PyPI - gcf-python - Versions diffs - 0.3.0__tar.gz → 0.4.0__tar.gz - Mend

gcf-python 0.3.0tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

{gcf_python-0.3.0 → gcf_python-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gcf-python
-Version: 0.3.0
+Version: 0.4.0
 Summary: Python implementation of GCF (Graph Compact Format): token-optimized wire format for LLM tool responses
 Project-URL: Homepage, https://github.com/blackwell-systems/gcf-python
 Project-URL: Documentation, https://blackwell-systems.github.io/gcf/
@@ -32,7 +32,7 @@ Description-Content-Type: text/markdown
 Python implementation of [GCF (Graph Compact Format)](https://gcformat.com/) — the most token-efficient wire format for LLMs. A drop-in alternative to JSON and TOON for any structured data.
-**79% fewer input tokens than JSON. 75% fewer output tokens. 52% smaller than TOON. 100% LLM comprehension at 500 symbols, where JSON fails at 66.7%.**
+**79% fewer input tokens than JSON. 75% fewer output tokens. 52% smaller than TOON. 100% LLM comprehension at 500 symbols, where JSON scores 76.9% and TOON scores 92.3%.**
 Docs: [gcformat.com](https://gcformat.com/) · [Playground](https://gcformat.com/playground.html) · [GCF vs TOON](https://gcformat.com/guide/vs-toon.html)
@@ -119,6 +119,35 @@ out2 = encode_with_session(payload2, sess)  # reused symbols as "@N  # previousl
 By the 5th call in a session: 92.7% token savings vs JSON.
+## Streaming Encode
+Write GCF output incrementally as symbols and edges arrive. Zero buffering, O(1) memory per row:
+```python
+from gcf import StreamEncoder, Symbol, Edge
+enc = StreamEncoder(sys.stdout, "context_for_task", token_budget=5000)
+enc.write_symbol(Symbol(qualified_name="pkg.Auth", kind="function", score=0.95, provenance="lsp", distance=0))
+enc.write_symbol(Symbol(qualified_name="pkg.Server", kind="function", score=0.60, provenance="lsp", distance=1))
+enc.write_edge(Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls"))
+enc.close()  # emits ## _summary trailer
+```
+Output:
+```
+GCF tool=context_for_task budget=5000
+## targets
+@0 fn pkg.Auth 0.95 lsp
+## related
+@1 fn pkg.Server 0.60 lsp
+## edges [?]
+@0<@1 calls
+## _summary symbols=2 edges=1 sections=targets:1,related:1,edges:1
+```
+The writer is any object with a `write(s: str)` method. Thread-safe. Standard `decode()` handles streaming output with no changes.
 ## Delta Encoding
 When the consumer already has a prior context pack, send only what changed:
@@ -189,15 +218,17 @@ Works on dicts, lists, and primitives. Lists of uniform dicts get tabular rows.
 ## Comprehension Eval
-Rigorous 3-way benchmark (GCF vs TOON vs JSON) at 500 symbols, 200 edges. Six structured extraction questions sent to an LLM:
+Rigorous 3-way benchmark (GCF vs TOON vs JSON) at 500 symbols, 200 edges. 13 structured extraction questions sent to an LLM with zero format instructions:
 | Format | Accuracy | Tokens | vs JSON |
 |--------|----------|--------|---------|
-| **GCF** | **100%** (6/6) | **11,090** | **79% fewer** |
-| TOON | 100% (6/6) | 16,378 | 69% fewer |
-| JSON | 66.7% (4/6) | 53,341 | baseline |
+| **GCF** | **100%** (13/13) | **11,090** | **79% fewer** |
+| TOON | 92.3% (12/13) | 16,378 | 69% fewer |
+| JSON | 76.9% (10/13) | 53,341 | baseline |
+GCF is the only format with perfect accuracy at scale, at 32% fewer tokens than TOON.
-JSON failed on counting tasks. GCF and TOON both achieved perfect accuracy. GCF does it in 32% fewer tokens.
+Reproduce: `git clone https://github.com/blackwell-systems/gcf-go && cd gcf-go/eval && GOWORK=off go test -run TestComprehension -v -timeout 0`
 ## Token Efficiency (TOON's Own Benchmark)
@@ -205,13 +236,13 @@ Running [TOON's benchmark harness](https://github.com/blackwell-systems/toon/tre
 | Track | GCF | TOON | Result |
 |-------|-----|------|--------|
-| Mixed-structure (nested, semi-uniform) | 169,554 | 227,896 | **GCF 34% smaller** |
-| Flat-only (tabular) | 66,026 | 67,837 | **GCF 3% smaller** |
-| Semi-uniform event logs | 107,269 | 154,032 | **GCF 44% smaller** |
+| Mixed-structure (nested, semi-uniform) | 170,367 | 227,896 | **GCF 34% smaller** |
+| Flat-only (tabular) | 66,029 | 67,837 | **GCF 3% smaller** |
+| Semi-uniform event logs | 108,158 | 154,032 | **GCF 42% smaller** |
-GCF wins on every dataset except deeply nested config (75 tokens on a 618-token payload). On semi-uniform data, GCF uses 44% fewer tokens than TOON.
+GCF wins all 6 datasets. On semi-uniform data (the most common real-world pattern), GCF uses 42% fewer tokens than TOON.
-Reproducible: [blackwell-systems/toon@gcf-comparison](https://github.com/blackwell-systems/toon/tree/gcf-comparison)
+Reproduce: `git clone https://github.com/blackwell-systems/toon && cd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens`
 ## Links

{gcf_python-0.3.0 → gcf_python-0.4.0}/README.md RENAMED Viewed

@@ -7,7 +7,7 @@
 Python implementation of [GCF (Graph Compact Format)](https://gcformat.com/) — the most token-efficient wire format for LLMs. A drop-in alternative to JSON and TOON for any structured data.
-**79% fewer input tokens than JSON. 75% fewer output tokens. 52% smaller than TOON. 100% LLM comprehension at 500 symbols, where JSON fails at 66.7%.**
+**79% fewer input tokens than JSON. 75% fewer output tokens. 52% smaller than TOON. 100% LLM comprehension at 500 symbols, where JSON scores 76.9% and TOON scores 92.3%.**
 Docs: [gcformat.com](https://gcformat.com/) · [Playground](https://gcformat.com/playground.html) · [GCF vs TOON](https://gcformat.com/guide/vs-toon.html)
@@ -94,6 +94,35 @@ out2 = encode_with_session(payload2, sess)  # reused symbols as "@N  # previousl
 By the 5th call in a session: 92.7% token savings vs JSON.
+## Streaming Encode
+Write GCF output incrementally as symbols and edges arrive. Zero buffering, O(1) memory per row:
+```python
+from gcf import StreamEncoder, Symbol, Edge
+enc = StreamEncoder(sys.stdout, "context_for_task", token_budget=5000)
+enc.write_symbol(Symbol(qualified_name="pkg.Auth", kind="function", score=0.95, provenance="lsp", distance=0))
+enc.write_symbol(Symbol(qualified_name="pkg.Server", kind="function", score=0.60, provenance="lsp", distance=1))
+enc.write_edge(Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls"))
+enc.close()  # emits ## _summary trailer
+```
+Output:
+```
+GCF tool=context_for_task budget=5000
+## targets
+@0 fn pkg.Auth 0.95 lsp
+## related
+@1 fn pkg.Server 0.60 lsp
+## edges [?]
+@0<@1 calls
+## _summary symbols=2 edges=1 sections=targets:1,related:1,edges:1
+```
+The writer is any object with a `write(s: str)` method. Thread-safe. Standard `decode()` handles streaming output with no changes.
 ## Delta Encoding
 When the consumer already has a prior context pack, send only what changed:
@@ -164,15 +193,17 @@ Works on dicts, lists, and primitives. Lists of uniform dicts get tabular rows.
 ## Comprehension Eval
-Rigorous 3-way benchmark (GCF vs TOON vs JSON) at 500 symbols, 200 edges. Six structured extraction questions sent to an LLM:
+Rigorous 3-way benchmark (GCF vs TOON vs JSON) at 500 symbols, 200 edges. 13 structured extraction questions sent to an LLM with zero format instructions:
 | Format | Accuracy | Tokens | vs JSON |
 |--------|----------|--------|---------|
-| **GCF** | **100%** (6/6) | **11,090** | **79% fewer** |
-| TOON | 100% (6/6) | 16,378 | 69% fewer |
-| JSON | 66.7% (4/6) | 53,341 | baseline |
+| **GCF** | **100%** (13/13) | **11,090** | **79% fewer** |
+| TOON | 92.3% (12/13) | 16,378 | 69% fewer |
+| JSON | 76.9% (10/13) | 53,341 | baseline |
+GCF is the only format with perfect accuracy at scale, at 32% fewer tokens than TOON.
-JSON failed on counting tasks. GCF and TOON both achieved perfect accuracy. GCF does it in 32% fewer tokens.
+Reproduce: `git clone https://github.com/blackwell-systems/gcf-go && cd gcf-go/eval && GOWORK=off go test -run TestComprehension -v -timeout 0`
 ## Token Efficiency (TOON's Own Benchmark)
@@ -180,13 +211,13 @@ Running [TOON's benchmark harness](https://github.com/blackwell-systems/toon/tre
 | Track | GCF | TOON | Result |
 |-------|-----|------|--------|
-| Mixed-structure (nested, semi-uniform) | 169,554 | 227,896 | **GCF 34% smaller** |
-| Flat-only (tabular) | 66,026 | 67,837 | **GCF 3% smaller** |
-| Semi-uniform event logs | 107,269 | 154,032 | **GCF 44% smaller** |
+| Mixed-structure (nested, semi-uniform) | 170,367 | 227,896 | **GCF 34% smaller** |
+| Flat-only (tabular) | 66,029 | 67,837 | **GCF 3% smaller** |
+| Semi-uniform event logs | 108,158 | 154,032 | **GCF 42% smaller** |
-GCF wins on every dataset except deeply nested config (75 tokens on a 618-token payload). On semi-uniform data, GCF uses 44% fewer tokens than TOON.
+GCF wins all 6 datasets. On semi-uniform data (the most common real-world pattern), GCF uses 42% fewer tokens than TOON.
-Reproducible: [blackwell-systems/toon@gcf-comparison](https://github.com/blackwell-systems/toon/tree/gcf-comparison)
+Reproduce: `git clone https://github.com/blackwell-systems/toon && cd toon && git checkout gcf-comparison && cd benchmarks && pnpm install && pnpm benchmark:tokens`
 ## Links

{gcf_python-0.3.0 → gcf_python-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "gcf-python"
-version = "0.3.0"
+version = "0.4.0"
 description = "Python implementation of GCF (Graph Compact Format): token-optimized wire format for LLM tool responses"
 readme = "README.md"
 license = {text = "MIT"}

{gcf_python-0.3.0 → gcf_python-0.4.0}/src/gcf/__init__.py RENAMED Viewed

@@ -40,6 +40,7 @@ from .delta import encode_delta
 from .encode import encode
 from .generic import encode_generic
 from .session import Session, encode_with_session
+from .stream import StreamEncoder
 from .types import Components, DeltaPayload, Edge, Payload, Symbol
 __all__ = [
@@ -51,6 +52,7 @@ __all__ = [
     "KIND_EXPAND",
     "Payload",
     "Session",
+    "StreamEncoder",
     "Symbol",
     "decode",
     "encode",

gcf_python-0.4.0/src/gcf/stream.py ADDED Viewed

@@ -0,0 +1,151 @@
+"""GCF streaming encoder: zero-buffering encode to any writable."""
+from __future__ import annotations
+import threading
+from typing import Any, Protocol
+from .constants import KIND_ABBREV
+from .types import Edge, Symbol
+class StreamWriter(Protocol):
+    """Any object with a write(s: str) method."""
+    def write(self, s: str) -> Any: ...
+class StreamEncoder:
+    """Writes GCF output incrementally as symbols and edges arrive.
+    Zero buffering: each symbol/edge is written immediately. A trailer summary
+    is emitted on close() with the final counts.
+    Example::
+        enc = StreamEncoder(sys.stdout, "context_for_task", token_budget=5000)
+        enc.write_symbol(sym1)  # emitted immediately
+        enc.write_edge(edge1)   # emitted immediately
+        enc.close()             # emits ## _summary trailer
+    """
+    def __init__(
+        self,
+        writer: StreamWriter,
+        tool: str,
+        *,
+        token_budget: int = 0,
+        tokens_used: int = 0,
+        pack_root: str = "",
+        session: bool = False,
+    ) -> None:
+        self._w = writer
+        self._lock = threading.Lock()
+        self._sym_index: dict[str, int] = {}
+        self._next_id = 0
+        self._current_group = ""
+        self._group_counts: dict[str, int] = {}
+        self._edge_count = 0
+        self._edges_started = False
+        # Emit header immediately.
+        parts = [f"GCF tool={tool}"]
+        if token_budget:
+            parts.append(f"budget={token_budget}")
+        if tokens_used:
+            parts.append(f"tokens={tokens_used}")
+        if pack_root:
+            parts.append(f"pack_root={pack_root}")
+        if session:
+            parts.append("session=true")
+        self._w.write(" ".join(parts) + "\n")
+    def write_symbol(self, s: Symbol) -> None:
+        """Emit a symbol line immediately. Group headers auto-managed."""
+        with self._lock:
+            group_names = ["targets", "related", "extended"]
+            if s.distance < len(group_names):
+                group_name = group_names[s.distance]
+            else:
+                group_name = f"distance_{s.distance}"
+            if group_name != self._current_group:
+                self._w.write(f"## {group_name}\n")
+                self._current_group = group_name
+            idx = self._next_id
+            self._sym_index[s.qualified_name] = idx
+            self._next_id += 1
+            kind = KIND_ABBREV.get(s.kind, s.kind)
+            self._w.write(f"@{idx} {kind} {s.qualified_name} {s.score:.2f} {s.provenance}\n")
+            self._group_counts[group_name] = self._group_counts.get(group_name, 0) + 1
+    def write_edge(self, e: Edge) -> None:
+        """Emit an edge line immediately. Edges section header auto-emitted on first edge."""
+        with self._lock:
+            src_idx = self._sym_index.get(e.source)
+            tgt_idx = self._sym_index.get(e.target)
+            if src_idx is None or tgt_idx is None:
+                return
+            if not self._edges_started:
+                self._w.write("## edges [?]\n")
+                self._edges_started = True
+            line = f"@{tgt_idx}<@{src_idx} {e.edge_type}"
+            if e.status and e.status != "unchanged":
+                line += f" {e.status}"
+            self._w.write(line + "\n")
+            self._edge_count += 1
+    def write_bare_ref(self, qname: str, distance: int) -> None:
+        """Emit a bare reference for a previously-transmitted symbol (session mode)."""
+        with self._lock:
+            group_names = ["targets", "related", "extended"]
+            if distance < len(group_names):
+                group_name = group_names[distance]
+            else:
+                group_name = f"distance_{distance}"
+            if group_name != self._current_group:
+                self._w.write(f"## {group_name}\n")
+                self._current_group = group_name
+            idx = self._next_id
+            self._sym_index[qname] = idx
+            self._next_id += 1
+            self._w.write(f"@{idx}  # previously transmitted\n")
+            self._group_counts[group_name] = self._group_counts.get(group_name, 0) + 1
+    def close(self) -> None:
+        """Emit ## _summary trailer with final counts."""
+        with self._lock:
+            sections: list[str] = []
+            group_order = ["targets", "related", "extended"]
+            for g in group_order:
+                c = self._group_counts.get(g, 0)
+                if c > 0:
+                    sections.append(f"{g}:{c}")
+            for g, c in self._group_counts.items():
+                if g not in group_order and c > 0:
+                    sections.append(f"{g}:{c}")
+            if self._edge_count > 0:
+                sections.append(f"edges:{self._edge_count}")
+            self._w.write(
+                f"## _summary symbols={self._next_id} edges={self._edge_count}"
+                f" sections={','.join(sections)}\n"
+            )
+    @property
+    def symbol_count(self) -> int:
+        """Number of symbols written so far."""
+        return self._next_id
+    @property
+    def edge_count(self) -> int:
+        """Number of edges written so far."""
+        return self._edge_count

gcf_python-0.4.0/tests/test_stream.py ADDED Viewed

@@ -0,0 +1,116 @@
+"""Tests for the StreamEncoder."""
+import io
+from gcf import StreamEncoder, Symbol, Edge, decode
+def test_stream_basic():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "context_for_task", token_budget=5000)
+    enc.write_symbol(Symbol(qualified_name="pkg.Auth", kind="function", score=0.78, provenance="lsp_resolved", distance=0))
+    enc.write_symbol(Symbol(qualified_name="pkg.Server", kind="function", score=0.54, provenance="lsp_resolved", distance=1))
+    enc.write_edge(Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls"))
+    enc.close()
+    out = buf.getvalue()
+    assert "GCF tool=context_for_task budget=5000\n" in out
+    assert "## targets\n" in out
+    assert "@0 fn pkg.Auth 0.78 lsp_resolved\n" in out
+    assert "## related\n" in out
+    assert "@1 fn pkg.Server 0.54 lsp_resolved\n" in out
+    assert "## edges [?]\n" in out
+    assert "@0<@1 calls\n" in out
+    assert "## _summary symbols=2 edges=1" in out
+    # Header should not have symbols= or edges=
+    header = out.split("\n")[0]
+    assert "symbols=" not in header
+    assert "edges=" not in header
+def test_stream_round_trip():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "blast_radius", token_budget=10000)
+    enc.write_symbol(Symbol(qualified_name="pkg.Auth", kind="function", score=0.95, provenance="lsp", distance=0))
+    enc.write_symbol(Symbol(qualified_name="pkg.Config", kind="type", score=0.80, provenance="ast", distance=0))
+    enc.write_symbol(Symbol(qualified_name="pkg.Server", kind="function", score=0.60, provenance="lsp", distance=1))
+    enc.write_edge(Edge(source="pkg.Server", target="pkg.Auth", edge_type="calls"))
+    enc.write_edge(Edge(source="pkg.Auth", target="pkg.Config", edge_type="references"))
+    enc.close()
+    p = decode(buf.getvalue())
+    assert p.tool == "blast_radius"
+    assert len(p.symbols) == 3
+    assert len(p.edges) == 2
+def test_stream_no_edges():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "test")
+    enc.write_symbol(Symbol(qualified_name="a.A", kind="function", score=0.9, provenance="x", distance=0))
+    enc.close()
+    out = buf.getvalue()
+    assert "## edges" not in out
+    assert "edges=0" in out
+def test_stream_multiple_groups():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "test")
+    enc.write_symbol(Symbol(qualified_name="a", kind="function", score=1.0, provenance="x", distance=0))
+    enc.write_symbol(Symbol(qualified_name="b", kind="function", score=0.8, provenance="x", distance=1))
+    enc.write_symbol(Symbol(qualified_name="c", kind="function", score=0.6, provenance="x", distance=2))
+    enc.write_symbol(Symbol(qualified_name="d", kind="function", score=0.4, provenance="x", distance=5))
+    enc.close()
+    out = buf.getvalue()
+    assert "## targets\n" in out
+    assert "## related\n" in out
+    assert "## extended\n" in out
+    assert "## distance_5\n" in out
+    assert "sections=targets:1,related:1,extended:1,distance_5:1" in out
+def test_stream_skips_unknown_refs():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "test")
+    enc.write_symbol(Symbol(qualified_name="a.A", kind="function", score=0.9, provenance="x", distance=0))
+    enc.write_edge(Edge(source="unknown.B", target="a.A", edge_type="calls"))
+    enc.close()
+    out = buf.getvalue()
+    assert "calls" not in out
+    assert "edges=0" in out
+def test_stream_incremental():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "test")
+    # Header written immediately.
+    assert buf.tell() > 0
+    pos_after_header = buf.tell()
+    enc.write_symbol(Symbol(qualified_name="a.A", kind="function", score=0.9, provenance="x", distance=0))
+    assert buf.tell() > pos_after_header
+def test_stream_bare_ref():
+    buf = io.StringIO()
+    enc = StreamEncoder(buf, "test", session=True)
+    enc.write_bare_ref("pkg.Auth", 0)
+    enc.write_symbol(Symbol(qualified_name="pkg.New", kind="function", score=0.85, provenance="lsp", distance=0))
+    enc.close()
+    out = buf.getvalue()
+    assert "session=true" in out
+    assert "@0  # previously transmitted" in out
+    assert "@1 fn pkg.New 0.85 lsp" in out