reasonix 0.2.0 → 0.3.0-alpha.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -58,17 +58,17 @@ actually stays byte-stable.
58
58
 
59
59
  ## Validated numbers
60
60
 
61
- **τ-bench-lite** — 8 multi-turn tool-use tasks, same tools / same prompt
62
- on both sides, sole variable is prefix stability. Measured on live
63
- DeepSeek `deepseek-chat`:
61
+ **τ-bench-lite** — 8 multi-turn tool-use tasks × 3 repeats = 48 runs per
62
+ side. Same tools / same prompt / same client on both sides, sole variable
63
+ is prefix stability. Measured on live DeepSeek `deepseek-chat`:
64
64
 
65
65
  | metric | baseline (cache-hostile) | Reasonix | delta |
66
66
  |---|---:|---:|---:|
67
- | runs | 8 | 8 | — |
68
- | **cache hit** | 43.9% | **94.3%** | **+50.3pp** |
69
- | cost / task | $0.002783 | $0.001621 | **−42% (×0.58)** |
67
+ | runs | 24 | 24 | — |
68
+ | **cache hit** | 46.6% | **94.4%** | **+47.7pp** |
69
+ | cost / task | $0.002599 | $0.001579 | **−39% (×0.61)** |
70
70
  | vs Claude Sonnet 4.6 (token-count estimate) | — | — | **~96% cheaper** |
71
- | pass rate | 100% (8/8) | 88% (7/8) | 1 refusal-task predicate too strict (see [report.md][r]) |
71
+ | pass rate | 96% (23/24) | **100% (24/24)** | Reasonix held the guardrail on every run |
72
72
 
73
73
  **Verify it yourself — no API key, zero cost:**
74
74
 
@@ -86,11 +86,34 @@ stays byte-stable across every model call; baseline's prefix churns on
86
86
  every turn. The cache delta is *mechanically* attributable to log
87
87
  stability, not to a different system prompt.
88
88
 
89
- Full 16-run report: [`benchmarks/tau-bench/report.md`][r]. Reproduce
90
- with your own API key: `npx tsx benchmarks/tau-bench/runner.ts`.
89
+ Full 48-run report: [`benchmarks/tau-bench/report.md`][r]. Reproduce
90
+ with your own API key: `npx tsx benchmarks/tau-bench/runner.ts --repeats 3`.
91
91
 
92
92
  [r]: ./benchmarks/tau-bench/report.md
93
93
 
94
+ ### Extends to MCP (v0.3-alpha)
95
+
96
+ Any [MCP](https://spec.modelcontextprotocol.io/) server's tools inherit
97
+ the same Cache-First benefits. Live run with an MCP tool call in the
98
+ middle of a conversation:
99
+
100
+ | turn | what happened | cache hit |
101
+ |---|---|---:|
102
+ | 1 | user asks, model decides to call `add` tool via MCP stdio | 0.0% (first-ever prefix) |
103
+ | 1 (continued) | model receives tool result (42), writes final answer | **96.6%** |
104
+
105
+ The MCP round-trip did not disturb the byte-stable prefix — server-side
106
+ prompt cache kicked in on turn 2. Cost $0.000254 total, **94% cheaper
107
+ than Claude** at equivalent token counts. Reference transcript:
108
+ [`mcp-demo.add.jsonl`][mcp]. Reproduce:
109
+
110
+ ```bash
111
+ reasonix chat --mcp "node --import tsx examples/mcp-server-demo.ts"
112
+ # ask "use add to compute 17 + 25" — model calls the MCP tool, cache holds
113
+ ```
114
+
115
+ [mcp]: ./benchmarks/tau-bench/transcripts/mcp-demo.add.jsonl
116
+
94
117
  ---
95
118
 
96
119
  ## Usage