reasonix 0.2.0 → 0.3.0-alpha.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +32 -9
- package/dist/cli/index.js +944 -86
- package/dist/cli/index.js.map +1 -1
- package/dist/index.d.ts +270 -2
- package/dist/index.js +327 -1
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -58,17 +58,17 @@ actually stays byte-stable.
|
|
|
58
58
|
|
|
59
59
|
## Validated numbers
|
|
60
60
|
|
|
61
|
-
**τ-bench-lite** — 8 multi-turn tool-use tasks
|
|
62
|
-
|
|
63
|
-
DeepSeek `deepseek-chat`:
|
|
61
|
+
**τ-bench-lite** — 8 multi-turn tool-use tasks × 3 repeats = 48 runs per
|
|
62
|
+
side. Same tools / same prompt / same client on both sides, sole variable
|
|
63
|
+
is prefix stability. Measured on live DeepSeek `deepseek-chat`:
|
|
64
64
|
|
|
65
65
|
| metric | baseline (cache-hostile) | Reasonix | delta |
|
|
66
66
|
|---|---:|---:|---:|
|
|
67
|
-
| runs |
|
|
68
|
-
| **cache hit** |
|
|
69
|
-
| cost / task | $0.
|
|
67
|
+
| runs | 24 | 24 | — |
|
|
68
|
+
| **cache hit** | 46.6% | **94.4%** | **+47.7pp** |
|
|
69
|
+
| cost / task | $0.002599 | $0.001579 | **−39% (×0.61)** |
|
|
70
70
|
| vs Claude Sonnet 4.6 (token-count estimate) | — | — | **~96% cheaper** |
|
|
71
|
-
| pass rate |
|
|
71
|
+
| pass rate | 96% (23/24) | **100% (24/24)** | Reasonix held the guardrail on every run |
|
|
72
72
|
|
|
73
73
|
**Verify it yourself — no API key, zero cost:**
|
|
74
74
|
|
|
@@ -86,11 +86,34 @@ stays byte-stable across every model call; baseline's prefix churns on
|
|
|
86
86
|
every turn. The cache delta is *mechanically* attributable to log
|
|
87
87
|
stability, not to a different system prompt.
|
|
88
88
|
|
|
89
|
-
Full
|
|
90
|
-
with your own API key: `npx tsx benchmarks/tau-bench/runner.ts`.
|
|
89
|
+
Full 48-run report: [`benchmarks/tau-bench/report.md`][r]. Reproduce
|
|
90
|
+
with your own API key: `npx tsx benchmarks/tau-bench/runner.ts --repeats 3`.
|
|
91
91
|
|
|
92
92
|
[r]: ./benchmarks/tau-bench/report.md
|
|
93
93
|
|
|
94
|
+
### Extends to MCP (v0.3-alpha)
|
|
95
|
+
|
|
96
|
+
Any [MCP](https://spec.modelcontextprotocol.io/) server's tools inherit
|
|
97
|
+
the same Cache-First benefits. Live run with an MCP tool call in the
|
|
98
|
+
middle of a conversation:
|
|
99
|
+
|
|
100
|
+
| turn | what happened | cache hit |
|
|
101
|
+
|---|---|---:|
|
|
102
|
+
| 1 | user asks, model decides to call `add` tool via MCP stdio | 0.0% (first-ever prefix) |
|
|
103
|
+
| 1 (continued) | model receives tool result (42), writes final answer | **96.6%** |
|
|
104
|
+
|
|
105
|
+
The MCP round-trip did not disturb the byte-stable prefix — server-side
|
|
106
|
+
prompt cache kicked in on turn 2. Cost $0.000254 total, **94% cheaper
|
|
107
|
+
than Claude** at equivalent token counts. Reference transcript:
|
|
108
|
+
[`mcp-demo.add.jsonl`][mcp]. Reproduce:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
reasonix chat --mcp "node --import tsx examples/mcp-server-demo.ts"
|
|
112
|
+
# ask "use add to compute 17 + 25" — model calls the MCP tool, cache holds
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
[mcp]: ./benchmarks/tau-bench/transcripts/mcp-demo.add.jsonl
|
|
116
|
+
|
|
94
117
|
---
|
|
95
118
|
|
|
96
119
|
## Usage
|