@blockrun/clawrouter 0.12.61 → 0.12.63
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/anthropic-cost-savings.md +349 -0
- package/docs/architecture.md +559 -0
- package/docs/assets/blockrun-248-day-cost-overrun-problem.png +0 -0
- package/docs/assets/blockrun-clawrouter-7-layer-token-compression-openclaw.png +0 -0
- package/docs/assets/blockrun-clawrouter-observation-compression-97-percent-token-savings.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-agentic-proxy-architecture.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-automatic-tier-routing-model-selection.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-error-classification-retry-storm-prevention.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-session-memory-journaling-vs-context-compounding.png +0 -0
- package/docs/assets/blockrun-clawrouter-vs-openclaw-standalone-comparison-production-safety.png +0 -0
- package/docs/assets/blockrun-clawrouter-x402-usdc-micropayment-wallet-budget-control.png +0 -0
- package/docs/assets/blockrun-openclaw-inference-layer-blind-spots.png +0 -0
- package/docs/blog-benchmark-2026-03.md +184 -0
- package/docs/blog-openclaw-cost-overruns.md +197 -0
- package/docs/clawrouter-savings.png +0 -0
- package/docs/configuration.md +512 -0
- package/docs/features.md +257 -0
- package/docs/image-generation.md +380 -0
- package/docs/plans/2026-02-03-smart-routing-design.md +267 -0
- package/docs/plans/2026-02-13-e2e-docker-deployment.md +1260 -0
- package/docs/plans/2026-02-28-worker-network.md +947 -0
- package/docs/plans/2026-03-18-error-classification.md +574 -0
- package/docs/plans/2026-03-19-exclude-models.md +538 -0
- package/docs/routing-profiles.md +81 -0
- package/docs/subscription-failover.md +320 -0
- package/docs/technical-routing-2026-03.md +322 -0
- package/docs/troubleshooting.md +159 -0
- package/docs/vision.md +49 -0
- package/docs/vs-openrouter.md +157 -0
- package/docs/worker-network.md +1241 -0
- package/package.json +3 -2
- package/scripts/reinstall.sh +8 -4
- package/scripts/update.sh +8 -4
|
@@ -0,0 +1,184 @@
|
|
|
1
|
+
# We Benchmarked 39 AI Models Through Our Payment Gateway. Here's What We Found.
|
|
2
|
+
|
|
3
|
+
*March 16, 2026 | BlockRun Engineering*
|
|
4
|
+
|
|
5
|
+
Last week we ran every model on BlockRun through a real-world latency benchmark — 39 models, same prompts, same payment pipeline, same hardware. No cherry-picked results. No synthetic lab conditions. Just cold, hard numbers from production infrastructure.
|
|
6
|
+
|
|
7
|
+
The results changed how we route requests.
|
|
8
|
+
|
|
9
|
+
## Why We Did This
|
|
10
|
+
|
|
11
|
+
BlockRun is an x402 micropayment gateway that sits between your AI agent and 39+ LLM providers. Every request flows through our payment verification layer before hitting the model API. That means our latency numbers include everything a real user experiences: payment auth, provider API call, and response delivery.
|
|
12
|
+
|
|
13
|
+
Most benchmarks measure model speed in isolation. We wanted to measure what users actually feel.
|
|
14
|
+
|
|
15
|
+
## The Leaderboard
|
|
16
|
+
|
|
17
|
+
We sent 2 coding prompts per model (256 max tokens, non-streaming) and measured end-to-end response time.
|
|
18
|
+
|
|
19
|
+
### Speed Rankings (End-to-End Latency Through BlockRun)
|
|
20
|
+
|
|
21
|
+
| # | Model | Latency | Tok/s | $/1M in | $/1M out |
|
|
22
|
+
|---|-------|---------|-------|---------|----------|
|
|
23
|
+
| 1 | xai/grok-4-fast-non-reasoning | 1,143ms | 224 | $0.20 | $0.50 |
|
|
24
|
+
| 2 | xai/grok-3-mini | 1,202ms | 215 | $0.30 | $0.50 |
|
|
25
|
+
| 3 | google/gemini-2.5-flash | 1,238ms | 208 | $0.15 | $0.60 |
|
|
26
|
+
| 4 | xai/grok-3 | 1,244ms | 207 | $3.00 | $15.00 |
|
|
27
|
+
| 5 | xai/grok-4-1-fast-non-reasoning | 1,244ms | 206 | $0.20 | $0.50 |
|
|
28
|
+
| 6 | nvidia/gpt-oss-120b | 1,252ms | 204 | FREE | FREE |
|
|
29
|
+
| 7 | minimax/minimax-m2.5 | 1,278ms | 202 | $0.30 | $1.10 |
|
|
30
|
+
| 8 | google/gemini-2.5-pro | 1,294ms | 198 | $1.25 | $10.00 |
|
|
31
|
+
| 9 | xai/grok-4-fast-reasoning | 1,298ms | 198 | $0.20 | $0.50 |
|
|
32
|
+
| 10 | xai/grok-4-0709 | 1,348ms | 190 | $0.20 | $1.50 |
|
|
33
|
+
| 11 | google/gemini-3-pro-preview | 1,352ms | 190 | $1.25 | $10.00 |
|
|
34
|
+
| 12 | google/gemini-2.5-flash-lite | 1,353ms | 193 | $0.10 | $0.40 |
|
|
35
|
+
| 13 | google/gemini-3-flash-preview | 1,398ms | 183 | $0.15 | $0.60 |
|
|
36
|
+
| 14 | deepseek/deepseek-chat | 1,431ms | 179 | $0.27 | $1.10 |
|
|
37
|
+
| 15 | deepseek/deepseek-reasoner | 1,454ms | 183 | $0.55 | $2.19 |
|
|
38
|
+
| 16 | xai/grok-4-1-fast-reasoning | 1,454ms | 176 | $0.20 | $0.50 |
|
|
39
|
+
| 17 | google/gemini-3.1-pro | 1,609ms | 167 | $1.25 | $10.00 |
|
|
40
|
+
| 18 | moonshot/kimi-k2.5 | 1,646ms | 156 | $0.60 | $3.00 |
|
|
41
|
+
| 19 | anthropic/claude-sonnet-4.6 | 2,110ms | 121 | $3.00 | $15.00 |
|
|
42
|
+
| 20 | anthropic/claude-opus-4.6 | 2,139ms | 120 | $15.00 | $75.00 |
|
|
43
|
+
| 21 | openai/o3-mini | 2,260ms | 114 | $1.10 | $4.40 |
|
|
44
|
+
| 22 | openai/gpt-5-mini | 2,264ms | 114 | $1.10 | $4.40 |
|
|
45
|
+
| 23 | anthropic/claude-haiku-4.5 | 2,305ms | 141 | $0.80 | $4.00 |
|
|
46
|
+
| 24 | openai/o4-mini | 2,328ms | 111 | $1.10 | $4.40 |
|
|
47
|
+
| 25 | openai/gpt-4.1-mini | 2,340ms | 109 | $0.40 | $1.60 |
|
|
48
|
+
| 26 | openai/o1 | 2,562ms | 100 | $15.00 | $60.00 |
|
|
49
|
+
| 27 | openai/gpt-4.1-nano | 2,640ms | 97 | $0.10 | $0.40 |
|
|
50
|
+
| 28 | openai/o1-mini | 2,746ms | 93 | $1.10 | $4.40 |
|
|
51
|
+
| 29 | openai/gpt-4o-mini | 2,764ms | 93 | $0.15 | $0.60 |
|
|
52
|
+
| 30 | openai/o3 | 2,862ms | 90 | $2.00 | $8.00 |
|
|
53
|
+
| 31 | openai/gpt-5-nano | 3,187ms | 81 | $0.50 | $2.00 |
|
|
54
|
+
| 32 | openai/gpt-5.2-pro | 3,546ms | 73 | $2.50 | $10.00 |
|
|
55
|
+
| 33 | openai/gpt-4o | 5,378ms | 48 | $2.50 | $10.00 |
|
|
56
|
+
| 34 | openai/gpt-4.1 | 5,477ms | 47 | $2.00 | $8.00 |
|
|
57
|
+
| 35 | openai/gpt-5.3 | 5,910ms | 43 | $2.50 | $10.00 |
|
|
58
|
+
| 36 | openai/gpt-5.4 | 6,213ms | 41 | $2.50 | $15.00 |
|
|
59
|
+
| 37 | openai/gpt-5.2 | 6,507ms | 40 | $2.50 | $10.00 |
|
|
60
|
+
| 38 | openai/gpt-5.4-pro | 6,671ms | 40 | $2.50 | $15.00 |
|
|
61
|
+
| 39 | openai/gpt-5.3-codex | 7,935ms | 32 | $2.50 | $10.00 |
|
|
62
|
+
|
|
63
|
+
## Three Things That Surprised Us
|
|
64
|
+
|
|
65
|
+
### 1. xAI Grok is Absurdly Fast
|
|
66
|
+
|
|
67
|
+
Grok 4 Fast clocked in at **1,143ms** end-to-end. That's the full round trip: payment verification, API call, response. For context, OpenAI's GPT-5.4 took **6,213ms** for the same request — nearly **6x slower**.
|
|
68
|
+
|
|
69
|
+
The entire xAI lineup dominated the top of the leaderboard. Five of the top 10 fastest models are from xAI. At $0.20 per million input tokens, they're also among the cheapest.
|
|
70
|
+
|
|
71
|
+
### 2. Google Gemini Owns the Efficiency Frontier
|
|
72
|
+
|
|
73
|
+
Gemini 2.5 Flash delivered **1,238ms** latency at **$0.15/$0.60** per million tokens. For simple tasks, it's the clear winner on cost-per-quality.
|
|
74
|
+
|
|
75
|
+
But here's what's more impressive: Gemini 2.5 Pro came in at **1,294ms** — barely slower than Flash — while scoring significantly higher on intelligence benchmarks. Google's infrastructure advantage is showing.
|
|
76
|
+
|
|
77
|
+
Six Google models landed in the top 13. No other provider came close to that kind of lineup depth.
|
|
78
|
+
|
|
79
|
+
### 3. OpenAI Flagship Models Are Surprisingly Slow
|
|
80
|
+
|
|
81
|
+
Every OpenAI model with "5.x" in the name landed in the bottom third of the leaderboard. GPT-5.3 Codex was dead last at **7,935ms**. Even GPT-4o, a model from 2024, took over 5 seconds.
|
|
82
|
+
|
|
83
|
+
OpenAI's "mini" and "nano" variants are faster (2.2-3.2s range) but still 2x slower than the fastest competitors. The speed gap is real and consistent across their entire lineup.
|
|
84
|
+
|
|
85
|
+
## Speed vs. Intelligence: The Tradeoff That Broke Our Routing
|
|
86
|
+
|
|
87
|
+
We cross-referenced our latency data with quality scores from [Artificial Analysis](https://artificialanalysis.ai/leaderboards/models) (Intelligence Index v4.0):
|
|
88
|
+
|
|
89
|
+
| Model | BlockRun Latency | Intelligence Index | Price Tier |
|
|
90
|
+
|-------|-----------------|-------------------|-----------|
|
|
91
|
+
| Gemini 3.1 Pro | 1,609ms | 57 | $1.25/$10 |
|
|
92
|
+
| GPT-5.4 | 6,213ms | 57 | $2.50/$15 |
|
|
93
|
+
| GPT-5.3 Codex | 7,935ms | 54 | $2.50/$10 |
|
|
94
|
+
| Claude Opus 4.6 | 2,139ms | 53 | $15/$75 |
|
|
95
|
+
| Claude Sonnet 4.6 | 2,110ms | 52 | $3/$15 |
|
|
96
|
+
| Kimi K2.5 | 1,646ms | 47 | $0.60/$3 |
|
|
97
|
+
| Gemini 3 Flash Preview | 1,398ms | 46 | $0.15/$0.60 |
|
|
98
|
+
| Grok 4 | 1,348ms | 41 | $0.20/$1.50 |
|
|
99
|
+
| Grok 4.1 Fast | 1,244ms | 41 | $0.20/$0.50 |
|
|
100
|
+
| DeepSeek V3 | 1,431ms | 32 | $0.27/$1.10 |
|
|
101
|
+
| Grok 3 | 1,244ms | 32 | $3/$15 |
|
|
102
|
+
| Grok 4 Fast | 1,143ms | 23 | $0.20/$0.50 |
|
|
103
|
+
| Gemini 2.5 Flash | 1,238ms | 20 | $0.15/$0.60 |
|
|
104
|
+
|
|
105
|
+
**Gemini 3.1 Pro** is the standout: highest intelligence score (57) at just 1.6 seconds. GPT-5.4 matches its intelligence but takes **4x longer**.
|
|
106
|
+
|
|
107
|
+
We initially used these numbers to promote fast models (Grok 4 Fast, Grok 4.1 Fast) as our default routing targets. It backfired. Users reported that the fast models were refusing complex tasks and giving shallow responses. Fast and cheap doesn't mean capable.
|
|
108
|
+
|
|
109
|
+
The fix: we now weight **quality and user retention** alongside speed in our routing algorithm. Gemini 2.5 Flash became our default for simple tasks (fast, cheap, reliable), while Kimi K2.5 handles medium-complexity work and Claude/GPT flagships handle the hard stuff.
|
|
110
|
+
|
|
111
|
+
## What This Means for Developers
|
|
112
|
+
|
|
113
|
+
**If you're building agents:** Don't default to GPT. At 5-7 seconds per call, your agent's chain-of-actions will feel sluggish. Route simple subtasks to Grok/Gemini Flash and save the flagships for reasoning-heavy steps.
|
|
114
|
+
|
|
115
|
+
**If you're cost-sensitive:** Gemini 2.5 Flash-Lite at $0.10/$0.40 with 1.35s latency is the budget king. DeepSeek Chat at $0.27/$1.10 with 1.43s is a close second.
|
|
116
|
+
|
|
117
|
+
**If you need peak intelligence:** Gemini 3.1 Pro (IQ 57, 1.6s) gives you the same quality as GPT-5.4 (IQ 57, 6.2s) at one-quarter the latency and lower cost. Claude Opus 4.6 (IQ 53, 2.1s) is the best option if you need Anthropic-family capabilities.
|
|
118
|
+
|
|
119
|
+
**If you want it all handled for you:** That's what BlockRun's smart router does. Set your profile to `auto` and we'll pick the right model based on task complexity, balancing speed, quality, and cost automatically.
|
|
120
|
+
|
|
121
|
+
## Methodology
|
|
122
|
+
|
|
123
|
+
- **Date:** March 16, 2026
|
|
124
|
+
- **Setup:** BlockRun ClawRouter v0.12.47 proxy on localhost, connected to BlockRun's x402 payment gateway on Base (EVM)
|
|
125
|
+
- **Prompts:** 3 Python coding tasks (IPv4 validation, LCS algorithm, LRU cache), 2 requests per model
|
|
126
|
+
- **Config:** 256 max tokens, non-streaming, temperature 0.7
|
|
127
|
+
- **Latency:** End-to-end wall clock time including x402 payment verification (~50-100ms overhead)
|
|
128
|
+
- **Intelligence scores:** [Artificial Analysis Intelligence Index v4.0](https://artificialanalysis.ai/leaderboards/models) (March 2026)
|
|
129
|
+
|
|
130
|
+
Raw benchmark data: [benchmark-results.json](https://github.com/BlockRunAI/ClawRouter/blob/main/benchmark-results.json)
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
*BlockRun is the x402 micropayment gateway for AI. One wallet, 39+ models, pay-per-request with USDC. [Get started](https://blockrun.ai)*
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## Twitter Thread
|
|
139
|
+
|
|
140
|
+
**Thread: We benchmarked 39 AI models through our payment gateway. The speed differences are wild. (thread)**
|
|
141
|
+
|
|
142
|
+
**1/** We ran every model on @BlockRunAI through a real-world latency benchmark. 39 models, same prompts, full payment pipeline included.
|
|
143
|
+
|
|
144
|
+
The fastest model (Grok 4 Fast) was 7x faster than the slowest (GPT-5.3 Codex). Here's the full breakdown:
|
|
145
|
+
|
|
146
|
+
**2/** Top 5 fastest (end-to-end latency):
|
|
147
|
+
1. xai/grok-4-fast — 1,143ms
|
|
148
|
+
2. xai/grok-3-mini — 1,202ms
|
|
149
|
+
3. google/gemini-2.5-flash — 1,238ms
|
|
150
|
+
4. xai/grok-3 — 1,244ms
|
|
151
|
+
5. nvidia/gpt-oss-120b — 1,252ms (FREE)
|
|
152
|
+
|
|
153
|
+
**3/** Bottom 5 (all OpenAI):
|
|
154
|
+
35. openai/gpt-5.3 — 5,910ms
|
|
155
|
+
36. openai/gpt-5.4 — 6,213ms
|
|
156
|
+
37. openai/gpt-5.2 — 6,507ms
|
|
157
|
+
38. openai/gpt-5.4-pro — 6,671ms
|
|
158
|
+
39. openai/gpt-5.3-codex — 7,935ms
|
|
159
|
+
|
|
160
|
+
Every OpenAI 5.x model: 5-8 seconds. Every Grok/Gemini model: ~1.2 seconds.
|
|
161
|
+
|
|
162
|
+
**4/** But speed isn't everything.
|
|
163
|
+
|
|
164
|
+
We tried routing all requests to the fastest models. Users complained the "fast" models refused complex tasks and gave shallow answers.
|
|
165
|
+
|
|
166
|
+
Lesson: you need to balance speed, quality, AND cost.
|
|
167
|
+
|
|
168
|
+
**5/** The efficiency frontier winners:
|
|
169
|
+
- Best overall: Gemini 3.1 Pro (IQ 57, 1.6s, $1.25/M)
|
|
170
|
+
- Best budget: Gemini 2.5 Flash (IQ 20, 1.2s, $0.15/M)
|
|
171
|
+
- Best reasoning: Claude Opus 4.6 (IQ 53, 2.1s, $15/M)
|
|
172
|
+
- Best speed/quality: Kimi K2.5 (IQ 47, 1.6s, $0.60/M)
|
|
173
|
+
|
|
174
|
+
**6/** This is why we built smart routing into BlockRun.
|
|
175
|
+
|
|
176
|
+
Set `model: "auto"` and we pick the right model based on task complexity. Simple tasks get Gemini Flash. Complex reasoning gets Claude/GPT flagships.
|
|
177
|
+
|
|
178
|
+
One wallet. 39 models. The router handles the rest.
|
|
179
|
+
|
|
180
|
+
**7/** Full leaderboard, methodology, and raw data in our blog post: [link]
|
|
181
|
+
|
|
182
|
+
All 39 models benchmarked through real x402 micropayment infrastructure. No synthetic lab conditions.
|
|
183
|
+
|
|
184
|
+
Build with @BlockRunAI: blockrun.ai
|
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
# The Most AI-Agent-Native Router for OpenClaw
|
|
2
|
+
|
|
3
|
+
> *OpenClaw is one of the best AI agent frameworks available. Its LLM abstraction layer is not.*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## The $248/Day Problem
|
|
8
|
+
|
|
9
|
+
<p align="center"><img src="assets/blockrun-248-day-cost-overrun-problem.png" alt="The Autopsy of an Overrun — token volume compounds exponentially in agentic workloads, reaching 11.3M input tokens in a single hour" width="720"></p>
|
|
10
|
+
|
|
11
|
+
From [openclaw/openclaw#3181](https://github.com/openclaw/openclaw/issues/3181):
|
|
12
|
+
|
|
13
|
+
> *"We ended up at $248/day before we caught it. Heartbeat on Opus 4.6 with a large context. The dedup fix reduced trigger rate, but there's nothing bounding the run itself."*
|
|
14
|
+
|
|
15
|
+
> *"11.3M input tokens in 1 hour on claude-opus-4-6 (128K context), ~$20/hour."*
|
|
16
|
+
|
|
17
|
+
Both users ended up disabling heartbeat entirely. The workaround: `heartbeat.every: "0"` — turning off the feature to avoid burning money.
|
|
18
|
+
|
|
19
|
+
The root cause isn't configuration error. It's that OpenClaw's LLM layer has no concept of what things cost, and no way to stop a run that's spending too much.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## What OpenClaw Gets Wrong at the Inference Layer
|
|
24
|
+
|
|
25
|
+
<p align="center"><img src="assets/blockrun-openclaw-inference-layer-blind-spots.png" alt="Orchestration frameworks are blind to inference realities — cost tier, error semantics, and context size go unscreened" width="720"></p>
|
|
26
|
+
|
|
27
|
+
OpenClaw is an excellent orchestration framework — session management, tool dispatch, agent routing, memory. But every request it makes hits a single configured model with no awareness of:
|
|
28
|
+
|
|
29
|
+
**Cost tier** — A heartbeat status check doesn't need Opus. A file read result doesn't need 128K context. OpenClaw sends both to the same model at the same price.
|
|
30
|
+
|
|
31
|
+
**Rate limit isolation** — When one provider hits a 429, OpenClaw's failover logic applies that cooldown to the entire profile, not just the offending model. Every model in the same group is penalized ([#49834](https://github.com/openclaw/openclaw/issues/49834)). If you configured 5 models for fallback, one slow provider can block all of them.
|
|
32
|
+
|
|
33
|
+
**Empty/degraded responses** — Some providers return HTTP 200 with empty content, repeated tokens, or a single newline. OpenClaw passes this through to the agent. The agent either errors out, loops, or silently gets a blank response ([#49902](https://github.com/openclaw/openclaw/issues/49902)).
|
|
34
|
+
|
|
35
|
+
**Error semantics** — OpenClaw's failover logic has known gaps. We found and fixed two while building ClawRouter:
|
|
36
|
+
|
|
37
|
+
- **MiniMax HTTP 520** ([PR #49550](https://github.com/openclaw/openclaw/pull/49550)) — MiniMax returns `{"type":"api_error","message":"unknown error, 520 (1000)"}` for transient server errors. OpenClaw's classifier required both `"type":"api_error"` AND the string `"internal server error"`. MiniMax fails the second check. Result: no failover, silent failure, retry storm.
|
|
38
|
+
|
|
39
|
+
- **Z.ai codes 1311 and 1113** ([PR #49552](https://github.com/openclaw/openclaw/pull/49552)) — Z.ai error 1311 means "model not on your plan" (billing — stop retrying). Error 1113 means "wrong endpoint" (auth — rotate key). Both fell through to `null`, got treated as `rate_limit`, triggered exponential backoff, and charged for every retry.
|
|
40
|
+
|
|
41
|
+
**Context size** — Agents accumulate context. A 10-message conversation with tool results can easily hit 40K+ tokens. OpenClaw sends the full context every request, on every retry.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## ClawRouter: Built for Agentic Workloads
|
|
46
|
+
|
|
47
|
+
<p align="center"><img src="assets/blockrun-clawrouter-openclaw-agentic-proxy-architecture.png" alt="ClawRouter proxy manifold sits between OpenClaw and upstream APIs like GPT-4o, Claude Opus, and Gemini — cost control is a gateway concern" width="720"></p>
|
|
48
|
+
|
|
49
|
+
ClawRouter is a local OpenAI-compatible proxy, purpose-built for how AI agents actually behave — not how simple chat clients do. It sits between OpenClaw and the upstream model APIs.
|
|
50
|
+
|
|
51
|
+
```
|
|
52
|
+
OpenClaw → ClawRouter → blockrun.ai → GPT-4o / Opus / Gemini / ...
|
|
53
|
+
↑
|
|
54
|
+
All the smart stuff happens here
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
### 1. Token Compression — 7 Layers, Agent-Aware
|
|
58
|
+
|
|
59
|
+
<p align="center"><img src="assets/blockrun-clawrouter-7-layer-token-compression-openclaw.png" alt="Seven-layer agent-aware token compression — ClawRouter intercepts and compresses requests through 7 filters for 15–40% overall token reduction" width="720"></p>
|
|
60
|
+
|
|
61
|
+
Agents are the worst offenders for context bloat. Tool call results are verbose. File reads return thousands of lines. Conversation history compounds with every turn.
|
|
62
|
+
|
|
63
|
+
ClawRouter compresses every request through 7 layers before it hits the wire:
|
|
64
|
+
|
|
65
|
+
| Layer | What it does | Saves |
|
|
66
|
+
|-------|-------------|-------|
|
|
67
|
+
| Deduplication | Removes repeated messages (retries, echoes) | Variable |
|
|
68
|
+
| Whitespace | Strips excessive whitespace from all content | 2–8% |
|
|
69
|
+
| Dictionary | Replaces common phrases with short codes | 5–15% |
|
|
70
|
+
| Path shortening | Codebook for repeated file paths in tool results | 3–10% |
|
|
71
|
+
| JSON compaction | Removes whitespace from embedded JSON | 5–12% |
|
|
72
|
+
| **Observation compression** | **Summarizes tool results to key information** | **Up to 97%** |
|
|
73
|
+
| Dynamic codebook | Learns repetitions in the actual conversation | 3–15% |
|
|
74
|
+
|
|
75
|
+
Layer 6 is the big one. Tool results — file reads, API responses, shell output — can be 10KB+ each. The actual useful signal is often 200–300 chars. ClawRouter extracts errors, status lines, key JSON fields, and compresses the rest. Same model intelligence, 97% fewer tokens on the bulk.
|
|
76
|
+
|
|
77
|
+
<p align="center"><img src="assets/blockrun-clawrouter-observation-compression-97-percent-token-savings.png" alt="Extracting intelligence from tool bloat — raw tool output is 97% noise, ClawRouter filters to 3% signal with errors, status lines, and key values" width="720"></p>
|
|
78
|
+
|
|
79
|
+
**Overall reduction: 15–40% on typical agentic workloads.** On the $248/day scenario, that's $150–$200/day in savings from compression alone, before any routing changes.
|
|
80
|
+
|
|
81
|
+
### 2. Automatic Tier Routing — Right Model for Each Request
|
|
82
|
+
|
|
83
|
+
<p align="center"><img src="assets/blockrun-clawrouter-openclaw-automatic-tier-routing-model-selection.png" alt="Right-sizing models for specific agent tasks — ClawRouter's task-to-tier routing engine with session pinning routes heartbeats to Flash and reasoning to Opus" width="720"></p>
|
|
84
|
+
|
|
85
|
+
ClawRouter classifies every request before forwarding:
|
|
86
|
+
|
|
87
|
+
```
|
|
88
|
+
heartbeat status check → SIMPLE → gemini-2.5-flash (~0.04¢ / request)
|
|
89
|
+
code review, refactor → COMPLEX → claude-sonnet-4-6 (~5¢ / request)
|
|
90
|
+
formal proof, reasoning → REASONING → o3 / claude-opus (~30¢ / request)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Tool detection is automatic.** When OpenClaw sends a request with tools attached, ClawRouter forces agentic routing tiers — guaranteeing tool-capable models and preventing the silent fallback to models that refuse tool calls.
|
|
94
|
+
|
|
95
|
+
**Session pinning.** Once a session selects a model for a task, ClawRouter pins that model for the session lifetime. No mid-task model switching, no consistency issues across a long agent run.
|
|
96
|
+
|
|
97
|
+
The heartbeat that was burning $248/day on Opus routes to Flash at ~1/500th the cost. Same heartbeat feature, working as designed.
|
|
98
|
+
|
|
99
|
+
### 3. Per-Model Rate Limit Isolation — No Cross-Contamination
|
|
100
|
+
|
|
101
|
+
When a provider returns 429, ClawRouter marks that specific model as rate-limited for 60 seconds ([#49834](https://github.com/openclaw/openclaw/issues/49834)). Other models in the fallback chain are unaffected. If Claude Sonnet gets rate-limited, Gemini Flash and GPT-4o continue working. No cascade.
|
|
102
|
+
|
|
103
|
+
Before failing over, ClawRouter also retries the rate-limited model once after 200ms. Token-bucket limits often recover within milliseconds — most short-burst 429s resolve on the first retry without ever touching a fallback model.
|
|
104
|
+
|
|
105
|
+
### 4. Empty Response Detection — No Silent Failures
|
|
106
|
+
|
|
107
|
+
ClawRouter inspects every HTTP 200 response body before forwarding it ([#49902](https://github.com/openclaw/openclaw/issues/49902)). Blank responses, repeated-token loops, and single-character outputs trigger model fallback — the same as a 5xx. The agent never sees a degraded response that would cause it to loop or silently fail.
|
|
108
|
+
|
|
109
|
+
### 5. Correct Error Classification — No Retry Storms
|
|
110
|
+
|
|
111
|
+
<p align="center"><img src="assets/blockrun-clawrouter-openclaw-error-classification-retry-storm-prevention.png" alt="Stopping retry storms at the HTTP layer — ClawRouter classifies errors per provider with logic gate classifier and automated mechanical actions" width="720"></p>
|
|
112
|
+
|
|
113
|
+
ClawRouter classifies errors at the HTTP/body layer before OpenClaw sees them:
|
|
114
|
+
|
|
115
|
+
```
|
|
116
|
+
401 / 403 → auth_failure → stop retrying, rotate key
|
|
117
|
+
402 / billing body → quota_exceeded → stop retrying, surface alert
|
|
118
|
+
429 → rate_limited → backoff, try next model
|
|
119
|
+
529 / overloaded body → overloaded → short cooldown, fallback model
|
|
120
|
+
5xx / 520 → server_error → retry with different model
|
|
121
|
+
Z.ai 1311 → billing → stop retrying
|
|
122
|
+
Z.ai 1113 → auth → rotate key
|
|
123
|
+
MiniMax 520 (api_error)→ server_error → retry with fallback
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Per-provider error state is tracked independently. If MiniMax is having a bad hour, Anthropic and OpenAI routes continue working. No cross-contamination, no single provider poisoning the session.
|
|
127
|
+
|
|
128
|
+
### 6. Session Memory — Agents That Remember
|
|
129
|
+
|
|
130
|
+
<p align="center"><img src="assets/blockrun-clawrouter-openclaw-session-memory-journaling-vs-context-compounding.png" alt="Agents that remember without compounding cost — ClawRouter session journaling vs standard OpenClaw context compounding across turns" width="720"></p>
|
|
131
|
+
|
|
132
|
+
OpenClaw sessions can be long-lived. ClawRouter maintains a session journal — extracting decisions, results, and context from each turn — and injects relevant history when the agent asks questions that reference earlier work.
|
|
133
|
+
|
|
134
|
+
Less context repeated = fewer tokens = lower cost. Agents that need to recall earlier decisions don't need to carry the entire history in every prompt.
|
|
135
|
+
|
|
136
|
+
### 7. x402 Micropayments — Wallet-Based Budget Control
|
|
137
|
+
|
|
138
|
+
<p align="center"><img src="assets/blockrun-clawrouter-x402-usdc-micropayment-wallet-budget-control.png" alt="Budget limits enforced by physical construction — wallet loaded via Base/Solana, pay per call across 41+ models, balance hits zero and the valve shuts cleanly" width="720"></p>
|
|
139
|
+
|
|
140
|
+
ClawRouter pays for inference via [x402](https://x402.org/) USDC micropayments (Base or Solana). You load a wallet. Each inference call costs exactly what it costs. When the wallet runs low, requests stop cleanly.
|
|
141
|
+
|
|
142
|
+
There is no monthly invoice. There is no 3am email. There is a wallet balance, and it either has funds or it doesn't. Wallet-based billing means your budget stops the burn — not a monthly invoice that arrives after the damage is done.
|
|
143
|
+
|
|
144
|
+
**`maxCostPerRun`** — a per-session cost ceiling that stops or downgrades requests once a session exceeds a configured threshold (e.g., `$0.50`). This closes the remaining gap ([#3181](https://github.com/openclaw/openclaw/issues/3181)) where a wallet with sufficient funds can still accumulate within a single run. Two modes: `graceful` (downgrade to cheaper models) and `strict` (hard 429 once the cap is hit).
|
|
145
|
+
|
|
146
|
+
```
|
|
147
|
+
41+ models. One wallet. Pay per call.
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
---
|
|
151
|
+
|
|
152
|
+
## OpenClaw + ClawRouter: The Full Picture
|
|
153
|
+
|
|
154
|
+
<p align="center"><img src="assets/blockrun-clawrouter-vs-openclaw-standalone-comparison-production-safety.png" alt="Architecting for production safety — OpenClaw standalone vs OpenClaw + ClawRouter comparison across cost, context, error handling, and budgeting" width="720"></p>
|
|
155
|
+
|
|
156
|
+
| Problem | OpenClaw alone | OpenClaw + ClawRouter |
|
|
157
|
+
|---------|---------------|----------------------|
|
|
158
|
+
| Heartbeat cost overrun | No per-run cap | Tier routing → 50–500× cheaper model |
|
|
159
|
+
| Large context | Full context every call | 7-layer compression, 15–40% reduction |
|
|
160
|
+
| Tool result bloat | Raw output forwarded | Observation compression, up to 97% |
|
|
161
|
+
| Rate limit contaminates profile | All models penalized (#49834) | Per-model 60s cooldown, others unaffected |
|
|
162
|
+
| Empty / degraded 200 response | Passed through to agent (#49902) | Detected, triggers model fallback |
|
|
163
|
+
| Short-burst 429 failover | Immediate failover to next model | 200ms retry first, failover only if needed |
|
|
164
|
+
| MiniMax 520 failure | Silent drop / retry storm | Classified as server_error, retried correctly |
|
|
165
|
+
| Z.ai 1311 (billing) | Treated as rate_limit, retried | Classified as billing, stopped immediately |
|
|
166
|
+
| Mid-task model switch | Model can change mid-session | Session pinning, consistent model per task |
|
|
167
|
+
| Monthly billing surprise | Possible | Wallet-based, stops when empty |
|
|
168
|
+
| Per-session cost ceiling | None | `maxCostPerRun` — graceful or strict cap |
|
|
169
|
+
| Cost visibility | None | `/stats` with per-provider error counts |
|
|
170
|
+
|
|
171
|
+
---
|
|
172
|
+
|
|
173
|
+
## Getting Started
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
# 1. Install with smart routing enabled
|
|
177
|
+
curl -fsSL https://blockrun.ai/ClawRouter-update | bash
|
|
178
|
+
openclaw gateway restart
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
ClawRouter auto-injects itself into `~/.openclaw/openclaw.json` as a provider on startup. No manual config needed — your existing tools, sessions, and extensions are unchanged.
|
|
182
|
+
|
|
183
|
+
Load a wallet, choose a model profile (`eco` / `auto` / `premium` / `agentic`), and run.
|
|
184
|
+
|
|
185
|
+
---
|
|
186
|
+
|
|
187
|
+
## On Our OpenClaw Contributions
|
|
188
|
+
|
|
189
|
+
We contribute upstream when we find bugs. The two PRs linked above fix real error classification gaps. Everyone using OpenClaw directly benefits.
|
|
190
|
+
|
|
191
|
+
ClawRouter exists because proxy-layer cost control, context compression, and agent-aware routing are fundamentally gateway concerns — not framework concerns. OpenClaw can't know that your heartbeat doesn't need Opus. It can't compress tool results it hasn't seen. It can't enforce a wallet ceiling.
|
|
192
|
+
|
|
193
|
+
That's what ClawRouter is for.
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
*[github.com/BlockRunAI/ClawRouter](https://github.com/BlockRunAI/ClawRouter) · [blockrun.ai](https://blockrun.ai) · `npm install -g @blockrun/clawrouter`*
|
|
Binary file
|