@blockrun/clawrouter 0.12.61 → 0.12.63
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/anthropic-cost-savings.md +349 -0
- package/docs/architecture.md +559 -0
- package/docs/assets/blockrun-248-day-cost-overrun-problem.png +0 -0
- package/docs/assets/blockrun-clawrouter-7-layer-token-compression-openclaw.png +0 -0
- package/docs/assets/blockrun-clawrouter-observation-compression-97-percent-token-savings.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-agentic-proxy-architecture.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-automatic-tier-routing-model-selection.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-error-classification-retry-storm-prevention.png +0 -0
- package/docs/assets/blockrun-clawrouter-openclaw-session-memory-journaling-vs-context-compounding.png +0 -0
- package/docs/assets/blockrun-clawrouter-vs-openclaw-standalone-comparison-production-safety.png +0 -0
- package/docs/assets/blockrun-clawrouter-x402-usdc-micropayment-wallet-budget-control.png +0 -0
- package/docs/assets/blockrun-openclaw-inference-layer-blind-spots.png +0 -0
- package/docs/blog-benchmark-2026-03.md +184 -0
- package/docs/blog-openclaw-cost-overruns.md +197 -0
- package/docs/clawrouter-savings.png +0 -0
- package/docs/configuration.md +512 -0
- package/docs/features.md +257 -0
- package/docs/image-generation.md +380 -0
- package/docs/plans/2026-02-03-smart-routing-design.md +267 -0
- package/docs/plans/2026-02-13-e2e-docker-deployment.md +1260 -0
- package/docs/plans/2026-02-28-worker-network.md +947 -0
- package/docs/plans/2026-03-18-error-classification.md +574 -0
- package/docs/plans/2026-03-19-exclude-models.md +538 -0
- package/docs/routing-profiles.md +81 -0
- package/docs/subscription-failover.md +320 -0
- package/docs/technical-routing-2026-03.md +322 -0
- package/docs/troubleshooting.md +159 -0
- package/docs/vision.md +49 -0
- package/docs/vs-openrouter.md +157 -0
- package/docs/worker-network.md +1241 -0
- package/package.json +3 -2
- package/scripts/reinstall.sh +8 -4
- package/scripts/update.sh +8 -4
|
@@ -0,0 +1,320 @@
|
|
|
1
|
+
# Using Subscriptions with ClawRouter Failover
|
|
2
|
+
|
|
3
|
+
This guide explains how to use your existing LLM subscriptions (Claude Pro/Max, ChatGPT Plus, etc.) as primary providers, with ClawRouter x402 micropayments as automatic failover.
|
|
4
|
+
|
|
5
|
+
## Why Not Built Into ClawRouter?
|
|
6
|
+
|
|
7
|
+
After careful consideration, we decided **not** to integrate subscription support directly into ClawRouter for several important reasons:
|
|
8
|
+
|
|
9
|
+
### 1. Terms of Service Compliance
|
|
10
|
+
|
|
11
|
+
- Most subscription ToS (Claude Code, ChatGPT Plus) are designed for personal use
|
|
12
|
+
- Using them through a proxy/API service may violate provider agreements
|
|
13
|
+
- We want to keep ClawRouter compliant and low-risk for all users
|
|
14
|
+
|
|
15
|
+
### 2. Security & Privacy
|
|
16
|
+
|
|
17
|
+
- Integrating subscriptions would require ClawRouter to access your credentials/sessions
|
|
18
|
+
- Spawning external processes (like Claude CLI) introduces security concerns
|
|
19
|
+
- Better to keep authentication at the OpenClaw layer where you control it
|
|
20
|
+
|
|
21
|
+
### 3. Maintenance & Flexibility
|
|
22
|
+
|
|
23
|
+
- Each subscription provider has different APIs, CLIs, and authentication methods
|
|
24
|
+
- OpenClaw already has a robust provider system that handles this
|
|
25
|
+
- Duplicating this in ClawRouter would increase complexity without added value
|
|
26
|
+
|
|
27
|
+
### 4. Better Architecture
|
|
28
|
+
|
|
29
|
+
- OpenClaw's native failover mechanism is more flexible and powerful
|
|
30
|
+
- Works with **any** provider (not just Claude)
|
|
31
|
+
- Zero code changes needed in ClawRouter
|
|
32
|
+
- You maintain full control over your credentials
|
|
33
|
+
|
|
34
|
+
## How It Works
|
|
35
|
+
|
|
36
|
+
OpenClaw has a built-in **model fallback chain** that automatically tries alternative providers when the primary fails:
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
User Request
|
|
40
|
+
↓
|
|
41
|
+
Primary Provider (e.g., Claude subscription via OpenClaw)
|
|
42
|
+
↓ (rate limited / quota exceeded / auth failed)
|
|
43
|
+
OpenClaw detects failure
|
|
44
|
+
↓
|
|
45
|
+
Fallback Chain (try each in order)
|
|
46
|
+
↓
|
|
47
|
+
ClawRouter (blockrun/auto)
|
|
48
|
+
↓
|
|
49
|
+
Smart routing picks cheapest model
|
|
50
|
+
↓
|
|
51
|
+
x402 micropayment to BlockRun API
|
|
52
|
+
↓
|
|
53
|
+
Response returned to user
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Key benefits:**
|
|
57
|
+
|
|
58
|
+
- ✅ Automatic failover (no manual intervention)
|
|
59
|
+
- ✅ Works with any subscription provider OpenClaw supports
|
|
60
|
+
- ✅ Respects provider ToS (you configure authentication directly)
|
|
61
|
+
- ✅ ClawRouter stays focused on cost optimization
|
|
62
|
+
|
|
63
|
+
## Setup Guide
|
|
64
|
+
|
|
65
|
+
### Prerequisites
|
|
66
|
+
|
|
67
|
+
1. **OpenClaw Gateway installed** with ClawRouter plugin
|
|
68
|
+
|
|
69
|
+
```bash
|
|
70
|
+
npm install -g openclaw
|
|
71
|
+
openclaw plugins install @blockrun/clawrouter
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
2. **Subscription configured in OpenClaw**
|
|
75
|
+
- For Claude: Use `claude setup-token` or API key
|
|
76
|
+
- For OpenAI: Set `OPENAI_API_KEY` environment variable
|
|
77
|
+
- For others: See [OpenClaw provider docs](https://docs.openclaw.ai)
|
|
78
|
+
|
|
79
|
+
3. **ClawRouter wallet funded** (for failover)
|
|
80
|
+
```bash
|
|
81
|
+
openclaw gateway logs | grep "Wallet:"
|
|
82
|
+
# Send USDC to the displayed address on Base network
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Configuration Steps
|
|
86
|
+
|
|
87
|
+
#### Step 1: Set Primary Model (Your Subscription)
|
|
88
|
+
|
|
89
|
+
```bash
|
|
90
|
+
# Option A: Using Claude subscription
|
|
91
|
+
openclaw models set anthropic/claude-sonnet-4.6
|
|
92
|
+
|
|
93
|
+
# Option B: Using ChatGPT Plus (via OpenAI provider)
|
|
94
|
+
openclaw models set openai/gpt-4o
|
|
95
|
+
|
|
96
|
+
# Option C: Using any other provider
|
|
97
|
+
openclaw models set <provider>/<model>
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
#### Step 2: Add ClawRouter as Fallback
|
|
101
|
+
|
|
102
|
+
```bash
|
|
103
|
+
# Add blockrun/auto for smart routing (recommended)
|
|
104
|
+
openclaw models fallbacks add blockrun/auto
|
|
105
|
+
|
|
106
|
+
# Or specify a specific model
|
|
107
|
+
openclaw models fallbacks add blockrun/google/gemini-2.5-pro
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
#### Step 3: Verify Configuration
|
|
111
|
+
|
|
112
|
+
```bash
|
|
113
|
+
openclaw models show
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Expected output:
|
|
117
|
+
|
|
118
|
+
```
|
|
119
|
+
Primary: anthropic/claude-sonnet-4.6
|
|
120
|
+
Fallbacks:
|
|
121
|
+
1. blockrun/auto
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
#### Step 4: Test Failover (Optional)
|
|
125
|
+
|
|
126
|
+
To verify failover works:
|
|
127
|
+
|
|
128
|
+
1. **Temporarily exhaust your subscription quota** (or wait for rate limit)
|
|
129
|
+
2. **Make a request** - OpenClaw should automatically failover to ClawRouter
|
|
130
|
+
3. **Check logs:**
|
|
131
|
+
```bash
|
|
132
|
+
openclaw gateway logs | grep -i "fallback\|blockrun"
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Advanced Configuration
|
|
136
|
+
|
|
137
|
+
#### Configure Multiple Fallbacks
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
openclaw models fallbacks add blockrun/google/gemini-2.5-flash # Fast & cheap
|
|
141
|
+
openclaw models fallbacks add blockrun/deepseek/deepseek-chat # Even cheaper
|
|
142
|
+
openclaw models fallbacks add blockrun/nvidia/gpt-oss-120b # Free tier
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
#### Per-Agent Configuration
|
|
146
|
+
|
|
147
|
+
Edit `~/.openclaw/openclaw.json`:
|
|
148
|
+
|
|
149
|
+
```json
|
|
150
|
+
{
|
|
151
|
+
"agents": {
|
|
152
|
+
"main": {
|
|
153
|
+
"model": {
|
|
154
|
+
"primary": "anthropic/claude-opus-4.6",
|
|
155
|
+
"fallbacks": ["blockrun/auto"]
|
|
156
|
+
}
|
|
157
|
+
},
|
|
158
|
+
"coding": {
|
|
159
|
+
"model": {
|
|
160
|
+
"primary": "anthropic/claude-sonnet-4.6",
|
|
161
|
+
"fallbacks": ["blockrun/google/gemini-2.5-pro", "blockrun/deepseek/deepseek-chat"]
|
|
162
|
+
}
|
|
163
|
+
}
|
|
164
|
+
}
|
|
165
|
+
}
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
#### Tier-Based Configuration (ClawRouter Smart Routing)
|
|
169
|
+
|
|
170
|
+
When using `blockrun/auto`, ClawRouter automatically classifies your request and picks the cheapest capable model:
|
|
171
|
+
|
|
172
|
+
- **SIMPLE** queries → Gemini 2.5 Flash, DeepSeek Chat (~$0.0001/req)
|
|
173
|
+
- **MEDIUM** queries → GPT-4o-mini, Gemini Flash (~$0.001/req)
|
|
174
|
+
- **COMPLEX** queries → Claude Sonnet, Gemini Pro (~$0.01/req)
|
|
175
|
+
- **REASONING** queries → DeepSeek R1, o3-mini (~$0.05/req)
|
|
176
|
+
|
|
177
|
+
Learn more: [ClawRouter Smart Routing](./smart-routing.md)
|
|
178
|
+
|
|
179
|
+
## Monitoring & Troubleshooting
|
|
180
|
+
|
|
181
|
+
### Check If Failover Is Working
|
|
182
|
+
|
|
183
|
+
```bash
|
|
184
|
+
# Watch real-time logs
|
|
185
|
+
openclaw gateway logs --follow | grep -i "fallback\|blockrun\|rate.limit\|quota"
|
|
186
|
+
|
|
187
|
+
# Check ClawRouter proxy logs
|
|
188
|
+
openclaw gateway logs | grep "ClawRouter"
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
**Success indicators:**
|
|
192
|
+
|
|
193
|
+
- ✅ "Rate limit reached" or "Quota exceeded" → primary failed
|
|
194
|
+
- ✅ "Trying fallback: blockrun/auto" → failover triggered
|
|
195
|
+
- ✅ "ClawRouter: Success with model" → failover succeeded
|
|
196
|
+
|
|
197
|
+
### Common Issues
|
|
198
|
+
|
|
199
|
+
#### Issue: Failover never triggers
|
|
200
|
+
|
|
201
|
+
**Symptoms:** Always uses primary, never switches to ClawRouter
|
|
202
|
+
|
|
203
|
+
**Solutions:**
|
|
204
|
+
|
|
205
|
+
1. Check fallbacks are configured:
|
|
206
|
+
```bash
|
|
207
|
+
openclaw models show
|
|
208
|
+
```
|
|
209
|
+
2. Verify primary is actually failing (check provider dashboard for quota/rate limits)
|
|
210
|
+
3. Check OpenClaw logs for authentication errors
|
|
211
|
+
|
|
212
|
+
#### Issue: "Wallet empty" errors during failover
|
|
213
|
+
|
|
214
|
+
**Symptoms:** Failover triggers but ClawRouter returns balance errors
|
|
215
|
+
|
|
216
|
+
**Solutions:**
|
|
217
|
+
|
|
218
|
+
1. Check ClawRouter wallet balance:
|
|
219
|
+
```bash
|
|
220
|
+
openclaw gateway logs | grep "Balance:"
|
|
221
|
+
```
|
|
222
|
+
2. Fund wallet on Base network (USDC)
|
|
223
|
+
3. Verify wallet key is configured correctly
|
|
224
|
+
|
|
225
|
+
#### Issue: Slow failover (high latency)
|
|
226
|
+
|
|
227
|
+
**Symptoms:** 5-10 second delay when switching to ClawRouter
|
|
228
|
+
|
|
229
|
+
**Cause:** OpenClaw tries multiple auth profiles before failover
|
|
230
|
+
|
|
231
|
+
**Solutions:**
|
|
232
|
+
|
|
233
|
+
1. Reduce auth profile retry attempts (see OpenClaw config)
|
|
234
|
+
2. Use `blockrun/auto` as primary for faster responses
|
|
235
|
+
3. Accept the latency as a tradeoff for cheaper requests
|
|
236
|
+
|
|
237
|
+
## Cost Analysis
|
|
238
|
+
|
|
239
|
+
### Example Scenario
|
|
240
|
+
|
|
241
|
+
**Usage pattern:**
|
|
242
|
+
|
|
243
|
+
- 100 requests/day
|
|
244
|
+
- 50% hit Claude subscription quota (rate limited)
|
|
245
|
+
- 50% use ClawRouter failover
|
|
246
|
+
|
|
247
|
+
**Without failover:**
|
|
248
|
+
|
|
249
|
+
- Pay Anthropic API: $50/month (100% API usage)
|
|
250
|
+
|
|
251
|
+
**With failover:**
|
|
252
|
+
|
|
253
|
+
- Claude subscription: $20/month (covers 50%)
|
|
254
|
+
- ClawRouter x402: ~$5/month (50 requests via smart routing)
|
|
255
|
+
- **Total: $25/month (50% savings)**
|
|
256
|
+
|
|
257
|
+
### When Does This Make Sense?
|
|
258
|
+
|
|
259
|
+
✅ **Good fit:**
|
|
260
|
+
|
|
261
|
+
- You already have a subscription for personal use
|
|
262
|
+
- You occasionally exceed quota/rate limits
|
|
263
|
+
- You want cost optimization without managing API keys
|
|
264
|
+
|
|
265
|
+
❌ **Not ideal:**
|
|
266
|
+
|
|
267
|
+
- You need 100% reliability (subscriptions have rate limits)
|
|
268
|
+
- You prefer a single provider (no failover complexity)
|
|
269
|
+
- Your usage is low (< 10 requests/day)
|
|
270
|
+
|
|
271
|
+
## FAQ
|
|
272
|
+
|
|
273
|
+
### Q: Will this violate my subscription ToS?
|
|
274
|
+
|
|
275
|
+
**A:** You configure the subscription directly in OpenClaw using your own credentials. ClawRouter only receives requests after your subscription fails. This is similar to using multiple API keys yourself.
|
|
276
|
+
|
|
277
|
+
However, each provider has different ToS. Check yours before proceeding:
|
|
278
|
+
|
|
279
|
+
- [Claude Code Terms](https://claude.ai/terms)
|
|
280
|
+
- [ChatGPT Terms](https://openai.com/policies/terms-of-use)
|
|
281
|
+
|
|
282
|
+
### Q: Can I use multiple subscriptions?
|
|
283
|
+
|
|
284
|
+
**A:** Yes! Configure multiple providers with failback chains:
|
|
285
|
+
|
|
286
|
+
```bash
|
|
287
|
+
openclaw models set anthropic/claude-opus-4.6
|
|
288
|
+
openclaw models fallbacks add openai/gpt-4o # ChatGPT Plus
|
|
289
|
+
openclaw models fallbacks add blockrun/auto # x402 as final fallback
|
|
290
|
+
```
|
|
291
|
+
|
|
292
|
+
### Q: Does this work with Claude Max API Proxy?
|
|
293
|
+
|
|
294
|
+
**A:** Yes! Configure the proxy as a custom provider in OpenClaw, then add `blockrun/auto` as fallback.
|
|
295
|
+
|
|
296
|
+
See: [Claude Max API Proxy Guide](https://github.com/anthropics/claude-code/blob/main/docs/providers/claude-max-api-proxy.md)
|
|
297
|
+
|
|
298
|
+
### Q: How is this different from PR #15?
|
|
299
|
+
|
|
300
|
+
**A:** PR #15 integrated Claude CLI directly into ClawRouter. Our approach:
|
|
301
|
+
|
|
302
|
+
- ✅ Works with any provider (not just Claude)
|
|
303
|
+
- ✅ Respects provider ToS (no proxy/wrapper)
|
|
304
|
+
- ✅ Uses OpenClaw's native failover (more reliable)
|
|
305
|
+
- ✅ Zero maintenance burden on ClawRouter
|
|
306
|
+
|
|
307
|
+
## Feedback & Support
|
|
308
|
+
|
|
309
|
+
We'd love to hear your experience with subscription failover:
|
|
310
|
+
|
|
311
|
+
- **GitHub Discussion:** [Share your setup](https://github.com/BlockRunAI/ClawRouter/discussions)
|
|
312
|
+
- **Issues:** [Report problems](https://github.com/BlockRunAI/ClawRouter/issues)
|
|
313
|
+
- **Telegram:** [Join community](https://t.me/blockrunAI)
|
|
314
|
+
|
|
315
|
+
## Related Documentation
|
|
316
|
+
|
|
317
|
+
- [OpenClaw Model Failover](https://docs.openclaw.ai/concepts/model-failover)
|
|
318
|
+
- [OpenClaw Provider Configuration](https://docs.openclaw.ai/gateway/configuration)
|
|
319
|
+
- [ClawRouter Smart Routing](./smart-routing.md)
|
|
320
|
+
- [ClawRouter x402 Micropayments](./x402-payments.md)
|
|
@@ -0,0 +1,322 @@
|
|
|
1
|
+
# Building a Smart LLM Router: How We Benchmarked 46 Models and Built a 14-Dimension Classifier
|
|
2
|
+
|
|
3
|
+
*March 20, 2026 | BlockRun Engineering*
|
|
4
|
+
|
|
5
|
+
When you route AI requests across 46 models from 8 providers, you can't just pick the cheapest one. You can't just pick the fastest one either. We learned this the hard way.
|
|
6
|
+
|
|
7
|
+
This is the technical story of how we benchmarked every model on our platform, discovered that speed and intelligence are poorly correlated, and built a production routing system that classifies requests in under 1ms using 14 weighted dimensions with sigmoid confidence calibration.
|
|
8
|
+
|
|
9
|
+
## The Problem: One Gateway, 46 Models, Infinite Wrong Choices
|
|
10
|
+
|
|
11
|
+
BlockRun is an x402 micropayment gateway. Every LLM request flows through our proxy, gets authenticated via on-chain USDC payment, and is forwarded to the appropriate provider. The payment overhead adds 50-100ms to every request.
|
|
12
|
+
|
|
13
|
+
Our users set `model: "auto"` and expect us to pick the right model. But "right" means different things for different requests:
|
|
14
|
+
|
|
15
|
+
- A "what is Python?" query should route to the cheapest, fastest model
|
|
16
|
+
- A "implement a B-tree with concurrent insertions" query needs a capable model
|
|
17
|
+
- A "prove this theorem step by step" query needs reasoning capabilities
|
|
18
|
+
- An agentic workflow with tool calls needs models that follow instructions precisely
|
|
19
|
+
|
|
20
|
+
We needed a system that could classify any request and route it to the optimal model in real-time.
|
|
21
|
+
|
|
22
|
+
## Step 1: Benchmarking the Fleet
|
|
23
|
+
|
|
24
|
+
Before building the router, we needed ground truth. We benchmarked all 46 models through our production payment pipeline.
|
|
25
|
+
|
|
26
|
+
### Methodology
|
|
27
|
+
|
|
28
|
+
```
|
|
29
|
+
Setup: ClawRouter v0.12.47 proxy on localhost
|
|
30
|
+
→ BlockRun x402 gateway (Base EVM chain)
|
|
31
|
+
→ Provider APIs (OpenAI, Anthropic, Google, xAI, DeepSeek, Moonshot, MiniMax, NVIDIA, Z.AI)
|
|
32
|
+
|
|
33
|
+
Prompts: 3 Python coding tasks (IPv4 validation, LCS algorithm, LRU cache)
|
|
34
|
+
2 requests per model per prompt
|
|
35
|
+
Config: 256 max tokens, non-streaming, temperature 0.7
|
|
36
|
+
Measured: End-to-end wall clock time (includes x402 payment verification)
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
This is not a synthetic benchmark. Every measurement includes the full payment-verification round trip that real users experience.
|
|
40
|
+
|
|
41
|
+
### The Latency Landscape
|
|
42
|
+
|
|
43
|
+
Results revealed a 7x spread between the fastest and slowest models:
|
|
44
|
+
|
|
45
|
+
```
|
|
46
|
+
FAST TIER (<1.5s):
|
|
47
|
+
xai/grok-4-fast 1,143ms 224 tok/s $0.20/$0.50
|
|
48
|
+
xai/grok-3-mini 1,202ms 215 tok/s $0.30/$0.50
|
|
49
|
+
google/gemini-2.5-flash 1,238ms 208 tok/s $0.30/$2.50
|
|
50
|
+
google/gemini-2.5-pro 1,294ms 198 tok/s $1.25/$10.00
|
|
51
|
+
google/gemini-3-flash 1,398ms 183 tok/s $0.50/$3.00
|
|
52
|
+
deepseek/deepseek-chat 1,431ms 179 tok/s $0.28/$0.42
|
|
53
|
+
|
|
54
|
+
MID TIER (1.5-2.5s):
|
|
55
|
+
google/gemini-3.1-pro 1,609ms 167 tok/s $2.00/$12.00
|
|
56
|
+
moonshot/kimi-k2.5 1,646ms 156 tok/s $0.60/$3.00
|
|
57
|
+
anthropic/claude-sonnet 2,110ms 121 tok/s $3.00/$15.00
|
|
58
|
+
anthropic/claude-opus 2,139ms 120 tok/s $5.00/$25.00
|
|
59
|
+
openai/o3-mini 2,260ms 114 tok/s $1.10/$4.40
|
|
60
|
+
|
|
61
|
+
SLOW TIER (>3s):
|
|
62
|
+
openai/gpt-5.2-pro 3,546ms 73 tok/s $21.00/$168.00
|
|
63
|
+
openai/gpt-4o 5,378ms 48 tok/s $2.50/$10.00
|
|
64
|
+
openai/gpt-5.4 6,213ms 41 tok/s $2.50/$15.00
|
|
65
|
+
openai/gpt-5.3-codex 7,935ms 32 tok/s $1.75/$14.00
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
Two clear patterns:
|
|
69
|
+
|
|
70
|
+
1. **Google and xAI dominate speed.** 11 of the top 13 fastest models are from Google or xAI.
|
|
71
|
+
2. **OpenAI flagship models are consistently slow.** Every GPT-5.x model takes 3-8 seconds. Even their cheapest models (GPT-4.1-nano at $0.10/$0.40) are 2x slower than Google's cheapest.
|
|
72
|
+
|
|
73
|
+
## Step 2: Adding the Quality Dimension
|
|
74
|
+
|
|
75
|
+
Speed alone tells you nothing about whether a model can actually handle your request. We cross-referenced our latency data with Artificial Analysis Intelligence Index v4.0 scores (composite of GPQA, MMLU, MATH, HumanEval, and other benchmarks):
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
MODEL LATENCY IQ $/M INPUT
|
|
79
|
+
─────────────────────────────────────────────────────
|
|
80
|
+
google/gemini-3.1-pro 1,609ms 57 $2.00 ← SWEET SPOT
|
|
81
|
+
openai/gpt-5.4 6,213ms 57 $2.50
|
|
82
|
+
openai/gpt-5.3-codex 7,935ms 54 $1.75
|
|
83
|
+
anthropic/claude-opus-4.6 2,139ms 53 $5.00
|
|
84
|
+
anthropic/claude-sonnet-4.6 2,110ms 52 $3.00
|
|
85
|
+
google/gemini-3-pro-prev 1,352ms 48 $2.00
|
|
86
|
+
moonshot/kimi-k2.5 1,646ms 47 $0.60
|
|
87
|
+
google/gemini-3-flash-prev 1,398ms 46 $0.50 ← VALUE SWEET SPOT
|
|
88
|
+
xai/grok-4 1,348ms 41 $0.20
|
|
89
|
+
xai/grok-4.1-fast 1,244ms 41 $0.20
|
|
90
|
+
deepseek/deepseek-chat 1,431ms 32 $0.28
|
|
91
|
+
xai/grok-4-fast 1,143ms 23 $0.20
|
|
92
|
+
google/gemini-2.5-flash 1,238ms 20 $0.30
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### The Efficiency Frontier
|
|
96
|
+
|
|
97
|
+
Plotting IQ against latency reveals a clear efficiency frontier:
|
|
98
|
+
|
|
99
|
+
```
|
|
100
|
+
IQ
|
|
101
|
+
57 | Gem3.1Pro ·························· GPT-5.4
|
|
102
|
+
|
|
|
103
|
+
53 | · Opus
|
|
104
|
+
52 | · Sonnet
|
|
105
|
+
|
|
|
106
|
+
48 | Gem3Pro ·
|
|
107
|
+
47 | · Kimi
|
|
108
|
+
46 | Gem3Flash ·
|
|
109
|
+
|
|
|
110
|
+
41 | Grok4 ·
|
|
111
|
+
|
|
|
112
|
+
32 | Grok3 · · DeepSeek
|
|
113
|
+
|
|
|
114
|
+
23 | GrokFast ·
|
|
115
|
+
20 | GemFlash ·
|
|
116
|
+
└──────────────────────────────────────────────
|
|
117
|
+
1.0 1.5 2.0 2.5 3.0 6.0 8.0
|
|
118
|
+
End-to-End Latency (seconds)
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
The frontier runs from Gemini 2.5 Flash (IQ 20, 1.2s) up to Gemini 3.1 Pro (IQ 57, 1.6s). Everything above and to the right of this line is dominated — you can get equal or better quality at lower latency from a different model.
|
|
122
|
+
|
|
123
|
+
Key insight: **Gemini 3.1 Pro matches GPT-5.4's IQ at 1/4 the latency and lower cost.** Claude Sonnet 4.6 nearly matches Opus 4.6 quality at 60% of the price. These dominated pairings directly informed our routing fallback chains.
|
|
124
|
+
|
|
125
|
+
## Step 3: The Failed Experiment (Latency-First Routing)
|
|
126
|
+
|
|
127
|
+
Armed with benchmark data, we initially optimized for speed. The routing config promoted fast models:
|
|
128
|
+
|
|
129
|
+
```typescript
|
|
130
|
+
// v0.12.47 — latency-optimized (REVERTED)
|
|
131
|
+
COMPLEX: {
|
|
132
|
+
primary: "xai/grok-4-0709", // 1,348ms, IQ 41
|
|
133
|
+
fallback: [
|
|
134
|
+
"xai/grok-4-1-fast-non-reasoning", // 1,244ms, IQ 41
|
|
135
|
+
"google/gemini-2.5-flash", // 1,238ms, IQ 20
|
|
136
|
+
// ... fast models first
|
|
137
|
+
],
|
|
138
|
+
}
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
Users complained within 24 hours. The fast models were refusing complex tasks and giving shallow responses. A model with IQ 41 can't reliably handle architecture design or multi-step code generation, no matter how fast it is.
|
|
142
|
+
|
|
143
|
+
**Lesson: optimizing for a single metric in a multi-objective system creates failure modes.** We needed to optimize across speed, quality, and cost simultaneously.
|
|
144
|
+
|
|
145
|
+
## Step 4: The 14-Dimension Scoring System
|
|
146
|
+
|
|
147
|
+
The router needs to determine what kind of request it's looking at before selecting a model. We built a rule-based classifier that scores requests across 14 weighted dimensions:
|
|
148
|
+
|
|
149
|
+
### Architecture
|
|
150
|
+
|
|
151
|
+
```
|
|
152
|
+
User Prompt → Lowercase + Tokenize
|
|
153
|
+
↓
|
|
154
|
+
┌──────────────────────────────────┐
|
|
155
|
+
│ 14 Dimension Scorers │
|
|
156
|
+
│ Each returns score ∈ [-1, 1] │
|
|
157
|
+
└──────┬───────────────────────────┘
|
|
158
|
+
↓
|
|
159
|
+
Weighted Sum (configurable weights)
|
|
160
|
+
↓
|
|
161
|
+
Tier Boundaries (SIMPLE < 0.0 < MEDIUM < 0.3 < COMPLEX < 0.5 < REASONING)
|
|
162
|
+
↓
|
|
163
|
+
Sigmoid Confidence Calibration
|
|
164
|
+
↓
|
|
165
|
+
confidence < 0.7 → AMBIGUOUS → default to MEDIUM
|
|
166
|
+
confidence ≥ 0.7 → Classified tier
|
|
167
|
+
↓
|
|
168
|
+
Tier × Profile → Model Selection
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### The 14 Dimensions
|
|
172
|
+
|
|
173
|
+
| Dimension | Weight | What It Detects | Score Range |
|
|
174
|
+
|-----------|--------|-----------------|-------------|
|
|
175
|
+
| reasoningMarkers | 0.18 | "prove", "theorem", "step by step" | 0 to 1.0 |
|
|
176
|
+
| codePresence | 0.15 | "function", "class", "import", "```" | 0 to 1.0 |
|
|
177
|
+
| multiStepPatterns | 0.12 | "first...then", "step N", numbered lists | 0 or 0.5 |
|
|
178
|
+
| technicalTerms | 0.10 | "algorithm", "kubernetes", "distributed" | 0 to 1.0 |
|
|
179
|
+
| tokenCount | 0.08 | Short (<50 tokens) vs long (>500 tokens) | -1.0 to 1.0 |
|
|
180
|
+
| creativeMarkers | 0.05 | "story", "poem", "brainstorm" | 0 to 0.7 |
|
|
181
|
+
| questionComplexity | 0.05 | Number of question marks (>3 = complex) | 0 or 0.5 |
|
|
182
|
+
| agenticTask | 0.04 | "edit", "deploy", "fix", "debug" | 0 to 1.0 |
|
|
183
|
+
| constraintCount | 0.04 | "at most", "within", "O()" | 0 to 0.7 |
|
|
184
|
+
| imperativeVerbs | 0.03 | "build", "create", "implement" | 0 to 0.5 |
|
|
185
|
+
| outputFormat | 0.03 | "json", "yaml", "table", "csv" | 0 to 0.7 |
|
|
186
|
+
| simpleIndicators | 0.02 | "what is", "hello", "define" | 0 to -1.0 |
|
|
187
|
+
| referenceComplexity | 0.02 | "the code above", "the API docs" | 0 to 0.5 |
|
|
188
|
+
| domainSpecificity | 0.02 | "quantum", "FPGA", "genomics" | 0 to 0.8 |
|
|
189
|
+
|
|
190
|
+
Weights sum to 1.0. The weighted score maps to a continuous axis where tier boundaries partition the space.
|
|
191
|
+
|
|
192
|
+
### Multilingual Support
|
|
193
|
+
|
|
194
|
+
Every keyword list includes translations in 9 languages (EN, ZH, JA, RU, DE, ES, PT, KO, AR). A Chinese user asking "证明这个定理" triggers the same reasoning classification as "prove this theorem."
|
|
195
|
+
|
|
196
|
+
### Confidence Calibration
|
|
197
|
+
|
|
198
|
+
Raw tier assignments can be ambiguous when a score falls near a boundary. We use sigmoid calibration:
|
|
199
|
+
|
|
200
|
+
```
|
|
201
|
+
confidence = 1 / (1 + exp(-steepness * distance_from_boundary))
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
Where `steepness = 12` and `distance_from_boundary` is the score's distance to the nearest tier boundary. This maps to a [0.5, 1.0] confidence range. Below `threshold = 0.7`, the request is classified as ambiguous and defaults to MEDIUM.
|
|
205
|
+
|
|
206
|
+
### Agentic Detection
|
|
207
|
+
|
|
208
|
+
A separate scoring pathway detects agentic tasks (multi-step, tool-using, iterative). When `agenticScore >= 0.5`, the router switches to agentic-optimized tier configs that prefer models with strong instruction following (Claude Sonnet for complex tasks, GPT-4o-mini for simple tool calls).
|
|
209
|
+
|
|
210
|
+
## Step 5: Tier-to-Model Mapping
|
|
211
|
+
|
|
212
|
+
Once a request is classified into a tier, the router selects from 4 routing profiles:
|
|
213
|
+
|
|
214
|
+
### Auto Profile (Default)
|
|
215
|
+
|
|
216
|
+
Tuned from our benchmark data + user retention metrics:
|
|
217
|
+
|
|
218
|
+
```
|
|
219
|
+
SIMPLE → gemini-2.5-flash (1,238ms, IQ 20, 60% retention)
|
|
220
|
+
MEDIUM → kimi-k2.5 (1,646ms, IQ 47, strong tool use)
|
|
221
|
+
COMPLEX → gemini-3.1-pro (1,609ms, IQ 57, fastest flagship)
|
|
222
|
+
REASON → grok-4-1-fast-reasoning (1,454ms, $0.20/$0.50)
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### Eco Profile
|
|
226
|
+
|
|
227
|
+
Ultra cost-optimized. Uses free/near-free models:
|
|
228
|
+
|
|
229
|
+
```
|
|
230
|
+
SIMPLE → nvidia/gpt-oss-120b (FREE)
|
|
231
|
+
MEDIUM → gemini-2.5-flash-lite ($0.10/$0.40, 1M context)
|
|
232
|
+
COMPLEX → gemini-2.5-flash-lite ($0.10/$0.40)
|
|
233
|
+
REASON → grok-4-1-fast-reasoning ($0.20/$0.50)
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
### Premium Profile
|
|
237
|
+
|
|
238
|
+
Best quality regardless of cost:
|
|
239
|
+
|
|
240
|
+
```
|
|
241
|
+
SIMPLE → kimi-k2.5 ($0.60/$3.00)
|
|
242
|
+
MEDIUM → gpt-5.3-codex ($1.75/$14.00, 400K context)
|
|
243
|
+
COMPLEX → claude-opus-4.6 ($5.00/$25.00)
|
|
244
|
+
REASON → claude-sonnet-4.6 ($3.00/$15.00)
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### Fallback Chains
|
|
248
|
+
|
|
249
|
+
Each tier config includes an ordered fallback list. When the primary model returns a 402 (payment failed), 429 (rate limited), or 5xx, the proxy walks the fallback chain. Fallback ordering is benchmark-informed:
|
|
250
|
+
|
|
251
|
+
```typescript
|
|
252
|
+
// COMPLEX tier — quality-first fallback order
|
|
253
|
+
fallback: [
|
|
254
|
+
"google/gemini-3-pro-preview", // IQ 48, 1,352ms
|
|
255
|
+
"google/gemini-3-flash-preview", // IQ 46, 1,398ms
|
|
256
|
+
"xai/grok-4-0709", // IQ 41, 1,348ms
|
|
257
|
+
"google/gemini-2.5-pro", // 1,294ms
|
|
258
|
+
"anthropic/claude-sonnet-4.6", // IQ 52, 2,110ms
|
|
259
|
+
"deepseek/deepseek-chat", // IQ 32, 1,431ms
|
|
260
|
+
"google/gemini-2.5-flash", // IQ 20, 1,238ms
|
|
261
|
+
"openai/gpt-5.4", // IQ 57, 6,213ms — last resort
|
|
262
|
+
]
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
The chain descends by quality first (IQ 48 → 46 → 41), then trades quality for speed. GPT-5.4 is last despite having IQ 57, because its 6.2s latency is a worst-case user experience.
|
|
266
|
+
|
|
267
|
+
## Step 6: Context-Aware Filtering
|
|
268
|
+
|
|
269
|
+
The fallback chain is filtered at runtime based on request properties:
|
|
270
|
+
|
|
271
|
+
1. **Context window filtering**: Models with insufficient context window for the estimated total tokens are excluded (with 10% safety buffer)
|
|
272
|
+
2. **Tool calling filter**: When the request includes tool definitions, only models that support function calling are kept
|
|
273
|
+
3. **Vision filter**: When the request includes images, only vision-capable models are kept
|
|
274
|
+
|
|
275
|
+
If filtering eliminates all candidates, the full chain is used as a fallback (better to let the API error than return nothing).
|
|
276
|
+
|
|
277
|
+
## Cost Calculation and Savings
|
|
278
|
+
|
|
279
|
+
Every routing decision includes a cost estimate and savings percentage against a baseline (Claude Opus 4.6 pricing):
|
|
280
|
+
|
|
281
|
+
```typescript
|
|
282
|
+
savings = max(0, (opusCost - routedCost) / opusCost)
|
|
283
|
+
```
|
|
284
|
+
|
|
285
|
+
For a typical SIMPLE request (500 input tokens, 256 output tokens):
|
|
286
|
+
- Opus cost: $0.0089 (at $5.00/$25.00 per 1M tokens)
|
|
287
|
+
- Gemini Flash cost: $0.0008 (at $0.30/$2.50 per 1M tokens)
|
|
288
|
+
- Savings: 91.0%
|
|
289
|
+
|
|
290
|
+
Across our user base, the median savings rate is 85% compared to routing everything to a premium model.
|
|
291
|
+
|
|
292
|
+
## Performance
|
|
293
|
+
|
|
294
|
+
The entire classification pipeline (14 dimensions + tier mapping + model selection) runs in under 1ms. No external API calls. No LLM inference. Pure keyword matching and arithmetic.
|
|
295
|
+
|
|
296
|
+
We originally designed a two-stage system where low-confidence rules-based classifications would fall back to an LLM classifier (Gemini 2.5 Flash). In practice, the rules handle 70-80% of requests with high confidence, and the remaining ambiguous cases default to MEDIUM — which is the correct conservative choice.
|
|
297
|
+
|
|
298
|
+
## What We Learned
|
|
299
|
+
|
|
300
|
+
1. **Speed and intelligence are weakly correlated.** The fastest model (Grok 4 Fast, IQ 23) is at the bottom of the quality scale. The smartest model at low latency (Gemini 3.1 Pro, IQ 57, 1.6s) is a Google model, not OpenAI.
|
|
301
|
+
|
|
302
|
+
2. **Optimizing for one metric fails.** Latency-first routing breaks quality. Quality-first routing breaks latency budgets. You need multi-objective optimization.
|
|
303
|
+
|
|
304
|
+
3. **User retention is the real metric.** Our best-performing model for SIMPLE tasks isn't the cheapest or the fastest — it's Gemini 2.5 Flash (60% retention rate), which balances speed, cost, and just-enough quality.
|
|
305
|
+
|
|
306
|
+
4. **Fallback ordering matters more than primary selection.** The primary model handles the happy path. The fallback chain handles reality — rate limits, outages, payment failures. A well-ordered fallback chain is more important than picking the perfect primary.
|
|
307
|
+
|
|
308
|
+
5. **Rule-based classification is underrated.** 14 keyword dimensions with sigmoid confidence calibration handles 70-80% of requests correctly in <1ms. The remaining 20-30% default to a safe middle tier. For a routing system where every millisecond of overhead compounds across millions of requests, avoiding LLM inference in the classification step is worth the reduced accuracy.
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Appendix: Full Benchmark Data
|
|
313
|
+
|
|
314
|
+
Raw data (46 models, latency, throughput, IQ scores, pricing): [`benchmark-merged.json`](https://github.com/BlockRunAI/ClawRouter/blob/main/benchmark-merged.json)
|
|
315
|
+
|
|
316
|
+
Routing configuration: [`src/router/config.ts`](https://github.com/BlockRunAI/ClawRouter/blob/main/src/router/config.ts)
|
|
317
|
+
|
|
318
|
+
Scoring implementation: [`src/router/rules.ts`](https://github.com/BlockRunAI/ClawRouter/blob/main/src/router/rules.ts)
|
|
319
|
+
|
|
320
|
+
---
|
|
321
|
+
|
|
322
|
+
*BlockRun is the x402 micropayment gateway for AI. One wallet, 46+ models, pay-per-request with USDC. [blockrun.ai](https://blockrun.ai)*
|