lynkr 9.4.5 → 9.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -545,6 +545,28 @@ TOOL_INJECTION_ENABLED=false
545
545
  CODE_MODE_ENABLED=true
546
546
  ```
547
547
 
548
+ Always-on (no config): **smart tool selection** (server mode), **RTK tool-result
549
+ compression** (test/git/grep/lint/build/JSON output), **MCP tool dedup** (drops
550
+ built-in WebSearch/WebFetch when an Exa/Tavily MCP tool is present), and
551
+ **request bypass** (Claude CLI Warmup / title-extraction calls are answered
552
+ locally, never hitting a provider).
553
+
554
+ Optional **terse-output mode** to cut *output* tokens:
555
+ ```bash
556
+ CAVEMAN_ENABLED=true # off by default — nudges the model to be concise
557
+ CAVEMAN_LEVEL=lite # lite | full | ultra
558
+ ```
559
+
560
+ ### Cost tracking & model pricing
561
+ Per-request cost is computed from a model-pricing registry (LiteLLM → models.dev,
562
+ cached 24h) and recorded in telemetry. Models the registry doesn't know record
563
+ `cost_usd=null` (logged once) rather than a fabricated price. Pin prices for
564
+ unknown models:
565
+ ```bash
566
+ # Per-1M-token USD prices, JSON keyed by model name
567
+ MODEL_PRICE_OVERRIDES={"my-model":{"input":0.5,"output":1.5}}
568
+ ```
569
+
548
570
  ### Memory System (Titans-inspired)
549
571
  ```bash
550
572
  MEMORY_ENABLED=true
@@ -652,35 +674,45 @@ npm start
652
674
 
653
675
  ## Benchmark Results
654
676
 
655
- Measured on real agentic coding workloads (Claude Code / Cursor sessions) with Ollama, Moonshot, and Azure OpenAI backends. Run with `node benchmark-tier-routing.js`.
677
+ Head-to-head against **LiteLLM** on the **same backends** (Ollama `minimax-m2.5`, Moonshot, Azure OpenAI), 9 scenarios across 4 feature categories. Apples-to-apples comparison is Lynkr vs LiteLLM **billed tokens on the same scenario**. Run with `node benchmark-tier-routing.js`.
656
678
 
657
- ### Token compression
679
+ > _Run: June 5, 2026 · Lynkr v9.3.2 · LiteLLM v1.87.1 · macOS, Apple Silicon._
658
680
 
659
- | Scenario | Tokens without Lynkr | Tokens with Lynkr | Reduction |
681
+ ### Token reduction (vs LiteLLM, same model & prompt)
682
+
683
+ | Mechanism | Lynkr | LiteLLM | Result |
660
684
  |---|---|---|---|
661
- | 14-tool request (read task) | 1,042 | **547** | **47%** |
662
- | 14-tool request (write task) | 1,043 | **412** | **60%** |
663
- | Large JSON grep result (60 items) | 3,458 | **427** | **87.6%** |
685
+ | Smart tool selection (14 tools) | **959** tokens · $0.0044 | 2,085 tokens · $0.0091 | **53% fewer tokens, 52% cheaper** |
686
+ | TOON compression (60-item grep JSON) | **427** tokens · $0.009 | 3,458 tokens · $0.018 | **87.6% fewer tokens, 50% cheaper** |
664
687
 
665
- Lynkr strips irrelevant tool schemas before forwarding (smart tool selection) and binary-compresses large JSON tool results (TOON) — both happen in-process with no added latency.
688
+ Lynkr strips irrelevant tool schemas (smart tool selection) and binary-compresses large JSON tool results (TOON) — both in-process, no added latency.
666
689
 
667
690
  ### Semantic cache
668
691
 
669
692
  | | Tokens billed | Response time |
670
693
  |---|---|---|
671
694
  | First call (cold) | 2,857 | 1,891ms |
672
- | **Second call — paraphrased, cache hit** | **0** | **171ms** |
695
+ | **Second call — paraphrased, cache hit** | **0** (served from cache) | **171ms (11× faster)** |
673
696
 
674
- Near-identical prompts return cached responses in 171ms. Zero tokens billed on a cache hit.
697
+ Near-identical prompts return cached responses in 171ms. Zero model tokens billed on a cache hit.
675
698
 
676
699
  ### Tier routing
677
700
 
678
- | Request | Routed to |
679
- |---|---|
680
- | "What does git stash do?" | SIMPLE local model (free) |
681
- | JWT vs cookies security analysis | COMPLEX cloud model (correct) |
701
+ | Request | Lynkr routes to | LiteLLM routes to |
702
+ |---|---|---|
703
+ | "What does git stash do?" | `minimax-m2.5` (local, free) | Ollama (local) |
704
+ | JWT vs cookies security analysis | `moonshot` (cloud correct) | **Ollama (local — wrong call)** |
705
+
706
+ Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and escalates automatically. LiteLLM's `cost-based-routing` sends everything to the cheapest model regardless of complexity.
707
+
708
+ ### Cost projection (100,000 requests/month, same backend)
709
+
710
+ | | Monthly cost | vs LiteLLM |
711
+ |---|---|---|
712
+ | LiteLLM | ~$818 | baseline |
713
+ | **Lynkr** | **~$409** | **~50% cheaper** |
682
714
 
683
- Lynkr scores each request on 15 dimensions (token count, code complexity, reasoning markers, risk signals, agentic patterns) and routes automatically. No caller changes needed.
715
+ _Based on a tool-heavy agentic session (TOON scenario). On equal footing same provider, same model Lynkr is cheaper due to token optimization._
684
716
 
685
717
  → [Full benchmark report with methodology](BENCHMARK_REPORT.md)
686
718
 
package/install.sh CHANGED
@@ -108,8 +108,24 @@ clone_or_update() {
108
108
  install_dependencies() {
109
109
  print_info "Installing dependencies..."
110
110
  cd "$INSTALL_DIR"
111
- npm install --production
111
+ # --omit=dev keeps optionalDependencies (better-sqlite3, hnswlib-node,
112
+ # tree-sitter) which back telemetry, the memory store and routing ML.
113
+ # The postinstall hook (scripts/check-native.js) verifies the native ABI
114
+ # and rebuilds if Node was upgraded — best-effort, never fails the install.
115
+ npm install --omit=dev
112
116
  print_success "Dependencies installed"
117
+
118
+ # Native optional modules need a C/C++ toolchain only if no prebuilt binary
119
+ # is available for this platform. They degrade gracefully if absent.
120
+ if ! node -e "const D=require('better-sqlite3'); new D(':memory:').close()" >/dev/null 2>&1; then
121
+ print_warning "Native module 'better-sqlite3' is not loadable."
122
+ echo " Telemetry, the memory store and sessions need it. To enable:"
123
+ echo " - Ensure a build toolchain is present (Xcode CLT on macOS, build-essential + python3 on Linux), then:"
124
+ echo " - ${BLUE}cd $INSTALL_DIR && npm run rebuild-native${NC}"
125
+ echo " Lynkr still runs without it (those features stay disabled)."
126
+ else
127
+ print_success "Native modules OK (telemetry, memory, sessions enabled)"
128
+ fi
113
129
  }
114
130
 
115
131
  # Create default .env file
@@ -131,7 +147,7 @@ create_env_file() {
131
147
  MODEL_PROVIDER=ollama
132
148
 
133
149
  # Server Configuration
134
- PORT=8080
150
+ PORT=8081
135
151
 
136
152
  # Ollama Configuration (default for local development)
137
153
  OLLAMA_MODEL=qwen2.5-coder:7b
@@ -161,7 +177,7 @@ EOF
161
177
  print_info "📝 Configuration ready! Key settings:"
162
178
  echo " • Default provider: Ollama (local, offline)"
163
179
  echo " • Memory system: Enabled (learns from conversations)"
164
- echo " • Port: 8080"
180
+ echo " • Port: 8081"
165
181
  echo ""
166
182
  print_warning "To use cloud providers (Databricks/OpenAI/Azure):"
167
183
  echo " Edit: ${BLUE}nano $INSTALL_DIR/.env${NC}"
@@ -220,7 +236,7 @@ print_next_steps() {
220
236
  echo " ${BLUE}lynkr${NC}"
221
237
  echo ""
222
238
  echo " 3. Configure Claude Code CLI:"
223
- echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
239
+ echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
224
240
  echo " ${BLUE}claude${NC}"
225
241
  echo ""
226
242
  echo " ${YELLOW}Option B: Use Cloud Providers (Databricks/OpenAI/Azure)${NC}"
@@ -238,7 +254,7 @@ print_next_steps() {
238
254
  echo " ${BLUE}lynkr${NC}"
239
255
  echo ""
240
256
  echo " 3. Configure Claude Code CLI:"
241
- echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8080${NC}"
257
+ echo " ${BLUE}export ANTHROPIC_BASE_URL=http://localhost:8081${NC}"
242
258
  echo " ${BLUE}export ANTHROPIC_API_KEY=any-non-empty-value${NC} ${GREEN}← Placeholder${NC}"
243
259
  echo " ${BLUE}claude${NC}"
244
260
  echo ""
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "lynkr",
3
- "version": "9.4.5",
3
+ "version": "9.5.0",
4
4
  "description": "Self-hosted LLM gateway and tier-routing proxy for Claude Code, Cursor, and Codex. Routes across Ollama, AWS Bedrock, OpenRouter, Databricks, Azure OpenAI, llama.cpp, and LM Studio with prompt caching, MCP tools, and 60-80% cost savings.",
5
5
  "main": "index.js",
6
6
  "bin": {
@@ -8,13 +8,15 @@
8
8
  "lynkr-setup": "scripts/setup.js"
9
9
  },
10
10
  "scripts": {
11
+ "postinstall": "node scripts/check-native.js",
12
+ "rebuild-native": "node scripts/check-native.js",
11
13
  "prestart": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom up -d --build headroom 2>/dev/null || echo 'Headroom skipped (disabled or Docker not running)'",
12
14
  "start": "node index.js 2>&1 | npx pino-pretty --sync",
13
15
  "stop": "node -e \"if(process.env.HEADROOM_ENABLED==='true'&&process.env.HEADROOM_DOCKER_ENABLED!=='false'){process.exit(0)}else{process.exit(1)}\" && docker compose --profile headroom down || echo 'Headroom skipped (disabled or Docker not running)'",
14
16
  "dev": "nodemon index.js",
15
17
  "lint": "eslint src index.js",
16
18
  "test": "npm run test:unit && npm run test:performance",
17
- "test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js",
19
+ "test:unit": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/routing.test.js test/hybrid-routing-integration.test.js test/web-tools.test.js test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js test/azure-openai-config.test.js test/azure-openai-format-conversion.test.js test/azure-openai-routing.test.js test/azure-openai-streaming.test.js test/azure-openai-error-resilience.test.js test/azure-openai-integration.test.js test/openai-integration.test.js test/toon-compression.test.js test/llamacpp-integration.test.js test/resilience.test.js test/telemetry-routing.test.js test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js test/distill.test.js test/large-payload.test.js test/code-mode.test.js test/prompt-cache-injection.test.js test/risk-analyzer.test.js test/interaction-block.test.js test/preflight.test.js test/token-reduction.test.js test/session-affinity.test.js test/model-registry-cost.test.js",
18
20
  "test:memory": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/memory/store.test.js test/memory/surprise.test.js test/memory/extractor.test.js test/memory/search.test.js test/memory/retriever.test.js",
19
21
  "test:new-features": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node --test test/passthrough-mode.test.js test/openrouter-error-resilience.test.js test/format-conversion.test.js",
20
22
  "test:performance": "DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/hybrid-routing-performance.test.js && DATABRICKS_API_KEY=test-key DATABRICKS_API_BASE=http://test.com node test/performance-tests.js",
@@ -244,6 +244,7 @@ const App = {
244
244
  const t = d.today;
245
245
  const s = d.stats;
246
246
 
247
+ const tierLabel = t => t === 'default' ? 'default' : String(t).toLowerCase();
247
248
  const providerCards = d.providers.length === 0
248
249
  ? `<p class="text-slate-500 text-sm">No providers configured</p>`
249
250
  : d.providers.map(p => `
@@ -251,10 +252,21 @@ const App = {
251
252
  <div class="flex items-center gap-2">
252
253
  <span class="status-dot ${providerDot(p.type)}"></span>
253
254
  <span class="text-sm font-medium text-slate-200">${p.name}</span>
255
+ ${(p.tiers || []).map(t => `<span class="badge bg-slate-600/60 text-slate-300">${tierLabel(t)}</span>`).join('')}
254
256
  </div>
255
257
  <span class="text-xs ${p.type === 'local' ? 'text-green-400' : 'text-blue-400'}">${p.type}</span>
256
258
  </div>`).join('');
257
259
 
260
+ const providerWarnings = (d.providerWarnings || []).map(w => `
261
+ <div class="flex items-center justify-between bg-amber-500/10 border border-amber-500/30 rounded-lg px-4 py-3">
262
+ <div class="flex items-center gap-2">
263
+ <span class="text-amber-400 text-sm">⚠</span>
264
+ <span class="text-sm font-medium text-amber-200">${w.name}</span>
265
+ ${(w.tiers || []).map(t => `<span class="badge bg-amber-500/20 text-amber-300">${tierLabel(t)}</span>`).join('')}
266
+ </div>
267
+ <span class="text-xs text-amber-400">no credentials</span>
268
+ </div>`).join('');
269
+
258
270
  const recentRows = (d.recentRequests || []).map(r => `
259
271
  <tr class="table-row border-b border-slate-700/50">
260
272
  <td class="py-2 px-3 text-xs text-slate-500">${fmt.ago(r.timestamp)}</td>
@@ -279,7 +291,7 @@ const App = {
279
291
  <!-- Providers -->
280
292
  ${card(`
281
293
  <h3 class="text-sm font-semibold text-slate-300 mb-3">Configured Providers</h3>
282
- <div class="flex flex-col gap-2">${providerCards}</div>
294
+ <div class="flex flex-col gap-2">${providerCards}${providerWarnings}</div>
283
295
  `)}
284
296
 
285
297
  <!-- 24h Stats -->
@@ -0,0 +1,97 @@
1
+ #!/usr/bin/env node
2
+ /**
3
+ * Native module ABI guard (postinstall).
4
+ *
5
+ * better-sqlite3 (and the other native optionalDependencies) are compiled
6
+ * against a specific Node ABI. When Node is upgraded, the prebuilt/compiled
7
+ * binary stops loading with:
8
+ *
9
+ * "was compiled against a different Node.js version using
10
+ * NODE_MODULE_VERSION 115. This version of Node.js requires
11
+ * NODE_MODULE_VERSION 141."
12
+ *
13
+ * The failure is silent at runtime — telemetry, request logs, and the memory
14
+ * store all sit behind try/catch and simply go empty. This probe detects the
15
+ * mismatch and rebuilds the native modules so it self-heals on `npm install`.
16
+ *
17
+ * It is intentionally best-effort: it NEVER exits non-zero, so it can't break
18
+ * `npm install` on machines without a build toolchain (the modules are
19
+ * optional and the app degrades gracefully without them).
20
+ */
21
+
22
+ const { execSync } = require("child_process");
23
+
24
+ // Native optionalDependencies that are ABI-sensitive. If Node changed, all of
25
+ // them are stale, so we rebuild the set in one pass.
26
+ const NATIVE_DEPS = [
27
+ "better-sqlite3",
28
+ "hnswlib-node",
29
+ "tree-sitter",
30
+ "tree-sitter-javascript",
31
+ "tree-sitter-python",
32
+ "tree-sitter-typescript",
33
+ ];
34
+
35
+ function log(msg) {
36
+ console.log(`[check-native] ${msg}`);
37
+ }
38
+
39
+ /**
40
+ * Probe better-sqlite3 — the canary. `require()` alone is not enough: the
41
+ * native addon only loads when a Database is instantiated.
42
+ * @returns {"ok"|"absent"|"mismatch"}
43
+ */
44
+ function probe() {
45
+ let Database;
46
+ try {
47
+ Database = require("better-sqlite3");
48
+ } catch (err) {
49
+ if (err && err.code === "MODULE_NOT_FOUND") return "absent";
50
+ return "mismatch";
51
+ }
52
+ try {
53
+ const db = new Database(":memory:");
54
+ db.close();
55
+ return "ok";
56
+ } catch (err) {
57
+ if (/NODE_MODULE_VERSION|different Node\.js version|invalid ELF|dlopen|\.node/i.test(err.message || "")) {
58
+ return "mismatch";
59
+ }
60
+ // Some other instantiation error — not an ABI issue we can fix by rebuild.
61
+ return "ok";
62
+ }
63
+ }
64
+
65
+ function main() {
66
+ const status = probe();
67
+
68
+ if (status === "absent") {
69
+ // Optional dependency not installed (e.g. build skipped). Nothing to do.
70
+ return;
71
+ }
72
+ if (status === "ok") {
73
+ return;
74
+ }
75
+
76
+ log("native module ABI mismatch detected (Node was likely upgraded). Rebuilding native modules…");
77
+ try {
78
+ execSync(`npm rebuild ${NATIVE_DEPS.join(" ")}`, { stdio: "inherit" });
79
+ } catch {
80
+ log("rebuild did not complete (a build toolchain may be missing). Continuing — native features will be disabled until you run: npm rebuild better-sqlite3");
81
+ return;
82
+ }
83
+
84
+ // Re-probe to report the outcome.
85
+ if (probe() === "ok") {
86
+ log("native modules rebuilt successfully.");
87
+ } else {
88
+ log("native modules still not loadable after rebuild. Run `npm rebuild better-sqlite3` manually.");
89
+ }
90
+ }
91
+
92
+ try {
93
+ main();
94
+ } catch (err) {
95
+ // Never fail the install.
96
+ log(`skipped (${err.message})`);
97
+ }
@@ -1506,10 +1506,16 @@ async function invokeMoonshot(body) {
1506
1506
  "claude-haiku-4-5-20251001": "kimi-k2-turbo-preview",
1507
1507
  "claude-haiku-4-5": "kimi-k2-turbo-preview",
1508
1508
  "claude-3-haiku": "kimi-k2-turbo-preview",
1509
+ // moonshot-v1-auto 400s with "tokenization failed" (its server-side auto
1510
+ // context-size pass fails on large tool-bearing payloads). Remap to a
1511
+ // fixed model that's broadly available on api.moonshot.ai.
1512
+ "moonshot-v1-auto": "moonshot-v1-128k",
1509
1513
  };
1510
1514
 
1511
1515
  const requestedModel = body._tierModel || body.model || config.moonshot.model;
1512
- const mappedModel = modelMap[requestedModel] || config.moonshot.model || "kimi-k2-turbo-preview";
1516
+ let mappedModel = modelMap[requestedModel] || config.moonshot.model || "kimi-k2-turbo-preview";
1517
+ // Guard against the deprecated auto model arriving via config too.
1518
+ if (mappedModel === "moonshot-v1-auto") mappedModel = "moonshot-v1-128k";
1513
1519
 
1514
1520
  // Convert messages using existing utility
1515
1521
  const messages = convertAnthropicMessagesToOpenRouter(body.messages || []);
@@ -1522,12 +1528,18 @@ async function invokeMoonshot(body) {
1522
1528
  messages.unshift({ role: "system", content: systemContent });
1523
1529
  }
1524
1530
 
1531
+ // kimi-k2.x (k2.5 / k2.6 …) are thinking models that only accept
1532
+ // temperature: 1 — any other value 400s with "invalid temperature".
1533
+ const isKimiThinking = /^kimi-k2/i.test(mappedModel);
1534
+
1525
1535
  const moonshotBody = {
1526
1536
  model: mappedModel,
1527
1537
  messages,
1528
1538
  max_tokens: body.max_tokens || 16384,
1529
- temperature: body.temperature ?? 0.7,
1530
- top_p: body.top_p ?? 1.0,
1539
+ // kimi-k2.x thinking models pin sampling params: temperature must be 1
1540
+ // and top_p must be 0.95 any other value 400s.
1541
+ temperature: isKimiThinking ? 1 : (body.temperature ?? 0.7),
1542
+ top_p: isKimiThinking ? 0.95 : (body.top_p ?? 1.0),
1531
1543
  stream: false, // Force non-streaming - OpenAI SSE to Anthropic SSE conversion not implemented
1532
1544
  };
1533
1545
 
@@ -2027,6 +2039,65 @@ async function invokeCodex(body) {
2027
2039
  };
2028
2040
  }
2029
2041
 
2042
+ /**
2043
+ * Compute request cost in USD from model pricing × token usage.
2044
+ * Registry returns per-1M-token prices ({ input, output }); returns null when
2045
+ * pricing is unknown so we don't record misleading zeros.
2046
+ */
2047
+ const _unknownCostWarned = new Set();
2048
+ function computeCostUsd(model, inputTokens, outputTokens) {
2049
+ try {
2050
+ const { getModelRegistrySync } = require("../routing/model-registry");
2051
+ const reg = getModelRegistrySync && getModelRegistrySync();
2052
+ const cost = reg?.getCost?.(model);
2053
+ if (!cost) return null;
2054
+ // Unknown model → record null (not a fabricated default), warn once so the
2055
+ // gap is visible and can be fixed via MODEL_PRICE_OVERRIDES.
2056
+ if (cost.unknown) {
2057
+ if (model && !_unknownCostWarned.has(model)) {
2058
+ _unknownCostWarned.add(model);
2059
+ logger.warn({ model }, "[Cost] No pricing for model — recording cost_usd=null. Set MODEL_PRICE_OVERRIDES to fix.");
2060
+ }
2061
+ return null;
2062
+ }
2063
+ if (cost.input == null && cost.output == null) return null;
2064
+ const inUsd = ((inputTokens || 0) / 1e6) * (cost.input || 0);
2065
+ const outUsd = ((outputTokens || 0) / 1e6) * (cost.output || 0);
2066
+ return Number((inUsd + outUsd).toFixed(6));
2067
+ } catch {
2068
+ return null;
2069
+ }
2070
+ }
2071
+
2072
+ // Telemetry prompt/response text is always captured (truncated) to build the
2073
+ // routing ML training corpus. Stored locally in .lynkr/telemetry.db only.
2074
+ const TELEMETRY_TEXT_MAXLEN = 2000;
2075
+
2076
+ /** Flatten the latest user message to plain text (for telemetry capture). */
2077
+ function captureRequestText(body) {
2078
+ const messages = body?.messages;
2079
+ if (!Array.isArray(messages)) return null;
2080
+ for (let i = messages.length - 1; i >= 0; i--) {
2081
+ const m = messages[i];
2082
+ if (m?.role !== "user") continue;
2083
+ let text = "";
2084
+ if (typeof m.content === "string") text = m.content;
2085
+ else if (Array.isArray(m.content)) {
2086
+ text = m.content.filter((b) => b?.type === "text").map((b) => b.text || "").join(" ");
2087
+ }
2088
+ if (text) return text.slice(0, TELEMETRY_TEXT_MAXLEN);
2089
+ }
2090
+ return null;
2091
+ }
2092
+
2093
+ /** Flatten an Anthropic response's text blocks to plain text (for telemetry). */
2094
+ function captureResponseText(resultJson) {
2095
+ const content = resultJson?.content;
2096
+ if (!Array.isArray(content)) return null;
2097
+ const text = content.filter((b) => b?.type === "text").map((b) => b.text || "").join(" ");
2098
+ return text ? text.slice(0, TELEMETRY_TEXT_MAXLEN) : null;
2099
+ }
2100
+
2030
2101
  async function invokeModel(body, options = {}) {
2031
2102
  const { determineProviderSmart, isFallbackEnabled, getFallbackProvider } = require("./routing");
2032
2103
  const metricsCollector = getMetricsCollector();
@@ -2233,6 +2304,9 @@ async function invokeModel(body, options = {}) {
2233
2304
  circuit_breaker_state: breaker.state,
2234
2305
  quality_score: qualityScore,
2235
2306
  tokens_per_second: outputTokens && latency > 0 ? outputTokens / (latency / 1000) : null,
2307
+ cost_usd: computeCostUsd(routingDecision.model || body._tierModel, inputTokens, outputTokens),
2308
+ request_text: captureRequestText(body),
2309
+ response_text: captureResponseText(result.json),
2236
2310
  });
2237
2311
 
2238
2312
  // Return result with provider info and routing decision for headers
@@ -2394,6 +2468,9 @@ async function invokeModel(body, options = {}) {
2394
2468
  { status_code: 200, output_tokens: fbOutputTokens, tool_calls_made: fbToolCalls, was_fallback: true, retry_count: 0, latency_ms: Date.now() - startTime }
2395
2469
  ),
2396
2470
  tokens_per_second: fbOutputTokens && fallbackLatency > 0 ? fbOutputTokens / (fallbackLatency / 1000) : null,
2471
+ cost_usd: computeCostUsd(routingDecision.model || body._tierModel, fbInputTokens, fbOutputTokens),
2472
+ request_text: captureRequestText(body),
2473
+ response_text: captureResponseText(fallbackResult.json),
2397
2474
  });
2398
2475
 
2399
2476
  // Return result with actual provider used (fallback provider) and routing decision
@@ -176,6 +176,21 @@ function convertAnthropicMessagesToOpenRouter(anthropicMessages) {
176
176
  }
177
177
  }
178
178
 
179
+ // Kimi/Moonshot (and some OpenAI-compatible APIs) reject a message whose
180
+ // content is an empty string with "Invalid request: tokenization failed".
181
+ // This happens when a turn had only non-text blocks (thinking / image /
182
+ // stripped content) and flattened to "". Replace empty/whitespace-only
183
+ // content with a single space — but never touch an assistant message that
184
+ // carries tool_calls, where content: null is intentional and required.
185
+ for (const m of converted) {
186
+ if (m.role === 'tool') continue;
187
+ const hasToolCalls = Array.isArray(m.tool_calls) && m.tool_calls.length > 0;
188
+ if (hasToolCalls) continue;
189
+ if (typeof m.content !== 'string' || m.content.trim() === '') {
190
+ m.content = ' ';
191
+ }
192
+ }
193
+
179
194
  // Log the converted messages for debugging
180
195
  logger.debug({
181
196
  inputCount: anthropicMessages.length,
@@ -208,6 +208,11 @@ const tokenBudgetWarning = Number.parseInt(process.env.TOKEN_BUDGET_WARNING ?? "
208
208
  const tokenBudgetMax = Number.parseInt(process.env.TOKEN_BUDGET_MAX ?? "180000", 10);
209
209
  const tokenBudgetEnforcement = process.env.TOKEN_BUDGET_ENFORCEMENT !== "false"; // default true
210
210
 
211
+ // Caveman terse-output injection (opt-in, off by default)
212
+ const cavemanEnabled = process.env.CAVEMAN_ENABLED === "true";
213
+ const cavemanLevel = (process.env.CAVEMAN_LEVEL ?? "lite").toLowerCase();
214
+
215
+
211
216
  // TOON payload compression (opt-in)
212
217
  const toonEnabled = process.env.TOON_ENABLED === "true"; // default false
213
218
  const toonMinBytes = Number.parseInt(process.env.TOON_MIN_BYTES ?? "4096", 10);
@@ -641,6 +646,10 @@ var config = {
641
646
  toolResultCompression: {
642
647
  enabled: true,
643
648
  },
649
+ caveman: {
650
+ enabled: cavemanEnabled,
651
+ level: cavemanLevel,
652
+ },
644
653
  server: {
645
654
  jsonLimit: process.env.REQUEST_JSON_LIMIT ?? "1gb",
646
655
  },
@@ -0,0 +1,94 @@
1
+ /**
2
+ * Caveman Terse-Output Injector
3
+ *
4
+ * Appends a brevity instruction to the system prompt so the model produces
5
+ * terser responses, reducing OUTPUT tokens. Opt-in and off by default — it
6
+ * changes model behavior, so it's only applied when explicitly enabled.
7
+ *
8
+ * Enable with CAVEMAN_ENABLED=true. Level via CAVEMAN_LEVEL=lite|full|ultra
9
+ * (default: lite). Adapted from 9router's caveman injector / the caveman skill
10
+ * (https://github.com/JuliusBrussee/caveman).
11
+ *
12
+ * @module context/caveman
13
+ */
14
+
15
+ const config = require("../config");
16
+ const logger = require("../logger");
17
+
18
+ const LEVELS = ["lite", "full", "ultra"];
19
+
20
+ // Shared guardrails so brevity never corrupts the substance that matters.
21
+ const BOUNDARIES =
22
+ "Code blocks, file paths, commands, errors, URLs: keep exact. " +
23
+ "Security warnings, irreversible-action confirmations, and multi-step ordered " +
24
+ "sequences: write in full normal prose. Resume terse style afterward.";
25
+
26
+ const EXAMPLES =
27
+ 'Not: "Sure! I\'d be happy to help. The issue is likely caused by..." ' +
28
+ 'Yes: "Bug in auth middleware. Token expiry uses `<` not `<=`. Fix:"';
29
+
30
+ const PERSISTENCE = "Apply this to every response unless a guardrail above applies.";
31
+
32
+ const PROMPTS = {
33
+ lite: [
34
+ "Respond tersely. Keep grammar and full sentences but drop filler, hedging, and pleasantries (just/really/basically/sure/of course/I'd be happy to).",
35
+ "Pattern: state the thing, the action, the reason. Then the next step.",
36
+ EXAMPLES,
37
+ BOUNDARIES,
38
+ PERSISTENCE,
39
+ ].join(" "),
40
+
41
+ full: [
42
+ "Respond like a terse caveman. All technical substance stays exact; only fluff dies.",
43
+ "Drop articles (a/an/the), filler (just/really/basically/actually/simply), pleasantries, and hedging. Fragments OK. Prefer short synonyms (big not extensive, fix not implement a solution for).",
44
+ "Pattern: [thing] [action] [reason]. [next step].",
45
+ EXAMPLES,
46
+ BOUNDARIES,
47
+ PERSISTENCE,
48
+ ].join(" "),
49
+
50
+ ultra: [
51
+ "Respond ultra-terse. Maximum compression. Telegraphic.",
52
+ "Abbreviate (DB/auth/config/req/res/fn/impl), strip conjunctions, use arrows for causality (X → Y). One word when one word is enough.",
53
+ "Pattern: [thing] → [result]. [fix].",
54
+ EXAMPLES,
55
+ BOUNDARIES,
56
+ PERSISTENCE,
57
+ ].join(" "),
58
+ };
59
+
60
+ const MARKER = "[brevity]";
61
+
62
+ /** Resolve the configured level, falling back to "lite". */
63
+ function resolveLevel(level) {
64
+ const l = String(level || config.caveman?.level || "lite").toLowerCase();
65
+ return LEVELS.includes(l) ? l : "lite";
66
+ }
67
+
68
+ /**
69
+ * Append the brevity instruction to a system prompt string.
70
+ * Idempotent — won't double-inject if the marker is already present.
71
+ *
72
+ * @param {string} system - Existing system prompt (may be empty).
73
+ * @param {object} [opts]
74
+ * @param {boolean} [opts.enabled] - Override config enablement.
75
+ * @param {string} [opts.level] - Override level.
76
+ * @returns {string} system prompt, possibly with brevity instruction appended.
77
+ */
78
+ function injectCaveman(system, opts = {}) {
79
+ const enabled = opts.enabled ?? config.caveman?.enabled === true;
80
+ if (!enabled) return system || "";
81
+
82
+ const base = system || "";
83
+ if (base.includes(MARKER)) return base;
84
+
85
+ const level = resolveLevel(opts.level);
86
+ const instruction = `\n\n${MARKER} ${PROMPTS[level]}`;
87
+ logger.debug({ level }, "[Caveman] Injected brevity instruction into system prompt");
88
+ return base + instruction;
89
+ }
90
+
91
+ module.exports = {
92
+ injectCaveman,
93
+ LEVELS,
94
+ };