llm-cli-gateway 1.5.35 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -2,6 +2,111 @@
2
2
 
3
3
  All notable changes to the llm-cli-gateway project.
4
4
 
5
+ ## [1.6.0] - 2026-05-26 — cache-awareness phase 1 + security posture
6
+
7
+ Also includes (beyond cache-awareness):
8
+
9
+ ### Added — free-OSS security posture (matches verivus-oss/agent-assurance)
10
+
11
+ - New `.github/workflows/security.yml` running on every push + PR:
12
+ actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff,
13
+ bandit, lychee. SHA-pinned, fail-on-finding.
14
+ - `eslint-plugin-security` 3.0.1 wired into the existing eslint config.
15
+ - `SECURITY.md` (vulnerability reporting policy), `.github/CODEOWNERS`
16
+ (review routing for security-sensitive paths), `_typos.toml`,
17
+ `lychee.toml`, `.gitleaks.toml`, `.github/actionlint.yaml`,
18
+ `integrations/llm-plugin/.bandit`.
19
+ - Workflow hygiene: top-level `permissions: contents: read`, per-job
20
+ explicit, `persist-credentials: false` on every `actions/checkout`
21
+ except the upload job in `release-installer.yml`. Cache disabled on
22
+ release-triggered setup-node/setup-go (zizmor cache-poisoning).
23
+ - Dependabot: added `npm` ecosystem at `/` and `pip` ecosystem at
24
+ `/integrations/llm-plugin/` (github-actions group preserved).
25
+ - `installer/go.mod` bumped Go 1.22 → 1.25 (clears 26 stdlib CVEs
26
+ flagged by osv-scanner); `release-installer.yml` setup-go pin
27
+ updated in lock-step.
28
+
29
+ ### Added — cache-awareness slice 1+2+3 (all opt-in, default OFF)
30
+
31
+ ### Added — cache-awareness slice 1+2+3 (all opt-in, default OFF)
32
+
33
+ - **`promptParts` on every `*_request` / `*_request_async` tool** (claude, codex,
34
+ gemini, grok, mistral; sync + async = 10 tools). Accepts
35
+ `{ system?, tools?, context?, task }`. Mutually exclusive with `prompt`.
36
+ The gateway concatenates in canonical order (`system → tools → context → task`)
37
+ so the stable prefix bytes precede the volatile task tail unchanged across
38
+ calls — raising implicit cache hit rate without calling provider cache APIs.
39
+ The exact error strings `provide exactly one of \`prompt\` or \`promptParts\``
40
+ and `one of \`prompt\` or \`promptParts\` is required` are stable API
41
+ contract.
42
+ - **Flight-recorder v3 migration**: new columns `stable_prefix_hash`
43
+ (sha256) and `stable_prefix_tokens` (integer bytes/4 heuristic) on
44
+ `requests`, plus `idx_requests_stable_hash`. Legacy rows keep NULL.
45
+ - **Cache-state MCP resources** (read-only, tokens/hashes/aggregates only —
46
+ never raw prompt text):
47
+ - `cache_state://global` (last 24h aggregates + per-CLI breakdown).
48
+ - `cache_state://session/{sessionId}` (per-session).
49
+ - `cache_state://prefix/{hash}` (per-stable-prefix-hash).
50
+ - **`session_get.cacheState`** projection: compact hit-rate / hit-count /
51
+ cache-token-totals / estimated-savings-USD block, present only when the
52
+ session has prior requests. Omitted entirely (not null, not empty) for
53
+ fresh sessions. NOT persisted on the Session interface — it is a
54
+ read-time projection from the flight recorder.
55
+ - **`computeTtlRemaining()` + `cache_ttl_expiring_soon` warning**: claude
56
+ sync + async handlers attach a structured `warnings[]` entry when a
57
+ resumed session's Anthropic cache breakpoint is within 30 s of expiry
58
+ (gated on `[cache_awareness].warn_on_ttl_expiry`; default false). The
59
+ TTL math respects `anthropic_ttl_seconds = 300 | 3600`.
60
+ - **Doctor `cache_awareness` block**: always present, zeroed when the
61
+ flight recorder is empty. Reports `enabled_features` (active flags),
62
+ `last_24h` (hit rate + savings), and `per_cli` aggregates. JSON schema
63
+ updated; `setup/status.schema.json` `additionalProperties: false`
64
+ intact at the root.
65
+ - **`[cache_awareness]` config block** in `~/.llm-cli-gateway/config.toml`:
66
+ - `emit_anthropic_cache_control = false`
67
+ - `anthropic_ttl_seconds = 300` (enum: 300 | 3600)
68
+ - `warn_on_ttl_expiry = false`
69
+ - `[cache_awareness.min_stable_tokens_for_cache_control]` per-family
70
+ table (sonnet=1024, opus=4096, haiku=4096, default=4096).
71
+ Validated by a separate Zod schema and loader (`loadCacheAwarenessConfig`);
72
+ a malformed `[cache_awareness]` block does NOT break `loadPersistenceConfig`
73
+ and vice versa. No env-var overrides.
74
+
75
+ ### Decision: Branch B (prefix-discipline only) for slice 1
76
+
77
+ The gateway does NOT emit explicit `cache_control` JSON to Claude in this
78
+ slice and does NOT route `promptParts.system` into `--system-prompt`. The
79
+ upstream injection mechanism is unverified; Branch A is gated on a live
80
+ smoke test in a follow-up slice. The
81
+ `[cache_awareness].emit_anthropic_cache_control` flag is in place for
82
+ when that lands.
83
+
84
+ ### Deferred / out of scope
85
+
86
+ - **Async-path `stable_prefix_hash` recording**: `src/async-job-manager.ts`
87
+ has zero flight-recorder integration today, so the v3 columns are NOT
88
+ populated for async-job rows. This is a separate concern beyond
89
+ cache-awareness — tracked for a future plan
90
+ (`docs/plans/async-flight-recorder.dag.toml`, TBD). Slice 1's runtime
91
+ mutex check IS in place on the async tool surface; only the flight-recorder
92
+ write deferral applies.
93
+ - **Codex parser cache-tokens fix**: `src/codex-json-parser.ts` reads
94
+ Anthropic-style `cache_read_input_tokens` but Codex CLI 0.133.0+ emits
95
+ `cached_input_tokens`. `cache_read_tokens` therefore stays NULL for codex
96
+ rows today. Out of scope for this slice (see PROVIDER_CACHE_SURFACES.md).
97
+
98
+ ### Invariant
99
+
100
+ "No conversation content in session storage" is preserved. The session
101
+ manager (`~/.llm-cli-gateway/sessions.json`) is UNTOUCHED by this slice.
102
+ The cache-awareness columns added by migration v3
103
+ (`stable_prefix_hash`, `stable_prefix_tokens`) live on the existing
104
+ flight recorder (`~/.llm-cli-gateway/logs.db`), which is a separate
105
+ audit-focused store that already records prompts and responses (and is
106
+ not subject to the session-storage invariant). `session_get.cacheState`
107
+ is a read-time PROJECTION from the flight recorder, never persisted on
108
+ the Session interface.
109
+
5
110
  ## [1.5.35] - 2026-05-25
6
111
 
7
112
  ### Fixed
package/README.md CHANGED
@@ -88,6 +88,36 @@ docker compose -f docker-compose.personal.yml run --rm doctor
88
88
  ### Observability
89
89
  - **SQLite Flight Recorder**: Every request/response logged to `~/.llm-cli-gateway/logs.db` with correlation IDs, token usage, duration, retry counts, and circuit breaker state. Browse with [Datasette](https://datasette.io/): `datasette ~/.llm-cli-gateway/logs.db`
90
90
  - **Structured Metadata**: Tool responses include machine-readable `structuredContent` (model, cli, correlationId, sessionId, durationMs, token counts)
91
+ - **Cache observability resources**: `cache_state://global`, `cache_state://session/{id}`, and `cache_state://prefix/{hash}` MCP resources return aggregate cache hit/miss/savings — tokens and hashes only, no prompt text. `session_get` includes a `cacheState` block when the session has prior requests.
92
+
93
+ ### Cache-aware operation
94
+
95
+ Every `*_request` and `*_request_async` tool accepts an optional `promptParts` field that structures the prompt for better cache hit rates. The gateway concatenates the parts in canonical order (`system → tools → context → task`) so that the stable prefix bytes precede the volatile task tail unchanged across calls, letting each provider's automatic prompt-caching land on the same content hash each time.
96
+
97
+ ```json
98
+ {
99
+ "promptParts": {
100
+ "system": "You are a helpful code reviewer.",
101
+ "tools": "You have access to Read, Grep, Bash.",
102
+ "context": "<long stable context block — file dumps, etc.>",
103
+ "task": "Review the changes in src/foo.ts for security issues."
104
+ }
105
+ }
106
+ ```
107
+
108
+ `prompt` and `promptParts` are mutually exclusive — pass exactly one.
109
+
110
+ Per-CLI capability matrix:
111
+
112
+ | CLI | Prefix discipline (auto via `promptParts`) | Explicit `cache_control` emission |
113
+ |---------|--------------------------------------------|------------------------------------|
114
+ | claude | yes | not yet (Branch B; gated on `[cache_awareness].emit_anthropic_cache_control`) |
115
+ | codex | yes | n/a (OpenAI implicit cache, no CLI lever) |
116
+ | gemini | yes | n/a (implicit prefix cache server-side) |
117
+ | grok | yes | n/a (no surfaced cache lever) |
118
+ | mistral | yes | n/a (no surfaced cache lever) |
119
+
120
+ Opt-in flags (all default off) live under `[cache_awareness]` in `~/.llm-cli-gateway/config.toml`. See `docs/personal-mcp/PROVIDER_CACHE_SURFACES.md` for the per-model minimum cacheable token thresholds and field-name divergences.
91
121
 
92
122
  ### Reliability & Performance
93
123
  - **Retry Logic**: Exponential backoff with circuit breaker for transient failures
@@ -1019,6 +1049,7 @@ If you're vetting `llm-cli-gateway` through [Socket](https://socket.dev/npm/pack
1019
1049
  | **Shell access** | `src/executor.ts` uses `child_process.spawn(cmd, args, …)` to invoke the underlying LLM CLIs. | `spawn` is called with an argument array and **never** `shell: true`, so there is no shell interpolation path for caller input. The command name is restricted to an allow-list of known CLI binaries (`claude`, `codex`, `gemini`, `grok`, `vibe`). |
1020
1050
  | **Uses eval** | None in our source. Transitive: `@modelcontextprotocol/sdk` → `ajv@8` uses `new Function(...)` in `ajv/dist/compile/index.js` to compile JSON Schema validators. | This is ajv's standard codegen path. Only known schemas (defined in our source and the MCP SDK) flow into it; no caller-supplied data ever reaches the compiled function body. |
1021
1051
  | **better-sqlite3 PRAGMA helper** | Transitive: `better-sqlite3/lib/methods/pragma.js` interpolates its caller-provided `source` into a `PRAGMA ${source}` statement. | We do not call `db.pragma()` from production source. Internal SQLite setup uses fixed literal `db.exec("PRAGMA ...")` statements, and `npm run security:audit` fails the release if production code reintroduces `.pragma()` calls. |
1052
+ | **ioredis obfuscated code** | Optional peer/dev dependency: `ioredis@5.10.1` may be flagged at `built/constants/TLSProfiles.js` for base64-looking strings. | Reviewed as a false positive. The file is a Redis Cloud TLS CA certificate bundle in PEM format, which is base64 by design. It contains no decoder loop, dynamic evaluation, network call, or hidden execution path. The same file is byte-for-byte identical in `ioredis@5.9.2`; our default production install does not install `ioredis`, and our code does not pass ioredis TLS profile options. |
1022
1053
  | **Dependency ownership** | A handful of small transitive packages (e.g. `bindings` via `better-sqlite3`, `media-typer` via `@modelcontextprotocol/sdk`) trip Socket's "unstable ownership" or "obfuscated code" heuristics. | These are pinned, well-known micro-deps in the Node ecosystem with no known issues. We pin direct override versions of `content-type` and `type-is` in `package.json#overrides`. Our previous direct dependency on `toml@3.0.0` (also single-maintainer, last released 2020) was replaced with the actively-maintained `smol-toml` to reduce inherited risk. |
1023
1054
 
1024
1055
  See [`socket.yml`](./socket.yml) for the same context in machine-readable form.
@@ -0,0 +1,112 @@
1
+ /**
2
+ * Cache observability aggregates.
3
+ *
4
+ * Pure read-only aggregation over the FlightRecorder's `requests` table.
5
+ * No new storage — every value is computed at query time from existing
6
+ * columns (`cache_read_tokens`, `cache_creation_tokens`, `stable_prefix_*`,
7
+ * `datetime_utc`, etc.).
8
+ *
9
+ * COALESCE / NULL handling: rows from before the v3 migration have NULL
10
+ * for stable_prefix_*. Rows from CLIs whose parser does not surface cache
11
+ * tokens (gemini, grok, mistral, and codex until its parser is fixed)
12
+ * have NULL for cache_read_tokens / cache_creation_tokens. All aggregates
13
+ * tolerate NULL via COALESCE(col, 0) — never divides by zero.
14
+ */
15
+ import type { FlightRecorderQuery } from "./flight-recorder.js";
16
+ export type CacheStatsCli = "claude" | "codex" | "gemini" | "grok" | "mistral";
17
+ export interface SessionCacheStats {
18
+ sessionId: string;
19
+ cli: CacheStatsCli | null;
20
+ /** Total cache_read_tokens across all rows in this session. */
21
+ totalCacheReadTokens: number;
22
+ /** Total cache_creation_tokens across all rows in this session. */
23
+ totalCacheCreationTokens: number;
24
+ /** Number of rows in this session. */
25
+ requestCount: number;
26
+ /** Number of rows where cache_read_tokens > 0. */
27
+ hitCount: number;
28
+ /** hitCount / requestCount (0 when requestCount = 0). */
29
+ hitRate: number;
30
+ /** Distinct stable_prefix_hash values seen in this session. */
31
+ distinctPrefixCount: number;
32
+ /** Last time any row in this session was written (datetime_utc max). ISO string or null. */
33
+ lastRequestAt: string | null;
34
+ /** Estimated USD saved by cache reads in this session (best-effort). */
35
+ estimatedSavingsUsd: number;
36
+ /**
37
+ * Slice 3: best-effort remaining TTL on the Anthropic cache breakpoint
38
+ * established at lastRequestAt. Null for non-claude CLIs (we have no
39
+ * read on their cache state) and null when lastRequestAt is null.
40
+ * Computed by computeTtlRemaining(); see ttlPolicy parameter.
41
+ */
42
+ ttlRemainingMs: number | null;
43
+ }
44
+ export interface PrefixCacheStats {
45
+ stablePrefixHash: string;
46
+ requestCount: number;
47
+ hitCount: number;
48
+ hitRate: number;
49
+ totalCacheReadTokens: number;
50
+ totalCacheCreationTokens: number;
51
+ /** Distinct CLI x model combos that hashed to this prefix. */
52
+ cliBreakdown: Array<{
53
+ cli: CacheStatsCli;
54
+ model: string;
55
+ count: number;
56
+ }>;
57
+ firstSeenAt: string | null;
58
+ lastSeenAt: string | null;
59
+ estimatedSavingsUsd: number;
60
+ }
61
+ export interface GlobalCacheStats {
62
+ /** Optional window: rows since (now - lastNHours * 3600s). */
63
+ windowHours: number | null;
64
+ totalRequests: number;
65
+ totalHits: number;
66
+ hitRate: number;
67
+ totalCacheReadTokens: number;
68
+ totalCacheCreationTokens: number;
69
+ perCli: Array<{
70
+ cli: CacheStatsCli;
71
+ requestCount: number;
72
+ hitCount: number;
73
+ hitRate: number;
74
+ totalCacheReadTokens: number;
75
+ totalCacheCreationTokens: number;
76
+ estimatedSavingsUsd: number;
77
+ }>;
78
+ estimatedSavingsUsd: number;
79
+ }
80
+ export declare function computeSessionCacheStats(db: FlightRecorderQuery, sessionId: string): SessionCacheStats;
81
+ export interface TtlPolicy {
82
+ /**
83
+ * Seconds: how long Anthropic holds a cache entry after the last
84
+ * write. Default 300 (5 minutes). Set to 3600 when the operator has
85
+ * opted into Anthropic's 1-hour cache TTL via
86
+ * `[cache_awareness].anthropic_ttl_seconds = 3600`.
87
+ */
88
+ anthropicTtlSeconds: 300 | 3600;
89
+ /** Defaults to `() => Date.now()`. Overridable for deterministic tests. */
90
+ now?: () => number;
91
+ }
92
+ /**
93
+ * Slice 3: compute the best-effort milliseconds remaining on the cache
94
+ * breakpoint established at `stats.lastRequestAt`.
95
+ *
96
+ * - Claude: Anthropic's documented TTL (5min default, 1h beta). Computed
97
+ * as max(0, ttl - (now - lastWriteAt)).
98
+ * - Other CLIs: returns null. We do not observe the provider's actual
99
+ * cache state, so any number we'd return would be a guess. session_get
100
+ * and cache_state resources should report null for these.
101
+ *
102
+ * Note: this is "best effort". A cache eviction inside Anthropic's
103
+ * window will NOT be visible to us — the warning may be optimistic
104
+ * (see risks section in dag.toml).
105
+ */
106
+ export declare function computeTtlRemaining(stats: SessionCacheStats, cli: CacheStatsCli | null, ttlPolicy: TtlPolicy): number | null;
107
+ export declare function computePrefixCacheStats(db: FlightRecorderQuery, stablePrefixHash: string): PrefixCacheStats;
108
+ export interface GlobalCacheStatsOpts {
109
+ /** If set, restrict to rows whose datetime_utc is within the last N hours. */
110
+ lastNHours?: number;
111
+ }
112
+ export declare function computeGlobalCacheStats(db: FlightRecorderQuery, opts?: GlobalCacheStatsOpts): GlobalCacheStats;
@@ -0,0 +1,225 @@
1
+ /**
2
+ * Cache observability aggregates.
3
+ *
4
+ * Pure read-only aggregation over the FlightRecorder's `requests` table.
5
+ * No new storage — every value is computed at query time from existing
6
+ * columns (`cache_read_tokens`, `cache_creation_tokens`, `stable_prefix_*`,
7
+ * `datetime_utc`, etc.).
8
+ *
9
+ * COALESCE / NULL handling: rows from before the v3 migration have NULL
10
+ * for stable_prefix_*. Rows from CLIs whose parser does not surface cache
11
+ * tokens (gemini, grok, mistral, and codex until its parser is fixed)
12
+ * have NULL for cache_read_tokens / cache_creation_tokens. All aggregates
13
+ * tolerate NULL via COALESCE(col, 0) — never divides by zero.
14
+ */
15
+ import { estimateCacheSavingsUsd } from "./pricing.js";
16
+ function safeNum(n) {
17
+ return typeof n === "number" && Number.isFinite(n) ? n : 0;
18
+ }
19
+ function isCacheStatsCli(s) {
20
+ return s === "claude" || s === "codex" || s === "gemini" || s === "grok" || s === "mistral";
21
+ }
22
+ export function computeSessionCacheStats(db, sessionId) {
23
+ const rows = db.queryRequests(`SELECT cli, model,
24
+ COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
25
+ COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
26
+ stable_prefix_hash,
27
+ datetime_utc
28
+ FROM requests
29
+ WHERE session_id = ?
30
+ ORDER BY datetime_utc DESC`, sessionId);
31
+ let totalRead = 0;
32
+ let totalCreation = 0;
33
+ let hitCount = 0;
34
+ const prefixSet = new Set();
35
+ let lastAt = null;
36
+ let cli = null;
37
+ let estimatedSavingsUsd = 0;
38
+ for (const row of rows) {
39
+ const reads = safeNum(row.cache_read_tokens);
40
+ const creation = safeNum(row.cache_creation_tokens);
41
+ totalRead += reads;
42
+ totalCreation += creation;
43
+ if (reads > 0)
44
+ hitCount += 1;
45
+ if (row.stable_prefix_hash)
46
+ prefixSet.add(row.stable_prefix_hash);
47
+ if (!lastAt || row.datetime_utc > lastAt)
48
+ lastAt = row.datetime_utc;
49
+ if (cli === null && isCacheStatsCli(row.cli))
50
+ cli = row.cli;
51
+ if (isCacheStatsCli(row.cli)) {
52
+ estimatedSavingsUsd += estimateCacheSavingsUsd(row.cli, row.model, reads);
53
+ }
54
+ }
55
+ const requestCount = rows.length;
56
+ return {
57
+ sessionId,
58
+ cli,
59
+ totalCacheReadTokens: totalRead,
60
+ totalCacheCreationTokens: totalCreation,
61
+ requestCount,
62
+ hitCount,
63
+ hitRate: requestCount > 0 ? hitCount / requestCount : 0,
64
+ distinctPrefixCount: prefixSet.size,
65
+ lastRequestAt: lastAt,
66
+ estimatedSavingsUsd,
67
+ // ttlRemainingMs is populated by computeTtlRemaining() — the field
68
+ // exists on the type so the resource shape is uniform, but its value
69
+ // is left null here. Callers (session_get / cache_state resources)
70
+ // apply the configured TTL policy and set the field.
71
+ ttlRemainingMs: null,
72
+ };
73
+ }
74
+ /**
75
+ * Slice 3: compute the best-effort milliseconds remaining on the cache
76
+ * breakpoint established at `stats.lastRequestAt`.
77
+ *
78
+ * - Claude: Anthropic's documented TTL (5min default, 1h beta). Computed
79
+ * as max(0, ttl - (now - lastWriteAt)).
80
+ * - Other CLIs: returns null. We do not observe the provider's actual
81
+ * cache state, so any number we'd return would be a guess. session_get
82
+ * and cache_state resources should report null for these.
83
+ *
84
+ * Note: this is "best effort". A cache eviction inside Anthropic's
85
+ * window will NOT be visible to us — the warning may be optimistic
86
+ * (see risks section in dag.toml).
87
+ */
88
+ export function computeTtlRemaining(stats, cli, ttlPolicy) {
89
+ if (cli !== "claude")
90
+ return null;
91
+ if (!stats.lastRequestAt)
92
+ return null;
93
+ const nowMs = (ttlPolicy.now ?? Date.now)();
94
+ const lastWriteMs = Date.parse(stats.lastRequestAt);
95
+ if (!Number.isFinite(lastWriteMs))
96
+ return null;
97
+ const elapsedMs = nowMs - lastWriteMs;
98
+ const ttlMs = ttlPolicy.anthropicTtlSeconds * 1000;
99
+ return Math.max(0, ttlMs - elapsedMs);
100
+ }
101
+ export function computePrefixCacheStats(db, stablePrefixHash) {
102
+ const rows = db.queryRequests(`SELECT cli, model,
103
+ COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
104
+ COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
105
+ stable_prefix_hash,
106
+ datetime_utc
107
+ FROM requests
108
+ WHERE stable_prefix_hash = ?
109
+ ORDER BY datetime_utc ASC`, stablePrefixHash);
110
+ let totalRead = 0;
111
+ let totalCreation = 0;
112
+ let hitCount = 0;
113
+ let firstAt = null;
114
+ let lastAt = null;
115
+ let estimatedSavingsUsd = 0;
116
+ const cliMap = new Map();
117
+ for (const row of rows) {
118
+ const reads = safeNum(row.cache_read_tokens);
119
+ totalRead += reads;
120
+ totalCreation += safeNum(row.cache_creation_tokens);
121
+ if (reads > 0)
122
+ hitCount += 1;
123
+ if (!firstAt)
124
+ firstAt = row.datetime_utc;
125
+ lastAt = row.datetime_utc;
126
+ if (isCacheStatsCli(row.cli)) {
127
+ estimatedSavingsUsd += estimateCacheSavingsUsd(row.cli, row.model, reads);
128
+ const key = `${row.cli}::${row.model}`;
129
+ const entry = cliMap.get(key);
130
+ if (entry) {
131
+ entry.count += 1;
132
+ }
133
+ else {
134
+ cliMap.set(key, { cli: row.cli, model: row.model, count: 1 });
135
+ }
136
+ }
137
+ }
138
+ const requestCount = rows.length;
139
+ return {
140
+ stablePrefixHash,
141
+ requestCount,
142
+ hitCount,
143
+ hitRate: requestCount > 0 ? hitCount / requestCount : 0,
144
+ totalCacheReadTokens: totalRead,
145
+ totalCacheCreationTokens: totalCreation,
146
+ cliBreakdown: Array.from(cliMap.values()).sort((a, b) => b.count - a.count),
147
+ firstSeenAt: firstAt,
148
+ lastSeenAt: lastAt,
149
+ estimatedSavingsUsd,
150
+ };
151
+ }
152
+ export function computeGlobalCacheStats(db, opts = {}) {
153
+ const windowHours = opts.lastNHours ?? null;
154
+ const sinceIso = windowHours !== null && windowHours > 0
155
+ ? new Date(Date.now() - windowHours * 3600_000).toISOString()
156
+ : null;
157
+ const sql = sinceIso
158
+ ? `SELECT cli, model,
159
+ COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
160
+ COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
161
+ stable_prefix_hash,
162
+ datetime_utc
163
+ FROM requests
164
+ WHERE datetime_utc >= ?`
165
+ : `SELECT cli, model,
166
+ COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
167
+ COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
168
+ stable_prefix_hash,
169
+ datetime_utc
170
+ FROM requests`;
171
+ const rows = sinceIso ? db.queryRequests(sql, sinceIso) : db.queryRequests(sql);
172
+ const perCliMap = new Map();
173
+ let totalRequests = 0;
174
+ let totalHits = 0;
175
+ let totalRead = 0;
176
+ let totalCreation = 0;
177
+ let totalSavings = 0;
178
+ for (const row of rows) {
179
+ totalRequests += 1;
180
+ const reads = safeNum(row.cache_read_tokens);
181
+ const creation = safeNum(row.cache_creation_tokens);
182
+ totalRead += reads;
183
+ totalCreation += creation;
184
+ if (reads > 0)
185
+ totalHits += 1;
186
+ if (!isCacheStatsCli(row.cli))
187
+ continue;
188
+ const cli = row.cli;
189
+ const savings = estimateCacheSavingsUsd(cli, row.model, reads);
190
+ totalSavings += savings;
191
+ const agg = perCliMap.get(cli) ?? {
192
+ requestCount: 0,
193
+ hitCount: 0,
194
+ totalCacheReadTokens: 0,
195
+ totalCacheCreationTokens: 0,
196
+ estimatedSavingsUsd: 0,
197
+ };
198
+ agg.requestCount += 1;
199
+ if (reads > 0)
200
+ agg.hitCount += 1;
201
+ agg.totalCacheReadTokens += reads;
202
+ agg.totalCacheCreationTokens += creation;
203
+ agg.estimatedSavingsUsd += savings;
204
+ perCliMap.set(cli, agg);
205
+ }
206
+ const perCli = Array.from(perCliMap.entries()).map(([cli, agg]) => ({
207
+ cli,
208
+ requestCount: agg.requestCount,
209
+ hitCount: agg.hitCount,
210
+ hitRate: agg.requestCount > 0 ? agg.hitCount / agg.requestCount : 0,
211
+ totalCacheReadTokens: agg.totalCacheReadTokens,
212
+ totalCacheCreationTokens: agg.totalCacheCreationTokens,
213
+ estimatedSavingsUsd: agg.estimatedSavingsUsd,
214
+ }));
215
+ return {
216
+ windowHours,
217
+ totalRequests,
218
+ totalHits,
219
+ hitRate: totalRequests > 0 ? totalHits / totalRequests : 0,
220
+ totalCacheReadTokens: totalRead,
221
+ totalCacheCreationTokens: totalCreation,
222
+ perCli,
223
+ estimatedSavingsUsd: totalSavings,
224
+ };
225
+ }
package/dist/config.d.ts CHANGED
@@ -63,3 +63,44 @@ export interface PersistenceConfigSources {
63
63
  * Throws on incoherent configs (memory/none + asyncJobsEnabled without ack).
64
64
  */
65
65
  export declare function loadPersistenceConfig(logger?: Logger): PersistenceConfig;
66
+ export declare const ANTHROPIC_TTL_SECONDS_VALUES: readonly [300, 3600];
67
+ export type AnthropicTtlSeconds = (typeof ANTHROPIC_TTL_SECONDS_VALUES)[number];
68
+ /**
69
+ * Per-Anthropic-model-family minimum cacheable tokens. Sourced from
70
+ * docs/personal-mcp/PROVIDER_CACHE_SURFACES.md (Anthropic API docs as of
71
+ * 2026-05-26). Models below the threshold cannot be cached even with
72
+ * cache_control set — Anthropic silently returns un-cached.
73
+ */
74
+ export declare const DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL: {
75
+ readonly sonnet: 1024;
76
+ readonly opus: 4096;
77
+ readonly haiku: 4096;
78
+ readonly default: 4096;
79
+ };
80
+ export type ModelFamilyAlias = keyof typeof DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL;
81
+ export interface CacheAwarenessConfig {
82
+ emitAnthropicCacheControl: boolean;
83
+ anthropicTtlSeconds: AnthropicTtlSeconds;
84
+ warnOnTtlExpiry: boolean;
85
+ minStableTokensForCacheControl: {
86
+ sonnet: number;
87
+ opus: number;
88
+ haiku: number;
89
+ default: number;
90
+ };
91
+ /** Audit trail: file the config was loaded from (or null if defaults). */
92
+ sources: {
93
+ configFile: string | null;
94
+ };
95
+ }
96
+ /**
97
+ * Load [cache_awareness] from ~/.llm-cli-gateway/config.toml. Defaults: all
98
+ * behaviour off, per-model min-token thresholds from PROVIDER_CACHE_SURFACES.md.
99
+ */
100
+ export declare function loadCacheAwarenessConfig(logger?: Logger): CacheAwarenessConfig;
101
+ /**
102
+ * Look up the per-model-family threshold. `modelName` is the user-facing model
103
+ * string (e.g. "claude-sonnet-4-6", "claude-opus-4-7"). Falls back to `default`
104
+ * when the family is unrecognised.
105
+ */
106
+ export declare function minStableTokensForModel(config: CacheAwarenessConfig, modelName: string): number;
package/dist/config.js CHANGED
@@ -227,3 +227,112 @@ export function loadPersistenceConfig(logger = noopLogger) {
227
227
  sources,
228
228
  };
229
229
  }
230
+ //──────────────────────────────────────────────────────────────────────────────
231
+ // Cache-awareness configuration
232
+ //
233
+ // Reads the [cache_awareness] block from the same ~/.llm-cli-gateway/config.toml
234
+ // file as [persistence], but uses a SEPARATE loader and schema. Keeping the two
235
+ // independent means a malformed [cache_awareness] never breaks persistence
236
+ // loading and vice versa. No env-var overrides — purely TOML.
237
+ //
238
+ // All defaults are "off"; behavioural changes (slice 1 cache_control, slice 3
239
+ // TTL warnings) ship dormant until operators opt in.
240
+ //──────────────────────────────────────────────────────────────────────────────
241
+ export const ANTHROPIC_TTL_SECONDS_VALUES = [300, 3600];
242
+ /**
243
+ * Per-Anthropic-model-family minimum cacheable tokens. Sourced from
244
+ * docs/personal-mcp/PROVIDER_CACHE_SURFACES.md (Anthropic API docs as of
245
+ * 2026-05-26). Models below the threshold cannot be cached even with
246
+ * cache_control set — Anthropic silently returns un-cached.
247
+ */
248
+ export const DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL = {
249
+ sonnet: 1024,
250
+ opus: 4096,
251
+ haiku: 4096,
252
+ default: 4096,
253
+ };
254
+ const MinStableTokensSchema = z
255
+ .object({
256
+ sonnet: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.sonnet),
257
+ opus: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.opus),
258
+ haiku: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.haiku),
259
+ default: z
260
+ .number()
261
+ .int()
262
+ .positive()
263
+ .default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.default),
264
+ })
265
+ .strict()
266
+ .default({
267
+ sonnet: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.sonnet,
268
+ opus: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.opus,
269
+ haiku: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.haiku,
270
+ default: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.default,
271
+ });
272
+ const CacheAwarenessSchema = z
273
+ .object({
274
+ emit_anthropic_cache_control: z.boolean().default(false),
275
+ anthropic_ttl_seconds: z.union([z.literal(300), z.literal(3600)]).default(300),
276
+ warn_on_ttl_expiry: z.boolean().default(false),
277
+ min_stable_tokens_for_cache_control: MinStableTokensSchema,
278
+ })
279
+ .strict();
280
+ function readCacheAwarenessFile(configPath, logger) {
281
+ if (!existsSync(configPath)) {
282
+ return { raw: undefined, sourcePath: null };
283
+ }
284
+ try {
285
+ const require = createRequire(import.meta.url);
286
+ const TOML = require("smol-toml");
287
+ const text = readFileSync(configPath, "utf-8");
288
+ const parsed = TOML.parse(text);
289
+ return { raw: parsed?.cache_awareness, sourcePath: configPath };
290
+ }
291
+ catch (err) {
292
+ logger.error(`Failed to parse gateway config at ${configPath}; using cache_awareness defaults`, err);
293
+ return { raw: undefined, sourcePath: null };
294
+ }
295
+ }
296
+ /**
297
+ * Load [cache_awareness] from ~/.llm-cli-gateway/config.toml. Defaults: all
298
+ * behaviour off, per-model min-token thresholds from PROVIDER_CACHE_SURFACES.md.
299
+ */
300
+ export function loadCacheAwarenessConfig(logger = noopLogger) {
301
+ const configPath = defaultPersistenceConfigPath();
302
+ const { raw, sourcePath } = readCacheAwarenessFile(configPath, logger);
303
+ let parsed;
304
+ try {
305
+ parsed = CacheAwarenessSchema.parse(raw ?? {});
306
+ }
307
+ catch (err) {
308
+ throw new Error(`Invalid [cache_awareness] config: ${err instanceof Error ? err.message : String(err)}`);
309
+ }
310
+ return {
311
+ emitAnthropicCacheControl: parsed.emit_anthropic_cache_control,
312
+ anthropicTtlSeconds: parsed.anthropic_ttl_seconds,
313
+ warnOnTtlExpiry: parsed.warn_on_ttl_expiry,
314
+ minStableTokensForCacheControl: {
315
+ sonnet: parsed.min_stable_tokens_for_cache_control.sonnet,
316
+ opus: parsed.min_stable_tokens_for_cache_control.opus,
317
+ haiku: parsed.min_stable_tokens_for_cache_control.haiku,
318
+ default: parsed.min_stable_tokens_for_cache_control.default,
319
+ },
320
+ sources: { configFile: sourcePath },
321
+ };
322
+ }
323
+ /**
324
+ * Look up the per-model-family threshold. `modelName` is the user-facing model
325
+ * string (e.g. "claude-sonnet-4-6", "claude-opus-4-7"). Falls back to `default`
326
+ * when the family is unrecognised.
327
+ */
328
+ export function minStableTokensForModel(config, modelName) {
329
+ const lower = modelName.toLowerCase();
330
+ const table = config.minStableTokensForCacheControl;
331
+ if (lower.includes("sonnet"))
332
+ return table.sonnet;
333
+ if (lower.includes("opus"))
334
+ return table.opus;
335
+ if (lower.includes("haiku"))
336
+ return table.haiku;
337
+ return table.default;
338
+ }