npm - llm-cli-gateway - Versions diffs - 1.5.35 → 1.6.0 - Mend

llm-cli-gateway 1.5.35 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/CHANGELOG.md +105 -0
package/README.md +31 -0
package/dist/cache-stats.d.ts +112 -0
package/dist/cache-stats.js +225 -0
package/dist/config.d.ts +41 -0
package/dist/config.js +109 -0
package/dist/doctor.d.ts +42 -1
package/dist/doctor.js +121 -2
package/dist/flight-recorder.d.ts +27 -0
package/dist/flight-recorder.js +79 -2
package/dist/index.d.ts +46 -9
package/dist/index.js +395 -67
package/dist/pricing.d.ts +54 -0
package/dist/pricing.js +100 -0
package/dist/prompt-parts.d.ts +38 -0
package/dist/prompt-parts.js +42 -0
package/dist/resources.d.ts +32 -1
package/dist/resources.js +52 -1
package/package.json +2 -1
package/setup/status.schema.json +39 -0
package/socket.yml +10 -0

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,111 @@
 All notable changes to the llm-cli-gateway project.
+## [1.6.0] - 2026-05-26 — cache-awareness phase 1 + security posture
+Also includes (beyond cache-awareness):
+### Added — free-OSS security posture (matches verivus-oss/agent-assurance)
+- New `.github/workflows/security.yml` running on every push + PR:
+  actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff,
+  bandit, lychee. SHA-pinned, fail-on-finding.
+- `eslint-plugin-security` 3.0.1 wired into the existing eslint config.
+- `SECURITY.md` (vulnerability reporting policy), `.github/CODEOWNERS`
+  (review routing for security-sensitive paths), `_typos.toml`,
+  `lychee.toml`, `.gitleaks.toml`, `.github/actionlint.yaml`,
+  `integrations/llm-plugin/.bandit`.
+- Workflow hygiene: top-level `permissions: contents: read`, per-job
+  explicit, `persist-credentials: false` on every `actions/checkout`
+  except the upload job in `release-installer.yml`. Cache disabled on
+  release-triggered setup-node/setup-go (zizmor cache-poisoning).
+- Dependabot: added `npm` ecosystem at `/` and `pip` ecosystem at
+  `/integrations/llm-plugin/` (github-actions group preserved).
+- `installer/go.mod` bumped Go 1.22 → 1.25 (clears 26 stdlib CVEs
+  flagged by osv-scanner); `release-installer.yml` setup-go pin
+  updated in lock-step.
+### Added — cache-awareness slice 1+2+3 (all opt-in, default OFF)
+### Added — cache-awareness slice 1+2+3 (all opt-in, default OFF)
+- **`promptParts` on every `*_request` / `*_request_async` tool** (claude, codex,
+  gemini, grok, mistral; sync + async = 10 tools). Accepts
+  `{ system?, tools?, context?, task }`. Mutually exclusive with `prompt`.
+  The gateway concatenates in canonical order (`system → tools → context → task`)
+  so the stable prefix bytes precede the volatile task tail unchanged across
+  calls — raising implicit cache hit rate without calling provider cache APIs.
+  The exact error strings `provide exactly one of \`prompt\` or \`promptParts\``
+  and `one of \`prompt\` or \`promptParts\` is required` are stable API
+  contract.
+- **Flight-recorder v3 migration**: new columns `stable_prefix_hash`
+  (sha256) and `stable_prefix_tokens` (integer bytes/4 heuristic) on
+  `requests`, plus `idx_requests_stable_hash`. Legacy rows keep NULL.
+- **Cache-state MCP resources** (read-only, tokens/hashes/aggregates only —
+  never raw prompt text):
+  - `cache_state://global` (last 24h aggregates + per-CLI breakdown).
+  - `cache_state://session/{sessionId}` (per-session).
+  - `cache_state://prefix/{hash}` (per-stable-prefix-hash).
+- **`session_get.cacheState`** projection: compact hit-rate / hit-count /
+  cache-token-totals / estimated-savings-USD block, present only when the
+  session has prior requests. Omitted entirely (not null, not empty) for
+  fresh sessions. NOT persisted on the Session interface — it is a
+  read-time projection from the flight recorder.
+- **`computeTtlRemaining()` + `cache_ttl_expiring_soon` warning**: claude
+  sync + async handlers attach a structured `warnings[]` entry when a
+  resumed session's Anthropic cache breakpoint is within 30 s of expiry
+  (gated on `[cache_awareness].warn_on_ttl_expiry`; default false). The
+  TTL math respects `anthropic_ttl_seconds = 300 | 3600`.
+- **Doctor `cache_awareness` block**: always present, zeroed when the
+  flight recorder is empty. Reports `enabled_features` (active flags),
+  `last_24h` (hit rate + savings), and `per_cli` aggregates. JSON schema
+  updated; `setup/status.schema.json` `additionalProperties: false`
+  intact at the root.
+- **`[cache_awareness]` config block** in `~/.llm-cli-gateway/config.toml`:
+  - `emit_anthropic_cache_control = false`
+  - `anthropic_ttl_seconds = 300` (enum: 300 | 3600)
+  - `warn_on_ttl_expiry = false`
+  - `[cache_awareness.min_stable_tokens_for_cache_control]` per-family
+    table (sonnet=1024, opus=4096, haiku=4096, default=4096).
+  Validated by a separate Zod schema and loader (`loadCacheAwarenessConfig`);
+  a malformed `[cache_awareness]` block does NOT break `loadPersistenceConfig`
+  and vice versa. No env-var overrides.
+### Decision: Branch B (prefix-discipline only) for slice 1
+The gateway does NOT emit explicit `cache_control` JSON to Claude in this
+slice and does NOT route `promptParts.system` into `--system-prompt`. The
+upstream injection mechanism is unverified; Branch A is gated on a live
+smoke test in a follow-up slice. The
+`[cache_awareness].emit_anthropic_cache_control` flag is in place for
+when that lands.
+### Deferred / out of scope
+- **Async-path `stable_prefix_hash` recording**: `src/async-job-manager.ts`
+  has zero flight-recorder integration today, so the v3 columns are NOT
+  populated for async-job rows. This is a separate concern beyond
+  cache-awareness — tracked for a future plan
+  (`docs/plans/async-flight-recorder.dag.toml`, TBD). Slice 1's runtime
+  mutex check IS in place on the async tool surface; only the flight-recorder
+  write deferral applies.
+- **Codex parser cache-tokens fix**: `src/codex-json-parser.ts` reads
+  Anthropic-style `cache_read_input_tokens` but Codex CLI 0.133.0+ emits
+  `cached_input_tokens`. `cache_read_tokens` therefore stays NULL for codex
+  rows today. Out of scope for this slice (see PROVIDER_CACHE_SURFACES.md).
+### Invariant
+"No conversation content in session storage" is preserved. The session
+manager (`~/.llm-cli-gateway/sessions.json`) is UNTOUCHED by this slice.
+The cache-awareness columns added by migration v3
+(`stable_prefix_hash`, `stable_prefix_tokens`) live on the existing
+flight recorder (`~/.llm-cli-gateway/logs.db`), which is a separate
+audit-focused store that already records prompts and responses (and is
+not subject to the session-storage invariant). `session_get.cacheState`
+is a read-time PROJECTION from the flight recorder, never persisted on
+the Session interface.
 ## [1.5.35] - 2026-05-25
 ### Fixed

package/README.md CHANGED Viewed

@@ -88,6 +88,36 @@ docker compose -f docker-compose.personal.yml run --rm doctor
 ### Observability
 - **SQLite Flight Recorder**: Every request/response logged to `~/.llm-cli-gateway/logs.db` with correlation IDs, token usage, duration, retry counts, and circuit breaker state. Browse with [Datasette](https://datasette.io/): `datasette ~/.llm-cli-gateway/logs.db`
 - **Structured Metadata**: Tool responses include machine-readable `structuredContent` (model, cli, correlationId, sessionId, durationMs, token counts)
+- **Cache observability resources**: `cache_state://global`, `cache_state://session/{id}`, and `cache_state://prefix/{hash}` MCP resources return aggregate cache hit/miss/savings — tokens and hashes only, no prompt text. `session_get` includes a `cacheState` block when the session has prior requests.
+### Cache-aware operation
+Every `*_request` and `*_request_async` tool accepts an optional `promptParts` field that structures the prompt for better cache hit rates. The gateway concatenates the parts in canonical order (`system → tools → context → task`) so that the stable prefix bytes precede the volatile task tail unchanged across calls, letting each provider's automatic prompt-caching land on the same content hash each time.
+```json
+{
+  "promptParts": {
+    "system": "You are a helpful code reviewer.",
+    "tools": "You have access to Read, Grep, Bash.",
+    "context": "<long stable context block — file dumps, etc.>",
+    "task": "Review the changes in src/foo.ts for security issues."
+  }
+}
+```
+`prompt` and `promptParts` are mutually exclusive — pass exactly one.
+Per-CLI capability matrix:
+| CLI     | Prefix discipline (auto via `promptParts`) | Explicit `cache_control` emission |
+|---------|--------------------------------------------|------------------------------------|
+| claude  | yes                                        | not yet (Branch B; gated on `[cache_awareness].emit_anthropic_cache_control`) |
+| codex   | yes                                        | n/a (OpenAI implicit cache, no CLI lever) |
+| gemini  | yes                                        | n/a (implicit prefix cache server-side)  |
+| grok    | yes                                        | n/a (no surfaced cache lever)            |
+| mistral | yes                                        | n/a (no surfaced cache lever)            |
+Opt-in flags (all default off) live under `[cache_awareness]` in `~/.llm-cli-gateway/config.toml`. See `docs/personal-mcp/PROVIDER_CACHE_SURFACES.md` for the per-model minimum cacheable token thresholds and field-name divergences.
 ### Reliability & Performance
 - **Retry Logic**: Exponential backoff with circuit breaker for transient failures
@@ -1019,6 +1049,7 @@ If you're vetting `llm-cli-gateway` through [Socket](https://socket.dev/npm/pack
 | **Shell access** | `src/executor.ts` uses `child_process.spawn(cmd, args, …)` to invoke the underlying LLM CLIs. | `spawn` is called with an argument array and **never** `shell: true`, so there is no shell interpolation path for caller input. The command name is restricted to an allow-list of known CLI binaries (`claude`, `codex`, `gemini`, `grok`, `vibe`). |
 | **Uses eval** | None in our source. Transitive: `@modelcontextprotocol/sdk` → `ajv@8` uses `new Function(...)` in `ajv/dist/compile/index.js` to compile JSON Schema validators. | This is ajv's standard codegen path. Only known schemas (defined in our source and the MCP SDK) flow into it; no caller-supplied data ever reaches the compiled function body. |
 | **better-sqlite3 PRAGMA helper** | Transitive: `better-sqlite3/lib/methods/pragma.js` interpolates its caller-provided `source` into a `PRAGMA ${source}` statement. | We do not call `db.pragma()` from production source. Internal SQLite setup uses fixed literal `db.exec("PRAGMA ...")` statements, and `npm run security:audit` fails the release if production code reintroduces `.pragma()` calls. |
+| **ioredis obfuscated code** | Optional peer/dev dependency: `ioredis@5.10.1` may be flagged at `built/constants/TLSProfiles.js` for base64-looking strings. | Reviewed as a false positive. The file is a Redis Cloud TLS CA certificate bundle in PEM format, which is base64 by design. It contains no decoder loop, dynamic evaluation, network call, or hidden execution path. The same file is byte-for-byte identical in `ioredis@5.9.2`; our default production install does not install `ioredis`, and our code does not pass ioredis TLS profile options. |
 | **Dependency ownership** | A handful of small transitive packages (e.g. `bindings` via `better-sqlite3`, `media-typer` via `@modelcontextprotocol/sdk`) trip Socket's "unstable ownership" or "obfuscated code" heuristics. | These are pinned, well-known micro-deps in the Node ecosystem with no known issues. We pin direct override versions of `content-type` and `type-is` in `package.json#overrides`. Our previous direct dependency on `toml@3.0.0` (also single-maintainer, last released 2020) was replaced with the actively-maintained `smol-toml` to reduce inherited risk. |
 See [`socket.yml`](./socket.yml) for the same context in machine-readable form.

package/dist/cache-stats.d.ts ADDED Viewed

@@ -0,0 +1,112 @@
+/**
+ * Cache observability aggregates.
+ *
+ * Pure read-only aggregation over the FlightRecorder's `requests` table.
+ * No new storage — every value is computed at query time from existing
+ * columns (`cache_read_tokens`, `cache_creation_tokens`, `stable_prefix_*`,
+ * `datetime_utc`, etc.).
+ *
+ * COALESCE / NULL handling: rows from before the v3 migration have NULL
+ * for stable_prefix_*. Rows from CLIs whose parser does not surface cache
+ * tokens (gemini, grok, mistral, and codex until its parser is fixed)
+ * have NULL for cache_read_tokens / cache_creation_tokens. All aggregates
+ * tolerate NULL via COALESCE(col, 0) — never divides by zero.
+ */
+import type { FlightRecorderQuery } from "./flight-recorder.js";
+export type CacheStatsCli = "claude" | "codex" | "gemini" | "grok" | "mistral";
+export interface SessionCacheStats {
+    sessionId: string;
+    cli: CacheStatsCli | null;
+    /** Total cache_read_tokens across all rows in this session. */
+    totalCacheReadTokens: number;
+    /** Total cache_creation_tokens across all rows in this session. */
+    totalCacheCreationTokens: number;
+    /** Number of rows in this session. */
+    requestCount: number;
+    /** Number of rows where cache_read_tokens > 0. */
+    hitCount: number;
+    /** hitCount / requestCount (0 when requestCount = 0). */
+    hitRate: number;
+    /** Distinct stable_prefix_hash values seen in this session. */
+    distinctPrefixCount: number;
+    /** Last time any row in this session was written (datetime_utc max). ISO string or null. */
+    lastRequestAt: string | null;
+    /** Estimated USD saved by cache reads in this session (best-effort). */
+    estimatedSavingsUsd: number;
+    /**
+     * Slice 3: best-effort remaining TTL on the Anthropic cache breakpoint
+     * established at lastRequestAt. Null for non-claude CLIs (we have no
+     * read on their cache state) and null when lastRequestAt is null.
+     * Computed by computeTtlRemaining(); see ttlPolicy parameter.
+     */
+    ttlRemainingMs: number | null;
+}
+export interface PrefixCacheStats {
+    stablePrefixHash: string;
+    requestCount: number;
+    hitCount: number;
+    hitRate: number;
+    totalCacheReadTokens: number;
+    totalCacheCreationTokens: number;
+    /** Distinct CLI x model combos that hashed to this prefix. */
+    cliBreakdown: Array<{
+        cli: CacheStatsCli;
+        model: string;
+        count: number;
+    }>;
+    firstSeenAt: string | null;
+    lastSeenAt: string | null;
+    estimatedSavingsUsd: number;
+}
+export interface GlobalCacheStats {
+    /** Optional window: rows since (now - lastNHours * 3600s). */
+    windowHours: number | null;
+    totalRequests: number;
+    totalHits: number;
+    hitRate: number;
+    totalCacheReadTokens: number;
+    totalCacheCreationTokens: number;
+    perCli: Array<{
+        cli: CacheStatsCli;
+        requestCount: number;
+        hitCount: number;
+        hitRate: number;
+        totalCacheReadTokens: number;
+        totalCacheCreationTokens: number;
+        estimatedSavingsUsd: number;
+    }>;
+    estimatedSavingsUsd: number;
+}
+export declare function computeSessionCacheStats(db: FlightRecorderQuery, sessionId: string): SessionCacheStats;
+export interface TtlPolicy {
+    /**
+     * Seconds: how long Anthropic holds a cache entry after the last
+     * write. Default 300 (5 minutes). Set to 3600 when the operator has
+     * opted into Anthropic's 1-hour cache TTL via
+     * `[cache_awareness].anthropic_ttl_seconds = 3600`.
+     */
+    anthropicTtlSeconds: 300 | 3600;
+    /** Defaults to `() => Date.now()`. Overridable for deterministic tests. */
+    now?: () => number;
+}
+/**
+ * Slice 3: compute the best-effort milliseconds remaining on the cache
+ * breakpoint established at `stats.lastRequestAt`.
+ *
+ * - Claude: Anthropic's documented TTL (5min default, 1h beta). Computed
+ *   as max(0, ttl - (now - lastWriteAt)).
+ * - Other CLIs: returns null. We do not observe the provider's actual
+ *   cache state, so any number we'd return would be a guess. session_get
+ *   and cache_state resources should report null for these.
+ *
+ * Note: this is "best effort". A cache eviction inside Anthropic's
+ * window will NOT be visible to us — the warning may be optimistic
+ * (see risks section in dag.toml).
+ */
+export declare function computeTtlRemaining(stats: SessionCacheStats, cli: CacheStatsCli | null, ttlPolicy: TtlPolicy): number | null;
+export declare function computePrefixCacheStats(db: FlightRecorderQuery, stablePrefixHash: string): PrefixCacheStats;
+export interface GlobalCacheStatsOpts {
+    /** If set, restrict to rows whose datetime_utc is within the last N hours. */
+    lastNHours?: number;
+}
+export declare function computeGlobalCacheStats(db: FlightRecorderQuery, opts?: GlobalCacheStatsOpts): GlobalCacheStats;

package/dist/cache-stats.js ADDED Viewed

@@ -0,0 +1,225 @@
+/**
+ * Cache observability aggregates.
+ *
+ * Pure read-only aggregation over the FlightRecorder's `requests` table.
+ * No new storage — every value is computed at query time from existing
+ * columns (`cache_read_tokens`, `cache_creation_tokens`, `stable_prefix_*`,
+ * `datetime_utc`, etc.).
+ *
+ * COALESCE / NULL handling: rows from before the v3 migration have NULL
+ * for stable_prefix_*. Rows from CLIs whose parser does not surface cache
+ * tokens (gemini, grok, mistral, and codex until its parser is fixed)
+ * have NULL for cache_read_tokens / cache_creation_tokens. All aggregates
+ * tolerate NULL via COALESCE(col, 0) — never divides by zero.
+ */
+import { estimateCacheSavingsUsd } from "./pricing.js";
+function safeNum(n) {
+    return typeof n === "number" && Number.isFinite(n) ? n : 0;
+}
+function isCacheStatsCli(s) {
+    return s === "claude" || s === "codex" || s === "gemini" || s === "grok" || s === "mistral";
+}
+export function computeSessionCacheStats(db, sessionId) {
+    const rows = db.queryRequests(`SELECT cli, model,
+            COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
+            COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
+            stable_prefix_hash,
+            datetime_utc
+     FROM requests
+     WHERE session_id = ?
+     ORDER BY datetime_utc DESC`, sessionId);
+    let totalRead = 0;
+    let totalCreation = 0;
+    let hitCount = 0;
+    const prefixSet = new Set();
+    let lastAt = null;
+    let cli = null;
+    let estimatedSavingsUsd = 0;
+    for (const row of rows) {
+        const reads = safeNum(row.cache_read_tokens);
+        const creation = safeNum(row.cache_creation_tokens);
+        totalRead += reads;
+        totalCreation += creation;
+        if (reads > 0)
+            hitCount += 1;
+        if (row.stable_prefix_hash)
+            prefixSet.add(row.stable_prefix_hash);
+        if (!lastAt || row.datetime_utc > lastAt)
+            lastAt = row.datetime_utc;
+        if (cli === null && isCacheStatsCli(row.cli))
+            cli = row.cli;
+        if (isCacheStatsCli(row.cli)) {
+            estimatedSavingsUsd += estimateCacheSavingsUsd(row.cli, row.model, reads);
+        }
+    }
+    const requestCount = rows.length;
+    return {
+        sessionId,
+        cli,
+        totalCacheReadTokens: totalRead,
+        totalCacheCreationTokens: totalCreation,
+        requestCount,
+        hitCount,
+        hitRate: requestCount > 0 ? hitCount / requestCount : 0,
+        distinctPrefixCount: prefixSet.size,
+        lastRequestAt: lastAt,
+        estimatedSavingsUsd,
+        // ttlRemainingMs is populated by computeTtlRemaining() — the field
+        // exists on the type so the resource shape is uniform, but its value
+        // is left null here. Callers (session_get / cache_state resources)
+        // apply the configured TTL policy and set the field.
+        ttlRemainingMs: null,
+    };
+}
+/**
+ * Slice 3: compute the best-effort milliseconds remaining on the cache
+ * breakpoint established at `stats.lastRequestAt`.
+ *
+ * - Claude: Anthropic's documented TTL (5min default, 1h beta). Computed
+ *   as max(0, ttl - (now - lastWriteAt)).
+ * - Other CLIs: returns null. We do not observe the provider's actual
+ *   cache state, so any number we'd return would be a guess. session_get
+ *   and cache_state resources should report null for these.
+ *
+ * Note: this is "best effort". A cache eviction inside Anthropic's
+ * window will NOT be visible to us — the warning may be optimistic
+ * (see risks section in dag.toml).
+ */
+export function computeTtlRemaining(stats, cli, ttlPolicy) {
+    if (cli !== "claude")
+        return null;
+    if (!stats.lastRequestAt)
+        return null;
+    const nowMs = (ttlPolicy.now ?? Date.now)();
+    const lastWriteMs = Date.parse(stats.lastRequestAt);
+    if (!Number.isFinite(lastWriteMs))
+        return null;
+    const elapsedMs = nowMs - lastWriteMs;
+    const ttlMs = ttlPolicy.anthropicTtlSeconds * 1000;
+    return Math.max(0, ttlMs - elapsedMs);
+}
+export function computePrefixCacheStats(db, stablePrefixHash) {
+    const rows = db.queryRequests(`SELECT cli, model,
+            COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
+            COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
+            stable_prefix_hash,
+            datetime_utc
+     FROM requests
+     WHERE stable_prefix_hash = ?
+     ORDER BY datetime_utc ASC`, stablePrefixHash);
+    let totalRead = 0;
+    let totalCreation = 0;
+    let hitCount = 0;
+    let firstAt = null;
+    let lastAt = null;
+    let estimatedSavingsUsd = 0;
+    const cliMap = new Map();
+    for (const row of rows) {
+        const reads = safeNum(row.cache_read_tokens);
+        totalRead += reads;
+        totalCreation += safeNum(row.cache_creation_tokens);
+        if (reads > 0)
+            hitCount += 1;
+        if (!firstAt)
+            firstAt = row.datetime_utc;
+        lastAt = row.datetime_utc;
+        if (isCacheStatsCli(row.cli)) {
+            estimatedSavingsUsd += estimateCacheSavingsUsd(row.cli, row.model, reads);
+            const key = `${row.cli}::${row.model}`;
+            const entry = cliMap.get(key);
+            if (entry) {
+                entry.count += 1;
+            }
+            else {
+                cliMap.set(key, { cli: row.cli, model: row.model, count: 1 });
+            }
+        }
+    }
+    const requestCount = rows.length;
+    return {
+        stablePrefixHash,
+        requestCount,
+        hitCount,
+        hitRate: requestCount > 0 ? hitCount / requestCount : 0,
+        totalCacheReadTokens: totalRead,
+        totalCacheCreationTokens: totalCreation,
+        cliBreakdown: Array.from(cliMap.values()).sort((a, b) => b.count - a.count),
+        firstSeenAt: firstAt,
+        lastSeenAt: lastAt,
+        estimatedSavingsUsd,
+    };
+}
+export function computeGlobalCacheStats(db, opts = {}) {
+    const windowHours = opts.lastNHours ?? null;
+    const sinceIso = windowHours !== null && windowHours > 0
+        ? new Date(Date.now() - windowHours * 3600_000).toISOString()
+        : null;
+    const sql = sinceIso
+        ? `SELECT cli, model,
+              COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
+              COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
+              stable_prefix_hash,
+              datetime_utc
+       FROM requests
+       WHERE datetime_utc >= ?`
+        : `SELECT cli, model,
+              COALESCE(cache_read_tokens, 0) AS cache_read_tokens,
+              COALESCE(cache_creation_tokens, 0) AS cache_creation_tokens,
+              stable_prefix_hash,
+              datetime_utc
+       FROM requests`;
+    const rows = sinceIso ? db.queryRequests(sql, sinceIso) : db.queryRequests(sql);
+    const perCliMap = new Map();
+    let totalRequests = 0;
+    let totalHits = 0;
+    let totalRead = 0;
+    let totalCreation = 0;
+    let totalSavings = 0;
+    for (const row of rows) {
+        totalRequests += 1;
+        const reads = safeNum(row.cache_read_tokens);
+        const creation = safeNum(row.cache_creation_tokens);
+        totalRead += reads;
+        totalCreation += creation;
+        if (reads > 0)
+            totalHits += 1;
+        if (!isCacheStatsCli(row.cli))
+            continue;
+        const cli = row.cli;
+        const savings = estimateCacheSavingsUsd(cli, row.model, reads);
+        totalSavings += savings;
+        const agg = perCliMap.get(cli) ?? {
+            requestCount: 0,
+            hitCount: 0,
+            totalCacheReadTokens: 0,
+            totalCacheCreationTokens: 0,
+            estimatedSavingsUsd: 0,
+        };
+        agg.requestCount += 1;
+        if (reads > 0)
+            agg.hitCount += 1;
+        agg.totalCacheReadTokens += reads;
+        agg.totalCacheCreationTokens += creation;
+        agg.estimatedSavingsUsd += savings;
+        perCliMap.set(cli, agg);
+    }
+    const perCli = Array.from(perCliMap.entries()).map(([cli, agg]) => ({
+        cli,
+        requestCount: agg.requestCount,
+        hitCount: agg.hitCount,
+        hitRate: agg.requestCount > 0 ? agg.hitCount / agg.requestCount : 0,
+        totalCacheReadTokens: agg.totalCacheReadTokens,
+        totalCacheCreationTokens: agg.totalCacheCreationTokens,
+        estimatedSavingsUsd: agg.estimatedSavingsUsd,
+    }));
+    return {
+        windowHours,
+        totalRequests,
+        totalHits,
+        hitRate: totalRequests > 0 ? totalHits / totalRequests : 0,
+        totalCacheReadTokens: totalRead,
+        totalCacheCreationTokens: totalCreation,
+        perCli,
+        estimatedSavingsUsd: totalSavings,
+    };
+}

package/dist/config.d.ts CHANGED Viewed

@@ -63,3 +63,44 @@ export interface PersistenceConfigSources {
  * Throws on incoherent configs (memory/none + asyncJobsEnabled without ack).
  */
 export declare function loadPersistenceConfig(logger?: Logger): PersistenceConfig;
+export declare const ANTHROPIC_TTL_SECONDS_VALUES: readonly [300, 3600];
+export type AnthropicTtlSeconds = (typeof ANTHROPIC_TTL_SECONDS_VALUES)[number];
+/**
+ * Per-Anthropic-model-family minimum cacheable tokens. Sourced from
+ * docs/personal-mcp/PROVIDER_CACHE_SURFACES.md (Anthropic API docs as of
+ * 2026-05-26). Models below the threshold cannot be cached even with
+ * cache_control set — Anthropic silently returns un-cached.
+ */
+export declare const DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL: {
+    readonly sonnet: 1024;
+    readonly opus: 4096;
+    readonly haiku: 4096;
+    readonly default: 4096;
+};
+export type ModelFamilyAlias = keyof typeof DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL;
+export interface CacheAwarenessConfig {
+    emitAnthropicCacheControl: boolean;
+    anthropicTtlSeconds: AnthropicTtlSeconds;
+    warnOnTtlExpiry: boolean;
+    minStableTokensForCacheControl: {
+        sonnet: number;
+        opus: number;
+        haiku: number;
+        default: number;
+    };
+    /** Audit trail: file the config was loaded from (or null if defaults). */
+    sources: {
+        configFile: string | null;
+    };
+}
+/**
+ * Load [cache_awareness] from ~/.llm-cli-gateway/config.toml. Defaults: all
+ * behaviour off, per-model min-token thresholds from PROVIDER_CACHE_SURFACES.md.
+ */
+export declare function loadCacheAwarenessConfig(logger?: Logger): CacheAwarenessConfig;
+/**
+ * Look up the per-model-family threshold. `modelName` is the user-facing model
+ * string (e.g. "claude-sonnet-4-6", "claude-opus-4-7"). Falls back to `default`
+ * when the family is unrecognised.
+ */
+export declare function minStableTokensForModel(config: CacheAwarenessConfig, modelName: string): number;

package/dist/config.js CHANGED Viewed

@@ -227,3 +227,112 @@ export function loadPersistenceConfig(logger = noopLogger) {
         sources,
     };
 }
+//──────────────────────────────────────────────────────────────────────────────
+// Cache-awareness configuration
+//
+// Reads the [cache_awareness] block from the same ~/.llm-cli-gateway/config.toml
+// file as [persistence], but uses a SEPARATE loader and schema. Keeping the two
+// independent means a malformed [cache_awareness] never breaks persistence
+// loading and vice versa. No env-var overrides — purely TOML.
+//
+// All defaults are "off"; behavioural changes (slice 1 cache_control, slice 3
+// TTL warnings) ship dormant until operators opt in.
+//──────────────────────────────────────────────────────────────────────────────
+export const ANTHROPIC_TTL_SECONDS_VALUES = [300, 3600];
+/**
+ * Per-Anthropic-model-family minimum cacheable tokens. Sourced from
+ * docs/personal-mcp/PROVIDER_CACHE_SURFACES.md (Anthropic API docs as of
+ * 2026-05-26). Models below the threshold cannot be cached even with
+ * cache_control set — Anthropic silently returns un-cached.
+ */
+export const DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL = {
+    sonnet: 1024,
+    opus: 4096,
+    haiku: 4096,
+    default: 4096,
+};
+const MinStableTokensSchema = z
+    .object({
+    sonnet: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.sonnet),
+    opus: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.opus),
+    haiku: z.number().int().positive().default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.haiku),
+    default: z
+        .number()
+        .int()
+        .positive()
+        .default(DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.default),
+})
+    .strict()
+    .default({
+    sonnet: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.sonnet,
+    opus: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.opus,
+    haiku: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.haiku,
+    default: DEFAULT_MIN_STABLE_TOKENS_FOR_CACHE_CONTROL.default,
+});
+const CacheAwarenessSchema = z
+    .object({
+    emit_anthropic_cache_control: z.boolean().default(false),
+    anthropic_ttl_seconds: z.union([z.literal(300), z.literal(3600)]).default(300),
+    warn_on_ttl_expiry: z.boolean().default(false),
+    min_stable_tokens_for_cache_control: MinStableTokensSchema,
+})
+    .strict();
+function readCacheAwarenessFile(configPath, logger) {
+    if (!existsSync(configPath)) {
+        return { raw: undefined, sourcePath: null };
+    }
+    try {
+        const require = createRequire(import.meta.url);
+        const TOML = require("smol-toml");
+        const text = readFileSync(configPath, "utf-8");
+        const parsed = TOML.parse(text);
+        return { raw: parsed?.cache_awareness, sourcePath: configPath };
+    }
+    catch (err) {
+        logger.error(`Failed to parse gateway config at ${configPath}; using cache_awareness defaults`, err);
+        return { raw: undefined, sourcePath: null };
+    }
+}
+/**
+ * Load [cache_awareness] from ~/.llm-cli-gateway/config.toml. Defaults: all
+ * behaviour off, per-model min-token thresholds from PROVIDER_CACHE_SURFACES.md.
+ */
+export function loadCacheAwarenessConfig(logger = noopLogger) {
+    const configPath = defaultPersistenceConfigPath();
+    const { raw, sourcePath } = readCacheAwarenessFile(configPath, logger);
+    let parsed;
+    try {
+        parsed = CacheAwarenessSchema.parse(raw ?? {});
+    }
+    catch (err) {
+        throw new Error(`Invalid [cache_awareness] config: ${err instanceof Error ? err.message : String(err)}`);
+    }
+    return {
+        emitAnthropicCacheControl: parsed.emit_anthropic_cache_control,
+        anthropicTtlSeconds: parsed.anthropic_ttl_seconds,
+        warnOnTtlExpiry: parsed.warn_on_ttl_expiry,
+        minStableTokensForCacheControl: {
+            sonnet: parsed.min_stable_tokens_for_cache_control.sonnet,
+            opus: parsed.min_stable_tokens_for_cache_control.opus,
+            haiku: parsed.min_stable_tokens_for_cache_control.haiku,
+            default: parsed.min_stable_tokens_for_cache_control.default,
+        },
+        sources: { configFile: sourcePath },
+    };
+}
+/**
+ * Look up the per-model-family threshold. `modelName` is the user-facing model
+ * string (e.g. "claude-sonnet-4-6", "claude-opus-4-7"). Falls back to `default`
+ * when the family is unrecognised.
+ */
+export function minStableTokensForModel(config, modelName) {
+    const lower = modelName.toLowerCase();
+    const table = config.minStableTokensForCacheControl;
+    if (lower.includes("sonnet"))
+        return table.sonnet;
+    if (lower.includes("opus"))
+        return table.opus;
+    if (lower.includes("haiku"))
+        return table.haiku;
+    return table.default;
+}