claude-code-cache-fix 3.7.0 → 3.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -29,7 +29,7 @@ That's it. The proxy applies all 7 cache-fix extensions automatically. No wrappe
29
29
 
30
30
  ### What the proxy does
31
31
 
32
- On every `/v1/messages` request, 7 extensions run in order:
32
+ On every `/v1/messages` request, 9 extensions run in order (one opt-in):
33
33
 
34
34
  | Extension | What it fixes |
35
35
  |-----------|--------------|
@@ -40,6 +40,8 @@ On every `/v1/messages` request, 7 extensions run in order:
40
40
  | `fresh-session-sort` | Fixes non-deterministic ordering on first turn |
41
41
  | `cache-control-normalize` | Normalizes cache_control markers across messages |
42
42
  | `cache-telemetry` | Extracts cache stats from response headers → `~/.claude/quota-status/{account.json,sessions/<id>.json}` |
43
+ | `session-health` | Observes per-session thinking-desync risk (context size + thinking-block count) and warns before a session reaches the danger zone. Read-only |
44
+ | `thinking-block-sanitize` | Drops omitted (empty-text) thinking blocks to head off the CC thinking-desync `400` (#63147). **Opt-in** (`CACHE_FIX_THINKING_SANITIZE=on`) |
43
45
 
44
46
  Extensions are hot-reloadable — add, remove, or modify `.mjs` files in `proxy/extensions/` and changes apply to the next request without restarting. Configuration in `proxy/extensions.json`.
45
47
 
@@ -212,11 +214,21 @@ Options (all optional; all fall back to the same env vars used by the CLI):
212
214
 
213
215
  **Cache-economics regressions.** The original purpose of cache-fix is to absorb the cache-handling behaviors in Claude Code that cost users real money and quota — TTL downgrades, cache-breaking header churn, identity-latching issues, and the rest of the regression catalog documented across our issue history. The proxy sits between CC and the Anthropic API, normalizes the request and response stream, and emits enough observability (via statusline integration and the quota-status files) that users can see what their session is actually doing. This is the load-bearing feature for almost every user today.
214
216
 
215
- **Bootstrap-channel observability.** Claude Code v2.1.150 introduced a prompt-section consumer that fetches a server-supplied string from `/api/claude_cli/bootstrap` and merges it into the agent's behavioral-instructions prompt path. We filed this behavior with Anthropic's security team in May 2026; Anthropic closed the report as *Informative*, treating TLS as the transport-integrity boundary and declining to add application-layer authenticity checks. Cache-fix v3.7.0 adds explicit handling for this path. Default mode is `audit` — bootstrap responses proxy through to CC and are logged to `~/.claude/cache-fix-bootstrap-log.jsonl` so users can inspect them locally. To opt into block mode instead, set `CACHE_FIX_BOOTSTRAP_MODE=block` in the proxy environment; block mode short-circuits the upstream call and returns a 200 with an empty JSON body, dropping bootstrap content before it reaches CC. (Note: cache-fix v3.6.2 and earlier returned 404 for this path because the proxy router did not include it — the practical effect was that bootstrap content was not previously reaching CC for cache-fix users. v3.7.0's default `audit` changes that behavior; explicit `CACHE_FIX_BOOTSTRAP_MODE=block` preserves it.) The full disclosure record, including Anthropic's verbatim close text, is in [`docs/disclosure/heron-brook-2026-05.md`](docs/disclosure/heron-brook-2026-05.md).
217
+ **Bootstrap-channel observability.** Claude Code v2.1.150 introduced a prompt-section consumer that fetches a server-supplied string from `/api/claude_cli/bootstrap` and merges it into the agent's behavioral-instructions prompt path. We filed this behavior with Anthropic's security team in May 2026; Anthropic closed the report as *Informative*, treating TLS as the transport-integrity boundary and declining to add application-layer authenticity checks. Cache-fix v3.7.0 added explicit handling for this path. v3.7.1 extends it to also cover the env-var-selected GrowthBook prompt-injection surface that landed in CC v2.1.152 (remote-control mode: `CLAUDE_CODE_SYSTEM_PROMPT_GB_FEATURE` names a flag key whose cached value is used as the system prompt body).
218
+
219
+ Cache-fix's `bootstrap-defense` extension ships three modes, selected via `CACHE_FIX_BOOTSTRAP_MODE`:
220
+
221
+ | Mode | Default? | Behavior |
222
+ |---|---|---|
223
+ | `audit` | yes | Bootstrap responses proxy through to CC. Each response is logged to `~/.claude/cache-fix-bootstrap-log.jsonl` with surface metadata: which prompt-source surfaces fired (`tengu_heron_brook` legacy and/or env-var-selected), the SHA-256 hash of the value (first 16 hex chars — never the value itself), and the `CLAUDE_CODE_REMOTE` flag. Multi-surface responses emit one record per surface, correlated by `request_id` + timestamp window. |
224
+ | `block` | opt-in | `onRequest` returns a 200 with an empty JSON body. Upstream is never called, no flag map ever reaches the on-disk GrowthBook cache. Defeats both legacy and env-var-selected injection surfaces. |
225
+ | `allowlist` | opt-in (experimental) | Bootstrap response proxies through, but prompt-source-eligible keys (legacy `tengu_heron_brook` + env-var-selected key) not in the allowlist are stripped from the response body before it reaches CC. Default allowlist is `tengu_heron_brook` (the only known-legitimate historical key); configure via `CACHE_FIX_BOOTSTRAP_ALLOWED_KEYS=comma,separated,list`. Pass `CACHE_FIX_BOOTSTRAP_ALLOWED_KEYS=` (explicit empty) for full deny-all. Other GrowthBook flag keys pass through untouched. May need updates if Anthropic adds legitimate prompt-source keys in future CC releases. |
226
+
227
+ Note: cache-fix v3.6.2 and earlier returned 404 for the bootstrap path because the proxy router did not include it — the practical effect was that bootstrap content was not reaching CC for cache-fix users. v3.7.0's default `audit` changes that behavior; explicit `CACHE_FIX_BOOTSTRAP_MODE=block` preserves it. The full disclosure record, including Anthropic's verbatim close text, is in [`docs/disclosure/heron-brook-2026-05.md`](docs/disclosure/heron-brook-2026-05.md).
216
228
 
217
229
  **Reference material:**
218
230
  - [`docs/disclosure/heron-brook-2026-05.md`](docs/disclosure/heron-brook-2026-05.md) — full disclosure record
219
- - [`CHANGELOG.md`](CHANGELOG.md#370---2026-05-26) — v3.7.0 release entry (includes the behavior-change note for prior users)
231
+ - [`CHANGELOG.md`](CHANGELOG.md#371---2026-05-27) — v3.7.1 release entry (extended surface coverage + allowlist mode); [v3.7.0 entry](CHANGELOG.md#370---2026-05-26) covers the prior behavior-change note
220
232
  - [`cnighswonger/heron-brook-poc`](https://github.com/cnighswonger/heron-brook-poc) — reproducer for the bootstrap-channel behavior
221
233
 
222
234
  ## Recommended CC operational config
@@ -713,6 +725,40 @@ Scoping rules baked into the extension:
713
725
  |---------|---------|---------|
714
726
  | `CACHE_FIX_THINKING_DISPLAY` | `summarized` (built-in) | One of `summarized` / `omitted` / `disabled`. `summarized` restores thinking summaries (default). `omitted` force-suppresses thinking blocks. `disabled` opts the extension out entirely. |
715
727
 
728
+ ## Session-health early-warning (proxy mode, thinking-desync risk)
729
+
730
+ Long-running Opus 4.7 `[1m]` sessions accumulate interleaved thinking blocks and grow their live context until Claude Code's own history reconstruction desyncs a thinking-block signature, producing a permanent `400 … thinking blocks … cannot be modified` on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)). The session dies abruptly with no prior signal.
731
+
732
+ The `session-health` extension watches the conditions that correlate with the trip and warns **before** a session reaches the danger zone, so the operator can retire it deliberately (write a session-state handoff, `/clear`) instead of being surprised by a dead session. It is **read-only** — it never mutates the request/response body and never attempts to repair the desync (that is CC-side, #63147). It records numeric telemetry into the per-session file (`~/.claude/quota-status/sessions/<id>.json`) on each request and, when a session first crosses into `high` risk, emits a one-time stderr line. Counts only — no thinking text or signatures are ever logged.
733
+
734
+ Fields added to the per-session JSON:
735
+
736
+ - `context_tokens` — latest request's live context (`input + cache_read + cache_creation`)
737
+ - `thinking_block_count` — `thinking`/`redacted_thinking` blocks in the latest request
738
+ - `thinking_block_max` — session high-water mark (carried across proxy restarts)
739
+ - `first_seen`, `request_count` — session age + request tally
740
+ - `thinking_desync_risk` — `ok` / `warn` / `high` (omitted when the signal is disabled)
741
+
742
+ Token thresholds are anchored to the observed ~382K-token trip with margin; the warning is conservative by design — a premature "retire soon" is far cheaper than a dead session. Block-count is recorded but does not yet gate the warning (it activates in a calibrated fast-follow once the failure distribution is known).
743
+
744
+ | Env var | Default | Purpose |
745
+ |---------|---------|---------|
746
+ | `CACHE_FIX_THINKING_RISK_WARN_TOKENS` | `250000` | Context-token level at which `thinking_desync_risk` becomes `warn`. |
747
+ | `CACHE_FIX_THINKING_RISK_HIGH_TOKENS` | `340000` | Context-token level at which risk becomes `high` and the one-time stderr warn fires. |
748
+ | `CACHE_FIX_THINKING_RISK` | unset (on) | Set to `off` to suppress the warning signal (stderr line + `thinking_desync_risk` field). Raw count telemetry keeps recording. |
749
+
750
+ ## Thinking-block sanitize (proxy mode, opt-in, thinking-desync mitigation)
751
+
752
+ The *mitigate* half of the thinking-desync response (the *warn-before* half is session-health above). On history-replay paths (resume / `--continue` / auto-compaction / parallel-tool-cancel), Claude Code re-sends prior assistant turns' extended thinking in the **omitted** shape `{ "type":"thinking", "thinking":"", "signature":"<intact>" }`. The API rejects modified thinking in the **latest** assistant message with a permanent `400 … thinking … blocks cannot be modified`, which wedges the session on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)).
753
+
754
+ The `thinking-block-sanitize` extension drops those omitted blocks — which the API treats as optional history — from the request before it is forwarded. Empirically-resolved turn-selection rule: drop omitted thinking from **all prior assistant turns and the latest assistant turn, unless the latest turn is an active tool-continuation** (its last block is a `tool_use` answered by a following `tool_result`). In that one case the API requires the signed thinking intact and the proxy cannot restore the emptied text, so it leaves the turn untouched. **No env var both preserves thinking and avoids the wedge for that case:** `CLAUDE_CODE_DISABLE_THINKING=1` / `MAX_THINKING_TOKENS=0` stop the wedge only by disabling thinking entirely (lossy — no reasoning), and `DISABLE_INTERLEAVED_THINKING=1` does *not* stop the `400` — so there the answer is don't-resume + heal/retire the session. That is exactly why the proxy mitigation matters: **it is the only path that preserves reasoning while avoiding the wedge** for the history-replay paths it covers. Non-empty thinking is never touched; `redacted_thinking` is out of scope for v1.
755
+
756
+ **Opt-in.** v1 ships behind `CACHE_FIX_THINKING_SANITIZE=on` (default off): it mutates request bodies and full live-coverage validation is pending. The transform is deterministic and cache-prefix-stable, and emits a per-request `thinking_blocks_dropped` count into the per-session JSON (counts only — never content) that complements the session-health signal.
757
+
758
+ | Env var | Default | Purpose |
759
+ |---------|---------|---------|
760
+ | `CACHE_FIX_THINKING_SANITIZE` | unset (off) | Set to `on` to enable the request-path drop of omitted thinking blocks. Off = no-op (no mutation, no telemetry). |
761
+
716
762
  ## System prompt rewrite (preload mode, optional)
717
763
 
718
764
  The interceptor can rewrite Claude Code's `# Output efficiency` system-prompt section. Disabled by default. Enable with `CACHE_FIX_OUTPUT_EFFICIENCY_REPLACEMENT`. See [docs/output-efficiency-prompts.md](docs/output-efficiency-prompts.md) for the three known prompt variants and usage instructions.
@@ -763,6 +809,7 @@ We monitor 30+ upstream Claude Code issues related to cache, quota, and context
763
809
  - **[@deafsquad](https://github.com/deafsquad)** — Universal smoosh_split un-smoosh fix (PR #26), source-level function attribution of resume scatter bug (anthropics/claude-code#43657), OTEL telemetry discovery, proposed and built proxy architecture for v3.0.0
764
810
  - **[@vmfarms](https://github.com/vmfarms)** — Concurrent multi-runner production validation, surfaced proxy-mode resume-marker regex no-op (#96), TTL tier detection gap (#97), and image-strip stderr leak (#98)
765
811
  - **[@ojura](https://github.com/ojura)** — Opus 4.7 thinking-summaries root-cause analysis: filed [anthropics/claude-code#59844](https://github.com/anthropics/claude-code/issues/59844) with the CLI-binary decode (`!getIsNonInteractiveSession()` gate at offset 230510599 in v2.1.142) and the two-stacked-special-cases framing, which made the `thinking-display` extension (v3.6.1) a clean proxy-side complement to the proposed upstream fix
812
+ - **[@schuay](https://github.com/schuay)** — `quota-statusline.sh` enhancements: 10-cell quota bar with elapsed-time tick and exhaust-vs-reset projection replacing the prior `%/min` burn-rate display (PR #140, v3.6.2), and d/h vs h/m time-format autoselect plus named time-unit and burn-warmup constants (PR #143, v3.7.0)
766
813
 
767
814
  If you contributed to the community effort on these issues and aren't listed here, please open an issue or PR — we want to credit everyone properly.
768
815
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-code-cache-fix",
3
- "version": "3.7.0",
3
+ "version": "3.8.0",
4
4
  "description": "Cache optimization proxy and interceptor for Claude Code. Fixes prompt cache bugs, stabilizes prefix, reduces quota burn.",
5
5
  "type": "module",
6
6
  "exports": {
@@ -1,10 +1,13 @@
1
1
  import { appendFileSync, statSync, renameSync, mkdirSync } from "node:fs";
2
2
  import { join, dirname } from "node:path";
3
3
  import { homedir } from "node:os";
4
+ import { createHash } from "node:crypto";
4
5
 
5
6
  const LOG_ROTATE_BYTES = 5 * 1024 * 1024;
6
- const SCHEMA_VERSION = 1;
7
- const EXTENSION_VERSION = "v3.6.3";
7
+ const SCHEMA_VERSION = 2;
8
+ const EXTENSION_VERSION = "v3.7.1";
9
+
10
+ const LEGACY_PROMPT_KEY = "tengu_heron_brook";
8
11
 
9
12
  function logPath() {
10
13
  return process.env.CACHE_FIX_BOOTSTRAP_LOG_PATH || join(homedir(), ".claude", "cache-fix-bootstrap-log.jsonl");
@@ -40,19 +43,43 @@ function appendRecord(record) {
40
43
 
41
44
  function modeFromEnv(fallback) {
42
45
  const raw = process.env.CACHE_FIX_BOOTSTRAP_MODE;
43
- if (raw === "audit" || raw === "block") return raw;
46
+ if (raw === "audit" || raw === "block" || raw === "allowlist") return raw;
44
47
  return fallback;
45
48
  }
46
49
 
50
+ // Parse the allowlist env var. Distinguishes three states:
51
+ // - var unset → default allowlist [LEGACY_PROMPT_KEY]
52
+ // - var = "" (explicit) → empty allowlist (deny-all)
53
+ // - var = "a,b,c" → [a, b, c] (trimmed, empty entries dropped)
54
+ function allowedKeysFromEnv() {
55
+ const raw = process.env.CACHE_FIX_BOOTSTRAP_ALLOWED_KEYS;
56
+ if (raw === undefined) return new Set([LEGACY_PROMPT_KEY]);
57
+ if (raw === "") return new Set();
58
+ return new Set(raw.split(",").map((s) => s.trim()).filter((s) => s.length > 0));
59
+ }
60
+
61
+ // Hash derivation contract (directive § Hash derivation): first 16 chars of
62
+ // the lowercase hex SHA-256 digest over the UTF-8-encoded flag value. Pinned
63
+ // here so a future refactor cannot silently change historical-record identity.
64
+ function hashFlagValue(value) {
65
+ return createHash("sha256").update(value, "utf8").digest("hex").slice(0, 16);
66
+ }
67
+
47
68
  // PII discipline: the audit log MUST NOT include client headers (Authorization,
48
69
  // x-api-key, cookies, etc.) or request/response bodies. Callers pass only the
49
70
  // extracted scalar fields below — the full headers object is never threaded
50
71
  // through this function, so a future maintainer can't accidentally widen the
51
72
  // log surface by spreading the parameter object.
52
73
  //
53
- // v3.7.0 extension fields are emitted as null/defaulted so log readers don't
54
- // break when v3.7.0 starts populating baseline_hash, anomaly_status, etc.
55
- function recordShape({ phase, mode, status, body_bytes, upstream_host, request_id, error }) {
74
+ // v3.7.0 extension fields (baseline_hash, anomaly_status) are emitted as null
75
+ // so log readers from that schema don't break. v3.7.1 adds the prompt-source
76
+ // surface fields (surface, prompt_key, prompt_value_hash, remote_mode,
77
+ // stripped_keys) — consumers reading v1 records should treat these as
78
+ // null/empty array.
79
+ function recordShape({
80
+ phase, mode, status, body_bytes, upstream_host, request_id, error,
81
+ surface, prompt_key, prompt_value_hash, remote_mode, stripped_keys,
82
+ }) {
56
83
  return {
57
84
  schema_version: SCHEMA_VERSION,
58
85
  extension_version: EXTENSION_VERSION,
@@ -66,6 +93,11 @@ function recordShape({ phase, mode, status, body_bytes, upstream_host, request_i
66
93
  baseline_hash: null,
67
94
  anomaly_status: null,
68
95
  error: error ?? null,
96
+ surface: surface ?? "bootstrap",
97
+ prompt_key: prompt_key ?? null,
98
+ prompt_value_hash: prompt_value_hash ?? null,
99
+ remote_mode: remote_mode ?? false,
100
+ stripped_keys: stripped_keys ?? [],
69
101
  };
70
102
  }
71
103
 
@@ -94,10 +126,32 @@ function extractAuditFields(meta, headers) {
94
126
  };
95
127
  }
96
128
 
129
+ // Surface-activation logic (directive § Multi-surface records):
130
+ // - "bootstrap" surface fires when the legacy hardcoded key is in the body.
131
+ // - "prompt_injection_gb" surface fires when CLAUDE_CODE_SYSTEM_PROMPT_GB_FEATURE
132
+ // is set, independent of whether that named key is in the body — this gives
133
+ // the audit log operator-visibility into env-configured intent even when
134
+ // the upstream did not deliver the value.
135
+ function detectSurfaces(body) {
136
+ const surfaces = [];
137
+ const bodyIsObject = body !== null && typeof body === "object" && !Array.isArray(body);
138
+ if (bodyIsObject && Object.prototype.hasOwnProperty.call(body, LEGACY_PROMPT_KEY)) {
139
+ surfaces.push({ surface: "bootstrap", prompt_key: LEGACY_PROMPT_KEY });
140
+ }
141
+ const envKey = process.env.CLAUDE_CODE_SYSTEM_PROMPT_GB_FEATURE;
142
+ if (envKey) {
143
+ surfaces.push({ surface: "prompt_injection_gb", prompt_key: envKey });
144
+ }
145
+ return surfaces;
146
+ }
147
+
97
148
  export default {
98
149
  name: "bootstrap-defense",
99
150
  description:
100
- "Audit (default) or block /api/claude_cli/bootstrap traffic. Audit mode proxies the response through to CC and logs metadata to ~/.claude/cache-fix-bootstrap-log.jsonl. Block mode returns empty 200, preserving v3.6.2's de-facto behavior in explicit form.",
151
+ "Audit (default), block, or allowlist server-controlled system-prompt injection through /api/claude_cli/bootstrap. " +
152
+ "Covers both the legacy tengu_heron_brook reader and the env-var-selected (CLAUDE_CODE_SYSTEM_PROMPT_GB_FEATURE) reader. " +
153
+ "Audit mode logs surface metadata to ~/.claude/cache-fix-bootstrap-log.jsonl. Block mode returns empty 200 from onRequest. " +
154
+ "Allowlist mode strips non-allowlisted prompt-source-eligible keys from the response body before returning it to CC.",
101
155
  // Pipeline route scoping (pipeline.mjs:appliesToRoute) gates this extension
102
156
  // to the bootstrap path. The internal route guard below is belt-and-suspenders
103
157
  // in case a caller invokes the hook directly.
@@ -116,6 +170,7 @@ export default {
116
170
  phase: "request_blocked",
117
171
  mode,
118
172
  ...extractAuditFields(ctx.meta, ctx.headers),
173
+ remote_mode: Boolean(process.env.CLAUDE_CODE_REMOTE),
119
174
  }),
120
175
  );
121
176
  return {
@@ -130,7 +185,7 @@ export default {
130
185
  async onResponse(ctx) {
131
186
  if (ctx.meta?.route !== "bootstrap") return;
132
187
  const mode = ctx.meta._bootstrapDefenseMode || "audit";
133
- if (mode !== "audit") return;
188
+ if (mode !== "audit" && mode !== "allowlist") return;
134
189
 
135
190
  // Prefer raw on-wire byte count from server.mjs (accurate even for
136
191
  // non-JSON / unparseable upstream responses). The fallback stringify path
@@ -142,20 +197,90 @@ export default {
142
197
  }
143
198
 
144
199
  const upstreamError = ctx.meta?._bootstrapUpstreamError ?? null;
145
- // Request-side context (request_id, upstream_host) is captured at
146
- // forward time and stashed on ctx.meta — we read from meta here rather
147
- // than ctx.headers (which on onResponse contains the upstream RESPONSE
148
- // headers, not the client request).
149
- appendRecord(
150
- recordShape({
151
- phase: upstreamError ? "upstream_error_audited" : "response_audited",
152
- mode,
153
- status: ctx.status,
154
- body_bytes: bodyBytes,
155
- upstream_host: ctx.meta?._bootstrapUpstreamHost ?? null,
156
- request_id: ctx.meta?._bootstrapRequestId ?? null,
200
+ const auditFields = extractAuditFields(ctx.meta, ctx.headers);
201
+ const remoteMode = Boolean(process.env.CLAUDE_CODE_REMOTE);
202
+ const baseRecord = {
203
+ mode,
204
+ status: ctx.status,
205
+ body_bytes: bodyBytes,
206
+ ...auditFields,
207
+ remote_mode: remoteMode,
208
+ };
209
+
210
+ // Anomaly path: upstream error or unparseable body. Single record with
211
+ // new fields defaulted; no multi-surface emission (directive § Multi-surface
212
+ // records, last paragraph).
213
+ if (upstreamError) {
214
+ appendRecord(recordShape({
215
+ ...baseRecord,
216
+ phase: "upstream_error_audited",
157
217
  error: upstreamError,
158
- }),
159
- );
218
+ }));
219
+ return;
220
+ }
221
+ if (ctx.body === null || ctx.body === undefined) {
222
+ appendRecord(recordShape({
223
+ ...baseRecord,
224
+ phase: "response_audited",
225
+ }));
226
+ return;
227
+ }
228
+
229
+ // Detect which prompt-source surfaces apply to this response.
230
+ const surfaces = detectSurfaces(ctx.body);
231
+
232
+ // No injection-detected baseline: single record preserving v3.7.0's
233
+ // record shape (surface defaults to "bootstrap", prompt_key null).
234
+ if (surfaces.length === 0) {
235
+ appendRecord(recordShape({
236
+ ...baseRecord,
237
+ phase: "response_audited",
238
+ }));
239
+ return;
240
+ }
241
+
242
+ // Compute per-surface fields. Read from the original (unmutated) body
243
+ // for every surface BEFORE applying any allowlist strips, so multiple
244
+ // surfaces referencing the same key (env-var-aliases-legacy-key) both
245
+ // see the same value and emit identical prompt_value_hash.
246
+ const allowed = mode === "allowlist" ? allowedKeysFromEnv() : null;
247
+ const perSurface = surfaces.map((s) => {
248
+ const value = ctx.body[s.prompt_key];
249
+ const hash = typeof value === "string" ? hashFlagValue(value) : null;
250
+ const stripThis =
251
+ mode === "allowlist" &&
252
+ !allowed.has(s.prompt_key) &&
253
+ Object.prototype.hasOwnProperty.call(ctx.body, s.prompt_key);
254
+ return {
255
+ ...s,
256
+ prompt_value_hash: hash,
257
+ stripped_keys: stripThis ? [s.prompt_key] : [],
258
+ };
259
+ });
260
+
261
+ // Apply strips to ctx.body. Deletion is idempotent across multiple
262
+ // surfaces that target the same key (alias case).
263
+ if (mode === "allowlist") {
264
+ for (const r of perSurface) {
265
+ for (const k of r.stripped_keys) {
266
+ delete ctx.body[k];
267
+ }
268
+ }
269
+ }
270
+
271
+ // Emit one audit record per detected surface. Shared scalars
272
+ // (request_id, timestamp window, body_bytes, etc.) are duplicated across
273
+ // records emitted from a single response; consumers correlate by
274
+ // request_id + timestamp window.
275
+ for (const r of perSurface) {
276
+ appendRecord(recordShape({
277
+ ...baseRecord,
278
+ phase: "response_audited",
279
+ surface: r.surface,
280
+ prompt_key: r.prompt_key,
281
+ prompt_value_hash: r.prompt_value_hash,
282
+ stripped_keys: r.stripped_keys,
283
+ }));
284
+ }
160
285
  },
161
286
  };
@@ -49,6 +49,13 @@ export function sessionFilename(rawId) {
49
49
  return "inv-" + createHash("sha256").update(s).digest("hex").slice(0, 16);
50
50
  }
51
51
 
52
+ // Full path to the per-session file for a raw session id. Exported so sibling
53
+ // extensions (e.g. session-health) can READ the prior state this writer wrote,
54
+ // using the identical filename rule — reuse, not duplicate.
55
+ export function sessionFilePath(rawId) {
56
+ return join(paths().sessionsDir, `${sessionFilename(rawId)}.json`);
57
+ }
58
+
52
59
  function resolveSessionId(headers) {
53
60
  if (!headers) return null;
54
61
  const sid =
@@ -222,6 +229,13 @@ export default {
222
229
  hit_rate: hitRate,
223
230
  timestamp,
224
231
  },
232
+ // Additive session-health fields (session-health extension, order
233
+ // 590, stashes these before this writer runs). Optional — absent if
234
+ // that extension is disabled or produced nothing this request.
235
+ ...(ctx.meta._sessionHealth || {}),
236
+ // Additive thinking-block-sanitize drop count (order 550, opt-in).
237
+ // Optional — absent unless CACHE_FIX_THINKING_SANITIZE=on.
238
+ ...(ctx.meta._thinkingSanitize || {}),
225
239
  timestamp,
226
240
  session_id: rawSid,
227
241
  },
@@ -0,0 +1,152 @@
1
+ import { readFileSync } from "node:fs";
2
+ import { sessionFilename, sessionFilePath } from "./cache-telemetry.mjs";
3
+
4
+ // session-health — read-only early-warning for the CC thinking-desync wedge
5
+ // (anthropics/claude-code#63147). Long-running Opus 4.7 [1m] sessions grow
6
+ // their live context until CC's own history reconstruction desyncs a
7
+ // thinking-block signature, producing a permanent 400 on every subsequent
8
+ // turn. This extension OBSERVES (never mutates the body) and records the
9
+ // conditions that correlate with the trip, plus emits a one-time stderr warn
10
+ // so the operator can retire the session deliberately before it dies.
11
+ //
12
+ // It hands its computed fields to the existing per-session writer
13
+ // (cache-telemetry, order 600) via ctx.meta._sessionHealth; cache-telemetry
14
+ // merges them into the single per-session JSON write. This extension never
15
+ // writes that file itself (single-writer invariant).
16
+
17
+ const THINKING_TYPES = new Set(["thinking", "redacted_thinking"]);
18
+
19
+ const DEFAULT_WARN_TOKENS = 250_000;
20
+ const DEFAULT_HIGH_TOKENS = 340_000; // just under the observed ~382K trip
21
+
22
+ // --- Module-scope state ---
23
+ // Cross-request accumulators, seeded once-per-process from the prior persisted
24
+ // file so first_seen / max / count stay accurate across the proxy restarts
25
+ // that multi-week sessions inevitably span.
26
+ const sessionState = new Map(); // key -> { firstSeen, max, count }
27
+ // Sessions already given the one-time "high" stderr warn this process.
28
+ const warnedSessions = new Set();
29
+
30
+ function parseTokenEnv(raw, def) {
31
+ if (raw === undefined || raw === "") return def;
32
+ const n = Number(raw);
33
+ return Number.isFinite(n) && n >= 0 ? n : def;
34
+ }
35
+
36
+ // Exported for unit testing.
37
+ export function loadConfig(env = process.env) {
38
+ return {
39
+ warnTokens: parseTokenEnv(env.CACHE_FIX_THINKING_RISK_WARN_TOKENS, DEFAULT_WARN_TOKENS),
40
+ highTokens: parseTokenEnv(env.CACHE_FIX_THINKING_RISK_HIGH_TOKENS, DEFAULT_HIGH_TOKENS),
41
+ enabled: env.CACHE_FIX_THINKING_RISK !== "off",
42
+ };
43
+ }
44
+
45
+ export function countThinkingBlocks(body) {
46
+ if (!body || !Array.isArray(body.messages)) return 0;
47
+ let n = 0;
48
+ for (const msg of body.messages) {
49
+ if (!Array.isArray(msg.content)) continue;
50
+ for (const block of msg.content) {
51
+ if (block && THINKING_TYPES.has(block.type)) n++;
52
+ }
53
+ }
54
+ return n;
55
+ }
56
+
57
+ export function computeContextTokens(cacheStats) {
58
+ if (!cacheStats) return 0;
59
+ return (
60
+ (cacheStats.inputTokens || 0) +
61
+ (cacheStats.cacheRead || 0) +
62
+ (cacheStats.cacheCreation || 0)
63
+ );
64
+ }
65
+
66
+ export function computeRisk(contextTokens, { warnTokens, highTokens }) {
67
+ if (contextTokens >= highTokens) return "high";
68
+ if (contextTokens >= warnTokens) return "warn";
69
+ return "ok";
70
+ }
71
+
72
+ function seedFromFile(rawSid, now) {
73
+ let prev = null;
74
+ try {
75
+ prev = JSON.parse(readFileSync(sessionFilePath(rawSid), "utf8"));
76
+ } catch {}
77
+ return {
78
+ firstSeen: typeof prev?.first_seen === "string" ? prev.first_seen : now,
79
+ max: Number.isFinite(prev?.thinking_block_max) ? prev.thinking_block_max : 0,
80
+ count: Number.isFinite(prev?.request_count) ? prev.request_count : 0,
81
+ };
82
+ }
83
+
84
+ export default {
85
+ name: "session-health",
86
+ description:
87
+ "Observe per-session thinking-desync risk (context size + thinking-block count) and warn before the session reaches the danger zone. Read-only; never mutates the body.",
88
+ order: 590, // after request-body mutators (so the count is the forwarded body), before the writer (cache-telemetry, 600)
89
+
90
+ async onRequest(ctx) {
91
+ // Count thinking blocks in the (near-final) forwarded body. Session id is
92
+ // resolved by cache-telemetry's onRequest (order 600), which runs AFTER
93
+ // this hook — so we don't read the session id here; we read it in
94
+ // onStreamEvent, by which time it is set.
95
+ ctx.meta._thinkingBlockCount = countThinkingBlocks(ctx.body);
96
+ },
97
+
98
+ async onStreamEvent(ctx) {
99
+ const { event } = ctx;
100
+ if (!event || event.type !== "message_delta") return;
101
+ // Once per response, regardless of how many message_delta events arrive.
102
+ if (ctx.meta._sessionHealthDone) return;
103
+ ctx.meta._sessionHealthDone = true;
104
+
105
+ const now = new Date().toISOString();
106
+ const rawSid = ctx.meta._sessionId ?? null;
107
+ const key = sessionFilename(rawSid);
108
+ const thinkingBlockCount = ctx.meta._thinkingBlockCount || 0;
109
+ const contextTokens = computeContextTokens(ctx.meta.cacheStats);
110
+
111
+ let st = sessionState.get(key);
112
+ if (!st) {
113
+ st = seedFromFile(rawSid, now);
114
+ sessionState.set(key, st);
115
+ }
116
+ st.count += 1;
117
+ st.max = Math.max(st.max, thinkingBlockCount);
118
+
119
+ const health = {
120
+ context_tokens: contextTokens,
121
+ thinking_block_count: thinkingBlockCount,
122
+ thinking_block_max: st.max,
123
+ first_seen: st.firstSeen,
124
+ request_count: st.count,
125
+ };
126
+
127
+ const cfg = loadConfig();
128
+ if (cfg.enabled) {
129
+ const risk = computeRisk(contextTokens, cfg);
130
+ health.thinking_desync_risk = risk;
131
+ if (risk === "high" && !warnedSessions.has(key)) {
132
+ warnedSessions.add(key);
133
+ const sidLabel = rawSid || "unknown";
134
+ process.stderr.write(
135
+ `[session-health] session ${sidLabel} high thinking-desync risk: ` +
136
+ `context_tokens=${contextTokens} (>= ${cfg.highTokens}), ` +
137
+ `thinking_block_count=${thinkingBlockCount}. ` +
138
+ `Consider retiring this session (write SESSION_STATE + /clear).\n`,
139
+ );
140
+ }
141
+ }
142
+
143
+ // Hand off to cache-telemetry (order 600) to persist in its single write.
144
+ ctx.meta._sessionHealth = health;
145
+ },
146
+
147
+ // Test-only: reset module state between tests.
148
+ __resetForTests() {
149
+ sessionState.clear();
150
+ warnedSessions.clear();
151
+ },
152
+ };
@@ -0,0 +1,130 @@
1
+ // thinking-block-sanitize — request-path mitigation for the CC thinking-desync
2
+ // wedge (anthropics/claude-code#63147). On replay paths (resume / --continue /
3
+ // auto-compaction / parallel-tool-cancel), CC re-sends prior assistant turns'
4
+ // thinking in the OMITTED shape `{ type:"thinking", thinking:"", signature }`.
5
+ // The API rejects modified thinking in the *latest* assistant message with a
6
+ // permanent 400, which wedges the session. This extension drops the omitted
7
+ // thinking blocks the API treats as optional, before the request is forwarded.
8
+ //
9
+ // Resolved turn-selection rule (directive Open Question 1, empirical capture):
10
+ // - drop omitted thinking from ALL prior assistant turns, AND
11
+ // - from the LATEST assistant turn UNLESS it is an active tool-continuation
12
+ // (last block is a tool_use with a following tool_result) — that case is
13
+ // uncoverable by the proxy (the API needs the signed thinking for the
14
+ // pending tool call; we can't restore the emptied text). No env var both
15
+ // preserves thinking and avoids the wedge there — CLAUDE_CODE_DISABLE_THINKING=1
16
+ // / MAX_THINKING_TOKENS=0 stop it only by disabling thinking entirely
17
+ // (lossy); DISABLE_INTERLEAVED_THINKING=1 does NOT stop the 400 — so the
18
+ // answer for that case is don't-resume + heal/retire.
19
+ // Never touches non-empty thinking, and never touches redacted_thinking (v1).
20
+ //
21
+ // OPT-IN for v1: only runs when CACHE_FIX_THINKING_SANITIZE=on (default off) —
22
+ // it mutates request bodies and its coverage is not yet live-validated.
23
+ //
24
+ // Order 550: after the request-body mutators (ttl-management 500) and before
25
+ // session-health (590), so #160's thinking_block_count reflects the forwarded
26
+ // body. The per-request drop count is exposed via ctx.meta._thinkingSanitize
27
+ // for cache-telemetry (600) to merge into the per-session JSON.
28
+
29
+ export function isOmittedThinking(block) {
30
+ return (
31
+ !!block &&
32
+ block.type === "thinking" &&
33
+ typeof block.thinking === "string" &&
34
+ block.thinking.trim() === ""
35
+ );
36
+ }
37
+
38
+ function answersToolUse(msg, toolUseId) {
39
+ return (
40
+ !!msg &&
41
+ Array.isArray(msg.content) &&
42
+ msg.content.some(
43
+ (b) => b && b.type === "tool_result" && b.tool_use_id === toolUseId,
44
+ )
45
+ );
46
+ }
47
+
48
+ // The latest assistant message is an active tool-continuation when its terminal
49
+ // block is a `tool_use` that is *paired with* — i.e. answered by — a following
50
+ // `tool_result` carrying the same `tool_use_id`. Only then does the API require
51
+ // that turn's thinking intact, so only then must we leave it untouched. Matching
52
+ // the id (not merely the presence of any later tool_result) keeps the guard as
53
+ // narrow as the approved rule: an unanswered terminal tool_use, or a later
54
+ // tool_result that answers a *different* call, is not the protected case.
55
+ export function isActiveToolContinuation(messages, idx) {
56
+ const msg = messages[idx];
57
+ if (!msg || !Array.isArray(msg.content) || msg.content.length === 0) return false;
58
+ const last = msg.content[msg.content.length - 1];
59
+ if (!last || last.type !== "tool_use" || !last.id) return false;
60
+ for (let j = idx + 1; j < messages.length; j++) {
61
+ if (answersToolUse(messages[j], last.id)) return true;
62
+ }
63
+ return false;
64
+ }
65
+
66
+ function latestAssistantIndex(messages) {
67
+ for (let i = messages.length - 1; i >= 0; i--) {
68
+ if (messages[i] && messages[i].role === "assistant") return i;
69
+ }
70
+ return -1;
71
+ }
72
+
73
+ // Pure planner: returns { messages, dropped }. Does not mutate the input.
74
+ // `messages` is the new array (a message that loses all content is dropped).
75
+ export function planSanitize(messages) {
76
+ if (!Array.isArray(messages)) return { messages, dropped: 0 };
77
+ const latestAsst = latestAssistantIndex(messages);
78
+ const protectLatest = latestAsst >= 0 && isActiveToolContinuation(messages, latestAsst);
79
+
80
+ let dropped = 0;
81
+ let changed = false;
82
+ const out = [];
83
+ for (let i = 0; i < messages.length; i++) {
84
+ const msg = messages[i];
85
+ if (!msg || msg.role !== "assistant" || !Array.isArray(msg.content)) {
86
+ out.push(msg);
87
+ continue;
88
+ }
89
+ if (i === latestAsst && protectLatest) {
90
+ out.push(msg); // active continuation — leave its thinking intact
91
+ continue;
92
+ }
93
+ const kept = msg.content.filter((b) => {
94
+ if (isOmittedThinking(b)) {
95
+ dropped++;
96
+ return false;
97
+ }
98
+ return true;
99
+ });
100
+ if (kept.length === msg.content.length) {
101
+ out.push(msg); // unchanged
102
+ } else if (kept.length === 0) {
103
+ changed = true; // message became empty → drop it entirely
104
+ } else {
105
+ out.push({ ...msg, content: kept });
106
+ changed = true;
107
+ }
108
+ }
109
+ return { messages: changed ? out : messages, dropped };
110
+ }
111
+
112
+ export default {
113
+ name: "thinking-block-sanitize",
114
+ description:
115
+ "Drop omitted (empty-text) thinking blocks from prior assistant turns and the latest non-continuation turn, to head off the CC thinking-desync 400 (#63147). Opt-in via CACHE_FIX_THINKING_SANITIZE=on.",
116
+ order: 550,
117
+
118
+ async onRequest(ctx) {
119
+ if (process.env.CACHE_FIX_THINKING_SANITIZE !== "on") return;
120
+ const body = ctx.body;
121
+ if (!body || !Array.isArray(body.messages)) return;
122
+
123
+ const { messages, dropped } = planSanitize(body.messages);
124
+ if (dropped > 0) body.messages = messages;
125
+
126
+ // Counts only — never content. Exposed for cache-telemetry to persist and
127
+ // for the #160 session-health signal.
128
+ ctx.meta._thinkingSanitize = { thinking_blocks_dropped: dropped };
129
+ },
130
+ };
@@ -10,7 +10,17 @@ function detectRequestType(system) {
10
10
  return isSubagent ? "subagent" : "main";
11
11
  }
12
12
 
13
+ // Thinking and redacted_thinking blocks must be returned to the API byte-identical
14
+ // to the original model response — the API validates them and rejects any
15
+ // modification with "thinking blocks ... cannot be modified" (a 400 on the whole
16
+ // request). On Opus 4.7 interleaved thinking, CC can place a cache_control
17
+ // breakpoint on a thinking block; injecting a ttl there would mutate the block
18
+ // and break the request. Skip them — the marginal TTL benefit on one breakpoint
19
+ // is never worth corrupting a thinking turn.
20
+ const PROTECTED_BLOCK_TYPES = new Set(["thinking", "redacted_thinking"]);
21
+
13
22
  function injectTtl(block, ttlParam) {
23
+ if (block && PROTECTED_BLOCK_TYPES.has(block.type)) return block;
14
24
  if (block.cache_control?.type === "ephemeral" && !block.cache_control.ttl) {
15
25
  return { ...block, cache_control: { ...block.cache_control, ttl: ttlParam } };
16
26
  }
@@ -1,20 +1,82 @@
1
1
  {
2
- "bootstrap-defense": { "enabled": true, "order": 45 },
3
- "ttl-tier-detect": { "enabled": true, "order": 75 },
4
- "fingerprint-strip": { "enabled": true, "order": 100 },
5
- "image-strip": { "enabled": true, "order": 150 },
6
- "sort-stabilization": { "enabled": true, "order": 200 },
7
- "fresh-session-sort": { "enabled": true, "order": 250 },
8
- "identity-normalization": { "enabled": true, "order": 300 },
9
- "smoosh-split": { "enabled": true, "order": 320 },
10
- "content-strip": { "enabled": true, "order": 330 },
11
- "tool-input-normalize": { "enabled": true, "order": 340 },
12
- "microcompact-stability": { "enabled": true, "order": 350 },
13
- "thinking-display": { "enabled": true, "order": 360 },
14
- "cache-control-normalize": { "enabled": true, "order": 400 },
15
- "messages-cache-breakpoint": { "enabled": true, "order": 410 },
16
- "ttl-management": { "enabled": true, "order": 500 },
17
- "cache-telemetry": { "enabled": true, "order": 600 },
18
- "overage-warning": { "enabled": true, "order": 610 },
19
- "request-log": { "enabled": false, "order": 700 }
2
+ "bootstrap-defense": {
3
+ "enabled": true,
4
+ "order": 45
5
+ },
6
+ "ttl-tier-detect": {
7
+ "enabled": true,
8
+ "order": 75
9
+ },
10
+ "fingerprint-strip": {
11
+ "enabled": true,
12
+ "order": 100
13
+ },
14
+ "image-strip": {
15
+ "enabled": true,
16
+ "order": 150
17
+ },
18
+ "sort-stabilization": {
19
+ "enabled": true,
20
+ "order": 200
21
+ },
22
+ "fresh-session-sort": {
23
+ "enabled": true,
24
+ "order": 250
25
+ },
26
+ "identity-normalization": {
27
+ "enabled": true,
28
+ "order": 300
29
+ },
30
+ "smoosh-split": {
31
+ "enabled": true,
32
+ "order": 320
33
+ },
34
+ "content-strip": {
35
+ "enabled": true,
36
+ "order": 330
37
+ },
38
+ "tool-input-normalize": {
39
+ "enabled": true,
40
+ "order": 340
41
+ },
42
+ "microcompact-stability": {
43
+ "enabled": true,
44
+ "order": 350
45
+ },
46
+ "thinking-display": {
47
+ "enabled": true,
48
+ "order": 360
49
+ },
50
+ "cache-control-normalize": {
51
+ "enabled": true,
52
+ "order": 400
53
+ },
54
+ "messages-cache-breakpoint": {
55
+ "enabled": true,
56
+ "order": 410
57
+ },
58
+ "ttl-management": {
59
+ "enabled": true,
60
+ "order": 500
61
+ },
62
+ "cache-telemetry": {
63
+ "enabled": true,
64
+ "order": 600
65
+ },
66
+ "overage-warning": {
67
+ "enabled": true,
68
+ "order": 610
69
+ },
70
+ "request-log": {
71
+ "enabled": false,
72
+ "order": 700
73
+ },
74
+ "usage-log": {
75
+ "enabled": true,
76
+ "order": 650
77
+ },
78
+ "rate-limit-log": {
79
+ "enabled": true,
80
+ "order": 660
81
+ }
20
82
  }