npm - claude-code-cache-fix - Versions diffs - 3.7.1 → 3.8.0 - Mend

claude-code-cache-fix 3.7.1 → 3.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +37 -1
package/package.json +1 -1
package/proxy/extensions/cache-telemetry.mjs +14 -0
package/proxy/extensions/session-health.mjs +152 -0
package/proxy/extensions/thinking-block-sanitize.mjs +130 -0
package/proxy/extensions/ttl-management.mjs +10 -0
package/proxy/extensions.json +80 -18

package/README.md CHANGED Viewed

@@ -29,7 +29,7 @@ That's it. The proxy applies all 7 cache-fix extensions automatically. No wrappe
 ### What the proxy does
-On every `/v1/messages` request, 7 extensions run in order:
+On every `/v1/messages` request, 9 extensions run in order (one opt-in):
 | Extension | What it fixes |
 |-----------|--------------|
@@ -40,6 +40,8 @@ On every `/v1/messages` request, 7 extensions run in order:
 | `fresh-session-sort` | Fixes non-deterministic ordering on first turn |
 | `cache-control-normalize` | Normalizes cache_control markers across messages |
 | `cache-telemetry` | Extracts cache stats from response headers → `~/.claude/quota-status/{account.json,sessions/<id>.json}` |
+| `session-health` | Observes per-session thinking-desync risk (context size + thinking-block count) and warns before a session reaches the danger zone. Read-only |
+| `thinking-block-sanitize` | Drops omitted (empty-text) thinking blocks to head off the CC thinking-desync `400` (#63147). **Opt-in** (`CACHE_FIX_THINKING_SANITIZE=on`) |
 Extensions are hot-reloadable — add, remove, or modify `.mjs` files in `proxy/extensions/` and changes apply to the next request without restarting. Configuration in `proxy/extensions.json`.
@@ -723,6 +725,40 @@ Scoping rules baked into the extension:
 |---------|---------|---------|
 | `CACHE_FIX_THINKING_DISPLAY` | `summarized` (built-in) | One of `summarized` / `omitted` / `disabled`. `summarized` restores thinking summaries (default). `omitted` force-suppresses thinking blocks. `disabled` opts the extension out entirely. |
+## Session-health early-warning (proxy mode, thinking-desync risk)
+Long-running Opus 4.7 `[1m]` sessions accumulate interleaved thinking blocks and grow their live context until Claude Code's own history reconstruction desyncs a thinking-block signature, producing a permanent `400 … thinking blocks … cannot be modified` on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)). The session dies abruptly with no prior signal.
+The `session-health` extension watches the conditions that correlate with the trip and warns **before** a session reaches the danger zone, so the operator can retire it deliberately (write a session-state handoff, `/clear`) instead of being surprised by a dead session. It is **read-only** — it never mutates the request/response body and never attempts to repair the desync (that is CC-side, #63147). It records numeric telemetry into the per-session file (`~/.claude/quota-status/sessions/<id>.json`) on each request and, when a session first crosses into `high` risk, emits a one-time stderr line. Counts only — no thinking text or signatures are ever logged.
+Fields added to the per-session JSON:
+- `context_tokens` — latest request's live context (`input + cache_read + cache_creation`)
+- `thinking_block_count` — `thinking`/`redacted_thinking` blocks in the latest request
+- `thinking_block_max` — session high-water mark (carried across proxy restarts)
+- `first_seen`, `request_count` — session age + request tally
+- `thinking_desync_risk` — `ok` / `warn` / `high` (omitted when the signal is disabled)
+Token thresholds are anchored to the observed ~382K-token trip with margin; the warning is conservative by design — a premature "retire soon" is far cheaper than a dead session. Block-count is recorded but does not yet gate the warning (it activates in a calibrated fast-follow once the failure distribution is known).
+| Env var | Default | Purpose |
+|---------|---------|---------|
+| `CACHE_FIX_THINKING_RISK_WARN_TOKENS` | `250000` | Context-token level at which `thinking_desync_risk` becomes `warn`. |
+| `CACHE_FIX_THINKING_RISK_HIGH_TOKENS` | `340000` | Context-token level at which risk becomes `high` and the one-time stderr warn fires. |
+| `CACHE_FIX_THINKING_RISK` | unset (on) | Set to `off` to suppress the warning signal (stderr line + `thinking_desync_risk` field). Raw count telemetry keeps recording. |
+## Thinking-block sanitize (proxy mode, opt-in, thinking-desync mitigation)
+The *mitigate* half of the thinking-desync response (the *warn-before* half is session-health above). On history-replay paths (resume / `--continue` / auto-compaction / parallel-tool-cancel), Claude Code re-sends prior assistant turns' extended thinking in the **omitted** shape `{ "type":"thinking", "thinking":"", "signature":"<intact>" }`. The API rejects modified thinking in the **latest** assistant message with a permanent `400 … thinking … blocks cannot be modified`, which wedges the session on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)).
+The `thinking-block-sanitize` extension drops those omitted blocks — which the API treats as optional history — from the request before it is forwarded. Empirically-resolved turn-selection rule: drop omitted thinking from **all prior assistant turns and the latest assistant turn, unless the latest turn is an active tool-continuation** (its last block is a `tool_use` answered by a following `tool_result`). In that one case the API requires the signed thinking intact and the proxy cannot restore the emptied text, so it leaves the turn untouched. **No env var both preserves thinking and avoids the wedge for that case:** `CLAUDE_CODE_DISABLE_THINKING=1` / `MAX_THINKING_TOKENS=0` stop the wedge only by disabling thinking entirely (lossy — no reasoning), and `DISABLE_INTERLEAVED_THINKING=1` does *not* stop the `400` — so there the answer is don't-resume + heal/retire the session. That is exactly why the proxy mitigation matters: **it is the only path that preserves reasoning while avoiding the wedge** for the history-replay paths it covers. Non-empty thinking is never touched; `redacted_thinking` is out of scope for v1.
+**Opt-in.** v1 ships behind `CACHE_FIX_THINKING_SANITIZE=on` (default off): it mutates request bodies and full live-coverage validation is pending. The transform is deterministic and cache-prefix-stable, and emits a per-request `thinking_blocks_dropped` count into the per-session JSON (counts only — never content) that complements the session-health signal.
+| Env var | Default | Purpose |
+|---------|---------|---------|
+| `CACHE_FIX_THINKING_SANITIZE` | unset (off) | Set to `on` to enable the request-path drop of omitted thinking blocks. Off = no-op (no mutation, no telemetry). |
 ## System prompt rewrite (preload mode, optional)
 The interceptor can rewrite Claude Code's `# Output efficiency` system-prompt section. Disabled by default. Enable with `CACHE_FIX_OUTPUT_EFFICIENCY_REPLACEMENT`. See [docs/output-efficiency-prompts.md](docs/output-efficiency-prompts.md) for the three known prompt variants and usage instructions.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-code-cache-fix",
-  "version": "3.7.1",
+  "version": "3.8.0",
   "description": "Cache optimization proxy and interceptor for Claude Code. Fixes prompt cache bugs, stabilizes prefix, reduces quota burn.",
   "type": "module",
   "exports": {

package/proxy/extensions/cache-telemetry.mjs CHANGED Viewed

@@ -49,6 +49,13 @@ export function sessionFilename(rawId) {
   return "inv-" + createHash("sha256").update(s).digest("hex").slice(0, 16);
 }
+// Full path to the per-session file for a raw session id. Exported so sibling
+// extensions (e.g. session-health) can READ the prior state this writer wrote,
+// using the identical filename rule — reuse, not duplicate.
+export function sessionFilePath(rawId) {
+  return join(paths().sessionsDir, `${sessionFilename(rawId)}.json`);
+}
 function resolveSessionId(headers) {
   if (!headers) return null;
   const sid =
@@ -222,6 +229,13 @@ export default {
             hit_rate: hitRate,
             timestamp,
           },
+          // Additive session-health fields (session-health extension, order
+          // 590, stashes these before this writer runs). Optional — absent if
+          // that extension is disabled or produced nothing this request.
+          ...(ctx.meta._sessionHealth || {}),
+          // Additive thinking-block-sanitize drop count (order 550, opt-in).
+          // Optional — absent unless CACHE_FIX_THINKING_SANITIZE=on.
+          ...(ctx.meta._thinkingSanitize || {}),
           timestamp,
           session_id: rawSid,
         },

package/proxy/extensions/session-health.mjs ADDED Viewed

@@ -0,0 +1,152 @@
+import { readFileSync } from "node:fs";
+import { sessionFilename, sessionFilePath } from "./cache-telemetry.mjs";
+// session-health — read-only early-warning for the CC thinking-desync wedge
+// (anthropics/claude-code#63147). Long-running Opus 4.7 [1m] sessions grow
+// their live context until CC's own history reconstruction desyncs a
+// thinking-block signature, producing a permanent 400 on every subsequent
+// turn. This extension OBSERVES (never mutates the body) and records the
+// conditions that correlate with the trip, plus emits a one-time stderr warn
+// so the operator can retire the session deliberately before it dies.
+//
+// It hands its computed fields to the existing per-session writer
+// (cache-telemetry, order 600) via ctx.meta._sessionHealth; cache-telemetry
+// merges them into the single per-session JSON write. This extension never
+// writes that file itself (single-writer invariant).
+const THINKING_TYPES = new Set(["thinking", "redacted_thinking"]);
+const DEFAULT_WARN_TOKENS = 250_000;
+const DEFAULT_HIGH_TOKENS = 340_000; // just under the observed ~382K trip
+// --- Module-scope state ---
+// Cross-request accumulators, seeded once-per-process from the prior persisted
+// file so first_seen / max / count stay accurate across the proxy restarts
+// that multi-week sessions inevitably span.
+const sessionState = new Map(); // key -> { firstSeen, max, count }
+// Sessions already given the one-time "high" stderr warn this process.
+const warnedSessions = new Set();
+function parseTokenEnv(raw, def) {
+  if (raw === undefined || raw === "") return def;
+  const n = Number(raw);
+  return Number.isFinite(n) && n >= 0 ? n : def;
+}
+// Exported for unit testing.
+export function loadConfig(env = process.env) {
+  return {
+    warnTokens: parseTokenEnv(env.CACHE_FIX_THINKING_RISK_WARN_TOKENS, DEFAULT_WARN_TOKENS),
+    highTokens: parseTokenEnv(env.CACHE_FIX_THINKING_RISK_HIGH_TOKENS, DEFAULT_HIGH_TOKENS),
+    enabled: env.CACHE_FIX_THINKING_RISK !== "off",
+  };
+}
+export function countThinkingBlocks(body) {
+  if (!body || !Array.isArray(body.messages)) return 0;
+  let n = 0;
+  for (const msg of body.messages) {
+    if (!Array.isArray(msg.content)) continue;
+    for (const block of msg.content) {
+      if (block && THINKING_TYPES.has(block.type)) n++;
+    }
+  }
+  return n;
+}
+export function computeContextTokens(cacheStats) {
+  if (!cacheStats) return 0;
+  return (
+    (cacheStats.inputTokens || 0) +
+    (cacheStats.cacheRead || 0) +
+    (cacheStats.cacheCreation || 0)
+  );
+}
+export function computeRisk(contextTokens, { warnTokens, highTokens }) {
+  if (contextTokens >= highTokens) return "high";
+  if (contextTokens >= warnTokens) return "warn";
+  return "ok";
+}
+function seedFromFile(rawSid, now) {
+  let prev = null;
+  try {
+    prev = JSON.parse(readFileSync(sessionFilePath(rawSid), "utf8"));
+  } catch {}
+  return {
+    firstSeen: typeof prev?.first_seen === "string" ? prev.first_seen : now,
+    max: Number.isFinite(prev?.thinking_block_max) ? prev.thinking_block_max : 0,
+    count: Number.isFinite(prev?.request_count) ? prev.request_count : 0,
+  };
+}
+export default {
+  name: "session-health",
+  description:
+    "Observe per-session thinking-desync risk (context size + thinking-block count) and warn before the session reaches the danger zone. Read-only; never mutates the body.",
+  order: 590, // after request-body mutators (so the count is the forwarded body), before the writer (cache-telemetry, 600)
+  async onRequest(ctx) {
+    // Count thinking blocks in the (near-final) forwarded body. Session id is
+    // resolved by cache-telemetry's onRequest (order 600), which runs AFTER
+    // this hook — so we don't read the session id here; we read it in
+    // onStreamEvent, by which time it is set.
+    ctx.meta._thinkingBlockCount = countThinkingBlocks(ctx.body);
+  },
+  async onStreamEvent(ctx) {
+    const { event } = ctx;
+    if (!event || event.type !== "message_delta") return;
+    // Once per response, regardless of how many message_delta events arrive.
+    if (ctx.meta._sessionHealthDone) return;
+    ctx.meta._sessionHealthDone = true;
+    const now = new Date().toISOString();
+    const rawSid = ctx.meta._sessionId ?? null;
+    const key = sessionFilename(rawSid);
+    const thinkingBlockCount = ctx.meta._thinkingBlockCount || 0;
+    const contextTokens = computeContextTokens(ctx.meta.cacheStats);
+    let st = sessionState.get(key);
+    if (!st) {
+      st = seedFromFile(rawSid, now);
+      sessionState.set(key, st);
+    }
+    st.count += 1;
+    st.max = Math.max(st.max, thinkingBlockCount);
+    const health = {
+      context_tokens: contextTokens,
+      thinking_block_count: thinkingBlockCount,
+      thinking_block_max: st.max,
+      first_seen: st.firstSeen,
+      request_count: st.count,
+    };
+    const cfg = loadConfig();
+    if (cfg.enabled) {
+      const risk = computeRisk(contextTokens, cfg);
+      health.thinking_desync_risk = risk;
+      if (risk === "high" && !warnedSessions.has(key)) {
+        warnedSessions.add(key);
+        const sidLabel = rawSid || "unknown";
+        process.stderr.write(
+          `[session-health] session ${sidLabel} high thinking-desync risk: ` +
+            `context_tokens=${contextTokens} (>= ${cfg.highTokens}), ` +
+            `thinking_block_count=${thinkingBlockCount}. ` +
+            `Consider retiring this session (write SESSION_STATE + /clear).\n`,
+        );
+      }
+    }
+    // Hand off to cache-telemetry (order 600) to persist in its single write.
+    ctx.meta._sessionHealth = health;
+  },
+  // Test-only: reset module state between tests.
+  __resetForTests() {
+    sessionState.clear();
+    warnedSessions.clear();
+  },
+};

package/proxy/extensions/thinking-block-sanitize.mjs ADDED Viewed

@@ -0,0 +1,130 @@
+// thinking-block-sanitize — request-path mitigation for the CC thinking-desync
+// wedge (anthropics/claude-code#63147). On replay paths (resume / --continue /
+// auto-compaction / parallel-tool-cancel), CC re-sends prior assistant turns'
+// thinking in the OMITTED shape `{ type:"thinking", thinking:"", signature }`.
+// The API rejects modified thinking in the *latest* assistant message with a
+// permanent 400, which wedges the session. This extension drops the omitted
+// thinking blocks the API treats as optional, before the request is forwarded.
+//
+// Resolved turn-selection rule (directive Open Question 1, empirical capture):
+//   - drop omitted thinking from ALL prior assistant turns, AND
+//   - from the LATEST assistant turn UNLESS it is an active tool-continuation
+//     (last block is a tool_use with a following tool_result) — that case is
+//     uncoverable by the proxy (the API needs the signed thinking for the
+//     pending tool call; we can't restore the emptied text). No env var both
+//     preserves thinking and avoids the wedge there — CLAUDE_CODE_DISABLE_THINKING=1
+//     / MAX_THINKING_TOKENS=0 stop it only by disabling thinking entirely
+//     (lossy); DISABLE_INTERLEAVED_THINKING=1 does NOT stop the 400 — so the
+//     answer for that case is don't-resume + heal/retire.
+// Never touches non-empty thinking, and never touches redacted_thinking (v1).
+//
+// OPT-IN for v1: only runs when CACHE_FIX_THINKING_SANITIZE=on (default off) —
+// it mutates request bodies and its coverage is not yet live-validated.
+//
+// Order 550: after the request-body mutators (ttl-management 500) and before
+// session-health (590), so #160's thinking_block_count reflects the forwarded
+// body. The per-request drop count is exposed via ctx.meta._thinkingSanitize
+// for cache-telemetry (600) to merge into the per-session JSON.
+export function isOmittedThinking(block) {
+  return (
+    !!block &&
+    block.type === "thinking" &&
+    typeof block.thinking === "string" &&
+    block.thinking.trim() === ""
+  );
+}
+function answersToolUse(msg, toolUseId) {
+  return (
+    !!msg &&
+    Array.isArray(msg.content) &&
+    msg.content.some(
+      (b) => b && b.type === "tool_result" && b.tool_use_id === toolUseId,
+    )
+  );
+}
+// The latest assistant message is an active tool-continuation when its terminal
+// block is a `tool_use` that is *paired with* — i.e. answered by — a following
+// `tool_result` carrying the same `tool_use_id`. Only then does the API require
+// that turn's thinking intact, so only then must we leave it untouched. Matching
+// the id (not merely the presence of any later tool_result) keeps the guard as
+// narrow as the approved rule: an unanswered terminal tool_use, or a later
+// tool_result that answers a *different* call, is not the protected case.
+export function isActiveToolContinuation(messages, idx) {
+  const msg = messages[idx];
+  if (!msg || !Array.isArray(msg.content) || msg.content.length === 0) return false;
+  const last = msg.content[msg.content.length - 1];
+  if (!last || last.type !== "tool_use" || !last.id) return false;
+  for (let j = idx + 1; j < messages.length; j++) {
+    if (answersToolUse(messages[j], last.id)) return true;
+  }
+  return false;
+}
+function latestAssistantIndex(messages) {
+  for (let i = messages.length - 1; i >= 0; i--) {
+    if (messages[i] && messages[i].role === "assistant") return i;
+  }
+  return -1;
+}
+// Pure planner: returns { messages, dropped }. Does not mutate the input.
+// `messages` is the new array (a message that loses all content is dropped).
+export function planSanitize(messages) {
+  if (!Array.isArray(messages)) return { messages, dropped: 0 };
+  const latestAsst = latestAssistantIndex(messages);
+  const protectLatest = latestAsst >= 0 && isActiveToolContinuation(messages, latestAsst);
+  let dropped = 0;
+  let changed = false;
+  const out = [];
+  for (let i = 0; i < messages.length; i++) {
+    const msg = messages[i];
+    if (!msg || msg.role !== "assistant" || !Array.isArray(msg.content)) {
+      out.push(msg);
+      continue;
+    }
+    if (i === latestAsst && protectLatest) {
+      out.push(msg); // active continuation — leave its thinking intact
+      continue;
+    }
+    const kept = msg.content.filter((b) => {
+      if (isOmittedThinking(b)) {
+        dropped++;
+        return false;
+      }
+      return true;
+    });
+    if (kept.length === msg.content.length) {
+      out.push(msg); // unchanged
+    } else if (kept.length === 0) {
+      changed = true; // message became empty → drop it entirely
+    } else {
+      out.push({ ...msg, content: kept });
+      changed = true;
+    }
+  }
+  return { messages: changed ? out : messages, dropped };
+}
+export default {
+  name: "thinking-block-sanitize",
+  description:
+    "Drop omitted (empty-text) thinking blocks from prior assistant turns and the latest non-continuation turn, to head off the CC thinking-desync 400 (#63147). Opt-in via CACHE_FIX_THINKING_SANITIZE=on.",
+  order: 550,
+  async onRequest(ctx) {
+    if (process.env.CACHE_FIX_THINKING_SANITIZE !== "on") return;
+    const body = ctx.body;
+    if (!body || !Array.isArray(body.messages)) return;
+    const { messages, dropped } = planSanitize(body.messages);
+    if (dropped > 0) body.messages = messages;
+    // Counts only — never content. Exposed for cache-telemetry to persist and
+    // for the #160 session-health signal.
+    ctx.meta._thinkingSanitize = { thinking_blocks_dropped: dropped };
+  },
+};

package/proxy/extensions/ttl-management.mjs CHANGED Viewed

@@ -10,7 +10,17 @@ function detectRequestType(system) {
   return isSubagent ? "subagent" : "main";
 }
+// Thinking and redacted_thinking blocks must be returned to the API byte-identical
+// to the original model response — the API validates them and rejects any
+// modification with "thinking blocks ... cannot be modified" (a 400 on the whole
+// request). On Opus 4.7 interleaved thinking, CC can place a cache_control
+// breakpoint on a thinking block; injecting a ttl there would mutate the block
+// and break the request. Skip them — the marginal TTL benefit on one breakpoint
+// is never worth corrupting a thinking turn.
+const PROTECTED_BLOCK_TYPES = new Set(["thinking", "redacted_thinking"]);
 function injectTtl(block, ttlParam) {
+  if (block && PROTECTED_BLOCK_TYPES.has(block.type)) return block;
   if (block.cache_control?.type === "ephemeral" && !block.cache_control.ttl) {
     return { ...block, cache_control: { ...block.cache_control, ttl: ttlParam } };
   }

package/proxy/extensions.json CHANGED Viewed

@@ -1,20 +1,82 @@
 {
-  "bootstrap-defense": { "enabled": true, "order": 45 },
-  "ttl-tier-detect": { "enabled": true, "order": 75 },
-  "fingerprint-strip": { "enabled": true, "order": 100 },
-  "image-strip": { "enabled": true, "order": 150 },
-  "sort-stabilization": { "enabled": true, "order": 200 },
-  "fresh-session-sort": { "enabled": true, "order": 250 },
-  "identity-normalization": { "enabled": true, "order": 300 },
-  "smoosh-split": { "enabled": true, "order": 320 },
-  "content-strip": { "enabled": true, "order": 330 },
-  "tool-input-normalize": { "enabled": true, "order": 340 },
-  "microcompact-stability": { "enabled": true, "order": 350 },
-  "thinking-display": { "enabled": true, "order": 360 },
-  "cache-control-normalize": { "enabled": true, "order": 400 },
-  "messages-cache-breakpoint": { "enabled": true, "order": 410 },
-  "ttl-management": { "enabled": true, "order": 500 },
-  "cache-telemetry": { "enabled": true, "order": 600 },
-  "overage-warning": { "enabled": true, "order": 610 },
-  "request-log": { "enabled": false, "order": 700 }
+  "bootstrap-defense": {
+    "enabled": true,
+    "order": 45
+  },
+  "ttl-tier-detect": {
+    "enabled": true,
+    "order": 75
+  },
+  "fingerprint-strip": {
+    "enabled": true,
+    "order": 100
+  },
+  "image-strip": {
+    "enabled": true,
+    "order": 150
+  },
+  "sort-stabilization": {
+    "enabled": true,
+    "order": 200
+  },
+  "fresh-session-sort": {
+    "enabled": true,
+    "order": 250
+  },
+  "identity-normalization": {
+    "enabled": true,
+    "order": 300
+  },
+  "smoosh-split": {
+    "enabled": true,
+    "order": 320
+  },
+  "content-strip": {
+    "enabled": true,
+    "order": 330
+  },
+  "tool-input-normalize": {
+    "enabled": true,
+    "order": 340
+  },
+  "microcompact-stability": {
+    "enabled": true,
+    "order": 350
+  },
+  "thinking-display": {
+    "enabled": true,
+    "order": 360
+  },
+  "cache-control-normalize": {
+    "enabled": true,
+    "order": 400
+  },
+  "messages-cache-breakpoint": {
+    "enabled": true,
+    "order": 410
+  },
+  "ttl-management": {
+    "enabled": true,
+    "order": 500
+  },
+  "cache-telemetry": {
+    "enabled": true,
+    "order": 600
+  },
+  "overage-warning": {
+    "enabled": true,
+    "order": 610
+  },
+  "request-log": {
+    "enabled": false,
+    "order": 700
+  },
+  "usage-log": {
+    "enabled": true,
+    "order": 650
+  },
+  "rate-limit-log": {
+    "enabled": true,
+    "order": 660
+  }
 }