claude-code-cache-fix 3.7.1 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +55 -1
- package/README.zh.md +691 -159
- package/hooks/README.md +36 -0
- package/hooks/examples/worktree-edit-guard.py +93 -0
- package/package.json +2 -1
- package/proxy/extensions/auto-1m-guard.mjs +117 -0
- package/proxy/extensions/cache-telemetry.mjs +19 -0
- package/proxy/extensions/session-health.mjs +152 -0
- package/proxy/extensions/thinking-block-sanitize.mjs +130 -0
- package/proxy/extensions/ttl-management.mjs +10 -0
- package/proxy/extensions.json +80 -18
- package/tools/MANUAL-COMPACT.md +15 -8
- package/tools/manual-compact.sh +17 -11
- package/tools/quota-statusline.sh +4 -2
package/README.md
CHANGED
|
@@ -29,7 +29,7 @@ That's it. The proxy applies all 7 cache-fix extensions automatically. No wrappe
|
|
|
29
29
|
|
|
30
30
|
### What the proxy does
|
|
31
31
|
|
|
32
|
-
On every `/v1/messages` request,
|
|
32
|
+
On every `/v1/messages` request, 9 extensions run in order (one opt-in):
|
|
33
33
|
|
|
34
34
|
| Extension | What it fixes |
|
|
35
35
|
|-----------|--------------|
|
|
@@ -40,6 +40,8 @@ On every `/v1/messages` request, 7 extensions run in order:
|
|
|
40
40
|
| `fresh-session-sort` | Fixes non-deterministic ordering on first turn |
|
|
41
41
|
| `cache-control-normalize` | Normalizes cache_control markers across messages |
|
|
42
42
|
| `cache-telemetry` | Extracts cache stats from response headers → `~/.claude/quota-status/{account.json,sessions/<id>.json}` |
|
|
43
|
+
| `session-health` | Observes per-session thinking-desync risk (context size + thinking-block count) and warns before a session reaches the danger zone. Read-only |
|
|
44
|
+
| `thinking-block-sanitize` | Drops omitted (empty-text) thinking blocks to head off the CC thinking-desync `400` (#63147). **Opt-in** (`CACHE_FIX_THINKING_SANITIZE=on`) |
|
|
43
45
|
|
|
44
46
|
Extensions are hot-reloadable — add, remove, or modify `.mjs` files in `proxy/extensions/` and changes apply to the next request without restarting. Configuration in `proxy/extensions.json`.
|
|
45
47
|
|
|
@@ -229,6 +231,24 @@ Note: cache-fix v3.6.2 and earlier returned 404 for the bootstrap path because t
|
|
|
229
231
|
- [`CHANGELOG.md`](CHANGELOG.md#371---2026-05-27) — v3.7.1 release entry (extended surface coverage + allowlist mode); [v3.7.0 entry](CHANGELOG.md#370---2026-05-26) covers the prior behavior-change note
|
|
230
232
|
- [`cnighswonger/heron-brook-poc`](https://github.com/cnighswonger/heron-brook-poc) — reproducer for the bootstrap-channel behavior
|
|
231
233
|
|
|
234
|
+
**Auto-1M-context overage protection.** CC v2.1.161 onward (notably the VS Code Extension surface) can auto-select 1M context on Pro Plan without user request, immediately consuming overage credits. The proxy's `auto-1m-guard` extension detects the `context-1m-2025-08-07` token on the outbound `anthropic-beta` header and either warns or strips it, depending on the mode you opt into via `CACHE_FIX_AUTO_1M_GUARD`:
|
|
235
|
+
|
|
236
|
+
| Mode | Default? | Behavior |
|
|
237
|
+
|---|---|---|
|
|
238
|
+
| `off` | no | Extension no-op. |
|
|
239
|
+
| `warn` | yes | Detect the token. Stash an annotation into the per-session JSON (`auto_1m_detected`, `auto_1m_action: "warn"`, `auto_1m_advice`) and emit a stderr log line. Does not modify the request. |
|
|
240
|
+
| `strip` | opt-in | Detect AND remove the token from the `anthropic-beta` header before forwarding. Annotation: `auto_1m_action: "stripped"`. |
|
|
241
|
+
|
|
242
|
+
The CC-side kill switch is `CLAUDE_CODE_DISABLE_1M_CONTEXT=1` (env var), which is the right fix when it actually reaches the CC process. On the VS Code extension surface that env var is reportedly unreliable; the proxy intercept bypasses that gap because it acts on the wire regardless of which CC launcher produced the request. Tracks [CC#64919](https://github.com/anthropics/claude-code/issues/64919); see [`docs/directives/proxy-auto-1m-guard.md`](docs/directives/proxy-auto-1m-guard.md) for the binary-walk that confirms the proxy-visible signal is the beta header (CC strips the `[1m]` suffix from `req.body.model` client-side before sending).
|
|
243
|
+
|
|
244
|
+
## Client-side hooks
|
|
245
|
+
|
|
246
|
+
Some Claude Code behaviors live below the request layer — they happen client-side, in the tool-dispatch path, before the proxy ever sees traffic. cache-fix ships standalone hook scripts under [`hooks/examples/`](hooks/README.md) for those cases. They're independent of the proxy and you install them by pointing at them from your own `~/.claude/settings.json`.
|
|
247
|
+
|
|
248
|
+
| Script | What it does |
|
|
249
|
+
|---|---|
|
|
250
|
+
| [`worktree-edit-guard.py`](docs/hooks/worktree-edit-guard.md) | Block `Edit`/`Write`/`MultiEdit`/`NotebookEdit` tool calls whose target path escapes the active git worktree, preventing parent-checkout corruption from worktree sessions. Addresses [CC#59628](https://github.com/anthropics/claude-code/issues/59628). |
|
|
251
|
+
|
|
232
252
|
## Recommended CC operational config
|
|
233
253
|
|
|
234
254
|
The proxy fixes what it can fix at the request layer. A handful of CC client-side env vars and `~/.claude/settings.json` knobs solve adjacent problems the proxy can't reach — silent model swaps on CC update, ambiguous model fallback, schema-strip side effects. Surfacing these here as a recommendation; users decide their own config.
|
|
@@ -723,6 +743,40 @@ Scoping rules baked into the extension:
|
|
|
723
743
|
|---------|---------|---------|
|
|
724
744
|
| `CACHE_FIX_THINKING_DISPLAY` | `summarized` (built-in) | One of `summarized` / `omitted` / `disabled`. `summarized` restores thinking summaries (default). `omitted` force-suppresses thinking blocks. `disabled` opts the extension out entirely. |
|
|
725
745
|
|
|
746
|
+
## Session-health early-warning (proxy mode, thinking-desync risk)
|
|
747
|
+
|
|
748
|
+
Long-running Opus 4.7 `[1m]` sessions accumulate interleaved thinking blocks and grow their live context until Claude Code's own history reconstruction desyncs a thinking-block signature, producing a permanent `400 … thinking blocks … cannot be modified` on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)). The session dies abruptly with no prior signal.
|
|
749
|
+
|
|
750
|
+
The `session-health` extension watches the conditions that correlate with the trip and warns **before** a session reaches the danger zone, so the operator can retire it deliberately (write a session-state handoff, `/clear`) instead of being surprised by a dead session. It is **read-only** — it never mutates the request/response body and never attempts to repair the desync (that is CC-side, #63147). It records numeric telemetry into the per-session file (`~/.claude/quota-status/sessions/<id>.json`) on each request and, when a session first crosses into `high` risk, emits a one-time stderr line. Counts only — no thinking text or signatures are ever logged.
|
|
751
|
+
|
|
752
|
+
Fields added to the per-session JSON:
|
|
753
|
+
|
|
754
|
+
- `context_tokens` — latest request's live context (`input + cache_read + cache_creation`)
|
|
755
|
+
- `thinking_block_count` — `thinking`/`redacted_thinking` blocks in the latest request
|
|
756
|
+
- `thinking_block_max` — session high-water mark (carried across proxy restarts)
|
|
757
|
+
- `first_seen`, `request_count` — session age + request tally
|
|
758
|
+
- `thinking_desync_risk` — `ok` / `warn` / `high` (omitted when the signal is disabled)
|
|
759
|
+
|
|
760
|
+
Token thresholds are anchored to the observed ~382K-token trip with margin; the warning is conservative by design — a premature "retire soon" is far cheaper than a dead session. Block-count is recorded but does not yet gate the warning (it activates in a calibrated fast-follow once the failure distribution is known).
|
|
761
|
+
|
|
762
|
+
| Env var | Default | Purpose |
|
|
763
|
+
|---------|---------|---------|
|
|
764
|
+
| `CACHE_FIX_THINKING_RISK_WARN_TOKENS` | `250000` | Context-token level at which `thinking_desync_risk` becomes `warn`. |
|
|
765
|
+
| `CACHE_FIX_THINKING_RISK_HIGH_TOKENS` | `340000` | Context-token level at which risk becomes `high` and the one-time stderr warn fires. |
|
|
766
|
+
| `CACHE_FIX_THINKING_RISK` | unset (on) | Set to `off` to suppress the warning signal (stderr line + `thinking_desync_risk` field). Raw count telemetry keeps recording. |
|
|
767
|
+
|
|
768
|
+
## Thinking-block sanitize (proxy mode, opt-in, thinking-desync mitigation)
|
|
769
|
+
|
|
770
|
+
The *mitigate* half of the thinking-desync response (the *warn-before* half is session-health above). On history-replay paths (resume / `--continue` / auto-compaction / parallel-tool-cancel), Claude Code re-sends prior assistant turns' extended thinking in the **omitted** shape `{ "type":"thinking", "thinking":"", "signature":"<intact>" }`. The API rejects modified thinking in the **latest** assistant message with a permanent `400 … thinking … blocks cannot be modified`, which wedges the session on every subsequent turn (upstream root cause: [anthropics/claude-code#63147](https://github.com/anthropics/claude-code/issues/63147)).
|
|
771
|
+
|
|
772
|
+
The `thinking-block-sanitize` extension drops those omitted blocks — which the API treats as optional history — from the request before it is forwarded. Empirically-resolved turn-selection rule: drop omitted thinking from **all prior assistant turns and the latest assistant turn, unless the latest turn is an active tool-continuation** (its last block is a `tool_use` answered by a following `tool_result`). In that one case the API requires the signed thinking intact and the proxy cannot restore the emptied text, so it leaves the turn untouched. **No env var both preserves thinking and avoids the wedge for that case:** `CLAUDE_CODE_DISABLE_THINKING=1` / `MAX_THINKING_TOKENS=0` stop the wedge only by disabling thinking entirely (lossy — no reasoning), and `DISABLE_INTERLEAVED_THINKING=1` does *not* stop the `400` — so there the answer is don't-resume + heal/retire the session. That is exactly why the proxy mitigation matters: **it is the only path that preserves reasoning while avoiding the wedge** for the history-replay paths it covers. Non-empty thinking is never touched; `redacted_thinking` is out of scope for v1.
|
|
773
|
+
|
|
774
|
+
**Opt-in.** v1 ships behind `CACHE_FIX_THINKING_SANITIZE=on` (default off): it mutates request bodies and full live-coverage validation is pending. The transform is deterministic and cache-prefix-stable, and emits a per-request `thinking_blocks_dropped` count into the per-session JSON (counts only — never content) that complements the session-health signal.
|
|
775
|
+
|
|
776
|
+
| Env var | Default | Purpose |
|
|
777
|
+
|---------|---------|---------|
|
|
778
|
+
| `CACHE_FIX_THINKING_SANITIZE` | unset (off) | Set to `on` to enable the request-path drop of omitted thinking blocks. Off = no-op (no mutation, no telemetry). |
|
|
779
|
+
|
|
726
780
|
## System prompt rewrite (preload mode, optional)
|
|
727
781
|
|
|
728
782
|
The interceptor can rewrite Claude Code's `# Output efficiency` system-prompt section. Disabled by default. Enable with `CACHE_FIX_OUTPUT_EFFICIENCY_REPLACEMENT`. See [docs/output-efficiency-prompts.md](docs/output-efficiency-prompts.md) for the three known prompt variants and usage instructions.
|