npm - @ictechgy/context-guard - Versions diffs - 0.4.10 → 0.4.12 - Mend

@ictechgy/context-guard 0.4.10 → 0.4.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/CHANGELOG.md +17 -1
package/README.ko.md +46 -28
package/README.md +42 -33
package/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl +24 -0
package/docs/benchmark-workflow-examples.md +3 -0
package/docs/benchmark-workflows/context-pack-byte-proxy.example.json +278 -137
package/docs/benchmark-workflows/measured-token-workflow.example.json +279 -138
package/docs/benchmark-workflows/provider-cache-telemetry.example.json +279 -138
package/docs/experimental-benchmark-fixtures.md +24 -7
package/package.json +2 -1
package/plugins/context-guard/.claude-plugin/plugin.json +1 -1
package/plugins/context-guard/README.ko.md +14 -11
package/plugins/context-guard/README.md +15 -14
package/plugins/context-guard/bin/context-guard +48 -17
package/plugins/context-guard/bin/context-guard-artifact +342 -33
package/plugins/context-guard/bin/context-guard-audit +36 -5
package/plugins/context-guard/bin/context-guard-bench +1675 -44
package/plugins/context-guard/bin/context-guard-cache-score +347 -35
package/plugins/context-guard/bin/context-guard-compress +89 -27
package/plugins/context-guard/bin/context-guard-cost +7 -2
package/plugins/context-guard/bin/context-guard-experiments +364 -8
package/plugins/context-guard/bin/context-guard-failed-nudge +6 -2
package/plugins/context-guard/bin/context-guard-filter +88 -18
package/plugins/context-guard/bin/context-guard-pack +329 -19
package/plugins/context-guard/bin/context-guard-read-symbol +27 -0
package/plugins/context-guard/bin/context-guard-sanitize-output +245 -18
package/plugins/context-guard/bin/context-guard-setup +21 -5
package/plugins/context-guard/bin/context-guard-tool-prune +287 -62
package/plugins/context-guard/bin/context-guard-trim-output +394 -90
package/plugins/context-guard/brief/README.md +5 -5
package/plugins/context-guard/lib/context_guard_command_manifest_loader.py +123 -0
package/plugins/context-guard/lib/context_guard_commands.py +217 -190

package/README.md CHANGED Viewed

@@ -1,22 +1,22 @@
 # ContextGuard
-ContextGuard is a local-first context management toolkit for AI coding and tool agents. It ships first as a Claude Code plugin: install it once, enable it per project, and roll it back when needed.
+ContextGuard is a local-first context-management toolkit for AI coding and tool-using agents. It starts with a Claude Code plugin: install it once, enable it explicitly per project, and roll it back when needed.
-It trims noisy output, steers agents toward symbol-level reads, nudges repeated failures, redacts secret-like patterns, and measures usage. The same guardrails extend to other agents through local helper commands and advisory brief-mode rule snippets.
+It trims noisy output, guides agents toward symbol-level reads, flags repeated failures, redacts secret-like patterns, and measures usage. The same guardrails are reusable by other agents through local helper commands and advisory brief-mode snippets.
 - Korean documentation: [`README.ko.md`](README.ko.md)
 - Static landing page: [GitHub Pages](https://ictechgy.github.io/context-guard/) ([source](docs/index.html))
 ## TL;DR
-Installation and activation are deliberately separate. Installing ContextGuard only makes local helpers or Claude plugin skills available. Configuration changes happen only when you run an explicit setup command.
+Installation and activation are deliberately separate. Installing ContextGuard only makes local helpers or Claude plugin skills available; it does not write configuration until you run an explicit setup command.
 | If you use... | Install | Activate |
 | --- | --- | --- |
 | Claude Code | `/plugin marketplace add ictechgy/context-guard` then `/plugin install context-guard@context-guard` | Run `/context-guard:setup` inside the project. |
 | Codex CLI or any terminal-first agent | `npm install -g @ictechgy/context-guard` or one-shot `npx @ictechgy/context-guard ...` | `context-guard setup --agent codex --scope project --with-init --with-skill --plan`, then rerun with `--yes`. |
 | Other rule-file agents | Use the npm/npx install path above. | `context-guard setup --agent gemini,cursor,windsurf,cline,copilot --scope project --with-init --plan`, then apply only the agents you want. |
-| macOS/Homebrew users | release path: `brew install ictechgy/tap/context-guard` | Same `context-guard setup ...` commands after install. |
+| macOS/Homebrew users | Release path: `brew install ictechgy/tap/context-guard` | Same `context-guard setup ...` commands after install. |
 Common commands:
@@ -29,13 +29,15 @@ context-guard setup --agent claude --scope user --verify --json  # read-only use
 context-guard setup --agent claude --scope user --plan
 ```
-Project scope is the default. User-level setup is opt-in, requires an explicit agent for writes, records backups and rollback metadata, and never runs during package installation. Use `context-guard doctor` or `context-guard setup --verify` for a read-only health check before applying setup. `doctor` reports next commands and makes no changes. Setup resolves bundled or checkout-local helpers first; it does not trust arbitrary `PATH` helpers unless you explicitly pass `--allow-path-helper-fallback` for a known-good install.
+Project scope is the default. User-level setup is opt-in, requires an explicit agent for writes, records backups and rollback metadata, and never runs during package installation. Before applying setup, use `context-guard doctor` or `context-guard setup --verify` for a read-only health check. `doctor` reports next commands and makes no changes. Setup looks for bundled or checkout-local helpers first; it does not trust arbitrary `PATH` helpers unless you explicitly pass `--allow-path-helper-fallback` for a known-good install.
-ContextGuard is intentionally conservative about savings claims. It reduces common sources of context bloat and provides benchmark tooling so you can measure before/after results on your own tasks. It does **not** promise a fixed token or cost reduction for every repository.
+Distribution and helper trust boundaries are conservative too: npm exposes only canonical `context-guard`/`context-guard-*` bin links, legacy `claude-*` wrappers remain package files for path-based migration, command manifests are treated as literal data rather than executable Python, and the macOS visibility helper is discovered only from bundled/resource/executable-relative paths or an absolute explicit override with a minimal child environment. Current working directories, relative overrides, symlinked helpers, arbitrary `PATH`, and ambient shell environment are not trusted by default.
+ContextGuard is intentionally conservative about savings claims. It reduces common sources of context bloat and provides benchmark tooling so you can measure before-and-after results on your own tasks. It does **not** promise a fixed token or cost reduction for every repository.
 ## Claude Code first, other agents too
-ContextGuard ships first as a Claude Code plugin, which is still the fastest path to value for Claude users. After installation, the same local-first guardrails can be reused by other AI coding and tool agents through:
+ContextGuard ships as a Claude Code plugin first, which is still the fastest starting point for Claude users. After installation, the same local-first guardrails can be reused by other AI coding and tool-using agents through:
 - **Local helper commands** (`context-guard-*`) that run as plain shell commands, independent of any specific agent.
 - **Advisory brief-mode rule snippets** that you install into an agent's own instruction file (`AGENTS.md`, `GEMINI.md`, `.cursorrules`, Copilot instructions, and similar rule files) and remove by deleting the marker-delimited block.
@@ -56,7 +58,7 @@ Current setup surfaces:
 ## How ContextGuard reduces token waste
-ContextGuard does not make the model cheaper by itself. It reduces avoidable context before it reaches an AI coding agent, then gives you signals to measure whether the change helped.
+ContextGuard does not change model prices. It reduces avoidable context before it reaches an AI coding agent, then gives you signals to measure whether the change helped.
 | Waste path | ContextGuard guardrail |
 | --- | --- |
@@ -71,14 +73,14 @@ ContextGuard does not make the model cheaper by itself. It reduces avoidable con
 ## How it fits with caching and compression tools
-ContextGuard complements provider and semantic caches, and sits next to prompt compression. Its main job is simpler: **do not send unnecessary files, logs, or output in the first place**.
+ContextGuard complements provider and semantic caches, and works alongside prompt compression. Its main job is simpler: **do not send unnecessary files, logs, or output in the first place**.
 | Tool category | Saves by | ContextGuard relationship |
 | --- | --- | --- |
-| Provider prompt/context caching | Reusing stable prompt prefixes. | Complementary; ContextGuard helps keep the changing tail of context smaller and cleaner, `context-guard-audit` can flag likely volatile prefix layouts, and `context-guard cost` can warn when an Anthropic request is likely to create/cache-write instead of read. |
+| Provider prompt/context caching | Reusing stable prompt prefixes. | Complementary; ContextGuard helps keep the changing tail of context smaller and cleaner, `context-guard-audit` can flag likely volatile prefix layouts, and `context-guard cost` can warn when an Anthropic request is likely to cache-write instead of cache-read. |
 | Semantic response cache | Reusing answers to identical or similar requests. | Complementary; ContextGuard does not serve cached AI answers. |
 | Prompt/context compression | Shortening text that is already selected for the model. | Adjacent; ContextGuard trims and summarizes local output, but does not promise lossless semantic compression. |
-| Experimental planners and local runtimes | Default-off and explicit-command-only; covers local-proxy plans and gate records plus narrow local runtimes for caller-supplied context-diff, visual evidence-pack, learned-compression, and self-hosted metrics evidence. | The local proxy `record` command starts no listener and forwards no traffic; `serve local-proxy` binds and forwards only literal loopback IPs for one bounded request. Compressor/model execution, OCR/crop services, external forwarding, credential persistence, and hosted-savings claims stay out of scope until a separate evidence gate and future PR allow them. |
+| Experimental planners and local runtimes | Default-off and explicit-command-only; covers local-proxy plans and gate records plus narrow local runtimes for caller-supplied context-diff, visual evidence-pack, learned-compression, and self-hosted metrics evidence. | The local proxy `record` command starts no listener and forwards no traffic; `serve local-proxy` binds and forwards only literal loopback IPs for one bounded request; `--response-sandbox` can replace a safe UTF-8 upstream body with a compact local artifact rehydration envelope. Compressor/model execution, OCR/crop services, external forwarding, credential persistence, and hosted-savings claims stay out of scope until a separate evidence gate and future PR allow them. |
 | ContextGuard | Avoiding unnecessary files, logs, repeated failures, and noisy output before they enter agent context. | Local guardrails, reversible artifacts, and measurement. |
 Related patterns that informed the design:
@@ -87,24 +89,24 @@ Related patterns that informed the design:
 | --- | --- | --- |
 | Compression-first | Shortening text already selected for the model, often with lossy transforms. | ContextGuard prefers local artifact storage with exact slice retrieval over lossy one-way compression, so you can get the original back. |
 | Terse-output rulesets across agents | Installing brief-mode output rules into many agents at once. | ContextGuard offers advisory brief-mode snippets and dry-run cross-agent setup — opt-in per project, no guaranteed savings claimed. |
-| ContextGuard | Avoiding unnecessary files, logs, and output before they enter context, with conservative measurement. | Local guardrails, reversible artifacts and retrieval, and benchmark evidence you measure yourself. |
+| ContextGuard | Avoiding unnecessary files, logs, and output before they enter context, with conservative measurement. | Local guardrails, reversible artifacts and retrieval, plus benchmark evidence you measure yourself. |
 ## Brief mode (advisory)
-Brief mode is a set of agent-neutral, advisory rule snippets that ask a coding agent to cut filler while preserving the evidence a reviewer needs: file paths, commands, command output and errors, code blocks, verification status, changed files, known gaps, and caveats. It is best-effort guidance, not enforcement, and does **not** guarantee any token or cost savings.
+Brief mode is a set of agent-neutral, advisory rule snippets that ask a coding agent to cut filler while preserving reviewer evidence: file paths, commands, command output and errors, code blocks, verification status, changed files, known gaps, and caveats. It is best-effort guidance, not enforcement, and does **not** guarantee token or cost savings.
 Three deterministic levels ship under [`plugins/context-guard/brief/`](plugins/context-guard/brief/): `lite`, `standard`, and `ultra`. Each level is a single marker-delimited block for an agent's rule/instruction file (for example `AGENTS.md`, `CLAUDE.md`, a Cursor rules file, or Copilot instructions). Manage it through setup with `context-guard setup --agent codex --scope project --brief-mode standard --plan`, rerun with `--yes` to apply, and use `--brief-mode off` to remove the managed block. See [`plugins/context-guard/brief/README.md`](plugins/context-guard/brief/README.md).
 ## What to measure
-When you need a savings claim, measure it on your own tasks:
+If you need a savings claim, measure it on your own tasks:
 - full-file reads versus symbol or line-range reads
 - raw logs versus digest output or artifact receipts
 - transcript hotspots reported by `context-guard-audit`, including `cache_friendliness` prompt-layout signals and `cache_layout_advice` experiment priorities
 - statusline `cache` / `reuse` as observed transcript/provider-cache signals, not savings caused by ContextGuard
 - `context-guard cost preflight` estimates for Anthropic request JSON, followed by `context-guard cost observe` using provider usage fields (`cache_creation_input_tokens`, `cache_read_input_tokens`) after the call
-- static prompt/request cache layout checks from `context-guard-cache-score`; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits
+- static prompt/request cache layout checks from `context-guard-cache-score`, including optional user-supplied cache write/read multiplier amortization risk; its char/4 token estimates and warnings are advisory only until provider usage fields confirm real cache hits
 - matched successful baseline/variant runs from `context-guard-bench`
 - large tool/MCP catalogs versus `context-guard-tool-prune` top-k reports plus receipt retrieval
 - optional experimental lanes in [`research/experimental-token-reduction-radar.md`](research/experimental-token-reduction-radar.md); fixture-only starters in [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) use the same matched-task benchmark gates before any savings claim
@@ -114,13 +116,14 @@ When you need a savings claim, measure it on your own tasks:
 - It does not guarantee a fixed token or cost reduction.
 - It does not send work to external AI providers to save model tokens.
 - It does not mutate global Claude settings during install.
+- It does not execute command manifests as code or trust arbitrary `PATH`/current-working-directory helpers during setup or packaged smoke checks.
 - It does not replace real before/after measurement when you need a savings claim.
-- Local RAM/disk receipts can reduce what you send next, but they do **not** replace Anthropic's provider prompt cache or guarantee cache hits. Recheck Anthropic prompt-caching and pricing docs before release or billing claims: https://docs.anthropic.com/en/build-with-claude/prompt-caching and https://platform.claude.com/docs/en/about-claude/pricing.
-- Experimental helpers are mostly dry-run checker/planner surfaces, including a design-only external-forwarding opt-in gate. Explicit local runtimes exist only for caller-supplied context-diff replacement payloads, caller-supplied visual crop/OCR evidence packs, caller-supplied learned-compression prose candidates, self-hosted metrics JSONL sidecar records, local-proxy runtime-gate JSONL records, and one-shot `serve local-proxy` loopback forwarding with a private ready-file nonce plus optional shifted-cost diagnostic JSONL rows for successful forwarded requests.
+- Local RAM/disk receipts can help reduce what you send next, but they do **not** replace Anthropic's provider prompt cache or guarantee cache hits. Recheck Anthropic prompt-caching and pricing docs before release or billing claims: https://docs.anthropic.com/en/build-with-claude/prompt-caching and https://platform.claude.com/docs/en/about-claude/pricing.
+- Experimental helpers are mostly dry-run checker/planner surfaces, including a design-only external-forwarding opt-in gate. Explicit local runtimes exist only for caller-supplied context-diff replacement payloads, caller-supplied visual crop/OCR evidence packs, caller-supplied learned-compression prose candidates, self-hosted metrics JSONL sidecar records, local-proxy runtime-gate JSONL records, and one-shot `serve local-proxy` loopback forwarding with a private ready-file nonce, optional `--response-sandbox` compact artifact envelopes for safe UTF-8 responses, plus optional shifted-cost diagnostic JSONL rows for successful forwarded requests.
 - ContextGuard does not ship learned/synthetic compressor execution, embeddings, rerankers, model calls, generated replacement text, screenshot capture, image cropping, OCR execution, image parsing, external OCR/image services, self-hosted KV/latent inference optimization beyond explicit local metrics recording, or broader proxy forwarding beyond literal-loopback, one-request HTTP forwarding with credential material blocked.
 - It does not alias the old `/claude-token-optimizer:*` Claude Code slash-command namespace. Use `/context-guard:*` after installing this plugin.
-Legacy local CLI wrappers (`claude-token-*`, `claude-read-symbol`, `claude-trim-output`, and `claude-sanitize-output`) still ship in `bin/` so existing automation can migrate gradually.
+Legacy local CLI wrappers (`claude-token-*`, `claude-read-symbol`, `claude-trim-output`, and `claude-sanitize-output`) still ship as package files under `plugins/context-guard/bin/` so existing plugin-path automation can migrate gradually. npm global/`npx` bin links intentionally expose only the canonical `context-guard`/`context-guard-*` commands; call the legacy wrappers by package/plugin path if you still need them.
 ## Features
@@ -173,7 +176,7 @@ Setup is explicit, project-local, and reversible. The plugin does not configure
 ## Install with npm/npx
-The npm package exposes a canonical `context-guard` command plus backward-compatible `context-guard-*` helper commands. Package installation is passive: there is no `postinstall` setup hook and no config write until you run `context-guard setup` yourself. If setup cannot find bundled or checkout-local helpers, `PATH` fallback remains disabled by default; use `--allow-path-helper-fallback` only for trusted helper directories after `context-guard doctor` or `setup --verify` confirms the plan.
+The npm package exposes a canonical `context-guard` command plus `context-guard-*` helper commands. Package installation is passive: there is no `postinstall` setup hook and no config write until you run `context-guard setup` yourself. npm global/`npx` bin links intentionally expose only canonical `context-guard`/`context-guard-*` commands; legacy `claude-*` wrapper files remain packaged for explicit path-based migration but are not advertised as executable bin aliases. If setup cannot find bundled or checkout-local helpers, `PATH` fallback remains disabled by default; use `--allow-path-helper-fallback` only for trusted helper directories after `context-guard doctor` or `setup --verify` confirms the plan.
 ```bash
 npm install -g @ictechgy/context-guard
@@ -249,10 +252,11 @@ The optional Read guard uses a progressive path for oversized files: search firs
 ```bash
 long-command 2>&1 | ./plugins/context-guard/bin/context-guard-artifact store --command "long-command" --json
 ./plugins/context-guard/bin/context-guard-artifact search "ERROR" --json
+./plugins/context-guard/bin/context-guard-artifact receipt <artifact_id> --json
 ./plugins/context-guard/bin/context-guard-artifact get <artifact_id> --lines 1:80
 ```
-Artifact mode is for capture, sandbox search, and retrieval. It stores sanitized output under `.context-guard/artifacts` by default and can still read legacy `.claude-token-optimizer/artifacts` receipts from before the rebrand. JSON receipts include line-numbered top-error receipts, duplicate-line groups, and sanitized bounded `suggested_queries` so an agent can fetch the smallest useful exact slice instead of replaying the full log. `search` scans the local sanitized artifact sandbox by literal substring, returns capped match/context records, and includes `context-guard-artifact get ... --lines START:END` rehydration commands for omitted detail. For custom `--dir` values, raw private paths stay redacted by default; rerun with the same `--dir`, or pass `search --show-paths` when you explicitly want a directly executable local command. The search report is local-only and does not make hosted token/cost savings claims. When `--max-lines` accompanies a `--lines START:END` selector, it caps lines returned within that range; it does not expand the selector. Preserve the producer command's exit code yourself when using shell pipelines in release checks, or use `context-guard-trim-output -- ...` when exit-code preservation is the primary requirement.
+Artifact mode is for capture, sandbox search, and retrieval. It stores sanitized output under `.context-guard/artifacts` by default and can still read legacy `.claude-token-optimizer/artifacts` receipts from before the rebrand. JSON receipts include line-numbered top-error receipts, duplicate-line groups, sanitized bounded `suggested_queries`, and an `output_sandbox` envelope with a stable `contextguard-artifact:<id>` handle. Use `context-guard-artifact receipt <artifact_id> --json` to rehydrate metadata-only handles without returning content, then fetch the smallest useful exact slice instead of replaying the full log. `search` scans the local sanitized artifact sandbox by literal substring, returns capped match/context records, and includes `context-guard-artifact get ... --lines START:END` rehydration commands for omitted detail. For custom `--dir` values, raw private paths stay redacted by default; rerun with the same `--dir`, or pass `search --show-paths` when you explicitly want a directly executable local command. The search report is local-only and does not make hosted token/cost savings claims. When `--max-lines` accompanies a `--lines START:END` selector, it caps lines returned within that range; it does not expand the selector. Preserve the producer command's exit code yourself when using shell pipelines in release checks, or use `context-guard-trim-output -- ...` when exit-code preservation is the primary requirement.
 ### Build a budgeted context pack
@@ -267,7 +271,7 @@ Artifact mode is for capture, sandbox search, and retrieval. It stores sanitized
 # Or run the two explicit steps:
 ./plugins/context-guard/bin/context-guard-pack suggest \
   --root . --query "review failing tests" --diff HEAD \
-  --manifest-out suggested-pack.json --budget-bytes 12000 --json --adaptive-k
+  --manifest-out suggested-pack.json --budget-bytes 12000 --json --adaptive-k --adaptive-k-policy recall
 ./plugins/context-guard/bin/context-guard-pack build \
   --root . --manifest suggested-pack.json --budget-bytes 12000 --json
 ./plugins/context-guard/bin/context-guard-pack slice --root . --path README.md --lines 1:40 --json
@@ -280,7 +284,7 @@ A few boundaries are intentional:
 - Add `--explain` for compact deterministic local selection/build reasons in JSON or text output.
 - `--explain` may include bounded `repo_map` metadata: sampled byte/token-proxy tree entries, category-only secret-risk counts, signature-first file hints, explain-only graph ranks, and exact `slice`/symbol retrieval hints.
 - Explain metadata does not change the manifest, pack body, receipt, or byte budget. It does not use network/model/embedding calls, and token values remain local `chars_div_4` proxies rather than provider-token or savings claims.
-- Add `--adaptive-k` to `suggest` or `auto` for advisory-only shrink/expand top-k metadata derived from local score distribution, byte-budget fit, and score-mass recall/precision proxies. It never applies the recommendation automatically and does not change the manifest, pack body, receipt, or byte budget.
+- Add `--adaptive-k` to `suggest` or `auto` for advisory-only shrink/expand top-k metadata derived from local score distribution, byte-budget fit, and clamped score-mass recall/precision proxies. Use `--adaptive-k-policy balanced|recall|precision` plus optional `--adaptive-k-min-recall-proxy` / `--adaptive-k-min-precision-proxy` gates to choose a local recommendation policy; gate failures are metadata-only (`pass|failed`). The adaptive block includes capped selected/omitted evidence and structured source-verification hints, never applies the recommendation automatically, and does not change the manifest, pack body, receipt, or byte budget.
 - Add `--symbol-memory` to `auto` for repo-map-derived symbol/graph advisory metadata with exact `slice` / `read-symbol` verification hints. It is source-verification guidance only and does not change the manifest, pack body, receipt, or byte budget.
 - `--manifest-out` writes a build-compatible manifest; `--pack-out` saves the rendered pack.
 - `context-guard-pack suggest` is the lower-level additive local-only planning step. It ranks candidate files and line ranges from `--query`, `--diff`, repeated `--files`, and optional sanitized `--output` / `--test-output` files under `--root`, then writes a manifest that `build --manifest` can consume.
@@ -303,7 +307,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode
 ./plugins/context-guard/bin/context-guard-tool-prune get <receipt_id> --tool read_file --json
 ```
-`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings.
+`context-guard-tool-prune` ranks a local tool or MCP catalog with deterministic lexical heuristics and emits a bounded top-k advisory report. Inline selected schemas respect an observed UTF-8 byte budget, and omitted or budget-skipped schemas remain recoverable from a compact local receipt plus a separate sanitized payload under `.context-guard/tool-prune`. `defer-report` uses the same receipt path to split a catalog into core inline tools plus deferred tool stubs and namespace summaries, and reports gross deferred-schema plus net initial-report char/4 proxy accounting so you can see what moved out of the first prompt. This is advisory only: it does not mutate MCP configuration, does not configure native provider tool search, and token counts remain estimated proxies rather than measured provider savings.
 ### Score static prompt cacheability
@@ -312,7 +316,7 @@ The packer uses deterministic standard-library heuristics only: no network, mode
 ./plugins/context-guard/bin/context-guard cache-score --input prompt.txt --provider anthropic --json
 ```
-`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. It does not call providers, store raw prompts, estimate prices, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry.
+`context-guard-cache-score` is a local static lint for prompt/request layout. It estimates total and cacheable-prefix size with a tokenizer-free char/4 proxy, warns about dynamic-looking values near the prefix, and records provider caveats for OpenAI, Anthropic, Gemini, or a generic threshold. Optional `--expected-reuses`, `--cache-write-multiplier`, and `--cache-read-multiplier` inputs add an advisory amortization-risk section using user-supplied economics only. It does not call providers, store raw prompts, estimate prices from bundled defaults, observe cache hits, or prove token/cost savings; verify real cache behavior with provider usage telemetry.
 ### Advise on total cost, batchability, and routing
@@ -344,7 +348,7 @@ Add `--mode readable` only for sanitized prose previews. It uses a deterministic
 ./plugins/context-guard/bin/context-guard-trim-output --max-lines 120 -- npm test
 ```
-Use `--digest markdown` or `--digest json` for a compact semantic digest instead of head/tail logs. Digest mode keeps status, exit code, truncation counts, runner failure facts, a sanitized failure signature, duplicate-line groups, representative lines, redaction counts, and suggested next queries while preserving the wrapped command exit code. Add `--artifact-receipt` with digest mode when you want the exact sanitized full output stored locally as a `context-guard-artifact` receipt; re-expand with the emitted `context-guard-artifact get ...` command before relying on omitted details. Wrapped commands time out after 600 seconds by default; tune this with `--timeout-seconds`.
+Use `--digest markdown` or `--digest json` for a compact semantic digest instead of head/tail logs. Digest mode keeps status, exit code, truncation counts, runner failure facts, a sanitized failure signature, duplicate-line groups, representative lines, redaction counts, and suggested next queries while preserving the wrapped command exit code. Add `--artifact-receipt` with digest mode when you want the exact sanitized full output stored locally as a `context-guard-artifact` receipt; keep the emitted `contextguard-artifact:<id>` handle in agent context and re-expand with the emitted `context-guard-artifact receipt/get/search ...` commands before relying on omitted details. Wrapped commands time out after 600 seconds by default; tune this with `--timeout-seconds`.
 ### Sanitize search and diff output
@@ -406,20 +410,25 @@ These fields can flag likely volatile content near the prompt prefix, stable-pre
 ```bash
 ./plugins/context-guard/bin/context-guard-bench \
   --tasks bench/tasks.json --variants bench/variants.json --csv bench/results.csv \
-  --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json
+  --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json \
+  --dashboard-md bench/dashboard.md
 ```
+For deterministic local replay before a live provider run, add `--evidence-jsonl docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl` and, for the 12-task fixture, `--baseline-variant baseline_full_context_fixture`. Replay mode skips provider and `success_command` execution, writes the same CSV/report/dashboard surfaces, and marks synthetic/manual evidence as non-public-claim-eligible.
 Read the report through its claim boundaries before writing any savings statement:
 - Successful baseline/variant runs are compared by real tokens and `cost_usd + external_cost_usd`; byte reductions stay proxy evidence.
 - Token-savings claims require `primary_tokens_measured` on both sides of a matched task.
 - `matched_pair_evidence` links each successful task bucket to the transform, measurement availability, quality gate, and claim boundary.
+- `default_matrix` classifies trimming, artifact escrow, tool pruning, cache advice, adaptive-k, and optional compression as `default-on`, `advisory`, `experimental`, or `reject/rework` from the same matched evidence. The matrix is report-only: it does not change runtime defaults or authorize hosted token/cost savings claims.
+- `public_claim_readiness` is the authoritative release/public-claim gate. It remains false unless matched successful tasks, provider-measured primary tokens/cost, quality non-inferiority, shifted-cost accounting, explicit confidence/failure notes, and complete provider-export provenance all pass; unsupported hosted savings claims are forbidden when `claim_allowed` is false.
 - `wall_time_seconds`, `provider_cached_tokens`, and `provider_cached_tokens_measured` are diagnostic telemetry, not proof of ContextGuard-caused token or cost savings.
 - Optional `self_hosted_metrics` from provider payloads are stored as per-row JSONL ledger sidecars, kept out of CSV/report summaries, and must not be folded into hosted API token/cost savings claims.
 - If cost fields are zero or unavailable, the report can still mark token savings but will not claim shifted-cost savings.
 - CSV schemas are strict; after upgrading the benchmark helper, start a new `--csv` file or migrate the header named in the mismatch error.
-See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters.
+See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters plus synthetic evidence replay.
 ### Manage experimental opt-ins
@@ -439,7 +448,7 @@ context-guard experiments record self-hosted-metrics-ledger --ledger-jsonl .cont
 context-guard experiments plan local-proxy --json --bind-host 127.0.0.1 --target-host 127.0.0.1 --runtime-gate-ack
 context-guard experiments plan local-proxy-external-forwarding --external-forwarding-intent --external-forwarding-design-ack --allow-host api.example.com --allow-scheme https --credential-redaction-policy strip-sensitive-headers --provider-evidence-boundary diagnostic-only-provider-measured-required --threat-model-note "Only user-owned HTTPS endpoint; sensitive headers are stripped before any future forwarding." --json
 context-guard experiments record local-proxy-runtime-gate --ledger-jsonl .context-guard/local-proxy-gates.jsonl --bind-host 127.0.0.1 --target-host 127.0.0.1 --runtime-gate-ack --json
-context-guard experiments serve local-proxy --bind-host 127.0.0.1 --bind-port 18080 --target-host 127.0.0.1 --target-port 18081 --runtime-gate-ack --forwarding-gate-ack --once --ready-file .context-guard/local-proxy-ready.json --diagnostic-ledger-jsonl .context-guard/local-proxy-diagnostics.jsonl --json
+context-guard experiments serve local-proxy --bind-host 127.0.0.1 --bind-port 18080 --target-host 127.0.0.1 --target-port 18081 --runtime-gate-ack --forwarding-gate-ack --once --ready-file .context-guard/local-proxy-ready.json --response-sandbox --response-artifact-dir .context-guard/artifacts --diagnostic-ledger-jsonl .context-guard/local-proxy-diagnostics.jsonl --json
 context-guard experiments enable output-receipt-trim --root .
 context-guard experiments disable output-receipt-trim --root .
 ```
@@ -448,7 +457,7 @@ The local-proxy examples are intentionally split by side effect:
 - `plan local-proxy` produces advisory metadata only; it does not enable forwarding.
 - `record local-proxy-runtime-gate` appends one localhost-only gate row and still starts no listener, forwards no traffic, persists no API keys, and makes no hosted-savings claim.
-- `serve local-proxy` is the separate MVP. It requires both runtime and forwarding acknowledgements plus `--once`, a private `--ready-file` nonce handoff for the forwarding client, binds only a literal loopback IP, forwards only to a literal loopback IP target, blocks credential-bearing requests, uses byte/time limits, uses literal IPs instead of hostname DNS targets, does not persist API keys, and does not support external forwarding, CONNECT/TLS proxying, or hosted-savings claims.
+- `serve local-proxy` is the separate MVP. It requires both runtime and forwarding acknowledgements plus `--once`, a private `--ready-file` nonce handoff for the forwarding client, binds only a literal loopback IP, forwards only to a literal loopback IP target, blocks credential-bearing requests, uses byte/time limits, uses literal IPs instead of hostname DNS targets, does not persist API keys, and does not support external forwarding, CONNECT/TLS proxying, or hosted-savings claims. Optional `--response-sandbox` is a mediated response mode, not transparent forwarding: it artifacts only safe UTF-8 upstream response text and returns a compact JSON envelope with `contextguard-artifact:<id>` and rehydration commands; binary, sensitive, oversized, or blocked responses are not artifacted.
 - With `--diagnostic-ledger-jsonl`, `serve` appends one shifted-cost diagnostic row only after a successful forwarded request. The row stores hashes/metadata rather than raw headers, request bodies, response bodies, or hosted-savings evidence.
 - `plan local-proxy-external-forwarding` is a dry-run design gate only. It requires explicit external intent, design acknowledgement, HTTPS host allowlist, threat model notes, credential redaction policy, and provider-evidence boundary, but starts no listener, performs no DNS lookup, calls no external service, forwards no traffic, persists no credentials, and does not ship an external proxy forwarding runtime.
@@ -462,11 +471,11 @@ Shipped experimental checker/planner surfaces, plus explicit local context-diff,
 | `visual-crop-ocr` | Dry-run visual evidence advice plus an explicit `emit visual-crop-ocr` runtime for caller-supplied evidence packs. | `emit` requires a full visual evidence receipt, missed-context note, and complete user-supplied crop and/or OCR evidence; ContextGuard does not capture screenshots, crop images, run OCR, parse images, call external services, write files, or support hosted token/cost savings claims. |
 | `learned-compression` | Deny-by-default policy checks plus an explicit `emit learned-compression` runtime for caller-supplied compact prose candidates with verified exact fallback content. | `emit` requires sanitized trusted prose, protected-signal denial, a verified local fallback artifact matching the input, and a smaller caller-supplied prose candidate; ContextGuard does not run compressors, embeddings, rerankers, model calls, subprocesses, external services, generated replacement text, or hosted savings claims. |
 | `self-hosted-metrics-ledger` | Dry-run preview plus an explicit `record ... --ledger-jsonl` runtime for local/model-server latency, memory, quality, energy, throughput, and local-cost metrics. | The dry-run preview does not write a ledger; the explicit record command writes only local JSONL sidecars and still does not support hosted API token/cost savings claims. |
-| `local-proxy` | Localhost-only advisory metadata, design-only `plan local-proxy-external-forwarding` review for future external forwarding, an explicit `record local-proxy-runtime-gate --ledger-jsonl` runtime for one local gate row, an explicit one-shot `serve local-proxy` loopback forwarding MVP, and optional `--diagnostic-ledger-jsonl` shifted-cost diagnostics for successful forwarded requests. | `plan` writes no ledger. `record` writes only after localhost-only metadata and `--runtime-gate-ack`; it starts no listener, forwards no traffic, and performs no DNS lookup. `serve` additionally requires `--forwarding-gate-ack --once`, a private `--ready-file` nonce handoff, literal loopback bind/target IPs, nonzero ports, bounded bytes/timeouts, and credential-free requests; it performs no external forwarding, no CONNECT/TLS proxying, no API-key persistence, and no hosted-savings claim. `--diagnostic-ledger-jsonl` writes only successful-forward diagnostics with no raw headers/bodies and no hosted-savings claim. `plan local-proxy-external-forwarding` emits threat-model/allowlist/redaction/provider-evidence design metadata only and still performs no DNS lookup, external service call, traffic forwarding, credential persistence, or hosted-savings claim. |
+| `local-proxy` | Localhost-only advisory metadata, design-only `plan local-proxy-external-forwarding` review for future external forwarding, an explicit `record local-proxy-runtime-gate --ledger-jsonl` runtime for one local gate row, an explicit one-shot `serve local-proxy` loopback forwarding MVP, optional `--response-sandbox` compact artifact envelopes, and optional `--diagnostic-ledger-jsonl` shifted-cost diagnostics for successful forwarded requests. | `plan` writes no ledger. `record` writes only after localhost-only metadata and `--runtime-gate-ack`; it starts no listener, forwards no traffic, and performs no DNS lookup. `serve` additionally requires `--forwarding-gate-ack --once`, a private `--ready-file` nonce handoff, literal loopback bind/target IPs, nonzero ports, bounded bytes/timeouts, and credential-free requests; it performs no external forwarding, no CONNECT/TLS proxying, no API-key persistence, and no hosted-savings claim. `--response-sandbox` can store safe UTF-8 response text as a sanitized local artifact receipt and return a compact envelope with redacted rehydration command templates; it does not claim hosted token/cost savings. `--diagnostic-ledger-jsonl` writes only successful-forward diagnostics with no raw headers/bodies and no hosted-savings claim. `plan local-proxy-external-forwarding` emits threat-model/allowlist/redaction/provider-evidence design metadata only and still performs no DNS lookup, external service call, traffic forwarding, credential persistence, or hosted-savings claim. |
 ## What is not yet shipped
-These are directions the project has tracked, not committed features. Nothing here ships unless documented elsewhere in the repository.
+These are tracked directions, not committed features. Nothing here ships unless another repository document says it does.
 ContextGuard does not yet ship:
@@ -515,7 +524,7 @@ export PATH="$PWD/plugins/context-guard/bin:$PATH"
 context-guard-setup --plan
 ```
-Do not rely on `PATH` lookup for generated hooks by default. The setup wizard records explicit bundled or checkout-local helper paths; `--allow-path-helper-fallback` is only for trusted external installs and validates the resolved helper before writing commands.
+Do not rely on `PATH` lookup for generated hooks by default. The setup wizard records explicit bundled or checkout-local helper paths; `--allow-path-helper-fallback` is only for trusted external installs and validates the resolved helper path, symlink state, and bounded identity probe before writing commands. The macOS app helper follows the same trust model: no launch-CWD discovery, no relative override paths, and no inherited ambient shell environment beyond the allowlisted values it needs to start.
 ## Release checks
@@ -527,7 +536,7 @@ python3 scripts/prepublish_check.py
 python3 scripts/release_smoke.py
 ```
-When a helper under `context-guard-kit/` changes, run `python3 scripts/sync_plugin_copies.py --write` before the gates. `sync_plugin_copies.py --check` verifies the maintainer-facing exact-copy contract up front. npm packages intentionally ship only the synchronized plugin-local `plugins/context-guard/bin` entrypoints and `plugins/context-guard/lib` helpers to avoid duplicate implementation payloads. `prepublish_check.py` verifies package invariants, synchronized plugin binaries, manifests, diagnostic redaction, and the regression suite. `release_smoke.py` executes representative packaged entrypoints from `plugins/context-guard/bin` in a temporary project so broken CLI wiring is caught before publish. See [docs/release-runbook.md](docs/release-runbook.md) for the full release workflow, evidence checklist, quad-review requirement, and rollback checklist.
+When a helper under `context-guard-kit/` changes, run `python3 scripts/sync_plugin_copies.py --write` before the gates. `sync_plugin_copies.py --check` verifies the maintainer-facing exact-copy contract up front. npm packages intentionally ship only the synchronized plugin-local `plugins/context-guard/bin` entrypoints and `plugins/context-guard/lib` helpers to avoid duplicate implementation payloads, and the npm bin map intentionally omits legacy `claude-*` wrapper aliases. Command manifests are loaded as literal assignments for release and runtime checks; executable Python, imports, functions, or shadow manifests are rejected. `prepublish_check.py` verifies package invariants, synchronized plugin binaries, manifests, diagnostic redaction, and the regression suite. `release_smoke.py` executes representative packaged entrypoints from `plugins/context-guard/bin` in a temporary project so broken CLI wiring is caught before publish. See [docs/release-runbook.md](docs/release-runbook.md) for the full release workflow, evidence checklist, quad-review requirement, and rollback checklist.
 Versioned release notes live in [CHANGELOG.md](CHANGELOG.md); the prepublish gate requires an entry matching the plugin manifest version before publishing.

package/docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl ADDED Viewed

@@ -0,0 +1,24 @@
+{"artifacts_used": 0, "bytes_after": 9450, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1715, "output_tokens": 229}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.17}
+{"artifacts_used": 1, "bytes_after": 5481, "bytes_before": 9450, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_01_bugfix", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1131, "output_tokens": 210}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.13}
+{"artifacts_used": 0, "bytes_after": 9900, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1830, "output_tokens": 238}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.34}
+{"artifacts_used": 1, "bytes_after": 5742, "bytes_before": 9900, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_02_exploration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1207, "output_tokens": 218}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.26}
+{"artifacts_used": 0, "bytes_after": 10350, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1945, "output_tokens": 247}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.51}
+{"artifacts_used": 1, "bytes_after": 6003, "bytes_before": 10350, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_03_code_review", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1283, "output_tokens": 227}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.39}
+{"artifacts_used": 0, "bytes_after": 10800, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2060, "output_tokens": 256}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.68}
+{"artifacts_used": 1, "bytes_after": 6264, "bytes_before": 10800, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_04_long_log_analysis", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1359, "output_tokens": 235}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.52}
+{"artifacts_used": 0, "bytes_after": 11250, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2175, "output_tokens": 265}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 11.85}
+{"artifacts_used": 1, "bytes_after": 6525, "bytes_before": 11250, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_05_migration", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1435, "output_tokens": 243}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.65}
+{"artifacts_used": 0, "bytes_after": 11700, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2290, "output_tokens": 274}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.02}
+{"artifacts_used": 1, "bytes_after": 6785, "bytes_before": 11700, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_06_docs", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1511, "output_tokens": 252}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.78}
+{"artifacts_used": 0, "bytes_after": 12150, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2405, "output_tokens": 283}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.19}
+{"artifacts_used": 1, "bytes_after": 7046, "bytes_before": 12150, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_07_refactor", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1587, "output_tokens": 260}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 11.91}
+{"artifacts_used": 0, "bytes_after": 12600, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2520, "output_tokens": 292}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.36}
+{"artifacts_used": 1, "bytes_after": 7307, "bytes_before": 12600, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_08_performance", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1663, "output_tokens": 268}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.04}
+{"artifacts_used": 0, "bytes_after": 13050, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2635, "output_tokens": 301}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.53}
+{"artifacts_used": 1, "bytes_after": 7568, "bytes_before": 13050, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_09_telemetry", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1739, "output_tokens": 276}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.17}
+{"artifacts_used": 0, "bytes_after": 13500, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2750, "output_tokens": 310}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.7}
+{"artifacts_used": 1, "bytes_after": 7829, "bytes_before": 13500, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_10_cache_layout", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1815, "output_tokens": 285}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.3}
+{"artifacts_used": 0, "bytes_after": 13950, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2865, "output_tokens": 319}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 12.87}
+{"artifacts_used": 1, "bytes_after": 8090, "bytes_before": 13950, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_11_tool_schema", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1890, "output_tokens": 293}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.43}
+{"artifacts_used": 0, "bytes_after": 14400, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 0, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 2980, "output_tokens": 328}, "turns": 3, "variant": "baseline_full_context_fixture", "wall_time_seconds": 13.04}
+{"artifacts_used": 1, "bytes_after": 8352, "bytes_before": 14400, "corrections": 0, "cost_measured": false, "cost_usd": 0.0, "effort": "medium", "external_cost_measured": true, "external_cost_usd": 0.0, "external_tokens": 0, "external_tokens_measured": true, "hook_triggers": 1, "model": "sonnet", "notes": "synthetic fixture-only replay row; not provider measured and not public-claim eligible", "primary_tokens_measured": false, "provenance": {"capture_command_or_export_id": "docs/benchmark-fixtures/token-savings-12task.evidence.example.jsonl", "claim_scope": "local_replay_fixture_not_public_claim", "evidence_source_type": "synthetic_fixture"}, "provider_cached_tokens": 0, "provider_cached_tokens_measured": false, "schema_version": "contextguard.bench.run-evidence.v1", "success": true, "task_id": "token_savings_12_artifact_receipt", "tokens": {"cache_creation": 0, "cache_read": 0, "input_tokens": 1966, "output_tokens": 301}, "turns": 2, "variant": "fixture_only_contextguard_advisory_foundations", "wall_time_seconds": 12.56}

package/docs/benchmark-workflow-examples.md CHANGED Viewed

@@ -26,6 +26,7 @@ Use them to decide what evidence a workflow has and what it does **not** prove:
 3. Treat `comparisons[].quality_gate != "pass"` as a warning to inspect failures, correction burden, and unmatched tasks before discussing savings.
 4. Keep byte-proxy, provider-cache, wall-time, and shifted-cost evidence in separate language from provider-measured token/cost claims. Provider-cache telemetry is not independent savings proof.
 5. Keep self-hosted local/model-server latency, memory, and quality metrics in the run-evidence ledger sidecar; do not fold them into hosted API token/cost savings claims unless provider-measured matched-task evidence separately supports that claim.
+6. For deterministic local replay, add `--evidence-jsonl ... --dashboard-md ...`. Synthetic/manual replay evidence regenerates CSV/report/dashboard artifacts, but the report is marked `replay_only_not_public_claim` or `unknown_mixed_csv` unless every report row has complete provider-export provenance. Public hosted savings claims must additionally have `public_claim_readiness.claim_allowed=true`, which requires matched successful tasks, provider-measured token/cost, quality non-inferiority, shifted-cost accounting, explicit confidence/failure notes, and complete provider-export provenance.
 ## Safe wording
@@ -42,3 +43,5 @@ The `.example.json` fixtures intentionally use full `context-guard-bench-report-
 The self-hosted metrics example is a JSONL run-evidence sidecar, not a full report shape. Its fields are additive ledger evidence only: `latency_ms`, `peak_memory_mb`, and normalized `quality_score` describe local/model-server behavior and leave hosted API report calculations unchanged. Use `context-guard experiments plan self-hosted-metrics-ledger --json ...` only as a dry-run ledger-preview checker for explicit metrics; it does not write the benchmark ledger.
 For task/variant starter fixtures rather than full report-shape examples, see [`experimental-benchmark-fixtures.md`](experimental-benchmark-fixtures.md). Those files are fixture-only and synthetic dry-run-only starters until users replace the placeholder prompts and success checks; they are not shipped OCR, visual-token, learned-compression, or output-transform benchmark results, and real claims still require provider-measured matched successful tasks plus failure-rate, correction, and shifted-cost guardrails.
+The token-savings 12-task starter also includes [`benchmark-fixtures/token-savings-12task.evidence.example.jsonl`](benchmark-fixtures/token-savings-12task.evidence.example.jsonl) for `context-guard-bench --evidence-jsonl` replay. That file is synthetic local replay evidence, not provider-measured savings proof; use it to validate dashboards and claim-boundary handling before collecting real provider exports.