@ictechgy/context-guard 0.4.4 → 0.4.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. package/CHANGELOG.md +15 -0
  2. package/README.ko.md +15 -2
  3. package/README.md +15 -4
  4. package/context-guard-kit/README.md +2 -2
  5. package/context-guard-kit/benchmark_runner.py +244 -6
  6. package/context-guard-kit/claude_transcript_cost_audit.py +171 -1
  7. package/context-guard-kit/context_pack.py +82 -35
  8. package/context-guard-kit/cost_guard.py +457 -45
  9. package/docs/benchmark-fixtures/learned-compression-baseline-context-pack.prompt.example.md +19 -0
  10. package/docs/benchmark-fixtures/learned-compression-candidate-digest.prompt.example.md +21 -0
  11. package/docs/benchmark-fixtures/learned-compression.tasks.example.json +5 -1
  12. package/docs/benchmark-fixtures/output-transform-baseline-raw-output.prompt.example.md +20 -0
  13. package/docs/benchmark-fixtures/output-transform-digest-receipt.prompt.example.md +23 -0
  14. package/docs/benchmark-fixtures/output-transform.tasks.example.json +28 -0
  15. package/docs/benchmark-fixtures/output-transform.variants.example.json +10 -0
  16. package/docs/benchmark-fixtures/visual-ocr-cropped-ocr.prompt.example.md +22 -0
  17. package/docs/benchmark-fixtures/visual-ocr-full-visual.prompt.example.md +19 -0
  18. package/docs/benchmark-fixtures/visual-ocr.tasks.example.json +5 -1
  19. package/docs/benchmark-workflow-examples.md +6 -2
  20. package/docs/benchmark-workflows/self-hosted-metrics-ledger.example.jsonl +1 -0
  21. package/docs/experimental-benchmark-fixtures.md +17 -6
  22. package/docs/mac-visibility-feasibility-schema.md +62 -0
  23. package/docs/mac-visibility-feasibility.example.json +130 -0
  24. package/package.json +5 -1
  25. package/packaging/homebrew/context-guard.rb.template +1 -1
  26. package/plugins/context-guard/.claude-plugin/plugin.json +1 -1
  27. package/plugins/context-guard/README.ko.md +1 -1
  28. package/plugins/context-guard/README.md +1 -1
  29. package/plugins/context-guard/bin/context-guard-audit +171 -1
  30. package/plugins/context-guard/bin/context-guard-bench +244 -6
  31. package/plugins/context-guard/bin/context-guard-cost +457 -45
  32. package/plugins/context-guard/bin/context-guard-pack +82 -35
@@ -0,0 +1,21 @@
1
+ Fixture-only candidate prompt for learned-compression experiment setup.
2
+
3
+ You are reviewing an already-sanitized compressed digest candidate. This is synthetic benchmark input only. No learned compressor, latent helper, embedding model, reranker, or provider call is shipped or invoked by this fixture.
4
+
5
+ Sanitized evidence only: private paths, endpoints, screenshots, secrets, raw credentials, and unsanitized logs do not belong in this fixture. Protected evidence no semantic rewrite: protected identifiers, constants, hashes, paths, quoted strings, stack frames, JSON keys, code fences, and diff zones must remain exact or receipt-retrievable.
6
+
7
+ Compressed digest candidate:
8
+ - candidate id: fixture-compression-alpha
9
+ - digest summary: sample_module.py branch returns quoted string `retry` after numeric constant `3` attempts
10
+ - protected evidence preserved exactly: identifier `sample_status`, numeric constant `3`, quoted string `retry`, JSON key `status`, and stack frame label `sample_module:31`
11
+ - omitted protected context: sample_helper.py lines 1:80
12
+ - receipt fallback: fixture-receipt-alpha
13
+ - exact retrieval fallback: context-guard-pack slice --path sample_helper.py --lines 1:80
14
+
15
+ Task:
16
+ 1. Decide whether required evidence is exact or receipt-retrievable.
17
+ 2. Identify any protected evidence that would make semantic rewrite unsafe.
18
+ 3. State that digest size, byte ratios, and receipt availability are proxy or retrieval evidence only, not hosted API token or cost savings evidence.
19
+ 4. State that real comparisons require provider-measured primary token/cost fields on matched successful tasks, plus a failure-rate guardrail, human corrections, and shifted-cost accounting.
20
+
21
+ This prompt is dry-run-only fixture scaffolding and does not claim hosted API savings.
@@ -8,7 +8,11 @@
8
8
  "max_budget_usd": 1.0,
9
9
  "allowed_tools": [],
10
10
  "success_command": "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\"",
11
- "success_cwd": "."
11
+ "success_cwd": ".",
12
+ "variant_prompt_files": {
13
+ "baseline_uncompressed_fixture": "learned-compression-baseline-context-pack.prompt.example.md",
14
+ "fixture_only_learned_compression_candidate": "learned-compression-candidate-digest.prompt.example.md"
15
+ }
12
16
  },
13
17
  {
14
18
  "id": "learned_compression_artifact_digest_fixture",
@@ -0,0 +1,20 @@
1
+ Fixture-only raw-output prompt for reversible output-transform A/B setup.
2
+
3
+ You are reviewing an already-sanitized command transcript. Treat this as synthetic benchmark input only.
4
+
5
+ Raw sanitized command output:
6
+ - command: python3 -m unittest sample_suite
7
+ - status: failed
8
+ - summary: one assertion failed in sample_test_alpha
9
+ - excerpt line 01: expected status ok
10
+ - excerpt line 02: actual status retry
11
+ - excerpt line 03: sanitized stack frame in sample_module
12
+ - excerpt line 04: sanitized assertion message
13
+ - excerpt line 05: sanitized context marker
14
+
15
+ Task:
16
+ 1. Identify the failing command and failing check.
17
+ 2. Explain whether the visible raw output is enough to diagnose the synthetic failure.
18
+ 3. State that real token or cost comparisons require provider-measured telemetry on matched successful tasks, a failure-rate guardrail, human corrections, and shifted-cost accounting.
19
+
20
+ This prompt is not shipped benchmark evidence and does not claim hosted API savings.
@@ -0,0 +1,23 @@
1
+ Fixture-only digest plus artifact receipt prompt for reversible output-transform A/B setup.
2
+
3
+ You are reviewing an already-sanitized digest and receipt. Treat this as synthetic benchmark input only.
4
+
5
+ Digest of sanitized command output:
6
+ - command: python3 -m unittest sample_suite
7
+ - status: failed
8
+ - failure summary: sample_test_alpha expected ok but saw retry
9
+ - omitted sanitized lines: 5
10
+
11
+ Artifact receipt:
12
+ - artifact id: fixture-artifact-alpha
13
+ - digest id: fixture-digest-alpha
14
+ - exact re-expand command: context-guard-artifact show fixture-artifact-alpha
15
+ - re-expand expectation: retrieves the omitted sanitized lines exactly from a user-supplied local artifact store
16
+
17
+ Task:
18
+ 1. Identify the failing command and failing check.
19
+ 2. Describe which exact re-expand step would retrieve the omitted sanitized lines.
20
+ 3. State that artifact receipt metadata and byte counts are retrieval or proxy evidence only, not token or cost savings evidence.
21
+ 4. State that real comparisons require provider-measured telemetry on matched successful tasks, a failure-rate guardrail, human corrections, and shifted-cost accounting.
22
+
23
+ This prompt is dry-run-only fixture scaffolding and does not claim hosted API savings.
@@ -0,0 +1,28 @@
1
+ [
2
+ {
3
+ "id": "output_transform_trim_digest_fixture",
4
+ "prompt": "Fixture-only synthetic reversible output-transform task. Compare a placeholder raw command log with a digest plus artifact receipt and answer whether omitted sanitized lines can be exactly re-expanded. This fixture does not run a provider, trim command output, or fetch artifacts; future real runs must supply sanitized raw and digest evidence, artifact receipt metadata, provider-measured token/cost telemetry, matched successful tasks, failure-rate guardrail, human corrections, and shifted-cost accounting.",
5
+ "model": "sonnet",
6
+ "effort": "medium",
7
+ "max_turns": 3,
8
+ "max_budget_usd": 1.0,
9
+ "allowed_tools": [],
10
+ "success_command": "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\"",
11
+ "success_cwd": ".",
12
+ "variant_prompt_files": {
13
+ "baseline_raw_output_fixture": "output-transform-baseline-raw-output.prompt.example.md",
14
+ "fixture_only_digest_artifact_receipt": "output-transform-digest-receipt.prompt.example.md"
15
+ }
16
+ },
17
+ {
18
+ "id": "output_transform_failure_summary_fixture",
19
+ "prompt": "Fixture-only synthetic reversible output-transform task. Given a placeholder failure summary and a receipt-backed sanitized output handle, identify the failing command and describe which exact re-expand step would retrieve the omitted context. This fixture is dry-run-only until prompts, success checks, provider-measured primary token/cost fields, human corrections, and shifted-cost accounting are supplied for matched successful tasks.",
20
+ "model": "sonnet",
21
+ "effort": "medium",
22
+ "max_turns": 3,
23
+ "max_budget_usd": 1.0,
24
+ "allowed_tools": [],
25
+ "success_command": "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\"",
26
+ "success_cwd": "."
27
+ }
28
+ ]
@@ -0,0 +1,10 @@
1
+ [
2
+ {
3
+ "name": "baseline_raw_output_fixture",
4
+ "extra_args": []
5
+ },
6
+ {
7
+ "name": "fixture_only_digest_artifact_receipt",
8
+ "extra_args": []
9
+ }
10
+ ]
@@ -0,0 +1,22 @@
1
+ Fixture-only cropped/OCR prompt for multimodal crop/OCR experiment setup.
2
+
3
+ You are reviewing sanitized textual cropped/OCR-derived evidence only. This fixture contains no screenshot, image file, image URL, private path, endpoint, crop helper, OCR helper, visual-token helper, or provider call.
4
+
5
+ Cropped or OCR-derived evidence:
6
+ - original image dimensions telemetry: width 1200, height 800
7
+ - crop area telemetry: x 80, y 120, width 640, height 360
8
+ - visible area: checkout form region with validation text `Card number required`
9
+ - OCR text: `Card number required`; `Continue`
10
+ - OCR confidence telemetry: 0.96 for validation text; 0.92 for navigation label
11
+ - OCR error notes: footer notice omitted by crop; decorative icon ignored
12
+ - omitted context: footer notice `Sandbox order` and page-wide navigation are not visible in the crop
13
+ - missed-context guardrail: if the answer depends on omitted context, fall back to full visual evidence before judging success
14
+ - full visual fallback: rerun with baseline_full_visual_fixture evidence when OCR confidence is low, crop area excludes required context, or human correction is needed
15
+
16
+ Task:
17
+ 1. Identify the navigation control associated with the visible validation error.
18
+ 2. List any missed or omitted context that could change the answer.
19
+ 3. State that crop area, OCR text, OCR confidence, and byte counts are proxy or telemetry evidence only, not hosted API token or cost savings evidence.
20
+ 4. State that real comparisons require provider-measured image/text token or cost fields when available, matched successful tasks, failure-rate guardrail, human corrections, and shifted-cost accounting.
21
+
22
+ This prompt is dry-run-only fixture scaffolding and does not claim hosted API savings.
@@ -0,0 +1,19 @@
1
+ Fixture-only full visual prompt for multimodal crop/OCR experiment setup.
2
+
3
+ You are reviewing sanitized textual visual evidence only. This fixture contains no screenshot, image file, image URL, private path, endpoint, crop helper, OCR helper, visual-token helper, or provider call.
4
+
5
+ Full visual evidence:
6
+ - image dimensions telemetry: width 1200, height 800
7
+ - visible area: full page region
8
+ - checkout form context: navigation button label `Continue`, validation text `Card number required`, footer notice `Sandbox order`
9
+ - missed context risk: none for this baseline because the full visual description is present
10
+ - OCR confidence telemetry: not applicable for full visual baseline
11
+ - OCR error notes: not applicable for full visual baseline
12
+
13
+ Task:
14
+ 1. Identify the navigation control associated with the visible validation error.
15
+ 2. State what evidence would need to remain visible if a cropped or OCR-derived variant is used.
16
+ 3. State that image dimensions and visible area are telemetry only, not hosted API token or cost savings evidence.
17
+ 4. State that real comparisons require provider-measured image/text token or cost fields when available, matched successful tasks, failure-rate guardrail, human corrections, and shifted-cost accounting.
18
+
19
+ This prompt is dry-run-only fixture scaffolding and does not claim hosted API savings.
@@ -8,7 +8,11 @@
8
8
  "max_budget_usd": 1.0,
9
9
  "allowed_tools": [],
10
10
  "success_command": "python3 -c \"raise SystemExit('fixture-only placeholder: replace success_command before real benchmark runs')\"",
11
- "success_cwd": "."
11
+ "success_cwd": ".",
12
+ "variant_prompt_files": {
13
+ "baseline_full_visual_fixture": "visual-ocr-full-visual.prompt.example.md",
14
+ "fixture_only_cropped_or_ocr_evidence": "visual-ocr-cropped-ocr.prompt.example.md"
15
+ }
12
16
  },
13
17
  {
14
18
  "id": "visual_ocr_table_status_fixture",
@@ -17,6 +17,7 @@ Use them to decide what evidence a workflow has and what it does **not** prove:
17
17
  | [`benchmark-workflows/context-pack-byte-proxy.example.json`](benchmark-workflows/context-pack-byte-proxy.example.json) | `context-guard-pack auto` can reduce selected local bytes and inferred token proxies. | No hosted API token-savings claim because primary provider token fields are unavailable. |
18
18
  | [`benchmark-workflows/provider-cache-telemetry.example.json`](benchmark-workflows/provider-cache-telemetry.example.json) | Cache-layout diagnostics can coincide with observed provider cached-token telemetry. | Provider-cache telemetry is not proof that ContextGuard reduced prompt tokens or cost. |
19
19
  | [`benchmark-workflows/measured-token-workflow.example.json`](benchmark-workflows/measured-token-workflow.example.json) | A matched successful task pair with measured primary tokens may expose `token_savings_pct`. | The percentage is sample report data only, not a general savings promise; real claims require your own matched successful task runs and quality gates. |
20
+ | [`benchmark-workflows/self-hosted-metrics-ledger.example.jsonl`](benchmark-workflows/self-hosted-metrics-ledger.example.jsonl) | A run-evidence JSONL row can carry explicit local/model-server latency, peak-memory, and quality sidecar metrics. | Self-hosted metrics are not hosted API token/cost telemetry and do not change report savings math. |
20
21
 
21
22
  ## How to use the examples
22
23
 
@@ -24,6 +25,7 @@ Use them to decide what evidence a workflow has and what it does **not** prove:
24
25
  2. Compare your report's `claim_status`, `summary_by_variant`, and `comparisons[].quality_gate` to the examples.
25
26
  3. Treat `comparisons[].quality_gate != "pass"` as a warning to inspect failures, correction burden, and unmatched tasks before discussing savings.
26
27
  4. Keep byte-proxy, provider-cache, wall-time, and shifted-cost evidence in separate language from provider-measured token/cost claims. Provider-cache telemetry is not independent savings proof.
28
+ 5. Keep self-hosted local/model-server latency, memory, and quality metrics in the run-evidence ledger sidecar; do not fold them into hosted API token/cost savings claims unless provider-measured matched-task evidence separately supports that claim.
27
29
 
28
30
  ## Safe wording
29
31
 
@@ -35,6 +37,8 @@ Avoid language like:
35
37
 
36
38
  > ContextGuard guarantees this workflow will save tokens or cost.
37
39
 
38
- The fixtures intentionally use full `context-guard-bench-report-v1` shapes so tests can catch schema drift and overclaim wording.
40
+ The `.example.json` fixtures intentionally use full `context-guard-bench-report-v1` shapes so tests can catch schema drift and overclaim wording.
39
41
 
40
- For task/variant starter fixtures rather than full report-shape examples, see [`experimental-benchmark-fixtures.md`](experimental-benchmark-fixtures.md). Those files are fixture-only and synthetic dry-run-only starters until users replace the placeholder prompts and success checks; they are not shipped OCR, visual-token, or learned-compression runtime features, and real claims still require provider-measured matched successful tasks plus failure-rate, correction, and shifted-cost guardrails.
42
+ The self-hosted metrics example is a JSONL run-evidence sidecar, not a full report shape. Its fields are additive ledger evidence only: `latency_ms`, `peak_memory_mb`, and normalized `quality_score` describe local/model-server behavior and leave hosted API report calculations unchanged.
43
+
44
+ For task/variant starter fixtures rather than full report-shape examples, see [`experimental-benchmark-fixtures.md`](experimental-benchmark-fixtures.md). Those files are fixture-only and synthetic dry-run-only starters until users replace the placeholder prompts and success checks; they are not shipped OCR, visual-token, learned-compression, or output-transform benchmark results, and real claims still require provider-measured matched successful tasks plus failure-rate, correction, and shifted-cost guardrails.
@@ -0,0 +1 @@
1
+ {"schema_version":"contextguard.bench.run-evidence.v1","task_id":"self-hosted-demo","variant":"local-cache-reuse","success":true,"primary_tokens_measured":false,"primary_tokens":0,"primary_cost_measured":false,"primary_cost_usd":0.0,"external_tokens_measured":false,"external_tokens":0,"external_cost_measured":false,"external_cost_usd":0.0,"total_cost_with_shift_usd":null,"wall_time_seconds":0.0,"measurement_availability":{"primary_tokens":false,"primary_cost":false,"external_tokens":false,"external_cost":false,"shifted_cost":false,"provider_cache":false,"byte_metrics":false,"wall_time":true,"self_hosted_metrics":true},"self_hosted_metrics":{"schema_version":"contextguard.bench.self-hosted-metrics.v1","source":"explicit_provider_payload.self_hosted_metrics","metrics":{"latency_ms":842.5,"peak_memory_mb":14336.0,"quality_score":0.98},"labels":{"model_server":"local test server","optimization":"prefix cache reuse","quality_metric":"golden task pass rate"},"measurement_availability":{"latency_ms":true,"peak_memory_mb":true,"quality_score":true},"claim_boundary":{"id":"self_hosted_metrics_only_not_hosted_api_token_or_cost_savings","hosted_api_token_savings_claim_allowed":false,"hosted_api_cost_savings_claim_allowed":false,"requires_provider_measured_matched_tasks_for_hosted_claims":true,"reason":"Self-hosted local/model-server latency, memory, and quality metrics are not hosted API token or cost telemetry."}},"proxy_metrics":{"byte_metrics_observed":false,"token_proxy":"chars_div_4","bytes_per_token":4,"claim_boundary":"proxy_only_not_hosted_token_savings"},"notes":"Synthetic JSONL shape only; make hosted API savings claims only from provider-measured matched successful task evidence."}
@@ -1,6 +1,6 @@
1
1
  # Experimental benchmark fixtures
2
2
 
3
- These fixtures are **fixture-only** starter scaffolds for future visual/OCR and learned-compression experiments. They are **synthetic**, package-visible examples for `context-guard-bench` task and variant shapes; they are **not a shipped runtime feature**, not an OCR/compression implementation, and not a hosted API savings claim.
3
+ These fixtures are **fixture-only** starter scaffolds for future visual/OCR, learned-compression, and reversible output-transform experiments. They are **synthetic**, package-visible examples for `context-guard-bench` task and variant shapes; they are **not shipped benchmark results**, not OCR/compression implementations, and not hosted API savings claims.
4
4
 
5
5
  Use them when designing an experiment that starts from ContextGuard's existing benchmark discipline:
6
6
 
@@ -12,20 +12,31 @@ Use them when designing an experiment that starts from ContextGuard's existing b
12
12
  5. Treat byte counts, image dimensions, OCR confidence, and local compressor ratios as proxy evidence. Real token/cost claims require **provider-measured** primary token/cost fields on both sides.
13
13
  6. Keep private screenshots, raw secrets, and external service endpoints out of fixture files.
14
14
 
15
+ ## Runner-native variant prompt files
16
+
17
+ `context-guard-bench` supports optional file-backed `variant_prompt_files` in task fixtures. The map is keyed by variant name and lets a single logical task swap sanitized prompt evidence per variant, for example a baseline raw-output prompt versus a digest plus artifact receipt prompt. Prompt files are resolved relative to the task JSON, must be relative paths, and are read with the same no-follow/symlink-safe posture as task and variant fixtures.
18
+
19
+ This runner-native swap only proves command shape and prompt selection until the user supplies real sanitized tasks, success checks, and provider telemetry. It does **not** make dry-run output, artifact receipts, byte counts, or digest metadata into token/cost savings evidence. For real non-dry-run output-transform experiments, keep task IDs matched across baseline and digest variants and require provider-measured primary token/cost fields on matched successful tasks before making any comparison claim.
20
+
15
21
  ## Included fixture sets
16
22
 
17
23
  | Fixture set | Task file | Variant file | Intended future experiment |
18
24
  | --- | --- | --- | --- |
19
- | Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized artifacts and provider telemetry. |
20
- | Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | Compare baseline context packs or artifact digests against a future learned-compression candidate after quality gates and shifted costs are measured. |
25
+ | Visual/OCR evidence | [`benchmark-fixtures/visual-ocr.tasks.example.json`](benchmark-fixtures/visual-ocr.tasks.example.json) | [`benchmark-fixtures/visual-ocr.variants.example.json`](benchmark-fixtures/visual-ocr.variants.example.json) | Compare full visual evidence against cropped or OCR-derived evidence after the user supplies sanitized textual evidence, missed-context notes, crop/OCR telemetry, and provider telemetry. |
26
+ | Learned compression | [`benchmark-fixtures/learned-compression.tasks.example.json`](benchmark-fixtures/learned-compression.tasks.example.json) | [`benchmark-fixtures/learned-compression.variants.example.json`](benchmark-fixtures/learned-compression.variants.example.json) | Compare sanitized baseline context packs against a fixture-only compressed digest candidate after exact retrieval or receipt fallback, quality gates, and shifted costs are measured. |
27
+ | Reversible output transform | [`benchmark-fixtures/output-transform.tasks.example.json`](benchmark-fixtures/output-transform.tasks.example.json) | [`benchmark-fixtures/output-transform.variants.example.json`](benchmark-fixtures/output-transform.variants.example.json) | Compare raw sanitized command output against a digest plus artifact receipt after variant prompt files, success checks, and provider telemetry are supplied. |
21
28
 
22
29
  ## Visual/OCR fixture notes
23
30
 
24
- The visual/OCR fixtures describe placeholder evidence only. They do not crop images, run OCR, prune visual tokens, or call a model. Future experiments should record image dimensions, crop area, OCR confidence/error notes, provider image/text token telemetry when available, task success, corrections, and any external/local processing cost.
31
+ The visual/OCR fixtures describe sanitized textual visual evidence only and now demonstrate `variant_prompt_files` for full visual evidence versus cropped/OCR-derived evidence. They do not include image assets, crop images, run OCR, prune visual tokens, or call a model. Future experiments should record image dimensions, crop area, visible area, omitted or missed context, OCR confidence/error notes, full visual fallback conditions, provider image/text token telemetry when available, task success, corrections, and any external/local processing cost.
25
32
 
26
33
  ## Learned-compression fixture notes
27
34
 
28
- The learned-compression fixtures describe already-sanitized context-pack or artifact-digest comparisons. They do not invoke LLMLingua-style, gist-token, latent-context, or reranking implementations. Future experiments should preserve exact retrieval for lossy transforms where possible and record bytes before/after, primary provider tokens, cost, success, corrections, compressor latency, and external cost.
35
+ The learned-compression fixtures describe already-sanitized context-pack or artifact-digest comparisons and now demonstrate `variant_prompt_files` for baseline context-pack evidence versus a fixture-only compressed digest candidate. They do not invoke LLMLingua-style, gist-token, latent-context, embedding, or reranking implementations. Future experiments must follow a sanitized evidence only rule, keep protected evidence exact or receipt-retrievable, forbid semantic rewrites of identifiers, numeric constants, hashes, paths, quoted strings, stack frames, JSON keys, code fences, and diff zones, and record bytes before/after, primary provider tokens, cost, success, corrections, compressor latency, and external cost.
36
+
37
+ ## Reversible output-transform fixture notes
38
+
39
+ The output-transform fixtures describe already-sanitized command output comparisons and now demonstrate `variant_prompt_files` for raw sanitized output versus digest plus artifact receipt prompt evidence. They do not execute `context-guard-trim-output`, store artifacts, call `context-guard-artifact`, or invoke a provider. Future experiments should compare raw sanitized output against `--digest` output plus an `--artifact-receipt`, verify the receipt's exact re-expand command retrieves the omitted sanitized lines, and record bytes before/after, primary provider tokens, cost, success, corrections, artifact-store usage, and any external/local processing cost.
29
40
 
30
41
  ## Safe wording
31
42
 
@@ -33,4 +44,4 @@ Use language like:
33
44
 
34
45
  > This synthetic fixture validates benchmark task/variant shape only. A real claim needs provider-measured token/cost data for matched successful baseline and variant tasks, plus failure-rate, correction, and shifted-cost guardrails.
35
46
 
36
- Avoid language that presents dry-run output, bytes saved, OCR text, or compressor ratios as hosted API token/cost savings evidence.
47
+ Avoid language that presents dry-run output, bytes saved, OCR text, artifact receipts, exact re-expand handles, or compressor ratios as hosted API token/cost savings evidence.
@@ -0,0 +1,62 @@
1
+ # macOS visibility feasibility contract
2
+
3
+ `context-guard-audit --feasibility-json` emits a local, pre-GUI contract for future macOS-visible surfaces such as a menu-bar app, xbar item, Raycast command, or SwiftUI prototype. It is a transcript-scan contract, not a GUI implementation and not a live daemon.
4
+
5
+ The full feasibility envelope is versioned as `contextguard.metric-feasibility.v1.3`. The macOS binding/index inside that envelope is the top-level `mac_visibility` object with nested `schema_version: contextguard.mac-visibility.v1`.
6
+
7
+ ## Contract boundary
8
+
9
+ - `mac_visibility` is a thin index over stable top-level feasibility fields. It does not recompute totals and does not read diagnostic `summary`.
10
+ - GUI or menu-bar consumers should bind only to fields listed in `consumer_contract.stable_top_level_fields` and `mac_visibility.bind_to_top_level_fields`.
11
+ - `summary` is diagnostic/backward-compatible payload only. It may be shown in a debug panel, but it must not drive primary cards.
12
+ - Historical transcript scans do not include live context-window state. Context and headroom cards stay `missing` until a future surface provides `live_statusline_snapshot`.
13
+ - Values are local transcript observations. They are not invoice-grade billing records, do not prove provider cache hits, and do not guarantee token or cost savings.
14
+
15
+ ## Stable `mac_visibility` keys
16
+
17
+ | Key | Meaning |
18
+ | --- | --- |
19
+ | `schema_version` | Nested contract version: `contextguard.mac-visibility.v1`. |
20
+ | `surface_kind` | Local surface family; currently `local_macos_visibility_contract`. |
21
+ | `readiness.status` | One of `ready`, `partial`, or `missing`, derived from token availability and scan integrity. |
22
+ | `bind_to_top_level_fields` | Stable top-level fields primary consumers may use. |
23
+ | `diagnostic_only_fields` | Fields that must not drive primary UI; currently `summary`. |
24
+ | `primary_cards` | Ordered card descriptors with `id`, `title`, `status`, and `binding_paths`. |
25
+ | `missing_live_observations` | Required live observations that transcript scans cannot provide. |
26
+ | `claim_boundaries` | Copy-safe caveats for UI labels and docs. |
27
+ | `redaction_required` | Always `true` for default GUI/menu-bar presentation. |
28
+
29
+ ## Card IDs and binding paths
30
+
31
+ `primary_cards[*].binding_paths` use dotted paths inside the feasibility envelope. Current card IDs are:
32
+
33
+ 1. `source_freshness` → `source_kind`, `source_freshness.status`, `source_freshness.generated_at`
34
+ 2. `scan_integrity` → scan completeness and skipped counts
35
+ 3. `token_totals` → `totals.total_tokens` and `totals.tokens.*`
36
+ 4. `cache_reuse` → `totals.cache_read_share`, `totals.cache_reuse_ratio`, `metric_availability.cache`
37
+ 5. `observed_cost` → `totals.cost_usd_observed`, `metric_availability.cost`
38
+ 6. `context_availability` → `context_availability`, `metric_availability.context`
39
+ 7. `headroom_availability` → `headroom_availability`, `cache_diagnostics.headroom_diagnostics`
40
+ 8. `cache_layout_advice` → `cache_layout_advice`, `cache_friendliness`, `cache_diagnostics.dynamic_prefix_breakers`
41
+
42
+ When a card includes `required_observation: live_statusline_snapshot`, consumers should show an unavailable or setup state rather than treating the value as zero.
43
+
44
+ ## Example
45
+
46
+ See [`mac-visibility-feasibility.example.json`](mac-visibility-feasibility.example.json) for an abridged feasibility envelope. It keeps `summary` out of primary bindings and demonstrates the missing live context/headroom boundary.
47
+
48
+ ## Verification guidance
49
+
50
+ For a local fixture:
51
+
52
+ ```bash
53
+ context-guard-audit ./fixtures/transcripts --feasibility-json --recommend
54
+ ```
55
+
56
+ Then verify:
57
+
58
+ - `schema_version == "contextguard.metric-feasibility.v1.3"`
59
+ - `consumer_contract.stable_top_level_fields` contains `mac_visibility`
60
+ - `mac_visibility.diagnostic_only_fields` contains `summary`
61
+ - no `primary_cards[*].binding_paths` entry starts with `summary`
62
+ - `missing_live_observations[*].required_observation` names `live_statusline_snapshot` when context/headroom are missing
@@ -0,0 +1,130 @@
1
+ {
2
+ "schema_version": "contextguard.metric-feasibility.v1.3",
3
+ "producer": "context-guard-audit",
4
+ "generated_at": "2026-06-08T12:00:00Z",
5
+ "consumer_contract": {
6
+ "stable_top_level_fields": [
7
+ "schema_version",
8
+ "producer",
9
+ "generated_at",
10
+ "source_kind",
11
+ "source_freshness",
12
+ "scan_integrity",
13
+ "metric_availability",
14
+ "metric_caveats",
15
+ "redaction_mode",
16
+ "context_availability",
17
+ "headroom_availability",
18
+ "cache_friendliness",
19
+ "cache_diagnostics",
20
+ "cache_layout_advice",
21
+ "mac_visibility",
22
+ "totals"
23
+ ],
24
+ "diagnostic_fields": ["summary"],
25
+ "summary_contract": "summary is the legacy audit JSON payload for diagnostics and backward compatibility; new GUI prototypes should bind to stable top-level feasibility fields first."
26
+ },
27
+ "source_kind": "historical_transcript_scan",
28
+ "source_freshness": {
29
+ "status": "snapshot_at_scan_time",
30
+ "live": false,
31
+ "generated_at": "2026-06-08T12:00:00Z",
32
+ "description": "Local transcript files were scanned when this report was generated; this is not a live statusline snapshot."
33
+ },
34
+ "scan_integrity": {
35
+ "status": "complete",
36
+ "files_scanned": 1,
37
+ "records_scanned": 1,
38
+ "skipped_files": 0,
39
+ "skipped_records": 0,
40
+ "parse_error_count": 0,
41
+ "complete": true
42
+ },
43
+ "metric_availability": {
44
+ "tokens": {"status": "available", "present_fields": {"input": 1, "output": 1, "cache_read": 1, "cache_creation": 1}, "evidence": "observed"},
45
+ "cache": {"status": "available", "present_fields": {"cache_read": 1, "cache_creation": 1}, "zero_values_observed": {"cache_read": false, "cache_creation": false}, "evidence": "observed"},
46
+ "cost": {"status": "available", "present_count": 1, "observed_cost_usd": 0.1234, "evidence": "observed"},
47
+ "context": {"status": "missing", "evidence": "unavailable", "reason": "Transcript scans do not include live Claude Code context_window data. Pass a live statusline snapshot in a future surface to populate context availability."},
48
+ "headroom": {"status": "missing", "evidence": "unavailable", "reason": "Transcript scans do not carry live context-window or remaining-token data, so context headroom cannot be observed or conservatively inferred from history alone.", "observable_via": "live_statusline_snapshot"}
49
+ },
50
+ "metric_caveats": [
51
+ "Values are observed from local Claude Code transcript JSON/JSONL fields and are not official billing records.",
52
+ "cache-read share is cache_read / (input + cache_read + cache_creation), not a provider billing hit-rate."
53
+ ],
54
+ "redaction_mode": {
55
+ "paths": "basename_plus_stable_hash_by_default",
56
+ "commands": "command_category_plus_stable_hash_by_default",
57
+ "secret_like_values": "pattern_redacted",
58
+ "raw_path_and_command_flags": ["--show-paths", "--show-commands"]
59
+ },
60
+ "context_availability": {"status": "missing", "evidence": "unavailable", "reason": "Transcript scans do not include live Claude Code context_window data. Pass a live statusline snapshot in a future surface to populate context availability."},
61
+ "headroom_availability": {"status": "missing", "evidence": "unavailable", "observable_via": "live_statusline_snapshot", "reason": "Transcript scans do not carry live context-window or remaining-token data, so context headroom cannot be observed or conservatively inferred from history alone."},
62
+ "cache_friendliness": {"status": "partial", "confidence": "partial", "evidence": "observed", "heuristic": true},
63
+ "cache_diagnostics": {
64
+ "schema_version": "contextguard.cache-diagnostics.v1",
65
+ "status": "partial",
66
+ "confidence": "hypothesis",
67
+ "evidence": "inferred",
68
+ "heuristic": true,
69
+ "dynamic_prefix_breakers": [],
70
+ "headroom_diagnostics": {"status": "missing", "evidence": "unavailable", "observable_via": "live_statusline_snapshot", "required_observation": "live_statusline_snapshot", "historical_total_tokens_are_not_headroom": true}
71
+ },
72
+ "cache_layout_advice": {
73
+ "schema_version": "contextguard.cache-layout-advice.v1",
74
+ "status": "partial",
75
+ "confidence": "partial",
76
+ "priority": "P1",
77
+ "observed_issue": "long_session_accumulation"
78
+ },
79
+ "mac_visibility": {
80
+ "schema_version": "contextguard.mac-visibility.v1",
81
+ "surface_kind": "local_macos_visibility_contract",
82
+ "readiness": {
83
+ "status": "ready",
84
+ "reason": "Transcript token totals are available and the scan completed within configured limits."
85
+ },
86
+ "bind_to_top_level_fields": [
87
+ "source_kind",
88
+ "source_freshness",
89
+ "scan_integrity",
90
+ "metric_availability",
91
+ "metric_caveats",
92
+ "redaction_mode",
93
+ "context_availability",
94
+ "headroom_availability",
95
+ "cache_friendliness",
96
+ "cache_diagnostics",
97
+ "cache_layout_advice",
98
+ "totals"
99
+ ],
100
+ "diagnostic_only_fields": ["summary"],
101
+ "primary_cards": [
102
+ {"id": "source_freshness", "title": "Source freshness", "status": "available", "binding_paths": ["source_kind", "source_freshness.status", "source_freshness.generated_at"]},
103
+ {"id": "scan_integrity", "title": "Scan integrity", "status": "complete", "binding_paths": ["scan_integrity.status", "scan_integrity.files_scanned", "scan_integrity.records_scanned", "scan_integrity.skipped_files", "scan_integrity.skipped_records"]},
104
+ {"id": "token_totals", "title": "Token totals", "status": "available", "binding_paths": ["totals.total_tokens", "totals.tokens.input", "totals.tokens.output", "totals.tokens.cache_read", "totals.tokens.cache_creation"]},
105
+ {"id": "cache_reuse", "title": "Cache-read share and reuse ratio", "status": "available", "binding_paths": ["totals.cache_read_share", "totals.cache_reuse_ratio", "metric_availability.cache"]},
106
+ {"id": "observed_cost", "title": "Observed transcript cost", "status": "available", "binding_paths": ["totals.cost_usd_observed", "metric_availability.cost"]},
107
+ {"id": "context_availability", "title": "Context availability", "status": "missing", "binding_paths": ["context_availability", "metric_availability.context"], "required_observation": "live_statusline_snapshot"},
108
+ {"id": "headroom_availability", "title": "Headroom availability", "status": "missing", "binding_paths": ["headroom_availability", "cache_diagnostics.headroom_diagnostics"], "required_observation": "live_statusline_snapshot"},
109
+ {"id": "cache_layout_advice", "title": "Cache layout advice", "status": "partial", "binding_paths": ["cache_layout_advice", "cache_friendliness", "cache_diagnostics.dynamic_prefix_breakers"]}
110
+ ],
111
+ "missing_live_observations": [
112
+ {"id": "live_context_window", "required_observation": "live_statusline_snapshot", "affects": ["context_availability", "metric_availability.context"], "reason": "Historical transcript scans do not include live Claude Code context_window data."},
113
+ {"id": "live_headroom", "required_observation": "live_statusline_snapshot", "affects": ["headroom_availability", "cache_diagnostics.headroom_diagnostics"], "reason": "Historical transcript totals are not remaining-token or live headroom observations."}
114
+ ],
115
+ "claim_boundaries": [
116
+ "Local transcript observations are not invoice-grade billing records.",
117
+ "Provider cache fields are telemetry, not ContextGuard-caused token reduction and do not prove provider cache hits.",
118
+ "Historical transcript totals do not infer live context headroom or remaining tokens.",
119
+ "This contract does not guarantee token or cost savings."
120
+ ],
121
+ "redaction_required": true
122
+ },
123
+ "totals": {
124
+ "total_tokens": 1150,
125
+ "tokens": {"input": 100, "output": 50, "cache_read": 800, "cache_creation": 200},
126
+ "cost_usd_observed": 0.1234,
127
+ "cache_read_share": 0.7272727272727273,
128
+ "cache_reuse_ratio": 4.0
129
+ }
130
+ }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ictechgy/context-guard",
3
- "version": "0.4.4",
3
+ "version": "0.4.6",
4
4
  "description": "ContextGuard CLI helpers for keeping AI coding agent context focused and local-first.",
5
5
  "license": "Apache-2.0",
6
6
  "homepage": "https://github.com/ictechgy/context-guard#readme",
@@ -55,9 +55,13 @@
55
55
  "docs/cache-diagnostics-schema.md",
56
56
  "docs/cache-diagnostics.schema.json",
57
57
  "docs/cache-diagnostics.example.json",
58
+ "docs/mac-visibility-feasibility-schema.md",
59
+ "docs/mac-visibility-feasibility.example.json",
58
60
  "docs/benchmark-workflows/*.example.json",
61
+ "docs/benchmark-workflows/*.example.jsonl",
59
62
  "docs/benchmark-workflow-examples.md",
60
63
  "docs/benchmark-fixtures/*.example.json",
64
+ "docs/benchmark-fixtures/*.prompt.example.md",
61
65
  "docs/experimental-benchmark-fixtures.md",
62
66
  "packaging/homebrew/context-guard.rb.template"
63
67
  ],
@@ -5,7 +5,7 @@ class ContextGuard < Formula
5
5
 
6
6
  desc "Local-first context guardrails for AI coding agents"
7
7
  homepage "https://github.com/ictechgy/context-guard"
8
- url "https://github.com/ictechgy/context-guard/archive/refs/tags/v0.4.4.tar.gz"
8
+ url "https://github.com/ictechgy/context-guard/archive/refs/tags/v0.4.6.tar.gz"
9
9
  sha256 "REPLACE_WITH_RELEASE_TARBALL_SHA256"
10
10
  license "Apache-2.0"
11
11
 
@@ -37,5 +37,5 @@
37
37
  "gated-experiments",
38
38
  "future-roadmap"
39
39
  ],
40
- "version": "0.4.4"
40
+ "version": "0.4.6"
41
41
  }
@@ -97,7 +97,7 @@ context-guard-statusline-merged
97
97
  - **상태표시줄**은 모델, 컨텍스트, 비용 신호를 짧게 보여주고, 대화 기록 데이터가 있으면 캐시 읽기와 캐시 재사용 신호도 함께 표시합니다.
98
98
  - **대화 기록 감사**는 usage/cost/cache bucket을 집계하고, 토큰 집중 지점, `cache_friendliness` 프롬프트 배치 신호, `cache_layout_advice` 확인/실험 우선순위를 제한된 가림 처리된 segment hash로 보고합니다. 원문 프롬프트는 출력하지 않습니다.
99
99
  - **반복 실패 알림**은 Bash 실패가 반복될 때 같은 경로를 계속 재시도하지 않고 전략을 바꾸도록 안내합니다.
100
- - **벤치마크 헬퍼**는 기준/변형 실행을 대응해 실제 토큰·비용 필드, 별도의 바이트 감소 간접 증거, 진단용 `wall_time_seconds`, `provider_cached_tokens`, provider-cache 사용 가능성 텔레메트리로 기록합니다.
100
+ - **벤치마크 헬퍼**는 기준/변형 실행을 대응해 실제 토큰·비용 필드, 별도의 바이트 감소 간접 증거, 진단용 `wall_time_seconds`, `provider_cached_tokens`, provider-cache 사용 가능성 텔레메트리, 파일 기반 `variant_prompt_files`, 선택적 run별 `self_hosted_metrics` JSONL ledger sidecar를 기록합니다. 이 sidecar는 hosted API 절감 주장에 합치지 않습니다.
101
101
 
102
102
  비용 가드의 로컬 HMAC 키는 기본적으로 `.context-guard/cost-ledger/hmac.key`에 자동 생성됩니다. 관리자가 직접 주입하는 경우 파일에는 필수 padding을 포함한 canonical URL-safe base64 32바이트 키만 정확히 들어 있어야 하며, trailing newline이나 공백은 허용하지 않습니다. 리포트는 키와 원문 프롬프트를 출력하지 않고, 로컬 ledger는 Anthropic/provider prompt cache를 대체하지 않습니다.
103
103
 
@@ -103,7 +103,7 @@ context-guard-statusline-merged
103
103
  - **Statusline** displays compact model/context/cost signals and, when transcript data is available, cache-read and cache-reuse signals.
104
104
  - **Transcript audit** aggregates usage/cost/cache buckets, flags likely token hotspots, and exposes `cache_friendliness`, additive [`cache_diagnostics`](https://github.com/ictechgy/context-guard/blob/main/docs/cache-diagnostics-schema.md), and `cache_layout_advice` experiment priorities from bounded usage fields, timestamped cache telemetry records, and redacted segment hashes without printing raw prompt text or claiming provider-cache savings.
105
105
  - **Repeated-failure nudge** warns after repeated Bash failures so the agent switches strategy instead of retrying the same context-heavy path.
106
- - **Benchmark helper** records matched baseline/variant runs with real token and cost fields, separate byte-reduction proxy evidence, diagnostic `wall_time_seconds`, `provider_cached_tokens`, and provider-cache availability telemetry.
106
+ - **Benchmark helper** records matched baseline/variant runs with real token and cost fields, separate byte-reduction proxy evidence, diagnostic `wall_time_seconds`, `provider_cached_tokens`, provider-cache availability telemetry, file-backed `variant_prompt_files`, and optional per-run `self_hosted_metrics` JSONL ledger sidecars that stay out of hosted API savings claims.
107
107
 
108
108
  Cost guard creates its local HMAC key automatically at `.context-guard/cost-ledger/hmac.key`. If you provision that file yourself, it must contain exactly one canonical URL-safe base64 32-byte key with required padding and no trailing newline or whitespace. Reports never emit the key or raw prompt text, and the local ledger does not replace Anthropic/provider prompt caching.
109
109