npm - @ictechgy/context-guard - Versions diffs - 0.4.4 → 0.4.5 - Mend

@ictechgy/context-guard 0.4.4 → 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,13 @@
 All notable changes for the ContextGuard plugin are documented here.
+## [Unreleased]
+## [0.4.5] - 2026-06-09
+- Added a package-visible `mac_visibility` feasibility contract for future local macOS-visible surfaces without building a GUI or inferring live headroom from historical transcript scans.
+- Clarified README, plugin README, kit README, and GitHub Pages measurement boundaries for self-hosted metrics sidecars, benchmark evidence, mac visibility contracts, and experimental fixtures.
 ## [0.4.4] - 2026-06-08
 - Added top-level `cache_layout_advice` to transcript audit JSON and feasibility output so cache-prefix instability can be prioritized without mixing advice into evidence-only diagnostics.

package/README.ko.md CHANGED Viewed

@@ -125,13 +125,16 @@ brief 모드는 코딩 에이전트가 군더더기를 줄이도록 요청하되
 | Claude Code 플러그인 스킬 | 설정 마법사, 최적화 점검, 대화 기록 사용량 감사를 Claude Code 안에서 실행합니다. |
 | 프로젝트 단위 설정 마법사 | 전역 설정은 그대로 두고 권장 `.claude/settings.json` 옵션을 프로젝트에 적용합니다. |
 | 컨텍스트 관리 스캐너 | 누락된 가드레일, 과도한 훅 출력, 넓은 읽기 범위, 큰 컨텍스트 파일, 민감해 보이는 파일, 과도한 MCP 서버, 비용이 큰 기본값을 찾습니다. |
+| 구조적 낭비 진단 | 중복 규칙, stale import 후보, 쓰이지 않는 skill 후보, 과도한 tool schema, 반복 read/tool-call loop를 읽기 전용으로 진단합니다. |
 | 대용량 읽기 가드와 심볼 리더 | 파일 전체 읽기 대신 `rg`, 심볼 단위 읽기, 작은 줄 범위 읽기를 사용하도록 안내합니다. |
 | 출력 축약과 민감정보 가림 | 테스트·빌드·검색·diff 출력을 작게 만들고, 에이전트 컨텍스트에 들어가기 전에 민감해 보이는 값을 가립니다. |
+| 선언형 출력 필터 | 사용자 정의 JSON DSL로 성공 출력만 명시적으로 줄이고, 보호해야 하는 실패 출력은 원문 stdout/stderr와 종료 코드를 보존합니다. |
 | 로컬 로그 보관소 | 큰 로그를 대화 밖 로컬 저장소에 보관하고, 요약 정보나 요청한 줄 범위만 다시 가져옵니다. |
 | Anthropic 비용 가드 | `context-guard cost preflight/observe/ledger/compile`이 cache 위험과 비용 범위를 추정하고, 원문 대신 keyed HMAC fingerprint만 저장하며, `--enforce`를 명시하지 않으면 경고만 합니다. |
 | 예산 기반 컨텍스트 패커 | 우선순위가 있는 로컬 파일 근거를 바이트 예산 안의 Markdown 팩으로 조립하고, 로컬 신호에서 `build`용 manifest를 추천하며, `--explain`으로 짧은 로컬 선택 이유를 덧붙일 수 있습니다. |
 | Tool/MCP schema pruner | 로컬 catalog에서 bounded top-k tool/schema 자문 리포트를 만들고, compact 요약 기록과 전체 가림 처리된 payload 재조회 경로를 남깁니다. |
 | 보수적 stdin 압축기 | 선택한 JSON, diff, 로그, 검색 출력, 코드, 산문을 줄이고, 관측 바이트 근거와 추정 토큰 proxy를 함께 표시합니다. |
+| 보호 영역 정책 기록 | `context-guard-compress --protected-policy`와 `context-guard cost compile`이 코드·diff·path·hash·JSON/literal zone을 structural-only 변환 대상으로 표시하고 정확한 재조회 경계를 남깁니다. |
 | 반복 실패 알림 | Bash 실패가 반복되면 실패 로그가 컨텍스트를 채우기 전에 전략을 바꾸도록 안내합니다. |
 | 상태표시줄, 감사, 벤치마크 | 컨텍스트·캐시·비용 신호를 보여주고, 사용량과 캐시 친화성 집중 지점을 찾고, 보수적인 전후 비교 증거를 남깁니다. |
@@ -300,7 +303,7 @@ head/tail 로그 대신 의미 요약이 필요하면 `--digest markdown` 또는
 ./plugins/context-guard/bin/context-guard-audit ~/.claude/projects --top 20 --recommend
 ```
-감사 명령은 기본적으로 너무 큰 대화 기록 파일과 JSONL 기록을 건너뛰고(`--max-file-bytes`, `--max-line-bytes`), 건너뛴 개수를 함께 보고합니다. 손상된 추적 기록이 메모리를 독점하거나 스캔 공백을 숨기지 않도록 하기 위한 방어입니다. JSON 출력에는 `cache_friendliness`와 [`cache_diagnostics`](docs/cache-diagnostics-schema.md)도 포함됩니다. 이는 제한된 사용량 필드, timestamped cache telemetry records, 가림 처리된 segment hash로 만든 휴리스틱 프롬프트 배치/cache-read 진단입니다. sibling `cache_layout_advice`는 이 신호를 긴 세션 분리, prefix 안정화 같은 순위화된 **확인/실험**으로 바꾸되, 관측된 issue와 가설/입증된 cause를 분리합니다. 원문 프롬프트는 출력하지 않고 provider cache hit를 증명하지 않으며, 대화 기록 스키마가 충분한 증거를 드러내지 않으면 `missing`, `partial`, `hypothesis`, `unavailable`일 수 있습니다.
+감사 명령은 기본적으로 너무 큰 대화 기록 파일과 JSONL 기록을 건너뛰고(`--max-file-bytes`, `--max-line-bytes`), 건너뛴 개수를 함께 보고합니다. 손상된 추적 기록이 메모리를 독점하거나 스캔 공백을 숨기지 않도록 하기 위한 방어입니다. JSON 출력에는 `cache_friendliness`와 [`cache_diagnostics`](docs/cache-diagnostics-schema.md)도 포함됩니다. 이는 제한된 사용량 필드, timestamped cache telemetry records, 가림 처리된 segment hash로 만든 휴리스틱 프롬프트 배치/cache-read 진단입니다. sibling `cache_layout_advice`는 이 신호를 긴 세션 분리, prefix 안정화 같은 순위화된 **확인/실험**으로 바꾸되, 관측된 issue와 가설/입증된 cause를 분리합니다. `--feasibility-json` 출력에는 로컬 macOS 가시화 surface가 바인딩할 수 있는 [`mac_visibility`](docs/mac-visibility-feasibility-schema.md) 계약도 포함됩니다. 이 계약은 안정적인 top-level field만 가리키며, `summary`는 primary UI binding 대상이 아닙니다. 원문 프롬프트는 출력하지 않고 provider cache hit나 live headroom을 증명하지 않으며, 대화 기록 스키마가 충분한 증거를 드러내지 않으면 `missing`, `partial`, `hypothesis`, `unavailable`일 수 있습니다.
 ### 상태표시줄에서 컨텍스트와 캐시 상태 확인
@@ -318,7 +321,17 @@ head/tail 로그 대신 의미 요약이 필요하면 `--digest markdown` 또는
   --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json
 ```
-보고서는 성공한 기준/변형 실행을 실제 토큰과 `cost_usd + external_cost_usd` 기준으로 비교합니다. 바이트 감소는 간접 증거로만 기록하며, 그 자체를 절감 증명으로 보지 않습니다. 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산합니다. `matched_pair_evidence`는 성공한 기준/변형 task bucket을 transform, 측정 가능 여부, quality gate, claim boundary와 연결하므로 절감 문구를 쓰기 전에 이 항목을 먼저 확인해야 합니다. `wall_time_seconds`, `provider_cached_tokens`, `provider_cached_tokens_measured`는 진단용 텔레메트리이며, ContextGuard가 직접 만든 토큰·비용 절감 증거로 보지 않습니다. 비용 필드가 0이거나 없으면 토큰 절감만 표시하고 실제 비용 절감은 주장하지 않습니다. 절감 주장은 양쪽 모두 성공한 태스크 대응 기준이며, 실패율 가드레일이 악화되면 경고 수준으로 조정합니다. CSV 스키마는 엄격하게 검사합니다. 벤치마크 헬퍼를 업그레이드한 뒤에는 새 `--csv` 파일을 시작하거나 mismatch 오류가 알려주는 헤더로 마이그레이션하세요. 최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요.
+보고서를 읽을 때는 먼저 claim boundary를 확인하세요.
+- 성공한 기준/변형 실행은 실제 토큰과 `cost_usd + external_cost_usd` 기준으로 비교하고, 바이트 감소는 간접 증거로만 기록합니다.
+- 토큰 절감 주장은 대응 태스크 양쪽 모두에 `primary_tokens_measured`가 있을 때만 계산합니다.
+- `matched_pair_evidence`는 성공한 task bucket을 transform, 측정 가능 여부, quality gate, claim boundary와 연결하므로 절감 문구를 쓰기 전에 먼저 확인해야 합니다.
+- `wall_time_seconds`, `provider_cached_tokens`, `provider_cached_tokens_measured`는 진단용 텔레메트리이며, ContextGuard가 직접 만든 토큰·비용 절감 증거로 보지 않습니다.
+- 선택적 `self_hosted_metrics`는 run별 JSONL ledger sidecar로만 기록하고 CSV/report 요약에는 넣지 않으며, hosted API token/cost 절감 주장의 근거로 포함해서는 안 됩니다.
+- 비용 필드가 0이거나 없으면 토큰 절감만 표시하고 실제 비용 절감은 주장하지 않습니다.
+- CSV 스키마는 엄격하게 검사합니다. 벤치마크 헬퍼를 업그레이드한 뒤에는 새 `--csv` 파일을 시작하거나 mismatch 오류가 알려주는 헤더로 마이그레이션하세요.
+최소 보고서 형태 예시는 [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json)을, 작업 유형별 합성 예시와 안전한 해석 경계는 [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md)을, fixture-only 실험 시작 예시는 [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md)을 참고하세요.
 ## 아직 제공하지 않는 기능

package/README.md CHANGED Viewed

@@ -339,7 +339,7 @@ JSON
 ./plugins/context-guard/bin/context-guard-audit ~/.claude/projects --top 20 --recommend
 ```
-The audit command skips oversized transcript files and JSONL records by default (`--max-file-bytes`, `--max-line-bytes`) and reports skipped counts, so a corrupt trace cannot dominate memory or hide scan gaps. JSON output also includes `cache_friendliness` and [`cache_diagnostics`](docs/cache-diagnostics-schema.md): heuristic prompt-layout/cache-read diagnostics built from bounded usage fields, timestamped cache telemetry records, and redacted segment hashes. The sibling `cache_layout_advice` field turns those signals into ranked **checks/experiments** such as splitting long sessions or stabilizing early prompt prefixes, while keeping observed issues separate from hypothesized or corroborated causes. These fields can flag likely volatile content near the prompt prefix, stable-prefix candidates, cache-miss hypotheses, and TTL/headroom evidence gaps, but they do not print raw prompt text, do not prove provider cache hits, and may be `missing`, `partial`, `hypothesis`, or `unavailable` when transcript schemas do not expose enough evidence.
+The audit command skips oversized transcript files and JSONL records by default (`--max-file-bytes`, `--max-line-bytes`) and reports skipped counts, so a corrupt trace cannot dominate memory or hide scan gaps. JSON output also includes `cache_friendliness` and [`cache_diagnostics`](docs/cache-diagnostics-schema.md): heuristic prompt-layout/cache-read diagnostics built from bounded usage fields, timestamped cache telemetry records, and redacted segment hashes. The sibling `cache_layout_advice` field turns those signals into ranked **checks/experiments** such as splitting long sessions or stabilizing early prompt prefixes, while keeping observed issues separate from hypothesized or corroborated causes. `--feasibility-json` also includes a [`mac_visibility`](docs/mac-visibility-feasibility-schema.md) contract that local macOS-visible consumers can bind against; only stable top-level fields are designated binding targets, and `summary` is not a primary UI binding source. These fields can flag likely volatile content near the prompt prefix, stable-prefix candidates, cache-miss hypotheses, and TTL/headroom evidence gaps, but they do not print raw prompt text, do not prove provider cache hits, and may be `missing`, `partial`, `hypothesis`, or `unavailable` when transcript schemas do not expose enough evidence.
 ### Watch context and cache health in the statusline
@@ -357,7 +357,17 @@ The audit command skips oversized transcript files and JSONL records by default
   --ledger-jsonl bench/cost-shift.jsonl --report-json bench/report.json
 ```
-The report compares successful baseline/variant runs by real tokens and `cost_usd + external_cost_usd`. Byte reductions are recorded as proxy evidence, not treated as proof of savings. Token-savings claims require `primary_tokens_measured` on both sides of a matched task. `matched_pair_evidence` links each successful baseline/variant task bucket to the transform, measurement availability, quality gate, and claim boundary; inspect it before making any savings statement. `wall_time_seconds`, `provider_cached_tokens`, and `provider_cached_tokens_measured` are diagnostic telemetry, not proof of ContextGuard-caused token or cost savings. If cost fields are zero or unavailable, the report can still mark token savings but will not claim shifted-cost savings. Claims are paired by matched successful tasks and downgraded when failure-rate guardrails regress. CSV schemas are strict; after upgrading the benchmark helper, start a new `--csv` file or migrate the header named in the mismatch error. See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters.
+Read the report through its claim boundaries before writing any savings statement:
+- Successful baseline/variant runs are compared by real tokens and `cost_usd + external_cost_usd`; byte reductions stay proxy evidence.
+- Token-savings claims require `primary_tokens_measured` on both sides of a matched task.
+- `matched_pair_evidence` links each successful task bucket to the transform, measurement availability, quality gate, and claim boundary.
+- `wall_time_seconds`, `provider_cached_tokens`, and `provider_cached_tokens_measured` are diagnostic telemetry, not proof of ContextGuard-caused token or cost savings.
+- Optional `self_hosted_metrics` from provider payloads are stored as per-row JSONL ledger sidecars, kept out of CSV/report summaries, and must not be folded into hosted API token/cost savings claims.
+- If cost fields are zero or unavailable, the report can still mark token savings but will not claim shifted-cost savings.
+- CSV schemas are strict; after upgrading the benchmark helper, start a new `--csv` file or migrate the header named in the mismatch error.
+See [`docs/benchmark-report.example.json`](docs/benchmark-report.example.json) for a minimal report-shape example, [`docs/benchmark-workflow-examples.md`](docs/benchmark-workflow-examples.md) for workflow-specific synthetic examples, and [`docs/experimental-benchmark-fixtures.md`](docs/experimental-benchmark-fixtures.md) for fixture-only experimental task/variant starters.
 ## What is not yet shipped

package/context-guard-kit/README.md CHANGED Viewed

@@ -60,12 +60,12 @@ python3 context-guard-kit/sanitize_output.py -- git diff
 `cost_guard.py compile`은 section manifest의 `protected`, `semantic_sensitive`, `protected_zone_classes`, `content_type`, `volatile`, `ttl`, `bytes` 필드를 읽어 `protected_zone_policy`와 `transform_policy`를 출력합니다. `protected=true`와 `volatile=true`가 같이 있으면 volatile이 cache ordering을 tail 쪽으로 보내고, protection은 transform/retrieval 정책만 제어합니다. 대용량 protected section은 local artifact retrieval을 권고하지만 provider prompt cache를 대체한다고 주장하지 않습니다.
-`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성합니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요.
+`benchmark_runner.py`는 `research/benchmark-plan.md`의 고정 task/variant 실험을 실행합니다. `variant_prompt_files`는 선택된 task/variant를 필터링한 뒤 필요한 file-backed prompt만 읽으므로 선택하지 않은 fixture의 누락 파일이 선택된 실행을 깨지 않습니다. `--ledger-jsonl`은 subagent·artifact 등 외부 실행 표면으로 옮겨간 token/cost와 run별 측정 가능 여부를 남기고, 선택적 `self_hosted_metrics` provider payload는 run별 sidecar로만 기록합니다. `--report-json`은 baseline 대비 실제 token/cost 절감과 proxy byte 감소를 분리한 A/B report를 생성하며, `self_hosted_metrics`는 CSV/report 요약에 접지 않습니다. Report의 `matched_pair_evidence`는 성공한 baseline/variant task bucket을 transform, quality gate, 측정 가능 여부, claim boundary와 연결하므로 절감 주장을 쓰기 전에 이 항목을 확인하세요.
 `../research/experimental-token-reduction-radar.md`는 learned compression, multimodal crop/OCR/visual-token pruning, self-hosted KV/latent inference optimization 같은 선택적 미래 실험을 문서화한 gate입니다. `../docs/experimental-benchmark-fixtures.md`에는 fixture-only task/variant 시작 예시가 있습니다. 이 radar와 fixture는 현재 제공되는 runtime helper가 아니며, hosted API token/cost 절감을 보장하지 않습니다. hosted API token/cost 절감 주장은 provider가 측정한 matched-task 근거가 있을 때만 허용합니다. Radar의 later-roadmap gate는 neural/semantic compression, trust-tiered injection-aware compression, context-diff compaction, local proxy constraint를 별도 미래 PR이 gate를 통과하기 전까지 experimental/non-shipped로 유지합니다.
 `claude_transcript_cost_audit.py --recommend`의 기본 출력은 공유 시 안전하도록 transcript 경로를 `basename#hash`, 명령을 `command#hash` 형태로 익명화합니다. 로컬 원문 식별자가 꼭 필요할 때만 `--show-paths` 또는 `--show-commands`를 추가하세요.
-대용량/손상 transcript 방어를 위해 파일 단위 `--max-file-bytes`, JSONL record 단위 `--max-line-bytes` 제한도 기본 적용되며, 건너뛴 항목은 skip count와 warning으로 표시됩니다. JSON summary/feasibility 출력의 `cache_friendliness`는 제한된 정제 segment hash로 안정적인 prefix와 volatile prefix/tail 신호를 비교하는 휴리스틱입니다. `cache_layout_advice`는 그 신호를 긴 세션 분리, prefix 안정화, diet 점검 같은 순위화된 확인/실험으로 연결하지만, 관측 issue와 가설/입증 cause를 분리합니다. 원문 prompt text는 출력하지 않고, provider cache token field는 ContextGuard가 만든 토큰 절감 증거가 아니라 별도 진단 텔레메트리로 해석하세요.
+대용량/손상 transcript 방어를 위해 파일 단위 `--max-file-bytes`, JSONL record 단위 `--max-line-bytes` 제한도 기본 적용되며, 건너뛴 항목은 skip count와 warning으로 표시됩니다. JSON summary/feasibility 출력의 `cache_friendliness`는 제한된 정제 segment hash로 안정적인 prefix와 volatile prefix/tail 신호를 비교하는 휴리스틱입니다. `cache_layout_advice`는 그 신호를 긴 세션 분리, prefix 안정화, diet 점검 같은 순위화된 확인/실험으로 연결하지만, 관측 issue와 가설/입증 cause를 분리합니다. `--feasibility-json`은 macOS-visible prototype 같은 consumer가 안정적인 top-level field에만 바인딩하도록 `mac_visibility` 계약도 함께 제공합니다. 원문 prompt text는 출력하지 않고, provider cache token field와 historical token total은 ContextGuard가 만든 토큰 절감 또는 live headroom 증거가 아니라 별도 진단 텔레메트리로 해석하세요.
 `context_guard_diet.py scan`은 항상 로컬에서만 읽는 read-only 스캐너입니다. 기본 출력은 project root를 익명화하고 상대경로 중심으로 보고합니다. `--top`은 보고서의 context-like file 목록과 context-exclusion recommendation 목록에 공통으로 적용됩니다. `--show-paths`는 로컬/비공개 디버깅에서만 쓰세요.

package/context-guard-kit/benchmark_runner.py CHANGED Viewed

@@ -27,6 +27,7 @@ Task fixture (`tasks.json`): 각 task 는 다음 필드를 가진다.
     "max_turns": 3,
     "max_budget_usd": 1.0,
     "allowed_tools": ["Read", "Edit", "Bash(npm test*)"],
+    "variant_prompt_files": {"context_hygiene": "t01.context_hygiene.prompt.md"},
     "success_command": "npm test -- auth/session",
     "success_cwd": "."
   }
@@ -183,6 +184,13 @@ MAX_USAGE_COST_USD = 10**9
 TOKEN_PROXY_BYTES_PER_TOKEN = 4
 BENCH_RUN_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.run-evidence.v1"
 MATCHED_PAIR_EVIDENCE_SCHEMA_VERSION = "contextguard.bench.matched-pair.v1"
+SELF_HOSTED_METRICS_SCHEMA_VERSION = "contextguard.bench.self-hosted-metrics.v1"
+SELF_HOSTED_METRICS_KEY = "self_hosted_metrics"
+SELF_HOSTED_METRICS_CLAIM_BOUNDARY = "self_hosted_metrics_only_not_hosted_api_token_or_cost_savings"
+MAX_SELF_HOSTED_LABEL_CHARS = 120
+MAX_SELF_HOSTED_LATENCY_MS = 7 * 24 * 60 * 60 * 1000
+MAX_SELF_HOSTED_MEMORY_MB = 10_000_000
+MAX_VARIANT_PROMPT_FILE_BYTES = 128_000
 CLAUDE_OUTPUT_MAX_BYTES = 1_000_000
 SUCCESS_COMMAND_OUTPUT_MAX_BYTES = 64_000
 VERSION_OUTPUT_MAX_BYTES = 16_000
@@ -354,6 +362,8 @@ class TaskFixture:
     allowed_tools: list[str] = field(default_factory=list)
     success_command: str | None = None
     success_cwd: str = "."
+    variant_prompt_files: dict[str, str] = field(default_factory=dict)
+    variant_prompt_texts: dict[str, str] = field(default_factory=dict)
 @dataclass
@@ -387,6 +397,7 @@ class RunResult:
     provider_cached_tokens: int = 0
     provider_cached_tokens_measured: bool = False
     primary_tokens_measured: bool = False
+    self_hosted_metrics: dict[str, Any] | None = None
 @dataclass
@@ -433,6 +444,22 @@ def parse_string_list(value: Any, *, field: str, owner: str) -> list[str]:
     return items
+def parse_string_map(value: Any, *, field: str, owner: str) -> dict[str, str]:
+    """Parse a JSON fixture field that must be an object of non-empty string values."""
+    if value is None:
+        return {}
+    if not isinstance(value, dict):
+        raise SystemExit(f"{owner} {field} must be a JSON object of strings")
+    items: dict[str, str] = {}
+    for raw_key, raw_value in value.items():
+        if not isinstance(raw_key, str) or not raw_key.strip():
+            raise SystemExit(f"{owner} {field} keys must be non-empty strings")
+        if not isinstance(raw_value, str) or not raw_value.strip():
+            raise SystemExit(f"{owner} {field}.{raw_key} must be a non-empty string")
+        items[raw_key] = raw_value
+    return items
 def validate_variant_extra_args(extra_args: list[str], *, owner: str) -> list[str]:
     for index, arg in enumerate(extra_args):
         flag = arg.split("=", 1)[0]
@@ -443,6 +470,101 @@ def validate_variant_extra_args(extra_args: list[str], *, owner: str) -> list[st
     return extra_args
+def validate_variant_prompt_file_path(raw_path: str, *, owner: str) -> Path:
+    """Return a safe relative prompt-file path, or fail before any file read."""
+    rel_path = Path(raw_path)
+    if rel_path.is_absolute():
+        raise SystemExit(f"{owner} variant_prompt_files path must be relative: {raw_path}")
+    if not rel_path.parts or rel_path == Path("."):
+        raise SystemExit(f"{owner} variant_prompt_files path must name a file")
+    if any(part in ("", ".", "..") for part in rel_path.parts):
+        raise SystemExit(f"{owner} variant_prompt_files path must not contain '.', '..', or empty components: {raw_path}")
+    return rel_path
+def validate_variant_prompt_file_references(
+    tasks: list[TaskFixture],
+    variants: list["Variant"],
+) -> None:
+    """Validate variant prompt-file keys and paths without dereferencing files.
+    Unknown variant keys and unsafe relative paths are rejected before any file
+    read. Missing prompt files are intentionally not checked here so a run
+    narrowed by --task-id/--variant is not blocked by unselected prompt files.
+    """
+    known_variants = {variant.name for variant in variants}
+    for task in tasks:
+        unknown = sorted(set(task.variant_prompt_files) - known_variants)
+        if unknown:
+            raise SystemExit(
+                f"task {task.id} variant_prompt_files references unknown variant(s): {', '.join(unknown)}"
+            )
+        for variant_name, raw_path in task.variant_prompt_files.items():
+            validate_variant_prompt_file_path(
+                raw_path,
+                owner=f"task {task.id} variant {variant_name}",
+            )
+def read_variant_prompt_file(path: Path, *, owner: str, display_path: str | None = None) -> str:
+    """Read one selected prompt file with no-follow IO and an argv-safe size cap."""
+    label = display_path or path.name
+    try:
+        fd = _open_regular_no_symlink(path)
+    except OSError as exc:
+        detail = exc.strerror or exc.__class__.__name__
+        raise SystemExit(f"{owner} variant_prompt_files could not read prompt file: {label}: {detail}") from None
+    try:
+        size = os.fstat(fd).st_size
+        if size > MAX_VARIANT_PROMPT_FILE_BYTES:
+            raise SystemExit(
+                f"{owner} variant_prompt_files prompt file exceeds "
+                f"{MAX_VARIANT_PROMPT_FILE_BYTES} bytes: {label}"
+            )
+        try:
+            with os.fdopen(fd, "r", encoding="utf-8") as handle:
+                fd = -1
+                text = handle.read()
+        except UnicodeDecodeError as exc:
+            raise SystemExit(
+                f"{owner} variant_prompt_files prompt file must be UTF-8 text: "
+                f"{label}: {exc.reason}"
+            ) from None
+        except OSError as exc:
+            detail = exc.strerror or exc.__class__.__name__
+            raise SystemExit(f"{owner} variant_prompt_files could not read prompt file: {label}: {detail}") from None
+    finally:
+        if fd != -1:
+            os.close(fd)
+    if len(text.encode("utf-8", errors="replace")) > MAX_VARIANT_PROMPT_FILE_BYTES:
+        raise SystemExit(
+            f"{owner} variant_prompt_files prompt text exceeds "
+            f"{MAX_VARIANT_PROMPT_FILE_BYTES} bytes after decoding: {label}"
+        )
+    return text
+def load_variant_prompt_files_for_targets(
+    targets: list[tuple[TaskFixture, "Variant"]],
+    *,
+    task_file_dir: Path,
+) -> None:
+    """Load file-backed prompts only for selected (task, variant) targets."""
+    for task, variant in targets:
+        raw_path = task.variant_prompt_files.get(variant.name)
+        if raw_path is None:
+            continue
+        rel_path = validate_variant_prompt_file_path(
+            raw_path,
+            owner=f"task {task.id} variant {variant.name}",
+        )
+        task.variant_prompt_texts[variant.name] = read_variant_prompt_file(
+            task_file_dir / rel_path,
+            owner=f"task {task.id} variant {variant.name}",
+            display_path=str(rel_path),
+        )
 def normalize_usage_token(value: Any) -> int | None:
     """Return a safe non-negative token count, or None for invalid metrics."""
     if isinstance(value, bool) or not isinstance(value, (int, float)):
@@ -469,7 +591,7 @@ def normalize_usage_cost(value: Any) -> float | None:
     return numeric
-def parse_tasks(path: Path) -> list[TaskFixture]:
+def parse_tasks(path: Path, variants: list["Variant"] | None = None) -> list[TaskFixture]:
     raw = json.loads(_read_text_no_follow(path))
     if not isinstance(raw, list):
         raise SystemExit(f"tasks file must be a JSON list: {path}")
@@ -488,21 +610,33 @@ def parse_tasks(path: Path) -> list[TaskFixture]:
                 raise SystemExit(f"task {item.get('id')} max_budget_usd must be finite and > 0 (use null for unlimited)")
         else:
             budget = None
+        task_id = str(item["id"])
+        if "variant_prompts" in item:
+            raise SystemExit(
+                f"task {task_id} variant_prompts is not supported; use file-backed variant_prompt_files"
+            )
         fixtures.append(TaskFixture(
-            id=str(item["id"]),
+            id=task_id,
             prompt=str(item["prompt"]),
             model=str(item.get("model", "sonnet")),
             effort=str(effort_raw) if effort_raw is not None else None,
-            max_turns=parse_positive_int(item.get("max_turns", 3), field="max_turns", owner=f"task {item.get('id')}"),
+            max_turns=parse_positive_int(item.get("max_turns", 3), field="max_turns", owner=f"task {task_id}"),
             max_budget_usd=budget,
             allowed_tools=parse_string_list(
                 item.get("allowed_tools", []),
                 field="allowed_tools",
-                owner=f"task {item.get('id')}",
+                owner=f"task {task_id}",
             ),
             success_command=item.get("success_command"),
             success_cwd=str(item.get("success_cwd", ".")),
+            variant_prompt_files=parse_string_map(
+                item.get("variant_prompt_files"),
+                field="variant_prompt_files",
+                owner=f"task {task_id}",
+            ),
         ))
+    if variants is not None:
+        validate_variant_prompt_file_references(fixtures, variants)
     return fixtures
@@ -717,6 +851,102 @@ def collect_shift_metrics(payload: Any) -> dict[str, int | float | bool]:
     return metrics
+def normalize_self_hosted_metric(value: Any, *, maximum: float) -> float | None:
+    if isinstance(value, bool) or not isinstance(value, (int, float)):
+        return None
+    number = float(value)
+    if not math.isfinite(number) or number < 0 or number > maximum:
+        return None
+    return number
+def sanitize_self_hosted_label(value: Any) -> str | None:
+    if not isinstance(value, str):
+        return None
+    text = sanitize_note_text(value)
+    if not text:
+        return None
+    if len(text) > MAX_SELF_HOSTED_LABEL_CHARS:
+        text = text[:MAX_SELF_HOSTED_LABEL_CHARS - 12].rstrip() + "…[truncated]"
+    return text
+def normalize_self_hosted_metrics(raw: Any, *, source: str) -> dict[str, Any] | None:
+    if not isinstance(raw, dict):
+        return None
+    metrics: dict[str, float] = {}
+    labels: dict[str, str] = {}
+    availability = {
+        "latency_ms": False,
+        "peak_memory_mb": False,
+        "quality_score": False,
+    }
+    latency = normalize_self_hosted_metric(raw.get("latency_ms"), maximum=MAX_SELF_HOSTED_LATENCY_MS)
+    if latency is not None:
+        metrics["latency_ms"] = latency
+        availability["latency_ms"] = True
+    peak_memory = normalize_self_hosted_metric(raw.get("peak_memory_mb"), maximum=MAX_SELF_HOSTED_MEMORY_MB)
+    if peak_memory is not None:
+        metrics["peak_memory_mb"] = peak_memory
+        availability["peak_memory_mb"] = True
+    quality = normalize_self_hosted_metric(raw.get("quality_score"), maximum=1.0)
+    if quality is not None:
+        metrics["quality_score"] = quality
+        availability["quality_score"] = True
+    for key in ("model_server", "optimization", "quality_metric"):
+        label = sanitize_self_hosted_label(raw.get(key))
+        if label is not None:
+            labels[key] = label
+    if not metrics:
+        return None
+    return {
+        "schema_version": SELF_HOSTED_METRICS_SCHEMA_VERSION,
+        "source": source,
+        "metrics": metrics,
+        "labels": labels,
+        "measurement_availability": availability,
+        "claim_boundary": {
+            "id": SELF_HOSTED_METRICS_CLAIM_BOUNDARY,
+            "hosted_api_token_savings_claim_allowed": False,
+            "hosted_api_cost_savings_claim_allowed": False,
+            "requires_provider_measured_matched_tasks_for_hosted_claims": True,
+            "reason": (
+                "Self-hosted local/model-server latency, memory, and quality metrics "
+                "are not hosted API token or cost telemetry."
+            ),
+        },
+    }
+def collect_self_hosted_metrics(payload: Any) -> dict[str, Any] | None:
+    """Collect explicit self-hosted metric sidecars without broad key inference.
+    Only explicit top-level telemetry envelopes are considered.  Do not infer
+    from incidental keys like `self_hosted_latency_ms` or arbitrary nested model
+    message content: that would make local/model-server telemetry too easy to
+    mix into hosted API claim surfaces.
+    """
+    if not isinstance(payload, dict):
+        return None
+    candidates = [
+        (
+            payload.get(SELF_HOSTED_METRICS_KEY),
+            f"explicit_provider_payload.{SELF_HOSTED_METRICS_KEY}",
+        )
+    ]
+    metrics_envelope = payload.get("metrics")
+    if isinstance(metrics_envelope, dict):
+        candidates.append((
+            metrics_envelope.get(SELF_HOSTED_METRICS_KEY),
+            f"explicit_provider_payload.metrics.{SELF_HOSTED_METRICS_KEY}",
+        ))
+    for raw, source in candidates:
+        normalized = normalize_self_hosted_metrics(raw, source=source)
+        if normalized is not None:
+            return normalized
+    return None
 def claude_version(claude_bin: str) -> str:
     try:
         proc = run_bounded_command(
@@ -747,7 +977,7 @@ def build_claude_argv(claude_bin: str, task: TaskFixture, variant: Variant) -> l
         argv.extend(["--allowedTools", ",".join(task.allowed_tools)])
     argv.extend(variant.extra_args)
     argv.append("--")
-    argv.append(task.prompt)
+    argv.append(task.variant_prompt_texts.get(variant.name, task.prompt))
     return argv
@@ -1003,6 +1233,7 @@ def run_fixture(task: TaskFixture, variant: Variant, claude_bin: str,
     tokens, cost, cost_measured, primary_tokens_measured = collect_usage(payload)
     provider_cached_tokens, provider_cached_tokens_measured = collect_provider_cache_telemetry(payload)
     shift_metrics = collect_shift_metrics(payload)
+    self_hosted_metrics = collect_self_hosted_metrics(payload)
     success, success_note = run_success_command(task, project_root)
     return RunResult(
         task_id=task.id, variant=variant.name, model=task.model, effort=task.effort,
@@ -1021,6 +1252,7 @@ def run_fixture(task: TaskFixture, variant: Variant, claude_bin: str,
         external_cost_measured=bool(shift_metrics["external_cost_measured"]),
         provider_cached_tokens=provider_cached_tokens,
         provider_cached_tokens_measured=provider_cached_tokens_measured,
+        self_hosted_metrics=self_hosted_metrics,
     )
@@ -1169,6 +1401,7 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) ->
             "provider_cache": result.provider_cached_tokens_measured,
             "byte_metrics": byte_metrics_observed,
             "wall_time": result.wall_time_seconds >= 0,
+            "self_hosted_metrics": result.self_hosted_metrics is not None,
         },
         "proxy_metrics": {
             "byte_metrics_observed": byte_metrics_observed,
@@ -1177,6 +1410,8 @@ def append_cost_shift_ledger(path: Path, claude_ver: str, result: RunResult) ->
             "claim_boundary": "proxy_only_not_hosted_token_savings",
         },
     }
+    if result.self_hosted_metrics is not None:
+        payload["self_hosted_metrics"] = result.self_hosted_metrics
     with csv_file_lock(path, create_parent=True):
         fd = _open_regular_no_symlink(path, os.O_CREAT | os.O_APPEND | os.O_WRONLY, 0o600, create_parent=True)
         try:
@@ -2090,8 +2325,8 @@ def main() -> int:
     require_no_follow_file_ops_supported()
     validate_distinct_output_paths(args.csv, args.ledger_jsonl, args.report_json)
-    tasks = parse_tasks(args.tasks)
     variants = parse_variants(args.variants)
+    tasks = parse_tasks(args.tasks, variants=variants)
     targets = filter_targets(tasks, variants, args.task_id, args.variant)
     if not targets:
         print("no (task, variant) targets matched the filters", file=sys.stderr)
@@ -2122,6 +2357,9 @@ def main() -> int:
             print(f"claude binary not found: {args.claude_bin}", file=sys.stderr)
             return 2
+    if runnable_targets:
+        load_variant_prompt_files_for_targets(runnable_targets, task_file_dir=args.tasks.parent)
     project_root = args.project_root.resolve()
     claude_ver = "dry-run" if args.dry_run else (claude_version(args.claude_bin) if runnable_targets else "skipped")