@event4u/agent-config 2.20.1 → 2.21.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,107 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-15
4
+ ---
5
+
6
+ # cost-summary schema (`cost-summary/v1`)
7
+
8
+ Stable JSON contract for inter-tool consumption of cost-tracking data
9
+ emitted by [`scripts/cost_summary.py`](../../scripts/cost_summary.py).
10
+ Schema-versioned so downstream consumers can pin and migrate explicitly.
11
+
12
+ Design reference: Ruflo `scripts/summary.mjs` (upstream cite). Our shape
13
+ diverges to align with the local `agents/cost-tracking/sessions.jsonl`
14
+ fields and the caveman-suspended-multiplier contract.
15
+
16
+ ## Envelope
17
+
18
+ ```json
19
+ {
20
+ "schema_version": "cost-summary/v1",
21
+ "generated_at": "2026-05-16T23:45:00Z",
22
+ "totals": { ... },
23
+ "by_session": [ ... ],
24
+ "by_conversation": [ ... ],
25
+ "by_model": [ ... ]
26
+ }
27
+ ```
28
+
29
+ | Field | Type | Notes |
30
+ |---|---|---|
31
+ | `schema_version` | string | Pinned to `cost-summary/v1`. Downstream consumers MUST refuse unknown versions. |
32
+ | `generated_at` | string (ISO-8601 UTC, `Z` suffix) | Emit time. |
33
+ | `totals` | object | Lifetime aggregates — see `totals` below. |
34
+ | `by_session` | array | Per `sessionId` row; ordered by `sessionId` ascending. |
35
+ | `by_conversation` | array | Per `conversation_id` row; ordered by `conversation_id` ascending. |
36
+ | `by_model` | array | Per `model` row; ordered by `model` ascending. |
37
+
38
+ ## `totals` shape
39
+
40
+ ```json
41
+ {
42
+ "sessions": 123,
43
+ "total_cost_usd": 1.2345,
44
+ "input_tokens": 100000,
45
+ "output_tokens": 50000,
46
+ "caveman_delta_tokens": 0,
47
+ "caveman_multiplier_version": "v1",
48
+ "caveman_multiplier_active": false
49
+ }
50
+ ```
51
+
52
+ `caveman_delta_tokens` is always `0` while
53
+ `caveman_multiplier_active == false` — see
54
+ [`caveman-telemetry.md`](caveman-telemetry.md) for the suspension contract.
55
+
56
+ ## `by_session` / `by_conversation` row shape
57
+
58
+ ```json
59
+ {
60
+ "key": "<sessionId or conversation_id>",
61
+ "sessions": 12,
62
+ "total_cost_usd": 0.4567,
63
+ "input_tokens": 8000,
64
+ "output_tokens": 4500,
65
+ "caveman_delta_tokens": 0
66
+ }
67
+ ```
68
+
69
+ The `key` field is the grouping identifier; consumers identify the
70
+ group by inspecting which array the row lives in.
71
+
72
+ ## `by_model` row shape
73
+
74
+ ```json
75
+ {
76
+ "model": "claude-3-5-sonnet-20241022",
77
+ "sessions": 12,
78
+ "total_cost_usd": 0.4567,
79
+ "input_tokens": 8000,
80
+ "output_tokens": 4500
81
+ }
82
+ ```
83
+
84
+ `by_model` omits caveman fields — the multiplier is dialect-scoped, not
85
+ model-scoped.
86
+
87
+ ## Stability guarantees
88
+
89
+ - **Field additions** are **non-breaking**: consumers MUST ignore unknown fields.
90
+ - **Field removals or renames** bump the `schema_version` minor (`v1` → `v2`).
91
+ - **Type changes** bump the major (`v1.*` → `v2.0`).
92
+ - Downstream consumers SHOULD pin to a specific `schema_version` and
93
+ refuse unknown ones; the pin is the migration boundary.
94
+
95
+ ## Downstream consumers
96
+
97
+ - `agent-status` skill — surfaces lifetime / current-conversation slice.
98
+ - Future `cost-export-to-monitoring` scripts (deferred; trigger:
99
+ consumer request) would wrap this JSON to push to Prometheus / OTLP.
100
+
101
+ ## See also
102
+
103
+ - [`caveman-telemetry.md`](caveman-telemetry.md) — defines the
104
+ `caveman_*` fields and the suspended-multiplier contract.
105
+ - [`scripts/cost_summary.py`](../../scripts/cost_summary.py) — implementation.
106
+ - [`scripts/cost_by_conversation.py`](../../scripts/cost_by_conversation.py) — narrower per-conversation lens with the same JSONL source.
107
+ - [`scripts/caveman_stats.py`](../../scripts/caveman_stats.py) — caveman-only delta lens with the same JSONL source.
@@ -1800,6 +1800,12 @@
1800
1800
  "load_context": [],
1801
1801
  "load_context_eager": []
1802
1802
  },
1803
+ ".agent-src.uncompressed/skills/compress-memory/SKILL.md": {
1804
+ "kind": "skill",
1805
+ "rule_type": null,
1806
+ "load_context": [],
1807
+ "load_context_eager": []
1808
+ },
1803
1809
  ".agent-src.uncompressed/skills/content-funnel-design/SKILL.md": {
1804
1810
  "kind": "skill",
1805
1811
  "rule_type": null,
@@ -6396,6 +6402,13 @@
6396
6402
  "via": "self",
6397
6403
  "depth": 0
6398
6404
  },
6405
+ {
6406
+ "source": ".agent-src.uncompressed/rules/caveman-speak.md",
6407
+ "target": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
6408
+ "type": "READ_ONLY",
6409
+ "via": "body_link",
6410
+ "depth": 1
6411
+ },
6399
6412
  {
6400
6413
  "source": ".agent-src.uncompressed/rules/cli-output-handling.md",
6401
6414
  "target": ".agent-src.uncompressed/rules/cli-output-handling.md",
@@ -8048,6 +8061,34 @@
8048
8061
  "via": "self",
8049
8062
  "depth": 0
8050
8063
  },
8064
+ {
8065
+ "source": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
8066
+ "target": ".agent-src.uncompressed/rules/caveman-speak.md",
8067
+ "type": "READ_ONLY",
8068
+ "via": "body_link",
8069
+ "depth": 1
8070
+ },
8071
+ {
8072
+ "source": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
8073
+ "target": ".agent-src.uncompressed/rules/role-mode-adherence.md",
8074
+ "type": "READ_ONLY",
8075
+ "via": "body_link",
8076
+ "depth": 1
8077
+ },
8078
+ {
8079
+ "source": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
8080
+ "target": ".agent-src.uncompressed/skills/agents-md-thin-root/SKILL.md",
8081
+ "type": "READ_ONLY",
8082
+ "via": "body_link",
8083
+ "depth": 1
8084
+ },
8085
+ {
8086
+ "source": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
8087
+ "target": ".agent-src.uncompressed/skills/compress-memory/SKILL.md",
8088
+ "type": "WRITE",
8089
+ "via": "self",
8090
+ "depth": 0
8091
+ },
8051
8092
  {
8052
8093
  "source": ".agent-src.uncompressed/skills/content-funnel-design/SKILL.md",
8053
8094
  "target": ".agent-src.uncompressed/skills/activation-design/SKILL.md",
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@event4u/agent-config",
3
- "version": "2.20.1",
3
+ "version": "2.21.0",
4
4
  "description": "Shared agent configuration \u2014 skills, rules, commands, guidelines, and templates for AI coding tools",
5
5
  "license": "MIT",
6
6
  "private": false,
@@ -0,0 +1,273 @@
1
+ # Caveman compression bench — step-16 Phase 1 Step 4.
2
+ #
3
+ # Three-arm live bench against bench/corpora/caveman/prompts.yaml:
4
+ # compressed — system prompt embeds caveman-speak rule (aggressive).
5
+ # terse_control — system prompt = "Answer concisely. …" (carve-out-free baseline).
6
+ # uncompressed — generic helpful-assistant system prompt.
7
+ #
8
+ # Token counts come from Anthropic API `usage` (authoritative). Carve-out
9
+ # share is measured via regex extraction on the reply text; chars/4 yields
10
+ # an estimated carve-out-token figure for the carve-out-tax accounting.
11
+ #
12
+ # Cost-touch: 10 prompts × 3 arms × claude-sonnet-4-5 (~$3/M in, ~$15/M out).
13
+ """Caveman compression bench runner."""
14
+ from __future__ import annotations
15
+
16
+ import re
17
+ import statistics
18
+ import time
19
+ from dataclasses import dataclass, field
20
+ from pathlib import Path
21
+ from typing import Any
22
+
23
+ import yaml
24
+
25
+ # ── system prompts per arm ──────────────────────────────────────────────
26
+
27
+ SYSTEM_PROMPT_COMPRESSED = """You are speaking in CAVEMAN-SPEAK mode (speak_scope=aggressive).
28
+
29
+ Compress all body prose to caveman grammar:
30
+ - Drop articles (the, a, an).
31
+ - Drop linking auxiliaries (is, are, was, be) where unambiguous.
32
+ - Drop pronouns when context is clear.
33
+ - Keep nouns, verbs, key adjectives, negation, numbers.
34
+ - Example: "I will now check the file and see if it exists" -> "Check file. Exists?"
35
+
36
+ Carve-outs — preserve BYTE-FOR-BYTE (do NOT compress these):
37
+ 1. Triple-backtick code/literal blocks (any language, including ALL-CAPS Iron-Law fences).
38
+ 2. Numbered-options lines matching ^\\d+\\.\\s + a **Recommendation:** label.
39
+ 3. Backtick spans (file paths, command names, identifiers).
40
+ 4. Status markers: lines starting with ❌, ⚠️, or ✅.
41
+ 5. Mode markers.
42
+ 6. Markdown tables.
43
+ 7. Deliverables (PR titles, commit messages, ticket summaries, articles, the prompt
44
+ line of any single question asked to the user).
45
+
46
+ Apply caveman compression aggressively to every other prose surface."""
47
+
48
+ SYSTEM_PROMPT_TERSE = (
49
+ "Answer concisely. Skip preamble. Do not restate the question. "
50
+ "Avoid filler phrases ('Let me', 'Here is', 'I will'). Get to the answer."
51
+ )
52
+
53
+ SYSTEM_PROMPT_UNCOMPRESSED = (
54
+ "You are a helpful AI assistant. Answer the user's question clearly and completely."
55
+ )
56
+
57
+ ARMS: tuple[str, ...] = ("compressed", "terse_control", "uncompressed")
58
+ ARM_SYSTEM_PROMPT: dict[str, str] = {
59
+ "compressed": SYSTEM_PROMPT_COMPRESSED,
60
+ "terse_control": SYSTEM_PROMPT_TERSE,
61
+ "uncompressed": SYSTEM_PROMPT_UNCOMPRESSED,
62
+ }
63
+
64
+ # ── carve-out detection ────────────────────────────────────────────────
65
+
66
+ _RE_TRIPLE_BACKTICK = re.compile(r"```[\s\S]*?```")
67
+ _RE_BACKTICK_SPAN = re.compile(r"`[^`\n]+`")
68
+ _RE_NUMBERED_LINE = re.compile(r"^>?\s*\d+\.\s.*$", re.MULTILINE)
69
+ _RE_STATUS_LINE = re.compile(r"^(❌|⚠️|✅).*$", re.MULTILINE)
70
+ _RE_TABLE_LINE = re.compile(r"^\s*\|.*\|\s*$", re.MULTILINE)
71
+ _RE_RECOMMENDATION = re.compile(r"^\*\*(Recommendation|Empfehlung):\*\*.*$", re.MULTILINE)
72
+
73
+
74
+ def carve_out_chars(text: str) -> int:
75
+ """Sum byte-length of every carve-out region (union, no double-count)."""
76
+ if not text:
77
+ return 0
78
+ mask = bytearray(len(text))
79
+ for pattern in (
80
+ _RE_TRIPLE_BACKTICK, _RE_BACKTICK_SPAN, _RE_NUMBERED_LINE,
81
+ _RE_STATUS_LINE, _RE_TABLE_LINE, _RE_RECOMMENDATION,
82
+ ):
83
+ for m in pattern.finditer(text):
84
+ for i in range(m.start(), m.end()):
85
+ mask[i] = 1
86
+ return sum(mask)
87
+
88
+
89
+ # ── data shapes ────────────────────────────────────────────────────────
90
+
91
+ @dataclass
92
+ class ArmResult:
93
+ arm: str
94
+ text: str
95
+ input_tokens: int
96
+ output_tokens: int
97
+ latency_ms: int
98
+ output_chars: int
99
+ carve_out_chars: int
100
+ error: str | None = None
101
+
102
+ @property
103
+ def realised_carve_out_pct(self) -> float:
104
+ return self.carve_out_chars / self.output_chars if self.output_chars else 0.0
105
+
106
+
107
+ @dataclass
108
+ class PromptResult:
109
+ id: str
110
+ category: str
111
+ expected_carve_out_pct: float
112
+ arms: dict[str, ArmResult] = field(default_factory=dict)
113
+
114
+ @property
115
+ def savings_vs_raw(self) -> float | None:
116
+ c = self.arms.get("compressed")
117
+ u = self.arms.get("uncompressed")
118
+ if not c or not u or u.output_tokens == 0:
119
+ return None
120
+ return 1.0 - (c.output_tokens / u.output_tokens)
121
+
122
+ @property
123
+ def savings_vs_terse(self) -> float | None:
124
+ c = self.arms.get("compressed")
125
+ t = self.arms.get("terse_control")
126
+ if not c or not t or t.output_tokens == 0:
127
+ return None
128
+ return 1.0 - (c.output_tokens / t.output_tokens)
129
+
130
+
131
+ # ── corpus + runner ────────────────────────────────────────────────────
132
+
133
+ def load_corpus(corpus_path: Path) -> list[dict[str, Any]]:
134
+ """Read bench/corpora/caveman/prompts.yaml → list of prompt dicts."""
135
+ data = yaml.safe_load(corpus_path.read_text(encoding="utf-8")) or {}
136
+ prompts = data.get("prompts") or []
137
+ if not prompts:
138
+ raise ValueError(f"empty corpus: {corpus_path}")
139
+ return prompts
140
+
141
+
142
+ def run_arm(
143
+ client: Any,
144
+ arm: str,
145
+ user_prompt: str,
146
+ *,
147
+ max_tokens: int = 1024,
148
+ ) -> ArmResult:
149
+ """Invoke one arm against the live API. Returns ArmResult including text."""
150
+ t0 = time.monotonic()
151
+ system = ARM_SYSTEM_PROMPT[arm]
152
+ try:
153
+ resp = client.ask(system, user_prompt, max_tokens=max_tokens)
154
+ except Exception as exc: # noqa: BLE001
155
+ latency_ms = int((time.monotonic() - t0) * 1000)
156
+ return ArmResult(arm=arm, text="", input_tokens=0, output_tokens=0,
157
+ latency_ms=latency_ms, output_chars=0, carve_out_chars=0,
158
+ error=str(exc))
159
+ return ArmResult(
160
+ arm=arm, text=resp.text or "",
161
+ input_tokens=int(resp.input_tokens or 0),
162
+ output_tokens=int(resp.output_tokens or 0),
163
+ latency_ms=int(resp.latency_ms or (time.monotonic() - t0) * 1000),
164
+ output_chars=len(resp.text or ""),
165
+ carve_out_chars=carve_out_chars(resp.text or ""),
166
+ error=resp.error,
167
+ )
168
+
169
+
170
+ # ── aggregation ────────────────────────────────────────────────────────────
171
+
172
+ def _stats(values: list[float]) -> dict[str, float]:
173
+ """Median / p10 / p90 / stdev / n on a list of floats. Empty → zeros."""
174
+ if not values:
175
+ return {"n": 0, "median": 0.0, "p10": 0.0, "p90": 0.0, "stdev": 0.0}
176
+ s = sorted(values)
177
+ n = len(s)
178
+ def _pct(p: float) -> float:
179
+ if n == 1:
180
+ return s[0]
181
+ k = (n - 1) * p
182
+ lo, hi = int(k), min(int(k) + 1, n - 1)
183
+ return s[lo] + (s[hi] - s[lo]) * (k - lo)
184
+ return {
185
+ "n": n,
186
+ "median": statistics.median(s),
187
+ "p10": _pct(0.10),
188
+ "p90": _pct(0.90),
189
+ "stdev": statistics.pstdev(s) if n > 1 else 0.0,
190
+ }
191
+
192
+
193
+ def aggregate_results(results: list[PromptResult]) -> dict[str, Any]:
194
+ """Compute median/p10/p90 for compression metrics across the corpus."""
195
+ vs_raw = [r.savings_vs_raw for r in results if r.savings_vs_raw is not None]
196
+ vs_terse = [r.savings_vs_terse for r in results if r.savings_vs_terse is not None]
197
+ realised_carve_pct = [
198
+ r.arms["compressed"].realised_carve_out_pct
199
+ for r in results if "compressed" in r.arms and r.arms["compressed"].output_chars
200
+ ]
201
+ expected_carve_pct = [r.expected_carve_out_pct for r in results]
202
+
203
+ per_arm_tokens: dict[str, list[int]] = {a: [] for a in ARMS}
204
+ for r in results:
205
+ for arm in ARMS:
206
+ ar = r.arms.get(arm)
207
+ if ar:
208
+ per_arm_tokens[arm].append(ar.output_tokens)
209
+
210
+ return {
211
+ "savings_vs_raw": _stats(vs_raw),
212
+ "savings_vs_terse": _stats(vs_terse),
213
+ "realised_carve_out_pct": _stats(realised_carve_pct),
214
+ "expected_carve_out_pct": _stats(expected_carve_pct),
215
+ "output_tokens": {
216
+ arm: _stats([float(v) for v in per_arm_tokens[arm]]) for arm in ARMS
217
+ },
218
+ }
219
+
220
+
221
+ def compute_cost(results: list[PromptResult], pricing: dict[str, float]) -> dict[str, Any]:
222
+ """Sum input/output tokens across all arms; cost from per-1M pricing dict."""
223
+ totals = {"input_tokens": 0, "output_tokens": 0, "calls": 0, "errors": 0}
224
+ per_arm: dict[str, dict[str, int]] = {a: {"input_tokens": 0, "output_tokens": 0, "calls": 0} for a in ARMS}
225
+ for r in results:
226
+ for arm, ar in r.arms.items():
227
+ totals["input_tokens"] += ar.input_tokens
228
+ totals["output_tokens"] += ar.output_tokens
229
+ totals["calls"] += 1
230
+ if ar.error:
231
+ totals["errors"] += 1
232
+ per_arm[arm]["input_tokens"] += ar.input_tokens
233
+ per_arm[arm]["output_tokens"] += ar.output_tokens
234
+ per_arm[arm]["calls"] += 1
235
+ cost_usd = (
236
+ totals["input_tokens"] / 1e6 * pricing.get("input", 0.0)
237
+ + totals["output_tokens"] / 1e6 * pricing.get("output", 0.0)
238
+ )
239
+ totals["total_cost_usd"] = round(cost_usd, 6)
240
+ return {"totals": totals, "per_arm": per_arm}
241
+
242
+
243
+ # ── orchestrator ───────────────────────────────────────────────────────────
244
+
245
+ def run_caveman_bench(
246
+ client: Any,
247
+ corpus_path: Path,
248
+ *,
249
+ max_prompts: int | None = None,
250
+ max_tokens: int = 1024,
251
+ on_progress: Any = None,
252
+ ) -> list[PromptResult]:
253
+ """Run all three arms over the corpus. Returns per-prompt results."""
254
+ prompts = load_corpus(corpus_path)
255
+ if max_prompts:
256
+ prompts = prompts[:max_prompts]
257
+ results: list[PromptResult] = []
258
+ total = len(prompts) * len(ARMS)
259
+ done = 0
260
+ for p in prompts:
261
+ pr = PromptResult(
262
+ id=str(p["id"]),
263
+ category=str(p.get("category", "unknown")),
264
+ expected_carve_out_pct=float(p.get("expected_carve_out_pct", 0.0)),
265
+ )
266
+ for arm in ARMS:
267
+ ar = run_arm(client, arm, str(p["prompt"]), max_tokens=max_tokens)
268
+ pr.arms[arm] = ar
269
+ done += 1
270
+ if on_progress:
271
+ on_progress(done, total, pr.id, arm, ar)
272
+ results.append(pr)
273
+ return results
@@ -0,0 +1,152 @@
1
+ # Caveman bench report serializer — step-16 Phase 1 Step 5.
2
+ #
3
+ # Emits the caveman-v1 JSON + Markdown shape. Distinct schema_version
4
+ # ("caveman-v1") from the selection-accuracy bench (v1) because the
5
+ # blocks are disjoint: caveman has no `selection`/`quality`, and the
6
+ # selection bench has no three-arm compression metrics.
7
+ """Caveman bench report serializer."""
8
+ from __future__ import annotations
9
+
10
+ from typing import Any
11
+
12
+ from _lib.bench_caveman import ARMS, PromptResult, aggregate_results, compute_cost
13
+
14
+
15
+ def build_caveman_report(
16
+ *,
17
+ results: list[PromptResult],
18
+ corpus_path_rel: str,
19
+ generated_at: str,
20
+ bench_run_version: str,
21
+ model: str,
22
+ transport: str,
23
+ pricing_rates: dict[str, float],
24
+ pricing_sourced_on: str | None,
25
+ ) -> dict[str, Any]:
26
+ aggregate = aggregate_results(results)
27
+ cost = compute_cost(results, pricing_rates)
28
+ cost["source"] = "live-api"
29
+ cost["model"] = model
30
+ cost["pricing_sourced_on"] = pricing_sourced_on
31
+ errors = cost["totals"]["errors"]
32
+ return {
33
+ "schema_version": "caveman-v1",
34
+ "generated_at": generated_at,
35
+ "corpus": {
36
+ "id": "caveman",
37
+ "path": corpus_path_rel,
38
+ "prompt_count": len(results),
39
+ },
40
+ "runner": {
41
+ "bench_run_version": bench_run_version,
42
+ "transport": transport,
43
+ "model": model,
44
+ },
45
+ "caveman": {
46
+ "arms": list(ARMS),
47
+ "aggregate": aggregate,
48
+ "per_prompt": [_prompt_block(r) for r in results],
49
+ },
50
+ "cost": cost,
51
+ "verdict": {
52
+ "overall": "measured" if errors == 0 else "partial",
53
+ "errors": errors,
54
+ },
55
+ }
56
+
57
+
58
+ def _prompt_block(r: PromptResult) -> dict[str, Any]:
59
+ return {
60
+ "id": r.id,
61
+ "category": r.category,
62
+ "expected_carve_out_pct": r.expected_carve_out_pct,
63
+ "realised_carve_out_pct": (
64
+ r.arms["compressed"].realised_carve_out_pct
65
+ if "compressed" in r.arms else None
66
+ ),
67
+ "savings_vs_raw": r.savings_vs_raw,
68
+ "savings_vs_terse": r.savings_vs_terse,
69
+ "arms": {
70
+ arm: {
71
+ "input_tokens": ar.input_tokens,
72
+ "output_tokens": ar.output_tokens,
73
+ "latency_ms": ar.latency_ms,
74
+ "output_chars": ar.output_chars,
75
+ "carve_out_chars": ar.carve_out_chars,
76
+ "error": ar.error,
77
+ "text": ar.text,
78
+ }
79
+ for arm, ar in r.arms.items()
80
+ },
81
+ }
82
+
83
+
84
+ def _fmt_pct(x: float | None) -> str:
85
+ return f"{x:.2%}" if isinstance(x, (int, float)) else "—"
86
+
87
+
88
+ def render_caveman_markdown(report: dict[str, Any]) -> str:
89
+ cv = report["caveman"]
90
+ agg = cv["aggregate"]
91
+ cost = report["cost"]
92
+ head = [
93
+ f"# Caveman Bench Report — `caveman` · {report['generated_at']}",
94
+ "",
95
+ "## Headline",
96
+ "",
97
+ f"- prompts: **{report['corpus']['prompt_count']}** · "
98
+ f"arms: **{', '.join(cv['arms'])}** · "
99
+ f"model: **{report['runner']['model']}** · "
100
+ f"transport: **{report['runner']['transport']}**",
101
+ f"- median savings vs raw: **{_fmt_pct(agg['savings_vs_raw']['median'])}** "
102
+ f"(p10 {_fmt_pct(agg['savings_vs_raw']['p10'])} · p90 {_fmt_pct(agg['savings_vs_raw']['p90'])})",
103
+ f"- median savings vs terse-control: **{_fmt_pct(agg['savings_vs_terse']['median'])}** "
104
+ f"(p10 {_fmt_pct(agg['savings_vs_terse']['p10'])} · p90 {_fmt_pct(agg['savings_vs_terse']['p90'])})",
105
+ f"- median realised carve-out share (compressed arm): **{_fmt_pct(agg['realised_carve_out_pct']['median'])}** "
106
+ f"(expected median {_fmt_pct(agg['expected_carve_out_pct']['median'])})",
107
+ f"- total cost: **${cost['totals']['total_cost_usd']:.6f}** "
108
+ f"(calls {cost['totals']['calls']} · errors {cost['totals']['errors']})",
109
+ f"- verdict: **{report['verdict']['overall']}**",
110
+ "",
111
+ ]
112
+ per_arm = [
113
+ "## Per-arm token totals",
114
+ "",
115
+ "| arm | calls | input_tokens | output_tokens | median out/prompt |",
116
+ "|---|---:|---:|---:|---:|",
117
+ ]
118
+ for arm in cv["arms"]:
119
+ a = cost["per_arm"][arm]
120
+ m = agg["output_tokens"][arm]["median"]
121
+ per_arm.append(
122
+ f"| `{arm}` | {a['calls']} | {a['input_tokens']} | {a['output_tokens']} | {m:.0f} |"
123
+ )
124
+ per_arm.append("")
125
+ per_prompt = [
126
+ "## Per-prompt results",
127
+ "",
128
+ "| id | category | exp.carve | real.carve | out.compressed | out.terse | out.uncompressed | vs raw | vs terse |",
129
+ "|---|---|---:|---:|---:|---:|---:|---:|---:|",
130
+ ]
131
+ for r in cv["per_prompt"]:
132
+ arms = r["arms"]
133
+ oc = arms.get("compressed", {}).get("output_tokens", "—")
134
+ ot = arms.get("terse_control", {}).get("output_tokens", "—")
135
+ ou = arms.get("uncompressed", {}).get("output_tokens", "—")
136
+ per_prompt.append(
137
+ f"| `{r['id']}` | {r['category']} | "
138
+ f"{_fmt_pct(r['expected_carve_out_pct'])} | {_fmt_pct(r['realised_carve_out_pct'])} | "
139
+ f"{oc} | {ot} | {ou} | "
140
+ f"{_fmt_pct(r['savings_vs_raw'])} | {_fmt_pct(r['savings_vs_terse'])} |"
141
+ )
142
+ per_prompt.append("")
143
+ notes = [
144
+ "## Notes",
145
+ "",
146
+ f"- corpus: `{report['corpus']['path']}`",
147
+ f"- pricing: `bench/pricing.yaml` (sourced {cost.get('pricing_sourced_on') or '—'})",
148
+ f"- schema: `caveman-v1` (see `docs/contracts/benchmark-report-schema.md`)",
149
+ f"- bench_run version: `{report['runner']['bench_run_version']}`",
150
+ "",
151
+ ]
152
+ return "\n".join(head + per_arm + per_prompt + notes)