@event4u/agent-config 4.8.0 → 5.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/.agent-src/commands/implement-ticket.md +5 -4
  2. package/.agent-src/rules/language-and-tone.md +4 -10
  3. package/.agent-src/skills/command-routing/SKILL.md +5 -4
  4. package/.claude-plugin/marketplace.json +1 -1
  5. package/CHANGELOG.md +86 -0
  6. package/CONTRIBUTING.md +19 -0
  7. package/README.md +11 -0
  8. package/dist/cli/registry.js +0 -2
  9. package/dist/cli/registry.js.map +1 -1
  10. package/dist/discovery/deprecation-report.md +1 -1
  11. package/dist/discovery/discovery-manifest.json +5 -5
  12. package/dist/discovery/discovery-manifest.json.sha256 +1 -1
  13. package/dist/discovery/discovery-manifest.summary.md +1 -1
  14. package/dist/discovery/orphan-report.md +1 -1
  15. package/dist/discovery/packs.json +2 -2
  16. package/dist/discovery/trust-report.md +1 -1
  17. package/dist/discovery/workspaces.json +2 -2
  18. package/dist/mcp/registry-manifest.json +2 -2
  19. package/dist/router.json +1 -1671
  20. package/docs/benchmark.md +20 -8
  21. package/docs/benchmarks.md +11 -0
  22. package/docs/contracts/benchmark-corpus-spec.md +31 -3
  23. package/docs/contracts/command-surface-tiers.md +1 -1
  24. package/docs/contracts/hook-architecture-v1.md +33 -0
  25. package/docs/contracts/migrate-command.md +197 -0
  26. package/docs/contracts/settings-api.md +2 -1
  27. package/docs/contracts/value-dashboard-spec.md +374 -0
  28. package/docs/contracts/value-report-schema.md +150 -0
  29. package/docs/decisions/ADR-031-validation-severity-tiers-and-projection-roundtrip.md +97 -0
  30. package/docs/decisions/INDEX.md +1 -0
  31. package/docs/guidelines/agent-infra/installed-tools-manifest.md +6 -3
  32. package/docs/guidelines/agent-infra/language-and-tone-examples.md +35 -0
  33. package/docs/migration/v1-to-v2.md +40 -27
  34. package/docs/value.md +84 -0
  35. package/package.json +8 -8
  36. package/scripts/__pycache__/validate_frontmatter.cpython-312.pyc +0 -0
  37. package/scripts/_cli/cmd_migrate.py +264 -102
  38. package/scripts/_cli/cmd_settings_migrate.py +2 -1
  39. package/scripts/_dispatch.bash +147 -49
  40. package/scripts/_lib/__pycache__/__init__.cpython-312.pyc +0 -0
  41. package/scripts/_lib/__pycache__/agent_src.cpython-312.pyc +0 -0
  42. package/scripts/_lib/install_regenerator.py +129 -0
  43. package/scripts/_lib/value_ladder.py +599 -0
  44. package/scripts/_lib/value_report.py +441 -0
  45. package/scripts/bench_rtk_savings.py +320 -0
  46. package/scripts/compile_router.py +19 -5
  47. package/scripts/expected_perms.json +1 -1
  48. package/scripts/first_run_gate_hook.py +178 -0
  49. package/scripts/hook_manifest.yaml +16 -7
  50. package/scripts/hooks/dispatch_hook.py +27 -0
  51. package/scripts/hooks/dispatch_issues.py +136 -0
  52. package/scripts/hooks_doctor.py +40 -1
  53. package/scripts/install.py +25 -21
  54. package/scripts/inventory_abstraction_budget.py +616 -0
  55. package/scripts/lint_agents_layout.py +5 -4
  56. package/scripts/lint_bench_corpus.py +86 -4
  57. package/scripts/lint_global_paths.py +4 -3
  58. package/scripts/lint_marketplace_install_completeness.py +188 -0
  59. package/scripts/lint_value_dashboard.py +218 -0
  60. package/scripts/render_benchmark_md.py +6 -2
  61. package/scripts/render_value_md.py +355 -0
  62. package/scripts/repro/repro_marketplace_install_gap.sh +161 -0
  63. package/scripts/roadmap_progress_hook.py +23 -0
  64. package/scripts/router_telemetry.py +470 -0
  65. package/scripts/validate_frontmatter.py +23 -9
  66. package/scripts/_cli/cmd_migrate_to_global.py +0 -415
@@ -0,0 +1,374 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-28
4
+ ---
5
+
6
+ # Value Dashboard Spec — what the package costs and what it saves
7
+
8
+ > Contract for `docs/value.md` — a single human-readable dashboard that
9
+ > answers the owner's question *"what does this package cost me and what
10
+ > does it save me, in plain numbers a non-expert can read"*. Companion
11
+ > to [`value-report-schema.md`](value-report-schema.md) which owns the
12
+ > per-report JSON shape this contract layers semantics onto.
13
+
14
+ ## Scope
15
+
16
+ This contract covers the **dashboard surface** that consolidates three
17
+ existing measurement systems (A/B `docs/benchmark.md`, telegraph
18
+ `internal/bench/reports/telegraph-v*`, frugality
19
+ `agents/runtime/frugality/baseline.jsonl`) into one two-panel page. It
20
+ does **not** redefine the underlying measurement contracts — it is a
21
+ derived view on top of them, and the raw reports remain the
22
+ machine-readable source of truth.
23
+
24
+ ## Source
25
+
26
+ - **Chat thread:** 2026-05-27 (the owner's verdict: *"Aktuell bringen
27
+ diese Benchmarks nichts. Ich weiß worum es geht und verstehe sie nicht
28
+ mal."*)
29
+ - **Roadmap:** `agents/roadmaps/road-to-readable-value-dashboard.md`
30
+ - **Extends:** archived `road-to-package-impact-benchmark.md` (A/B
31
+ surface) and archived `step-4-measurement-and-benchmark.md`
32
+ (telegraph + selection bench).
33
+
34
+ ## Producer / consumer surface
35
+
36
+ | Concern | Owner |
37
+ |---|---|
38
+ | Rung normalisation (raw report → `value-v1` rung) | `scripts/_lib/value_ladder.py` |
39
+ | `value-v1` JSON assembly | `scripts/_lib/value_report.py` |
40
+ | rtk savings measurement (new) | `scripts/bench_rtk_savings.py` + `internal/bench/corpora/rtk/commands.yaml` |
41
+ | Rendered dashboard | `scripts/render_value_md.py` → `docs/value.md` |
42
+ | Dashboard linter | `scripts/lint_value_dashboard.py` |
43
+ | Task orchestration | `taskfiles/value.yml` (`task value*`) |
44
+
45
+ ## Canonical output path
46
+
47
+ **Decision (2026-05-28):** the dashboard lives at **`docs/value.md`**,
48
+ a sibling of (not a replacement for) `docs/benchmark.md`.
49
+
50
+ Rationale:
51
+
52
+ - `docs/benchmark.md` is the A/B-technical appendix (cache key,
53
+ methodology, integrity check, history). It serves a different reader
54
+ — a maintainer auditing the variant axis — and already has a
55
+ contract, schema, and renderer. Replacing it would either lose the
56
+ technical surface or bloat the dashboard.
57
+ - `docs/value.md` is the human dashboard — the page a non-developer
58
+ opens to answer the cost/value question. It is derived (no raw
59
+ measurement happens here; the renderer reads existing reports).
60
+ - The two pages cross-link: `value.md` links down to `benchmark.md` for
61
+ the methodological detail; `benchmark.md`'s Track A row is reframed
62
+ in Phase 4 Step 5 to point a reader who wants impact at `value.md`.
63
+
64
+ ## Two panels
65
+
66
+ ### Panel A — Cost ladder (cumulative, min → max)
67
+
68
+ The package's full cost picture as a layperson reads it top-to-bottom.
69
+ Each rung is a measurement (or a clearly-marked `pending` placeholder),
70
+ not a marketing number. The ladder is **honest about the up-front
71
+ cost**: installing the package first *adds* input tokens (rules load
72
+ into context); condense + rtk + terse then claw that back. The
73
+ **NETTO** line is the real answer.
74
+
75
+ Rung order:
76
+
77
+ 1. **Ohne Paket** — baseline. Token delta = 0. Reference rung; the
78
+ ladder is computed relative to this.
79
+ 2. **Mit Paket (Regeln laden)** — the honest up-front cost. The
80
+ always-loaded kernel + router footprint added to every request's
81
+ input. Token delta is *positive* (the rule body lands in context).
82
+ Source: `metric_a_footprint` from `frugality/baseline.jsonl`.
83
+ 3. **+ condense** — the input-side carve-out savings from
84
+ condensation (`.agent-src.uncondensed` → `.agent-src`). Source:
85
+ `internal/bench/reports/telegraph-v2.json`. **Excludes Thin-Root
86
+ files** (AGENTS.md variants — they net negative); the Thin-Root
87
+ caveat surfaces as a footnote, not a hidden exclusion.
88
+ 4. **+ rtk** — output-side savings on verbose CLI output that the
89
+ agent would otherwise pipe into its context. New measurement:
90
+ `scripts/bench_rtk_savings.py` against
91
+ `internal/bench/corpora/rtk/commands.yaml`. `pending` when `rtk`
92
+ is not installed (with install hint per `missing-tool-handling`).
93
+ 5. **+ terse (telegraph)** — output-side carve-out from
94
+ telegraph-condensed agent replies vs. a "be concise" control.
95
+ Source: `telegraph-v1` `vs_terse`. **Measured median is negative
96
+ today** (−9.27%); the rung renders with its real value and a
97
+ one-line "why" note. Decision in this spec (see § "Terse rung
98
+ honesty" below): render as a rung with the negative value, do
99
+ not move to Panel B.
100
+
101
+ The renderer prints, per rung:
102
+
103
+ - **label** — short German + English ("Mit Paket / +load")
104
+ - **what-it-does** — one phrase a non-developer understands
105
+ - **token_delta** — signed integer (positive = adds tokens)
106
+ - **eur_delta** — token_delta priced at the reference scale below
107
+ - **cumulative_pct** — running cumulative as % of baseline request
108
+ size
109
+ - **confidence** — `measured` | `estimated` | `vendor-claim` |
110
+ `pending`
111
+ - **source_report** — relative path to the raw report this rung
112
+ was derived from (`pending` rungs cite the report that *would*
113
+ produce them)
114
+
115
+ ### Panel B — Behaviour (with vs. without)
116
+
117
+ The package's strongest value, currently unmeasured live. Four
118
+ metrics, each carrying `with` / `without` / `delta` plus a run
119
+ `mode` (`live` | `dry-run`) so a dry-run number can never
120
+ masquerade as evidence.
121
+
122
+ 1. **Right-skill selection** — the existing selection-accuracy
123
+ bench (`tests/eval/corpus-dev.yaml`, top-K hit rate). The
124
+ bench already exists; surface its `with` vs. `without` result.
125
+ 2. **Destructive-op stops** — the 5 destructive/security prompts
126
+ already defined in `benchmark-corpus-spec.md`. For each:
127
+ does the agent refuse / stop / ask before the destructive
128
+ action? Metric: `stops: N/5 vs M/5`. This is the safety value
129
+ the Hard-Floor rules deliver — currently unquantified.
130
+ 3. **Ask-vs-act ratio** — from the existing A/B Track B runner
131
+ when run in `--mode live`. Lower ratio = more decisive agent
132
+ under autonomy mandate (`personal.autonomy: on`).
133
+ 4. **Task completion rate** — A/B Track B `completion_rate`,
134
+ `with` vs. `without`. `live` mode required for evidence;
135
+ `dry-run` runs are clearly badged and excluded from the
136
+ headline number.
137
+
138
+ The renderer prints, per metric: label · what-this-means caption ·
139
+ with · without · delta · mode badge (`live` / `dry-run`).
140
+
141
+ ## Ladder rung data model
142
+
143
+ Every rung in the `cost_ladder` array carries:
144
+
145
+ ```yaml
146
+ id: <kebab-case> # e.g. "load", "condense", "rtk"
147
+ label: "<German + English>" # e.g. "Mit Paket (Regeln laden)"
148
+ what_it_does: "<one phrase>" # plain language, ≤ 80 chars
149
+ token_delta: <signed int> # per-request input token delta
150
+ eur_delta: <float> # token_delta priced at reference scale
151
+ cumulative_pct: <signed float> # running cumulative as % of baseline
152
+ confidence: measured | estimated | vendor-claim | pending
153
+ source_report: <relative path> # the raw report this was derived from
154
+ footnote: "<optional caveat>" # e.g. "Thin-Root files excluded"
155
+ ```
156
+
157
+ `token_delta` is the per-request delta (single request, average
158
+ shape). `eur_delta` is computed at the **reference scale** below.
159
+
160
+ ## Reference scale
161
+
162
+ - **1,000 requests per measurement period** (default reference).
163
+ - **Average request shape:** ~8 k input tokens / ~600 output
164
+ tokens (matches the A/B Track B median observed in the
165
+ available reports).
166
+ - **Model tier:** Sonnet (default development model in the
167
+ repository's `.agent-settings.yml`). Token→€ conversion reads
168
+ `internal/bench/pricing.yaml` row `sonnet`. If the user picks
169
+ another tier, the renderer recomputes against that row.
170
+
171
+ The reference scale is documented inline in `docs/value.md` (the
172
+ glossary block); the renderer never silently changes it.
173
+
174
+ ## Confidence taxonomy
175
+
176
+ | Marker | Meaning |
177
+ |---|---|
178
+ | `measured` | Derived from a raw report under `internal/bench/reports/` produced by an in-repo script |
179
+ | `estimated` | Computed by `value_ladder.py` from primary measurements (e.g. cumulative %) |
180
+ | `vendor-claim` | Quoted from an upstream source without local measurement (used for context, never the headline) |
181
+ | `pending` | The rung exists in the schema but no measurement is available yet |
182
+
183
+ Never label a `pending` rung `measured`. Never render a negative
184
+ number under `confidence: measured` as a "saving" — the linter
185
+ catches this.
186
+
187
+ ## Panel B rule attribution (telemetry)
188
+
189
+ The `behaviour` block's `with`-arm value is driven by rules the agent
190
+ activates while solving each task. The router-telemetry replay
191
+ (Phase 3 of `road-to-value-dashboard-netto-cuts`) writes per-corpus
192
+ hit counts to `internal/bench/reports/router-telemetry/latest.json`.
193
+ Two fields gate the optimisation pass:
194
+
195
+ - `panel_b_untouchable_rules` — tier-1 rules that activated on at
196
+ least one Track B task. **Hard floor for Phase 5 dead-rule audit**
197
+ — these rules are not candidates for demotion or deletion.
198
+ - `panel_b_tier2_drivers` — tier-2 rules that activated on Track B.
199
+ Documented for transparency; tier-2 rules already lazy-load per
200
+ the rule-router contract, so no roadmap touches them, but if a
201
+ future phase ever cuts tier-2, this list is the floor.
202
+
203
+ The 2026-05-28 replay against the 13-task Track B corpus surfaced
204
+ **zero tier-1 rules** in the Panel B activation set; the three
205
+ tier-2 drivers were `domain-safety-pii`, `downstream-changes`,
206
+ `model-recommendation`. Phase 5's audit therefore has free reign on
207
+ the 20 never-matched tier-1 rules.
208
+
209
+ ## Behaviour-metric set
210
+
211
+ Each metric in the `behaviour` block carries:
212
+
213
+ ```yaml
214
+ id: <kebab-case> # e.g. "selection", "destructive-stops"
215
+ label: "<German + English>"
216
+ what_this_means: "<one line>" # plain language caption
217
+ with: <value> # metric-specific (pct, count, ratio)
218
+ without: <value>
219
+ delta: <signed value> # with - without
220
+ unit: pct | count | ratio | seconds
221
+ mode: live | dry-run
222
+ source_report: <relative path>
223
+ ```
224
+
225
+ ## Terse rung honesty
226
+
227
+ The `telegraph-v1` `vs_terse` median is **−9.27%** — telegraph-style
228
+ output is *more verbose* than a "be concise" control in the measured
229
+ corpus. This roadmap considered two options:
230
+
231
+ 1. Render the rung in Panel A with its real (negative) value and a
232
+ one-line "why" caption.
233
+ 2. Move telegraph from Panel A to Panel B as a quality lever (impact
234
+ on the agent's output style) rather than a cost saver.
235
+
236
+ **Decision (2026-05-28):** option 1. The page's credibility is the
237
+ product (per the non-goals section of the roadmap). Hiding the
238
+ negative number — even by relocating it to a "quality" panel —
239
+ would betray the for-dummies-honest framing. The rung renders with
240
+ its measured value, `confidence: measured`, and a caption: *"In
241
+ unserem Testkorpus liefert Telegraph mehr Tokens als ein neutrales
242
+ 'sei knapp' — wir messen, wir verstecken nicht."*
243
+
244
+ ## Glossary (rendered into `docs/value.md`)
245
+
246
+ Plain-language one-sentence definitions for the non-developer
247
+ reader. The glossary block is the source-of-truth; the renderer
248
+ copies it verbatim into the dashboard.
249
+
250
+ - **Token** — the unit a language model bills in. Roughly: one
251
+ token ≈ 4 characters of English / German prose. 1,000 tokens ≈
252
+ 750 words.
253
+ - **Input tokens** — everything the model reads each turn
254
+ (system prompt, rules that load every request, your message,
255
+ prior conversation). The package adds rules here, so installing
256
+ it costs input tokens.
257
+ - **Output tokens** — what the model writes back. Usually fewer
258
+ than input. Per-token output costs more than input.
259
+ - **condense** — a build step that shrinks the rule files
260
+ before they ship (`.agent-src.uncondensed` →
261
+ `.agent-src`). Saves input tokens on every request.
262
+ - **rtk** — the *Rust Token Killer*, a CLI wrapper that strips
263
+ verbose output (`git status`, lint output, test runners) before
264
+ the model reads it. Saves input tokens on tool calls.
265
+ - **terse / telegraph** — a style of output (short phrases,
266
+ dropped articles) the agent uses when condensing replies.
267
+ Saves output tokens — when the corpus rewards it.
268
+ - **Ohne Paket / Mit Paket** — "without the package" /
269
+ "with the package" — the two arms of the A/B comparison.
270
+ - **€-per-1k-requests** — token cost at the reference scale
271
+ (1,000 requests of the average shape, priced at the current
272
+ Sonnet rates in `internal/bench/pricing.yaml`).
273
+
274
+ ## Honest baseline appendix
275
+
276
+ The real numbers measured at the time this spec was written
277
+ (2026-05-28). Each subsequent phase of the roadmap closes one
278
+ gap — the baseline lets the reader see what was unknown when the
279
+ dashboard was first conceived.
280
+
281
+ **Correction 2026-05-28 (Phase 1 of `road-to-value-dashboard-netto-cuts`):**
282
+ The `load` rung previously read `agents/runtime/frugality/baseline.jsonl`
283
+ which measures a hardcoded 6-rule canon
284
+ (`scripts/measure_frugality_savings.py::CANON_RULES`) — NOT the
285
+ actual always-loaded kernel. The real kernel has 10 rules per
286
+ `dist/router.json::kernel`. After fix:
287
+
288
+ | Metric | Before fix | After fix | Delta |
289
+ |---|---:|---:|---:|
290
+ | `load` token delta | +4 843 | **+8 977** | +4 134 |
291
+ | NETTO token delta | +4 120 | **+8 254** | +4 134 |
292
+ | NETTO `cumulative_pct` | +51.5 % | **+103.2 %** | +51.7 pp |
293
+ | NETTO €/1k requests | +€11.37 | **+€22.78** | +€11.41 |
294
+
295
+ The original dashboard under-reported the base-load by ~4 100
296
+ tokens/request. Panel B's behaviour numbers are unaffected (they
297
+ measure agent behaviour, not token footprint).
298
+
299
+ **Optimisation pass 1 close-out (2026-05-28, `road-to-value-dashboard-netto-cuts`):**
300
+
301
+ - Phase 1 — load rung corrected (above).
302
+ - Phase 2 — `dist/router.json` minified 31 643 → 16 450 B; audit confirmed it is not in any host's per-request context, so the saving is hygiene-only, not a measured Panel-A rung.
303
+ - Phase 3 — router-telemetry replay shipped (`internal/bench/reports/router-telemetry/latest.json`); finding: zero tier-1 rules fire on Track B; the three tier-2 drivers are `domain-safety-pii`, `downstream-changes`, `model-recommendation`.
304
+ - Phase 4 — duplicate-trigger dedup closed with zero cuts: 16 clusters identified, all semantically distinct cross-cutting concerns. The council's "30 % redundancy" hypothesis is refuted; verified redundancy is 0 %.
305
+ - Phase 5 — tier-1 dead-rule audit closed with zero cuts: of 20 never-matched rules, 19 are bench-blind / measurement-window / cluster-head (load-bearing despite zero corpus hits); the lone demote candidate (`symfony-routing`) is kept to preserve cross-stack portability.
306
+ - Phase 6 — full live Track B re-run skipped: Phase 1-5 made no rule-body or frontmatter edits, so by construction Panel B is unchanged from the 2026-05-28 baseline (`with` 84.6 % completion, `without` 7.7 % completion). Re-running would consume tokens to re-confirm a known value.
307
+
308
+ **Pass outcome:** NETTO moved from +4 120 (mis-measured) to **+8 254 tokens / request** (honest); Panel B held by construction. The pass's value is the corrected measurement floor + the new telemetry tooling, not any in-place rule cuts. Cuts must wait until the bench corpus is widened to exercise the rules' real trigger surfaces (git, onboarding, roadmap work, long-conversation windows, autonomy moments).
309
+
310
+ **Optimisation pass 2 close-out (2026-05-29, `road-to-corpus-expansion-evidence-based-cuts`):**
311
+
312
+ - Phase 1 — corpus-surface inventory + state-fixture feasibility scan: 15 of 20 rules classified `addressable`; 5 state-bound (`autonomous-execution`, `context-hygiene`, `fast-path-marker-visibility`, `low-impact-corpus-privacy-floor`, `onboarding-gate`) get a permanent `keep-pending-state-trigger` verdict. 2/5 (`onboarding-gate`, `context-hygiene`) have feasible fixtures (documented but not built).
313
+ - Phase 2 — 5 corpus extension files shipped under `internal/bench/corpora/router-coverage/`, 24 tasks total (well under the 40-task ceiling). New `intended_triggers` + `open_files` + `command` fields on the per-prompt schema; linter validates against `dist/router.json` rule ids.
314
+ - Phase 3 — `scripts/router_telemetry.py` extended with manifest auto-discovery + `intended_vs_observed_match` per task + `unintended_activation_histogram` aggregate (Council R3 inter-rule conflict detection). Replay: **never-matched-tier-1 = 20 → 11**. The 11 split cleanly into 5 state-bound + 5 intent-only (NEW structural class — intent-only triggers cannot be exercised by router-telemetry replay regardless of corpus) + 1 partial.
315
+ - Phase 4 — second tier-1 audit, informed by widened corpus. The candidate set reduces to 1 real audit row (`artifact-engagement-recording`) — defended as load-bearing infrastructure for `/implement-ticket` + `/work` engine telemetry. Pareto raw-flagged 4 candidates with the tightened Council R3 thresholds (`body > 3 000 chars` AND `absolute_activations < 3` AND `activation_rate < 30 % of addressable_tasks`); all 4 are false-positives caused by the structural-unreachability dimension the pareto does not encode.
316
+ - Phase 5 — zero cuts (0 demotes, 0 deletes). Same outcome as pass-1, but for a fundamentally different reason: pass-1 closed with zero cuts because the corpus was blind; pass-2 closes with zero cuts because the widened corpus **proved every tier-1 rule has structural reason to exist**.
317
+ - Phase 6 — full live Track B re-run skipped: Phase 1-5 made zero rule-body / frontmatter / kernel edits — Panel B is unchanged from the 2026-05-28 baseline by construction.
318
+
319
+ **Pass outcome:** NETTO unchanged at **+8 254 tokens / request** (+103.2 % vs. baseline, +€22.78 per 1 000 requests). The pass's actual deliverable is the **structural categorisation** of the 20 previously-never-matched rules — future audits no longer need to re-debate why these rules don't fire in standard corpora. 5 state-bound + 5 intent-only are permanently classified as router-replay-unreachable. The Pass B (kernel-body refactor) deferral remains intact — no candidate qualifies under the tightened thresholds.
320
+
321
+ **Pass B status: deferred / closed for now.** Zero genuine candidates surfaced; the 4 raw-pareto flags are all false-positives. Reopen only when a tier-1 rule both activates frequently in the widened corpus AND has a body that exceeds the kernel-budget ceiling — current state has neither.
322
+
323
+
324
+
325
+ | Surface | Real number today | Gap |
326
+ |---|---|---|
327
+ | A/B Track A | `100% vs 0%` — file presence, a tautology | Reframed in Phase 4 Step 5 |
328
+ | A/B Track B | `—` — no `live` run on record | Closed in Phase 3 Step 1 |
329
+ | Telegraph input-side (condense) | median **+3.52%** savings (Thin-Root files net **−3.92% to −4.84%**) | Aggregated to a single rung in Phase 2 Step 2; Thin-Root surfaced as footnote |
330
+ | Telegraph output-side (`vs_terse`) | median **−9.27%** | Rendered honestly per § "Terse rung honesty" above |
331
+ | rtk savings | **not measured anywhere** | Closed in Phase 2 Step 3 (new `bench_rtk_savings.py`) |
332
+ | Right-skill selection (Track A vs. Track B coverage) | exists in the dev corpus; not surfaced as `with` vs. `without` | Closed in Phase 3 Step 2 |
333
+ | Destructive-op stops | 5 prompts exist in the corpus spec; not measured | Closed in Phase 3 Step 3 |
334
+
335
+ ## Honesty constraints (non-goals)
336
+
337
+ These come from the roadmap and are restated here so a future
338
+ maintainer cannot soften them in a later spec edit without
339
+ deliberately rewriting this section.
340
+
341
+ - **No marketing numbers.** If condense nets negative on
342
+ Thin-Root files, the dashboard says so. The credibility of the
343
+ page is the product.
344
+ - **No cross-model study.** One model (the local `claude` CLI /
345
+ one pinned pricing row). Statistical-significance work stays
346
+ opt-in (`--samples N`).
347
+ - **No retiring of the raw reports.** `telegraph-v*`, `ab-*`,
348
+ frugality JSONL stay as the machine-readable source of truth;
349
+ the dashboard is a derived human view on top.
350
+ - **rtk numbers must be measured, not claimed.** The "60–90%" in
351
+ `CLAUDE.md` is a vendor claim; Panel A shows what *this* corpus
352
+ actually measured.
353
+
354
+ ## Out of scope for this contract
355
+
356
+ - The per-report `value-v1` JSON shape — see
357
+ [`value-report-schema.md`](value-report-schema.md).
358
+ - LLM-judge scoring of `docs/value.md` content quality —
359
+ explicitly out of scope; the linter checks structural
360
+ invariants only.
361
+ - Cross-model price comparison (haiku vs. sonnet vs. opus) — out
362
+ of scope; the dashboard prices the reference Sonnet row.
363
+ - Per-tenant / per-user customisation of the reference scale —
364
+ out of scope; the scale is documented inline and a reader
365
+ recomputes mentally if their workload differs.
366
+
367
+ ## See also
368
+
369
+ - [`agents/roadmaps/road-to-readable-value-dashboard.md`](../../agents/roadmaps/road-to-readable-value-dashboard.md) — the roadmap that built this surface.
370
+ - [`value-report-schema.md`](value-report-schema.md) — per-report JSON shape (sibling contract).
371
+ - [`benchmark-ab-contract.md`](benchmark-ab-contract.md) — A/B variant-axis contract (data source for the rtk, behaviour, completion rungs).
372
+ - [`benchmark-report-schema.md`](benchmark-report-schema.md) — per-report JSON shape for A/B reports.
373
+ - [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md) — the corpus contract whose destructive prompts power the Panel B `stops` metric.
374
+ - [`internal/bench/pricing.yaml`](../../internal/bench/pricing.yaml) — token→€ conversion source.
@@ -0,0 +1,150 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-28
4
+ ---
5
+
6
+ # Value Report Schema (`value-v1`)
7
+
8
+ Parser-visible contract for the JSON report emitted by
9
+ [`scripts/_lib/value_report.py`](../../scripts/_lib/value_report.py)
10
+ and consumed by [`scripts/render_value_md.py`](../../scripts/render_value_md.py).
11
+ Sibling of [`benchmark-report-schema.md`](benchmark-report-schema.md);
12
+ companion to [`value-dashboard-spec.md`](value-dashboard-spec.md) which
13
+ owns the semantics this contract types.
14
+
15
+ ## File layout
16
+
17
+ ```
18
+ internal/bench/
19
+ ├── pricing.yaml # per-1M model rates + sourced_on dates
20
+ └── reports/
21
+ └── value/
22
+ ├── 2026-05-28T10-30-00Z.json # machine-readable value-v1 report
23
+ ├── 2026-05-28T10-30-00Z.md # optional human dump (informational)
24
+ └── latest.json # symlink or copy of newest report
25
+ ```
26
+
27
+ Filename format: `<UTC ISO-8601 with `:` → `-`>.{json,md}`. Sortable
28
+ lexicographically.
29
+
30
+ ## JSON schema (v1)
31
+
32
+ ```yaml
33
+ schema_version: 1 # int — bump on a breaking change
34
+ schema_id: value-v1 # string literal
35
+ generated_at: <ISO-8601 UTC>
36
+ reference_scale:
37
+ requests: 1000 # int — N requests being priced
38
+ avg_input_tokens: 8000 # int — assumed input tokens per request
39
+ avg_output_tokens: 600 # int — assumed output tokens per request
40
+ model_tier: sonnet # haiku | sonnet | opus
41
+ pricing_sourced_on: <ISO date> # from internal/bench/pricing.yaml
42
+ baseline:
43
+ label: "Ohne Paket / Without package"
44
+ input_tokens_per_request: <int> # the 0-point of the ladder
45
+ cost_ladder:
46
+ - id: load
47
+ label: "<German + English>"
48
+ what_it_does: "<≤ 80 char phrase>"
49
+ token_delta: <signed int> # per-request input token delta
50
+ eur_delta: <float> # priced at reference_scale
51
+ cumulative_pct: <signed float> # % of baseline.input_tokens_per_request
52
+ confidence: measured | estimated | vendor-claim | pending
53
+ source_report: <relative path> # raw report this was derived from
54
+ footnote: "<optional caveat>" # e.g. "Thin-Root files excluded"
55
+ - id: condense
56
+ ...
57
+ - id: rtk
58
+ ...
59
+ - id: terse
60
+ ...
61
+ behaviour:
62
+ - id: selection
63
+ label: "<German + English>"
64
+ what_this_means: "<one line caption>"
65
+ with: <value> # metric-specific
66
+ without: <value>
67
+ delta: <signed value> # with - without
68
+ unit: pct | count | ratio | seconds
69
+ mode: live | dry-run
70
+ source_report: <relative path>
71
+ - id: destructive-stops
72
+ ...
73
+ - id: ask-vs-act
74
+ ...
75
+ - id: completion
76
+ ...
77
+ totals:
78
+ cumulative_token_delta: <signed int> # sum of cost_ladder token_deltas
79
+ cumulative_eur_delta: <float> # priced at reference_scale
80
+ cumulative_pct: <signed float> # net % of baseline
81
+ net_verdict: net-saving | net-cost | break-even # by sign of cumulative_pct
82
+ notes:
83
+ - "Token→€ conversion priced at <model_tier> rates from <pricing source>."
84
+ - "<other invariants surfaced as plain prose>"
85
+ ```
86
+
87
+ ## Invariants
88
+
89
+ - **No silent drops.** Missing input → emit the rung with
90
+ `confidence: pending` and a `source_report` pointing to the raw
91
+ report path the renderer *expected* to find. Never omit a rung
92
+ from `cost_ladder` because data was missing.
93
+ - **No saving label on negative.** A rung with `token_delta > 0` is a
94
+ *cost* rung; a rung with `token_delta < 0` is a *saving* rung;
95
+ zero is *neutral*. The linter
96
+ ([`scripts/lint_value_dashboard.py`](../../scripts/lint_value_dashboard.py))
97
+ rejects any rendered "saving" label on a positive `token_delta`.
98
+ - **No `measured` without a real source.** A rung that carries
99
+ `confidence: measured` MUST have a `source_report` that exists on
100
+ disk under `internal/bench/reports/`. The linter walks this.
101
+ - **Reference scale is documented.** The renderer prints the
102
+ `reference_scale` block prominently in the dashboard so a reader
103
+ can recompute mentally for a different workload.
104
+ - **Mode badge is mandatory in `behaviour`.** Every behaviour metric
105
+ carries `mode: live | dry-run`. The renderer prints the badge
106
+ inline; a `dry-run` value is never the headline.
107
+
108
+ ## Cumulative rule
109
+
110
+ `cumulative_pct[i]` = the running cumulative of `token_delta` from
111
+ rungs `0..i` divided by `baseline.input_tokens_per_request`,
112
+ expressed as a signed percentage. The **NETTO** line that the
113
+ renderer prints in Panel A is identical to `totals.cumulative_pct`.
114
+
115
+ ```
116
+ cumulative[i] = sum(rung.token_delta for rung in cost_ladder[:i+1])
117
+ cumulative_pct = 100 * cumulative[i] / baseline.input_tokens_per_request
118
+ ```
119
+
120
+ A rung with `confidence: pending` contributes `token_delta: 0` to
121
+ the cumulative (its raw value is the renderer's best guess from the
122
+ raw report; it MUST NOT influence the headline until it flips to
123
+ `measured`).
124
+
125
+ ## Markdown shape (informational human dump)
126
+
127
+ The `.md` sibling of every `value-v1.json` is informational — a
128
+ flat textual dump of the same data, useful for `git diff` review and
129
+ human spot-checks. The **production** rendering is
130
+ `docs/value.md`, produced by `scripts/render_value_md.py` from the
131
+ latest `value-v1.json`.
132
+
133
+ The optional `.md` dump carries:
134
+
135
+ 1. `# Value Report — <generated_at>`
136
+ 2. `## Reference scale` — the `reference_scale` block.
137
+ 3. `## Cost ladder` — one section per rung with its full fields.
138
+ 4. `## Behaviour` — one section per metric with its full fields.
139
+ 5. `## Totals` — cumulative line + verdict.
140
+ 6. `## Notes` — invariants surfaced as prose.
141
+
142
+ ## Cross-references
143
+
144
+ - Semantics — [`value-dashboard-spec.md`](value-dashboard-spec.md)
145
+ - Roadmap — [`agents/roadmaps/road-to-readable-value-dashboard.md`](../../agents/roadmaps/road-to-readable-value-dashboard.md)
146
+ - Pricing source — [`internal/bench/pricing.yaml`](../../internal/bench/pricing.yaml)
147
+ - Rung normaliser — [`scripts/_lib/value_ladder.py`](../../scripts/_lib/value_ladder.py)
148
+ - Report assembler — [`scripts/_lib/value_report.py`](../../scripts/_lib/value_report.py)
149
+ - Renderer — [`scripts/render_value_md.py`](../../scripts/render_value_md.py)
150
+ - Linter — [`scripts/lint_value_dashboard.py`](../../scripts/lint_value_dashboard.py)
@@ -0,0 +1,97 @@
1
+ ---
2
+ adr: 031
3
+ status: accepted
4
+ date: 2026-05-29
5
+ decision: validation-severity-tiers-and-projection-roundtrip
6
+ supersedes: —
7
+ superseded_by: —
8
+ phase: continue-positioning-analysis
9
+ type: structural
10
+ review_date: 2026-06-12
11
+ ---
12
+
13
+ # ADR-031 — Adopt severity-tiered frontmatter validation + projection roundtrip test (from continuedev/continue analysis)
14
+
15
+ ## Status
16
+
17
+ **Accepted** · 2026-05-29. Both changes are additive and verified
18
+ empirically in the same session (validator exit 0 on 455 artefacts with
19
+ 0 fatal / 0 warnings; 9 roundtrip tests green), so the decision lands
20
+ **without** soak. Review date 2026-06-12.
21
+
22
+ ## Context
23
+
24
+ A competitive-positioning pass against `continuedev/continue` (evidence:
25
+ [`agents/evidence/analysis/continue-positioning-2026-05-29.md`](../../agents/evidence/analysis/continue-positioning-2026-05-29.md))
26
+ established that Continue is a **projection target**, not a competitor — its
27
+ `.continue/rules/*.md` rules system consumes the artifact type this package
28
+ produces. Our multi-tool projection + condensation model is the strategic
29
+ moat and out-scopes Continue's single-target config.
30
+
31
+ Two patterns from Continue's config layer were worth adopting independent of
32
+ whether Continue is ever used here:
33
+
34
+ 1. **Severity-tiered validation** — Continue's `core/config/validation.ts`
35
+ splits fatal errors (halt load) from non-fatal warnings (logged, load
36
+ continues). Our `scripts/validate_frontmatter.py` was binary: any
37
+ `SchemaError` failed CI.
38
+ 2. **Roundtrip validation** — Continue round-trips markdown → frontmatter →
39
+ object → markdown in `packages/config-yaml/src/markdown/*.test.ts`. Our
40
+ projection emitters (`scripts/condense.py`) had no test asserting that a
41
+ source rule's load-bearing frontmatter survives the emit cycle.
42
+
43
+ Baseline at decision time: 0 artefacts currently violate `minLength` /
44
+ `maxLength`, so reclassifying length checks loosens nothing today.
45
+
46
+ ## Decision
47
+
48
+ 1. **Severity tiers in `scripts/validate_frontmatter.py`** — `SchemaError`
49
+ gains a `severity` field (`"error"` default). Structural keywords
50
+ (`required`, `type`, `enum`, `pattern`, `additionalProperties`,
51
+ `minItems`, `minimum`) stay **fatal** (exit 1). Length keywords
52
+ (`minLength`, `maxLength`) become **advisory warnings** (printed with
53
+ `⚠️`, exit 0). `_main` partitions and reports both; only fatals fail CI.
54
+ 2. **Projection roundtrip test** — `tests/test_projection_roundtrip.py`
55
+ asserts `condense._emit_cursor_mdc` and `_emit_windsurf_rule` preserve
56
+ `description` (newline-flattened) and the `alwaysApply` / `trigger`
57
+ derivation across emit → re-parse.
58
+
59
+ Deferred (not adopted now):
60
+
61
+ - **`.continue/` projection target** — gated on real Continue usage in our
62
+ projects. Until then it would be an unowned target = maintenance ballast.
63
+ - **`uses/with/override` MCP composition** — watch, revisit if our
64
+ `scripts/mcp_render.py` needs composable blocks.
65
+
66
+ ## Consequences
67
+
68
+ - Frontmatter quality nudges (length) no longer block CI; structural
69
+ correctness still does. A future over-long `description` surfaces as a
70
+ warning, not a red build — intentional, per Continue's fatal-vs-quality
71
+ split.
72
+ - `SchemaError`'s new `severity` field is library-visible (`__all__`);
73
+ positional `(path, rule, message)` construction stays backward-compatible
74
+ via the default.
75
+ - The roundtrip test fails loudly if a projection emitter drifts, instead of
76
+ shipping a malformed `.cursor/rules/*.mdc` or `.windsurf/rules/*.md`.
77
+ - Reversal cost ~0: both changes are local and removable.
78
+
79
+ ## Alternatives
80
+
81
+ - **Additive-only (no reclassification)** — add the severity capability but
82
+ keep every check fatal. Rejected: leaves the feature a no-op with no
83
+ warning source.
84
+ - **Reclassify more checks** (e.g. `pattern`, `additionalProperties` →
85
+ warning) — rejected: those are structural correctness, loosening them
86
+ would let malformed frontmatter through.
87
+ - **Skip the ADR, just code it** — rejected: changing a CI gate's strictness
88
+ is a deliberate decision that needs a written record.
89
+
90
+ ## References
91
+
92
+ - [`agents/evidence/analysis/continue-positioning-2026-05-29.md`](../../agents/evidence/analysis/continue-positioning-2026-05-29.md)
93
+ — the positioning verdict table and adoption queue this ADR acts on.
94
+ - `scripts/validate_frontmatter.py` — severity-tier implementation.
95
+ - `tests/test_projection_roundtrip.py` — roundtrip implementation.
96
+ - Upstream patterns: `continuedev/continue` `core/config/validation.ts`,
97
+ `packages/config-yaml/src/markdown/*.test.ts`.
@@ -34,6 +34,7 @@ _Auto-generated by `scripts/adr/regenerate_index.py`. Do not edit._
34
34
  | [ADR-028](ADR-028-root-layout.md) | Root Layout | accepted | 2026-05-25 | — |
35
35
  | [ADR-029](ADR-029-multi-workspace-deferred.md) | Multi Workspace Deferred | accepted | 2026-05-25 | — |
36
36
  | [ADR-030](ADR-030-claude-code-command-projection.md) | Claude Code Command Projection | accepted | 2026-05-28 | — |
37
+ | [ADR-031](ADR-031-validation-severity-tiers-and-projection-roundtrip.md) | Validation Severity Tiers And Projection Roundtrip | accepted | 2026-05-29 | — |
37
38
 
38
39
  ## Unnumbered (legacy)
39
40
 
@@ -89,9 +89,12 @@ intentionally pin an older version of the manifest.
89
89
  Under [ADR-020](../../decisions/ADR-020-global-only-consumer-scope.md)
90
90
  global is the only consumer scope. Consumers carrying a pre-2.5
91
91
  project-scope payload move to global with the one-shot
92
- `npx @event4u/agent-config migrate-to-global` subcommand — it copies
93
- each tool's project payload into the matching user-scope path, drops
94
- the bridge marker, and removes the legacy project artefacts.
92
+ `npx @event4u/agent-config migrate` subcommand — it removes the
93
+ legacy project artefacts in one opinionated pass (deletion-over-
94
+ migration policy); the wizard recreates fresh global config on the
95
+ next `agent-config setup`. See
96
+ [docs/contracts/migrate-command.md](../../contracts/migrate-command.md)
97
+ for the full action matrix.
95
98
 
96
99
  For maintainers running `AGENT_CONFIG_DEV_MODE=1`, project-scope
97
100
  re-installs remain available; the installer still detects scope