@event4u/agent-config 2.19.0 → 2.20.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agent-src/commands/agent-status.md +29 -0
- package/.agent-src/commands/onboard.md +221 -81
- package/.agent-src/packs/README.md +49 -0
- package/.agent-src/packs/agency-delivery.yml +63 -0
- package/.agent-src/packs/content-engine.yml +53 -0
- package/.agent-src/packs/founder-mvp.yml +51 -0
- package/.agent-src/presets/README.md +26 -0
- package/.agent-src/presets/balanced.yml +34 -0
- package/.agent-src/presets/fast.yml +31 -0
- package/.agent-src/presets/strict.yml +38 -0
- package/.agent-src/profiles/README.md +29 -0
- package/.agent-src/profiles/agency.yml +27 -0
- package/.agent-src/profiles/content_creator.yml +25 -0
- package/.agent-src/profiles/developer.yml +26 -0
- package/.agent-src/profiles/finance.yml +24 -0
- package/.agent-src/profiles/founder.yml +25 -0
- package/.agent-src/profiles/ops.yml +25 -0
- package/.agent-src/rules/no-cheap-questions.md +25 -17
- package/.agent-src/skills/adr-create/SKILL.md +78 -68
- package/.agent-src/skills/subagent-orchestration/SKILL.md +33 -0
- package/.agent-src/templates/agents/agent-project-settings.example.yml +1 -1
- package/.agent-src/templates/skill-archive-note.md +101 -0
- package/.claude-plugin/marketplace.json +1 -1
- package/CHANGELOG.md +52 -30
- package/README.md +68 -72
- package/config/agent-settings.template.yml +22 -0
- package/docs/adrs/caveman/0001-default-off-until-bench.md +93 -0
- package/docs/adrs/caveman/README.md +9 -0
- package/docs/adrs/cost/0001-hard-stop-hook.md +114 -0
- package/docs/adrs/cost/README.md +9 -0
- package/docs/adrs/memory/0001-consumer-side-snapshot.md +111 -0
- package/docs/adrs/memory/README.md +9 -0
- package/docs/adrs/router/0001-three-tier-routing.md +119 -0
- package/docs/adrs/router/README.md +9 -0
- package/docs/adrs/schema/0001-json-schema-frontmatter.md +102 -0
- package/docs/adrs/schema/README.md +9 -0
- package/docs/adrs/smoke/0001-per-tier-smoke-scripts.md +99 -0
- package/docs/adrs/smoke/README.md +9 -0
- package/docs/architecture/current-onboard-baseline.md +126 -0
- package/docs/architecture/current-safety-behavior.md +137 -0
- package/docs/archive/CHANGELOG-pre-2.16.0.md +48 -0
- package/docs/contracts/adr-layout.md +108 -0
- package/docs/contracts/benchmark-corpus-spec.md +97 -0
- package/docs/contracts/benchmark-report-schema.md +111 -0
- package/docs/contracts/command-clusters.md +1 -0
- package/docs/contracts/command-taxonomy.md +137 -0
- package/docs/contracts/compression-default-kill-criterion.md +69 -0
- package/docs/contracts/config-presets.md +144 -0
- package/docs/contracts/cost-dashboard.md +143 -0
- package/docs/contracts/cost-enforcement.md +134 -0
- package/docs/contracts/file-ownership-matrix.json +0 -7
- package/docs/contracts/mcp-tool-inventory.md +53 -0
- package/docs/contracts/measurement-baseline.md +102 -0
- package/docs/contracts/namespace.md +125 -0
- package/docs/contracts/profile-system.md +142 -0
- package/docs/contracts/safety-model.md +129 -0
- package/docs/contracts/smoke-contracts.md +144 -0
- package/docs/contracts/workflow-packs.md +121 -0
- package/docs/decisions/ADR-010-profile-pack-preset-boundary.md +132 -0
- package/docs/decisions/INDEX.md +1 -0
- package/docs/featured-commands.md +27 -0
- package/docs/parity/bench-ruflo.json +58 -0
- package/docs/parity/bench.json +41 -0
- package/docs/parity/ruflo.md +46 -0
- package/docs/profiles.md +91 -0
- package/package.json +1 -1
- package/scripts/_cli/cmd_explain.py +250 -0
- package/scripts/_lib/bench_cost.py +138 -0
- package/scripts/_lib/bench_quality.py +118 -0
- package/scripts/_lib/bench_report.py +150 -0
- package/scripts/agent-config +13 -0
- package/scripts/audit_adr_coverage.py +175 -0
- package/scripts/audit_mcp_tools.py +146 -0
- package/scripts/bench_baseline_ready.py +108 -0
- package/scripts/bench_drift_check.py +151 -0
- package/scripts/bench_per_tool.py +216 -0
- package/scripts/bench_run.py +155 -0
- package/scripts/config/__init__.py +9 -0
- package/scripts/config/presets.py +206 -0
- package/scripts/config/profiles.py +173 -0
- package/scripts/cost/budget.mjs +73 -12
- package/scripts/cost/preflight.mjs +89 -0
- package/scripts/lint_archived_skills.py +143 -0
- package/scripts/lint_bench_corpus.py +161 -0
- package/scripts/lint_namespace.py +135 -0
- package/scripts/skill_overlap.py +204 -0
- package/scripts/skill_usage_collect.py +191 -0
- package/scripts/skill_usage_report.py +162 -0
- package/scripts/smoke/kernel.sh +101 -0
- package/scripts/smoke/router.sh +129 -0
- package/scripts/smoke/schema.sh +71 -0
- package/scripts/smoke/skills.sh +101 -0
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
# Current `/onboard` Baseline (pre-step-15)
|
|
2
|
+
|
|
3
|
+
> **Status:** descriptive baseline · **Owner:** package maintainer ·
|
|
4
|
+
> **Last reviewed:** 2026-05-16
|
|
5
|
+
>
|
|
6
|
+
> Documents the **current** `/onboard` flow so the Phase 1 Guided
|
|
7
|
+
> Setup Wizard (step-15 item 2) has a baseline to extend. Council v3
|
|
8
|
+
> unique finding (cannot "extend" an undocumented surface). This file
|
|
9
|
+
> describes what ships today; it is **not** a proposal.
|
|
10
|
+
|
|
11
|
+
## Surface
|
|
12
|
+
|
|
13
|
+
`/onboard` lives at [`.agent-src.uncompressed/commands/onboard.md`](../../.agent-src.uncompressed/commands/onboard.md)
|
|
14
|
+
(canonical source) and is triggered by the
|
|
15
|
+
[`onboarding-gate`](../../.agent-src/rules/onboarding-gate.md) rule on
|
|
16
|
+
the first turn when `onboarding.onboarded == false` in
|
|
17
|
+
`.agent-settings.yml`. Cloud surfaces (Claude.ai Web, Skills API): fully
|
|
18
|
+
inert — no settings file, no flow.
|
|
19
|
+
|
|
20
|
+
## The 12 steps today
|
|
21
|
+
|
|
22
|
+
| # | Step | Captures | Asked if |
|
|
23
|
+
|---|---|---|---|
|
|
24
|
+
| 1 | Greet + set expectations | — | always |
|
|
25
|
+
| 2 | Offer user-global cross-project defaults | intent flag for step 9 | first-time-setup heuristic only |
|
|
26
|
+
| 3 | `personal.user_name` | first name | unset |
|
|
27
|
+
| 4 | `personal.ide` (+ auto-detect via `ps aux`) and `personal.open_edited_files` | IDE id, auto-open flag | unset |
|
|
28
|
+
| 5 | `personal.pr_comment_bot_icon` | bool | always (no detection possible) |
|
|
29
|
+
| 6 | `personal.rtk_installed` (via `which rtk`) | bool + install action | rtk not found |
|
|
30
|
+
| 7 | `cost_profile` and `pipelines.skill_improvement` | profile id, learning bool | always (one summary screen) |
|
|
31
|
+
| 8 | Mark `onboarding.onboarded: true` | — | always |
|
|
32
|
+
| 9 | Write user-global `~/.event4u/agent-config/agent-settings.yml` | six whitelisted keys | step 2 captured "yes" |
|
|
33
|
+
| 10 | Summary block | — | always |
|
|
34
|
+
| 11 | Quickstart pointer (`/work` and `/implement-ticket`) | — | local only |
|
|
35
|
+
| 12 | Maintainer telemetry hint (opt-in) | — | local only |
|
|
36
|
+
|
|
37
|
+
## What `/onboard` does **not** capture today
|
|
38
|
+
|
|
39
|
+
Step-15 Phase 1 item 2 introduces a new role-selection step ("8 options
|
|
40
|
+
covering Software / Content / Founder / Consulting / Marketing / Finance
|
|
41
|
+
/ Handwerk / Self-configure") that produces a `user_type`. Today, no
|
|
42
|
+
`user_type` is captured. Specifically:
|
|
43
|
+
|
|
44
|
+
- **No audience/role question.** `/onboard` knows the developer's name,
|
|
45
|
+
IDE, and rtk install status — never the audience taxonomy.
|
|
46
|
+
- **No `profile.id`.** `profile.id` does not exist as a key in
|
|
47
|
+
`.agent-settings.yml`. Per
|
|
48
|
+
[ADR-010](../decisions/ADR-010-profile-pack-preset-boundary.md), it
|
|
49
|
+
is owned by the Phase 1 item 1 profile loader.
|
|
50
|
+
- **No `preset.id`.** Same status — `preset.id` arrives with Phase 1
|
|
51
|
+
item 4.
|
|
52
|
+
- **No `pack.id`.** Arrives with Phase 2 item 7.
|
|
53
|
+
- **No risk-appetite question.** The current flow defers risk posture
|
|
54
|
+
to `personal.autonomy`, which is itself not part of the onboard
|
|
55
|
+
questions (it inherits the template default).
|
|
56
|
+
- **No stack question.** Stack is inferred at runtime by detectors
|
|
57
|
+
(`scripts/detect/*`), not asked here.
|
|
58
|
+
|
|
59
|
+
## Settings keys written today
|
|
60
|
+
|
|
61
|
+
```yaml
|
|
62
|
+
personal:
|
|
63
|
+
user_name: "<first-name>" # step 3
|
|
64
|
+
ide: "code|phpstorm|cursor" # step 4
|
|
65
|
+
open_edited_files: true|false # step 4
|
|
66
|
+
pr_comment_bot_icon: true|false # step 5
|
|
67
|
+
rtk_installed: true|false # step 6
|
|
68
|
+
cost_profile: "balanced" # step 7 (default unchanged)
|
|
69
|
+
pipelines:
|
|
70
|
+
skill_improvement: true # step 7 (default unchanged)
|
|
71
|
+
onboarding:
|
|
72
|
+
onboarded: true # step 8
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
User-global file (step 9, opt-in): the six whitelisted keys in
|
|
76
|
+
[`scripts/_lib/agent_settings.py`](../../scripts/_lib/agent_settings.py)
|
|
77
|
+
— `name`, `ide`, `cost_profile`, `personal.bot_icon`,
|
|
78
|
+
`personal.autonomy`, `caveman.speak_scope`.
|
|
79
|
+
|
|
80
|
+
## Iron Laws today
|
|
81
|
+
|
|
82
|
+
- **One question per turn** ([`ask-when-uncertain`](../../.agent-src/rules/ask-when-uncertain.md)).
|
|
83
|
+
- **Re-runnable** — invoking `/onboard` when `onboarded: true` walks the
|
|
84
|
+
flow again, never silently rewrites a value (asks before overwriting
|
|
85
|
+
`user_name` / `ide`).
|
|
86
|
+
- **Never commits** — `.agent-settings.yml` is git-ignored.
|
|
87
|
+
- **User-global write is opt-in + one-shot + never silent** — step 2
|
|
88
|
+
captures intent, step 9 re-confirms.
|
|
89
|
+
|
|
90
|
+
## Gaps the wizard (Phase 1 item 2) must close
|
|
91
|
+
|
|
92
|
+
1. **Add role-selection step** producing a `user_type` (later mapped to
|
|
93
|
+
`profile.id`). Eight options covering Software / Content / Founder /
|
|
94
|
+
Consulting / Marketing / Finance / Handwerk / Self-configure.
|
|
95
|
+
Inserted **before** step 8 (mark onboarded) so the profile loader
|
|
96
|
+
has a value to read on the next session start.
|
|
97
|
+
2. **Add stack-detection confirmation step.** Run the existing
|
|
98
|
+
`scripts/detect/*` detectors, present the result, allow the user
|
|
99
|
+
to override. Without confirmation, profile-aware presets cannot
|
|
100
|
+
resolve.
|
|
101
|
+
3. **Add risk-appetite question.** Maps to `preset.id` from
|
|
102
|
+
[`config-presets.md`](../contracts/config-presets.md). Three
|
|
103
|
+
options: `fast` / `balanced` / `strict`.
|
|
104
|
+
4. **Write the new keys.** `profile.id`, `preset.id`, optionally
|
|
105
|
+
`pack.id`, plus the user-typed `user_type` as a stable audit field.
|
|
106
|
+
|
|
107
|
+
## Wizard contract (Phase 1 item 2 acceptance)
|
|
108
|
+
|
|
109
|
+
The wizard MUST:
|
|
110
|
+
|
|
111
|
+
- Preserve every existing step semantically (no silent removal).
|
|
112
|
+
- Insert role + stack + risk-appetite questions **before** step 8.
|
|
113
|
+
- Honor the one-question-per-turn Iron Law.
|
|
114
|
+
- Write `profile.id`, `preset.id`, and `user_type` to
|
|
115
|
+
`.agent-settings.yml` using the section-aware merge rules.
|
|
116
|
+
- Be re-runnable (idempotent for unchanged answers).
|
|
117
|
+
- Work offline (no network call required for any question).
|
|
118
|
+
- Skip itself on cloud surfaces (inherit current cloud-noop behavior).
|
|
119
|
+
|
|
120
|
+
## See also
|
|
121
|
+
|
|
122
|
+
- [`/onboard` command](../../.agent-src.uncompressed/commands/onboard.md) — canonical source.
|
|
123
|
+
- [`onboarding-gate`](../../.agent-src/rules/onboarding-gate.md) — trigger rule.
|
|
124
|
+
- [`ADR-010`](../decisions/ADR-010-profile-pack-preset-boundary.md) — boundary the wizard must respect.
|
|
125
|
+
- [`config-presets.md`](../contracts/config-presets.md) — preset axis the wizard writes.
|
|
126
|
+
- [`agents/roadmaps/step-15-product-refinement.md`](../../agents/roadmaps/step-15-product-refinement.md) — Phase 1 item 2.
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
# Current Safety Behavior — Baseline (pre-step-15)
|
|
2
|
+
|
|
3
|
+
> **Status:** descriptive baseline · **Owner:** package maintainer ·
|
|
4
|
+
> **Last reviewed:** 2026-05-16
|
|
5
|
+
>
|
|
6
|
+
> Documents the **current** safety / autonomy surface so the Phase 2
|
|
7
|
+
> Universal Safety Model ADR (step-15 item 9) has a baseline to diff
|
|
8
|
+
> against. Council v3 action #4 prerequisite. This file describes what
|
|
9
|
+
> ships today; it is **not** a proposal for what should ship next.
|
|
10
|
+
|
|
11
|
+
## Scope
|
|
12
|
+
|
|
13
|
+
The current package has **one autonomy switch** plus **four
|
|
14
|
+
non-overridable floors**. The Phase 2 ADR will replace the single switch
|
|
15
|
+
with per-profile, per-domain `deny / ask / allow` declarations. Before
|
|
16
|
+
that ADR can specify "replace X", X has to be written down.
|
|
17
|
+
|
|
18
|
+
## The one switch — `personal.autonomy`
|
|
19
|
+
|
|
20
|
+
**Where defined:** `.agent-settings.yml` under `personal.autonomy`.
|
|
21
|
+
Template: `config/agent-settings.template.yml`.
|
|
22
|
+
|
|
23
|
+
**Values:** `on` · `off` · `auto`.
|
|
24
|
+
|
|
25
|
+
**Read site:** [`.agent-src/rules/autonomous-execution.md`](../../.agent-src/rules/autonomous-execution.md)
|
|
26
|
+
(Iron-Law rule, kernel-loaded in every profile). Cached on the first
|
|
27
|
+
turn; missing key treated as `on`.
|
|
28
|
+
|
|
29
|
+
**What it gates:** trivial workflow questions (suppression). Examples:
|
|
30
|
+
"Should I run the tests now?", "Should I create the branch?", "Continue
|
|
31
|
+
with the next phase?". These are suppressed when `autonomy` resolves to
|
|
32
|
+
`on`.
|
|
33
|
+
|
|
34
|
+
**What it does NOT gate:** any of the four floors below, any
|
|
35
|
+
[`scope-control`](../../.agent-src/rules/scope-control.md) git operation,
|
|
36
|
+
or any [`commit-policy`](../../.agent-src/rules/commit-policy.md) commit
|
|
37
|
+
default. The switch only narrows the **trivial-question** surface.
|
|
38
|
+
|
|
39
|
+
### State table
|
|
40
|
+
|
|
41
|
+
| State | Behavior on trivial workflow questions | Blocking / Hard-Floor / Commit gates |
|
|
42
|
+
|---|---|---|
|
|
43
|
+
| `on` | **Suppress** — agent acts, surfaces what it did | Unchanged — still apply |
|
|
44
|
+
| `off` | **Ask** — numbered options, single question | Unchanged — still apply |
|
|
45
|
+
| `auto` | Same as `off` until the user opts in via a standing autonomy directive ("just work", "arbeite eigenständig"). Then sticky-flip to `on` for the rest of the conversation. Mirror opt-out flips back. | Unchanged — still apply |
|
|
46
|
+
|
|
47
|
+
### Opt-in detection
|
|
48
|
+
|
|
49
|
+
Intent-matched, not literal-string-matched. Speech-act-checked: the
|
|
50
|
+
phrase must be a meta-instruction, not content / quote / code. Detail:
|
|
51
|
+
[`autonomy-detection`](../../.agent-src/contexts/execution/autonomy-detection.md),
|
|
52
|
+
[`autonomy-mechanics`](../../.agent-src/contexts/execution/autonomy-mechanics.md).
|
|
53
|
+
|
|
54
|
+
### Task scope vs conversation scope
|
|
55
|
+
|
|
56
|
+
Two distinct autonomy shapes:
|
|
57
|
+
|
|
58
|
+
| Shape | Trigger | Scope |
|
|
59
|
+
|---|---|---|
|
|
60
|
+
| **Conversation-wide trivial-question suppression** | "stop asking on trivial steps" — no deliverable named | Sticky for the rest of the conversation. Suppresses trivial workflow questions only. |
|
|
61
|
+
| **Task-scoped autonomous execution** | "work autonomously on X", "arbeite die Roadmap Y komplett ab" — deliverable named | Bound to that task. Ends when the task ends. Does NOT authorize a different later deliverable. |
|
|
62
|
+
|
|
63
|
+
Per [`autonomous-execution § task-scope`](../../.agent-src/rules/autonomous-execution.md#task-scope--autonomy-is-bound-to-the-named-task).
|
|
64
|
+
|
|
65
|
+
## The four non-overridable floors
|
|
66
|
+
|
|
67
|
+
No value of `personal.autonomy` lifts any of these. Standing
|
|
68
|
+
autonomy directives, roadmap authorizations, or "just keep going"
|
|
69
|
+
phrases never reach them.
|
|
70
|
+
|
|
71
|
+
### 1. Hard Floor — `non-destructive-by-default`
|
|
72
|
+
|
|
73
|
+
[`.agent-src/rules/non-destructive-by-default.md`](../../.agent-src/rules/non-destructive-by-default.md).
|
|
74
|
+
Stops on: production-branch merges; deploy / release; push to remote;
|
|
75
|
+
production data / infra writes; whimsical bulk deletions; commits
|
|
76
|
+
containing bulk deletions or infra changes. **Always confirm this turn.**
|
|
77
|
+
|
|
78
|
+
### 2. Git-ops Permission Gate — `scope-control`
|
|
79
|
+
|
|
80
|
+
[`.agent-src/rules/scope-control.md § Git operations`](../../.agent-src/rules/scope-control.md#git-operations--permission-gated).
|
|
81
|
+
Stops on: commit · push · merge · rebase · force-push · branch create /
|
|
82
|
+
switch / delete · PR create / close / retarget · tag / release / pin.
|
|
83
|
+
Permission must be **this turn or a standing instruction not yet
|
|
84
|
+
revoked**.
|
|
85
|
+
|
|
86
|
+
### 3. Commit Default — `commit-policy`
|
|
87
|
+
|
|
88
|
+
[`.agent-src/rules/commit-policy.md`](../../.agent-src/rules/commit-policy.md).
|
|
89
|
+
**Never commit, never ask about committing.** Four exceptions: user
|
|
90
|
+
says so this turn · standing instruction · `/commit` invoked · roadmap
|
|
91
|
+
authorization. Anything else → no commit.
|
|
92
|
+
|
|
93
|
+
### 4. Security-sensitive STOP — `security-sensitive-stop`
|
|
94
|
+
|
|
95
|
+
[`.agent-src/rules/security-sensitive-stop.md`](../../.agent-src/rules/security-sensitive-stop.md).
|
|
96
|
+
Stops on: auth, billing, tenant boundaries, secrets, uploads,
|
|
97
|
+
integrations, webhooks, public endpoints. Threat-model **before**
|
|
98
|
+
editing.
|
|
99
|
+
|
|
100
|
+
## Coverage map
|
|
101
|
+
|
|
102
|
+
| Surface | What governs it |
|
|
103
|
+
|---|---|
|
|
104
|
+
| Trivial workflow question | `personal.autonomy` (the switch) |
|
|
105
|
+
| Blocking architectural / scope question | [`ask-when-uncertain`](../../.agent-src/rules/ask-when-uncertain.md) (always) |
|
|
106
|
+
| Tool / MCP call cost | None today — Phase 1 item 4 introduces preset-loader Hard Enforcement |
|
|
107
|
+
| Skill / command allowlist per audience | None today — Phase 2 item 7 introduces packs |
|
|
108
|
+
| Per-domain `deny / ask / allow` | None today — Phase 2 item 9 introduces this |
|
|
109
|
+
| Hard Floor (prod, deploy, push, bulk-destructive) | Universal — not switchable |
|
|
110
|
+
| Git ops | Universal permission gate — not switchable |
|
|
111
|
+
| Commit | Universal default-deny — not switchable |
|
|
112
|
+
|
|
113
|
+
## Gaps the Phase 2 ADR will address
|
|
114
|
+
|
|
115
|
+
1. **One switch, one granularity.** Today, `autonomy: on` suppresses
|
|
116
|
+
*every* trivial question identically. A founder running the
|
|
117
|
+
`content-engine` pack may want autonomy for content, ask-mode for
|
|
118
|
+
spend; the current model cannot express that.
|
|
119
|
+
2. **No per-domain policy.** Domain-safety rules
|
|
120
|
+
(`.agent-src/rules/domain-safety-*.md`) act as output floors but do
|
|
121
|
+
not declare `deny / ask / allow` per profile. The Phase 2 model
|
|
122
|
+
centralizes this.
|
|
123
|
+
3. **No machine-readable safety schema.** The current behavior is
|
|
124
|
+
distributed across four rules. A consuming tool (the wizard, the
|
|
125
|
+
explain command) cannot ask "what is this install's safety posture?"
|
|
126
|
+
without reading rule prose.
|
|
127
|
+
|
|
128
|
+
The Phase 2 ADR (`docs/contracts/safety-model.md`) inherits this
|
|
129
|
+
baseline and adds: per-profile policy table, machine-readable schema,
|
|
130
|
+
explain-trace integration. It MUST NOT silently relax any of the four
|
|
131
|
+
floors above.
|
|
132
|
+
|
|
133
|
+
## See also
|
|
134
|
+
|
|
135
|
+
- [`autonomous-execution`](../../.agent-src/rules/autonomous-execution.md) · [`non-destructive-by-default`](../../.agent-src/rules/non-destructive-by-default.md) · [`scope-control`](../../.agent-src/rules/scope-control.md) · [`commit-policy`](../../.agent-src/rules/commit-policy.md) · [`security-sensitive-stop`](../../.agent-src/rules/security-sensitive-stop.md).
|
|
136
|
+
- [`docs/safety.md`](../safety.md) — domain-safety output floors.
|
|
137
|
+
- [`agents/roadmaps/step-15-product-refinement.md`](../../agents/roadmaps/step-15-product-refinement.md) — Phase 1 item 2a (this doc) and Phase 2 item 9 (Universal Safety Model ADR).
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# Changelog Archive — pre-2.16.0
|
|
2
|
+
|
|
3
|
+
> Frozen snapshot of `event4u/agent-config` changelog entries from
|
|
4
|
+
> `2.15.0`, split out of the main
|
|
5
|
+
> [`CHANGELOG.md`](../../CHANGELOG.md) on 2026-05-16 once the active
|
|
6
|
+
> era's body crossed the 200-line drift cap enforced by
|
|
7
|
+
> `tests/test_changelog_eras.py`.
|
|
8
|
+
>
|
|
9
|
+
> **Read-only.** New entries land in `CHANGELOG.md` § "Era: 2.16.x".
|
|
10
|
+
> Entries here are not amended — git tag `2.15.0` remains the
|
|
11
|
+
> canonical source for what shipped.
|
|
12
|
+
>
|
|
13
|
+
> Entry shape follows the conventions documented in
|
|
14
|
+
> [`docs/contracts/CHANGELOG-conventions.md`](../contracts/CHANGELOG-conventions.md).
|
|
15
|
+
> Earlier eras live in
|
|
16
|
+
> [`CHANGELOG-pre-2.15.0.md`](CHANGELOG-pre-2.15.0.md),
|
|
17
|
+
> [`CHANGELOG-pre-2.11.0.md`](CHANGELOG-pre-2.11.0.md),
|
|
18
|
+
> [`CHANGELOG-pre-2.7.0.md`](CHANGELOG-pre-2.7.0.md), and
|
|
19
|
+
> [`CHANGELOG-pre-2.2.0.md`](CHANGELOG-pre-2.2.0.md).
|
|
20
|
+
|
|
21
|
+
## [2.15.0](https://github.com/event4u-app/agent-config/compare/2.14.0...2.15.0) (2026-05-15)
|
|
22
|
+
|
|
23
|
+
### Features
|
|
24
|
+
|
|
25
|
+
* **agent-user:** add /agents user command cluster (init, show, review, accept, update) ([15d53d8](https://github.com/event4u-app/agent-config/commit/15d53d8d9a2365b044831cd42127e247a70d7e20))
|
|
26
|
+
* **agent-user:** add v1 schema contract for .agent-user.md persona file ([64f4eab](https://github.com/event4u-app/agent-config/commit/64f4eab62ccf6a2606fbca0c56d398372c05a7a0))
|
|
27
|
+
|
|
28
|
+
### Bug Fixes
|
|
29
|
+
|
|
30
|
+
* **agent-user:** inline council-reference summary per no-roadmap-references ([ee4d3ce](https://github.com/event4u-app/agent-config/commit/ee4d3cedf9f4429450d21ca5badc2ae5c2ecaaed))
|
|
31
|
+
* **agent-user:** drop roadmap references per no-roadmap-references rule ([c8ade8d](https://github.com/event4u-app/agent-config/commit/c8ade8d7c5b495e0e4295aa0cb801e59076ee0b0))
|
|
32
|
+
* **agent-user:** adjust keep-beta-until to fit 90-day window ([801b365](https://github.com/event4u-app/agent-config/commit/801b365117a2d1efb4505e504bdd730e4cbbc217))
|
|
33
|
+
|
|
34
|
+
### Documentation
|
|
35
|
+
|
|
36
|
+
* **persona:** README section + agent-settings legacy-fallback note ([4da7629](https://github.com/event4u-app/agent-config/commit/4da7629f1f0b5a35a64d0a861040ad8639a66ebe))
|
|
37
|
+
* **roadmap:** mark step-3-agent-user-persona phases as in-progress ([f29d3bc](https://github.com/event4u-app/agent-config/commit/f29d3bce2380c0ea9c67e6094540b88d920ed9ff))
|
|
38
|
+
|
|
39
|
+
### Chores
|
|
40
|
+
|
|
41
|
+
* **roadmap:** close out + archive step-3-agent-user-persona ([09c0229](https://github.com/event4u-app/agent-config/commit/09c0229efd67af9cad7b2ca8202f4caa351d028d))
|
|
42
|
+
* **ownership:** regenerate file-ownership-matrix for /agents user ([128890d](https://github.com/event4u-app/agent-config/commit/128890d880584704b4842a398555dd979ae54462))
|
|
43
|
+
* **docs:** bump command count from 109 to 115 ([f8c61b1](https://github.com/event4u-app/agent-config/commit/f8c61b1d0ec48034e0d66e8d32534056ca4aa1f0))
|
|
44
|
+
* **template:** bump agent_config_version pin to 2.14.0 ([fcb885f](https://github.com/event4u-app/agent-config/commit/fcb885fd19bdbca46ef91ec4d5e723cc6c186c6d))
|
|
45
|
+
* **index:** regenerate agents/index.md + docs/catalog.md for /agents user ([56b281d](https://github.com/event4u-app/agent-config/commit/56b281d69960d3e57adbd24b9ec6fd24fc1a5aff))
|
|
46
|
+
* **agent-user:** regenerate compressed sources + claude tool stubs ([f79b6d1](https://github.com/event4u-app/agent-config/commit/f79b6d1cfcf1caccde4a723ad779c65d9ed87198))
|
|
47
|
+
|
|
48
|
+
Tests: 4352 (+12 since 2.14.0)
|
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: stable
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# ADR Layout — Per-area Directories
|
|
6
|
+
|
|
7
|
+
> Status: accepted · 2026-05-16 · Roadmap: `step-11-ruflo-parity` Phase 4
|
|
8
|
+
|
|
9
|
+
## Scope
|
|
10
|
+
|
|
11
|
+
Two ADR surfaces coexist in this repo. **Both are canonical** — neither supersedes the other.
|
|
12
|
+
|
|
13
|
+
| Surface | Path | Use for |
|
|
14
|
+
|---|---|---|
|
|
15
|
+
| **Flat (legacy)** | `docs/decisions/ADR-NNN-<slug>.md` | Cross-cutting governance decisions: kernel composition, rule taxonomy, package-wide architecture. Numbering is global, sequential, gap-free. |
|
|
16
|
+
| **Per-area** | `docs/adrs/<area>/NNNN-<slug>.md` | Sub-area decisions whose blast radius is one plugin / one subsystem. Numbering is per-area, starts at `0001`, padded to 4 digits. |
|
|
17
|
+
|
|
18
|
+
Choice rule — does the decision constrain code **inside one area folder** (one runtime module, one contract group, one CLI surface)? → per-area. Does it constrain **the package's contract with consumers**? → flat. In doubt → per-area (cheaper to surface, easier to relocate).
|
|
19
|
+
|
|
20
|
+
## Per-area layout
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
docs/adrs/
|
|
24
|
+
<area>/
|
|
25
|
+
README.md # one-paragraph area scope + table of all ADRs in this area
|
|
26
|
+
0001-<slug>.md # first ADR, retrospective or prospective
|
|
27
|
+
0002-<slug>.md
|
|
28
|
+
...
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
`<area>` is a kebab-case stem matching one of:
|
|
32
|
+
|
|
33
|
+
- An entry in the canonical area inventory (see [`scripts/audit_adr_coverage.py`](../../scripts/audit_adr_coverage.py) `AREAS`).
|
|
34
|
+
- A new area added to that inventory in the same PR.
|
|
35
|
+
|
|
36
|
+
Reserved areas (bootstrap pass — step-11 Phase 4 Step 3):
|
|
37
|
+
|
|
38
|
+
| Area | Scope | Owner contract |
|
|
39
|
+
|---|---|---|
|
|
40
|
+
| `cost` | Budget ladder, hard-stop hook, cost reporting | [`cost-enforcement.md`](cost-enforcement.md) |
|
|
41
|
+
| `caveman` | Caveman-speak compression, decompression, reversibility | [`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) |
|
|
42
|
+
| `schema` | Frontmatter schemas, v2 rigor, lint behaviour | [`schema-versioning.md`](schema-versioning.md) (when published) |
|
|
43
|
+
| `router` | `router.json` shape, tier semantics, dispatch precedence | [`rule-router.md`](rule-router.md) |
|
|
44
|
+
| `smoke` | Per-tier smoke contracts, baseline locks | [`smoke-contracts.md`](smoke-contracts.md) |
|
|
45
|
+
| `memory` | Memory MCP, propose / promote / poison flow | [`agent-memory-contract.md`](agent-memory-contract.md) |
|
|
46
|
+
|
|
47
|
+
## Frontmatter
|
|
48
|
+
|
|
49
|
+
Identical across both surfaces:
|
|
50
|
+
|
|
51
|
+
```yaml
|
|
52
|
+
---
|
|
53
|
+
adr: NNN # zero-padded; per-area uses 4-digit (0001), flat uses 3-digit (010)
|
|
54
|
+
area: <area> | flat # 'flat' for docs/decisions/, otherwise the area slug
|
|
55
|
+
status: proposed | accepted | superseded | deprecated
|
|
56
|
+
date: YYYY-MM-DD
|
|
57
|
+
decision: <slug>
|
|
58
|
+
supersedes: — | ADR-<area>-NNNN | ADR-MMM
|
|
59
|
+
superseded_by: — | ADR-<area>-NNNN | ADR-MMM
|
|
60
|
+
phase: <roadmap-stem> · <phase-id> # optional but recommended
|
|
61
|
+
type: retrospective | prospective
|
|
62
|
+
---
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
Supersession links cross surfaces: a per-area ADR may supersede a flat ADR and vice versa. The numeric prefix in `supersedes:` makes the target unambiguous (`ADR-007` = flat, `ADR-cost-0001` = per-area).
|
|
66
|
+
|
|
67
|
+
## Per-area README contract
|
|
68
|
+
|
|
69
|
+
Every `<area>/` directory carries a `README.md` with:
|
|
70
|
+
|
|
71
|
+
1. One-paragraph area scope (≤ 4 sentences).
|
|
72
|
+
2. Single contract pointer — the `docs/contracts/<X>.md` this area implements (or "no published contract" if pre-Phase 5).
|
|
73
|
+
3. Numbered table of ADRs in the area: `| # | Title | Status | Date | Supersedes |`. Generated by `scripts/audit_adr_coverage.py --regen-area-readme <area>`.
|
|
74
|
+
|
|
75
|
+
## Coverage gate
|
|
76
|
+
|
|
77
|
+
`scripts/audit_adr_coverage.py --check` (wired to `task lint-adr-coverage`):
|
|
78
|
+
|
|
79
|
+
- Warns when a `docs/contracts/<X>.md` exists without a matching `docs/adrs/<X>/0001-*.md`.
|
|
80
|
+
- Hard-fails on number gaps within an area (e.g. `0001`, `0003` without `0002`).
|
|
81
|
+
- Hard-fails on missing `README.md` in any non-empty area directory.
|
|
82
|
+
- Warns on dangling `supersedes:` or `superseded_by:` references.
|
|
83
|
+
|
|
84
|
+
Default mode is **warn** at the consumer surface; **fail** under `task ci`. Rationale: a new contract dropped without an ADR is a documentation gap, not a bug. CI enforces it for this package; consumer projects opt in by adding the task to their own pipeline.
|
|
85
|
+
|
|
86
|
+
## Numbering & gaps
|
|
87
|
+
|
|
88
|
+
- Per-area: 4-digit, gap-free, starts at `0001`. Re-use of numbers is a hard failure in the index regenerator.
|
|
89
|
+
- Flat: 3-digit, gap-free, starts at `001`. Existing ADRs in `docs/decisions/` set the precedent.
|
|
90
|
+
- A deleted ADR is **never** removed from history — supersede it. The lint surfaces broken supersession chains.
|
|
91
|
+
|
|
92
|
+
## Relationship to `adr-create` skill
|
|
93
|
+
|
|
94
|
+
[`adr-create`](../../.agent-src.uncompressed/skills/adr-create/SKILL.md) accepts an optional `<area>` argument (added in step-11 Phase 4 Step 4):
|
|
95
|
+
|
|
96
|
+
- No `<area>` → flat surface, `docs/decisions/`.
|
|
97
|
+
- `<area>` matches inventory → per-area surface, `docs/adrs/<area>/`.
|
|
98
|
+
- `<area>` does **not** match inventory → skill refuses with a hint to update the inventory first.
|
|
99
|
+
|
|
100
|
+
The skill's template, numbering logic, and validation hooks are identical for both surfaces; only the target directory and number padding differ.
|
|
101
|
+
|
|
102
|
+
## References
|
|
103
|
+
|
|
104
|
+
- [`docs/adrs/cost/0001-hard-stop-hook.md`](../adrs/cost/0001-hard-stop-hook.md) — first per-area ADR (bootstrap).
|
|
105
|
+
- [`docs/decisions/INDEX.md`](../decisions/INDEX.md) — flat surface index.
|
|
106
|
+
- [`scripts/audit_adr_coverage.py`](../../scripts/audit_adr_coverage.py) — coverage gate.
|
|
107
|
+
- [`scripts/adr/regenerate_index.py`](../../scripts/adr/regenerate_index.py) — index regenerator (works on both surfaces; pass `--dir`).
|
|
108
|
+
- `step-11-ruflo-parity` Phase 4 — origin.
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Corpus Spec — step-4 Phase 1
|
|
7
|
+
|
|
8
|
+
Parser-visible contract for the golden corpus consumed by
|
|
9
|
+
[`scripts/bench_runner.py`](../../scripts/bench_runner.py) and the
|
|
10
|
+
upcoming `scripts/lint_bench_corpus.py`. Defines composition, schema,
|
|
11
|
+
and validation invariants.
|
|
12
|
+
|
|
13
|
+
## Path decision
|
|
14
|
+
|
|
15
|
+
Roadmap `step-4-measurement-and-benchmark.md`
|
|
16
|
+
Phase 1 Step 2 names `bench/corpus.yaml`. The existing benchmark
|
|
17
|
+
infrastructure (runner + non-dev corpus + `task bench`) lives under
|
|
18
|
+
`tests/eval/` and `scripts/bench_runner.py` hardcodes that directory.
|
|
19
|
+
**Canonical location:** `tests/eval/corpus-<id>.yaml`. The `bench/`
|
|
20
|
+
directory is reserved for **reports + pricing** (Phase 2 deliverables).
|
|
21
|
+
Migration to `bench/corpus.yaml` is a no-op rename if downstream Phase
|
|
22
|
+
2 work proves the consolidation is worth the diff cost.
|
|
23
|
+
|
|
24
|
+
## Composition (25 prompts)
|
|
25
|
+
|
|
26
|
+
| Bucket | Count | Purpose |
|
|
27
|
+
|---|---|---|
|
|
28
|
+
| **Routing-canonical** | 10 | One prompt per major skill cluster — exact-match scoring |
|
|
29
|
+
| **Ambiguous** | 8 | Multiple plausible skills — set-intersection ≥ 0.7 scoring |
|
|
30
|
+
| **Destructive / security carve-out** | 5 | Triggers a safety floor — selection must surface the floor skill |
|
|
31
|
+
| **Long-context** | 2 | ≥ 4 k input tokens — exercises retrieval under context pressure |
|
|
32
|
+
|
|
33
|
+
The 10 routing-canonical prompts MUST cover the kernel + tier-1 skill
|
|
34
|
+
clusters used by the dev profile (`developer.yml`). The 8 ambiguous
|
|
35
|
+
prompts MUST each declare ≥ 2 acceptable skills in `expected_skills`.
|
|
36
|
+
The 5 destructive / security prompts MUST declare an
|
|
37
|
+
`expected_carve_outs` value (e.g. `security-sensitive-stop`,
|
|
38
|
+
`non-destructive-by-default`).
|
|
39
|
+
|
|
40
|
+
## Schema
|
|
41
|
+
|
|
42
|
+
```yaml
|
|
43
|
+
version: 1 # corpus format version (int)
|
|
44
|
+
corpus_id: <id> # short kebab-case identifier
|
|
45
|
+
selection_accuracy_target: 0.60 # 0.0–1.0; runner exits non-zero below
|
|
46
|
+
prompts:
|
|
47
|
+
- id: <bucket>-<NN> # e.g. canonical-01, ambiguous-03
|
|
48
|
+
category: <bucket> # canonical | ambiguous | destructive | long-context
|
|
49
|
+
user_type_candidates: [<slug>, ...] # optional; informational
|
|
50
|
+
language: en # en | de — per language-and-tone
|
|
51
|
+
prompt: "<text>" # the agent-facing prompt
|
|
52
|
+
expected_skills: [<slug>, ...] # ≥ 1 entry; non-empty
|
|
53
|
+
expected_carve_outs: [<slug>, ...] # required when category == destructive
|
|
54
|
+
rubric: # optional structural assertion
|
|
55
|
+
must_include: ["<phrase>", ...] # all phrases must appear in output
|
|
56
|
+
must_not_include: ["<phrase>", ...]
|
|
57
|
+
length_words: { min: 0, max: 0 }
|
|
58
|
+
quality_assertion: "<regex>" # optional regex over agent output
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Invariants (lint-bench gate)
|
|
62
|
+
|
|
63
|
+
| Drift | `reason` | Example |
|
|
64
|
+
|---|---|---|
|
|
65
|
+
| Missing top-level `version` / `corpus_id` / `prompts` | `missing_top_level` | — |
|
|
66
|
+
| `version` not in `{1}` | `unsupported_version` | `version: 2` |
|
|
67
|
+
| `selection_accuracy_target` outside `[0.0, 1.0]` | `target_out_of_range` | `1.5` |
|
|
68
|
+
| Duplicate `id` across prompts | `duplicate_id` | two `canonical-01` |
|
|
69
|
+
| `id` does not match `^[a-z][a-z0-9-]*-\d{2}$` | `bad_id_format` | `Canonical_1` |
|
|
70
|
+
| `category` not in `{canonical, ambiguous, destructive, long-context}` | `bad_category` | `category: misc` |
|
|
71
|
+
| `language` not in `{en, de}` | `bad_language` | `language: fr` |
|
|
72
|
+
| `expected_skills` empty / missing | `empty_expected` | `expected_skills: []` |
|
|
73
|
+
| `expected_skills` references an unknown skill slug | `unknown_skill` | `expected_skills: [imaginary]` |
|
|
74
|
+
| `category == destructive` without `expected_carve_outs` | `missing_carve_out` | — |
|
|
75
|
+
| Prompt text empty / whitespace-only | `empty_prompt` | — |
|
|
76
|
+
|
|
77
|
+
The linter MUST run with `--quiet` honour per the script-output
|
|
78
|
+
convention and emit one violation per line in non-quiet mode.
|
|
79
|
+
|
|
80
|
+
## Composition gates (25-prompt-complete state)
|
|
81
|
+
|
|
82
|
+
Once `corpus-dev.yaml` reaches the 25-prompt target, the linter
|
|
83
|
+
additionally enforces the per-bucket counts above. Until then, the
|
|
84
|
+
linter only enforces per-prompt invariants — partial corpora are
|
|
85
|
+
valid during Phase 1 build-out.
|
|
86
|
+
|
|
87
|
+
The composition gate is opt-in via `--require-full` to keep the
|
|
88
|
+
reduced 10-prompt suite (Phase 1 Step 4) usable during development
|
|
89
|
+
without tripping CI.
|
|
90
|
+
|
|
91
|
+
## Cross-references
|
|
92
|
+
|
|
93
|
+
- Runner — [`scripts/bench_runner.py`](../../scripts/bench_runner.py)
|
|
94
|
+
- Linter — `scripts/lint_bench_corpus.py` (Phase 1 Step 3)
|
|
95
|
+
- Existing non-dev corpus — [`tests/eval/corpus-non-dev.yaml`](../../tests/eval/corpus-non-dev.yaml)
|
|
96
|
+
- Language gate — [`language-and-tone`](../../.agent-src.uncompressed/rules/language-and-tone.md)
|
|
97
|
+
- Report schema — `docs/contracts/benchmark-report-schema.md` (Phase 2 Step 4)
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Report Schema — step-4 Phase 2
|
|
7
|
+
|
|
8
|
+
Parser-visible contract for the JSON + Markdown reports emitted by
|
|
9
|
+
[`scripts/bench_run.py`](../../scripts/bench_run.py). Every `task bench`
|
|
10
|
+
run writes one `bench/reports/<ts>-<corpus_id>.json` + matching `.md`.
|
|
11
|
+
|
|
12
|
+
## File layout
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
bench/
|
|
16
|
+
├── pricing.yaml # per-1M model rates + sourced_on dates
|
|
17
|
+
└── reports/
|
|
18
|
+
├── 2026-05-16T10-30-00Z-dev.json # machine-readable
|
|
19
|
+
├── 2026-05-16T10-30-00Z-dev.md # human-readable
|
|
20
|
+
└── ...
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
Filename format: `<UTC ISO-8601 with `:` → `-`>-<corpus_id>.{json,md}`.
|
|
24
|
+
Sortable lexicographically.
|
|
25
|
+
|
|
26
|
+
## JSON schema (v1)
|
|
27
|
+
|
|
28
|
+
```yaml
|
|
29
|
+
schema_version: 1
|
|
30
|
+
generated_at: <ISO-8601 UTC>
|
|
31
|
+
corpus:
|
|
32
|
+
id: <corpus_id>
|
|
33
|
+
path: tests/eval/corpus-<id>.yaml
|
|
34
|
+
prompt_count: <int>
|
|
35
|
+
runner:
|
|
36
|
+
bench_run_version: <semver>
|
|
37
|
+
baseline_collector: scripts/bench_runner.py # selection-accuracy floor
|
|
38
|
+
baseline_collector_sha: <git-sha-or-mtime>
|
|
39
|
+
selection:
|
|
40
|
+
top_k: 3
|
|
41
|
+
prompts_hit: <int>
|
|
42
|
+
prompts_total: <int>
|
|
43
|
+
selection_accuracy: <float 0.0-1.0> # hits / total
|
|
44
|
+
target: <float> # from corpus
|
|
45
|
+
passed: <bool> # accuracy >= target
|
|
46
|
+
per_prompt: # one entry per corpus prompt
|
|
47
|
+
- id: canonical-01
|
|
48
|
+
expected_skills: [...]
|
|
49
|
+
top_k_ranked: [...]
|
|
50
|
+
hit: <bool>
|
|
51
|
+
cost:
|
|
52
|
+
source: agents/cost-tracking/sessions.jsonl # or "unavailable"
|
|
53
|
+
sessions_scanned: <int>
|
|
54
|
+
totals:
|
|
55
|
+
input_tokens: <int>
|
|
56
|
+
output_tokens: <int>
|
|
57
|
+
cache_read_input_tokens: <int>
|
|
58
|
+
cache_creation_input_tokens: <int>
|
|
59
|
+
total_cost_usd: <float>
|
|
60
|
+
per_tier: # haiku / sonnet / opus / unknown
|
|
61
|
+
sonnet: { messages: <int>, cost_usd: <float> }
|
|
62
|
+
...
|
|
63
|
+
pricing_sourced_on: <ISO date from bench/pricing.yaml>
|
|
64
|
+
quality:
|
|
65
|
+
source: <path-or-"not_collected">
|
|
66
|
+
prompts_with_assertion: <int>
|
|
67
|
+
prompts_passing: <int>
|
|
68
|
+
quality_score: <float 0.0-1.0> # passing / total OR 0.0 if not_collected
|
|
69
|
+
per_prompt:
|
|
70
|
+
- id: canonical-01
|
|
71
|
+
assertion: <regex-string>
|
|
72
|
+
assertion_kind: rubric.must_include | quality_assertion
|
|
73
|
+
passed: <bool | "not_collected">
|
|
74
|
+
verdict:
|
|
75
|
+
selection: pass | fail
|
|
76
|
+
quality: pass | fail | not_collected
|
|
77
|
+
overall: pass | fail | partial # partial = quality not_collected
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Markdown shape
|
|
81
|
+
|
|
82
|
+
Headers in order:
|
|
83
|
+
|
|
84
|
+
1. `# Benchmark Report — <corpus_id> · <generated_at>`
|
|
85
|
+
2. `## Headline` — three-line summary (selection · cost · quality).
|
|
86
|
+
3. `## Selection accuracy` — table per prompt with hit/miss + expected/got.
|
|
87
|
+
4. `## Cost capture` — per-tier table + total; "unavailable" block if no
|
|
88
|
+
session jsonl was found.
|
|
89
|
+
5. `## Quality probe` — per-prompt assertion pass/fail; `not_collected`
|
|
90
|
+
block when no agent-output path was passed.
|
|
91
|
+
6. `## Notes` — pointer to `pricing.yaml`, `corpus path`, and the
|
|
92
|
+
versioned filename for citation.
|
|
93
|
+
|
|
94
|
+
## Invariants
|
|
95
|
+
|
|
96
|
+
- **No silent drops.** Missing cost source → emit `source: unavailable`
|
|
97
|
+
and `total_cost_usd: 0.0` with a marker; never omit the section.
|
|
98
|
+
- **Quality stub honesty.** When agent outputs are not provided, set
|
|
99
|
+
`quality.source: not_collected` and `verdict.overall: partial`. Score
|
|
100
|
+
stays `0.0`; never inflate by assuming pass.
|
|
101
|
+
- **Pricing dated.** Every cost row reads `sourced_on` from
|
|
102
|
+
`bench/pricing.yaml`. Stale price (> 90 days) → warning line in the
|
|
103
|
+
Markdown footer.
|
|
104
|
+
|
|
105
|
+
## Cross-references
|
|
106
|
+
|
|
107
|
+
- Runner — [`scripts/bench_run.py`](../../scripts/bench_run.py)
|
|
108
|
+
- Baseline collector — [`scripts/bench_runner.py`](../../scripts/bench_runner.py)
|
|
109
|
+
- Corpus contract — [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md)
|
|
110
|
+
- Pricing source — [`bench/pricing.yaml`](../../bench/pricing.yaml)
|
|
111
|
+
- Cost session reader (live sessions) — [`scripts/cost/track.mjs`](../../scripts/cost/track.mjs)
|
|
@@ -297,4 +297,5 @@ A command that fails either floor drops to **Tier-1** at the next minor release;
|
|
|
297
297
|
- [`docs/migrations/commands-1.15.0.md`](../migrations/commands-1.15.0.md) — user-facing migration notes.
|
|
298
298
|
- [`docs/contracts/STABILITY.md`](STABILITY.md) — `beta` level rules apply.
|
|
299
299
|
- [`docs/contracts/command-surface-tiers.md`](command-surface-tiers.md) — what each tier means and what `--help` surfaces.
|
|
300
|
+
- [`docs/contracts/command-taxonomy.md`](command-taxonomy.md) — profile axis (discoverability) layered on top of this verb axis (invocation).
|
|
300
301
|
- [`.agent-src.uncompressed/contexts/contracts/artifact-engagement-flow.md`](../../.agent-src.uncompressed/contexts/contracts/artifact-engagement-flow.md) — sibling telemetry surface; same privacy floor and four-layer enforcement model.
|