@event4u/agent-config 2.19.0 → 2.20.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agent-src/commands/agent-status.md +29 -0
- package/.agent-src/commands/onboard.md +221 -81
- package/.agent-src/packs/README.md +49 -0
- package/.agent-src/packs/agency-delivery.yml +63 -0
- package/.agent-src/packs/content-engine.yml +53 -0
- package/.agent-src/packs/founder-mvp.yml +51 -0
- package/.agent-src/presets/README.md +26 -0
- package/.agent-src/presets/balanced.yml +34 -0
- package/.agent-src/presets/fast.yml +31 -0
- package/.agent-src/presets/strict.yml +38 -0
- package/.agent-src/profiles/README.md +29 -0
- package/.agent-src/profiles/agency.yml +27 -0
- package/.agent-src/profiles/content_creator.yml +25 -0
- package/.agent-src/profiles/developer.yml +26 -0
- package/.agent-src/profiles/finance.yml +24 -0
- package/.agent-src/profiles/founder.yml +25 -0
- package/.agent-src/profiles/ops.yml +25 -0
- package/.agent-src/rules/no-cheap-questions.md +25 -17
- package/.agent-src/skills/adr-create/SKILL.md +78 -68
- package/.agent-src/skills/subagent-orchestration/SKILL.md +33 -0
- package/.agent-src/templates/agents/agent-project-settings.example.yml +1 -1
- package/.agent-src/templates/skill-archive-note.md +101 -0
- package/.claude-plugin/marketplace.json +1 -1
- package/CHANGELOG.md +73 -70
- package/README.md +68 -72
- package/config/agent-settings.template.yml +22 -0
- package/docs/adrs/caveman/0001-default-off-until-bench.md +93 -0
- package/docs/adrs/caveman/README.md +9 -0
- package/docs/adrs/cost/0001-hard-stop-hook.md +114 -0
- package/docs/adrs/cost/README.md +9 -0
- package/docs/adrs/memory/0001-consumer-side-snapshot.md +111 -0
- package/docs/adrs/memory/README.md +9 -0
- package/docs/adrs/router/0001-three-tier-routing.md +119 -0
- package/docs/adrs/router/README.md +9 -0
- package/docs/adrs/schema/0001-json-schema-frontmatter.md +102 -0
- package/docs/adrs/schema/README.md +9 -0
- package/docs/adrs/smoke/0001-per-tier-smoke-scripts.md +99 -0
- package/docs/adrs/smoke/README.md +9 -0
- package/docs/architecture/current-onboard-baseline.md +126 -0
- package/docs/architecture/current-safety-behavior.md +137 -0
- package/docs/archive/CHANGELOG-pre-2.16.0.md +48 -0
- package/docs/archive/CHANGELOG-pre-2.17.0.md +63 -0
- package/docs/contracts/adr-layout.md +108 -0
- package/docs/contracts/benchmark-corpus-spec.md +97 -0
- package/docs/contracts/benchmark-report-schema.md +111 -0
- package/docs/contracts/command-clusters.md +1 -0
- package/docs/contracts/command-taxonomy.md +137 -0
- package/docs/contracts/compression-default-kill-criterion.md +69 -0
- package/docs/contracts/config-presets.md +144 -0
- package/docs/contracts/cost-dashboard.md +143 -0
- package/docs/contracts/cost-enforcement.md +134 -0
- package/docs/contracts/file-ownership-matrix.json +0 -7
- package/docs/contracts/mcp-tool-inventory.md +53 -0
- package/docs/contracts/measurement-baseline.md +102 -0
- package/docs/contracts/namespace.md +125 -0
- package/docs/contracts/profile-system.md +142 -0
- package/docs/contracts/safety-model.md +129 -0
- package/docs/contracts/smoke-contracts.md +144 -0
- package/docs/contracts/workflow-packs.md +121 -0
- package/docs/decisions/ADR-010-profile-pack-preset-boundary.md +132 -0
- package/docs/decisions/INDEX.md +1 -0
- package/docs/featured-commands.md +27 -0
- package/docs/parity/bench-ruflo.json +58 -0
- package/docs/parity/bench.json +41 -0
- package/docs/parity/ruflo.md +46 -0
- package/docs/profiles.md +91 -0
- package/package.json +1 -1
- package/scripts/_cli/cmd_explain.py +250 -0
- package/scripts/_lib/bench_cost.py +138 -0
- package/scripts/_lib/bench_quality.py +118 -0
- package/scripts/_lib/bench_report.py +150 -0
- package/scripts/agent-config +13 -0
- package/scripts/audit_adr_coverage.py +175 -0
- package/scripts/audit_mcp_tools.py +146 -0
- package/scripts/bench_baseline_ready.py +108 -0
- package/scripts/bench_drift_check.py +151 -0
- package/scripts/bench_per_tool.py +216 -0
- package/scripts/bench_run.py +155 -0
- package/scripts/config/__init__.py +9 -0
- package/scripts/config/presets.py +206 -0
- package/scripts/config/profiles.py +173 -0
- package/scripts/cost/budget.mjs +73 -12
- package/scripts/cost/preflight.mjs +89 -0
- package/scripts/lint_archived_skills.py +143 -0
- package/scripts/lint_bench_corpus.py +161 -0
- package/scripts/lint_namespace.py +135 -0
- package/scripts/lint_roadmap_complexity.py +3 -2
- package/scripts/skill_overlap.py +204 -0
- package/scripts/skill_usage_collect.py +191 -0
- package/scripts/skill_usage_report.py +162 -0
- package/scripts/smoke/kernel.sh +101 -0
- package/scripts/smoke/router.sh +129 -0
- package/scripts/smoke/schema.sh +71 -0
- package/scripts/smoke/skills.sh +101 -0
|
@@ -0,0 +1,143 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-12
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Cost governance dashboard
|
|
7
|
+
|
|
8
|
+
> **Status:** beta — first draft 2026-05-16 (Phase 2 Item 10 of
|
|
9
|
+
> `step-15-product-refinement`).
|
|
10
|
+
>
|
|
11
|
+
> **Related:** [`config-presets`](config-presets.md) (caps schema) ·
|
|
12
|
+
> [`cost-profile-defaults`](cost-profile-defaults.md) (default
|
|
13
|
+
> selection) · `scripts/cost/budget.mjs` (existing local-store
|
|
14
|
+
> primitive) · `scripts/cost/track.mjs` (session ingest).
|
|
15
|
+
|
|
16
|
+
The `agent-config cost` subcommand surfaces accumulated spend against
|
|
17
|
+
the active preset's caps. Read-only, CLI-first, no UI. Wraps the
|
|
18
|
+
existing `scripts/cost/*.mjs` primitives behind a single discoverable
|
|
19
|
+
verb so a user can ask "where am I against my budget?" without knowing
|
|
20
|
+
the storage layout.
|
|
21
|
+
|
|
22
|
+
## Surface
|
|
23
|
+
|
|
24
|
+
```
|
|
25
|
+
agent-config cost # default: status (this period's spend)
|
|
26
|
+
agent-config cost status [--json] # spend vs caps for daily/weekly/monthly
|
|
27
|
+
agent-config cost ingest # pull latest session.jsonl → local store
|
|
28
|
+
agent-config cost history [--period=today|week|month] [--limit=N]
|
|
29
|
+
agent-config cost reset --confirm # truncate sessions.jsonl + budget.json
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
All subcommands are **read-only by default**. `ingest` writes only to
|
|
33
|
+
`agents/cost-tracking/sessions.jsonl`. `reset` is destructive and
|
|
34
|
+
gated by `--confirm` (Hard-Floor per
|
|
35
|
+
[`non-destructive-by-default`](../../.agent-src/rules/non-destructive-by-default.md)).
|
|
36
|
+
|
|
37
|
+
## `cost status` — output contract
|
|
38
|
+
|
|
39
|
+
Human format:
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
Cost (preset: balanced · profile: developer)
|
|
43
|
+
|
|
44
|
+
Period Spent Cap Remaining % Status
|
|
45
|
+
today $2.43 $10.00 $7.57 24% ✅
|
|
46
|
+
week $14.20 $40.00 $25.80 36% ✅
|
|
47
|
+
month $52.10 $150.00 $97.90 35% ✅
|
|
48
|
+
|
|
49
|
+
MCP calls: 12 today · 47 this week · 188 this month
|
|
50
|
+
Council calls: 1 today · 3 this week · 11 this month
|
|
51
|
+
|
|
52
|
+
Next threshold notification at 75% (week: $30.00).
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
`--json` output schema:
|
|
56
|
+
|
|
57
|
+
```json
|
|
58
|
+
{
|
|
59
|
+
"preset": "balanced",
|
|
60
|
+
"profile": "developer",
|
|
61
|
+
"periods": {
|
|
62
|
+
"today": {"spent_usd": 2.43, "cap_usd": 10.00, "remaining_usd": 7.57, "pct": 0.243, "status": "ok"},
|
|
63
|
+
"week": {"spent_usd": 14.20, "cap_usd": 40.00, "remaining_usd": 25.80, "pct": 0.355, "status": "ok"},
|
|
64
|
+
"month": {"spent_usd": 52.10, "cap_usd": 150.00, "remaining_usd": 97.90, "pct": 0.347, "status": "ok"}
|
|
65
|
+
},
|
|
66
|
+
"calls": {
|
|
67
|
+
"mcp": {"today": 12, "week": 47, "month": 188},
|
|
68
|
+
"council": {"today": 1, "week": 3, "month": 11}
|
|
69
|
+
},
|
|
70
|
+
"next_threshold": {"period": "week", "pct": 0.75, "trigger_usd": 30.00}
|
|
71
|
+
}
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
### Status field
|
|
75
|
+
|
|
76
|
+
| Value | Trigger | Exit code |
|
|
77
|
+
|---|---|---|
|
|
78
|
+
| `ok` | `pct < 0.75` | 0 |
|
|
79
|
+
| `warn` | `0.75 ≤ pct < 1.0` | 0 |
|
|
80
|
+
| `over` | `pct ≥ 1.0` | 1 |
|
|
81
|
+
|
|
82
|
+
Overall exit = worst-of across the three periods. `--json` always
|
|
83
|
+
emits the full object regardless of exit.
|
|
84
|
+
|
|
85
|
+
## Data sources
|
|
86
|
+
|
|
87
|
+
| Field | Source |
|
|
88
|
+
|---|---|
|
|
89
|
+
| `preset` | Active preset id from [`config-presets`](config-presets.md) resolution chain. |
|
|
90
|
+
| `cap_usd` | `preset.cost.{daily,weekly,monthly}_max_usd`. |
|
|
91
|
+
| `spent_usd` | Sum of `cost_usd` field over `agents/cost-tracking/sessions.jsonl` records inside the period window. |
|
|
92
|
+
| `calls.mcp.*` | Sum of `mcp_calls` field in the same records. |
|
|
93
|
+
| `calls.council.*` | Count of records whose `kind` is `council`. |
|
|
94
|
+
| `next_threshold` | Smallest `(period, pct ∈ preset.notifications.threshold_pct)` tuple where `spent_usd < pct × cap_usd`. |
|
|
95
|
+
|
|
96
|
+
When the active preset declares no `cost.*` cap (legacy installs),
|
|
97
|
+
`cap_usd` is reported as `null` and `status` is `ok`. The tool does
|
|
98
|
+
**not** invent a default cap.
|
|
99
|
+
|
|
100
|
+
## Enforcement vs surfacing
|
|
101
|
+
|
|
102
|
+
`agent-config cost` is **read-only**. Enforcement (refuse a council
|
|
103
|
+
or MCP call that would push spend over a cap) lives at the call site
|
|
104
|
+
per the active preset's `cost.enforce` setting (`off`, `advisory`,
|
|
105
|
+
`hybrid`, `hard`). This contract does not change enforcement; it only
|
|
106
|
+
makes the existing local-store data discoverable.
|
|
107
|
+
|
|
108
|
+
## Refresh model
|
|
109
|
+
|
|
110
|
+
`sessions.jsonl` is appended to by the Claude Code session hooks
|
|
111
|
+
(see `scripts/cost/track.mjs`). `cost status` reads what's there;
|
|
112
|
+
`cost ingest` triggers a one-shot pull from `~/.claude/projects/`.
|
|
113
|
+
Users running a non-Claude-Code agent surface call `cost ingest`
|
|
114
|
+
manually after a session; users on Claude Code with hooks installed
|
|
115
|
+
never need to.
|
|
116
|
+
|
|
117
|
+
## Validation
|
|
118
|
+
|
|
119
|
+
`scripts/lint_cost_dashboard.py` (Phase 2 deliverable — not yet
|
|
120
|
+
shipped) fails CI on:
|
|
121
|
+
|
|
122
|
+
- Schema drift in `sessions.jsonl` (missing required fields).
|
|
123
|
+
- Preset declaring `cost.*` caps that disagree with this contract's
|
|
124
|
+
expected period grid.
|
|
125
|
+
- `cost status --json` output diverging from the schema above.
|
|
126
|
+
|
|
127
|
+
## What this contract does **not** do
|
|
128
|
+
|
|
129
|
+
- **Does not** ship a UI. CLI-first, by design.
|
|
130
|
+
- **Does not** introduce per-skill or per-command cost attribution
|
|
131
|
+
beyond `kind` (`council` vs other). Per-skill attribution is a
|
|
132
|
+
Phase 3 candidate.
|
|
133
|
+
- **Does not** override per-call hard caps from the preset.
|
|
134
|
+
- **Does not** roll up across multiple projects. Each project's
|
|
135
|
+
`agents/cost-tracking/` is its own scope.
|
|
136
|
+
|
|
137
|
+
## See also
|
|
138
|
+
|
|
139
|
+
- [`config-presets`](config-presets.md) — preset caps + `enforce` semantics
|
|
140
|
+
- [`cost-profile-defaults`](cost-profile-defaults.md) — default preset selection
|
|
141
|
+
- [`safety-model`](safety-model.md) — `mcp_call_costly` domain
|
|
142
|
+
- `scripts/cost/budget.mjs`, `scripts/cost/track.mjs` — wrapped primitives
|
|
143
|
+
- `step-15-product-refinement` § Phase 2 Item 10
|
|
@@ -0,0 +1,134 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: stable
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Cost Enforcement Contract
|
|
6
|
+
|
|
7
|
+
> Status: stable · Owner: `step-11-measurement-governance-parity` · Last reviewed: 2026-05-16
|
|
8
|
+
|
|
9
|
+
How USD budgets read from `.agent-settings.yml` interact with the
|
|
10
|
+
session-cost ledger (`agents/cost-tracking/sessions.jsonl`) and the
|
|
11
|
+
budget evaluator (`scripts/cost/budget.mjs`).
|
|
12
|
+
|
|
13
|
+
## Surface
|
|
14
|
+
|
|
15
|
+
Two files. Settings file declares the budget; ledger file accumulates
|
|
16
|
+
spend. The evaluator joins them and emits a tier.
|
|
17
|
+
|
|
18
|
+
| File | Role |
|
|
19
|
+
|---|---|
|
|
20
|
+
| `.agent-settings.yml § cost` | Declarative: budgets per period + enforcement mode. |
|
|
21
|
+
| `agents/cost-tracking/sessions.jsonl` | Append-only: per-session cost records (model, tokens, USD). |
|
|
22
|
+
| `scripts/cost/budget.mjs` | Evaluator: joins both, emits `{ level, utilization_pct, enforcement, source }`. |
|
|
23
|
+
| `scripts/cost/preflight.mjs` | Hard-stop hook: wraps `budget.mjs check` and exits non-zero at HARD_STOP when `enforcement: hard-stop`. |
|
|
24
|
+
|
|
25
|
+
## Settings schema
|
|
26
|
+
|
|
27
|
+
```yaml
|
|
28
|
+
cost:
|
|
29
|
+
budgets:
|
|
30
|
+
daily: 0 # USD ceiling for rolling 24h. 0 = unbudgeted.
|
|
31
|
+
weekly: 0 # USD ceiling for rolling 7d. 0 = unbudgeted.
|
|
32
|
+
monthly: 0 # USD ceiling for rolling 30d. 0 = unbudgeted.
|
|
33
|
+
enforcement: advisory # advisory | hard-stop
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
- `0` (or absent) on any period = that period is not enforced. The
|
|
37
|
+
evaluator falls back to a longer-period budget when checking shorter
|
|
38
|
+
periods, never the other way around.
|
|
39
|
+
- `enforcement: advisory` is the default. Dashboards surface the
|
|
40
|
+
breach; the agent keeps working.
|
|
41
|
+
- `enforcement: hard-stop` is opt-in. `scripts/cost/preflight.mjs`
|
|
42
|
+
exits non-zero at the HARD_STOP tier; wrapping shells / CI / `task`
|
|
43
|
+
bindings must check this before composing a turn.
|
|
44
|
+
|
|
45
|
+
## Tier ladder (5-stage)
|
|
46
|
+
|
|
47
|
+
| Utilization | Level | Emoji | Threshold-pct |
|
|
48
|
+
|---:|---|:---:|---:|
|
|
49
|
+
| `< 50 %` | `OK` | 🟢 | 0 |
|
|
50
|
+
| `50–74 %` | `INFO` | 🟡 | 50 |
|
|
51
|
+
| `75–89 %` | `WARNING` | 🟠 | 75 |
|
|
52
|
+
| `90–99 %` | `CRITICAL` | 🔴 | 90 |
|
|
53
|
+
| `≥ 100 %` | `HARD_STOP` | 🛑 | 100 |
|
|
54
|
+
|
|
55
|
+
The legacy 4-stage draft (`under / 50 / 75 / 90 / 100`) folded `OK`
|
|
56
|
+
into `under`. Parity-doc Phase 6 maps both forms verbatim.
|
|
57
|
+
|
|
58
|
+
## Hook surface
|
|
59
|
+
|
|
60
|
+
`scripts/cost/preflight.mjs` is the **single** turn-start surface.
|
|
61
|
+
It wraps `budget.mjs check` and:
|
|
62
|
+
|
|
63
|
+
1. Reads `cost.enforcement` from `.agent-settings.yml`.
|
|
64
|
+
2. If `advisory` → always exits `0`, prints the tier as advisory text.
|
|
65
|
+
3. If `hard-stop` and level is `HARD_STOP` → prints a refusal block
|
|
66
|
+
citing this contract and exits `1`.
|
|
67
|
+
4. If no budget is configured at all → exits `0` (fail-open). Never
|
|
68
|
+
blocks unbudgeted work.
|
|
69
|
+
|
|
70
|
+
The hook does **not** rewrite or block individual tool calls. It is a
|
|
71
|
+
process-entry gate, intended to be invoked by:
|
|
72
|
+
|
|
73
|
+
- `task ci`, `task work:*`, `task roadmap:*` wrappers.
|
|
74
|
+
- The `/onboard` boot path (`scripts/install.py`-side guidance only).
|
|
75
|
+
- Manual `node scripts/cost/preflight.mjs` for shell wrappers.
|
|
76
|
+
|
|
77
|
+
## Bypass
|
|
78
|
+
|
|
79
|
+
User-facing bypass mechanism (documented for the refusal block):
|
|
80
|
+
|
|
81
|
+
- Raise the budget: edit `.agent-settings.yml § cost.budgets.<period>`.
|
|
82
|
+
- Reset the ledger (drops historical spend from the calculation):
|
|
83
|
+
`node scripts/cost/track.mjs reset --confirm`.
|
|
84
|
+
- Disable enforcement: set `cost.enforcement: advisory`.
|
|
85
|
+
|
|
86
|
+
No environment-variable override. Bypass must be an explicit edit so
|
|
87
|
+
the change is durable and auditable.
|
|
88
|
+
|
|
89
|
+
## Default behaviour without a budget
|
|
90
|
+
|
|
91
|
+
When `cost.budgets.{daily,weekly,monthly}` are all `0`:
|
|
92
|
+
|
|
93
|
+
- `budget.mjs check` reports cumulative spend, no tier (returns the
|
|
94
|
+
no-budget JSON shape).
|
|
95
|
+
- `preflight.mjs` exits `0`. Never blocks.
|
|
96
|
+
- `agent-status` panel shows **only** the measured-spend USD figure;
|
|
97
|
+
the tier table is suppressed.
|
|
98
|
+
|
|
99
|
+
## Source precedence
|
|
100
|
+
|
|
101
|
+
`budget.mjs` reads budget config in this order:
|
|
102
|
+
|
|
103
|
+
1. `.agent-settings.yml § cost` (when any value > 0).
|
|
104
|
+
2. `agents/cost-tracking/budget.json` (legacy single-period JSON).
|
|
105
|
+
3. None → no-budget output shape.
|
|
106
|
+
|
|
107
|
+
The evaluator output carries `source: 'agent-settings.yml' | 'budget.json'`
|
|
108
|
+
so dashboards can show where the figure came from.
|
|
109
|
+
|
|
110
|
+
## Period mapping
|
|
111
|
+
|
|
112
|
+
`BUDGET_PERIOD={today|week|month|all}` selects which budget value
|
|
113
|
+
applies:
|
|
114
|
+
|
|
115
|
+
| `BUDGET_PERIOD` | Settings key |
|
|
116
|
+
|---|---|
|
|
117
|
+
| `today` | `cost.budgets.daily` |
|
|
118
|
+
| `week` | `cost.budgets.weekly` |
|
|
119
|
+
| `month` | `cost.budgets.monthly` |
|
|
120
|
+
| `all` (default) | First non-zero of `monthly → weekly → daily`. |
|
|
121
|
+
|
|
122
|
+
## Acceptance fixtures
|
|
123
|
+
|
|
124
|
+
`tests/fixtures/cost/budget/` carries five reference fixtures:
|
|
125
|
+
`under-50`, `mid-75`, `high-90`, `at-100`, `over-100`. Each fixture
|
|
126
|
+
ships a `sessions.jsonl` slice + an expected JSON output. The fixture
|
|
127
|
+
suite is wired to `task test-cost-budget` per `step-11` Phase 2 Step 5.
|
|
128
|
+
|
|
129
|
+
## See also
|
|
130
|
+
|
|
131
|
+
- `step-11-ruflo-parity` — Measurement & Governance Parity roadmap.
|
|
132
|
+
- `docs/contracts/cost-dashboard.md` — companion dashboard contract.
|
|
133
|
+
- `scripts/cost/budget.mjs` — evaluator implementation.
|
|
134
|
+
- `bench/pricing.yaml` — per-model USD pricing table.
|
|
@@ -6928,13 +6928,6 @@
|
|
|
6928
6928
|
"via": "body_link",
|
|
6929
6929
|
"depth": 1
|
|
6930
6930
|
},
|
|
6931
|
-
{
|
|
6932
|
-
"source": ".agent-src.uncompressed/rules/no-cheap-questions.md",
|
|
6933
|
-
"target": ".agent-src.uncompressed/contexts/contracts/frugality-charter.md",
|
|
6934
|
-
"type": "READ_ONLY",
|
|
6935
|
-
"via": "body_link",
|
|
6936
|
-
"depth": 1
|
|
6937
|
-
},
|
|
6938
6931
|
{
|
|
6939
6932
|
"source": ".agent-src.uncompressed/rules/no-cheap-questions.md",
|
|
6940
6933
|
"target": ".agent-src.uncompressed/rules/ask-when-uncertain.md",
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# MCP tool inventory
|
|
7
|
+
|
|
8
|
+
> Generated by [`scripts/audit_mcp_tools.py`](../../scripts/audit_mcp_tools.py)
|
|
9
|
+
> from the source-of-truth catalog
|
|
10
|
+
> [`scripts/mcp_server/consumer_tool_catalog.json`](../../scripts/mcp_server/consumer_tool_catalog.json).
|
|
11
|
+
> Do **not** hand-edit; rerun `python3 scripts/audit_mcp_tools.py --write`.
|
|
12
|
+
>
|
|
13
|
+
> Step-11 Phase 5 Step 3 (`step-11-ruflo-parity.md`).
|
|
14
|
+
|
|
15
|
+
## Summary
|
|
16
|
+
|
|
17
|
+
- **Total tools:** 20
|
|
18
|
+
- **By transport:** stdio=9
|
|
19
|
+
- **By side-effect:** fs-write=5, ro=12, shell=3
|
|
20
|
+
- **Discovery-only stubs (no implementation):** 11
|
|
21
|
+
|
|
22
|
+
## Tools
|
|
23
|
+
|
|
24
|
+
| Tool | Side-effect | Transports | Catalog | Handler |
|
|
25
|
+
|---|---|---|---|---|
|
|
26
|
+
| `lint_skills` | `ro` | stdio | [`consumer_tool_catalog.json:7`](../../scripts/mcp_server/consumer_tool_catalog.json#L7) | [`tools.py:510`](../../scripts/mcp_server/tools.py#L510) |
|
|
27
|
+
| `chat_history_append` | `fs-write` | stdio | [`consumer_tool_catalog.json:24`](../../scripts/mcp_server/consumer_tool_catalog.json#L24) | [`tools.py:535`](../../scripts/mcp_server/tools.py#L535) |
|
|
28
|
+
| `chat_history_read` | `ro` | stdio | [`consumer_tool_catalog.json:43`](../../scripts/mcp_server/consumer_tool_catalog.json#L43) | [`tools.py:571`](../../scripts/mcp_server/tools.py#L571) |
|
|
29
|
+
| `memory_lookup` | `ro` | stdio | [`consumer_tool_catalog.json:59`](../../scripts/mcp_server/consumer_tool_catalog.json#L59) | [`tools.py:590`](../../scripts/mcp_server/tools.py#L590) |
|
|
30
|
+
| `memory_signal` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:75`](../../scripts/mcp_server/consumer_tool_catalog.json#L75) | _stub-only_ |
|
|
31
|
+
| `memory_status` | `ro` | stdio | [`consumer_tool_catalog.json:91`](../../scripts/mcp_server/consumer_tool_catalog.json#L91) | [`tools.py:617`](../../scripts/mcp_server/tools.py#L617) |
|
|
32
|
+
| `skill_trigger_eval` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:98`](../../scripts/mcp_server/consumer_tool_catalog.json#L98) | _stub-only_ |
|
|
33
|
+
| `suggest_command` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:114`](../../scripts/mcp_server/consumer_tool_catalog.json#L114) | _stub-only_ |
|
|
34
|
+
| `suggest_skill_for_task` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:129`](../../scripts/mcp_server/consumer_tool_catalog.json#L129) | _stub-only_ |
|
|
35
|
+
| `mine_session` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:144`](../../scripts/mcp_server/consumer_tool_catalog.json#L144) | _stub-only_ |
|
|
36
|
+
| `update_form_request_messages` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:158`](../../scripts/mcp_server/consumer_tool_catalog.json#L158) | _stub-only_ |
|
|
37
|
+
| `sync_gitignore` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:173`](../../scripts/mcp_server/consumer_tool_catalog.json#L173) | _stub-only_ |
|
|
38
|
+
| `sync_agent_settings` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:186`](../../scripts/mcp_server/consumer_tool_catalog.json#L186) | _stub-only_ |
|
|
39
|
+
| `run_tests` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:200`](../../scripts/mcp_server/consumer_tool_catalog.json#L200) | _stub-only_ |
|
|
40
|
+
| `run_quality_checks` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:214`](../../scripts/mcp_server/consumer_tool_catalog.json#L214) | _stub-only_ |
|
|
41
|
+
| `list_skills` | `ro` | stdio | [`consumer_tool_catalog.json:227`](../../scripts/mcp_server/consumer_tool_catalog.json#L227) | [`tools.py:631`](../../scripts/mcp_server/tools.py#L631) |
|
|
42
|
+
| `list_commands` | `ro` | stdio | [`consumer_tool_catalog.json:234`](../../scripts/mcp_server/consumer_tool_catalog.json#L234) | [`tools.py:644`](../../scripts/mcp_server/tools.py#L644) |
|
|
43
|
+
| `list_rules` | `ro` | stdio | [`consumer_tool_catalog.json:241`](../../scripts/mcp_server/consumer_tool_catalog.json#L241) | [`tools.py:657`](../../scripts/mcp_server/tools.py#L657) |
|
|
44
|
+
| `compile_router` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:248`](../../scripts/mcp_server/consumer_tool_catalog.json#L248) | _stub-only_ |
|
|
45
|
+
| `read_resource_body` | `ro` | stdio | [`consumer_tool_catalog.json:261`](../../scripts/mcp_server/consumer_tool_catalog.json#L261) | [`tools.py:670`](../../scripts/mcp_server/tools.py#L670) |
|
|
46
|
+
|
|
47
|
+
## Glossary
|
|
48
|
+
|
|
49
|
+
- **Side-effect** — `ro` (read-only) · `fs-write` (filesystem write) · `shell` (spawns processes).
|
|
50
|
+
- **Transports** — `stdio` (`scripts/mcp_server/`) · `worker` (`workers/mcp/`). A tool may live on both.
|
|
51
|
+
- **Stub** — catalog-listed for discovery; returns the `not_implemented` envelope from
|
|
52
|
+
[`mcp-tool-stub-envelope.md`](mcp-tool-stub-envelope.md) until promoted.
|
|
53
|
+
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: stable
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Measurement baseline — contract
|
|
6
|
+
|
|
7
|
+
> **Status:** locked 2026-05-16 · **Owner:** `step-4-measurement-and-benchmark.md`
|
|
8
|
+
> · **Cited by:** every P2 enforcement roadmap (skill rationalization G0, north-star G1, compression default decision).
|
|
9
|
+
|
|
10
|
+
Single source of truth for what `task bench` measures, what counts as
|
|
11
|
+
drift, and what unblocks enforcement. Read this before pinning a number
|
|
12
|
+
to a roadmap or PR description.
|
|
13
|
+
|
|
14
|
+
## What `task bench` measures
|
|
15
|
+
|
|
16
|
+
Four axes, all numeric, all reproducible from the same input:
|
|
17
|
+
|
|
18
|
+
| Axis | Source | Definition | Units |
|
|
19
|
+
|---|---|---|---:|
|
|
20
|
+
| **selection accuracy** | [`scripts/bench_runner.py`](../../scripts/bench_runner.py) | Keyword-overlap ranker hits the expected skill in top-K | % |
|
|
21
|
+
| **cost** | [`scripts/cost/track.mjs`](../../scripts/cost/track.mjs) session jsonl | Token+USD per model, captured live | USD |
|
|
22
|
+
| **quality** | regex / rubric assertions per prompt | `quality_assertion` matches in agent output | % |
|
|
23
|
+
| **projection fidelity** | [`scripts/bench_per_tool.py`](../../scripts/bench_per_tool.py) | `accuracy(tool) / accuracy(augment)` for skill-projecting tools | ratio |
|
|
24
|
+
|
|
25
|
+
Schemas: [`benchmark-report-schema.md`](benchmark-report-schema.md) ·
|
|
26
|
+
[`benchmark-corpus-spec.md`](benchmark-corpus-spec.md). Reports land at
|
|
27
|
+
`bench/reports/<utc-stamp>-<corpus>[-projection].{json,md}` —
|
|
28
|
+
timestamped, never overwritten, content-addressed by run.
|
|
29
|
+
|
|
30
|
+
## Corpora — frozen for the soak window
|
|
31
|
+
|
|
32
|
+
| Corpus | Path | Prompts | Purpose |
|
|
33
|
+
|---|---|---:|---|
|
|
34
|
+
| `dev` | [`tests/eval/corpus-dev.yaml`](../../tests/eval/corpus-dev.yaml) | 10 | Developer task surface (Laravel/Symfony/React/CI/PR) |
|
|
35
|
+
| `non-dev` | [`tests/eval/corpus-non-dev.yaml`](../../tests/eval/corpus-non-dev.yaml) | 16 | Founder / agency / content creator surface (Wing-4) |
|
|
36
|
+
|
|
37
|
+
Total 26 prompts ≥ Acceptance Criteria floor of 25. Mid-window edits
|
|
38
|
+
to either YAML restart the 60-day clock per
|
|
39
|
+
[`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) § 2.
|
|
40
|
+
|
|
41
|
+
## What counts as drift
|
|
42
|
+
|
|
43
|
+
[`scripts/bench_drift_check.py`](../../scripts/bench_drift_check.py)
|
|
44
|
+
compares the latest report against a sliding window of the prior N runs
|
|
45
|
+
(default 5) for the same corpus.
|
|
46
|
+
|
|
47
|
+
| Axis | Threshold | Note |
|
|
48
|
+
|---|---|---|
|
|
49
|
+
| selection accuracy | latest − baseline_mean ≤ −5 pp | always evaluated |
|
|
50
|
+
| cost | latest / baseline_mean ≥ +20 % | only when both sides have `source: captured` |
|
|
51
|
+
| quality | latest − baseline_mean ≤ −10 pp | skipped when latest is `not_collected` |
|
|
52
|
+
| projection fidelity | tool fidelity < 0.85 | exit 1 from `task bench:projection` |
|
|
53
|
+
|
|
54
|
+
Drift exits with code 2 from `task bench:drift`. **CI posture during
|
|
55
|
+
soak:** all bench-drift steps `continue-on-error: true` and post a
|
|
56
|
+
sticky PR comment — informational only, not a merge gate. Flip to
|
|
57
|
+
required check happens via a separate PR once
|
|
58
|
+
`task bench:baseline-ready` returns 0 (see below).
|
|
59
|
+
|
|
60
|
+
## What unblocks enforcement (the G1 gate)
|
|
61
|
+
|
|
62
|
+
```
|
|
63
|
+
TASK bench:baseline-ready EXIT 0 IS THE ONLY AUTHORITY.
|
|
64
|
+
NO ANECDOTE, NO INDIVIDUAL REPORT, NO ROADMAP-SIDE OVERRIDE.
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
[`scripts/bench_baseline_ready.py`](../../scripts/bench_baseline_ready.py)
|
|
68
|
+
returns exit 0 iff both:
|
|
69
|
+
|
|
70
|
+
1. **Wall-clock soak:** `today − bench/baseline-start.txt ≥ --min-days` (default 60)
|
|
71
|
+
2. **Report density:** `bench/reports/*-<corpus>.json` count ≥ `--min-reports` (default 30)
|
|
72
|
+
|
|
73
|
+
Soak start anchored at [`bench/baseline-start.txt`](../../bench/baseline-start.txt)
|
|
74
|
+
= **2026-05-16**. Earliest possible flip: **2026-07-15**, contingent
|
|
75
|
+
on the 30-report floor.
|
|
76
|
+
|
|
77
|
+
Downstream consumers:
|
|
78
|
+
|
|
79
|
+
- ``step-99-north-star-restructure.md` § Acceptance G1` — reads this exit code.
|
|
80
|
+
- [`compression-default-kill-criterion.md` § 3](compression-default-kill-criterion.md) — reads the decision table after baseline closes.
|
|
81
|
+
- ``step-2-skill-inventory-rationalization.md` § G0` — usage-data soak floor.
|
|
82
|
+
|
|
83
|
+
## What the closeout writes
|
|
84
|
+
|
|
85
|
+
On baseline closure, the step-4 closeout writes the numeric verdict to
|
|
86
|
+
[`docs/parity/bench.json`](../parity/bench.json) — frozen snapshot with
|
|
87
|
+
the 30+ reports averaged, drift verdict, and the compression-default
|
|
88
|
+
decision per the kill-criterion table. That file is the artefact every
|
|
89
|
+
P2 roadmap reads — not the live `bench/reports/` directory.
|
|
90
|
+
|
|
91
|
+
## Carve-outs
|
|
92
|
+
|
|
93
|
+
- **Pricing freshness:** [`bench/pricing.yaml`](../../bench/pricing.yaml) rows must carry `sourced_on: YYYY-MM-DD`. Stale prices = stale numbers = no trust (ruflo "measured-vs-claimed" pattern).
|
|
94
|
+
- **Subjective grading excluded:** quality scoring is mechanical via `quality_assertion`. No vibes.
|
|
95
|
+
- **Cursor / Cline / Windsurf:** rules-only surfaces, no SKILL.md projection. `bench:projection` reports them as `not_applicable` — the gap is acknowledged, not silently dropped.
|
|
96
|
+
|
|
97
|
+
## Cross-references
|
|
98
|
+
|
|
99
|
+
- [`benchmark-report-schema.md`](benchmark-report-schema.md) · per-report JSON schema
|
|
100
|
+
- [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md) · corpus YAML schema
|
|
101
|
+
- [`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) · decision table read by step-4 closeout
|
|
102
|
+
- `step-4-measurement-and-benchmark.md` · the owning roadmap
|
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: stable
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# Namespace contract — skills, rules, commands, personas
|
|
6
|
+
|
|
7
|
+
> Every artefact name is a **stable identifier**: routed to from
|
|
8
|
+
> `router.json`, cited from skills, surfaced in `/help`, embedded in
|
|
9
|
+
> command paths, and back-referenced in test fixtures. Drift breaks
|
|
10
|
+
> all five surfaces silently.
|
|
11
|
+
>
|
|
12
|
+
> **Source:** Step-11 Phase 5 Step 1
|
|
13
|
+
> (`step-11-ruflo-parity.md`).
|
|
14
|
+
> **Enforcer:** [`scripts/lint_namespace.py`](../../scripts/lint_namespace.py),
|
|
15
|
+
> wired into `task lint-skills`.
|
|
16
|
+
|
|
17
|
+
## 1. Shape
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
<stem>-<intent> kebab-case, ASCII, lowercase
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
| Component | Rule |
|
|
24
|
+
|---|---|
|
|
25
|
+
| Charset | `[a-z0-9-]+` only |
|
|
26
|
+
| Separator | single `-` between tokens; never `_`, `.`, or camelCase |
|
|
27
|
+
| Length | skills: 3 ≤ name ≤ 64 · rules / commands / personas: 2 ≤ name ≤ 64 (two-letter slot reserved for intentional acronyms — `pr`, `ci`, `qa`, `me`) |
|
|
28
|
+
| First char | `[a-z]` (digits and `-` forbidden at start) |
|
|
29
|
+
| Last char | `[a-z0-9]` (trailing `-` forbidden) |
|
|
30
|
+
| Run | no consecutive `--` |
|
|
31
|
+
|
|
32
|
+
The `<stem>` carries the **subject** (`commit`, `eloquent`,
|
|
33
|
+
`livewire`); the `<intent>` (optional) carries the **verb / lens**
|
|
34
|
+
(`-writing`, `-architect`, `-routing`). Single-token names are
|
|
35
|
+
permitted when the stem already encodes both (`commit`, `eloquent`,
|
|
36
|
+
`docker`).
|
|
37
|
+
|
|
38
|
+
## 2. Reserved names — forbidden as artefact names
|
|
39
|
+
|
|
40
|
+
| Name | Reason |
|
|
41
|
+
|---|---|
|
|
42
|
+
| `pattern` | Reserved for trigger-pattern fixtures (see `tests/fixtures/triggers/`). |
|
|
43
|
+
| `claude-memories` | Reserved for the `~/.claude/CLAUDE.md` shape — host-agent state, not a package artefact. |
|
|
44
|
+
| `default` | Ambiguous with profile / mode defaults; collides with `.agent-settings.yml` keys. |
|
|
45
|
+
| `index` | Reserved for auto-generated INDEX.md files. |
|
|
46
|
+
| `router` | Reserved for `router.json` and the router contract. |
|
|
47
|
+
|
|
48
|
+
Reserved names apply at the **top level** of each artefact type. A
|
|
49
|
+
sub-verb under a namespaced group (e.g. `council/default.md` →
|
|
50
|
+
`/council:default`) is **not** a top-level identifier — the group
|
|
51
|
+
prefix disambiguates it, and reserved-name enforcement is skipped
|
|
52
|
+
for sub-verbs by the linter. A future artefact `pattern-foo` at the
|
|
53
|
+
top level is fine; bare `pattern` is not.
|
|
54
|
+
|
|
55
|
+
`README.md` and `INDEX.md` are documentation, not artefacts, and are
|
|
56
|
+
skipped by the linter.
|
|
57
|
+
|
|
58
|
+
## 3. Per-type conventions
|
|
59
|
+
|
|
60
|
+
| Type | Source path | Naming nuance |
|
|
61
|
+
|---|---|---|
|
|
62
|
+
| Skill | `.agent-src.uncompressed/skills/<name>/SKILL.md` | Directory name == frontmatter `name`. |
|
|
63
|
+
| Rule | `.agent-src.uncompressed/rules/<name>.md` | Filename stem == frontmatter `id` (when present). |
|
|
64
|
+
| Command | `.agent-src.uncompressed/commands/<name>.md` or `<group>/<verb>.md` | Slash-command invocation `<name>` or `<group>:<verb>`. |
|
|
65
|
+
| Persona | `.agent-src.uncompressed/personas/<name>.md` | Cited from skill frontmatter `personas:` list. |
|
|
66
|
+
|
|
67
|
+
Sub-namespacing (`commit/in-chunks.md` → `/commit:in-chunks`) uses
|
|
68
|
+
the same charset rules per segment; the joining colon is implicit.
|
|
69
|
+
|
|
70
|
+
## 4. Linter — `scripts/lint_namespace.py`
|
|
71
|
+
|
|
72
|
+
Walks the four source roots above, asserts each artefact name:
|
|
73
|
+
|
|
74
|
+
1. Matches the regex `^[a-z][a-z0-9]*(-[a-z0-9]+)*$`.
|
|
75
|
+
2. Length 3 ≤ name ≤ 64.
|
|
76
|
+
3. Not in the reserved-names list.
|
|
77
|
+
4. Skill: directory name matches frontmatter `name`.
|
|
78
|
+
|
|
79
|
+
Exit codes:
|
|
80
|
+
|
|
81
|
+
| Exit | Meaning |
|
|
82
|
+
|---|---|
|
|
83
|
+
| `0` | All names valid. |
|
|
84
|
+
| `1` | At least one name fails a rule. |
|
|
85
|
+
| `2` | Linter crashed (filesystem error, malformed frontmatter). |
|
|
86
|
+
|
|
87
|
+
Diagnostic format: one issue per line — `<path>: <rule> — <detail>`.
|
|
88
|
+
|
|
89
|
+
## 5. Adding a new artefact
|
|
90
|
+
|
|
91
|
+
Pick the name; verify locally:
|
|
92
|
+
|
|
93
|
+
```bash
|
|
94
|
+
python3 scripts/lint_namespace.py --name <candidate>
|
|
95
|
+
# or full run:
|
|
96
|
+
python3 scripts/lint_namespace.py
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
If the candidate fails, the linter prints the rule it violated.
|
|
100
|
+
**Renames after release are expensive** — touch router.json, every
|
|
101
|
+
skill citing the old name, the bench corpus, and consumer settings.
|
|
102
|
+
Pay the naming cost once, upfront.
|
|
103
|
+
|
|
104
|
+
## 6. Relationship to the frontmatter contract
|
|
105
|
+
|
|
106
|
+
The **shape** lives here. The **frontmatter keys** that carry the
|
|
107
|
+
name (`name:` in skills, `id:` in rules) live in
|
|
108
|
+
[`frontmatter-contract.md`](../../agents/docs/frontmatter-contract.md).
|
|
109
|
+
Both contracts share the regex; this file is the source of truth for
|
|
110
|
+
the regex string.
|
|
111
|
+
|
|
112
|
+
## 7. Why this exists
|
|
113
|
+
|
|
114
|
+
`router.json` resolves `<kind>:<id>` strings at session start. Any
|
|
115
|
+
artefact rename breaks every routing entry pointing at the old name
|
|
116
|
+
without compile-time error. The linter catches the rename at the PR
|
|
117
|
+
boundary, not at runtime in a consumer.
|
|
118
|
+
|
|
119
|
+
## 8. Out of scope
|
|
120
|
+
|
|
121
|
+
- File-system case sensitivity (we rely on lowercase-only names).
|
|
122
|
+
- Cross-tool aliases (Augment / Claude / Cursor all consume the same
|
|
123
|
+
name — projection is by content, not by alias).
|
|
124
|
+
- Versioning suffixes (`-v2`, `-legacy`). Use `status: superseded`
|
|
125
|
+
in frontmatter instead; never rename in place.
|