@event4u/agent-config 2.18.0 → 2.20.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (108) hide show
  1. package/.agent-src/commands/agent-status.md +29 -0
  2. package/.agent-src/commands/onboard.md +221 -81
  3. package/.agent-src/commands/refine-ticket.md +3 -0
  4. package/.agent-src/packs/README.md +49 -0
  5. package/.agent-src/packs/agency-delivery.yml +63 -0
  6. package/.agent-src/packs/content-engine.yml +53 -0
  7. package/.agent-src/packs/founder-mvp.yml +51 -0
  8. package/.agent-src/personas/README.md +8 -0
  9. package/.agent-src/presets/README.md +26 -0
  10. package/.agent-src/presets/balanced.yml +34 -0
  11. package/.agent-src/presets/fast.yml +31 -0
  12. package/.agent-src/presets/strict.yml +38 -0
  13. package/.agent-src/profiles/README.md +29 -0
  14. package/.agent-src/profiles/agency.yml +27 -0
  15. package/.agent-src/profiles/content_creator.yml +25 -0
  16. package/.agent-src/profiles/developer.yml +26 -0
  17. package/.agent-src/profiles/finance.yml +24 -0
  18. package/.agent-src/profiles/founder.yml +25 -0
  19. package/.agent-src/profiles/ops.yml +25 -0
  20. package/.agent-src/rules/no-cheap-questions.md +25 -17
  21. package/.agent-src/skills/adr-create/SKILL.md +78 -68
  22. package/.agent-src/skills/refine-ticket/SKILL.md +3 -0
  23. package/.agent-src/skills/subagent-orchestration/SKILL.md +33 -0
  24. package/.agent-src/templates/agents/agent-project-settings.example.yml +1 -1
  25. package/.agent-src/templates/skill-archive-note.md +101 -0
  26. package/.agent-src/user-types/README.md +124 -0
  27. package/.agent-src/user-types/_template/user-type.md +95 -0
  28. package/.agent-src/user-types/galabau-field-crew.md +100 -0
  29. package/.agent-src/user-types/metalworking-shop.md +105 -0
  30. package/.agent-src/user-types/truck-driver.md +113 -0
  31. package/.claude-plugin/marketplace.json +1 -1
  32. package/CHANGELOG.md +91 -30
  33. package/README.md +68 -72
  34. package/config/agent-settings.template.yml +22 -0
  35. package/docs/adrs/caveman/0001-default-off-until-bench.md +93 -0
  36. package/docs/adrs/caveman/README.md +9 -0
  37. package/docs/adrs/cost/0001-hard-stop-hook.md +114 -0
  38. package/docs/adrs/cost/README.md +9 -0
  39. package/docs/adrs/memory/0001-consumer-side-snapshot.md +111 -0
  40. package/docs/adrs/memory/README.md +9 -0
  41. package/docs/adrs/router/0001-three-tier-routing.md +119 -0
  42. package/docs/adrs/router/README.md +9 -0
  43. package/docs/adrs/schema/0001-json-schema-frontmatter.md +102 -0
  44. package/docs/adrs/schema/README.md +9 -0
  45. package/docs/adrs/smoke/0001-per-tier-smoke-scripts.md +99 -0
  46. package/docs/adrs/smoke/README.md +9 -0
  47. package/docs/architecture/current-onboard-baseline.md +126 -0
  48. package/docs/architecture/current-safety-behavior.md +137 -0
  49. package/docs/archive/CHANGELOG-pre-2.16.0.md +48 -0
  50. package/docs/contracts/adr-layout.md +108 -0
  51. package/docs/contracts/adr-mcp-runtime.md +128 -0
  52. package/docs/contracts/adr-user-types-axis.md +127 -0
  53. package/docs/contracts/benchmark-corpus-spec.md +97 -0
  54. package/docs/contracts/benchmark-report-schema.md +111 -0
  55. package/docs/contracts/command-clusters.md +1 -0
  56. package/docs/contracts/command-taxonomy.md +137 -0
  57. package/docs/contracts/compression-default-kill-criterion.md +69 -0
  58. package/docs/contracts/config-presets.md +144 -0
  59. package/docs/contracts/cost-dashboard.md +143 -0
  60. package/docs/contracts/cost-enforcement.md +134 -0
  61. package/docs/contracts/file-ownership-matrix.json +0 -7
  62. package/docs/contracts/mcp-tool-inventory.md +53 -0
  63. package/docs/contracts/measurement-baseline.md +102 -0
  64. package/docs/contracts/namespace.md +125 -0
  65. package/docs/contracts/profile-system.md +142 -0
  66. package/docs/contracts/safety-model.md +129 -0
  67. package/docs/contracts/smoke-contracts.md +144 -0
  68. package/docs/contracts/user-type-schema.md +146 -0
  69. package/docs/contracts/workflow-packs.md +121 -0
  70. package/docs/decisions/ADR-010-profile-pack-preset-boundary.md +132 -0
  71. package/docs/decisions/INDEX.md +1 -0
  72. package/docs/featured-commands.md +27 -0
  73. package/docs/parity/bench-ruflo.json +58 -0
  74. package/docs/parity/bench.json +41 -0
  75. package/docs/parity/ruflo.md +46 -0
  76. package/docs/profiles.md +91 -0
  77. package/docs/recruits/_template.md +81 -0
  78. package/package.json +1 -1
  79. package/scripts/_cli/cmd_explain.py +250 -0
  80. package/scripts/_lib/bench_cost.py +138 -0
  81. package/scripts/_lib/bench_quality.py +118 -0
  82. package/scripts/_lib/bench_report.py +150 -0
  83. package/scripts/agent-config +13 -0
  84. package/scripts/audit_adr_coverage.py +175 -0
  85. package/scripts/audit_mcp_tools.py +146 -0
  86. package/scripts/bench_baseline_ready.py +108 -0
  87. package/scripts/bench_drift_check.py +151 -0
  88. package/scripts/bench_per_tool.py +216 -0
  89. package/scripts/bench_run.py +155 -0
  90. package/scripts/compress.py +48 -2
  91. package/scripts/config/__init__.py +9 -0
  92. package/scripts/config/presets.py +206 -0
  93. package/scripts/config/profiles.py +173 -0
  94. package/scripts/cost/budget.mjs +73 -12
  95. package/scripts/cost/preflight.mjs +89 -0
  96. package/scripts/lint_archived_skills.py +143 -0
  97. package/scripts/lint_bench_corpus.py +161 -0
  98. package/scripts/lint_namespace.py +135 -0
  99. package/scripts/schemas/user-type.schema.json +35 -0
  100. package/scripts/skill_linter.py +139 -4
  101. package/scripts/skill_overlap.py +204 -0
  102. package/scripts/skill_tools/audit_user_type_coverage.py +148 -0
  103. package/scripts/skill_usage_collect.py +191 -0
  104. package/scripts/skill_usage_report.py +162 -0
  105. package/scripts/smoke/kernel.sh +101 -0
  106. package/scripts/smoke/router.sh +129 -0
  107. package/scripts/smoke/schema.sh +71 -0
  108. package/scripts/smoke/skills.sh +101 -0
@@ -0,0 +1,144 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-14
4
+ ---
5
+
6
+ # Config Presets — Contract
7
+
8
+ > **Status:** beta · **Owner:** package maintainer · **Last reviewed:** 2026-05-16
9
+ >
10
+ > Schema and semantics for the **Config Preset** axis introduced in
11
+ > step-15 Phase 1 item 4. Records the **Cost Enforcement** model
12
+ > (Council v3 action #3 prerequisite) so the preset loader can ship.
13
+ > Boundary against `profile.id`, `pack.id`, and `cost_profile`:
14
+ > [`ADR-010`](../decisions/ADR-010-profile-pack-preset-boundary.md).
15
+
16
+ ## Decision
17
+
18
+ A **preset** owns governance knobs that the user wants to tune as a
19
+ bundle, not individually. Three seed presets ship; users can declare
20
+ their own under `.agent-src.uncompressed/presets/<id>.yml`.
21
+
22
+ | `preset.id` | Stance | Typical user |
23
+ |---|---|---|
24
+ | `fast` | Lowest friction; widest autonomy; loosest cost caps | Solo founder, throw-away prototype, exploration |
25
+ | **`balanced`** *(default)* | Moderate friction; per-task autonomy; sensible cost caps | Day-to-day work; default for any new install |
26
+ | `strict` | Highest friction; ask-by-default; tight cost caps; block-on-risk | Production paths, regulated work, shared trunks |
27
+
28
+ Profile-aware overlay: `developer + strict` ≠ `founder + strict` — the
29
+ profile selects which knob in the preset is read first (e.g. `developer`
30
+ reads `block_on_risk.code_paths`, `founder` reads `block_on_risk.financial_paths`).
31
+
32
+ ## Preset shape
33
+
34
+ ```yaml
35
+ preset:
36
+ id: balanced
37
+ autonomy:
38
+ default: auto # on | off | auto (see autonomous-execution rule)
39
+ trivial_suppress: true
40
+ confidence:
41
+ min_band: medium # low | medium | high — block plan if below
42
+ require_evidence: false
43
+ risk:
44
+ block_on: [security, prod_data]
45
+ ask_on: [bulk_delete, schema_change]
46
+ council:
47
+ auto_consult: false
48
+ cap_per_consult_usd: 0.50
49
+ mcp:
50
+ per_call_max_usd: 0.10
51
+ per_session_max_usd: 2.00
52
+ cost:
53
+ daily_max_usd: 10.00
54
+ weekly_max_usd: 50.00
55
+ monthly_max_usd: 150.00
56
+ enforce: hybrid # see Cost Enforcement section
57
+ notifications:
58
+ threshold_pct: [50, 75, 90, 100]
59
+ ```
60
+
61
+ ## Cost Enforcement
62
+
63
+ *Hybrid model* — recorded as the Phase 1 prerequisite per Council v3
64
+ action #3. Two enforcement surfaces, one decision per call.
65
+
66
+ ### Hard enforcement (preset loader, blocking)
67
+
68
+ The preset loader **refuses to dispatch** any council or MCP call whose
69
+ *estimated* cost exceeds the active preset's per-call ceiling. The
70
+ estimate is read from the model adapter (`council_cli.py estimate` for
71
+ council; the MCP tool manifest for MCP). The block is raised **before**
72
+ the network call. There is no override flag — the user must change the
73
+ preset, override `cost.per_call_max_usd` in `.agent-settings.yml`, or
74
+ pass `--preset=fast` on the CLI.
75
+
76
+ ```
77
+ PRE-CALL CEILING IS HARD.
78
+ NO RUNTIME OVERRIDE. NO "JUST THIS ONCE" FLAG.
79
+ EXCEED → REFUSE → SURFACE THE CEILING + THE OVERRIDE PATH.
80
+ ```
81
+
82
+ Applies to:
83
+
84
+ - AI Council consults (`scripts/council_cli.py run`).
85
+ - MCP tool calls dispatched through the universal dispatcher
86
+ ([`hook-architecture-v1`](hook-architecture-v1.md)).
87
+ - Any future skill that reads `preset.cost.per_call_max_usd`.
88
+
89
+ ### Advisory dashboard (retroactive, non-blocking)
90
+
91
+ `agent-config cost` (Phase 2 item 10) surfaces daily / weekly / monthly
92
+ spend against the active preset's caps. The dashboard **does not**
93
+ block — it warns at the thresholds in `preset.notifications.threshold_pct`
94
+ (default `50 / 75 / 90 / 100`). At 100 %, the dashboard prints a hard
95
+ warning; the next session start re-checks the cap against the running
96
+ total before dispatching the next paid call.
97
+
98
+ The advisory layer's role is **awareness**, not enforcement. Enforcement
99
+ is exclusively the per-call ceiling above; retroactive blocking would
100
+ turn a session unrecoverably hostile mid-task.
101
+
102
+ ### What the loader does **not** do
103
+
104
+ - It does **not** estimate cost for unpaid local model calls
105
+ (`ollama`, local llama.cpp). These bypass both surfaces.
106
+ - It does **not** estimate cost for non-LLM tool calls (file reads,
107
+ shell commands, MCP-static-resource fetches). The per-call ceiling
108
+ targets paid token spend.
109
+ - It does **not** override the Hard Floor in
110
+ [`non-destructive-by-default`](../../.agent-src/rules/non-destructive-by-default.md)
111
+ — a preset cannot lift the universal safety floor.
112
+
113
+ ## Resolution chain
114
+
115
+ Reads happen in this order; last writer wins for any single knob:
116
+
117
+ 1. `pack.preset_id` (if pack active) → set `preset.id`.
118
+ 2. `profile.preset_id` → set `preset.id` (if not already set by pack).
119
+ 3. `preset.<id>.yml` → fill all knobs.
120
+ 4. `.agent-settings.yml` user keys under `preset:` → override per-knob.
121
+ 5. Environment variables (`AGENT_CONFIG_PRESET_COST_DAILY_MAX_USD=…`)
122
+ → override per-knob.
123
+ 6. Runtime CLI flags (`--preset-cost-per-call-max-usd=…`) → override
124
+ per-knob, single session.
125
+
126
+ Per [`ADR-010`](../decisions/ADR-010-profile-pack-preset-boundary.md),
127
+ no other axis may write preset-owned knobs.
128
+
129
+ ## Drift detection
130
+
131
+ `task lint-config-schema` (added in Phase 1) hard-fails when:
132
+
133
+ - A pack YAML or profile YAML names a preset-owned knob.
134
+ - A preset YAML names a knob outside this contract.
135
+ - The three seed presets diverge from the documented stances above.
136
+
137
+ ## Non-goals
138
+
139
+ - This contract does **not** define profiles, packs, or `cost_profile`.
140
+ See the corresponding contracts.
141
+ - It does **not** ship a UI. CLI-first (`agent-config cost`,
142
+ `agent-config preset set <id>`).
143
+ - It does **not** auto-migrate existing installs. Without a preset,
144
+ the loader falls back to current per-knob defaults (`balanced`-equivalent).
@@ -0,0 +1,143 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-12
4
+ ---
5
+
6
+ # Cost governance dashboard
7
+
8
+ > **Status:** beta — first draft 2026-05-16 (Phase 2 Item 10 of
9
+ > `step-15-product-refinement`).
10
+ >
11
+ > **Related:** [`config-presets`](config-presets.md) (caps schema) ·
12
+ > [`cost-profile-defaults`](cost-profile-defaults.md) (default
13
+ > selection) · `scripts/cost/budget.mjs` (existing local-store
14
+ > primitive) · `scripts/cost/track.mjs` (session ingest).
15
+
16
+ The `agent-config cost` subcommand surfaces accumulated spend against
17
+ the active preset's caps. Read-only, CLI-first, no UI. Wraps the
18
+ existing `scripts/cost/*.mjs` primitives behind a single discoverable
19
+ verb so a user can ask "where am I against my budget?" without knowing
20
+ the storage layout.
21
+
22
+ ## Surface
23
+
24
+ ```
25
+ agent-config cost # default: status (this period's spend)
26
+ agent-config cost status [--json] # spend vs caps for daily/weekly/monthly
27
+ agent-config cost ingest # pull latest session.jsonl → local store
28
+ agent-config cost history [--period=today|week|month] [--limit=N]
29
+ agent-config cost reset --confirm # truncate sessions.jsonl + budget.json
30
+ ```
31
+
32
+ All subcommands are **read-only by default**. `ingest` writes only to
33
+ `agents/cost-tracking/sessions.jsonl`. `reset` is destructive and
34
+ gated by `--confirm` (Hard-Floor per
35
+ [`non-destructive-by-default`](../../.agent-src/rules/non-destructive-by-default.md)).
36
+
37
+ ## `cost status` — output contract
38
+
39
+ Human format:
40
+
41
+ ```
42
+ Cost (preset: balanced · profile: developer)
43
+
44
+ Period Spent Cap Remaining % Status
45
+ today $2.43 $10.00 $7.57 24% ✅
46
+ week $14.20 $40.00 $25.80 36% ✅
47
+ month $52.10 $150.00 $97.90 35% ✅
48
+
49
+ MCP calls: 12 today · 47 this week · 188 this month
50
+ Council calls: 1 today · 3 this week · 11 this month
51
+
52
+ Next threshold notification at 75% (week: $30.00).
53
+ ```
54
+
55
+ `--json` output schema:
56
+
57
+ ```json
58
+ {
59
+ "preset": "balanced",
60
+ "profile": "developer",
61
+ "periods": {
62
+ "today": {"spent_usd": 2.43, "cap_usd": 10.00, "remaining_usd": 7.57, "pct": 0.243, "status": "ok"},
63
+ "week": {"spent_usd": 14.20, "cap_usd": 40.00, "remaining_usd": 25.80, "pct": 0.355, "status": "ok"},
64
+ "month": {"spent_usd": 52.10, "cap_usd": 150.00, "remaining_usd": 97.90, "pct": 0.347, "status": "ok"}
65
+ },
66
+ "calls": {
67
+ "mcp": {"today": 12, "week": 47, "month": 188},
68
+ "council": {"today": 1, "week": 3, "month": 11}
69
+ },
70
+ "next_threshold": {"period": "week", "pct": 0.75, "trigger_usd": 30.00}
71
+ }
72
+ ```
73
+
74
+ ### Status field
75
+
76
+ | Value | Trigger | Exit code |
77
+ |---|---|---|
78
+ | `ok` | `pct < 0.75` | 0 |
79
+ | `warn` | `0.75 ≤ pct < 1.0` | 0 |
80
+ | `over` | `pct ≥ 1.0` | 1 |
81
+
82
+ Overall exit = worst-of across the three periods. `--json` always
83
+ emits the full object regardless of exit.
84
+
85
+ ## Data sources
86
+
87
+ | Field | Source |
88
+ |---|---|
89
+ | `preset` | Active preset id from [`config-presets`](config-presets.md) resolution chain. |
90
+ | `cap_usd` | `preset.cost.{daily,weekly,monthly}_max_usd`. |
91
+ | `spent_usd` | Sum of `cost_usd` field over `agents/cost-tracking/sessions.jsonl` records inside the period window. |
92
+ | `calls.mcp.*` | Sum of `mcp_calls` field in the same records. |
93
+ | `calls.council.*` | Count of records whose `kind` is `council`. |
94
+ | `next_threshold` | Smallest `(period, pct ∈ preset.notifications.threshold_pct)` tuple where `spent_usd < pct × cap_usd`. |
95
+
96
+ When the active preset declares no `cost.*` cap (legacy installs),
97
+ `cap_usd` is reported as `null` and `status` is `ok`. The tool does
98
+ **not** invent a default cap.
99
+
100
+ ## Enforcement vs surfacing
101
+
102
+ `agent-config cost` is **read-only**. Enforcement (refuse a council
103
+ or MCP call that would push spend over a cap) lives at the call site
104
+ per the active preset's `cost.enforce` setting (`off`, `advisory`,
105
+ `hybrid`, `hard`). This contract does not change enforcement; it only
106
+ makes the existing local-store data discoverable.
107
+
108
+ ## Refresh model
109
+
110
+ `sessions.jsonl` is appended to by the Claude Code session hooks
111
+ (see `scripts/cost/track.mjs`). `cost status` reads what's there;
112
+ `cost ingest` triggers a one-shot pull from `~/.claude/projects/`.
113
+ Users running a non-Claude-Code agent surface call `cost ingest`
114
+ manually after a session; users on Claude Code with hooks installed
115
+ never need to.
116
+
117
+ ## Validation
118
+
119
+ `scripts/lint_cost_dashboard.py` (Phase 2 deliverable — not yet
120
+ shipped) fails CI on:
121
+
122
+ - Schema drift in `sessions.jsonl` (missing required fields).
123
+ - Preset declaring `cost.*` caps that disagree with this contract's
124
+ expected period grid.
125
+ - `cost status --json` output diverging from the schema above.
126
+
127
+ ## What this contract does **not** do
128
+
129
+ - **Does not** ship a UI. CLI-first, by design.
130
+ - **Does not** introduce per-skill or per-command cost attribution
131
+ beyond `kind` (`council` vs other). Per-skill attribution is a
132
+ Phase 3 candidate.
133
+ - **Does not** override per-call hard caps from the preset.
134
+ - **Does not** roll up across multiple projects. Each project's
135
+ `agents/cost-tracking/` is its own scope.
136
+
137
+ ## See also
138
+
139
+ - [`config-presets`](config-presets.md) — preset caps + `enforce` semantics
140
+ - [`cost-profile-defaults`](cost-profile-defaults.md) — default preset selection
141
+ - [`safety-model`](safety-model.md) — `mcp_call_costly` domain
142
+ - `scripts/cost/budget.mjs`, `scripts/cost/track.mjs` — wrapped primitives
143
+ - `step-15-product-refinement` § Phase 2 Item 10
@@ -0,0 +1,134 @@
1
+ ---
2
+ stability: stable
3
+ ---
4
+
5
+ # Cost Enforcement Contract
6
+
7
+ > Status: stable · Owner: `step-11-measurement-governance-parity` · Last reviewed: 2026-05-16
8
+
9
+ How USD budgets read from `.agent-settings.yml` interact with the
10
+ session-cost ledger (`agents/cost-tracking/sessions.jsonl`) and the
11
+ budget evaluator (`scripts/cost/budget.mjs`).
12
+
13
+ ## Surface
14
+
15
+ Two files. Settings file declares the budget; ledger file accumulates
16
+ spend. The evaluator joins them and emits a tier.
17
+
18
+ | File | Role |
19
+ |---|---|
20
+ | `.agent-settings.yml § cost` | Declarative: budgets per period + enforcement mode. |
21
+ | `agents/cost-tracking/sessions.jsonl` | Append-only: per-session cost records (model, tokens, USD). |
22
+ | `scripts/cost/budget.mjs` | Evaluator: joins both, emits `{ level, utilization_pct, enforcement, source }`. |
23
+ | `scripts/cost/preflight.mjs` | Hard-stop hook: wraps `budget.mjs check` and exits non-zero at HARD_STOP when `enforcement: hard-stop`. |
24
+
25
+ ## Settings schema
26
+
27
+ ```yaml
28
+ cost:
29
+ budgets:
30
+ daily: 0 # USD ceiling for rolling 24h. 0 = unbudgeted.
31
+ weekly: 0 # USD ceiling for rolling 7d. 0 = unbudgeted.
32
+ monthly: 0 # USD ceiling for rolling 30d. 0 = unbudgeted.
33
+ enforcement: advisory # advisory | hard-stop
34
+ ```
35
+
36
+ - `0` (or absent) on any period = that period is not enforced. The
37
+ evaluator falls back to a longer-period budget when checking shorter
38
+ periods, never the other way around.
39
+ - `enforcement: advisory` is the default. Dashboards surface the
40
+ breach; the agent keeps working.
41
+ - `enforcement: hard-stop` is opt-in. `scripts/cost/preflight.mjs`
42
+ exits non-zero at the HARD_STOP tier; wrapping shells / CI / `task`
43
+ bindings must check this before composing a turn.
44
+
45
+ ## Tier ladder (5-stage)
46
+
47
+ | Utilization | Level | Emoji | Threshold-pct |
48
+ |---:|---|:---:|---:|
49
+ | `< 50 %` | `OK` | 🟢 | 0 |
50
+ | `50–74 %` | `INFO` | 🟡 | 50 |
51
+ | `75–89 %` | `WARNING` | 🟠 | 75 |
52
+ | `90–99 %` | `CRITICAL` | 🔴 | 90 |
53
+ | `≥ 100 %` | `HARD_STOP` | 🛑 | 100 |
54
+
55
+ The legacy 4-stage draft (`under / 50 / 75 / 90 / 100`) folded `OK`
56
+ into `under`. Parity-doc Phase 6 maps both forms verbatim.
57
+
58
+ ## Hook surface
59
+
60
+ `scripts/cost/preflight.mjs` is the **single** turn-start surface.
61
+ It wraps `budget.mjs check` and:
62
+
63
+ 1. Reads `cost.enforcement` from `.agent-settings.yml`.
64
+ 2. If `advisory` → always exits `0`, prints the tier as advisory text.
65
+ 3. If `hard-stop` and level is `HARD_STOP` → prints a refusal block
66
+ citing this contract and exits `1`.
67
+ 4. If no budget is configured at all → exits `0` (fail-open). Never
68
+ blocks unbudgeted work.
69
+
70
+ The hook does **not** rewrite or block individual tool calls. It is a
71
+ process-entry gate, intended to be invoked by:
72
+
73
+ - `task ci`, `task work:*`, `task roadmap:*` wrappers.
74
+ - The `/onboard` boot path (`scripts/install.py`-side guidance only).
75
+ - Manual `node scripts/cost/preflight.mjs` for shell wrappers.
76
+
77
+ ## Bypass
78
+
79
+ User-facing bypass mechanism (documented for the refusal block):
80
+
81
+ - Raise the budget: edit `.agent-settings.yml § cost.budgets.<period>`.
82
+ - Reset the ledger (drops historical spend from the calculation):
83
+ `node scripts/cost/track.mjs reset --confirm`.
84
+ - Disable enforcement: set `cost.enforcement: advisory`.
85
+
86
+ No environment-variable override. Bypass must be an explicit edit so
87
+ the change is durable and auditable.
88
+
89
+ ## Default behaviour without a budget
90
+
91
+ When `cost.budgets.{daily,weekly,monthly}` are all `0`:
92
+
93
+ - `budget.mjs check` reports cumulative spend, no tier (returns the
94
+ no-budget JSON shape).
95
+ - `preflight.mjs` exits `0`. Never blocks.
96
+ - `agent-status` panel shows **only** the measured-spend USD figure;
97
+ the tier table is suppressed.
98
+
99
+ ## Source precedence
100
+
101
+ `budget.mjs` reads budget config in this order:
102
+
103
+ 1. `.agent-settings.yml § cost` (when any value > 0).
104
+ 2. `agents/cost-tracking/budget.json` (legacy single-period JSON).
105
+ 3. None → no-budget output shape.
106
+
107
+ The evaluator output carries `source: 'agent-settings.yml' | 'budget.json'`
108
+ so dashboards can show where the figure came from.
109
+
110
+ ## Period mapping
111
+
112
+ `BUDGET_PERIOD={today|week|month|all}` selects which budget value
113
+ applies:
114
+
115
+ | `BUDGET_PERIOD` | Settings key |
116
+ |---|---|
117
+ | `today` | `cost.budgets.daily` |
118
+ | `week` | `cost.budgets.weekly` |
119
+ | `month` | `cost.budgets.monthly` |
120
+ | `all` (default) | First non-zero of `monthly → weekly → daily`. |
121
+
122
+ ## Acceptance fixtures
123
+
124
+ `tests/fixtures/cost/budget/` carries five reference fixtures:
125
+ `under-50`, `mid-75`, `high-90`, `at-100`, `over-100`. Each fixture
126
+ ships a `sessions.jsonl` slice + an expected JSON output. The fixture
127
+ suite is wired to `task test-cost-budget` per `step-11` Phase 2 Step 5.
128
+
129
+ ## See also
130
+
131
+ - `step-11-ruflo-parity` — Measurement & Governance Parity roadmap.
132
+ - `docs/contracts/cost-dashboard.md` — companion dashboard contract.
133
+ - `scripts/cost/budget.mjs` — evaluator implementation.
134
+ - `bench/pricing.yaml` — per-model USD pricing table.
@@ -6928,13 +6928,6 @@
6928
6928
  "via": "body_link",
6929
6929
  "depth": 1
6930
6930
  },
6931
- {
6932
- "source": ".agent-src.uncompressed/rules/no-cheap-questions.md",
6933
- "target": ".agent-src.uncompressed/contexts/contracts/frugality-charter.md",
6934
- "type": "READ_ONLY",
6935
- "via": "body_link",
6936
- "depth": 1
6937
- },
6938
6931
  {
6939
6932
  "source": ".agent-src.uncompressed/rules/no-cheap-questions.md",
6940
6933
  "target": ".agent-src.uncompressed/rules/ask-when-uncertain.md",
@@ -0,0 +1,53 @@
1
+ ---
2
+ stability: beta
3
+ keep-beta-until: 2026-08-14
4
+ ---
5
+
6
+ # MCP tool inventory
7
+
8
+ > Generated by [`scripts/audit_mcp_tools.py`](../../scripts/audit_mcp_tools.py)
9
+ > from the source-of-truth catalog
10
+ > [`scripts/mcp_server/consumer_tool_catalog.json`](../../scripts/mcp_server/consumer_tool_catalog.json).
11
+ > Do **not** hand-edit; rerun `python3 scripts/audit_mcp_tools.py --write`.
12
+ >
13
+ > Step-11 Phase 5 Step 3 (`step-11-ruflo-parity.md`).
14
+
15
+ ## Summary
16
+
17
+ - **Total tools:** 20
18
+ - **By transport:** stdio=9
19
+ - **By side-effect:** fs-write=5, ro=12, shell=3
20
+ - **Discovery-only stubs (no implementation):** 11
21
+
22
+ ## Tools
23
+
24
+ | Tool | Side-effect | Transports | Catalog | Handler |
25
+ |---|---|---|---|---|
26
+ | `lint_skills` | `ro` | stdio | [`consumer_tool_catalog.json:7`](../../scripts/mcp_server/consumer_tool_catalog.json#L7) | [`tools.py:510`](../../scripts/mcp_server/tools.py#L510) |
27
+ | `chat_history_append` | `fs-write` | stdio | [`consumer_tool_catalog.json:24`](../../scripts/mcp_server/consumer_tool_catalog.json#L24) | [`tools.py:535`](../../scripts/mcp_server/tools.py#L535) |
28
+ | `chat_history_read` | `ro` | stdio | [`consumer_tool_catalog.json:43`](../../scripts/mcp_server/consumer_tool_catalog.json#L43) | [`tools.py:571`](../../scripts/mcp_server/tools.py#L571) |
29
+ | `memory_lookup` | `ro` | stdio | [`consumer_tool_catalog.json:59`](../../scripts/mcp_server/consumer_tool_catalog.json#L59) | [`tools.py:590`](../../scripts/mcp_server/tools.py#L590) |
30
+ | `memory_signal` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:75`](../../scripts/mcp_server/consumer_tool_catalog.json#L75) | _stub-only_ |
31
+ | `memory_status` | `ro` | stdio | [`consumer_tool_catalog.json:91`](../../scripts/mcp_server/consumer_tool_catalog.json#L91) | [`tools.py:617`](../../scripts/mcp_server/tools.py#L617) |
32
+ | `skill_trigger_eval` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:98`](../../scripts/mcp_server/consumer_tool_catalog.json#L98) | _stub-only_ |
33
+ | `suggest_command` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:114`](../../scripts/mcp_server/consumer_tool_catalog.json#L114) | _stub-only_ |
34
+ | `suggest_skill_for_task` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:129`](../../scripts/mcp_server/consumer_tool_catalog.json#L129) | _stub-only_ |
35
+ | `mine_session` | `ro` | _(stub)_ | [`consumer_tool_catalog.json:144`](../../scripts/mcp_server/consumer_tool_catalog.json#L144) | _stub-only_ |
36
+ | `update_form_request_messages` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:158`](../../scripts/mcp_server/consumer_tool_catalog.json#L158) | _stub-only_ |
37
+ | `sync_gitignore` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:173`](../../scripts/mcp_server/consumer_tool_catalog.json#L173) | _stub-only_ |
38
+ | `sync_agent_settings` | `fs-write` | _(stub)_ | [`consumer_tool_catalog.json:186`](../../scripts/mcp_server/consumer_tool_catalog.json#L186) | _stub-only_ |
39
+ | `run_tests` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:200`](../../scripts/mcp_server/consumer_tool_catalog.json#L200) | _stub-only_ |
40
+ | `run_quality_checks` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:214`](../../scripts/mcp_server/consumer_tool_catalog.json#L214) | _stub-only_ |
41
+ | `list_skills` | `ro` | stdio | [`consumer_tool_catalog.json:227`](../../scripts/mcp_server/consumer_tool_catalog.json#L227) | [`tools.py:631`](../../scripts/mcp_server/tools.py#L631) |
42
+ | `list_commands` | `ro` | stdio | [`consumer_tool_catalog.json:234`](../../scripts/mcp_server/consumer_tool_catalog.json#L234) | [`tools.py:644`](../../scripts/mcp_server/tools.py#L644) |
43
+ | `list_rules` | `ro` | stdio | [`consumer_tool_catalog.json:241`](../../scripts/mcp_server/consumer_tool_catalog.json#L241) | [`tools.py:657`](../../scripts/mcp_server/tools.py#L657) |
44
+ | `compile_router` | `shell` | _(stub)_ | [`consumer_tool_catalog.json:248`](../../scripts/mcp_server/consumer_tool_catalog.json#L248) | _stub-only_ |
45
+ | `read_resource_body` | `ro` | stdio | [`consumer_tool_catalog.json:261`](../../scripts/mcp_server/consumer_tool_catalog.json#L261) | [`tools.py:670`](../../scripts/mcp_server/tools.py#L670) |
46
+
47
+ ## Glossary
48
+
49
+ - **Side-effect** — `ro` (read-only) · `fs-write` (filesystem write) · `shell` (spawns processes).
50
+ - **Transports** — `stdio` (`scripts/mcp_server/`) · `worker` (`workers/mcp/`). A tool may live on both.
51
+ - **Stub** — catalog-listed for discovery; returns the `not_implemented` envelope from
52
+ [`mcp-tool-stub-envelope.md`](mcp-tool-stub-envelope.md) until promoted.
53
+
@@ -0,0 +1,102 @@
1
+ ---
2
+ stability: stable
3
+ ---
4
+
5
+ # Measurement baseline — contract
6
+
7
+ > **Status:** locked 2026-05-16 · **Owner:** `step-4-measurement-and-benchmark.md`
8
+ > · **Cited by:** every P2 enforcement roadmap (skill rationalization G0, north-star G1, compression default decision).
9
+
10
+ Single source of truth for what `task bench` measures, what counts as
11
+ drift, and what unblocks enforcement. Read this before pinning a number
12
+ to a roadmap or PR description.
13
+
14
+ ## What `task bench` measures
15
+
16
+ Four axes, all numeric, all reproducible from the same input:
17
+
18
+ | Axis | Source | Definition | Units |
19
+ |---|---|---|---:|
20
+ | **selection accuracy** | [`scripts/bench_runner.py`](../../scripts/bench_runner.py) | Keyword-overlap ranker hits the expected skill in top-K | % |
21
+ | **cost** | [`scripts/cost/track.mjs`](../../scripts/cost/track.mjs) session jsonl | Token+USD per model, captured live | USD |
22
+ | **quality** | regex / rubric assertions per prompt | `quality_assertion` matches in agent output | % |
23
+ | **projection fidelity** | [`scripts/bench_per_tool.py`](../../scripts/bench_per_tool.py) | `accuracy(tool) / accuracy(augment)` for skill-projecting tools | ratio |
24
+
25
+ Schemas: [`benchmark-report-schema.md`](benchmark-report-schema.md) ·
26
+ [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md). Reports land at
27
+ `bench/reports/<utc-stamp>-<corpus>[-projection].{json,md}` —
28
+ timestamped, never overwritten, content-addressed by run.
29
+
30
+ ## Corpora — frozen for the soak window
31
+
32
+ | Corpus | Path | Prompts | Purpose |
33
+ |---|---|---:|---|
34
+ | `dev` | [`tests/eval/corpus-dev.yaml`](../../tests/eval/corpus-dev.yaml) | 10 | Developer task surface (Laravel/Symfony/React/CI/PR) |
35
+ | `non-dev` | [`tests/eval/corpus-non-dev.yaml`](../../tests/eval/corpus-non-dev.yaml) | 16 | Founder / agency / content creator surface (Wing-4) |
36
+
37
+ Total 26 prompts ≥ Acceptance Criteria floor of 25. Mid-window edits
38
+ to either YAML restart the 60-day clock per
39
+ [`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) § 2.
40
+
41
+ ## What counts as drift
42
+
43
+ [`scripts/bench_drift_check.py`](../../scripts/bench_drift_check.py)
44
+ compares the latest report against a sliding window of the prior N runs
45
+ (default 5) for the same corpus.
46
+
47
+ | Axis | Threshold | Note |
48
+ |---|---|---|
49
+ | selection accuracy | latest − baseline_mean ≤ −5 pp | always evaluated |
50
+ | cost | latest / baseline_mean ≥ +20 % | only when both sides have `source: captured` |
51
+ | quality | latest − baseline_mean ≤ −10 pp | skipped when latest is `not_collected` |
52
+ | projection fidelity | tool fidelity < 0.85 | exit 1 from `task bench:projection` |
53
+
54
+ Drift exits with code 2 from `task bench:drift`. **CI posture during
55
+ soak:** all bench-drift steps `continue-on-error: true` and post a
56
+ sticky PR comment — informational only, not a merge gate. Flip to
57
+ required check happens via a separate PR once
58
+ `task bench:baseline-ready` returns 0 (see below).
59
+
60
+ ## What unblocks enforcement (the G1 gate)
61
+
62
+ ```
63
+ TASK bench:baseline-ready EXIT 0 IS THE ONLY AUTHORITY.
64
+ NO ANECDOTE, NO INDIVIDUAL REPORT, NO ROADMAP-SIDE OVERRIDE.
65
+ ```
66
+
67
+ [`scripts/bench_baseline_ready.py`](../../scripts/bench_baseline_ready.py)
68
+ returns exit 0 iff both:
69
+
70
+ 1. **Wall-clock soak:** `today − bench/baseline-start.txt ≥ --min-days` (default 60)
71
+ 2. **Report density:** `bench/reports/*-<corpus>.json` count ≥ `--min-reports` (default 30)
72
+
73
+ Soak start anchored at [`bench/baseline-start.txt`](../../bench/baseline-start.txt)
74
+ = **2026-05-16**. Earliest possible flip: **2026-07-15**, contingent
75
+ on the 30-report floor.
76
+
77
+ Downstream consumers:
78
+
79
+ - ``step-99-north-star-restructure.md` § Acceptance G1` — reads this exit code.
80
+ - [`compression-default-kill-criterion.md` § 3](compression-default-kill-criterion.md) — reads the decision table after baseline closes.
81
+ - ``step-2-skill-inventory-rationalization.md` § G0` — usage-data soak floor.
82
+
83
+ ## What the closeout writes
84
+
85
+ On baseline closure, the step-4 closeout writes the numeric verdict to
86
+ [`docs/parity/bench.json`](../parity/bench.json) — frozen snapshot with
87
+ the 30+ reports averaged, drift verdict, and the compression-default
88
+ decision per the kill-criterion table. That file is the artefact every
89
+ P2 roadmap reads — not the live `bench/reports/` directory.
90
+
91
+ ## Carve-outs
92
+
93
+ - **Pricing freshness:** [`bench/pricing.yaml`](../../bench/pricing.yaml) rows must carry `sourced_on: YYYY-MM-DD`. Stale prices = stale numbers = no trust (ruflo "measured-vs-claimed" pattern).
94
+ - **Subjective grading excluded:** quality scoring is mechanical via `quality_assertion`. No vibes.
95
+ - **Cursor / Cline / Windsurf:** rules-only surfaces, no SKILL.md projection. `bench:projection` reports them as `not_applicable` — the gap is acknowledged, not silently dropped.
96
+
97
+ ## Cross-references
98
+
99
+ - [`benchmark-report-schema.md`](benchmark-report-schema.md) · per-report JSON schema
100
+ - [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md) · corpus YAML schema
101
+ - [`compression-default-kill-criterion.md`](compression-default-kill-criterion.md) · decision table read by step-4 closeout
102
+ - `step-4-measurement-and-benchmark.md` · the owning roadmap