@event4u/agent-config 2.18.0 → 2.20.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agent-src/commands/agent-status.md +29 -0
- package/.agent-src/commands/onboard.md +221 -81
- package/.agent-src/commands/refine-ticket.md +3 -0
- package/.agent-src/packs/README.md +49 -0
- package/.agent-src/packs/agency-delivery.yml +63 -0
- package/.agent-src/packs/content-engine.yml +53 -0
- package/.agent-src/packs/founder-mvp.yml +51 -0
- package/.agent-src/personas/README.md +8 -0
- package/.agent-src/presets/README.md +26 -0
- package/.agent-src/presets/balanced.yml +34 -0
- package/.agent-src/presets/fast.yml +31 -0
- package/.agent-src/presets/strict.yml +38 -0
- package/.agent-src/profiles/README.md +29 -0
- package/.agent-src/profiles/agency.yml +27 -0
- package/.agent-src/profiles/content_creator.yml +25 -0
- package/.agent-src/profiles/developer.yml +26 -0
- package/.agent-src/profiles/finance.yml +24 -0
- package/.agent-src/profiles/founder.yml +25 -0
- package/.agent-src/profiles/ops.yml +25 -0
- package/.agent-src/rules/no-cheap-questions.md +25 -17
- package/.agent-src/skills/adr-create/SKILL.md +78 -68
- package/.agent-src/skills/refine-ticket/SKILL.md +3 -0
- package/.agent-src/skills/subagent-orchestration/SKILL.md +33 -0
- package/.agent-src/templates/agents/agent-project-settings.example.yml +1 -1
- package/.agent-src/templates/skill-archive-note.md +101 -0
- package/.agent-src/user-types/README.md +124 -0
- package/.agent-src/user-types/_template/user-type.md +95 -0
- package/.agent-src/user-types/galabau-field-crew.md +100 -0
- package/.agent-src/user-types/metalworking-shop.md +105 -0
- package/.agent-src/user-types/truck-driver.md +113 -0
- package/.claude-plugin/marketplace.json +1 -1
- package/CHANGELOG.md +91 -30
- package/README.md +68 -72
- package/config/agent-settings.template.yml +22 -0
- package/docs/adrs/caveman/0001-default-off-until-bench.md +93 -0
- package/docs/adrs/caveman/README.md +9 -0
- package/docs/adrs/cost/0001-hard-stop-hook.md +114 -0
- package/docs/adrs/cost/README.md +9 -0
- package/docs/adrs/memory/0001-consumer-side-snapshot.md +111 -0
- package/docs/adrs/memory/README.md +9 -0
- package/docs/adrs/router/0001-three-tier-routing.md +119 -0
- package/docs/adrs/router/README.md +9 -0
- package/docs/adrs/schema/0001-json-schema-frontmatter.md +102 -0
- package/docs/adrs/schema/README.md +9 -0
- package/docs/adrs/smoke/0001-per-tier-smoke-scripts.md +99 -0
- package/docs/adrs/smoke/README.md +9 -0
- package/docs/architecture/current-onboard-baseline.md +126 -0
- package/docs/architecture/current-safety-behavior.md +137 -0
- package/docs/archive/CHANGELOG-pre-2.16.0.md +48 -0
- package/docs/contracts/adr-layout.md +108 -0
- package/docs/contracts/adr-mcp-runtime.md +128 -0
- package/docs/contracts/adr-user-types-axis.md +127 -0
- package/docs/contracts/benchmark-corpus-spec.md +97 -0
- package/docs/contracts/benchmark-report-schema.md +111 -0
- package/docs/contracts/command-clusters.md +1 -0
- package/docs/contracts/command-taxonomy.md +137 -0
- package/docs/contracts/compression-default-kill-criterion.md +69 -0
- package/docs/contracts/config-presets.md +144 -0
- package/docs/contracts/cost-dashboard.md +143 -0
- package/docs/contracts/cost-enforcement.md +134 -0
- package/docs/contracts/file-ownership-matrix.json +0 -7
- package/docs/contracts/mcp-tool-inventory.md +53 -0
- package/docs/contracts/measurement-baseline.md +102 -0
- package/docs/contracts/namespace.md +125 -0
- package/docs/contracts/profile-system.md +142 -0
- package/docs/contracts/safety-model.md +129 -0
- package/docs/contracts/smoke-contracts.md +144 -0
- package/docs/contracts/user-type-schema.md +146 -0
- package/docs/contracts/workflow-packs.md +121 -0
- package/docs/decisions/ADR-010-profile-pack-preset-boundary.md +132 -0
- package/docs/decisions/INDEX.md +1 -0
- package/docs/featured-commands.md +27 -0
- package/docs/parity/bench-ruflo.json +58 -0
- package/docs/parity/bench.json +41 -0
- package/docs/parity/ruflo.md +46 -0
- package/docs/profiles.md +91 -0
- package/docs/recruits/_template.md +81 -0
- package/package.json +1 -1
- package/scripts/_cli/cmd_explain.py +250 -0
- package/scripts/_lib/bench_cost.py +138 -0
- package/scripts/_lib/bench_quality.py +118 -0
- package/scripts/_lib/bench_report.py +150 -0
- package/scripts/agent-config +13 -0
- package/scripts/audit_adr_coverage.py +175 -0
- package/scripts/audit_mcp_tools.py +146 -0
- package/scripts/bench_baseline_ready.py +108 -0
- package/scripts/bench_drift_check.py +151 -0
- package/scripts/bench_per_tool.py +216 -0
- package/scripts/bench_run.py +155 -0
- package/scripts/compress.py +48 -2
- package/scripts/config/__init__.py +9 -0
- package/scripts/config/presets.py +206 -0
- package/scripts/config/profiles.py +173 -0
- package/scripts/cost/budget.mjs +73 -12
- package/scripts/cost/preflight.mjs +89 -0
- package/scripts/lint_archived_skills.py +143 -0
- package/scripts/lint_bench_corpus.py +161 -0
- package/scripts/lint_namespace.py +135 -0
- package/scripts/schemas/user-type.schema.json +35 -0
- package/scripts/skill_linter.py +139 -4
- package/scripts/skill_overlap.py +204 -0
- package/scripts/skill_tools/audit_user_type_coverage.py +148 -0
- package/scripts/skill_usage_collect.py +191 -0
- package/scripts/skill_usage_report.py +162 -0
- package/scripts/smoke/kernel.sh +101 -0
- package/scripts/smoke/router.sh +129 -0
- package/scripts/smoke/schema.sh +71 -0
- package/scripts/smoke/skills.sh +101 -0
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# ADR — Runtime user-types axis (review lens, parallel to personas)
|
|
7
|
+
|
|
8
|
+
> **Status:** Decided · 2026-05-15
|
|
9
|
+
> **Source:** user-authored brief (no council session — direct user spec)
|
|
10
|
+
> **Sibling axis (distinct layer):** [`adr-install-user-type-axis`](adr-install-user-type-axis.md) — install-time `personal.user_type` filter; same vocabulary, different layer
|
|
11
|
+
|
|
12
|
+
## Context
|
|
13
|
+
|
|
14
|
+
The persona axis (`personas/`) was overloaded with two semantics:
|
|
15
|
+
|
|
16
|
+
1. **Methodology lenses** — `qa`, `senior-engineer`, `critical-challenger`,
|
|
17
|
+
`developer`, `product-owner`. These voices answer: *how* we review.
|
|
18
|
+
2. **End-user simulations** — proposals like `galabau-field-crew`,
|
|
19
|
+
`truck-driver`, `metalworking-shop`. These voices answer: *who*
|
|
20
|
+
experiences the software.
|
|
21
|
+
|
|
22
|
+
Mixing the two collapses the taxonomy. A `qa` reviewer applies QA
|
|
23
|
+
methodology regardless of which end-user the software serves. A
|
|
24
|
+
`galabau-field-crew` is not a review methodology — it is the end-user
|
|
25
|
+
viewpoint a methodology reviewer should adopt while reviewing.
|
|
26
|
+
|
|
27
|
+
The composition the system needs is orthogonal:
|
|
28
|
+
|
|
29
|
+
```
|
|
30
|
+
/refine-ticket --personas=qa --user-type=truck-driver PROJ-123
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
QA methodology applied through a truck-driver end-user lens. Two
|
|
34
|
+
axes, one orthogonal product.
|
|
35
|
+
|
|
36
|
+
## Decision
|
|
37
|
+
|
|
38
|
+
Split into a parallel axis. Add `.agent-src.uncompressed/user-types/`
|
|
39
|
+
as a first-class directory mirroring the persona pipeline:
|
|
40
|
+
|
|
41
|
+
- Source dir: `.agent-src.uncompressed/user-types/`
|
|
42
|
+
- Schema doc: [`user-type-schema`](user-type-schema.md) — 7-section spine, ≤ 120 lines
|
|
43
|
+
- JSON schema: [`scripts/schemas/user-type.schema.json`](../../scripts/schemas/user-type.schema.json)
|
|
44
|
+
- Linter: `scripts/skill_linter.py § lint_usertype`
|
|
45
|
+
- CLI surface: `/refine-ticket --user-type=<id>` (single id in v1)
|
|
46
|
+
- Composition: `--user-type=` and `--personas=` compose orthogonally
|
|
47
|
+
|
|
48
|
+
Persona surface is **byte-identical** after this work. No persona
|
|
49
|
+
moves, no schema change to `persona-schema.md` or `persona.schema.json`,
|
|
50
|
+
no behaviour change to `--personas=`. The three seed user-types
|
|
51
|
+
(`galabau-field-crew`, `metalworking-shop`, `truck-driver`) are born
|
|
52
|
+
as user-types — existing personas stay as personas.
|
|
53
|
+
|
|
54
|
+
## Consequences
|
|
55
|
+
|
|
56
|
+
**Additive surface:**
|
|
57
|
+
|
|
58
|
+
- One new CLI flag (`--user-type=`)
|
|
59
|
+
- One new schema doc + JSON schema
|
|
60
|
+
- One new linter hook (`lint_usertype` + classifier branch)
|
|
61
|
+
- One new directory (`user-types/`) projected the same way `personas/`
|
|
62
|
+
is
|
|
63
|
+
- Three seed files at merge time; consumer projects add their own
|
|
64
|
+
domain-specific user-types under `.agent-src/user-types/`
|
|
65
|
+
|
|
66
|
+
**Locked v1 boundaries:**
|
|
67
|
+
|
|
68
|
+
- CLI-only. Skills do NOT declare a default `user-types:` frontmatter
|
|
69
|
+
key in v1. Migration path to v2: if usage patterns show > 3 skills
|
|
70
|
+
citing the same user-type default, add the key and the audit script
|
|
71
|
+
to mirror `recommended_for_user_types` discipline (smaller surface
|
|
72
|
+
now, additive later).
|
|
73
|
+
- Single user-type per invocation (`--user-type=<id>`, not a list).
|
|
74
|
+
Multi-user-type composition deferred to v2 with a one-line note —
|
|
75
|
+
it requires synthesis logic that does not exist yet and would block
|
|
76
|
+
v1 on a non-load-bearing nice-to-have.
|
|
77
|
+
- **Review lens only.** User-types never provide trade execution
|
|
78
|
+
instructions. Guardrails encoded in every file's `Anti-Patterns`
|
|
79
|
+
section per [`user-type-schema § 5`](user-type-schema.md#-5--guardrails-encoded-in-every-anti-patterns-block).
|
|
80
|
+
- **Anti-Generic Quality Bar.** Every user-type encodes ≥ 5 concrete,
|
|
81
|
+
domain-specific review points and ≥ 3 Unique Questions no other
|
|
82
|
+
persona asks verbatim. Generic prose is rejected at lint or review
|
|
83
|
+
time.
|
|
84
|
+
|
|
85
|
+
## Alternatives considered
|
|
86
|
+
|
|
87
|
+
**Alt-1 — Extend persona schema with a `subtype: end-user` discriminator.**
|
|
88
|
+
Rejected. Same physical file, two semantics, two enforcement paths
|
|
89
|
+
inside one linter hook. Scales worse: every persona-consuming surface
|
|
90
|
+
(`--personas=`, `lint_persona`, `audit_persona_coverage.py`) would
|
|
91
|
+
need a branch on `subtype` to know whether the artefact is a
|
|
92
|
+
methodology lens or an end-user lens. The clean axis split is a
|
|
93
|
+
single fork-point at the classifier; the subtype fork-point recurs at
|
|
94
|
+
every consumption site.
|
|
95
|
+
|
|
96
|
+
**Alt-2 — Reuse the existing `user-types/` (install-time) directory
|
|
97
|
+
for runtime lenses.** Rejected. The install-time axis stores YAML
|
|
98
|
+
configs filtering *which skills load*; the runtime axis stores
|
|
99
|
+
Markdown lenses filtering *whose viewpoint a review adopts*. Same
|
|
100
|
+
vocabulary, completely different content shape (YAML key-value vs.
|
|
101
|
+
Markdown prose + frontmatter), completely different consumer
|
|
102
|
+
(`scripts/install.sh` vs. `refine-ticket`). Co-locating them would
|
|
103
|
+
force a single `kind:` discriminator on a directory whose two halves
|
|
104
|
+
do not share a schema. The separation is in different physical paths
|
|
105
|
+
(`user-types/` root vs. `.agent-src.uncompressed/user-types/`) and
|
|
106
|
+
the vocabulary overlap is deliberate per [`adr-install-user-type-axis`](adr-install-user-type-axis.md).
|
|
107
|
+
|
|
108
|
+
**Alt-3 — Defer the axis until end-user lenses prove themselves in
|
|
109
|
+
the field.** Rejected. The methodology / end-user overload is already
|
|
110
|
+
producing taxonomy drift in the persona README (review-lens vs
|
|
111
|
+
end-user examples mixing). Splitting now is cheaper than splitting
|
|
112
|
+
after three more end-user "personas" land.
|
|
113
|
+
|
|
114
|
+
## Migration
|
|
115
|
+
|
|
116
|
+
No migration. v1 ships with three seed user-types born under the new
|
|
117
|
+
axis. Existing personas (`qa`, `developer`, `senior-engineer`,
|
|
118
|
+
`product-owner`, `stakeholder`, `critical-challenger`, `ai-agent`,
|
|
119
|
+
plus specialists) stay put. The `personas/README.md` gains one
|
|
120
|
+
cross-link sentence pointing readers at the parallel axis when their
|
|
121
|
+
intent is end-user simulation rather than methodology review.
|
|
122
|
+
|
|
123
|
+
## See also
|
|
124
|
+
|
|
125
|
+
- [`user-type-schema`](user-type-schema.md) — locked shape, 7-section spine, size budget, quality bar
|
|
126
|
+
- [`persona-schema`](persona-schema.md) — sister axis (untouched by this ADR)
|
|
127
|
+
- [`adr-install-user-type-axis`](adr-install-user-type-axis.md) — install-time `personal.user_type` filter (distinct layer)
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Corpus Spec — step-4 Phase 1
|
|
7
|
+
|
|
8
|
+
Parser-visible contract for the golden corpus consumed by
|
|
9
|
+
[`scripts/bench_runner.py`](../../scripts/bench_runner.py) and the
|
|
10
|
+
upcoming `scripts/lint_bench_corpus.py`. Defines composition, schema,
|
|
11
|
+
and validation invariants.
|
|
12
|
+
|
|
13
|
+
## Path decision
|
|
14
|
+
|
|
15
|
+
Roadmap `step-4-measurement-and-benchmark.md`
|
|
16
|
+
Phase 1 Step 2 names `bench/corpus.yaml`. The existing benchmark
|
|
17
|
+
infrastructure (runner + non-dev corpus + `task bench`) lives under
|
|
18
|
+
`tests/eval/` and `scripts/bench_runner.py` hardcodes that directory.
|
|
19
|
+
**Canonical location:** `tests/eval/corpus-<id>.yaml`. The `bench/`
|
|
20
|
+
directory is reserved for **reports + pricing** (Phase 2 deliverables).
|
|
21
|
+
Migration to `bench/corpus.yaml` is a no-op rename if downstream Phase
|
|
22
|
+
2 work proves the consolidation is worth the diff cost.
|
|
23
|
+
|
|
24
|
+
## Composition (25 prompts)
|
|
25
|
+
|
|
26
|
+
| Bucket | Count | Purpose |
|
|
27
|
+
|---|---|---|
|
|
28
|
+
| **Routing-canonical** | 10 | One prompt per major skill cluster — exact-match scoring |
|
|
29
|
+
| **Ambiguous** | 8 | Multiple plausible skills — set-intersection ≥ 0.7 scoring |
|
|
30
|
+
| **Destructive / security carve-out** | 5 | Triggers a safety floor — selection must surface the floor skill |
|
|
31
|
+
| **Long-context** | 2 | ≥ 4 k input tokens — exercises retrieval under context pressure |
|
|
32
|
+
|
|
33
|
+
The 10 routing-canonical prompts MUST cover the kernel + tier-1 skill
|
|
34
|
+
clusters used by the dev profile (`developer.yml`). The 8 ambiguous
|
|
35
|
+
prompts MUST each declare ≥ 2 acceptable skills in `expected_skills`.
|
|
36
|
+
The 5 destructive / security prompts MUST declare an
|
|
37
|
+
`expected_carve_outs` value (e.g. `security-sensitive-stop`,
|
|
38
|
+
`non-destructive-by-default`).
|
|
39
|
+
|
|
40
|
+
## Schema
|
|
41
|
+
|
|
42
|
+
```yaml
|
|
43
|
+
version: 1 # corpus format version (int)
|
|
44
|
+
corpus_id: <id> # short kebab-case identifier
|
|
45
|
+
selection_accuracy_target: 0.60 # 0.0–1.0; runner exits non-zero below
|
|
46
|
+
prompts:
|
|
47
|
+
- id: <bucket>-<NN> # e.g. canonical-01, ambiguous-03
|
|
48
|
+
category: <bucket> # canonical | ambiguous | destructive | long-context
|
|
49
|
+
user_type_candidates: [<slug>, ...] # optional; informational
|
|
50
|
+
language: en # en | de — per language-and-tone
|
|
51
|
+
prompt: "<text>" # the agent-facing prompt
|
|
52
|
+
expected_skills: [<slug>, ...] # ≥ 1 entry; non-empty
|
|
53
|
+
expected_carve_outs: [<slug>, ...] # required when category == destructive
|
|
54
|
+
rubric: # optional structural assertion
|
|
55
|
+
must_include: ["<phrase>", ...] # all phrases must appear in output
|
|
56
|
+
must_not_include: ["<phrase>", ...]
|
|
57
|
+
length_words: { min: 0, max: 0 }
|
|
58
|
+
quality_assertion: "<regex>" # optional regex over agent output
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Invariants (lint-bench gate)
|
|
62
|
+
|
|
63
|
+
| Drift | `reason` | Example |
|
|
64
|
+
|---|---|---|
|
|
65
|
+
| Missing top-level `version` / `corpus_id` / `prompts` | `missing_top_level` | — |
|
|
66
|
+
| `version` not in `{1}` | `unsupported_version` | `version: 2` |
|
|
67
|
+
| `selection_accuracy_target` outside `[0.0, 1.0]` | `target_out_of_range` | `1.5` |
|
|
68
|
+
| Duplicate `id` across prompts | `duplicate_id` | two `canonical-01` |
|
|
69
|
+
| `id` does not match `^[a-z][a-z0-9-]*-\d{2}$` | `bad_id_format` | `Canonical_1` |
|
|
70
|
+
| `category` not in `{canonical, ambiguous, destructive, long-context}` | `bad_category` | `category: misc` |
|
|
71
|
+
| `language` not in `{en, de}` | `bad_language` | `language: fr` |
|
|
72
|
+
| `expected_skills` empty / missing | `empty_expected` | `expected_skills: []` |
|
|
73
|
+
| `expected_skills` references an unknown skill slug | `unknown_skill` | `expected_skills: [imaginary]` |
|
|
74
|
+
| `category == destructive` without `expected_carve_outs` | `missing_carve_out` | — |
|
|
75
|
+
| Prompt text empty / whitespace-only | `empty_prompt` | — |
|
|
76
|
+
|
|
77
|
+
The linter MUST run with `--quiet` honour per the script-output
|
|
78
|
+
convention and emit one violation per line in non-quiet mode.
|
|
79
|
+
|
|
80
|
+
## Composition gates (25-prompt-complete state)
|
|
81
|
+
|
|
82
|
+
Once `corpus-dev.yaml` reaches the 25-prompt target, the linter
|
|
83
|
+
additionally enforces the per-bucket counts above. Until then, the
|
|
84
|
+
linter only enforces per-prompt invariants — partial corpora are
|
|
85
|
+
valid during Phase 1 build-out.
|
|
86
|
+
|
|
87
|
+
The composition gate is opt-in via `--require-full` to keep the
|
|
88
|
+
reduced 10-prompt suite (Phase 1 Step 4) usable during development
|
|
89
|
+
without tripping CI.
|
|
90
|
+
|
|
91
|
+
## Cross-references
|
|
92
|
+
|
|
93
|
+
- Runner — [`scripts/bench_runner.py`](../../scripts/bench_runner.py)
|
|
94
|
+
- Linter — `scripts/lint_bench_corpus.py` (Phase 1 Step 3)
|
|
95
|
+
- Existing non-dev corpus — [`tests/eval/corpus-non-dev.yaml`](../../tests/eval/corpus-non-dev.yaml)
|
|
96
|
+
- Language gate — [`language-and-tone`](../../.agent-src.uncompressed/rules/language-and-tone.md)
|
|
97
|
+
- Report schema — `docs/contracts/benchmark-report-schema.md` (Phase 2 Step 4)
|
|
@@ -0,0 +1,111 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Benchmark Report Schema — step-4 Phase 2
|
|
7
|
+
|
|
8
|
+
Parser-visible contract for the JSON + Markdown reports emitted by
|
|
9
|
+
[`scripts/bench_run.py`](../../scripts/bench_run.py). Every `task bench`
|
|
10
|
+
run writes one `bench/reports/<ts>-<corpus_id>.json` + matching `.md`.
|
|
11
|
+
|
|
12
|
+
## File layout
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
bench/
|
|
16
|
+
├── pricing.yaml # per-1M model rates + sourced_on dates
|
|
17
|
+
└── reports/
|
|
18
|
+
├── 2026-05-16T10-30-00Z-dev.json # machine-readable
|
|
19
|
+
├── 2026-05-16T10-30-00Z-dev.md # human-readable
|
|
20
|
+
└── ...
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
Filename format: `<UTC ISO-8601 with `:` → `-`>-<corpus_id>.{json,md}`.
|
|
24
|
+
Sortable lexicographically.
|
|
25
|
+
|
|
26
|
+
## JSON schema (v1)
|
|
27
|
+
|
|
28
|
+
```yaml
|
|
29
|
+
schema_version: 1
|
|
30
|
+
generated_at: <ISO-8601 UTC>
|
|
31
|
+
corpus:
|
|
32
|
+
id: <corpus_id>
|
|
33
|
+
path: tests/eval/corpus-<id>.yaml
|
|
34
|
+
prompt_count: <int>
|
|
35
|
+
runner:
|
|
36
|
+
bench_run_version: <semver>
|
|
37
|
+
baseline_collector: scripts/bench_runner.py # selection-accuracy floor
|
|
38
|
+
baseline_collector_sha: <git-sha-or-mtime>
|
|
39
|
+
selection:
|
|
40
|
+
top_k: 3
|
|
41
|
+
prompts_hit: <int>
|
|
42
|
+
prompts_total: <int>
|
|
43
|
+
selection_accuracy: <float 0.0-1.0> # hits / total
|
|
44
|
+
target: <float> # from corpus
|
|
45
|
+
passed: <bool> # accuracy >= target
|
|
46
|
+
per_prompt: # one entry per corpus prompt
|
|
47
|
+
- id: canonical-01
|
|
48
|
+
expected_skills: [...]
|
|
49
|
+
top_k_ranked: [...]
|
|
50
|
+
hit: <bool>
|
|
51
|
+
cost:
|
|
52
|
+
source: agents/cost-tracking/sessions.jsonl # or "unavailable"
|
|
53
|
+
sessions_scanned: <int>
|
|
54
|
+
totals:
|
|
55
|
+
input_tokens: <int>
|
|
56
|
+
output_tokens: <int>
|
|
57
|
+
cache_read_input_tokens: <int>
|
|
58
|
+
cache_creation_input_tokens: <int>
|
|
59
|
+
total_cost_usd: <float>
|
|
60
|
+
per_tier: # haiku / sonnet / opus / unknown
|
|
61
|
+
sonnet: { messages: <int>, cost_usd: <float> }
|
|
62
|
+
...
|
|
63
|
+
pricing_sourced_on: <ISO date from bench/pricing.yaml>
|
|
64
|
+
quality:
|
|
65
|
+
source: <path-or-"not_collected">
|
|
66
|
+
prompts_with_assertion: <int>
|
|
67
|
+
prompts_passing: <int>
|
|
68
|
+
quality_score: <float 0.0-1.0> # passing / total OR 0.0 if not_collected
|
|
69
|
+
per_prompt:
|
|
70
|
+
- id: canonical-01
|
|
71
|
+
assertion: <regex-string>
|
|
72
|
+
assertion_kind: rubric.must_include | quality_assertion
|
|
73
|
+
passed: <bool | "not_collected">
|
|
74
|
+
verdict:
|
|
75
|
+
selection: pass | fail
|
|
76
|
+
quality: pass | fail | not_collected
|
|
77
|
+
overall: pass | fail | partial # partial = quality not_collected
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Markdown shape
|
|
81
|
+
|
|
82
|
+
Headers in order:
|
|
83
|
+
|
|
84
|
+
1. `# Benchmark Report — <corpus_id> · <generated_at>`
|
|
85
|
+
2. `## Headline` — three-line summary (selection · cost · quality).
|
|
86
|
+
3. `## Selection accuracy` — table per prompt with hit/miss + expected/got.
|
|
87
|
+
4. `## Cost capture` — per-tier table + total; "unavailable" block if no
|
|
88
|
+
session jsonl was found.
|
|
89
|
+
5. `## Quality probe` — per-prompt assertion pass/fail; `not_collected`
|
|
90
|
+
block when no agent-output path was passed.
|
|
91
|
+
6. `## Notes` — pointer to `pricing.yaml`, `corpus path`, and the
|
|
92
|
+
versioned filename for citation.
|
|
93
|
+
|
|
94
|
+
## Invariants
|
|
95
|
+
|
|
96
|
+
- **No silent drops.** Missing cost source → emit `source: unavailable`
|
|
97
|
+
and `total_cost_usd: 0.0` with a marker; never omit the section.
|
|
98
|
+
- **Quality stub honesty.** When agent outputs are not provided, set
|
|
99
|
+
`quality.source: not_collected` and `verdict.overall: partial`. Score
|
|
100
|
+
stays `0.0`; never inflate by assuming pass.
|
|
101
|
+
- **Pricing dated.** Every cost row reads `sourced_on` from
|
|
102
|
+
`bench/pricing.yaml`. Stale price (> 90 days) → warning line in the
|
|
103
|
+
Markdown footer.
|
|
104
|
+
|
|
105
|
+
## Cross-references
|
|
106
|
+
|
|
107
|
+
- Runner — [`scripts/bench_run.py`](../../scripts/bench_run.py)
|
|
108
|
+
- Baseline collector — [`scripts/bench_runner.py`](../../scripts/bench_runner.py)
|
|
109
|
+
- Corpus contract — [`benchmark-corpus-spec.md`](benchmark-corpus-spec.md)
|
|
110
|
+
- Pricing source — [`bench/pricing.yaml`](../../bench/pricing.yaml)
|
|
111
|
+
- Cost session reader (live sessions) — [`scripts/cost/track.mjs`](../../scripts/cost/track.mjs)
|
|
@@ -297,4 +297,5 @@ A command that fails either floor drops to **Tier-1** at the next minor release;
|
|
|
297
297
|
- [`docs/migrations/commands-1.15.0.md`](../migrations/commands-1.15.0.md) — user-facing migration notes.
|
|
298
298
|
- [`docs/contracts/STABILITY.md`](STABILITY.md) — `beta` level rules apply.
|
|
299
299
|
- [`docs/contracts/command-surface-tiers.md`](command-surface-tiers.md) — what each tier means and what `--help` surfaces.
|
|
300
|
+
- [`docs/contracts/command-taxonomy.md`](command-taxonomy.md) — profile axis (discoverability) layered on top of this verb axis (invocation).
|
|
300
301
|
- [`.agent-src.uncompressed/contexts/contracts/artifact-engagement-flow.md`](../../.agent-src.uncompressed/contexts/contracts/artifact-engagement-flow.md) — sibling telemetry surface; same privacy floor and four-layer enforcement model.
|
|
@@ -0,0 +1,137 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-12
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Command taxonomy
|
|
7
|
+
|
|
8
|
+
> **Status:** beta — first draft 2026-05-16 (Phase 2 Item 6 of
|
|
9
|
+
> `step-15-product-refinement`).
|
|
10
|
+
|
|
11
|
+
The taxonomy answers **"how is the command surface organized so each
|
|
12
|
+
profile finds their three first commands in under 30 seconds?"** It is
|
|
13
|
+
a **catalog-organization contract**, not an invocation-rename. Existing
|
|
14
|
+
slash invocations (`/work`, `/fix ci`, `/research deep`) are preserved
|
|
15
|
+
by the locked verb-cluster contract at
|
|
16
|
+
[`command-clusters`](command-clusters.md). This file adds a **profile
|
|
17
|
+
axis** on top of the verb axis without breaking either.
|
|
18
|
+
|
|
19
|
+
## The two axes
|
|
20
|
+
|
|
21
|
+
| Axis | Owner | Surface |
|
|
22
|
+
|---|---|---|
|
|
23
|
+
| **Verb-cluster** (existing) | [`command-clusters`](command-clusters.md) | Defines the invocation tree (`/fix ci` dispatches to the `ci` sub-command of the `fix` cluster). Linter-enforced. **Source of truth for invocation.** |
|
|
24
|
+
| **Profile** (this contract) | [`profile-system`](profile-system.md) | Defines which verb-clusters and sub-commands are surfaced first for each profile (developer · content_creator · founder · agency · finance · ops). **Source of truth for discoverability.** |
|
|
25
|
+
|
|
26
|
+
A command can be discoverable under multiple profiles. `/work` is
|
|
27
|
+
universal — it appears in `commands_hint` for every profile. `/dcf-modeling`
|
|
28
|
+
is finance-only. Discoverability is many-to-many; invocation stays
|
|
29
|
+
single-source.
|
|
30
|
+
|
|
31
|
+
## Membership rules
|
|
32
|
+
|
|
33
|
+
### Profile membership
|
|
34
|
+
|
|
35
|
+
A command appears in a profile's `commands_hint` (in
|
|
36
|
+
`.agent-src.uncompressed/profiles/<id>.yml`) iff **all** hold:
|
|
37
|
+
|
|
38
|
+
1. **First-week reach.** A user of that profile will reach for this
|
|
39
|
+
command within their first five sessions without being told.
|
|
40
|
+
2. **Profile-coherent.** The command's domain matches the profile's
|
|
41
|
+
primary work surface (engineering for `developer`, content for
|
|
42
|
+
`content_creator`, etc.).
|
|
43
|
+
3. **Verb-cluster owned.** The command exists in `command-clusters` —
|
|
44
|
+
no profile may declare a command that has not gone through the
|
|
45
|
+
verb-cluster linter.
|
|
46
|
+
4. **Cap of five.** A profile's `commands_hint` is capped at five
|
|
47
|
+
entries. The cap is what makes "three first commands" possible.
|
|
48
|
+
|
|
49
|
+
### Top-10 most-used (for alias / deprecation policy)
|
|
50
|
+
|
|
51
|
+
The top-10 list is the **union of all six profiles' `commands_hint`
|
|
52
|
+
lists, ranked by per-profile membership count**. As of 2026-05-16
|
|
53
|
+
that union is, in rank order:
|
|
54
|
+
|
|
55
|
+
1. `work` (6/6 profiles)
|
|
56
|
+
2. `implement-ticket` (2/6 — developer, agency)
|
|
57
|
+
3. `feature` (2/6 — founder, agency)
|
|
58
|
+
4. `council` (2/6 — founder, finance)
|
|
59
|
+
5. `challenge-me` (2/6 — founder, finance)
|
|
60
|
+
6. `review-changes` (2/6 — developer, ops)
|
|
61
|
+
7. `fix` (2/6 — developer, ops)
|
|
62
|
+
8. `refine-ticket` (1/6 — agency)
|
|
63
|
+
9. `commit` (1/6 — developer)
|
|
64
|
+
10. `roadmap` (1/6 — agency)
|
|
65
|
+
|
|
66
|
+
The top-10 is regenerated automatically from the profile YAMLs by
|
|
67
|
+
`scripts/regen_top10.py` (Phase 2 deliverable — not yet shipped). Until
|
|
68
|
+
the regen script lands, the list above is the locked snapshot.
|
|
69
|
+
|
|
70
|
+
## Backward-compat policy
|
|
71
|
+
|
|
72
|
+
The top-10 commands carry a **two-release backward-compat guarantee**:
|
|
73
|
+
|
|
74
|
+
- A rename of any top-10 command (whether by verb-cluster restructure
|
|
75
|
+
or profile-axis reorganization) ships with an alias for **at least
|
|
76
|
+
two minor releases**.
|
|
77
|
+
- The alias is recorded in the verb-cluster's `Replaces` column in
|
|
78
|
+
[`command-clusters`](command-clusters.md) and re-emits a one-line
|
|
79
|
+
deprecation notice to stderr on every invocation.
|
|
80
|
+
- Removing the alias requires the `bundled-always-rules-acknowledged`
|
|
81
|
+
PR label and an entry in the CHANGELOG `Removed` section naming the
|
|
82
|
+
end-of-deprecation release.
|
|
83
|
+
|
|
84
|
+
Commands outside the top-10 follow the existing verb-cluster
|
|
85
|
+
deprecation rules (one release as a shim, then disappear).
|
|
86
|
+
|
|
87
|
+
## Discoverability surfaces
|
|
88
|
+
|
|
89
|
+
Three surfaces consume this contract:
|
|
90
|
+
|
|
91
|
+
| Surface | Path | What it shows |
|
|
92
|
+
|---|---|---|
|
|
93
|
+
| **README** | `README.md` § "Six entry paths" | Per-profile `commands_hint` (max 5) rendered as the first-commands list per profile block |
|
|
94
|
+
| **Catalog** | `docs/catalog.md` | All commands grouped by verb-cluster (primary axis), with a per-command `profiles:` line listing which profiles surface it |
|
|
95
|
+
| **Wizard** | `.agent-src.uncompressed/commands/onboard.md` | After role selection, prints the five-command starter list from the selected profile's `commands_hint` |
|
|
96
|
+
|
|
97
|
+
The README and wizard surfaces are already wired. The catalog `profiles:`
|
|
98
|
+
line is a Phase 2 deliverable.
|
|
99
|
+
|
|
100
|
+
## What this contract does **not** do
|
|
101
|
+
|
|
102
|
+
- **Does not** rename any command. Invocation stays flat (`/work`, not
|
|
103
|
+
`/dev/work`). The `/dev/...` / `/ops/...` strawman in the Item 6
|
|
104
|
+
roadmap entry is **rejected** — adding a profile prefix to invocation
|
|
105
|
+
would dual-namespace the surface, conflict with verb-cluster cluster
|
|
106
|
+
heads, and require a 124-command migration with no measurable
|
|
107
|
+
discoverability gain over the README + wizard surfaces above.
|
|
108
|
+
- **Does not** modify the verb-cluster contract. `command-clusters`
|
|
109
|
+
remains the locked source of truth for invocation. This contract is
|
|
110
|
+
additive.
|
|
111
|
+
- **Does not** ship telemetry. The top-10 is derived from declared
|
|
112
|
+
profile membership, not observed usage. A usage-based top-10
|
|
113
|
+
recomputation is deferred to Item 10 (Cost Governance Dashboard),
|
|
114
|
+
which already collects per-command call counts.
|
|
115
|
+
|
|
116
|
+
## Open questions (post-beta)
|
|
117
|
+
|
|
118
|
+
1. **Profile evolution.** When a seventh profile lands (e.g.
|
|
119
|
+
`researcher`), what is the membership review process for the
|
|
120
|
+
top-10? Proposal: any new profile triggers a `regen_top10.py` run
|
|
121
|
+
and a CHANGELOG entry; no manual review unless the top-10 order
|
|
122
|
+
changes.
|
|
123
|
+
2. **Profile-prefix invocation.** If the no-rename verdict is
|
|
124
|
+
revisited (e.g. user research shows discoverability still fails
|
|
125
|
+
even with the README + wizard surfaces), a separate ADR records
|
|
126
|
+
the decision; this contract does not pre-authorize it.
|
|
127
|
+
3. **Catalog generator.** `docs/catalog.md` is currently
|
|
128
|
+
handwritten. The `profiles:` line proposed in the discoverability
|
|
129
|
+
table requires `scripts/regen_catalog.py` to consume profile YAMLs
|
|
130
|
+
— deferred to its own roadmap step.
|
|
131
|
+
|
|
132
|
+
## See also
|
|
133
|
+
|
|
134
|
+
- [`command-clusters`](command-clusters.md) — verb-axis (invocation)
|
|
135
|
+
- [`profile-system`](profile-system.md) — profile-axis (discoverability)
|
|
136
|
+
- [`command-surface-tiers`](command-surface-tiers.md) — tier-axis (`./agent-config --help` visibility)
|
|
137
|
+
- `step-15-product-refinement` § Phase 2 Item 6
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
---
|
|
2
|
+
stability: beta
|
|
3
|
+
keep-beta-until: 2026-08-14
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Compression default — kill-criterion
|
|
7
|
+
|
|
8
|
+
> **Status:** parked, criterion-deferred · **Owner:** `step-4-measurement-and-benchmark.md`
|
|
9
|
+
> closeout phase · **Source:** [`council-synthesis.md` § 7](../../agents/audit-2026-05-14-north-star/council-synthesis.md)
|
|
10
|
+
|
|
11
|
+
## Rule
|
|
12
|
+
|
|
13
|
+
```
|
|
14
|
+
DEFAULT STAYS OFF UNTIL `task bench` PRODUCES A NUMBER.
|
|
15
|
+
DECISION OWNED BY step-4 CLOSEOUT, NOT BY THIS DOC OR BY step-99.
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
1. **Current state.** `caveman.speak_scope` defaults `off`. Carve-outs
|
|
19
|
+
(security · destructive · multi-step · code blocks · paths · numbered
|
|
20
|
+
options · Iron-Law markers) are documented in
|
|
21
|
+
[`caveman-speak`](../../.agent-src.uncompressed/rules/caveman-speak.md)
|
|
22
|
+
but the feature is non-promoted: no skill recommends turning it on,
|
|
23
|
+
no preset enables it, no profile depends on it.
|
|
24
|
+
2. **Baseline window.** 60 days from the first green run of
|
|
25
|
+
`task bench` against the locked 25-prompt corpus
|
|
26
|
+
(`step-4-measurement-and-benchmark.md`
|
|
27
|
+
Phase 2). The corpus, the model, and the cost-tracker are frozen
|
|
28
|
+
for the window; mid-window changes restart the clock.
|
|
29
|
+
3. **Decision points.** After the window closes, `step-4` closeout
|
|
30
|
+
reads `docs/parity/bench.json` and applies exactly one of:
|
|
31
|
+
|
|
32
|
+
| Measured tokens saved | Quality regression on corpus | Verdict |
|
|
33
|
+
|---|---|---|
|
|
34
|
+
| < 30 % | any | **Deprecate** — remove `caveman-speak` rule, archive `caveman-compress` script, retire `caveman.*` settings keys with a one-release deprecation window |
|
|
35
|
+
| ≥ 30 % | < 5 % | **Flip default on** — `caveman.speak_scope` defaults to a non-`off` value, carve-outs stay, statusline surfaces lifetime tokens saved |
|
|
36
|
+
| ≥ 30 % | ≥ 5 % | **Hold** — repeat the window once with tuned intensity ladder; second hold → deprecate |
|
|
37
|
+
|
|
38
|
+
"Quality regression" = host-side rubric on the corpus per
|
|
39
|
+
`step-4-measurement-and-benchmark.md` Phase 3. Numbers checked into
|
|
40
|
+
`docs/parity/bench.json` as the decision artefact.
|
|
41
|
+
4. **No interim flip.** The default does not move on anecdote,
|
|
42
|
+
gut feeling, or a single benchmark snapshot. The 60-day window and
|
|
43
|
+
the table above are the only path to a default change.
|
|
44
|
+
|
|
45
|
+
## Why this is parked, not decided
|
|
46
|
+
|
|
47
|
+
The council split (Opus = remove now, o1 = measure-then-decide) is
|
|
48
|
+
real. Either branch is wrong-shaped without numbers. The kill-criterion
|
|
49
|
+
gives the audit a deterministic resolution path and stops every
|
|
50
|
+
downstream roadmap from re-litigating compression on every PR.
|
|
51
|
+
|
|
52
|
+
## Cross-references
|
|
53
|
+
|
|
54
|
+
- ``step-99-north-star-restructure.md` § Phase 4`
|
|
55
|
+
— parks this criterion, does not decide.
|
|
56
|
+
- `step-4-measurement-and-benchmark.md`
|
|
57
|
+
— owns `task bench`, the corpus, and the closeout that applies the
|
|
58
|
+
table above.
|
|
59
|
+
- `step-10-caveman-parity.md`
|
|
60
|
+
— implements the carve-outs and the statusline integration the
|
|
61
|
+
"flip default on" branch depends on; blocks the default flip until
|
|
62
|
+
acceptance is green.
|
|
63
|
+
- [`caveman-speak`](../../.agent-src.uncompressed/rules/caveman-speak.md)
|
|
64
|
+
— runtime rule; reads `caveman.speak_scope` from settings.
|
|
65
|
+
|
|
66
|
+
## Done
|
|
67
|
+
|
|
68
|
+
This doc exists to keep the decision visible. It is **not** an action
|
|
69
|
+
item. `step-4` closeout closes the loop.
|