@pilotspace/add 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. package/CHANGELOG.md +81 -0
  2. package/GETTING-STARTED.md +187 -139
  3. package/README.md +13 -7
  4. package/bin/cli.js +96 -5
  5. package/docs/01-principles.md +3 -3
  6. package/docs/02-the-flow.md +19 -12
  7. package/docs/03-step-1-specify.md +15 -13
  8. package/docs/04-step-2-scenarios.md +2 -2
  9. package/docs/05-step-3-contract.md +3 -3
  10. package/docs/06-step-4-tests.md +10 -2
  11. package/docs/07-step-5-build.md +3 -1
  12. package/docs/08-step-6-verify.md +25 -5
  13. package/docs/09-the-loop.md +12 -6
  14. package/docs/10-setup-and-stages.md +27 -13
  15. package/docs/11-governance.md +6 -2
  16. package/docs/12-roles.md +3 -3
  17. package/docs/13-adoption.md +1 -1
  18. package/docs/14-foundation.md +15 -15
  19. package/docs/15-foundations-and-lineage.md +106 -0
  20. package/docs/README.md +4 -0
  21. package/docs/appendix-a-templates.md +3 -3
  22. package/docs/appendix-b-prompts.md +40 -5
  23. package/docs/appendix-c-glossary.md +49 -12
  24. package/docs/appendix-d-worked-example.md +2 -2
  25. package/docs/appendix-e-checklists.md +16 -4
  26. package/docs/appendix-f-requirements-matrix.md +8 -8
  27. package/docs/appendix-g-references.md +106 -0
  28. package/package.json +1 -1
  29. package/skill/add/SKILL.md +41 -38
  30. package/skill/add/adopt.md +13 -11
  31. package/skill/add/deltas.md +8 -6
  32. package/skill/add/fold.md +19 -17
  33. package/skill/add/graduate.md +74 -0
  34. package/skill/add/intake.md +22 -7
  35. package/skill/add/loop.md +59 -0
  36. package/skill/add/phases/0-ground.md +66 -0
  37. package/skill/add/phases/0-setup.md +32 -25
  38. package/skill/add/phases/1-specify.md +28 -13
  39. package/skill/add/phases/2-scenarios.md +14 -4
  40. package/skill/add/phases/3-contract.md +27 -12
  41. package/skill/add/phases/4-tests.md +15 -5
  42. package/skill/add/phases/5-build.md +33 -4
  43. package/skill/add/phases/6-verify.md +40 -2
  44. package/skill/add/phases/7-observe.md +13 -5
  45. package/skill/add/report-template.md +65 -7
  46. package/skill/add/run.md +93 -39
  47. package/skill/add/scope.md +10 -6
  48. package/skill/add/setup-review.md +13 -10
  49. package/skill/add/streams.md +88 -23
  50. package/tooling/add.py +1817 -90
  51. package/tooling/templates/CONVENTIONS.md.tmpl +1 -1
  52. package/tooling/templates/DESIGN.md.tmpl +66 -0
  53. package/tooling/templates/GLOSSARY.md.tmpl +29 -0
  54. package/tooling/templates/MILESTONE.md.tmpl +1 -0
  55. package/tooling/templates/PROJECT.md.tmpl +6 -3
  56. package/tooling/templates/TASK.md.tmpl +55 -15
  57. package/tooling/templates/catalog.sample.json +38 -0
  58. package/tooling/templates/prototype.sample.json +48 -0
  59. package/tooling/templates/tokens.sample.json +55 -0
  60. package/tooling/templates/udd-catalog.md +122 -0
  61. package/tooling/templates/udd-tokens.md +79 -0
@@ -10,6 +10,21 @@ Pick ONE task-sized slice, restate the tests it must satisfy, implement, run
10
10
  tests, iterate to green. Keep each batch small enough to review in full — you
11
11
  cannot move faster than you can verify.
12
12
 
13
+ ## Declaring the scope of impact (Scope + Strategy)
14
+
15
+ §5 of TASK.md opens with two declarations, drafted WITH the specification bundle
16
+ and frozen by the one §3 approval — never invented mid-build:
17
+
18
+ - **Scope (may touch)** — the allowlist of every file the build may write
19
+ (backticked tokens; grammar in the template comment). During build, needing a
20
+ file outside the declared Scope is a **STOP → change request** back to Specify,
21
+ never improvisation.
22
+ - **Strategy (ordered batches)** — the planned build order. Guidance, not
23
+ enforced: it aims the small-batches loop, it does not gate it.
24
+
25
+ Deferral, named: the engine gate (touched ⊆ declared) lands in the
26
+ `scope-gate-enforce` task — until it ships this section is prose discipline.
27
+
13
28
  ## The cardinal rule
14
29
 
15
30
  **Never weaken or delete a test to make it pass, and never edit the frozen
@@ -19,18 +34,26 @@ change request back to Specify. Honor the feature-specific safety rule named in
19
34
 
20
35
  ## AI prompt
21
36
 
22
- > Read §1, §3, §4, and CONVENTIONS. Make EVERY failing test pass, one small batch
23
- > at a time. Constraints: do NOT change any test; do NOT change the contract; honor
24
- > the §5 safety rule; use only allow-listed packages; stop and ask if unclear.
25
- > Report which tests pass and exactly what changed.
37
+ <prompt>
38
+ Role: implement the feature so EVERY failing test passes the build phase.
39
+ Read first: §1 · §3 · §4 · CONVENTIONS.
40
+ Objective: every §4 test green, one small batch at a time.
41
+ Steps:
42
+ 1. Make EVERY failing test pass, one small batch at a time, honoring the §5 safety rule.
43
+ 2. Report which tests pass and exactly what changed.
44
+ Never: change a test or the contract; use a package off the allow-list; or push past something unclear instead of asking.
45
+ </prompt>
26
46
 
27
47
  ## Exit gate
28
48
 
49
+ <exit_gate>
29
50
  - [ ] All tests pass.
30
51
  - [ ] Coverage did not decrease.
31
52
  - [ ] No test and no contract modified by the AI.
32
53
  - [ ] No dependency outside the allow-list.
54
+ - [ ] No file outside the declared §5 Scope was touched.
33
55
  - [ ] Change small enough to review in full.
56
+ </exit_gate>
34
57
 
35
58
  ## Next
36
59
 
@@ -39,3 +62,9 @@ Book: `docs/07-step-5-build.md`.
39
62
 
40
63
  > Under `autonomy: auto` (the default) Build and Verify run together as one dynamic,
41
64
  > evidence-auto-gated run — not two manual stops. See `run.md`.
65
+ >
66
+ > **Honest redo.** If the verify gate finds a confirmed cheat (a tamper, or a reported
67
+ > earned-green failure), the task returns HERE for an honest redo — revert the tampered
68
+ > file or de-overfit src, then advance again. This is the bounded self-heal loop (`run.md`),
69
+ > capped: after the cap a confirmed cheat HARD-STOPs to the human. Never weaken a test or
70
+ > edit the frozen contract to pass.
@@ -1,4 +1,4 @@
1
- # Phase 6 — Verify (evidence + blind-spot checks)
1
+ # Phase 6 — Verify (evidence + non-functional review)
2
2
 
3
3
  Goal: establish trust and record an outcome. Passing tests are necessary, not
4
4
  sufficient. Fill **§6** in TASK.md including the GATE RECORD.
@@ -31,8 +31,44 @@ If any is false, stop and return to Build — there is nothing to verify yet.
31
31
  note reviewed by the auto-gate is an audit finding (`unescalated_security_note`).
32
32
  - **Architecture** — does it respect layering/dependency rules in CONVENTIONS.md?
33
33
 
34
+ ## Part three — the deep check (do not skim)
35
+
36
+ Green tests prove behavior on the inputs you thought of. They do not prove the change
37
+ is *wired in*, nor that you did not leave a dead end behind — and for a non-coding change
38
+ they prove nothing about whether you actually *read* the thing you signed off. So one more
39
+ requirement, every gate:
40
+
41
+ Deep check — do not skim. If the task produced code, record that every new symbol is
42
+ referenced (wiring) and that no new dead/unused code was introduced. If it produced prose
43
+ or non-code, record a semantic read — what you read in full and what it confirmed. Which
44
+ path applies is the resolver's judgement; the engine never classifies.
45
+
46
+ Record it in the §6 **Deep checks** block — where each new symbol is called (a reference
47
+ search), the dead-code scan result, or the prose you read in full and what it confirmed.
48
+ An unfilled Deep checks block is a **shallow verify**, not a PASS.
49
+
50
+ ## Part four — was the green earned?
51
+
52
+ A green suite proves the tests pass — not that the build EARNED them. Three judgment cheats
53
+ pass the unchanged suite without earning it: src overfit to the test fixtures (special-cased
54
+ to the literal inputs, not the general behavior §1 asked for), vacuous asserts (tautological —
55
+ green even against an empty implementation), and real logic stubbed away (the function returns
56
+ a constant the tests happen to accept). These cheats are invisible to the mechanical tamper
57
+ tripwire, which only sees edited files. Score them with an adversarial refute-read: an
58
+ independent reviewer — a subagent under `autonomy: auto` is recommended, the engine never
59
+ spawns one — prompted to argue the green was NOT earned from outside the build context. This
60
+ is the verify-gate, whole-suite specialization of run.md's adversarial verify (see run.md), not
61
+ a new discipline. A confirmed earned-green failure is HARD-STOP-class: never auto-passed, never
62
+ RISK-ACCEPTED — but a first cheat is a chance to redo: a confirmed cheat (mechanical tamper or a
63
+ reported earned-green failure) enters the bounded self-heal loop — it returns to build for an honest
64
+ redo, and only after the loop's cap does it HARD-STOP to the human (the loop lives in run.md).
65
+
34
66
  ## Record exactly one outcome (no silent pass)
35
67
 
68
+ When you present this gate to the human, open with the ARC (goal · done · plan) per
69
+ `report-template.md`, and reconcile its FLAGS with `add.py report --decide`'s open-item count
70
+ before the ask — per that file's reconcile rule (verify is where a flag-vs-digest mismatch bites).
71
+
36
72
  | Outcome | When |
37
73
  |---------|------|
38
74
  | `PASS` | all checks met |
@@ -41,8 +77,10 @@ If any is false, stop and return to Build — there is nothing to verify yet.
41
77
 
42
78
  ## Exit gate / Next
43
79
 
44
- - [ ] Evidence confirmed, blind-spots checked, outcome recorded — a person approved, or
80
+ <exit_gate>
81
+ - [ ] Evidence confirmed, non-functional risks checked, outcome recorded — a person approved, or
45
82
  (under `autonomy: auto` with no residue) the run auto-resolved as the accountable owner.
83
+ </exit_gate>
46
84
 
47
85
  ```bash
48
86
  python3 .add/tooling/add.py gate PASS # marks the task done
@@ -6,7 +6,7 @@ about the feature finally appears. Fill **§7** in TASK.md.
6
6
 
7
7
  ## Do
8
8
 
9
- 1. **Release behind a blast-radius limit** — feature flag and/or gradual rollout.
9
+ 1. **Release behind a scope-of-impact limit** — feature flag and/or gradual rollout.
10
10
  2. **Reuse scenarios as monitors** — the §2 scenarios that defined "correct" now
11
11
  define what you alert on: overall error rate, each rejection's rate (a spike in
12
12
  one is a signal), latency of the risky operation under load.
@@ -15,16 +15,24 @@ about the feature finally appears. Fill **§7** in TASK.md.
15
15
 
16
16
  ## AI prompt
17
17
 
18
- > Role: a reliability analyst feeding the next cycle. Read telemetry, objectives,
19
- > incidents. Report error-budget burn; cluster errors and surface the top
20
- > real-world failures; draft a SPEC delta with evidence links. Never auto-roll-back
21
- > recommend; a human owns the production decision.
18
+ <prompt>
19
+ Role: a reliability analyst feeding the next cycle.
20
+ Read first: telemetry · objectives · incidents.
21
+ Objective: turn what production shows into the next SPEC delta.
22
+ Steps:
23
+ 1. Report error-budget burn.
24
+ 2. Cluster errors and surface the top real-world failures.
25
+ 3. Draft a SPEC delta with evidence links.
26
+ Never: auto-roll-back — recommend; a human owns the production decision.
27
+ </prompt>
22
28
 
23
29
  ## Exit gate
24
30
 
31
+ <exit_gate>
25
32
  - [ ] Released behind a flag/rollout.
26
33
  - [ ] Scenario-based monitors live.
27
34
  - [ ] A reviewed spec delta captured (becomes the next `new-task`).
35
+ </exit_gate>
28
36
 
29
37
  ## Next
30
38
 
@@ -1,19 +1,59 @@
1
- # Chat reports — the seam template (for the AI, not for add.py)
1
+ # Chat reports — the decision-point template (for the AI, not for add.py)
2
2
 
3
3
  The engine renders artifacts (`report`, `report --decide`, `status`); this file
4
4
  governs the CHAT MESSAGE you wrap around them. The digest is the artifact BEHIND
5
5
  your presentation, never a replacement for it — and your prose is never a
6
6
  replacement for the digest.
7
7
 
8
- Use it every time you report at or near a decision seam: an intake proposal, a
9
- bundle/front approval, a verify gate, a task completion, a milestone close.
8
+ Use it every time you report at or near a decision point: an intake proposal, a
9
+ bundle approval, a verify gate, a task completion, a milestone close.
10
+
11
+ ## The decision arc — rendered first, above the five blocks
12
+
13
+ Every report at a human gate opens with the **ARC** — three labelled lines that
14
+ place the decision in the work's whole arc, so the human confirms with sight of
15
+ where this is going, not just the step in front of them. Render it first, then a
16
+ separator, then the unchanged five blocks below:
17
+
18
+ ```
19
+ ARC goal: <the milestone / project goal this decision serves>
20
+ done: <proven progress — tasks done · exit-criteria met · what this gate proves>
21
+ plan: <this gate → the next step → the goal>
22
+ ```
23
+
24
+ - **goal** — the milestone or project goal the decision serves, read from the
25
+ `m-goal` line in `add.py status`; never re-typed from memory.
26
+ - **done** — proven progress only: exit-criteria met/total and tasks done from
27
+ the rollup, plus what this gate proves. An honest fact, never a hope.
28
+ - **plan** — this gate → the next step → the goal, mirroring the rollup's
29
+ `DECIDE NEXT` line.
30
+
31
+ The arc is required at every human gate: **baseline-lock · contract-freeze ·
32
+ verify · intake · scope · milestone-close · graduation**. The three labels stay
33
+ constant; their content adapts to the gate. The arc is presentation only — it
34
+ adds no gate and changes no PASS / RISK-ACCEPTED / HARD-STOP / freeze outcome.
35
+
36
+ Its facts are engine-sourced, exactly like EVIDENCE below: goal = `m-goal` ·
37
+ done = exit-criteria met/total + tasks done · plan = `DECIDE NEXT`. If your arc
38
+ and `add.py` output disagree, the engine wins — fix the arc, not the engine.
39
+
40
+ ### Per-gate examples — one shape, gate-specific content
41
+
42
+ - **verify** — `goal:` ship the decision arc · `done:` report-arc tests 6/6
43
+ green, gate ready · `plan:` PASS this gate → wire the arc into every gate → goal.
44
+ - **contract-freeze** — `goal:` … · `done:` bundle drafted, lowest-confidence
45
+ flag surfaced · `plan:` freeze §3 → build → goal.
46
+ - **milestone-close** — `goal:` … · `done:` exit-criteria 3/3 met, all tasks
47
+ done · `plan:` close → archive → the next milestone.
48
+ - **intake** — `goal:` the sized request · `done:` classified new-major,
49
+ rationale stated · `plan:` create the milestone → first contract → goal.
10
50
 
11
51
  ## The five blocks, in order
12
52
 
13
53
  ```
14
54
  SUMMARY one line: intent + target + where we are
15
55
  DECISION what you need from the human (or "none — FYI")
16
- ⚠ FLAGS least-sure first, why + cost-if-wrong
56
+ ⚠ FLAGS lowest-confidence first, why + cost-if-wrong
17
57
  EVIDENCE small table: tests · gates · parity · check — engine-sourced
18
58
  NEXT the single next action + what it unlocks
19
59
  ```
@@ -24,7 +64,7 @@ NEXT the single next action + what it unlocks
24
64
  2. **DECISION** — the question the human must answer, stated plainly; exactly
25
65
  one decision per report, or an explicit "none — FYI". If a decision exists,
26
66
  ask it AFTER everything below has been shown (show-before-ask).
27
- 3. **⚠ FLAGS** — least-sure first, each with *why* it is least sure and the
67
+ 3. **⚠ FLAGS** — lowest-confidence first, each with *why* confidence is lowest and the
28
68
  *cost if wrong*. Where TASK.md markers exist (`⚠` / `- [~]` / `- [ ]`),
29
69
  quote them verbatim and keep their document order — extraction ≠ judgment.
30
70
  4. **EVIDENCE** — engine-sourced facts pasted from `add.py` output, never
@@ -34,15 +74,33 @@ NEXT the single next action + what it unlocks
34
74
  line when it is right; overrule it only with a stated reason (e.g. planned
35
75
  tasks the state file cannot see yet).
36
76
 
77
+ **The ask itself** — when block 2's decision becomes a literal question component
78
+ (option picker, numbered menu), compose it as a summary: the detail stays in the
79
+ report above, the question carries intent + what "yes" means + the flag count.
80
+
37
81
  ## Hard rules
38
82
 
83
+ <constraints>
39
84
  - **Summary-first.** Never bury the decision under a task list or a diff.
40
85
  - **Show before ask.** Render the artifact (digest · diff · report) before any
41
86
  approval question; the human decides on what they can see.
42
- - **Never pre-stamp a human seam.** Freeze / gate / lock fields stay DRAFT or
87
+ - **Reconcile the count.** Before the ask, your FLAGS must reconcile with
88
+ `add.py report --decide`'s open-item count. If your prose calls an item
89
+ resolved while the digest still counts it open, the engine wins — fix the data
90
+ (the TASK.md markers the digest reads), not the sentence. A report whose flag
91
+ count disagrees with the engine is the un-transparent gate the ARC exists to close.
92
+ - **Never pre-stamp a human decision point.** Freeze / gate / lock fields stay DRAFT or
43
93
  blank until the answer returns: show → ask → stamp → advance. An artifact
44
94
  must never claim an approval that has not happened.
45
- - **One report per seam.** After an approval, point at the frozen artifact —
95
+ - **One report per decision point.** After an approval, point at the frozen artifact —
46
96
  do not re-render the whole bundle.
47
97
  - **Honest scope.** "Done" means the request, not the last task: report
48
98
  "task 2/3", never "done" while approved scope remains.
99
+ - **The question is a summary, never the artifact.** Every approval ask carries
100
+ two layers: a compact SUMMARY · DECISION · ⚠ FLAGS block sits in chat
101
+ immediately before the ask (positional), and the question text itself is a
102
+ summary of two lines at most — intent + what "yes" means + the flag count —
103
+ pointing at the report above (compositional). The full bundle, diff, or
104
+ artifact lives only in the chat report; a question that re-carries it buries
105
+ the decision.
106
+ </constraints>
package/skill/add/run.md CHANGED
@@ -1,25 +1,24 @@
1
1
  # The dynamic run — executing a locked scope
2
2
 
3
3
  Once a task's CONTRACT is frozen (phase 3), the scope is *locked*: the external shape will not move.
4
- That lock is ADD's autonomy seam — below it code is disposable; above it nothing breaks. This rubric
5
- covers what runs on the far side of the seam: the **build->verify half, executed as a dynamic,
6
- self-improving run** instead of a manual, sequential build. The human-led FRONT (Specify · Scenarios
7
- · Contract) still owns *direction*, but v7 compresses it to a **single human approval at the seam**
8
- (see "The one-approval front" below) — the AI drafts the whole front, a human approves it once.
4
+ That lock is ADD's autonomy decision point — below it code is disposable; above it nothing breaks. This rubric
5
+ covers what runs on the far side of the decision point: the **build->verify half, executed as a dynamic,
6
+ self-improving run** instead of a manual, sequential build. The human-led **specification bundle** (Specify · Scenarios
7
+ · Contract) still owns *direction*, but v7 compresses it to a **single human approval at the decision point**
8
+ (see "The specification bundle" below) — the AI drafts the whole bundle, a human approves it once.
9
9
 
10
10
  > **Self-improving = within-run convergence + emit v5 deltas** — same definition as v5: tracked,
11
11
  > evidence-backed, never autonomous training. The run converges in-turn AND feeds the human-gated
12
- > fold loop (`deltas.md` · `fold.md`). The engine stays judgment-free: this is a rubric, not `add.py`.
12
+ > consolidation loop (`deltas.md` · `fold.md`). The engine stays judgment-free: this is a rubric, not `add.py`.
13
13
 
14
- ## The one-approval front (v7)
14
+ ## The specification bundle (v7)
15
15
 
16
- The human-led front used to be three separate approvals — Specify, then Scenarios, then the Contract
17
- freeze. v7 compresses it to **one**. From the user's input the AI **drafts the whole front as a single
18
- bundle** the Spec, the Scenarios, the Contract, and the failing Tests and presents it together. The
19
- human gives **one approval, at the frozen contract** (the seam). That single approval is the green light
16
+ The specification bundle used to be three separate approvals — Specify, then Scenarios, then the Contract
17
+ freeze. v7 compresses it to **one**. From the user's input the AI **drafts the whole specification bundle in one pass** — the Spec, the Scenarios, the Contract, and the failing Tests — and presents it together. The
18
+ human gives **one approval, at the frozen contract** (the decision point). That single approval is the green light
20
19
  for the self-driving run.
21
20
 
22
- Why one approval and not zero: the contract freeze is the autonomy seam, and the seam **stays human**.
21
+ Why one approval and not zero: the contract freeze is the autonomy decision point, and the decision point **stays human**.
23
22
  The AI *drafts* the contract but never *freezes its own* — a person approves the frozen shape before any
24
23
  auto-run touches code. This is exactly what keeps "never self-gate a human-led gate" true under an auto
25
24
  default: the one gate that remains is human. Drop it to zero and the AI would freeze the interface it
@@ -28,11 +27,11 @@ then builds against and self-gate the result — the circular trust v6's dogfood
28
27
  What the human is actually approving in that one gate: that the drafted Spec captures the real intent,
29
28
  that the Scenarios cover the cases that matter, and that the Contract shape is the one to freeze. Reject
30
29
  any part and the bundle goes back to draft — that is backward-correction (principle 4), not failure.
31
- Approve, and the run begins. The seam guide (`phases/3-contract.md`) carries the
32
- **freeze review checklist** — six lines that walk the human through exactly this, ⚠-first.
30
+ Approve, and the run begins. The decision-point guide (`phases/3-contract.md`) carries the
31
+ **freeze review checklist** — seven lines that walk the human through exactly this, ⚠-first.
33
32
 
34
- **The least-sure flag — aiming the one approval.** A single approval over a whole bundle invites a
35
- rubber stamp. So the AI presents the bundle **least-sure first**: of everything it is asking the human
33
+ **The lowest-confidence flag — aiming the one approval.** A single approval over a whole bundle is easy to
34
+ grant without reading. So the AI presents the bundle **lowest-confidence first**: of everything it is asking the human
36
35
  to freeze, it names the **1–2 points most likely to be wrong**, tagged by part
37
36
  (`⚠ [spec|scenario|contract|test] … — because …; if wrong: …`), each with *why* it is uncertain and
38
37
  *what it costs if wrong*. The §1 assumptions feed it, but a flag may equally point at an uncovered
@@ -40,7 +39,7 @@ scenario or the contract shape. If nothing is materially uncertain, the AI still
40
39
  biggest risk, however small — never a blank "none". Honest about its limit: the flag records that the
41
40
  human approved with the soft spots **in front of them**, eyes open; it makes a real review cheap and a
42
41
  lazy one visibly negligent, but it cannot *force* engagement — and the AI never asserts that the human
43
- engaged when it cannot know (a self-asserted gate would just be the rubber stamp one level up). Closing
42
+ engaged when it cannot know (a self-asserted gate would just move the unread approval one level up). Closing
44
43
  that enforcement gap is the job of a CI checker, not of prose.
45
44
 
46
45
  ## When the run begins — the scope-lock trigger
@@ -50,17 +49,18 @@ The trigger is the **frozen contract**, nothing else. A run may start only when:
50
49
  - §3 CONTRACT is marked `FROZEN @ vN` (the shape is fixed), AND
51
50
  - §4 TESTS exist and are RED for the right reason (the target the run drives to green).
52
51
 
53
- No frozen contract -> no run: you are still on the human-led front, and starting early is the
52
+ No frozen contract -> no run: you are still inside the specification bundle, and starting early is the
54
53
  forward-skip the flow forbids. The lock is what makes autonomous execution *safe* — the AI cannot
55
54
  drift the interface, because the interface is frozen above it.
56
55
 
57
- ## The touch-boundary — what the run may and may not touch
56
+ ## The change scope — what the run may and may not touch
58
57
 
58
+ <constraints>
59
59
  A locked run has a hard boundary. It MAY:
60
60
 
61
- - write and rewrite **code** (`src/`) — code is disposable below the seam;
61
+ - write and rewrite **code** (`src/`) — code is disposable below the decision point;
62
62
  - drive the **tests** to green WITHOUT weakening them (a weakened test is a method violation);
63
- - gather **evidence** for the verify gate (test output, blind-spot checks).
63
+ - gather **evidence** for the verify gate (test output, non-functional review).
64
64
 
65
65
  It MUST NOT:
66
66
 
@@ -68,10 +68,11 @@ It MUST NOT:
68
68
  the run STOPS and hands back to a human to reopen Specify (principle 4). The run never re-locks
69
69
  scope on its own.
70
70
  - weaken, delete, or skip a **test** to make the build pass (that inverts the method).
71
- - touch the **human-led front artifacts** (§1–§3) except to halt and escalate.
71
+ - touch the **specification-bundle artifacts** (§1–§3) except to halt and escalate.
72
+ </constraints>
72
73
 
73
74
  Crossing the boundary is not a fast run; it is an unverified one. When the run hits something only the
74
- front can resolve, it stops — and that stop is the loop working, not failing.
75
+ specification bundle can resolve, it stops — and that stop is the loop working, not failing.
75
76
 
76
77
  ## The dynamic run — fan-out and in-run convergence
77
78
 
@@ -83,21 +84,28 @@ on a trustworthy result with three loops:
83
84
  Stopping at the first green is how defects survive; the run stops only when the well runs dry.
84
85
  - **adversarial verify** — for every "done" claim, an independent skeptic tries to REFUTE it. The
85
86
  claim survives only if it withstands refutation, not because one pass looked plausible.
86
- - **completeness-critic** — a final pass that asks "what did we NOT cover — a scenario, a blind-spot,
87
+ - **completeness-critic** — a final pass that asks "what did we NOT cover — a scenario, a non-functional risk,
87
88
  an unstated assumption?" Whatever it finds re-enters the run.
88
89
 
89
90
  The run ends only when the loops go dry AND the auto-gate's evidence is satisfied. This is the run
90
91
  **self-improving within the turn** — the same convergence the foundation loop runs across milestones,
91
92
  compressed into one task.
92
93
 
93
- ## The evidence auto-gate
94
+ ## The automated quality gate
94
95
 
96
+ <constraints>
95
97
  The verify gate may be resolved by **evidence** rather than by a person — when the evidence is
96
98
  sufficient and the result is recorded (principle 7, reframed: an automated, recorded pass is an
97
99
  explicit pass, not a skip).
98
100
 
99
101
  - **Auto-PASS requires ALL of:** every test green; coverage not decreased; no test weakened and no
100
- contract edited; the convergence loops dry; the completeness-critic found nothing open.
102
+ contract edited; the convergence loops dry; the completeness-critic found nothing open; and the
103
+ deep check below recorded.
104
+ - **The deep check (every gate, no skim).** Deep check — do not skim. If the task produced code, record
105
+ that every new symbol is referenced (wiring) and that no new dead/unused code was introduced. If it
106
+ produced prose or non-code, record a semantic read — what you read in full and what it confirmed.
107
+ Which path applies is the resolver's judgement; the engine never classifies. An unfilled deep check is
108
+ a **shallow verify**, not an auto-PASS — evidence the work is wired, not merely plausible.
101
109
  - **Always escalates to a human (never auto-passed):** any **security** finding (HARD-STOP, always);
102
110
  a **concurrency**/timing risk the tests cannot exercise; an **architecture**/layering violation; and
103
111
  any failing test. These are the residue principle 2 names — automation cannot judge them.
@@ -107,54 +115,100 @@ explicit pass, not a skip).
107
115
 
108
116
  The auto-gate NEVER writes a human signature it did not get. An auto-PASS is logged as *auto-resolved*,
109
117
  honestly — the line between a pass and a skip is the recorded outcome, not a forged name.
118
+ </constraints>
119
+
120
+ ## The bounded self-heal loop — a confirmed cheat returns to build
121
+
122
+ The auto-gate trusts evidence; but evidence can be **gamed**. A build can make the unchanged red suite
123
+ pass without EARNING it — a test or the frozen contract edited after the red run, src **overfit** to the
124
+ fixtures, **vacuous** asserts, or real logic **stubbed away**. That is a **confirmed cheat**, and a cheat
125
+ is **HARD-STOP-class**: never auto-passed, never RISK-ACCEPTED-waived (like a security finding). But a
126
+ first cheat is not yet a stop — it is a chance to redo honestly.
127
+
128
+ So a confirmed cheat enters a **bounded self-heal loop**: the engine returns the task to **build** for an
129
+ honest redo, **counts** the attempt, and **caps** it. After **3** honest re-build attempts a fourth
130
+ confirmed cheat forces a **HARD-STOP that escalates to the human** — never an auto-PASS, never an unbounded
131
+ loop. The engine COUNTS, CAPS, and ESCALATES; the **agent** does the honest re-build (the engine never
132
+ auto-fixes). The counter is **monotonic** — it never auto-resets, so the cap cannot be cleared by
133
+ re-crossing a phase; only an honest build (no cheat) escapes the loop, and an honest build PASSes even at
134
+ the third attempt (the cap bites a *continued* cheat, never a recovery).
135
+
136
+ Two findings enter the loop:
137
+ - **mechanical** (enforced) — the tamper tripwire (`tamper-tripwire`): at the gate the engine re-hashes the
138
+ red test files + the frozen §3 against the `tests→build` snapshot; any divergence is a cheat, routed to
139
+ the loop before any completing outcome is recorded.
140
+ - **semantic** (honor-system, necessary-not-sufficient) — the **adversarial refute-read** (`6-verify.md`):
141
+ an independent reviewer argues "the green was NOT earned" and, on a confirmed overfit/vacuous/stub, the
142
+ agent reports it with `add.py heal <slug> --reason "<finding>"`. The engine cannot SEE a judgment cheat,
143
+ so this entry is the agent's honest report — the human verify gate stays the real backstop.
144
+
145
+ The mechanical entry returns-to-build automatically at the gate; the `heal` verb is how a *reported* cheat
146
+ enters the same bounded loop. Either way: ≤3 honest redos, then escalate. A gamed green never ships.
110
147
 
111
148
  ## Emitting deltas — feeding the foundation back
112
149
 
113
150
  The completeness-critic does not discard what it finds. Every gap, surprise, or convention that helped
114
- or hurt becomes an **`open` competency delta** in the task's OBSERVE block, in the `deltas.md` grammar,
151
+ or hurt becomes an **`open` lesson learned** in the task's OBSERVE block, in the `deltas.md` grammar,
115
152
  tagged by competency:
116
153
 
117
154
  - a finding the run FIXED but that taught the foundation something (a missing scenario -> `TDD`);
118
155
  - a finding the run could NOT fix — a residue escalation -> a delta AND the escalation to a human.
119
156
 
120
- These `open` deltas feed v5's human-gated fold (`fold.md`) at milestone close: the run emits `open`;
121
- the human folds. That is the loop closing — **v6 run -> v5 foundation** — so a dynamic run sharpens the
157
+ These `open` deltas feed v5's human-gated consolidation (`fold.md`) at milestone close: the run emits `open`;
158
+ the human consolidates. That is the loop closing — **v6 run -> v5 foundation** — so a dynamic run sharpens the
122
159
  five competencies instead of letting its findings evaporate at end-of-run.
123
160
 
124
- ## The autonomy dial
161
+ ## The autonomy level
125
162
 
163
+ <constraints>
126
164
  How much a run may auto-gate is a **per-scope setting**, not a global switch (principle 5: trust is
127
165
  earned per scope). A task declares its level in its `TASK.md` header:
128
166
 
129
167
  ```
130
- autonomy: auto | conservative
168
+ autonomy: manual | conservative | auto
131
169
  ```
132
170
 
133
- - **auto (the default)** the run may auto-PASS when the evidence + residue checks above are
171
+ An ordered ladder`manual < conservative < auto` declared once in the header and reviewed at the freeze:
172
+
173
+ - **auto (the seeded default)** — the run may auto-PASS when the evidence + residue checks above are
134
174
  satisfied. Security still always escalates. This is the default starting point: a frozen contract
135
175
  flips the task into a self-driving run that converges and auto-gates on evidence.
136
176
  - **conservative** — the deliberate *lowering*: the run does all the work and converges, but STOPS at
137
177
  the verify gate for a human. Auto-PASS is disabled. Choose it wherever evidence is thin or risk is high.
178
+ - **manual** — the strict floor: the human owns the verify gate and the engine never auto-resolves
179
+ (behaviourally the conservative floor with the explicit "I drive this decision; the AI proposes only"
180
+ name). Choose it for the highest-stakes scope; like `conservative`, it satisfies the high-risk guard.
138
181
 
139
182
  > **v7 reversal (recorded, not hidden).** Earlier the default was `conservative` and `auto` was the
140
183
  > earned exception; v7 flips this — `auto` is the default, `conservative` is the deliberate lowering.
141
- > What did **not** change is principle 5: the dial is still **per-scope**, the level still lives in the
184
+ > What did **not** change is principle 5: the autonomy level is still **per-scope**, and it still lives in the
142
185
  > `TASK.md` header, and you still lower it anywhere risk demands. Only the starting point moved.
143
186
 
144
- **The high-risk guard — `auto` is refused where it matters most.** The dial is not a blank cheque. On a
187
+ **The high-risk guard — `auto` is refused where it matters most.** The autonomy level is not a blank cheque. On a
145
188
  **high-risk or method-defining scope** — anything where a wrong-but-plausible result is expensive or
146
189
  hard to reverse (auth, money, data-loss paths, the method/trust-layer itself) — `auto` must be lowered
147
- to `conservative`; leaving it at `auto` there is the reject code **`unguarded_high_risk_auto`**. This
148
- closes the v6 dogfood blind-spot, where the whole milestone ran at `auto` on the riskiest possible
190
+ to a stricter rung — `conservative` or `manual`; leaving it at `auto` there is the reject code
191
+ **`unguarded_high_risk_auto`**. This
192
+ closes the v6 dogfood gap, where the whole milestone ran at `auto` on the riskiest possible
149
193
  scope (defining the method) with no friction. The default is `auto` *for ordinary, well-tested scope*;
150
194
  high risk still earns a human gate.
151
195
 
152
196
  Judging *what* is high-risk stays human — the scope declares **`risk: high`** in the same `TASK.md`
153
- header where the dial lives, reviewed at the freeze like every header line (the engine never
197
+ header where the autonomy level lives, reviewed at the freeze like every header line (the engine never
154
198
  classifies scope). **Since v14 the guard is mechanical for the declared case:**
155
199
  the engine refuses the declared combination — `add.py gate` will not complete (`PASS`/`RISK-ACCEPTED`) a task whose header
156
- carries `risk: high` without `autonomy: conservative` (error `unguarded_high_risk_auto`; `HARD-STOP`
200
+ carries `risk: high` without a lowered level — `conservative` or `manual` (error `unguarded_high_risk_auto`; `HARD-STOP`
157
201
  always records — stopping is never blocked), and `add.py audit` flags the same code on a finished
158
202
  record whose header was tampered or whose GATE RECORD reviewer is the auto-gate — which CI enforces
159
203
  (audit-ci). The honest limit mirrors the audit's: an **undeclared** high-risk scope passes; declaring
160
- is the human seam, the engine enforces what was declared.
204
+ is the human decision point, the engine enforces what was declared.
205
+
206
+ **Autonomy is earned by goal-clarity — the auto-ready goal.** The level decides *who* resolves Verify;
207
+ an **auto-ready goal** decides whether a self-verifying run is even *meaningful*. A milestone goal is
208
+ auto-ready when **every exit criterion cites a verifier** — `(verify: <test | command | metric>)` — so the
209
+ run can check its own result against the goal without human judgment. `add.py check` raises a
210
+ `goal_not_auto_ready` WARN (never red, the active milestone only) while criteria are uncited, and `status`
211
+ prints a `goal-ready:` line every session. It **measures, never blocks** — it changes neither the freeze
212
+ gate nor the autonomy level. The lint forces a citation slot per criterion (raising the floor) but cannot
213
+ prove the citation is honest (`(verify: it works)` passes) — that judgment stays the human's.
214
+ </constraints>
@@ -20,7 +20,7 @@ scope drafting honors intake's classification — it never re-sizes a request:
20
20
  means one drafting pass, NOT auto-creation. Nothing is written to disk — single draft or the
21
21
  whole batch — until the human confirms. You propose; you wait.
22
22
 
23
- ## Brainstorm before you draft — co-specify at milestone altitude
23
+ ## Brainstorm before you draft — co-specify at milestone level
24
24
 
25
25
  Don't draft a MILESTONE.md from thin input. Run the same three-move co-specify as a
26
26
  task's §1 (`phases/1-specify.md`) — Diverge (framings + open questions) → Converge
@@ -31,12 +31,14 @@ Draft the WHOLE milestone before showing; nothing hits disk until the human conf
31
31
  Diverge seeds (pick the live ones):
32
32
  - **Outcome** — done means a user can do *what* they can't today? (goal sentence)
33
33
  - **Edge of scope** — nearest thing assumed IN that you want OUT? (Out list)
34
- - **Riskiest seam** — which contract, if wrong, costs the most rework? (freeze-first)
34
+ - **Riskiest decision point** — which contract, if wrong, costs the most rework? (freeze-first)
35
35
  - **Done-looks-like** — how do we SEE each outcome without reading code? (exit criteria)
36
36
  - **First slice** — which task unblocks the rest? (breadth-first order)
37
37
 
38
- Rank assumptions least-sure first; the top 1–2 get the flag the human reads at confirm:
39
- `⚠ <assumption> — least sure because <why>; if wrong: <cost>`.
38
+ Rank assumptions lowest-confidence first; the top 1–2 get the flag the human reads at confirm:
39
+ `⚠ <assumption> — lowest confidence because <why>; if wrong: <cost>`. Present the draft via
40
+ `report-template.md` — open with the ARC (goal · done · plan): the goal this milestone serves,
41
+ what is already covered, and the plan its task list lays out.
40
42
 
41
43
  ## Drafting a good MILESTONE.md (section by section)
42
44
 
@@ -45,8 +47,8 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
45
47
  - **Scope In/Out** — the explicit anti-creep deferral list. Naming what is OUT is as important
46
48
  as what is IN; an empty Out list usually means the scope is not yet thought through.
47
49
  - **Shared decisions & glossary deltas** — cross-cutting rules every task must honor, named from
48
- the glossary. New terms get a glossary entry (the survivor layer stays honest).
49
- - **Shared / risky contracts to freeze first** — the seams between tasks; name the owning task.
50
+ the glossary. New terms get a glossary entry (the living documentation stays honest).
51
+ - **Shared / risky contracts to freeze first** — the decision points between tasks; name the owning task.
50
52
  - **Tasks (breadth-first)** — `slug · depends-on · one line` each. Decompose by deliverable, not
51
53
  by phase; keep each task one-file-sized. Order by dependency, not by guesswork.
52
54
  - **Exit criteria** — observable, and **every exit criterion maps to a declared task slug**
@@ -54,6 +56,7 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
54
56
 
55
57
  ## Reject codes (emit `{ reject, rationale }`, create nothing)
56
58
 
59
+ <reject_codes>
57
60
  - `not_classified` — the request has not been through intake yet. Classify it first; you cannot
58
61
  draft scope for an unclassified request.
59
62
  - `dangling_criterion` — a drafted MILESTONE.md has an exit criterion that maps to no declared
@@ -61,6 +64,7 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
61
64
  a malformed milestone. With no engine lint, you are the first check and the human is the backstop.
62
65
  - `no_milestone` — intake routed the request to `task` or `change-request`; scope drafting
63
66
  creates NO milestone. Honor the classification; do not invent milestone-sized scope.
67
+ </reject_codes>
64
68
 
65
69
  ## Worked example (from this repo's own history)
66
70
 
@@ -1,11 +1,11 @@
1
1
  # Setup review — the one page the human signs
2
2
 
3
- Autonomous setup ends at a single human gate: the **lock-down** (`add.py lock`). Before that
3
+ Autonomous setup ends at a single human gate: the **baseline approval** (`add.py lock`). Before that
4
4
  signature is honest, the human needs to see *what you drafted and how sure you were* — not re-derive
5
5
  it. `SETUP-REVIEW.md` is that page: every decision you made while drafting the foundation, first-scope,
6
- and the first contract, **ordered least-sure-first** so the riskiest guesses meet their eye first.
6
+ and the first contract, **ordered lowest-confidence-first** so the riskiest guesses meet their eye first.
7
7
 
8
- This is the setup-altitude analog of presenting a task's front least-sure-first at the contract freeze.
8
+ This is the setup-level analog of presenting a task's specification bundle lowest-confidence-first at the contract freeze.
9
9
  The engine never reads this file — `add.py lock` is judgment-free, the signature *is* the gate (see
10
10
  `setup-lock-state`). The human **reading** this page is the review; your job is to make the reading honest.
11
11
 
@@ -13,7 +13,7 @@ The engine never reads this file — `add.py lock` is judgment-free, the signatu
13
13
 
14
14
  Write **one** artifact at `.add/SETUP-REVIEW.md`. **Never clobber a human-edited one** — if it already
15
15
  exists with hand edits, append/update, don't overwrite (the same non-clobber rule `init` applies to
16
- survivors). It is a per-onboarding, setup-altitude artifact; it sits beside `PROJECT.md`, not under a task.
16
+ living docs). It is a per-onboarding, setup-level artifact; it sits beside `PROJECT.md`, not under a task.
17
17
 
18
18
  ## The template
19
19
 
@@ -27,14 +27,15 @@ survivors). It is a per-onboarding, setup-altitude artifact; it sits beside `PRO
27
27
  | 1 | <the drafted decision> | PROJECT.md \| scope \| first-contract | `guessed` | <the inference + why you had to guess> |
28
28
  | 2 | <…> | <…> | `evidence-grounded` | <cite the source file/line you read it from> |
29
29
 
30
- Sign: reviewed the above → `add.py lock --by "<name>"`
30
+ Sign: confirm in chatthe agent runs `add.py lock --by "<name>"` (typing it yourself works too)
31
31
  ```
32
32
 
33
- Rows are numbered for reference at the gate ("row 1 is the one I'm least sure about").
33
+ Rows are numbered for reference at the gate ("row 1 is where my confidence is lowest").
34
34
 
35
35
  ## The two rules that make it honest
36
36
 
37
- 1. **Least-sure-first.** Order rows by confidence **ascending**. A `guessed` row always floats above an
37
+ <constraints>
38
+ 1. **Lowest-confidence-first.** Order rows by confidence **ascending**. A `guessed` row always floats above an
38
39
  `evidence-grounded` one. The point is not completeness theatre — it is to spend the human's attention
39
40
  where it changes outcomes: the top of the table is the part they actually need to challenge.
40
41
 
@@ -45,13 +46,15 @@ Rows are numbered for reference at the gate ("row 1 is the one I'm least sure ab
45
46
  onboarding (a near-empty repo, only the 4-lens answers) produces these. These are what the human
46
47
  must check; that is why they sit on top.
47
48
 
48
- The tag vocabulary is shared with `adopt.md` — the brownfield map tags each filled survivor decision
49
+ The tag vocabulary is shared with `adopt.md` — the brownfield map tags each filled living-doc decision
49
50
  `guessed`/`evidence-grounded`, and those tags flow straight into this table.
51
+ </constraints>
50
52
 
51
53
  ## Where it ends
52
54
 
53
- `SETUP-REVIEW.md` is **read-only context** for the lock-down. You do not ask the human to approve it
54
- field-by-field; you present it, least-sure-first, and they sign once:
55
+ `SETUP-REVIEW.md` is **read-only context** for the baseline approval. You do not ask the human to approve it
56
+ field-by-field; you present it, lowest-confidence-first; they confirm in conversation, and you run the lock
57
+ with their name:
55
58
 
56
59
  ```bash
57
60
  python3 .add/tooling/add.py lock --by "<name>"