npm - @pilotspace/add - Versions diffs - 1.1.0 → 1.3.0 - Mend

@pilotspace/add 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

package/CHANGELOG.md +81 -0
package/GETTING-STARTED.md +187 -139
package/README.md +13 -7
package/bin/cli.js +96 -5
package/docs/01-principles.md +3 -3
package/docs/02-the-flow.md +19 -12
package/docs/03-step-1-specify.md +15 -13
package/docs/04-step-2-scenarios.md +2 -2
package/docs/05-step-3-contract.md +3 -3
package/docs/06-step-4-tests.md +10 -2
package/docs/07-step-5-build.md +3 -1
package/docs/08-step-6-verify.md +25 -5
package/docs/09-the-loop.md +12 -6
package/docs/10-setup-and-stages.md +27 -13
package/docs/11-governance.md +6 -2
package/docs/12-roles.md +3 -3
package/docs/13-adoption.md +1 -1
package/docs/14-foundation.md +15 -15
package/docs/15-foundations-and-lineage.md +106 -0
package/docs/README.md +4 -0
package/docs/appendix-a-templates.md +3 -3
package/docs/appendix-b-prompts.md +40 -5
package/docs/appendix-c-glossary.md +49 -12
package/docs/appendix-d-worked-example.md +2 -2
package/docs/appendix-e-checklists.md +16 -4
package/docs/appendix-f-requirements-matrix.md +8 -8
package/docs/appendix-g-references.md +106 -0
package/package.json +1 -1
package/skill/add/SKILL.md +41 -38
package/skill/add/adopt.md +13 -11
package/skill/add/deltas.md +8 -6
package/skill/add/fold.md +19 -17
package/skill/add/graduate.md +74 -0
package/skill/add/intake.md +22 -7
package/skill/add/loop.md +59 -0
package/skill/add/phases/0-ground.md +66 -0
package/skill/add/phases/0-setup.md +32 -25
package/skill/add/phases/1-specify.md +28 -13
package/skill/add/phases/2-scenarios.md +14 -4
package/skill/add/phases/3-contract.md +27 -12
package/skill/add/phases/4-tests.md +15 -5
package/skill/add/phases/5-build.md +33 -4
package/skill/add/phases/6-verify.md +40 -2
package/skill/add/phases/7-observe.md +13 -5
package/skill/add/report-template.md +65 -7
package/skill/add/run.md +93 -39
package/skill/add/scope.md +10 -6
package/skill/add/setup-review.md +13 -10
package/skill/add/streams.md +88 -23
package/tooling/add.py +1817 -90
package/tooling/templates/CONVENTIONS.md.tmpl +1 -1
package/tooling/templates/DESIGN.md.tmpl +66 -0
package/tooling/templates/GLOSSARY.md.tmpl +29 -0
package/tooling/templates/MILESTONE.md.tmpl +1 -0
package/tooling/templates/PROJECT.md.tmpl +6 -3
package/tooling/templates/TASK.md.tmpl +55 -15
package/tooling/templates/catalog.sample.json +38 -0
package/tooling/templates/prototype.sample.json +48 -0
package/tooling/templates/tokens.sample.json +55 -0
package/tooling/templates/udd-catalog.md +122 -0
package/tooling/templates/udd-tokens.md +79 -0

package/skill/add/phases/5-build.md CHANGED Viewed

@@ -10,6 +10,21 @@ Pick ONE task-sized slice, restate the tests it must satisfy, implement, run
 tests, iterate to green. Keep each batch small enough to review in full — you
 cannot move faster than you can verify.
+## Declaring the scope of impact (Scope + Strategy)
+§5 of TASK.md opens with two declarations, drafted WITH the specification bundle
+and frozen by the one §3 approval — never invented mid-build:
+- **Scope (may touch)** — the allowlist of every file the build may write
+  (backticked tokens; grammar in the template comment). During build, needing a
+  file outside the declared Scope is a **STOP → change request** back to Specify,
+  never improvisation.
+- **Strategy (ordered batches)** — the planned build order. Guidance, not
+  enforced: it aims the small-batches loop, it does not gate it.
+Deferral, named: the engine gate (touched ⊆ declared) lands in the
+`scope-gate-enforce` task — until it ships this section is prose discipline.
 ## The cardinal rule
 **Never weaken or delete a test to make it pass, and never edit the frozen
@@ -19,18 +34,26 @@ change request back to Specify. Honor the feature-specific safety rule named in
 ## AI prompt
-> Read §1, §3, §4, and CONVENTIONS. Make EVERY failing test pass, one small batch
-> at a time. Constraints: do NOT change any test; do NOT change the contract; honor
-> the §5 safety rule; use only allow-listed packages; stop and ask if unclear.
-> Report which tests pass and exactly what changed.
+<prompt>
+Role: implement the feature so EVERY failing test passes — the build phase.
+Read first: §1 · §3 · §4 · CONVENTIONS.
+Objective: every §4 test green, one small batch at a time.
+Steps:
+  1. Make EVERY failing test pass, one small batch at a time, honoring the §5 safety rule.
+  2. Report which tests pass and exactly what changed.
+Never: change a test or the contract; use a package off the allow-list; or push past something unclear instead of asking.
+</prompt>
 ## Exit gate
+<exit_gate>
 - [ ] All tests pass.
 - [ ] Coverage did not decrease.
 - [ ] No test and no contract modified by the AI.
 - [ ] No dependency outside the allow-list.
+- [ ] No file outside the declared §5 Scope was touched.
 - [ ] Change small enough to review in full.
+</exit_gate>
 ## Next
@@ -39,3 +62,9 @@ Book: `docs/07-step-5-build.md`.
 > Under `autonomy: auto` (the default) Build and Verify run together as one dynamic,
 > evidence-auto-gated run — not two manual stops. See `run.md`.
+>
+> **Honest redo.** If the verify gate finds a confirmed cheat (a tamper, or a reported
+> earned-green failure), the task returns HERE for an honest redo — revert the tampered
+> file or de-overfit src, then advance again. This is the bounded self-heal loop (`run.md`),
+> capped: after the cap a confirmed cheat HARD-STOPs to the human. Never weaken a test or
+> edit the frozen contract to pass.

package/skill/add/phases/6-verify.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Phase 6 — Verify (evidence + blind-spot checks)
+# Phase 6 — Verify (evidence + non-functional review)
 Goal: establish trust and record an outcome. Passing tests are necessary, not
 sufficient. Fill **§6** in TASK.md including the GATE RECORD.
@@ -31,8 +31,44 @@ If any is false, stop and return to Build — there is nothing to verify yet.
   note reviewed by the auto-gate is an audit finding (`unescalated_security_note`).
 - **Architecture** — does it respect layering/dependency rules in CONVENTIONS.md?
+## Part three — the deep check (do not skim)
+Green tests prove behavior on the inputs you thought of. They do not prove the change
+is *wired in*, nor that you did not leave a dead end behind — and for a non-coding change
+they prove nothing about whether you actually *read* the thing you signed off. So one more
+requirement, every gate:
+Deep check — do not skim. If the task produced code, record that every new symbol is
+referenced (wiring) and that no new dead/unused code was introduced. If it produced prose
+or non-code, record a semantic read — what you read in full and what it confirmed. Which
+path applies is the resolver's judgement; the engine never classifies.
+Record it in the §6 **Deep checks** block — where each new symbol is called (a reference
+search), the dead-code scan result, or the prose you read in full and what it confirmed.
+An unfilled Deep checks block is a **shallow verify**, not a PASS.
+## Part four — was the green earned?
+A green suite proves the tests pass — not that the build EARNED them. Three judgment cheats
+pass the unchanged suite without earning it: src overfit to the test fixtures (special-cased
+to the literal inputs, not the general behavior §1 asked for), vacuous asserts (tautological —
+green even against an empty implementation), and real logic stubbed away (the function returns
+a constant the tests happen to accept). These cheats are invisible to the mechanical tamper
+tripwire, which only sees edited files. Score them with an adversarial refute-read: an
+independent reviewer — a subagent under `autonomy: auto` is recommended, the engine never
+spawns one — prompted to argue the green was NOT earned from outside the build context. This
+is the verify-gate, whole-suite specialization of run.md's adversarial verify (see run.md), not
+a new discipline. A confirmed earned-green failure is HARD-STOP-class: never auto-passed, never
+RISK-ACCEPTED — but a first cheat is a chance to redo: a confirmed cheat (mechanical tamper or a
+reported earned-green failure) enters the bounded self-heal loop — it returns to build for an honest
+redo, and only after the loop's cap does it HARD-STOP to the human (the loop lives in run.md).
 ## Record exactly one outcome (no silent pass)
+When you present this gate to the human, open with the ARC (goal · done · plan) per
+`report-template.md`, and reconcile its FLAGS with `add.py report --decide`'s open-item count
+before the ask — per that file's reconcile rule (verify is where a flag-vs-digest mismatch bites).
 | Outcome | When |
 |---------|------|
 | `PASS` | all checks met |
@@ -41,8 +77,10 @@ If any is false, stop and return to Build — there is nothing to verify yet.
 ## Exit gate / Next
-- [ ] Evidence confirmed, blind-spots checked, outcome recorded — a person approved, or
+<exit_gate>
+- [ ] Evidence confirmed, non-functional risks checked, outcome recorded — a person approved, or
   (under `autonomy: auto` with no residue) the run auto-resolved as the accountable owner.
+</exit_gate>
 ```bash
 python3 .add/tooling/add.py gate PASS          # marks the task done

package/skill/add/phases/7-observe.md CHANGED Viewed

@@ -6,7 +6,7 @@ about the feature finally appears. Fill **§7** in TASK.md.
 ## Do
-1. **Release behind a blast-radius limit** — feature flag and/or gradual rollout.
+1. **Release behind a scope-of-impact limit** — feature flag and/or gradual rollout.
 2. **Reuse scenarios as monitors** — the §2 scenarios that defined "correct" now
    define what you alert on: overall error rate, each rejection's rate (a spike in
    one is a signal), latency of the risky operation under load.
@@ -15,16 +15,24 @@ about the feature finally appears. Fill **§7** in TASK.md.
 ## AI prompt
-> Role: a reliability analyst feeding the next cycle. Read telemetry, objectives,
-> incidents. Report error-budget burn; cluster errors and surface the top
-> real-world failures; draft a SPEC delta with evidence links. Never auto-roll-back
-> — recommend; a human owns the production decision.
+<prompt>
+Role: a reliability analyst feeding the next cycle.
+Read first: telemetry · objectives · incidents.
+Objective: turn what production shows into the next SPEC delta.
+Steps:
+  1. Report error-budget burn.
+  2. Cluster errors and surface the top real-world failures.
+  3. Draft a SPEC delta with evidence links.
+Never: auto-roll-back — recommend; a human owns the production decision.
+</prompt>
 ## Exit gate
+<exit_gate>
 - [ ] Released behind a flag/rollout.
 - [ ] Scenario-based monitors live.
 - [ ] A reviewed spec delta captured (becomes the next `new-task`).
+</exit_gate>
 ## Next

package/skill/add/report-template.md CHANGED Viewed

@@ -1,19 +1,59 @@
-# Chat reports — the seam template (for the AI, not for add.py)
+# Chat reports — the decision-point template (for the AI, not for add.py)
 The engine renders artifacts (`report`, `report --decide`, `status`); this file
 governs the CHAT MESSAGE you wrap around them. The digest is the artifact BEHIND
 your presentation, never a replacement for it — and your prose is never a
 replacement for the digest.
-Use it every time you report at or near a decision seam: an intake proposal, a
-bundle/front approval, a verify gate, a task completion, a milestone close.
+Use it every time you report at or near a decision point: an intake proposal, a
+bundle approval, a verify gate, a task completion, a milestone close.
+## The decision arc — rendered first, above the five blocks
+Every report at a human gate opens with the **ARC** — three labelled lines that
+place the decision in the work's whole arc, so the human confirms with sight of
+where this is going, not just the step in front of them. Render it first, then a
+separator, then the unchanged five blocks below:
+```
+ARC  goal: <the milestone / project goal this decision serves>
+     done: <proven progress — tasks done · exit-criteria met · what this gate proves>
+     plan: <this gate → the next step → the goal>
+```
+- **goal** — the milestone or project goal the decision serves, read from the
+  `m-goal` line in `add.py status`; never re-typed from memory.
+- **done** — proven progress only: exit-criteria met/total and tasks done from
+  the rollup, plus what this gate proves. An honest fact, never a hope.
+- **plan** — this gate → the next step → the goal, mirroring the rollup's
+  `DECIDE NEXT` line.
+The arc is required at every human gate: **baseline-lock · contract-freeze ·
+verify · intake · scope · milestone-close · graduation**. The three labels stay
+constant; their content adapts to the gate. The arc is presentation only — it
+adds no gate and changes no PASS / RISK-ACCEPTED / HARD-STOP / freeze outcome.
+Its facts are engine-sourced, exactly like EVIDENCE below: goal = `m-goal` ·
+done = exit-criteria met/total + tasks done · plan = `DECIDE NEXT`. If your arc
+and `add.py` output disagree, the engine wins — fix the arc, not the engine.
+### Per-gate examples — one shape, gate-specific content
+- **verify** — `goal:` ship the decision arc · `done:` report-arc tests 6/6
+  green, gate ready · `plan:` PASS this gate → wire the arc into every gate → goal.
+- **contract-freeze** — `goal:` … · `done:` bundle drafted, lowest-confidence
+  flag surfaced · `plan:` freeze §3 → build → goal.
+- **milestone-close** — `goal:` … · `done:` exit-criteria 3/3 met, all tasks
+  done · `plan:` close → archive → the next milestone.
+- **intake** — `goal:` the sized request · `done:` classified new-major,
+  rationale stated · `plan:` create the milestone → first contract → goal.
 ## The five blocks, in order
 ```
 SUMMARY   one line: intent + target + where we are
 DECISION  what you need from the human (or "none — FYI")
-⚠ FLAGS   least-sure first, why + cost-if-wrong
+⚠ FLAGS   lowest-confidence first, why + cost-if-wrong
 EVIDENCE  small table: tests · gates · parity · check — engine-sourced
 NEXT      the single next action + what it unlocks
 ```
@@ -24,7 +64,7 @@ NEXT      the single next action + what it unlocks
 2. **DECISION** — the question the human must answer, stated plainly; exactly
    one decision per report, or an explicit "none — FYI". If a decision exists,
    ask it AFTER everything below has been shown (show-before-ask).
-3. **⚠ FLAGS** — least-sure first, each with *why* it is least sure and the
+3. **⚠ FLAGS** — lowest-confidence first, each with *why* confidence is lowest and the
    *cost if wrong*. Where TASK.md markers exist (`⚠` / `- [~]` / `- [ ]`),
    quote them verbatim and keep their document order — extraction ≠ judgment.
 4. **EVIDENCE** — engine-sourced facts pasted from `add.py` output, never
@@ -34,15 +74,33 @@ NEXT      the single next action + what it unlocks
    line when it is right; overrule it only with a stated reason (e.g. planned
    tasks the state file cannot see yet).
+**The ask itself** — when block 2's decision becomes a literal question component
+(option picker, numbered menu), compose it as a summary: the detail stays in the
+report above, the question carries intent + what "yes" means + the flag count.
 ## Hard rules
+<constraints>
 - **Summary-first.** Never bury the decision under a task list or a diff.
 - **Show before ask.** Render the artifact (digest · diff · report) before any
   approval question; the human decides on what they can see.
-- **Never pre-stamp a human seam.** Freeze / gate / lock fields stay DRAFT or
+- **Reconcile the count.** Before the ask, your ⚠ FLAGS must reconcile with
+  `add.py report --decide`'s open-item count. If your prose calls an item
+  resolved while the digest still counts it open, the engine wins — fix the data
+  (the TASK.md markers the digest reads), not the sentence. A report whose flag
+  count disagrees with the engine is the un-transparent gate the ARC exists to close.
+- **Never pre-stamp a human decision point.** Freeze / gate / lock fields stay DRAFT or
   blank until the answer returns: show → ask → stamp → advance. An artifact
   must never claim an approval that has not happened.
-- **One report per seam.** After an approval, point at the frozen artifact —
+- **One report per decision point.** After an approval, point at the frozen artifact —
   do not re-render the whole bundle.
 - **Honest scope.** "Done" means the request, not the last task: report
   "task 2/3", never "done" while approved scope remains.
+- **The question is a summary, never the artifact.** Every approval ask carries
+  two layers: a compact SUMMARY · DECISION · ⚠ FLAGS block sits in chat
+  immediately before the ask (positional), and the question text itself is a
+  summary of two lines at most — intent + what "yes" means + the flag count —
+  pointing at the report above (compositional). The full bundle, diff, or
+  artifact lives only in the chat report; a question that re-carries it buries
+  the decision.
+</constraints>

package/skill/add/run.md CHANGED Viewed

@@ -1,25 +1,24 @@
 # The dynamic run — executing a locked scope
 Once a task's CONTRACT is frozen (phase 3), the scope is *locked*: the external shape will not move.
-That lock is ADD's autonomy seam — below it code is disposable; above it nothing breaks. This rubric
-covers what runs on the far side of the seam: the **build->verify half, executed as a dynamic,
-self-improving run** instead of a manual, sequential build. The human-led FRONT (Specify · Scenarios
-· Contract) still owns *direction*, but v7 compresses it to a **single human approval at the seam**
-(see "The one-approval front" below) — the AI drafts the whole front, a human approves it once.
+That lock is ADD's autonomy decision point — below it code is disposable; above it nothing breaks. This rubric
+covers what runs on the far side of the decision point: the **build->verify half, executed as a dynamic,
+self-improving run** instead of a manual, sequential build. The human-led **specification bundle** (Specify · Scenarios
+· Contract) still owns *direction*, but v7 compresses it to a **single human approval at the decision point**
+(see "The specification bundle" below) — the AI drafts the whole bundle, a human approves it once.
 > **Self-improving = within-run convergence + emit v5 deltas** — same definition as v5: tracked,
 > evidence-backed, never autonomous training. The run converges in-turn AND feeds the human-gated
-> fold loop (`deltas.md` · `fold.md`). The engine stays judgment-free: this is a rubric, not `add.py`.
+> consolidation loop (`deltas.md` · `fold.md`). The engine stays judgment-free: this is a rubric, not `add.py`.
-## The one-approval front (v7)
+## The specification bundle (v7)
-The human-led front used to be three separate approvals — Specify, then Scenarios, then the Contract
-freeze. v7 compresses it to **one**. From the user's input the AI **drafts the whole front as a single
-bundle** — the Spec, the Scenarios, the Contract, and the failing Tests — and presents it together. The
-human gives **one approval, at the frozen contract** (the seam). That single approval is the green light
+The specification bundle used to be three separate approvals — Specify, then Scenarios, then the Contract
+freeze. v7 compresses it to **one**. From the user's input the AI **drafts the whole specification bundle in one pass** — the Spec, the Scenarios, the Contract, and the failing Tests — and presents it together. The
+human gives **one approval, at the frozen contract** (the decision point). That single approval is the green light
 for the self-driving run.
-Why one approval and not zero: the contract freeze is the autonomy seam, and the seam **stays human**.
+Why one approval and not zero: the contract freeze is the autonomy decision point, and the decision point **stays human**.
 The AI *drafts* the contract but never *freezes its own* — a person approves the frozen shape before any
 auto-run touches code. This is exactly what keeps "never self-gate a human-led gate" true under an auto
 default: the one gate that remains is human. Drop it to zero and the AI would freeze the interface it
@@ -28,11 +27,11 @@ then builds against and self-gate the result — the circular trust v6's dogfood
 What the human is actually approving in that one gate: that the drafted Spec captures the real intent,
 that the Scenarios cover the cases that matter, and that the Contract shape is the one to freeze. Reject
 any part and the bundle goes back to draft — that is backward-correction (principle 4), not failure.
-Approve, and the run begins. The seam guide (`phases/3-contract.md`) carries the
-**freeze review checklist** — six lines that walk the human through exactly this, ⚠-first.
+Approve, and the run begins. The decision-point guide (`phases/3-contract.md`) carries the
+**freeze review checklist** — seven lines that walk the human through exactly this, ⚠-first.
-**The least-sure flag — aiming the one approval.** A single approval over a whole bundle invites a
-rubber stamp. So the AI presents the bundle **least-sure first**: of everything it is asking the human
+**The lowest-confidence flag — aiming the one approval.** A single approval over a whole bundle is easy to
+grant without reading. So the AI presents the bundle **lowest-confidence first**: of everything it is asking the human
 to freeze, it names the **1–2 points most likely to be wrong**, tagged by part
 (`⚠ [spec|scenario|contract|test] … — because …; if wrong: …`), each with *why* it is uncertain and
 *what it costs if wrong*. The §1 assumptions feed it, but a flag may equally point at an uncovered
@@ -40,7 +39,7 @@ scenario or the contract shape. If nothing is materially uncertain, the AI still
 biggest risk, however small — never a blank "none". Honest about its limit: the flag records that the
 human approved with the soft spots **in front of them**, eyes open; it makes a real review cheap and a
 lazy one visibly negligent, but it cannot *force* engagement — and the AI never asserts that the human
-engaged when it cannot know (a self-asserted gate would just be the rubber stamp one level up). Closing
+engaged when it cannot know (a self-asserted gate would just move the unread approval one level up). Closing
 that enforcement gap is the job of a CI checker, not of prose.
 ## When the run begins — the scope-lock trigger
@@ -50,17 +49,18 @@ The trigger is the **frozen contract**, nothing else. A run may start only when:
 - §3 CONTRACT is marked `FROZEN @ vN` (the shape is fixed), AND
 - §4 TESTS exist and are RED for the right reason (the target the run drives to green).
-No frozen contract -> no run: you are still on the human-led front, and starting early is the
+No frozen contract -> no run: you are still inside the specification bundle, and starting early is the
 forward-skip the flow forbids. The lock is what makes autonomous execution *safe* — the AI cannot
 drift the interface, because the interface is frozen above it.
-## The touch-boundary — what the run may and may not touch
+## The change scope — what the run may and may not touch
+<constraints>
 A locked run has a hard boundary. It MAY:
-- write and rewrite **code** (`src/`) — code is disposable below the seam;
+- write and rewrite **code** (`src/`) — code is disposable below the decision point;
 - drive the **tests** to green WITHOUT weakening them (a weakened test is a method violation);
-- gather **evidence** for the verify gate (test output, blind-spot checks).
+- gather **evidence** for the verify gate (test output, non-functional review).
 It MUST NOT:
@@ -68,10 +68,11 @@ It MUST NOT:
   the run STOPS and hands back to a human to reopen Specify (principle 4). The run never re-locks
   scope on its own.
 - weaken, delete, or skip a **test** to make the build pass (that inverts the method).
-- touch the **human-led front artifacts** (§1–§3) except to halt and escalate.
+- touch the **specification-bundle artifacts** (§1–§3) except to halt and escalate.
+</constraints>
 Crossing the boundary is not a fast run; it is an unverified one. When the run hits something only the
-front can resolve, it stops — and that stop is the loop working, not failing.
+specification bundle can resolve, it stops — and that stop is the loop working, not failing.
 ## The dynamic run — fan-out and in-run convergence
@@ -83,21 +84,28 @@ on a trustworthy result with three loops:
   Stopping at the first green is how defects survive; the run stops only when the well runs dry.
 - **adversarial verify** — for every "done" claim, an independent skeptic tries to REFUTE it. The
   claim survives only if it withstands refutation, not because one pass looked plausible.
-- **completeness-critic** — a final pass that asks "what did we NOT cover — a scenario, a blind-spot,
+- **completeness-critic** — a final pass that asks "what did we NOT cover — a scenario, a non-functional risk,
   an unstated assumption?" Whatever it finds re-enters the run.
 The run ends only when the loops go dry AND the auto-gate's evidence is satisfied. This is the run
 **self-improving within the turn** — the same convergence the foundation loop runs across milestones,
 compressed into one task.
-## The evidence auto-gate
+## The automated quality gate
+<constraints>
 The verify gate may be resolved by **evidence** rather than by a person — when the evidence is
 sufficient and the result is recorded (principle 7, reframed: an automated, recorded pass is an
 explicit pass, not a skip).
 - **Auto-PASS requires ALL of:** every test green; coverage not decreased; no test weakened and no
-  contract edited; the convergence loops dry; the completeness-critic found nothing open.
+  contract edited; the convergence loops dry; the completeness-critic found nothing open; and the
+  deep check below recorded.
+- **The deep check (every gate, no skim).** Deep check — do not skim. If the task produced code, record
+  that every new symbol is referenced (wiring) and that no new dead/unused code was introduced. If it
+  produced prose or non-code, record a semantic read — what you read in full and what it confirmed.
+  Which path applies is the resolver's judgement; the engine never classifies. An unfilled deep check is
+  a **shallow verify**, not an auto-PASS — evidence the work is wired, not merely plausible.
 - **Always escalates to a human (never auto-passed):** any **security** finding (HARD-STOP, always);
   a **concurrency**/timing risk the tests cannot exercise; an **architecture**/layering violation; and
   any failing test. These are the residue principle 2 names — automation cannot judge them.
@@ -107,54 +115,100 @@ explicit pass, not a skip).
 The auto-gate NEVER writes a human signature it did not get. An auto-PASS is logged as *auto-resolved*,
 honestly — the line between a pass and a skip is the recorded outcome, not a forged name.
+</constraints>
+## The bounded self-heal loop — a confirmed cheat returns to build
+The auto-gate trusts evidence; but evidence can be **gamed**. A build can make the unchanged red suite
+pass without EARNING it — a test or the frozen contract edited after the red run, src **overfit** to the
+fixtures, **vacuous** asserts, or real logic **stubbed away**. That is a **confirmed cheat**, and a cheat
+is **HARD-STOP-class**: never auto-passed, never RISK-ACCEPTED-waived (like a security finding). But a
+first cheat is not yet a stop — it is a chance to redo honestly.
+So a confirmed cheat enters a **bounded self-heal loop**: the engine returns the task to **build** for an
+honest redo, **counts** the attempt, and **caps** it. After **3** honest re-build attempts a fourth
+confirmed cheat forces a **HARD-STOP that escalates to the human** — never an auto-PASS, never an unbounded
+loop. The engine COUNTS, CAPS, and ESCALATES; the **agent** does the honest re-build (the engine never
+auto-fixes). The counter is **monotonic** — it never auto-resets, so the cap cannot be cleared by
+re-crossing a phase; only an honest build (no cheat) escapes the loop, and an honest build PASSes even at
+the third attempt (the cap bites a *continued* cheat, never a recovery).
+Two findings enter the loop:
+- **mechanical** (enforced) — the tamper tripwire (`tamper-tripwire`): at the gate the engine re-hashes the
+  red test files + the frozen §3 against the `tests→build` snapshot; any divergence is a cheat, routed to
+  the loop before any completing outcome is recorded.
+- **semantic** (honor-system, necessary-not-sufficient) — the **adversarial refute-read** (`6-verify.md`):
+  an independent reviewer argues "the green was NOT earned" and, on a confirmed overfit/vacuous/stub, the
+  agent reports it with `add.py heal <slug> --reason "<finding>"`. The engine cannot SEE a judgment cheat,
+  so this entry is the agent's honest report — the human verify gate stays the real backstop.
+The mechanical entry returns-to-build automatically at the gate; the `heal` verb is how a *reported* cheat
+enters the same bounded loop. Either way: ≤3 honest redos, then escalate. A gamed green never ships.
 ## Emitting deltas — feeding the foundation back
 The completeness-critic does not discard what it finds. Every gap, surprise, or convention that helped
-or hurt becomes an **`open` competency delta** in the task's OBSERVE block, in the `deltas.md` grammar,
+or hurt becomes an **`open` lesson learned** in the task's OBSERVE block, in the `deltas.md` grammar,
 tagged by competency:
 - a finding the run FIXED but that taught the foundation something (a missing scenario -> `TDD`);
 - a finding the run could NOT fix — a residue escalation -> a delta AND the escalation to a human.
-These `open` deltas feed v5's human-gated fold (`fold.md`) at milestone close: the run emits `open`;
-the human folds. That is the loop closing — **v6 run -> v5 foundation** — so a dynamic run sharpens the
+These `open` deltas feed v5's human-gated consolidation (`fold.md`) at milestone close: the run emits `open`;
+the human consolidates. That is the loop closing — **v6 run -> v5 foundation** — so a dynamic run sharpens the
 five competencies instead of letting its findings evaporate at end-of-run.
-## The autonomy dial
+## The autonomy level
+<constraints>
 How much a run may auto-gate is a **per-scope setting**, not a global switch (principle 5: trust is
 earned per scope). A task declares its level in its `TASK.md` header:
 ```
-autonomy: auto | conservative
+autonomy: manual | conservative | auto
 ```
-- **auto (the default)** — the run may auto-PASS when the evidence + residue checks above are
+An ordered ladder — `manual < conservative < auto` — declared once in the header and reviewed at the freeze:
+- **auto (the seeded default)** — the run may auto-PASS when the evidence + residue checks above are
   satisfied. Security still always escalates. This is the default starting point: a frozen contract
   flips the task into a self-driving run that converges and auto-gates on evidence.
 - **conservative** — the deliberate *lowering*: the run does all the work and converges, but STOPS at
   the verify gate for a human. Auto-PASS is disabled. Choose it wherever evidence is thin or risk is high.
+- **manual** — the strict floor: the human owns the verify gate and the engine never auto-resolves
+  (behaviourally the conservative floor with the explicit "I drive this decision; the AI proposes only"
+  name). Choose it for the highest-stakes scope; like `conservative`, it satisfies the high-risk guard.
 > **v7 reversal (recorded, not hidden).** Earlier the default was `conservative` and `auto` was the
 > earned exception; v7 flips this — `auto` is the default, `conservative` is the deliberate lowering.
-> What did **not** change is principle 5: the dial is still **per-scope**, the level still lives in the
+> What did **not** change is principle 5: the autonomy level is still **per-scope**, and it still lives in the
 > `TASK.md` header, and you still lower it anywhere risk demands. Only the starting point moved.
-**The high-risk guard — `auto` is refused where it matters most.** The dial is not a blank cheque. On a
+**The high-risk guard — `auto` is refused where it matters most.** The autonomy level is not a blank cheque. On a
 **high-risk or method-defining scope** — anything where a wrong-but-plausible result is expensive or
 hard to reverse (auth, money, data-loss paths, the method/trust-layer itself) — `auto` must be lowered
-to `conservative`; leaving it at `auto` there is the reject code **`unguarded_high_risk_auto`**. This
-closes the v6 dogfood blind-spot, where the whole milestone ran at `auto` on the riskiest possible
+to a stricter rung — `conservative` or `manual`; leaving it at `auto` there is the reject code
+**`unguarded_high_risk_auto`**. This
+closes the v6 dogfood gap, where the whole milestone ran at `auto` on the riskiest possible
 scope (defining the method) with no friction. The default is `auto` *for ordinary, well-tested scope*;
 high risk still earns a human gate.
 Judging *what* is high-risk stays human — the scope declares **`risk: high`** in the same `TASK.md`
-header where the dial lives, reviewed at the freeze like every header line (the engine never
+header where the autonomy level lives, reviewed at the freeze like every header line (the engine never
 classifies scope). **Since v14 the guard is mechanical for the declared case:**
 the engine refuses the declared combination — `add.py gate` will not complete (`PASS`/`RISK-ACCEPTED`) a task whose header
-carries `risk: high` without `autonomy: conservative` (error `unguarded_high_risk_auto`; `HARD-STOP`
+carries `risk: high` without a lowered level — `conservative` or `manual` (error `unguarded_high_risk_auto`; `HARD-STOP`
 always records — stopping is never blocked), and `add.py audit` flags the same code on a finished
 record whose header was tampered or whose GATE RECORD reviewer is the auto-gate — which CI enforces
 (audit-ci). The honest limit mirrors the audit's: an **undeclared** high-risk scope passes; declaring
-is the human seam, the engine enforces what was declared.
+is the human decision point, the engine enforces what was declared.
+**Autonomy is earned by goal-clarity — the auto-ready goal.** The level decides *who* resolves Verify;
+an **auto-ready goal** decides whether a self-verifying run is even *meaningful*. A milestone goal is
+auto-ready when **every exit criterion cites a verifier** — `(verify: <test | command | metric>)` — so the
+run can check its own result against the goal without human judgment. `add.py check` raises a
+`goal_not_auto_ready` WARN (never red, the active milestone only) while criteria are uncited, and `status`
+prints a `goal-ready:` line every session. It **measures, never blocks** — it changes neither the freeze
+gate nor the autonomy level. The lint forces a citation slot per criterion (raising the floor) but cannot
+prove the citation is honest (`(verify: it works)` passes) — that judgment stays the human's.
+</constraints>

package/skill/add/scope.md CHANGED Viewed

@@ -20,7 +20,7 @@ scope drafting honors intake's classification — it never re-sizes a request:
 means one drafting pass, NOT auto-creation. Nothing is written to disk — single draft or the
 whole batch — until the human confirms. You propose; you wait.
-## Brainstorm before you draft — co-specify at milestone altitude
+## Brainstorm before you draft — co-specify at milestone level
 Don't draft a MILESTONE.md from thin input. Run the same three-move co-specify as a
 task's §1 (`phases/1-specify.md`) — Diverge (framings + open questions) → Converge
@@ -31,12 +31,14 @@ Draft the WHOLE milestone before showing; nothing hits disk until the human conf
 Diverge seeds (pick the live ones):
 - **Outcome** — done means a user can do *what* they can't today? (goal sentence)
 - **Edge of scope** — nearest thing assumed IN that you want OUT? (Out list)
-- **Riskiest seam** — which contract, if wrong, costs the most rework? (freeze-first)
+- **Riskiest decision point** — which contract, if wrong, costs the most rework? (freeze-first)
 - **Done-looks-like** — how do we SEE each outcome without reading code? (exit criteria)
 - **First slice** — which task unblocks the rest? (breadth-first order)
-Rank assumptions least-sure first; the top 1–2 get the flag the human reads at confirm:
-`⚠ <assumption> — least sure because <why>; if wrong: <cost>`.
+Rank assumptions lowest-confidence first; the top 1–2 get the flag the human reads at confirm:
+`⚠ <assumption> — lowest confidence because <why>; if wrong: <cost>`. Present the draft via
+`report-template.md` — open with the ARC (goal · done · plan): the goal this milestone serves,
+what is already covered, and the plan its task list lays out.
 ## Drafting a good MILESTONE.md (section by section)
@@ -45,8 +47,8 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
 - **Scope In/Out** — the explicit anti-creep deferral list. Naming what is OUT is as important
   as what is IN; an empty Out list usually means the scope is not yet thought through.
 - **Shared decisions & glossary deltas** — cross-cutting rules every task must honor, named from
-  the glossary. New terms get a glossary entry (the survivor layer stays honest).
-- **Shared / risky contracts to freeze first** — the seams between tasks; name the owning task.
+  the glossary. New terms get a glossary entry (the living documentation stays honest).
+- **Shared / risky contracts to freeze first** — the decision points between tasks; name the owning task.
 - **Tasks (breadth-first)** — `slug · depends-on · one line` each. Decompose by deliverable, not
   by phase; keep each task one-file-sized. Order by dependency, not by guesswork.
 - **Exit criteria** — observable, and **every exit criterion maps to a declared task slug**
@@ -54,6 +56,7 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
 ## Reject codes (emit `{ reject, rationale }`, create nothing)
+<reject_codes>
 - `not_classified` — the request has not been through intake yet. Classify it first; you cannot
   draft scope for an unclassified request.
 - `dangling_criterion` — a drafted MILESTONE.md has an exit criterion that maps to no declared
@@ -61,6 +64,7 @@ Rank assumptions least-sure first; the top 1–2 get the flag the human reads at
   a malformed milestone. With no engine lint, you are the first check and the human is the backstop.
 - `no_milestone` — intake routed the request to `task` or `change-request`; scope drafting
   creates NO milestone. Honor the classification; do not invent milestone-sized scope.
+</reject_codes>
 ## Worked example (from this repo's own history)

package/skill/add/setup-review.md CHANGED Viewed

@@ -1,11 +1,11 @@
 # Setup review — the one page the human signs
-Autonomous setup ends at a single human gate: the **lock-down** (`add.py lock`). Before that
+Autonomous setup ends at a single human gate: the **baseline approval** (`add.py lock`). Before that
 signature is honest, the human needs to see *what you drafted and how sure you were* — not re-derive
 it. `SETUP-REVIEW.md` is that page: every decision you made while drafting the foundation, first-scope,
-and the first contract, **ordered least-sure-first** so the riskiest guesses meet their eye first.
+and the first contract, **ordered lowest-confidence-first** so the riskiest guesses meet their eye first.
-This is the setup-altitude analog of presenting a task's front least-sure-first at the contract freeze.
+This is the setup-level analog of presenting a task's specification bundle lowest-confidence-first at the contract freeze.
 The engine never reads this file — `add.py lock` is judgment-free, the signature *is* the gate (see
 `setup-lock-state`). The human **reading** this page is the review; your job is to make the reading honest.
@@ -13,7 +13,7 @@ The engine never reads this file — `add.py lock` is judgment-free, the signatu
 Write **one** artifact at `.add/SETUP-REVIEW.md`. **Never clobber a human-edited one** — if it already
 exists with hand edits, append/update, don't overwrite (the same non-clobber rule `init` applies to
-survivors). It is a per-onboarding, setup-altitude artifact; it sits beside `PROJECT.md`, not under a task.
+living docs). It is a per-onboarding, setup-level artifact; it sits beside `PROJECT.md`, not under a task.
 ## The template
@@ -27,14 +27,15 @@ survivors). It is a per-onboarding, setup-altitude artifact; it sits beside `PRO
 | 1 | <the drafted decision> | PROJECT.md \| scope \| first-contract | `guessed` | <the inference + why you had to guess> |
 | 2 | <…> | <…> | `evidence-grounded` | <cite the source file/line you read it from> |
-Sign: reviewed the above → `add.py lock --by "<name>"`
+Sign: confirm in chat → the agent runs `add.py lock --by "<name>"` (typing it yourself works too)
 ```
-Rows are numbered for reference at the gate ("row 1 is the one I'm least sure about").
+Rows are numbered for reference at the gate ("row 1 is where my confidence is lowest").
 ## The two rules that make it honest
-1. **Least-sure-first.** Order rows by confidence **ascending**. A `guessed` row always floats above an
+<constraints>
+1. **Lowest-confidence-first.** Order rows by confidence **ascending**. A `guessed` row always floats above an
    `evidence-grounded` one. The point is not completeness theatre — it is to spend the human's attention
    where it changes outcomes: the top of the table is the part they actually need to challenge.
@@ -45,13 +46,15 @@ Rows are numbered for reference at the gate ("row 1 is the one I'm least sure ab
      onboarding (a near-empty repo, only the 4-lens answers) produces these. These are what the human
      must check; that is why they sit on top.
-   The tag vocabulary is shared with `adopt.md` — the brownfield map tags each filled survivor decision
+   The tag vocabulary is shared with `adopt.md` — the brownfield map tags each filled living-doc decision
    `guessed`/`evidence-grounded`, and those tags flow straight into this table.
+</constraints>
 ## Where it ends
-`SETUP-REVIEW.md` is **read-only context** for the lock-down. You do not ask the human to approve it
-field-by-field; you present it, least-sure-first, and they sign once:
+`SETUP-REVIEW.md` is **read-only context** for the baseline approval. You do not ask the human to approve it
+field-by-field; you present it, lowest-confidence-first; they confirm in conversation, and you run the lock
+with their name:
 ```bash
 python3 .add/tooling/add.py lock --by "<name>"