npm - @pilotspace/add - Versions diffs - 1.2.0 → 1.3.0 - Mend

@pilotspace/add 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/CHANGELOG.md +41 -0
package/GETTING-STARTED.md +22 -0
package/bin/cli.js +84 -2
package/docs/02-the-flow.md +4 -1
package/docs/03-step-1-specify.md +2 -0
package/docs/06-step-4-tests.md +8 -0
package/docs/07-step-5-build.md +2 -0
package/docs/08-step-6-verify.md +11 -0
package/docs/10-setup-and-stages.md +1 -1
package/docs/11-governance.md +4 -0
package/docs/appendix-c-glossary.md +8 -1
package/docs/appendix-e-checklists.md +14 -2
package/package.json +1 -1
package/skill/add/SKILL.md +4 -3
package/skill/add/phases/0-ground.md +66 -0
package/skill/add/phases/0-setup.md +3 -1
package/skill/add/phases/1-specify.md +5 -0
package/skill/add/phases/3-contract.md +3 -1
package/skill/add/phases/5-build.md +22 -0
package/skill/add/phases/6-verify.md +16 -0
package/skill/add/run.md +48 -5
package/skill/add/streams.md +21 -6
package/tooling/add.py +1348 -63
package/tooling/templates/DESIGN.md.tmpl +66 -0
package/tooling/templates/GLOSSARY.md.tmpl +7 -1
package/tooling/templates/PROJECT.md.tmpl +3 -1
package/tooling/templates/TASK.md.tmpl +23 -4
package/tooling/templates/catalog.sample.json +38 -0
package/tooling/templates/prototype.sample.json +48 -0
package/tooling/templates/tokens.sample.json +55 -0
package/tooling/templates/udd-catalog.md +122 -0
package/tooling/templates/udd-tokens.md +79 -0

package/skill/add/run.md CHANGED Viewed

@@ -28,7 +28,7 @@ What the human is actually approving in that one gate: that the drafted Spec cap
 that the Scenarios cover the cases that matter, and that the Contract shape is the one to freeze. Reject
 any part and the bundle goes back to draft — that is backward-correction (principle 4), not failure.
 Approve, and the run begins. The decision-point guide (`phases/3-contract.md`) carries the
-**freeze review checklist** — six lines that walk the human through exactly this, ⚠-first.
+**freeze review checklist** — seven lines that walk the human through exactly this, ⚠-first.
 **The lowest-confidence flag — aiming the one approval.** A single approval over a whole bundle is easy to
 grant without reading. So the AI presents the bundle **lowest-confidence first**: of everything it is asking the human
@@ -117,6 +117,34 @@ The auto-gate NEVER writes a human signature it did not get. An auto-PASS is log
 honestly — the line between a pass and a skip is the recorded outcome, not a forged name.
 </constraints>
+## The bounded self-heal loop — a confirmed cheat returns to build
+The auto-gate trusts evidence; but evidence can be **gamed**. A build can make the unchanged red suite
+pass without EARNING it — a test or the frozen contract edited after the red run, src **overfit** to the
+fixtures, **vacuous** asserts, or real logic **stubbed away**. That is a **confirmed cheat**, and a cheat
+is **HARD-STOP-class**: never auto-passed, never RISK-ACCEPTED-waived (like a security finding). But a
+first cheat is not yet a stop — it is a chance to redo honestly.
+So a confirmed cheat enters a **bounded self-heal loop**: the engine returns the task to **build** for an
+honest redo, **counts** the attempt, and **caps** it. After **3** honest re-build attempts a fourth
+confirmed cheat forces a **HARD-STOP that escalates to the human** — never an auto-PASS, never an unbounded
+loop. The engine COUNTS, CAPS, and ESCALATES; the **agent** does the honest re-build (the engine never
+auto-fixes). The counter is **monotonic** — it never auto-resets, so the cap cannot be cleared by
+re-crossing a phase; only an honest build (no cheat) escapes the loop, and an honest build PASSes even at
+the third attempt (the cap bites a *continued* cheat, never a recovery).
+Two findings enter the loop:
+- **mechanical** (enforced) — the tamper tripwire (`tamper-tripwire`): at the gate the engine re-hashes the
+  red test files + the frozen §3 against the `tests→build` snapshot; any divergence is a cheat, routed to
+  the loop before any completing outcome is recorded.
+- **semantic** (honor-system, necessary-not-sufficient) — the **adversarial refute-read** (`6-verify.md`):
+  an independent reviewer argues "the green was NOT earned" and, on a confirmed overfit/vacuous/stub, the
+  agent reports it with `add.py heal <slug> --reason "<finding>"`. The engine cannot SEE a judgment cheat,
+  so this entry is the agent's honest report — the human verify gate stays the real backstop.
+The mechanical entry returns-to-build automatically at the gate; the `heal` verb is how a *reported* cheat
+enters the same bounded loop. Either way: ≤3 honest redos, then escalate. A gamed green never ships.
 ## Emitting deltas — feeding the foundation back
 The completeness-critic does not discard what it finds. Every gap, surprise, or convention that helped
@@ -137,14 +165,19 @@ How much a run may auto-gate is a **per-scope setting**, not a global switch (pr
 earned per scope). A task declares its level in its `TASK.md` header:
 ```
-autonomy: auto | conservative
+autonomy: manual | conservative | auto
 ```
-- **auto (the default)** — the run may auto-PASS when the evidence + residue checks above are
+An ordered ladder — `manual < conservative < auto` — declared once in the header and reviewed at the freeze:
+- **auto (the seeded default)** — the run may auto-PASS when the evidence + residue checks above are
   satisfied. Security still always escalates. This is the default starting point: a frozen contract
   flips the task into a self-driving run that converges and auto-gates on evidence.
 - **conservative** — the deliberate *lowering*: the run does all the work and converges, but STOPS at
   the verify gate for a human. Auto-PASS is disabled. Choose it wherever evidence is thin or risk is high.
+- **manual** — the strict floor: the human owns the verify gate and the engine never auto-resolves
+  (behaviourally the conservative floor with the explicit "I drive this decision; the AI proposes only"
+  name). Choose it for the highest-stakes scope; like `conservative`, it satisfies the high-risk guard.
 > **v7 reversal (recorded, not hidden).** Earlier the default was `conservative` and `auto` was the
 > earned exception; v7 flips this — `auto` is the default, `conservative` is the deliberate lowering.
@@ -154,7 +187,8 @@ autonomy: auto | conservative
 **The high-risk guard — `auto` is refused where it matters most.** The autonomy level is not a blank cheque. On a
 **high-risk or method-defining scope** — anything where a wrong-but-plausible result is expensive or
 hard to reverse (auth, money, data-loss paths, the method/trust-layer itself) — `auto` must be lowered
-to `conservative`; leaving it at `auto` there is the reject code **`unguarded_high_risk_auto`**. This
+to a stricter rung — `conservative` or `manual`; leaving it at `auto` there is the reject code
+**`unguarded_high_risk_auto`**. This
 closes the v6 dogfood gap, where the whole milestone ran at `auto` on the riskiest possible
 scope (defining the method) with no friction. The default is `auto` *for ordinary, well-tested scope*;
 high risk still earns a human gate.
@@ -163,9 +197,18 @@ Judging *what* is high-risk stays human — the scope declares **`risk: high`**
 header where the autonomy level lives, reviewed at the freeze like every header line (the engine never
 classifies scope). **Since v14 the guard is mechanical for the declared case:**
 the engine refuses the declared combination — `add.py gate` will not complete (`PASS`/`RISK-ACCEPTED`) a task whose header
-carries `risk: high` without `autonomy: conservative` (error `unguarded_high_risk_auto`; `HARD-STOP`
+carries `risk: high` without a lowered level — `conservative` or `manual` (error `unguarded_high_risk_auto`; `HARD-STOP`
 always records — stopping is never blocked), and `add.py audit` flags the same code on a finished
 record whose header was tampered or whose GATE RECORD reviewer is the auto-gate — which CI enforces
 (audit-ci). The honest limit mirrors the audit's: an **undeclared** high-risk scope passes; declaring
 is the human decision point, the engine enforces what was declared.
+**Autonomy is earned by goal-clarity — the auto-ready goal.** The level decides *who* resolves Verify;
+an **auto-ready goal** decides whether a self-verifying run is even *meaningful*. A milestone goal is
+auto-ready when **every exit criterion cites a verifier** — `(verify: <test | command | metric>)` — so the
+run can check its own result against the goal without human judgment. `add.py check` raises a
+`goal_not_auto_ready` WARN (never red, the active milestone only) while criteria are uncited, and `status`
+prints a `goal-ready:` line every session. It **measures, never blocks** — it changes neither the freeze
+gate nor the autonomy level. The lint forces a citation slot per criterion (raising the floor) but cannot
+prove the citation is honest (`(verify: it works)` passes) — that judgment stays the human's.
 </constraints>

package/skill/add/streams.md CHANGED Viewed

@@ -42,9 +42,9 @@ How much concurrency you actually get is set by each task's `autonomy:` header
 | `autonomy` (TASK.md) | What serializes on the human | Concurrency |
 |----------------------|------------------------------|-------------|
-| `conservative` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue |
+| `conservative` / `manual` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue (`manual` is the strict floor; same streams behaviour) |
 | `auto` (default) | bundle approval **only**; Verify auto-PASSes on evidence | real concurrency — only the decision point + residue escalations queue |
-| `auto` but **high-risk** | refused → forced `conservative` (`unguarded_high_risk_auto`) | back to pipelining, by design |
+| `auto` but **high-risk** | refused → must lower to `conservative` / `manual` (`unguarded_high_risk_auto`) | back to pipelining, by design |
 The irreducible floor is **one human approval per task at the contract decision point** — the decision point
 never drops to zero (`run.md:22`). That floor is correct; do not engineer around it.
@@ -72,6 +72,16 @@ never drops to zero (`run.md:22`). That floor is correct; do not engineer around
   worktree forked from a stale base forces the worker to recreate the frozen artifacts by hand
   (the v10 dogfood hit exactly this). Before the worker starts, confirm `git -C <worktree>
   rev-parse HEAD` equals the orchestrator's `HEAD`; if it drifted, `git merge` the base in first.
+  On a runner that creates each worktree **at spawn** from a pool (e.g. Claude Code), that pool can hand
+  out a STALE base, so the pre-spawn `rev-parse` evidence cell is unsatisfiable. The `unverified_fork_base`
+  check then **shifts** — it never skips: the worker's **step-0** syncs to base (`git merge` the orchestrator's
+  `HEAD`) and re-echoes `rev-parse HEAD`, which the orchestrator verifies at **merge-time**, before merge-back.
+  The pre-spawn check stays the DEFAULT for fresh-`HEAD`-worktree runners; the merge-time path is the additive
+  ALTERNATIVE for spawn-time runners — never a replacement of the pre-spawn rule.
+  **The engine executes this gate** (engine-merge-base-enforcement): run
+  `python3 .add/tooling/add.py wave-verify` before the first merge-back — it refuses a mismatched or
+  pending echo (`unverified_fork_base`) and an off-template ledger (`wave_ledger_malformed`, fail-closed);
+  `add.py check` is the standing monitor (red at `status: merging`, `fork_base_pending` WARN at `live`).
 - **Lease + timeout** — record which worker holds which task (in the wave ledger, below);
   if a worker dies, release the claim back to READY (re-spawn, do not assume partial work is sound).
 - **Failure isolates** — a worker that hits a STOP-and-escalate (below) blocks only its
@@ -114,7 +124,10 @@ base: <orchestrator HEAD at spawn — the sha every fork must equal>
 `git -C <worktree> rev-parse HEAD`, and it must equal `base:`. A tick is not evidence; a row
 you can only fill by running the command is the fresh-worktree-base check EXECUTING — the
 v12-1 lesson (words-exist ≠ method-works) closed structurally. Spawning a worker whose roster
-row lacks that evidence is refused (`unverified_fork_base`).
+row lacks that evidence is refused (`unverified_fork_base`). On a spawn-time pool runner this
+PRE-spawn paste is unsatisfiable (the pooled base is stale until the worker syncs), so the cell
+instead holds the worker's **step-0** post-sync echo (still `== base:`) and the `unverified_fork_base`
+refusal **shifts to merge-time**, before merge-back — it shifts, it never lifts.
 **Lifecycle — open → consume → digest → delete.** Open the ledger when the first worker
 spawns. The serial integration Verify consumes it (the merge order is read from it, one
@@ -181,7 +194,7 @@ STOP-and-escalate (return your findings; do not decide):
   • a discovered scope/contract gap  → backward-correction, reopen Specify (principle 4)
   • any SECURITY finding              → HARD-STOP, always
   • a concurrency/timing OR architecture/layering risk the tests cannot exercise
-  • [include this bullet ONLY when autonomy=conservative] the verify gate itself — STOP for the human
+  • [include this bullet when autonomy is conservative OR manual — any lowered rung] the verify gate itself — STOP for the human
 Auto-PASS only if autonomy=auto AND: all tests green · coverage not decreased · no test weakened ·
   no contract edited · loops dry · completeness-critic clean · no residue above. Log it as
   auto-resolved, naming this run as owner — never forge a human signature.
@@ -204,7 +217,9 @@ ripgrep otherwise. Design every IO path for failure — timeouts, retries, rollb
 </tools>
 <return>   <!-- the worker PROPOSES; the orchestrator RECORDS. A worker never runs add.py. -->
-End with a structured verdict AND write the same into SUMMARY.md in the task dir:
+End with a structured verdict AND write the same into SUMMARY.md in the task dir, then
+**commit SUMMARY.md + deltas.md** in the worktree (uncommitted worktree files survive only by
+harness courtesy — commit them so the serial-integration merge-back carries your report):
 { task, outcome: PASS|RISK-ACCEPTED|HARD-STOP|ESCALATE, evidence: <tests+coverage>,
   residue: [security|concurrency|architecture findings], deltas: [open lessons learned] }.
 Do NOT touch add.py or any shared file — the orchestrator gates on your verdict.
@@ -222,7 +237,7 @@ The contract is identical whichever model runs it (the model is disposable, like
 | **top** | complex / ambiguous / cross-cutting / broad scope of impact | `opus` | the runner's strongest reasoning model |
 Two rules sit **above** model choice and never bend:
-- **High-risk ⇒ `conservative` autonomy, regardless of model** (`run.md` high-risk guard). A
+- **High-risk ⇒ a lowered rung (`conservative` or `manual`), regardless of model** (`run.md` high-risk guard). A
   stronger model does not buy back the human gate.
 - **Security residue always escalates** — no tier and no model auto-passes it.