@pilotspace/add 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/skill/add/run.md CHANGED
@@ -28,7 +28,7 @@ What the human is actually approving in that one gate: that the drafted Spec cap
28
28
  that the Scenarios cover the cases that matter, and that the Contract shape is the one to freeze. Reject
29
29
  any part and the bundle goes back to draft — that is backward-correction (principle 4), not failure.
30
30
  Approve, and the run begins. The decision-point guide (`phases/3-contract.md`) carries the
31
- **freeze review checklist** — six lines that walk the human through exactly this, ⚠-first.
31
+ **freeze review checklist** — seven lines that walk the human through exactly this, ⚠-first.
32
32
 
33
33
  **The lowest-confidence flag — aiming the one approval.** A single approval over a whole bundle is easy to
34
34
  grant without reading. So the AI presents the bundle **lowest-confidence first**: of everything it is asking the human
@@ -117,6 +117,34 @@ The auto-gate NEVER writes a human signature it did not get. An auto-PASS is log
117
117
  honestly — the line between a pass and a skip is the recorded outcome, not a forged name.
118
118
  </constraints>
119
119
 
120
+ ## The bounded self-heal loop — a confirmed cheat returns to build
121
+
122
+ The auto-gate trusts evidence; but evidence can be **gamed**. A build can make the unchanged red suite
123
+ pass without EARNING it — a test or the frozen contract edited after the red run, src **overfit** to the
124
+ fixtures, **vacuous** asserts, or real logic **stubbed away**. That is a **confirmed cheat**, and a cheat
125
+ is **HARD-STOP-class**: never auto-passed, never RISK-ACCEPTED-waived (like a security finding). But a
126
+ first cheat is not yet a stop — it is a chance to redo honestly.
127
+
128
+ So a confirmed cheat enters a **bounded self-heal loop**: the engine returns the task to **build** for an
129
+ honest redo, **counts** the attempt, and **caps** it. After **3** honest re-build attempts a fourth
130
+ confirmed cheat forces a **HARD-STOP that escalates to the human** — never an auto-PASS, never an unbounded
131
+ loop. The engine COUNTS, CAPS, and ESCALATES; the **agent** does the honest re-build (the engine never
132
+ auto-fixes). The counter is **monotonic** — it never auto-resets, so the cap cannot be cleared by
133
+ re-crossing a phase; only an honest build (no cheat) escapes the loop, and an honest build PASSes even at
134
+ the third attempt (the cap bites a *continued* cheat, never a recovery).
135
+
136
+ Two findings enter the loop:
137
+ - **mechanical** (enforced) — the tamper tripwire (`tamper-tripwire`): at the gate the engine re-hashes the
138
+ red test files + the frozen §3 against the `tests→build` snapshot; any divergence is a cheat, routed to
139
+ the loop before any completing outcome is recorded.
140
+ - **semantic** (honor-system, necessary-not-sufficient) — the **adversarial refute-read** (`6-verify.md`):
141
+ an independent reviewer argues "the green was NOT earned" and, on a confirmed overfit/vacuous/stub, the
142
+ agent reports it with `add.py heal <slug> --reason "<finding>"`. The engine cannot SEE a judgment cheat,
143
+ so this entry is the agent's honest report — the human verify gate stays the real backstop.
144
+
145
+ The mechanical entry returns-to-build automatically at the gate; the `heal` verb is how a *reported* cheat
146
+ enters the same bounded loop. Either way: ≤3 honest redos, then escalate. A gamed green never ships.
147
+
120
148
  ## Emitting deltas — feeding the foundation back
121
149
 
122
150
  The completeness-critic does not discard what it finds. Every gap, surprise, or convention that helped
@@ -137,14 +165,19 @@ How much a run may auto-gate is a **per-scope setting**, not a global switch (pr
137
165
  earned per scope). A task declares its level in its `TASK.md` header:
138
166
 
139
167
  ```
140
- autonomy: auto | conservative
168
+ autonomy: manual | conservative | auto
141
169
  ```
142
170
 
143
- - **auto (the default)** the run may auto-PASS when the evidence + residue checks above are
171
+ An ordered ladder`manual < conservative < auto` declared once in the header and reviewed at the freeze:
172
+
173
+ - **auto (the seeded default)** — the run may auto-PASS when the evidence + residue checks above are
144
174
  satisfied. Security still always escalates. This is the default starting point: a frozen contract
145
175
  flips the task into a self-driving run that converges and auto-gates on evidence.
146
176
  - **conservative** — the deliberate *lowering*: the run does all the work and converges, but STOPS at
147
177
  the verify gate for a human. Auto-PASS is disabled. Choose it wherever evidence is thin or risk is high.
178
+ - **manual** — the strict floor: the human owns the verify gate and the engine never auto-resolves
179
+ (behaviourally the conservative floor with the explicit "I drive this decision; the AI proposes only"
180
+ name). Choose it for the highest-stakes scope; like `conservative`, it satisfies the high-risk guard.
148
181
 
149
182
  > **v7 reversal (recorded, not hidden).** Earlier the default was `conservative` and `auto` was the
150
183
  > earned exception; v7 flips this — `auto` is the default, `conservative` is the deliberate lowering.
@@ -154,7 +187,8 @@ autonomy: auto | conservative
154
187
  **The high-risk guard — `auto` is refused where it matters most.** The autonomy level is not a blank cheque. On a
155
188
  **high-risk or method-defining scope** — anything where a wrong-but-plausible result is expensive or
156
189
  hard to reverse (auth, money, data-loss paths, the method/trust-layer itself) — `auto` must be lowered
157
- to `conservative`; leaving it at `auto` there is the reject code **`unguarded_high_risk_auto`**. This
190
+ to a stricter rung — `conservative` or `manual`; leaving it at `auto` there is the reject code
191
+ **`unguarded_high_risk_auto`**. This
158
192
  closes the v6 dogfood gap, where the whole milestone ran at `auto` on the riskiest possible
159
193
  scope (defining the method) with no friction. The default is `auto` *for ordinary, well-tested scope*;
160
194
  high risk still earns a human gate.
@@ -163,9 +197,18 @@ Judging *what* is high-risk stays human — the scope declares **`risk: high`**
163
197
  header where the autonomy level lives, reviewed at the freeze like every header line (the engine never
164
198
  classifies scope). **Since v14 the guard is mechanical for the declared case:**
165
199
  the engine refuses the declared combination — `add.py gate` will not complete (`PASS`/`RISK-ACCEPTED`) a task whose header
166
- carries `risk: high` without `autonomy: conservative` (error `unguarded_high_risk_auto`; `HARD-STOP`
200
+ carries `risk: high` without a lowered level — `conservative` or `manual` (error `unguarded_high_risk_auto`; `HARD-STOP`
167
201
  always records — stopping is never blocked), and `add.py audit` flags the same code on a finished
168
202
  record whose header was tampered or whose GATE RECORD reviewer is the auto-gate — which CI enforces
169
203
  (audit-ci). The honest limit mirrors the audit's: an **undeclared** high-risk scope passes; declaring
170
204
  is the human decision point, the engine enforces what was declared.
205
+
206
+ **Autonomy is earned by goal-clarity — the auto-ready goal.** The level decides *who* resolves Verify;
207
+ an **auto-ready goal** decides whether a self-verifying run is even *meaningful*. A milestone goal is
208
+ auto-ready when **every exit criterion cites a verifier** — `(verify: <test | command | metric>)` — so the
209
+ run can check its own result against the goal without human judgment. `add.py check` raises a
210
+ `goal_not_auto_ready` WARN (never red, the active milestone only) while criteria are uncited, and `status`
211
+ prints a `goal-ready:` line every session. It **measures, never blocks** — it changes neither the freeze
212
+ gate nor the autonomy level. The lint forces a citation slot per criterion (raising the floor) but cannot
213
+ prove the citation is honest (`(verify: it works)` passes) — that judgment stays the human's.
171
214
  </constraints>
@@ -42,9 +42,9 @@ How much concurrency you actually get is set by each task's `autonomy:` header
42
42
 
43
43
  | `autonomy` (TASK.md) | What serializes on the human | Concurrency |
44
44
  |----------------------|------------------------------|-------------|
45
- | `conservative` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue |
45
+ | `conservative` / `manual` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue (`manual` is the strict floor; same streams behaviour) |
46
46
  | `auto` (default) | bundle approval **only**; Verify auto-PASSes on evidence | real concurrency — only the decision point + residue escalations queue |
47
- | `auto` but **high-risk** | refused → forced `conservative` (`unguarded_high_risk_auto`) | back to pipelining, by design |
47
+ | `auto` but **high-risk** | refused → must lower to `conservative` / `manual` (`unguarded_high_risk_auto`) | back to pipelining, by design |
48
48
 
49
49
  The irreducible floor is **one human approval per task at the contract decision point** — the decision point
50
50
  never drops to zero (`run.md:22`). That floor is correct; do not engineer around it.
@@ -72,6 +72,16 @@ never drops to zero (`run.md:22`). That floor is correct; do not engineer around
72
72
  worktree forked from a stale base forces the worker to recreate the frozen artifacts by hand
73
73
  (the v10 dogfood hit exactly this). Before the worker starts, confirm `git -C <worktree>
74
74
  rev-parse HEAD` equals the orchestrator's `HEAD`; if it drifted, `git merge` the base in first.
75
+ On a runner that creates each worktree **at spawn** from a pool (e.g. Claude Code), that pool can hand
76
+ out a STALE base, so the pre-spawn `rev-parse` evidence cell is unsatisfiable. The `unverified_fork_base`
77
+ check then **shifts** — it never skips: the worker's **step-0** syncs to base (`git merge` the orchestrator's
78
+ `HEAD`) and re-echoes `rev-parse HEAD`, which the orchestrator verifies at **merge-time**, before merge-back.
79
+ The pre-spawn check stays the DEFAULT for fresh-`HEAD`-worktree runners; the merge-time path is the additive
80
+ ALTERNATIVE for spawn-time runners — never a replacement of the pre-spawn rule.
81
+ **The engine executes this gate** (engine-merge-base-enforcement): run
82
+ `python3 .add/tooling/add.py wave-verify` before the first merge-back — it refuses a mismatched or
83
+ pending echo (`unverified_fork_base`) and an off-template ledger (`wave_ledger_malformed`, fail-closed);
84
+ `add.py check` is the standing monitor (red at `status: merging`, `fork_base_pending` WARN at `live`).
75
85
  - **Lease + timeout** — record which worker holds which task (in the wave ledger, below);
76
86
  if a worker dies, release the claim back to READY (re-spawn, do not assume partial work is sound).
77
87
  - **Failure isolates** — a worker that hits a STOP-and-escalate (below) blocks only its
@@ -114,7 +124,10 @@ base: <orchestrator HEAD at spawn — the sha every fork must equal>
114
124
  `git -C <worktree> rev-parse HEAD`, and it must equal `base:`. A tick is not evidence; a row
115
125
  you can only fill by running the command is the fresh-worktree-base check EXECUTING — the
116
126
  v12-1 lesson (words-exist ≠ method-works) closed structurally. Spawning a worker whose roster
117
- row lacks that evidence is refused (`unverified_fork_base`).
127
+ row lacks that evidence is refused (`unverified_fork_base`). On a spawn-time pool runner this
128
+ PRE-spawn paste is unsatisfiable (the pooled base is stale until the worker syncs), so the cell
129
+ instead holds the worker's **step-0** post-sync echo (still `== base:`) and the `unverified_fork_base`
130
+ refusal **shifts to merge-time**, before merge-back — it shifts, it never lifts.
118
131
 
119
132
  **Lifecycle — open → consume → digest → delete.** Open the ledger when the first worker
120
133
  spawns. The serial integration Verify consumes it (the merge order is read from it, one
@@ -181,7 +194,7 @@ STOP-and-escalate (return your findings; do not decide):
181
194
  • a discovered scope/contract gap → backward-correction, reopen Specify (principle 4)
182
195
  • any SECURITY finding → HARD-STOP, always
183
196
  • a concurrency/timing OR architecture/layering risk the tests cannot exercise
184
- • [include this bullet ONLY when autonomy=conservative] the verify gate itself — STOP for the human
197
+ • [include this bullet when autonomy is conservative OR manual — any lowered rung] the verify gate itself — STOP for the human
185
198
  Auto-PASS only if autonomy=auto AND: all tests green · coverage not decreased · no test weakened ·
186
199
  no contract edited · loops dry · completeness-critic clean · no residue above. Log it as
187
200
  auto-resolved, naming this run as owner — never forge a human signature.
@@ -204,7 +217,9 @@ ripgrep otherwise. Design every IO path for failure — timeouts, retries, rollb
204
217
  </tools>
205
218
 
206
219
  <return> <!-- the worker PROPOSES; the orchestrator RECORDS. A worker never runs add.py. -->
207
- End with a structured verdict AND write the same into SUMMARY.md in the task dir:
220
+ End with a structured verdict AND write the same into SUMMARY.md in the task dir, then
221
+ **commit SUMMARY.md + deltas.md** in the worktree (uncommitted worktree files survive only by
222
+ harness courtesy — commit them so the serial-integration merge-back carries your report):
208
223
  { task, outcome: PASS|RISK-ACCEPTED|HARD-STOP|ESCALATE, evidence: <tests+coverage>,
209
224
  residue: [security|concurrency|architecture findings], deltas: [open lessons learned] }.
210
225
  Do NOT touch add.py or any shared file — the orchestrator gates on your verdict.
@@ -222,7 +237,7 @@ The contract is identical whichever model runs it (the model is disposable, like
222
237
  | **top** | complex / ambiguous / cross-cutting / broad scope of impact | `opus` | the runner's strongest reasoning model |
223
238
 
224
239
  Two rules sit **above** model choice and never bend:
225
- - **High-risk ⇒ `conservative` autonomy, regardless of model** (`run.md` high-risk guard). A
240
+ - **High-risk ⇒ a lowered rung (`conservative` or `manual`), regardless of model** (`run.md` high-risk guard). A
226
241
  stronger model does not buy back the human gate.
227
242
  - **Security residue always escalates** — no tier and no model auto-passes it.
228
243