@pilotspace/add 1.2.0 → 1.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +41 -0
- package/GETTING-STARTED.md +22 -0
- package/bin/cli.js +84 -2
- package/docs/02-the-flow.md +4 -1
- package/docs/03-step-1-specify.md +2 -0
- package/docs/06-step-4-tests.md +8 -0
- package/docs/07-step-5-build.md +2 -0
- package/docs/08-step-6-verify.md +11 -0
- package/docs/10-setup-and-stages.md +1 -1
- package/docs/11-governance.md +4 -0
- package/docs/appendix-c-glossary.md +8 -1
- package/docs/appendix-e-checklists.md +14 -2
- package/package.json +1 -1
- package/skill/add/SKILL.md +4 -3
- package/skill/add/phases/0-ground.md +66 -0
- package/skill/add/phases/0-setup.md +3 -1
- package/skill/add/phases/1-specify.md +5 -0
- package/skill/add/phases/3-contract.md +3 -1
- package/skill/add/phases/5-build.md +22 -0
- package/skill/add/phases/6-verify.md +16 -0
- package/skill/add/run.md +48 -5
- package/skill/add/streams.md +21 -6
- package/tooling/add.py +1348 -63
- package/tooling/templates/DESIGN.md.tmpl +66 -0
- package/tooling/templates/GLOSSARY.md.tmpl +7 -1
- package/tooling/templates/PROJECT.md.tmpl +3 -1
- package/tooling/templates/TASK.md.tmpl +23 -4
- package/tooling/templates/catalog.sample.json +38 -0
- package/tooling/templates/prototype.sample.json +48 -0
- package/tooling/templates/tokens.sample.json +55 -0
- package/tooling/templates/udd-catalog.md +122 -0
- package/tooling/templates/udd-tokens.md +79 -0
package/skill/add/run.md
CHANGED
|
@@ -28,7 +28,7 @@ What the human is actually approving in that one gate: that the drafted Spec cap
|
|
|
28
28
|
that the Scenarios cover the cases that matter, and that the Contract shape is the one to freeze. Reject
|
|
29
29
|
any part and the bundle goes back to draft — that is backward-correction (principle 4), not failure.
|
|
30
30
|
Approve, and the run begins. The decision-point guide (`phases/3-contract.md`) carries the
|
|
31
|
-
**freeze review checklist** —
|
|
31
|
+
**freeze review checklist** — seven lines that walk the human through exactly this, ⚠-first.
|
|
32
32
|
|
|
33
33
|
**The lowest-confidence flag — aiming the one approval.** A single approval over a whole bundle is easy to
|
|
34
34
|
grant without reading. So the AI presents the bundle **lowest-confidence first**: of everything it is asking the human
|
|
@@ -117,6 +117,34 @@ The auto-gate NEVER writes a human signature it did not get. An auto-PASS is log
|
|
|
117
117
|
honestly — the line between a pass and a skip is the recorded outcome, not a forged name.
|
|
118
118
|
</constraints>
|
|
119
119
|
|
|
120
|
+
## The bounded self-heal loop — a confirmed cheat returns to build
|
|
121
|
+
|
|
122
|
+
The auto-gate trusts evidence; but evidence can be **gamed**. A build can make the unchanged red suite
|
|
123
|
+
pass without EARNING it — a test or the frozen contract edited after the red run, src **overfit** to the
|
|
124
|
+
fixtures, **vacuous** asserts, or real logic **stubbed away**. That is a **confirmed cheat**, and a cheat
|
|
125
|
+
is **HARD-STOP-class**: never auto-passed, never RISK-ACCEPTED-waived (like a security finding). But a
|
|
126
|
+
first cheat is not yet a stop — it is a chance to redo honestly.
|
|
127
|
+
|
|
128
|
+
So a confirmed cheat enters a **bounded self-heal loop**: the engine returns the task to **build** for an
|
|
129
|
+
honest redo, **counts** the attempt, and **caps** it. After **3** honest re-build attempts a fourth
|
|
130
|
+
confirmed cheat forces a **HARD-STOP that escalates to the human** — never an auto-PASS, never an unbounded
|
|
131
|
+
loop. The engine COUNTS, CAPS, and ESCALATES; the **agent** does the honest re-build (the engine never
|
|
132
|
+
auto-fixes). The counter is **monotonic** — it never auto-resets, so the cap cannot be cleared by
|
|
133
|
+
re-crossing a phase; only an honest build (no cheat) escapes the loop, and an honest build PASSes even at
|
|
134
|
+
the third attempt (the cap bites a *continued* cheat, never a recovery).
|
|
135
|
+
|
|
136
|
+
Two findings enter the loop:
|
|
137
|
+
- **mechanical** (enforced) — the tamper tripwire (`tamper-tripwire`): at the gate the engine re-hashes the
|
|
138
|
+
red test files + the frozen §3 against the `tests→build` snapshot; any divergence is a cheat, routed to
|
|
139
|
+
the loop before any completing outcome is recorded.
|
|
140
|
+
- **semantic** (honor-system, necessary-not-sufficient) — the **adversarial refute-read** (`6-verify.md`):
|
|
141
|
+
an independent reviewer argues "the green was NOT earned" and, on a confirmed overfit/vacuous/stub, the
|
|
142
|
+
agent reports it with `add.py heal <slug> --reason "<finding>"`. The engine cannot SEE a judgment cheat,
|
|
143
|
+
so this entry is the agent's honest report — the human verify gate stays the real backstop.
|
|
144
|
+
|
|
145
|
+
The mechanical entry returns-to-build automatically at the gate; the `heal` verb is how a *reported* cheat
|
|
146
|
+
enters the same bounded loop. Either way: ≤3 honest redos, then escalate. A gamed green never ships.
|
|
147
|
+
|
|
120
148
|
## Emitting deltas — feeding the foundation back
|
|
121
149
|
|
|
122
150
|
The completeness-critic does not discard what it finds. Every gap, surprise, or convention that helped
|
|
@@ -137,14 +165,19 @@ How much a run may auto-gate is a **per-scope setting**, not a global switch (pr
|
|
|
137
165
|
earned per scope). A task declares its level in its `TASK.md` header:
|
|
138
166
|
|
|
139
167
|
```
|
|
140
|
-
autonomy:
|
|
168
|
+
autonomy: manual | conservative | auto
|
|
141
169
|
```
|
|
142
170
|
|
|
143
|
-
|
|
171
|
+
An ordered ladder — `manual < conservative < auto` — declared once in the header and reviewed at the freeze:
|
|
172
|
+
|
|
173
|
+
- **auto (the seeded default)** — the run may auto-PASS when the evidence + residue checks above are
|
|
144
174
|
satisfied. Security still always escalates. This is the default starting point: a frozen contract
|
|
145
175
|
flips the task into a self-driving run that converges and auto-gates on evidence.
|
|
146
176
|
- **conservative** — the deliberate *lowering*: the run does all the work and converges, but STOPS at
|
|
147
177
|
the verify gate for a human. Auto-PASS is disabled. Choose it wherever evidence is thin or risk is high.
|
|
178
|
+
- **manual** — the strict floor: the human owns the verify gate and the engine never auto-resolves
|
|
179
|
+
(behaviourally the conservative floor with the explicit "I drive this decision; the AI proposes only"
|
|
180
|
+
name). Choose it for the highest-stakes scope; like `conservative`, it satisfies the high-risk guard.
|
|
148
181
|
|
|
149
182
|
> **v7 reversal (recorded, not hidden).** Earlier the default was `conservative` and `auto` was the
|
|
150
183
|
> earned exception; v7 flips this — `auto` is the default, `conservative` is the deliberate lowering.
|
|
@@ -154,7 +187,8 @@ autonomy: auto | conservative
|
|
|
154
187
|
**The high-risk guard — `auto` is refused where it matters most.** The autonomy level is not a blank cheque. On a
|
|
155
188
|
**high-risk or method-defining scope** — anything where a wrong-but-plausible result is expensive or
|
|
156
189
|
hard to reverse (auth, money, data-loss paths, the method/trust-layer itself) — `auto` must be lowered
|
|
157
|
-
to `conservative`; leaving it at `auto` there is the reject code
|
|
190
|
+
to a stricter rung — `conservative` or `manual`; leaving it at `auto` there is the reject code
|
|
191
|
+
**`unguarded_high_risk_auto`**. This
|
|
158
192
|
closes the v6 dogfood gap, where the whole milestone ran at `auto` on the riskiest possible
|
|
159
193
|
scope (defining the method) with no friction. The default is `auto` *for ordinary, well-tested scope*;
|
|
160
194
|
high risk still earns a human gate.
|
|
@@ -163,9 +197,18 @@ Judging *what* is high-risk stays human — the scope declares **`risk: high`**
|
|
|
163
197
|
header where the autonomy level lives, reviewed at the freeze like every header line (the engine never
|
|
164
198
|
classifies scope). **Since v14 the guard is mechanical for the declared case:**
|
|
165
199
|
the engine refuses the declared combination — `add.py gate` will not complete (`PASS`/`RISK-ACCEPTED`) a task whose header
|
|
166
|
-
carries `risk: high` without `
|
|
200
|
+
carries `risk: high` without a lowered level — `conservative` or `manual` (error `unguarded_high_risk_auto`; `HARD-STOP`
|
|
167
201
|
always records — stopping is never blocked), and `add.py audit` flags the same code on a finished
|
|
168
202
|
record whose header was tampered or whose GATE RECORD reviewer is the auto-gate — which CI enforces
|
|
169
203
|
(audit-ci). The honest limit mirrors the audit's: an **undeclared** high-risk scope passes; declaring
|
|
170
204
|
is the human decision point, the engine enforces what was declared.
|
|
205
|
+
|
|
206
|
+
**Autonomy is earned by goal-clarity — the auto-ready goal.** The level decides *who* resolves Verify;
|
|
207
|
+
an **auto-ready goal** decides whether a self-verifying run is even *meaningful*. A milestone goal is
|
|
208
|
+
auto-ready when **every exit criterion cites a verifier** — `(verify: <test | command | metric>)` — so the
|
|
209
|
+
run can check its own result against the goal without human judgment. `add.py check` raises a
|
|
210
|
+
`goal_not_auto_ready` WARN (never red, the active milestone only) while criteria are uncited, and `status`
|
|
211
|
+
prints a `goal-ready:` line every session. It **measures, never blocks** — it changes neither the freeze
|
|
212
|
+
gate nor the autonomy level. The lint forces a citation slot per criterion (raising the floor) but cannot
|
|
213
|
+
prove the citation is honest (`(verify: it works)` passes) — that judgment stays the human's.
|
|
171
214
|
</constraints>
|
package/skill/add/streams.md
CHANGED
|
@@ -42,9 +42,9 @@ How much concurrency you actually get is set by each task's `autonomy:` header
|
|
|
42
42
|
|
|
43
43
|
| `autonomy` (TASK.md) | What serializes on the human | Concurrency |
|
|
44
44
|
|----------------------|------------------------------|-------------|
|
|
45
|
-
| `conservative` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue |
|
|
45
|
+
| `conservative` / `manual` | bundle approval **+** every Verify | pure pipelining — builds overlap, both gates queue (`manual` is the strict floor; same streams behaviour) |
|
|
46
46
|
| `auto` (default) | bundle approval **only**; Verify auto-PASSes on evidence | real concurrency — only the decision point + residue escalations queue |
|
|
47
|
-
| `auto` but **high-risk** | refused →
|
|
47
|
+
| `auto` but **high-risk** | refused → must lower to `conservative` / `manual` (`unguarded_high_risk_auto`) | back to pipelining, by design |
|
|
48
48
|
|
|
49
49
|
The irreducible floor is **one human approval per task at the contract decision point** — the decision point
|
|
50
50
|
never drops to zero (`run.md:22`). That floor is correct; do not engineer around it.
|
|
@@ -72,6 +72,16 @@ never drops to zero (`run.md:22`). That floor is correct; do not engineer around
|
|
|
72
72
|
worktree forked from a stale base forces the worker to recreate the frozen artifacts by hand
|
|
73
73
|
(the v10 dogfood hit exactly this). Before the worker starts, confirm `git -C <worktree>
|
|
74
74
|
rev-parse HEAD` equals the orchestrator's `HEAD`; if it drifted, `git merge` the base in first.
|
|
75
|
+
On a runner that creates each worktree **at spawn** from a pool (e.g. Claude Code), that pool can hand
|
|
76
|
+
out a STALE base, so the pre-spawn `rev-parse` evidence cell is unsatisfiable. The `unverified_fork_base`
|
|
77
|
+
check then **shifts** — it never skips: the worker's **step-0** syncs to base (`git merge` the orchestrator's
|
|
78
|
+
`HEAD`) and re-echoes `rev-parse HEAD`, which the orchestrator verifies at **merge-time**, before merge-back.
|
|
79
|
+
The pre-spawn check stays the DEFAULT for fresh-`HEAD`-worktree runners; the merge-time path is the additive
|
|
80
|
+
ALTERNATIVE for spawn-time runners — never a replacement of the pre-spawn rule.
|
|
81
|
+
**The engine executes this gate** (engine-merge-base-enforcement): run
|
|
82
|
+
`python3 .add/tooling/add.py wave-verify` before the first merge-back — it refuses a mismatched or
|
|
83
|
+
pending echo (`unverified_fork_base`) and an off-template ledger (`wave_ledger_malformed`, fail-closed);
|
|
84
|
+
`add.py check` is the standing monitor (red at `status: merging`, `fork_base_pending` WARN at `live`).
|
|
75
85
|
- **Lease + timeout** — record which worker holds which task (in the wave ledger, below);
|
|
76
86
|
if a worker dies, release the claim back to READY (re-spawn, do not assume partial work is sound).
|
|
77
87
|
- **Failure isolates** — a worker that hits a STOP-and-escalate (below) blocks only its
|
|
@@ -114,7 +124,10 @@ base: <orchestrator HEAD at spawn — the sha every fork must equal>
|
|
|
114
124
|
`git -C <worktree> rev-parse HEAD`, and it must equal `base:`. A tick is not evidence; a row
|
|
115
125
|
you can only fill by running the command is the fresh-worktree-base check EXECUTING — the
|
|
116
126
|
v12-1 lesson (words-exist ≠ method-works) closed structurally. Spawning a worker whose roster
|
|
117
|
-
row lacks that evidence is refused (`unverified_fork_base`).
|
|
127
|
+
row lacks that evidence is refused (`unverified_fork_base`). On a spawn-time pool runner this
|
|
128
|
+
PRE-spawn paste is unsatisfiable (the pooled base is stale until the worker syncs), so the cell
|
|
129
|
+
instead holds the worker's **step-0** post-sync echo (still `== base:`) and the `unverified_fork_base`
|
|
130
|
+
refusal **shifts to merge-time**, before merge-back — it shifts, it never lifts.
|
|
118
131
|
|
|
119
132
|
**Lifecycle — open → consume → digest → delete.** Open the ledger when the first worker
|
|
120
133
|
spawns. The serial integration Verify consumes it (the merge order is read from it, one
|
|
@@ -181,7 +194,7 @@ STOP-and-escalate (return your findings; do not decide):
|
|
|
181
194
|
• a discovered scope/contract gap → backward-correction, reopen Specify (principle 4)
|
|
182
195
|
• any SECURITY finding → HARD-STOP, always
|
|
183
196
|
• a concurrency/timing OR architecture/layering risk the tests cannot exercise
|
|
184
|
-
• [include this bullet
|
|
197
|
+
• [include this bullet when autonomy is conservative OR manual — any lowered rung] the verify gate itself — STOP for the human
|
|
185
198
|
Auto-PASS only if autonomy=auto AND: all tests green · coverage not decreased · no test weakened ·
|
|
186
199
|
no contract edited · loops dry · completeness-critic clean · no residue above. Log it as
|
|
187
200
|
auto-resolved, naming this run as owner — never forge a human signature.
|
|
@@ -204,7 +217,9 @@ ripgrep otherwise. Design every IO path for failure — timeouts, retries, rollb
|
|
|
204
217
|
</tools>
|
|
205
218
|
|
|
206
219
|
<return> <!-- the worker PROPOSES; the orchestrator RECORDS. A worker never runs add.py. -->
|
|
207
|
-
End with a structured verdict AND write the same into SUMMARY.md in the task dir
|
|
220
|
+
End with a structured verdict AND write the same into SUMMARY.md in the task dir, then
|
|
221
|
+
**commit SUMMARY.md + deltas.md** in the worktree (uncommitted worktree files survive only by
|
|
222
|
+
harness courtesy — commit them so the serial-integration merge-back carries your report):
|
|
208
223
|
{ task, outcome: PASS|RISK-ACCEPTED|HARD-STOP|ESCALATE, evidence: <tests+coverage>,
|
|
209
224
|
residue: [security|concurrency|architecture findings], deltas: [open lessons learned] }.
|
|
210
225
|
Do NOT touch add.py or any shared file — the orchestrator gates on your verdict.
|
|
@@ -222,7 +237,7 @@ The contract is identical whichever model runs it (the model is disposable, like
|
|
|
222
237
|
| **top** | complex / ambiguous / cross-cutting / broad scope of impact | `opus` | the runner's strongest reasoning model |
|
|
223
238
|
|
|
224
239
|
Two rules sit **above** model choice and never bend:
|
|
225
|
-
- **High-risk ⇒ `conservative`
|
|
240
|
+
- **High-risk ⇒ a lowered rung (`conservative` or `manual`), regardless of model** (`run.md` high-risk guard). A
|
|
226
241
|
stronger model does not buy back the human gate.
|
|
227
242
|
- **Security residue always escalates** — no tier and no model auto-passes it.
|
|
228
243
|
|