@pilotspace/add 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (53) hide show
  1. package/GETTING-STARTED.md +238 -0
  2. package/LICENSE +20 -0
  3. package/README.md +106 -0
  4. package/bin/cli.js +131 -0
  5. package/docs/00-introduction.md +46 -0
  6. package/docs/01-principles.md +71 -0
  7. package/docs/02-the-flow.md +93 -0
  8. package/docs/03-step-1-specify.md +117 -0
  9. package/docs/04-step-2-scenarios.md +78 -0
  10. package/docs/05-step-3-contract.md +78 -0
  11. package/docs/06-step-4-tests.md +71 -0
  12. package/docs/07-step-5-build.md +80 -0
  13. package/docs/08-step-6-verify.md +63 -0
  14. package/docs/09-the-loop.md +43 -0
  15. package/docs/10-setup-and-stages.md +75 -0
  16. package/docs/11-governance.md +87 -0
  17. package/docs/12-roles.md +99 -0
  18. package/docs/13-adoption.md +67 -0
  19. package/docs/14-foundation.md +121 -0
  20. package/docs/README.md +70 -0
  21. package/docs/add-competencies.png +0 -0
  22. package/docs/add-flow.png +0 -0
  23. package/docs/add-foundation.png +0 -0
  24. package/docs/add-hierarchy.png +0 -0
  25. package/docs/appendix-a-templates.md +88 -0
  26. package/docs/appendix-b-prompts.md +119 -0
  27. package/docs/appendix-c-glossary.md +85 -0
  28. package/docs/appendix-d-worked-example.md +152 -0
  29. package/docs/appendix-e-checklists.md +80 -0
  30. package/docs/appendix-f-requirements-matrix.md +170 -0
  31. package/package.json +47 -0
  32. package/skill/add/SKILL.md +118 -0
  33. package/skill/add/deltas.md +69 -0
  34. package/skill/add/fold.md +66 -0
  35. package/skill/add/intake.md +49 -0
  36. package/skill/add/phases/0-setup.md +35 -0
  37. package/skill/add/phases/1-specify.md +55 -0
  38. package/skill/add/phases/2-scenarios.md +36 -0
  39. package/skill/add/phases/3-contract.md +41 -0
  40. package/skill/add/phases/4-tests.md +37 -0
  41. package/skill/add/phases/5-build.md +38 -0
  42. package/skill/add/phases/6-verify.md +39 -0
  43. package/skill/add/phases/7-observe.md +32 -0
  44. package/skill/add/run.md +152 -0
  45. package/skill/add/scope.md +58 -0
  46. package/tooling/add.py +1573 -0
  47. package/tooling/templates/CONVENTIONS.md.tmpl +8 -0
  48. package/tooling/templates/GLOSSARY.md.tmpl +3 -0
  49. package/tooling/templates/MILESTONE.md.tmpl +25 -0
  50. package/tooling/templates/MODEL_REGISTRY.md.tmpl +6 -0
  51. package/tooling/templates/PROJECT.md.tmpl +42 -0
  52. package/tooling/templates/TASK.md.tmpl +111 -0
  53. package/tooling/templates/dependencies.allowlist.tmpl +2 -0
@@ -0,0 +1,93 @@
1
+ # 02 · The flow, and what is disposable
2
+
3
+ [← 01 Core principles](./01-principles.md) · [Contents](./README.md) · Next: [03 Step 1 Specify →](./03-step-1-specify.md)
4
+
5
+ ---
6
+
7
+ ## The flow
8
+
9
+ AIDD is one repeatable flow of six steps, followed by an observation loop. People perform the first four steps (with AI assistance), the AI performs the fifth (under direction), and people perform the sixth.
10
+
11
+ ![The ADD flow — a solid forward spine Specify→Scenarios→Contract→Tests→Build→Verify→Observe, with dashed backward-correction arrows (any phase may return to an earlier one), a Tests⇄Build red/green engine, and Observe looping back to the next Specify](./add-flow.png)
12
+
13
+ ```mermaid
14
+ flowchart LR
15
+ S1["1 Specify<br/>the rules"] --> S2["2 Scenarios<br/>pass/fail cases"]
16
+ S2 --> S3["3 Contract<br/>freeze the shape"]
17
+ S3 --> S4["4 Tests<br/>safety net (red)"]
18
+ S4 --> S5["5 Build<br/>AI writes code"]
19
+ S5 --> S6["6 Verify<br/>evidence + checks"]
20
+ S6 --> OBS["Observe<br/>in production"]
21
+ S5 -. "red / green engine" .-> S4
22
+ S6 -. "evidence fails → back to Build" .-> S5
23
+ S5 -. "a missing rule → back to Specify" .-> S1
24
+ OBS -. "what you learn becomes the next spec" .-> S1
25
+ classDef human fill:#FAEEDA,stroke:#BA7517,color:#633806;
26
+ classDef seam fill:#E1F5EE,stroke:#0F6E56,color:#04342C;
27
+ classDef machine fill:#E6F1FB,stroke:#185FA5,color:#042C53;
28
+ class S1,S2 human;
29
+ class S3,S4 seam;
30
+ class S5,S6 machine;
31
+ ```
32
+
33
+ > **Solid arrows are the forward spine** — you never start a phase before its input exists (forward-skip forbidden). **Dashed arrows are backward correction** — any phase may return to an earlier one to repair its artifact (the long loop, Observe → Specify, is the same rule at milestone scale). The tight Tests ⇄ Build cycle is the per-feature red/green engine.
34
+
35
+ ```text
36
+ human-led ─────────────────►│◄─────────── machine-led ──► human verify
37
+ 1 Specify → 2 Scenarios → 3 Contract → 4 Tests ⇄ 5 Build → 6 Verify
38
+ ▲ (freeze) └red/green┘ (AI) (people)
39
+ ╎ │
40
+ ╎╴╴ backward correction (dashed): any phase may return to ╴╴╴┤
41
+ ╎ an earlier one — e.g. Build exposes a missing rule │
42
+ │ │
43
+ │ observe in production ◄──────────────────┘
44
+ │ │
45
+ └─────────────────────────┘ becomes the next Specify
46
+ ```
47
+
48
+ The shape is deliberate: the human-led steps establish direction, a frozen contract forms the seam in the middle, and the AI-led build runs fast and safely on the far side because everything it needs is already fixed.
49
+
50
+ ## Why the order is the order
51
+
52
+ Each step produces exactly one artifact, and each artifact is the input to the next step. The order is not a preference; it is a dependency chain.
53
+
54
+ | Step | Produces | Which is needed by |
55
+ |------|----------|--------------------|
56
+ | 1 Specify | the rules | scenarios, and everything after |
57
+ | 2 Scenarios | pass/fail cases | the tests |
58
+ | 3 Contract | the fixed shape | the tests and the build |
59
+ | 4 Tests | the failing safety net | the build and the verification |
60
+ | 5 Build | the code | the verification |
61
+ | 6 Verify | a trusted, releasable change | the release and the next loop |
62
+
63
+ The single rule of discipline follows directly: **do not begin a step until the previous artifact exists.** Skipping forward means the AI builds against a guess.
64
+
65
+ The flow runs in two directions under two rules that never conflict. **Backward correction is always allowed:** any phase may send you back to an earlier one to repair its artifact — a failing Build that exposes a missing rule sends you back to Specify, and that is the loop working ([principle 4](./01-principles.md)), not a failure. **Forward-skipping is forbidden:** you never start a phase before its input artifact exists. Correct backward freely; never skip forward.
66
+
67
+ ## Who does what
68
+
69
+ | Step | Person's job | AI's job |
70
+ |------|--------------|----------|
71
+ | 1 Specify | decide and confirm the rules | draft; list assumptions to confirm |
72
+ | 2 Scenarios | decide what "correct" looks like | draft scenarios |
73
+ | 3 Contract | approve and freeze the shape | generate the contract and mocks |
74
+ | 4 Tests | set the targets | generate failing tests |
75
+ | 5 Build | direct in small batches | implement until tests pass |
76
+ | 6 Verify | confirm via evidence + judgment | (none — this is the human check) |
77
+
78
+ ## What survives, and what is disposable
79
+
80
+ This is the idea that most distinguishes AIDD from older practice.
81
+
82
+ **The artifacts are the durable asset.** The specification, the scenarios, the contract, and the tests capture decisions and meaning. They are what you protect, version, and carry forward.
83
+
84
+ **The code is disposable.** It is one implementation that satisfies the artifacts. If a better approach appears, or the AI model improves, the code can be regenerated against the same artifacts without loss.
85
+
86
+ A practical test of whether a team has absorbed this: ask what they would be upset to lose. If the answer is "the code," they are still working the old way. If the answer is "the contracts and the tests," they are working in AIDD.
87
+
88
+ > **Do:** invest in clear, stable specs, contracts, and tests.
89
+ > **Don't:** measure progress by how much code was generated or reused — that counts the cheap, disposable thing.
90
+
91
+ ## How the rest of Part II is organized
92
+
93
+ Each of the next seven chapters takes one step (and then the loop) and gives it the same treatment: its purpose, who does it, the artifact it produces, the AI prompt that drives it, the exit check that says it is done, and what to do when that check fails. The running example continues throughout.
@@ -0,0 +1,117 @@
1
+ # 03 · Step 1 — Specify
2
+
3
+ [← 02 The flow](./02-the-flow.md) · [Contents](./README.md) · Next: [04 Step 2 Scenarios →](./04-step-2-scenarios.md)
4
+
5
+ > **Purpose:** state, in plain language, what the feature must do and what it must reject, with no ambiguity left for the AI to resolve by guessing.
6
+ > **Produces:** `SPEC.md` for the feature.
7
+ > **How it works — co-specification:** AI and human **brainstorm the shape together**; the AI drafts; the **human validates, with the AI's advice.** The decisive advice is a *least-sure flag* — the AI names the one or two things most likely to be wrong, so the human's attention lands where it matters. The human owns the decision; the AI owns surfacing what it does not yet know.
8
+
9
+ ---
10
+
11
+ ## Why this step is first
12
+
13
+ The specification is the description the AI will build from. Every other artifact descends from it. Anything vague here does not stay vague — it becomes a concrete wrong guess in the code, discovered late. The cheapest moment to remove an ambiguity is now, in a sentence, before anything depends on it.
14
+
15
+ There is also a diagnostic value: **if you cannot write the spec, you do not yet understand the feature well enough to build it.** The inability to specify is information, not an obstacle to push past.
16
+
17
+ ## Co-specification — how the spec gets made
18
+
19
+ A specification is not dictated by one side. It is made in three moves:
20
+
21
+ 1. **Diverge — brainstorm by both.** Before drafting, the AI surfaces the *decision space*: the two or three genuine ways to frame the feature, and the open questions it would otherwise resolve by guessing. You react — add, kill, redirect. This is the brainstorm, and it lives in the conversation, not in a new document.
22
+ 2. **Converge — the AI drafts, and ranks its own uncertainty.** The AI writes the spec below, then ranks what it is least sure about. It does not hand you a flat wall of equal-looking assumptions to nod through; it tells you *where it is most likely wrong, and what that would cost.*
23
+ 3. **Validate — you decide, with the AI's advice.** You read the ranked uncertainty first, then confirm, correct, or send it back. Your approval is real because your attention was aimed.
24
+
25
+ The brainstorm leaves a *light trace, not a document.* What you chose becomes a rule; what you weighed and dropped becomes a one-line **`Framings weighed:`** note; what stayed genuinely uncertain becomes a **least-sure flag**. Nothing new to maintain — the residue lands in the spec you were writing anyway.
26
+
27
+ ## What a good specification contains
28
+
29
+ Four parts, kept short:
30
+
31
+ 1. **Must** — the behaviors the feature is required to perform.
32
+ 2. **Reject** — the inputs or situations it must refuse, each paired with a named error.
33
+ 3. **After** — the state that is true once it succeeds (what changed).
34
+ 4. **Assumptions — least-sure first** — the things you are taking for granted, **ranked so the most-likely-wrong come first.** The top one or two carry a `⚠` flag with *why it is uncertain* and *what it costs if wrong*; the rest are the low-stakes tail. A spec with genuinely nothing uncertain still names its single biggest risk, however small — the AI never claims a blank mind.
35
+
36
+ Naming the errors matters. "Reject bad amounts" is an instruction to guess; `amount <= 0 -> "amount_invalid"` is a rule that produces a testable scenario and a defined contract response.
37
+
38
+ ## Template
39
+
40
+ ```
41
+ # SPEC.md
42
+ Feature: <name>
43
+ Framings weighed: <chosen> (chosen) · <alternative> · <alternative>
44
+ Must:
45
+ - <required behavior>
46
+ Reject:
47
+ - <bad input / situation> -> "<error_code>"
48
+ After:
49
+ - <what is true once it succeeds>
50
+ Assumptions — least-sure first:
51
+ ⚠ <most-likely-wrong assumption> — least sure because <why>; if wrong: <cost>
52
+ - [x] <confirmed / low-stakes assumption> — <one line>
53
+ ```
54
+
55
+ ## ▶ Example
56
+
57
+ ```
58
+ Feature: Transfer money between my own accounts
59
+ Framings weighed: synchronous single-currency transfer (chosen) · queued transfer · multi-currency with FX
60
+ Must:
61
+ - move an amount from one of my accounts to another of mine
62
+ - amount > 0
63
+ - source and destination are different accounts
64
+ - source has enough balance
65
+ After:
66
+ - source balance -= amount, destination balance += amount
67
+ Reject:
68
+ - amount <= 0 -> "amount_invalid"
69
+ - source == destination -> "same_account"
70
+ - balance < amount -> "insufficient_funds"
71
+ - account not mine -> "forbidden"
72
+ Assumptions — least-sure first:
73
+ ⚠ same currency only (no FX) in v1 — least sure because the ticket never said; if wrong: the whole amount/rounding model changes and this contract is wrong
74
+ - [x] no daily limit in v1 — confirmed: out of scope for v1
75
+ ```
76
+
77
+ The `Framings weighed:` line shows what was considered and dropped, so the chosen shape is a *decision*, not a default. The `⚠` line is the one the stakeholder reads first: the assumption most likely to be wrong and most expensive to get wrong. The flat `[x]` line is real but low-stakes. A reviewer can now spend their attention where it pays.
78
+
79
+ ## The AI's role here
80
+
81
+ Use the AI to **open the space and then narrow it honestly.** First it brainstorms the genuine framings with you (diverge). Then it drafts the spec from whatever raw material you have — a ticket, an interview, a contract document — listing every assumption it had to make, **ranked least-sure first**, and flagging the one or two it is least confident in with *why* and *what it costs if wrong*. Its instinct is to fill gaps silently and present a confident wall; the method forces those gaps into the open, and forces the confident wall to declare its own soft spots. See `playbook/1_specify.md` in [Appendix B](./appendix-b-prompts.md).
82
+
83
+ The defining instruction: *if a requirement is unclear, ask — do not resolve it by guessing — and of the things you must assume, say plainly which you are least sure about.*
84
+
85
+ ## Common mistakes
86
+
87
+ - **Stating only the happy path.** The "Reject" list is where most real complexity lives; an empty one usually means it has not been thought through.
88
+ - **Free-text errors.** Errors must be named codes, not sentences, so they can become scenarios and contract responses.
89
+ - **Hidden assumptions.** If an assumption is not written down, it is not confirmed — it is a future bug with a delay timer.
90
+ - **A flat wall of "confirmed" assumptions.** Eight equal-looking ticks invite a reflex approval. Rank them; flag the one or two that are load-bearing. An unranked list hides the risk inside the noise.
91
+
92
+ ## Exit check
93
+
94
+ A spec is done when:
95
+
96
+ - [ ] Every required behavior is stated explicitly.
97
+ - [ ] Every rejection has a named error code.
98
+ - [ ] The success state-change is described.
99
+ - [ ] The assumptions are ordered least-sure first, and the one or two `⚠` flags carry *why* + *cost* — or, for genuinely trivial scope, an honest "none material" that still names the single biggest risk.
100
+
101
+ The shift from older practice: you no longer pre-confirm every assumption to advance. You confirm that the AI has *ranked* its uncertainty and that you have *engaged the top of the rank.* Stated honestly: the flag makes a genuine review cheap and a lazy one visibly negligent — it cannot force the read. That is the most a lightweight check can buy.
102
+
103
+ ## If the check fails
104
+
105
+ If you cannot state a rule clearly, the feature is not ready to build. Stop, take the question to whoever owns the requirement, and resolve it. Do not let the AI proceed on an unresolved point — that is the exact failure the whole method exists to prevent.
106
+
107
+ ---
108
+
109
+ ## The one approval, and where the flag really lands
110
+
111
+ In the one-approval front, you do not approve the spec alone — you approve the whole frozen bundle (spec, scenarios, contract, tests) once, at the contract freeze. So the least-sure flag is **bundle-wide**: at that single seam the AI leads with *"of everything I'm asking you to freeze, these one or two points are most likely wrong"* — and a flag may point at an uncovered scenario or the contract shape, not only a spec assumption. The ranking you do here in Specify is the first feeder into that one gate. See [05 Contract](./05-step-3-contract.md) and the `add` skill's `run.md`.
112
+
113
+ ---
114
+
115
+ ## When the feature has a user interface
116
+
117
+ For anything with a UI, extend this step with a quick design: the **user flows** (the happy path and the main alternatives) and **every screen state** — loading, empty, error, and success. Correct logic behind a confusing or incomplete interface is still a poor product, and undesigned states are exactly where an AI will improvise something ugly. In the early **Prototype** stage, this design work is the main event and the code is throwaway (see [10 Stages](./10-setup-and-stages.md)).
@@ -0,0 +1,78 @@
1
+ # 04 · Step 2 — Scenarios
2
+
3
+ [← 03 Step 1 Specify](./03-step-1-specify.md) · [Contents](./README.md) · Next: [05 Step 3 Contract →](./05-step-3-contract.md)
4
+
5
+ > **Purpose:** rewrite each rule from the spec as a concrete, pass-or-fail scenario.
6
+ > **Produces:** `features/<name>.feature`.
7
+ > **Person's job:** decide what "correct" looks like in concrete situations. **AI's job:** draft the scenarios.
8
+
9
+ ---
10
+
11
+ ## Why turn rules into scenarios
12
+
13
+ A plain rule is still open to interpretation. "Source must have enough balance" leaves open: enough for what, exactly? What happens to the balances when it is *not* enough? A scenario removes the interpretation by pinning a specific situation to a specific expected result.
14
+
15
+ Scenarios occupy a unique position: they are **readable by people and checkable by machines at the same time.** A product owner can confirm a scenario is what they meant; a test can be generated directly from it. This makes them the bridge between the human-led front of the flow and the machine-led back. They are the single most leverage-bearing artifact in the method, because everything downstream — the tests, and through them the build's definition of success — is generated from them.
16
+
17
+ ## The form
18
+
19
+ Each scenario has three parts:
20
+
21
+ - **Given** — the starting situation.
22
+ - **When** — the action taken.
23
+ - **Then** — the result that must follow.
24
+
25
+ Where a rule also constrains what must *not* change, add an **And** clause to state it. Unwanted side effects are caught by what you assert stays the same, not only by what you assert changes.
26
+
27
+ ## Template
28
+
29
+ ```
30
+ Scenario: <short name>
31
+ Given <starting situation>
32
+ When <action>
33
+ Then <expected result>
34
+ And <what must remain unchanged> # when relevant
35
+ ```
36
+
37
+ ## ▶ Example
38
+
39
+ ```
40
+ Scenario: successful transfer
41
+ Given A has 100 and B has 0, both mine
42
+ When I transfer 30 from A to B
43
+ Then A has 70 and B has 30
44
+
45
+ Scenario: insufficient funds
46
+ Given A has 20, mine
47
+ When I transfer 50 from A to B
48
+ Then it is rejected "insufficient_funds"
49
+ And no balance changes
50
+
51
+ Scenario: not my account
52
+ Given account C is not mine
53
+ When I transfer 10 from C to B
54
+ Then it is rejected "forbidden"
55
+ ```
56
+
57
+ The `And no balance changes` line is doing real work: it specifies that a rejected transfer must leave the world untouched — a property the AI could easily violate by deducting before checking.
58
+
59
+ ## The AI's role here
60
+
61
+ Hand the AI the spec and have it draft a scenario for each rule, including the rejection rules. Then read them as the person who owns the requirement: do they describe what you actually meant? Correct any that drift. See `playbook/2_scenarios.md` in [Appendix B](./appendix-b-prompts.md).
62
+
63
+ ## Common mistakes
64
+
65
+ - **Only happy-path scenarios.** Every "Reject" rule in the spec needs its own scenario, or that rule will never be verified.
66
+ - **Vague results.** "Then it works" is not checkable. The result must be a specific, observable fact ("A has 70").
67
+ - **Forgetting the unchanged state.** For any rejection, assert that nothing changed; otherwise a partial, corrupting failure can pass.
68
+
69
+ ## Exit check
70
+
71
+ - [ ] Every "Must" rule has at least one scenario.
72
+ - [ ] Every "Reject" rule has at least one scenario.
73
+ - [ ] Each scenario's result is a specific, observable fact.
74
+ - [ ] Rejections assert what must stay unchanged.
75
+
76
+ ## If the check fails
77
+
78
+ A rule with no scenario will never be tested, and therefore will never be verified — it is a rule in name only. Either write the missing scenario or remove the rule from the spec. Do not carry an unscenarioed rule into the contract.
@@ -0,0 +1,78 @@
1
+ # 05 · Step 3 — Contract
2
+
3
+ [← 04 Step 2 Scenarios](./04-step-2-scenarios.md) · [Contents](./README.md) · Next: [06 Step 4 Tests →](./06-step-4-tests.md)
4
+
5
+ > **Purpose:** fix the external shape of the feature — interfaces, data structures, names, and error cases — and freeze it.
6
+ > **Produces:** `contracts/<name>.md` (plus a mock and contract tests).
7
+ > **Person's job:** approve and freeze the shape. **AI's job:** generate the first draft, the mock, and the contract tests.
8
+
9
+ ---
10
+
11
+ ## The seam of the whole method
12
+
13
+ This step is the seam between the human-led and machine-led halves of the flow, and it is what makes everything after it safe.
14
+
15
+ The reasoning is simple. The AI is allowed to write and rewrite code quickly. That is only safe if there is a stable surface that the rest of the system depends on and that the AI is not allowed to disturb. The frozen contract is that surface. Below it, the code is disposable and can be regenerated freely; above it, nothing breaks, because the shape it depends on does not move.
16
+
17
+ Freezing the contract is therefore not bureaucracy — it is the precondition for granting the AI real autonomy in the build step. Without it, every regeneration risks silently changing an interface that another part of the system relies on.
18
+
19
+ ## What the contract contains
20
+
21
+ - **Interfaces** — the endpoints, functions, or messages, with their inputs and outputs.
22
+ - **Data structures** — the request and response shapes, and the persistent schema.
23
+ - **Names** — drawn from the project glossary, so the same concept has the same name everywhere.
24
+ - **Error cases** — the defined failures, using the error codes from the spec.
25
+
26
+ ## Template
27
+
28
+ ```
29
+ # contracts/<name>.md
30
+ <METHOD> <path> body: { <fields> }
31
+ 200 -> { <success fields> }
32
+ 4xx -> { error: "<code>" | "<code>" }
33
+ Schema: <tables/fields touched, and access pattern>
34
+ Status: FROZEN @ v<n>
35
+ ```
36
+
37
+ ## ▶ Example
38
+
39
+ ```
40
+ POST /transfers body: { fromAccountId, toAccountId, amount }
41
+ 200 -> { transferId, fromBalance, toBalance }
42
+ 400 -> { error: "amount_invalid" | "same_account" | "insufficient_funds" }
43
+ 403 -> { error: "forbidden" }
44
+ Schema: accounts.balance (read + write, must be transactional)
45
+ Status: FROZEN @ v1
46
+ ```
47
+
48
+ Every error code traces back to a rejection rule in the spec, and the schema note (`must be transactional`) flags the one place where correctness depends on more than shape — a hint the verification step will follow up.
49
+
50
+ ## The AI's role here
51
+
52
+ The AI generates the contract from the spec and design, and additionally produces two things that make the contract enforceable: a **mock server** that returns the contracted shapes, and **contract tests** that pin those shapes. With the mock in place, work that depends on this feature can proceed before the real code exists. See `playbook/3_contract.md` in [Appendix B](./appendix-b-prompts.md).
53
+
54
+ ## The change-request rule
55
+
56
+ Once frozen, a contract does not change casually. A needed change is a **change request**: you return to [Step 1](./03-step-1-specify.md), adjust the spec, re-freeze at a new version, and come forward again. The AI never alters a frozen contract on its own initiative.
57
+
58
+ This rule is what keeps the contract trustworthy as a foundation. If it could drift, nothing built on it would be safe.
59
+
60
+ > **Do:** version and freeze the contract before any implementation.
61
+ > **Don't:** let the build step quietly change an interface to make code easier — that breaks everything depending on it.
62
+
63
+ ## Common mistakes
64
+
65
+ - **Inconsistent names.** If the contract calls it `fromAccountId` and the schema calls it `src_acct`, the AI will produce subtle mismatches. Use the glossary everywhere.
66
+ - **Undefined errors.** Every failure the spec rejects must have a contracted response, or callers cannot handle it.
67
+ - **Freezing too early or too late.** Freeze once the spec and design are stable — not before they are agreed, and not after code has already been written against an unfrozen shape.
68
+
69
+ ## Exit check
70
+
71
+ - [ ] Contract is versioned and marked `FROZEN`.
72
+ - [ ] Contract tests pass against the mock.
73
+ - [ ] Every name matches the project glossary.
74
+ - [ ] Every spec rejection has a contracted error response.
75
+
76
+ ## If the check fails
77
+
78
+ If the contract is not yet stable enough to freeze, the upstream artifacts are not settled — return to the spec or scenarios and resolve what is still open. If a frozen contract later needs to change, treat it as a change request rather than an edit; the discipline is the point.
@@ -0,0 +1,71 @@
1
+ # 06 · Step 4 — Tests
2
+
3
+ [← 05 Step 3 Contract](./05-step-3-contract.md) · [Contents](./README.md) · Next: [07 Step 5 Build →](./07-step-5-build.md)
4
+
5
+ > **Purpose:** turn the scenarios and contract into automated tests, and confirm they fail before any code exists.
6
+ > **Produces:** a failing (red) automated test suite.
7
+ > **Person's job:** set the targets and coverage. **AI's job:** generate the tests.
8
+
9
+ ---
10
+
11
+ ## Why tests come before code
12
+
13
+ This is the step that operationalizes the second principle — *trust through evidence, not inspection.* The tests written here are how you will judge the AI's code in [Step 5](./07-step-5-build.md). For that judgment to be honest, the tests must exist *before* the code.
14
+
15
+ The reason is mechanical. If code is written first and tests after, the tests are unconsciously shaped to match whatever the code happens to do — including its mistakes. Tests written first, from the scenarios, are shaped only by the agreed definition of correct. They are an independent standard the code must rise to meet, not a description of what the code already does.
16
+
17
+ ## The must-fail principle
18
+
19
+ After generating the tests, you run them — and they must **fail**, because no implementation exists yet. This sounds trivial and is not. A test that passes before any code is written is testing nothing; it is a false reassurance that will later wave bad code through. Confirming the suite is "red for the right reason" (a missing implementation, not a broken test) is what makes it a genuine safety net.
20
+
21
+ ## What to test
22
+
23
+ - **One test per scenario** — every scenario from [Step 2](./04-step-2-scenarios.md) becomes an executable test.
24
+ - **Contract conformance** — tests that pin the shapes and error responses from [Step 3](./05-step-3-contract.md).
25
+ - **Edge cases from the spec** — the boundary values implied by the "Reject" rules.
26
+ - **Behavior, not internals** — tests assert what the feature does (the observable result), never how it is implemented, so the code can be regenerated freely beneath them.
27
+
28
+ ## ▶ Example
29
+
30
+ ```python
31
+ def test_successful_transfer():
32
+ a = account(balance=100, owner=me); b = account(balance=0, owner=me)
33
+ r = transfer(a.id, b.id, 30)
34
+ assert r.status == 200
35
+ assert a.balance == 70 and b.balance == 30
36
+
37
+ def test_insufficient_funds():
38
+ a = account(balance=20, owner=me); b = account(balance=0, owner=me)
39
+ r = transfer(a.id, b.id, 50)
40
+ assert r.status == 400 and r.error == "insufficient_funds"
41
+ assert a.balance == 20 # unchanged — the side-effect assertion
42
+
43
+ def test_not_my_account():
44
+ c = account(balance=100, owner=someone_else); b = account(balance=0, owner=me)
45
+ r = transfer(c.id, b.id, 10)
46
+ assert r.status == 403 and r.error == "forbidden"
47
+ ```
48
+
49
+ Run this now, with no implementation: all three fail. That is the correct, honest starting point for the build.
50
+
51
+ ## The AI's role here
52
+
53
+ The AI generates the test suite from the scenarios and contract. Your job is to confirm two things it cannot judge for itself: that each test asserts *behavior* rather than internal detail, and that none of them pass by accident before code exists. See `playbook/4_tests.md` in [Appendix B](./appendix-b-prompts.md).
54
+
55
+ ## Common mistakes
56
+
57
+ - **Tests that test the implementation.** Asserting on private internals couples the test to one version of the code and defeats disposability.
58
+ - **A green suite before the build.** Means the tests are not actually exercising the missing feature — fix them now.
59
+ - **Skipping the side-effect assertions.** Without `assert a.balance == 20` on the rejection path, a corrupting partial failure passes silently.
60
+ - **No coverage target.** Without a recorded target, coverage can quietly erode during the build.
61
+
62
+ ## Exit check
63
+
64
+ - [ ] One test exists per scenario.
65
+ - [ ] The suite runs in the pipeline and is **red for the right reason**.
66
+ - [ ] Tests assert observable behavior, not internals.
67
+ - [ ] A coverage target is recorded.
68
+
69
+ ## If the check fails
70
+
71
+ If a test passes before any implementation, it is a fake test — repair it before continuing, because it is your only independent check on the AI. If the suite is red for the wrong reason (a syntax or harness error), fix the harness first; a build cannot be judged against a broken net.
@@ -0,0 +1,80 @@
1
+ # 07 · Step 5 — Build
2
+
3
+ [← 06 Step 4 Tests](./06-step-4-tests.md) · [Contents](./README.md) · Next: [08 Step 6 Verify →](./08-step-6-verify.md)
4
+
5
+ > **Purpose:** have the AI implement the feature so that every failing test passes.
6
+ > **Produces:** working code, plus the evidence that the tests now pass.
7
+ > **Person's job:** direct, in small batches. **AI's job:** implement.
8
+
9
+ ---
10
+
11
+ ## The only step the AI leads
12
+
13
+ This is the step the AI is genuinely good at, and the only one where it should be doing the heavy lifting. It works precisely because the previous four steps removed all the ambiguity: the AI is no longer guessing what to build: it has a spec, a set of scenarios, a frozen contract, and a suite of failing tests that define "done" exactly. Its task is narrow and checkable — turn the suite green.
14
+
15
+ This is the difference between AIDD and vague-prompt coding. The same agent that produces confident nonsense from "build me a transfer feature" produces correct, bounded code from "make these specific failing tests pass without changing them." The agent did not change; the direction did.
16
+
17
+ ## The build prompt
18
+
19
+ The instruction is explicit about constraints, because the constraints are what keep the speed safe.
20
+
21
+ ```
22
+ Read SPEC.md, contracts/<name>.md, and tests/<name>_test.py.
23
+ Implement the feature so that EVERY test passes.
24
+ Constraints:
25
+ - Do NOT change any test.
26
+ - Do NOT change the contract.
27
+ - <feature-specific safety rule>.
28
+ - Stop and ask if any requirement is unclear — do not guess.
29
+ - Use only packages listed in dependencies.allowlist.
30
+ Report which tests pass and exactly what you changed.
31
+ ```
32
+
33
+ For the running example, the feature-specific safety rule is *"make the balance update atomic — debit and credit occur in a single transaction."* This is the one correctness property the tests alone may not force, so it is named directly to the builder.
34
+
35
+ See `playbook/5_build.md` in [Appendix B](./appendix-b-prompts.md).
36
+
37
+ ## Work in small batches
38
+
39
+ Direct the AI one task at a time, and keep each task small enough that its result can be reviewed in full. This is a direct application of the principle *you cannot move faster than you can verify.* A single enormous change that turns the whole suite green at once is not a triumph — it is an unreviewable blob. Small batches keep the verification step (next chapter) tractable and keep a human genuinely in the loop.
40
+
41
+ ## The iteration loop
42
+
43
+ ```
44
+ AI writes code → pipeline runs the tests → some still fail
45
+ → AI iterates → ... → all green → hand to Verify
46
+ ```
47
+
48
+ The loop is tight and largely autonomous within a task: the AI runs the tests, sees what fails, and adjusts. Your attention is needed at the boundaries — defining the task going in, and reviewing the result coming out — not on each internal iteration.
49
+
50
+ ## The cardinal rule: never change the test to pass
51
+
52
+ An AI under pressure to make a suite green has an available shortcut: weaken or delete the failing test. This must be forbidden explicitly and caught reliably. A test changed to fit the code inverts the entire method — the code is now judging itself. If you find a test was altered during the build, reject the change outright and re-prompt with the constraint restated.
53
+
54
+ The same applies to the contract: the build implements *against* the frozen contract and may not edit it. A genuine need to change either is a change request that returns to an earlier step.
55
+
56
+ ## How much autonomy
57
+
58
+ The autonomy granted in this step should match the evidence and your review capacity (see [11 Governance](./11-governance.md)):
59
+
60
+ - Where the area is new or risky, the AI proposes and a person reviews every change.
61
+ - Where the contract and tests are solid, the AI generates freely and a person reviews each batch.
62
+ - Only in narrow, well-tested areas, with a full evidence bundle attached, may the AI integrate its own work.
63
+
64
+ ## Common mistakes
65
+
66
+ - **Batches too large to review.** Shrinks verification to rubber-stamping.
67
+ - **Letting the AI add unknown dependencies.** The allow-list check in the pipeline should block this automatically; if it does not, the supply-chain risk is real (an AI may invent a plausible package name that an attacker has registered).
68
+ - **Accepting "all tests pass" without reading the change.** Passing tests are necessary, not sufficient — the next step exists for exactly this reason.
69
+
70
+ ## Exit check
71
+
72
+ - [ ] All tests pass.
73
+ - [ ] Test coverage did not decrease.
74
+ - [ ] No test and no contract was modified by the AI.
75
+ - [ ] No dependency outside the allow-list was added.
76
+ - [ ] The change is small enough to review in full.
77
+
78
+ ## If the check fails
79
+
80
+ If the AI weakened a test, reject and re-prompt. If it added an out-of-allow-list package, the pipeline blocks it; have the AI find an approved alternative or raise the package for human approval. If the batch is too large to review, ask the AI to split the work and resubmit. Only once the exit check passes does the change proceed to verification.
@@ -0,0 +1,63 @@
1
+ # 08 · Step 6 — Verify
2
+
3
+ [← 07 Step 5 Build](./07-step-5-build.md) · [Contents](./README.md) · Next: [09 The loop →](./09-the-loop.md)
4
+
5
+ > **Purpose:** confirm the result is correct and safe to release.
6
+ > **Produces:** a reviewed change with a recorded outcome, ready to release.
7
+ > **Person's job:** this entire step. There is no AI role here — it is the human check.
8
+
9
+ ---
10
+
11
+ ## Where trust is actually established
12
+
13
+ The build produced passing tests. That is necessary but not sufficient. Verification is where a person establishes trust — and the principle governing it is *trust through evidence, not inspection.*
14
+
15
+ This needs care, because it is easy to misread. "Not by inspection" does not mean "do not look at the code." It means the *basis* of trust is the passing evidence plus a deliberate check of the specific things tests cannot easily catch — not a general impression that the code reads plausibly. Plausibility is exactly the trap: AI code is frequently plausible and wrong. So verification has two parts: confirm the evidence, then check the known blind spots.
16
+
17
+ ## Part one — confirm the evidence
18
+
19
+ - [ ] All tests pass.
20
+ - [ ] Coverage did not decrease.
21
+ - [ ] No test or contract was altered during the build.
22
+
23
+ If any of these is false, stop here and return to the build; there is nothing to verify yet.
24
+
25
+ ## Part two — check what tests miss
26
+
27
+ Automated tests are excellent at behavior on defined inputs and poor at a few specific things. Check those by hand, every time:
28
+
29
+ - **Concurrency and timing.** Is the operation correct when two of them happen at once? Tests usually run serially and miss races.
30
+ - ▶ *Example: the balance update must be one atomic transaction. Confirm that two simultaneous transfers from the same account cannot both pass the balance check and overdraw it.* This is the single most important check for this feature, and it is the reason the build prompt named atomicity explicitly.
31
+ - **Security.** Are there exposed secrets, injection openings, or unexpected dependencies? AI-generated code is known to hardcode secrets and to pull in packages by plausible-but-wrong names.
32
+ - **Architecture conformance.** Does the change respect the layering and dependency rules in `CONVENTIONS.md`? Speed with no architectural check produces a fast-growing tangle that becomes unmaintainable within months.
33
+
34
+ ## Recording the outcome
35
+
36
+ Every verification ends with exactly one recorded outcome, with an accountable owner — never a silent pass:
37
+
38
+ | Outcome | Meaning | Allowed when |
39
+ |---------|---------|--------------|
40
+ | `PASS` | all checks met | the normal path |
41
+ | `RISK-ACCEPTED` | proceed with a signed waiver: named owner, linked ticket, expiry date | a non-security gap only |
42
+ | `HARD-STOP` | cannot proceed | any failing test or any security finding |
43
+
44
+ A security finding is always a `HARD-STOP`; it is never waved through with a waiver. A `RISK-ACCEPTED` outcome is a deliberate, documented decision to ship a known, non-security limitation — not a way to skip the check.
45
+
46
+ ## The verification checklist
47
+
48
+ - [ ] All tests pass (the evidence).
49
+ - [ ] Concurrency/timing of the risky operation is safe.
50
+ - [ ] No exposed secrets, injection openings, or unexpected dependencies.
51
+ - [ ] Layering and dependencies follow `CONVENTIONS.md`.
52
+ - [ ] A person has reviewed and approved the change.
53
+ - [ ] An outcome is recorded (`PASS` / `RISK-ACCEPTED` / `HARD-STOP`).
54
+
55
+ ## Common mistakes
56
+
57
+ - **Shipping on plausibility.** Reading the diff, finding it reasonable, and approving — without the evidence and the blind-spot checks — is the precise failure the method exists to prevent.
58
+ - **Treating a security gap as acceptable risk.** It is a `HARD-STOP`, not a waiver.
59
+ - **Skipping the concurrency check** because the tests are green. Tests rarely exercise simultaneity; this is a manual check by design.
60
+
61
+ ## If the check fails
62
+
63
+ A failing test or a security finding returns the change to the build step ([Step 5](./07-step-5-build.md)). A non-security limitation may proceed only with a signed `RISK-ACCEPTED` record carrying an owner and an expiry — so the team can find and close it later. Nothing proceeds on an unrecorded decision.
@@ -0,0 +1,43 @@
1
+ # 09 · The loop — observe and learn
2
+
3
+ [← 08 Step 6 Verify](./08-step-6-verify.md) · [Contents](./README.md) · Next: [10 Setup and stages →](./10-setup-and-stages.md)
4
+
5
+ > **Purpose:** release the verified change, watch how it behaves in reality, and turn what you learn into the next specification.
6
+ > **Produces:** a running feature, observations, and the next `SPEC.md` delta.
7
+
8
+ ---
9
+
10
+ ## The flow is a loop, not a line
11
+
12
+ Older mental models end at "ship." That framing is the source of a common pathology: teams treat release as a finish line, and so they hide defects to protect the line rather than manage them in the open. In AIDD, release is not the end of the flow — it is the point where the most reliable information about the feature finally becomes available: how it behaves with real users, real data, and real load.
13
+
14
+ That information is the input to the next cycle. What you learn in production becomes the next specification, and the flow returns to [Step 1](./03-step-1-specify.md). The cycle is continuous.
15
+
16
+ ## Release deliberately
17
+
18
+ Release behind a mechanism that limits the blast radius of a mistake — a feature flag, a gradual rollout, or both. The verification step established that the feature is correct against everything you anticipated; a controlled release is your protection against what you did not anticipate. If something is wrong, you want to affect a few users and roll back, not affect everyone and scramble.
19
+
20
+ ## Reuse the scenarios as monitors
21
+
22
+ The scenarios from [Step 2](./04-step-2-scenarios.md) have a second life here. They described the behavior you expected; in production they become the behavior you monitor. The same definition of "correct" that drove the tests now drives the alerts.
23
+
24
+ **What to watch (▶ example):**
25
+
26
+ - the overall transfer error rate;
27
+ - the rate of each individual rejection (`amount_invalid`, `same_account`, `insufficient_funds`, `forbidden`) — a sudden spike in one is a signal, not noise;
28
+ - latency, especially of the atomic balance update under load.
29
+
30
+ ## Turn observation into the next spec
31
+
32
+ Every defect, surprise, or new need is written up as a change to the specification — a delta that re-enters the flow at [Step 1](./03-step-1-specify.md). An error rate that is too high, a rejection that fires more than expected, a user behavior nobody designed for: each becomes a concrete, specified next step rather than a vague intention.
33
+
34
+ This is also where the AI returns to a useful role: summarizing telemetry, clustering errors into themes, and drafting the proposed spec delta for a person to review. But the production decisions — what to roll back, what to prioritize — remain human.
35
+
36
+ ## Re-entrancy: the loop is the whole point
37
+
38
+ Two principles converge here. *The flow is re-entrant* — any step can send you back to an earlier one — and *the flow is a loop* — production feeds the next specification. Together they mean the artifacts you built are never "finished"; they are living documents that the next cycle refines.
39
+
40
+ A team operating this way does not experience requirements changing as a failure of planning. It experiences it as the system working: reality is teaching the specification, and the specification is teaching the next build.
41
+
42
+ > **Do:** release small, watch the scenarios, and feed every learning back into the spec.
43
+ > **Don't:** treat shipping as the end. The most valuable information about a feature arrives *after* it ships.