@pilotspace/add 1.1.0 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +40 -0
- package/GETTING-STARTED.md +165 -139
- package/README.md +13 -7
- package/bin/cli.js +13 -4
- package/docs/01-principles.md +3 -3
- package/docs/02-the-flow.md +15 -11
- package/docs/03-step-1-specify.md +13 -13
- package/docs/04-step-2-scenarios.md +2 -2
- package/docs/05-step-3-contract.md +3 -3
- package/docs/06-step-4-tests.md +2 -2
- package/docs/07-step-5-build.md +1 -1
- package/docs/08-step-6-verify.md +14 -5
- package/docs/09-the-loop.md +12 -6
- package/docs/10-setup-and-stages.md +27 -13
- package/docs/11-governance.md +2 -2
- package/docs/12-roles.md +3 -3
- package/docs/13-adoption.md +1 -1
- package/docs/14-foundation.md +15 -15
- package/docs/15-foundations-and-lineage.md +106 -0
- package/docs/README.md +4 -0
- package/docs/appendix-a-templates.md +3 -3
- package/docs/appendix-b-prompts.md +40 -5
- package/docs/appendix-c-glossary.md +42 -12
- package/docs/appendix-d-worked-example.md +2 -2
- package/docs/appendix-e-checklists.md +2 -2
- package/docs/appendix-f-requirements-matrix.md +8 -8
- package/docs/appendix-g-references.md +106 -0
- package/package.json +1 -1
- package/skill/add/SKILL.md +39 -37
- package/skill/add/adopt.md +13 -11
- package/skill/add/deltas.md +8 -6
- package/skill/add/fold.md +19 -17
- package/skill/add/graduate.md +74 -0
- package/skill/add/intake.md +22 -7
- package/skill/add/loop.md +59 -0
- package/skill/add/phases/0-setup.md +29 -24
- package/skill/add/phases/1-specify.md +23 -13
- package/skill/add/phases/2-scenarios.md +14 -4
- package/skill/add/phases/3-contract.md +24 -11
- package/skill/add/phases/4-tests.md +15 -5
- package/skill/add/phases/5-build.md +11 -4
- package/skill/add/phases/6-verify.md +24 -2
- package/skill/add/phases/7-observe.md +13 -5
- package/skill/add/report-template.md +65 -7
- package/skill/add/run.md +45 -34
- package/skill/add/scope.md +10 -6
- package/skill/add/setup-review.md +13 -10
- package/skill/add/streams.md +69 -19
- package/tooling/add.py +476 -34
- package/tooling/templates/CONVENTIONS.md.tmpl +1 -1
- package/tooling/templates/GLOSSARY.md.tmpl +23 -0
- package/tooling/templates/MILESTONE.md.tmpl +1 -0
- package/tooling/templates/PROJECT.md.tmpl +4 -3
- package/tooling/templates/TASK.md.tmpl +33 -12
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# 15 · Foundations & Lineage
|
|
2
|
+
|
|
3
|
+
[← 14 The foundation](./14-foundation.md) · [Contents](./README.md) · Next: [Appendix A Templates →](./appendix-a-templates.md)
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
ADD did not appear from nowhere. It sits where four currents meet: the **recursive
|
|
8
|
+
self-improvement** thesis (AI that helps build the next AI), a decade of **autonomous and
|
|
9
|
+
agentic** research, the **spec-driven development** movement (the specification, not the
|
|
10
|
+
code, is the source of truth), and the **tests-first** discipline that constrains a
|
|
11
|
+
generate→check→refine loop with executable tests — turning fluent model output into
|
|
12
|
+
trustworthy software. This chapter tells that story; [Appendix G](./appendix-g-references.md)
|
|
13
|
+
is the verified source list it cites into. Every `[Author Year]` here resolves to an entry
|
|
14
|
+
there.
|
|
15
|
+
|
|
16
|
+
## The frame — "closing the loop"
|
|
17
|
+
|
|
18
|
+
Anthropic's recursive-self-improvement picture runs from autonomous agents delegating to
|
|
19
|
+
workers *today* toward a future where Claude improves Claude — *closing the loop* on the
|
|
20
|
+
work of building AI itself [Favaro & Clark 2026]. That is the backdrop ADD is built for, and
|
|
21
|
+
its position inside that picture is deliberately narrow: ADD is a **human-gated,
|
|
22
|
+
evidence-trusted** instance of recursive self-improvement. The AI drives the whole inner
|
|
23
|
+
cycle — specify → build → verify → observe — but a human owns the frozen contract and the
|
|
24
|
+
verify gate, and trust comes from passing tests and re-resolved evidence, never from a
|
|
25
|
+
diff that merely reads plausibly. The argument is not that the loop should stay open
|
|
26
|
+
forever; it is that the loop should be *bounded by human direction* rather than left to run
|
|
27
|
+
unattended [Amodei 2024]. ADD is one concrete shape for that bound.
|
|
28
|
+
|
|
29
|
+
## The four currents
|
|
30
|
+
|
|
31
|
+
**Recursive self-improvement.** The mathematical anchor is the Gödel machine — a
|
|
32
|
+
self-modifying agent that rewrites itself *only when it can prove the rewrite helps*
|
|
33
|
+
[Schmidhuber 2003]. ADD enforces the same discipline socially rather than formally: the
|
|
34
|
+
never-weaken-a-test rule is "only change on proof" expressed as a gate. The algorithmic kin
|
|
35
|
+
arrived later — a scaffolding program that improves the code that improves code
|
|
36
|
+
[Zelikman et al. 2023], a generate→critique→refine micro-loop [Madaan et al. 2023], agents
|
|
37
|
+
that keep verbal reflections and retry [Shinn et al. 2023], an agent that grows a reusable
|
|
38
|
+
skill library over time [Wang et al. 2023], and an evolutionary coder that beat a
|
|
39
|
+
long-standing matrix-multiplication record under continuous checking
|
|
40
|
+
[Novikov et al. 2025]. And where a self-rewarding loop has the model judge its own reward
|
|
41
|
+
[Yuan et al. 2024], ADD diverges by design — it makes the tests and a human the reward
|
|
42
|
+
signal, not the model's own opinion.
|
|
43
|
+
|
|
44
|
+
**Autonomous and agentic workflows.** The architecture vocabulary comes from the canonical
|
|
45
|
+
taxonomy of prompt-chaining, routing, orchestrator-workers, and the evaluator-optimizer loop
|
|
46
|
+
[Schluntz & Zhang 2024] — where evaluator-optimizer *is* build→verify→refine and
|
|
47
|
+
orchestrator-workers is ADD's wave parallelism. Underneath it sit the base agent loop of
|
|
48
|
+
interleaved think→act→observe [Yao et al. 2022], the self-supervised tool use that lets an
|
|
49
|
+
agent run its own tests and builds [Schick et al. 2023], and the designed agent–computer
|
|
50
|
+
interface that materially lifts autonomous issue resolution [Yang et al. 2024] — the role
|
|
51
|
+
ADD's `add.py` engine plays for the method. The production reports close the gap from theory
|
|
52
|
+
to practice: checkpoints, subagents, and rollback for autonomous work [Anthropic 2025a], and
|
|
53
|
+
a lead orchestrating subagents under an LLM judge [Anthropic 2025b].
|
|
54
|
+
|
|
55
|
+
**Spec-driven development.** ADD's closest siblings are explicit specification systems.
|
|
56
|
+
GitHub's **spec-kit** runs `constitution` → `specify` → `plan` → `tasks` → `implement` with
|
|
57
|
+
the spec as the executable source of truth [GitHub 2025]; its launch framed task
|
|
58
|
+
decomposition as "TDD for your AI agent" [Delimarsky 2025], and its rationale named the
|
|
59
|
+
failure spec-driven work exists to solve — context degrading over a long session
|
|
60
|
+
[Vesely 2025]. The academic vocabulary followed, with a taxonomy of Spec-First,
|
|
61
|
+
Spec-Anchored, and Spec-as-Source rigor [Piskala 2026], and the pattern is converging across
|
|
62
|
+
vendors [InfoQ 2025]. Nearest of all is **GSD** — a spec-driven, context-engineering system
|
|
63
|
+
for the same Claude-Code niche [GSD 2025].
|
|
64
|
+
|
|
65
|
+
**Tests-first and verification.** The empirical backbone is direct: supplying tests
|
|
66
|
+
alongside the prompt measurably lifts pass rates [Mathews & Nagappan 2024], and the field's
|
|
67
|
+
yardstick judges a fix solely by whether the project's own tests pass [Jimenez et al. 2023].
|
|
68
|
+
"Done" means the tests pass — which is exactly how ADD gates a feature. The safety framing
|
|
69
|
+
completes the current: human control and transparency made concrete [Anthropic 2025c], under
|
|
70
|
+
a governance ceiling that grows *more* binding, not less, as the loop gets more capable
|
|
71
|
+
[Anthropic 2026b].
|
|
72
|
+
|
|
73
|
+
## Where ADD diverges
|
|
74
|
+
|
|
75
|
+
The shared lineage is real, but ADD is not a re-skin of its siblings. spec-kit stops at
|
|
76
|
+
`implement`; GSD ends at verify. ADD closes the loop past both by adding three things
|
|
77
|
+
neither spec-kit [GitHub 2025] nor GSD [GSD 2025] carries as a first-class gate:
|
|
78
|
+
|
|
79
|
+
- a **failing-tests-first gate** — no build starts until the tests are red for the right
|
|
80
|
+
reason, so the contract is proven executable before any code exists;
|
|
81
|
+
- an **observe → `fold`** step — confirmed lessons learned consolidate back into a versioned
|
|
82
|
+
foundation, so the method improves itself across loops (retrospective consolidation is the
|
|
83
|
+
recursive-self-improvement current turned inward on ADD);
|
|
84
|
+
- a **dynamic goal-loop** — the engine holds a milestone open and reopens tasks until its
|
|
85
|
+
exit criteria are met, rather than declaring done when a checklist empties.
|
|
86
|
+
|
|
87
|
+
ADD also deliberately targets **less doc-time than GSD** — a lean foundation and one human
|
|
88
|
+
approval per task instead of a document per phase. The tests-first gate, the `fold`, and the
|
|
89
|
+
goal-loop are ADD's contribution; everything beneath them is inherited.
|
|
90
|
+
|
|
91
|
+
## The evidence chain — the loop already runs
|
|
92
|
+
|
|
93
|
+
The case that this is not speculative rests on three measured facts. First, the task
|
|
94
|
+
time-horizon: the length of work models complete unaided keeps doubling [Favaro & Clark 2026].
|
|
95
|
+
Second, the authorship share: by 2026 more than 80% of the code merged at Anthropic was
|
|
96
|
+
Claude-authored [Favaro & Clark 2026]. Third, the **Automated Alignment Researchers** result:
|
|
97
|
+
nine parallel Claude agents recovered roughly 97% of the human-expert gap on an alignment task
|
|
98
|
+
in five days against the human team's seven [Anthropic 2026a] — parallel agents working under
|
|
99
|
+
review, which is precisely ADD's wave-plus-verify shape. The loop already runs.
|
|
100
|
+
|
|
101
|
+
What it does *not* yet supply is the discipline to trust the output. That is ADD's
|
|
102
|
+
contribution: the frozen contract, the never-weaken-a-test rule, the evidence-over-inspection
|
|
103
|
+
gate, and the security HARD-STOP that no autonomy level may auto-pass [Anthropic 2025c],
|
|
104
|
+
held beneath the responsible-scaling governance ceiling [Anthropic 2026b]. As the loop grows
|
|
105
|
+
more capable, those gates and the human-owned verify matter more, not less. ADD is the human-gated, evidence-trusted way to stand inside the
|
|
106
|
+
closing loop and still own the result.
|
package/docs/README.md
CHANGED
|
@@ -51,6 +51,9 @@ For every feature, before AI writes any code, you write four short artifacts in
|
|
|
51
51
|
- [13 · Adoption and onboarding](./13-adoption.md)
|
|
52
52
|
- [14 · The foundation: project context across milestones](./14-foundation.md)
|
|
53
53
|
|
|
54
|
+
**Lineage**
|
|
55
|
+
- [15 · Foundations & Lineage](./15-foundations-and-lineage.md)
|
|
56
|
+
|
|
54
57
|
**Part IV — Reference**
|
|
55
58
|
- [Appendix A · Templates](./appendix-a-templates.md)
|
|
56
59
|
- [Appendix B · Prompt library](./appendix-b-prompts.md)
|
|
@@ -58,6 +61,7 @@ For every feature, before AI writes any code, you write four short artifacts in
|
|
|
58
61
|
- [Appendix D · The worked example, end to end](./appendix-d-worked-example.md)
|
|
59
62
|
- [Appendix E · Checklists](./appendix-e-checklists.md)
|
|
60
63
|
- [Appendix F · Document requirements matrix (Project → Milestone → Task)](./appendix-f-requirements-matrix.md)
|
|
64
|
+
- [Appendix G · References & lineage](./appendix-g-references.md)
|
|
61
65
|
|
|
62
66
|
---
|
|
63
67
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Appendix A · Templates
|
|
2
2
|
|
|
3
|
-
[←
|
|
3
|
+
[← 15 Foundations & Lineage](./15-foundations-and-lineage.md) · [Contents](./README.md) · Next: [Appendix B Prompts →](./appendix-b-prompts.md)
|
|
4
4
|
|
|
5
5
|
Copy-paste blanks. Project-level templates are filled once at setup; feature-level templates are filled once per feature.
|
|
6
6
|
|
|
@@ -46,8 +46,8 @@ Reject:
|
|
|
46
46
|
- <bad input / situation> -> "<error_code>"
|
|
47
47
|
After:
|
|
48
48
|
- <state true once it succeeds>
|
|
49
|
-
Assumptions —
|
|
50
|
-
⚠ <most-likely-wrong assumption> —
|
|
49
|
+
Assumptions — lowest-confidence first:
|
|
50
|
+
⚠ <most-likely-wrong assumption> — lowest confidence because <why>; if wrong: <cost>
|
|
51
51
|
- [x] <confirmed / low-stakes assumption> — <one line>
|
|
52
52
|
```
|
|
53
53
|
|
|
@@ -7,6 +7,9 @@ The contents of the `playbook/` folder. Each prompt is plain text that names the
|
|
|
7
7
|
---
|
|
8
8
|
|
|
9
9
|
### `playbook/1_specify.md`
|
|
10
|
+
|
|
11
|
+
<prompt>
|
|
12
|
+
|
|
10
13
|
```
|
|
11
14
|
Role: a domain analyst who brainstorms, then asks rather than assumes.
|
|
12
15
|
Read first: ./PRD/* , ./GLOSSARY.md , ./inputs/ (tickets, interviews, contracts)
|
|
@@ -19,15 +22,20 @@ Steps:
|
|
|
19
22
|
giving each refusal a named error code.
|
|
20
23
|
# why: named errors become scenarios and contract responses; "handle bad input" does not.
|
|
21
24
|
2. State the success state-change (After).
|
|
22
|
-
3. List the assumptions you had to make, RANKED
|
|
23
|
-
|
|
24
|
-
# why: a flat all-equal list gets
|
|
25
|
-
Exit: a domain owner disputes none of it; assumptions ranked
|
|
25
|
+
3. List the assumptions you had to make, RANKED lowest-confidence first; flag the 1–2 where
|
|
26
|
+
your confidence is lowest as `⚠ <assumption> — lowest confidence because <why>; if wrong: <cost>`.
|
|
27
|
+
# why: a flat all-equal list gets approved without reading; a ranked one aims my attention at the risk.
|
|
28
|
+
Exit: a domain owner disputes none of it; assumptions ranked lowest-confidence first, the 1–2 ⚠ flags
|
|
26
29
|
carrying why + cost — or an honest "none material" that still names the single biggest risk.
|
|
27
|
-
Never: resolve an ambiguity by guessing — ask. Never a blank "none" or a flat
|
|
30
|
+
Never: resolve an ambiguity by guessing — ask. Never a blank "none" or a flat list of equal ticks.
|
|
28
31
|
```
|
|
29
32
|
|
|
33
|
+
</prompt>
|
|
34
|
+
|
|
30
35
|
### `playbook/2_scenarios.md`
|
|
36
|
+
|
|
37
|
+
<prompt>
|
|
38
|
+
|
|
31
39
|
```
|
|
32
40
|
Role: a specification tester.
|
|
33
41
|
Read first: ./SPEC.md , ./GLOSSARY.md
|
|
@@ -41,7 +49,12 @@ Exit: every rule has at least one scenario with an observable result.
|
|
|
41
49
|
Never: write a vague result ("then it works").
|
|
42
50
|
```
|
|
43
51
|
|
|
52
|
+
</prompt>
|
|
53
|
+
|
|
44
54
|
### `playbook/3_contract.md`
|
|
55
|
+
|
|
56
|
+
<prompt>
|
|
57
|
+
|
|
45
58
|
```
|
|
46
59
|
Role: an interface/contract architect; contracts are immutable once frozen.
|
|
47
60
|
Read first: ./SPEC.md , ./features/*.feature , ./GLOSSARY.md
|
|
@@ -57,7 +70,12 @@ Exit: contract tests pass against the mock; every spec rejection has a response.
|
|
|
57
70
|
Never: change a frozen contract — a change is a request that reopens Specify.
|
|
58
71
|
```
|
|
59
72
|
|
|
73
|
+
</prompt>
|
|
74
|
+
|
|
60
75
|
### `playbook/4_tests.md`
|
|
76
|
+
|
|
77
|
+
<prompt>
|
|
78
|
+
|
|
61
79
|
```
|
|
62
80
|
Role: a test author who writes tests before code.
|
|
63
81
|
Read first: ./features/*.feature , ./contracts/*
|
|
@@ -73,7 +91,12 @@ Exit: one test per scenario; suite red for the right reason; target recorded.
|
|
|
73
91
|
Never: assert on internals; write the implementation here.
|
|
74
92
|
```
|
|
75
93
|
|
|
94
|
+
</prompt>
|
|
95
|
+
|
|
76
96
|
### `playbook/5_build.md`
|
|
97
|
+
|
|
98
|
+
<prompt>
|
|
99
|
+
|
|
77
100
|
```
|
|
78
101
|
Role: an execution agent. The human commands; you implement and report.
|
|
79
102
|
Read first: ./SPEC.md , ./contracts/* , ./tests/* , ./CONVENTIONS.md
|
|
@@ -90,7 +113,12 @@ Never: change a test or the contract; add an unlisted dependency; exceed the tas
|
|
|
90
113
|
without escalating; guess when unclear — ask.
|
|
91
114
|
```
|
|
92
115
|
|
|
116
|
+
</prompt>
|
|
117
|
+
|
|
93
118
|
### `playbook/6_observe.md`
|
|
119
|
+
|
|
120
|
+
<prompt>
|
|
121
|
+
|
|
94
122
|
```
|
|
95
123
|
Role: a reliability analyst feeding the next cycle.
|
|
96
124
|
Read first: telemetry exports , service-objective definitions , incident tickets
|
|
@@ -104,9 +132,14 @@ Exit: a reviewed SPEC delta linked into the backlog.
|
|
|
104
132
|
Never: auto-roll back — recommend; a human owns the production decision.
|
|
105
133
|
```
|
|
106
134
|
|
|
135
|
+
</prompt>
|
|
136
|
+
|
|
107
137
|
---
|
|
108
138
|
|
|
109
139
|
### Master prompt skeleton
|
|
140
|
+
|
|
141
|
+
<prompt>
|
|
142
|
+
|
|
110
143
|
```
|
|
111
144
|
Role: <one line — who the agent is for this step>
|
|
112
145
|
Read first: <explicit repository paths — never chat memory>
|
|
@@ -117,3 +150,5 @@ Exit: <conditions a person or the pipeline can check>
|
|
|
117
150
|
Never: <what the agent must not do>
|
|
118
151
|
Evidence: <artifacts to attach for review>
|
|
119
152
|
```
|
|
153
|
+
|
|
154
|
+
</prompt>
|
|
@@ -10,31 +10,39 @@
|
|
|
10
10
|
|
|
11
11
|
**Artifact** — a durable work product: the spec, the scenarios, the contract, the tests. The artifacts survive; the code is disposable.
|
|
12
12
|
|
|
13
|
-
**
|
|
13
|
+
**Lesson learned** (formerly "competency delta") — a single learning a loop produces, tagged by which of the five competencies (`DDD · SDD · UDD · TDD · ADD`) it improves, written in a task's OBSERVE phase as `- [<COMPETENCY> · <status>] <learning> (evidence: …)`. Emitted `open` by the AI; the human folds it into a versioned `PROJECT.md` (`folded`) or declines it (`rejected`). The mechanism by which the foundation self-improves instead of drifting. See the `add` skill's `deltas.md`.
|
|
14
14
|
|
|
15
15
|
**Contract** — the fixed external shape of a feature: interfaces, data structures, names, and error cases. Frozen before the build, it is the surface the AI builds against.
|
|
16
16
|
|
|
17
|
-
**Co-specification** — how a spec is made in ADD: the AI and the human **brainstorm the shape together** (diverge), the AI **drafts** it, and the human **validates with the AI's advice** (validate). The AI's decisive advice is the *
|
|
17
|
+
**Co-specification** — how a spec is made in ADD: the AI and the human **brainstorm the shape together** (diverge), the AI **drafts** it, and the human **validates with the AI's advice** (validate). The AI's decisive advice is the *lowest-confidence flag*. It replaces dictation-by-one-side — the human owns the decision, the AI owns surfacing what it does not yet know. See [03 Specify](./03-step-1-specify.md).
|
|
18
18
|
|
|
19
19
|
**Disposable code** — the view that code is one regenerable implementation of the artifacts, not a durable asset to be preserved.
|
|
20
20
|
|
|
21
21
|
**Evidence bundle** — the proof attached to a change (passing tests, clean security scan, no coverage loss) that justifies trusting it and may unlock more AI autonomy.
|
|
22
22
|
|
|
23
|
-
**Foundation version** — a monotonic integer marker in `PROJECT.md` that advances by one each time confirmed
|
|
23
|
+
**Foundation version** — a monotonic integer marker in `PROJECT.md` that advances by one each time confirmed lessons learned are consolidated into the foundation. It makes the living documentation's evolution auditable: a rising version with fewer new deltas per milestone is the signal that a competency is converging rather than drifting. Bumped only by the retrospective consolidation (see the `add` skill's `fold.md`).
|
|
24
24
|
|
|
25
25
|
**Gate** — a checkpoint with an explicit pass/fail exit. Its outcome is `PASS`, `RISK-ACCEPTED`, or `HARD-STOP`.
|
|
26
26
|
|
|
27
27
|
**`HARD-STOP`** — a gate outcome meaning work cannot proceed; triggered by any failing test or security finding.
|
|
28
28
|
|
|
29
|
-
**Intake** — the step *before* a task: sizing a raw request into versioned scope by classifying it into one **request bucket**. The AI proposes `{bucket, rationale, command}`; the human confirms. Lives in the `add` skill's `intake.md` (the intake
|
|
29
|
+
**Intake** — the step *before* a task: sizing a raw request into versioned scope by classifying it into one **request bucket**. The AI proposes `{bucket, rationale, command}`; the human confirms. Lives in the `add` skill's `intake.md` (the intake level, above the per-task flow).
|
|
30
30
|
|
|
31
|
-
**
|
|
31
|
+
**Lowest-confidence flag** (formerly "least-sure flag") — the AI's ranked declaration of the **1–2 things most likely to be wrong** in what it is asking a human to approve, each carrying *why* it is uncertain and *what it costs if wrong* (`⚠ [spec|scenario|contract|test] … — because …; if wrong: …`). It reshapes the old flat assumptions list into a ranked one, so a single approval aims the reviewer's attention at the real risk instead of a flat list of equal-looking ticks. Bundle-wide at the contract-freeze decision point; the §1 assumptions are its first input. If nothing is materially uncertain it still names the single biggest risk — never a blank "none". It makes a genuine review cheap and a lazy one visibly negligent, but cannot *force* the read. The "AI advises" half of **co-specification**.
|
|
32
32
|
|
|
33
33
|
**Living document** — an artifact expected to change as the loop learns; never frozen forever (the one exception being a versioned contract, which changes only via a change request).
|
|
34
34
|
|
|
35
|
-
**
|
|
35
|
+
**Onboarding** (formerly "on-ramp") — the path a new user walks from install to their first milestone: install → `/add` → describe the goal → the agent runs intake (sizing the request into a milestone the human confirms) → the specification bundle → the self-driving run. The AI-first entry to the method; the human talks to the agent rather than hand-typing `add.py`.
|
|
36
36
|
|
|
37
|
-
**
|
|
37
|
+
**Decision point** (formerly "seam") — a place where the flow stops for human judgment: the contract-freeze approval (the one approval), an escalated verify gate, intake confirmation, milestone close. The machine layer keeps the legacy name: the `--json` owner enum `seam`, the decide-digest key `seam`, and the `seam-audit` CI job.
|
|
38
|
+
|
|
39
|
+
**The decision arc** — the three engine-sourced lines a gate report opens with at every **decision point**: `goal:` the milestone goal the work serves · `done:` the achievement, the proven progress toward it (the gate reports render this line as `done`) · `plan:` what comes next. What `done` reports adapts per gate (verify: tests + evidence · milestone close: exit-criteria met · intake: the request sized) while the three-part shape stays constant. Rendered first, above the report's summary, so the human confirms with sight of the whole trajectory, not a local snapshot. Engine-sourced like all evidence — goal · done · plan are pulled from `add.py` output, never re-typed. Presentation only: it never adds a gate or changes a `PASS` / `RISK-ACCEPTED` / `HARD-STOP` / freeze outcome. The report it opens is the chat report a person reads at a decision point — distinct from the three Test/Quality/Risk reports a verify gate produces ([11 Governance](./11-governance.md)). See the `add` skill's `report-template.md`.
|
|
40
|
+
|
|
41
|
+
**Specification bundle** (formerly "the one-approval front") — §1–§4 of a task (spec · scenarios · contract · failing tests) drafted by the AI as one piece and approved by a person **once**, at the contract freeze. Rejecting any part returns the whole bundle to draft. The single approval it carries is the bundle approval.
|
|
42
|
+
|
|
43
|
+
**Retrospective consolidation** (formerly "the fold / fold ritual") — the milestone-close (or on-demand) step where a person gathers `open` lessons learned, confirms each, and the AI writes them append-only into the versioned foundation, bumping `foundation-version:`. The AI never self-approves a consolidation. The machine names keep their names: `fold.md`, the `folded` delta status, and `add.py deltas`.
|
|
44
|
+
|
|
45
|
+
**Owner (of a phase)** — who drives a phase, exposed by `add.py … --json` as `human`, `seam`, or `ai` (machine enum values that keep their names; in prose the `seam` value's concept is now the decision point, formerly "seam"). It tells an autonomous harness where it may run (`ai`) and where it must checkpoint to a person (`human`/`seam`), following the who-does-what table (Verify is always `human`).
|
|
38
46
|
|
|
39
47
|
**Profile** — the intensity at which the method is run: Express, Standard, or Regulated.
|
|
40
48
|
|
|
@@ -48,17 +56,39 @@
|
|
|
48
56
|
|
|
49
57
|
**Spec (`SPEC.md`)** — the plain-language statement of what a feature must do, must reject, and assumes.
|
|
50
58
|
|
|
51
|
-
**
|
|
59
|
+
**Cross-cutting concern** (formerly "spine / continuous concern") — a concern that runs through every step rather than being one step: security, testing, observability, cost.
|
|
52
60
|
|
|
53
61
|
**Stage** — one pass through the flow at a chosen depth: Prototype, Proof of Concept, MVP, or Production-Ready.
|
|
54
62
|
|
|
55
|
-
**
|
|
63
|
+
**Stage graduation** — the orchestration loop that proposes the move to the next **stage** as a human-confirmed roadmap, never a bare flip; the 4th scope level after setup · intake · milestone-loop. The cue is every milestone `done` with the **stage-goal-criteria** all `[x]`; the flow is gather **graduation analytics** → interview *what production means here* → draft ≥1 production milestone → human confirms → `add.py stage production` as the final step. The →production flip is guarded: it refuses with `stage_no_roadmap` (a tally, not a readiness judgment) until ≥1 production milestone exists; `--force` overrides. Lives in the `add` skill's `graduate.md`.
|
|
64
|
+
|
|
65
|
+
**Graduation analytics** — the five record-sets `add.py graduation-report` clusters from the whole MVP loop for the graduation interview: open deltas by competency · open RISK-ACCEPTED waivers by expiry · RETRO records · verify residue · observe-loop coverage gaps. It gathers, never judges — there is no readiness verdict, only the records the human reasons from (gather-not-judge).
|
|
66
|
+
|
|
67
|
+
**Stage-goal-criteria** — the human-authored `[x]` checklist in `PROJECT.md` that defines "MVP covered" for this project; when every milestone is `done` and these are all checked, `add.py status` prints the graduation cue. Authored by the human (judgment), never inferred by the engine.
|
|
68
|
+
|
|
69
|
+
**Baseline approval** (formerly "the lock-down") — the single human gate ending autonomous setup: an explicit yes that freezes the foundation, first scope, and first contract together; runs as `add.py lock --by <name>`.
|
|
70
|
+
|
|
71
|
+
**Scope level** (formerly "altitude") — the granularity a decision lives at: intake level (request → versioned scope) · milestone level · setup/foundation level · task level. (A cross-stage decision lives one level out, at the **stage-graduation** loop — which `graduate.md` also numbers as a scope level; see **Stage graduation**.) One ⚠-assumption notation is shared across every scope level.
|
|
72
|
+
|
|
73
|
+
**Autonomy level** (formerly "autonomy dial") — the per-task setting (`autonomy: auto | conservative`) choosing who resolves Verify; high-risk scope refuses an unguarded `auto`.
|
|
74
|
+
|
|
75
|
+
**Automated quality gate** (formerly "evidence auto-gate") — the Verify resolver under `autonomy: auto`: a run may auto-PASS on complete evidence, recorded as *auto-resolved*; a security finding always escalates (`HARD-STOP`).
|
|
76
|
+
|
|
77
|
+
**Change scope** (formerly "touch-boundary") — the hard boundary of a locked run: what it may edit (code, tests-to-green, evidence) and must not (the frozen contract, locked scope, any test weakening). The `<touch_boundary>` XML prompt tag keeps its name.
|
|
78
|
+
|
|
79
|
+
**Non-functional review** (formerly "blind-spot checks") — the deliberate verify-time check of the risks tests rarely catch: concurrency, security, architecture. Security findings always escalate.
|
|
80
|
+
|
|
81
|
+
**Failing-first suite** (formerly "red safety net") — the per-feature test suite written before any code and confirmed red for the right reason (a missing implementation, not a broken test); the TDD red phase at ADD step 4.
|
|
82
|
+
|
|
83
|
+
**Method rationale** (formerly "trust layer") — the *why* behind every rule: the AIDD book in `.add/docs/`, read on demand via each phase guide's chapter pointer, never auto-loaded.
|
|
84
|
+
|
|
85
|
+
**Working state** (formerly "state surface" — one of the two record surfaces) — everything an agent loads every session: the `add` skill (router `SKILL.md` + the active phase) and the lean operational docs — `PROJECT.md`, the active `MILESTONE.md` and `TASK.md`, and `state.json`. Kept small to avoid context rot. Contrast **audit trail**.
|
|
56
86
|
|
|
57
87
|
**Stop signal** — the boolean an autonomous harness reads from `add.py … --json` (`stop = owner != "ai"`): true means pause for a person before proceeding. The irreducible stops are the contract freeze and the Verify gate. See **Owner (of a phase)**.
|
|
58
88
|
|
|
59
|
-
**
|
|
89
|
+
**Audit trail** (formerly "story surface") — the book (`docs/*`): the whole method, read once by a person to trust ADD, then referenced by a pointer and **never auto-loaded** into agent context. Contrast **working state**.
|
|
60
90
|
|
|
61
|
-
**
|
|
91
|
+
**Living documentation** (formerly "survivor layer") — the set of durable artifacts (conventions, glossary, frozen contracts) that outlives any particular code.
|
|
62
92
|
|
|
63
93
|
**Trust ladder / autonomy ladder** — the graduated levels of AI autonomy, earned with evidence and verification capacity.
|
|
64
94
|
|
|
@@ -82,4 +112,4 @@ This book uses plain step names. Teams connecting it to a larger formal standard
|
|
|
82
112
|
| Verify | the review gate within the build |
|
|
83
113
|
| Observe (loop) | Operate and Learn |
|
|
84
114
|
|
|
85
|
-
The formal standard also names the *foundation* and *design* work as full phases in their own right; this book
|
|
115
|
+
The formal standard also names the *foundation* and *design* work as full phases in their own right; this book merges them into project setup and the Specify step (and the Prototype stage) to keep the flow to six memorable steps.
|
|
@@ -23,8 +23,8 @@ Reject:
|
|
|
23
23
|
- source == destination -> "same_account"
|
|
24
24
|
- balance < amount -> "insufficient_funds"
|
|
25
25
|
- account not mine -> "forbidden"
|
|
26
|
-
Assumptions —
|
|
27
|
-
⚠ same currency only (no FX) in v1 —
|
|
26
|
+
Assumptions — lowest-confidence first:
|
|
27
|
+
⚠ same currency only (no FX) in v1 — lowest confidence because the ticket never said; if wrong: the amount/rounding model changes and this contract is wrong
|
|
28
28
|
- [x] no daily limit in v1 — confirmed: out of scope for v1
|
|
29
29
|
```
|
|
30
30
|
|
|
@@ -18,7 +18,7 @@ Every exit check in the book, collected for quick use. Print this page.
|
|
|
18
18
|
- [ ] Every required behavior stated explicitly.
|
|
19
19
|
- [ ] Every rejection has a named error code.
|
|
20
20
|
- [ ] Success state-change described.
|
|
21
|
-
- [ ] Assumptions ranked
|
|
21
|
+
- [ ] Assumptions ranked lowest-confidence first; the 1–2 most-likely-wrong ⚠-flagged with why + cost (or an honest "none material" that still names the single biggest risk).
|
|
22
22
|
|
|
23
23
|
## Step 2 — Scenarios
|
|
24
24
|
|
|
@@ -70,7 +70,7 @@ Every exit check in the book, collected for quick use. Print this page.
|
|
|
70
70
|
|
|
71
71
|
A feature is shippable only when all are true:
|
|
72
72
|
|
|
73
|
-
- [ ] Spec complete: behavior stated, rejections named, assumptions ranked
|
|
73
|
+
- [ ] Spec complete: behavior stated, rejections named, assumptions ranked lowest-confidence first with the biggest risk flagged.
|
|
74
74
|
- [ ] Every rule has a scenario.
|
|
75
75
|
- [ ] Contract frozen; contract tests green.
|
|
76
76
|
- [ ] A test per scenario; suite was red before the build.
|
|
@@ -10,11 +10,11 @@ This appendix maps every AIDD document to a three-level project hierarchy, so th
|
|
|
10
10
|
|
|
11
11
|
| Level | What it is | AIDD meaning | Spans |
|
|
12
12
|
|-------|-----------|--------------|-------|
|
|
13
|
-
| **Project** | the whole product or engagement | the
|
|
13
|
+
| **Project** | the whole product or engagement | the living documentation — documents created once and kept for the life of the product | all milestones |
|
|
14
14
|
| **Milestone** | a stage or release | one pass of the flow at a chosen depth: Prototype, POC, MVP, or Production-Ready; groups many tasks | many tasks |
|
|
15
15
|
| **Task** | one feature through the flow | a single pass of Specify → … → Verify → Observe; the smallest unit with its own gate records | the seven steps |
|
|
16
16
|
|
|
17
|
-
A **project** sets up the
|
|
17
|
+
A **project** sets up the living documentation once. A **milestone** is a depth-bounded goal that groups tasks and has its own entry and exit document gates. A **task** is one feature, and it produces the per-feature artifacts.
|
|
18
18
|
|
|
19
19
|
## How the hierarchy decomposes
|
|
20
20
|
|
|
@@ -53,12 +53,12 @@ Which document lives at which level, who is accountable for it, and how long it
|
|
|
53
53
|
| `SLO.md` (objectives) | Milestone (MVP+) | from MVP | from MVP onward | DevOps / SRE |
|
|
54
54
|
| `SPEC.md` | Task | per feature | living | Product / Domain |
|
|
55
55
|
| `features/*.feature` | Task | per feature | living | QA / Test |
|
|
56
|
-
| `contracts/*.md` | Task → **Project** | per feature, then frozen |
|
|
56
|
+
| `contracts/*.md` | Task → **Project** | per feature, then frozen | living doc (promoted to project) | Architect / Lead |
|
|
57
57
|
| `tests/*` | Task | per feature | living | QA / Engineer |
|
|
58
58
|
| Source code | Task | per feature | **disposable** | Engineer |
|
|
59
59
|
| Gate outcome records | Task | per step | kept for audit | the reviewer |
|
|
60
60
|
|
|
61
|
-
> Note the one promotion: a **contract** is authored at task level but, once frozen, becomes part of the project's
|
|
61
|
+
> Note the one promotion: a **contract** is authored at task level but, once frozen, becomes part of the project's living documentation — other tasks depend on it. That promotion is why a contract change is a project-level change request, not a task-local edit.
|
|
62
62
|
|
|
63
63
|
---
|
|
64
64
|
|
|
@@ -93,13 +93,13 @@ Every task, regardless of milestone, produces this artifact chain. The depth var
|
|
|
93
93
|
|
|
94
94
|
| Step | Required document | Exit gate (the proof) | Detail |
|
|
95
95
|
|------|-------------------|------------------------|--------|
|
|
96
|
-
| 1 Specify | `SPEC.md` | rules + named rejections, assumptions ranked
|
|
96
|
+
| 1 Specify | `SPEC.md` | rules + named rejections, assumptions ranked lowest-confidence first (biggest risk ⚠-flagged) | [03](./03-step-1-specify.md) |
|
|
97
97
|
| 2 Scenarios | `features/<task>.feature` | one scenario per rule | [04](./04-step-2-scenarios.md) |
|
|
98
98
|
| 3 Contract | `contracts/<task>.md` | frozen + contract tests green | [05](./05-step-3-contract.md) |
|
|
99
99
|
| 4 Tests | `tests/<task>_*` | one test per scenario, red first | [06](./06-step-4-tests.md) |
|
|
100
100
|
| 5 Build | source code + evidence bundle | all tests green, nothing weakened | [07](./07-step-5-build.md) |
|
|
101
101
|
| 6 Verify | gate outcome record | `PASS` / `RISK-ACCEPTED` / `HARD-STOP` (auto-resolved on evidence under `autonomy: auto`; security always escalates) | [08](./08-step-6-verify.md) |
|
|
102
|
-
| 7 Observe | `TASK.md` §7 OBSERVE block | released behind a flag; scenario-monitors live; spec delta +
|
|
102
|
+
| 7 Observe | `TASK.md` §7 OBSERVE block | released behind a flag; scenario-monitors live; spec delta + lessons learned captured | [09](./09-the-loop.md) |
|
|
103
103
|
|
|
104
104
|
A task is **done** when the build's documents exist and the Verify record reads `PASS` (or a signed `RISK-ACCEPTED`); the seventh step — **Observe** (§7) — then runs in production and feeds the next loop's Specify. See the master shippable checklist in [Appendix E](./appendix-e-checklists.md).
|
|
105
105
|
|
|
@@ -136,13 +136,13 @@ The tests are the source of truth; this table is their index. If a row here is e
|
|
|
136
136
|
|
|
137
137
|
## Worked example — the hierarchy filled in
|
|
138
138
|
|
|
139
|
-
- **Project:** *Mobile Banking App.*
|
|
139
|
+
- **Project:** *Mobile Banking App.* Living documentation: `CONVENTIONS.md`, `GLOSSARY.md` (defines *account*, *balance*, *transfer*), `MODEL_REGISTRY.md`, `dependencies.allowlist`, `playbook/`.
|
|
140
140
|
- **Milestone:** *MVP — core money movement.* Exit requires the full per-feature document set for each task below, plus a light `SLO.md` and a milestone exit report.
|
|
141
141
|
- **Task:** *Transfer between own accounts* → `SPEC.md`, `features/transfer.feature`, `contracts/transfer.md` (frozen at v1), `tests/transfer_test.py`, code, and a `PASS` gate record. (The full set is in [Appendix D](./appendix-d-worked-example.md).)
|
|
142
142
|
- **Task:** *View balance* → its own SPEC, feature, contract, tests, code, record.
|
|
143
143
|
- **Task:** *Transaction history* → its own set.
|
|
144
144
|
|
|
145
|
-
When all three tasks read `PASS` and the milestone documents exist, the MVP milestone exits — and the frozen `transfer` contract is now a project-level
|
|
145
|
+
When all three tasks read `PASS` and the milestone documents exist, the MVP milestone exits — and the frozen `transfer` contract is now a project-level living-documentation artifact the next milestone builds on.
|
|
146
146
|
|
|
147
147
|
---
|
|
148
148
|
|
|
@@ -0,0 +1,106 @@
|
|
|
1
|
+
# Appendix G — References & Lineage
|
|
2
|
+
|
|
3
|
+
ADD did not appear from nowhere. It sits at the meeting point of three currents:
|
|
4
|
+
the **recursive self-improvement** thesis (AI that helps build the next AI), the
|
|
5
|
+
**spec-driven development** movement (the specification, not the code, is the
|
|
6
|
+
source of truth), and a decade of **agentic + tests-first** research showing that
|
|
7
|
+
a generate→check→refine loop, constrained by executable tests, turns fluent model
|
|
8
|
+
output into trustworthy software. This appendix is the curated, verified grounding
|
|
9
|
+
for that lineage — every source below is reachable and annotated with a `↔ ADD:`
|
|
10
|
+
line saying exactly how it relates to the method.
|
|
11
|
+
|
|
12
|
+
**The frame — "closing the loop."** Anthropic's recursive-self-improvement picture
|
|
13
|
+
runs from autonomous agents delegating to workers *today* toward a future where
|
|
14
|
+
Claude improves Claude. ADD is a deliberately **human-gated, evidence-trusted**
|
|
15
|
+
instance of that loop: the AI drives spec→build→verify→observe, but a human owns the
|
|
16
|
+
frozen contract and the verify gate, and trust comes from passing tests and
|
|
17
|
+
re-resolved evidence — never from a plausible-looking diff. The sources here are
|
|
18
|
+
the shoulders that posture stands on.
|
|
19
|
+
|
|
20
|
+
The four sections below are the four currents. The comparison table places ADD next
|
|
21
|
+
to its two closest peers — GitHub's **spec-kit** and **GSD (Get Shit Done)** — and
|
|
22
|
+
names where ADD diverges. Read "How to cite" first; the rest of the book cites into
|
|
23
|
+
the keys defined here.
|
|
24
|
+
|
|
25
|
+
## How to cite
|
|
26
|
+
|
|
27
|
+
The book uses one inline citation form — **author-year** — and every entry's lead
|
|
28
|
+
`(Author Year)` *is* its cite-key. Resolve any inline `[…]` to the matching entry below.
|
|
29
|
+
|
|
30
|
+
| Authors | Inline form | Example |
|
|
31
|
+
|---|---|---|
|
|
32
|
+
| one author | `[Surname Year]` | `[Schmidhuber 2003]` |
|
|
33
|
+
| two authors | `[Surname & Surname Year]` | `[Mathews & Nagappan 2024]` |
|
|
34
|
+
| three or more | `[Surname et al. Year]` | `[Zelikman et al. 2023]` |
|
|
35
|
+
| an organisation | `[Org Year]` | `[Anthropic 2026a]` · `[GitHub 2025]` |
|
|
36
|
+
| several at once | joined by `; ` | `[Schmidhuber 2003; Zelikman et al. 2023]` |
|
|
37
|
+
| same author, same year | add a `Year`-letter suffix | `[Anthropic 2025a]` / `[Anthropic 2025b]` |
|
|
38
|
+
|
|
39
|
+
The 3+-author rule becomes **et al.**; an organisation stands in as the author
|
|
40
|
+
when no individual is credited; and when two org-authored sources collide on a year
|
|
41
|
+
(several Anthropic 2025/2026 items do, below) a trailing letter disambiguates them.
|
|
42
|
+
There is exactly one entry per cite-key.
|
|
43
|
+
|
|
44
|
+
## spec-kit ↔ ADD (and GSD)
|
|
45
|
+
|
|
46
|
+
ADD shares the spec-first DNA of GitHub's **spec-kit** and the Claude-Code,
|
|
47
|
+
context-rot-fighting niche of **GSD**. The phase models line up closely:
|
|
48
|
+
|
|
49
|
+
| ADD phase | spec-kit command | GSD phase |
|
|
50
|
+
|---|---|---|
|
|
51
|
+
| foundation · principles | `/speckit.constitution` → `constitution.md` | (project setup / `CLAUDE.md`-level) |
|
|
52
|
+
| §1 specify (what / why) | `/speckit.specify` → `spec.md` | **discuss** — capture decisions before planning |
|
|
53
|
+
| §3 contract (how, frozen) | `/speckit.plan` → `plan.md`, `contracts/` | **plan** — research, decompose, fit fresh context |
|
|
54
|
+
| milestone tasks / waves | `/speckit.tasks` → `tasks.md` | (phases → parallel waves) |
|
|
55
|
+
| §5 build | `/speckit.implement` | **execute** — parallel waves, fresh 200k-token context each |
|
|
56
|
+
| §6 verify | `/speckit.analyze` + `/speckit.checklist` | **verify** — walk what was built, fix before declaring done |
|
|
57
|
+
|
|
58
|
+
**Where ADD diverges.** spec-kit stops at `implement`; GSD ends at verify (GSD Core
|
|
59
|
+
adds a fifth *ship* phase). ADD closes the loop past both by adding three things
|
|
60
|
+
neither has as a first-class gate: a **failing-tests-first** gate (§4 — no build
|
|
61
|
+
starts until the tests are red for the right reason), an **observe→`fold`**
|
|
62
|
+
self-improvement step (§7 — confirmed learnings consolidate into a versioned foundation),
|
|
63
|
+
and an engine-tracked **dynamic goal-loop** that will hold a milestone open and
|
|
64
|
+
reopen tasks until its exit criteria are met. ADD also deliberately targets **less
|
|
65
|
+
doc-time than GSD** — a lean foundation and one human approval per task, rather than
|
|
66
|
+
a document per phase. The shared lineage is real; the tests-first gate, the `fold`,
|
|
67
|
+
and the goal-loop are ADD's contribution.
|
|
68
|
+
|
|
69
|
+
## 1. Recursive self-improvement
|
|
70
|
+
|
|
71
|
+
- **When AI builds itself** (Favaro & Clark 2026) — https://www.anthropic.com/institute/recursive-self-improvement — essay. The RSI thesis: by 2026 >80% of code merged at Anthropic was Claude-authored and the 50%-task time-horizon keeps doubling; recursive self-improvement would shift humans from builders to validators. ↔ ADD: the seed source — ADD is the human-gated, evidence-trusted way to run a spec→build→verify→observe loop while the human stays the validator.
|
|
72
|
+
- **Automated Alignment Researchers** (Anthropic 2026a) — https://www.anthropic.com/research/automated-alignment-researchers — research. Nine parallel Claude agents recovered ~97% of the human-expert gap on an alignment task in 5 days versus 7 for the human team. ↔ ADD: the strongest evidence the recursive loop is not speculative — parallel agents under review are exactly ADD's wave-plus-verify shape.
|
|
73
|
+
- **Machines of Loving Grace** (Amodei 2024) — https://www.darioamodei.com/essay/machines-of-loving-grace — essay. A "country of geniuses in a datacenter," argued with a measured, bounded position on recursive self-improvement. ↔ ADD: the intent framing behind milestoning — bound the loop with human direction rather than let it run open.
|
|
74
|
+
- **Gödel Machines: Self-Referential Universal Problem Solvers** (Schmidhuber 2003) — https://arxiv.org/abs/cs/0309048 — paper. A provably-optimal self-modifying agent that rewrites itself only when it can prove the rewrite helps. ↔ ADD: the mathematical anchor of the lineage — and a precedent for "only change on proof," which ADD enforces socially via the never-weaken-a-test rule.
|
|
75
|
+
- **STOP: Self-Taught Optimizer** (Zelikman et al. 2023) — https://arxiv.org/abs/2310.02304 — paper. A scaffolding program recursively improves the code that improves code. ↔ ADD: the algorithmic kin of the `fold` step — consolidate confirmed learnings back into the method that produced them.
|
|
76
|
+
- **Self-Refine: Iterative Refinement with Self-Feedback** (Madaan et al. 2023) — https://arxiv.org/abs/2303.17651 — paper. Generate→critique→refine with the same model lifts quality ~20% with no extra training. ↔ ADD: the micro-loop inside build→verify — produce, check against the contract, refine.
|
|
77
|
+
- **Self-Rewarding Language Models** (Yuan et al. 2024) — https://arxiv.org/abs/2401.10020 — paper. A model acts as its own reward judge to improve across iterations. ↔ ADD: the risk ADD answers — a self-judging loop needs an external gate; ADD makes tests and a human the reward signal, not the model's own opinion.
|
|
78
|
+
- **Reflexion: Language Agents with Verbal Reinforcement Learning** (Shinn et al. 2023) — https://arxiv.org/abs/2303.11366 — paper. Agents keep verbal reflections in episodic memory and retry, reaching 91% on HumanEval. ↔ ADD: the principle behind "reopen the task if criteria are unmet" — a failed check becomes feedback for the next attempt, not a dead end.
|
|
79
|
+
- **Voyager: An Open-Ended Embodied Agent with LLMs** (Wang et al. 2023) — https://arxiv.org/abs/2305.16291 — paper. An auto-curriculum agent that grows a reusable skill library over time. ↔ ADD: the growing foundation — each milestone's consolidated deltas are ADD's accumulating skill library.
|
|
80
|
+
- **AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery** (Novikov et al. 2025) — https://arxiv.org/abs/2506.13131 — paper. An evolutionary coding agent that beat a long-standing matrix-multiplication record and shipped a production scheduler improvement. ↔ ADD: the end-state evidence — a generate-and-verify loop can exceed human baselines when every candidate is checked.
|
|
81
|
+
|
|
82
|
+
## 2. Autonomous & agentic workflows
|
|
83
|
+
|
|
84
|
+
- **Building Effective Agents** (Schluntz & Zhang 2024) — https://www.anthropic.com/research/building-effective-agents — blog. The canonical taxonomy: prompt-chaining, routing, orchestrator-workers, and the evaluator-optimizer loop. ↔ ADD: the architecture cite — evaluator-optimizer is build→verify→refine; orchestrator-workers is ADD's wave parallelism.
|
|
85
|
+
- **Enabling Claude Code to work more autonomously** (Anthropic 2025a) — https://www.anthropic.com/news/enabling-claude-code-to-work-more-autonomously — news. Checkpoints, subagents, hooks, background tasks, and `/rewind` rollback. ↔ ADD: checkpoint/rewind is the rollback strategy behind phase gates; hooks are where the engine enforces them.
|
|
86
|
+
- **How we built our multi-agent research system** (Anthropic 2025b) — https://www.anthropic.com/engineering/multi-agent-research-system — blog. An Opus lead orchestrating Sonnet subagents, with an LLM acting as judge, lifting task performance ~90%. ↔ ADD: the lead-plus-subagents-plus-judge pattern is exactly ADD's wave execution under a verify gate.
|
|
87
|
+
- **ReAct: Synergizing Reasoning and Acting in Language Models** (Yao et al. 2022) — https://arxiv.org/abs/2210.03629 — paper. Interleaving think→act→observe turns a model into an agent. ↔ ADD: the base loop every ADD phase runs on.
|
|
88
|
+
- **Toolformer: Language Models Can Teach Themselves to Use Tools** (Schick et al. 2023) — https://arxiv.org/abs/2302.04761 — paper. Self-supervised learning of when and how to call external tools. ↔ ADD: the capability that lets an agent run its own tests, linters, and builds — the evidence ADD trusts.
|
|
89
|
+
- **SWE-agent: Agent–Computer Interfaces Enable Automated Software Engineering** (Yang et al. 2024) — https://arxiv.org/abs/2405.15793 — paper. A designed agent–computer interface materially improves autonomous issue resolution. ↔ ADD: the structured agent↔environment contract — ADD's `add.py` engine is that interface for the method.
|
|
90
|
+
- **The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery** (Lu et al. 2024) — https://arxiv.org/abs/2408.06292 — paper. A full idea→experiment→write→review research loop at ~$15 per paper. ↔ ADD: the research analog of ADD's loop — and a reminder that an automated reviewer is the weak link a human gate protects.
|
|
91
|
+
|
|
92
|
+
## 3. Spec-driven development & spec-kit
|
|
93
|
+
|
|
94
|
+
- **GitHub Spec Kit** (GitHub 2025) — https://github.com/github/spec-kit — repo. The reference SDD toolkit: the phase model is `constitution` → `specify` → `plan` → `tasks` → `implement`, with the spec as the executable source of truth. ↔ ADD: the closest spec-first sibling — ADD's specify and contract phases map onto specify and plan; see the comparison table for the divergence.
|
|
95
|
+
- **Spec-driven development with AI: get started with a new open-source toolkit** (Delimarsky 2025) — https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/ — blog. The spec-kit launch post; frames `tasks` as "TDD for your AI agent." ↔ ADD: independent articulation of why decomposing a spec into checkable units beats one big prompt.
|
|
96
|
+
- **Spec-driven development: using Markdown as a programming language when building with AI** (Vesely 2025) — https://github.blog/ai-and-ml/generative-ai/spec-driven-development-using-markdown-as-a-programming-language-when-building-with-ai/ — blog. Spec-as-source, with context-rot named as the failure SDD exists to solve. ↔ ADD: the rationale for the frozen contract — a stable written spec is what survives when the model's context degrades.
|
|
97
|
+
- **Get Shit Done (GSD)** (GSD 2025) — https://github.com/open-gsd/gsd-core — repo. A meta-prompting, context-engineering, spec-driven system for Claude Code; its `discuss` → `plan` → `execute` → `verify` cycle runs each phase in a fresh subagent context to fight context-rot (originally `gsd-build/get-shit-done`, now continued as GSD Core). ↔ ADD: ADD's closest peer — same Claude-Code, context-rot niche; ADD diverges with the tests-first gate, the observe→`fold` step, and the dynamic goal-loop, and aims for less doc-time than GSD.
|
|
98
|
+
- **Beyond Vibe Coding: Amazon Introduces Kiro, the Spec-Driven Agentic IDE** (InfoQ 2025) — https://www.infoq.com/news/2025/08/aws-kiro-spec-driven-agent/ — blog. Kiro structures work as requirements→design→tasks with execution hooks. ↔ ADD: cross-vendor confirmation that spec-first is converging across the industry, not a single-tool idea.
|
|
99
|
+
- **Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants** (Piskala 2026) — https://arxiv.org/abs/2602.00180 — paper. A taxonomy of SDD rigor — Spec-First, Spec-Anchored, Spec-as-Source — reporting human-refined specs can cut LLM code errors substantially, with BDD as SDD's ancestor. ↔ ADD: places ADD as "Spec-Anchored" and gives the academic vocabulary for the contract-freeze decision.
|
|
100
|
+
|
|
101
|
+
## 4. Tests-first & verification
|
|
102
|
+
|
|
103
|
+
- **Test-Driven Development for Code Generation** (Mathews & Nagappan 2024) — https://arxiv.org/abs/2402.13521 — paper. Supplying tests alongside the prompt measurably lifts pass rates on MBPP and HumanEval. ↔ ADD: the empirical backbone of the failing-tests-first gate — tests as the constraint that makes generation verifiable.
|
|
104
|
+
- **SWE-bench: Can Language Models Resolve Real-World GitHub Issues?** (Jimenez et al. 2023) — https://arxiv.org/abs/2310.06770 — paper. 2,294 real issues judged by whether the project's own tests pass; <2% solved at release. ↔ ADD: the yardstick that proves the point — "done" means the tests pass, which is exactly how ADD gates a feature.
|
|
105
|
+
- **Our framework for developing safe and trustworthy agents** (Anthropic 2025c) — https://www.anthropic.com/news/our-framework-for-developing-safe-and-trustworthy-agents — news. Five principles: human control, transparency, alignment, privacy, and security. ↔ ADD: the frozen-contract gate and never-weaken-a-test rule are human control and transparency made concrete; the security HARD-STOP is the security principle.
|
|
106
|
+
- **Responsible Scaling Policy v3.0** (Anthropic 2026b) — https://www.anthropic.com/news/responsible-scaling-policy-v3 — policy. The AI Safety Level framework; ASL-3 governs autonomous R&D capability. ↔ ADD: the governance ceiling that makes ADD's discipline necessary — as the loop gets more capable, the gates and the human-owned verify matter more, not less.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@pilotspace/add",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.2.0",
|
|
4
4
|
"description": "ADD (AI-Driven Development) — a minimal, state-tracked Claude Code skill that drives every feature through Specify → Scenarios → Contract → Tests → Build → Verify → Observe. Ships the AIDD book as its trust layer.",
|
|
5
5
|
"bin": {
|
|
6
6
|
"add": "bin/cli.js"
|