deepflow 0.1.87 → 0.1.89

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -10,36 +10,27 @@ description: Execute tasks from PLAN.md with agent spawning, ratchet health chec
10
10
  You are a coordinator. Spawn agents, run ratchet checks, update PLAN.md. Never implement code yourself.
11
11
 
12
12
  **NEVER:** Read source files, edit code, use TaskOutput, use EnterPlanMode, use ExitPlanMode
13
-
14
- **ONLY:** Read PLAN.md, read specs/doing-*.md, spawn background agents, run ratchet health checks after each agent completes, update PLAN.md, write `.deepflow/decisions.md` in the main tree
13
+ **ONLY:** Read PLAN.md, read specs/doing-*.md, spawn background agents, run ratchet health checks, update PLAN.md, write `.deepflow/decisions.md`
15
14
 
16
15
  ## Core Loop (Notification-Driven)
17
16
 
18
- Each task = one background agent. Completion notifications drive the loop.
19
-
20
- **NEVER use TaskOutput** — returns full transcripts (100KB+) that explode context.
17
+ Each task = one background agent. **NEVER use TaskOutput** (100KB+ transcripts explode context).
21
18
 
22
19
  ```
23
20
  1. Spawn ALL wave agents with run_in_background=true in ONE message
24
- 2. STOP. End your turn. Do NOT poll or monitor.
21
+ 2. STOP. End turn. Do NOT poll.
25
22
  3. On EACH notification:
26
- a. Run ratchet check (section 5.5)
27
- b. Passed → TaskUpdate(status: "completed"), update PLAN.md [x] + commit hash
28
- c. Failed → run partial salvage protocol (section 5.5). If salvaged treat as passed. If not → git revert, TaskUpdate(status: "pending")
29
- d. Report ONE line: "✓ T1: ratchet passed (abc123)" or "⚕ T1: salvaged lint fix (abc124)" or "✗ T1: ratchet failed, reverted"
30
- e. NOT all done end turn, wait | ALL done next wave or finish
31
- 4. Between waves: check context %. If ≥50%checkpoint and exit.
23
+ a. Ratchet check (§5.5)
24
+ b. Passed → wave test agent (§5.6). Tests pass → re-snapshot (§5.6) → TaskUpdate(status: "completed"), update PLAN.md [x] + commit hash
25
+ c. Failed → partial salvage (§5.5). Salvagedwave test agent (§5.6). Not → git revert, TaskUpdate(status: "pending")
26
+ d. Wave test agent failed after max attempts revert ALL task commits, TaskUpdate(status: "pending")
27
+ e. Report ONE line: "✓ T1: ratchet+tests passed (abc123)" or "⚕ T1: salvaged+tested (abc124)" or "✗ T1: reverted" or "✗ T1: test agent failed, reverted"
28
+ f. NOT all done end turn, wait | ALL done next wave or finish
29
+ 4. Between waves: context ≥50% → checkpoint and exit.
32
30
  5. Repeat until: all done, all blocked, or context ≥50%.
33
31
  ```
34
32
 
35
- ## Context Threshold
36
-
37
- Statusline writes `.deepflow/context.json`: `{"percentage": 45}`
38
-
39
- | Context % | Action |
40
- |-----------|--------|
41
- | < 50% | Full parallelism (up to 5 agents) |
42
- | ≥ 50% | Wait for running agents, checkpoint, exit |
33
+ **Context threshold:** Statusline writes `.deepflow/context.json`: `{"percentage": 45}`. <50% = full parallelism (up to 5). ≥50% = wait, checkpoint, exit.
43
34
 
44
35
  ---
45
36
 
@@ -47,109 +38,70 @@ Statusline writes `.deepflow/context.json`: `{"percentage": 45}`
47
38
 
48
39
  ### 1. CHECK CHECKPOINT
49
40
 
50
- ```
51
- --continue Load .deepflow/checkpoint.json from worktree
52
- → Verify worktree exists on disk (else error: "Use --fresh")
53
- → Skip completed tasks, resume execution
54
- --fresh → Delete checkpoint, start fresh
55
- checkpoint exists → Prompt: "Resume? (y/n)"
56
- else → Start fresh
57
- ```
58
-
59
- Shell injection (use output directly — no manual file reads needed):
60
- - `` !`cat .deepflow/checkpoint.json 2>/dev/null || echo 'NOT_FOUND'` ``
61
- - `` !`git diff --quiet && echo 'CLEAN' || echo 'DIRTY'` ``
41
+ `--continue` → load `.deepflow/checkpoint.json`, verify worktree exists (else error "Use --fresh"), skip completed. `--fresh` → delete checkpoint. Checkpoint exists → prompt "Resume? (y/n)".
42
+ Shell: `` !`cat .deepflow/checkpoint.json 2>/dev/null || echo 'NOT_FOUND'` `` / `` !`git diff --quiet && echo 'CLEAN' || echo 'DIRTY'` ``
62
43
 
63
44
  ### 1.5. CREATE WORKTREE
64
45
 
65
- Require clean HEAD (`git diff --quiet`). Derive SPEC_NAME from `specs/doing-*.md`.
66
- Create worktree: `.deepflow/worktrees/{spec}` on branch `df/{spec}`.
67
- Reuse if exists. `--fresh` deletes first.
68
-
69
- If `worktree.sparse_paths` is non-empty in config, enable sparse checkout:
70
- ```bash
71
- git worktree add --no-checkout -b df/{spec} .deepflow/worktrees/{spec}
72
- cd .deepflow/worktrees/{spec}
73
- git sparse-checkout set {sparse_paths...}
74
- git checkout df/{spec}
75
- ```
46
+ Require clean HEAD. Derive SPEC_NAME from `specs/doing-*.md`. Create `.deepflow/worktrees/{spec}` on branch `df/{spec}`. Reuse if exists; `--fresh` deletes first. If `worktree.sparse_paths` non-empty: `git worktree add --no-checkout`, `sparse-checkout set {paths}`, checkout.
76
47
 
77
48
  ### 1.6. RATCHET SNAPSHOT
78
49
 
79
- Snapshot pre-existing test files in worktree — only these count for ratchet (agent-created tests excluded):
80
-
50
+ Snapshot pre-existing test files — only these count for ratchet (agent-created excluded):
81
51
  ```bash
82
- cd ${WORKTREE_PATH}
83
- git ls-files | grep -E '\.(test|spec)\.[^/]+$|^test_|_test\.[^/]+$|^tests/|__tests__/' \
84
- > .deepflow/auto-snapshot.txt
52
+ git -C ${WORKTREE_PATH} ls-files | grep -E '\.(test|spec)\.[^/]+$|^test_|_test\.[^/]+$|^tests/|__tests__/' > .deepflow/auto-snapshot.txt
85
53
  ```
86
54
 
87
55
  ### 1.7. NO-TESTS BOOTSTRAP
88
56
 
89
- If snapshot has zero test files:
57
+ <!-- AC-1: zero test files triggers bootstrap before wave 1 -->
58
+ <!-- AC-2: bootstrap success re-snapshots auto-snapshot.txt; subsequent tasks use updated snapshot -->
59
+ <!-- AC-3: bootstrap failure with default model retries with Opus; double failure halts with specific message -->
90
60
 
91
- 1. Spawn ONE bootstrap agent (section 6 Bootstrap Task) to write tests for `edit_scope` files
92
- 2. On ratchet pass: re-snapshot, report `"bootstrap: completed"`, end cycle (no PLAN.md tasks this cycle)
93
- 3. On ratchet fail: revert, halt with "Bootstrap failed — manual intervention required"
61
+ **Gate:** After §1.6 snapshot, check `auto-snapshot.txt`:
62
+ ```bash
63
+ SNAPSHOT_COUNT=$(wc -l < .deepflow/auto-snapshot.txt | tr -d ' ')
64
+ ```
65
+ If `SNAPSHOT_COUNT` is `0` (zero test files found), MUST spawn bootstrap agent before wave 1. No implementation tasks may start until bootstrap completes successfully.
94
66
 
95
- Subsequent cycles use bootstrapped tests as ratchet baseline.
67
+ **Bootstrap flow:**
68
+ 1. Spawn `Agent(model="{default_model}", ...)` with Bootstrap prompt (§6). End turn, wait for notification.
69
+ 2. **On success (TASK_STATUS:pass):** Re-snapshot immediately:
70
+ ```bash
71
+ git -C ${WORKTREE_PATH} ls-files | grep -E '\.(test|spec)\.[^/]+$|^test_|_test\.[^/]+$|^tests/|__tests__/' > .deepflow/auto-snapshot.txt
72
+ ```
73
+ All subsequent tasks use this updated snapshot as their ratchet baseline. Proceed to wave 1.
74
+ 3. **On failure (TASK_STATUS:fail) with default model:** Retry ONCE with `Agent(model="opus", ...)` using the same Bootstrap prompt.
75
+ - Opus success → re-snapshot (same command above) → proceed to wave 1.
76
+ - Opus failure → halt with message: `"Bootstrap failed with both default and Opus — manual intervention required"`. Do not proceed.
96
77
 
97
78
  ### 2. LOAD PLAN
98
79
 
99
- ```
100
- Load: PLAN.md (required), specs/doing-*.md, .deepflow/config.yaml
101
- If missing: "No PLAN.md found. Run /df:plan first."
102
- ```
103
-
104
- Shell injection (use output directly — no manual file reads needed):
105
- - `` !`cat .deepflow/checkpoint.json 2>/dev/null || echo 'NOT_FOUND'` ``
106
- - `` !`git diff --quiet && echo 'CLEAN' || echo 'DIRTY'` ``
80
+ Load PLAN.md (required), specs/doing-*.md, .deepflow/config.yaml. Missing → "No PLAN.md found. Run /df:plan first."
107
81
 
108
82
  ### 2.5. REGISTER NATIVE TASKS
109
83
 
110
- For each `[ ]` task in PLAN.md: `TaskCreate(subject: "{task_id}: {description}", activeForm: "{gerund}", description: full block)`. Store task_id → native ID mapping. Set dependencies via `TaskUpdate(addBlockedBy: [...])`. On `--continue`: only register remaining `[ ]` items.
84
+ For each `[ ]` task: `TaskCreate(subject: "{task_id}: {description}", activeForm: "{gerund}", description: full block)`. Store task_id → native ID. Set deps via `TaskUpdate(addBlockedBy: [...])`. `--continue` only remaining `[ ]` items.
111
85
 
112
- ### 3. CHECK FOR UNPLANNED SPECS
86
+ ### 3–4. READY TASKS
113
87
 
114
- Warn if `specs/*.md` (excluding doing-/done-) exist. Non-blocking.
115
-
116
- ### 4. IDENTIFY READY TASKS
117
-
118
- Ready = TaskList where status: "pending" AND blockedBy: empty.
88
+ Warn if unplanned `specs/*.md` (excluding doing-/done-) exist (non-blocking). Ready = TaskList where status: "pending" AND blockedBy: empty.
119
89
 
120
90
  ### 5. SPAWN AGENTS
121
91
 
122
- Context ≥50%: checkpoint and exit.
123
-
124
- Before spawning: `TaskUpdate(taskId: native_id, status: "in_progress")` — activates UI spinner.
125
-
126
- **Token tracking — record start:**
127
- ```
128
- start_percentage = !`grep -o '"percentage":[0-9]*' .deepflow/context.json 2>/dev/null | grep -o '[0-9]*' || echo ''`
129
- start_timestamp = !`date -u +%Y-%m-%dT%H:%M:%SZ`
130
- ```
131
- Store both values in memory (keyed by task_id) for use after ratchet completes. Omit if context.json unavailable.
132
-
133
- **NEVER use `isolation: "worktree"` on Task calls.** Deepflow manages a shared worktree so wave 2 sees wave 1 commits.
92
+ Context ≥50% checkpoint and exit. Before spawning: `TaskUpdate(status: "in_progress")`.
134
93
 
135
- **Spawn ALL ready tasks in ONE message** EXCEPT file conflicts (see below).
94
+ **Token tracking start:** Store `start_percentage` (from context.json) and `start_timestamp` (ISO 8601) keyed by task_id. Omit if unavailable.
136
95
 
137
- **File conflict enforcement (1 file = 1 writer):**
138
- Before spawning, check `Files:` lists of all ready tasks. If two+ ready tasks share a file:
139
- 1. Sort conflicting tasks by task number (T1 < T2 < T3)
140
- 2. Spawn only the lowest-numbered task from each conflict group
141
- 3. Remaining tasks stay `pending` — they become ready once the spawned task completes
142
- 4. Log: `"⏳ T{N} deferred — file conflict with T{M} on {filename}"`
96
+ **NEVER use `isolation: "worktree"`.** Deepflow manages a shared worktree so wave 2 sees wave 1 commits. **Spawn ALL ready tasks in ONE message** except file conflicts.
143
97
 
144
- **≥2 [SPIKE] tasks for same problem:** Follow Parallel Spike Probes (section 5.7).
98
+ **File conflicts (1 file = 1 writer):** Check `Files:` lists. Overlap spawn lowest-numbered only; rest stay pending. Log: `"⏳ T{N} deferred — file conflict with T{M} on {filename}"`
145
99
 
146
- **[OPTIMIZE] tasks:** Follow Optimize Cycle (section 5.9). Only ONE optimize task runs at a time — defer others until the active one completes.
100
+ **≥2 [SPIKE] tasks same problem →** Parallel Spike Probes (§5.7). **[OPTIMIZE] tasks →** Optimize Cycle (§5.9), one at a time.
147
101
 
148
102
  ### 5.5. RATCHET CHECK
149
103
 
150
- After each agent completes, run health checks in the worktree.
151
-
152
- **Auto-detect commands:**
104
+ Run health checks in worktree after each agent completes.
153
105
 
154
106
  | File | Build | Test | Typecheck | Lint |
155
107
  |------|-------|------|-----------|------|
@@ -158,555 +110,323 @@ After each agent completes, run health checks in the worktree.
158
110
  | `Cargo.toml` | `cargo build` | `cargo test` | — | `cargo clippy` |
159
111
  | `go.mod` | `go build ./...` | `go test ./...` | — | `go vet ./...` |
160
112
 
161
- Run Build → Test → Typecheck → Lint (stop on first failure).
162
-
163
- **Edit scope validation** (if spec declares `edit_scope`): check `git diff HEAD~1 --name-only` against allowed globs. Violations → `git revert HEAD --no-edit`, report "Edit scope violation: {files}".
164
-
165
- **Impact completeness check** (if task has Impact block in PLAN.md):
166
- Compare `git diff HEAD~1 --name-only` against Impact callers/duplicates list.
167
- File listed but not modified → **advisory warning**: "Impact gap: {file} listed as {caller|duplicate} but not modified — verify manually". Not auto-revert (callers sometimes don't need changes), but flags the risk.
168
-
169
- **Metric gate (Optimize tasks only):**
170
-
171
- After ratchet passes, if the current task has an `Optimize:` block, run the metric gate:
172
-
173
- 1. Run the `metric` shell command in the worktree: `cd ${WORKTREE_PATH} && eval "${metric_command}"`
174
- 2. Parse output as float. Non-numeric output → cycle failure (revert, log "metric parse error: {raw output}")
175
- 3. Compare against previous measurement using `direction`:
176
- - `direction: higher` → new value must be > previous + (previous × min_improvement_threshold)
177
- - `direction: lower` → new value must be < previous - (previous × min_improvement_threshold)
178
- 4. Both ratchet AND metric improvement required → keep commit
179
- 5. Ratchet passes but metric did not improve → revert (log "ratchet passed but metric stagnant/regressed: {old} → {new}")
180
- 6. Run each `secondary_metrics` command, parse as float. If regression > `regression_threshold` (default 5%) compared to baseline: append WARNING to `.deepflow/auto-report.md`: `"WARNING: {name} regressed {delta}% ({baseline_val} → {new_val}) at cycle {N}"`. Do NOT auto-revert.
181
-
182
- **Output Truncation:**
183
-
184
- After ratchet checks complete, truncate command output for context efficiency:
185
-
186
- - **Success (all checks passed):** Suppress output entirely — do not include build/test/lint output in reports
187
- - **Build failure:** Include last 15 lines of build error only
188
- - **Test failure:** Include failed test name(s) + last 20 lines of test output
189
- - **Typecheck/lint failure:** Include error count + first 5 errors only
190
-
191
- **Token tracking — write result (on ratchet pass):**
192
-
193
- After all checks pass, compute and write the token block to `.deepflow/results/T{N}.yaml`:
113
+ Run Build → Test → Typecheck → Lint (stop on first failure). Ratchet uses ONLY pre-existing tests from `.deepflow/auto-snapshot.txt`.
194
114
 
195
- ```
196
- end_percentage = !`grep -o '"percentage":[0-9]*' .deepflow/context.json 2>/dev/null | grep -o '[0-9]*' || echo ''`
197
- ```
115
+ **Edit scope validation:** `git diff HEAD~1 --name-only` vs allowed globs. Violation → revert, report.
116
+ **Impact completeness:** diff vs Impact callers/duplicates. Gap advisory warning (no revert).
198
117
 
199
- Parse `.deepflow/token-history.jsonl` to sum token fields for lines whose `timestamp` falls between `start_timestamp` and `end_timestamp` (ISO 8601 compare):
200
- ```bash
201
- awk -v start="REPLACE_start_timestamp" -v end="REPLACE_end_timestamp" '
202
- {
203
- ts=""; inp=0; cre=0; rd=0
204
- if (match($0, /"timestamp":"[^"]*"/)) { ts=substr($0, RSTART+13, RLENGTH-14) }
205
- if (ts >= start && ts <= end) {
206
- if (match($0, /"input_tokens":[0-9]+/)) inp=substr($0, RSTART+15, RLENGTH-15)
207
- if (match($0, /"cache_creation_input_tokens":[0-9]+/)) cre=substr($0, RSTART+30, RLENGTH-30)
208
- if (match($0, /"cache_read_input_tokens":[0-9]+/)) rd=substr($0, RSTART+26, RLENGTH-26)
209
- si+=inp; sc+=cre; sr+=rd
210
- }
211
- }
212
- END { printf "{\"input_tokens\":%d,\"cache_creation_input_tokens\":%d,\"cache_read_input_tokens\":%d}\n", si+0, sc+0, sr+0 }
213
- ' .deepflow/token-history.jsonl 2>/dev/null || echo '{}'
214
- ```
118
+ **Metric gate (Optimize only):** Run `eval "${metric_command}"` with cwd=`${WORKTREE_PATH}` (never `cd && eval`). Parse float (non-numeric → revert). Compare using `direction`+`min_improvement_threshold`. Both ratchet AND metric must pass keep. Ratchet pass + metric stagnant → revert. Secondary metrics: regression > `regression_threshold` (5%) WARNING in auto-report.md (no revert).
215
119
 
216
- Append (or create) `.deepflow/results/T{N}.yaml` with the following block. Use shell injection to read the existing file first:
217
- ```
218
- !`cat .deepflow/results/T{N}.yaml 2>/dev/null || echo ''`
219
- ```
120
+ **Output truncation:** Success → suppress. Build fail last 15 lines. Test fail names + last 20 lines. Typecheck/lint → count + first 5 errors.
220
121
 
221
- Write the `tokens` block:
122
+ **Token tracking result (on pass):** Read `end_percentage`. Sum token fields from `.deepflow/token-history.jsonl` between start/end timestamps (awk ISO 8601 compare). Write to `.deepflow/results/T{N}.yaml`:
222
123
  ```yaml
223
124
  tokens:
224
- start_percentage: {start_percentage}
225
- end_percentage: {end_percentage}
226
- delta_percentage: {end_percentage - start_percentage}
227
- input_tokens: {sum from jsonl}
228
- cache_creation_input_tokens: {sum from jsonl}
229
- cache_read_input_tokens: {sum from jsonl}
230
- ```
231
-
232
- **Omit entirely if:** context.json was unavailable at start OR end, OR token-history.jsonl is missing, OR awk is unavailable. Never fail the ratchet due to token tracking errors.
233
-
234
- **Evaluate:** All pass + no violations commit stands. Any failureattempt partial salvage before reverting:
235
-
236
- **Partial salvage protocol:**
237
- 1. Run `git diff HEAD~1 --stat` to see what the agent changed
238
- 2. If failure is lint-only or typecheck-only (build + tests passed):
239
- - Spawn `Agent(model="haiku", subagent_type="general-purpose")` with prompt: `Fix the {lint|typecheck} errors in the worktree. Only fix what's broken, change nothing else. Files changed: {diff stat}. Error output: {error}`
240
- - Run ratchet again on the fix commit
241
- - If passes → both commits stand. If fails `git revert HEAD --no-edit && git revert HEAD --no-edit` (revert both)
242
- 3. If failure is build or test → `git revert HEAD --no-edit` (no salvage, too risky)
243
-
244
- Ratchet uses ONLY pre-existing test files from `.deepflow/auto-snapshot.txt`.
125
+ start_percentage: {val}
126
+ end_percentage: {val}
127
+ delta_percentage: {end - start}
128
+ input_tokens: {sum}
129
+ cache_creation_input_tokens: {sum}
130
+ cache_read_input_tokens: {sum}
131
+ ```
132
+ Omit if context.json/token-history.jsonl/awk unavailable. Never fail ratchet for tracking errors.
133
+
134
+ **Evaluate:** All pass → commit stands. Failure → partial salvage:
135
+ 1. Lint/typecheck-only (build+tests passed): spawn `Agent(model="haiku")` to fix. Re-ratchet. Failrevert both.
136
+ 2. Build/test failure → `git revert HEAD --no-edit` (no salvage).
137
+
138
+ ### 5.6. WAVE TEST AGENT
139
+
140
+ <!-- AC-8: After wave ratchet passes, Opus test agent spawns and writes unit tests -->
141
+ <!-- AC-9: Test failures trigger implementer re-spawn with failure feedback; max 3 attempts then revert -->
142
+ <!-- AC-12: auto-snapshot.txt re-generated after wave test agent commits; wave N+1 ratchet includes wave N tests -->
143
+
144
+ **Trigger:** After ratchet check passes (or after successful salvage) for a task.
145
+
146
+ **Attempt tracking:** Initialize `attempt_count = 1` and `failure_feedback = ""` per task when first spawned. Max 3 total attempts (1 initial + 2 retries).
147
+
148
+ **Flow:**
149
+ 1. Capture the implementation diff: `git -C ${WORKTREE_PATH} diff HEAD~1` → store as `IMPL_DIFF`.
150
+ 2. Spawn `Agent(model="opus")` with Wave Test prompt (§6). `run_in_background=true`. End turn, wait.
151
+ 3. On notification:
152
+ a. Run ratchet check (§5.5) — all new + pre-existing tests must pass.
153
+ b. **Tests pass** → commit stands. **Re-snapshot** immediately so wave N+1 ratchet includes wave N tests:
154
+ ```bash
155
+ git -C ${WORKTREE_PATH} ls-files | grep -E '\.(test|spec)\.[^/]+$|^test_|_test\.[^/]+$|^tests/|__tests__/' > .deepflow/auto-snapshot.txt
156
+ ```
157
+ Task complete. Report: `"✓ T{n}: ratchet+tests passed ({hash})"`.
158
+ c. **Tests fail** →
159
+ - If `attempt_count < 3`:
160
+ - `git revert HEAD --no-edit` (revert test commit)
161
+ - `git revert HEAD --no-edit` (revert implementation commit)
162
+ - Accumulate failure output: `failure_feedback += "Attempt {N}: {truncated_test_output}\n"`
163
+ - `attempt_count += 1`
164
+ - Re-spawn implementer agent with original prompt + failure feedback appendix:
165
+ ```
166
+ PREVIOUS FAILURES (attempt {N-1} of 3):
167
+ {failure_feedback}
168
+ Fix the issues above. Do NOT repeat the same mistakes.
169
+ ```
170
+ - On implementer notification: ratchet check (§5.5). Passed → goto step 1 (spawn test agent again). Failed → same retry logic.
171
+ - If `attempt_count >= 3`:
172
+ - Revert ALL commits back to pre-task state: `git -C ${WORKTREE_PATH} reset --hard {pre_task_commit}`
173
+ - `TaskUpdate(status: "pending")`
174
+ - Report: `"✗ T{n}: test agent failed after 3 attempts, reverted"`
175
+
176
+ **Output truncation for failure feedback:** Test failures → test names + last 30 lines of output. Build failures → last 15 lines. Cap total `failure_feedback` at 200 lines.
245
177
 
246
178
  ### 5.7. PARALLEL SPIKE PROBES
247
179
 
248
- Trigger: ≥2 [SPIKE] tasks with same "Blocked by:" target or identical hypothesis.
249
-
250
- 1. **Baseline:** Record `BASELINE=$(git rev-parse HEAD)` in shared worktree
251
- 2. **Sub-worktrees:** Per spike: `git worktree add -b df/{spec}--probe-{SPIKE_ID} .deepflow/worktrees/{spec}/probe-{SPIKE_ID} ${BASELINE}`
252
- 3. **Spawn:** All probes in ONE message, each targeting its probe worktree. End turn.
253
- 4. **Ratchet:** Per notification, run standard ratchet (5.5) in probe worktree. Record: ratchet_passed, regressions, coverage_delta, files_changed, commit
254
- 5. **Select winner** (after ALL complete, no LLM judge):
255
- - Disqualify any with regressions
256
- - **Standard spikes**: Rank: fewer regressions > higher coverage_delta > fewer files_changed > first to complete
257
- - **Optimize probes**: Rank: best metric improvement (absolute delta toward target) > fewer regressions > fewer files_changed
258
- - No passes → reset all to pending for retry with debugger
259
- 6. **Preserve all worktrees.** Losers: rename branch + `-failed` suffix. Record in checkpoint.json under `"spike_probes"`
260
- 7. **Log ALL probe outcomes** to `.deepflow/auto-memory.yaml` (main tree):
261
- ```yaml
262
- spike_insights:
263
- - date: "YYYY-MM-DD"
264
- spec: "{spec_name}"
265
- spike_id: "SPIKE_A"
266
- hypothesis: "{from PLAN.md}"
267
- outcome: "winner"
268
- approach: "{one-sentence summary of what the winning probe chose}"
269
- ratchet_metrics: {regressions: N, coverage_delta: N, files_changed: N}
270
- branch: "df/{spec}--probe-SPIKE_A"
271
- - date: "YYYY-MM-DD"
272
- spec: "{spec_name}"
273
- spike_id: "SPIKE_B"
274
- hypothesis: "{from PLAN.md}"
275
- outcome: "failed" # or "passed-but-lost"
276
- failure_reason: "{first failed check + error summary}"
277
- ratchet_metrics: {regressions: N, coverage_delta: N, files_changed: N}
278
- worktree: ".deepflow/worktrees/{spec}/probe-SPIKE_B-failed"
279
- branch: "df/{spec}--probe-SPIKE_B-failed"
280
- probe_learnings: # read by /df:auto-cycle each start AND included in per-task preamble
281
- - spike: "SPIKE_A"
282
- probe: "probe-SPIKE_A"
283
- insight: "{one-sentence summary of winning approach — e.g. 'Use Node.js over Bun for Playwright'}"
284
- - spike: "SPIKE_B"
285
- probe: "probe-SPIKE_B"
286
- insight: "{one-sentence summary from failure_reason}"
287
- ```
288
- Create file if missing. Preserve existing keys when merging. Log BOTH winners and losers — downstream tasks need to know what was chosen, not just what failed.
289
- 8. **Promote winner:** Cherry-pick into shared worktree. Winner → `[x] [PROBE_WINNER]`, losers → `[~] [PROBE_FAILED]`. Resume standard loop.
290
-
291
- #### 5.7.1. PROBE DIVERSITY ENFORCEMENT (Optimize Probes)
180
+ Trigger: ≥2 [SPIKE] tasks with same blocker or identical hypothesis.
292
181
 
293
- When spawning probes for optimize plateau resolution, enforce diversity roles:
182
+ 1. `BASELINE=$(git rev-parse HEAD)` in shared worktree
183
+ 2. Sub-worktrees per spike: `git worktree add -b df/{spec}--probe-{ID} .deepflow/worktrees/{spec}/probe-{ID} ${BASELINE}`
184
+ 3. Spawn all probes in ONE message. End turn.
185
+ 4. Per notification: ratchet (§5.5). Record: ratchet_passed, regressions, coverage_delta, files_changed, commit.
186
+ 5. **Winner selection** (no LLM judge): disqualify regressions. Standard: fewer regressions > coverage > fewer files > first complete. Optimize: best metric delta > fewer regressions > fewer files. No passes → reset pending for debugger.
187
+ 6. Preserve all worktrees. Losers: branch + `-failed`. Record in checkpoint.json.
188
+ 7. Log all outcomes to `.deepflow/auto-memory.yaml` under `spike_insights`+`probe_learnings` (schema in auto-cycle.md). Both winners and losers.
189
+ 8. Cherry-pick winner into shared worktree. Winner → `[x] [PROBE_WINNER]`, losers → `[~] [PROBE_FAILED]`.
294
190
 
295
- **Role definitions:**
296
- - **contextualizada**: Builds on the best approach so far — refines, extends, or combines what worked. Prompt includes: "Build on the best result so far: {best_approach_summary}. Refine or extend it."
297
- - **contraditoria**: Tries the opposite of the current best. Prompt includes: "The best approach so far is {best_approach_summary}. Try the OPPOSITE direction — if it cached, don't cache; if it optimized hot path, optimize cold path; etc."
298
- - **ingenua**: No prior context — naive fresh attempt. Prompt includes: "Ignore all prior attempts. Approach this from scratch with no assumptions about what works."
191
+ #### 5.7.1. PROBE DIVERSITY (Optimize Probes)
299
192
 
300
- **Auto-scaling by probe round:**
193
+ Roles: **contextualizada** (refine best), **contraditoria** (opposite of best), **ingenua** (fresh, no context).
301
194
 
302
- | Probe round | Count | Required roles |
303
- |-------------|-------|----------------|
195
+ | Round | Count | Roles |
196
+ |-------|-------|-------|
304
197
  | 1st plateau | 2 | 1 contraditoria + 1 ingenua |
305
198
  | 2nd plateau | 4 | 1 contextualizada + 2 contraditoria + 1 ingenua |
306
- | 3rd+ plateau | 6 | 2 contextualizada + 2 contraditoria + 2 ingenua |
199
+ | 3rd+ | 6 | 2 contextualizada + 2 contraditoria + 2 ingenua |
307
200
 
308
- **Rules:**
309
- - Every probe set MUST include ≥1 contraditoria and ≥1 ingenua (minimum diversity)
310
- - contextualizada only added from round 2+ (needs prior data to build on)
311
- - Each probe prompt includes its role label and role-specific instruction
312
- - Probe scale persists in `optimize_state.probe_scale` in `auto-memory.yaml`
201
+ Every set: ≥1 contraditoria + ≥1 ingenua. contextualizada from round 2+ only. Scale persists in `optimize_state.probe_scale`.
313
202
 
314
203
  ### 5.9. OPTIMIZE CYCLE
315
204
 
316
- Trigger: task has `Optimize:` block in PLAN.md. Runs instead of standard single-agent spawn.
317
-
318
- **Optimize is a distinct execution mode** — one optimize task at a time, spanning N cycles until a stop condition.
319
-
320
- #### 5.9.1. INITIALIZATION
321
-
322
- 1. Parse `Optimize:` block from PLAN.md task: `metric`, `target`, `direction`, `max_cycles`, `secondary_metrics`
323
- 2. Load or initialize `optimize_state` from `.deepflow/auto-memory.yaml`:
324
- ```yaml
325
- optimize_state:
326
- task_id: "T{n}"
327
- metric_command: "{shell command}"
328
- target: {number}
329
- direction: "higher|lower"
330
- baseline: null # set on first measure
331
- current_best: null # best metric value seen
332
- best_commit: null # commit hash of best value
333
- cycles_run: 0
334
- cycles_without_improvement: 0
335
- consecutive_reverts: 0
336
- probe_scale: 0 # 0=no probes yet, 2/4/6
337
- max_cycles: {number}
338
- history: [] # [{cycle, value, delta, kept, commit}]
339
- failed_hypotheses: [] # ["{description}"]
340
- ```
341
- 3. **Measure baseline**: `cd ${WORKTREE_PATH} && eval "${metric_command}"` → parse float → store as `baseline` and `current_best`
342
- 4. Measure each secondary metric → store as `secondary_baselines`
343
- 5. Check if target already met (`direction: higher` → baseline >= target; `lower` → baseline <= target). If met → mark task `[x]`, log "target already met: {baseline}", done.
344
-
345
- #### 5.9.2. CYCLE LOOP
205
+ Trigger: task has `Optimize:` block. One at a time, N cycles until stop condition.
346
206
 
347
- Each cycle = one agent spawn + measure + keep/revert decision.
207
+ **Init:** Parse metric/target/direction/max_cycles/secondary_metrics. Load or init `optimize_state` in auto-memory.yaml (fields: task_id, metric_command, target, direction, baseline, current_best, best_commit, cycles_run, cycles_without_improvement, consecutive_reverts, probe_scale, max_cycles, history[], failed_hypotheses[]). Measure baseline (`eval` with cwd=worktree) store as baseline+current_best. Measure secondaries. Target met → mark `[x]`, done.
348
208
 
209
+ **Cycle loop:**
349
210
  ```
350
211
  REPEAT:
351
- 1. Check stop conditions (5.9.3) → if triggered, exit loop
352
- 2. Spawn ONE optimize agent (section 6, Optimize Task prompt) with run_in_background=true
353
- 3. STOP. End turn. Wait for notification.
354
- 4. On notification:
355
- a. Run ratchet check (section 5.5) — build/test/lint must pass
356
- b. If ratchet fails git revert HEAD --no-edit, increment consecutive_reverts, log failed hypothesis, go to step 1
357
- c. Run metric gate (section 5.5 metric gate) measure new value
358
- d. If metric parse error git revert HEAD --no-edit, increment consecutive_reverts, log "metric parse error"
359
- e. Compute improvement:
360
- - direction: higher improvement = (new - current_best) / |current_best| × 100
361
- - direction: lower improvement = (current_best - new) / |current_best| × 100
362
- - current_best == 0 → use absolute delta
363
- f. If improvement >= min_improvement_threshold (default 1%):
364
- KEEP: update current_best, best_commit, reset cycles_without_improvement=0, reset consecutive_reverts=0
365
- g. If improvement < min_improvement_threshold:
366
- REVERT: git revert HEAD --no-edit, increment cycles_without_improvement
367
- h. Increment cycles_run
368
- i. Append to history: {cycle, value, delta_pct, kept: bool, commit}
369
- j. Measure secondary metrics, check regression (WARNING only, no revert)
370
- k. Persist optimize_state to auto-memory.yaml
371
- l. Report: "⟳ T{n} cycle {N}: {old} {new} ({+/-delta}%) {kept|reverted} [best: {current_best}, target: {target}]"
372
- m. Check context %. If ≥50% → checkpoint and exit (auto-cycle resumes).
373
- ```
212
+ 1. Check stop conditions → if triggered, exit
213
+ 2. Spawn ONE optimize agent (§6) run_in_background=true. STOP, end turn.
214
+ 3. On notification:
215
+ a. Ratchet fail → revert, ++consecutive_reverts, log hypothesis, goto 1
216
+ b. Metric parse error revert, ++consecutive_reverts
217
+ c. improvement = (new - best) / |best| × 100 (flip for lower; absolute if best==0)
218
+ d. >= 1% threshold KEEP, update best, reset counters
219
+ e. < thresholdREVERT, ++cycles_without_improvement
220
+ f. ++cycles_run, append history, check secondaries, persist state
221
+ g. Report: "⟳ T{n} cycle {N}: {old}→{new} ({delta}%) {kept|reverted} [best: X, target: Y]"
222
+ h. Context ≥50% checkpoint, exit
223
+ ```
224
+
225
+ **Stop conditions:**
226
+
227
+ | Condition | Action |
228
+ |-----------|--------|
229
+ | Target reached | Mark `[x]` |
230
+ | cycles_run >= max_cycles | Mark `[x]`. If best < baseline → `git reset --hard {best_commit}` |
231
+ | 3 cycles without improvement | Launch probes (plateau) |
232
+ | 3 consecutive reverts | Halt, task `[ ]`, requires human intervention |
374
233
 
375
- #### 5.9.3. STOP CONDITIONS
376
-
377
- | Condition | Detection | Action |
378
- |-----------|-----------|--------|
379
- | **Target reached** | `direction: higher` → value >= target; `lower` → value <= target | Mark task `[x]`, log "target reached: {value}" |
380
- | **Max cycles** | `cycles_run >= max_cycles` | Mark task `[x]` with note: "max cycles reached, best: {current_best}". If current_best worse than baseline → `git reset --hard {best_commit}`, log "reverted to best-known" |
381
- | **Plateau** | `cycles_without_improvement >= 3` | Pause normal cycle → launch probes (5.9.4) |
382
- | **Circuit breaker** | `consecutive_reverts >= 3` | Halt, task stays `[ ]`, log "circuit breaker: 3 consecutive reverts". Requires human intervention. |
383
-
384
- On **max cycles** with final value worse than baseline:
385
- 1. `git reset --hard {best_commit}` in worktree
386
- 2. Log: "final value {current} worse than baseline {baseline}, reverted to best-known commit {best_commit} (value: {current_best})"
387
-
388
- #### 5.9.4. PLATEAU → PROBE LAUNCH
389
-
390
- When plateau detected (3 cycles without ≥1% improvement):
391
-
392
- 1. Pause normal optimize cycle
393
- 2. Determine probe count from `probe_scale` (section 5.7.1 auto-scaling table): 0→2, 2→4, 4→6
394
- 3. Update `probe_scale` in optimize_state
395
- 4. Record `BASELINE=$(git rev-parse HEAD)` in shared worktree
396
- 5. Create sub-worktrees per probe: `git worktree add -b df/{spec}--opt-probe-{N} .deepflow/worktrees/{spec}/opt-probe-{N} ${BASELINE}`
397
- 6. Spawn ALL probes in ONE message using Optimize Probe prompt (section 6), each with its diversity role
398
- 7. End turn. Wait for all notifications.
399
- 8. Per notification: run ratchet + metric measurement in probe worktree
400
- 9. Select winner (section 5.7 step 5, optimize ranking): best metric improvement toward target
401
- 10. Winner → cherry-pick into shared worktree, update current_best, reset cycles_without_improvement=0
402
- 11. Losers → rename branch with `-failed` suffix, preserve worktrees
403
- 12. Log all probe outcomes to `auto-memory.yaml` under `spike_insights` (reuse existing format)
404
- 13. Log probe learnings: winning approach summary + each loser's failure reason
405
- 14. Resume normal optimize cycle from step 1
406
-
407
- #### 5.9.5. STATE PERSISTENCE (auto-memory.yaml)
408
-
409
- After every cycle, write `optimize_state` to `.deepflow/auto-memory.yaml` (main tree). This ensures:
410
- - Context exhaustion at 50% → auto-cycle resumes with full history
411
- - Failed hypotheses carry forward (agents won't repeat approaches)
412
- - Probe scale persists across context windows
413
-
414
- Also append cycle results to `.deepflow/auto-report.md`:
415
- ```
416
- ## Optimize: T{n} — {metric_name}
417
- | Cycle | Value | Delta | Kept | Commit |
418
- |-------|-------|-------|------|--------|
419
- | 1 | 72.3 | — | baseline | abc123 |
420
- | 2 | 74.1 | +2.5% | ✓ | def456 |
421
- | 3 | 73.8 | -0.4% | ✗ | (reverted) |
422
- ...
423
- Best: {current_best} | Target: {target} | Status: {in_progress|reached|max_cycles|circuit_breaker}
424
- ```
234
+ **Plateau → probes:** Scale 0→2, 2→4, 4→6 per §5.7.1. Create sub-worktrees, spawn all with diversity roles (§6 Optimize Probe). Per notification: ratchet + metric. Winner → cherry-pick, update best, reset counters. Losers → `-failed`. Log outcomes. Resume cycle.
235
+
236
+ **State persistence:** Write `optimize_state` to auto-memory.yaml after every cycle. Append results table to `.deepflow/auto-report.md`.
425
237
 
426
238
  ---
427
239
 
428
240
  ### 6. PER-TASK (agent prompt)
429
241
 
430
- > **Context engineering rationale:** Prompt order follows the attention U-curve (start/end = high attention, middle = low).
431
- > Critical instructions go at start and end. Navigable data goes in the middle.
432
- > See: Chroma "Context Rot" (2025) — performance degrades ~2%/100K tokens; distractors and semantic ambiguity compound degradation.
242
+ **Common preamble (all):** `Working directory: {worktree_absolute_path}. All file ops use this path. Commit format: {type}({spec}): {desc}`
433
243
 
434
- **Common preamble (include in all agent prompts):**
244
+ **Standard Task** (`Agent(model="{Model}", ...)`):
435
245
  ```
436
- Working directory: {worktree_absolute_path}
437
- All file operations MUST use this absolute path as base. Do NOT write files to the main project directory.
438
- Commit format: {commit_type}({spec}): {description}
246
+ --- START ---
247
+ {task_id}: {description} Files: {files} Spec: {spec}
248
+ {If reverted: DO NOT repeat: - Cycle {N}: "{reason}"}
249
+ {If spike insights exist:
250
+ spike_results:
251
+ hypothesis: {hypothesis from spike_insights}
252
+ outcome: {outcome}
253
+ edge_cases: {edge_cases}
254
+ insight: {insight from probe_learnings}
255
+ }
256
+ Success criteria: {ACs from spec relevant to this task}
257
+ --- MIDDLE (omit for low effort; omit deps for medium) ---
258
+ Impact: Callers: {file} ({why}) | Duplicates: [active→consolidate] [dead→DELETE] | Data flow: {consumers}
259
+ Prior tasks: {dep_id}: {summary}
260
+ Steps: 1. chub search/get for APIs 2. LSP findReferences, add unlisted callers 3. Read all Impact files 4. Implement 5. Commit
261
+ --- END ---
262
+ Duplicates: [active]→consolidate [dead]→DELETE. ONLY job: code+commit. No merge/rename/checkout.
263
+ Last line of your response MUST be: TASK_STATUS:pass (if successful) or TASK_STATUS:fail (if failed) or TASK_STATUS:revert (if reverted)
439
264
  ```
440
265
 
441
- **Standard Task** (spawn with `Agent(model="{Model from PLAN.md}", ...)`):
442
-
443
- Prompt sections in order (START = high attention, MIDDLE = navigable data, END = high attention):
266
+ **Bootstrap:** `BOOTSTRAP: Write tests for edit_scope files. Do NOT change implementation. Commit as test({spec}): bootstrap. Last line: TASK_STATUS:pass or TASK_STATUS:fail`
444
267
 
268
+ **Wave Test** (`Agent(model="opus")`):
445
269
  ```
446
- --- START (high attention zone) ---
447
-
448
- {task_id}: {description from PLAN.md}
449
- Files: {target files} Spec: {spec_name}
450
-
451
- {Prior failure context — include ONLY if task was previously reverted. Read from .deepflow/auto-memory.yaml revert_history for this task_id:}
452
- DO NOT repeat these approaches:
453
- - Cycle {N}: reverted — "{reason from revert_history}"
454
- {Omit this entire block if task has no revert history.}
455
-
456
- {Acceptance criteria excerpt — extract 2-3 key ACs from the spec file (specs/doing-*.md). Include only the criteria relevant to THIS task, not the full spec.}
457
- Success criteria:
458
- - {AC relevant to this task}
459
- - {AC relevant to this task}
460
- {Omit if spec has no structured ACs.}
461
-
462
- --- MIDDLE (navigable data zone) ---
463
-
464
- {Impact block from PLAN.md — include verbatim if present. Annotate each caller with WHY it's impacted:}
465
- Impact:
466
- - Callers: {file} ({why — e.g. "imports validateToken which you're changing"})
467
- - Duplicates:
468
- - {file} [active — consolidate]
469
- - {file} [dead — DELETE]
470
- - Data flow: {consumers}
471
- {Omit if no Impact in PLAN.md.}
472
-
473
- {Dependency context — for each completed blocker task, include a one-liner summary:}
474
- Prior tasks:
475
- - {dep_task_id}: {one-line summary of what changed — e.g. "refactored validateToken to async, changed signature (string) → (string, opts)"}
476
- {Omit if task has no dependencies or all deps are bootstrap/spike tasks.}
477
-
478
- Steps:
479
- 1. External APIs/SDKs → chub search "<library>" --json → chub get <id> --lang <lang> (skip if chub unavailable or internal code only)
480
- 2. LSP freshness check: run `findReferences` on each function/type you're about to change. If callers exist beyond the Impact list, add them to your scope before implementing.
481
- 3. Read ALL files in Impact (+ any new callers from step 2) before implementing — understand the full picture
482
- 4. Implement the task, updating all impacted files
483
- 5. Commit as feat({spec}): {description}
484
-
485
- --- END (high attention zone) ---
486
-
487
- {If .deepflow/auto-memory.yaml exists and has probe_learnings, include:}
488
- Spike results (follow these approaches):
489
- {each probe_learning with outcome "winner" → "- {insight}"}
490
- {Omit this block if no probe_learnings exist.}
491
-
492
- If Impact lists duplicates: [active] → consolidate into single source of truth. [dead] → DELETE entirely.
493
- Your ONLY job is to write code and commit. Orchestrator runs health checks after.
494
- STOP after committing. Do NOT merge branches, rename spec files, remove worktrees, or run git checkout on main.
495
- ```
270
+ --- START ---
271
+ You are a QA engineer. Write unit tests for the following code changes.
272
+ Use {test_framework}. Test behavioral correctness, not implementation details.
273
+ Spec: {spec}. Task: {task_id}.
496
274
 
497
- **Effort-aware context budget:** For `Effort: low` tasks, omit the MIDDLE section entirely (no Impact, no dependency context, no steps). For `Effort: medium`, include Impact but omit dependency context. For `Effort: high`, include everything.
275
+ Implementation diff:
276
+ {IMPL_DIFF}
498
277
 
499
- **Bootstrap Task:**
500
- ```
501
- BOOTSTRAP: Write tests for files in edit_scope
502
- Files: {edit_scope files} Spec: {spec_name}
503
-
504
- Write tests covering listed files. Do NOT change implementation files.
505
- Commit as test({spec}): bootstrap tests for edit_scope
506
- ```
278
+ --- MIDDLE ---
279
+ Files changed: {changed_files}
280
+ Existing test patterns: {test_file_examples from auto-snapshot.txt, first 3}
507
281
 
508
- **Spike Task:**
282
+ --- END ---
283
+ Write thorough unit tests covering: happy paths, edge cases, error handling.
284
+ Follow existing test conventions in the codebase.
285
+ Commit as: test({spec}): wave-{N} unit tests
286
+ Do NOT modify implementation files. ONLY add/edit test files.
287
+ Last line of your response MUST be: TASK_STATUS:pass or TASK_STATUS:fail
509
288
  ```
510
- {task_id} [SPIKE]: {hypothesis}
511
- Files: {target files} Spec: {spec_name}
512
289
 
513
- {Prior failure context include ONLY if this spike was previously reverted. Read from .deepflow/auto-memory.yaml revert_history + spike_insights for this task_id:}
514
- DO NOT repeat these approaches:
515
- - Cycle {N}: reverted — "{reason}"
516
- {Omit this entire block if no revert history.}
290
+ **Spike:** `{task_id} [SPIKE]: {hypothesis}. Files+Spec. {reverted warnings}. Minimal spike. Commit as spike({spec}): {desc}. Last line: TASK_STATUS:pass or TASK_STATUS:fail`
517
291
 
518
- Implement minimal spike to validate hypothesis.
519
- Commit as spike({spec}): {description}
292
+ **Optimize Task** (`Agent(model="opus")`):
520
293
  ```
521
-
522
- **Optimize Task** (spawn with `Agent(model="opus", subagent_type="general-purpose")`):
523
-
524
- One agent per cycle. Agent makes ONE atomic change to improve the metric.
525
-
294
+ --- START ---
295
+ {task_id} [OPTIMIZE]: {metric} cycle {N}/{max}. Files+Spec.
296
+ Current: {val} (baseline: {b}, best: {best}). Target: {t} ({dir}). Metric: {cmd}
297
+ CONSTRAINT: ONE atomic change.
298
+ --- MIDDLE ---
299
+ Last 5 cycles + failed hypotheses + Impact/deps.
300
+ --- END ---
301
+ {Learnings}. ONE change + commit. No metric run, no multiple changes.
302
+ Last line of your response MUST be: TASK_STATUS:pass or TASK_STATUS:fail or TASK_STATUS:revert
526
303
  ```
527
- --- START (high attention zone) ---
528
-
529
- {task_id} [OPTIMIZE]: Improve {metric_name} — cycle {N}/{max_cycles}
530
- Files: {target files} Spec: {spec_name}
531
-
532
- Current metric: {current_value} (baseline: {baseline}, best: {current_best})
533
- Target: {target} ({direction})
534
- Improvement needed: {delta_to_target} ({direction})
535
-
536
- CONSTRAINT: Make exactly ONE atomic change. Do not refactor broadly.
537
- The metric is measured by: {metric_command}
538
- You succeed if the metric moves toward {target} after your change.
539
-
540
- --- MIDDLE (navigable data zone) ---
541
-
542
- Attempt history (last 5 cycles):
543
- {For each recent history entry:}
544
- - Cycle {N}: {value} ({+/-delta}%) — {kept|reverted} — "{one-line description of what was tried}"
545
- {Omit if cycle 1.}
546
-
547
- DO NOT repeat these failed approaches:
548
- {For each failed_hypothesis in optimize_state:}
549
- - "{hypothesis description}"
550
- {Omit if no failed hypotheses.}
551
304
 
552
- {Impact block from PLAN.md if present}
553
-
554
- {Dependency context if present}
555
-
556
- Steps:
557
- 1. Analyze the metric command to understand what's being measured
558
- 2. Read the target files and identify ONE specific improvement
559
- 3. Implement the change (ONE atomic modification)
560
- 4. Commit as feat({spec}): optimize {metric_name} — {what you changed}
561
-
562
- --- END (high attention zone) ---
563
-
564
- {Spike/probe learnings if any}
565
-
566
- Your ONLY job is to make ONE atomic change and commit. Orchestrator measures the metric after.
567
- Do NOT run the metric command yourself. Do NOT make multiple changes.
568
- STOP after committing. Do NOT merge branches, rename spec files, remove worktrees, or run git checkout on main.
305
+ **Optimize Probe** (`Agent(model="opus")`):
306
+ ```
307
+ --- START ---
308
+ {task_id} [OPTIMIZE PROBE]: {metric} — probe {id} ({role})
309
+ Current/Target. Role instruction:
310
+ contextualizada: "Build on best: {summary}. Refine."
311
+ contraditoria: "Best was: {summary}. Try OPPOSITE."
312
+ ingenua: "Ignore prior. Fresh approach."
313
+ --- MIDDLE ---
314
+ Full history + all failed hypotheses.
315
+ --- END ---
316
+ ONE atomic change. Commit. STOP.
317
+ Last line of your response MUST be: TASK_STATUS:pass or TASK_STATUS:fail or TASK_STATUS:revert
569
318
  ```
570
319
 
571
- **Optimize Probe Task** (spawn with `Agent(model="opus", subagent_type="general-purpose")`):
572
-
573
- Used during plateau resolution. Each probe has a diversity role.
574
-
320
+ **Final Test** (`Agent(model="opus")`):
575
321
  ```
576
- --- START (high attention zone) ---
322
+ --- START ---
323
+ You are an independent QA engineer. You have ONLY the spec and exported interfaces below.
324
+ You cannot read implementation files — you must treat the system as a black box.
325
+ Write integration tests that verify EACH acceptance criterion from the spec.
577
326
 
578
- {task_id} [OPTIMIZE PROBE]: {metric_name} — probe {probe_id} ({role_label})
579
- Files: {target files} Spec: {spec_name}
327
+ Spec:
328
+ {SPEC_CONTENT}
580
329
 
581
- Current metric: {current_value} (baseline: {baseline}, best: {current_best})
582
- Target: {target} ({direction})
330
+ Exported interfaces:
331
+ {EXPORTED_INTERFACES}
583
332
 
584
- Role: {role_label}
585
- {role_instruction one of:}
586
- contextualizada: "Build on the best approach so far: {best_approach_summary}. Refine, extend, or combine what worked."
587
- contraditoria: "The best approach so far was: {best_approach_summary}. Try the OPPOSITE if it optimized X, try Y instead. Challenge the current direction."
588
- ingenua: "Ignore all prior attempts. Approach this metric from scratch with no assumptions about what has or hasn't worked."
333
+ --- END ---
334
+ Write integration tests covering every AC in the spec.
335
+ Test through public interfaces only no internal imports, no implementation details.
336
+ If an AC cannot be tested through exports alone, write a test stub with a TODO comment explaining why.
337
+ Commit as: test({spec}): integration tests
338
+ Do NOT read or modify implementation files. ONLY add/edit test files.
339
+ Last line of your response MUST be: TASK_STATUS:pass or TASK_STATUS:fail
340
+ ```
589
341
 
590
- --- MIDDLE (navigable data zone) ---
342
+ ### 8. COMPLETE SPECS
591
343
 
592
- Full attempt history:
593
- {ALL history entries from optimize_state}
594
- - Cycle {N}: {value} ({+/-delta}%) — {kept|reverted}
344
+ <!-- AC-10: After all waves, Opus black-box test agent spawns with spec + exports only (no implementation) -->
345
+ <!-- AC-11: Final integration tests must all pass before merge proceeds; failure blocks merge -->
595
346
 
596
- All failed approaches (DO NOT repeat):
597
- {ALL failed_hypotheses}
598
- - "{hypothesis description}"
347
+ All tasks done for `doing-*` spec:
599
348
 
600
- --- END (high attention zone) ---
349
+ **8.1. Final Test Agent (black-box integration tests):**
601
350
 
602
- Make ONE atomic change that moves the metric toward {target}.
603
- Commit as feat({spec}): optimize probe {probe_id} — {what you changed}
604
- STOP after committing.
605
- ```
351
+ Before merge, spawn an independent Opus QA agent that sees ONLY the spec and exported interfaces — never implementation source.
606
352
 
607
- ### 8. COMPLETE SPECS
608
-
609
- When all tasks done for a `doing-*` spec:
610
- 1. Run `/df:verify doing-{name}` via the Skill tool (`skill: "df:verify", args: "doing-{name}"`)
611
- - Verify runs quality gates (L0-L4), merges worktree branch to main, cleans up worktree, renames spec `doing-*` → `done-*`, and extracts decisions
612
- - If verify fails (adds fix tasks): stop here — `/df:execute --continue` will pick up the fix tasks
613
- - If verify passes: proceed to step 2
614
- 2. Remove spec's ENTIRE section from PLAN.md (header, tasks, summaries, fix tasks, separators)
615
- 3. Recalculate Summary table at top of PLAN.md
353
+ 1. Extract exported interfaces from the worktree (public API surface):
354
+ ```bash
355
+ # Collect exported symbols adapt pattern to language
356
+ git -C ${WORKTREE_PATH} diff main --name-only | xargs grep -h '^\(export\|pub \|func \|def \)' 2>/dev/null | head -100
357
+ ```
358
+ Store result as `EXPORTED_INTERFACES`. Also load spec content: `cat specs/doing-{name}.md` `SPEC_CONTENT`.
359
+
360
+ 2. Spawn `Agent(model="opus")` with Final Test prompt (§6). `run_in_background=true`. End turn, wait.
361
+
362
+ 3. On notification:
363
+ a. Run ratchet check (§5.5) — all integration tests must pass.
364
+ b. **Tests pass** → commit stands. Proceed to step 8.2 (merge).
365
+ c. **Tests fail** → **merge is blocked**. Do NOT retry. Report:
366
+ `"✗ Final integration tests failed for {spec} — merge blocked, requires human review"`
367
+ Leave worktree intact. Set all spec tasks back to `TaskUpdate(status: "pending")`.
368
+ Write failure details to `.deepflow/results/final-test-{spec}.yaml`:
369
+ ```yaml
370
+ spec: {spec}
371
+ status: blocked
372
+ reason: "Final integration tests failed"
373
+ output: |
374
+ {truncated test output — last 30 lines}
375
+ ```
376
+ STOP. Do not proceed to merge.
377
+
378
+ **8.2. Merge and cleanup:**
379
+ 1. `skill: "df:verify", args: "doing-{name}"` — runs L0-L4 gates, merges, cleans worktree, renames doing→done, extracts decisions. Fail (fix tasks added) → stop; `--continue` picks them up.
380
+ 2. Remove spec's ENTIRE section from PLAN.md. Recalculate Summary table.
616
381
 
617
382
  ---
618
383
 
619
384
  ## Usage
620
385
 
621
386
  ```
622
- /df:execute # Execute all ready tasks
623
- /df:execute T1 T2 # Specific tasks only
624
- /df:execute --continue # Resume from checkpoint
387
+ /df:execute # All ready tasks
388
+ /df:execute T1 T2 # Specific tasks
389
+ /df:execute --continue # Resume checkpoint
625
390
  /df:execute --fresh # Ignore checkpoint
626
391
  /df:execute --dry-run # Show plan only
627
392
  ```
628
393
 
629
394
  ## Skills & Agents
630
395
 
631
- - Skill: `atomic-commits` Clean commit protocol
632
- - Skill: `browse-fetch` — Fetch live web pages and external API docs via browser before coding
633
-
634
- | Agent | subagent_type | Purpose |
635
- |-------|---------------|---------|
636
- | Implementation | `general-purpose` | Task implementation |
637
- | Debugger | `reasoner` | Debugging failures |
396
+ Skills: `atomic-commits`, `browse-fetch`. Agents: Implementation (`general-purpose`), Debugger (`reasoner`).
638
397
 
639
- **Model + effort routing:** Read `Model:` and `Effort:` fields from each task block in PLAN.md. Pass `model:` parameter when spawning the agent. Prepend effort instruction to the agent prompt. Defaults: `Model: sonnet`, `Effort: medium`.
398
+ **Model+effort routing** (read from PLAN.md, defaults: sonnet/medium):
640
399
 
641
- | Task fields | Agent call | Prompt preamble |
642
- |-------------|-----------|-----------------|
643
- | `Model: haiku, Effort: low` | `Agent(model="haiku", ...)` | `You MUST be maximally efficient: skip explanations, minimize tool calls, go straight to implementation.` |
644
- | `Model: sonnet, Effort: medium` | `Agent(model="sonnet", ...)` | `Be direct and efficient. Explain only when the logic is non-obvious.` |
645
- | `Model: opus, Effort: high` | `Agent(model="opus", ...)` | _(no preamble — default behavior)_ |
646
- | (missing) | `Agent(model="sonnet", ...)` | `Be direct and efficient. Explain only when the logic is non-obvious.` |
400
+ | Fields | Agent | Preamble |
401
+ |--------|-------|----------|
402
+ | haiku/low | `Agent(model="haiku")` | `Maximally efficient: skip explanations, minimize tool calls, straight to implementation.` |
403
+ | sonnet/medium | `Agent(model="sonnet")` | `Direct and efficient. Explain only non-obvious logic.` |
404
+ | opus/high | `Agent(model="opus")` | _(none)_ |
647
405
 
648
- **Effort preamble rules:**
649
- - `low` → Prepend efficiency instruction. Agent should make fewest possible tool calls.
650
- - `medium` → Prepend balanced instruction. Agent skips preamble but explains non-obvious decisions.
651
- - `high` → No preamble added. Agent uses full reasoning capabilities.
652
-
653
- **Checkpoint schema:** `.deepflow/checkpoint.json` in worktree:
654
- ```json
655
- {"completed_tasks": ["T1","T2"], "current_wave": 2, "worktree_path": ".deepflow/worktrees/upload", "worktree_branch": "df/upload"}
656
- ```
657
-
658
- ---
406
+ **Checkpoint:** `.deepflow/checkpoint.json`: `{"completed_tasks":["T1"],"current_wave":2,"worktree_path":"...","worktree_branch":"df/..."}`
659
407
 
660
408
  ## Failure Handling
661
409
 
662
- When task fails ratchet and is reverted:
663
- - `TaskUpdate(taskId: native_id, status: "pending")` — dependents remain blocked
664
- - Repeated failure → spawn `Task(subagent_type="reasoner", prompt="Debug failure: {ratchet output}")`
665
- - Leave worktree intact, keep checkpoint.json
666
- - Output: worktree path/branch, `cd {path}` to investigate, `--continue` to resume, `--fresh` to discard
410
+ Reverted task: `TaskUpdate(status: "pending")`, dependents stay blocked. Repeated failure → spawn reasoner debugger. Leave worktree+checkpoint intact. Output: path, `cd` command, `--continue`/`--fresh` options.
667
411
 
668
412
  ## Rules
669
413
 
670
414
  | Rule | Detail |
671
415
  |------|--------|
672
- | Zero test files → bootstrap first | Bootstrap is cycle's sole task when snapshot empty |
416
+ | Zero tests → bootstrap first | Sole task when snapshot empty |
673
417
  | 1 task = 1 agent = 1 commit | `atomic-commits` skill |
674
- | 1 file = 1 writer | Sequential if conflict |
675
- | Agent writes code, orchestrator measures | Ratchet is the judge |
676
- | No LLM evaluates LLM work | Health checks only |
677
- | ≥2 spikes same problem → parallel probes | Never run competing spikes sequentially |
678
- | All probe worktrees preserved | Losers renamed `-failed`; never deleted |
679
- | Machine-selected winner | Regressions > coverage > files changed; no LLM judge |
418
+ | 1 file = 1 writer | Sequential on conflict |
419
+ | Agent codes, orchestrator measures | Ratchet judges |
420
+ | No LLM evaluates LLM | Health checks only |
421
+ | ≥2 spikes → parallel probes | Never sequential |
422
+ | Probe worktrees preserved | Losers `-failed`, never deleted |
423
+ | Machine-selected winner | Regressions > coverage > files; no LLM judge |
680
424
  | External APIs → chub first | Skip if unavailable |
681
- | 1 optimize task at a time | Inherently sequential — no parallel optimize tasks |
682
- | Optimize = atomic changes only | One modification per cycle for diagnosability |
683
- | Ratchet + metric = both required | Optimize keeps commit only if ratchet AND metric improve |
684
- | Plateau → probes, not more cycles | 3 cycles without ≥1% improvement triggers probe launch |
685
- | Circuit breaker = 3 consecutive reverts | Halts optimize loop, requires human intervention |
686
- | Optimize probes need diversity | Every probe set: ≥1 contraditoria + ≥1 ingenua minimum |
687
-
688
- ## Example
689
-
690
- ```
691
- /df:execute (context: 12%)
692
-
693
- Loading PLAN.md... T1 ready, T2/T3 blocked by T1
694
- Ratchet snapshot: 24 pre-existing test files
695
-
696
- Wave 1: TaskUpdate(T1, in_progress)
697
- [Agent "T1" completed]
698
- Running ratchet: build ✓ | tests ✓ (24 passed) | typecheck ✓
699
- ✓ T1: ratchet passed (abc1234)
700
- TaskUpdate(T1, completed) → auto-unblocks T2, T3
701
-
702
- Wave 2: TaskUpdate(T2/T3, in_progress)
703
- [Agent "T2" completed] ✓ T2: ratchet passed (def5678)
704
- [Agent "T3" completed] ✓ T3: ratchet passed (ghi9012)
705
-
706
- Context: 35% — All tasks done for doing-upload.
707
- Running /df:verify doing-upload...
708
- ✓ L0 | ✓ L1 (3/3 files) | ⚠ L2 (no coverage tool) | ✓ L4 (24 tests)
709
- ✓ Merged df/upload to main
710
- ✓ Spec complete: doing-upload → done-upload
711
- Complete: 3/3
712
- ```
425
+ | 1 optimize at a time | Sequential |
426
+ | Optimize = atomic only | One change per cycle |
427
+ | Ratchet + metric both required | Keep only if both pass |
428
+ | Plateau → probes | 3 cycles <1% triggers probes |
429
+ | Circuit breaker = 3 reverts | Halts, needs human |
430
+ | Wave test after ratchet | Opus writes tests; 3 attempts then revert |
431
+ | Final test before merge | Opus black-box integration tests; failure blocks merge, no retry |
432
+ | Probe diversity | ≥1 contraditoria + ≥1 ingenua |