maestro-flow 0.4.2 → 0.4.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/commands/maestro-analyze.md +1 -1
- package/.claude/commands/maestro-brainstorm.md +1 -1
- package/.claude/commands/maestro-collab.md +1 -1
- package/.claude/commands/maestro-execute.md +10 -1
- package/.claude/commands/maestro-guard.md +101 -0
- package/.claude/commands/maestro-impeccable.md +1 -1
- package/.claude/commands/maestro-plan.md +15 -2
- package/.claude/commands/maestro-ralph-execute.md +9 -2
- package/.claude/commands/maestro-ralph.md +8 -1
- package/.claude/commands/maestro-verify.md +15 -1
- package/.claude/commands/quality-auto-test.md +1 -1
- package/.claude/commands/quality-debug.md +1 -1
- package/.claude/commands/quality-refactor.md +1 -1
- package/.claude/commands/quality-retrospective.md +1 -1
- package/.claude/commands/quality-review.md +15 -1
- package/.claude/commands/quality-test.md +1 -1
- package/.claude/commands/security-audit.md +154 -0
- package/.claude/skills/maestro-help/index/catalog.json +2 -0
- package/.codex/skills/maestro-analyze/SKILL.md +18 -1
- package/.codex/skills/maestro-brainstorm/SKILL.md +17 -4
- package/.codex/skills/maestro-collab/SKILL.md +7 -1
- package/.codex/skills/maestro-execute/SKILL.md +365 -348
- package/.codex/skills/maestro-guard/SKILL.md +97 -0
- package/.codex/skills/maestro-impeccable/SKILL.md +1 -1
- package/.codex/skills/maestro-plan/SKILL.md +66 -7
- package/.codex/skills/maestro-ralph/SKILL.md +1 -1
- package/.codex/skills/maestro-verify/SKILL.md +18 -1
- package/.codex/skills/quality-auto-test/SKILL.md +13 -3
- package/.codex/skills/quality-debug/SKILL.md +362 -346
- package/.codex/skills/quality-refactor/SKILL.md +1 -1
- package/.codex/skills/quality-retrospective/SKILL.md +292 -292
- package/.codex/skills/quality-review/SKILL.md +374 -365
- package/.codex/skills/quality-test/SKILL.md +1 -1
- package/.codex/skills/security-audit/SKILL.md +154 -0
- package/bin/maestro-hook-runner.js +21 -1
- package/dashboard/dist-server/src/coordinator/output-parser.js +27 -0
- package/dashboard/dist-server/src/coordinator/output-parser.js.map +1 -1
- package/dist/src/commands/coordinate.d.ts.map +1 -1
- package/dist/src/commands/coordinate.js +2 -0
- package/dist/src/commands/coordinate.js.map +1 -1
- package/dist/src/commands/hooks.d.ts.map +1 -1
- package/dist/src/commands/hooks.js +39 -3
- package/dist/src/commands/hooks.js.map +1 -1
- package/dist/src/coordinator/output-parser.d.ts.map +1 -1
- package/dist/src/coordinator/output-parser.js +27 -0
- package/dist/src/coordinator/output-parser.js.map +1 -1
- package/dist/src/hooks/delegate-monitor.d.ts +1 -0
- package/dist/src/hooks/delegate-monitor.d.ts.map +1 -1
- package/dist/src/hooks/delegate-monitor.js +1 -1
- package/dist/src/hooks/delegate-monitor.js.map +1 -1
- package/dist/src/hooks/guards/workflow-guard.d.ts +15 -0
- package/dist/src/hooks/guards/workflow-guard.d.ts.map +1 -1
- package/dist/src/hooks/guards/workflow-guard.js +61 -1
- package/dist/src/hooks/guards/workflow-guard.js.map +1 -1
- package/dist/src/hooks/plugins/decision-log-plugin.d.ts +19 -0
- package/dist/src/hooks/plugins/decision-log-plugin.d.ts.map +1 -0
- package/dist/src/hooks/plugins/decision-log-plugin.js +28 -0
- package/dist/src/hooks/plugins/decision-log-plugin.js.map +1 -0
- package/dist/src/hooks/plugins/index.d.ts +2 -0
- package/dist/src/hooks/plugins/index.d.ts.map +1 -1
- package/dist/src/hooks/plugins/index.js +1 -0
- package/dist/src/hooks/plugins/index.js.map +1 -1
- package/dist/src/hooks/session-context.d.ts +1 -0
- package/dist/src/hooks/session-context.d.ts.map +1 -1
- package/dist/src/hooks/session-context.js +1 -1
- package/dist/src/hooks/session-context.js.map +1 -1
- package/dist/src/hooks/skill-context.d.ts +1 -0
- package/dist/src/hooks/skill-context.d.ts.map +1 -1
- package/dist/src/hooks/skill-context.js +1 -1
- package/dist/src/hooks/skill-context.js.map +1 -1
- package/dist/src/hooks/spec-injector.d.ts.map +1 -1
- package/dist/src/hooks/spec-injector.js +2 -0
- package/dist/src/hooks/spec-injector.js.map +1 -1
- package/package.json +1 -1
- package/workflows/debug.md +73 -0
- package/workflows/execute.md +27 -0
- package/workflows/plan.md +11 -0
- package/workflows/review.md +33 -1
- package/workflows/tdd.md +257 -0
- package/workflows/verify.md +57 -0
package/workflows/execute.md
CHANGED
|
@@ -6,6 +6,33 @@ Core principle: **Execute per-plan, not per-phase.** Each plan's wave DAG runs i
|
|
|
6
6
|
|
|
7
7
|
---
|
|
8
8
|
|
|
9
|
+
## Iron Law
|
|
10
|
+
|
|
11
|
+
**VERIFY EACH TASK OUTPUT BEFORE MARKING COMPLETE.**
|
|
12
|
+
|
|
13
|
+
Every task completion requires:
|
|
14
|
+
1. Run convergence criteria checks (not just code review)
|
|
15
|
+
2. Confirm output matches task definition expectations
|
|
16
|
+
3. Evidence of verification in the task summary
|
|
17
|
+
|
|
18
|
+
No task may be marked "completed" based on agent self-report alone.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## Red Flags — These Thoughts Mean STOP
|
|
23
|
+
|
|
24
|
+
If you catch yourself thinking any of these, STOP and verify:
|
|
25
|
+
|
|
26
|
+
- "The agent said it's done, so it must be done"
|
|
27
|
+
- "I'll batch-verify all tasks at the end instead of per-task"
|
|
28
|
+
- "This task is too simple to need verification"
|
|
29
|
+
- "The warning isn't relevant, I'll ignore it"
|
|
30
|
+
- "Let me mark it complete and fix the issue later"
|
|
31
|
+
|
|
32
|
+
All of these mean: **run convergence criteria check NOW before marking the task complete**.
|
|
33
|
+
|
|
34
|
+
---
|
|
35
|
+
|
|
9
36
|
## Prerequisites
|
|
10
37
|
|
|
11
38
|
- Plan exists in scratch directory: `plan.json` + `.task/TASK-*.json`
|
package/workflows/plan.md
CHANGED
|
@@ -47,6 +47,7 @@ OUTPUT_DIR = .workflow/scratch/plan-{PHASE_SLUG or milestone_slug}-{date}/
|
|
|
47
47
|
| `--dir <path>` | Use arbitrary directory instead of phase resolution (skip roadmap validation) |
|
|
48
48
|
| `--revise [instructions]` | Revise existing plan (skip P1-P3, load → modify → P4). Auto-discovers latest plan or use `--dir` |
|
|
49
49
|
| `--check <plan-dir>` | Standalone plan verification (P4 only, read-only) |
|
|
50
|
+
| `--tdd` | Generate TDD task chains (RED-GREEN-REFACTOR triplets). Load `@~/.maestro/workflows/tdd.md` for discipline and task structure |
|
|
50
51
|
|
|
51
52
|
---
|
|
52
53
|
|
|
@@ -55,9 +56,19 @@ OUTPUT_DIR = .workflow/scratch/plan-{PHASE_SLUG or milestone_slug}-{date}/
|
|
|
55
56
|
```
|
|
56
57
|
--check <plan-dir> → Check Mode (P4 only, read-only)
|
|
57
58
|
--revise → Revise Mode (load → modify → P4)
|
|
59
|
+
--tdd → TDD Mode: P1 → P2 → P3 (with TDD task chain generation) → P4 → P4.5 → P5
|
|
58
60
|
default → Create Mode: P1 → P2 → P3 → P4 → P4.5 → P5
|
|
59
61
|
```
|
|
60
62
|
|
|
63
|
+
### TDD Mode
|
|
64
|
+
|
|
65
|
+
When `--tdd` is active:
|
|
66
|
+
1. Read `@~/.maestro/workflows/tdd.md` for TDD discipline, Iron Law, and task chain structure
|
|
67
|
+
2. In P3 (Planning), decompose each behavior into RED-GREEN-REFACTOR triplets per `tdd.md § Task Chain Generation`
|
|
68
|
+
3. Set `plan.json.tdd_mode = true` and include `tdd_groups[]`
|
|
69
|
+
4. Wave assignment follows TDD dependency rules: `{N}a → {N}b → {N}c`
|
|
70
|
+
5. Output is standard plan.json + .task/TASK-*.json — consumable by `maestro-execute` without modification
|
|
71
|
+
|
|
61
72
|
---
|
|
62
73
|
|
|
63
74
|
## P1: Context Collection
|
package/workflows/review.md
CHANGED
|
@@ -4,6 +4,37 @@ Tiered multi-dimensional code review with parallel agents, severity classificati
|
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
+
## Spec Compliance Pre-Check (Phase 0)
|
|
8
|
+
|
|
9
|
+
**Before any dimensional code quality review**, verify the implementation matches its spec:
|
|
10
|
+
|
|
11
|
+
1. Load `convergence.criteria[]` from each `.task/TASK-{NNN}.json` in the phase
|
|
12
|
+
2. For each criterion, check if the code implements it (grep for functions, endpoints, components named in the criterion)
|
|
13
|
+
3. Classify each: **MET** (evidence found) | **UNMET** (not implemented) | **PARTIAL** (incomplete)
|
|
14
|
+
|
|
15
|
+
| Result | Action |
|
|
16
|
+
|--------|--------|
|
|
17
|
+
| All MET | Proceed to Step 1 (dimensional review) |
|
|
18
|
+
| Any UNMET | Report as spec_compliance_failures, add to findings with severity=critical, dimension="spec-compliance" |
|
|
19
|
+
| Any PARTIAL | Report with severity=high |
|
|
20
|
+
|
|
21
|
+
This prevents code that is well-written but doesn't meet requirements from passing review.
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Receiving Review Feedback
|
|
26
|
+
|
|
27
|
+
When external review feedback is received (from human reviewers, PR comments, or other agents):
|
|
28
|
+
|
|
29
|
+
1. **Verify before implementing** — Check each suggestion against codebase reality. Reviewer may lack full context.
|
|
30
|
+
2. **Technical acknowledgment only** — No performative agreement ("You're absolutely right!", "Great point!"). Just state the fix or provide technical reasoning.
|
|
31
|
+
3. **Push back when wrong** — If a suggestion would break existing functionality, violate architecture constraints, or is technically incorrect for this codebase, explain why with evidence.
|
|
32
|
+
4. **YAGNI check** — If reviewer suggests adding a feature/abstraction, verify it's actually needed. Unused features should be questioned.
|
|
33
|
+
5. **Implement one at a time** — Fix one item, test, then move to next. Never batch-implement all feedback at once.
|
|
34
|
+
6. **Priority order** — Blocking issues (breaks, security) → Simple fixes (typos, imports) → Complex fixes (refactoring, logic).
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
7
38
|
## Prerequisites
|
|
8
39
|
|
|
9
40
|
- Phase execution completed (task summaries exist)
|
|
@@ -428,7 +459,8 @@ Next steps:
|
|
|
428
459
|
|---------|------------|
|
|
429
460
|
| PASS | Skill({ skill: "quality-test", args: "{phase}" }) for UAT, or Skill({ skill: "maestro-milestone-audit" }) if UAT already passed |
|
|
430
461
|
| WARN | Review findings, then Skill({ skill: "quality-test", args: "{phase}" }) — acknowledge warnings before proceeding |
|
|
431
|
-
| BLOCK
|
|
462
|
+
| BLOCK (≤3 findings, all medium/low) | **Lightweight fix loop**: fix inline → re-run review on affected files only → repeat until PASS/WARN (max 2 iterations) |
|
|
463
|
+
| BLOCK (>3 findings or any critical) | Full fix cycle: Skill({ skill: "maestro-plan", args: "{phase} --gaps" }) -> Skill({ skill: "maestro-execute", args: "{phase}" }) -> re-run Skill({ skill: "quality-review", args: "{phase}" }) |
|
|
432
464
|
|
|
433
465
|
---
|
|
434
466
|
|
package/workflows/tdd.md
ADDED
|
@@ -0,0 +1,257 @@
|
|
|
1
|
+
# TDD Workflow
|
|
2
|
+
|
|
3
|
+
Test-Driven Development discipline for plan generation and execution. Invoked when `maestro-plan --tdd` is used. Transforms feature requirements into RED-GREEN-REFACTOR task chains that `maestro-execute` can consume.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Iron Law
|
|
8
|
+
|
|
9
|
+
**NO PRODUCTION CODE WITHOUT A FAILING TEST FIRST.**
|
|
10
|
+
|
|
11
|
+
Write code before the test? Delete it. Start over.
|
|
12
|
+
- Don't keep it as "reference"
|
|
13
|
+
- Don't "adapt" it while writing tests
|
|
14
|
+
- Delete means delete
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## Red-Green-Refactor Cycle
|
|
19
|
+
|
|
20
|
+
Every feature/behavior follows this mandatory sequence:
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
RED: Write failing test → verify it fails correctly
|
|
24
|
+
GREEN: Write minimal code to pass → verify ALL tests pass
|
|
25
|
+
REFACTOR: Clean up → verify tests still pass
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
Each cycle produces exactly 3 tasks in the plan. No steps may be skipped or merged.
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
## Red Flags — These Thoughts Mean STOP
|
|
33
|
+
|
|
34
|
+
- "This is too simple to need TDD"
|
|
35
|
+
- "I'll write tests after to verify"
|
|
36
|
+
- "Let me explore the implementation first, then add tests"
|
|
37
|
+
- "I'll keep the code as reference and write tests first"
|
|
38
|
+
- "TDD will slow me down"
|
|
39
|
+
- "I already manually tested this"
|
|
40
|
+
- "Tests after achieve the same goals"
|
|
41
|
+
|
|
42
|
+
All of these mean: **follow the cycle anyway**.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Rationalization Table
|
|
47
|
+
|
|
48
|
+
| Excuse | Reality |
|
|
49
|
+
|--------|---------|
|
|
50
|
+
| "Too simple to test" | Simple code breaks. Test takes 30 seconds. |
|
|
51
|
+
| "I'll test after" | Tests passing immediately prove nothing — you never saw them catch the bug. |
|
|
52
|
+
| "Tests after achieve same goals" | Tests-after = "what does this do?" Tests-first = "what should this do?" |
|
|
53
|
+
| "Already manually tested" | Ad-hoc != systematic. No record, can't re-run. |
|
|
54
|
+
| "Deleting X hours is wasteful" | Sunk cost fallacy. Keeping unverified code is technical debt. |
|
|
55
|
+
| "Need to explore first" | Fine. Throw away exploration, start fresh with TDD. |
|
|
56
|
+
| "Test hard = design unclear" | Listen to the test. Hard to test = hard to use. Simplify the interface. |
|
|
57
|
+
| "TDD will slow me down" | TDD is faster than debugging in production. |
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Task Chain Generation
|
|
62
|
+
|
|
63
|
+
When `maestro-plan --tdd` is active, each behavior/feature decomposes into a TDD triplet:
|
|
64
|
+
|
|
65
|
+
### Structure
|
|
66
|
+
|
|
67
|
+
For each behavior B (derived from requirements or convergence criteria):
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
TASK-{N}a: RED — Write failing test for B
|
|
71
|
+
TASK-{N}b: GREEN — Implement minimal code to pass B
|
|
72
|
+
TASK-{N}c: REFACTOR — Clean up B implementation (optional, skip if nothing to clean)
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### TASK-{N}a: RED — Write Failing Test
|
|
76
|
+
|
|
77
|
+
```json
|
|
78
|
+
{
|
|
79
|
+
"id": "TASK-{N}a",
|
|
80
|
+
"title": "RED: Write failing test for {behavior}",
|
|
81
|
+
"type": "test",
|
|
82
|
+
"action": "Write test that describes expected behavior. Test MUST fail before implementation exists.",
|
|
83
|
+
"implementation": [
|
|
84
|
+
"Identify the behavior to test from requirement",
|
|
85
|
+
"Write one minimal test — one behavior per test, clear name",
|
|
86
|
+
"Use real code, not mocks (unless external dependency)",
|
|
87
|
+
"Run test: verify it FAILS (not errors) with expected failure message",
|
|
88
|
+
"If test passes: wrong test — testing existing behavior, fix test",
|
|
89
|
+
"If test errors: fix error, re-run until it fails correctly"
|
|
90
|
+
],
|
|
91
|
+
"convergence": {
|
|
92
|
+
"criteria": [
|
|
93
|
+
"Test file exists at {test_path}",
|
|
94
|
+
"Test run exits non-zero (test fails, not errors)",
|
|
95
|
+
"Failure message matches expected behavior gap"
|
|
96
|
+
],
|
|
97
|
+
"verification": "Run test command, confirm RED status (failure, not error)"
|
|
98
|
+
},
|
|
99
|
+
"meta": {
|
|
100
|
+
"tdd_phase": "red",
|
|
101
|
+
"tdd_group": "{N}"
|
|
102
|
+
}
|
|
103
|
+
}
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### TASK-{N}b: GREEN — Write Minimal Code
|
|
107
|
+
|
|
108
|
+
```json
|
|
109
|
+
{
|
|
110
|
+
"id": "TASK-{N}b",
|
|
111
|
+
"title": "GREEN: Implement minimal code for {behavior}",
|
|
112
|
+
"type": "feature",
|
|
113
|
+
"depends_on": ["TASK-{N}a"],
|
|
114
|
+
"action": "Write the simplest code that makes the failing test pass. No features beyond what the test requires.",
|
|
115
|
+
"implementation": [
|
|
116
|
+
"Read the failing test to understand exactly what is needed",
|
|
117
|
+
"Write minimal production code — just enough to pass",
|
|
118
|
+
"Do NOT add features, refactor other code, or improve beyond the test",
|
|
119
|
+
"Do NOT add options, configurability, or flexibility not required by test",
|
|
120
|
+
"Run test: verify it PASSES",
|
|
121
|
+
"Run full test suite: verify no regressions (all other tests still pass)",
|
|
122
|
+
"If test fails: fix code, NOT test",
|
|
123
|
+
"If other tests fail: fix now"
|
|
124
|
+
],
|
|
125
|
+
"convergence": {
|
|
126
|
+
"criteria": [
|
|
127
|
+
"Test from TASK-{N}a passes (exit 0)",
|
|
128
|
+
"Full test suite passes (no regressions)",
|
|
129
|
+
"No warnings or errors in test output"
|
|
130
|
+
],
|
|
131
|
+
"verification": "Run test command, confirm GREEN status (all pass, clean output)"
|
|
132
|
+
},
|
|
133
|
+
"meta": {
|
|
134
|
+
"tdd_phase": "green",
|
|
135
|
+
"tdd_group": "{N}"
|
|
136
|
+
}
|
|
137
|
+
}
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
### TASK-{N}c: REFACTOR — Clean Up
|
|
141
|
+
|
|
142
|
+
```json
|
|
143
|
+
{
|
|
144
|
+
"id": "TASK-{N}c",
|
|
145
|
+
"title": "REFACTOR: Clean up {behavior} implementation",
|
|
146
|
+
"type": "refactor",
|
|
147
|
+
"depends_on": ["TASK-{N}b"],
|
|
148
|
+
"action": "Remove duplication, improve names, extract helpers. Keep tests green. Do NOT add behavior.",
|
|
149
|
+
"implementation": [
|
|
150
|
+
"Review code from TASK-{N}b for duplication, naming, structure",
|
|
151
|
+
"Apply refactoring while keeping ALL tests green",
|
|
152
|
+
"Remove duplication across the new and existing code",
|
|
153
|
+
"Improve variable and function names for clarity",
|
|
154
|
+
"Extract helpers only if reuse is immediate (not speculative)",
|
|
155
|
+
"Run full test suite after each refactoring step",
|
|
156
|
+
"If any test fails during refactoring: undo last change, re-run"
|
|
157
|
+
],
|
|
158
|
+
"convergence": {
|
|
159
|
+
"criteria": [
|
|
160
|
+
"Full test suite passes (same as GREEN, no regressions)",
|
|
161
|
+
"No new behavior added beyond what tests cover"
|
|
162
|
+
],
|
|
163
|
+
"verification": "Run full test suite, confirm still GREEN"
|
|
164
|
+
},
|
|
165
|
+
"meta": {
|
|
166
|
+
"tdd_phase": "refactor",
|
|
167
|
+
"tdd_group": "{N}"
|
|
168
|
+
}
|
|
169
|
+
}
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
---
|
|
173
|
+
|
|
174
|
+
## Wave Assignment
|
|
175
|
+
|
|
176
|
+
TDD triplets are sequential within each group but groups can parallelize if independent:
|
|
177
|
+
|
|
178
|
+
```
|
|
179
|
+
Wave 1: TASK-1a (RED for feature A), TASK-2a (RED for feature B) — parallel if independent
|
|
180
|
+
Wave 2: TASK-1b (GREEN for feature A), TASK-2b (GREEN for feature B) — parallel
|
|
181
|
+
Wave 3: TASK-1c (REFACTOR for A), TASK-2c (REFACTOR for B) — parallel
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
Within a group, the dependency chain is always: `{N}a → {N}b → {N}c`.
|
|
185
|
+
|
|
186
|
+
---
|
|
187
|
+
|
|
188
|
+
## Integration with plan.json
|
|
189
|
+
|
|
190
|
+
When `--tdd` is active, the plan.json output includes:
|
|
191
|
+
|
|
192
|
+
```json
|
|
193
|
+
{
|
|
194
|
+
"tdd_mode": true,
|
|
195
|
+
"tdd_groups": [
|
|
196
|
+
{
|
|
197
|
+
"group": 1,
|
|
198
|
+
"behavior": "User can login with email and password",
|
|
199
|
+
"tasks": ["TASK-1a", "TASK-1b", "TASK-1c"]
|
|
200
|
+
}
|
|
201
|
+
]
|
|
202
|
+
}
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
The standard `plan.json.waves[]` and `.task/TASK-*.json` structure is preserved — `maestro-execute` consumes it without modification.
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
## Execution Enforcement
|
|
210
|
+
|
|
211
|
+
When `maestro-execute` processes a TDD plan (detected by `plan.json.tdd_mode == true`):
|
|
212
|
+
|
|
213
|
+
### RED task verification
|
|
214
|
+
- After TASK-{N}a completes, verify test exists AND fails
|
|
215
|
+
- If test passes: mark task BLOCKED with reason "Test passes before implementation — wrong test"
|
|
216
|
+
|
|
217
|
+
### GREEN task verification
|
|
218
|
+
- After TASK-{N}b completes, verify ALL tests pass
|
|
219
|
+
- If the RED test still fails: mark task BLOCKED, provide failure output
|
|
220
|
+
- If other tests regress: mark task BLOCKED, list regressed tests
|
|
221
|
+
|
|
222
|
+
### REFACTOR task verification
|
|
223
|
+
- After TASK-{N}c completes, verify ALL tests still pass
|
|
224
|
+
- If any test fails: undo changes, mark as needing re-attempt
|
|
225
|
+
|
|
226
|
+
---
|
|
227
|
+
|
|
228
|
+
## Good Tests
|
|
229
|
+
|
|
230
|
+
| Quality | Good | Bad |
|
|
231
|
+
|---------|------|-----|
|
|
232
|
+
| **Minimal** | One thing per test. "and" in name? Split it. | `test('validates email and domain and whitespace')` |
|
|
233
|
+
| **Clear** | Name describes behavior | `test('test1')`, `test('it works')` |
|
|
234
|
+
| **Shows intent** | Demonstrates desired API | Obscures what code should do |
|
|
235
|
+
| **Real code** | Uses actual implementations | Mocks everything, tests mock behavior |
|
|
236
|
+
|
|
237
|
+
---
|
|
238
|
+
|
|
239
|
+
## When to Skip REFACTOR
|
|
240
|
+
|
|
241
|
+
TASK-{N}c (REFACTOR) may be omitted from the plan when:
|
|
242
|
+
- GREEN code is already clean (no duplication, good names)
|
|
243
|
+
- The change is truly trivial (single-line fix)
|
|
244
|
+
|
|
245
|
+
When skipped, mark in plan.json: `"refactor_skipped": true, "reason": "GREEN code already clean"`
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## Error Handling
|
|
250
|
+
|
|
251
|
+
| Situation | Action |
|
|
252
|
+
|-----------|--------|
|
|
253
|
+
| No test framework detected | Abort: "No test infrastructure found. Set up testing first." |
|
|
254
|
+
| RED test passes immediately | BLOCKED: "Test passes before implementation — rewrite test" |
|
|
255
|
+
| GREEN test still fails after implementation | Retry once with more context, then BLOCKED |
|
|
256
|
+
| REFACTOR breaks tests | Undo refactoring, mark as BLOCKED |
|
|
257
|
+
| Cannot write meaningful test | AskUserQuestion: "Behavior '{B}' is hard to test. Should we: (1) simplify the interface, (2) skip TDD for this behavior, (3) use integration test instead?" |
|
package/workflows/verify.md
CHANGED
|
@@ -4,6 +4,63 @@ Dual verification: Goal-Backward structural verification + Nyquist test coverage
|
|
|
4
4
|
|
|
5
5
|
---
|
|
6
6
|
|
|
7
|
+
## Iron Law
|
|
8
|
+
|
|
9
|
+
**NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE IN THIS MESSAGE.**
|
|
10
|
+
|
|
11
|
+
Before any success/completion claim:
|
|
12
|
+
1. IDENTIFY — What command proves this claim?
|
|
13
|
+
2. RUN — Execute the FULL command (fresh, in this message — never cite prior runs)
|
|
14
|
+
3. READ — Read FULL output, check exit code, count failures
|
|
15
|
+
4. VERIFY — Does the output actually confirm the claim?
|
|
16
|
+
5. ONLY THEN — Make the claim, with evidence inline
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## Forbidden Wording
|
|
21
|
+
|
|
22
|
+
These phrases are BANNED in verification reports and completion claims:
|
|
23
|
+
- "Should work now"
|
|
24
|
+
- "Probably passes"
|
|
25
|
+
- "Seems correct"
|
|
26
|
+
- "Looks good"
|
|
27
|
+
- "I'm confident that..."
|
|
28
|
+
- "Based on my review, this is complete"
|
|
29
|
+
- Any expression of satisfaction BEFORE running verification commands
|
|
30
|
+
|
|
31
|
+
Replace with evidence: `"Tests pass: 42/42 green (exit 0)"`, `"All 5 truths VERIFIED with file:line evidence"`.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Red Flags — These Thoughts Mean STOP
|
|
36
|
+
|
|
37
|
+
If you catch yourself thinking any of these, STOP and run verification first:
|
|
38
|
+
|
|
39
|
+
- "I just wrote this code, it definitely works"
|
|
40
|
+
- "The changes are too small to break anything"
|
|
41
|
+
- "I already verified this earlier in the conversation"
|
|
42
|
+
- "The tests passed before, they'll pass now"
|
|
43
|
+
- "I can see from the code that it's correct"
|
|
44
|
+
- "Let me just mark this complete and move on"
|
|
45
|
+
|
|
46
|
+
All of these mean: **run the verification command NOW, read the output, then report**.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Rationalization Table
|
|
51
|
+
|
|
52
|
+
| Excuse | Why It's Wrong |
|
|
53
|
+
|--------|----------------|
|
|
54
|
+
| "I just made a one-line change" | One-line changes cause the most insidious bugs |
|
|
55
|
+
| "Tests passed earlier" | Code changed since then — earlier results are stale |
|
|
56
|
+
| "I can read the code is correct" | Reading is not running — subtle runtime errors are invisible in code review |
|
|
57
|
+
| "The build succeeded" | Build success ≠ functional correctness |
|
|
58
|
+
| "It works for the happy path" | Edge cases, error paths, and boundary conditions need verification too |
|
|
59
|
+
| "Verification would take too long" | Skipping verification costs more time when bugs surface later |
|
|
60
|
+
| "The agent said it's done" | Agent reports are claims, not evidence — verify independently |
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
7
64
|
## Prerequisites
|
|
8
65
|
|
|
9
66
|
- Phase execution completed (or partially completed)
|