buildanything 1.6.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (77) hide show
  1. package/.claude-plugin/marketplace.json +2 -1
  2. package/.claude-plugin/plugin.json +10 -2
  3. package/agents/agentic-identity-trust.md +65 -311
  4. package/agents/data-consolidation-agent.md +3 -22
  5. package/agents/design-brand-guardian.md +52 -275
  6. package/agents/design-image-prompt-engineer.md +67 -196
  7. package/agents/design-ui-designer.md +37 -361
  8. package/agents/design-ux-architect.md +51 -434
  9. package/agents/design-ux-researcher.md +48 -299
  10. package/agents/design-whimsy-injector.md +58 -405
  11. package/agents/engineering-backend-architect.md +39 -202
  12. package/agents/engineering-data-engineer.md +41 -236
  13. package/agents/engineering-devops-automator.md +73 -258
  14. package/agents/engineering-frontend-developer.md +33 -206
  15. package/agents/engineering-mobile-app-builder.md +36 -446
  16. package/agents/engineering-rapid-prototyper.md +34 -428
  17. package/agents/engineering-security-engineer.md +44 -204
  18. package/agents/engineering-senior-developer.md +18 -138
  19. package/agents/engineering-technical-writer.md +40 -302
  20. package/agents/marketing-app-store-optimizer.md +63 -276
  21. package/agents/marketing-social-media-strategist.md +38 -87
  22. package/agents/project-management-experiment-tracker.md +62 -156
  23. package/agents/report-distribution-agent.md +4 -24
  24. package/agents/sales-data-extraction-agent.md +3 -22
  25. package/agents/specialized-cultural-intelligence-strategist.md +41 -62
  26. package/agents/specialized-developer-advocate.md +65 -234
  27. package/agents/support-analytics-reporter.md +76 -306
  28. package/agents/support-executive-summary-generator.md +26 -172
  29. package/agents/support-finance-tracker.md +67 -362
  30. package/agents/support-legal-compliance-checker.md +40 -497
  31. package/agents/support-support-responder.md +40 -532
  32. package/agents/testing-accessibility-auditor.md +67 -271
  33. package/agents/testing-api-tester.md +58 -274
  34. package/agents/testing-evidence-collector.md +48 -170
  35. package/agents/testing-performance-benchmarker.md +75 -236
  36. package/agents/testing-reality-checker.md +49 -192
  37. package/agents/testing-test-results-analyzer.md +70 -276
  38. package/agents/testing-tool-evaluator.md +52 -368
  39. package/agents/testing-workflow-optimizer.md +66 -415
  40. package/bin/setup.js +45 -0
  41. package/bin/sync-version.js +38 -0
  42. package/commands/add-feature.md +98 -0
  43. package/commands/build.md +156 -93
  44. package/commands/dogfood.md +43 -0
  45. package/commands/fix.md +89 -0
  46. package/commands/idea-sweep.md +19 -82
  47. package/commands/refactor.md +68 -0
  48. package/commands/ux-review.md +81 -0
  49. package/commands/verify.md +43 -0
  50. package/hooks/session-start +5 -10
  51. package/package.json +4 -1
  52. package/agents/agents-orchestrator.md +0 -365
  53. package/agents/data-analytics-reporter.md +0 -52
  54. package/agents/lsp-index-engineer.md +0 -312
  55. package/agents/macos-spatial-metal-engineer.md +0 -335
  56. package/agents/marketing-content-creator.md +0 -52
  57. package/agents/marketing-growth-hacker.md +0 -52
  58. package/agents/product-sprint-prioritizer.md +0 -152
  59. package/agents/product-trend-researcher.md +0 -157
  60. package/agents/project-management-project-shepherd.md +0 -192
  61. package/agents/project-management-studio-operations.md +0 -198
  62. package/agents/project-management-studio-producer.md +0 -201
  63. package/agents/project-manager-senior.md +0 -133
  64. package/agents/support-infrastructure-maintainer.md +0 -616
  65. package/agents/terminal-integration-specialist.md +0 -68
  66. package/agents/visionos-spatial-engineer.md +0 -52
  67. package/agents/xr-cockpit-interaction-specialist.md +0 -30
  68. package/agents/xr-immersive-developer.md +0 -30
  69. package/agents/xr-interface-architect.md +0 -30
  70. package/commands/protocols/brainstorm.md +0 -99
  71. package/commands/protocols/build-fix.md +0 -52
  72. package/commands/protocols/cleanup.md +0 -56
  73. package/commands/protocols/design.md +0 -287
  74. package/commands/protocols/eval-harness.md +0 -62
  75. package/commands/protocols/metric-loop.md +0 -94
  76. package/commands/protocols/planning.md +0 -56
  77. package/commands/protocols/verify.md +0 -63
@@ -0,0 +1,98 @@
1
+ ---
2
+ description: "Add a single feature to an existing project — lightweight build cycle using existing architecture, design system, and CLAUDE.md context"
3
+ argument-hint: "Describe the feature to add. --autonomous for unattended mode."
4
+ ---
5
+
6
+ <HARD-GATE>
7
+ YOU ARE AN ORCHESTRATOR. YOU COORDINATE AGENTS. YOU DO NOT WRITE CODE.
8
+
9
+ "Launch an agent" = call the Agent tool. For implementation agents, set mode: "bypassPermissions". For parallel work, put multiple Agent tool calls in ONE message.
10
+ </HARD-GATE>
11
+
12
+ Input: $ARGUMENTS
13
+
14
+ If the input contains `--autonomous` or `--auto`, skip user approval gates and log decisions to `docs/plans/build-log.md`.
15
+
16
+ ---
17
+
18
+ ## Phase 1: Context Gathering
19
+
20
+ Read these files directly (no agent needed — this is fast):
21
+
22
+ 1. `CLAUDE.md` — product context, tech stack, rules
23
+ 2. `docs/plans/architecture.md` — current architecture
24
+ 3. `docs/plans/sprint-tasks.md` — existing user journeys and scope
25
+
26
+ If any file is missing, proceed with what exists. If the codebase is unfamiliar or the feature touches unknown areas, spawn an Explore agent:
27
+
28
+ Call the Agent tool — description: "Explore codebase for [feature area]" — prompt: "Find all files related to [feature area]. Report: directory structure, key files, patterns used, relevant components/routes/APIs. Be concise."
29
+
30
+ ---
31
+
32
+ ## Phase 2: Plan the Feature
33
+
34
+ You do this yourself — no agent needed.
35
+
36
+ 1. **Break the feature into 1-5 tasks** (most features are 1-3). Each task should be one commit-sized unit of work.
37
+ 2. **Define behavioral acceptance criteria** for each task — what must be true when the task is done.
38
+ 3. **Define the user journey** — the end-to-end flow the user will experience with this feature.
39
+ 4. **Present the plan to the user for approval.** In autonomous mode, log the plan to `docs/plans/build-log.md` and proceed.
40
+
41
+ ---
42
+
43
+ ## Phase 3: Build
44
+
45
+ **For EACH task:**
46
+
47
+ ### Step 3.1 — Implement
48
+
49
+ Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture context: [paste ONLY the relevant section from architecture.md]. Style guide: the living style guide at /design-system shows component styling — match it. Implement with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
50
+
51
+ Set `[COMPLEXITY: S/M/L]` based on task scope.
52
+
53
+ ### Step 3.2 — Cleanup
54
+
55
+ Skip if trivial (< 20 lines, single file). Otherwise:
56
+
57
+ Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPermissions" — prompt: "Clean up these files: [list from implementation]. Fix: naming, dead code, unused imports, style, DRY. Do NOT add features, change architecture, or touch other files. If cleanup breaks acceptance criteria, revert."
58
+
59
+ ### Step 3.3 — Smoke Test
60
+
61
+ Skip if this task has no UI surface. Otherwise run the Smoke Test Protocol (`protocols/smoke-test.md`): open the affected route, execute behavioral acceptance criteria via agent-browser, collect evidence. On FAIL: spawn fix agent with evidence. Max 2 fix-and-retest cycles.
62
+
63
+ ### Step 3.4 — Verification
64
+
65
+ Run the Verification Protocol (`protocols/verify.md`). All 7 checks. If FAIL, fix before starting the next task.
66
+
67
+ ---
68
+
69
+ ## Phase 4: End-to-End Verification
70
+
71
+ ### Step 4.1 — Run the User Journey
72
+
73
+ Call the Agent tool — description: "E2E: [feature name]" — mode: "bypassPermissions" — prompt: "Verify the full user journey for [feature name]: [paste the user journey from Phase 2]. Use agent-browser to walk through each step. For each step: interact, verify the expected outcome, capture evidence. Report PASS/FAIL per step with screenshots."
74
+
75
+ ### Step 4.2 — Dogfood Affected Pages
76
+
77
+ Call the Agent tool — description: "Dogfood [feature area]" — prompt: "Open every page affected by [feature name]. Check for: broken layouts, console errors, missing data, dead links, regressions. Report issues with screenshots."
78
+
79
+ ### Step 4.3 — Fix Loop
80
+
81
+ If issues found in 4.1 or 4.2: spawn a fix agent with the evidence. Re-run the failing check. Max 2 fix-and-retest cycles. After 2 failures:
82
+ - **Interactive:** present evidence to the user.
83
+ - **Autonomous:** log to `docs/plans/build-log.md` and proceed with a warning.
84
+
85
+ ---
86
+
87
+ ## Phase 5: Done
88
+
89
+ Report to the user:
90
+
91
+ ```
92
+ FEATURE COMPLETE: [feature name]
93
+ Tasks: [done]/[total] | Tests: [count] passing
94
+ User journey: PASS/FAIL
95
+ Evidence: [paths to screenshots/logs]
96
+ ```
97
+
98
+ If the feature expands the product scope, update `CLAUDE.md` to reflect the new capability.
package/commands/build.md CHANGED
@@ -51,6 +51,8 @@ If you catch yourself typing code or reading source files: STOP. You are wasting
51
51
  - `last_save: [Phase.Step]`
52
52
  Increment after each agent returns (parallel dispatch of 4 agents = +4). Reset to 0 after each compaction save.
53
53
 
54
+ **Compaction checkpoint format:** At every phase boundary, check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
55
+
54
56
  Input: $ARGUMENTS
55
57
 
56
58
  ### Autonomous Mode
@@ -67,7 +69,7 @@ When combining `--resume` with `--autonomous`: the current invocation's flags ta
67
69
 
68
70
  ### Metric Loop
69
71
 
70
- Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `commands/protocols/metric-loop.md`. Critical rules (survive compaction):
72
+ Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `protocols/metric-loop.md`. Critical rules (survive compaction):
71
73
 
72
74
  1. YOU define a metric for this phase based on context (what you're building, what matters). The metric is NOT predefined.
73
75
  2. Spawn a **measurement agent** to score the artifact 0-100. Read its full output — it's analysis.
@@ -95,15 +97,7 @@ For implementation agents (Phase 5+): Do NOT paste the entire Design Document or
95
97
 
96
98
  ### Complexity Routing (Advisory)
97
99
 
98
- When composing agent prompts, prefix with `[COMPLEXITY: S/M/L]` to hint at the appropriate model tier:
99
-
100
- | Complexity | Task Types | Preferred Tier |
101
- |-----------|-----------|----------------|
102
- | S | Build-fix, cleanup, lint fix, single-error fix | Haiku-class (fastest) |
103
- | M | Measurement, eval, testing, single-feature impl | Sonnet-class (balanced) |
104
- | L | Architecture, research, multi-file impl, debugging | Opus-class (deepest reasoning) |
105
-
106
- For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
100
+ Tag agent prompts with `[COMPLEXITY: S/M/L]` based on task size from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
107
101
 
108
102
  ---
109
103
 
@@ -112,7 +106,7 @@ For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is
112
106
  **Resuming?** If the input contains `--resume` OR if context was just compacted (SessionStart hook fired with active state):
113
107
  1. Read `docs/plans/.build-state.md` — verify it exists and has a Resume Point section.
114
108
  If `docs/plans/.build-state.md` does not exist or has no Resume Point, warn the user: 'No previous build state found. Starting fresh.' Then proceed to Step 0.1 as a new build.
115
- 2. Re-read this file and all protocol files in `commands/protocols/`.
109
+ 2. Re-read this file and all protocol files in `protocols/`.
116
110
  3. Re-read `docs/plans/sprint-tasks.md`, `docs/plans/architecture.md`, and `CLAUDE.md`.
117
111
  4. Rebuild TodoWrite from the state file (TodoWrite does NOT survive compaction or session breaks).
118
112
  5. Reset `dispatches_since_save` to 0 (fresh context window).
@@ -183,7 +177,7 @@ Autonomous mode: Log checklist to `docs/plans/build-log.md`. Create `.env.exampl
183
177
 
184
178
  ### Step 1.1 — Brainstorming
185
179
 
186
- Follow the Brainstorm Protocol (`commands/protocols/brainstorm.md`).
180
+ Follow the Brainstorm Protocol (`protocols/brainstorm.md`).
187
181
 
188
182
  In interactive mode: this is a conversation. Ask questions one at a time, propose approaches with trade-offs, let the user decide. Output: Design Document saved to `docs/plans/`.
189
183
 
@@ -195,15 +189,15 @@ Skip if context level is "Decision brief" (research already done).
195
189
 
196
190
  Call the Agent tool 5 times in a single message. Pass each agent the build request AND the Design Document draft.
197
191
 
198
- 1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Use web search extensively. Report with a Market Verdict: GREEN/AMBER/RED."
192
+ 1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Report with a Market Verdict: GREEN/AMBER/RED."
199
193
 
200
- 2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Search for APIs and libraries mentioned in the design to verify they exist and are maintained. Report with a Technical Verdict."
194
+ 2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Verify APIs and libraries from the design exist and are maintained. Report with a Technical Verdict."
201
195
 
202
- 3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Search for real user complaints and communities discussing this problem. Report with a User Verdict."
196
+ 3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, and behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Report with a User Verdict."
203
197
 
204
- 4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, first-1000-users strategy for: [build request]. Design context: [paste design doc]. Search for comparable pricing and growth data. Report with a Business Verdict."
198
+ 4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, and first-1000-users strategy for: [build request]. Design context: [paste design doc]. Report with a Business Verdict."
205
199
 
206
- 5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Search for enforcement actions and comparable failures. Report with a Risk Verdict."
200
+ 5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Report with a Risk Verdict."
207
201
 
208
202
  After all 5 return, synthesize a **Research Brief** with a verdict table. Save to `docs/plans/research-brief.md`.
209
203
 
@@ -218,17 +212,41 @@ Read the Design Document and Research Brief together. Check for contradictions:
218
212
 
219
213
  Update the Design Document with corrections. Save final version.
220
214
 
221
- ### Step 1.4 — Persist Decisions
215
+ ### Step 1.4 — Write CLAUDE.md
216
+
217
+ Create (or overwrite) the project's `CLAUDE.md`. This is the product brain — every agent spawned during the build reads it automatically. Write it from the Design Document and Research Brief. It must give any agent enough context to make smart product, UX, and technical decisions without needing the full design doc.
218
+
219
+ <HARD-GATE>
220
+ CLAUDE.md must be under 200 lines. It is not a wiki, not a conventions doc, not a dump of everything you know. It is the minimum context an agent needs to make correct decisions about this specific product.
221
+ </HARD-GATE>
222
222
 
223
- Append key decisions to the project's `CLAUDE.md` (create if needed) under `## Build Decisions`:
223
+ Structure:
224
224
 
225
- - Project name and one-line description
226
- - Primary user and core value prop
227
- - Tech stack (with rationale)
228
- - Key constraints or risks
229
- - MVP scope boundary (in vs. deferred)
225
+ ```
226
+ ## Product
227
+ [1-3 sentences: what this is, core value prop, what success looks like]
228
+
229
+ ## User
230
+ [Primary persona: who they are, what they care about, pain points,
231
+ technical sophistication. This drives every UX decision.]
232
+
233
+ ## Tech Stack
234
+ [Stack choices with 1-line rationale for each. Framework, DB, auth,
235
+ key libraries, deployment target.]
236
+
237
+ ## Scope
238
+ [What's in MVP vs. deferred. Hard boundaries. This prevents agents
239
+ from building features that aren't scoped.]
240
+
241
+ ## Rules
242
+ [Project-specific hard rules derived from the product and user context.
243
+ Examples: "All data must be real-time — no simulated/fake data",
244
+ "User must be able to pause/stop any automated process at any time",
245
+ "Every interactive element must have visible feedback within 200ms".
246
+ Only include rules this specific project needs — not generic best practices.]
247
+ ```
230
248
 
231
- This ensures decisions survive context compaction.
249
+ Keep it product-focused. An implementation agent reading this should understand WHO the user is and WHAT matters enough to make the right call when the handoff prompt doesn't cover an edge case.
232
250
 
233
251
  ### Quality Gate 1
234
252
 
@@ -238,7 +256,7 @@ This ensures decisions survive context compaction.
238
256
 
239
257
  Update TodoWrite and `docs/plans/.build-state.md`.
240
258
 
241
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
259
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
242
260
 
243
261
  ---
244
262
 
@@ -270,13 +288,13 @@ After all 4 return, YOU synthesize into one Architecture Document. Save to `docs
270
288
 
271
289
  ### Step 2.3 — Metric Loop: Architecture Quality
272
290
 
273
- Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`) on the Architecture Document. Define a metric based on this project — coverage of design doc requirements, specificity, consistency between agents. Max 3 iterations.
291
+ Run the Metric Loop Protocol (`protocols/metric-loop.md`) on the Architecture Document. Define a metric based on: coverage of design doc requirements, specificity, consistency between agents, and **simplicity** — is this the simplest architecture that meets the requirements? Could any service, abstraction, or dependency be eliminated without losing functionality? Penalize over-engineering (microservices for a simple app, Kubernetes for a static site, complex state management for a 3-page app). Max 3 iterations.
274
292
 
275
293
  ### Step 2.4 — Sprint Planning
276
294
 
277
- Follow the Planning Protocol (`commands/protocols/planning.md`). Use 2 sequential Agent tool calls:
295
+ Follow the Planning Protocol (`protocols/planning.md`). Use 2 sequential Agent tool calls:
278
296
 
279
- Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
297
+ Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). Include a `**Behavioral Test:**` field for every task that has UI — a concrete interaction test: 'Navigate to [page], click [element], verify [expected outcome]'. API-only tasks should have curl-based acceptance tests instead. ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
280
298
 
281
299
  Then call the Agent tool — description: "Validate task list" — prompt: "Validate this task list: [paste]. Check scope is realistic, no missing tasks, descriptions specific enough for a developer agent to execute, all tasks within MVP boundary."
282
300
 
@@ -290,7 +308,7 @@ Save to `docs/plans/sprint-tasks.md`.
290
308
 
291
309
  Update TodoWrite and `docs/plans/.build-state.md`.
292
310
 
293
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
311
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
294
312
 
295
313
  ---
296
314
 
@@ -301,14 +319,14 @@ Update TodoWrite and `docs/plans/.build-state.md`.
301
319
  **Skip if** the project has no user-facing frontend (CLI tools, pure APIs, backend services).
302
320
 
303
321
  <HARD-GATE>
304
- UI/UX IS THE PRODUCT. This phase is a full peer to Architecture and Build — not a footnote, not an afterthought, not a "nice to have." Do NOT skip, compress, or rush this phase for any reason. The agents must research real competitors and award-winning sites, make deliberate visual choices backed by that research, build proof screens, and iterate with Playwright-verified visual QA before a single line of product code is written.
322
+ UI/UX IS THE PRODUCT. This phase is a full peer to Architecture and Build — not a footnote, not an afterthought, not a "nice to have." Do NOT skip, compress, or rush this phase for any reason. The agents must research real competitors and award-winning sites, make deliberate visual choices backed by that research, build a living style guide with every component rendered and interactive, and iterate with Playwright-verified visual QA before a single line of product code is written.
305
323
 
306
324
  Phase 4 (Foundation) WILL NOT START without `docs/plans/visual-design-spec.md`. If it does not exist, return here.
307
325
  </HARD-GATE>
308
326
 
309
327
  ### Step 3.1 — Design Research (2 agents, parallel, both use Playwright)
310
328
 
311
- Follow the Design Protocol (`commands/protocols/design.md`), Step 3.1.
329
+ Follow the Design Protocol (`protocols/design.md`), Step 3.1.
312
330
 
313
331
  Call the Agent tool 2 times in one message:
314
332
 
@@ -320,21 +338,23 @@ After both return, synthesize a **Design Research Brief** to `docs/plans/design-
320
338
 
321
339
  ### Step 3.2 — Design Direction (2 agents, sequential)
322
340
 
323
- Follow the Design Protocol (`commands/protocols/design.md`), Step 3.2.
341
+ Follow the Design Protocol (`protocols/design.md`), Step 3.2.
324
342
 
325
343
  1. Call the Agent tool — description: "UX architecture" — Prompt: "Create structural design foundation. INPUTS: frontend architecture section from architecture.md [paste], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: information architecture, layout strategy, component hierarchy, responsive approach, interaction patterns. Base decisions on competitive research, not generic patterns."
326
344
 
327
345
  2. Call the Agent tool — description: "Visual design spec" — Prompt: "Create the Visual Design Spec with AUTONOMOUS decisions — pick the single best direction, do not present options. INPUTS: UX foundation [paste previous output], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: color system (with hex, light+dark), typography (Google Fonts, mathematical scale), 8px spacing system, tinted shadow system, border radius, animation/motion, component styles with ALL states. Every choice must cite the research. Apply anti-AI-template rules from the Design Protocol. Save to docs/plans/visual-design-spec.md."
328
346
 
329
- ### Step 3.3 — Proof Screens (1 implementation agent)
347
+ ### Step 3.3 — Living Style Guide (1 implementation agent)
348
+
349
+ Follow the Design Protocol (`protocols/design.md`), Step 3.3.
330
350
 
331
- Call the Agent tool — description: "Build proof screens" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: L] Implement 2-3 proof screens (landing/hero, main app view, key form). INPUTS: Visual Design Spec [paste], UX foundation [paste relevant sections], reference screenshots [list paths — these are your visual targets]. Use EXACT colors, fonts, spacing from spec. Real styled responsive pages, not wireframes. Include hover/focus states, transitions. Commit: 'feat: proof screens for design validation'."
351
+ Call the Agent tool — description: "Build living style guide" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: L] Build a living style guide page (/design-system route or standalone HTML). INPUTS: Visual Design Spec [paste], UX foundation [paste relevant sections], reference screenshots [list paths — these are your quality targets]. Must include rendered, interactive examples of: color swatches, typography scale, spacing scale, buttons (all states), form elements (all states), cards, navigation, feedback components (alerts, toasts, spinners, empty states), modals/overlays, and layout grid examples. Every component interactive (hover, focus, transitions work). Mobile-responsive. This ships with the product. Commit: 'feat: living style guide'."
332
352
 
333
353
  ### Step 3.4 — Visual QA Loop (Playwright + Metric Loop)
334
354
 
335
- Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`) using the measurement criteria from the Design Protocol (`commands/protocols/design.md`, Step 3.4).
355
+ Run the Metric Loop Protocol (`protocols/metric-loop.md`) using the measurement criteria from the Design Protocol (`protocols/design.md`, Step 3.4).
336
356
 
337
- Measurement: Playwright screenshots of proof screens (desktop + mobile). Design critic agent scores 0-100 across 6 dimensions: spacing/alignment, typography hierarchy, color harmony, component polish, responsive quality, originality (anti-AI-template check). Receives screenshots + Visual Design Spec + reference screenshots.
357
+ Measurement: Playwright screenshots of the living style guide sections (desktop + mobile). Design critic agent scores 0-100 across 6 dimensions: spacing/alignment, typography hierarchy, color harmony, component polish, responsive quality, originality (anti-AI-template check). Receives screenshots + Visual Design Spec + reference screenshots.
338
358
 
339
359
  **Target: 80. Max 5 iterations.** On stall: accept if >= 65, log warning below 65.
340
360
 
@@ -342,7 +362,7 @@ Measurement: Playwright screenshots of proof screens (desktop + mobile). Design
342
362
 
343
363
  Log to `docs/plans/build-log.md`: final screenshot paths, score history table, design decisions, originality score. No user pause. Proceed to Phase 4.
344
364
 
345
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
365
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
346
366
 
347
367
  ---
348
368
 
@@ -360,7 +380,11 @@ Call the Agent tool — description: "Project scaffolding" — mode: "bypassPerm
360
380
 
361
381
  ### Step 4.2 — Design System (frontend only)
362
382
 
363
- Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement the design system from the Visual Design Spec: [paste from docs/plans/visual-design-spec.md]. Create CSS tokens matching the spec's color system, typography scale, spacing system, shadow/elevation tokens, and base layout components. Reference the proof screens from Phase 3 as implementation targets. Commit: 'feat: design system'."
383
+ Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement the design system from the Visual Design Spec: [paste from docs/plans/visual-design-spec.md]. Create CSS tokens matching the spec's color system, typography scale, spacing system, shadow/elevation tokens, and base layout components. The living style guide from Phase 3 is the reference implementation — components must match. Commit: 'feat: design system'."
384
+
385
+ ### Step 4.2b — Acceptance Test Scaffolding
386
+
387
+ Call the Agent tool — description: "Scaffold acceptance tests" — mode: "bypassPermissions" — prompt: "Read docs/plans/sprint-tasks.md. For every task with a Behavioral Test field, create a Playwright test stub in tests/e2e/acceptance/. Use Page Object Model. Each test should: navigate to the page, perform the interaction, assert the expected outcome. Tests should FAIL right now (features aren't built yet) — that's correct. Also ensure agent-browser is available (run `which agent-browser`). Commit: 'test: scaffold acceptance tests from sprint tasks'."
364
388
 
365
389
  ### Step 4.3 — Metric Loop: Scaffold Health
366
390
 
@@ -368,10 +392,10 @@ Run the Metric Loop Protocol. Define a metric: builds clean, tests pass, lint cl
368
392
 
369
393
  ### Step 4.4 — Verification Gate
370
394
 
371
- Run the Verification Protocol (`commands/protocols/verify.md`). Critical rules (survive compaction):
395
+ Run the Verification Protocol (`protocols/verify.md`). Critical rules (survive compaction):
372
396
  - ONE agent runs all 6 checks sequentially: Build → Type-Check → Lint → Test → Security → Diff Review. Stop on first FAIL.
373
397
  - Agent auto-detects stack from manifest files (package.json → Node, go.mod → Go, etc.).
374
- - On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`commands/protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
398
+ - On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
375
399
  - On PASS: log `VERIFY: PASS (6/6)` to `docs/plans/.build-state.md`. Proceed.
376
400
 
377
401
  Call the Agent tool — description: "Verify scaffolding" — mode: "bypassPermissions" — prompt: "Run the Verification Protocol. Execute all 6 checks sequentially, stop on first failure. Report: VERIFY: PASS or VERIFY: FAIL with details."
@@ -380,7 +404,7 @@ Do not proceed to Phase 5 until verification passes.
380
404
 
381
405
  Update TodoWrite and state.
382
406
 
383
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
407
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
384
408
 
385
409
  ---
386
410
 
@@ -396,13 +420,13 @@ Expand TodoWrite with each sprint task.
396
420
 
397
421
  ### Step 5.1 — Implement
398
422
 
399
- Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
423
+ Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. For UI tasks: the living style guide at /design-system shows every component's exact styling and states — match it. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
400
424
 
401
425
  Pick the right developer framing: frontend, backend, AI, etc. Set `[COMPLEXITY: S/M/L]` based on the task's Size from sprint-tasks.md.
402
426
 
403
427
  ### Step 5.1b — Cleanup (De-Sloppify)
404
428
 
405
- Follow the Cleanup Protocol (`commands/protocols/cleanup.md`). Critical rules (survive compaction):
429
+ Follow the Cleanup Protocol (`protocols/cleanup.md`). Critical rules (survive compaction):
406
430
  [COMPLEXITY: S]
407
431
  - Skip if trivial (< 20 lines, single file).
408
432
  - Cleanup agent is a SEPARATE agent from the implementer — no cleaning your own mess.
@@ -414,7 +438,7 @@ Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPerm
414
438
 
415
439
  ### Step 5.2 — Metric Loop: Task Quality
416
440
 
417
- Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. Max 5 iterations.
441
+ Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. For UI-facing tasks, include behavioral verification: the measurement agent should use agent-browser to verify the feature renders and responds to interaction, not just read the code. Max 5 iterations.
418
442
 
419
443
  ### Step 5.3 — Loop Exit
420
444
 
@@ -426,11 +450,23 @@ On stall or max iterations:
426
450
 
427
451
  After each task: update TodoWrite and `docs/plans/.build-state.md`.
428
452
 
453
+ ### Step 5.3b — Behavioral Smoke Test
454
+
455
+ Skip if this task has no Behavioral Test criteria (API-only, config, infrastructure tasks).
456
+
457
+ Run the Smoke Test Protocol (`protocols/smoke-test.md`). This uses agent-browser to open the app, execute the task's behavioral acceptance criteria, and verify the feature actually works.
458
+
459
+ Evidence saved to `docs/plans/evidence/[task-name]/`: annotated screenshot, snapshot diff, error log, network log, HAR file.
460
+
461
+ On FAIL: spawn fix agent with the evidence. The fix agent receives: what was expected (from acceptance criteria), what actually happened (snapshot diff + errors + screenshot), and the relevant source files. Max 2 fix-and-retest cycles.
462
+
463
+ On PASS: proceed to Step 5.4.
464
+
429
465
  ### Step 5.4 — Post-Task Verification
430
466
 
431
- Run the Verification Protocol (`commands/protocols/verify.md`) to catch regressions. If FAIL, fix before starting the next task.
467
+ Run the Verification Protocol (`protocols/verify.md`). If FAIL, fix before starting the next task.
432
468
 
433
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
469
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
434
470
 
435
471
  ---
436
472
 
@@ -438,23 +474,27 @@ Run the Verification Protocol (`commands/protocols/verify.md`) to catch regressi
438
474
 
439
475
  ### Step 6.0 — Pre-Hardening Verification
440
476
 
441
- Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before starting expensive audit agents — do not waste audit agents on code that doesn't build or pass tests.
477
+ Run the Verification Protocol (`protocols/verify.md`). All checks must pass before starting expensive audit agents.
478
+
479
+ ### Step 6.1 — Initial Audit (5 agents in parallel, ONE message)
442
480
 
443
- ### Step 6.1 Initial Audit (4 agents in parallel, ONE message)
481
+ Read the NFRs from `docs/plans/sprint-tasks.md`. Pass the relevant NFR thresholds to each audit agent so they have concrete targets, not generic checks.
444
482
 
445
- Call the Agent tool 4 times in one message:
483
+ Call the Agent tool 5 times in one message:
446
484
 
447
- 1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. Report findings with counts."
485
+ 1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. NFR targets: [paste performance and reliability NFRs]. Report findings with counts."
448
486
 
449
- 2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. Report benchmarks."
487
+ 2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. NFR targets: [paste performance NFRs — e.g., API < 200ms, page load < 3s]. Report benchmarks AGAINST these targets."
450
488
 
451
- 3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. Check screen reader, keyboard nav, contrast. Report issues with counts."
489
+ 3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. NFR target: [paste accessibility NFR — e.g., WCAG AA]. Check screen reader, keyboard nav, contrast. Report issues with counts."
452
490
 
453
- 4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. Report findings with severity."
491
+ 4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. NFR targets: [paste security NFRs]. Report findings with severity."
492
+
493
+ 5. Description: "UX quality audit" — Prompt: "UX quality review of every user-facing page. NFR targets: [paste accessibility NFRs]. First, screenshot the living style guide at /design-system as your reference for how components should look. Then review every product page and check: loading states (every async action must show a loading indicator), error states (every form and API call must show user-friendly error feedback), empty states (every list/table must handle zero items gracefully), mobile responsiveness (test at 375px viewport — touch targets >= 44px, no horizontal scroll, readable text), form validation (inline feedback, not just alert()), transition smoothness (no layout shifts, no janky animations), visual consistency (compare each page's components against the style guide — buttons, inputs, cards, colors, spacing should match). Report issues with page, severity, and screenshot."
454
494
 
455
495
  ### Step 6.1b — Eval Harness
456
496
 
457
- Run the Eval Harness Protocol (`commands/protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 6.2 as specific issues to fix.
497
+ Run the Eval Harness Protocol (`protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. For UI flows, eval cases should use agent-browser: "agent-browser open /dashboard -> agent-browser click @submit -> agent-browser wait --text Success -> expect text contains confirmation ID". Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 6.2 as specific issues to fix.
458
498
 
459
499
  ### Step 6.2 — Metric Loop: Hardening Quality
460
500
 
@@ -472,7 +512,7 @@ Re-run the Eval Harness after the metric loop exits. All CRITICAL eval cases mus
472
512
  ALL 3 ITERATIONS ARE MANDATORY. Do NOT stop after iteration 1 even if all tests pass. The purpose of 3 runs is to catch flaky tests, timing-dependent failures, and race conditions that only surface on repeated execution. Skip this step ONLY if the project has no user-facing frontend.
473
513
  </HARD-GATE>
474
514
 
475
- Generate and execute end-to-end tests using Playwright against the running application. Tests cover critical user journeys derived from the design doc and architecture.
515
+ Generate and execute end-to-end tests using Playwright against the running application. Tests cover the **User Journeys** defined in `docs/plans/sprint-tasks.md` (Step 0 of the Planning Protocol). Each journey = one E2E test file.
476
516
 
477
517
  **Iteration 1 — Generate & Run:**
478
518
 
@@ -481,12 +521,13 @@ Call the Agent tool — description: "E2E test generation" — mode: "bypassPerm
481
521
  "[COMPLEXITY: L] Generate and run end-to-end Playwright tests for this application.
482
522
 
483
523
  INPUTS:
484
- - Architecture doc (user flows and API contracts): [paste relevant sections from docs/plans/architecture.md]
485
- - Design doc (core user journeys): [paste relevant sections]
486
- - Visual Design Spec (component selectors and page structure): [paste relevant sections from docs/plans/visual-design-spec.md]
524
+ - User Journeys from docs/plans/sprint-tasks.md: [paste the User Journeys section — each journey becomes one E2E test]
525
+ - Architecture doc (API contracts): [paste relevant sections from docs/plans/architecture.md]
526
+ - NFRs from docs/plans/sprint-tasks.md: [paste use performance thresholds as test assertions]
527
+ - Visual Design Spec (component selectors): [paste relevant sections from docs/plans/visual-design-spec.md]
487
528
 
488
529
  REQUIREMENTS:
489
- 1. Identify 5-10 critical user journeys from the design doc (auth flows, core feature flows, data entry, navigation)
530
+ 1. One E2E test per User Journey from sprint-tasks.md (each journey = one test file covering the full flow)
490
531
  2. Use Page Object Model pattern — one page object per major view
491
532
  3. Use data-testid selectors (add them to components if missing)
492
533
  4. Wait for API responses, NEVER use arbitrary timeouts (no waitForTimeout)
@@ -511,56 +552,67 @@ Record results: total tests, pass count, fail count, failure details. Log to `do
511
552
 
512
553
  **Iteration 2 — Fix & Re-run:**
513
554
 
514
- Call the Agent tool — description: "E2E fix iteration 2" — mode: "bypassPermissions" — prompt:
555
+ Call the Agent tool — description: "E2E fix iteration 2" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Fix E2E test failures from iteration 1: [paste failure details — test names, error messages, screenshot paths]. Diagnose each as real bug, flaky test, or missing selector. Fix accordingly — do NOT delete or skip tests. Re-run ALL tests. Commit: 'fix: e2e test failures iteration 2'."
515
556
 
516
- "[COMPLEXITY: M] Fix E2E test failures and re-run the full suite.
557
+ Record results in the E2E table. Identify flaky candidates (passed iter 1, failed iter 2 or vice versa).
517
558
 
518
- ITERATION 1 RESULTS: [paste failure details test names, error messages, screenshot paths]
559
+ **Iteration 3Final Stability Run:**
519
560
 
520
- For each failure:
521
- 1. Diagnose: Is this a real bug, a flaky test, or a missing data-testid?
522
- 2. Real bugs: Fix the application code
523
- 3. Flaky tests: Add proper waits, fix race conditions, improve selectors
524
- 4. Missing selectors: Add data-testid attributes to components
525
- 5. Do NOT delete or skip failing tests — fix them
561
+ Call the Agent tool — description: "E2E stability run" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Final E2E stability run (3 of 3). Previous results — Iter 1: [pass/fail counts], Iter 2: [pass/fail counts], Flaky candidates: [list]. Run ALL tests with --repeat-each=3. Quarantine inconsistent tests with test.fixme(). Fix remaining consistent failures. PASS CRITERIA: 95%+ pass rate (quarantined flaky tests excluded but logged). Commit: 'test: e2e stability fixes iteration 3'."
526
562
 
527
- Re-run ALL tests (not just previously failing ones). Report results.
528
- Commit fixes: 'fix: e2e test failures iteration 2'"
563
+ Record final results. Include in Reality Checker evidence.
529
564
 
530
- Record results in the E2E table. Identify any tests that passed in iteration 1 but failed in iteration 2 these are flaky candidates.
565
+ ### Step 6.2dAutonomous Dogfooding
531
566
 
532
- **Iteration 3 Final Stability Run:**
567
+ Run the agent-browser dogfood skill against the running app. Unlike the per-task smoke tests (which verify specific acceptance criteria), dogfooding is **exploratory**it autonomously navigates every reachable page, clicks buttons, fills forms, checks console errors, and finds issues we didn't think to test.
533
568
 
534
- Call the Agent tool description: "E2E stability run" mode: "bypassPermissions" — prompt:
569
+ Start the dev server if not running. Then invoke the dogfood skill:
535
570
 
536
- "[COMPLEXITY: M] Final E2E stability run iteration 3 of 3.
571
+ Call the Agent tool — description: "Dogfood the app" — mode: "bypassPermissions" — prompt: "Run the agent-browser dogfood skill against the running app at http://localhost:[port]. Explore every reachable page. Click every button. Fill every form. Check console for errors. Report a structured list of issues with severity ratings (critical/high/medium/low), screenshots, and repro steps. If dogfood skill is not available, use agent-browser manually: snapshot each page, click all interactive elements, check errors and network requests. Also evaluate UX quality: missing loading states, poor error messages, broken mobile layouts (resize to 375px), visual inconsistencies, missing empty states, form validation gaps. Report UX issues separately from functional issues."
537
572
 
538
- PREVIOUS RESULTS:
539
- - Iteration 1: [pass/fail counts]
540
- - Iteration 2: [pass/fail counts]
541
- - Flaky candidates: [tests that had inconsistent results across iterations]
573
+ **Fix loop:** For each CRITICAL or HIGH issue found:
574
+ 1. Classify: is this a code bug (fix in Phase 5 style — spawn implementation fix agent) or a structural problem (needs architecture change — spawn architect agent to propose a fix plan, then implementation agent to execute)?
575
+ 2. Spawn the appropriate fix agent with: the issue description, repro steps, screenshot, affected page/component.
576
+ 3. After fixes, re-run dogfood on the affected pages only (not the full app). If new CRITICAL/HIGH issues appear, repeat. Max 3 fix cycles.
542
577
 
543
- REQUIREMENTS:
544
- 1. Run ALL tests with --repeat-each=3 to detect flakiness (each test runs 3 times within this iteration)
545
- 2. Any test failing inconsistently across the 3 sub-runs: quarantine with test.fixme() and file path + reason
546
- 3. Fix any remaining consistent failures
547
- 4. Generate final report with: total journeys, pass rate, flaky count, quarantined tests
548
- 5. Commit: 'test: e2e stability fixes iteration 3'
578
+ MEDIUM/LOW issues: log to `docs/plans/build-log.md` for the Reality Checker.
549
579
 
550
- PASS CRITERIA: 95%+ pass rate across all tests. Quarantined flaky tests do not count against pass rate but must be logged."
580
+ ### Step 6.2e Fake Data Detector
551
581
 
552
- Record final results. Include in Reality Checker evidence.
582
+ Call the Agent tool — description: "Fake data audit" — mode: "bypassPermissions" — prompt: "Run the Fake Data Detector Protocol (protocols/fake-data-detector.md). Check for mock/hardcoded data in production paths. Static analysis: grep for Math.random() business data, hardcoded API responses, setTimeout faking async, placeholder text. Dynamic analysis: inspect HAR files from docs/plans/evidence/ for missing real API calls, static responses, absent WebSocket traffic. Report findings with file:line references and severity."
583
+
584
+ **Fix loop:** For each CRITICAL finding:
585
+ 1. Spawn a fix agent with: the finding (file:line, what's fake, what it should be), and the relevant source files.
586
+ 2. The fix agent replaces fake data with real API calls, real WebSocket connections, real data sources. If real data sources aren't available (missing API keys, no backend), the fix agent must flag this as a blocker — not paper over it with better-looking fake data.
587
+ 3. After fixes, re-run the fake data detector (static checks only — fast). Max 2 fix cycles.
553
588
 
554
- ### Step 6.3 Reality Check
589
+ Remaining findings feed into the Reality Checker in Step 6.4.
555
590
 
556
- Call the Agent tool description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. E2E test results: [paste E2E table — 3 iterations, final pass rate, quarantined count]. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
591
+ ### Step 6.4 — Reality Check
592
+
593
+ Call the Agent tool — description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. E2E test results: [paste E2E table — 3 iterations, final pass rate, quarantined count]. Dogfood results: [paste issue count and any CRITICAL/HIGH findings, or 'clean — no issues found']. Fake data audit results: [paste findings or 'clean — no fake data detected']. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
557
594
 
558
595
  <HARD-GATE>Do NOT self-approve. Reality Checker must give the verdict.</HARD-GATE>
559
596
 
560
- **Autonomous:** Log verdict to `docs/plans/build-log.md`. Continue.
561
- **Interactive:** Present score history + verdict to user. Update state.
597
+ **On PRODUCTION READY:** Log verdict. Proceed to Phase 7.
598
+
599
+ **On NEEDS WORK:** The Reality Checker returns specific issues. These must be fixed — not logged and ignored.
562
600
 
563
- **Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
601
+ 1. Read the Reality Checker's specific findings. Classify each:
602
+ - **Code bug** (broken feature, failing test, fake data) → spawn implementation fix agent with the finding + affected files.
603
+ - **Structural issue** (missing feature, wrong architecture, data flow problem) → spawn architect agent to produce a fix plan, then implementation agent to execute it. This is a mini Phase 5 loop for the specific issue.
604
+ - **Blocker** (missing API key, no backend, needs human action) → log to `docs/plans/build-log.md` and present to user. Cannot be auto-fixed.
605
+ 2. After fixes, re-run verification (7 checks) + the specific failing gate (E2E, dogfood, or fake data — whichever surfaced the issue).
606
+ 3. Re-run the Reality Checker with updated evidence.
607
+
608
+ <HARD-GATE>
609
+ Max 2 NEEDS WORK cycles. If the Reality Checker returns NEEDS WORK a third time:
610
+ - **Interactive:** Present all remaining issues to user. Ask for direction.
611
+ - **Autonomous:** Log remaining issues to `docs/plans/build-log.md`. Proceed to Phase 7 with a warning in the completion report.
612
+ Do not loop forever.
613
+ </HARD-GATE>
614
+
615
+ **Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
564
616
 
565
617
  ---
566
618
 
@@ -568,7 +620,18 @@ Call the Agent tool — description: "Final verdict" — prompt: "You are the Re
568
620
 
569
621
  ### Step 7.0 — Pre-Ship Verification
570
622
 
571
- Final verification gate. Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before documenting and shipping. If FAIL persists, return to Phase 6 for targeted fixes.
623
+ Run the Verification Protocol (`protocols/verify.md`). All checks must pass before documenting and shipping. If FAIL persists after 3 fix attempts, return to Phase 6.
624
+
625
+ ### Step 7.0b — Requirements Coverage Report
626
+
627
+ Call the Agent tool — description: "Requirements coverage check" — prompt: "Re-read the original Design Document (docs/plans/*.md design doc) and the user journeys + NFRs from docs/plans/sprint-tasks.md. For EVERY feature listed in the MVP scope, verify: (1) it has a corresponding implemented task, (2) it has a passing test or behavioral verification, (3) it is reachable and functional in the running app. Produce a coverage table:
628
+
629
+ | MVP Feature | Task | Test | Verified | Status |
630
+ |-------------|------|------|----------|--------|
631
+
632
+ Mark each as COVERED, PARTIAL (implemented but untested), or MISSING. Any MISSING feature is a blocker — report it immediately."
633
+
634
+ If any features are MISSING: spawn implementation agents to build them, then re-run verification. This is the final safety net before shipping — it catches requirements that were planned but somehow never built.
572
635
 
573
636
  ### Step 7.1 — Documentation
574
637