npm - buildanything - Versions diffs - 1.6.0 → 1.7.0 - Mend

buildanything 1.6.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (77) hide show

package/.claude-plugin/marketplace.json +2 -1
package/.claude-plugin/plugin.json +10 -2
package/agents/agentic-identity-trust.md +65 -311
package/agents/data-consolidation-agent.md +3 -22
package/agents/design-brand-guardian.md +52 -275
package/agents/design-image-prompt-engineer.md +67 -196
package/agents/design-ui-designer.md +37 -361
package/agents/design-ux-architect.md +51 -434
package/agents/design-ux-researcher.md +48 -299
package/agents/design-whimsy-injector.md +58 -405
package/agents/engineering-backend-architect.md +39 -202
package/agents/engineering-data-engineer.md +41 -236
package/agents/engineering-devops-automator.md +73 -258
package/agents/engineering-frontend-developer.md +33 -206
package/agents/engineering-mobile-app-builder.md +36 -446
package/agents/engineering-rapid-prototyper.md +34 -428
package/agents/engineering-security-engineer.md +44 -204
package/agents/engineering-senior-developer.md +18 -138
package/agents/engineering-technical-writer.md +40 -302
package/agents/marketing-app-store-optimizer.md +63 -276
package/agents/marketing-social-media-strategist.md +38 -87
package/agents/project-management-experiment-tracker.md +62 -156
package/agents/report-distribution-agent.md +4 -24
package/agents/sales-data-extraction-agent.md +3 -22
package/agents/specialized-cultural-intelligence-strategist.md +41 -62
package/agents/specialized-developer-advocate.md +65 -234
package/agents/support-analytics-reporter.md +76 -306
package/agents/support-executive-summary-generator.md +26 -172
package/agents/support-finance-tracker.md +67 -362
package/agents/support-legal-compliance-checker.md +40 -497
package/agents/support-support-responder.md +40 -532
package/agents/testing-accessibility-auditor.md +67 -271
package/agents/testing-api-tester.md +58 -274
package/agents/testing-evidence-collector.md +48 -170
package/agents/testing-performance-benchmarker.md +75 -236
package/agents/testing-reality-checker.md +49 -192
package/agents/testing-test-results-analyzer.md +70 -276
package/agents/testing-tool-evaluator.md +52 -368
package/agents/testing-workflow-optimizer.md +66 -415
package/bin/setup.js +45 -0
package/bin/sync-version.js +38 -0
package/commands/add-feature.md +98 -0
package/commands/build.md +156 -93
package/commands/dogfood.md +43 -0
package/commands/fix.md +89 -0
package/commands/idea-sweep.md +19 -82
package/commands/refactor.md +68 -0
package/commands/ux-review.md +81 -0
package/commands/verify.md +43 -0
package/hooks/session-start +5 -10
package/package.json +4 -1
package/agents/agents-orchestrator.md +0 -365
package/agents/data-analytics-reporter.md +0 -52
package/agents/lsp-index-engineer.md +0 -312
package/agents/macos-spatial-metal-engineer.md +0 -335
package/agents/marketing-content-creator.md +0 -52
package/agents/marketing-growth-hacker.md +0 -52
package/agents/product-sprint-prioritizer.md +0 -152
package/agents/product-trend-researcher.md +0 -157
package/agents/project-management-project-shepherd.md +0 -192
package/agents/project-management-studio-operations.md +0 -198
package/agents/project-management-studio-producer.md +0 -201
package/agents/project-manager-senior.md +0 -133
package/agents/support-infrastructure-maintainer.md +0 -616
package/agents/terminal-integration-specialist.md +0 -68
package/agents/visionos-spatial-engineer.md +0 -52
package/agents/xr-cockpit-interaction-specialist.md +0 -30
package/agents/xr-immersive-developer.md +0 -30
package/agents/xr-interface-architect.md +0 -30
package/commands/protocols/brainstorm.md +0 -99
package/commands/protocols/build-fix.md +0 -52
package/commands/protocols/cleanup.md +0 -56
package/commands/protocols/design.md +0 -287
package/commands/protocols/eval-harness.md +0 -62
package/commands/protocols/metric-loop.md +0 -94
package/commands/protocols/planning.md +0 -56
package/commands/protocols/verify.md +0 -63

package/commands/add-feature.md ADDED Viewed

@@ -0,0 +1,98 @@
+---
+description: "Add a single feature to an existing project — lightweight build cycle using existing architecture, design system, and CLAUDE.md context"
+argument-hint: "Describe the feature to add. --autonomous for unattended mode."
+---
+<HARD-GATE>
+YOU ARE AN ORCHESTRATOR. YOU COORDINATE AGENTS. YOU DO NOT WRITE CODE.
+"Launch an agent" = call the Agent tool. For implementation agents, set mode: "bypassPermissions". For parallel work, put multiple Agent tool calls in ONE message.
+</HARD-GATE>
+Input: $ARGUMENTS
+If the input contains `--autonomous` or `--auto`, skip user approval gates and log decisions to `docs/plans/build-log.md`.
+---
+## Phase 1: Context Gathering
+Read these files directly (no agent needed — this is fast):
+1. `CLAUDE.md` — product context, tech stack, rules
+2. `docs/plans/architecture.md` — current architecture
+3. `docs/plans/sprint-tasks.md` — existing user journeys and scope
+If any file is missing, proceed with what exists. If the codebase is unfamiliar or the feature touches unknown areas, spawn an Explore agent:
+Call the Agent tool — description: "Explore codebase for [feature area]" — prompt: "Find all files related to [feature area]. Report: directory structure, key files, patterns used, relevant components/routes/APIs. Be concise."
+---
+## Phase 2: Plan the Feature
+You do this yourself — no agent needed.
+1. **Break the feature into 1-5 tasks** (most features are 1-3). Each task should be one commit-sized unit of work.
+2. **Define behavioral acceptance criteria** for each task — what must be true when the task is done.
+3. **Define the user journey** — the end-to-end flow the user will experience with this feature.
+4. **Present the plan to the user for approval.** In autonomous mode, log the plan to `docs/plans/build-log.md` and proceed.
+---
+## Phase 3: Build
+**For EACH task:**
+### Step 3.1 — Implement
+Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture context: [paste ONLY the relevant section from architecture.md]. Style guide: the living style guide at /design-system shows component styling — match it. Implement with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
+Set `[COMPLEXITY: S/M/L]` based on task scope.
+### Step 3.2 — Cleanup
+Skip if trivial (< 20 lines, single file). Otherwise:
+Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPermissions" — prompt: "Clean up these files: [list from implementation]. Fix: naming, dead code, unused imports, style, DRY. Do NOT add features, change architecture, or touch other files. If cleanup breaks acceptance criteria, revert."
+### Step 3.3 — Smoke Test
+Skip if this task has no UI surface. Otherwise run the Smoke Test Protocol (`protocols/smoke-test.md`): open the affected route, execute behavioral acceptance criteria via agent-browser, collect evidence. On FAIL: spawn fix agent with evidence. Max 2 fix-and-retest cycles.
+### Step 3.4 — Verification
+Run the Verification Protocol (`protocols/verify.md`). All 7 checks. If FAIL, fix before starting the next task.
+---
+## Phase 4: End-to-End Verification
+### Step 4.1 — Run the User Journey
+Call the Agent tool — description: "E2E: [feature name]" — mode: "bypassPermissions" — prompt: "Verify the full user journey for [feature name]: [paste the user journey from Phase 2]. Use agent-browser to walk through each step. For each step: interact, verify the expected outcome, capture evidence. Report PASS/FAIL per step with screenshots."
+### Step 4.2 — Dogfood Affected Pages
+Call the Agent tool — description: "Dogfood [feature area]" — prompt: "Open every page affected by [feature name]. Check for: broken layouts, console errors, missing data, dead links, regressions. Report issues with screenshots."
+### Step 4.3 — Fix Loop
+If issues found in 4.1 or 4.2: spawn a fix agent with the evidence. Re-run the failing check. Max 2 fix-and-retest cycles. After 2 failures:
+- **Interactive:** present evidence to the user.
+- **Autonomous:** log to `docs/plans/build-log.md` and proceed with a warning.
+---
+## Phase 5: Done
+Report to the user:
+```
+FEATURE COMPLETE: [feature name]
+Tasks: [done]/[total] | Tests: [count] passing
+User journey: PASS/FAIL
+Evidence: [paths to screenshots/logs]
+```
+If the feature expands the product scope, update `CLAUDE.md` to reflect the new capability.

package/commands/build.md CHANGED Viewed

@@ -51,6 +51,8 @@ If you catch yourself typing code or reading source files: STOP. You are wasting
 - `last_save: [Phase.Step]`
 Increment after each agent returns (parallel dispatch of 4 agents = +4). Reset to 0 after each compaction save.
+**Compaction checkpoint format:** At every phase boundary, check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
 Input: $ARGUMENTS
 ### Autonomous Mode
@@ -67,7 +69,7 @@ When combining `--resume` with `--autonomous`: the current invocation's flags ta
 ### Metric Loop
-Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `commands/protocols/metric-loop.md`. Critical rules (survive compaction):
+Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `protocols/metric-loop.md`. Critical rules (survive compaction):
 1. YOU define a metric for this phase based on context (what you're building, what matters). The metric is NOT predefined.
 2. Spawn a **measurement agent** to score the artifact 0-100. Read its full output — it's analysis.
@@ -95,15 +97,7 @@ For implementation agents (Phase 5+): Do NOT paste the entire Design Document or
 ### Complexity Routing (Advisory)
-When composing agent prompts, prefix with `[COMPLEXITY: S/M/L]` to hint at the appropriate model tier:
-| Complexity | Task Types | Preferred Tier |
-|-----------|-----------|----------------|
-| S | Build-fix, cleanup, lint fix, single-error fix | Haiku-class (fastest) |
-| M | Measurement, eval, testing, single-feature impl | Sonnet-class (balanced) |
-| L | Architecture, research, multi-file impl, debugging | Opus-class (deepest reasoning) |
-For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
+Tag agent prompts with `[COMPLEXITY: S/M/L]` based on task size from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
 ---
@@ -112,7 +106,7 @@ For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is
 **Resuming?** If the input contains `--resume` OR if context was just compacted (SessionStart hook fired with active state):
 1. Read `docs/plans/.build-state.md` — verify it exists and has a Resume Point section.
    If `docs/plans/.build-state.md` does not exist or has no Resume Point, warn the user: 'No previous build state found. Starting fresh.' Then proceed to Step 0.1 as a new build.
-2. Re-read this file and all protocol files in `commands/protocols/`.
+2. Re-read this file and all protocol files in `protocols/`.
 3. Re-read `docs/plans/sprint-tasks.md`, `docs/plans/architecture.md`, and `CLAUDE.md`.
 4. Rebuild TodoWrite from the state file (TodoWrite does NOT survive compaction or session breaks).
 5. Reset `dispatches_since_save` to 0 (fresh context window).
@@ -183,7 +177,7 @@ Autonomous mode: Log checklist to `docs/plans/build-log.md`. Create `.env.exampl
 ### Step 1.1 — Brainstorming
-Follow the Brainstorm Protocol (`commands/protocols/brainstorm.md`).
+Follow the Brainstorm Protocol (`protocols/brainstorm.md`).
 In interactive mode: this is a conversation. Ask questions one at a time, propose approaches with trade-offs, let the user decide. Output: Design Document saved to `docs/plans/`.
@@ -195,15 +189,15 @@ Skip if context level is "Decision brief" (research already done).
 Call the Agent tool 5 times in a single message. Pass each agent the build request AND the Design Document draft.
-1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Use web search extensively. Report with a Market Verdict: GREEN/AMBER/RED."
+1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Report with a Market Verdict: GREEN/AMBER/RED."
-2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Search for APIs and libraries mentioned in the design to verify they exist and are maintained. Report with a Technical Verdict."
+2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Verify APIs and libraries from the design exist and are maintained. Report with a Technical Verdict."
-3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Search for real user complaints and communities discussing this problem. Report with a User Verdict."
+3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, and behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Report with a User Verdict."
-4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, first-1000-users strategy for: [build request]. Design context: [paste design doc]. Search for comparable pricing and growth data. Report with a Business Verdict."
+4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, and first-1000-users strategy for: [build request]. Design context: [paste design doc]. Report with a Business Verdict."
-5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Search for enforcement actions and comparable failures. Report with a Risk Verdict."
+5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Report with a Risk Verdict."
 After all 5 return, synthesize a **Research Brief** with a verdict table. Save to `docs/plans/research-brief.md`.
@@ -218,17 +212,41 @@ Read the Design Document and Research Brief together. Check for contradictions:
 Update the Design Document with corrections. Save final version.
-### Step 1.4 — Persist Decisions
+### Step 1.4 — Write CLAUDE.md
+Create (or overwrite) the project's `CLAUDE.md`. This is the product brain — every agent spawned during the build reads it automatically. Write it from the Design Document and Research Brief. It must give any agent enough context to make smart product, UX, and technical decisions without needing the full design doc.
+<HARD-GATE>
+CLAUDE.md must be under 200 lines. It is not a wiki, not a conventions doc, not a dump of everything you know. It is the minimum context an agent needs to make correct decisions about this specific product.
+</HARD-GATE>
-Append key decisions to the project's `CLAUDE.md` (create if needed) under `## Build Decisions`:
+Structure:
-- Project name and one-line description
-- Primary user and core value prop
-- Tech stack (with rationale)
-- Key constraints or risks
-- MVP scope boundary (in vs. deferred)
+```
+## Product
+[1-3 sentences: what this is, core value prop, what success looks like]
+## User
+[Primary persona: who they are, what they care about, pain points,
+technical sophistication. This drives every UX decision.]
+## Tech Stack
+[Stack choices with 1-line rationale for each. Framework, DB, auth,
+key libraries, deployment target.]
+## Scope
+[What's in MVP vs. deferred. Hard boundaries. This prevents agents
+from building features that aren't scoped.]
+## Rules
+[Project-specific hard rules derived from the product and user context.
+Examples: "All data must be real-time — no simulated/fake data",
+"User must be able to pause/stop any automated process at any time",
+"Every interactive element must have visible feedback within 200ms".
+Only include rules this specific project needs — not generic best practices.]
+```
-This ensures decisions survive context compaction.
+Keep it product-focused. An implementation agent reading this should understand WHO the user is and WHAT matters enough to make the right call when the handoff prompt doesn't cover an edge case.
 ### Quality Gate 1
@@ -238,7 +256,7 @@ This ensures decisions survive context compaction.
 Update TodoWrite and `docs/plans/.build-state.md`.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -270,13 +288,13 @@ After all 4 return, YOU synthesize into one Architecture Document. Save to `docs
 ### Step 2.3 — Metric Loop: Architecture Quality
-Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`) on the Architecture Document. Define a metric based on this project — coverage of design doc requirements, specificity, consistency between agents. Max 3 iterations.
+Run the Metric Loop Protocol (`protocols/metric-loop.md`) on the Architecture Document. Define a metric based on: coverage of design doc requirements, specificity, consistency between agents, and **simplicity** — is this the simplest architecture that meets the requirements? Could any service, abstraction, or dependency be eliminated without losing functionality? Penalize over-engineering (microservices for a simple app, Kubernetes for a static site, complex state management for a 3-page app). Max 3 iterations.
 ### Step 2.4 — Sprint Planning
-Follow the Planning Protocol (`commands/protocols/planning.md`). Use 2 sequential Agent tool calls:
+Follow the Planning Protocol (`protocols/planning.md`). Use 2 sequential Agent tool calls:
-Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
+Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). Include a `**Behavioral Test:**` field for every task that has UI — a concrete interaction test: 'Navigate to [page], click [element], verify [expected outcome]'. API-only tasks should have curl-based acceptance tests instead. ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
 Then call the Agent tool — description: "Validate task list" — prompt: "Validate this task list: [paste]. Check scope is realistic, no missing tasks, descriptions specific enough for a developer agent to execute, all tasks within MVP boundary."
@@ -290,7 +308,7 @@ Save to `docs/plans/sprint-tasks.md`.
 Update TodoWrite and `docs/plans/.build-state.md`.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -301,14 +319,14 @@ Update TodoWrite and `docs/plans/.build-state.md`.
 **Skip if** the project has no user-facing frontend (CLI tools, pure APIs, backend services).
 <HARD-GATE>
-UI/UX IS THE PRODUCT. This phase is a full peer to Architecture and Build — not a footnote, not an afterthought, not a "nice to have." Do NOT skip, compress, or rush this phase for any reason. The agents must research real competitors and award-winning sites, make deliberate visual choices backed by that research, build proof screens, and iterate with Playwright-verified visual QA before a single line of product code is written.
+UI/UX IS THE PRODUCT. This phase is a full peer to Architecture and Build — not a footnote, not an afterthought, not a "nice to have." Do NOT skip, compress, or rush this phase for any reason. The agents must research real competitors and award-winning sites, make deliberate visual choices backed by that research, build a living style guide with every component rendered and interactive, and iterate with Playwright-verified visual QA before a single line of product code is written.
 Phase 4 (Foundation) WILL NOT START without `docs/plans/visual-design-spec.md`. If it does not exist, return here.
 </HARD-GATE>
 ### Step 3.1 — Design Research (2 agents, parallel, both use Playwright)
-Follow the Design Protocol (`commands/protocols/design.md`), Step 3.1.
+Follow the Design Protocol (`protocols/design.md`), Step 3.1.
 Call the Agent tool 2 times in one message:
@@ -320,21 +338,23 @@ After both return, synthesize a **Design Research Brief** to `docs/plans/design-
 ### Step 3.2 — Design Direction (2 agents, sequential)
-Follow the Design Protocol (`commands/protocols/design.md`), Step 3.2.
+Follow the Design Protocol (`protocols/design.md`), Step 3.2.
 1. Call the Agent tool — description: "UX architecture" — Prompt: "Create structural design foundation. INPUTS: frontend architecture section from architecture.md [paste], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: information architecture, layout strategy, component hierarchy, responsive approach, interaction patterns. Base decisions on competitive research, not generic patterns."
 2. Call the Agent tool — description: "Visual design spec" — Prompt: "Create the Visual Design Spec with AUTONOMOUS decisions — pick the single best direction, do not present options. INPUTS: UX foundation [paste previous output], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: color system (with hex, light+dark), typography (Google Fonts, mathematical scale), 8px spacing system, tinted shadow system, border radius, animation/motion, component styles with ALL states. Every choice must cite the research. Apply anti-AI-template rules from the Design Protocol. Save to docs/plans/visual-design-spec.md."
-### Step 3.3 — Proof Screens (1 implementation agent)
+### Step 3.3 — Living Style Guide (1 implementation agent)
+Follow the Design Protocol (`protocols/design.md`), Step 3.3.
-Call the Agent tool — description: "Build proof screens" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: L] Implement 2-3 proof screens (landing/hero, main app view, key form). INPUTS: Visual Design Spec [paste], UX foundation [paste relevant sections], reference screenshots [list paths — these are your visual targets]. Use EXACT colors, fonts, spacing from spec. Real styled responsive pages, not wireframes. Include hover/focus states, transitions. Commit: 'feat: proof screens for design validation'."
+Call the Agent tool — description: "Build living style guide" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: L] Build a living style guide page (/design-system route or standalone HTML). INPUTS: Visual Design Spec [paste], UX foundation [paste relevant sections], reference screenshots [list paths — these are your quality targets]. Must include rendered, interactive examples of: color swatches, typography scale, spacing scale, buttons (all states), form elements (all states), cards, navigation, feedback components (alerts, toasts, spinners, empty states), modals/overlays, and layout grid examples. Every component interactive (hover, focus, transitions work). Mobile-responsive. This ships with the product. Commit: 'feat: living style guide'."
 ### Step 3.4 — Visual QA Loop (Playwright + Metric Loop)
-Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`) using the measurement criteria from the Design Protocol (`commands/protocols/design.md`, Step 3.4).
+Run the Metric Loop Protocol (`protocols/metric-loop.md`) using the measurement criteria from the Design Protocol (`protocols/design.md`, Step 3.4).
-Measurement: Playwright screenshots of proof screens (desktop + mobile). Design critic agent scores 0-100 across 6 dimensions: spacing/alignment, typography hierarchy, color harmony, component polish, responsive quality, originality (anti-AI-template check). Receives screenshots + Visual Design Spec + reference screenshots.
+Measurement: Playwright screenshots of the living style guide sections (desktop + mobile). Design critic agent scores 0-100 across 6 dimensions: spacing/alignment, typography hierarchy, color harmony, component polish, responsive quality, originality (anti-AI-template check). Receives screenshots + Visual Design Spec + reference screenshots.
 **Target: 80. Max 5 iterations.** On stall: accept if >= 65, log warning below 65.
@@ -342,7 +362,7 @@ Measurement: Playwright screenshots of proof screens (desktop + mobile). Design
 Log to `docs/plans/build-log.md`: final screenshot paths, score history table, design decisions, originality score. No user pause. Proceed to Phase 4.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -360,7 +380,11 @@ Call the Agent tool — description: "Project scaffolding" — mode: "bypassPerm
 ### Step 4.2 — Design System (frontend only)
-Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement the design system from the Visual Design Spec: [paste from docs/plans/visual-design-spec.md]. Create CSS tokens matching the spec's color system, typography scale, spacing system, shadow/elevation tokens, and base layout components. Reference the proof screens from Phase 3 as implementation targets. Commit: 'feat: design system'."
+Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement the design system from the Visual Design Spec: [paste from docs/plans/visual-design-spec.md]. Create CSS tokens matching the spec's color system, typography scale, spacing system, shadow/elevation tokens, and base layout components. The living style guide from Phase 3 is the reference implementation — components must match. Commit: 'feat: design system'."
+### Step 4.2b — Acceptance Test Scaffolding
+Call the Agent tool — description: "Scaffold acceptance tests" — mode: "bypassPermissions" — prompt: "Read docs/plans/sprint-tasks.md. For every task with a Behavioral Test field, create a Playwright test stub in tests/e2e/acceptance/. Use Page Object Model. Each test should: navigate to the page, perform the interaction, assert the expected outcome. Tests should FAIL right now (features aren't built yet) — that's correct. Also ensure agent-browser is available (run `which agent-browser`). Commit: 'test: scaffold acceptance tests from sprint tasks'."
 ### Step 4.3 — Metric Loop: Scaffold Health
@@ -368,10 +392,10 @@ Run the Metric Loop Protocol. Define a metric: builds clean, tests pass, lint cl
 ### Step 4.4 — Verification Gate
-Run the Verification Protocol (`commands/protocols/verify.md`). Critical rules (survive compaction):
+Run the Verification Protocol (`protocols/verify.md`). Critical rules (survive compaction):
 - ONE agent runs all 6 checks sequentially: Build → Type-Check → Lint → Test → Security → Diff Review. Stop on first FAIL.
 - Agent auto-detects stack from manifest files (package.json → Node, go.mod → Go, etc.).
-- On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`commands/protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
+- On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
 - On PASS: log `VERIFY: PASS (6/6)` to `docs/plans/.build-state.md`. Proceed.
 Call the Agent tool — description: "Verify scaffolding" — mode: "bypassPermissions" — prompt: "Run the Verification Protocol. Execute all 6 checks sequentially, stop on first failure. Report: VERIFY: PASS or VERIFY: FAIL with details."
@@ -380,7 +404,7 @@ Do not proceed to Phase 5 until verification passes.
 Update TodoWrite and state.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -396,13 +420,13 @@ Expand TodoWrite with each sprint task.
 ### Step 5.1 — Implement
-Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
+Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. For UI tasks: the living style guide at /design-system shows every component's exact styling and states — match it. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
 Pick the right developer framing: frontend, backend, AI, etc. Set `[COMPLEXITY: S/M/L]` based on the task's Size from sprint-tasks.md.
 ### Step 5.1b — Cleanup (De-Sloppify)
-Follow the Cleanup Protocol (`commands/protocols/cleanup.md`). Critical rules (survive compaction):
+Follow the Cleanup Protocol (`protocols/cleanup.md`). Critical rules (survive compaction):
 [COMPLEXITY: S]
 - Skip if trivial (< 20 lines, single file).
 - Cleanup agent is a SEPARATE agent from the implementer — no cleaning your own mess.
@@ -414,7 +438,7 @@ Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPerm
 ### Step 5.2 — Metric Loop: Task Quality
-Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. Max 5 iterations.
+Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. For UI-facing tasks, include behavioral verification: the measurement agent should use agent-browser to verify the feature renders and responds to interaction, not just read the code. Max 5 iterations.
 ### Step 5.3 — Loop Exit
@@ -426,11 +450,23 @@ On stall or max iterations:
 After each task: update TodoWrite and `docs/plans/.build-state.md`.
+### Step 5.3b — Behavioral Smoke Test
+Skip if this task has no Behavioral Test criteria (API-only, config, infrastructure tasks).
+Run the Smoke Test Protocol (`protocols/smoke-test.md`). This uses agent-browser to open the app, execute the task's behavioral acceptance criteria, and verify the feature actually works.
+Evidence saved to `docs/plans/evidence/[task-name]/`: annotated screenshot, snapshot diff, error log, network log, HAR file.
+On FAIL: spawn fix agent with the evidence. The fix agent receives: what was expected (from acceptance criteria), what actually happened (snapshot diff + errors + screenshot), and the relevant source files. Max 2 fix-and-retest cycles.
+On PASS: proceed to Step 5.4.
 ### Step 5.4 — Post-Task Verification
-Run the Verification Protocol (`commands/protocols/verify.md`) to catch regressions. If FAIL, fix before starting the next task.
+Run the Verification Protocol (`protocols/verify.md`). If FAIL, fix before starting the next task.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -438,23 +474,27 @@ Run the Verification Protocol (`commands/protocols/verify.md`) to catch regressi
 ### Step 6.0 — Pre-Hardening Verification
-Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before starting expensive audit agents — do not waste audit agents on code that doesn't build or pass tests.
+Run the Verification Protocol (`protocols/verify.md`). All checks must pass before starting expensive audit agents.
+### Step 6.1 — Initial Audit (5 agents in parallel, ONE message)
-### Step 6.1 — Initial Audit (4 agents in parallel, ONE message)
+Read the NFRs from `docs/plans/sprint-tasks.md`. Pass the relevant NFR thresholds to each audit agent so they have concrete targets, not generic checks.
-Call the Agent tool 4 times in one message:
+Call the Agent tool 5 times in one message:
-1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. Report findings with counts."
+1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. NFR targets: [paste performance and reliability NFRs]. Report findings with counts."
-2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. Report benchmarks."
+2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. NFR targets: [paste performance NFRs — e.g., API < 200ms, page load < 3s]. Report benchmarks AGAINST these targets."
-3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. Check screen reader, keyboard nav, contrast. Report issues with counts."
+3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. NFR target: [paste accessibility NFR — e.g., WCAG AA]. Check screen reader, keyboard nav, contrast. Report issues with counts."
-4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. Report findings with severity."
+4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. NFR targets: [paste security NFRs]. Report findings with severity."
+5. Description: "UX quality audit" — Prompt: "UX quality review of every user-facing page. NFR targets: [paste accessibility NFRs]. First, screenshot the living style guide at /design-system as your reference for how components should look. Then review every product page and check: loading states (every async action must show a loading indicator), error states (every form and API call must show user-friendly error feedback), empty states (every list/table must handle zero items gracefully), mobile responsiveness (test at 375px viewport — touch targets >= 44px, no horizontal scroll, readable text), form validation (inline feedback, not just alert()), transition smoothness (no layout shifts, no janky animations), visual consistency (compare each page's components against the style guide — buttons, inputs, cards, colors, spacing should match). Report issues with page, severity, and screenshot."
 ### Step 6.1b — Eval Harness
-Run the Eval Harness Protocol (`commands/protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 6.2 as specific issues to fix.
+Run the Eval Harness Protocol (`protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. For UI flows, eval cases should use agent-browser: "agent-browser open /dashboard -> agent-browser click @submit -> agent-browser wait --text Success -> expect text contains confirmation ID". Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 6.2 as specific issues to fix.
 ### Step 6.2 — Metric Loop: Hardening Quality
@@ -472,7 +512,7 @@ Re-run the Eval Harness after the metric loop exits. All CRITICAL eval cases mus
 ALL 3 ITERATIONS ARE MANDATORY. Do NOT stop after iteration 1 even if all tests pass. The purpose of 3 runs is to catch flaky tests, timing-dependent failures, and race conditions that only surface on repeated execution. Skip this step ONLY if the project has no user-facing frontend.
 </HARD-GATE>
-Generate and execute end-to-end tests using Playwright against the running application. Tests cover critical user journeys derived from the design doc and architecture.
+Generate and execute end-to-end tests using Playwright against the running application. Tests cover the **User Journeys** defined in `docs/plans/sprint-tasks.md` (Step 0 of the Planning Protocol). Each journey = one E2E test file.
 **Iteration 1 — Generate & Run:**
@@ -481,12 +521,13 @@ Call the Agent tool — description: "E2E test generation" — mode: "bypassPerm
 "[COMPLEXITY: L] Generate and run end-to-end Playwright tests for this application.
 INPUTS:
-- Architecture doc (user flows and API contracts): [paste relevant sections from docs/plans/architecture.md]
-- Design doc (core user journeys): [paste relevant sections]
-- Visual Design Spec (component selectors and page structure): [paste relevant sections from docs/plans/visual-design-spec.md]
+- User Journeys from docs/plans/sprint-tasks.md: [paste the User Journeys section — each journey becomes one E2E test]
+- Architecture doc (API contracts): [paste relevant sections from docs/plans/architecture.md]
+- NFRs from docs/plans/sprint-tasks.md: [paste — use performance thresholds as test assertions]
+- Visual Design Spec (component selectors): [paste relevant sections from docs/plans/visual-design-spec.md]
 REQUIREMENTS:
-1. Identify 5-10 critical user journeys from the design doc (auth flows, core feature flows, data entry, navigation)
+1. One E2E test per User Journey from sprint-tasks.md (each journey = one test file covering the full flow)
 2. Use Page Object Model pattern — one page object per major view
 3. Use data-testid selectors (add them to components if missing)
 4. Wait for API responses, NEVER use arbitrary timeouts (no waitForTimeout)
@@ -511,56 +552,67 @@ Record results: total tests, pass count, fail count, failure details. Log to `do
 **Iteration 2 — Fix & Re-run:**
-Call the Agent tool — description: "E2E fix iteration 2" — mode: "bypassPermissions" — prompt:
+Call the Agent tool — description: "E2E fix iteration 2" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Fix E2E test failures from iteration 1: [paste failure details — test names, error messages, screenshot paths]. Diagnose each as real bug, flaky test, or missing selector. Fix accordingly — do NOT delete or skip tests. Re-run ALL tests. Commit: 'fix: e2e test failures iteration 2'."
-"[COMPLEXITY: M] Fix E2E test failures and re-run the full suite.
+Record results in the E2E table. Identify flaky candidates (passed iter 1, failed iter 2 or vice versa).
-ITERATION 1 RESULTS: [paste failure details — test names, error messages, screenshot paths]
+**Iteration 3 — Final Stability Run:**
-For each failure:
-1. Diagnose: Is this a real bug, a flaky test, or a missing data-testid?
-2. Real bugs: Fix the application code
-3. Flaky tests: Add proper waits, fix race conditions, improve selectors
-4. Missing selectors: Add data-testid attributes to components
-5. Do NOT delete or skip failing tests — fix them
+Call the Agent tool — description: "E2E stability run" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Final E2E stability run (3 of 3). Previous results — Iter 1: [pass/fail counts], Iter 2: [pass/fail counts], Flaky candidates: [list]. Run ALL tests with --repeat-each=3. Quarantine inconsistent tests with test.fixme(). Fix remaining consistent failures. PASS CRITERIA: 95%+ pass rate (quarantined flaky tests excluded but logged). Commit: 'test: e2e stability fixes iteration 3'."
-Re-run ALL tests (not just previously failing ones). Report results.
-Commit fixes: 'fix: e2e test failures iteration 2'"
+Record final results. Include in Reality Checker evidence.
-Record results in the E2E table. Identify any tests that passed in iteration 1 but failed in iteration 2 — these are flaky candidates.
+### Step 6.2d — Autonomous Dogfooding
-**Iteration 3 — Final Stability Run:**
+Run the agent-browser dogfood skill against the running app. Unlike the per-task smoke tests (which verify specific acceptance criteria), dogfooding is **exploratory** — it autonomously navigates every reachable page, clicks buttons, fills forms, checks console errors, and finds issues we didn't think to test.
-Call the Agent tool — description: "E2E stability run" — mode: "bypassPermissions" — prompt:
+Start the dev server if not running. Then invoke the dogfood skill:
-"[COMPLEXITY: M] Final E2E stability run — iteration 3 of 3.
+Call the Agent tool — description: "Dogfood the app" — mode: "bypassPermissions" — prompt: "Run the agent-browser dogfood skill against the running app at http://localhost:[port]. Explore every reachable page. Click every button. Fill every form. Check console for errors. Report a structured list of issues with severity ratings (critical/high/medium/low), screenshots, and repro steps. If dogfood skill is not available, use agent-browser manually: snapshot each page, click all interactive elements, check errors and network requests. Also evaluate UX quality: missing loading states, poor error messages, broken mobile layouts (resize to 375px), visual inconsistencies, missing empty states, form validation gaps. Report UX issues separately from functional issues."
-PREVIOUS RESULTS:
-- Iteration 1: [pass/fail counts]
-- Iteration 2: [pass/fail counts]
-- Flaky candidates: [tests that had inconsistent results across iterations]
+**Fix loop:** For each CRITICAL or HIGH issue found:
+1. Classify: is this a code bug (fix in Phase 5 style — spawn implementation fix agent) or a structural problem (needs architecture change — spawn architect agent to propose a fix plan, then implementation agent to execute)?
+2. Spawn the appropriate fix agent with: the issue description, repro steps, screenshot, affected page/component.
+3. After fixes, re-run dogfood on the affected pages only (not the full app). If new CRITICAL/HIGH issues appear, repeat. Max 3 fix cycles.
-REQUIREMENTS:
-1. Run ALL tests with --repeat-each=3 to detect flakiness (each test runs 3 times within this iteration)
-2. Any test failing inconsistently across the 3 sub-runs: quarantine with test.fixme() and file path + reason
-3. Fix any remaining consistent failures
-4. Generate final report with: total journeys, pass rate, flaky count, quarantined tests
-5. Commit: 'test: e2e stability fixes iteration 3'
+MEDIUM/LOW issues: log to `docs/plans/build-log.md` for the Reality Checker.
-PASS CRITERIA: 95%+ pass rate across all tests. Quarantined flaky tests do not count against pass rate but must be logged."
+### Step 6.2e — Fake Data Detector
-Record final results. Include in Reality Checker evidence.
+Call the Agent tool — description: "Fake data audit" — mode: "bypassPermissions" — prompt: "Run the Fake Data Detector Protocol (protocols/fake-data-detector.md). Check for mock/hardcoded data in production paths. Static analysis: grep for Math.random() business data, hardcoded API responses, setTimeout faking async, placeholder text. Dynamic analysis: inspect HAR files from docs/plans/evidence/ for missing real API calls, static responses, absent WebSocket traffic. Report findings with file:line references and severity."
+**Fix loop:** For each CRITICAL finding:
+1. Spawn a fix agent with: the finding (file:line, what's fake, what it should be), and the relevant source files.
+2. The fix agent replaces fake data with real API calls, real WebSocket connections, real data sources. If real data sources aren't available (missing API keys, no backend), the fix agent must flag this as a blocker — not paper over it with better-looking fake data.
+3. After fixes, re-run the fake data detector (static checks only — fast). Max 2 fix cycles.
-### Step 6.3 — Reality Check
+Remaining findings feed into the Reality Checker in Step 6.4.
-Call the Agent tool — description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. E2E test results: [paste E2E table — 3 iterations, final pass rate, quarantined count]. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
+### Step 6.4 — Reality Check
+Call the Agent tool — description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. E2E test results: [paste E2E table — 3 iterations, final pass rate, quarantined count]. Dogfood results: [paste issue count and any CRITICAL/HIGH findings, or 'clean — no issues found']. Fake data audit results: [paste findings or 'clean — no fake data detected']. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
 <HARD-GATE>Do NOT self-approve. Reality Checker must give the verdict.</HARD-GATE>
-**Autonomous:** Log verdict to `docs/plans/build-log.md`. Continue.
-**Interactive:** Present score history + verdict to user. Update state.
+**On PRODUCTION READY:** Log verdict. Proceed to Phase 7.
+**On NEEDS WORK:** The Reality Checker returns specific issues. These must be fixed — not logged and ignored.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+1. Read the Reality Checker's specific findings. Classify each:
+   - **Code bug** (broken feature, failing test, fake data) → spawn implementation fix agent with the finding + affected files.
+   - **Structural issue** (missing feature, wrong architecture, data flow problem) → spawn architect agent to produce a fix plan, then implementation agent to execute it. This is a mini Phase 5 loop for the specific issue.
+   - **Blocker** (missing API key, no backend, needs human action) → log to `docs/plans/build-log.md` and present to user. Cannot be auto-fixed.
+2. After fixes, re-run verification (7 checks) + the specific failing gate (E2E, dogfood, or fake data — whichever surfaced the issue).
+3. Re-run the Reality Checker with updated evidence.
+<HARD-GATE>
+Max 2 NEEDS WORK cycles. If the Reality Checker returns NEEDS WORK a third time:
+- **Interactive:** Present all remaining issues to user. Ask for direction.
+- **Autonomous:** Log remaining issues to `docs/plans/build-log.md`. Proceed to Phase 7 with a warning in the completion report.
+Do not loop forever.
+</HARD-GATE>
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -568,7 +620,18 @@ Call the Agent tool — description: "Final verdict" — prompt: "You are the Re
 ### Step 7.0 — Pre-Ship Verification
-Final verification gate. Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before documenting and shipping. If FAIL persists, return to Phase 6 for targeted fixes.
+Run the Verification Protocol (`protocols/verify.md`). All checks must pass before documenting and shipping. If FAIL persists after 3 fix attempts, return to Phase 6.
+### Step 7.0b — Requirements Coverage Report
+Call the Agent tool — description: "Requirements coverage check" — prompt: "Re-read the original Design Document (docs/plans/*.md design doc) and the user journeys + NFRs from docs/plans/sprint-tasks.md. For EVERY feature listed in the MVP scope, verify: (1) it has a corresponding implemented task, (2) it has a passing test or behavioral verification, (3) it is reachable and functional in the running app. Produce a coverage table:
+| MVP Feature | Task | Test | Verified | Status |
+|-------------|------|------|----------|--------|
+Mark each as COVERED, PARTIAL (implemented but untested), or MISSING. Any MISSING feature is a blocker — report it immediately."
+If any features are MISSING: spawn implementation agents to build them, then re-run verification. This is the final safety net before shipping — it catches requirements that were planned but somehow never built.
 ### Step 7.1 — Documentation