npm - buildanything - Versions diffs - 1.5.0 → 1.7.0 - Mend

buildanything 1.5.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/.claude-plugin/marketplace.json +2 -1
package/.claude-plugin/plugin.json +10 -2
package/agents/agentic-identity-trust.md +65 -311
package/agents/data-consolidation-agent.md +3 -22
package/agents/design-brand-guardian.md +52 -275
package/agents/design-image-prompt-engineer.md +67 -196
package/agents/design-ui-designer.md +55 -351
package/agents/design-ux-architect.md +54 -427
package/agents/design-ux-researcher.md +48 -299
package/agents/design-whimsy-injector.md +58 -405
package/agents/engineering-backend-architect.md +39 -202
package/agents/engineering-data-engineer.md +41 -236
package/agents/engineering-devops-automator.md +73 -258
package/agents/engineering-frontend-developer.md +33 -206
package/agents/engineering-mobile-app-builder.md +36 -446
package/agents/engineering-rapid-prototyper.md +34 -428
package/agents/engineering-security-engineer.md +44 -204
package/agents/engineering-senior-developer.md +18 -138
package/agents/engineering-technical-writer.md +40 -302
package/agents/marketing-app-store-optimizer.md +63 -276
package/agents/marketing-social-media-strategist.md +38 -87
package/agents/project-management-experiment-tracker.md +62 -156
package/agents/report-distribution-agent.md +4 -24
package/agents/sales-data-extraction-agent.md +3 -22
package/agents/specialized-cultural-intelligence-strategist.md +41 -62
package/agents/specialized-developer-advocate.md +65 -234
package/agents/support-analytics-reporter.md +76 -306
package/agents/support-executive-summary-generator.md +26 -172
package/agents/support-finance-tracker.md +67 -362
package/agents/support-legal-compliance-checker.md +40 -497
package/agents/support-support-responder.md +40 -532
package/agents/testing-accessibility-auditor.md +67 -271
package/agents/testing-api-tester.md +58 -274
package/agents/testing-evidence-collector.md +48 -170
package/agents/testing-performance-benchmarker.md +75 -236
package/agents/testing-reality-checker.md +49 -192
package/agents/testing-test-results-analyzer.md +70 -276
package/agents/testing-tool-evaluator.md +52 -368
package/agents/testing-workflow-optimizer.md +66 -415
package/bin/setup.js +45 -0
package/bin/sync-version.js +38 -0
package/commands/add-feature.md +98 -0
package/commands/build.md +285 -79
package/commands/dogfood.md +43 -0
package/commands/fix.md +89 -0
package/commands/idea-sweep.md +19 -82
package/commands/refactor.md +68 -0
package/commands/ux-review.md +81 -0
package/commands/verify.md +43 -0
package/hooks/session-start +22 -14
package/package.json +4 -1
package/agents/agents-orchestrator.md +0 -365
package/agents/data-analytics-reporter.md +0 -52
package/agents/lsp-index-engineer.md +0 -312
package/agents/macos-spatial-metal-engineer.md +0 -335
package/agents/marketing-content-creator.md +0 -52
package/agents/marketing-growth-hacker.md +0 -52
package/agents/product-sprint-prioritizer.md +0 -152
package/agents/product-trend-researcher.md +0 -157
package/agents/project-management-project-shepherd.md +0 -192
package/agents/project-management-studio-operations.md +0 -198
package/agents/project-management-studio-producer.md +0 -201
package/agents/project-manager-senior.md +0 -133
package/agents/support-infrastructure-maintainer.md +0 -616
package/agents/terminal-integration-specialist.md +0 -68
package/agents/visionos-spatial-engineer.md +0 -52
package/agents/xr-cockpit-interaction-specialist.md +0 -30
package/agents/xr-immersive-developer.md +0 -30
package/agents/xr-interface-architect.md +0 -30
package/commands/protocols/brainstorm.md +0 -99
package/commands/protocols/build-fix.md +0 -52
package/commands/protocols/cleanup.md +0 -56
package/commands/protocols/eval-harness.md +0 -62
package/commands/protocols/metric-loop.md +0 -94
package/commands/protocols/planning.md +0 -56
package/commands/protocols/verify.md +0 -63

package/commands/build.md CHANGED Viewed

@@ -51,6 +51,8 @@ If you catch yourself typing code or reading source files: STOP. You are wasting
 - `last_save: [Phase.Step]`
 Increment after each agent returns (parallel dispatch of 4 agents = +4). Reset to 0 after each compaction save.
+**Compaction checkpoint format:** At every phase boundary, check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
 Input: $ARGUMENTS
 ### Autonomous Mode
@@ -67,7 +69,7 @@ When combining `--resume` with `--autonomous`: the current invocation's flags ta
 ### Metric Loop
-Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `commands/protocols/metric-loop.md`. Critical rules (survive compaction):
+Every phase uses a **metric-driven iteration loop** to drive quality. Read the full protocol at `protocols/metric-loop.md`. Critical rules (survive compaction):
 1. YOU define a metric for this phase based on context (what you're building, what matters). The metric is NOT predefined.
 2. Spawn a **measurement agent** to score the artifact 0-100. Read its full output — it's analysis.
@@ -91,19 +93,11 @@ When spawning agents in sequence (e.g., architect → implementer → reviewer),
 2. **Previous agent's output** — what the upstream agent produced (if any)
 3. **Acceptance criteria** — what "done" looks like for THIS agent
-For implementation agents (Phase 4+): Do NOT paste the entire Design Document or Architecture Document. Extract the relevant sections only. For research and architecture agents (Phases 1-2): pass the full document — these agents need complete context to do their analysis.
+For implementation agents (Phase 5+): Do NOT paste the entire Design Document or Architecture Document. Extract the relevant sections only. For research and architecture agents (Phases 1-2): pass the full document — these agents need complete context to do their analysis.
 ### Complexity Routing (Advisory)
-When composing agent prompts, prefix with `[COMPLEXITY: S/M/L]` to hint at the appropriate model tier:
-| Complexity | Task Types | Preferred Tier |
-|-----------|-----------|----------------|
-| S | Build-fix, cleanup, lint fix, single-error fix | Haiku-class (fastest) |
-| M | Measurement, eval, testing, single-feature impl | Sonnet-class (balanced) |
-| L | Architecture, research, multi-file impl, debugging | Opus-class (deepest reasoning) |
-For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
+Tag agent prompts with `[COMPLEXITY: S/M/L]` based on task size from `docs/plans/sprint-tasks.md`. This is advisory — the tag documents intent for future model routing support.
 ---
@@ -112,7 +106,7 @@ For sprint tasks, use the Size field from `docs/plans/sprint-tasks.md`. This is
 **Resuming?** If the input contains `--resume` OR if context was just compacted (SessionStart hook fired with active state):
 1. Read `docs/plans/.build-state.md` — verify it exists and has a Resume Point section.
    If `docs/plans/.build-state.md` does not exist or has no Resume Point, warn the user: 'No previous build state found. Starting fresh.' Then proceed to Step 0.1 as a new build.
-2. Re-read this file and all protocol files in `commands/protocols/`.
+2. Re-read this file and all protocol files in `protocols/`.
 3. Re-read `docs/plans/sprint-tasks.md`, `docs/plans/architecture.md`, and `CLAUDE.md`.
 4. Rebuild TodoWrite from the state file (TodoWrite does NOT survive compaction or session breaks).
 5. Reset `dispatches_since_save` to 0 (fresh context window).
@@ -169,7 +163,7 @@ Autonomous mode: Log checklist to `docs/plans/build-log.md`. Create `.env.exampl
 ### Step 0.3 — Initialize
 0. Create `docs/plans/` directory if it doesn't exist (greenfield projects won't have it).
-1. Create a TodoWrite checklist with Phases 0-6.
+1. Create a TodoWrite checklist with Phases 0-7.
 2. Create `docs/plans/.build-state.md` as a single write with ALL of the following: phase and step (`Phase: 0 — Starting`), input (`[build request]`), context level (`[classification]`), prerequisites (`[status]`), dispatch counter (`dispatches_since_save: 0, last_save: Phase 0`), and a `## Resume Point` section with: phase, step, autonomous mode flag, completed tasks (none), git branch name.
 3. Go to Phase 1 (or Phase 2 if context level is "Full design").
@@ -183,7 +177,7 @@ Autonomous mode: Log checklist to `docs/plans/build-log.md`. Create `.env.exampl
 ### Step 1.1 — Brainstorming
-Follow the Brainstorm Protocol (`commands/protocols/brainstorm.md`).
+Follow the Brainstorm Protocol (`protocols/brainstorm.md`).
 In interactive mode: this is a conversation. Ask questions one at a time, propose approaches with trade-offs, let the user decide. Output: Design Document saved to `docs/plans/`.
@@ -195,15 +189,15 @@ Skip if context level is "Decision brief" (research already done).
 Call the Agent tool 5 times in a single message. Pass each agent the build request AND the Design Document draft.
-1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Use web search extensively. Report with a Market Verdict: GREEN/AMBER/RED."
+1. Description: "Market research" — Prompt: "Research market size (TAM/SAM/SOM), competitive landscape (5-10 players), timing, and market structure for: [build request]. Design context: [paste design doc]. Report with a Market Verdict: GREEN/AMBER/RED."
-2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Search for APIs and libraries mentioned in the design to verify they exist and are maintained. Report with a Technical Verdict."
+2. Description: "Tech feasibility" — Prompt: "Evaluate hard technical problems (Solved/Hard/Unsolved), build-vs-buy decisions, MVP scope, and stack validation for: [build request]. Design context: [paste design doc]. Verify APIs and libraries from the design exist and are maintained. Report with a Technical Verdict."
-3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Search for real user complaints and communities discussing this problem. Report with a User Verdict."
+3. Description: "User research" — Prompt: "Analyze target persona, jobs-to-be-done, current alternatives, and behavioral barriers to adoption for: [build request]. Design context: [paste design doc]. Report with a User Verdict."
-4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, first-1000-users strategy for: [build request]. Design context: [paste design doc]. Search for comparable pricing and growth data. Report with a Business Verdict."
+4. Description: "Business model" — Prompt: "Evaluate revenue models, unit economics, growth loops, and first-1000-users strategy for: [build request]. Design context: [paste design doc]. Report with a Business Verdict."
-5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Search for enforcement actions and comparable failures. Report with a Risk Verdict."
+5. Description: "Risk analysis" — Prompt: "Adversarial review: regulatory risk, security concerns, dependency risks, competitive response, top 3 failure modes for: [build request]. Design context: [paste design doc]. Report with a Risk Verdict."
 After all 5 return, synthesize a **Research Brief** with a verdict table. Save to `docs/plans/research-brief.md`.
@@ -218,17 +212,41 @@ Read the Design Document and Research Brief together. Check for contradictions:
 Update the Design Document with corrections. Save final version.
-### Step 1.4 — Persist Decisions
+### Step 1.4 — Write CLAUDE.md
+Create (or overwrite) the project's `CLAUDE.md`. This is the product brain — every agent spawned during the build reads it automatically. Write it from the Design Document and Research Brief. It must give any agent enough context to make smart product, UX, and technical decisions without needing the full design doc.
+<HARD-GATE>
+CLAUDE.md must be under 200 lines. It is not a wiki, not a conventions doc, not a dump of everything you know. It is the minimum context an agent needs to make correct decisions about this specific product.
+</HARD-GATE>
-Append key decisions to the project's `CLAUDE.md` (create if needed) under `## Build Decisions`:
+Structure:
-- Project name and one-line description
-- Primary user and core value prop
-- Tech stack (with rationale)
-- Key constraints or risks
-- MVP scope boundary (in vs. deferred)
+```
+## Product
+[1-3 sentences: what this is, core value prop, what success looks like]
+## User
+[Primary persona: who they are, what they care about, pain points,
+technical sophistication. This drives every UX decision.]
+## Tech Stack
+[Stack choices with 1-line rationale for each. Framework, DB, auth,
+key libraries, deployment target.]
+## Scope
+[What's in MVP vs. deferred. Hard boundaries. This prevents agents
+from building features that aren't scoped.]
+## Rules
+[Project-specific hard rules derived from the product and user context.
+Examples: "All data must be real-time — no simulated/fake data",
+"User must be able to pause/stop any automated process at any time",
+"Every interactive element must have visible feedback within 200ms".
+Only include rules this specific project needs — not generic best practices.]
+```
-This ensures decisions survive context compaction.
+Keep it product-focused. An implementation agent reading this should understand WHO the user is and WHAT matters enough to make the right call when the handoff prompt doesn't cover an edge case.
 ### Quality Gate 1
@@ -238,7 +256,7 @@ This ensures decisions survive context compaction.
 Update TodoWrite and `docs/plans/.build-state.md`.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
@@ -270,13 +288,13 @@ After all 4 return, YOU synthesize into one Architecture Document. Save to `docs
 ### Step 2.3 — Metric Loop: Architecture Quality
-Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`) on the Architecture Document. Define a metric based on this project — coverage of design doc requirements, specificity, consistency between agents. Max 3 iterations.
+Run the Metric Loop Protocol (`protocols/metric-loop.md`) on the Architecture Document. Define a metric based on: coverage of design doc requirements, specificity, consistency between agents, and **simplicity** — is this the simplest architecture that meets the requirements? Could any service, abstraction, or dependency be eliminated without losing functionality? Penalize over-engineering (microservices for a simple app, Kubernetes for a static site, complex state management for a 3-page app). Max 3 iterations.
 ### Step 2.4 — Sprint Planning
-Follow the Planning Protocol (`commands/protocols/planning.md`). Use 2 sequential Agent tool calls:
+Follow the Planning Protocol (`protocols/planning.md`). Use 2 sequential Agent tool calls:
-Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
+Call the Agent tool — description: "Sprint breakdown" — prompt: "Break this architecture into ordered, atomic tasks. Each task needs: description, acceptance criteria, dependencies, size (S/M/L). Include a `**Behavioral Test:**` field for every task that has UI — a concrete interaction test: 'Navigate to [page], click [element], verify [expected outcome]'. API-only tasks should have curl-based acceptance tests instead. ARCHITECTURE: [paste]. DESIGN DOC: [paste]. Scope to MVP only."
 Then call the Agent tool — description: "Validate task list" — prompt: "Validate this task list: [paste]. Check scope is realistic, no missing tasks, descriptions specific enough for a developer agent to execute, all tasks within MVP boundary."
@@ -290,61 +308,125 @@ Save to `docs/plans/sprint-tasks.md`.
 Update TodoWrite and `docs/plans/.build-state.md`.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
+---
+## Phase 3: Design & Visual Identity
+**Goal**: Transform architecture into a research-backed visual design system, proven with Playwright screenshots. Fully autonomous — agents research, decide, and iterate without user input.
+**Skip if** the project has no user-facing frontend (CLI tools, pure APIs, backend services).
+<HARD-GATE>
+UI/UX IS THE PRODUCT. This phase is a full peer to Architecture and Build — not a footnote, not an afterthought, not a "nice to have." Do NOT skip, compress, or rush this phase for any reason. The agents must research real competitors and award-winning sites, make deliberate visual choices backed by that research, build a living style guide with every component rendered and interactive, and iterate with Playwright-verified visual QA before a single line of product code is written.
+Phase 4 (Foundation) WILL NOT START without `docs/plans/visual-design-spec.md`. If it does not exist, return here.
+</HARD-GATE>
+### Step 3.1 — Design Research (2 agents, parallel, both use Playwright)
+Follow the Design Protocol (`protocols/design.md`), Step 3.1.
+Call the Agent tool 2 times in one message:
+1. Description: "Competitive visual audit" — Prompt: "Research the top 5-8 competitors/analogues for: [product description]. Use Playwright to screenshot each site (desktop 1920x1080 + mobile 375x812). Screenshot standout components (hero, cards, forms, nav, CTAs). Save to docs/plans/design-references/competitors/. Analyze visual language: colors, typography, spacing, what feels premium vs cheap. Rank by visual quality. DESIGN DOC: [paste]."
+2. Description: "Design inspiration mining" — Prompt: "Search Awwwards.com, Godly.website, SiteInspire for award-winning sites in category: [product category]. Use Playwright to screenshot top 5-8 results + standout components. Save to docs/plans/design-references/inspiration/. Identify visual trends, what separates best-in-class from generic. DESIGN DOC: [paste]."
+After both return, synthesize a **Design Research Brief** to `docs/plans/design-research.md`. Include all screenshot paths.
+### Step 3.2 — Design Direction (2 agents, sequential)
+Follow the Design Protocol (`protocols/design.md`), Step 3.2.
+1. Call the Agent tool — description: "UX architecture" — Prompt: "Create structural design foundation. INPUTS: frontend architecture section from architecture.md [paste], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: information architecture, layout strategy, component hierarchy, responsive approach, interaction patterns. Base decisions on competitive research, not generic patterns."
+2. Call the Agent tool — description: "Visual design spec" — Prompt: "Create the Visual Design Spec with AUTONOMOUS decisions — pick the single best direction, do not present options. INPUTS: UX foundation [paste previous output], Design Research Brief [paste], reference screenshot paths [list], user persona [paste]. OUTPUT: color system (with hex, light+dark), typography (Google Fonts, mathematical scale), 8px spacing system, tinted shadow system, border radius, animation/motion, component styles with ALL states. Every choice must cite the research. Apply anti-AI-template rules from the Design Protocol. Save to docs/plans/visual-design-spec.md."
+### Step 3.3 — Living Style Guide (1 implementation agent)
+Follow the Design Protocol (`protocols/design.md`), Step 3.3.
+Call the Agent tool — description: "Build living style guide" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: L] Build a living style guide page (/design-system route or standalone HTML). INPUTS: Visual Design Spec [paste], UX foundation [paste relevant sections], reference screenshots [list paths — these are your quality targets]. Must include rendered, interactive examples of: color swatches, typography scale, spacing scale, buttons (all states), form elements (all states), cards, navigation, feedback components (alerts, toasts, spinners, empty states), modals/overlays, and layout grid examples. Every component interactive (hover, focus, transitions work). Mobile-responsive. This ships with the product. Commit: 'feat: living style guide'."
+### Step 3.4 — Visual QA Loop (Playwright + Metric Loop)
+Run the Metric Loop Protocol (`protocols/metric-loop.md`) using the measurement criteria from the Design Protocol (`protocols/design.md`, Step 3.4).
+Measurement: Playwright screenshots of the living style guide sections (desktop + mobile). Design critic agent scores 0-100 across 6 dimensions: spacing/alignment, typography hierarchy, color harmony, component polish, responsive quality, originality (anti-AI-template check). Receives screenshots + Visual Design Spec + reference screenshots.
+**Target: 80. Max 5 iterations.** On stall: accept if >= 65, log warning below 65.
+### Step 3.5 — Autonomous Quality Gate
+Log to `docs/plans/build-log.md`: final screenshot paths, score history table, design decisions, originality score. No user pause. Proceed to Phase 4.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
-## Phase 3: Foundation
+## Phase 4: Foundation
+<HARD-GATE>
+Before starting Phase 4: Phase 2 must be approved AND Phase 3 must have produced `docs/plans/visual-design-spec.md`.
+If visual-design-spec.md does not exist, DO NOT PROCEED. Return to Phase 3.
+Step 4.2 (Design System) MUST implement from visual-design-spec.md — not generic architecture tokens.
+</HARD-GATE>
-### Step 3.1 — Scaffolding
+### Step 4.1 — Scaffolding
 Call the Agent tool — description: "Project scaffolding" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Set up the project from this architecture: [paste]. Create directory structure, dependencies, build tooling, linting config, test framework with one passing test, .gitignore, .env.example. Commit: 'feat: initial scaffolding'."
-### Step 3.2 — Design System (frontend only)
+### Step 4.2 — Design System (frontend only)
-Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement design system foundation from this architecture: [paste frontend section]. Create CSS tokens, base layout components, core UI primitives. Commit: 'feat: design system'."
+Call the Agent tool — description: "Design system setup" — mode: "bypassPermissions" — prompt: "Implement the design system from the Visual Design Spec: [paste from docs/plans/visual-design-spec.md]. Create CSS tokens matching the spec's color system, typography scale, spacing system, shadow/elevation tokens, and base layout components. The living style guide from Phase 3 is the reference implementation — components must match. Commit: 'feat: design system'."
-### Step 3.3 — Metric Loop: Scaffold Health
+### Step 4.2b — Acceptance Test Scaffolding
+Call the Agent tool — description: "Scaffold acceptance tests" — mode: "bypassPermissions" — prompt: "Read docs/plans/sprint-tasks.md. For every task with a Behavioral Test field, create a Playwright test stub in tests/e2e/acceptance/. Use Page Object Model. Each test should: navigate to the page, perform the interaction, assert the expected outcome. Tests should FAIL right now (features aren't built yet) — that's correct. Also ensure agent-browser is available (run `which agent-browser`). Commit: 'test: scaffold acceptance tests from sprint tasks'."
+### Step 4.3 — Metric Loop: Scaffold Health
 Run the Metric Loop Protocol. Define a metric: builds clean, tests pass, lint clean, structure matches architecture. Max 3 iterations.
-### Step 3.4 — Verification Gate
+### Step 4.4 — Verification Gate
-Run the Verification Protocol (`commands/protocols/verify.md`). Critical rules (survive compaction):
+Run the Verification Protocol (`protocols/verify.md`). Critical rules (survive compaction):
 - ONE agent runs all 6 checks sequentially: Build → Type-Check → Lint → Test → Security → Diff Review. Stop on first FAIL.
 - Agent auto-detects stack from manifest files (package.json → Node, go.mod → Go, etc.).
-- On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`commands/protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
+- On FAIL: for build/type/lint errors, use the Build-Fix Protocol (`protocols/build-fix.md`) — fixes one error at a time with cascade detection. For test/security/diff failures, spawn a targeted fix agent. Re-verify. Max 3 fix attempts.
 - On PASS: log `VERIFY: PASS (6/6)` to `docs/plans/.build-state.md`. Proceed.
 Call the Agent tool — description: "Verify scaffolding" — mode: "bypassPermissions" — prompt: "Run the Verification Protocol. Execute all 6 checks sequentially, stop on first failure. Report: VERIFY: PASS or VERIFY: FAIL with details."
-Do not proceed to Phase 4 until verification passes.
+Do not proceed to Phase 5 until verification passes.
 Update TodoWrite and state.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
-## Phase 4: Build — Metric-Driven Dev Loops
+## Phase 5: Build — Metric-Driven Dev Loops
 <HARD-GATE>
-Before starting: Phase 2 must be approved, Phase 3 must pass. You MUST call the Agent tool for EVERY task. No exceptions.
+Before starting: Phase 2 must be approved, Phase 3 must produce docs/plans/visual-design-spec.md, Phase 4 must pass. You MUST call the Agent tool for EVERY task. No exceptions.
 </HARD-GATE>
 Expand TodoWrite with each sprint task.
 **For EACH task:**
-### Step 4.1 — Implement
+### Step 5.1 — Implement
-Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
+Call the Agent tool — description: "[task name]" — mode: "bypassPermissions" — prompt: "TASK: [task description + acceptance criteria]. HANDOFF — Architecture section: [paste ONLY the relevant section from architecture.md]. Design section: [paste ONLY the relevant section from the design doc]. Previous task output: [what the last completed task produced, if relevant]. For UI tasks: the living style guide at /design-system shows every component's exact styling and states — match it. Implement fully with real code and tests. Commit: 'feat: [task]'. Report what you built, files changed, and test results."
 Pick the right developer framing: frontend, backend, AI, etc. Set `[COMPLEXITY: S/M/L]` based on the task's Size from sprint-tasks.md.
-### Step 4.1b — Cleanup (De-Sloppify)
+### Step 5.1b — Cleanup (De-Sloppify)
-Follow the Cleanup Protocol (`commands/protocols/cleanup.md`). Critical rules (survive compaction):
+Follow the Cleanup Protocol (`protocols/cleanup.md`). Critical rules (survive compaction):
 [COMPLEXITY: S]
 - Skip if trivial (< 20 lines, single file).
 - Cleanup agent is a SEPARATE agent from the implementer — no cleaning your own mess.
@@ -354,11 +436,11 @@ Follow the Cleanup Protocol (`commands/protocols/cleanup.md`). Critical rules (s
 Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPermissions" — with the list of files changed and the task's acceptance criteria.
-### Step 4.2 — Metric Loop: Task Quality
+### Step 5.2 — Metric Loop: Task Quality
-Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. Max 5 iterations.
+Run the Metric Loop Protocol on the task implementation. Define a metric based on the task's acceptance criteria. For UI-facing tasks, include behavioral verification: the measurement agent should use agent-browser to verify the feature renders and responds to interaction, not just read the code. Max 5 iterations.
-### Step 4.3 — Loop Exit
+### Step 5.3 — Loop Exit
 On target met: mark task complete in TodoWrite, report "Task X/N: [name] — COMPLETE (score: [final], iterations: [count])".
@@ -368,74 +450,198 @@ On stall or max iterations:
 After each task: update TodoWrite and `docs/plans/.build-state.md`.
-### Step 4.4 — Post-Task Verification
+### Step 5.3b — Behavioral Smoke Test
+Skip if this task has no Behavioral Test criteria (API-only, config, infrastructure tasks).
+Run the Smoke Test Protocol (`protocols/smoke-test.md`). This uses agent-browser to open the app, execute the task's behavioral acceptance criteria, and verify the feature actually works.
+Evidence saved to `docs/plans/evidence/[task-name]/`: annotated screenshot, snapshot diff, error log, network log, HAR file.
-Run the Verification Protocol (`commands/protocols/verify.md`) to catch regressions. If FAIL, fix before starting the next task.
+On FAIL: spawn fix agent with the evidence. The fix agent receives: what was expected (from acceptance criteria), what actually happened (snapshot diff + errors + screenshot), and the relevant source files. Max 2 fix-and-retest cycles.
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+On PASS: proceed to Step 5.4.
+### Step 5.4 — Post-Task Verification
+Run the Verification Protocol (`protocols/verify.md`). If FAIL, fix before starting the next task.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
-## Phase 5: Harden — Metric-Driven Hardening
+## Phase 6: Harden — Metric-Driven Hardening
-### Step 5.0 — Pre-Hardening Verification
+### Step 6.0 — Pre-Hardening Verification
-Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before starting expensive audit agents — do not waste audit agents on code that doesn't build or pass tests.
+Run the Verification Protocol (`protocols/verify.md`). All checks must pass before starting expensive audit agents.
-### Step 5.1 — Initial Audit (4 agents in parallel, ONE message)
+### Step 6.1 — Initial Audit (5 agents in parallel, ONE message)
-Call the Agent tool 4 times in one message:
+Read the NFRs from `docs/plans/sprint-tasks.md`. Pass the relevant NFR thresholds to each audit agent so they have concrete targets, not generic checks.
-1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. Report findings with counts."
+Call the Agent tool 5 times in one message:
-2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. Report benchmarks."
+1. Description: "API testing" — Prompt: "Comprehensive API validation: all endpoints, edge cases, error responses, auth flows. NFR targets: [paste performance and reliability NFRs]. Report findings with counts."
-3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. Check screen reader, keyboard nav, contrast. Report issues with counts."
+2. Description: "Performance audit" — Prompt: "Measure response times, identify bottlenecks, flag performance issues. NFR targets: [paste performance NFRs — e.g., API < 200ms, page load < 3s]. Report benchmarks AGAINST these targets."
-4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. Report findings with severity."
+3. Description: "Accessibility audit" — Prompt: "WCAG compliance audit on all interfaces. NFR target: [paste accessibility NFR — e.g., WCAG AA]. Check screen reader, keyboard nav, contrast. Report issues with counts."
-### Step 5.1b — Eval Harness
+4. Description: "Security audit" — Prompt: "Security review: auth, input validation, data exposure, dependency vulnerabilities. NFR targets: [paste security NFRs]. Report findings with severity."
-Run the Eval Harness Protocol (`commands/protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 5.2 as specific issues to fix.
+5. Description: "UX quality audit" — Prompt: "UX quality review of every user-facing page. NFR targets: [paste accessibility NFRs]. First, screenshot the living style guide at /design-system as your reference for how components should look. Then review every product page and check: loading states (every async action must show a loading indicator), error states (every form and API call must show user-friendly error feedback), empty states (every list/table must handle zero items gracefully), mobile responsiveness (test at 375px viewport — touch targets >= 44px, no horizontal scroll, readable text), form validation (inline feedback, not just alert()), transition smoothness (no layout shifts, no janky animations), visual consistency (compare each page's components against the style guide — buttons, inputs, cards, colors, spacing should match). Report issues with page, severity, and screenshot."
-### Step 5.2 — Metric Loop: Hardening Quality
+### Step 6.1b — Eval Harness
+Run the Eval Harness Protocol (`protocols/eval-harness.md`). Define 8-15 concrete, executable eval cases from the audit findings and architecture doc. For UI flows, eval cases should use agent-browser: "agent-browser open /dashboard -> agent-browser click @submit -> agent-browser wait --text Success -> expect text contains confirmation ID". Run the eval agent. Record baseline pass rate. CRITICAL and HIGH failures feed into the metric loop in Step 6.2 as specific issues to fix.
+### Step 6.2 — Metric Loop: Hardening Quality
 Run the Metric Loop Protocol on the full codebase using audit findings as initial input. Define a composite metric based on what this project needs. Max 4 iterations.
 When fixing, dispatch to the RIGHT specialist. Security → security agent. Accessibility → frontend agent. Don't send everything to one agent.
-### Step 5.2b — Eval Re-run
+### Step 6.2b — Eval Re-run
 Re-run the Eval Harness after the metric loop exits. All CRITICAL eval cases must now pass. If any CRITICAL case still fails, include it as evidence for the Reality Checker.
-### Step 5.3 — Reality Check
+### Step 6.2c — E2E Testing (3 mandatory iterations)
+<HARD-GATE>
+ALL 3 ITERATIONS ARE MANDATORY. Do NOT stop after iteration 1 even if all tests pass. The purpose of 3 runs is to catch flaky tests, timing-dependent failures, and race conditions that only surface on repeated execution. Skip this step ONLY if the project has no user-facing frontend.
+</HARD-GATE>
+Generate and execute end-to-end tests using Playwright against the running application. Tests cover the **User Journeys** defined in `docs/plans/sprint-tasks.md` (Step 0 of the Planning Protocol). Each journey = one E2E test file.
+**Iteration 1 — Generate & Run:**
+Call the Agent tool — description: "E2E test generation" — mode: "bypassPermissions" — prompt:
+"[COMPLEXITY: L] Generate and run end-to-end Playwright tests for this application.
+INPUTS:
+- User Journeys from docs/plans/sprint-tasks.md: [paste the User Journeys section — each journey becomes one E2E test]
+- Architecture doc (API contracts): [paste relevant sections from docs/plans/architecture.md]
+- NFRs from docs/plans/sprint-tasks.md: [paste — use performance thresholds as test assertions]
+- Visual Design Spec (component selectors): [paste relevant sections from docs/plans/visual-design-spec.md]
+REQUIREMENTS:
+1. One E2E test per User Journey from sprint-tasks.md (each journey = one test file covering the full flow)
+2. Use Page Object Model pattern — one page object per major view
+3. Use data-testid selectors (add them to components if missing)
+4. Wait for API responses, NEVER use arbitrary timeouts (no waitForTimeout)
+5. Capture screenshots at critical verification points
+6. Configure multi-browser: Chromium + Firefox + WebKit
+7. Set up playwright.config.ts with: fullyParallel, retries: 0 (we handle retries ourselves), screenshot: 'only-on-failure', video: 'retain-on-failure', trace: 'on-first-retry'
+8. Run all tests. Report: total, passed, failed, with failure details and screenshot paths.
+9. Commit: 'test: e2e test suite for critical user journeys'
+Test priority:
+- CRITICAL: Auth, core feature happy path, data submission, payment/transaction flows
+- HIGH: Search, filtering, navigation, error states
+- MEDIUM: Responsive layout, animations, edge cases"
+Record results: total tests, pass count, fail count, failure details. Log to `docs/plans/.build-state.md` under `## E2E Testing`:
+```
+| Iter | Total | Passed | Failed | Flaky | Top Failure |
+|------|-------|--------|--------|-------|-------------|
+| 1    | ...   | ...    | ...    | ...   | ...         |
+```
+**Iteration 2 — Fix & Re-run:**
+Call the Agent tool — description: "E2E fix iteration 2" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Fix E2E test failures from iteration 1: [paste failure details — test names, error messages, screenshot paths]. Diagnose each as real bug, flaky test, or missing selector. Fix accordingly — do NOT delete or skip tests. Re-run ALL tests. Commit: 'fix: e2e test failures iteration 2'."
+Record results in the E2E table. Identify flaky candidates (passed iter 1, failed iter 2 or vice versa).
+**Iteration 3 — Final Stability Run:**
-Call the Agent tool — description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
+Call the Agent tool — description: "E2E stability run" — mode: "bypassPermissions" — prompt: "[COMPLEXITY: M] Final E2E stability run (3 of 3). Previous results — Iter 1: [pass/fail counts], Iter 2: [pass/fail counts], Flaky candidates: [list]. Run ALL tests with --repeat-each=3. Quarantine inconsistent tests with test.fixme(). Fix remaining consistent failures. PASS CRITERIA: 95%+ pass rate (quarantined flaky tests excluded but logged). Commit: 'test: e2e stability fixes iteration 3'."
+Record final results. Include in Reality Checker evidence.
+### Step 6.2d — Autonomous Dogfooding
+Run the agent-browser dogfood skill against the running app. Unlike the per-task smoke tests (which verify specific acceptance criteria), dogfooding is **exploratory** — it autonomously navigates every reachable page, clicks buttons, fills forms, checks console errors, and finds issues we didn't think to test.
+Start the dev server if not running. Then invoke the dogfood skill:
+Call the Agent tool — description: "Dogfood the app" — mode: "bypassPermissions" — prompt: "Run the agent-browser dogfood skill against the running app at http://localhost:[port]. Explore every reachable page. Click every button. Fill every form. Check console for errors. Report a structured list of issues with severity ratings (critical/high/medium/low), screenshots, and repro steps. If dogfood skill is not available, use agent-browser manually: snapshot each page, click all interactive elements, check errors and network requests. Also evaluate UX quality: missing loading states, poor error messages, broken mobile layouts (resize to 375px), visual inconsistencies, missing empty states, form validation gaps. Report UX issues separately from functional issues."
+**Fix loop:** For each CRITICAL or HIGH issue found:
+1. Classify: is this a code bug (fix in Phase 5 style — spawn implementation fix agent) or a structural problem (needs architecture change — spawn architect agent to propose a fix plan, then implementation agent to execute)?
+2. Spawn the appropriate fix agent with: the issue description, repro steps, screenshot, affected page/component.
+3. After fixes, re-run dogfood on the affected pages only (not the full app). If new CRITICAL/HIGH issues appear, repeat. Max 3 fix cycles.
+MEDIUM/LOW issues: log to `docs/plans/build-log.md` for the Reality Checker.
+### Step 6.2e — Fake Data Detector
+Call the Agent tool — description: "Fake data audit" — mode: "bypassPermissions" — prompt: "Run the Fake Data Detector Protocol (protocols/fake-data-detector.md). Check for mock/hardcoded data in production paths. Static analysis: grep for Math.random() business data, hardcoded API responses, setTimeout faking async, placeholder text. Dynamic analysis: inspect HAR files from docs/plans/evidence/ for missing real API calls, static responses, absent WebSocket traffic. Report findings with file:line references and severity."
+**Fix loop:** For each CRITICAL finding:
+1. Spawn a fix agent with: the finding (file:line, what's fake, what it should be), and the relevant source files.
+2. The fix agent replaces fake data with real API calls, real WebSocket connections, real data sources. If real data sources aren't available (missing API keys, no backend), the fix agent must flag this as a blocker — not paper over it with better-looking fake data.
+3. After fixes, re-run the fake data detector (static checks only — fast). Max 2 fix cycles.
+Remaining findings feed into the Reality Checker in Step 6.4.
+### Step 6.4 — Reality Check
+Call the Agent tool — description: "Final verdict" — prompt: "You are the Reality Checker. Default: NEEDS WORK. The hardening loop reached score [final_score] after [iterations] iterations. Score history: [paste table]. Review all evidence. Eval harness results: [baseline pass rate] → [final pass rate]. E2E test results: [paste E2E table — 3 iterations, final pass rate, quarantined count]. Dogfood results: [paste issue count and any CRITICAL/HIGH findings, or 'clean — no issues found']. Fake data audit results: [paste findings or 'clean — no fake data detected']. CRITICAL failures remaining: [list or none]. Verdict: PRODUCTION READY or NEEDS WORK with specifics."
 <HARD-GATE>Do NOT self-approve. Reality Checker must give the verdict.</HARD-GATE>
-**Autonomous:** Log verdict to `docs/plans/build-log.md`. Continue.
-**Interactive:** Present score history + verdict to user. Update state.
+**On PRODUCTION READY:** Log verdict. Proceed to Phase 7.
+**On NEEDS WORK:** The Reality Checker returns specific issues. These must be fixed — not logged and ignored.
+1. Read the Reality Checker's specific findings. Classify each:
+   - **Code bug** (broken feature, failing test, fake data) → spawn implementation fix agent with the finding + affected files.
+   - **Structural issue** (missing feature, wrong architecture, data flow problem) → spawn architect agent to produce a fix plan, then implementation agent to execute it. This is a mini Phase 5 loop for the specific issue.
+   - **Blocker** (missing API key, no backend, needs human action) → log to `docs/plans/build-log.md` and present to user. Cannot be auto-fixed.
+2. After fixes, re-run verification (7 checks) + the specific failing gate (E2E, dogfood, or fake data — whichever surfaced the issue).
+3. Re-run the Reality Checker with updated evidence.
+<HARD-GATE>
+Max 2 NEEDS WORK cycles. If the Reality Checker returns NEEDS WORK a third time:
+- **Interactive:** Present all remaining issues to user. Ask for direction.
+- **Autonomous:** Log remaining issues to `docs/plans/build-log.md`. Proceed to Phase 7 with a warning in the completion report.
+Do not loop forever.
+</HARD-GATE>
-**Compaction checkpoint:** Check `dispatches_since_save` in `docs/plans/.build-state.md`. If >= 8: save ALL state (current phase, task statuses, metric loop scores, decisions) to `docs/plans/.build-state.md`. Reset `dispatches_since_save` to 0. TodoWrite does NOT survive compaction — rebuild it from this state file on resume.
+**Compaction checkpoint.** Update `docs/plans/.build-state.md` per the format above.
 ---
-## Phase 6: Ship
+## Phase 7: Ship
+### Step 7.0 — Pre-Ship Verification
+Run the Verification Protocol (`protocols/verify.md`). All checks must pass before documenting and shipping. If FAIL persists after 3 fix attempts, return to Phase 6.
+### Step 7.0b — Requirements Coverage Report
+Call the Agent tool — description: "Requirements coverage check" — prompt: "Re-read the original Design Document (docs/plans/*.md design doc) and the user journeys + NFRs from docs/plans/sprint-tasks.md. For EVERY feature listed in the MVP scope, verify: (1) it has a corresponding implemented task, (2) it has a passing test or behavioral verification, (3) it is reachable and functional in the running app. Produce a coverage table:
+| MVP Feature | Task | Test | Verified | Status |
+|-------------|------|------|----------|--------|
-### Step 6.0 — Pre-Ship Verification
+Mark each as COVERED, PARTIAL (implemented but untested), or MISSING. Any MISSING feature is a blocker — report it immediately."
-Final verification gate. Run the Verification Protocol (`commands/protocols/verify.md`). ONE agent, 6 sequential checks (Build → Type → Lint → Test → Security → Diff), stop on first FAIL. Max 3 fix attempts. All checks must pass before documenting and shipping. If FAIL persists, return to Phase 5 for targeted fixes.
+If any features are MISSING: spawn implementation agents to build them, then re-run verification. This is the final safety net before shipping — it catches requirements that were planned but somehow never built.
-### Step 6.1 — Documentation
+### Step 7.1 — Documentation
 Call the Agent tool — description: "Documentation" — mode: "bypassPermissions" — prompt: "Write project docs: README with setup/architecture/usage, API docs if applicable, deployment notes. Commit: 'docs: project documentation'."
-### Step 6.2 — Metric Loop: Documentation Quality
+### Step 7.2 — Metric Loop: Documentation Quality
 Run the Metric Loop Protocol on documentation. Define a metric based on completeness and whether a new developer could follow the README. Max 3 iterations.
-### Step 6.3 — Record Learnings
+### Step 7.3 — Record Learnings
 Append to `docs/plans/learnings.md` (create if it doesn't exist). Review the build and record 3-5 learnings:
@@ -457,4 +663,4 @@ Metric loops run: [count] | Avg iterations: [N]
 Remaining: [any NEEDS WORK items]
 ```
-Mark all TodoWrite items complete. Update `docs/plans/.build-state.md`: "Phase: 6 COMPLETE."
+Mark all TodoWrite items complete. Update `docs/plans/.build-state.md`: "Phase: 7 COMPLETE."

package/commands/dogfood.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+description: "Autonomous exploratory testing — navigate your running app like a real user, find bugs, UX issues, and broken flows"
+argument-hint: "URL or page/route to test, e.g. 'http://localhost:3000' or '/settings'. Omit to dogfood the entire app."
+---
+# Dogfood
+You are a ruthless QA tester. Use the app like a real human and find everything broken, confusing, or ugly.
+## Step 1: Scope and Server
+- If the user provided a specific page/route: focus on that area only.
+- If no argument: dogfood the entire app — discover all routes and test each one.
+- Check if the app is already running at the target URL. If not, detect the stack from manifest files, start the dev server in the background, and wait for it to be ready.
+## Step 2: Exploratory Testing
+Use agent-browser (or Playwright MCP tools) for real user interactions:
+1. **Navigate** — visit every discoverable page/route. Click nav links, follow breadcrumbs.
+2. **Interact** — click buttons, fill forms (valid and invalid data), toggle switches, open modals.
+3. **Check console** — after each page, check for JS errors and warnings.
+4. **Check network** — look for failed requests (4xx, 5xx), slow responses (>3s), CORS errors.
+5. **Screenshot** each page for the final report.
+## Step 3: UX Checks
+For each page: check **loading states** (spinner/skeleton vs blank flash), **error states** (submit invalid forms, hit broken routes), **mobile layout** (resize to 375px — check overflow, readability, tap targets), and **empty states** (what happens with no data).
+## Step 4: Report
+Present findings as a severity-sorted table:
+| Severity | Page | Issue | Screenshot | Repro Steps |
+|----------|------|-------|------------|-------------|
+Severity: **CRITICAL** = crashes/data loss/security, **HIGH** = broken features/JS errors/failed requests, **MEDIUM** = UX confusion/layout issues, **LOW** = cosmetic polish.
+## Step 5: Offer Fixes
+For CRITICAL/HIGH issues, ask: "Found [N] critical/high issues. Fix now or just report?"
+If they want fixes: address each one at a time, re-verifying after each fix. Close the browser when done.