buildanything 1.2.1 → 1.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/agents/design-ui-designer.md +28 -0
- package/agents/design-ux-architect.md +10 -0
- package/commands/build.md +463 -324
- package/commands/protocols/brainstorm.md +99 -0
- package/commands/protocols/build-fix.md +52 -0
- package/commands/protocols/cleanup.md +56 -0
- package/commands/protocols/design.md +287 -0
- package/commands/protocols/eval-harness.md +62 -0
- package/commands/protocols/metric-loop.md +94 -0
- package/commands/protocols/planning.md +56 -0
- package/commands/protocols/verify.md +63 -0
- package/hooks/hooks.json +2 -2
- package/hooks/session-start +65 -8
- package/package.json +1 -1
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# Brainstorm Protocol
|
|
2
|
+
|
|
3
|
+
You are the orchestrator running a structured brainstorming session to turn a raw idea into a validated design document.
|
|
4
|
+
|
|
5
|
+
## How This Works
|
|
6
|
+
|
|
7
|
+
You ask questions one at a time, propose approaches with trade-offs, and converge on decisions. The output is a Design Document saved to `docs/plans/`.
|
|
8
|
+
|
|
9
|
+
This is a CONVERSATION, not a monologue. Each step involves the user.
|
|
10
|
+
|
|
11
|
+
## Step 1: Understand the Idea
|
|
12
|
+
|
|
13
|
+
Read the build request and any existing context (brainstorm docs, decision briefs, conversation history, existing code).
|
|
14
|
+
|
|
15
|
+
Ask the user 3-5 targeted questions to fill gaps. Ask ONE question at a time, wait for the answer, then ask the next. Do not dump all questions at once.
|
|
16
|
+
|
|
17
|
+
Focus on:
|
|
18
|
+
- **Who is the user?** Who will use this, and what's their primary pain point?
|
|
19
|
+
- **What's the core flow?** What does the user DO in the product? Walk through the main interaction.
|
|
20
|
+
- **What's the scope?** What's in the MVP vs. what's deferred?
|
|
21
|
+
- **What are the constraints?** Tech stack preferences, budget, timeline, existing systems to integrate with.
|
|
22
|
+
- **What does success look like?** How will you know this works?
|
|
23
|
+
|
|
24
|
+
Skip questions the user already answered in their build request or prior context.
|
|
25
|
+
|
|
26
|
+
## Step 2: Propose Approaches
|
|
27
|
+
|
|
28
|
+
For each major design decision, propose 2-3 approaches with trade-offs:
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
DECISION: [e.g., "Data storage approach"]
|
|
32
|
+
|
|
33
|
+
Option A: [approach] — [1-line trade-off summary]
|
|
34
|
+
Option B: [approach] — [1-line trade-off summary]
|
|
35
|
+
Option C: [approach] — [1-line trade-off summary]
|
|
36
|
+
|
|
37
|
+
My recommendation: [which and why, in 1 sentence]
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
Major decisions typically include:
|
|
41
|
+
- Tech stack (framework, language, database, hosting)
|
|
42
|
+
- Data model (what entities, how they relate)
|
|
43
|
+
- Primary user flow (step by step)
|
|
44
|
+
- Authentication approach
|
|
45
|
+
- External service integrations
|
|
46
|
+
- MVP scope boundary (in vs. out)
|
|
47
|
+
|
|
48
|
+
Let the user pick or modify. Do not force your recommendation.
|
|
49
|
+
|
|
50
|
+
## Step 3: Write the Design Document
|
|
51
|
+
|
|
52
|
+
After decisions converge, produce a Design Document and save to `docs/plans/YYYY-MM-DD-[topic]-design.md`:
|
|
53
|
+
|
|
54
|
+
```markdown
|
|
55
|
+
# [Project Name] — Design Document
|
|
56
|
+
|
|
57
|
+
## Vision
|
|
58
|
+
[1-2 sentences: what this is and who it's for]
|
|
59
|
+
|
|
60
|
+
## Primary User
|
|
61
|
+
[Who they are, what they need, why current alternatives fail them]
|
|
62
|
+
|
|
63
|
+
## Core User Flow
|
|
64
|
+
[Step-by-step: what the user does, numbered list]
|
|
65
|
+
|
|
66
|
+
## Tech Stack
|
|
67
|
+
[Each choice with 1-line rationale]
|
|
68
|
+
|
|
69
|
+
## Data Model
|
|
70
|
+
[Key entities and relationships — tables, fields, types]
|
|
71
|
+
|
|
72
|
+
## External Integrations
|
|
73
|
+
[APIs, services, and what they're used for]
|
|
74
|
+
|
|
75
|
+
## MVP Scope
|
|
76
|
+
**In:** [bulleted list of what's included]
|
|
77
|
+
**Deferred:** [bulleted list of what's explicitly NOT in v1]
|
|
78
|
+
|
|
79
|
+
## Key Decisions
|
|
80
|
+
[Numbered list of decisions made during brainstorming, with brief rationale]
|
|
81
|
+
|
|
82
|
+
## Open Questions
|
|
83
|
+
[Anything unresolved that architecture or research needs to answer]
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
Present the document to the user for approval before proceeding.
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## Autonomous Mode (no user present)
|
|
91
|
+
|
|
92
|
+
If running in autonomous mode, you cannot ask questions interactively. Instead:
|
|
93
|
+
|
|
94
|
+
1. Read all available context (build request, existing docs, code).
|
|
95
|
+
2. For each major decision, pick the most pragmatic option and document your rationale.
|
|
96
|
+
3. Bias toward: proven tech, simpler architecture, smaller MVP scope.
|
|
97
|
+
4. Write the Design Document as above.
|
|
98
|
+
5. Log all decisions and rationale to `docs/plans/build-log.md`.
|
|
99
|
+
6. Proceed without user approval.
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Build-Fix Protocol (One Error at a Time)
|
|
2
|
+
|
|
3
|
+
You are the orchestrator. A build, type-check, or lint check has failed. Do NOT dump all errors on a fix agent. Most build errors cascade — fixing the root cause clears 5-10 downstream errors.
|
|
4
|
+
|
|
5
|
+
## When to Use
|
|
6
|
+
|
|
7
|
+
When the Verification Protocol reports FAIL on Build, Type-Check, or Lint checks. Also usable during Phase 4 scaffolding or Phase 5 implementation when builds break.
|
|
8
|
+
|
|
9
|
+
## Step 1: Extract First Error
|
|
10
|
+
|
|
11
|
+
Parse the failure output from the verification agent. Extract the FIRST error only:
|
|
12
|
+
- File path
|
|
13
|
+
- Line number (if available)
|
|
14
|
+
- Error message
|
|
15
|
+
|
|
16
|
+
Ignore all other errors. They are likely cascading from this one.
|
|
17
|
+
|
|
18
|
+
## Step 2: Fix
|
|
19
|
+
|
|
20
|
+
Call the Agent tool — description: "Fix [error]" — mode: "bypassPermissions" — prompt:
|
|
21
|
+
|
|
22
|
+
"[COMPLEXITY: S] Fix this single build error. FILE: [path]. LINE: [number]. ERROR: [message]. Fix this specific error. Do not fix other errors. Do not refactor. Commit: 'fix: [error description]'."
|
|
23
|
+
|
|
24
|
+
> Pass ONLY the single error. Do not show the fix agent the full error log.
|
|
25
|
+
|
|
26
|
+
## Step 3: Rebuild
|
|
27
|
+
|
|
28
|
+
Re-run ONLY the failing check (not all 6 verification checks). Count errors in the new output.
|
|
29
|
+
|
|
30
|
+
## Step 4: Evaluate
|
|
31
|
+
|
|
32
|
+
- **0 errors:** DONE. Return FIXED to the calling protocol.
|
|
33
|
+
- **Error count decreased:** Log "CASCADE: fixed 1 error, resolved [N] total." Return to Step 1 with the new first error.
|
|
34
|
+
- **Error count same or increased:** The fix was bad. Revert: `git revert HEAD --no-edit`. Try the SECOND error from the original output instead. If already tried 2 different errors, return FAILED.
|
|
35
|
+
- **Iteration count >= 5:** Return PARTIAL with remaining error count.
|
|
36
|
+
|
|
37
|
+
## Step 5: Report
|
|
38
|
+
|
|
39
|
+
Return to the orchestrator one of:
|
|
40
|
+
- **FIXED** — all errors resolved
|
|
41
|
+
- **PARTIAL** — [N] errors remain after 5 iterations
|
|
42
|
+
- **FAILED** — could not make progress
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## Rules
|
|
47
|
+
|
|
48
|
+
- ONE error per fix agent. Never show a fix agent multiple errors.
|
|
49
|
+
- Revert bad fixes immediately. Do not accumulate broken fixes.
|
|
50
|
+
- Max 5 fix iterations per build-fix invocation.
|
|
51
|
+
- The fix agent is a SEPARATE agent from the verification agent. Fresh context.
|
|
52
|
+
- Track iteration count and error count delta in `docs/plans/.build-state.md`.
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Cleanup Protocol (De-Sloppify)
|
|
2
|
+
|
|
3
|
+
You are the orchestrator. An implementation agent just finished a task. Before running the metric loop, you run a focused cleanup pass on the changed files.
|
|
4
|
+
|
|
5
|
+
The insight: two focused agents outperform one constrained agent. The implementer optimizes for "make it work." The cleaner optimizes for "make it right."
|
|
6
|
+
|
|
7
|
+
## When to Skip
|
|
8
|
+
|
|
9
|
+
If the implementation was trivial — single config file change, < 20 lines changed total — skip this protocol. The overhead isn't worth it.
|
|
10
|
+
|
|
11
|
+
## Step 1: Collect the Changeset
|
|
12
|
+
|
|
13
|
+
Get the authoritative list of files changed by running `git diff --name-only HEAD~1` (or checking the implementation agent's commit). Do not rely solely on the agent's self-reported file list — use git as the source of truth. This is the cleanup scope. Nothing outside this list gets touched.
|
|
14
|
+
|
|
15
|
+
## Step 2: Invoke the Cleanup Agent
|
|
16
|
+
|
|
17
|
+
Call the Agent tool — description: "Cleanup [task name]" — mode: "bypassPermissions" — prompt:
|
|
18
|
+
|
|
19
|
+
"You are a code quality cleanup agent. Your job is to improve code quality in the files listed below WITHOUT changing behavior.
|
|
20
|
+
|
|
21
|
+
FILES IN SCOPE:
|
|
22
|
+
[list of files changed by the implementer]
|
|
23
|
+
|
|
24
|
+
ACCEPTANCE CRITERIA (do not break these):
|
|
25
|
+
[paste the task's acceptance criteria]
|
|
26
|
+
|
|
27
|
+
FIX these issues if you find them:
|
|
28
|
+
- Naming inconsistencies (variables, functions, files)
|
|
29
|
+
- Dead code and unused imports
|
|
30
|
+
- Redundant or duplicate imports
|
|
31
|
+
- Unclear variable or function names
|
|
32
|
+
- Missing error handling
|
|
33
|
+
- Code style violations
|
|
34
|
+
- Obvious DRY violations within the changed files
|
|
35
|
+
|
|
36
|
+
DO NOT:
|
|
37
|
+
- Add features or change behavior
|
|
38
|
+
- Modify the architecture or file structure
|
|
39
|
+
- Touch files outside the list above
|
|
40
|
+
- Refactor code that wasn't part of this task
|
|
41
|
+
- Modify tests unless fixing a broken assertion caused by the implementer
|
|
42
|
+
|
|
43
|
+
When finished, commit: 'refactor: cleanup [task name]'."
|
|
44
|
+
|
|
45
|
+
## Step 3: Verify
|
|
46
|
+
|
|
47
|
+
After the cleanup agent finishes, spot-check that acceptance criteria still hold. If the cleanup agent broke something, revert its commit and log the issue to `docs/plans/build-log.md`. Then proceed to the metric loop without cleanup.
|
|
48
|
+
|
|
49
|
+
---
|
|
50
|
+
|
|
51
|
+
## Rules
|
|
52
|
+
|
|
53
|
+
- The cleanup agent is a SEPARATE Agent tool call from the implementer. No cleaning your own mess.
|
|
54
|
+
- Scope is sacred. Only files from the implementation changeset. Zero exceptions.
|
|
55
|
+
- This runs AFTER implementation, BEFORE the metric loop.
|
|
56
|
+
- If cleanup breaks acceptance criteria, revert and skip. Never block the metric loop on a cleanup failure.
|
|
@@ -0,0 +1,287 @@
|
|
|
1
|
+
# Design & Visual Identity Protocol
|
|
2
|
+
|
|
3
|
+
You are the orchestrator. Phase 2 (Architecture) is complete. Before building anything, you must establish a research-backed visual design system. This phase is a FULL PEER to Architecture and Build — not a footnote.
|
|
4
|
+
|
|
5
|
+
## Why This Phase Exists
|
|
6
|
+
|
|
7
|
+
UI/UX is the first thing a user experiences. A structurally sound app with ugly UI fails. A beautiful app with minor bugs succeeds. Design is not decoration — it is the product.
|
|
8
|
+
|
|
9
|
+
Top design firms (Pentagram, Work & Co, Clay, Metalab) treat design as its own phase with its own research, iteration, and quality gates. This protocol replicates that process: Discovery → Direction → Prototyping → Visual QA.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Step 3.1 — Design Research (2 agents, parallel, both use Playwright)
|
|
14
|
+
|
|
15
|
+
Launch 2 agents in ONE message. Both MUST use Playwright to capture real screenshots — text descriptions of competitor sites are insufficient. Downstream agents need visual references.
|
|
16
|
+
|
|
17
|
+
**Agent 1: "Competitive visual audit"**
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
You are a senior visual design researcher. Find the top 5-8 competitors or analogues for: [product description from design doc].
|
|
21
|
+
|
|
22
|
+
For each competitor:
|
|
23
|
+
1. Use Playwright to navigate to their site
|
|
24
|
+
2. Take full-page screenshots (desktop 1920x1080 + mobile 375x812)
|
|
25
|
+
3. Screenshot standout components: hero sections, cards, forms, navigation, CTAs, footer
|
|
26
|
+
4. Save all screenshots to docs/plans/design-references/competitors/[site-name]/
|
|
27
|
+
|
|
28
|
+
Analyze each site's visual language:
|
|
29
|
+
- Color palette (extract dominant colors)
|
|
30
|
+
- Typography choices (font families, scale, weight usage)
|
|
31
|
+
- Spacing rhythm (generous vs compact, section padding)
|
|
32
|
+
- Component style (shadows, borders, radius, elevation)
|
|
33
|
+
- What makes it feel premium or cheap?
|
|
34
|
+
- What would you steal vs avoid?
|
|
35
|
+
|
|
36
|
+
Output: Ranked analysis by visual quality and relevance. Include screenshot paths.
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
**Agent 2: "Design inspiration mining"**
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
You are a senior visual design researcher. Search Awwwards.com, Godly.website, and SiteInspire for award-winning sites in the category: [product category — SaaS, developer tool, e-commerce, marketplace, etc.].
|
|
43
|
+
|
|
44
|
+
For the top 5-8 results:
|
|
45
|
+
1. Use Playwright to navigate and take full-page screenshots (desktop + mobile)
|
|
46
|
+
2. Screenshot standout components and interactions worth referencing
|
|
47
|
+
3. Save all screenshots to docs/plans/design-references/inspiration/[site-name]/
|
|
48
|
+
|
|
49
|
+
Identify cross-cutting patterns:
|
|
50
|
+
- What do the best-in-class sites have in common?
|
|
51
|
+
- What visual trends dominate this category right now?
|
|
52
|
+
- What separates "Awwwards worthy" from "generic template"?
|
|
53
|
+
- What specific techniques create the premium feel? (spacing, typography, animation, color)
|
|
54
|
+
|
|
55
|
+
Output: Trend analysis with specific adoptable patterns and anti-patterns to avoid. Include screenshot paths.
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
After both return, synthesize a **Design Research Brief** saved to `docs/plans/design-research.md`. Include all screenshot paths for downstream agent reference.
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## Step 3.2 — Design Direction (2 agents, sequential)
|
|
63
|
+
|
|
64
|
+
The UI Designer makes ALL decisions autonomously. No "Direction A vs B" presentations. Pick the best based on the research.
|
|
65
|
+
|
|
66
|
+
**Agent 1: UX Architect**
|
|
67
|
+
|
|
68
|
+
```
|
|
69
|
+
You are the UX Architect. Create the structural design foundation.
|
|
70
|
+
|
|
71
|
+
INPUTS:
|
|
72
|
+
- Architecture doc (frontend section): [paste]
|
|
73
|
+
- Design Research Brief: [paste from docs/plans/design-research.md]
|
|
74
|
+
- Reference screenshots: [list paths from docs/plans/design-references/]
|
|
75
|
+
- User persona from Phase 1 research: [paste relevant section]
|
|
76
|
+
|
|
77
|
+
OUTPUT a UX Foundation document:
|
|
78
|
+
1. Information architecture and content hierarchy
|
|
79
|
+
2. User flow diagrams for core interactions
|
|
80
|
+
3. Layout strategy — which pages use which layout patterns, informed by what worked in the research
|
|
81
|
+
4. Component hierarchy — what components exist, how they compose
|
|
82
|
+
5. Responsive breakpoint strategy (mobile-first)
|
|
83
|
+
6. Navigation patterns
|
|
84
|
+
7. Interaction patterns: hover, focus, loading, error, empty, success states
|
|
85
|
+
|
|
86
|
+
Base layout and flow decisions on what performed best in the competitive analysis — not generic patterns.
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
**Agent 2: UI Designer**
|
|
90
|
+
|
|
91
|
+
```
|
|
92
|
+
You are the UI Designer. Create the Visual Design Spec.
|
|
93
|
+
|
|
94
|
+
INPUTS:
|
|
95
|
+
- UX Foundation from UX Architect: [paste full output]
|
|
96
|
+
- Design Research Brief: [paste from docs/plans/design-research.md]
|
|
97
|
+
- Reference screenshots: [list paths from docs/plans/design-references/]
|
|
98
|
+
- User persona: [paste relevant section]
|
|
99
|
+
|
|
100
|
+
Make AUTONOMOUS decisions. Do not present options. Pick the single best direction based on the research.
|
|
101
|
+
|
|
102
|
+
OUTPUT a Visual Design Spec covering:
|
|
103
|
+
|
|
104
|
+
1. **Color System** — Primary, secondary, accent, semantic (success/warning/error/info), neutral palette. Full hex values for light AND dark themes. Rationale tied to research: "competitor X uses muted blues; we differentiate with warm neutrals because our persona values approachability."
|
|
105
|
+
|
|
106
|
+
2. **Typography System** — Font families (from Google Fonts or system fonts), size scale using a mathematical ratio (Major Third 1.25 or Perfect Fourth 1.333), weights, line heights (body: 1.5-1.6x, headings: 1.1-1.3x), letter spacing adjustments. MAX 2 font families.
|
|
107
|
+
|
|
108
|
+
3. **Spacing System** — 8px base unit. Scale: 4, 8, 12, 16, 24, 32, 48, 64, 96, 128px. Rule: internal component padding MUST be less than external margin between components (Gestalt proximity principle).
|
|
109
|
+
|
|
110
|
+
4. **Shadow & Elevation** — Layered shadow system using tinted shadows (NOT pure black — e.g., rgba(0,0,50,0.08) instead of rgba(0,0,0,0.1)). Ambient shadow + key shadow per elevation level. Levels: flat, raised (cards), elevated (dropdowns), overlay (modals), top (tooltips).
|
|
111
|
+
|
|
112
|
+
5. **Border Radius** — ONE primary radius for the entire app (pick 4px, 6px, 8px, or 12px and justify). Pill radius for tags/badges only.
|
|
113
|
+
|
|
114
|
+
6. **Animation & Motion** — Easing functions (ease-out for entrances, ease-in for exits, ease-in-out for transitions). Duration scale: micro 150ms, normal 300ms, emphasis 500ms. Stagger timing for lists: 30-50ms between items. Respect prefers-reduced-motion.
|
|
115
|
+
|
|
116
|
+
7. **Component Styles** — For each component (buttons, inputs, cards, badges, navigation, modals, alerts, tables):
|
|
117
|
+
- ALL states: default, hover, active, focus-visible, disabled, loading
|
|
118
|
+
- Exact CSS properties: background, color, border, shadow, padding, font-size, font-weight, border-radius, transition
|
|
119
|
+
|
|
120
|
+
8. **Design Rationale** — For EVERY major decision, cite the research. "The top 3 Awwwards sites in this category use geometric sans-serifs with high x-heights. Competitor Y uses Inter which is ubiquitous. We chose Space Grotesk to differentiate while maintaining the same readability characteristics."
|
|
121
|
+
|
|
122
|
+
ANTI-AI-TEMPLATE RULES:
|
|
123
|
+
Your design MUST NOT fall into the generic AI aesthetic. Penalize yourself if 3+ of these appear together:
|
|
124
|
+
- Purple-to-blue or purple-to-pink gradient hero backgrounds
|
|
125
|
+
- Floating mesh/blob gradient decorative elements
|
|
126
|
+
- Inter or Plus Jakarta Sans as the font choice (unless research specifically justifies it)
|
|
127
|
+
- 3-column icon + heading + paragraph feature grids as the primary content pattern
|
|
128
|
+
- Glassmorphism/frosted glass as the primary design language
|
|
129
|
+
- Bento grid as default layout
|
|
130
|
+
- Dark mode + neon accents as the "premium" look
|
|
131
|
+
- Generic illustration pack imagery (Undraw, Humaaans style)
|
|
132
|
+
- Perfect symmetry everywhere with no visual tension or personality
|
|
133
|
+
|
|
134
|
+
ONE or two of these in isolation is fine IF the research supports it. THREE or more together = AI template smell. Every visual choice must be JUSTIFIED by the research, not by framework defaults.
|
|
135
|
+
|
|
136
|
+
Save output to docs/plans/visual-design-spec.md.
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
---
|
|
140
|
+
|
|
141
|
+
## Step 3.3 — Proof Screens (1 implementation agent)
|
|
142
|
+
|
|
143
|
+
```
|
|
144
|
+
[COMPLEXITY: L] Implement 2-3 proof screens — the most visually demanding pages in this product:
|
|
145
|
+
|
|
146
|
+
1. Landing page / hero section (the first impression)
|
|
147
|
+
2. Main app view (dashboard, feed, workspace — the core experience)
|
|
148
|
+
3. A form or interactive component (sign up, settings, creation flow)
|
|
149
|
+
|
|
150
|
+
INPUTS:
|
|
151
|
+
- Visual Design Spec: [paste from docs/plans/visual-design-spec.md]
|
|
152
|
+
- UX Foundation: [paste relevant layout and component sections]
|
|
153
|
+
- Reference screenshots: [list paths from docs/plans/design-references/ — these are your visual targets]
|
|
154
|
+
|
|
155
|
+
REQUIREMENTS:
|
|
156
|
+
- Real, styled, responsive pages. NOT wireframes or skeletons.
|
|
157
|
+
- Use the EXACT colors, fonts, spacing, shadows from the Visual Design Spec. Do not deviate.
|
|
158
|
+
- Include hover states, focus states, transitions, loading states.
|
|
159
|
+
- Mobile-responsive at 375px, 768px, 1024px, 1280px breakpoints.
|
|
160
|
+
- These screens PROVE the design system works. They must look like they belong next to the Awwwards references from the research.
|
|
161
|
+
|
|
162
|
+
Commit: 'feat: proof screens for design validation'
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
---
|
|
166
|
+
|
|
167
|
+
## Step 3.4 — Visual QA Loop (Playwright + Metric Loop)
|
|
168
|
+
|
|
169
|
+
Run the Metric Loop Protocol (`commands/protocols/metric-loop.md`).
|
|
170
|
+
|
|
171
|
+
**Metric definition for `.build-state.md`:**
|
|
172
|
+
|
|
173
|
+
```
|
|
174
|
+
## Active Metric Loop
|
|
175
|
+
Phase: 3
|
|
176
|
+
Artifact: Proof screens (landing page, main app view, form/interaction)
|
|
177
|
+
Metric: Visual design quality — implementation fidelity to Visual Design Spec + competitive quality relative to Awwwards/competitor references
|
|
178
|
+
How to measure: Playwright screenshots of proof screens (desktop 1920x1080 + mobile 375x812), scored by design critic agent across 6 dimensions
|
|
179
|
+
Target: 80
|
|
180
|
+
Max iterations: 5
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
**Measurement agent prompt:**
|
|
184
|
+
|
|
185
|
+
```
|
|
186
|
+
You are a senior design critic at a top-tier agency (Pentagram, Work & Co). You are reviewing a product's visual implementation for quality.
|
|
187
|
+
|
|
188
|
+
INPUTS:
|
|
189
|
+
- Screenshots of current proof screens: [Playwright captures — desktop + mobile]
|
|
190
|
+
- The Visual Design Spec the implementation should follow: [paste from docs/plans/visual-design-spec.md]
|
|
191
|
+
- Reference screenshots from competitors and Awwwards winners: [paths in docs/plans/design-references/]
|
|
192
|
+
|
|
193
|
+
Score 0-100 across these 6 dimensions (weight equally, average for final score):
|
|
194
|
+
|
|
195
|
+
1. **Spacing & Alignment (0-100)**
|
|
196
|
+
- Is the 8px grid respected consistently?
|
|
197
|
+
- Do elements breathe? Generous whitespace between sections (hero padding 120-200px, not 40px)?
|
|
198
|
+
- Internal component padding < external margin between components (Gestalt proximity)?
|
|
199
|
+
- Visual grouping through whitespace, not just borders?
|
|
200
|
+
|
|
201
|
+
2. **Typography Hierarchy (0-100)**
|
|
202
|
+
- Clear 3-4 levels of visual hierarchy?
|
|
203
|
+
- Consistent type scale from the spec applied?
|
|
204
|
+
- Proper line heights (body: 1.5-1.6x, headings: 1.1-1.3x)?
|
|
205
|
+
- Font weight contrast used effectively (not just size)?
|
|
206
|
+
- Letter spacing appropriate for context?
|
|
207
|
+
|
|
208
|
+
3. **Color Harmony (0-100)**
|
|
209
|
+
- Cohesive palette matching the spec?
|
|
210
|
+
- 60-30-10 rule (60% neutral, 30% secondary, 10% accent)?
|
|
211
|
+
- WCAG AA contrast ratios (4.5:1 body, 3:1 large text)?
|
|
212
|
+
- Shadows tinted not pure black?
|
|
213
|
+
- Colors slightly desaturated (refined, not garish)?
|
|
214
|
+
|
|
215
|
+
4. **Component Polish (0-100)**
|
|
216
|
+
- Hover states present and smooth?
|
|
217
|
+
- Focus-visible indicators for keyboard nav?
|
|
218
|
+
- Consistent border radius throughout?
|
|
219
|
+
- Shadow/elevation system applied per spec?
|
|
220
|
+
- Transitions feel intentional (not instant, not sluggish)?
|
|
221
|
+
- Loading/empty states considered?
|
|
222
|
+
|
|
223
|
+
5. **Responsive Quality (0-100)**
|
|
224
|
+
- Mobile layout functional and readable at 375px?
|
|
225
|
+
- No horizontal scroll on any breakpoint?
|
|
226
|
+
- Touch targets 44px+ on mobile?
|
|
227
|
+
- Layout ADAPTS (not just stacks) — different patterns per breakpoint?
|
|
228
|
+
- Images and media scale properly?
|
|
229
|
+
|
|
230
|
+
6. **Originality (0-100)**
|
|
231
|
+
- Does this look DESIGNED or GENERATED?
|
|
232
|
+
- Penalize heavily if 3+ of these appear together:
|
|
233
|
+
* Purple/blue gradient hero background
|
|
234
|
+
* Floating blob/mesh gradient decorations
|
|
235
|
+
* Inter or Plus Jakarta Sans as the only font
|
|
236
|
+
* 3-column icon+heading+paragraph feature grids
|
|
237
|
+
* Glassmorphism cards as primary style
|
|
238
|
+
* Bento grid as default layout
|
|
239
|
+
* Dark mode + neon accents aesthetic
|
|
240
|
+
* Generic illustration pack imagery
|
|
241
|
+
* Perfect symmetry everywhere, no visual tension
|
|
242
|
+
- One or two in isolation is fine. Three+ together = "AI template" smell.
|
|
243
|
+
- The test: would a human designer say "this was made by AI"?
|
|
244
|
+
- Does the design have personality and point of view?
|
|
245
|
+
|
|
246
|
+
Return format:
|
|
247
|
+
SCORE: [average of 6 dimensions, rounded to nearest integer]
|
|
248
|
+
DIMENSION SCORES: [list each dimension with its score]
|
|
249
|
+
TOP ISSUE: [the single highest-impact change that would most improve the overall score]
|
|
250
|
+
FINDINGS: [detailed list of specific issues, each with the file path and line/component where the fix should happen]
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
**Fix agent receives:** ONLY the top issue + relevant file paths + the relevant Visual Design Spec section. One fix per iteration. Commit each fix.
|
|
254
|
+
|
|
255
|
+
**Exit conditions (from metric-loop protocol):**
|
|
256
|
+
- Score >= 80 → proceed to Phase 4
|
|
257
|
+
- Stall (2 consecutive delta <= 0) → accept if score >= 65, log warning below 65
|
|
258
|
+
- Max 5 iterations → accept if score >= 65, log warning below 65
|
|
259
|
+
|
|
260
|
+
---
|
|
261
|
+
|
|
262
|
+
## Step 3.5 — Autonomous Quality Gate
|
|
263
|
+
|
|
264
|
+
Log to `docs/plans/build-log.md`:
|
|
265
|
+
- Final proof screen screenshot paths
|
|
266
|
+
- Score history table from the metric loop
|
|
267
|
+
- Key design decisions and their research rationale
|
|
268
|
+
- Anti-AI-template dimension score
|
|
269
|
+
|
|
270
|
+
No user pause. Proceed to Phase 4 (Foundation).
|
|
271
|
+
|
|
272
|
+
---
|
|
273
|
+
|
|
274
|
+
## Rules
|
|
275
|
+
|
|
276
|
+
<HARD-GATE>
|
|
277
|
+
DESIGN RESEARCH IS NOT OPTIONAL. Step 3.1 agents MUST use Playwright to capture real screenshots of real competitor and inspiration sites. Text-only descriptions of "what their site looks like" are INSUFFICIENT — downstream agents need visual references to make informed decisions and the Visual QA measurement agent needs them for comparison.
|
|
278
|
+
|
|
279
|
+
If Playwright is unavailable: log as blocker, use web search to find and describe competitors in maximum visual detail, proceed with degraded quality. But TRY Playwright first.
|
|
280
|
+
</HARD-GATE>
|
|
281
|
+
|
|
282
|
+
- The UI Designer agent makes ALL visual decisions autonomously. No "pick A or B" presentations. The research provides the evidence; the agent makes the call.
|
|
283
|
+
- The Visual Design Spec MUST include research rationale for every major decision. Unjustified defaults are a design failure.
|
|
284
|
+
- The anti-AI-template checklist is a SCORING DIMENSION (Originality), not a hard blocker. The goal is awareness and intentional differentiation, not rigid prohibition of any single element.
|
|
285
|
+
- Proof screens are REAL implementations with real CSS/components, not mockups or wireframes. They must work responsively.
|
|
286
|
+
- The Visual QA loop is the primary quality control — no human reviews the design. The 80/100 threshold IS the taste arbiter. Treat it seriously.
|
|
287
|
+
- Screenshot data stays in measurement agents' context (separate subprocess). Do NOT load screenshots into the orchestrator's context — receive only the SCORE and TOP ISSUE as text.
|
|
@@ -0,0 +1,62 @@
|
|
|
1
|
+
# Eval Harness Protocol
|
|
2
|
+
|
|
3
|
+
You are the orchestrator. Phase 6.1 audits are complete. Before running the metric loop, define formal eval cases that are concrete, executable, and reproducible. This replaces subjective narrative audits with deterministic pass/fail tests.
|
|
4
|
+
|
|
5
|
+
## How This Differs from the Metric Loop
|
|
6
|
+
|
|
7
|
+
The metric loop answers "how good is this?" (qualitative score 0-100, iterative improvement).
|
|
8
|
+
The eval harness answers "does this specific behavior work reliably?" (binary pass/fail, deterministic).
|
|
9
|
+
|
|
10
|
+
They are complementary: eval harness failures become specific issues for the metric loop to fix.
|
|
11
|
+
|
|
12
|
+
## Step 0: Define Eval Cases
|
|
13
|
+
|
|
14
|
+
YOU (the orchestrator) define eval cases based on:
|
|
15
|
+
- Audit findings from Phase 6.1 (highest-severity items first)
|
|
16
|
+
- Architecture doc (API contracts, auth model, data validation rules)
|
|
17
|
+
- Design doc (core user flows, edge cases)
|
|
18
|
+
|
|
19
|
+
Write eval cases to `docs/plans/.build-state.md` under `## Eval Harness`:
|
|
20
|
+
|
|
21
|
+
| # | Name | Action | Expected Result | pass@k | Severity |
|
|
22
|
+
|---|------|--------|-----------------|--------|----------|
|
|
23
|
+
|
|
24
|
+
**Severity thresholds (non-negotiable):**
|
|
25
|
+
- CRITICAL: pass@5 (must pass 5/5 — 100% reliability)
|
|
26
|
+
- HIGH: pass@4 (must pass 4/5 — 80% reliability)
|
|
27
|
+
- MEDIUM: pass@3 (must pass 3/5 — 60% reliability)
|
|
28
|
+
|
|
29
|
+
Aim for 8-15 eval cases. Cover: auth boundaries, input validation, error handling, core happy path, primary edge cases.
|
|
30
|
+
|
|
31
|
+
**Eval cases must be concrete and executable** — actual commands (curl, function calls, UI interactions), not descriptions. Bad: "Auth should work." Good: "curl -X GET /api/recipes without Authorization header → expect 401."
|
|
32
|
+
|
|
33
|
+
## Step 1: Run Eval
|
|
34
|
+
|
|
35
|
+
Call the Agent tool — description: "Run eval harness" — mode: "bypassPermissions" — prompt:
|
|
36
|
+
|
|
37
|
+
"[COMPLEXITY: M] Run these eval cases. For each case, execute the action the specified number of times (k). Report per case: PASS (N/k passed, meets threshold) or FAIL (N/k passed, below threshold). Include the actual result on failures. [paste eval case table]"
|
|
38
|
+
|
|
39
|
+
<HARD-GATE>
|
|
40
|
+
The eval agent RUNS cases. It does NOT define them. Case definition is the orchestrator's job.
|
|
41
|
+
</HARD-GATE>
|
|
42
|
+
|
|
43
|
+
## Step 2: Score
|
|
44
|
+
|
|
45
|
+
Count PASS cases / total cases. This is the eval baseline. Record to `docs/plans/.build-state.md`.
|
|
46
|
+
|
|
47
|
+
## Step 3: Feed into Metric Loop
|
|
48
|
+
|
|
49
|
+
Any FAIL case with severity CRITICAL or HIGH becomes a candidate issue for the Phase 6.2 metric loop. Pass the failure details (case name, action, expected vs actual) as context when defining the metric loop's metric.
|
|
50
|
+
|
|
51
|
+
## Step 4: Re-evaluate After Metric Loop
|
|
52
|
+
|
|
53
|
+
After the Phase 6.2 metric loop exits, re-run the eval harness. All CRITICAL cases must now pass. If any CRITICAL case still fails, flag it for the Reality Checker in Step 6.3.
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
## Rules
|
|
58
|
+
|
|
59
|
+
- Eval cases are defined by the ORCHESTRATOR, not by the eval agent.
|
|
60
|
+
- pass@k thresholds are non-negotiable per severity level.
|
|
61
|
+
- Re-run eval after metric loop to verify fixes — this is the exit gate.
|
|
62
|
+
- Eval failures feed into the metric loop as specific, concrete issues — not vague audit findings.
|
|
@@ -0,0 +1,94 @@
|
|
|
1
|
+
# Metric Loop Protocol
|
|
2
|
+
|
|
3
|
+
You are the orchestrator. You are about to run a metric-driven iteration loop on an artifact (code, architecture, docs, etc.) to drive it toward a quality target.
|
|
4
|
+
|
|
5
|
+
## Step 0: Define Your Metric
|
|
6
|
+
|
|
7
|
+
Before iterating, YOU define the metric for this specific context. Consider:
|
|
8
|
+
- What is the artifact? (a task implementation, a security audit, an architecture doc, etc.)
|
|
9
|
+
- What does "good" look like? (all tests pass, zero critical vulns, all acceptance criteria met, etc.)
|
|
10
|
+
- Is the metric quantitative (test pass rate, vuln count, coverage %) or qualitative (architecture completeness, doc clarity)?
|
|
11
|
+
|
|
12
|
+
Write a **Metric Definition** block to `docs/plans/.build-state.md`:
|
|
13
|
+
|
|
14
|
+
```
|
|
15
|
+
## Active Metric Loop
|
|
16
|
+
Phase: [current phase]
|
|
17
|
+
Artifact: [what you're iterating on]
|
|
18
|
+
Metric: [what you're measuring, in one sentence]
|
|
19
|
+
How to measure: [what the measurement agent should do — run tests, audit code, check criteria, etc.]
|
|
20
|
+
Target: [score 0-100 at which you stop]
|
|
21
|
+
Max iterations: [hard cap, default 5]
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
Then create a score log table:
|
|
25
|
+
|
|
26
|
+
```
|
|
27
|
+
| Iter | Score | Delta | Top Issue | Files |
|
|
28
|
+
|------|-------|-------|-----------|-------|
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
When starting a new metric loop, REPLACE the previous Active Metric Loop section (if any). There is only ever ONE active metric loop. Previous loop results should already be recorded in their phase's section above. When the loop completes (Step 2 exit), rename the section header from `## Active Metric Loop` to `## Completed Metric Loop — [Phase N]` and leave it for historical reference.
|
|
32
|
+
|
|
33
|
+
If you are in Phase 5, also record the current sub-step for the overall task cycle (not all of these are within the metric loop itself):
|
|
34
|
+
```
|
|
35
|
+
Sub-step: [5.1 Implement | 5.1b Cleanup | 5.2 Metric Loop | 5.3 Loop Exit | 5.4 Verify]
|
|
36
|
+
```
|
|
37
|
+
This tells the orchestrator exactly where to resume after context compaction.
|
|
38
|
+
|
|
39
|
+
## Step 1: MEASURE
|
|
40
|
+
|
|
41
|
+
Call the Agent tool — description: "Measure [metric]" — prompt:
|
|
42
|
+
|
|
43
|
+
"[How to measure, from your metric definition]. Score the current state 0-100. Return your response with a clear SCORE: [number] line, a list of FINDINGS, and the single TOP ISSUE most likely to improve the score if fixed."
|
|
44
|
+
|
|
45
|
+
Read the agent's response. You need: the SCORE, the TOP ISSUE, and the file paths for diagnosis in Step 3. Record the score to `docs/plans/.build-state.md`. The full findings list is useful for diagnosis but does NOT need to persist in your context across iterations — once you've picked the top issue, the details of lower-priority findings can go. Append a row to the score log in `docs/plans/.build-state.md`:
|
|
46
|
+
|
|
47
|
+
| Iter | Score | Delta | Top Issue | Files |
|
|
48
|
+
|------|-------|-------|-----------|-------|
|
|
49
|
+
|
|
50
|
+
## Step 2: CHECK EXIT
|
|
51
|
+
|
|
52
|
+
Stop the loop if ANY of these:
|
|
53
|
+
|
|
54
|
+
- **Score >= target** → done. Log "Target met at iteration [N]."
|
|
55
|
+
- **Iteration >= max** → done. Log "Max iterations reached. Final score: [N]."
|
|
56
|
+
- **Stall: last 2 scores show no improvement** (delta <= 0 twice in a row) → done. Log "Stalled at score [N]."
|
|
57
|
+
|
|
58
|
+
On stall or max iterations:
|
|
59
|
+
- **Interactive mode:** present score history + top remaining issue to user. Ask for direction.
|
|
60
|
+
- **Autonomous mode:** if score >= 60% of target, accept with warning. Otherwise skip. Log to `docs/plans/build-log.md`.
|
|
61
|
+
|
|
62
|
+
If not exiting, continue to Step 3.
|
|
63
|
+
|
|
64
|
+
## Step 3: DIAGNOSE
|
|
65
|
+
|
|
66
|
+
Look at the findings from Step 1. Pick the ONE highest-impact issue — the single fix most likely to move the score. Do not try to fix everything at once. This is the autoresearch insight: one targeted change per iteration, measured impact.
|
|
67
|
+
|
|
68
|
+
## Step 4: IMPROVE
|
|
69
|
+
|
|
70
|
+
Call the Agent tool — description: "Fix [top issue]" — mode: "bypassPermissions" — prompt:
|
|
71
|
+
|
|
72
|
+
"TARGETED FIX: [specific issue to fix, from diagnosis]. CONTEXT: [relevant architecture/criteria]. Make this specific change. Do not refactor unrelated code. Commit: 'fix: [description]'."
|
|
73
|
+
|
|
74
|
+
> **Do NOT pass the measurement agent's full findings to this agent. Only pass the single diagnosed issue and relevant file paths.**
|
|
75
|
+
|
|
76
|
+
## Step 5: LOOP
|
|
77
|
+
|
|
78
|
+
Return to Step 1. Re-measure the artifact after the fix.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Rules
|
|
83
|
+
|
|
84
|
+
<HARD-GATE>
|
|
85
|
+
AUTHOR-BIAS ELIMINATION: The measurement agent and the fix agent must NEVER share context.
|
|
86
|
+
- They MUST be separate Agent tool calls (separate subprocesses, separate context windows).
|
|
87
|
+
- The fix agent receives ONLY: (a) the single top issue diagnosed in Step 3, (b) the relevant file paths, (c) the acceptance criteria. It does NOT receive the measurement agent's full findings, score breakdown, or other issues.
|
|
88
|
+
- The measurement agent in the next iteration does NOT know what the fix agent did — it measures the artifact fresh.
|
|
89
|
+
- Rationale: When a reviewer shares context with an implementer, the implementer unconsciously optimizes for the reviewer's framing rather than actual quality.
|
|
90
|
+
</HARD-GATE>
|
|
91
|
+
- One fix per iteration. Measure its impact before fixing the next thing.
|
|
92
|
+
- Track ALL scores in `docs/plans/.build-state.md` so the history survives context compaction.
|
|
93
|
+
- If context was compacted mid-loop: read `docs/plans/.build-state.md`, find the Active Metric Loop section, resume from the last recorded iteration.
|
|
94
|
+
- CONTEXT HYGIENE: Measurement agents are analysis agents — read their full output for diagnosis. But once you've picked the top issue (Step 3) and dispatched the fix (Step 4), the detailed findings from THAT iteration are spent. Don't accumulate findings across iterations — each measurement is fresh.
|