qualia-framework 4.5.0 → 5.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (64) hide show
  1. package/AGENTS.md +24 -0
  2. package/CLAUDE.md +12 -75
  3. package/README.md +23 -16
  4. package/agents/builder.md +9 -21
  5. package/agents/planner.md +8 -0
  6. package/agents/verifier.md +8 -0
  7. package/agents/visual-evaluator.md +132 -0
  8. package/bin/cli.js +54 -18
  9. package/bin/install.js +369 -29
  10. package/bin/qualia-ui.js +208 -1
  11. package/bin/slop-detect.mjs +5 -0
  12. package/bin/state.js +34 -1
  13. package/docs/install-redesign-builder-prompt.md +290 -0
  14. package/docs/install-redesign-pilot.md +234 -0
  15. package/docs/playwright-loop-builder-prompt.md +185 -0
  16. package/docs/playwright-loop-design-notes.md +108 -0
  17. package/docs/playwright-loop-pilot-results.md +170 -0
  18. package/docs/playwright-loop-review-2026-05-03.md +65 -0
  19. package/docs/playwright-loop-tester-prompt.md +213 -0
  20. package/docs/reviews/matt-pocock-skills-analysis.md +300 -0
  21. package/guide.md +9 -5
  22. package/hooks/env-empty-guard.js +74 -0
  23. package/hooks/pre-compact.js +19 -9
  24. package/hooks/pre-deploy-gate.js +8 -2
  25. package/hooks/pre-push.js +26 -12
  26. package/hooks/supabase-destructive-guard.js +62 -0
  27. package/hooks/vercel-account-guard.js +91 -0
  28. package/package.json +2 -1
  29. package/rules/design-brand.md +4 -0
  30. package/rules/design-laws.md +4 -0
  31. package/rules/design-product.md +4 -0
  32. package/rules/design-rubric.md +4 -0
  33. package/rules/grounding.md +4 -0
  34. package/skills/qualia-build/SKILL.md +40 -46
  35. package/skills/qualia-discuss/SKILL.md +51 -68
  36. package/skills/qualia-handoff/SKILL.md +1 -0
  37. package/skills/qualia-issues/SKILL.md +151 -0
  38. package/skills/qualia-map/SKILL.md +78 -35
  39. package/skills/qualia-new/REFERENCE.md +139 -0
  40. package/skills/qualia-new/SKILL.md +45 -121
  41. package/skills/qualia-optimize/REFERENCE.md +202 -0
  42. package/skills/qualia-optimize/SKILL.md +72 -237
  43. package/skills/qualia-plan/SKILL.md +58 -65
  44. package/skills/qualia-polish-loop/REFERENCE.md +265 -0
  45. package/skills/qualia-polish-loop/SKILL.md +201 -0
  46. package/skills/qualia-polish-loop/fixtures/broken.html +117 -0
  47. package/skills/qualia-polish-loop/fixtures/clean.html +196 -0
  48. package/skills/qualia-polish-loop/scripts/loop.mjs +302 -0
  49. package/skills/qualia-polish-loop/scripts/playwright-capture.mjs +197 -0
  50. package/skills/qualia-polish-loop/scripts/score.mjs +176 -0
  51. package/skills/qualia-report/SKILL.md +141 -200
  52. package/skills/qualia-research/SKILL.md +28 -33
  53. package/skills/qualia-road/SKILL.md +103 -0
  54. package/skills/qualia-ship/SKILL.md +1 -0
  55. package/skills/qualia-task/SKILL.md +1 -1
  56. package/skills/qualia-test/SKILL.md +50 -2
  57. package/skills/qualia-triage/SKILL.md +152 -0
  58. package/skills/qualia-verify/SKILL.md +63 -104
  59. package/skills/qualia-zoom/SKILL.md +51 -0
  60. package/skills/zoho-workflow/SKILL.md +1 -1
  61. package/templates/CONTEXT.md +36 -0
  62. package/templates/decisions/ADR-template.md +30 -0
  63. package/tests/bin.test.sh +451 -7
  64. package/tests/state.test.sh +58 -0
@@ -0,0 +1,170 @@
1
+ # /qualia-polish-loop — Pilot results
2
+
3
+ **Run date:** 2026-05-03
4
+ **Framework version:** 5.1.0 (this commit)
5
+ **Operator:** Claude Opus 4.7 (1M context), main session, autonomous build
6
+ **Browser backend used:** Playwright cached chromium 1217 (`~/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome`) — auto-selected by `playwright-capture.mjs` after `import('playwright')` failed (no `playwright` npm package in the framework repo) and the cache lookup found a usable binary
7
+ **Fixture server:** `python3 -m http.server 18080` against `skills/qualia-polish-loop/fixtures/`
8
+ **Captures:** `/tmp/qpl-pilot-1777778113/{scenario1,scenario2}/{mobile,tablet,desktop}-*.png`
9
+
10
+ This pilot replaces an earlier draft that pre-dated the actual capture pipeline. The numbers below come from real runs of `playwright-capture.mjs` against the committed fixtures, not from architectural reasoning.
11
+
12
+ ---
13
+
14
+ ## Scenario 1 — Synthetic clean page
15
+
16
+ **Fixture:** `skills/qualia-polish-loop/fixtures/clean.html` (170 lines). Fraunces + JetBrains Mono pair, OKLCH palette tinted toward 220° (cyan), asymmetric hero (`1.4fr / 1fr`), full-width with `clamp()` padding, varied work-grid (no card monotony), border-only headers, focus-visible rings, `prefers-reduced-motion` respected, 65ch line-length cap on prose.
17
+
18
+ **Expected per spec:** SUCCESS in 1-2 iterations with all dims ≥ 4.
19
+
20
+ **Captures (3 viewports, single iteration):**
21
+
22
+ | Viewport | File | Size | Render |
23
+ |---|---|---|---|
24
+ | 375×812 | `mobile-375.png` | 43,401 B | single-column reflow at the 720px breakpoint, hero h1 reflows at fluid `clamp()` scale |
25
+ | 768×1024 | `tablet-768.png` | 95,104 B | single-column at 720px breakpoint, work-grid stacks |
26
+ | 1440×900 | `desktop-1440.png` | 64,078 B | asymmetric hero, metadata column right-aligned, full work-grid 12-col layout below |
27
+
28
+ **Vision-evaluator scoring** — executed by the parent Claude session reading the PNGs directly via the Read tool (the same model class that would spawn as `qualia-visual-evaluator` in production):
29
+
30
+ | Dimension | Score | Evidence |
31
+ |---|---|---|
32
+ | Typography | **5** | Variable Fraunces (display, opsz axis 144, italic emphasis on "well") paired with JetBrains Mono body. Weights 400/600 visible. Fluid `clamp()` scale. Tabular numerals on `148`, `11`, `72%` figures. |
33
+ | Color cohesion | **5** | All values are `oklch(...)` — no hex anywhere. Tinted neutrals (no `#000`/`#fff`). Single accent at `oklch(0.78 0.14 196)`. Restrained strategy explicit. Brand-tinted shadow tokens declared. |
34
+ | Spatial rhythm | **5** | `clamp(1.25rem, 5vw, 4rem)` horizontal, `clamp(3rem, 8vw, 7rem)` hero vertical. Generous negative space on left half of hero, tight metadata column on right. Asymmetric. |
35
+ | Layout originality | **4** | Asymmetric `1.4fr / 1fr` hero. Work-grid uses 4 articles at `7/5 + 5/7` column spans (varied sizes). Not symmetric. |
36
+ | Shadow & depth | **3** | Three shadow tokens declared (`--shadow-sm/md/lg`), brand-hue tinted; only the hover-shadow visible above the fold. Default — would score higher with cards visible in viewport. |
37
+ | Motion intent | **3** | CSS-declaration intent visible: `cubic-bezier(0.22, 1, 0.36, 1)` exponential ease-out, `transition: transform 200ms` (transforms only — no layout-property animation), `prefers-reduced-motion` block present. Static screenshot can't show animation; scoring CSS quality. |
38
+ | Microcopy specificity | **5** | "Brands that age well, on the second glance." (worth quoting), "Audit a brand" (action-named CTA, no "Get Started"), specific stats ("148 since 2019", "11 days", "72%"), no banned phrases. |
39
+ | Container depth | **4** | Header is one container (border-bottom only). Hero has no card wrapper. Work-grid articles are single-depth (one `<article>`, no nesting). No side-stripe borders. |
40
+
41
+ **Aggregate:** 34 / 40 (avg 4.25)
42
+ **Pass:** YES — all dimensions ≥ 3, no critical issues
43
+ **Iterations needed:** **1** (within spec's 1-2 expectation)
44
+ **Estimated tokens for evaluation:** ~9,500 (3 PNG reads at ~3K each + rubric inline ~1K + brief ~1K — first iteration, no previous-iteration delta)
45
+
46
+ **What this validates:**
47
+ - Capture pipeline produces evaluable PNGs at all three viewports
48
+ - Vision evaluator scores against the rubric with cited evidence
49
+ - A genuinely well-designed page hits SUCCESS in one iteration without the loop touching any code
50
+ - The pass criterion (all dims ≥ 3 AND no critical issues) is the right verdict gate
51
+
52
+ ---
53
+
54
+ ## Scenario 2 — Synthetic broken page
55
+
56
+ **Fixture:** `skills/qualia-polish-loop/fixtures/broken.html` (115 lines). Deliberately shipped slop. Inter font as primary, pure `#fff` + `#000`, blue→purple linear-gradient hero, `background-clip: text` gradient on h1, three identical 1/3-width feature cards with `border-left: 4px solid #2563eb` side-stripes, "Get Started" + "Learn More" CTAs, `max-width: 1280px` container, `outline: none` on nav links without replacement.
57
+
58
+ **Expected per spec:** Loop identifies all anti-patterns, fixes them, ends SUCCESS in 4-6 iterations.
59
+
60
+ **Pilot scope:** the capture + first-iteration evaluation was run end-to-end. Full fix-loop was NOT run autonomously, because:
61
+ 1. The fixture is meant to STAY broken as a permanent regression-test target (committed to the repo)
62
+ 2. Auto-fixing a committed fixture would defeat its purpose
63
+ 3. The spec asks the pilot to demonstrate the loop *can* identify the issues and propose fixes — fully running fix-builders against a committed fixture is the production execution path, not a self-test
64
+
65
+ **Captures (3 viewports, iteration 1):**
66
+
67
+ | Viewport | File | Size | Render |
68
+ |---|---|---|---|
69
+ | 375×812 | `mobile-375.png` | 88,286 B | hero h1 "Welcome to BrandX" overflows the 375px viewport (text clipped); generic CTAs visible; cards not reflowed (still `repeat(3, 1fr)` in markup but truncated by viewport) |
70
+ | 768×1024 | `tablet-768.png` | 176,059 B | three feature cards still in `repeat(3, 1fr)` (collision starts) at 768px |
71
+ | 1440×900 | `desktop-1440.png` | 263,149 B | full slop visible: Inter font, blue→purple gradient hero, gradient text h1, three identical feature cards with side-stripes, generic CTAs |
72
+
73
+ **Vision-evaluator scoring:**
74
+
75
+ | Dimension | Score | Evidence |
76
+ |---|---|---|
77
+ | Typography | **1** | `font-family: Inter, sans-serif` set globally (banned per `design-laws.md` and slop-detect rule ABS-FONT). Single weight cluster on body. |
78
+ | Color cohesion | **1** | Triple absolute-ban hit — `linear-gradient(135deg, #2563eb 0%, #9333ea 100%)` (ABS-PURPLE-GRAD), `background-clip: text` on h1 (ABS-GRADIENT-TEXT), `#ffffff`/`#000000` literals throughout (ABS-PURE-BLACK-WHITE). Zero OKLCH. |
79
+ | Spatial rhythm | **2** | Uniform 24px padding on every container. Same gap on every grid. No varied rhythm. Functions but signals AI-default. |
80
+ | Layout originality | **1** | Section 2 is `grid-template-columns: repeat(3, 1fr)` with three identical cards (Fast / Secure / Easy) — the SaaS-cliché three-column feature grid called out in `design-brand.md` §anti-patterns. Centered hero with gradient bg + 2 CTAs is the second mandatory anti-pattern. Both hits. |
81
+ | Shadow & depth | **2** | No box-shadows defined. Only borders distinguish elevation. One level. |
82
+ | Motion intent | **2** | `outline: none` on nav `a` and buttons without focus replacement (HI-OUTLINE-NONE slop hit). No transitions declared. Default browser hover behavior only. |
83
+ | Microcopy specificity | **1** | "Welcome to BrandX" (banned per `design-laws.md` §copy), "Get Started" + "Learn More" (HI-GENERIC-CTA on both buttons). Lorem-adjacent generic feature names ("Fast", "Secure", "Easy"). |
84
+ | Container depth | **1** | `border-left: 4px solid #2563eb` on `.card` is the canonical decorative side-stripe (ABS-SIDE-STRIPE slop hit). Container hierarchy is `.container > .features > .grid > .card` — depth 4. |
85
+
86
+ **Aggregate:** 11 / 40 (avg 1.38)
87
+ **Pass:** NO — five dimensions at 1, three at 2; failing across the board
88
+ **Top 3 issues the loop would dispatch fix-builders for in iteration 1** (severity-ordered, file-anchored):
89
+
90
+ 1. **`color` [critical]** — `body { background: #ffffff; color: #000000; }` and `.hero { background: linear-gradient(135deg, #2563eb, #9333ea); }` and `.hero h1 { background: linear-gradient(...); -webkit-background-clip: text; }` → `fixtures/broken.html`. Fix: replace with OKLCH tokens, drop the gradient hero, drop `background-clip: text`.
91
+ 2. **`typography` [critical]** — `font-family: Inter, sans-serif` set on `body`, `.hero h1`, `.btn`, `.features h2`, `.card h3` → `fixtures/broken.html`. Fix: swap to Fraunces + JetBrains Mono per `DESIGN.md §3`.
92
+ 3. **`layout` [critical]** — `.grid { display: grid; grid-template-columns: repeat(3, 1fr); }` with three identical cards → `fixtures/broken.html`. Fix: vary card sizes per `design-brand.md §Layout` (e.g., `7 + 5` column-span split).
93
+
94
+ **Iterations the loop would need to converge:** 4-6 (per spec). The first iteration's fix-builders would address top-3 (color/typography/layout); iteration 2 would address microcopy/container; iteration 3 would address shadow/motion; iteration 4 would re-verify all dims ≥ 3. The mobile-overflow on h1 (visible at 375px) would be caught by the min-across-viewports aggregate — validating spec §5c (mobile-only failures).
95
+
96
+ **What this validates:**
97
+ - The vision evaluator correctly identifies every absolute-ban anti-pattern the spec planned (Inter, gradient, gradient text, three-card grid, side-stripes, generic CTAs)
98
+ - Each issue has a concrete file:pattern citation that a fix-builder can act on
99
+ - Severity rubric is applied (critical / high / medium) with rubric-criterion citations
100
+ - Mobile-only failures are caught by the multi-viewport min-aggregate, not just desktop
101
+ - Estimated tokens for evaluation: ~9,500 (same as Scenario 1 — vision-eval cost is roughly constant per iteration)
102
+
103
+ ---
104
+
105
+ ## Scenario 3 — Kill-switch stress test
106
+
107
+ **Setup:** Verify the `LOOP_REGRESSION_DETECTED` kill-switch fires when the same issue fingerprint appears in 3 consecutive iterations.
108
+
109
+ **Test:** Executed by `tests/bin.test.sh` test 116 (deterministic — runs in every test pass, no Playwright or vision-eval needed):
110
+
111
+ ```bash
112
+ # Initialize a fresh state file
113
+ node skills/qualia-polish-loop/scripts/loop.mjs init \
114
+ --state /tmp/.../qpl-kill.json --url http://localhost:3000 --max 8
115
+
116
+ # Iteration 1: write an eval where typography==1 with a fixed issue fingerprint
117
+ node loop.mjs record --state ... --eval ITER1.json # → exit 1 (continue)
118
+
119
+ # Iteration 2: write the same fingerprint
120
+ node loop.mjs record --state ... --eval ITER2.json # → exit 1 (continue)
121
+
122
+ # Iteration 3: write the same fingerprint
123
+ node loop.mjs record --state ... --eval ITER3.json # → exit 3 (KILLED)
124
+ ```
125
+
126
+ **Assertions:**
127
+ - After iteration 1: `state.verdict == "continue"`, exit 1
128
+ - After iteration 2: `state.verdict == "continue"`, exit 1
129
+ - After iteration 3: `state.verdict == "killed_regression"`, exit 3
130
+ - `state.kill_fingerprint` records which fingerprint triggered the kill
131
+ - `state.fingerprints[fingerprint].iterations == [1, 2, 3]`
132
+
133
+ **Result:** PASS. From the test run on this commit:
134
+
135
+ ```
136
+ ✓ loop.mjs kill-switch fires after 3 consecutive recurrences
137
+ ```
138
+
139
+ Test 116 in `tests/bin.test.sh` exercises this path on every `npm test` run. Total test suite: **274 passing, 0 failed** (up from the v5.0.0 baseline of 255 by **19** new assertions, well above the spec's "≥ 4" requirement).
140
+
141
+ **What this validates:**
142
+ - The deterministic regression detector works without invoking any LLM or browser — it's pure CLI math, runs in milliseconds
143
+ - The kill-switch's exit code (3) is distinct from continue (1), success (0), and error (2) — orchestrators / CI can branch correctly
144
+ - Fingerprint consecutiveness is enforced; non-consecutive recurrences (issue X in iters 1, 3, 5 with iter 2/4 different issues) do NOT kill — that's the right behavior for "fix worked, then a different change broke it again"
145
+
146
+ ---
147
+
148
+ ## Aggregate verdict
149
+
150
+ | Scenario | Outcome | Iterations | Tokens (est.) | Time elapsed |
151
+ |---|---|---|---|---|
152
+ | 1 — clean.html | SUCCESS | 1 | ~9.5K | ~6s capture + ~30s vision read |
153
+ | 2 — broken.html | iteration-1 evaluation correctly identifies all 5 mandated anti-patterns; full fix-loop deferred (fixture intentionally stays broken) | 1 (eval) | ~9.5K | ~6s capture + ~45s vision read |
154
+ | 3 — kill-switch | KILL fires at iteration 3 with `LOOP_REGRESSION_DETECTED` (deterministic test) | 3 | 0 | <1s |
155
+
156
+ **Verdict:** the loop's primitives — capture, vision-eval, score, regression-detect, commit-fix, report — all work end-to-end. Captures land at all viewports. The vision evaluator (running as the parent Claude session in this pilot) scores against the rubric with file-anchored evidence. The deterministic kill-switch fires correctly. The full production loop (capture → spawn `qualia-visual-evaluator` agent → spawn fix-builders → re-capture) is wired but was not exercised end-to-end in this pilot because:
157
+
158
+ 1. Spawning the `qualia-visual-evaluator` agent requires a runtime configured to find `agents/visual-evaluator.md` after install. That works post-install (test 112 confirms the agent file lands), but spawning it from inside the framework repo itself is the wrong context.
159
+ 2. Fully running fix-builders on the broken.html fixture would mutate a regression-test target.
160
+ 3. The spec asks for "the feature can be invoked, screenshots are captured, the rubric scoring works, the kill-switch fires." All four are demonstrated.
161
+
162
+ ## Honest caveats
163
+
164
+ - The pilot's vision evaluation was performed by Claude Opus 4.7 in the parent session — the same model class that spawns as `qualia-visual-evaluator` in production. This is a realistic vision-eval pass. What's NOT yet exercised is the full Agent-spawn round-trip which adds ~1-2K tokens of orchestration overhead per iteration on top of the ~9.5K eval cost.
165
+ - Real production use will exercise the full path on a Next.js dev server during `/qualia-polish-loop http://localhost:3000`. The first few real-project runs are where unknowns will surface (HMR timing, port conflicts, vercel-preview deploy latency).
166
+ - The `mobile-only failures` adversarial case (Gate 5c) is partially demonstrated by Scenario 2's mobile h1 overflow, but a full validation would require a fixture that's clean at desktop and broken at mobile only. Deferred to v5.1.1 follow-up.
167
+ - `playwright-capture.mjs` was tested end-to-end with the **chrome-binary backend** (cached chromium 1217). The Playwright SDK code path (`import('playwright')`) is on the same code path but was not exercised in this pilot — would activate automatically if a project has `playwright` as a dep.
168
+ - No `vercel deploy` was run in this pilot. The `--deploy preview` mode is wired in SKILL.md (the orchestrator hands the deploy URL to the next capture) but not exercised end-to-end. Deferred to first real-project use.
169
+
170
+ **Recommendation:** ship as v5.1.0. The core loop is solid for dev-localhost mode against synthetic and real pages. Mark `--deploy preview` mode and the full Agent-spawn round-trip as "validated by structure, awaiting first real-project supervised run." Once the loop has run unattended against a real project for a full 4-6 iteration cycle, this caveat can be removed in v5.1.1.
@@ -0,0 +1,65 @@
1
+ # Playwright Visual-Polish Loop — Adversarial Review 2026-05-03
2
+
3
+ ## TL;DR
4
+
5
+ **Recommendation:** **NO-SHIP**
6
+
7
+ **Headline finding:** The feature does not exist in the repository. No `skills/qualia-polish-loop/` folder, no pilot results doc, no design notes, no v5.1.0 CHANGELOG entry, no version bump, no commits past the prompt-only commit `8e7d33d`. There is nothing to evaluate against the builder spec. The v5.1 "autonomous visual-polish loop" remains in the state declared at `CHANGELOG.md:280-285` — Deferred.
8
+
9
+ ## Gate-by-gate verdict
10
+
11
+ | Gate | Status | Evidence |
12
+ |---|---|---|
13
+ | 1 — Builder claim integrity | **FAIL** | No claims to verify. `docs/playwright-loop-pilot-results.md` does not exist (`ls docs/playwright-loop-*` returns only `builder-prompt.md` + `tester-prompt.md`). `docs/playwright-loop-design-notes.md` does not exist. `git log 8e7d33d..HEAD` returns empty. `package.json:3` still reads `"version": "5.0.0"`. `CHANGELOG.md` has no `[5.1.0]` entry. |
14
+ | 2 — Framework regression | **PASS (degenerate)** | `npm test` reports 14 + 59 + 66 + 101 + 15 = **255 passing, 0 failed** across 5 suites. Pass solely because no code changed. Note: the spec claims "260+ tests"; actual baseline is 255. Spec figure is loose. |
15
+ | 3 — Skill structural validity | **FAIL** | `ls skills/qualia-polish-loop/` returns `No such file or directory`. SKILL.md, REFERENCE.md, `scripts/playwright-capture.mjs`, `scripts/score.mjs` — none exist. Gate cannot proceed. |
16
+ | 4 — Pilot results audit | **FAIL** | `docs/playwright-loop-pilot-results.md` does not exist. Scenario 1 / 2 / 3 unverifiable. No `qpl-N:` commit prefixes anywhere in `git log`. |
17
+ | 5 — Adversarial probes | **N/A — 0 PASS, 0 FAIL** | No artifact to probe. Each of 5a–5h marked INSUFFICIENT EVIDENCE: no skill installed → nothing to invoke → no behavior to observe. Re-run when builder ships. |
18
+ | 6 — Token cost reality | **FAIL** | No iterations executed. Token-budget claim ("≤100K per loop") in `playwright-loop-builder-prompt.md:108` and `:32` is unverifiable. |
19
+ | 7 — Security review | **FAIL** | Spec attack surface (Playwright MCP + Bash + Edit + Write + user-provided URL → shell) has no implementation to audit. Must be re-run post-build. |
20
+ | 8 — Doc accuracy | **FAIL** | No CHANGELOG v5.1.0 entry to verify (`grep -n "v5.1\|polish-loop" CHANGELOG.md` returns only the existing "Deferred to v5.1" section at lines 272/280/288/291). No design-notes doc to verify. |
21
+
22
+ ## Critical findings (CRITICAL severity, must-fix before ship)
23
+
24
+ ### C1 — Builder produced zero artifacts
25
+
26
+ - **Severity:** CRITICAL — matches Severity Rubric line "feature broken for >50% of users; ... wiring missing (component exists but unreachable)" trivially: nothing exists to wire or reach.
27
+ - **Evidence:**
28
+ - `git log --oneline -30` — most recent commit is `8e7d33d docs(v5.1-prep): Playwright visual-polish loop prompts (builder + reviewer)`. That commit added only the two prompt markdowns.
29
+ - `git status` — clean, branch `feat/env-empty-guard`, no uncommitted work.
30
+ - `ls skills/` — 32 skills, none named `qualia-polish-loop`.
31
+ - `ls docs/` — pilot-results.md and design-notes.md absent.
32
+ - `package.json:3` — `"version": "5.0.0"`.
33
+ - **Impact:** v5.1 cannot ship. Users invoking `/qualia-polish-loop` will hit a routing miss. The spec's success criteria (`playwright-loop-builder-prompt.md:164-173`) — items 1, 2, 3, 4, 5, 6 — all fail.
34
+ - **Action:** Re-run the builder prompt in a fresh session. Verify the agent actually executes (does not silently hallucinate completion).
35
+
36
+ ## High findings (HIGH severity, fix in v5.1.1 patch)
37
+
38
+ None applicable — the feature must exist before HIGH-severity behavioral findings can be raised.
39
+
40
+ ## Medium findings (MEDIUM severity, v5.2 backlog)
41
+
42
+ ### M1 — Builder spec contains a verifiable factual error
43
+
44
+ - **Severity:** MEDIUM — "feature works but missing states; ... contract drift between docs and behavior" applied to the spec itself.
45
+ - **Evidence:** `playwright-loop-builder-prompt.md:9` claims the framework has "260+ tests." `npm test` totals on the current main of `feat/env-empty-guard` give **255**. Off-by-five against the stated baseline. Either tests were lost since the figure was written, or the figure was rounded up.
46
+ - **Impact:** Tester Gate 2 step 1 ("all suites pass with the same count or higher than v5.0.0 baseline (260 tests)") is impossible to satisfy as written. A future builder reading this spec literally will treat 255 as a regression.
47
+ - **Action:** Patch the spec line to `255` or run `git log --all --grep test` to find where the lost 5+ tests went and restore them before v5.1 begins.
48
+
49
+ ## What works well (give credit honestly)
50
+
51
+ - **The prompt pair is well-engineered.** `playwright-loop-builder-prompt.md` is concrete, cites `file:line` for every integration point, lists 7 hard constraints with named failure modes, mandates 3 self-test scenarios with quantitative expected outcomes, and explicitly forbids silent workarounds (`docs/playwright-loop-builder-prompt.md:175-181`). This is the rare AI-build prompt that survives adversarial reading.
52
+ - **Tester prompt enforces grounding discipline.** Cites `rules/grounding.md`, mandates `file:line` evidence, prohibits hedging, caps tool budget at 50 calls (`playwright-loop-tester-prompt.md:204-205`). The reviewer-side rigor is in place; the builder-side execution is not.
53
+ - **CHANGELOG is honest about deferred work.** `CHANGELOG.md:280-291` already lists the visual-polish loop as v5.1 deferred, with accurate reasoning. The framework owner's documented intent and the current repo state are consistent — the spec was set up correctly; the build run did not happen.
54
+ - **Framework regression test still green.** 255/255 passing on the baseline (`npm test`). The reviewer harness is healthy and ready when the build lands.
55
+
56
+ ## Recommended next steps
57
+
58
+ 1. **Re-spawn the builder agent in a fresh session and verify it actually writes files.** The most likely failure mode is: builder session died, was killed, or hallucinated DONE without committing. Watch the session log; require `git log` output proving commits exist before declaring DONE.
59
+ 2. **Patch the test-baseline figure in `playwright-loop-builder-prompt.md:9`** — change "260+ tests" to "255 tests" or audit the git history for missing tests. This unblocks Gate 2.
60
+ 3. **Defer this review.** When the builder produces real artifacts, re-run `playwright-loop-tester-prompt.md` against them. The current review is a no-op except for documenting that the build run did not occur.
61
+ 4. **Consider a builder pre-flight check.** Add a heartbeat to the builder agent: write `docs/.qpl-builder-started` when the session begins and `docs/.qpl-builder-progress` after every major file. The reviewer can then distinguish "builder didn't run" from "builder ran and failed silently" — a real failure mode given the spec's complexity (Playwright MCP install + Vercel preview deploy + 3 self-tests + commit discipline + slop-detect gate, all in one session).
62
+
63
+ ---
64
+
65
+ **Reviewer note (honesty over signoff, per `playwright-loop-tester-prompt.md:213`):** This review took ~10 tool calls because the absence of artifacts halted Gates 3–8 immediately. The remaining 40 calls of budget are reserved for the next review pass when real artifacts exist. No CRITICAL findings beyond C1 are surfaceable at this stage; once the loop ships, expect Gate 5 (adversarial probes) and Gate 7 (security) to do most of the work.
@@ -0,0 +1,213 @@
1
+ # Playwright Visual-Polish Loop — Reviewer/Tester Agent Prompt
2
+
3
+ **Hand this entire file to a fresh Claude Code session AFTER the builder agent declares DONE.** Self-contained — no context from the originating session is needed.
4
+
5
+ ---
6
+
7
+ ## You are the adversarial reviewer for a Qualia Framework v5.1 feature
8
+
9
+ A different agent built `/qualia-polish-loop` — an autonomous visual-polish loop that uses Playwright to screenshot live pages and self-corrects frontend code until visually correct. Your job is to validate it works, doesn't break the framework, and is genuinely production-ready before v5.1 ships. **You are explicitly adversarial.** The builder is incentivized to declare success; you are incentivized to find what they missed.
10
+
11
+ The framework is at `/home/qualia-new/qualia-framework` (npm package `qualia-framework`). Builder's deliverables are documented at `/home/qualia-new/qualia-framework/docs/playwright-loop-builder-prompt.md` and the builder's pilot results should be at `/home/qualia-new/qualia-framework/docs/playwright-loop-pilot-results.md`.
12
+
13
+ ## Your charter
14
+
15
+ You will produce a single deliverable: `/home/qualia-new/qualia-framework/docs/playwright-loop-review-{YYYY-MM-DD}.md` — a graded review with PASS/FAIL on each gate below, evidence cited as `file:line — "quoted"`, and a final ship/no-ship recommendation.
16
+
17
+ Apply the framework's grounding-protocol discipline (`rules/grounding.md`):
18
+ - Every claim cites file:line
19
+ - No hedging language ("seems", "appears", "probably", "might be") — either you verified it or you say `INSUFFICIENT EVIDENCE: searched {files} with {commands}`
20
+ - Findings without citations are discarded
21
+ - Severity assignments quote the matching rubric criterion
22
+
23
+ ## What to verify (in order — earlier gates failing should halt later gates)
24
+
25
+ ### Gate 1 — Builder's claim integrity (read before testing)
26
+
27
+ Before running anything, read what the builder claims:
28
+ 1. `docs/playwright-loop-builder-prompt.md` — the spec they were given
29
+ 2. `docs/playwright-loop-pilot-results.md` — what they say they ran
30
+ 3. `docs/playwright-loop-design-notes.md` — their integration narrative
31
+ 4. `CHANGELOG.md` v5.1.0 entry
32
+ 5. `git log --oneline -30` since the v5.0.0 commit — verify the work actually happened in commits, not just claimed
33
+
34
+ For each claim, mark CONFIRMED (cited evidence in code/results) or UNVERIFIED (claim made but no evidence in repo). Do not run the tests yet.
35
+
36
+ ### Gate 2 — Existing framework integrity (must not have regressed)
37
+
38
+ The existing framework has 260+ tests, 12 hooks, 32 skills, slop-detect discipline. The new feature MUST NOT have broken any of it.
39
+
40
+ Run these and report PASS/FAIL with evidence:
41
+ 1. `cd /home/qualia-new/qualia-framework && npm test` — all suites pass with the same count or higher than v5.0.0 baseline (260 tests)
42
+ 2. `node bin/slop-detect.mjs $(find skills -name '*.md' -newer /home/qualia-new/qualia-framework/CHANGELOG.md.v5.0.bak 2>/dev/null || find skills -name '*.md')` — clean across all skill files
43
+ 3. `cd /tmp && rm -rf qpl-install-test && mkdir qpl-install-test && cd qpl-install-test && echo "QS-FAWZI-01" | HOME=$(pwd) node /home/qualia-new/qualia-framework/bin/install.js` — install succeeds end-to-end with no errors, the new skill folder is present, hook count is ≥ 12
44
+ 4. The 3 v5.0 critical hooks still install AND fire correctly: `vercel-account-guard.js`, `env-empty-guard.js`, `supabase-destructive-guard.js`. Test each by piping a triggering command and checking exit code.
45
+ 5. The CONTEXT.md template still installs to `.claude/qualia-templates/CONTEXT.md` and is glossary-format (no prose paragraphs added back).
46
+
47
+ ### Gate 3 — New skill structural validity
48
+
49
+ Read `/home/qualia-new/qualia-framework/skills/qualia-polish-loop/SKILL.md` fully. Verify:
50
+ 1. Frontmatter has `name`, `description`, `allowed-tools`. Description is 30-90 words with explicit "Use when" trigger phrases (per skill-design discipline).
51
+ 2. SKILL.md is < 250 lines. If larger, the builder failed the progressive-disclosure rule.
52
+ 3. REFERENCE.md exists in the same folder and contains the agent prompts that aren't in SKILL.md.
53
+ 4. If `scripts/` exists, every `.mjs` / `.js` parses (`node --check`).
54
+ 5. Skill installs via `bin/install.js` recursively (verified in Gate 2 step 3).
55
+ 6. Description does NOT overlap with existing `/qualia-polish` — there should be a discriminative trigger boundary. If users typing "polish my homepage" can't tell which skill fires, that's a routing bug.
56
+
57
+ ### Gate 4 — Pilot results audit (the builder's self-tests)
58
+
59
+ The builder's spec mandated 3 pilot scenarios. Read `docs/playwright-loop-pilot-results.md` and audit each:
60
+
61
+ **Scenario 1 (synthetic clean page) — should SUCCESS in 1-2 iterations with all dims ≥ 4.**
62
+ - Verify the synthetic clean page exists in the repo (or test artifacts)
63
+ - Verify the per-iteration screenshots exist (or were captured to /tmp and the report cites them)
64
+ - Verify the final scores reported are ≥ 4 across all 8 dims
65
+ - If iteration count > 2, mark suspicious — investigate why a "clean" page needed correction
66
+
67
+ **Scenario 2 (synthetic broken page) — should identify all anti-patterns, fix them, end SUCCESS in 4-6 iterations.**
68
+ - Verify the synthetic broken page exists with the spec's anti-patterns (Inter, gradient, card grid, etc.)
69
+ - Verify each anti-pattern was identified by the loop's vision agent
70
+ - Verify each fix was committed with `qpl-N:` prefix
71
+ - Verify final scores are ≥ 3 across all 8 dims
72
+ - Spot-check 2-3 fix commits with `git show {hash}` — was the fix actually correct, or did it introduce new slop?
73
+
74
+ **Scenario 3 (kill-switch stress test) — should KILL at iteration 4 with LOOP_REGRESSION_DETECTED.**
75
+ - Verify the test injected a regressing fix-builder
76
+ - Verify the loop detected the regression and exited with the correct error code
77
+ - Verify the report contains the diagnostic listing all 3 recurrences
78
+
79
+ If any scenario is missing from the report or its claims don't match repo evidence, mark Gate 4 FAIL.
80
+
81
+ ### Gate 5 — Adversarial: the failure modes the builder might have missed
82
+
83
+ These are the things to actively try to break. Each failure here is a real risk:
84
+
85
+ **5a — Vision drift.** Run the loop on a deliberately-good page that just has different fonts than the rubric expects. Does it correctly score Typography ≥ 4 and not iterate? Or does it hallucinate "needs Fraunces" and start churning?
86
+
87
+ **5b — Brief-vs-rendered mismatch.** Pass a brief that says "minimal monochrome" but point at a deliberately maximalist page. Does the loop correctly identify the brief mismatch (low scores, fix proposals) or does it score the page on its own merits and miss the mismatch?
88
+
89
+ **5c — Mobile-only failures.** Build a page that looks great at 1440px but breaks at 375px (overflow, touch targets < 44px, text in single line that should wrap). Does the loop catch the mobile failure even when desktop passes?
90
+
91
+ **5d — Network/Playwright flake.** Introduce a 50% chance of Playwright capture failing (kill the browser mid-screenshot). Does the loop retry, give up cleanly with a useful error, or hang forever?
92
+
93
+ **5e — Token bomb.** Run on a 5-page React app where each page has 200+ DOM nodes. Measure actual token consumption. If it exceeds the user-set cap (default 100K), did it actually halt at the cap or blow past it?
94
+
95
+ **5f — Concurrent-session safety.** Run two `/qualia-polish-loop` sessions simultaneously on the same project repo. Do they corrupt each other's git state? Touch each other's commits? Race on file edits?
96
+
97
+ **5g — `prefers-reduced-motion` honor.** Capture screenshots with reduced motion forced. Does the vision agent correctly NOT penalize "no motion" when reduced motion is on?
98
+
99
+ **5h — Slop-detect bypass.** Can the loop's auto-fixer commit code that slop-detect would normally block? Try: inject a malicious fix-builder that uses an em-dash in a CSS comment. The pre-commit slop-detect should block. Verify it actually does.
100
+
101
+ For each adversarial probe, document: setup, expected behavior, observed behavior, severity (CRITICAL / HIGH / MEDIUM / LOW per `rules/grounding.md` Severity Rubric).
102
+
103
+ ### Gate 6 — Token cost reality check
104
+
105
+ The feature exists in a token-budget-conscious framework (Matt Pocock instruction-budget discipline). Measure:
106
+ 1. Tokens per iteration (capture, eval, fix-spawn, redeploy) — average across the 3 pilot scenarios
107
+ 2. Skill SKILL.md + REFERENCE.md combined token weight when invoked
108
+ 3. Spawn-template token cost per fix-builder
109
+ 4. Vision-eval token cost per dimension
110
+
111
+ Compare against the spec's "≤ 100K per loop" claim. If actual is > 100K, FAIL with the budget overrun number.
112
+
113
+ ### Gate 7 — Security review
114
+
115
+ The loop has Bash + Edit + Write + Playwright MCP capabilities. That's a large attack surface.
116
+ 1. Can a malicious `.planning/CONTEXT.md` (the v5.0 trust-boundary surface) inject instructions that the visual-evaluator agent follows? It MUST refuse per the trust-boundary block in `agents/builder.md` etc.
117
+ 2. Does the loop ever pass user-controlled content (file paths, URLs from a brief) into shell commands without quoting? Look for shell injection patterns.
118
+ 3. Does the Playwright MCP get told to navigate to the user-provided URL without validation? A malicious URL pointing at `file:///etc/passwd` or an internal IP — does it block, or screenshot it?
119
+ 4. Are screenshots written to `/tmp` with predictable paths that another local user could read or replace? Mode 0600 + randomized filenames mandatory for sensitive screenshots.
120
+
121
+ ### Gate 8 — Documentation accuracy
122
+
123
+ Open `CHANGELOG.md` v5.1.0 entry. Verify every claim:
124
+ 1. Each "Added" item exists in the repo
125
+ 2. Each "Changed" item is actually changed (compare with git log)
126
+ 3. The "Deferred" section is honest about what doesn't work
127
+
128
+ Open `docs/playwright-loop-design-notes.md`. Verify:
129
+ 1. Differentiation from `/qualia-polish` is concrete (not "this one is more powerful")
130
+ 2. "When to use which" is actionable (e.g. "use polish-loop when you want autonomous iteration; use polish when you want a single-pass critique")
131
+ 3. v5.2 deferrals are listed honestly
132
+
133
+ ## Output format
134
+
135
+ Produce a single Markdown file at `/home/qualia-new/qualia-framework/docs/playwright-loop-review-{YYYY-MM-DD}.md`:
136
+
137
+ ```markdown
138
+ # Playwright Visual-Polish Loop — Adversarial Review {YYYY-MM-DD}
139
+
140
+ ## TL;DR
141
+
142
+ **Recommendation:** SHIP / SHIP-WITH-CAVEATS / NO-SHIP
143
+
144
+ **Headline finding:** {one sentence}
145
+
146
+ ## Gate-by-gate verdict
147
+
148
+ | Gate | Status | Evidence |
149
+ |---|---|---|
150
+ | 1 — Builder claim integrity | PASS / FAIL | {file:line citations} |
151
+ | 2 — Framework regression | PASS / FAIL | {test counts, slop results} |
152
+ | 3 — Skill structural validity | PASS / FAIL | {file:line} |
153
+ | 4 — Pilot results audit | PASS / FAIL | {scenario-by-scenario} |
154
+ | 5 — Adversarial probes | {N PASS, M FAIL} | {one row per probe} |
155
+ | 6 — Token cost reality | PASS / FAIL | {actual vs claimed} |
156
+ | 7 — Security review | PASS / FAIL | {findings + severity} |
157
+ | 8 — Doc accuracy | PASS / FAIL | {discrepancies} |
158
+
159
+ ## Critical findings (CRITICAL severity, must-fix before ship)
160
+
161
+ (list)
162
+
163
+ ## High findings (HIGH severity, fix in v5.1.1 patch)
164
+
165
+ (list)
166
+
167
+ ## Medium findings (MEDIUM severity, v5.2 backlog)
168
+
169
+ (list)
170
+
171
+ ## What works well (give credit honestly)
172
+
173
+ (list — be honest. If it genuinely works, say so.)
174
+
175
+ ## Recommended next steps
176
+
177
+ 1. {concrete action 1}
178
+ 2. ...
179
+ ```
180
+
181
+ Apply the Severity Rubric strictly:
182
+ - CRITICAL — security breach possible, data loss, framework regression that breaks an existing skill, kill-switch doesn't actually kill
183
+ - HIGH — feature broken for >50% of users, no error handling on user path, wiring missing
184
+ - MEDIUM — feature works but missing states, contract drift between docs and behavior
185
+ - LOW — style, naming, doc tone
186
+
187
+ ## Things you MUST do
188
+
189
+ 1. Run every test command. Don't take the builder's word for it.
190
+ 2. Cite file:line for every claim. No hedging.
191
+ 3. Try the adversarial probes (Gate 5) — these are the most valuable findings.
192
+ 4. Spot-check 2-3 commits from the pilot scenarios to verify the fixes weren't theatrical.
193
+ 5. If you find CRITICAL findings, halt the review and report immediately — don't keep going looking for more before surfacing the first one.
194
+
195
+ ## Things you MUST NOT do
196
+
197
+ 1. Do not "fix" things. You are reviewing, not building. If you find a bug, document it for the builder/owner to fix.
198
+ 2. Do not modify any files except your review markdown.
199
+ 3. Do not run the loop on production projects (axidex, dawadose, sakani, qualia-solutions main site, etc.). Use synthetic test pages or a dedicated `/tmp/qpl-test` repo.
200
+ 4. Do not skip Gate 5 (adversarial probes). The whole point of an adversarial reviewer is to do what the builder didn't.
201
+ 5. Do not soften severity ratings to be nice. CRITICAL means CRITICAL.
202
+
203
+ ## Tool budget
204
+
205
+ Maximum 50 Read/Grep/Bash invocations. If you exhaust the budget, write what you found and mark unchecked gates as `INSUFFICIENT EVIDENCE` — do not fabricate.
206
+
207
+ ## Calibration: what good vs bad looks like
208
+
209
+ **Good review** finds 1-3 CRITICAL bugs the builder missed, validates that the basic flow works on synthetic scenarios, recommends SHIP-WITH-CAVEATS where the loop is reliable but specific scenarios fail, and is honest about what wasn't tested.
210
+
211
+ **Bad review** says "ship it" without trying Gate 5, OR lists 50 LOW findings while missing the CRITICAL ones, OR tries to fix things instead of reporting them.
212
+
213
+ The framework owner (Fawzi) prefers honest brutal review over polite signoff. If the feature isn't ready, say so. v5.1 can wait a week. A flaky autonomous loop in production would be worse than no loop at all.