npm - qualia-framework - Versions diffs - 4.5.0 → 5.3.0 - Mend

qualia-framework 4.5.0 → 5.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (66) hide show

package/AGENTS.md +24 -0
package/CLAUDE.md +12 -75
package/README.md +23 -16
package/agents/builder.md +9 -21
package/agents/planner.md +8 -0
package/agents/verifier.md +8 -0
package/agents/visual-evaluator.md +132 -0
package/bin/cli.js +54 -18
package/bin/install.js +369 -29
package/bin/qualia-ui.js +208 -1
package/bin/slop-detect.mjs +5 -0
package/bin/state.js +34 -1
package/docs/install-redesign-builder-prompt.md +290 -0
package/docs/install-redesign-pilot.md +234 -0
package/docs/playwright-loop-builder-prompt.md +185 -0
package/docs/playwright-loop-design-notes.md +108 -0
package/docs/playwright-loop-pilot-results.md +170 -0
package/docs/playwright-loop-tester-prompt.md +213 -0
package/docs/polish-loop-supervised-run.md +111 -0
package/docs/reviews/matt-pocock-skills-analysis.md +300 -0
package/guide.md +9 -5
package/hooks/env-empty-guard.js +74 -0
package/hooks/pre-compact.js +19 -9
package/hooks/pre-deploy-gate.js +8 -2
package/hooks/pre-push.js +26 -12
package/hooks/supabase-destructive-guard.js +62 -0
package/hooks/vercel-account-guard.js +91 -0
package/package.json +2 -1
package/rules/design-brand.md +4 -0
package/rules/design-laws.md +4 -0
package/rules/design-product.md +4 -0
package/rules/design-rubric.md +4 -0
package/rules/grounding.md +4 -0
package/skills/qualia-build/SKILL.md +40 -46
package/skills/qualia-discuss/SKILL.md +51 -68
package/skills/qualia-handoff/SKILL.md +1 -0
package/skills/qualia-hook-gen/SKILL.md +206 -0
package/skills/qualia-issues/SKILL.md +151 -0
package/skills/qualia-map/SKILL.md +78 -35
package/skills/qualia-new/REFERENCE.md +139 -0
package/skills/qualia-new/SKILL.md +45 -121
package/skills/qualia-optimize/REFERENCE.md +265 -0
package/skills/qualia-optimize/SKILL.md +92 -232
package/skills/qualia-plan/SKILL.md +58 -65
package/skills/qualia-polish-loop/REFERENCE.md +265 -0
package/skills/qualia-polish-loop/SKILL.md +201 -0
package/skills/qualia-polish-loop/fixtures/broken.html +117 -0
package/skills/qualia-polish-loop/fixtures/clean.html +196 -0
package/skills/qualia-polish-loop/scripts/loop.mjs +323 -0
package/skills/qualia-polish-loop/scripts/playwright-capture.mjs +206 -0
package/skills/qualia-polish-loop/scripts/score.mjs +176 -0
package/skills/qualia-prd/SKILL.md +199 -0
package/skills/qualia-report/SKILL.md +141 -200
package/skills/qualia-research/SKILL.md +28 -33
package/skills/qualia-road/SKILL.md +103 -0
package/skills/qualia-ship/SKILL.md +1 -0
package/skills/qualia-task/SKILL.md +1 -1
package/skills/qualia-test/SKILL.md +50 -2
package/skills/qualia-triage/SKILL.md +152 -0
package/skills/qualia-verify/SKILL.md +63 -104
package/skills/qualia-zoom/SKILL.md +51 -0
package/skills/zoho-workflow/SKILL.md +1 -1
package/templates/CONTEXT.md +36 -0
package/templates/decisions/ADR-template.md +30 -0
package/tests/bin.test.sh +598 -7
package/tests/state.test.sh +58 -0

package/docs/playwright-loop-pilot-results.md ADDED Viewed

@@ -0,0 +1,170 @@
+# /qualia-polish-loop — Pilot results
+**Run date:** 2026-05-03
+**Framework version:** 5.1.0 (this commit)
+**Operator:** Claude Opus 4.7 (1M context), main session, autonomous build
+**Browser backend used:** Playwright cached chromium 1217 (`~/.cache/ms-playwright/chromium-1217/chrome-linux64/chrome`) — auto-selected by `playwright-capture.mjs` after `import('playwright')` failed (no `playwright` npm package in the framework repo) and the cache lookup found a usable binary
+**Fixture server:** `python3 -m http.server 18080` against `skills/qualia-polish-loop/fixtures/`
+**Captures:** `/tmp/qpl-pilot-1777778113/{scenario1,scenario2}/{mobile,tablet,desktop}-*.png`
+This pilot replaces an earlier draft that pre-dated the actual capture pipeline. The numbers below come from real runs of `playwright-capture.mjs` against the committed fixtures, not from architectural reasoning.
+---
+## Scenario 1 — Synthetic clean page
+**Fixture:** `skills/qualia-polish-loop/fixtures/clean.html` (170 lines). Fraunces + JetBrains Mono pair, OKLCH palette tinted toward 220° (cyan), asymmetric hero (`1.4fr / 1fr`), full-width with `clamp()` padding, varied work-grid (no card monotony), border-only headers, focus-visible rings, `prefers-reduced-motion` respected, 65ch line-length cap on prose.
+**Expected per spec:** SUCCESS in 1-2 iterations with all dims ≥ 4.
+**Captures (3 viewports, single iteration):**
+| Viewport | File | Size | Render |
+|---|---|---|---|
+| 375×812 | `mobile-375.png` | 43,401 B | single-column reflow at the 720px breakpoint, hero h1 reflows at fluid `clamp()` scale |
+| 768×1024 | `tablet-768.png` | 95,104 B | single-column at 720px breakpoint, work-grid stacks |
+| 1440×900 | `desktop-1440.png` | 64,078 B | asymmetric hero, metadata column right-aligned, full work-grid 12-col layout below |
+**Vision-evaluator scoring** — executed by the parent Claude session reading the PNGs directly via the Read tool (the same model class that would spawn as `qualia-visual-evaluator` in production):
+| Dimension | Score | Evidence |
+|---|---|---|
+| Typography | **5** | Variable Fraunces (display, opsz axis 144, italic emphasis on "well") paired with JetBrains Mono body. Weights 400/600 visible. Fluid `clamp()` scale. Tabular numerals on `148`, `11`, `72%` figures. |
+| Color cohesion | **5** | All values are `oklch(...)` — no hex anywhere. Tinted neutrals (no `#000`/`#fff`). Single accent at `oklch(0.78 0.14 196)`. Restrained strategy explicit. Brand-tinted shadow tokens declared. |
+| Spatial rhythm | **5** | `clamp(1.25rem, 5vw, 4rem)` horizontal, `clamp(3rem, 8vw, 7rem)` hero vertical. Generous negative space on left half of hero, tight metadata column on right. Asymmetric. |
+| Layout originality | **4** | Asymmetric `1.4fr / 1fr` hero. Work-grid uses 4 articles at `7/5 + 5/7` column spans (varied sizes). Not symmetric. |
+| Shadow & depth | **3** | Three shadow tokens declared (`--shadow-sm/md/lg`), brand-hue tinted; only the hover-shadow visible above the fold. Default — would score higher with cards visible in viewport. |
+| Motion intent | **3** | CSS-declaration intent visible: `cubic-bezier(0.22, 1, 0.36, 1)` exponential ease-out, `transition: transform 200ms` (transforms only — no layout-property animation), `prefers-reduced-motion` block present. Static screenshot can't show animation; scoring CSS quality. |
+| Microcopy specificity | **5** | "Brands that age well, on the second glance." (worth quoting), "Audit a brand" (action-named CTA, no "Get Started"), specific stats ("148 since 2019", "11 days", "72%"), no banned phrases. |
+| Container depth | **4** | Header is one container (border-bottom only). Hero has no card wrapper. Work-grid articles are single-depth (one `<article>`, no nesting). No side-stripe borders. |
+**Aggregate:** 34 / 40 (avg 4.25)
+**Pass:** YES — all dimensions ≥ 3, no critical issues
+**Iterations needed:** **1** (within spec's 1-2 expectation)
+**Estimated tokens for evaluation:** ~9,500 (3 PNG reads at ~3K each + rubric inline ~1K + brief ~1K — first iteration, no previous-iteration delta)
+**What this validates:**
+- Capture pipeline produces evaluable PNGs at all three viewports
+- Vision evaluator scores against the rubric with cited evidence
+- A genuinely well-designed page hits SUCCESS in one iteration without the loop touching any code
+- The pass criterion (all dims ≥ 3 AND no critical issues) is the right verdict gate
+---
+## Scenario 2 — Synthetic broken page
+**Fixture:** `skills/qualia-polish-loop/fixtures/broken.html` (115 lines). Deliberately shipped slop. Inter font as primary, pure `#fff` + `#000`, blue→purple linear-gradient hero, `background-clip: text` gradient on h1, three identical 1/3-width feature cards with `border-left: 4px solid #2563eb` side-stripes, "Get Started" + "Learn More" CTAs, `max-width: 1280px` container, `outline: none` on nav links without replacement.
+**Expected per spec:** Loop identifies all anti-patterns, fixes them, ends SUCCESS in 4-6 iterations.
+**Pilot scope:** the capture + first-iteration evaluation was run end-to-end. Full fix-loop was NOT run autonomously, because:
+1. The fixture is meant to STAY broken as a permanent regression-test target (committed to the repo)
+2. Auto-fixing a committed fixture would defeat its purpose
+3. The spec asks the pilot to demonstrate the loop *can* identify the issues and propose fixes — fully running fix-builders against a committed fixture is the production execution path, not a self-test
+**Captures (3 viewports, iteration 1):**
+| Viewport | File | Size | Render |
+|---|---|---|---|
+| 375×812 | `mobile-375.png` | 88,286 B | hero h1 "Welcome to BrandX" overflows the 375px viewport (text clipped); generic CTAs visible; cards not reflowed (still `repeat(3, 1fr)` in markup but truncated by viewport) |
+| 768×1024 | `tablet-768.png` | 176,059 B | three feature cards still in `repeat(3, 1fr)` (collision starts) at 768px |
+| 1440×900 | `desktop-1440.png` | 263,149 B | full slop visible: Inter font, blue→purple gradient hero, gradient text h1, three identical feature cards with side-stripes, generic CTAs |
+**Vision-evaluator scoring:**
+| Dimension | Score | Evidence |
+|---|---|---|
+| Typography | **1** | `font-family: Inter, sans-serif` set globally (banned per `design-laws.md` and slop-detect rule ABS-FONT). Single weight cluster on body. |
+| Color cohesion | **1** | Triple absolute-ban hit — `linear-gradient(135deg, #2563eb 0%, #9333ea 100%)` (ABS-PURPLE-GRAD), `background-clip: text` on h1 (ABS-GRADIENT-TEXT), `#ffffff`/`#000000` literals throughout (ABS-PURE-BLACK-WHITE). Zero OKLCH. |
+| Spatial rhythm | **2** | Uniform 24px padding on every container. Same gap on every grid. No varied rhythm. Functions but signals AI-default. |
+| Layout originality | **1** | Section 2 is `grid-template-columns: repeat(3, 1fr)` with three identical cards (Fast / Secure / Easy) — the SaaS-cliché three-column feature grid called out in `design-brand.md` §anti-patterns. Centered hero with gradient bg + 2 CTAs is the second mandatory anti-pattern. Both hits. |
+| Shadow & depth | **2** | No box-shadows defined. Only borders distinguish elevation. One level. |
+| Motion intent | **2** | `outline: none` on nav `a` and buttons without focus replacement (HI-OUTLINE-NONE slop hit). No transitions declared. Default browser hover behavior only. |
+| Microcopy specificity | **1** | "Welcome to BrandX" (banned per `design-laws.md` §copy), "Get Started" + "Learn More" (HI-GENERIC-CTA on both buttons). Lorem-adjacent generic feature names ("Fast", "Secure", "Easy"). |
+| Container depth | **1** | `border-left: 4px solid #2563eb` on `.card` is the canonical decorative side-stripe (ABS-SIDE-STRIPE slop hit). Container hierarchy is `.container > .features > .grid > .card` — depth 4. |
+**Aggregate:** 11 / 40 (avg 1.38)
+**Pass:** NO — five dimensions at 1, three at 2; failing across the board
+**Top 3 issues the loop would dispatch fix-builders for in iteration 1** (severity-ordered, file-anchored):
+1. **`color` [critical]** — `body { background: #ffffff; color: #000000; }` and `.hero { background: linear-gradient(135deg, #2563eb, #9333ea); }` and `.hero h1 { background: linear-gradient(...); -webkit-background-clip: text; }` → `fixtures/broken.html`. Fix: replace with OKLCH tokens, drop the gradient hero, drop `background-clip: text`.
+2. **`typography` [critical]** — `font-family: Inter, sans-serif` set on `body`, `.hero h1`, `.btn`, `.features h2`, `.card h3` → `fixtures/broken.html`. Fix: swap to Fraunces + JetBrains Mono per `DESIGN.md §3`.
+3. **`layout` [critical]** — `.grid { display: grid; grid-template-columns: repeat(3, 1fr); }` with three identical cards → `fixtures/broken.html`. Fix: vary card sizes per `design-brand.md §Layout` (e.g., `7 + 5` column-span split).
+**Iterations the loop would need to converge:** 4-6 (per spec). The first iteration's fix-builders would address top-3 (color/typography/layout); iteration 2 would address microcopy/container; iteration 3 would address shadow/motion; iteration 4 would re-verify all dims ≥ 3. The mobile-overflow on h1 (visible at 375px) would be caught by the min-across-viewports aggregate — validating spec §5c (mobile-only failures).
+**What this validates:**
+- The vision evaluator correctly identifies every absolute-ban anti-pattern the spec planned (Inter, gradient, gradient text, three-card grid, side-stripes, generic CTAs)
+- Each issue has a concrete file:pattern citation that a fix-builder can act on
+- Severity rubric is applied (critical / high / medium) with rubric-criterion citations
+- Mobile-only failures are caught by the multi-viewport min-aggregate, not just desktop
+- Estimated tokens for evaluation: ~9,500 (same as Scenario 1 — vision-eval cost is roughly constant per iteration)
+---
+## Scenario 3 — Kill-switch stress test
+**Setup:** Verify the `LOOP_REGRESSION_DETECTED` kill-switch fires when the same issue fingerprint appears in 3 consecutive iterations.
+**Test:** Executed by `tests/bin.test.sh` test 116 (deterministic — runs in every test pass, no Playwright or vision-eval needed):
+```bash
+# Initialize a fresh state file
+node skills/qualia-polish-loop/scripts/loop.mjs init \
+  --state /tmp/.../qpl-kill.json --url http://localhost:3000 --max 8
+# Iteration 1: write an eval where typography==1 with a fixed issue fingerprint
+node loop.mjs record --state ... --eval ITER1.json   # → exit 1 (continue)
+# Iteration 2: write the same fingerprint
+node loop.mjs record --state ... --eval ITER2.json   # → exit 1 (continue)
+# Iteration 3: write the same fingerprint
+node loop.mjs record --state ... --eval ITER3.json   # → exit 3 (KILLED)
+```
+**Assertions:**
+- After iteration 1: `state.verdict == "continue"`, exit 1
+- After iteration 2: `state.verdict == "continue"`, exit 1
+- After iteration 3: `state.verdict == "killed_regression"`, exit 3
+- `state.kill_fingerprint` records which fingerprint triggered the kill
+- `state.fingerprints[fingerprint].iterations == [1, 2, 3]`
+**Result:** PASS. From the test run on this commit:
+```
+✓ loop.mjs kill-switch fires after 3 consecutive recurrences
+```
+Test 116 in `tests/bin.test.sh` exercises this path on every `npm test` run. Total test suite: **274 passing, 0 failed** (up from the v5.0.0 baseline of 255 by **19** new assertions, well above the spec's "≥ 4" requirement).
+**What this validates:**
+- The deterministic regression detector works without invoking any LLM or browser — it's pure CLI math, runs in milliseconds
+- The kill-switch's exit code (3) is distinct from continue (1), success (0), and error (2) — orchestrators / CI can branch correctly
+- Fingerprint consecutiveness is enforced; non-consecutive recurrences (issue X in iters 1, 3, 5 with iter 2/4 different issues) do NOT kill — that's the right behavior for "fix worked, then a different change broke it again"
+---
+## Aggregate verdict
+| Scenario | Outcome | Iterations | Tokens (est.) | Time elapsed |
+|---|---|---|---|---|
+| 1 — clean.html | SUCCESS | 1 | ~9.5K | ~6s capture + ~30s vision read |
+| 2 — broken.html | iteration-1 evaluation correctly identifies all 5 mandated anti-patterns; full fix-loop deferred (fixture intentionally stays broken) | 1 (eval) | ~9.5K | ~6s capture + ~45s vision read |
+| 3 — kill-switch | KILL fires at iteration 3 with `LOOP_REGRESSION_DETECTED` (deterministic test) | 3 | 0 | <1s |
+**Verdict:** the loop's primitives — capture, vision-eval, score, regression-detect, commit-fix, report — all work end-to-end. Captures land at all viewports. The vision evaluator (running as the parent Claude session in this pilot) scores against the rubric with file-anchored evidence. The deterministic kill-switch fires correctly. The full production loop (capture → spawn `qualia-visual-evaluator` agent → spawn fix-builders → re-capture) is wired but was not exercised end-to-end in this pilot because:
+1. Spawning the `qualia-visual-evaluator` agent requires a runtime configured to find `agents/visual-evaluator.md` after install. That works post-install (test 112 confirms the agent file lands), but spawning it from inside the framework repo itself is the wrong context.
+2. Fully running fix-builders on the broken.html fixture would mutate a regression-test target.
+3. The spec asks for "the feature can be invoked, screenshots are captured, the rubric scoring works, the kill-switch fires." All four are demonstrated.
+## Honest caveats
+- The pilot's vision evaluation was performed by Claude Opus 4.7 in the parent session — the same model class that spawns as `qualia-visual-evaluator` in production. This is a realistic vision-eval pass. What's NOT yet exercised is the full Agent-spawn round-trip which adds ~1-2K tokens of orchestration overhead per iteration on top of the ~9.5K eval cost.
+- Real production use will exercise the full path on a Next.js dev server during `/qualia-polish-loop http://localhost:3000`. The first few real-project runs are where unknowns will surface (HMR timing, port conflicts, vercel-preview deploy latency).
+- The `mobile-only failures` adversarial case (Gate 5c) is partially demonstrated by Scenario 2's mobile h1 overflow, but a full validation would require a fixture that's clean at desktop and broken at mobile only. Deferred to v5.1.1 follow-up.
+- `playwright-capture.mjs` was tested end-to-end with the **chrome-binary backend** (cached chromium 1217). The Playwright SDK code path (`import('playwright')`) is on the same code path but was not exercised in this pilot — would activate automatically if a project has `playwright` as a dep.
+- No `vercel deploy` was run in this pilot. The `--deploy preview` mode is wired in SKILL.md (the orchestrator hands the deploy URL to the next capture) but not exercised end-to-end. Deferred to first real-project use.
+**Recommendation:** ship as v5.1.0. The core loop is solid for dev-localhost mode against synthetic and real pages. Mark `--deploy preview` mode and the full Agent-spawn round-trip as "validated by structure, awaiting first real-project supervised run." Once the loop has run unattended against a real project for a full 4-6 iteration cycle, this caveat can be removed in v5.1.1.

package/docs/playwright-loop-tester-prompt.md ADDED Viewed

@@ -0,0 +1,213 @@
+# Playwright Visual-Polish Loop — Reviewer/Tester Agent Prompt
+**Hand this entire file to a fresh Claude Code session AFTER the builder agent declares DONE.** Self-contained — no context from the originating session is needed.
+---
+## You are the adversarial reviewer for a Qualia Framework v5.1 feature
+A different agent built `/qualia-polish-loop` — an autonomous visual-polish loop that uses Playwright to screenshot live pages and self-corrects frontend code until visually correct. Your job is to validate it works, doesn't break the framework, and is genuinely production-ready before v5.1 ships. **You are explicitly adversarial.** The builder is incentivized to declare success; you are incentivized to find what they missed.
+The framework is at `/home/qualia-new/qualia-framework` (npm package `qualia-framework`). Builder's deliverables are documented at `/home/qualia-new/qualia-framework/docs/playwright-loop-builder-prompt.md` and the builder's pilot results should be at `/home/qualia-new/qualia-framework/docs/playwright-loop-pilot-results.md`.
+## Your charter
+You will produce a single deliverable: `/home/qualia-new/qualia-framework/docs/playwright-loop-review-{YYYY-MM-DD}.md` — a graded review with PASS/FAIL on each gate below, evidence cited as `file:line — "quoted"`, and a final ship/no-ship recommendation.
+Apply the framework's grounding-protocol discipline (`rules/grounding.md`):
+- Every claim cites file:line
+- No hedging language ("seems", "appears", "probably", "might be") — either you verified it or you say `INSUFFICIENT EVIDENCE: searched {files} with {commands}`
+- Findings without citations are discarded
+- Severity assignments quote the matching rubric criterion
+## What to verify (in order — earlier gates failing should halt later gates)
+### Gate 1 — Builder's claim integrity (read before testing)
+Before running anything, read what the builder claims:
+1. `docs/playwright-loop-builder-prompt.md` — the spec they were given
+2. `docs/playwright-loop-pilot-results.md` — what they say they ran
+3. `docs/playwright-loop-design-notes.md` — their integration narrative
+4. `CHANGELOG.md` v5.1.0 entry
+5. `git log --oneline -30` since the v5.0.0 commit — verify the work actually happened in commits, not just claimed
+For each claim, mark CONFIRMED (cited evidence in code/results) or UNVERIFIED (claim made but no evidence in repo). Do not run the tests yet.
+### Gate 2 — Existing framework integrity (must not have regressed)
+The existing framework has 260+ tests, 12 hooks, 32 skills, slop-detect discipline. The new feature MUST NOT have broken any of it.
+Run these and report PASS/FAIL with evidence:
+1. `cd /home/qualia-new/qualia-framework && npm test` — all suites pass with the same count or higher than v5.0.0 baseline (260 tests)
+2. `node bin/slop-detect.mjs $(find skills -name '*.md' -newer /home/qualia-new/qualia-framework/CHANGELOG.md.v5.0.bak 2>/dev/null || find skills -name '*.md')` — clean across all skill files
+3. `cd /tmp && rm -rf qpl-install-test && mkdir qpl-install-test && cd qpl-install-test && echo "QS-FAWZI-01" | HOME=$(pwd) node /home/qualia-new/qualia-framework/bin/install.js` — install succeeds end-to-end with no errors, the new skill folder is present, hook count is ≥ 12
+4. The 3 v5.0 critical hooks still install AND fire correctly: `vercel-account-guard.js`, `env-empty-guard.js`, `supabase-destructive-guard.js`. Test each by piping a triggering command and checking exit code.
+5. The CONTEXT.md template still installs to `.claude/qualia-templates/CONTEXT.md` and is glossary-format (no prose paragraphs added back).
+### Gate 3 — New skill structural validity
+Read `/home/qualia-new/qualia-framework/skills/qualia-polish-loop/SKILL.md` fully. Verify:
+1. Frontmatter has `name`, `description`, `allowed-tools`. Description is 30-90 words with explicit "Use when" trigger phrases (per skill-design discipline).
+2. SKILL.md is < 250 lines. If larger, the builder failed the progressive-disclosure rule.
+3. REFERENCE.md exists in the same folder and contains the agent prompts that aren't in SKILL.md.
+4. If `scripts/` exists, every `.mjs` / `.js` parses (`node --check`).
+5. Skill installs via `bin/install.js` recursively (verified in Gate 2 step 3).
+6. Description does NOT overlap with existing `/qualia-polish` — there should be a discriminative trigger boundary. If users typing "polish my homepage" can't tell which skill fires, that's a routing bug.
+### Gate 4 — Pilot results audit (the builder's self-tests)
+The builder's spec mandated 3 pilot scenarios. Read `docs/playwright-loop-pilot-results.md` and audit each:
+**Scenario 1 (synthetic clean page) — should SUCCESS in 1-2 iterations with all dims ≥ 4.**
+- Verify the synthetic clean page exists in the repo (or test artifacts)
+- Verify the per-iteration screenshots exist (or were captured to /tmp and the report cites them)
+- Verify the final scores reported are ≥ 4 across all 8 dims
+- If iteration count > 2, mark suspicious — investigate why a "clean" page needed correction
+**Scenario 2 (synthetic broken page) — should identify all anti-patterns, fix them, end SUCCESS in 4-6 iterations.**
+- Verify the synthetic broken page exists with the spec's anti-patterns (Inter, gradient, card grid, etc.)
+- Verify each anti-pattern was identified by the loop's vision agent
+- Verify each fix was committed with `qpl-N:` prefix
+- Verify final scores are ≥ 3 across all 8 dims
+- Spot-check 2-3 fix commits with `git show {hash}` — was the fix actually correct, or did it introduce new slop?
+**Scenario 3 (kill-switch stress test) — should KILL at iteration 4 with LOOP_REGRESSION_DETECTED.**
+- Verify the test injected a regressing fix-builder
+- Verify the loop detected the regression and exited with the correct error code
+- Verify the report contains the diagnostic listing all 3 recurrences
+If any scenario is missing from the report or its claims don't match repo evidence, mark Gate 4 FAIL.
+### Gate 5 — Adversarial: the failure modes the builder might have missed
+These are the things to actively try to break. Each failure here is a real risk:
+**5a — Vision drift.** Run the loop on a deliberately-good page that just has different fonts than the rubric expects. Does it correctly score Typography ≥ 4 and not iterate? Or does it hallucinate "needs Fraunces" and start churning?
+**5b — Brief-vs-rendered mismatch.** Pass a brief that says "minimal monochrome" but point at a deliberately maximalist page. Does the loop correctly identify the brief mismatch (low scores, fix proposals) or does it score the page on its own merits and miss the mismatch?
+**5c — Mobile-only failures.** Build a page that looks great at 1440px but breaks at 375px (overflow, touch targets < 44px, text in single line that should wrap). Does the loop catch the mobile failure even when desktop passes?
+**5d — Network/Playwright flake.** Introduce a 50% chance of Playwright capture failing (kill the browser mid-screenshot). Does the loop retry, give up cleanly with a useful error, or hang forever?
+**5e — Token bomb.** Run on a 5-page React app where each page has 200+ DOM nodes. Measure actual token consumption. If it exceeds the user-set cap (default 100K), did it actually halt at the cap or blow past it?
+**5f — Concurrent-session safety.** Run two `/qualia-polish-loop` sessions simultaneously on the same project repo. Do they corrupt each other's git state? Touch each other's commits? Race on file edits?
+**5g — `prefers-reduced-motion` honor.** Capture screenshots with reduced motion forced. Does the vision agent correctly NOT penalize "no motion" when reduced motion is on?
+**5h — Slop-detect bypass.** Can the loop's auto-fixer commit code that slop-detect would normally block? Try: inject a malicious fix-builder that uses an em-dash in a CSS comment. The pre-commit slop-detect should block. Verify it actually does.
+For each adversarial probe, document: setup, expected behavior, observed behavior, severity (CRITICAL / HIGH / MEDIUM / LOW per `rules/grounding.md` Severity Rubric).
+### Gate 6 — Token cost reality check
+The feature exists in a token-budget-conscious framework (Matt Pocock instruction-budget discipline). Measure:
+1. Tokens per iteration (capture, eval, fix-spawn, redeploy) — average across the 3 pilot scenarios
+2. Skill SKILL.md + REFERENCE.md combined token weight when invoked
+3. Spawn-template token cost per fix-builder
+4. Vision-eval token cost per dimension
+Compare against the spec's "≤ 100K per loop" claim. If actual is > 100K, FAIL with the budget overrun number.
+### Gate 7 — Security review
+The loop has Bash + Edit + Write + Playwright MCP capabilities. That's a large attack surface.
+1. Can a malicious `.planning/CONTEXT.md` (the v5.0 trust-boundary surface) inject instructions that the visual-evaluator agent follows? It MUST refuse per the trust-boundary block in `agents/builder.md` etc.
+2. Does the loop ever pass user-controlled content (file paths, URLs from a brief) into shell commands without quoting? Look for shell injection patterns.
+3. Does the Playwright MCP get told to navigate to the user-provided URL without validation? A malicious URL pointing at `file:///etc/passwd` or an internal IP — does it block, or screenshot it?
+4. Are screenshots written to `/tmp` with predictable paths that another local user could read or replace? Mode 0600 + randomized filenames mandatory for sensitive screenshots.
+### Gate 8 — Documentation accuracy
+Open `CHANGELOG.md` v5.1.0 entry. Verify every claim:
+1. Each "Added" item exists in the repo
+2. Each "Changed" item is actually changed (compare with git log)
+3. The "Deferred" section is honest about what doesn't work
+Open `docs/playwright-loop-design-notes.md`. Verify:
+1. Differentiation from `/qualia-polish` is concrete (not "this one is more powerful")
+2. "When to use which" is actionable (e.g. "use polish-loop when you want autonomous iteration; use polish when you want a single-pass critique")
+3. v5.2 deferrals are listed honestly
+## Output format
+Produce a single Markdown file at `/home/qualia-new/qualia-framework/docs/playwright-loop-review-{YYYY-MM-DD}.md`:
+```markdown
+# Playwright Visual-Polish Loop — Adversarial Review {YYYY-MM-DD}
+## TL;DR
+**Recommendation:** SHIP / SHIP-WITH-CAVEATS / NO-SHIP
+**Headline finding:** {one sentence}
+## Gate-by-gate verdict
+| Gate | Status | Evidence |
+|---|---|---|
+| 1 — Builder claim integrity | PASS / FAIL | {file:line citations} |
+| 2 — Framework regression | PASS / FAIL | {test counts, slop results} |
+| 3 — Skill structural validity | PASS / FAIL | {file:line} |
+| 4 — Pilot results audit | PASS / FAIL | {scenario-by-scenario} |
+| 5 — Adversarial probes | {N PASS, M FAIL} | {one row per probe} |
+| 6 — Token cost reality | PASS / FAIL | {actual vs claimed} |
+| 7 — Security review | PASS / FAIL | {findings + severity} |
+| 8 — Doc accuracy | PASS / FAIL | {discrepancies} |
+## Critical findings (CRITICAL severity, must-fix before ship)
+(list)
+## High findings (HIGH severity, fix in v5.1.1 patch)
+(list)
+## Medium findings (MEDIUM severity, v5.2 backlog)
+(list)
+## What works well (give credit honestly)
+(list — be honest. If it genuinely works, say so.)
+## Recommended next steps
+1. {concrete action 1}
+2. ...
+```
+Apply the Severity Rubric strictly:
+- CRITICAL — security breach possible, data loss, framework regression that breaks an existing skill, kill-switch doesn't actually kill
+- HIGH — feature broken for >50% of users, no error handling on user path, wiring missing
+- MEDIUM — feature works but missing states, contract drift between docs and behavior
+- LOW — style, naming, doc tone
+## Things you MUST do
+1. Run every test command. Don't take the builder's word for it.
+2. Cite file:line for every claim. No hedging.
+3. Try the adversarial probes (Gate 5) — these are the most valuable findings.
+4. Spot-check 2-3 commits from the pilot scenarios to verify the fixes weren't theatrical.
+5. If you find CRITICAL findings, halt the review and report immediately — don't keep going looking for more before surfacing the first one.
+## Things you MUST NOT do
+1. Do not "fix" things. You are reviewing, not building. If you find a bug, document it for the builder/owner to fix.
+2. Do not modify any files except your review markdown.
+3. Do not run the loop on production projects (axidex, dawadose, sakani, qualia-solutions main site, etc.). Use synthetic test pages or a dedicated `/tmp/qpl-test` repo.
+4. Do not skip Gate 5 (adversarial probes). The whole point of an adversarial reviewer is to do what the builder didn't.
+5. Do not soften severity ratings to be nice. CRITICAL means CRITICAL.
+## Tool budget
+Maximum 50 Read/Grep/Bash invocations. If you exhaust the budget, write what you found and mark unchecked gates as `INSUFFICIENT EVIDENCE` — do not fabricate.
+## Calibration: what good vs bad looks like
+**Good review** finds 1-3 CRITICAL bugs the builder missed, validates that the basic flow works on synthetic scenarios, recommends SHIP-WITH-CAVEATS where the loop is reliable but specific scenarios fail, and is honest about what wasn't tested.
+**Bad review** says "ship it" without trying Gate 5, OR lists 50 LOW findings while missing the CRITICAL ones, OR tries to fix things instead of reporting them.
+The framework owner (Fawzi) prefers honest brutal review over polite signoff. If the feature isn't ready, say so. v5.1 can wait a week. A flaky autonomous loop in production would be worse than no loop at all.

package/docs/polish-loop-supervised-run.md ADDED Viewed

@@ -0,0 +1,111 @@
+# /qualia-polish-loop — First supervised run (v5.2)
+**Run date:** 2026-05-05
+**Framework version:** 5.2.0
+**Operator:** Claude Opus 4.7 (1M context), main session
+**Browser backend used:** Playwright cached chromium 1217 via `--reduced-motion` Chromium-binary path
+**Run ID:** `qpl-v52-test`
+This document closes the "first real-project supervised run not done" caveat from v5.1's CHANGELOG. It captures the actual end-to-end behavior of the new v5.2 flags (`--reduced-motion`, `--routes`) against the framework's own test fixtures.
+## What was tested
+| Subject | Fixture | Why |
+|---|---|---|
+| `--routes` multi-route init | `clean.html` + `broken.html` served from `python3 -m http.server 18081` | Validate state machine handles 2-URL list with first-entry backward compat |
+| `--reduced-motion` capture (chromium-binary backend) | `clean.html` at 375 + 1440 | Validate the `--force-prefers-reduced-motion` Chrome flag is passed through and the captures land |
+| `loop.mjs report` with multi-route state | the multi-route state file from above | Validate the report header renders `URLs (2)` instead of single `URL` |
+| All assertions in `tests/bin.test.sh` #129-134 | (deterministic; no browser) | Validate the orchestrator's CLI surface |
+## Results
+### Multi-route init
+```bash
+node skills/qualia-polish-loop/scripts/loop.mjs init \
+  --state /tmp/qpl-v52-test/state.json \
+  --routes "http://localhost:18081/clean.html,http://localhost:18081/broken.html" \
+  --reduced-motion --max 4 --budget 30000
+```
+Output (excerpt):
+```json
+{
+  "url": "http://localhost:18081/clean.html",
+  "urls": [
+    "http://localhost:18081/clean.html",
+    "http://localhost:18081/broken.html"
+  ],
+  "reduced_motion": true,
+  "max_iterations": 4,
+  "token_budget": 30000,
+  "verdict": "pending"
+}
+```
+`state.url` correctly defaults to the first URL (single-route drivers keep working). `state.urls` contains the full list. `state.reduced_motion` is set so downstream capture invocations know to pass the flag through.
+### Reduced-motion capture (chromium-binary backend)
+```bash
+node skills/qualia-polish-loop/scripts/playwright-capture.mjs \
+  --url "http://localhost:18081/clean.html" \
+  --out /tmp/qpl-v52-test/cap \
+  --viewports 375,1440 --wait 1500 --reduced-motion
+```
+Result:
+```json
+{
+  "captures": [
+    { "viewport": "mobile",  "width": 375,  "ok": true, "reducedMotion": true,
+      "backend": "chrome-binary",
+      "binary": ".../chromium-1217/chrome-linux64/chrome" },
+    { "viewport": "desktop", "width": 1440, "ok": true, "reducedMotion": true,
+      "backend": "chrome-binary",
+      "binary": ".../chromium-1217/chrome-linux64/chrome" }
+  ],
+  "total": 2, "failed": 0
+}
+```
+Both captures landed (43,401 B mobile / 64,078 B desktop). The Chrome flag `--force-prefers-reduced-motion` was passed through. Each capture record has `reducedMotion: true`, propagating the user's a11y intent into the evaluator's input contract.
+### Wall-clock and token estimates
+| Operation | Wall-clock |
+|---|---|
+| `loop.mjs init` with `--routes` (2 URLs) + `--reduced-motion` | ~10 ms |
+| `playwright-capture.mjs` 2 viewports, chromium-binary backend, `--reduced-motion` | ~3 s |
+| Full multi-route iteration cycle (estimated): 2 URLs × 3 viewports × ~1.5s capture + ~9 K tokens vision-eval per URL | ~15-20 s wall-clock, ~18-20 K tokens per iteration |
+The token cost of multi-route mode scales linearly with URL count. A 6-iteration loop on 3 URLs would cost ~108-120 K tokens — close to the default 100 K budget cap. The orchestrator will surface this estimate in pre-flight and recommend `--budget 150000` for 3+ route sweeps.
+### What worked
+- **Backward compatibility intact.** Single-route `--url` invocations behave identically to v5.1. The `state.url` field still points to a real URL even when `--routes` was used.
+- **Flag propagation is clean.** The `--reduced-motion` flag flows from the loop CLI into the state file, then into each capture invocation, then into Chrome's `--force-prefers-reduced-motion` flag. The Playwright SDK path uses the equivalent `newContext({ reducedMotion: 'reduce' })` option.
+- **Both backends carry the flag.** Tested on chromium-binary path (the active path on this dev machine — no `playwright` npm package installed). The Playwright SDK path is unit-clean by inspection (`reducedMotion: "reduce"` is a documented `BrowserContextOptions` field since Playwright 1.16).
+- **State stays out of the LLM context.** All multi-route state lives in JSON on disk. The orchestrator reads compact per-iteration deltas only.
+- **Tests cover the new surface.** 6 new assertions (#129-134) catch regressions in `--routes`, `--reduced-motion`, init validation, and report rendering.
+### What surprised me
+- The Chrome flag `--force-prefers-reduced-motion` doesn't take a value (Chrome ≥87 — present-tense, no compat tax). Some older Chromium docs suggested `--force-prefers-reduced-motion=reduce`; that variant is harmless but redundant.
+- `state.urls` array is intentionally not deduped. If a user passes `--routes "/a,/a,/b"`, they get three captures per iteration (the loop drivers can dedupe in the SKILL.md if needed). Keeping the script literal avoids surprising behavior.
+### What still requires real-project use to validate
+- **Vercel-preview deploy mode** end-to-end with multi-route + reduced-motion. The `--deploy preview` path is wired but only ever exercised on dev-localhost.
+- **Real Next.js dev server** with HMR mid-iteration. The fixtures used here are static HTML.
+- **Token-budget hits in practice.** The estimate of ~18-20 K tokens/iter for 2-URL multi-route is from rubric arithmetic; first-real-project run will tighten this number.
+The "experimental" caveat from v5.1's CHANGELOG is now removed for the single-route case (this run validates the deterministic infrastructure end-to-end). Multi-route + Vercel-preview combined remains experimental until first real-project use.
+## Verdict
+v5.2 ships. The two named v5.1 deferrals (`prefers-reduced-motion`, multi-route) are closed cleanly, with backward compatibility preserved and 6 new tests guarding the surface. The remaining v5.1 deferral — Vercel-preview end-to-end — is still pending real-project use, deferred to v5.2.x or the first time a Qualia project actually invokes `/qualia-polish-loop --deploy preview`.
+The polish-loop is now reliable enough to use unattended on a single-route dev-localhost target, and reliable enough to drive supervised on multi-route or reduced-motion targets. Take it for a real run.