@massu/core 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (118) hide show
  1. package/README.md +40 -0
  2. package/agents/massu-architecture-reviewer.md +104 -0
  3. package/agents/massu-blast-radius-analyzer.md +84 -0
  4. package/agents/massu-competitive-scorer.md +126 -0
  5. package/agents/massu-help-sync.md +73 -0
  6. package/agents/massu-migration-writer.md +94 -0
  7. package/agents/massu-output-scorer.md +87 -0
  8. package/agents/massu-pattern-reviewer.md +84 -0
  9. package/agents/massu-plan-auditor.md +170 -0
  10. package/agents/massu-schema-sync-verifier.md +70 -0
  11. package/agents/massu-security-reviewer.md +98 -0
  12. package/agents/massu-ux-reviewer.md +106 -0
  13. package/commands/_shared-preamble.md +53 -23
  14. package/commands/_shared-references/auto-learning-protocol.md +71 -0
  15. package/commands/_shared-references/blast-radius-protocol.md +76 -0
  16. package/commands/_shared-references/security-pre-screen.md +64 -0
  17. package/commands/_shared-references/test-first-protocol.md +87 -0
  18. package/commands/_shared-references/verification-table.md +52 -0
  19. package/commands/massu-article-review.md +343 -0
  20. package/commands/massu-autoresearch/references/eval-runner.md +84 -0
  21. package/commands/massu-autoresearch/references/safety-rails.md +125 -0
  22. package/commands/massu-autoresearch/references/scoring-protocol.md +151 -0
  23. package/commands/massu-autoresearch.md +258 -0
  24. package/commands/massu-batch.md +44 -12
  25. package/commands/massu-bearings.md +42 -8
  26. package/commands/massu-checkpoint.md +588 -0
  27. package/commands/massu-ci-fix.md +2 -2
  28. package/commands/massu-command-health.md +132 -0
  29. package/commands/massu-command-improve.md +232 -0
  30. package/commands/massu-commit.md +205 -44
  31. package/commands/massu-create-plan.md +239 -57
  32. package/commands/massu-data/references/common-queries.md +79 -0
  33. package/commands/massu-data/references/table-guide.md +50 -0
  34. package/commands/massu-data.md +66 -0
  35. package/commands/massu-dead-code.md +29 -34
  36. package/commands/massu-debug/references/auto-learning.md +61 -0
  37. package/commands/massu-debug/references/codegraph-tracing.md +80 -0
  38. package/commands/massu-debug/references/common-shortcuts.md +98 -0
  39. package/commands/massu-debug/references/investigation-phases.md +294 -0
  40. package/commands/massu-debug/references/report-format.md +107 -0
  41. package/commands/massu-debug.md +105 -386
  42. package/commands/massu-docs.md +1 -1
  43. package/commands/massu-full-audit.md +61 -0
  44. package/commands/massu-gap-enhancement-analyzer.md +276 -16
  45. package/commands/massu-golden-path/references/approval-points.md +216 -0
  46. package/commands/massu-golden-path/references/competitive-mode.md +273 -0
  47. package/commands/massu-golden-path/references/error-handling.md +121 -0
  48. package/commands/massu-golden-path/references/phase-0-requirements.md +53 -0
  49. package/commands/massu-golden-path/references/phase-1-plan-creation.md +168 -0
  50. package/commands/massu-golden-path/references/phase-2-implementation.md +397 -0
  51. package/commands/massu-golden-path/references/phase-2.5-gap-analyzer.md +156 -0
  52. package/commands/massu-golden-path/references/phase-3-simplify.md +40 -0
  53. package/commands/massu-golden-path/references/phase-4-commit.md +94 -0
  54. package/commands/massu-golden-path/references/phase-5-push.md +116 -0
  55. package/commands/massu-golden-path/references/phase-5.5-production-verify.md +170 -0
  56. package/commands/massu-golden-path/references/phase-6-completion.md +113 -0
  57. package/commands/massu-golden-path/references/qa-evaluator-spec.md +137 -0
  58. package/commands/massu-golden-path/references/sprint-contract-protocol.md +117 -0
  59. package/commands/massu-golden-path/references/vr-visual-calibration.md +73 -0
  60. package/commands/massu-golden-path.md +114 -848
  61. package/commands/massu-guide.md +72 -69
  62. package/commands/massu-hooks.md +27 -12
  63. package/commands/massu-hotfix.md +221 -144
  64. package/commands/massu-incident.md +49 -20
  65. package/commands/massu-infra-audit.md +187 -0
  66. package/commands/massu-learning-audit.md +211 -0
  67. package/commands/massu-loop/references/auto-learning.md +49 -0
  68. package/commands/massu-loop/references/checkpoint-audit.md +40 -0
  69. package/commands/massu-loop/references/guardrails.md +17 -0
  70. package/commands/massu-loop/references/iteration-structure.md +115 -0
  71. package/commands/massu-loop/references/loop-controller.md +188 -0
  72. package/commands/massu-loop/references/plan-extraction.md +78 -0
  73. package/commands/massu-loop/references/vr-plan-spec.md +140 -0
  74. package/commands/massu-loop-playwright.md +9 -9
  75. package/commands/massu-loop.md +115 -670
  76. package/commands/massu-new-pattern.md +423 -0
  77. package/commands/massu-perf.md +422 -0
  78. package/commands/massu-plan-audit.md +1 -1
  79. package/commands/massu-plan.md +389 -122
  80. package/commands/massu-production-verify.md +433 -0
  81. package/commands/massu-push.md +62 -378
  82. package/commands/massu-recap.md +29 -3
  83. package/commands/massu-rollback.md +613 -0
  84. package/commands/massu-scaffold-hook.md +2 -4
  85. package/commands/massu-scaffold-page.md +2 -3
  86. package/commands/massu-scaffold-router.md +1 -2
  87. package/commands/massu-security.md +619 -0
  88. package/commands/massu-simplify.md +115 -85
  89. package/commands/massu-squirrels.md +2 -2
  90. package/commands/massu-tdd.md +38 -22
  91. package/commands/massu-test.md +3 -3
  92. package/commands/massu-type-mismatch-audit.md +469 -0
  93. package/commands/massu-ui-audit.md +587 -0
  94. package/commands/massu-verify-playwright.md +287 -32
  95. package/commands/massu-verify.md +150 -46
  96. package/dist/cli.js +146 -95
  97. package/package.json +6 -2
  98. package/patterns/build-patterns.md +302 -0
  99. package/patterns/component-patterns.md +246 -0
  100. package/patterns/display-patterns.md +185 -0
  101. package/patterns/form-patterns.md +890 -0
  102. package/patterns/integration-testing-checklist.md +445 -0
  103. package/patterns/security-patterns.md +219 -0
  104. package/patterns/testing-patterns.md +569 -0
  105. package/patterns/tool-routing.md +81 -0
  106. package/patterns/ui-patterns.md +371 -0
  107. package/protocols/plan-implementation.md +267 -0
  108. package/protocols/recovery.md +225 -0
  109. package/protocols/verification.md +404 -0
  110. package/reference/command-taxonomy.md +178 -0
  111. package/reference/cr-rules-reference.md +76 -0
  112. package/reference/hook-execution-order.md +148 -0
  113. package/reference/lessons-learned.md +175 -0
  114. package/reference/patterns-quickref.md +208 -0
  115. package/reference/standards.md +135 -0
  116. package/reference/subagents-reference.md +17 -0
  117. package/reference/vr-verification-reference.md +867 -0
  118. package/src/commands/install-commands.ts +149 -53
@@ -0,0 +1,137 @@
1
+ # QA Evaluator Specification
2
+
3
+ > Reference doc for `/massu-golden-path` Phase 2C. Return to `phase-2-implementation.md` for full Phase 2.
4
+
5
+ ## Purpose
6
+
7
+ An adversarial functional QA agent that exercises the running application via Playwright MCP, tuned for skepticism. Catches "compiles but doesn't work" failures that self-evaluation and code review miss.
8
+
9
+ **Origin**: Adapted from Anthropic Labs' harness design. Key insight: separating generation from evaluation eliminates self-praise bias. Tuning a standalone evaluator to be skeptical is far more tractable than making a generator self-critical.
10
+
11
+ ---
12
+
13
+ ## Evaluation Dimensions
14
+
15
+ The QA evaluator grades each implemented plan item across 4 dimensions:
16
+
17
+ | # | Dimension | Weight | What It Checks |
18
+ |---|-----------|--------|----------------|
19
+ | 1 | **Functionality** | High | Can users complete the primary workflow? Navigate to feature, interact, verify result. |
20
+ | 2 | **Completeness** | High | Are ALL contract acceptance criteria met? No stubs, no mock data, no placeholder text. |
21
+ | 3 | **Data Integrity** | Medium | Do write→store→read→display roundtrips work? (Aligns with CR-47 VR-ROUNDTRIP) |
22
+ | 4 | **Design Compliance** | Medium | Does the UI follow the design system tokens and component specs? (Aligns with VR-TOKEN, VR-SPEC-MATCH) |
23
+
24
+ ---
25
+
26
+ ## Grading Rubric
27
+
28
+ Per plan item:
29
+
30
+ | Grade | Meaning | Gate Impact |
31
+ |-------|---------|-------------|
32
+ | **PASS** | All contract criteria met, feature works as intended | QA_GATE remains PASS |
33
+ | **PARTIAL** | Most criteria met but 1-2 minor gaps (e.g., missing empty state) | QA_GATE: FAIL — must fix |
34
+ | **FAIL** | Core functionality broken, mock data, unwired features | QA_GATE: FAIL — must fix |
35
+
36
+ **Failure threshold**: Any single FAIL or PARTIAL = QA_GATE: FAIL. The article emphasizes that a lenient evaluator defeats the purpose.
37
+
38
+ ---
39
+
40
+ ## Known Failure Patterns to Check
41
+
42
+ These patterns are derived from production incidents. The QA evaluator MUST actively check for each:
43
+
44
+ | Pattern | Source | How to Detect |
45
+ |---------|--------|---------------|
46
+ | **Mock/hardcoded data** | Common incident pattern | Data doesn't change when DB changes; look for hardcoded arrays in component files |
47
+ | **Write succeeds but read/display broken** | Data visibility incidents | Submit form successfully, navigate away and back, verify data persists and displays |
48
+ | **Feature stubs** | Multiple incidents | Component renders but onClick/onSubmit handlers are empty or log-only |
49
+ | **Invisible elements** | Visibility incidents | Elements exist in DOM but have `display:none`, `opacity:0`, or are behind other elements |
50
+ | **Missing query invalidation** | Common pattern | Create/update item, verify list updates without manual refresh |
51
+ | **Broken dark mode** | Design audit findings | Toggle theme, verify all text visible, no invisible-on-dark elements |
52
+
53
+ ---
54
+
55
+ ## Evaluation Protocol
56
+
57
+ For each plan item with a sprint contract:
58
+
59
+ ```
60
+ 1. NAVIGATE to the affected page using Playwright MCP
61
+ - browser_navigate to the target URL
62
+ - browser_snapshot to verify page loaded (not error/auth page)
63
+ - browser_console_messages to check for React errors
64
+
65
+ 2. EXERCISE the feature as a real user would
66
+ - Follow the happy path described in contract criteria
67
+ - browser_click, browser_fill_form, browser_select_option as needed
68
+ - Wait 2-3 seconds after interactions for async operations
69
+
70
+ 3. VERIFY against sprint contract acceptance criteria
71
+ - Check each criterion explicitly
72
+ - browser_snapshot after key interactions for evidence
73
+ - browser_network_requests to verify API calls succeed
74
+
75
+ 4. CHECK for known failure patterns
76
+ - Look for hardcoded data (browser_evaluate to check DOM for static arrays)
77
+ - Verify write→read roundtrip if applicable
78
+ - Check empty/loading/error states by testing edge conditions
79
+
80
+ 5. GRADE the item: PASS / PARTIAL / FAIL
81
+ - Include specific evidence for any non-PASS grade
82
+ - Reference the specific contract criterion that failed
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Conditional Activation
88
+
89
+ The QA evaluator only spawns when the plan touches UI files:
90
+ - `src/app/**/*.tsx`
91
+ - `src/components/**/*.tsx`
92
+
93
+ For backend-only plans (routers, crons, migrations with no UI), skip with log note:
94
+ ```
95
+ QA Evaluator: SKIPPED (no UI files in plan)
96
+ ```
97
+
98
+ ---
99
+
100
+ ## Relationship to Phase 2G
101
+
102
+ | Aspect | QA Evaluator (Phase 2C) | Browser Verification (Phase 2G) |
103
+ |--------|------------------------|--------------------------------|
104
+ | **When** | After implementation, during review | After all reviews pass |
105
+ | **Focus** | Contract compliance, feature functionality | Load audit, performance, interactive inventory |
106
+ | **Scope** | Only plan items with contracts | All pages affected by changes |
107
+ | **Adversarial?** | Yes — tuned for skepticism | No — comprehensive but not adversarial |
108
+ | **Fixes** | Reports findings; main agent fixes | Fixes issues directly |
109
+
110
+ They are **complementary, not redundant**. QA evaluator asks "does it do what we agreed?" while Phase 2G asks "does everything still work correctly?"
111
+
112
+ ---
113
+
114
+ ## Evaluator Prompt Tuning
115
+
116
+ The article explicitly notes that Claude is "a poor QA agent" out of the box — it identifies issues then talks itself into approving. Effective QA evaluation requires iterative prompt tuning.
117
+
118
+ ### Tuning Protocol
119
+
120
+ After each golden-path run:
121
+
122
+ 1. **Review QA evaluator findings log** — did it catch real bugs?
123
+ 2. **Check Phase 2G results** — did browser verification catch anything QA evaluator missed?
124
+ 3. **Check production** — did anything break after deploy that should have been caught?
125
+ 4. **Update evaluator prompt** if judgment divergences found:
126
+ - Add the missed pattern to the "Known Failure Patterns" list above
127
+ - Strengthen the prompt language for that failure mode
128
+ - Document the tuning decision in memory
129
+
130
+ ### Anti-Leniency Rules
131
+
132
+ The evaluator prompt includes these rules to prevent the natural tendency toward generosity:
133
+
134
+ 1. **Never say "this is acceptable because..."** — if criteria aren't met, it's FAIL
135
+ 2. **Never give benefit of the doubt** — if you can't verify it works, it's FAIL
136
+ 3. **Partial credit is still failure** — PARTIAL means "not done yet"
137
+ 4. **Evidence required** — every PASS must cite specific evidence (screenshot, DOM state, network response)
@@ -0,0 +1,117 @@
1
+ # Sprint Contract Protocol
2
+
3
+ > Reference doc for `/massu-golden-path` Phase 2A.5. Return to `phase-2-implementation.md` for full Phase 2.
4
+
5
+ ## What Is a Sprint Contract?
6
+
7
+ A sprint contract is a **negotiated definition-of-done** established for each plan item **before implementation begins**. It bridges the gap between high-level plan items and testable implementation by defining specific, measurable acceptance criteria that both the implementing agent and the evaluator agree on.
8
+
9
+ **Origin**: Adapted from Anthropic Labs' harness design pattern where generator and evaluator agents negotiate per-sprint contracts before coding starts. This prevents the "I implemented something adjacent to the plan item" failure mode.
10
+
11
+ ---
12
+
13
+ ## Contract Template
14
+
15
+ For each plan item in the Phase 2A tracking table, add these columns:
16
+
17
+ | Column | Content |
18
+ |--------|---------|
19
+ | **Scope Boundary** | What is IN scope and what is explicitly OUT of scope for this item |
20
+ | **Implementation Approach** | High-level approach (which files, which patterns) |
21
+ | **Acceptance Criteria** | 3-5 testable statements per item (see quality bar below) |
22
+ | **VR-* Mapping** | Which verification types apply and their expected output |
23
+
24
+ ### Example Contract
25
+
26
+ ```
27
+ Plan Item: P3-001 — Add contact activity timeline to CRM detail page
28
+ Scope Boundary:
29
+ IN: Activity timeline component on contact detail, showing calls/emails/meetings
30
+ OUT: Creating new activities (that's P3-002), activity filtering, pagination
31
+ Implementation Approach:
32
+ - New ActivityTimeline component in src/components/crm/
33
+ - tRPC query in contacts router using 3-step pattern
34
+ - Render in contact detail page right column
35
+ Acceptance Criteria:
36
+ 1. Timeline renders with most recent activity first
37
+ 2. Each activity shows type icon, timestamp (relative), and summary text
38
+ 3. Empty state shows "No activities recorded" when contact has zero activities
39
+ 4. Loading state shows 3 skeleton rows while fetching
40
+ 5. Clicking an activity row navigates to the activity detail (or opens Sheet)
41
+ VR-* Mapping:
42
+ - VR-GREP: ActivityTimeline component exists
43
+ - VR-RENDER: Component rendered in contact detail page
44
+ - VR-VISUAL: Route passes weighted scoring >= 3.0
45
+ - VR-ROUNDTRIP: Activity data flows from DB to display
46
+ ```
47
+
48
+ ---
49
+
50
+ ## Contract Quality Bar
51
+
52
+ Acceptance criteria MUST be specific enough that **two independent evaluators would agree on PASS/FAIL**.
53
+
54
+ | Quality | Example | Verdict |
55
+ |---------|---------|---------|
56
+ | **BAD** | "UI looks good" | Subjective, unmeasurable |
57
+ | **BAD** | "Feature works correctly" | Too vague, no specific behavior |
58
+ | **OKAY** | "Table renders contact data" | Testable but not specific enough |
59
+ | **GOOD** | "DataTable renders with sortable columns for name, email, company. Clicking column header toggles sort direction. Empty state shows 'No contacts found' message." | Specific, testable, includes edge case |
60
+ | **GOOD** | "Form submits and shows toast.success('Contact created'). New contact appears in list without page refresh." | Verifiable behavior with specific UI feedback |
61
+
62
+ ### Criteria Categories
63
+
64
+ Each contract should include criteria from at least 3 of these categories:
65
+
66
+ 1. **Happy path**: Primary workflow completes successfully
67
+ 2. **Data display**: Correct data appears in the right places
68
+ 3. **Empty/loading/error states**: All states handled
69
+ 4. **User feedback**: Success/error messages shown
70
+ 5. **Edge cases**: Null values, long text, missing data
71
+
72
+ ---
73
+
74
+ ## Negotiation Rules
75
+
76
+ 1. **Max 3 negotiation rounds per item.** If generator and auditor can't agree after 3 rounds, escalate to user via AskUserQuestion.
77
+ 2. **Auditor challenges vague criteria.** If a criterion uses words like "good", "correct", "proper" without specifics, the auditor MUST push back.
78
+ 3. **Generator can propose simpler criteria** if the auditor's demands exceed the plan item's scope.
79
+ 4. **No gold-plating.** Criteria must match the plan item scope, not expand it. If the auditor wants more, that's a new plan item.
80
+
81
+ ---
82
+
83
+ ## Contract Storage
84
+
85
+ Contracts are stored as **additional columns in the Phase 2A tracking table**, not as separate files. This avoids file proliferation and keeps contracts co-located with the items they describe.
86
+
87
+ ```
88
+ | Item # | Type | Description | Location | Verification | Contract | Acceptance Criteria | Status |
89
+ |--------|------|-------------|----------|--------------|----------|---------------------|--------|
90
+ | P3-001 | COMPONENT | Activity timeline | src/components/crm/ | VR-RENDER | See above | 5 criteria | PENDING |
91
+ ```
92
+
93
+ ---
94
+
95
+ ## When to Skip
96
+
97
+ Sprint contracts can be marked **N/A** for:
98
+ - Pure refactors with no user-facing behavior change (verified by VR-BUILD + VR-TYPE + VR-TEST)
99
+ - Documentation-only items
100
+ - Migration items where the SQL IS the contract (the migration either applies or it doesn't)
101
+
102
+ Mark as: `Contract: N/A — [reason]`
103
+
104
+ ---
105
+
106
+ ## Relationship to Existing Verification
107
+
108
+ Sprint contracts **ADD** acceptance criteria on top of existing VR-* checks. They do NOT replace them.
109
+
110
+ ```
111
+ VR-* checks = "Does the code meet technical standards?"
112
+ Sprint contracts = "Does the code do what we agreed it would do?"
113
+ ```
114
+
115
+ Both must pass. A plan item is NOT complete unless:
116
+ 1. All VR-* checks pass
117
+ 2. All contract acceptance criteria are met
@@ -0,0 +1,73 @@
1
+ # VR-VISUAL Calibration Examples
2
+
3
+ > Reference doc for VR-VISUAL 4-dimension weighted scoring.
4
+ > Used by `scripts/ui-review.sh` and golden-path Phase 2.5 gap analysis.
5
+
6
+ ## Purpose
7
+
8
+ Few-shot calibration examples that anchor the LLM evaluator's scoring, reducing drift across runs. The article's author found that calibrating the evaluator with detailed score breakdowns ensured judgment aligned with preferences.
9
+
10
+ ---
11
+
12
+ ## Scoring Dimensions (Recap)
13
+
14
+ | Dimension | Weight | What It Measures |
15
+ |-----------|--------|------------------|
16
+ | Design Quality | 2x | Coherent visual identity, deliberate choices |
17
+ | Functionality | 2x | Usable, clear hierarchy, findable actions |
18
+ | Craft | 1x | Spacing, typography, color consistency |
19
+ | Completeness | 1x | All states handled, no broken elements |
20
+
21
+ **Weighted Score** = (DQ×2 + FN×2 + CR×1 + CO×1) / 6
22
+ **PASS threshold**: >= 3.0/5.0
23
+
24
+ ---
25
+
26
+ ## Score 5: Excellent
27
+
28
+ **Description**: A page that a senior designer would approve without changes. Distinct visual identity, not generic template output.
29
+
30
+ **Design Quality (5)**: Colors, typography, and spacing work together to create a cohesive mood. The page has personality — you could tell it's this product without seeing the logo. No purple gradients over white cards. No generic SaaS dashboard look.
31
+
32
+ **Functionality (5)**: Primary action is immediately obvious. Navigation is intuitive. Information hierarchy is clear — the most important data is most prominent. Secondary actions are accessible but don't compete with primary.
33
+
34
+ **Craft (5)**: Consistent spacing throughout. Typography has clear hierarchy (headings, body, captions). Color palette is harmonious with adequate contrast ratios. No misaligned elements, no inconsistent border radii, no orphaned text.
35
+
36
+ **Completeness (5)**: Loading states use appropriate skeletons (not spinners for tabular data). Empty states have helpful messaging and a call-to-action. Error states are specific and actionable. No broken images, no placeholder text, no "TODO" comments visible.
37
+
38
+ ---
39
+
40
+ ## Score 3: Acceptable
41
+
42
+ **Description**: Functional and usable but unremarkable. Would not embarrass the team but wouldn't impress either. Typical of competent but uninspired output.
43
+
44
+ **Design Quality (3)**: Uses the design system correctly but without distinction. Colors are from the palette but the combination doesn't create a strong identity. Layout works but doesn't guide the eye naturally. It's "fine."
45
+
46
+ **Functionality (3)**: Primary actions are findable but may require a moment of scanning. Information is present but hierarchy could be clearer. Navigation works but some paths feel one extra click too deep.
47
+
48
+ **Craft (3)**: Mostly consistent spacing with occasional irregularities. Typography hierarchy exists but some levels are too similar in weight/size. Colors pass contrast checks but some combinations feel muddy.
49
+
50
+ **Completeness (3)**: Most states handled. May be missing one of: loading skeleton for a secondary section, empty state for an uncommon scenario, or error messaging for a specific failure mode. Nothing broken, but gaps exist in edge cases.
51
+
52
+ ---
53
+
54
+ ## Score 1: Failing
55
+
56
+ **Description**: Visually broken or fundamentally confusing. Would be flagged in any review.
57
+
58
+ **Design Quality (1)**: Generic AI-generated appearance — default component library styling, no visual identity. Or worse: clashing colors, inconsistent themes, elements that look like they belong to different products.
59
+
60
+ **Functionality (1)**: Primary action unclear. User has to guess what to do. Important data is buried or missing. Navigation is confusing — back button doesn't go where expected, breadcrumbs are wrong.
61
+
62
+ **Craft (1)**: Overlapping elements. Cut-off text. Inconsistent spacing (some sections have gaps, others are cramped). Typography has no clear hierarchy. Colors clash or have poor contrast (text invisible on background).
63
+
64
+ **Completeness (1)**: Obvious missing states — raw error objects displayed, blank sections where data should be, spinner that never resolves, "undefined" or "null" visible in the UI. Broken images showing alt text or image-not-found icons.
65
+
66
+ ---
67
+
68
+ ## Calibration Notes
69
+
70
+ - **Design quality and functionality are weighted 2x** because Claude already scores well on craft and completeness by default. The emphasis pushes toward more distinctive, user-centric output.
71
+ - **"Museum quality"** language in criteria pushes toward convergent aesthetics (the article noted this effect). Massu's criteria use more practical language ("distinct visual identity", "deliberate design choices") to encourage diversity.
72
+ - **Score 3.0 is the minimum bar, not the target.** Golden path should aim for 3.5+ on weighted score. Pages scoring 3.0-3.2 pass but should be flagged for improvement in post-build reflection.
73
+ - **Scores are route-specific.** An admin settings page scoring 3.5 is fine; a customer-facing landing page scoring 3.5 should be improved.