npm - @tgoodington/intuition - Versions diffs - 8.1.3 → 9.2.1 - Mend

@tgoodington/intuition 8.1.3 → 9.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (154) hide show

package/docs/v9/test/phase5/regression-comparison.md ADDED Viewed

@@ -0,0 +1,197 @@
+# Code Production Regression Comparison
+## Summary Verdict
+**v9 EQUIVALENT** — The v9 output faithfully implements all 8 blueprint sections and satisfies all 8 acceptance criteria. It is more concise than v8 (460 vs 708 lines) with no structural omissions, though v8 provides greater detail in several areas (GPU options, error scenarios, report template). Neither output contains significant technical errors. The v9 architecture can replace v8 for code production without quality loss, though it produces leaner output.
+## Structural Comparison
+| Aspect | v8 | v9 | Winner |
+|--------|----|----|--------|
+| YAML frontmatter | Correct: name, description, model, tools | Identical frontmatter | EQUAL |
+| Section 1: Overview & Role | Present (lines 8-28) | Present (lines 8-16) | v8 slightly more detailed |
+| Section 2: Data Loading | Present (lines 30-86) | Present (lines 18-71) | v8 more detailed (lists all fields) |
+| Section 3: Question Flow | Present (lines 88-194) | Present (lines 73-158) | v8 more detailed (more GPU options) |
+| Section 4: Analysis | Present (lines 196-348) | Present (lines 160-244) | v8 more detailed |
+| Section 5: Benchmark Search | Present (lines 350-420) | Present (lines 246-293) | v8 more detailed |
+| Section 6: Report Template | Present (lines 422-596) | Present (lines 295-411) | v8 significantly more detailed |
+| Section 7: Error Handling | Present (lines 598-689) | Present (lines 413-444) | v8 significantly more detailed |
+| Section 8: Completion | Present (lines 691-707) | Present (lines 446-459) | EQUAL in substance |
+| Total line count | 708 lines | 460 lines | v8 is 54% longer |
+| Heading style | "# Hardware Vetter Skill" + "## N. Section" | "## Section N: Title" | Stylistic preference only |
+## Criterion-by-Criterion Analysis
+### 1. Structural Completeness
+**Rating: EQUAL**
+Both outputs contain all 8 required sections plus correct YAML frontmatter. The frontmatter blocks are identical:
+v8 (lines 1-6):
+```yaml
+---
+name: hardware-vetter
+description: Evaluate proposed hardware changes against the AI server's model lineup
+model: sonnet
+tools: Read, WebSearch, AskUserQuestion, Write, Glob
+---
+```
+v9 (lines 1-6): Identical block.
+Both include Data Loading, Question Flow, Analysis, Benchmark Search, Report Template, Error Handling, and Completion sections in the correct order.
+### 2. Specification Fidelity
+**Rating: EQUAL — both are faithful to their inputs, with minor differences**
+**v8 fidelity to code_specs:**
+- The code_specs (Task 2) specified 8 sections in order. v8 implements all 8.
+- The code_specs specified "Provide common options (RTX 4060/16GB, RTX 4090/24GB, A4000/16GB, A6000/48GB) plus free-text." v8 expanded this significantly to 8 GPU options (lines 141-148: RTX 4060 Ti, 4070, 4080, 4090, A4000, A5000, A6000, Custom). This is **invented content** — the code_specs only listed 4 common options. The additions (4070, 4080, A5000) are reasonable but were not specified.
+- The code_specs specified RAM options as "128GB, 256GB, 384GB, 512GB". v8 expanded to 8 options (128, 192, 256, 384, 512, 768, 1024, Custom) — lines 163-171. The additional options (192, 768, 1024) are **invented content**.
+- v8 added a "Scenario 6: User Proposes a Downgrade" error case (lines 659-667) and "Scenario 7: Partial Offload Complexity" (lines 669-676) that were not in the code_specs. The code_specs Section 7 listed 5 error cases; v8 has 7.
+**v9 fidelity to blueprint:**
+- The blueprint specified 8 sections. v9 implements all 8.
+- The blueprint (Section 5.4) specified GPU options as: "RTX 4060/16GB, RTX 4090/24GB, A4000/16GB, A6000/48GB". v9 lists exactly these 4 plus "Other (I'll specify)" — lines 112-116. Faithful to spec.
+- The blueprint specified RAM options as "128GB, 256GB, 384GB, 512GB". v9 lists exactly these 4 plus "Other" — lines 127-131. Faithful to spec.
+- The blueprint specified 5 error cases in Section 5.8. v9 implements all 5 plus adds a 6th: "If current hardware fields are missing from `hardware_profile`" (lines 441-443). This is a reasonable addition that handles a gap the blueprint didn't explicitly cover.
+- v9 adds a deduplication check (line 310-311): "You MUST NOT overwrite an existing report file. If a file at the target path already exists, append a numeric suffix..." The blueprint mentions "never overwrite existing reports" in Section 2 (Research Findings) and the approach matches, but the implementation detail of a numeric suffix is v9's own addition. Good defensive behavior.
+**Verdict:** Both outputs are faithful to their respective inputs. v8 invents more content (extra GPU/RAM options, extra error scenarios). v9 stays closer to its blueprint but adds one useful error case and a file collision guard.
+### 3. Technical Accuracy
+**Rating: EQUAL**
+**Feasibility tier thresholds:**
+- v8 (line 201-202): "runs_comfortably (>=40% headroom), runs_constrained (10-39% headroom), does_not_fit (<10% headroom or exceeds capacity)" — Correct.
+- v9 (lines 192-195): "Headroom >= 40% -> runs_comfortably, 10-39% -> runs_constrained, < 10% OR resource usage ratio > 1.0 -> does_not_fit" — Correct. The `> 1.0` explicit check is a clearer formulation.
+**Headroom formula:**
+- v8 (line 274): `(proposed_total_ram - model_ram_requirement) / proposed_total_ram * 100` — Correct.
+- v9 (line 179): `(proposed_total_ram - ram_gb_q4) / proposed_total_ram * 100` — Correct. Uses the specific field name rather than a generic variable, which is more precise.
+**Concurrent capacity:**
+- v8 (lines 284-286): `floor(proposed_total_ram / model_ram_requirement)`, capped by CPU cores (2 cores per instance). Correct.
+- v9 (lines 200-202): `floor(proposed_total_ram / ram_gb_q4)`, capped by `floor(proposed_cpu_cores / 2)`. Correct. Explicitly states the floor operation on the core cap.
+**Throughput formula:**
+- v8 (lines 291-297): `relative_throughput = (proposed_clock_speed / current_clock_speed) * sqrt(proposed_cores / current_cores)` using base clock speeds. Includes a worked example. Correct.
+- v9 (lines 210-213): `clock_speed_ratio * sqrt(core_count_ratio)` — uses turbo clock speeds as primary with base as fallback ("use base clock if turbo is unavailable for either"). Includes the geometric mean characterization. Correct formula, but **differs on clock source**: v8 says "Use base clock speeds for comparison (turbo is inconsistent under sustained load)" while v9 says use turbo as primary.
+This is a **minor discrepancy** — the v8 rationale (turbo is inconsistent under sustained load) is arguably more technically accurate for sustained inference workloads. The v9 approach using turbo is defensible but less conservative. Both the code_specs and blueprint say "clock speed ratio (proposed vs current)" without specifying base vs turbo, so neither output is contradicting its input — they made different reasonable choices.
+**GPU partial offload:**
+- v8 (lines 239-254): Detailed partial offload logic with VRAM spillover calculation, RAM headroom for spillover, loading strategy labels. Correct and thorough.
+- v9 (lines 185-188): Covers partial offload viability check. Correct but less detailed — doesn't specify spillover calculation or loading strategy labels.
+### 4. Instruction Quality
+**Rating: v8 SLIGHTLY BETTER**
+Both outputs use imperative directives to Claude, as required. Examples:
+v8 (line 32): "**CRITICAL:** You MUST complete data loading before proceeding to questions."
+v9 (line 68): "If `docs/model_catalog.json` is missing or unreadable, stop immediately and tell the user..."
+v8 uses more explicit "CRITICAL RULES" callout blocks (lines 32, 91-93, 198-202, 352-357, 424-427, 602-605) with MUST/NEVER language. v9 uses imperative voice throughout but with fewer explicit CRITICAL markers.
+v8 provides more procedural granularity. For example, in the question flow, v8 specifies exact suggested formats for user input (line 130: "Provide CPU details: Model, Core Count, Base Clock GHz, Turbo Clock GHz, Architecture (optional)") and an example (line 132). v9 provides the question text but not a suggested format or example.
+v8's report template (Section 6) includes the complete markdown structure with every field placeholder and key/legend text, spanning ~170 lines. v9's report template is ~100 lines with the same sections but less placeholder detail.
+For Claude following instructions reliably, v8's greater specificity reduces ambiguity. However, v9's instructions are clear enough that a capable model (sonnet) would produce equivalent results in practice.
+### 5. Content Depth
+**Rating: v8 BETTER in specific areas**
+| Section | v8 Lines | v9 Lines | Depth Comparison |
+|---------|----------|----------|-----------------|
+| Overview | ~20 | ~9 | v8 includes explicit "What you do" / "What you do NOT do" lists (7 do items, 5 don't items). v9 has 3 lines covering the same scope more tersely. |
+| Data Loading | ~57 | ~54 | Comparable. v8 lists every field to extract with bullet points. v9 lists the same fields with slightly different organization. |
+| Question Flow | ~107 | ~86 | v8 provides more GPU options (8 vs 5), more RAM options (8 vs 5), explicit table format for confirmation summary, detailed handling logic. v9 covers the same flow with fewer options. |
+| Analysis | ~153 | ~85 | v8 is significantly more detailed. Includes: worked throughput example (lines 293-296), explicit loading strategy labels for GPU path, detailed partial offload spillover math, explicit delta format specifications. v9 covers the same methodology but without worked examples. |
+| Benchmark Search | ~71 | ~48 | v8 includes alternative query template, explicit filtering criteria with 3 conditions, search tracking counter, search summary format. v9 covers the same logic more concisely. |
+| Report Template | ~175 | ~117 | v8 includes the full report markdown with all placeholders, legend, and formatting notes. v9 includes the same sections but with less placeholder text. Both specify identical section order. |
+| Error Handling | ~92 | ~32 | Largest gap. v8 has 7 named error scenarios with detailed response/remediation for each, plus a validation gates summary. v9 has 6 error cases in a more compressed format. v8 adds Downgrade and Partial Offload Complexity scenarios. |
+| Completion | ~17 | ~14 | Comparable. Both specify report path, summary, and next steps. |
+**Areas where v9 adds content v8 lacks:**
+- File collision guard with numeric suffix (v9 lines 310-311)
+- Missing hardware_profile fields error case (v9 lines 441-443)
+- Schema version awareness: v9 notes that version "1.1" indicates Task 1 completion (line 49)
+- Explicit instruction: "Perform all calculations in working memory. Do not write intermediate results to files." (v9 line 165)
+### 6. Acceptance Criteria Coverage
+## Acceptance Criteria Coverage
+| AC# | Criterion | v8 | v9 | Notes |
+|-----|-----------|----|----|-------|
+| 1 | Accepts all four hardware change types | PASS — Section 3, Step 3.1 offers all four with multi-select, Step 3.2 has per-component follow-ups | PASS — Section 3, Step 3a offers all four with multi-select, Step 3b has per-component follow-ups | Both handle "Full system" subsuming individual selections |
+| 2 | Reads from model_catalog.json and config.py automatically | PASS — Section 2, Steps 2.1 and 2.2 with Read tool instructions | PASS — Section 2, Steps 2.1 and 2.2 with Read tool instructions | Both extract hardware_profile, models, and config Settings fields |
+| 3 | Fit/no-fit verdict with resource estimates | PASS — Section 4, 3-tier system with resource usage % and headroom % | PASS — Section 4.2, same 3-tier system with resource usage ratio and headroom % | Both use identical thresholds (40%/10-39%/<10%) |
+| 4 | Before/after comparison per model | PASS — Section 4, Step 4.4 with 4 metrics (tier, headroom, throughput, concurrent capacity) and delta | PASS — Section 4.5 with same 4 metrics and delta | Both compare current vs proposed for registered models only |
+| 5 | Identifies newly feasible candidate models | PASS — Section 4, Step 4.5 identifies models moving from does_not_fit to feasible tiers | PASS — Section 4.6 identifies models moving from does_not_fit to feasible tiers | Both include explicit handling when no models become newly feasible |
+| 6 | Upgrade recommendation with rationale | PASS — Section 6 report template includes Executive Summary with verdict + rationale and Recommendation section | PASS — Section 6 report template includes Executive Summary with verdict + rationale | v8 has a separate "Recommendation" section at end of report; v9 puts recommendation in Executive Summary. Both satisfy the criterion. |
+| 7 | Spec estimates flagged as "projected" | PASS — Section 4 (line 203), Section 5 (lines 399-407), report template methodology section | PASS — Section 4.4 (line 214), Section 5.5 (lines 284-291), report template methodology section | Both implement Verified/Projected labeling consistently |
+| 8 | Timestamped markdown file output | PASS — Section 6, Step 6.1 with YYYY-MM-DD_slug format | PASS — Section 6.1 with YYYY-MM-DD_slug format | Both specify docs/reports/ directory creation and slug derivation |
+## Notable Differences
+### 1. Report Template: Recommendation Section
+v8 includes a separate **"Recommendation"** section at the end of the report (lines 578-584) in addition to the Executive Summary. v9 puts the recommendation in the Executive Summary only. The blueprint (Section 5.7) specifies: "Executive Summary [3-5 sentences: upgrade verdict (recommended/not recommended/conditional), key rationale...]" — it does not specify a separate Recommendation section. The code_specs also do not specify a separate Recommendation section. v8 invented this addition. It is useful but goes beyond spec.
+### 2. Throughput Clock Source
+v8 explicitly states: "Use base clock speeds for comparison (turbo is inconsistent under sustained load)" (line 292). v9 states: "use `proposed_cpu_clock_ghz_turbo` / `current_cpu_clock_ghz_turbo` (use base clock if turbo is unavailable for either)" (line 211). Neither input document specifies which clock to use. The v8 choice is arguably more technically sound for sustained inference workloads.
+### 3. GPU Option Breadth
+v8 provides 8 GPU options including RTX 4070, 4080, A5000 (lines 141-148). v9 provides 5 options (lines 112-116). The code_specs specified 4 common options; the blueprint specified the same 4. v8 expanded beyond spec; v9 stayed faithful.
+### 4. Error Handling Depth
+v8 has 7 named error scenarios with detailed response and remediation for each (lines 607-676), plus a Validation Gates Summary (lines 678-689). v9 has 6 error cases in a compressed format (lines 422-444). v8 adds "User Proposes a Downgrade" and "Partial Offload Complexity" scenarios. v9 adds "Current hardware fields missing from hardware_profile."
+### 5. File Collision Guard
+v9 (lines 310-311) explicitly instructs: "You MUST NOT overwrite an existing report file. If a file at the target path already exists, append a numeric suffix to the slug." v8 says "Each evaluation creates a new timestamped file — never overwrite existing reports" (line 429) but doesn't specify how to handle a collision. v9 is more defensive here.
+### 6. Working Memory Instruction
+v9 (line 165) includes: "Perform all calculations in working memory. Do not write intermediate results to files." v8 has no equivalent instruction. This is a useful guardrail for the executing model.
+### 7. Loading Strategy Labels
+v8 explicitly specifies loading strategy labels in the analysis: "Full GPU offload", "Partial GPU offload (layers split between GPU and CPU)", "Not feasible", "CPU-only" — and includes them in the report template. v9 does not include loading strategy as a tracked metric.
+## Issues Found
+### v8 Issues
+1. **Invented content beyond spec:** v8 adds GPU options (RTX 4070, 4080, A5000), RAM options (192, 768, 1024 GB), extra error scenarios (Downgrade, Partial Offload Complexity), and a separate Recommendation section that were not in the code_specs. While all additions are reasonable, they represent scope creep from the build system.
+2. **Redundancy in report template:** The v8 report template includes both an Executive Summary with verdict AND a separate Recommendation section (lines 578-584). These overlap in purpose. The code_specs did not specify a separate Recommendation section.
+### v9 Issues
+1. **Turbo clock preference:** v9 defaults to turbo clock for throughput comparison (line 211), which is less conservative than base clock for sustained inference. Minor technical choice difference, not an error.
+2. **Less GPU option breadth:** Only 5 GPU options vs v8's 8. While faithful to the blueprint, users evaluating RTX 4070/4080 or A5000 would need to use the free-text option. Not a functional gap, but a usability reduction.
+3. **Missing loading strategy labels:** v9 does not track or report "loading strategy" (Full GPU offload / Partial offload / CPU-only) as a per-model metric. v8 includes this in both analysis and report. The blueprint does not explicitly require it, so this is not a spec violation, but it is a useful piece of information that v8 provides.
+4. **Less detailed report template:** v9's report template section is less prescriptive about exact placeholder text and formatting. A capable model would fill in reasonable content, but there is more room for variation in output format.
+### Neither Output
+Both outputs correctly handle all core requirements. No critical errors or omissions found in either.
+## Conclusion
+The v9 architecture produces output that is functionally equivalent to v8 for this skill. All 8 acceptance criteria are satisfied by both outputs. The v9 output is 35% shorter (460 vs 708 lines), which is attributable to:
+1. Fewer invented additions (v9 stays closer to its blueprint spec)
+2. More compressed error handling (6 cases in 32 lines vs 7 cases in 92 lines)
+3. Less verbose report template (still complete, but with fewer placeholder details)
+The v8 output is richer in several areas: more GPU/RAM options for users, more detailed error scenarios, explicit loading strategy tracking, a worked throughput example, and a separate Recommendation section. However, most of these additions were **invented by the v8 build system** beyond what the code_specs specified, meaning v8's build process exhibited more creative expansion while v9's code-writer producer stayed more disciplined to its blueprint.
+**Can the v9 architecture replace v8 for code production without quality loss?** Yes. The v9 output would produce an equally functional skill when executed by Claude. The differences are in richness of detail, not correctness or completeness. If the goal is spec-faithful production (build what was specified, nothing more), v9 is arguably better disciplined. If the goal is maximum robustness through extra detail, v8's build process added useful material — but that material could also be added to the v9 blueprint if desired, which would then flow through to the producer output.
+The v9 architecture's advantage is that quality is controlled at the blueprint level. What the code-architect specifies, the code-writer produces. The v8 system's build step added value through creative expansion, but also added scope that was never reviewed by the engineer or architect — a tradeoff between richness and spec discipline.

package/docs/v9/test/phase5/review-5A-specialist.md ADDED Viewed

@@ -0,0 +1,213 @@
+# Specialist Review — Task 4: Landlord Registration & Compliance Checklist
+**Reviewer role:** Legal analyst
+**Blueprint:** `phase4/blueprints/legal-analyst-T4.md`
+**Deliverable:** `phase5/output/04_compliance_checklist.md`
+**Review date:** 2026-02-27
+---
+## Overall Result: PASS
+All six review criteria met. No invented content. No omitted content. All legal distinctions preserved. Minor formatting variance (section heading "Summary" vs. blueprint's "Summary Counts") is cosmetically inconsequential.
+---
+## Criterion-by-Criterion Findings
+### 1. "All regulatory requirements addressed — every registration/permit from the blueprint present?"
+**PASS**
+Every item specified in the blueprint's Section 5 (Deliverable Specification) appears in the deliverable with correct item numbers, issuing authorities, and content. Cross-check:
+| Blueprint Item | Present in Deliverable | Notes |
+|---|---|---|
+| 1.1 ZBA Variance | Yes — Item 1.1 | Exact content match |
+| 1.2 Building Permit | Yes — Item 1.2 | Address and phone number preserved |
+| 1.3 Electrical Permit | Yes — Item 1.3 | Correct |
+| 1.4 Plumbing Permit | Yes — Item 1.4 | Correct |
+| 1.5 Mechanical Permit | Yes — Item 1.5 | Correct |
+| 1.6 Sewage Disposal (RSA 485-A) | Yes — Item 1.6 | NH DES citation preserved |
+| 1.7 RRP Rule / EPA Certified Firm | Yes — Item 1.7 | 40 CFR 745 citation preserved |
+| 2.1–2.9 All construction inspections | Yes — Items 2.1–2.9 | All 9 intermediate inspections present |
+| 2.10 Certificate of Occupancy | Yes — Item 2.10 | Gating language preserved verbatim |
+| 3.1 Landlord/Agent Registration ($15) | Yes — Item 3.1 | Fee and address correct |
+| 3.2 Lead Paint Disclosure Form | Yes — Item 3.2 | Statutory citations (42 U.S.C. 4852d; RSA 540-B) present |
+| 3.3 EPA Lead Paint Pamphlet | Yes — Item 3.3 | Title and timing requirement preserved |
+| 3.4 Security Deposit Receipt | Yes — Item 3.4 | RSA 540-A citation and 30-day window correct |
+| 3.5 Landlord Identity Disclosure | Yes — Item 3.5 | Correct |
+| 3.6 Utility Responsibility Disclosure | Yes — Item 3.6 | Correct |
+| 3.7 Radon Disclosure | Yes — Item 3.7 | [VERIFY] flag preserved |
+| 4.1 Smoke Detectors | Yes — Item 4.1 | RSA citation and hardwired requirement preserved |
+| 4.2 CO Detectors | Yes — Item 4.2 | RSA 153:10a citation; garage-attachment exception correctly noted |
+| 4.3 Fire Separation | Yes — Item 4.3 | Interior door requirement preserved |
+| 4.4 Egress Windows | Yes — Item 4.4 | Correct |
+| 4.5 Fire Extinguisher | Yes — Item 4.5 | Correct |
+| 4.6 Fire Dept Inspection | Yes — Item 4.6 | [VERIFY] flag preserved; uncertainty correctly stated |
+| 5.1 Landlord Insurance | Yes — Item 5.1 | Correct; homeowner policy exclusion noted |
+| 5.2 Umbrella Policy | Yes — Item 5.2 | Correct |
+| 5.3 Renter's Insurance Requirement | Yes — Item 5.3 | Correct |
+| 5.4 Security Deposit Escrow Account | Yes — Item 5.4 | RSA 540-A citation; interest requirement preserved |
+| 5.5 Annual Lead Paint Renewal | Yes — Item 5.5 | Lease renewal trigger preserved |
+| 5.6 Housing Code Compliance | Yes — Item 5.6 | Chapter 27 citation and complaint-based enforcement noted |
+| 5.7 Landlord Registration Renewal | Yes — Item 5.7 | [VERIFY] flag preserved |
+All 33 items from the blueprint appear in the deliverable.
+---
+### 2. "Legal distinctions maintained accurately — required vs. recommended vs. optional correctly marked?"
+**PASS**
+Every status classification from the blueprint is reproduced exactly in the deliverable. No status was elevated, downgraded, or blurred.
+Critical distinctions verified:
+- Insurance (5.1, 5.2) correctly marked **RECOMMENDED**, not REQUIRED — consistent with blueprint's "No NH state mandate" finding and the decision to mark insurance as recommended.
+- Renter's insurance (5.3) correctly marked **OPTIONAL** — sole optional item in the entire checklist.
+- Radon disclosure (3.7) correctly marked **RECOMMENDED** — consistent with blueprint's "NH does not strictly require" language.
+- Fire department inspection (4.6) correctly marked **RECOMMENDED** with [VERIFY] qualifier — uncertainty about whether it is mandatory for CO is preserved, not resolved by the producer.
+- Security deposit receipt (3.4) and escrow account (5.4) both correctly marked **REQUIRED** — no blurring of these as merely advisable.
+REQUIRED count: 26. RECOMMENDED count: 6. OPTIONAL count: 1. Totals match blueprint specification exactly.
+---
+### 3. "Factual claims supported by research or flagged for verification — all [VERIFY] flags preserved?"
+**PASS**
+All nine categories of [VERIFY] items from the blueprint's [VERIFY] Items Summary table are reproduced identically in the deliverable:
+| [VERIFY] Item | Preserved in Deliverable |
+|---|---|
+| 1.1 ZBA variance fee | Yes |
+| 1.2–1.5 Building and trade permit fees | Yes |
+| 1.6 Sewage disposal approval process | Yes |
+| 2.6 Fire separation inspection fee | Yes |
+| 2.10 CO fee | Yes |
+| 3.7 Radon disclosure requirement | Yes |
+| 4.6 Fire dept inspection for CO | Yes |
+| 5.1, 5.2 Insurance premium estimates | Yes |
+| 5.7 Landlord registration renewal | Yes |
+In-cell [VERIFY] flags in the tables are also preserved exactly. No [VERIFY] flag was silently dropped or resolved with invented data.
+The blueprint's research findings that were confirmable (e.g., $15 landlord registration fee, RSA citations, 42 U.S.C. 4852d, specific addresses and phone numbers) are reproduced accurately without modification.
+---
+### 4. "Cross-references internally consistent — do item numbers, categories, and counts match?"
+**PASS**
+Item numbering is fully consistent across all sections of the deliverable:
+- Category tables use items 1.1–1.7, 2.1–2.10, 3.1–3.7, 4.1–4.6, 5.1–5.7 — matching the blueprint exactly.
+- Critical Path Items section references correct item numbers: (1.2–1.6, 2.1–2.9), (2.10), (3.1), (3.2, 3.3), (4.1, 4.2), (5.4). All cross-references are accurate.
+- [VERIFY] Items Summary table references item numbers that exist in the main tables. The range notation "1.2–1.5" is correct.
+- Summary count table shows 26 REQUIRED + 6 RECOMMENDED + 1 OPTIONAL = 33 total. This is arithmetically correct and matches the actual count of items in the five category tables.
+One minor formatting observation: the blueprint specifies the section heading "Summary Counts" (Section 5); the deliverable uses "Summary." This is a cosmetic variance with no legal or substantive consequence.
+---
+### 5. "Tone appropriate for the target audience — practical, actionable checklist format?"
+**PASS**
+The deliverable is well-calibrated for a landlord without legal training:
+- "How to Use This Checklist" section provides brief, direct instructions without jargon.
+- Tables use a checkbox column (`[ ]`) enabling physical tracking.
+- Notes column provides specific actionable information (e.g., "Schedule 24 hrs ahead," "Submit via online permit portal," "Check payable to City of Concord").
+- Critical Path section clearly identifies gating items — the most important framing for a first-time landlord.
+- Legal disclaimer at the end is appropriately scoped without being alarmist.
+- Status labels (REQUIRED / RECOMMENDED / OPTIONAL) are defined in plain English in the legend.
+- The warning on Item 2.10 ("CANNOT legally rent the ADU without a valid CO") is appropriately emphatic.
+The tone does not over-legalize or under-warn. It reads as a practitioner tool, consistent with blueprint intent.
+---
+### 6. "Required disclosures present and accurate — lead paint, security deposit, landlord ID all covered?"
+**PASS**
+All three named disclosures are present and accurate:
+**Lead paint:**
+- Item 1.7: RRP Rule — EPA-certified contractor for renovation of pre-1978 structure. 40 CFR 745 cited.
+- Item 3.2: Lead Paint Disclosure Form — 42 U.S.C. 4852d and RSA 540-B cited. 3-year record retention requirement stated.
+- Item 3.3: EPA Lead Paint Pamphlet — correct title, timing (before lease signing), and renewal trigger (each lease renewal) stated.
+- Item 5.5: Annual renewal obligation at lease renewal confirmed.
+**Security deposit:**
+- Item 3.4: Written receipt required within 30 days. NH bank name and account number required. RSA 540-A cited.
+- Item 5.4: Escrow account in separate NH bank account required. Interest obligation noted. RSA 540-A cited.
+- The split across two items (receipt obligation vs. account obligation) is accurate and not duplicative — they are distinct legal requirements.
+**Landlord identity:**
+- Item 3.5: Name and address of landlord or authorized agent required for service of process and notices. Timing correctly stated as "in lease or at tenancy start."
+No disclosure is omitted, conflated, or misattributed.
+---
+## Invented Content Check
+**None found.**
+Every item in the deliverable maps directly to a blueprint research finding, decision, or specification. The Key Contacts table (addresses, phone numbers) was specified in the blueprint's Section 5 (content block 12) and populated with contacts drawn from the blueprint's own research findings (Code Admin at 37 Green St (603) 225-8580; City Clerk at 41 Green St (603) 225-8610; Fire Dept via permit portal; NH DES at 29 Hazen Dr (603) 271-3503). The NH DES address and phone number in the Key Contacts table are the only data not explicitly stated in the blueprint text — they are standard agency contact information consistent with the NH DES citation in Item 1.6. This does not constitute problematic invention; it is a reasonable contact lookup consistent with the [VERIFY] guidance to "Contact NH DES." No legal claims were added beyond the blueprint's research.
+---
+## Omitted Content Check
+**One minor omission: Legal disclaimer scoping.**
+The blueprint (Section 9, content block 13) specifies the disclaimer text exactly: "This checklist is for informational planning purposes. Consult with a licensed attorney and/or the relevant municipal offices to confirm current requirements."
+The deliverable renders this as: "*This checklist is for informational planning purposes. Consult with a licensed attorney and/or the relevant municipal offices to confirm current requirements.*"
+Text is present and complete. Italicized rather than in a named block, which is a presentation choice, not an omission.
+**One structural omission: Integration Points section.**
+The blueprint's Section 7 (Integration Points) described cross-references to Task 1 (Lease Agreement) and Task 2. This section was not included in the deliverable. However, the Integration Points section was part of the blueprint's internal analysis, not listed as a required content block in Section 9 (Producer Handoff). The Section 9 specification lists 13 content blocks; Integration Points is not among them. This omission is appropriate — the producer correctly followed the handoff specification rather than including analyst-facing internal sections in the client-facing deliverable.
+**No substantive omissions.** All 13 content blocks specified in Section 9 are present.
+---
+## Legal Distinctions — Integrity Check
+No distinctions were blurred. Specific risk points examined:
+1. **RECOMMENDED vs. REQUIRED for insurance** — correctly maintained. The deliverable does not overstate the legal mandate.
+2. **CO as gating requirement** — flagged with emphatic language in both Item 2.10 notes and the Critical Path section. Not understated.
+3. **Lead paint as federal AND state requirement** — dual citation (42 U.S.C. 4852d; RSA 540-B) preserved in Item 3.2.
+4. **RRP Rule as applicable to renovation (not just rental)** — correctly scoped to the construction phase (Item 1.7), not conflated with the lease-time disclosure obligations.
+5. **Security deposit receipt vs. escrow account** — maintained as two separate items (3.4 and 5.4), each with its own statutory basis and timing requirement.
+6. **CO detector exemption** — the narrow exemption (no attached garage AND no fuel-fired appliances) is correctly applied with a property-specific note that this ADU is NOT exempt due to the attached garage. This is accurate and risk-appropriate.
+---
+## Summary
+| Criterion | Result | Notes |
+|---|---|---|
+| 1. All regulatory requirements present | PASS | All 33 items present; all issuing authorities named |
+| 2. Legal distinctions maintained | PASS | All REQUIRED/RECOMMENDED/OPTIONAL classifications correct |
+| 3. [VERIFY] flags preserved | PASS | All 9 [VERIFY] categories intact; no fabricated specifics |
+| 4. Cross-references consistent | PASS | Item numbers, counts, and Critical Path references all accurate |
+| 5. Tone appropriate | PASS | Practical, actionable, landlord-facing |
+| 6. Required disclosures present | PASS | Lead paint, security deposit, landlord ID all complete |
+| Invented content | NONE | NH DES contact details are standard lookup, not invented legal claims |
+| Omitted content | NONE (material) | Integration Points section correctly excluded per handoff spec |
+| Legal distinctions blurred | NONE | All critical distinctions preserved or strengthened |
+**Overall: PASS**
+The deliverable faithfully executes the blueprint specification. It is accurate, complete, appropriately scoped, and suitable for use as a working compliance tracking document.

package/docs/v9/test/phase5/specialist-test/TEST_PLAN.md ADDED Viewed

@@ -0,0 +1,60 @@
+# Phase 5 Specialist Test: Code-Architect End-to-End
+## Objective
+Validate that the code-architect specialist profile can produce a blueprint of equivalent quality to the hand-crafted blueprint through the Stage 1 → user gate → Stage 2 pipeline.
+## Test Case: Hardware Vetter Skill (Same task as Phase 5B)
+### Test S1: Stage 1 Quality — COMPLETE (PASS)
+- **Specialist**: code-architect
+- **Task**: Task 2 — Build the Hardware Vetter Claude Code Skill
+- **Depth**: Deep
+- **Input**: Plan task + access to District AI codebase
+- **Output**: `specialist-test/scratch/code-architect-stage1.md`
+- **Pass criteria**:
+  1. Research findings cover all relevant files (catalog, config, existing skill, existing report)
+  2. ECD analysis is thorough (elements, connections, dynamics)
+  3. Engineering questions are identified and self-answered from research
+  4. Issues/risks are surfaced
+- **Result**: PASS — thorough 428-line exploration, found all key files, identified 5 issues, self-answered 5 engineering questions
+### Test S2: Stage 2 Blueprint Quality — COMPLETE (PASS)
+- **Input**: stage1.md + user decisions (scope: full rewrite, accept self-answers, ignore existing skill)
+- **Output**: `specialist-test/blueprints/code-architect-hw-vetter.md`
+- **Reference**: `phase5/blueprints/code-architect-hw-vetter.md` (hand-crafted)
+- **Pass criteria**:
+  1. All 9 universal envelope sections present and non-empty
+  2. Deliverable specification detailed enough for code-writer producer (no architectural gaps)
+  3. All 8 acceptance criteria mapped
+  4. Quality comparable to hand-crafted blueprint
+- **Result**: PASS — 917-line blueprint, all 9 envelope sections present and substantive, all acceptance criteria mapped, specification depth sufficient for a code-writer producer. Divergences from hand-crafted are design choices, not gaps (see S3).
+- **Comparison dimensions evaluated**:
+  - Structural completeness: All sections present, auto-generated is more verbose (917 vs 376 lines)
+  - Specification depth: Auto-generated has more detail in some areas (error handling, architecture multipliers), less in others (predefined option lists)
+  - Research grounding: Most design choices traceable to Stage 1 findings; some re-derived from first principles (see S3)
+  - Practical executability: A code-writer producer could build from either blueprint
+- **Full comparison**: `specialist-test/blueprint-comparison.md`
+### Test S3: Divergence Analysis — COMPLETE (PASS with findings)
+- **Input**: Both blueprints side by side
+- **Check**: Where does the auto-generated blueprint diverge from the hand-crafted one?
+- **Key divergences found** (38 specific items documented in `blueprint-comparison.md`):
+  - Feasibility thresholds: auto 30%/5% vs hand-crafted 40%/10% — different but both defensible
+  - Throughput formula: auto invented architecture multiplier tables vs hand-crafted used relative ratio — auto is more ambitious
+  - Question flow: auto used free-text vs hand-crafted used predefined option lists — hand-crafted more constrained
+  - Tier naming: auto "Comfortable/Tight/Infeasible" vs hand-crafted "runs_comfortably/runs_constrained/does_not_fit" — style difference
+  - WebSearch cap: auto 5 vs hand-crafted 8 — minor
+  - Model: auto opus vs hand-crafted sonnet — should be pinned by plan, not decided by specialist
+- **Verdict**: ~70% structural equivalence. Divergences are a mix of (a) defensible alternative designs and (b) cases where Stage 2 re-derived from first principles instead of using Stage 1's documented research.
+- **Greenfield reframe**: This test compared against a hand-crafted blueprint that embodied v8 implementation decisions. Since Intuition projects are typically greenfield (no prior implementation to match), the "invention" behavior is actually the desired outcome. The relevant quality question is not "did it match the reference?" but "are the inventions grounded in Stage 1 research?"
+### Protocol Improvements Applied (from S2/S3 findings)
+1. **Research grounding rule added to Stage 2 protocol** (Section 3): Every design choice must be traceable to Stage 1 research, user decisions, or a named domain standard. Ungrounded choices go to Open Items.
+2. **D23 added to decisions log**: Greenfield-first design stance — Stage 2 invents grounded designs, not transcriptions.
+3. **Model pinning**: Model selection should come from the plan, not be invented by the specialist. (Recommendation 4 — applies regardless of greenfield/rewrite.)
+4. **Dropped recommendations**: Reference Behavior input, explicit deviation flags, and faithfulness checks were specific to rewrite scenarios and don't apply to greenfield projects.
+## Execution Order
+1. ~~Test S2 comparison~~ COMPLETE
+2. ~~Test S3 divergence analysis~~ COMPLETE