npm - buildanything - Versions diffs - 1.7.1 → 2.0.0 - Mend

buildanything 1.7.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (633) hide show

package/protocols/launch-readiness.md ADDED Viewed

@@ -0,0 +1,258 @@
+# Launch Readiness Review Protocol
+You are the orchestrator. You are about to run a Launch Readiness Review (LRR) — a multi-chapter, independent-verdict gate that sits between the Phase 6 Reality Check and Phase 7.
+## Purpose
+LRR replaces the monolithic Reality Checker verdict with five independent chapter verdicts plus a mechanical aggregator.
+## Chapters
+LRR runs **five chapters**: Eng-Quality, Security, SRE, A11y, and Brand Guardian.
+Requirements coverage is evaluated as a sub-input of the Eng-Quality chapter. There is no separate PM chapter, no `pm.json` file, and the LRR Aggregator runs exactly once. The Eng-Quality chapter agent reads the Design Doc + `sprint-tasks.md` MVP scope directly alongside its other evidence and emits COVERED/PARTIAL/MISSING per feature inline on its own verdict (see the `requirements_coverage` field in the schema below). There is no separate Step 7.0 dispatch and no Aggregator re-run.
+### Primary evidence inputs
+| Chapter | Primary evidence inputs |
+|---|---|
+| Eng-Quality | `architecture.md`, `task-outputs/`, `verify.md` check outputs, test results, eval results, Design Doc + `sprint-tasks.md` MVP scope (read directly for the Requirements Coverage sub-input) |
+| Security | `evidence/fake-data-audit.md`, Phase 5 security audit output, eval-harness security cases |
+| SRE | Phase 5 performance-audit outputs, Performance Benchmarker evidence, NFRs from `sprint-tasks.md`, reliability checks |
+| A11y | Phase 5 a11y audit output, Phase 3.7 `a11y-design-review.md`, WCAG 2.2 AA runtime findings, per-page accessibility findings |
+| Brand Guardian | `docs/plans/visual-design-spec.md`, `docs/plans/visual-dna.md`, `docs/plans/design-references.md`, Playwright screenshots under `docs/plans/evidence/` matching product pages |
+## Chapter verdict schema
+Each chapter agent runs fresh-context, reads its own slice of the evidence manifest, and writes a verdict file with this shape:
+```json
+{
+  "chapter": "eng-quality | security | sre | a11y | brand-guardian",
+  "verdict": "PASS | CONCERNS | BLOCK",
+  "override_blocks_launch": false,
+  "evidence_files_read": ["docs/plans/evidence/..."],
+  "findings": [
+    {"severity": "block|concern|info", "description": "...", "evidence_ref": "path", "related_decision_id": "D-2-03"}
+  ],
+  "follow_up_spawned": false,
+  "follow_up_findings": null
+}
+```
+The **Eng-Quality** chapter additionally carries the Requirements Coverage sub-input inline on its verdict:
+```json
+{
+  "requirements_coverage": [
+    {"feature": "string", "status": "COVERED | PARTIAL | MISSING", "note": "optional string"}
+  ]
+}
+```
+This field carries the PM coverage signal directly on the Eng-Quality verdict — there is no separate `pm.json` file and no separate PM dispatch. The Eng-Quality chapter agent reads the Design Doc + `sprint-tasks.md` MVP scope as part of its own evidence sweep and emits the coverage list alongside its code-quality judgment.
+<HARD-GATE>
+SCHEMA CONTRACT:
+- `verdict` MUST be one of `PASS | CONCERNS | BLOCK`. `CONCERNS` means "I have concerns but won't block" and is aggregated as NEEDS WORK. `BLOCK` means "this fails my chapter's criteria" and is aggregated as NEEDS WORK unless `override_blocks_launch: true`.
+- `override_blocks_launch: true` is a veto-of-vetoes — no other chapter's PASS can override it. Only legal when `verdict == BLOCK`.
+- `evidence_files_read` MUST be non-empty. A verdict file with empty `evidence_files_read` is treated as malformed. "Looks good to me" verdicts are not permitted.
+- `follow_up_spawned` is only legal for the **Security** and **SRE** chapters. Brand Guardian and A11y chapters CANNOT spawn follow-ups — they render their verdict on what they can read and write it directly.
+</HARD-GATE>
+## A11y chapter rules
+The A11y chapter gates on WCAG 2.2 AA runtime findings from the Phase 5 accessibility audit and the Phase 3.7 a11y design review.
+- **PASS** if: zero Critical findings AND zero Serious findings AND zero failed WCAG 2.2 AA success criteria (runtime-measurable ones).
+- **CONCERNS** if: zero Critical AND 1-3 Serious findings (reviewable but not blocking).
+- **BLOCK** if: any Critical finding OR >3 Serious findings OR any failed WCAG 2.2 AA success criterion.
+Runtime measurement is required — spec-only compliance is not sufficient. Evidence must come from a Playwright + axe-core sweep (or equivalent) run against the actual built product pages, not the design system route alone.
+## Brand Guardian chapter rules
+The Brand Guardian chapter gates on **DNA drift** — did the built product stay true to the 6-axis Visual DNA locked at Phase 3.0?
+Prepend this anti-sycophancy preamble **verbatim** to the chapter dispatch prompt (stolen from ECC `gan-evaluator`):
+> "Your natural tendency is to be encouraging. Fight it. Default verdict: NEEDS WORK. You are not here to validate — you are here to find the gap."
+The chapter agent reads: `docs/plans/visual-dna.md` (the locked DNA card), `docs/plans/visual-design-spec.md`, `docs/plans/design-references.md`, and Playwright screenshots under `docs/plans/evidence/` matching **product pages** (not just `/design-system`).
+Scoring — 6 DNA axes (20 pts each = 120) + 5 craft dimensions (20 pts each = 100) = **220 total**, target **≥ 180**.
+- **PASS** if: DNA score ≥ 180 AND craft score ≥ 75 AND no single axis scoring < 12/20.
+- **CONCERNS** if: DNA score 150-179 OR any single axis scoring 10-12/20.
+- **BLOCK** if: DNA score < 150 OR any axis scoring < 10/20.
+Every finding must cite a specific element with a `file:line` reference and reference either the DNA card or a `design-references.md` path. Vague findings ("the hero needs work") are not admissible — they are rejected by the Aggregator's schema check and the verdict defaults to CONCERNS. Brand Guardian is forbidden from rubber-stamping.
+## Follow-up investigation flow (Security and SRE only)
+Security and SRE — and only these chapters — may spawn one read-only follow-up investigation per LRR round. Eng-Quality, A11y, and Brand Guardian render their verdict on what they can read and rely on the existing NEEDS_WORK loop when concerns arise.
+The trigger is now tightened: follow-ups fire **only on a BLOCK verdict**. The previous "or suspicious, need to verify" escape hatch is gone — suspicion without a BLOCK is recorded as a CONCERNS verdict, not as a follow-up spawn.
+```
+Security/SRE chapter agent runs → reads evidence manifest + 2-3 targeted files
+  |
+  If verdict = PASS or CONCERNS → write lrr/{chapter}.json, done (1 dispatch)
+  |
+  If verdict = BLOCK → CAN spawn ONE follow-up (no longer legal on "suspicion")
+  |
+  Follow-up agent spawned with:
+    - Mode: read-only (no Write, no Edit, no Bash write ops)
+    - Allowed tools: Read, Grep, Glob
+    - Max tool calls: 15 (enforced via prompt + self-report)
+    - Scope: single named concern from parent chapter's findings
+    - Returns typed JSON (see below)
+  |
+  Parent chapter reads follow-up output + its own earlier findings,
+  validates follow-up schema, writes FINAL lrr/{chapter}.json with
+  follow_up_findings populated.
+```
+### Follow-up return schema
+```json
+{
+  "confirmed": true,
+  "evidence": ["path1", "path2"],
+  "findings": [
+    {"severity": "block|concern|info", "description": "...", "evidence_ref": "path"}
+  ],
+  "tool_calls_used": 12
+}
+```
+<HARD-GATE>
+HARD CAPS ON FOLLOW-UPS:
+- **Max 1 follow-up per chapter per LRR round.** No follow-ups spawning follow-ups — investigation chains are banned because that is where dispatch counts actually explode.
+- **Read-only.** Allowed tools: Read, Grep, Glob. No Write, no Edit, no Bash write ops. If the follow-up finds a real fix, it documents it — the fix is executed in a separate Phase 6 NEEDS_WORK cycle, never inline.
+- **15 tool call cap, self-reported via `tool_calls_used`.** Parent chapter validates `tool_calls_used <= 15` before accepting the follow-up output.
+- **Only Security and SRE have the power.** Eng-Quality, A11y, and Brand Guardian chapters with concerns write a CONCERNS verdict; they do not spawn follow-ups.
+- **BLOCK-only trigger.** A follow-up can only be spawned when the parent chapter's verdict is BLOCK. Concerns without a BLOCK are logged as CONCERNS; they do not justify a second dispatch.
+</HARD-GATE>
+## Fallback on malformed or timed-out follow-up
+<HARD-GATE>
+DEFAULT: HARD BLOCK WITH `override_blocks_launch: true`.
+If a Security/SRE follow-up times out, returns invalid JSON, or reports `tool_calls_used > 15`, the parent chapter writes:
+```json
+{
+  "chapter": "security|sre",
+  "verdict": "BLOCK",
+  "reason": "follow_up_malformed",
+  "override_blocks_launch": true,
+  "detail": "Follow-up returned {timeout|invalid_json|cap_violation} — cannot verify security/reliability claim, blocking per pessimistic default"
+}
+```
+A security or reliability check that cannot produce a typed result is itself a signal — pessimistic-block is the only safe default. You cannot ship a build whose security check is unverifiable.
+</HARD-GATE>
+## LRR Aggregator
+The Reality Checker keeps its evidence-manifest sweep role but its verdict generation is replaced by LRR aggregation. The Aggregator now runs a **5-step flow**: file-completeness checkpoint → apply the 6 rules → BLOCK routing via the decision log → classification for NEEDS WORK findings → READY handoff.
+### Step 1: File-completeness checkpoint (NEW barrier)
+Before applying any aggregation rule, the Aggregator MUST Glob `docs/plans/evidence/lrr/*.json` and verify all 5 expected chapter files exist and parse as valid JSON:
+- `eng-quality.json`
+- `security.json`
+- `sre.json`
+- `a11y.json`
+- `brand-guardian.json`
+The Aggregator does **not** expect or read any `pm.json` file. Requirements coverage lives inline on the Eng-Quality verdict (`requirements_coverage` field) and is not a separate artifact.
+If any of the 5 required files are missing, OR any file fails to parse as valid JSON, OR any file is missing required schema fields (`chapter`, `verdict`, non-empty `evidence_files_read`), the Aggregator:
+1. Logs `LRR INCOMPLETE: missing [filename] / malformed [filename]` to `docs/plans/build-log.md`.
+2. Writes `docs/plans/evidence/lrr-aggregate.json` with `combined_verdict = INCOMPLETE` and the list of missing/malformed files.
+3. STOPS — does **NOT** proceed to the 6 aggregation rules.
+This is the partial-glob race fix — the explicit roster check makes the race impossible: the Aggregator fails loudly instead of silently under-counting.
+### Step 2: Apply the 6 aggregation rules
+Once all 5 chapter files are present and parseable, the Aggregator applies these rules:
+1. **ANY `override_blocks_launch: true`** → `combined_verdict = BLOCKED`, regardless of other verdicts. This is the veto-of-vetoes rule.
+2. **ALL verdicts `PASS` AND zero follow-ups spawned** → `combined_verdict = PRODUCTION READY`.
+3. **ANY verdict `BLOCK` with `override_blocks_launch: false`** → `combined_verdict = NEEDS WORK`, with that chapter's findings routed into the existing fix-and-retest loop.
+4. **ANY verdict `CONCERNS`** → `combined_verdict = NEEDS WORK`, concerns logged to `build-log.md` for later triage.
+5. **Follow-up spawned AND `follow_up.confirmed: true`** → treat the parent chapter's verdict as if it were `BLOCK` (since the follow-up confirmed the concern).
+6. **Contradictions between chapters on typed fields** → `combined_verdict = BLOCKED` with the specific finding `cross-chapter contradiction: {field} differs between {chapter_a} and {chapter_b}`.
+Rule 6 detects cross-chapter conflicts mechanically on typed fields only — no chapter reads another's draft, preserving fresh-context independence.
+### Step 3: BLOCK routing via `decisions.jsonl` `decided_by` lookup
+When the Aggregator determines `combined_verdict = BLOCKED` or `NEEDS WORK` via a BLOCK finding, it MUST NOT stop and wait. Instead:
+1. For each BLOCK finding in the aggregated output, read the `related_decision_id` field on the finding.
+2. Read `docs/plans/decisions.jsonl` and find the row with that `decision_id`.
+3. Read the `decided_by` field — this is the phase that authored the original decision (e.g., `architect` or `design-brand-guardian`). Cross-reference with the `phase` field on the row to disambiguate between phases that share a `decided_by` value.
+4. Route the finding BACKWARD to that phase as re-entry input. The build resumes at that phase with the BLOCK finding as a correction signal.
+If no `related_decision_id` is present on the finding (legacy finding, or a non-decision-backed issue such as a runtime crash or a fresh test failure), fall back to the legacy routing: classify by severity and route to Phase 4 (code-level) or Phase 2 (structural) per Step 4 below.
+This replaces "BLOCKED → return to failing step" with author-aware re-entry — a BLOCK on an auth model flaw routes to the Phase 2 architecture synthesizer who authored the decision, not the Phase 4 implementer.
+### Step 4: Classification for NEEDS WORK findings
+For any NEEDS WORK findings that did not carry a `related_decision_id` (and therefore fell through Step 3's `decided_by` lookup), the Aggregator applies the existing legacy classification:
+- **code-level** findings (test failure, runtime error, lint/type error, per-file bug) → route to Phase 4 target task.
+- **structural** findings (architectural mismatch, API contract violation, persistence model error, DNA drift at the component-library layer) → route to Phase 2 or Phase 3 per the finding's domain.
+- **CONCERNS** entries (no BLOCK) → logged to `build-log.md` for triage, do not block launch on their own.
+### Step 5: READY handoff
+If Step 2 resolves to `combined_verdict = PRODUCTION READY`, the Aggregator writes `docs/plans/evidence/lrr-aggregate.json` with the combined verdict, per-chapter summaries, and forwards to Phase 7 (Launch). No backward routing is triggered — the build moves forward.
+## File paths
+- Chapter verdicts: `docs/plans/evidence/lrr/{eng-quality,security,sre,a11y,brand-guardian}.json`
+- Aggregator output: `docs/plans/evidence/lrr-aggregate.json`
+## Token budget
+~12-17K tokens per LRR cycle, net of PM fold-in: five chapter dispatches at 2-3K each, plus the nested `pr-test-analyzer` sub-dispatch inside Eng-Quality at ~1.5-2K, plus one aggregator at 1-2K. The Aggregator runs exactly once per cycle.
+## Design Notes (non-operational)
+### Rationale: single-verdict failure mode
+The current Reality Checker collapses code quality, security, reliability, accessibility, and product completeness into a single verdict from a single agent — the exact failure mode matrix organizations exist to prevent. Independence matters most for Security and SRE: production incidents from those chapters are asymmetric in consequence — a security finding that goes unchallenged is a breach; a reliability finding is an outage. That asymmetry justifies the extra dispatch power and the pessimistic-block fallback.
+### Structural shifts from the prior 5-chapter panel
+- **Eng and QA merged into Eng-Quality.** More than half of their evidence overlapped (tests, verify outputs, task-level quality signals), and two nearly-identical verdicts produced two-thirds the signal of one coherent view.
+- **A11y is a new seat.** The WCAG gap was the biggest coverage hole in the prior panel — a mechanical contrast field on the old Design chapter is not a runtime accessibility check.
+- **Brand Guardian replaces the Design mechanical check.** The prior Design chapter was a 15-line threshold on a Phase 3 metric score — theater, not judgment.
+### Why each chapter cannot be folded
+- **Eng-Quality:** Merged from the previous Eng+QA chapters because >50% of their evidence overlapped, which produced two near-identical verdicts instead of one stronger one.
+- **Security:** Production breach risk is asymmetric, so an independent chapter runs fresh-context against security evidence with the power to veto launch outright.
+- **SRE:** Now explicitly reads Performance Benchmarker evidence (previously unclear which chapter owned perf NFRs), plus reliability checks and NFR thresholds from `sprint-tasks.md`.
+- **A11y:** The prior Design chapter had a mechanical contrast check and nothing runtime. A11y reads the Phase 5 runtime accessibility sweep and the Phase 3.7 design review, and gates on Critical/Serious finding counts.
+- **Brand Guardian:** Reads the locked DNA card + rendered screenshots + visual design spec + design references, and judges drift from the DNA. Taste judgment, not checklist theater.
+### Rule 6 — anchoring cost rationale
+A naive matrix-org design would have Security read Eng-Quality's draft to catch disagreements — that reintroduces the exact anchoring bias that fresh-context-per-chapter is designed to prevent (Madaan et al. Self-Refine; Gou et al. CRITIC). Instead, the aggregator does the cross-chapter check mechanically on typed fields only. Typed fields are mechanically diffable; free-form findings prose is not — and the prose is where the anchoring bias would come from.
+### Why under `evidence/`
+The existing evidence manifest sweep picks up anything under `docs/plans/evidence/` for free, and these files ARE evidence — typed attestations from independent chapters. Putting them elsewhere would create a second manifest path the aggregator has to know about separately.

package/protocols/metric-loop.md ADDED Viewed

@@ -0,0 +1,153 @@
+# Metric Loop Protocol
+You are the orchestrator. You are about to run a metric-driven iteration loop on an artifact (code, architecture, docs, etc.) to drive it toward a quality target.
+## Step 0: Define Your Metric
+Before iterating, YOU define the metric for this specific context. Consider:
+- What is the artifact? (a task implementation, a security audit, an architecture doc, etc.)
+- What does "good" look like? (all tests pass, zero critical vulns, all acceptance criteria met, etc.)
+- Is the metric quantitative (test pass rate, vuln count, coverage %) or qualitative (architecture completeness, doc clarity)?
+Write a **Metric Definition** block to `docs/plans/.build-state.md`:
+```
+## Active Metric Loop
+Phase: [current phase]
+Artifact: [what you're iterating on]
+Metric: [what you're measuring, in one sentence]
+How to measure: [what the measurement agent should do — run tests, audit code, check criteria, etc.]
+Target: [score 0-100 at which you stop]
+Max iterations: [hard cap, default 5]
+Scoring Criteria Checklist: [extracted in Step 0.5 — see .build-state.json]
+Extraction method: [mechanical | one-shot-dispatch | mixed]
+```
+Then create a score log table:
+```
+| Iter | Score | Delta | Top Issue | Files |
+|------|-------|-------|-----------|-------|
+```
+When starting a new metric loop, REPLACE the previous Active Metric Loop section (if any). There is only ever ONE active metric loop. Previous loop results should already be recorded in their phase's section above. When the loop completes (Step 2 exit), rename the section header from `## Active Metric Loop` to `## Completed Metric Loop — [Phase N]` and leave it for historical reference.
+If you are in Phase 5, also record the current sub-step for the overall task cycle (not all of these are within the metric loop itself):
+```
+Sub-step: [5.1 Implement | 5.1b Cleanup | 5.2 Metric Loop | 5.3 Loop Exit | 5.4 Verify]
+```
+This tells the orchestrator exactly where to resume after context compaction.
+## Step 0.5: Extract Scoring Criteria
+Before the first measurement, extract a **Scoring Criteria Checklist** from the stable reference docs. This checklist is the critic's scoring input for every iteration — it replaces raw doc injection.
+### Why
+Reference docs (DNA cards, design specs, architecture docs, acceptance criteria) do not change during the loop. Pre-injecting them into every critic prompt wastes ~20-30K tokens per iteration. The checklist extracts the exact scoring values once and passes ~1-2K tokens instead.
+### Extraction mechanism (per doc type)
+Use the cheapest mechanism that preserves fidelity:
+| Source doc type | Structure | Extraction mechanism | Cost |
+|-----------------|-----------|---------------------|------|
+| Structured (named fields with explicit values) — e.g., `visual-dna.md` DNA card, `sprint-tasks.md` Behavioral Test field, `ios-design-board.md` named sections | YAML/markdown with named axes, fields, or sections | **Mechanical** — orchestrator parses and copies the values directly. No LLM reasoning. | ~0 tokens |
+| Semi-structured (values spread across prose sections) — e.g., `visual-design-spec.md`, Phase 5 audit findings | Long-form with explicit values in multiple sections | **One-shot extractor dispatch** — single agent call reads the full doc once and outputs the structured checklist. | ~20-25K one-time |
+| Unstructured (visual references, screenshots, mood boards) — e.g., `design-references.md` | Screenshot URLs, visual comps | **Not extracted.** Referenced by path in the checklist. Iteration 1 MAY read on-demand; iteration 2+ MUST NOT unless diagnosis explicitly flags a visual-reference gap. | 0 tokens |
+**Rule: if the source doc has named fields with explicit values, extraction is mechanical (no dispatch). If values are spread across prose sections, use a one-shot extractor dispatch. Never use orchestrator LLM reasoning for extraction — it burns the tokens you're trying to save.**
+### Persist the checklist
+Write the checklist to `.build-state.json` under `active_metric_loop.scoring_criteria_checklist`. Record the extraction method in `active_metric_loop.extraction_method` (one of: `mechanical`, `one-shot-dispatch`, `mixed`). The rendered `.build-state.md` view reflects it automatically.
+### Checklist format
+```
+Scoring Criteria Checklist
+Source docs: [list of source doc paths]
+Extracted at: [timestamp]
+Extraction method: [mechanical | one-shot-dispatch | mixed]
+[Structured criteria with exact values, organized by scoring dimension]
+Reference Anchors:
+- [path] (iteration 1 MAY read; iteration 2+ MUST NOT unless diagnosis flags gap)
+```
+The format is flexible — adapt it to the phase and artifact. The protocol gives a template, not a rigid schema. What matters: exact values, not summaries; organized by scoring dimension; reference anchors for unstructured docs.
+## Step 1: MEASURE
+Call the Agent tool — description: "Measure [metric]" — prompt:
+"SCORING CRITERIA CHECKLIST: [paste the checklist from Step 0.5 — NOT the raw reference docs]. [How to measure, from your metric definition]. Score the current state 0-100 against the checklist criteria. Return your response with a clear SCORE: [number] line, a list of FINDINGS, and the single TOP ISSUE most likely to improve the score if fixed."
+> **Pass the Scoring Criteria Checklist (from Step 0.5) to the measurement agent. Do NOT paste full reference docs into the prompt. The agent retains Read access for on-demand lookups if the checklist doesn't cover a specific detail, but the prompt must not pre-inject stable docs.**
+Read the agent's response. You need: the SCORE, the TOP ISSUE, and the file paths for diagnosis in Step 3. Record the score to `docs/plans/.build-state.md`. The full findings list is useful for diagnosis but does NOT need to persist in your context across iterations — once you've picked the top issue, the details of lower-priority findings can go. Append a row to the score log in `docs/plans/.build-state.md`:
+| Iter | Score | Delta | Top Issue | Files |
+|------|-------|-------|-----------|-------|
+## Step 2: CHECK EXIT
+Stop the loop if ANY of these:
+- **Score >= target** → done. Log "Target met at iteration [N]."
+- **Iter-1 short-circuit: score >= target + 10 on the first measurement** → done. Log "Short-circuit at iteration 1. Score: [N]." and record `exit_reason: "short_circuit_iter1"` in the loop state.
+- **Iteration >= max** → done. Log "Max iterations reached. Final score: [N]."
+- **Stall: last 2 scores show no improvement** (delta <= 0 twice in a row) → done. Log "Stalled at score [N]."
+### Early exit — iter-1 short-circuit
+If the first measurement (iter-1) scores >= `target + 10`, exit the loop immediately and commit the iter-1 output. Log `exit_reason: "short_circuit_iter1"` in the loop state. The 10-point margin guards against measurement noise at the boundary; a genuinely excellent first pass does not need rework.
+On stall or max iterations:
+- **Interactive mode:** present score history + top remaining issue to user. Ask for direction.
+- **Autonomous mode:**
+  - If score >= target: done (this branch is already handled in Step 2's exit conditions; included here for completeness).
+  - If score >= 60% of target AND no CRITICAL issues remain in the measurement: accept with WARNING. Log to `docs/plans/build-log.md` with the warning text and the score history. The orchestrator may proceed to the next phase, but the warning MUST be surfaced in the Phase 7 Completion Report's "Verification Gap" section.
+  - If score < 60% of target OR any CRITICAL issue remains: HALT. Do NOT skip. Log "METRIC LOOP: BLOCKED" to `docs/plans/.build-state.md` with the score history and the unresolved issues. Either (a) re-dispatch the fix agent with the unresolved issues, OR (b) abort the build with a directive to the user. The orchestrator may NOT silently proceed past a metric loop that did not converge.
+If not exiting, continue to Step 3.
+## Step 3: DIAGNOSE
+Look at the findings from Step 1. Pick the ONE highest-impact issue — the single fix most likely to move the score. Do not try to fix everything at once. This is the autoresearch insight: one targeted change per iteration, measured impact.
+## Step 4: IMPROVE
+Call the Agent tool — description: "Fix [top issue]" — mode: "bypassPermissions" — prompt:
+"TARGETED FIX: [specific issue to fix, from diagnosis]. CONTEXT: [relevant architecture/criteria]. Make this specific change. Do not refactor unrelated code. Commit: 'fix: [description]'."
+> **Do NOT pass the measurement agent's full findings to this agent. Only pass the single diagnosed issue and relevant file paths.**
+### Iteration-aware context rule
+- **Iteration 1 generator** receives the full phase context header + task description (the generator needs full context for its first pass).
+- **Iteration 2+ generator** receives ONLY: (a) the single top issue from diagnosis, (b) the relevant file paths, (c) the specific criteria values from the Scoring Criteria Checklist that relate to the top issue. The full `[CONTEXT header above]` preamble is NOT re-injected. The generator already has the codebase from iteration 1 — it only needs the delta.
+## Step 5: LOOP
+Return to Step 1. Re-measure the artifact after the fix.
+---
+## Rules
+<HARD-GATE>
+AUTHOR-BIAS ELIMINATION: The measurement agent and the fix agent must NEVER share context.
+- They MUST be separate Agent tool calls (separate subprocesses, separate context windows).
+- The fix agent receives ONLY: (a) the single top issue diagnosed in Step 3, (b) the relevant file paths, (c) the acceptance criteria. It does NOT receive the measurement agent's full findings, score breakdown, or other issues.
+- The measurement agent in the next iteration does NOT know what the fix agent did — it measures the artifact fresh.
+</HARD-GATE>
+- One fix per iteration. Measure its impact before fixing the next thing.
+- Track ALL scores in `docs/plans/.build-state.md` so the history survives context compaction.
+- If context was compacted mid-loop: read `docs/plans/.build-state.md`, find the Active Metric Loop section, resume from the last recorded iteration.
+- CONTEXT HYGIENE: Measurement agents are analysis agents — read their full output for diagnosis. But once you've picked the top issue (Step 3) and dispatched the fix (Step 4), the detailed findings from THAT iteration are spent. Don't accumulate findings across iterations — each measurement is fresh.
+<STABLE-CONTEXT-RULE>
+STABLE CONTEXT RULE: Reference docs that do not change during the loop (design specs, DNA cards, architecture docs, acceptance criteria) MUST be extracted into the Scoring Criteria Checklist at Step 0.5 and passed as the checklist — never re-injected as raw content into iteration 2+ prompts. Fresh artifacts (screenshots, test results, rendered output) are fetched each iteration. This is the primary token-saving mechanism: ~1-2K checklist vs ~20-30K raw docs per iteration.
+</STABLE-CONTEXT-RULE>
+- CHECKLIST FALLBACK: If no Scoring Criteria Checklist is provided (Step 0.5 was skipped or caller did not pass one), the critic falls back to raw doc reads. The orchestrator MUST log a WARN to `docs/plans/build-log.md`: "Metric loop iteration [N]: no scoring criteria checklist provided, falling back to raw doc reads." Callers SHOULD always provide a checklist. New callers that omit it will trigger the WARN, making silent regressions visible.

package/protocols/smoke-test.md ADDED Viewed

@@ -0,0 +1,118 @@
+# Smoke Test Protocol (Behavioral Verification via agent-browser)
+You are the orchestrator. You are about to verify that a completed task actually works in a browser by interacting with it and collecting machine-readable evidence.
+## When to Run
+After a UI-affecting task passes the Verification Protocol. Skip for non-UI tasks (API-only, config, infrastructure, CLI tools). If the task has no behavioral acceptance criteria or no affected page/route, skip.
+## Inputs
+- **Acceptance criteria**: the task's behavioral acceptance criteria (e.g., "clicking Submit saves the form and shows a success toast").
+- **Affected route**: the page or route to test (e.g., `/settings`).
+## Step 0: Preflight
+Check that `agent-browser` is available:
+```
+which agent-browser
+```
+**If installed:** proceed to Step 1.
+**If not installed AND the task has a non-N/A Behavioral Test field in `docs/plans/sprint-tasks.md`:**
+1. Dispatch an installer agent: "Install agent-browser via `npm install -g agent-browser`. Verify `which agent-browser` returns a path. Report PASS/FAIL." MAX 2 attempts.
+2. If installation succeeds, proceed to Step 1.
+3. If installation fails after 2 attempts, hard-fail the smoke test with directive: "agent-browser installation failed. Either (a) install manually and re-run, or (b) explicitly downgrade this task's Behavioral Test field to N/A in `sprint-tasks.md` to acknowledge the gap. The orchestrator may NOT proceed silently."
+4. Log "SMOKE: BLOCKED -- agent-browser unavailable, task [task-name] has Behavioral Test [field text]" to `docs/plans/.build-state.md`. Do NOT log SKIPPED. Do NOT log PASS. Return BLOCKED.
+**If not installed AND the task has Behavioral Test field = N/A:** log "SMOKE: SKIPPED -- task acknowledged as no-behavioral-coverage" with the task name. Return SKIPPED. (This is the only legal silent skip path, and it requires the spec to have explicitly declared N/A.)
+## Step 1: Start Dev Server
+Detect the dev script from `package.json` (`dev`, `start`, or `serve`). If the server is already running on the expected port, skip. Otherwise start it in the background and wait for the port to be listening.
+## Step 2: Capture Baseline
+```
+agent-browser open http://localhost:[port]/[affected-route]
+agent-browser wait --load networkidle
+agent-browser snapshot -i
+```
+Save the snapshot output as the "before" state. This is your baseline for diffing.
+## Step 3: Execute Acceptance Criteria
+For EACH behavioral acceptance criterion, sequentially:
+1. **Interact** -- execute the required action (`agent-browser click`, `fill`, `select`, `press`, etc.).
+2. **Diff** -- `agent-browser diff snapshot`. If the diff is empty after an interaction that should change the DOM, the feature is broken. Mark criterion FAIL.
+3. **Wait for outcome** -- `agent-browser wait --text "expected outcome"` with a 10s timeout. Timeout means FAIL.
+4. **Check network** -- `agent-browser network requests --status 4xx,5xx`. Any failed API call related to this criterion means FAIL.
+After each page navigation, re-snapshot. Element refs (`@e1`, `@e2`, etc.) invalidate on navigation.
+## Step 4: Collect Evidence
+After all criteria are tested:
+```
+agent-browser errors
+agent-browser screenshot --annotate
+agent-browser network har stop
+```
+For UI tasks, also capture a mobile viewport screenshot:
+```
+agent-browser resize 375 812
+agent-browser screenshot --annotate
+```
+Save the mobile screenshot to the evidence directory as `[task-name]-mobile.png`.
+```
+agent-browser resize 1920 1080
+```
+If the mobile screenshot shows horizontal scrolling, overlapping elements, or text smaller than 14px, flag as a UX issue in the smoke test results.
+Save all evidence to `docs/plans/evidence/[task-name]/`:
+| File | Content | Format |
+|------|---------|--------|
+| `before.snapshot.txt` | Baseline DOM snapshot | text |
+| `after.snapshot.txt` | Final DOM snapshot | text |
+| `screenshot.png` | Annotated screenshot of final state | image |
+| `errors.txt` | Uncaught JS exceptions | text |
+| `session.har` | Full network trace | HAR |
+Start HAR capture (`agent-browser network har start`) at Step 2 and stop at Step 4. The HAR file is saved for Phase 6.2d fake data analysis -- do not parse it here.
+## Step 5: Verdict
+**PASS**: all criteria verified, zero uncaught exceptions, zero failed API calls. Log "SMOKE: PASS" to `docs/plans/.build-state.md`. Close the browser.
+**FAIL**: spawn a fix agent with this prompt:
+"Fix smoke test failure for [task-name]. EXPECTED: [criterion text]. ACTUAL: [what happened -- empty diff / timeout / network error]. Evidence: [annotated screenshot path], [snapshot diff], [error log], [failed network requests]. Fix the implementation to match the expected behavior."
+After the fix agent completes, re-run from Step 2.
+<HARD-GATE>
+MAX 2 RETRY CYCLES: smoke -> fix -> re-smoke -> fix -> re-smoke. If still failing after 2 fix attempts:
+- **Interactive mode:** present evidence to the user. Include the annotated screenshot, snapshot diff, and error log.
+- **Autonomous mode:** log "SMOKE: FAILED after 2 retries" to `docs/plans/build-log.md` and proceed with a warning.
+Do not loop further.
+</HARD-GATE>
+## Rules
+- Use `agent-browser` CLI commands only. Not Playwright MCP, not Playwright directly.
+- Evidence is for the agent, not humans. No video recording, no dashboards.
+- Element refs (`@e1`, `@e2`) invalidate on every page change. Re-snapshot after any navigation.
+- One criterion at a time. Measure each interaction's impact before moving to the next.
+- HAR file is for downstream analysis (Phase 6.2d). This protocol does not parse it.
+- `agent-browser close` before returning, regardless of pass/fail.