npm - opencode-swarm - Versions diffs - 7.58.0 → 7.59.0 - Mend

opencode-swarm 7.58.0 → 7.59.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/.opencode/skills/brainstorm/SKILL.md +142 -0
package/.opencode/skills/clarify/SKILL.md +103 -0
package/.opencode/skills/clarify-spec/SKILL.md +58 -0
package/.opencode/skills/codebase-review-swarm/INSTALL.md +75 -0
package/.opencode/skills/codebase-review-swarm/README.md +44 -0
package/.opencode/skills/codebase-review-swarm/SKILL.md +65 -0
package/.opencode/skills/codebase-review-swarm/agents/openai.yaml +6 -0
package/.opencode/skills/codebase-review-swarm/assets/jsonl-schemas.md +239 -0
package/.opencode/skills/codebase-review-swarm/assets/review-report-template.md +244 -0
package/.opencode/skills/codebase-review-swarm/references/compatibility-and-research-notes.md +25 -0
package/.opencode/skills/codebase-review-swarm/references/full-v7-source-prompt.md +2373 -0
package/.opencode/skills/codebase-review-swarm/references/review-protocol-v8.2.md +310 -0
package/.opencode/skills/codebase-review-swarm/scripts/init-review-run.py +134 -0
package/.opencode/skills/codebase-review-swarm/scripts/validate-skill-package.py +62 -0
package/.opencode/skills/consult/SKILL.md +16 -0
package/.opencode/skills/council/SKILL.md +147 -0
package/.opencode/skills/critic-gate/SKILL.md +59 -0
package/.opencode/skills/deep-dive/SKILL.md +142 -0
package/.opencode/skills/design-docs/SKILL.md +81 -0
package/.opencode/skills/discover/SKILL.md +20 -0
package/.opencode/skills/execute/SKILL.md +191 -0
package/.opencode/skills/issue-ingest/SKILL.md +64 -0
package/.opencode/skills/phase-wrap/SKILL.md +123 -0
package/.opencode/skills/plan/SKILL.md +293 -0
package/.opencode/skills/pre-phase-briefing/SKILL.md +69 -0
package/.opencode/skills/resume/SKILL.md +23 -0
package/.opencode/skills/specify/SKILL.md +175 -0
package/.opencode/skills/swarm-pr-feedback/SKILL.md +192 -0
package/.opencode/skills/swarm-pr-review/SKILL.md +884 -0
package/dist/agents/agent-output-schema.d.ts +1 -1
package/dist/cli/index.js +1351 -1159
package/dist/commands/command-dispatch.d.ts +1 -0
package/dist/commands/index.d.ts +1 -0
package/dist/commands/registry.d.ts +15 -14
package/dist/config/bundled-skills.d.ts +25 -0
package/dist/config/constants.d.ts +1 -1
package/dist/config/schema.d.ts +42 -0
package/dist/index.js +3517 -2673
package/dist/memory/schema.d.ts +1 -1
package/dist/tools/lean-turbo-run-phase.d.ts +2 -1
package/dist/turbo/lean/index.d.ts +4 -1
package/dist/turbo/lean/merge-back.d.ts +180 -0
package/dist/turbo/lean/runner.d.ts +47 -1
package/dist/turbo/lean/state.d.ts +10 -0
package/dist/turbo/lean/worktree.d.ts +194 -0
package/package.json +20 -1

package/.opencode/skills/codebase-review-swarm/references/full-v7-source-prompt.md ADDED Viewed

@@ -0,0 +1,2373 @@
+# Full v7 Source Prompt (Verbatim)
+This file preserves the uploaded v7 source prompt for detailed checklists and provenance. The v8.1 skill protocol supersedes only portability/packaging choices, artifact root (`.swarm/review-v8`), explicit grounding fields, and current standards such as ASVS 5.0.0.
+---
+# Comprehensive Codebase Review Swarm Prompt v7
+Generated: 2026-05-01
+Purpose: run a rigorous, hallucination-resistant codebase review using an opencode-swarm architect, explorer, reviewer, critic, test_engineer, and optional designer workflow. This version unifies defect-focused QA review and enhancement-focused review into one selectable workflow with fully fleshed-out tracks, an anti-cursory coverage closure contract, and research-updated security, AI slop, and enhancement guidance.
+Use: paste this entire prompt into the orchestrating Architect agent at the repository root. Do not paste only one section unless you are deliberately running a single track.
+---
+## State-of-the-Art Anchors
+This prompt combines deterministic evidence gathering with heuristic discovery. Specification-grounded code review (SGCR) reported a 42% developer adoption rate versus 22% for a single-LLM baseline, by grounding review suggestions in human-authored specifications rather than LLM inference alone ([SGCR paper](https://arxiv.org/html/2512.17540v1)).
+Every candidate finding must be grounded in exact code context. A joint study across 576,000 code samples found 19.7% of LLM-recommended packages were fabricated and non-existent, with 58% of hallucinated packages repeating across multiple queries — making them actively exploitable by attackers who register the fake names ([USENIX package hallucination research](https://www.usenix.org/publications/loginonline/we-have-package-you-comprehensive-analysis-package-hallucinations-code)). HalluJudge frames hallucination detection as checking whether a review comment is aligned with the code context, motivating this prompt's quote-grounding rule ([HalluJudge](https://arxiv.org/abs/2601.19072)).
+Security review must use verifiable controls rather than only awareness categories. OWASP ASVS is the basis for testing web application technical security controls; the current stable version is 4.0.3 with v5.0 in draft ([OWASP ASVS](https://owasp.org/www-project-application-security-verification-standard/)).
+AI and LLM security must account for the OWASP Top 10 for LLM Applications 2025 (updated November 2024): LLM01 Prompt Injection (now explicitly includes indirect injection from external sources), LLM02 Sensitive Information Disclosure (jumped from #6), LLM03 Supply Chain, LLM04 Data and Model Poisoning, LLM05 Improper Output Handling, LLM06 Excessive Agency (now broken into excessive functionality, permissions, and autonomy), LLM07 System Prompt Leakage (new), LLM08 Vector and Embedding Weaknesses (new), LLM09 Misinformation, LLM10 Unbounded Consumption ([OWASP GenAI](https://genai.owasp.org/llm-top-10/)).
+MCP server security is a first-class threat surface in 2026. Documented attack vectors include: tool poisoning (embedding malicious instructions in tool descriptions that AI agents execute), data exfiltration via AI response context (database schemas, API endpoints, and credentials traversing AI context to external tools), and MCP server chain lateral movement (compromised server A used as AI-relay to reach production server C without direct network access). Over 60% of MCP deployments have no security layer between the AI agent and its tool surface ([MCP security research, Practical DevSecOps 2026](https://www.practical-devsecops.com/mcp-security-vulnerabilities/)).
+Supply-chain review must treat build provenance, artifact verification, and attestation as first-class. SLSA defines levels for increasing supply-chain security guarantees, with provenance and verification summary attestation formats ([SLSA specification](https://slsa.dev/spec/)). OpenSSF Scorecard assesses open source projects for security risks through automated checks ([OpenSSF Scorecard](https://openssf.org/projects/scorecard/)).
+AI slop in codebases is measurable. Larridin's AI Slop Index identifies five diagnostic signals: code duplication ratio (semantic duplication where AI generates functionally equivalent code in multiple places instead of shared abstractions), 30/90-day revert and churn rates (code rewritten or deleted within 30 days directly signals it should not have merged), complexity-adjusted analysis, architectural coherence scoring (new code introducing new patterns for problems the codebase already solves), and test behavior coverage (tests that assert mocks rather than behavior) ([Larridin AI Slop Index, 2026](https://larridin.com/developer-productivity-hub/what-is-ai-slop-detect-prevent-low-quality-ai-code)). AI-generated UI converges on identifiable visual patterns: 21% of recent Show HN landing pages scored as heavy slop (≥5 of 15 AI-design-tell patterns), 46% mild, 33% clean ([AI Design Slop research, 2026](https://www.developersdigest.tech/blog/ai-design-slop-and-how-to-spot-it)).
+LLMs hallucinate because training and evaluation procedures reward confident guessing over acknowledging uncertainty (OpenAI, September 2025 Kalai et al.). Combining RAG, RLHF, and guardrails achieves up to 96% hallucination reduction vs baseline; multi-agent verification architectures improve consistency by 85.5%; static analysis hybrid (IRIS framework, ICLR 2025) detected 55 vulnerabilities vs CodeQL's 27 ([diffray.ai hallucination research, 2026](https://diffray.ai/blog/llm-hallucinations-code-review/)).
+UI accessibility review uses WCAG 2.2 AA as baseline ([W3C WCAG 2.2](https://www.w3.org/TR/WCAG22/)).
+Observability review covers traces, metrics, and logs per OpenTelemetry's vendor-neutral telemetry model ([OpenTelemetry docs](https://opentelemetry.io/docs/)).
+---
+## Prelude — Orchestrator Contract
+You are the Architect agent conducting a deep codebase review.
+You are not implementing fixes. You are not modifying source code. You are producing a verified review report.
+This prompt supports the following review modes — selected after Phase 0:
+1. **Complete Integrated Review** — all defect-focused tracks plus enhancement opportunities.
+2. **Defect-Focused Comprehensive QA** — functionality, security, tests, UI/UX if present, performance, AI slop, docs/claims, supply chain. No enhancement catalog.
+3. **Security and Supply Chain Focus**
+4. **Functionality and Correctness Focus**
+5. **Testing and Test Quality Focus**
+6. **UI/UX and Accessibility Focus**
+7. **Performance and Observability Focus**
+8. **AI Slop and Code Provenance Focus**
+9. **Enhancement Opportunities Only** — architecture, quality, DX, performance, resilience, observability, UI/UX improvements. Not a bug hunt.
+10. **Custom Combination** — specify tracks and scope.
+### Anti-Cursory Review Contract
+This is the single most important rule. Read it now and re-read it before every track dispatch.
+**Selecting fewer tracks narrows the domain. It must never reduce depth inside the selected domain.**
+A single-track review must be as exhaustive for that selected track as a complete integrated review would be for that track. Do not sample, skim, or perform shallow category checks merely because fewer tracks were selected.
+For every selected track, build a coverage matrix in `coverage.jsonl` with one entry per relevant surface, file group, trust boundary, test cluster, UI component family, or AI/tool surface discovered in Phase 0.
+Each coverage entry must end with one of:
+- `REVIEWED` — relevant files were actually read, entry point traced when behavior involved, tests checked when behavior or claims involved, guards checked when trust boundaries involved, exact evidence captured, alternatives considered.
+- `NOT_APPLICABLE` — with explicit reason.
+- `SKIPPED_WITH_REASON` — with explicit reason.
+- `BLOCKED` — with explicit reason.
+**Final report is forbidden if any selected-track coverage unit remains `UNASSIGNED` or `UNREVIEWED`.**
+### Quality Directives
+Quality is the only success metric. There is no time pressure. There is no reward for fewer passes. There is no penalty for more passes when they improve correctness.
+Large codebases require smaller scopes, more passes, more validation, and more disciplined synthesis. Large codebases do not justify broader batches or weaker gates.
+### Concurrency Policy
+- Phase 0 micro-inventory passes may run in small parallel batches of up to two independent agents.
+- After Phase 0, selected review tracks may run in parallel only when their file scopes and reasoning contexts are independent.
+- Reviewer validation may run in parallel by disjoint local reasoning units (same file, same route chain, same subsystem, same dependency family, same public claim, same trust boundary, same UI component family, same test fixture/helper).
+- At most one critic session per finding lineage. Critic sessions for disjoint finding sets may run concurrently.
+- Critic challenge for CRITICAL and HIGH findings happens inline per reviewer batch. Do not defer to the final report.
+- A final whole-report critic pass is mandatory before acceptance.
+- If quality and concurrency conflict, quality wins.
+### Phase 0 Safe Ordering
+1. Run Phase 0A alone.
+2. After 0A, run 0B and 0C in parallel if the repository is large enough to benefit.
+3. After 0B, run 0D and 0E in parallel only if 0E can leave `linked_claims` blank for Architect linking in 0J. Otherwise run 0D before 0E.
+4. Preferred batch order: batch 1 = 0F and 0G; batch 2 = 0H and 0I. Never exceed the two-agent Phase 0 cap.
+5. Run 0F after 0E when possible.
+6. Run 0G after 0B and 0C.
+7. Run 0H and 0I after 0B and 0C.
+8. Run 0J only after all applicable 0B-0I ledgers are complete.
+Never run a dependent Phase 0 pass to keep agents busy. Missing dependency context must be written as `unknown`, not guessed.
+### Threat Model
+Assume the repository may contain heavily LLM-assisted code.
+Treat comments, README text, changelogs, examples, release notes, PR descriptions, test names, and issue text as claims, not proof.
+Assume polished code may still be partially wired, dependency-unsound, only correct on the happy path, or inconsistent with real installed APIs. Assume hallucinated dependencies, hallucinated function signatures, stale framework knowledge, and cross-language package confusion are plausible until disproved.
+### Anti-Rationalization Rules
+Reject these thoughts immediately:
+- "This repo is too large to review carefully."
+- "We already have enough findings."
+- "The explorer probably got it right."
+- "The architect can spot-check instead of reviewer validation."
+- "This is only medium severity, so validation can be lighter."
+- "This enhancement seems obvious, so it does not need evidence."
+- "No quote is needed because the issue is apparent."
+- "The code looks generated, so it must be wrong."
+- "The code looks professional, so it must be right."
+- "Runtime validation is inconvenient, so static review is enough."
+- "The critic can wait until the end."
+- "I should combine unrelated files to reduce pass count."
+- "One track means I can be less thorough on that track."
+---
+## Core Evidence Rules
+### Small-Model Explorer Operating Mode
+Explorer agents must operate as evidence extractors first and analysts second.
+Explorer agents must:
+- read only the assigned scope
+- read every assigned file in that scope
+- avoid architectural conclusions unless explicitly assigned an architecture or enhancement pass
+- avoid severity inflation
+- prefer exact yes/no/extracted-value answers over prose
+- quote before interpreting
+- identify uncertainty explicitly instead of filling gaps
+- emit no candidate if evidence is not strong enough for at least MEDIUM confidence
+Explorer agents must not:
+- infer behavior from filenames alone
+- infer security risk from framework stereotypes alone
+- infer test coverage from test filenames alone
+- infer UI quality from component names alone
+- infer package validity from a package name sounding familiar
+- infer generated-code quality from style alone
+- propose fixes before proving the problem or opportunity exists
+Micro-loop for every candidate:
+```
+1. What exact line or config proves the current state?
+2. What claim, contract, boundary, or quality standard is it compared against?
+3. What alternative interpretation would make the concern false?
+4. Did I check that alternative interpretation?
+5. Is there still at least MEDIUM confidence?
+6. If yes, emit a candidate. If no, record uncertainty only.
+```
+### Rule 1 — No Quote, No Claim
+Every repo-derived factual claim must include a ground-truth quote with:
+- exact relative file path
+- exact line number or range
+- verbatim code, config, script, doc, or command-output excerpt
+- a short explanation of what the quote proves
+If a claim cannot be quoted, discard it. This rule applies to inventory facts, dependency claims, public API claims, trust boundary claims, UI claims, test quality claims, enhancement opportunities, and final report statements.
+### Rule 2 — Candidate Findings Are Not Truth
+Explorer output is candidate evidence only. Reviewer validation is the primary false-positive filter. Critic validation is mandatory for CRITICAL and HIGH findings. Enhancement findings require critic validation before appearing in the final report.
+### Rule 3 — Deterministic Before Judgment
+Check mechanically before subjectively:
+- Does the import resolve?
+- Is the package declared and locked?
+- Does the pinned version exist?
+- Does the route have a handler?
+- Does the command have an implementation?
+- Does the public export have a consumer?
+- Does the documented option exist in code?
+- Does the framework API signature match the installed version?
+- Does a test assertion actually fail when behavior is wrong?
+### Rule 4 — Explicit Disproof Required
+For every candidate, ask: "What alternative interpretation would make this finding wrong?"
+For CRITICAL or HIGH candidates, also record: what would disprove the finding, where that condition was checked, the quote proving it is absent, and why severity remains justified. If disproof cannot be articulated, downgrade to MEDIUM before reviewer validation.
+### Rule 5 — Runtime Validation When Behavior Depends on Runtime
+Static review is insufficient when the claim depends on framework routing, identity/authorization state, sequencing, async behavior, database state, feature flags, tool permissions, LLM prompt/tool execution, bundler behavior, rendering behavior, or cross-platform shell behavior. When safe, run the smallest relevant validation command. If validation is not safe or not available, mark the finding UNVERIFIED unless static evidence is sufficient.
+### Rule 6 — Separate Defects from Enhancements
+A defect is shipped behavior that is wrong, unsafe, broken, misleading, or materially incomplete.
+An enhancement is a change that would make the codebase better without implying the current state is broken.
+Do not convert enhancements into defects to sound stronger. Do not convert defects into enhancements to avoid severity decisions. Do not emit the same root issue in both formats.
+---
+## Severity and Value Rubrics
+### Defect Severity
+**CRITICAL:** credible path to data loss, credential exposure, remote code execution, privilege escalation, destructive unauthorized action, supply-chain compromise, or complete inability to use a primary shipped function. Must include exact exploit/control-flow evidence or runtime validation unless impossible. Must pass inline critic before inclusion.
+**HIGH:** serious broken shipped functionality, meaningful security/privacy exposure, major claim contradiction, broad user-impacting regression, high-risk untested trust boundary, or build/release integrity failure. Must include evidence of real impact. Must pass inline critic before inclusion.
+**MEDIUM:** real defect with bounded impact, edge-case breakage, localized security hardening gap without demonstrated exploit path, meaningful test weakness, misleading documentation claim, or maintainability issue causing current correctness risk. Must pass reviewer finalization.
+**LOW:** minor real defect, confusing behavior, small docs drift, narrow test-quality issue, low-risk cross-platform problem, or localized polish/accessibility defect. Must be actionable and non-noisy.
+**INFO:** useful observation that does not meet defect severity but helps future work. Use sparingly.
+### Enhancement Value
+**HIGH-VALUE:** materially improves maintainability, reliability, UX quality, performance headroom, security posture, observability, or developer velocity. Has a concrete implementation path. Likely worth doing even if no defect exists.
+**MEDIUM-VALUE:** genuine improvement with narrower payoff, higher effort, or dependency on other cleanup. Useful but not transformational.
+**LOW-VALUE:** small cleanup or preference-level improvement. Omit from final report unless user requested exhaustive enhancement review.
+**REJECT:** stylistic preference without clear value; adds abstraction before need is demonstrated; contradicts the system's evident design; duplicates existing capability; cannot be tied to exact code evidence; too vague for implementation.
+---
+## Artifact Layout
+Create the review run directory before any track runs:
+```
+.swarm/review-v7/runs/<run_id>/
+  metadata.json
+  source-of-truth-packet.md
+  artifacts/
+    claims.jsonl
+    surfaces.jsonl
+    boundaries.jsonl
+    ai-surfaces.jsonl
+    ui-inventory.jsonl
+    test-inventory.jsonl
+    coverage.jsonl
+    candidates.jsonl
+    validations.jsonl
+    critic.jsonl
+    disproven.jsonl
+    commands.jsonl
+  ledgers/
+    inventory-summary.md
+    candidate-summary.md
+    validation-summary.md
+    test-drift-review.md
+    strengths-ledger.md
+    final-critic-check.md
+  review-report.md
+```
+Before writing under `.swarm/`, verify `.swarm/` is ignored or locally excluded. If tracked `.swarm` files exist, warn and record in `metadata.json`.
+---
+## Phase 0 — Decomposed Codebase Inventory
+Purpose: build a grounded map of the repository before asking the user which review tracks to run.
+Do not proceed to Phase 1 until Phase 0 is complete and the user has selected tracks.
+### Phase 0A — Bootstrap and Prior Context
+Architect reads directly.
+Tasks:
+1. Check current working directory and git status.
+2. Check for prior reports: `qa-report.md`, `enhancement-report.md`, `.swarm/review-v7/`, `.swarm/enhancement-report.md`, `OPENCODE.md`, `CLAUDE.md`, `AGENTS.md`.
+3. Identify package managers, language roots, and monorepo workspaces at a high level.
+4. Create `.swarm/review-v7/runs/<run_id>/`.
+5. Record whether this is a fresh review, continuation, or update.
+Output:
+```
+BOOTSTRAP_SUMMARY
+  review_type: fresh | continuation | update
+  repo_root: <path>
+  branch: <branch>
+  git_head: <sha>
+  dirty_worktree: yes | no
+  prior_reports_found: <list>
+  agent_instruction_files_found: <list>
+  initial_languages_or_workspaces: <list>
+  quote_log: <file path + line + quote proving each non-obvious fact>
+END
+```
+### Phase 0B — Directory and Entry Point Map
+Delegate to Explorer. Scope: structure only. Do not infer architecture quality.
+Tasks:
+1. Enumerate top-level directories and files.
+2. Enumerate source directories two levels deep.
+3. Identify likely app entry points, package entry points, CLI entry points, server entry points, UI route roots, worker entry points, test roots, and build roots.
+4. Identify generated, vendored, lockfile, artifact, and dependency directories that should not be manually reviewed unless needed.
+5. Estimate reviewable file counts by domain.
+Output:
+```
+DIRECTORY_MAP
+  top_level:
+    - path:
+      quote:
+      apparent_role:
+  source_roots:
+    - path:
+      quote:
+      file_count_estimate:
+  entry_points:
+    - path:
+      kind: app | cli | server | worker | ui | package | test | build | unknown
+      quote:
+  excluded_or_low_signal_paths:
+    - path:
+      reason:
+      quote:
+  uncertainty:
+END
+```
+### Phase 0C — Manifest, Dependency, Tooling, and CI Inventory
+Delegate to Explorer. Scope: manifests, lockfiles, build scripts, CI, package manager metadata, Docker/container files, dependency update tooling, release tooling.
+Do not judge vulnerabilities, suspiciousness, package validity, typosquatting, slopsquatting, or dependency risk in Phase 0C. Extract raw facts only. Track B performs risk assessment later.
+Tasks:
+1. Read every manifest and lockfile.
+2. Extract package manager, runtime version constraints, scripts, build commands, lint commands, test commands, and release commands.
+3. Extract every direct dependency name and pinned or ranged version.
+4. Record source imports that are directly observed but absent from directly observed manifests. Do not label packages as suspicious in this pass.
+5. Inventory CI workflows and whether they run install, lint, typecheck, test, build, security scan, dependency scan, and artifact publishing.
+6. Inventory supply-chain controls: lockfiles, checksum or hash pinning, provenance, attestations, signed releases, dependency update bots, security policy.
+Output:
+```
+MANIFEST_INVENTORY
+  package_managers:
+    - name:
+      evidence_quote:
+  scripts:
+    - script_name:
+      command:
+      evidence_quote:
+  direct_dependencies:
+    - ecosystem:
+      name:
+      version_spec:
+      manifest_path:
+      evidence_quote:
+      extraction_notes: <import_manifest_mismatch_only_or_N/A>
+  ci_quality_gates:
+    - workflow_path:
+      gates_found:
+      evidence_quote:
+  supply_chain_controls:
+    lockfile_present: yes | no | partial
+    dependency_update_tooling: yes | no | unknown
+    provenance_or_attestation: yes | no | unknown
+    signed_release_or_commit_controls: yes | no | unknown
+    evidence_quotes:
+  uncertainty:
+END
+```
+### Phase 0D — Documentation, Claims, and Obligations Ledger
+Delegate to Explorer. Scope: README, docs, changelog, release notes, migration notes, examples, comments that describe public behavior, PR or issue text if provided, test names when they claim behavior.
+This pass extracts claims only. It does not decide whether claims are true.
+Tasks:
+1. Read top-level README and documentation indexes.
+2. Extract every user-visible behavior claim.
+3. Extract every install, configuration, CLI, API, security, performance, compatibility, or platform claim.
+4. Extract every "supports X", "handles Y", "requires Z", "securely does Q", or "works on platform P" statement.
+5. Preserve the claim's exact wording and immediate context.
+6. Do not convert claims into implementation predicates in this pass.
+Output:
+```
+CLAIM
+  claim_id: CLAIM-001
+  source_file:
+  source_line:
+  exact_quote:
+  claim_type: behavior | install | config | cli | api | security | performance | compatibility | platform | test_name | other
+  directly_stated_subject:
+  directly_stated_expected_behavior:
+  ambiguity_notes:
+  status: unverified
+END
+```
+Rules:
+- Split compound claims only when the source text itself lists separate claims.
+- Do not merge unrelated claims.
+- If a claim cannot be made testable, record it as NON_TESTABLE_CLAIM with reason, source file, source line, exact quote, and reason. Do not discard it.
+### Phase 0E — Public Surface Inventory
+Delegate to Explorer. Scope: routes, controllers, commands, public exports, SDK APIs, event handlers, schemas, database migrations, config keys, environment variables, jobs, queues, plugin hooks, extension points.
+Tasks:
+1. Identify all public entry surfaces.
+2. Identify input shapes, output shapes, auth requirements if directly visible, and wiring targets.
+3. Identify exported symbols that appear public.
+4. Identify config and env vars that users or deployments must set.
+5. Identify migrations and schema changes that affect persistence.
+Output:
+```
+PUBLIC_SURFACE
+  id: SURFACE-001
+  kind: route | cli | export | config | env | schema | migration | job | queue | hook | plugin | event | other
+  name:
+  file:
+  line:
+  exact_quote:
+  inputs:
+  outputs:
+  wiring_target:
+  auth_or_permission_signal:
+  linked_claims:
+  uncertainty:
+END
+```
+### Phase 0F — Trust Boundary and Data Flow Inventory
+Delegate to Explorer. Scope: boundary crossings only.
+Tasks:
+1. Identify external input ingress: HTTP, WebSocket, CLI args, env vars, files, uploads, clipboard, drag/drop, forms, IPC, queues, webhooks, plugins, browser storage, database reads, subprocess output.
+2. Identify sensitive sinks: database writes, file writes, subprocess execution, shell execution, network calls, auth/session changes, template rendering, DOM insertion, logs, telemetry, LLM calls, vector database writes, tool calls.
+3. Identify authentication and authorization boundaries.
+4. Identify serialization and deserialization boundaries.
+5. Identify LLM-specific boundaries: prompts, system prompts, user prompts, retrieval context, tool schemas, MCP servers, agent permissions, output parsers, model responses.
+6. Identify MCP-specific surfaces: registered tool descriptions, tool parameter schemas, resource URIs, server-to-server chains.
+Output:
+```
+TRUST_BOUNDARY
+  id: BOUNDARY-001
+  boundary_type:
+  source:
+  sink:
+  file:
+  line:
+  exact_quote:
+  validation_or_guard_observed: yes | no | unknown
+  auth_or_permission_observed: yes | no | unknown
+  data_sensitivity:
+  linked_public_surface:
+  linked_claims:
+  uncertainty:
+END
+```
+Guard fields rule: record `unknown` unless a guard or its absence is unambiguously visible in the same file and same local code region as the boundary quote. Do not infer missing guards from not seeing them in a narrow pass. Track B validates guards later.
+### Phase 0G — Test, Quality Gate, and Drift Inventory
+Delegate to test_engineer if available. Use Explorer only when test_engineer is not assigned.
+Scope: tests and quality tooling only.
+Tasks:
+1. Identify test frameworks, test commands, test directories, fixture directories, mock utilities, coverage tooling, mutation tooling, property-based testing tooling, e2e tooling, snapshot tooling.
+2. List test file names, test function names, and what subjects they import or instantiate.
+3. Inventory CI test gates.
+4. Identify test names or comments that make behavior claims that must be checked later for drift.
+5. If Phase 0E is available, list public surfaces with no obviously corresponding test. If Phase 0E is unavailable, record as unknown.
+Output:
+```
+TEST_QUALITY_INVENTORY
+  test_frameworks:
+    - framework:
+      evidence_quote:
+  test_commands:
+    - command:
+      evidence_quote:
+  test_roots:
+    - path:
+      evidence_quote:
+  observed_test_subjects:
+    - test_file:
+      test_name_or_import:
+      evidence_quote:
+  quality_gates:
+    lint:
+    typecheck:
+    unit:
+    integration:
+    e2e:
+    coverage:
+    mutation:
+    property_based:
+    evidence_quotes:
+  test_claims_for_later_review:
+    - file:
+      line:
+      exact_quote:
+      review_later_reason:
+  surface_test_name_gaps:
+    - surface_id:
+      evidence_quote:
+      uncertainty:
+END
+```
+### Phase 0H — UI, UX, and Design System Inventory
+Delegate to Explorer. If a designer agent exists, use designer for this pass.
+Scope: detect UI presence and map UI assets. Do not critique yet.
+Tasks:
+1. Determine whether there is a user-facing UI, desktop UI, web app, browser extension UI, terminal UI, admin console, or docs site.
+2. Identify UI framework, component system, route/page structure, styling system, theme or design token files, icons, fonts, animation libraries, and accessibility utilities.
+3. Identify whether screenshots, Storybook, Playwright, visual tests, or design docs exist.
+4. Identify structural design signals only: dark/light mode tokens, density tokens, route/page/component naming, and explicitly stated UI type in docs or code comments. Do not classify the aesthetic register yet.
+5. Flag whether any component library defaults are in use unmodified (e.g., shadcn/ui with no customization, Tailwind defaults with no design token layer).
+Output:
+```
+UI_INVENTORY
+  ui_present: yes | no | partial
+  ui_type:
+  framework:
+  component_roots:
+  route_or_page_roots:
+  styling_system:
+  theme_or_token_files:
+  design_token_customization: yes | no | unknown
+  component_library_defaults_unmodified: yes | no | unknown
+  accessibility_tooling:
+  visual_test_tooling:
+  design_structural_signals:
+  evidence_quotes:
+  uncertainty:
+END
+```
+### Phase 0I — AI, Agent, and Model Surface Inventory
+Delegate to Explorer.
+Scope: AI/LLM/agent functionality only.
+Deterministic skip rule: skip only if Phase 0B found no AI-related file, directory, or symbol names (ai, llm, prompt, agent, model, openai, anthropic, embedding, vector, rag, mcp, tool, eval) AND Phase 0C found no AI-related packages. If either signal exists, run Phase 0I.
+Tasks:
+1. Identify model calls, prompt templates, system prompts, tool definitions, function-calling schemas, MCP servers, autonomous agent loops, memory, retrieval, embeddings, vector stores, evaluators, moderation, content filters, and output parsers.
+2. Identify any user-controllable content that enters prompts or tools.
+3. Identify any model output that flows into code execution, database writes, network calls, browser rendering, files, shell commands, or user-visible authoritative claims.
+4. Identify rate limits, token limits, budget limits, retries, timeouts, and circuit breakers if visible.
+5. Identify MCP-specific surfaces: registered tool descriptions that include prose the model will read, tool parameter schemas, server-to-server chains, and whether untrusted content from external sources can enter tool descriptions or resource outputs.
+Output:
+```
+AI_SURFACE
+  id: AI-001
+  kind: prompt | model_call | tool | agent_loop | mcp | mcp_tool_description | retrieval | embedding | vector_store | parser | evaluator | memory | moderation | other
+  file:
+  line:
+  exact_quote:
+  user_controlled_inputs:
+  model_outputs:
+  downstream_sinks:
+  permissions_or_limits:
+  linked_trust_boundaries:
+  mcp_chain_depth: <number of MCP servers in chain if applicable>
+  uncertainty:
+END
+```
+### Phase 0J — Architect Inventory Synthesis
+Architect synthesizes Phase 0 outputs. Do not add unquoted repo facts.
+Create `source-of-truth-packet.md` and `ledgers/inventory-summary.md`.
+Before writing the summary, verify every required Phase 0 ledger exists and is non-empty. If a ledger is not applicable, create it with an explicit `NOT_APPLICABLE` reason.
+Minimum adequacy gate: if fewer than five non-`NOT_APPLICABLE`, non-empty structured blocks exist across all applicable Phase 0 ledgers, or if the inventory is too sparse to support the selected review scope, stop and report the limitation.
+Claim synthesis duties:
+- Convert raw Phase 0D claims into testable predicates now, after having access to public surfaces, manifests, trust boundaries, tests, UI, and AI inventory.
+- Assign likely verification targets only when supported by Phase 0E-0I evidence.
+- Assign `risk_if_false` only after considering user impact, public surface exposure, and trust boundaries.
+- Summarize NON_TESTABLE_CLAIM entries under Unknowns.
+The source-of-truth packet must contain only Phase 0 facts and must include:
+```markdown
+# Source of Truth Packet
+## Repo Identity
+[repo name, branch, git HEAD SHA, review type]
+## Tech Stack
+[languages, runtimes, frameworks, package managers]
+## Commands
+[install, lint, typecheck, test, build, run commands with evidence]
+## Public Surfaces
+[IDs and one-line descriptions]
+## Trust Boundaries
+[IDs and one-line descriptions]
+## MCP and Agent Surfaces
+[IDs, descriptions, and chain depth]
+## Claims Needing Verification
+[top claim IDs and predicates]
+## Test and Quality Gates
+[test frameworks and CI gates]
+## UI Applicability
+[whether UI review applies and why; whether component library defaults appear unmodified]
+## AI/Agent Applicability
+[whether LLM/agent review applies and why]
+## Review Track Recommendation
+[architect recommendation]
+## Prohibited Assumptions
+- Do not assume facts not present in this packet or quoted from source.
+- Do not assume a dependency exists unless manifest/lock/import evidence proves it.
+- Do not assume a feature works because docs claim it.
+- Do not assume a UI exists unless Phase 0H says it does.
+- Do not assume MCP tool descriptions are trusted input.
+```
+---
+## Phase 0K — User Review Mode Gate
+Stop after Phase 0J. Ask the user which review track or tracks to run.
+Do not proceed until the user selects a scope, unless the user's original instruction explicitly already selected tracks and explicitly told you not to ask.
+Present the choices:
+```
+Phase 0 inventory is complete. Based on the repository shape, I recommend:
+[Architect recommendation grounded in Phase 0 evidence]
+Choose review scope:
+1. Complete Integrated Review — all defect-focused tracks plus enhancement opportunities.
+2. Defect-Focused Comprehensive QA — all defect tracks, no enhancement catalog.
+3. Security and Supply Chain Focus — AppSec, LLM/MCP security, dependency integrity, CI provenance.
+4. Functionality and Correctness Focus — claims-vs-shipped, wiring, edge cases, business logic.
+5. Testing and Test Quality Focus — behavioral coverage, test drift, mutation resilience, property-based gaps.
+6. UI/UX and Accessibility Focus — visual hierarchy, interaction design, WCAG 2.2 AA, typography, polish, performance, design system, AI-slop UI patterns.
+7. Performance and Observability Focus — runtime performance, resource use, startup, telemetry, logs, metrics, traces.
+8. AI Slop and Code Provenance Focus — hallucinated APIs, phantom dependencies, confident stubs, slopsquatting, context rot, stale API usage.
+9. Enhancement Opportunities Only — architecture, quality, DX, performance, resilience, observability, UI/UX improvements. Not a bug hunt.
+10. Custom Combination — specify any combination or narrower subsystem.
+Please select one or more options.
+```
+If the user selects a focused review, do not run unrelated tracks. Mention omitted tracks in coverage notes.
+---
+## Phase 1 — Selected Track Candidate Generation
+Phase 1 generates candidates, not truth. Phase 1 obeys the global concurrency policy.
+Every Phase 1 agent dispatch must include:
+- selected review track(s) for that dispatch
+- exact file list or public surface IDs in scope
+- `source-of-truth-packet.md`
+- relevant Phase 0 ledger excerpts for claims, surfaces, boundaries, tests, UI, or AI surfaces
+- the candidate output format
+- explicit instruction that out-of-scope issues should be recorded as `out_of_scope_note` rather than emitted as candidates
+- a reminder of the anti-cursory contract: selecting this track means exhaustive depth for it
+File-size rule:
+- `dense file` = a file over 300 logical lines, a file with multiple unrelated responsibilities, or a file with interleaved UI/state/network/security logic.
+- Default: no more than 15 files per deep pass; no more than 8 dense files per deep pass.
+- No sampling inside an assigned scope.
+Classification tiebreaker:
+- If a candidate could be either a defect or an enhancement, ask: would shipping the code as-is mislead a user, expose a security or privacy risk, lose data, break a documented/public behavior, or produce wrong behavior?
+- If yes, emit a `CANDIDATE_FINDING`.
+- If no, emit an `ENHANCEMENT_CANDIDATE`.
+- Do not emit the same root issue in both formats.
+### Candidate Finding Format
+```
+CANDIDATE_FINDING
+  id: <track>-<scope>-<sequence>
+  track: functionality | security | supply_chain | testing | ui_ux | performance | observability | ai_slop | docs_claims | cross_platform | cross_boundary
+  group: <short category>
+  provisional_severity: CRITICAL | HIGH | MEDIUM | LOW | INFO
+  confidence: HIGH | MEDIUM
+  file: <relative path>
+  line: <line or range>
+  exact_quote: <verbatim evidence>
+  title: <specific one-line title>
+  problem: <factual description>
+  impact: <why it matters>
+  likely_fix: <concrete likely remediation>
+  evidence_checked: <files, callers, configs, tests, docs, manifests, runtime paths checked>
+  alternative_interpretation: <what could make this wrong>
+  disproof_attempt: <required for CRITICAL/HIGH; recommended for all>
+  linked_claims: <claim ids or N/A>
+  linked_surfaces: <surface ids or N/A>
+  linked_boundaries: <boundary ids or N/A>
+  ai_pattern: <optional>
+  needs_runtime_validation: yes | no
+  size: S | M | L
+END
+```
+### Enhancement Candidate Format
+```
+ENHANCEMENT_CANDIDATE
+  id: ENH-<track>-<sequence>
+  track: enhancement | architecture | code_quality | testing | ui_ux | performance | observability | resilience | developer_experience
+  domain: <specific subsystem or component family>
+  category: architecture | code_quality | simplification | developer_experience | performance | resilience | observability | ui_hierarchy | ui_interaction | ui_accessibility | ui_typography | ui_performance | ui_consistency | testing
+  value_level: high | medium | low
+  confidence: HIGH | MEDIUM
+  file: <relative path>
+  line: <line or range>
+  exact_quote: <verbatim current-state evidence>
+  title: <specific one-line title>
+  current_state: <what exists now, without calling it broken>
+  confirms_current_code_is_working: yes | no
+  enhancement: <specific implementable improvement>
+  expected_impact: <what improves>
+  effort: S | M | L
+  dependencies: <other enhancement ids or N/A>
+  alternative_interpretation: <why the current design might be intentional>
+  disproof_attempt: <required for HIGH-confidence high-value candidates; recommended for all>
+  rejection_risk: <what would make this a bad suggestion>
+END
+```
+---
+### Track A — Functionality, Correctness, and Claims-vs-Shipped
+Run if user selected options 1, 2, 4, or a custom scope requiring behavior review.
+**Anti-cursory contract for Track A:** Build a coverage unit for every public surface from Phase 0E. Every surface must be traced from entry point to implementation. A surface marked REVIEWED must have had its entry point read, its implementation traced, its tests checked, and its claims from Phase 0D compared against the implementation. Closing the coverage matrix is required before synthesis.
+**Agent lens:** shipped behavior correctness. Does the code do what it claims and what it documents?
+**Required method for each surface:**
+1. Pick a public surface from Phase 0E.
+2. Link any claims from Phase 0D.
+3. Trace from entry point through routing/wiring to implementation.
+4. Extract obligations first (what docs/claims say should happen).
+5. Summarize implemented behavior second.
+6. Compare obligations to implementation third.
+7. Check tests for behavioral assertions on this surface.
+8. Emit only grounded candidates.
+**Check:**
+*Wiring and reachability:*
+- Route, command, job, hook, plugin, and export wiring — does the registered path lead to an actual handler?
+- Unreachable code and dead branches in public behavior paths
+- Exported symbols with no consumers and no documented extension intent
+- Handler registered but not called, called but wrong arguments, wrong return value forwarding
+*Claim vs. implementation:*
+- Documented feature claims versus actual code paths
+- "Supports X" claims with no supporting implementation
+- Default values in docs that differ from default values in code
+- Removed behavior still documented as present
+- Parameters, option names, env vars, schema fields, and response fields mismatched between docs and implementation
+*Logic correctness:*
+- Off-by-one logic and boundary conditions
+- Integer overflow or underflow where input is externally controlled
+- Floating-point comparison where equality is asserted
+- Signed/unsigned mismatch in comparisons or arithmetic
+- Wrong operator precedence in complex boolean expressions
+- Null/undefined not handled where the value may be absent
+- Early returns that skip required side effects
+*Async correctness:*
+- Missing awaits (promise returned but not awaited)
+- Ignored promise return values (fire-and-forget where failure matters)
+- Race conditions in shared state accessed by concurrent async paths
+- Sequential awaits where order matters but is not enforced
+- Error swallowed inside async then/catch when caller needs it
+- Unhandled promise rejections in event listeners or callbacks
+*Data model and persistence:*
+- Data model mismatches across persistence layer, API layer, and UI layer
+- Migration or schema drift (new column in docs but not in migration file, or vice versa)
+- Serialization and deserialization that silently drops fields
+- JSON parse/stringify round-trip loss
+- Feature flag or config behavior drift
+- State machine edge cases: missing transitions, invalid state combinations, missing final states
+*Cross-platform:*
+- Code claiming portability but using platform-specific APIs (path separators, signals, shell-isms)
+- Environment assumptions that break on Windows/macOS/Linux differences
+*Happy-path-only:*
+- Error handling that claims recovery but only logs or swallows
+- Input validation that accepts empty, null, oversized, or malformed values without handling them
+- Network timeout handling missing or set to unbounded
+---
+### Track B — Security, Privacy, LLM Security, and Supply Chain
+Run if user selected options 1, 2, 3, or a custom security scope.
+**Anti-cursory contract for Track B:** Build a coverage unit for every trust boundary from Phase 0F and every AI surface from Phase 0I. Every boundary and AI surface must be reviewed. A boundary marked REVIEWED must have had its source, guard, sink, and impact traced. An AI surface marked REVIEWED must have had its user-controlled input paths and downstream sinks traced.
+**Agent lens:** exploitable or protection-relevant risk.
+**Frameworks:**
+- OWASP ASVS 4.0.3 as the verifiable AppSec checklist baseline for web application controls
+- OWASP Top 10 for LLM Applications 2025: LLM01–LLM10 as listed in the State-of-the-Art Anchors
+- SLSA Version 1.2 for supply-chain provenance and verification
+- OpenSSF Scorecard for repository hygiene checks
+**Required method:**
+1. Start from Phase 0F trust boundaries and Phase 0I AI surfaces.
+2. For each candidate, identify: attacker-controlled input → insufficient guard → sensitive sink → impact.
+3. If exploitability depends on runtime behavior, run a safe minimal validation or mark UNVERIFIED.
+4. For dependency candidates, verify against manifests, lockfiles, imports, and registry evidence when safe.
+**Application security checks:**
+*Injection:*
+- SQL injection via string concatenation, template interpolation, or ORM raw query misuse
+- Command injection via unsanitized input in shell.exec, subprocess, eval, or dynamic code execution
+- Path traversal via unsanitized file paths (../../ attacks, null bytes, URL-encoded sequences)
+- SSRF via user-controlled URLs in fetch, HTTP client, redirect, webhook, or import
+- Template injection via unsanitized input in template engines (Handlebars, Jinja2, EJS, Pug)
+- DOM-based XSS via innerHTML, document.write, dangerouslySetInnerHTML, or eval with user input
+- LDAP, XML, XPath injection where those parsers are in use
+- Header injection via unsanitized values in response headers
+- Log injection via unsanitized user input in log statements that attackers could use to forge log entries
+*Authentication and authorization:*
+- Missing authentication on routes/handlers that claim or imply protection
+- Inconsistent authorization: enforced in one path but not in sibling or alternative path
+- Horizontal privilege escalation: user can access another user's resources by changing an ID
+- Vertical privilege escalation: lower-privileged user can invoke higher-privileged action
+- JWT algorithm confusion (none algorithm, RS256 vs HS256 confusion)
+- Token/session not invalidated on logout or password change
+- Authentication bypass via mass assignment, parameter pollution, or HTTP method override
+- Insecure direct object reference without ownership check
+- CSRF missing where state-changing operations use cookies or sessions
+- CORS misconfiguration: wildcard origin with credentials, or overly permissive allow-origin
+*Secrets and sensitive data:*
+- Hardcoded secrets, tokens, credentials, private keys, API keys, or passwords in source
+- Sensitive defaults (default admin/admin, empty string passwords)
+- Credentials or PII logged in plaintext (including in telemetry, error messages, or debug output)
+- API keys or tokens in client-side code, public assets, or URLs
+- Sensitive data in HTTP responses that should not be returned
+- Insecure cookie flags: missing HttpOnly, Secure, or SameSite attributes
+*Cryptography:*
+- Weak hashing for passwords (MD5, SHA1, unsalted SHA256; require bcrypt/argon2/scrypt)
+- Weak randomness for security-sensitive values (Math.random(), time-based seeds)
+- Insecure transport: HTTP used for security-sensitive operations, TLS version pinned to old versions
+- Predictable token generation or insufficient entropy for session IDs
+- Crypto misuse: ECB mode, fixed IVs, reused nonces, unauthenticated encryption
+*File and process security:*
+- Unsafe file upload: missing extension validation, missing content-type validation, missing size limits, files saved to web-accessible paths, archive extraction without path normalization (zip slip)
+- Unsafe subprocess: shell: true with user input, argument injection via array spreading
+- Symlink attacks in file handling
+*Input validation and output encoding:*
+- Inputs accepted without schema validation
+- Inputs validated but not sanitized before passing to sinks
+- Output not encoded for the context it is rendered in (HTML, SQL, shell, URL, JSON)
+*Prototype pollution and object merging:*
+- `Object.assign`, `_.merge`, `lodash.merge`, `deepmerge`, spread operators applied to untrusted input
+- JSON.parse result used as object keys without validation
+- `__proto__`, `constructor`, `prototype` keys not filtered from user input
+**LLM and agent security (OWASP LLM 2025):**
+*LLM01 — Prompt injection:*
+- Direct injection: user input processed as instructions without separation from system instructions
+- Indirect injection: content from external sources (web pages, documents, tool outputs, database records, emails) entering the prompt context where it could contain adversarial instructions
+- Injection via tool outputs: tool call results that contain embedded instructions processed by the model
+- Instruction override attempts via role-play, "ignore previous instructions", jailbreaks
+- System prompt extraction attempts via carefully constructed user queries
+*LLM02 — Sensitive information disclosure:*
+- System prompt contents exposed to users (directly or via extraction)
+- PII or proprietary data leaking through model completions
+- API keys, connection strings, or credentials present in system prompts or RAG context
+- Internal architecture details exposed through model responses
+*LLM03 — Supply chain:*
+- LLM provider or model version not pinned (model behavior can change on API side)
+- Third-party prompt templates or agent frameworks used without validation
+- Plugin or tool integrations from untrusted sources
+*LLM04 — Data and model poisoning:*
+- User-supplied content writing to training datasets, fine-tuning pipelines, or embedding stores
+- RAG documents sourced from user-controlled or untrusted content without sanitization
+- Embedding poisoning: adversarial content crafted to manipulate retrieval
+*LLM05 — Improper output handling:*
+- Model output used directly as shell commands, SQL queries, or code to execute
+- Model output rendered as HTML without sanitization
+- Model output trusted as authoritative fact without verification
+- Structured outputs (JSON, code) from models parsed without schema validation
+*LLM06 — Excessive agency:*
+- Agent tools with broader permissions than the task requires (excessive functionality)
+- Agent operating with system-level or production privileges for tasks that only need read access (excessive permissions)
+- High-impact actions (file deletion, email send, API calls, code deployment) proceeding without human-in-the-loop confirmation (excessive autonomy)
+- Agent has access to multiple systems when it only needs one
+*LLM07 — System prompt leakage:*
+- System prompt reconstruction via model introspection
+- System prompt stored in client-accessible locations
+- Sensitive instructions (internal logic, security rules, competitor names) embedded in system prompts without leakage controls
+*LLM08 — Vector and embedding weaknesses:*
+- Untrusted documents written to vector stores without sanitization
+- Vector similarity search results trusted without provenance verification
+- Embedding inversion risks for sensitive data stored in vector stores
+- RAG retrieval injection: crafting content to manipulate what gets retrieved
+*LLM09 — Misinformation:*
+- Model output presented as authoritative without hallucination detection or uncertainty signaling
+- Factual claims generated by models without grounding in retrieved or verified sources
+*LLM10 — Unbounded consumption:*
+- No rate limits on model API calls
+- Context flooding: user input that causes unbounded token usage
+- Recursive agent loops with no termination condition
+- Missing cost budgets or circuit breakers for AI operations
+**MCP-specific attack vectors (2026):**
+*Tool poisoning:*
+- MCP tool descriptions contain prose the model reads; if that prose is untrusted or externally loaded, it is an injection surface
+- Tool description metadata that instructs the model to prefer this tool over safer alternatives
+- Tool parameter descriptions that suggest unsafe parameter values
+- Hidden instructions in tool schema `description` fields
+*Data exfiltration via AI context:*
+- Sensitive data (DB schemas, API configs, PII) loaded into model context and then passed to external tool calls
+- MCP server logs that accumulate sensitive context from AI sessions
+- Context carryover between requests that should be isolated
+*MCP server chain lateral movement:*
+- Server A (lower-trust, e.g., code repo) chained to Server B (CI/CD) chained to Server C (production)
+- A compromise or injection in Server A can instruct the AI to make calls through the chain to higher-privilege servers
+- Inadequate isolation between MCP server identities in multi-server configurations
+- Missing per-server permission scoping (all servers share one permission set)
+*Missing MCP controls:*
+- No allow-list of approved MCP servers
+- MCP server connections accepted from arbitrary URLs without validation
+- No per-session or per-request permission scoping for MCP tool calls
+- No anomaly detection on MCP request/response patterns
+**Supply chain:**
+*Dependency integrity:*
+- Packages imported but not declared in manifest (phantom imports)
+- Packages declared but with version ranges that allow major version drift (`*`, `latest`, `^` on 0.x)
+- Packages that sound like well-known packages but are slightly different (typosquatting, dependency confusion)
+- Package names that appear in AI-generated code but do not exist in registries (slopsquatting) — check the USENIX research: 19.7% of LLM-recommended packages are fabricated
+- `postinstall`, `preinstall`, or `prepare` scripts in dependencies that execute arbitrary code
+- Binary downloads in install scripts from non-pinned or non-verified URLs
+- Native bindings or addons with privileged system access
+*Build and release integrity:*
+- CI that publishes artifacts without SLSA provenance attestation
+- Artifact signing absent or unverified at deployment
+- Build credentials (deploy keys, NPM tokens, signing keys) with excessive scope
+- Release process that runs untrusted input in privileged CI context
+- Workflow injection: `${{ github.event.pull_request.head.repo.full_name }}` or similar dynamic values in `run:` steps
+- Third-party actions used without pinning to commit SHA
+- Missing dependency update tooling (Dependabot, Renovate) for CVE response
+*Repository hygiene (OpenSSF Scorecard checks):*
+- Branch protection: no required reviews, no required status checks
+- Token permissions not explicitly scoped in workflow files
+- Dangerous workflow patterns: pull_request_target with checkout of untrusted PR code
+---
+### Track C — Testing and Test Quality
+Run if user selected options 1, 2, 5, or a custom testing scope.
+**Anti-cursory contract for Track C:** Build a coverage unit for every public surface and every high-risk trust boundary. Every unit must be reviewed for behavioral test coverage. A unit marked REVIEWED must have had its tests (or lack thereof) read, and the assertion quality assessed — not just whether a test file exists.
+**Agent lens:** whether tests would catch real regressions if the behavior changed.
+**Required method:**
+1. Link each testing candidate to a public surface, claim, trust boundary, or critical behavior from Phase 0.
+2. State what regression could escape with the current test.
+3. Identify the smallest test improvement that would catch it.
+4. If possible, run the relevant test command to observe what it actually asserts.
+**Coverage and behavioral assertions:**
+*Missing test coverage:*
+- Public behavior surfaces with no test at any level (unit, integration, e2e)
+- High-risk trust boundaries with no auth/authz test
+- Security-sensitive paths (auth, permissions, secrets handling) with no negative test
+- Migration/schema changes with no before/after state test
+- Config parsing with no test for missing, invalid, or boundary-value configs
+- Error handling paths with no test that the error is surfaced correctly
+- Critical background jobs, queues, or scheduled tasks with no integration test
+*Test quality — behavioral vs. implementation:*
+- Tests that only assert the mock was called rather than asserting the behavioral outcome
+- Tests that verify internal implementation details (private method called, specific log output emitted) rather than external behavior
+- Tests that pass as long as no exception is thrown, without asserting a meaningful return value or state change
+- Tests with assertions broad enough to pass even if behavior changes (e.g., `expect(result).toBeTruthy()`)
+- Snapshot tests that capture implementation artifacts rather than behavioral contracts — easy to update without understanding the change
+- Tests that import and directly call private/internal modules rather than the public API they are supposed to test
+*Fixture and schema drift:*
+- Test fixtures that no longer match current schema structure or default values
+- Mock return values that no longer represent what the real implementation returns
+- Hardcoded test data that encodes outdated business rules
+- Snapshot files out of sync with current component output
+- Database fixtures that assume old migration state
+*Test reliability:*
+- Time-dependent tests (assertions on exact timestamps, `Date.now()`, clock-dependent logic without mocking)
+- Path-dependent tests (hardcoded local paths, home directory assumptions)
+- Network-dependent tests without offline fallback or VCR cassettes
+- Order-dependent tests (later test depends on state left by earlier test)
+- Shared mutable state between tests without cleanup
+- Flaky concurrency patterns (sleep(N) as synchronization, untimed promise resolution)
+*Test completeness — missing negative and edge cases:*
+- No test for empty input where the function handles it
+- No test for the maximum or minimum valid value
+- No test for input at exactly the boundary (N and N+1 both tested)
+- No test for concurrent access where shared state could be corrupted
+- No test for partial success (operation succeeds for some items, fails for others)
+- No test for authentication failure (valid auth tested, missing invalid auth test)
+- No test for authorization boundary (owner tested, non-owner not tested)
+*Mutation resilience:*
+- Off-by-one mutations (`<` vs `<=`, `>` vs `>=`) that tests do not catch
+- Boolean condition flip mutations (missing `not` equivalent test)
+- Null vs non-null mutations (missing null path test)
+- Return value mutations (function returns wrong thing, but test only checks side effect)
+- Identify high-risk logic where a simple one-line mutation would not fail any test
+*Property-based testing opportunities:*
+- Input parsers and serializers (invariant: parse(serialize(x)) === x)
+- Data transformations with mathematical properties (commutativity, associativity, idempotency)
+- Permission systems (any combination of valid inputs should produce a consistent authz result)
+- State machines (transitions from valid states should never reach invalid states)
+- Fuzz-worthy trust boundary inputs (all inputs from Phase 0F that accept user-controlled data)
+*Framework misuse:*
+- `jest.mock()` or equivalent hoisted in ways that affect test isolation unexpectedly
+- `beforeAll` vs `beforeEach` misuse where state leaks between tests in the same suite
+- Async test without returning the promise or using `done` correctly
+- Testing a singleton or module with cached state that should be reset between tests
+Test drift rule: touched or discussed tests must be checked against current and intended behavior, not just syntax. A passing test is not enough if it asserts the wrong behavior.
+---
+### Track D — UI/UX and Accessibility
+Run if user selected options 1, 2, 6, or a custom UI scope, but only when Phase 0H found UI evidence.
+Skip if Phase 0H found no UI. Record the skip in coverage notes.
+**Anti-cursory contract for Track D:** Build a coverage unit for every UI component family from Phase 0H. All six passes must complete for each component family in scope. A unit marked REVIEWED must have had its component files actually read, not just inferred from filenames.
+If a designer agent exists, use designer for Passes D1, D2, D3, D4, and D6. Use explorer for Pass D5.
+**Accessibility baseline:** WCAG 2.2 AA.
+**AI-aesthetic baseline (applies to all UI passes):**
+Do not apply generic AI-generated-UI aesthetic tells as aesthetic criticism. Cite evidence, not vibes. However, flag when a UI exhibits these specific evidence-backed patterns that indicate unmodified AI-scaffold defaults:
+- "VibeCode Purple" (a specific lavender-purple in the range `hsl(250-270, 50-80%, 55-70%)`) as the primary brand color with no apparent intentional choice
+- Unmodified shadcn/ui or similar component library defaults with no design token customization layer (Phase 0H will have flagged this)
+- Gradients applied to more than 30% of UI surfaces without a coherent design rationale
+- All-caps headings and section labels as a dominant typographic pattern
+- Identical feature cards with icon-on-top layout as the sole layout primitive
+- Numbered "1, 2, 3" step sequences as the dominant content structure
+- Sidebar or nav with emoji icons as the primary navigational metaphor
+- Color-coded border-left or border-top on cards as the dominant differentiation pattern
+- Medium-grey body text on dark backgrounds that barely passes contrast but lacks intentionality
+The test is not "does this look AI-generated?" The test is: can you quote exact CSS values, class names, or component code that shows the pattern, and can you show the pattern is unintentional rather than designed? If yes, flag it with evidence.
+**Pass D1 — Visual Hierarchy and Layout:**
+Delegate to designer. Read every component file, every layout file, every page/route file.
+Format for each finding:
+```
+[UI-HIER-N] Title
+Screen/Component: [exact file path + component name]
+Current State: [what exists now — quote class names, styles, or structure]
+Enhancement: [specific, implementable improvement]
+User Impact: [how the user experience improves]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Is there a clear primary action on every screen? Does it visually read as primary (weight, color, size, position)?
+- Do typographic heading levels (h1/h2/h3/font-size/font-weight) match the content hierarchy?
+- Is whitespace used intentionally to group related elements and separate unrelated ones?
+- Are layout patterns consistent across screens, or does each screen use a different structural approach?
+- What happens with realistic data extremes: very long strings, empty states, single-item lists, 1000-item lists?
+- Are empty states designed with messaging, guidance, and a call to action, or are they just blank/null?
+- Does the visual hierarchy change at different viewport sizes in a way that preserves content priority?
+- Are density and information architecture appropriate for the user's task complexity?
+**Pass D2 — Interaction Design and Feedback:**
+Delegate to designer. Read every component file, every interaction handler, every form.
+Format for each finding:
+```
+[UI-INT-N] Title
+Screen/Component: [exact file path + component name]
+Current State: [what exists now]
+Enhancement: [specific, implementable improvement]
+User Impact: [how the user experience improves]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Do all interactive elements provide visual feedback for hover, active/pressed, focus, and disabled states?
+- Are loading states present for all async operations? Are they specific to the operation or generic spinners?
+- Are success and error states visually distinct and clearly communicated to the user?
+- Is there confirmation or undo opportunity before destructive actions?
+- Are form validation messages specific and actionable, or generic ("field is required", "invalid input")?
+- Are there interaction flows that could be fewer steps, have smarter defaults, or reordered for common paths?
+- Do transitions or animations help users understand what changed (state transitions, panel slides, expansion), or are they purely decorative?
+- Are there missing transitions that would help orient users during state changes?
+- Does the UI provide optimistic updates for operations that can be safely assumed to succeed?
+- Are there keyboard shortcuts for power-user workflows, and are they discoverable?
+- For forms: does the submit button become enabled/disabled correctly based on validity?
+**Pass D3 — Accessibility:**
+Delegate to designer. Read every component file, every stylesheet, every interactive element.
+Format for each finding:
+```
+[UI-A11Y-N] Title
+WCAG Criterion: [e.g., 1.4.3 Contrast Minimum, 2.1.1 Keyboard, 4.1.2 Name, Role, Value]
+Screen/Component: [exact file path + component name]
+Current State: [what exists now — quote the problematic code or style]
+Enhancement: [specific, implementable improvement]
+User Impact: [who benefits and how]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Are all interactive elements reachable by keyboard alone? (Tab, Shift+Tab, Enter, Space, Arrow keys)
+- Is the tab order logical and predictable? Does it follow the visual reading order?
+- Do all images, icons, and non-text elements have meaningful alternative text (not just file names or empty alt="")?
+- Color contrast: body text 4.5:1, large text 3:1, UI components and graphics 3:1. Cite exact computed values where possible.
+- Are form inputs labeled with visible labels, not just placeholder text (which disappears on focus)?
+- Are error messages programmatically associated with their inputs (aria-describedby or aria-errormessage)?
+- Are dynamic state changes announced to screen readers (aria-live="polite", role="status", aria-live="assertive" for urgent)?
+- Are touch targets at least 44×44px for all interactive elements (WCAG 2.5.8 target size)?
+- Are there color-only indicators (error = red only) that need a secondary visual cue (icon, pattern, or text)?
+- Are modal dialogs, drawers, and menus trapping focus correctly (focus stays inside until closed)?
+- Is there a skip-to-main-content link for keyboard users on pages with repetitive navigation?
+- Are custom interactive widgets (sliders, tabs, accordions, comboboxes, date pickers) using correct ARIA roles and states?
+- Is prefers-reduced-motion respected for animations and transitions?
+- Does text resize to 200% without horizontal scrolling or loss of content? (WCAG 1.4.4)
+**Pass D4 — Typography and Visual Polish:**
+Delegate to designer. Read every component file, every stylesheet or theme file, every design token file.
+Format for each finding:
+```
+[UI-VIS-N] Title
+Category: [Typography | Color | Spacing | Polish]
+Screen/Component: [exact file path + component name]
+Current State: [quote exact values — font sizes, weights, colors, spacing]
+Enhancement: [specific, implementable improvement]
+User Impact: [how the experience improves]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Is there a named, consistent type scale (e.g., 12/14/16/18/24/32px or a modular scale)? Or are font sizes arbitrary across components?
+- Is negative letter-spacing applied at display/heading sizes? (Headings generally need tighter tracking at large sizes; body text should not be tracked)
+- Are body text line lengths within 45–75 characters for comfortable reading?
+- Is line height appropriate for the font in use? (Body typically 1.4–1.6; display 1.0–1.2)
+- Is the font weight scale meaningful? Does it distinguish body (400), emphasis (500–600), and headings (600–700+)?
+- Is monospace type used consistently and only where appropriate (code, commands, IDs, data values)?
+- Is the same semantic element (e.g., card title, navigation item, inline code) styled consistently everywhere?
+- Is text truncation and overflow handled gracefully (ellipsis with title tooltip, explicit wrapping strategy)?
+- Is the color palette applied consistently — same semantic color for the same semantic meaning (error = red, always the same red)?
+- Are border radii, shadow depths, and spacing values from a token system or arbitrary per-component?
+- Are hardcoded hex values, spacing units, or radius values that could be design tokens cited for extraction?
+- Are there places where the visual polish diverges significantly between different sections of the UI, suggesting inconsistent generation sessions?
+**Pass D5 — UI Performance and Perceived Performance:**
+Delegate to explorer. Read every component file, every data-fetching hook, every list rendering pattern.
+Format for each finding:
+```
+[UI-PERF-N] Title
+Category: [Render Performance | Asset Optimization | Perceived Performance | Animation | Native/IPC]
+Screen/Component: [exact file path + component name]
+Current State: [quote code where helpful]
+Enhancement: [specific, implementable improvement]
+User Impact: [how the experience improves]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Are there components re-rendering on every parent update that could be memoized (React.memo, useMemo, useCallback)?
+- Are expensive calculations (sorting, filtering, mapping large arrays) happening inline during render without caching?
+- Are large lists (>50 items) rendered unconditionally instead of virtualized?
+- Are images and assets loaded at correct sizes for their display context? Are they using modern formats (WebP, AVIF)?
+- Are perceived-performance patterns in use? (Optimistic updates, skeleton loaders, progressive disclosure, speculative prefetching)
+- Are any animations/transitions animating layout properties (width, height, top, left, margin) instead of transform/opacity (which cause reflow/repaint)?
+- Is the first meaningful content visible quickly, or is there a blank/spinner period before anything appears?
+- For Tauri/Electron/native apps: is expensive work offloaded from the main thread? Are IPC calls batched to reduce round-trips? Are large IPC payloads streamed rather than sent as one blob? Are native transitions handled with skeleton states rather than blocking?
+- Are code-splitting boundaries in place so the initial bundle only loads what is needed?
+- Are lazy imports used for heavy routes, modals, or features?
+**Pass D6 — Consistency and Design System Alignment:**
+Delegate to designer. Read every component file, every stylesheet, every shared UI utility.
+Format for each finding:
+```
+[UI-CON-N] Title
+Category: [Pattern Consistency | Design Token | Component Extraction | Mental Model | AI-Aesthetic]
+Screen/Component: [exact file path + component name]
+Current State: [what exists now]
+Enhancement: [specific, implementable improvement]
+User Impact: [how the experience improves]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+- Are equivalent UI patterns implemented differently in different parts of the application (e.g., one list uses a table, another uses a card grid, another uses a custom layout — for the same data shape)?
+- Are there hardcoded style values (hex colors, px spacing, border-radius values) that should reference design tokens?
+- Are there component variants that diverge unnecessarily when they could share a base component?
+- Are there repeated UI patterns that could be extracted into reusable components but aren't?
+- Is the navigation structure consistent and predictable — does the same navigation pattern appear on all screens?
+- Are there places where the interface's mental model doesn't match how users think about the task (e.g., a "send" action that actually stages, or a "save" action that auto-publishes)?
+- AI-aesthetic audit: apply the AI-aesthetic baseline patterns listed in the Track D preamble. For each pattern found, cite exact file and code evidence, and assess whether it is an unintentional default or a deliberate design decision.
+---
+### Track E — Performance and Observability
+Run if user selected options 1, 2, 7, or a custom performance/observability scope.
+**Anti-cursory contract for Track E:** Build a coverage unit for every hot path and every operational path identified in Phase 0. Every path must be reviewed. A path marked REVIEWED must have had its implementation read, its resource usage assessed, and its telemetry coverage noted.
+**Agent lens:** runtime efficiency and production visibility.
+**Observability baseline:** OpenTelemetry traces, metrics, and logs as first-class signals.
+**Required method:**
+1. Identify the hot path or operational path.
+2. Quote the code causing repeated work, missing telemetry, or unsafe resource behavior.
+3. State whether the issue is proven, probable, or requires profiling.
+4. Do not invent performance impact. If impact is not measured, label it qualitative.
+**Performance checks:**
+*Computational:*
+- Loops iterating over data multiple times where a single pass would suffice
+- `O(n²)` or worse algorithms where the input can grow (nested loops over the same collection)
+- Repeated parsing, serialization, compilation, or IO in loops or hot paths
+- N+1 database, network, or filesystem access (fetching one-at-a-time inside a loop)
+- Missing memoization for expensive pure computations called repeatedly with same inputs
+- Synchronous critical-path work that blocks the event loop (sync file reads, sync crypto)
+- Regex recompilation on every call (creating `new RegExp()` inside a loop)
+- Unnecessary deep cloning of large objects where shallow copy or reference would suffice
+*Memory:*
+- Objects retained longer than their usage scope (closures capturing large contexts unnecessarily)
+- Missing cleanup for subscriptions, timers, event listeners, or file handles (memory/resource leaks)
+- Data structures mismatched to access patterns (array linear scan where Map/Set lookup is needed)
+- Growing unbounded collections (event logs, caches, in-memory queues without eviction)
+- Circular references preventing garbage collection
+*Async and concurrency:*
+- Sequential awaits in series where `Promise.all` or `Promise.allSettled` could parallelize safely
+- Missing caching for repeated network, filesystem, or database reads in the same request lifecycle
+- Unbounded concurrency fanout with no throttle (spawning N parallel requests without a concurrency limiter)
+- Missing backpressure for streaming operations or queue consumers
+- Blocking the main thread in Electron/Tauri with large computations (use worker threads or IPC to background)
+- IPC call-per-item patterns that could be batched into a single IPC call
+*Startup and bundle (if applicable):*
+- Heavy synchronous initialization in module scope that delays startup
+- Full library imports where only a small subset is used (import full lodash, full moment)
+- Missing tree-shaking-friendly export patterns
+- Synchronous filesystem reads at startup that could be deferred or cached
+- Missing code-splitting for large routes or features
+*AI/LLM performance:*
+- Unbounded model API calls with no concurrency limit
+- Context payloads that grow unboundedly with session length
+- Repeated embedding or completion calls for identical inputs without caching
+- Token budget not enforced, allowing unexpectedly large responses to accumulate cost
+**Observability checks:**
+*Logging:*
+- Key operations completing with no trace in logs (successful auth, data mutations, background job completion)
+- Error logs missing context (which entity, which user, which request, which operation)
+- Log messages noting what happened but not why it happened or what to do next
+- Sensitive data (PII, tokens, credentials, query parameters with secrets) in log statements
+- Debug-only visibility for production-critical failures (e.g., errors only logged at `console.debug`)
+- Missing correlation IDs or request/session/trace IDs that would link related log events
+*Metrics:*
+- Missing request latency metrics for externally-visible operations
+- Missing error rate metrics for critical paths
+- Missing queue depth, backlog, or processing rate for async workers
+- Missing cost metrics for AI/LLM API calls (token counts, call counts)
+- Missing retry count metrics that would reveal upstream instability
+- Missing saturation metrics (memory usage, connection pool usage, disk usage)
+*Traces:*
+- Missing spans across service boundaries (outgoing HTTP calls, database queries, queue publishes)
+- Missing spans for model/embedding API calls (duration, token count, model version)
+- Missing trace propagation (W3C Trace Context headers not forwarded across service boundaries)
+- Span attributes missing key identifiers (user ID, tenant ID, resource ID, feature flag state)
+*Operational visibility:*
+- Production-critical failures only visible by reading source code or log noise
+- No structured error taxonomy that would enable alerting rules
+- Missing operational runbook hooks or on-call documentation comments for critical paths
+- Alert thresholds not defined or documented for key metrics
+---
+### Track F — AI Slop and Code Provenance
+Run if user selected options 1, 2, 8, or a custom AI-slop/provenance scope.
+**Anti-cursory contract for Track F:** Build a coverage unit for every file group and every public surface. Every unit must be reviewed. A unit marked REVIEWED must have had its imports verified against the manifest/lockfile, its API signatures verified against an installed version, and its implementation reviewed for stub patterns.
+**Agent lens:** patterns statistically common in LLM-assisted code that look plausible but are weakly grounded.
+This is not permission to call code bad because it "looks AI-generated." Every finding still needs evidence.
+**Required method:**
+1. Prefer deterministic checks first: import existence, API signatures, wiring, docs vs. code.
+2. For subjective AI-slop patterns, require two pieces of evidence: exact quote plus a concrete consequence.
+3. Do not emit candidates based only on style.
+**Phantom dependencies and hallucinated APIs:**
+- Packages imported in source but not declared in any manifest
+- Package names that do not match any registered package in the expected ecosystem
+- Packages that sound like combinations of real packages (`react-fetch-hooks`, `express-validate-zod`) but may be fabricated — verify by checking the lock file for the exact name and version
+- Version numbers that do not exist for the declared package (check semver range resolution against the lockfile)
+- API function calls on a package where those functions do not exist in the declared version (check against the installed package's actual exports, not docs or LLM knowledge)
+- Calling internal/private APIs of a dependency that were not part of its public contract
+- Calling deprecated APIs of a dependency that were removed in the locked version
+- Cross-ecosystem imports (Python package imported in JavaScript, Node.js module imported in browser context, etc.)
+- Framework APIs from the wrong version (React 17 vs React 18 API differences, Next.js 13 vs 14 vs 15 differences, etc.)
+- Calling methods on types that don't exist at runtime (TypeScript type narrowing giving false confidence)
+**Stale library and framework usage:**
+- APIs that existed in older versions but were deprecated or removed in the pinned version
+- Import paths from old package structures (pre-restructuring imports that no longer resolve)
+- Using class-based APIs where the installed version is hook/function-based
+- Using callback-based APIs where the installed version is promise-based
+- Accessing config or environment APIs using old format that the current runtime ignores silently
+**Confident stubs and happy-path-only implementations:**
+- Functions with an impressive-looking signature and docstring but an implementation that is one or two lines, clearly insufficient for the stated purpose
+- Validation functions whose name suggests thoroughness (`validateSecureInput`, `sanitizeUserData`) but whose body only checks for null or trims whitespace
+- Security function names (`checkPermissions`, `isAuthorized`, `encryptPayload`) with trivially incorrect implementations
+- Error handlers that catch broad exception types and log a generic message, treating all errors identically
+- Retry or backoff functions that loop `N` times with `sleep(fixed_delay)` instead of implementing actual exponential backoff
+- Rate limiters that initialize a counter but never actually block or reject requests
+- Test files that import real modules but only call them with mocked return values, never actually testing the real behavior
+- Examples in docs that call non-existent functions or APIs with wrong argument shapes
+**Over-abstraction and premature generalization:**
+- Adapter, factory, or registry patterns implemented before there are two real use cases to abstract over (abstraction layer with exactly one implementation)
+- Generic interfaces with a single concrete implementation and no documented reason for the layer
+- Dependency injection containers or service locators added to simple scripts that have no runtime variation requirement
+- Configuration system with many options for which only one is ever set
+- Plugin or hook systems with registration infrastructure but no registrations
+- Abstraction cascades: function A calls function B calls function C which calls function D, where each wrapper does nothing except forward arguments
+**Copy-paste artifacts and inconsistent integration:**
+- Same logic block (3+ lines) duplicated in two or more files with minor variations instead of being extracted
+- Naming conventions that differ between files in the same module (camelCase in one file, snake_case in the sibling)
+- Error message strings that differ in style or capitalization for equivalent error conditions
+- Inconsistent parameter order for similar functions in the same module
+- Inconsistent return type patterns (some functions return `null` on error, others `undefined`, others throw)
+- Logging patterns that differ between files as if each was generated independently
+- Comments written in a different prose style from the surrounding codebase (suggesting multiple generation sessions)
+**Context rot:**
+- Comments that were accurate for an older version of the code but no longer match the current implementation
+- TODO/FIXME comments that reference issues, versions, or constraints that no longer apply
+- Test names that claim to test behavior the test no longer exercises
+- Changelog entries that describe features not present in the current code
+- Import aliases that no longer match the imported module's actual exports
+**Documentation for unwired features:**
+- README sections describing features (commands, flags, config options, APIs) with no corresponding implementation in source
+- JSDoc or TSDoc on exported functions describing parameters that don't exist in the function signature
+- Config documentation describing keys that are read and ignored, or never read at all
+- CLI help text describing flags or subcommands that have no handler
+**Security theater:**
+- Input validation that checks type or presence but not content (accepts any string as an email, any number as a valid ID)
+- Permission check function that always returns `true` or is bypassed on any non-trivial code path
+- Encryption function that Base64-encodes data and calls it "encrypted"
+- HTTPS check that only verifies the string starts with "https" but does not validate the certificate
+- Rate limiting that resets on every request instead of per time window
+- CSRF protection that checks for the header's presence but not its value
+**Slopsquatting exposure:**
+Per the USENIX research: 19.7% of LLM-recommended packages are fabricated and non-existent; 58% of hallucinated packages repeat across queries. Check:
+- Every package name in manifests against the lockfile. If a package is in the manifest but not in the lockfile, it may be unresolved or hallucinated.
+- Package names that are combinations of legitimate package names in a pattern that suggests AI generation
+- Package scopes (`@company/something`) where `@company` does not correspond to a known published scope
+---
+### Track G — Enhancement Opportunities
+Run if user selected options 1, 9, or a custom enhancement scope.
+**Anti-cursory contract for Track G:** Build a coverage unit for every enhancement domain (architecture, code quality, developer experience, performance, resilience, observability, testing, and UI/UX if applicable). Every domain must be reviewed. A domain marked REVIEWED must have had representative source files for that domain actually read and assessed.
+**Anti-defect-hunt rule:** This track is not a defect hunt.
+Do not report:
+- bugs or security vulnerabilities
+- broken claims or missing required tests
+- anything that implies the current code is wrong or unsafe
+Report only:
+- improvements that raise maintainability, clarity, resilience, performance, observability, developer experience, or UX quality
+- specific opportunities with exact file evidence
+- implementation ideas concrete enough for an engineer or agent to act on
+---
+#### Enhancement Pass G1 — Architecture and Structure
+Delegate to explorer. Read all source files.
+Format:
+```
+[ARCH-N] Title
+Category: [Abstraction | Cohesion | Interface Clarity | Dependency | Simplification]
+File(s): [exact path]
+Current State: [what exists now — quote specific code]
+Enhancement: [specific, implementable improvement]
+Impact: [what gets better — readability, testability, reuse, etc.]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+*Abstraction opportunities:*
+- Functions doing more than one thing that could be cleanly separated (measure: function name contains "and", "or", "also")
+- Logic duplicated across three or more files that has stabilized enough to deserve a shared utility
+- Inline logic grown complex enough (≥10 lines of closely related computation) to deserve its own named abstraction
+- Modules with accumulated responsibilities spanning multiple unrelated concerns
+*Simplification opportunities:*
+- Premature abstractions: adapter, factory, or registry patterns with exactly one implementation and no near-term second
+- Abstraction cascades: A → B → C → D where each wrapper only forwards arguments
+- Over-engineered configuration systems with many options where only one is used
+- Dead compatibility layers kept for a version no longer in any manifest
+- Unused code paths: functions defined and exported but with no import in the codebase
+*Cohesion improvements:*
+- Cross-cutting concerns (logging, error handling, config access) scattered across modules instead of centralized
+- Inconsistent module grouping where related files are in unrelated directories
+- Business logic mixed with I/O, network, or presentation logic in the same module
+*Interface clarity:*
+- Function signatures with ≥4 positional parameters where an options object would be clearer
+- Overloaded return types that could be split into typed variants
+- Implicit contracts (side effects, required call order, mutability expectations) that could be made explicit
+*Dependency improvements:*
+- External dependencies used for one or two trivial functions that native language features now provide
+- Long dependency chains that could be simplified with a direct interface layer
+- Tight coupling to concrete implementations that limits testing or reuse
+Do not report items without an exact file path and code quote.
+---
+#### Enhancement Pass G2 — Code Quality and Elegance
+Delegate to explorer. Read all source files.
+Format:
+```
+[QUAL-N] Title
+Category: [Readability | Idiomatic | Test Quality | DX]
+File(s): [exact path]
+Current State: [what exists now — quote specific code]
+Enhancement: [specific, implementable improvement]
+Impact: [what gets better]
+Effort: [Low | Medium | High]
+```
+Evaluate:
+*Readability:*
+- Variable or function names that are accurate but not expressive (generic names like `data`, `result`, `item`, `temp` where a domain term exists)
+- Complex conditionals with 3+ conditions that could become a named predicate function
+- Deeply nested logic (≥3 levels) that could be flattened with early returns or guard clauses
+- Comments that describe what the code does instead of why it does it
+- Magic numbers or strings that should be named constants (what does `86400` mean in this context?)
+*Idiomatic improvements:*
+- Non-idiomatic patterns with cleaner modern equivalents:
+  - Manual for/while loops where `map`, `filter`, `reduce`, `find`, `every`, `some` apply
+  - `.then()` chains where `async/await` would be clearer
+  - `Object.assign({}, x)` where spread `{...x}` is idiomatic
+  - String concatenation in loops where template literals or join apply
+  - Index-based array access where destructuring is cleaner
+- TypeScript: `any` types that could be narrowed; missing generics; untyped event handlers; optional chaining opportunities; unnecessary type assertions; union types that should be discriminated unions
+- Patterns inconsistent with how the rest of the codebase does similar things (local idiosyncrasy vs. established pattern)
+- Defensive copying where reference sharing is both safe and intended
+*Test quality:*
+- Tests verifying implementation details instead of behavior
+- Test descriptions that don't communicate intent (test("works correctly", ...))
+- Setup/teardown duplication across test files that could be shared fixtures
+- Assertions too broad to fail on behavior changes
+- Missing test for the documented main use case of a public API
+*Developer experience:*
+- Exported public APIs with no JSDoc or TSDoc
+- Error messages lacking enough context to debug (what failed, what was the input, where to look)
+- Config validation that only fails at runtime when it could fail at startup with a clear message
+- Missing local scripts for common development workflows (setup, seed, reset, generate types)
+- Missing examples for non-obvious public API usage
+---
+#### Enhancement Pass G3 — Performance Enhancement
+Delegate to explorer. Read all source files.
+Format:
+```
+[PERF-N] Title
+Category: [Computational | Memory | Async | Bundle | Startup]
+File(s): [exact path]
+Current State: [what exists now — quote code]
+Enhancement: [specific, implementable improvement]
+Impact: [measurable or qualitative benefit]
+Effort: [Low | Medium | High]
+```
+Evaluate — enhancement framing only (the current code is correct; this makes it better):
+*Computational:*
+- Loops iterating over data multiple times where a single pass would suffice
+- Missing memoization for expensive pure computations called repeatedly (React renders, recursive computations)
+- N+1 patterns: repeated work per item that could be batched (opportunity to batch, not a broken behavior)
+- Synchronous critical-path work that could be deferred without correctness risk
+- Regex objects created inside loops that could be created once and reused
+*Memory:*
+- Large objects retained longer than needed (opportunity to scope more tightly)
+- Subscriptions, timers, or event listeners with no cleanup (opportunity to add lifecycle cleanup)
+- Data structure mismatches: array linear scan where Map/Set would improve lookup
+*Async:*
+- Sequential await chains where `Promise.all` would safely parallelize
+- Missing caching for repeated network or filesystem reads within the same request lifecycle
+- Unbounded concurrency fanout that could benefit from a concurrency limiter
+*Bundle and startup (if applicable):*
+- Full library imports where only a small subset is used
+- Synchronous initialization that could be lazy
+- Missing tree-shaking-friendly export patterns
+---
+#### Enhancement Pass G4 — Resilience and Observability Enhancement
+Delegate to explorer. Read all source files.
+Format:
+```
+[RES-N] Title
+Category: [Error Handling | Observability | Configuration | Retry | Graceful Degradation]
+File(s): [exact path]
+Current State: [what exists now — quote code]
+Enhancement: [specific, implementable improvement]
+Impact: [what gets better]
+Effort: [Low | Medium | High]
+```
+Evaluate — enhancement framing only:
+*Error handling:*
+- Errors caught and swallowed silently that could surface meaningful context to callers
+- Generic error messages that could include the specific context that caused the error
+- Operations that would benefit from retry with exponential backoff (currently: fail fast or no retry)
+- Binary success/crash outcomes that could degrade gracefully (return partial results, skip and continue)
+- Missing error differentiation: all exceptions treated the same when some should be retried, some reported, some fatal
+*Logging and observability:*
+- Key operations completing with no trace in logs (opportunity to add structured log at completion)
+- Log messages noting what happened but not why or what to do next
+- Missing structured fields (correlation IDs, user context, entity IDs) that would help correlate events
+- Debug information inaccessible without reading source (opportunity to surface via logs or metrics)
+- Missing metrics for operations that affect user experience, reliability, or cost
+*Configuration robustness:*
+- Config values accessed without validation that could be validated at startup
+- Missing sensible defaults for optional configuration
+- Sensitive config that could be better isolated (environment separation, secret management)
+---
+#### Enhancement Pass G5 — Testing Enhancement
+Delegate to test_engineer if available, otherwise explorer. Read all test files and source files.
+Format:
+```
+[TEST-N] Title
+Category: [Organization | Fixtures | Property-Based | Mutation | Behavior-Level]
+File(s): [exact path]
+Current State: [what exists now — quote test code]
+Enhancement: [specific, implementable improvement]
+Impact: [what gets better]
+Effort: [Low | Medium | High]
+```
+Evaluate — enhancement framing only (existing tests pass; this makes the test suite better):
+- Better test organization: grouping tests by behavior rather than by implementation unit
+- Shared fixtures or factory functions to eliminate test setup duplication
+- Property-based testing opportunities for invariants: parsers, serializers, transformations, state machines, permission matrices, fuzz-worthy trust boundaries
+- Mutation testing on high-risk core logic: identify the logic where a one-line flip would be catastrophic and where a mutation test would catch it
+- Behavior-level test assertions: replace implementation-asserting tests with behavior-asserting equivalents
+- Missing tests for documented edge cases or recently fixed bugs
+- Test performance: identify test suites taking disproportionate time and opportunities to speed them up
+---
+#### Enhancement Pass G6 — UI/UX Enhancement (Run only if UI is confirmed present)
+**Condition:** Only run if Phase 0H confirmed UI presence. If no UI, skip and record NOT_APPLICABLE in coverage.
+Run all six UI passes from Track D (D1 through D6), framing all findings as enhancement opportunities rather than defects.
+Use the same formats and evaluation criteria as Track D. The key framing difference:
+- Track D (defect mode): "This is broken, missing, or fails a compliance standard."
+- Track G Pass G6 (enhancement mode): "The current UI is working; this is how it could become better."
+Findings that would be LOW or INFO severity in Track D become genuine enhancement candidates here. In enhancement mode, all UI improvements are valuable — the bar is not "this is a defect" but "this would make the experience meaningfully better."
+Do not repeat Track D findings if Track D was also run. Reference them by ID in the enhancement catalog if relevant.
+---
+### Phase 1X — Cross-Boundary Review
+After selected track candidate generation completes, run one cross-boundary explorer pass.
+Skip rule: run Phase 1X only when two or more tracks ran and there is quoted cross-track evidence to compare. For single-track reviews, skip and record the skip in Coverage Notes.
+Purpose: find issues that isolated track passes miss.
+Check:
+- Caller and callee contract mismatches across module boundaries
+- UI/API/schema drift (what the UI sends vs. what the API expects vs. what the schema defines)
+- Docs/API/test drift (what docs claim vs. what the API does vs. what tests assert)
+- Auth assumptions across middleware and handlers (auth enforced in middleware but not in handler, or vice versa)
+- Config names across docs, env parsing, deployment config, and code
+- Shared state mutation across modules that assumes exclusive access
+- Package scripts calling files or commands that no longer exist
+- Generated types or schemas out of sync with their sources
+- AI prompt/tool boundaries crossing into security-sensitive sinks (identified in Track B but not surfaced in Track A)
+- Repeated candidate patterns in sibling files suggesting a systemic issue
+Output: additional `CANDIDATE_FINDING` entries only. Use the track of the most security-relevant finding. If no single track dominates, use `track: cross_boundary`. Link all involved claims, surfaces, boundaries, or prior candidates.
+---
+## Phase 2 — Reviewer Validation
+Reviewer validates candidates. Reviewer does not rediscover the whole repo.
+Reviewer receives small batches by local reasoning unit: same file, same route or handler chain, same subsystem, same dependency family, same public claim, same trust boundary, same UI component family, or same test fixture/helper.
+Do not hand Reviewer dozens of unrelated candidates in one batch.
+### Validation Status
+Reviewer must assign exactly one:
+- `CONFIRMED` — real in current code and supported by evidence
+- `DISPROVED` — not real in context
+- `UNVERIFIED` — plausible but not proven to required confidence
+- `PRE_EXISTING` — real but outside the target change scope
+### Reviewer Responsibilities
+For each candidate:
+1. Re-open exact file and line.
+2. Read the raw file independently before reading the explorer's `evidence_checked` field. Do not let the explorer's paraphrase prime validation.
+3. Re-read enough surrounding context.
+4. Check callers, callees, tests, manifests, configs, schemas, routes, generated files, and docs needed to validate.
+5. Check mitigating controls that could disprove the candidate.
+6. Run safe minimal runtime validation where behavior depends on runtime.
+7. Reclassify severity or value level if appropriate.
+8. Record exact disproof reason for rejected candidates.
+9. Mark UNVERIFIED rather than guessing when evidence is insufficient.
+### Defect Validation Format
+```
+VALIDATED_FINDING
+  candidate_id:
+  status: CONFIRMED | DISPROVED | UNVERIFIED | PRE_EXISTING
+  final_severity: CRITICAL | HIGH | MEDIUM | LOW | INFO
+  confidence: HIGH | MEDIUM
+  file:
+  line:
+  exact_quote:
+  title:
+  problem:
+  impact:
+  fix:
+  validation_evidence:
+  disproof_reason: <required if DISPROVED>
+  verification_mode: STATIC | STATIC_PLUS_RUNTIME
+  runtime_validation: <command or N/A>
+  linked_claims:
+  linked_surfaces:
+  linked_boundaries:
+  ai_pattern: <same value from candidate or N/A>
+  inline_routing: CRITIC_REQUIRED | REVIEWER_FINALIZED | REVIEWER_DOWNGRADED
+  finalization_status: FINALIZED | DOWNGRADED | N/A
+  size: S | M | L
+END
+```
+Rules:
+- CRITICAL/HIGH CONFIRMED or PRE_EXISTING requires `inline_routing: CRITIC_REQUIRED`.
+- MEDIUM/LOW CONFIRMED or PRE_EXISTING requires reviewer finalization before return.
+- DISPROVED and UNVERIFIED do not enter the main findings list.
+### Enhancement Validation Format
+```
+VALIDATED_ENHANCEMENT
+  candidate_id:
+  status: CONFIRMED_HIGH_VALUE | CONFIRMED_MEDIUM_VALUE | REJECTED | UNVERIFIED
+  track:
+  domain:
+  category:
+  confidence: HIGH | MEDIUM
+  file:
+  line:
+  exact_quote:
+  title:
+  current_state:
+  confirms_current_code_is_working: yes | no
+  enhancement:
+  expected_impact:
+  effort: S | M | L
+  validation_evidence:
+  dependency_map:
+  rejection_reason: <required if REJECTED>
+END
+```
+Enhancement rejection reasons include: already handled elsewhere; contradicts system intent; adds complexity without clear benefit; purely stylistic preference; too vague to implement; current design appears intentional and better; not grounded in exact evidence; `confirms_current_code_is_working` is not `yes`.
+---
+## Phase 2C — Inline Critic Challenge for CRITICAL and HIGH Defects
+Trigger immediately after each reviewer batch containing CRITICAL or HIGH CONFIRMED or PRE_EXISTING findings. Do not wait for all reviewer batches to complete.
+Critic receives only: the relevant validated findings, exact evidence quotes, minimal surrounding context, and any runtime validation notes.
+Critic checks:
+- Is the finding real at the cited location?
+- Did reviewer miss a mitigating control?
+- Is the severity justified?
+- Is runtime validation sufficient or required?
+- Is the fix actionable?
+- Does the finding overclaim beyond evidence?
+- Is this part of a repeated pattern requiring sibling coverage?
+```
+CRITIC_RESULT
+  finding_id:
+  verdict: UPHELD | REFINED | DOWNGRADED | OVERTURNED
+  original_severity: CRITICAL | HIGH
+  final_severity:
+  file:
+  line:
+  exact_quote:
+  title:
+  final_problem:
+  final_fix:
+  ai_pattern: <same value from validated finding or N/A>
+  verdict_reason:
+  coverage_gap:
+END
+```
+Only UPHELD, REFINED, and DOWNGRADED findings may enter the confirmed evidence set. OVERTURNED findings are dropped and logged.
+If Phase 2C downgrades a CRITICAL/HIGH to MEDIUM/LOW, route immediately through Phase 2M. Record `finalization_status: DOWNGRADED`.
+---
+## Phase 2M — Reviewer Finalization for MEDIUM and LOW Defects
+This is not a separate agent dispatch. Reviewer performs this before returning a validation batch.
+For every MEDIUM or LOW CONFIRMED or PRE_EXISTING finding:
+1. Re-read evidence.
+2. Check whether a mitigating control was missed.
+3. Confirm severity is not inflated.
+4. Confirm the finding is not style preference.
+5. Confirm actionability.
+6. Set `inline_routing: REVIEWER_FINALIZED` or `inline_routing: REVIEWER_DOWNGRADED`.
+7. Set `finalization_status: FINALIZED` or `finalization_status: DOWNGRADED`.
+Only FINALIZED and DOWNGRADED findings enter the confirmed evidence set.
+---
+## Phase 2E — Critic Validation for Enhancements
+Every report-eligible enhancement requires critic validation.
+Rationale for asymmetry with MEDIUM/LOW defects: enhancement value is more subjective. LOW-value enhancements are normally omitted unless the user requested exhaustive enhancement review. If a LOW-value enhancement is retained, critic validation is still required.
+Phase 2E may run concurrently with Phase 2C and Phase 2M only for disjoint findings and disjoint subsystems. If an enhancement and defect concern the same file or root cause, serialize validation to keep the defect/enhancement boundary clear.
+Critic receives batches by category and subsystem.
+Critic checks:
+- Is the current state quoted accurately?
+- Is the opportunity genuinely valuable?
+- Is the improvement concrete enough to implement?
+- Is the effort estimate plausible?
+- Would the suggestion add more complexity than value?
+- Does it conflict with codebase intent or style?
+- Does it duplicate another opportunity?
+- Should it be merged, split, downgraded, or rejected?
+```
+ENHANCEMENT_CRITIC_RESULT
+  enhancement_id:
+  verdict: UPHELD_HIGH_VALUE | UPHELD_MEDIUM_VALUE | REFINED | MERGED | DOWNGRADED | REJECTED
+  final_category:
+  final_title:
+  file:
+  line:
+  exact_quote:
+  final_enhancement:
+  expected_impact:
+  effort: S | M | L
+  dependencies:
+  verdict_reason:
+END
+```
+Only UPHELD_HIGH_VALUE, UPHELD_MEDIUM_VALUE, REFINED, MERGED, and DOWNGRADED enhancements enter the final report.
+---
+## Phase 3 — Test Validation and Drift Review
+Run this phase if any selected track touches functionality, testing, security, public claims, CI, or behavior.
+If Track C did not run, Phase 3 is limited to test-related drift arising from findings in other selected tracks.
+Use test_engineer where available.
+Tasks:
+1. Review every test-related finding and every claim that depends on tests.
+2. Confirm whether tests assert behavior or merely execute code.
+3. Confirm whether test fixtures match current schemas and defaults.
+4. Confirm whether mocked boundaries hide real integration failures.
+5. Confirm whether snapshot tests are masking meaningful changes.
+6. Identify property-based testing opportunities for invariants.
+7. Identify mutation resilience gaps for high-risk logic.
+8. Run safe focused test commands where needed.
+9. Record commands run and what they prove.
+```
+TEST_DRIFT_REVIEW
+  related_findings:
+  commands_run:
+  behavior_assertions_verified:
+  stale_tests_found:
+  weak_assertions_found:
+  property_based_opportunities:
+  mutation_resilience_gaps:
+  remaining_uncertainty:
+END
+```
+Write to `ledgers/test-drift-review.md`. If not applicable, write with `NOT_APPLICABLE` and reason.
+Rules:
+- Coverage percentage is not proof of test quality.
+- Passing tests are not proof of correct behavior.
+- Test names are claims.
+- A test that cannot fail for the bug it claims to prevent is a test-quality finding.
+---
+## Phase 4 — Architect Synthesis
+Architect synthesizes only validated evidence.
+Inputs: Phase 0 ledgers; candidate ledgers; reviewer validation ledgers; inline critic results; enhancement critic results; `ledgers/test-drift-review.md`.
+Synthesis tasks:
+1. Drop DISPROVED findings.
+2. Drop OVERTURNED critic findings.
+3. Keep UNVERIFIED findings only in Coverage Notes.
+4. Keep CONFIRMED and PRE_EXISTING defects only if they passed required routing.
+5. Keep enhancements only if critic upheld, refined, merged, or downgraded them.
+6. Deduplicate same-root-cause findings.
+7. Merge repeated pattern findings only when evidence supports the cluster.
+8. Separate defects from enhancements.
+9. Separate unsupported claims from code defects.
+10. Separate AI slop patterns from normal technical debt.
+11. Count rejected and unverified items so filtering is auditable.
+12. Identify systemic themes.
+13. Identify recommended remediation or enhancement order.
+14. Identify omitted tracks and coverage limitations.
+15. Create `ledgers/strengths-ledger.md` with only quoted codebase strengths. If no strengths can be quoted, write `NOT_APPLICABLE`.
+16. Verify coverage closure: every selected-track coverage unit must be REVIEWED, NOT_APPLICABLE, SKIPPED_WITH_REASON, or BLOCKED. If any unit is UNASSIGNED or UNREVIEWED, do not proceed to Phase 5. Return to Phase 1 for that unit.
+Claim ledger outcome definitions:
+- `supported` — implementation evidence confirms the claim.
+- `partially_supported` — evidence supports part but not all of the claim.
+- `unsupported` — no implementation evidence supports the claim.
+- `contradicted` — implementation evidence conflicts with the claim.
+- `stealth_change` — public behavior, API contract, config, or documented workflow appears to have changed without a corresponding documentation, migration, changelog, or test update.
+- `unverified` — evidence was insufficient to classify.
+### Required Counts Block
+```
+Defect Findings by Track:
+  functionality_correctness: C / H / M / L / I
+  security_privacy:         C / H / M / L / I
+  llm_ai_security:          C / H / M / L / I
+  supply_chain:             C / H / M / L / I
+  testing_quality:          C / H / M / L / I
+  ui_ux_accessibility:      C / H / M / L / I
+  performance:              C / H / M / L / I
+  observability:            C / H / M / L / I
+  ai_slop_provenance:       C / H / M / L / I
+  docs_claims_drift:        C / H / M / L / I
+  cross_platform:           C / H / M / L / I
+  cross_boundary:           C / H / M / L / I
+  total:                    C / H / M / L / I
+Validation Outcomes:
+  candidates_generated:
+  confirmed:
+  pre_existing:
+  disproved:
+  unverified:
+  reviewer_downgraded:
+  critic_upheld:
+  critic_refined:
+  critic_downgraded:
+  critic_overturned:
+Enhancement Outcomes:
+  candidates_generated:
+  upheld_high_value:
+  upheld_medium_value:
+  refined:
+  merged:
+  downgraded:
+  rejected:
+  unverified:
+Claim Ledger:
+  supported:
+  partially_supported:
+  unsupported:
+  contradicted:
+  stealth_change:
+  unverified:
+Coverage Closure:
+  total_coverage_units:
+  reviewed:
+  not_applicable:
+  skipped_with_reason:
+  blocked:
+  unreviewed: <must be 0 to proceed>
+AI Pattern Distribution:
+  phantom_dependency:
+  hallucinated_api:
+  stale_api_usage:
+  confident_stub:
+  happy_path_only:
+  over_abstraction:
+  context_rot:
+  security_theater:
+  generated_test_weakness:
+  mcp_tool_poisoning:
+  unsupported_claim:
+  other:
+```
+---
+## Phase 5 — Final Whole-Report Critic
+Before writing the final report, dispatch Critic with the planned synthesis.
+Critic checks:
+- Does every final defect have validation evidence?
+- Did every CRITICAL/HIGH pass inline critic?
+- Did every MEDIUM/LOW pass reviewer finalization?
+- Does every enhancement have critic validation?
+- Are defects and enhancements separated?
+- Are all codebase strengths quoted in `ledgers/strengths-ledger.md`?
+- Are unverified items excluded from main findings?
+- Are severities calibrated to the rubrics?
+- Are UI findings concrete and implementable?
+- Are security findings exploitability-grounded?
+- Are performance findings not overstated without measurement?
+- Are AI-slop findings evidence-based rather than vibe-based?
+- Are claims ledger conclusions supported?
+- Are coverage notes honest?
+- Are counts internally consistent?
+- Is the coverage closure count showing 0 UNREVIEWED?
+- Did the report omit any user-selected track?
+```
+FINAL_CRITIC_CHECK
+  verdict: PASS | REVISE
+  required_revisions:
+  severity_adjustments:
+  findings_to_drop:
+  findings_to_reclassify_as_enhancements:
+  enhancements_to_reclassify_as_defects:
+  unsupported_report_claims:
+  missing_or_empty_ledgers:
+  unsupported_strengths:
+  coverage_note_fixes:
+  count_mismatches:
+  coverage_closure_failures:
+END
+```
+If verdict is REVISE, revise the synthesis and rerun final critic until PASS.
+---
+## Phase 6 — Final Report
+Write to: `review-report.md` in the run directory.
+Use this structure:
+```markdown
+# Codebase Review Report
+Generated: [timestamp]
+Repository: [name/path]
+Git HEAD: [SHA]
+Selected Review Tracks: [tracks]
+Skipped Tracks: [tracks and why]
+Review Mode: [complete integrated | defect-focused | focused | enhancement-only | custom]
+## Executive Summary
+[2-5 sentences. Strongest confirmed themes only.]
+## Review Scope and Method
+- Phase 0 inventory completed: yes
+- User-selected tracks:
+- Explorer candidates generated:
+- Reviewer validation completed:
+- Inline critic used for CRITICAL/HIGH:
+- Reviewer finalization used for MEDIUM/LOW:
+- Enhancement critic used:
+- Final whole-report critic verdict:
+- Coverage closure verified: yes (N units reviewed)
+- Runtime validation commands run:
+## Findings Count
+[counts block]
+## Critical and High Confirmed Defect Findings
+[full details. Do not include PRE_EXISTING here.]
+## High-Severity Pre-Existing Findings
+[required if any CRITICAL/HIGH PRE_EXISTING findings exist]
+## Medium Defect Findings
+[full details or grouped details]
+## Low and Info Defect Findings
+[condensed but evidence-grounded]
+## Security, Privacy, and Supply Chain Notes
+[include only if selected or relevant]
+## Unsupported, Contradicted, or Partially Supported Claims
+[claim ledger outcomes]
+## AI Slop and Code Provenance Patterns
+[evidence-based patterns only. Never vibe-based.]
+## Testing and Test Drift Findings
+[test-quality and drift results]
+## UI/UX and Accessibility Findings
+[include only if selected and UI exists]
+## Performance and Observability Findings
+[include only if selected]
+## Systemic Themes
+[themes synthesized from validated findings only]
+## Enhancement Opportunities
+[include only if selected]
+### Top 10 Highest-Impact Enhancements
+[top validated high-value opportunities, ranked by impact]
+### Full Enhancement Catalog
+#### Architecture Enhancements (ARCH-*)
+#### Code Quality Enhancements (QUAL-*)
+#### Performance Enhancements (PERF-*)
+#### Resilience and Observability Enhancements (RES-*)
+#### Testing Enhancements (TEST-*)
+#### UI/UX — Visual Hierarchy and Layout (UI-HIER-*)
+#### UI/UX — Interaction Design and Feedback (UI-INT-*)
+#### UI/UX — Accessibility and Inclusivity (UI-A11Y-*)
+#### UI/UX — Typography and Visual Polish (UI-VIS-*)
+#### UI/UX — Performance and Perceived Performance (UI-PERF-*)
+#### UI/UX — Consistency and Design System Alignment (UI-CON-*)
+### Implementation Roadmap
+#### Phase 1 — Quick Wins
+Low effort, high clarity. List by ID with one-line description.
+#### Phase 2 — Meaningful Improvements
+Medium effort, clear payoff. List by ID with dependencies noted.
+#### Phase 3 — Architectural Investments
+High effort, transformational impact. List by ID.
+### Codebase Strengths
+[specific patterns worth preserving. Each strength must cite a file and line range and include exact quote evidence.]
+## Recommended Remediation Order
+1. Security, supply-chain, data-loss, and broken shipped functionality.
+2. Unsupported public claims and stealth behavior changes.
+3. Trust-boundary and authorization defects.
+4. Test gaps that allow confirmed defects to recur.
+5. Performance and observability gaps affecting production diagnosis.
+6. AI slop and provenance cleanup by repeated pattern.
+7. Validated enhancement opportunities by dependency order.
+## Coverage Notes
+- Tracks not run:
+- Areas inventoried but not deeply reviewed:
+- Runtime validations not run and why:
+- UNVERIFIED findings worth future attention:
+- Files or generated artifacts intentionally excluded:
+## Validation Notes
+- candidates generated:
+- reviewer confirmed:
+- reviewer disproved:
+- reviewer unverified:
+- critic upheld/refined/downgraded/overturned:
+- enhancements upheld/rejected:
+- final critic verdict:
+- coverage units: total / reviewed / not_applicable / skipped / blocked / unreviewed
+```
+### Per-Finding Final Format
+For every final defect:
+```markdown
+### [SEVERITY] [Title]
+Location: `path:line`
+Track: [track]
+Status: CONFIRMED | PRE_EXISTING
+Confidence: HIGH | MEDIUM
+Evidence:
+> [exact quote]
+Problem:
+[factual issue]
+Impact:
+[specific impact]
+Validation:
+[what reviewer checked, runtime command if any, critic outcome if high severity]
+Recommended Fix:
+[actionable remediation]
+```
+For every final enhancement:
+```markdown
+### [ENHANCEMENT-ID] [Title]
+Location: `path:line`
+Category: [category]
+Value: High | Medium
+Effort: S | M | L
+Current State:
+> [exact quote]
+Opportunity:
+[specific improvement]
+Expected Impact:
+[what improves]
+Validation:
+[critic result and any dependencies]
+```
+---
+## Completion Rules
+The review is complete only when:
+- Phase 0 inventory completed.
+- Every required ledger exists and is non-empty, or contains an explicit `NOT_APPLICABLE` reason.
+- User selected review tracks or preselected tracks were explicit.
+- Every selected track was run or explicitly skipped with reason.
+- Coverage closure verified: every selected-track coverage unit is REVIEWED, NOT_APPLICABLE, SKIPPED_WITH_REASON, or BLOCKED. Zero UNASSIGNED or UNREVIEWED units.
+- Every final defect has exact quote evidence.
+- Every final enhancement has exact quote evidence.
+- Every defect candidate was reviewer validated or logged as not validated.
+- Every CRITICAL/HIGH final finding passed inline critic.
+- Every MEDIUM/LOW final finding passed reviewer finalization.
+- Every enhancement in the final report passed enhancement critic.
+- Test drift review ran when behavior or tests were in scope.
+- Final whole-report critic returned PASS.
+- `review-report.md` was written.
+- The report was read back and checked for missing sections.
+Do not implement fixes. Do not modify source files.
+Stop after reporting the final review file path, selected tracks, counts summary, and any user questions that block remediation planning.
+---
+## Final Architect Response to User
+Do not fill in this template until Phase 5 final critic returns PASS.
+After the report is complete and the final critic verdict is PASS:
+```
+Review complete.
+Report: .swarm/review-v7/runs/<run_id>/review-report.md
+Selected tracks: [tracks]
+Coverage units closed: [n] (0 unreviewed)
+Confirmed defects: [counts by severity]
+Validated enhancements: [counts by value tier]
+Candidates filtered out: [counts]
+Final critic verdict: PASS
+Highest-risk confirmed findings:
+- [one-line list of CRITICAL/HIGH only]
+Highest-value enhancements:
+- [one-line list if enhancement track ran]
+Coverage limitations:
+- [brief list]
+No source files were modified.
+```
+If final critic verdict is not PASS, do not claim completion. Revise and rerun.