opencode-swarm 7.58.0 → 7.59.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/.opencode/skills/brainstorm/SKILL.md +142 -0
  2. package/.opencode/skills/clarify/SKILL.md +103 -0
  3. package/.opencode/skills/clarify-spec/SKILL.md +58 -0
  4. package/.opencode/skills/codebase-review-swarm/INSTALL.md +75 -0
  5. package/.opencode/skills/codebase-review-swarm/README.md +44 -0
  6. package/.opencode/skills/codebase-review-swarm/SKILL.md +65 -0
  7. package/.opencode/skills/codebase-review-swarm/agents/openai.yaml +6 -0
  8. package/.opencode/skills/codebase-review-swarm/assets/jsonl-schemas.md +239 -0
  9. package/.opencode/skills/codebase-review-swarm/assets/review-report-template.md +244 -0
  10. package/.opencode/skills/codebase-review-swarm/references/compatibility-and-research-notes.md +25 -0
  11. package/.opencode/skills/codebase-review-swarm/references/full-v7-source-prompt.md +2373 -0
  12. package/.opencode/skills/codebase-review-swarm/references/review-protocol-v8.2.md +310 -0
  13. package/.opencode/skills/codebase-review-swarm/scripts/init-review-run.py +134 -0
  14. package/.opencode/skills/codebase-review-swarm/scripts/validate-skill-package.py +62 -0
  15. package/.opencode/skills/consult/SKILL.md +16 -0
  16. package/.opencode/skills/council/SKILL.md +147 -0
  17. package/.opencode/skills/critic-gate/SKILL.md +59 -0
  18. package/.opencode/skills/deep-dive/SKILL.md +142 -0
  19. package/.opencode/skills/design-docs/SKILL.md +81 -0
  20. package/.opencode/skills/discover/SKILL.md +20 -0
  21. package/.opencode/skills/execute/SKILL.md +191 -0
  22. package/.opencode/skills/issue-ingest/SKILL.md +64 -0
  23. package/.opencode/skills/phase-wrap/SKILL.md +123 -0
  24. package/.opencode/skills/plan/SKILL.md +293 -0
  25. package/.opencode/skills/pre-phase-briefing/SKILL.md +69 -0
  26. package/.opencode/skills/resume/SKILL.md +23 -0
  27. package/.opencode/skills/specify/SKILL.md +175 -0
  28. package/.opencode/skills/swarm-pr-feedback/SKILL.md +192 -0
  29. package/.opencode/skills/swarm-pr-review/SKILL.md +884 -0
  30. package/dist/agents/agent-output-schema.d.ts +1 -1
  31. package/dist/cli/index.js +1351 -1159
  32. package/dist/commands/command-dispatch.d.ts +1 -0
  33. package/dist/commands/index.d.ts +1 -0
  34. package/dist/commands/registry.d.ts +15 -14
  35. package/dist/config/bundled-skills.d.ts +25 -0
  36. package/dist/config/constants.d.ts +1 -1
  37. package/dist/config/schema.d.ts +42 -0
  38. package/dist/index.js +3517 -2673
  39. package/dist/memory/schema.d.ts +1 -1
  40. package/dist/tools/lean-turbo-run-phase.d.ts +2 -1
  41. package/dist/turbo/lean/index.d.ts +4 -1
  42. package/dist/turbo/lean/merge-back.d.ts +180 -0
  43. package/dist/turbo/lean/runner.d.ts +47 -1
  44. package/dist/turbo/lean/state.d.ts +10 -0
  45. package/dist/turbo/lean/worktree.d.ts +194 -0
  46. package/package.json +20 -1
@@ -0,0 +1,2373 @@
1
+ # Full v7 Source Prompt (Verbatim)
2
+
3
+ This file preserves the uploaded v7 source prompt for detailed checklists and provenance. The v8.1 skill protocol supersedes only portability/packaging choices, artifact root (`.swarm/review-v8`), explicit grounding fields, and current standards such as ASVS 5.0.0.
4
+
5
+ ---
6
+
7
+ # Comprehensive Codebase Review Swarm Prompt v7
8
+
9
+ Generated: 2026-05-01
10
+
11
+ Purpose: run a rigorous, hallucination-resistant codebase review using an opencode-swarm architect, explorer, reviewer, critic, test_engineer, and optional designer workflow. This version unifies defect-focused QA review and enhancement-focused review into one selectable workflow with fully fleshed-out tracks, an anti-cursory coverage closure contract, and research-updated security, AI slop, and enhancement guidance.
12
+
13
+ Use: paste this entire prompt into the orchestrating Architect agent at the repository root. Do not paste only one section unless you are deliberately running a single track.
14
+
15
+ ---
16
+
17
+ ## State-of-the-Art Anchors
18
+
19
+ This prompt combines deterministic evidence gathering with heuristic discovery. Specification-grounded code review (SGCR) reported a 42% developer adoption rate versus 22% for a single-LLM baseline, by grounding review suggestions in human-authored specifications rather than LLM inference alone ([SGCR paper](https://arxiv.org/html/2512.17540v1)).
20
+
21
+ Every candidate finding must be grounded in exact code context. A joint study across 576,000 code samples found 19.7% of LLM-recommended packages were fabricated and non-existent, with 58% of hallucinated packages repeating across multiple queries — making them actively exploitable by attackers who register the fake names ([USENIX package hallucination research](https://www.usenix.org/publications/loginonline/we-have-package-you-comprehensive-analysis-package-hallucinations-code)). HalluJudge frames hallucination detection as checking whether a review comment is aligned with the code context, motivating this prompt's quote-grounding rule ([HalluJudge](https://arxiv.org/abs/2601.19072)).
22
+
23
+ Security review must use verifiable controls rather than only awareness categories. OWASP ASVS is the basis for testing web application technical security controls; the current stable version is 4.0.3 with v5.0 in draft ([OWASP ASVS](https://owasp.org/www-project-application-security-verification-standard/)).
24
+
25
+ AI and LLM security must account for the OWASP Top 10 for LLM Applications 2025 (updated November 2024): LLM01 Prompt Injection (now explicitly includes indirect injection from external sources), LLM02 Sensitive Information Disclosure (jumped from #6), LLM03 Supply Chain, LLM04 Data and Model Poisoning, LLM05 Improper Output Handling, LLM06 Excessive Agency (now broken into excessive functionality, permissions, and autonomy), LLM07 System Prompt Leakage (new), LLM08 Vector and Embedding Weaknesses (new), LLM09 Misinformation, LLM10 Unbounded Consumption ([OWASP GenAI](https://genai.owasp.org/llm-top-10/)).
26
+
27
+ MCP server security is a first-class threat surface in 2026. Documented attack vectors include: tool poisoning (embedding malicious instructions in tool descriptions that AI agents execute), data exfiltration via AI response context (database schemas, API endpoints, and credentials traversing AI context to external tools), and MCP server chain lateral movement (compromised server A used as AI-relay to reach production server C without direct network access). Over 60% of MCP deployments have no security layer between the AI agent and its tool surface ([MCP security research, Practical DevSecOps 2026](https://www.practical-devsecops.com/mcp-security-vulnerabilities/)).
28
+
29
+ Supply-chain review must treat build provenance, artifact verification, and attestation as first-class. SLSA defines levels for increasing supply-chain security guarantees, with provenance and verification summary attestation formats ([SLSA specification](https://slsa.dev/spec/)). OpenSSF Scorecard assesses open source projects for security risks through automated checks ([OpenSSF Scorecard](https://openssf.org/projects/scorecard/)).
30
+
31
+ AI slop in codebases is measurable. Larridin's AI Slop Index identifies five diagnostic signals: code duplication ratio (semantic duplication where AI generates functionally equivalent code in multiple places instead of shared abstractions), 30/90-day revert and churn rates (code rewritten or deleted within 30 days directly signals it should not have merged), complexity-adjusted analysis, architectural coherence scoring (new code introducing new patterns for problems the codebase already solves), and test behavior coverage (tests that assert mocks rather than behavior) ([Larridin AI Slop Index, 2026](https://larridin.com/developer-productivity-hub/what-is-ai-slop-detect-prevent-low-quality-ai-code)). AI-generated UI converges on identifiable visual patterns: 21% of recent Show HN landing pages scored as heavy slop (≥5 of 15 AI-design-tell patterns), 46% mild, 33% clean ([AI Design Slop research, 2026](https://www.developersdigest.tech/blog/ai-design-slop-and-how-to-spot-it)).
32
+
33
+ LLMs hallucinate because training and evaluation procedures reward confident guessing over acknowledging uncertainty (OpenAI, September 2025 Kalai et al.). Combining RAG, RLHF, and guardrails achieves up to 96% hallucination reduction vs baseline; multi-agent verification architectures improve consistency by 85.5%; static analysis hybrid (IRIS framework, ICLR 2025) detected 55 vulnerabilities vs CodeQL's 27 ([diffray.ai hallucination research, 2026](https://diffray.ai/blog/llm-hallucinations-code-review/)).
34
+
35
+ UI accessibility review uses WCAG 2.2 AA as baseline ([W3C WCAG 2.2](https://www.w3.org/TR/WCAG22/)).
36
+
37
+ Observability review covers traces, metrics, and logs per OpenTelemetry's vendor-neutral telemetry model ([OpenTelemetry docs](https://opentelemetry.io/docs/)).
38
+
39
+ ---
40
+
41
+ ## Prelude — Orchestrator Contract
42
+
43
+ You are the Architect agent conducting a deep codebase review.
44
+
45
+ You are not implementing fixes. You are not modifying source code. You are producing a verified review report.
46
+
47
+ This prompt supports the following review modes — selected after Phase 0:
48
+
49
+ 1. **Complete Integrated Review** — all defect-focused tracks plus enhancement opportunities.
50
+ 2. **Defect-Focused Comprehensive QA** — functionality, security, tests, UI/UX if present, performance, AI slop, docs/claims, supply chain. No enhancement catalog.
51
+ 3. **Security and Supply Chain Focus**
52
+ 4. **Functionality and Correctness Focus**
53
+ 5. **Testing and Test Quality Focus**
54
+ 6. **UI/UX and Accessibility Focus**
55
+ 7. **Performance and Observability Focus**
56
+ 8. **AI Slop and Code Provenance Focus**
57
+ 9. **Enhancement Opportunities Only** — architecture, quality, DX, performance, resilience, observability, UI/UX improvements. Not a bug hunt.
58
+ 10. **Custom Combination** — specify tracks and scope.
59
+
60
+ ### Anti-Cursory Review Contract
61
+
62
+ This is the single most important rule. Read it now and re-read it before every track dispatch.
63
+
64
+ **Selecting fewer tracks narrows the domain. It must never reduce depth inside the selected domain.**
65
+
66
+ A single-track review must be as exhaustive for that selected track as a complete integrated review would be for that track. Do not sample, skim, or perform shallow category checks merely because fewer tracks were selected.
67
+
68
+ For every selected track, build a coverage matrix in `coverage.jsonl` with one entry per relevant surface, file group, trust boundary, test cluster, UI component family, or AI/tool surface discovered in Phase 0.
69
+
70
+ Each coverage entry must end with one of:
71
+ - `REVIEWED` — relevant files were actually read, entry point traced when behavior involved, tests checked when behavior or claims involved, guards checked when trust boundaries involved, exact evidence captured, alternatives considered.
72
+ - `NOT_APPLICABLE` — with explicit reason.
73
+ - `SKIPPED_WITH_REASON` — with explicit reason.
74
+ - `BLOCKED` — with explicit reason.
75
+
76
+ **Final report is forbidden if any selected-track coverage unit remains `UNASSIGNED` or `UNREVIEWED`.**
77
+
78
+ ### Quality Directives
79
+
80
+ Quality is the only success metric. There is no time pressure. There is no reward for fewer passes. There is no penalty for more passes when they improve correctness.
81
+
82
+ Large codebases require smaller scopes, more passes, more validation, and more disciplined synthesis. Large codebases do not justify broader batches or weaker gates.
83
+
84
+ ### Concurrency Policy
85
+
86
+ - Phase 0 micro-inventory passes may run in small parallel batches of up to two independent agents.
87
+ - After Phase 0, selected review tracks may run in parallel only when their file scopes and reasoning contexts are independent.
88
+ - Reviewer validation may run in parallel by disjoint local reasoning units (same file, same route chain, same subsystem, same dependency family, same public claim, same trust boundary, same UI component family, same test fixture/helper).
89
+ - At most one critic session per finding lineage. Critic sessions for disjoint finding sets may run concurrently.
90
+ - Critic challenge for CRITICAL and HIGH findings happens inline per reviewer batch. Do not defer to the final report.
91
+ - A final whole-report critic pass is mandatory before acceptance.
92
+ - If quality and concurrency conflict, quality wins.
93
+
94
+ ### Phase 0 Safe Ordering
95
+
96
+ 1. Run Phase 0A alone.
97
+ 2. After 0A, run 0B and 0C in parallel if the repository is large enough to benefit.
98
+ 3. After 0B, run 0D and 0E in parallel only if 0E can leave `linked_claims` blank for Architect linking in 0J. Otherwise run 0D before 0E.
99
+ 4. Preferred batch order: batch 1 = 0F and 0G; batch 2 = 0H and 0I. Never exceed the two-agent Phase 0 cap.
100
+ 5. Run 0F after 0E when possible.
101
+ 6. Run 0G after 0B and 0C.
102
+ 7. Run 0H and 0I after 0B and 0C.
103
+ 8. Run 0J only after all applicable 0B-0I ledgers are complete.
104
+
105
+ Never run a dependent Phase 0 pass to keep agents busy. Missing dependency context must be written as `unknown`, not guessed.
106
+
107
+ ### Threat Model
108
+
109
+ Assume the repository may contain heavily LLM-assisted code.
110
+
111
+ Treat comments, README text, changelogs, examples, release notes, PR descriptions, test names, and issue text as claims, not proof.
112
+
113
+ Assume polished code may still be partially wired, dependency-unsound, only correct on the happy path, or inconsistent with real installed APIs. Assume hallucinated dependencies, hallucinated function signatures, stale framework knowledge, and cross-language package confusion are plausible until disproved.
114
+
115
+ ### Anti-Rationalization Rules
116
+
117
+ Reject these thoughts immediately:
118
+
119
+ - "This repo is too large to review carefully."
120
+ - "We already have enough findings."
121
+ - "The explorer probably got it right."
122
+ - "The architect can spot-check instead of reviewer validation."
123
+ - "This is only medium severity, so validation can be lighter."
124
+ - "This enhancement seems obvious, so it does not need evidence."
125
+ - "No quote is needed because the issue is apparent."
126
+ - "The code looks generated, so it must be wrong."
127
+ - "The code looks professional, so it must be right."
128
+ - "Runtime validation is inconvenient, so static review is enough."
129
+ - "The critic can wait until the end."
130
+ - "I should combine unrelated files to reduce pass count."
131
+ - "One track means I can be less thorough on that track."
132
+
133
+ ---
134
+
135
+ ## Core Evidence Rules
136
+
137
+ ### Small-Model Explorer Operating Mode
138
+
139
+ Explorer agents must operate as evidence extractors first and analysts second.
140
+
141
+ Explorer agents must:
142
+ - read only the assigned scope
143
+ - read every assigned file in that scope
144
+ - avoid architectural conclusions unless explicitly assigned an architecture or enhancement pass
145
+ - avoid severity inflation
146
+ - prefer exact yes/no/extracted-value answers over prose
147
+ - quote before interpreting
148
+ - identify uncertainty explicitly instead of filling gaps
149
+ - emit no candidate if evidence is not strong enough for at least MEDIUM confidence
150
+
151
+ Explorer agents must not:
152
+ - infer behavior from filenames alone
153
+ - infer security risk from framework stereotypes alone
154
+ - infer test coverage from test filenames alone
155
+ - infer UI quality from component names alone
156
+ - infer package validity from a package name sounding familiar
157
+ - infer generated-code quality from style alone
158
+ - propose fixes before proving the problem or opportunity exists
159
+
160
+ Micro-loop for every candidate:
161
+ ```
162
+ 1. What exact line or config proves the current state?
163
+ 2. What claim, contract, boundary, or quality standard is it compared against?
164
+ 3. What alternative interpretation would make the concern false?
165
+ 4. Did I check that alternative interpretation?
166
+ 5. Is there still at least MEDIUM confidence?
167
+ 6. If yes, emit a candidate. If no, record uncertainty only.
168
+ ```
169
+
170
+ ### Rule 1 — No Quote, No Claim
171
+
172
+ Every repo-derived factual claim must include a ground-truth quote with:
173
+ - exact relative file path
174
+ - exact line number or range
175
+ - verbatim code, config, script, doc, or command-output excerpt
176
+ - a short explanation of what the quote proves
177
+
178
+ If a claim cannot be quoted, discard it. This rule applies to inventory facts, dependency claims, public API claims, trust boundary claims, UI claims, test quality claims, enhancement opportunities, and final report statements.
179
+
180
+ ### Rule 2 — Candidate Findings Are Not Truth
181
+
182
+ Explorer output is candidate evidence only. Reviewer validation is the primary false-positive filter. Critic validation is mandatory for CRITICAL and HIGH findings. Enhancement findings require critic validation before appearing in the final report.
183
+
184
+ ### Rule 3 — Deterministic Before Judgment
185
+
186
+ Check mechanically before subjectively:
187
+ - Does the import resolve?
188
+ - Is the package declared and locked?
189
+ - Does the pinned version exist?
190
+ - Does the route have a handler?
191
+ - Does the command have an implementation?
192
+ - Does the public export have a consumer?
193
+ - Does the documented option exist in code?
194
+ - Does the framework API signature match the installed version?
195
+ - Does a test assertion actually fail when behavior is wrong?
196
+
197
+ ### Rule 4 — Explicit Disproof Required
198
+
199
+ For every candidate, ask: "What alternative interpretation would make this finding wrong?"
200
+
201
+ For CRITICAL or HIGH candidates, also record: what would disprove the finding, where that condition was checked, the quote proving it is absent, and why severity remains justified. If disproof cannot be articulated, downgrade to MEDIUM before reviewer validation.
202
+
203
+ ### Rule 5 — Runtime Validation When Behavior Depends on Runtime
204
+
205
+ Static review is insufficient when the claim depends on framework routing, identity/authorization state, sequencing, async behavior, database state, feature flags, tool permissions, LLM prompt/tool execution, bundler behavior, rendering behavior, or cross-platform shell behavior. When safe, run the smallest relevant validation command. If validation is not safe or not available, mark the finding UNVERIFIED unless static evidence is sufficient.
206
+
207
+ ### Rule 6 — Separate Defects from Enhancements
208
+
209
+ A defect is shipped behavior that is wrong, unsafe, broken, misleading, or materially incomplete.
210
+
211
+ An enhancement is a change that would make the codebase better without implying the current state is broken.
212
+
213
+ Do not convert enhancements into defects to sound stronger. Do not convert defects into enhancements to avoid severity decisions. Do not emit the same root issue in both formats.
214
+
215
+ ---
216
+
217
+ ## Severity and Value Rubrics
218
+
219
+ ### Defect Severity
220
+
221
+ **CRITICAL:** credible path to data loss, credential exposure, remote code execution, privilege escalation, destructive unauthorized action, supply-chain compromise, or complete inability to use a primary shipped function. Must include exact exploit/control-flow evidence or runtime validation unless impossible. Must pass inline critic before inclusion.
222
+
223
+ **HIGH:** serious broken shipped functionality, meaningful security/privacy exposure, major claim contradiction, broad user-impacting regression, high-risk untested trust boundary, or build/release integrity failure. Must include evidence of real impact. Must pass inline critic before inclusion.
224
+
225
+ **MEDIUM:** real defect with bounded impact, edge-case breakage, localized security hardening gap without demonstrated exploit path, meaningful test weakness, misleading documentation claim, or maintainability issue causing current correctness risk. Must pass reviewer finalization.
226
+
227
+ **LOW:** minor real defect, confusing behavior, small docs drift, narrow test-quality issue, low-risk cross-platform problem, or localized polish/accessibility defect. Must be actionable and non-noisy.
228
+
229
+ **INFO:** useful observation that does not meet defect severity but helps future work. Use sparingly.
230
+
231
+ ### Enhancement Value
232
+
233
+ **HIGH-VALUE:** materially improves maintainability, reliability, UX quality, performance headroom, security posture, observability, or developer velocity. Has a concrete implementation path. Likely worth doing even if no defect exists.
234
+
235
+ **MEDIUM-VALUE:** genuine improvement with narrower payoff, higher effort, or dependency on other cleanup. Useful but not transformational.
236
+
237
+ **LOW-VALUE:** small cleanup or preference-level improvement. Omit from final report unless user requested exhaustive enhancement review.
238
+
239
+ **REJECT:** stylistic preference without clear value; adds abstraction before need is demonstrated; contradicts the system's evident design; duplicates existing capability; cannot be tied to exact code evidence; too vague for implementation.
240
+
241
+ ---
242
+
243
+ ## Artifact Layout
244
+
245
+ Create the review run directory before any track runs:
246
+
247
+ ```
248
+ .swarm/review-v7/runs/<run_id>/
249
+ metadata.json
250
+ source-of-truth-packet.md
251
+ artifacts/
252
+ claims.jsonl
253
+ surfaces.jsonl
254
+ boundaries.jsonl
255
+ ai-surfaces.jsonl
256
+ ui-inventory.jsonl
257
+ test-inventory.jsonl
258
+ coverage.jsonl
259
+ candidates.jsonl
260
+ validations.jsonl
261
+ critic.jsonl
262
+ disproven.jsonl
263
+ commands.jsonl
264
+ ledgers/
265
+ inventory-summary.md
266
+ candidate-summary.md
267
+ validation-summary.md
268
+ test-drift-review.md
269
+ strengths-ledger.md
270
+ final-critic-check.md
271
+ review-report.md
272
+ ```
273
+
274
+ Before writing under `.swarm/`, verify `.swarm/` is ignored or locally excluded. If tracked `.swarm` files exist, warn and record in `metadata.json`.
275
+
276
+ ---
277
+
278
+ ## Phase 0 — Decomposed Codebase Inventory
279
+
280
+ Purpose: build a grounded map of the repository before asking the user which review tracks to run.
281
+
282
+ Do not proceed to Phase 1 until Phase 0 is complete and the user has selected tracks.
283
+
284
+ ### Phase 0A — Bootstrap and Prior Context
285
+
286
+ Architect reads directly.
287
+
288
+ Tasks:
289
+ 1. Check current working directory and git status.
290
+ 2. Check for prior reports: `qa-report.md`, `enhancement-report.md`, `.swarm/review-v7/`, `.swarm/enhancement-report.md`, `OPENCODE.md`, `CLAUDE.md`, `AGENTS.md`.
291
+ 3. Identify package managers, language roots, and monorepo workspaces at a high level.
292
+ 4. Create `.swarm/review-v7/runs/<run_id>/`.
293
+ 5. Record whether this is a fresh review, continuation, or update.
294
+
295
+ Output:
296
+ ```
297
+ BOOTSTRAP_SUMMARY
298
+ review_type: fresh | continuation | update
299
+ repo_root: <path>
300
+ branch: <branch>
301
+ git_head: <sha>
302
+ dirty_worktree: yes | no
303
+ prior_reports_found: <list>
304
+ agent_instruction_files_found: <list>
305
+ initial_languages_or_workspaces: <list>
306
+ quote_log: <file path + line + quote proving each non-obvious fact>
307
+ END
308
+ ```
309
+
310
+ ### Phase 0B — Directory and Entry Point Map
311
+
312
+ Delegate to Explorer. Scope: structure only. Do not infer architecture quality.
313
+
314
+ Tasks:
315
+ 1. Enumerate top-level directories and files.
316
+ 2. Enumerate source directories two levels deep.
317
+ 3. Identify likely app entry points, package entry points, CLI entry points, server entry points, UI route roots, worker entry points, test roots, and build roots.
318
+ 4. Identify generated, vendored, lockfile, artifact, and dependency directories that should not be manually reviewed unless needed.
319
+ 5. Estimate reviewable file counts by domain.
320
+
321
+ Output:
322
+ ```
323
+ DIRECTORY_MAP
324
+ top_level:
325
+ - path:
326
+ quote:
327
+ apparent_role:
328
+ source_roots:
329
+ - path:
330
+ quote:
331
+ file_count_estimate:
332
+ entry_points:
333
+ - path:
334
+ kind: app | cli | server | worker | ui | package | test | build | unknown
335
+ quote:
336
+ excluded_or_low_signal_paths:
337
+ - path:
338
+ reason:
339
+ quote:
340
+ uncertainty:
341
+ END
342
+ ```
343
+
344
+ ### Phase 0C — Manifest, Dependency, Tooling, and CI Inventory
345
+
346
+ Delegate to Explorer. Scope: manifests, lockfiles, build scripts, CI, package manager metadata, Docker/container files, dependency update tooling, release tooling.
347
+
348
+ Do not judge vulnerabilities, suspiciousness, package validity, typosquatting, slopsquatting, or dependency risk in Phase 0C. Extract raw facts only. Track B performs risk assessment later.
349
+
350
+ Tasks:
351
+ 1. Read every manifest and lockfile.
352
+ 2. Extract package manager, runtime version constraints, scripts, build commands, lint commands, test commands, and release commands.
353
+ 3. Extract every direct dependency name and pinned or ranged version.
354
+ 4. Record source imports that are directly observed but absent from directly observed manifests. Do not label packages as suspicious in this pass.
355
+ 5. Inventory CI workflows and whether they run install, lint, typecheck, test, build, security scan, dependency scan, and artifact publishing.
356
+ 6. Inventory supply-chain controls: lockfiles, checksum or hash pinning, provenance, attestations, signed releases, dependency update bots, security policy.
357
+
358
+ Output:
359
+ ```
360
+ MANIFEST_INVENTORY
361
+ package_managers:
362
+ - name:
363
+ evidence_quote:
364
+ scripts:
365
+ - script_name:
366
+ command:
367
+ evidence_quote:
368
+ direct_dependencies:
369
+ - ecosystem:
370
+ name:
371
+ version_spec:
372
+ manifest_path:
373
+ evidence_quote:
374
+ extraction_notes: <import_manifest_mismatch_only_or_N/A>
375
+ ci_quality_gates:
376
+ - workflow_path:
377
+ gates_found:
378
+ evidence_quote:
379
+ supply_chain_controls:
380
+ lockfile_present: yes | no | partial
381
+ dependency_update_tooling: yes | no | unknown
382
+ provenance_or_attestation: yes | no | unknown
383
+ signed_release_or_commit_controls: yes | no | unknown
384
+ evidence_quotes:
385
+ uncertainty:
386
+ END
387
+ ```
388
+
389
+ ### Phase 0D — Documentation, Claims, and Obligations Ledger
390
+
391
+ Delegate to Explorer. Scope: README, docs, changelog, release notes, migration notes, examples, comments that describe public behavior, PR or issue text if provided, test names when they claim behavior.
392
+
393
+ This pass extracts claims only. It does not decide whether claims are true.
394
+
395
+ Tasks:
396
+ 1. Read top-level README and documentation indexes.
397
+ 2. Extract every user-visible behavior claim.
398
+ 3. Extract every install, configuration, CLI, API, security, performance, compatibility, or platform claim.
399
+ 4. Extract every "supports X", "handles Y", "requires Z", "securely does Q", or "works on platform P" statement.
400
+ 5. Preserve the claim's exact wording and immediate context.
401
+ 6. Do not convert claims into implementation predicates in this pass.
402
+
403
+ Output:
404
+ ```
405
+ CLAIM
406
+ claim_id: CLAIM-001
407
+ source_file:
408
+ source_line:
409
+ exact_quote:
410
+ claim_type: behavior | install | config | cli | api | security | performance | compatibility | platform | test_name | other
411
+ directly_stated_subject:
412
+ directly_stated_expected_behavior:
413
+ ambiguity_notes:
414
+ status: unverified
415
+ END
416
+ ```
417
+
418
+ Rules:
419
+ - Split compound claims only when the source text itself lists separate claims.
420
+ - Do not merge unrelated claims.
421
+ - If a claim cannot be made testable, record it as NON_TESTABLE_CLAIM with reason, source file, source line, exact quote, and reason. Do not discard it.
422
+
423
+ ### Phase 0E — Public Surface Inventory
424
+
425
+ Delegate to Explorer. Scope: routes, controllers, commands, public exports, SDK APIs, event handlers, schemas, database migrations, config keys, environment variables, jobs, queues, plugin hooks, extension points.
426
+
427
+ Tasks:
428
+ 1. Identify all public entry surfaces.
429
+ 2. Identify input shapes, output shapes, auth requirements if directly visible, and wiring targets.
430
+ 3. Identify exported symbols that appear public.
431
+ 4. Identify config and env vars that users or deployments must set.
432
+ 5. Identify migrations and schema changes that affect persistence.
433
+
434
+ Output:
435
+ ```
436
+ PUBLIC_SURFACE
437
+ id: SURFACE-001
438
+ kind: route | cli | export | config | env | schema | migration | job | queue | hook | plugin | event | other
439
+ name:
440
+ file:
441
+ line:
442
+ exact_quote:
443
+ inputs:
444
+ outputs:
445
+ wiring_target:
446
+ auth_or_permission_signal:
447
+ linked_claims:
448
+ uncertainty:
449
+ END
450
+ ```
451
+
452
+ ### Phase 0F — Trust Boundary and Data Flow Inventory
453
+
454
+ Delegate to Explorer. Scope: boundary crossings only.
455
+
456
+ Tasks:
457
+ 1. Identify external input ingress: HTTP, WebSocket, CLI args, env vars, files, uploads, clipboard, drag/drop, forms, IPC, queues, webhooks, plugins, browser storage, database reads, subprocess output.
458
+ 2. Identify sensitive sinks: database writes, file writes, subprocess execution, shell execution, network calls, auth/session changes, template rendering, DOM insertion, logs, telemetry, LLM calls, vector database writes, tool calls.
459
+ 3. Identify authentication and authorization boundaries.
460
+ 4. Identify serialization and deserialization boundaries.
461
+ 5. Identify LLM-specific boundaries: prompts, system prompts, user prompts, retrieval context, tool schemas, MCP servers, agent permissions, output parsers, model responses.
462
+ 6. Identify MCP-specific surfaces: registered tool descriptions, tool parameter schemas, resource URIs, server-to-server chains.
463
+
464
+ Output:
465
+ ```
466
+ TRUST_BOUNDARY
467
+ id: BOUNDARY-001
468
+ boundary_type:
469
+ source:
470
+ sink:
471
+ file:
472
+ line:
473
+ exact_quote:
474
+ validation_or_guard_observed: yes | no | unknown
475
+ auth_or_permission_observed: yes | no | unknown
476
+ data_sensitivity:
477
+ linked_public_surface:
478
+ linked_claims:
479
+ uncertainty:
480
+ END
481
+ ```
482
+
483
+ Guard fields rule: record `unknown` unless a guard or its absence is unambiguously visible in the same file and same local code region as the boundary quote. Do not infer missing guards from not seeing them in a narrow pass. Track B validates guards later.
484
+
485
+ ### Phase 0G — Test, Quality Gate, and Drift Inventory
486
+
487
+ Delegate to test_engineer if available. Use Explorer only when test_engineer is not assigned.
488
+
489
+ Scope: tests and quality tooling only.
490
+
491
+ Tasks:
492
+ 1. Identify test frameworks, test commands, test directories, fixture directories, mock utilities, coverage tooling, mutation tooling, property-based testing tooling, e2e tooling, snapshot tooling.
493
+ 2. List test file names, test function names, and what subjects they import or instantiate.
494
+ 3. Inventory CI test gates.
495
+ 4. Identify test names or comments that make behavior claims that must be checked later for drift.
496
+ 5. If Phase 0E is available, list public surfaces with no obviously corresponding test. If Phase 0E is unavailable, record as unknown.
497
+
498
+ Output:
499
+ ```
500
+ TEST_QUALITY_INVENTORY
501
+ test_frameworks:
502
+ - framework:
503
+ evidence_quote:
504
+ test_commands:
505
+ - command:
506
+ evidence_quote:
507
+ test_roots:
508
+ - path:
509
+ evidence_quote:
510
+ observed_test_subjects:
511
+ - test_file:
512
+ test_name_or_import:
513
+ evidence_quote:
514
+ quality_gates:
515
+ lint:
516
+ typecheck:
517
+ unit:
518
+ integration:
519
+ e2e:
520
+ coverage:
521
+ mutation:
522
+ property_based:
523
+ evidence_quotes:
524
+ test_claims_for_later_review:
525
+ - file:
526
+ line:
527
+ exact_quote:
528
+ review_later_reason:
529
+ surface_test_name_gaps:
530
+ - surface_id:
531
+ evidence_quote:
532
+ uncertainty:
533
+ END
534
+ ```
535
+
536
+ ### Phase 0H — UI, UX, and Design System Inventory
537
+
538
+ Delegate to Explorer. If a designer agent exists, use designer for this pass.
539
+
540
+ Scope: detect UI presence and map UI assets. Do not critique yet.
541
+
542
+ Tasks:
543
+ 1. Determine whether there is a user-facing UI, desktop UI, web app, browser extension UI, terminal UI, admin console, or docs site.
544
+ 2. Identify UI framework, component system, route/page structure, styling system, theme or design token files, icons, fonts, animation libraries, and accessibility utilities.
545
+ 3. Identify whether screenshots, Storybook, Playwright, visual tests, or design docs exist.
546
+ 4. Identify structural design signals only: dark/light mode tokens, density tokens, route/page/component naming, and explicitly stated UI type in docs or code comments. Do not classify the aesthetic register yet.
547
+ 5. Flag whether any component library defaults are in use unmodified (e.g., shadcn/ui with no customization, Tailwind defaults with no design token layer).
548
+
549
+ Output:
550
+ ```
551
+ UI_INVENTORY
552
+ ui_present: yes | no | partial
553
+ ui_type:
554
+ framework:
555
+ component_roots:
556
+ route_or_page_roots:
557
+ styling_system:
558
+ theme_or_token_files:
559
+ design_token_customization: yes | no | unknown
560
+ component_library_defaults_unmodified: yes | no | unknown
561
+ accessibility_tooling:
562
+ visual_test_tooling:
563
+ design_structural_signals:
564
+ evidence_quotes:
565
+ uncertainty:
566
+ END
567
+ ```
568
+
569
+ ### Phase 0I — AI, Agent, and Model Surface Inventory
570
+
571
+ Delegate to Explorer.
572
+
573
+ Scope: AI/LLM/agent functionality only.
574
+
575
+ Deterministic skip rule: skip only if Phase 0B found no AI-related file, directory, or symbol names (ai, llm, prompt, agent, model, openai, anthropic, embedding, vector, rag, mcp, tool, eval) AND Phase 0C found no AI-related packages. If either signal exists, run Phase 0I.
576
+
577
+ Tasks:
578
+ 1. Identify model calls, prompt templates, system prompts, tool definitions, function-calling schemas, MCP servers, autonomous agent loops, memory, retrieval, embeddings, vector stores, evaluators, moderation, content filters, and output parsers.
579
+ 2. Identify any user-controllable content that enters prompts or tools.
580
+ 3. Identify any model output that flows into code execution, database writes, network calls, browser rendering, files, shell commands, or user-visible authoritative claims.
581
+ 4. Identify rate limits, token limits, budget limits, retries, timeouts, and circuit breakers if visible.
582
+ 5. Identify MCP-specific surfaces: registered tool descriptions that include prose the model will read, tool parameter schemas, server-to-server chains, and whether untrusted content from external sources can enter tool descriptions or resource outputs.
583
+
584
+ Output:
585
+ ```
586
+ AI_SURFACE
587
+ id: AI-001
588
+ kind: prompt | model_call | tool | agent_loop | mcp | mcp_tool_description | retrieval | embedding | vector_store | parser | evaluator | memory | moderation | other
589
+ file:
590
+ line:
591
+ exact_quote:
592
+ user_controlled_inputs:
593
+ model_outputs:
594
+ downstream_sinks:
595
+ permissions_or_limits:
596
+ linked_trust_boundaries:
597
+ mcp_chain_depth: <number of MCP servers in chain if applicable>
598
+ uncertainty:
599
+ END
600
+ ```
601
+
602
+ ### Phase 0J — Architect Inventory Synthesis
603
+
604
+ Architect synthesizes Phase 0 outputs. Do not add unquoted repo facts.
605
+
606
+ Create `source-of-truth-packet.md` and `ledgers/inventory-summary.md`.
607
+
608
+ Before writing the summary, verify every required Phase 0 ledger exists and is non-empty. If a ledger is not applicable, create it with an explicit `NOT_APPLICABLE` reason.
609
+
610
+ Minimum adequacy gate: if fewer than five non-`NOT_APPLICABLE`, non-empty structured blocks exist across all applicable Phase 0 ledgers, or if the inventory is too sparse to support the selected review scope, stop and report the limitation.
611
+
612
+ Claim synthesis duties:
613
+ - Convert raw Phase 0D claims into testable predicates now, after having access to public surfaces, manifests, trust boundaries, tests, UI, and AI inventory.
614
+ - Assign likely verification targets only when supported by Phase 0E-0I evidence.
615
+ - Assign `risk_if_false` only after considering user impact, public surface exposure, and trust boundaries.
616
+ - Summarize NON_TESTABLE_CLAIM entries under Unknowns.
617
+
618
+ The source-of-truth packet must contain only Phase 0 facts and must include:
619
+
620
+ ```markdown
621
+ # Source of Truth Packet
622
+
623
+ ## Repo Identity
624
+ [repo name, branch, git HEAD SHA, review type]
625
+
626
+ ## Tech Stack
627
+ [languages, runtimes, frameworks, package managers]
628
+
629
+ ## Commands
630
+ [install, lint, typecheck, test, build, run commands with evidence]
631
+
632
+ ## Public Surfaces
633
+ [IDs and one-line descriptions]
634
+
635
+ ## Trust Boundaries
636
+ [IDs and one-line descriptions]
637
+
638
+ ## MCP and Agent Surfaces
639
+ [IDs, descriptions, and chain depth]
640
+
641
+ ## Claims Needing Verification
642
+ [top claim IDs and predicates]
643
+
644
+ ## Test and Quality Gates
645
+ [test frameworks and CI gates]
646
+
647
+ ## UI Applicability
648
+ [whether UI review applies and why; whether component library defaults appear unmodified]
649
+
650
+ ## AI/Agent Applicability
651
+ [whether LLM/agent review applies and why]
652
+
653
+ ## Review Track Recommendation
654
+ [architect recommendation]
655
+
656
+ ## Prohibited Assumptions
657
+ - Do not assume facts not present in this packet or quoted from source.
658
+ - Do not assume a dependency exists unless manifest/lock/import evidence proves it.
659
+ - Do not assume a feature works because docs claim it.
660
+ - Do not assume a UI exists unless Phase 0H says it does.
661
+ - Do not assume MCP tool descriptions are trusted input.
662
+ ```
663
+
664
+ ---
665
+
666
+ ## Phase 0K — User Review Mode Gate
667
+
668
+ Stop after Phase 0J. Ask the user which review track or tracks to run.
669
+
670
+ Do not proceed until the user selects a scope, unless the user's original instruction explicitly already selected tracks and explicitly told you not to ask.
671
+
672
+ Present the choices:
673
+
674
+ ```
675
+ Phase 0 inventory is complete. Based on the repository shape, I recommend:
676
+
677
+ [Architect recommendation grounded in Phase 0 evidence]
678
+
679
+ Choose review scope:
680
+ 1. Complete Integrated Review — all defect-focused tracks plus enhancement opportunities.
681
+ 2. Defect-Focused Comprehensive QA — all defect tracks, no enhancement catalog.
682
+ 3. Security and Supply Chain Focus — AppSec, LLM/MCP security, dependency integrity, CI provenance.
683
+ 4. Functionality and Correctness Focus — claims-vs-shipped, wiring, edge cases, business logic.
684
+ 5. Testing and Test Quality Focus — behavioral coverage, test drift, mutation resilience, property-based gaps.
685
+ 6. UI/UX and Accessibility Focus — visual hierarchy, interaction design, WCAG 2.2 AA, typography, polish, performance, design system, AI-slop UI patterns.
686
+ 7. Performance and Observability Focus — runtime performance, resource use, startup, telemetry, logs, metrics, traces.
687
+ 8. AI Slop and Code Provenance Focus — hallucinated APIs, phantom dependencies, confident stubs, slopsquatting, context rot, stale API usage.
688
+ 9. Enhancement Opportunities Only — architecture, quality, DX, performance, resilience, observability, UI/UX improvements. Not a bug hunt.
689
+ 10. Custom Combination — specify any combination or narrower subsystem.
690
+
691
+ Please select one or more options.
692
+ ```
693
+
694
+ If the user selects a focused review, do not run unrelated tracks. Mention omitted tracks in coverage notes.
695
+
696
+ ---
697
+
698
+ ## Phase 1 — Selected Track Candidate Generation
699
+
700
+ Phase 1 generates candidates, not truth. Phase 1 obeys the global concurrency policy.
701
+
702
+ Every Phase 1 agent dispatch must include:
703
+ - selected review track(s) for that dispatch
704
+ - exact file list or public surface IDs in scope
705
+ - `source-of-truth-packet.md`
706
+ - relevant Phase 0 ledger excerpts for claims, surfaces, boundaries, tests, UI, or AI surfaces
707
+ - the candidate output format
708
+ - explicit instruction that out-of-scope issues should be recorded as `out_of_scope_note` rather than emitted as candidates
709
+ - a reminder of the anti-cursory contract: selecting this track means exhaustive depth for it
710
+
711
+ File-size rule:
712
+ - `dense file` = a file over 300 logical lines, a file with multiple unrelated responsibilities, or a file with interleaved UI/state/network/security logic.
713
+ - Default: no more than 15 files per deep pass; no more than 8 dense files per deep pass.
714
+ - No sampling inside an assigned scope.
715
+
716
+ Classification tiebreaker:
717
+ - If a candidate could be either a defect or an enhancement, ask: would shipping the code as-is mislead a user, expose a security or privacy risk, lose data, break a documented/public behavior, or produce wrong behavior?
718
+ - If yes, emit a `CANDIDATE_FINDING`.
719
+ - If no, emit an `ENHANCEMENT_CANDIDATE`.
720
+ - Do not emit the same root issue in both formats.
721
+
722
+ ### Candidate Finding Format
723
+
724
+ ```
725
+ CANDIDATE_FINDING
726
+ id: <track>-<scope>-<sequence>
727
+ track: functionality | security | supply_chain | testing | ui_ux | performance | observability | ai_slop | docs_claims | cross_platform | cross_boundary
728
+ group: <short category>
729
+ provisional_severity: CRITICAL | HIGH | MEDIUM | LOW | INFO
730
+ confidence: HIGH | MEDIUM
731
+ file: <relative path>
732
+ line: <line or range>
733
+ exact_quote: <verbatim evidence>
734
+ title: <specific one-line title>
735
+ problem: <factual description>
736
+ impact: <why it matters>
737
+ likely_fix: <concrete likely remediation>
738
+ evidence_checked: <files, callers, configs, tests, docs, manifests, runtime paths checked>
739
+ alternative_interpretation: <what could make this wrong>
740
+ disproof_attempt: <required for CRITICAL/HIGH; recommended for all>
741
+ linked_claims: <claim ids or N/A>
742
+ linked_surfaces: <surface ids or N/A>
743
+ linked_boundaries: <boundary ids or N/A>
744
+ ai_pattern: <optional>
745
+ needs_runtime_validation: yes | no
746
+ size: S | M | L
747
+ END
748
+ ```
749
+
750
+ ### Enhancement Candidate Format
751
+
752
+ ```
753
+ ENHANCEMENT_CANDIDATE
754
+ id: ENH-<track>-<sequence>
755
+ track: enhancement | architecture | code_quality | testing | ui_ux | performance | observability | resilience | developer_experience
756
+ domain: <specific subsystem or component family>
757
+ category: architecture | code_quality | simplification | developer_experience | performance | resilience | observability | ui_hierarchy | ui_interaction | ui_accessibility | ui_typography | ui_performance | ui_consistency | testing
758
+ value_level: high | medium | low
759
+ confidence: HIGH | MEDIUM
760
+ file: <relative path>
761
+ line: <line or range>
762
+ exact_quote: <verbatim current-state evidence>
763
+ title: <specific one-line title>
764
+ current_state: <what exists now, without calling it broken>
765
+ confirms_current_code_is_working: yes | no
766
+ enhancement: <specific implementable improvement>
767
+ expected_impact: <what improves>
768
+ effort: S | M | L
769
+ dependencies: <other enhancement ids or N/A>
770
+ alternative_interpretation: <why the current design might be intentional>
771
+ disproof_attempt: <required for HIGH-confidence high-value candidates; recommended for all>
772
+ rejection_risk: <what would make this a bad suggestion>
773
+ END
774
+ ```
775
+
776
+ ---
777
+
778
+ ### Track A — Functionality, Correctness, and Claims-vs-Shipped
779
+
780
+ Run if user selected options 1, 2, 4, or a custom scope requiring behavior review.
781
+
782
+ **Anti-cursory contract for Track A:** Build a coverage unit for every public surface from Phase 0E. Every surface must be traced from entry point to implementation. A surface marked REVIEWED must have had its entry point read, its implementation traced, its tests checked, and its claims from Phase 0D compared against the implementation. Closing the coverage matrix is required before synthesis.
783
+
784
+ **Agent lens:** shipped behavior correctness. Does the code do what it claims and what it documents?
785
+
786
+ **Required method for each surface:**
787
+ 1. Pick a public surface from Phase 0E.
788
+ 2. Link any claims from Phase 0D.
789
+ 3. Trace from entry point through routing/wiring to implementation.
790
+ 4. Extract obligations first (what docs/claims say should happen).
791
+ 5. Summarize implemented behavior second.
792
+ 6. Compare obligations to implementation third.
793
+ 7. Check tests for behavioral assertions on this surface.
794
+ 8. Emit only grounded candidates.
795
+
796
+ **Check:**
797
+
798
+ *Wiring and reachability:*
799
+ - Route, command, job, hook, plugin, and export wiring — does the registered path lead to an actual handler?
800
+ - Unreachable code and dead branches in public behavior paths
801
+ - Exported symbols with no consumers and no documented extension intent
802
+ - Handler registered but not called, called but wrong arguments, wrong return value forwarding
803
+
804
+ *Claim vs. implementation:*
805
+ - Documented feature claims versus actual code paths
806
+ - "Supports X" claims with no supporting implementation
807
+ - Default values in docs that differ from default values in code
808
+ - Removed behavior still documented as present
809
+ - Parameters, option names, env vars, schema fields, and response fields mismatched between docs and implementation
810
+
811
+ *Logic correctness:*
812
+ - Off-by-one logic and boundary conditions
813
+ - Integer overflow or underflow where input is externally controlled
814
+ - Floating-point comparison where equality is asserted
815
+ - Signed/unsigned mismatch in comparisons or arithmetic
816
+ - Wrong operator precedence in complex boolean expressions
817
+ - Null/undefined not handled where the value may be absent
818
+ - Early returns that skip required side effects
819
+
820
+ *Async correctness:*
821
+ - Missing awaits (promise returned but not awaited)
822
+ - Ignored promise return values (fire-and-forget where failure matters)
823
+ - Race conditions in shared state accessed by concurrent async paths
824
+ - Sequential awaits where order matters but is not enforced
825
+ - Error swallowed inside async then/catch when caller needs it
826
+ - Unhandled promise rejections in event listeners or callbacks
827
+
828
+ *Data model and persistence:*
829
+ - Data model mismatches across persistence layer, API layer, and UI layer
830
+ - Migration or schema drift (new column in docs but not in migration file, or vice versa)
831
+ - Serialization and deserialization that silently drops fields
832
+ - JSON parse/stringify round-trip loss
833
+ - Feature flag or config behavior drift
834
+ - State machine edge cases: missing transitions, invalid state combinations, missing final states
835
+
836
+ *Cross-platform:*
837
+ - Code claiming portability but using platform-specific APIs (path separators, signals, shell-isms)
838
+ - Environment assumptions that break on Windows/macOS/Linux differences
839
+
840
+ *Happy-path-only:*
841
+ - Error handling that claims recovery but only logs or swallows
842
+ - Input validation that accepts empty, null, oversized, or malformed values without handling them
843
+ - Network timeout handling missing or set to unbounded
844
+
845
+ ---
846
+
847
+ ### Track B — Security, Privacy, LLM Security, and Supply Chain
848
+
849
+ Run if user selected options 1, 2, 3, or a custom security scope.
850
+
851
+ **Anti-cursory contract for Track B:** Build a coverage unit for every trust boundary from Phase 0F and every AI surface from Phase 0I. Every boundary and AI surface must be reviewed. A boundary marked REVIEWED must have had its source, guard, sink, and impact traced. An AI surface marked REVIEWED must have had its user-controlled input paths and downstream sinks traced.
852
+
853
+ **Agent lens:** exploitable or protection-relevant risk.
854
+
855
+ **Frameworks:**
856
+ - OWASP ASVS 4.0.3 as the verifiable AppSec checklist baseline for web application controls
857
+ - OWASP Top 10 for LLM Applications 2025: LLM01–LLM10 as listed in the State-of-the-Art Anchors
858
+ - SLSA Version 1.2 for supply-chain provenance and verification
859
+ - OpenSSF Scorecard for repository hygiene checks
860
+
861
+ **Required method:**
862
+ 1. Start from Phase 0F trust boundaries and Phase 0I AI surfaces.
863
+ 2. For each candidate, identify: attacker-controlled input → insufficient guard → sensitive sink → impact.
864
+ 3. If exploitability depends on runtime behavior, run a safe minimal validation or mark UNVERIFIED.
865
+ 4. For dependency candidates, verify against manifests, lockfiles, imports, and registry evidence when safe.
866
+
867
+ **Application security checks:**
868
+
869
+ *Injection:*
870
+ - SQL injection via string concatenation, template interpolation, or ORM raw query misuse
871
+ - Command injection via unsanitized input in shell.exec, subprocess, eval, or dynamic code execution
872
+ - Path traversal via unsanitized file paths (../../ attacks, null bytes, URL-encoded sequences)
873
+ - SSRF via user-controlled URLs in fetch, HTTP client, redirect, webhook, or import
874
+ - Template injection via unsanitized input in template engines (Handlebars, Jinja2, EJS, Pug)
875
+ - DOM-based XSS via innerHTML, document.write, dangerouslySetInnerHTML, or eval with user input
876
+ - LDAP, XML, XPath injection where those parsers are in use
877
+ - Header injection via unsanitized values in response headers
878
+ - Log injection via unsanitized user input in log statements that attackers could use to forge log entries
879
+
880
+ *Authentication and authorization:*
881
+ - Missing authentication on routes/handlers that claim or imply protection
882
+ - Inconsistent authorization: enforced in one path but not in sibling or alternative path
883
+ - Horizontal privilege escalation: user can access another user's resources by changing an ID
884
+ - Vertical privilege escalation: lower-privileged user can invoke higher-privileged action
885
+ - JWT algorithm confusion (none algorithm, RS256 vs HS256 confusion)
886
+ - Token/session not invalidated on logout or password change
887
+ - Authentication bypass via mass assignment, parameter pollution, or HTTP method override
888
+ - Insecure direct object reference without ownership check
889
+ - CSRF missing where state-changing operations use cookies or sessions
890
+ - CORS misconfiguration: wildcard origin with credentials, or overly permissive allow-origin
891
+
892
+ *Secrets and sensitive data:*
893
+ - Hardcoded secrets, tokens, credentials, private keys, API keys, or passwords in source
894
+ - Sensitive defaults (default admin/admin, empty string passwords)
895
+ - Credentials or PII logged in plaintext (including in telemetry, error messages, or debug output)
896
+ - API keys or tokens in client-side code, public assets, or URLs
897
+ - Sensitive data in HTTP responses that should not be returned
898
+ - Insecure cookie flags: missing HttpOnly, Secure, or SameSite attributes
899
+
900
+ *Cryptography:*
901
+ - Weak hashing for passwords (MD5, SHA1, unsalted SHA256; require bcrypt/argon2/scrypt)
902
+ - Weak randomness for security-sensitive values (Math.random(), time-based seeds)
903
+ - Insecure transport: HTTP used for security-sensitive operations, TLS version pinned to old versions
904
+ - Predictable token generation or insufficient entropy for session IDs
905
+ - Crypto misuse: ECB mode, fixed IVs, reused nonces, unauthenticated encryption
906
+
907
+ *File and process security:*
908
+ - Unsafe file upload: missing extension validation, missing content-type validation, missing size limits, files saved to web-accessible paths, archive extraction without path normalization (zip slip)
909
+ - Unsafe subprocess: shell: true with user input, argument injection via array spreading
910
+ - Symlink attacks in file handling
911
+
912
+ *Input validation and output encoding:*
913
+ - Inputs accepted without schema validation
914
+ - Inputs validated but not sanitized before passing to sinks
915
+ - Output not encoded for the context it is rendered in (HTML, SQL, shell, URL, JSON)
916
+
917
+ *Prototype pollution and object merging:*
918
+ - `Object.assign`, `_.merge`, `lodash.merge`, `deepmerge`, spread operators applied to untrusted input
919
+ - JSON.parse result used as object keys without validation
920
+ - `__proto__`, `constructor`, `prototype` keys not filtered from user input
921
+
922
+ **LLM and agent security (OWASP LLM 2025):**
923
+
924
+ *LLM01 — Prompt injection:*
925
+ - Direct injection: user input processed as instructions without separation from system instructions
926
+ - Indirect injection: content from external sources (web pages, documents, tool outputs, database records, emails) entering the prompt context where it could contain adversarial instructions
927
+ - Injection via tool outputs: tool call results that contain embedded instructions processed by the model
928
+ - Instruction override attempts via role-play, "ignore previous instructions", jailbreaks
929
+ - System prompt extraction attempts via carefully constructed user queries
930
+
931
+ *LLM02 — Sensitive information disclosure:*
932
+ - System prompt contents exposed to users (directly or via extraction)
933
+ - PII or proprietary data leaking through model completions
934
+ - API keys, connection strings, or credentials present in system prompts or RAG context
935
+ - Internal architecture details exposed through model responses
936
+
937
+ *LLM03 — Supply chain:*
938
+ - LLM provider or model version not pinned (model behavior can change on API side)
939
+ - Third-party prompt templates or agent frameworks used without validation
940
+ - Plugin or tool integrations from untrusted sources
941
+
942
+ *LLM04 — Data and model poisoning:*
943
+ - User-supplied content writing to training datasets, fine-tuning pipelines, or embedding stores
944
+ - RAG documents sourced from user-controlled or untrusted content without sanitization
945
+ - Embedding poisoning: adversarial content crafted to manipulate retrieval
946
+
947
+ *LLM05 — Improper output handling:*
948
+ - Model output used directly as shell commands, SQL queries, or code to execute
949
+ - Model output rendered as HTML without sanitization
950
+ - Model output trusted as authoritative fact without verification
951
+ - Structured outputs (JSON, code) from models parsed without schema validation
952
+
953
+ *LLM06 — Excessive agency:*
954
+ - Agent tools with broader permissions than the task requires (excessive functionality)
955
+ - Agent operating with system-level or production privileges for tasks that only need read access (excessive permissions)
956
+ - High-impact actions (file deletion, email send, API calls, code deployment) proceeding without human-in-the-loop confirmation (excessive autonomy)
957
+ - Agent has access to multiple systems when it only needs one
958
+
959
+ *LLM07 — System prompt leakage:*
960
+ - System prompt reconstruction via model introspection
961
+ - System prompt stored in client-accessible locations
962
+ - Sensitive instructions (internal logic, security rules, competitor names) embedded in system prompts without leakage controls
963
+
964
+ *LLM08 — Vector and embedding weaknesses:*
965
+ - Untrusted documents written to vector stores without sanitization
966
+ - Vector similarity search results trusted without provenance verification
967
+ - Embedding inversion risks for sensitive data stored in vector stores
968
+ - RAG retrieval injection: crafting content to manipulate what gets retrieved
969
+
970
+ *LLM09 — Misinformation:*
971
+ - Model output presented as authoritative without hallucination detection or uncertainty signaling
972
+ - Factual claims generated by models without grounding in retrieved or verified sources
973
+
974
+ *LLM10 — Unbounded consumption:*
975
+ - No rate limits on model API calls
976
+ - Context flooding: user input that causes unbounded token usage
977
+ - Recursive agent loops with no termination condition
978
+ - Missing cost budgets or circuit breakers for AI operations
979
+
980
+ **MCP-specific attack vectors (2026):**
981
+
982
+ *Tool poisoning:*
983
+ - MCP tool descriptions contain prose the model reads; if that prose is untrusted or externally loaded, it is an injection surface
984
+ - Tool description metadata that instructs the model to prefer this tool over safer alternatives
985
+ - Tool parameter descriptions that suggest unsafe parameter values
986
+ - Hidden instructions in tool schema `description` fields
987
+
988
+ *Data exfiltration via AI context:*
989
+ - Sensitive data (DB schemas, API configs, PII) loaded into model context and then passed to external tool calls
990
+ - MCP server logs that accumulate sensitive context from AI sessions
991
+ - Context carryover between requests that should be isolated
992
+
993
+ *MCP server chain lateral movement:*
994
+ - Server A (lower-trust, e.g., code repo) chained to Server B (CI/CD) chained to Server C (production)
995
+ - A compromise or injection in Server A can instruct the AI to make calls through the chain to higher-privilege servers
996
+ - Inadequate isolation between MCP server identities in multi-server configurations
997
+ - Missing per-server permission scoping (all servers share one permission set)
998
+
999
+ *Missing MCP controls:*
1000
+ - No allow-list of approved MCP servers
1001
+ - MCP server connections accepted from arbitrary URLs without validation
1002
+ - No per-session or per-request permission scoping for MCP tool calls
1003
+ - No anomaly detection on MCP request/response patterns
1004
+
1005
+ **Supply chain:**
1006
+
1007
+ *Dependency integrity:*
1008
+ - Packages imported but not declared in manifest (phantom imports)
1009
+ - Packages declared but with version ranges that allow major version drift (`*`, `latest`, `^` on 0.x)
1010
+ - Packages that sound like well-known packages but are slightly different (typosquatting, dependency confusion)
1011
+ - Package names that appear in AI-generated code but do not exist in registries (slopsquatting) — check the USENIX research: 19.7% of LLM-recommended packages are fabricated
1012
+ - `postinstall`, `preinstall`, or `prepare` scripts in dependencies that execute arbitrary code
1013
+ - Binary downloads in install scripts from non-pinned or non-verified URLs
1014
+ - Native bindings or addons with privileged system access
1015
+
1016
+ *Build and release integrity:*
1017
+ - CI that publishes artifacts without SLSA provenance attestation
1018
+ - Artifact signing absent or unverified at deployment
1019
+ - Build credentials (deploy keys, NPM tokens, signing keys) with excessive scope
1020
+ - Release process that runs untrusted input in privileged CI context
1021
+ - Workflow injection: `${{ github.event.pull_request.head.repo.full_name }}` or similar dynamic values in `run:` steps
1022
+ - Third-party actions used without pinning to commit SHA
1023
+ - Missing dependency update tooling (Dependabot, Renovate) for CVE response
1024
+
1025
+ *Repository hygiene (OpenSSF Scorecard checks):*
1026
+ - Branch protection: no required reviews, no required status checks
1027
+ - Token permissions not explicitly scoped in workflow files
1028
+ - Dangerous workflow patterns: pull_request_target with checkout of untrusted PR code
1029
+
1030
+ ---
1031
+
1032
+ ### Track C — Testing and Test Quality
1033
+
1034
+ Run if user selected options 1, 2, 5, or a custom testing scope.
1035
+
1036
+ **Anti-cursory contract for Track C:** Build a coverage unit for every public surface and every high-risk trust boundary. Every unit must be reviewed for behavioral test coverage. A unit marked REVIEWED must have had its tests (or lack thereof) read, and the assertion quality assessed — not just whether a test file exists.
1037
+
1038
+ **Agent lens:** whether tests would catch real regressions if the behavior changed.
1039
+
1040
+ **Required method:**
1041
+ 1. Link each testing candidate to a public surface, claim, trust boundary, or critical behavior from Phase 0.
1042
+ 2. State what regression could escape with the current test.
1043
+ 3. Identify the smallest test improvement that would catch it.
1044
+ 4. If possible, run the relevant test command to observe what it actually asserts.
1045
+
1046
+ **Coverage and behavioral assertions:**
1047
+
1048
+ *Missing test coverage:*
1049
+ - Public behavior surfaces with no test at any level (unit, integration, e2e)
1050
+ - High-risk trust boundaries with no auth/authz test
1051
+ - Security-sensitive paths (auth, permissions, secrets handling) with no negative test
1052
+ - Migration/schema changes with no before/after state test
1053
+ - Config parsing with no test for missing, invalid, or boundary-value configs
1054
+ - Error handling paths with no test that the error is surfaced correctly
1055
+ - Critical background jobs, queues, or scheduled tasks with no integration test
1056
+
1057
+ *Test quality — behavioral vs. implementation:*
1058
+ - Tests that only assert the mock was called rather than asserting the behavioral outcome
1059
+ - Tests that verify internal implementation details (private method called, specific log output emitted) rather than external behavior
1060
+ - Tests that pass as long as no exception is thrown, without asserting a meaningful return value or state change
1061
+ - Tests with assertions broad enough to pass even if behavior changes (e.g., `expect(result).toBeTruthy()`)
1062
+ - Snapshot tests that capture implementation artifacts rather than behavioral contracts — easy to update without understanding the change
1063
+ - Tests that import and directly call private/internal modules rather than the public API they are supposed to test
1064
+
1065
+ *Fixture and schema drift:*
1066
+ - Test fixtures that no longer match current schema structure or default values
1067
+ - Mock return values that no longer represent what the real implementation returns
1068
+ - Hardcoded test data that encodes outdated business rules
1069
+ - Snapshot files out of sync with current component output
1070
+ - Database fixtures that assume old migration state
1071
+
1072
+ *Test reliability:*
1073
+ - Time-dependent tests (assertions on exact timestamps, `Date.now()`, clock-dependent logic without mocking)
1074
+ - Path-dependent tests (hardcoded local paths, home directory assumptions)
1075
+ - Network-dependent tests without offline fallback or VCR cassettes
1076
+ - Order-dependent tests (later test depends on state left by earlier test)
1077
+ - Shared mutable state between tests without cleanup
1078
+ - Flaky concurrency patterns (sleep(N) as synchronization, untimed promise resolution)
1079
+
1080
+ *Test completeness — missing negative and edge cases:*
1081
+ - No test for empty input where the function handles it
1082
+ - No test for the maximum or minimum valid value
1083
+ - No test for input at exactly the boundary (N and N+1 both tested)
1084
+ - No test for concurrent access where shared state could be corrupted
1085
+ - No test for partial success (operation succeeds for some items, fails for others)
1086
+ - No test for authentication failure (valid auth tested, missing invalid auth test)
1087
+ - No test for authorization boundary (owner tested, non-owner not tested)
1088
+
1089
+ *Mutation resilience:*
1090
+ - Off-by-one mutations (`<` vs `<=`, `>` vs `>=`) that tests do not catch
1091
+ - Boolean condition flip mutations (missing `not` equivalent test)
1092
+ - Null vs non-null mutations (missing null path test)
1093
+ - Return value mutations (function returns wrong thing, but test only checks side effect)
1094
+ - Identify high-risk logic where a simple one-line mutation would not fail any test
1095
+
1096
+ *Property-based testing opportunities:*
1097
+ - Input parsers and serializers (invariant: parse(serialize(x)) === x)
1098
+ - Data transformations with mathematical properties (commutativity, associativity, idempotency)
1099
+ - Permission systems (any combination of valid inputs should produce a consistent authz result)
1100
+ - State machines (transitions from valid states should never reach invalid states)
1101
+ - Fuzz-worthy trust boundary inputs (all inputs from Phase 0F that accept user-controlled data)
1102
+
1103
+ *Framework misuse:*
1104
+ - `jest.mock()` or equivalent hoisted in ways that affect test isolation unexpectedly
1105
+ - `beforeAll` vs `beforeEach` misuse where state leaks between tests in the same suite
1106
+ - Async test without returning the promise or using `done` correctly
1107
+ - Testing a singleton or module with cached state that should be reset between tests
1108
+
1109
+ Test drift rule: touched or discussed tests must be checked against current and intended behavior, not just syntax. A passing test is not enough if it asserts the wrong behavior.
1110
+
1111
+ ---
1112
+
1113
+ ### Track D — UI/UX and Accessibility
1114
+
1115
+ Run if user selected options 1, 2, 6, or a custom UI scope, but only when Phase 0H found UI evidence.
1116
+
1117
+ Skip if Phase 0H found no UI. Record the skip in coverage notes.
1118
+
1119
+ **Anti-cursory contract for Track D:** Build a coverage unit for every UI component family from Phase 0H. All six passes must complete for each component family in scope. A unit marked REVIEWED must have had its component files actually read, not just inferred from filenames.
1120
+
1121
+ If a designer agent exists, use designer for Passes D1, D2, D3, D4, and D6. Use explorer for Pass D5.
1122
+
1123
+ **Accessibility baseline:** WCAG 2.2 AA.
1124
+
1125
+ **AI-aesthetic baseline (applies to all UI passes):**
1126
+
1127
+ Do not apply generic AI-generated-UI aesthetic tells as aesthetic criticism. Cite evidence, not vibes. However, flag when a UI exhibits these specific evidence-backed patterns that indicate unmodified AI-scaffold defaults:
1128
+
1129
+ - "VibeCode Purple" (a specific lavender-purple in the range `hsl(250-270, 50-80%, 55-70%)`) as the primary brand color with no apparent intentional choice
1130
+ - Unmodified shadcn/ui or similar component library defaults with no design token customization layer (Phase 0H will have flagged this)
1131
+ - Gradients applied to more than 30% of UI surfaces without a coherent design rationale
1132
+ - All-caps headings and section labels as a dominant typographic pattern
1133
+ - Identical feature cards with icon-on-top layout as the sole layout primitive
1134
+ - Numbered "1, 2, 3" step sequences as the dominant content structure
1135
+ - Sidebar or nav with emoji icons as the primary navigational metaphor
1136
+ - Color-coded border-left or border-top on cards as the dominant differentiation pattern
1137
+ - Medium-grey body text on dark backgrounds that barely passes contrast but lacks intentionality
1138
+
1139
+ The test is not "does this look AI-generated?" The test is: can you quote exact CSS values, class names, or component code that shows the pattern, and can you show the pattern is unintentional rather than designed? If yes, flag it with evidence.
1140
+
1141
+ **Pass D1 — Visual Hierarchy and Layout:**
1142
+
1143
+ Delegate to designer. Read every component file, every layout file, every page/route file.
1144
+
1145
+ Format for each finding:
1146
+ ```
1147
+ [UI-HIER-N] Title
1148
+ Screen/Component: [exact file path + component name]
1149
+ Current State: [what exists now — quote class names, styles, or structure]
1150
+ Enhancement: [specific, implementable improvement]
1151
+ User Impact: [how the user experience improves]
1152
+ Effort: [Low | Medium | High]
1153
+ ```
1154
+
1155
+ Evaluate:
1156
+ - Is there a clear primary action on every screen? Does it visually read as primary (weight, color, size, position)?
1157
+ - Do typographic heading levels (h1/h2/h3/font-size/font-weight) match the content hierarchy?
1158
+ - Is whitespace used intentionally to group related elements and separate unrelated ones?
1159
+ - Are layout patterns consistent across screens, or does each screen use a different structural approach?
1160
+ - What happens with realistic data extremes: very long strings, empty states, single-item lists, 1000-item lists?
1161
+ - Are empty states designed with messaging, guidance, and a call to action, or are they just blank/null?
1162
+ - Does the visual hierarchy change at different viewport sizes in a way that preserves content priority?
1163
+ - Are density and information architecture appropriate for the user's task complexity?
1164
+
1165
+ **Pass D2 — Interaction Design and Feedback:**
1166
+
1167
+ Delegate to designer. Read every component file, every interaction handler, every form.
1168
+
1169
+ Format for each finding:
1170
+ ```
1171
+ [UI-INT-N] Title
1172
+ Screen/Component: [exact file path + component name]
1173
+ Current State: [what exists now]
1174
+ Enhancement: [specific, implementable improvement]
1175
+ User Impact: [how the user experience improves]
1176
+ Effort: [Low | Medium | High]
1177
+ ```
1178
+
1179
+ Evaluate:
1180
+ - Do all interactive elements provide visual feedback for hover, active/pressed, focus, and disabled states?
1181
+ - Are loading states present for all async operations? Are they specific to the operation or generic spinners?
1182
+ - Are success and error states visually distinct and clearly communicated to the user?
1183
+ - Is there confirmation or undo opportunity before destructive actions?
1184
+ - Are form validation messages specific and actionable, or generic ("field is required", "invalid input")?
1185
+ - Are there interaction flows that could be fewer steps, have smarter defaults, or reordered for common paths?
1186
+ - Do transitions or animations help users understand what changed (state transitions, panel slides, expansion), or are they purely decorative?
1187
+ - Are there missing transitions that would help orient users during state changes?
1188
+ - Does the UI provide optimistic updates for operations that can be safely assumed to succeed?
1189
+ - Are there keyboard shortcuts for power-user workflows, and are they discoverable?
1190
+ - For forms: does the submit button become enabled/disabled correctly based on validity?
1191
+
1192
+ **Pass D3 — Accessibility:**
1193
+
1194
+ Delegate to designer. Read every component file, every stylesheet, every interactive element.
1195
+
1196
+ Format for each finding:
1197
+ ```
1198
+ [UI-A11Y-N] Title
1199
+ WCAG Criterion: [e.g., 1.4.3 Contrast Minimum, 2.1.1 Keyboard, 4.1.2 Name, Role, Value]
1200
+ Screen/Component: [exact file path + component name]
1201
+ Current State: [what exists now — quote the problematic code or style]
1202
+ Enhancement: [specific, implementable improvement]
1203
+ User Impact: [who benefits and how]
1204
+ Effort: [Low | Medium | High]
1205
+ ```
1206
+
1207
+ Evaluate:
1208
+ - Are all interactive elements reachable by keyboard alone? (Tab, Shift+Tab, Enter, Space, Arrow keys)
1209
+ - Is the tab order logical and predictable? Does it follow the visual reading order?
1210
+ - Do all images, icons, and non-text elements have meaningful alternative text (not just file names or empty alt="")?
1211
+ - Color contrast: body text 4.5:1, large text 3:1, UI components and graphics 3:1. Cite exact computed values where possible.
1212
+ - Are form inputs labeled with visible labels, not just placeholder text (which disappears on focus)?
1213
+ - Are error messages programmatically associated with their inputs (aria-describedby or aria-errormessage)?
1214
+ - Are dynamic state changes announced to screen readers (aria-live="polite", role="status", aria-live="assertive" for urgent)?
1215
+ - Are touch targets at least 44×44px for all interactive elements (WCAG 2.5.8 target size)?
1216
+ - Are there color-only indicators (error = red only) that need a secondary visual cue (icon, pattern, or text)?
1217
+ - Are modal dialogs, drawers, and menus trapping focus correctly (focus stays inside until closed)?
1218
+ - Is there a skip-to-main-content link for keyboard users on pages with repetitive navigation?
1219
+ - Are custom interactive widgets (sliders, tabs, accordions, comboboxes, date pickers) using correct ARIA roles and states?
1220
+ - Is prefers-reduced-motion respected for animations and transitions?
1221
+ - Does text resize to 200% without horizontal scrolling or loss of content? (WCAG 1.4.4)
1222
+
1223
+ **Pass D4 — Typography and Visual Polish:**
1224
+
1225
+ Delegate to designer. Read every component file, every stylesheet or theme file, every design token file.
1226
+
1227
+ Format for each finding:
1228
+ ```
1229
+ [UI-VIS-N] Title
1230
+ Category: [Typography | Color | Spacing | Polish]
1231
+ Screen/Component: [exact file path + component name]
1232
+ Current State: [quote exact values — font sizes, weights, colors, spacing]
1233
+ Enhancement: [specific, implementable improvement]
1234
+ User Impact: [how the experience improves]
1235
+ Effort: [Low | Medium | High]
1236
+ ```
1237
+
1238
+ Evaluate:
1239
+ - Is there a named, consistent type scale (e.g., 12/14/16/18/24/32px or a modular scale)? Or are font sizes arbitrary across components?
1240
+ - Is negative letter-spacing applied at display/heading sizes? (Headings generally need tighter tracking at large sizes; body text should not be tracked)
1241
+ - Are body text line lengths within 45–75 characters for comfortable reading?
1242
+ - Is line height appropriate for the font in use? (Body typically 1.4–1.6; display 1.0–1.2)
1243
+ - Is the font weight scale meaningful? Does it distinguish body (400), emphasis (500–600), and headings (600–700+)?
1244
+ - Is monospace type used consistently and only where appropriate (code, commands, IDs, data values)?
1245
+ - Is the same semantic element (e.g., card title, navigation item, inline code) styled consistently everywhere?
1246
+ - Is text truncation and overflow handled gracefully (ellipsis with title tooltip, explicit wrapping strategy)?
1247
+ - Is the color palette applied consistently — same semantic color for the same semantic meaning (error = red, always the same red)?
1248
+ - Are border radii, shadow depths, and spacing values from a token system or arbitrary per-component?
1249
+ - Are hardcoded hex values, spacing units, or radius values that could be design tokens cited for extraction?
1250
+ - Are there places where the visual polish diverges significantly between different sections of the UI, suggesting inconsistent generation sessions?
1251
+
1252
+ **Pass D5 — UI Performance and Perceived Performance:**
1253
+
1254
+ Delegate to explorer. Read every component file, every data-fetching hook, every list rendering pattern.
1255
+
1256
+ Format for each finding:
1257
+ ```
1258
+ [UI-PERF-N] Title
1259
+ Category: [Render Performance | Asset Optimization | Perceived Performance | Animation | Native/IPC]
1260
+ Screen/Component: [exact file path + component name]
1261
+ Current State: [quote code where helpful]
1262
+ Enhancement: [specific, implementable improvement]
1263
+ User Impact: [how the experience improves]
1264
+ Effort: [Low | Medium | High]
1265
+ ```
1266
+
1267
+ Evaluate:
1268
+ - Are there components re-rendering on every parent update that could be memoized (React.memo, useMemo, useCallback)?
1269
+ - Are expensive calculations (sorting, filtering, mapping large arrays) happening inline during render without caching?
1270
+ - Are large lists (>50 items) rendered unconditionally instead of virtualized?
1271
+ - Are images and assets loaded at correct sizes for their display context? Are they using modern formats (WebP, AVIF)?
1272
+ - Are perceived-performance patterns in use? (Optimistic updates, skeleton loaders, progressive disclosure, speculative prefetching)
1273
+ - Are any animations/transitions animating layout properties (width, height, top, left, margin) instead of transform/opacity (which cause reflow/repaint)?
1274
+ - Is the first meaningful content visible quickly, or is there a blank/spinner period before anything appears?
1275
+ - For Tauri/Electron/native apps: is expensive work offloaded from the main thread? Are IPC calls batched to reduce round-trips? Are large IPC payloads streamed rather than sent as one blob? Are native transitions handled with skeleton states rather than blocking?
1276
+ - Are code-splitting boundaries in place so the initial bundle only loads what is needed?
1277
+ - Are lazy imports used for heavy routes, modals, or features?
1278
+
1279
+ **Pass D6 — Consistency and Design System Alignment:**
1280
+
1281
+ Delegate to designer. Read every component file, every stylesheet, every shared UI utility.
1282
+
1283
+ Format for each finding:
1284
+ ```
1285
+ [UI-CON-N] Title
1286
+ Category: [Pattern Consistency | Design Token | Component Extraction | Mental Model | AI-Aesthetic]
1287
+ Screen/Component: [exact file path + component name]
1288
+ Current State: [what exists now]
1289
+ Enhancement: [specific, implementable improvement]
1290
+ User Impact: [how the experience improves]
1291
+ Effort: [Low | Medium | High]
1292
+ ```
1293
+
1294
+ Evaluate:
1295
+ - Are equivalent UI patterns implemented differently in different parts of the application (e.g., one list uses a table, another uses a card grid, another uses a custom layout — for the same data shape)?
1296
+ - Are there hardcoded style values (hex colors, px spacing, border-radius values) that should reference design tokens?
1297
+ - Are there component variants that diverge unnecessarily when they could share a base component?
1298
+ - Are there repeated UI patterns that could be extracted into reusable components but aren't?
1299
+ - Is the navigation structure consistent and predictable — does the same navigation pattern appear on all screens?
1300
+ - Are there places where the interface's mental model doesn't match how users think about the task (e.g., a "send" action that actually stages, or a "save" action that auto-publishes)?
1301
+ - AI-aesthetic audit: apply the AI-aesthetic baseline patterns listed in the Track D preamble. For each pattern found, cite exact file and code evidence, and assess whether it is an unintentional default or a deliberate design decision.
1302
+
1303
+ ---
1304
+
1305
+ ### Track E — Performance and Observability
1306
+
1307
+ Run if user selected options 1, 2, 7, or a custom performance/observability scope.
1308
+
1309
+ **Anti-cursory contract for Track E:** Build a coverage unit for every hot path and every operational path identified in Phase 0. Every path must be reviewed. A path marked REVIEWED must have had its implementation read, its resource usage assessed, and its telemetry coverage noted.
1310
+
1311
+ **Agent lens:** runtime efficiency and production visibility.
1312
+
1313
+ **Observability baseline:** OpenTelemetry traces, metrics, and logs as first-class signals.
1314
+
1315
+ **Required method:**
1316
+ 1. Identify the hot path or operational path.
1317
+ 2. Quote the code causing repeated work, missing telemetry, or unsafe resource behavior.
1318
+ 3. State whether the issue is proven, probable, or requires profiling.
1319
+ 4. Do not invent performance impact. If impact is not measured, label it qualitative.
1320
+
1321
+ **Performance checks:**
1322
+
1323
+ *Computational:*
1324
+ - Loops iterating over data multiple times where a single pass would suffice
1325
+ - `O(n²)` or worse algorithms where the input can grow (nested loops over the same collection)
1326
+ - Repeated parsing, serialization, compilation, or IO in loops or hot paths
1327
+ - N+1 database, network, or filesystem access (fetching one-at-a-time inside a loop)
1328
+ - Missing memoization for expensive pure computations called repeatedly with same inputs
1329
+ - Synchronous critical-path work that blocks the event loop (sync file reads, sync crypto)
1330
+ - Regex recompilation on every call (creating `new RegExp()` inside a loop)
1331
+ - Unnecessary deep cloning of large objects where shallow copy or reference would suffice
1332
+
1333
+ *Memory:*
1334
+ - Objects retained longer than their usage scope (closures capturing large contexts unnecessarily)
1335
+ - Missing cleanup for subscriptions, timers, event listeners, or file handles (memory/resource leaks)
1336
+ - Data structures mismatched to access patterns (array linear scan where Map/Set lookup is needed)
1337
+ - Growing unbounded collections (event logs, caches, in-memory queues without eviction)
1338
+ - Circular references preventing garbage collection
1339
+
1340
+ *Async and concurrency:*
1341
+ - Sequential awaits in series where `Promise.all` or `Promise.allSettled` could parallelize safely
1342
+ - Missing caching for repeated network, filesystem, or database reads in the same request lifecycle
1343
+ - Unbounded concurrency fanout with no throttle (spawning N parallel requests without a concurrency limiter)
1344
+ - Missing backpressure for streaming operations or queue consumers
1345
+ - Blocking the main thread in Electron/Tauri with large computations (use worker threads or IPC to background)
1346
+ - IPC call-per-item patterns that could be batched into a single IPC call
1347
+
1348
+ *Startup and bundle (if applicable):*
1349
+ - Heavy synchronous initialization in module scope that delays startup
1350
+ - Full library imports where only a small subset is used (import full lodash, full moment)
1351
+ - Missing tree-shaking-friendly export patterns
1352
+ - Synchronous filesystem reads at startup that could be deferred or cached
1353
+ - Missing code-splitting for large routes or features
1354
+
1355
+ *AI/LLM performance:*
1356
+ - Unbounded model API calls with no concurrency limit
1357
+ - Context payloads that grow unboundedly with session length
1358
+ - Repeated embedding or completion calls for identical inputs without caching
1359
+ - Token budget not enforced, allowing unexpectedly large responses to accumulate cost
1360
+
1361
+ **Observability checks:**
1362
+
1363
+ *Logging:*
1364
+ - Key operations completing with no trace in logs (successful auth, data mutations, background job completion)
1365
+ - Error logs missing context (which entity, which user, which request, which operation)
1366
+ - Log messages noting what happened but not why it happened or what to do next
1367
+ - Sensitive data (PII, tokens, credentials, query parameters with secrets) in log statements
1368
+ - Debug-only visibility for production-critical failures (e.g., errors only logged at `console.debug`)
1369
+ - Missing correlation IDs or request/session/trace IDs that would link related log events
1370
+
1371
+ *Metrics:*
1372
+ - Missing request latency metrics for externally-visible operations
1373
+ - Missing error rate metrics for critical paths
1374
+ - Missing queue depth, backlog, or processing rate for async workers
1375
+ - Missing cost metrics for AI/LLM API calls (token counts, call counts)
1376
+ - Missing retry count metrics that would reveal upstream instability
1377
+ - Missing saturation metrics (memory usage, connection pool usage, disk usage)
1378
+
1379
+ *Traces:*
1380
+ - Missing spans across service boundaries (outgoing HTTP calls, database queries, queue publishes)
1381
+ - Missing spans for model/embedding API calls (duration, token count, model version)
1382
+ - Missing trace propagation (W3C Trace Context headers not forwarded across service boundaries)
1383
+ - Span attributes missing key identifiers (user ID, tenant ID, resource ID, feature flag state)
1384
+
1385
+ *Operational visibility:*
1386
+ - Production-critical failures only visible by reading source code or log noise
1387
+ - No structured error taxonomy that would enable alerting rules
1388
+ - Missing operational runbook hooks or on-call documentation comments for critical paths
1389
+ - Alert thresholds not defined or documented for key metrics
1390
+
1391
+ ---
1392
+
1393
+ ### Track F — AI Slop and Code Provenance
1394
+
1395
+ Run if user selected options 1, 2, 8, or a custom AI-slop/provenance scope.
1396
+
1397
+ **Anti-cursory contract for Track F:** Build a coverage unit for every file group and every public surface. Every unit must be reviewed. A unit marked REVIEWED must have had its imports verified against the manifest/lockfile, its API signatures verified against an installed version, and its implementation reviewed for stub patterns.
1398
+
1399
+ **Agent lens:** patterns statistically common in LLM-assisted code that look plausible but are weakly grounded.
1400
+
1401
+ This is not permission to call code bad because it "looks AI-generated." Every finding still needs evidence.
1402
+
1403
+ **Required method:**
1404
+ 1. Prefer deterministic checks first: import existence, API signatures, wiring, docs vs. code.
1405
+ 2. For subjective AI-slop patterns, require two pieces of evidence: exact quote plus a concrete consequence.
1406
+ 3. Do not emit candidates based only on style.
1407
+
1408
+ **Phantom dependencies and hallucinated APIs:**
1409
+
1410
+ - Packages imported in source but not declared in any manifest
1411
+ - Package names that do not match any registered package in the expected ecosystem
1412
+ - Packages that sound like combinations of real packages (`react-fetch-hooks`, `express-validate-zod`) but may be fabricated — verify by checking the lock file for the exact name and version
1413
+ - Version numbers that do not exist for the declared package (check semver range resolution against the lockfile)
1414
+ - API function calls on a package where those functions do not exist in the declared version (check against the installed package's actual exports, not docs or LLM knowledge)
1415
+ - Calling internal/private APIs of a dependency that were not part of its public contract
1416
+ - Calling deprecated APIs of a dependency that were removed in the locked version
1417
+ - Cross-ecosystem imports (Python package imported in JavaScript, Node.js module imported in browser context, etc.)
1418
+ - Framework APIs from the wrong version (React 17 vs React 18 API differences, Next.js 13 vs 14 vs 15 differences, etc.)
1419
+ - Calling methods on types that don't exist at runtime (TypeScript type narrowing giving false confidence)
1420
+
1421
+ **Stale library and framework usage:**
1422
+
1423
+ - APIs that existed in older versions but were deprecated or removed in the pinned version
1424
+ - Import paths from old package structures (pre-restructuring imports that no longer resolve)
1425
+ - Using class-based APIs where the installed version is hook/function-based
1426
+ - Using callback-based APIs where the installed version is promise-based
1427
+ - Accessing config or environment APIs using old format that the current runtime ignores silently
1428
+
1429
+ **Confident stubs and happy-path-only implementations:**
1430
+
1431
+ - Functions with an impressive-looking signature and docstring but an implementation that is one or two lines, clearly insufficient for the stated purpose
1432
+ - Validation functions whose name suggests thoroughness (`validateSecureInput`, `sanitizeUserData`) but whose body only checks for null or trims whitespace
1433
+ - Security function names (`checkPermissions`, `isAuthorized`, `encryptPayload`) with trivially incorrect implementations
1434
+ - Error handlers that catch broad exception types and log a generic message, treating all errors identically
1435
+ - Retry or backoff functions that loop `N` times with `sleep(fixed_delay)` instead of implementing actual exponential backoff
1436
+ - Rate limiters that initialize a counter but never actually block or reject requests
1437
+ - Test files that import real modules but only call them with mocked return values, never actually testing the real behavior
1438
+ - Examples in docs that call non-existent functions or APIs with wrong argument shapes
1439
+
1440
+ **Over-abstraction and premature generalization:**
1441
+
1442
+ - Adapter, factory, or registry patterns implemented before there are two real use cases to abstract over (abstraction layer with exactly one implementation)
1443
+ - Generic interfaces with a single concrete implementation and no documented reason for the layer
1444
+ - Dependency injection containers or service locators added to simple scripts that have no runtime variation requirement
1445
+ - Configuration system with many options for which only one is ever set
1446
+ - Plugin or hook systems with registration infrastructure but no registrations
1447
+ - Abstraction cascades: function A calls function B calls function C which calls function D, where each wrapper does nothing except forward arguments
1448
+
1449
+ **Copy-paste artifacts and inconsistent integration:**
1450
+
1451
+ - Same logic block (3+ lines) duplicated in two or more files with minor variations instead of being extracted
1452
+ - Naming conventions that differ between files in the same module (camelCase in one file, snake_case in the sibling)
1453
+ - Error message strings that differ in style or capitalization for equivalent error conditions
1454
+ - Inconsistent parameter order for similar functions in the same module
1455
+ - Inconsistent return type patterns (some functions return `null` on error, others `undefined`, others throw)
1456
+ - Logging patterns that differ between files as if each was generated independently
1457
+ - Comments written in a different prose style from the surrounding codebase (suggesting multiple generation sessions)
1458
+
1459
+ **Context rot:**
1460
+
1461
+ - Comments that were accurate for an older version of the code but no longer match the current implementation
1462
+ - TODO/FIXME comments that reference issues, versions, or constraints that no longer apply
1463
+ - Test names that claim to test behavior the test no longer exercises
1464
+ - Changelog entries that describe features not present in the current code
1465
+ - Import aliases that no longer match the imported module's actual exports
1466
+
1467
+ **Documentation for unwired features:**
1468
+
1469
+ - README sections describing features (commands, flags, config options, APIs) with no corresponding implementation in source
1470
+ - JSDoc or TSDoc on exported functions describing parameters that don't exist in the function signature
1471
+ - Config documentation describing keys that are read and ignored, or never read at all
1472
+ - CLI help text describing flags or subcommands that have no handler
1473
+
1474
+ **Security theater:**
1475
+
1476
+ - Input validation that checks type or presence but not content (accepts any string as an email, any number as a valid ID)
1477
+ - Permission check function that always returns `true` or is bypassed on any non-trivial code path
1478
+ - Encryption function that Base64-encodes data and calls it "encrypted"
1479
+ - HTTPS check that only verifies the string starts with "https" but does not validate the certificate
1480
+ - Rate limiting that resets on every request instead of per time window
1481
+ - CSRF protection that checks for the header's presence but not its value
1482
+
1483
+ **Slopsquatting exposure:**
1484
+
1485
+ Per the USENIX research: 19.7% of LLM-recommended packages are fabricated and non-existent; 58% of hallucinated packages repeat across queries. Check:
1486
+ - Every package name in manifests against the lockfile. If a package is in the manifest but not in the lockfile, it may be unresolved or hallucinated.
1487
+ - Package names that are combinations of legitimate package names in a pattern that suggests AI generation
1488
+ - Package scopes (`@company/something`) where `@company` does not correspond to a known published scope
1489
+
1490
+ ---
1491
+
1492
+ ### Track G — Enhancement Opportunities
1493
+
1494
+ Run if user selected options 1, 9, or a custom enhancement scope.
1495
+
1496
+ **Anti-cursory contract for Track G:** Build a coverage unit for every enhancement domain (architecture, code quality, developer experience, performance, resilience, observability, testing, and UI/UX if applicable). Every domain must be reviewed. A domain marked REVIEWED must have had representative source files for that domain actually read and assessed.
1497
+
1498
+ **Anti-defect-hunt rule:** This track is not a defect hunt.
1499
+
1500
+ Do not report:
1501
+ - bugs or security vulnerabilities
1502
+ - broken claims or missing required tests
1503
+ - anything that implies the current code is wrong or unsafe
1504
+
1505
+ Report only:
1506
+ - improvements that raise maintainability, clarity, resilience, performance, observability, developer experience, or UX quality
1507
+ - specific opportunities with exact file evidence
1508
+ - implementation ideas concrete enough for an engineer or agent to act on
1509
+
1510
+ ---
1511
+
1512
+ #### Enhancement Pass G1 — Architecture and Structure
1513
+
1514
+ Delegate to explorer. Read all source files.
1515
+
1516
+ Format:
1517
+ ```
1518
+ [ARCH-N] Title
1519
+ Category: [Abstraction | Cohesion | Interface Clarity | Dependency | Simplification]
1520
+ File(s): [exact path]
1521
+ Current State: [what exists now — quote specific code]
1522
+ Enhancement: [specific, implementable improvement]
1523
+ Impact: [what gets better — readability, testability, reuse, etc.]
1524
+ Effort: [Low | Medium | High]
1525
+ ```
1526
+
1527
+ Evaluate:
1528
+
1529
+ *Abstraction opportunities:*
1530
+ - Functions doing more than one thing that could be cleanly separated (measure: function name contains "and", "or", "also")
1531
+ - Logic duplicated across three or more files that has stabilized enough to deserve a shared utility
1532
+ - Inline logic grown complex enough (≥10 lines of closely related computation) to deserve its own named abstraction
1533
+ - Modules with accumulated responsibilities spanning multiple unrelated concerns
1534
+
1535
+ *Simplification opportunities:*
1536
+ - Premature abstractions: adapter, factory, or registry patterns with exactly one implementation and no near-term second
1537
+ - Abstraction cascades: A → B → C → D where each wrapper only forwards arguments
1538
+ - Over-engineered configuration systems with many options where only one is used
1539
+ - Dead compatibility layers kept for a version no longer in any manifest
1540
+ - Unused code paths: functions defined and exported but with no import in the codebase
1541
+
1542
+ *Cohesion improvements:*
1543
+ - Cross-cutting concerns (logging, error handling, config access) scattered across modules instead of centralized
1544
+ - Inconsistent module grouping where related files are in unrelated directories
1545
+ - Business logic mixed with I/O, network, or presentation logic in the same module
1546
+
1547
+ *Interface clarity:*
1548
+ - Function signatures with ≥4 positional parameters where an options object would be clearer
1549
+ - Overloaded return types that could be split into typed variants
1550
+ - Implicit contracts (side effects, required call order, mutability expectations) that could be made explicit
1551
+
1552
+ *Dependency improvements:*
1553
+ - External dependencies used for one or two trivial functions that native language features now provide
1554
+ - Long dependency chains that could be simplified with a direct interface layer
1555
+ - Tight coupling to concrete implementations that limits testing or reuse
1556
+
1557
+ Do not report items without an exact file path and code quote.
1558
+
1559
+ ---
1560
+
1561
+ #### Enhancement Pass G2 — Code Quality and Elegance
1562
+
1563
+ Delegate to explorer. Read all source files.
1564
+
1565
+ Format:
1566
+ ```
1567
+ [QUAL-N] Title
1568
+ Category: [Readability | Idiomatic | Test Quality | DX]
1569
+ File(s): [exact path]
1570
+ Current State: [what exists now — quote specific code]
1571
+ Enhancement: [specific, implementable improvement]
1572
+ Impact: [what gets better]
1573
+ Effort: [Low | Medium | High]
1574
+ ```
1575
+
1576
+ Evaluate:
1577
+
1578
+ *Readability:*
1579
+ - Variable or function names that are accurate but not expressive (generic names like `data`, `result`, `item`, `temp` where a domain term exists)
1580
+ - Complex conditionals with 3+ conditions that could become a named predicate function
1581
+ - Deeply nested logic (≥3 levels) that could be flattened with early returns or guard clauses
1582
+ - Comments that describe what the code does instead of why it does it
1583
+ - Magic numbers or strings that should be named constants (what does `86400` mean in this context?)
1584
+
1585
+ *Idiomatic improvements:*
1586
+ - Non-idiomatic patterns with cleaner modern equivalents:
1587
+ - Manual for/while loops where `map`, `filter`, `reduce`, `find`, `every`, `some` apply
1588
+ - `.then()` chains where `async/await` would be clearer
1589
+ - `Object.assign({}, x)` where spread `{...x}` is idiomatic
1590
+ - String concatenation in loops where template literals or join apply
1591
+ - Index-based array access where destructuring is cleaner
1592
+ - TypeScript: `any` types that could be narrowed; missing generics; untyped event handlers; optional chaining opportunities; unnecessary type assertions; union types that should be discriminated unions
1593
+ - Patterns inconsistent with how the rest of the codebase does similar things (local idiosyncrasy vs. established pattern)
1594
+ - Defensive copying where reference sharing is both safe and intended
1595
+
1596
+ *Test quality:*
1597
+ - Tests verifying implementation details instead of behavior
1598
+ - Test descriptions that don't communicate intent (test("works correctly", ...))
1599
+ - Setup/teardown duplication across test files that could be shared fixtures
1600
+ - Assertions too broad to fail on behavior changes
1601
+ - Missing test for the documented main use case of a public API
1602
+
1603
+ *Developer experience:*
1604
+ - Exported public APIs with no JSDoc or TSDoc
1605
+ - Error messages lacking enough context to debug (what failed, what was the input, where to look)
1606
+ - Config validation that only fails at runtime when it could fail at startup with a clear message
1607
+ - Missing local scripts for common development workflows (setup, seed, reset, generate types)
1608
+ - Missing examples for non-obvious public API usage
1609
+
1610
+ ---
1611
+
1612
+ #### Enhancement Pass G3 — Performance Enhancement
1613
+
1614
+ Delegate to explorer. Read all source files.
1615
+
1616
+ Format:
1617
+ ```
1618
+ [PERF-N] Title
1619
+ Category: [Computational | Memory | Async | Bundle | Startup]
1620
+ File(s): [exact path]
1621
+ Current State: [what exists now — quote code]
1622
+ Enhancement: [specific, implementable improvement]
1623
+ Impact: [measurable or qualitative benefit]
1624
+ Effort: [Low | Medium | High]
1625
+ ```
1626
+
1627
+ Evaluate — enhancement framing only (the current code is correct; this makes it better):
1628
+
1629
+ *Computational:*
1630
+ - Loops iterating over data multiple times where a single pass would suffice
1631
+ - Missing memoization for expensive pure computations called repeatedly (React renders, recursive computations)
1632
+ - N+1 patterns: repeated work per item that could be batched (opportunity to batch, not a broken behavior)
1633
+ - Synchronous critical-path work that could be deferred without correctness risk
1634
+ - Regex objects created inside loops that could be created once and reused
1635
+
1636
+ *Memory:*
1637
+ - Large objects retained longer than needed (opportunity to scope more tightly)
1638
+ - Subscriptions, timers, or event listeners with no cleanup (opportunity to add lifecycle cleanup)
1639
+ - Data structure mismatches: array linear scan where Map/Set would improve lookup
1640
+
1641
+ *Async:*
1642
+ - Sequential await chains where `Promise.all` would safely parallelize
1643
+ - Missing caching for repeated network or filesystem reads within the same request lifecycle
1644
+ - Unbounded concurrency fanout that could benefit from a concurrency limiter
1645
+
1646
+ *Bundle and startup (if applicable):*
1647
+ - Full library imports where only a small subset is used
1648
+ - Synchronous initialization that could be lazy
1649
+ - Missing tree-shaking-friendly export patterns
1650
+
1651
+ ---
1652
+
1653
+ #### Enhancement Pass G4 — Resilience and Observability Enhancement
1654
+
1655
+ Delegate to explorer. Read all source files.
1656
+
1657
+ Format:
1658
+ ```
1659
+ [RES-N] Title
1660
+ Category: [Error Handling | Observability | Configuration | Retry | Graceful Degradation]
1661
+ File(s): [exact path]
1662
+ Current State: [what exists now — quote code]
1663
+ Enhancement: [specific, implementable improvement]
1664
+ Impact: [what gets better]
1665
+ Effort: [Low | Medium | High]
1666
+ ```
1667
+
1668
+ Evaluate — enhancement framing only:
1669
+
1670
+ *Error handling:*
1671
+ - Errors caught and swallowed silently that could surface meaningful context to callers
1672
+ - Generic error messages that could include the specific context that caused the error
1673
+ - Operations that would benefit from retry with exponential backoff (currently: fail fast or no retry)
1674
+ - Binary success/crash outcomes that could degrade gracefully (return partial results, skip and continue)
1675
+ - Missing error differentiation: all exceptions treated the same when some should be retried, some reported, some fatal
1676
+
1677
+ *Logging and observability:*
1678
+ - Key operations completing with no trace in logs (opportunity to add structured log at completion)
1679
+ - Log messages noting what happened but not why or what to do next
1680
+ - Missing structured fields (correlation IDs, user context, entity IDs) that would help correlate events
1681
+ - Debug information inaccessible without reading source (opportunity to surface via logs or metrics)
1682
+ - Missing metrics for operations that affect user experience, reliability, or cost
1683
+
1684
+ *Configuration robustness:*
1685
+ - Config values accessed without validation that could be validated at startup
1686
+ - Missing sensible defaults for optional configuration
1687
+ - Sensitive config that could be better isolated (environment separation, secret management)
1688
+
1689
+ ---
1690
+
1691
+ #### Enhancement Pass G5 — Testing Enhancement
1692
+
1693
+ Delegate to test_engineer if available, otherwise explorer. Read all test files and source files.
1694
+
1695
+ Format:
1696
+ ```
1697
+ [TEST-N] Title
1698
+ Category: [Organization | Fixtures | Property-Based | Mutation | Behavior-Level]
1699
+ File(s): [exact path]
1700
+ Current State: [what exists now — quote test code]
1701
+ Enhancement: [specific, implementable improvement]
1702
+ Impact: [what gets better]
1703
+ Effort: [Low | Medium | High]
1704
+ ```
1705
+
1706
+ Evaluate — enhancement framing only (existing tests pass; this makes the test suite better):
1707
+
1708
+ - Better test organization: grouping tests by behavior rather than by implementation unit
1709
+ - Shared fixtures or factory functions to eliminate test setup duplication
1710
+ - Property-based testing opportunities for invariants: parsers, serializers, transformations, state machines, permission matrices, fuzz-worthy trust boundaries
1711
+ - Mutation testing on high-risk core logic: identify the logic where a one-line flip would be catastrophic and where a mutation test would catch it
1712
+ - Behavior-level test assertions: replace implementation-asserting tests with behavior-asserting equivalents
1713
+ - Missing tests for documented edge cases or recently fixed bugs
1714
+ - Test performance: identify test suites taking disproportionate time and opportunities to speed them up
1715
+
1716
+ ---
1717
+
1718
+ #### Enhancement Pass G6 — UI/UX Enhancement (Run only if UI is confirmed present)
1719
+
1720
+ **Condition:** Only run if Phase 0H confirmed UI presence. If no UI, skip and record NOT_APPLICABLE in coverage.
1721
+
1722
+ Run all six UI passes from Track D (D1 through D6), framing all findings as enhancement opportunities rather than defects.
1723
+
1724
+ Use the same formats and evaluation criteria as Track D. The key framing difference:
1725
+
1726
+ - Track D (defect mode): "This is broken, missing, or fails a compliance standard."
1727
+ - Track G Pass G6 (enhancement mode): "The current UI is working; this is how it could become better."
1728
+
1729
+ Findings that would be LOW or INFO severity in Track D become genuine enhancement candidates here. In enhancement mode, all UI improvements are valuable — the bar is not "this is a defect" but "this would make the experience meaningfully better."
1730
+
1731
+ Do not repeat Track D findings if Track D was also run. Reference them by ID in the enhancement catalog if relevant.
1732
+
1733
+ ---
1734
+
1735
+ ### Phase 1X — Cross-Boundary Review
1736
+
1737
+ After selected track candidate generation completes, run one cross-boundary explorer pass.
1738
+
1739
+ Skip rule: run Phase 1X only when two or more tracks ran and there is quoted cross-track evidence to compare. For single-track reviews, skip and record the skip in Coverage Notes.
1740
+
1741
+ Purpose: find issues that isolated track passes miss.
1742
+
1743
+ Check:
1744
+ - Caller and callee contract mismatches across module boundaries
1745
+ - UI/API/schema drift (what the UI sends vs. what the API expects vs. what the schema defines)
1746
+ - Docs/API/test drift (what docs claim vs. what the API does vs. what tests assert)
1747
+ - Auth assumptions across middleware and handlers (auth enforced in middleware but not in handler, or vice versa)
1748
+ - Config names across docs, env parsing, deployment config, and code
1749
+ - Shared state mutation across modules that assumes exclusive access
1750
+ - Package scripts calling files or commands that no longer exist
1751
+ - Generated types or schemas out of sync with their sources
1752
+ - AI prompt/tool boundaries crossing into security-sensitive sinks (identified in Track B but not surfaced in Track A)
1753
+ - Repeated candidate patterns in sibling files suggesting a systemic issue
1754
+
1755
+ Output: additional `CANDIDATE_FINDING` entries only. Use the track of the most security-relevant finding. If no single track dominates, use `track: cross_boundary`. Link all involved claims, surfaces, boundaries, or prior candidates.
1756
+
1757
+ ---
1758
+
1759
+ ## Phase 2 — Reviewer Validation
1760
+
1761
+ Reviewer validates candidates. Reviewer does not rediscover the whole repo.
1762
+
1763
+ Reviewer receives small batches by local reasoning unit: same file, same route or handler chain, same subsystem, same dependency family, same public claim, same trust boundary, same UI component family, or same test fixture/helper.
1764
+
1765
+ Do not hand Reviewer dozens of unrelated candidates in one batch.
1766
+
1767
+ ### Validation Status
1768
+
1769
+ Reviewer must assign exactly one:
1770
+ - `CONFIRMED` — real in current code and supported by evidence
1771
+ - `DISPROVED` — not real in context
1772
+ - `UNVERIFIED` — plausible but not proven to required confidence
1773
+ - `PRE_EXISTING` — real but outside the target change scope
1774
+
1775
+ ### Reviewer Responsibilities
1776
+
1777
+ For each candidate:
1778
+ 1. Re-open exact file and line.
1779
+ 2. Read the raw file independently before reading the explorer's `evidence_checked` field. Do not let the explorer's paraphrase prime validation.
1780
+ 3. Re-read enough surrounding context.
1781
+ 4. Check callers, callees, tests, manifests, configs, schemas, routes, generated files, and docs needed to validate.
1782
+ 5. Check mitigating controls that could disprove the candidate.
1783
+ 6. Run safe minimal runtime validation where behavior depends on runtime.
1784
+ 7. Reclassify severity or value level if appropriate.
1785
+ 8. Record exact disproof reason for rejected candidates.
1786
+ 9. Mark UNVERIFIED rather than guessing when evidence is insufficient.
1787
+
1788
+ ### Defect Validation Format
1789
+
1790
+ ```
1791
+ VALIDATED_FINDING
1792
+ candidate_id:
1793
+ status: CONFIRMED | DISPROVED | UNVERIFIED | PRE_EXISTING
1794
+ final_severity: CRITICAL | HIGH | MEDIUM | LOW | INFO
1795
+ confidence: HIGH | MEDIUM
1796
+ file:
1797
+ line:
1798
+ exact_quote:
1799
+ title:
1800
+ problem:
1801
+ impact:
1802
+ fix:
1803
+ validation_evidence:
1804
+ disproof_reason: <required if DISPROVED>
1805
+ verification_mode: STATIC | STATIC_PLUS_RUNTIME
1806
+ runtime_validation: <command or N/A>
1807
+ linked_claims:
1808
+ linked_surfaces:
1809
+ linked_boundaries:
1810
+ ai_pattern: <same value from candidate or N/A>
1811
+ inline_routing: CRITIC_REQUIRED | REVIEWER_FINALIZED | REVIEWER_DOWNGRADED
1812
+ finalization_status: FINALIZED | DOWNGRADED | N/A
1813
+ size: S | M | L
1814
+ END
1815
+ ```
1816
+
1817
+ Rules:
1818
+ - CRITICAL/HIGH CONFIRMED or PRE_EXISTING requires `inline_routing: CRITIC_REQUIRED`.
1819
+ - MEDIUM/LOW CONFIRMED or PRE_EXISTING requires reviewer finalization before return.
1820
+ - DISPROVED and UNVERIFIED do not enter the main findings list.
1821
+
1822
+ ### Enhancement Validation Format
1823
+
1824
+ ```
1825
+ VALIDATED_ENHANCEMENT
1826
+ candidate_id:
1827
+ status: CONFIRMED_HIGH_VALUE | CONFIRMED_MEDIUM_VALUE | REJECTED | UNVERIFIED
1828
+ track:
1829
+ domain:
1830
+ category:
1831
+ confidence: HIGH | MEDIUM
1832
+ file:
1833
+ line:
1834
+ exact_quote:
1835
+ title:
1836
+ current_state:
1837
+ confirms_current_code_is_working: yes | no
1838
+ enhancement:
1839
+ expected_impact:
1840
+ effort: S | M | L
1841
+ validation_evidence:
1842
+ dependency_map:
1843
+ rejection_reason: <required if REJECTED>
1844
+ END
1845
+ ```
1846
+
1847
+ Enhancement rejection reasons include: already handled elsewhere; contradicts system intent; adds complexity without clear benefit; purely stylistic preference; too vague to implement; current design appears intentional and better; not grounded in exact evidence; `confirms_current_code_is_working` is not `yes`.
1848
+
1849
+ ---
1850
+
1851
+ ## Phase 2C — Inline Critic Challenge for CRITICAL and HIGH Defects
1852
+
1853
+ Trigger immediately after each reviewer batch containing CRITICAL or HIGH CONFIRMED or PRE_EXISTING findings. Do not wait for all reviewer batches to complete.
1854
+
1855
+ Critic receives only: the relevant validated findings, exact evidence quotes, minimal surrounding context, and any runtime validation notes.
1856
+
1857
+ Critic checks:
1858
+ - Is the finding real at the cited location?
1859
+ - Did reviewer miss a mitigating control?
1860
+ - Is the severity justified?
1861
+ - Is runtime validation sufficient or required?
1862
+ - Is the fix actionable?
1863
+ - Does the finding overclaim beyond evidence?
1864
+ - Is this part of a repeated pattern requiring sibling coverage?
1865
+
1866
+ ```
1867
+ CRITIC_RESULT
1868
+ finding_id:
1869
+ verdict: UPHELD | REFINED | DOWNGRADED | OVERTURNED
1870
+ original_severity: CRITICAL | HIGH
1871
+ final_severity:
1872
+ file:
1873
+ line:
1874
+ exact_quote:
1875
+ title:
1876
+ final_problem:
1877
+ final_fix:
1878
+ ai_pattern: <same value from validated finding or N/A>
1879
+ verdict_reason:
1880
+ coverage_gap:
1881
+ END
1882
+ ```
1883
+
1884
+ Only UPHELD, REFINED, and DOWNGRADED findings may enter the confirmed evidence set. OVERTURNED findings are dropped and logged.
1885
+
1886
+ If Phase 2C downgrades a CRITICAL/HIGH to MEDIUM/LOW, route immediately through Phase 2M. Record `finalization_status: DOWNGRADED`.
1887
+
1888
+ ---
1889
+
1890
+ ## Phase 2M — Reviewer Finalization for MEDIUM and LOW Defects
1891
+
1892
+ This is not a separate agent dispatch. Reviewer performs this before returning a validation batch.
1893
+
1894
+ For every MEDIUM or LOW CONFIRMED or PRE_EXISTING finding:
1895
+ 1. Re-read evidence.
1896
+ 2. Check whether a mitigating control was missed.
1897
+ 3. Confirm severity is not inflated.
1898
+ 4. Confirm the finding is not style preference.
1899
+ 5. Confirm actionability.
1900
+ 6. Set `inline_routing: REVIEWER_FINALIZED` or `inline_routing: REVIEWER_DOWNGRADED`.
1901
+ 7. Set `finalization_status: FINALIZED` or `finalization_status: DOWNGRADED`.
1902
+
1903
+ Only FINALIZED and DOWNGRADED findings enter the confirmed evidence set.
1904
+
1905
+ ---
1906
+
1907
+ ## Phase 2E — Critic Validation for Enhancements
1908
+
1909
+ Every report-eligible enhancement requires critic validation.
1910
+
1911
+ Rationale for asymmetry with MEDIUM/LOW defects: enhancement value is more subjective. LOW-value enhancements are normally omitted unless the user requested exhaustive enhancement review. If a LOW-value enhancement is retained, critic validation is still required.
1912
+
1913
+ Phase 2E may run concurrently with Phase 2C and Phase 2M only for disjoint findings and disjoint subsystems. If an enhancement and defect concern the same file or root cause, serialize validation to keep the defect/enhancement boundary clear.
1914
+
1915
+ Critic receives batches by category and subsystem.
1916
+
1917
+ Critic checks:
1918
+ - Is the current state quoted accurately?
1919
+ - Is the opportunity genuinely valuable?
1920
+ - Is the improvement concrete enough to implement?
1921
+ - Is the effort estimate plausible?
1922
+ - Would the suggestion add more complexity than value?
1923
+ - Does it conflict with codebase intent or style?
1924
+ - Does it duplicate another opportunity?
1925
+ - Should it be merged, split, downgraded, or rejected?
1926
+
1927
+ ```
1928
+ ENHANCEMENT_CRITIC_RESULT
1929
+ enhancement_id:
1930
+ verdict: UPHELD_HIGH_VALUE | UPHELD_MEDIUM_VALUE | REFINED | MERGED | DOWNGRADED | REJECTED
1931
+ final_category:
1932
+ final_title:
1933
+ file:
1934
+ line:
1935
+ exact_quote:
1936
+ final_enhancement:
1937
+ expected_impact:
1938
+ effort: S | M | L
1939
+ dependencies:
1940
+ verdict_reason:
1941
+ END
1942
+ ```
1943
+
1944
+ Only UPHELD_HIGH_VALUE, UPHELD_MEDIUM_VALUE, REFINED, MERGED, and DOWNGRADED enhancements enter the final report.
1945
+
1946
+ ---
1947
+
1948
+ ## Phase 3 — Test Validation and Drift Review
1949
+
1950
+ Run this phase if any selected track touches functionality, testing, security, public claims, CI, or behavior.
1951
+
1952
+ If Track C did not run, Phase 3 is limited to test-related drift arising from findings in other selected tracks.
1953
+
1954
+ Use test_engineer where available.
1955
+
1956
+ Tasks:
1957
+ 1. Review every test-related finding and every claim that depends on tests.
1958
+ 2. Confirm whether tests assert behavior or merely execute code.
1959
+ 3. Confirm whether test fixtures match current schemas and defaults.
1960
+ 4. Confirm whether mocked boundaries hide real integration failures.
1961
+ 5. Confirm whether snapshot tests are masking meaningful changes.
1962
+ 6. Identify property-based testing opportunities for invariants.
1963
+ 7. Identify mutation resilience gaps for high-risk logic.
1964
+ 8. Run safe focused test commands where needed.
1965
+ 9. Record commands run and what they prove.
1966
+
1967
+ ```
1968
+ TEST_DRIFT_REVIEW
1969
+ related_findings:
1970
+ commands_run:
1971
+ behavior_assertions_verified:
1972
+ stale_tests_found:
1973
+ weak_assertions_found:
1974
+ property_based_opportunities:
1975
+ mutation_resilience_gaps:
1976
+ remaining_uncertainty:
1977
+ END
1978
+ ```
1979
+
1980
+ Write to `ledgers/test-drift-review.md`. If not applicable, write with `NOT_APPLICABLE` and reason.
1981
+
1982
+ Rules:
1983
+ - Coverage percentage is not proof of test quality.
1984
+ - Passing tests are not proof of correct behavior.
1985
+ - Test names are claims.
1986
+ - A test that cannot fail for the bug it claims to prevent is a test-quality finding.
1987
+
1988
+ ---
1989
+
1990
+ ## Phase 4 — Architect Synthesis
1991
+
1992
+ Architect synthesizes only validated evidence.
1993
+
1994
+ Inputs: Phase 0 ledgers; candidate ledgers; reviewer validation ledgers; inline critic results; enhancement critic results; `ledgers/test-drift-review.md`.
1995
+
1996
+ Synthesis tasks:
1997
+ 1. Drop DISPROVED findings.
1998
+ 2. Drop OVERTURNED critic findings.
1999
+ 3. Keep UNVERIFIED findings only in Coverage Notes.
2000
+ 4. Keep CONFIRMED and PRE_EXISTING defects only if they passed required routing.
2001
+ 5. Keep enhancements only if critic upheld, refined, merged, or downgraded them.
2002
+ 6. Deduplicate same-root-cause findings.
2003
+ 7. Merge repeated pattern findings only when evidence supports the cluster.
2004
+ 8. Separate defects from enhancements.
2005
+ 9. Separate unsupported claims from code defects.
2006
+ 10. Separate AI slop patterns from normal technical debt.
2007
+ 11. Count rejected and unverified items so filtering is auditable.
2008
+ 12. Identify systemic themes.
2009
+ 13. Identify recommended remediation or enhancement order.
2010
+ 14. Identify omitted tracks and coverage limitations.
2011
+ 15. Create `ledgers/strengths-ledger.md` with only quoted codebase strengths. If no strengths can be quoted, write `NOT_APPLICABLE`.
2012
+ 16. Verify coverage closure: every selected-track coverage unit must be REVIEWED, NOT_APPLICABLE, SKIPPED_WITH_REASON, or BLOCKED. If any unit is UNASSIGNED or UNREVIEWED, do not proceed to Phase 5. Return to Phase 1 for that unit.
2013
+
2014
+ Claim ledger outcome definitions:
2015
+ - `supported` — implementation evidence confirms the claim.
2016
+ - `partially_supported` — evidence supports part but not all of the claim.
2017
+ - `unsupported` — no implementation evidence supports the claim.
2018
+ - `contradicted` — implementation evidence conflicts with the claim.
2019
+ - `stealth_change` — public behavior, API contract, config, or documented workflow appears to have changed without a corresponding documentation, migration, changelog, or test update.
2020
+ - `unverified` — evidence was insufficient to classify.
2021
+
2022
+ ### Required Counts Block
2023
+
2024
+ ```
2025
+ Defect Findings by Track:
2026
+ functionality_correctness: C / H / M / L / I
2027
+ security_privacy: C / H / M / L / I
2028
+ llm_ai_security: C / H / M / L / I
2029
+ supply_chain: C / H / M / L / I
2030
+ testing_quality: C / H / M / L / I
2031
+ ui_ux_accessibility: C / H / M / L / I
2032
+ performance: C / H / M / L / I
2033
+ observability: C / H / M / L / I
2034
+ ai_slop_provenance: C / H / M / L / I
2035
+ docs_claims_drift: C / H / M / L / I
2036
+ cross_platform: C / H / M / L / I
2037
+ cross_boundary: C / H / M / L / I
2038
+ total: C / H / M / L / I
2039
+
2040
+ Validation Outcomes:
2041
+ candidates_generated:
2042
+ confirmed:
2043
+ pre_existing:
2044
+ disproved:
2045
+ unverified:
2046
+ reviewer_downgraded:
2047
+ critic_upheld:
2048
+ critic_refined:
2049
+ critic_downgraded:
2050
+ critic_overturned:
2051
+
2052
+ Enhancement Outcomes:
2053
+ candidates_generated:
2054
+ upheld_high_value:
2055
+ upheld_medium_value:
2056
+ refined:
2057
+ merged:
2058
+ downgraded:
2059
+ rejected:
2060
+ unverified:
2061
+
2062
+ Claim Ledger:
2063
+ supported:
2064
+ partially_supported:
2065
+ unsupported:
2066
+ contradicted:
2067
+ stealth_change:
2068
+ unverified:
2069
+
2070
+ Coverage Closure:
2071
+ total_coverage_units:
2072
+ reviewed:
2073
+ not_applicable:
2074
+ skipped_with_reason:
2075
+ blocked:
2076
+ unreviewed: <must be 0 to proceed>
2077
+
2078
+ AI Pattern Distribution:
2079
+ phantom_dependency:
2080
+ hallucinated_api:
2081
+ stale_api_usage:
2082
+ confident_stub:
2083
+ happy_path_only:
2084
+ over_abstraction:
2085
+ context_rot:
2086
+ security_theater:
2087
+ generated_test_weakness:
2088
+ mcp_tool_poisoning:
2089
+ unsupported_claim:
2090
+ other:
2091
+ ```
2092
+
2093
+ ---
2094
+
2095
+ ## Phase 5 — Final Whole-Report Critic
2096
+
2097
+ Before writing the final report, dispatch Critic with the planned synthesis.
2098
+
2099
+ Critic checks:
2100
+ - Does every final defect have validation evidence?
2101
+ - Did every CRITICAL/HIGH pass inline critic?
2102
+ - Did every MEDIUM/LOW pass reviewer finalization?
2103
+ - Does every enhancement have critic validation?
2104
+ - Are defects and enhancements separated?
2105
+ - Are all codebase strengths quoted in `ledgers/strengths-ledger.md`?
2106
+ - Are unverified items excluded from main findings?
2107
+ - Are severities calibrated to the rubrics?
2108
+ - Are UI findings concrete and implementable?
2109
+ - Are security findings exploitability-grounded?
2110
+ - Are performance findings not overstated without measurement?
2111
+ - Are AI-slop findings evidence-based rather than vibe-based?
2112
+ - Are claims ledger conclusions supported?
2113
+ - Are coverage notes honest?
2114
+ - Are counts internally consistent?
2115
+ - Is the coverage closure count showing 0 UNREVIEWED?
2116
+ - Did the report omit any user-selected track?
2117
+
2118
+ ```
2119
+ FINAL_CRITIC_CHECK
2120
+ verdict: PASS | REVISE
2121
+ required_revisions:
2122
+ severity_adjustments:
2123
+ findings_to_drop:
2124
+ findings_to_reclassify_as_enhancements:
2125
+ enhancements_to_reclassify_as_defects:
2126
+ unsupported_report_claims:
2127
+ missing_or_empty_ledgers:
2128
+ unsupported_strengths:
2129
+ coverage_note_fixes:
2130
+ count_mismatches:
2131
+ coverage_closure_failures:
2132
+ END
2133
+ ```
2134
+
2135
+ If verdict is REVISE, revise the synthesis and rerun final critic until PASS.
2136
+
2137
+ ---
2138
+
2139
+ ## Phase 6 — Final Report
2140
+
2141
+ Write to: `review-report.md` in the run directory.
2142
+
2143
+ Use this structure:
2144
+
2145
+ ```markdown
2146
+ # Codebase Review Report
2147
+
2148
+ Generated: [timestamp]
2149
+ Repository: [name/path]
2150
+ Git HEAD: [SHA]
2151
+ Selected Review Tracks: [tracks]
2152
+ Skipped Tracks: [tracks and why]
2153
+ Review Mode: [complete integrated | defect-focused | focused | enhancement-only | custom]
2154
+
2155
+ ## Executive Summary
2156
+ [2-5 sentences. Strongest confirmed themes only.]
2157
+
2158
+ ## Review Scope and Method
2159
+ - Phase 0 inventory completed: yes
2160
+ - User-selected tracks:
2161
+ - Explorer candidates generated:
2162
+ - Reviewer validation completed:
2163
+ - Inline critic used for CRITICAL/HIGH:
2164
+ - Reviewer finalization used for MEDIUM/LOW:
2165
+ - Enhancement critic used:
2166
+ - Final whole-report critic verdict:
2167
+ - Coverage closure verified: yes (N units reviewed)
2168
+ - Runtime validation commands run:
2169
+
2170
+ ## Findings Count
2171
+ [counts block]
2172
+
2173
+ ## Critical and High Confirmed Defect Findings
2174
+ [full details. Do not include PRE_EXISTING here.]
2175
+
2176
+ ## High-Severity Pre-Existing Findings
2177
+ [required if any CRITICAL/HIGH PRE_EXISTING findings exist]
2178
+
2179
+ ## Medium Defect Findings
2180
+ [full details or grouped details]
2181
+
2182
+ ## Low and Info Defect Findings
2183
+ [condensed but evidence-grounded]
2184
+
2185
+ ## Security, Privacy, and Supply Chain Notes
2186
+ [include only if selected or relevant]
2187
+
2188
+ ## Unsupported, Contradicted, or Partially Supported Claims
2189
+ [claim ledger outcomes]
2190
+
2191
+ ## AI Slop and Code Provenance Patterns
2192
+ [evidence-based patterns only. Never vibe-based.]
2193
+
2194
+ ## Testing and Test Drift Findings
2195
+ [test-quality and drift results]
2196
+
2197
+ ## UI/UX and Accessibility Findings
2198
+ [include only if selected and UI exists]
2199
+
2200
+ ## Performance and Observability Findings
2201
+ [include only if selected]
2202
+
2203
+ ## Systemic Themes
2204
+ [themes synthesized from validated findings only]
2205
+
2206
+ ## Enhancement Opportunities
2207
+ [include only if selected]
2208
+
2209
+ ### Top 10 Highest-Impact Enhancements
2210
+ [top validated high-value opportunities, ranked by impact]
2211
+
2212
+ ### Full Enhancement Catalog
2213
+
2214
+ #### Architecture Enhancements (ARCH-*)
2215
+ #### Code Quality Enhancements (QUAL-*)
2216
+ #### Performance Enhancements (PERF-*)
2217
+ #### Resilience and Observability Enhancements (RES-*)
2218
+ #### Testing Enhancements (TEST-*)
2219
+ #### UI/UX — Visual Hierarchy and Layout (UI-HIER-*)
2220
+ #### UI/UX — Interaction Design and Feedback (UI-INT-*)
2221
+ #### UI/UX — Accessibility and Inclusivity (UI-A11Y-*)
2222
+ #### UI/UX — Typography and Visual Polish (UI-VIS-*)
2223
+ #### UI/UX — Performance and Perceived Performance (UI-PERF-*)
2224
+ #### UI/UX — Consistency and Design System Alignment (UI-CON-*)
2225
+
2226
+ ### Implementation Roadmap
2227
+
2228
+ #### Phase 1 — Quick Wins
2229
+ Low effort, high clarity. List by ID with one-line description.
2230
+
2231
+ #### Phase 2 — Meaningful Improvements
2232
+ Medium effort, clear payoff. List by ID with dependencies noted.
2233
+
2234
+ #### Phase 3 — Architectural Investments
2235
+ High effort, transformational impact. List by ID.
2236
+
2237
+ ### Codebase Strengths
2238
+ [specific patterns worth preserving. Each strength must cite a file and line range and include exact quote evidence.]
2239
+
2240
+ ## Recommended Remediation Order
2241
+ 1. Security, supply-chain, data-loss, and broken shipped functionality.
2242
+ 2. Unsupported public claims and stealth behavior changes.
2243
+ 3. Trust-boundary and authorization defects.
2244
+ 4. Test gaps that allow confirmed defects to recur.
2245
+ 5. Performance and observability gaps affecting production diagnosis.
2246
+ 6. AI slop and provenance cleanup by repeated pattern.
2247
+ 7. Validated enhancement opportunities by dependency order.
2248
+
2249
+ ## Coverage Notes
2250
+ - Tracks not run:
2251
+ - Areas inventoried but not deeply reviewed:
2252
+ - Runtime validations not run and why:
2253
+ - UNVERIFIED findings worth future attention:
2254
+ - Files or generated artifacts intentionally excluded:
2255
+
2256
+ ## Validation Notes
2257
+ - candidates generated:
2258
+ - reviewer confirmed:
2259
+ - reviewer disproved:
2260
+ - reviewer unverified:
2261
+ - critic upheld/refined/downgraded/overturned:
2262
+ - enhancements upheld/rejected:
2263
+ - final critic verdict:
2264
+ - coverage units: total / reviewed / not_applicable / skipped / blocked / unreviewed
2265
+ ```
2266
+
2267
+ ### Per-Finding Final Format
2268
+
2269
+ For every final defect:
2270
+ ```markdown
2271
+ ### [SEVERITY] [Title]
2272
+
2273
+ Location: `path:line`
2274
+ Track: [track]
2275
+ Status: CONFIRMED | PRE_EXISTING
2276
+ Confidence: HIGH | MEDIUM
2277
+
2278
+ Evidence:
2279
+ > [exact quote]
2280
+
2281
+ Problem:
2282
+ [factual issue]
2283
+
2284
+ Impact:
2285
+ [specific impact]
2286
+
2287
+ Validation:
2288
+ [what reviewer checked, runtime command if any, critic outcome if high severity]
2289
+
2290
+ Recommended Fix:
2291
+ [actionable remediation]
2292
+ ```
2293
+
2294
+ For every final enhancement:
2295
+ ```markdown
2296
+ ### [ENHANCEMENT-ID] [Title]
2297
+
2298
+ Location: `path:line`
2299
+ Category: [category]
2300
+ Value: High | Medium
2301
+ Effort: S | M | L
2302
+
2303
+ Current State:
2304
+ > [exact quote]
2305
+
2306
+ Opportunity:
2307
+ [specific improvement]
2308
+
2309
+ Expected Impact:
2310
+ [what improves]
2311
+
2312
+ Validation:
2313
+ [critic result and any dependencies]
2314
+ ```
2315
+
2316
+ ---
2317
+
2318
+ ## Completion Rules
2319
+
2320
+ The review is complete only when:
2321
+
2322
+ - Phase 0 inventory completed.
2323
+ - Every required ledger exists and is non-empty, or contains an explicit `NOT_APPLICABLE` reason.
2324
+ - User selected review tracks or preselected tracks were explicit.
2325
+ - Every selected track was run or explicitly skipped with reason.
2326
+ - Coverage closure verified: every selected-track coverage unit is REVIEWED, NOT_APPLICABLE, SKIPPED_WITH_REASON, or BLOCKED. Zero UNASSIGNED or UNREVIEWED units.
2327
+ - Every final defect has exact quote evidence.
2328
+ - Every final enhancement has exact quote evidence.
2329
+ - Every defect candidate was reviewer validated or logged as not validated.
2330
+ - Every CRITICAL/HIGH final finding passed inline critic.
2331
+ - Every MEDIUM/LOW final finding passed reviewer finalization.
2332
+ - Every enhancement in the final report passed enhancement critic.
2333
+ - Test drift review ran when behavior or tests were in scope.
2334
+ - Final whole-report critic returned PASS.
2335
+ - `review-report.md` was written.
2336
+ - The report was read back and checked for missing sections.
2337
+
2338
+ Do not implement fixes. Do not modify source files.
2339
+
2340
+ Stop after reporting the final review file path, selected tracks, counts summary, and any user questions that block remediation planning.
2341
+
2342
+ ---
2343
+
2344
+ ## Final Architect Response to User
2345
+
2346
+ Do not fill in this template until Phase 5 final critic returns PASS.
2347
+
2348
+ After the report is complete and the final critic verdict is PASS:
2349
+
2350
+ ```
2351
+ Review complete.
2352
+
2353
+ Report: .swarm/review-v7/runs/<run_id>/review-report.md
2354
+ Selected tracks: [tracks]
2355
+ Coverage units closed: [n] (0 unreviewed)
2356
+ Confirmed defects: [counts by severity]
2357
+ Validated enhancements: [counts by value tier]
2358
+ Candidates filtered out: [counts]
2359
+ Final critic verdict: PASS
2360
+
2361
+ Highest-risk confirmed findings:
2362
+ - [one-line list of CRITICAL/HIGH only]
2363
+
2364
+ Highest-value enhancements:
2365
+ - [one-line list if enhancement track ran]
2366
+
2367
+ Coverage limitations:
2368
+ - [brief list]
2369
+
2370
+ No source files were modified.
2371
+ ```
2372
+
2373
+ If final critic verdict is not PASS, do not claim completion. Revise and rerun.