npm - specpipe - Versions diffs - 1.0.1 → 1.0.2 - Mend

specpipe 1.0.1 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/README.md +111 -311
package/package.json +2 -1
package/src/cli.js +16 -6
package/src/commands/diff.js +1 -1
package/src/commands/init-agents.js +40 -20
package/src/commands/init-global.js +88 -33
package/src/commands/init-interactive.js +71 -0
package/src/commands/init.js +61 -22
package/src/commands/remove.js +159 -49
package/src/commands/upgrade.js +21 -56
package/src/lib/agent-guards.js +34 -78
package/src/lib/agent-install.js +38 -25
package/src/lib/agents.js +53 -11
package/src/lib/claude-global.js +50 -77
package/src/lib/hooks.js +203 -0
package/src/lib/installer.js +73 -61
package/src/lib/reconcile.js +13 -8
package/templates/{.claude/hooks → hooks}/file-guard.js +26 -21
package/templates/hooks/specpipe-read-guard.sh +94 -21
package/templates/hooks/specpipe-shell-guard.sh +121 -29
package/templates/rules/specpipe-rules.md +77 -0
package/templates/skills/sp-build/SKILL.md +101 -1
package/templates/skills/sp-build-behavior-matrix/SKILL.md +876 -0
package/templates/skills/sp-challenge/SKILL.md +34 -0
package/templates/skills/sp-challenge-behavior-matrix/SKILL.md +289 -0
package/templates/skills/sp-explore/SKILL.md +132 -0
package/templates/skills/sp-explore-behavior-matrix/SKILL.md +862 -0
package/templates/skills/sp-fix/SKILL.md +73 -1
package/templates/skills/sp-fix-behavior-matrix/SKILL.md +338 -0
package/templates/skills/sp-investigate/SKILL.md +70 -0
package/templates/skills/sp-investigate-behavior-matrix/SKILL.md +718 -0
package/templates/skills/sp-plan/SKILL.md +90 -0
package/templates/skills/sp-plan-behavior-matrix/SKILL.md +1037 -0
package/templates/skills/sp-review/SKILL.md +29 -3
package/templates/skills/sp-review-behavior-matrix/SKILL.md +294 -0
package/templates/.claude/CLAUDE.md +0 -79
package/templates/.claude/hooks/path-guard.sh +0 -118
package/templates/.claude/hooks/self-review.sh +0 -27
package/templates/.claude/hooks/sensitive-guard.sh +0 -227
package/templates/.claude/settings.json +0 -68
package/templates/docs/WORKFLOW.md +0 -325
package/templates/docs/specs/.gitkeep +0 -0
package/templates/rules/specpipe-guards.md +0 -40
package/templates/scripts/test-hooks.sh +0 -66
/package/templates/{.claude/hooks → hooks}/comment-guard.js +0 -0
/package/templates/{.claude/hooks → hooks}/glob-guard.js +0 -0

package/templates/skills/sp-review/SKILL.md CHANGED Viewed

@@ -33,9 +33,11 @@ Before Phase 0:
    git log --oneline "$BASE"...HEAD
    ```
 2. Check for spec in `docs/specs/<feature>/<feature>.md` — review against INTENT.
-3. Read the diff: `git diff "$BASE"...HEAD`
-4. **Expand blast radius.** **If GA available (per Phase 0a):** run `ga_impact(diff=<full diff>)` (or `changed_files=[...]`) to get impacted files, affected tests, affected routes/configs, and a 4-dim risk score in one call — this is the flagship review tool, prefer it over any chain of grep + manual reading. Cross-check with `ga_architecture` for module/layer membership (auth, payment, core) and `ga_risk(changed_files=[...])` for a refactor-safety gate. **If GA unavailable:** grep for each changed function/type name across the rest of the tree to find affected files; identify sensitive paths (`auth/`, `payment/`, `core/`) by directory.
-5. **What already exists:** List any code/flows that already partially solve the problem in this diff. Flag if the diff rebuilds something that already exists.
+3. If the spec contains `## Behavior Matrix`, treat each `BM.AS-NNN.<surface>` row as review intent. Keep the matrix open while reading the diff.
+4. Use the invariant registry README/schema as base knowledge; README examples are not runtime entries. Then read project-local invariant entries if present: `docs/invariants/INV-*.md`. If none exist, continue and note "No invariant registry found" internally.
+5. Read the diff: `git diff "$BASE"...HEAD`
+6. **Expand blast radius.** **If GA available (per Phase 0a):** run `ga_impact(diff=<full diff>)` (or `changed_files=[...]`) to get impacted files, affected tests, affected routes/configs, and a 4-dim risk score in one call — this is the flagship review tool, prefer it over any chain of grep + manual reading. Cross-check with `ga_architecture` for module/layer membership (auth, payment, core) and `ga_risk(changed_files=[...])` for a refactor-safety gate. **If GA unavailable:** grep for each changed function/type name across the rest of the tree to find affected files; identify sensitive paths (`auth/`, `payment/`, `core/`) by directory.
+7. **What already exists:** List any code/flows that already partially solve the problem in this diff. Flag if the diff rebuilds something that already exists.
 If `$ARGUMENTS` provided → scope to those files only.
 If diff > 500 lines → review file-by-file, prioritize by smart focus below.
@@ -55,6 +57,8 @@ Auto-detect primary focus from diff content:
 | Test files only | Test quality (skip security deep-dive) |
 | Docs/comments only | Accuracy only (minimal review) |
 | Payment, billing, transaction | Correctness + idempotency |
+| Spec has `## Behavior Matrix`, or diff contains state/status/role/viewer/surface/read-model/feed/calendar/email/list/detail/worklist/dashboard | Lifecycle + parity + cascade |
+| Diff touches code named in invariant logs | Regression invariants for that component |
 Spend 60% of analysis on the primary focus. Cover all categories, but proportionally.
@@ -95,6 +99,28 @@ Spend 60% of analysis on the primary focus. Cover all categories, but proportion
   - Test referencing an AS-NNN that no longer exists in the spec → "Test references removed AS-NNN"
   Keep this lightweight — match on AS-NNN identifiers and story name substrings, not semantic analysis.
+### Behavior Matrix & Invariants (High)
+Use this section when the spec has `## Behavior Matrix`, `## Sibling Surface Map`, or project-local invariant logs match the diff.
+- **Cell-to-diff trace:** For each changed state transition, viewer rule, read surface, notification, queue, dashboard count, feed, calendar, or API projection, identify the corresponding `BM.AS-NNN.<surface>` row. If code changes behavior for a matrix cell but tests do not reference that AS ID or `BM.AS-NNN`, flag High.
+- **Surface parity:** If the diff updates one read surface for a state/viewer change, check matrix siblings for list/detail/worklist/dashboard/feed/API/email/calendar parity. Flag missing sibling updates unless the matrix marks them `N/A` with a concrete reason.
+- **Sibling candidate disposition:** If the spec has `## Sibling Surface Map`, every high/medium candidate must be `cover`, `GAP-NNN`, or `ignore(reason)`. Flag missing dispositions. If the diff changes a confirmed sibling surface but omits sibling tests/updates for the other confirmed surfaces, flag High unless a GAP/N/A covers it.
+- **Discovery drift:** If the diff introduces or modifies an entry-point whose name/evidence matches the operation (`create_from_*`, `*_from_*`, `send_*invite*`, `*_outcome*`, `reschedule*`, `book_next*`, etc.) but it is absent from the Sibling Surface Map and invariant registry, flag "Invariant candidate missing" as Medium/High depending on risk.
+- **Suspicious N/A/GAP:** If changed code exercises a matrix row marked `N/A` or `GAP`, flag the spec/code mismatch. `GAP` is not a test obligation, but it is a review concern when the diff implements or depends on that behavior.
+- **Viewer-relative behavior:** Confirm visibility, labels, allowed actions, recipients, and queue membership are derived from `state/status × viewer/role`, not only from owner/assignee shortcuts.
+- **Cascade propagation:** State transitions must update derived queues, counts, feeds, read models, notifications, calendars, and APIs named by the matrix. Flag partial propagation and stale read-path risks.
+- **Delete/orphan/incomplete/out-of-order:** For lifecycle changes, check delete/cancel/reschedule/reassign paths for orphaned rows, stale external events, incomplete rollback, and out-of-order async delivery.
+- **Timing/source parity:** When matrix rows imply sync/async/external-down timing, confirm tests assert the correct timing tier and user-visible source of truth.
+- **No-vacuous boundary tests:** If a test covers `BM.AS-NNN.<surface>` but mocks the exact boundary that surface depends on (API projection, calendar provider, email provider, read-model query, queue feed), flag it. Mock outside the boundary, not the behavior being claimed.
+- **Invariant registry check:** For each project-local `docs/invariants/INV-*.md` entry whose `component_keys`, `sibling_set`, `shared_anchor`, or keywords match the diff, verify code and tests preserve it. Status handling:
+  - `enforced` → hard review gate: if the diff touches one sibling/component in the invariant but does not update or run the `test_ref` / equivalent regression, flag High.
+  - `confirmed` → High risk advisory: flag missing sibling updates/tests unless the spec intentionally changes the invariant.
+  - `candidate` → Medium/High depending on evidence: flag as "Invariant candidate needs confirmation" if the diff repeats the class.
+  - `retired` → ignore unless the diff revives the retired component.
+  A repeated class such as carry-forward, viewer-relative labels, invite-on-reschedule, orphan cleanup, or stale dashboard count is High unless intentionally changed in the spec.
+- **Review fix direction:** Suggested fixes should usually update the spec/test/code triangle: add or correct the matrix cell, add a `BM.AS-NNN` regression test, then fix code. Do not suggest "add generic coverage" when a precise cell is available.
 ### Code Quality (Medium)
 - Dead code: removed functions still imported elsewhere?
 - Obvious duplication: copy-pasted blocks that should be shared?

package/templates/skills/sp-review-behavior-matrix/SKILL.md ADDED Viewed

@@ -0,0 +1,294 @@
+---
+description: |
+  Pre-merge code review — security, correctness, spec alignment. Reviews diff
+  against the base branch with smart focus by blast radius.
+  Use when asked to "review this PR", "review code", "review trước khi merge",
+  "kiểm tra code", "check my diff", "pre-merge review", or "review my changes".
+  Proactively suggest before /sp-commit or when the user is about to merge,
+  especially after /sp-build produces a non-trivial diff.
+  Catches: SQL safety issues, security gaps, spec drift, regressions in
+  modified-not-added lines, and changes to sensitive layers (auth, payment, core).
+allowed-tools: Read, Bash, Glob, Grep, AskUserQuestion, mcp__graphatlas__*
+---
+Pre-merge code review — security, correctness, spec alignment.
+## Phase 0a — Graphatlas probe (run once, silently)
+Before Phase 0:
+1. Call `mcp__graphatlas__ga_architecture` with `max_modules: 1`.
+2. Interpret:
+   - Returns `modules` → **GA available.** Use `ga_impact`, `ga_risk`, `ga_architecture` for blast-radius and layer checks below. Manual grep is fallback.
+   - Error `STALE_INDEX` → call `mcp__graphatlas__ga_reindex` (mode `"full"`), retry once, then treat as available. (Reviewing the diff against a stale index gives wrong impact — reindex matters here.)
+   - Tool not found / connection error / any other failure → **GA unavailable.** Skip `ga_*` steps and review the diff manually. Do not re-probe.
+3. Carry the outcome through Phase 0 - 4.
+---
+## Phase 0: Understand Intent
+1. Read commit messages:
+   ```
+   BASE=$(git symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's|refs/remotes/origin/||') || BASE="main"
+   git log --oneline "$BASE"...HEAD
+   ```
+2. Check for spec in `docs/specs/<feature>/<feature>.md` — review against INTENT.
+3. If the spec contains `## Behavior Matrix`, treat each `BM.AS-NNN.<surface>` row as review intent. Keep the matrix open while reading the diff.
+4. Use the invariant registry README/schema as base knowledge; README examples are not runtime entries. Then read project-local invariant entries if present: `docs/invariants/INV-*.md`. If none exist, continue and note "No invariant registry found" internally.
+5. Read the diff: `git diff "$BASE"...HEAD`
+6. **Expand blast radius.** **If GA available (per Phase 0a):** run `ga_impact(diff=<full diff>)` (or `changed_files=[...]`) to get impacted files, affected tests, affected routes/configs, and a 4-dim risk score in one call — this is the flagship review tool, prefer it over any chain of grep + manual reading. Cross-check with `ga_architecture` for module/layer membership (auth, payment, core) and `ga_risk(changed_files=[...])` for a refactor-safety gate. **If GA unavailable:** grep for each changed function/type name across the rest of the tree to find affected files; identify sensitive paths (`auth/`, `payment/`, `core/`) by directory.
+7. **What already exists:** List any code/flows that already partially solve the problem in this diff. Flag if the diff rebuilds something that already exists.
+If `$ARGUMENTS` provided → scope to those files only.
+If diff > 500 lines → review file-by-file, prioritize by smart focus below.
+---
+## Phase 1: Smart Focus
+Auto-detect primary focus from diff content:
+| Diff contains | Focus heavily on |
+|--------------|-----------------|
+| auth, login, token, session, password, JWT | Security — full depth |
+| SQL, query, database, migration | Injection + data integrity |
+| API, endpoint, route, controller, handler | Input validation + error handling |
+| .env, config, secret, key, credential | Secret exposure |
+| Test files only | Test quality (skip security deep-dive) |
+| Docs/comments only | Accuracy only (minimal review) |
+| Payment, billing, transaction | Correctness + idempotency |
+| Spec has `## Behavior Matrix`, or diff contains state/status/role/viewer/surface/read-model/feed/calendar/email/list/detail/worklist/dashboard | Lifecycle + parity + cascade |
+| Diff touches code named in invariant logs | Regression invariants for that component |
+Spend 60% of analysis on the primary focus. Cover all categories, but proportionally.
+---
+## Phase 2: Checklist
+### Security (Critical)
+- **Injection:** Search diff for string concatenation in SQL/shell/HTML. Look for `${var}` in queries, `.innerHTML`, template literals in SQL. Flag any user input reaching a query without parameterization.
+- **Auth/Authz:** New endpoint → has auth middleware? Can user A access user B's data? ID in URL without ownership check?
+- **Secrets:** Hardcoded strings matching `sk-`, `ghp_`, `Bearer `, long base64. New env vars committed?
+- **Error exposure:** Catch blocks sending raw errors to users? Stack traces, file paths, DB schemas in responses?
+- **Dependencies:** New packages — maintained? >1000 weekly downloads? Known CVEs?
+### Correctness (High)
+- **Logic vs intent:** Does the code do what commits/spec claim? "Add validation" but code just logs?
+- **Edge cases:** null, empty, 0, negative, MAX_INT, unicode, very long strings — handled?
+- **Error handling:** For each try/catch — error logged with context? User shown safe message? Resources cleaned in finally?
+- **Concurrency:** Shared state without locks? Read-then-write without atomicity? Non-atomic DB updates?
+- **Null safety:** Optionals used without guards? `object!.property` without nil check?
+### API/Backend (High)
+- **Unvalidated input** — request body/params used without schema validation
+- **Missing rate limiting** — public endpoints without throttling
+- **Missing timeouts** — external HTTP calls without timeout configuration
+- **Missing CORS configuration** — APIs accessible from unintended origins
+- **Error message leakage** — stack traces, file paths, DB schemas in responses
+### Spec-Test Alignment (Medium)
+- Source changed but no spec update in `docs/specs/<feature>/`? → flag
+- Source changed but no test update? → flag
+- Spec changed but acceptance scenarios or tests not updated? → flag
+- Code removed but dead tests remain? → flag
+- Spec contains vague requirements without metrics ("fast", "secure", "easy", "scalable")? → flag with suggestion to add SC-NNN with concrete numbers
+- **AS-to-test name check:** Read the spec's `## Stories` section. For each AS-NNN, check if a test file contains a test named or described with that AS ID or its short description. Flag:
+  - AS in spec with no matching test → "AS-NNN: \<description\> has no corresponding test"
+  - Test referencing an AS-NNN that no longer exists in the spec → "Test references removed AS-NNN"
+  Keep this lightweight — match on AS-NNN identifiers and story name substrings, not semantic analysis.
+### Behavior Matrix & Invariants (High)
+Use this section when the spec has `## Behavior Matrix`, `## Sibling Surface Map`, or project-local invariant logs match the diff.
+- **Cell-to-diff trace:** For each changed state transition, viewer rule, read surface, notification, queue, dashboard count, feed, calendar, or API projection, identify the corresponding `BM.AS-NNN.<surface>` row. If code changes behavior for a matrix cell but tests do not reference that AS ID or `BM.AS-NNN`, flag High.
+- **Surface parity:** If the diff updates one read surface for a state/viewer change, check matrix siblings for list/detail/worklist/dashboard/feed/API/email/calendar parity. Flag missing sibling updates unless the matrix marks them `N/A` with a concrete reason.
+- **Sibling candidate disposition:** If the spec has `## Sibling Surface Map`, every high/medium candidate must be `cover`, `GAP-NNN`, or `ignore(reason)`. Flag missing dispositions. If the diff changes a confirmed sibling surface but omits sibling tests/updates for the other confirmed surfaces, flag High unless a GAP/N/A covers it.
+- **Discovery drift:** If the diff introduces or modifies an entry-point whose name/evidence matches the operation (`create_from_*`, `*_from_*`, `send_*invite*`, `*_outcome*`, `reschedule*`, `book_next*`, etc.) but it is absent from the Sibling Surface Map and invariant registry, flag "Invariant candidate missing" as Medium/High depending on risk.
+- **Suspicious N/A/GAP:** If changed code exercises a matrix row marked `N/A` or `GAP`, flag the spec/code mismatch. `GAP` is not a test obligation, but it is a review concern when the diff implements or depends on that behavior.
+- **Viewer-relative behavior:** Confirm visibility, labels, allowed actions, recipients, and queue membership are derived from `state/status × viewer/role`, not only from owner/assignee shortcuts.
+- **Cascade propagation:** State transitions must update derived queues, counts, feeds, read models, notifications, calendars, and APIs named by the matrix. Flag partial propagation and stale read-path risks.
+- **Delete/orphan/incomplete/out-of-order:** For lifecycle changes, check delete/cancel/reschedule/reassign paths for orphaned rows, stale external events, incomplete rollback, and out-of-order async delivery.
+- **Timing/source parity:** When matrix rows imply sync/async/external-down timing, confirm tests assert the correct timing tier and user-visible source of truth.
+- **No-vacuous boundary tests:** If a test covers `BM.AS-NNN.<surface>` but mocks the exact boundary that surface depends on (API projection, calendar provider, email provider, read-model query, queue feed), flag it. Mock outside the boundary, not the behavior being claimed.
+- **Invariant registry check:** For each project-local `docs/invariants/INV-*.md` entry whose `component_keys`, `sibling_set`, `shared_anchor`, or keywords match the diff, verify code and tests preserve it. Status handling:
+  - `enforced` → hard review gate: if the diff touches one sibling/component in the invariant but does not update or run the `test_ref` / equivalent regression, flag High.
+  - `confirmed` → High risk advisory: flag missing sibling updates/tests unless the spec intentionally changes the invariant.
+  - `candidate` → Medium/High depending on evidence: flag as "Invariant candidate needs confirmation" if the diff repeats the class.
+  - `retired` → ignore unless the diff revives the retired component.
+  A repeated class such as carry-forward, viewer-relative labels, invite-on-reschedule, orphan cleanup, or stale dashboard count is High unless intentionally changed in the spec.
+- **Review fix direction:** Suggested fixes should usually update the spec/test/code triangle: add or correct the matrix cell, add a `BM.AS-NNN` regression test, then fix code. Do not suggest "add generic coverage" when a precise cell is available.
+### Code Quality (Medium)
+- Dead code: removed functions still imported elsewhere?
+- Obvious duplication: copy-pasted blocks that should be shared?
+- Naming: consistent with codebase? Descriptive?
+- Complexity: functions > 40 lines or > 3 nesting levels?
+- **Diagram maintenance:** Diff touches code with ASCII diagrams in nearby comments? Check if those diagrams are still accurate. Stale diagrams are worse than no diagrams — they actively mislead. Flag even if outside immediate scope.
+### Performance (Low)
+- Flag N+1 queries, unbounded collections, redundant computation in loops.
+### When Reviewing AI-Generated Code
+Prioritize these concerns above standard checklist:
+- **Behavioral regressions** — does changed code break edge cases the AI didn't consider?
+- **Trust boundaries** — does the AI code implicitly trust external input it shouldn't?
+- **Architecture drift** — does it introduce hidden coupling or deviate from existing patterns?
+- **Model cost escalation** — flag workflows that escalate to higher-cost models without clear reasoning; recommend lower-cost tiers for deterministic operations.
+### Failure Mode Grid
+For each new codepath in the diff, evaluate 3 dimensions:
+| Codepath | Test covers it? | Error handling? | User sees clear error? |
+|----------|----------------|-----------------|----------------------|
+| (path)   | ✓/✗            | ✓/✗             | clear / silent        |
+**Critical gap** = all 3 are ✗ → flag as High severity, non-optional.
+---
+## Confidence Calibration
+Every finding MUST include a confidence score:
+| Score | Meaning | Display rule |
+|-------|---------|-------------|
+| 9–10 | Verified by reading code directly. Concrete bug demonstrated. | Show normally |
+| 7–8 | High-confidence pattern match. Very likely correct. | Show normally |
+| 5–6 | Possible false positive. | Show with caveat: "verify this" |
+| 3–4 | Low confidence. | Appendix only |
+| 1–2 | Speculation. | Only report if severity Critical |
+**Finding format:** `**[C-1] (confidence: 9/10) file.ts:42 — description**`
+---
+## Phase 3: TL;DR Output
+Print ONLY this block to terminal — concise, no full finding bodies yet. Keep all finding detail internal for Phase 5.
+```
+## Code Review: <branch>
+Scope: X files, +Y/-Z lines | Focus: <detected> | Verdict: APPROVE | REQUEST CHANGES | NEEDS DISCUSSION
+Counts: N Critical · N High · N Medium · N Low  (total: N)
+Top blockers (Critical + High only, one-liner each — cap 5):
+- [C-1] file.ts:42 — SQL injection (conf 9/10)
+- [H-1] api.ts:15 — empty catch swallows DB errors (conf 8/10)
+Positive: <1 line — reinforce one good pattern from the diff>
+Not in scope: <1 line, or "None identified.">
+```
+If total findings = 0 → print TL;DR with "No findings." and STOP. Skip Phase 4–6.
+After printing TL;DR, append one line:
+> 💡 Want a second opinion? Run `/sp-voices` on this diff for a multi-LLM cross-check before triaging — especially useful for security/payment changes or when most findings sit at confidence 5–7.
+---
+## Phase 4: Bulk triage
+Use `AskUserQuestion`. Recommendation logic for the question text:
+- Any Critical or High present → recommend **A (Review each)**
+- Only Medium/Low, majority confidence ≥7 → recommend **B (Accept all)**
+- Majority confidence ≤6 → recommend **C (Reject all)**
+Append `(Recommended)` to the matching option.
+```json
+{
+  "questions": [
+    {
+      "question": "<N> findings (<C>C / <H>H / <M>M / <L>L). How to triage? RECOMMENDATION: Choose <X> — <one-line reason based on severity/confidence mix>.",
+      "header": "Triage Mode",
+      "multiSelect": false,
+      "options": [
+        {"label": "A) Review each — walk through finding by finding with full details"},
+        {"label": "B) Accept all — add every finding to action list, skip per-item review"},
+        {"label": "C) Reject all — dismiss all findings, verdict stands, no action list"},
+        {"label": "D) Exit — keep the TL;DR above, stop here"}
+      ]
+    }
+  ]
+}
+```
+Routing: A → Phase 5. B → mark all Accepted, jump to Phase 6. C → mark all Rejected, jump to Phase 6. D → stop.
+---
+## Phase 5: Per-finding loop (only if A chosen)
+Iterate findings in order: Critical → High → Medium → Low. For EACH, print the full detail block:
+```
+[<ID>] <severity> | confidence: <N>/10 | <file:line>
+Title: <title>
+Description: <what's wrong — concrete>
+Evidence: <code snippet or direct quote from diff>
+Failure scenario: <step-by-step how this hits production>
+Suggested fix: <specific, actionable>
+```
+Then ask. Append `(Recommended)` to the matching option:
+- **Accept** if: severity ≥ High AND confidence ≥ 7
+- **Reject** if: confidence ≤ 6
+- **Defer** if: severity Medium/Low AND confidence ≥ 7
+```json
+{
+  "questions": [
+    {
+      "question": "Finding [<ID>]: <title>\n<1-line flaw summary>\nRECOMMENDATION: Choose <X> — <rationale: severity × confidence>.",
+      "header": "Finding <ID>",
+      "multiSelect": false,
+      "options": [
+        {"label": "A) Accept — add to action list"},
+        {"label": "B) Reject — false positive, dismiss"},
+        {"label": "C) Defer — note in PR description, don't fix now"}
+      ]
+    }
+  ]
+}
+```
+*(Move `(Recommended)` to whichever option matches the rule above.)*
+Escape hatch: if user hits Reject 3 times in a row on High/Critical items, ask once: "Skip remaining per-finding prompts? A) Continue B) Reject all remaining C) Accept all remaining" — avoids fatigue on noisy reviews.
+---
+## Phase 6: Summary
+Print final tally:
+```
+Triage complete.
+Accepted: N | Rejected: N | Deferred: N
+Action list (accepted):
+- [<ID>] file:line — <title>  →  /sp-fix "<title>"
+- ...
+Deferred (note in PR description):
+- [<ID>] file:line — <title>
+```
+If accepted = 0 → print "No action items. Verdict stands: <verdict>." and stop.
+Do **NOT** spawn `/sp-fix` automatically — user runs it per item.
+---
+## Rules
+1. **Never auto-fix.** Report only — triage classifies, doesn't edit code.
+2. **Specific.** Every finding has `file:line` and concrete description.
+3. **Severity matches impact.** Style nits = Low. Injection = Critical.
+4. **Positive notes mandatory.** Reviews aren't just about problems.
+5. **Review against intent.** Not just "clean code?" but "does this match spec/commits?"
+6. **Proportional.** 5-line doc change ≠ 500-line auth rewrite.
+7. **TL;DR first, details on demand.** Never dump all finding bodies to terminal upfront — reveal detail only inside Phase 5.
+8. **Recommendation mandatory.** Every `AskUserQuestion` includes `RECOMMENDATION:` in question text AND `(Recommended)` suffix on the matching option.

package/templates/.claude/CLAUDE.md DELETED Viewed

@@ -1,79 +0,0 @@
-# Project Rules
-## Spec-First Development
-Every change follows this cycle: **SPEC (with acceptance scenarios) → CODE + TESTS → BUILD PASS**.
-- Business logic specs live in `docs/specs/<feature>/<feature>.md`
-- Acceptance scenarios (Given/When/Then) are embedded in the spec under `## Stories`
-- Never write code before the spec exists. Never auto-modify specs from code.
-- Specs are the source of truth. If code contradicts the spec, the code is wrong.
-## Workflow Quick Reference
-| Trigger | Commands | Details |
-|---------|----------|---------|
-| New project (no codebase yet) | `/sp-explore` (greenfield) → `/sp-scaffold` → `/sp-plan` → `/sp-build` | Scaffolds a runnable skeleton + ARCHITECTURE/ADRs before the first spec; `/sp-build` Foundation Gate blocks the TDD loop until a runnable harness exists |
-| Feature unclear / complex | `/sp-explore` → `/sp-plan` | Clarify requirements before writing spec |
-| New feature | `/sp-plan` → `/sp-challenge` (optional) → code in chunks → `/sp-build` each chunk | Start with spec or description |
-| Update feature | `/sp-plan <spec-path> "changes"` → code → `/sp-build` | Do NOT manually edit spec before /sp-plan |
-| Bug (complex/outage) | `/sp-investigate "description"` → `/sp-fix <investigation-file>` | OPTIONAL: diagnose root cause + blast radius before fixing |
-| Bug fix | `/sp-fix "description"` | Test-first: write failing test → fix → green |
-| Remove feature | `/sp-plan <spec-path> "remove stories"` → delete code + tests → build pass | /sp-plan handles snapshot before removal |
-| Pre-merge check | `/sp-review` | Diff-based quality gate |
-| Commit changes | `/sp-commit` | Secret scan + conventional commit |
-| Render spec HTML | `/sp-spec-render <feature>` | Generates scannable `<feature>.html` (sidebar TOC, story cards). Run after `/sp-plan` if you want the HTML view, or to refresh a stale one |
-| Render any markdown HTML | `/sp-md-render <file.md>` | Generic counterpart to `/sp-spec-render` for non-spec markdown (investigation, explore, RFC, retro, README). Callouts, step cards, Mermaid, dark/light theme |
-| Multi-LLM review | `/sp-voices [target]` | Send material to 2–3 LLMs, synthesize consensus + disagreements |
-| Rephrase to human voice | `/sp-humanize [text]` | Turn plan/notes/AI output into natural, send-ready text. Infers format + audience + tone, strips AI tone. Not part of the dev cycle |
-For detailed workflow steps, templates, and decision trees, see `docs/WORKFLOW.md`.
-## Testing
-- **Run tests:** use the project's native test command (e.g. `npx vitest run`, `swift test`, `python3 -m pytest`, `cargo test`, `go test ./...`). `/sp-build` and `/sp-fix` auto-detect it from project markers.
-- **Compile/typecheck BEFORE running tests.** Catch syntax errors early.
-- **Max 3 fix loops** for test failures. If tests still fail after 3 attempts, stop and report.
-- **NEVER fix production code** to make a test pass — ask the user first.
-- **No mocks, fakes, stubs, or cheats** to pass builds. Real implementations only.
-  Test doubles are acceptable only when they replace external services (APIs, databases)
-  that cannot run locally.
-## Project Info
-> Fill these in when setting up the project (or let `setup.sh` do it automatically).
-- **Language:** [CUSTOMIZE]
-- **Test framework:** [CUSTOMIZE]
-- **Source directory:** [CUSTOMIZE]
-- **Test directory:** [CUSTOMIZE]
-## Conventions
-- **Commits:** Conventional format — `type(scope): description`
-  Types: `feat`, `fix`, `docs`, `refactor`, `test`, `chore`, `perf`, `build`, `ci`
-- **File naming:** Descriptive enough that AI tools understand the purpose from the path alone.
-  Prefer kebab-case for new files (e.g., `user-authentication-service.ts`).
-- **Dates in filenames:** Use `$(date +%Y-%m-%d)` — never guess dates.
-- **Spec naming:**
-  - kebab-case, lowercase: `user-auth/user-auth.md`, `file-sync/file-sync.md`
-  - Feature name, not module name: `user-auth/` not `AuthService/`
-  - Each feature gets its own directory: `docs/specs/<feature>/<feature>.md`
-  - Short (2-3 words): `payment-flow/` not `payment-processing-with-stripe-integration/`
-  - No prefix/suffix: `user-auth.md` not `spec-user-auth.md`
-## Forbidden
-These patterns are never acceptable in this project:
-- `any` / `Any` type without explicit justification in a comment
-- Force unwrap (`!`) or force cast (`as!`) without a preceding guard
-- Hardcoded secrets, API keys, tokens, or credentials in source files
-- Mocks or fake data used solely to make tests pass
-- `git push --force` to main or master branches
-- Editing generated files, vendor directories, or lock files
-- Committing `.env` files, certificates, or private keys
-- Ignoring compiler/linter warnings without documented reason
-- Replacing real code with placeholder comments like `// ... existing code ...`
-- Renaming parameters to `_param` instead of actually fixing unused parameter issues
-- Reading or writing `.env`, `.pem`, `.key`, or other sensitive files (use `.env.example` for templates)

package/templates/.claude/hooks/path-guard.sh DELETED Viewed

@@ -1,118 +0,0 @@
-#!/usr/bin/env bash
-# path-guard.sh — PreToolUse hook for Claude Code
-#
-# Blocks Bash commands that target directories known to be large and wasteful
-# to explore (node_modules, build artifacts, .git internals, etc.).
-#
-# Exit codes:
-#   0 — command allowed
-#   2 — command blocked (policy)
-#
-# Environment:
-#   PATH_GUARD_EXTRA — additional pipe-separated patterns to block
-#                      e.g. "\.terraform|\.vagrant"
-set -euo pipefail
-# Windows note: this hook requires bash (WSL or Git Bash).
-# On Windows without bash, Claude Code will fail to run this hook and skip it silently.
-# Install WSL or Git Bash and ensure `bash` is in PATH to activate protection.
-# ─── Read hook payload from stdin ───────────────────────────────────
-INPUT=$(cat)
-[[ -z "$INPUT" ]] && exit 0
-# Extract command from JSON — try node first, fall back to grep/sed
-extract_command() {
-    if command -v node &>/dev/null; then
-        printf '%s' "$1" | node -e "
-          try {
-            const d = JSON.parse(require('fs').readFileSync(0, 'utf-8'));
-            const cmd = d.tool_input?.command;
-            if (typeof cmd === 'string') process.stdout.write(cmd);
-          } catch {}
-        " 2>/dev/null
-    else
-        # Lightweight fallback: extract "command":"..." from JSON
-        printf '%s' "$1" | grep -o '"command":"[^"]*"' | head -1 | sed 's/^"command":"//;s/"$//' 2>/dev/null
-    fi
-}
-COMMAND=$(extract_command "$INPUT") || exit 0
-[[ -z "$COMMAND" ]] && exit 0
-# ─── Blocked directory patterns ─────────────────────────────────────
-# Use explicit path separators to avoid substring false positives.
-# [/\\] matches both forward slash (Unix/macOS) and backslash (Windows Git Bash).
-# e.g. "build/" should not match "rebuild/src" or "my-build-tool"
-SEP="[/\\\\]"
-BLOCKED="(^|[ /\\\\])node_modules(${SEP}|$| )"
-BLOCKED+="|(__pycache__)"
-BLOCKED+="|\.git${SEP}(objects|refs)"
-BLOCKED+="|(^|[ /\\\\])dist${SEP}"
-BLOCKED+="|(^|[ /\\\\])build${SEP}"
-BLOCKED+="|\.next${SEP}"
-BLOCKED+="|(^|[ /\\\\])vendor(${SEP}|$| )"
-BLOCKED+="|(^|[ /\\\\])Pods(${SEP}|$| )"
-BLOCKED+="|\.build${SEP}"
-BLOCKED+="|DerivedData"
-BLOCKED+="|\.gradle${SEP}"
-BLOCKED+="|(^|[ /\\\\])target${SEP}"
-BLOCKED+="|\.nuget"
-BLOCKED+="|\.cache(${SEP}|$| )"
-# Python
-BLOCKED+="|(^|[ /\\\\])\.venv${SEP}"
-BLOCKED+="|(^|[ /\\\\])venv${SEP}"
-BLOCKED+="|\.mypy_cache${SEP}"
-BLOCKED+="|\.pytest_cache${SEP}"
-BLOCKED+="|\.ruff_cache${SEP}"
-BLOCKED+="|\.egg-info(${SEP}|$| )"
-# C# .NET (match .NET-specific subdirs to avoid false positives on generic bin/)
-BLOCKED+="|(^|[ /\\\\])bin${SEP}(Debug|Release|net|x64|x86)"
-BLOCKED+="|(^|[ /\\\\])obj${SEP}(Debug|Release|net)"
-# Node.js frameworks
-BLOCKED+="|\.nuxt${SEP}"
-BLOCKED+="|\.svelte-kit${SEP}"
-BLOCKED+="|\.parcel-cache${SEP}"
-BLOCKED+="|\.turbo${SEP}"
-BLOCKED+="|(^|[ /\\\\])out${SEP}(server|static|_next)"
-# Ruby
-BLOCKED+="|\.bundle${SEP}"
-# Append project-specific patterns from env
-if [[ -n "${PATH_GUARD_EXTRA:-}" ]]; then
-    BLOCKED+="|$PATH_GUARD_EXTRA"
-fi
-# ─── Exploration-verb pre-check ─────────────────────────────────────
-# Only apply blocked-path rules when the command contains a file-reading
-# or directory-listing verb. This avoids false positives for commands that
-# merely REFERENCE a path through a blocked directory without exploring it
-# (e.g. variable assignments, [ -x "$B" ] binary existence checks, or
-# running a compiled binary whose path happens to include /dist/).
-#
-# Verbs that indicate directory/file exploration:
-EXPLORE_VERB_RE="(^|[[:space:]|;&\`(])(ls|ll|la|find|cat|head|tail|less|more|wc|stat|du|tree|bat|od|xxd|hexdump|nl)([[:space:]]|$)"
-if ! printf '%s\n' "$COMMAND" | grep -qE "$EXPLORE_VERB_RE"; then
-    exit 0
-fi
-# ─── Match and block ────────────────────────────────────────────────
-# Strip node_modules/.bin/<binary> references before checking blocked paths.
-# Executing an installed binary (e.g. node_modules/.bin/playwright) is not
-# directory exploration — only the binary itself is referenced, not the tree.
-COMMAND_FOR_CHECK=$(printf '%s\n' "$COMMAND" | sed -E "s|node_modules[/\\]\.bin[/\\][^[:space:]]*||g")
-if printf '%s\n' "$COMMAND_FOR_CHECK" | grep -qE "$BLOCKED"; then
-    # Extract which pattern matched for a useful error message
-    MATCHED=$(printf '%s\n' "$COMMAND" | grep -oE "$BLOCKED" | head -1)
-    echo "Blocked: command references '$MATCHED' — this directory is typically large and exploring it wastes tokens. Use Glob or Grep tools instead." >&2
-    exit 2
-fi
-exit 0

package/templates/.claude/hooks/self-review.sh DELETED Viewed

@@ -1,27 +0,0 @@
-#!/usr/bin/env bash
-# self-review.sh — Stop hook for Claude Code
-#
-# Injects a self-review checklist when Claude is about to finish.
-# Non-blocking: always exits 0, just adds context for Claude to consider.
-#
-# Environment:
-#   SELF_REVIEW_ENABLED — set to "false" to disable (default: true)
-# No set -euo pipefail — this hook must NEVER fail
-# Windows note: requires bash (WSL or Git Bash). Silently skipped on Windows native.
-# Check if disabled
-if [[ "${SELF_REVIEW_ENABLED:-true}" == "false" ]]; then
-    exit 0
-fi
-# Read stdin (Stop event payload — may be empty or minimal)
-cat > /dev/null 2>&1 || true
-# Inject self-review checklist as context
-cat <<'REVIEW_JSON'
-{
-  "continue": true,
-  "systemMessage": "Self-review before finishing:\n1. Did you leave any TODO/FIXME comments that should be resolved now?\n2. Did you create mock or fake implementations just to pass tests?\n3. Did you replace real code with placeholder comments like '// ... existing code'?\n4. Do all changed files compile and typecheck cleanly?\n5. Did you run the full test suite, not just the new tests?\n6. Are there any files you modified but forgot to include in the summary?"
-}
-REVIEW_JSON