npm - @wazir-dev/cli - Versions diffs - 1.2.0 → 1.3.0 - Mend

@wazir-dev/cli 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (46) hide show

package/CHANGELOG.md +39 -44
package/README.md +13 -13
package/assets/demo.cast +47 -0
package/assets/demo.gif +0 -0
package/docs/anti-patterns/AP-23-skipping-enabled-workflows.md +28 -0
package/docs/anti-patterns/AP-24-clarifier-deciding-scope.md +34 -0
package/docs/concepts/architecture.md +1 -1
package/docs/concepts/why-wazir.md +1 -1
package/docs/readmes/INDEX.md +1 -1
package/docs/readmes/features/expertise/README.md +1 -1
package/docs/readmes/features/hooks/pre-compact-summary.md +1 -1
package/docs/reference/hooks.md +1 -0
package/docs/reference/launch-checklist.md +3 -3
package/docs/reference/review-loop-pattern.md +3 -2
package/docs/reference/skill-tiers.md +2 -2
package/expertise/antipatterns/process/ai-coding-antipatterns.md +117 -0
package/exports/hosts/claude/.claude/commands/plan-review.md +3 -1
package/exports/hosts/claude/.claude/commands/verify.md +30 -1
package/exports/hosts/claude/export.manifest.json +2 -2
package/exports/hosts/codex/export.manifest.json +2 -2
package/exports/hosts/cursor/export.manifest.json +2 -2
package/exports/hosts/gemini/export.manifest.json +2 -2
package/llms-full.txt +48 -18
package/package.json +2 -3
package/schemas/phase-report.schema.json +9 -0
package/skills/brainstorming/SKILL.md +14 -2
package/skills/clarifier/SKILL.md +189 -35
package/skills/executor/SKILL.md +67 -0
package/skills/init-pipeline/SKILL.md +0 -1
package/skills/reviewer/SKILL.md +86 -13
package/skills/self-audit/SKILL.md +20 -0
package/skills/skill-research/SKILL.md +188 -0
package/skills/verification/SKILL.md +41 -3
package/skills/wazir/SKILL.md +304 -38
package/tooling/src/capture/command.js +17 -1
package/tooling/src/capture/store.js +32 -0
package/tooling/src/capture/user-input.js +66 -0
package/tooling/src/checks/security-sensitivity.js +69 -0
package/tooling/src/cli.js +28 -26
package/tooling/src/guards/phase-prerequisite-guard.js +58 -0
package/tooling/src/init/auto-detect.js +0 -2
package/tooling/src/init/command.js +3 -95
package/tooling/src/status/command.js +6 -1
package/tooling/src/verify/proof-collector.js +299 -0
package/workflows/plan-review.md +3 -1
package/workflows/verify.md +30 -1

package/skills/clarifier/SKILL.md CHANGED Viewed

@@ -52,6 +52,14 @@ Read `context_mode` from `.wazir/state/config.json`:
 ## Sub-Workflow 1: Research (discover workflow)
+**Before starting this phase, output to the user:**
+> **Research** — About to scan the codebase and fetch external references to understand the existing architecture, tech stack, and any standards referenced in the briefing.
+>
+> **Why this matters:** Without research, I'd assume the wrong framework version, miss existing patterns in the codebase, and contradict established conventions. Every wrong assumption here cascades into a wrong spec and wrong implementation.
+>
+> **Looking for:** Existing code patterns, dependency versions, external standard definitions, architectural constraints
 Delegate to the discover workflow (`workflows/discover.md`):
 1. **Keyword extraction:** Read the briefing and extract concepts/terms that are vague, reference external standards, or use unfamiliar terminology.
@@ -68,22 +76,43 @@ Delegate to the discover workflow (`workflows/discover.md`):
 Save result to `.wazir/runs/latest/clarified/research-brief.md`.
+**After completing this phase, output to the user:**
+> **Research complete.**
+>
+> **Found:** [N] external sources fetched, [N] codebase patterns identified, [N] architectural constraints documented
+>
+> **Without this phase:** Spec would be built on assumptions instead of evidence — wrong framework APIs, missed existing utilities, contradicted naming conventions
+>
+> **Changed because of this work:** [List of key discoveries — e.g., "found existing auth middleware at src/middleware/auth.ts", "project uses Vitest not Jest"]
 ### Checkpoint: Research Review
 > **Research complete. Here's what I found:**
 >
 > [Summary of codebase state, relevant architecture, external context]
->
-> 1. **Looks good, continue** (Recommended)
-> 2. **Missing context** — let me add more information
-> 3. **Wrong direction** — let me clarify the intent
-**Wait for user response before continuing.**
+Ask the user via AskUserQuestion:
+- **Question:** "Does the research look complete and accurate?"
+- **Options:**
+  1. "Looks good, continue" *(Recommended)*
+  2. "Missing context — let me add more information"
+  3. "Wrong direction — let me clarify the intent"
+Wait for the user's selection before continuing.
 ---
 ## Sub-Workflow 2: Clarify (clarify workflow)
+**Before starting this phase, output to the user:**
+> **Clarification** — About to transform the briefing and research into a precise scope document with explicit constraints, assumptions, and boundaries.
+>
+> **Why this matters:** Without explicit clarification, "add user auth" could mean OAuth, magic links, or username/password. Every ambiguity left here becomes a 50/50 coin flip during implementation that could produce the wrong feature.
+>
+> **Looking for:** Ambiguous requirements, implicit assumptions, missing constraints, scope boundaries, unresolved questions
 ### Input Preservation (before producing clarification)
 1. Glob `.wazir/input/tasks/*.md`. If files exist:
@@ -93,37 +122,88 @@ Save result to `.wazir/runs/latest/clarified/research-brief.md`.
    - Every API endpoint, color hex code, and UI dimension from input must appear in the relevant item section.
 2. If `.wazir/input/tasks/` is empty or missing, synthesize from `briefing.md` alone.
+### Informed Question Batching (after research, before producing clarification)
+Research has completed. You now have codebase context and external findings. Before producing the clarification, ask the user INFORMED questions — informed by the research, not guesses.
+**Rules:**
+1. **Research runs FIRST, questions come AFTER.** Never ask questions before research completes.
+2. **Batch questions:** 1-3 batches of 3-7 questions each. Never one-at-a-time.
+3. **Every scope exclusion must be explicitly confirmed by the user.** You MUST NOT decide that something is "out of scope" without asking. If the input doesn't mention docs, ask: "The input doesn't mention documentation — should we include API docs, or is that explicitly out of scope?" Do NOT assume.
+4. **If the input is clear and complete:** Zero questions is fine. State: "Input is clear and specific. No ambiguities detected. Proceeding with clarification."
+5. **In auto mode (`interaction_mode: auto`):** Questions go to the gating agent, not the user.
+6. **In interactive mode (`interaction_mode: interactive`):** More detailed questions, present research findings that informed each question.
+**Question format:**
+```
+Based on research, I have [N] questions before proceeding:
+**Scope & Intent**
+1. [Question informed by research finding]
+2. [Question about ambiguous requirement]
+**Technical Decisions**
+3. [Question about architecture choice discovered during research]
+4. [Question about dependency/framework preference]
+**Boundaries**
+5. [Explicit scope boundary question — "Should X be included or excluded?"]
+```
+Ask via AskUserQuestion with the full batch. Wait for answers. If answers introduce new ambiguity, ask a follow-up batch (max 3 batches total).
 ### Clarification Production
-Read the briefing, research brief, and codebase context. Produce:
+Read the briefing, research brief, user answers to questions, and codebase context. Produce:
 - **What** we're building — concrete deliverables
 - **Why** — the motivation and business value
 - **Constraints** — technical, timeline, dependencies
-- **Assumptions** — what we're taking as given
-- **Scope boundaries** — what's IN and what's explicitly OUT
-- **Unresolved questions** — anything ambiguous
+- **Assumptions** — what we're taking as given (each explicitly confirmed by user or clearly stated in input)
+- **Scope boundaries** — what's IN and what's explicitly OUT (every exclusion must reference the user's confirmation: "Out of scope per user confirmation in question batch 1, Q5")
+- **Unresolved questions** — anything still ambiguous after question batches
 Save to `.wazir/runs/latest/clarified/clarification.md`.
 Invoke `wz:reviewer --mode clarification-review`. Resolve findings before presenting to user.
+**After completing this phase, output to the user:**
+> **Clarification complete.**
+>
+> **Found:** [N] ambiguities resolved, [N] assumptions documented, [N] scope boundaries defined, [N] items explicitly marked out-of-scope
+>
+> **Without this phase:** Implementation would proceed with hidden assumptions, scope would creep mid-build, and acceptance criteria would be vague enough to pass any implementation
+>
+> **Changed because of this work:** [List of resolved ambiguities — e.g., "clarified auth means OAuth2 with Google provider only", "out-of-scope: mobile responsive for v1"]
 ### Checkpoint: Clarification Review
 > **Here's the clarified scope:**
 >
 > [Full clarification]
->
-> 1. **Approved — continue to spec hardening** (Recommended)
-> 2. **Needs changes** — [user provides corrections]
-> 3. **Missing important context** — [user adds information]
-**Wait for user response.** Route feedback: plan corrections → `user-feedback.md`, new requirements → `briefing.md`.
+Ask the user via AskUserQuestion:
+- **Question:** "Does the clarified scope accurately capture what you want to build?"
+- **Options:**
+  1. "Approved — continue to spec hardening" *(Recommended)*
+  2. "Needs changes — let me provide corrections"
+  3. "Missing important context — let me add information"
+Wait for the user's selection before continuing. Route feedback: plan corrections → `user-feedback.md`, new requirements → `briefing.md`.
 ---
 ## Sub-Workflow 3: Spec Harden (specify + spec-challenge workflows)
+**Before starting this phase, output to the user:**
+> **Spec Hardening** — About to convert the clarified scope into a measurable, testable specification and then run adversarial spec-challenge review to find gaps.
+>
+> **Why this matters:** Without hardening, acceptance criteria stay vague ("it should work well") instead of measurable ("response time under 200ms for 95th percentile"). Vague specs pass any implementation, making review meaningless.
+>
+> **Looking for:** Untestable criteria, missing error handling specs, undefined edge cases, performance requirements, security constraints
 Delegate to the specify workflow (`workflows/specify.md`):
 1. The **specifier role** produces a measurable spec from clarification + research.
@@ -132,6 +212,16 @@ Delegate to the specify workflow (`workflows/specify.md`):
 Save result to `.wazir/runs/latest/clarified/spec-hardened.md`.
+**After completing this phase, output to the user:**
+> **Spec Hardening complete.**
+>
+> **Found:** [N] acceptance criteria tightened, [N] edge cases added, [N] error handling requirements specified, [N] spec-challenge findings resolved
+>
+> **Without this phase:** Acceptance criteria would be subjective, review would have no concrete standard to measure against, and "done" would mean whatever the implementer decided
+>
+> **Changed because of this work:** [List of hardening changes — e.g., "added 404 handling spec for missing resources", "specified max payload size of 5MB", "added rate limit requirement of 100 req/min"]
 ### Content-Author Detection
 After spec hardening, scan the spec for content needs. Auto-enable the `author` workflow if the spec mentions any of:
@@ -151,17 +241,28 @@ If detected, set `workflow_policy.author.enabled = true` in the run config and n
 > **Spec hardened. Changes made:**
 >
 > [List of gaps found and how they were tightened]
->
-> 1. **Approved — continue to brainstorming** (Recommended)
-> 2. **Disagree with a change** — [user specifies]
-> 3. **Found more gaps** — [user adds]
-**Wait for user response.**
+Ask the user via AskUserQuestion:
+- **Question:** "Are the spec hardening changes acceptable?"
+- **Options:**
+  1. "Approved — continue to brainstorming" *(Recommended)*
+  2. "Disagree with a change — let me specify"
+  3. "Found more gaps — let me add"
+Wait for the user's selection before continuing.
 ---
 ## Sub-Workflow 4: Brainstorm (design + design-review workflows)
+**Before starting this phase, output to the user:**
+> **Brainstorming** — About to propose 2-3 design approaches with explicit trade-offs, then run design-review on the approved choice.
+>
+> **Why this matters:** Without exploring alternatives, the first approach that comes to mind gets built — even if a simpler, more maintainable, or more performant option exists. This is where architectural mistakes get caught cheaply instead of discovered during implementation.
+>
+> **Looking for:** Architectural trade-offs, scalability implications, complexity vs. simplicity, alignment with existing codebase patterns
 Invoke the `brainstorming` skill (`wz:brainstorming`):
 1. Propose 2-3 viable approaches with explicit trade-offs
@@ -170,42 +271,74 @@ Invoke the `brainstorming` skill (`wz:brainstorming`):
 ### Checkpoint: Design Approval
-> **Which approach should we implement?**
->
-> 1. **Approach A** — [one-line summary] (Recommended)
-> 2. **Approach B** — [one-line summary]
-> 3. **Approach C** — [one-line summary]
-> 4. **Modify an approach** — [user specifies changes]
+Ask the user via AskUserQuestion:
+- **Question:** "Which design approach should we implement?"
+- **Options:**
+  1. "Approach A — [one-line summary]" *(Recommended)*
+  2. "Approach B — [one-line summary]"
+  3. "Approach C — [one-line summary]"
+  4. "Modify an approach — let me specify changes"
-**Wait for user response.** This is the most important checkpoint.
+Wait for the user's selection before continuing. This is the most important checkpoint.
 Save approved design to `.wazir/runs/latest/clarified/design.md`.
+**After completing this phase, output to the user:**
+> **Brainstorming complete.**
+>
+> **Found:** [N] approaches evaluated, [N] trade-offs documented, [N] design-review findings resolved
+>
+> **Without this phase:** The first viable approach would be built without considering alternatives — potentially choosing a complex solution when a simple one exists, or an approach that conflicts with existing patterns
+>
+> **Changed because of this work:** [Selected approach and why, rejected alternatives and why, design-review adjustments made]
 After approval: design-review loop with `--mode design-review` (5 canonical dimensions: spec coverage, design-spec consistency, accessibility, visual consistency, exported-code fidelity).
 ---
 ## Sub-Workflow 5: Plan (plan + plan-review workflows)
+**Before starting this phase, output to the user:**
+> **Planning** — About to break the approved design into ordered, dependency-aware implementation tasks with a gap analysis against the original input.
+>
+> **Why this matters:** Without explicit planning, tasks get implemented in the wrong order (breaking dependencies), items from the input get silently dropped, and task granularity is either too coarse (monolithic changes that are hard to review) or too fine (overhead without value).
+>
+> **Looking for:** Correct dependency ordering, complete input coverage, appropriate task granularity, clear acceptance criteria per task
 Delegate to `wz:writing-plans`:
 1. Planner produces a SINGLE execution plan at `.wazir/runs/latest/clarified/execution-plan.md` in spec-kit format.
 2. **Gap analysis exit gate:** Compare original input against plan. Invoke `wz:reviewer --mode plan-review`.
 3. Loop until clean or cap reached.
+**After completing this phase, output to the user:**
+> **Planning complete.**
+>
+> **Found:** [N] tasks created, [N] dependencies mapped, [N] plan-review findings resolved, [N] gap analysis items addressed
+>
+> **Without this phase:** Tasks would be implemented in ad-hoc order breaking dependencies, input items would be silently dropped, and task sizes would vary wildly making review inconsistent
+>
+> **Changed because of this work:** [Task count, dependency chain summary, any items reordered or split during plan-review]
 ### Checkpoint: Plan Review
 > **Implementation plan: [N] tasks**
 >
 > | # | Task | Complexity | Dependencies | Description |
 > |---|------|-----------|--------------|-------------|
->
-> 1. **Approved — ready for execution** (Recommended)
-> 2. **Reorder or split tasks**
-> 3. **Missing tasks**
-> 4. **Too granular / too coarse**
-**Wait for user response.**
+Ask the user via AskUserQuestion:
+- **Question:** "Does the implementation plan look correct and complete?"
+- **Options:**
+  1. "Approved — ready for execution" *(Recommended)*
+  2. "Reorder or split tasks"
+  3. "Missing tasks"
+  4. "Too granular / too coarse"
+Wait for the user's selection before continuing.
 ---
@@ -220,9 +353,12 @@ Before presenting the plan to the user, verify ALL input items are covered:
 > **Scope reduction detected.** The input contains [N] items but the plan only covers [M].
 >
 > Missing items: [list]
->
-> 1. **Add missing items to the plan** (Required)
-> 2. **User explicitly approves reduced scope** — only if user confirms
+Ask the user via AskUserQuestion:
+- **Question:** "The plan is missing [N-M] items from your input. How should we proceed?"
+- **Options:**
+  1. "Add missing items to the plan" *(Recommended)*
+  2. "Approve reduced scope — I confirm these items can be dropped"
 **The clarifier MUST NOT autonomously drop items into "future tiers", "deferred", or "out of scope" without explicit user approval. This is a hard rule.**
@@ -230,6 +366,24 @@ Invariant: `items_in_plan >= items_in_input` unless user explicitly approves red
 ---
+## Reasoning Output
+Throughout the clarifier phase, produce reasoning at two layers:
+**Conversation (Layer 1):** Before each sub-workflow, explain the trigger and why it matters. After each sub-workflow, state what was found and the counterfactual — what would have gone wrong without it.
+**File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-clarifier-reasoning.md` with structured entries per decision:
+- **Trigger** — what prompted the decision
+- **Options considered** — alternatives evaluated
+- **Chosen** — selected option
+- **Reasoning** — why
+- **Confidence** — high/medium/low
+- **Counterfactual** — what would go wrong without this info
+Examples of clarifier reasoning entries:
+- "Trigger: input says 'auth' without specifying provider. Options: ask user, assume OAuth2, assume magic links. Chosen: ask user. Counterfactual: assuming OAuth2 when user wanted Supabase auth = wrong middleware, 2 days rework."
+- "Trigger: 13 items in input. Options: plan all 13, tier into must/should/could. Chosen: plan all 13 (user explicitly said 'do not tier'). Counterfactual: tiering would silently drop 5 items."
 ## Done
 When the plan is approved:

package/skills/executor/SKILL.md CHANGED Viewed

@@ -61,12 +61,29 @@ Run these checks before implementing:
 If either fails, surface the failure and do NOT proceed until resolved.
+> **Output to the user** before execution begins:
+> Each task is implemented with TDD (test first, then code) and reviewed before commit. This catches correctness bugs, missing tests, wiring errors, and spec drift at the task level — before they compound across tasks and become expensive to fix.
+## Security Awareness
+Before implementing each task, check if the task touches security-sensitive areas. Run `detectSecurityPatterns` (from `tooling/src/checks/security-sensitivity.js`) mentally against the planned changes. If security patterns are detected (auth, token, password, session, SQL, fetch, upload, secret, env, API key, cookie, CORS, CSRF, JWT, OAuth, encrypt, decrypt, hash, salt):
+- Load security expertise from the composition map for the relevant concern
+- Apply defense-in-depth: validate inputs, parameterize queries, escape outputs, use secure defaults
+- The per-task reviewer will automatically add security dimensions when patterns are detected — expect and address security findings
 ## Execute (execute workflow)
 Implement tasks in the order defined by the execution plan.
 For each task:
+**Before starting each task, output to the user:**
+> **Implementing Task [NNN]: [task title]** — This enables [what downstream tasks or user-facing features depend on this task].
+>
+> **Looking for:** [Key technical concerns for this specific task — e.g., "correct API contract", "database migration safety", "backwards compatibility"]
 1. **Read** the task from the execution plan
 2. **Implement** using TDD (write test first, make it pass, refactor)
 3. **Verify locally** — run tests, type checks, linting as appropriate
@@ -82,15 +99,29 @@ For each task:
    - See `docs/reference/review-loop-pattern.md` for full protocol
    - NOTE: this is the per-task review (5 dims), not the final scored review (7 dims) which runs in Phase 4
 5. **Commit** — only after review passes, commit with conventional commit format: `<type>(<scope>): <description>`
+   - **HARD RULE: One task = one commit.** Commit after EACH task completes its review. Never batch multiple tasks into a single commit. If the reviewer detects multi-task batching, the commit is REJECTED.
 6. **CHANGELOG** — if user-facing change, update `CHANGELOG.md` under `[Unreleased]` using keepachangelog types: Added, Changed, Fixed, Removed, Deprecated, Security.
 7. **Record** evidence at `.wazir/runs/latest/artifacts/task-NNN/`
+**After completing each task, output to the user:**
+> **Completed Task [NNN]: [task title].**
+>
+> **Changed:** [List of files created/modified, tests added, key implementation decisions]
+>
+> **Without this task:** [Concrete risk — e.g., "no auth middleware means all routes are publicly accessible", "no migration means schema change would require manual DB intervention"]
+>
+> **Review result:** [N] findings in [N] review passes, [N] fixed before commit
 Review loops follow `docs/reference/review-loop-pattern.md`. Code review scoping: review uncommitted changes before commit. If changes are committed, use `--base <pre-task-sha>`.
 Tasks always run sequentially.
 **Standalone mode:** When no `.wazir/runs/latest/` exists, review logs go to `docs/plans/`.
+> **Output to the user** before verification:
+> Verification produces deterministic proof — actual command output, not claims. It confirms that tests pass, types check, linters are clean, and every acceptance criterion has evidence. This is the evidence gate that separates "I think it works" from "here is proof it works."
 ## Verify (verify workflow)
 After all tasks are complete, run deterministic verification:
@@ -110,6 +141,14 @@ This is NOT a review loop — it produces proof, not findings. If verification f
 - Use `wazir recall file <path> --tier L1` for files you need to understand but not modify
 - When dispatching subagents, include: "Use wazir index search-symbols before direct file reads."
+## Interaction Mode Awareness
+Read `interaction_mode` from run-config at the start of execution:
+- **`auto`:** Skip user checkpoints. On escalation, write reason to `.wazir/runs/<id>/escalations/` and STOP (do not proceed without user). Gating agent evaluates phase reports.
+- **`guided`:** Standard behavior — ask user on escalation, show per-task completion summaries.
+- **`interactive`:** Before implementing each task, briefly describe the approach and ask: "About to implement [task] using [approach] — sound right?" Show more detail in per-task summaries.
 ## Escalation
 Pause and ask the user when:
@@ -117,6 +156,34 @@ Pause and ask the user when:
 - Implementation would require unapproved scope change
 - A task's acceptance criteria can't be met
+When escalating, use this pattern:
+Ask the user via AskUserQuestion:
+- **Question:** "[Describe the specific blocker or conflict]"
+- **Options:**
+  1. "Adjust the plan to work around the blocker" *(Recommended)*
+  2. "Expand scope to handle the new requirement"
+  3. "Skip this task and continue with the rest"
+  4. "Abort the run"
+Wait for the user's selection before continuing.
+## Reasoning Output
+Throughout the executor phase, produce reasoning at two layers:
+**Conversation (Layer 1):** Before each task, explain what you're about to implement and why. After each task, state what would have gone wrong without this task.
+**File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-executor-reasoning.md` with structured entries per implementation decision:
+- **Trigger** — what prompted the decision (e.g., "task spec requires auth middleware")
+- **Options considered** — implementation alternatives
+- **Chosen** — selected approach
+- **Reasoning** — why this approach over alternatives
+- **Confidence** — high/medium/low
+- **Counterfactual** — what would break without this decision
+Key executor reasoning moments: architecture choices, library selections, API design decisions, test strategy decisions, and any deviation from the plan.
 ## Done
 When all tasks are complete and verified:

package/skills/init-pipeline/SKILL.md CHANGED Viewed

@@ -45,7 +45,6 @@ Run `wazir init` (default: auto mode). This automatically:
    - `model_mode: "claude-only"` (override: `wazir config set model_mode multi-model`)
    - `default_depth: "standard"` (override per-run: `/wazir deep ...`)
    - `default_intent: "feature"` (inferred per-run from request text)
-   - `team_mode: "sequential"` (always)
 6. **Auto-exports** for the detected host
 **No questions asked.** The pipeline is ready to use immediately.

package/skills/reviewer/SKILL.md CHANGED Viewed

@@ -48,7 +48,7 @@ The reviewer operates in different modes depending on the phase. Mode MUST be pa
 | `final` | After execution + verification | Completed task artifacts, approved spec/plan/design | 7 final-review dims, scored 0-70 | Scored verdict (PASS/FAIL) |
 | `spec-challenge` | After specify | Draft spec artifact | 5 spec/clarification dims | Pass/fix loop, no score |
 | `design-review` | After design approval | Design artifact, approved spec | 5 design-review dims (canonical) | Pass/fix loop, no score |
-| `plan-review` | After planning | Draft plan artifact | 7 plan dims | Pass/fix loop, no score |
+| `plan-review` | After planning | Draft plan artifact | 8 plan dims (7 + input coverage) | Pass/fix loop, no score |
 | `task-review` | During execution, per task | Uncommitted changes or `--base` SHA | 5 task-execution dims (correctness, tests, wiring, drift, quality) | Pass/fix loop, no score |
 | `research-review` | During discover | Research artifact | 5 research dims | Pass/fix loop, no score |
 | `clarification-review` | During clarify | Clarification artifact | 5 spec/clarification dims | Pass/fix loop, no score |
@@ -88,24 +88,42 @@ If any file is missing:
 ### `task-review` mode
 1. Uncommitted changes exist for the current task, or a `--base` SHA is provided for committed changes.
 2. Read `.wazir/state/config.json` for depth and multi_tool settings.
+3. **Commit discipline check:** If uncommitted changes span work from multiple tasks (e.g., files from task N and task N+1 are both modified), REJECT immediately: "REJECTED: Multiple tasks in single commit. Split into per-task commits before review." This is a blocking finding — no other dimensions are evaluated until resolved.
+4. **Security sensitivity check:** Run `detectSecurityPatterns` from `tooling/src/checks/security-sensitivity.js` against the diff. If `triggered === true`, add the 6 security review dimensions (injection, auth bypass, data exposure, CSRF/SSRF, XSS, secrets leakage) to the standard 5 task-execution dimensions for this review pass. Security findings use severity levels: critical (exploitable), high (likely exploitable), medium (defense-in-depth gap), low (best-practice deviation).
 ### `spec-challenge`, `design-review`, `plan-review`, `research-review`, `clarification-review` modes
 1. The appropriate input artifact for the mode exists.
 2. Read `.wazir/state/config.json` for depth and multi_tool settings.
+3. **`plan-review` additional dimension — Input Coverage:**
+   - Read the original input/briefing from `.wazir/input/briefing.md` and any `input/*.md` files
+   - Count distinct items/requirements in the input
+   - Count tasks in the execution plan
+   - If `tasks_in_plan < items_in_input` → **HIGH** finding: "Plan covers [N] of [M] input items. Missing: [list of uncovered items]"
+   - If `tasks_in_plan >= items_in_input` → dimension passes
+   - One task MAY cover multiple input items if justified in the task description
+   - This is the review-level enforcement of the "no scope reduction" rule
 ## Review Process (`final` mode)
+**Before starting this phase, output to the user:**
+> **Final Review** — About to run adversarial 7-dimension review comparing your implementation against the original input, not just the task specs. The executor's per-task reviewer already validated correctness per-task — this catches drift between what you asked for and what was actually built.
+>
+> **Why this matters:** Without this, implementation drift ships undetected. Per-task review confirms each task matches its spec, but cannot catch: tasks that collectively miss the original intent, scope creep that added unrequested features, or acceptance criteria that were rewritten to match implementation instead of input.
+>
+> **Looking for:** Logic errors, missing features, dead code, unsubstantiated "it works" claims, scope creep, security gaps, stale documentation
 **Input:** Read the ORIGINAL user input (`.wazir/input/briefing.md`, `input/` directory files) and compare against what was built. This catches intent drift that task-level review misses.
 Perform adversarial review across 7 dimensions:
-1. **Correctness** — Does the code do what the original input asked for?
-2. **Completeness** — Are all requirements from the original input met?
-3. **Wiring** — Are all paths connected end-to-end?
-4. **Verification** — Is there evidence (tests, type checks) for each claim?
-5. **Drift** — Does the implementation match what the user originally requested? (not just the plan — the INPUT)
-6. **Quality** — Code style, naming, error handling, security
-7. **Documentation** — Changelog entries, commit messages, comments
+1. **Correctness** — Does the code do what the original input asked for? *(catches: logic errors, wrong behavior, spec violations)*
+2. **Completeness** — Are all requirements from the original input met? *(catches: missing features, unimplemented acceptance criteria, partially delivered items)*
+3. **Wiring** — Are all paths connected end-to-end? *(catches: dead code, disconnected paths, missing imports, orphaned routes)*
+4. **Verification** — Is there evidence (tests, type checks) for each claim? *(catches: false claims of "it works" without evidence, untested code paths, missing type coverage)*
+5. **Drift** — Does the implementation match what the user originally requested? (not just the plan — the INPUT) *(catches: scope creep, plan deviations, unauthorized changes, gold-plating)*
+6. **Quality** — Code style, naming, error handling, security *(catches: security vulnerabilities, poor error handling, inconsistent naming, missing input validation)*
+7. **Documentation** — Changelog entries, commit messages, comments *(catches: missing changelogs, wrong commit messages, stale comments, undocumented breaking changes)*
 ## Context Retrieval
@@ -224,6 +242,24 @@ const recurring = getRecurringFindingHashes(db, 2);
 This is how Wazir evolves — findings that recur across runs become accepted learnings injected into future executor context, preventing the same mistakes.
+## Interaction Mode Awareness
+Read `interaction_mode` from run-config:
+- **`auto`:** No user checkpoints. Present verdict and let gating agent decide. On escalation, write reason and STOP.
+- **`guided`:** Standard behavior — present verdict, ask user how to proceed.
+- **`interactive`:** Discuss findings with user: "I found a potential auth bypass in `src/auth.js:42` — here's why I rated it high severity. Do you agree, or is there context I'm missing?" Show detailed reasoning for each dimension score.
+## CLI/Context-Mode Enforcement
+In ALL review modes, check for these violations:
+1. **Index usage enforcement:** If the agent performed >5 direct file reads (Read tool) without a preceding `wazir index search-symbols` query, flag as **[warning]** finding: "Agent performed [N] direct file reads without using wazir index. Use `wazir index search-symbols <query>` before reading files to reduce context consumption."
+2. **Context-mode enforcement:** If the agent ran a large-category command (test runners, builds, diffs, dependency trees, linting — as classified by `hooks/routing-matrix.json`) using native Bash instead of context-mode tools (when context-mode is available), flag as **[warning]** finding: "Large command `[cmd]` run without context-mode. Route through `mcp__plugin_context-mode_context-mode__execute` to reduce context usage."
+These are warnings, not blocking findings — they improve efficiency but don't affect correctness.
 ## Task-Review Log Filenames
 In `task-review` mode, use task-scoped log filenames and cap tracking:
@@ -238,6 +274,13 @@ Save review results to `.wazir/runs/latest/reviews/review.md` with:
 - Score breakdown
 - Verdict
+Run the phase report and display it to the user:
+```bash
+wazir report phase --run <run-id> --phase <review-mode>
+```
+Output the report content to the user in the conversation.
 ## Phase Report Generation
 After completing any review pass, generate a phase report following `schemas/phase-report.schema.json`:
@@ -406,8 +449,34 @@ Write to `.wazir/runs/<run-id>/handoff.md`:
 - Do NOT mutate `input/` — it belongs to the user
 - Do NOT auto-load proposed learnings into the next run
+## Reasoning Output
+Throughout the reviewer phase, produce reasoning at two layers:
+**Conversation (Layer 1):** Before each review pass, explain what dimensions are being checked and why. After findings, explain the reasoning behind severity assignments.
+**File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-reviewer-reasoning.md` with structured entries:
+- **Trigger** — what prompted the finding (e.g., "diff adds SQL query without parameterization")
+- **Options considered** — severity options, fix approaches
+- **Chosen** — assigned severity and recommendation
+- **Reasoning** — why this severity level
+- **Confidence** — high/medium/low
+- **Counterfactual** — what would ship if this finding were missed
+Key reviewer reasoning moments: severity assignments, PASS/FAIL decisions, dimension score justifications, and escalation decisions.
 ## Done
+**After completing this phase, output to the user:**
+> **Final Review complete.**
+>
+> **Found:** [N] findings across 7 dimensions — [N] blocking, [N] warnings, [N] notes. Score: [score]/70 ([VERDICT]).
+>
+> **Without this phase:** [N] blocking issues would have shipped — including [specific examples: e.g., "missing error handler on /api/users endpoint", "auth middleware not wired to 3 routes", "CHANGELOG missing entry for breaking API change"]
+>
+> **Changed because of this work:** [List of issues caught and fixed during review passes, score improvement from first to final pass]
 Present the verdict and offer next steps:
 > **Review complete: [VERDICT] ([score]/70)**
@@ -416,8 +485,12 @@ Present the verdict and offer next steps:
 >
 > **Learnings proposed:** [count] (see `memory/learnings/proposed/`)
 > **Handoff:** `.wazir/runs/<run-id>/handoff.md`
->
-> **What would you like to do?**
-> 1. **Create a PR** (if PASS)
-> 2. **Auto-fix and re-review** (if MINOR FIXES)
-> 3. **Review findings in detail**
+Ask the user via AskUserQuestion:
+- **Question:** "How would you like to proceed with the review results?"
+- **Options:**
+  1. "Create a PR" *(Recommended if PASS)*
+  2. "Auto-fix and re-review" *(Recommended if MINOR FIXES)*
+  3. "Review findings in detail"
+Wait for the user's selection before continuing.

package/skills/self-audit/SKILL.md CHANGED Viewed

@@ -185,6 +185,26 @@ Beyond CLI checks, inspect for:
     - Run `wazir export --check`
     - Any drift detected is a finding
+11. **Input Coverage** (run-scoped — only when a run directory exists)
+    - Read the original input file(s) from `.wazir/input/` or `.wazir/runs/<id>/sources/`
+    - Read the execution plan from `.wazir/runs/<id>/clarified/execution-plan.md`
+    - Read the actual commits on the branch: `git log --oneline main..HEAD`
+    - Build a coverage matrix: every distinct item in the input should map to:
+      - At least one task in the execution plan
+      - At least one commit in the git log
+    - **Missing items** (in input but not in plan AND not in commits) → **HIGH** severity finding
+    - **Partial items** (in plan but no corresponding commit) → **MEDIUM** severity finding
+    - **Fully covered items** (input → plan → commit) → pass
+    - Output the coverage matrix in the audit report:
+      ```
+      | Input Item | Plan Task | Commit | Status |
+      |------------|-----------|--------|--------|
+      | Item 1     | Task 3    | abc123 | PASS   |
+      | Item 2     | Task 5    | —      | PARTIAL|
+      | Item 3     | —         | —      | MISSING|
+      ```
+    - This dimension catches scope reduction AFTER the fact — a safety net for when the clarifier or planner fails
 ## Protected-Path Safety Rails
 Before applying ANY fix in Phase 3, check if the target file is in a protected path. The self-audit loop MUST NOT modify files in: