@wazir-dev/cli 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (46) hide show
  1. package/CHANGELOG.md +39 -44
  2. package/README.md +13 -13
  3. package/assets/demo.cast +47 -0
  4. package/assets/demo.gif +0 -0
  5. package/docs/anti-patterns/AP-23-skipping-enabled-workflows.md +28 -0
  6. package/docs/anti-patterns/AP-24-clarifier-deciding-scope.md +34 -0
  7. package/docs/concepts/architecture.md +1 -1
  8. package/docs/concepts/why-wazir.md +1 -1
  9. package/docs/readmes/INDEX.md +1 -1
  10. package/docs/readmes/features/expertise/README.md +1 -1
  11. package/docs/readmes/features/hooks/pre-compact-summary.md +1 -1
  12. package/docs/reference/hooks.md +1 -0
  13. package/docs/reference/launch-checklist.md +3 -3
  14. package/docs/reference/review-loop-pattern.md +3 -2
  15. package/docs/reference/skill-tiers.md +2 -2
  16. package/expertise/antipatterns/process/ai-coding-antipatterns.md +117 -0
  17. package/exports/hosts/claude/.claude/commands/plan-review.md +3 -1
  18. package/exports/hosts/claude/.claude/commands/verify.md +30 -1
  19. package/exports/hosts/claude/export.manifest.json +2 -2
  20. package/exports/hosts/codex/export.manifest.json +2 -2
  21. package/exports/hosts/cursor/export.manifest.json +2 -2
  22. package/exports/hosts/gemini/export.manifest.json +2 -2
  23. package/llms-full.txt +48 -18
  24. package/package.json +2 -3
  25. package/schemas/phase-report.schema.json +9 -0
  26. package/skills/brainstorming/SKILL.md +14 -2
  27. package/skills/clarifier/SKILL.md +189 -35
  28. package/skills/executor/SKILL.md +67 -0
  29. package/skills/init-pipeline/SKILL.md +0 -1
  30. package/skills/reviewer/SKILL.md +86 -13
  31. package/skills/self-audit/SKILL.md +20 -0
  32. package/skills/skill-research/SKILL.md +188 -0
  33. package/skills/verification/SKILL.md +41 -3
  34. package/skills/wazir/SKILL.md +304 -38
  35. package/tooling/src/capture/command.js +17 -1
  36. package/tooling/src/capture/store.js +32 -0
  37. package/tooling/src/capture/user-input.js +66 -0
  38. package/tooling/src/checks/security-sensitivity.js +69 -0
  39. package/tooling/src/cli.js +28 -26
  40. package/tooling/src/guards/phase-prerequisite-guard.js +58 -0
  41. package/tooling/src/init/auto-detect.js +0 -2
  42. package/tooling/src/init/command.js +3 -95
  43. package/tooling/src/status/command.js +6 -1
  44. package/tooling/src/verify/proof-collector.js +299 -0
  45. package/workflows/plan-review.md +3 -1
  46. package/workflows/verify.md +30 -1
@@ -52,6 +52,14 @@ Read `context_mode` from `.wazir/state/config.json`:
52
52
 
53
53
  ## Sub-Workflow 1: Research (discover workflow)
54
54
 
55
+ **Before starting this phase, output to the user:**
56
+
57
+ > **Research** — About to scan the codebase and fetch external references to understand the existing architecture, tech stack, and any standards referenced in the briefing.
58
+ >
59
+ > **Why this matters:** Without research, I'd assume the wrong framework version, miss existing patterns in the codebase, and contradict established conventions. Every wrong assumption here cascades into a wrong spec and wrong implementation.
60
+ >
61
+ > **Looking for:** Existing code patterns, dependency versions, external standard definitions, architectural constraints
62
+
55
63
  Delegate to the discover workflow (`workflows/discover.md`):
56
64
 
57
65
  1. **Keyword extraction:** Read the briefing and extract concepts/terms that are vague, reference external standards, or use unfamiliar terminology.
@@ -68,22 +76,43 @@ Delegate to the discover workflow (`workflows/discover.md`):
68
76
 
69
77
  Save result to `.wazir/runs/latest/clarified/research-brief.md`.
70
78
 
79
+ **After completing this phase, output to the user:**
80
+
81
+ > **Research complete.**
82
+ >
83
+ > **Found:** [N] external sources fetched, [N] codebase patterns identified, [N] architectural constraints documented
84
+ >
85
+ > **Without this phase:** Spec would be built on assumptions instead of evidence — wrong framework APIs, missed existing utilities, contradicted naming conventions
86
+ >
87
+ > **Changed because of this work:** [List of key discoveries — e.g., "found existing auth middleware at src/middleware/auth.ts", "project uses Vitest not Jest"]
88
+
71
89
  ### Checkpoint: Research Review
72
90
 
73
91
  > **Research complete. Here's what I found:**
74
92
  >
75
93
  > [Summary of codebase state, relevant architecture, external context]
76
- >
77
- > 1. **Looks good, continue** (Recommended)
78
- > 2. **Missing context** — let me add more information
79
- > 3. **Wrong direction** — let me clarify the intent
80
94
 
81
- **Wait for user response before continuing.**
95
+ Ask the user via AskUserQuestion:
96
+ - **Question:** "Does the research look complete and accurate?"
97
+ - **Options:**
98
+ 1. "Looks good, continue" *(Recommended)*
99
+ 2. "Missing context — let me add more information"
100
+ 3. "Wrong direction — let me clarify the intent"
101
+
102
+ Wait for the user's selection before continuing.
82
103
 
83
104
  ---
84
105
 
85
106
  ## Sub-Workflow 2: Clarify (clarify workflow)
86
107
 
108
+ **Before starting this phase, output to the user:**
109
+
110
+ > **Clarification** — About to transform the briefing and research into a precise scope document with explicit constraints, assumptions, and boundaries.
111
+ >
112
+ > **Why this matters:** Without explicit clarification, "add user auth" could mean OAuth, magic links, or username/password. Every ambiguity left here becomes a 50/50 coin flip during implementation that could produce the wrong feature.
113
+ >
114
+ > **Looking for:** Ambiguous requirements, implicit assumptions, missing constraints, scope boundaries, unresolved questions
115
+
87
116
  ### Input Preservation (before producing clarification)
88
117
 
89
118
  1. Glob `.wazir/input/tasks/*.md`. If files exist:
@@ -93,37 +122,88 @@ Save result to `.wazir/runs/latest/clarified/research-brief.md`.
93
122
  - Every API endpoint, color hex code, and UI dimension from input must appear in the relevant item section.
94
123
  2. If `.wazir/input/tasks/` is empty or missing, synthesize from `briefing.md` alone.
95
124
 
125
+ ### Informed Question Batching (after research, before producing clarification)
126
+
127
+ Research has completed. You now have codebase context and external findings. Before producing the clarification, ask the user INFORMED questions — informed by the research, not guesses.
128
+
129
+ **Rules:**
130
+ 1. **Research runs FIRST, questions come AFTER.** Never ask questions before research completes.
131
+ 2. **Batch questions:** 1-3 batches of 3-7 questions each. Never one-at-a-time.
132
+ 3. **Every scope exclusion must be explicitly confirmed by the user.** You MUST NOT decide that something is "out of scope" without asking. If the input doesn't mention docs, ask: "The input doesn't mention documentation — should we include API docs, or is that explicitly out of scope?" Do NOT assume.
133
+ 4. **If the input is clear and complete:** Zero questions is fine. State: "Input is clear and specific. No ambiguities detected. Proceeding with clarification."
134
+ 5. **In auto mode (`interaction_mode: auto`):** Questions go to the gating agent, not the user.
135
+ 6. **In interactive mode (`interaction_mode: interactive`):** More detailed questions, present research findings that informed each question.
136
+
137
+ **Question format:**
138
+ ```
139
+ Based on research, I have [N] questions before proceeding:
140
+
141
+ **Scope & Intent**
142
+ 1. [Question informed by research finding]
143
+ 2. [Question about ambiguous requirement]
144
+
145
+ **Technical Decisions**
146
+ 3. [Question about architecture choice discovered during research]
147
+ 4. [Question about dependency/framework preference]
148
+
149
+ **Boundaries**
150
+ 5. [Explicit scope boundary question — "Should X be included or excluded?"]
151
+ ```
152
+
153
+ Ask via AskUserQuestion with the full batch. Wait for answers. If answers introduce new ambiguity, ask a follow-up batch (max 3 batches total).
154
+
96
155
  ### Clarification Production
97
156
 
98
- Read the briefing, research brief, and codebase context. Produce:
157
+ Read the briefing, research brief, user answers to questions, and codebase context. Produce:
99
158
 
100
159
  - **What** we're building — concrete deliverables
101
160
  - **Why** — the motivation and business value
102
161
  - **Constraints** — technical, timeline, dependencies
103
- - **Assumptions** — what we're taking as given
104
- - **Scope boundaries** — what's IN and what's explicitly OUT
105
- - **Unresolved questions** — anything ambiguous
162
+ - **Assumptions** — what we're taking as given (each explicitly confirmed by user or clearly stated in input)
163
+ - **Scope boundaries** — what's IN and what's explicitly OUT (every exclusion must reference the user's confirmation: "Out of scope per user confirmation in question batch 1, Q5")
164
+ - **Unresolved questions** — anything still ambiguous after question batches
106
165
 
107
166
  Save to `.wazir/runs/latest/clarified/clarification.md`.
108
167
 
109
168
  Invoke `wz:reviewer --mode clarification-review`. Resolve findings before presenting to user.
110
169
 
170
+ **After completing this phase, output to the user:**
171
+
172
+ > **Clarification complete.**
173
+ >
174
+ > **Found:** [N] ambiguities resolved, [N] assumptions documented, [N] scope boundaries defined, [N] items explicitly marked out-of-scope
175
+ >
176
+ > **Without this phase:** Implementation would proceed with hidden assumptions, scope would creep mid-build, and acceptance criteria would be vague enough to pass any implementation
177
+ >
178
+ > **Changed because of this work:** [List of resolved ambiguities — e.g., "clarified auth means OAuth2 with Google provider only", "out-of-scope: mobile responsive for v1"]
179
+
111
180
  ### Checkpoint: Clarification Review
112
181
 
113
182
  > **Here's the clarified scope:**
114
183
  >
115
184
  > [Full clarification]
116
- >
117
- > 1. **Approved — continue to spec hardening** (Recommended)
118
- > 2. **Needs changes** — [user provides corrections]
119
- > 3. **Missing important context** — [user adds information]
120
185
 
121
- **Wait for user response.** Route feedback: plan corrections → `user-feedback.md`, new requirements → `briefing.md`.
186
+ Ask the user via AskUserQuestion:
187
+ - **Question:** "Does the clarified scope accurately capture what you want to build?"
188
+ - **Options:**
189
+ 1. "Approved — continue to spec hardening" *(Recommended)*
190
+ 2. "Needs changes — let me provide corrections"
191
+ 3. "Missing important context — let me add information"
192
+
193
+ Wait for the user's selection before continuing. Route feedback: plan corrections → `user-feedback.md`, new requirements → `briefing.md`.
122
194
 
123
195
  ---
124
196
 
125
197
  ## Sub-Workflow 3: Spec Harden (specify + spec-challenge workflows)
126
198
 
199
+ **Before starting this phase, output to the user:**
200
+
201
+ > **Spec Hardening** — About to convert the clarified scope into a measurable, testable specification and then run adversarial spec-challenge review to find gaps.
202
+ >
203
+ > **Why this matters:** Without hardening, acceptance criteria stay vague ("it should work well") instead of measurable ("response time under 200ms for 95th percentile"). Vague specs pass any implementation, making review meaningless.
204
+ >
205
+ > **Looking for:** Untestable criteria, missing error handling specs, undefined edge cases, performance requirements, security constraints
206
+
127
207
  Delegate to the specify workflow (`workflows/specify.md`):
128
208
 
129
209
  1. The **specifier role** produces a measurable spec from clarification + research.
@@ -132,6 +212,16 @@ Delegate to the specify workflow (`workflows/specify.md`):
132
212
 
133
213
  Save result to `.wazir/runs/latest/clarified/spec-hardened.md`.
134
214
 
215
+ **After completing this phase, output to the user:**
216
+
217
+ > **Spec Hardening complete.**
218
+ >
219
+ > **Found:** [N] acceptance criteria tightened, [N] edge cases added, [N] error handling requirements specified, [N] spec-challenge findings resolved
220
+ >
221
+ > **Without this phase:** Acceptance criteria would be subjective, review would have no concrete standard to measure against, and "done" would mean whatever the implementer decided
222
+ >
223
+ > **Changed because of this work:** [List of hardening changes — e.g., "added 404 handling spec for missing resources", "specified max payload size of 5MB", "added rate limit requirement of 100 req/min"]
224
+
135
225
  ### Content-Author Detection
136
226
 
137
227
  After spec hardening, scan the spec for content needs. Auto-enable the `author` workflow if the spec mentions any of:
@@ -151,17 +241,28 @@ If detected, set `workflow_policy.author.enabled = true` in the run config and n
151
241
  > **Spec hardened. Changes made:**
152
242
  >
153
243
  > [List of gaps found and how they were tightened]
154
- >
155
- > 1. **Approved — continue to brainstorming** (Recommended)
156
- > 2. **Disagree with a change** — [user specifies]
157
- > 3. **Found more gaps** — [user adds]
158
244
 
159
- **Wait for user response.**
245
+ Ask the user via AskUserQuestion:
246
+ - **Question:** "Are the spec hardening changes acceptable?"
247
+ - **Options:**
248
+ 1. "Approved — continue to brainstorming" *(Recommended)*
249
+ 2. "Disagree with a change — let me specify"
250
+ 3. "Found more gaps — let me add"
251
+
252
+ Wait for the user's selection before continuing.
160
253
 
161
254
  ---
162
255
 
163
256
  ## Sub-Workflow 4: Brainstorm (design + design-review workflows)
164
257
 
258
+ **Before starting this phase, output to the user:**
259
+
260
+ > **Brainstorming** — About to propose 2-3 design approaches with explicit trade-offs, then run design-review on the approved choice.
261
+ >
262
+ > **Why this matters:** Without exploring alternatives, the first approach that comes to mind gets built — even if a simpler, more maintainable, or more performant option exists. This is where architectural mistakes get caught cheaply instead of discovered during implementation.
263
+ >
264
+ > **Looking for:** Architectural trade-offs, scalability implications, complexity vs. simplicity, alignment with existing codebase patterns
265
+
165
266
  Invoke the `brainstorming` skill (`wz:brainstorming`):
166
267
 
167
268
  1. Propose 2-3 viable approaches with explicit trade-offs
@@ -170,42 +271,74 @@ Invoke the `brainstorming` skill (`wz:brainstorming`):
170
271
 
171
272
  ### Checkpoint: Design Approval
172
273
 
173
- > **Which approach should we implement?**
174
- >
175
- > 1. **Approach A** — [one-line summary] (Recommended)
176
- > 2. **Approach B** — [one-line summary]
177
- > 3. **Approach C** — [one-line summary]
178
- > 4. **Modify an approach** — [user specifies changes]
274
+ Ask the user via AskUserQuestion:
275
+ - **Question:** "Which design approach should we implement?"
276
+ - **Options:**
277
+ 1. "Approach A — [one-line summary]" *(Recommended)*
278
+ 2. "Approach B — [one-line summary]"
279
+ 3. "Approach C — [one-line summary]"
280
+ 4. "Modify an approach — let me specify changes"
179
281
 
180
- **Wait for user response.** This is the most important checkpoint.
282
+ Wait for the user's selection before continuing. This is the most important checkpoint.
181
283
 
182
284
  Save approved design to `.wazir/runs/latest/clarified/design.md`.
183
285
 
286
+ **After completing this phase, output to the user:**
287
+
288
+ > **Brainstorming complete.**
289
+ >
290
+ > **Found:** [N] approaches evaluated, [N] trade-offs documented, [N] design-review findings resolved
291
+ >
292
+ > **Without this phase:** The first viable approach would be built without considering alternatives — potentially choosing a complex solution when a simple one exists, or an approach that conflicts with existing patterns
293
+ >
294
+ > **Changed because of this work:** [Selected approach and why, rejected alternatives and why, design-review adjustments made]
295
+
184
296
  After approval: design-review loop with `--mode design-review` (5 canonical dimensions: spec coverage, design-spec consistency, accessibility, visual consistency, exported-code fidelity).
185
297
 
186
298
  ---
187
299
 
188
300
  ## Sub-Workflow 5: Plan (plan + plan-review workflows)
189
301
 
302
+ **Before starting this phase, output to the user:**
303
+
304
+ > **Planning** — About to break the approved design into ordered, dependency-aware implementation tasks with a gap analysis against the original input.
305
+ >
306
+ > **Why this matters:** Without explicit planning, tasks get implemented in the wrong order (breaking dependencies), items from the input get silently dropped, and task granularity is either too coarse (monolithic changes that are hard to review) or too fine (overhead without value).
307
+ >
308
+ > **Looking for:** Correct dependency ordering, complete input coverage, appropriate task granularity, clear acceptance criteria per task
309
+
190
310
  Delegate to `wz:writing-plans`:
191
311
 
192
312
  1. Planner produces a SINGLE execution plan at `.wazir/runs/latest/clarified/execution-plan.md` in spec-kit format.
193
313
  2. **Gap analysis exit gate:** Compare original input against plan. Invoke `wz:reviewer --mode plan-review`.
194
314
  3. Loop until clean or cap reached.
195
315
 
316
+ **After completing this phase, output to the user:**
317
+
318
+ > **Planning complete.**
319
+ >
320
+ > **Found:** [N] tasks created, [N] dependencies mapped, [N] plan-review findings resolved, [N] gap analysis items addressed
321
+ >
322
+ > **Without this phase:** Tasks would be implemented in ad-hoc order breaking dependencies, input items would be silently dropped, and task sizes would vary wildly making review inconsistent
323
+ >
324
+ > **Changed because of this work:** [Task count, dependency chain summary, any items reordered or split during plan-review]
325
+
196
326
  ### Checkpoint: Plan Review
197
327
 
198
328
  > **Implementation plan: [N] tasks**
199
329
  >
200
330
  > | # | Task | Complexity | Dependencies | Description |
201
331
  > |---|------|-----------|--------------|-------------|
202
- >
203
- > 1. **Approved — ready for execution** (Recommended)
204
- > 2. **Reorder or split tasks**
205
- > 3. **Missing tasks**
206
- > 4. **Too granular / too coarse**
207
332
 
208
- **Wait for user response.**
333
+ Ask the user via AskUserQuestion:
334
+ - **Question:** "Does the implementation plan look correct and complete?"
335
+ - **Options:**
336
+ 1. "Approved — ready for execution" *(Recommended)*
337
+ 2. "Reorder or split tasks"
338
+ 3. "Missing tasks"
339
+ 4. "Too granular / too coarse"
340
+
341
+ Wait for the user's selection before continuing.
209
342
 
210
343
  ---
211
344
 
@@ -220,9 +353,12 @@ Before presenting the plan to the user, verify ALL input items are covered:
220
353
  > **Scope reduction detected.** The input contains [N] items but the plan only covers [M].
221
354
  >
222
355
  > Missing items: [list]
223
- >
224
- > 1. **Add missing items to the plan** (Required)
225
- > 2. **User explicitly approves reduced scope** only if user confirms
356
+
357
+ Ask the user via AskUserQuestion:
358
+ - **Question:** "The plan is missing [N-M] items from your input. How should we proceed?"
359
+ - **Options:**
360
+ 1. "Add missing items to the plan" *(Recommended)*
361
+ 2. "Approve reduced scope — I confirm these items can be dropped"
226
362
 
227
363
  **The clarifier MUST NOT autonomously drop items into "future tiers", "deferred", or "out of scope" without explicit user approval. This is a hard rule.**
228
364
 
@@ -230,6 +366,24 @@ Invariant: `items_in_plan >= items_in_input` unless user explicitly approves red
230
366
 
231
367
  ---
232
368
 
369
+ ## Reasoning Output
370
+
371
+ Throughout the clarifier phase, produce reasoning at two layers:
372
+
373
+ **Conversation (Layer 1):** Before each sub-workflow, explain the trigger and why it matters. After each sub-workflow, state what was found and the counterfactual — what would have gone wrong without it.
374
+
375
+ **File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-clarifier-reasoning.md` with structured entries per decision:
376
+ - **Trigger** — what prompted the decision
377
+ - **Options considered** — alternatives evaluated
378
+ - **Chosen** — selected option
379
+ - **Reasoning** — why
380
+ - **Confidence** — high/medium/low
381
+ - **Counterfactual** — what would go wrong without this info
382
+
383
+ Examples of clarifier reasoning entries:
384
+ - "Trigger: input says 'auth' without specifying provider. Options: ask user, assume OAuth2, assume magic links. Chosen: ask user. Counterfactual: assuming OAuth2 when user wanted Supabase auth = wrong middleware, 2 days rework."
385
+ - "Trigger: 13 items in input. Options: plan all 13, tier into must/should/could. Chosen: plan all 13 (user explicitly said 'do not tier'). Counterfactual: tiering would silently drop 5 items."
386
+
233
387
  ## Done
234
388
 
235
389
  When the plan is approved:
@@ -61,12 +61,29 @@ Run these checks before implementing:
61
61
 
62
62
  If either fails, surface the failure and do NOT proceed until resolved.
63
63
 
64
+ > **Output to the user** before execution begins:
65
+ > Each task is implemented with TDD (test first, then code) and reviewed before commit. This catches correctness bugs, missing tests, wiring errors, and spec drift at the task level — before they compound across tasks and become expensive to fix.
66
+
67
+ ## Security Awareness
68
+
69
+ Before implementing each task, check if the task touches security-sensitive areas. Run `detectSecurityPatterns` (from `tooling/src/checks/security-sensitivity.js`) mentally against the planned changes. If security patterns are detected (auth, token, password, session, SQL, fetch, upload, secret, env, API key, cookie, CORS, CSRF, JWT, OAuth, encrypt, decrypt, hash, salt):
70
+
71
+ - Load security expertise from the composition map for the relevant concern
72
+ - Apply defense-in-depth: validate inputs, parameterize queries, escape outputs, use secure defaults
73
+ - The per-task reviewer will automatically add security dimensions when patterns are detected — expect and address security findings
74
+
64
75
  ## Execute (execute workflow)
65
76
 
66
77
  Implement tasks in the order defined by the execution plan.
67
78
 
68
79
  For each task:
69
80
 
81
+ **Before starting each task, output to the user:**
82
+
83
+ > **Implementing Task [NNN]: [task title]** — This enables [what downstream tasks or user-facing features depend on this task].
84
+ >
85
+ > **Looking for:** [Key technical concerns for this specific task — e.g., "correct API contract", "database migration safety", "backwards compatibility"]
86
+
70
87
  1. **Read** the task from the execution plan
71
88
  2. **Implement** using TDD (write test first, make it pass, refactor)
72
89
  3. **Verify locally** — run tests, type checks, linting as appropriate
@@ -82,15 +99,29 @@ For each task:
82
99
  - See `docs/reference/review-loop-pattern.md` for full protocol
83
100
  - NOTE: this is the per-task review (5 dims), not the final scored review (7 dims) which runs in Phase 4
84
101
  5. **Commit** — only after review passes, commit with conventional commit format: `<type>(<scope>): <description>`
102
+ - **HARD RULE: One task = one commit.** Commit after EACH task completes its review. Never batch multiple tasks into a single commit. If the reviewer detects multi-task batching, the commit is REJECTED.
85
103
  6. **CHANGELOG** — if user-facing change, update `CHANGELOG.md` under `[Unreleased]` using keepachangelog types: Added, Changed, Fixed, Removed, Deprecated, Security.
86
104
  7. **Record** evidence at `.wazir/runs/latest/artifacts/task-NNN/`
87
105
 
106
+ **After completing each task, output to the user:**
107
+
108
+ > **Completed Task [NNN]: [task title].**
109
+ >
110
+ > **Changed:** [List of files created/modified, tests added, key implementation decisions]
111
+ >
112
+ > **Without this task:** [Concrete risk — e.g., "no auth middleware means all routes are publicly accessible", "no migration means schema change would require manual DB intervention"]
113
+ >
114
+ > **Review result:** [N] findings in [N] review passes, [N] fixed before commit
115
+
88
116
  Review loops follow `docs/reference/review-loop-pattern.md`. Code review scoping: review uncommitted changes before commit. If changes are committed, use `--base <pre-task-sha>`.
89
117
 
90
118
  Tasks always run sequentially.
91
119
 
92
120
  **Standalone mode:** When no `.wazir/runs/latest/` exists, review logs go to `docs/plans/`.
93
121
 
122
+ > **Output to the user** before verification:
123
+ > Verification produces deterministic proof — actual command output, not claims. It confirms that tests pass, types check, linters are clean, and every acceptance criterion has evidence. This is the evidence gate that separates "I think it works" from "here is proof it works."
124
+
94
125
  ## Verify (verify workflow)
95
126
 
96
127
  After all tasks are complete, run deterministic verification:
@@ -110,6 +141,14 @@ This is NOT a review loop — it produces proof, not findings. If verification f
110
141
  - Use `wazir recall file <path> --tier L1` for files you need to understand but not modify
111
142
  - When dispatching subagents, include: "Use wazir index search-symbols before direct file reads."
112
143
 
144
+ ## Interaction Mode Awareness
145
+
146
+ Read `interaction_mode` from run-config at the start of execution:
147
+
148
+ - **`auto`:** Skip user checkpoints. On escalation, write reason to `.wazir/runs/<id>/escalations/` and STOP (do not proceed without user). Gating agent evaluates phase reports.
149
+ - **`guided`:** Standard behavior — ask user on escalation, show per-task completion summaries.
150
+ - **`interactive`:** Before implementing each task, briefly describe the approach and ask: "About to implement [task] using [approach] — sound right?" Show more detail in per-task summaries.
151
+
113
152
  ## Escalation
114
153
 
115
154
  Pause and ask the user when:
@@ -117,6 +156,34 @@ Pause and ask the user when:
117
156
  - Implementation would require unapproved scope change
118
157
  - A task's acceptance criteria can't be met
119
158
 
159
+ When escalating, use this pattern:
160
+
161
+ Ask the user via AskUserQuestion:
162
+ - **Question:** "[Describe the specific blocker or conflict]"
163
+ - **Options:**
164
+ 1. "Adjust the plan to work around the blocker" *(Recommended)*
165
+ 2. "Expand scope to handle the new requirement"
166
+ 3. "Skip this task and continue with the rest"
167
+ 4. "Abort the run"
168
+
169
+ Wait for the user's selection before continuing.
170
+
171
+ ## Reasoning Output
172
+
173
+ Throughout the executor phase, produce reasoning at two layers:
174
+
175
+ **Conversation (Layer 1):** Before each task, explain what you're about to implement and why. After each task, state what would have gone wrong without this task.
176
+
177
+ **File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-executor-reasoning.md` with structured entries per implementation decision:
178
+ - **Trigger** — what prompted the decision (e.g., "task spec requires auth middleware")
179
+ - **Options considered** — implementation alternatives
180
+ - **Chosen** — selected approach
181
+ - **Reasoning** — why this approach over alternatives
182
+ - **Confidence** — high/medium/low
183
+ - **Counterfactual** — what would break without this decision
184
+
185
+ Key executor reasoning moments: architecture choices, library selections, API design decisions, test strategy decisions, and any deviation from the plan.
186
+
120
187
  ## Done
121
188
 
122
189
  When all tasks are complete and verified:
@@ -45,7 +45,6 @@ Run `wazir init` (default: auto mode). This automatically:
45
45
  - `model_mode: "claude-only"` (override: `wazir config set model_mode multi-model`)
46
46
  - `default_depth: "standard"` (override per-run: `/wazir deep ...`)
47
47
  - `default_intent: "feature"` (inferred per-run from request text)
48
- - `team_mode: "sequential"` (always)
49
48
  6. **Auto-exports** for the detected host
50
49
 
51
50
  **No questions asked.** The pipeline is ready to use immediately.
@@ -48,7 +48,7 @@ The reviewer operates in different modes depending on the phase. Mode MUST be pa
48
48
  | `final` | After execution + verification | Completed task artifacts, approved spec/plan/design | 7 final-review dims, scored 0-70 | Scored verdict (PASS/FAIL) |
49
49
  | `spec-challenge` | After specify | Draft spec artifact | 5 spec/clarification dims | Pass/fix loop, no score |
50
50
  | `design-review` | After design approval | Design artifact, approved spec | 5 design-review dims (canonical) | Pass/fix loop, no score |
51
- | `plan-review` | After planning | Draft plan artifact | 7 plan dims | Pass/fix loop, no score |
51
+ | `plan-review` | After planning | Draft plan artifact | 8 plan dims (7 + input coverage) | Pass/fix loop, no score |
52
52
  | `task-review` | During execution, per task | Uncommitted changes or `--base` SHA | 5 task-execution dims (correctness, tests, wiring, drift, quality) | Pass/fix loop, no score |
53
53
  | `research-review` | During discover | Research artifact | 5 research dims | Pass/fix loop, no score |
54
54
  | `clarification-review` | During clarify | Clarification artifact | 5 spec/clarification dims | Pass/fix loop, no score |
@@ -88,24 +88,42 @@ If any file is missing:
88
88
  ### `task-review` mode
89
89
  1. Uncommitted changes exist for the current task, or a `--base` SHA is provided for committed changes.
90
90
  2. Read `.wazir/state/config.json` for depth and multi_tool settings.
91
+ 3. **Commit discipline check:** If uncommitted changes span work from multiple tasks (e.g., files from task N and task N+1 are both modified), REJECT immediately: "REJECTED: Multiple tasks in single commit. Split into per-task commits before review." This is a blocking finding — no other dimensions are evaluated until resolved.
92
+ 4. **Security sensitivity check:** Run `detectSecurityPatterns` from `tooling/src/checks/security-sensitivity.js` against the diff. If `triggered === true`, add the 6 security review dimensions (injection, auth bypass, data exposure, CSRF/SSRF, XSS, secrets leakage) to the standard 5 task-execution dimensions for this review pass. Security findings use severity levels: critical (exploitable), high (likely exploitable), medium (defense-in-depth gap), low (best-practice deviation).
91
93
 
92
94
  ### `spec-challenge`, `design-review`, `plan-review`, `research-review`, `clarification-review` modes
93
95
  1. The appropriate input artifact for the mode exists.
94
96
  2. Read `.wazir/state/config.json` for depth and multi_tool settings.
97
+ 3. **`plan-review` additional dimension — Input Coverage:**
98
+ - Read the original input/briefing from `.wazir/input/briefing.md` and any `input/*.md` files
99
+ - Count distinct items/requirements in the input
100
+ - Count tasks in the execution plan
101
+ - If `tasks_in_plan < items_in_input` → **HIGH** finding: "Plan covers [N] of [M] input items. Missing: [list of uncovered items]"
102
+ - If `tasks_in_plan >= items_in_input` → dimension passes
103
+ - One task MAY cover multiple input items if justified in the task description
104
+ - This is the review-level enforcement of the "no scope reduction" rule
95
105
 
96
106
  ## Review Process (`final` mode)
97
107
 
108
+ **Before starting this phase, output to the user:**
109
+
110
+ > **Final Review** — About to run adversarial 7-dimension review comparing your implementation against the original input, not just the task specs. The executor's per-task reviewer already validated correctness per-task — this catches drift between what you asked for and what was actually built.
111
+ >
112
+ > **Why this matters:** Without this, implementation drift ships undetected. Per-task review confirms each task matches its spec, but cannot catch: tasks that collectively miss the original intent, scope creep that added unrequested features, or acceptance criteria that were rewritten to match implementation instead of input.
113
+ >
114
+ > **Looking for:** Logic errors, missing features, dead code, unsubstantiated "it works" claims, scope creep, security gaps, stale documentation
115
+
98
116
  **Input:** Read the ORIGINAL user input (`.wazir/input/briefing.md`, `input/` directory files) and compare against what was built. This catches intent drift that task-level review misses.
99
117
 
100
118
  Perform adversarial review across 7 dimensions:
101
119
 
102
- 1. **Correctness** — Does the code do what the original input asked for?
103
- 2. **Completeness** — Are all requirements from the original input met?
104
- 3. **Wiring** — Are all paths connected end-to-end?
105
- 4. **Verification** — Is there evidence (tests, type checks) for each claim?
106
- 5. **Drift** — Does the implementation match what the user originally requested? (not just the plan — the INPUT)
107
- 6. **Quality** — Code style, naming, error handling, security
108
- 7. **Documentation** — Changelog entries, commit messages, comments
120
+ 1. **Correctness** — Does the code do what the original input asked for? *(catches: logic errors, wrong behavior, spec violations)*
121
+ 2. **Completeness** — Are all requirements from the original input met? *(catches: missing features, unimplemented acceptance criteria, partially delivered items)*
122
+ 3. **Wiring** — Are all paths connected end-to-end? *(catches: dead code, disconnected paths, missing imports, orphaned routes)*
123
+ 4. **Verification** — Is there evidence (tests, type checks) for each claim? *(catches: false claims of "it works" without evidence, untested code paths, missing type coverage)*
124
+ 5. **Drift** — Does the implementation match what the user originally requested? (not just the plan — the INPUT) *(catches: scope creep, plan deviations, unauthorized changes, gold-plating)*
125
+ 6. **Quality** — Code style, naming, error handling, security *(catches: security vulnerabilities, poor error handling, inconsistent naming, missing input validation)*
126
+ 7. **Documentation** — Changelog entries, commit messages, comments *(catches: missing changelogs, wrong commit messages, stale comments, undocumented breaking changes)*
109
127
 
110
128
  ## Context Retrieval
111
129
 
@@ -224,6 +242,24 @@ const recurring = getRecurringFindingHashes(db, 2);
224
242
 
225
243
  This is how Wazir evolves — findings that recur across runs become accepted learnings injected into future executor context, preventing the same mistakes.
226
244
 
245
+ ## Interaction Mode Awareness
246
+
247
+ Read `interaction_mode` from run-config:
248
+
249
+ - **`auto`:** No user checkpoints. Present verdict and let gating agent decide. On escalation, write reason and STOP.
250
+ - **`guided`:** Standard behavior — present verdict, ask user how to proceed.
251
+ - **`interactive`:** Discuss findings with user: "I found a potential auth bypass in `src/auth.js:42` — here's why I rated it high severity. Do you agree, or is there context I'm missing?" Show detailed reasoning for each dimension score.
252
+
253
+ ## CLI/Context-Mode Enforcement
254
+
255
+ In ALL review modes, check for these violations:
256
+
257
+ 1. **Index usage enforcement:** If the agent performed >5 direct file reads (Read tool) without a preceding `wazir index search-symbols` query, flag as **[warning]** finding: "Agent performed [N] direct file reads without using wazir index. Use `wazir index search-symbols <query>` before reading files to reduce context consumption."
258
+
259
+ 2. **Context-mode enforcement:** If the agent ran a large-category command (test runners, builds, diffs, dependency trees, linting — as classified by `hooks/routing-matrix.json`) using native Bash instead of context-mode tools (when context-mode is available), flag as **[warning]** finding: "Large command `[cmd]` run without context-mode. Route through `mcp__plugin_context-mode_context-mode__execute` to reduce context usage."
260
+
261
+ These are warnings, not blocking findings — they improve efficiency but don't affect correctness.
262
+
227
263
  ## Task-Review Log Filenames
228
264
 
229
265
  In `task-review` mode, use task-scoped log filenames and cap tracking:
@@ -238,6 +274,13 @@ Save review results to `.wazir/runs/latest/reviews/review.md` with:
238
274
  - Score breakdown
239
275
  - Verdict
240
276
 
277
+ Run the phase report and display it to the user:
278
+ ```bash
279
+ wazir report phase --run <run-id> --phase <review-mode>
280
+ ```
281
+
282
+ Output the report content to the user in the conversation.
283
+
241
284
  ## Phase Report Generation
242
285
 
243
286
  After completing any review pass, generate a phase report following `schemas/phase-report.schema.json`:
@@ -406,8 +449,34 @@ Write to `.wazir/runs/<run-id>/handoff.md`:
406
449
  - Do NOT mutate `input/` — it belongs to the user
407
450
  - Do NOT auto-load proposed learnings into the next run
408
451
 
452
+ ## Reasoning Output
453
+
454
+ Throughout the reviewer phase, produce reasoning at two layers:
455
+
456
+ **Conversation (Layer 1):** Before each review pass, explain what dimensions are being checked and why. After findings, explain the reasoning behind severity assignments.
457
+
458
+ **File (Layer 2):** Write `.wazir/runs/<id>/reasoning/phase-reviewer-reasoning.md` with structured entries:
459
+ - **Trigger** — what prompted the finding (e.g., "diff adds SQL query without parameterization")
460
+ - **Options considered** — severity options, fix approaches
461
+ - **Chosen** — assigned severity and recommendation
462
+ - **Reasoning** — why this severity level
463
+ - **Confidence** — high/medium/low
464
+ - **Counterfactual** — what would ship if this finding were missed
465
+
466
+ Key reviewer reasoning moments: severity assignments, PASS/FAIL decisions, dimension score justifications, and escalation decisions.
467
+
409
468
  ## Done
410
469
 
470
+ **After completing this phase, output to the user:**
471
+
472
+ > **Final Review complete.**
473
+ >
474
+ > **Found:** [N] findings across 7 dimensions — [N] blocking, [N] warnings, [N] notes. Score: [score]/70 ([VERDICT]).
475
+ >
476
+ > **Without this phase:** [N] blocking issues would have shipped — including [specific examples: e.g., "missing error handler on /api/users endpoint", "auth middleware not wired to 3 routes", "CHANGELOG missing entry for breaking API change"]
477
+ >
478
+ > **Changed because of this work:** [List of issues caught and fixed during review passes, score improvement from first to final pass]
479
+
411
480
  Present the verdict and offer next steps:
412
481
 
413
482
  > **Review complete: [VERDICT] ([score]/70)**
@@ -416,8 +485,12 @@ Present the verdict and offer next steps:
416
485
  >
417
486
  > **Learnings proposed:** [count] (see `memory/learnings/proposed/`)
418
487
  > **Handoff:** `.wazir/runs/<run-id>/handoff.md`
419
- >
420
- > **What would you like to do?**
421
- > 1. **Create a PR** (if PASS)
422
- > 2. **Auto-fix and re-review** (if MINOR FIXES)
423
- > 3. **Review findings in detail**
488
+
489
+ Ask the user via AskUserQuestion:
490
+ - **Question:** "How would you like to proceed with the review results?"
491
+ - **Options:**
492
+ 1. "Create a PR" *(Recommended if PASS)*
493
+ 2. "Auto-fix and re-review" *(Recommended if MINOR FIXES)*
494
+ 3. "Review findings in detail"
495
+
496
+ Wait for the user's selection before continuing.
@@ -185,6 +185,26 @@ Beyond CLI checks, inspect for:
185
185
  - Run `wazir export --check`
186
186
  - Any drift detected is a finding
187
187
 
188
+ 11. **Input Coverage** (run-scoped — only when a run directory exists)
189
+ - Read the original input file(s) from `.wazir/input/` or `.wazir/runs/<id>/sources/`
190
+ - Read the execution plan from `.wazir/runs/<id>/clarified/execution-plan.md`
191
+ - Read the actual commits on the branch: `git log --oneline main..HEAD`
192
+ - Build a coverage matrix: every distinct item in the input should map to:
193
+ - At least one task in the execution plan
194
+ - At least one commit in the git log
195
+ - **Missing items** (in input but not in plan AND not in commits) → **HIGH** severity finding
196
+ - **Partial items** (in plan but no corresponding commit) → **MEDIUM** severity finding
197
+ - **Fully covered items** (input → plan → commit) → pass
198
+ - Output the coverage matrix in the audit report:
199
+ ```
200
+ | Input Item | Plan Task | Commit | Status |
201
+ |------------|-----------|--------|--------|
202
+ | Item 1 | Task 3 | abc123 | PASS |
203
+ | Item 2 | Task 5 | — | PARTIAL|
204
+ | Item 3 | — | — | MISSING|
205
+ ```
206
+ - This dimension catches scope reduction AFTER the fact — a safety net for when the clarifier or planner fails
207
+
188
208
  ## Protected-Path Safety Rails
189
209
 
190
210
  Before applying ANY fix in Phase 3, check if the target file is in a protected path. The self-audit loop MUST NOT modify files in: