@wazir-dev/cli 1.1.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (124) hide show
  1. package/CHANGELOG.md +73 -4
  2. package/README.md +6 -6
  3. package/docs/concepts/architecture.md +1 -1
  4. package/docs/concepts/roles-and-workflows.md +2 -0
  5. package/docs/concepts/why-wazir.md +59 -0
  6. package/docs/decisions/2026-03-19-deferred-items.md +564 -0
  7. package/docs/decisions/2026-03-19-enhancement-decisions.md +300 -0
  8. package/docs/readmes/INDEX.md +21 -5
  9. package/docs/readmes/features/expertise/README.md +2 -2
  10. package/docs/readmes/features/exports/README.md +2 -2
  11. package/docs/readmes/features/schemas/README.md +3 -0
  12. package/docs/readmes/features/skills/README.md +17 -0
  13. package/docs/readmes/features/skills/clarifier.md +5 -0
  14. package/docs/readmes/features/skills/claude-cli.md +5 -0
  15. package/docs/readmes/features/skills/codex-cli.md +5 -0
  16. package/docs/readmes/features/skills/dispatching-parallel-agents.md +5 -0
  17. package/docs/readmes/features/skills/executing-plans.md +5 -0
  18. package/docs/readmes/features/skills/executor.md +5 -0
  19. package/docs/readmes/features/skills/finishing-a-development-branch.md +5 -0
  20. package/docs/readmes/features/skills/gemini-cli.md +5 -0
  21. package/docs/readmes/features/skills/humanize.md +5 -0
  22. package/docs/readmes/features/skills/init-pipeline.md +5 -0
  23. package/docs/readmes/features/skills/receiving-code-review.md +5 -0
  24. package/docs/readmes/features/skills/requesting-code-review.md +5 -0
  25. package/docs/readmes/features/skills/reviewer.md +5 -0
  26. package/docs/readmes/features/skills/subagent-driven-development.md +5 -0
  27. package/docs/readmes/features/skills/using-git-worktrees.md +5 -0
  28. package/docs/readmes/features/skills/wazir.md +5 -0
  29. package/docs/readmes/features/skills/writing-skills.md +5 -0
  30. package/docs/readmes/features/workflows/prepare-next.md +1 -1
  31. package/docs/reference/configuration-reference.md +47 -6
  32. package/docs/reference/launch-checklist.md +4 -4
  33. package/docs/reference/review-loop-pattern.md +117 -8
  34. package/docs/reference/roles-reference.md +1 -0
  35. package/docs/reference/skill-tiers.md +147 -0
  36. package/docs/reference/tooling-cli.md +3 -1
  37. package/docs/truth-claims.yaml +12 -0
  38. package/expertise/antipatterns/process/ai-coding-antipatterns.md +97 -1
  39. package/exports/hosts/claude/.claude/settings.json +9 -0
  40. package/exports/hosts/claude/CLAUDE.md +1 -1
  41. package/exports/hosts/claude/export.manifest.json +4 -2
  42. package/exports/hosts/claude/host-package.json +3 -1
  43. package/exports/hosts/codex/AGENTS.md +1 -1
  44. package/exports/hosts/codex/export.manifest.json +4 -2
  45. package/exports/hosts/codex/host-package.json +3 -1
  46. package/exports/hosts/cursor/.cursor/hooks.json +4 -0
  47. package/exports/hosts/cursor/.cursor/rules/wazir-core.mdc +1 -1
  48. package/exports/hosts/cursor/export.manifest.json +4 -2
  49. package/exports/hosts/cursor/host-package.json +3 -1
  50. package/exports/hosts/gemini/GEMINI.md +1 -1
  51. package/exports/hosts/gemini/export.manifest.json +4 -2
  52. package/exports/hosts/gemini/host-package.json +3 -1
  53. package/hooks/context-mode-router +191 -0
  54. package/hooks/definitions/context_mode_router.yaml +19 -0
  55. package/hooks/hooks.json +31 -6
  56. package/hooks/protected-path-write-guard +8 -0
  57. package/hooks/routing-matrix.json +45 -0
  58. package/hooks/session-start +62 -1
  59. package/llms-full.txt +905 -132
  60. package/package.json +2 -3
  61. package/schemas/hook.schema.json +2 -1
  62. package/schemas/phase-report.schema.json +80 -0
  63. package/schemas/usage.schema.json +25 -1
  64. package/schemas/wazir-manifest.schema.json +19 -0
  65. package/skills/brainstorming/SKILL.md +18 -155
  66. package/skills/clarifier/SKILL.md +122 -98
  67. package/skills/claude-cli/SKILL.md +320 -0
  68. package/skills/codex-cli/SKILL.md +260 -0
  69. package/skills/debugging/SKILL.md +13 -0
  70. package/skills/design/SKILL.md +13 -0
  71. package/skills/dispatching-parallel-agents/SKILL.md +13 -0
  72. package/skills/executing-plans/SKILL.md +13 -0
  73. package/skills/executor/SKILL.md +72 -19
  74. package/skills/finishing-a-development-branch/SKILL.md +13 -0
  75. package/skills/gemini-cli/SKILL.md +260 -0
  76. package/skills/humanize/SKILL.md +13 -0
  77. package/skills/init-pipeline/SKILL.md +73 -164
  78. package/skills/prepare-next/SKILL.md +81 -10
  79. package/skills/receiving-code-review/SKILL.md +13 -0
  80. package/skills/requesting-code-review/SKILL.md +13 -0
  81. package/skills/reviewer/SKILL.md +287 -15
  82. package/skills/run-audit/SKILL.md +13 -0
  83. package/skills/scan-project/SKILL.md +13 -0
  84. package/skills/self-audit/SKILL.md +197 -16
  85. package/skills/subagent-driven-development/SKILL.md +13 -0
  86. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +2 -0
  87. package/skills/subagent-driven-development/implementer-prompt.md +8 -0
  88. package/skills/subagent-driven-development/spec-reviewer-prompt.md +7 -0
  89. package/skills/tdd/SKILL.md +13 -0
  90. package/skills/using-git-worktrees/SKILL.md +13 -0
  91. package/skills/using-skills/SKILL.md +13 -0
  92. package/skills/verification/SKILL.md +13 -0
  93. package/skills/wazir/SKILL.md +194 -377
  94. package/skills/writing-plans/SKILL.md +14 -1
  95. package/skills/writing-skills/SKILL.md +13 -0
  96. package/templates/artifacts/implementation-plan.md +3 -0
  97. package/templates/artifacts/tasks-template.md +133 -0
  98. package/templates/examples/phase-report.example.json +48 -0
  99. package/tooling/src/adapters/composition-engine.js +256 -0
  100. package/tooling/src/adapters/model-router.js +84 -0
  101. package/tooling/src/capture/command.js +24 -1
  102. package/tooling/src/capture/run-config.js +3 -1
  103. package/tooling/src/capture/store.js +24 -0
  104. package/tooling/src/capture/usage.js +106 -0
  105. package/tooling/src/checks/ac-matrix.js +256 -0
  106. package/tooling/src/checks/command-registry.js +12 -0
  107. package/tooling/src/checks/docs-truth.js +1 -1
  108. package/tooling/src/checks/skills.js +111 -0
  109. package/tooling/src/cli.js +9 -0
  110. package/tooling/src/commands/stats.js +161 -0
  111. package/tooling/src/commands/validate.js +5 -1
  112. package/tooling/src/export/compiler.js +33 -37
  113. package/tooling/src/gating/agent.js +145 -0
  114. package/tooling/src/guards/phase-prerequisite-guard.js +127 -0
  115. package/tooling/src/hooks/routing-logic.js +69 -0
  116. package/tooling/src/init/auto-detect.js +260 -0
  117. package/tooling/src/init/command.js +95 -135
  118. package/tooling/src/input/scanner.js +46 -0
  119. package/tooling/src/reports/command.js +103 -0
  120. package/tooling/src/reports/phase-report.js +323 -0
  121. package/tooling/src/state/command.js +160 -0
  122. package/tooling/src/state/db.js +287 -0
  123. package/tooling/src/status/command.js +53 -1
  124. package/wazir.manifest.yaml +26 -14
@@ -0,0 +1,564 @@
1
+ # Deferred Work Items — 2026-03-19
2
+
3
+ Items agreed upon during brainstorming + pipeline session. Each needs discussion before implementation.
4
+
5
+ ---
6
+
7
+ ## PRIORITY: Restructure Pipeline to 4 Main Phases
8
+
9
+ The current pipeline has too many micro-phases. Restructure into 4 clear phases:
10
+
11
+ ### Phase 1: Init
12
+ - Straightforward setup
13
+ - Read config, check prerequisites, create run directory
14
+ - No questions (depth/intent/teams inferred or defaulted — see items #4, #5)
15
+ - Scan `input/` directory automatically (see item #3)
16
+
17
+ ### Phase 2: Clarifier
18
+ - Everything from reading input to having an implementation plan
19
+ - Includes: research, clarification, spec hardening, brainstorming/design, planning
20
+ - All review loops happen within this phase (research-review, clarification-review, spec-challenge, design-review, plan-review)
21
+ - Output: approved spec + design + execution plan with task specs
22
+
23
+ ### Phase 3: Executor
24
+ - The simplest phase — pure implementation
25
+ - Should work with a lower-quality model (e.g., Sonnet for executor, medium-quality for per-task reviewer)
26
+ - Per-task TDD cycle: write test → implement → review against task spec
27
+ - The per-task reviewer checks implementation against the TASK spec (not input)
28
+ - Should still produce the same quality results with cheaper models
29
+
30
+ ### Phase 4: Final Review
31
+ - Compare implementation against the ORIGINAL INPUT (not the task specs)
32
+ - The executor's per-task reviewer already validated against task specs — that's covered
33
+ - Final reviewer validates: does what we built actually match what the user asked for?
34
+ - Input-to-implementation comparison catches drift that task-level review misses
35
+ - After review: apply learnings (from the run)
36
+ - After review: prepare for next task (compress/archive unneeded files, write handoff)
37
+ - This is where learn + prepare-next phases get folded in
38
+
39
+ ### Key Principle
40
+ - Phase 3 executor reviewer reviews: implementation vs. task spec
41
+ - Phase 4 final reviewer reviews: implementation vs. original user input
42
+ - These are different concerns — task-level correctness vs. intent alignment
43
+
44
+ ### Online Research
45
+ - How MetaGPT structures its pipeline phases (ProductManager → Architect → Engineer → QA) — their phase boundaries and handoff contracts
46
+ - How OpenAI Symphony structures Poll → Dispatch → Resolve → Land — what happens between phases
47
+ - How CrewAI Flows handle phase transitions with routers and conditional branching
48
+ - How GitHub Spec-Kit structures spec → implement → verify — their phase model
49
+ - LangGraph's multi-step agent pipelines — how they define phase boundaries vs. node boundaries
50
+ - How Devin structures its planning → implementation → verification phases
51
+ - How Cursor Agent / Windsurf Cascade structure their internal pipeline phases
52
+
53
+ ---
54
+
55
+ ## 16. One-Command Install + Zero-Config Start
56
+
57
+ **Goal:** Wazir should be one of the easiest agent frameworks to install and use.
58
+
59
+ **Current experience (too many steps, one breaks):**
60
+ ```bash
61
+ git clone wazir
62
+ npm install
63
+ npx wazir export build # 4 host exports
64
+ npx wazir init # interactive arrows — FAILS in Claude Code
65
+ # then learn phases, roles, skills, composition...
66
+ ```
67
+
68
+ **Target experience:**
69
+ ```bash
70
+ npx wazir
71
+ ```
72
+
73
+ One command. Everything else is automatic or deferred until needed.
74
+
75
+ **What `npx wazir` should do:**
76
+ 1. Detect host (Claude Code? Codex? Cursor? Gemini?) from environment
77
+ 2. Export the right files for that host automatically
78
+ 3. Scan the project (language, framework, stack)
79
+ 4. Ask "What do you want to build?" — one question, no config
80
+ 5. Run the full pipeline
81
+
82
+ **No init questions.** Defaults for everything:
83
+ - Depth: standard (override with `/wazir deep ...`)
84
+ - Intent: inferred from request text
85
+ - Teams: sequential (always)
86
+ - Model mode: detected from available tools
87
+
88
+ **Install paths (from easiest to most control):**
89
+
90
+ | Path | Command | Who |
91
+ |------|---------|-----|
92
+ | Plugin marketplace | `/plugin install wazir` | Claude Code users |
93
+ | npx (zero install) | `npx @wazir-dev/cli` | Any Node project |
94
+ | Global install | `npm i -g @wazir-dev/cli` | Power users |
95
+ | Clone + link | `git clone && npm link` | Contributors |
96
+
97
+ **First-run experience:**
98
+ - No CLAUDE.md editing required
99
+ - No config files to create
100
+ - No concepts to learn upfront
101
+ - Just: install → `/wazir build me a login page` → pipeline runs
102
+
103
+ **Deep when you need it:**
104
+ - `wazir config` for power users who want control
105
+ - `wazir doctor` to see what's configured
106
+ - `wazir stats` to see what Wazir saved you
107
+ - Expertise modules, review loops, composition — all there, all optional to understand
108
+
109
+ **Principle:** Like git — `git init` is one command, `git config` is deep. Instant start, deep when you need it.
110
+
111
+ ### Online Research
112
+ - How superpowers plugin handles install (`/plugin marketplace add`) — onboarding flow, zero-config
113
+ - How Cursor rules / .cursorrules get auto-detected and applied — zero-config pattern
114
+ - How `create-react-app`, `create-next-app`, `create-t3-app` handle project scaffolding — one-command patterns
115
+ - How Vercel CLI (`npx vercel`) auto-detects framework and deploys — zero-config detection
116
+ - How `degit` (Rich Harris) handles instant project scaffolding without git history
117
+ - How Yeoman generators and `npm create` handle interactive scaffolding vs. zero-config
118
+ - How the Claude Code plugin marketplace install flow works (behind the scenes)
119
+ - How OpenAI Codex handles first-run experience — what config does it need?
120
+ - How aider (Paul Gauthier) handles zero-config start — `aider` just works in any repo
121
+ - How bolt.new / v0.dev handle instant project start with no config
122
+
123
+ ---
124
+
125
+ ## 15. SQLite State Database for Learnings, Findings, and Trends
126
+
127
+ **What:** Extend the existing SQLite infrastructure (already used for index) with a `state.sqlite` for persistent cross-run data.
128
+
129
+ **Why files don't work:** Can't query "show me all executor learnings for node stack", can't detect recurring findings, can't trend across runs without reading every past report.
130
+
131
+ **New tables in `state.sqlite` (separate from `index.sqlite` to survive rebuilds):**
132
+ - `learnings` — id, source_run, category, scope_roles, scope_stacks, scope_concerns, confidence, recurrence_count, content, created_at, last_applied, expires_at
133
+ - `findings` — id, run_id, phase, source (internal/codex/self-audit), severity, description, resolved, finding_hash (for dedup/recurrence detection)
134
+ - `audit_history` — run_id, date, finding_count, fix_count, manual_count, quality_score_before, quality_score_after
135
+ - `usage_aggregate` — run_id, date, tokens_saved, bytes_avoided, savings_ratio, index_queries, routing_decisions
136
+
137
+ **Key queries this enables:**
138
+ - `SELECT * FROM findings WHERE finding_hash IN (SELECT finding_hash FROM findings GROUP BY finding_hash HAVING COUNT(*) >= 2)` — recurring findings → auto-propose as learnings
139
+ - `SELECT run_id, finding_count FROM audit_history ORDER BY date` — trend tracking
140
+ - `SELECT * FROM learnings WHERE scope_stacks LIKE '%node%' AND confidence = 'high'` — targeted learning injection
141
+ - `SELECT SUM(tokens_saved) FROM usage_aggregate` — cumulative savings
142
+
143
+ **No new dependency:** `node:sqlite` already used by the index. Same infrastructure, new file.
144
+
145
+ ### Online Research
146
+ - How CrewAI stores long-term memory (SQLite3 backend) — schema, query patterns, adaptive scoring
147
+ - How LangGraph Store uses PostgreSQL + pgvector for cross-session persistence — schema design
148
+ - How LangMem stores and retrieves memories with TTL and semantic search
149
+ - How Mem0 (formerly EmbedChain) structures its memory database — entities, relations, facts
150
+ - How OpenHands stores agent state and history across sessions
151
+ - SQLite FTS5 full-text search for finding content — already used by context-mode plugin
152
+ - How Datasette (Simon Willison) exposes SQLite as a browsable interface — could be useful for `wazir stats`
153
+ - How Litestream handles SQLite replication — relevant if state.sqlite needs backup/sync
154
+ - SQLite WAL mode for concurrent reads during pipeline execution
155
+
156
+ ---
157
+
158
+ ## 14. Self-Audit V2 — Learnings + Trend Tracking + Severity
159
+
160
+ **What:** Current self-audit finds and fixes issues but doesn't learn from them. Needs a second pass.
161
+
162
+ **What's missing:**
163
+
164
+ 1. **No learning** — findings should feed into `memory/learnings/proposed/`. If the same issue recurs across audits, it becomes an accepted learning injected into executor context to prevent the issue from being created.
165
+
166
+ 2. **No comparison to previous audits** — each audit starts fresh with no awareness of what was found last time. Should track: "last audit found 12 issues, this time 8 — are they the same or new?" Trend detection across runs.
167
+
168
+ 3. **Fixes are mechanical only** — can fix lint, stale exports, broken references. Can't fix architectural drift, skill degradation, logic issues. These become "manual required" that nobody acts on. Need: escalation path for manual findings (create a task? file an issue?).
169
+
170
+ 4. **No severity or prioritization** — all findings are equal. Should have severity levels: critical (abort loop), high (fix now), medium (fix if time), low (log and skip).
171
+
172
+ 5. **Doesn't audit its own effectiveness** — after 5 loops, did the fixes improve anything? Need a before/after quality score per loop to measure convergence quality, not just convergence count.
173
+
174
+ **Connection to item #13:** Self-audit findings should feed the SAME learning pipeline as review findings. Both are signals about recurring quality issues.
175
+
176
+ ### Online Research
177
+ - How SonarQube tracks code quality trends across builds — severity levels, technical debt measurement
178
+ - How CodeClimate maintains quality scores over time — GPA model, trend detection
179
+ - How GitHub Advanced Security tracks recurring vulnerabilities — alert lifecycle (open → dismissed → fixed → reopened)
180
+ - How ESLint's `--cache` tracks which files have been linted before — incremental audit pattern
181
+ - How Snyk tracks dependency vulnerabilities across versions — recurring vs. new findings
182
+ - How PagerDuty/OpsGenie handle alert fatigue and dedup — relevant for finding dedup
183
+ - How Netflix Chaos Monkey evolved its audit approach — game days, severity escalation
184
+ - Academic: "Technical Debt Detection and Management" survey papers — automated detection patterns
185
+
186
+ ---
187
+
188
+ ## 13. Internal Review First, Codex Second + Review Findings Feed Learning
189
+
190
+ **What:** Restructure the review loop: internal review skill first (fast, cheap, expertise-loaded), Codex second (slow, expensive, fresh eyes). Review findings must feed the learning system.
191
+
192
+ **Current problem:**
193
+ - Everything goes to Codex first — slow, expensive, overkill for most findings
194
+ - Review findings are consumed and forgotten — same mistakes repeat across runs
195
+ - Wazir is not evolving from its own reviews
196
+
197
+ **New review flow:**
198
+ 1. **Internal review** — `wz:reviewer` with full expertise modules composed in context (composition engine). Catches ~80% of issues. Uses Sonnet (not Opus). Fast and cheap.
199
+ 2. **Fix cycle** — executor fixes internal findings, re-review internally until clean
200
+ 3. **Codex review** — only AFTER internal review passes. Fresh eyes on clean code. Catches subtle issues the internal reviewer missed.
201
+ 4. **Fix cycle** — executor fixes Codex findings
202
+ 5. **Learning extraction** — ALL review findings (internal + Codex) feed into the learning system
203
+
204
+ **Review findings → Learning pipeline:**
205
+ - After each run, the learner role scans all review findings
206
+ - Findings that recur across 2+ runs → auto-proposed as learnings
207
+ - Recurring learnings get injected into future executor context
208
+ - Example: if Codex keeps finding "missing error handling in async functions" across 3 runs, that becomes an accepted learning injected into every future executor prompt
209
+ - This is how Wazir evolves — it stops making the same mistakes
210
+
211
+ **Reference skills:**
212
+ - superpowers: `requesting-code-review` (dispatches review subagent)
213
+ - CodeRabbit: `code-reviewer` (autonomous review agent)
214
+ - Wazir: `wz:reviewer` (phase-aware, multi-mode)
215
+
216
+ **Decision needed:** Should internal review use a dedicated review subagent (like superpowers) or the current inline reviewer skill?
217
+
218
+ ### Online Research
219
+ - How superpowers `requesting-code-review` dispatches review subagents — architecture and prompt design
220
+ - How CodeRabbit AI review agent works — what it checks, how it structures findings
221
+ - How GitHub Copilot code review works — inline suggestions vs. PR-level review
222
+ - How Amazon CodeGuru Reviewer detects recurring patterns — ML-based finding detection
223
+ - How Google's Critique (internal code review tool) prioritizes findings
224
+ - How Facebook/Meta's Sapienz handles automated test generation from review findings
225
+ - How Sourcery.ai structures automated code review — rules vs. AI vs. hybrid
226
+ - How Qodo (formerly CodiumAI) generates review-driven test suggestions
227
+ - LangMem's prompt optimization from feedback — how review findings become prompt improvements
228
+ - Academic: "Learning from Code Reviews" — mining patterns from historical review data
229
+
230
+ ---
231
+
232
+ ## 12. Opus Orchestrator with Model-Aware Delegation
233
+
234
+ **What:** Opus acts as orchestrator and decides which model handles each sub-task. Not "Phase 3 uses Sonnet" — every sub-task across all phases gets the cheapest model that can do it.
235
+
236
+ **Examples:**
237
+ | Sub-task | Model | Why |
238
+ |----------|-------|-----|
239
+ | Orchestration / decisions | Opus | Needs judgment |
240
+ | Fetch URL content | Haiku | Mechanical, no reasoning |
241
+ | Read/summarize a file | Sonnet | Comprehension, not deep reasoning |
242
+ | Write implementation code | Sonnet | Good spec + plan = mechanical |
243
+ | Per-task code review | Sonnet | Diff review against clear spec |
244
+ | Spec hardening / design | Opus | Creativity + adversarial thinking |
245
+ | Final review (input vs impl) | Opus | Holistic judgment |
246
+ | Apply learnings | Sonnet | Structured extraction |
247
+ | Write handoff / compress | Haiku | Mechanical file operations |
248
+
249
+ **How:** Claude Code supports `model` field in agent frontmatter (`model: sonnet | opus | haiku | inherit`). Opus spawns subagents with explicit model overrides via the Agent tool.
250
+
251
+ **What needs to change:**
252
+ - Implement the `multi-model` mode in config (exists as option, never built)
253
+ - Add model-routing logic: Opus decides model per sub-task based on task complexity
254
+ - Composition engine dispatch includes model selection
255
+ - Skills/workflows annotate which sub-tasks are Haiku/Sonnet/Opus-grade
256
+
257
+ **Principle:** Not all clarifier sub-agents should be Opus. The agent that grabs data from the internet doesn't need Opus. The one that designs the architecture does.
258
+
259
+ ### Online Research
260
+ - How BAMAS (Budget-Aware Multi-Agent Structuring) models cost budgets in multi-agent systems — academic paper
261
+ - How AutoGen/AG2 handles model routing across agent roles — model selection per agent
262
+ - How CrewAI assigns different LLMs to different agents — per-agent model configuration
263
+ - How LangGraph routes to different models based on task complexity — conditional model selection
264
+ - How OpenRouter handles model routing and fallback — API-level model selection patterns
265
+ - How Anthropic's own Claude Agent SDK handles model selection for subagents
266
+ - How Martian's model router selects optimal model per query — automated model routing
267
+ - Token cost comparison across models (Opus vs Sonnet vs Haiku) for different task types
268
+ - How the Mixture of Agents (MoA) pattern uses cheap models for drafting, expensive for refinement
269
+
270
+ ---
271
+
272
+ ## 11. Phase Reports Still Partially Prose-Based
273
+
274
+ **What:** The reviewer skill has instructions to generate phase reports per the new schema, but there's no executable code that auto-generates the JSON. It's skill prose telling the agent what fields to fill, not a function that produces it deterministically.
275
+
276
+ **What's needed:**
277
+ - A `tooling/src/reports/phase-report.js` module that builds the JSON from actual data (test results, lint output, diff stats)
278
+ - Wiring into the review workflow so reports are generated automatically, not by agent interpretation
279
+ - Between-phase reports (from commit 6c84455) are separate and still thin — need to be replaced or merged with the new schema
280
+
281
+ **Decision needed:** Should report generation be code (deterministic) or skill prose (agent-generated)? Or hybrid — code collects metrics, agent fills qualitative fields?
282
+
283
+ ### Online Research
284
+ - How GitHub Actions produces structured job summaries — GITHUB_STEP_SUMMARY format
285
+ - How Jest/Vitest output structured test reports (JSON reporters) — machine-readable test results
286
+ - How Allure Report structures multi-dimensional test/quality reports
287
+ - How Datadog CI Visibility structures pipeline reports with metrics
288
+ - How OpenTelemetry structures observability data — spans, metrics, traces as structured data
289
+ - How CircleCI Test Insights tracks test results across runs with trend analysis
290
+ - How Buildkite annotations structure build reports with metadata
291
+
292
+ ---
293
+
294
+ ## 1. Continuous Learning Implementation (Decision #7)
295
+
296
+ **What:** Build the learning system. Design is done, implementation deferred.
297
+
298
+ **Context:** The original Wazir (v0.1.0 initial commit) already had:
299
+ - `memory/learnings/proposed/` → `accepted/` → `archived/` promotion path
300
+ - Learner role with scope tags, evidence, confidence
301
+ - Explicit review required before acceptance (no auto-apply)
302
+ - Templates: `templates/artifacts/learning-proposal.md`, `templates/artifacts/accepted-learning.md`
303
+ - Guard: fresh runs do NOT auto-load learnings by default
304
+
305
+ **What the new design adds (from learning-system-design.md):**
306
+ - User input capture (all messages, corrections, approvals as ndjson)
307
+ - Cross-run injection (top-K learnings loaded into next run's context)
308
+ - Drift budget (30% token ratio threshold per role)
309
+ - Feedback loop (auto-confidence adjustment based on recurrence)
310
+ - CrewAI-style adaptive scoring (semantic + recency + importance weights)
311
+
312
+ **Decision needed:** Build on existing `memory/learnings/` structure or replace with the new design?
313
+
314
+ ### Online Research
315
+ - How CrewAI Memory system works — 4-layer memory (short-term, long-term, entity, contextual), adaptive scoring
316
+ - How LangMem SDK implements prompt optimization from feedback — gradient-based updates
317
+ - How LangGraph Store handles cross-session persistence with TTL and semantic search
318
+ - How Mem0 structures persistent memory — entities, relations, facts, temporal decay
319
+ - How Voyager (Minecraft agent) implements skill library as persistent learning — learning through play
320
+ - How RLHF (Reinforcement Learning from Human Feedback) pipelines structure feedback collection
321
+ - How Constitutional AI implements self-improvement from principles
322
+ - How OpenHands stores and retrieves agent experiences across sessions
323
+ - Academic: "Experience Replay in Multi-Agent Systems" — how agents learn from past interactions
324
+ - Academic: "Reflexion: Language Agents with Verbal Reinforcement Learning" — self-reflective learning
325
+
326
+ ---
327
+
328
+ ## 2. Restore Apply Learning + Prepare Next Task
329
+
330
+ **What:** The original pipeline had learn + prepare-next as final phases. Currently disabled.
331
+
332
+ **Context:**
333
+ - `workflows/learn.md` exists — learner role extracts learnings
334
+ - `workflows/prepare-next.md` exists — planner role creates clean handoff
335
+ - `skills/prepare-next/SKILL.md` exists — writes handoff using template
336
+ - Both are `enabled: false` in `phase_policy` in every run config
337
+
338
+ **Decision needed:** Enable these phases by default? Or only for deep-depth runs?
339
+
340
+ ### Online Research
341
+ - How OpenAI Symphony's proof-of-work system handles post-execution learning and handoff
342
+ - How Devin handles context handoff between sessions — what persists, what's compressed
343
+ - How Claude Code's auto-memory (`/memory`) handles session-to-session learning
344
+ - How Anthropic recommends structuring long-running agent state — `claude-progress.txt` pattern
345
+ - How Cursor's .cursorrules evolve across sessions — do they learn?
346
+
347
+ ---
348
+
349
+ ## 3. Fix: Pipeline Doesn't Read `input/` Directory
350
+
351
+ **What:** `/wazir` and `/clarifier` don't scan `input/` for briefing materials automatically.
352
+
353
+ **Context:** User placed enhancement decisions in `input/` but the pipeline didn't pick them up. Had to manually create a briefing.
354
+
355
+ **Fix:** Scan both `input/` (project-level) and `.wazir/input/` (state-level) at startup. Present what's found, ask user which items to work on.
356
+
357
+ **Decision needed:** Should this be a skill fix or a CLI fix? Which takes priority — project `input/` or state `.wazir/input/`?
358
+
359
+ ### Online Research
360
+ - How GitHub Spec-Kit reads `specs/` directory to auto-discover specs for implementation
361
+ - How Cursor reads `.cursor/rules/` to auto-load project context
362
+ - How aider handles file discovery and auto-inclusion — `.aider.conf.yml` patterns
363
+ - How Anthropic recommends structuring project context in CLAUDE.md
364
+
365
+ ---
366
+
367
+ ## 4. Remove Agent Teams From Pipeline
368
+
369
+ **What:** Remove the teams question and team_mode from the pipeline. Always sequential.
370
+
371
+ **Context:** User decided teams are not ready — too much overhead, experimental, Opus-only.
372
+
373
+ **Scope:**
374
+ - Remove teams question from `skills/wazir/SKILL.md` Step 3
375
+ - Remove teams question from `skills/init-pipeline/SKILL.md` Step 6
376
+ - Hard-default `team_mode: sequential` in run config
377
+ - Remove or stub the Agent Teams Structured Dialogue section in `skills/brainstorming/SKILL.md`
378
+ - Keep the code paths but don't expose the choice to users
379
+
380
+ **Decision needed:** Remove entirely or just hide the option?
381
+
382
+ ### Online Research
383
+ - How Claude Code Agent Teams actually work now (2026) — stability, token consumption, real-world results
384
+ - How superpowers handles agent teams — their v5.0 approach
385
+ - How OpenAI Symphony handles parallel agent execution — supervision trees vs. teams
386
+ - When to revisit: track Claude Code team stability announcements
387
+
388
+ ---
389
+
390
+ ## 5. Remove 3 Init Questions
391
+
392
+ **What:** Stop asking depth, teams, and intent during pipeline init. Infer instead.
393
+
394
+ **Current questions removed:**
395
+ - "How thorough should this run be?" (depth) — default to standard, override via inline modifier (`/wazir deep ...`)
396
+ - "Would you like to use Agent Teams?" (teams) — always sequential (see #4)
397
+ - "What kind of work does this project mostly involve?" (intent) — infer from the request text
398
+
399
+ **Context:** These questions slow down the pipeline start. The user wants fewer questions, smarter inference.
400
+
401
+ **Decision needed:** What's the inference logic for intent? Options:
402
+ - Keywords: "fix" → bugfix, "add/build/create" → feature, "refactor/clean" → refactor
403
+ - Or just default to feature and let inline modifiers override
404
+
405
+ ### Online Research
406
+ - How GitHub Copilot Workspace infers intent from issue descriptions — NLU for code intent
407
+ - How Linear auto-categorizes issues (bug/feature/improvement) from title/description
408
+ - How Jira Smart Commits parse commit messages for intent signals
409
+ - How aider infers what the user wants from natural language — no explicit intent selection
410
+
411
+ ---
412
+
413
+ ## 6. Deep Research: Codex CLI, Gemini CLI, Claude CLI
414
+
415
+ **What:** Research all three CLIs and create Wazir skills for each.
416
+
417
+ **Goal:** Make cross-tool orchestration a first-class Wazir capability.
418
+
419
+ **Current state:** Wazir uses Codex CLI for reviews (multi-tool mode with `codex exec`). No formal skill for it. Gemini CLI and Claude CLI not integrated at all.
420
+
421
+ **Research scope:**
422
+ - Full command surface of each CLI
423
+ - How to use them programmatically (non-interactive mode)
424
+ - Integration patterns (review, second opinion, parallel execution)
425
+ - What skills would be useful for Wazir
426
+
427
+ **Decision needed:** Priority order? Start with Codex (already used) or research all three in parallel?
428
+
429
+ ### Online Research
430
+ - Codex CLI full documentation — `codex exec`, `codex review`, `codex exec review`, sandbox modes, model selection
431
+ - Gemini CLI documentation — `gemini` command surface, non-interactive mode, sandbox, model selection
432
+ - Claude CLI (Claude Code) documentation — `claude` command surface, `-p` print mode, `--model`, `--allowedTools`
433
+ - How OpenAI Symphony uses Codex CLI programmatically — the App-Server Protocol
434
+ - How superpowers integrates with multiple hosts (Claude Code, Codex, Gemini CLI, OpenCode, Cursor)
435
+ - How aider integrates with multiple LLM providers — model routing across providers
436
+ - Comparison: Codex sandbox vs. Claude Code sandbox vs. Gemini CLI sandbox — capabilities and limitations
437
+
438
+ ---
439
+
440
+ ## 7. Interactive Claude — Arrow-Key Selection
441
+
442
+ **What:** When Claude asks the user a question with options, use arrow-key selection instead of typing numbers.
443
+
444
+ **Reference:** https://github.com/bladnman/ideation_team_skill — does this really well in Claude Code.
445
+
446
+ **Also:** Remove all attempts to use interactive CLI prompts (`wazir init --force` with `@inquirer/prompts`). Arrow-key prompts ALWAYS fail in Claude Code's Bash tool. Handle init questions in the skill/conversation instead.
447
+
448
+ **Research needed:**
449
+ - How does bladnman/ideation_team_skill implement arrow-key selection?
450
+ - Is this a Claude Code native capability (AskUserQuestion with options)?
451
+ - Can we use it in skills without external dependencies?
452
+
453
+ **Decision needed:** Is this a Claude Code platform feature we can use, or does it require a custom implementation?
454
+
455
+ ### Online Research
456
+ - https://github.com/bladnman/ideation_team_skill — how it implements interactive selection in Claude Code
457
+ - Claude Code `AskUserQuestion` tool — does it support structured options with selection?
458
+ - How Inquirer.js handles prompts in non-TTY environments — relevant for understanding why CLI prompts fail
459
+ - How superpowers handles user interaction in skills — numbered lists vs. native selection
460
+ - How Cursor handles inline user choices — dropdowns, quick-pick menus
461
+ - How VS Code extension API handles `showQuickPick` — native selection pattern
462
+ - How GitHub CLI (`gh`) handles interactive prompts with `--prompter` — cross-environment pattern
463
+
464
+ ---
465
+
466
+ ## 8. Skill Tier Re-evaluation
467
+
468
+ **What:** Currently all 25 skills are Own tier. Re-evaluate whether some can be Delegate.
469
+
470
+ **Approach:**
471
+ 1. Move Command Routing + Codebase Exploration preambles from every SKILL.md to CLAUDE.md
472
+ 2. Re-evaluate: without preambles, which skills are pure superpowers forks?
473
+ 3. Candidates for Delegate: finishing-a-development-branch, dispatching-parallel-agents, using-git-worktrees, receiving-code-review, requesting-code-review
474
+
475
+ **Blocker:** Superpowers shadowing is full-override (R2 finding). Even if we delegate, we can't augment. It's all-or-nothing per skill.
476
+
477
+ **Decision needed:** Is it worth the effort to delegate 5 skills? Trade-off: less maintenance vs. less control.
478
+
479
+ ### Online Research
480
+ - Superpowers v5.0.5 skill loading mechanism — `skills-core.js` resolveSkillPath() behavior
481
+ - Claude Code native skills directory (`~/.claude/skills/`) — how discovery and priority work
482
+ - How other Claude Code plugins handle skill conflicts with superpowers
483
+ - Superpowers roadmap — any planned extensibility/composition mechanism?
484
+
485
+ ---
486
+
487
+ ## 9. PreToolUse:Write and PreToolUse:Edit Hook Errors
488
+
489
+ **What:** These hooks are throwing errors. Needs investigation.
490
+
491
+ **Decision needed:** Priority? Is this blocking anything?
492
+
493
+ ### Online Research
494
+ - Claude Code hooks documentation — PreToolUse hook contract, exit codes, error handling
495
+ - How superpowers hooks handle errors — their error recovery patterns
496
+ - Claude Code GitHub issues related to hook errors — known bugs or limitations
497
+
498
+ ---
499
+
500
+ ## 10. Clarifier Overhaul Reconciliation
501
+
502
+ **What:** A 14-issue clarifier overhaul was in progress (paused at Phase 1A review pass 1) before this session. The changes we shipped may conflict.
503
+
504
+ **Decision needed:** Resume the overhaul or consider it superseded by the enhancement session changes?
505
+
506
+ ### Online Research
507
+ - No online research needed — this is internal reconciliation. Compare the 14 issues against what was shipped.
508
+
509
+ ---
510
+
511
+ ## 18. ENFORCE Pipeline — Agent Must Never Skip Phases
512
+
513
+ **What:** The Wazir pipeline was completely bypassed in a real run. The agent saw a detailed input and skipped clarify/specify/design/plan/verify/review — went straight to parallel implementation. This defeats the purpose of Wazir.
514
+
515
+ **Root cause:** The `/wazir` skill says "Run the entire pipeline end-to-end" but has no enforcement mechanism. The agent can rationalize "the spec is already clear" and jump to execution. There's no hard gate that prevents this.
516
+
517
+ **How to enforce:**
518
+ 1. **Mandatory phase artifacts** — execution phase MUST NOT start without: clarification.md, spec-hardened.md, design.md, execution-plan.md in the run directory. The executor skill should CHECK for these files and REFUSE to start if they're missing.
519
+ 2. **Phase entry validation** — `wazir capture event --phase execute` should fail if the previous phases haven't completed (check events.ndjson for phase_exit events)
520
+ 3. **Skill-level hard gates** — each phase's skill should verify the previous phase's artifact exists before proceeding. Not a suggestion — a file-existence check that blocks.
521
+ 4. **Antipattern entry** — add "skipping pipeline phases because the input looks clear enough" to `expertise/antipatterns/process/ai-coding-antipatterns.md`
522
+
523
+ **This is the most important enforcement in Wazir.** If the pipeline can be bypassed, nothing else matters.
524
+
525
+ 5. **No autonomous scope reduction** — the clarifier MUST NOT drop items from the user's input into "future tiers" or "deferred" categories without explicit user approval. If the user provides 10 items, all 10 must appear in the execution plan. The clarifier can suggest prioritization but CANNOT unilaterally cut scope. This is a hard rule: `items_in_plan >= items_in_input` unless the user explicitly approves the reduction.
526
+
527
+ ### Online Research
528
+ - How CrewAI Task Guardrails enforce mandatory phase completion before advancing
529
+ - How LangGraph's `interrupt()` creates hard gates that can't be rationalized away
530
+ - How GitHub Actions enforces job dependencies — `needs:` field prevents out-of-order execution
531
+ - How Temporal.io enforces workflow step ordering — each step must complete before the next starts
532
+
533
+ ---
534
+
535
+ ## 17. Content Author Should Activate for Any Content Need
536
+
537
+ **What:** Currently content-author phase is disabled by default and documented as "for content-heavy projects." Wrong framing — it should activate whenever ANY task needs content of any kind.
538
+
539
+ **Examples where content-author should activate:**
540
+ - Database seeding (seed data, fixtures, sample records)
541
+ - Sample content for UI (placeholder text, demo data)
542
+ - Test fixtures (mock API responses, test data files)
543
+ - Translations / i18n strings
544
+ - Copy (button labels, error messages, onboarding text)
545
+ - Documentation content (user guides, API docs)
546
+ - Email templates, notification text
547
+
548
+ **Current state:**
549
+ - `workflows/author.md` exists — content-author role writes non-code content
550
+ - `roles/content-author.md` exists — with i18n, editorial, humanize expertise
551
+ - `composition-map.yaml` has content-author entries for 15+ concerns
552
+ - Phase is `enabled: false` by default in phase_policy
553
+
554
+ **What needs to change:**
555
+ - The clarifier should detect content needs during spec hardening and auto-enable the author phase
556
+ - Author phase should run AFTER spec approval, BEFORE planning (so the plan includes content tasks)
557
+ - Content artifacts should be reviewed like any other artifact
558
+
559
+ ### Online Research
560
+ - How Contentful / Sanity handle content modeling and structured content — content-as-data patterns
561
+ - How Storybook handles sample content / mock data for component development
562
+ - How i18n frameworks (i18next, react-intl) handle content authoring workflows
563
+ - How Faker.js / seed libraries structure realistic seed data generation
564
+ - How design systems handle content guidelines (voice/tone, placeholder standards)