npm - @wazir-dev/cli - Versions diffs - 1.2.0 → 1.4.0 - Mend

@wazir-dev/cli 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (161) hide show

package/CHANGELOG.md +54 -44
package/README.md +13 -13
package/assets/demo.cast +47 -0
package/assets/demo.gif +0 -0
package/docs/anti-patterns/AP-23-skipping-enabled-workflows.md +28 -0
package/docs/anti-patterns/AP-24-clarifier-deciding-scope.md +34 -0
package/docs/concepts/architecture.md +1 -1
package/docs/concepts/why-wazir.md +1 -1
package/docs/readmes/INDEX.md +1 -1
package/docs/readmes/features/expertise/README.md +1 -1
package/docs/readmes/features/hooks/pre-compact-summary.md +1 -1
package/docs/reference/hooks.md +1 -0
package/docs/reference/launch-checklist.md +3 -3
package/docs/reference/review-loop-pattern.md +3 -2
package/docs/reference/skill-tiers.md +2 -2
package/docs/research/2026-03-20-agents/a18fb002157904af5.txt +187 -0
package/docs/research/2026-03-20-agents/a1d0ac79ac2f11e6f.txt +2 -0
package/docs/research/2026-03-20-agents/a324079de037abd7c.txt +198 -0
package/docs/research/2026-03-20-agents/a357586bccfafb0e5.txt +256 -0
package/docs/research/2026-03-20-agents/a4365394e4d753105.txt +137 -0
package/docs/research/2026-03-20-agents/a492af28bc52d3613.txt +136 -0
package/docs/research/2026-03-20-agents/a4984db0b6a8eee07.txt +124 -0
package/docs/research/2026-03-20-agents/a5b30e59d34bbb062.txt +214 -0
package/docs/research/2026-03-20-agents/a5cf7829dab911586.txt +165 -0
package/docs/research/2026-03-20-agents/a607157c30dd97c9e.txt +96 -0
package/docs/research/2026-03-20-agents/a60b68b1e19d1e16b.txt +115 -0
package/docs/research/2026-03-20-agents/a722af01c5594aba0.txt +166 -0
package/docs/research/2026-03-20-agents/a787bdc516faa5829.txt +181 -0
package/docs/research/2026-03-20-agents/a7c46d1bba1056ed2.txt +132 -0
package/docs/research/2026-03-20-agents/a7e5abbab2b281a0d.txt +100 -0
package/docs/research/2026-03-20-agents/a8dbadc66cd0d7d5a.txt +95 -0
package/docs/research/2026-03-20-agents/a904d9f45d6b86a6d.txt +75 -0
package/docs/research/2026-03-20-agents/a927659a942ee7f60.txt +102 -0
package/docs/research/2026-03-20-agents/a962cb569191f7583.txt +125 -0
package/docs/research/2026-03-20-agents/aab6decea538aac41.txt +148 -0
package/docs/research/2026-03-20-agents/abd58b853dd938a1b.txt +295 -0
package/docs/research/2026-03-20-agents/ac009da573eff7f65.txt +100 -0
package/docs/research/2026-03-20-agents/ac1bc783364405e5f.txt +190 -0
package/docs/research/2026-03-20-agents/aca5e2b57fde152a0.txt +132 -0
package/docs/research/2026-03-20-agents/ad849b8c0a7e95b8b.txt +176 -0
package/docs/research/2026-03-20-agents/adc2b12a4da32c962.txt +258 -0
package/docs/research/2026-03-20-agents/af97caaaa9a80e4cb.txt +146 -0
package/docs/research/2026-03-20-agents/afc5faceee368b3ca.txt +111 -0
package/docs/research/2026-03-20-agents/afdb282d866e3c1e4.txt +164 -0
package/docs/research/2026-03-20-agents/afe9d1f61c02b1e8d.txt +299 -0
package/docs/research/2026-03-20-agents/b4hmkwril.txt +1856 -0
package/docs/research/2026-03-20-agents/b80ptk89g.txt +1856 -0
package/docs/research/2026-03-20-agents/bf54s1jss.txt +1150 -0
package/docs/research/2026-03-20-agents/bhd6kq2kx.txt +1856 -0
package/docs/research/2026-03-20-agents/bmb2fodyr.txt +988 -0
package/docs/research/2026-03-20-agents/bmmsrij8i.txt +826 -0
package/docs/research/2026-03-20-agents/bn4t2ywpu.txt +2175 -0
package/docs/research/2026-03-20-agents/bu22t9f1z.txt +0 -0
package/docs/research/2026-03-20-agents/bwvl98v2p.txt +738 -0
package/docs/research/2026-03-20-agents/psych-a3697a7fd06eb64fd.txt +135 -0
package/docs/research/2026-03-20-agents/psych-a37776fabc870feae.txt +123 -0
package/docs/research/2026-03-20-agents/psych-a5b1fe05c0589efaf.txt +2 -0
package/docs/research/2026-03-20-agents/psych-a95c15b1f29424435.txt +76 -0
package/docs/research/2026-03-20-agents/psych-a9c26f4d9172dde7c.txt +2 -0
package/docs/research/2026-03-20-agents/psych-aa19c69f0ca2c5ad3.txt +2 -0
package/docs/research/2026-03-20-agents/psych-aa4e4cb70e1be5ecb.txt +95 -0
package/docs/research/2026-03-20-agents/psych-ab5b302f26a554663.txt +102 -0
package/docs/research/2026-03-20-deep-research-complete.md +101 -0
package/docs/research/2026-03-20-deep-research-status.md +38 -0
package/docs/research/2026-03-20-enforcement-research.md +107 -0
package/expertise/antipatterns/process/ai-coding-antipatterns.md +117 -0
package/expertise/composition-map.yaml +27 -8
package/expertise/digests/reviewer/ai-coding-digest.md +83 -0
package/expertise/digests/reviewer/architectural-thinking-digest.md +63 -0
package/expertise/digests/reviewer/architecture-antipatterns-digest.md +49 -0
package/expertise/digests/reviewer/code-smells-digest.md +53 -0
package/expertise/digests/reviewer/coupling-cohesion-digest.md +54 -0
package/expertise/digests/reviewer/ddd-digest.md +60 -0
package/expertise/digests/reviewer/dependency-risk-digest.md +40 -0
package/expertise/digests/reviewer/error-handling-digest.md +55 -0
package/expertise/digests/reviewer/review-methodology-digest.md +49 -0
package/exports/hosts/claude/.claude/commands/learn.md +61 -8
package/exports/hosts/claude/.claude/commands/plan-review.md +3 -1
package/exports/hosts/claude/.claude/commands/verify.md +30 -1
package/exports/hosts/claude/.claude/settings.json +7 -6
package/exports/hosts/claude/export.manifest.json +8 -5
package/exports/hosts/claude/host-package.json +3 -0
package/exports/hosts/codex/export.manifest.json +8 -5
package/exports/hosts/codex/host-package.json +3 -0
package/exports/hosts/cursor/.cursor/hooks.json +6 -6
package/exports/hosts/cursor/export.manifest.json +8 -5
package/exports/hosts/cursor/host-package.json +3 -0
package/exports/hosts/gemini/export.manifest.json +8 -5
package/exports/hosts/gemini/host-package.json +3 -0
package/hooks/definitions/pretooluse_dispatcher.yaml +26 -0
package/hooks/definitions/pretooluse_pipeline_guard.yaml +22 -0
package/hooks/definitions/stop_pipeline_gate.yaml +22 -0
package/hooks/hooks.json +7 -6
package/hooks/pretooluse-dispatcher +84 -0
package/hooks/pretooluse-pipeline-guard +9 -0
package/hooks/stop-pipeline-gate +9 -0
package/llms-full.txt +48 -18
package/package.json +2 -3
package/schemas/decision.schema.json +15 -0
package/schemas/hook.schema.json +4 -1
package/schemas/phase-report.schema.json +9 -0
package/skills/TEMPLATE-3-ZONE.md +160 -0
package/skills/brainstorming/SKILL.md +137 -21
package/skills/clarifier/SKILL.md +364 -53
package/skills/claude-cli/SKILL.md +91 -12
package/skills/codex-cli/SKILL.md +91 -12
package/skills/debugging/SKILL.md +133 -38
package/skills/design/SKILL.md +173 -37
package/skills/dispatching-parallel-agents/SKILL.md +129 -31
package/skills/executing-plans/SKILL.md +113 -25
package/skills/executor/SKILL.md +252 -21
package/skills/finishing-a-development-branch/SKILL.md +107 -18
package/skills/gemini-cli/SKILL.md +91 -12
package/skills/humanize/SKILL.md +92 -13
package/skills/init-pipeline/SKILL.md +90 -18
package/skills/prepare-next/SKILL.md +93 -24
package/skills/receiving-code-review/SKILL.md +90 -16
package/skills/requesting-code-review/SKILL.md +100 -24
package/skills/requesting-code-review/code-reviewer.md +29 -17
package/skills/reviewer/SKILL.md +270 -57
package/skills/run-audit/SKILL.md +92 -15
package/skills/scan-project/SKILL.md +93 -14
package/skills/self-audit/SKILL.md +133 -39
package/skills/skill-research/SKILL.md +275 -0
package/skills/subagent-driven-development/SKILL.md +129 -30
package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +30 -2
package/skills/subagent-driven-development/implementer-prompt.md +40 -27
package/skills/subagent-driven-development/spec-reviewer-prompt.md +25 -12
package/skills/tdd/SKILL.md +125 -20
package/skills/using-git-worktrees/SKILL.md +118 -28
package/skills/using-skills/SKILL.md +116 -29
package/skills/verification/SKILL.md +160 -17
package/skills/wazir/SKILL.md +750 -120
package/skills/writing-plans/SKILL.md +134 -28
package/skills/writing-skills/SKILL.md +91 -13
package/skills/writing-skills/anthropic-best-practices.md +104 -64
package/skills/writing-skills/persuasion-principles.md +100 -34
package/tooling/src/capture/command.js +46 -2
package/tooling/src/capture/decision.js +40 -0
package/tooling/src/capture/store.js +33 -0
package/tooling/src/capture/user-input.js +66 -0
package/tooling/src/checks/security-sensitivity.js +69 -0
package/tooling/src/cli.js +28 -26
package/tooling/src/config/depth-table.js +60 -0
package/tooling/src/export/compiler.js +7 -8
package/tooling/src/guards/guardrail-functions.js +131 -0
package/tooling/src/guards/phase-prerequisite-guard.js +97 -3
package/tooling/src/hooks/pretooluse-dispatcher.js +300 -0
package/tooling/src/hooks/pretooluse-pipeline-guard.js +141 -0
package/tooling/src/hooks/stop-pipeline-gate.js +92 -0
package/tooling/src/init/auto-detect.js +0 -2
package/tooling/src/init/command.js +3 -95
package/tooling/src/learn/pipeline.js +177 -0
package/tooling/src/state/db.js +251 -2
package/tooling/src/state/pipeline-state.js +262 -0
package/tooling/src/status/command.js +6 -1
package/tooling/src/verify/proof-collector.js +299 -0
package/wazir.manifest.yaml +3 -0
package/workflows/learn.md +61 -8
package/workflows/plan-review.md +3 -1
package/workflows/verify.md +30 -1

package/docs/research/2026-03-20-deep-research-complete.md ADDED Viewed

@@ -0,0 +1,101 @@
+# Deep Research Complete — 2026-03-20
+## 25 Research Agents — ALL COMPLETED
+Total output: ~6.8MB across 25 agents. Full transcripts at:
+`/private/tmp/claude-501/-Users-mohamedabdallah-Work-Wazir/96398c18-9868-43bc-a4d6-d7f388880d4a/tasks/`
+## Executive Summary
+### The Architecture for Wazir v2
+**Three-layer enforcement pyramid:**
+1. **Hooks** (mechanical, can't bypass): Stop blocks completion, PreToolUse blocks writes/commits/pushes
+2. **Subagent isolation** (architectural, can't see full pipeline): one agent per phase, controller holds the loop
+3. **Persuasion engineering** (behavioral, won't bypass): superpowers-style rationalization tables, red flags, authority language
+### Key Findings
+**Hooks:**
+- Stop hook CAN block completion (`{"decision": "block"}`) — proven by ralph-loop (492+ iterations)
+- PreToolUse has 7 decision patterns: silent allow, advisory, systemMessage, modify, JSON deny, exit-code deny, echo-trick redirect
+- State tracking via `pipeline-state.json` — hooks read, CLI writes, atomic temp+rename
+- Critical limitations: hooks can block but not compel; "hook error" labels poison the model; SubagentStop is broken; agent can escape via AskUserQuestion
+- Must check `stop_hook_active` to prevent infinite loops; must allow context-limit and user-abort stops
+**Subagent Architecture:**
+- Controller-as-orchestrator: wz:wazir holds the loop, dispatches one subagent per phase
+- Each subagent gets fresh 200K context (~165K usable after overhead)
+- No nesting (depth=1) — controller dispatches ALL subagents directly
+- File-mediated handoff (MetaGPT pattern): artifacts on disk, not in context
+- Artifact dependency: each artifact has `requires` block with predecessor digest for staleness detection
+- Guardrail functions per phase boundary with concrete pass/fail criteria
+- Retry ladder: same-model×2 → model-escalation×1 → human escalation
+- Error classification: transient (retry), quality (retry+feedback), deterministic (escalate), resource (model-escalate)
+**Persuasion Engineering:**
+- Superpowers is 100% prompt engineering, zero mechanical enforcement — and agents STILL skip (issue #463)
+- Meincke et al. 2025: persuasion doubles compliance (33%→72%, N=28,000, p<.001)
+- Best combination: Authority + Commitment + Scarcity
+- CSO critical: skill descriptions must be triggers only, never process summaries
+- 47 rationalization entries across 5 superpowers skills — Wazir has ZERO
+- "Violating the letter is violating the spirit" — single most impactful sentence
+- TDD for skills: RED (observe baseline failures) → GREEN (write skill addressing those) → REFACTOR (close new loopholes)
+**Learning System:**
+- 4-stage pipeline: Tally → Candidate → Promote → Active
+- Findings classified by 8 categories × 4 severity levels
+- Recurrence detection via finding_hash dedup (PagerDuty pattern)
+- Semi-automatic promotion: auto-propose, human-approve (CodeGuru + Snyk model)
+- Drift prevention: 30 active project learnings max, 90-day TTL, 5% hit-rate demotion, principle consolidation at 25+ entries
+- Decision audit trail: v2 schema with category, alternatives, confidence, outcome_ref, supersedes
+- User feedback: capture corrections/approvals in ndjson, classify signal vs noise
+**Review Architecture:**
+- Two-tier: internal (Sonnet, expertise-loaded, pattern-matching) → external (Codex, fresh eyes, unknown-unknowns)
+- Critical finding: reviewer always-layer is 99K tokens against 50K ceiling — 5 of 8 modules are dropped
+- Fix: mode-specific reviewer composition (different modules per review mode)
+- Reviewer digest modules: 3-5K tokens each (not 12K originals)
+- Findings classified by 8 categories (correctness, security, completeness, wiring, verification, drift, performance, style)
+- Auto-classification rules per category with severity floors
+- Feedback-to-learning: 7-step loop, LLM-assisted clustering for pattern detection
+**Interactive UX:**
+- AskUserQuestion: 1-4 questions, 2-4 options each, arrow-key selection, multiSelect supported
+- Bug: DO NOT list in skill's allowed-tools (causes empty answers)
+- Progressive disclosure: status line (what) → paragraph (why) → full report (everything)
+- Key formula: "Name the action. State the dependency. Omit the journey."
+- 5 progress patterns: phase map, meaningful updates, artifact previews, time estimates, heartbeat
+- Heartbeat: never >2min silence (standard), >90s (deep), >3min (quick)
+- Steerability: classify mutation level → show impact → selective regeneration → preserve completed work
+- Three modes: auto (gating agent steers) / guided (checkpoints steer) / interactive (continuous steer)
+## Agent Output Index
+| # | Agent | Key Deliverable |
+|---|-------|----------------|
+| 1 | Stop hook patterns | Complete blueprint for pipeline-gate Stop hook with 10 edge cases |
+| 2 | PreToolUse catalog | 7 decision patterns with code examples from 4 real plugins |
+| 3 | State machine design | pipeline-state.json schema with 30+ fields, update rules, session isolation |
+| 4 | Hook limitations | 13 limitations with workarounds, including "hook error" label poisoning |
+| 5 | Persuasion playbook | 10 patterns, 47 rationalization entries, CSO rules, implementation checklist |
+| 6 | Controller pattern | Hybrid architecture: flat orchestration with file-mediated handoff |
+| 7 | Artifact dependencies | Per-phase schemas with requires/digest, write-time validation |
+| 8 | Context isolation | 200K per subagent, no nesting, MCP tool caveats, MetaGPT pub-sub |
+| 9 | Guardrail validation | 6 guardrail functions with concrete pass/fail criteria per phase |
+| 10 | Failure + retry | 3-tier ladder (same-model→escalate→human), error classification |
+| 11 | AskUserQuestion API | Full schema, 2-4 options, multiSelect, known bugs, plugin examples |
+| 12 | Showing reasoning | Progressive disclosure templates at 3 levels with anti-patterns |
+| 13 | Depth parameters | (in bladnman analysis) 4 depth levels with per-parameter tables |
+| 14 | Steerability | Mutation classification, impact assessment, selective regeneration |
+| 15 | Progress reporting | 5 patterns (phase map, finding updates, previews, time estimates, heartbeat) |
+| 16 | Findings → antipatterns | 4-stage promotion pipeline, 3+ occurrence threshold, human gate |
+| 17 | Cumulative tracking | SQLite schema (5 tables), dedup algorithm, recurrence detection |
+| 18 | Drift prevention | 7 mechanisms with concrete limits (30 active, 90-day TTL, 5% demotion) |
+| 19 | Decision audit trail | v2 schema with alternatives, confidence, outcome correlation |
+| 20 | User feedback capture | Signal classification, correction weighting, ndjson format |
+| 21 | Two-tier review | Internal→external, critical asymmetry (known vs unknown unknowns) |
+| 22 | Reviewer composition | Mode-specific modules, 3-5K digests, 50K budget analysis |
+| 23 | Findings classification | 8 categories × 4 severities, auto-classification rules |
+| 24 | Feedback-to-learning | 7-step loop, LLM clustering, minimum viable phases A-D |
+| 25 | Proof-of-implementation | Per-type matrix (web/API/CLI/library), Playwright MCP, Symphony model |

package/docs/research/2026-03-20-deep-research-status.md ADDED Viewed

@@ -0,0 +1,38 @@
+# Deep Research Status — 2026-03-20
+## 25 Research Agents — Progress
+### Completed (14/25)
+1. Hook: Stop hook patterns (ralph-loop analysis) ✅
+2. Hook: PreToolUse catalog (7 decision patterns) ✅
+3. Hook: State machine design (pipeline-state.json) ✅
+4. Subagent: Artifact dependencies (per-phase schemas) ✅
+5. Subagent: Guardrail validation (per-phase functions) ✅
+6. Subagent: Failure + retry (3-tier ladder) ✅
+7. Interactive: Showing reasoning (progressive disclosure) ✅
+8. Learning: Findings → antipatterns (4-stage pipeline) ✅
+9. Learning: Cumulative tracking (SQLite schema) ✅
+10. Learning: Drift prevention (7 mechanisms) ✅
+11. Learning: User feedback capture ✅
+12. Review: Feedback-to-learning pipeline (7-step loop) ✅
+13. Review: Proof-of-implementation (per-type matrix) ✅
+14. Hook: Persuasion engineering (superpowers analysis) — in first batch ✅
+### Pending (11/25)
+15. Hook: Limitations + workarounds
+16. Subagent: Controller pattern
+17. Subagent: Context isolation
+18. Interactive: AskUserQuestion API
+19. Interactive: Depth parameters
+20. Interactive: Steerability
+21. Interactive: Progress reporting
+22. Review: Two-tier architecture
+23. Review: Reviewer composition
+24. Review: Findings classification
+25. Learning: Decision audit trail
+## Key Findings So Far
+All research output files at: /private/tmp/claude-501/-Users-mohamedabdallah-Work-Wazir/96398c18-9868-43bc-a4d6-d7f388880d4a/tasks/
+Full synthesis will be compiled when all 25 agents complete.

package/docs/research/2026-03-20-enforcement-research.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Enforcement Research — 2026-03-20
+## The Answer
+**Prose instructions don't work. The agent will always rationalize skipping them.** Every framework that achieves reliable enforcement uses the same pattern: **the framework holds the loop, not the agent.**
+## The Three-Layer Strategy
+### Layer 1: Mechanical Hooks (agent CANNOT bypass)
+**Stop hook** blocks completion: `{"decision": "block", "reason": "..."}` — proven by ralph-loop plugin (official marketplace). The agent literally cannot stop until all artifacts exist.
+**PreToolUse hooks** block actions:
+- `PreToolUse:Write|Edit` — blocks implementation code if no plan artifact exists
+- `PreToolUse:Bash` — blocks `git commit` if no tests run, blocks `git push` if no review
+- Returns `permissionDecision: "deny"` — the tool call is prevented entirely
+**State tracking** via `pipeline-state.json` — hooks READ state, CLI WRITES state. No race conditions.
+**Key: command hooks only, never prompt hooks.** Prompt hooks re-introduce the rationalization problem.
+### Layer 2: Subagent Isolation (agent CANNOT see full pipeline)
+From every framework (CrewAI, LangGraph, Symphony, ideation_team_skill): **give the agent a task, not a plan.**
+- Each phase is a separate subagent invocation
+- Phase N+1 receives phase N's artifact as input — if it doesn't exist, the call fails
+- The controller (wazir skill) holds the loop and decides what runs next
+- No single agent can rationalize skipping from research to code
+### Layer 3: Persuasion Engineering (agent WON'T bypass — 72% compliance)
+From superpowers (100K stars, backed by Meincke et al. 2025, N=28,000):
+- **Rationalization tables** — enumerate exact thoughts the agent has when skipping, with rebuttals
+- **"Violating the letter is violating the spirit"** — kills the #1 escape pattern
+- **Red flags lists** — specific phrases that mean STOP
+- **Authority + Commitment + Social Proof** — doubles compliance (33% → 72%)
+- **CSO (Claude Search Optimization)** — skill descriptions must be triggers, never process summaries
+## Key Findings Per Source
+### Claude Code Hooks
+- Stop hook CAN block (`{"decision": "block"}`) — proven by ralph-loop
+- PreToolUse CAN deny AND modify tool calls — proven by context-mode plugin
+- Hooks are stateless but can read/write files for state
+- Hooks loaded at session start, can't be added mid-session
+- **Limitation: hooks block actions but can't compel them**
+### Superpowers (100K stars)
+- 100% prompt engineering, zero mechanical enforcement
+- Single SessionStart hook injects meta-skill in `<EXTREMELY_IMPORTANT>` tags
+- **Issue #463: agents STILL skip reviews** — the author knows it's unsolved
+- Commenter: "The only reliable fix is making reviews structural, not instructional"
+- TDD skill is best-in-class prompt engineering but still fails sometimes
+- Persuasion research: authority language doubles compliance but doesn't reach 100%
+### Framework Enforcement Patterns
+- **CrewAI:** Python for-loop + guardrail functions. Agent produces output, framework validates.
+- **LangGraph:** Channel triggers + NamedBarrierValue. Node can't fire until inputs ready.
+- **Temporal:** `await` keyword is the enforcement. Language-level blocking.
+- **Symphony:** State machine + data dependencies. Each phase produces data the next requires.
+- **GitHub Actions:** `needs:` DAG. Scheduler prevents jobs from starting without dependencies.
+- **Universal pattern:** framework holds program counter, not agent.
+### UX / User Engagement
+- **bladnman/ideation_team_skill:** AskUserQuestion for pre-flight interview, depth-aware parameters, cognitive role separation across agents
+- **Devin:** PR-as-proof, screen recordings, conversational Slack updates, async delegation
+- **Copilot Workspace:** Spec → Plan → Code, each editable. Steerability = trust.
+- **Anthropic:** Show planning steps explicitly, programmatic checks at intermediate steps
+## What Wazir Must Build
+### 1. Pipeline State Machine (hooks + state file)
+```
+SessionStart → initialize pipeline-state.json
+PreToolUse:Write|Edit → deny if phase gate not passed
+PreToolUse:Bash → deny git commit/push without tests/review
+Stop → deny if any enabled workflow incomplete or proof missing
+```
+### 2. Subagent-Per-Phase Architecture
+The `/wazir` skill becomes a CONTROLLER that:
+- Spawns a clarifier subagent → receives clarification artifact
+- Spawns a spec subagent → receives spec artifact
+- Spawns a design subagent → receives design artifact
+- Spawns an executor subagent → receives implementation
+- Spawns a reviewer subagent → receives review verdict
+- Each subagent sees ONLY its phase, not the full pipeline
+### 3. Superpowers-Style Persuasion on Every Skill
+For each discipline rule:
+- Iron Law statement
+- Rationalization table (empirically derived)
+- Red flags list
+- "Violating the letter is violating the spirit"
+- `<EXTREMELY_IMPORTANT>` wrapper on session injection
+### 4. User Engagement Templates
+- Pre-flight interview via AskUserQuestion (batched, not serial)
+- Three-tier progress reporting (status line / key decisions / full record)
+- Artifacts as proof (self-describing, contain lineage and reasoning)
+- Steerability at phase boundaries (edit upstream, regenerate downstream)

package/expertise/antipatterns/process/ai-coding-antipatterns.md CHANGED Viewed

@@ -917,6 +917,121 @@ Plan: "10 tasks covering all 10 items. Suggested order: [...]"
 ---
+### AP-23: Stale Documentation Counts
+**Also known as:** Count Drift, Number Rot, Metric Desync
+**Frequency:** Very Common
+**Severity:** Medium
+**Detection difficulty:** Low (mechanical)
+**What it looks like:**
+Documentation claims "268 expertise modules" when the actual count is 315. README says "7 hooks" when 8 exist. Counts in multiple files diverge from each other and from reality. The numbers were correct when written but drifted as the project grew.
+**Why AI agents do it:**
+Agents update the source of truth (add a new hook, write new expertise modules) but do not grep for every downstream reference. Each file is edited in isolation. No automated check enforces that prose counts match filesystem reality.
+**What goes wrong:**
+Users see contradictory numbers across docs and lose trust. Reviewers waste time verifying which number is correct. Launch materials ship with wrong counts, creating a first impression of sloppiness.
+**Detection signals:**
+- `find expertise -name '*.md' | wc -l` disagrees with counts in README, architecture docs, and readmes
+- `ls hooks/definitions/ | wc -l` disagrees with hook count claims
+- Different files claim different counts for the same metric
+**The fix:**
+1. **Self-audit loop** — run `wazir validate docs` which cross-references prose claims against filesystem counts
+2. **Single source of truth** — reference manifest counts programmatically where possible; avoid hardcoding counts in prose
+3. **Grep sweep on every addition** — when adding a new module, hook, or skill, grep for the old count and update all references
+4. **CI enforcement** — `wazir validate docs` in CI catches drift before merge
+**Example:**
+Bad:
+```
+README.md: "268 expertise modules"
+architecture.md: "268 curated knowledge modules"
+expertise/README.md: "268 knowledge modules"
+Actual count: 315
+```
+Good:
+```
+README.md: "315 expertise modules"
+architecture.md: "315 curated knowledge modules"
+expertise/README.md: "315 knowledge modules"
+Actual count: 315
+All references match.
+```
+**Related:** AP-06 (Partial Updates — same root cause applied to code), `wazir validate docs`, self-audit skill
+---
+### AP-24: Silent Checkpoint Bypass
+**Also known as:** Gate Ghosting, Approval Amnesia, Review Skipping
+**Frequency:** Common
+**Severity:** Critical
+**Detection difficulty:** Moderate
+**What it looks like:**
+The agent reaches an approval gate (spec-challenge, plan-review, or final review) and proceeds without obtaining explicit reviewer approval. The gate exists in the workflow definition but the agent treats it as advisory, not blocking. Review artifacts are either missing or contain self-generated approvals.
+**Why AI agents do it:**
+The agent conflates "review" with "self-review." Without a hard external gate (different model, different session, or user confirmation), the agent reviews its own work and approves it. Optimism bias means self-review almost never rejects. The agent also optimizes for speed, and gates are the slowest part of the pipeline.
+**What goes wrong:**
+Spec errors propagate to implementation. Design flaws survive to production. The entire adversarial review structure becomes theater — gates exist on paper but provide no actual quality assurance. Bugs caught in final review could have been caught in spec-challenge at 10x lower cost.
+**Detection signals:**
+- Review pass files authored by the same agent that authored the reviewed artifact
+- Approval granted on the first pass with zero findings
+- Missing review artifacts in the run state directory
+- `wazir capture loop-check` shows 0 review iterations for a gate phase
+**The fix:**
+1. **External reviewer enforcement** — gate phases must invoke a different model or require user confirmation via `AskUserQuestion`
+2. **Minimum findings threshold** — first-pass reviews that report zero findings trigger a warning; real adversarial review almost always finds something
+3. **Artifact validation** — `wazir validate runtime` checks that review artifacts exist and were not authored by the same role as the reviewed artifact
+4. **Loop cap guard** — `hooks/loop-cap-guard` tracks review iterations; zero iterations at a gate phase is a validation failure
+**Example:**
+Bad:
+```
+# spec-challenge pass 1
+Reviewer: executor (same agent)
+Findings: 0
+Decision: APPROVED
+```
+Good:
+```
+# spec-challenge pass 1
+Reviewer: codex-cli (external model)
+Findings: 3 (ambiguous acceptance criteria, missing edge case, unclear priority)
+Decision: REVISE
+# spec-challenge pass 2
+Reviewer: codex-cli (external model)
+Findings: 0 (all 3 resolved)
+Decision: APPROVED
+```
+**Related:** AP-21 (Pipeline Phase Skipping — bypassing the gate entirely vs. rubber-stamping it), AP-08 (Test Theater — similar pattern of going through motions without rigor), `docs/reference/review-loop-pattern.md`, `hooks/loop-cap-guard`
+---
 ## Code Smell Quick Reference
 | Anti-Pattern | Severity | Frequency | Key Signal | First Action |
@@ -943,6 +1058,8 @@ Plan: "10 tasks covering all 10 items. Suggested order: [...]"
 | AP-20 Resumption Errors | High | Common | Mixed ID types across files | Architecture file in every session |
 | AP-21 Pipeline Phase Skipping | Critical | Common | Missing clarified/* artifacts | Enforce hard gates in skills + CLI |
 | AP-22 Autonomous Scope Reduction | Critical | Common | Plan has fewer tasks than input items | Scope coverage guard + user approval |
+| AP-23 Stale Documentation Counts | Medium | Very Common | Doc counts disagree with filesystem | Grep sweep + `wazir validate docs` |
+| AP-24 Silent Checkpoint Bypass | Critical | Common | Self-approved gate with 0 findings | External reviewer + minimum findings |
 ---

package/expertise/composition-map.yaml CHANGED Viewed

@@ -27,19 +27,38 @@ always:
     - antipatterns/code/state-management-antipatterns.md
     - quality/evidence-based-verification.md
   reviewer:
-    - antipatterns/process/ai-coding-antipatterns.md
-    - antipatterns/code/code-smells.md
-    - antipatterns/process/code-review-antipatterns.md
-    - antipatterns/code/dependency-antipatterns.md
-    - architecture/foundations/architectural-thinking.md
-    - architecture/foundations/coupling-and-cohesion.md
-    - antipatterns/code/architecture-antipatterns.md
-    - architecture/foundations/domain-driven-design.md
+    # Mode-agnostic core — loaded for ALL review modes (~6K tokens)
+    - digests/reviewer/review-methodology-digest.md
+    - digests/reviewer/ai-coding-digest.md
   content-author:
     - i18n/content/translation-management.md
     - i18n/foundations/string-externalization.md
     - i18n/foundations/pluralization-and-gender.md
+# Mode-specific reviewer composition
+# Loaded ON TOP of always.reviewer based on the --mode flag
+# Total budget per mode: ~15-25K tokens (digests + auto + stack modules)
+reviewer_modes:
+  task-review:
+    - digests/reviewer/code-smells-digest.md
+    - digests/reviewer/error-handling-digest.md
+  spec-challenge:
+    - digests/reviewer/architectural-thinking-digest.md
+    - digests/reviewer/ddd-digest.md
+  design-review:
+    - digests/reviewer/architectural-thinking-digest.md
+    - digests/reviewer/coupling-cohesion-digest.md
+  plan-review:
+    - digests/reviewer/architectural-thinking-digest.md
+    - digests/reviewer/coupling-cohesion-digest.md
+    - digests/reviewer/ai-coding-digest.md
+  final:
+    - digests/reviewer/code-smells-digest.md
+    - digests/reviewer/architecture-antipatterns-digest.md
+    - digests/reviewer/dependency-risk-digest.md
+  research-review: []
+  clarification-review: []
 auto:
   all-stacks:
     all-roles:

package/expertise/digests/reviewer/ai-coding-digest.md ADDED Viewed

@@ -0,0 +1,83 @@
+# AI Coding Antipatterns — Reviewer Digest
+> Detection-focused extract for reviewer context. For full analysis, see `antipatterns/process/ai-coding-antipatterns.md`.
+## Specification Drift (AP-01)
+- **Signal:** Implementation differs from stated requirements without documented reason
+- **Check:** Compare task spec acceptance criteria against actual code behavior
+- **Severity:** high
+## Hallucinated APIs (AP-02)
+- **Signal:** Import or call to function/class/module that doesn't exist in the dependency tree
+- **Check:** Verify every imported symbol resolves to an actual export
+- **Severity:** critical
+## Outdated Patterns (AP-03)
+- **Signal:** Using deprecated APIs, class components in React 2025, callback-based async when promises are standard
+- **Check:** Compare patterns against current library version best practices
+- **Severity:** high
+## Premature Abstraction (AP-04)
+- **Signal:** Generic utility/helper that is used exactly once
+- **Check:** Count call sites for each abstraction introduced
+- **Severity:** medium
+## Context Window Stuffing (AP-05)
+- **Signal:** Agent reads 10+ files without index queries; loads entire modules instead of targeted slices
+- **Check:** Review tool call patterns — excessive Read calls without preceding search
+- **Severity:** low (efficiency, not correctness)
+## Fake Testing (AP-06)
+- **Signal:** Tests that assert implementation details, use mocks that mirror the implementation, or test tautologies
+- **Check:** Would the test fail if the implementation had a real bug? If not, it's fake.
+- **Severity:** high
+## Scope Creep (AP-07)
+- **Signal:** Files modified or features added that were not in the task spec
+- **Check:** Diff includes changes outside the task's specified file scope
+- **Severity:** medium
+## Optimistic Error Handling (AP-08)
+- **Signal:** Missing try/catch around I/O operations, network calls, file operations, JSON parsing
+- **Check:** Every async operation and external call has error handling
+- **Severity:** high
+## Stale Dependency (AP-09)
+- **Signal:** Importing deprecated APIs, using outdated package versions with known CVEs
+- **Check:** Package versions against known vulnerability databases
+- **Severity:** medium-high
+## Cargo-Cult Patterns (AP-10)
+- **Signal:** Design patterns applied without the problem they solve (Factory for single type, Observer for single listener)
+- **Check:** Does the pattern's complexity serve a real need?
+- **Severity:** medium
+## Gold Plating (AP-11)
+- **Signal:** Extra configuration, extensibility points, or features not in the spec
+- **Check:** Is every public API/config option traceable to a requirement?
+- **Severity:** medium
+## Sycophantic Compliance (AP-12)
+- **Signal:** Agent implements exactly what was asked even when the request contains contradictions or obvious errors
+- **Check:** Look for requirements that conflict with each other or with the codebase's existing contracts
+- **Severity:** high
+## Phantom Error Handling (AP-13)
+- **Signal:** Error handling code that looks comprehensive but handles errors incorrectly (swallows, retries without backoff, logs without propagating)
+- **Check:** Trace each error path — does it actually reach a handler that does the right thing?
+- **Severity:** high
+## Inconsistent State After Failure (AP-14)
+- **Signal:** Multi-step operations where a failure in step N leaves steps 1..N-1 committed
+- **Check:** Are multi-step mutations wrapped in transactions or compensating actions?
+- **Severity:** high
+## Over-Confident Comments (AP-15)
+- **Signal:** Comments claiming "this handles all edge cases" or "this is thread-safe" without evidence
+- **Check:** Does the code actually handle what the comment claims?
+- **Severity:** medium
+## Training Data Leakage (AP-16)
+- **Signal:** Code that closely mirrors common training examples but doesn't fit the actual use case
+- **Check:** Does the implementation structure match the problem, or does it match a textbook example?
+- **Severity:** medium

package/expertise/digests/reviewer/architectural-thinking-digest.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Architectural Thinking — Reviewer Digest
+> Evaluation-focused extract for reviewer context. For full guidance, see `architecture/foundations/architectural-thinking.md`.
+## Architecture Review Checklist
+### Separation of Concerns
+- Does each module/file have a single clear responsibility?
+- Are business logic, data access, and presentation in separate layers?
+- Can you describe what a module does in one sentence without "and"?
+### Dependency Direction
+- Do dependencies point inward (toward core domain), not outward?
+- Are infrastructure details (DB, HTTP, filesystem) behind abstractions?
+- Could you swap the database without changing business logic?
+### Interface Design
+- Are public APIs minimal (expose only what is needed)?
+- Are contracts (types, schemas, interfaces) explicit and documented?
+- Do functions have clear input/output contracts without hidden side effects?
+### Change Impact
+- Can you add a feature without modifying existing code (Open-Closed)?
+- Are changes localized (changing one feature doesn't cascade across modules)?
+- Is the dependency graph shallow (max 3-4 levels deep)?
+### Reversibility Assessment
+- Which decisions in this diff are hard to reverse?
+- Are irreversible decisions (data models, service boundaries, consistency models) justified with documented reasoning?
+- Are reversible decisions (naming, folder structure, library choices) made quickly without over-analysis?
+### Trade-off Reasoning
+Every architectural decision involves trade-offs. During review, check:
+- Is the trade-off acknowledged? ("We chose X because Y, accepting Z")
+- Is the trade-off appropriate for the context? (startup vs. enterprise, prototype vs. production)
+- Are rejected alternatives documented?
+## Architecture Smells (Quick Detection)
+| Smell | Signal | Severity |
+|-------|--------|----------|
+| **Big Ball of Mud** | No discernible module boundaries; any module calls any other | critical |
+| **Layering Violation** | UI code calling database directly; domain importing from infrastructure | high |
+| **Circular Module Dependency** | Module A depends on Module B depends on Module A | high |
+| **God Module** | One module >1000 LOC handling multiple concerns | medium |
+| **Leaky Abstraction** | Internal implementation details exposed in public interface | medium |
+| **Distributed Monolith** | Multiple services that must be deployed together | high |
+| **Accidental Complexity** | Architecture complexity not justified by problem complexity | medium |
+| **Architecture Astronaut** | Abstractions solving problems no one has yet | medium |
+| **Dead End Architecture** | Design choices that prevent future evolution (no extension points, hardcoded assumptions) | high |
+## Quality Attribute Checklist
+When reviewing architectural decisions, verify the relevant quality attributes are addressed:
+| Attribute | Review Question |
+|-----------|----------------|
+| **Performance** | Are there obvious bottlenecks? N+1 queries? Unbounded loops? |
+| **Scalability** | Can this handle 10x load without structural changes? |
+| **Security** | Are trust boundaries enforced? Input validated at boundaries? |
+| **Availability** | What happens when a dependency fails? Is there a fallback? |
+| **Modifiability** | How many files change to add a typical feature? |
+| **Testability** | Can components be tested in isolation without complex setup? |

package/expertise/digests/reviewer/architecture-antipatterns-digest.md ADDED Viewed

@@ -0,0 +1,49 @@
+# Architecture Antipatterns — Reviewer Digest
+> Detection-focused extract for reviewer context. For full analysis, see `antipatterns/code/architecture-antipatterns.md`.
+## Structural Antipatterns
+| Antipattern | Detection Signal | Severity |
+|-------------|-----------------|----------|
+| **Big Ball of Mud** | No discernible module boundaries; any module calls any other; package diagram is fully connected | critical |
+| **God Object / God Service** | Class/module with >10 public methods touching >3 concerns; single service handling unrelated domains | high |
+| **Golden Hammer** | Same pattern/library used for every problem regardless of fit (everything is a microservice, everything uses Redux) | medium |
+| **Architecture Astronaut** | Layers of abstraction solving problems no one has; meta-frameworks, plugin systems with zero plugins | medium |
+| **Dead Code / Lava Flow** | Unreachable code paths, unused exports, commented-out blocks; code preserved "because it might be needed" | medium |
+| **Copy-Paste Architecture** | Duplicated modules with minor variations instead of shared abstraction | high |
+| **Boat Anchor** | Unused infrastructure "for future use" (empty interfaces, unused config, skeleton services) | medium |
+| **Accidental Complexity** | System complexity far exceeds problem complexity; over-engineered for the actual requirements | medium |
+| **Stovepipe System** | Modules built in isolation with no integration architecture; each uses different patterns, different data formats | high |
+| **Swiss Army Knife** | One component tries to serve every use case; endlessly configurable but hard to use for any single purpose | medium |
+## Integration Antipatterns
+| Antipattern | Detection Signal | Severity |
+|-------------|-----------------|----------|
+| **Distributed Monolith** | Multiple services that must be deployed together; shared database; lock-step releases | critical |
+| **Chatty Interface** | >5 sequential API calls to complete one logical operation | medium |
+| **Shared Database** | Multiple services reading/writing the same database tables directly | critical |
+| **Circular Dependency** | Service A calls B calls C calls A (or module-level equivalent) | high |
+| **Hardcoded Endpoints** | URLs, hostnames, or ports as string literals in source code | medium |
+| **Missing Circuit Breaker** | External service calls without timeout or failure handling | high |
+| **Sinkhole Anti-pattern** | Requests pass through multiple layers that add no value (pure pass-through) | medium |
+## Layering Antipatterns
+| Antipattern | Detection Signal | Severity |
+|-------------|-----------------|----------|
+| **Upward Dependency** | Core/domain module imports from UI/API layer | critical |
+| **Layer Bypass** | UI code calling database/repository directly, skipping service layer | high |
+| **Anemic Domain** | Domain objects are pure data holders; all logic in services | medium |
+| **Fat Controller** | Controller/handler contains business logic instead of delegating | high |
+| **Inner Platform Effect** | Building a general-purpose engine inside the application that reimplements what the platform already provides | high |
+## Root Cause Patterns
+Most architecture antipatterns share a few root causes:
+- **Shipping pressure:** Shortcuts that accumulate into structural debt
+- **Missing boundaries:** No enforced module boundaries in build tooling
+- **Conway's Law misalignment:** Architecture doesn't match team structure
+- **Premature optimization:** Distributed complexity without proven need
+- **BDUF backlash:** Avoiding all upfront design, resulting in no design