npm - @wazir-dev/cli - Versions diffs - 1.2.0 → 1.4.0 - Mend

@wazir-dev/cli 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (161) hide show

package/CHANGELOG.md +54 -44
package/README.md +13 -13
package/assets/demo.cast +47 -0
package/assets/demo.gif +0 -0
package/docs/anti-patterns/AP-23-skipping-enabled-workflows.md +28 -0
package/docs/anti-patterns/AP-24-clarifier-deciding-scope.md +34 -0
package/docs/concepts/architecture.md +1 -1
package/docs/concepts/why-wazir.md +1 -1
package/docs/readmes/INDEX.md +1 -1
package/docs/readmes/features/expertise/README.md +1 -1
package/docs/readmes/features/hooks/pre-compact-summary.md +1 -1
package/docs/reference/hooks.md +1 -0
package/docs/reference/launch-checklist.md +3 -3
package/docs/reference/review-loop-pattern.md +3 -2
package/docs/reference/skill-tiers.md +2 -2
package/docs/research/2026-03-20-agents/a18fb002157904af5.txt +187 -0
package/docs/research/2026-03-20-agents/a1d0ac79ac2f11e6f.txt +2 -0
package/docs/research/2026-03-20-agents/a324079de037abd7c.txt +198 -0
package/docs/research/2026-03-20-agents/a357586bccfafb0e5.txt +256 -0
package/docs/research/2026-03-20-agents/a4365394e4d753105.txt +137 -0
package/docs/research/2026-03-20-agents/a492af28bc52d3613.txt +136 -0
package/docs/research/2026-03-20-agents/a4984db0b6a8eee07.txt +124 -0
package/docs/research/2026-03-20-agents/a5b30e59d34bbb062.txt +214 -0
package/docs/research/2026-03-20-agents/a5cf7829dab911586.txt +165 -0
package/docs/research/2026-03-20-agents/a607157c30dd97c9e.txt +96 -0
package/docs/research/2026-03-20-agents/a60b68b1e19d1e16b.txt +115 -0
package/docs/research/2026-03-20-agents/a722af01c5594aba0.txt +166 -0
package/docs/research/2026-03-20-agents/a787bdc516faa5829.txt +181 -0
package/docs/research/2026-03-20-agents/a7c46d1bba1056ed2.txt +132 -0
package/docs/research/2026-03-20-agents/a7e5abbab2b281a0d.txt +100 -0
package/docs/research/2026-03-20-agents/a8dbadc66cd0d7d5a.txt +95 -0
package/docs/research/2026-03-20-agents/a904d9f45d6b86a6d.txt +75 -0
package/docs/research/2026-03-20-agents/a927659a942ee7f60.txt +102 -0
package/docs/research/2026-03-20-agents/a962cb569191f7583.txt +125 -0
package/docs/research/2026-03-20-agents/aab6decea538aac41.txt +148 -0
package/docs/research/2026-03-20-agents/abd58b853dd938a1b.txt +295 -0
package/docs/research/2026-03-20-agents/ac009da573eff7f65.txt +100 -0
package/docs/research/2026-03-20-agents/ac1bc783364405e5f.txt +190 -0
package/docs/research/2026-03-20-agents/aca5e2b57fde152a0.txt +132 -0
package/docs/research/2026-03-20-agents/ad849b8c0a7e95b8b.txt +176 -0
package/docs/research/2026-03-20-agents/adc2b12a4da32c962.txt +258 -0
package/docs/research/2026-03-20-agents/af97caaaa9a80e4cb.txt +146 -0
package/docs/research/2026-03-20-agents/afc5faceee368b3ca.txt +111 -0
package/docs/research/2026-03-20-agents/afdb282d866e3c1e4.txt +164 -0
package/docs/research/2026-03-20-agents/afe9d1f61c02b1e8d.txt +299 -0
package/docs/research/2026-03-20-agents/b4hmkwril.txt +1856 -0
package/docs/research/2026-03-20-agents/b80ptk89g.txt +1856 -0
package/docs/research/2026-03-20-agents/bf54s1jss.txt +1150 -0
package/docs/research/2026-03-20-agents/bhd6kq2kx.txt +1856 -0
package/docs/research/2026-03-20-agents/bmb2fodyr.txt +988 -0
package/docs/research/2026-03-20-agents/bmmsrij8i.txt +826 -0
package/docs/research/2026-03-20-agents/bn4t2ywpu.txt +2175 -0
package/docs/research/2026-03-20-agents/bu22t9f1z.txt +0 -0
package/docs/research/2026-03-20-agents/bwvl98v2p.txt +738 -0
package/docs/research/2026-03-20-agents/psych-a3697a7fd06eb64fd.txt +135 -0
package/docs/research/2026-03-20-agents/psych-a37776fabc870feae.txt +123 -0
package/docs/research/2026-03-20-agents/psych-a5b1fe05c0589efaf.txt +2 -0
package/docs/research/2026-03-20-agents/psych-a95c15b1f29424435.txt +76 -0
package/docs/research/2026-03-20-agents/psych-a9c26f4d9172dde7c.txt +2 -0
package/docs/research/2026-03-20-agents/psych-aa19c69f0ca2c5ad3.txt +2 -0
package/docs/research/2026-03-20-agents/psych-aa4e4cb70e1be5ecb.txt +95 -0
package/docs/research/2026-03-20-agents/psych-ab5b302f26a554663.txt +102 -0
package/docs/research/2026-03-20-deep-research-complete.md +101 -0
package/docs/research/2026-03-20-deep-research-status.md +38 -0
package/docs/research/2026-03-20-enforcement-research.md +107 -0
package/expertise/antipatterns/process/ai-coding-antipatterns.md +117 -0
package/expertise/composition-map.yaml +27 -8
package/expertise/digests/reviewer/ai-coding-digest.md +83 -0
package/expertise/digests/reviewer/architectural-thinking-digest.md +63 -0
package/expertise/digests/reviewer/architecture-antipatterns-digest.md +49 -0
package/expertise/digests/reviewer/code-smells-digest.md +53 -0
package/expertise/digests/reviewer/coupling-cohesion-digest.md +54 -0
package/expertise/digests/reviewer/ddd-digest.md +60 -0
package/expertise/digests/reviewer/dependency-risk-digest.md +40 -0
package/expertise/digests/reviewer/error-handling-digest.md +55 -0
package/expertise/digests/reviewer/review-methodology-digest.md +49 -0
package/exports/hosts/claude/.claude/commands/learn.md +61 -8
package/exports/hosts/claude/.claude/commands/plan-review.md +3 -1
package/exports/hosts/claude/.claude/commands/verify.md +30 -1
package/exports/hosts/claude/.claude/settings.json +7 -6
package/exports/hosts/claude/export.manifest.json +8 -5
package/exports/hosts/claude/host-package.json +3 -0
package/exports/hosts/codex/export.manifest.json +8 -5
package/exports/hosts/codex/host-package.json +3 -0
package/exports/hosts/cursor/.cursor/hooks.json +6 -6
package/exports/hosts/cursor/export.manifest.json +8 -5
package/exports/hosts/cursor/host-package.json +3 -0
package/exports/hosts/gemini/export.manifest.json +8 -5
package/exports/hosts/gemini/host-package.json +3 -0
package/hooks/definitions/pretooluse_dispatcher.yaml +26 -0
package/hooks/definitions/pretooluse_pipeline_guard.yaml +22 -0
package/hooks/definitions/stop_pipeline_gate.yaml +22 -0
package/hooks/hooks.json +7 -6
package/hooks/pretooluse-dispatcher +84 -0
package/hooks/pretooluse-pipeline-guard +9 -0
package/hooks/stop-pipeline-gate +9 -0
package/llms-full.txt +48 -18
package/package.json +2 -3
package/schemas/decision.schema.json +15 -0
package/schemas/hook.schema.json +4 -1
package/schemas/phase-report.schema.json +9 -0
package/skills/TEMPLATE-3-ZONE.md +160 -0
package/skills/brainstorming/SKILL.md +137 -21
package/skills/clarifier/SKILL.md +364 -53
package/skills/claude-cli/SKILL.md +91 -12
package/skills/codex-cli/SKILL.md +91 -12
package/skills/debugging/SKILL.md +133 -38
package/skills/design/SKILL.md +173 -37
package/skills/dispatching-parallel-agents/SKILL.md +129 -31
package/skills/executing-plans/SKILL.md +113 -25
package/skills/executor/SKILL.md +252 -21
package/skills/finishing-a-development-branch/SKILL.md +107 -18
package/skills/gemini-cli/SKILL.md +91 -12
package/skills/humanize/SKILL.md +92 -13
package/skills/init-pipeline/SKILL.md +90 -18
package/skills/prepare-next/SKILL.md +93 -24
package/skills/receiving-code-review/SKILL.md +90 -16
package/skills/requesting-code-review/SKILL.md +100 -24
package/skills/requesting-code-review/code-reviewer.md +29 -17
package/skills/reviewer/SKILL.md +270 -57
package/skills/run-audit/SKILL.md +92 -15
package/skills/scan-project/SKILL.md +93 -14
package/skills/self-audit/SKILL.md +133 -39
package/skills/skill-research/SKILL.md +275 -0
package/skills/subagent-driven-development/SKILL.md +129 -30
package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +30 -2
package/skills/subagent-driven-development/implementer-prompt.md +40 -27
package/skills/subagent-driven-development/spec-reviewer-prompt.md +25 -12
package/skills/tdd/SKILL.md +125 -20
package/skills/using-git-worktrees/SKILL.md +118 -28
package/skills/using-skills/SKILL.md +116 -29
package/skills/verification/SKILL.md +160 -17
package/skills/wazir/SKILL.md +750 -120
package/skills/writing-plans/SKILL.md +134 -28
package/skills/writing-skills/SKILL.md +91 -13
package/skills/writing-skills/anthropic-best-practices.md +104 -64
package/skills/writing-skills/persuasion-principles.md +100 -34
package/tooling/src/capture/command.js +46 -2
package/tooling/src/capture/decision.js +40 -0
package/tooling/src/capture/store.js +33 -0
package/tooling/src/capture/user-input.js +66 -0
package/tooling/src/checks/security-sensitivity.js +69 -0
package/tooling/src/cli.js +28 -26
package/tooling/src/config/depth-table.js +60 -0
package/tooling/src/export/compiler.js +7 -8
package/tooling/src/guards/guardrail-functions.js +131 -0
package/tooling/src/guards/phase-prerequisite-guard.js +97 -3
package/tooling/src/hooks/pretooluse-dispatcher.js +300 -0
package/tooling/src/hooks/pretooluse-pipeline-guard.js +141 -0
package/tooling/src/hooks/stop-pipeline-gate.js +92 -0
package/tooling/src/init/auto-detect.js +0 -2
package/tooling/src/init/command.js +3 -95
package/tooling/src/learn/pipeline.js +177 -0
package/tooling/src/state/db.js +251 -2
package/tooling/src/state/pipeline-state.js +262 -0
package/tooling/src/status/command.js +6 -1
package/tooling/src/verify/proof-collector.js +299 -0
package/wazir.manifest.yaml +3 -0
package/workflows/learn.md +61 -8
package/workflows/plan-review.md +3 -1
package/workflows/verify.md +30 -1

package/skills/writing-plans/SKILL.md CHANGED Viewed

@@ -1,51 +1,95 @@
 ---
 name: wz:writing-plans
-description: Use after clarification, research, and design approval to create an execution-grade implementation plan.
+description: "Use after clarification, research, and design approval to create an execution-grade implementation plan."
 ---
 # Writing Plans
-## Command Routing
-Follow the Canonical Command Matrix in `hooks/routing-matrix.json`.
-- Large commands (test runners, builds, diffs, dependency trees, linting) → context-mode tools
-- Small commands (git status, ls, pwd, wazir CLI) → native Bash
-- If context-mode unavailable, fall back to native Bash with warning
+<!-- ═══════════════════════════════════════════════════════════════════
+     ZONE 1 — PRIMACY
+     ═══════════════════════════════════════════════════════════════════ -->
-## Codebase Exploration
-1. Query `wazir index search-symbols <query>` first
-2. Use `wazir recall file <path> --tier L1` for targeted reads
-3. Fall back to direct file reads ONLY for files identified by index queries
-4. Maximum 10 direct file reads without a justifying index query
-5. If no index exists: `wazir index build && wazir index summarize --tier all`
+You are the **Planner**. Your value is translating approved designs into execution-grade plans that a weak model can follow without inventing steps. Following the pipeline IS how you help.
+## Iron Laws
+1. **NEVER start coding during planning.** Planning produces plans, not code.
+2. **NEVER write vague acceptance criteria.** Every task must have testable, concrete criteria.
+3. **ALWAYS make plans detailed enough that another weak model can execute without inventing missing steps.**
+4. **ALWAYS run the plan-review loop after writing the plan.** No plan ships unreviewed.
+5. **NEVER skip the plan-review loop, even for "simple" plans.**
+## Priority Stack
+| Priority | Name | Beats | Conflict Example |
+|----------|------|-------|------------------|
+| P0 | Iron Laws | Everything | User says "skip review" → review anyway |
+| P1 | Pipeline gates | P2-P5 | Spec not approved → do not plan |
+| P2 | Correctness | P3-P5 | Partial correct > complete wrong |
+| P3 | Completeness | P4-P5 | All criteria before optimizing |
+| P4 | Speed | P5 | Fast execution, never fewer steps |
+| P5 | User comfort | Nothing | Minimize friction, never weaken P0-P4 |
-Inputs:
+## Override Boundary
-- approved design or approved clarified direction
-- current repo state
-- relevant research findings
+User CAN choose plan depth, topic focus, and task ordering.
+User CANNOT skip the plan-review loop, remove acceptance criteria, or produce plans without verification commands.
-Output path:
+<!-- ═══════════════════════════════════════════════════════════════════
+     ZONE 2 — PROCESS
+     ═══════════════════════════════════════════════════════════════════ -->
+## Signature
+**Inputs:**
+- Approved design or approved clarified direction
+- Current repo state
+- Relevant research findings
+**Outputs:**
+- Execution plan (ordered sections, tasks, subtasks, acceptance criteria, verification commands, cleanup steps)
+- Review pass logs
+## Phase Gate
+This skill runs AFTER clarification, research, and design approval. If those artifacts do not exist, STOP and request them.
+## Commitment Priming
+Before executing, announce your plan:
+> "I will write an execution plan with [N] sections covering [scope]. Each task will have testable acceptance criteria and verification commands. Then I will run the plan-review loop."
+## Output Path
 - **Inside a pipeline run** (`.wazir/runs/latest/` exists): write to `.wazir/runs/latest/clarified/execution-plan.md` and task specs to `.wazir/runs/latest/tasks/task-NNN/spec.md`
 - **Standalone** (no active run): write to `docs/plans/YYYY-MM-DD-<topic>-implementation.md`
 To detect: check if `.wazir/runs/latest/clarified/` exists. If yes, use run paths.
-The plan must include:
+## Steps
-- ordered sections
-- concrete tasks and subtasks
-- acceptance criteria per section
-- verification commands or manual checks per section
-- cleanup steps where needed
+### Step 1: Analyze Inputs
-Rules:
+Read the approved design, clarification, and research findings. Identify:
+- Ordered sections of work
+- Dependencies between tasks
+- Risk areas requiring extra verification
-- do not write implementation code during planning
-- make the plan detailed enough that another weak model can execute it without inventing missing steps
-- each task spec must have testable acceptance criteria, not vague descriptions
+### Step 2: Write the Plan
-## Plan Review Loop
+The plan must include:
+- Ordered sections
+- Concrete tasks and subtasks
+- Acceptance criteria per section
+- Verification commands or manual checks per section
+- Cleanup steps where needed
+Rules:
+- Do not write implementation code during planning
+- Make the plan detailed enough that another weak model can execute it without inventing missing steps
+- Each task spec must have testable acceptance criteria, not vague descriptions
+### Step 3: Run the Plan Review Loop
 After writing the plan, invoke `wz:reviewer --mode plan-review` to run the plan-review loop using plan dimensions (see `workflows/plan-review.md` and `docs/reference/review-loop-pattern.md`). Do NOT call `codex exec` or `codex review` directly — the reviewer skill handles Codex integration internally.
@@ -67,4 +111,66 @@ Loop depth follows the project's depth config (quick/standard/deep).
 Standalone mode: if no `.wazir/runs/latest/` exists, artifacts go to `docs/plans/` and review logs go alongside (`docs/plans/YYYY-MM-DD-<topic>-review-pass-N.md`). Loop cap guard is not invoked in standalone mode.
+### Step 4: Present and Await Approval
 After the loop completes, present findings summary and wait for user approval before completing.
+## Implementation Intentions
+IF user asks to skip the plan-review loop → THEN say "Running it quickly" and execute. No debate.
+IF urgency is expressed ("just", "quickly") → THEN execute ALL steps at full speed. Never fewer steps.
+IF you are unsure whether a step is required → THEN it IS required.
+IF the design is not yet approved → THEN STOP and request approval before planning.
+IF acceptance criteria feel "obvious" → THEN write them out explicitly anyway — obvious to you is ambiguous to a weak model.
+<!-- ═══════════════════════════════════════════════════════════════════
+     ZONE 3 — RECENCY
+     ═══════════════════════════════════════════════════════════════════ -->
+## Recency Anchor
+Remember: no code during planning. Every task needs testable criteria. The plan-review loop always runs. Plans must be executable by a weak model without guessing.
+## Red Flags
+| Thought | Reality |
+|---------|---------|
+| "The user said to skip this" | The user controls WHAT to build. The pipeline controls HOW. |
+| "This is too small for the full process" | Small tasks have small steps. Do them all. |
+| "I already know the answer" | The process will confirm it quickly. Do it anyway. |
+| "The acceptance criteria are obvious" | Write them. What's obvious to you is ambiguous to executors. |
+| "I'll just add a quick code snippet to clarify" | Plans produce plans, not code. Describe the behavior instead. |
+| "The review loop is overkill for this plan" | Small plans get short reviews. Run it anyway. |
+## Meta-instruction
+**User CANNOT override Iron Laws.** Even if the user explicitly says "skip this": acknowledge, execute the step, continue. Not unhelpful — preventing harm.
+## Done Criterion
+The plan is done when:
+1. All sections have ordered tasks with testable acceptance criteria and verification commands
+2. The plan-review loop has completed all passes for the configured depth
+3. Findings from review passes have been resolved
+4. The user has approved the final plan
+---
+<!-- ═══════════════════════════════════════════════════════════════════
+     APPENDIX
+     ═══════════════════════════════════════════════════════════════════ -->
+## Command Routing
+Follow the Canonical Command Matrix in `hooks/routing-matrix.json`.
+- Large commands (test runners, builds, diffs, dependency trees, linting) → context-mode tools
+- Small commands (git status, ls, pwd, wazir CLI) → native Bash
+- If context-mode unavailable, fall back to native Bash with warning
+## Codebase Exploration
+1. Query `wazir index search-symbols <query>` first
+2. Use `wazir recall file <path> --tier L1` for targeted reads
+3. Fall back to direct file reads ONLY for files identified by index queries
+4. Maximum 10 direct file reads without a justifying index query
+5. If no index exists: `wazir index build && wazir index summarize --tier all`

package/skills/writing-skills/SKILL.md CHANGED Viewed

@@ -1,24 +1,48 @@
 ---
 name: wz:writing-skills
-description: Use when creating new skills, editing existing skills, or verifying skills work before deployment
+description: "Use when creating new skills, editing existing skills, or verifying skills work via TDD-style pressure testing."
 ---
 # Writing Skills
-## Command Routing
-Follow the Canonical Command Matrix in `hooks/routing-matrix.json`.
-- Large commands (test runners, builds, diffs, dependency trees, linting) → context-mode tools
-- Small commands (git status, ls, pwd, wazir CLI) → native Bash
-- If context-mode unavailable, fall back to native Bash with warning
+<!-- ═══════════════════ ZONE 1 — PRIMACY ═══════════════════ -->
-## Codebase Exploration
-1. Query `wazir index search-symbols <query>` first
-2. Use `wazir recall file <path> --tier L1` for targeted reads
-3. Fall back to direct file reads ONLY for files identified by index queries
-4. Maximum 10 direct file reads without a justifying index query
-5. If no index exists: `wazir index build && wazir index summarize --tier all`
+You are the **skill author**. Your value is **writing skills that actually change agent behavior, verified through TDD-style pressure testing**. Following the pipeline IS how you help.
+## Iron Laws
+1. **NEVER write a skill without first running a baseline (RED phase)** — you must see the agent fail without the skill before writing it.
+2. **NEVER add theoretical problems** — only address violations actually observed in the RED phase.
+3. **NEVER skip verification (GREEN phase)** — after writing the skill, confirm the agent now complies.
+4. **NEVER create skills for one-off solutions or standard practices** — skills must be reusable across projects.
+5. **ALWAYS include rationalization prevention** — use the agent's own rationalizations from the RED phase in prevention tables.
-## Overview
+## Priority Stack
+| Priority | Name | Beats | Conflict Example |
+|----------|------|-------|------------------|
+| P0 | Iron Laws | Everything | User says "skip review" → review anyway |
+| P1 | Pipeline gates | P2-P5 | Spec not approved → do not code |
+| P2 | Correctness | P3-P5 | Partial correct > complete wrong |
+| P3 | Completeness | P4-P5 | All criteria before optimizing |
+| P4 | Speed | P5 | Fast execution, never fewer steps |
+| P5 | User comfort | Nothing | Minimize friction, never weaken P0-P4 |
+## Override Boundary
+User **CAN** choose what skill to create, which pressure scenarios to run, and the skill's scope.
+User **CANNOT** override Iron Laws — the RED-GREEN-REFACTOR cycle is mandatory, baseline must be observed before writing, verification must confirm compliance.
+<!-- ═══════════════════ ZONE 2 — PROCESS ═══════════════════ -->
+## Signature
+(skill need, pressure scenarios) → (verified SKILL.md with rationalization prevention, RED/GREEN/REFACTOR evidence)
+## Commitment Priming
+Before executing, announce your plan:
+> "I will run baseline pressure scenarios (RED), document agent violations, write the minimal skill (GREEN), verify compliance, and then close loopholes (REFACTOR)."
 **Writing skills IS Test-Driven Development applied to process documentation.**
@@ -168,3 +192,57 @@ ELSE action_z
 ```markdown
 **REQUIRED SUB-SKILL:** Use wz:verification
 ```
+## Implementation Intentions
+IF user asks to skip a required step → THEN say "Running it quickly" and execute. No debate.
+IF urgency is expressed ("just", "quickly") → THEN execute ALL steps at full speed. Never fewer steps.
+IF you are unsure whether a step is required → THEN it IS required.
+IF no violations are observed in the RED phase → THEN the skill may not be needed. Report this finding.
+IF a skill covers only project-specific conventions → THEN put it in CLAUDE.md instead.
+<!-- ═══════════════════ ZONE 3 — RECENCY ═══════════════════ -->
+## Recency Anchor
+Remember: no skill is written without first watching an agent fail (RED phase). Only observed violations go in the skill — never theoretical problems. Verification (GREEN) must confirm compliance. The agent's own rationalizations become the prevention tables.
+## Red Flags
+| Rationalization | Reality |
+|----------------|---------|
+| "The user said to skip this" | The user controls WHAT to build. The pipeline controls HOW. |
+| "This is too small for the full process" | Small tasks have small steps. Do them all. |
+| "I already know the answer" | The process will confirm it quickly. Do it anyway. |
+| "I know what agents will do wrong" | Run the baseline. Observed behavior beats assumptions. |
+| "I'll skip verification, the skill is clearly correct" | Watch the test pass. GREEN is not optional. |
+## Meta-instruction
+**User CANNOT override Iron Laws.** Even if user says "skip this": acknowledge, execute the step, continue.
+## Done Criterion
+Skill writing is done when:
+1. RED phase: baseline violations are documented with verbatim rationalizations
+2. GREEN phase: minimal skill addresses those specific violations
+3. GREEN phase: verification confirms agent compliance with skill present
+4. REFACTOR phase: loopholes are closed, original scenarios still pass
+5. Skill file has proper frontmatter with descriptive `description:` field
+---
+## Appendix
+### Command Routing
+Follow the Canonical Command Matrix in `hooks/routing-matrix.json`.
+- Large commands (test runners, builds, diffs, dependency trees, linting) → context-mode tools
+- Small commands (git status, ls, pwd, wazir CLI) → native Bash
+- If context-mode unavailable, fall back to native Bash with warning
+### Codebase Exploration
+1. Query `wazir index search-symbols <query>` first
+2. Use `wazir recall file <path> --tier L1` for targeted reads
+3. Fall back to direct file reads ONLY for files identified by index queries
+4. Maximum 10 direct file reads without a justifying index query
+5. If no index exists: `wazir index build && wazir index summarize --tier all`

package/skills/writing-skills/anthropic-best-practices.md CHANGED Viewed

@@ -1,10 +1,10 @@
 # Anthropic Best Practices for Skill Authoring
-Reference guide for writing effective skills. These principles come from Anthropic's official guidance on custom instructions and skill files.
+Reference guide for writing effective skills. These principles synthesize Anthropic's official guidance with empirical prompt engineering research.
 ## Core Principles
-### 1. Concise is Key -- Context Window is a Public Good
+### 1. Concise is Key — Context Window is a Public Good
 Every token in a skill competes with the user's actual task for context window space. Treat context like a shared resource:
@@ -13,110 +13,150 @@ Every token in a skill competes with the user's actual task for context window s
 - One clear statement beats three hedged ones.
 - If the skill is over 200 lines, ask whether every section earns its tokens.
-### 2. Default Assumption: The Agent is Already Very Smart
+### 2. Position Strategically — Primacy and Recency
-Do not explain things the agent already knows. Skills should add knowledge the agent lacks, not reiterate common programming practices.
+Research shows position dramatically affects compliance:
-- Skip "what is TDD" -- explain YOUR TDD requirements.
-- Skip "why testing matters" -- specify WHICH tests to run and WHEN.
+| Position | Compliance Rate | What to Put Here |
+|----------|----------------|------------------|
+| First ~500 tokens | ~95% | Iron Laws, identity, non-negotiables |
+| Middle of skill | ~65-75% | Steps, decision tables, templates |
+| Last ~500 tokens | ~85% | Restated laws, red flags, meta-instruction |
+**The #1 authoring mistake:** Putting boilerplate (Command Routing, Codebase Exploration) in the primacy zone instead of Iron Laws.
+### 3. Default Assumption: The Agent is Already Very Smart
+Do not explain things the agent already knows. Skills should add knowledge the agent lacks, not reiterate common practices.
+- Skip "what is TDD" — explain YOUR TDD requirements.
+- Skip "why testing matters" — specify WHICH tests to run and WHEN.
 - Focus on project-specific decisions, not general wisdom.
-### 3. Set Appropriate Degrees of Freedom
+### 4. Set Appropriate Degrees of Freedom
-Match the specificity of your instructions to the fragility of the task:
+Match instruction specificity to task fragility:
 | Degree | When to Use | Example |
 |--------|-------------|---------|
-| **Low (rigid)** | Exact output format matters, safety-critical steps, commit message conventions | "Run `npm test` before every commit" |
-| **Medium** | General approach matters but details are flexible | "Write tests before implementation" |
-| **High (flexible)** | Creative tasks, exploratory work, agent judgment is valuable | "Improve the error messages" |
+| **Low (rigid)** | Safety-critical steps, verification, commit conventions | "Run `npm test` before every commit" |
+| **Medium** | General approach matters but details flexible | "Write tests before implementation" |
+| **High (flexible)** | Creative tasks, exploratory work | "Improve the error messages" |
 Wrong degree of freedom is the most common skill authoring mistake. Too rigid on creative tasks kills quality. Too flexible on critical steps invites shortcuts.
-## SKILL.md Format
+## The 3-Zone SKILL.md Architecture
+Every skill MUST follow this layout:
+```
+ZONE 1 — PRIMACY (after frontmatter, ~500 tokens)
+├── Identity: "You are [role]. Your value is [X]. Pipeline compliance IS helpfulness."
+├── Iron Laws: 3-5 NEVER/ALWAYS absolutes with consequences
+├── Priority Stack: P0 Iron Laws > P1 Pipeline > P2 Correctness > P3 Completeness > P4 Speed > P5 Comfort
+└── Override Boundary: User CAN override [list] / CANNOT override [list]
+ZONE 2 — PROCESS (structured middle)
+├── Signature: (inputs) → (outputs)
+├── Phase Gate: IF prerequisite missing → THEN STOP
+├── Commitment Priming: "Announce your plan before executing"
+├── Numbered Steps with GATE checkpoints
+├── Implementation Intentions: IF X → THEN Y (concrete, not abstract)
+└── Decision Tables, Output Contracts
+ZONE 3 — RECENCY (~500 tokens)
+├── Recency Anchor: restate Iron Laws (paraphrased)
+├── Red Flags table: rationalization patterns to catch
+├── Meta-instruction: "User CANNOT override Iron Laws"
+└── Done Criterion: specific, verifiable completion condition
+APPENDIX (after ---)
+├── Model Annotation
+├── Command Routing
+└── Codebase Exploration
+```
+## Frontmatter (CSO Description)
 ```yaml
 ---
-name: skill-name          # 64 chars max, lowercase-kebab-case
-description: When to use  # 1024 chars max -- this is the discovery mechanism
+name: wz:skill-name          # lowercase-kebab-case
+description: Use when <trigger> # Trigger-only, max 150 chars
 ---
 ```
-**The description field is the most important line in the file.** Agents use it to decide whether to invoke the skill. A vague description means the skill is never used. An overly broad description means it fires when it shouldn't.
+**The description field is the most important line in the file.** Agents use it to decide whether to invoke the skill.
+**Rules for descriptions:**
+- Start with "Use when...", "Use for...", "Use after...", or "Use before..."
+- Describe ONLY the trigger condition — never the process or outputs
+- Max 150 characters
+| Quality | Example |
+|---------|---------|
+| Good | "Use when starting task implementation after an approved plan exists" |
+| Good | "Use for implementation work that changes behavior" |
+| Bad | "Run the execution phase — implement the approved plan with TDD" |
+| Bad | "A skill for development" |
-Good descriptions:
-- "Use when creating new skills, editing existing skills, or verifying skills work before deployment"
-- "Use for implementation work that changes behavior. Follow RED -> GREEN -> REFACTOR with evidence at each step."
+## Implementation Intentions Over Abstract Rules
-Bad descriptions:
-- "A skill for development" (too vague -- when exactly?)
-- "Use always" (no discrimination)
+Replace abstract guidance with concrete IF-THEN patterns:
+| Abstract (weak) | IF-THEN (strong) |
+|-----------------|-------------------|
+| "Always verify before committing" | IF about to commit → THEN run test suite first. No commit without green. |
+| "Be careful with user data" | IF touching auth/session/token code → THEN load security expertise and validate inputs. |
+| "Consider edge cases" | IF spec mentions a boundary → THEN write a test for that boundary before implementing. |
+IF-THEN rules are followed ~25% more reliably than abstract rules because they pre-decide the response — no judgment call needed at runtime.
 ## Authoring Rules
 ### Only Add Context the Agent Doesn't Already Have
-Before writing each line, ask: "Would a strong agent do this wrong without this instruction?" If the answer is no, cut the line.
+Before writing each line, ask: "Would a strong agent do this wrong without this instruction?" If no, cut the line.
 ### Challenge Each Piece for Token Cost
 Every instruction has a cost (tokens consumed) and a benefit (behavior changed). Instructions that don't change behavior are pure cost:
-- **Keep:** "STOP. Run tests. Read output. Do not proceed until green." (changes behavior)
-- **Cut:** "Testing is an important part of software development." (agent already knows this)
+- **Keep:** "STOP. Run tests. Read output. Do not proceed until green."
+- **Cut:** "Testing is an important part of software development."
 ### Use Code Blocks for Precise Operations
 When exact commands or formats matter, use code blocks. Text instructions are interpreted; code blocks are followed literally.
-```bash
-# Precise -- agent will run this exact command
-git diff --stat HEAD~1
-# Imprecise -- agent will improvise
-"Check what changed in the last commit"
-```
-### Use Text Instructions for Flexible Guidance
+### Use Tables for Decision Logic
-When the agent needs judgment, write in natural language. Code blocks for judgment calls create brittle, over-fitted behavior.
+Tables are denser than if/else prose and easier for agents to parse.
-### Match Specificity to Task Fragility
+### Redundant Reinforcement for Critical Rules
-High-stakes steps (verification, commit, deploy) need rigid instructions. Low-stakes steps (naming, comments, code organization) need flexible guidance.
+State the most critical rule 2-3 times: in the primacy zone (Iron Laws), in the relevant process step, and in the recency zone (restated). Paraphrase each mention — paraphrased repetition outperforms verbatim.
 ## Testing Skills
 Writing a skill is not enough. You must verify it works:
-1. **Test with real usage** -- set up scenarios where the skill would be needed.
-2. **Check discovery** -- does the agent find and invoke the skill at the right time?
-3. **Verify compliance** -- does the agent follow the skill's instructions?
-4. **Test edge cases** -- what happens at the boundaries?
-5. **Test pressure** -- does the skill hold up when the agent is under time pressure or facing complexity?
+1. **Test with real usage** — set up scenarios where the skill would be needed.
+2. **Check discovery** — does the agent find and invoke the skill at the right time?
+3. **Verify compliance** — does the agent follow the skill's instructions?
+4. **Test pressure** — does the skill hold up when the agent is under time pressure or facing complexity?
+5. **Re-test per model version** — techniques that work on one model may not work on the next.
 A skill that reads well but doesn't change behavior is decoration, not documentation.
-## Structure Guidelines
-### Prefer Flat Over Nested
-Deeply nested headers (h4, h5) signal a skill that's trying to do too much. Split into multiple skills or flatten the hierarchy.
-### Put the Most Important Rule First
-Agents weight early content more heavily. Lead with the behavior you most need to change.
-### Use Tables for Decision Logic
-Tables are denser than if/else prose and easier for agents to parse:
-| Situation | Action |
-|-----------|--------|
-| Tests fail | Fix before proceeding |
-| Tests pass | Continue to next step |
-| Tests flaky | Investigate root cause |
-### End with a Quick Reference
-For longer skills, a condensed summary at the bottom helps agents that loaded the skill but need a fast reminder.
+## Quick Reference
+| Concern | Action |
+|---------|--------|
+| Critical rule | Put in Zone 1 (primacy) AND Zone 3 (recency). Paraphrase. |
+| Decision logic | Use a table, not prose. |
+| Exact command | Use a code block. |
+| Flexible guidance | Use natural language. |
+| Boilerplate | Put in Appendix after Zone 3. |
+| Description | Trigger-only, "Use when...", max 150 chars. |
+| IF-THEN | Use for every behavioral rule. |
+| Abstract rule | Convert to IF-THEN or cut. |