npm - @alexandrealvaro/agentic - Versions diffs - 0.13.0-beta.1 → 0.15.0-beta.1 - Mend

@alexandrealvaro/agentic 0.13.0-beta.1 → 0.15.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (76) hide show

package/README.md CHANGED Viewed

@@ -4,13 +4,13 @@ A starter kit for engineering production code with LLMs. Lean templates and init
 **The framing.** An LLM is the super-soldier serum; the engineer is Steve Rogers. The serum amplifies what the engineer already brings — solid bases, investigation, care for quality, architecture, clean code, observability, maintainability. The kit encodes those bases as skills, ADRs, and gates so the amplification compounds in the right direction. See [WORKFLOW.md](WORKFLOW.md) for the principles.
-The CLI installs nine universal skills (`agentic-bootstrap`, `agentic-philosophy`, `agentic-architecture`, `agentic-adr`, `agentic-spec`, `agentic-task`, `agentic-audit`, `agentic-review`, `agentic-ground`) plus three conditional ones (`agentic-design` for frontend, `agentic-subagent` for Claude Code, `agentic-skill` opt-in) into the agent's native location. Each skill produces its artifact or runs its operation via the agent's native conversational UI; `agentic update` keeps installed skills in sync with upstream kit changes via a state-aware three-way diff. Report rough edges via [GitHub Issues](https://github.com/alexandremendoncaalvaro/agentic-development/issues); current releases live under [GitHub Releases](https://github.com/alexandremendoncaalvaro/agentic-development/releases).
+The CLI installs nineteen universal skills at the default `team` profile (`ad-bootstrap`, `ad-philosophy`, `ad-architecture`, `ad-adr`, `ad-spec`, `ad-task`, `ad-audit`, `ad-review`, `ad-ground`, `ad-next`, `ad-spike`, `ad-tdg`, `ad-domain`, `ad-grill`, `ad-deepen`, `ad-diagnose`, `ad-commit`, `ad-pr`, `ad-merge`) plus four conditional ones (`ad-design` for frontend, `ad-subagent` for Claude Code, `ad-skill` opt-in, `ad-hooks` opt-in / recommended at `mature`) into the agent's native location. Lower profiles install fewer (`poc` = 9 universals, `solo` = 16, `team` / `mature` = 19; `ad-deepen` is excluded from `poc` / `solo` per [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) §4; `ad-commit` / `ad-pr` / `ad-merge` are excluded from `poc` per [ADR-0023](doc/adr/0023-agentic-commit-skill.md) / [ADR-0024](doc/adr/0024-agentic-pr-skill.md) / [ADR-0025](doc/adr/0025-agentic-merge-skill.md)). Each skill produces its artifact or runs its operation via the agent's native conversational UI; `agentic update` keeps installed skills in sync with upstream kit changes via a state-aware three-way diff. Report rough edges via [GitHub Issues](https://github.com/alexandremendoncaalvaro/agentic-development/issues); current releases live under [GitHub Releases](https://github.com/alexandremendoncaalvaro/agentic-development/releases).
 ## Prerequisites
 An agentic coding tool that reads markdown files. Examples here use **Claude Code** and **Codex CLI** (primary tools the author uses); the kit also works with [Antigravity](https://antigravity.google), [Gemini CLI](https://github.com/google-gemini/gemini-cli), Cursor, Continue, Aider, and any other tool that follows the [agents.md](https://agents.md) open standard.
-For the CLI path: Node.js 18+. The CLI is the recommended path. Paste-into-agent prompts (see [Manual prompts](#manual-prompts) below) remain as an alternative for users who don't want to run an installer — same artifacts, same patterns.
+For the CLI path: Node.js 20+ (Node 18 is past EOL and `@clack/prompts` 1.x requires `node:util` `styleText`, which only ships in Node 20+). The CLI is the recommended path. Paste-into-agent prompts (see [Manual prompts](#manual-prompts) below) remain as an alternative for users who don't want to run an installer — same artifacts, same patterns.
 For the philosophy and full reasoning behind the kit, see [WORKFLOW.md](WORKFLOW.md).
@@ -30,37 +30,55 @@ Two categories ([ADR-0007](doc/adr/0007-workflow-operational-skills.md)) and two
 | Skill | Category | Installs | What it does | Invoke |
 | --- | --- | --- | --- | --- |
-| `agentic-bootstrap` | spec-driven | universal | Scans the repo, writes `AGENTS.md` ≤150 lines | `/agentic-bootstrap` |
-| `agentic-architecture` | spec-driven | universal | Scans the code, writes `ARCHITECTURE.md` | `/agentic-architecture` |
-| `agentic-adr` | spec-driven | universal | Drafts `doc/adr/NNNN-<slug>.md` from the conversation | `/agentic-adr` |
-| `agentic-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) layer 2 of the four-layer stack | `/agentic-spec` |
-| `agentic-task` | spec-driven | universal | Drafts `doc/tasks/NNNN-<slug>.md` (checkbox + Notes format; carries `Spec ref` to link the implementing spec) | `/agentic-task` |
-| `agentic-audit` | spec-driven | universal | Read-only drift report (AGENTS.md / ARCHITECTURE.md / ADRs) | `/agentic-audit` |
-| `agentic-philosophy` | workflow-operational | universal | Universal agent guardrails — auto-loads on non-trivial work | implicit |
-| `agentic-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/agentic-review <range>` |
-| `agentic-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/agentic-ground` |
-| `agentic-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the four-layer artifact stack and recommends prioritized next actions; complements `agentic-audit` (drift) | `/agentic-next` |
-| `agentic-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/agentic-spike` |
-| `agentic-tdg` | workflow-operational | universal | Outcome-based prompting per WORKFLOW §9 — ground truth pair + Test Dependency Map + three approaches + single-criterion selection, for cases where the technique is known but the implementation strategy is uncertain | `/agentic-tdg` |
-| `agentic-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/agentic-design` |
-| `agentic-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/agentic-subagent` |
-| `agentic-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/agentic-skill` |
-| `agentic-hooks` | workflow-operational | opt-in only | Scaffolds deterministic quality gates per WORKFLOW §11 (pre-commit + pre-push); detects stack and recommends a runner (Husky / lefthook / pre-commit / native) | `/agentic-hooks` |
-A short TUI shows the detected mode, agent, and feature signals (frontend / `.claude/` / `.agents/` presence) and lets you toggle the conditional skills. Non-interactive flags: `--agent claude-code | codex | both`, `--yes` to skip confirmations — auto-checked conditionals (e.g., `agentic-design` if the project has React) install; `agentic-skill` stays opt-in. Re-running on an installed project is idempotent — unchanged files report `·`, divergent ones prompt to replace.
+| `ad-bootstrap` | spec-driven | universal | Scans the repo, writes `AGENTS.md` ≤150 lines | `/ad-bootstrap` |
+| `ad-architecture` | spec-driven | universal | Scans the code, writes `ARCHITECTURE.md` | `/ad-architecture` |
+| `ad-adr` | spec-driven | universal | Drafts `doc/adr/NNNN-<slug>.md` from the conversation | `/ad-adr` |
+| `ad-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) layer 3 of the five-layer stack | `/ad-spec` |
+| `ad-task` | spec-driven | universal | Drafts `doc/tasks/NNNN-<slug>.md` (checkbox + Notes format; carries `Spec ref` to link the implementing spec) | `/ad-task` |
+| `ad-audit` | spec-driven | universal | Read-only drift report (AGENTS.md / ARCHITECTURE.md / ADRs) | `/ad-audit` |
+| `ad-philosophy` | workflow-operational | universal | Universal agent guardrails — auto-loads on non-trivial work | implicit |
+| `ad-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/ad-review <range>` |
+| `ad-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/ad-ground` |
+| `ad-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the five-layer artifact stack and recommends prioritized next actions; complements `ad-audit` (drift) | `/ad-next` |
+| `ad-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/ad-spike` |
+| `ad-tdg` | workflow-operational | universal | Outcome-based prompting per WORKFLOW §9 — ground truth pair + Test Dependency Map + three approaches + single-criterion selection, for cases where the technique is known but the implementation strategy is uncertain | `/ad-tdg` |
+| `ad-domain` | spec-driven | universal | Lazy lifecycle owner of `CONTEXT.md` (Layer 2 — ubiquitous language per Evans 2003); single-context or `CONTEXT-MAP.md` multi-context | `/ad-domain` |
+| `ad-grill` | workflow-operational | universal | Interview-before-research grilling — one question at a time with recommendation, codebase-first, sharpens vocabulary against `CONTEXT.md`, captures terms via `ad-domain` and decisions via `ad-adr` (three-criteria rule); upstream of `ad-ground` | `/ad-grill` |
+| `ad-deepen` | workflow-operational | universal in `team` + `mature` only | Surface deepening opportunities using WORKFLOW §8 vocabulary (Module / Interface / Depth / Seam / Adapter / Leverage / Locality); three phases — explore, present numbered candidates with deletion-test framing, grill the chosen one; pairs with `ad-audit` | `/ad-deepen` |
+| `ad-diagnose` | workflow-operational | universal | Disciplined diagnosis loop for hard bugs and performance regressions per WORKFLOW §15; five phases — build a feedback loop (the skill itself), reproduce, hypothesise (3-5 ranked falsifiable), instrument (one variable at a time), fix + regression-test | `/ad-diagnose` |
+| `ad-commit` | workflow-operational | universal in `solo` / `team` / `mature` | Atomic Conventional Commits with DCO `Signed-off-by` sign-off per ADR-0023; four phases — scope intake, stage-split when concerns mix, draft message, sign + write. Identity from `git config`. No `Co-Authored-By`. Helper, not blocker | `/ad-commit` |
+| `ad-pr` | workflow-operational | universal in `solo` / `team` / `mature` | Open a GitHub PR with a uniform body shape (Summary / Test plan / Links) per ADR-0024; four phases — preflight (`gh` auth + branch pushed), scope assembly, draft body, open + report URL. Title format = Conventional Commits | `/ad-pr` |
+| `ad-merge` | workflow-operational | universal in `solo` / `team` / `mature` | Evaluate (CI / fresh-context review / linked task / unresolved comments / mergeability) and merge a GitHub PR per ADR-0025; CI green = hard gate (yields to explicit user override); others = warnings. Merge mode auto-detected from `gh repo view`; `--delete-branch` by default | `/ad-merge` |
+| `ad-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/ad-design` |
+| `ad-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/ad-subagent` |
+| `ad-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/ad-skill` |
+| `ad-hooks` | workflow-operational | opt-in only | Scaffolds deterministic quality gates per WORKFLOW §11 (pre-commit + pre-push); detects stack and recommends a runner (Husky / lefthook / pre-commit / native) | `/ad-hooks` |
+A short TUI shows the detected mode, agent, and feature signals (frontend / `.claude/` / `.agents/` presence) and lets you toggle the conditional skills. Non-interactive flags: `--agent claude-code | codex | both`, `--yes` to skip confirmations — auto-checked conditionals (e.g., `ad-design` if the project has React) install; `ad-skill` stays opt-in. Re-running on an installed project is idempotent — unchanged files report `·`, divergent ones prompt to replace.
 If your project already has an `AGENTS.md` (or `CLAUDE.md`), the installer appends a managed `Skills installed by agentic` section bracketed by `<!-- agentic-managed-skills:start -->` / `:end -->` markers. User content outside those markers is byte-preserved; re-runs update only the managed block.
+### v0.15 bundle
+The four skills accepted by ADR-0019 / 0020 / 0021 / 0022 (the mattpocock-absorption Phase-2 set) shipped together in v0.15.0-beta.1. Closes [task-0020](doc/tasks/0020-mattpocock-absorptions.md) Phase 2 in a single release rather than the originally-planned per-minor stack (v0.15 → v0.18). Rationale: 3 of 4 skills are direct mirrors of mature mattpocock prior art; bundling kept the WORKFLOW §15 / §8 / Layer-2 deltas coherent in one ship.
+| Skill | ADR | Operationalizes |
+| --- | --- | --- |
+| `ad-domain` | [ADR-0019](doc/adr/0019-domain-language-layer.md) | Layer 2 (ubiquitous language) — lazy `CONTEXT.md` lifecycle |
+| `ad-grill` | [ADR-0022](doc/adr/0022-agentic-grill-skill.md) | Interview-before-research, upstream of `ad-ground` |
+| `ad-deepen` | [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) | WORKFLOW §8 vocabulary applied to refactor proposals (team + mature only) |
+| `ad-diagnose` | [ADR-0021](doc/adr/0021-diagnose-discipline.md) | WORKFLOW §15 five-phase debugging |
 ## Project maturity profiles
 The kit ships four profiles that select which skills auto-install. Same WORKFLOW principles bind every profile; only the artifact set scales.
 | Profile | Universal install set | Conditional posture | Recommended for |
 | --- | --- | --- | --- |
-| `poc` | philosophy, ground, audit | all blocked | spike, hackathon, exploration |
-| `solo` | + bootstrap, spec, task, review | architecture / adr / hooks opt-in; design auto if frontend; subagent auto for Claude Code | solo developer shipping a real product |
-| `team` (default) | + architecture, adr | hooks opt-in; design / subagent / skill follow autoIf | team product, shared discipline |
-| `mature` | same as team | hooks **recommended**; future evals + spike skills land here | regulated / public-facing production |
+| `poc` | 9 — philosophy, ground, audit, next, spike, tdg, domain, grill, diagnose | all blocked | spike, hackathon, exploration |
+| `solo` | 16 — + bootstrap, spec, task, review, commit, pr, merge | architecture / adr / hooks opt-in; design auto if frontend; subagent auto for Claude Code | solo developer shipping a real product |
+| `team` (default) | 19 — + architecture, adr, deepen | hooks opt-in; design / subagent / skill follow autoIf | team product, shared discipline |
+| `mature` | 19 — same as team | hooks **recommended**; deepening surfaced via `ad-deepen` | regulated / public-facing production |
 Select at init time:
@@ -114,7 +132,7 @@ Useful flags:
 If the project was installed with a kit version older than v0.3 (no state file present), the first `update` falls back to today's byte-compare behavior, then writes the state file so subsequent runs use the three-way diff.
-The `agentic-review` skill writes the assembled WORKFLOW §10 handoff to `.agentic/reviews/<ISO-timestamp>-<scope>.md` before delegating to the fresh-context reviewer (Claude Code) or before instructing you to `/clear` and paste (Codex). These files are ephemeral audit artifacts — add `.agentic/reviews/` to your `.gitignore`.
+The `ad-review` skill writes the assembled WORKFLOW §10 handoff to `.agentic/reviews/<ISO-timestamp>-<scope>.md` before delegating to the fresh-context reviewer (Claude Code) or before instructing you to `/clear` and paste (Codex). These files are ephemeral audit artifacts — add `.agentic/reviews/` to your `.gitignore`.
 For persistent install:
@@ -125,41 +143,48 @@ agentic init
 ## Recommended daily sequence
-The kit ships nine universal skills plus three conditional ones — twelve discrete capabilities. The sequence below is a happy path through them for the three flows that cover most daily work. Skip steps that don't apply; the kit never enforces order.
+The kit ships nineteen universal skills in the full `team` / `mature` install set; `ad-deepen` is excluded from `poc` and `solo` per [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) §4, and `ad-commit` / `ad-pr` / `ad-merge` are excluded from `poc` per [ADR-0023](doc/adr/0023-agentic-commit-skill.md) / [ADR-0024](doc/adr/0024-agentic-pr-skill.md) / [ADR-0025](doc/adr/0025-agentic-merge-skill.md) (concrete counts: `poc` = 9 universals, `solo` = 16, `team` / `mature` = 19). Plus four conditional skills (`ad-design`, `ad-subagent`, `ad-skill`, `ad-hooks`). The sequence below is a happy path through them for the three flows that cover most daily work. Skip steps that don't apply; the kit never enforces order.
 **Greenfield project, first non-trivial feature:**
 1. `agentic init` — install skills.
-2. `/agentic-bootstrap` — produce `AGENTS.md` (operational guide).
-3. `/agentic-architecture` — produce `ARCHITECTURE.md` once load-bearing patterns emerge.
-4. `/agentic-spec` — feature-level spec at `doc/specs/NNNN-<slug>.md` (User Scenarios, Requirements, Success Criteria).
-5. `/agentic-adr` — only when the feature forces a binding architectural decision worth recording for posterity.
-6. `/agentic-task` — work-unit decomposition; reference the spec via `Spec ref`.
-7. `/agentic-ground` — four-source research before code (`agentic-philosophy` auto-loads in parallel).
-8. Implement.
-9. `/agentic-review main..HEAD` — fresh-context §10 review before merge.
-10. `/agentic-audit` — periodic drift check across operational docs and specs.
+2. `/ad-bootstrap` — produce `AGENTS.md` (operational guide).
+3. `/ad-architecture` — produce `ARCHITECTURE.md` once load-bearing patterns emerge.
+4. `/ad-grill` — interview-before-research when the ask is fuzzy; resolves vocabulary into `CONTEXT.md` via `/ad-domain` as terms surface.
+5. `/ad-spec` — feature-level spec at `doc/specs/NNNN-<slug>.md` (User Scenarios, Requirements, Success Criteria).
+6. `/ad-adr` — only when the feature forces a binding architectural decision worth recording for posterity (three-criteria rule: hard to reverse, surprising without context, real trade-off).
+7. `/ad-task` — work-unit decomposition; reference the spec via `Spec ref`.
+8. `/ad-ground` — four-source research before code (`ad-philosophy` auto-loads in parallel).
+9. Implement.
+10. `/ad-review main..HEAD` — fresh-context §10 review before merge.
+11. `/ad-audit` — periodic drift check across operational docs, specs, and the `CONTEXT.md` glossary.
 **Brownfield project, quick fix:**
 1. `agentic update` (only if you want upstream kit changes).
-2. Fix. `agentic-philosophy` auto-loads if the change is non-trivial.
-3. `/agentic-review` only if the fix is non-trivial. Trivial diffs skip the review.
+2. Fix. `ad-philosophy` auto-loads if the change is non-trivial.
+3. `/ad-review` only if the fix is non-trivial. Trivial diffs skip the review.
 4. Commit.
 **Brownfield project, research-only ("what's the best way to add X?"):**
-1. `/agentic-ground` — runs the four-source research pass and surfaces the happy path with citations.
-2. Decide whether the answer becomes a spec (`/agentic-spec`) or a one-off task (`/agentic-task`).
+1. `/ad-ground` — runs the four-source research pass and surfaces the happy path with citations.
+2. Decide whether the answer becomes a spec (`/ad-spec`) or a one-off task (`/ad-task`).
 3. Continue from step 6 of the greenfield flow.
-The kit's discipline scales with the project's maturity. A solo PoC may legitimately skip `/agentic-spec` and `/agentic-adr` (the WORKFLOW §1 prune principle applies — don't add an artifact that wouldn't change agent behavior). A team product running on this kit is expected to use the full sequence and additionally invoke `/agentic-hooks` once to scaffold the deterministic gates per WORKFLOW §11 (pre-commit lint / format / secret-scan; pre-push build / unit / integration). Project maturity profiles that automate the recommendation by stack are deferred — see the next planned release.
+The kit's discipline scales with the project's maturity. A solo PoC may legitimately skip `/ad-spec` and `/ad-adr` (the WORKFLOW §1 prune principle applies — don't add an artifact that wouldn't change agent behavior). A team product running on this kit is expected to use the full sequence and additionally invoke `/ad-hooks` once to scaffold the deterministic gates per WORKFLOW §11 (pre-commit lint / format / secret-scan; pre-push build / unit / integration). Project maturity profiles that automate the recommendation by stack are deferred — see the next planned release.
+**Lost mid-flow?** Invoke `/ad-next` at any time to survey the project's state across the five-layer artifact stack (Constitution → Domain → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/ad-audit` (drift detection — different question).
+**Technique uncertain across multiple plausible approaches?** Invoke `/ad-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/ad-adr` and delete.
+**Technique known but implementation strategy uncertain?** Invoke `/ad-tdg` (per WORKFLOW §9) when multiple algorithms could produce the expected output with different trade-offs along readability / performance / testability. The skill forces a ground-truth pair, lists the tests covering the file (Test Dependency Map), generates three implementation candidates, and commits to one by a single named criterion — refusing the "optimize for all three at once" failure mode. No file written; the verified implementation is the artifact, with the candidate set + criterion landing in the commit message body.
-**Lost mid-flow?** Invoke `/agentic-next` at any time to survey the project's state across the four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/agentic-audit` (drift detection — different question).
+**Question is fuzzy?** Invoke `/ad-grill` (per [ADR-0022](doc/adr/0022-agentic-grill-skill.md)) before research. One question at a time with a recommended answer; codebase-first when the answer is in code; sharpens vocabulary against `CONTEXT.md`. Routes to `/ad-ground` (research-ready), `/ad-tdg` (implement-ready), `/ad-spike` (technique-uncertain), or `/ad-diagnose` (it turned out to be a bug) when the question is sharp.
-**Technique uncertain across multiple plausible approaches?** Invoke `/agentic-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/agentic-adr` and delete.
+**Bug or performance regression?** Invoke `/ad-diagnose` (per WORKFLOW §15). Five phases — build a feedback loop (the skill itself), reproduce, hypothesise (3-5 ranked falsifiable), instrument (one variable at a time), fix + regression-test. The loop is the skill; everything else is mechanical.
-**Technique known but implementation strategy uncertain?** Invoke `/agentic-tdg` (per WORKFLOW §9) when multiple algorithms could produce the expected output with different trade-offs along readability / performance / testability. The skill forces a ground-truth pair, lists the tests covering the file (Test Dependency Map), generates three implementation candidates, and commits to one by a single named criterion — refusing the "optimize for all three at once" failure mode. No file written; the verified implementation is the artifact, with the candidate set + criterion landing in the commit message body.
+**Stable codebase friction?** Invoke `/ad-deepen` (per WORKFLOW §8, [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md)) on `team` / `mature` profiles. Surfaces deepening opportunities using the Module / Interface / Depth / Seam / Adapter / Leverage / Locality vocabulary and the deletion test. Premature on `poc` / `solo` — auto-install excludes them.
 ## Manual prompts
@@ -186,19 +211,19 @@ Prompts reference templates by relative path. Two ways to give your agent access
 ## Workflows by scenario
-**New project (greenfield).** Initialize git and project structure, then run `agentic init` to install the universal skill set plus any auto-detected conditional skills. From inside Claude Code or Codex: `/agentic-bootstrap` produces `AGENTS.md`, `/agentic-architecture` produces `ARCHITECTURE.md`, `/agentic-adr` records each binding decision, `/agentic-task` opens trackable work items, `/agentic-audit` flags drift, `/agentic-review <range>` runs a fresh-context review of a diff, `/agentic-design` bootstraps `DESIGN.md` from your tokens (frontend projects), `/agentic-subagent` and `/agentic-skill` scaffold custom subagents and skills. `agentic-philosophy` auto-loads on non-trivial work.
+**New project (greenfield).** Initialize git and project structure, then run `agentic init` to install the universal skill set plus any auto-detected conditional skills. From inside Claude Code or Codex: `/ad-bootstrap` produces `AGENTS.md`, `/ad-architecture` produces `ARCHITECTURE.md`, `/ad-adr` records each binding decision, `/ad-task` opens trackable work items, `/ad-audit` flags drift, `/ad-review <range>` runs a fresh-context review of a diff, `/ad-design` bootstraps `DESIGN.md` from your tokens (frontend projects), `/ad-subagent` and `/ad-skill` scaffold custom subagents and skills. `ad-philosophy` auto-loads on non-trivial work.
-**Existing project (brownfield).** Same flow. The project-wide skills (`/agentic-bootstrap`, `/agentic-architecture`) follow a **scan-first pattern**: the agent reads the codebase first, pre-fills every placeholder it can verify, then asks you only about the genuine gaps and conflicts — no philosophical questions, no interview-by-section. The per-artifact skills (`/agentic-adr`, `/agentic-task`, `/agentic-design`, `/agentic-skill`, `/agentic-subagent`) work on a single decision or asset and don't need codebase-wide verification. Backfill ADRs only for decisions that matter going forward.
+**Existing project (brownfield).** Same flow. The project-wide skills (`/ad-bootstrap`, `/ad-architecture`) follow a **scan-first pattern**: the agent reads the codebase first, pre-fills every placeholder it can verify, then asks you only about the genuine gaps and conflicts — no philosophical questions, no interview-by-section. The per-artifact skills (`/ad-adr`, `/ad-task`, `/ad-design`, `/ad-skill`, `/ad-subagent`) work on a single decision or asset and don't need codebase-wide verification. Backfill ADRs only for decisions that matter going forward.
-**Revisiting / auditing existing specs.** Run `/agentic-audit` from inside Claude Code or Codex. The skill reads `AGENTS.md`, `ARCHITECTURE.md`, and `doc/adr/` and produces a drift list — one finding per line, format `[file or section]: spec says X, code says Y. Suggested resolution: change spec / change code / discuss.` Read-only — never rewrites specs. Apply judgment manually before changing anything.
+**Revisiting / auditing existing specs.** Run `/ad-audit` from inside Claude Code or Codex. The skill reads `AGENTS.md`, `ARCHITECTURE.md`, and `doc/adr/` and produces a drift list — one finding per line, format `[file or section]: spec says X, code says Y. Suggested resolution: change spec / change code / discuss.` Read-only — never rewrites specs. Apply judgment manually before changing anything.
-**Reviewing your own diff.** Run `/agentic-review <range>` (e.g. `/agentic-review main..HEAD` or `/agentic-review PR#42`). The skill assembles the diff, the relevant spec slice (`AGENTS.md`, applicable ADRs, the task's Acceptance Criteria), and delegates to a fresh-context reviewer subagent — no inherited bias from the session that wrote the code (WORKFLOW §10). Returns structured findings grouped Blocker / Concern / Note. Codex variant uses `/clear` + paste handoff since Codex has no subagent primitive.
+**Reviewing your own diff.** Run `/ad-review <range>` (e.g. `/ad-review main..HEAD` or `/ad-review PR#42`). The skill assembles the diff, the relevant spec slice (`AGENTS.md`, applicable ADRs, the task's Acceptance Criteria), and delegates to a fresh-context reviewer subagent — no inherited bias from the session that wrote the code (WORKFLOW §10). Returns structured findings grouped Blocker / Concern / Note. Codex variant uses `/clear` + paste handoff since Codex has no subagent primitive.
-**Researching before implementation.** Run `/agentic-ground` (or let it auto-trigger on non-trivial work). The skill runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, and git history — synthesizes a happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written (WORKFLOW §4 + §5). Output is the input to whatever produces the implementation plan; the skill does not write code.
+**Researching before implementation.** Run `/ad-ground` (or let it auto-trigger on non-trivial work). The skill runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, and git history — synthesizes a happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written (WORKFLOW §4 + §5). Output is the input to whatever produces the implementation plan; the skill does not write code.
-**Specifying a feature.** Run `/agentic-spec` (or `/agentic-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are layer 2 of the four-layer artifact stack — Constitution → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/agentic-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec.
+**Specifying a feature.** Run `/ad-spec` (or `/ad-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are layer 3 of the five-layer artifact stack — Constitution → Domain → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/ad-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec.
-**Project already built with agents.** Treat missing artifacts as brownfield (run the relevant skill) and existing artifacts as audit (`/agentic-audit`).
+**Project already built with agents.** Treat missing artifacts as brownfield (run the relevant skill) and existing artifacts as audit (`/ad-audit`).
 ## What ends up in your target project
@@ -219,11 +244,11 @@ your-project/
 │   └── reviews/                (gitignored — ephemeral §10 review handoffs)
 ├── .claude/                    (Claude Code targets)
 │   ├── agentic-state.json      (kit install state — committed)
-│   ├── skills/agentic-*/SKILL.md
+│   ├── skills/ad-*/SKILL.md
 │   └── agents/fresh-context-reviewer.md
 └── .agents/                    (Codex targets, cc-sdd convention)
     ├── agentic-state.json      (kit install state — committed)
-    └── skills/agentic-*/{SKILL.md, agents/openai.yaml}
+    └── skills/ad-*/{SKILL.md, agents/openai.yaml}
 ```
 The pattern matches how [GitHub's spec-kit](https://github.com/github/spec-kit) and [cookiecutter](https://cookiecutter.readthedocs.io/) handle distribution — templates in one place, outputs in another, never mixed.

package/WORKFLOW.md CHANGED Viewed

@@ -4,7 +4,7 @@ Engineering production code with LLMs. Agentic, not vibe coding.
 **The Steve Rogers framing.** The LLM is the super-soldier serum. The engineer is Steve Rogers. The serum amplifies what the engineer already brings — solid bases, organization, investigation, care for quality, architecture, clean code, documentation, observability, maintainability. Add the serum to a disciplined engineer and you get Captain America. Add it to an undisciplined one and you get faster sloppy at scale. This document is those bases written down as principles. The discipline is the input; the LLM is the amplifier; the kit (skills, ADRs, audits, gates) is the scaffolding that keeps the discipline intact across sessions, agents, and projects.
-**The principle behind the rest:** context engineering beats prompt engineering. Context is finite and decays as it fills — aim for the smallest set of high-signal tokens that gets the outcome.
+**The principle behind the rest:** context engineering beats prompt engineering. Context is finite and decays as it fills — aim for the smallest set of high-signal tokens that gets the outcome. Operationally: one task per session, reset rather than extend. A long-running conversation crosses from the model's high-precision zone into a low-precision one as the cross-references multiply; smaller, deliberately-loaded contexts beat larger, accreted ones.
 ## TL;DR
@@ -27,7 +27,10 @@ What to keep in mind:
 13. **Automation needs rails.** Hooks, tests, lint, CI, sandboxing, and permissions matter more than advisory text the agent can forget.
 14. **Autonomy requires observability.** If the agent makes decisions, log the trajectory: tool calls, intermediate outputs, failures.
 15. **Staged spikes when the technique is uncertain.** When the *how* is unknown — a library choice, a CV technique, a multi-stage transformation — break the problem into staged spikes against golden fixtures with per-stage debug artifacts.
-16. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
+16. **Diagnose with discipline.** For hard bugs and performance regressions: build a fast, deterministic feedback loop *before* hypothesising. Reproduce, then generate three to five ranked falsifiable hypotheses, then change one variable at a time. The feedback loop is the skill; everything else is mechanical.
+17. **One task per session.** Context decays as it fills. Reset and reload only what the next task needs rather than extending a long conversation across many features. Smaller, deliberately-loaded contexts beat large, accreted ones.
+18. **Slice vertically, not horizontally.** Decompose a spec into thin end-to-end paths through every layer (schema, API, UI, tests) rather than one layer at a time. Each slice ships demonstrable behavior; horizontal layer-stacks ship nothing on their own.
+19. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
 > Working with agents means trading typing for technical direction. The value is in giving the right context, setting boundaries, validating the result, and keeping "almost right" out of production.
@@ -37,14 +40,15 @@ Define the rules before the agent writes a line. The temptation is to dump every
 There are two complementary frames for the artifacts the kit produces. The first is **purpose** — what each artifact is *for*. The second is **loading mechanism** — when each artifact reaches the agent's context.
-### Four-layer artifact stack (purpose)
+### Five-layer artifact stack (purpose)
 1. **Constitution** — `AGENTS.md` (operational guide) and `WORKFLOW.md` (engineering philosophy). Tells the agent how the project works and how the team thinks. Read every session.
-2. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; ADRs may be driven by spec constraints. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
-3. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for.
-4. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
+2. **Domain** — `CONTEXT.md` at the repo root (or `CONTEXT-MAP.md` plus per-context `CONTEXT.md` files for multi-context repos). The project's ubiquitous language: canonical nouns, the aliases to avoid, the relationships between them, and the ambiguities that have already been resolved. Direct application of Domain-Driven Design (Evans, 2003) — when an agent and a human share the project's vocabulary, the agent uses fewer tokens to say more, and the code, tests, and conversation all converge on the same names. Created lazily — first term resolved triggers the file. ADR-0019 records the layer.
+3. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; ADRs may be driven by spec constraints. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
+4. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for.
+5. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
-The four layers scale with project maturity (TL;DR #16). A spike or PoC profile may legitimately ship only Layers 1 and 4 — adding Layers 2 and 3 to a 200-line experiment is ceremony that does not change agent behavior. A team or regulated product runs all four. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
+The five layers scale with project maturity (TL;DR #19 — Discipline scales). A spike or PoC profile may legitimately ship only Layers 1, 2, and 5 — adding Layers 3 and 4 to a 200-line experiment is ceremony that does not change agent behavior. (Domain — Layer 2 — earns its keep even at PoC because vocabulary drift starts on day one.) A team or regulated product runs all five. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
 ### Three context types (loading mechanism)
@@ -52,9 +56,10 @@ The four layers scale with project maturity (TL;DR #16). A spike or PoC profile
 - **Canonical specs are constraints, not advice.** `DESIGN.md` (the visual contract — YAML tokens plus Markdown rationale, per Google Labs' open standard), `ARCHITECTURE.md`, ADRs in `doc/adr/*.md`, and feature specs in `doc/specs/*.md` are facts the agent must obey. If a token, pattern, or success criterion isn't declared here, it doesn't exist. The agent must never invent one.
 - **On-demand context is `SKILL.md`.** Description loads at session start (the listing is capped at 1,536 characters per the spec) and body loads only when the skill is invoked. Use it for repeatable workflows or domain knowledge that shouldn't pay a token cost on every turn.
-Two rules apply across all of the above:
+Three rules apply across all of the above:
 - **Acceptance criteria must be measurable.** "Build a dashboard" fails. "Loads in under 2 seconds, shows 6 months of history, passes axe accessibility" succeeds.
+- **Acceptance criteria must be durable, not procedural.** Describe the behavior and the interfaces — the contracts that survive a rename. Avoid file paths, line numbers, and "open file X and add line Y" wording in the criteria themselves; those rot the moment the implementation moves. Procedural execution steps belong in a separate section of the task file, not in the criteria.
 - **Prune.** If removing a line wouldn't make the agent fail, cut it.
 ## 2. Docs vs. Code
@@ -67,7 +72,7 @@ Comments are exceptions. They justify *why* a non-obvious choice was made — ne
 ### Documentation Discipline
-The agent's authoritative copy of the eight-rule documentation discipline lives in the `agentic-philosophy` skill (`Documentation Discipline` section). The rules are summarized below for reference; the skill carries the full text agents read at session time. ADR-0008 records the canonical decision and the reconciliations against ADR-0004 (file-based task tracking) and ADR-0005 (universal agent behavior as a skill).
+The agent's authoritative copy of the eight-rule documentation discipline lives in the `ad-philosophy` skill (`Documentation Discipline` section). The rules are summarized below for reference; the skill carries the full text agents read at session time. ADR-0008 records the canonical decision and the reconciliations against ADR-0004 (file-based task tracking) and ADR-0005 (universal agent behavior as a skill).
 1. **Definitions and decisions only.** No speculation, history, or unfounded plans.
 2. **No dates, version stamps, `DRAFT` markers, or changelogs in narrative documents.** Decision-record artifacts under `doc/adr/`, `doc/tasks/`, `doc/specs/` are exempt — their lifecycle fields are the auditability primitive.
@@ -78,7 +83,7 @@ The agent's authoritative copy of the eight-rule documentation discipline lives
 7. **No commented-out code; no orphan `TODO` / `FIXME` in source.** Every deferred item references a GitHub Issue or a `doc/tasks/NNNN-*.md` task.
 8. **Tests are living documentation of behavior.**
-The skill body explains the rationale per rule, lists the failure modes the rules counter (bloated `AGENTS.md`, README pages drifting into changelogs, decision artifacts diluted by speculation), and walks through the reconciliations. Generator skills (`agentic-bootstrap`, `agentic-architecture`, `agentic-spec`, `agentic-task`, `agentic-adr`, `agentic-design`) reject violations of these rules at write time; `agentic-audit` flags drift across narrative docs and decision-record artifacts on demand.
+The skill body explains the rationale per rule, lists the failure modes the rules counter (bloated `AGENTS.md`, README pages drifting into changelogs, decision artifacts diluted by speculation), and walks through the reconciliations. Generator skills (`ad-bootstrap`, `ad-architecture`, `ad-spec`, `ad-task`, `ad-adr`, `ad-design`) reject violations of these rules at write time; `ad-audit` flags drift across narrative docs and decision-record artifacts on demand.
 ## 3. Format by Evidence
@@ -97,7 +102,7 @@ No format is universally best. **An observation from my practice, not benchmarke
 ## 4–5. Research Before Implementation
-Combines Find the Happy Path (canonical / idiomatic baseline) and Ground in Real Patterns (anchoring in project-specific examples). The kit treats both as one indivisible flow via `agentic-ground`; two prose sections would frame one operation as two separate practices.
+Combines Find the Happy Path (canonical / idiomatic baseline) and Ground in Real Patterns (anchoring in project-specific examples). The kit treats both as one indivisible flow via `ad-ground`; two prose sections would frame one operation as two separate practices.
 Two sub-practices, joined into one indivisible pass.
@@ -105,14 +110,14 @@ Two sub-practices, joined into one indivisible pass.
 **Ground in real patterns.** Don't dump the codebase into context. Anchor the model in a specific, project-relevant example: *"Find an existing example of [similar feature]; use that exact structure."* Cite specific files, not "the codebase." Use just-in-time retrieval — pass paths or IDs and let the agent fetch via tools.
-The kit ships `agentic-ground` as the workflow-operational implementation of both. It runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, git history — joined by AND not OR, synthesizes the happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written. Splitting the two sub-practices into separate skills would force two invocations with overlapping research outputs and fragment the synthesis context (ADR-0010).
+The kit ships `ad-ground` as the workflow-operational implementation of both. It runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, git history — joined by AND not OR, synthesizes the happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written. Splitting the two sub-practices into separate skills would force two invocations with overlapping research outputs and fragment the synthesis context (ADR-0010).
 ## 6. Explore → Plan → Implement → Commit
 For non-trivial changes, four phases:
 1. **Explore (read-only).** Plan mode in your agent. Read, build a mental model, no edits.
-2. **Plan.** Agent writes a Markdown plan. You edit before approving. For non-trivial multi-step work, structure the plan as a per-task file (`doc/tasks/<NNNN>-<slug>.md`) with checkbox acceptance criteria and execution steps — the agent toggles checkboxes as it works rather than rewriting paragraphs, keeping edits cheap and resumable across sessions.
+2. **Plan.** Agent writes a Markdown plan. You edit before approving. For non-trivial multi-step work, structure the plan as a per-task file (`doc/tasks/<NNNN>-<slug>.md`) with checkbox acceptance criteria and execution steps — the agent toggles checkboxes as it works rather than rewriting paragraphs, keeping edits cheap and resumable across sessions. When a spec yields more than one task, **slice vertically**: each task is a thin end-to-end path through every layer the change touches (schema, API, UI, tests), not one layer at a time. The anti-pattern is *horizontal slicing* — "first the schema task, then the API task, then the UI task" — because nothing is shippable until the last task lands. Tracer-bullet vertical slices each ship a demonstrable behavior on their own ([Hunt & Thomas, *Pragmatic Programmer*, 1999](https://en.wikipedia.org/wiki/The_Pragmatic_Programmer)). Tag each task **AFK** (specified completely enough that an autonomous agent can land it) or **HITL** (needs human judgment, taste, design review, or external access) — the dimension is orthogonal to the lifecycle status and tells parallel agents which work is theirs to take.
 3. **Implement.** Execute the approved plan; verify each step before moving to the next.
 4. **Commit.** One logical change per commit.
@@ -135,6 +140,27 @@ Lock the load-bearing decisions into `AGENTS.md` or `CLAUDE.md` so the agent doe
 The agent will follow what's specified and invent what isn't. Prefer specifying.
+### Architectural vocabulary
+Architectural drift accelerates with the agent's typing speed; the counter is shared vocabulary that names the shapes that matter. The kit adopts the canonical terms from John Ousterhout's *A Philosophy of Software Design* (2018) and Michael Feathers's *Working Effectively with Legacy Code* (2004), and uses them in `ARCHITECTURE.md`, ADRs, and architecture-touching skills (ADR-0020).
+- **Module** — anything with an interface and an implementation; deliberately scale-agnostic (function, class, package, vertical slice).
+- **Interface** — everything a caller must know to use the module correctly: types, invariants, ordering constraints, error modes, configuration, performance characteristics. *Not* just the type signature.
+- **Implementation** — what's inside the module.
+- **Depth** — leverage at the interface. A module is **deep** when a large amount of behavior sits behind a small interface; **shallow** when the interface is nearly as complex as the implementation. Depth is a property of the interface, not of line counts (rejected framing: depth-as-implementation-to-interface line ratio rewards padding).
+- **Seam** (Feathers) — a place where behavior can be altered without editing in place; the *location* of an interface. Distinct from DDD's *bounded context*; the kit avoids "boundary" for this reason.
+- **Adapter** — a concrete thing that satisfies an interface at a seam; a role, not a substance.
+- **Leverage** — what callers get from depth: more capability per unit of interface they have to learn.
+- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated at one place rather than spread across callers.
+Three principles fall out of those terms:
+- **Deletion test.** Imagine deleting the module. If complexity vanishes, the module was a pass-through (delete it). If complexity reappears across N callers, it was earning its keep.
+- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape.
+- **One adapter is a hypothetical seam; two adapters make it real.** Don't introduce a seam unless something actually varies across it.
+Skills that touch architecture (`ad-architecture`, `ad-adr`, the planned `ad-deepen`) use these terms verbatim, so suggestions and reviews land in a single language.
 ## 9. Outcome-Based Prompting (TDG)
 Give the finish line first, not the path:
@@ -172,6 +198,8 @@ The takeaway: §10 (Reviewer) and §11 (Quality Gates) are not optional. Skippin
 Per the Steve Rogers framing in the preamble: the serum cannot manufacture discrimination — it amplifies whatever discrimination the engineer already brings. The kit's job is to encode discrimination into the agent's context (specs, ADRs, fresh-context reviews, deterministic gates) so the amplification compounds in the disciplined direction even when the engineer is sleepy, rushed, or handing off to another collaborator.
+**Two roles of judgment, not one.** Agent review (§10 fresh-context) and the deterministic gates (§11) catch the *mechanical* failures: bugs, coupling, edge cases, broken contracts, missed branches. They do not — and should not be expected to — catch what is left after that: taste, product judgment, visual feel, whether the feature actually solves the user's problem. Those require a human to look at the running thing and form an opinion. Skipping fresh-context review because "the engineer will catch it" wastes engineer attention on mechanical failures the agent should have caught. Skipping the human pass because "the agent reviewed it" ships features that compile, pass, and feel wrong. The two are complements, not substitutes.
 ## 13. Evals for Anything Autonomous
 If your agent is making decisions on its own, you need evals. A few principles:
@@ -185,6 +213,8 @@ If your agent is making decisions on its own, you need evals. A few principles:
 Sometimes the spec is clear but the *technique* is uncertain — you don't know which library, which CV approach, which decomposition. Don't ask the agent to solve it end-to-end. Break the problem into staged spikes and validate each one against curated ground truth.
+**Spike vs. prototype.** Use a *spike* (this section) when the unknown is *how* — the technique itself is uncertain across multiple plausible approaches and validation needs golden fixtures with per-stage debug. Use a *prototype* when the unknown is *what should this feel like* — UI/UX direction, the shape of an interaction, whether a state model holds up under play. Different question, different artifact, different success criterion.
 The flow has four parts:
 1. **Discovery first.** Ask the agent to list canonical approaches grounded in official docs and real examples. Pick one by an explicit criterion. The output of this step is information, not code.
@@ -198,6 +228,53 @@ The flow has four parts:
 This is a combination of established practices, not new terminology: spike (XP), golden datasets, stage-segmented error analysis, trajectory evaluation, and visual debugging in CV pipelines.
+## 15. Diagnose With Discipline
+For hard bugs and performance regressions, the failure mode is jumping to hypotheses before there is a way to check them. The discipline below is the counter; it owes its shape to standard debugging practice (Kernighan & Pike, *The Practice of Programming*, 1999) and the kit ships `ad-diagnose` as the operational implementation (ADR-0021).
+### Phase 1 — Build a feedback loop
+This is *the* skill. Everything else is mechanical. A fast, deterministic, agent-runnable pass/fail signal for the bug is what makes bisection, hypothesis-testing, and instrumentation effective; without one, no amount of staring at code converges. Spend disproportionate effort here.
+Loop-construction techniques, in roughly increasing cost:
+1. Failing test at whatever seam reaches the bug — unit, integration, e2e.
+2. Curl / HTTP script against a running dev server.
+3. CLI invocation with a fixture input, diffing stdout against a known-good snapshot.
+4. Headless browser script (Playwright / Puppeteer) — drives the UI, asserts on DOM, console, network.
+5. Replay a captured trace — saved network request, payload, event log — through the code path in isolation.
+6. Throwaway harness — a minimal subset of the system that exercises the bug code path with one function call.
+7. Property / fuzz loop — when the bug is "sometimes wrong output", run many random inputs and look for the failure mode.
+8. Bisection harness — automate "boot at state X, check, repeat" so `git bisect run` works.
+9. Differential loop — run the same input through two versions or two configs and diff outputs.
+10. HITL bash script — last resort. Structure the human's clicks so the loop is still repeatable.
+Treat the loop as a product: faster, sharper signal, more deterministic, every iteration. A two-second deterministic loop is a debugging superpower; a thirty-second flaky loop is barely better than no loop.
+For non-deterministic bugs, the goal is not a clean repro but a *higher reproduction rate* — loop the trigger, parallelize, narrow timing windows, inject sleeps. A 50%-flake is debuggable; 1% is not.
+If a loop genuinely cannot be built, stop and say so — do not proceed to hypotheses.
+### Phase 2 — Reproduce
+Run the loop. Confirm the failure matches the user's description (not a different failure that happens to be nearby), reproduces consistently (or at a high enough rate), and the exact symptom is captured for later phases to verify the fix against. Wrong bug = wrong fix.
+### Phase 3 — Hypothesise
+Generate **three to five ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+Each hypothesis must be **falsifiable**: state the prediction it makes — *"if X is the cause, then changing Y will make the bug disappear / changing Z will make it worse."* If you cannot state the prediction, the hypothesis is a vibe; sharpen it or discard it.
+Show the ranked list to the user before testing. Domain knowledge often re-ranks instantly ("we just deployed a change to #3"), or marks hypotheses already ruled out — a cheap checkpoint, big time saver.
+### Phase 4 — Instrument
+Each probe maps to a specific prediction from Phase 3. Change one variable at a time. Log enough that the result is unambiguous — a single instrument that confirms two predictions at once tends to confirm whichever one you were rooting for.
+### Phase 5 — Fix and regression-test
+Apply the fix. Re-run the Phase-1 loop and confirm the captured symptom is gone. Promote the loop's check into a permanent test that lives next to the code so the same failure mode cannot return silently.
 ---
 These are starting points. Prune what doesn't fit your codebase.
@@ -214,6 +291,8 @@ This is not theory I read and copied. Most of the practices here come from years
 **§14 (Staged Spikes With Golden Fixtures) is my own working technique.** I haven't seen it documented end-to-end as a single named pattern, but each component (spike, golden dataset, stage-segmented error analysis, trajectory evaluation, visual CV debugging) has its own lineage in the literature listed under Sources. The combination — discovery → fixture → staged pipeline with debug artifacts → two-layer evaluation — is how I attack problems where the *technique* itself is uncertain.
+**Practices added through cross-pollination.** A second pass over the guide compared this kit's coverage against [Matt Pocock's `mattpocock/skills`](https://github.com/mattpocock/skills) — a separate body of agent-engineering practice grounded in the same canonical literature (DDD, *Pragmatic Programmer*, Ousterhout, Feathers, Beck). The comparison surfaced principles that earned their place in this document on independent merits: the **Domain layer** (§1 Layer 2, ADR-0019) from DDD's ubiquitous-language pattern; the **architectural vocabulary** (§8, ADR-0020) from Ousterhout's depth and Feathers's seams; **Diagnose with discipline** (§15, ADR-0021) from standard debugging practice; **vertical slicing** and **HITL/AFK tagging** (§6) from PragProg's tracer-bullet metaphor; **AI mechanical / human judgment** (§12 second paragraph). Where Pocock's framing sharpened our own — vocabulary or principle that we already gestured at without naming — the borrowed phrasing is acknowledged inline; everything else stays kit-original.
 External claims (specific percentages, named frameworks) are cited under Sources. Everything else is operational guidance from practice or synthesis across that material — a working model, refined over time, not academic claim.
 ## Sources
@@ -221,6 +300,14 @@ External claims (specific percentages, named frameworks) are cited under Sources
 **§1 — Spec-Driven Design**
 - DESIGN.md spec (Google Labs): https://github.com/google-labs-code/design.md
 - SKILL.md spec (Anthropic): https://code.claude.com/docs/en/skills
+- Domain-Driven Design (Evans, 2003) — *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Source for the Domain layer (`CONTEXT.md`) and the ubiquitous-language discipline.
+**§6 — Explore → Plan → Implement → Commit**
+- *The Pragmatic Programmer* (Hunt & Thomas, 1999), tracer-bullet metaphor — source for the vertical-slicing principle.
+**§8 — Architectural Boundaries**
+- *A Philosophy of Software Design* (Ousterhout, 2018) — Module / Interface / Depth vocabulary; rejected framing of depth-as-line-ratio.
+- *Working Effectively with Legacy Code* (Feathers, 2004) — Seam vocabulary.
 **§12 — The Bottleneck Is Discrimination, Not Generation**
 - JetBrains *DevEcosystem 2025*: https://devecosystem-2025.jetbrains.com/artificial-intelligence
@@ -232,3 +319,7 @@ External claims (specific percentages, named frameworks) are cited under Sources
 - Stage-segmented error analysis — Hamel Husain's evals FAQ: https://hamel.dev/blog/posts/evals-faq/
 - Trajectory evaluation — LangSmith docs: https://docs.langchain.com/langsmith/trajectory-evals
 - Visual CV debugging — OpenCV cvv tutorial: https://docs.opencv.org/3.4/d7/dcf/tutorial_cvv_introduction.html
+**§15 — Diagnose With Discipline**
+- *The Practice of Programming* (Kernighan & Pike, 1999) — chapters on debugging and testing.
+- Falsifiability framing — Karl Popper, *The Logic of Scientific Discovery* (1959).

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@alexandrealvaro/agentic",
-  "version": "0.13.0-beta.1",
+  "version": "0.15.0-beta.1",
   "description": "Bootstrap and audit AGENTS.md, ARCHITECTURE.md, ADRs, skills, and subagents for engineering production code with LLMs",
   "type": "module",
   "bin": {
@@ -16,7 +16,7 @@
     "LICENSE"
   ],
   "engines": {
-    "node": ">=18"
+    "node": ">=20"
   },
   "scripts": {
     "start": "node bin/agentic.js",