@alexandrealvaro/agentic 0.15.2-beta.1 → 0.16.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (52) hide show
  1. package/README.md +40 -44
  2. package/WORKFLOW.md +83 -36
  3. package/package.json +1 -1
  4. package/src/commands/init.js +3 -0
  5. package/src/lib/profiles.js +20 -4
  6. package/src/skills/claude-code/ad-architecture/SKILL.md +1 -8
  7. package/src/skills/claude-code/ad-archive/SKILL.md +117 -0
  8. package/src/skills/claude-code/ad-audit/SKILL.md +13 -3
  9. package/src/skills/claude-code/ad-bootstrap/SKILL.md +44 -7
  10. package/src/skills/claude-code/ad-commit/SKILL.md +7 -7
  11. package/src/skills/claude-code/ad-deepen/SKILL.md +5 -5
  12. package/src/skills/claude-code/ad-diagnose/SKILL.md +2 -2
  13. package/src/skills/claude-code/ad-domain/SKILL.md +4 -4
  14. package/src/skills/claude-code/ad-grill/SKILL.md +3 -3
  15. package/src/skills/claude-code/ad-ground/SKILL.md +11 -7
  16. package/src/skills/claude-code/ad-guidelines/SKILL.md +200 -0
  17. package/src/skills/claude-code/ad-merge/SKILL.md +6 -6
  18. package/src/skills/claude-code/ad-next/SKILL.md +21 -14
  19. package/src/skills/claude-code/ad-philosophy/SKILL.md +6 -3
  20. package/src/skills/claude-code/ad-pr/SKILL.md +6 -6
  21. package/src/skills/claude-code/ad-prd/SKILL.md +115 -0
  22. package/src/skills/claude-code/ad-spec/SKILL.md +16 -15
  23. package/src/skills/claude-code/ad-spike/SKILL.md +4 -4
  24. package/src/skills/claude-code/ad-tdd/SKILL.md +117 -0
  25. package/src/skills/claude-code/ad-tdg/SKILL.md +1 -1
  26. package/src/skills/codex/ad-architecture/SKILL.md +1 -8
  27. package/src/skills/codex/ad-archive/SKILL.md +98 -0
  28. package/src/skills/codex/ad-archive/agents/openai.yaml +5 -0
  29. package/src/skills/codex/ad-audit/SKILL.md +9 -2
  30. package/src/skills/codex/ad-bootstrap/SKILL.md +44 -7
  31. package/src/skills/codex/ad-commit/SKILL.md +6 -6
  32. package/src/skills/codex/ad-deepen/SKILL.md +5 -5
  33. package/src/skills/codex/ad-diagnose/SKILL.md +2 -2
  34. package/src/skills/codex/ad-domain/SKILL.md +3 -3
  35. package/src/skills/codex/ad-grill/SKILL.md +2 -2
  36. package/src/skills/codex/ad-ground/SKILL.md +10 -6
  37. package/src/skills/codex/ad-guidelines/SKILL.md +114 -0
  38. package/src/skills/codex/ad-guidelines/agents/openai.yaml +5 -0
  39. package/src/skills/codex/ad-merge/SKILL.md +4 -4
  40. package/src/skills/codex/ad-next/SKILL.md +18 -13
  41. package/src/skills/codex/ad-philosophy/SKILL.md +6 -3
  42. package/src/skills/codex/ad-pr/SKILL.md +5 -5
  43. package/src/skills/codex/ad-prd/SKILL.md +81 -0
  44. package/src/skills/codex/ad-prd/agents/openai.yaml +5 -0
  45. package/src/skills/codex/ad-spec/SKILL.md +13 -13
  46. package/src/skills/codex/ad-spike/SKILL.md +3 -3
  47. package/src/skills/codex/ad-tdd/SKILL.md +86 -0
  48. package/src/skills/codex/ad-tdd/agents/openai.yaml +5 -0
  49. package/templates/agents-project.md +1 -6
  50. package/templates/architecture.md +0 -7
  51. package/templates/guidelines.md +388 -0
  52. package/templates/prd.md +73 -0
package/README.md CHANGED
@@ -4,7 +4,7 @@ A starter kit for engineering production code with LLMs. Lean templates and init
4
4
 
5
5
  **The framing.** An LLM is the super-soldier serum; the engineer is Steve Rogers. The serum amplifies what the engineer already brings — solid bases, investigation, care for quality, architecture, clean code, observability, maintainability. The kit encodes those bases as skills, ADRs, and gates so the amplification compounds in the right direction. See [WORKFLOW.md](WORKFLOW.md) for the principles.
6
6
 
7
- The CLI installs nineteen universal skills at the default `team` profile (`ad-bootstrap`, `ad-philosophy`, `ad-architecture`, `ad-adr`, `ad-spec`, `ad-task`, `ad-audit`, `ad-review`, `ad-ground`, `ad-next`, `ad-spike`, `ad-tdg`, `ad-domain`, `ad-grill`, `ad-deepen`, `ad-diagnose`, `ad-commit`, `ad-pr`, `ad-merge`) plus four conditional ones (`ad-design` for frontend, `ad-subagent` for Claude Code, `ad-skill` opt-in, `ad-hooks` opt-in / recommended at `mature`) into the agent's native location. Lower profiles install fewer (`poc` = 9 universals, `solo` = 16, `team` / `mature` = 19; `ad-deepen` is excluded from `poc` / `solo` per [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) §4; `ad-commit` / `ad-pr` / `ad-merge` are excluded from `poc` per [ADR-0023](doc/adr/0023-agentic-commit-skill.md) / [ADR-0024](doc/adr/0024-agentic-pr-skill.md) / [ADR-0025](doc/adr/0025-agentic-merge-skill.md)). Each skill produces its artifact or runs its operation via the agent's native conversational UI; `agentic update` keeps installed skills in sync with upstream kit changes via a state-aware three-way diff. Report rough edges via [GitHub Issues](https://github.com/alexandremendoncaalvaro/agentic-development/issues); current releases live under [GitHub Releases](https://github.com/alexandremendoncaalvaro/agentic-development/releases).
7
+ The CLI installs a universal skill set at the default `team` profile plus four conditional skills (`ad-design` for frontend, `ad-subagent` for Claude Code, `ad-skill` opt-in, `ad-hooks` opt-in / recommended at `mature`) into the agent's native location. The full skill table is below; profile counts and exclusions are summarized under [Project maturity profiles](#project-maturity-profiles). Each skill produces its artifact or runs its operation via the agent's native conversational UI; `agentic update` keeps installed skills in sync with upstream kit changes via a state-aware three-way diff. Report rough edges via [GitHub Issues](https://github.com/alexandremendoncaalvaro/agentic-development/issues); releases live under [GitHub Releases](https://github.com/alexandremendoncaalvaro/agentic-development/releases).
8
8
 
9
9
  ## Prerequisites
10
10
 
@@ -33,22 +33,25 @@ Two categories ([ADR-0007](doc/adr/0007-workflow-operational-skills.md)) and two
33
33
  | `ad-bootstrap` | spec-driven | universal | Scans the repo, writes `AGENTS.md` ≤150 lines | `/ad-bootstrap` |
34
34
  | `ad-architecture` | spec-driven | universal | Scans the code, writes `ARCHITECTURE.md` | `/ad-architecture` |
35
35
  | `ad-adr` | spec-driven | universal | Drafts `doc/adr/NNNN-<slug>.md` from the conversation | `/ad-adr` |
36
- | `ad-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) layer 3 of the five-layer stack | `/ad-spec` |
36
+ | `ad-prd` | spec-driven | universal in `solo` / `team` / `mature` | Drafts or updates `doc/product/PRD.md` (or per-product `<slug>.md`)Layer 3, product-level scope (target user, problem, success metrics, multi-feature roadmap) feature specs inherit from. Distinct from `ad-spec` (feature-level). Lazy lifecycle. | `/ad-prd` |
37
+ | `ad-guidelines` | spec-driven | universal in `solo` / `team` / `mature` | Drafts or updates `GUIDELINES.md` — the project's full engineering reference (Clean Architecture, SOLID, Object Calisthenics loose/moderate/strict per Bay 2008, code standards, complexity discipline, testing strategy, security). Layer 1 Constitution trinity member alongside `WORKFLOW.md` and `AGENTS.md`. Scan-first; pre-suggested defaults from canon. Lazy lifecycle. | `/ad-guidelines` |
38
+ | `ad-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) Layer 4 of the six-layer stack; references parent PRD for product-scope inheritance | `/ad-spec` |
37
39
  | `ad-task` | spec-driven | universal | Drafts `doc/tasks/NNNN-<slug>.md` (checkbox + Notes format; carries `Spec ref` to link the implementing spec) | `/ad-task` |
38
40
  | `ad-audit` | spec-driven | universal | Read-only drift report (AGENTS.md / ARCHITECTURE.md / ADRs) | `/ad-audit` |
39
41
  | `ad-philosophy` | workflow-operational | universal | Universal agent guardrails — auto-loads on non-trivial work | implicit |
40
42
  | `ad-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/ad-review <range>` |
41
- | `ad-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/ad-ground` |
42
- | `ad-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the five-layer artifact stack and recommends prioritized next actions; complements `ad-audit` (drift) | `/ad-next` |
43
+ | `ad-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / impl-refs / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/ad-ground` |
44
+ | `ad-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the six-layer artifact stack and recommends prioritized next actions; complements `ad-audit` (drift) | `/ad-next` |
43
45
  | `ad-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/ad-spike` |
44
46
  | `ad-tdg` | workflow-operational | universal | Outcome-based prompting per WORKFLOW §9 — ground truth pair + Test Dependency Map + three approaches + single-criterion selection, for cases where the technique is known but the implementation strategy is uncertain | `/ad-tdg` |
47
+ | `ad-tdd` | workflow-operational | universal | Test-Driven Development per WORKFLOW §16 — red-green-refactor as deterministic LLM guardrail. Five phases — confirm regime, plan vertically, tracer bullet, incremental loop, refactor while green. Distinct from `ad-tdg`; routes to it for strategy selection inside the GREEN phase | `/ad-tdd` |
45
48
  | `ad-domain` | spec-driven | universal | Lazy lifecycle owner of `CONTEXT.md` (Layer 2 — ubiquitous language per Evans 2003); single-context or `CONTEXT-MAP.md` multi-context | `/ad-domain` |
46
49
  | `ad-grill` | workflow-operational | universal | Interview-before-research grilling — one question at a time with recommendation, codebase-first, sharpens vocabulary against `CONTEXT.md`, captures terms via `ad-domain` and decisions via `ad-adr` (three-criteria rule); upstream of `ad-ground` | `/ad-grill` |
47
50
  | `ad-deepen` | workflow-operational | universal in `team` + `mature` only | Surface deepening opportunities using WORKFLOW §8 vocabulary (Module / Interface / Depth / Seam / Adapter / Leverage / Locality); three phases — explore, present numbered candidates with deletion-test framing, grill the chosen one; pairs with `ad-audit` | `/ad-deepen` |
48
51
  | `ad-diagnose` | workflow-operational | universal | Disciplined diagnosis loop for hard bugs and performance regressions per WORKFLOW §15; five phases — build a feedback loop (the skill itself), reproduce, hypothesise (3-5 ranked falsifiable), instrument (one variable at a time), fix + regression-test | `/ad-diagnose` |
49
- | `ad-commit` | workflow-operational | universal in `solo` / `team` / `mature` | Atomic Conventional Commits with DCO `Signed-off-by` sign-off per ADR-0023; four phases — scope intake, stage-split when concerns mix, draft message, sign + write. Identity from `git config`. No `Co-Authored-By`. Helper, not blocker | `/ad-commit` |
50
- | `ad-pr` | workflow-operational | universal in `solo` / `team` / `mature` | Open a GitHub PR with a uniform body shape (Summary / Test plan / Links) per ADR-0024; four phases — preflight (`gh` auth + branch pushed), scope assembly, draft body, open + report URL. Title format = Conventional Commits | `/ad-pr` |
51
- | `ad-merge` | workflow-operational | universal in `solo` / `team` / `mature` | Evaluate (CI / fresh-context review / linked task / unresolved comments / mergeability) and merge a GitHub PR per ADR-0025; CI green = hard gate (yields to explicit user override); others = warnings. Merge mode auto-detected from `gh repo view`; `--delete-branch` by default | `/ad-merge` |
52
+ | `ad-commit` | workflow-operational | universal in `solo` / `team` / `mature` | Atomic Conventional Commits with DCO `Signed-off-by` sign-off; four phases — scope intake, stage-split when concerns mix, draft message, sign + write. Identity from `git config`. No `Co-Authored-By`. Helper, not blocker | `/ad-commit` |
53
+ | `ad-pr` | workflow-operational | universal in `solo` / `team` / `mature` | Open a GitHub PR with a uniform body shape (Summary / Test plan / Links); four phases — preflight (`gh` auth + branch pushed), scope assembly, draft body, open + report URL. Title format = Conventional Commits | `/ad-pr` |
54
+ | `ad-merge` | workflow-operational | universal in `solo` / `team` / `mature` | Evaluate (CI / fresh-context review / linked task / unresolved comments / mergeability) and merge a GitHub PR; CI green = hard gate (yields to explicit user override); others = warnings. Merge mode auto-detected from `gh repo view`; `--delete-branch` by default | `/ad-merge` |
52
55
  | `ad-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/ad-design` |
53
56
  | `ad-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/ad-subagent` |
54
57
  | `ad-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/ad-skill` |
@@ -58,27 +61,16 @@ A short TUI shows the detected mode, agent, and feature signals (frontend / `.cl
58
61
 
59
62
  If your project already has an `AGENTS.md` (or `CLAUDE.md`), the installer appends a managed `Skills installed by agentic` section bracketed by `<!-- agentic-managed-skills:start -->` / `:end -->` markers. User content outside those markers is byte-preserved; re-runs update only the managed block.
60
63
 
61
- ### v0.15 bundle
62
-
63
- The four skills accepted by ADR-0019 / 0020 / 0021 / 0022 (the mattpocock-absorption Phase-2 set) shipped together in v0.15.0-beta.1. Closes [task-0020](doc/tasks/0020-mattpocock-absorptions.md) Phase 2 in a single release rather than the originally-planned per-minor stack (v0.15 → v0.18). Rationale: 3 of 4 skills are direct mirrors of mature mattpocock prior art; bundling kept the WORKFLOW §15 / §8 / Layer-2 deltas coherent in one ship.
64
-
65
- | Skill | ADR | Operationalizes |
66
- | --- | --- | --- |
67
- | `ad-domain` | [ADR-0019](doc/adr/0019-domain-language-layer.md) | Layer 2 (ubiquitous language) — lazy `CONTEXT.md` lifecycle |
68
- | `ad-grill` | [ADR-0022](doc/adr/0022-agentic-grill-skill.md) | Interview-before-research, upstream of `ad-ground` |
69
- | `ad-deepen` | [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) | WORKFLOW §8 vocabulary applied to refactor proposals (team + mature only) |
70
- | `ad-diagnose` | [ADR-0021](doc/adr/0021-diagnose-discipline.md) | WORKFLOW §15 five-phase debugging |
71
-
72
64
  ## Project maturity profiles
73
65
 
74
66
  The kit ships four profiles that select which skills auto-install. Same WORKFLOW principles bind every profile; only the artifact set scales.
75
67
 
76
68
  | Profile | Universal install set | Conditional posture | Recommended for |
77
69
  | --- | --- | --- | --- |
78
- | `poc` | 9 — philosophy, ground, audit, next, spike, tdg, domain, grill, diagnose | all blocked | spike, hackathon, exploration |
79
- | `solo` | 16 — + bootstrap, spec, task, review, commit, pr, merge | architecture / adr / hooks opt-in; design auto if frontend; subagent auto for Claude Code | solo developer shipping a real product |
80
- | `team` (default) | 19 — + architecture, adr, deepen | hooks opt-in; design / subagent / skill follow autoIf | team product, shared discipline |
81
- | `mature` | 19 — same as team | hooks **recommended**; deepening surfaced via `ad-deepen` | regulated / public-facing production |
70
+ | `poc` | 10 — philosophy, ground, audit, next, spike, tdg, tdd, domain, grill, diagnose | all blocked; `ad-prd` + `ad-guidelines` blocked | spike, hackathon, exploration |
71
+ | `solo` | 19 — + bootstrap, prd, guidelines, spec, task, review, commit, pr, merge | architecture / adr / hooks opt-in; design auto if frontend; subagent auto for Claude Code | solo developer shipping a real product |
72
+ | `team` (default) | 22 — + architecture, adr, deepen | hooks opt-in; design / subagent / skill follow autoIf | team product, shared discipline |
73
+ | `mature` | 22 — same as team | hooks **recommended**; deepening surfaced via `ad-deepen` | regulated / public-facing production |
82
74
 
83
75
  Select at init time:
84
76
 
@@ -96,8 +88,6 @@ npx @alexandrealvaro/agentic@beta profile list # list all profiles
96
88
 
97
89
  `profile set` runs the equivalent of `update` after writing the new profile, so the install set matches. The state-aware three-way diff prompts before overwriting user-edited files.
98
90
 
99
- Existing v0.7 installs with no profile field migrate to `team` automatically — same install set as before, no user action needed.
100
-
101
91
  ## Updating an existing project
102
92
 
103
93
  To pull upstream kit changes into a project that already has agentic skills installed:
@@ -130,8 +120,6 @@ Useful flags:
130
120
  * `--agent claude-code | codex | both` — restrict the update to one agent.
131
121
  * `--yes` — non-interactive, accepts defaults (skip on conflict, keep orphans). Combine with `--force` if you want overwrites in CI.
132
122
 
133
- If the project was installed with a kit version older than v0.3 (no state file present), the first `update` falls back to today's byte-compare behavior, then writes the state file so subsequent runs use the three-way diff.
134
-
135
123
  The `ad-review` skill writes the assembled WORKFLOW §10 handoff to `.agentic/reviews/<ISO-timestamp>-<scope>.md` before delegating to the fresh-context reviewer (Claude Code) or before instructing you to `/clear` and paste (Codex). These files are ephemeral audit artifacts — add `.agentic/reviews/` to your `.gitignore`.
136
124
 
137
125
  For persistent install:
@@ -143,21 +131,23 @@ agentic init
143
131
 
144
132
  ## Recommended daily sequence
145
133
 
146
- The kit ships nineteen universal skills in the full `team` / `mature` install set; `ad-deepen` is excluded from `poc` and `solo` per [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) §4, and `ad-commit` / `ad-pr` / `ad-merge` are excluded from `poc` per [ADR-0023](doc/adr/0023-agentic-commit-skill.md) / [ADR-0024](doc/adr/0024-agentic-pr-skill.md) / [ADR-0025](doc/adr/0025-agentic-merge-skill.md) (concrete counts: `poc` = 9 universals, `solo` = 16, `team` / `mature` = 19). Plus four conditional skills (`ad-design`, `ad-subagent`, `ad-skill`, `ad-hooks`). The sequence below is a happy path through them for the three flows that cover most daily work. Skip steps that don't apply; the kit never enforces order.
134
+ The sequence below is a happy path for the three flows that cover most daily work. Skip steps that don't apply; the kit never enforces order. Profile-specific install sets and conditional skills are summarized under [Project maturity profiles](#project-maturity-profiles).
147
135
 
148
136
  **Greenfield project, first non-trivial feature:**
149
137
 
150
138
  1. `agentic init` — install skills.
151
- 2. `/ad-bootstrap` — produce `AGENTS.md` (operational guide).
152
- 3. `/ad-architecture` — produce `ARCHITECTURE.md` once load-bearing patterns emerge.
153
- 4. `/ad-grill` — interview-before-research when the ask is fuzzy; resolves vocabulary into `CONTEXT.md` via `/ad-domain` as terms surface.
154
- 5. `/ad-spec` — feature-level spec at `doc/specs/NNNN-<slug>.md` (User Scenarios, Requirements, Success Criteria).
155
- 6. `/ad-adr` — only when the feature forces a binding architectural decision worth recording for posterity (three-criteria rule: hard to reverse, surprising without context, real trade-off).
156
- 7. `/ad-task` — work-unit decomposition; reference the spec via `Spec ref`.
157
- 8. `/ad-ground` — four-source research before code (`ad-philosophy` auto-loads in parallel).
158
- 9. Implement.
159
- 10. `/ad-review main..HEAD` — fresh-context §10 review before merge.
160
- 11. `/ad-audit` periodic drift check across operational docs, specs, and the `CONTEXT.md` glossary.
139
+ 2. `/ad-bootstrap` — produce `AGENTS.md` (distilled session-load rules).
140
+ 3. `/ad-guidelines` — produce `GUIDELINES.md` (full engineering reference: Clean Architecture, SOLID, Object Calisthenics tier, testing strategy, security). Excluded from `poc`. After writing, refresh `AGENTS.md` so its engineering sections point to `GUIDELINES.md`.
141
+ 4. `/ad-architecture` — produce `ARCHITECTURE.md` once load-bearing patterns emerge.
142
+ 5. `/ad-prd` — scope the product (target user, problem, success metrics, multi-feature roadmap) at `doc/product/PRD.md`. Excluded from `poc` profile.
143
+ 6. `/ad-grill` — interview-before-research when the ask is fuzzy; resolves vocabulary into `CONTEXT.md` via `/ad-domain` as terms surface.
144
+ 7. `/ad-spec` — feature-level spec at `doc/specs/NNNN-<slug>.md`; inherits target user, success metrics, and constraints from the PRD.
145
+ 8. `/ad-adr` — only when the feature forces a binding architectural decision worth recording for posterity (three-criteria rule: hard to reverse, surprising without context, real trade-off).
146
+ 9. `/ad-task` — work-unit decomposition; reference the spec via `Spec ref`.
147
+ 10. `/ad-ground` — four-source research before code (`ad-philosophy` auto-loads in parallel).
148
+ 11. Implement — `/ad-tdd` when the behavior is test-expressible up front (red-green-refactor); `/ad-tdg` when the technique is known but the implementation strategy is uncertain.
149
+ 12. `/ad-review main..HEAD` — fresh-context §10 review before merge.
150
+ 13. `/ad-audit` — periodic drift check across operational docs, PRD, guidelines, specs, and the `CONTEXT.md` glossary.
161
151
 
162
152
  **Brownfield project, quick fix:**
163
153
 
@@ -170,17 +160,19 @@ The kit ships nineteen universal skills in the full `team` / `mature` install se
170
160
 
171
161
  1. `/ad-ground` — runs the four-source research pass and surfaces the happy path with citations.
172
162
  2. Decide whether the answer becomes a spec (`/ad-spec`) or a one-off task (`/ad-task`).
173
- 3. Continue from step 6 of the greenfield flow.
163
+ 3. Continue from step 8 of the greenfield flow.
174
164
 
175
- The kit's discipline scales with the project's maturity. A solo PoC may legitimately skip `/ad-spec` and `/ad-adr` (the WORKFLOW §1 prune principle applies — don't add an artifact that wouldn't change agent behavior). A team product running on this kit is expected to use the full sequence and additionally invoke `/ad-hooks` once to scaffold the deterministic gates per WORKFLOW §11 (pre-commit lint / format / secret-scan; pre-push build / unit / integration). Project maturity profiles that automate the recommendation by stack are deferred — see the next planned release.
165
+ The kit's discipline scales with the project's maturity. A solo PoC may legitimately skip `/ad-spec` and `/ad-adr` (the WORKFLOW §1 prune principle applies — don't add an artifact that wouldn't change agent behavior). A team product running on this kit is expected to use the full sequence and additionally invoke `/ad-hooks` once to scaffold the deterministic gates per WORKFLOW §11 (pre-commit lint / format / secret-scan; pre-push build / unit / integration).
176
166
 
177
- **Lost mid-flow?** Invoke `/ad-next` at any time to survey the project's state across the five-layer artifact stack (Constitution → Domain → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/ad-audit` (drift detection — different question).
167
+ **Lost mid-flow?** Invoke `/ad-next` at any time to survey the project's state across the six-layer artifact stack (Constitution → Domain → Product → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/ad-audit` (drift detection — different question).
178
168
 
179
169
  **Technique uncertain across multiple plausible approaches?** Invoke `/ad-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/ad-adr` and delete.
180
170
 
181
171
  **Technique known but implementation strategy uncertain?** Invoke `/ad-tdg` (per WORKFLOW §9) when multiple algorithms could produce the expected output with different trade-offs along readability / performance / testability. The skill forces a ground-truth pair, lists the tests covering the file (Test Dependency Map), generates three implementation candidates, and commits to one by a single named criterion — refusing the "optimize for all three at once" failure mode. No file written; the verified implementation is the artifact, with the candidate set + criterion landing in the commit message body.
182
172
 
183
- **Question is fuzzy?** Invoke `/ad-grill` (per [ADR-0022](doc/adr/0022-agentic-grill-skill.md)) before research. One question at a time with a recommended answer; codebase-first when the answer is in code; sharpens vocabulary against `CONTEXT.md`. Routes to `/ad-ground` (research-ready), `/ad-tdg` (implement-ready), `/ad-spike` (technique-uncertain), or `/ad-diagnose` (it turned out to be a bug) when the question is sharp.
173
+ **Behavior is test-expressible up front?** Invoke `/ad-tdd` (per WORKFLOW §16) when the change has a clear behavior to express as a test — a new API capability, a bug-driven regression test, a refactor with a known contract. Red-green-refactor as deterministic LLM guardrail: one test → red → minimum code green repeat. Horizontal-slicing anti-pattern (bulk-write tests, then bulk-write code) is rejected. Distinct from `/ad-tdg`; route to `/ad-tdg` inside the GREEN phase when the strategy for that test cycle is uncertain.
174
+
175
+ **Question is fuzzy?** Invoke `/ad-grill` before research. One question at a time with a recommended answer; codebase-first when the answer is in code; sharpens vocabulary against `CONTEXT.md`. Routes to `/ad-ground` (research-ready), `/ad-tdg` (implement-ready), `/ad-spike` (technique-uncertain), or `/ad-diagnose` (it turned out to be a bug) when the question is sharp.
184
176
 
185
177
  **Bug or performance regression?** Invoke `/ad-diagnose` (per WORKFLOW §15). Five phases — build a feedback loop (the skill itself), reproduce, hypothesise (3-5 ranked falsifiable), instrument (one variable at a time), fix + regression-test. The loop is the skill; everything else is mechanical.
186
178
 
@@ -219,9 +211,13 @@ Prompts reference templates by relative path. Two ways to give your agent access
219
211
 
220
212
  **Reviewing your own diff.** Run `/ad-review <range>` (e.g. `/ad-review main..HEAD` or `/ad-review PR#42`). The skill assembles the diff, the relevant spec slice (`AGENTS.md`, applicable ADRs, the task's Acceptance Criteria), and delegates to a fresh-context reviewer subagent — no inherited bias from the session that wrote the code (WORKFLOW §10). Returns structured findings grouped Blocker / Concern / Note. Codex variant uses `/clear` + paste handoff since Codex has no subagent primitive.
221
213
 
222
- **Researching before implementation.** Run `/ad-ground` (or let it auto-trigger on non-trivial work). The skill runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, and git history — synthesizes a happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written (WORKFLOW §4 + §5). Output is the input to whatever produces the implementation plan; the skill does not write code.
214
+ **Researching before implementation.** Run `/ad-ground` (or let it auto-trigger on non-trivial work). The skill runs a four-source research pass — official docs, validated implementation references (open-source repos, Stack Overflow / forum answers, blog posts, gists), in-repo patterns, and git history — synthesizes a happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written (WORKFLOW §4 + §5). Output is the input to whatever produces the implementation plan; the skill does not write code.
215
+
216
+ **Scoping a product.** Run `/ad-prd` when you are scoping a product (not a single feature) — target user, problem, success metrics across multiple features, roadmap, cross-feature constraints. The skill scaffolds `doc/product/PRD.md` (single-product) or `doc/product/<slug>.md` plus `doc/product/PRODUCT-MAP.md` (multi-product). Layer 3 of the six-layer stack; feature specs (`/ad-spec`) reference back to it for product-scope inheritance. Excluded from `poc` profile.
217
+
218
+ **Defining engineering standards.** Run `/ad-guidelines` to write `GUIDELINES.md` — the project's full engineering reference: Clean Architecture binding, SOLID, Object Calisthenics tier (loose / moderate / strict per Bay 2008), naming conventions, error handling, complexity discipline, testing strategy, security policy. Scan-first — reads the language toolchain, existing test/lint/format config, and existing `AGENTS.md` engineering sections; pre-fills detected fields; asks only the genuine gaps and preference questions. Layer 1 Constitution trinity member alongside `WORKFLOW.md` (kit-shipped philosophy) and `AGENTS.md` (distilled session-load rules). Excluded from `poc` profile.
223
219
 
224
- **Specifying a feature.** Run `/ad-spec` (or `/ad-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are layer 3 of the five-layer artifact stack — Constitution → Domain → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/ad-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec.
220
+ **Specifying a feature.** Run `/ad-spec` (or `/ad-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are Layer 4 of the six-layer artifact stack — Constitution → Domain → Product → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/ad-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec; the spec references its parent PRD for product-scope inheritance.
225
221
 
226
222
  **Project already built with agents.** Treat missing artifacts as brownfield (run the relevant skill) and existing artifacts as audit (`/ad-audit`).
227
223
 
@@ -266,7 +262,7 @@ node bin/agentic.js init
266
262
 
267
263
  Branch layout:
268
264
  - `main` — manual workflow source of truth (no CLI code; the npm package gets promoted here when mature).
269
- - `cli` — CLI development (you're here). Beta releases are published from this branch.
265
+ - `cli` — CLI development. Beta releases are published from this branch.
270
266
 
271
267
  ## License
272
268
 
package/WORKFLOW.md CHANGED
@@ -28,27 +28,29 @@ What to keep in mind:
28
28
  14. **Autonomy requires observability.** If the agent makes decisions, log the trajectory: tool calls, intermediate outputs, failures.
29
29
  15. **Staged spikes when the technique is uncertain.** When the *how* is unknown — a library choice, a CV technique, a multi-stage transformation — break the problem into staged spikes against golden fixtures with per-stage debug artifacts.
30
30
  16. **Diagnose with discipline.** For hard bugs and performance regressions: build a fast, deterministic feedback loop *before* hypothesising. Reproduce, then generate three to five ranked falsifiable hypotheses, then change one variable at a time. The feedback loop is the skill; everything else is mechanical.
31
- 17. **One task per session.** Context decays as it fills. Reset and reload only what the next task needs rather than extending a long conversation across many features. Smaller, deliberately-loaded contexts beat large, accreted ones.
32
- 18. **Slice vertically, not horizontally.** Decompose a spec into thin end-to-end paths through every layer (schema, API, UI, tests) rather than one layer at a time. Each slice ships demonstrable behavior; horizontal layer-stacks ship nothing on their own.
33
- 19. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
31
+ 17. **TDD as deterministic guardrail.** When behavior is test-expressible up front: tracer-bullet test minimum code green refactor. One test at a time; tests verify behavior through public interfaces, not implementation. Pairs with TDG inside the GREEN phase of any cycle where multiple implementation strategies are plausible.
32
+ 18. **One task per session.** Reset rather than extend; reload only what the next task needs.
33
+ 19. **Slice vertically, not horizontally.** Decompose a spec into thin end-to-end paths through every layer (schema, API, UI, tests) rather than one layer at a time. Each slice ships demonstrable behavior; horizontal layer-stacks ship nothing on their own.
34
+ 20. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
34
35
 
35
36
  > Working with agents means trading typing for technical direction. The value is in giving the right context, setting boundaries, validating the result, and keeping "almost right" out of production.
36
37
 
37
38
  ## 1. Spec-Driven Design
38
39
 
39
- Define the rules before the agent writes a line. The temptation is to dump everything into `AGENTS.md` and hope it works — but bloat causes the model to ignore the file. Keep one topic per Markdown file: lean and focused.
40
+ Define the rules before the agent writes a line. The temptation is to dump everything into `AGENTS.md` and hope it works — but bloat causes the model to ignore the file. One topic per file.
40
41
 
41
42
  There are two complementary frames for the artifacts the kit produces. The first is **purpose** — what each artifact is *for*. The second is **loading mechanism** — when each artifact reaches the agent's context.
42
43
 
43
- ### Five-layer artifact stack (purpose)
44
+ ### Six-layer artifact stack (purpose)
44
45
 
45
- 1. **Constitution** — `AGENTS.md` (operational guide) and `WORKFLOW.md` (engineering philosophy). Tells the agent how the project works and how the team thinks. Read every session.
46
- 2. **Domain** — `CONTEXT.md` at the repo root (or `CONTEXT-MAP.md` plus per-context `CONTEXT.md` files for multi-context repos). The project's ubiquitous language: canonical nouns, the aliases to avoid, the relationships between them, and the ambiguities that have already been resolved. Direct application of Domain-Driven Design (Evans, 2003) — when an agent and a human share the project's vocabulary, the agent uses fewer tokens to say more, and the code, tests, and conversation all converge on the same names. Created lazily — first term resolved triggers the file. ADR-0019 records the layer.
47
- 3. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; ADRs may be driven by spec constraints. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
48
- 4. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for.
49
- 5. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
46
+ 1. **Constitution** — `WORKFLOW.md` (universal engineering philosophy, kit-shipped), `AGENTS.md` (project-specific compressed rules read every session), and `GUIDELINES.md` (project-specific full engineering reference: Clean Architecture binding, SOLID application, Object Calisthenics tier, code standards, complexity discipline, API rules, performance standards, build system, static analysis, quality gates, testing strategy, git workflow, documentation, security). The trinity answers "how this project is built" at three compression levels — principles, distilled rules, full reference. `AGENTS.md` and `GUIDELINES.md` are project-owned and lazy; `WORKFLOW.md` is kit-shipped.
47
+ 2. **Domain** — `CONTEXT.md` at the repo root (or `CONTEXT-MAP.md` plus per-context `CONTEXT.md` files for multi-context repos). The project's ubiquitous language: canonical nouns, the aliases to avoid, the relationships between them, and the ambiguities that have already been resolved. Direct application of Domain-Driven Design (Evans, 2003) — when an agent and a human share the project's vocabulary, the agent uses fewer tokens to say more, and the code, tests, and conversation all converge on the same names. Created lazily — first term resolved triggers the file.
48
+ 3. **Product** — `doc/product/PRD.md` (single-product) or `doc/product/<slug>.md` plus `doc/product/PRODUCT-MAP.md` (multi-product). Product-level scope: target user, problem, goals, non-goals, success metrics, multi-feature roadmap, cross-feature constraints. One PRD per product; feature specs (Layer 4) inherit target user, success metrics, and constraints from the PRD. Created lazily when a product is being scoped.
49
+ 4. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; specs reference their parent PRD for product-scope inheritance. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
50
+ 5. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for. Layer 5 spans three document roles — `ARCHITECTURE.md` is definition, ADRs are decision-record, tasks are tracking — write each per its role, not as a single class.
51
+ 6. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
50
52
 
51
- The five layers scale with project maturity (TL;DR #19 — Discipline scales). A spike or PoC profile may legitimately ship only Layers 1, 2, and 5 — adding Layers 3 and 4 to a 200-line experiment is ceremony that does not change agent behavior. (Domain — Layer 2 — earns its keep even at PoC because vocabulary drift starts on day one.) A team or regulated product runs all five. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
53
+ The six layers scale with project maturity (TL;DR #20 — Discipline scales). A spike or PoC profile may legitimately ship only Layers 1, 2, and 6 — adding Layers 3, 4, and 5 to a 200-line experiment is ceremony that does not change agent behavior. (Domain — Layer 2 — earns its keep even at PoC because vocabulary drift starts on day one.) A team or regulated product runs all six. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
52
54
 
53
55
  ### Three context types (loading mechanism)
54
56
 
@@ -72,18 +74,22 @@ Comments are exceptions. They justify *why* a non-obvious choice was made — ne
72
74
 
73
75
  ### Documentation Discipline
74
76
 
75
- The agent's authoritative copy of the eight-rule documentation discipline lives in the `ad-philosophy` skill (`Documentation Discipline` section). The rules are summarized below for reference; the skill carries the full text agents read at session time. ADR-0008 records the canonical decision and the reconciliations against ADR-0004 (file-based task tracking) and ADR-0005 (universal agent behavior as a skill).
77
+ The rules below are canonical.
76
78
 
77
79
  1. **Definitions and decisions only.** No speculation, history, or unfounded plans.
78
- 2. **No dates, version stamps, `DRAFT` markers, or changelogs in narrative documents.** Decision-record artifacts under `doc/adr/`, `doc/tasks/`, `doc/specs/` are exempt — their lifecycle fields are the auditability primitive.
80
+ 2. **No dates, version stamps, `DRAFT` markers, or changelogs in narrative documents.** Decision-record artifacts under `doc/adr/`, `doc/tasks/`, `doc/specs/`, `doc/product/` are exempt — their lifecycle fields are the auditability primitive.
79
81
  3. **No emoji anywhere.**
80
82
  4. **Business context first.**
81
83
  5. **One scope per document. No duplication.**
82
84
  6. **Code is the primary documentation of behavior.**
83
85
  7. **No commented-out code; no orphan `TODO` / `FIXME` in source.** Every deferred item references a GitHub Issue or a `doc/tasks/NNNN-*.md` task.
84
86
  8. **Tests are living documentation of behavior.**
87
+ 9. **Single responsibility per document, named by layer.** Each document plays exactly one role — **definition** (pillar Layers 1, 2, 3 plus `ARCHITECTURE.md`; read-mostly after defined; no per-item tracking UI), **decision-record** (ADRs, Specs; single `Status:` field; mostly immutable after acceptance), or **tracking** (Tasks; full checkbox / append-only-Notes UI is their job). A document does not take on adjacent layers' responsibilities.
88
+ 10. **Each layer owns its directory index. No duplication across docs.** `doc/adr/` is the canonical ADR index; `doc/tasks/`, `doc/specs/`, `doc/product/` likewise own their layers. Other documents do not list / digest / re-state these indices.
89
+ 11. **Cross-references must be load-bearing.** If you can delete the reference and the surrounding statement still stands, the reference was decoration — drop it.
90
+ 12. **Universal-vs-kit-state separation.** `WORKFLOW.md` ships to downstream projects and carries universal principles only — it does not cite kit-specific ADR numbers. The kit's adoption of each principle is recorded in `doc/adr/` (kit-internal). Literature citations remain (they are universal load-bearing references).
85
91
 
86
- The skill body explains the rationale per rule, lists the failure modes the rules counter (bloated `AGENTS.md`, README pages drifting into changelogs, decision artifacts diluted by speculation), and walks through the reconciliations. Generator skills (`ad-bootstrap`, `ad-architecture`, `ad-spec`, `ad-task`, `ad-adr`, `ad-design`) reject violations of these rules at write time; `ad-audit` flags drift across narrative docs and decision-record artifacts on demand.
92
+ The twelve rules above are the authoritative Documentation Discipline contract. They counter recurring failure modes: session-load files bloated past relevance, README pages drifting into changelogs, decision artifacts diluted by speculation, definition documents accumulating per-item tracking UI, and pillar documents duplicating adjacent layers' indices.
87
93
 
88
94
  ## 3. Format by Evidence
89
95
 
@@ -96,21 +102,19 @@ Structure reduces ambiguity, but format isn't magic. Pick the right one for the
96
102
 
97
103
  Use XML when the prompt mixes instructions, retrieved context, examples, user input, and expected output — the separation pays off when there's noise to fight. Skip it for simple prompts; if Markdown headings or plain text are clear enough, use them.
98
104
 
99
- No format is universally best. **An observation from my practice, not benchmarked:** I've seen consistent gains when shifting prompts to XML — most noticeably with autonomous agents, where the prompt has to land alone without conversational refinement. Direct interactive use (Claude Code, Codex) tolerates loose Markdown; unattended agents don't. Claude in particular seems to respond well to XML, which I attribute to its training, but I haven't benchmarked it. Treat this as a starting hypothesis worth testing on your own target model and task before standardizing.
105
+ No format is universally best. XML separation pays off most for autonomous agents, where the prompt has to land alone without conversational refinement; interactive use (Claude Code, Codex) tolerates loose Markdown. Claude appears to respond well to XML, plausibly an artifact of training. Treat this as a working hypothesis worth testing on your own target model and task before standardizing.
100
106
 
101
- **Host-aware structured prompts.** Hosts that expose structured-prompt primitives — Claude Code's `AskUserQuestion` (multi-choice cards) and Plan Mode (plan-approval cards) — reduce ambiguity at confirmation gates more reliably than inline text. Prefer the structured primitive when the host supports it; fall back to numbered text otherwise. Codex has no equivalent today; its skills stay on numbered text. Skills carrying confirmation gates or multi-choice interview steps prescribe this preference (ADR-0014).
107
+ **Host-aware structured prompts.** Hosts that expose structured-prompt primitives — Claude Code's `AskUserQuestion` (multi-choice cards) and Plan Mode (plan-approval cards) — reduce ambiguity at confirmation gates more reliably than inline text. Prefer the structured primitive when the host supports it; fall back to numbered text otherwise. Codex has no equivalent today; its skills stay on numbered text.
102
108
 
103
109
  ## 4–5. Research Before Implementation
104
110
 
105
- Combines Find the Happy Path (canonical / idiomatic baseline) and Ground in Real Patterns (anchoring in project-specific examples). The kit treats both as one indivisible flow via `ad-ground`; two prose sections would frame one operation as two separate practices.
106
-
107
- Two sub-practices, joined into one indivisible pass.
111
+ Two sub-practices, joined into one indivisible pass: find the canonical baseline (Happy Path) and anchor it in project-specific examples (Ground in Real Patterns).
108
112
 
109
113
  **Find the happy path.** Before implementing, ask: *"What is the canonical, idiomatic way to implement [X] in [stack]? Cite official docs. List common deviations and why people take them."* Mid-implementation: *"Are we still on the happy path? If we deviated, was it deliberate?"* Sometimes you can't follow the happy path — that's fine. Always know where it is and why you left it.
110
114
 
111
115
  **Ground in real patterns.** Don't dump the codebase into context. Anchor the model in a specific, project-relevant example: *"Find an existing example of [similar feature]; use that exact structure."* Cite specific files, not "the codebase." Use just-in-time retrieval — pass paths or IDs and let the agent fetch via tools.
112
116
 
113
- The kit ships `ad-ground` as the workflow-operational implementation of both. It runs a four-source research pass official docs, validated open-source examples, in-repo patterns, git history — joined by AND not OR, synthesizes the happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written. Splitting the two sub-practices into separate skills would force two invocations with overlapping research outputs and fragment the synthesis context (ADR-0010).
117
+ A research pass joining four sources official docs, validated implementation references (public repos, Q&A forums, blog posts, gists), in-repo patterns, git history — by AND not OR, synthesizing the happy path with citations from each source, and gating any deviation behind an irrefutable justification before code is written, is the operational shape.
114
118
 
115
119
  ## 6. Explore → Plan → Implement → Commit
116
120
 
@@ -142,7 +146,7 @@ The agent will follow what's specified and invent what isn't. Prefer specifying.
142
146
 
143
147
  ### Architectural vocabulary
144
148
 
145
- Architectural drift accelerates with the agent's typing speed; the counter is shared vocabulary that names the shapes that matter. The kit adopts the canonical terms from John Ousterhout's *A Philosophy of Software Design* (2018) and Michael Feathers's *Working Effectively with Legacy Code* (2004), and uses them in `ARCHITECTURE.md`, ADRs, and architecture-touching skills (ADR-0020).
149
+ Architectural drift accelerates with the agent's typing speed; the counter is shared vocabulary that names the shapes that matter. The canonical terms come from John Ousterhout's *A Philosophy of Software Design* (2018) and Michael Feathers's *Working Effectively with Legacy Code* (2004).
146
150
 
147
151
  - **Module** — anything with an interface and an implementation; deliberately scale-agnostic (function, class, package, vertical slice).
148
152
  - **Interface** — everything a caller must know to use the module correctly: types, invariants, ordering constraints, error modes, configuration, performance characteristics. *Not* just the type signature.
@@ -159,7 +163,7 @@ Three principles fall out of those terms:
159
163
  - **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape.
160
164
  - **One adapter is a hypothetical seam; two adapters make it real.** Don't introduce a seam unless something actually varies across it.
161
165
 
162
- Skills that touch architecture (`ad-architecture`, `ad-adr`, the planned `ad-deepen`) use these terms verbatim, so suggestions and reviews land in a single language.
166
+ Architectural skills should use these terms verbatim, so suggestions and reviews land in a single language.
163
167
 
164
168
  ## 9. Outcome-Based Prompting (TDG)
165
169
 
@@ -168,7 +172,7 @@ Give the finish line first, not the path:
168
172
  1. **Ground truth.** Raw input plus exact expected output.
169
173
  2. **Command the implementation.** The algorithm that connects the two.
170
174
  3. **Iterate by criterion.** Ask for three approaches; pick by *one* explicit criterion (readability, performance, *or* testability — not all three at once).
171
- 4. **Test Dependency Map, not procedural TDD.** Don't tell the agent "do TDD" — tell it *which* tests cover the file. *"Before modifying X.ts, list which tests cover it. Run. Modify. Run. If none cover it, write one first."*
175
+ 4. **Test Dependency Map as pre-flight.** Before any change, tell the agent *which* tests cover the file. *"Before modifying X.ts, list which tests cover it. Run. Modify. Run. If none cover it, write one first."* TDM is orthogonal to the implementation regime: it pairs with TDG (this section) when the implementation strategy is the uncertain axis, and with TDD (§16) when the behavior is test-expressible up front. TDM rejects cargo-cult test-first ceremony, not test-first development itself.
172
176
 
173
177
  ## 10. Reviewer With Fresh Context
174
178
 
@@ -192,7 +196,7 @@ In Claude Code, this means a subagent (the `Task` tool, or a custom `.claude/age
192
196
 
193
197
  Modern agents handle most routine implementation. The work has shifted to catching what they got wrong.
194
198
 
195
- Two 2025 industry surveys point at the same wall. JetBrains' DevEcosystem 2025 reports that only **44%** of developers have AI fully or partially integrated into their workflow. Stack Overflow's 2025 Developer Survey adds: **66%** of developers cite "AI solutions that are almost right, but not quite" as their top frustration, and **45%** say debugging AI-generated code is more time-consuming.
199
+ Industry data underlines the wall. Recent JetBrains and Stack Overflow developer surveys show a majority frustration with "AI solutions that are almost right, but not quite," and a near-majority report that debugging AI-generated code costs more time than writing it from scratch. See Sources for the surveys.
196
200
 
197
201
  The takeaway: §10 (Reviewer) and §11 (Quality Gates) are not optional. Skipping them is where bug density grows.
198
202
 
@@ -226,11 +230,9 @@ The flow has four parts:
226
230
 
227
231
  **When to use it:** the unknown is *how* — a library choice, a CV technique, a multi-stage transformation. Skip it when the *how* is routine.
228
232
 
229
- This is a combination of established practices, not new terminology: spike (XP), golden datasets, stage-segmented error analysis, trajectory evaluation, and visual debugging in CV pipelines.
230
-
231
233
  ## 15. Diagnose With Discipline
232
234
 
233
- For hard bugs and performance regressions, the failure mode is jumping to hypotheses before there is a way to check them. The discipline below is the counter; it owes its shape to standard debugging practice (Kernighan & Pike, *The Practice of Programming*, 1999) and the kit ships `ad-diagnose` as the operational implementation (ADR-0021).
235
+ For hard bugs and performance regressions, the failure mode is jumping to hypotheses before there is a way to check them. The discipline below is the counter, grounded in standard debugging practice (Kernighan & Pike, *The Practice of Programming*).
234
236
 
235
237
  ### Phase 1 — Build a feedback loop
236
238
 
@@ -275,23 +277,61 @@ Each probe maps to a specific prediction from Phase 3. Change one variable at a
275
277
 
276
278
  Apply the fix. Re-run the Phase-1 loop and confirm the captured symptom is gone. Promote the loop's check into a permanent test that lives next to the code so the same failure mode cannot return silently.
277
279
 
278
- ---
280
+ ## 16. Test-Driven Development (TDD)
279
281
 
280
- These are starting points. Prune what doesn't fit your codebase.
282
+ TDG (§9) gives the agent the finish line and asks it to find the path. TDD asks the agent to express one behavior as a test, drive the minimum implementation that makes the test pass, then deepen. The two are distinct LLM disciplines; the Test Dependency Map (§9, item 4) is a pre-flight that pairs with either.
283
+
284
+ TDD is the cleanest **deterministic guardrail** when the change has a clear behavior to express up front: a failing test is unambiguous, so the "almost right but not quite" failure mode (§12) cannot ship silently. **Good tests read like a specification** — *"user can checkout with a valid cart"* tells you the capability exists. **Bad tests couple to implementation** — mock internal collaborators, assert on private state, or query the database directly when the public interface is what the caller uses. A test that breaks on a rename but not on a behavior change was testing implementation, not behavior.
285
+
286
+ ### Phase 1 — Plan vertically
287
+
288
+ Before writing a test or code:
289
+
290
+ 1. Read `CONTEXT.md` so test names and interface vocabulary land in the project's ubiquitous language.
291
+ 2. Confirm the public interface — what does the caller need to know? Types, ordering, error modes. Interface design *is* testability design.
292
+ 3. Identify deepening opportunities (Layer 2 vocabulary per §8): small interface, deep implementation.
293
+ 4. List the behaviors to test, not the implementation steps. Pick the **first** behavior — the one that proves end-to-end the path works. The rest defer until the tracer bullet lands.
294
+ 5. Establish the green baseline. For existing code, list the tests already covering the surface (TDM, §9.4) and run them. For new code, write the first test fresh.
295
+ 6. Get the user's approval on the plan in one sentence before any test or code is written.
296
+
297
+ ### Phase 2 — Tracer bullet
298
+
299
+ Write ONE test that confirms ONE behavior through the public interface. `RED → GREEN → end-to-end path proven`. The fail-reason matters: a test failing because the function is undefined is not the same as a test failing because the assertion is wrong; only the latter proves the test verifies behavior rather than the existence of a symbol.
300
+
301
+ Do not write a second test until this one is green.
302
+
303
+ ### Phase 3 — Incremental loop
281
304
 
282
- ## How this guide was built
305
+ For each remaining behavior: `RED → minimum code → GREEN`. Three rules:
283
306
 
284
- This is not theory I read and copied. Most of the practices here come from years of shipping production code, with and without LLMs.
307
+ - **One test at a time.** Bulk-writing all tests first then all implementation is *horizontal slicing* the named anti-pattern. Bulk-written tests verify *imagined* behavior, not actual behavior; the suite becomes insensitive to real changes and you outrun your headlights, committing to test structure before understanding the implementation.
308
+ - **Only enough code to pass the current test.** Anticipating the next test bloats the implementation and couples it to assumptions not yet verified.
309
+ - **Tests verify behavior through public interfaces.** No private-method tests, no internal-collaborator mocks, no direct database/file-system assertions when the public interface is what the caller uses. A test that breaks on a rename but not on a behavior change was testing implementation, not behavior.
285
310
 
286
- **Patterns I was already using when I drafted this guide.** Several of them I used before knowing they had established names; once the industry converged on a label, I adopted it to make the conversation easier: **Spec-Driven Design** (§1), **Docs-vs-Code separation** (§2), **pattern matching by real examples** (§5), **explicit Action Commands** (§7), **Architectural Boundaries** (§8), **Outcome-Based Prompting / TDG** (§9), the **senior-reviewer technique** (§10), and **deterministic Quality Gates** (§11).
311
+ ### Phase 4 Refactor
287
312
 
288
- **The XML observation in §3** is also drawn from practice, not benchmarks. It's a hypothesis worth testing on your own setup, not a settled finding.
313
+ Once all planned tests pass: extract duplication, deepen modules (move complexity behind smaller interfaces), apply SOLID where natural. Run tests after each refactor step. **Never refactor while RED** — get to green first, then refactor with the green baseline as the safety net.
289
314
 
290
- **Practices that came in through iteration on this guide.** They weren't in my original draft, but each matches a problem I'd already encountered or a habit I'd only formalized loosely: **Find the Happy Path** 4), **Explore Plan Implement Commit** (§6), **The Bottleneck Is Discrimination, Not Generation** (§12 the 2025 industry statistics ground the principle, they didn't generate it), and **Evals for Anything Autonomous** (§13).
315
+ The refactor phase is where deepening8) happens. Treat tests as a fixed contract; the implementation is free to change shape as long as the contract holds.
291
316
 
292
- **§14 (Staged Spikes With Golden Fixtures) is my own working technique.** I haven't seen it documented end-to-end as a single named pattern, but each component (spike, golden dataset, stage-segmented error analysis, trajectory evaluation, visual CV debugging) has its own lineage in the literature listed under Sources. The combination — discovery → fixture → staged pipeline with debug artifacts → two-layer evaluation — is how I attack problems where the *technique* itself is uncertain.
317
+ ### TDD vs TDG when to use which
293
318
 
294
- **Practices added through cross-pollination.** A second pass over the guide compared this kit's coverage against [Matt Pocock's `mattpocock/skills`](https://github.com/mattpocock/skills) — a separate body of agent-engineering practice grounded in the same canonical literature (DDD, *Pragmatic Programmer*, Ousterhout, Feathers, Beck). The comparison surfaced principles that earned their place in this document on independent merits: the **Domain layer** (§1 Layer 2, ADR-0019) from DDD's ubiquitous-language pattern; the **architectural vocabulary** (§8, ADR-0020) from Ousterhout's depth and Feathers's seams; **Diagnose with discipline** (§15, ADR-0021) from standard debugging practice; **vertical slicing** and **HITL/AFK tagging** (§6) from PragProg's tracer-bullet metaphor; **AI mechanical / human judgment** (§12 second paragraph). Where Pocock's framing sharpened our own — vocabulary or principle that we already gestured at without naming — the borrowed phrasing is acknowledged inline; everything else stays kit-original.
319
+ - **TDD** behavior is known and test-expressible up front. The test is the contract; implementation follows.
320
+ - **TDG** — behavior may not yet be testable as a single pair, but the outcome is known. Three candidate implementations + one criterion picks the path.
321
+
322
+ When both apply (test-expressible AND multiple implementation strategies are plausible), use TDD as the outer loop and TDG inside the GREEN phase to select the strategy for that cycle.
323
+
324
+ ---
325
+
326
+ These are starting points. Prune what doesn't fit your codebase.
327
+
328
+ ## Provenance
329
+
330
+ This guide is operational practice, not theory. Most principles come from years of shipping production code, with and without LLMs; some were in use before the industry converged on labels for them, and once a label landed the kit adopted it to make the conversation easier.
331
+
332
+ §14 (Staged Spikes With Golden Fixtures) is the author's own working technique — each component (spike, golden dataset, stage-segmented error analysis, trajectory evaluation, visual CV debugging) has its own lineage in the literature under Sources; the combination — discovery → fixture → staged pipeline with debug artifacts → two-layer evaluation — is original to this kit.
333
+
334
+ A cross-pollination pass against [Matt Pocock's `mattpocock/skills`](https://github.com/mattpocock/skills) — a separate body of agent-engineering practice grounded in the same canonical literature (DDD, *Pragmatic Programmer*, Ousterhout, Feathers, Beck) — surfaced principles that earned their place on independent merits: the **Domain layer** (§1 Layer 2), the **architectural vocabulary** (§8), **Diagnose with discipline** (§15), **vertical slicing** and **HITL/AFK tagging** (§6), and **AI mechanical / human judgment** (§12). Where Pocock's framing sharpened our own, the borrowed phrasing is acknowledged inline; everything else stays kit-original.
295
335
 
296
336
  External claims (specific percentages, named frameworks) are cited under Sources. Everything else is operational guidance from practice or synthesis across that material — a working model, refined over time, not academic claim.
297
337
 
@@ -323,3 +363,10 @@ External claims (specific percentages, named frameworks) are cited under Sources
323
363
  **§15 — Diagnose With Discipline**
324
364
  - *The Practice of Programming* (Kernighan & Pike, 1999) — chapters on debugging and testing.
325
365
  - Falsifiability framing — Karl Popper, *The Logic of Scientific Discovery* (1959).
366
+
367
+ **§16 — Test-Driven Development (TDD)**
368
+ - *Test-Driven Development: By Example* (Kent Beck, 2002) — canonical red-green-refactor framing.
369
+ - *Working Effectively with Legacy Code* (Feathers, 2004) — seams as test surfaces.
370
+ - *Unit Testing Principles, Practices, and Patterns* (Khorikov, 2020) — behavior-vs-implementation test classification.
371
+ - [`mattpocock/skills` engineering/tdd](https://github.com/mattpocock/skills/blob/main/skills/engineering/tdd/SKILL.md) — vertical-tracer-bullet framing adopted with attribution.
372
+
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@alexandrealvaro/agentic",
3
- "version": "0.15.2-beta.1",
3
+ "version": "0.16.0-beta.1",
4
4
  "description": "Bootstrap and audit AGENTS.md, ARCHITECTURE.md, ADRs, skills, and subagents for engineering production code with LLMs",
5
5
  "type": "module",
6
6
  "bin": {
@@ -370,6 +370,7 @@ export async function initCommand(opts) {
370
370
  '/ad-review (WORKFLOW §10)',
371
371
  '/ad-ground (WORKFLOW §4 + §5)',
372
372
  '/ad-next (state survey + recommendations)',
373
+ '/ad-archive (sweep done tasks / shipped specs / superseded ADRs into git history)',
373
374
  '/ad-spike (WORKFLOW §14 — staged spike with golden fixtures)',
374
375
  '/ad-tdg (WORKFLOW §9 — outcome-based prompting + TDM)',
375
376
  '/ad-domain (CONTEXT.md — Layer 2 ubiquitous language)',
@@ -397,6 +398,8 @@ export async function initCommand(opts) {
397
398
  'ad-spec': '/ad-spec (doc/specs/)',
398
399
  'ad-task': '/ad-task',
399
400
  'ad-audit': '/ad-audit',
401
+ 'ad-archive':
402
+ '/ad-archive (sweep done tasks / shipped specs / superseded ADRs into git history)',
400
403
  'ad-review': '/ad-review (WORKFLOW §10)',
401
404
  'ad-ground': '/ad-ground (WORKFLOW §4 + §5)',
402
405
  'ad-deepen': '/ad-deepen (WORKFLOW §8 — deepening opportunities)',
@@ -25,19 +25,23 @@ export const PROFILES = {
25
25
  'ad-ground',
26
26
  'ad-audit',
27
27
  'ad-next',
28
+ 'ad-archive',
28
29
  'ad-spike',
29
30
  'ad-tdg',
30
31
  'ad-domain',
31
32
  'ad-grill',
32
33
  'ad-diagnose',
34
+ 'ad-tdd',
33
35
  ],
34
36
  conditional: {
35
37
  'ad-design': 'blocked',
36
38
  'ad-subagent': 'blocked',
37
39
  'ad-skill': 'blocked',
38
40
  'ad-hooks': 'blocked',
41
+ 'ad-prd': 'blocked',
42
+ 'ad-guidelines': 'blocked',
39
43
  },
40
- note: 'PoC / spike / experiment. Nine universals — posture (philosophy), research (ground), drift (audit), navigation (next), spike + tdg + grill + domain + diagnose process scaffolds. No mandatory artifact-producing skills (no bootstrap / spec / task / adr / architecture). Adds discipline you can grow into; never pre-imposes ceremony.',
44
+ note: 'PoC / spike / experiment. Ten universals — posture (philosophy), research (ground), drift (audit), navigation (next), spike + tdg + tdd + grill + domain + diagnose process scaffolds. No mandatory artifact-producing skills (no bootstrap / spec / task / adr / architecture / prd / guidelines). Adds discipline you can grow into; never pre-imposes ceremony.',
41
45
  },
42
46
  solo: {
43
47
  universal: [
@@ -45,9 +49,13 @@ export const PROFILES = {
45
49
  'ad-ground',
46
50
  'ad-audit',
47
51
  'ad-next',
52
+ 'ad-archive',
48
53
  'ad-spike',
49
54
  'ad-tdg',
55
+ 'ad-tdd',
50
56
  'ad-bootstrap',
57
+ 'ad-prd',
58
+ 'ad-guidelines',
51
59
  'ad-spec',
52
60
  'ad-task',
53
61
  'ad-review',
@@ -66,7 +74,7 @@ export const PROFILES = {
66
74
  'ad-skill': false,
67
75
  'ad-hooks': false,
68
76
  },
69
- note: 'Solo developer shipping a real product. Specs and tasks are universal; ADRs and architecture are opt-in for binding decisions only.',
77
+ note: 'Solo developer shipping a real product. PRD, engineering guidelines, specs and tasks are universal; ADRs and architecture are opt-in for binding decisions only.',
70
78
  },
71
79
  team: {
72
80
  universal: [
@@ -74,14 +82,18 @@ export const PROFILES = {
74
82
  'ad-philosophy',
75
83
  'ad-architecture',
76
84
  'ad-adr',
85
+ 'ad-prd',
86
+ 'ad-guidelines',
77
87
  'ad-spec',
78
88
  'ad-task',
79
89
  'ad-audit',
80
90
  'ad-review',
81
91
  'ad-ground',
82
92
  'ad-next',
93
+ 'ad-archive',
83
94
  'ad-spike',
84
95
  'ad-tdg',
96
+ 'ad-tdd',
85
97
  'ad-domain',
86
98
  'ad-grill',
87
99
  'ad-deepen',
@@ -96,7 +108,7 @@ export const PROFILES = {
96
108
  'ad-skill': false,
97
109
  'ad-hooks': false,
98
110
  },
99
- note: 'Team product. Full universal stack including deepening (architectural refactor surfacing) and the v0.15 grill/domain/diagnose trio. Conditional skills auto-detect by signal. This was the v0.7 default and is the migration target for existing installs.',
111
+ note: 'Team product. Full universal stack including PRD, engineering guidelines, TDD, deepening, and the grill/domain/diagnose trio. Conditional skills auto-detect by signal.',
100
112
  },
101
113
  mature: {
102
114
  universal: [
@@ -104,14 +116,18 @@ export const PROFILES = {
104
116
  'ad-philosophy',
105
117
  'ad-architecture',
106
118
  'ad-adr',
119
+ 'ad-prd',
120
+ 'ad-guidelines',
107
121
  'ad-spec',
108
122
  'ad-task',
109
123
  'ad-audit',
110
124
  'ad-review',
111
125
  'ad-ground',
112
126
  'ad-next',
127
+ 'ad-archive',
113
128
  'ad-spike',
114
129
  'ad-tdg',
130
+ 'ad-tdd',
115
131
  'ad-domain',
116
132
  'ad-grill',
117
133
  'ad-deepen',
@@ -126,7 +142,7 @@ export const PROFILES = {
126
142
  'ad-skill': false,
127
143
  'ad-hooks': true,
128
144
  },
129
- note: 'Mature / regulated product. Recommends ad-hooks alongside the team stack for deterministic gates per WORKFLOW §11. Includes deepening + the v0.15 grill/domain/diagnose trio.',
145
+ note: 'Mature / regulated product. Recommends ad-hooks alongside the team stack for deterministic gates per WORKFLOW §11. Full universal stack including PRD, engineering guidelines, TDD, deepening, and the grill/domain/diagnose trio.',
130
146
  },
131
147
  };
132
148