npm - @alexandrealvaro/agentic - Versions diffs - 0.12.0-beta.1 → 0.14.0-beta.1 - Mend

@alexandrealvaro/agentic 0.12.0-beta.1 → 0.14.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/README.md +18 -4
package/WORKFLOW.md +100 -9
package/package.json +1 -1
package/src/commands/init.js +1 -0
package/src/lib/profiles.js +4 -1
package/src/lib/rootdoc.js +4 -2
package/src/skills/claude-code/agentic-bootstrap/SKILL.md +1 -1
package/src/skills/claude-code/agentic-next/SKILL.md +17 -10
package/src/skills/claude-code/agentic-spec/SKILL.md +2 -2
package/src/skills/claude-code/agentic-tdg/SKILL.md +126 -0
package/src/skills/codex/agentic-bootstrap/SKILL.md +1 -1
package/src/skills/codex/agentic-next/SKILL.md +15 -10
package/src/skills/codex/agentic-next/agents/openai.yaml +1 -1
package/src/skills/codex/agentic-spec/SKILL.md +2 -2
package/src/skills/codex/agentic-tdg/SKILL.md +109 -0
package/src/skills/codex/agentic-tdg/agents/openai.yaml +5 -0

package/README.md CHANGED Viewed

@@ -33,14 +33,15 @@ Two categories ([ADR-0007](doc/adr/0007-workflow-operational-skills.md)) and two
 | `agentic-bootstrap` | spec-driven | universal | Scans the repo, writes `AGENTS.md` ≤150 lines | `/agentic-bootstrap` |
 | `agentic-architecture` | spec-driven | universal | Scans the code, writes `ARCHITECTURE.md` | `/agentic-architecture` |
 | `agentic-adr` | spec-driven | universal | Drafts `doc/adr/NNNN-<slug>.md` from the conversation | `/agentic-adr` |
-| `agentic-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) layer 2 of the four-layer stack | `/agentic-spec` |
+| `agentic-spec` | spec-driven | universal | Drafts `doc/specs/NNNN-<slug>.md` — feature-level spec (User Scenarios, Requirements, Success Criteria) layer 3 of the five-layer stack | `/agentic-spec` |
 | `agentic-task` | spec-driven | universal | Drafts `doc/tasks/NNNN-<slug>.md` (checkbox + Notes format; carries `Spec ref` to link the implementing spec) | `/agentic-task` |
 | `agentic-audit` | spec-driven | universal | Read-only drift report (AGENTS.md / ARCHITECTURE.md / ADRs) | `/agentic-audit` |
 | `agentic-philosophy` | workflow-operational | universal | Universal agent guardrails — auto-loads on non-trivial work | implicit |
 | `agentic-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/agentic-review <range>` |
 | `agentic-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/agentic-ground` |
-| `agentic-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the four-layer artifact stack and recommends prioritized next actions; complements `agentic-audit` (drift) | `/agentic-next` |
+| `agentic-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the five-layer artifact stack and recommends prioritized next actions; complements `agentic-audit` (drift) | `/agentic-next` |
 | `agentic-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/agentic-spike` |
+| `agentic-tdg` | workflow-operational | universal | Outcome-based prompting per WORKFLOW §9 — ground truth pair + Test Dependency Map + three approaches + single-criterion selection, for cases where the technique is known but the implementation strategy is uncertain | `/agentic-tdg` |
 | `agentic-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/agentic-design` |
 | `agentic-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/agentic-subagent` |
 | `agentic-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/agentic-skill` |
@@ -50,6 +51,17 @@ A short TUI shows the detected mode, agent, and feature signals (frontend / `.cl
 If your project already has an `AGENTS.md` (or `CLAUDE.md`), the installer appends a managed `Skills installed by agentic` section bracketed by `<!-- agentic-managed-skills:start -->` / `:end -->` markers. User content outside those markers is byte-preserved; re-runs update only the managed block.
+### Planned skills
+Skills accepted by ADR but not yet implemented; tracked under [task-0020](doc/tasks/0020-mattpocock-absorptions.md). Each ships in its own minor release.
+| Skill | Target | Operationalizes | ADR |
+| --- | --- | --- | --- |
+| `agentic-domain` | v0.15.x | Lazy-creates / updates `CONTEXT.md` (Layer 2 — ubiquitous language per Evans 2003) | [ADR-0019](doc/adr/0019-domain-language-layer.md) |
+| `agentic-grill` | v0.16.x | Interview-before-research session, upstream of `agentic-ground` | (forthcoming) |
+| `agentic-deepen` | v0.17.x | Surfaces deepening opportunities using the Ousterhout/Feathers vocabulary | [ADR-0020](doc/adr/0020-deep-modules-vocabulary.md) |
+| `agentic-diagnose` | v0.18.x | Five-phase debugging discipline per WORKFLOW §15 | [ADR-0021](doc/adr/0021-diagnose-discipline.md) |
 ## Project maturity profiles
 The kit ships four profiles that select which skills auto-install. Same WORKFLOW principles bind every profile; only the artifact set scales.
@@ -154,10 +166,12 @@ The kit ships nine universal skills plus three conditional ones — twelve discr
 The kit's discipline scales with the project's maturity. A solo PoC may legitimately skip `/agentic-spec` and `/agentic-adr` (the WORKFLOW §1 prune principle applies — don't add an artifact that wouldn't change agent behavior). A team product running on this kit is expected to use the full sequence and additionally invoke `/agentic-hooks` once to scaffold the deterministic gates per WORKFLOW §11 (pre-commit lint / format / secret-scan; pre-push build / unit / integration). Project maturity profiles that automate the recommendation by stack are deferred — see the next planned release.
-**Lost mid-flow?** Invoke `/agentic-next` at any time to survey the project's state across the four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/agentic-audit` (drift detection — different question).
+**Lost mid-flow?** Invoke `/agentic-next` at any time to survey the project's state across the five-layer artifact stack (Constitution → Domain → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/agentic-audit` (drift detection — different question).
 **Technique uncertain across multiple plausible approaches?** Invoke `/agentic-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/agentic-adr` and delete.
+**Technique known but implementation strategy uncertain?** Invoke `/agentic-tdg` (per WORKFLOW §9) when multiple algorithms could produce the expected output with different trade-offs along readability / performance / testability. The skill forces a ground-truth pair, lists the tests covering the file (Test Dependency Map), generates three implementation candidates, and commits to one by a single named criterion — refusing the "optimize for all three at once" failure mode. No file written; the verified implementation is the artifact, with the candidate set + criterion landing in the commit message body.
 ## Manual prompts
 If you prefer to skip the installer, the same artifacts can be generated by pasting prompts directly into your agent. Each prompt file has the literal text to copy, plus the matching template structure:
@@ -193,7 +207,7 @@ Prompts reference templates by relative path. Two ways to give your agent access
 **Researching before implementation.** Run `/agentic-ground` (or let it auto-trigger on non-trivial work). The skill runs a four-source research pass — official docs, validated open-source examples, in-repo patterns, and git history — synthesizes a happy path with citations from each source, and gates any deviation behind an irrefutable justification before code is written (WORKFLOW §4 + §5). Output is the input to whatever produces the implementation plan; the skill does not write code.
-**Specifying a feature.** Run `/agentic-spec` (or `/agentic-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are layer 2 of the four-layer artifact stack — Constitution → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/agentic-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec.
+**Specifying a feature.** Run `/agentic-spec` (or `/agentic-spec` with a feature name). The skill scaffolds `doc/specs/NNNN-<slug>.md` with industry-aligned mandatory sections (User Scenarios, Functional / Non-functional Requirements, Success Criteria, Edge Cases, Out of Scope, Open Questions, Related). Specs are layer 3 of the five-layer artifact stack — Constitution → Domain → Spec → Plan/Decisions → Code. One spec per feature; multiple tasks (`/agentic-task`) implement one spec; the task template carries a `Spec ref` field linking back to the spec.
 **Project already built with agents.** Treat missing artifacts as brownfield (run the relevant skill) and existing artifacts as audit (`/agentic-audit`).

package/WORKFLOW.md CHANGED Viewed

@@ -4,7 +4,7 @@ Engineering production code with LLMs. Agentic, not vibe coding.
 **The Steve Rogers framing.** The LLM is the super-soldier serum. The engineer is Steve Rogers. The serum amplifies what the engineer already brings — solid bases, organization, investigation, care for quality, architecture, clean code, documentation, observability, maintainability. Add the serum to a disciplined engineer and you get Captain America. Add it to an undisciplined one and you get faster sloppy at scale. This document is those bases written down as principles. The discipline is the input; the LLM is the amplifier; the kit (skills, ADRs, audits, gates) is the scaffolding that keeps the discipline intact across sessions, agents, and projects.
-**The principle behind the rest:** context engineering beats prompt engineering. Context is finite and decays as it fills — aim for the smallest set of high-signal tokens that gets the outcome.
+**The principle behind the rest:** context engineering beats prompt engineering. Context is finite and decays as it fills — aim for the smallest set of high-signal tokens that gets the outcome. Operationally: one task per session, reset rather than extend. A long-running conversation crosses from the model's high-precision zone into a low-precision one as the cross-references multiply; smaller, deliberately-loaded contexts beat larger, accreted ones.
 ## TL;DR
@@ -27,7 +27,10 @@ What to keep in mind:
 13. **Automation needs rails.** Hooks, tests, lint, CI, sandboxing, and permissions matter more than advisory text the agent can forget.
 14. **Autonomy requires observability.** If the agent makes decisions, log the trajectory: tool calls, intermediate outputs, failures.
 15. **Staged spikes when the technique is uncertain.** When the *how* is unknown — a library choice, a CV technique, a multi-stage transformation — break the problem into staged spikes against golden fixtures with per-stage debug artifacts.
-16. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
+16. **Diagnose with discipline.** For hard bugs and performance regressions: build a fast, deterministic feedback loop *before* hypothesising. Reproduce, then generate three to five ranked falsifiable hypotheses, then change one variable at a time. The feedback loop is the skill; everything else is mechanical.
+17. **One task per session.** Context decays as it fills. Reset and reload only what the next task needs rather than extending a long conversation across many features. Smaller, deliberately-loaded contexts beat large, accreted ones.
+18. **Slice vertically, not horizontally.** Decompose a spec into thin end-to-end paths through every layer (schema, API, UI, tests) rather than one layer at a time. Each slice ships demonstrable behavior; horizontal layer-stacks ship nothing on their own.
+19. **Discipline scales with project maturity.** Same principles bind every project; the artifact set scales. A spike runs posture + research + audit; a regulated product adds spec / ADR / hooks / evals. Add ceremony only where it changes agent behavior; configure at init and reconfigure as the project matures.
 > Working with agents means trading typing for technical direction. The value is in giving the right context, setting boundaries, validating the result, and keeping "almost right" out of production.
@@ -37,14 +40,15 @@ Define the rules before the agent writes a line. The temptation is to dump every
 There are two complementary frames for the artifacts the kit produces. The first is **purpose** — what each artifact is *for*. The second is **loading mechanism** — when each artifact reaches the agent's context.
-### Four-layer artifact stack (purpose)
+### Five-layer artifact stack (purpose)
 1. **Constitution** — `AGENTS.md` (operational guide) and `WORKFLOW.md` (engineering philosophy). Tells the agent how the project works and how the team thinks. Read every session.
-2. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; ADRs may be driven by spec constraints. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
-3. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for.
-4. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
+2. **Domain** — `CONTEXT.md` at the repo root (or `CONTEXT-MAP.md` plus per-context `CONTEXT.md` files for multi-context repos). The project's ubiquitous language: canonical nouns, the aliases to avoid, the relationships between them, and the ambiguities that have already been resolved. Direct application of Domain-Driven Design (Evans, 2003) — when an agent and a human share the project's vocabulary, the agent uses fewer tokens to say more, and the code, tests, and conversation all converge on the same names. Created lazily — first term resolved triggers the file. ADR-0019 records the layer.
+3. **Spec** — `doc/specs/NNNN-<slug>.md`. Feature-level requirements: who the feature is for, what it must do, the measurable success criteria, the explicit non-goals. One spec per feature; multiple tasks implement one spec; ADRs may be driven by spec constraints. Industry-aligned with [GitHub Spec Kit](https://github.com/github/spec-kit).
+4. **Plan / Decisions** — `ARCHITECTURE.md` (system patterns and boundaries), `doc/adr/NNNN-*.md` (binding architectural decisions in Michael Nygard's pattern), `doc/tasks/NNNN-*.md` (per-work-unit plan with checkbox acceptance criteria). The *how* of building what the spec asked for.
+5. **Code** — the implementation. Code is the primary documentation of behavior; comments justify non-obvious choices.
-The four layers scale with project maturity (TL;DR #16). A spike or PoC profile may legitimately ship only Layers 1 and 4 — adding Layers 2 and 3 to a 200-line experiment is ceremony that does not change agent behavior. A team or regulated product runs all four. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
+The five layers scale with project maturity (TL;DR #19 — Discipline scales). A spike or PoC profile may legitimately ship only Layers 1, 2, and 5 — adding Layers 3 and 4 to a 200-line experiment is ceremony that does not change agent behavior. (Domain — Layer 2 — earns its keep even at PoC because vocabulary drift starts on day one.) A team or regulated product runs all five. The kit's profiles (`poc`, `solo`, `team`, `mature`) configure which layers auto-install per project and are changeable as the project matures; the principles in this document bind every profile, only the artifact set differs.
 ### Three context types (loading mechanism)
@@ -52,9 +56,10 @@ The four layers scale with project maturity (TL;DR #16). A spike or PoC profile
 - **Canonical specs are constraints, not advice.** `DESIGN.md` (the visual contract — YAML tokens plus Markdown rationale, per Google Labs' open standard), `ARCHITECTURE.md`, ADRs in `doc/adr/*.md`, and feature specs in `doc/specs/*.md` are facts the agent must obey. If a token, pattern, or success criterion isn't declared here, it doesn't exist. The agent must never invent one.
 - **On-demand context is `SKILL.md`.** Description loads at session start (the listing is capped at 1,536 characters per the spec) and body loads only when the skill is invoked. Use it for repeatable workflows or domain knowledge that shouldn't pay a token cost on every turn.
-Two rules apply across all of the above:
+Three rules apply across all of the above:
 - **Acceptance criteria must be measurable.** "Build a dashboard" fails. "Loads in under 2 seconds, shows 6 months of history, passes axe accessibility" succeeds.
+- **Acceptance criteria must be durable, not procedural.** Describe the behavior and the interfaces — the contracts that survive a rename. Avoid file paths, line numbers, and "open file X and add line Y" wording in the criteria themselves; those rot the moment the implementation moves. Procedural execution steps belong in a separate section of the task file, not in the criteria.
 - **Prune.** If removing a line wouldn't make the agent fail, cut it.
 ## 2. Docs vs. Code
@@ -112,7 +117,7 @@ The kit ships `agentic-ground` as the workflow-operational implementation of bot
 For non-trivial changes, four phases:
 1. **Explore (read-only).** Plan mode in your agent. Read, build a mental model, no edits.
-2. **Plan.** Agent writes a Markdown plan. You edit before approving. For non-trivial multi-step work, structure the plan as a per-task file (`doc/tasks/<NNNN>-<slug>.md`) with checkbox acceptance criteria and execution steps — the agent toggles checkboxes as it works rather than rewriting paragraphs, keeping edits cheap and resumable across sessions.
+2. **Plan.** Agent writes a Markdown plan. You edit before approving. For non-trivial multi-step work, structure the plan as a per-task file (`doc/tasks/<NNNN>-<slug>.md`) with checkbox acceptance criteria and execution steps — the agent toggles checkboxes as it works rather than rewriting paragraphs, keeping edits cheap and resumable across sessions. When a spec yields more than one task, **slice vertically**: each task is a thin end-to-end path through every layer the change touches (schema, API, UI, tests), not one layer at a time. The anti-pattern is *horizontal slicing* — "first the schema task, then the API task, then the UI task" — because nothing is shippable until the last task lands. Tracer-bullet vertical slices each ship a demonstrable behavior on their own ([Hunt & Thomas, *Pragmatic Programmer*, 1999](https://en.wikipedia.org/wiki/The_Pragmatic_Programmer)). Tag each task **AFK** (specified completely enough that an autonomous agent can land it) or **HITL** (needs human judgment, taste, design review, or external access) — the dimension is orthogonal to the lifecycle status and tells parallel agents which work is theirs to take.
 3. **Implement.** Execute the approved plan; verify each step before moving to the next.
 4. **Commit.** One logical change per commit.
@@ -135,6 +140,27 @@ Lock the load-bearing decisions into `AGENTS.md` or `CLAUDE.md` so the agent doe
 The agent will follow what's specified and invent what isn't. Prefer specifying.
+### Architectural vocabulary
+Architectural drift accelerates with the agent's typing speed; the counter is shared vocabulary that names the shapes that matter. The kit adopts the canonical terms from John Ousterhout's *A Philosophy of Software Design* (2018) and Michael Feathers's *Working Effectively with Legacy Code* (2004), and uses them in `ARCHITECTURE.md`, ADRs, and architecture-touching skills (ADR-0020).
+- **Module** — anything with an interface and an implementation; deliberately scale-agnostic (function, class, package, vertical slice).
+- **Interface** — everything a caller must know to use the module correctly: types, invariants, ordering constraints, error modes, configuration, performance characteristics. *Not* just the type signature.
+- **Implementation** — what's inside the module.
+- **Depth** — leverage at the interface. A module is **deep** when a large amount of behavior sits behind a small interface; **shallow** when the interface is nearly as complex as the implementation. Depth is a property of the interface, not of line counts (rejected framing: depth-as-implementation-to-interface line ratio rewards padding).
+- **Seam** (Feathers) — a place where behavior can be altered without editing in place; the *location* of an interface. Distinct from DDD's *bounded context*; the kit avoids "boundary" for this reason.
+- **Adapter** — a concrete thing that satisfies an interface at a seam; a role, not a substance.
+- **Leverage** — what callers get from depth: more capability per unit of interface they have to learn.
+- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated at one place rather than spread across callers.
+Three principles fall out of those terms:
+- **Deletion test.** Imagine deleting the module. If complexity vanishes, the module was a pass-through (delete it). If complexity reappears across N callers, it was earning its keep.
+- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test *past* the interface, the module is probably the wrong shape.
+- **One adapter is a hypothetical seam; two adapters make it real.** Don't introduce a seam unless something actually varies across it.
+Skills that touch architecture (`agentic-architecture`, `agentic-adr`, the planned `agentic-deepen`) use these terms verbatim, so suggestions and reviews land in a single language.
 ## 9. Outcome-Based Prompting (TDG)
 Give the finish line first, not the path:
@@ -172,6 +198,8 @@ The takeaway: §10 (Reviewer) and §11 (Quality Gates) are not optional. Skippin
 Per the Steve Rogers framing in the preamble: the serum cannot manufacture discrimination — it amplifies whatever discrimination the engineer already brings. The kit's job is to encode discrimination into the agent's context (specs, ADRs, fresh-context reviews, deterministic gates) so the amplification compounds in the disciplined direction even when the engineer is sleepy, rushed, or handing off to another collaborator.
+**Two roles of judgment, not one.** Agent review (§10 fresh-context) and the deterministic gates (§11) catch the *mechanical* failures: bugs, coupling, edge cases, broken contracts, missed branches. They do not — and should not be expected to — catch what is left after that: taste, product judgment, visual feel, whether the feature actually solves the user's problem. Those require a human to look at the running thing and form an opinion. Skipping fresh-context review because "the engineer will catch it" wastes engineer attention on mechanical failures the agent should have caught. Skipping the human pass because "the agent reviewed it" ships features that compile, pass, and feel wrong. The two are complements, not substitutes.
 ## 13. Evals for Anything Autonomous
 If your agent is making decisions on its own, you need evals. A few principles:
@@ -185,6 +213,8 @@ If your agent is making decisions on its own, you need evals. A few principles:
 Sometimes the spec is clear but the *technique* is uncertain — you don't know which library, which CV approach, which decomposition. Don't ask the agent to solve it end-to-end. Break the problem into staged spikes and validate each one against curated ground truth.
+**Spike vs. prototype.** Use a *spike* (this section) when the unknown is *how* — the technique itself is uncertain across multiple plausible approaches and validation needs golden fixtures with per-stage debug. Use a *prototype* when the unknown is *what should this feel like* — UI/UX direction, the shape of an interaction, whether a state model holds up under play. Different question, different artifact, different success criterion.
 The flow has four parts:
 1. **Discovery first.** Ask the agent to list canonical approaches grounded in official docs and real examples. Pick one by an explicit criterion. The output of this step is information, not code.
@@ -198,6 +228,53 @@ The flow has four parts:
 This is a combination of established practices, not new terminology: spike (XP), golden datasets, stage-segmented error analysis, trajectory evaluation, and visual debugging in CV pipelines.
+## 15. Diagnose With Discipline
+For hard bugs and performance regressions, the failure mode is jumping to hypotheses before there is a way to check them. The discipline below is the counter; it owes its shape to standard debugging practice (Kernighan & Pike, *The Practice of Programming*, 1999) and the kit ships `agentic-diagnose` as the operational implementation (ADR-0021).
+### Phase 1 — Build a feedback loop
+This is *the* skill. Everything else is mechanical. A fast, deterministic, agent-runnable pass/fail signal for the bug is what makes bisection, hypothesis-testing, and instrumentation effective; without one, no amount of staring at code converges. Spend disproportionate effort here.
+Loop-construction techniques, in roughly increasing cost:
+1. Failing test at whatever seam reaches the bug — unit, integration, e2e.
+2. Curl / HTTP script against a running dev server.
+3. CLI invocation with a fixture input, diffing stdout against a known-good snapshot.
+4. Headless browser script (Playwright / Puppeteer) — drives the UI, asserts on DOM, console, network.
+5. Replay a captured trace — saved network request, payload, event log — through the code path in isolation.
+6. Throwaway harness — a minimal subset of the system that exercises the bug code path with one function call.
+7. Property / fuzz loop — when the bug is "sometimes wrong output", run many random inputs and look for the failure mode.
+8. Bisection harness — automate "boot at state X, check, repeat" so `git bisect run` works.
+9. Differential loop — run the same input through two versions or two configs and diff outputs.
+10. HITL bash script — last resort. Structure the human's clicks so the loop is still repeatable.
+Treat the loop as a product: faster, sharper signal, more deterministic, every iteration. A two-second deterministic loop is a debugging superpower; a thirty-second flaky loop is barely better than no loop.
+For non-deterministic bugs, the goal is not a clean repro but a *higher reproduction rate* — loop the trigger, parallelize, narrow timing windows, inject sleeps. A 50%-flake is debuggable; 1% is not.
+If a loop genuinely cannot be built, stop and say so — do not proceed to hypotheses.
+### Phase 2 — Reproduce
+Run the loop. Confirm the failure matches the user's description (not a different failure that happens to be nearby), reproduces consistently (or at a high enough rate), and the exact symptom is captured for later phases to verify the fix against. Wrong bug = wrong fix.
+### Phase 3 — Hypothesise
+Generate **three to five ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+Each hypothesis must be **falsifiable**: state the prediction it makes — *"if X is the cause, then changing Y will make the bug disappear / changing Z will make it worse."* If you cannot state the prediction, the hypothesis is a vibe; sharpen it or discard it.
+Show the ranked list to the user before testing. Domain knowledge often re-ranks instantly ("we just deployed a change to #3"), or marks hypotheses already ruled out — a cheap checkpoint, big time saver.
+### Phase 4 — Instrument
+Each probe maps to a specific prediction from Phase 3. Change one variable at a time. Log enough that the result is unambiguous — a single instrument that confirms two predictions at once tends to confirm whichever one you were rooting for.
+### Phase 5 — Fix and regression-test
+Apply the fix. Re-run the Phase-1 loop and confirm the captured symptom is gone. Promote the loop's check into a permanent test that lives next to the code so the same failure mode cannot return silently.
 ---
 These are starting points. Prune what doesn't fit your codebase.
@@ -214,6 +291,8 @@ This is not theory I read and copied. Most of the practices here come from years
 **§14 (Staged Spikes With Golden Fixtures) is my own working technique.** I haven't seen it documented end-to-end as a single named pattern, but each component (spike, golden dataset, stage-segmented error analysis, trajectory evaluation, visual CV debugging) has its own lineage in the literature listed under Sources. The combination — discovery → fixture → staged pipeline with debug artifacts → two-layer evaluation — is how I attack problems where the *technique* itself is uncertain.
+**Practices added through cross-pollination.** A second pass over the guide compared this kit's coverage against [Matt Pocock's `mattpocock/skills`](https://github.com/mattpocock/skills) — a separate body of agent-engineering practice grounded in the same canonical literature (DDD, *Pragmatic Programmer*, Ousterhout, Feathers, Beck). The comparison surfaced principles that earned their place in this document on independent merits: the **Domain layer** (§1 Layer 2, ADR-0019) from DDD's ubiquitous-language pattern; the **architectural vocabulary** (§8, ADR-0020) from Ousterhout's depth and Feathers's seams; **Diagnose with discipline** (§15, ADR-0021) from standard debugging practice; **vertical slicing** and **HITL/AFK tagging** (§6) from PragProg's tracer-bullet metaphor; **AI mechanical / human judgment** (§12 second paragraph). Where Pocock's framing sharpened our own — vocabulary or principle that we already gestured at without naming — the borrowed phrasing is acknowledged inline; everything else stays kit-original.
 External claims (specific percentages, named frameworks) are cited under Sources. Everything else is operational guidance from practice or synthesis across that material — a working model, refined over time, not academic claim.
 ## Sources
@@ -221,6 +300,14 @@ External claims (specific percentages, named frameworks) are cited under Sources
 **§1 — Spec-Driven Design**
 - DESIGN.md spec (Google Labs): https://github.com/google-labs-code/design.md
 - SKILL.md spec (Anthropic): https://code.claude.com/docs/en/skills
+- Domain-Driven Design (Evans, 2003) — *Domain-Driven Design: Tackling Complexity in the Heart of Software*. Source for the Domain layer (`CONTEXT.md`) and the ubiquitous-language discipline.
+**§6 — Explore → Plan → Implement → Commit**
+- *The Pragmatic Programmer* (Hunt & Thomas, 1999), tracer-bullet metaphor — source for the vertical-slicing principle.
+**§8 — Architectural Boundaries**
+- *A Philosophy of Software Design* (Ousterhout, 2018) — Module / Interface / Depth vocabulary; rejected framing of depth-as-line-ratio.
+- *Working Effectively with Legacy Code* (Feathers, 2004) — Seam vocabulary.
 **§12 — The Bottleneck Is Discrimination, Not Generation**
 - JetBrains *DevEcosystem 2025*: https://devecosystem-2025.jetbrains.com/artificial-intelligence
@@ -232,3 +319,7 @@ External claims (specific percentages, named frameworks) are cited under Sources
 - Stage-segmented error analysis — Hamel Husain's evals FAQ: https://hamel.dev/blog/posts/evals-faq/
 - Trajectory evaluation — LangSmith docs: https://docs.langchain.com/langsmith/trajectory-evals
 - Visual CV debugging — OpenCV cvv tutorial: https://docs.opencv.org/3.4/d7/dcf/tutorial_cvv_introduction.html
+**§15 — Diagnose With Discipline**
+- *The Practice of Programming* (Kernighan & Pike, 1999) — chapters on debugging and testing.
+- Falsifiability framing — Karl Popper, *The Logic of Scientific Discovery* (1959).

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@alexandrealvaro/agentic",
-  "version": "0.12.0-beta.1",
+  "version": "0.14.0-beta.1",
   "description": "Bootstrap and audit AGENTS.md, ARCHITECTURE.md, ADRs, skills, and subagents for engineering production code with LLMs",
   "type": "module",
   "bin": {

package/src/commands/init.js CHANGED Viewed

@@ -371,6 +371,7 @@ export async function initCommand(opts) {
       '/agentic-ground (WORKFLOW §4 + §5)',
       '/agentic-next (state survey + recommendations)',
       '/agentic-spike (WORKFLOW §14 — staged spike with golden fixtures)',
+      '/agentic-tdg (WORKFLOW §9 — outcome-based prompting + TDM)',
       ...(optedSkills.includes('agentic-design') ? ['/agentic-design (DESIGN.md)'] : []),
       ...(optedSkills.includes('agentic-subagent') && agents.includes('claude-code')
         ? ['/agentic-subagent']

package/src/lib/profiles.js CHANGED Viewed

@@ -20,7 +20,7 @@ export const PROFILE_NAMES = ['poc', 'solo', 'team', 'mature'];
 export const PROFILES = {
   poc: {
-    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next', 'agentic-spike'],
+    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next', 'agentic-spike', 'agentic-tdg'],
     conditional: {
       'agentic-design': 'blocked',
       'agentic-subagent': 'blocked',
@@ -36,6 +36,7 @@ export const PROFILES = {
       'agentic-audit',
       'agentic-next',
       'agentic-spike',
+      'agentic-tdg',
       'agentic-bootstrap',
       'agentic-spec',
       'agentic-task',
@@ -64,6 +65,7 @@ export const PROFILES = {
       'agentic-ground',
       'agentic-next',
       'agentic-spike',
+      'agentic-tdg',
     ],
     conditional: {
       'agentic-design': 'frontend',
@@ -86,6 +88,7 @@ export const PROFILES = {
       'agentic-ground',
       'agentic-next',
       'agentic-spike',
+      'agentic-tdg',
     ],
     conditional: {
       'agentic-design': 'frontend',

package/src/lib/rootdoc.js CHANGED Viewed

@@ -18,7 +18,7 @@ export const SKILL_DESCRIPTIONS = {
   'agentic-architecture': 'Generate or audit `ARCHITECTURE.md` at the repo root.',
   'agentic-adr': 'Draft a new ADR at `doc/adr/NNNN-<slug>.md`.',
   'agentic-spec':
-    'Draft a feature spec at `doc/specs/NNNN-<slug>.md` (Spec Kit-aligned mandatory sections). Layer 2 of the four-layer artifact stack.',
+    'Draft a feature spec at `doc/specs/NNNN-<slug>.md` (Spec Kit-aligned mandatory sections). Layer 3 of the five-layer artifact stack.',
   'agentic-task': 'Draft a new task at `doc/tasks/NNNN-<slug>.md`.',
   'agentic-audit':
     'Read-only drift report comparing AGENTS.md / ARCHITECTURE.md / ADRs against the code.',
@@ -27,9 +27,11 @@ export const SKILL_DESCRIPTIONS = {
   'agentic-ground':
     'Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate. WORKFLOW §4 + §5.',
   'agentic-next':
-    'State survey + prioritized next-action recommendations across the four-layer artifact stack. Read-only navigation aid (`flutter doctor` pattern).',
+    'State survey + prioritized next-action recommendations across the five-layer artifact stack. Read-only navigation aid (`flutter doctor` pattern).',
   'agentic-spike':
     'Staged spike with golden fixtures per WORKFLOW §14. Discovery + fixture + pipeline-with-gates + two-layer evaluation, when the *technique* is uncertain across multiple plausible approaches.',
+  'agentic-tdg':
+    'Outcome-based prompting per WORKFLOW §9. Ground truth pair + Test Dependency Map + three approaches + single-criterion selection, when the technique is known but the implementation strategy is uncertain.',
   'agentic-design': 'Bootstrap `DESIGN.md` from existing tokens (frontend projects).',
   'agentic-subagent': 'Draft a new Claude Code subagent at `.claude/agents/<name>.md`.',
   'agentic-skill': 'Draft a new Claude Code or Codex skill at the appropriate path.',

package/src/skills/claude-code/agentic-bootstrap/SKILL.md CHANGED Viewed

@@ -165,6 +165,6 @@ A single `AGENTS.md` at the repo root, ≤150 lines, every line operational. No
 ## Next
 - In `team` / `mature`: run `/agentic-architecture` once load-bearing patterns emerge in the code.
-- When you start your first feature: `/agentic-spec` (Layer 2 of the four-layer artifact stack).
+- When you start your first feature: `/agentic-spec` (Layer 3 of the five-layer artifact stack).
 - Skip both above in `poc` / `solo` until the project genuinely needs them — the WORKFLOW §1 prune principle applies.
 - `agentic-philosophy` auto-loads on non-trivial work; no explicit invocation needed.

package/src/skills/claude-code/agentic-next/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: agentic-next
-description: Survey the project's state across the four-layer artifact stack and recommend prioritized next actions, modeled on `flutter doctor`. Use when the user asks "what's next", "next step", "where am I", "project status", "doctor", "what should I do", "audit my workflow", or whenever a navigation aid is needed mid-flow. Read-only; complements `agentic-audit` (drift detection, a different question). Profile-aware — `poc` suppresses Layer 2 / 3 noise, `team` / `mature` run the full survey.
+description: Survey the project's state across the five-layer artifact stack and recommend prioritized next actions, modeled on `flutter doctor`. Use when the user asks "what's next", "next step", "where am I", "project status", "doctor", "what should I do", "audit my workflow", or whenever a navigation aid is needed mid-flow. Read-only; complements `agentic-audit` (drift detection, a different question). Profile-aware — `poc` suppresses Layer 3 / 4 noise, `team` / `mature` run the full survey.
 allowed-tools: Read, Glob, Grep, Bash
 ---
@@ -23,7 +23,7 @@ Do not parse skill bodies. Do not run tests. Do not invoke other skills. The sur
 ## Step 1 — Layer-by-layer status
-Render four sections in this exact order. For each section, list what is present, what is in flight, what is missing or stale. Use words for status (`present`, `in flight`, `missing`, `stale`) — no emoji.
+Render five sections in this exact order. For each section, list what is present, what is in flight, what is missing or stale. Use words for status (`present`, `in flight`, `missing`, `stale`) — no emoji.
 **Layer 1 — Constitution.**
 - `AGENTS.md` (or `CLAUDE.md`) present?
@@ -31,7 +31,11 @@ Render four sections in this exact order. For each section, list what is present
 - `ARCHITECTURE.md` present?
 - `DESIGN.md` present? (frontend projects only)
-**Layer 2 — Specs (`doc/specs/`).**
+**Layer 2 — Domain (`CONTEXT.md`).**
+- `CONTEXT.md` present at repo root, *or* `CONTEXT-MAP.md` plus per-context `CONTEXT.md` files for multi-context repos? (Lazy-created per [ADR-0019](../adr/0019-domain-language-layer.md) — `missing` is a valid state for projects whose first domain term has not been resolved yet, not a finding to flag in `poc` / `solo`.)
+- For each present `CONTEXT.md`, report whether the Language section has at least one term with an `_Avoid_:` line — empty glossary is worse than no glossary.
+**Layer 3 — Specs (`doc/specs/`).**
 For each spec file, report `Status` and the count of tasks whose `Spec ref` field points at it:
@@ -42,13 +46,13 @@ For each spec file, report `Status` and the count of tasks whose `Spec ref` fiel
 Flag specs with `Status: accepted` and zero implementing tasks — that is the most common stuck state.
-**Layer 3 — Plans / Decisions.**
+**Layer 4 — Plans / Decisions.**
 `doc/adr/` — count by status: `proposed`, `accepted`, `deprecated`, `superseded`. Flag any `proposed` ADRs explicitly with their slug — they need a decision.
 `doc/tasks/` — count by status: `proposed`, `in-progress`, `blocked`, `done`. List in-progress and blocked tasks with their slugs and `Spec ref`. Flag tasks with no `Spec ref` and no `Board ref` as orphans (no clear scope tie).
-**Layer 4 — Code.**
+**Layer 5 — Code.**
 - Branch: `<name>` (`<n>` commits ahead of `main` if applicable).
 - Tests: wired? (presence of `npm test` script / `pytest` / `cargo test` / `go test ./...`).
 - Hooks: wired? (presence of `.husky/`, `lefthook.yml`, `.pre-commit-config.yaml`, or active `.git/hooks/` scripts).
@@ -81,8 +85,8 @@ If nothing actionable surfaces, say so explicitly — empty output is real signa
 Apply per-profile rules at the end so the user sees output matched to their maturity:
-- **`poc`:** suppress Layer 2 (specs) and Layer 3 (ADRs / tasks) sections entirely if those directories do not exist. Show Layer 1 + Layer 4 only. Recommendation set: `/agentic-ground` for research, `/agentic-audit` for drift, `/agentic-update` for staleness.
-- **`solo`:** Layer 2 / Layer 3 render but ADR / `ARCHITECTURE.md` absence is informational — no "needs action" flag. Specs are universal at this profile; spec-without-tasks remains a real finding.
+- **`poc`:** suppress Layer 3 (specs) and Layer 4 (ADRs / tasks) sections entirely if those directories do not exist. Show Layer 1 + Layer 2 + Layer 5 only. Layer 2 (Domain) renders informationally — `CONTEXT.md` missing is *not* a finding at `poc` (the file is lazy-created when the first term is resolved). Recommendation set: `/agentic-ground` for research, `/agentic-audit` for drift, `/agentic-update` for staleness.
+- **`solo`:** Layer 3 / Layer 4 render but ADR / `ARCHITECTURE.md` absence is informational — no "needs action" flag. Specs are universal at this profile; spec-without-tasks remains a real finding. Layer 2 — same lazy-creation rule as `poc`.
 - **`team`:** full survey. Default profile.
 - **`mature`:** additionally flag hooks-not-wired louder ("WORKFLOW §11 binding for `mature` profile — `/agentic-hooks` recommended").
@@ -99,13 +103,16 @@ A single Markdown message structured as:
 ### Layer 1 — Constitution
 <one-line status per artifact>
-### Layer 2 — Specs (doc/specs/)
+### Layer 2 — Domain (CONTEXT.md)
+<present / lazy-missing per ADR-0019; glossary-empty flag if file exists but has no terms>
+### Layer 3 — Specs (doc/specs/)
 <spec list with status + task count, or "no specs">
-### Layer 3 — Plans / Decisions
+### Layer 4 — Plans / Decisions
 <ADR + task summaries with explicit flags>
-### Layer 4 — Code
+### Layer 5 — Code
 <branch / tests / hooks / CI status>
 ### Recommended next (priority)

package/src/skills/claude-code/agentic-spec/SKILL.md CHANGED Viewed

@@ -1,12 +1,12 @@
 ---
 name: agentic-spec
-description: Draft a feature-level specification at doc/specs/NNNN-<slug>.md following the kit's four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code). Adapts GitHub Spec Kit's mandatory sections (User Scenarios, Requirements, Success Criteria) to the kit's documentation discipline. Use when the user wants to write, draft, scaffold, or open a feature spec, PRD, product requirements, feature brief, user stories, or success criteria for a feature multiple tasks will implement. Status starts at draft; the file is the binding feature contract once accepted.
+description: Draft a feature-level specification at doc/specs/NNNN-<slug>.md following the kit's five-layer artifact stack (Constitution → Domain → Spec → Plan/Decisions → Code). Adapts GitHub Spec Kit's mandatory sections (User Scenarios, Requirements, Success Criteria) to the kit's documentation discipline. Use when the user wants to write, draft, scaffold, or open a feature spec, PRD, product requirements, feature brief, user stories, or success criteria for a feature multiple tasks will implement. Status starts at draft; the file is the binding feature contract once accepted.
 allowed-tools: Read, Write, Glob, Bash
 ---
 # /agentic-spec
-Drafts `doc/specs/<NNNN>-<short-slug>.md` for one feature. Status lifecycle: `draft` → `accepted` → `shipped` | `superseded by SPEC-NNNN`. Spec is the layer-2 artifact in the kit's four-layer stack — Constitution (`AGENTS.md` + `WORKFLOW.md`) → **Spec (this skill)** → Plan/Decisions (`ARCHITECTURE.md` + `doc/adr/` + `doc/tasks/`) → Code. Multiple tasks implement one spec; ADRs may be driven by spec constraints.
+Drafts `doc/specs/<NNNN>-<short-slug>.md` for one feature. Status lifecycle: `draft` → `accepted` → `shipped` | `superseded by SPEC-NNNN`. Spec is the layer-3 artifact in the kit's five-layer stack — Constitution (`AGENTS.md` + `WORKFLOW.md`) → Domain (`CONTEXT.md`) → **Spec (this skill)** → Plan/Decisions (`ARCHITECTURE.md` + `doc/adr/` + `doc/tasks/`) → Code. Multiple tasks implement one spec; ADRs may be driven by spec constraints. The Domain layer (`CONTEXT.md`, ubiquitous language per Evans 2003) is the source of canonical nouns the spec must use; if the spec introduces a new noun, resolve it through `CONTEXT.md` first.
 ## Step 1 — Determine NNNN and slug

package/src/skills/claude-code/agentic-tdg/SKILL.md ADDED Viewed

@@ -0,0 +1,126 @@
+---
+name: agentic-tdg
+description: Outcome-based prompting per WORKFLOW.md §9. Give the agent the finish line first, not the path. Five steps — confirm regime, ground truth pair, Test Dependency Map, three approaches, pick by one criterion, implement and verify. Triggers on "outcome-based", "TDG", "ground truth", "expected output", "three approaches", "pick by criterion", "test dependency map", "TDM", "before modifying", "tests covering this file", "give the finish line". Routes to `agentic-spike` if the technique itself is uncertain. No file written; output is the verified implementation that lands through normal commits.
+allowed-tools: Read, Glob, Grep, Bash
+---
+# /agentic-tdg
+Implements WORKFLOW.md §9 (Outcome-Based Prompting / Test Dependency Map) end-to-end. Process scaffold for the implementation phase when the canonical technique is known and multiple implementation strategies are plausible. No file written — output is the verified implementation that lands through normal commits; ground-truth pair, candidate set, and selection criterion go into the commit message body or task `Notes`.
+## Step 0 — Confirm regime
+TDG is for the *technique-known, implementation-strategy-uncertain* regime. Run the skill only when:
+* The canonical approach is settled (via `agentic-ground` or prior knowledge), AND
+* Multiple implementation strategies could produce the expected output with different trade-offs along readability / performance / testability axes.
+Route elsewhere when:
+* The *technique itself* is uncertain across multiple plausible approaches → `/agentic-spike` (WORKFLOW §14). Spike validates technique with golden fixture + per-stage debug; TDG validates implementation with ground-truth pair + TDM.
+* The path is fully obvious (one-line fix, mechanical refactor, byte-for-byte port) → TDG is overkill; proceed directly without the skill.
+* No tests cover the surface and the project does not yet have a test runner wired → consider `/agentic-hooks` for project gates first; TDG depends on a verifiable green baseline.
+## Step 1 — Ground truth pair
+State raw input + exact expected output before any code is written. Concrete, not aspirational. The pair is the contract the implementation must satisfy.
+Format depends on the surface:
+* For pure functions: input arguments + expected return value, as code-block or JSON.
+* For data transformations: source data + transformed data, side-by-side.
+* For CLI / API surfaces: request payload + response payload, as paste-ready examples.
+* For UI changes: pre-state DOM / screenshot + post-state DOM / screenshot.
+Example (pure function):
+```
+Input: parse('--agent claude-code --yes')
+Expected output: { agent: 'claude-code', yes: true, dryRun: false, force: false }
+```
+The pair is small, explicit, and verifiable. Aspirational language ("loads fast", "handles all edge cases") is not ground truth — concrete examples are.
+## Step 2 — Test Dependency Map (TDM)
+List the tests covering the file(s) the change will touch. Run them to establish the green baseline. If no tests cover the surface, write one first that exercises the *current* behavior before any modification.
+Process:
+1. Grep for tests referencing the file by import path: `grep -r 'from.*<file>' test/ tests/`.
+2. Grep for tests referencing the function or symbol: `grep -r '<symbol>' test/ tests/`.
+3. Run the matched tests in isolation to confirm green: `npx node --test test/<matched>.test.js` or equivalent for the language.
+4. If empty: write a test that asserts the current behavior. Run. Confirm green. *Then* proceed to Step 3.
+The TDM is the verification surface — Step 5's "implement + verify" loop runs these tests, not the full suite, so the feedback loop stays under a few seconds. Full-suite runs happen at commit-time via the project's pre-push hook (`agentic-hooks`).
+## Step 3 — Three approaches
+Generate three implementation candidates that produce the ground-truth pair. Each candidate names trade-offs along the §9 axes:
+```
+Approach A: <name / one-line description>
+  - readability: <high / medium / low + reason>
+  - performance: <O(...), bytes allocated, etc.>
+  - testability: <surface area, mockability, isolation>
+Approach B: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+Approach C: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+```
+No premature optimization across all three axes — each candidate names its trade-offs honestly. A candidate that is "high on every axis" is suspect; surface what it actually trades against.
+Three is a soft target. If the surface is small and only two approaches are plausible, two is fine. If the third candidate is contrived to fill a slot, drop it; the criterion-based selection in Step 4 handles two as easily as three.
+## Step 4 — Pick by one criterion
+The user names the criterion explicitly — readability *or* performance *or* testability, **not all three at once**. The skill commits to the selected candidate. Alternatives get one-line rejection notes for the commit message body or task `Notes`:
+```
+Criterion: testability
+Picked: Approach B (smaller mock surface, no global state read)
+Rejected:
+  - Approach A — one fewer allocation but mocks four singletons (testability lower)
+  - Approach C — clearer at the call site but writes shared state (testability lower)
+```
+The discipline: refuse "optimize for readability AND performance AND testability". Tradeoffs are real — naming the criterion makes them visible.
+When the host exposes `AskUserQuestion` (per ADR-0014), render the three candidates as a structured multi-choice card with the criterion as the framing question.
+## Step 5 — Implement + verify
+Modify the file. Run the TDM tests from Step 2. If green, the change is done — proceed to commit with the ground-truth pair + criterion + rejection notes in the commit message body. If red, iterate against the same ground-truth pair until green.
+Verification rules:
+* Do **not** edit the TDM tests to make them green. The tests are the verification surface; modifying them while implementing is the hidden form of "almost right" that WORKFLOW §12 names. If a test is wrong, fix the test first as a separate diff with its own ground-truth pair.
+* Do **not** declare done before the TDM tests pass. Type-check passing is not the gate; the test gate is.
+* If iteration loops past three attempts and the implementation still fails, **stop and re-examine the ground-truth pair**. The pair may be ambiguous, or the canonical approach may not actually map to the expected output. Routing back to `agentic-ground` (re-research) or `agentic-spike` (technique uncertain) is preferable to forcing a wrong implementation through.
+## Output contract
+No file written. Output is structured conversation:
+1. Confirmation of regime (Step 0 — TDG or routed elsewhere).
+2. Ground-truth pair (Step 1).
+3. TDM list with green-baseline confirmation (Step 2).
+4. Three candidates with trade-off table (Step 3).
+5. Selection by criterion + rejection notes (Step 4).
+6. Implementation + verification result (Step 5).
+When the change is committed, the ground-truth pair, criterion, and rejection notes go into the commit message body. When a task file exists, the same content also lands in the task's `Notes` log under a dated entry.
+## Next
+- After Step 5 verification passes: commit the change with the ground-truth pair + criterion + rejection notes in the body.
+- `/agentic-review main..HEAD` (or current scope) before merge — WORKFLOW §10. The TDM tests verify the implementation matches ground truth; §10 review checks coupling, edge cases, spec drift the pair did not cover.
+- If the work spans multiple sessions: `/agentic-task` for explicit decomposition (Spec ref the original spec; cite this TDG run in the task `Notes`).
+- If iteration stalled at Step 5 and routing back was needed: `/agentic-ground` (re-research) or `/agentic-spike` (technique uncertain).

package/src/skills/codex/agentic-bootstrap/SKILL.md CHANGED Viewed

@@ -147,6 +147,6 @@ A single `AGENTS.md` at the repo root, ≤150 lines, every line operational. No
 ## Next
 - In `team` / `mature`: run `/agentic-architecture` once load-bearing patterns emerge in the code.
-- When you start your first feature: `/agentic-spec` (Layer 2 of the four-layer artifact stack).
+- When you start your first feature: `/agentic-spec` (Layer 3 of the five-layer artifact stack).
 - Skip both above in `poc` / `solo` until the project genuinely needs them — the WORKFLOW §1 prune principle applies.
 - `agentic-philosophy` auto-loads on non-trivial work.

package/src/skills/codex/agentic-next/SKILL.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 name: agentic-next
-description: Survey the project's state across the four-layer artifact stack and recommend prioritized next actions, modeled on `flutter doctor`. Use when the user asks "what's next", "next step", "where am I", "project status", "doctor", "what should I do", "audit my workflow". Read-only; complements `agentic-audit` (drift detection, a different question). Profile-aware — `poc` suppresses Layer 2/3 noise, `team`/`mature` run the full survey.
+description: Survey the project's state across the five-layer artifact stack and recommend prioritized next actions, modeled on `flutter doctor`. Use when the user asks "what's next", "next step", "where am I", "project status", "doctor", "what should I do", "audit my workflow". Read-only; complements `agentic-audit` (drift detection, a different question). Profile-aware — `poc` suppresses Layer 3/4 noise, `team`/`mature` run the full survey.
 ---
 <background_information>
@@ -20,15 +20,17 @@ Step 0 — read state. Detect baseline:
 Do not parse skill bodies. Do not run tests. Do not invoke other skills.
-Step 1 — layer-by-layer status. Render four sections in this exact order. Use words for status (`present`, `in flight`, `missing`, `stale`) — no emoji.
+Step 1 — layer-by-layer status. Render five sections in this exact order. Use words for status (`present`, `in flight`, `missing`, `stale`) — no emoji.
 Layer 1 — Constitution: AGENTS.md / CLAUDE.md, WORKFLOW.md, ARCHITECTURE.md, DESIGN.md (frontend only).
-Layer 2 — Specs (doc/specs/): for each file, report Status + count of tasks whose Spec ref points at it. Flag specs with Status: accepted and zero implementing tasks.
+Layer 2 — Domain (CONTEXT.md): present at repo root, *or* CONTEXT-MAP.md plus per-context CONTEXT.md files? Lazy-created per ADR-0019 — `missing` is valid for projects whose first domain term has not been resolved yet, not a finding to flag in poc / solo. For each present file, flag empty-glossary (Language section with zero terms).
-Layer 3 — Plans / Decisions: doc/adr/ counts by status, flag proposed ADRs with their slug. doc/tasks/ counts by status, list in-progress + blocked with slug and Spec ref. Flag tasks with no Spec ref and no Board ref as orphans.
+Layer 3 — Specs (doc/specs/): for each file, report Status + count of tasks whose Spec ref points at it. Flag specs with Status: accepted and zero implementing tasks.
-Layer 4 — Code: branch + ahead count, tests wired? (npm test / pytest / cargo test / go test), hooks wired? (.husky / lefthook.yml / .pre-commit-config.yaml / .git/hooks/), CI wired? (.github/workflows / .gitlab-ci.yml / .circleci/).
+Layer 4 — Plans / Decisions: doc/adr/ counts by status, flag proposed ADRs with their slug. doc/tasks/ counts by status, list in-progress + blocked with slug and Spec ref. Flag tasks with no Spec ref and no Board ref as orphans.
+Layer 5 — Code: branch + ahead count, tests wired? (npm test / pytest / cargo test / go test), hooks wired? (.husky / lefthook.yml / .pre-commit-config.yaml / .git/hooks/), CI wired? (.github/workflows / .gitlab-ci.yml / .circleci/).
 Step 2 — cross-cut signals:
 - Pending fresh-context review: branch ≥1 commits ahead of main with no .agentic/reviews/<ts>-*.md for the current range → recommend agentic-review.
@@ -48,8 +50,8 @@ Priority heuristic:
 If nothing actionable surfaces, say so: "No urgent next action. Continue current work or invoke `/agentic-audit` for a full drift check."
 Step 4 — profile-aware filtering. Apply at the end:
-- poc: suppress Layer 2/3 sections if those directories do not exist. Show Layer 1 + Layer 4 only. Recommendation set: `/agentic-ground`, `/agentic-audit`, `agentic update`.
-- solo: Layer 2/3 render; ADR / ARCHITECTURE.md absence is informational, not a flag. Specs are universal; spec-without-tasks remains a real finding.
+- poc: suppress Layer 3/4 sections if those directories do not exist. Show Layer 1 + Layer 2 + Layer 5 only. Layer 2 (Domain) renders informationally — CONTEXT.md missing is *not* a finding (lazy-created). Recommendation set: `/agentic-ground`, `/agentic-audit`, `agentic update`.
+- solo: Layer 3/4 render; ADR / ARCHITECTURE.md absence is informational, not a flag. Specs are universal; spec-without-tasks remains a real finding. Layer 2 — same lazy-creation rule as poc.
 - team: full survey (default).
 - mature: additionally flag hooks-not-wired louder (WORKFLOW §11 binding for mature profile).
 </instructions>
@@ -66,13 +68,16 @@ A single Markdown message structured as:
 ### Layer 1 — Constitution
 <one-line status per artifact>
-### Layer 2 — Specs (doc/specs/)
+### Layer 2 — Domain (CONTEXT.md)
+<present / lazy-missing per ADR-0019; glossary-empty flag if file exists but has no terms>
+### Layer 3 — Specs (doc/specs/)
 <spec list with status + task count, or "no specs">
-### Layer 3 — Plans / Decisions
+### Layer 4 — Plans / Decisions
 <ADR + task summaries with explicit flags>
-### Layer 4 — Code
+### Layer 5 — Code
 <branch / tests / hooks / CI status>
 ### Recommended next (priority)

package/src/skills/codex/agentic-next/agents/openai.yaml CHANGED Viewed

@@ -1,5 +1,5 @@
 interface:
   display_name: agentic-next
-  short_description: State-aware navigation aid (flutter doctor pattern). Surveys four-layer artifact stack + recommends prioritized next actions. Read-only.
+  short_description: State-aware navigation aid (flutter doctor pattern). Surveys five-layer artifact stack + recommends prioritized next actions. Read-only.
 policy:
   allow_implicit_invocation: false

package/src/skills/codex/agentic-spec/SKILL.md CHANGED Viewed

@@ -1,10 +1,10 @@
 ---
 name: agentic-spec
-description: Draft a feature-level specification at doc/specs/NNNN-<slug>.md following the four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code). Adapts GitHub Spec Kit's mandatory sections (User Scenarios, Requirements, Success Criteria) to the kit's documentation discipline. Use when the user wants to write, draft, scaffold, or open a feature spec, PRD, product requirements, feature brief, user stories, or success criteria. Status starts at draft.
+description: Draft a feature-level specification at doc/specs/NNNN-<slug>.md following the five-layer artifact stack (Constitution → Domain → Spec → Plan/Decisions → Code). Adapts GitHub Spec Kit's mandatory sections (User Scenarios, Requirements, Success Criteria) to the kit's documentation discipline. Use when the user wants to write, draft, scaffold, or open a feature spec, PRD, product requirements, feature brief, user stories, or success criteria. Status starts at draft.
 ---
 <background_information>
-Drafts `doc/specs/<NNNN>-<short-slug>.md` for one feature. Status lifecycle: draft → accepted → shipped | superseded by SPEC-NNNN. Spec is the layer-2 artifact in the kit's four-layer stack — Constitution (`AGENTS.md` + `WORKFLOW.md`) → Spec (this skill) → Plan/Decisions (`ARCHITECTURE.md` + `doc/adr/` + `doc/tasks/`) → Code. Multiple tasks implement one spec; ADRs may be driven by spec constraints.
+Drafts `doc/specs/<NNNN>-<short-slug>.md` for one feature. Status lifecycle: draft → accepted → shipped | superseded by SPEC-NNNN. Spec is the layer-3 artifact in the kit's five-layer stack — Constitution (`AGENTS.md` + `WORKFLOW.md`) → Domain (`CONTEXT.md`) → Spec (this skill) → Plan/Decisions (`ARCHITECTURE.md` + `doc/adr/` + `doc/tasks/`) → Code. Multiple tasks implement one spec; ADRs may be driven by spec constraints. The Domain layer (ubiquitous language per Evans 2003) is the source of canonical nouns the spec must use; if the spec introduces a new noun, resolve it through `CONTEXT.md` first.
 Codex auto-trigger on description keywords is less mature than Claude Code's. If auto-invocation does not fire when the user asks to "draft a spec" or "write a PRD" on a non-trivial feature, invoke this skill manually.
 </background_information>

package/src/skills/codex/agentic-tdg/SKILL.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+name: agentic-tdg
+description: Outcome-based prompting per WORKFLOW.md §9. Give the agent the finish line first, not the path. Five steps — confirm regime, ground truth pair, Test Dependency Map, three approaches, pick by one criterion, implement and verify. Triggers on "outcome-based", "TDG", "ground truth", "expected output", "three approaches", "pick by criterion", "test dependency map", "TDM", "before modifying", "tests covering this file", "give the finish line". Routes to `agentic-spike` if the technique itself is uncertain. No file written; output is the verified implementation that lands through normal commits.
+---
+<background_information>
+Implements WORKFLOW.md §9 (Outcome-Based Prompting / Test Dependency Map) end-to-end. The skill is for the implementation phase when the canonical technique is known and multiple implementation strategies are plausible. WORKFLOW §14 (`agentic-spike`) covers the technique-unknown regime; §9 (this skill) covers the implementation-strategy-uncertain regime.
+No file is written. The output of the skill is the verified implementation that lands in the repo through normal commits. The ground-truth pair, candidate set, selection criterion, and TDM list go into the commit message body — or the task's `Notes` log when one exists — not into a separate `doc/` artifact.
+Codex auto-trigger on description keywords is less mature than Claude Code's. If auto-invocation does not fire when the user mentions outcome-based prompting, ground truth, three approaches, or a Test Dependency Map, invoke this skill manually.
+</background_information>
+<instructions>
+Step 0 — confirm regime. TDG is for the *technique-known, implementation-strategy-uncertain* regime. Run the skill only when the canonical approach is settled (via `agentic-ground` or prior knowledge) AND multiple implementation strategies could produce the expected output with different trade-offs along readability / performance / testability axes.
+Route elsewhere when:
+- The *technique itself* is uncertain across multiple plausible approaches → `agentic-spike` (WORKFLOW §14). Spike validates technique with golden fixture + per-stage debug; TDG validates implementation with ground-truth pair + TDM.
+- The path is fully obvious (one-line fix, mechanical refactor, byte-for-byte port) → TDG is overkill; proceed directly without the skill.
+- No tests cover the surface and the project does not yet have a test runner wired → consider `agentic-hooks` for project gates first; TDG depends on a verifiable green baseline.
+Step 1 — ground truth pair. State raw input + exact expected output before any code is written. Concrete, not aspirational. The pair is the contract the implementation must satisfy.
+Format depends on the surface:
+- For pure functions: input arguments + expected return value, as code-block or JSON.
+- For data transformations: source data + transformed data, side-by-side.
+- For CLI / API surfaces: request payload + response payload, as paste-ready examples.
+- For UI changes: pre-state DOM / screenshot + post-state DOM / screenshot.
+Example (pure function):
+```
+Input: parse('--agent claude-code --yes')
+Expected output: { agent: 'claude-code', yes: true, dryRun: false, force: false }
+```
+The pair is small, explicit, and verifiable. Aspirational language ("loads fast", "handles all edge cases") is not ground truth — concrete examples are.
+Step 2 — Test Dependency Map (TDM). List the tests covering the file(s) the change will touch. Run them to establish the green baseline. If no tests cover the surface, write one first that exercises the *current* behavior before any modification.
+Process:
+1. Grep for tests referencing the file by import path: `grep -r 'from.*<file>' test/ tests/`.
+2. Grep for tests referencing the function or symbol: `grep -r '<symbol>' test/ tests/`.
+3. Run the matched tests in isolation to confirm green: `npx node --test test/<matched>.test.js` or equivalent for the language.
+4. If empty: write a test that asserts the current behavior. Run. Confirm green. *Then* proceed to Step 3.
+The TDM is the verification surface — Step 5's "implement + verify" loop runs these tests, not the full suite, so the feedback loop stays under a few seconds. Full-suite runs happen at commit-time via the project's pre-push hook (`agentic-hooks`).
+Step 3 — three approaches. Generate three implementation candidates that produce the ground-truth pair. Each candidate names trade-offs along the §9 axes:
+```
+Approach A: <name / one-line description>
+  - readability: <high / medium / low + reason>
+  - performance: <O(...), bytes allocated, etc.>
+  - testability: <surface area, mockability, isolation>
+Approach B: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+Approach C: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+```
+No premature optimization across all three axes — each candidate names its trade-offs honestly. A candidate that is "high on every axis" is suspect; surface what it actually trades against.
+Three is a soft target. If the surface is small and only two approaches are plausible, two is fine. If the third candidate is contrived to fill a slot, drop it; the criterion-based selection in Step 4 handles two as easily as three.
+Step 4 — pick by one criterion. The user names the criterion explicitly — readability *or* performance *or* testability, **not all three at once**. The skill commits to the selected candidate. Alternatives get one-line rejection notes for the commit message body or task `Notes`:
+```
+Criterion: testability
+Picked: Approach B (smaller mock surface, no global state read)
+Rejected:
+  - Approach A — one fewer allocation but mocks four singletons (testability lower)
+  - Approach C — clearer at the call site but writes shared state (testability lower)
+```
+The discipline: refuse "optimize for readability AND performance AND testability". Tradeoffs are real — naming the criterion makes them visible.
+Step 5 — implement + verify. Modify the file. Run the TDM tests from Step 2. If green, the change is done — proceed to commit with the ground-truth pair + criterion + rejection notes in the commit message body. If red, iterate against the same ground-truth pair until green.
+Verification rules:
+- Do **not** edit the TDM tests to make them green. The tests are the verification surface; modifying them while implementing is the hidden form of "almost right" that WORKFLOW §12 names. If a test is wrong, fix the test first as a separate diff with its own ground-truth pair.
+- Do **not** declare done before the TDM tests pass. Type-check passing is not the gate; the test gate is.
+- If iteration loops past three attempts and the implementation still fails, **stop and re-examine the ground-truth pair**. The pair may be ambiguous, or the canonical approach may not actually map to the expected output. Routing back to `agentic-ground` (re-research) or `agentic-spike` (technique uncertain) is preferable to forcing a wrong implementation through.
+</instructions>
+<output_contract>
+No file written. Output is structured conversation:
+1. Confirmation of regime (Step 0 — TDG or routed elsewhere).
+2. Ground-truth pair (Step 1).
+3. TDM list with green-baseline confirmation (Step 2).
+4. Three candidates with trade-off table (Step 3).
+5. Selection by criterion + rejection notes (Step 4).
+6. Implementation + verification result (Step 5).
+When the change is committed, the ground-truth pair, criterion, and rejection notes go into the commit message body. When a task file exists, the same content also lands in the task's `Notes` log under a dated entry.
+</output_contract>
+## Next
+- After Step 5 verification passes: commit the change with the ground-truth pair + criterion + rejection notes in the body.
+- `/agentic-review main..HEAD` (or current scope) before merge — WORKFLOW §10. The TDM tests verify the implementation matches ground truth; §10 review checks coupling, edge cases, spec drift the pair did not cover.
+- If the work spans multiple sessions: `/agentic-task` for explicit decomposition (Spec ref the original spec; cite this TDG run in the task `Notes`).
+- If iteration stalled at Step 5 and routing back was needed: `/agentic-ground` (re-research) or `/agentic-spike` (technique uncertain).

package/src/skills/codex/agentic-tdg/agents/openai.yaml ADDED Viewed

@@ -0,0 +1,5 @@
+interface:
+  display_name: agentic-tdg
+  short_description: Outcome-based prompting per WORKFLOW §9. Ground truth pair + Test Dependency Map + three approaches + single-criterion selection. For implementation-strategy uncertainty when the technique is known; routes to agentic-spike if the technique itself is uncertain.
+policy:
+  allow_implicit_invocation: false