npm - @alexandrealvaro/agentic - Versions diffs - 0.11.3-beta.1 → 0.13.0-beta.1 - Mend

@alexandrealvaro/agentic 0.11.3-beta.1 → 0.13.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/README.md +6 -0
package/package.json +1 -1
package/src/commands/init.js +2 -0
package/src/lib/profiles.js +7 -1
package/src/lib/rootdoc.js +4 -0
package/src/skills/claude-code/agentic-spike/SKILL.md +220 -0
package/src/skills/claude-code/agentic-tdg/SKILL.md +126 -0
package/src/skills/codex/agentic-spike/SKILL.md +89 -0
package/src/skills/codex/agentic-spike/agents/openai.yaml +5 -0
package/src/skills/codex/agentic-tdg/SKILL.md +109 -0
package/src/skills/codex/agentic-tdg/agents/openai.yaml +5 -0

package/README.md CHANGED Viewed

@@ -40,6 +40,8 @@ Two categories ([ADR-0007](doc/adr/0007-workflow-operational-skills.md)) and two
 | `agentic-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/agentic-review <range>` |
 | `agentic-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/agentic-ground` |
 | `agentic-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the four-layer artifact stack and recommends prioritized next actions; complements `agentic-audit` (drift) | `/agentic-next` |
+| `agentic-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/agentic-spike` |
+| `agentic-tdg` | workflow-operational | universal | Outcome-based prompting per WORKFLOW §9 — ground truth pair + Test Dependency Map + three approaches + single-criterion selection, for cases where the technique is known but the implementation strategy is uncertain | `/agentic-tdg` |
 | `agentic-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/agentic-design` |
 | `agentic-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/agentic-subagent` |
 | `agentic-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/agentic-skill` |
@@ -155,6 +157,10 @@ The kit's discipline scales with the project's maturity. A solo PoC may legitima
 **Lost mid-flow?** Invoke `/agentic-next` at any time to survey the project's state across the four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/agentic-audit` (drift detection — different question).
+**Technique uncertain across multiple plausible approaches?** Invoke `/agentic-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/agentic-adr` and delete.
+**Technique known but implementation strategy uncertain?** Invoke `/agentic-tdg` (per WORKFLOW §9) when multiple algorithms could produce the expected output with different trade-offs along readability / performance / testability. The skill forces a ground-truth pair, lists the tests covering the file (Test Dependency Map), generates three implementation candidates, and commits to one by a single named criterion — refusing the "optimize for all three at once" failure mode. No file written; the verified implementation is the artifact, with the candidate set + criterion landing in the commit message body.
 ## Manual prompts
 If you prefer to skip the installer, the same artifacts can be generated by pasting prompts directly into your agent. Each prompt file has the literal text to copy, plus the matching template structure:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@alexandrealvaro/agentic",
-  "version": "0.11.3-beta.1",
+  "version": "0.13.0-beta.1",
   "description": "Bootstrap and audit AGENTS.md, ARCHITECTURE.md, ADRs, skills, and subagents for engineering production code with LLMs",
   "type": "module",
   "bin": {

package/src/commands/init.js CHANGED Viewed

@@ -370,6 +370,8 @@ export async function initCommand(opts) {
       '/agentic-review (WORKFLOW §10)',
       '/agentic-ground (WORKFLOW §4 + §5)',
       '/agentic-next (state survey + recommendations)',
+      '/agentic-spike (WORKFLOW §14 — staged spike with golden fixtures)',
+      '/agentic-tdg (WORKFLOW §9 — outcome-based prompting + TDM)',
       ...(optedSkills.includes('agentic-design') ? ['/agentic-design (DESIGN.md)'] : []),
       ...(optedSkills.includes('agentic-subagent') && agents.includes('claude-code')
         ? ['/agentic-subagent']

package/src/lib/profiles.js CHANGED Viewed

@@ -20,7 +20,7 @@ export const PROFILE_NAMES = ['poc', 'solo', 'team', 'mature'];
 export const PROFILES = {
   poc: {
-    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next'],
+    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next', 'agentic-spike', 'agentic-tdg'],
     conditional: {
       'agentic-design': 'blocked',
       'agentic-subagent': 'blocked',
@@ -35,6 +35,8 @@ export const PROFILES = {
       'agentic-ground',
       'agentic-audit',
       'agentic-next',
+      'agentic-spike',
+      'agentic-tdg',
       'agentic-bootstrap',
       'agentic-spec',
       'agentic-task',
@@ -62,6 +64,8 @@ export const PROFILES = {
       'agentic-review',
       'agentic-ground',
       'agentic-next',
+      'agentic-spike',
+      'agentic-tdg',
     ],
     conditional: {
       'agentic-design': 'frontend',
@@ -83,6 +87,8 @@ export const PROFILES = {
       'agentic-review',
       'agentic-ground',
       'agentic-next',
+      'agentic-spike',
+      'agentic-tdg',
     ],
     conditional: {
       'agentic-design': 'frontend',

package/src/lib/rootdoc.js CHANGED Viewed

@@ -28,6 +28,10 @@ export const SKILL_DESCRIPTIONS = {
     'Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate. WORKFLOW §4 + §5.',
   'agentic-next':
     'State survey + prioritized next-action recommendations across the four-layer artifact stack. Read-only navigation aid (`flutter doctor` pattern).',
+  'agentic-spike':
+    'Staged spike with golden fixtures per WORKFLOW §14. Discovery + fixture + pipeline-with-gates + two-layer evaluation, when the *technique* is uncertain across multiple plausible approaches.',
+  'agentic-tdg':
+    'Outcome-based prompting per WORKFLOW §9. Ground truth pair + Test Dependency Map + three approaches + single-criterion selection, when the technique is known but the implementation strategy is uncertain.',
   'agentic-design': 'Bootstrap `DESIGN.md` from existing tokens (frontend projects).',
   'agentic-subagent': 'Draft a new Claude Code subagent at `.claude/agents/<name>.md`.',
   'agentic-skill': 'Draft a new Claude Code or Codex skill at the appropriate path.',

package/src/skills/claude-code/agentic-spike/SKILL.md ADDED Viewed

@@ -0,0 +1,220 @@
+---
+name: agentic-spike
+description: Scaffold a staged spike with golden fixtures per WORKFLOW.md §14, for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. Four stages — discovery, golden fixture, pipeline with gates, two-layer evaluation. Use when the unknown is *how*, not *what*. Triggers on "spike", "uncertain technique", "which library", "CV pipeline", "evaluate approaches", "ground truth", "golden fixture", "staged pipeline", "debug per stage". Routes to `agentic-ground` if the *how* is routine and a single happy path is obvious. Read-and-write — creates `spikes/NNNN-<slug>/` with fixtures, debug per-stage artifacts, eval results.
+allowed-tools: Read, Write, Glob, Grep, Bash, WebFetch, WebSearch
+---
+# /agentic-spike
+Implements WORKFLOW.md §14 (Staged Spikes With Golden Fixtures) end-to-end. The skill is for cases where the spec is clear but the *technique* is uncertain across multiple plausible approaches — library choice, CV approach, multi-stage transformation. WORKFLOW §9 (TDG) assumes the path is known and validates end-to-end; §14 assumes the path is unknown and validates per stage. Different uncertainty regimes; this skill is for the unknown one.
+The skill creates a working directory under `spikes/NNNN-<slug>/` and fills it stage-by-stage. The directory is throwaway by design — when the spike concludes, an ADR records the decision (`/agentic-adr`) and the spike directory is deleted. See ADR-0017 for the promote-or-delete lifecycle rationale.
+## Step 0 — Confirm uncertainty
+The skill is for *unknown technique* across multiple plausible approaches, not for *non-trivial work* in general. If a single happy path is obvious, **do not start a spike**. If the *how* is knowable from official docs / OSS examples / in-repo patterns / git history, route to `agentic-ground` and stop.
+Concrete tests to run before starting:
+* Could `agentic-ground`'s four-source research surface a single happy path with a defensible deviation gate? If yes, run that instead.
+* Are there ≥2 candidate techniques with materially different trade-offs that no source resolves? If no, this is not a spike.
+* Is end-to-end validation against expected outputs feasible without per-stage debug? If yes, this is `agentic-task` + `agentic-philosophy` Goal-Driven Execution territory, not a spike.
+If the spike is warranted, confirm with the user the *recortte* (the specific surface where uncertainty sits — not the whole feature) and proceed.
+## Step 1 — Discovery
+List canonical approaches grounded in **official docs and real examples**. Pick one (or a small set, ≤3) by an **explicit criterion**.
+Candidate-listing process:
+1. Search official documentation for the language / library / domain in question. Cite URL + version.
+2. Search public OSS for repos that solve the same technical recortte. Cite `<repo>:<path>:<line-range>` and fetch via tools — never paraphrase from training memory.
+3. Survey in-repo for analogous patterns the codebase already uses. Cite `<file>:<line>` or "no analog found".
+4. Survey git history for prior attempts at the same problem. Cite `<commit-sha>` or "no prior attempt".
+Output format:
+```markdown
+## Discovery — <recortte>
+### Candidate techniques
+1. **<name>** — <one-line description>. Source: <URL or repo:path>. Trade-offs: <pros / cons>.
+2. **<name>** — ...
+3. **<name>** — ...
+### Selection criterion
+<one-line criterion: latency / accuracy / readability / dependencies / etc>
+### Picked
+<technique X>, picked by criterion <Y>. Alternatives held in reserve: <list>.
+```
+The output of this step is **information, not code**. No spike directory is created yet. The user reviews the candidate list and confirms the picked approach (or revises) before Step 2.
+## Step 2 — Golden fixture
+Curate inputs with rich expected outputs. The fixture is the ground truth the staged pipeline validates against; richer fixtures catch more failure modes.
+Create the spike directory:
+```bash
+mkdir -p spikes/NNNN-<slug>/{fixtures,debug,eval}
+```
+Where `NNNN` is the next available 4-digit number (mirrors ADR / task / spec numbering). List `spikes/` and pick the next slot.
+The fixture format is JSON keyed by input path (recommended) or whatever shape the domain demands. For computer vision: bounding boxes, sizes, lighting condition, difficulty tag, edge case markers. For multi-stage transformations: intermediate states. For library choice: representative inputs covering typical and edge cases.
+Example fixture file (`spikes/0001-detect-circles/fixtures/golden.json`):
+```json
+{
+  "inputs/easy-01.jpg": {
+    "expected": [
+      { "bbox": [120, 80, 240, 200], "label": "circle", "size": "large", "lighting": "even" }
+    ],
+    "difficulty": "easy",
+    "edge_cases": []
+  },
+  "inputs/hard-01.jpg": {
+    "expected": [
+      { "bbox": [50, 60, 90, 100], "label": "circle", "size": "small", "lighting": "low" },
+      { "bbox": [200, 80, 260, 140], "label": "circle", "size": "medium", "lighting": "even", "occluded": true }
+    ],
+    "difficulty": "hard",
+    "edge_cases": ["low-light", "partial-occlusion", "multiple-objects"]
+  }
+}
+```
+Curation principles:
+* Include **edge cases** (low light, partial occlusion, malformed inputs, large inputs, empty inputs) — not just "happy path" examples. The fixture's job is to surface *where* a technique fails, not just whether it succeeds on easy cases.
+* Include **difficulty tags** so per-stage evaluation can report performance segmented by difficulty.
+* Keep the fixture as data, not code. JSON / YAML / CSV — anything that diffs cleanly and survives a refactor.
+The fixture is the contract the pipeline validates against. Treat it like spec text — it should not change once the spike runs unless ground truth itself changes.
+## Step 3 — Pipeline with gates
+One technique per stage. Each stage emits a **debug artifact** that makes its output inspectable.
+Pipeline structure:
+```
+spikes/NNNN-<slug>/
+├── README.md          # spike framing (Step 1 output)
+├── fixtures/          # golden inputs + expected outputs
+│   └── golden.json
+├── pipeline/          # one file per stage
+│   ├── 01-preprocess.<ext>
+│   ├── 02-detect.<ext>
+│   └── 03-postprocess.<ext>
+├── debug/             # per-stage debug artifacts
+│   ├── 01-preprocess/
+│   ├── 02-detect/
+│   └── 03-postprocess/
+└── eval/              # evaluation results (Step 4)
+```
+Each stage's debug artifact format depends on the domain:
+* CV pipelines: image saved to `debug/NN-<stage>/<input-name>.png` showing the stage's output.
+* Multi-stage transformations: intermediate JSON saved to `debug/NN-<stage>/<input-name>.json`.
+* Library evaluation: log row per (input, library) saved to `debug/NN-<stage>/log.csv`.
+The discipline: **each stage's output must be inspectable independently**. End-to-end output alone tells you *that* it failed; per-stage debug tells you *where*.
+Implementation pattern (any language):
+* Stage takes (input, context) and returns (output, debug-record). Debug-record is written to `debug/NN-<stage>/`.
+* Pipeline runs stages sequentially. Failure at any stage halts the pipeline and reports the stage where divergence happened.
+## Step 4 — Two-layer evaluation
+Run the pipeline against the fixture and emit two layers of results:
+* **End-to-end:** how many fixture inputs produced expected outputs? Reported as pass / fail per input, plus aggregate pass rate.
+* **Per-stage:** for each fixture input, where did the pipeline diverge? Stage NN's output vs the expected intermediate. Reported as pass / fail per (input, stage).
+Output to `spikes/NNNN-<slug>/eval/results.json`:
+```json
+{
+  "fixture": "fixtures/golden.json",
+  "pipeline_version": "<commit-sha or timestamp>",
+  "end_to_end": {
+    "total": 10,
+    "passed": 7,
+    "failed": 3
+  },
+  "per_stage": {
+    "01-preprocess": { "passed": 10, "failed": 0 },
+    "02-detect": { "passed": 8, "failed": 2 },
+    "03-postprocess": { "passed": 7, "failed": 1 }
+  },
+  "failures": [
+    {
+      "input": "inputs/hard-02.jpg",
+      "diverged_at": "02-detect",
+      "expected": [...],
+      "actual": [...],
+      "debug_artifact": "debug/02-detect/hard-02.png"
+    }
+  ]
+}
+```
+The per-stage layer is what makes the spike actionable. End-to-end says *that* it failed; per-stage + debug artifact says *where* and *why*.
+## Step 5 — Conclude (promote or delete)
+When the spike concludes — either the picked technique works or it does not — record the outcome via `/agentic-adr` and delete the spike directory. The ADR is the persistent artifact; the spike code is throwaway.
+ADR template for spike outcomes:
+```markdown
+# ADR-NNNN: We will use technique X for <recortte>
+## Context
+<why the spike was needed — what was uncertain>
+## Decision
+We will use technique X. The spike at `spikes/NNNN-<slug>/` (now deleted) showed:
+- End-to-end pass rate: <%>
+- Failures concentrated at stage <NN>, root cause <Y>
+- Mitigation: <Z>
+Alternatives held in reserve and rejected:
+- Technique A: rejected because <reason from spike eval>
+- Technique B: rejected because <reason from spike eval>
+## Consequences
+<follow-on work this decision unblocks; rails to maintain>
+```
+Then:
+```bash
+rm -rf spikes/NNNN-<slug>/
+git add doc/adr/NNNN-<slug>.md
+git commit -m "feat: adopt technique X for <recortte> per spike NNNN"
+```
+Spikes that conclude inconclusively get an ADR too — `Decision: defer; the spike at NNNN inconclusive because Y` — and the directory is deleted. Inconclusive spikes are real signal; preserving the framing in an ADR prevents re-litigation.
+## Output contract
+A spike directory at `spikes/NNNN-<short-slug>/` with the four-stage layout above (discovery README, fixtures, pipeline, debug per stage, eval results). The directory is throwaway by design — promote-or-delete lifecycle per ADR-0017 §4. No `Status: shipped` lifecycle; spikes do not "ship" — they conclude with an ADR.
+When the host exposes `AskUserQuestion` (per ADR-0014), use it for the Step 1 selection criterion confirmation and the Step 5 promote/delete decision.
+## Next
+- After Step 1 (discovery output reviewed): proceed to Step 2 to create the spike directory + fixture, or abort if the discovery surfaced a single happy path (route to `agentic-ground`).
+- After Step 4 (eval results): `/agentic-adr` to record the outcome, then delete the spike directory.
+- If the spike succeeds and production work follows: `/agentic-task` for the work units to apply the spike's findings to production code (Spec ref the original spec if applicable; cite the ADR in the task `Notes`).

package/src/skills/claude-code/agentic-tdg/SKILL.md ADDED Viewed

@@ -0,0 +1,126 @@
+---
+name: agentic-tdg
+description: Outcome-based prompting per WORKFLOW.md §9. Give the agent the finish line first, not the path. Five steps — confirm regime, ground truth pair, Test Dependency Map, three approaches, pick by one criterion, implement and verify. Triggers on "outcome-based", "TDG", "ground truth", "expected output", "three approaches", "pick by criterion", "test dependency map", "TDM", "before modifying", "tests covering this file", "give the finish line". Routes to `agentic-spike` if the technique itself is uncertain. No file written; output is the verified implementation that lands through normal commits.
+allowed-tools: Read, Glob, Grep, Bash
+---
+# /agentic-tdg
+Implements WORKFLOW.md §9 (Outcome-Based Prompting / Test Dependency Map) end-to-end. Process scaffold for the implementation phase when the canonical technique is known and multiple implementation strategies are plausible. No file written — output is the verified implementation that lands through normal commits; ground-truth pair, candidate set, and selection criterion go into the commit message body or task `Notes`.
+## Step 0 — Confirm regime
+TDG is for the *technique-known, implementation-strategy-uncertain* regime. Run the skill only when:
+* The canonical approach is settled (via `agentic-ground` or prior knowledge), AND
+* Multiple implementation strategies could produce the expected output with different trade-offs along readability / performance / testability axes.
+Route elsewhere when:
+* The *technique itself* is uncertain across multiple plausible approaches → `/agentic-spike` (WORKFLOW §14). Spike validates technique with golden fixture + per-stage debug; TDG validates implementation with ground-truth pair + TDM.
+* The path is fully obvious (one-line fix, mechanical refactor, byte-for-byte port) → TDG is overkill; proceed directly without the skill.
+* No tests cover the surface and the project does not yet have a test runner wired → consider `/agentic-hooks` for project gates first; TDG depends on a verifiable green baseline.
+## Step 1 — Ground truth pair
+State raw input + exact expected output before any code is written. Concrete, not aspirational. The pair is the contract the implementation must satisfy.
+Format depends on the surface:
+* For pure functions: input arguments + expected return value, as code-block or JSON.
+* For data transformations: source data + transformed data, side-by-side.
+* For CLI / API surfaces: request payload + response payload, as paste-ready examples.
+* For UI changes: pre-state DOM / screenshot + post-state DOM / screenshot.
+Example (pure function):
+```
+Input: parse('--agent claude-code --yes')
+Expected output: { agent: 'claude-code', yes: true, dryRun: false, force: false }
+```
+The pair is small, explicit, and verifiable. Aspirational language ("loads fast", "handles all edge cases") is not ground truth — concrete examples are.
+## Step 2 — Test Dependency Map (TDM)
+List the tests covering the file(s) the change will touch. Run them to establish the green baseline. If no tests cover the surface, write one first that exercises the *current* behavior before any modification.
+Process:
+1. Grep for tests referencing the file by import path: `grep -r 'from.*<file>' test/ tests/`.
+2. Grep for tests referencing the function or symbol: `grep -r '<symbol>' test/ tests/`.
+3. Run the matched tests in isolation to confirm green: `npx node --test test/<matched>.test.js` or equivalent for the language.
+4. If empty: write a test that asserts the current behavior. Run. Confirm green. *Then* proceed to Step 3.
+The TDM is the verification surface — Step 5's "implement + verify" loop runs these tests, not the full suite, so the feedback loop stays under a few seconds. Full-suite runs happen at commit-time via the project's pre-push hook (`agentic-hooks`).
+## Step 3 — Three approaches
+Generate three implementation candidates that produce the ground-truth pair. Each candidate names trade-offs along the §9 axes:
+```
+Approach A: <name / one-line description>
+  - readability: <high / medium / low + reason>
+  - performance: <O(...), bytes allocated, etc.>
+  - testability: <surface area, mockability, isolation>
+Approach B: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+Approach C: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+```
+No premature optimization across all three axes — each candidate names its trade-offs honestly. A candidate that is "high on every axis" is suspect; surface what it actually trades against.
+Three is a soft target. If the surface is small and only two approaches are plausible, two is fine. If the third candidate is contrived to fill a slot, drop it; the criterion-based selection in Step 4 handles two as easily as three.
+## Step 4 — Pick by one criterion
+The user names the criterion explicitly — readability *or* performance *or* testability, **not all three at once**. The skill commits to the selected candidate. Alternatives get one-line rejection notes for the commit message body or task `Notes`:
+```
+Criterion: testability
+Picked: Approach B (smaller mock surface, no global state read)
+Rejected:
+  - Approach A — one fewer allocation but mocks four singletons (testability lower)
+  - Approach C — clearer at the call site but writes shared state (testability lower)
+```
+The discipline: refuse "optimize for readability AND performance AND testability". Tradeoffs are real — naming the criterion makes them visible.
+When the host exposes `AskUserQuestion` (per ADR-0014), render the three candidates as a structured multi-choice card with the criterion as the framing question.
+## Step 5 — Implement + verify
+Modify the file. Run the TDM tests from Step 2. If green, the change is done — proceed to commit with the ground-truth pair + criterion + rejection notes in the commit message body. If red, iterate against the same ground-truth pair until green.
+Verification rules:
+* Do **not** edit the TDM tests to make them green. The tests are the verification surface; modifying them while implementing is the hidden form of "almost right" that WORKFLOW §12 names. If a test is wrong, fix the test first as a separate diff with its own ground-truth pair.
+* Do **not** declare done before the TDM tests pass. Type-check passing is not the gate; the test gate is.
+* If iteration loops past three attempts and the implementation still fails, **stop and re-examine the ground-truth pair**. The pair may be ambiguous, or the canonical approach may not actually map to the expected output. Routing back to `agentic-ground` (re-research) or `agentic-spike` (technique uncertain) is preferable to forcing a wrong implementation through.
+## Output contract
+No file written. Output is structured conversation:
+1. Confirmation of regime (Step 0 — TDG or routed elsewhere).
+2. Ground-truth pair (Step 1).
+3. TDM list with green-baseline confirmation (Step 2).
+4. Three candidates with trade-off table (Step 3).
+5. Selection by criterion + rejection notes (Step 4).
+6. Implementation + verification result (Step 5).
+When the change is committed, the ground-truth pair, criterion, and rejection notes go into the commit message body. When a task file exists, the same content also lands in the task's `Notes` log under a dated entry.
+## Next
+- After Step 5 verification passes: commit the change with the ground-truth pair + criterion + rejection notes in the body.
+- `/agentic-review main..HEAD` (or current scope) before merge — WORKFLOW §10. The TDM tests verify the implementation matches ground truth; §10 review checks coupling, edge cases, spec drift the pair did not cover.
+- If the work spans multiple sessions: `/agentic-task` for explicit decomposition (Spec ref the original spec; cite this TDG run in the task `Notes`).
+- If iteration stalled at Step 5 and routing back was needed: `/agentic-ground` (re-research) or `/agentic-spike` (technique uncertain).

package/src/skills/codex/agentic-spike/SKILL.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+name: agentic-spike
+description: Scaffold a staged spike with golden fixtures per WORKFLOW.md §14, for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. Four stages — discovery, golden fixture, pipeline with gates, two-layer evaluation. Use when the unknown is *how*, not *what*. Triggers on "spike", "uncertain technique", "which library", "CV pipeline", "evaluate approaches", "ground truth", "golden fixture", "staged pipeline", "debug per stage". Routes to `agentic-ground` if the *how* is routine and a single happy path is obvious.
+---
+<background_information>
+Implements WORKFLOW.md §14 (Staged Spikes With Golden Fixtures) end-to-end. The skill is for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. WORKFLOW §9 (TDG) assumes the path is known and validates end-to-end; §14 assumes the path is unknown and validates per stage.
+The skill creates `spikes/NNNN-<slug>/` and fills it stage-by-stage. The directory is throwaway by design — when the spike concludes, an ADR records the decision and the spike directory is deleted (promote-or-delete lifecycle per ADR-0017 §4).
+Codex auto-trigger on description keywords is less mature than Claude Code's. If auto-invocation does not fire when the user mentions an uncertain technique or asks to evaluate approaches, invoke this skill manually.
+</background_information>
+<instructions>
+Step 0 — confirm uncertainty. Skill is for unknown technique across multiple plausible approaches, not non-trivial work in general. If a single happy path is obvious, do NOT start a spike — route to `agentic-ground` and stop.
+Tests:
+- Could `agentic-ground`'s four-source research surface a single happy path with a defensible deviation gate? If yes, run that instead.
+- Are there ≥2 candidate techniques with materially different trade-offs that no source resolves? If no, this is not a spike.
+- Is end-to-end validation feasible without per-stage debug? If yes, this is `agentic-task` + `agentic-philosophy` Goal-Driven Execution territory.
+If spike warranted, confirm the recortte with the user and proceed.
+Step 1 — discovery. List canonical approaches grounded in official docs and real examples. Pick one (or ≤3) by an explicit criterion.
+Process:
+- Search official documentation. Cite URL + version.
+- Search OSS for repos solving the same recortte. Cite `<repo>:<path>:<line-range>`; fetch via tools, never paraphrase from training memory.
+- Survey in-repo for analogous patterns. Cite `<file>:<line>` or "no analog found".
+- Survey git history for prior attempts. Cite `<commit-sha>` or "no prior attempt".
+Output: candidate-list markdown with techniques, sources, trade-offs, selection criterion, picked technique. NO code yet. User reviews before Step 2.
+Step 2 — golden fixture. Curate inputs with rich expected outputs. JSON keyed by input path (recommended). Include edge cases (low light, partial occlusion, malformed inputs, large inputs, empty inputs) and difficulty tags.
+Create the spike directory:
+```
+mkdir -p spikes/NNNN-<slug>/{fixtures,debug,eval}
+```
+NNNN = next 4-digit number after highest existing under `spikes/`.
+The fixture is the contract the pipeline validates against. Treat like spec text — should not change once the spike runs unless ground truth changes.
+Step 3 — pipeline with gates. One technique per stage. Each stage emits a debug artifact making its output inspectable.
+Layout:
+```
+spikes/NNNN-<slug>/
+├── README.md          # spike framing (Step 1 output)
+├── fixtures/          # golden inputs + expected outputs
+├── pipeline/          # one file per stage (01-preprocess, 02-detect, etc)
+├── debug/             # per-stage debug artifacts (image / JSON / log row)
+└── eval/              # evaluation results (Step 4)
+```
+Each stage takes (input, context), returns (output, debug-record). Debug-record written to `debug/NN-<stage>/`. Pipeline halts and reports stage on first divergence.
+Step 4 — two-layer evaluation:
+- End-to-end: pass rate against fixture inputs.
+- Per-stage: for each input, where did pipeline diverge?
+Output to `spikes/NNNN-<slug>/eval/results.json`:
+```
+{
+  "fixture": "fixtures/golden.json",
+  "end_to_end": { "total": 10, "passed": 7, "failed": 3 },
+  "per_stage": { "01-preprocess": { "passed": 10, "failed": 0 }, ... },
+  "failures": [{ "input": "...", "diverged_at": "02-detect", "debug_artifact": "..." }]
+}
+```
+Per-stage layer is what makes the spike actionable.
+Step 5 — conclude (promote or delete). When the spike concludes:
+- Record outcome via `/agentic-adr` (ADR is the persistent artifact).
+- Delete the spike directory: `rm -rf spikes/NNNN-<slug>/`.
+ADR captures: which technique picked, alternatives held in reserve, end-to-end pass rate, failures and root causes, mitigation. Inconclusive spikes get ADRs too — preserves framing, prevents re-litigation.
+</instructions>
+<output_contract>
+A spike directory at `spikes/NNNN-<short-slug>/` with the four-stage layout (discovery README, fixtures, pipeline, debug per stage, eval results). The directory is throwaway by design — promote-or-delete lifecycle per ADR-0017 §4. No `Status: shipped` lifecycle; spikes conclude with an ADR.
+</output_contract>
+## Next
+- After Step 1: proceed to Step 2, or abort if discovery surfaced a single happy path (route to `agentic-ground`).
+- After Step 4: `/agentic-adr` to record the outcome, then delete the spike directory.
+- If spike succeeds and production work follows: `/agentic-task` for work units (Spec ref the original spec; cite the ADR in task Notes).

package/src/skills/codex/agentic-spike/agents/openai.yaml ADDED Viewed

@@ -0,0 +1,5 @@
+interface:
+  display_name: agentic-spike
+  short_description: Staged spike with golden fixtures per WORKFLOW §14. Discovery + fixture + pipeline-with-gates + two-layer evaluation. For unknown-technique uncertainty, not non-trivial work in general.
+policy:
+  allow_implicit_invocation: false

package/src/skills/codex/agentic-tdg/SKILL.md ADDED Viewed

@@ -0,0 +1,109 @@
+---
+name: agentic-tdg
+description: Outcome-based prompting per WORKFLOW.md §9. Give the agent the finish line first, not the path. Five steps — confirm regime, ground truth pair, Test Dependency Map, three approaches, pick by one criterion, implement and verify. Triggers on "outcome-based", "TDG", "ground truth", "expected output", "three approaches", "pick by criterion", "test dependency map", "TDM", "before modifying", "tests covering this file", "give the finish line". Routes to `agentic-spike` if the technique itself is uncertain. No file written; output is the verified implementation that lands through normal commits.
+---
+<background_information>
+Implements WORKFLOW.md §9 (Outcome-Based Prompting / Test Dependency Map) end-to-end. The skill is for the implementation phase when the canonical technique is known and multiple implementation strategies are plausible. WORKFLOW §14 (`agentic-spike`) covers the technique-unknown regime; §9 (this skill) covers the implementation-strategy-uncertain regime.
+No file is written. The output of the skill is the verified implementation that lands in the repo through normal commits. The ground-truth pair, candidate set, selection criterion, and TDM list go into the commit message body — or the task's `Notes` log when one exists — not into a separate `doc/` artifact.
+Codex auto-trigger on description keywords is less mature than Claude Code's. If auto-invocation does not fire when the user mentions outcome-based prompting, ground truth, three approaches, or a Test Dependency Map, invoke this skill manually.
+</background_information>
+<instructions>
+Step 0 — confirm regime. TDG is for the *technique-known, implementation-strategy-uncertain* regime. Run the skill only when the canonical approach is settled (via `agentic-ground` or prior knowledge) AND multiple implementation strategies could produce the expected output with different trade-offs along readability / performance / testability axes.
+Route elsewhere when:
+- The *technique itself* is uncertain across multiple plausible approaches → `agentic-spike` (WORKFLOW §14). Spike validates technique with golden fixture + per-stage debug; TDG validates implementation with ground-truth pair + TDM.
+- The path is fully obvious (one-line fix, mechanical refactor, byte-for-byte port) → TDG is overkill; proceed directly without the skill.
+- No tests cover the surface and the project does not yet have a test runner wired → consider `agentic-hooks` for project gates first; TDG depends on a verifiable green baseline.
+Step 1 — ground truth pair. State raw input + exact expected output before any code is written. Concrete, not aspirational. The pair is the contract the implementation must satisfy.
+Format depends on the surface:
+- For pure functions: input arguments + expected return value, as code-block or JSON.
+- For data transformations: source data + transformed data, side-by-side.
+- For CLI / API surfaces: request payload + response payload, as paste-ready examples.
+- For UI changes: pre-state DOM / screenshot + post-state DOM / screenshot.
+Example (pure function):
+```
+Input: parse('--agent claude-code --yes')
+Expected output: { agent: 'claude-code', yes: true, dryRun: false, force: false }
+```
+The pair is small, explicit, and verifiable. Aspirational language ("loads fast", "handles all edge cases") is not ground truth — concrete examples are.
+Step 2 — Test Dependency Map (TDM). List the tests covering the file(s) the change will touch. Run them to establish the green baseline. If no tests cover the surface, write one first that exercises the *current* behavior before any modification.
+Process:
+1. Grep for tests referencing the file by import path: `grep -r 'from.*<file>' test/ tests/`.
+2. Grep for tests referencing the function or symbol: `grep -r '<symbol>' test/ tests/`.
+3. Run the matched tests in isolation to confirm green: `npx node --test test/<matched>.test.js` or equivalent for the language.
+4. If empty: write a test that asserts the current behavior. Run. Confirm green. *Then* proceed to Step 3.
+The TDM is the verification surface — Step 5's "implement + verify" loop runs these tests, not the full suite, so the feedback loop stays under a few seconds. Full-suite runs happen at commit-time via the project's pre-push hook (`agentic-hooks`).
+Step 3 — three approaches. Generate three implementation candidates that produce the ground-truth pair. Each candidate names trade-offs along the §9 axes:
+```
+Approach A: <name / one-line description>
+  - readability: <high / medium / low + reason>
+  - performance: <O(...), bytes allocated, etc.>
+  - testability: <surface area, mockability, isolation>
+Approach B: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+Approach C: <name / one-line description>
+  - readability: ...
+  - performance: ...
+  - testability: ...
+```
+No premature optimization across all three axes — each candidate names its trade-offs honestly. A candidate that is "high on every axis" is suspect; surface what it actually trades against.
+Three is a soft target. If the surface is small and only two approaches are plausible, two is fine. If the third candidate is contrived to fill a slot, drop it; the criterion-based selection in Step 4 handles two as easily as three.
+Step 4 — pick by one criterion. The user names the criterion explicitly — readability *or* performance *or* testability, **not all three at once**. The skill commits to the selected candidate. Alternatives get one-line rejection notes for the commit message body or task `Notes`:
+```
+Criterion: testability
+Picked: Approach B (smaller mock surface, no global state read)
+Rejected:
+  - Approach A — one fewer allocation but mocks four singletons (testability lower)
+  - Approach C — clearer at the call site but writes shared state (testability lower)
+```
+The discipline: refuse "optimize for readability AND performance AND testability". Tradeoffs are real — naming the criterion makes them visible.
+Step 5 — implement + verify. Modify the file. Run the TDM tests from Step 2. If green, the change is done — proceed to commit with the ground-truth pair + criterion + rejection notes in the commit message body. If red, iterate against the same ground-truth pair until green.
+Verification rules:
+- Do **not** edit the TDM tests to make them green. The tests are the verification surface; modifying them while implementing is the hidden form of "almost right" that WORKFLOW §12 names. If a test is wrong, fix the test first as a separate diff with its own ground-truth pair.
+- Do **not** declare done before the TDM tests pass. Type-check passing is not the gate; the test gate is.
+- If iteration loops past three attempts and the implementation still fails, **stop and re-examine the ground-truth pair**. The pair may be ambiguous, or the canonical approach may not actually map to the expected output. Routing back to `agentic-ground` (re-research) or `agentic-spike` (technique uncertain) is preferable to forcing a wrong implementation through.
+</instructions>
+<output_contract>
+No file written. Output is structured conversation:
+1. Confirmation of regime (Step 0 — TDG or routed elsewhere).
+2. Ground-truth pair (Step 1).
+3. TDM list with green-baseline confirmation (Step 2).
+4. Three candidates with trade-off table (Step 3).
+5. Selection by criterion + rejection notes (Step 4).
+6. Implementation + verification result (Step 5).
+When the change is committed, the ground-truth pair, criterion, and rejection notes go into the commit message body. When a task file exists, the same content also lands in the task's `Notes` log under a dated entry.
+</output_contract>
+## Next
+- After Step 5 verification passes: commit the change with the ground-truth pair + criterion + rejection notes in the body.
+- `/agentic-review main..HEAD` (or current scope) before merge — WORKFLOW §10. The TDM tests verify the implementation matches ground truth; §10 review checks coupling, edge cases, spec drift the pair did not cover.
+- If the work spans multiple sessions: `/agentic-task` for explicit decomposition (Spec ref the original spec; cite this TDG run in the task `Notes`).
+- If iteration stalled at Step 5 and routing back was needed: `/agentic-ground` (re-research) or `/agentic-spike` (technique uncertain).

package/src/skills/codex/agentic-tdg/agents/openai.yaml ADDED Viewed

@@ -0,0 +1,5 @@
+interface:
+  display_name: agentic-tdg
+  short_description: Outcome-based prompting per WORKFLOW §9. Ground truth pair + Test Dependency Map + three approaches + single-criterion selection. For implementation-strategy uncertainty when the technique is known; routes to agentic-spike if the technique itself is uncertain.
+policy:
+  allow_implicit_invocation: false