npm - @alexandrealvaro/agentic - Versions diffs - 0.11.2-beta.1 → 0.12.0-beta.1 - Mend

@alexandrealvaro/agentic 0.11.2-beta.1 → 0.12.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/README.md +3 -0
package/package.json +1 -1
package/src/commands/init.js +35 -5
package/src/commands/profile.js +32 -7
package/src/commands/update.js +33 -14
package/src/index.js +16 -3
package/src/lib/install.js +8 -0
package/src/lib/profiles.js +4 -1
package/src/lib/rootdoc.js +2 -0
package/src/skills/claude-code/agentic-spike/SKILL.md +220 -0
package/src/skills/codex/agentic-spike/SKILL.md +89 -0
package/src/skills/codex/agentic-spike/agents/openai.yaml +5 -0

package/README.md CHANGED Viewed

@@ -40,6 +40,7 @@ Two categories ([ADR-0007](doc/adr/0007-workflow-operational-skills.md)) and two
 | `agentic-review` | workflow-operational | universal | Fresh-context code review per WORKFLOW §10; structured findings, no "approve" | `/agentic-review <range>` |
 | `agentic-ground` | workflow-operational | universal | Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate per WORKFLOW §4 + §5 | `/agentic-ground` |
 | `agentic-next` | workflow-operational | universal | State-aware navigation aid (`flutter doctor` pattern) — surveys the four-layer artifact stack and recommends prioritized next actions; complements `agentic-audit` (drift) | `/agentic-next` |
+| `agentic-spike` | workflow-operational | universal | Staged spike with golden fixtures per WORKFLOW §14, for cases where the *technique* is uncertain across multiple plausible approaches; produces `spikes/NNNN-<slug>/` with discovery + fixture + pipeline-with-gates + two-layer evaluation | `/agentic-spike` |
 | `agentic-design` | spec-driven | auto if frontend detected | Bootstrap `DESIGN.md` from existing tokens (Figma, tailwind.config, tokens.json, CSS custom props) | `/agentic-design` |
 | `agentic-subagent` | spec-driven | auto if installing for Claude Code | Drafts `.claude/agents/<name>.md` (Claude Code only — Codex has no subagent primitive) | `/agentic-subagent` |
 | `agentic-skill` | spec-driven | opt-in only | Drafts a new Claude Code or Codex skill at the appropriate path | `/agentic-skill` |
@@ -155,6 +156,8 @@ The kit's discipline scales with the project's maturity. A solo PoC may legitima
 **Lost mid-flow?** Invoke `/agentic-next` at any time to survey the project's state across the four-layer artifact stack (Constitution → Spec → Plan/Decisions → Code) and get prioritized next-action recommendations. Read-only; complements `/agentic-audit` (drift detection — different question).
+**Technique uncertain across multiple plausible approaches?** Invoke `/agentic-spike` (per WORKFLOW §14) when the spec is clear but the *how* is unknown — library choice, multi-stage transformation, novel domain. The skill scaffolds a staged spike with golden fixtures + per-stage debug artifacts + two-layer evaluation under `spikes/NNNN-<slug>/`. The directory is throwaway by design; conclude with `/agentic-adr` and delete.
 ## Manual prompts
 If you prefer to skip the installer, the same artifacts can be generated by pasting prompts directly into your agent. Each prompt file has the literal text to copy, plus the matching template structure:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@alexandrealvaro/agentic",
-  "version": "0.11.2-beta.1",
+  "version": "0.12.0-beta.1",
   "description": "Bootstrap and audit AGENTS.md, ARCHITECTURE.md, ADRs, skills, and subagents for engineering production code with LLMs",
   "type": "module",
   "bin": {

package/src/commands/init.js CHANGED Viewed

@@ -94,6 +94,28 @@ export const CONDITIONAL_SKILLS = [
     hintWhenAuto: 'opt-in',
     hintWhenManual: 'WORKFLOW §11 hooks scaffolder (pre-commit, pre-push)',
   },
+  // The next two skills are universal in `team` / `mature` profiles
+  // (declared in PROFILES['team' / 'mature'].universal in src/lib/profiles.js)
+  // and conditional/opt-in in `solo`. They must appear in this catalog so
+  // `availableConditionalsForProfile('solo')` lookups in `pickConditionalAuto`
+  // succeed — without these entries, `if (!def) continue` silently skipped
+  // them and a `solo` user could not opt-in to either (review B1, v0.11.3).
+  // The autoIf rule here is the universal-default; per-profile overrides
+  // come from `availableConditionalsForProfile`'s rule field.
+  {
+    name: 'agentic-architecture',
+    autoIf: () => true,
+    agents: ['claude-code', 'codex'],
+    hintWhenAuto: 'system patterns + boundaries',
+    hintWhenManual: 'opt-in (recommended once load-bearing patterns emerge)',
+  },
+  {
+    name: 'agentic-adr',
+    autoIf: () => true,
+    agents: ['claude-code', 'codex'],
+    hintWhenAuto: 'binding architectural decisions (Nygard pattern)',
+    hintWhenManual: 'opt-in (recommended for binding decisions worth recording)',
+  },
 ];
 const CONDITIONAL_BY_NAME = Object.fromEntries(
@@ -295,16 +317,23 @@ export async function initCommand(opts) {
       confirmReplace,
       previousStates,
       kitVersion: pkg.version,
+      profile: profileName,
     });
     allActions.push(...actions);
-    const next = nextStates[agent];
-    next.profile = profileName;
-    saveState(cwd, agent, next);
+    // installSkills now stamps `profile` into nextStates per review C3.
+    // No post-hoc injection.
+    saveState(cwd, agent, nextStates[agent]);
   }
+  // Dedup: agentic-architecture and agentic-adr are universal at team /
+  // mature (in REQUIRED_SKILLS) AND conditional at solo (in
+  // CONDITIONAL_SKILLS) per review B1 (v0.11.3). Without the Set, the
+  // managed-skills section would list those rows twice.
   const skillDisplayOrder = [
-    ...REQUIRED_SKILLS,
-    ...CONDITIONAL_SKILLS.map((s) => s.name),
+    ...new Set([
+      ...REQUIRED_SKILLS,
+      ...CONDITIONAL_SKILLS.map((s) => s.name),
+    ]),
   ].filter((s) => installedSkillSet.has(s));
   const confirmAppend = interactive
@@ -341,6 +370,7 @@ export async function initCommand(opts) {
       '/agentic-review (WORKFLOW §10)',
       '/agentic-ground (WORKFLOW §4 + §5)',
       '/agentic-next (state survey + recommendations)',
+      '/agentic-spike (WORKFLOW §14 — staged spike with golden fixtures)',
       ...(optedSkills.includes('agentic-design') ? ['/agentic-design (DESIGN.md)'] : []),
       ...(optedSkills.includes('agentic-subagent') && agents.includes('claude-code')
         ? ['/agentic-subagent']

package/src/commands/profile.js CHANGED Viewed

@@ -17,6 +17,10 @@ const AGENT_LABEL = {
 };
 function readProjectProfile(cwd) {
+  // Returns { agent: profileName }. Use `loadProjectStates(cwd)` instead
+  // when you need both the profile and the full state objects in a single
+  // pass (avoids the TOCTOU window where state files could be deleted
+  // between two reads — review C2, v0.11.3).
   const perAgent = {};
   for (const agent of VALID_AGENTS) {
     const state = loadState(cwd, agent);
@@ -25,6 +29,23 @@ function readProjectProfile(cwd) {
   return perAgent;
 }
+function loadProjectStates(cwd) {
+  // Single-pass load of every per-agent state file. Returns
+  // { statesByAgent, profilesByAgent }. Both objects share the same agent
+  // keys so callers can iterate one and look up the other without a
+  // second filesystem read.
+  const statesByAgent = {};
+  const profilesByAgent = {};
+  for (const agent of VALID_AGENTS) {
+    const state = loadState(cwd, agent);
+    if (state) {
+      statesByAgent[agent] = state;
+      profilesByAgent[agent] = state.profile ?? DEFAULT_PROFILE;
+    }
+  }
+  return { statesByAgent, profilesByAgent };
+}
 function showProfile(cwd) {
   const perAgent = readProjectProfile(cwd);
   if (Object.keys(perAgent).length === 0) {
@@ -64,8 +85,12 @@ function formatRule(rule) {
 async function setProfile(cwd, name, opts) {
   validateProfile(name);
-  const perAgent = readProjectProfile(cwd);
-  if (Object.keys(perAgent).length === 0) {
+  // Single load — reuse below to write. Avoids the TOCTOU window where
+  // state files could be deleted between read and re-read (review C2,
+  // v0.11.3). Previous implementation called readProjectProfile then
+  // loadState again per agent in the write loop.
+  const { statesByAgent, profilesByAgent } = loadProjectStates(cwd);
+  if (Object.keys(statesByAgent).length === 0) {
     throw new Error(
       'no agentic install detected. Run `agentic init --profile <name>` first.'
     );
@@ -73,7 +98,7 @@ async function setProfile(cwd, name, opts) {
   const interactive = process.stdout.isTTY && !opts.yes;
-  const currentProfiles = [...new Set(Object.values(perAgent))];
+  const currentProfiles = [...new Set(Object.values(profilesByAgent))];
   if (currentProfiles.length === 1 && currentProfiles[0] === name) {
     process.stdout.write(`Profile already \`${name}\` for all installed agents. No change.\n`);
     return;
@@ -82,7 +107,7 @@ async function setProfile(cwd, name, opts) {
   if (interactive) {
     p.intro(`agentic profile set ${name}`);
     p.note(
-      Object.entries(perAgent)
+      Object.entries(profilesByAgent)
         .map(([agent, profile]) => `${AGENT_LABEL[agent]}: ${profile} → ${name}`)
         .join('\n'),
       'Profile change'
@@ -101,9 +126,9 @@ async function setProfile(cwd, name, opts) {
   }
   // Write the new profile to each installed agent's state file before
-  // running update, so update reads the new profile.
-  for (const agent of Object.keys(perAgent)) {
-    const state = loadState(cwd, agent);
+  // running update, so update reads the new profile. Reuses the in-memory
+  // states loaded above — no second filesystem read.
+  for (const [agent, state] of Object.entries(statesByAgent)) {
     state.profile = name;
     saveState(cwd, agent, state);
   }

package/src/commands/update.js CHANGED Viewed

@@ -128,18 +128,27 @@ function previouslyOptedConditional(previousStates, currentAgents, profileName)
   return [...opted];
 }
-function profileFromStates(previousStates, currentAgents) {
-  // If multiple agents disagree on profile, surface and bail. Profile is
-  // expected to match across agents in the same project.
+function profileFromStates(statesByAgent, currentAgents) {
+  // Profile must match across every installed agent in the project — not
+  // only across the agents the current invocation targets. Without this,
+  // `--agent claude-code` on a project where codex was installed with a
+  // different profile masks the disagreement and produces inconsistent
+  // installs. Per review B2 (v0.11.3): always inspect the FULL set of
+  // loaded states, not the narrowed slice.
   const seen = new Set();
-  for (const agent of currentAgents) {
-    const prev = previousStates[agent];
-    if (prev?.profile) seen.add(prev.profile);
+  for (const [agent, state] of Object.entries(statesByAgent)) {
+    if (state?.profile) seen.add(state.profile);
+  }
+  if (seen.size === 0) {
+    // No state on disk for any agent. Fall back to the default; current
+    // invocation is a fresh / legacy install handled by the legacy path.
+    return DEFAULT_PROFILE;
   }
-  if (seen.size === 0) return DEFAULT_PROFILE;
   if (seen.size > 1) {
     throw new Error(
-      `state files disagree on profile (${[...seen].join(', ')}). Run \`agentic profile set <name>\` to reconcile.`
+      `state files disagree on profile (${[...seen].join(
+        ', '
+      )}). Run \`agentic profile set <name>\` to reconcile across all installed agents before re-running update.`
     );
   }
   return [...seen][0];
@@ -172,7 +181,10 @@ export async function updateCommand(opts) {
     previousStates[agent] = statesByAgent[agent] ?? null;
   }
-  const profileName = profileFromStates(previousStates, agents);
+  // Pass the FULL loaded set, not the narrowed slice. profileFromStates
+  // surfaces cross-agent disagreement even when the current invocation
+  // targets only one agent (review B2, v0.11.3).
+  const profileName = profileFromStates(statesByAgent, agents);
   const previousOpted = previouslyOptedConditional(
     previousStates,
     agents,
@@ -269,13 +281,14 @@ export async function updateCommand(opts) {
       confirmReplace,
       previousStates: { [agent]: previousStates[agent] ?? null },
       kitVersion: pkg.version,
+      profile: profileName,
       dryRun,
       force,
     });
     allActions.push(...result.actions);
-    const next = result.nextStates[agent];
-    next.profile = profileName;
-    nextStates[agent] = next;
+    // installSkills now stamps `profile` into nextStates per review C3.
+    // No post-hoc injection.
+    nextStates[agent] = result.nextStates[agent];
   }
   if (!dryRun) {
@@ -284,9 +297,15 @@ export async function updateCommand(opts) {
     }
   }
+  // Dedup: agentic-architecture and agentic-adr are universal at team /
+  // mature (in REQUIRED_SKILLS) AND conditional at solo (in
+  // CONDITIONAL_SKILLS) per review B1 (v0.11.3). Without the Set, the
+  // managed-skills section would list those rows twice.
   const skillDisplayOrder = [
-    ...REQUIRED_SKILLS,
-    ...CONDITIONAL_SKILLS.map((s) => s.name),
+    ...new Set([
+      ...REQUIRED_SKILLS,
+      ...CONDITIONAL_SKILLS.map((s) => s.name),
+    ]),
   ].filter((s) => installedSkillSet.has(s));
   const confirmAppend = interactive

package/src/index.js CHANGED Viewed

@@ -36,12 +36,25 @@ export async function run(argv) {
     .option('--force', 'overwrite user-edited files on conflict (non-interactive default: no)')
     .action(updateCommand);
+  // Profile command accepts two positionals so `agentic profile set <name>`
+  // captures the name natively. Per review C1 (v0.11.3): the prior single-
+  // positional form had Commander swallow the second arg, leaving the
+  // documented `Usage: agentic profile set <name>` error message misleading.
+  // All forms work now:
+  //   agentic profile                          → show
+  //   agentic profile show                     → show
+  //   agentic profile list                     → list
+  //   agentic profile set <name>               → set
+  //   agentic profile <name>                   → shorthand for `set <name>`
+  //   agentic profile set --name <name>        → flag form (back-compat)
   program
-    .command('profile [subcommand]')
+    .command('profile [subcommand] [name]')
     .description('Show, list, or set the project maturity profile (poc | solo | team | mature)')
-    .option('-n, --name <name>', 'profile name when used with `set` subcommand')
+    .option('-n, --name <name>', 'profile name (alternative to positional, for `set` subcommand)')
     .option('-y, --yes', 'skip confirmation prompts (non-interactive)')
-    .action(profileCommand);
+    .action((subcommand, name, opts) =>
+      profileCommand(subcommand, { ...opts, name: opts.name ?? name })
+    );
   await program.parseAsync(argv);
 }

package/src/lib/install.js CHANGED Viewed

@@ -12,6 +12,7 @@ import {
 import { fileURLToPath } from 'node:url';
 import { basename, dirname, join, relative, sep as PATH_SEP } from 'node:path';
 import { SCHEMA_VERSION } from './state.js';
+import { DEFAULT_PROFILE, validateProfile } from './profiles.js';
 const __dirname = dirname(fileURLToPath(import.meta.url));
 const KIT_ROOT = join(__dirname, '..', '..');
@@ -190,9 +191,11 @@ export async function installSkills({
   confirmReplace = async () => false,
   previousStates = {},
   kitVersion = null,
+  profile = null,
   dryRun = false,
   force = false,
 }) {
+  if (profile !== null) validateProfile(profile);
   const actions = [];
   const nextStates = {};
@@ -302,10 +305,15 @@ export async function installSkills({
       };
     }
+    // Profile resolution order: explicit `profile` arg > prior state's
+    // profile > DEFAULT_PROFILE. installSkills is the single owner of the
+    // returned nextStates' shape; callers no longer inject `profile`
+    // post-hoc per review C3 (v0.11.3).
     nextStates[agent] = {
       schemaVersion: SCHEMA_VERSION,
       kitVersion: kitVersion ?? prev?.kitVersion ?? null,
       agent,
+      profile: profile ?? prev?.profile ?? DEFAULT_PROFILE,
       skills: nextSkills,
     };
   }

package/src/lib/profiles.js CHANGED Viewed

@@ -20,7 +20,7 @@ export const PROFILE_NAMES = ['poc', 'solo', 'team', 'mature'];
 export const PROFILES = {
   poc: {
-    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next'],
+    universal: ['agentic-philosophy', 'agentic-ground', 'agentic-audit', 'agentic-next', 'agentic-spike'],
     conditional: {
       'agentic-design': 'blocked',
       'agentic-subagent': 'blocked',
@@ -35,6 +35,7 @@ export const PROFILES = {
       'agentic-ground',
       'agentic-audit',
       'agentic-next',
+      'agentic-spike',
       'agentic-bootstrap',
       'agentic-spec',
       'agentic-task',
@@ -62,6 +63,7 @@ export const PROFILES = {
       'agentic-review',
       'agentic-ground',
       'agentic-next',
+      'agentic-spike',
     ],
     conditional: {
       'agentic-design': 'frontend',
@@ -83,6 +85,7 @@ export const PROFILES = {
       'agentic-review',
       'agentic-ground',
       'agentic-next',
+      'agentic-spike',
     ],
     conditional: {
       'agentic-design': 'frontend',

package/src/lib/rootdoc.js CHANGED Viewed

@@ -28,6 +28,8 @@ export const SKILL_DESCRIPTIONS = {
     'Four-source pre-implementation research (docs / OSS / in-repo / git history) + happy-path synthesis + deviation gate. WORKFLOW §4 + §5.',
   'agentic-next':
     'State survey + prioritized next-action recommendations across the four-layer artifact stack. Read-only navigation aid (`flutter doctor` pattern).',
+  'agentic-spike':
+    'Staged spike with golden fixtures per WORKFLOW §14. Discovery + fixture + pipeline-with-gates + two-layer evaluation, when the *technique* is uncertain across multiple plausible approaches.',
   'agentic-design': 'Bootstrap `DESIGN.md` from existing tokens (frontend projects).',
   'agentic-subagent': 'Draft a new Claude Code subagent at `.claude/agents/<name>.md`.',
   'agentic-skill': 'Draft a new Claude Code or Codex skill at the appropriate path.',

package/src/skills/claude-code/agentic-spike/SKILL.md ADDED Viewed

@@ -0,0 +1,220 @@
+---
+name: agentic-spike
+description: Scaffold a staged spike with golden fixtures per WORKFLOW.md §14, for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. Four stages — discovery, golden fixture, pipeline with gates, two-layer evaluation. Use when the unknown is *how*, not *what*. Triggers on "spike", "uncertain technique", "which library", "CV pipeline", "evaluate approaches", "ground truth", "golden fixture", "staged pipeline", "debug per stage". Routes to `agentic-ground` if the *how* is routine and a single happy path is obvious. Read-and-write — creates `spikes/NNNN-<slug>/` with fixtures, debug per-stage artifacts, eval results.
+allowed-tools: Read, Write, Glob, Grep, Bash, WebFetch, WebSearch
+---
+# /agentic-spike
+Implements WORKFLOW.md §14 (Staged Spikes With Golden Fixtures) end-to-end. The skill is for cases where the spec is clear but the *technique* is uncertain across multiple plausible approaches — library choice, CV approach, multi-stage transformation. WORKFLOW §9 (TDG) assumes the path is known and validates end-to-end; §14 assumes the path is unknown and validates per stage. Different uncertainty regimes; this skill is for the unknown one.
+The skill creates a working directory under `spikes/NNNN-<slug>/` and fills it stage-by-stage. The directory is throwaway by design — when the spike concludes, an ADR records the decision (`/agentic-adr`) and the spike directory is deleted. See ADR-0017 for the promote-or-delete lifecycle rationale.
+## Step 0 — Confirm uncertainty
+The skill is for *unknown technique* across multiple plausible approaches, not for *non-trivial work* in general. If a single happy path is obvious, **do not start a spike**. If the *how* is knowable from official docs / OSS examples / in-repo patterns / git history, route to `agentic-ground` and stop.
+Concrete tests to run before starting:
+* Could `agentic-ground`'s four-source research surface a single happy path with a defensible deviation gate? If yes, run that instead.
+* Are there ≥2 candidate techniques with materially different trade-offs that no source resolves? If no, this is not a spike.
+* Is end-to-end validation against expected outputs feasible without per-stage debug? If yes, this is `agentic-task` + `agentic-philosophy` Goal-Driven Execution territory, not a spike.
+If the spike is warranted, confirm with the user the *recortte* (the specific surface where uncertainty sits — not the whole feature) and proceed.
+## Step 1 — Discovery
+List canonical approaches grounded in **official docs and real examples**. Pick one (or a small set, ≤3) by an **explicit criterion**.
+Candidate-listing process:
+1. Search official documentation for the language / library / domain in question. Cite URL + version.
+2. Search public OSS for repos that solve the same technical recortte. Cite `<repo>:<path>:<line-range>` and fetch via tools — never paraphrase from training memory.
+3. Survey in-repo for analogous patterns the codebase already uses. Cite `<file>:<line>` or "no analog found".
+4. Survey git history for prior attempts at the same problem. Cite `<commit-sha>` or "no prior attempt".
+Output format:
+```markdown
+## Discovery — <recortte>
+### Candidate techniques
+1. **<name>** — <one-line description>. Source: <URL or repo:path>. Trade-offs: <pros / cons>.
+2. **<name>** — ...
+3. **<name>** — ...
+### Selection criterion
+<one-line criterion: latency / accuracy / readability / dependencies / etc>
+### Picked
+<technique X>, picked by criterion <Y>. Alternatives held in reserve: <list>.
+```
+The output of this step is **information, not code**. No spike directory is created yet. The user reviews the candidate list and confirms the picked approach (or revises) before Step 2.
+## Step 2 — Golden fixture
+Curate inputs with rich expected outputs. The fixture is the ground truth the staged pipeline validates against; richer fixtures catch more failure modes.
+Create the spike directory:
+```bash
+mkdir -p spikes/NNNN-<slug>/{fixtures,debug,eval}
+```
+Where `NNNN` is the next available 4-digit number (mirrors ADR / task / spec numbering). List `spikes/` and pick the next slot.
+The fixture format is JSON keyed by input path (recommended) or whatever shape the domain demands. For computer vision: bounding boxes, sizes, lighting condition, difficulty tag, edge case markers. For multi-stage transformations: intermediate states. For library choice: representative inputs covering typical and edge cases.
+Example fixture file (`spikes/0001-detect-circles/fixtures/golden.json`):
+```json
+{
+  "inputs/easy-01.jpg": {
+    "expected": [
+      { "bbox": [120, 80, 240, 200], "label": "circle", "size": "large", "lighting": "even" }
+    ],
+    "difficulty": "easy",
+    "edge_cases": []
+  },
+  "inputs/hard-01.jpg": {
+    "expected": [
+      { "bbox": [50, 60, 90, 100], "label": "circle", "size": "small", "lighting": "low" },
+      { "bbox": [200, 80, 260, 140], "label": "circle", "size": "medium", "lighting": "even", "occluded": true }
+    ],
+    "difficulty": "hard",
+    "edge_cases": ["low-light", "partial-occlusion", "multiple-objects"]
+  }
+}
+```
+Curation principles:
+* Include **edge cases** (low light, partial occlusion, malformed inputs, large inputs, empty inputs) — not just "happy path" examples. The fixture's job is to surface *where* a technique fails, not just whether it succeeds on easy cases.
+* Include **difficulty tags** so per-stage evaluation can report performance segmented by difficulty.
+* Keep the fixture as data, not code. JSON / YAML / CSV — anything that diffs cleanly and survives a refactor.
+The fixture is the contract the pipeline validates against. Treat it like spec text — it should not change once the spike runs unless ground truth itself changes.
+## Step 3 — Pipeline with gates
+One technique per stage. Each stage emits a **debug artifact** that makes its output inspectable.
+Pipeline structure:
+```
+spikes/NNNN-<slug>/
+├── README.md          # spike framing (Step 1 output)
+├── fixtures/          # golden inputs + expected outputs
+│   └── golden.json
+├── pipeline/          # one file per stage
+│   ├── 01-preprocess.<ext>
+│   ├── 02-detect.<ext>
+│   └── 03-postprocess.<ext>
+├── debug/             # per-stage debug artifacts
+│   ├── 01-preprocess/
+│   ├── 02-detect/
+│   └── 03-postprocess/
+└── eval/              # evaluation results (Step 4)
+```
+Each stage's debug artifact format depends on the domain:
+* CV pipelines: image saved to `debug/NN-<stage>/<input-name>.png` showing the stage's output.
+* Multi-stage transformations: intermediate JSON saved to `debug/NN-<stage>/<input-name>.json`.
+* Library evaluation: log row per (input, library) saved to `debug/NN-<stage>/log.csv`.
+The discipline: **each stage's output must be inspectable independently**. End-to-end output alone tells you *that* it failed; per-stage debug tells you *where*.
+Implementation pattern (any language):
+* Stage takes (input, context) and returns (output, debug-record). Debug-record is written to `debug/NN-<stage>/`.
+* Pipeline runs stages sequentially. Failure at any stage halts the pipeline and reports the stage where divergence happened.
+## Step 4 — Two-layer evaluation
+Run the pipeline against the fixture and emit two layers of results:
+* **End-to-end:** how many fixture inputs produced expected outputs? Reported as pass / fail per input, plus aggregate pass rate.
+* **Per-stage:** for each fixture input, where did the pipeline diverge? Stage NN's output vs the expected intermediate. Reported as pass / fail per (input, stage).
+Output to `spikes/NNNN-<slug>/eval/results.json`:
+```json
+{
+  "fixture": "fixtures/golden.json",
+  "pipeline_version": "<commit-sha or timestamp>",
+  "end_to_end": {
+    "total": 10,
+    "passed": 7,
+    "failed": 3
+  },
+  "per_stage": {
+    "01-preprocess": { "passed": 10, "failed": 0 },
+    "02-detect": { "passed": 8, "failed": 2 },
+    "03-postprocess": { "passed": 7, "failed": 1 }
+  },
+  "failures": [
+    {
+      "input": "inputs/hard-02.jpg",
+      "diverged_at": "02-detect",
+      "expected": [...],
+      "actual": [...],
+      "debug_artifact": "debug/02-detect/hard-02.png"
+    }
+  ]
+}
+```
+The per-stage layer is what makes the spike actionable. End-to-end says *that* it failed; per-stage + debug artifact says *where* and *why*.
+## Step 5 — Conclude (promote or delete)
+When the spike concludes — either the picked technique works or it does not — record the outcome via `/agentic-adr` and delete the spike directory. The ADR is the persistent artifact; the spike code is throwaway.
+ADR template for spike outcomes:
+```markdown
+# ADR-NNNN: We will use technique X for <recortte>
+## Context
+<why the spike was needed — what was uncertain>
+## Decision
+We will use technique X. The spike at `spikes/NNNN-<slug>/` (now deleted) showed:
+- End-to-end pass rate: <%>
+- Failures concentrated at stage <NN>, root cause <Y>
+- Mitigation: <Z>
+Alternatives held in reserve and rejected:
+- Technique A: rejected because <reason from spike eval>
+- Technique B: rejected because <reason from spike eval>
+## Consequences
+<follow-on work this decision unblocks; rails to maintain>
+```
+Then:
+```bash
+rm -rf spikes/NNNN-<slug>/
+git add doc/adr/NNNN-<slug>.md
+git commit -m "feat: adopt technique X for <recortte> per spike NNNN"
+```
+Spikes that conclude inconclusively get an ADR too — `Decision: defer; the spike at NNNN inconclusive because Y` — and the directory is deleted. Inconclusive spikes are real signal; preserving the framing in an ADR prevents re-litigation.
+## Output contract
+A spike directory at `spikes/NNNN-<short-slug>/` with the four-stage layout above (discovery README, fixtures, pipeline, debug per stage, eval results). The directory is throwaway by design — promote-or-delete lifecycle per ADR-0017 §4. No `Status: shipped` lifecycle; spikes do not "ship" — they conclude with an ADR.
+When the host exposes `AskUserQuestion` (per ADR-0014), use it for the Step 1 selection criterion confirmation and the Step 5 promote/delete decision.
+## Next
+- After Step 1 (discovery output reviewed): proceed to Step 2 to create the spike directory + fixture, or abort if the discovery surfaced a single happy path (route to `agentic-ground`).
+- After Step 4 (eval results): `/agentic-adr` to record the outcome, then delete the spike directory.
+- If the spike succeeds and production work follows: `/agentic-task` for the work units to apply the spike's findings to production code (Spec ref the original spec if applicable; cite the ADR in the task `Notes`).

package/src/skills/codex/agentic-spike/SKILL.md ADDED Viewed

@@ -0,0 +1,89 @@
+---
+name: agentic-spike
+description: Scaffold a staged spike with golden fixtures per WORKFLOW.md §14, for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. Four stages — discovery, golden fixture, pipeline with gates, two-layer evaluation. Use when the unknown is *how*, not *what*. Triggers on "spike", "uncertain technique", "which library", "CV pipeline", "evaluate approaches", "ground truth", "golden fixture", "staged pipeline", "debug per stage". Routes to `agentic-ground` if the *how* is routine and a single happy path is obvious.
+---
+<background_information>
+Implements WORKFLOW.md §14 (Staged Spikes With Golden Fixtures) end-to-end. The skill is for cases where the spec is clear but the technique is uncertain across multiple plausible approaches. WORKFLOW §9 (TDG) assumes the path is known and validates end-to-end; §14 assumes the path is unknown and validates per stage.
+The skill creates `spikes/NNNN-<slug>/` and fills it stage-by-stage. The directory is throwaway by design — when the spike concludes, an ADR records the decision and the spike directory is deleted (promote-or-delete lifecycle per ADR-0017 §4).
+Codex auto-trigger on description keywords is less mature than Claude Code's. If auto-invocation does not fire when the user mentions an uncertain technique or asks to evaluate approaches, invoke this skill manually.
+</background_information>
+<instructions>
+Step 0 — confirm uncertainty. Skill is for unknown technique across multiple plausible approaches, not non-trivial work in general. If a single happy path is obvious, do NOT start a spike — route to `agentic-ground` and stop.
+Tests:
+- Could `agentic-ground`'s four-source research surface a single happy path with a defensible deviation gate? If yes, run that instead.
+- Are there ≥2 candidate techniques with materially different trade-offs that no source resolves? If no, this is not a spike.
+- Is end-to-end validation feasible without per-stage debug? If yes, this is `agentic-task` + `agentic-philosophy` Goal-Driven Execution territory.
+If spike warranted, confirm the recortte with the user and proceed.
+Step 1 — discovery. List canonical approaches grounded in official docs and real examples. Pick one (or ≤3) by an explicit criterion.
+Process:
+- Search official documentation. Cite URL + version.
+- Search OSS for repos solving the same recortte. Cite `<repo>:<path>:<line-range>`; fetch via tools, never paraphrase from training memory.
+- Survey in-repo for analogous patterns. Cite `<file>:<line>` or "no analog found".
+- Survey git history for prior attempts. Cite `<commit-sha>` or "no prior attempt".
+Output: candidate-list markdown with techniques, sources, trade-offs, selection criterion, picked technique. NO code yet. User reviews before Step 2.
+Step 2 — golden fixture. Curate inputs with rich expected outputs. JSON keyed by input path (recommended). Include edge cases (low light, partial occlusion, malformed inputs, large inputs, empty inputs) and difficulty tags.
+Create the spike directory:
+```
+mkdir -p spikes/NNNN-<slug>/{fixtures,debug,eval}
+```
+NNNN = next 4-digit number after highest existing under `spikes/`.
+The fixture is the contract the pipeline validates against. Treat like spec text — should not change once the spike runs unless ground truth changes.
+Step 3 — pipeline with gates. One technique per stage. Each stage emits a debug artifact making its output inspectable.
+Layout:
+```
+spikes/NNNN-<slug>/
+├── README.md          # spike framing (Step 1 output)
+├── fixtures/          # golden inputs + expected outputs
+├── pipeline/          # one file per stage (01-preprocess, 02-detect, etc)
+├── debug/             # per-stage debug artifacts (image / JSON / log row)
+└── eval/              # evaluation results (Step 4)
+```
+Each stage takes (input, context), returns (output, debug-record). Debug-record written to `debug/NN-<stage>/`. Pipeline halts and reports stage on first divergence.
+Step 4 — two-layer evaluation:
+- End-to-end: pass rate against fixture inputs.
+- Per-stage: for each input, where did pipeline diverge?
+Output to `spikes/NNNN-<slug>/eval/results.json`:
+```
+{
+  "fixture": "fixtures/golden.json",
+  "end_to_end": { "total": 10, "passed": 7, "failed": 3 },
+  "per_stage": { "01-preprocess": { "passed": 10, "failed": 0 }, ... },
+  "failures": [{ "input": "...", "diverged_at": "02-detect", "debug_artifact": "..." }]
+}
+```
+Per-stage layer is what makes the spike actionable.
+Step 5 — conclude (promote or delete). When the spike concludes:
+- Record outcome via `/agentic-adr` (ADR is the persistent artifact).
+- Delete the spike directory: `rm -rf spikes/NNNN-<slug>/`.
+ADR captures: which technique picked, alternatives held in reserve, end-to-end pass rate, failures and root causes, mitigation. Inconclusive spikes get ADRs too — preserves framing, prevents re-litigation.
+</instructions>
+<output_contract>
+A spike directory at `spikes/NNNN-<short-slug>/` with the four-stage layout (discovery README, fixtures, pipeline, debug per stage, eval results). The directory is throwaway by design — promote-or-delete lifecycle per ADR-0017 §4. No `Status: shipped` lifecycle; spikes conclude with an ADR.
+</output_contract>
+## Next
+- After Step 1: proceed to Step 2, or abort if discovery surfaced a single happy path (route to `agentic-ground`).
+- After Step 4: `/agentic-adr` to record the outcome, then delete the spike directory.
+- If spike succeeds and production work follows: `/agentic-task` for work units (Spec ref the original spec; cite the ADR in task Notes).

package/src/skills/codex/agentic-spike/agents/openai.yaml ADDED Viewed

@@ -0,0 +1,5 @@
+interface:
+  display_name: agentic-spike
+  short_description: Staged spike with golden fixtures per WORKFLOW §14. Discovery + fixture + pipeline-with-gates + two-layer evaluation. For unknown-technique uncertainty, not non-trivial work in general.
+policy:
+  allow_implicit_invocation: false