npm - valent-pipeline - Versions diffs - 0.6.3 → 0.7.0 - Mend

valent-pipeline 0.6.3 → 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (55) hide show

package/bin/cli.js +15 -0
package/knowledge-seed/README.md +9 -0
package/knowledge-seed/curated/pitfalls-api-express-prisma.md +19 -0
package/knowledge-seed/curated/pitfalls-data-pipeline.md +19 -0
package/knowledge-seed/curated/pitfalls-document-generation.md +19 -0
package/knowledge-seed/curated/pitfalls-iac.md +19 -0
package/knowledge-seed/curated/pitfalls-library.md +19 -0
package/knowledge-seed/curated/pitfalls-mcp-server.md +19 -0
package/knowledge-seed/curated/pitfalls-mobile.md +19 -0
package/knowledge-seed/curated/pitfalls-ui-react-vite.md +19 -0
package/package.json +3 -2
package/pipeline/docs/multi-provider-runners.md +185 -0
package/pipeline/orchestrators/claude-code/README.md +3 -2
package/pipeline/orchestrators/claude-code/plan.workflow.js +207 -116
package/pipeline/orchestrators/claude-code/retro.workflow.js +103 -15
package/pipeline/orchestrators/claude-code/sprint.workflow.js +564 -54
package/pipeline/prompts/judge.md +3 -3
package/pipeline/prompts/pmcp.md +71 -0
package/pipeline/schemas/verdict.schema.json +36 -2
package/pipeline/steps/bend/handoff.md +3 -0
package/pipeline/steps/bend/write-tests.md +5 -0
package/pipeline/steps/critic/triage.md +7 -4
package/pipeline/steps/critic/write-verdict.md +2 -2
package/pipeline/steps/data/handoff.md +3 -0
package/pipeline/steps/docgen/handoff.md +3 -0
package/pipeline/steps/fend/handoff.md +3 -0
package/pipeline/steps/fend/write-tests.md +5 -0
package/pipeline/steps/iac/handoff.md +3 -0
package/pipeline/steps/judge/evidence-review.md +65 -11
package/pipeline/steps/judge/ship-decision.md +11 -1
package/pipeline/steps/libdev/handoff.md +3 -0
package/pipeline/steps/mcp-dev/handoff.md +3 -0
package/pipeline/steps/mobile/handoff.md +3 -0
package/pipeline/steps/orchestration/bug-intake.md +39 -0
package/pipeline/steps/orchestration/update-backlog-status.md +35 -4
package/pipeline/steps/orchestration/validate-story-inputs.md +8 -0
package/pipeline/steps/qa-a/write-spec.md +24 -0
package/pipeline/steps/qa-b/execute-tests.md +15 -8
package/pipeline/steps/qa-b/write-report.md +6 -1
package/pipeline/templates/critic-review.template.md +26 -18
package/pipeline/templates/judge-decision.template.md +14 -4
package/pipeline/templates/traceability-matrix.template.md +33 -3
package/skills/valent-configure/SKILL.md +3 -1
package/skills/valent-run-epic-workflow/SKILL.md +20 -10
package/skills/valent-run-project-workflow/SKILL.md +47 -18
package/skills/valent-run-story-workflow/SKILL.md +4 -4
package/src/commands/calibrate.js +11 -4
package/src/commands/check-spec-conformance.js +99 -0
package/src/commands/init.js +29 -0
package/src/commands/upgrade.js +1 -1
package/src/lib/config-schema.js +44 -4
package/src/lib/handoff.js +19 -4
package/src/lib/obsolete-manifest.js +3 -1
package/src/lib/spec-conformance.js +187 -0
package/src/lib/sprint.js +152 -48

package/bin/cli.js CHANGED Viewed

@@ -155,6 +155,8 @@ program
   .option('--velocity-history <path>', 'Explicit velocity history (YAML/JSON) to pair with --data')
   .option('--deviation-threshold <n>', 'Pairwise deviation flag threshold (default 0.5)')
   .option('--cv-threshold <n>', 'Velocity coefficient-of-variation instability threshold (default 0.3)')
+  .option('--sma-window <n>', 'Trailing window (sprints) for the velocity simple moving average (default 3)')
+  .option('--seed-velocity <n>', 'Initial/config velocity that seeds the SMA so early sprints ramp smoothly')
   .action(async (options) => {
     const { calibrateCmd } = await import('../src/commands/calibrate.js');
     await calibrateCmd(options);
@@ -188,6 +190,19 @@ program
     await rejectionCapCmd(options);
   });
+// check-spec-conformance command (deterministic pre-CRITIC gate for assertion-weakening / spec-drift)
+program
+  .command('check-spec-conformance')
+  .description('Deterministically diff the QA-A spec manifest against the implemented test suite (exits non-zero on a dropped assertion target, missing spec\'d case, or un-waived skip)')
+  .requiredOption('--story <id>', 'Story identifier')
+  .option('--root <dir>', 'Project root to inspect (defaults to the current directory)')
+  .option('--manifest <path>', 'Spec manifest path (defaults to stories/<id>/output/qa-test-spec.manifest.json)')
+  .option('--tests-glob <regex>', 'Filename regex for test files (default: \\.(test|spec)\\.(js|ts|jsx|tsx|mjs|cjs)$)')
+  .action(async (options) => {
+    const { checkSpecConformanceCmd } = await import('../src/commands/check-spec-conformance.js');
+    await checkSpecConformanceCmd(options);
+  });
 // db commands
 const dbCmd = program
   .command('db')

package/knowledge-seed/README.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Knowledge Seed
+Version-controlled knowledge that ships with the package and is **seeded into each project at `init`**.
+- `curated/pitfalls-<profile>-<stack>.md` — short, high-cost-of-miss "pitfalls primers" per profile/stack. At `init`, `curated/` is copied into the project's `knowledge/curated/` (the configured `curated_files_path`). Dev agents read the primer for their stack during read-inputs (Step 3b) and re-scan their diff against it in the pre-handoff smudge check before writing their handoff.
+These are **primers, not exhaustive checklists** — a few examples to trigger recall, with an explicit "also handle obvious things like these that aren't listed" clause. They are the durable home for invariants the retro promotes out of project-local `correction-directives.yaml` when a pitfall recurs across projects.
+To add coverage: drop a new `pitfalls-<profile>-<stack>.md` here. It seeds automatically on the next `init`.

package/knowledge-seed/curated/pitfalls-api-express-prisma.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — api / Express + Prisma
+**Scope:** stories with the `api` profile on an Express + Prisma stack.
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **CORS + preflight.** If the endpoint is called cross-origin, CORS is configured and the `OPTIONS` preflight short-circuits *before* auth/body middleware — preflight carries no credentials or body and must not 401/400.
+2. **Status-code precedence.** Existence is checked before validity: a missing resource is `404` before a malformed body is `400`; auth (`401`) before authz (`403`). Don't 400 a request for a record that was never going to be found.
+3. **Body parsing is wired.** `express.json()` (and `urlencoded` if used) is registered before any handler reads `req.body`, and the handler tolerates a missing/empty body instead of throwing on `undefined`.
+4. **Prisma `P2025` → 404.** "Record to update/delete does not exist" is caught and mapped to `404`, not a bare `500`. Same for other expected Prisma error codes (`P2002` unique-violation → `409`).
+5. **Enum ↔ wire mapping.** Prisma/DB enums are mapped to/from their wire (API) representation explicitly — don't leak internal enum casing/values straight to the client, and validate inbound strings against the enum before querying.
+6. **null vs undefined.** Intended JSON `null` vs an omitted field is deliberate, not accidental — check what actually serializes. Optional/absent ≠ explicitly null in responses and in Prisma `where`/`data`.
+7. **Error-shape consistency.** Every error path returns the same envelope shape the success paths' consumers expect (same content-type, same `{ error: ... }` contract). No mix of HTML error pages and JSON.
+8. **Input validation before DB.** Untrusted input is validated/coerced before it reaches a query — no unbounded `take`, no user-controlled `orderBy`/field names hitting Prisma raw.
+9. **Test-env `listen()` guard.** The app module exports the Express app without unconditionally binding a port, so importing it under Vitest doesn't `EADDRINUSE` or leave a hanging handle. `listen()` is guarded (e.g. `if (require.main === module)` / `NODE_ENV !== 'test'`).
+10. **Migrations are real.** Schema changes ship as a Prisma migration (not just a `schema.prisma` edit), and startup/CI runs `migrate deploy` — no silent drift between the model and the DB.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-data-pipeline.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — data-pipeline
+**Scope:** stories with the `data-pipeline` profile (ETL, data transformation, batch/stream processing).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Idempotent re-runs.** Re-running the same batch doesn't double-write or duplicate — use upserts/dedup keys or transactional checkpoints. A retried job converges, it doesn't compound.
+2. **Validate at ingest.** Input rows are validated against the expected schema; bad rows are rejected/quarantined to a dead-letter path, not silently coerced into garbage downstream.
+3. **Partial-failure safety.** A failure mid-batch doesn't leave half-written state — checkpointing/transactions make the run resumable, and in-flight work rolls back or replays cleanly.
+4. **Null / missing / type coercion is explicit.** Nulls, empty strings, and type mismatches are handled deliberately — a null doesn't silently become `0` or `""`, and a string `"NaN"` doesn't poison an aggregate.
+5. **Timezones & encoding.** Timestamps carry/normalize to UTC (no naive local time), and text is explicitly UTF-8 — no mojibake, no DST off-by-one in windowing.
+6. **Numeric precision.** Money/decimals use exact types (not floats); aggregation and rounding don't leak precision.
+7. **Determinism & ordering.** Output is deterministic regardless of input partition/arrival order where required; sorts are stable; nothing relies on dict/hash iteration order.
+8. **Bounded memory.** Large datasets stream or batch — no "load the whole table into a list." Memory stays bounded as volume grows.
+9. **Watermarks & late data.** Incremental loads track a high-water mark and handle late/out-of-order records — no windows silently missed or reprocessed.
+10. **Reconciliation & observability.** Row counts in vs out (and rejects) are reconciled and logged. A silent drop of rows is a bug, not a warning you skip.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-document-generation.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — document-generation
+**Scope:** stories with the `document-generation` profile (templated documents/reports — PDF, HTML, DOCX, CSV/XLSX, etc.).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Every declared format produces valid output.** Each target format (PDF/HTML/DOCX/…) actually opens and validates — not just the one you eyeballed.
+2. **Escaping & injection.** Data values are escaped for the target format: HTML-escaped in HTML, no template injection, and spreadsheet cells starting with `= + - @` are neutralized against CSV/formula injection.
+3. **Missing/null data renders gracefully.** Absent fields don't emit literal `undefined`/`null` or break layout — explicit placeholders or omission.
+4. **Encoding & special characters.** UTF-8 throughout; accents, emoji, CJK, and RTL render correctly — no mojibake or missing-glyph boxes.
+5. **Deterministic output.** Same input → same document. No embedded timestamps/random IDs that break diffing/caching unless explicitly intended.
+6. **Pagination & overflow.** Long tables/content paginate cleanly — no clipped rows, no content running off the page, sane page breaks.
+7. **Locale formatting.** Dates, numbers, and currency are formatted per locale, not hardcoded to one convention.
+8. **Assets embedded/resolvable.** Referenced fonts and images are embedded or reliably resolvable — no broken image links, no font-fallback surprises.
+9. **Computed numbers are correct.** Totals/subtotals in the document match the source data; rounding is consistent across the doc.
+10. **Structure & accessibility.** Headings, alt text, and tagged structure are present where required — not just visually-formatted soup that fails a screen reader.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-iac.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — iac
+**Scope:** stories with the `iac` profile (Terraform / CloudFormation / Kubernetes / CI-CD).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Remote, locked state.** State lives in a remote backend with locking — not local/committed state. Concurrent applies can't corrupt it.
+2. **No plaintext secrets.** No credentials, keys, or tokens hardcoded in `.tf`/templates/`tfvars` or committed. Secrets come from a secret manager / CI secret store, and sensitive outputs are marked `sensitive`.
+3. **Plan before apply.** The change is reviewed as a `plan`/changeset, and destructive replacements (resource deletion/recreation) are noticed and intended — not a surprise from an in-place-impossible attribute change.
+4. **Least-privilege IAM.** Policies scope actions and resources — no `"*"` action on `"*"` resource, no wildcard principals, unless explicitly justified.
+5. **Pinned providers/versions.** Provider and module versions are pinned (not floating `latest`) so applies are reproducible.
+6. **Network exposure is intentional.** Security groups / firewall rules / ingress don't open `0.0.0.0/0` on sensitive ports unless that's the explicit requirement. Default-deny, then allow.
+7. **Deletion protection on stateful resources.** Databases, buckets, and volumes have deletion/`prevent_destroy` protection and a backup story — accidental destroy isn't one `terraform apply` away.
+8. **Tagging + naming conventions.** Resources carry the required tags (owner, env, cost) and follow the naming scheme — for cost attribution and ops.
+9. **Idempotent + drift-aware.** Re-applying with no change is a no-op; the config doesn't fight manual drift or rely on click-ops side state.
+10. **Outputs don't leak.** Module/stack outputs don't expose secrets or internal endpoints to wider scopes than intended.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-library.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — library
+**Scope:** stories with the `library` profile (shared, reusable packages/modules consumed by other code).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Intentional public surface.** Only what's meant to be public is exported; internals stay internal. No accidental export that becomes a de-facto API you can't change later.
+2. **CJS + ESM both resolve.** Dual entry points actually work — `exports`/`main`/`module`/`types` fields are correct and consumers can import under both module systems.
+3. **Type declarations are complete and accurate.** `.d.ts` ship and match the runtime — consumers get correct types, not `any` or stale signatures.
+4. **No side effects on import.** Importing the module runs no code/IO; `sideEffects` is set honestly so tree-shaking works.
+5. **Semver discipline.** Breaking changes bump major. Signatures/behavior don't silently change in a minor/patch.
+6. **Peer vs direct deps.** Shared frameworks (React, etc.) are `peerDependencies`, not bundled — no duplicate-instance bugs in the consumer.
+7. **Runtime-neutral or guarded.** Code targeting both Node and browser guards env-specific globals; no unguarded `window`/`process`/`fs` that explodes in the other runtime.
+8. **Library-appropriate error handling.** Throws typed/documented errors at boundaries; never swallows silently and never calls `process.exit` — that's the consumer's decision, not the library's.
+9. **Backward compatibility.** Deprecations are additive with warnings, not silent removals — existing call sites keep working across the change.
+10. **Lean footprint.** No heavy transitive dependency pulled in for a one-liner; bundle/install size stays reasonable.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-mcp-server.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — mcp-server
+**Scope:** stories with the `mcp-server` profile (Model Context Protocol tool/handler servers).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **stdio is sacred on stdout.** On the stdio transport, *nothing* but protocol JSON-RPC goes to stdout — all logging/diagnostics go to stderr. A stray `console.log` corrupts the stream and breaks the client.
+2. **Tool input schemas validate.** Each tool's `inputSchema` declares required fields and types, and the handler validates/coerces input before use — don't trust the model to send well-formed args.
+3. **Errors use the MCP contract.** Handlers return proper MCP error responses (or `isError` tool results), not raw thrown exceptions that crash the connection. Failures are legible to the calling model.
+4. **Descriptions are accurate.** Tool/parameter descriptions match actual behavior — the model selects and fills tools purely from these. A misleading description is a functional bug, not a doc nit.
+5. **Side-effects are gated.** Destructive or external-mutation tools are clearly named, described as such, and (where relevant) require explicit confirmation/parameters — no silent irreversible actions.
+6. **No secret leakage in results.** Tool outputs and error messages don't echo API keys, tokens, full env, or internal paths back to the model/client.
+7. **Init handshake + capabilities.** The server declares only capabilities it implements, and the initialize handshake/protocol version is handled — don't advertise resources/prompts you didn't wire.
+8. **Long-running + cancellation.** Slow tools handle timeouts/cancellation and don't block the event loop or hang the connection indefinitely.
+9. **Stable resource URIs.** If exposing resources, URIs are stable and resolvable; reads handle the missing/changed case instead of throwing.
+10. **Deterministic, idempotent reads.** Read/query tools don't mutate state; repeated identical calls return consistent results.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-mobile.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — mobile
+**Scope:** stories with the `mobile` profile (Android/iOS apps — React Native or native).
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Lifecycle & process death.** Background/foreground transitions and OS-initiated process death/restore are handled — state is saved/restored, no leaks, no crash on resume.
+2. **Network resilience.** Offline and slow networks are handled — timeouts, retries, and real error UI instead of an infinite spinner. No assumption of connectivity.
+3. **Permissions.** Requested at the right moment with rationale, and denial is handled gracefully — a missing permission degrades, it doesn't crash.
+4. **Platform parity.** The feature works on both Android and iOS (or is explicitly gated). No silent platform-only assumption (paths, intents, gestures).
+5. **Safe areas & screen variety.** Layout respects safe areas/notches, rotation, and a range of screen sizes/densities — nothing clipped behind the notch or off-screen.
+6. **Memory & lists.** Long lists are virtualized and large images downsampled — no OOM/jank from rendering everything at once.
+7. **Secure storage.** Tokens/secrets live in Keychain/Keystore, not plaintext prefs/AsyncStorage. Nothing sensitive logged.
+8. **Navigation & deep links.** Back stack behaves, deep links resolve to the right screen, no duplicate/dead screens.
+9. **Keyboard handling.** Inputs aren't obscured by the keyboard; focus, scroll-into-view, and dismissal behave.
+10. **Release/build config.** Each build variant points at the right env — no debug endpoints, dev keys, or verbose logging shipped in release; signing configured.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/knowledge-seed/curated/pitfalls-ui-react-vite.md ADDED Viewed

@@ -0,0 +1,19 @@
+# Pitfalls Primer — ui / React + Vite
+**Scope:** stories with the `ui` profile on a React + Vite stack.
+**How to use:** This is a *primer, not a checklist to satisfy.* Scan your diff against each item and fix gaps now. It is deliberately short — also handle obvious things *like* these that aren't listed. CRITIC is still the grader; this is just wiping the obvious smudges off first.
+## The usual suspects
+1. **Real env, right origin.** Client config uses actual `VITE_*` vars (prefix required, inlined at build time — not read from `process.env` at runtime). The API base URL points at the real origin per environment, not a hardcoded `localhost`.
+2. **Loading / empty / error are real states.** Every async view renders distinct loading, empty, and error UI — not a blank flash or a perpetual spinner. These are spec states, not afterthoughts.
+3. **Centralized fetch + error handling.** Network calls go through one place that checks `res.ok`, parses errors consistently, and surfaces them — not scattered `fetch().then(r => r.json())` that swallows non-2xx.
+4. **No secrets in the bundle.** Only public `VITE_*` values reach the client. No API keys, tokens, or server-only config compiled into the front-end bundle.
+5. **Effect cleanup.** Effects that subscribe/fetch clean up on unmount (abort the request, clear timers/listeners) — no `setState` after unmount, no leaked intervals, no duplicate listeners on re-render.
+6. **Stable list keys.** Lists use stable, unique keys (not array index when items reorder/insert/delete) to avoid state bleed across rows.
+7. **Controlled, double-submit-safe forms.** Inputs are controlled, the submit button disables while pending, and double-submit is prevented. Validation errors render inline.
+8. **Accessibility basics.** Inputs have associated labels, interactive elements are real buttons/links (keyboard-reachable), images have `alt`. Don't ship a `div` onClick as the only affordance.
+9. **Dev proxy vs prod CORS.** If using Vite's dev proxy, confirm the prod build talks to the real API origin with credentials configured correctly — the thing that works in dev via proxy must work in prod via CORS.
+10. **No stale closures / missing deps.** Effect/callback dependency arrays are correct — no stale values captured, no infinite re-render from an unstable dependency.
+*Not exhaustive. If you know an obvious one for this stack that isn't here, handle it anyway.*

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "valent-pipeline",
-  "version": "0.6.3",
+  "version": "0.7.0",
   "description": "v3 multi-agent AI pipeline for software development lifecycle",
   "type": "module",
   "bin": {
@@ -13,7 +13,8 @@
     "bin/",
     "src/",
     "pipeline/",
-    "skills/"
+    "skills/",
+    "knowledge-seed/"
   ],
   "scripts": {
     "test": "node scripts/test-local.js && node scripts/test-workflow.js",

package/pipeline/docs/multi-provider-runners.md ADDED Viewed

@@ -0,0 +1,185 @@
+# Multi-Provider Runners — Design Doc
+> **Status:** Design / proposal
+> **Implements:** Lever 5 / Phase 7 of [`cost-and-resumability-improvements.md`](./cost-and-resumability-improvements.md) — the one cost lever still on paper.
+> **Scope:** `sprint.workflow.js` agent spawning. Generalize `modelFor(role)` → `runnerFor(role)` so the bulk dev line can run on a flat-rate subscription CLI (Codex / Grok) instead of the per-token Anthropic API.
+> **Target subscriptions (confirmed in hand):** OpenAI **Codex** (`codex exec`, ChatGPT Plus/Pro) and **Grok CLI** (X/Grok sub). Cursor is out of scope until/unless a sub exists.
+> **Thesis (unchanged):** the pipeline is gated, so the Opus judgment gates are the quality floor. That makes provider variance *safe to absorb* on the bulk find/build work — weak output just gets caught and reworked. We trade a little rework risk for near-zero marginal token cost on the bulk.
+---
+## 1. The one axis that must not be conflated
+There are **two independent provider concepts**, and the existing code only has the first:
+| Axis | What it picks | Today | Stays / changes |
+|------|---------------|-------|-----------------|
+| `runtime.provider` (`config-schema.js:138`, validated against `['claude-code']`) | The **orchestrator harness** — who runs `Workflow`, spawns agents, owns the journal | `claude-code` | **Unchanged.** Always `claude-code`. The orchestrator is Claude Code regardless of who does the labor. |
+| **`runners` (new)** | Per-role: who does the **labor inside a role** — a native Claude subagent, or a shelled-out vendor CLI | (does not exist) | **New axis.** This whole doc. |
+Keeping these separate is the core design decision. A Codex-driven `BEND` agent is still spawned *by the Claude Code orchestrator as a native `agent()`* — that native agent is a thin **driver** that shells out to `codex exec` and re-emits the handoff schema. The orchestrator never itself shells out (it can't — see §4).
+---
+## 2. Where it plugs in — one chokepoint
+Every dev role becomes a subagent through exactly one helper, `spawn()` at `sprint.workflow.js:615`:
+```js
+const spawn = (role, promptFile, taskSubject, opts = {}) =>
+  agent(buildPrompt({ role, promptFile, storyId, taskSubject, ...opts }), {
+    label: opts.label || `${role.toLowerCase()}:${storyId}`,
+    phase: opts.phase,
+    schema: opts.schema || HANDOFF_SCHEMA,
+    model: opts.model || modelFor(role),   // ← the only routing knob today
+  })
+```
+`modelFor(role)` (`sprint.workflow.js:395`) returns a Claude tier string (`opus`/`sonnet`/`haiku`/`undefined`) and it always lands on a native `agent()`. The feature is: add a sibling resolver `runnerFor(role)`. When it says `claude` (the default), nothing changes. When it names an external runner, `spawn()` builds a **driver prompt** instead of the role prompt and routes through a wrapper agent.
+The gates are untouched. CRITIC / JUDGE / READINESS consume `HANDOFF_SCHEMA` (`:100`) and `VERDICT_SCHEMA` (`:114`) and **do not care who produced them** — that schema normalization is the entire reason this is safe to bolt on.
+---
+## 3. Config surface
+Mirror the existing `models` map (`config-schema.js:187`). Two new blocks, both overlay-on-default and journal-replay-safe (static + args only):
+```yaml
+# pipeline-config.yaml
+# Runner catalog — declares the external CLIs available this run. `claude` is implicit (native).
+runners:
+  codex:
+    cmd: "codex exec --json"          # headless, non-interactive; --json => parseable stdout
+    auth: subscription                # informational; means "needs interactive login, not an API key"
+    timeout_s: 1200                   # long build runs use run_in_background + Monitor past this
+  grok:
+    cmd: "grok --prompt"
+    auth: subscription
+    timeout_s: 900
+# Per-role runner assignment. Unlisted roles default to { runner: claude } and keep their `models` tier.
+# `fallback` is MANDATORY for any external runner (see §6) — a missing/failed CLI degrades, never kills.
+roles:
+  BEND:   { runner: codex, fallback: claude:sonnet }   # build the backend on the flat-rate sub
+  FEND:   { runner: codex, fallback: claude:sonnet }
+  DATA:   { runner: codex, fallback: claude:sonnet }
+  DOCGEN: { runner: grok,  fallback: claude:sonnet }    # low-risk first target (see §7 spike)
+  # Judgment gates stay native Claude — never route these to an external runner:
+  CRITIC-TRIAGE: { runner: claude, model: opus }
+  JUDGE-DECIDE:  { runner: claude, model: opus }
+  READINESS:     { runner: claude, model: opus }
+```
+`runnerFor(role)` resolution order, parallel to `buildModelMap` (`:373`):
+1. explicit `roles.<ROLE>.runner` → use it;
+2. else `claude` with `modelFor(role)` as the tier (today's behavior, byte-for-byte).
+**Guardrail baked into the resolver:** the judgment roles (`READINESS`, `CRITIC-TRIAGE`, `JUDGE-DECIDE`, `INTEGRATION`) are a hard-coded deny-list for external runners. A config that tries to route them to `codex`/`grok` is a **validation error** in `config-schema.js`, not a silent override. The gates are the floor; they never run on a subsidized model.
+---
+## 4. Why a wrapper agent — and what it costs
+**The orchestrator cannot shell out.** The `Workflow` runtime has no Bash/FS access, and even if it did, an external CLI call from the script body would break journal-replay determinism (`Date.now`-class non-determinism; the resume-safety linter at `scripts/test-workflow.js` enforces a pure script body). So all external IO must happen **inside an agent**, where it's captured in the journal as that agent's cached result.
+That means an external runner is not the doc's literal "$0 marginal." It's:
+> **flat-rate-sub tokens (the build itself, $0 marginal) + a cheap Claude driver agent that babysits the CLI and normalizes output.**
+The driver is a `haiku` native agent. Its job is purely mechanical — run a command, wait, parse, validate, retry — so haiku is right. Pennies, not Opus. Still a massive win when it moves the entire Sonnet dev line off the per-token API, but the doc should be honest that the babysitter exists.
+### Driver-agent contract
+`spawn()`, when `runnerFor(role)` is external, builds this driver prompt (haiku, `schema: HANDOFF_SCHEMA`):
+```
+You are a SHELL DRIVER. Do not reason about the task. Execute these steps exactly:
+1. Write the following role brief to a temp file <brief.md>:
+   <<the normal buildPrompt() output for this role + story>>
+2. Run:  <runners.codex.cmd> "$(cat <brief.md>)"
+   - If it exceeds <timeout_s>, use run_in_background:true + Monitor; do not block the slot.
+   - The CLI works in the story worktree and writes code there directly.
+3. Capture stdout. The role brief instructs the CLI to end with a fenced ```json block
+   matching the handoff schema. Extract that block.
+4. If extraction or schema validation fails, retry step 2 ONCE with an explicit
+   "emit ONLY the JSON handoff" suffix.
+5. If it still fails, OR the CLI exits non-zero, OR the CLI is not installed:
+   return { "status": "runner_failed", "reason": "<short>" } so the orchestrator falls back.
+6. On success, return the validated handoff JSON as your final message.
+```
+Two things make this hold together:
+- **The role brief is reused verbatim.** `buildPrompt()` (`:447`) already produces the full role instructions; the driver just relays them to the external CLI plus a "end with this JSON schema" coda. We are not maintaining a second set of role prompts per vendor.
+- **The handoff schema is the normalization layer.** Codex stdout and a native Opus handoff are interchangeable to every downstream gate because both end as schema-valid JSON. This is the same property that lets `models` tiering work; runners just extend it across vendors.
+---
+## 5. The file-collision problem the original plan missed
+Native dev agents in the Build phase run in **`parallel`** and write to the same story worktree; the Claude Code harness coordinates them. **Raw `codex exec` / `grok` processes do not coordinate** — two external drivers mutating the same tree concurrently is exactly the merge-conflict / lost-write case.
+**Resolution:** any role whose `runner` is external **and** that runs in the parallel Build barrier must spawn with `isolation: 'worktree'` (the driver agent gets its own git worktree), followed by a merge step. Options, cheapest first:
+- **A — Serialize external build roles.** If only one role per story is external (e.g. just `BEND` on Codex while `FEND` stays native), no isolation needed — the single external writer and the native writers are harness-coordinated only among themselves, but a lone external writer touching distinct files is usually fine. *Start here for the spike.*
+- **B — Worktree-per-external-driver + merge.** Each external dev driver runs `isolation: 'worktree'`; a follow-up native agent merges the worktrees back. Correct for N concurrent external writers, but `isolation: 'worktree'` is ~200–500ms + disk per agent and adds a merge gate. Only pay this when ≥2 roles are external **and** concurrent.
+The doc's §3 example deliberately routes multiple dev roles to Codex — that triggers Option B. The **spike (§7) routes exactly one low-risk role**, staying in Option A and proving the path before paying for worktree orchestration.
+---
+## 6. Headless auth & fallback — load-bearing, not polish
+Subscription CLIs (`codex`, `grok`) assume an **interactive login**. In an unattended / cron / fresh-host sprint they may be unauthenticated or absent. Therefore:
+- `fallback` is **mandatory** for every external role (validated in `config-schema.js`). Form: `claude:<tier>`.
+- The driver's step 5 (`runner_failed`) makes the orchestrator **transparently re-spawn the role as a native Claude agent at the fallback tier**. A missing CLI costs one wasted haiku driver turn, not a dead 2–4M-token sprint.
+- Add a **one-time preflight** at sprint start: a haiku agent runs `codex --version` / `grok --version` for every external runner referenced in config and `log()`s which are live. If a runner is dead, log it loudly and pin those roles to fallback for the whole run (don't re-probe per story). This makes the degradation visible instead of silent.
+---
+## 7. Determinism / journal safety
+- All shelling stays **inside the driver agent**; the orchestrator script body stays pure. Resume replays the driver agent's **cached result** (the validated handoff), and never re-invokes `codex`. So `codex`'s own run-to-run non-determinism is irrelevant to replay — exactly like a native agent's sampling variance is today.
+- No `Date.now` / `Math.random` enters the script. Temp-file names inside the driver derive from `storyId`+`role` (already unique), not a clock.
+- `scripts/test-workflow.js` (resume-safety linter) must stay green; nothing here adds script-body IO.
+---
+## 8. Changes, by file
+| File | Change |
+|------|--------|
+| `sprint.workflow.js:351` `DEFAULT_MODELS` / `:373` `buildModelMap` | Unchanged. Tiers still apply to `claude` runner and to every `fallback`. |
+| `sprint.workflow.js:395` `modelFor` | Joined by a new `runnerFor(role)` + `buildRunnerMap(a.runners, a.roles)`, same overlay pattern. |
+| `sprint.workflow.js:615` `spawn()` | Branch: `runnerFor(role) === 'claude'` → today's path; else build the §4 driver prompt (haiku, handoff schema), apply §5 isolation rule, wire §6 fallback on `runner_failed`. |
+| `sprint.workflow.js:447` `buildPrompt` | Add an optional `runnerCoda` (the "end with this JSON" suffix) appended only for external drivers. Role content reused as-is. |
+| `src/lib/config-schema.js:137` runtime block | Add validation for `runners` (each needs `cmd`; `timeout_s` numeric) and `roles` (`runner` ∈ catalog ∪ `claude`; external roles **require** `fallback: claude:<tier>`; judgment roles **reject** external runners). |
+| `src/lib/config-schema.js:187` `defaults` | Add `runners: {}` and `roles: {}` (empty ⇒ feature off, behavior identical to today). |
+| Sprint start | Add the §6 preflight probe agent. |
+**Off by default.** Empty `runners`/`roles` ⇒ `runnerFor` always returns `claude` ⇒ byte-identical to current behavior. No prompt-cache disruption for anyone not opting in.
+---
+## 9. Phased rollout
+| Phase | Change | Why this order |
+|-------|--------|----------------|
+| **A — Resolver + schema, no external calls** | Add `runnerFor`/`buildRunnerMap`, config validation, defaults. All roles still resolve to `claude`. | Pure plumbing; provably a no-op; lands the config surface and the judgment-role deny-list first. |
+| **B — Spike: one low-risk role on Grok** | Route only `DOCGEN` → `grok` (single external writer, Option A, no worktree merge). Driver agent + handoff parse + fallback. | Proves the driver contract, output normalization, and fallback end-to-end on the least-critical, lowest-collision role. Measure with `valent-review-cost` before widening. |
+| **C — Codex on the build line** | Route `BEND` (then `FEND`/`DATA`) → `codex`. Introduces Option B worktree isolation + merge gate. | The actual prize — moves the Sonnet dev bulk onto the flat-rate sub. Gated behind a working spike and the worktree machinery. |
+| **D — Calibrate** | Track per-runner rejection-rate / rework-cycle counts in the existing `calibrate` data. If an external runner drives rework up, the savings evaporate → pin that role back to `claude`. | The thesis says variance is safe *because gates catch it* — but only if we **measure** the rework tax. Make tiering measured, not assumed. |
+Phase A is safe to merge immediately. Phase B is the go/no-go: if Grok-built docs sail through the gates at materially lower cost, Phase C is justified; if they thrash the rejection loop, we learned it cheaply on the role that matters least.
+---
+## 10. Open questions
+1. **Codex `--json` handoff fidelity.** Does `codex exec` reliably emit a clean trailing JSON block when instructed, or does it wrap/narrate? If narration leaks, the driver's retry-with-strict-suffix (step 4) is the guard — but confirm during the Phase B spike with Grok first, since DOCGEN output is forgiving.
+2. **Merge-gate ownership (Phase C).** When ≥2 external drivers use `isolation: 'worktree'`, who merges — a dedicated native agent, or does the existing INTEGRATION gate absorb it? Leaning: a small dedicated merge agent before CRITIC, so CRITIC always sees one coherent tree.
+3. **Per-runner model pinning.** Codex/Grok may expose model choice (e.g. `codex --model`). Worth surfacing as `runners.codex.model`? Defer until Phase C; the sub's default model is fine for bulk build under a gate.
+4. **Cost attribution.** Flat-rate sub usage doesn't show up in the Anthropic token audit at all. `valent-review-cost` will show the *reduction* in Claude spend but is blind to sub consumption — note this so "savings" aren't overstated against a sub's own rate limits.

package/pipeline/orchestrators/claude-code/README.md CHANGED Viewed

@@ -12,7 +12,7 @@ provider keeps the markdown-skill Lead. Both consume the same shared substrate
 |---|---|---|
 | `plan.workflow.js` | 7 | Groom → size → pack → validate a set of pending stories into a planned sprint batch. Emits a batch shaped to feed straight into `sprint.workflow.js`. |
 | `sprint.workflow.js` | 4 + 6 | Execute a planned batch sequentially through the per-story pipeline with schema-validated gates. |
-| `retro.workflow.js` | 7 | Learn from a shipped batch: calibrate, loop-until-dry aggregate review, gated directives, embed. |
+| `retro.workflow.js` | 7 | Learn from a shipped batch: record + calibrate estimation data, synthesize gated correction directives (impact + invariant + stale-reissue guards), embed. |
 They compose as `plan → sprint → retro`. The per-story pipeline is kept **inline** in
 `sprint.workflow.js` (not a nested `workflow()`), so the single `workflow()` nesting level
@@ -43,6 +43,7 @@ incl. a resume-safety lint), but:
 | Rejection cap | JS `while` loop, code-owned counter | model-counted circuit breaker |
 | Dev fan-out | `parallel()` barrier before CRITIC | wave/spawn_trigger overlay |
 | 3b CRITIC | `parallel([blind, edge, acceptance])` independent agents → triage barrier | one CRITIC context, passes anchored on each other |
+| Visual evidence | a `ui`-profile story runs a PMCP stage that drives the browser MCP over QA-A's checklist → `pmcp-evidence.md` (the artifact JUDGE requires) | the orphaned dependency: JUDGE demanded PMCP evidence no stage produced |
 | Spawn context | `buildPrompt()` mirrors `spawn.template.md` (Setup/Task/Trigger/Completion) | terse inline instructions |
 | Roll-over | a rejected story is recorded and the batch continues | — |
 | Empty-graph guard | a resolved graph with zero dev agents throws a diagnostic before Build | silent empty Build → CRITIC looping on an empty diff |
@@ -104,7 +105,7 @@ are invoked *through* an agent rather than imported.
   yet; for now run the three workflows in sequence (the plan output feeds the sprint input).
 - Per-story dev fan-out re-runs ALL dev agents on a CRITIC rejection; routing rework to only
   the agent(s) CRITIC targeted (via `rejectionTarget`) is a refinement once run live.
-- No PMCP / visual-validation stage yet; no PM/program-loop workflow (left agent-driven per §5b).
+- No PM/program-loop workflow yet (left agent-driven per §5b).
 ## Runtime constraint that shaped the design