valent-pipeline 0.6.1 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "valent-pipeline",
3
- "version": "0.6.1",
3
+ "version": "0.6.3",
4
4
  "description": "v3 multi-agent AI pipeline for software development lifecycle",
5
5
  "type": "module",
6
6
  "bin": {
@@ -0,0 +1,313 @@
1
+ # Workflow Improvement Plan — Cost & Resumability
2
+
3
+ > **Status:** Proposal / design doc
4
+ > **Scope:** `valent-pipeline` orchestrators (`plan` / `sprint` / `retro`), gate model assignment, and the path to a multi-provider runner.
5
+ > **Author's thesis (yours, and it's correct):** *Because the pipeline is gated, cheap models can do far more of the work.* The gates are the quality backstop, so the expensive model only needs to show up where **judgment** happens — not where **finding** or **building** happens. Everything below is an elaboration of that one idea, plus a resumability layer so a 2–4M-token sprint never has to start over.
6
+
7
+ ---
8
+
9
+ ## 1. The two problems, stated precisely
10
+
11
+ | # | Problem | Root cause in the current design |
12
+ |---|---------|----------------------------------|
13
+ | 1 | **Resumability** after a network glitch or context exhaustion | Resume relies on the Workflow journal + `resumeFromRunId`, which is **same-session only** and **in-memory**. If the session itself dies (context exhaustion → new session) or the host process is killed, the journal/runId is gone and there is no durable, cross-session ledger of "which stories already shipped." A single transient agent failure inside a sequential gate **throws and kills the whole run** rather than retrying. |
14
+ | 2 | **Cost** — ~1M tokens per planning run, 2–4M per sprint | Opus is the default for **every gate** (`READINESS`, `CRITIC`, `JUDGE`, `INTEGRATION`), and the most expensive gate — `CRITIC` — runs **3 full Opus passes + an Opus triage**, and **re-runs all of that** on every rejection cycle (up to `maxRejectionCycles`, default 5). The rejection loop tail is the single largest cost center. Nothing deterministic (linters, type-checkers, static analysis) runs *before* Opus, so Opus pays tokens to find mechanical issues a free tool would catch. |
15
+
16
+ These interact: a robust **idempotent checkpoint** layer (Problem 1) *also* saves cost, because a resumed sprint skips already-shipped stories instead of re-billing them.
17
+
18
+ ---
19
+
20
+ ## 2. Cost anatomy — where the tokens actually go
21
+
22
+ **Current model assignment** (`sprint.workflow.js:224-230`):
23
+
24
+ ```js
25
+ const DEFAULT_MODELS = {
26
+ READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus', // ← gates: all Opus
27
+ REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
28
+ BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
29
+ LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
30
+ RESOLVE: 'haiku',
31
+ }
32
+ ```
33
+
34
+ **Current API prices (2026):**
35
+
36
+ | Model | Input $/M | Output $/M | Cached input $/M (≈90% off) | Batch (−50%) |
37
+ |-------|----------:|-----------:|----------------------------:|-------------:|
38
+ | Opus 4.8 | $5 | $25 | ~$0.50 | $2.50 / $12.50 |
39
+ | Sonnet 4.6 | $3 | $15 | ~$0.30 | $1.50 / $7.50 |
40
+ | Haiku 4.5 | $1 | $5 | ~$0.10 | $0.50 / $2.50 |
41
+
42
+ **Per-story Opus cost, directional, *zero rework*** (input largely uncached on first pass):
43
+
44
+ | Gate | Opus calls | Rough $ / story |
45
+ |------|-----------:|----------------:|
46
+ | READINESS (1 pass) | 1 | ~$0.28 |
47
+ | **CRITIC (3 passes + triage)** | **4** | **~$1.25** |
48
+ | JUDGE (1 pass) | 1 | ~$0.33 |
49
+ | **Total Opus (no rework)** | **6** | **~$1.86** |
50
+
51
+ Now add the loop. **Each CRITIC rejection cycle re-runs all 3 passes + triage** (`runCriticGate`, `sprint.workflow.js:519-566`). Two cycles ≈ **+$2.50**, and CRITIC alone becomes ~70% of the story's Opus spend. **The rejection-loop tail is the thing to attack first.**
52
+
53
+ > **Takeaway:** Of the 6 Opus calls per clean story, only **3 are genuine judgment calls** (the READINESS decision, the CRITIC *triage*, the JUDGE *ship decision*). The other 3 (the CRITIC *hunting* passes) are *finding* work that a cheaper model — or a free deterministic tool — can do most of. That gap is the savings.
54
+
55
+ ---
56
+
57
+ ## 3. Cost levers, highest ROI first
58
+
59
+ ### Lever 1 — Deterministic pre-gates (free, non-LLM) before any Opus runs
60
+
61
+ Run linters, type-checkers, and static analysis as a **blocking gate between Build and CRITIC**. Dev agents must produce a clean lint/type/compile/test-build before a single Opus token is spent reviewing.
62
+
63
+ | Tool | Cost | Catches (deterministically) |
64
+ |------|------|------------------------------|
65
+ | **Semgrep** | Free ≤10 contributors (OSS engine free) | Injection, secrets, taint flows, unsafe APIs, ~security rules |
66
+ | **ESLint / tsc** | Free | Type errors, null/undefined, unused, unreachable, many real bugs |
67
+ | **Ruff / Bandit** (Python) | Free | Lint + Python security |
68
+ | Compiler / `tsc --noEmit` | Free | Whole class of "doesn't even type-check" rejections |
69
+
70
+ **Why this is #1:** It directly shrinks the **rejection-loop tail** — the most expensive thing in the system. Every mechanical defect a linter catches is a CRITIC rejection cycle (3 Opus passes + triage) that *never happens*. It also lets CRITIC's prompt explicitly say *"lint/types/security-static already passed; focus only on semantic correctness and AC coverage,"* which shrinks each remaining Opus pass.
71
+
72
+ **Where:** new `pre-critic-gate` step invoked in `runStory` between the Build barrier and `runCriticGate` (`sprint.workflow.js`, around the Build→Critic transition). The dev agents already run tests; add the static suite to their `handoff` step and make CRITIC refuse to start until the pre-gate verdict is green.
73
+
74
+ ### Lever 2 — Split *find* from *judge*; make Opus the exception-handler, not the default
75
+
76
+ Re-tier `CRITIC` so the **hunting** passes run cheap and only **judgment** runs on Opus:
77
+
78
+ ```yaml
79
+ # pipeline-config.yaml (proposed `models` override — already supported by buildModelMap)
80
+ models:
81
+ haiku: [RESOLVE, CRITIC_BLIND] # blind-hunt is mechanical: naming, dead code, copy-paste
82
+ sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, ...,
83
+ CRITIC_EDGE, CRITIC_ACCEPT, # edge-case + acceptance-audit on Sonnet
84
+ JUDGE_EVIDENCE] # evidence cross-referencing is mechanical counting
85
+ opus: [CRITIC_TRIAGE, # adjudication = judgment → keep Opus
86
+ JUDGE_DECIDE, READINESS] # the actual ship/ready decisions
87
+ ```
88
+
89
+ This requires splitting the monolithic `CRITIC` and `JUDGE` roles into sub-roles the model map can target (the 3-pass structure already exists as separate agents at `sprint.workflow.js:519-566` — give each pass its own role key). The triage step stays Opus because that's where duplicates collapse and severity is decided — the genuine judgment.
90
+
91
+ **Escalation rule (the gate-thesis, made literal):** run the cheap pass; **escalate a finding to Opus only when** the cheap model marks it High-severity *or* flags low confidence. A clean diff that passes lint + Sonnet review + full AC coverage may need **zero Opus** — Opus becomes an adjudicator invoked on exceptions, not the default reviewer. Projected CRITIC cut: **~40–60%** with no loss of the Opus backstop on actual decisions.
92
+
93
+ Apply the same split to `JUDGE`: Sonnet does the *evidence cross-reference* (does the test count match the spec, is every AC traced — mechanical counting), Opus makes only the final SHIP/REJECT call, and only when evidence is ambiguous.
94
+
95
+ ### Lever 3 — Targeted re-review on rejection (kill the loop tail)
96
+
97
+ Today a rejection re-runs **all 3 CRITIC passes + triage** (`runCriticGate`). Instead: **persist the open findings**, and on rework, run a **verify-only pass** that checks *just the previously-failed findings* against the new diff. Full re-hunt only if the dev's fix touched files outside the flagged set.
98
+
99
+ - Re-review cost drops from ~100% of a full CRITIC to ~20%.
100
+ - Combined with Lever 1 (fewer rejections to begin with) this collapses the cost tail.
101
+ - Lower `maxRejectionCycles` from 5 → 2–3 and **escalate to a human/Opus-once** instead of looping a 4th time; the 4th–5th cycles are almost never productive and are pure Opus burn.
102
+
103
+ ### Lever 4 — Offload review to flat-rate / free external reviewers
104
+
105
+ Use an external reviewer as the **blind-hunt pass** (or a pre-CRITIC filter), so that pass costs a flat fee or $0 instead of Opus tokens:
106
+
107
+ | Tool | Pricing | Local/CLI | Needs your LLM key? | Catches |
108
+ |------|---------|-----------|---------------------|---------|
109
+ | **CodeRabbit CLI** | **$0.25 / file** reviewed (credit add-on); SaaS Pro $24/dev/mo, Pro+ $48 | CLI client → their cloud (needs API key, not offline) | No — flat per-file | Race conditions, memory leaks, security vulns, actionable fixes |
110
+ | **Qodo Merge** (ex-PR-Agent) | **Free tier 250 credits/mo**; Teams $30/user/mo (2500 credits). PR-Agent OSS is **self-hostable, BYO key** | Yes — IDE/local review + self-host/air-gapped | OSS path: yes (your key) | PR review, context-aware; Opus=5 credits/req, Grok-4=4 credits/req on managed tier |
111
+ | **Greptile** | $30/seat/mo, 50 reviews incl., then $1/review | Cloud; enterprise self-host | Managed | Codebase-context review |
112
+ | **Sourcery** | Free for OSS; Team $24/seat/mo (**BYO LLM** on Team) | Yes — CLI | Team tier: yes (your key) | **Rule-based refactoring** (deterministic) + AI review |
113
+ | **Native Claude Code `/code-review`** | In-session tokens only | Yes — runs on the local diff | Uses your Claude session | Correctness bugs + reuse/simplification at chosen effort |
114
+
115
+ **Recommendation:** CodeRabbit CLI ($0.25/file, predictable) or self-hosted **PR-Agent** (free, your key) as the **blind-hunt replacement**. They run on the diff, return structured findings, and slot straight into the existing `verdict.schema.json` gate machinery — the orchestrator doesn't care who produced the verdict as long as it matches the schema. The native `/code-review` skill is the zero-integration option for a first experiment.
116
+
117
+ ### Lever 5 — Multi-provider CLI arbitrage (Codex / Cursor / Grok) — *subsidized usage across providers*
118
+
119
+ This is the big structural play, and it's feasible **because Workflow `agent()` subagents have Bash**. An agent can shell out to another vendor's headless CLI, wait, and parse the result back into the handoff schema. The orchestrator stays pure (all IO happens *inside* the agent, which is journal-replay-safe); the subagent becomes a thin **"shell driver."**
120
+
121
+ **Subsidized runners to target:**
122
+ - `codex exec "<prompt>"` — OpenAI **Codex CLI**, non-interactive/headless, billable against a **ChatGPT Plus/Pro/Team subscription** (flat-rate, not per-token API).
123
+ - `cursor-agent` — **Cursor CLI** headless, subsidized by a Cursor subscription.
124
+ - **Grok CLI** — xAI, subsidized by an X/Grok subscription.
125
+
126
+ **The arbitrage:** route the cheap-but-bulky work (dev/build passes, blind-hunt, test scaffolding, doc lookups) to whichever flat-rate subscription you're *already paying for*, so those tokens are **$0 marginal**. Keep Claude **Opus only on the irreducible judgment gates**. If Codex/Cursor does the *building* on a flat subscription, the entire Sonnet dev line can leave the per-token API bill.
127
+
128
+ **Architecture — generalize `modelFor(role)` into `runnerFor(role)`:**
129
+
130
+ ```yaml
131
+ # pipeline-config.yaml (proposed)
132
+ runners:
133
+ codex: { cmd: "codex exec", auth: subscription } # flat-rate
134
+ cursor: { cmd: "cursor-agent", auth: subscription }
135
+ grok: { cmd: "grok", auth: subscription }
136
+ claude: { cmd: native } # in-session subagent
137
+ roles:
138
+ BEND: { runner: codex, fallback: claude:sonnet } # build on flat-rate sub
139
+ CRITIC_BLIND: { runner: cursor, fallback: claude:haiku } # blind hunt on flat-rate sub
140
+ CRITIC_TRIAGE: { runner: claude, model: opus } # judgment stays Opus
141
+ JUDGE_DECIDE: { runner: claude, model: opus }
142
+ ```
143
+
144
+ The handoff/verdict schemas (`schemas/handoff.schema.json`, `schemas/verdict.schema.json`) are the **normalization layer** — a Codex-produced review and an Opus-produced review are interchangeable to the gate logic as long as both emit the schema. Long external runs use `run_in_background: true` + `Monitor`.
145
+
146
+ **Guardrails (this is exactly your gate-thesis):** quality variance across providers is *fine* precisely because the Opus gate is the backstop. A cheaper/subsidized provider that produces weaker work just gets caught and rejected by CRITIC/JUDGE — you trade a little more rework risk for near-zero marginal token cost on the bulk. Keep a `fallback` so a missing/failing CLI degrades to a native Claude agent rather than killing the run.
147
+
148
+ ### Lever 6 — Ref MCP to cut rework *and* context bloat
149
+
150
+ Wire **Ref MCP** (`ref_search_documentation`, `ref_read_url`) into the dev agents (BEND/FEND/DATA/…) and CRITIC. Two distinct savings:
151
+
152
+ 1. **Fewer rejection cycles.** A large share of CRITIC/JUDGE rejections are wrong API usage / hallucinated signatures. Ref lets the dev agent verify the real API *before* writing code, so the diff is right the first time → fewer (expensive, Opus) rework loops. This attacks the same cost tail as Lever 1, from the front.
153
+ 2. **Smaller context.** Without Ref, agents either guess or paste entire doc pages into context. Ref returns **only the relevant snippet**, so you pay for the 500 tokens you need, not the 40K-token page. CRITIC can use Ref to confirm "is this really the right call signature?" instead of being handed the whole SDK.
154
+
155
+ **Where:** add a "verify external APIs via Ref before implementing" step to each dev `read-inputs`/`implement` step file, and a "confirm uncertain API usage via Ref" note to `critic/edge-case-hunt.md`.
156
+
157
+ ### Lever 7 — Cache hygiene, distilled handoffs, batch API
158
+
159
+ - **Prompt-cache hygiene.** Caching is already in use, but cache hits require the static prefix to be **byte-identical** across spawns. Audit `buildPrompt` (`sprint.workflow.js:273-298`) so the **static** content (role prompt path, pipeline-context, handoff contract) comes first and the **dynamic** content (storyId, task subject, trigger) comes last. Verify realized cache-hit rate with the existing `audit` command (`src/commands/audit.js`) / `valent-review-cost` skill. Cached input is ~90% cheaper — a few percent of hit-rate is real money on a 4M-token sprint.
160
+ - **Distilled handoffs.** `distilled-handoff-format.md` already exists; make sure CRITIC/JUDGE read **distilled** artifacts, not raw full files, wherever the decision doesn't need the raw text.
161
+ - **Batch API (−50%)** for the latency-insensitive, non-interactive work: estimation/sizing in `plan`, retrospective synthesis, KB embeddings. These don't block a human, so the 50% batch discount is free money.
162
+
163
+ ---
164
+
165
+ ## 4. Projected impact
166
+
167
+ Directional, stacking the levers (no loss of the Opus decision-backstop):
168
+
169
+ | Lever | Mechanism | Est. cost reduction |
170
+ |-------|-----------|---------------------|
171
+ | 1 — Deterministic pre-gates | Removes ~30–50% of CRITIC rejection cycles | **High** |
172
+ | 2 — Find vs. judge tiering | 2 of 3 CRITIC passes → Sonnet/Haiku; JUDGE evidence → Sonnet | ~40–60% of CRITIC |
173
+ | 3 — Targeted re-review | Re-review ~20% of a full pass instead of 100% | High on rework tail |
174
+ | 4 — External reviewer | Blind-hunt → $0.25/file or free, off the Opus bill | Moderate |
175
+ | 5 — Provider arbitrage | Dev/blind work → subsidized flat-rate subs ($0 marginal) | **High** (can move the entire Sonnet dev line off API) |
176
+ | 6 — Ref MCP | Fewer rework loops + smaller context | Moderate, compounding |
177
+ | 7 — Cache/distill/batch | ~90% off cached input, 50% off batch | Moderate, broad |
178
+
179
+ **Realistic combined target: 50–70% reduction** in per-sprint API spend, with the irreducible Opus judgment (CRITIC triage, JUDGE ship decision, READINESS) preserved as the quality floor.
180
+
181
+ ---
182
+
183
+ ## 5. Resumability redesign
184
+
185
+ ### Current state and the gap
186
+
187
+ Resume = relaunch `Workflow({ scriptPath, resumeFromRunId })`; the journal replays the unchanged `agent()` prefix at ~100% cache hit. This is elegant **but**:
188
+ - **Same-session only / in-memory.** Context exhaustion that forces a new session, or a killed host process, loses the journal and the runId.
189
+ - **No durable ledger.** Nothing on disk says "story KANBAN-014 already SHIPPED in this sprint," so a from-scratch restart re-bills shipped stories.
190
+ - **A transient failure is fatal to sequential gates.** In a `for`-loop story (`sprint.workflow.js:312-315`), a network blip inside an awaited gate `throw`s and ends the run (unlike `parallel`, which drops to `null`).
191
+
192
+ ### Fix A — Idempotent, on-disk phase checkpoints (cross-session resume)
193
+
194
+ Persist a durable run ledger in the existing SQLite DB (extends the `artifacts`/`calibration` tables in `src/lib/db.js`). Add a CLI command:
195
+
196
+ ```
197
+ node .valent-pipeline/bin/cli.js checkpoint --sprint <id> --story <id> --phase <spec|build|critic|qa|judge> --verdict <pass|ship|reject>
198
+ ```
199
+
200
+ An `agent()` at the end of each phase writes the checkpoint (IO inside an agent = journal-safe). Then **make each phase idempotent at the top of `runStory`**: before running a phase, query the ledger; if a terminal verdict already exists for `(sprint, story, phase)`, **skip and reuse the on-disk artifact**. This is exactly the `groomed`-bypass pattern (`sprint.workflow.js` skips Spec+Readiness when `groomed:true`) — generalized to every phase. Result: a fresh run **after total session loss** naturally fast-forwards to the first incomplete phase, with no journal required.
201
+
202
+ ### Fix B — Retry-with-backoff wrapper for transient failures
203
+
204
+ Wrap every gate/agent call in a bounded retry so a network glitch doesn't kill a 4M-token run:
205
+
206
+ ```js
207
+ async function withRetry(fn, { tries = 3, label } = {}) {
208
+ let lastErr
209
+ for (let i = 0; i < tries; i++) {
210
+ try { return await fn() }
211
+ catch (e) { lastErr = e; log(`retry ${label} (${i + 1}/${tries}): ${e.message}`) }
212
+ }
213
+ // exhausted: checkpoint progress and surface the runId so the user can resume cross-session
214
+ log(`PAUSED at ${label}. Resume with resumeFromRunId, or restart — completed phases will be skipped via the ledger.`)
215
+ throw lastErr
216
+ }
217
+ ```
218
+
219
+ (Keep it deterministic — no `Date.now`/`Math.random` in the script body; the resume-safety linter at `scripts/test-workflow.js` enforces this.)
220
+
221
+ ### Fix C — Make the runId loud and durable
222
+
223
+ On every story boundary, `log()` the `runId` and write it into the SQLite ledger so it survives the session. The resume instruction becomes a one-liner the user can always recover, and even if the runId is lost, **Fix A** makes a cold restart safe.
224
+
225
+ **Net:** three independent safety nets — journal replay (fast path, same session), the idempotent ledger (cross-session/cold-start), and bounded retry (transient blips) — so no single failure mode forces a 2–4M-token redo.
226
+
227
+ ---
228
+
229
+ ## 6. Proposed `pipeline-config.yaml` surface (consolidated)
230
+
231
+ ```yaml
232
+ runtime:
233
+ provider: claude-code
234
+
235
+ # Lever 2 — find vs. judge tiering (CRITIC/JUDGE split into sub-roles)
236
+ models:
237
+ haiku: [RESOLVE, CRITIC_BLIND]
238
+ sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, DATA, IAC, MCP-DEV, LIBDEV, DOCGEN, MOBILE,
239
+ CRITIC_EDGE, CRITIC_ACCEPT, JUDGE_EVIDENCE]
240
+ opus: [READINESS, CRITIC_TRIAGE, JUDGE_DECIDE, INTEGRATION]
241
+
242
+ # Lever 5 — multi-provider runners (subsidized arbitrage)
243
+ runners:
244
+ codex: { cmd: "codex exec", auth: subscription }
245
+ cursor: { cmd: "cursor-agent", auth: subscription }
246
+ grok: { cmd: "grok", auth: subscription }
247
+ roles:
248
+ BEND: { runner: codex, fallback: claude:sonnet }
249
+ CRITIC_BLIND: { runner: cursor, fallback: claude:haiku }
250
+ CRITIC_TRIAGE: { runner: claude, model: opus }
251
+ JUDGE_DECIDE: { runner: claude, model: opus }
252
+
253
+ # Lever 1 — deterministic pre-gate (blocks CRITIC until green)
254
+ pre_critic_gate:
255
+ blocking: true
256
+ commands: ["npm run lint", "npm run typecheck", "semgrep --config auto --error"]
257
+
258
+ # Lever 4 — external reviewer as blind-hunt
259
+ external_review:
260
+ blind_hunt: coderabbit # coderabbit | pr-agent | native | claude
261
+
262
+ # Lever 3 — bounded, targeted rework
263
+ rejection:
264
+ max_cycles: 2 # was 5
265
+ targeted_reverify: true # re-check only previously-failed findings
266
+
267
+ # Lever 6 — Ref MCP for dev + critic
268
+ ref_mcp: { enabled: true, roles: [BEND, FEND, DATA, CRITIC_EDGE] }
269
+
270
+ # Resumability
271
+ resume:
272
+ ledger: sqlite # durable cross-session phase checkpoints
273
+ retry: { tries: 3 }
274
+ ```
275
+
276
+ All of this is overlay-on-default and journal-replay-safe (static + args only), consistent with `buildModelMap` (`sprint.workflow.js:231-241`).
277
+
278
+ ---
279
+
280
+ ## 7. Phased rollout
281
+
282
+ | Phase | Change | Effort | Risk | Payoff |
283
+ |-------|--------|--------|------|--------|
284
+ | **0** | Turn on **prompt-cache hygiene** + verify hit-rate via `audit`; add **retry wrapper**; **lower `max_cycles` to 3** | S | Low | Quick cost + stability win |
285
+ | **1** | **Deterministic pre-critic gate** (lint/types/semgrep) | S–M | Low | Biggest single cost cut (kills rework cycles) |
286
+ | **2** | **Idempotent SQLite checkpoint ledger** (cross-session resume) | M | Low | Fixes Problem 1 fully |
287
+ | **3** | **Split CRITIC/JUDGE** into sub-roles + **find-vs-judge tiering** | M | Med (validate quality holds) | ~40–60% CRITIC cut |
288
+ | **4** | **Targeted re-verify** on rejection | M | Med | Collapses the cost tail |
289
+ | **5** | **External reviewer** (CodeRabbit CLI or self-hosted PR-Agent) as blind-hunt | M | Low | Moves a pass off the Opus bill |
290
+ | **6** | **Ref MCP** wired into dev + critic | S | Low | Fewer reworks, smaller context |
291
+ | **7** | **`runnerFor()` provider abstraction** → Codex/Cursor/Grok arbitrage | L | Med–High | Largest structural saving; multi-harness future |
292
+
293
+ Phases 0–2 are low-risk and address both problems immediately. Phase 7 is the strategic bet that unlocks the "many subsidized providers" future you described — and it's safe to attempt *because the gates already protect quality*, which is the whole premise.
294
+
295
+ ---
296
+
297
+ ## 8. Risks & guardrails
298
+
299
+ - **Quality regression from cheaper/foreign models.** Mitigated by design: the Opus judgment gates are the floor. Track rejection-rate and rework-cycle counts per tier in the existing `calibrate` data — if a Sonnet/Codex pass drives rework up, the savings evaporate and you escalate that role back up. Make the tiering **measured, not assumed.**
300
+ - **Determinism / journal safety.** All external-CLI and DB calls must stay **inside agents**; the orchestrator script stays pure (no `Date.now`/`Math.random`/IO) so resume replay holds. The linter at `scripts/test-workflow.js` already enforces this — keep it green.
301
+ - **External-tool auth in headless/cron runs.** Codex/Cursor/Grok CLIs and CodeRabbit need their own auth; subscription-based CLIs may not be present in unattended runs. Always define a **`fallback` to a native Claude agent** so a missing CLI degrades instead of failing.
302
+ - **Cost monitoring must lead.** Before/after every phase, run `valent-review-cost` / `audit` on the Workflow journals so each lever's real savings is measured, not guessed.
303
+
304
+ ---
305
+
306
+ ## 9. TL;DR
307
+
308
+ 1. **Stop paying Opus to find mechanical bugs.** Put free linters/Semgrep/Ref in front of CRITIC, and move CRITIC's *hunting* passes to Sonnet/Haiku (or a flat-rate external reviewer / subsidized CLI). Keep Opus only for the *triage* and *ship* decisions.
309
+ 2. **Kill the rejection-loop tail** — targeted re-verify + lower cap. It's the #1 cost center.
310
+ 3. **Arbitrage subsidized providers** (Codex/Cursor/Grok via headless CLI) for the bulk build/find work; the Opus gate makes the quality variance safe. This is also the on-ramp to the multi-harness future.
311
+ 4. **Make resume durable** with an idempotent SQLite phase ledger + retry wrapper, so a glitch or context wipe never re-bills a 2–4M-token sprint.
312
+
313
+ Combined target: **50–70% lower spend** and **no full restarts**, with the quality backstop untouched.
@@ -148,6 +148,15 @@ if (!stories.length || !sprintId || typeof velocity !== 'number') {
148
148
  throw new Error('args must include { stories:[{storyId,projectType}], sprintId, velocity }')
149
149
  }
150
150
 
151
+ // A candidate arrives EITHER ungroomed (needs the full groom -> size pass below) OR already groomed
152
+ // — a leftover from a PRIOR sprint's grooming that this sprint only needs to PACK. The caller flags
153
+ // the latter `groomed: true` and carries its `profiles` from the backlog (resolveEligibleStories
154
+ // surfaces these as `groomedBuffer`). We groom only the ungroomed set; the buffer is folded straight
155
+ // into the pack + return so it executes WITHOUT being re-specced or re-gated. Skipping this is what
156
+ // stranded the buffer before: nothing drained the groomed overflow once a sprint over-groomed.
157
+ const toGroom = stories.filter((s) => !s.groomed)
158
+ const buffer = stories.filter((s) => s.groomed)
159
+
151
160
  // --- per-agent model tiers ----------------------------------------------------
152
161
  // Tiers come from pipeline-config.yaml `models` (a tier->roles map), passed in as
153
162
  // args.models by the invoking skill — a Workflow script can't read files. We invert it
@@ -230,36 +239,44 @@ phase('Groom')
230
239
  // story bounces off the profile gate and eats a re-tag + full re-review cycle. So derive once here,
231
240
  // write the backlog in a SINGLE agent (one writer => no race on the shared YAML), and reuse the same
232
241
  // profiles for the in-memory flow below so the backlog tag, UXA-skip, and sizing can never diverge.
233
- const tag = await agent(
234
- [
235
- 'You are the grooming orchestrator performing **Step 0: Pre-Grooming Profile Tagging**, before any spec agent runs.',
236
- '',
237
- 'For EACH story below: read `stories/{storyId}/story.md` (its ACs + scope) and derive its `testing_profiles` using these criteria (multiple may apply):',
238
- '- `api` owns API endpoints, backend routes, business logic, or database changes',
239
- '- `ui` — owns UI components, pages, or visual elements',
240
- '- `data-pipeline` ETL, data transformation, or batch processing',
241
- '- `mcp-server` — MCP server tools, handlers, or protocol work',
242
- '- `library` shared library/package (exports, packaging, versioning)',
243
- '- `document-generation` — document/report template or generation pipeline work',
244
- '- `iac` — infrastructure (Terraform, CloudFormation, Kubernetes, CI/CD)',
245
- "Tag a profile only when the story OWNS that surface. A story that merely CONSUMES another story's API endpoint (no endpoint/DB change of its own) is NOT `api`.",
246
- '',
247
- `Then write \`testing_profiles: [...]\` onto each story's entry in \`${backlogPath}\` preserve every other field and the item order. If an entry already has testing_profiles, only correct it when a required profile is missing.`,
248
- '',
249
- `Stories: ${JSON.stringify(stories.map((s) => s.storyId))}.`,
250
- '',
251
- 'Return ONLY { stories: [{ storyId, testing_profiles:[...] }, ...] } as JSON, covering every story above.',
252
- ].join('\n'),
253
- { label: 'pre-groom-profile-tag', phase: 'Groom', schema: PROFILE_TAG_SCHEMA, model: modelFor('PERSIST') },
254
- )
255
- const profileById = new Map(
256
- (tag.stories || []).map((s) => [s.storyId, Array.isArray(s.testing_profiles) ? s.testing_profiles : []]),
257
- )
258
- log(`tagged testing_profiles on ${profileById.size}/${stories.length} stories before the readiness gate`)
242
+ // Tag only the ungroomed set — buffer stories already carry `testing_profiles` on the backlog (and
243
+ // in args), so re-deriving them is wasted work. A pure buffer-drain sprint (nothing to groom) skips
244
+ // this agent entirely.
245
+ const profileById = new Map()
246
+ if (toGroom.length) {
247
+ const tag = await agent(
248
+ [
249
+ 'You are the grooming orchestrator performing **Step 0: Pre-Grooming Profile Tagging**, before any spec agent runs.',
250
+ '',
251
+ 'For EACH story below: read `stories/{storyId}/story.md` (its ACs + scope) and derive its `testing_profiles` using these criteria (multiple may apply):',
252
+ '- `api` — owns API endpoints, backend routes, business logic, or database changes',
253
+ '- `ui` — owns UI components, pages, or visual elements',
254
+ '- `data-pipeline` ETL, data transformation, or batch processing',
255
+ '- `mcp-server` — MCP server tools, handlers, or protocol work',
256
+ '- `library`shared library/package (exports, packaging, versioning)',
257
+ '- `document-generation` — document/report template or generation pipeline work',
258
+ '- `iac` — infrastructure (Terraform, CloudFormation, Kubernetes, CI/CD)',
259
+ "Tag a profile only when the story OWNS that surface. A story that merely CONSUMES another story's API endpoint (no endpoint/DB change of its own) is NOT `api`.",
260
+ '',
261
+ `Then write \`testing_profiles: [...]\` onto each story's entry in \`${backlogPath}\` — preserve every other field and the item order. If an entry already has testing_profiles, only correct it when a required profile is missing.`,
262
+ '',
263
+ `Stories: ${JSON.stringify(toGroom.map((s) => s.storyId))}.`,
264
+ '',
265
+ 'Return ONLY { stories: [{ storyId, testing_profiles:[...] }, ...] } as JSON, covering every story above.',
266
+ ].join('\n'),
267
+ { label: 'pre-groom-profile-tag', phase: 'Groom', schema: PROFILE_TAG_SCHEMA, model: modelFor('PERSIST') },
268
+ )
269
+ for (const s of tag.stories || []) {
270
+ profileById.set(s.storyId, Array.isArray(s.testing_profiles) ? s.testing_profiles : [])
271
+ }
272
+ log(`tagged testing_profiles on ${profileById.size}/${toGroom.length} stories before the readiness gate`)
273
+ }
274
+ if (buffer.length) log(`draining ${buffer.length} groomed buffer stor${buffer.length === 1 ? 'y' : 'ies'} (specced in a prior sprint — pack only)`)
259
275
 
260
276
  // Pipelined: spec agents don't touch code, so stories flow assembly-line through the stages.
277
+ // Only the ungroomed set is groomed here; the already-groomed buffer skips straight to packing.
261
278
  const groomed = await pipeline(
262
- stories,
279
+ toGroom,
263
280
  // Stage 1: REQS produces reqs-brief.md. Profiles are already derived + on the backlog (Step 0);
264
281
  // we pass them in so REQS selects the right profile-applicable sections (draft-brief.md).
265
282
  async (story) => {
@@ -314,12 +331,30 @@ const groomed = await pipeline(
314
331
  buildPrompt({ role: target, promptFile: `${target.toLowerCase()}.md`, storyId: g.storyId, taskSubject: 'Address the READINESS rejection and rewrite the affected spec.' }),
315
332
  { label: `rework:${target.toLowerCase()}:${g.storyId}`, phase: 'Groom', schema: HANDOFF_SCHEMA, model: modelFor(target) },
316
333
  )
334
+ // Cascade re-derivation downstream. Grooming is a derivation chain (REQS -> UXA -> QA-A): when an
335
+ // upstream spec is reworked, the specs derived from it are now stale and CONTRADICT the correction.
336
+ // Re-running only the rejection target leaves that staleness in place, so the next gate just rejects
337
+ // again on the downstream artifact — a guaranteed wasted reject/rework cycle per contract fix. Re-derive
338
+ // every dependent spec here, before re-gating, so the whole chain re-syncs in one pass.
339
+ const up = target.toUpperCase()
340
+ if (up === 'REQS' && g.profiles.includes('ui')) {
341
+ await agent(
342
+ buildPrompt({ role: 'UXA', promptFile: 'uxa.md', storyId: g.storyId, taskSubject: 'Re-derive uxa-spec.md from the reworked reqs-brief.md — the upstream brief changed during readiness rework, so re-sync the spec to it.' }),
343
+ { label: `cascade:uxa:${g.storyId}`, phase: 'Groom', schema: HANDOFF_SCHEMA, model: modelFor('UXA') },
344
+ )
345
+ }
346
+ if (up === 'REQS' || up === 'UXA') {
347
+ await agent(
348
+ buildPrompt({ role: 'QA-A', promptFile: 'qa-a.md', storyId: g.storyId, taskSubject: 'Re-derive qa-test-spec.md from the reworked upstream spec — reqs-brief/uxa-spec changed during readiness rework, so re-sync the test spec to the corrected contract.' }),
349
+ { label: `cascade:qa-a:${g.storyId}`, phase: 'Groom', schema: HANDOFF_SCHEMA, model: modelFor('QA-A') },
350
+ )
351
+ }
317
352
  }
318
353
  },
319
354
  )
320
355
 
321
356
  const readyStories = groomed.filter(Boolean).filter((g) => g.groomedStatus === 'groomed')
322
- log(`groomed ${readyStories.length}/${stories.length} stories`)
357
+ log(`groomed ${readyStories.length}/${toGroom.length} stories`)
323
358
 
324
359
  phase('Size')
325
360
  // Each story is sized by every estimator whose profile is present; story_points = the sum.
@@ -395,11 +430,26 @@ if (!validation.valid) {
395
430
  // `groomed: true` — these stories were fully specced + passed the READINESS gate during grooming
396
431
  // above, and their reqs-brief/uxa-spec/qa-test-spec are on disk. sprint.workflow.js reads this flag
397
432
  // and SKIPS its Spec + Readiness phases, so execution doesn't redundantly re-spec and re-gate.
398
- const sizedById = new Map(sizedStories.map((s) => [s.storyId, s]))
433
+ // Metadata for EVERY packable story: the ones groomed this run (sizedStories) AND the pre-existing
434
+ // groomed buffer (from args). sprint-pack reads the whole backlog, so `pack.sprint_stories` can name
435
+ // buffer stories too — they must be returned with their projectType/profiles or sprint.workflow.js
436
+ // has nothing to execute. (Building this only from sizedStories is exactly what silently dropped the
437
+ // buffer before: packed + tagged sprint-planned, but never handed downstream.)
438
+ const metaById = new Map(sizedStories.map((s) => [s.storyId, { projectType: s.projectType, profiles: s.profiles }]))
439
+ for (const b of buffer) {
440
+ if (!metaById.has(b.storyId)) {
441
+ metaById.set(b.storyId, { projectType: b.projectType, profiles: Array.isArray(b.profiles) ? b.profiles : [] })
442
+ }
443
+ }
444
+ const missingMeta = pack.sprint_stories.filter((id) => !metaById.has(id))
445
+ if (missingMeta.length) {
446
+ log(`⚠ packed but NOT executable — no projectType/profiles passed for: ${missingMeta.join(', ')}. ` +
447
+ `These are groomed in the backlog but were not handed to plan as candidates or buffer; they will be ` +
448
+ `tagged sprint-planned yet skipped this sprint. Pass them (with profiles) via groomedBuffer.`)
449
+ }
399
450
  const plannedStories = pack.sprint_stories
400
- .map((id) => sizedById.get(id))
401
- .filter(Boolean)
402
- .map((s) => ({ storyId: s.storyId, projectType: s.projectType, profiles: s.profiles, groomed: true }))
451
+ .filter((id) => metaById.has(id))
452
+ .map((id) => ({ storyId: id, projectType: metaById.get(id).projectType, profiles: metaById.get(id).profiles, groomed: true }))
403
453
 
404
454
  return {
405
455
  sprintId,