valent-pipeline 0.6.1 → 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "valent-pipeline",
3
- "version": "0.6.1",
3
+ "version": "0.6.2",
4
4
  "description": "v3 multi-agent AI pipeline for software development lifecycle",
5
5
  "type": "module",
6
6
  "bin": {
@@ -0,0 +1,313 @@
1
+ # Workflow Improvement Plan — Cost & Resumability
2
+
3
+ > **Status:** Proposal / design doc
4
+ > **Scope:** `valent-pipeline` orchestrators (`plan` / `sprint` / `retro`), gate model assignment, and the path to a multi-provider runner.
5
+ > **Author's thesis (yours, and it's correct):** *Because the pipeline is gated, cheap models can do far more of the work.* The gates are the quality backstop, so the expensive model only needs to show up where **judgment** happens — not where **finding** or **building** happens. Everything below is an elaboration of that one idea, plus a resumability layer so a 2–4M-token sprint never has to start over.
6
+
7
+ ---
8
+
9
+ ## 1. The two problems, stated precisely
10
+
11
+ | # | Problem | Root cause in the current design |
12
+ |---|---------|----------------------------------|
13
+ | 1 | **Resumability** after a network glitch or context exhaustion | Resume relies on the Workflow journal + `resumeFromRunId`, which is **same-session only** and **in-memory**. If the session itself dies (context exhaustion → new session) or the host process is killed, the journal/runId is gone and there is no durable, cross-session ledger of "which stories already shipped." A single transient agent failure inside a sequential gate **throws and kills the whole run** rather than retrying. |
14
+ | 2 | **Cost** — ~1M tokens per planning run, 2–4M per sprint | Opus is the default for **every gate** (`READINESS`, `CRITIC`, `JUDGE`, `INTEGRATION`), and the most expensive gate — `CRITIC` — runs **3 full Opus passes + an Opus triage**, and **re-runs all of that** on every rejection cycle (up to `maxRejectionCycles`, default 5). The rejection loop tail is the single largest cost center. Nothing deterministic (linters, type-checkers, static analysis) runs *before* Opus, so Opus pays tokens to find mechanical issues a free tool would catch. |
15
+
16
+ These interact: a robust **idempotent checkpoint** layer (Problem 1) *also* saves cost, because a resumed sprint skips already-shipped stories instead of re-billing them.
17
+
18
+ ---
19
+
20
+ ## 2. Cost anatomy — where the tokens actually go
21
+
22
+ **Current model assignment** (`sprint.workflow.js:224-230`):
23
+
24
+ ```js
25
+ const DEFAULT_MODELS = {
26
+ READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus', // ← gates: all Opus
27
+ REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
28
+ BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
29
+ LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
30
+ RESOLVE: 'haiku',
31
+ }
32
+ ```
33
+
34
+ **Current API prices (2026):**
35
+
36
+ | Model | Input $/M | Output $/M | Cached input $/M (≈90% off) | Batch (−50%) |
37
+ |-------|----------:|-----------:|----------------------------:|-------------:|
38
+ | Opus 4.8 | $5 | $25 | ~$0.50 | $2.50 / $12.50 |
39
+ | Sonnet 4.6 | $3 | $15 | ~$0.30 | $1.50 / $7.50 |
40
+ | Haiku 4.5 | $1 | $5 | ~$0.10 | $0.50 / $2.50 |
41
+
42
+ **Per-story Opus cost, directional, *zero rework*** (input largely uncached on first pass):
43
+
44
+ | Gate | Opus calls | Rough $ / story |
45
+ |------|-----------:|----------------:|
46
+ | READINESS (1 pass) | 1 | ~$0.28 |
47
+ | **CRITIC (3 passes + triage)** | **4** | **~$1.25** |
48
+ | JUDGE (1 pass) | 1 | ~$0.33 |
49
+ | **Total Opus (no rework)** | **6** | **~$1.86** |
50
+
51
+ Now add the loop. **Each CRITIC rejection cycle re-runs all 3 passes + triage** (`runCriticGate`, `sprint.workflow.js:519-566`). Two cycles ≈ **+$2.50**, and CRITIC alone becomes ~70% of the story's Opus spend. **The rejection-loop tail is the thing to attack first.**
52
+
53
+ > **Takeaway:** Of the 6 Opus calls per clean story, only **3 are genuine judgment calls** (the READINESS decision, the CRITIC *triage*, the JUDGE *ship decision*). The other 3 (the CRITIC *hunting* passes) are *finding* work that a cheaper model — or a free deterministic tool — can do most of. That gap is the savings.
54
+
55
+ ---
56
+
57
+ ## 3. Cost levers, highest ROI first
58
+
59
+ ### Lever 1 — Deterministic pre-gates (free, non-LLM) before any Opus runs
60
+
61
+ Run linters, type-checkers, and static analysis as a **blocking gate between Build and CRITIC**. Dev agents must produce a clean lint/type/compile/test-build before a single Opus token is spent reviewing.
62
+
63
+ | Tool | Cost | Catches (deterministically) |
64
+ |------|------|------------------------------|
65
+ | **Semgrep** | Free ≤10 contributors (OSS engine free) | Injection, secrets, taint flows, unsafe APIs, ~security rules |
66
+ | **ESLint / tsc** | Free | Type errors, null/undefined, unused, unreachable, many real bugs |
67
+ | **Ruff / Bandit** (Python) | Free | Lint + Python security |
68
+ | Compiler / `tsc --noEmit` | Free | Whole class of "doesn't even type-check" rejections |
69
+
70
+ **Why this is #1:** It directly shrinks the **rejection-loop tail** — the most expensive thing in the system. Every mechanical defect a linter catches is a CRITIC rejection cycle (3 Opus passes + triage) that *never happens*. It also lets CRITIC's prompt explicitly say *"lint/types/security-static already passed; focus only on semantic correctness and AC coverage,"* which shrinks each remaining Opus pass.
71
+
72
+ **Where:** new `pre-critic-gate` step invoked in `runStory` between the Build barrier and `runCriticGate` (`sprint.workflow.js`, around the Build→Critic transition). The dev agents already run tests; add the static suite to their `handoff` step and make CRITIC refuse to start until the pre-gate verdict is green.
73
+
74
+ ### Lever 2 — Split *find* from *judge*; make Opus the exception-handler, not the default
75
+
76
+ Re-tier `CRITIC` so the **hunting** passes run cheap and only **judgment** runs on Opus:
77
+
78
+ ```yaml
79
+ # pipeline-config.yaml (proposed `models` override — already supported by buildModelMap)
80
+ models:
81
+ haiku: [RESOLVE, CRITIC_BLIND] # blind-hunt is mechanical: naming, dead code, copy-paste
82
+ sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, ...,
83
+ CRITIC_EDGE, CRITIC_ACCEPT, # edge-case + acceptance-audit on Sonnet
84
+ JUDGE_EVIDENCE] # evidence cross-referencing is mechanical counting
85
+ opus: [CRITIC_TRIAGE, # adjudication = judgment → keep Opus
86
+ JUDGE_DECIDE, READINESS] # the actual ship/ready decisions
87
+ ```
88
+
89
+ This requires splitting the monolithic `CRITIC` and `JUDGE` roles into sub-roles the model map can target (the 3-pass structure already exists as separate agents at `sprint.workflow.js:519-566` — give each pass its own role key). The triage step stays Opus because that's where duplicates collapse and severity is decided — the genuine judgment.
90
+
91
+ **Escalation rule (the gate-thesis, made literal):** run the cheap pass; **escalate a finding to Opus only when** the cheap model marks it High-severity *or* flags low confidence. A clean diff that passes lint + Sonnet review + full AC coverage may need **zero Opus** — Opus becomes an adjudicator invoked on exceptions, not the default reviewer. Projected CRITIC cut: **~40–60%** with no loss of the Opus backstop on actual decisions.
92
+
93
+ Apply the same split to `JUDGE`: Sonnet does the *evidence cross-reference* (does the test count match the spec, is every AC traced — mechanical counting), Opus makes only the final SHIP/REJECT call, and only when evidence is ambiguous.
94
+
95
+ ### Lever 3 — Targeted re-review on rejection (kill the loop tail)
96
+
97
+ Today a rejection re-runs **all 3 CRITIC passes + triage** (`runCriticGate`). Instead: **persist the open findings**, and on rework, run a **verify-only pass** that checks *just the previously-failed findings* against the new diff. Full re-hunt only if the dev's fix touched files outside the flagged set.
98
+
99
+ - Re-review cost drops from ~100% of a full CRITIC to ~20%.
100
+ - Combined with Lever 1 (fewer rejections to begin with) this collapses the cost tail.
101
+ - Lower `maxRejectionCycles` from 5 → 2–3 and **escalate to a human/Opus-once** instead of looping a 4th time; the 4th–5th cycles are almost never productive and are pure Opus burn.
102
+
103
+ ### Lever 4 — Offload review to flat-rate / free external reviewers
104
+
105
+ Use an external reviewer as the **blind-hunt pass** (or a pre-CRITIC filter), so that pass costs a flat fee or $0 instead of Opus tokens:
106
+
107
+ | Tool | Pricing | Local/CLI | Needs your LLM key? | Catches |
108
+ |------|---------|-----------|---------------------|---------|
109
+ | **CodeRabbit CLI** | **$0.25 / file** reviewed (credit add-on); SaaS Pro $24/dev/mo, Pro+ $48 | CLI client → their cloud (needs API key, not offline) | No — flat per-file | Race conditions, memory leaks, security vulns, actionable fixes |
110
+ | **Qodo Merge** (ex-PR-Agent) | **Free tier 250 credits/mo**; Teams $30/user/mo (2500 credits). PR-Agent OSS is **self-hostable, BYO key** | Yes — IDE/local review + self-host/air-gapped | OSS path: yes (your key) | PR review, context-aware; Opus=5 credits/req, Grok-4=4 credits/req on managed tier |
111
+ | **Greptile** | $30/seat/mo, 50 reviews incl., then $1/review | Cloud; enterprise self-host | Managed | Codebase-context review |
112
+ | **Sourcery** | Free for OSS; Team $24/seat/mo (**BYO LLM** on Team) | Yes — CLI | Team tier: yes (your key) | **Rule-based refactoring** (deterministic) + AI review |
113
+ | **Native Claude Code `/code-review`** | In-session tokens only | Yes — runs on the local diff | Uses your Claude session | Correctness bugs + reuse/simplification at chosen effort |
114
+
115
+ **Recommendation:** CodeRabbit CLI ($0.25/file, predictable) or self-hosted **PR-Agent** (free, your key) as the **blind-hunt replacement**. They run on the diff, return structured findings, and slot straight into the existing `verdict.schema.json` gate machinery — the orchestrator doesn't care who produced the verdict as long as it matches the schema. The native `/code-review` skill is the zero-integration option for a first experiment.
116
+
117
+ ### Lever 5 — Multi-provider CLI arbitrage (Codex / Cursor / Grok) — *subsidized usage across providers*
118
+
119
+ This is the big structural play, and it's feasible **because Workflow `agent()` subagents have Bash**. An agent can shell out to another vendor's headless CLI, wait, and parse the result back into the handoff schema. The orchestrator stays pure (all IO happens *inside* the agent, which is journal-replay-safe); the subagent becomes a thin **"shell driver."**
120
+
121
+ **Subsidized runners to target:**
122
+ - `codex exec "<prompt>"` — OpenAI **Codex CLI**, non-interactive/headless, billable against a **ChatGPT Plus/Pro/Team subscription** (flat-rate, not per-token API).
123
+ - `cursor-agent` — **Cursor CLI** headless, subsidized by a Cursor subscription.
124
+ - **Grok CLI** — xAI, subsidized by an X/Grok subscription.
125
+
126
+ **The arbitrage:** route the cheap-but-bulky work (dev/build passes, blind-hunt, test scaffolding, doc lookups) to whichever flat-rate subscription you're *already paying for*, so those tokens are **$0 marginal**. Keep Claude **Opus only on the irreducible judgment gates**. If Codex/Cursor does the *building* on a flat subscription, the entire Sonnet dev line can leave the per-token API bill.
127
+
128
+ **Architecture — generalize `modelFor(role)` into `runnerFor(role)`:**
129
+
130
+ ```yaml
131
+ # pipeline-config.yaml (proposed)
132
+ runners:
133
+ codex: { cmd: "codex exec", auth: subscription } # flat-rate
134
+ cursor: { cmd: "cursor-agent", auth: subscription }
135
+ grok: { cmd: "grok", auth: subscription }
136
+ claude: { cmd: native } # in-session subagent
137
+ roles:
138
+ BEND: { runner: codex, fallback: claude:sonnet } # build on flat-rate sub
139
+ CRITIC_BLIND: { runner: cursor, fallback: claude:haiku } # blind hunt on flat-rate sub
140
+ CRITIC_TRIAGE: { runner: claude, model: opus } # judgment stays Opus
141
+ JUDGE_DECIDE: { runner: claude, model: opus }
142
+ ```
143
+
144
+ The handoff/verdict schemas (`schemas/handoff.schema.json`, `schemas/verdict.schema.json`) are the **normalization layer** — a Codex-produced review and an Opus-produced review are interchangeable to the gate logic as long as both emit the schema. Long external runs use `run_in_background: true` + `Monitor`.
145
+
146
+ **Guardrails (this is exactly your gate-thesis):** quality variance across providers is *fine* precisely because the Opus gate is the backstop. A cheaper/subsidized provider that produces weaker work just gets caught and rejected by CRITIC/JUDGE — you trade a little more rework risk for near-zero marginal token cost on the bulk. Keep a `fallback` so a missing/failing CLI degrades to a native Claude agent rather than killing the run.
147
+
148
+ ### Lever 6 — Ref MCP to cut rework *and* context bloat
149
+
150
+ Wire **Ref MCP** (`ref_search_documentation`, `ref_read_url`) into the dev agents (BEND/FEND/DATA/…) and CRITIC. Two distinct savings:
151
+
152
+ 1. **Fewer rejection cycles.** A large share of CRITIC/JUDGE rejections are wrong API usage / hallucinated signatures. Ref lets the dev agent verify the real API *before* writing code, so the diff is right the first time → fewer (expensive, Opus) rework loops. This attacks the same cost tail as Lever 1, from the front.
153
+ 2. **Smaller context.** Without Ref, agents either guess or paste entire doc pages into context. Ref returns **only the relevant snippet**, so you pay for the 500 tokens you need, not the 40K-token page. CRITIC can use Ref to confirm "is this really the right call signature?" instead of being handed the whole SDK.
154
+
155
+ **Where:** add a "verify external APIs via Ref before implementing" step to each dev `read-inputs`/`implement` step file, and a "confirm uncertain API usage via Ref" note to `critic/edge-case-hunt.md`.
156
+
157
+ ### Lever 7 — Cache hygiene, distilled handoffs, batch API
158
+
159
+ - **Prompt-cache hygiene.** Caching is already in use, but cache hits require the static prefix to be **byte-identical** across spawns. Audit `buildPrompt` (`sprint.workflow.js:273-298`) so the **static** content (role prompt path, pipeline-context, handoff contract) comes first and the **dynamic** content (storyId, task subject, trigger) comes last. Verify realized cache-hit rate with the existing `audit` command (`src/commands/audit.js`) / `valent-review-cost` skill. Cached input is ~90% cheaper — a few percent of hit-rate is real money on a 4M-token sprint.
160
+ - **Distilled handoffs.** `distilled-handoff-format.md` already exists; make sure CRITIC/JUDGE read **distilled** artifacts, not raw full files, wherever the decision doesn't need the raw text.
161
+ - **Batch API (−50%)** for the latency-insensitive, non-interactive work: estimation/sizing in `plan`, retrospective synthesis, KB embeddings. These don't block a human, so the 50% batch discount is free money.
162
+
163
+ ---
164
+
165
+ ## 4. Projected impact
166
+
167
+ Directional, stacking the levers (no loss of the Opus decision-backstop):
168
+
169
+ | Lever | Mechanism | Est. cost reduction |
170
+ |-------|-----------|---------------------|
171
+ | 1 — Deterministic pre-gates | Removes ~30–50% of CRITIC rejection cycles | **High** |
172
+ | 2 — Find vs. judge tiering | 2 of 3 CRITIC passes → Sonnet/Haiku; JUDGE evidence → Sonnet | ~40–60% of CRITIC |
173
+ | 3 — Targeted re-review | Re-review ~20% of a full pass instead of 100% | High on rework tail |
174
+ | 4 — External reviewer | Blind-hunt → $0.25/file or free, off the Opus bill | Moderate |
175
+ | 5 — Provider arbitrage | Dev/blind work → subsidized flat-rate subs ($0 marginal) | **High** (can move the entire Sonnet dev line off API) |
176
+ | 6 — Ref MCP | Fewer rework loops + smaller context | Moderate, compounding |
177
+ | 7 — Cache/distill/batch | ~90% off cached input, 50% off batch | Moderate, broad |
178
+
179
+ **Realistic combined target: 50–70% reduction** in per-sprint API spend, with the irreducible Opus judgment (CRITIC triage, JUDGE ship decision, READINESS) preserved as the quality floor.
180
+
181
+ ---
182
+
183
+ ## 5. Resumability redesign
184
+
185
+ ### Current state and the gap
186
+
187
+ Resume = relaunch `Workflow({ scriptPath, resumeFromRunId })`; the journal replays the unchanged `agent()` prefix at ~100% cache hit. This is elegant **but**:
188
+ - **Same-session only / in-memory.** Context exhaustion that forces a new session, or a killed host process, loses the journal and the runId.
189
+ - **No durable ledger.** Nothing on disk says "story KANBAN-014 already SHIPPED in this sprint," so a from-scratch restart re-bills shipped stories.
190
+ - **A transient failure is fatal to sequential gates.** In a `for`-loop story (`sprint.workflow.js:312-315`), a network blip inside an awaited gate `throw`s and ends the run (unlike `parallel`, which drops to `null`).
191
+
192
+ ### Fix A — Idempotent, on-disk phase checkpoints (cross-session resume)
193
+
194
+ Persist a durable run ledger in the existing SQLite DB (extends the `artifacts`/`calibration` tables in `src/lib/db.js`). Add a CLI command:
195
+
196
+ ```
197
+ node .valent-pipeline/bin/cli.js checkpoint --sprint <id> --story <id> --phase <spec|build|critic|qa|judge> --verdict <pass|ship|reject>
198
+ ```
199
+
200
+ An `agent()` at the end of each phase writes the checkpoint (IO inside an agent = journal-safe). Then **make each phase idempotent at the top of `runStory`**: before running a phase, query the ledger; if a terminal verdict already exists for `(sprint, story, phase)`, **skip and reuse the on-disk artifact**. This is exactly the `groomed`-bypass pattern (`sprint.workflow.js` skips Spec+Readiness when `groomed:true`) — generalized to every phase. Result: a fresh run **after total session loss** naturally fast-forwards to the first incomplete phase, with no journal required.
201
+
202
+ ### Fix B — Retry-with-backoff wrapper for transient failures
203
+
204
+ Wrap every gate/agent call in a bounded retry so a network glitch doesn't kill a 4M-token run:
205
+
206
+ ```js
207
+ async function withRetry(fn, { tries = 3, label } = {}) {
208
+ let lastErr
209
+ for (let i = 0; i < tries; i++) {
210
+ try { return await fn() }
211
+ catch (e) { lastErr = e; log(`retry ${label} (${i + 1}/${tries}): ${e.message}`) }
212
+ }
213
+ // exhausted: checkpoint progress and surface the runId so the user can resume cross-session
214
+ log(`PAUSED at ${label}. Resume with resumeFromRunId, or restart — completed phases will be skipped via the ledger.`)
215
+ throw lastErr
216
+ }
217
+ ```
218
+
219
+ (Keep it deterministic — no `Date.now`/`Math.random` in the script body; the resume-safety linter at `scripts/test-workflow.js` enforces this.)
220
+
221
+ ### Fix C — Make the runId loud and durable
222
+
223
+ On every story boundary, `log()` the `runId` and write it into the SQLite ledger so it survives the session. The resume instruction becomes a one-liner the user can always recover, and even if the runId is lost, **Fix A** makes a cold restart safe.
224
+
225
+ **Net:** three independent safety nets — journal replay (fast path, same session), the idempotent ledger (cross-session/cold-start), and bounded retry (transient blips) — so no single failure mode forces a 2–4M-token redo.
226
+
227
+ ---
228
+
229
+ ## 6. Proposed `pipeline-config.yaml` surface (consolidated)
230
+
231
+ ```yaml
232
+ runtime:
233
+ provider: claude-code
234
+
235
+ # Lever 2 — find vs. judge tiering (CRITIC/JUDGE split into sub-roles)
236
+ models:
237
+ haiku: [RESOLVE, CRITIC_BLIND]
238
+ sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, DATA, IAC, MCP-DEV, LIBDEV, DOCGEN, MOBILE,
239
+ CRITIC_EDGE, CRITIC_ACCEPT, JUDGE_EVIDENCE]
240
+ opus: [READINESS, CRITIC_TRIAGE, JUDGE_DECIDE, INTEGRATION]
241
+
242
+ # Lever 5 — multi-provider runners (subsidized arbitrage)
243
+ runners:
244
+ codex: { cmd: "codex exec", auth: subscription }
245
+ cursor: { cmd: "cursor-agent", auth: subscription }
246
+ grok: { cmd: "grok", auth: subscription }
247
+ roles:
248
+ BEND: { runner: codex, fallback: claude:sonnet }
249
+ CRITIC_BLIND: { runner: cursor, fallback: claude:haiku }
250
+ CRITIC_TRIAGE: { runner: claude, model: opus }
251
+ JUDGE_DECIDE: { runner: claude, model: opus }
252
+
253
+ # Lever 1 — deterministic pre-gate (blocks CRITIC until green)
254
+ pre_critic_gate:
255
+ blocking: true
256
+ commands: ["npm run lint", "npm run typecheck", "semgrep --config auto --error"]
257
+
258
+ # Lever 4 — external reviewer as blind-hunt
259
+ external_review:
260
+ blind_hunt: coderabbit # coderabbit | pr-agent | native | claude
261
+
262
+ # Lever 3 — bounded, targeted rework
263
+ rejection:
264
+ max_cycles: 2 # was 5
265
+ targeted_reverify: true # re-check only previously-failed findings
266
+
267
+ # Lever 6 — Ref MCP for dev + critic
268
+ ref_mcp: { enabled: true, roles: [BEND, FEND, DATA, CRITIC_EDGE] }
269
+
270
+ # Resumability
271
+ resume:
272
+ ledger: sqlite # durable cross-session phase checkpoints
273
+ retry: { tries: 3 }
274
+ ```
275
+
276
+ All of this is overlay-on-default and journal-replay-safe (static + args only), consistent with `buildModelMap` (`sprint.workflow.js:231-241`).
277
+
278
+ ---
279
+
280
+ ## 7. Phased rollout
281
+
282
+ | Phase | Change | Effort | Risk | Payoff |
283
+ |-------|--------|--------|------|--------|
284
+ | **0** | Turn on **prompt-cache hygiene** + verify hit-rate via `audit`; add **retry wrapper**; **lower `max_cycles` to 3** | S | Low | Quick cost + stability win |
285
+ | **1** | **Deterministic pre-critic gate** (lint/types/semgrep) | S–M | Low | Biggest single cost cut (kills rework cycles) |
286
+ | **2** | **Idempotent SQLite checkpoint ledger** (cross-session resume) | M | Low | Fixes Problem 1 fully |
287
+ | **3** | **Split CRITIC/JUDGE** into sub-roles + **find-vs-judge tiering** | M | Med (validate quality holds) | ~40–60% CRITIC cut |
288
+ | **4** | **Targeted re-verify** on rejection | M | Med | Collapses the cost tail |
289
+ | **5** | **External reviewer** (CodeRabbit CLI or self-hosted PR-Agent) as blind-hunt | M | Low | Moves a pass off the Opus bill |
290
+ | **6** | **Ref MCP** wired into dev + critic | S | Low | Fewer reworks, smaller context |
291
+ | **7** | **`runnerFor()` provider abstraction** → Codex/Cursor/Grok arbitrage | L | Med–High | Largest structural saving; multi-harness future |
292
+
293
+ Phases 0–2 are low-risk and address both problems immediately. Phase 7 is the strategic bet that unlocks the "many subsidized providers" future you described — and it's safe to attempt *because the gates already protect quality*, which is the whole premise.
294
+
295
+ ---
296
+
297
+ ## 8. Risks & guardrails
298
+
299
+ - **Quality regression from cheaper/foreign models.** Mitigated by design: the Opus judgment gates are the floor. Track rejection-rate and rework-cycle counts per tier in the existing `calibrate` data — if a Sonnet/Codex pass drives rework up, the savings evaporate and you escalate that role back up. Make the tiering **measured, not assumed.**
300
+ - **Determinism / journal safety.** All external-CLI and DB calls must stay **inside agents**; the orchestrator script stays pure (no `Date.now`/`Math.random`/IO) so resume replay holds. The linter at `scripts/test-workflow.js` already enforces this — keep it green.
301
+ - **External-tool auth in headless/cron runs.** Codex/Cursor/Grok CLIs and CodeRabbit need their own auth; subscription-based CLIs may not be present in unattended runs. Always define a **`fallback` to a native Claude agent** so a missing CLI degrades instead of failing.
302
+ - **Cost monitoring must lead.** Before/after every phase, run `valent-review-cost` / `audit` on the Workflow journals so each lever's real savings is measured, not guessed.
303
+
304
+ ---
305
+
306
+ ## 9. TL;DR
307
+
308
+ 1. **Stop paying Opus to find mechanical bugs.** Put free linters/Semgrep/Ref in front of CRITIC, and move CRITIC's *hunting* passes to Sonnet/Haiku (or a flat-rate external reviewer / subsidized CLI). Keep Opus only for the *triage* and *ship* decisions.
309
+ 2. **Kill the rejection-loop tail** — targeted re-verify + lower cap. It's the #1 cost center.
310
+ 3. **Arbitrage subsidized providers** (Codex/Cursor/Grok via headless CLI) for the bulk build/find work; the Opus gate makes the quality variance safe. This is also the on-ramp to the multi-harness future.
311
+ 4. **Make resume durable** with an idempotent SQLite phase ledger + retry wrapper, so a glitch or context wipe never re-bills a 2–4M-token sprint.
312
+
313
+ Combined target: **50–70% lower spend** and **no full restarts**, with the quality backstop untouched.
@@ -46,11 +46,36 @@
46
46
  * `models` is the pipeline-config.yaml `models` tier->roles map (e.g. { opus:[...], sonnet:[...],
47
47
  * haiku:[...] }); the invoking skill passes it through so per-agent model tiers stay config-driven
48
48
  * and editable via `valent configure`. Omit it to use the baked-in default assignment.
49
+ * CRITIC and JUDGE are split into find-vs-judge sub-roles (Lever 2): CRITIC-BLIND/-EDGE/-ACCEPT do
50
+ * the hunting on cheaper tiers while CRITIC-TRIAGE (the adjudication) stays opus; JUDGE-EVIDENCE
51
+ * measures + recommends on a cheaper tier while JUDGE-DECIDE (the binding ship call) stays opus.
52
+ * A config that pins only the bare CRITIC/JUDGE head propagates that tier to its sub-roles, so
53
+ * existing configs are unchanged; list the sub-roles directly to tier them à la carte.
49
54
  *
50
55
  * `reasoning` is the pipeline-config.yaml `reasoning` level->roles map (e.g. { ultrathink:[...],
51
56
  * 'think-harder':[...], 'think-hard':[...], think:[...] }); it injects a thinking-effort trigger
52
57
  * into a role's prompt. BLANK by default — omit it (or leave the levels empty) and nothing is
53
58
  * injected, behavior unchanged. It is a config-driven control surface, parallel to `models`.
59
+ *
60
+ * `skipCompleted` (alias `resume`) enables cross-session resume: each story is first checked for an
61
+ * existing terminal verdict on disk and skipped if already finalized in a prior run, so re-running an
62
+ * interrupted sprint does not re-bill shipped stories. Set by the /valent-resume command. Off => a
63
+ * fresh run pays nothing extra and behaves exactly as before.
64
+ *
65
+ * `ref` (from pipeline-config.yaml `ref`: { enabled?, roles? }) toggles Lever 6: appends a short
66
+ * "verify third-party APIs against current docs via the Ref tools IF available" hint to the listed
67
+ * roles' spawn prompts. Conditional on runtime tool availability => a no-op without Ref configured,
68
+ * so it defaults ON for dev agents + CRITIC. Set enabled:false to suppress, or override roles.
69
+ *
70
+ * `targetedReverify` (from pipeline-config.yaml `quality.targeted_reverify`) toggles Lever 3: after a
71
+ * CRITIC rejection + rework, re-review only the previously-flagged findings (one cheap CRITIC-REVERIFY
72
+ * pass) rather than re-running the full 3-pass hunt. Defaults to false (full re-hunt) for back-compat.
73
+ *
74
+ * `preCriticGate` (from pipeline-config.yaml `pre_critic_gate`) is the deterministic pre-CRITIC
75
+ * gate (Lever 1): { commands:[shell strings], blocking?:bool=true, maxCycles?:int }. A cheap STATIC
76
+ * agent runs the commands (lint/type-check/static analysis) after Build and before CRITIC; failures
77
+ * route back to dev and re-run (bounded), so mechanical defects are fixed for free instead of
78
+ * burning an Opus CRITIC rejection cycle. BLANK by default — no commands => the gate is a no-op.
54
79
  */
55
80
 
56
81
  export const meta = {
@@ -61,9 +86,10 @@ export const meta = {
61
86
  { title: 'Spec', detail: 'reqs -> uxa -> qa-a' },
62
87
  { title: 'Readiness', detail: 'pre-dev quality gate' },
63
88
  { title: 'Build', detail: 'dev agents in parallel (barrier before CRITIC)' },
64
- { title: 'Critic', detail: 'three independent passes in parallel -> triage -> rejection loop (code-owned cap)' },
89
+ { title: 'Static', detail: 'deterministic pre-CRITIC gate lint/type/static checks (cheap; blank by default)' },
90
+ { title: 'Critic', detail: 'three independent passes (tiered: blind=haiku, edge/acceptance=sonnet) -> opus triage -> rejection loop (targeted re-verify, code-owned cap)' },
65
91
  { title: 'QA', detail: 'execute tests against real infra' },
66
- { title: 'Judge', detail: 'evidence-based ship decision' },
92
+ { title: 'Judge', detail: 'evidence pass (cheaper tier, measures + recommends) -> opus ship decision' },
67
93
  { title: 'Integration', detail: 'single cross-story seam review — only when >1 story touched overlapping files' },
68
94
  ],
69
95
  }
@@ -129,6 +155,53 @@ const FINDINGS_SCHEMA = {
129
155
  },
130
156
  }
131
157
 
158
+ // JUDGE split (Lever 2): the evidence pass measures + cross-references the QA artifacts (cheaper
159
+ // tier) and RECOMMENDS; the binding SHIP/REJECT is the separate decision step (opus). Loose by
160
+ // design — the decision step consumes this as context to reason over, not as the verdict itself.
161
+ const JUDGE_EVIDENCE_SCHEMA = {
162
+ type: 'object',
163
+ required: ['schema', 'agent', 'story', 'recommendation'],
164
+ additionalProperties: true,
165
+ properties: {
166
+ schema: { const: 1 },
167
+ agent: { type: 'string' },
168
+ story: { type: 'string' },
169
+ recommendation: { enum: ['ship', 'reject', 'unsure'] },
170
+ testsPassed: { type: 'integer' },
171
+ testsFailed: { type: 'integer' },
172
+ openP1toP3: { type: 'integer' },
173
+ discrepancies: { type: 'array', items: { type: 'string' } },
174
+ notes: { type: 'string' },
175
+ },
176
+ }
177
+
178
+ // Deterministic pre-CRITIC gate (Lever 1). A cheap agent runs the configured lint/type/static
179
+ // commands and reports pass/fail per command — exit code is the verdict, no model judgment.
180
+ const PRE_GATE_SCHEMA = {
181
+ type: 'object',
182
+ required: ['schema', 'story', 'verdict', 'checks'],
183
+ additionalProperties: true,
184
+ properties: {
185
+ schema: { const: 1 },
186
+ agent: { type: 'string' },
187
+ story: { type: 'string' },
188
+ verdict: { enum: ['pass', 'fail'] },
189
+ checks: {
190
+ type: 'array',
191
+ items: {
192
+ type: 'object',
193
+ required: ['command', 'ok'],
194
+ properties: {
195
+ command: { type: 'string' },
196
+ ok: { type: 'boolean' },
197
+ summary: { type: 'string' },
198
+ },
199
+ },
200
+ },
201
+ filesToFix: { type: 'array', items: { type: 'string' } },
202
+ },
203
+ }
204
+
132
205
  // Sprint-end cross-story seam review. Advisory: stories already passed JUDGE, so this does not
133
206
  // re-gate them — it surfaces integration findings to be filed as bugs against the affected stories.
134
207
  const INTEGRATION_SCHEMA = {
@@ -176,14 +249,32 @@ const RESOLVED_GRAPH_SCHEMA = {
176
249
  },
177
250
  }
178
251
 
252
+ // Resume check (cross-session resumability): a cheap agent reports whether a story already reached a
253
+ // terminal verdict on disk in a prior run, so a resumed sprint skips finalized stories instead of
254
+ // re-billing them. Only consulted when skipCompleted is set (the /valent-resume path).
255
+ const RESUME_CHECK_SCHEMA = {
256
+ type: 'object',
257
+ required: ['schema', 'story', 'done'],
258
+ additionalProperties: true,
259
+ properties: {
260
+ schema: { const: 1 },
261
+ story: { type: 'string' },
262
+ done: { type: 'boolean' },
263
+ shipped: { type: 'boolean' },
264
+ },
265
+ }
266
+
179
267
  const DEV_AGENTS = new Set(['BEND', 'FEND', 'IAC', 'DATA', 'DOCGEN', 'LIBDEV', 'MCP-DEV', 'MOBILE'])
180
268
 
181
269
  // CRITIC's three independent passes (step 3b). Each reads ONLY its own pass step file and
182
270
  // the diff/artifacts it is told to — never another pass's output — so they cannot anchor.
271
+ // `modelRole` keys the per-pass model tier (Lever 2): blind hunting is mechanical (haiku), edge +
272
+ // acceptance need spec context but are still *finding* (sonnet); the agent identity stays CRITIC
273
+ // (it reads critic.md + its pass step), only the tier differs. Triage — the judgment — stays opus.
183
274
  const CRITIC_PASSES = [
184
- { pass: 'blind', step: 'blind-hunt.md', reads: 'ONLY the git diff (do NOT read reqs-brief or qa-test-spec)' },
185
- { pass: 'edge', step: 'edge-case-hunt.md', reads: 'the diff plus reqs-brief.md (hunt boundary/error/concurrency cases)' },
186
- { pass: 'acceptance', step: 'acceptance-audit.md', reads: 'the diff plus qa-test-spec.md and reqs-brief.md (audit every AC)' },
275
+ { pass: 'blind', step: 'blind-hunt.md', modelRole: 'CRITIC-BLIND', reads: 'ONLY the git diff (do NOT read reqs-brief or qa-test-spec)' },
276
+ { pass: 'edge', step: 'edge-case-hunt.md', modelRole: 'CRITIC-EDGE', reads: 'the diff plus reqs-brief.md (hunt boundary/error/concurrency cases)' },
277
+ { pass: 'acceptance', step: 'acceptance-audit.md', modelRole: 'CRITIC-ACCEPT', reads: 'the diff plus qa-test-spec.md and reqs-brief.md (audit every AC)' },
187
278
  ]
188
279
 
189
280
  // --- arg normalization: accept a batch or a single story ---------------------
@@ -209,6 +300,42 @@ const batch = Array.isArray(a.stories) && a.stories.length
209
300
 
210
301
  const maxRejectionCycles = a.maxRejectionCycles ?? 5
211
302
 
303
+ // Lever 3: targeted re-review on rejection. When true, CRITIC re-reviews ONLY the previously-flagged
304
+ // findings against the reworked diff (one cheap CRITIC-REVERIFY pass) instead of re-running the full
305
+ // 3-pass hunt every cycle — collapsing the rejection-loop cost tail. From pipeline-config.yaml
306
+ // `quality.targeted_reverify`, passed through as args.targetedReverify. Defaults to FALSE so an
307
+ // existing config (without the key) keeps the full re-hunt; new projects ship it on.
308
+ const targetedReverify = a.targetedReverify === true
309
+ || (!!a.rejection && (a.rejection.targetedReverify === true || a.rejection.targeted_reverify === true))
310
+
311
+ // Resumability: when true, each story is first checked for an existing terminal verdict on disk and
312
+ // skipped if already finalized in a prior run — so re-running an interrupted sprint does not re-bill
313
+ // shipped stories. Set by the /valent-resume command (valent-resume skill). Off by default, so a
314
+ // fresh run pays nothing and behaves exactly as before.
315
+ const skipCompleted = a.skipCompleted === true || a.resume === true
316
+
317
+ // --- deterministic pre-CRITIC gate config (Lever 1) ---------------------------
318
+ // A cheap, non-LLM gate that runs the project's lint / type-check / static-analysis commands
319
+ // BEFORE the Opus CRITIC. Mechanical defects caught here never become a CRITIC rejection cycle
320
+ // (the most expensive thing in the loop). BLANK by default — with no commands configured the gate
321
+ // is a no-op and behavior is unchanged, mirroring the `reasoning` control surface. Comes from
322
+ // pipeline-config.yaml `pre_critic_gate`, passed through by the invoking skill. Accepts either the
323
+ // camelCase arg (preCriticGate) or the raw snake_case section (pre_critic_gate); within it, both
324
+ // maxCycles / max_cycles. Static + args only => journal-replay safe.
325
+ const preCfg = (() => {
326
+ const c = a.preCriticGate || a.pre_critic_gate
327
+ return c && typeof c === 'object' && !Array.isArray(c) ? c : {}
328
+ })()
329
+ const preCriticChecks = Array.isArray(preCfg.commands)
330
+ ? preCfg.commands.filter((c) => typeof c === 'string' && c.trim())
331
+ : []
332
+ const preCriticBlocking = preCfg.blocking !== false // default: blocking (rework loop). false => advisory.
333
+ const preCriticMaxCycles = Number.isInteger(preCfg.maxCycles)
334
+ ? preCfg.maxCycles
335
+ : Number.isInteger(preCfg.max_cycles)
336
+ ? preCfg.max_cycles
337
+ : maxRejectionCycles
338
+
212
339
  for (const s of batch) {
213
340
  if (!s.storyId || !s.projectType) {
214
341
  throw new Error('each story needs { storyId, projectType }; profiles[] optional')
@@ -222,21 +349,45 @@ for (const s of batch) {
222
349
  // assignment even when args.models is absent. Static + args only => journal-replay safe.
223
350
  // gates -> opus (judgment), spec/build -> sonnet, CLI-runners/IO -> haiku.
224
351
  const DEFAULT_MODELS = {
225
- READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus',
352
+ READINESS: 'opus', INTEGRATION: 'opus',
353
+ // CRITIC / JUDGE are split into find-vs-judge sub-roles (Lever 2): the *finding* work runs on a
354
+ // cheaper tier and only the *judgment* step (triage / ship decision) stays on opus. The bare
355
+ // CRITIC / JUDGE keys are the family heads — kept so an old-style config that pins only the head
356
+ // still propagates its tier to every sub-role (see buildModelMap), preserving prior behavior.
357
+ CRITIC: 'opus', 'CRITIC-BLIND': 'haiku', 'CRITIC-EDGE': 'sonnet', 'CRITIC-ACCEPT': 'sonnet',
358
+ 'CRITIC-TRIAGE': 'opus', 'CRITIC-REVERIFY': 'sonnet', // REVERIFY: Lever 3 targeted re-review (cheap)
359
+ JUDGE: 'opus', 'JUDGE-EVIDENCE': 'sonnet', 'JUDGE-DECIDE': 'opus',
226
360
  REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
227
361
  BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
228
362
  LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
229
- RESOLVE: 'haiku',
363
+ RESOLVE: 'haiku', STATIC: 'haiku', RESUME: 'haiku',
364
+ }
365
+ // Family heads -> their sub-roles. A config that explicitly tiers a head (e.g. `opus: [CRITIC]`)
366
+ // but not the sub-roles propagates the head's tier to each unconfigured sub-role, so EXISTING
367
+ // configs behave exactly as before (no silent rigor change on upgrade); configs that list the
368
+ // sub-roles directly (the new default) tier them à la carte.
369
+ const ROLE_FAMILIES = {
370
+ CRITIC: ['CRITIC-BLIND', 'CRITIC-EDGE', 'CRITIC-ACCEPT', 'CRITIC-TRIAGE', 'CRITIC-REVERIFY'],
371
+ JUDGE: ['JUDGE-EVIDENCE', 'JUDGE-DECIDE'],
230
372
  }
231
373
  function buildModelMap(cfg) {
232
374
  const map = { ...DEFAULT_MODELS }
375
+ const explicit = new Set()
233
376
  if (cfg && typeof cfg === 'object' && !Array.isArray(cfg)) {
234
377
  for (const tier of ['opus', 'sonnet', 'haiku']) {
235
378
  for (const role of cfg[tier] || []) {
236
- if (typeof role === 'string') map[role.toUpperCase()] = tier
379
+ if (typeof role === 'string') {
380
+ const r = role.toUpperCase()
381
+ map[r] = tier
382
+ explicit.add(r)
383
+ }
237
384
  }
238
385
  }
239
386
  }
387
+ // Back-compat: propagate an explicitly-configured family head to its unconfigured sub-roles.
388
+ for (const [head, subs] of Object.entries(ROLE_FAMILIES)) {
389
+ if (explicit.has(head)) for (const sub of subs) if (!explicit.has(sub)) map[sub] = map[head]
390
+ }
240
391
  return map
241
392
  }
242
393
  const MODELS = buildModelMap(a.models)
@@ -266,6 +417,29 @@ const REASONING = buildReasoningMap(a.reasoning)
266
417
  // undefined => inject no thinking trigger for this role.
267
418
  const reasoningFor = (role) => REASONING[String(role).toUpperCase()]
268
419
 
420
+ // --- Ref MCP documentation-verification hint (Lever 6) ------------------------
421
+ // When a role is Ref-enabled, buildPrompt appends a short instruction telling the agent to verify
422
+ // third-party APIs against current docs via the Ref tools IF those tools are available to it. The
423
+ // instruction is purely conditional on runtime tool availability, so it is a no-op for any session
424
+ // without Ref configured — which is why it can default ON: it only ever helps, never blocks. From
425
+ // pipeline-config.yaml `ref` ({ enabled?, roles? }); set enabled:false to suppress it, or override
426
+ // `roles` to change which agents get it. Static + args only => journal-replay safe.
427
+ const DEFAULT_REF_ROLES = ['BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'CRITIC']
428
+ function buildRefRoleSet(cfg) {
429
+ if (cfg && cfg.enabled === false) return new Set()
430
+ const roles = cfg && Array.isArray(cfg.roles) && cfg.roles.length ? cfg.roles : DEFAULT_REF_ROLES
431
+ return new Set(roles.map((r) => String(r).toUpperCase()))
432
+ }
433
+ const REF_ROLES = buildRefRoleSet(a.ref)
434
+ const refFor = (role) => REF_ROLES.has(String(role).toUpperCase())
435
+ const REF_HINT =
436
+ 'When you work with a third-party library, framework, or external API, FIRST check whether the Ref ' +
437
+ 'documentation tools (`ref_search_documentation`, `ref_read_url`) are available to you. If they are, ' +
438
+ 'use them to confirm exact signatures, parameters, and current usage against the real docs before ' +
439
+ 'writing or reviewing that code — do not trust memory for unfamiliar, niche, or fast-moving APIs. ' +
440
+ 'If those tools are not available, just proceed with your existing knowledge — this step is optional ' +
441
+ 'and best-effort, never a blocker.'
442
+
269
443
  // --- prompt builder: mirrors providers/claude-code/spawn.template.md so spawned agents
270
444
  // get full pipeline context (core prompt + shared context + step-at-execution + the
271
445
  // handoff contract), not a terse one-liner. ------------------------------------------
@@ -281,6 +455,7 @@ function buildPrompt({ role, promptFile, storyId, taskRef, taskSubject, trigger,
281
455
  `1. Read your core prompt: \`.valent-pipeline/prompts/${promptFile}\` — identity, protocols, step sequence.`,
282
456
  `2. Read shared context: \`${outputDir}/pipeline-context.md\` (and correction directives if present).`,
283
457
  '3. Read each step file at the point of execution, not before. Check decision gates first.',
458
+ ...(refFor(role) ? ['', '## External docs (optional)', REF_HINT] : []),
284
459
  '',
285
460
  '## Task Assignment',
286
461
  `${taskRef ? `Task ${taskRef}: ` : ''}${taskSubject}`,
@@ -385,6 +560,26 @@ async function runStory(story) {
385
560
  const { storyId, projectType, profiles, groomed } = story
386
561
  const profilesCsv = profiles.join(',')
387
562
 
563
+ // Resumability: on a resumed run, skip any story already finalized on disk in a prior run. The
564
+ // check is a single cheap agent (the script can't read files itself); it returns done/shipped from
565
+ // the story's judge-decision.md / story-report.md. No-op unless skipCompleted is set.
566
+ if (skipCompleted) {
567
+ const prior = await agent(
568
+ `Resume check for story ${storyId}. Determine whether this story ALREADY reached a terminal verdict ` +
569
+ `in a PRIOR run by reading, if present, \`stories/${storyId}/output/judge-decision.md\` and ` +
570
+ `\`stories/${storyId}/output/story-report.md\`. Return ONLY ` +
571
+ `{ schema:1, story:"${storyId}", done:boolean, shipped:boolean }: done:true ONLY if a final SHIP or ` +
572
+ `REJECT verdict is already recorded on disk (shipped reflects a SHIP). If neither file exists or no ` +
573
+ `final verdict is present, done:false.`,
574
+ { label: `resume-check:${storyId}`, phase: 'Resolve', schema: RESUME_CHECK_SCHEMA, model: modelFor('RESUME') },
575
+ )
576
+ if (prior.done) {
577
+ log(`${storyId}: already finalized in a prior run (${prior.shipped ? 'shipped' : 'rejected'}) — skipping (resume)`)
578
+ return { storyId, shipped: prior.shipped === true, verdict: prior.shipped ? 'pass' : 'fail', resumedSkip: true, skipped: [], files: [] }
579
+ }
580
+ log(`${storyId}: no terminal verdict on disk — running normally (resume)`)
581
+ }
582
+
388
583
  phase('Resolve')
389
584
  // The script cannot run the CLI itself; an agent runs resolve-graph and returns its JSON.
390
585
  const graph = await agent(
@@ -471,6 +666,10 @@ async function runStory(story) {
471
666
  return { storyId, shipped: false, verdict: 'blocked', reason: 'no-dev-output', skipped: graph.skipped, files: [] }
472
667
  }
473
668
 
669
+ // Deterministic pre-CRITIC gate (Lever 1): run lint/type/static checks (free, mechanical) and
670
+ // fix what they catch BEFORE the Opus CRITIC sees the diff. No-op when no commands are configured.
671
+ await runPreCriticGate(storyId, devTasks)
672
+
474
673
  phase('Critic')
475
674
  await runCriticGate(storyId, devTasks)
476
675
 
@@ -478,13 +677,104 @@ async function runStory(story) {
478
677
  await spawn('QA-B', 'qa-b.md', 'Execute the full test suite, file bugs, build the traceability matrix.', { phase: 'QA' })
479
678
 
480
679
  phase('Judge')
481
- const decision = await runGate(storyId, 'JUDGE', 'judge.md',
482
- 'Review evidence (tests, traceability, bugs) and make the ship decision.', 'Judge', null)
680
+ const decision = await runJudgeGate(storyId)
483
681
 
484
682
  return { storyId, shipped: decision.verdict === 'pass', verdict: decision.verdict, skipped: graph.skipped, files: devFiles }
485
683
 
486
684
  // --- per-story closures over storyId/devTasks ----------------------------
487
685
 
686
+ // runPreCriticGate (Lever 1): a deterministic, non-LLM gate. A cheap STATIC agent runs the
687
+ // configured lint/type/static-analysis commands; the exit code is the verdict (no model
688
+ // judgment). On failure in blocking mode, route the fixes to the owning dev agents and re-run,
689
+ // bounded by preCriticMaxCycles. On cap-trip it logs and FALLS THROUGH to CRITIC (the real gate)
690
+ // rather than rolling the story over — a story should not be blocked on mechanical checks alone,
691
+ // and CRITIC/JUDGE remain the quality backstop. No-op (returns immediately) when nothing is
692
+ // configured, so the default pipeline behavior is unchanged.
693
+ async function runPreCriticGate(sid, devs) {
694
+ if (!preCriticChecks.length) return { skipped: true } // blank control surface: nothing configured
695
+ phase('Static')
696
+ let rejections = 0
697
+ while (true) {
698
+ const verdict = await agent(
699
+ [
700
+ `You are **STATIC**, the deterministic pre-review gate for story ${sid} in the valent-pipeline.`,
701
+ `Run each command below from the project root, capturing its exit code and a short tail of its output.`,
702
+ `A command PASSES iff it exits 0. Do NOT fix anything and do NOT make judgment calls — the exit code is the verdict.`,
703
+ ``,
704
+ `Commands:`,
705
+ ...preCriticChecks.map((c, i) => ` ${i + 1}. ${c}`),
706
+ ``,
707
+ `Set verdict:"pass" only if EVERY command exited 0; otherwise verdict:"fail". For each FAILED command,`,
708
+ `summarize the errors and list the offending file paths (relative to the project root) in filesToFix.`,
709
+ `Return ONLY { schema:1, agent:"static", story:"${sid}", verdict:"pass"|"fail", ` +
710
+ `checks:[{command, ok:boolean, summary}], filesToFix:[paths] } as JSON.`,
711
+ ].join('\n'),
712
+ { label: `gate:static:${sid}`, phase: 'Static', schema: PRE_GATE_SCHEMA, model: modelFor('STATIC') },
713
+ )
714
+ if (verdict.verdict === 'pass') {
715
+ log(`${sid}/STATIC: pre-critic checks green (${(verdict.checks || []).length} command(s)) — handing a clean diff to CRITIC`)
716
+ return verdict
717
+ }
718
+ const failed = (verdict.checks || []).filter((c) => !c.ok)
719
+ if (!preCriticBlocking) {
720
+ log(`${sid}/STATIC: ${failed.length} check(s) failed — advisory mode, proceeding to CRITIC`)
721
+ return { ...verdict, advisory: true }
722
+ }
723
+ rejections += 1
724
+ if (rejections >= preCriticMaxCycles) {
725
+ log(`${sid}/STATIC: still failing after ${rejections} cycle(s) — escalating to CRITIC (real gate) instead of blocking on mechanical checks`)
726
+ return { ...verdict, escalated: true }
727
+ }
728
+ log(`${sid}/STATIC: ${failed.length} check(s) failed — rejection ${rejections}/${preCriticMaxCycles}, routing fixes to dev`)
729
+ const summary = failed.map((c) => `- ${c.command}: ${c.summary || 'failed'}`).join('\n')
730
+ await parallel(
731
+ devs.map((t) => () =>
732
+ spawn(t.agent, `${t.agent.toLowerCase()}.md`,
733
+ `The deterministic pre-review gate (lint/type/static analysis) failed. Fix every reported failure that ` +
734
+ `falls in your area, then re-run the checks locally to confirm they pass:\n${summary}`, {
735
+ label: `rework:static:${t.ref}:${sid}`,
736
+ phase: 'Static',
737
+ })),
738
+ )
739
+ }
740
+ }
741
+
742
+ // runJudgeGate (Lever 2): the terminal ship gate, split find-vs-judge. The EVIDENCE pass measures
743
+ // and cross-references the QA artifacts (bug review + evidence review) on a cheaper tier and emits
744
+ // a recommendation; the binding SHIP/REJECT DECISION is a separate opus step that reasons over the
745
+ // distilled evidence (and may spot-check a flagged artifact on disk) rather than re-reading every
746
+ // raw report. Net: the bulk reading/counting is cheap, opus is spent only on the judgment — and on
747
+ // a compact input. The decision keeps the `gate:judge:${sid}` label so it remains the gate of record.
748
+ async function runJudgeGate(sid) {
749
+ const evidence = await spawn('JUDGE', 'judge.md',
750
+ 'Run the EVIDENCE pass only: bug review (`.valent-pipeline/steps/judge/bug-review.md`) then ' +
751
+ 'evidence review (`.valent-pipeline/steps/judge/evidence-review.md`). Cross-check ' +
752
+ 'execution-report / traceability-matrix / bugs against qa-test-spec: count tests passed/failed, ' +
753
+ 'verify every AC is covered, validate bug priorities, and FLAG any claim the artifacts do not ' +
754
+ 'support. Write judge-review.md. Do NOT make the ship decision — only gather the evidence and ' +
755
+ 'give a recommendation.',
756
+ {
757
+ label: `judge:evidence:${sid}`,
758
+ phase: 'Judge',
759
+ schema: JUDGE_EVIDENCE_SCHEMA,
760
+ model: modelFor('JUDGE-EVIDENCE'),
761
+ returnContract:
762
+ 'Return ONLY { schema:1, agent:"judge", story, recommendation:"ship"|"reject"|"unsure", ' +
763
+ 'testsPassed, testsFailed, openP1toP3, discrepancies:[...], notes } as JSON.',
764
+ })
765
+
766
+ return assertGate(
767
+ await spawn('JUDGE', 'judge.md',
768
+ 'Make the BINDING ship decision per `.valent-pipeline/steps/judge/ship-decision.md`. Your primary ' +
769
+ 'input is the evidence summary below from the evidence pass; if it flags any discrepancy or a ' +
770
+ 'non-"ship" recommendation or low confidence, spot-check the named artifact on disk before ' +
771
+ 'deciding. The verdict is terminal SHIP(pass)/REJECT(fail) — no partial ships. Write ' +
772
+ 'judge-decision.md (and story-report.md on SHIP).\n\nEVIDENCE:\n' + JSON.stringify(evidence),
773
+ { label: `gate:judge:${sid}`, phase: 'Judge', schema: VERDICT_SCHEMA, model: modelFor('JUDGE-DECIDE') }),
774
+ 'JUDGE',
775
+ )
776
+ }
777
+
488
778
  // runGate: a schema-validated verification stage with a code-owned rejection loop.
489
779
  // `reworkThunk` (or null for terminal gates) produces the fix work before re-gating.
490
780
  async function runGate(sid, role, promptFile, instruction, gatePhase, reworkThunk) {
@@ -513,37 +803,20 @@ async function runStory(story) {
513
803
  }
514
804
  }
515
805
 
516
- // runCriticGate (step 3b): three INDEPENDENT passes in parallel, then a triage barrier
517
- // that dedups and writes the verdict. The whole thing is wrapped in the code-owned
518
- // rejection loop; on reject, the routed dev agents rework and the passes re-run.
806
+ // runCriticGate (step 3b): three INDEPENDENT passes in parallel, then a triage barrier that dedups
807
+ // and writes the verdict. Wrapped in the code-owned rejection loop; on reject, the routed dev agents
808
+ // rework. The FIRST review is always the full 3-pass hunt (whole-diff coverage). Re-reviews after a
809
+ // rework are TARGETED (Lever 3) when enabled — one cheap pass that confirms only the previously-
810
+ // flagged findings are resolved — instead of re-running the full hunt, which collapses the rejection
811
+ // loop's cost tail. Disabled (default for existing configs) => every cycle re-runs the full hunt.
519
812
  async function runCriticGate(sid, devs) {
520
813
  let rejections = 0
814
+ let verdict = await fullCriticReview(sid)
521
815
  while (true) {
522
- // Independent perspective-diverse verify: each pass reads only its own inputs.
523
- await parallel(
524
- CRITIC_PASSES.map((p) => () =>
525
- spawn('CRITIC', 'critic.md',
526
- `Run pass ${p.pass} per \`.valent-pipeline/steps/critic/${p.step}\`. Read ${p.reads}. ` +
527
- `Do NOT read any other pass's output. Record findings only — do NOT deduplicate or set a verdict.`,
528
- {
529
- label: `critic:${p.pass}:${sid}`,
530
- phase: 'Critic',
531
- schema: FINDINGS_SCHEMA,
532
- returnContract: 'Return ONLY { schema:1, agent:"critic", story, pass, findings:[...] } as JSON.',
533
- })),
534
- )
535
- // Triage barrier: the single point of deduplication; produces the schema-validated verdict.
536
- const verdict = assertGate(
537
- await spawn('CRITIC', 'critic.md',
538
- 'Triage per `.valent-pipeline/steps/critic/triage.md`: gather findings from ALL three passes, ' +
539
- 'collapse duplicates (same root cause) into one, classify final severity, then write the verdict.',
540
- { label: `gate:critic:${sid}`, phase: 'Critic', schema: VERDICT_SCHEMA }),
541
- 'CRITIC',
542
- )
543
816
  if (verdict.verdict === 'pass') return verdict
544
817
  // 'needs-review' from triage means CRITIC could not complete a pass/fail review — almost
545
- // always an empty/unchanged diff (the reported failure mode). Re-running the three passes +
546
- // triage (4 agents) + dev rework would reproduce it every cycle until the cap, so bail now.
818
+ // always an empty/unchanged diff (the reported failure mode). Re-running passes + triage + dev
819
+ // rework would reproduce it every cycle until the cap, so bail now.
547
820
  if (verdict.verdict === 'needs-review') {
548
821
  log(`${sid}/CRITIC: needs-review (no reviewable diff) — escalating without re-running passes`)
549
822
  return { ...verdict, escalated: true }
@@ -553,8 +826,8 @@ async function runStory(story) {
553
826
  log(`${sid}/CRITIC: circuit breaker tripped after ${rejections} rejections — escalating`)
554
827
  return { ...verdict, escalated: true }
555
828
  }
556
- log(`${sid}/CRITIC: rejection ${rejections}/${maxRejectionCycles} — reworking`)
557
- // Route fixes to the owning dev agent(s), then re-run the passes.
829
+ log(`${sid}/CRITIC: rejection ${rejections}/${maxRejectionCycles} — reworking${targetedReverify ? ' (targeted re-verify)' : ''}`)
830
+ // Route fixes to the owning dev agent(s), then re-evaluate.
558
831
  await parallel(
559
832
  devs.map((t) => () =>
560
833
  spawn(t.agent, `${t.agent.toLowerCase()}.md`, 'Fix every High finding CRITIC routed to you.', {
@@ -562,6 +835,51 @@ async function runStory(story) {
562
835
  phase: 'Critic',
563
836
  })),
564
837
  )
838
+ verdict = targetedReverify ? await targetedCriticReverify(sid, rejections) : await fullCriticReview(sid)
839
+ }
840
+
841
+ // --- review strategies ---------------------------------------------------
842
+ // Full 3-pass hunt + triage barrier. Each pass reads only its own inputs (perspective-diverse,
843
+ // cannot anchor); triage is the single dedup point and the judgment step (stays on CRITIC-TRIAGE).
844
+ async function fullCriticReview(s) {
845
+ await parallel(
846
+ CRITIC_PASSES.map((p) => () =>
847
+ spawn('CRITIC', 'critic.md',
848
+ `Run pass ${p.pass} per \`.valent-pipeline/steps/critic/${p.step}\`. Read ${p.reads}. ` +
849
+ `Do NOT read any other pass's output. Record findings only — do NOT deduplicate or set a verdict.`,
850
+ {
851
+ label: `critic:${p.pass}:${s}`,
852
+ phase: 'Critic',
853
+ schema: FINDINGS_SCHEMA,
854
+ model: modelFor(p.modelRole), // Lever 2: per-pass tier (blind=haiku, edge/acceptance=sonnet)
855
+ returnContract: 'Return ONLY { schema:1, agent:"critic", story, pass, findings:[...] } as JSON.',
856
+ })),
857
+ )
858
+ return assertGate(
859
+ await spawn('CRITIC', 'critic.md',
860
+ 'Triage per `.valent-pipeline/steps/critic/triage.md`: gather findings from ALL three passes, ' +
861
+ 'collapse duplicates (same root cause) into one, classify final severity, then write the verdict.',
862
+ { label: `gate:critic:${s}`, phase: 'Critic', schema: VERDICT_SCHEMA, model: modelFor('CRITIC-TRIAGE') }),
863
+ 'CRITIC',
864
+ )
865
+ }
866
+
867
+ // Lever 3: targeted re-review. ONE cheap pass that re-checks only the previously-flagged findings
868
+ // against the updated diff (plus obvious regressions the fix introduced in those files), instead
869
+ // of re-running the full 3-pass hunt. Keeps the `gate:critic` label so it remains the verdict of
870
+ // record and the pass-invariant still applies. QA-B + JUDGE stay the whole-diff backstops.
871
+ async function targetedCriticReverify(s, cycle) {
872
+ return assertGate(
873
+ await spawn('CRITIC', 'critic.md',
874
+ `Targeted RE-VERIFY (rejection cycle ${cycle}). The dev has reworked the findings from your ` +
875
+ `previous review. Read your prior verdict in critic-review.md (the open findings) and the ` +
876
+ `UPDATED git diff for the files those findings touched. For EACH prior finding, confirm it is ` +
877
+ `now resolved; also flag any obvious regression the fix introduced in those same files. Do NOT ` +
878
+ `re-hunt the whole diff — the first review already covered it. Write the verdict: pass only if ` +
879
+ `every prior High finding is resolved and the fix introduced no new High.`,
880
+ { label: `gate:critic:${s}`, phase: 'Critic', schema: VERDICT_SCHEMA, model: modelFor('CRITIC-REVERIFY') }),
881
+ 'CRITIC',
882
+ )
565
883
  }
566
884
  }
567
885
  }
@@ -39,6 +39,9 @@ Read these as needed to answer questions:
39
39
  **"How do I run an epic or the whole project?"**
40
40
  → `/valent-run-epic-workflow EPIC-ID` runs all stories tagged with that epic; `/valent-run-project-workflow` runs the whole backlog across epics. Both plan and execute sprints automatically.
41
41
 
42
+ **"A run got interrupted (crash / context reset / new session) — how do I continue?"**
43
+ → `/valent-resume`. It reads `.valent-pipeline/run-state.json`, shows what already shipped, and picks up where it stopped — no `runId` needed, and finished stories are not re-run.
44
+
42
45
  **"How do I configure the pipeline?"**
43
46
  → `/valent-configure` for interactive wizard, or edit `.valent-pipeline/pipeline-config.yaml` directly.
44
47
 
@@ -0,0 +1,72 @@
1
+ ---
2
+ name: valent-resume
3
+ description: 'Resume an interrupted pipeline run (epic, project, sprint, or single story) after a crash, context reset, network glitch, or new session — with no runId required. Use when the user says "resume", "continue the run", "pick up where it left off", "resume the pipeline", or runs /valent-resume.'
4
+ argument-hint: '(none — auto-detects; or a run key like an epic id)'
5
+ ---
6
+
7
+ # valent-resume
8
+
9
+ One command to continue an interrupted pipeline run. The user types `/valent-resume` and you figure out the rest: what was running, what already shipped, and how to continue **without re-billing finished work**. No `runId` to remember.
10
+
11
+ The pipeline records a durable pointer at `.valent-pipeline/run-state.json` around every Workflow call (the run skills maintain it). This skill reads that pointer plus the on-disk progress artifacts and resumes via the cheapest safe path.
12
+
13
+ ## Step 1: Locate the run
14
+
15
+ 1. Read `.valent-pipeline/run-state.json`.
16
+ - **Missing?** Fall back: read `{epic_progress_path}` (`./epic-progress.md` by default). If it exists, treat it as an epic/project run to resume (its `epic_id` is the key). If neither exists → tell the user there is nothing to resume and offer to start a fresh run (`/valent-run-epic-workflow`, `/valent-run-project-workflow`, or `/valent-run-story-workflow`). STOP.
17
+ 2. Parse the pointer. Its shape:
18
+ ```json
19
+ { "schema": 1, "kind": "epic|project|story", "key": "<epic id | 'project' | story id>",
20
+ "scriptPath": "<plan|sprint|retro workflow path>", "phase": "plan|sprint|retro",
21
+ "sprintId": "<id or null>", "args": { /* the exact args last passed */ },
22
+ "runId": "<id or null>", "status": "running|completed" }
23
+ ```
24
+ 3. If `status` is `"completed"` → the last run already finished. Tell the user, summarize, and offer a fresh run. STOP (unless they explicitly want to re-run).
25
+
26
+ ## Step 2: Show what's already done (then continue without asking, per the original opt-in)
27
+
28
+ Before relaunching, give the user a 3–5 line summary so they trust nothing is being redone:
29
+ - From `epic-progress.md` (epic/project): stories completed, stories remaining, current sprint.
30
+ - From disk: scan `stories/*/output/` for `story-report.md` (shipped) and `judge-decision.md` (finalized) to confirm which stories are done.
31
+ - State the resume path you're about to take (below).
32
+
33
+ ## Step 3: Resume by the cheapest safe path
34
+
35
+ Pick the first applicable:
36
+
37
+ ### A. Same-session journal fast-path (only if a `runId` is present)
38
+ The journal `runId` is **same-session only**. If this is the same session the run started in, relaunch the *exact* in-flight workflow:
39
+ ```js
40
+ Workflow({ scriptPath: <pointer.scriptPath>, resumeFromRunId: <pointer.runId> })
41
+ ```
42
+ The journal replays the unchanged prefix of `agent()` calls instantly and re-runs only from the interruption point. If the Workflow tool reports the run is **unknown / not resumable** (i.e. a different session), fall through to B.
43
+
44
+ ### B. Cross-session idempotent resume (runId gone, or fast-path failed)
45
+ Relaunch the in-flight phase from disk state; finished work is detected and skipped:
46
+
47
+ - **`phase: "sprint"`** (the common, expensive case): relaunch `sprint.workflow.js` with the pointer's saved `args` **plus `skipCompleted: true`**. Each story is first checked for a terminal verdict on disk (`judge-decision.md` / `story-report.md`) and **already-finalized stories are skipped** — only unfinished stories run. This is the key to not re-billing a partially-completed sprint.
48
+ ```js
49
+ Workflow({ scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
50
+ args: { ...<pointer.args>, skipCompleted: true } })
51
+ ```
52
+ - **`phase: "plan"`**: relaunch `plan.workflow.js` with the saved `args` (grooming is cheap and idempotent — re-grooming an already-groomed story just re-confirms it).
53
+ - **`phase: "retro"`**: relaunch `retro.workflow.js` with the saved `args` (learning only; safe to repeat).
54
+
55
+ ### C. Hand back to the run loop (epic/project)
56
+ After the in-flight workflow completes, **continue the remaining work** so the whole epic/project finishes:
57
+ - `kind: "epic"` → invoke `/valent-run-epic-workflow <key>`. Its Step 3 detects `epic-progress.md` and resumes at the next unfinished sprint; shipped stories are not re-run.
58
+ - `kind: "project"` → invoke `/valent-run-project-workflow`. Same resume behavior.
59
+ - `kind: "story"` → nothing more to loop; report the single story's outcome.
60
+
61
+ When relaunching a `sprint`/`plan`/`retro` directly (not via the run skill), update `.valent-pipeline/run-state.json` yourself: set `status: "running"` (and the new `runId` once it returns) before, `status: "completed"` after.
62
+
63
+ ## Step 4: Report
64
+
65
+ Summarize: what was resumed, how (journal fast-path vs. idempotent re-run), which stories were skipped as already-done, and the final outcome. If anything was rolled over or blocked, surface it.
66
+
67
+ ## Notes
68
+
69
+ - **Never restart from scratch.** A fresh `/valent-run-*` without resume would re-bill shipped stories. Always go through the pointer + `skipCompleted` path.
70
+ - **The pointer is a hint, the artifacts are truth.** If `run-state.json` is stale or inconsistent with `epic-progress.md` / story artifacts, trust the artifacts (a shipped `story-report.md` means shipped) and reconcile.
71
+ - **Do not hand-edit a journal.** Resume is either `resumeFromRunId` (same session) or `skipCompleted` (cross-session) — both are safe and idempotent.
72
+ - This skill is read-mostly: it inspects state and relaunches workflows. It does not modify story code or gate verdicts.
@@ -29,7 +29,7 @@ Use the standard 200k context window. Workflow `agent()` calls run in their own
29
29
 
30
30
  ### Step 1: Load Pipeline Config
31
31
 
32
- Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Set `{epic_id}` from the argument. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty.
32
+ Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Set `{epic_id}` from the argument. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg to the **sprint** Workflow call; omit it if absent or its `commands` list is empty (the gate is then a no-op).
33
33
 
34
34
  ### Step 2: Validate Epic
35
35
 
@@ -48,6 +48,8 @@ Read `{epic_progress_path}`.
48
48
 
49
49
  ### Step 4: Sprint Loop (plan → sprint → retro)
50
50
 
51
+ > **Run-state pointer (powers `/valent-resume`).** Maintain `.valent-pipeline/run-state.json` around every Workflow call so the run can be resumed after a crash, context reset, or new session. **Immediately before** each `plan`/`sprint`/`retro` call, write `{ schema: 1, kind: "epic", key: "{epic_id}", scriptPath: "<the script>", phase: "plan"|"sprint"|"retro", sprintId: "{epic_id}-sprint-{n}", args: <the exact args object>, runId: null, status: "running" }`. **The moment the call returns**, set `runId` to the returned id. On epic completion (Step 5) set `status: "completed"`. This single file is what `/valent-resume` reads — keep it current.
52
+
51
53
  Loop sprints until no pending epic stories with met dependencies remain. Each iteration:
52
54
 
53
55
  #### 4a. Re-read state from disk
@@ -74,11 +76,11 @@ Feed the planned batch straight into `sprint.workflow.js`:
74
76
  ```js
75
77
  Workflow({
76
78
  scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
77
- args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
79
+ args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, targetedReverify: {quality.targeted_reverify or false}, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
78
80
  })
79
81
  ```
80
82
 
81
- It runs each story through the per-story pipeline (deterministic graph resolve, READINESS/CRITIC/JUDGE gates, 3-pass parallel CRITIC, code-owned rejection cap) **sequentially on a shared branch**, rolling over any story that JUDGE rejects or that trips the cap. Returns `{ shipped, stories_shipped, stories_rolled_over, results: [{ storyId, shipped, verdict, skipped }] }`. Record its `runId`.
83
+ It runs each story through the per-story pipeline (deterministic graph resolve, READINESS gate, the deterministic pre-CRITIC static gate, CRITIC/JUDGE gates, 3-pass parallel CRITIC, code-owned rejection cap) **sequentially on a shared branch**, rolling over any story that JUDGE rejects or that trips the cap. Returns `{ shipped, stories_shipped, stories_rolled_over, results: [{ storyId, shipped, verdict, skipped }] }`. Record its `runId`.
82
84
 
83
85
  #### 4e. Update progress
84
86
  For each shipped story: move it from `stories_remaining` to `stories_completed` in `epic-progress.md` with a compact one-line outcome; update `total_completed` and `last_updated`; keep the file under 50 lines. Then read and follow `.valent-pipeline/steps/orchestration/update-backlog-status.md`. Rolled-over stories stay `pending` for a later sprint (or are surfaced as blocked if they keep failing).
@@ -104,12 +106,15 @@ When no pending epic stories with met dependencies remain, write `epic-report.md
104
106
 
105
107
  ## Resume (do NOT restart from scratch)
106
108
 
107
- Two layers of durability:
109
+ **Easiest path: just run `/valent-resume`.** It reads `.valent-pipeline/run-state.json`, shows what already shipped, and continues from where the run stopped — journal fast-path if it's the same session, idempotent re-run otherwise. The user never needs a `runId`.
110
+
111
+ Under the hood there are three layers of durability:
108
112
 
109
- 1. **Within a single Workflow run** (a plan, a sprint, or a retro): every invocation returns a `runId`. If one is interrupted, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call. Already-shipped stories and passed gates are not redone.
110
- 2. **Across the sprint loop**: `epic-progress.md` + `{backlog_path}` record which stories shipped and which sprint is current. On a full conversation reset, re-run `/valent-run-epic-workflow {epic_id}`; Step 3 detects the existing progress file and resumes at the next unfinished sprint. Shipped stories are not re-run.
113
+ 1. **Within a single Workflow run** (a plan, a sprint, or a retro): every invocation returns a `runId` (recorded in `run-state.json`). In the **same session**, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call.
114
+ 2. **Across sessions (runId is gone)**: relaunch the recorded `sprint` workflow with `skipCompleted: true` each story is checked for a terminal verdict on disk and finalized stories are skipped, so the resumed sprint does not re-bill shipped stories.
115
+ 3. **Across the sprint loop**: `epic-progress.md` + `{backlog_path}` record which stories shipped and which sprint is current. Re-running `/valent-run-epic-workflow {epic_id}` resumes at the next unfinished sprint.
111
116
 
112
- Do **not** hand-edit state files to "resume" a Workflow — pass `resumeFromRunId`.
117
+ Do **not** hand-edit state files to "resume" a Workflow — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
113
118
 
114
119
  ## Notes
115
120
 
@@ -24,7 +24,7 @@ Use the standard 200k context window. Workflow `agent()` calls run in their own
24
24
 
25
25
  ### Step 1: Load Pipeline Config
26
26
 
27
- Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty.
27
+ Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg to the **sprint** Workflow call; omit it if absent or its `commands` list is empty (the gate is then a no-op).
28
28
 
29
29
  ### Step 2: Build Cross-Epic Dependency Map
30
30
 
@@ -46,6 +46,8 @@ Read `{epic_progress_path}`.
46
46
 
47
47
  ### Step 4: Sprint Loop (plan → sprint → retro)
48
48
 
49
+ > **Run-state pointer (powers `/valent-resume`).** Maintain `.valent-pipeline/run-state.json` around every Workflow call. **Immediately before** each `plan`/`sprint`/`retro` call, write `{ schema: 1, kind: "project", key: "project", scriptPath: "<the script>", phase: "plan"|"sprint"|"retro", sprintId: "project-sprint-{n}", args: <the exact args object>, runId: null, status: "running" }`. **The moment the call returns**, set `runId`. On project completion (Step 5) set `status: "completed"`. `/valent-resume` reads this file.
50
+
49
51
  Loop sprints until all stories are shipped, blocked, or cancelled. Each iteration:
50
52
 
51
53
  #### 4a. Re-read state from disk
@@ -92,7 +94,7 @@ Feed the planned batch straight into `sprint.workflow.js`:
92
94
  ```js
93
95
  Workflow({
94
96
  scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
95
- args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
97
+ args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, targetedReverify: {quality.targeted_reverify or false}, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
96
98
  })
97
99
  ```
98
100
 
@@ -128,12 +130,13 @@ If the loop finds no unblocked stories but pending stories remain, list each blo
128
130
 
129
131
  ## Resume (do NOT restart from scratch)
130
132
 
131
- Two layers of durability:
133
+ **Easiest path: run `/valent-resume`.** It reads `.valent-pipeline/run-state.json`, shows what already shipped, and continues — no `runId` needed. Three layers of durability:
132
134
 
133
- 1. **Within a single Workflow run** (a plan, sprint, or retro): every invocation returns a `runId`. If one is interrupted, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call.
134
- 2. **Across the sprint loop**: the progress file (`epic_id: "project"`) + `{backlog_path}` record what shipped and which sprint is current. On a full conversation reset, re-run `/valent-run-project-workflow`; Step 3 detects the existing progress file and resumes at the next unfinished sprint. Shipped stories are not re-run.
135
+ 1. **Within a single Workflow run** (same session): relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId from run-state.json> })` — the journal replays the unchanged prefix instantly.
136
+ 2. **Across sessions (runId gone)**: relaunch the recorded `sprint` workflow with `skipCompleted: true` finalized stories are detected on disk and skipped, so the resumed sprint does not re-bill shipped stories.
137
+ 3. **Across the sprint loop**: the progress file (`epic_id: "project"`) + `{backlog_path}` record what shipped and which sprint is current. Re-running `/valent-run-project-workflow` resumes at the next unfinished sprint.
135
138
 
136
- Do **not** hand-edit state files to "resume" a Workflow — pass `resumeFromRunId`.
139
+ Do **not** hand-edit state files to "resume" a Workflow — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
137
140
 
138
141
  ## Notes
139
142
 
@@ -29,7 +29,7 @@ If no argument is provided, resolve the next work item from the backlog (see Ste
29
29
 
30
30
  ### Step 1: Load Pipeline Config
31
31
 
32
- Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Capture `project.type` and the story's `testing_profiles` — these become the Workflow's `projectType` and `profiles` args. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflow uses its baked-in default assignment. Likewise capture the `reasoning` section (the `{ ultrathink:[...], 'think-harder':[...], 'think-hard':[...], think:[...] }` level→roles map) and pass it as the `reasoning` arg; it is blank by default (injects nothing). Omit the arg if the section is absent or all levels are empty.
32
+ Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Capture `project.type` and the story's `testing_profiles` — these become the Workflow's `projectType` and `profiles` args. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflow uses its baked-in default assignment. Likewise capture the `reasoning` section (the `{ ultrathink:[...], 'think-harder':[...], 'think-hard':[...], think:[...] }` level→roles map) and pass it as the `reasoning` arg; it is blank by default (injects nothing). Omit the arg if the section is absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg; omit it if absent or its `commands` list is empty (the gate is then a no-op).
33
33
 
34
34
  ### Step 1b: Resolve Next Work Item (when no argument provided)
35
35
 
@@ -52,7 +52,7 @@ Invoke the Workflow at `.valent-pipeline/orchestrators/claude-code/sprint.workfl
52
52
  ```js
53
53
  Workflow({
54
54
  scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
55
- args: { storyId: '<resolved story id>', projectType: '<project.type>', profiles: [/* testing_profiles */], maxRejectionCycles: <quality.max_rejection_cycles or 5>, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
55
+ args: { storyId: '<resolved story id>', projectType: '<project.type>', profiles: [/* testing_profiles */], maxRejectionCycles: <quality.max_rejection_cycles or 5>, targetedReverify: <quality.targeted_reverify or false>, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
56
56
  })
57
57
  ```
58
58
 
@@ -64,9 +64,11 @@ It resolves the story's task graph deterministically (`resolve-graph`), runs the
64
64
 
65
65
  Report the returned verdict and the `runId` to the user. A story that JUDGE rejects (or that trips the rejection cap) is recorded as rolled-over — surface that outcome; do not silently retry.
66
66
 
67
+ **Run-state pointer (powers `/valent-resume`).** Immediately before the Workflow call, write `.valent-pipeline/run-state.json`: `{ schema: 1, kind: "story", key: "<story id>", scriptPath: "<the script>", phase: "sprint", sprintId: null, args: <the exact args object>, runId: null, status: "running" }`. The moment the call returns, set `runId`; when the story finalizes (ship or roll-over), set `status: "completed"`.
68
+
67
69
  ### Step 5: Resume (do NOT restart from scratch)
68
70
 
69
- Every Workflow invocation returns a `runId`. If a run is interrupted context limit, crash, manual stop, or a mid-run script edit — **relaunch with `Workflow({ scriptPath, resumeFromRunId: <that runId> })`**, not a fresh invocation. The journal replays the unchanged prefix of `agent()` calls instantly (same script + same args 100% cache hit) and re-runs only from the first changed/new call onward, so already-shipped stories and passed gates are not redone. Do **not** hand-edit state files to "resume" — pass `resumeFromRunId`.
71
+ **Easiest path: run `/valent-resume`** it reads `.valent-pipeline/run-state.json` and continues with no `runId` needed. Manually: in the **same session**, relaunch with `Workflow({ scriptPath, resumeFromRunId: <runId from run-state.json> })` the journal replays the unchanged prefix instantly. **Across sessions** (runId gone), relaunch the same single-story args with `skipCompleted: true`; if the story already finalized on disk it is detected and skipped, otherwise it runs cleanly. Do **not** hand-edit state files to "resume" — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
70
72
 
71
73
  ## Notes
72
74
 
@@ -346,6 +346,11 @@ tech_stack:
346
346
  browser_automation_mcp: "${config.tech_stack.browser_automation_mcp || 'playwright-mcp'}"
347
347
  state_management: "${config.tech_stack.state_management || 'React Context'}"
348
348
 
349
+ # Per-agent model tiers (tier -> role list). CRITIC and JUDGE are split find-vs-judge: only the
350
+ # judgment steps are on opus (CRITIC-TRIAGE = adjudication, JUDGE-DECIDE = the binding ship call),
351
+ # while the finding work is cheaper (CRITIC-BLIND = haiku; CRITIC-EDGE/-ACCEPT and JUDGE-EVIDENCE =
352
+ # sonnet). To force the whole family onto one tier, list the bare head (e.g. "CRITIC") under a tier
353
+ # and it propagates to every sub-role.
349
354
  models:
350
355
  opus: [${(config.models.opus || defaults.models.opus).map(a => `"${a}"`).join(', ')}]
351
356
  sonnet: [${(config.models.sonnet || defaults.models.sonnet).map(a => `"${a}"`).join(', ')}]
@@ -362,10 +367,34 @@ reasoning:
362
367
  think-hard: [${(config.reasoning?.['think-hard'] || []).map(a => `"${a}"`).join(', ')}]
363
368
  think: [${(config.reasoning?.think || []).map(a => `"${a}"`).join(', ')}]
364
369
 
370
+ # Deterministic pre-CRITIC gate: cheap, non-LLM checks (lint / type-check / static analysis) run
371
+ # after Build and BEFORE the Opus CRITIC. Mechanical defects caught here are fixed for free instead
372
+ # of burning an expensive CRITIC rejection cycle. BLANK by default (empty commands => no-op). Fill
373
+ # commands with your project's checks, e.g.:
374
+ # commands: ["npm run lint", "npm run typecheck", "semgrep --config auto --error"]
375
+ # blocking: true reworks-and-retries (bounded by max_cycles, default = quality.max_rejection_cycles);
376
+ # blocking: false runs once as advisory and always proceeds to CRITIC.
377
+ pre_critic_gate:
378
+ commands: [${(config.pre_critic_gate?.commands || []).map(c => `"${c}"`).join(', ')}]
379
+ blocking: ${config.pre_critic_gate?.blocking ?? true}
380
+
381
+ # Ref MCP documentation verification. When the Ref tools (ref_search_documentation, ref_read_url)
382
+ # are available in your session, the listed agents are told to verify third-party APIs against
383
+ # current docs before writing/reviewing that code — cutting wrong-API rework and doc-page context
384
+ # bloat. Conditional on the tools being present, so it is a harmless no-op if you have not installed
385
+ # Ref. Set enabled: false to suppress. Install Ref as an MCP server to benefit.
386
+ ref:
387
+ enabled: ${config.ref?.enabled ?? true}
388
+ roles: [${(config.ref?.roles || defaults.ref.roles).map(r => `"${r}"`).join(', ')}]
389
+
365
390
  quality:
366
391
  max_rejection_cycles: ${config.quality.max_rejection_cycles}
367
392
  retrospective_every_n_stories: ${config.quality.retrospective_every_n_stories}
368
393
  stall_threshold_minutes: ${config.quality.stall_threshold_minutes}
394
+ # On a CRITIC rejection, re-review ONLY the previously-flagged findings against the reworked diff
395
+ # (one cheap pass) instead of re-running the full 3-pass hunt every cycle. QA-B + JUDGE remain the
396
+ # whole-diff backstops. Set false for a full re-hunt on every rejection (the pre-Lever-3 behavior).
397
+ targeted_reverify: ${config.quality.targeted_reverify ?? true}
369
398
 
370
399
  git:
371
400
  target_branch: "${config.git.target_branch}"
@@ -41,6 +41,11 @@ export function validateConfig(config) {
41
41
  errors.push(`quality.${field} must be a number, got: ${typeof config.quality[field]}`);
42
42
  }
43
43
  }
44
+ // Lever 3: targeted re-review on rejection (boolean). When true, a CRITIC rejection re-reviews only
45
+ // the prior findings instead of re-running the full 3-pass hunt. Optional; defaults applied elsewhere.
46
+ if (config.quality?.targeted_reverify !== undefined && typeof config.quality.targeted_reverify !== 'boolean') {
47
+ errors.push(`quality.targeted_reverify must be a boolean, got: ${typeof config.quality.targeted_reverify}`);
48
+ }
44
49
 
45
50
  // Reasoning section (optional — per-agent thinking-effort control surface). Blank by default.
46
51
  if (config.reasoning) {
@@ -54,6 +59,40 @@ export function validateConfig(config) {
54
59
  }
55
60
  }
56
61
 
62
+ // Pre-CRITIC gate section (optional — deterministic lint/type/static checks run before CRITIC).
63
+ // Blank by default: no `commands` => the gate is a no-op and behavior is unchanged.
64
+ if (config.pre_critic_gate) {
65
+ const g = config.pre_critic_gate;
66
+ if (g.commands !== undefined) {
67
+ if (!Array.isArray(g.commands)) {
68
+ errors.push('pre_critic_gate.commands must be an array of shell-command strings');
69
+ } else if (!g.commands.every((c) => typeof c === 'string')) {
70
+ errors.push('pre_critic_gate.commands must contain only strings');
71
+ }
72
+ }
73
+ if (g.blocking !== undefined && typeof g.blocking !== 'boolean') {
74
+ errors.push(`pre_critic_gate.blocking must be a boolean, got: ${typeof g.blocking}`);
75
+ }
76
+ if (g.max_cycles !== undefined && typeof g.max_cycles !== 'number') {
77
+ errors.push(`pre_critic_gate.max_cycles must be a number, got: ${typeof g.max_cycles}`);
78
+ }
79
+ }
80
+
81
+ // Ref MCP section (optional — documentation-verification hint, Lever 6). Conditional on the Ref
82
+ // tools being available to the agent at runtime, so it is harmless when Ref is not installed.
83
+ if (config.ref) {
84
+ if (config.ref.enabled !== undefined && typeof config.ref.enabled !== 'boolean') {
85
+ errors.push(`ref.enabled must be a boolean, got: ${typeof config.ref.enabled}`);
86
+ }
87
+ if (config.ref.roles !== undefined) {
88
+ if (!Array.isArray(config.ref.roles)) {
89
+ errors.push('ref.roles must be an array of agent role names');
90
+ } else if (!config.ref.roles.every((r) => typeof r === 'string')) {
91
+ errors.push('ref.roles must contain only strings');
92
+ }
93
+ }
94
+ }
95
+
57
96
  // Communication section
58
97
  const validFormats = ['distilled', 'verbose'];
59
98
  if (config.communication?.handoff_format && !validFormats.includes(config.communication.handoff_format)) {
@@ -146,14 +185,18 @@ export const defaults = {
146
185
  state_management: 'React Context',
147
186
  },
148
187
  models: {
149
- // Quality gates + high-judgment roles judgment is the whole point, so top tier.
150
- // RETRO-REVIEW is the Workflow retrospective's loop-until-dry aggregate review.
151
- opus: ['READINESS', 'CRITIC', 'JUDGE', 'Retrospective', 'RETRO-REVIEW'],
152
- // Spec + build agents (and lighter retro reasoning) — mid tier.
153
- sonnet: ['REQS', 'UXA', 'QA-A', 'QA-B', 'BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'RETRO'],
188
+ // Quality JUDGMENT steps the decisions are the whole point, so top tier. CRITIC and JUDGE are
189
+ // split find-vs-judge (Lever 2): only the adjudication (CRITIC-TRIAGE) and the binding ship call
190
+ // (JUDGE-DECIDE) are opus. RETRO-REVIEW is the Workflow retrospective's loop-until-dry review.
191
+ opus: ['READINESS', 'CRITIC-TRIAGE', 'JUDGE-DECIDE', 'Retrospective', 'RETRO-REVIEW'],
192
+ // Spec + build agents, the *finding* half of the gates (CRITIC edge/acceptance hunting, JUDGE
193
+ // evidence cross-referencing), and lighter retro reasoning — mid tier.
194
+ sonnet: ['REQS', 'UXA', 'QA-A', 'QA-B', 'BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'RETRO', 'CRITIC-EDGE', 'CRITIC-ACCEPT', 'CRITIC-REVERIFY', 'JUDGE-EVIDENCE'],
154
195
  // Mechanical retrieval / CLI-runner / IO steps — no reasoning, cheapest tier.
155
196
  // RESOLVE/PACK/VALIDATE/CALIBRATE/PERSIST are the Workflow orchestrators' CLI-runner agents.
156
- haiku: ['Knowledge', 'Embed', 'RESOLVE', 'PACK', 'VALIDATE', 'CALIBRATE', 'PERSIST'],
197
+ // STATIC runs the deterministic pre-CRITIC gate (lint/type/static analysis) — pure command runner.
198
+ // CRITIC-BLIND is the no-context blind hunt — mechanical (naming, dead code, obvious bugs).
199
+ haiku: ['Knowledge', 'Embed', 'RESOLVE', 'PACK', 'VALIDATE', 'CALIBRATE', 'PERSIST', 'STATIC', 'CRITIC-BLIND'],
157
200
  },
158
201
  // Per-agent reasoning effort (thinking budget). Mirrors `models`: list a role under an effort
159
202
  // level to inject that thinking trigger into the agent's prompt at spawn. BLANK by default —
@@ -166,10 +209,34 @@ export const defaults = {
166
209
  'think-hard': [],
167
210
  think: [],
168
211
  },
212
+ // Deterministic pre-CRITIC gate (Lever 1): cheap, non-LLM checks run after Build and before the
213
+ // Opus CRITIC. Mechanical defects caught here never become a CRITIC rejection cycle. BLANK by
214
+ // default (no commands => no-op). Populate `commands` with your project's lint / type-check /
215
+ // static-analysis invocations, e.g. ['npm run lint', 'npm run typecheck', 'semgrep --config auto --error'].
216
+ // `blocking: true` reworks-and-retries up to `max_cycles` (default: quality.max_rejection_cycles);
217
+ // `blocking: false` runs the checks once as advisory and always proceeds to CRITIC.
218
+ pre_critic_gate: {
219
+ commands: [],
220
+ blocking: true,
221
+ },
222
+ // Ref MCP documentation verification (Lever 6). When the Ref tools (ref_search_documentation,
223
+ // ref_read_url) are available in the session, the listed roles are told to verify third-party
224
+ // APIs against current docs before writing/reviewing that code. Purely conditional on tool
225
+ // availability — a no-op for sessions without Ref — so it defaults ON. Set enabled:false to
226
+ // suppress, or trim `roles` to scope it. Install Ref as an MCP server to benefit.
227
+ ref: {
228
+ enabled: true,
229
+ roles: ['BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'CRITIC'],
230
+ },
169
231
  quality: {
170
232
  max_rejection_cycles: 5,
171
233
  retrospective_every_n_stories: 5,
172
234
  stall_threshold_minutes: 15,
235
+ // Lever 3: on a CRITIC rejection, re-review ONLY the previously-flagged findings against the
236
+ // reworked diff (one cheap pass) instead of re-running the full 3-pass hunt every cycle. New
237
+ // projects ship this on; QA-B + JUDGE remain the whole-diff backstops. Set false for a full
238
+ // re-hunt on every rejection cycle (the pre-Lever-3 behavior).
239
+ targeted_reverify: true,
173
240
  },
174
241
  git: {
175
242
  target_branch: '',