npm - valent-pipeline - Versions diffs - 0.6.1 → 0.6.2 - Mend

valent-pipeline 0.6.1 → 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/package.json +1 -1
package/pipeline/docs/cost-and-resumability-improvements.md +313 -0
package/pipeline/orchestrators/claude-code/sprint.workflow.js +356 -38
package/skills/valent-help/SKILL.md +3 -0
package/skills/valent-resume/SKILL.md +72 -0
package/skills/valent-run-epic-workflow/SKILL.md +12 -7
package/skills/valent-run-project-workflow/SKILL.md +9 -6
package/skills/valent-run-story-workflow/SKILL.md +5 -3
package/src/commands/init.js +29 -0
package/src/lib/config-schema.js +73 -6

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "valent-pipeline",
-  "version": "0.6.1",
+  "version": "0.6.2",
   "description": "v3 multi-agent AI pipeline for software development lifecycle",
   "type": "module",
   "bin": {

package/pipeline/docs/cost-and-resumability-improvements.md ADDED Viewed

@@ -0,0 +1,313 @@
+# Workflow Improvement Plan — Cost & Resumability
+> **Status:** Proposal / design doc
+> **Scope:** `valent-pipeline` orchestrators (`plan` / `sprint` / `retro`), gate model assignment, and the path to a multi-provider runner.
+> **Author's thesis (yours, and it's correct):** *Because the pipeline is gated, cheap models can do far more of the work.* The gates are the quality backstop, so the expensive model only needs to show up where **judgment** happens — not where **finding** or **building** happens. Everything below is an elaboration of that one idea, plus a resumability layer so a 2–4M-token sprint never has to start over.
+---
+## 1. The two problems, stated precisely
+| # | Problem | Root cause in the current design |
+|---|---------|----------------------------------|
+| 1 | **Resumability** after a network glitch or context exhaustion | Resume relies on the Workflow journal + `resumeFromRunId`, which is **same-session only** and **in-memory**. If the session itself dies (context exhaustion → new session) or the host process is killed, the journal/runId is gone and there is no durable, cross-session ledger of "which stories already shipped." A single transient agent failure inside a sequential gate **throws and kills the whole run** rather than retrying. |
+| 2 | **Cost** — ~1M tokens per planning run, 2–4M per sprint | Opus is the default for **every gate** (`READINESS`, `CRITIC`, `JUDGE`, `INTEGRATION`), and the most expensive gate — `CRITIC` — runs **3 full Opus passes + an Opus triage**, and **re-runs all of that** on every rejection cycle (up to `maxRejectionCycles`, default 5). The rejection loop tail is the single largest cost center. Nothing deterministic (linters, type-checkers, static analysis) runs *before* Opus, so Opus pays tokens to find mechanical issues a free tool would catch. |
+These interact: a robust **idempotent checkpoint** layer (Problem 1) *also* saves cost, because a resumed sprint skips already-shipped stories instead of re-billing them.
+---
+## 2. Cost anatomy — where the tokens actually go
+**Current model assignment** (`sprint.workflow.js:224-230`):
+```js
+const DEFAULT_MODELS = {
+  READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus',   // ← gates: all Opus
+  REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
+  BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
+  LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
+  RESOLVE: 'haiku',
+}
+```
+**Current API prices (2026):**
+| Model | Input $/M | Output $/M | Cached input $/M (≈90% off) | Batch (−50%) |
+|-------|----------:|-----------:|----------------------------:|-------------:|
+| Opus 4.8   | $5  | $25 | ~$0.50 | $2.50 / $12.50 |
+| Sonnet 4.6 | $3  | $15 | ~$0.30 | $1.50 / $7.50 |
+| Haiku 4.5  | $1  | $5  | ~$0.10 | $0.50 / $2.50 |
+**Per-story Opus cost, directional, *zero rework*** (input largely uncached on first pass):
+| Gate | Opus calls | Rough $ / story |
+|------|-----------:|----------------:|
+| READINESS (1 pass) | 1 | ~$0.28 |
+| **CRITIC (3 passes + triage)** | **4** | **~$1.25** |
+| JUDGE (1 pass) | 1 | ~$0.33 |
+| **Total Opus (no rework)** | **6** | **~$1.86** |
+Now add the loop. **Each CRITIC rejection cycle re-runs all 3 passes + triage** (`runCriticGate`, `sprint.workflow.js:519-566`). Two cycles ≈ **+$2.50**, and CRITIC alone becomes ~70% of the story's Opus spend. **The rejection-loop tail is the thing to attack first.**
+> **Takeaway:** Of the 6 Opus calls per clean story, only **3 are genuine judgment calls** (the READINESS decision, the CRITIC *triage*, the JUDGE *ship decision*). The other 3 (the CRITIC *hunting* passes) are *finding* work that a cheaper model — or a free deterministic tool — can do most of. That gap is the savings.
+---
+## 3. Cost levers, highest ROI first
+### Lever 1 — Deterministic pre-gates (free, non-LLM) before any Opus runs
+Run linters, type-checkers, and static analysis as a **blocking gate between Build and CRITIC**. Dev agents must produce a clean lint/type/compile/test-build before a single Opus token is spent reviewing.
+| Tool | Cost | Catches (deterministically) |
+|------|------|------------------------------|
+| **Semgrep** | Free ≤10 contributors (OSS engine free) | Injection, secrets, taint flows, unsafe APIs, ~security rules |
+| **ESLint / tsc** | Free | Type errors, null/undefined, unused, unreachable, many real bugs |
+| **Ruff / Bandit** (Python) | Free | Lint + Python security |
+| Compiler / `tsc --noEmit` | Free | Whole class of "doesn't even type-check" rejections |
+**Why this is #1:** It directly shrinks the **rejection-loop tail** — the most expensive thing in the system. Every mechanical defect a linter catches is a CRITIC rejection cycle (3 Opus passes + triage) that *never happens*. It also lets CRITIC's prompt explicitly say *"lint/types/security-static already passed; focus only on semantic correctness and AC coverage,"* which shrinks each remaining Opus pass.
+**Where:** new `pre-critic-gate` step invoked in `runStory` between the Build barrier and `runCriticGate` (`sprint.workflow.js`, around the Build→Critic transition). The dev agents already run tests; add the static suite to their `handoff` step and make CRITIC refuse to start until the pre-gate verdict is green.
+### Lever 2 — Split *find* from *judge*; make Opus the exception-handler, not the default
+Re-tier `CRITIC` so the **hunting** passes run cheap and only **judgment** runs on Opus:
+```yaml
+# pipeline-config.yaml (proposed `models` override — already supported by buildModelMap)
+models:
+  haiku:  [RESOLVE, CRITIC_BLIND]        # blind-hunt is mechanical: naming, dead code, copy-paste
+  sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, ...,
+           CRITIC_EDGE, CRITIC_ACCEPT,   # edge-case + acceptance-audit on Sonnet
+           JUDGE_EVIDENCE]               # evidence cross-referencing is mechanical counting
+  opus:   [CRITIC_TRIAGE,                # adjudication = judgment → keep Opus
+           JUDGE_DECIDE, READINESS]      # the actual ship/ready decisions
+```
+This requires splitting the monolithic `CRITIC` and `JUDGE` roles into sub-roles the model map can target (the 3-pass structure already exists as separate agents at `sprint.workflow.js:519-566` — give each pass its own role key). The triage step stays Opus because that's where duplicates collapse and severity is decided — the genuine judgment.
+**Escalation rule (the gate-thesis, made literal):** run the cheap pass; **escalate a finding to Opus only when** the cheap model marks it High-severity *or* flags low confidence. A clean diff that passes lint + Sonnet review + full AC coverage may need **zero Opus** — Opus becomes an adjudicator invoked on exceptions, not the default reviewer. Projected CRITIC cut: **~40–60%** with no loss of the Opus backstop on actual decisions.
+Apply the same split to `JUDGE`: Sonnet does the *evidence cross-reference* (does the test count match the spec, is every AC traced — mechanical counting), Opus makes only the final SHIP/REJECT call, and only when evidence is ambiguous.
+### Lever 3 — Targeted re-review on rejection (kill the loop tail)
+Today a rejection re-runs **all 3 CRITIC passes + triage** (`runCriticGate`). Instead: **persist the open findings**, and on rework, run a **verify-only pass** that checks *just the previously-failed findings* against the new diff. Full re-hunt only if the dev's fix touched files outside the flagged set.
+- Re-review cost drops from ~100% of a full CRITIC to ~20%.
+- Combined with Lever 1 (fewer rejections to begin with) this collapses the cost tail.
+- Lower `maxRejectionCycles` from 5 → 2–3 and **escalate to a human/Opus-once** instead of looping a 4th time; the 4th–5th cycles are almost never productive and are pure Opus burn.
+### Lever 4 — Offload review to flat-rate / free external reviewers
+Use an external reviewer as the **blind-hunt pass** (or a pre-CRITIC filter), so that pass costs a flat fee or $0 instead of Opus tokens:
+| Tool | Pricing | Local/CLI | Needs your LLM key? | Catches |
+|------|---------|-----------|---------------------|---------|
+| **CodeRabbit CLI** | **$0.25 / file** reviewed (credit add-on); SaaS Pro $24/dev/mo, Pro+ $48 | CLI client → their cloud (needs API key, not offline) | No — flat per-file | Race conditions, memory leaks, security vulns, actionable fixes |
+| **Qodo Merge** (ex-PR-Agent) | **Free tier 250 credits/mo**; Teams $30/user/mo (2500 credits). PR-Agent OSS is **self-hostable, BYO key** | Yes — IDE/local review + self-host/air-gapped | OSS path: yes (your key) | PR review, context-aware; Opus=5 credits/req, Grok-4=4 credits/req on managed tier |
+| **Greptile** | $30/seat/mo, 50 reviews incl., then $1/review | Cloud; enterprise self-host | Managed | Codebase-context review |
+| **Sourcery** | Free for OSS; Team $24/seat/mo (**BYO LLM** on Team) | Yes — CLI | Team tier: yes (your key) | **Rule-based refactoring** (deterministic) + AI review |
+| **Native Claude Code `/code-review`** | In-session tokens only | Yes — runs on the local diff | Uses your Claude session | Correctness bugs + reuse/simplification at chosen effort |
+**Recommendation:** CodeRabbit CLI ($0.25/file, predictable) or self-hosted **PR-Agent** (free, your key) as the **blind-hunt replacement**. They run on the diff, return structured findings, and slot straight into the existing `verdict.schema.json` gate machinery — the orchestrator doesn't care who produced the verdict as long as it matches the schema. The native `/code-review` skill is the zero-integration option for a first experiment.
+### Lever 5 — Multi-provider CLI arbitrage (Codex / Cursor / Grok) — *subsidized usage across providers*
+This is the big structural play, and it's feasible **because Workflow `agent()` subagents have Bash**. An agent can shell out to another vendor's headless CLI, wait, and parse the result back into the handoff schema. The orchestrator stays pure (all IO happens *inside* the agent, which is journal-replay-safe); the subagent becomes a thin **"shell driver."**
+**Subsidized runners to target:**
+- `codex exec "<prompt>"` — OpenAI **Codex CLI**, non-interactive/headless, billable against a **ChatGPT Plus/Pro/Team subscription** (flat-rate, not per-token API).
+- `cursor-agent` — **Cursor CLI** headless, subsidized by a Cursor subscription.
+- **Grok CLI** — xAI, subsidized by an X/Grok subscription.
+**The arbitrage:** route the cheap-but-bulky work (dev/build passes, blind-hunt, test scaffolding, doc lookups) to whichever flat-rate subscription you're *already paying for*, so those tokens are **$0 marginal**. Keep Claude **Opus only on the irreducible judgment gates**. If Codex/Cursor does the *building* on a flat subscription, the entire Sonnet dev line can leave the per-token API bill.
+**Architecture — generalize `modelFor(role)` into `runnerFor(role)`:**
+```yaml
+# pipeline-config.yaml (proposed)
+runners:
+  codex:   { cmd: "codex exec",       auth: subscription }   # flat-rate
+  cursor:  { cmd: "cursor-agent",     auth: subscription }
+  grok:    { cmd: "grok",             auth: subscription }
+  claude:  { cmd: native }                                   # in-session subagent
+roles:
+  BEND:          { runner: codex,  fallback: claude:sonnet }  # build on flat-rate sub
+  CRITIC_BLIND:  { runner: cursor, fallback: claude:haiku }   # blind hunt on flat-rate sub
+  CRITIC_TRIAGE: { runner: claude, model: opus }              # judgment stays Opus
+  JUDGE_DECIDE:  { runner: claude, model: opus }
+```
+The handoff/verdict schemas (`schemas/handoff.schema.json`, `schemas/verdict.schema.json`) are the **normalization layer** — a Codex-produced review and an Opus-produced review are interchangeable to the gate logic as long as both emit the schema. Long external runs use `run_in_background: true` + `Monitor`.
+**Guardrails (this is exactly your gate-thesis):** quality variance across providers is *fine* precisely because the Opus gate is the backstop. A cheaper/subsidized provider that produces weaker work just gets caught and rejected by CRITIC/JUDGE — you trade a little more rework risk for near-zero marginal token cost on the bulk. Keep a `fallback` so a missing/failing CLI degrades to a native Claude agent rather than killing the run.
+### Lever 6 — Ref MCP to cut rework *and* context bloat
+Wire **Ref MCP** (`ref_search_documentation`, `ref_read_url`) into the dev agents (BEND/FEND/DATA/…) and CRITIC. Two distinct savings:
+1. **Fewer rejection cycles.** A large share of CRITIC/JUDGE rejections are wrong API usage / hallucinated signatures. Ref lets the dev agent verify the real API *before* writing code, so the diff is right the first time → fewer (expensive, Opus) rework loops. This attacks the same cost tail as Lever 1, from the front.
+2. **Smaller context.** Without Ref, agents either guess or paste entire doc pages into context. Ref returns **only the relevant snippet**, so you pay for the 500 tokens you need, not the 40K-token page. CRITIC can use Ref to confirm "is this really the right call signature?" instead of being handed the whole SDK.
+**Where:** add a "verify external APIs via Ref before implementing" step to each dev `read-inputs`/`implement` step file, and a "confirm uncertain API usage via Ref" note to `critic/edge-case-hunt.md`.
+### Lever 7 — Cache hygiene, distilled handoffs, batch API
+- **Prompt-cache hygiene.** Caching is already in use, but cache hits require the static prefix to be **byte-identical** across spawns. Audit `buildPrompt` (`sprint.workflow.js:273-298`) so the **static** content (role prompt path, pipeline-context, handoff contract) comes first and the **dynamic** content (storyId, task subject, trigger) comes last. Verify realized cache-hit rate with the existing `audit` command (`src/commands/audit.js`) / `valent-review-cost` skill. Cached input is ~90% cheaper — a few percent of hit-rate is real money on a 4M-token sprint.
+- **Distilled handoffs.** `distilled-handoff-format.md` already exists; make sure CRITIC/JUDGE read **distilled** artifacts, not raw full files, wherever the decision doesn't need the raw text.
+- **Batch API (−50%)** for the latency-insensitive, non-interactive work: estimation/sizing in `plan`, retrospective synthesis, KB embeddings. These don't block a human, so the 50% batch discount is free money.
+---
+## 4. Projected impact
+Directional, stacking the levers (no loss of the Opus decision-backstop):
+| Lever | Mechanism | Est. cost reduction |
+|-------|-----------|---------------------|
+| 1 — Deterministic pre-gates | Removes ~30–50% of CRITIC rejection cycles | **High** |
+| 2 — Find vs. judge tiering | 2 of 3 CRITIC passes → Sonnet/Haiku; JUDGE evidence → Sonnet | ~40–60% of CRITIC |
+| 3 — Targeted re-review | Re-review ~20% of a full pass instead of 100% | High on rework tail |
+| 4 — External reviewer | Blind-hunt → $0.25/file or free, off the Opus bill | Moderate |
+| 5 — Provider arbitrage | Dev/blind work → subsidized flat-rate subs ($0 marginal) | **High** (can move the entire Sonnet dev line off API) |
+| 6 — Ref MCP | Fewer rework loops + smaller context | Moderate, compounding |
+| 7 — Cache/distill/batch | ~90% off cached input, 50% off batch | Moderate, broad |
+**Realistic combined target: 50–70% reduction** in per-sprint API spend, with the irreducible Opus judgment (CRITIC triage, JUDGE ship decision, READINESS) preserved as the quality floor.
+---
+## 5. Resumability redesign
+### Current state and the gap
+Resume = relaunch `Workflow({ scriptPath, resumeFromRunId })`; the journal replays the unchanged `agent()` prefix at ~100% cache hit. This is elegant **but**:
+- **Same-session only / in-memory.** Context exhaustion that forces a new session, or a killed host process, loses the journal and the runId.
+- **No durable ledger.** Nothing on disk says "story KANBAN-014 already SHIPPED in this sprint," so a from-scratch restart re-bills shipped stories.
+- **A transient failure is fatal to sequential gates.** In a `for`-loop story (`sprint.workflow.js:312-315`), a network blip inside an awaited gate `throw`s and ends the run (unlike `parallel`, which drops to `null`).
+### Fix A — Idempotent, on-disk phase checkpoints (cross-session resume)
+Persist a durable run ledger in the existing SQLite DB (extends the `artifacts`/`calibration` tables in `src/lib/db.js`). Add a CLI command:
+```
+node .valent-pipeline/bin/cli.js checkpoint --sprint <id> --story <id> --phase <spec|build|critic|qa|judge> --verdict <pass|ship|reject>
+```
+An `agent()` at the end of each phase writes the checkpoint (IO inside an agent = journal-safe). Then **make each phase idempotent at the top of `runStory`**: before running a phase, query the ledger; if a terminal verdict already exists for `(sprint, story, phase)`, **skip and reuse the on-disk artifact**. This is exactly the `groomed`-bypass pattern (`sprint.workflow.js` skips Spec+Readiness when `groomed:true`) — generalized to every phase. Result: a fresh run **after total session loss** naturally fast-forwards to the first incomplete phase, with no journal required.
+### Fix B — Retry-with-backoff wrapper for transient failures
+Wrap every gate/agent call in a bounded retry so a network glitch doesn't kill a 4M-token run:
+```js
+async function withRetry(fn, { tries = 3, label } = {}) {
+  let lastErr
+  for (let i = 0; i < tries; i++) {
+    try { return await fn() }
+    catch (e) { lastErr = e; log(`retry ${label} (${i + 1}/${tries}): ${e.message}`) }
+  }
+  // exhausted: checkpoint progress and surface the runId so the user can resume cross-session
+  log(`PAUSED at ${label}. Resume with resumeFromRunId, or restart — completed phases will be skipped via the ledger.`)
+  throw lastErr
+}
+```
+(Keep it deterministic — no `Date.now`/`Math.random` in the script body; the resume-safety linter at `scripts/test-workflow.js` enforces this.)
+### Fix C — Make the runId loud and durable
+On every story boundary, `log()` the `runId` and write it into the SQLite ledger so it survives the session. The resume instruction becomes a one-liner the user can always recover, and even if the runId is lost, **Fix A** makes a cold restart safe.
+**Net:** three independent safety nets — journal replay (fast path, same session), the idempotent ledger (cross-session/cold-start), and bounded retry (transient blips) — so no single failure mode forces a 2–4M-token redo.
+---
+## 6. Proposed `pipeline-config.yaml` surface (consolidated)
+```yaml
+runtime:
+  provider: claude-code
+# Lever 2 — find vs. judge tiering (CRITIC/JUDGE split into sub-roles)
+models:
+  haiku:  [RESOLVE, CRITIC_BLIND]
+  sonnet: [REQS, UXA, QA-A, QA-B, BEND, FEND, DATA, IAC, MCP-DEV, LIBDEV, DOCGEN, MOBILE,
+           CRITIC_EDGE, CRITIC_ACCEPT, JUDGE_EVIDENCE]
+  opus:   [READINESS, CRITIC_TRIAGE, JUDGE_DECIDE, INTEGRATION]
+# Lever 5 — multi-provider runners (subsidized arbitrage)
+runners:
+  codex:  { cmd: "codex exec",   auth: subscription }
+  cursor: { cmd: "cursor-agent", auth: subscription }
+  grok:   { cmd: "grok",         auth: subscription }
+roles:
+  BEND:          { runner: codex,  fallback: claude:sonnet }
+  CRITIC_BLIND:  { runner: cursor, fallback: claude:haiku }
+  CRITIC_TRIAGE: { runner: claude, model: opus }
+  JUDGE_DECIDE:  { runner: claude, model: opus }
+# Lever 1 — deterministic pre-gate (blocks CRITIC until green)
+pre_critic_gate:
+  blocking: true
+  commands: ["npm run lint", "npm run typecheck", "semgrep --config auto --error"]
+# Lever 4 — external reviewer as blind-hunt
+external_review:
+  blind_hunt: coderabbit          # coderabbit | pr-agent | native | claude
+# Lever 3 — bounded, targeted rework
+rejection:
+  max_cycles: 2                   # was 5
+  targeted_reverify: true         # re-check only previously-failed findings
+# Lever 6 — Ref MCP for dev + critic
+ref_mcp: { enabled: true, roles: [BEND, FEND, DATA, CRITIC_EDGE] }
+# Resumability
+resume:
+  ledger: sqlite                  # durable cross-session phase checkpoints
+  retry: { tries: 3 }
+```
+All of this is overlay-on-default and journal-replay-safe (static + args only), consistent with `buildModelMap` (`sprint.workflow.js:231-241`).
+---
+## 7. Phased rollout
+| Phase | Change | Effort | Risk | Payoff |
+|-------|--------|--------|------|--------|
+| **0** | Turn on **prompt-cache hygiene** + verify hit-rate via `audit`; add **retry wrapper**; **lower `max_cycles` to 3** | S | Low | Quick cost + stability win |
+| **1** | **Deterministic pre-critic gate** (lint/types/semgrep) | S–M | Low | Biggest single cost cut (kills rework cycles) |
+| **2** | **Idempotent SQLite checkpoint ledger** (cross-session resume) | M | Low | Fixes Problem 1 fully |
+| **3** | **Split CRITIC/JUDGE** into sub-roles + **find-vs-judge tiering** | M | Med (validate quality holds) | ~40–60% CRITIC cut |
+| **4** | **Targeted re-verify** on rejection | M | Med | Collapses the cost tail |
+| **5** | **External reviewer** (CodeRabbit CLI or self-hosted PR-Agent) as blind-hunt | M | Low | Moves a pass off the Opus bill |
+| **6** | **Ref MCP** wired into dev + critic | S | Low | Fewer reworks, smaller context |
+| **7** | **`runnerFor()` provider abstraction** → Codex/Cursor/Grok arbitrage | L | Med–High | Largest structural saving; multi-harness future |
+Phases 0–2 are low-risk and address both problems immediately. Phase 7 is the strategic bet that unlocks the "many subsidized providers" future you described — and it's safe to attempt *because the gates already protect quality*, which is the whole premise.
+---
+## 8. Risks & guardrails
+- **Quality regression from cheaper/foreign models.** Mitigated by design: the Opus judgment gates are the floor. Track rejection-rate and rework-cycle counts per tier in the existing `calibrate` data — if a Sonnet/Codex pass drives rework up, the savings evaporate and you escalate that role back up. Make the tiering **measured, not assumed.**
+- **Determinism / journal safety.** All external-CLI and DB calls must stay **inside agents**; the orchestrator script stays pure (no `Date.now`/`Math.random`/IO) so resume replay holds. The linter at `scripts/test-workflow.js` already enforces this — keep it green.
+- **External-tool auth in headless/cron runs.** Codex/Cursor/Grok CLIs and CodeRabbit need their own auth; subscription-based CLIs may not be present in unattended runs. Always define a **`fallback` to a native Claude agent** so a missing CLI degrades instead of failing.
+- **Cost monitoring must lead.** Before/after every phase, run `valent-review-cost` / `audit` on the Workflow journals so each lever's real savings is measured, not guessed.
+---
+## 9. TL;DR
+1. **Stop paying Opus to find mechanical bugs.** Put free linters/Semgrep/Ref in front of CRITIC, and move CRITIC's *hunting* passes to Sonnet/Haiku (or a flat-rate external reviewer / subsidized CLI). Keep Opus only for the *triage* and *ship* decisions.
+2. **Kill the rejection-loop tail** — targeted re-verify + lower cap. It's the #1 cost center.
+3. **Arbitrage subsidized providers** (Codex/Cursor/Grok via headless CLI) for the bulk build/find work; the Opus gate makes the quality variance safe. This is also the on-ramp to the multi-harness future.
+4. **Make resume durable** with an idempotent SQLite phase ledger + retry wrapper, so a glitch or context wipe never re-bills a 2–4M-token sprint.
+Combined target: **50–70% lower spend** and **no full restarts**, with the quality backstop untouched.

package/pipeline/orchestrators/claude-code/sprint.workflow.js CHANGED Viewed

@@ -46,11 +46,36 @@
  * `models` is the pipeline-config.yaml `models` tier->roles map (e.g. { opus:[...], sonnet:[...],
  * haiku:[...] }); the invoking skill passes it through so per-agent model tiers stay config-driven
  * and editable via `valent configure`. Omit it to use the baked-in default assignment.
+ *   CRITIC and JUDGE are split into find-vs-judge sub-roles (Lever 2): CRITIC-BLIND/-EDGE/-ACCEPT do
+ *   the hunting on cheaper tiers while CRITIC-TRIAGE (the adjudication) stays opus; JUDGE-EVIDENCE
+ *   measures + recommends on a cheaper tier while JUDGE-DECIDE (the binding ship call) stays opus.
+ *   A config that pins only the bare CRITIC/JUDGE head propagates that tier to its sub-roles, so
+ *   existing configs are unchanged; list the sub-roles directly to tier them à la carte.
  *
  * `reasoning` is the pipeline-config.yaml `reasoning` level->roles map (e.g. { ultrathink:[...],
  * 'think-harder':[...], 'think-hard':[...], think:[...] }); it injects a thinking-effort trigger
  * into a role's prompt. BLANK by default — omit it (or leave the levels empty) and nothing is
  * injected, behavior unchanged. It is a config-driven control surface, parallel to `models`.
+ *
+ * `skipCompleted` (alias `resume`) enables cross-session resume: each story is first checked for an
+ * existing terminal verdict on disk and skipped if already finalized in a prior run, so re-running an
+ * interrupted sprint does not re-bill shipped stories. Set by the /valent-resume command. Off => a
+ * fresh run pays nothing extra and behaves exactly as before.
+ *
+ * `ref` (from pipeline-config.yaml `ref`: { enabled?, roles? }) toggles Lever 6: appends a short
+ * "verify third-party APIs against current docs via the Ref tools IF available" hint to the listed
+ * roles' spawn prompts. Conditional on runtime tool availability => a no-op without Ref configured,
+ * so it defaults ON for dev agents + CRITIC. Set enabled:false to suppress, or override roles.
+ *
+ * `targetedReverify` (from pipeline-config.yaml `quality.targeted_reverify`) toggles Lever 3: after a
+ * CRITIC rejection + rework, re-review only the previously-flagged findings (one cheap CRITIC-REVERIFY
+ * pass) rather than re-running the full 3-pass hunt. Defaults to false (full re-hunt) for back-compat.
+ *
+ * `preCriticGate` (from pipeline-config.yaml `pre_critic_gate`) is the deterministic pre-CRITIC
+ * gate (Lever 1): { commands:[shell strings], blocking?:bool=true, maxCycles?:int }. A cheap STATIC
+ * agent runs the commands (lint/type-check/static analysis) after Build and before CRITIC; failures
+ * route back to dev and re-run (bounded), so mechanical defects are fixed for free instead of
+ * burning an Opus CRITIC rejection cycle. BLANK by default — no commands => the gate is a no-op.
  */
 export const meta = {
@@ -61,9 +86,10 @@ export const meta = {
     { title: 'Spec', detail: 'reqs -> uxa -> qa-a' },
     { title: 'Readiness', detail: 'pre-dev quality gate' },
     { title: 'Build', detail: 'dev agents in parallel (barrier before CRITIC)' },
-    { title: 'Critic', detail: 'three independent passes in parallel -> triage -> rejection loop (code-owned cap)' },
+    { title: 'Static', detail: 'deterministic pre-CRITIC gate — lint/type/static checks (cheap; blank by default)' },
+    { title: 'Critic', detail: 'three independent passes (tiered: blind=haiku, edge/acceptance=sonnet) -> opus triage -> rejection loop (targeted re-verify, code-owned cap)' },
     { title: 'QA', detail: 'execute tests against real infra' },
-    { title: 'Judge', detail: 'evidence-based ship decision' },
+    { title: 'Judge', detail: 'evidence pass (cheaper tier, measures + recommends) -> opus ship decision' },
     { title: 'Integration', detail: 'single cross-story seam review — only when >1 story touched overlapping files' },
   ],
 }
@@ -129,6 +155,53 @@ const FINDINGS_SCHEMA = {
   },
 }
+// JUDGE split (Lever 2): the evidence pass measures + cross-references the QA artifacts (cheaper
+// tier) and RECOMMENDS; the binding SHIP/REJECT is the separate decision step (opus). Loose by
+// design — the decision step consumes this as context to reason over, not as the verdict itself.
+const JUDGE_EVIDENCE_SCHEMA = {
+  type: 'object',
+  required: ['schema', 'agent', 'story', 'recommendation'],
+  additionalProperties: true,
+  properties: {
+    schema: { const: 1 },
+    agent: { type: 'string' },
+    story: { type: 'string' },
+    recommendation: { enum: ['ship', 'reject', 'unsure'] },
+    testsPassed: { type: 'integer' },
+    testsFailed: { type: 'integer' },
+    openP1toP3: { type: 'integer' },
+    discrepancies: { type: 'array', items: { type: 'string' } },
+    notes: { type: 'string' },
+  },
+}
+// Deterministic pre-CRITIC gate (Lever 1). A cheap agent runs the configured lint/type/static
+// commands and reports pass/fail per command — exit code is the verdict, no model judgment.
+const PRE_GATE_SCHEMA = {
+  type: 'object',
+  required: ['schema', 'story', 'verdict', 'checks'],
+  additionalProperties: true,
+  properties: {
+    schema: { const: 1 },
+    agent: { type: 'string' },
+    story: { type: 'string' },
+    verdict: { enum: ['pass', 'fail'] },
+    checks: {
+      type: 'array',
+      items: {
+        type: 'object',
+        required: ['command', 'ok'],
+        properties: {
+          command: { type: 'string' },
+          ok: { type: 'boolean' },
+          summary: { type: 'string' },
+        },
+      },
+    },
+    filesToFix: { type: 'array', items: { type: 'string' } },
+  },
+}
 // Sprint-end cross-story seam review. Advisory: stories already passed JUDGE, so this does not
 // re-gate them — it surfaces integration findings to be filed as bugs against the affected stories.
 const INTEGRATION_SCHEMA = {
@@ -176,14 +249,32 @@ const RESOLVED_GRAPH_SCHEMA = {
   },
 }
+// Resume check (cross-session resumability): a cheap agent reports whether a story already reached a
+// terminal verdict on disk in a prior run, so a resumed sprint skips finalized stories instead of
+// re-billing them. Only consulted when skipCompleted is set (the /valent-resume path).
+const RESUME_CHECK_SCHEMA = {
+  type: 'object',
+  required: ['schema', 'story', 'done'],
+  additionalProperties: true,
+  properties: {
+    schema: { const: 1 },
+    story: { type: 'string' },
+    done: { type: 'boolean' },
+    shipped: { type: 'boolean' },
+  },
+}
 const DEV_AGENTS = new Set(['BEND', 'FEND', 'IAC', 'DATA', 'DOCGEN', 'LIBDEV', 'MCP-DEV', 'MOBILE'])
 // CRITIC's three independent passes (step 3b). Each reads ONLY its own pass step file and
 // the diff/artifacts it is told to — never another pass's output — so they cannot anchor.
+// `modelRole` keys the per-pass model tier (Lever 2): blind hunting is mechanical (haiku), edge +
+// acceptance need spec context but are still *finding* (sonnet); the agent identity stays CRITIC
+// (it reads critic.md + its pass step), only the tier differs. Triage — the judgment — stays opus.
 const CRITIC_PASSES = [
-  { pass: 'blind', step: 'blind-hunt.md', reads: 'ONLY the git diff (do NOT read reqs-brief or qa-test-spec)' },
-  { pass: 'edge', step: 'edge-case-hunt.md', reads: 'the diff plus reqs-brief.md (hunt boundary/error/concurrency cases)' },
-  { pass: 'acceptance', step: 'acceptance-audit.md', reads: 'the diff plus qa-test-spec.md and reqs-brief.md (audit every AC)' },
+  { pass: 'blind', step: 'blind-hunt.md', modelRole: 'CRITIC-BLIND', reads: 'ONLY the git diff (do NOT read reqs-brief or qa-test-spec)' },
+  { pass: 'edge', step: 'edge-case-hunt.md', modelRole: 'CRITIC-EDGE', reads: 'the diff plus reqs-brief.md (hunt boundary/error/concurrency cases)' },
+  { pass: 'acceptance', step: 'acceptance-audit.md', modelRole: 'CRITIC-ACCEPT', reads: 'the diff plus qa-test-spec.md and reqs-brief.md (audit every AC)' },
 ]
 // --- arg normalization: accept a batch or a single story ---------------------
@@ -209,6 +300,42 @@ const batch = Array.isArray(a.stories) && a.stories.length
 const maxRejectionCycles = a.maxRejectionCycles ?? 5
+// Lever 3: targeted re-review on rejection. When true, CRITIC re-reviews ONLY the previously-flagged
+// findings against the reworked diff (one cheap CRITIC-REVERIFY pass) instead of re-running the full
+// 3-pass hunt every cycle — collapsing the rejection-loop cost tail. From pipeline-config.yaml
+// `quality.targeted_reverify`, passed through as args.targetedReverify. Defaults to FALSE so an
+// existing config (without the key) keeps the full re-hunt; new projects ship it on.
+const targetedReverify = a.targetedReverify === true
+  || (!!a.rejection && (a.rejection.targetedReverify === true || a.rejection.targeted_reverify === true))
+// Resumability: when true, each story is first checked for an existing terminal verdict on disk and
+// skipped if already finalized in a prior run — so re-running an interrupted sprint does not re-bill
+// shipped stories. Set by the /valent-resume command (valent-resume skill). Off by default, so a
+// fresh run pays nothing and behaves exactly as before.
+const skipCompleted = a.skipCompleted === true || a.resume === true
+// --- deterministic pre-CRITIC gate config (Lever 1) ---------------------------
+// A cheap, non-LLM gate that runs the project's lint / type-check / static-analysis commands
+// BEFORE the Opus CRITIC. Mechanical defects caught here never become a CRITIC rejection cycle
+// (the most expensive thing in the loop). BLANK by default — with no commands configured the gate
+// is a no-op and behavior is unchanged, mirroring the `reasoning` control surface. Comes from
+// pipeline-config.yaml `pre_critic_gate`, passed through by the invoking skill. Accepts either the
+// camelCase arg (preCriticGate) or the raw snake_case section (pre_critic_gate); within it, both
+// maxCycles / max_cycles. Static + args only => journal-replay safe.
+const preCfg = (() => {
+  const c = a.preCriticGate || a.pre_critic_gate
+  return c && typeof c === 'object' && !Array.isArray(c) ? c : {}
+})()
+const preCriticChecks = Array.isArray(preCfg.commands)
+  ? preCfg.commands.filter((c) => typeof c === 'string' && c.trim())
+  : []
+const preCriticBlocking = preCfg.blocking !== false // default: blocking (rework loop). false => advisory.
+const preCriticMaxCycles = Number.isInteger(preCfg.maxCycles)
+  ? preCfg.maxCycles
+  : Number.isInteger(preCfg.max_cycles)
+    ? preCfg.max_cycles
+    : maxRejectionCycles
 for (const s of batch) {
   if (!s.storyId || !s.projectType) {
     throw new Error('each story needs { storyId, projectType }; profiles[] optional')
@@ -222,21 +349,45 @@ for (const s of batch) {
 // assignment even when args.models is absent. Static + args only => journal-replay safe.
 //   gates -> opus (judgment), spec/build -> sonnet, CLI-runners/IO -> haiku.
 const DEFAULT_MODELS = {
-  READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus',
+  READINESS: 'opus', INTEGRATION: 'opus',
+  // CRITIC / JUDGE are split into find-vs-judge sub-roles (Lever 2): the *finding* work runs on a
+  // cheaper tier and only the *judgment* step (triage / ship decision) stays on opus. The bare
+  // CRITIC / JUDGE keys are the family heads — kept so an old-style config that pins only the head
+  // still propagates its tier to every sub-role (see buildModelMap), preserving prior behavior.
+  CRITIC: 'opus', 'CRITIC-BLIND': 'haiku', 'CRITIC-EDGE': 'sonnet', 'CRITIC-ACCEPT': 'sonnet',
+  'CRITIC-TRIAGE': 'opus', 'CRITIC-REVERIFY': 'sonnet', // REVERIFY: Lever 3 targeted re-review (cheap)
+  JUDGE: 'opus', 'JUDGE-EVIDENCE': 'sonnet', 'JUDGE-DECIDE': 'opus',
   REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
   BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
   LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
-  RESOLVE: 'haiku',
+  RESOLVE: 'haiku', STATIC: 'haiku', RESUME: 'haiku',
+}
+// Family heads -> their sub-roles. A config that explicitly tiers a head (e.g. `opus: [CRITIC]`)
+// but not the sub-roles propagates the head's tier to each unconfigured sub-role, so EXISTING
+// configs behave exactly as before (no silent rigor change on upgrade); configs that list the
+// sub-roles directly (the new default) tier them à la carte.
+const ROLE_FAMILIES = {
+  CRITIC: ['CRITIC-BLIND', 'CRITIC-EDGE', 'CRITIC-ACCEPT', 'CRITIC-TRIAGE', 'CRITIC-REVERIFY'],
+  JUDGE: ['JUDGE-EVIDENCE', 'JUDGE-DECIDE'],
 }
 function buildModelMap(cfg) {
   const map = { ...DEFAULT_MODELS }
+  const explicit = new Set()
   if (cfg && typeof cfg === 'object' && !Array.isArray(cfg)) {
     for (const tier of ['opus', 'sonnet', 'haiku']) {
       for (const role of cfg[tier] || []) {
-        if (typeof role === 'string') map[role.toUpperCase()] = tier
+        if (typeof role === 'string') {
+          const r = role.toUpperCase()
+          map[r] = tier
+          explicit.add(r)
+        }
       }
     }
   }
+  // Back-compat: propagate an explicitly-configured family head to its unconfigured sub-roles.
+  for (const [head, subs] of Object.entries(ROLE_FAMILIES)) {
+    if (explicit.has(head)) for (const sub of subs) if (!explicit.has(sub)) map[sub] = map[head]
+  }
   return map
 }
 const MODELS = buildModelMap(a.models)
@@ -266,6 +417,29 @@ const REASONING = buildReasoningMap(a.reasoning)
 // undefined => inject no thinking trigger for this role.
 const reasoningFor = (role) => REASONING[String(role).toUpperCase()]
+// --- Ref MCP documentation-verification hint (Lever 6) ------------------------
+// When a role is Ref-enabled, buildPrompt appends a short instruction telling the agent to verify
+// third-party APIs against current docs via the Ref tools IF those tools are available to it. The
+// instruction is purely conditional on runtime tool availability, so it is a no-op for any session
+// without Ref configured — which is why it can default ON: it only ever helps, never blocks. From
+// pipeline-config.yaml `ref` ({ enabled?, roles? }); set enabled:false to suppress it, or override
+// `roles` to change which agents get it. Static + args only => journal-replay safe.
+const DEFAULT_REF_ROLES = ['BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'CRITIC']
+function buildRefRoleSet(cfg) {
+  if (cfg && cfg.enabled === false) return new Set()
+  const roles = cfg && Array.isArray(cfg.roles) && cfg.roles.length ? cfg.roles : DEFAULT_REF_ROLES
+  return new Set(roles.map((r) => String(r).toUpperCase()))
+}
+const REF_ROLES = buildRefRoleSet(a.ref)
+const refFor = (role) => REF_ROLES.has(String(role).toUpperCase())
+const REF_HINT =
+  'When you work with a third-party library, framework, or external API, FIRST check whether the Ref ' +
+  'documentation tools (`ref_search_documentation`, `ref_read_url`) are available to you. If they are, ' +
+  'use them to confirm exact signatures, parameters, and current usage against the real docs before ' +
+  'writing or reviewing that code — do not trust memory for unfamiliar, niche, or fast-moving APIs. ' +
+  'If those tools are not available, just proceed with your existing knowledge — this step is optional ' +
+  'and best-effort, never a blocker.'
 // --- prompt builder: mirrors providers/claude-code/spawn.template.md so spawned agents
 //     get full pipeline context (core prompt + shared context + step-at-execution + the
 //     handoff contract), not a terse one-liner. ------------------------------------------
@@ -281,6 +455,7 @@ function buildPrompt({ role, promptFile, storyId, taskRef, taskSubject, trigger,
     `1. Read your core prompt: \`.valent-pipeline/prompts/${promptFile}\` — identity, protocols, step sequence.`,
     `2. Read shared context: \`${outputDir}/pipeline-context.md\` (and correction directives if present).`,
     '3. Read each step file at the point of execution, not before. Check decision gates first.',
+    ...(refFor(role) ? ['', '## External docs (optional)', REF_HINT] : []),
     '',
     '## Task Assignment',
     `${taskRef ? `Task ${taskRef}: ` : ''}${taskSubject}`,
@@ -385,6 +560,26 @@ async function runStory(story) {
   const { storyId, projectType, profiles, groomed } = story
   const profilesCsv = profiles.join(',')
+  // Resumability: on a resumed run, skip any story already finalized on disk in a prior run. The
+  // check is a single cheap agent (the script can't read files itself); it returns done/shipped from
+  // the story's judge-decision.md / story-report.md. No-op unless skipCompleted is set.
+  if (skipCompleted) {
+    const prior = await agent(
+      `Resume check for story ${storyId}. Determine whether this story ALREADY reached a terminal verdict ` +
+        `in a PRIOR run by reading, if present, \`stories/${storyId}/output/judge-decision.md\` and ` +
+        `\`stories/${storyId}/output/story-report.md\`. Return ONLY ` +
+        `{ schema:1, story:"${storyId}", done:boolean, shipped:boolean }: done:true ONLY if a final SHIP or ` +
+        `REJECT verdict is already recorded on disk (shipped reflects a SHIP). If neither file exists or no ` +
+        `final verdict is present, done:false.`,
+      { label: `resume-check:${storyId}`, phase: 'Resolve', schema: RESUME_CHECK_SCHEMA, model: modelFor('RESUME') },
+    )
+    if (prior.done) {
+      log(`${storyId}: already finalized in a prior run (${prior.shipped ? 'shipped' : 'rejected'}) — skipping (resume)`)
+      return { storyId, shipped: prior.shipped === true, verdict: prior.shipped ? 'pass' : 'fail', resumedSkip: true, skipped: [], files: [] }
+    }
+    log(`${storyId}: no terminal verdict on disk — running normally (resume)`)
+  }
   phase('Resolve')
   // The script cannot run the CLI itself; an agent runs resolve-graph and returns its JSON.
   const graph = await agent(
@@ -471,6 +666,10 @@ async function runStory(story) {
     return { storyId, shipped: false, verdict: 'blocked', reason: 'no-dev-output', skipped: graph.skipped, files: [] }
   }
+  // Deterministic pre-CRITIC gate (Lever 1): run lint/type/static checks (free, mechanical) and
+  // fix what they catch BEFORE the Opus CRITIC sees the diff. No-op when no commands are configured.
+  await runPreCriticGate(storyId, devTasks)
   phase('Critic')
   await runCriticGate(storyId, devTasks)
@@ -478,13 +677,104 @@ async function runStory(story) {
   await spawn('QA-B', 'qa-b.md', 'Execute the full test suite, file bugs, build the traceability matrix.', { phase: 'QA' })
   phase('Judge')
-  const decision = await runGate(storyId, 'JUDGE', 'judge.md',
-    'Review evidence (tests, traceability, bugs) and make the ship decision.', 'Judge', null)
+  const decision = await runJudgeGate(storyId)
   return { storyId, shipped: decision.verdict === 'pass', verdict: decision.verdict, skipped: graph.skipped, files: devFiles }
   // --- per-story closures over storyId/devTasks ----------------------------
+  // runPreCriticGate (Lever 1): a deterministic, non-LLM gate. A cheap STATIC agent runs the
+  // configured lint/type/static-analysis commands; the exit code is the verdict (no model
+  // judgment). On failure in blocking mode, route the fixes to the owning dev agents and re-run,
+  // bounded by preCriticMaxCycles. On cap-trip it logs and FALLS THROUGH to CRITIC (the real gate)
+  // rather than rolling the story over — a story should not be blocked on mechanical checks alone,
+  // and CRITIC/JUDGE remain the quality backstop. No-op (returns immediately) when nothing is
+  // configured, so the default pipeline behavior is unchanged.
+  async function runPreCriticGate(sid, devs) {
+    if (!preCriticChecks.length) return { skipped: true } // blank control surface: nothing configured
+    phase('Static')
+    let rejections = 0
+    while (true) {
+      const verdict = await agent(
+        [
+          `You are **STATIC**, the deterministic pre-review gate for story ${sid} in the valent-pipeline.`,
+          `Run each command below from the project root, capturing its exit code and a short tail of its output.`,
+          `A command PASSES iff it exits 0. Do NOT fix anything and do NOT make judgment calls — the exit code is the verdict.`,
+          ``,
+          `Commands:`,
+          ...preCriticChecks.map((c, i) => `  ${i + 1}. ${c}`),
+          ``,
+          `Set verdict:"pass" only if EVERY command exited 0; otherwise verdict:"fail". For each FAILED command,`,
+          `summarize the errors and list the offending file paths (relative to the project root) in filesToFix.`,
+          `Return ONLY { schema:1, agent:"static", story:"${sid}", verdict:"pass"|"fail", ` +
+            `checks:[{command, ok:boolean, summary}], filesToFix:[paths] } as JSON.`,
+        ].join('\n'),
+        { label: `gate:static:${sid}`, phase: 'Static', schema: PRE_GATE_SCHEMA, model: modelFor('STATIC') },
+      )
+      if (verdict.verdict === 'pass') {
+        log(`${sid}/STATIC: pre-critic checks green (${(verdict.checks || []).length} command(s)) — handing a clean diff to CRITIC`)
+        return verdict
+      }
+      const failed = (verdict.checks || []).filter((c) => !c.ok)
+      if (!preCriticBlocking) {
+        log(`${sid}/STATIC: ${failed.length} check(s) failed — advisory mode, proceeding to CRITIC`)
+        return { ...verdict, advisory: true }
+      }
+      rejections += 1
+      if (rejections >= preCriticMaxCycles) {
+        log(`${sid}/STATIC: still failing after ${rejections} cycle(s) — escalating to CRITIC (real gate) instead of blocking on mechanical checks`)
+        return { ...verdict, escalated: true }
+      }
+      log(`${sid}/STATIC: ${failed.length} check(s) failed — rejection ${rejections}/${preCriticMaxCycles}, routing fixes to dev`)
+      const summary = failed.map((c) => `- ${c.command}: ${c.summary || 'failed'}`).join('\n')
+      await parallel(
+        devs.map((t) => () =>
+          spawn(t.agent, `${t.agent.toLowerCase()}.md`,
+            `The deterministic pre-review gate (lint/type/static analysis) failed. Fix every reported failure that ` +
+              `falls in your area, then re-run the checks locally to confirm they pass:\n${summary}`, {
+              label: `rework:static:${t.ref}:${sid}`,
+              phase: 'Static',
+            })),
+      )
+    }
+  }
+  // runJudgeGate (Lever 2): the terminal ship gate, split find-vs-judge. The EVIDENCE pass measures
+  // and cross-references the QA artifacts (bug review + evidence review) on a cheaper tier and emits
+  // a recommendation; the binding SHIP/REJECT DECISION is a separate opus step that reasons over the
+  // distilled evidence (and may spot-check a flagged artifact on disk) rather than re-reading every
+  // raw report. Net: the bulk reading/counting is cheap, opus is spent only on the judgment — and on
+  // a compact input. The decision keeps the `gate:judge:${sid}` label so it remains the gate of record.
+  async function runJudgeGate(sid) {
+    const evidence = await spawn('JUDGE', 'judge.md',
+      'Run the EVIDENCE pass only: bug review (`.valent-pipeline/steps/judge/bug-review.md`) then ' +
+        'evidence review (`.valent-pipeline/steps/judge/evidence-review.md`). Cross-check ' +
+        'execution-report / traceability-matrix / bugs against qa-test-spec: count tests passed/failed, ' +
+        'verify every AC is covered, validate bug priorities, and FLAG any claim the artifacts do not ' +
+        'support. Write judge-review.md. Do NOT make the ship decision — only gather the evidence and ' +
+        'give a recommendation.',
+      {
+        label: `judge:evidence:${sid}`,
+        phase: 'Judge',
+        schema: JUDGE_EVIDENCE_SCHEMA,
+        model: modelFor('JUDGE-EVIDENCE'),
+        returnContract:
+          'Return ONLY { schema:1, agent:"judge", story, recommendation:"ship"|"reject"|"unsure", ' +
+          'testsPassed, testsFailed, openP1toP3, discrepancies:[...], notes } as JSON.',
+      })
+    return assertGate(
+      await spawn('JUDGE', 'judge.md',
+        'Make the BINDING ship decision per `.valent-pipeline/steps/judge/ship-decision.md`. Your primary ' +
+          'input is the evidence summary below from the evidence pass; if it flags any discrepancy or a ' +
+          'non-"ship" recommendation or low confidence, spot-check the named artifact on disk before ' +
+          'deciding. The verdict is terminal SHIP(pass)/REJECT(fail) — no partial ships. Write ' +
+          'judge-decision.md (and story-report.md on SHIP).\n\nEVIDENCE:\n' + JSON.stringify(evidence),
+        { label: `gate:judge:${sid}`, phase: 'Judge', schema: VERDICT_SCHEMA, model: modelFor('JUDGE-DECIDE') }),
+      'JUDGE',
+    )
+  }
   // runGate: a schema-validated verification stage with a code-owned rejection loop.
   // `reworkThunk` (or null for terminal gates) produces the fix work before re-gating.
   async function runGate(sid, role, promptFile, instruction, gatePhase, reworkThunk) {
@@ -513,37 +803,20 @@ async function runStory(story) {
     }
   }
-  // runCriticGate (step 3b): three INDEPENDENT passes in parallel, then a triage barrier
-  // that dedups and writes the verdict. The whole thing is wrapped in the code-owned
-  // rejection loop; on reject, the routed dev agents rework and the passes re-run.
+  // runCriticGate (step 3b): three INDEPENDENT passes in parallel, then a triage barrier that dedups
+  // and writes the verdict. Wrapped in the code-owned rejection loop; on reject, the routed dev agents
+  // rework. The FIRST review is always the full 3-pass hunt (whole-diff coverage). Re-reviews after a
+  // rework are TARGETED (Lever 3) when enabled — one cheap pass that confirms only the previously-
+  // flagged findings are resolved — instead of re-running the full hunt, which collapses the rejection
+  // loop's cost tail. Disabled (default for existing configs) => every cycle re-runs the full hunt.
   async function runCriticGate(sid, devs) {
     let rejections = 0
+    let verdict = await fullCriticReview(sid)
     while (true) {
-      // Independent perspective-diverse verify: each pass reads only its own inputs.
-      await parallel(
-        CRITIC_PASSES.map((p) => () =>
-          spawn('CRITIC', 'critic.md',
-            `Run pass ${p.pass} per \`.valent-pipeline/steps/critic/${p.step}\`. Read ${p.reads}. ` +
-              `Do NOT read any other pass's output. Record findings only — do NOT deduplicate or set a verdict.`,
-            {
-              label: `critic:${p.pass}:${sid}`,
-              phase: 'Critic',
-              schema: FINDINGS_SCHEMA,
-              returnContract: 'Return ONLY { schema:1, agent:"critic", story, pass, findings:[...] } as JSON.',
-            })),
-      )
-      // Triage barrier: the single point of deduplication; produces the schema-validated verdict.
-      const verdict = assertGate(
-        await spawn('CRITIC', 'critic.md',
-          'Triage per `.valent-pipeline/steps/critic/triage.md`: gather findings from ALL three passes, ' +
-            'collapse duplicates (same root cause) into one, classify final severity, then write the verdict.',
-          { label: `gate:critic:${sid}`, phase: 'Critic', schema: VERDICT_SCHEMA }),
-        'CRITIC',
-      )
       if (verdict.verdict === 'pass') return verdict
       // 'needs-review' from triage means CRITIC could not complete a pass/fail review — almost
-      // always an empty/unchanged diff (the reported failure mode). Re-running the three passes +
-      // triage (4 agents) + dev rework would reproduce it every cycle until the cap, so bail now.
+      // always an empty/unchanged diff (the reported failure mode). Re-running passes + triage + dev
+      // rework would reproduce it every cycle until the cap, so bail now.
       if (verdict.verdict === 'needs-review') {
         log(`${sid}/CRITIC: needs-review (no reviewable diff) — escalating without re-running passes`)
         return { ...verdict, escalated: true }
@@ -553,8 +826,8 @@ async function runStory(story) {
         log(`${sid}/CRITIC: circuit breaker tripped after ${rejections} rejections — escalating`)
         return { ...verdict, escalated: true }
       }
-      log(`${sid}/CRITIC: rejection ${rejections}/${maxRejectionCycles} — reworking`)
-      // Route fixes to the owning dev agent(s), then re-run the passes.
+      log(`${sid}/CRITIC: rejection ${rejections}/${maxRejectionCycles} — reworking${targetedReverify ? ' (targeted re-verify)' : ''}`)
+      // Route fixes to the owning dev agent(s), then re-evaluate.
       await parallel(
         devs.map((t) => () =>
           spawn(t.agent, `${t.agent.toLowerCase()}.md`, 'Fix every High finding CRITIC routed to you.', {
@@ -562,6 +835,51 @@ async function runStory(story) {
             phase: 'Critic',
           })),
       )
+      verdict = targetedReverify ? await targetedCriticReverify(sid, rejections) : await fullCriticReview(sid)
+    }
+    // --- review strategies ---------------------------------------------------
+    // Full 3-pass hunt + triage barrier. Each pass reads only its own inputs (perspective-diverse,
+    // cannot anchor); triage is the single dedup point and the judgment step (stays on CRITIC-TRIAGE).
+    async function fullCriticReview(s) {
+      await parallel(
+        CRITIC_PASSES.map((p) => () =>
+          spawn('CRITIC', 'critic.md',
+            `Run pass ${p.pass} per \`.valent-pipeline/steps/critic/${p.step}\`. Read ${p.reads}. ` +
+              `Do NOT read any other pass's output. Record findings only — do NOT deduplicate or set a verdict.`,
+            {
+              label: `critic:${p.pass}:${s}`,
+              phase: 'Critic',
+              schema: FINDINGS_SCHEMA,
+              model: modelFor(p.modelRole), // Lever 2: per-pass tier (blind=haiku, edge/acceptance=sonnet)
+              returnContract: 'Return ONLY { schema:1, agent:"critic", story, pass, findings:[...] } as JSON.',
+            })),
+      )
+      return assertGate(
+        await spawn('CRITIC', 'critic.md',
+          'Triage per `.valent-pipeline/steps/critic/triage.md`: gather findings from ALL three passes, ' +
+            'collapse duplicates (same root cause) into one, classify final severity, then write the verdict.',
+          { label: `gate:critic:${s}`, phase: 'Critic', schema: VERDICT_SCHEMA, model: modelFor('CRITIC-TRIAGE') }),
+        'CRITIC',
+      )
+    }
+    // Lever 3: targeted re-review. ONE cheap pass that re-checks only the previously-flagged findings
+    // against the updated diff (plus obvious regressions the fix introduced in those files), instead
+    // of re-running the full 3-pass hunt. Keeps the `gate:critic` label so it remains the verdict of
+    // record and the pass-invariant still applies. QA-B + JUDGE stay the whole-diff backstops.
+    async function targetedCriticReverify(s, cycle) {
+      return assertGate(
+        await spawn('CRITIC', 'critic.md',
+          `Targeted RE-VERIFY (rejection cycle ${cycle}). The dev has reworked the findings from your ` +
+            `previous review. Read your prior verdict in critic-review.md (the open findings) and the ` +
+            `UPDATED git diff for the files those findings touched. For EACH prior finding, confirm it is ` +
+            `now resolved; also flag any obvious regression the fix introduced in those same files. Do NOT ` +
+            `re-hunt the whole diff — the first review already covered it. Write the verdict: pass only if ` +
+            `every prior High finding is resolved and the fix introduced no new High.`,
+          { label: `gate:critic:${s}`, phase: 'Critic', schema: VERDICT_SCHEMA, model: modelFor('CRITIC-REVERIFY') }),
+        'CRITIC',
+      )
     }
   }
 }

package/skills/valent-help/SKILL.md CHANGED Viewed

@@ -39,6 +39,9 @@ Read these as needed to answer questions:
 **"How do I run an epic or the whole project?"**
 → `/valent-run-epic-workflow EPIC-ID` runs all stories tagged with that epic; `/valent-run-project-workflow` runs the whole backlog across epics. Both plan and execute sprints automatically.
+**"A run got interrupted (crash / context reset / new session) — how do I continue?"**
+→ `/valent-resume`. It reads `.valent-pipeline/run-state.json`, shows what already shipped, and picks up where it stopped — no `runId` needed, and finished stories are not re-run.
 **"How do I configure the pipeline?"**
 → `/valent-configure` for interactive wizard, or edit `.valent-pipeline/pipeline-config.yaml` directly.

package/skills/valent-resume/SKILL.md ADDED Viewed

@@ -0,0 +1,72 @@
+---
+name: valent-resume
+description: 'Resume an interrupted pipeline run (epic, project, sprint, or single story) after a crash, context reset, network glitch, or new session — with no runId required. Use when the user says "resume", "continue the run", "pick up where it left off", "resume the pipeline", or runs /valent-resume.'
+argument-hint: '(none — auto-detects; or a run key like an epic id)'
+---
+# valent-resume
+One command to continue an interrupted pipeline run. The user types `/valent-resume` and you figure out the rest: what was running, what already shipped, and how to continue **without re-billing finished work**. No `runId` to remember.
+The pipeline records a durable pointer at `.valent-pipeline/run-state.json` around every Workflow call (the run skills maintain it). This skill reads that pointer plus the on-disk progress artifacts and resumes via the cheapest safe path.
+## Step 1: Locate the run
+1. Read `.valent-pipeline/run-state.json`.
+   - **Missing?** Fall back: read `{epic_progress_path}` (`./epic-progress.md` by default). If it exists, treat it as an epic/project run to resume (its `epic_id` is the key). If neither exists → tell the user there is nothing to resume and offer to start a fresh run (`/valent-run-epic-workflow`, `/valent-run-project-workflow`, or `/valent-run-story-workflow`). STOP.
+2. Parse the pointer. Its shape:
+   ```json
+   { "schema": 1, "kind": "epic|project|story", "key": "<epic id | 'project' | story id>",
+     "scriptPath": "<plan|sprint|retro workflow path>", "phase": "plan|sprint|retro",
+     "sprintId": "<id or null>", "args": { /* the exact args last passed */ },
+     "runId": "<id or null>", "status": "running|completed" }
+   ```
+3. If `status` is `"completed"` → the last run already finished. Tell the user, summarize, and offer a fresh run. STOP (unless they explicitly want to re-run).
+## Step 2: Show what's already done (then continue without asking, per the original opt-in)
+Before relaunching, give the user a 3–5 line summary so they trust nothing is being redone:
+- From `epic-progress.md` (epic/project): stories completed, stories remaining, current sprint.
+- From disk: scan `stories/*/output/` for `story-report.md` (shipped) and `judge-decision.md` (finalized) to confirm which stories are done.
+- State the resume path you're about to take (below).
+## Step 3: Resume by the cheapest safe path
+Pick the first applicable:
+### A. Same-session journal fast-path (only if a `runId` is present)
+The journal `runId` is **same-session only**. If this is the same session the run started in, relaunch the *exact* in-flight workflow:
+```js
+Workflow({ scriptPath: <pointer.scriptPath>, resumeFromRunId: <pointer.runId> })
+```
+The journal replays the unchanged prefix of `agent()` calls instantly and re-runs only from the interruption point. If the Workflow tool reports the run is **unknown / not resumable** (i.e. a different session), fall through to B.
+### B. Cross-session idempotent resume (runId gone, or fast-path failed)
+Relaunch the in-flight phase from disk state; finished work is detected and skipped:
+- **`phase: "sprint"`** (the common, expensive case): relaunch `sprint.workflow.js` with the pointer's saved `args` **plus `skipCompleted: true`**. Each story is first checked for a terminal verdict on disk (`judge-decision.md` / `story-report.md`) and **already-finalized stories are skipped** — only unfinished stories run. This is the key to not re-billing a partially-completed sprint.
+  ```js
+  Workflow({ scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
+             args: { ...<pointer.args>, skipCompleted: true } })
+  ```
+- **`phase: "plan"`**: relaunch `plan.workflow.js` with the saved `args` (grooming is cheap and idempotent — re-grooming an already-groomed story just re-confirms it).
+- **`phase: "retro"`**: relaunch `retro.workflow.js` with the saved `args` (learning only; safe to repeat).
+### C. Hand back to the run loop (epic/project)
+After the in-flight workflow completes, **continue the remaining work** so the whole epic/project finishes:
+- `kind: "epic"` → invoke `/valent-run-epic-workflow <key>`. Its Step 3 detects `epic-progress.md` and resumes at the next unfinished sprint; shipped stories are not re-run.
+- `kind: "project"` → invoke `/valent-run-project-workflow`. Same resume behavior.
+- `kind: "story"` → nothing more to loop; report the single story's outcome.
+When relaunching a `sprint`/`plan`/`retro` directly (not via the run skill), update `.valent-pipeline/run-state.json` yourself: set `status: "running"` (and the new `runId` once it returns) before, `status: "completed"` after.
+## Step 4: Report
+Summarize: what was resumed, how (journal fast-path vs. idempotent re-run), which stories were skipped as already-done, and the final outcome. If anything was rolled over or blocked, surface it.
+## Notes
+- **Never restart from scratch.** A fresh `/valent-run-*` without resume would re-bill shipped stories. Always go through the pointer + `skipCompleted` path.
+- **The pointer is a hint, the artifacts are truth.** If `run-state.json` is stale or inconsistent with `epic-progress.md` / story artifacts, trust the artifacts (a shipped `story-report.md` means shipped) and reconcile.
+- **Do not hand-edit a journal.** Resume is either `resumeFromRunId` (same session) or `skipCompleted` (cross-session) — both are safe and idempotent.
+- This skill is read-mostly: it inspects state and relaunches workflows. It does not modify story code or gate verdicts.

package/skills/valent-run-epic-workflow/SKILL.md CHANGED Viewed

@@ -29,7 +29,7 @@ Use the standard 200k context window. Workflow `agent()` calls run in their own
 ### Step 1: Load Pipeline Config
-Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Set `{epic_id}` from the argument. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty.
+Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Set `{epic_id}` from the argument. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg to the **sprint** Workflow call; omit it if absent or its `commands` list is empty (the gate is then a no-op).
 ### Step 2: Validate Epic
@@ -48,6 +48,8 @@ Read `{epic_progress_path}`.
 ### Step 4: Sprint Loop (plan → sprint → retro)
+> **Run-state pointer (powers `/valent-resume`).** Maintain `.valent-pipeline/run-state.json` around every Workflow call so the run can be resumed after a crash, context reset, or new session. **Immediately before** each `plan`/`sprint`/`retro` call, write `{ schema: 1, kind: "epic", key: "{epic_id}", scriptPath: "<the script>", phase: "plan"|"sprint"|"retro", sprintId: "{epic_id}-sprint-{n}", args: <the exact args object>, runId: null, status: "running" }`. **The moment the call returns**, set `runId` to the returned id. On epic completion (Step 5) set `status: "completed"`. This single file is what `/valent-resume` reads — keep it current.
 Loop sprints until no pending epic stories with met dependencies remain. Each iteration:
 #### 4a. Re-read state from disk
@@ -74,11 +76,11 @@ Feed the planned batch straight into `sprint.workflow.js`:
 ```js
 Workflow({
   scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
-  args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
+  args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, targetedReverify: {quality.targeted_reverify or false}, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
 })
 ```
-It runs each story through the per-story pipeline (deterministic graph resolve, READINESS/CRITIC/JUDGE gates, 3-pass parallel CRITIC, code-owned rejection cap) **sequentially on a shared branch**, rolling over any story that JUDGE rejects or that trips the cap. Returns `{ shipped, stories_shipped, stories_rolled_over, results: [{ storyId, shipped, verdict, skipped }] }`. Record its `runId`.
+It runs each story through the per-story pipeline (deterministic graph resolve, READINESS gate, the deterministic pre-CRITIC static gate, CRITIC/JUDGE gates, 3-pass parallel CRITIC, code-owned rejection cap) **sequentially on a shared branch**, rolling over any story that JUDGE rejects or that trips the cap. Returns `{ shipped, stories_shipped, stories_rolled_over, results: [{ storyId, shipped, verdict, skipped }] }`. Record its `runId`.
 #### 4e. Update progress
 For each shipped story: move it from `stories_remaining` to `stories_completed` in `epic-progress.md` with a compact one-line outcome; update `total_completed` and `last_updated`; keep the file under 50 lines. Then read and follow `.valent-pipeline/steps/orchestration/update-backlog-status.md`. Rolled-over stories stay `pending` for a later sprint (or are surfaced as blocked if they keep failing).
@@ -104,12 +106,15 @@ When no pending epic stories with met dependencies remain, write `epic-report.md
 ## Resume (do NOT restart from scratch)
-Two layers of durability:
+**Easiest path: just run `/valent-resume`.** It reads `.valent-pipeline/run-state.json`, shows what already shipped, and continues from where the run stopped — journal fast-path if it's the same session, idempotent re-run otherwise. The user never needs a `runId`.
+Under the hood there are three layers of durability:
-1. **Within a single Workflow run** (a plan, a sprint, or a retro): every invocation returns a `runId`. If one is interrupted, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call. Already-shipped stories and passed gates are not redone.
-2. **Across the sprint loop**: `epic-progress.md` + `{backlog_path}` record which stories shipped and which sprint is current. On a full conversation reset, re-run `/valent-run-epic-workflow {epic_id}`; Step 3 detects the existing progress file and resumes at the next unfinished sprint. Shipped stories are not re-run.
+1. **Within a single Workflow run** (a plan, a sprint, or a retro): every invocation returns a `runId` (recorded in `run-state.json`). In the **same session**, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call.
+2. **Across sessions (runId is gone)**: relaunch the recorded `sprint` workflow with `skipCompleted: true` — each story is checked for a terminal verdict on disk and finalized stories are skipped, so the resumed sprint does not re-bill shipped stories.
+3. **Across the sprint loop**: `epic-progress.md` + `{backlog_path}` record which stories shipped and which sprint is current. Re-running `/valent-run-epic-workflow {epic_id}` resumes at the next unfinished sprint.
-Do **not** hand-edit state files to "resume" a Workflow — pass `resumeFromRunId`.
+Do **not** hand-edit state files to "resume" a Workflow — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
 ## Notes

package/skills/valent-run-project-workflow/SKILL.md CHANGED Viewed

@@ -24,7 +24,7 @@ Use the standard 200k context window. Workflow `agent()` calls run in their own
 ### Step 1: Load Pipeline Config
-Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty.
+Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg to every Workflow call below so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflows use their baked-in default assignment. Likewise capture the `reasoning` section (level→roles thinking-effort map) and pass it as the `reasoning` arg to every Workflow call; it is blank by default (injects nothing). Omit it if absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg to the **sprint** Workflow call; omit it if absent or its `commands` list is empty (the gate is then a no-op).
 ### Step 2: Build Cross-Epic Dependency Map
@@ -46,6 +46,8 @@ Read `{epic_progress_path}`.
 ### Step 4: Sprint Loop (plan → sprint → retro)
+> **Run-state pointer (powers `/valent-resume`).** Maintain `.valent-pipeline/run-state.json` around every Workflow call. **Immediately before** each `plan`/`sprint`/`retro` call, write `{ schema: 1, kind: "project", key: "project", scriptPath: "<the script>", phase: "plan"|"sprint"|"retro", sprintId: "project-sprint-{n}", args: <the exact args object>, runId: null, status: "running" }`. **The moment the call returns**, set `runId`. On project completion (Step 5) set `status: "completed"`. `/valent-resume` reads this file.
 Loop sprints until all stories are shipped, blocked, or cancelled. Each iteration:
 #### 4a. Re-read state from disk
@@ -92,7 +94,7 @@ Feed the planned batch straight into `sprint.workflow.js`:
 ```js
 Workflow({
   scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
-  args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
+  args: { stories: <plan output .stories>, maxRejectionCycles: {quality.max_rejection_cycles or 5}, targetedReverify: {quality.targeted_reverify or false}, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
 })
 ```
@@ -128,12 +130,13 @@ If the loop finds no unblocked stories but pending stories remain, list each blo
 ## Resume (do NOT restart from scratch)
-Two layers of durability:
+**Easiest path: run `/valent-resume`.** It reads `.valent-pipeline/run-state.json`, shows what already shipped, and continues — no `runId` needed. Three layers of durability:
-1. **Within a single Workflow run** (a plan, sprint, or retro): every invocation returns a `runId`. If one is interrupted, relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId> })` — the journal replays the unchanged prefix instantly and re-runs only from the first changed/new `agent()` call.
-2. **Across the sprint loop**: the progress file (`epic_id: "project"`) + `{backlog_path}` record what shipped and which sprint is current. On a full conversation reset, re-run `/valent-run-project-workflow`; Step 3 detects the existing progress file and resumes at the next unfinished sprint. Shipped stories are not re-run.
+1. **Within a single Workflow run** (same session): relaunch **that** workflow with `Workflow({ scriptPath, resumeFromRunId: <runId from run-state.json> })` — the journal replays the unchanged prefix instantly.
+2. **Across sessions (runId gone)**: relaunch the recorded `sprint` workflow with `skipCompleted: true` — finalized stories are detected on disk and skipped, so the resumed sprint does not re-bill shipped stories.
+3. **Across the sprint loop**: the progress file (`epic_id: "project"`) + `{backlog_path}` record what shipped and which sprint is current. Re-running `/valent-run-project-workflow` resumes at the next unfinished sprint.
-Do **not** hand-edit state files to "resume" a Workflow — pass `resumeFromRunId`.
+Do **not** hand-edit state files to "resume" a Workflow — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
 ## Notes

package/skills/valent-run-story-workflow/SKILL.md CHANGED Viewed

@@ -29,7 +29,7 @@ If no argument is provided, resolve the next work item from the backlog (see Ste
 ### Step 1: Load Pipeline Config
-Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Capture `project.type` and the story's `testing_profiles` — these become the Workflow's `projectType` and `profiles` args. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflow uses its baked-in default assignment. Likewise capture the `reasoning` section (the `{ ultrathink:[...], 'think-harder':[...], 'think-hard':[...], think:[...] }` level→roles map) and pass it as the `reasoning` arg; it is blank by default (injects nothing). Omit the arg if the section is absent or all levels are empty.
+Read and follow `.valent-pipeline/steps/orchestration/load-pipeline-config.md`. Capture `project.type` and the story's `testing_profiles` — these become the Workflow's `projectType` and `profiles` args. Also capture the entire `models` section (the `{ opus:[...], sonnet:[...], haiku:[...] }` tier→roles map) — pass it as the `models` arg so per-agent model tiers stay config-driven (editable via `/valent-configure`). If the config has no `models` section, omit the arg and the workflow uses its baked-in default assignment. Likewise capture the `reasoning` section (the `{ ultrathink:[...], 'think-harder':[...], 'think-hard':[...], think:[...] }` level→roles map) and pass it as the `reasoning` arg; it is blank by default (injects nothing). Omit the arg if the section is absent or all levels are empty. Also capture the `pre_critic_gate` section (deterministic lint/type/static checks) and pass it as the `preCriticGate` arg; omit it if absent or its `commands` list is empty (the gate is then a no-op).
 ### Step 1b: Resolve Next Work Item (when no argument provided)
@@ -52,7 +52,7 @@ Invoke the Workflow at `.valent-pipeline/orchestrators/claude-code/sprint.workfl
 ```js
 Workflow({
   scriptPath: '.valent-pipeline/orchestrators/claude-code/sprint.workflow.js',
-  args: { storyId: '<resolved story id>', projectType: '<project.type>', profiles: [/* testing_profiles */], maxRejectionCycles: <quality.max_rejection_cycles or 5>, models: <config.models or omit>, reasoning: <config.reasoning or omit> }
+  args: { storyId: '<resolved story id>', projectType: '<project.type>', profiles: [/* testing_profiles */], maxRejectionCycles: <quality.max_rejection_cycles or 5>, targetedReverify: <quality.targeted_reverify or false>, models: <config.models or omit>, reasoning: <config.reasoning or omit>, preCriticGate: <config.pre_critic_gate or omit>, ref: <config.ref or omit> }
 })
 ```
@@ -64,9 +64,11 @@ It resolves the story's task graph deterministically (`resolve-graph`), runs the
 Report the returned verdict and the `runId` to the user. A story that JUDGE rejects (or that trips the rejection cap) is recorded as rolled-over — surface that outcome; do not silently retry.
+**Run-state pointer (powers `/valent-resume`).** Immediately before the Workflow call, write `.valent-pipeline/run-state.json`: `{ schema: 1, kind: "story", key: "<story id>", scriptPath: "<the script>", phase: "sprint", sprintId: null, args: <the exact args object>, runId: null, status: "running" }`. The moment the call returns, set `runId`; when the story finalizes (ship or roll-over), set `status: "completed"`.
 ### Step 5: Resume (do NOT restart from scratch)
-Every Workflow invocation returns a `runId`. If a run is interrupted — context limit, crash, manual stop, or a mid-run script edit — **relaunch with `Workflow({ scriptPath, resumeFromRunId: <that runId> })`**, not a fresh invocation. The journal replays the unchanged prefix of `agent()` calls instantly (same script + same args → 100% cache hit) and re-runs only from the first changed/new call onward, so already-shipped stories and passed gates are not redone. Do **not** hand-edit state files to "resume" — pass `resumeFromRunId`.
+**Easiest path: run `/valent-resume`** — it reads `.valent-pipeline/run-state.json` and continues with no `runId` needed. Manually: in the **same session**, relaunch with `Workflow({ scriptPath, resumeFromRunId: <runId from run-state.json> })` — the journal replays the unchanged prefix instantly. **Across sessions** (runId gone), relaunch the same single-story args with `skipCompleted: true`; if the story already finalized on disk it is detected and skipped, otherwise it runs cleanly. Do **not** hand-edit state files to "resume" — use `/valent-resume` (or pass `resumeFromRunId` / `skipCompleted`).
 ## Notes

package/src/commands/init.js CHANGED Viewed

@@ -346,6 +346,11 @@ tech_stack:
   browser_automation_mcp: "${config.tech_stack.browser_automation_mcp || 'playwright-mcp'}"
   state_management: "${config.tech_stack.state_management || 'React Context'}"
+# Per-agent model tiers (tier -> role list). CRITIC and JUDGE are split find-vs-judge: only the
+# judgment steps are on opus (CRITIC-TRIAGE = adjudication, JUDGE-DECIDE = the binding ship call),
+# while the finding work is cheaper (CRITIC-BLIND = haiku; CRITIC-EDGE/-ACCEPT and JUDGE-EVIDENCE =
+# sonnet). To force the whole family onto one tier, list the bare head (e.g. "CRITIC") under a tier
+# and it propagates to every sub-role.
 models:
   opus: [${(config.models.opus || defaults.models.opus).map(a => `"${a}"`).join(', ')}]
   sonnet: [${(config.models.sonnet || defaults.models.sonnet).map(a => `"${a}"`).join(', ')}]
@@ -362,10 +367,34 @@ reasoning:
   think-hard: [${(config.reasoning?.['think-hard'] || []).map(a => `"${a}"`).join(', ')}]
   think: [${(config.reasoning?.think || []).map(a => `"${a}"`).join(', ')}]
+# Deterministic pre-CRITIC gate: cheap, non-LLM checks (lint / type-check / static analysis) run
+# after Build and BEFORE the Opus CRITIC. Mechanical defects caught here are fixed for free instead
+# of burning an expensive CRITIC rejection cycle. BLANK by default (empty commands => no-op). Fill
+# commands with your project's checks, e.g.:
+#   commands: ["npm run lint", "npm run typecheck", "semgrep --config auto --error"]
+# blocking: true reworks-and-retries (bounded by max_cycles, default = quality.max_rejection_cycles);
+# blocking: false runs once as advisory and always proceeds to CRITIC.
+pre_critic_gate:
+  commands: [${(config.pre_critic_gate?.commands || []).map(c => `"${c}"`).join(', ')}]
+  blocking: ${config.pre_critic_gate?.blocking ?? true}
+# Ref MCP documentation verification. When the Ref tools (ref_search_documentation, ref_read_url)
+# are available in your session, the listed agents are told to verify third-party APIs against
+# current docs before writing/reviewing that code — cutting wrong-API rework and doc-page context
+# bloat. Conditional on the tools being present, so it is a harmless no-op if you have not installed
+# Ref. Set enabled: false to suppress. Install Ref as an MCP server to benefit.
+ref:
+  enabled: ${config.ref?.enabled ?? true}
+  roles: [${(config.ref?.roles || defaults.ref.roles).map(r => `"${r}"`).join(', ')}]
 quality:
   max_rejection_cycles: ${config.quality.max_rejection_cycles}
   retrospective_every_n_stories: ${config.quality.retrospective_every_n_stories}
   stall_threshold_minutes: ${config.quality.stall_threshold_minutes}
+  # On a CRITIC rejection, re-review ONLY the previously-flagged findings against the reworked diff
+  # (one cheap pass) instead of re-running the full 3-pass hunt every cycle. QA-B + JUDGE remain the
+  # whole-diff backstops. Set false for a full re-hunt on every rejection (the pre-Lever-3 behavior).
+  targeted_reverify: ${config.quality.targeted_reverify ?? true}
 git:
   target_branch: "${config.git.target_branch}"

package/src/lib/config-schema.js CHANGED Viewed

@@ -41,6 +41,11 @@ export function validateConfig(config) {
       errors.push(`quality.${field} must be a number, got: ${typeof config.quality[field]}`);
     }
   }
+  // Lever 3: targeted re-review on rejection (boolean). When true, a CRITIC rejection re-reviews only
+  // the prior findings instead of re-running the full 3-pass hunt. Optional; defaults applied elsewhere.
+  if (config.quality?.targeted_reverify !== undefined && typeof config.quality.targeted_reverify !== 'boolean') {
+    errors.push(`quality.targeted_reverify must be a boolean, got: ${typeof config.quality.targeted_reverify}`);
+  }
   // Reasoning section (optional — per-agent thinking-effort control surface). Blank by default.
   if (config.reasoning) {
@@ -54,6 +59,40 @@ export function validateConfig(config) {
     }
   }
+  // Pre-CRITIC gate section (optional — deterministic lint/type/static checks run before CRITIC).
+  // Blank by default: no `commands` => the gate is a no-op and behavior is unchanged.
+  if (config.pre_critic_gate) {
+    const g = config.pre_critic_gate;
+    if (g.commands !== undefined) {
+      if (!Array.isArray(g.commands)) {
+        errors.push('pre_critic_gate.commands must be an array of shell-command strings');
+      } else if (!g.commands.every((c) => typeof c === 'string')) {
+        errors.push('pre_critic_gate.commands must contain only strings');
+      }
+    }
+    if (g.blocking !== undefined && typeof g.blocking !== 'boolean') {
+      errors.push(`pre_critic_gate.blocking must be a boolean, got: ${typeof g.blocking}`);
+    }
+    if (g.max_cycles !== undefined && typeof g.max_cycles !== 'number') {
+      errors.push(`pre_critic_gate.max_cycles must be a number, got: ${typeof g.max_cycles}`);
+    }
+  }
+  // Ref MCP section (optional — documentation-verification hint, Lever 6). Conditional on the Ref
+  // tools being available to the agent at runtime, so it is harmless when Ref is not installed.
+  if (config.ref) {
+    if (config.ref.enabled !== undefined && typeof config.ref.enabled !== 'boolean') {
+      errors.push(`ref.enabled must be a boolean, got: ${typeof config.ref.enabled}`);
+    }
+    if (config.ref.roles !== undefined) {
+      if (!Array.isArray(config.ref.roles)) {
+        errors.push('ref.roles must be an array of agent role names');
+      } else if (!config.ref.roles.every((r) => typeof r === 'string')) {
+        errors.push('ref.roles must contain only strings');
+      }
+    }
+  }
   // Communication section
   const validFormats = ['distilled', 'verbose'];
   if (config.communication?.handoff_format && !validFormats.includes(config.communication.handoff_format)) {
@@ -146,14 +185,18 @@ export const defaults = {
     state_management: 'React Context',
   },
   models: {
-    // Quality gates + high-judgment roles — judgment is the whole point, so top tier.
-    // RETRO-REVIEW is the Workflow retrospective's loop-until-dry aggregate review.
-    opus: ['READINESS', 'CRITIC', 'JUDGE', 'Retrospective', 'RETRO-REVIEW'],
-    // Spec + build agents (and lighter retro reasoning) — mid tier.
-    sonnet: ['REQS', 'UXA', 'QA-A', 'QA-B', 'BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'RETRO'],
+    // Quality JUDGMENT steps — the decisions are the whole point, so top tier. CRITIC and JUDGE are
+    // split find-vs-judge (Lever 2): only the adjudication (CRITIC-TRIAGE) and the binding ship call
+    // (JUDGE-DECIDE) are opus. RETRO-REVIEW is the Workflow retrospective's loop-until-dry review.
+    opus: ['READINESS', 'CRITIC-TRIAGE', 'JUDGE-DECIDE', 'Retrospective', 'RETRO-REVIEW'],
+    // Spec + build agents, the *finding* half of the gates (CRITIC edge/acceptance hunting, JUDGE
+    // evidence cross-referencing), and lighter retro reasoning — mid tier.
+    sonnet: ['REQS', 'UXA', 'QA-A', 'QA-B', 'BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'RETRO', 'CRITIC-EDGE', 'CRITIC-ACCEPT', 'CRITIC-REVERIFY', 'JUDGE-EVIDENCE'],
     // Mechanical retrieval / CLI-runner / IO steps — no reasoning, cheapest tier.
     // RESOLVE/PACK/VALIDATE/CALIBRATE/PERSIST are the Workflow orchestrators' CLI-runner agents.
-    haiku: ['Knowledge', 'Embed', 'RESOLVE', 'PACK', 'VALIDATE', 'CALIBRATE', 'PERSIST'],
+    // STATIC runs the deterministic pre-CRITIC gate (lint/type/static analysis) — pure command runner.
+    // CRITIC-BLIND is the no-context blind hunt — mechanical (naming, dead code, obvious bugs).
+    haiku: ['Knowledge', 'Embed', 'RESOLVE', 'PACK', 'VALIDATE', 'CALIBRATE', 'PERSIST', 'STATIC', 'CRITIC-BLIND'],
   },
   // Per-agent reasoning effort (thinking budget). Mirrors `models`: list a role under an effort
   // level to inject that thinking trigger into the agent's prompt at spawn. BLANK by default —
@@ -166,10 +209,34 @@ export const defaults = {
     'think-hard': [],
     think: [],
   },
+  // Deterministic pre-CRITIC gate (Lever 1): cheap, non-LLM checks run after Build and before the
+  // Opus CRITIC. Mechanical defects caught here never become a CRITIC rejection cycle. BLANK by
+  // default (no commands => no-op). Populate `commands` with your project's lint / type-check /
+  // static-analysis invocations, e.g. ['npm run lint', 'npm run typecheck', 'semgrep --config auto --error'].
+  // `blocking: true` reworks-and-retries up to `max_cycles` (default: quality.max_rejection_cycles);
+  // `blocking: false` runs the checks once as advisory and always proceeds to CRITIC.
+  pre_critic_gate: {
+    commands: [],
+    blocking: true,
+  },
+  // Ref MCP documentation verification (Lever 6). When the Ref tools (ref_search_documentation,
+  // ref_read_url) are available in the session, the listed roles are told to verify third-party
+  // APIs against current docs before writing/reviewing that code. Purely conditional on tool
+  // availability — a no-op for sessions without Ref — so it defaults ON. Set enabled:false to
+  // suppress, or trim `roles` to scope it. Install Ref as an MCP server to benefit.
+  ref: {
+    enabled: true,
+    roles: ['BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE', 'CRITIC'],
+  },
   quality: {
     max_rejection_cycles: 5,
     retrospective_every_n_stories: 5,
     stall_threshold_minutes: 15,
+    // Lever 3: on a CRITIC rejection, re-review ONLY the previously-flagged findings against the
+    // reworked diff (one cheap pass) instead of re-running the full 3-pass hunt every cycle. New
+    // projects ship this on; QA-B + JUDGE remain the whole-diff backstops. Set false for a full
+    // re-hunt on every rejection cycle (the pre-Lever-3 behavior).
+    targeted_reverify: true,
   },
   git: {
     target_branch: '',