npm - valent-pipeline - Versions diffs - 0.5.2 → 0.5.4 - Mend

valent-pipeline 0.5.2 → 0.5.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/package.json +1 -1
package/pipeline/orchestrators/claude-code/retro.workflow.js +42 -141
package/pipeline/orchestrators/claude-code/sprint.workflow.js +84 -3
package/pipeline/prompts/qa-a.md +1 -0
package/pipeline/scripts/query-kb.ts +7 -1
package/pipeline/steps/common/agent-protocol.md +2 -2
package/pipeline/steps/common/quality-standards.md +14 -0
package/pipeline/steps/critic/test-review.md +3 -0
package/pipeline/steps/qa-a/write-spec.md +20 -0
package/pipeline/steps/retrospective/directives.md +7 -2
package/skills/valent-run-project-workflow/SKILL.md +4 -2

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "valent-pipeline",
-  "version": "0.5.2",
+  "version": "0.5.4",
   "description": "v3 multi-agent AI pipeline for software development lifecycle",
   "type": "module",
   "bin": {

package/pipeline/orchestrators/claude-code/retro.workflow.js CHANGED Viewed

@@ -5,15 +5,19 @@
  * by scripts/test-workflow.js. Opt-in, not the default. The Codex provider keeps the
  * markdown-skill Lead (hybrid, R3).
  *
- * The retrospective is the ONE place the meta-loop adds genuine *quality*, not just reliable
- * structure (R5): the prose retro is a fixed single pass; here the aggregate review runs
- * LOOP-UNTIL-DRY (keep reviewing until K consecutive rounds surface nothing new) followed by a
- * COMPLETENESS-CRITIC ("which pattern did we not check?"). That is the same rigor that makes
- * CRITIC's 3-pass and JUDGE the strongest existing features, applied to the learning loop.
+ * A retrospective LEARNS from what the sprint already produced — it does NOT re-review the code.
+ * Finding bugs is CRITIC/JUDGE's job, and they already did it as a gate DURING the sprint; by the
+ * time the retro runs the code has shipped, so a fresh review here is both too late to gate and a
+ * duplication of work. (An earlier design ran a loop-until-dry aggregate review + completeness-
+ * critic on opus — that bolted the pipeline's most expensive pattern onto the stage that should be
+ * the cheapest, for little value. The one genuine cross-story blind spot — seams between stories
+ * that per-story CRITIC can't see — now lives where it can still gate: the sprint-end integration
+ * gate in sprint.workflow.js.)
  *
- * Flow: calibrate (CLI) -> analyze -> aggregate-review (loop-until-dry) -> completeness-critic
- *       -> directives (agent proposes; CODE enforces impact gating + the architectural-invariant
- *          guard) -> embed (CLI).
+ * Flow: calibrate (CLI) -> synthesize (mine the sprint's OWN artifacts — CRITIC reviews, JUDGE
+ *       rejections, QA bugs, rejection-cycle/cost data — into correction directives) -> directives
+ *       gating (agent proposes; CODE enforces impact gating + the architectural-invariant guard)
+ *       -> embed (CLI). Bounded and cheap: no opus review loop.
  *
  * The deterministic pieces are NOT in this script: calibration arithmetic is
  * `node .valent-pipeline/bin/cli.js calibrate` (src/lib/sprint.js); embedding is `node .valent-pipeline/bin/cli.js db embed`.
@@ -21,21 +25,19 @@
  * GATING and INVARIANT GUARD are deterministic policy, so they are enforced HERE in code —
  * the agent only proposes; the script decides what gets applied vs. surfaced for approval.
  *
- * args: { batchNumber, sprintId?, storyOutputDirs?: string[], dryRounds?: number, maxRounds?: number, models? }
- *   sprintId present => sprint-mode (calibration runs). dryRounds = consecutive empty rounds
- *   that end the loop-until-dry (default 2). maxRounds caps it (default 5). `models` is the
- *   pipeline-config.yaml `models` tier->roles map, passed through by the invoking skill so
- *   per-agent model tiers stay config-driven (editable via `valent configure`). Omit it to use
- *   the baked-in default. See sprint.workflow.js for the full rationale.
+ * args: { batchNumber, sprintId?, storyOutputDirs?: string[], models?, reasoning? }
+ *   sprintId present => sprint-mode (calibration runs). `models` is the pipeline-config.yaml
+ *   `models` tier->roles map, passed through by the invoking skill so per-agent model tiers stay
+ *   config-driven (editable via `valent configure`). Omit it to use the baked-in default. See
+ *   sprint.workflow.js for the full rationale.
  */
 export const meta = {
   name: 'valent-retro',
-  description: 'Retrospective: calibrate, loop-until-dry aggregate review, gated directives, embed (Workflow)',
+  description: 'Retrospective: calibrate, synthesize directives from sprint artifacts, gate, embed (Workflow)',
   phases: [
     { title: 'Calibrate', detail: 'node .valent-pipeline/bin/cli.js calibrate (estimation accuracy, in code) — sprint mode' },
-    { title: 'Analyze', detail: 'CRITIC/QA/JUDGE batch outputs + cost' },
-    { title: 'Aggregate', detail: 'loop-until-dry 3-pass aggregate review + completeness critic (R5)' },
+    { title: 'Synthesize', detail: 'mine CRITIC/QA/JUDGE/cost artifacts -> propose correction directives' },
     { title: 'Directives', detail: 'agent proposes; code enforces impact gating + invariant guard' },
     { title: 'Embed', detail: 'node .valent-pipeline/bin/cli.js db embed (persist curated patterns)' },
   ],
@@ -43,39 +45,6 @@ export const meta = {
 // --- schemas (inlined) ---
-const FINDINGS_SCHEMA = {
-  type: 'object',
-  required: ['schema', 'findings'],
-  additionalProperties: true,
-  properties: {
-    schema: { const: 1 },
-    findings: {
-      type: 'array',
-      items: {
-        type: 'object',
-        required: ['id', 'summary'],
-        properties: {
-          id: { type: 'string' },
-          summary: { type: 'string' },
-          severity: { type: 'string' },
-          stories: { type: 'array', items: { type: 'string' } },
-        },
-      },
-    },
-  },
-}
-const COMPLETENESS_SCHEMA = {
-  type: 'object',
-  required: ['schema', 'gaps'],
-  additionalProperties: true,
-  // gaps = review angles NOT yet covered (e.g. "no security-boundary scan run"). Empty => complete.
-  properties: {
-    schema: { const: 1 },
-    gaps: { type: 'array', items: { type: 'string' } },
-  },
-}
 const DIRECTIVES_SCHEMA = {
   type: 'object',
   required: ['schema', 'directives'],
@@ -124,8 +93,6 @@ function parseArgs(x) {
 const a = parseArgs(args)
 const batchNumber = a.batchNumber
 const sprintId = a.sprintId || null
-const dryRounds = a.dryRounds ?? 2
-const maxRounds = a.maxRounds ?? 5
 if (batchNumber == null) throw new Error('args must include { batchNumber }')
 // --- per-agent model tiers ----------------------------------------------------
@@ -133,12 +100,11 @@ if (batchNumber == null) throw new Error('args must include { batchNumber }')
 // args.models by the invoking skill — a Workflow script can't read files. We invert it
 // to role->tier and overlay it on a baked-in default so the workflow self-hosts a sane
 // assignment even when args.models is absent. Static + args only => journal-replay safe.
-// Retro stages map to synthetic role keys (not the single RETROSPECTIVE persona) so each
-// stage can be tuned independently: the loop-until-dry aggregate review + completeness
-// critic are the genuine quality work (RETRO-REVIEW -> opus); analyze/directives are
-// lighter (RETRO -> sonnet); calibrate/embed/IO are mechanical (haiku).
+// Retro stages map to synthetic role keys (not the single RETROSPECTIVE persona) so each stage can
+// be tuned independently. The retro is learning, not review, so there is no opus tier here:
+// synthesis/judgment over existing artifacts is RETRO -> sonnet; calibrate/embed/IO are mechanical
+// (haiku). The cross-story review that used to justify opus now lives in the sprint-end gate.
 const DEFAULT_MODELS = {
-  'RETRO-REVIEW': 'opus',
   RETRO: 'sonnet',
   CALIBRATE: 'haiku', EMBED: 'haiku', PERSIST: 'haiku',
 }
@@ -185,9 +151,6 @@ const retroPrompt = (instruction, returnContract) => {
     (returnContract || 'Return your findings as the JSON object specified.')
 }
-// A stable de-dup key so loop-until-dry converges (don't re-count the same finding).
-const findingKey = (f) => `${(f.summary || '').toLowerCase().trim().slice(0, 80)}`
 // ---------------------------------------------------------------------------
 let calibration = null
@@ -202,93 +165,33 @@ if (sprintId) {
   log(`calibration: ${(calibration.flagged_pairs || []).length} flagged pair(s); velocity unstable=${calibration.velocity?.unstable}`)
 }
-phase('Analyze')
-await agent(
-  retroPrompt(
-    'Run analyze.md: read all CRITIC reviews, QA-B bug reports, JUDGE rejections, and cost data; categorize rejection/bug patterns.',
-    'Return ONLY { schema:1, findings:[{id,summary,severity,stories}] } as JSON.',
-  ),
-  { label: 'analyze', phase: 'Analyze', schema: FINDINGS_SCHEMA, model: modelFor('RETRO') },
-)
-phase('Aggregate')
-// LOOP-UNTIL-DRY (R5): re-run the 3-pass aggregate review until `dryRounds` consecutive
-// rounds surface nothing new, deduping against everything already seen. A simple
-// fixed-pass review (the prose behavior) misses the tail; this does not.
-const seen = new Set()
-const confirmed = []
-let dry = 0
-let round = 0
-while (dry < dryRounds && round < maxRounds) {
-  round += 1
-  const r = await agent(
-    retroPrompt(
-      `Run aggregate-review.md (round ${round}): 3-pass CRITIC-style review of the aggregate diff (last retro tag to HEAD) — ` +
-        `correctness across story boundaries, convention/pattern drift, architecture/integration. ` +
-        `Report ONLY findings not already reported in earlier rounds.`,
-      'Return ONLY { schema:1, findings:[{id,summary,severity,stories}] } as JSON.',
-    ),
-    { label: `aggregate:round-${round}`, phase: 'Aggregate', schema: FINDINGS_SCHEMA, model: modelFor('RETRO-REVIEW') },
-  )
-  const fresh = (r.findings || []).filter((f) => !seen.has(findingKey(f)))
-  if (!fresh.length) {
-    dry += 1
-    log(`aggregate round ${round}: dry (${dry}/${dryRounds})`)
-    continue
-  }
-  dry = 0
-  for (const f of fresh) seen.add(findingKey(f))
-  confirmed.push(...fresh)
-  log(`aggregate round ${round}: +${fresh.length} new finding(s) (${confirmed.length} total)`)
-}
-// COMPLETENESS-CRITIC (R5): ask what review angle we never ran. Each named gap gets one
-// targeted review round; anything it surfaces joins the confirmed set.
-const critic = await agent(
-  retroPrompt(
-    `We ran ${round} aggregate-review round(s) and found ${confirmed.length} finding(s). ` +
-      `What review angle was NOT covered (e.g. a modality, a security boundary, a contract surface)? ` +
-      `List only genuine gaps — empty if coverage is complete.`,
-    'Return ONLY { schema:1, gaps:["..."] } as JSON.',
-  ),
-  { label: 'completeness-critic', phase: 'Aggregate', schema: COMPLETENESS_SCHEMA, model: modelFor('RETRO-REVIEW') },
-)
-if ((critic.gaps || []).length) {
-  log(`completeness-critic surfaced ${critic.gaps.length} gap(s) — running targeted reviews`)
-  const extra = await parallel(
-    critic.gaps.map((gap, i) => () =>
-      agent(
-        retroPrompt(`Targeted aggregate review for the previously-uncovered angle: "${gap}". Report only findings not already reported.`,
-          'Return ONLY { schema:1, findings:[{id,summary,severity,stories}] } as JSON.'),
-        { label: `aggregate:gap-${i + 1}`, phase: 'Aggregate', schema: FINDINGS_SCHEMA, model: modelFor('RETRO-REVIEW') },
-      )),
-  )
-  for (const r of extra.filter(Boolean)) {
-    for (const f of (r.findings || [])) {
-      if (!seen.has(findingKey(f))) { seen.add(findingKey(f)); confirmed.push(f) }
-    }
-  }
-}
-log(`aggregate review complete: ${confirmed.length} confirmed finding(s)`)
-phase('Directives')
-// The agent PROPOSES directives (with impact_level + a touchesInvariant flag). The CODE
-// enforces the policy — deterministic, uncheatable — per the §5b determinism map:
+phase('Synthesize')
+// Learning, not review: one pass mines the artifacts the sprint ALREADY produced (CRITIC reviews,
+// QA-B bug reports, JUDGE rejections, rejection-cycle counts, cost data) per analyze.md, then drafts
+// correction directives per directives.md — no fresh code review, no opus loop. The agent proposes
+// directives (with impact_level + a touchesInvariant flag); the CODE below enforces the policy —
+// deterministic, uncheatable — per the §5b determinism map:
 //   - touchesInvariant      -> ARCHITECTURE-CONFLICT: never auto-applied, surfaced to the user
 //   - impact_level 'high'   -> proposal only, requires user approval
 //   - 'low' / 'medium'      -> auto-applied (medium also notifies the Lead)
 const drafted = await agent(
   retroPrompt(
-    `Run directives.md against the ${confirmed.length} confirmed finding(s)` +
-      (calibration ? ' and the calibration metrics' : '') +
-      `. For EACH proposed directive set impact_level (low|medium|high) and touchesInvariant=true if it would skip test ` +
-      `execution, allow shipping without evidence, weaken a quality gate, or exempt mandatory tests. Do NOT self-censor — ` +
-      `propose it and flag it; the orchestrator decides what gets applied.`,
+    `Run analyze.md then directives.md: read all CRITIC reviews, QA-B bug reports, JUDGE rejections, ` +
+      `rejection-cycle counts, and cost data from this sprint's story outputs; categorize the recurring ` +
+      `rejection/bug patterns; then propose correction directives for those patterns` +
+      (calibration ? ' and for the calibration metrics' : '') +
+      `. Do NOT re-review the shipped code for new bugs — that was CRITIC/JUDGE's job during the sprint; ` +
+      `your job is to learn from what they already found. For EACH proposed directive set impact_level ` +
+      `(low|medium|high) and touchesInvariant=true if it would skip test execution, allow shipping without ` +
+      `evidence, weaken a quality gate, or exempt mandatory tests. Do NOT self-censor — propose it and flag ` +
+      `it; the orchestrator decides what gets applied.`,
     'Return ONLY { schema:1, directives:[{target_agent,directive,reason,impact_level,touchesInvariant,category}] } as JSON.',
   ),
-  { label: 'draft-directives', phase: 'Directives', schema: DIRECTIVES_SCHEMA, model: modelFor('RETRO') },
+  { label: 'synthesize', phase: 'Synthesize', schema: DIRECTIVES_SCHEMA, model: modelFor('RETRO') },
 )
+phase('Directives')
 const all = drafted.directives || []
 const conflicts = all.filter((d) => d.touchesInvariant)
 const highImpact = all.filter((d) => !d.touchesInvariant && d.impact_level === 'high')
@@ -328,9 +231,7 @@ const embed = await agent(
 return {
   batchNumber,
   sprintId,
-  aggregate_findings: confirmed.length,
-  aggregate_rounds: round,
-  completeness_gaps: (critic.gaps || []).length,
+  directives_proposed: all.length,
   directives_applied: applied.length,
   directives_pending_approval: proposals.length,
   architecture_conflicts: conflicts.length,

package/pipeline/orchestrators/claude-code/sprint.workflow.js CHANGED Viewed

@@ -59,6 +59,7 @@ export const meta = {
     { title: 'Critic', detail: 'three independent passes in parallel -> triage -> rejection loop (code-owned cap)' },
     { title: 'QA', detail: 'execute tests against real infra' },
     { title: 'Judge', detail: 'evidence-based ship decision' },
+    { title: 'Integration', detail: 'single cross-story seam review — only when >1 story touched overlapping files' },
   ],
 }
@@ -123,6 +124,31 @@ const FINDINGS_SCHEMA = {
   },
 }
+// Sprint-end cross-story seam review. Advisory: stories already passed JUDGE, so this does not
+// re-gate them — it surfaces integration findings to be filed as bugs against the affected stories.
+const INTEGRATION_SCHEMA = {
+  type: 'object',
+  required: ['schema', 'verdict', 'findings'],
+  additionalProperties: true,
+  properties: {
+    schema: { const: 1 },
+    verdict: { enum: ['clean', 'findings'] },
+    findings: {
+      type: 'array',
+      items: {
+        type: 'object',
+        required: ['summary'],
+        properties: {
+          summary: { type: 'string' },
+          severity: { enum: ['High', 'Med', 'Low'] },
+          files: { type: 'array', items: { type: 'string' } },
+          stories: { type: 'array', items: { type: 'string' } },
+        },
+      },
+    },
+  },
+}
 const RESOLVED_GRAPH_SCHEMA = {
   type: 'object',
   required: ['tasks', 'skipped'],
@@ -190,7 +216,7 @@ for (const s of batch) {
 // assignment even when args.models is absent. Static + args only => journal-replay safe.
 //   gates -> opus (judgment), spec/build -> sonnet, CLI-runners/IO -> haiku.
 const DEFAULT_MODELS = {
-  READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus',
+  READINESS: 'opus', CRITIC: 'opus', JUDGE: 'opus', INTEGRATION: 'opus',
   REQS: 'sonnet', UXA: 'sonnet', 'QA-A': 'sonnet', 'QA-B': 'sonnet',
   BEND: 'sonnet', FEND: 'sonnet', DATA: 'sonnet', 'MCP-DEV': 'sonnet',
   LIBDEV: 'sonnet', DOCGEN: 'sonnet', IAC: 'sonnet', MOBILE: 'sonnet',
@@ -284,11 +310,66 @@ for (let i = 0; i < batch.length; i++) {
 const shippedCount = results.filter((r) => r.shipped).length
 log(`sprint complete: ${shippedCount}/${results.length} shipped`)
+// Sprint-end integration gate: per-story CRITIC reviews each diff in ISOLATION and never sees two
+// stories together, so cross-story SEAMS (a module touched by >1 story, mismatched integration
+// points) are its one structural blind spot. Cover it with a SINGLE bounded review — but only when
+// it could find something: >1 story shipped AND at least two of them touched an overlapping file.
+// No loop, no completeness-critic (that disproportionate apparatus was removed from the retro for
+// the same reason). Advisory only — stories already passed JUDGE, so findings are surfaced as bugs
+// to file, not a re-gate. Overlap is computed from the dev handoff `files`, so a disjoint sprint
+// (every story in its own corner of the tree) skips the gate entirely.
+const integration = await runIntegrationGate(results.filter((r) => r.shipped))
 return {
   shipped: results.every((r) => r.shipped),
   stories_shipped: shippedCount,
   stories_rolled_over: results.length - shippedCount,
   results,
+  integration,
+}
+// runIntegrationGate: returns null when not warranted (≤1 shipped story or no file overlap), else
+// the single review's structured result ({ verdict, findings }) for the orchestrator to file as bugs.
+async function runIntegrationGate(shipped) {
+  if (shipped.length < 2) return null
+  // Files touched by more than one shipped story = the seams worth reviewing.
+  const owners = new Map() // file -> Set(storyId)
+  for (const r of shipped) {
+    for (const f of r.files || []) {
+      if (!owners.has(f)) owners.set(f, new Set())
+      owners.get(f).add(r.storyId)
+    }
+  }
+  const overlapFiles = [...owners.entries()].filter(([, s]) => s.size > 1).map(([f]) => f)
+  if (!overlapFiles.length) {
+    log(`integration gate: skipped — ${shipped.length} stories shipped but no overlapping files`)
+    return null
+  }
+  log(`integration gate: ${overlapFiles.length} file(s) touched by >1 story — running single cross-story review`)
+  const result = await agent(
+    [
+      `You are **INTEGRATION**, the sprint-end cross-story seam reviewer for this batch.`,
+      ``,
+      `Per-story CRITIC reviewed each story's diff in isolation and CANNOT see two stories together.`,
+      `Review ONLY the cross-story SEAMS — the files below were each modified by more than one story`,
+      `this sprint, so that is where integration mismatches hide (incompatible signatures, ordering`,
+      `assumptions, duplicated/diverging logic, broken shared invariants).`,
+      ``,
+      `Overlapping files: ${JSON.stringify(overlapFiles)}`,
+      `Shipped stories: ${JSON.stringify(shipped.map((r) => r.storyId))}`,
+      ``,
+      `Do a SINGLE focused pass. Do NOT re-review within-story correctness (CRITIC already did) and do`,
+      `NOT hunt for unrelated issues. Report only genuine cross-story seam problems; empty if the seams`,
+      `are clean. Return ONLY { schema:1, verdict:"clean"|"findings", findings:[{summary,severity,files,stories}] } as JSON.`,
+    ].join('\n'),
+    { label: 'gate:integration', phase: 'Integration', schema: INTEGRATION_SCHEMA, model: modelFor('INTEGRATION') },
+  )
+  const findings = result.findings || []
+  log(`integration gate: ${findings.length} cross-story finding(s)`)
+  return { verdict: result.verdict, findings, overlapFiles }
 }
 // ===========================================================================
@@ -373,7 +454,7 @@ async function runStory(story) {
   const devFiles = builds.filter(Boolean).flatMap((b) => (Array.isArray(b.files) ? b.files : []))
   if (devFiles.length === 0) {
     log(`${storyId}: dev phase reported no files — nothing to review; skipping CRITIC/QA/JUDGE, rolling over`)
-    return { storyId, shipped: false, verdict: 'blocked', reason: 'no-dev-output', skipped: graph.skipped }
+    return { storyId, shipped: false, verdict: 'blocked', reason: 'no-dev-output', skipped: graph.skipped, files: [] }
   }
   phase('Critic')
@@ -386,7 +467,7 @@ async function runStory(story) {
   const decision = await runGate(storyId, 'JUDGE', 'judge.md',
     'Review evidence (tests, traceability, bugs) and make the ship decision.', 'Judge', null)
-  return { storyId, shipped: decision.verdict === 'pass', verdict: decision.verdict, skipped: graph.skipped }
+  return { storyId, shipped: decision.verdict === 'pass', verdict: decision.verdict, skipped: graph.skipped, files: devFiles }
   // --- per-story closures over storyId/devTasks ----------------------------

package/pipeline/prompts/qa-a.md CHANGED Viewed

@@ -73,6 +73,7 @@ Before finalizing, verify:
 - No behavior is spec'd at multiple tiers
 - Every error test case has a specific error code and message pattern
 - Every P0 test case has DB state verification
+- Every P0/P1 AC on an interactive surface (`api`, `ui`) has a human-readable acceptance scenario realized as a Live Smoke Test (api) or UI Integration Smoke Test (ui) row for QA-B to execute against live infrastructure — never left to unit tests alone
 - Seed data uses factory patterns, not raw SQL
 - NFR requirements are tagged ([NFR-PERF], [NFR-SEC], [NFR-REL])
 - Visual validation checkpoints cover all 5 page states for each page (if UI story)

package/pipeline/scripts/query-kb.ts CHANGED Viewed

@@ -78,10 +78,16 @@ switch (mode) {
   case 'directives': {
     const agent = flags['agent'];
+    // ALL-DEV is a cross-cutting directive that applies to every developer agent. Include it when
+    // the queried agent is one of them, so a per-agent query still sees shared dev directives.
+    const DEV_AGENTS = ['BEND', 'FEND', 'DATA', 'MCP-DEV', 'LIBDEV', 'DOCGEN', 'IAC', 'MOBILE'];
     let rows;
     if (agent) {
+      const includeAllDev = DEV_AGENTS.includes(agent.toUpperCase());
       rows = db.prepare(
-        "SELECT id, directive, reason FROM correction_directives WHERE target_agent = ? AND status = 'active'"
+        includeAllDev
+          ? "SELECT id, target_agent, directive, reason FROM correction_directives WHERE (target_agent = ? OR target_agent = 'ALL-DEV') AND status = 'active'"
+          : "SELECT id, directive, reason FROM correction_directives WHERE target_agent = ? AND status = 'active'"
       ).all(agent) as { id: string; directive: string; reason: string }[];
     } else {
       rows = db.prepare(

package/pipeline/steps/common/agent-protocol.md CHANGED Viewed

@@ -42,7 +42,7 @@ When `signal_delivery` is `thread`: Lead steers you with the Design Council ques
 When you need information about project conventions, architectural patterns, existing code structure, or known pitfalls: self-serve from the knowledge base before exploring the codebase directly. Curated knowledge and correction directives answer in seconds what codebase exploration takes minutes to discover.
 **How to self-serve:**
-1. Read correction directives from `{correction_directives}` — filter for `status: active` entries targeting your agent role.
+1. Read correction directives from `{correction_directives}` — filter for `status: active` entries whose `target_agent` is your agent role (or `ALL-DEV` if you are a developer agent).
 2. Read curated knowledge files in `{curated_files_path}` — scan file names and section headers for entries relevant to your current task.
 3. If `{knowledge_mode}` is `sqlite`: query the database via CLI (see your read-inputs step for commands).
 4. If `{story_output_dir}/knowledge-context.md` exists: read it directly — this is a pre-compiled reference containing all relevant knowledge for the current story.
@@ -51,7 +51,7 @@ Reserve direct codebase exploration for when curated knowledge does not have the
 ## Correction Directives
-Read active correction directives from `{correction_directives}`. If the file does not exist or is empty, proceed without directives -- this is expected for new pipelines. Apply ALL directives targeting your agent role. If a directive conflicts with these instructions, the directive takes precedence. Log each applied directive in your YAML frontmatter under `correctionsApplied`.
+Read active correction directives from `{correction_directives}`. If the file does not exist or is empty, proceed without directives -- this is expected for new pipelines. Apply ALL directives whose `target_agent` matches your agent role **OR** is `ALL-DEV` when you are a developer agent (BEND, FEND, DATA, MCP-DEV, LIBDEV, DOCGEN, IAC, MOBILE) — `ALL-DEV` is a cross-cutting directive that applies to every developer agent, not one surface. If a directive conflicts with these instructions, the directive takes precedence. Log each applied directive in your YAML frontmatter under `correctionsApplied`.
 ## YAML Frontmatter

package/pipeline/steps/common/quality-standards.md CHANGED Viewed

@@ -10,8 +10,22 @@ These are non-negotiable. CRITIC and QA-B enforce them. Every developer agent (B
 - **<1.5 minutes per test** -- any test exceeding this is a design problem, not a timeout problem.
 - **Self-cleaning via fixture auto-teardown** -- tests must not leave state behind. Use framework teardown hooks, not manual cleanup.
 - **Explicit assertions in test bodies** -- never hide assertions in helpers. Every test body must contain at least one visible `expect`/`assert`.
+- **Falsifiable assertions** -- every assertion must be able to FAIL when the behavior is wrong. No unfalsifiable disjunctive gates like `expect(a === 0 || b === 0 || fallback).toBe(true)` — a disjunction with a catch-all that is almost always true asserts nothing. Assert the real invariant directly (e.g. "exactly one container exists AND `SELECT 1` succeeds AND no corruption log"), not a condition that passes regardless of outcome.
+- **Bind to the real artifact under test** -- resolve the actual artifact dynamically (built image id, running service/container name, the file the build emits) and assert against THAT. Never assert against a hardcoded tag/name or an artifact that merely happens to exist on your machine. The test: would it still pass on a clean/CI checkout with no leftover state? If it passes only because of an ambient/side-channel artifact, it is broken. (Prove the artifact exists — `expect(imageId).toBeTruthy()` — before inspecting it.)
+- **Statistical assertions match the spec** -- when the spec states a percentile or sample count (e.g. NFR-PERF "P95 < 200ms"), assert exactly that over real samples with warm-up where appropriate. A single-shot `elapsed < 200` is not a P95 and is flaky — it does not satisfy the spec.
 - **Parallel-safe** -- no shared mutable state between tests. Must run cleanly with `--workers=4`.
+## Pre-Handoff Self-Check — MANDATORY before declaring done
+CRITIC enforces every item below and will reject on any High finding, costing a full rework cycle. Run this checklist against your own tests BEFORE writing your handoff; fix anything that fails first. (Each recurred across real stories — this is the gap that buys a guaranteed rework cycle if skipped.)
+- [ ] **Every P0 qa-test-spec case has a named, implemented test.** Count the spec's P0 cases; count your `it(...)`/`test(...)` blocks; a P0 case with no implementation is a High finding. Do not treat a spec note (e.g. "idempotency") as covered unless there is a named test for it.
+- [ ] **No `if`/`switch`/ternary in any test body** (same path every run).
+- [ ] **No unfalsifiable assertions** — no disjunctive/catch-all gates; each assert can fail.
+- [ ] **Tests bind to the real artifact** — no hardcoded tag/name; would pass on a clean/CI machine with no leftover state.
+- [ ] **Statistical/perf assertions match the spec's wording** (percentile + samples, not single-shot).
+- [ ] **No assertion weakening, no infra mocking on happy paths, no hard waits, assertions visible in test bodies.**
 ## Live Infrastructure Standards
 - **Live tests against running infrastructure** -- tests hit real systems. No mocking databases, APIs, pipelines, servers, or external services for happy-path verification.

package/pipeline/steps/critic/test-review.md CHANGED Viewed

@@ -21,6 +21,9 @@
 - **No test deletion** -- all qa-test-spec test cases must have corresponding tests. A missing test is a High finding.
 - **No hard waits** -- `sleep()`, `setTimeout()`, `page.waitForTimeout()` in tests is a High finding.
 - **No conditionals** -- `if` statements in test bodies is a High finding.
+- **Falsifiable assertions** -- a disjunctive/catch-all gate that is almost always true (e.g. `expect(a === 0 || b === 0 || transient).toBe(true)`) asserts nothing. A test that cannot fail when the behavior is wrong is a High finding. Require the real invariant.
+- **Artifact binding** -- a test that asserts against a hardcoded tag/name, or only passes because of an ambient/leftover artifact on the dev's machine (would fail on a clean/CI checkout), is a High finding. The artifact under test must be resolved dynamically and proven to exist before inspection.
+- **Statistical assertions match spec** -- when the spec states a percentile/sample count (NFR-PERF "P95 < Nms"), a single-shot threshold instead is a Med finding (the assertion does not match the spec).
 - **Explicit assertions** -- assertions hidden inside helper functions is a Med finding. Every test body must contain visible `expect`/`assert`.
 ## Output

package/pipeline/steps/qa-a/write-spec.md CHANGED Viewed

@@ -31,6 +31,17 @@ Structure per AC:
 - **Edge cases** (P0: required, P1: key boundaries, P2: optional, P3: omit)
 - **Concurrency** (P0: required, P1: if applicable, P2-P3: omit)
+### The Given-When-Then cases are human-readable acceptance scenarios — QA-B executes them
+Your Given-When-Then cases are not just developer test stubs; they are the **human-readable acceptance contract** for the story, written in plain user/consumer language (what a person does and observes), independent of implementation. They describe behavior the way a real developer or QA would verify it by hand.
+For **interactive surfaces this is mandatory and first-class** — it is how the pipeline tests like a real-world human developer, not just by trusting unit tests:
+- **`api` profile →** realize each acceptance scenario as a row in the **Live Smoke Tests** table (see `api.md`): real HTTP method/URL/body → expected status + response, plus a verification request for every mutation. QA-B starts the real server and drives these with real requests.
+- **`ui` profile →** realize each user-facing scenario as a row in the **UI Integration Smoke Tests** table (see `ui.md`): user action → real API call → expected UI state → DB verification, backed by a real-browser E2E test. QA-B runs these against the live UI + API + DB (and PMCP validates the visual checkpoints).
+These acceptance scenarios are **owned and executed by QA-B against live infrastructure** — they are the real-world, black-box evidence JUDGE relies on, distinct from (and never replaced by) the developer's own unit tests. Every P0/P1 AC on an interactive surface MUST have one. The Step 9b quality bar below governs the *automated test code*; it complements these acceptance scenarios — it does not replace them.
 ## Step 4: Database State Verification
 Per-risk DB verification:
@@ -72,6 +83,15 @@ For each NFR-sensitive path: `[NFR-PERF]` response time + load patterns; `[NFR-S
 - If `ui` in `{testing_profiles}` → **MANDATORY.** Read `uxa-spec.md`. If `uxa-spec.md` is missing, send `[BLOCKER]` to Lead — do NOT proceed without it. For each page state define: Checkpoint ID (VV-{NNN}), Page/Route, State (Default/Loading/Empty/Error/Success or custom), AC Reference, Area labels in scope, Screenshot filename (`{story_id}_VV-{NNN}_{page}_{state}.png`), Expected visual elements, Setup instructions, Pass criteria. Write to `{story_output_dir}/visual-validation-checklist.md`.
 - If `ui` NOT in `{testing_profiles}` → skip, note "N/A — no UI profile."
+## Step 9b: Test Quality Bar — make every case implementable AND falsifiable
+The dev agents implement exactly what you specify, and CRITIC rejects tests that are weak, unfalsifiable, or bound to the wrong artifact. Spec the bar IN so it is built right the first time (each rule below traces to a real rework cycle):
+- **Every P0 case is a named, must-implement test** — never leave a P0 (or an idempotency/concurrency requirement for stateful resources) as a prose note. If it matters, give it its own case ID and assertion target so the dev agent implements it directly. For stateful resources (volumes, DBs), enumerate an explicit idempotency case (apply twice → assert no-recreate / no re-init / data intact).
+- **Falsifiable assertion target** — state the real invariant and the exact expected value (status code, row/column value, count, error code + message regex). Forbid disjunctive/catch-all expectations: never spec "passes if A or B or fallback." The assertion must be able to fail when the behavior is wrong.
+- **Bind to the real artifact** — when a case inspects a built artifact (image, container, file), specify that the test resolves it dynamically (the compose-built image id, the actual service/container name) and proves it exists before inspecting it — never a hardcoded tag. Note it must pass on a clean/CI machine with no leftover state.
+- **NFR-PERF wording is statistical** — specify the percentile AND sample count + warm-up (e.g. "P95 < 200ms over 20 sequential samples after one warm-up request"), not a single-shot threshold.
 ## Step 10: Write Final Outputs
 - Write to `{story_output_dir}/qa-test-spec.md` and `{story_output_dir}/visual-validation-checklist.md` (if applicable)

package/pipeline/steps/retrospective/directives.md CHANGED Viewed

@@ -12,13 +12,18 @@ If pattern touches an invariant, do NOT write a CD. Instead:
 For patterns NOT conflicting with invariants:
+**Scope the directive to the failure mode, not the agent that tripped it.** Before setting `target_agent`, ask: *could another agent hit this same pattern?* Many failure modes are cross-cutting — test quality (missing P0 cases, conditionals/unfalsifiable assertions in tests, artifact-binding, hard waits), handoff-format violations, infra-mocking — and apply to **every** developer agent, not just the one that happened to surface it this sprint. For those:
+- Set `target_agent: ALL-DEV` (BEND, FEND, DATA, MCP-DEV, LIBDEV, DOCGEN, IAC, MOBILE) so the lesson generalizes; do NOT pin a cross-cutting test-quality directive to the single agent that tripped it.
+- Prefer reinforcing the shared contract: if the pattern is a standing rule, the durable fix is `.valent-pipeline/steps/common/quality-standards.md` (the dev self-check) or the QA-A spec / CRITIC checklist — note that in the directive's `reason`. A per-agent directive that restates a shared standard is noise.
+- Only use a single `target_agent` when the pattern is genuinely specific to that agent's surface (e.g. an IAC-only compose idiom, a FEND-only component pattern).
 **ADD** new directives:
 ```yaml
 - id: CD-{batch_number}-{seq}
   status: active
-  target_agent: {agent}
+  target_agent: {agent | ALL-DEV}
   directive: "{what to do differently}"
-  reason: "{evidence from batch}"
+  reason: "{evidence from batch; if it restates a shared standard, name the file it belongs in}"
   impact_level: low | medium | high
   created_batch: {batch_number}
   last_reinforced_batch: {batch_number}

package/skills/valent-run-project-workflow/SKILL.md CHANGED Viewed

@@ -96,11 +96,13 @@ Workflow({
 })
 ```
-Runs each story sequentially through the per-story pipeline on a shared branch, rolling over any story JUDGE rejects or that trips the cap. Returns `{ shipped, stories_shipped, stories_rolled_over, results: [...] }`. Record its `runId`.
+Runs each story sequentially through the per-story pipeline on a shared branch, rolling over any story JUDGE rejects or that trips the cap. After the batch, a **sprint-end integration gate** runs a single cross-story seam review — but ONLY when >1 story shipped and at least two touched an overlapping file (otherwise it's skipped). Returns `{ shipped, stories_shipped, stories_rolled_over, results: [...], integration }`, where `integration` is `null` (not warranted) or `{ verdict, findings, overlapFiles }`. Record its `runId`.
 #### 4e. Update progress + re-resolve dependencies
 For each shipped story: move it to `stories_completed` with a compact one-line outcome tagged with its epic; update `total_completed` / `last_updated`; **FIFO at 80 lines** (evict oldest). Read and follow `.valent-pipeline/steps/orchestration/update-backlog-status.md`. Then **re-check whether any previously blocked stories are now unblocked** (their `depends_on` / `blocked_by_bugs` may have been resolved by what just shipped) — these become eligible for the next sprint's candidate list.
+If the sprint result's `integration.findings` is non-empty, file each as a `bug` backlog item (per `update-backlog-status.md`'s conditional-ship bug format) against the affected stories' epic, so the cross-story seam issue is tracked and prioritized like any other bug.
 #### 4f. Retrospective (mandatory, blocking)
 Invoke `retro.workflow.js`:
@@ -111,7 +113,7 @@ Workflow({
 })
 ```
-Record its `runId`. Retro runs every sprint and tightens future sizing.
+Record its `runId`. Retro runs every sprint and tightens future sizing. It **learns from the sprint's own artifacts** — calibration (CLI), one synthesis pass that mines CRITIC/JUDGE/QA/cost data into correction directives, then embed — and is bounded and cheap (no fresh code review; that is CRITIC/JUDGE's job during the sprint, and cross-story seams are covered by the sprint-end integration gate in 4d).
 #### 4g. Continue
 Increment `{n}` and return to Step 4a. **Do NOT ask permission to continue between sprints** — the project loop is autonomous.