npm - @exaudeus/workrail - Versions diffs - 3.74.3 → 3.76.0 - Mend

@exaudeus/workrail 3.74.3 → 3.76.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (32) hide show

package/dist/console-ui/assets/index-DFZjlsUM.js +28 -0
package/dist/console-ui/index.html +1 -1
package/dist/coordinators/adaptive-pipeline.d.ts +8 -0
package/dist/coordinators/context-assembly.d.ts +4 -0
package/dist/coordinators/context-assembly.js +156 -0
package/dist/coordinators/modes/full-pipeline.d.ts +1 -1
package/dist/coordinators/modes/full-pipeline.js +140 -27
package/dist/coordinators/modes/implement-shared.d.ts +3 -2
package/dist/coordinators/modes/implement-shared.js +16 -6
package/dist/coordinators/modes/implement.js +49 -3
package/dist/coordinators/pipeline-run-context.d.ts +1811 -0
package/dist/coordinators/pipeline-run-context.js +114 -0
package/dist/infrastructure/storage/schema-validating-workflow-storage.js +25 -2
package/dist/manifest.json +54 -30
package/dist/trigger/coordinator-deps.js +131 -0
package/dist/v2/durable-core/domain/artifact-contract-validator.js +99 -0
package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.d.ts +39 -0
package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.js +10 -1
package/dist/v2/durable-core/schemas/artifacts/index.d.ts +2 -1
package/dist/v2/durable-core/schemas/artifacts/index.js +12 -1
package/dist/v2/durable-core/schemas/artifacts/phase-handoff.d.ts +89 -0
package/dist/v2/durable-core/schemas/artifacts/phase-handoff.js +56 -0
package/docs/authoring-v2.md +12 -0
package/docs/ideas/backlog.md +409 -1
package/package.json +1 -1
package/workflows/coding-task-workflow-agentic.json +9 -6
package/workflows/mr-review-workflow.agentic.v2.json +2 -2
package/workflows/routines/tension-driven-design.json +12 -12
package/workflows/workflow-for-workflows.json +5 -11
package/workflows/wr.discovery.json +20 -17
package/workflows/wr.shaping.json +7 -4
package/dist/console-ui/assets/index-ByqIsoyt.js +0 -28

package/docs/ideas/backlog.md CHANGED Viewed

@@ -192,6 +192,98 @@ The delivery pipeline was extracted into `delivery-pipeline.ts` with explicit st
 ## WorkTrain Daemon
+### Phase quality gate policy: partial vs escalate (May 5, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+The current phase quality gate policy (implemented in living work context, PR #939) is: `fallback` → escalate, `partial` → proceed with warning, `full` → proceed normally. The `partial` path is a deliberate judgment call that favors progress over quality: the agent ran for 25-65 minutes and produced partial output, and retrying might also produce partial output.
+The open question: should `partial` also escalate, or is "proceed with warning" the right default? This requires observability data to answer. If `partial` phases regularly produce wrong downstream output (review catches issues caused by missing upstream context, fix loops triggered by context gaps), the policy should shift to escalate-on-partial. If `partial` phases produce acceptable output, the current policy is correct.
+**Things to hash out:**
+- What metric determines whether `partial` downstream output is "wrong enough" to justify policy change? Review findings that cite missing upstream context? Fix loop iteration count?
+- Should the policy be configurable per-trigger (some pipelines tolerate partial, others don't)?
+- Should the `partial` warning in `assembledContextSummary` be structured enough that the downstream agent can flag "I was working with incomplete context" in its handoff artifact, making the degradation chain traceable?
+- Is there a smarter policy -- e.g. retry the prior phase once before escalating?
+**Note:** This is not a correctness problem with the current implementation. `fallback` correctly escalates. `partial` correctly proceeds with an explicit warning. The question is whether the `partial` threshold is in the right place. Revisit after observing real pipeline runs.
+---
+### Lifecycle integration tests: assert each workflow emits expected handoff artifact (May 5, 2026)
+**Status: idea** | Priority: medium
+**Score: 8** | Cor:2 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
+Issue #934 (living work context) required "lifecycle integration tests asserting each of the 4 workflows (wr.shaping, wr.coding-task, wr.discovery, mr-review) emits expected artifact at final step." PR #939 shipped the adversarial behavioral test (proves the chain works end-to-end) and the `contractRef` validation test (proves no unregistered refs ship), but did not ship per-workflow lifecycle harness tests that run each workflow through compilation + stepping and assert the final step emits the correct artifact kind.
+Without these tests: a workflow prompt change that removes or breaks the artifact emission instruction would pass all existing tests (smoke test only checks compilation, not artifact emission) until a real pipeline run catches it.
+**Done looks like:** `tests/lifecycle/` tests that run `wr.shaping`, `wr.coding-task`, `wr.discovery`, and `mr-review-workflow` through the lifecycle harness and assert the final step produces an artifact with the expected `kind` field.
+---
+### Slack/Teams/chat integration for pipeline completion alerts (May 4, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:2 Lev:1 Con:2 | Blocked: no
+When WorkTrain completes a pipeline run -- whether it produced a PR, escalated, timed out, or failed -- the operator currently has no push notification. They have to poll the console or check their email. For overnight-safe autonomous operation (the vision's stated success condition), the operator needs to know when work is ready for their attention without having to check. Beyond the individual operator, the team that will review the PR also needs to know it exists and is ready. Neither is addressed today.
+The use case has two layers: (1) operator-facing -- "your pipeline finished, here's the PR URL and outcome summary," sent to the operator's Slack/Teams DM or a dedicated channel; (2) team-facing -- "a PR is ready for review," sent to the team's review channel with enough context for a reviewer to triage without navigating to GitHub.
+**Things to hash out:**
+- Is this a WorkTrain daemon concern (coordinator sends notification after pipeline completion) or a trigger-layer concern (configured alongside the trigger)? The `callbackUrl` mechanism already exists for HTTP POST on completion -- is Slack/Teams just a specialized callback, or does it need first-class support?
+- What is the configuration model? Per-trigger (`notifyOnComplete: { slack: { channel: "#pr-reviews", token: "$SLACK_TOKEN" } }`) or workspace-level (`~/.workrail/config.json`)?
+- How does the team-facing notification avoid becoming noise? If WorkTrain opens 10 PRs in a day, each triggering a Slack message, the channel becomes unusable. Is there a batching, threading, or filtering mechanism?
+- What is the authentication and secret management story? Same `$ENV_VAR_NAME` resolution as trigger HMAC secrets, or a separate credentials store?
+**See also:** Daemon working hours / dispatch scheduling (below) -- notifications sent outside working hours are noise.
+---
+### Daemon working hours and dispatch scheduling (May 4, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: no
+WorkTrain is designed for overnight-safe autonomous operation, but "overnight-safe" currently means the daemon keeps working through the night without human oversight -- not that it respects the operator's or team's working hours. A PR opened at 2am sits unreviewed until morning. Slack/Teams notifications at 3am are noise. Triggers that fire from monitoring alerts at midnight might not be appropriate to dispatch.
+There is no current mechanism to configure when the daemon dispatches new sessions, when it sends notifications, or when it holds work for the next business day.
+**Things to hash out:**
+- What is the scope? Working hours could affect: (a) trigger dispatch (hold incoming triggers until working hours), (b) notifications (send alerts only during working hours), (c) both. These may need separate configuration.
+- What is the configuration model? Per-workspace (`~/.workrail/config.json: { workingHours: { timezone: "America/New_York", days: ["Mon"-"Fri"], start: "09:00", end: "18:00" } }`) or per-trigger (some triggers are critical and should dispatch any time)?
+- How does "critical" work? An on-call incident trigger probably should not be gated by working hours. What is the mechanism for a trigger to opt out? A `priority: critical` flag, or explicit `ignoreWorkingHours: true`?
+- What happens to triggers that fire outside working hours? Queue and dispatch at next working-hours start, discard, or dispatch anyway but suppress notifications?
+- How does this interact with multi-timezone teams?
+**See also:** Slack/Teams notification integration (above) -- the two features are designed to be used together.
+---
+### Assumption resolution before acting: agents should fill information gaps with available tools (May 4, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:3 Cap:3 Eff:2 Lev:2 Con:1 | Blocked: no
+Pipeline agents currently have two options when they hit an information gap: proceed with an explicit assumption, or get stuck. Neither is optimal. The coding agent might assume a function signature, proceed with the wrong implementation, and only discover the error in review. Any phase agent might miss context that was resolvable with a two-second tool call (gh, glab, jira, glean, codebase search, MCP tools). There is currently no structured mechanism in the workflow engine or in individual workflows that asks agents to explicitly audit their open assumptions and use available tools to close them before committing to an approach.
+**Things to hash out:**
+- Is this a workflow-level concern (each workflow author decides when and where to add assumption resolution) or an engine-level concern (the engine injects it automatically)?
+- Is the right mechanism a routine (injected via `templateCall`, creating a visible dedicated step with notes output), a feature (engine-injected constraint on every step), or both?
+- Should assumption resolution happen once per workflow (front-loaded as the first step) or opportunistically (at any step where the agent identifies a gap)?
+- What tools should the agent be expected to use? The set varies by workspace (some have Jira, some have GitLab, some have Glean). A generic routine can only say "use whatever tools are available" -- is that specific enough to be useful?
+- How does this interact with the task-scoped rules idea and the ephemeral per-turn injection idea? All three are trying to get the right context to the agent at the right time.
+---
 ### Intent gap: agent builds what it understood, not what the user meant (Apr 30, 2026)
 **Status: idea** | Priority: medium
@@ -223,6 +315,239 @@ This is categorically different from bugs (the agent implemented the right thing
 ---
+### Intent resolution: tiered context harvest to close the intent gap before coding starts (May 4, 2026)
+**Status: designing -- not ready for shaping or implementation** | Priority: high
+**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+The intent gap entry names the failure mode. This entry is the resolution design. The root cause is not that agents misread tickets -- it is that agents form interpretations without access to the context that would resolve ambiguity. The fix is a structured, tiered context harvest during discovery, a mid-discovery interpretation checkpoint, and a configurable escalation ladder when ambiguity survives the harvest.
+**The core insight:** a ticket description is almost never the most authoritative source of intent. The epic it belongs to, the design doc it references, the Slack thread where the feature was scoped, the vision doc that defines what the project is trying to become -- these carry far more signal. An agent that only reads the ticket is working with the thinnest slice of available context. Importantly, none of these sources need to live in the codebase -- they can be in Confluence, Notion, Slack, Google Docs, or a GitHub wiki. The tool layer is the access mechanism regardless of where the content lives.
+**Two distinct failure subtypes -- require different responses:**
+Research (AmbiEval 2026, Orchid 2026, AskBench 2026) distinguishes two failure modes:
+- **Subtype A -- vagueness/ambiguity:** the ticket is underspecified or has multiple valid interpretations. "Delete the record" -- soft-delete or hard-delete? The tiered harvest + council addresses this.
+- **Subtype B -- wrong prior:** the ticket is clear, but the agent has a systematically wrong prior about what tickets like this mean in this codebase. "Fix the auth issue" -- agent knows what auth issues usually mean, but this codebase does it differently. No amount of context harvest resolves this; it requires challenging the agent's assumptions explicitly.
+**Critical decision before building: measure which subtype dominates your actual failure distribution.** Retrospectively classify 10-20 past wrong-implementation cases as Subtype A vs B. If Subtype B is significant, the council and detection scaffold are insufficient -- Subtype B requires assumption-logging and adversarial plan review, not just ambiguity detection. The entire design below addresses Subtype A well and Subtype B only partially.
+**Tiered context harvest:**
+Tier 0 -- project identity (always injected, not searched):
+- Vision doc, active backlog items, design locks/ADRs for the affected area, coding philosophy
+- Not searched for relevance -- injected unconditionally because they constrain every interpretation
+- Source locations are workspace-configured and can be anywhere: local files (`docs/vision.md`), Confluence, Notion, Google Docs, GitHub wiki -- resolved via the same tool layer as other tiers
+- `ContextLoader` resolves Tier 0 sources before session start using whatever tools the workspace has configured
+- If Tier 0 is empty (no project identity configured), the minimum `ambiguityLevel` floor is `'uncertain'` regardless of agent self-report
+Tier 1 -- structured task sources (highest signal, deterministic):
+- Jira/Linear: linked epics, acceptance criteria, parent ticket, comments, attachments
+- GitHub/GitLab: linked PRs, prior implementations of the same feature, commit history on affected files, related issues
+- The ticket's own epic/milestone context -- a vague ticket is often disambiguated by the epic it belongs to
+Tier 2 -- conversational sources (high signal, noisier):
+- Slack: the thread where the ticket was discussed, the channel where the feature was scoped, off-ticket decisions
+- Notion/Confluence/Google Docs: design docs linked in the ticket or epic, ADRs for the affected area
+- Hard retrieval budget: top 2-3 most relevant sources, 4K token cap total. Beyond budget, sources logged as "available but not injected" and added to `unresolvedAssumptions[]`
+- Conflict resolution: when a lower-tier source contradicts a higher-tier source, the higher tier wins and the contradiction surfaces in `unresolvedAssumptions[]`. Priority: Tier 0 ADRs > Tier 1 acceptance criteria > Tier 1 linked epic > Tier 2 design docs > Tier 2 Slack threads
+Tier 3 -- codebase itself:
+- How similar features were implemented previously
+- Naming conventions, existing abstractions, patterns that constrain valid interpretations
+- Tests that describe current behavior of the affected area
+**"Enough context" checklist (harvest stops when satisfied, not when budget is full):**
+1. Tier 0 was injected or confirmed unavailable
+2. Tier 1 structured sources were queried (epic, acceptance criteria, linked issues)
+3. If Tier 1 returned ambiguity-relevant signal, Tier 2 search attempted for the most specific query
+4. Agent can articulate at least one rival interpretation with evidence
+**Ticket quality pre-flight (lightweight, independent of the full harvest):**
+Before dispatch, run 5 INVEST-based quality checks on the ticket: unambiguous, testable, non-compound, has acceptance criteria, scoped. USeR (arxiv 2503.02049) provides 34 automated RE quality metrics; these 5 are the highest-signal subset. Deployable independently of the full detection scaffold. A ticket that fails multiple quality checks is routed to Subtype A treatment immediately without spending turns on harvest.
+**Tool graceful degradation:** tool failure never blocks session start. When a configured source is unreachable, log the error, treat as empty, include in `unresolvedAssumptions[]`, and elevate `ambiguityLevel` accordingly. When a tool is not configured, skip silently.
+**Mid-discovery interpretation checkpoint:**
+Not pre-discovery (too low signal) and not post-discovery (too expensive to correct). The right spot is early in discovery after the agent has read the file structure, recent git history, relevant modules, and harvested Tier 0-1 context. Roughly turns 3-5.
+The checkpoint first classifies task type, then produces the interpretation artifact:
+`taskType: 'targeted_fix' | 'feature' | 'refactor' | 'architectural'`
+- `targeted_fix`: well-scoped, additive, low ambiguity risk -- council can be skipped
+- `feature`: new behavior, moderate ambiguity risk
+- `refactor`: structural change, high ambiguity risk
+- `architectural`: systemic change -- always requires council, minimum `ambiguityLevel` is `'uncertain'`
+Interpretation artifact:
+- `interpretation`: "I understand this task as X"
+- `rivalInterpretations[]`: genuine alternative readings -- must be architecturally different, not minor variations. Use falsification forcing: "What is the single most important word or phrase that, if read differently, leads to a substantially different implementation? Describe both implementations."
+- `unresolvedAssumptions[]`: what would have to be true for the primary interpretation to be wrong
+- `ambiguityLevel: 'clear' | 'uncertain' | 'ambiguous'` -- self-reported, used as floor only
+- `confidenceBreakdown`: `{ tier0Injected, tier1Complete, tier2Retrieved, rivalInterpretationStrength: 'weak' | 'plausible' | 'strong', unresolvedAssumptionCount, overallAmbiguityLevel }`
+**Add: clarification question generation as an independent signal.** Ask: "What is the one question you would most want answered before implementing this?" A specific high-stakes question ("Does 'delete' mean soft-delete or hard-delete?") = ambiguous. Inability to generate a meaningful question = likely clear. Specificity and number of non-trivial questions generated is an independent ambiguity meter (KC et al. 2025).
+**Critical: self-reported ambiguity is untrustworthy.** RLHF trains models to provide confident, forward-moving responses (Sharma et al. 2023). Use `max(introspective, structural)` as the effective level:
+Structural pre-filter signals (fast, no LLM, computed before checkpoint):
+- Presence of weak modals ("should", "may"), vague quantifiers ("fast", "large"), passive without agent, undefined pronouns, no acceptance criteria -- RE literature, 70-89% precision on formal requirements
+- `taskType` is `'architectural'` or `'refactor'`
+- Tier 0 is empty
+- `unresolvedAssumptionCount > 2`
+- Tier 1 returned empty
+Semantic entropy sampling (behavioral, no self-report):
+Sample the interpretation step 5-7 times at temperature ~0.8. Cluster semantically equivalent outputs. Compute Shannon entropy over clusters. High entropy = model is generating genuinely different interpretations, independent of self-report. Well-established (Wang et al. ICLR 2023, Kuhn et al. ICLR 2023 Spotlight). Cost: ~6-8x single inference, fully parallelizable.
+**Escalation ladder (coordinator routes deterministically on effective ambiguity level):**
+1. `'clear'` → proceed to full discovery automatically
+2. `'uncertain'` → council of agents (see below). Re-evaluate on council output.
+3. Still `'uncertain'` after council + `requireIntentConfirmation: 'uncertain'` on trigger → structured clarification request to operator. Structured options: "A / B / proceed with best judgment / abandon" + default-if-no-reply timeout (e.g. 4 hours → proceed with A). Delivered via configured channel (Slack > webhook > console outbox). Correction injected as `steer`; agent re-orients mid-discovery without restarting.
+4. `'ambiguous'` + `requireIntentConfirmation: 'always'` → pause for human approval.
+5. Genuinely unanswerable → escalate to outbox with full context packet.
+`requireIntentConfirmation: 'never' | 'uncertain' | 'always'` per trigger, defaulting to `'uncertain'`. Global workspace default overridable per trigger.
+**Vagueness vs. ambiguity routing:**
+- **Vague ticket** (underspecified -- doesn't say enough): clarification request to operator. Only the operator can add missing information. Council will not help -- both challengers fill the same gap the same way.
+- **Ambiguous ticket** (multiple valid interpretations): council of agents, then operator if unresolved.
+The detection layer classifies which failure mode before routing.
+**Council of agents -- cross-family comparison, not same-model debate:**
+The council handles ambiguous tickets. Its purpose is detecting interpretation error, not resolving genuine ambiguity (that requires the operator).
+**Critical research findings:**
+- "When Two LLMs Debate" (2025, 10-model study): both agents escalate to ~83% stated confidence by round 3 regardless of correctness. Never use stated confidence from a council -- compare interpretation content only.
+- "Persona Collapse / Chameleon's Limit" (2026): same-model instances with different personas converge to a narrow behavioral mode regardless of role assignment. Role prompts do not produce genuinely independent populations.
+- "Diversity of Thought in MAD" (2024): different model families achieve 91% vs 82% on reasoning benchmarks. Cross-family diversity reduces correlated interpretation errors.
+**Cross-family model diversity is required for genuine independence.** Role assignment can be layered on top but cannot substitute for it.
+The council is structured as comparison, not debate -- no "primary defends" turn:
+1. Primary agent (model family A) submits interpretation artifact
+2. Two challenger agents spawn in parallel from different model families (B, C), each with raw ticket + Tier 0-2 context but NOT the primary's interpretation. Each produces an independent reading.
+3. Coordinator compares all three outputs for substantive semantic divergence.
+4. Council produces typed output contract: `{ revisedAmbiguityLevel, failureMode: 'ambiguous' | 'vague', primaryInterpretationSurvived: boolean, winningInterpretation: { text, basis }, dissents[] }`
+5. Coordinator routes on `revisedAmbiguityLevel` and `failureMode`. Zero LLM turns.
+Challenger constraints:
+- Hard `maxTurns` cap (10-15 each) -- each challenger has one job
+- Spawned with `maxSubagentDepth: 1` -- challengers cannot spawn challengers
+- `mode: 'blind'` isolation -- no prior phase artifacts (per context isolation modes entry below)
+**ClarifyGPT consistency check (cheaper alternative to full council):**
+Generate the implementation plan twice independently. If the two plans are inconsistent, ask a targeted clarification question (arxiv 2310.10996). Cheaper than a full multi-family council; useful as a pre-council filter for `'uncertain'` cases before spending on cross-family challengers.
+**Program distribution divergence (Tier 2 behavioral signal where test oracles exist):**
+SpecFix (2025): generate N independent implementations (N=5-10), compare behavioral divergence on tests. 43.58% of ambiguous function-specs detected, +30.9% Pass@1 on repaired specs. Hard prerequisite: requires a test oracle. Viable only for repos with good test coverage. Transfer to informal GitHub-style descriptions is the highest-priority unvalidated gap before treating as production-ready.
+**Operator clarification UX:**
+A useful clarification request is answerable in one decision, time-bounded, and shows what changes between interpretations:
+```
+Task: "Improve error handling in auth module"
+Interpretation A: Add try/catch to the 3 unhandled failure points in token-service.ts (~50 lines, 1-2 hours)
+Interpretation B: Redesign the error type hierarchy across the auth subsystem (~300 lines, needs separate shaping)
+Evidence for A: ticket title says "improve" not "redesign"; linked issue reports a specific NPE in token-service.ts
+Evidence for B: parent epic is "Auth module modernization"; prior PR comment mentioned "error types need a complete overhaul"
+Reply: A / B / proceed with best judgment / abandon
+[Default if no reply in 4 hours: A]
+```
+For overnight queues: batched clarification UX (approve/correct a queue, not N individual notifications) is more practical. Undesigned -- needs its own design pass.
+**Feedback and calibration:**
+Build the calibration data capture layer now. Log: checkpoint outcome, `confidenceBreakdown`, operator correction, downstream PR verdict, and review findings tagged as interpretation-related. Use behavior-based ground truth -- divergent implementations as the ambiguity label, not human majority-vote polls (majority-voted labels miscalibrate detectors by 55-87% ECE, 2026).
+**`skipIntentResolution` escape hatch:**
+Operator sets `skipIntentResolution: true` on a trigger, or agent self-declares skip for: very short ticket + very narrow affected area + `taskType: 'targeted_fix'` + no rival interpretations possible. Skipped sessions still require a one-line interpretation statement.
+**Relationship to living work context:**
+Tier 0 injection needs a dedicated system prompt section separate from `assembledContextSummary` to avoid the 8KB cap. The interpretation checkpoint artifact flows into `DiscoveryHandoffArtifactV1` and `PipelineRunContext` once living work context lands. Downstream phases should see what interpretation the discovery agent committed to.
+**Research findings (resolved questions):**
+- **Role prompts vs. model families**: Persona Collapse (2026) shows same-model role-separated agents converge. Cross-family required for genuine independence. Resolved: cross-family > role prompts.
+- **Multi-agent debate confidence**: both agents escalate to ~83% confidence regardless of correctness. Never use stated confidence. Resolved: compare content only.
+- **Rival interpretation generation**: open-ended enumeration produces anchored minor variations. Falsification forcing is more reliable. Resolved.
+- **Vagueness vs. ambiguity**: empirically distinct failure modes requiring different responses. Resolved.
+- **Production systems**: SWE-agent, AutoCodeRover, Agentless have no ambiguity detection phase. Confirmed by 4 independent 2025-2026 benchmarks. WorkTrain architecture is differentiated.
+- **Calibration ground truth**: use divergent implementations, not human majority-vote labels. Resolved.
+**Things still to hash out:**
+- **Measure Subtype A vs B distribution first** -- retrospectively classify 10-20 past wrong-implementation cases before committing to the full design. If Subtype B dominates, the design needs explicit assumption-challenging and assumption-logging components that aren't here yet.
+- Semantic entropy sampling cost at scale -- always on, or triggered only when structural signals fire first?
+- Program distribution divergence (Tier 2) requires a test oracle. Fallback for repos without tests?
+- Council model selection: which model families for challengers, how configured per workspace?
+- Council cadence for large overnight queues: sampling approach may be more practical initially.
+- `taskType` classification: separate pre-checkpoint step or first output of the same checkpoint?
+- Batched clarification UX for overnight operators is undesigned.
+- Minimal interim wiring for interpretation commitment through phases before living work context lands?
+---
+### Subtype B intent failure: agent has a wrong prior about what this codebase does (May 5, 2026)
+**Status: idea -- needs empirical study before design** | Priority: high
+**Score: 12** | Cor:3 Cap:3 Eff:2 Lev:2 Con:1 | Blocked: no
+The intent resolution entry (above) addresses Subtype A failures -- tickets that are ambiguous or underspecified. This entry addresses Subtype B, which is categorically different and currently has no empirical intervention study in the literature.
+**The failure mode:** The ticket is clear and specific. The agent reads it correctly. But the agent has a systematically wrong prior about what the described thing means in this codebase -- because its training data, or a superficially similar pattern it has seen, leads it to a confident interpretation that is locally coherent but wrong for this specific system.
+Examples:
+- "Add rate limiting to the auth service" -- agent implements token bucket at the HTTP layer because that's what rate limiting means in most codebases. This codebase does it at the middleware layer with a different interface. The ticket was clear; the agent's prior was wrong.
+- "Fix the session expiry bug" -- agent finds and fixes the obvious TTL check. The actual expiry logic in this codebase is spread across three collaborating modules in a non-obvious way. The agent's mental model of "how session expiry works" doesn't match this codebase.
+- "Update the delivery pipeline to handle X" -- agent knows what delivery pipelines look like. This codebase's delivery pipeline has specific invariants (atomic stage ordering, sidecar lifecycle) that violate the agent's general expectations. The update is technically correct in isolation but violates a codebase-specific invariant the agent didn't know existed.
+**Why it's different from Subtype A:** You cannot fix this with more context harvest from Jira or Slack. The ticket is correctly specified. You cannot fix it with a council of agents -- challenger agents from different model families share the same wrong prior from training data. The problem is not ambiguity; it is that the agent's internal model of the codebase diverges from the actual codebase.
+**Why it's hard to detect:** the agent's interpretation feels correct and internally consistent. It will self-report high confidence. The semantic entropy signal may be low (all samples converge on the same wrong interpretation). A challenger agent may produce the same wrong interpretation independently. The failure is invisible until review or testing.
+**What might actually work (inferred, not empirically validated):**
+*Explicit assumption surfacing before acting:* Before touching any code, require the agent to write down: "Here is how I believe this component works based on what I have read." Then verify those beliefs against the codebase. If the agent's stated model of "how the delivery pipeline works" conflicts with what the code actually does, that conflict is the signal. This is different from rival interpretations (Subtype A) -- it is rival models of the existing system.
+*Assumption-challenging agent:* A separate lightweight agent reads the primary agent's stated assumptions about the codebase and actively searches for contradicting evidence. Not "is the ticket ambiguous" but "is the agent's model of this codebase correct?" Spawned with `mode: 'blind'` (no prior context) so it approaches the codebase fresh, then compares its reading to the primary agent's stated model.
+*Prior-invalidation pass in discovery:* Discovery workflow includes a mandatory step: for each major architectural assumption the agent is making, find one piece of codebase evidence that would invalidate it. If the agent assumes "rate limiting is at the HTTP layer," it must search for evidence that this is wrong before proceeding. Forces falsification of the prior rather than confirmation.
+*Historical session notes as prior correction:* If prior sessions have established "in this codebase, X works differently than you'd expect because Y," that context must be injected before the agent forms its model. This is the living work context applied across pipeline runs, not just within one run -- a per-workspace knowledge store of "things that are surprising about this codebase." Related to the knowledge graph backlog item.
+**Why Confidence is 1 (needs discovery before design can begin):**
+There is no empirical study of interventions for Subtype B in ticket-driven coding agents. AskBench's AskOverconfidence condition (arxiv 2602.11199) confirms agents fail differently on false-premise queries -- but "false premise" in a benchmark is a planted incorrect assumption, not a wrong prior from training data. The mechanisms may be similar but the intervention pathway is different. This needs:
+1. Empirical measurement of how often Subtype B vs Subtype A causes WorkTrain failures (the NS2 step from the independent research brief)
+2. A controlled study of whether assumption-surfacing before acting actually reduces Subtype B failures
+3. Design of the assumption-challenging agent -- what exactly it reads, what it produces, how the coordinator uses it
+**Relationship to other entries:**
+- "Intent resolution" (above): addresses Subtype A. This entry is the Subtype B complement.
+- "Living work context": the per-workspace knowledge store of codebase surprises is partial infrastructure for fixing Subtype B across sessions.
+- "Knowledge graph" (backlog): structural understanding of the codebase that would give the agent a ground-truth model to compare its priors against.
+- "Context isolation modes": the assumption-challenging agent needs `mode: 'blind'` to approach the codebase without anchoring on the primary agent's stated assumptions.
+**Things to hash out:**
+- How do you distinguish "the agent has a wrong prior" from "the ticket is genuinely ambiguous about which part of the system to change"? The boundary is fuzzy -- a ticket that doesn't name the specific module is Subtype A; a ticket that names the module but the agent's model of that module is wrong is Subtype B.
+- What format should "stated assumptions" take? Free-prose is hard to verify. A structured list of `{ assumption: string, evidence: string, falsificationQuery: string }` is verifiable but requires the agent to produce it honestly.
+- The assumption-challenging agent needs to approach the codebase independently. But it also needs to know what assumptions to challenge -- which means it needs the primary agent's stated assumption list. Is that contamination? No -- it is exactly the right input. The isolation is from the primary agent's conclusions, not its stated premises.
+- How does this interact with the `skipIntentResolution` escape hatch? Subtype B failures can occur even on tickets that pass quality pre-screening and look unambiguous. The skip hatch should not bypass assumption surfacing for `'refactor'` or `'architectural'` tasks.
+- Is the right long-term fix a knowledge graph (structural ground truth the agent can compare its model against) rather than per-session assumption surfacing? Knowledge graph is higher-confidence but much higher cost to build. Assumption surfacing is lower cost but relies on the agent honestly reporting its own priors.
+---
 ### Scope rationalization: agent silently accepts collateral damage (Apr 30, 2026)
 **Status: idea** | Priority: medium
@@ -260,10 +585,12 @@ The autonomous workflow runner (`worktrain daemon`). Completely separate from th
 ### Living work context: shared knowledge document that accumulates across the full pipeline (Apr 30, 2026)
-**Status: idea** | Priority: high
+**Status: done** | Shipped May 5, 2026 (PR #939)
 **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+**Shipped (PR #939):** `ShapingHandoffArtifactV1` + `CodingHandoffArtifactV1` + enriched `DiscoveryHandoffArtifactV1`, `PhaseHandoffArtifact` union, `buildContextSummary()` pure function with per-phase selection, `PipelineRunContext` per-run JSON with `PhaseResult<T>`, crash recovery via `active-run.json` pointer, phase quality gates (fallback escalates, partial warns), persistence failure escalation, 4 workflow authoring changes, adversarial behavioral test (AC 21), `contractRef` validation test. Deferred: `buildSystemPrompt()` named semantic slots, console visualization, retry logic, epic-mode task graph, extensible contract registration, per-workflow lifecycle artifact tests.
 When a multi-agent pipeline runs -- discovery → shaping → coding → review → fix → re-review -- no agent has a complete picture of what came before it. The coding agent has the goal. The review agent has the code. The fix agent has the findings. None of them have the accumulated context from the full pipeline: why this approach was chosen over alternatives, what was ruled out, what constraints were discovered, what architectural decisions were made, what edge cases were handled, what the review found and why.
 Each agent reconstructs intent from incomplete context, which is why review finds things coding missed (review doesn't know what the coding agent was trying to do), why fix sessions address symptoms without understanding causes (no access to the architectural reasoning), and why agents repeat work that earlier agents already did.
@@ -352,6 +679,67 @@ This is related to the "Coordinator context injection standard" and "Context bud
 ---
+### Subagent context isolation modes: enforced context sharing and contamination prevention (May 5, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:3 Cap:2 Eff:2 Lev:3 Con:2 | Blocked: no
+When WorkTrain spawns a subagent, the spawning agent decides what context to pass. Today this is purely by convention -- there is no mechanism to enforce isolation or guarantee completeness. Two distinct failure modes require opposite fixes:
+**Contamination (too much context):** A challenger agent spawned to independently evaluate an interpretation receives the primary agent's interpretation in the context bundle. It anchors on it and produces a biased reading. A review agent receives the coding agent's self-assessment and validates it rather than challenging it. These are cases where context leakage actively undermines the agent's purpose -- independence destroyed by prior context.
+**Starvation (too little context):** A coding agent spawned without discovery findings re-investigates settled questions. A review agent without shaping constraints cannot check whether the implementation satisfies them. Context absence causes wasted work or wrong output.
+Today both are addressed by convention. Convention fails silently -- the spawning agent follows its own judgment, which may be wrong. Even the orchestrating agent can contaminate a challenger without realizing it (as happened when spawning the research agents in this session without realizing context was being leaked).
+**The right fix is structural enforcement, not rules.** Context isolation mode should be a declared property of the spawn call, enforced by coordinator infrastructure, not managed by following instructions.
+**Proposed isolation modes:**
+```typescript
+type ContextIsolationMode =
+  | { mode: 'full' }
+  // Agent receives complete accumulated context: Tier 0 project identity +
+  // prior phase artifacts + task context. Default for most pipeline phases.
+  | { mode: 'task-only' }
+  // Agent receives only task description + Tier 0 project identity.
+  // No prior phase artifacts, no intermediate results.
+  // For agents that should approach the task fresh but know the project.
+  | { mode: 'blind' }
+  // Agent receives only the raw inputs declared at spawn time.
+  // No Tier 0 injection, no prior artifacts, no accumulated context.
+  // For adversarial/challenger agents where independence is the whole point.
+  // The spawning call must explicitly declare what inputs to pass.
+  | { mode: 'custom'; include: ContextKey[]; exclude: ContextKey[] }
+  // Explicit allowlist/blocklist. For partial context cases
+  // (e.g. review agent gets shaping constraints but not coding agent's self-assessment).
+```
+`mode: 'blind'` should be the enforced default for any session with `role: 'challenger' | 'adversarial' | 'evaluator'`. The coordinator cannot accidentally contaminate a challenger when the session declaration forbids it.
+**Note on 'blind' mode:** true blindness (no Tier 0 either) may be too aggressive. A challenger without the project's coding philosophy or architectural principles is missing the most important constraints. "No prior phase artifacts" is probably the right isolation boundary, not "no context whatsoever." A `challenger` mode that strips prior results but keeps Tier 0 may be more useful. Open question.
+**Enforcement point:** `spawnSession` in the coordinator infrastructure (`createCoordinatorDeps`). The spawning call declares the mode; the infrastructure assembles the context bundle according to the declared mode; the spawning agent cannot override it by passing extra fields. Validate at boundaries, trust inside.
+**Observability:** when an evaluation was produced by a `blind` or `task-only` session, that fact should be recorded in the session store so the independence of the evaluation is auditable. Without this, the isolation guarantee is invisible.
+**Relationship to existing entries:**
+- "Subagent context package" (above) is about ensuring agents receive enough context -- the `full` and `task-only` modes are the enforcement side of that design.
+- "Council of agents" in the intent resolution entry assumes `blind` mode for challengers -- this entry is what makes that assumption enforceable.
+- `buildContextSummary(priorArtifacts, targetPhase)` in living work context is the selection logic for `custom` mode.
+**Things to hash out:**
+- Should `mode` be declared on the workflow definition, the trigger, or the `spawnSession` call? Workflow definition is the right answer (the workflow knows its role), but requires a new schema field.
+- How does declared mode interact with the agent's tool access? A `blind` challenger can still read workspace files. True isolation may require tool path restrictions alongside context restrictions.
+- Custom `include`/`exclude` lists create maintenance burden as context keys evolve. Is there a better abstraction -- e.g. declaring the agent's role and having infrastructure derive the right context set from a role-to-context mapping?
+- Should `task-only` include or exclude Tier 0 project identity? Including it is almost always better, but the operator may have reasons to exclude it.
+---
 ### Agent-assisted backlog and issue enrichment (Apr 28, 2026)
 **Status: idea** | Priority: medium
@@ -879,6 +1267,8 @@ Demo repo tasks (worktrain:ready issues)
 **Score: 11** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+**Note (May 5, 2026):** PR #939 shipped *coordinator-level* pipeline crash recovery: `active-run.json` pointer + `PipelineRunContext` file allow the next coordinator startup to restore prior phase artifacts and resume without re-running completed phases. This item is about *agent session* crash recovery (the agent itself dies mid-session, worktree state, step advances). Both layers are needed.
 **The problem:** A daemon crash loop kills all in-flight sessions. The queue correctly detects the sidecar and skips re-dispatch for the TTL window, but when the sidecar expires the session is re-dispatched from scratch with zero context. An agent that spent 10 min in Phase 0, read codebase files, and formed a plan loses all of that work.
 **What we want:** WorkTrain detects orphaned sessions on startup and makes an autonomous decision: resume if meaningful progress was made, discard and re-dispatch from scratch if too early to be worth resuming.
@@ -1334,6 +1724,24 @@ This is already how mid-run resume works. The same mechanism extends naturally t
 ---
+### Extensible output contract registration: coordinator-owned schemas, engine-enforced (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 8** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+The engine's output contract registry (`ARTIFACT_CONTRACT_REFS` in `src/v2/durable-core/schemas/artifacts/index.ts`) is a closed list maintained in the engine source. Adding a new contract type requires modifying the engine: adding to the registry, implementing a validator in `artifact-contract-validator.ts`, and adding a Zod schema. This is the correct pattern today and works fine at 5 items. But as the pipeline gains more phase types, every new coordinator-domain artifact contract is an engine change. The registry is already mixed -- `review_verdict` and `discovery_handoff` are coordinator-domain artifacts registered there. At 15-20 items this becomes a maintenance burden and a coupling that is harder to justify.
+The better long-term design: the engine owns the enforcement mechanism (validate presence and schema at `complete_step`) but not the schema definitions. Coordinator-domain contracts register their Zod schemas from outside the engine. The engine validates against whatever is registered without a hardcoded case per contract type.
+**Things to hash out:**
+- What is the registration API? DI injection at startup (consistent with existing container pattern), a module-level call, or a config file?
+- How does registration work at compile time vs runtime? Workflow compilation and `complete_step` validation happen at different points -- the registry must be available at both.
+- Does this change the `workflowHash`? If registered schemas change, should the hash change? Does the hash include registered external schemas or only the workflow JSON?
+- Should the existing 5 contracts migrate, or stay hardcoded? A two-tier system (some hardcoded, some registered) is confusing but migration is low priority.
+---
 ### Task-scoped rules: step-level rule injection by task type (Apr 30, 2026)
 **Status: idea** | Priority: medium

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@exaudeus/workrail",
-  "version": "3.74.3",
+  "version": "3.76.0",
   "description": "Step-by-step workflow enforcement for AI agents via MCP",
   "license": "MIT",
   "repository": {

package/workflows/coding-task-workflow-agentic.json CHANGED Viewed

@@ -143,7 +143,7 @@
     "SUBAGENT SYNTHESIS: treat subagent output as evidence, not conclusions. State your hypothesis before delegating, then interrogate what came back: what was missed, wrong, or new? Say what changed your mind or what you still reject, and why.",
     "PARALLELISM: when reads, audits, or delegations are independent, run them in parallel inside the phase. Parallelize cognition; serialize synthesis and canonical writes.",
     "PHILOSOPHY LENS: apply the user's coding philosophy (from active session rules) as the evaluation lens. Flag violations by principle name, not as generic feedback. If principles conflict, surface the tension explicitly instead of silently choosing.",
-    "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness — in that order of reliability.",
+    "VALIDATION: prefer static/compile-time safety over runtime checks. Use build, type-checking, and tests as the primary proof of correctness \u2014 in that order of reliability.",
     "DRIFT HANDLING: when reality diverges from the plan, update the plan artifact and re-audit deliberately rather than accumulating undocumented drift.",
     "NEVER COMMIT MARKDOWN FILES UNLESS USER EXPLICITLY ASKS.",
     "SLICE DISCIPLINE: Phase 6 is a loop -- implement ONE slice per iteration. Do not implement multiple slices at once. The verification loop exists to catch drift per slice, not retroactively."
@@ -218,7 +218,7 @@
     },
     {
       "id": "phase-1b-design-deep",
-      "title": "Phase 1b: Design Generation (Injected Routine — Tension-Driven Design)",
+      "title": "Phase 1b: Design Generation (Injected Routine \u2014 Tension-Driven Design)",
       "runCondition": {
         "and": [
           {
@@ -257,7 +257,7 @@
           }
         ]
       },
-      "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` — chosen design with rationale tied to tensions\n- `runnerUpApproach` — next-best option and why it lost\n- `architectureRationale` — tensions resolved vs accepted\n- `pivotTriggers` — conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` — failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
+      "prompt": "Read `design-candidates.md`, compare it to your original guess, and make the call.\n\nBe explicit about three things:\n- what the design work confirmed\n- what changed your mind\n- what you missed the first time\n\nThen pressure-test the leading option:\n- what's the strongest case against it?\n- what assumption breaks it?\n\nAfter the challenge batch, say:\n- what changed your mind\n- what didn't\n- which findings you reject and why\n\nPick the approach yourself. Don't hide behind the artifact. If the simplest thing works, prefer it. If the front-runner stops looking right after challenge, switch.\n\nCapture:\n- `selectedApproach` \u2014 chosen design with rationale tied to tensions\n- `runnerUpApproach` \u2014 next-best option and why it lost\n- `architectureRationale` \u2014 tensions resolved vs accepted\n- `pivotTriggers` \u2014 conditions under which you'd switch to the runner-up\n- `keyRiskToMonitor` \u2014 failure mode of the selected approach\n- `acceptedTradeoffs`\n- `identifiedFailureModes`",
       "promptFragments": [
         {
           "id": "phase-1c-challenge-standard",
@@ -429,7 +429,7 @@
         "var": "taskComplexity",
         "not_equals": "Small"
       },
-      "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n   - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` — count of open questions that would materially affect implementation quality\n- `planConfidenceBand` — Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
+      "prompt": "Turn the decision into a plan someone else could execute without guessing.\n\n**Open questions gate:** check `openQuestions` from Phase 0. If any remain unanswered and would materially affect implementation quality, either resolve them now with tools or record them in the risk register with an explicit decision about how to proceed without them. Do not silently carry unanswered questions into implementation.\n\nUpdate `implementation_plan.md`.\n\nIt should cover:\n1. Problem statement\n2. Acceptance criteria (mirror `spec.md` if it exists; `spec.md` owns observable behavior)\n3. Non-goals\n4. Philosophy-driven constraints\n5. Invariants\n6. Selected approach + rationale + runner-up\n7. Vertical slices\n8. Work packages only if they actually help\n9. Test design\n10. Risk register\n11. PR packaging strategy\n12. Philosophy alignment per slice:\n   - [principle] -> [satisfied / tension / violated + 1-line why]\n\nCapture:\n- `implementationPlan`\n- `slices`\n- `testDesign`\n- `estimatedPRCount`\n- `followUpTickets` (initialize if needed)\n- `unresolvedUnknownCount` \u2014 count of open questions that would materially affect implementation quality\n- `planConfidenceBand` \u2014 Low / Medium / High\n\nThe plan is the deliverable for this step. Do not implement anything -- not a \"quick win\", not a file read that bleeds into edits, nothing. Execution begins in Phase 6, one slice at a time. If you find yourself writing code or editing source files right now, stop immediately.",
       "assessmentRefs": [
         "plan-completeness-gate",
         "invariant-clarity-gate",
@@ -543,7 +543,7 @@
         {
           "id": "phase-4b-loop-decision",
           "title": "Loop Exit Decision",
-          "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop — but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
+          "prompt": "Decide whether the plan needs another pass.\n\nIf `planFindings` is non-empty, keep going.\nIf it's empty, stop \u2014 but say what you checked so the clean pass means something.\nIf you've hit the limit, stop and record what still bothers you.\n\nThen emit the required loop-control artifact in this shape (`decision` must be `continue` or `stop`):\n```json\n{\n  \"artifacts\": [{\n    \"kind\": \"wr.loop_control\",\n    \"decision\": \"continue\"\n  }]\n}\n```",
           "requireConfirmation": true,
           "outputContract": {
             "contractRef": "wr.contracts.loop_control"
@@ -706,7 +706,10 @@
       "id": "phase-8-retrospective",
       "title": "Phase 8: Retrospective",
       "requireConfirmation": false,
-      "prompt": "The implementation is done and verified. Now look back.\n\nThis is not a re-run of tests. It is a short honest look at the work you just did.\n\nAsk yourself:\n\n1. **What would you do differently?** Now that the implementation is real, what approach, boundary, or decision looks wrong in hindsight?\n\n2. **What adjacent problems did this reveal?** Did the implementation expose gaps, tech debt, or fragile assumptions in the surrounding code that were not in scope but are worth noting?\n\n3. **What follow-up work is now visible?** What is the natural next step that became clear only after doing this work?\n\n4. **What was harder or easier than expected?** Were there surprises -- good or bad -- that would change how similar tasks are approached next time?\n\nProduce 2-4 concrete observations. Each should be specific enough to act on.\n\nFor each observation:\n- **File as follow-up**: add to backlog or open a ticket if it warrants tracking\n- **Accept**: note it explicitly if it is a known limitation you are consciously leaving\n- **Fix now**: if it is small and low-risk, fix it before closing\n\nCapture:\n- `retrospectiveObservations`: list of observations with disposition (filed/accepted/fixed)\n- `followUpTickets`: any new tickets created (append to existing list)"
+      "prompt": "The implementation is done and verified. Now look back.\n\nThis is not a re-run of tests. It is a short honest look at the work you just did.\n\nAsk yourself:\n\n1. **What would you do differently?** Now that the implementation is real, what approach, boundary, or decision looks wrong in hindsight?\n\n2. **What adjacent problems did this reveal?** Did the implementation expose gaps, tech debt, or fragile assumptions in the surrounding code that were not in scope but are worth noting?\n\n3. **What follow-up work is now visible?** What is the natural next step that became clear only after doing this work?\n\n4. **What was harder or easier than expected?** Were there surprises -- good or bad -- that would change how similar tasks are approached next time?\n\nProduce 2-4 concrete observations. Each should be specific enough to act on.\n\nFor each observation:\n- **File as follow-up**: add to backlog or open a ticket if it warrants tracking\n- **Accept**: note it explicitly if it is a known limitation you are consciously leaving\n- **Fix now**: if it is small and low-risk, fix it before closing\n\nCapture:\n- `retrospectiveObservations`: list of observations with disposition (filed/accepted/fixed)\n- `followUpTickets`: any new tickets created (append to existing list)\n\nBefore completing this step, emit a wr.coding_handoff artifact in your complete_step call:\n{\n  \"kind\": \"wr.coding_handoff\",\n  \"version\": 1,\n  \"branchName\": \"<git branch name containing your changes>\",\n  \"keyDecisions\": [\"<architectural decision + WHY>\", ...],\n  \"knownLimitations\": [\"<known gap or deliberate shortcut>\", ...],\n  \"testsAdded\": [\"<test file or test name added>\", ...],\n  \"filesChanged\": [\"<primary file path changed>\", ...]\n}\nNote: correctedAssumptions is populated ONLY by fix/retry agents when correcting assumptions from a prior coding session. On a first-run coding session, omit this field entirely.",
+      "outputContract": {
+        "contractRef": "wr.contracts.coding_handoff"
+      }
     }
   ],
   "validatedAgainstSpecVersion": 3

package/workflows/mr-review-workflow.agentic.v2.json CHANGED Viewed

@@ -86,7 +86,7 @@
     {
       "id": "phase-0-understand-and-classify",
       "title": "Phase 0: Locate, Bound, Enrich & Classify",
-      "prompt": "Build the review foundation in one pass.\n\nStep 1 \u2014 Early exit / minimum inputs:\nBefore exploring, verify that the review target is real and inspectable. If the diff, changed files, or equivalent review material are completely absent and cannot be inferred with tools, ask for the minimum missing artifact and stop. Do NOT ask questions you can resolve with tools.\n\nStep 2 \u2014 Locate and bound the review target:\nAttempt to determine the strongest available review target and boundary.\n\nAttempt to establish:\n- `reviewTargetKind` from the strongest available source such as PR/MR, branch, patch, diff, or local working tree changes\n- `reviewTargetSource` describing where the target came from\n- likely PR/MR identity when available (`prUrl`, `prNumber`)\n- likely base / ancestor reference (`baseCandidate`, `mergeBaseRef`) when available\n- whether the branch may include inherited or out-of-scope changes\n- `boundaryConfidence`: High / Medium / Low\n\nDo not over-prescribe your own investigation path. Use the strongest available evidence and record uncertainty honestly.\n\nStep 3 \u2014 Enrich with context:\nRecover the strongest available intent and policy context from whatever sources are actually available.\n\nAttempt to recover:\n- MR title and purpose\n- ticket / issue / acceptance context (`ticketRefs`, `ticketContext`)\n- supporting docs / specs / rollout context (`supportingDocsFound`)\n- repo or user policy/convention context when it is likely to affect review judgment (`policySourcesFound`)\n- `contextConfidence`: High / Medium / Low\n\nStep 4 \u2014 Review-surface hygiene:\nClassify the visible change into a minimal review surface.\n\nSet:\n- `coreReviewSurface`\n- `likelyNoiseOrMechanicalChurn`\n- `likelyInheritedOrOutOfScopeChanges`\n- `reviewSurfaceSummary`\n- `reviewScopeWarnings`\n\nThe goal is not a giant ledger. The goal is to avoid treating every visible changed file as equally worthy of deep review by default.\n\nStep 5 \u2014 Classify the review:\nAfter exploration, classify the work.\n\nSet:\n- `reviewMode`: QUICK / STANDARD / THOROUGH\n- `riskLevel`: Low / Medium / High\n- `shapeProfile`: choose the best primary label from `isolated_change`, `crosscutting_change`, `mechanically_noisy_change`, or `ambiguous_boundary`\n- `changeTypeProfile`: choose the best primary label from `general_code_change`, `api_contract_change`, `data_model_or_migration`, `security_sensitive`, or `test_only`\n- `maxParallelism`: 0 / 3 / 5\n- `criticalSurfaceTouched`: true / false\n- `needsSimulation`: true / false\n- `needsBoundaryFollowup`: true / false\n- `needsContextFollowup`: true / false\n- `needsReviewerBundle`: true / false\n\nDecision guidance:\n- QUICK: very small, isolated, low-risk changes with little ambiguity\n- STANDARD: typical feature or bug-fix reviews with moderate ambiguity or moderate risk\n- THOROUGH: critical surfaces, architectural novelty, high risk, broad change sets, or strong need for independent reviewer perspectives\n\nMinimal routing guidance:\n- if `boundaryConfidence = Low`, bias toward boundary/context follow-up before strong recommendation confidence\n- if `changeTypeProfile = api_contract_change`, bias toward contract/consumer/backward-compatibility scrutiny\n- if `changeTypeProfile = data_model_or_migration`, bias toward rollout / compatibility / simulation scrutiny\n- if `changeTypeProfile = security_sensitive`, bias toward adversarial/runtime-risk scrutiny and lower tolerance for weak evidence\n- if `changeTypeProfile = test_only`, bias toward stronger false-positive suppression\n- if `shapeProfile = mechanically_noisy_change`, bias toward stronger noise filtering and lower appetite for style-only findings\n\nStep 6 \u2014 Optional deeper context:\nIf `reviewMode` is STANDARD or THOROUGH and context remains incomplete, and delegation is available, spawn TWO WorkRail Executors SIMULTANEOUSLY running `routine-context-gathering` with focus=COMPLETENESS and focus=DEPTH. Synthesize both outputs before finishing this step.\n\nStep 7 \u2014 Human-facing artifact:\nChoose `reviewDocPath` only if a live artifact will materially improve human readability. Default suggestion: `mr-review.md` at the project root. This artifact is optional and never canonical workflow state.\n\nFallback behavior:\n- if PR/MR is not found but a branch/diff is inspectable, continue with downgraded context confidence and disclose missing PR context later\n- if the branch is inspectable but merge-base / ancestor remains ambiguous, continue with downgraded boundary confidence, set `needsBoundaryFollowup = true`, and disclose the uncertainty later\n- if ticket or supporting docs are missing, continue with downgraded context confidence and avoid overclaiming intent-sensitive findings\n- if only a patch/diff is available, continue if it is inspectable, but keep lower confidence on intent/boundary-dependent conclusions\n- if the review target itself is missing, ask only for that missing artifact and stop\n\nSet these keys in the next `continue_workflow` call's `context` object:\n- `reviewTargetKind`\n- `reviewTargetSource`\n- `prUrl`\n- `prNumber`\n- `baseCandidate`\n- `mergeBaseRef`\n- `boundaryConfidence`\n- `contextConfidence`\n- `mrTitle`\n- `mrPurpose`\n- `ticketRefs`\n- `ticketContext`\n- `supportingDocsFound`\n- `policySourcesFound`\n- `accessibleContextSources`\n- `missingContextSources`\n- `focusAreas`\n- `changedFileCount`\n- `criticalSurfaceTouched`\n- `reviewMode`\n- `riskLevel`\n- `shapeProfile`\n- `changeTypeProfile`\n- `maxParallelism`\n- `reviewDocPath`\n- `contextSummary`\n- `candidateFiles`\n- `moduleRoots`\n- `contextUnknownCount`\n- `coverageGapCount`\n- `authorIntentUnclear`\n- `needsSimulation`\n- `needsBoundaryFollowup`\n- `needsContextFollowup`\n- `needsReviewerBundle`\n- `coreReviewSurface`\n- `likelyNoiseOrMechanicalChurn`\n- `likelyInheritedOrOutOfScopeChanges`\n- `reviewSurfaceSummary`\n- `reviewScopeWarnings`\n- `openQuestions`\n\nRules:\n- answer your own questions with tools whenever possible\n- only keep true human-decision questions in `openQuestions`\n- keep `openQuestions` bounded to the minimum necessary\n- classify AFTER exploring, not before\n- before leaving this phase, either establish the likely review boundary or explicitly record why you could not\n\nAlso set  in the context object: one sentence describing what you are trying to accomplish (e.g. \"implement OAuth refresh token rotation\", \"review PR #47 before merge\"). This populates the session title in the Workspace console immediately.",
+      "prompt": "Build the review foundation in one pass.\n\nStep 1 \u2014 Early exit / minimum inputs:\nBefore exploring, verify that the review target is real and inspectable. If the diff, changed files, or equivalent review material are completely absent and cannot be inferred with tools, ask for the minimum missing artifact and stop. Do NOT ask questions you can resolve with tools.\n\nStep 2 \u2014 Locate and bound the review target:\nAttempt to determine the strongest available review target and boundary.\n\nAttempt to establish:\n- `reviewTargetKind` from the strongest available source such as PR/MR, branch, patch, diff, or local working tree changes\n- `reviewTargetSource` describing where the target came from\n- likely PR/MR identity when available (`prUrl`, `prNumber`)\n- likely base / ancestor reference (`baseCandidate`, `mergeBaseRef`) when available\n- whether the branch may include inherited or out-of-scope changes\n- `boundaryConfidence`: High / Medium / Low\n\nDo not over-prescribe your own investigation path. Use the strongest available evidence and record uncertainty honestly.\n\nStep 3 \u2014 Enrich with context:\nRecover the strongest available intent and policy context from whatever sources are actually available.\n\nAttempt to recover:\n- MR title and purpose\n- ticket / issue / acceptance context (`ticketRefs`, `ticketContext`)\n- supporting docs / specs / rollout context (`supportingDocsFound`)\n- repo or user policy/convention context when it is likely to affect review judgment (`policySourcesFound`)\n- `contextConfidence`: High / Medium / Low\n\nStep 4 \u2014 Review-surface hygiene:\nClassify the visible change into a minimal review surface.\n\nSet:\n- `coreReviewSurface`\n- `likelyNoiseOrMechanicalChurn`\n- `likelyInheritedOrOutOfScopeChanges`\n- `reviewSurfaceSummary`\n- `reviewScopeWarnings`\n\nThe goal is not a giant ledger. The goal is to avoid treating every visible changed file as equally worthy of deep review by default.\n\nStep 5 \u2014 Classify the review:\nAfter exploration, classify the work.\n\nSet:\n- `reviewMode`: QUICK / STANDARD / THOROUGH\n- `riskLevel`: Low / Medium / High\n- `shapeProfile`: choose the best primary label from `isolated_change`, `crosscutting_change`, `mechanically_noisy_change`, or `ambiguous_boundary`\n- `changeTypeProfile`: choose the best primary label from `general_code_change`, `api_contract_change`, `data_model_or_migration`, `security_sensitive`, or `test_only`\n- `maxParallelism`: 0 / 3 / 5\n- `criticalSurfaceTouched`: true / false\n- `needsSimulation`: true / false\n- `needsBoundaryFollowup`: true / false\n- `needsContextFollowup`: true / false\n- `needsReviewerBundle`: true / false\n\nDecision guidance:\n- QUICK: very small, isolated, low-risk changes with little ambiguity\n- STANDARD: typical feature or bug-fix reviews with moderate ambiguity or moderate risk\n- THOROUGH: critical surfaces, architectural novelty, high risk, broad change sets, or strong need for independent reviewer perspectives\n\nMinimal routing guidance:\n- if `boundaryConfidence = Low`, bias toward boundary/context follow-up before strong recommendation confidence\n- if `changeTypeProfile = api_contract_change`, bias toward contract/consumer/backward-compatibility scrutiny\n- if `changeTypeProfile = data_model_or_migration`, bias toward rollout / compatibility / simulation scrutiny\n- if `changeTypeProfile = security_sensitive`, bias toward adversarial/runtime-risk scrutiny and lower tolerance for weak evidence\n- if `changeTypeProfile = test_only`, bias toward stronger false-positive suppression\n- if `shapeProfile = mechanically_noisy_change`, bias toward stronger noise filtering and lower appetite for style-only findings\n\nStep 6 \u2014 Optional deeper context:\nIf `reviewMode` is STANDARD or THOROUGH and context remains incomplete, and delegation is available, spawn TWO WorkRail Executors SIMULTANEOUSLY running `routine-context-gathering` with focus=COMPLETENESS and focus=DEPTH. Synthesize both outputs before finishing this step.\n\nStep 7 \u2014 Human-facing artifact:\nChoose `reviewDocPath` only if a live artifact will materially improve human readability. Default suggestion: `mr-review.md` at the project root. This artifact is optional and never canonical workflow state.\n\nFallback behavior:\n- if PR/MR is not found but a branch/diff is inspectable, continue with downgraded context confidence and disclose missing PR context later\n- if the branch is inspectable but merge-base / ancestor remains ambiguous, continue with downgraded boundary confidence, set `needsBoundaryFollowup = true`, and disclose the uncertainty later\n- if ticket or supporting docs are missing, continue with downgraded context confidence and avoid overclaiming intent-sensitive findings\n- if only a patch/diff is available, continue if it is inspectable, but keep lower confidence on intent/boundary-dependent conclusions\n- if the review target itself is missing, ask only for that missing artifact and stop\n\nSet these keys in the next `continue_workflow` call's `context` object:\n- `reviewTargetKind`\n- `reviewTargetSource`\n- `prUrl`\n- `prNumber`\n- `baseCandidate`\n- `mergeBaseRef`\n- `boundaryConfidence`\n- `contextConfidence`\n- `mrTitle`\n- `mrPurpose`\n- `ticketRefs`\n- `ticketContext`\n- `supportingDocsFound`\n- `policySourcesFound`\n- `accessibleContextSources`\n- `missingContextSources`\n- `focusAreas`\n- `changedFileCount`\n- `criticalSurfaceTouched`\n- `reviewMode`\n- `riskLevel`\n- `shapeProfile`\n- `changeTypeProfile`\n- `maxParallelism`\n- `reviewDocPath`\n- `contextSummary`\n- `candidateFiles`\n- `moduleRoots`\n- `contextUnknownCount`\n- `coverageGapCount`\n- `authorIntentUnclear`\n- `needsSimulation`\n- `needsBoundaryFollowup`\n- `needsContextFollowup`\n- `needsReviewerBundle`\n- `coreReviewSurface`\n- `likelyNoiseOrMechanicalChurn`\n- `likelyInheritedOrOutOfScopeChanges`\n- `reviewSurfaceSummary`\n- `reviewScopeWarnings`\n- `openQuestions`\n\nRules:\n- answer your own questions with tools whenever possible\n- only keep true human-decision questions in `openQuestions`\n- keep `openQuestions` bounded to the minimum necessary\n- classify AFTER exploring, not before\n- before leaving this phase, either establish the likely review boundary or explicitly record why you could not\n\nAlso set  in the context object: one sentence describing what you are trying to accomplish (e.g. \"implement OAuth refresh token rotation\", \"review PR #47 before merge\"). This populates the session title in the Workspace console immediately.\n\nIf `validationChecklist` is provided in context (from the shaping phase), verify each item explicitly before proceeding to deeper review:\n- Each item is an acceptance criterion declared during shaping\n- A failing checklist item is a blocking finding regardless of other review depth\n- Record: which items passed, which failed, which could not be verified\n- Example: if checklist says \"Auth middleware is not modified\" and auth files changed, flag it as blocking\n\nThis is step 1b in your review process.",
       "requireConfirmation": {
         "or": [
           {
@@ -103,7 +103,7 @@
     {
       "id": "phase-0b-scope-and-completeness-gate",
       "title": "Phase 0b: Scope & Completeness Gate",
-      "prompt": "Verify that the PR delivers what was asked and nothing more.\n\nThis step runs after context is established (Phase 0) and before forming a review hypothesis. Its output feeds the fact packet in Phase 2.\n\nStep 1 — Enumerate acceptance criteria:\nFrom the ticket/issue/PR description recovered in Phase 0, extract a flat list of acceptance criteria. If no explicit criteria exist, infer them from the stated goal and the PR title/description. Mark each as `explicit` (stated in ticket/issue) or `inferred` (derived from goal).\n\nIf no ticket, issue, or PR description is available, record `acceptanceCriteriaSource: none` and set `scopeCheckConfidence: Low`. Continue with downgraded confidence -- do not block the review.\n\nStep 2 — Check each criterion against the diff:\nFor each acceptance criterion, examine the diff and determine:\n- `met`: the diff clearly addresses this criterion\n- `partial`: the diff partially addresses it but something appears missing\n- `missing`: the diff does not appear to address this criterion at all\n- `unclear`: insufficient context to judge\n\nCite specific files or functions for `met` and `partial` judgments. Be concrete.\n\nStep 3 — Check for scope creep:\nLook for changes in the diff that go beyond what any acceptance criterion requires. Flag any change that:\n- modifies behavior not mentioned in the ticket/goal\n- touches files unrelated to the stated purpose\n- introduces new abstractions or refactors not required by the task\n\nDistinguish necessary implementation details (e.g. extracting a helper to implement the feature) from genuine scope creep (e.g. rewriting unrelated logic while here).\n\nStep 4 — Set context keys:\nSet these keys in the next `continue_workflow` call's `context` object:\n- `acceptanceCriteria`: array of `{ criterion, source: 'explicit'|'inferred', status: 'met'|'partial'|'missing'|'unclear', evidence? }`\n- `acceptanceCriteriaSource`: `'ticket'` | `'pr_description'` | `'inferred'` | `'none'`\n- `missingCriteriaCount`: number of criteria with status `missing` or `partial`\n- `scopeCreepFlags`: array of specific out-of-scope changes found (empty array if none)\n- `scopeCreepCount`: length of `scopeCreepFlags`\n- `scopeCheckConfidence`: `High` | `Medium` | `Low`\n\nRules:\n- do not block the review on unclear criteria -- record uncertainty and continue\n- a criterion is only `missing` if you can confirm the behavior is absent from the diff, not just absent from a single file\n- scope creep findings feed into the reviewer families as potential `patterns_architecture` or `philosophy_alignment` concerns -- do not duplicate them as standalone findings here",
+      "prompt": "Verify that the PR delivers what was asked and nothing more.\n\nThis step runs after context is established (Phase 0) and before forming a review hypothesis. Its output feeds the fact packet in Phase 2.\n\nStep 1 \u2014 Enumerate acceptance criteria:\nFrom the ticket/issue/PR description recovered in Phase 0, extract a flat list of acceptance criteria. If no explicit criteria exist, infer them from the stated goal and the PR title/description. Mark each as `explicit` (stated in ticket/issue) or `inferred` (derived from goal).\n\nIf no ticket, issue, or PR description is available, record `acceptanceCriteriaSource: none` and set `scopeCheckConfidence: Low`. Continue with downgraded confidence -- do not block the review.\n\nStep 2 \u2014 Check each criterion against the diff:\nFor each acceptance criterion, examine the diff and determine:\n- `met`: the diff clearly addresses this criterion\n- `partial`: the diff partially addresses it but something appears missing\n- `missing`: the diff does not appear to address this criterion at all\n- `unclear`: insufficient context to judge\n\nCite specific files or functions for `met` and `partial` judgments. Be concrete.\n\nStep 3 \u2014 Check for scope creep:\nLook for changes in the diff that go beyond what any acceptance criterion requires. Flag any change that:\n- modifies behavior not mentioned in the ticket/goal\n- touches files unrelated to the stated purpose\n- introduces new abstractions or refactors not required by the task\n\nDistinguish necessary implementation details (e.g. extracting a helper to implement the feature) from genuine scope creep (e.g. rewriting unrelated logic while here).\n\nStep 4 \u2014 Set context keys:\nSet these keys in the next `continue_workflow` call's `context` object:\n- `acceptanceCriteria`: array of `{ criterion, source: 'explicit'|'inferred', status: 'met'|'partial'|'missing'|'unclear', evidence? }`\n- `acceptanceCriteriaSource`: `'ticket'` | `'pr_description'` | `'inferred'` | `'none'`\n- `missingCriteriaCount`: number of criteria with status `missing` or `partial`\n- `scopeCreepFlags`: array of specific out-of-scope changes found (empty array if none)\n- `scopeCreepCount`: length of `scopeCreepFlags`\n- `scopeCheckConfidence`: `High` | `Medium` | `Low`\n\nRules:\n- do not block the review on unclear criteria -- record uncertainty and continue\n- a criterion is only `missing` if you can confirm the behavior is absent from the diff, not just absent from a single file\n- scope creep findings feed into the reviewer families as potential `patterns_architecture` or `philosophy_alignment` concerns -- do not duplicate them as standalone findings here",
       "requireConfirmation": false
     },
     {