npm - @exaudeus/workrail - Versions diffs - 3.75.0 → 3.77.0 - Mend

@exaudeus/workrail 3.75.0 → 3.77.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (95) hide show

package/dist/console-ui/assets/index-D9pYbwS0.js +28 -0
package/dist/console-ui/index.html +1 -1
package/dist/coordinators/adaptive-pipeline.d.ts +8 -0
package/dist/coordinators/context-assembly.d.ts +4 -0
package/dist/coordinators/context-assembly.js +156 -0
package/dist/coordinators/modes/full-pipeline.d.ts +1 -1
package/dist/coordinators/modes/full-pipeline.js +140 -27
package/dist/coordinators/modes/implement-shared.d.ts +3 -2
package/dist/coordinators/modes/implement-shared.js +16 -6
package/dist/coordinators/modes/implement.js +49 -3
package/dist/coordinators/pipeline-run-context.d.ts +1811 -0
package/dist/coordinators/pipeline-run-context.js +114 -0
package/dist/daemon/context-loader.d.ts +1 -1
package/dist/daemon/core/agent-client.d.ts +7 -0
package/dist/daemon/core/agent-client.js +31 -0
package/dist/daemon/core/index.d.ts +6 -0
package/dist/daemon/core/index.js +19 -0
package/dist/daemon/core/session-context.d.ts +14 -0
package/dist/daemon/core/session-context.js +24 -0
package/dist/daemon/core/session-result.d.ts +10 -0
package/dist/daemon/core/session-result.js +92 -0
package/dist/daemon/core/system-prompt.d.ts +6 -0
package/dist/daemon/core/system-prompt.js +151 -0
package/dist/daemon/io/conversation-log.d.ts +2 -0
package/dist/daemon/io/conversation-log.js +45 -0
package/dist/daemon/io/execution-stats.d.ts +7 -0
package/dist/daemon/io/execution-stats.js +86 -0
package/dist/daemon/io/index.d.ts +5 -0
package/dist/daemon/io/index.js +24 -0
package/dist/daemon/io/session-notes-loader.d.ts +4 -0
package/dist/daemon/io/session-notes-loader.js +45 -0
package/dist/daemon/io/soul-loader.d.ts +3 -0
package/dist/daemon/io/soul-loader.js +68 -0
package/dist/daemon/io/workspace-context-loader.d.ts +17 -0
package/dist/daemon/io/workspace-context-loader.js +137 -0
package/dist/daemon/runner/agent-loop-runner.d.ts +28 -0
package/dist/daemon/runner/agent-loop-runner.js +250 -0
package/dist/daemon/runner/construct-tools.d.ts +5 -0
package/dist/daemon/runner/construct-tools.js +30 -0
package/dist/daemon/runner/finalize-session.d.ts +3 -0
package/dist/daemon/runner/finalize-session.js +75 -0
package/dist/daemon/runner/index.d.ts +8 -0
package/dist/daemon/runner/index.js +18 -0
package/dist/daemon/runner/pre-agent-session.d.ts +7 -0
package/dist/daemon/runner/pre-agent-session.js +227 -0
package/dist/daemon/runner/runner-types.d.ts +73 -0
package/dist/daemon/runner/runner-types.js +39 -0
package/dist/daemon/runner/tool-schemas.d.ts +1 -0
package/dist/daemon/runner/tool-schemas.js +151 -0
package/dist/daemon/session-scope.d.ts +1 -1
package/dist/daemon/startup-recovery.d.ts +20 -0
package/dist/daemon/startup-recovery.js +323 -0
package/dist/daemon/state/index.d.ts +6 -0
package/dist/daemon/state/index.js +14 -0
package/dist/daemon/state/session-state.d.ts +23 -0
package/dist/daemon/state/session-state.js +44 -0
package/dist/daemon/state/stuck-detection.d.ts +22 -0
package/dist/daemon/state/stuck-detection.js +25 -0
package/dist/daemon/state/terminal-signal.d.ts +9 -0
package/dist/daemon/state/terminal-signal.js +10 -0
package/dist/daemon/tools/file-tools.d.ts +1 -1
package/dist/daemon/turn-end/detect-stuck.d.ts +2 -2
package/dist/daemon/turn-end/detect-stuck.js +2 -2
package/dist/daemon/turn-end/step-injector.d.ts +1 -1
package/dist/daemon/types.d.ts +105 -0
package/dist/daemon/types.js +11 -0
package/dist/daemon/workflow-enricher.d.ts +16 -0
package/dist/daemon/workflow-enricher.js +58 -0
package/dist/daemon/workflow-runner.d.ts +13 -277
package/dist/daemon/workflow-runner.js +63 -1421
package/dist/manifest.json +280 -56
package/dist/trigger/coordinator-deps.d.ts +1 -1
package/dist/trigger/coordinator-deps.js +131 -0
package/dist/trigger/delivery-client.d.ts +1 -1
package/dist/trigger/delivery-pipeline.d.ts +1 -1
package/dist/trigger/notification-service.d.ts +1 -1
package/dist/trigger/trigger-listener.js +6 -2
package/dist/trigger/trigger-router.d.ts +2 -2
package/dist/v2/durable-core/domain/artifact-contract-validator.js +99 -0
package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.d.ts +39 -0
package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.js +10 -1
package/dist/v2/durable-core/schemas/artifacts/index.d.ts +2 -1
package/dist/v2/durable-core/schemas/artifacts/index.js +12 -1
package/dist/v2/durable-core/schemas/artifacts/phase-handoff.d.ts +89 -0
package/dist/v2/durable-core/schemas/artifacts/phase-handoff.js +56 -0
package/docs/authoring-v2.md +12 -0
package/docs/ideas/backlog.md +639 -25
package/docs/reference/worktrain-daemon-invariants.md +33 -49
package/docs/vision.md +5 -15
package/package.json +2 -2
package/workflows/coding-task-workflow-agentic.json +9 -6
package/workflows/mr-review-workflow.agentic.v2.json +2 -2
package/workflows/wr.discovery.json +2 -1
package/workflows/wr.shaping.json +7 -4
package/dist/console-ui/assets/index-BvBihscd.js +0 -28

package/docs/ideas/backlog.md CHANGED Viewed

@@ -192,6 +192,200 @@ The delivery pipeline was extracted into `delivery-pipeline.ts` with explicit st
 ## WorkTrain Daemon
+### Context injection bugs: double-injection, byte-slice truncation, workspaceRules[0] drop (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:3 Cap:1 Eff:3 Lev:3 Con:3 | Blocked: no
+Three active bugs in the context injection pipeline that waste tokens, produce incorrect truncation, and silently discard workspace context. Confirmed by codebase audit (Apr 30, 2026).
+1. **Double-injection (`session-context.ts:117-119`):** `trigger.context` is JSON-serialized in full into the initial user message. Since coordinators write `assembledContextSummary` *into* `trigger.context`, the assembled context appears twice -- once in the system prompt (8KB cap applied) and once in the initial user message (uncapped). These diverge when the content exceeds 8KB.
+2. **Byte-slice truncation (`system-prompt.ts:200-202`):** `assembledContextSummary` is truncated by raw byte index (`ctxStr.slice(0, 8192)`), which splits mid-sentence, mid-section, and can produce malformed UTF-8. The section-aware `buildBudgetedOutput()` pattern already exists in `src/coordinators/context-assembly.ts` and handles this correctly.
+3. **`workspaceRules[0]` silent drop (`session-context.ts:106`):** `ContextBundle.workspaceRules` is typed as `ContextRule[]` but only `[0]` is consumed. All additional workspace context rules are silently dropped. The type implies per-file rules are supported; the consumer silently ignores them.
+**Also in scope:** introduce `WorkflowContextSlots` typed fields on `WorkflowTrigger` (or a companion type) for system-managed context fields (`assembledContextSummary`, `priorSessionNotes`, `gitDiffStat`). This eliminates the stringly-typed `trigger.context['assembledContextSummary']` access pattern and is a prerequisite for the universal enricher (see next item). Scope Phase 0 changes to consumption sites only (`buildSystemPrompt`, `buildSessionContext`); coordinator write sites migrate in Phase 1.
+**Done looks like:** no `trigger.context` JSON dump in `initialPrompt`; `assembledContextSummary` truncated at section boundaries; all `workspaceRules` entries injected; `WorkflowContextSlots` typed fields replace stringly-typed access in consumption sites.
+---
+### Universal context enricher for all session entry points (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: yes (needs context injection bugs fixed first)
+Today 4 of 6 session entry points receive zero assembled context: raw webhook triggers, direct dispatch, `spawn_agent` children, and crash-recovered sessions never get cross-session notes or git diff state. Only coordinator-spawned sessions (via `pr-review.ts` or the adaptive pipeline) get assembled context -- and even then only through opt-in coordinator logic, not structural injection.
+There is no single layer that all dispatch paths share where assembly can run universally. Coordinators that care must call assembly explicitly; everything else gets nothing. This means every new entry point or coordinator is another opportunity to forget assembly.
+**Design (from Apr 30 discovery):** A `WorkflowEnricher` service injected into `runWorkflow()` that fires for root sessions only (`spawnDepth === 0`). Provides prior workspace session notes (max 3, newest-first, workspace-scoped) and `git diff HEAD~1 --stat` to all entry points. Injected via `WorkflowContextSlots` typed fields (see context injection bugs item). When a coordinator has already set `assembledContextSummary`, the enricher skips prior-notes injection (coordinator's richer context takes precedence) but still provides git diff stat if absent.
+**Critical gate:** before this ships, run a pilot test -- one session with `assembledContextSummary` injected, inspect turn-1 reasoning for citation. If agents don't reference pre-loaded context, the investment in universal enrichment adds tokens without improving outcomes.
+**Things to hash out:**
+- Where exactly does the enricher inject: inside `runWorkflow()` before `buildPreAgentSession()`, or inside `buildPreAgentSession()` itself? The latter is cleaner but changes the pre-agent phase boundary.
+- `listRecentSessions` must have a 1s wall-clock timeout with partial-result fallback. Without it, large session stores silently slow all session startups. This is a spec requirement, not optional.
+- `spawn_agent` children don't get enriched (they'd trigger redundant assembly for deeply nested trees). Is there a case where children should optionally enrich? Candidate: an `inheritParentContext: boolean` flag in the `spawn_agent` tool schema.
+---
+### MemoryStore: indexed session history and mid-session query_memory tool (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: yes (needs universal enricher first)
+The session event log is rich -- it records goals, step notes, artifacts, delivered commits, git state, and phase handoffs. But querying it requires a full directory scan and per-session event projection on every call. `LocalSessionSummaryProviderV2` does this today and is used in exactly one place (the PR-review coordinator). Every other consumer either skips it or re-implements a slower version.
+There is no mid-session memory query capability at all. An agent mid-session cannot ask "what did we decide about this module last week" and get an answer from persistent memory -- it can only use what was pre-loaded at session start.
+**Design (from Apr 30 discovery):** A `MemoryStore` port backed by `~/.workrail/memory.db` (SQLite, WAL mode) indexed by `finalizeSession()` as fire-and-forget after each session completes. Query kinds v1: `recent_sessions` (by workspace path hash), `sessions_by_goal_keywords`. A `query_memory` tool added to the daemon tool set. Replaces the slow `listRecentSessions` scan in the universal enricher.
+Phase 2b (separate): index phase artifacts via a new `phase_artifact_appended` session event kind -- bridges the current PipelineRunContext silo into the session event log so phase artifacts are queryable alongside session notes. Requires engine schema review before implementation.
+**Things to hash out:**
+- SQLite native compilation may fail in some deployment environments (Docker, Alpine Linux). Mitigation: use `@sqlite.org/sqlite-wasm` (pure WASM) or make `MemoryStore` fully optional -- daemon works without it, just no indexed queries.
+- `phase_artifact_appended` event schema change is the highest-risk part of Phase 2b. Should it reuse the existing artifact channel with a new content type, or be a new event kind? Each has different backward-compatibility implications.
+- Should `query_memory` be a general-purpose tool or typed with specific query kinds? A typed discriminated union prevents agents from inventing unsupported query shapes.
+---
+### worktrain session analyze: verify agents actually use pre-loaded context (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 8** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+There is no way to verify whether agents actually use pre-loaded context (soul, workspace context, `assembledContextSummary`, session notes) in their reasoning. The entire memory architecture investment (universal enricher, MemoryStore, knowledge graph) assumes agents reference pre-loaded context at turn 1 -- but this assumption is unvalidated. If agents receive 32KB of workspace context and `assembledContextSummary` but don't cite them in their reasoning before acting, richer pre-loading adds token cost without improving outcomes.
+Today, validating this requires manually reading raw session transcripts, which is impractical at scale. A `worktrain session analyze <sessionId>` command that reads the agent turn events and reports whether any pre-loaded context fields were cited in turn-1 reasoning would make this automatable and support data-driven decisions about context loading investment.
+**Done looks like:** `worktrain session analyze <sessionId>` reads the session event log, extracts turn-1 assistant message content, checks for citations of injected fields (workspace context file names, goal text, prior step note content), and reports a structured summary: fields injected, fields cited, fields ignored.
+**Things to hash out:**
+- "Citation" is hard to define precisely -- the agent might paraphrase rather than quote. Does substring matching suffice, or does this need an LLM similarity check?
+- Should this be a CLI command or a console feature? The console already reads session data; this could be a "context audit" view.
+- The primary use case is a one-time validation gate (before shipping the universal enricher). Does this justify a permanent command, or is it a one-off script?
+---
+### Per-run retrospective: structured learning from pipeline outcomes (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+After a pipeline run completes -- whether it merged, escalated, or failed -- there is no structured mechanism for WorkTrain to record what it learned. Mistakes that occurred in one run (wrong interpretation, missed edge case, collateral damage rationalized as a tradeoff) are not surfaced to future sessions. Each run starts with the same baseline.
+A per-run retrospective is a lightweight post-completion step that answers: what went wrong or unexpectedly, what assumption turned out to be false, what should the next session starting on this codebase know that this session didn't? The output would be a structured record written to the session store and made available as Tier 0 context for future sessions on the same workspace.
+This is distinct from the per-step `report_issue` mechanism (which records obstacles mid-session) and from the `wr.coding-task` phase-8 retrospective workflow (which is an agent-facing step prompt). This is a coordinator-level mechanism that runs after the pipeline exits, regardless of which workflows ran.
+**Things to hash out:**
+- Who runs the retrospective -- the coordinator (deterministic, reads phase results and produces structured output), a lightweight LLM step, or the agent in a final workflow phase?
+- What is the output format? A structured `RetrospectiveArtifactV1` that feeds Tier 0 context injection, or freeform notes that accumulate in a `workspace-knowledge.md` file?
+- Where does the output live? Per-run (alongside `PipelineRunContext`), per-workspace (accumulated knowledge store), or per-session in the session store?
+- When a retrospective records "assumption X was wrong," how does that fact reach future sessions? It needs to be injected as Tier 0 context -- which requires the context loading path to know where to look.
+- Should the retrospective run on every pipeline outcome (merge, escalate, timeout, error), or only on non-merge outcomes where something went wrong?
+---
+### Phase quality gate policy: partial vs escalate (May 5, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+The current phase quality gate policy (implemented in living work context, PR #939) is: `fallback` → escalate, `partial` → proceed with warning, `full` → proceed normally. The `partial` path is a deliberate judgment call that favors progress over quality: the agent ran for 25-65 minutes and produced partial output, and retrying might also produce partial output.
+The open question: should `partial` also escalate, or is "proceed with warning" the right default? This requires observability data to answer. If `partial` phases regularly produce wrong downstream output (review catches issues caused by missing upstream context, fix loops triggered by context gaps), the policy should shift to escalate-on-partial. If `partial` phases produce acceptable output, the current policy is correct.
+**Things to hash out:**
+- What metric determines whether `partial` downstream output is "wrong enough" to justify policy change? Review findings that cite missing upstream context? Fix loop iteration count?
+- Should the policy be configurable per-trigger (some pipelines tolerate partial, others don't)?
+- Should the `partial` warning in `assembledContextSummary` be structured enough that the downstream agent can flag "I was working with incomplete context" in its handoff artifact, making the degradation chain traceable?
+- Is there a smarter policy -- e.g. retry the prior phase once before escalating?
+**Note:** This is not a correctness problem with the current implementation. `fallback` correctly escalates. `partial` correctly proceeds with an explicit warning. The question is whether the `partial` threshold is in the right place. Revisit after observing real pipeline runs.
+---
+### Lifecycle integration tests: assert each workflow emits expected handoff artifact (May 5, 2026)
+**Status: idea** | Priority: medium
+**Score: 8** | Cor:2 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
+Issue #934 (living work context) required "lifecycle integration tests asserting each of the 4 workflows (wr.shaping, wr.coding-task, wr.discovery, mr-review) emits expected artifact at final step." PR #939 shipped the adversarial behavioral test (proves the chain works end-to-end) and the `contractRef` validation test (proves no unregistered refs ship), but did not ship per-workflow lifecycle harness tests that run each workflow through compilation + stepping and assert the final step emits the correct artifact kind.
+Without these tests: a workflow prompt change that removes or breaks the artifact emission instruction would pass all existing tests (smoke test only checks compilation, not artifact emission) until a real pipeline run catches it.
+**Done looks like:** `tests/lifecycle/` tests that run `wr.shaping`, `wr.coding-task`, `wr.discovery`, and `mr-review-workflow` through the lifecycle harness and assert the final step produces an artifact with the expected `kind` field.
+---
+### Slack/Teams/chat integration for pipeline completion alerts (May 4, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:2 Lev:1 Con:2 | Blocked: no
+When WorkTrain completes a pipeline run -- whether it produced a PR, escalated, timed out, or failed -- the operator currently has no push notification. They have to poll the console or check their email. For overnight-safe autonomous operation (the vision's stated success condition), the operator needs to know when work is ready for their attention without having to check. Beyond the individual operator, the team that will review the PR also needs to know it exists and is ready. Neither is addressed today.
+The use case has two layers: (1) operator-facing -- "your pipeline finished, here's the PR URL and outcome summary," sent to the operator's Slack/Teams DM or a dedicated channel; (2) team-facing -- "a PR is ready for review," sent to the team's review channel with enough context for a reviewer to triage without navigating to GitHub.
+**Things to hash out:**
+- Is this a WorkTrain daemon concern (coordinator sends notification after pipeline completion) or a trigger-layer concern (configured alongside the trigger)? The `callbackUrl` mechanism already exists for HTTP POST on completion -- is Slack/Teams just a specialized callback, or does it need first-class support?
+- What is the configuration model? Per-trigger (`notifyOnComplete: { slack: { channel: "#pr-reviews", token: "$SLACK_TOKEN" } }`) or workspace-level (`~/.workrail/config.json`)?
+- How does the team-facing notification avoid becoming noise? If WorkTrain opens 10 PRs in a day, each triggering a Slack message, the channel becomes unusable. Is there a batching, threading, or filtering mechanism?
+- What is the authentication and secret management story? Same `$ENV_VAR_NAME` resolution as trigger HMAC secrets, or a separate credentials store?
+**See also:** Daemon working hours / dispatch scheduling (below) -- notifications sent outside working hours are noise.
+---
+### Daemon working hours and dispatch scheduling (May 4, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: no
+WorkTrain is designed for overnight-safe autonomous operation, but "overnight-safe" currently means the daemon keeps working through the night without human oversight -- not that it respects the operator's or team's working hours. A PR opened at 2am sits unreviewed until morning. Slack/Teams notifications at 3am are noise. Triggers that fire from monitoring alerts at midnight might not be appropriate to dispatch.
+There is no current mechanism to configure when the daemon dispatches new sessions, when it sends notifications, or when it holds work for the next business day.
+**Things to hash out:**
+- What is the scope? Working hours could affect: (a) trigger dispatch (hold incoming triggers until working hours), (b) notifications (send alerts only during working hours), (c) both. These may need separate configuration.
+- What is the configuration model? Per-workspace (`~/.workrail/config.json: { workingHours: { timezone: "America/New_York", days: ["Mon"-"Fri"], start: "09:00", end: "18:00" } }`) or per-trigger (some triggers are critical and should dispatch any time)?
+- How does "critical" work? An on-call incident trigger probably should not be gated by working hours. What is the mechanism for a trigger to opt out? A `priority: critical` flag, or explicit `ignoreWorkingHours: true`?
+- What happens to triggers that fire outside working hours? Queue and dispatch at next working-hours start, discard, or dispatch anyway but suppress notifications?
+- How does this interact with multi-timezone teams?
+**See also:** Slack/Teams notification integration (above) -- the two features are designed to be used together.
+---
+### Assumption resolution before acting: agents should fill information gaps with available tools (May 4, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:3 Cap:3 Eff:2 Lev:2 Con:1 | Blocked: no
+Pipeline agents currently have two options when they hit an information gap: proceed with an explicit assumption, or get stuck. Neither is optimal. The coding agent might assume a function signature, proceed with the wrong implementation, and only discover the error in review. Any phase agent might miss context that was resolvable with a two-second tool call (gh, glab, jira, glean, codebase search, MCP tools). There is currently no structured mechanism in the workflow engine or in individual workflows that asks agents to explicitly audit their open assumptions and use available tools to close them before committing to an approach.
+**Things to hash out:**
+- Is this a workflow-level concern (each workflow author decides when and where to add assumption resolution) or an engine-level concern (the engine injects it automatically)?
+- Is the right mechanism a routine (injected via `templateCall`, creating a visible dedicated step with notes output), a feature (engine-injected constraint on every step), or both?
+- Should assumption resolution happen once per workflow (front-loaded as the first step) or opportunistically (at any step where the agent identifies a gap)?
+- What tools should the agent be expected to use? The set varies by workspace (some have Jira, some have GitLab, some have Glean). A generic routine can only say "use whatever tools are available" -- is that specific enough to be useful?
+- How does this interact with the task-scoped rules idea and the ephemeral per-turn injection idea? All three are trying to get the right context to the agent at the right time.
+---
 ### Intent gap: agent builds what it understood, not what the user meant (Apr 30, 2026)
 **Status: idea** | Priority: medium
@@ -223,6 +417,239 @@ This is categorically different from bugs (the agent implemented the right thing
 ---
+### Intent resolution: tiered context harvest to close the intent gap before coding starts (May 4, 2026)
+**Status: designing -- not ready for shaping or implementation** | Priority: high
+**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+The intent gap entry names the failure mode. This entry is the resolution design. The root cause is not that agents misread tickets -- it is that agents form interpretations without access to the context that would resolve ambiguity. The fix is a structured, tiered context harvest during discovery, a mid-discovery interpretation checkpoint, and a configurable escalation ladder when ambiguity survives the harvest.
+**The core insight:** a ticket description is almost never the most authoritative source of intent. The epic it belongs to, the design doc it references, the Slack thread where the feature was scoped, the vision doc that defines what the project is trying to become -- these carry far more signal. An agent that only reads the ticket is working with the thinnest slice of available context. Importantly, none of these sources need to live in the codebase -- they can be in Confluence, Notion, Slack, Google Docs, or a GitHub wiki. The tool layer is the access mechanism regardless of where the content lives.
+**Two distinct failure subtypes -- require different responses:**
+Research (AmbiEval 2026, Orchid 2026, AskBench 2026) distinguishes two failure modes:
+- **Subtype A -- vagueness/ambiguity:** the ticket is underspecified or has multiple valid interpretations. "Delete the record" -- soft-delete or hard-delete? The tiered harvest + council addresses this.
+- **Subtype B -- wrong prior:** the ticket is clear, but the agent has a systematically wrong prior about what tickets like this mean in this codebase. "Fix the auth issue" -- agent knows what auth issues usually mean, but this codebase does it differently. No amount of context harvest resolves this; it requires challenging the agent's assumptions explicitly.
+**Critical decision before building: measure which subtype dominates your actual failure distribution.** Retrospectively classify 10-20 past wrong-implementation cases as Subtype A vs B. If Subtype B is significant, the council and detection scaffold are insufficient -- Subtype B requires assumption-logging and adversarial plan review, not just ambiguity detection. The entire design below addresses Subtype A well and Subtype B only partially.
+**Tiered context harvest:**
+Tier 0 -- project identity (always injected, not searched):
+- Vision doc, active backlog items, design locks/ADRs for the affected area, coding philosophy
+- Not searched for relevance -- injected unconditionally because they constrain every interpretation
+- Source locations are workspace-configured and can be anywhere: local files (`docs/vision.md`), Confluence, Notion, Google Docs, GitHub wiki -- resolved via the same tool layer as other tiers
+- `ContextLoader` resolves Tier 0 sources before session start using whatever tools the workspace has configured
+- If Tier 0 is empty (no project identity configured), the minimum `ambiguityLevel` floor is `'uncertain'` regardless of agent self-report
+Tier 1 -- structured task sources (highest signal, deterministic):
+- Jira/Linear: linked epics, acceptance criteria, parent ticket, comments, attachments
+- GitHub/GitLab: linked PRs, prior implementations of the same feature, commit history on affected files, related issues
+- The ticket's own epic/milestone context -- a vague ticket is often disambiguated by the epic it belongs to
+Tier 2 -- conversational sources (high signal, noisier):
+- Slack: the thread where the ticket was discussed, the channel where the feature was scoped, off-ticket decisions
+- Notion/Confluence/Google Docs: design docs linked in the ticket or epic, ADRs for the affected area
+- Hard retrieval budget: top 2-3 most relevant sources, 4K token cap total. Beyond budget, sources logged as "available but not injected" and added to `unresolvedAssumptions[]`
+- Conflict resolution: when a lower-tier source contradicts a higher-tier source, the higher tier wins and the contradiction surfaces in `unresolvedAssumptions[]`. Priority: Tier 0 ADRs > Tier 1 acceptance criteria > Tier 1 linked epic > Tier 2 design docs > Tier 2 Slack threads
+Tier 3 -- codebase itself:
+- How similar features were implemented previously
+- Naming conventions, existing abstractions, patterns that constrain valid interpretations
+- Tests that describe current behavior of the affected area
+**"Enough context" checklist (harvest stops when satisfied, not when budget is full):**
+1. Tier 0 was injected or confirmed unavailable
+2. Tier 1 structured sources were queried (epic, acceptance criteria, linked issues)
+3. If Tier 1 returned ambiguity-relevant signal, Tier 2 search attempted for the most specific query
+4. Agent can articulate at least one rival interpretation with evidence
+**Ticket quality pre-flight (lightweight, independent of the full harvest):**
+Before dispatch, run 5 INVEST-based quality checks on the ticket: unambiguous, testable, non-compound, has acceptance criteria, scoped. USeR (arxiv 2503.02049) provides 34 automated RE quality metrics; these 5 are the highest-signal subset. Deployable independently of the full detection scaffold. A ticket that fails multiple quality checks is routed to Subtype A treatment immediately without spending turns on harvest.
+**Tool graceful degradation:** tool failure never blocks session start. When a configured source is unreachable, log the error, treat as empty, include in `unresolvedAssumptions[]`, and elevate `ambiguityLevel` accordingly. When a tool is not configured, skip silently.
+**Mid-discovery interpretation checkpoint:**
+Not pre-discovery (too low signal) and not post-discovery (too expensive to correct). The right spot is early in discovery after the agent has read the file structure, recent git history, relevant modules, and harvested Tier 0-1 context. Roughly turns 3-5.
+The checkpoint first classifies task type, then produces the interpretation artifact:
+`taskType: 'targeted_fix' | 'feature' | 'refactor' | 'architectural'`
+- `targeted_fix`: well-scoped, additive, low ambiguity risk -- council can be skipped
+- `feature`: new behavior, moderate ambiguity risk
+- `refactor`: structural change, high ambiguity risk
+- `architectural`: systemic change -- always requires council, minimum `ambiguityLevel` is `'uncertain'`
+Interpretation artifact:
+- `interpretation`: "I understand this task as X"
+- `rivalInterpretations[]`: genuine alternative readings -- must be architecturally different, not minor variations. Use falsification forcing: "What is the single most important word or phrase that, if read differently, leads to a substantially different implementation? Describe both implementations."
+- `unresolvedAssumptions[]`: what would have to be true for the primary interpretation to be wrong
+- `ambiguityLevel: 'clear' | 'uncertain' | 'ambiguous'` -- self-reported, used as floor only
+- `confidenceBreakdown`: `{ tier0Injected, tier1Complete, tier2Retrieved, rivalInterpretationStrength: 'weak' | 'plausible' | 'strong', unresolvedAssumptionCount, overallAmbiguityLevel }`
+**Add: clarification question generation as an independent signal.** Ask: "What is the one question you would most want answered before implementing this?" A specific high-stakes question ("Does 'delete' mean soft-delete or hard-delete?") = ambiguous. Inability to generate a meaningful question = likely clear. Specificity and number of non-trivial questions generated is an independent ambiguity meter (KC et al. 2025).
+**Critical: self-reported ambiguity is untrustworthy.** RLHF trains models to provide confident, forward-moving responses (Sharma et al. 2023). Use `max(introspective, structural)` as the effective level:
+Structural pre-filter signals (fast, no LLM, computed before checkpoint):
+- Presence of weak modals ("should", "may"), vague quantifiers ("fast", "large"), passive without agent, undefined pronouns, no acceptance criteria -- RE literature, 70-89% precision on formal requirements
+- `taskType` is `'architectural'` or `'refactor'`
+- Tier 0 is empty
+- `unresolvedAssumptionCount > 2`
+- Tier 1 returned empty
+Semantic entropy sampling (behavioral, no self-report):
+Sample the interpretation step 5-7 times at temperature ~0.8. Cluster semantically equivalent outputs. Compute Shannon entropy over clusters. High entropy = model is generating genuinely different interpretations, independent of self-report. Well-established (Wang et al. ICLR 2023, Kuhn et al. ICLR 2023 Spotlight). Cost: ~6-8x single inference, fully parallelizable.
+**Escalation ladder (coordinator routes deterministically on effective ambiguity level):**
+1. `'clear'` → proceed to full discovery automatically
+2. `'uncertain'` → council of agents (see below). Re-evaluate on council output.
+3. Still `'uncertain'` after council + `requireIntentConfirmation: 'uncertain'` on trigger → structured clarification request to operator. Structured options: "A / B / proceed with best judgment / abandon" + default-if-no-reply timeout (e.g. 4 hours → proceed with A). Delivered via configured channel (Slack > webhook > console outbox). Correction injected as `steer`; agent re-orients mid-discovery without restarting.
+4. `'ambiguous'` + `requireIntentConfirmation: 'always'` → pause for human approval.
+5. Genuinely unanswerable → escalate to outbox with full context packet.
+`requireIntentConfirmation: 'never' | 'uncertain' | 'always'` per trigger, defaulting to `'uncertain'`. Global workspace default overridable per trigger.
+**Vagueness vs. ambiguity routing:**
+- **Vague ticket** (underspecified -- doesn't say enough): clarification request to operator. Only the operator can add missing information. Council will not help -- both challengers fill the same gap the same way.
+- **Ambiguous ticket** (multiple valid interpretations): council of agents, then operator if unresolved.
+The detection layer classifies which failure mode before routing.
+**Council of agents -- cross-family comparison, not same-model debate:**
+The council handles ambiguous tickets. Its purpose is detecting interpretation error, not resolving genuine ambiguity (that requires the operator).
+**Critical research findings:**
+- "When Two LLMs Debate" (2025, 10-model study): both agents escalate to ~83% stated confidence by round 3 regardless of correctness. Never use stated confidence from a council -- compare interpretation content only.
+- "Persona Collapse / Chameleon's Limit" (2026): same-model instances with different personas converge to a narrow behavioral mode regardless of role assignment. Role prompts do not produce genuinely independent populations.
+- "Diversity of Thought in MAD" (2024): different model families achieve 91% vs 82% on reasoning benchmarks. Cross-family diversity reduces correlated interpretation errors.
+**Cross-family model diversity is required for genuine independence.** Role assignment can be layered on top but cannot substitute for it.
+The council is structured as comparison, not debate -- no "primary defends" turn:
+1. Primary agent (model family A) submits interpretation artifact
+2. Two challenger agents spawn in parallel from different model families (B, C), each with raw ticket + Tier 0-2 context but NOT the primary's interpretation. Each produces an independent reading.
+3. Coordinator compares all three outputs for substantive semantic divergence.
+4. Council produces typed output contract: `{ revisedAmbiguityLevel, failureMode: 'ambiguous' | 'vague', primaryInterpretationSurvived: boolean, winningInterpretation: { text, basis }, dissents[] }`
+5. Coordinator routes on `revisedAmbiguityLevel` and `failureMode`. Zero LLM turns.
+Challenger constraints:
+- Hard `maxTurns` cap (10-15 each) -- each challenger has one job
+- Spawned with `maxSubagentDepth: 1` -- challengers cannot spawn challengers
+- `mode: 'blind'` isolation -- no prior phase artifacts (per context isolation modes entry below)
+**ClarifyGPT consistency check (cheaper alternative to full council):**
+Generate the implementation plan twice independently. If the two plans are inconsistent, ask a targeted clarification question (arxiv 2310.10996). Cheaper than a full multi-family council; useful as a pre-council filter for `'uncertain'` cases before spending on cross-family challengers.
+**Program distribution divergence (Tier 2 behavioral signal where test oracles exist):**
+SpecFix (2025): generate N independent implementations (N=5-10), compare behavioral divergence on tests. 43.58% of ambiguous function-specs detected, +30.9% Pass@1 on repaired specs. Hard prerequisite: requires a test oracle. Viable only for repos with good test coverage. Transfer to informal GitHub-style descriptions is the highest-priority unvalidated gap before treating as production-ready.
+**Operator clarification UX:**
+A useful clarification request is answerable in one decision, time-bounded, and shows what changes between interpretations:
+```
+Task: "Improve error handling in auth module"
+Interpretation A: Add try/catch to the 3 unhandled failure points in token-service.ts (~50 lines, 1-2 hours)
+Interpretation B: Redesign the error type hierarchy across the auth subsystem (~300 lines, needs separate shaping)
+Evidence for A: ticket title says "improve" not "redesign"; linked issue reports a specific NPE in token-service.ts
+Evidence for B: parent epic is "Auth module modernization"; prior PR comment mentioned "error types need a complete overhaul"
+Reply: A / B / proceed with best judgment / abandon
+[Default if no reply in 4 hours: A]
+```
+For overnight queues: batched clarification UX (approve/correct a queue, not N individual notifications) is more practical. Undesigned -- needs its own design pass.
+**Feedback and calibration:**
+Build the calibration data capture layer now. Log: checkpoint outcome, `confidenceBreakdown`, operator correction, downstream PR verdict, and review findings tagged as interpretation-related. Use behavior-based ground truth -- divergent implementations as the ambiguity label, not human majority-vote polls (majority-voted labels miscalibrate detectors by 55-87% ECE, 2026).
+**`skipIntentResolution` escape hatch:**
+Operator sets `skipIntentResolution: true` on a trigger, or agent self-declares skip for: very short ticket + very narrow affected area + `taskType: 'targeted_fix'` + no rival interpretations possible. Skipped sessions still require a one-line interpretation statement.
+**Relationship to living work context:**
+Tier 0 injection needs a dedicated system prompt section separate from `assembledContextSummary` to avoid the 8KB cap. The interpretation checkpoint artifact flows into `DiscoveryHandoffArtifactV1` and `PipelineRunContext` once living work context lands. Downstream phases should see what interpretation the discovery agent committed to.
+**Research findings (resolved questions):**
+- **Role prompts vs. model families**: Persona Collapse (2026) shows same-model role-separated agents converge. Cross-family required for genuine independence. Resolved: cross-family > role prompts.
+- **Multi-agent debate confidence**: both agents escalate to ~83% confidence regardless of correctness. Never use stated confidence. Resolved: compare content only.
+- **Rival interpretation generation**: open-ended enumeration produces anchored minor variations. Falsification forcing is more reliable. Resolved.
+- **Vagueness vs. ambiguity**: empirically distinct failure modes requiring different responses. Resolved.
+- **Production systems**: SWE-agent, AutoCodeRover, Agentless have no ambiguity detection phase. Confirmed by 4 independent 2025-2026 benchmarks. WorkTrain architecture is differentiated.
+- **Calibration ground truth**: use divergent implementations, not human majority-vote labels. Resolved.
+**Things still to hash out:**
+- **Measure Subtype A vs B distribution first** -- retrospectively classify 10-20 past wrong-implementation cases before committing to the full design. If Subtype B dominates, the design needs explicit assumption-challenging and assumption-logging components that aren't here yet.
+- Semantic entropy sampling cost at scale -- always on, or triggered only when structural signals fire first?
+- Program distribution divergence (Tier 2) requires a test oracle. Fallback for repos without tests?
+- Council model selection: which model families for challengers, how configured per workspace?
+- Council cadence for large overnight queues: sampling approach may be more practical initially.
+- `taskType` classification: separate pre-checkpoint step or first output of the same checkpoint?
+- Batched clarification UX for overnight operators is undesigned.
+- Minimal interim wiring for interpretation commitment through phases before living work context lands?
+---
+### Subtype B intent failure: agent has a wrong prior about what this codebase does (May 5, 2026)
+**Status: idea -- needs empirical study before design** | Priority: high
+**Score: 12** | Cor:3 Cap:3 Eff:2 Lev:2 Con:1 | Blocked: no
+The intent resolution entry (above) addresses Subtype A failures -- tickets that are ambiguous or underspecified. This entry addresses Subtype B, which is categorically different and currently has no empirical intervention study in the literature.
+**The failure mode:** The ticket is clear and specific. The agent reads it correctly. But the agent has a systematically wrong prior about what the described thing means in this codebase -- because its training data, or a superficially similar pattern it has seen, leads it to a confident interpretation that is locally coherent but wrong for this specific system.
+Examples:
+- "Add rate limiting to the auth service" -- agent implements token bucket at the HTTP layer because that's what rate limiting means in most codebases. This codebase does it at the middleware layer with a different interface. The ticket was clear; the agent's prior was wrong.
+- "Fix the session expiry bug" -- agent finds and fixes the obvious TTL check. The actual expiry logic in this codebase is spread across three collaborating modules in a non-obvious way. The agent's mental model of "how session expiry works" doesn't match this codebase.
+- "Update the delivery pipeline to handle X" -- agent knows what delivery pipelines look like. This codebase's delivery pipeline has specific invariants (atomic stage ordering, sidecar lifecycle) that violate the agent's general expectations. The update is technically correct in isolation but violates a codebase-specific invariant the agent didn't know existed.
+**Why it's different from Subtype A:** You cannot fix this with more context harvest from Jira or Slack. The ticket is correctly specified. You cannot fix it with a council of agents -- challenger agents from different model families share the same wrong prior from training data. The problem is not ambiguity; it is that the agent's internal model of the codebase diverges from the actual codebase.
+**Why it's hard to detect:** the agent's interpretation feels correct and internally consistent. It will self-report high confidence. The semantic entropy signal may be low (all samples converge on the same wrong interpretation). A challenger agent may produce the same wrong interpretation independently. The failure is invisible until review or testing.
+**What might actually work (inferred, not empirically validated):**
+*Explicit assumption surfacing before acting:* Before touching any code, require the agent to write down: "Here is how I believe this component works based on what I have read." Then verify those beliefs against the codebase. If the agent's stated model of "how the delivery pipeline works" conflicts with what the code actually does, that conflict is the signal. This is different from rival interpretations (Subtype A) -- it is rival models of the existing system.
+*Assumption-challenging agent:* A separate lightweight agent reads the primary agent's stated assumptions about the codebase and actively searches for contradicting evidence. Not "is the ticket ambiguous" but "is the agent's model of this codebase correct?" Spawned with `mode: 'blind'` (no prior context) so it approaches the codebase fresh, then compares its reading to the primary agent's stated model.
+*Prior-invalidation pass in discovery:* Discovery workflow includes a mandatory step: for each major architectural assumption the agent is making, find one piece of codebase evidence that would invalidate it. If the agent assumes "rate limiting is at the HTTP layer," it must search for evidence that this is wrong before proceeding. Forces falsification of the prior rather than confirmation.
+*Historical session notes as prior correction:* If prior sessions have established "in this codebase, X works differently than you'd expect because Y," that context must be injected before the agent forms its model. This is the living work context applied across pipeline runs, not just within one run -- a per-workspace knowledge store of "things that are surprising about this codebase." Related to the knowledge graph backlog item.
+**Why Confidence is 1 (needs discovery before design can begin):**
+There is no empirical study of interventions for Subtype B in ticket-driven coding agents. AskBench's AskOverconfidence condition (arxiv 2602.11199) confirms agents fail differently on false-premise queries -- but "false premise" in a benchmark is a planted incorrect assumption, not a wrong prior from training data. The mechanisms may be similar but the intervention pathway is different. This needs:
+1. Empirical measurement of how often Subtype B vs Subtype A causes WorkTrain failures (the NS2 step from the independent research brief)
+2. A controlled study of whether assumption-surfacing before acting actually reduces Subtype B failures
+3. Design of the assumption-challenging agent -- what exactly it reads, what it produces, how the coordinator uses it
+**Relationship to other entries:**
+- "Intent resolution" (above): addresses Subtype A. This entry is the Subtype B complement.
+- "Living work context": the per-workspace knowledge store of codebase surprises is partial infrastructure for fixing Subtype B across sessions.
+- "Knowledge graph" (backlog): structural understanding of the codebase that would give the agent a ground-truth model to compare its priors against.
+- "Context isolation modes": the assumption-challenging agent needs `mode: 'blind'` to approach the codebase without anchoring on the primary agent's stated assumptions.
+**Things to hash out:**
+- How do you distinguish "the agent has a wrong prior" from "the ticket is genuinely ambiguous about which part of the system to change"? The boundary is fuzzy -- a ticket that doesn't name the specific module is Subtype A; a ticket that names the module but the agent's model of that module is wrong is Subtype B.
+- What format should "stated assumptions" take? Free-prose is hard to verify. A structured list of `{ assumption: string, evidence: string, falsificationQuery: string }` is verifiable but requires the agent to produce it honestly.
+- The assumption-challenging agent needs to approach the codebase independently. But it also needs to know what assumptions to challenge -- which means it needs the primary agent's stated assumption list. Is that contamination? No -- it is exactly the right input. The isolation is from the primary agent's conclusions, not its stated premises.
+- How does this interact with the `skipIntentResolution` escape hatch? Subtype B failures can occur even on tickets that pass quality pre-screening and look unambiguous. The skip hatch should not bypass assumption surfacing for `'refactor'` or `'architectural'` tasks.
+- Is the right long-term fix a knowledge graph (structural ground truth the agent can compare its model against) rather than per-session assumption surfacing? Knowledge graph is higher-confidence but much higher cost to build. Assumption surfacing is lower cost but relies on the agent honestly reporting its own priors.
+---
 ### Scope rationalization: agent silently accepts collateral damage (Apr 30, 2026)
 **Status: idea** | Priority: medium
@@ -260,10 +687,20 @@ The autonomous workflow runner (`worktrain daemon`). Completely separate from th
 ### Living work context: shared knowledge document that accumulates across the full pipeline (Apr 30, 2026)
-**Status: idea** | Priority: high
+**Status: partial** | Core infra shipped May 5, 2026 (PR #939). Three gaps remain.
 **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+**Shipped (PR #939):** `ShapingHandoffArtifactV1` + `CodingHandoffArtifactV1` + enriched `DiscoveryHandoffArtifactV1`, `PhaseHandoffArtifact` union, `buildContextSummary()` pure function with per-phase selection, `PipelineRunContext` per-run JSON with `PhaseResult<T>`, crash recovery via `active-run.json` pointer, phase quality gates (fallback escalates, partial warns), persistence failure escalation, 4 workflow authoring changes, adversarial behavioral test (AC 21), `contractRef` validation test. Deferred: `buildSystemPrompt()` named semantic slots, console visualization, retry logic, epic-mode task graph, extensible contract registration, per-workflow lifecycle artifact tests.
+**Remaining gaps (not tracked elsewhere):**
+1. **No end-to-end validation that context reaches downstream agents.** The `assembledContextSummary` is wired through `trigger.context` → `buildSystemPrompt()` → system prompt, but there is no test that runs a full pipeline (discovery → shaping → coding) and asserts that the coding agent's system prompt actually contains the discovery context. The adversarial behavioral test (AC 21) proves the pipeline structure -- it does not prove the context content is meaningful to the downstream agent.
+2. **Not all coordinator pipeline modes populate `assembledContextSummary`.** Some modes (e.g. quick-review) may exit without writing a full `PipelineRunContext`. When context is absent, `buildSystemPrompt()` silently injects nothing -- the downstream agent gets no prior context with no warning. There is no check that the coordinator always writes context before dispatching a downstream session.
+3. **No operator visibility into injected context.** The "Prior Context" section in an agent's system prompt is invisible from the console. An operator has no way to see what context was injected into a session without reading raw conversation logs. The console should surface this -- at minimum, whether the session had prior context and how many bytes.
 When a multi-agent pipeline runs -- discovery → shaping → coding → review → fix → re-review -- no agent has a complete picture of what came before it. The coding agent has the goal. The review agent has the code. The fix agent has the findings. None of them have the accumulated context from the full pipeline: why this approach was chosen over alternatives, what was ruled out, what constraints were discovered, what architectural decisions were made, what edge cases were handled, what the review found and why.
 Each agent reconstructs intent from incomplete context, which is why review finds things coding missed (review doesn't know what the coding agent was trying to do), why fix sessions address symptoms without understanding causes (no access to the architectural reasoning), and why agents repeat work that earlier agents already did.
@@ -352,6 +789,67 @@ This is related to the "Coordinator context injection standard" and "Context bud
 ---
+### Subagent context isolation modes: enforced context sharing and contamination prevention (May 5, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:3 Cap:2 Eff:2 Lev:3 Con:2 | Blocked: no
+When WorkTrain spawns a subagent, the spawning agent decides what context to pass. Today this is purely by convention -- there is no mechanism to enforce isolation or guarantee completeness. Two distinct failure modes require opposite fixes:
+**Contamination (too much context):** A challenger agent spawned to independently evaluate an interpretation receives the primary agent's interpretation in the context bundle. It anchors on it and produces a biased reading. A review agent receives the coding agent's self-assessment and validates it rather than challenging it. These are cases where context leakage actively undermines the agent's purpose -- independence destroyed by prior context.
+**Starvation (too little context):** A coding agent spawned without discovery findings re-investigates settled questions. A review agent without shaping constraints cannot check whether the implementation satisfies them. Context absence causes wasted work or wrong output.
+Today both are addressed by convention. Convention fails silently -- the spawning agent follows its own judgment, which may be wrong. Even the orchestrating agent can contaminate a challenger without realizing it (as happened when spawning the research agents in this session without realizing context was being leaked).
+**The right fix is structural enforcement, not rules.** Context isolation mode should be a declared property of the spawn call, enforced by coordinator infrastructure, not managed by following instructions.
+**Proposed isolation modes:**
+```typescript
+type ContextIsolationMode =
+  | { mode: 'full' }
+  // Agent receives complete accumulated context: Tier 0 project identity +
+  // prior phase artifacts + task context. Default for most pipeline phases.
+  | { mode: 'task-only' }
+  // Agent receives only task description + Tier 0 project identity.
+  // No prior phase artifacts, no intermediate results.
+  // For agents that should approach the task fresh but know the project.
+  | { mode: 'blind' }
+  // Agent receives only the raw inputs declared at spawn time.
+  // No Tier 0 injection, no prior artifacts, no accumulated context.
+  // For adversarial/challenger agents where independence is the whole point.
+  // The spawning call must explicitly declare what inputs to pass.
+  | { mode: 'custom'; include: ContextKey[]; exclude: ContextKey[] }
+  // Explicit allowlist/blocklist. For partial context cases
+  // (e.g. review agent gets shaping constraints but not coding agent's self-assessment).
+```
+`mode: 'blind'` should be the enforced default for any session with `role: 'challenger' | 'adversarial' | 'evaluator'`. The coordinator cannot accidentally contaminate a challenger when the session declaration forbids it.
+**Note on 'blind' mode:** true blindness (no Tier 0 either) may be too aggressive. A challenger without the project's coding philosophy or architectural principles is missing the most important constraints. "No prior phase artifacts" is probably the right isolation boundary, not "no context whatsoever." A `challenger` mode that strips prior results but keeps Tier 0 may be more useful. Open question.
+**Enforcement point:** `spawnSession` in the coordinator infrastructure (`createCoordinatorDeps`). The spawning call declares the mode; the infrastructure assembles the context bundle according to the declared mode; the spawning agent cannot override it by passing extra fields. Validate at boundaries, trust inside.
+**Observability:** when an evaluation was produced by a `blind` or `task-only` session, that fact should be recorded in the session store so the independence of the evaluation is auditable. Without this, the isolation guarantee is invisible.
+**Relationship to existing entries:**
+- "Subagent context package" (above) is about ensuring agents receive enough context -- the `full` and `task-only` modes are the enforcement side of that design.
+- "Council of agents" in the intent resolution entry assumes `blind` mode for challengers -- this entry is what makes that assumption enforceable.
+- `buildContextSummary(priorArtifacts, targetPhase)` in living work context is the selection logic for `custom` mode.
+**Things to hash out:**
+- Should `mode` be declared on the workflow definition, the trigger, or the `spawnSession` call? Workflow definition is the right answer (the workflow knows its role), but requires a new schema field.
+- How does declared mode interact with the agent's tool access? A `blind` challenger can still read workspace files. True isolation may require tool path restrictions alongside context restrictions.
+- Custom `include`/`exclude` lists create maintenance burden as context keys evolve. Is there a better abstraction -- e.g. declaring the agent's role and having infrastructure derive the right context set from a role-to-context mapping?
+- Should `task-only` include or exclude Tier 0 project identity? Including it is almost always better, but the operator may have reasons to exclude it.
+---
 ### Agent-assisted backlog and issue enrichment (Apr 28, 2026)
 **Status: idea** | Priority: medium
@@ -619,6 +1117,25 @@ The daemon reads `triggers.yml` once at startup. Any change requires a full daem
 ---
+### External task tracker integrations: Jira, Linear, Notion, and beyond (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 11** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+WorkTrain currently picks up work from GitHub and GitLab. Most engineering teams track work in Jira, Linear, Notion, or similar systems -- not in GitHub issues. Without native trigger adapters for these systems, WorkTrain cannot be used as the default development workflow for teams that don't use GitHub Issues as their primary tracker.
+The vision says WorkTrain picks up tasks "from external systems (GitHub issues, GitLab MRs, Jira tickets, webhooks)." The webhook trigger (`provider: generic`) handles anything with a POST endpoint, but it requires the operator to wire up field extraction manually and provides no assignee filtering, label filtering, or status-transition detection out of the box. A first-class adapter for each tracker would handle the integration details and give operators a clean configuration surface.
+**Things to hash out:**
+- What is the right abstraction boundary? A generic polling adapter with per-tracker field mapping (same pattern as `github_issues_poll` / `gitlab_poll`) vs. a more opinionated per-tracker adapter that understands Jira workflow states, Linear priorities, etc.
+- Jira's API requires OAuth or API token; Linear uses API keys; Notion uses integration tokens. Is secret resolution via `$ENV_VAR_NAME` sufficient, or is a richer credentials model needed?
+- For Jira specifically: issue assignment events are not available via webhook without Jira admin access to configure webhooks. Does WorkTrain need a polling adapter (`jira_poll`) as the primary path, with webhook as an optional enhancement?
+- What context does each tracker inject into the workflow session? Jira issues have epics, acceptance criteria, sprint context, labels. Linear issues have priority, team, estimate, project. The context mapping needs to capture what's useful without overwhelming the session.
+- How does deduplication work across tracker adapters? A Jira issue that was already picked up and is in-flight should not be dispatched again on the next poll cycle, even if it was updated.
+---
 ### GitHub webhook trigger with assignee/event filtering (Apr 20, 2026)
 **Status: idea** | Priority: medium-high
@@ -879,6 +1396,8 @@ Demo repo tasks (worktrain:ready issues)
 **Score: 11** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+**Note (May 5, 2026):** PR #939 shipped *coordinator-level* pipeline crash recovery: `active-run.json` pointer + `PipelineRunContext` file allow the next coordinator startup to restore prior phase artifacts and resume without re-running completed phases. This item is about *agent session* crash recovery (the agent itself dies mid-session, worktree state, step advances). Both layers are needed.
 **The problem:** A daemon crash loop kills all in-flight sessions. The queue correctly detects the sidecar and skips re-dispatch for the TTL window, but when the sidecar expires the session is re-dispatched from scratch with zero context. An agent that spent 10 min in Phase 0, read codebase files, and formed a plan loses all of that work.
 **What we want:** WorkTrain detects orphaned sessions on startup and makes an autonomous decision: resume if meaningful progress was made, discard and re-dispatch from scratch if too early to be worth resuming.
@@ -1501,7 +2020,7 @@ Each file is injected only into sessions running the matching pipeline phase. Re
 **Status: idea** | Priority: medium
-**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for context assembly)
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no (unblocked by Apr 30 discovery -- context assembly does not require the knowledge graph)
 **Problem:** `src/coordinators/pr-review.ts` is already ~500 LOC doing session dispatch, result aggregation, finding classification, merge routing, message queue drain, and outbox writes. Adding knowledge graph queries, context bundle assembly, and prior session lookups would create a god class.
@@ -1509,18 +2028,17 @@ Each file is injected only into sessions running the matching pipeline phase. Re
 ```
 Trigger layer         src/trigger/          receives events, validates, enqueues
 Dispatch layer        (TBD)                 decides which workflow + what goal
-Context assembly      (TBD)                 gathers and packages context before spawning
+Context assembly      src/daemon/           enriches trigger before runWorkflow() fires
 Orchestration layer   src/coordinators/     spawns, awaits, routes, retries, escalates
 Delivery layer        src/trigger/delivery  posts results back to origin systems
 ```
-**Context assembly** is the missing layer. Before dispatching a coding session, `assembleContext(task, workspace)` runs: knowledge graph query, upstream pitch/PRD fetch, relevant prior session notes, returns a structured context bundle. The orchestration script should call this, not own it.
+**Resolution from Apr 30 discovery:** Context assembly does NOT require the knowledge graph as a prerequisite. The universal enricher (Phase 1 of the memory architecture) provides a structural context assembly layer via `WorkflowEnricher` injected into `runWorkflow()` -- this IS the missing layer. The orchestration scripts (coordinators) continue to add task-specific richer context on top (phase artifacts, git diff for PRs) via the existing `assembledContextSummary` mechanism. The two layers compose: universal enricher provides the floor, coordinators provide the ceiling.
-**Things to hash out:**
-- The right layering puts "Dispatch layer (TBD)" between Trigger and Orchestration. What exactly does the dispatch layer decide, and how does it relate to the adaptive pipeline coordinator concept elsewhere in the backlog?
-- Context assembly requires the knowledge graph. What is the fallback when the KG is not yet built for a workspace -- does context assembly simply return empty, or does it fall back to a slower manual search?
-- Should context assembly run synchronously before dispatch (blocking the trigger listener) or asynchronously (session starts with partial context while assembly continues)?
-- Who owns the context assembly API contract -- the engine (as a new primitive), the daemon (as an infrastructure capability), or user-authored scripts?
+**The Dispatch layer question** is resolved by the adaptive pipeline coordinator (`src/coordinators/adaptive-pipeline.ts`) -- it IS the dispatch layer for queue-polled tasks. For webhook-triggered tasks, `TriggerRouter.route()` performs dispatch. The layering is already present; it just isn't documented as such.
+**Remaining open question:**
+- When a coordinator calls `spawnSession()` with an `assembledContextSummary`, should the universal enricher's prior-notes injection be suppressed (coordinator already covered it) or additive (both run)? The discovery recommends suppression -- enricher skips prior notes when `assembledContextSummary` is already set.
 ---
@@ -1949,6 +2467,42 @@ When an MR review session (run by a WorkTrain agent) finds issues in a coding se
 ---
+### wr.discovery lacks domain-specific ideation guidance (May 6, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+`wr.discovery` classifies `problemDomain` (software / product / ux / personal / general) and uses it for a few things -- philosophy source lookup, vision doc location, and `decisionCriteria` examples. But candidate generation, challenge framing, and resolution path guidance do not adapt to domain at all. A personal career decision, a product strategy question, and a software architecture problem have meaningfully different ideation patterns, different failure modes in candidate generation, different challenge rubrics, and different resolution artifacts. The workflow currently treats them all identically after `problemDomain` is set.
+The result is that `problemDomain` is a classification that carries almost no behavioral weight past phase-0 and phase-2. It reads well but does not change the actual work.
+**Things to hash out:**
+- Where is domain-specific guidance most needed? Candidate generation (different ideation patterns per domain) and challenge framing (different adversarial angles) are the clearest gaps. Are there others -- resolution mode selection, confidence dimensions, handoff format?
+- What is the right mechanism -- `promptFragments` conditioned on `problemDomain`, a domain-specific routine injected via `templateCall`, or richer domain context blocks injected at workflow start? The answer probably varies by where in the workflow the guidance applies.
+- How much domain specificity is enough? Software vs non-software is the biggest gap. Within non-software, personal vs product vs ux are also meaningfully different. Is a two-level split (software / general) sufficient for now, or is the full five-way split worth tackling immediately?
+- Are there domain-specific output formats worth considering? A personal decision probably ends with a different handoff shape than a software architecture decision -- different fields, different confidence dimensions, different "next actions" structure.
+---
+### wr.discovery anchors candidates to existing infrastructure instead of the ideal solution (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+`wr.discovery` produces candidates bounded by what already exists. The landscape step grounds the agent in the current codebase, which anchors candidate generation to what is buildable today rather than what would be best. On a discovery run for context-passing, for example, candidates are shaped by the current pre-load architecture instead of questioning whether pre-load is the right model at all. Decisions that should be challenged by the discovery process are instead silently inherited from it.
+The result is that discovery optimizes within the current design space rather than finding the edge of it. Problems that require restructuring existing code -- not just adding to it -- tend to produce timid candidates that paper over the root cause instead of addressing it. Discovery is supposed to find the best answer; it is currently finding the best answer that doesn't require changing much.
+**Things to hash out:**
+- Should the ideal-first reasoning happen before or after the landscape pass? Before risks ignoring hard constraints; after risks being anchored by them. What is the right sequencing, and is it always the same or does it depend on the problem type?
+- How do non-negotiable constraints (e.g. "must not change the engine API", "must work without a running daemon") get introduced without becoming the excuse for avoiding the best answer? There's a real difference between a hard constraint and an inherited assumption that could be challenged.
+- Is "what would the ideal look like, and what's the migration path from here?" a step inside discovery, or does it belong in `wr.shaping`? Shaping already produces an appetite and scope cut -- is ideal-first reasoning a discovery concern or a shaping concern, or does each need it independently?
+- When the ideal requires multi-sprint groundwork (e.g. "first build the KG, then build context assembly on top of it"), how should discovery represent that? As a sequenced multi-phase candidate? As a separate "phase 1" item that gets its own discovery?
+---
 ### Workflow previewer for compiled and runtime behavior
 **Status: idea** | Priority: medium
@@ -2780,33 +3334,33 @@ openclaw is worth studying deeply before building out the platform layer. Draw i
 **Status: idea** | Priority: medium
-**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: yes (needs MemoryStore first as Phase 2 prerequisite)
-**Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants.
+**Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants. And cross-session semantic queries ("what did we find about this module last week") cannot be answered without a vector index.
+**Position in the phased memory architecture (from Apr 30 discovery):** This is Phase 3 in a four-phase sequence. Phase 0 (bug fixes) → Phase 1 (universal enricher) → Phase 2 (MemoryStore SQLite) → Phase 3 (knowledge graph). The MemoryStore SQLite from Phase 2 answers 6 of 8 memory queries without a vector model. The knowledge graph adds the remaining two: code-structure traversal (Q8) and semantic similarity ("what is related to X"). Phase 3a (structural layer) extends the existing spike; Phase 3b (vector layer) is a feature flag.
 **Design -- two-layer hybrid:**
-**Layer 1: Structural graph (hard edges, deterministic)**
-Built by `ts-morph` (TypeScript Compiler API) + DuckDB. Captures: `imports`, `calls`, `exports`, `implements`, `extends`, `registers_in`, `tested_by`. Answers precise questions with certainty: "what imports trigger-router.ts?", "what CLI commands are registered?"
+**Layer 1: Structural graph (hard edges, deterministic) -- Phase 3a**
+Extends existing `src/knowledge-graph/` spike (DuckDB + ts-morph, already in `dependencies`). New node kinds: `session`, `pipeline_run`, `workspace_convention`. New edge kinds: `produced_by` (session → file), `applies_to_workspace`. Current spike only tracks import edges and CLI commands; session data from Phase 2 MemoryStore migrates here. Answers: "what imports trigger-router.ts?", "what files did session X touch?", "what sessions ran in this workspace?"
-**Layer 2: Vector similarity (soft weights, semantic)**
-Every node gets an embedding. Answers fuzzy questions: "what is conceptually related to this?", "what past sessions are relevant to this bug?" Built with LanceDB (embedded, TypeScript-native, local-first).
+**Layer 2: Vector similarity (soft weights, semantic) -- Phase 3b (feature flag)**
+LanceDB (embedded, TypeScript-native, local-first). Embeddings over session recaps and workspace conventions. Off by default (`WORKRAIL_VECTOR_SEARCH=1` to enable). Answers: "what sessions are semantically related to this bug?", "what workspace conventions mention authentication?"
 **Technology:**
-- Structural: `ts-morph` + DuckDB
-- Vector: LanceDB + local embedding model (Ollama or `@xenova/transformers`)
-- Unified query: `query_knowledge_graph(intent)` returns merged structural + semantic results
-**Build order:** Structural layer spike first (1-day). Vector layer after spike proves the foundation. Incremental update: re-index only files in `filesChanged` after each session.
+- Structural: `ts-morph` + DuckDB (existing spike, already in dependencies)
+- Vector: LanceDB + local embedding model -- `@xenova/transformers` (in-process, no external dep) preferred over Ollama (better quality but requires external process)
+- Unified query: `query_knowledge(intent, workspacePath)` replaces `query_memory` tool when Phase 3a lands
 **Build decision (from Apr 15 research):** ts-morph + DuckDB wins. Cognee: Python-only. GraphRAG/LightRAG: use LLMs to build graph (violates scripts-over-agent). Mem0/Zep: conversational memory, not code graphs. Sourcegraph: enterprise weight, overkill.
 **Things to hash out:**
-- How large does a typical workspace KG get? For a medium-sized TypeScript monorepo, what are the expected node and edge counts for the structural layer?
-- The incremental update strategy (re-index only `filesChanged`) requires accurate change tracking. What is the fallback when `filesChanged` is unavailable (e.g. for manually triggered sessions)?
-- The embedding model (Ollama or `@xenova/transformers`) needs to be running locally. What is the setup story for a new workspace -- is it expected to already have an embedding model, or does WorkTrain set one up?
-- DuckDB is in-process -- what is the concurrency story when multiple daemon sessions try to query or update it simultaneously?
-- Is the KG per-workspace or global? If per-workspace, cross-workspace queries (multi-project WorkTrain) require a federation layer.
+- Phase 3a scope: should the structural layer replace the Phase 2 SQLite MemoryStore (same data, different engine) or exist alongside it? Replacing is cleaner; coexisting avoids a migration.
+- `@xenova/transformers` vs Ollama for Phase 3b: @xenova runs in-process (no setup friction) but has lower embedding quality. Ollama is better quality but adds an external process dependency. Which matters more for the target user base?
+- The incremental update strategy (re-index only `filesChanged` after each session) requires accurate change tracking. What is the fallback when `filesChanged` is unavailable?
+- DuckDB is in-process -- WAL mode handles read concurrency but writes are serialized. Is the concurrency story acceptable when 3 sessions complete simultaneously?
+- Is the KG per-workspace or global? Per-workspace is simpler; global enables cross-workspace queries but adds federation complexity.
 ---
@@ -4292,3 +4846,63 @@ WorkTrain has no tooling to surface the state of worktrees and branches relative
 - Common-ground `make sync` distributing the script reliably
 **Priority:** Medium. The shared scripts work and have been tested. Main remaining work is the shell wrapper, token storage, and integration with common-ground's team config.
+---
+### Cross-system blind benchmark: compare AI coding tools/models on the same tasks (May 6, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: no
+There is no reproducible way to compare WorkTrain against other AI coding systems (Cursor, Copilot, raw Claude Code, competing agent frameworks) or to compare model families within WorkTrain on the same real tasks. Without this, claims about WorkTrain's quality are anecdotal and there is no principled way to understand where WorkTrain adds value versus where it falls short.
+**Things to hash out:**
+- What constitutes a valid "task" for comparison? Real GitHub issues from a well-understood repo are higher quality than synthetic benchmarks, but may not reproduce cleanly across different tool setups. What is the minimum reproducibility requirement?
+- How do you grade fairly? A grader that can see code style, comments, or formatting may infer which system produced the output. What does true blind evaluation look like here, and how blind is "blind enough"?
+- Should the rubric be global (same for all task types) or per-task-type (refactor vs feature vs bug fix)?
+- Token usage comparison requires accurate per-system accounting. Not all tools expose this. Is a cost-adjusted comparison feasible, or does this reduce to a quality-only benchmark?
+- Is this a one-time study or a continuous regression benchmark? The demo-repo benchmark entry covers regression -- this is specifically about cross-system comparative evaluation.
+**Relationship to existing entries:** the demo-repo benchmark (existing entry) runs the same tasks after each WorkRail release to track regression. This entry is about comparing WorkTrain vs other systems, not WorkTrain past vs present.
+---
+### WorkTrain as a full software team: design, PM, data science, opex, and everything in between (May 6, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:2 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+The current vision defines WorkTrain as an autonomous *software development* system. But shipping software requires more than coding -- product management, design, data science, operations, release engineering, and the feedback loop from production back into ideas are all necessary to deliver something that works and keeps working. WorkTrain currently handles only the coding-and-review slice of this. Everything before "write the code" (discovery what to build, analyzing what users actually need) and everything after "merge the PR" (instrumentation, metrics analysis, idea generation, rollout management, incident response) is done manually.
+The result is that the value loop -- PR → metrics → insight → idea → spec → PR -- is only partially automated. Humans still have to bridge analysis → idea and metrics → iteration gaps. An autonomous system that stops at "ship a PR" requires continuous human intervention to keep it pointed at the right work.
+The constraint on idea generation specifically: ideas grounded in vague intuition are not useful. The gap is not that WorkTrain can't generate suggestions -- it can. The gap is that those suggestions are not grounded in specific, verifiable facts about the actual system and its users. An idea like "23% of users who reach step 3 abandon, and the median time on that step is 47 seconds, and here is what the error logs show" is categorically different from "users might want X."
+**Relationship to existing entries:** Many existing backlog entries are partial implementations of this broader capability -- monitoring loops, analytics integration, feature flag management, opex, the blind benchmark entry. This entry captures the full frame so those entries can be understood as steps toward it rather than isolated features.
+**Things to hash out:**
+- The vision.md defines WorkTrain as "autonomous software development." Does this require a vision revision, or is design/PM/data science/opex a natural extension of "everything that ships software"?
+- Design and PM work requires product domain knowledge -- not just technical knowledge. There is no obvious equivalent of AGENTS.md for product context. What is the right mechanism for WorkTrain to acquire and maintain that context?
+- Data science work requires access to event logs, metrics stores, and potentially sensitive user data. What is the authorization model? What is the minimum access needed to produce useful insights without exposing sensitive data?
+- Release management requires write access to production systems (feature flag platforms, deployment infrastructure). What safeguards are necessary before WorkTrain can act autonomously there?
+- Opex (incident response, SLO management) has a different urgency profile than coding work. How does it fit into the existing pipeline model, which is designed for hours-to-days timescales?
+---
+### Task completion enforcement: detect and prevent deferred work within tasks (May 6, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+Agents routinely defer work within tasks rather than completing it. Common patterns: "I'll file a ticket for this later," "this is out of scope, leaving for a follow-up," "TODO: handle this edge case," "I noticed X but didn't address it to stay focused." These deferral patterns are individually plausible but collectively mean tasks are never actually finished -- they transition from "in progress" to "apparently done" while work accumulates in a long tail of unfiled tickets and unresolved TODOs.
+There is no mechanism to distinguish "this genuinely needs a separate session with different scope" from "I could have done this but chose not to." There is no enforcement that deferred items are tracked and eventually completed. There is no way to prove a task is actually done versus claimed done. A task that leaves TODOs in the code, or that defers 3 of its 5 acceptance criteria, is not done -- but the system currently has no way to detect or prevent this.
+**Things to hash out:**
+- What does "done" mean in a provable sense? What evidence would allow a coordinator to conclude that a task is complete rather than merely that an agent has stopped working on it?
+- How do you distinguish legitimate scope decisions from avoidance? A session on a performance bug that surfaces an unrelated security issue is right to defer the security issue. A session that addresses only 2 of 3 acceptance criteria is not. What is the principled distinction?
+- TODO comments in code are not always deferred work -- some are architectural notes, some are pre-existing. How do you identify TODOs that represent deferred task-scope work versus incidental notes?
+- How does this interact with the existing stuck detection system? A stuck agent and a "done-claiming but not actually done" agent are different failure modes. How does the system tell them apart?