npm - @exaudeus/workrail - Versions diffs - 3.76.0 → 3.77.0 - Mend

@exaudeus/workrail 3.76.0 → 3.77.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (71) hide show

package/dist/console-ui/assets/{index-DFZjlsUM.js → index-D9pYbwS0.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/daemon/context-loader.d.ts +1 -1
package/dist/daemon/core/agent-client.d.ts +7 -0
package/dist/daemon/core/agent-client.js +31 -0
package/dist/daemon/core/index.d.ts +6 -0
package/dist/daemon/core/index.js +19 -0
package/dist/daemon/core/session-context.d.ts +14 -0
package/dist/daemon/core/session-context.js +24 -0
package/dist/daemon/core/session-result.d.ts +10 -0
package/dist/daemon/core/session-result.js +92 -0
package/dist/daemon/core/system-prompt.d.ts +6 -0
package/dist/daemon/core/system-prompt.js +151 -0
package/dist/daemon/io/conversation-log.d.ts +2 -0
package/dist/daemon/io/conversation-log.js +45 -0
package/dist/daemon/io/execution-stats.d.ts +7 -0
package/dist/daemon/io/execution-stats.js +86 -0
package/dist/daemon/io/index.d.ts +5 -0
package/dist/daemon/io/index.js +24 -0
package/dist/daemon/io/session-notes-loader.d.ts +4 -0
package/dist/daemon/io/session-notes-loader.js +45 -0
package/dist/daemon/io/soul-loader.d.ts +3 -0
package/dist/daemon/io/soul-loader.js +68 -0
package/dist/daemon/io/workspace-context-loader.d.ts +17 -0
package/dist/daemon/io/workspace-context-loader.js +137 -0
package/dist/daemon/runner/agent-loop-runner.d.ts +28 -0
package/dist/daemon/runner/agent-loop-runner.js +250 -0
package/dist/daemon/runner/construct-tools.d.ts +5 -0
package/dist/daemon/runner/construct-tools.js +30 -0
package/dist/daemon/runner/finalize-session.d.ts +3 -0
package/dist/daemon/runner/finalize-session.js +75 -0
package/dist/daemon/runner/index.d.ts +8 -0
package/dist/daemon/runner/index.js +18 -0
package/dist/daemon/runner/pre-agent-session.d.ts +7 -0
package/dist/daemon/runner/pre-agent-session.js +227 -0
package/dist/daemon/runner/runner-types.d.ts +73 -0
package/dist/daemon/runner/runner-types.js +39 -0
package/dist/daemon/runner/tool-schemas.d.ts +1 -0
package/dist/daemon/runner/tool-schemas.js +151 -0
package/dist/daemon/session-scope.d.ts +1 -1
package/dist/daemon/startup-recovery.d.ts +20 -0
package/dist/daemon/startup-recovery.js +323 -0
package/dist/daemon/state/index.d.ts +6 -0
package/dist/daemon/state/index.js +14 -0
package/dist/daemon/state/session-state.d.ts +23 -0
package/dist/daemon/state/session-state.js +44 -0
package/dist/daemon/state/stuck-detection.d.ts +22 -0
package/dist/daemon/state/stuck-detection.js +25 -0
package/dist/daemon/state/terminal-signal.d.ts +9 -0
package/dist/daemon/state/terminal-signal.js +10 -0
package/dist/daemon/tools/file-tools.d.ts +1 -1
package/dist/daemon/turn-end/detect-stuck.d.ts +2 -2
package/dist/daemon/turn-end/detect-stuck.js +2 -2
package/dist/daemon/turn-end/step-injector.d.ts +1 -1
package/dist/daemon/types.d.ts +105 -0
package/dist/daemon/types.js +11 -0
package/dist/daemon/workflow-enricher.d.ts +16 -0
package/dist/daemon/workflow-enricher.js +58 -0
package/dist/daemon/workflow-runner.d.ts +13 -277
package/dist/daemon/workflow-runner.js +63 -1421
package/dist/manifest.json +231 -31
package/dist/trigger/coordinator-deps.d.ts +1 -1
package/dist/trigger/delivery-client.d.ts +1 -1
package/dist/trigger/delivery-pipeline.d.ts +1 -1
package/dist/trigger/notification-service.d.ts +1 -1
package/dist/trigger/trigger-listener.js +6 -2
package/dist/trigger/trigger-router.d.ts +2 -2
package/docs/ideas/backlog.md +249 -25
package/docs/reference/worktrain-daemon-invariants.md +33 -49
package/docs/vision.md +5 -15
package/package.json +2 -2

package/docs/ideas/backlog.md CHANGED Viewed

@@ -192,6 +192,108 @@ The delivery pipeline was extracted into `delivery-pipeline.ts` with explicit st
 ## WorkTrain Daemon
+### Context injection bugs: double-injection, byte-slice truncation, workspaceRules[0] drop (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:3 Cap:1 Eff:3 Lev:3 Con:3 | Blocked: no
+Three active bugs in the context injection pipeline that waste tokens, produce incorrect truncation, and silently discard workspace context. Confirmed by codebase audit (Apr 30, 2026).
+1. **Double-injection (`session-context.ts:117-119`):** `trigger.context` is JSON-serialized in full into the initial user message. Since coordinators write `assembledContextSummary` *into* `trigger.context`, the assembled context appears twice -- once in the system prompt (8KB cap applied) and once in the initial user message (uncapped). These diverge when the content exceeds 8KB.
+2. **Byte-slice truncation (`system-prompt.ts:200-202`):** `assembledContextSummary` is truncated by raw byte index (`ctxStr.slice(0, 8192)`), which splits mid-sentence, mid-section, and can produce malformed UTF-8. The section-aware `buildBudgetedOutput()` pattern already exists in `src/coordinators/context-assembly.ts` and handles this correctly.
+3. **`workspaceRules[0]` silent drop (`session-context.ts:106`):** `ContextBundle.workspaceRules` is typed as `ContextRule[]` but only `[0]` is consumed. All additional workspace context rules are silently dropped. The type implies per-file rules are supported; the consumer silently ignores them.
+**Also in scope:** introduce `WorkflowContextSlots` typed fields on `WorkflowTrigger` (or a companion type) for system-managed context fields (`assembledContextSummary`, `priorSessionNotes`, `gitDiffStat`). This eliminates the stringly-typed `trigger.context['assembledContextSummary']` access pattern and is a prerequisite for the universal enricher (see next item). Scope Phase 0 changes to consumption sites only (`buildSystemPrompt`, `buildSessionContext`); coordinator write sites migrate in Phase 1.
+**Done looks like:** no `trigger.context` JSON dump in `initialPrompt`; `assembledContextSummary` truncated at section boundaries; all `workspaceRules` entries injected; `WorkflowContextSlots` typed fields replace stringly-typed access in consumption sites.
+---
+### Universal context enricher for all session entry points (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: yes (needs context injection bugs fixed first)
+Today 4 of 6 session entry points receive zero assembled context: raw webhook triggers, direct dispatch, `spawn_agent` children, and crash-recovered sessions never get cross-session notes or git diff state. Only coordinator-spawned sessions (via `pr-review.ts` or the adaptive pipeline) get assembled context -- and even then only through opt-in coordinator logic, not structural injection.
+There is no single layer that all dispatch paths share where assembly can run universally. Coordinators that care must call assembly explicitly; everything else gets nothing. This means every new entry point or coordinator is another opportunity to forget assembly.
+**Design (from Apr 30 discovery):** A `WorkflowEnricher` service injected into `runWorkflow()` that fires for root sessions only (`spawnDepth === 0`). Provides prior workspace session notes (max 3, newest-first, workspace-scoped) and `git diff HEAD~1 --stat` to all entry points. Injected via `WorkflowContextSlots` typed fields (see context injection bugs item). When a coordinator has already set `assembledContextSummary`, the enricher skips prior-notes injection (coordinator's richer context takes precedence) but still provides git diff stat if absent.
+**Critical gate:** before this ships, run a pilot test -- one session with `assembledContextSummary` injected, inspect turn-1 reasoning for citation. If agents don't reference pre-loaded context, the investment in universal enrichment adds tokens without improving outcomes.
+**Things to hash out:**
+- Where exactly does the enricher inject: inside `runWorkflow()` before `buildPreAgentSession()`, or inside `buildPreAgentSession()` itself? The latter is cleaner but changes the pre-agent phase boundary.
+- `listRecentSessions` must have a 1s wall-clock timeout with partial-result fallback. Without it, large session stores silently slow all session startups. This is a spec requirement, not optional.
+- `spawn_agent` children don't get enriched (they'd trigger redundant assembly for deeply nested trees). Is there a case where children should optionally enrich? Candidate: an `inheritParentContext: boolean` flag in the `spawn_agent` tool schema.
+---
+### MemoryStore: indexed session history and mid-session query_memory tool (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: yes (needs universal enricher first)
+The session event log is rich -- it records goals, step notes, artifacts, delivered commits, git state, and phase handoffs. But querying it requires a full directory scan and per-session event projection on every call. `LocalSessionSummaryProviderV2` does this today and is used in exactly one place (the PR-review coordinator). Every other consumer either skips it or re-implements a slower version.
+There is no mid-session memory query capability at all. An agent mid-session cannot ask "what did we decide about this module last week" and get an answer from persistent memory -- it can only use what was pre-loaded at session start.
+**Design (from Apr 30 discovery):** A `MemoryStore` port backed by `~/.workrail/memory.db` (SQLite, WAL mode) indexed by `finalizeSession()` as fire-and-forget after each session completes. Query kinds v1: `recent_sessions` (by workspace path hash), `sessions_by_goal_keywords`. A `query_memory` tool added to the daemon tool set. Replaces the slow `listRecentSessions` scan in the universal enricher.
+Phase 2b (separate): index phase artifacts via a new `phase_artifact_appended` session event kind -- bridges the current PipelineRunContext silo into the session event log so phase artifacts are queryable alongside session notes. Requires engine schema review before implementation.
+**Things to hash out:**
+- SQLite native compilation may fail in some deployment environments (Docker, Alpine Linux). Mitigation: use `@sqlite.org/sqlite-wasm` (pure WASM) or make `MemoryStore` fully optional -- daemon works without it, just no indexed queries.
+- `phase_artifact_appended` event schema change is the highest-risk part of Phase 2b. Should it reuse the existing artifact channel with a new content type, or be a new event kind? Each has different backward-compatibility implications.
+- Should `query_memory` be a general-purpose tool or typed with specific query kinds? A typed discriminated union prevents agents from inventing unsupported query shapes.
+---
+### worktrain session analyze: verify agents actually use pre-loaded context (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 8** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+There is no way to verify whether agents actually use pre-loaded context (soul, workspace context, `assembledContextSummary`, session notes) in their reasoning. The entire memory architecture investment (universal enricher, MemoryStore, knowledge graph) assumes agents reference pre-loaded context at turn 1 -- but this assumption is unvalidated. If agents receive 32KB of workspace context and `assembledContextSummary` but don't cite them in their reasoning before acting, richer pre-loading adds token cost without improving outcomes.
+Today, validating this requires manually reading raw session transcripts, which is impractical at scale. A `worktrain session analyze <sessionId>` command that reads the agent turn events and reports whether any pre-loaded context fields were cited in turn-1 reasoning would make this automatable and support data-driven decisions about context loading investment.
+**Done looks like:** `worktrain session analyze <sessionId>` reads the session event log, extracts turn-1 assistant message content, checks for citations of injected fields (workspace context file names, goal text, prior step note content), and reports a structured summary: fields injected, fields cited, fields ignored.
+**Things to hash out:**
+- "Citation" is hard to define precisely -- the agent might paraphrase rather than quote. Does substring matching suffice, or does this need an LLM similarity check?
+- Should this be a CLI command or a console feature? The console already reads session data; this could be a "context audit" view.
+- The primary use case is a one-time validation gate (before shipping the universal enricher). Does this justify a permanent command, or is it a one-off script?
+---
+### Per-run retrospective: structured learning from pipeline outcomes (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+After a pipeline run completes -- whether it merged, escalated, or failed -- there is no structured mechanism for WorkTrain to record what it learned. Mistakes that occurred in one run (wrong interpretation, missed edge case, collateral damage rationalized as a tradeoff) are not surfaced to future sessions. Each run starts with the same baseline.
+A per-run retrospective is a lightweight post-completion step that answers: what went wrong or unexpectedly, what assumption turned out to be false, what should the next session starting on this codebase know that this session didn't? The output would be a structured record written to the session store and made available as Tier 0 context for future sessions on the same workspace.
+This is distinct from the per-step `report_issue` mechanism (which records obstacles mid-session) and from the `wr.coding-task` phase-8 retrospective workflow (which is an agent-facing step prompt). This is a coordinator-level mechanism that runs after the pipeline exits, regardless of which workflows ran.
+**Things to hash out:**
+- Who runs the retrospective -- the coordinator (deterministic, reads phase results and produces structured output), a lightweight LLM step, or the agent in a final workflow phase?
+- What is the output format? A structured `RetrospectiveArtifactV1` that feeds Tier 0 context injection, or freeform notes that accumulate in a `workspace-knowledge.md` file?
+- Where does the output live? Per-run (alongside `PipelineRunContext`), per-workspace (accumulated knowledge store), or per-session in the session store?
+- When a retrospective records "assumption X was wrong," how does that fact reach future sessions? It needs to be injected as Tier 0 context -- which requires the context loading path to know where to look.
+- Should the retrospective run on every pipeline outcome (merge, escalate, timeout, error), or only on non-merge outcomes where something went wrong?
+---
 ### Phase quality gate policy: partial vs escalate (May 5, 2026)
 **Status: idea** | Priority: medium
@@ -585,12 +687,20 @@ The autonomous workflow runner (`worktrain daemon`). Completely separate from th
 ### Living work context: shared knowledge document that accumulates across the full pipeline (Apr 30, 2026)
-**Status: done** | Shipped May 5, 2026 (PR #939)
+**Status: partial** | Core infra shipped May 5, 2026 (PR #939). Three gaps remain.
 **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
 **Shipped (PR #939):** `ShapingHandoffArtifactV1` + `CodingHandoffArtifactV1` + enriched `DiscoveryHandoffArtifactV1`, `PhaseHandoffArtifact` union, `buildContextSummary()` pure function with per-phase selection, `PipelineRunContext` per-run JSON with `PhaseResult<T>`, crash recovery via `active-run.json` pointer, phase quality gates (fallback escalates, partial warns), persistence failure escalation, 4 workflow authoring changes, adversarial behavioral test (AC 21), `contractRef` validation test. Deferred: `buildSystemPrompt()` named semantic slots, console visualization, retry logic, epic-mode task graph, extensible contract registration, per-workflow lifecycle artifact tests.
+**Remaining gaps (not tracked elsewhere):**
+1. **No end-to-end validation that context reaches downstream agents.** The `assembledContextSummary` is wired through `trigger.context` → `buildSystemPrompt()` → system prompt, but there is no test that runs a full pipeline (discovery → shaping → coding) and asserts that the coding agent's system prompt actually contains the discovery context. The adversarial behavioral test (AC 21) proves the pipeline structure -- it does not prove the context content is meaningful to the downstream agent.
+2. **Not all coordinator pipeline modes populate `assembledContextSummary`.** Some modes (e.g. quick-review) may exit without writing a full `PipelineRunContext`. When context is absent, `buildSystemPrompt()` silently injects nothing -- the downstream agent gets no prior context with no warning. There is no check that the coordinator always writes context before dispatching a downstream session.
+3. **No operator visibility into injected context.** The "Prior Context" section in an agent's system prompt is invisible from the console. An operator has no way to see what context was injected into a session without reading raw conversation logs. The console should surface this -- at minimum, whether the session had prior context and how many bytes.
 When a multi-agent pipeline runs -- discovery → shaping → coding → review → fix → re-review -- no agent has a complete picture of what came before it. The coding agent has the goal. The review agent has the code. The fix agent has the findings. None of them have the accumulated context from the full pipeline: why this approach was chosen over alternatives, what was ruled out, what constraints were discovered, what architectural decisions were made, what edge cases were handled, what the review found and why.
 Each agent reconstructs intent from incomplete context, which is why review finds things coding missed (review doesn't know what the coding agent was trying to do), why fix sessions address symptoms without understanding causes (no access to the architectural reasoning), and why agents repeat work that earlier agents already did.
@@ -1007,6 +1117,25 @@ The daemon reads `triggers.yml` once at startup. Any change requires a full daem
 ---
+### External task tracker integrations: Jira, Linear, Notion, and beyond (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 11** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+WorkTrain currently picks up work from GitHub and GitLab. Most engineering teams track work in Jira, Linear, Notion, or similar systems -- not in GitHub issues. Without native trigger adapters for these systems, WorkTrain cannot be used as the default development workflow for teams that don't use GitHub Issues as their primary tracker.
+The vision says WorkTrain picks up tasks "from external systems (GitHub issues, GitLab MRs, Jira tickets, webhooks)." The webhook trigger (`provider: generic`) handles anything with a POST endpoint, but it requires the operator to wire up field extraction manually and provides no assignee filtering, label filtering, or status-transition detection out of the box. A first-class adapter for each tracker would handle the integration details and give operators a clean configuration surface.
+**Things to hash out:**
+- What is the right abstraction boundary? A generic polling adapter with per-tracker field mapping (same pattern as `github_issues_poll` / `gitlab_poll`) vs. a more opinionated per-tracker adapter that understands Jira workflow states, Linear priorities, etc.
+- Jira's API requires OAuth or API token; Linear uses API keys; Notion uses integration tokens. Is secret resolution via `$ENV_VAR_NAME` sufficient, or is a richer credentials model needed?
+- For Jira specifically: issue assignment events are not available via webhook without Jira admin access to configure webhooks. Does WorkTrain need a polling adapter (`jira_poll`) as the primary path, with webhook as an optional enhancement?
+- What context does each tracker inject into the workflow session? Jira issues have epics, acceptance criteria, sprint context, labels. Linear issues have priority, team, estimate, project. The context mapping needs to capture what's useful without overwhelming the session.
+- How does deduplication work across tracker adapters? A Jira issue that was already picked up and is in-flight should not be dispatched again on the next poll cycle, even if it was updated.
+---
 ### GitHub webhook trigger with assignee/event filtering (Apr 20, 2026)
 **Status: idea** | Priority: medium-high
@@ -1891,7 +2020,7 @@ Each file is injected only into sessions running the matching pipeline phase. Re
 **Status: idea** | Priority: medium
-**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for context assembly)
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no (unblocked by Apr 30 discovery -- context assembly does not require the knowledge graph)
 **Problem:** `src/coordinators/pr-review.ts` is already ~500 LOC doing session dispatch, result aggregation, finding classification, merge routing, message queue drain, and outbox writes. Adding knowledge graph queries, context bundle assembly, and prior session lookups would create a god class.
@@ -1899,18 +2028,17 @@ Each file is injected only into sessions running the matching pipeline phase. Re
 ```
 Trigger layer         src/trigger/          receives events, validates, enqueues
 Dispatch layer        (TBD)                 decides which workflow + what goal
-Context assembly      (TBD)                 gathers and packages context before spawning
+Context assembly      src/daemon/           enriches trigger before runWorkflow() fires
 Orchestration layer   src/coordinators/     spawns, awaits, routes, retries, escalates
 Delivery layer        src/trigger/delivery  posts results back to origin systems
 ```
-**Context assembly** is the missing layer. Before dispatching a coding session, `assembleContext(task, workspace)` runs: knowledge graph query, upstream pitch/PRD fetch, relevant prior session notes, returns a structured context bundle. The orchestration script should call this, not own it.
+**Resolution from Apr 30 discovery:** Context assembly does NOT require the knowledge graph as a prerequisite. The universal enricher (Phase 1 of the memory architecture) provides a structural context assembly layer via `WorkflowEnricher` injected into `runWorkflow()` -- this IS the missing layer. The orchestration scripts (coordinators) continue to add task-specific richer context on top (phase artifacts, git diff for PRs) via the existing `assembledContextSummary` mechanism. The two layers compose: universal enricher provides the floor, coordinators provide the ceiling.
-**Things to hash out:**
-- The right layering puts "Dispatch layer (TBD)" between Trigger and Orchestration. What exactly does the dispatch layer decide, and how does it relate to the adaptive pipeline coordinator concept elsewhere in the backlog?
-- Context assembly requires the knowledge graph. What is the fallback when the KG is not yet built for a workspace -- does context assembly simply return empty, or does it fall back to a slower manual search?
-- Should context assembly run synchronously before dispatch (blocking the trigger listener) or asynchronously (session starts with partial context while assembly continues)?
-- Who owns the context assembly API contract -- the engine (as a new primitive), the daemon (as an infrastructure capability), or user-authored scripts?
+**The Dispatch layer question** is resolved by the adaptive pipeline coordinator (`src/coordinators/adaptive-pipeline.ts`) -- it IS the dispatch layer for queue-polled tasks. For webhook-triggered tasks, `TriggerRouter.route()` performs dispatch. The layering is already present; it just isn't documented as such.
+**Remaining open question:**
+- When a coordinator calls `spawnSession()` with an `assembledContextSummary`, should the universal enricher's prior-notes injection be suppressed (coordinator already covered it) or additive (both run)? The discovery recommends suppression -- enricher skips prior notes when `assembledContextSummary` is already set.
 ---
@@ -2339,6 +2467,42 @@ When an MR review session (run by a WorkTrain agent) finds issues in a coding se
 ---
+### wr.discovery lacks domain-specific ideation guidance (May 6, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+`wr.discovery` classifies `problemDomain` (software / product / ux / personal / general) and uses it for a few things -- philosophy source lookup, vision doc location, and `decisionCriteria` examples. But candidate generation, challenge framing, and resolution path guidance do not adapt to domain at all. A personal career decision, a product strategy question, and a software architecture problem have meaningfully different ideation patterns, different failure modes in candidate generation, different challenge rubrics, and different resolution artifacts. The workflow currently treats them all identically after `problemDomain` is set.
+The result is that `problemDomain` is a classification that carries almost no behavioral weight past phase-0 and phase-2. It reads well but does not change the actual work.
+**Things to hash out:**
+- Where is domain-specific guidance most needed? Candidate generation (different ideation patterns per domain) and challenge framing (different adversarial angles) are the clearest gaps. Are there others -- resolution mode selection, confidence dimensions, handoff format?
+- What is the right mechanism -- `promptFragments` conditioned on `problemDomain`, a domain-specific routine injected via `templateCall`, or richer domain context blocks injected at workflow start? The answer probably varies by where in the workflow the guidance applies.
+- How much domain specificity is enough? Software vs non-software is the biggest gap. Within non-software, personal vs product vs ux are also meaningfully different. Is a two-level split (software / general) sufficient for now, or is the full five-way split worth tackling immediately?
+- Are there domain-specific output formats worth considering? A personal decision probably ends with a different handoff shape than a software architecture decision -- different fields, different confidence dimensions, different "next actions" structure.
+---
+### wr.discovery anchors candidates to existing infrastructure instead of the ideal solution (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+`wr.discovery` produces candidates bounded by what already exists. The landscape step grounds the agent in the current codebase, which anchors candidate generation to what is buildable today rather than what would be best. On a discovery run for context-passing, for example, candidates are shaped by the current pre-load architecture instead of questioning whether pre-load is the right model at all. Decisions that should be challenged by the discovery process are instead silently inherited from it.
+The result is that discovery optimizes within the current design space rather than finding the edge of it. Problems that require restructuring existing code -- not just adding to it -- tend to produce timid candidates that paper over the root cause instead of addressing it. Discovery is supposed to find the best answer; it is currently finding the best answer that doesn't require changing much.
+**Things to hash out:**
+- Should the ideal-first reasoning happen before or after the landscape pass? Before risks ignoring hard constraints; after risks being anchored by them. What is the right sequencing, and is it always the same or does it depend on the problem type?
+- How do non-negotiable constraints (e.g. "must not change the engine API", "must work without a running daemon") get introduced without becoming the excuse for avoiding the best answer? There's a real difference between a hard constraint and an inherited assumption that could be challenged.
+- Is "what would the ideal look like, and what's the migration path from here?" a step inside discovery, or does it belong in `wr.shaping`? Shaping already produces an appetite and scope cut -- is ideal-first reasoning a discovery concern or a shaping concern, or does each need it independently?
+- When the ideal requires multi-sprint groundwork (e.g. "first build the KG, then build context assembly on top of it"), how should discovery represent that? As a sequenced multi-phase candidate? As a separate "phase 1" item that gets its own discovery?
+---
 ### Workflow previewer for compiled and runtime behavior
 **Status: idea** | Priority: medium
@@ -3170,33 +3334,33 @@ openclaw is worth studying deeply before building out the platform layer. Draw i
 **Status: idea** | Priority: medium
-**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+**Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: yes (needs MemoryStore first as Phase 2 prerequisite)
+**Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants. And cross-session semantic queries ("what did we find about this module last week") cannot be answered without a vector index.
-**Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants.
+**Position in the phased memory architecture (from Apr 30 discovery):** This is Phase 3 in a four-phase sequence. Phase 0 (bug fixes) → Phase 1 (universal enricher) → Phase 2 (MemoryStore SQLite) → Phase 3 (knowledge graph). The MemoryStore SQLite from Phase 2 answers 6 of 8 memory queries without a vector model. The knowledge graph adds the remaining two: code-structure traversal (Q8) and semantic similarity ("what is related to X"). Phase 3a (structural layer) extends the existing spike; Phase 3b (vector layer) is a feature flag.
 **Design -- two-layer hybrid:**
-**Layer 1: Structural graph (hard edges, deterministic)**
-Built by `ts-morph` (TypeScript Compiler API) + DuckDB. Captures: `imports`, `calls`, `exports`, `implements`, `extends`, `registers_in`, `tested_by`. Answers precise questions with certainty: "what imports trigger-router.ts?", "what CLI commands are registered?"
+**Layer 1: Structural graph (hard edges, deterministic) -- Phase 3a**
+Extends existing `src/knowledge-graph/` spike (DuckDB + ts-morph, already in `dependencies`). New node kinds: `session`, `pipeline_run`, `workspace_convention`. New edge kinds: `produced_by` (session → file), `applies_to_workspace`. Current spike only tracks import edges and CLI commands; session data from Phase 2 MemoryStore migrates here. Answers: "what imports trigger-router.ts?", "what files did session X touch?", "what sessions ran in this workspace?"
-**Layer 2: Vector similarity (soft weights, semantic)**
-Every node gets an embedding. Answers fuzzy questions: "what is conceptually related to this?", "what past sessions are relevant to this bug?" Built with LanceDB (embedded, TypeScript-native, local-first).
+**Layer 2: Vector similarity (soft weights, semantic) -- Phase 3b (feature flag)**
+LanceDB (embedded, TypeScript-native, local-first). Embeddings over session recaps and workspace conventions. Off by default (`WORKRAIL_VECTOR_SEARCH=1` to enable). Answers: "what sessions are semantically related to this bug?", "what workspace conventions mention authentication?"
 **Technology:**
-- Structural: `ts-morph` + DuckDB
-- Vector: LanceDB + local embedding model (Ollama or `@xenova/transformers`)
-- Unified query: `query_knowledge_graph(intent)` returns merged structural + semantic results
-**Build order:** Structural layer spike first (1-day). Vector layer after spike proves the foundation. Incremental update: re-index only files in `filesChanged` after each session.
+- Structural: `ts-morph` + DuckDB (existing spike, already in dependencies)
+- Vector: LanceDB + local embedding model -- `@xenova/transformers` (in-process, no external dep) preferred over Ollama (better quality but requires external process)
+- Unified query: `query_knowledge(intent, workspacePath)` replaces `query_memory` tool when Phase 3a lands
 **Build decision (from Apr 15 research):** ts-morph + DuckDB wins. Cognee: Python-only. GraphRAG/LightRAG: use LLMs to build graph (violates scripts-over-agent). Mem0/Zep: conversational memory, not code graphs. Sourcegraph: enterprise weight, overkill.
 **Things to hash out:**
-- How large does a typical workspace KG get? For a medium-sized TypeScript monorepo, what are the expected node and edge counts for the structural layer?
-- The incremental update strategy (re-index only `filesChanged`) requires accurate change tracking. What is the fallback when `filesChanged` is unavailable (e.g. for manually triggered sessions)?
-- The embedding model (Ollama or `@xenova/transformers`) needs to be running locally. What is the setup story for a new workspace -- is it expected to already have an embedding model, or does WorkTrain set one up?
-- DuckDB is in-process -- what is the concurrency story when multiple daemon sessions try to query or update it simultaneously?
-- Is the KG per-workspace or global? If per-workspace, cross-workspace queries (multi-project WorkTrain) require a federation layer.
+- Phase 3a scope: should the structural layer replace the Phase 2 SQLite MemoryStore (same data, different engine) or exist alongside it? Replacing is cleaner; coexisting avoids a migration.
+- `@xenova/transformers` vs Ollama for Phase 3b: @xenova runs in-process (no setup friction) but has lower embedding quality. Ollama is better quality but adds an external process dependency. Which matters more for the target user base?
+- The incremental update strategy (re-index only `filesChanged` after each session) requires accurate change tracking. What is the fallback when `filesChanged` is unavailable?
+- DuckDB is in-process -- WAL mode handles read concurrency but writes are serialized. Is the concurrency story acceptable when 3 sessions complete simultaneously?
+- Is the KG per-workspace or global? Per-workspace is simpler; global enables cross-workspace queries but adds federation complexity.
 ---
@@ -4682,3 +4846,63 @@ WorkTrain has no tooling to surface the state of worktrees and branches relative
 - Common-ground `make sync` distributing the script reliably
 **Priority:** Medium. The shared scripts work and have been tested. Main remaining work is the shell wrapper, token storage, and integration with common-ground's team config.
+---
+### Cross-system blind benchmark: compare AI coding tools/models on the same tasks (May 6, 2026)
+**Status: idea** | Priority: medium
+**Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: no
+There is no reproducible way to compare WorkTrain against other AI coding systems (Cursor, Copilot, raw Claude Code, competing agent frameworks) or to compare model families within WorkTrain on the same real tasks. Without this, claims about WorkTrain's quality are anecdotal and there is no principled way to understand where WorkTrain adds value versus where it falls short.
+**Things to hash out:**
+- What constitutes a valid "task" for comparison? Real GitHub issues from a well-understood repo are higher quality than synthetic benchmarks, but may not reproduce cleanly across different tool setups. What is the minimum reproducibility requirement?
+- How do you grade fairly? A grader that can see code style, comments, or formatting may infer which system produced the output. What does true blind evaluation look like here, and how blind is "blind enough"?
+- Should the rubric be global (same for all task types) or per-task-type (refactor vs feature vs bug fix)?
+- Token usage comparison requires accurate per-system accounting. Not all tools expose this. Is a cost-adjusted comparison feasible, or does this reduce to a quality-only benchmark?
+- Is this a one-time study or a continuous regression benchmark? The demo-repo benchmark entry covers regression -- this is specifically about cross-system comparative evaluation.
+**Relationship to existing entries:** the demo-repo benchmark (existing entry) runs the same tasks after each WorkRail release to track regression. This entry is about comparing WorkTrain vs other systems, not WorkTrain past vs present.
+---
+### WorkTrain as a full software team: design, PM, data science, opex, and everything in between (May 6, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:2 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+The current vision defines WorkTrain as an autonomous *software development* system. But shipping software requires more than coding -- product management, design, data science, operations, release engineering, and the feedback loop from production back into ideas are all necessary to deliver something that works and keeps working. WorkTrain currently handles only the coding-and-review slice of this. Everything before "write the code" (discovery what to build, analyzing what users actually need) and everything after "merge the PR" (instrumentation, metrics analysis, idea generation, rollout management, incident response) is done manually.
+The result is that the value loop -- PR → metrics → insight → idea → spec → PR -- is only partially automated. Humans still have to bridge analysis → idea and metrics → iteration gaps. An autonomous system that stops at "ship a PR" requires continuous human intervention to keep it pointed at the right work.
+The constraint on idea generation specifically: ideas grounded in vague intuition are not useful. The gap is not that WorkTrain can't generate suggestions -- it can. The gap is that those suggestions are not grounded in specific, verifiable facts about the actual system and its users. An idea like "23% of users who reach step 3 abandon, and the median time on that step is 47 seconds, and here is what the error logs show" is categorically different from "users might want X."
+**Relationship to existing entries:** Many existing backlog entries are partial implementations of this broader capability -- monitoring loops, analytics integration, feature flag management, opex, the blind benchmark entry. This entry captures the full frame so those entries can be understood as steps toward it rather than isolated features.
+**Things to hash out:**
+- The vision.md defines WorkTrain as "autonomous software development." Does this require a vision revision, or is design/PM/data science/opex a natural extension of "everything that ships software"?
+- Design and PM work requires product domain knowledge -- not just technical knowledge. There is no obvious equivalent of AGENTS.md for product context. What is the right mechanism for WorkTrain to acquire and maintain that context?
+- Data science work requires access to event logs, metrics stores, and potentially sensitive user data. What is the authorization model? What is the minimum access needed to produce useful insights without exposing sensitive data?
+- Release management requires write access to production systems (feature flag platforms, deployment infrastructure). What safeguards are necessary before WorkTrain can act autonomously there?
+- Opex (incident response, SLO management) has a different urgency profile than coding work. How does it fit into the existing pipeline model, which is designed for hours-to-days timescales?
+---
+### Task completion enforcement: detect and prevent deferred work within tasks (May 6, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
+Agents routinely defer work within tasks rather than completing it. Common patterns: "I'll file a ticket for this later," "this is out of scope, leaving for a follow-up," "TODO: handle this edge case," "I noticed X but didn't address it to stay focused." These deferral patterns are individually plausible but collectively mean tasks are never actually finished -- they transition from "in progress" to "apparently done" while work accumulates in a long tail of unfiled tickets and unresolved TODOs.
+There is no mechanism to distinguish "this genuinely needs a separate session with different scope" from "I could have done this but chose not to." There is no enforcement that deferred items are tracked and eventually completed. There is no way to prove a task is actually done versus claimed done. A task that leaves TODOs in the code, or that defers 3 of its 5 acceptance criteria, is not done -- but the system currently has no way to detect or prevent this.
+**Things to hash out:**
+- What does "done" mean in a provable sense? What evidence would allow a coordinator to conclude that a task is complete rather than merely that an agent has stopped working on it?
+- How do you distinguish legitimate scope decisions from avoidance? A session on a performance bug that surfaces an unrelated security issue is right to defer the security issue. A session that addresses only 2 of 3 acceptance criteria is not. What is the principled distinction?
+- TODO comments in code are not always deferred work -- some are architectural notes, some are pre-existing. How do you identify TODOs that represent deferred task-scope work versus incidental notes?
+- How does this interact with the existing stuck detection system? A stuck agent and a "done-claiming but not actually done" agent are different failure modes. How does the system tell them apart?

package/docs/reference/worktrain-daemon-invariants.md CHANGED Viewed

@@ -14,7 +14,7 @@ See also: `tests/unit/workflow-runner-outcome-invariants.test.ts` -- the test fi
 **Why:** `'unknown'` in `execution-stats.jsonl` is silent data loss. Operators calibrate session timeouts and monitor health from this data.
-**How it breaks:** The `writeExecutionStats()` helper takes `outcome` by value. If called with a variable that hasn't been assigned yet, it silently records `'unknown'`. Every result path must call `writeExecutionStats()` with the correct outcome at the call site, not via a shared variable captured in a closure.
+**How it breaks:** `writeExecutionStats()` takes `outcome` by value. If called with an unassigned variable, it silently records `'unknown'`. All result paths go through `finalizeSession()`, which calls `tagToStatsOutcome()` to derive the outcome -- there are no direct `writeExecutionStats()` calls outside `finalizeSession()`.
 ### 1.2 `delivery_failed` is never returned by `runWorkflow()` directly
@@ -32,13 +32,13 @@ See also: `tests/unit/workflow-runner-outcome-invariants.test.ts` -- the test fi
 | `'stuck'` | `'stuck'` |
 | `'delivery_failed'` | `'success'` (workflow succeeded; only the POST failed) |
-This mapping must be exhaustive. When `tagToStatsOutcome()` is extracted as a pure function (planned in the functional-core/imperative-shell refactor), it must use `assertNever` on the default case so the compiler enforces exhaustiveness.
+This mapping is exhaustive. `tagToStatsOutcome()` is a pure function in `workflow-runner.ts` that uses `assertNever` on the default case -- the compiler enforces exhaustiveness when new `_tag` variants are added.
 ### 1.4 Outcome priority when multiple signals fire
-If both `stuckReason` and `timeoutReason` are non-null at the same time (same turn), `stuck` takes priority over `timeout`. This is intentional: stuck is the more specific signal (the agent is looping, not just slow), and fires before the wall-clock limit.
+`stuck` takes priority over `timeout`. This is enforced structurally by `TerminalSignal` and `setTerminalSignal()`: `setTerminalSignal()` is first-writer-wins -- the first signal to set `state.terminalSignal` wins, and subsequent calls are silent no-ops. Because stuck detection fires inside the turn-end subscriber (which runs before the wall-clock timeout handler), stuck always sets `terminalSignal` first when both conditions are present in the same turn.
-**Code location:** The `if (stuckReason !== null)` check precedes `if (timeoutReason !== null)` in `runWorkflow()`.
+**Code location:** `setTerminalSignal()` in `workflow-runner.ts`. `buildSessionResult()` reads `state.terminalSignal` after the loop exits.
 ### 1.5 stepCount reflects agent-loop advances only
@@ -56,7 +56,7 @@ Each `runWorkflow()` call writes a per-session sidecar file at `~/.workrail/daem
 `persistTokens()` returns `Promise<Result<void, PersistTokensError>>` (not throws). Callers in the setup phase treat `err` as fatal (abort); callers inside tool closures treat `err` as degraded-but-continue (log and still call `onAdvance`/`onTokenUpdate` -- see invariant 4.3).
-**Exception:** If `continueToken` is undefined (instant single-step completion, or `_preAllocatedStartResponse` with no token), `persistTokens()` is skipped. There is nothing to recover.
+**Exception:** If `continueToken` is undefined (instant single-step completion, or a `pre_allocated` `SessionSource` with no token), `persistTokens()` is skipped. There is nothing to recover.
 ### 2.2 Sidecar is deleted on every non-worktree terminal path
@@ -88,33 +88,34 @@ Since Phase B crash recovery (PR #811), `persistTokens()` also writes `workflowI
 ## 3. Registry invariants
-Three registries track in-flight daemon sessions:
+Two registries track in-flight daemon sessions:
 | Registry | Key | Value | Purpose |
 |---|---|---|---|
 | `DaemonRegistry` | `workrailSessionId` | `{ workflowId, lastHeartbeatMs }` | Console `isLive` display |
-| `SteerRegistry` | `workrailSessionId` | `(text: string) => void` | Mid-session coordinator injection |
-| `AbortRegistry` | `workrailSessionId` | `() => void` | SIGTERM graceful shutdown |
+| `ActiveSessionSet` | `workrailSessionId` | `SessionHandle` | Steer injection + SIGTERM abort |
+`ActiveSessionSet` + `SessionHandle` (in `src/daemon/active-sessions.ts`) replaced the former separate `SteerRegistry` and `AbortRegistry` maps. A `SessionHandle` exposes `steer()`, `setAgent()`, `abort()`, and `dispose()` -- all session lifecycle operations on a single object.
 ### 3.1 Registry registration and deregistration
-**Registration** happens in two places:
+**Registration** happens in two phases:
-- `steerRegistry` and `DaemonRegistry` are registered inside `buildPreAgentSession()` -- AFTER all potentially-failing I/O (executeStartWorkflow, persistTokens, worktree creation). Error paths that return before registration have nothing to clean up. The single-step completion path (which returns success without running an agent loop) explicitly calls `steerRegistry.delete()` and `daemonRegistry.unregister()` before returning.
+- `DaemonRegistry` and `ActiveSessionSet` are registered inside `buildPreAgentSession()` -- AFTER all potentially-failing I/O (executeStartWorkflow, persistTokens, worktree creation). Error paths that return before this point have nothing to clean up. The single-step completion path goes through `finalizeSession()` (which calls `daemonRegistry.unregister()`) and returns the handle via `PreAgentSessionResult` so the caller (`runWorkflow()`) can call `handle.dispose()`.
-- `abortRegistry` is registered in `runWorkflow()` immediately after `const agent = new AgentLoop(...)`. The closure `() => agent.abort()` references `agent` -- registering before agent construction would be a TDZ hazard.
+- `handle.setAgent(agent)` is called in `buildAgentReadySession()` immediately after `const agent = new AgentLoop(...)`. This wires in abort capability. `abort()` before `setAgent()` is a safe no-op -- the TDZ hazard is eliminated by the null check inside `SessionHandleImpl.abort()`.
 **Deregistration**:
-- `steerRegistry.delete()` and `abortRegistry.delete()` are called in the `finally` block of `runWorkflow()`. This ensures cleanup happens even if an exception is thrown in the agent loop.
+- `handle.dispose()` is called in the `finally` block of `runAgentLoop()`. This removes the handle from `ActiveSessionSet` so `size` decrements correctly and shutdown drain terminates.
-- `daemonRegistry.unregister()` is called at each result path (success, error, timeout, stuck) via `finalizeSession()`. It is NOT in `finally` because the completion status ('completed' vs 'failed') differs by path.
+- `daemonRegistry.unregister()` is called via `finalizeSession()` at both result paths (early-exit and post-agent-loop). It is NOT in `finally` because the completion status ('completed' vs 'failed') differs by result.
-**Why stale entries are bugs:** A stale steer callback on a dead session makes `POST /sessions/:id/steer` return 200 (calling the closed-over callback) instead of 404. A stale abort callback makes the shutdown handler call `abort()` on an already-exited session. Both are silent correctness bugs.
+**Why stale entries are bugs:** A stale steer handle on a dead session makes `POST /sessions/:id/steer` return 200 instead of 404. A stale abort handle makes the shutdown handler call `abort()` on an already-exited session. Both are silent correctness bugs.
 ### 3.2 `DaemonRegistry` is unregistered at every result path
-`daemonRegistry.unregister(workrailSessionId, 'completed' | 'failed')` is called at each of the four result paths (success, error, timeout, stuck). It is NOT in the `finally` block because the completion status ('completed' vs 'failed') differs by path.
+`daemonRegistry.unregister(workrailSessionId, 'completed' | 'failed')` is called via `finalizeSession()` at both the early-exit path and the post-agent-loop path. It is NOT in `finally` because the completion status differs by result.
 ### 3.3 `workrailSessionId` is available before registry operations
@@ -124,11 +125,11 @@ If `parseContinueTokenOrFail()` fails (unusual -- the token just came from `exec
 ### 3.4 Registration gap is documented
-**SteerRegistry gap (~50ms):** There is a ~50ms window between `executeStartWorkflow()` returning and `steerRegistry.set()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
+**Steer gap (~50ms):** There is a ~50ms window between `executeStartWorkflow()` returning and `activeSessionSet.register()` being called (after `parseContinueTokenOrFail()` completes). A `POST /sessions/:id/steer` call in this window receives 404. Coordinators should retry once on 404 during session startup.
-**AbortRegistry gap (~200-500ms):** `abortRegistry.set()` is registered _after_ `const agent = new AgentLoop(...)` is constructed, which happens after the context-loading phase (`loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes` in parallel). This means there is a ~200-500ms window where SIGTERM will not abort an in-flight session. Sessions in this window run to completion or hit the wall-clock timeout.
+**Abort gap (~200-500ms):** `handle.setAgent(agent)` is called after `const agent = new AgentLoop(...)` is constructed, which happens after the context-loading phase (`loadDaemonSoul`, `loadWorkspaceContext`, `loadSessionNotes` in parallel). During this window, `handle.abort()` is a safe no-op -- SIGTERM will not abort the session. Sessions in this window run to completion or hit the wall-clock timeout.
-**Why the abort gap is wider than the steer gap:** `abortRegistry.set` registers `() => agent.abort()` which closes over `agent`. Registering this callback before `agent` is constructed would be a TDZ (Temporal Dead Zone) hazard -- `agent` is declared with `const` and would not yet be initialized if the shutdown handler fired on an early-exit path. Registering after `agent` construction eliminates the hazard at the cost of a wider registration window. The accepted tradeoff is the same as for the steer gap.
+**Why the abort gap is wider than the steer gap:** `setAgent()` must be called after `agent` construction. Calling it before would be a TDZ hazard. The `SessionHandleImpl.abort()` null-checks `_agent`, making pre-`setAgent()` abort a safe no-op rather than a crash.
 ---
@@ -160,7 +161,7 @@ Both are guarded by the sequential tool execution invariant (no concurrent token
 All three stuck detection signals (`repeated_tool_call`, `no_progress`, `timeout_imminent`) emit `agent_stuck` events via `emitter?.emit()`, which is fire-and-forget. An event write failure never affects the session.
-Signals 1 and 2 abort the session (set `stuckReason`) subject to `stuckAbortPolicy`. Signal 3 (`timeout_imminent`) is purely observational -- the abort has already been triggered by the timeout handler.
+Signals 1 and 2 call `setTerminalSignal(state, { kind: 'stuck', reason: ... })` subject to `stuckAbortPolicy`. Signal 3 (`timeout_imminent`) is purely observational -- the abort has already been triggered by the timeout handler.
 ### 4.5 `spawn_agent` depth is enforced at the call site
@@ -204,7 +205,7 @@ On failure/timeout/stuck paths, the worktree is left in place for debugging. `ru
 ### 6.3 Sessions with >= 1 step advance are resumed if sidecar has trigger context
-`evaluateRecovery({ stepAdvances: >= 1 })` returns `'resume'`. If the sidecar contains `workflowId` and `workspacePath`, `runStartupRecovery()` calls `executeContinueWorkflow({ intent: 'rehydrate' })` to get the current step prompt, builds a minimal `WorkflowTrigger` with `_preAllocatedStartResponse`, and calls `runWorkflow()` fire-and-forget.
+`evaluateRecovery({ stepAdvances: >= 1 })` returns `'resume'`. If the sidecar contains `workflowId` and `workspacePath`, `runStartupRecovery()` calls `executeContinueWorkflow({ intent: 'rehydrate' })` to get the current step prompt, builds a minimal `WorkflowTrigger` and a `pre_allocated` `SessionSource`, and calls `runWorkflow()` fire-and-forget.
 **Old-format sidecars** (missing `workflowId`/`workspacePath`) fall through to discard regardless of step count.
@@ -214,32 +215,15 @@ Worktree sessions that are resumed set `branchStrategy: 'none'` and use the pers
 ---
-## 7. Planned refactor: functional core / imperative shell
-The invariants above are currently enforced by convention (comments, code structure) rather than by the type system. The planned refactor will make them structurally enforced:
-**Core (pure functions, no I/O):**
-- `buildSessionConfig(trigger) → SessionConfig` -- model, tools, limits, prompts
-- `evaluateAgentExitState(exitState) → WorkflowRunResult` -- replaces 4 scattered return sites
-- `tagToStatsOutcome(tag) → StatsOutcome` -- exhaustive via `assertNever`
-- `evaluateStuck(signals) → StuckSignal | null` -- already nearly pure
-**Shell (one cleanup site for all I/O):**
-```typescript
-async function runWorkflow(trigger, ctx, apiKey, ...): Promise<WorkflowRunResult> {
-  const startMs = Date.now();
-  const result = await _runWorkflowCore(trigger, ctx, apiKey, ...);
-  // All I/O in one place:
-  writeExecutionStats(statsDir, ..., tagToStatsOutcome(result._tag), result.stepCount);
-  await cleanupSidecar(sessionId, result._tag, trigger.branchStrategy);
-  emitSessionCompleted(emitter, sessionId, result._tag);
-  daemonRegistry?.unregister(workrailSessionId, result._tag === 'success' ? 'completed' : 'failed');
-  return result.workflowRunResult;
-}
-```
-After the refactor, adding a new result path requires:
-1. Adding it to the `WorkflowRunResult` union (compiler enforces exhaustiveness in `tagToStatsOutcome` via `assertNever`)
-2. Returning the new variant from `_runWorkflowCore` (no I/O to add at the return site)
-The current pattern requires manually adding `writeExecutionStats()`, sidecar deletion, event emission, and registry deregistration at each new return site -- easily forgotten.
+## 7. Structural enforcement summary
+The invariants above are enforced by a combination of type system guarantees and code structure:
+- `tagToStatsOutcome()` -- pure function with `assertNever` default; compiler error on unhandled `_tag`
+- `sidecardLifecycleFor()` -- pure function with `assertNever` default; compiler error on unhandled `_tag`
+- `buildSessionResult()` -- pure function; reads `state.terminalSignal` after loop exits
+- `finalizeSession()` -- single cleanup site for all result paths (event emission, registry cleanup, stats, sidecar deletion)
+- `setTerminalSignal()` -- first-writer-wins; structurally prevents dual stuck+timeout state
+- `SessionHandle` -- encapsulates steer/abort lifecycle; `abort()` before `setAgent()` is a safe no-op
+Adding a new `WorkflowRunResult` variant requires updating `tagToStatsOutcome()` and `sidecardLifecycleFor()` -- the compiler enforces both via `assertNever`. No I/O needs to be added at the new return site.

package/docs/vision.md CHANGED Viewed

@@ -14,7 +14,7 @@ WorkTrain runs the workrail repository as one of its own workspaces. It picks up
 This creates a direct feedback loop: if WorkTrain's development pipeline is flawed, it will produce flawed changes to itself and catch them in review. If its context injection is thin, it will miss things in its own codebase that a well-briefed agent would catch. The quality of WorkTrain's output is the quality of WorkTrain.
-The self-improvement loop is not fully operational today. The pieces -- coordinator session chaining, full development pipeline, spec as ground truth, living work context -- are being built. But it is the north star. If WorkTrain cannot build WorkTrain well, it cannot be trusted to build anything else.
+The self-improvement loop is not fully operational today, but it is the north star. If WorkTrain cannot build WorkTrain well, it cannot be trusted to build anything else.
 ## What success looks like
@@ -34,7 +34,7 @@ WorkTrain earns trust over time by doing this correctly, repeatedly, at scale --
 **Zero LLM turns for routing.** Coordinator decisions -- what workflow to run next, whether findings are blocking, when to merge -- are deterministic TypeScript code. LLM turns are used for cognitive work: understanding code, writing code, evaluating findings. Never for deciding "what do I do next?".
-**Structured outputs at every boundary.** Each phase produces a typed result. The next phase reads that result. Free-text scraping between phases is a design smell. `ChildSessionResult`, `wr.coordinator_result`, `wr.review_verdict` are the contracts that make phases composable without a main agent holding context.
+**Structured outputs at every boundary.** Each phase produces a typed result. The next phase reads that result. Free-text scraping between phases is a design smell. Typed contracts at phase boundaries are what make phases composable without a main agent holding context.
 **Correctness over speed.** WorkTrain does not merge changes it is not confident in. Review findings are addressed. Tests pass. The right next step is not always the fastest one.
@@ -88,18 +88,6 @@ WorkTrain does not pause for: implementation decisions within a well-specified t
 This boundary is still being tested and refined through real usage. Where exactly "genuine ambiguity" begins is an open question.
-## What is still being built
-WorkTrain is not finished. The vision above is where it is going, not where it is today. Key pieces still in progress:
-- **Living work context** -- shared knowledge store that accumulates across all phases so every agent starts informed (`docs/ideas/backlog.md`: "Living work context")
-- **Coordinator pipeline templates** -- actual coordinator scripts for full development pipeline, bug-fix, grooming (`docs/ideas/backlog.md`: "Scripts-first coordinator")
-- **`worktrain spawn`/`await` CLI** -- CLI surface for coordinator scripts
-- **Knowledge graph** -- per-workspace structural understanding so agents skip discovery on repeated tasks
-- **Spec as ground truth** -- wiring `wr.shaping` output into coordinator dispatch so coding/review agents work from the same spec
-For the current prioritized list, see `npm run backlog` or `docs/ideas/backlog.md`.
 ## Open questions
 These are genuinely unresolved. Any agent operating in this system should know they exist and not assume they are answered.
@@ -112,4 +100,6 @@ These are genuinely unresolved. Any agent operating in this system should know t
 - **What is the right granularity of tasks?** WorkTrain is being designed for ticket-sized work. Whether it handles epics (by decomposing them), hotfixes (by moving fast and deferring thoroughness), and architectural changes (which may require multiple sessions across multiple days) the same way is untested.
-- **Is "document" the right abstraction for the living work context?** A flat document implies agents read it linearly. Agents need to query it selectively -- the coding agent wants constraints relevant to a specific decision, the review agent wants what the coding agent said about a specific module. A structured knowledge store (typed facts, queryable by topic) may be more useful than a document. See `docs/ideas/backlog.md`: "Living work context".
+- **Is typed-artifact-per-phase the right abstraction for inter-phase context?** The current model threads structured handoff artifacts between pipeline phases. Whether this is sufficient long-term, or whether a queryable per-workspace knowledge store (indexed by topic, accessible across pipeline runs and across tasks) is needed for things like codebase-specific priors and accumulated project memory, is an open question. See `docs/ideas/backlog.md`: "Knowledge graph".
+For current priorities and status, run `npm run backlog` or read `docs/ideas/backlog.md`.