npm - @exaudeus/workrail - Versions diffs - 3.73.2 → 3.74.1 - Mend

@exaudeus/workrail 3.73.2 → 3.74.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

package/dist/cli-worktrain.js +126 -1
package/dist/console-ui/assets/{index-CfI4I3OX.js → index-BmDxs-a5.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/coordinators/pr-review.d.ts +11 -1
package/dist/coordinators/types.d.ts +15 -0
package/dist/coordinators/types.js +2 -0
package/dist/manifest.json +17 -9
package/dist/trigger/coordinator-deps.js +203 -36
package/docs/authoring.md +23 -0
package/docs/ideas/backlog.md +299 -60
package/docs/planning/README.md +6 -9
package/docs/roadmap/archive/README.md +8 -0
package/docs/tickets/next-up.md +6 -1
package/docs/vision.md +115 -0
package/package.json +1 -1
package/spec/authoring-spec.json +36 -1
/package/docs/roadmap/{now-next-later.md → archive/now-next-later.md} +0 -0
/package/docs/roadmap/{open-work-inventory.md → archive/open-work-inventory.md} +0 -0

package/docs/ideas/backlog.md CHANGED Viewed

@@ -3,6 +3,8 @@
 Workflow and feature ideas worth capturing but not yet planned or designed.
 For historical narrative and sprint journals, see `docs/history/worktrain-journal.md`.
+**Before reading this backlog, read the vision:** `docs/vision.md` -- what WorkTrain is, what success looks like, and the principles every decision is held against. Every item in this backlog should serve that vision. If it doesn't, it shouldn't be here.
 **To see a sorted priority view, run:**
 ```bash
 npm run backlog                                       # full list, grouped by blocked/unblocked
@@ -12,88 +14,77 @@ npm run backlog -- --help                            # all options
 ```
 Each item has a score line: `**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: ...`
-See the scoring rubric in the "Agent-assisted backlog prioritization" entry (WorkTrain Daemon section).
----
+**When adding a new backlog item, score it using this rubric.** Five dimensions, each 1-3. Score = sum (max 15).
-## P0 / Critical (blocks WorkTrain from working correctly)
+| Dimension | 3 | 2 | 1 |
+|---|---|---|---|
+| **Correctness** | Silent wrong output, crash, or skipped safety gate | Degraded behavior, misleading output, test coverage gap | No effect on correctness |
+| **Capability** | Meaningfully expands what WorkTrain can do or who can use it | Reduces friction for an *active* use case today | Polish, internal quality, or nothing anyone is actively blocked by right now |
+| **Effort** (inverted) | Hours to a day or two | A few days to a week | Weeks or longer, significant design work needed first |
+| **Leverage** | Prerequisite for multiple other items | Enables one or two downstream items | Standalone, nothing depends on it |
+| **Confidence** | Clear problem, clear direction, just needs implementation | Problem is clear, but has open questions to hash out first | Still needs discovery or design before work can begin |
-### wr.coding-task implementation loop does not exit when slices complete (Apr 30, 2026)
+**Blocked flag:** annotate with *what* the item is blocked by -- "Blocked: needs knowledge graph" vs "Blocked: needs dispatchCondition" carry very different timelines. Blocked items are listed separately regardless of score.
-**Status: bug** | Priority: high
+**Scoring notes:**
+- Score the first actionable phase, not the full vision. Phase 1 = two days of work should not score Effort 1 just because Phase 3 is months away.
+- Tiebreaker at equal score: prefer the item that makes the next item easier to execute.
+- Capability 2 = reduces friction for an *active* use case today (not something hypothetical).
-**Score: 13** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
+---
-The `wr.coding-task` workflow's implementation loop (up to 20 passes) does not exit when all slices are complete. The `wr.loop_control` stop artifact is emitted correctly but the loop decision gate never fires because `currentSlice.name` remains `[unset]` -- the engine is not tracking which slice is current across passes. The loop ran 8 passes before eventually exiting on its own.
+**How to write a backlog item.** Every entry should follow this shape:
-This means: (1) every coding task session wastes passes doing no work, (2) the agent cannot confidently signal completion, (3) total session turn count is inflated, increasing cost and timeout risk.
+```
+### Title (Date)
-**Root cause**: the `slices` array is stored in context but the engine does not advance a `currentSliceIndex` counter -- or the counter is not being surfaced to the step as `currentSlice.name`. The `wr.loop_control` artifact is evaluated at the loop decision step, but that step only fires when the engine recognizes it's at the end of a pass. With `currentSlice.name = [unset]`, the recognition fails.
+**Status: idea | bug | partial | done** | Priority: high/medium/low
-**Things to hash out:**
-- Is the bug in the workflow JSON (slices not wired to currentSlice tracking), in the engine (loop_control artifact evaluation), or in the way context variables are threaded between passes?
-- Does the issue affect all loops with `wr.loop_control`, or only the implementation loop in `wr.coding-task` specifically?
-- Is there a workaround agents can use today (e.g. setting a specific context variable that the loop decision gate does check)?
-- Should the loop decision gate fire after every pass regardless of `currentSlice.name` state, or only when the slice tracking is valid?
+**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: no / yes (blocked by X)
----
-### Intent gap: agent builds what it understood, not what the user meant (Apr 30, 2026)
+[2-4 sentences stating the problem plainly. What is wrong or missing? Why does it matter?
+No proposed solutions here -- just the problem.]
-**Status: idea** | Priority: high
+**Things to hash out:**
+- [Open question that needs a decision before design can begin]
+- [Another open question -- constraint, tradeoff, interaction with other systems]
+- [Keep these honest -- don't fill this section with questions you already know the answer to]
+```
-**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+**Rules for writing entries:**
+- **State the problem, not the solution.** "There is no way to invoke a routine directly" not "We should add a `worktrain invoke` command."
+- **No steering.** Don't tell future implementers how to build it. Capture what needs to exist, not how to make it exist.
+- **Things to hash out = genuine open questions.** Only include questions that actually need to be answered before design can start. If you know the answer, state it in the problem description.
+- **Relationships matter.** If this item depends on another, or would be superseded by another, name it explicitly.
+- **Be specific about what "done" looks like** when it's not obvious -- e.g. "done means an operator can invoke any routine by name from the CLI without writing a workflow."
-This is one of the most fundamental failure modes for autonomous WorkTrain sessions and a blocker for production viability. An agent receives a task description, forms an interpretation of what's needed, and executes flawlessly against that interpretation -- but the interpretation was wrong. The code is correct for what the agent thought was asked. It is not what the user actually wanted. The user only discovers this after reviewing the PR, sometimes after it has already merged.
+---
-This is categorically different from bugs (the agent implemented the right thing incorrectly) and scope creep (the agent did extra things). This is the agent solving the wrong problem well.
+## P0 / Critical (blocks WorkTrain from working correctly)
-**Why it's hard:** the agent's interpretation feels reasonable from the task description. The user's description was ambiguous, underspecified, or relied on context the agent didn't have. Neither party made an obvious mistake -- the gap is structural.
+### wr.coding-task forEach loop exposes broken agent-facing state (Apr 30, 2026)
-**Known manifestations:**
-- Agent fixes the symptom instead of the root cause because the task description named the symptom
-- Agent implements feature X when the user wanted feature Y that happens to use X
-- Agent interprets "add support for Z" as extending the existing system when the user wanted a new abstraction
-- Agent makes a local fix when the user wanted an architectural change
-- Agent's implementation is technically correct but violates unstated invariants the user assumed were obvious
-**Things to hash out:**
-- Where in the workflow should intent validation happen? Before the agent writes any code (Phase 0), the agent should be required to state its interpretation back in plain English. The user (or a validation step) confirms or corrects it before implementation begins. But this requires a human confirmation gate -- does that break the autonomous use case?
-- For fully autonomous sessions (no human in the loop), is there a way to detect a likely intent gap before the agent commits? Signals might include: the task description is short or vague, the agent's interpretation involves a significant architectural decision, the agent is about to delete or restructure existing code.
-- What is the right escalation path when the agent detects ambiguity itself? Currently `report_issue` handles task obstacles; there is no structured way for the agent to surface "I am not sure I understood this correctly" before acting.
-- The `wr.shaping` workflow exists precisely to close this gap for planned features -- the issue is urgent/reactive tasks that skip shaping entirely. How do we get intent validation without requiring a full shaping pass for every small task?
-- Can historical session notes help? If previous sessions have established what "X" means in this codebase (design decisions, naming conventions, architectural invariants), injecting that context before Phase 0 reduces the gap. This points toward the knowledge graph and persistent project memory as partial solutions.
-- Should WorkTrain have an explicit "confirm interpretation" step as a configurable option per trigger? A `requireIntentConfirmation: true` flag on the trigger that blocks autonomous start until the operator approves the agent's stated interpretation via the console or CLI.
+**Status: bug** | Priority: high
----
+**Score: 13** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
-### Scope rationalization: agent silently accepts collateral damage (Apr 30, 2026)
+The `phase-6-implement-slices` loop (forEach over `slices`) ran correctly mechanically -- it iterated all 8 slices and stopped. But the agent-facing representation was broken in ways that violate WorkRail's promise of consistency and determinism:
-**Status: idea** | Priority: high
+1. **`currentSlice.name` showed `[unset]`** -- the agent was inside a forEach loop over `slices` with `itemVar: "currentSlice"`, but the template variable wasn't being projected into sessionContext before rendering. The agent couldn't see which slice it was on. This is an engine rendering issue in `buildLoopRenderContext` / `prompt-renderer.ts`.
-**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+2. **Agent emitted `wr.loop_control` artifacts that had no effect** -- the forEach loop silently ignores these. The agent did useless work the engine discarded without signaling that this was happening. A correct system should either prevent the agent from emitting artifacts that can't affect the loop, or tell the agent explicitly that artifact-based exit isn't available in this loop type.
-When an agent makes a change that breaks or degrades something outside its immediate task scope, it often recognizes the impact but rationalizes it as acceptable because "that's not in scope for this task." The reasoning feels locally valid -- the agent was asked to do X, X is done correctly, the side effect on Y is noted but deprioritized. This produces a PR that is correct for X and silently broken for Y.
+3. **Loop presented as "Pass N of 20" not "Slice 3 of 8"** -- the framing confused the agent about what was happening. The agent should be told it's iterating over concrete slices, not burning through a budget.
-This is exactly what happened with the commit SHA change: setting `agentCommitShas` to always empty correctly fixes the faked SHA bug, but degrades the console's SHA display for all sessions going forward. A scoped agent might note "this makes the console show empty SHAs" and proceed anyway because fixing the console display is "a separate ticket."
+The forEach loop *worked* but the agent experience was wrong. This matters because WorkRail's value is that agents should not be confused about their own loop state. An agent that emits useless artifacts, can't see its own iteration variable, and misunderstands whether the loop is progress-based or budget-based is not operating under the deterministic, correct framework WorkRail promises.
-**Why this is insidious:** the agent's reasoning is locally coherent. It did not make a mistake within its scope. The problem is that autonomous agents operating in isolation cannot always see when a locally correct change has unacceptable global consequences -- and even when they can see it, they lack a good mechanism to stop, escalate, and surface the impact rather than proceeding.
-**Known manifestations:**
-- Agent correctly fixes a bug but the fix changes a public API contract, breaking callers it didn't check
-- Agent refactors a module for clarity but silently changes behavior in an edge case it considered minor
-- Agent adds a feature but disables or degrades an existing feature as a side effect, judging the tradeoff acceptable on its own
-- Agent's change passes all tests but the tests don't cover the degraded behavior
-- Agent notes a downstream impact in session notes but does not block, escalate, or file a follow-up ticket
-- **Agent reframes a bug as "a key tradeoff to document."** This is a specific and common failure: the agent detects a real problem it caused, correctly identifies that it's a problem, and instead of filing it as a bug or escalating, reclassifies it as an "accepted design decision" or "known limitation" in documentation. The bug is real. Documenting it is not fixing it. This pattern actively buries bugs.
+**GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/920
 **Things to hash out:**
-- How does an agent distinguish "acceptable tradeoff within scope" from "collateral damage that must be escalated"? The line is fuzzy and context-dependent. A hard rule ("never degrade existing behavior") is too strict for refactors; a soft heuristic ("if it affects other code, escalate") is too broad.
-- Should the agent be required to enumerate side effects as part of the verification phase, and should the coordinator review that list before merging? This is the proof record concept applied to impact assessment rather than just correctness.
-- What is the right mechanism for the agent to pause and escalate? Currently `report_issue` is for task obstacles; `signal_coordinator` is for coordinator events. There is no structured "I need a decision on whether this tradeoff is acceptable" signal.
-- Test coverage is the obvious mitigation -- if Y has tests, the agent's change would fail them. But not everything has tests, and agents can rationalize skipping test runs for "unrelated" paths.
-- Is there a way to detect likely collateral damage statically before the agent acts? A pre-commit check that measures what changed beyond the declared `filesChanged` list, for example, could surface unexpected side effects automatically.
-- The knowledge graph and architectural invariant rules (pattern and architecture validation) are partial solutions -- they can flag when a change violates a declared constraint. But they only work for constraints that have been explicitly codified.
+- Is `currentSlice.name = [unset]` a bug in `buildLoopRenderContext` (engine fix needed), or is it a workflow authoring issue (the slices array items don't have a `name` property)?
+- Should the engine prevent agents from emitting `wr.loop_control` artifacts inside forEach loops, or simply document that they have no effect?
+- Should forEach loops surface iteration progress ("slice 3 of 8") differently than while loops ("pass 3 of 20") in the step header text?
 ---
@@ -177,9 +168,101 @@ The delivery pipeline was extracted into `delivery-pipeline.ts` with explicit st
 ## WorkTrain Daemon
+### Intent gap: agent builds what it understood, not what the user meant (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+This is one of the most fundamental failure modes for autonomous WorkTrain sessions and a blocker for production viability. An agent receives a task description, forms an interpretation of what's needed, and executes flawlessly against that interpretation -- but the interpretation was wrong. The code is correct for what the agent thought was asked. It is not what the user actually wanted. The user only discovers this after reviewing the PR, sometimes after it has already merged.
+This is categorically different from bugs (the agent implemented the right thing incorrectly) and scope creep (the agent did extra things). This is the agent solving the wrong problem well.
+**Why it's hard:** the agent's interpretation feels reasonable from the task description. The user's description was ambiguous, underspecified, or relied on context the agent didn't have. Neither party made an obvious mistake -- the gap is structural.
+**Known manifestations:**
+- Agent fixes the symptom instead of the root cause because the task description named the symptom
+- Agent implements feature X when the user wanted feature Y that happens to use X
+- Agent interprets "add support for Z" as extending the existing system when the user wanted a new abstraction
+- Agent makes a local fix when the user wanted an architectural change
+- Agent's implementation is technically correct but violates unstated invariants the user assumed were obvious
+**Done looks like:** a WorkTrain session that receives an ambiguous or underspecified task either (a) states its interpretation explicitly before acting and the coordinator can gate on approval, or (b) has access to enough prior context (from the knowledge graph or living work context) that the interpretation is reliably correct. A session that builds the wrong thing well should be detectable before it merges, not after.
+**Things to hash out:**
+- Where in the workflow should intent validation happen? Before the agent writes any code (Phase 0), the agent should be required to state its interpretation back in plain English. The user (or a validation step) confirms or corrects it before implementation begins. But this requires a human confirmation gate -- does that break the autonomous use case?
+- For fully autonomous sessions (no human in the loop), is there a way to detect a likely intent gap before the agent commits? Signals might include: the task description is short or vague, the agent's interpretation involves a significant architectural decision, the agent is about to delete or restructure existing code.
+- What is the right escalation path when the agent detects ambiguity itself? Currently `report_issue` handles task obstacles; there is no structured way for the agent to surface "I am not sure I understood this correctly" before acting.
+- The `wr.shaping` workflow exists precisely to close this gap for planned features -- the issue is urgent/reactive tasks that skip shaping entirely. How do we get intent validation without requiring a full shaping pass for every small task?
+- Can historical session notes help? If previous sessions have established what "X" means in this codebase (design decisions, naming conventions, architectural invariants), injecting that context before Phase 0 reduces the gap. This points toward the knowledge graph and persistent project memory as partial solutions.
+- Should WorkTrain have an explicit "confirm interpretation" step as a configurable option per trigger? A `requireIntentConfirmation: true` flag on the trigger that blocks autonomous start until the operator approves the agent's stated interpretation via the console or CLI.
+---
+### Scope rationalization: agent silently accepts collateral damage (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
+When an agent makes a change that breaks or degrades something outside its immediate task scope, it often recognizes the impact but rationalizes it as acceptable because "that's not in scope for this task." The reasoning feels locally valid -- the agent was asked to do X, X is done correctly, the side effect on Y is noted but deprioritized. This produces a PR that is correct for X and silently broken for Y.
+This is exactly what happened with the commit SHA change: setting `agentCommitShas` to always empty correctly fixes the faked SHA bug, but degrades the console's SHA display for all sessions going forward. A scoped agent might note "this makes the console show empty SHAs" and proceed anyway because fixing the console display is "a separate ticket."
+**Why this is insidious:** the agent's reasoning is locally coherent. It did not make a mistake within its scope. The problem is that autonomous agents operating in isolation cannot always see when a locally correct change has unacceptable global consequences -- and even when they can see it, they lack a good mechanism to stop, escalate, and surface the impact rather than proceeding.
+**Known manifestations:**
+- Agent correctly fixes a bug but the fix changes a public API contract, breaking callers it didn't check
+- Agent refactors a module for clarity but silently changes behavior in an edge case it considered minor
+- Agent adds a feature but disables or degrades an existing feature as a side effect, judging the tradeoff acceptable on its own
+- Agent's change passes all tests but the tests don't cover the degraded behavior
+- Agent notes a downstream impact in session notes but does not block, escalate, or file a follow-up ticket
+- **Agent reframes a bug as "a key tradeoff to document."** This is a specific and common failure: the agent detects a real problem it caused, correctly identifies that it's a problem, and instead of filing it as a bug or escalating, reclassifies it as an "accepted design decision" or "known limitation" in documentation. The bug is real. Documenting it is not fixing it. This pattern actively buries bugs.
+**Done looks like:** when an agent makes a change that degrades something outside its scope, it surfaces the degradation explicitly before the PR merges -- either by blocking (filing a follow-up issue as a condition of the current PR merging) or escalating to the coordinator for a decision. A PR that silently buries a regression in a comment or documentation should not pass review.
+**Things to hash out:**
+- How does an agent distinguish "acceptable tradeoff within scope" from "collateral damage that must be escalated"? The line is fuzzy and context-dependent. A hard rule ("never degrade existing behavior") is too strict for refactors; a soft heuristic ("if it affects other code, escalate") is too broad.
+- Should the agent be required to enumerate side effects as part of the verification phase, and should the coordinator review that list before merging? This is the proof record concept applied to impact assessment rather than just correctness.
+- What is the right mechanism for the agent to pause and escalate? Currently `report_issue` is for task obstacles; `signal_coordinator` is for coordinator events. There is no structured "I need a decision on whether this tradeoff is acceptable" signal.
+- Test coverage is the obvious mitigation -- if Y has tests, the agent's change would fail them. But not everything has tests, and agents can rationalize skipping test runs for "unrelated" paths.
+- Is there a way to detect likely collateral damage statically before the agent acts? A pre-commit check that measures what changed beyond the declared `filesChanged` list, for example, could surface unexpected side effects automatically.
+- The knowledge graph and architectural invariant rules (pattern and architecture validation) are partial solutions -- they can flag when a change violates a declared constraint. But they only work for constraints that have been explicitly codified.
+---
 The autonomous workflow runner (`worktrain daemon`). Completely separate from the MCP server -- calls the engine directly in-process.
+### Subagent context package: project vision and task goal baked into spawning (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:2 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
+When WorkTrain spawns a subagent today, the operator (or the main agent) must manually write out all context: what the project is, what WorkTrain's vision is, what the task is trying to accomplish, what documents exist, what the end goal is. Subagents know nothing -- no conversation history, no project familiarity, no awareness of the vision. If the context briefing is thin or missing, the subagent works in the dark and produces generic output.
+Two things need to be baked into the spawning infrastructure:
+1. **Project-level context package**: every spawned subagent automatically receives a synthesized briefing about the WorkTrain project -- what it is, what it is trying to become, the architectural layers (daemon vs MCP server vs console), the coding philosophy, and pointers to key docs (AGENTS.md, backlog.md, relevant design docs). This should not require the spawning agent to manually write it out each time.
+2. **Task-level context package**: every spawned subagent automatically receives the vision and end goal of the specific task -- not just the technical instructions, but WHY the task matters, what it enables, and how it fits into the larger picture. A subagent that understands the goal can adapt when it hits unexpected situations; one that only has instructions cannot.
+This is related to the "Coordinator context injection standard" and "Context budget per spawned agent" backlog entries, but is broader -- it applies to all subagent spawning, not just coordinator-spawned child sessions.
+**Critical design constraint:** WorkTrain may not always have a "main" agent assembling context dynamically. A pure coordinator pipeline is deterministic TypeScript code -- it knows the goal it was given and the results it gets back, but has no ambient understanding of the project vision and cannot synthesize what context a subagent needs at runtime. This means context packages cannot be assembled dynamically by the spawning agent; they must be **pre-built and attached as structured data**, assembled by the daemon from configured sources before the session starts. This is closer to the trigger-derived knowledge configuration idea than to runtime context assembly.
+**Things to hash out:**
+- Where does the project-level context package live and how is it kept current? A static template in `~/.workrail/daemon-soul.md` covers behavioral rules but not project vision -- these are different concerns.
+- In a pure coordinator pipeline (no main agent), who decides what goes in the context package for each session type? Must be declared configuration, not runtime synthesis.
+- Should context profiles be declared per workflow, per trigger type, or per session role (coding vs review vs discovery)?
+- What is the right size for an auto-injected context package? Too small loses signal; too large crowds out the actual task prompt.
+- Should the package be structured (JSON/YAML) for programmatic injection, or prose for human readability?
+- How does this interact with the existing workspace context injection (CLAUDE.md, AGENTS.md, daemon-soul.md)?
+- Whether a "main" orchestrating agent is needed at all, or whether pure coordinator scripts plus well-configured context packages are sufficient -- this is an open question that requires real pipeline testing to answer.
+---
 ### Agent-assisted backlog and issue enrichment (Apr 28, 2026)
 **Status: idea** | Priority: medium
@@ -248,6 +331,40 @@ Five dimensions, each scored 1-3. Score = sum (max 15). Items marked **Blocked**
 ---
+### `delivery_failed` unreachable in `getChildSessionResult` -- type promises more than code delivers (Apr 30, 2026)
+**Status: bug** | Priority: medium
+**Score: 10** | Cor:3 Cap:1 Eff:2 Lev:2 Con:2 | Blocked: no
+`ChildSessionResult` has `reason: 'delivery_failed'` as a variant of `kind: 'failed'`. However `fetchChildSessionResult` in `coordinator-deps.ts` reads session status through `ConsoleService.getSessionDetail`, which returns statuses like `complete`/`blocked`/`in_progress` -- it never returns a `delivery_failed` status. `delivery_failed` is a `TriggerRouter`-level concept (callbackUrl POST failure) that is not stored as a session status in the event log. Child sessions spawned via `spawnSession`/`spawnAndAwait` have no `callbackUrl` and cannot produce it through this code path.
+The result: coordinators using `getChildSessionResult` can never observe `reason: 'delivery_failed'`, even though the type says they might. This violates the "make illegal states unrepresentable" principle -- the type union promises a variant the implementation cannot produce on this path.
+**Architectural fix (not a comment):** surface `delivery_failed` through session status. When `TriggerRouter` records a `delivery_failed` outcome, write a corresponding session event or status that `ConsoleService.getSessionDetail` returns. Then `fetchChildSessionResult` can map it correctly. This closes the gap between what the type promises and what the infrastructure delivers.
+Alternative: if `spawnSession`/`spawnAndAwait` child sessions genuinely cannot have `delivery_failed` outcomes by design, remove `reason: 'delivery_failed'` from `ChildSessionResult` entirely and document that it only exists in `spawn_agent`'s direct outcome mapping.
+**Things to hash out:**
+- Should `delivery_failed` be surfaced through ConsoleService (requires touching session status storage), or removed from `ChildSessionResult` since the `spawnSession` path provably cannot produce it?
+- If surfaced: what event or field in the session store carries this status, and how does ConsoleService project it?
+---
+### `spawnAndAwait` duplicates ~90 lines of polling logic from `awaitSessions` (Apr 30, 2026)
+**Status: tech debt** | Priority: low
+**Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
+`spawnAndAwait` in `coordinator-deps.ts` contains an inline polling loop (~90 lines) that duplicates the logic in `awaitSessions`. The WHY comment explains a real construction-time constraint: object literals cannot reference sibling methods by name during construction. But this constraint applies to methods on the returned object -- it does not apply to closure-level functions, which are already used for `fetchAgentResult` and `fetchChildSessionResult`.
+**Fix:** extract a `pollUntilTerminal(handles: string[], timeoutMs: number): Promise<'completed' | 'timed_out' | 'degraded'>` closure-level function (before the `return {}` block). Have both `awaitSessions` and `spawnAndAwait` call it. This eliminates the duplication without violating the construction-time constraint.
+**GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/921
+---
 ### Daemon architecture: remaining migrations (Apr 29, 2026)
 **Status: partial** | A9 shipped Apr 29, 2026.
@@ -922,6 +1039,31 @@ Combined with the `DEFAULT_MAX_TURNS` cap, this provides defense-in-depth agains
 The durable session store, v2 engine, and workflow authoring features shared by all three systems.
+### WorkTrain as the canonical workflow author -- MCP as a derived runtime (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:2 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+Today workflows are authored once and expected to work identically in both runtimes: the WorkRail MCP server (human-in-the-loop, Claude Code) and the WorkTrain daemon (fully autonomous, coordinator-driven). In practice they don't -- a workflow authored for human use has `requireConfirmation` gates that block autonomous execution, step prompts that assume the human is reading them, and phase structures that assume a single continuous session. Conversely, a workflow good for autonomous use has no natural pause points, produces typed structured outputs that humans find hard to read mid-session, and chains phases that a human might want to interrupt.
+The current response is to author separate "agentic variants" (`wr.coding-task` vs `coding-task-workflow.agentic.v2`). This is the wrong direction: it creates duplicate maintenance burden, improvements to one don't propagate to the other, and it means there is no single source of truth for what a workflow does.
+There should be one version of each workflow, not two. Improvements to one should benefit the other automatically. The self-improvement loop WorkTrain runs on its own workflows should produce better workflows for everyone, not just daemon sessions. The question is how to structure authorship and any adaptation layer so this is possible without forcing workflows into an awkward compromise that works poorly in both contexts.
+**What this enables:** WorkTrain can autonomously improve workflows using `wr.workflow-for-workflows`, and those improvements automatically benefit MCP users. The self-improvement loop produces better workflows for everyone, not just daemon sessions. Workflow quality compounds because there is only one version to improve.
+**Relationship to existing entries:**
+- "Workflow runtime adapter: one spec, two runtimes" (Shared/Engine) is a narrower version of this idea focused on parallelism and `requireConfirmation` gates. This entry is about the authoring philosophy and source-of-truth question, not just the adapter mechanics.
+- `wr.workflow-for-workflows` is how WorkTrain improves workflows autonomously -- this entry determines what it improves toward.
+**Things to hash out:**
+- What does the MCP conversion layer actually do? Adding pause points is straightforward. Adapting output formats (structured JSON → human-readable prose) may require active LLM translation, not just structural transformation.
+- Some workflow steps are genuinely different between runtimes -- a step that spawns parallel child sessions in the daemon doesn't have a clean MCP equivalent. Does the conversion layer skip those, simulate them sequentially, or require the author to declare a fallback?
+- If WorkTrain is the authoring target, existing workflows authored for MCP need migration. What is the migration path and who does it -- the author, WorkTrain itself, or a one-time script?
+- How do `requireConfirmation` gates fit? In the daemon they are removed or auto-satisfied by the coordinator. In MCP they pause for the human. Does the workflow declare them or does the conversion layer infer them?
+- Is the conversion layer purely structural (rearranging/omitting steps) or does it require understanding the semantic intent of each step?
 ### Improve commit SHA gathering consistency in wr.coding-task
@@ -1356,7 +1498,7 @@ Routing by `finding.category` from `wr.review_verdict`:
 ### Workflow execution time tracking and prediction
-**Status: idea** | Priority: medium
+**Status: partial** | Tracking shipped; prediction/calibration layer not yet built
 **Score: 11** | Cor:1 Cap:2 Eff:3 Lev:2 Con:3 | Blocked: no
@@ -1834,10 +1976,14 @@ A proof record contains: `prNumber`, `goal`, `verificationChain` (array of `{ ki
 ### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
-**Status: idea** | Priority: high
+**Status: partial** | Foundation shipped PR #908 (Apr 30, 2026)
 **Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
+**What shipped:** `ChildSessionResult` discriminated union, `getChildSessionResult()`, `spawnAndAwait()`, `parentSessionId` threading, `wr.coordinator_result` artifact schema. The typed coordinator primitives that enable in-process coordinator scripts are now available.
+**What's still needed:** the actual coordinator scripts (full development pipeline, bug-fix coordinator, grooming coordinator) and the `worktrain spawn`/`await` CLI commands that wrap these primitives for shell scripts.
 **The insight:** In a coordinator workflow, the main agent spends most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
 **The principle:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
@@ -2003,7 +2149,7 @@ WorkTrain notices things without being asked. After a batch of work lands, it sc
 ### Native multi-agent orchestration: coordinator sessions + session DAG (Apr 15, 2026)
-**Status: idea** | Priority: high
+**Status: partial** | Typed primitives shipped PR #908 (Apr 30, 2026)
 **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
@@ -2322,6 +2468,99 @@ A workflow that aggregates activity across git history, GitLab/GitHub MRs and re
 ## Platform Vision (longer-term)
+### Move backlog to a dedicated worktrain-meta repo with version control (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:2 Cap:2 Eff:2 Lev:3 Con:3 | Blocked: no
+The backlog (`docs/ideas/backlog.md`) lives in the code repo. Every feature branch has its own version. Ideas added mid-session on a feature branch are held hostage until that PR merges. If two branches modify the backlog simultaneously, merge conflicts occur. There is no single authoritative place to capture an idea that immediately applies everywhere.
+A dedicated `worktrain-meta` repo (e.g. `~/git/personal/worktrain-meta/`) would hold the backlog as the only concern. No feature branches -- ideas are committed directly to main. Full git history preserved. No code PR ever touches it.
+Done means: an operator or agent can add a backlog idea from any branch or context, commit directly, and it is immediately visible on all other branches and in all other sessions.
+**Note on format:** when this migration happens, one-file-per-item with YAML frontmatter becomes viable. Frontmatter makes scores, status, dates, and blocked-by machine-readable without prose parsing. The `npm run backlog` script would read frontmatter instead of regex-parsing Score lines. This is the right time to adopt that format -- in the current single-file structure frontmatter would require a custom delimiter scheme, but one-file-per-item makes it natural.
+**Things to hash out:**
+- Should the worktrain-meta repo also hold the roadmap docs, now-next-later, open-work-inventory? Or just the backlog?
+- How do subagents spawned in a worktree find the backlog? They need a configured path, not relative to the code workspace.
+- When native structured backlog operations are built (SQLite), does the storage backend live in worktrain-meta (git-tracked history) or `~/.workrail/data/` (local queryable)? Both have merit.
+---
+### Invocable routines: dispatch an existing routine directly as a task (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
+WorkRail has a routines system (`workflows/routines/`) for reusable workflow fragments. But routines can only be used embedded inside a larger workflow -- there is no way to invoke a routine directly as a standalone task. Many useful repeat tasks are process-shaped (same steps every time, structured output) and could be expressed as short 1-2 step workflows or existing routines. Today an operator who wants to run "context gathering" or "hypothesis challenge" on demand has to either build a wrapper workflow or do it manually.
+There is no dispatch surface for standalone routine invocation. Done means: an operator can invoke any routine by name from the CLI or a trigger, and the result is durable in the session store.
+**Relationship to existing ideas:** this is one half of the lightweight agents gap (the process-shaped half). The ad-hoc query half is a separate entry below.
+**Things to hash out:**
+- Should this be a new CLI command (`worktrain invoke <routineId> --goal "..."`) or a trigger type, or both?
+- Do routines need output contracts defined before they can be invoked standalone, or is free-form output acceptable?
+- How does the session store record a routine-only run vs a full workflow run? Should they be distinguished?
+---
+### Ad-hoc query agents: answer questions about the workspace without a full workflow (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for efficient context)
+There is a class of tasks that are question-shaped rather than process-shaped: "why does the session store use a manifest file?", "what would break if I changed this function?", "summarize what shipped this week." These don't have fixed steps, don't produce structured output contracts, and don't benefit from workflow phase gating. Running a full `wr.coding-task` session for them wastes 10 minutes on overhead. Not supporting them means the operator has to context-switch to Claude Code or do them manually.
+These tasks need a capable agent with workspace context but no workflow structure. They are stateless, single-purpose, and short-lived.
+Examples of what this enables:
+- `worktrain ask "why does the session store use a manifest file?"`
+- `worktrain explain pr/908`
+- `worktrain impact src/trigger/coordinator-deps.ts`
+- `worktrain diff-since "last week"`
+Done means: an operator can ask a natural-language question about the workspace and get a grounded answer within seconds, without starting a full session.
+**Relationship to existing ideas:** `worktrain talk` (interactive ideation) is the conversational, stateful version of this. Standup status generator is a scheduled instance of the same pattern. Invocable routines (entry above) are the process-shaped complement. This entry covers the unstructured query case.
+**Things to hash out:**
+- Without the knowledge graph, these queries require full file-scanning on every invocation -- too slow to be useful. Is there a minimum viable version before the KG is built, or does this wait?
+- What is the boundary between "this is a quick query" and "this actually needs a full discovery session"? Who decides -- the operator, or WorkTrain itself?
+- Should outputs be ephemeral (printed to terminal, not stored) or durable (in session store)? Durability adds value for audit but adds overhead.
+---
+### Self-restart after shipping changes to itself (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 11** | Cor:2 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs self-improvement loop operational)
+If WorkTrain can build and ship changes to itself autonomously, the natural next step is that it also restarts itself with those changes. Today, after a WorkTrain daemon session ships a change to the workrail repo, the daemon continues running the old binary. The operator has to manually run `worktrain daemon --stop && worktrain daemon --start` to pick up the new version. In a self-improving system running overnight, this is a human intervention point that should not exist.
+**What this requires:**
+1. After a session that modifies WorkTrain itself merges to main, the daemon detects it was running on this repo
+2. The daemon rebuilds (`npm run build`) and restarts itself cleanly -- completing any in-flight sessions first, then performing a graceful restart with the new binary
+3. After restart, the daemon logs what changed so the operator can review
+This is related to the "daemon binary stale after rebuild" P0 gap, but goes further: not just warning about staleness, but actually handling the upgrade cycle automatically.
+**Why this matters for the self-improvement loop:** if WorkTrain ships 5 improvements to itself in a day but the operator has to manually restart it 5 times, the loop isn't truly autonomous. Full autonomy requires the restart to be part of the pipeline.
+**Things to hash out:**
+- What triggers the restart check? After every merge to main that touches `src/`? After a successful `npm run build`? On a heartbeat that detects binary staleness?
+- How does the daemon ensure in-flight sessions complete before restarting? Does it drain the active session set or hard-stop?
+- What is the rollback path if the new binary fails to start (startup crash, broken build)? The daemon needs to detect this and either roll back or alert the operator.
+- Should the restart happen immediately or at a configurable "quiet period" (e.g. 2am) to avoid disrupting active sessions during the day?
+- Self-modification is inherently risky -- a buggy change to the daemon's restart logic could make the daemon unable to restart at all. What safeguards prevent this?
+---
 ### WorkTrain as a first-class project participant: ideal backlog and planning capabilities (Apr 30, 2026)
 **Status: idea** | Priority: high (long-term)

package/docs/planning/README.md CHANGED Viewed

@@ -54,9 +54,9 @@ Not every roadmap item must become a ticket immediately.
 ## Status ownership
-**Status lives in exactly two places**: `docs/roadmap/open-work-inventory.md` and `docs/roadmap/now-next-later.md`.
+**Status lives in `docs/ideas/backlog.md`**. Each entry has a `Status:` line (idea / partial / done / bug). Use `npm run backlog` to see a scored, sorted view.
-Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the roadmap docs, not the plan doc. Plan docs that carry their own status blocks create a second source of truth that drifts.
+Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the backlog entry status, not the plan doc.
 ## Rules of thumb
@@ -95,10 +95,7 @@ Existing feature-specific plans in `docs/plans/` still matter. Treat them as **f
 ## Starting points
-- `docs/ideas/backlog.md`
-- `docs/roadmap/now-next-later.md`
-- `docs/roadmap/open-work-inventory.md`
-- `docs/roadmap/legacy-planning-status.md`
-- `docs/planning/docs-taxonomy-and-migration-plan.md`
-- `docs/tickets/README.md`
-- `docs/tickets/next-up.md`
+- `docs/vision.md` -- what WorkTrain is and where it's going (read this first)
+- `docs/ideas/backlog.md` -- the backlog (`npm run backlog` for priority view)
+- `docs/roadmap/legacy-planning-status.md` -- status map for older planning docs
+- `docs/tickets/next-up.md` -- scratch space for near-term tickets

package/docs/roadmap/archive/README.md ADDED Viewed

@@ -0,0 +1,8 @@
+# Archive
+These docs were superseded by `docs/ideas/backlog.md` + `npm run backlog`.
+- `now-next-later.md` -- manual roadmap curation; replaced by backlog scoring and `npm run backlog`
+- `open-work-inventory.md` -- normalized list of partial/unimplemented work; replaced by `Status: partial` entries in the backlog
+Kept for historical reference only. Do not update.

package/docs/tickets/next-up.md CHANGED Viewed

@@ -1,6 +1,11 @@
 # Next Up
-Groomed near-term tickets. Check `docs/roadmap/now-next-later.md` first for the current priority ordering.
+Scratch space for grooming near-term tickets before they become GitHub issues.
+For the current priority ordering, run `npm run backlog -- --min-score 11 --unblocked-only` or see `docs/ideas/backlog.md`.
+---
+> The tickets below are historical. Active work is tracked via GitHub issues and the backlog.
 ---