npm - @exaudeus/workrail - Versions diffs - 3.74.0 → 3.74.2 - Mend

@exaudeus/workrail 3.74.0 → 3.74.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

package/dist/application/services/workflow-interpreter.js +22 -0
package/dist/console-ui/assets/{index-CfU3va8H.js → index-CK8Zux9a.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/coordinators/types.d.ts +1 -1
package/dist/daemon/session-scope.d.ts +7 -1
package/dist/daemon/workflow-runner.d.ts +11 -4
package/dist/daemon/workflow-runner.js +92 -66
package/dist/manifest.json +15 -15
package/dist/v2/durable-core/domain/context-template-resolver.js +34 -9
package/docs/ideas/backlog.md +242 -18
package/docs/planning/README.md +6 -9
package/docs/roadmap/archive/README.md +8 -0
package/docs/tickets/next-up.md +6 -1
package/docs/vision.md +115 -0
package/package.json +1 -1
/package/docs/roadmap/{now-next-later.md → archive/now-next-later.md} +0 -0
/package/docs/roadmap/{open-work-inventory.md → archive/open-work-inventory.md} +0 -0

package/docs/ideas/backlog.md CHANGED Viewed

@@ -3,6 +3,8 @@
 Workflow and feature ideas worth capturing but not yet planned or designed.
 For historical narrative and sprint journals, see `docs/history/worktrain-journal.md`.
+**Before reading this backlog, read the vision:** `docs/vision.md` -- what WorkTrain is, what success looks like, and the principles every decision is held against. Every item in this backlog should serve that vision. If it doesn't, it shouldn't be here.
 **To see a sorted priority view, run:**
 ```bash
 npm run backlog                                       # full list, grouped by blocked/unblocked
@@ -12,35 +14,101 @@ npm run backlog -- --help                            # all options
 ```
 Each item has a score line: `**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: ...`
-See the scoring rubric in the "Agent-assisted backlog prioritization" entry (WorkTrain Daemon section).
+**When adding a new backlog item, score it using this rubric.** Five dimensions, each 1-3. Score = sum (max 15).
+| Dimension | 3 | 2 | 1 |
+|---|---|---|---|
+| **Correctness** | Silent wrong output, crash, or skipped safety gate | Degraded behavior, misleading output, test coverage gap | No effect on correctness |
+| **Capability** | Meaningfully expands what WorkTrain can do or who can use it | Reduces friction for an *active* use case today | Polish, internal quality, or nothing anyone is actively blocked by right now |
+| **Effort** (inverted) | Hours to a day or two | A few days to a week | Weeks or longer, significant design work needed first |
+| **Leverage** | Prerequisite for multiple other items | Enables one or two downstream items | Standalone, nothing depends on it |
+| **Confidence** | Clear problem, clear direction, just needs implementation | Problem is clear, but has open questions to hash out first | Still needs discovery or design before work can begin |
+**Blocked flag:** annotate with *what* the item is blocked by -- "Blocked: needs knowledge graph" vs "Blocked: needs dispatchCondition" carry very different timelines. Blocked items are listed separately regardless of score.
+**Scoring notes:**
+- Score the first actionable phase, not the full vision. Phase 1 = two days of work should not score Effort 1 just because Phase 3 is months away.
+- Tiebreaker at equal score: prefer the item that makes the next item easier to execute.
+- Capability 2 = reduces friction for an *active* use case today (not something hypothetical).
+---
+**How to write a backlog item.** Every entry should follow this shape:
+```
+### Title (Date)
+**Status: idea | bug | partial | done** | Priority: high/medium/low
+**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: no / yes (blocked by X)
+[2-4 sentences stating the problem plainly. What is wrong or missing? Why does it matter?
+No proposed solutions here -- just the problem.]
+**Things to hash out:**
+- [Open question that needs a decision before design can begin]
+- [Another open question -- constraint, tradeoff, interaction with other systems]
+- [Keep these honest -- don't fill this section with questions you already know the answer to]
+```
+**Rules for writing entries:**
+- **State the problem, not the solution.** "There is no way to invoke a routine directly" not "We should add a `worktrain invoke` command."
+- **No steering.** Don't tell future implementers how to build it. Capture what needs to exist, not how to make it exist.
+- **Things to hash out = genuine open questions.** Only include questions that actually need to be answered before design can start. If you know the answer, state it in the problem description.
+- **Relationships matter.** If this item depends on another, or would be superseded by another, name it explicitly.
+- **Be specific about what "done" looks like** when it's not obvious -- e.g. "done means an operator can invoke any routine by name from the CLI without writing a workflow."
 ---
 ## P0 / Critical (blocks WorkTrain from working correctly)
-### wr.coding-task implementation loop does not exit when slices complete (Apr 30, 2026)
+### wr.coding-task forEach loop exposes broken agent-facing state (Apr 30, 2026)
-**Status: bug** | Priority: high
+**Status: done** | Shipped May 1, 2026 (PR #926)
 **Score: 13** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
-The `wr.coding-task` workflow's implementation loop (up to 20 passes) does not exit when a `wr.loop_control` stop artifact is emitted. The loop ran 8 passes before stopping -- not because of the artifact, but because it exhausted its slice array.
+**Root cause (diagnosed Apr 30, 2026):** The agent wrote `slices` as an array of plain strings (`["1: slice name", ...]`) instead of objects (`[{name: "...", ...}]`). The engine accepted the array (it was an array), entered the loop, and `{{currentSlice.name}}` silently resolved to `[unset]` on every iteration because strings don't have a `.name` property.
+**Shipped (PR #926):**
+1. **forEach shape guard** (`workflow-interpreter.ts`): at iteration 0, if the body uses `{{itemVar.field}}` dot-path access but the items array contains primitives, returns `LOOP_MISSING_CONTEXT` with a message naming the actual type and a preview of the bad value. The loop never enters with broken state.
+2. **Diagnostic `[unset]` messages** (`context-template-resolver.ts`): when dot-path navigation fails mid-path due to a type mismatch (e.g. `currentSlice` is a string), the rendered prompt now shows `[unset: currentSlice.name -- 'currentSlice' is string ("1: Auth..."), not object]` instead of just `[unset: currentSlice.name]`.
+**Remaining open (separate items):** context contract enforcement (systemic fix), `todoList` abstraction, `wr.loop_control` shown in forEach prompts.
+**GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/920
+---
+### Context contract: steps must declare required and produced context keys (Apr 30, 2026)
-**Root cause (confirmed by investigation)**: `phase-6-implement-slices` is a `forEach` loop, not a `while`/`until` loop with `artifact_contract`. The `wr.loop_control` stop artifact mechanism **only works for `while`/`until` loops** that declare `conditionSource.kind = artifact_contract`. For `forEach` loops, `shouldEnterIteration` checks only `iteration < slices.length` -- artifacts passed to `interpreter.next()` are never consulted. Confirmed in `workflow-interpreter.ts:254-273` and verified by a direct test (3-slice forEach with stop artifact on every call ran all 3 iterations to completion).
+**Status: tentative** | Priority: medium
-**Why the loop stopped at pass 8**: the loop exhausted its `slices` array which had exactly 8 elements. `metrics_outcome = success` appearing at pass 8 was a coincidence.
+**Score: 12** | Cor:3 Cap:2 Eff:1 Lev:3 Con:2 | Blocked: no
-**`currentSlice.name` showing `[unset]`**: secondary issue. `buildLoopRenderContext` in `prompt-renderer.ts:190-197` requires `sessionContext['slices']` to be an array at render time. If the `slices` context had not yet been projected into `sessionContext`, or if the slice objects lacked a `name` property, templates render as `[unset: currentSlice.name]`.
+The engine has no mechanism to enforce context between steps. `Capture:` instructions in step prompts are prose -- the engine accepts `continue_workflow` with empty context on every advance, silently. This is the systemic root of the forEach `[unset]` bug: the agent wrote planning output as notes, not as context, and the engine accepted every advance without complaint. The same failure can happen in any workflow that passes state between steps.
-**Three fix directions:**
-1. **Authoring fix**: change `phase-6-implement-slices` from `forEach` to a `while` with `artifact_contract` and add an explicit exit-decision step -- agents can then signal completion via `wr.loop_control`
-2. **Engine feature**: add early-exit support to `forEach` loops when a `wr.loop_control` stop artifact is emitted
-3. **Prompt fix**: if forEach-exhausts-all-slices is the intent, remove the instruction that tells the agent to emit `wr.loop_control` artifacts
+**Things to hash out:**
+- What schema format should `contextContract` use -- JSON Schema subset or a simpler workrail-specific type DSL?
+- Should validation be blocking (engine rejects the advance) or advisory (engine warns in the next step prompt)?
+- Does context contract cover loop entry preconditions, or does the separate forEach guard item handle that?
+---
+### `todoList` step type: ergonomic abstraction over forEach (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 10** | Cor:2 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: no
+Workflow authors using forEach must manually wire a prior step to populate the items array, understand iteration variables, avoid emitting `wr.loop_control` artifacts (which have no effect in forEach), and explain the loop framing to the agent. The forEach shape guard (PR #926) now catches primitive-item arrays loudly at loop entry, but the wiring between "the step that produces items" and "the loop that consumes them" remains implicit and invisible to the engine. The `todoList` abstraction would make this wiring structural.
 **Things to hash out:**
-- Which fix direction is correct depends on the intended behavior: should the agent be able to stop the loop early (fix 1 or 2), or should it always run all slices (fix 3)?
-- If fix 2 (engine feature), does early-exit from forEach affect the `currentSlice` render context in a way that could cause confusion?
-- Does fix 1 require re-authoring the workflow through `wr.workflow-for-workflows`, or is it a targeted JSON edit?
+- Should `todoList` compile to a forEach loop at the engine layer, or be a new execution primitive?
+- How does the setup step that produces the items array get authored -- inline prompt, routine reference, or both?
+- What does the agent-facing presentation look like: "Item 3 of 8" with item content injected, or something else?
+- Should `wr.loop_control` artifacts be stripped from the step prompt entirely in a `todoList`, or does the agent still need an explicit completion signal?
 ---
@@ -143,6 +211,8 @@ This is categorically different from bugs (the agent implemented the right thing
 - Agent makes a local fix when the user wanted an architectural change
 - Agent's implementation is technically correct but violates unstated invariants the user assumed were obvious
+**Done looks like:** a WorkTrain session that receives an ambiguous or underspecified task either (a) states its interpretation explicitly before acting and the coordinator can gate on approval, or (b) has access to enough prior context (from the knowledge graph or living work context) that the interpretation is reliably correct. A session that builds the wrong thing well should be detectable before it merges, not after.
 **Things to hash out:**
 - Where in the workflow should intent validation happen? Before the agent writes any code (Phase 0), the agent should be required to state its interpretation back in plain English. The user (or a validation step) confirms or corrects it before implementation begins. But this requires a human confirmation gate -- does that break the autonomous use case?
 - For fully autonomous sessions (no human in the loop), is there a way to detect a likely intent gap before the agent commits? Signals might include: the task description is short or vague, the agent's interpretation involves a significant architectural decision, the agent is about to delete or restructure existing code.
@@ -173,6 +243,8 @@ This is exactly what happened with the commit SHA change: setting `agentCommitSh
 - Agent notes a downstream impact in session notes but does not block, escalate, or file a follow-up ticket
 - **Agent reframes a bug as "a key tradeoff to document."** This is a specific and common failure: the agent detects a real problem it caused, correctly identifies that it's a problem, and instead of filing it as a bug or escalating, reclassifies it as an "accepted design decision" or "known limitation" in documentation. The bug is real. Documenting it is not fixing it. This pattern actively buries bugs.
+**Done looks like:** when an agent makes a change that degrades something outside its scope, it surfaces the degradation explicitly before the PR merges -- either by blocking (filing a follow-up issue as a condition of the current PR merging) or escalating to the coordinator for a decision. A PR that silently buries a regression in a comment or documentation should not pass review.
 **Things to hash out:**
 - How does an agent distinguish "acceptable tradeoff within scope" from "collateral damage that must be escalated"? The line is fuzzy and context-dependent. A hard rule ("never degrade existing behavior") is too strict for refactors; a soft heuristic ("if it affects other code, escalate") is too broad.
 - Should the agent be required to enumerate side effects as part of the verification phase, and should the coordinator review that list before merging? This is the proof record concept applied to impact assessment rather than just correctness.
@@ -283,21 +355,49 @@ Five dimensions, each scored 1-3. Score = sum (max 15). Items marked **Blocked**
 ---
+### `delivery_failed` unreachable in `getChildSessionResult` -- type promises more than code delivers (Apr 30, 2026)
+**Status: done** | Fixed in `cd8aaeb8` -- `delivery_failed` removed from `ChildSessionResult` entirely. The `spawnSession`/`spawnAndAwait` path cannot produce it by design; it only exists in `spawn_agent`'s direct outcome mapping.
+---
+### `spawnAndAwait` duplicates ~90 lines of polling logic from `awaitSessions` (Apr 30, 2026)
+**Status: tech debt** | Priority: low
+**Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
+`spawnAndAwait` in `coordinator-deps.ts` contains an inline polling loop (~90 lines) that duplicates the logic in `awaitSessions`. The WHY comment explains a real construction-time constraint: object literals cannot reference sibling methods by name during construction. But this constraint applies to methods on the returned object -- it does not apply to closure-level functions, which are already used for `fetchAgentResult` and `fetchChildSessionResult`.
+**Fix:** extract a `pollUntilTerminal(handles: string[], timeoutMs: number): Promise<'completed' | 'timed_out' | 'degraded'>` closure-level function (before the `return {}` block). Have both `awaitSessions` and `spawnAndAwait` call it. This eliminates the duplication without violating the construction-time constraint.
+**GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/921
+---
 ### Daemon architecture: remaining migrations (Apr 29, 2026)
-**Status: partial** | A9 shipped Apr 29, 2026.
+**Status: partial** | A9 shipped Apr 29, 2026. FC/IS follow-on shipped Apr 30 -- May 1, 2026.
 **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
 Track A (A1-A9) shipped and the `SessionSource` migration is complete. `WorkflowTrigger._preAllocatedStartResponse` is gone.
+**Shipped Apr 30 -- May 1, 2026 (PR #925):**
+- `TerminalSignal` union replaces `stuckReason` + `timeoutReason`. Illegal state (stuck AND timeout simultaneously) now structurally impossible. Stall overwrite bug fixed. `Readonly<SessionState>` at pure read sites.
+- `SessionScope` capability boundary complete: `onTokenUpdate`, `onIssueReported`, `onSteer`, `getCurrentToken`, `sessionWorkspacePath`, spawn depths all named scope fields. `constructTools` signature is `(ctx, apiKey, schemas, scope)` -- zero direct `state.X` references.
+- Early-exit paths unified through `finalizeSession`. `SteerRegistry`/`AbortRegistry` dead exports removed.
+- Architecture tests enforce `state.terminalSignal` write restriction and `constructTools` state-access restriction in CI.
+- `persistTokens` failure early-exit path covered by new outcome invariants tests.
 **Remaining items:**
 - `CriticalEffect<T>` / `ObservabilityEffect` type distinction -- categorize side effects in `runAgentLoop` and finalization as either crash-relevant or observability-only
-- `StateRef` mutation wrapper -- replace direct `state.pendingSteerParts.push()` mutations with an explicit mutation API
 - Zod tool param validation -- replace manual `typeof` checks in tool factories with Zod schema validation (requires `zodToJsonSchema` or maintaining two sources of truth for param schemas)
 - `createCoordinatorDeps` unit tests -- extraction in B3 improved testability; cover `spawnSession`, `awaitSessions`, `getAgentResult` at minimum
 - ~~Wire `AllocatedSession.triggerSource` to the `run_started` event for session attribution~~ -- **done**, PR #899 (Apr 30, 2026)
+- ~~`SessionStateWriter` capability interfaces~~ -- **done** as part of PR #925 (`SessionScope` now owns all mutation callbacks)
+- ~~Architecture test: forbid `state.terminalSignal =` direct writes outside `setTerminalSignal()`~~ -- **done**, PR #925
 ---
@@ -362,6 +462,8 @@ Phase 3 (PRs #835, #837): `buildTurnEndSubscriber`, `buildAgentCallbacks`, `buil
 **Total workflow-runner.ts reduction: ~4,955 → ~2,800 lines (44%).**
+**FC/IS follow-on (PR #925, Apr 30 -- May 1, 2026):** `TerminalSignal` union, `SessionScope` capability boundary completion, early-exit unification through `finalizeSession`, architecture tests. See "Daemon architecture: remaining migrations" entry for full details.
 **Follow-on:** `wr.refactoring` workflow (see backlog entry above). Remaining items in "Daemon architecture: remaining migrations" entry below.
 ---
@@ -957,6 +1059,31 @@ Combined with the `DEFAULT_MAX_TURNS` cap, this provides defense-in-depth agains
 The durable session store, v2 engine, and workflow authoring features shared by all three systems.
+### WorkTrain as the canonical workflow author -- MCP as a derived runtime (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 13** | Cor:2 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
+Today workflows are authored once and expected to work identically in both runtimes: the WorkRail MCP server (human-in-the-loop, Claude Code) and the WorkTrain daemon (fully autonomous, coordinator-driven). In practice they don't -- a workflow authored for human use has `requireConfirmation` gates that block autonomous execution, step prompts that assume the human is reading them, and phase structures that assume a single continuous session. Conversely, a workflow good for autonomous use has no natural pause points, produces typed structured outputs that humans find hard to read mid-session, and chains phases that a human might want to interrupt.
+The current response is to author separate "agentic variants" (`wr.coding-task` vs `coding-task-workflow.agentic.v2`). This is the wrong direction: it creates duplicate maintenance burden, improvements to one don't propagate to the other, and it means there is no single source of truth for what a workflow does.
+There should be one version of each workflow, not two. Improvements to one should benefit the other automatically. The self-improvement loop WorkTrain runs on its own workflows should produce better workflows for everyone, not just daemon sessions. The question is how to structure authorship and any adaptation layer so this is possible without forcing workflows into an awkward compromise that works poorly in both contexts.
+**What this enables:** WorkTrain can autonomously improve workflows using `wr.workflow-for-workflows`, and those improvements automatically benefit MCP users. The self-improvement loop produces better workflows for everyone, not just daemon sessions. Workflow quality compounds because there is only one version to improve.
+**Relationship to existing entries:**
+- "Workflow runtime adapter: one spec, two runtimes" (Shared/Engine) is a narrower version of this idea focused on parallelism and `requireConfirmation` gates. This entry is about the authoring philosophy and source-of-truth question, not just the adapter mechanics.
+- `wr.workflow-for-workflows` is how WorkTrain improves workflows autonomously -- this entry determines what it improves toward.
+**Things to hash out:**
+- What does the MCP conversion layer actually do? Adding pause points is straightforward. Adapting output formats (structured JSON → human-readable prose) may require active LLM translation, not just structural transformation.
+- Some workflow steps are genuinely different between runtimes -- a step that spawns parallel child sessions in the daemon doesn't have a clean MCP equivalent. Does the conversion layer skip those, simulate them sequentially, or require the author to declare a fallback?
+- If WorkTrain is the authoring target, existing workflows authored for MCP need migration. What is the migration path and who does it -- the author, WorkTrain itself, or a one-time script?
+- How do `requireConfirmation` gates fit? In the daemon they are removed or auto-satisfied by the coordinator. In MCP they pause for the human. Does the workflow declare them or does the conversion layer infer them?
+- Is the conversion layer purely structural (rearranging/omitting steps) or does it require understanding the semantic intent of each step?
 ### Improve commit SHA gathering consistency in wr.coding-task
@@ -1869,10 +1996,14 @@ A proof record contains: `prNumber`, `goal`, `verificationChain` (array of `{ ki
 ### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
-**Status: idea** | Priority: high
+**Status: partial** | Foundation shipped PR #908 (Apr 30, 2026)
 **Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
+**What shipped:** `ChildSessionResult` discriminated union, `getChildSessionResult()`, `spawnAndAwait()`, `parentSessionId` threading, `wr.coordinator_result` artifact schema. The typed coordinator primitives that enable in-process coordinator scripts are now available.
+**What's still needed:** the actual coordinator scripts (full development pipeline, bug-fix coordinator, grooming coordinator) and the `worktrain spawn`/`await` CLI commands that wrap these primitives for shell scripts.
 **The insight:** In a coordinator workflow, the main agent spends most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
 **The principle:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
@@ -2038,7 +2169,7 @@ WorkTrain notices things without being asked. After a batch of work lands, it sc
 ### Native multi-agent orchestration: coordinator sessions + session DAG (Apr 15, 2026)
-**Status: idea** | Priority: high
+**Status: partial** | Typed primitives shipped PR #908 (Apr 30, 2026)
 **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
@@ -2357,6 +2488,99 @@ A workflow that aggregates activity across git history, GitLab/GitHub MRs and re
 ## Platform Vision (longer-term)
+### Move backlog to a dedicated worktrain-meta repo with version control (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:2 Cap:2 Eff:2 Lev:3 Con:3 | Blocked: no
+The backlog (`docs/ideas/backlog.md`) lives in the code repo. Every feature branch has its own version. Ideas added mid-session on a feature branch are held hostage until that PR merges. If two branches modify the backlog simultaneously, merge conflicts occur. There is no single authoritative place to capture an idea that immediately applies everywhere.
+A dedicated `worktrain-meta` repo (e.g. `~/git/personal/worktrain-meta/`) would hold the backlog as the only concern. No feature branches -- ideas are committed directly to main. Full git history preserved. No code PR ever touches it.
+Done means: an operator or agent can add a backlog idea from any branch or context, commit directly, and it is immediately visible on all other branches and in all other sessions.
+**Note on format:** when this migration happens, one-file-per-item with YAML frontmatter becomes viable. Frontmatter makes scores, status, dates, and blocked-by machine-readable without prose parsing. The `npm run backlog` script would read frontmatter instead of regex-parsing Score lines. This is the right time to adopt that format -- in the current single-file structure frontmatter would require a custom delimiter scheme, but one-file-per-item makes it natural.
+**Things to hash out:**
+- Should the worktrain-meta repo also hold the roadmap docs, now-next-later, open-work-inventory? Or just the backlog?
+- How do subagents spawned in a worktree find the backlog? They need a configured path, not relative to the code workspace.
+- When native structured backlog operations are built (SQLite), does the storage backend live in worktrain-meta (git-tracked history) or `~/.workrail/data/` (local queryable)? Both have merit.
+---
+### Invocable routines: dispatch an existing routine directly as a task (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
+WorkRail has a routines system (`workflows/routines/`) for reusable workflow fragments. But routines can only be used embedded inside a larger workflow -- there is no way to invoke a routine directly as a standalone task. Many useful repeat tasks are process-shaped (same steps every time, structured output) and could be expressed as short 1-2 step workflows or existing routines. Today an operator who wants to run "context gathering" or "hypothesis challenge" on demand has to either build a wrapper workflow or do it manually.
+There is no dispatch surface for standalone routine invocation. Done means: an operator can invoke any routine by name from the CLI or a trigger, and the result is durable in the session store.
+**Relationship to existing ideas:** this is one half of the lightweight agents gap (the process-shaped half). The ad-hoc query half is a separate entry below.
+**Things to hash out:**
+- Should this be a new CLI command (`worktrain invoke <routineId> --goal "..."`) or a trigger type, or both?
+- Do routines need output contracts defined before they can be invoked standalone, or is free-form output acceptable?
+- How does the session store record a routine-only run vs a full workflow run? Should they be distinguished?
+---
+### Ad-hoc query agents: answer questions about the workspace without a full workflow (Apr 30, 2026)
+**Status: idea** | Priority: high
+**Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for efficient context)
+There is a class of tasks that are question-shaped rather than process-shaped: "why does the session store use a manifest file?", "what would break if I changed this function?", "summarize what shipped this week." These don't have fixed steps, don't produce structured output contracts, and don't benefit from workflow phase gating. Running a full `wr.coding-task` session for them wastes 10 minutes on overhead. Not supporting them means the operator has to context-switch to Claude Code or do them manually.
+These tasks need a capable agent with workspace context but no workflow structure. They are stateless, single-purpose, and short-lived.
+Examples of what this enables:
+- `worktrain ask "why does the session store use a manifest file?"`
+- `worktrain explain pr/908`
+- `worktrain impact src/trigger/coordinator-deps.ts`
+- `worktrain diff-since "last week"`
+Done means: an operator can ask a natural-language question about the workspace and get a grounded answer within seconds, without starting a full session.
+**Relationship to existing ideas:** `worktrain talk` (interactive ideation) is the conversational, stateful version of this. Standup status generator is a scheduled instance of the same pattern. Invocable routines (entry above) are the process-shaped complement. This entry covers the unstructured query case.
+**Things to hash out:**
+- Without the knowledge graph, these queries require full file-scanning on every invocation -- too slow to be useful. Is there a minimum viable version before the KG is built, or does this wait?
+- What is the boundary between "this is a quick query" and "this actually needs a full discovery session"? Who decides -- the operator, or WorkTrain itself?
+- Should outputs be ephemeral (printed to terminal, not stored) or durable (in session store)? Durability adds value for audit but adds overhead.
+---
+### Self-restart after shipping changes to itself (Apr 30, 2026)
+**Status: idea** | Priority: medium
+**Score: 11** | Cor:2 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs self-improvement loop operational)
+If WorkTrain can build and ship changes to itself autonomously, the natural next step is that it also restarts itself with those changes. Today, after a WorkTrain daemon session ships a change to the workrail repo, the daemon continues running the old binary. The operator has to manually run `worktrain daemon --stop && worktrain daemon --start` to pick up the new version. In a self-improving system running overnight, this is a human intervention point that should not exist.
+**What this requires:**
+1. After a session that modifies WorkTrain itself merges to main, the daemon detects it was running on this repo
+2. The daemon rebuilds (`npm run build`) and restarts itself cleanly -- completing any in-flight sessions first, then performing a graceful restart with the new binary
+3. After restart, the daemon logs what changed so the operator can review
+This is related to the "daemon binary stale after rebuild" P0 gap, but goes further: not just warning about staleness, but actually handling the upgrade cycle automatically.
+**Why this matters for the self-improvement loop:** if WorkTrain ships 5 improvements to itself in a day but the operator has to manually restart it 5 times, the loop isn't truly autonomous. Full autonomy requires the restart to be part of the pipeline.
+**Things to hash out:**
+- What triggers the restart check? After every merge to main that touches `src/`? After a successful `npm run build`? On a heartbeat that detects binary staleness?
+- How does the daemon ensure in-flight sessions complete before restarting? Does it drain the active session set or hard-stop?
+- What is the rollback path if the new binary fails to start (startup crash, broken build)? The daemon needs to detect this and either roll back or alert the operator.
+- Should the restart happen immediately or at a configurable "quiet period" (e.g. 2am) to avoid disrupting active sessions during the day?
+- Self-modification is inherently risky -- a buggy change to the daemon's restart logic could make the daemon unable to restart at all. What safeguards prevent this?
+---
 ### WorkTrain as a first-class project participant: ideal backlog and planning capabilities (Apr 30, 2026)
 **Status: idea** | Priority: high (long-term)

package/docs/planning/README.md CHANGED Viewed

@@ -54,9 +54,9 @@ Not every roadmap item must become a ticket immediately.
 ## Status ownership
-**Status lives in exactly two places**: `docs/roadmap/open-work-inventory.md` and `docs/roadmap/now-next-later.md`.
+**Status lives in `docs/ideas/backlog.md`**. Each entry has a `Status:` line (idea / partial / done / bug). Use `npm run backlog` to see a scored, sorted view.
-Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the roadmap docs, not the plan doc. Plan docs that carry their own status blocks create a second source of truth that drifts.
+Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the backlog entry status, not the plan doc.
 ## Rules of thumb
@@ -95,10 +95,7 @@ Existing feature-specific plans in `docs/plans/` still matter. Treat them as **f
 ## Starting points
-- `docs/ideas/backlog.md`
-- `docs/roadmap/now-next-later.md`
-- `docs/roadmap/open-work-inventory.md`
-- `docs/roadmap/legacy-planning-status.md`
-- `docs/planning/docs-taxonomy-and-migration-plan.md`
-- `docs/tickets/README.md`
-- `docs/tickets/next-up.md`
+- `docs/vision.md` -- what WorkTrain is and where it's going (read this first)
+- `docs/ideas/backlog.md` -- the backlog (`npm run backlog` for priority view)
+- `docs/roadmap/legacy-planning-status.md` -- status map for older planning docs
+- `docs/tickets/next-up.md` -- scratch space for near-term tickets

package/docs/roadmap/archive/README.md ADDED Viewed

@@ -0,0 +1,8 @@
+# Archive
+These docs were superseded by `docs/ideas/backlog.md` + `npm run backlog`.
+- `now-next-later.md` -- manual roadmap curation; replaced by backlog scoring and `npm run backlog`
+- `open-work-inventory.md` -- normalized list of partial/unimplemented work; replaced by `Status: partial` entries in the backlog
+Kept for historical reference only. Do not update.

package/docs/tickets/next-up.md CHANGED Viewed

@@ -1,6 +1,11 @@
 # Next Up
-Groomed near-term tickets. Check `docs/roadmap/now-next-later.md` first for the current priority ordering.
+Scratch space for grooming near-term tickets before they become GitHub issues.
+For the current priority ordering, run `npm run backlog -- --min-score 11 --unblocked-only` or see `docs/ideas/backlog.md`.
+---
+> The tickets below are historical. Active work is tracked via GitHub issues and the backlog.
 ---

package/docs/vision.md ADDED Viewed

@@ -0,0 +1,115 @@
+# WorkTrain Vision
+## What WorkTrain is
+WorkTrain is an autonomous software development daemon. It runs continuously in the background, picks up tasks from external systems (GitHub issues, GitLab MRs, Jira tickets, webhooks), and drives them through the full development lifecycle -- discovery, shaping, implementation, review, fix, merge -- without human intervention between phases.
+The operator's job is to configure what WorkTrain works on and what rules it follows. WorkTrain's job is to do the actual work, autonomously, reliably, and correctly.
+## The self-improvement loop
+WorkTrain builds WorkTrain. This is not a metaphor -- it is the intended operating mode and the ultimate test of whether the system works.
+WorkTrain runs the workrail repository as one of its own workspaces. It picks up tickets from the workrail GitHub issue queue, runs the full pipeline (discovery, shaping, coding, review, fix, merge), and ships improvements to itself. Every feature built into WorkTrain is a feature WorkTrain could have built using its own infrastructure. Every bug fixed in WorkTrain is a bug WorkTrain found in itself.
+This creates a direct feedback loop: if WorkTrain's development pipeline is flawed, it will produce flawed changes to itself and catch them in review. If its context injection is thin, it will miss things in its own codebase that a well-briefed agent would catch. The quality of WorkTrain's output is the quality of WorkTrain.
+The self-improvement loop is not fully operational today. The pieces -- coordinator session chaining, full development pipeline, spec as ground truth, living work context -- are being built. But it is the north star. If WorkTrain cannot build WorkTrain well, it cannot be trusted to build anything else.
+## What success looks like
+An operator assigns a ticket to WorkTrain in the morning. By the time they check in, there is a merged PR, a closed ticket, and a summary of what was done and why. They did not intervene between phases. Nothing surprising happened that required their attention.
+WorkTrain earns trust over time by doing this correctly, repeatedly, at scale -- not just for one-off tasks but as the default mode of software development. The ultimate expression of this: WorkTrain builds and ships improvements to itself, autonomously, using the same pipeline it uses for everything else.
+## What WorkTrain is not
+- **Not a chatbot or copilot.** WorkTrain does not assist humans doing development. It does development. The human is the operator, not the pair programmer.
+- **Not the WorkRail MCP server.** The WorkRail engine and MCP server are infrastructure WorkTrain uses. They are separate systems. Do not conflate them.
+- **Not a replacement for judgment.** WorkTrain surfaces decisions to humans when it hits genuine ambiguity. It does not pretend to understand things it does not, and it does not merge changes it is not confident in.
+## How WorkTrain thinks about work
+**Phases, not turns.** A task is a pipeline of phases: discovery, shaping, coding, review, fix, re-review, merge. Each phase is a session with a typed output contract. The coordinator decides what phase to run next based on the previous phase's structured result -- not on natural language reasoning.
+**Zero LLM turns for routing.** Coordinator decisions -- what workflow to run next, whether findings are blocking, when to merge -- are deterministic TypeScript code. LLM turns are used for cognitive work: understanding code, writing code, evaluating findings. Never for deciding "what do I do next?".
+**Structured outputs at every boundary.** Each phase produces a typed result. The next phase reads that result. Free-text scraping between phases is a design smell. `ChildSessionResult`, `wr.coordinator_result`, `wr.review_verdict` are the contracts that make phases composable without a main agent holding context.
+**Correctness over speed.** WorkTrain does not merge changes it is not confident in. Review findings are addressed. Tests pass. The right next step is not always the fastest one.
+## What makes WorkTrain different from other autonomous coding agents
+Most autonomous coding agents are single-session: they get a task, they work on it, they produce output. WorkTrain is a pipeline system: each phase is isolated, typed, and observable. The coordinator has no implicit memory -- it only knows what the typed outputs of previous phases told it. This makes pipelines:
+- **Reproducible**: the same task run twice takes the same path
+- **Observable**: every phase, every result, every decision is in the session store
+- **Recoverable**: a crashed phase is retried with the same inputs
+- **Auditable**: no black box; you can see exactly what each phase decided and why
+## Principles that guide every decision
+1. **Zero LLM turns for routing** -- coordinator logic is code, not reasoning
+2. **Typed contracts at phase boundaries** -- structured results, not free-text
+3. **The spec is the source of truth** -- every agent in a pipeline reads the same spec
+4. **Correctness over speed** -- do it right, not just done
+5. **Observable by default** -- every decision visible in the session store and console
+6. **Overnight-safe** -- the system must work while the operator is asleep
+## Quality standards WorkTrain holds itself to
+WorkTrain does not ship work it is not confident in. Specifically:
+- Review findings are addressed before merge -- no "I'll file a ticket for this later" on findings that block
+- Tests pass. If tests were broken before the task started, that is noted explicitly, not silently ignored
+- A PR that triggered the escalating review chain (Critical finding → re-review → re-review) never auto-merges without human approval
+- If WorkTrain makes a change that degrades something outside its immediate scope, it surfaces that -- it does not document collateral damage as "a known tradeoff" and move on
+- When WorkTrain is wrong about something, it acknowledges it explicitly in the session notes so the next session starts with accurate context
+## How WorkTrain handles uncertainty and mistakes
+WorkTrain will make mistakes. The system is designed around this:
+- When an agent is uncertain about the task intent, it states its interpretation explicitly before acting. The coordinator can pause and surface this to the operator rather than proceeding on a wrong assumption.
+- Mistakes produce structured findings in the session store. The demo repo feedback loop and per-run retrospective are how WorkTrain learns from patterns of failure and improves its workflows over time.
+- "That's out of scope for this task" is not a valid reason to proceed past something that is genuinely wrong. Scope is for routing work, not for suppressing correctness.
+## The operator relationship
+The operator configures what WorkTrain works on (triggers, workflows, workspace rules) and sets the boundaries within which it operates. WorkTrain decides autonomously how to do the work.
+WorkTrain pauses and surfaces to the operator when:
+- It encounters genuine ambiguity about what the task is asking for
+- A finding is Critical and requires explicit human approval before merging
+- A child session fails in a way that exhausts automated retries
+- Something unexpected happened that the coordinator's routing logic does not cover
+WorkTrain does not pause for: implementation decisions within a well-specified task, routine review findings it can fix autonomously, or any decision that fits within the rules the operator already configured.
+This boundary is still being tested and refined through real usage. Where exactly "genuine ambiguity" begins is an open question.
+## What is still being built
+WorkTrain is not finished. The vision above is where it is going, not where it is today. Key pieces still in progress:
+- **Living work context** -- shared knowledge store that accumulates across all phases so every agent starts informed (`docs/ideas/backlog.md`: "Living work context")
+- **Coordinator pipeline templates** -- actual coordinator scripts for full development pipeline, bug-fix, grooming (`docs/ideas/backlog.md`: "Scripts-first coordinator")
+- **`worktrain spawn`/`await` CLI** -- CLI surface for coordinator scripts
+- **Knowledge graph** -- per-workspace structural understanding so agents skip discovery on repeated tasks
+- **Spec as ground truth** -- wiring `wr.shaping` output into coordinator dispatch so coding/review agents work from the same spec
+For the current prioritized list, see `npm run backlog` or `docs/ideas/backlog.md`.
+## Open questions
+These are genuinely unresolved. Any agent operating in this system should know they exist and not assume they are answered.
+- **Does WorkTrain need a main orchestrating agent?** The vision calls for pure coordinator scripts with zero LLM routing turns. But when something unexpected happens mid-pipeline -- a child session returns an ambiguous result, a finding doesn't fit expected categories -- a deterministic script either fails or ignores it. Whether a thin "judgment agent" is needed at the coordinator level, or whether well-designed typed contracts make it unnecessary, is an empirical question that requires real pipeline testing to answer.
+- **Where exactly is the operator boundary?** The rules above are directionally right but have fuzzy edges. "Genuine ambiguity" is not yet precisely defined. This will sharpen through real usage and failure modes, not through upfront design.
+- **How does WorkTrain know when it doesn't understand something?** An agent that mis-understands a task produces code that's correct for its interpretation but wrong for the operator's intent. Detecting this before implementation begins -- via explicit intent confirmation, pattern matching against prior sessions, or something else -- is an open problem. See `docs/ideas/backlog.md`: "Intent gap".
+- **What is the right granularity of tasks?** WorkTrain is being designed for ticket-sized work. Whether it handles epics (by decomposing them), hotfixes (by moving fast and deferring thoroughness), and architectural changes (which may require multiple sessions across multiple days) the same way is untested.
+- **Is "document" the right abstraction for the living work context?** A flat document implies agents read it linearly. Agents need to query it selectively -- the coding agent wants constraints relevant to a specific decision, the review agent wants what the coding agent said about a specific module. A structured knowledge store (typed facts, queryable by topic) may be more useful than a document. See `docs/ideas/backlog.md`: "Living work context".

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@exaudeus/workrail",
-  "version": "3.74.0",
+  "version": "3.74.2",
   "description": "Step-by-step workflow enforcement for AI agents via MCP",
   "license": "MIT",
   "repository": {

/package/docs/roadmap/{now-next-later.md → archive/now-next-later.md} RENAMED Viewed

File without changes

/package/docs/roadmap/{open-work-inventory.md → archive/open-work-inventory.md} RENAMED Viewed

File without changes