@exaudeus/workrail 3.74.0 → 3.74.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -3,6 +3,8 @@
3
3
  Workflow and feature ideas worth capturing but not yet planned or designed.
4
4
  For historical narrative and sprint journals, see `docs/history/worktrain-journal.md`.
5
5
 
6
+ **Before reading this backlog, read the vision:** `docs/vision.md` -- what WorkTrain is, what success looks like, and the principles every decision is held against. Every item in this backlog should serve that vision. If it doesn't, it shouldn't be here.
7
+
6
8
  **To see a sorted priority view, run:**
7
9
  ```bash
8
10
  npm run backlog # full list, grouped by blocked/unblocked
@@ -12,35 +14,101 @@ npm run backlog -- --help # all options
12
14
  ```
13
15
 
14
16
  Each item has a score line: `**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: ...`
15
- See the scoring rubric in the "Agent-assisted backlog prioritization" entry (WorkTrain Daemon section).
17
+
18
+ **When adding a new backlog item, score it using this rubric.** Five dimensions, each 1-3. Score = sum (max 15).
19
+
20
+ | Dimension | 3 | 2 | 1 |
21
+ |---|---|---|---|
22
+ | **Correctness** | Silent wrong output, crash, or skipped safety gate | Degraded behavior, misleading output, test coverage gap | No effect on correctness |
23
+ | **Capability** | Meaningfully expands what WorkTrain can do or who can use it | Reduces friction for an *active* use case today | Polish, internal quality, or nothing anyone is actively blocked by right now |
24
+ | **Effort** (inverted) | Hours to a day or two | A few days to a week | Weeks or longer, significant design work needed first |
25
+ | **Leverage** | Prerequisite for multiple other items | Enables one or two downstream items | Standalone, nothing depends on it |
26
+ | **Confidence** | Clear problem, clear direction, just needs implementation | Problem is clear, but has open questions to hash out first | Still needs discovery or design before work can begin |
27
+
28
+ **Blocked flag:** annotate with *what* the item is blocked by -- "Blocked: needs knowledge graph" vs "Blocked: needs dispatchCondition" carry very different timelines. Blocked items are listed separately regardless of score.
29
+
30
+ **Scoring notes:**
31
+ - Score the first actionable phase, not the full vision. Phase 1 = two days of work should not score Effort 1 just because Phase 3 is months away.
32
+ - Tiebreaker at equal score: prefer the item that makes the next item easier to execute.
33
+ - Capability 2 = reduces friction for an *active* use case today (not something hypothetical).
34
+
35
+ ---
36
+
37
+ **How to write a backlog item.** Every entry should follow this shape:
38
+
39
+ ```
40
+ ### Title (Date)
41
+
42
+ **Status: idea | bug | partial | done** | Priority: high/medium/low
43
+
44
+ **Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: no / yes (blocked by X)
45
+
46
+ [2-4 sentences stating the problem plainly. What is wrong or missing? Why does it matter?
47
+ No proposed solutions here -- just the problem.]
48
+
49
+ **Things to hash out:**
50
+ - [Open question that needs a decision before design can begin]
51
+ - [Another open question -- constraint, tradeoff, interaction with other systems]
52
+ - [Keep these honest -- don't fill this section with questions you already know the answer to]
53
+ ```
54
+
55
+ **Rules for writing entries:**
56
+ - **State the problem, not the solution.** "There is no way to invoke a routine directly" not "We should add a `worktrain invoke` command."
57
+ - **No steering.** Don't tell future implementers how to build it. Capture what needs to exist, not how to make it exist.
58
+ - **Things to hash out = genuine open questions.** Only include questions that actually need to be answered before design can start. If you know the answer, state it in the problem description.
59
+ - **Relationships matter.** If this item depends on another, or would be superseded by another, name it explicitly.
60
+ - **Be specific about what "done" looks like** when it's not obvious -- e.g. "done means an operator can invoke any routine by name from the CLI without writing a workflow."
16
61
 
17
62
  ---
18
63
 
19
64
  ## P0 / Critical (blocks WorkTrain from working correctly)
20
65
 
21
- ### wr.coding-task implementation loop does not exit when slices complete (Apr 30, 2026)
66
+ ### wr.coding-task forEach loop exposes broken agent-facing state (Apr 30, 2026)
22
67
 
23
- **Status: bug** | Priority: high
68
+ **Status: done** | Shipped May 1, 2026 (PR #926)
24
69
 
25
70
  **Score: 13** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
26
71
 
27
- The `wr.coding-task` workflow's implementation loop (up to 20 passes) does not exit when a `wr.loop_control` stop artifact is emitted. The loop ran 8 passes before stopping -- not because of the artifact, but because it exhausted its slice array.
72
+ **Root cause (diagnosed Apr 30, 2026):** The agent wrote `slices` as an array of plain strings (`["1: slice name", ...]`) instead of objects (`[{name: "...", ...}]`). The engine accepted the array (it was an array), entered the loop, and `{{currentSlice.name}}` silently resolved to `[unset]` on every iteration because strings don't have a `.name` property.
73
+
74
+ **Shipped (PR #926):**
75
+ 1. **forEach shape guard** (`workflow-interpreter.ts`): at iteration 0, if the body uses `{{itemVar.field}}` dot-path access but the items array contains primitives, returns `LOOP_MISSING_CONTEXT` with a message naming the actual type and a preview of the bad value. The loop never enters with broken state.
76
+ 2. **Diagnostic `[unset]` messages** (`context-template-resolver.ts`): when dot-path navigation fails mid-path due to a type mismatch (e.g. `currentSlice` is a string), the rendered prompt now shows `[unset: currentSlice.name -- 'currentSlice' is string ("1: Auth..."), not object]` instead of just `[unset: currentSlice.name]`.
77
+
78
+ **Remaining open (separate items):** context contract enforcement (systemic fix), `todoList` abstraction, `wr.loop_control` shown in forEach prompts.
79
+
80
+ **GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/920
81
+
82
+ ---
83
+
84
+ ### Context contract: steps must declare required and produced context keys (Apr 30, 2026)
28
85
 
29
- **Root cause (confirmed by investigation)**: `phase-6-implement-slices` is a `forEach` loop, not a `while`/`until` loop with `artifact_contract`. The `wr.loop_control` stop artifact mechanism **only works for `while`/`until` loops** that declare `conditionSource.kind = artifact_contract`. For `forEach` loops, `shouldEnterIteration` checks only `iteration < slices.length` -- artifacts passed to `interpreter.next()` are never consulted. Confirmed in `workflow-interpreter.ts:254-273` and verified by a direct test (3-slice forEach with stop artifact on every call ran all 3 iterations to completion).
86
+ **Status: tentative** | Priority: medium
30
87
 
31
- **Why the loop stopped at pass 8**: the loop exhausted its `slices` array which had exactly 8 elements. `metrics_outcome = success` appearing at pass 8 was a coincidence.
88
+ **Score: 12** | Cor:3 Cap:2 Eff:1 Lev:3 Con:2 | Blocked: no
32
89
 
33
- **`currentSlice.name` showing `[unset]`**: secondary issue. `buildLoopRenderContext` in `prompt-renderer.ts:190-197` requires `sessionContext['slices']` to be an array at render time. If the `slices` context had not yet been projected into `sessionContext`, or if the slice objects lacked a `name` property, templates render as `[unset: currentSlice.name]`.
90
+ The engine has no mechanism to enforce context between steps. `Capture:` instructions in step prompts are prose -- the engine accepts `continue_workflow` with empty context on every advance, silently. This is the systemic root of the forEach `[unset]` bug: the agent wrote planning output as notes, not as context, and the engine accepted every advance without complaint. The same failure can happen in any workflow that passes state between steps.
34
91
 
35
- **Three fix directions:**
36
- 1. **Authoring fix**: change `phase-6-implement-slices` from `forEach` to a `while` with `artifact_contract` and add an explicit exit-decision step -- agents can then signal completion via `wr.loop_control`
37
- 2. **Engine feature**: add early-exit support to `forEach` loops when a `wr.loop_control` stop artifact is emitted
38
- 3. **Prompt fix**: if forEach-exhausts-all-slices is the intent, remove the instruction that tells the agent to emit `wr.loop_control` artifacts
92
+ **Things to hash out:**
93
+ - What schema format should `contextContract` use -- JSON Schema subset or a simpler workrail-specific type DSL?
94
+ - Should validation be blocking (engine rejects the advance) or advisory (engine warns in the next step prompt)?
95
+ - Does context contract cover loop entry preconditions, or does the separate forEach guard item handle that?
96
+
97
+ ---
98
+
99
+ ### `todoList` step type: ergonomic abstraction over forEach (Apr 30, 2026)
100
+
101
+ **Status: idea** | Priority: medium
102
+
103
+ **Score: 10** | Cor:2 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: no
104
+
105
+ Workflow authors using forEach must manually wire a prior step to populate the items array, understand iteration variables, avoid emitting `wr.loop_control` artifacts (which have no effect in forEach), and explain the loop framing to the agent. The forEach shape guard (PR #926) now catches primitive-item arrays loudly at loop entry, but the wiring between "the step that produces items" and "the loop that consumes them" remains implicit and invisible to the engine. The `todoList` abstraction would make this wiring structural.
39
106
 
40
107
  **Things to hash out:**
41
- - Which fix direction is correct depends on the intended behavior: should the agent be able to stop the loop early (fix 1 or 2), or should it always run all slices (fix 3)?
42
- - If fix 2 (engine feature), does early-exit from forEach affect the `currentSlice` render context in a way that could cause confusion?
43
- - Does fix 1 require re-authoring the workflow through `wr.workflow-for-workflows`, or is it a targeted JSON edit?
108
+ - Should `todoList` compile to a forEach loop at the engine layer, or be a new execution primitive?
109
+ - How does the setup step that produces the items array get authored -- inline prompt, routine reference, or both?
110
+ - What does the agent-facing presentation look like: "Item 3 of 8" with item content injected, or something else?
111
+ - Should `wr.loop_control` artifacts be stripped from the step prompt entirely in a `todoList`, or does the agent still need an explicit completion signal?
44
112
 
45
113
  ---
46
114
 
@@ -143,6 +211,8 @@ This is categorically different from bugs (the agent implemented the right thing
143
211
  - Agent makes a local fix when the user wanted an architectural change
144
212
  - Agent's implementation is technically correct but violates unstated invariants the user assumed were obvious
145
213
 
214
+ **Done looks like:** a WorkTrain session that receives an ambiguous or underspecified task either (a) states its interpretation explicitly before acting and the coordinator can gate on approval, or (b) has access to enough prior context (from the knowledge graph or living work context) that the interpretation is reliably correct. A session that builds the wrong thing well should be detectable before it merges, not after.
215
+
146
216
  **Things to hash out:**
147
217
  - Where in the workflow should intent validation happen? Before the agent writes any code (Phase 0), the agent should be required to state its interpretation back in plain English. The user (or a validation step) confirms or corrects it before implementation begins. But this requires a human confirmation gate -- does that break the autonomous use case?
148
218
  - For fully autonomous sessions (no human in the loop), is there a way to detect a likely intent gap before the agent commits? Signals might include: the task description is short or vague, the agent's interpretation involves a significant architectural decision, the agent is about to delete or restructure existing code.
@@ -173,6 +243,8 @@ This is exactly what happened with the commit SHA change: setting `agentCommitSh
173
243
  - Agent notes a downstream impact in session notes but does not block, escalate, or file a follow-up ticket
174
244
  - **Agent reframes a bug as "a key tradeoff to document."** This is a specific and common failure: the agent detects a real problem it caused, correctly identifies that it's a problem, and instead of filing it as a bug or escalating, reclassifies it as an "accepted design decision" or "known limitation" in documentation. The bug is real. Documenting it is not fixing it. This pattern actively buries bugs.
175
245
 
246
+ **Done looks like:** when an agent makes a change that degrades something outside its scope, it surfaces the degradation explicitly before the PR merges -- either by blocking (filing a follow-up issue as a condition of the current PR merging) or escalating to the coordinator for a decision. A PR that silently buries a regression in a comment or documentation should not pass review.
247
+
176
248
  **Things to hash out:**
177
249
  - How does an agent distinguish "acceptable tradeoff within scope" from "collateral damage that must be escalated"? The line is fuzzy and context-dependent. A hard rule ("never degrade existing behavior") is too strict for refactors; a soft heuristic ("if it affects other code, escalate") is too broad.
178
250
  - Should the agent be required to enumerate side effects as part of the verification phase, and should the coordinator review that list before merging? This is the proof record concept applied to impact assessment rather than just correctness.
@@ -283,21 +355,49 @@ Five dimensions, each scored 1-3. Score = sum (max 15). Items marked **Blocked**
283
355
 
284
356
  ---
285
357
 
358
+ ### `delivery_failed` unreachable in `getChildSessionResult` -- type promises more than code delivers (Apr 30, 2026)
359
+
360
+ **Status: done** | Fixed in `cd8aaeb8` -- `delivery_failed` removed from `ChildSessionResult` entirely. The `spawnSession`/`spawnAndAwait` path cannot produce it by design; it only exists in `spawn_agent`'s direct outcome mapping.
361
+
362
+ ---
363
+
364
+ ### `spawnAndAwait` duplicates ~90 lines of polling logic from `awaitSessions` (Apr 30, 2026)
365
+
366
+ **Status: tech debt** | Priority: low
367
+
368
+ **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
369
+
370
+ `spawnAndAwait` in `coordinator-deps.ts` contains an inline polling loop (~90 lines) that duplicates the logic in `awaitSessions`. The WHY comment explains a real construction-time constraint: object literals cannot reference sibling methods by name during construction. But this constraint applies to methods on the returned object -- it does not apply to closure-level functions, which are already used for `fetchAgentResult` and `fetchChildSessionResult`.
371
+
372
+ **Fix:** extract a `pollUntilTerminal(handles: string[], timeoutMs: number): Promise<'completed' | 'timed_out' | 'degraded'>` closure-level function (before the `return {}` block). Have both `awaitSessions` and `spawnAndAwait` call it. This eliminates the duplication without violating the construction-time constraint.
373
+
374
+ **GitHub issue:** https://github.com/EtienneBBeaulac/workrail/issues/921
375
+
376
+ ---
377
+
286
378
  ### Daemon architecture: remaining migrations (Apr 29, 2026)
287
379
 
288
- **Status: partial** | A9 shipped Apr 29, 2026.
380
+ **Status: partial** | A9 shipped Apr 29, 2026. FC/IS follow-on shipped Apr 30 -- May 1, 2026.
289
381
 
290
382
  **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
291
383
 
292
384
  Track A (A1-A9) shipped and the `SessionSource` migration is complete. `WorkflowTrigger._preAllocatedStartResponse` is gone.
293
385
 
386
+ **Shipped Apr 30 -- May 1, 2026 (PR #925):**
387
+ - `TerminalSignal` union replaces `stuckReason` + `timeoutReason`. Illegal state (stuck AND timeout simultaneously) now structurally impossible. Stall overwrite bug fixed. `Readonly<SessionState>` at pure read sites.
388
+ - `SessionScope` capability boundary complete: `onTokenUpdate`, `onIssueReported`, `onSteer`, `getCurrentToken`, `sessionWorkspacePath`, spawn depths all named scope fields. `constructTools` signature is `(ctx, apiKey, schemas, scope)` -- zero direct `state.X` references.
389
+ - Early-exit paths unified through `finalizeSession`. `SteerRegistry`/`AbortRegistry` dead exports removed.
390
+ - Architecture tests enforce `state.terminalSignal` write restriction and `constructTools` state-access restriction in CI.
391
+ - `persistTokens` failure early-exit path covered by new outcome invariants tests.
392
+
294
393
  **Remaining items:**
295
394
 
296
395
  - `CriticalEffect<T>` / `ObservabilityEffect` type distinction -- categorize side effects in `runAgentLoop` and finalization as either crash-relevant or observability-only
297
- - `StateRef` mutation wrapper -- replace direct `state.pendingSteerParts.push()` mutations with an explicit mutation API
298
396
  - Zod tool param validation -- replace manual `typeof` checks in tool factories with Zod schema validation (requires `zodToJsonSchema` or maintaining two sources of truth for param schemas)
299
397
  - `createCoordinatorDeps` unit tests -- extraction in B3 improved testability; cover `spawnSession`, `awaitSessions`, `getAgentResult` at minimum
300
398
  - ~~Wire `AllocatedSession.triggerSource` to the `run_started` event for session attribution~~ -- **done**, PR #899 (Apr 30, 2026)
399
+ - ~~`SessionStateWriter` capability interfaces~~ -- **done** as part of PR #925 (`SessionScope` now owns all mutation callbacks)
400
+ - ~~Architecture test: forbid `state.terminalSignal =` direct writes outside `setTerminalSignal()`~~ -- **done**, PR #925
301
401
 
302
402
  ---
303
403
 
@@ -362,6 +462,8 @@ Phase 3 (PRs #835, #837): `buildTurnEndSubscriber`, `buildAgentCallbacks`, `buil
362
462
 
363
463
  **Total workflow-runner.ts reduction: ~4,955 → ~2,800 lines (44%).**
364
464
 
465
+ **FC/IS follow-on (PR #925, Apr 30 -- May 1, 2026):** `TerminalSignal` union, `SessionScope` capability boundary completion, early-exit unification through `finalizeSession`, architecture tests. See "Daemon architecture: remaining migrations" entry for full details.
466
+
365
467
  **Follow-on:** `wr.refactoring` workflow (see backlog entry above). Remaining items in "Daemon architecture: remaining migrations" entry below.
366
468
 
367
469
  ---
@@ -957,6 +1059,31 @@ Combined with the `DEFAULT_MAX_TURNS` cap, this provides defense-in-depth agains
957
1059
 
958
1060
  The durable session store, v2 engine, and workflow authoring features shared by all three systems.
959
1061
 
1062
+ ### WorkTrain as the canonical workflow author -- MCP as a derived runtime (Apr 30, 2026)
1063
+
1064
+ **Status: idea** | Priority: high
1065
+
1066
+ **Score: 13** | Cor:2 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
1067
+
1068
+ Today workflows are authored once and expected to work identically in both runtimes: the WorkRail MCP server (human-in-the-loop, Claude Code) and the WorkTrain daemon (fully autonomous, coordinator-driven). In practice they don't -- a workflow authored for human use has `requireConfirmation` gates that block autonomous execution, step prompts that assume the human is reading them, and phase structures that assume a single continuous session. Conversely, a workflow good for autonomous use has no natural pause points, produces typed structured outputs that humans find hard to read mid-session, and chains phases that a human might want to interrupt.
1069
+
1070
+ The current response is to author separate "agentic variants" (`wr.coding-task` vs `coding-task-workflow.agentic.v2`). This is the wrong direction: it creates duplicate maintenance burden, improvements to one don't propagate to the other, and it means there is no single source of truth for what a workflow does.
1071
+
1072
+ There should be one version of each workflow, not two. Improvements to one should benefit the other automatically. The self-improvement loop WorkTrain runs on its own workflows should produce better workflows for everyone, not just daemon sessions. The question is how to structure authorship and any adaptation layer so this is possible without forcing workflows into an awkward compromise that works poorly in both contexts.
1073
+
1074
+ **What this enables:** WorkTrain can autonomously improve workflows using `wr.workflow-for-workflows`, and those improvements automatically benefit MCP users. The self-improvement loop produces better workflows for everyone, not just daemon sessions. Workflow quality compounds because there is only one version to improve.
1075
+
1076
+ **Relationship to existing entries:**
1077
+ - "Workflow runtime adapter: one spec, two runtimes" (Shared/Engine) is a narrower version of this idea focused on parallelism and `requireConfirmation` gates. This entry is about the authoring philosophy and source-of-truth question, not just the adapter mechanics.
1078
+ - `wr.workflow-for-workflows` is how WorkTrain improves workflows autonomously -- this entry determines what it improves toward.
1079
+
1080
+ **Things to hash out:**
1081
+ - What does the MCP conversion layer actually do? Adding pause points is straightforward. Adapting output formats (structured JSON → human-readable prose) may require active LLM translation, not just structural transformation.
1082
+ - Some workflow steps are genuinely different between runtimes -- a step that spawns parallel child sessions in the daemon doesn't have a clean MCP equivalent. Does the conversion layer skip those, simulate them sequentially, or require the author to declare a fallback?
1083
+ - If WorkTrain is the authoring target, existing workflows authored for MCP need migration. What is the migration path and who does it -- the author, WorkTrain itself, or a one-time script?
1084
+ - How do `requireConfirmation` gates fit? In the daemon they are removed or auto-satisfied by the coordinator. In MCP they pause for the human. Does the workflow declare them or does the conversion layer infer them?
1085
+ - Is the conversion layer purely structural (rearranging/omitting steps) or does it require understanding the semantic intent of each step?
1086
+
960
1087
 
961
1088
  ### Improve commit SHA gathering consistency in wr.coding-task
962
1089
 
@@ -1869,10 +1996,14 @@ A proof record contains: `prNumber`, `goal`, `verificationChain` (array of `{ ki
1869
1996
 
1870
1997
  ### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
1871
1998
 
1872
- **Status: idea** | Priority: high
1999
+ **Status: partial** | Foundation shipped PR #908 (Apr 30, 2026)
1873
2000
 
1874
2001
  **Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
1875
2002
 
2003
+ **What shipped:** `ChildSessionResult` discriminated union, `getChildSessionResult()`, `spawnAndAwait()`, `parentSessionId` threading, `wr.coordinator_result` artifact schema. The typed coordinator primitives that enable in-process coordinator scripts are now available.
2004
+
2005
+ **What's still needed:** the actual coordinator scripts (full development pipeline, bug-fix coordinator, grooming coordinator) and the `worktrain spawn`/`await` CLI commands that wrap these primitives for shell scripts.
2006
+
1876
2007
  **The insight:** In a coordinator workflow, the main agent spends most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
1877
2008
 
1878
2009
  **The principle:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
@@ -2038,7 +2169,7 @@ WorkTrain notices things without being asked. After a batch of work lands, it sc
2038
2169
 
2039
2170
  ### Native multi-agent orchestration: coordinator sessions + session DAG (Apr 15, 2026)
2040
2171
 
2041
- **Status: idea** | Priority: high
2172
+ **Status: partial** | Typed primitives shipped PR #908 (Apr 30, 2026)
2042
2173
 
2043
2174
  **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
2044
2175
 
@@ -2357,6 +2488,99 @@ A workflow that aggregates activity across git history, GitLab/GitHub MRs and re
2357
2488
 
2358
2489
  ## Platform Vision (longer-term)
2359
2490
 
2491
+ ### Move backlog to a dedicated worktrain-meta repo with version control (Apr 30, 2026)
2492
+
2493
+ **Status: idea** | Priority: high
2494
+
2495
+ **Score: 11** | Cor:2 Cap:2 Eff:2 Lev:3 Con:3 | Blocked: no
2496
+
2497
+ The backlog (`docs/ideas/backlog.md`) lives in the code repo. Every feature branch has its own version. Ideas added mid-session on a feature branch are held hostage until that PR merges. If two branches modify the backlog simultaneously, merge conflicts occur. There is no single authoritative place to capture an idea that immediately applies everywhere.
2498
+
2499
+ A dedicated `worktrain-meta` repo (e.g. `~/git/personal/worktrain-meta/`) would hold the backlog as the only concern. No feature branches -- ideas are committed directly to main. Full git history preserved. No code PR ever touches it.
2500
+
2501
+ Done means: an operator or agent can add a backlog idea from any branch or context, commit directly, and it is immediately visible on all other branches and in all other sessions.
2502
+
2503
+ **Note on format:** when this migration happens, one-file-per-item with YAML frontmatter becomes viable. Frontmatter makes scores, status, dates, and blocked-by machine-readable without prose parsing. The `npm run backlog` script would read frontmatter instead of regex-parsing Score lines. This is the right time to adopt that format -- in the current single-file structure frontmatter would require a custom delimiter scheme, but one-file-per-item makes it natural.
2504
+
2505
+ **Things to hash out:**
2506
+ - Should the worktrain-meta repo also hold the roadmap docs, now-next-later, open-work-inventory? Or just the backlog?
2507
+ - How do subagents spawned in a worktree find the backlog? They need a configured path, not relative to the code workspace.
2508
+ - When native structured backlog operations are built (SQLite), does the storage backend live in worktrain-meta (git-tracked history) or `~/.workrail/data/` (local queryable)? Both have merit.
2509
+
2510
+ ---
2511
+
2512
+ ### Invocable routines: dispatch an existing routine directly as a task (Apr 30, 2026)
2513
+
2514
+ **Status: idea** | Priority: high
2515
+
2516
+ **Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
2517
+
2518
+ WorkRail has a routines system (`workflows/routines/`) for reusable workflow fragments. But routines can only be used embedded inside a larger workflow -- there is no way to invoke a routine directly as a standalone task. Many useful repeat tasks are process-shaped (same steps every time, structured output) and could be expressed as short 1-2 step workflows or existing routines. Today an operator who wants to run "context gathering" or "hypothesis challenge" on demand has to either build a wrapper workflow or do it manually.
2519
+
2520
+ There is no dispatch surface for standalone routine invocation. Done means: an operator can invoke any routine by name from the CLI or a trigger, and the result is durable in the session store.
2521
+
2522
+ **Relationship to existing ideas:** this is one half of the lightweight agents gap (the process-shaped half). The ad-hoc query half is a separate entry below.
2523
+
2524
+ **Things to hash out:**
2525
+ - Should this be a new CLI command (`worktrain invoke <routineId> --goal "..."`) or a trigger type, or both?
2526
+ - Do routines need output contracts defined before they can be invoked standalone, or is free-form output acceptable?
2527
+ - How does the session store record a routine-only run vs a full workflow run? Should they be distinguished?
2528
+
2529
+ ---
2530
+
2531
+ ### Ad-hoc query agents: answer questions about the workspace without a full workflow (Apr 30, 2026)
2532
+
2533
+ **Status: idea** | Priority: high
2534
+
2535
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for efficient context)
2536
+
2537
+ There is a class of tasks that are question-shaped rather than process-shaped: "why does the session store use a manifest file?", "what would break if I changed this function?", "summarize what shipped this week." These don't have fixed steps, don't produce structured output contracts, and don't benefit from workflow phase gating. Running a full `wr.coding-task` session for them wastes 10 minutes on overhead. Not supporting them means the operator has to context-switch to Claude Code or do them manually.
2538
+
2539
+ These tasks need a capable agent with workspace context but no workflow structure. They are stateless, single-purpose, and short-lived.
2540
+
2541
+ Examples of what this enables:
2542
+ - `worktrain ask "why does the session store use a manifest file?"`
2543
+ - `worktrain explain pr/908`
2544
+ - `worktrain impact src/trigger/coordinator-deps.ts`
2545
+ - `worktrain diff-since "last week"`
2546
+
2547
+ Done means: an operator can ask a natural-language question about the workspace and get a grounded answer within seconds, without starting a full session.
2548
+
2549
+ **Relationship to existing ideas:** `worktrain talk` (interactive ideation) is the conversational, stateful version of this. Standup status generator is a scheduled instance of the same pattern. Invocable routines (entry above) are the process-shaped complement. This entry covers the unstructured query case.
2550
+
2551
+ **Things to hash out:**
2552
+ - Without the knowledge graph, these queries require full file-scanning on every invocation -- too slow to be useful. Is there a minimum viable version before the KG is built, or does this wait?
2553
+ - What is the boundary between "this is a quick query" and "this actually needs a full discovery session"? Who decides -- the operator, or WorkTrain itself?
2554
+ - Should outputs be ephemeral (printed to terminal, not stored) or durable (in session store)? Durability adds value for audit but adds overhead.
2555
+
2556
+ ---
2557
+
2558
+ ### Self-restart after shipping changes to itself (Apr 30, 2026)
2559
+
2560
+ **Status: idea** | Priority: medium
2561
+
2562
+ **Score: 11** | Cor:2 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs self-improvement loop operational)
2563
+
2564
+ If WorkTrain can build and ship changes to itself autonomously, the natural next step is that it also restarts itself with those changes. Today, after a WorkTrain daemon session ships a change to the workrail repo, the daemon continues running the old binary. The operator has to manually run `worktrain daemon --stop && worktrain daemon --start` to pick up the new version. In a self-improving system running overnight, this is a human intervention point that should not exist.
2565
+
2566
+ **What this requires:**
2567
+ 1. After a session that modifies WorkTrain itself merges to main, the daemon detects it was running on this repo
2568
+ 2. The daemon rebuilds (`npm run build`) and restarts itself cleanly -- completing any in-flight sessions first, then performing a graceful restart with the new binary
2569
+ 3. After restart, the daemon logs what changed so the operator can review
2570
+
2571
+ This is related to the "daemon binary stale after rebuild" P0 gap, but goes further: not just warning about staleness, but actually handling the upgrade cycle automatically.
2572
+
2573
+ **Why this matters for the self-improvement loop:** if WorkTrain ships 5 improvements to itself in a day but the operator has to manually restart it 5 times, the loop isn't truly autonomous. Full autonomy requires the restart to be part of the pipeline.
2574
+
2575
+ **Things to hash out:**
2576
+ - What triggers the restart check? After every merge to main that touches `src/`? After a successful `npm run build`? On a heartbeat that detects binary staleness?
2577
+ - How does the daemon ensure in-flight sessions complete before restarting? Does it drain the active session set or hard-stop?
2578
+ - What is the rollback path if the new binary fails to start (startup crash, broken build)? The daemon needs to detect this and either roll back or alert the operator.
2579
+ - Should the restart happen immediately or at a configurable "quiet period" (e.g. 2am) to avoid disrupting active sessions during the day?
2580
+ - Self-modification is inherently risky -- a buggy change to the daemon's restart logic could make the daemon unable to restart at all. What safeguards prevent this?
2581
+
2582
+ ---
2583
+
2360
2584
  ### WorkTrain as a first-class project participant: ideal backlog and planning capabilities (Apr 30, 2026)
2361
2585
 
2362
2586
  **Status: idea** | Priority: high (long-term)
@@ -54,9 +54,9 @@ Not every roadmap item must become a ticket immediately.
54
54
 
55
55
  ## Status ownership
56
56
 
57
- **Status lives in exactly two places**: `docs/roadmap/open-work-inventory.md` and `docs/roadmap/now-next-later.md`.
57
+ **Status lives in `docs/ideas/backlog.md`**. Each entry has a `Status:` line (idea / partial / done / bug). Use `npm run backlog` to see a scored, sorted view.
58
58
 
59
- Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the roadmap docs, not the plan doc. Plan docs that carry their own status blocks create a second source of truth that drifts.
59
+ Plan docs in `docs/plans/` describe **design and intent** -- not current status. When work ships, update the backlog entry status, not the plan doc.
60
60
 
61
61
  ## Rules of thumb
62
62
 
@@ -95,10 +95,7 @@ Existing feature-specific plans in `docs/plans/` still matter. Treat them as **f
95
95
 
96
96
  ## Starting points
97
97
 
98
- - `docs/ideas/backlog.md`
99
- - `docs/roadmap/now-next-later.md`
100
- - `docs/roadmap/open-work-inventory.md`
101
- - `docs/roadmap/legacy-planning-status.md`
102
- - `docs/planning/docs-taxonomy-and-migration-plan.md`
103
- - `docs/tickets/README.md`
104
- - `docs/tickets/next-up.md`
98
+ - `docs/vision.md` -- what WorkTrain is and where it's going (read this first)
99
+ - `docs/ideas/backlog.md` -- the backlog (`npm run backlog` for priority view)
100
+ - `docs/roadmap/legacy-planning-status.md` -- status map for older planning docs
101
+ - `docs/tickets/next-up.md` -- scratch space for near-term tickets
@@ -0,0 +1,8 @@
1
+ # Archive
2
+
3
+ These docs were superseded by `docs/ideas/backlog.md` + `npm run backlog`.
4
+
5
+ - `now-next-later.md` -- manual roadmap curation; replaced by backlog scoring and `npm run backlog`
6
+ - `open-work-inventory.md` -- normalized list of partial/unimplemented work; replaced by `Status: partial` entries in the backlog
7
+
8
+ Kept for historical reference only. Do not update.
@@ -1,6 +1,11 @@
1
1
  # Next Up
2
2
 
3
- Groomed near-term tickets. Check `docs/roadmap/now-next-later.md` first for the current priority ordering.
3
+ Scratch space for grooming near-term tickets before they become GitHub issues.
4
+ For the current priority ordering, run `npm run backlog -- --min-score 11 --unblocked-only` or see `docs/ideas/backlog.md`.
5
+
6
+ ---
7
+
8
+ > The tickets below are historical. Active work is tracked via GitHub issues and the backlog.
4
9
 
5
10
  ---
6
11
 
package/docs/vision.md ADDED
@@ -0,0 +1,115 @@
1
+ # WorkTrain Vision
2
+
3
+ ## What WorkTrain is
4
+
5
+ WorkTrain is an autonomous software development daemon. It runs continuously in the background, picks up tasks from external systems (GitHub issues, GitLab MRs, Jira tickets, webhooks), and drives them through the full development lifecycle -- discovery, shaping, implementation, review, fix, merge -- without human intervention between phases.
6
+
7
+ The operator's job is to configure what WorkTrain works on and what rules it follows. WorkTrain's job is to do the actual work, autonomously, reliably, and correctly.
8
+
9
+ ## The self-improvement loop
10
+
11
+ WorkTrain builds WorkTrain. This is not a metaphor -- it is the intended operating mode and the ultimate test of whether the system works.
12
+
13
+ WorkTrain runs the workrail repository as one of its own workspaces. It picks up tickets from the workrail GitHub issue queue, runs the full pipeline (discovery, shaping, coding, review, fix, merge), and ships improvements to itself. Every feature built into WorkTrain is a feature WorkTrain could have built using its own infrastructure. Every bug fixed in WorkTrain is a bug WorkTrain found in itself.
14
+
15
+ This creates a direct feedback loop: if WorkTrain's development pipeline is flawed, it will produce flawed changes to itself and catch them in review. If its context injection is thin, it will miss things in its own codebase that a well-briefed agent would catch. The quality of WorkTrain's output is the quality of WorkTrain.
16
+
17
+ The self-improvement loop is not fully operational today. The pieces -- coordinator session chaining, full development pipeline, spec as ground truth, living work context -- are being built. But it is the north star. If WorkTrain cannot build WorkTrain well, it cannot be trusted to build anything else.
18
+
19
+ ## What success looks like
20
+
21
+ An operator assigns a ticket to WorkTrain in the morning. By the time they check in, there is a merged PR, a closed ticket, and a summary of what was done and why. They did not intervene between phases. Nothing surprising happened that required their attention.
22
+
23
+ WorkTrain earns trust over time by doing this correctly, repeatedly, at scale -- not just for one-off tasks but as the default mode of software development. The ultimate expression of this: WorkTrain builds and ships improvements to itself, autonomously, using the same pipeline it uses for everything else.
24
+
25
+ ## What WorkTrain is not
26
+
27
+ - **Not a chatbot or copilot.** WorkTrain does not assist humans doing development. It does development. The human is the operator, not the pair programmer.
28
+ - **Not the WorkRail MCP server.** The WorkRail engine and MCP server are infrastructure WorkTrain uses. They are separate systems. Do not conflate them.
29
+ - **Not a replacement for judgment.** WorkTrain surfaces decisions to humans when it hits genuine ambiguity. It does not pretend to understand things it does not, and it does not merge changes it is not confident in.
30
+
31
+ ## How WorkTrain thinks about work
32
+
33
+ **Phases, not turns.** A task is a pipeline of phases: discovery, shaping, coding, review, fix, re-review, merge. Each phase is a session with a typed output contract. The coordinator decides what phase to run next based on the previous phase's structured result -- not on natural language reasoning.
34
+
35
+ **Zero LLM turns for routing.** Coordinator decisions -- what workflow to run next, whether findings are blocking, when to merge -- are deterministic TypeScript code. LLM turns are used for cognitive work: understanding code, writing code, evaluating findings. Never for deciding "what do I do next?".
36
+
37
+ **Structured outputs at every boundary.** Each phase produces a typed result. The next phase reads that result. Free-text scraping between phases is a design smell. `ChildSessionResult`, `wr.coordinator_result`, `wr.review_verdict` are the contracts that make phases composable without a main agent holding context.
38
+
39
+ **Correctness over speed.** WorkTrain does not merge changes it is not confident in. Review findings are addressed. Tests pass. The right next step is not always the fastest one.
40
+
41
+ ## What makes WorkTrain different from other autonomous coding agents
42
+
43
+ Most autonomous coding agents are single-session: they get a task, they work on it, they produce output. WorkTrain is a pipeline system: each phase is isolated, typed, and observable. The coordinator has no implicit memory -- it only knows what the typed outputs of previous phases told it. This makes pipelines:
44
+
45
+ - **Reproducible**: the same task run twice takes the same path
46
+ - **Observable**: every phase, every result, every decision is in the session store
47
+ - **Recoverable**: a crashed phase is retried with the same inputs
48
+ - **Auditable**: no black box; you can see exactly what each phase decided and why
49
+
50
+ ## Principles that guide every decision
51
+
52
+ 1. **Zero LLM turns for routing** -- coordinator logic is code, not reasoning
53
+ 2. **Typed contracts at phase boundaries** -- structured results, not free-text
54
+ 3. **The spec is the source of truth** -- every agent in a pipeline reads the same spec
55
+ 4. **Correctness over speed** -- do it right, not just done
56
+ 5. **Observable by default** -- every decision visible in the session store and console
57
+ 6. **Overnight-safe** -- the system must work while the operator is asleep
58
+
59
+ ## Quality standards WorkTrain holds itself to
60
+
61
+ WorkTrain does not ship work it is not confident in. Specifically:
62
+
63
+ - Review findings are addressed before merge -- no "I'll file a ticket for this later" on findings that block
64
+ - Tests pass. If tests were broken before the task started, that is noted explicitly, not silently ignored
65
+ - A PR that triggered the escalating review chain (Critical finding → re-review → re-review) never auto-merges without human approval
66
+ - If WorkTrain makes a change that degrades something outside its immediate scope, it surfaces that -- it does not document collateral damage as "a known tradeoff" and move on
67
+ - When WorkTrain is wrong about something, it acknowledges it explicitly in the session notes so the next session starts with accurate context
68
+
69
+ ## How WorkTrain handles uncertainty and mistakes
70
+
71
+ WorkTrain will make mistakes. The system is designed around this:
72
+
73
+ - When an agent is uncertain about the task intent, it states its interpretation explicitly before acting. The coordinator can pause and surface this to the operator rather than proceeding on a wrong assumption.
74
+ - Mistakes produce structured findings in the session store. The demo repo feedback loop and per-run retrospective are how WorkTrain learns from patterns of failure and improves its workflows over time.
75
+ - "That's out of scope for this task" is not a valid reason to proceed past something that is genuinely wrong. Scope is for routing work, not for suppressing correctness.
76
+
77
+ ## The operator relationship
78
+
79
+ The operator configures what WorkTrain works on (triggers, workflows, workspace rules) and sets the boundaries within which it operates. WorkTrain decides autonomously how to do the work.
80
+
81
+ WorkTrain pauses and surfaces to the operator when:
82
+ - It encounters genuine ambiguity about what the task is asking for
83
+ - A finding is Critical and requires explicit human approval before merging
84
+ - A child session fails in a way that exhausts automated retries
85
+ - Something unexpected happened that the coordinator's routing logic does not cover
86
+
87
+ WorkTrain does not pause for: implementation decisions within a well-specified task, routine review findings it can fix autonomously, or any decision that fits within the rules the operator already configured.
88
+
89
+ This boundary is still being tested and refined through real usage. Where exactly "genuine ambiguity" begins is an open question.
90
+
91
+ ## What is still being built
92
+
93
+ WorkTrain is not finished. The vision above is where it is going, not where it is today. Key pieces still in progress:
94
+
95
+ - **Living work context** -- shared knowledge store that accumulates across all phases so every agent starts informed (`docs/ideas/backlog.md`: "Living work context")
96
+ - **Coordinator pipeline templates** -- actual coordinator scripts for full development pipeline, bug-fix, grooming (`docs/ideas/backlog.md`: "Scripts-first coordinator")
97
+ - **`worktrain spawn`/`await` CLI** -- CLI surface for coordinator scripts
98
+ - **Knowledge graph** -- per-workspace structural understanding so agents skip discovery on repeated tasks
99
+ - **Spec as ground truth** -- wiring `wr.shaping` output into coordinator dispatch so coding/review agents work from the same spec
100
+
101
+ For the current prioritized list, see `npm run backlog` or `docs/ideas/backlog.md`.
102
+
103
+ ## Open questions
104
+
105
+ These are genuinely unresolved. Any agent operating in this system should know they exist and not assume they are answered.
106
+
107
+ - **Does WorkTrain need a main orchestrating agent?** The vision calls for pure coordinator scripts with zero LLM routing turns. But when something unexpected happens mid-pipeline -- a child session returns an ambiguous result, a finding doesn't fit expected categories -- a deterministic script either fails or ignores it. Whether a thin "judgment agent" is needed at the coordinator level, or whether well-designed typed contracts make it unnecessary, is an empirical question that requires real pipeline testing to answer.
108
+
109
+ - **Where exactly is the operator boundary?** The rules above are directionally right but have fuzzy edges. "Genuine ambiguity" is not yet precisely defined. This will sharpen through real usage and failure modes, not through upfront design.
110
+
111
+ - **How does WorkTrain know when it doesn't understand something?** An agent that mis-understands a task produces code that's correct for its interpretation but wrong for the operator's intent. Detecting this before implementation begins -- via explicit intent confirmation, pattern matching against prior sessions, or something else -- is an open problem. See `docs/ideas/backlog.md`: "Intent gap".
112
+
113
+ - **What is the right granularity of tasks?** WorkTrain is being designed for ticket-sized work. Whether it handles epics (by decomposing them), hotfixes (by moving fast and deferring thoroughness), and architectural changes (which may require multiple sessions across multiple days) the same way is untested.
114
+
115
+ - **Is "document" the right abstraction for the living work context?** A flat document implies agents read it linearly. Agents need to query it selectively -- the coding agent wants constraints relevant to a specific decision, the review agent wants what the coding agent said about a specific module. A structured knowledge store (typed facts, queryable by topic) may be more useful than a document. See `docs/ideas/backlog.md`: "Living work context".
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@exaudeus/workrail",
3
- "version": "3.74.0",
3
+ "version": "3.74.2",
4
4
  "description": "Step-by-step workflow enforcement for AI agents via MCP",
5
5
  "license": "MIT",
6
6
  "repository": {