@exaudeus/workrail 3.73.0 → 3.73.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/dist/cli/commands/worktrain-daemon.d.ts +7 -0
  2. package/dist/cli/commands/worktrain-daemon.js +29 -7
  3. package/dist/cli-worktrain.js +11 -0
  4. package/dist/console-ui/assets/{index-HnM-KywY.js → index-CfI4I3OX.js} +1 -1
  5. package/dist/console-ui/index.html +1 -1
  6. package/dist/manifest.json +73 -57
  7. package/dist/mcp/handlers/v2-advance-core/index.d.ts +1 -0
  8. package/dist/mcp/handlers/v2-advance-core/index.js +3 -3
  9. package/dist/mcp/handlers/v2-advance-core/outcome-success.js +3 -28
  10. package/dist/mcp/handlers/v2-advance-events.d.ts +1 -1
  11. package/dist/mcp/handlers/v2-advance-events.js +1 -1
  12. package/dist/mcp/handlers/v2-execution/advance.d.ts +1 -0
  13. package/dist/mcp/handlers/v2-execution/advance.js +3 -3
  14. package/dist/mcp/handlers/v2-execution/continue-advance.d.ts +1 -0
  15. package/dist/mcp/handlers/v2-execution/continue-advance.js +2 -1
  16. package/dist/mcp/handlers/v2-execution/index.js +3 -1
  17. package/dist/mcp/server.js +6 -4
  18. package/dist/mcp/types.d.ts +2 -0
  19. package/dist/trigger/delivery-action.d.ts +1 -0
  20. package/dist/trigger/delivery-action.js +1 -1
  21. package/dist/trigger/delivery-pipeline.d.ts +13 -2
  22. package/dist/trigger/delivery-pipeline.js +58 -3
  23. package/dist/trigger/trigger-router.js +6 -3
  24. package/dist/v2/durable-core/constants.d.ts +1 -0
  25. package/dist/v2/durable-core/constants.js +1 -0
  26. package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +202 -0
  27. package/dist/v2/durable-core/schemas/session/events.d.ts +56 -0
  28. package/dist/v2/durable-core/schemas/session/events.js +8 -0
  29. package/dist/v2/infra/local/git-snapshot/index.d.ts +6 -0
  30. package/dist/v2/infra/local/git-snapshot/index.js +39 -0
  31. package/dist/v2/ports/git-snapshot.port.d.ts +10 -0
  32. package/dist/v2/ports/git-snapshot.port.js +9 -0
  33. package/dist/v2/projections/session-metrics.js +17 -2
  34. package/docs/design/engine-boundary-discovery.md +123 -0
  35. package/docs/design/engine-boundary-review-findings.md +72 -0
  36. package/docs/ideas/backlog.md +3178 -542
  37. package/docs/roadmap/open-work-inventory.md +12 -0
  38. package/package.json +2 -1
@@ -3,37 +3,127 @@
3
3
  Workflow and feature ideas worth capturing but not yet planned or designed.
4
4
  For historical narrative and sprint journals, see `docs/history/worktrain-journal.md`.
5
5
 
6
+ **To see a sorted priority view, run:**
7
+ ```bash
8
+ npm run backlog # full list, grouped by blocked/unblocked
9
+ npm run backlog -- --min-score 11 --unblocked-only # top items ready to work on
10
+ npm run backlog -- --section daemon # filter by section
11
+ npm run backlog -- --help # all options
12
+ ```
13
+
14
+ Each item has a score line: `**Score: N** | Cor:N Cap:N Eff:N Lev:N Con:N | Blocked: ...`
15
+ See the scoring rubric in the "Agent-assisted backlog prioritization" entry (WorkTrain Daemon section).
16
+
6
17
  ---
7
18
 
8
19
  ## P0 / Critical (blocks WorkTrain from working correctly)
9
20
 
10
- ### Agent is doing coordinator work
21
+ ### wr.coding-task implementation loop does not exit when slices complete (Apr 30, 2026)
22
+
23
+ **Status: bug** | Priority: high
24
+
25
+ **Score: 13** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
26
+
27
+ The `wr.coding-task` workflow's implementation loop (up to 20 passes) does not exit when all slices are complete. The `wr.loop_control` stop artifact is emitted correctly but the loop decision gate never fires because `currentSlice.name` remains `[unset]` -- the engine is not tracking which slice is current across passes. The loop ran 8 passes before eventually exiting on its own.
28
+
29
+ This means: (1) every coding task session wastes passes doing no work, (2) the agent cannot confidently signal completion, (3) total session turn count is inflated, increasing cost and timeout risk.
30
+
31
+ **Root cause**: the `slices` array is stored in context but the engine does not advance a `currentSliceIndex` counter -- or the counter is not being surfaced to the step as `currentSlice.name`. The `wr.loop_control` artifact is evaluated at the loop decision step, but that step only fires when the engine recognizes it's at the end of a pass. With `currentSlice.name = [unset]`, the recognition fails.
32
+
33
+ **Things to hash out:**
34
+ - Is the bug in the workflow JSON (slices not wired to currentSlice tracking), in the engine (loop_control artifact evaluation), or in the way context variables are threaded between passes?
35
+ - Does the issue affect all loops with `wr.loop_control`, or only the implementation loop in `wr.coding-task` specifically?
36
+ - Is there a workaround agents can use today (e.g. setting a specific context variable that the loop decision gate does check)?
37
+ - Should the loop decision gate fire after every pass regardless of `currentSlice.name` state, or only when the slice tracking is valid?
38
+
39
+ ---
40
+
41
+ ### Intent gap: agent builds what it understood, not what the user meant (Apr 30, 2026)
42
+
43
+ **Status: idea** | Priority: high
44
+
45
+ **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
46
+
47
+ This is one of the most fundamental failure modes for autonomous WorkTrain sessions and a blocker for production viability. An agent receives a task description, forms an interpretation of what's needed, and executes flawlessly against that interpretation -- but the interpretation was wrong. The code is correct for what the agent thought was asked. It is not what the user actually wanted. The user only discovers this after reviewing the PR, sometimes after it has already merged.
48
+
49
+ This is categorically different from bugs (the agent implemented the right thing incorrectly) and scope creep (the agent did extra things). This is the agent solving the wrong problem well.
50
+
51
+ **Why it's hard:** the agent's interpretation feels reasonable from the task description. The user's description was ambiguous, underspecified, or relied on context the agent didn't have. Neither party made an obvious mistake -- the gap is structural.
52
+
53
+ **Known manifestations:**
54
+ - Agent fixes the symptom instead of the root cause because the task description named the symptom
55
+ - Agent implements feature X when the user wanted feature Y that happens to use X
56
+ - Agent interprets "add support for Z" as extending the existing system when the user wanted a new abstraction
57
+ - Agent makes a local fix when the user wanted an architectural change
58
+ - Agent's implementation is technically correct but violates unstated invariants the user assumed were obvious
59
+
60
+ **Things to hash out:**
61
+ - Where in the workflow should intent validation happen? Before the agent writes any code (Phase 0), the agent should be required to state its interpretation back in plain English. The user (or a validation step) confirms or corrects it before implementation begins. But this requires a human confirmation gate -- does that break the autonomous use case?
62
+ - For fully autonomous sessions (no human in the loop), is there a way to detect a likely intent gap before the agent commits? Signals might include: the task description is short or vague, the agent's interpretation involves a significant architectural decision, the agent is about to delete or restructure existing code.
63
+ - What is the right escalation path when the agent detects ambiguity itself? Currently `report_issue` handles task obstacles; there is no structured way for the agent to surface "I am not sure I understood this correctly" before acting.
64
+ - The `wr.shaping` workflow exists precisely to close this gap for planned features -- the issue is urgent/reactive tasks that skip shaping entirely. How do we get intent validation without requiring a full shaping pass for every small task?
65
+ - Can historical session notes help? If previous sessions have established what "X" means in this codebase (design decisions, naming conventions, architectural invariants), injecting that context before Phase 0 reduces the gap. This points toward the knowledge graph and persistent project memory as partial solutions.
66
+ - Should WorkTrain have an explicit "confirm interpretation" step as a configurable option per trigger? A `requireIntentConfirmation: true` flag on the trigger that blocks autonomous start until the operator approves the agent's stated interpretation via the console or CLI.
67
+
68
+ ---
69
+
70
+ ### Scope rationalization: agent silently accepts collateral damage (Apr 30, 2026)
71
+
72
+ **Status: idea** | Priority: high
73
+
74
+ **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
75
+
76
+ When an agent makes a change that breaks or degrades something outside its immediate task scope, it often recognizes the impact but rationalizes it as acceptable because "that's not in scope for this task." The reasoning feels locally valid -- the agent was asked to do X, X is done correctly, the side effect on Y is noted but deprioritized. This produces a PR that is correct for X and silently broken for Y.
77
+
78
+ This is exactly what happened with the commit SHA change: setting `agentCommitShas` to always empty correctly fixes the faked SHA bug, but degrades the console's SHA display for all sessions going forward. A scoped agent might note "this makes the console show empty SHAs" and proceed anyway because fixing the console display is "a separate ticket."
79
+
80
+ **Why this is insidious:** the agent's reasoning is locally coherent. It did not make a mistake within its scope. The problem is that autonomous agents operating in isolation cannot always see when a locally correct change has unacceptable global consequences -- and even when they can see it, they lack a good mechanism to stop, escalate, and surface the impact rather than proceeding.
11
81
 
12
- **Status: bug** | Priority: P0
82
+ **Known manifestations:**
83
+ - Agent correctly fixes a bug but the fix changes a public API contract, breaking callers it didn't check
84
+ - Agent refactors a module for clarity but silently changes behavior in an edge case it considered minor
85
+ - Agent adds a feature but disables or degrades an existing feature as a side effect, judging the tradeoff acceptable on its own
86
+ - Agent's change passes all tests but the tests don't cover the degraded behavior
87
+ - Agent notes a downstream impact in session notes but does not block, escalate, or file a follow-up ticket
88
+ - **Agent reframes a bug as "a key tradeoff to document."** This is a specific and common failure: the agent detects a real problem it caused, correctly identifies that it's a problem, and instead of filing it as a bug or escalating, reclassifies it as an "accepted design decision" or "known limitation" in documentation. The bug is real. Documenting it is not fixing it. This pattern actively buries bugs.
13
89
 
14
- The agent ran `cd /path/to/main-checkout && git log`, `gh issue view`, read roadmap docs, checked open PRs -- coordinator work. The agent should never do this. It is a worker: receive scoped task, produce output, call `complete_step`. All environment setup, context gathering, git operations, worktree management, PR creation, and orchestration is coordinator responsibility.
90
+ **Things to hash out:**
91
+ - How does an agent distinguish "acceptable tradeoff within scope" from "collateral damage that must be escalated"? The line is fuzzy and context-dependent. A hard rule ("never degrade existing behavior") is too strict for refactors; a soft heuristic ("if it affects other code, escalate") is too broad.
92
+ - Should the agent be required to enumerate side effects as part of the verification phase, and should the coordinator review that list before merging? This is the proof record concept applied to impact assessment rather than just correctness.
93
+ - What is the right mechanism for the agent to pause and escalate? Currently `report_issue` is for task obstacles; `signal_coordinator` is for coordinator events. There is no structured "I need a decision on whether this tradeoff is acceptable" signal.
94
+ - Test coverage is the obvious mitigation -- if Y has tests, the agent's change would fail them. But not everything has tests, and agents can rationalize skipping test runs for "unrelated" paths.
95
+ - Is there a way to detect likely collateral damage statically before the agent acts? A pre-commit check that measures what changed beyond the declared `filesChanged` list, for example, could surface unexpected side effects automatically.
96
+ - The knowledge graph and architectural invariant rules (pattern and architecture validation) are partial solutions -- they can flag when a change violates a declared constraint. But they only work for constraints that have been explicitly codified.
15
97
 
16
- The coordinator should: create the worktree before the agent starts, pass a clean context packet (issue body, relevant code, what to produce), handle all git operations after the agent finishes, spawn specialized sub-agents for subtasks.
98
+ ---
99
+
100
+ ### Agent is doing coordinator work
101
+
102
+ **Status: partial** | Near-term mitigation shipped PR #882 (Apr 30, 2026)
17
103
 
18
- **Near-term mitigation:** Inject `sessionWorkspacePath` (the worktree) into the system prompt instead of `trigger.workspacePath` (main checkout), and explicitly tell the agent "do not run git commands, do not read roadmap docs -- that is coordinator work." Partial fix held pending full redesign.
104
+ **Score: 9** | Cor:3 Cap:1 Eff:1 Lev:2 Con:2 | Blocked: no
19
105
 
20
- **Full fix:** Coordinator-heavy pipeline redesign (see below).
106
+ The system prompt now explicitly scopes the agent to its worktree and instructs it not to read planning docs or run git commands against the main checkout. `Read`/`Write`/`Edit` tools enforce the workspace path at the tool layer (PR #892).
107
+
108
+ **Remaining:** Full coordinator-heavy redesign still needed. The agent sandbox (tool path restriction to worktree) is the architectural fix -- the system prompt is a mitigation. See "Agent sandbox" item below.
21
109
 
22
110
  ---
23
111
 
24
112
  ### Wrong directory: agent worked in main checkout instead of worktree
25
113
 
26
- **Status: bug** | Priority: P0
114
+ **Status: done** | Shipped PR #882 (Apr 30, 2026)
27
115
 
28
- All bash commands used `cd /main-checkout` instead of the worktree. Code changes went nowhere. Delivery found nothing to commit and silently skipped. Root cause: system prompt names `trigger.workspacePath`, not `sessionWorkspacePath`.
116
+ `buildSystemPrompt()` now injects the worktree path as the `## Workspace:` heading and adds an explicit scope boundary. Crash-recovered sessions also get the boundary via `AllocatedSession.sessionWorkspacePath`. `Read`, `Write`, and `Edit` tools all enforce the workspace path with proper normalization (dotdot traversal + prefix-sibling attacks fixed, PR #892).
29
117
 
30
118
  ---
31
119
 
32
120
  ### Agent faked commit SHAs in handoff block
33
121
 
34
- **Status: bug** | Priority: high
122
+ **Status: done** | Fixed in `src/mcp/handlers/v2-advance-core/outcome-success.ts`
35
123
 
36
- Handoff block `agentCommitShas` contained existing main-branch SHAs from `git log`, not new commits. Fix: coordinator records commit SHAs itself (before/after diff) rather than trusting the agent.
124
+ **Score: 11** | Cor:3 Cap:1 Eff:2 Lev:2 Con:3 | Blocked: no
125
+
126
+ Agents no longer participate in SHA tracking. `outcome-success.ts` now always emits `agentCommitShas: []` and `captureConfidence: 'none'` in the `run_completed` event. The `startGitSha` and `endGitSha` boundary fields are still recorded reliably -- consumers that need the commit list should derive it from `git log startGitSha..endGitSha --format=%H` at query time. The console SHA display will show empty for new sessions until that query-time derivation is built (tracked under "Console session detail" / "Artifacts as first-class citizens").
37
127
 
38
128
  ---
39
129
 
@@ -41,6 +131,8 @@ Handoff block `agentCommitShas` contained existing main-branch SHAs from `git lo
41
131
 
42
132
  **Status: bug** | Priority: medium
43
133
 
134
+ **Score: 9** | Cor:3 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
135
+
44
136
  Issue #241 (TTL eviction across multiple files + new tests) was classified as Small, skipping design review, planning audit, and verification loops. Consider requiring human confirmation on Small classification before bypassing phases.
45
137
 
46
138
  ---
@@ -49,15 +141,17 @@ Issue #241 (TTL eviction across multiple files + new tests) was classified as Sm
49
141
 
50
142
  **Status: ux gap** | Priority: medium
51
143
 
144
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
145
+
52
146
  After `npm run build`, `worktrain daemon --start` launches the old binary. No warning. Fix: compare binary mtime to running process's binary and warn if stale.
53
147
 
54
148
  ---
55
149
 
56
150
  ### `worktrain daemon --start` reports success even when daemon crashes immediately
57
151
 
58
- **Status: bug** | Priority: medium
152
+ **Status: done** | Shipped PR #898 (Apr 30, 2026)
59
153
 
60
- Health check waits 1 second then checks `launchctl list`. If daemon crashes in < 1s, check sees a PID and reports success. Fix: poll for up to 5 seconds, verify daemon is still running at end of window.
154
+ Now polls `GET /health` every 500ms for up to 5 seconds. Only reports success when the endpoint responds 200. `WORKRAIL_TRIGGER_PORT` also added to plist captured vars so port overrides are consistent between shell and daemon process.
61
155
 
62
156
  ---
63
157
 
@@ -65,19 +159,101 @@ Health check waits 1 second then checks `launchctl list`. If daemon crashes in <
65
159
 
66
160
  **Status: ux gap** | Priority: medium
67
161
 
162
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
163
+
68
164
  Agent writes a complete handoff block (commitType, prTitle, prBody, filesChanged) to the session store. Invisible to operator without digging through event logs. Fix: `worktrain status <sessionId>` should show it; console session detail should surface it prominently.
69
165
 
70
166
  ---
71
167
 
168
+ ### Worktree orphan leak on delivery failure (Apr 21, 2026)
169
+
170
+ **Status: done** | Fixed via delivery pipeline refactor (Track B)
171
+
172
+ The delivery pipeline was extracted into `delivery-pipeline.ts` with explicit stage ordering: `parseHandoffStage` -> `gitDeliveryStage` -> `cleanupWorktreeStage` -> `deleteSidecarStage`. Sidecar is now deleted after worktree removal, not before.
173
+
174
+ ---
175
+
176
+ ---
177
+
72
178
  ## WorkTrain Daemon
73
179
 
74
180
  The autonomous workflow runner (`worktrain daemon`). Completely separate from the MCP server -- calls the engine directly in-process.
75
181
 
76
182
 
183
+ ### Agent-assisted backlog and issue enrichment (Apr 28, 2026)
184
+
185
+ **Status: idea** | Priority: medium
186
+
187
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
188
+
189
+ When a new idea or task is captured -- in the backlog, as a GitHub issue, or during a session -- there is often a gap between "the thing was written down" and "the thing is ready to be designed." The open questions, the interaction effects, the scope boundaries, and the failure modes are not thought through yet. A human has to do that work manually before the idea can be groomed.
190
+
191
+ WorkTrain could assist with this: after an idea is captured, an agent reads it and identifies what still needs to be hashed out before the idea is ready for design. Not proposing solutions -- surfacing the questions that need answers.
192
+
193
+ **Things to hash out:**
194
+ - What triggers this enrichment? On every new issue? Only on request? Only when an issue is labeled a certain way?
195
+ - How does this interact with the human's own thinking process -- does an agent-generated question list help, or does it anchor thinking prematurely?
196
+ - Should the agent's questions appear in the GitHub issue as a comment, be written back to the backlog entry, or live somewhere else entirely?
197
+ - Who is responsible for answering the questions -- the human, another agent, or some combination?
198
+ - Is this valuable enough to run on every idea, or does it dilute the signal when applied broadly?
199
+ - How do you prevent the agent from generating obvious or generic questions that add no real value?
200
+
201
+ ---
202
+
203
+ ### Agent-assisted backlog prioritization (Apr 28, 2026)
204
+
205
+ **Status: idea** | Priority: medium
206
+
207
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
208
+
209
+ Some projects have a clear ticket queue with explicit priority set by a human. Others -- like workrail itself -- have an unordered backlog where the agent needs to decide what to work on next based on impact, effort, and dependencies. Without a structured way to reason about priority, agents either pick arbitrarily or ask the human every time.
210
+
211
+ WorkTrain should be able to apply a scoring rubric to backlog items and surface a prioritized working order. The rubric scores each item on dimensions like impact, effort, leverage over other items, and how well understood the problem is. Items that score high and have no blockers rise to the top. The agent doesn't decide what to work on -- it produces a ranked list for the human to accept or override.
212
+
213
+ **Tentative rubric (to be validated):**
214
+
215
+ Five dimensions, each scored 1-3. Score = sum (max 15). Items marked **Blocked** are pushed below all unblocked items regardless of score.
216
+
217
+ | Dimension | 3 | 2 | 1 |
218
+ |---|---|---|---|
219
+ | **Correctness** | Silent wrong output, crash, or skipped safety gate | Degraded behavior, misleading output, test coverage gap | No effect on correctness |
220
+ | **Capability** | Meaningfully expands what WorkTrain can do or who can use it | Reduces friction for an *active* use case today | Polish, internal quality, or nothing anyone is actively blocked by right now |
221
+ | **Effort** (inverted) | Hours to a day or two | A few days to a week | Weeks or longer, significant design work needed first |
222
+ | **Leverage** | Prerequisite for multiple other items | Enables one or two downstream items | Standalone, nothing depends on it |
223
+ | **Confidence** | Clear problem, clear direction, just needs implementation | Problem is clear, but has open questions to hash out first | Still needs discovery or design before work can begin |
224
+
225
+ **Blocked flag:** annotate with *what* the item is blocked by, not just yes/no -- "Blocked: needs knowledge graph" vs "Blocked: needs dispatchCondition" carry very different timelines. Blocked items are listed separately regardless of score.
226
+
227
+ **Scoring multi-phase items:** score the first actionable phase, not the full vision. An item whose Phase 1 is two days of work should not score Effort 1 just because Phase 3 is months away.
228
+
229
+ **Tiebreaker for items at the same score:** prefer the item that makes the next item easier to execute, even if it is not a formal prerequisite. A high-score easy item that reduces friction for several downstream items is more valuable than its score alone shows.
230
+
231
+ **Things to hash out:**
232
+ - Should the rubric be defined once globally, or per-workspace/per-project? Different projects have different definitions of "impact."
233
+ - How does the agent know enough about the project context to score impact accurately? Without domain knowledge, scores will be generic.
234
+ - Who owns the scores -- are they written back to the backlog entries, stored separately, or only computed on demand?
235
+ - How do you prevent the scoring from becoming a mechanical exercise that produces a ranked list nobody looks at?
236
+ - Should the agent re-score as items are completed and the landscape changes, or is one-time scoring sufficient?
237
+ - How does this interact with explicit human priority signals -- if the human labels something high-priority, does the agent's score override or defer?
238
+
239
+ ---
240
+
241
+ ### Queue config discriminated union tightening (Apr 20, 2026)
242
+
243
+ **Status: tech debt** | Priority: low
244
+
245
+ **Score: 9** | Cor:1 Cap:1 Eff:3 Lev:1 Con:3 | Blocked: no
246
+
247
+ `GitHubQueueConfig` uses a flat interface with runtime validation. Should be a proper TypeScript discriminated union so `type: 'assignee'` requires `user` at compile time. Tracked per "make illegal states unrepresentable."
248
+
249
+ ---
250
+
77
251
  ### Daemon architecture: remaining migrations (Apr 29, 2026)
78
252
 
79
253
  **Status: partial** | A9 shipped Apr 29, 2026.
80
254
 
255
+ **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
256
+
81
257
  Track A (A1-A9) shipped and the `SessionSource` migration is complete. `WorkflowTrigger._preAllocatedStartResponse` is gone.
82
258
 
83
259
  **Remaining items:**
@@ -86,7 +262,7 @@ Track A (A1-A9) shipped and the `SessionSource` migration is complete. `Workflow
86
262
  - `StateRef` mutation wrapper -- replace direct `state.pendingSteerParts.push()` mutations with an explicit mutation API
87
263
  - Zod tool param validation -- replace manual `typeof` checks in tool factories with Zod schema validation (requires `zodToJsonSchema` or maintaining two sources of truth for param schemas)
88
264
  - `createCoordinatorDeps` unit tests -- extraction in B3 improved testability; cover `spawnSession`, `awaitSessions`, `getAgentResult` at minimum
89
- - Wire `AllocatedSession.triggerSource` to the `run_started` event for session attribution (one-liner once the event schema field is added -- see "Session trigger source attribution" entry below)
265
+ - ~~Wire `AllocatedSession.triggerSource` to the `run_started` event for session attribution~~ -- **done**, PR #899 (Apr 30, 2026)
90
266
 
91
267
  ---
92
268
 
@@ -94,6 +270,8 @@ Track A (A1-A9) shipped and the `SessionSource` migration is complete. `Workflow
94
270
 
95
271
  **Status: idea** | Priority: medium
96
272
 
273
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
274
+
97
275
  A dedicated `wr.refactoring` workflow for structural refactors that don't change behavior. Distinct from `wr.coding-task` because refactors have a different shape: no new features, no bug fixes, just architecture alignment. The workflow should enforce:
98
276
  - **Discovery phase**: understand current state, identify violations, classify scope
99
277
  - **Test-first phase**: write tests for any extracted pure functions BEFORE extracting them (TDD red)
@@ -103,19 +281,21 @@ A dedicated `wr.refactoring` workflow for structural refactors that don't change
103
281
 
104
282
  The `wr.coding-task` workflow has too much overhead for pure refactors (design review, risk assessment gating, PR strategy) and not enough refactor-specific discipline (test-first enforcement, behavior-unchanged verification).
105
283
 
284
+ **Things to hash out:**
285
+ - What distinguishes a refactor from a behavior-changing fix? Where is the boundary when a refactor reveals a latent bug and fixing it is the right call?
286
+ - How does the workflow verify "no behavior change" for code without tests? Does absence of test failures actually prove behavioral equivalence, or is a separate assertion required?
287
+ - Should the workflow gate on having tests before extraction begins, or treat test-writing as a step within it?
288
+ - Who is the target user -- a human author running it interactively, or an autonomous daemon session? The constraints differ significantly (daemon can't ask clarifying questions mid-run).
289
+ - How does this interact with the existing `wr.coding-task` Small fast-path? Should refactors always bypass that path?
290
+ - What happens when a refactor spans multiple modules that are each independently shippable? Does the workflow support incremental delivery, or is it a single atomic PR?
291
+
106
292
  ---
107
293
 
108
294
  ### API key baked into launchd plist at install time (Apr 24, 2026)
109
295
 
110
- **Status: idea** | Priority: medium
111
-
112
- `worktrain daemon --install` captures `ANTHROPIC_API_KEY` from the current shell environment and bakes it into `~/Library/LaunchAgents/io.worktrain.daemon.plist` (mode 600). The key persists in the plist file indefinitely and is visible to anyone who can read the file or takes a backup of `~/Library/LaunchAgents/`.
113
-
114
- **Better approach:** Read `ANTHROPIC_API_KEY` from `~/.workrail/.env` at daemon startup rather than baking it into the plist. The plist would only contain the non-secret env vars (AWS_PROFILE, WORKRAIL_TRIGGERS_ENABLED, PATH). Secrets live in `~/.workrail/.env` which is already the designated secrets file and is already loaded by `loadDaemonEnv()` at startup.
296
+ **Status: done** | Fixed in PR #821
115
297
 
116
- **Implementation:** In `captureEnvVars()` in `src/cli/commands/worktrain-daemon.ts`, exclude `ANTHROPIC_API_KEY` (and any other `*_API_KEY` vars) from the captured set. The daemon already calls `loadDaemonEnv()` which reads `~/.workrail/.env` -- operators just need to put the key there instead of in their shell env.
117
-
118
- **Migration:** Existing installs have the key in the plist. `worktrain daemon --install` should detect an existing plist with an API key and print a migration note.
298
+ `CAPTURED_ENV_VARS` in `src/cli/commands/worktrain-daemon.ts` contains only non-secret vars (`AWS_PROFILE`, `PATH`, `HOME`, `USER`, feature flags). No `*_API_KEY` or token vars are captured into the plist. Secrets go in `~/.workrail/.env`, which is loaded by `loadDaemonEnv()` at daemon startup.
119
299
 
120
300
  ---
121
301
 
@@ -151,949 +331,3399 @@ Phase 3 (PRs #835, #837): `buildTurnEndSubscriber`, `buildAgentCallbacks`, `buil
151
331
 
152
332
  ---
153
333
 
154
- ## Shared / Engine
334
+ ### WorkTrain identity model: act as the user, not as a bot (Apr 20, 2026)
155
335
 
156
- The durable session store, v2 engine, and workflow authoring features shared by all three systems.
336
+ **Status: idea** | Priority: medium
157
337
 
338
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
158
339
 
159
- ### Improve commit SHA gathering consistency in wr.coding-task
340
+ **Design decision:** WorkTrain acts as the configured user, not as a separate bot account.
160
341
 
161
- **Status: idea** | Priority: high
342
+ **Why bot accounts are the wrong default:** Most developers -- especially at companies -- cannot create separate bot GitHub accounts. Jira, GitLab, and other enterprise systems tie authentication to employee identity. Requiring a separate account creates friction that blocks adoption entirely.
162
343
 
163
- After fixing the primary cause (SHA footer referenced `continue_workflow` by name while daemon agents use `complete_step`), two structural gaps remain that prevent consistent SHA recording:
344
+ WorkTrain's attribution signal is the **work pattern**, not the identity:
345
+ - Branch name: `worktrain/<sessionId>` -- immediately recognizable
346
+ - PR body footer: "Automated by WorkTrain" + session ID + workflow name
347
+ - Commit co-author: `Co-Authored-By: WorkTrain <worktrain@noreply>`
164
348
 
165
- **Gap 1: SHA footer appears on every non-final step, including planning/design steps with no commits.** Agents correctly skip it on those steps, but the repetition trains them to suppress it reflexively -- including on implementation steps where it matters. Options to explore: inject only inside loop bodies tagged as implementation, add an opt-out flag to steps, or move the SHA reminder into the implementation step prompts directly in the workflow JSON.
349
+ Anyone reviewing a PR knows it was autonomous. The developer's name on the PR is not a lie -- they configured WorkTrain to do this work on their behalf.
166
350
 
167
- **Gap 2: `phase-5-small-task-fast-path` has no correctly-wired final metrics step for Small tasks.** `isLastStep` resolves to `phase-7b-fix-and-summarize`, which has a `runCondition` that skips it for Small tasks. Small-task sessions never see the final metrics footer. Needs either: the final footer added directly to `phase-5`'s authored prompt, or `isLastStep` detection made context-aware (complex).
351
+ **Queue membership without a bot account:** Label-based opt-in works with any setup:
352
+ - Apply `worktrain:ready` label to an issue → WorkTrain picks it up
353
+ - The queue poll trigger uses `queueType: label` + `queueLabel: "worktrain:ready"`
354
+ - No bot account, no special permissions, no friction
168
355
 
169
- **Gap 3: No validation for `metrics_commit_shas`.** `checkContextBudget` validates `metrics_outcome` but not SHAs. Missing or partial arrays fail silently. A warning-level soft validation at the final step would at least surface the gap in logs.
356
+ `workOnAll: true` (future) processes any open issue -- also requires no bot account.
170
357
 
171
- The right fix is probably a combination of moving the SHA instruction into the implementation step prompts directly (removing it from the ambient footer entirely) and adding Gap 2's final footer to `phase-5`. That avoids any new engine machinery.
358
+ **Token:** `$GITHUB_TOKEN` (your personal token) or a fine-grained PAT scoped to the target repo. WorkTrain uses it for API calls; the commit identity (`git user.name`, `git user.email`) is set separately in the worktree and can be whatever you want.
359
+
360
+ **Attribution / signing:**
361
+ 1. Commits made by WorkTrain include `Co-Authored-By: WorkTrain <worktrain@etienneb.dev>`. The configured `worktrain-bot` identity is consistent across all workspaces.
362
+ 2. PR/MR description footer: session link, workflow names run. Clearly WorkTrain-authored.
363
+ 3. Issue/comment attribution: WorkTrain comments include "WorkTrain investigation" with session link.
364
+
365
+ `actAsUser: true` explicit opt-in, only for commits/PRs (never emails or Slack without additional permission), PR description always notes "Created by WorkTrain," full audit log in `~/.workrail/actions-as-user.jsonl`.
366
+
367
+ **Things to hash out:**
368
+ - What is the opt-in surface for `actAsUser: true`? Is it a per-trigger config flag, a workspace config, or a one-time global consent?
369
+ - If a user's employer audits their git history and finds autonomous commits attributed to the user, what is the disclosure expectation? Should WorkTrain disclose this more prominently in onboarding?
370
+ - How does the identity model interact with GPG commit signing? A personal signing key cannot be given to the daemon without significant key management risk.
371
+ - What is the right behavior when the configured user identity is unavailable (expired token, revoked PAT)? Should WorkTrain fail fast or fall back to a bot identity?
372
+ - How should the `actions-as-user.jsonl` audit log be surfaced and retained? Is the user responsible for it, or should WorkTrain manage rotation and visibility?
373
+ - Does `actAsUser` ever apply to things beyond commits/PRs -- issue comments, status updates, webhook calls? Where is the ceiling?
172
374
 
173
375
  ---
174
376
 
175
- ### `jumpIf`: conditional step jumps with per-target jump counter
377
+ ### Kill switch and commit signing (Apr 20, 2026)
176
378
 
177
379
  **Status: idea** | Priority: medium
178
380
 
179
- **Problem:** Workflows with investigation or iterative refinement patterns (bug-investigation, mr-review) can exhaust their hypothesis set and reach an `inconclusive_but_narrowed` state with no structural way to restart an earlier phase. A `jumpIf` primitive would let any step conditionally restart execution from an earlier step when a context condition is met.
381
+ **Score: 10** | Cor:2 Cap:2 Eff:3 Lev:1 Con:2 | Blocked: no
180
382
 
181
- **Proposed design:**
383
+ **Kill switch:** `worktrain kill-sessions` -- aborts all running daemon sessions immediately. Useful when WorkTrain is doing something unexpected. Sends abort signal to all active sessions, marks them user-killed in the event log.
182
384
 
183
- ```json
184
- {
185
- "id": "phase-4b-loop-decision",
186
- "jumpIf": {
187
- "condition": { "var": "diagnosisType", "equals": "inconclusive_but_narrowed" },
188
- "target": "phase-2-hypothesis-generation-and-shortlist",
189
- "maxJumps": 2
190
- }
191
- }
192
- ```
385
+ **Commit signing:** verify `git commit` honors existing `commit.gpgsign` config, or add explicit opt-out for bot identities that don't have signing keys. Empirically verify before declaring this solved.
193
386
 
194
- **Engine behavior:**
195
- - When a step completes and its `jumpIf.condition` is met, the engine checks the per-session jump counter for `target`
196
- - Counter is derived from the event log: count `jump_recorded` events where `toStepId === target` -- fully append-only and replayable
197
- - If `counter < maxJumps`: append `jump_recorded` event, create fresh nodeIds for `target` and all subsequent steps, mint a new continueToken pointing at the fresh target node
198
- - If `counter >= maxJumps`: jump is blocked, execution falls through to the next step (safety cap, not an error)
387
+ **Things to hash out:**
388
+ - Should `worktrain kill-sessions` kill all sessions globally, per-workspace, or per-trigger? What granularity does an operator actually need?
389
+ - What happens to in-flight worktrees and uncommitted changes when a session is kill-switched? Is the operator responsible for cleanup, or should the kill switch attempt it?
390
+ - How is the kill switch surfaced -- CLI only, or also a console button? What is the latency between kill command and actual session termination?
391
+ - For commit signing: if `commit.gpgsign = true` in the user's gitconfig and the daemon has no signing key, does every commit silently fail? What is the right fallback behavior?
392
+ - Should WorkTrain detect a signing configuration mismatch at `daemon --start` time rather than discovering it mid-session?
393
+ - Is per-bot-identity gpg key management in scope, or is the answer always "disable signing for WorkTrain identities"?
199
394
 
200
- **Why this is safe:**
201
- - `maxJumps` is a required field -- no unbounded loops possible
202
- - Counter is derivable from the append-only event log -- no mutable state
203
- - Fall-through on limit reached is predictable and operator-visible
395
+ ---
204
396
 
205
- **Open design questions:**
206
- - `maxJumps` default if omitted -- probably require it explicitly (same as `maxIterations` on loops)
207
- - DAG console rendering -- backward jumps create "re-entry" edges. Needs a distinct visual treatment
208
- - Interaction with `runCondition` -- if a jumped-to step has a `runCondition` that evaluates false at re-entry time, does the engine skip it and advance?
397
+ ### triggers.yml hot-reload (Apr 20, 2026)
209
398
 
210
- **Scope when ready to implement:**
211
- - `spec/workflow.schema.json`: add `jumpIf` to `standardStep`
212
- - `spec/authoring-spec.json`: add authoring rule
213
- - Compiler: validate `target` resolves to a reachable earlier step, `maxJumps >= 1`
214
- - Engine (`src/v2/durable-core/`): new `jump_recorded` event kind, counter derivation, fresh nodeId creation on jump
215
- - Console DAG: render jump edges distinctly
399
+ **Status: idea** | Priority: medium
216
400
 
217
- **Motivation workflow:** `wr.bug-investigation` -- when all hypotheses are eliminated and `diagnosisType === 'inconclusive_but_narrowed'`, jump back to phase 2 (hypothesis generation) with the eliminated theories in context, up to 2 times before falling through to validation/handoff.
401
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
218
402
 
219
- ---
403
+ The daemon reads `triggers.yml` once at startup. Any change requires a full daemon restart. This creates friction during trigger configuration iteration.
220
404
 
221
- ### Versioned workflow schema validation
405
+ **The fix:** watch `triggers.yml` for changes using `fs.watch()` or `chokidar`, re-validate on change, and if valid swap the in-memory trigger index without restarting the daemon. Active sessions in flight are unaffected (they hold their own trigger snapshot). New sessions after the reload use the new config.
222
406
 
223
- **Status: idea** | Priority: medium-high
407
+ **Partial hot-reload is acceptable:** if the new `triggers.yml` fails validation, log a warning and keep the old config. Don't crash the daemon on a syntax error.
224
408
 
225
- **Problem:** WorkRail validates workflow files against the schema bundled in the currently-running MCP binary. Binary too new rejects old workflows; binary too old rejects new workflows. Both cause silent disappearance from `list_workflows` with no explanation.
409
+ **Implementation:** `TriggerRouter` already accepts a `TriggerIndex` at construction. The hot-reload path re-calls `loadTriggerStore()` and swaps the index reference on the router. `PollingScheduler` loops are keyed per trigger -- swapping the index would also require restarting the polling loops cleanly.
226
410
 
227
- **The right fix:** Each workflow declares `"schemaVersion": 1` (integer). The binary ships validator copies for every schema version it supports. When loading a workflow, pick the validator matching the declared version.
411
+ **Things to hash out:**
412
+ - When a trigger is removed from `triggers.yml` on a hot-reload, what happens to its in-flight sessions? Should they run to completion, be aborted, or be suspended?
413
+ - When a trigger is modified (e.g. `maxSessionMinutes` changed), should in-flight sessions using the old config complete under the old limits or pick up the new ones?
414
+ - How should validation errors in the new `triggers.yml` be surfaced to the operator? A log line is easy to miss -- is there a better notification path?
415
+ - Does hot-reload need to be transactional (all-or-nothing swap) or can partial updates be safe?
416
+ - Should file watching be optional (behind a `--watch` flag) to avoid surprising behavior for users who prefer explicit restarts?
228
417
 
229
- **Load-time logic:**
230
- 1. Read `schemaVersion` (default 1 if absent -- legacy workflows)
231
- 2. If `schemaVersion === current`: validate against current schema directly
232
- 3. If `schemaVersion < current` (binary newer): validate against the declared schema version
233
- 4. If `schemaVersion > current` (binary too old): load leniently with warnings -- `additionalProperties: false` does not apply
418
+ ---
234
419
 
235
- **Decision (from Apr 23 audit):** v1 = current schema. The one historical breaking change (`assessmentConsequenceTrigger`, Apr 5) was fully contained within the bundled workflow corpus. No historical reconstruction needed.
420
+ ### GitHub webhook trigger with assignee/event filtering (Apr 20, 2026)
236
421
 
237
- **Files to change:** `spec/workflow.schema.json`, `spec/workflow.schema.v1.json` (snapshot), `src/application/validation.ts`, `src/types/workflow-definition.ts`, `workflow-for-workflows.json` (stamp `schemaVersion`), all bundled workflows.
422
+ **Status: idea** | Priority: medium-high
238
423
 
239
- ---
424
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:3 | Blocked: no
240
425
 
241
- ### Task re-dispatch loop protection
426
+ The `github_queue_poll` trigger has a 5-minute latency floor. Assigning an issue fires a GitHub webhook immediately -- WorkTrain should start within seconds, not minutes.
242
427
 
243
- **Status: idea** | Priority: high
428
+ **What exists today:** `provider: generic` handles arbitrary POST webhooks with HMAC validation and `goalTemplate: "{{$.issue.title}}"` extracts issue title from payload. You can use this today but without an assignee filter -- any issue event fires the trigger regardless of who it's assigned to.
244
429
 
245
- **Problem:** When a pipeline session fails (stuck, crash, timeout), the idempotency sidecar expires after its TTL and the queue re-selects the same issue on the next poll cycle. There is no memory of how many times an issue has been attempted. A task that consistently fails gets retried indefinitely, burning API credits with no forward progress.
430
+ **What's missing:** a `dispatchCondition` field that gates dispatch on a payload value:
246
431
 
247
- **Concrete incident:** Issue #393 was dispatched in a loop -- discovery + shaping + coding sessions repeatedly started, failed stuck, and were re-dispatched.
432
+ ```yaml
433
+ - id: self-improvement-hook
434
+ provider: generic
435
+ workflowId: coding-task-workflow-agentic
436
+ goalTemplate: "{{$.issue.title}}"
437
+ hmacSecret: $GITHUB_WEBHOOK_SECRET
438
+ dispatchCondition:
439
+ payloadPath: "$.assignee.login"
440
+ equals: "worktrain-etienneb"
441
+ ```
248
442
 
249
- **Design:** Extend `queue-issue-<N>.json` to include `attemptCount`. On each new dispatch for the same issue, increment. When `attemptCount >= maxAttempts` (default 3), skip dispatch, emit outbox notification, apply a `worktrain:needs-human` label, and post a comment on the issue.
443
+ **The hook+poll pattern (recommended for production):**
444
+ ```yaml
445
+ # Primary: instant response via webhook
446
+ - id: self-improvement-hook
447
+ provider: generic
448
+ goalTemplate: "{{$.issue.title}}"
449
+ hmacSecret: $GITHUB_WEBHOOK_SECRET
450
+ dispatchCondition:
451
+ payloadPath: "$.assignee.login"
452
+ equals: "worktrain-etienneb"
453
+
454
+ # Fallback: catch anything missed during downtime
455
+ - id: self-improvement-poll
456
+ provider: github_queue_poll
457
+ pollIntervalSeconds: 3600
458
+ ```
250
459
 
251
- **Human reset:** Closing/reopening the issue, removing the `worktrain:stuck` label, or `worktrain retry <issueNumber>`.
460
+ **Implementation:** Add `dispatchCondition: { payloadPath, equals }` to `TriggerDefinition` -- parsed in `trigger-store.ts`, checked in `trigger-router.ts` before enqueuing. Single condition is MVP; AND/OR logic is follow-up.
252
461
 
253
- **Files:** `src/trigger/adapters/github-queue-poller.ts`, `src/trigger/polling-scheduler.ts`, daemon sidecar schema.
462
+ **Things to hash out:**
463
+ - The hook+poll pattern requires two separate trigger IDs for the same workflow. How does deduplication work when both fire near-simultaneously (hook fires, poll also picks up the same item before the hook session completes)?
464
+ - `dispatchCondition` only checks a static `equals` comparison. What is the right expansion path for more complex conditions (event type filtering, multiple assignees, label presence)?
465
+ - GitHub webhooks require a public endpoint to receive events. How does this work for users without a public IP (laptop behind NAT, VPN)? Is a tunneling strategy (Cloudflare Tunnel, ngrok) in scope or out of scope for this feature?
466
+ - Should the `hmacSecret` validation happen before or after `dispatchCondition` evaluation? Order affects error handling for malformed requests.
254
467
 
255
468
  ---
256
469
 
257
- ### Daemon agent loop stall detection
470
+ ### Gate 2 follow-up: per-trigger gh CLI token for delivery (Apr 20, 2026)
258
471
 
259
472
  **Status: idea** | Priority: medium
260
473
 
261
- **Problem:** A daemon session that stops making LLM API calls (hung tool, network issue with no timeout, silent deadlock) spins until the wall-clock timeout fires -- up to 55-65 minutes. No indication to the operator, no early abort, no event emitted.
474
+ **Score: 10** | Cor:1 Cap:2 Eff:3 Lev:1 Con:3 | Blocked: no
475
+
476
+ `delivery-action.ts` calls `gh pr create` using whatever `gh` CLI auth is configured globally -- it does not pass a per-trigger token. For single-identity setups this is fine. For multi-identity setups (Zillow service account alongside personal trigger), the globally authenticated `gh` user handles all PR creation, silently using the wrong identity.
262
477
 
263
- **Design:** In `src/daemon/agent-loop.ts`, add a per-turn heartbeat timer that resets each time an LLM call starts. If the timer fires (120s with no new turn), call `agent.abort()` and emit `agent_stuck` with `reason: 'no_llm_turn'`. Configurable via `agentConfig.stallTimeoutSeconds`.
478
+ **Fix when multi-identity is needed:** Pass `GH_TOKEN=<triggerToken>` env override to `execFn` when calling `gh pr create` and `gh pr merge`. Not a blocker for single-identity. Prerequisite for multi-identity support.
264
479
 
265
- **Where to look:** `src/daemon/agent-loop.ts` `_runLoop()`, `src/daemon/workflow-runner.ts` stuck detection, `src/daemon/daemon-events.ts` `AgentStuckEvent`.
480
+ **Things to hash out:**
481
+ - How many distinct identities is the multi-identity design actually expected to serve? Is the target use case one personal + one work account, or arbitrary N?
482
+ - Where does the per-trigger token come from at runtime -- the trigger definition in `triggers.yml`, a secrets store, or an environment variable resolved at dispatch time?
483
+ - If a trigger's token is rotated mid-run, does the in-flight session pick up the new token or fail on the old one?
484
+ - Is this blocked by anything upstream -- does the `gh` CLI fully support per-call `GH_TOKEN` overrides without side effects on global auth state?
266
485
 
267
486
  ---
268
487
 
269
- ### `queue-poll.jsonl` never rotated
488
+ ### Queue opt-in design: unresolved decisions (Apr 20, 2026)
270
489
 
271
- **Status: idea** | Priority: medium
490
+ **Status: idea** | Priority: medium -- DO NOT IMPLEMENT until these questions are answered
272
491
 
273
- **Bug:** `~/.workrail/queue-poll.jsonl` grows indefinitely -- `appendFile`-only, no rotation. At 5-minute poll intervals: ~8-87 MB/month depending on activity. Disk exhaustion risk on long-running daemons.
492
+ **Score: 8** | Cor:1 Cap:2 Eff:3 Lev:1 Con:1 | Blocked: no
274
493
 
275
- **Fix:** Add a size check before appending in `appendQueuePollLog()`. If file exceeds 10 MB, rotate: rename to `queue-poll.jsonl.1`, start fresh. Keep at most 2 rotated files.
494
+ The self-improvement queue was partially implemented using label-based opt-in, then later walked back. This section records what's actually unresolved.
276
495
 
277
- **File:** `src/trigger/polling-scheduler.ts`, `appendQueuePollLog()`.
496
+ **The configurable queue shape (already designed, partially implemented):**
497
+ ```
498
+ { "queue": { "type": "github_assignee", "user": "worktrain-etienneb" } }
499
+ { "queue": { "type": "github_label", "name": "worktrain:ready" } }
500
+ { "queue": { "type": "github_query", "search": "is:issue is:open ..." } }
501
+ { "queue": { "type": "jql", "query": "assignee=currentUser() AND status='Ready for Dev'" } }
502
+ { "queue": { "type": "gitlab_label", "name": "worktrain" } }
503
+ ```
278
504
 
279
- ---
505
+ For the workrail repo specifically: either `github_assignee` (accept the conflation between your personal assignments and WorkTrain's queue -- fine for a solo repo) or `github_label` (apply label per issue -- more discipline, more friction). Neither is wrong; pick based on preference.
280
506
 
281
- ### ReviewSeverity: stderr bypassing injected dep
507
+ **Enterprise implications that must be resolved before Zillow work:**
282
508
 
283
- **Status: idea** | Priority: medium
509
+ Three questions to verify before designing any Zillow path:
284
510
 
285
- **Bug 1 (DONE):** `assertNever` on `ReviewSeverity` was added at `pr-review.ts:1407`. ✓
511
+ 1. **Service account process**: Does Zillow have a ServiceDesk or security review process for requesting service accounts (`worktrain-etienneb@zillow`)? If yes, request through proper channels rather than acting under personal identity.
286
512
 
287
- **Bug 2 (still open):** `src/coordinators/pr-review.ts:447` -- `process.stderr.write(...)` called directly instead of using injected `deps.stderr`. Tests that inject a fake dep miss this log.
513
+ 2. **AUP check**: Does Zillow's Acceptable Use Policy permit automation acting under employee identities without explicit security review? If not, "WorkTrain acts as you" is not viable.
288
514
 
289
- **File:** `src/coordinators/pr-review.ts`.
515
+ 3. **Self-approval rules**: Can you approve your own MRs in Zillow's GitLab? If "no self-approval" is enforced, every WorkTrain MR needs a human reviewer. That changes the pipeline entirely (no auto-merge under personal identity).
516
+
517
+ **Enterprise identity risk:** "WorkTrain acts as you" is different from "Dependabot acts as you." Dependabot does narrow, predictable operations (dependency bumps). WorkTrain does arbitrary LLM-driven code changes. Every autonomous action is attributed to you in audit logs. Understand this risk before turning on autonomy against company repos.
518
+
519
+ **Jira return path (missing from current jira_poll design):** The `jira_poll` entry describes pulling tickets from Jira but not writing back -- moving ticket to "In Review" when MR is opened, adding MR URL to the Jira ticket, reacting to Jira transitions mid-work. The full Jira integration is a round-trip, not just a poll. Design the return path before implementing `jira_poll`.
290
520
 
291
521
  ---
292
522
 
293
- ### Session continuation / "just keep talking"
523
+ ### Jira + GitLab integration for WorkTrain (Apr 20, 2026)
294
524
 
295
525
  **Status: idea** | Priority: medium
296
526
 
297
- A completed session is not dead -- the conversation is still in the event log. The only thing blocking continuation is the engine rejecting messages to sessions in `complete` state.
527
+ **Score: 7** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
298
528
 
299
- **The change:** Remove that gate. `worktrain session continue <sessionId> "<message>"` sends a message to a completed session. New events appended to the same log. Same session ID. The agent has full context of everything it ever did.
529
+ Most enterprise developers use Jira for tickets and GitLab for code hosting. WorkTrain should work in this environment without requiring GitHub or a bot account.
300
530
 
301
- Context window overflow (very long sessions) is a separate optimization problem -- truncate oldest turns while keeping step notes. Don't solve it now.
531
+ **What exists:** `gitlab_poll` trigger already exists -- polls GitLab MR list and dispatches sessions when new/updated MRs appear. WorkTrain can already do autonomous MR review on GitLab.
532
+
533
+ **What's missing -- `jira_poll` trigger:** Poll a Jira board/sprint/filter for issues in a specific status (e.g. "In Progress", "Ready for Dev") assigned to the configured user, and dispatch WorkTrain sessions for them.
534
+
535
+ Proposed `jira_poll` config:
536
+ ```yaml
537
+ - id: jira-queue
538
+ provider: jira_poll
539
+ jiraBaseUrl: https://zillow.atlassian.net
540
+ token: $JIRA_API_TOKEN
541
+ project: ACEI
542
+ statusFilter: "Ready for Dev"
543
+ assigneeFilter: "$JIRA_USERNAME"
544
+ workspacePath: /path/to/repo
545
+ branchStrategy: worktree
546
+ autoCommit: true
547
+ autoOpenPR: true
548
+ agentConfig:
549
+ maxSessionMinutes: 90
550
+ ```
551
+
552
+ **Also missing:** GitLab issue queue -- same as `github_queue_poll` but for GitLab issues.
553
+
554
+ **Implementation notes:** `jira_poll` follows the same `PollingSource` discriminated union pattern as `gitlab_poll` and `github_queue_poll`. Jira REST API v3: `GET /rest/api/3/search?jql=project=X+AND+status="Ready for Dev"+AND+assignee=currentUser()`. `jira_poll` should extract issue title + description as the goal, and the Jira issue URL as `upstreamSpecUrl` in `TaskCandidate`.
555
+
556
+ **Things to hash out:**
557
+ - How should the return path work -- when WorkTrain opens a PR, should `jira_poll` automatically transition the Jira ticket to "In Review" and attach the PR URL? Who owns specifying that behavior?
558
+ - Jira Cloud vs Jira Server/Data Center have different REST API versions and auth flows. Which variant is in scope first?
559
+ - Jira JQL filters can be arbitrarily complex. Should `jira_poll` expose a raw `jql` field, or only structured filters like `statusFilter` + `assigneeFilter`? What are the safety tradeoffs?
560
+ - How is deduplication handled? Jira issue IDs must be used as the `sourceId` to prevent re-dispatch when the poll runs again with the issue still in the same status.
561
+ - Should GitLab issue queue share the same config schema as `jira_poll`, or be a separate provider? How much should they be unified?
302
562
 
303
563
  ---
304
564
 
305
- ### Session as a living record: post-completion phases
565
+ ### MR/PR template support (Apr 20, 2026)
306
566
 
307
567
  **Status: idea** | Priority: medium
308
568
 
309
- A `session_completed` event means the original workflow is done -- not that the session can never receive new events. The event log is append-only: just keep appending. A post-completion interaction adds a `session_resumed` event, then new turns, then a new `session_completed`.
569
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
310
570
 
311
- This is already how mid-run resume works. The same mechanism extends naturally to post-completion: rehydrate the completed state, append a new lightweight phase, run it, complete again.
571
+ WorkTrain opens PRs using a generic body format hardcoded in `delivery-action.ts`. Teams maintain `.github/PULL_REQUEST_TEMPLATE.md` (GitHub), `.gitlab/merge_request_templates/` (GitLab), or custom templates -- WorkTrain ignores all of them. PRs opened by WorkTrain look structurally different from human-authored PRs and skip required fields (checklists, reviewer guidelines, linked issue fields).
312
572
 
313
- **Richer automatic checkpoints:** Many session events should trigger a checkpoint automatically:
314
- - `step_advanced` (already essentially a checkpoint)
315
- - `signal_coordinator` fired (agent surfaced meaningful mid-step state)
316
- - Worktree commit pushed (code state durable on remote)
317
- - Coordinator steers the session (notable injection)
318
- - `spawn_agent` child completes (parent has new information)
573
+ **What needs to happen:** Before `gh pr create`, `delivery-action.ts` should check for a PR/MR template in standard locations (`.github/PULL_REQUEST_TEMPLATE.md`, `.github/pull_request_template.md`, `.github/PULL_REQUEST_TEMPLATE/*.md`, `.gitlab/merge_request_templates/Default.md`). If a template exists: merge the agent's `HandoffArtifact.prBody` into the template structure.
574
+
575
+ **Recommended approach:** Pass the template to the agent's final step as additional context. The final step already produces the `HandoffArtifact.prBody` -- inject the template there so the agent fills it out correctly rather than trying to merge post-hoc.
576
+
577
+ Should land before WorkTrain is used in team repos with strict PR templates.
578
+
579
+ **Things to hash out:**
580
+ - Some repos have multiple PR templates keyed by branch prefix or PR type. How does WorkTrain select the right template when more than one exists?
581
+ - Template injection into the final step prompt may push the context window into uncomfortable territory for large templates. Is there a size budget for injected template content?
582
+ - Who is responsible for updating the injected template when the repo's template changes? Is this pulled fresh at dispatch time, or cached?
583
+ - GitLab MR templates have a different discovery path than GitHub PR templates. Should both providers be handled by the same abstraction, or is each provider responsible for its own template resolution?
584
+ - Should WorkTrain ever skip template injection if the agent's own `prBody` output already satisfies the template structure? Or is injection always mandatory?
319
585
 
320
586
  ---
321
587
 
322
- ### Rules preprocessing: normalize workspace rules before injection
588
+ ### triggers.yml: composable configuration for multi-workspace support (Apr 20, 2026)
323
589
 
324
590
  **Status: idea** | Priority: medium
325
591
 
326
- **Problem:** WorkTrain injects all rules files raw into every agent's system prompt. A workspace with `.cursorrules`, `CLAUDE.md`, `.windsurf/rules/*.md`, and `AGENTS.md` might inject 10KB of rules into a discovery session that only needs 2KB.
327
-
328
- **Design:** A `worktrain rules build` command that reads all IDE rules files from the workspace, deduplicates overlapping rules, categorizes by phase, and writes to `.worktrain/rules/`:
329
- - `implementation.md`, `review.md`, `delivery.md`, `discovery.md`, `all.md`
330
- - `manifest.json` -- which files exist, when generated, source files used
592
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
331
593
 
332
- At session time: WorkTrain injects only the phase-relevant file.
594
+ Single `triggers.yml` works well for one workspace. Becomes boilerplate-heavy as more repos are added. Each new repo needs a full trigger block repeating shared fields. The file mixes two concerns: **what to watch** (source, provider, repo, token, poll interval) and **what to do** (workflow, branch strategy, delivery, timeouts).
333
595
 
334
- ---
596
+ **Proposed direction: two-layer config**
335
597
 
336
- ### True session status (live agent state in console)
598
+ Layer 1 -- trigger templates (global defaults):
599
+ ```yaml
600
+ defaults:
601
+ coding-pipeline:
602
+ branchStrategy: worktree
603
+ baseBranch: main
604
+ branchPrefix: "worktrain/"
605
+ autoCommit: true
606
+ autoOpenPR: true
607
+ agentConfig:
608
+ maxSessionMinutes: 120
609
+ maxTurns: 60
610
+ ```
337
611
 
338
- **Status: idea** | Priority: medium-high
612
+ Layer 2 -- per-workspace overrides:
613
+ ```yaml
614
+ triggers:
615
+ - id: self-improvement
616
+ extends: coding-pipeline
617
+ provider: github_queue_poll
618
+ workspacePath: /path/to/repo
619
+ source:
620
+ repo: owner/repo
621
+ token: $WORKTRAIN_BOT_TOKEN
622
+ ```
339
623
 
340
- **Problem:** The console currently infers session status from last event timestamp. WorkTrain has direct access to `DaemonRegistry`, `DaemonEventEmitter`, and turn-level events -- it should show true status.
624
+ **Alternative:** per-workspace discovery -- WorkTrain scans each configured `workspaceRoots` entry for `.workrail/triggers.yml`. This is the GitHub Actions model -- one file per workflow per repo. Global `~/.workrail/triggers.yml` defines cross-workspace triggers.
341
625
 
342
- **True session status taxonomy:**
343
- - `active:thinking` -- LLM API call in progress
344
- - `active:tool` -- tool executing (name visible)
345
- - `active:idle` -- between turns, session in DaemonRegistry
346
- - `stuck` -- stuck heuristic fired
347
- - `completed:success/timeout/stuck/max_turns`
348
- - `aborted` -- daemon killed mid-run
349
- - `daemon:down` -- no recent heartbeat
626
+ Essential before WorkTrain manages more than 2-3 repos.
350
627
 
351
- Surface in: `worktrain status`, `worktrain health <sessionId>`, console session rows.
628
+ **Things to hash out:**
629
+ - If a workspace-local `.workrail/triggers.yml` and the global `~/.workrail/triggers.yml` both define a trigger with the same ID, which wins? Is this a conflict or a merge?
630
+ - Secrets (tokens, webhook secrets) in workspace-local triggers.yml files would be committed to the repo if the file is checked in. What is the recommended secret injection story for per-workspace config?
631
+ - When extending a named default template, what fields can be overridden vs. must be set? Are there fields that are always inherited and cannot be changed per-workspace?
632
+ - Is per-workspace discovery opt-in or the default behavior? Changing the default could break existing single-file setups.
633
+ - How does the daemon know which workspace paths to scan if it doesn't already have a configured workspace list?
352
634
 
353
635
  ---
354
636
 
355
- ## WorkTrain Daemon -- Coordinator patterns
637
+ ### Demo repo feedback loop: WorkTrain improves itself via real task execution (Apr 20, 2026)
356
638
 
357
- Coordinator design patterns for WorkTrain's autonomous pipeline.
639
+ **Status: idea** | Priority: high
358
640
 
641
+ **Score: 12** | Cor:1 Cap:3 Eff:3 Lev:3 Con:2 | Blocked: no
359
642
 
360
- ### Event-driven agent coordination (coordinator as event bus)
643
+ Run WorkTrain against a real demo repo, observe what breaks, automatically file issues against the workrail repo, and have WorkTrain fix them. A self-improving feedback loop that surfaces real production failures faster than any manual testing.
361
644
 
362
- **Status: idea** | Priority: high
645
+ **The loop:**
646
+ ```
647
+ Demo repo tasks (worktrain:ready issues)
648
+ -> WorkTrain runs full pipeline: discover -> shape -> code -> PR -> review -> merge
649
+ -> Failure classifier watches daemon event log
650
+ -> For each failure: structured issue filed against workrail repo
651
+ (what task, what step, what went wrong, session ID, relevant log lines)
652
+ -> worktrain-etienneb assigned -> WorkTrain fixes itself
653
+ -> WorkTrain re-runs the failed task -> confirms fix
654
+ ```
363
655
 
364
- **Problem:** Agents managing an MR should not poll for review comments or CI status -- that wastes turns and burns tokens. Instead, the coordinator should register for events and steer the agent when something relevant happens.
656
+ **Phase 1:** Pick a demo repo (real TypeScript project, diverse tasks), add 5-10 `worktrain:ready` issues, run WorkTrain on them, manually supervise first runs, collect failure patterns.
365
657
 
366
- The infrastructure already exists: `steerRegistry` + `POST /sessions/:id/steer`, `signal_coordinator` tool, `DaemonEventEmitter`.
658
+ **Phase 2:** Failure classifier -- scheduled session that reads `~/.workrail/events/daemon/YYYY-MM-DD.jsonl`, classifies sessions by outcome, for each non-success creates a GitHub issue against the workrail repo with structured failure context. ~100-150 LOC in `src/coordinators/failure-classifier.ts`.
367
659
 
368
- **What's missing:** Coordinator-side event sources (GitHub webhooks or polling fallback) and an event-to-steer bridge that maps `MREvent` to structured steer messages.
660
+ **Phase 3:** Auto-rerun after fix -- when WorkTrain merges a fix for a failure issue, the failure classifier re-queues the original demo task. Confirms the fix actually resolved the failure.
369
661
 
370
- **How it works:** MR management agent session is parked (no pending turns). Coordinator registers for GitHub events. When review comment/CI failure/approval arrives, coordinator steers the running session. Agent responds. No polling from the agent side.
662
+ **Relationship to benchmarking:** the same 10 demo tasks run after each WorkTrain release become a regression benchmark. Track: % completing successfully, fix loop iterations needed, LLM turns per task, token cost per task.
371
663
 
372
- **Agent session prompt:** "Do not poll for PR status. Wait for the coordinator to deliver events via injected messages."
664
+ **Things to hash out:**
665
+ - Who chooses the demo repo and the demo tasks? What makes a task representative vs a toy example?
666
+ - How does the failure classifier distinguish a WorkTrain bug from a task that is genuinely ambiguous or underdefined? Misclassification would create noise in the self-improvement loop.
667
+ - What is the blast radius if the self-improvement loop files a bad issue against workrail and WorkTrain acts on it autonomously? Who reviews auto-filed issues before they enter the queue?
668
+ - How many re-run attempts per task before the loop gives up and escalates to a human?
669
+ - Token cost of running 10 demo tasks per release could be significant. Is there a policy for how often the benchmark suite runs?
670
+ - How does this interact with branch protection and CI? WorkTrain fixing itself creates PRs -- someone or something must review and merge them.
373
671
 
374
672
  ---
375
673
 
376
- ### MR lifecycle manager
674
+ ### Autonomous crash recovery and interrupted-session resume (Apr 21, 2026)
377
675
 
378
676
  **Status: idea** | Priority: high
379
677
 
380
- **Gap:** WorkTrain currently creates a PR and dispatches an MR review session. Everything between "PR created" and "PR merged" is invisible: CI failures, reviewer comments, requested changes, merge conflicts, required approvals. A human has to watch and intervene.
678
+ **Score: 11** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
381
679
 
382
- **Vision:** `runMRLifecycleManager()` takes ownership of the MR from creation to merge.
680
+ **The problem:** A daemon crash loop kills all in-flight sessions. The queue correctly detects the sidecar and skips re-dispatch for the TTL window, but when the sidecar expires the session is re-dispatched from scratch with zero context. An agent that spent 10 min in Phase 0, read codebase files, and formed a plan loses all of that work.
383
681
 
384
- **Responsibilities:**
385
- 1. MR creation with correct template, labels, milestone, reviewers, linked tickets
386
- 2. CI pipeline monitoring -- parse failures, retry flaky tests, spawn fix sessions
387
- 3. Review comment triage -- classify each comment (actionable/question/nit/approval/blocker), reply autonomously or escalate
388
- 4. Approval tracking -- when all gates pass, trigger merge
389
- 5. Merge conflict resolution -- rebase or escalate complex conflicts
390
- 6. Merge execution + downstream ticket/notification updates
682
+ **What we want:** WorkTrain detects orphaned sessions on startup and makes an autonomous decision: resume if meaningful progress was made, discard and re-dispatch from scratch if too early to be worth resuming.
391
683
 
392
- **Dependency:** PR template support, phase-scoped rules, `dispatchCondition` webhook filter.
684
+ **Resumability decision criteria:**
685
+ - Session had >= 1 `continue_workflow` call (at least one step advance): worth resuming
686
+ - Session is at step 0 with 0 advances but > 5 LLM turns: borderline -- context accumulated but no checkpoint. Surface to console for human decision.
687
+ - Session is at step 0, < 5 turns, < 2 min: discard -- nothing was lost
688
+ - Session's worktree is missing or corrupted: discard -- can't resume cleanly
689
+ - Session is on a coding workflow and has uncommitted changes in the worktree: pause for human review before discarding
393
690
 
394
- ---
691
+ **`session-recovery-policy.ts`** (pure function) already exists -- extend `evaluateRecovery()` to surface the `human_review` case.
395
692
 
396
- ### Phase-scoped context files
693
+ **`worktrain session resume <sessionId>` CLI** -- manual override for human-initiated resume when the daemon's automatic heuristic chose to discard but the user sees partial work worth keeping.
397
694
 
398
- **Status: idea** | Priority: medium
695
+ **Queue sidecar TTL for resume vs. discard:** for a discarded session, the TTL should be short (5 min) so the queue can quickly re-select. For a resumed session, keep the full TTL and extend it by the time already spent.
399
696
 
400
- **Design:** Teams define context files scoped to specific pipeline phases under `.worktrain/rules/`:
401
- - `discovery.md`, `shaping.md`, `implementation.md`, `review.md`, `delivery.md`, `pr-management.md`, `all.md`
697
+ **Things to hash out:**
698
+ - When a session resumes after a crash, does the agent receive any signal that recovery happened? Should it be told explicitly so it can reorient, or is silent resumption preferable?
699
+ - If the agent crashed mid-tool-call (e.g. mid-Bash), what is the state of the file system? Does the recovery policy need to account for partially executed side effects?
700
+ - How is "meaningful progress" determined for sessions on non-coding workflows where there are no worktree commits? Step advances are the primary signal -- is that sufficient?
701
+ - The `human_review` case (borderline progress) requires a console UI to present the decision. What is the fallback if the console is not running?
702
+ - If a session resumes and crashes again in the same place, how many retries before permanent discard? Is this configurable per workflow?
703
+ - How does crash recovery interact with the re-dispatch loop protection (`maxAttempts`)? A resumed session should not count against the attempt counter in the same way as a fresh dispatch.
402
704
 
403
- Each file is injected only into sessions running the matching pipeline phase. Reduces token waste and rule dilution. `all.md` is equivalent to today's AGENTS.md injection.
705
+ ---
404
706
 
405
- **Load order (most specific wins):** `AGENTS.md` / `CLAUDE.md` (base) `.worktrain/rules/all.md` → phase-specific file.
707
+ ### Coordinator-managed git state and agent crash recovery (Apr 21, 2026)
406
708
 
407
- ---
709
+ **Status: idea** | Priority: high
408
710
 
409
- ### Coordinator architecture: separation of concerns
711
+ **Score: 11** | Cor:3 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
410
712
 
411
- **Status: idea** | Priority: medium
713
+ **Git state management (coordinator's job):** Before dispatching any WorkTrain session that does git work:
714
+ 1. Check for `.git/index.lock` -- if present, verify the owning PID is dead (via `lsof` on macOS), then remove it
715
+ 2. Abort any in-progress git operations: `git rebase --abort; git merge --abort`
716
+ 3. Verify the workspace is in a clean state before handing off to the agent
412
717
 
413
- **Problem:** `src/coordinators/pr-review.ts` is already ~500 LOC doing session dispatch, result aggregation, finding classification, merge routing, message queue drain, and outbox writes. Adding knowledge graph queries, context bundle assembly, and prior session lookups would create a god class.
718
+ **Agent crash recovery (coordinator's job):** An agent can die from: stream watchdog timeout, OOM kill, or SIGKILL. In all cases the session event log is intact.
414
719
 
415
- **Right layering:**
720
+ The coordinator should detect and recover automatically:
721
+ 1. Monitor child sessions via `worktrain await`
722
+ 2. If a session returns `_tag: 'aborted'` or `_tag: 'timeout'` mid-pipeline: check if the session made meaningful progress (step advances > 0, or notes written). If yes: resume the session -- same session ID, same context, agent picks up at last checkpoint. If no (zero progress): retry from scratch with a fresh session, same context bundle.
723
+ 3. Retry up to N times (configurable, default 2) before escalating to Human Outbox
724
+ 4. Track which phase failed and inject a hint on retry: "Previous attempt failed at this step. Retry with fresh approach."
725
+
726
+ **This is session continuation applied to crash recovery.** The agent's conversation history is fully preserved. Resuming puts it back exactly where it was. The 600s watchdog timeout (most common failure) almost always means a hung LLM call or a tool timeout -- resuming naturally retries the step.
727
+
728
+ **Things to hash out:**
729
+ - If the coordinator monitors child sessions and detects a crash, what prevents it from retrying a session that crashed because of an unrecoverable environment issue (e.g. the workspace is on a network drive that is now offline)?
730
+ - The hint "Previous attempt failed at this step. Retry with fresh approach." assumes the agent can adapt its approach. What if the failure was infrastructure (OOM, timeout from provider) rather than a strategy error?
731
+ - How does the coordinator distinguish between a `_tag: 'aborted'` from a user kill-switch vs a crash? Retrying a kill-switched session may violate operator intent.
732
+ - Git state management before recovery: `.git/index.lock` cleanup requires knowing the owning PID is dead. On macOS this is `lsof`; on Linux it is different. Is cross-platform git recovery in scope?
733
+ - Should the coordinator attempt git state cleanup even when it did not originally dispatch the session (e.g. a session manually started via CLI)?
734
+ - Who owns the N-retry limit configuration -- the coordinator script, the trigger definition, or a daemon-level policy?
735
+
736
+ ---
737
+
738
+ ### UX/UI impact detection and design workflow integration (Apr 19, 2026)
739
+
740
+ **Status: idea** | Priority: medium
741
+
742
+ **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:2 Con:2 | Blocked: yes (needs adaptive coordinator)
743
+
744
+ When the adaptive pipeline coordinator classifies a task, it should detect whether the task touches user-facing surfaces and automatically insert a `ui-ux-design-workflow` run before implementation.
745
+
746
+ **Why:** Coding tasks that touch UI get implemented without a design pass today. The agent writes functional code but often produces interfaces that are technically correct but experientially wrong -- wrong information hierarchy, wrong affordances, missing error states, missing loading states, wrong copy.
747
+
748
+ **Detection signals (`touchesUI: true`) when any of:**
749
+ - Issue title/body mentions: component, screen, page, modal, dialog, button, form, flow, onboarding, dashboard, navigation, UX, UI, design, user-facing, frontend, console, web
750
+ - Affected files include: `console/src/`, `*.tsx`, `*.css`, `web/`, `views/`
751
+ - The task has a `ui` or `frontend` label
752
+ - The upstream spec explicitly calls out visual or interaction design requirements
753
+
754
+ **Pipeline integration:** When `touchesUI: true`: `coding-task-classify -> ui-ux-design-workflow -> coding-task-workflow-agentic -> PR -> review -> merge`
755
+
756
+ **Open design questions:**
757
+ - Who reviews the design spec before coding starts? `complexity: Large AND touchesUI: true` → require human ack on the design spec before coding.
758
+ - Design this as part of the adaptive coordinator. The `touchesUI` flag belongs on the classification output alongside `taskComplexity` and `maturity`.
759
+ - What does "UI" mean for WorkRail specifically? The console is the only web surface -- does a change to `console/src/` always qualify, or only changes that affect user-visible interaction?
760
+ - Is false-positive `touchesUI` detection acceptable (wastes a design pass) or should the threshold be conservative to avoid unnecessary overhead?
761
+ - Should the `ui-ux-design-workflow` output be a gate (coding cannot start until design is approved) or advisory (coding proceeds in parallel)?
762
+ - Who is the design workflow audience -- the autonomous agent doing the coding, or a human reviewer? If the agent reads and follows the design spec itself, what prevents it from rationalizing the spec to fit what it already planned?
763
+
764
+ ---
765
+
766
+ ### Consider rewriting WorkRail engine in Kotlin (Apr 23, 2026)
767
+
768
+ **Status: idea** | Priority: low / long-term
769
+
770
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
771
+
772
+ **The argument:** WorkRail's coding philosophy demands "make illegal states unrepresentable" and "type safety as the first line of defense." TypeScript is structurally at odds with this: the compiler is advisory, not enforcing. `as unknown as`, `any`, and type assertion casts are always one line away. In a codebase where autonomous agents write and merge code without deep human review, the compiler is the reviewer -- and TypeScript's escape hatches make it too easy to paper over a real design problem with a cast.
773
+
774
+ **What Kotlin actually buys:**
775
+ - **Sealed classes** -- exhaustive `when` is a compile error, not a runtime `assertNever` pattern that convention must enforce
776
+ - **No easy escape hatch** -- `as` in Kotlin throws at runtime on type mismatch; there's no equivalent of `as unknown as` that silently lies to the compiler
777
+ - **Null safety by default** -- `String` vs `String?` is a language distinction, not a `strict: true` compiler flag that can be turned off
778
+ - **Value classes and data classes** -- less boilerplate for domain types, stronger invariants
779
+
780
+ **What TypeScript + current tooling already covers:** Zod at boundaries provides runtime validation; `neverthrow` gives Result types; discriminated unions + `assertNever` give exhaustiveness -- but enforced by convention, not the compiler.
781
+
782
+ **Real costs:** JVM startup latency for an MCP server that starts/stops frequently (mitigable with GraalVM native image, but adds build complexity); full rewrite of `src/`; Console stays TypeScript/React regardless.
783
+
784
+ **The honest tradeoff:** Convention drift is a recurring tax. Migration is a one-time cost. In a codebase driven heavily by autonomous agents, the compiler is the last line of defense against accumulated drift. TypeScript's permissiveness means that defense has holes.
785
+
786
+ Not urgent -- the current codebase is working well. Worth revisiting when the agent is writing the majority of new code. Requires a concrete spike: rewrite one module (e.g. `src/v2/durable-core/domain/`) in Kotlin and measure the real friction before committing to a full migration.
787
+
788
+ **Things to hash out:**
789
+ - What is the actual trigger condition? "Agent is writing the majority of new code" is vague -- what metric or event makes this evaluation happen?
790
+ - The Console is TypeScript/React and stays that way regardless. Does a partial Kotlin migration create a permanent two-language maintenance burden, or is the split clean enough to be manageable?
791
+ - GraalVM native image significantly reduces JVM startup time but adds build complexity and has known incompatibilities with reflection-heavy libraries. Is the build complexity acceptable for a project that ships frequently?
792
+ - Who owns the migration decision? This is a significant architectural commitment -- should it require explicit project owner sign-off rather than being decided by the agent autonomously?
793
+ - Are there TypeScript-specific patterns in the current codebase (e.g. `neverthrow`, discriminated unions) that would lose expressiveness in Kotlin, or would Kotlin actually improve them?
794
+
795
+ ---
796
+
797
+ ### Auto-start mechanism inventory (Apr 23, 2026)
798
+
799
+ **Status: resolved** | Documented for reference
800
+
801
+ Current auto-start mechanisms for WorkTrain daemon (as of current branch -- no auto-start):
802
+
803
+ The launchd plist (`~/Library/LaunchAgents/io.worktrain.daemon.plist`) no longer has `RunAtLoad` or `KeepAlive` keys (removed on current branch). The daemon must be started explicitly:
804
+ - `worktrain daemon --install` -- Register with launchd (no auto-start)
805
+ - `worktrain daemon --start` -- Start the daemon explicitly
806
+ - `worktrain daemon --stop` -- Stop the daemon
807
+ - `worktrain daemon --status` -- Check if running
808
+ - `worktrain daemon --uninstall` -- Remove registration
809
+
810
+ **Known operational note:** When working on daemon code, always `--stop` first then `--start` after rebuild. A running daemon does not automatically pick up a rebuilt binary.
811
+
812
+ ---
813
+
814
+ ### Post-update onboarding: contextual feature announcements
815
+
816
+ **Status: idea** | Priority: low
817
+
818
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
819
+
820
+ When WorkTrain updates to a new version with significant new capabilities, it prompts the user to configure the new feature -- once, the first time they run after updating.
821
+
822
+ **How it works:** Each significant feature ships with a migration step keyed to a minimum version:
823
+ ```json
824
+ {
825
+ "onboardingCompleted": "3.17.0",
826
+ "featureStepsCompleted": ["daemon-soul", "bedrock-setup", "triggers-v2"]
827
+ }
416
828
  ```
417
- Trigger layer src/trigger/ receives events, validates, enqueues
418
- Dispatch layer (TBD) decides which workflow + what goal
419
- Context assembly (TBD) gathers and packages context before spawning
420
- Orchestration layer src/coordinators/ spawns, awaits, routes, retries, escalates
421
- Delivery layer src/trigger/delivery posts results back to origin systems
829
+
830
+ On startup, WorkTrain checks: current version > `onboardingCompleted`? Any new `featureSteps` not in `featureStepsCompleted`? If yes, run those steps interactively before continuing.
831
+
832
+ Each step takes < 60 seconds. Show what changed, ask what's needed, confirm it works. Skip if already configured. Only triggers on: new capabilities that require user configuration, breaking config format changes, valuable opt-in features that are off by default. Does NOT trigger on: bug fixes, new workflows in the library, anything that works without user input.
833
+
834
+ **Things to hash out:**
835
+ - How does the onboarding system know which features require user configuration vs which just work? Is this metadata shipped with each feature, or manually curated?
836
+ - What happens if onboarding is interrupted mid-step (user closes the terminal)? Is the partial state safe to resume, or does it restart from the beginning?
837
+ - Should onboarding steps ever be re-runnable for reconfiguration, or is each step a one-time operation?
838
+ - Who authors and maintains the onboarding steps? Are they coupled to release engineering, or can feature authors ship their own?
839
+ - Is there a risk that forced onboarding after update creates friction that causes users to downgrade or skip updates?
840
+
841
+ ---
842
+
843
+ ### Bundled trigger templates: zero-config workflow automation via worktrain init (Apr 18, 2026)
844
+
845
+ **Status: idea** | Priority: high
846
+
847
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:3 | Blocked: no
848
+
849
+ Every user has to write their own triggers.yml manually. Wrong workflow IDs, missing required fields, wrong workspace paths -- all common mistakes. There's no "just works" path to workflow automation.
850
+
851
+ **Solution:** Ship common trigger templates bundled with WorkTrain. `worktrain init` presents a menu and generates a pre-filled triggers.yml.
852
+
853
+ **Bundled templates:**
854
+ ```yaml
855
+ # mr-review, coding-task, discovery-task, bug-investigation
856
+ # (with correct workflowIds, sensible defaults, and example config)
422
857
  ```
423
858
 
424
- **Context assembly** is the missing layer. Before dispatching a coding session, `assembleContext(task, workspace)` runs: knowledge graph query, upstream pitch/PRD fetch, relevant prior session notes, returns a structured context bundle. The orchestration script should call this, not own it.
859
+ **`worktrain init` flow:**
860
+ 1. "Which workflows do you want to run automatically?" (checkbox menu)
861
+ 2. For each selected: set `workspacePath` to current directory (overridable)
862
+ 3. Generate `triggers.yml` in the workspace root
863
+ 4. Validate workflow IDs exist before writing
864
+ 5. Tell the user how to fire each trigger: `curl -X POST http://localhost:3200/webhook/<id> ...`
865
+
866
+ **Also needed:** `worktrain trigger add <template-name>` to add a single trigger to an existing triggers.yml without re-running init.
867
+
868
+ The difference between WorkTrain being usable by anyone vs only by engineers who read the source code. A new user should be able to go from `worktrain init` to their first automated workflow in under 5 minutes.
869
+
870
+ **Things to hash out:**
871
+ - What is the scope of `worktrain init` -- does it also set up the daemon, configure the soul file, and validate credentials, or is it only for trigger template generation?
872
+ - When `worktrain trigger add` adds to an existing `triggers.yml`, what happens if the file has non-standard formatting or includes custom YAML anchors? Does the tool preserve or clobber them?
873
+ - Templates embed sensible defaults (e.g. `maxSessionMinutes: 90`). Who decides what "sensible" means, and how are those defaults kept in sync when the underlying constraints change?
874
+ - Should bundled templates be versioned separately from the WorkTrain binary, so they can be updated without a full release?
875
+ - If a template generates a trigger pointing to a workflowId that the user's WorkRail installation does not have (e.g. a custom workflow), how is that error surfaced?
425
876
 
426
877
  ---
427
878
 
428
- ### Scheduled tasks (native cron provider)
879
+ ### Decouple goal from trigger definition -- late-bound goals (Apr 18, 2026)
880
+
881
+ **Status: done** | Shipped (already implemented in trigger-store.ts)
882
+
883
+ **Score: 12** | Cor:1 Cap:3 Eff:3 Lev:2 Con:3 | Blocked: no
884
+
885
+ `trigger-store.ts` already implements the default `goalTemplate: "{{$.goal}}"` behavior (lines 766-773): when a trigger has neither `goal` nor `goalTemplate` configured, the loader injects `goalTemplate: "{{$.goal}}"` automatically and logs an informational warning. The webhook payload's `goal` field is the canonical way to pass a dynamic goal. Zero breaking changes, backward compatible.
886
+
887
+ The right long-term evolution (coordinator-spawned sessions needing richer context beyond a goal string) is tracked under "Coordinator context injection standard" and "Subagent context packaging".
888
+
889
+ **Preferred fix (Option 1 -- default goalTemplate):** if no `goal` is set in the trigger and no `goalTemplate` is set, default to `goalTemplate: "{{$.goal}}"`. The webhook payload's `goal` field becomes the canonical way to pass a dynamic goal. Zero breaking changes, backward compatible.
890
+
891
+ Most real-world triggers (PR review, issue investigation, incident response) have dynamic goals that depend on what just happened. Static goals in triggers.yml only work for scheduled/cron tasks. Late-bound goals make the whole trigger system composable with external events.
892
+
893
+ **Things to hash out:**
894
+ - If `goalTemplate: "{{$.goal}}"` is the default, what happens when the webhook payload omits the `goal` field entirely? Should the dispatch fail, fall back to the trigger ID, or use an empty string?
895
+ - How does this interact with `dispatchCondition`? A missing goal field might also indicate a structurally unexpected payload.
896
+ - Should late-bound goals apply to polling triggers as well (where the goal is derived from the polled item), or only webhooks?
897
+ - Is there a security concern with allowing arbitrary webhook payload fields to become the session goal without sanitization?
898
+
899
+ ---
900
+
901
+ ### FatalToolError: distinguish recoverable from non-recoverable tool failures (Apr 18, 2026)
902
+
903
+ **Status: idea** | Priority: low
904
+
905
+ **Score: 9** | Cor:2 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
906
+
907
+ The blanket try/catch in `AgentLoop._executeTools()` converts ALL tool throws to `isError: true` tool results. This is correct for Bash/Read/Write (LLM can see and retry), but potentially wrong for `continue_workflow` failures (LLM retrying with a broken token loops).
908
+
909
+ **Fix:** `FatalToolError` subclass -- tools throw `FatalToolError` for non-recoverable errors (session corruption, bad tokens), plain `Error` for recoverable failures. `_executeTools` catches plain `Error` and returns `isError`; `FatalToolError` propagates and kills the session.
910
+
911
+ Combined with the `DEFAULT_MAX_TURNS` cap, this provides defense-in-depth against runaway loops on broken tokens.
912
+
913
+ **Things to hash out:**
914
+ - How does the tool author declare a failure as `FatalToolError` vs plain `Error`? Is this a convention, a type check, or a registration step?
915
+ - If the LLM retries a `FatalToolError` tool call because it didn't understand the result, is the second attempt also fatal? Or does the fatal classification only apply to specific error codes?
916
+ - How should the session outcome be recorded when killed by a `FatalToolError`? Is it different from a stuck/timeout outcome in the event log?
917
+ - Should `FatalToolError` be surfaced to the console with a distinct visual treatment so operators can distinguish infrastructure failures from agent logic failures?
918
+
919
+ ---
920
+
921
+ ## Shared / Engine
922
+
923
+ The durable session store, v2 engine, and workflow authoring features shared by all three systems.
924
+
925
+
926
+ ### Improve commit SHA gathering consistency in wr.coding-task
927
+
928
+ **Status: idea** | Priority: high
929
+
930
+ **Score: 9** | Cor:2 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
931
+
932
+ After fixing the primary cause (SHA footer referenced `continue_workflow` by name while daemon agents use `complete_step`), two structural gaps remain that prevent consistent SHA recording:
933
+
934
+ **Gap 1: SHA footer appears on every non-final step, including planning/design steps with no commits.** Agents correctly skip it on those steps, but the repetition trains them to suppress it reflexively -- including on implementation steps where it matters. Options to explore: inject only inside loop bodies tagged as implementation, add an opt-out flag to steps, or move the SHA reminder into the implementation step prompts directly in the workflow JSON.
935
+
936
+ **Gap 2: `phase-5-small-task-fast-path` has no correctly-wired final metrics step for Small tasks.** `isLastStep` resolves to `phase-7b-fix-and-summarize`, which has a `runCondition` that skips it for Small tasks. Small-task sessions never see the final metrics footer. Needs either: the final footer added directly to `phase-5`'s authored prompt, or `isLastStep` detection made context-aware (complex).
937
+
938
+ **Gap 3: No validation for `metrics_commit_shas`.** `checkContextBudget` validates `metrics_outcome` but not SHAs. Missing or partial arrays fail silently. A warning-level soft validation at the final step would at least surface the gap in logs.
939
+
940
+ The right fix is probably a combination of moving the SHA instruction into the implementation step prompts directly (removing it from the ambient footer entirely) and adding Gap 2's final footer to `phase-5`. That avoids any new engine machinery.
941
+
942
+ **Things to hash out:**
943
+ - Moving the SHA instruction into implementation step prompts means every implementation step must be identified and updated. Who owns the ongoing maintenance of keeping that instruction present in new steps added to the workflow?
944
+ - Gap 3's soft validation: what is the right signal when `metrics_commit_shas` is missing -- a log warning, a console callout, or a session outcome flag? What action should the operator take on seeing this signal?
945
+ - If the SHA footer is removed from the ambient footer entirely, what prevents other workflows from missing SHA collection? Is the ambient footer the right abstraction, or should SHA recording be an engine-level concern separate from prompts?
946
+
947
+ ---
948
+
949
+ ### `jumpIf`: conditional step jumps with per-target jump counter
429
950
 
430
951
  **Status: idea** | Priority: medium
431
952
 
432
- **Gap:** No native cron/schedule provider. Workaround is OS crontab calling `curl`.
953
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
433
954
 
434
- **Design:**
435
- ```yaml
436
- triggers:
437
- - id: weekly-code-health
438
- provider: schedule
439
- cron: "0 9 * * 1"
440
- workflowId: architecture-scalability-audit
441
- workspacePath: /path/to/repo
442
- goal: "Run weekly code health scan"
955
+ **Problem:** Workflows with investigation or iterative refinement patterns (bug-investigation, mr-review) can exhaust their hypothesis set and reach an `inconclusive_but_narrowed` state with no structural way to restart an earlier phase. A `jumpIf` primitive would let any step conditionally restart execution from an earlier step when a context condition is met.
956
+
957
+ **Proposed design:**
958
+
959
+ ```json
960
+ {
961
+ "id": "phase-4b-loop-decision",
962
+ "jumpIf": {
963
+ "condition": { "var": "diagnosisType", "equals": "inconclusive_but_narrowed" },
964
+ "target": "phase-2-hypothesis-generation-and-shortlist",
965
+ "maxJumps": 2
966
+ }
967
+ }
968
+ ```
969
+
970
+ **Engine behavior:**
971
+ - When a step completes and its `jumpIf.condition` is met, the engine checks the per-session jump counter for `target`
972
+ - Counter is derived from the event log: count `jump_recorded` events where `toStepId === target` -- fully append-only and replayable
973
+ - If `counter < maxJumps`: append `jump_recorded` event, create fresh nodeIds for `target` and all subsequent steps, mint a new continueToken pointing at the fresh target node
974
+ - If `counter >= maxJumps`: jump is blocked, execution falls through to the next step (safety cap, not an error)
975
+
976
+ **Why this is safe:**
977
+ - `maxJumps` is a required field -- no unbounded loops possible
978
+ - Counter is derivable from the append-only event log -- no mutable state
979
+ - Fall-through on limit reached is predictable and operator-visible
980
+
981
+ **Open design questions:**
982
+ - `maxJumps` default if omitted -- probably require it explicitly (same as `maxIterations` on loops)
983
+ - DAG console rendering -- backward jumps create "re-entry" edges. Needs a distinct visual treatment
984
+ - Interaction with `runCondition` -- if a jumped-to step has a `runCondition` that evaluates false at re-entry time, does the engine skip it and advance?
985
+
986
+ **Scope when ready to implement:**
987
+ - `spec/workflow.schema.json`: add `jumpIf` to `standardStep`
988
+ - `spec/authoring-spec.json`: add authoring rule
989
+ - Compiler: validate `target` resolves to a reachable earlier step, `maxJumps >= 1`
990
+ - Engine (`src/v2/durable-core/`): new `jump_recorded` event kind, counter derivation, fresh nodeId creation on jump
991
+ - Console DAG: render jump edges distinctly
992
+
993
+ **Motivation workflow:** `wr.bug-investigation` -- when all hypotheses are eliminated and `diagnosisType === 'inconclusive_but_narrowed'`, jump back to phase 2 (hypothesis generation) with the eliminated theories in context, up to 2 times before falling through to validation/handoff.
994
+
995
+ ---
996
+
997
+ ### Versioned workflow schema validation
998
+
999
+ **Status: idea** | Priority: medium-high
1000
+
1001
+ **Score: 11** | Cor:2 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
1002
+
1003
+ **Problem:** WorkRail validates workflow files against the schema bundled in the currently-running MCP binary. Binary too new rejects old workflows; binary too old rejects new workflows. Both cause silent disappearance from `list_workflows` with no explanation.
1004
+
1005
+ **The right fix:** Each workflow declares `"schemaVersion": 1` (integer). The binary ships validator copies for every schema version it supports. When loading a workflow, pick the validator matching the declared version.
1006
+
1007
+ **Load-time logic:**
1008
+ 1. Read `schemaVersion` (default 1 if absent -- legacy workflows)
1009
+ 2. If `schemaVersion === current`: validate against current schema directly
1010
+ 3. If `schemaVersion < current` (binary newer): validate against the declared schema version
1011
+ 4. If `schemaVersion > current` (binary too old): load leniently with warnings -- `additionalProperties: false` does not apply
1012
+
1013
+ **Decision (from Apr 23 audit):** v1 = current schema. The one historical breaking change (`assessmentConsequenceTrigger`, Apr 5) was fully contained within the bundled workflow corpus. No historical reconstruction needed.
1014
+
1015
+ **Files to change:** `spec/workflow.schema.json`, `spec/workflow.schema.v1.json` (snapshot), `src/application/validation.ts`, `src/types/workflow-definition.ts`, `workflow-for-workflows.json` (stamp `schemaVersion`), all bundled workflows.
1016
+
1017
+ **Things to hash out:**
1018
+ - What is the policy when a workflow with `schemaVersion > current` has fields that fail lenient loading -- should the workflow be skipped entirely or loaded partially?
1019
+ - Should the binary ship all historical schema validator copies forever, or is there a deprecation window after which very old versions are dropped?
1020
+ - How does `workrailVersion` (the "forever backward compat" idea elsewhere in the backlog) relate to `schemaVersion`? Are these the same concept or different tracking axes?
1021
+ - External workflow authors who don't track WorkRail releases need to know how to set `schemaVersion`. Is the default-to-v1 behavior documented clearly enough?
1022
+ - What prevents a workflow from declaring `schemaVersion: 999` to bypass validation entirely via the lenient path?
1023
+
1024
+ ---
1025
+
1026
+ ### Task re-dispatch loop protection
1027
+
1028
+ **Status: done** | Shipped PR #883 (Apr 30, 2026)
1029
+
1030
+ `queue-issue-<N>.json` sidecar now carries `attemptCount`. Failure path rewrites it with the same count + zeroed TTL (no double-increment). When `attemptCount >= maxAttempts` (default 3, configurable as `maxDispatchAttempts`), dispatch is skipped, outbox notified, `worktrain:needs-human` label applied, comment posted. Daemon restart resets counts.
1031
+
1032
+ ---
1033
+
1034
+ ### Daemon agent loop stall detection
1035
+
1036
+ **Status: done** | Shipped PR #900 (Apr 30, 2026)
1037
+
1038
+ `AgentLoop` now accepts `stallTimeoutMs` and `onStallDetected` callback (injected, not hardcoded). Timer resets before each `client.messages.create()` call; if it fires, `abort()` is called and `WorkflowRunStuck` with `reason: 'stall'` is returned. Configurable via `agentConfig.stallTimeoutSeconds` in triggers.yml (default 120s).
1039
+
1040
+ ---
1041
+
1042
+ ### `queue-poll.jsonl` never rotated
1043
+
1044
+ **Status: done** | Shipped PR #897 (Apr 30, 2026)
1045
+
1046
+ `rotateLogFile()` reusable helper added. Fires at 10 MB: shifts `.1` to `.2`, renames current to `.1`, starts fresh. Two backup files (~10 weeks retention). Best-effort: rotation failure logs a warning but never blocks the append.
1047
+
1048
+ ---
1049
+
1050
+ ### ReviewSeverity: stderr bypassing injected dep
1051
+
1052
+ **Status: idea** | Priority: medium
1053
+
1054
+ **Score: 9** | Cor:1 Cap:1 Eff:3 Lev:1 Con:3 | Blocked: no
1055
+
1056
+ **Bug 1 (DONE):** `assertNever` on `ReviewSeverity` was added at `pr-review.ts:1407`. ✓
1057
+
1058
+ **Bug 2 (still open):** `src/coordinators/pr-review.ts:447` -- `process.stderr.write(...)` called directly instead of using injected `deps.stderr`. Tests that inject a fake dep miss this log.
1059
+
1060
+ **File:** `src/coordinators/pr-review.ts`.
1061
+
1062
+ ---
1063
+
1064
+ ### Session continuation / "just keep talking"
1065
+
1066
+ **Status: idea** | Priority: medium
1067
+
1068
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1069
+
1070
+ A completed session is not dead -- the conversation is still in the event log. The only thing blocking continuation is the engine rejecting messages to sessions in `complete` state.
1071
+
1072
+ **The change:** Remove that gate. `worktrain session continue <sessionId> "<message>"` sends a message to a completed session. New events appended to the same log. Same session ID. The agent has full context of everything it ever did.
1073
+
1074
+ Context window overflow (very long sessions) is a separate optimization problem -- truncate oldest turns while keeping step notes. Don't solve it now.
1075
+
1076
+ **Things to hash out:**
1077
+ - When a completed session is continued, what workflow state does the engine start from? Does the agent re-enter the workflow at the last step, or does continuation happen outside any workflow context?
1078
+ - If continuation adds a `session_resumed` event, how should the console display the session? As an extended session or as a new one with a link back?
1079
+ - Should `worktrain session continue` be available in both daemon and MCP contexts, or daemon-only where the context stays alive?
1080
+ - What is the intended use case -- interactive follow-up questions, or coordinator-driven post-processing? The answer shapes the UX significantly.
1081
+ - If a session is continued after its worktree has been cleaned up, what tools can the agent use? Does it get a fresh worktree, or is it context-only?
1082
+
1083
+ ---
1084
+
1085
+ ### Session as a living record: post-completion phases
1086
+
1087
+ **Status: idea** | Priority: medium
1088
+
1089
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1090
+
1091
+ A `session_completed` event means the original workflow is done -- not that the session can never receive new events. The event log is append-only: just keep appending. A post-completion interaction adds a `session_resumed` event, then new turns, then a new `session_completed`.
1092
+
1093
+ This is already how mid-run resume works. The same mechanism extends naturally to post-completion: rehydrate the completed state, append a new lightweight phase, run it, complete again.
1094
+
1095
+ **Richer automatic checkpoints:** Many session events should trigger a checkpoint automatically:
1096
+ - `step_advanced` (already essentially a checkpoint)
1097
+ - `signal_coordinator` fired (agent surfaced meaningful mid-step state)
1098
+ - Worktree commit pushed (code state durable on remote)
1099
+ - Coordinator steers the session (notable injection)
1100
+ - `spawn_agent` child completes (parent has new information)
1101
+
1102
+ **Things to hash out:**
1103
+ - Who decides what constitutes a "lightweight phase" added post-completion? Is this a new workflow, an ad-hoc prompt, or something else?
1104
+ - How does the auto-checkpoint list interact with existing explicit `checkpoint_workflow` calls? Is there any risk of over-checkpointing causing storage bloat?
1105
+ - If a coordinator resumes a session for post-completion processing, is the resumed session billed/attributed to the same source trigger?
1106
+ - What is the retention and garbage collection policy for post-completion events appended to old sessions?
1107
+
1108
+ ---
1109
+
1110
+ ### Rules preprocessing: normalize workspace rules before injection
1111
+
1112
+ **Status: idea** | Priority: medium
1113
+
1114
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
1115
+
1116
+ **Problem:** WorkTrain injects all rules files raw into every agent's system prompt. A workspace with `.cursorrules`, `CLAUDE.md`, `.windsurf/rules/*.md`, and `AGENTS.md` might inject 10KB of rules into a discovery session that only needs 2KB.
1117
+
1118
+ **Design:** A `worktrain rules build` command that reads all IDE rules files from the workspace, deduplicates overlapping rules, categorizes by phase, and writes to `.worktrain/rules/`:
1119
+ - `implementation.md`, `review.md`, `delivery.md`, `discovery.md`, `all.md`
1120
+ - `manifest.json` -- which files exist, when generated, source files used
1121
+
1122
+ At session time: WorkTrain injects only the phase-relevant file.
1123
+
1124
+ **Things to hash out:**
1125
+ - How does WorkTrain determine which pipeline phase a session corresponds to? Is this declared in the trigger, derived from the workflowId, or inferred from the step?
1126
+ - What happens when a single session spans multiple phases (e.g. a workflow that does discovery + implementation in one run)? Does the injected rules file switch mid-session, or is one phase file chosen at dispatch time?
1127
+ - Who authors and owns the `.worktrain/rules/` files -- the workspace team, the workflow author, or WorkTrain itself?
1128
+ - Should the absence of a phase-specific file fall back to `all.md`, or be a silent no-op? Is a missing `implementation.md` a misconfiguration or an acceptable default?
1129
+ - How does this interact with the existing `daemon-soul.md` and workspace AGENTS.md injection? What is the full load-order and precedence when all are present?
1130
+
1131
+ ---
1132
+
1133
+ ### True session status (live agent state in console)
1134
+
1135
+ **Status: idea** | Priority: medium-high
1136
+
1137
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
1138
+
1139
+ **Problem:** The console currently infers session status from last event timestamp. WorkTrain has direct access to `DaemonRegistry`, `DaemonEventEmitter`, and turn-level events -- it should show true status.
1140
+
1141
+ **True session status taxonomy:**
1142
+ - `active:thinking` -- LLM API call in progress
1143
+ - `active:tool` -- tool executing (name visible)
1144
+ - `active:idle` -- between turns, session in DaemonRegistry
1145
+ - `stuck` -- stuck heuristic fired
1146
+ - `completed:success/timeout/stuck/max_turns`
1147
+ - `aborted` -- daemon killed mid-run
1148
+ - `daemon:down` -- no recent heartbeat
1149
+
1150
+ Surface in: `worktrain status`, `worktrain health <sessionId>`, console session rows.
1151
+
1152
+ **Things to hash out:**
1153
+ - The daemon has direct access to `DaemonRegistry`, but the console is a separate process reading the session store. How does live status reach the console without the daemon being a dependency for reading it?
1154
+ - What is the polling or push mechanism for the console to get status updates? SSE from the daemon's HTTP endpoint, or a separate status file the daemon writes?
1155
+ - How is `daemon:down` distinguished from "daemon is up but this session is not currently running"? What is the heartbeat protocol?
1156
+ - Should `active:tool` surface the tool name? Some tool names (file paths, bash commands) could leak sensitive workspace content in the console UI.
1157
+ - What is the retention policy for status events -- does the console show only the live state, or a history of status transitions?
1158
+
1159
+ ---
1160
+
1161
+ ## WorkTrain Daemon -- Coordinator patterns
1162
+
1163
+ Coordinator design patterns for WorkTrain's autonomous pipeline.
1164
+
1165
+
1166
+ ### Event-driven agent coordination (coordinator as event bus)
1167
+
1168
+ **Status: idea** | Priority: high
1169
+
1170
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
1171
+
1172
+ **Problem:** Agents managing an MR should not poll for review comments or CI status -- that wastes turns and burns tokens. Instead, the coordinator should register for events and steer the agent when something relevant happens.
1173
+
1174
+ The infrastructure already exists: `steerRegistry` + `POST /sessions/:id/steer`, `signal_coordinator` tool, `DaemonEventEmitter`.
1175
+
1176
+ **What's missing:** Coordinator-side event sources (GitHub webhooks or polling fallback) and an event-to-steer bridge that maps `MREvent` to structured steer messages.
1177
+
1178
+ **How it works:** MR management agent session is parked (no pending turns). Coordinator registers for GitHub events. When review comment/CI failure/approval arrives, coordinator steers the running session. Agent responds. No polling from the agent side.
1179
+
1180
+ **Agent session prompt:** "Do not poll for PR status. Wait for the coordinator to deliver events via injected messages."
1181
+
1182
+ **Things to hash out:**
1183
+ - How does the coordinator distinguish between a GitHub webhook event and a polling fallback event when both are in flight? Is deduplication needed?
1184
+ - What is the protocol for a parked agent session -- does it consume a slot in `maxConcurrentSessions` while parked, or is the slot released and re-acquired when an event arrives?
1185
+ - How long can an agent session remain parked before the coordinator gives up and closes it? Is there a configurable TTL for event-waiting?
1186
+ - Should the coordinator register for GitHub events directly, or should a shared event router handle all webhook subscriptions and fan out to interested coordinators?
1187
+ - If the steer injection fails (session has timed out or been garbage collected), what does the coordinator do with the pending event?
1188
+
1189
+ ---
1190
+
1191
+ ### MR lifecycle manager
1192
+
1193
+ **Status: idea** | Priority: high
1194
+
1195
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: yes (needs dispatchCondition, PR templates)
1196
+
1197
+ **Gap:** WorkTrain currently creates a PR and dispatches an MR review session. Everything between "PR created" and "PR merged" is invisible: CI failures, reviewer comments, requested changes, merge conflicts, required approvals. A human has to watch and intervene.
1198
+
1199
+ **Vision:** `runMRLifecycleManager()` takes ownership of the MR from creation to merge.
1200
+
1201
+ **Responsibilities:**
1202
+ 1. MR creation with correct template, labels, milestone, reviewers, linked tickets
1203
+ 2. CI pipeline monitoring -- parse failures, retry flaky tests, spawn fix sessions
1204
+ 3. Review comment triage -- classify each comment (actionable/question/nit/approval/blocker), reply autonomously or escalate
1205
+ 4. Approval tracking -- when all gates pass, trigger merge
1206
+ 5. Merge conflict resolution -- rebase or escalate complex conflicts
1207
+ 6. Merge execution + downstream ticket/notification updates
1208
+
1209
+ **Dependency:** PR template support, phase-scoped rules, `dispatchCondition` webhook filter.
1210
+
1211
+ **Things to hash out:**
1212
+ - CI pipeline monitoring requires parsing CI failure logs, which are provider-specific (GitHub Actions, GitLab CI, CircleCI, etc.). Is the lifecycle manager expected to handle multiple providers, or is it scoped to one initially?
1213
+ - "Retry flaky tests" is a significant decision with potential to exhaust CI minutes. What is the policy for how many retries are allowed, and who decides when a test is "flaky" vs genuinely broken?
1214
+ - For merge conflict resolution, what is the boundary between "safe to rebase automatically" and "requires human escalation"? Is this a heuristic, a file-set check, or something else?
1215
+ - What happens if the lifecycle manager itself fails mid-run (daemon crash, token expiry)? Is the MR left in a consistent state, or can it be in a partially processed state?
1216
+ - Who is responsible for the MR while the lifecycle manager is active -- WorkTrain or the human who opened the task? Can the human intervene and override without confusing the manager?
1217
+ - How does the lifecycle manager handle PRs that become stale while waiting for CI (main advances, merge conflict develops)?
1218
+
1219
+ ---
1220
+
1221
+ ### Phase-scoped context files
1222
+
1223
+ **Status: idea** | Priority: medium
1224
+
1225
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
1226
+
1227
+ **Design:** Teams define context files scoped to specific pipeline phases under `.worktrain/rules/`:
1228
+ - `discovery.md`, `shaping.md`, `implementation.md`, `review.md`, `delivery.md`, `pr-management.md`, `all.md`
1229
+
1230
+ Each file is injected only into sessions running the matching pipeline phase. Reduces token waste and rule dilution. `all.md` is equivalent to today's AGENTS.md injection.
1231
+
1232
+ **Load order (most specific wins):** `AGENTS.md` / `CLAUDE.md` (base) → `.worktrain/rules/all.md` → phase-specific file.
1233
+
1234
+ ---
1235
+
1236
+ ### Coordinator architecture: separation of concerns
1237
+
1238
+ **Status: idea** | Priority: medium
1239
+
1240
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs knowledge graph for context assembly)
1241
+
1242
+ **Problem:** `src/coordinators/pr-review.ts` is already ~500 LOC doing session dispatch, result aggregation, finding classification, merge routing, message queue drain, and outbox writes. Adding knowledge graph queries, context bundle assembly, and prior session lookups would create a god class.
1243
+
1244
+ **Right layering:**
1245
+ ```
1246
+ Trigger layer src/trigger/ receives events, validates, enqueues
1247
+ Dispatch layer (TBD) decides which workflow + what goal
1248
+ Context assembly (TBD) gathers and packages context before spawning
1249
+ Orchestration layer src/coordinators/ spawns, awaits, routes, retries, escalates
1250
+ Delivery layer src/trigger/delivery posts results back to origin systems
1251
+ ```
1252
+
1253
+ **Context assembly** is the missing layer. Before dispatching a coding session, `assembleContext(task, workspace)` runs: knowledge graph query, upstream pitch/PRD fetch, relevant prior session notes, returns a structured context bundle. The orchestration script should call this, not own it.
1254
+
1255
+ **Things to hash out:**
1256
+ - The right layering puts "Dispatch layer (TBD)" between Trigger and Orchestration. What exactly does the dispatch layer decide, and how does it relate to the adaptive pipeline coordinator concept elsewhere in the backlog?
1257
+ - Context assembly requires the knowledge graph. What is the fallback when the KG is not yet built for a workspace -- does context assembly simply return empty, or does it fall back to a slower manual search?
1258
+ - Should context assembly run synchronously before dispatch (blocking the trigger listener) or asynchronously (session starts with partial context while assembly continues)?
1259
+ - Who owns the context assembly API contract -- the engine (as a new primitive), the daemon (as an infrastructure capability), or user-authored scripts?
1260
+
1261
+ ---
1262
+
1263
+ ### Scheduled tasks (native cron provider)
1264
+
1265
+ **Status: idea** | Priority: medium
1266
+
1267
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:3 | Blocked: no
1268
+
1269
+ **Gap:** No native cron/schedule provider. Workaround is OS crontab calling `curl`.
1270
+
1271
+ **Design:**
1272
+ ```yaml
1273
+ triggers:
1274
+ - id: weekly-code-health
1275
+ provider: schedule
1276
+ cron: "0 9 * * 1"
1277
+ workflowId: architecture-scalability-audit
1278
+ workspacePath: /path/to/repo
1279
+ goal: "Run weekly code health scan"
1280
+ ```
1281
+
1282
+ **Key decisions:**
1283
+ - Standard 5-field cron syntax, configurable timezone
1284
+ - Missed runs NOT caught up by default (optional `catchUp: true`)
1285
+ - Overlap prevention: if a run is still active when the next tick fires, skip it
1286
+ - `worktrain run schedule <trigger-id>` for manual trigger
1287
+
1288
+ **Implementation:** `PollingScheduler` already runs time-based loops. Schedule provider would use cron expression matching instead of API polling. State persists to `~/.workrail/schedule-state.json`.
1289
+
1290
+ **Things to hash out:**
1291
+ - `schedule-state.json` records last-run timestamps. If the daemon is not running at the scheduled time, what happens when it next starts -- does the missed run execute immediately, wait for the next tick, or follow the `catchUp: true` policy?
1292
+ - Timezone support requires knowing the user's local timezone at schedule-definition time, not at execution time. What happens when the operator moves to a different timezone?
1293
+ - "Overlap prevention" skips a tick if a run is still active. What is the notification when a run is skipped? Does the operator know they missed a scheduled execution?
1294
+ - Should `worktrain run schedule <trigger-id>` bypass the overlap check (for manual debugging), or respect it?
1295
+ - How does the schedule provider interact with the daemon's `maxConcurrentSessions` limit? A scheduled job at full capacity could be silently dropped without an overlap check.
1296
+
1297
+ ---
1298
+
1299
+ ### Autonomous grooming loop + workOnAll mode
1300
+
1301
+ **Status: idea** | Priority: medium
1302
+
1303
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: yes (needs scheduled tasks)
1304
+
1305
+ **Three autonomy levels:**
1306
+
1307
+ - **Level 0 (current):** Human applies `worktrain` label to specific issues. WorkTrain works those only.
1308
+ - **Level 1 -- workOnAll:** Config flag `workOnAll: true`. WorkTrain looks at ALL open issues, infers which are actionable, picks highest-priority. Escape hatch: `worktrain:skip` label.
1309
+ - **Level 2 -- Fully proactive:** WorkTrain also surfaces work it found itself (failing CI, backlog items with no issue, patterns in git history).
1310
+
1311
+ **Grooming loop (scheduled nightly):** Reads backlog, open issues, recent completed work. Closes resolved issues. For ungroomed items: infers maturity (linked spec, acceptance criteria, vague language). For high-value idea-level items: runs `wr.discovery` + `wr.shaping`, creates/updates issue.
1312
+
1313
+ **workOnAll config:**
1314
+ ```json
1315
+ { "workOnAll": true, "workOnAllExclusions": ["needs-design", "blocked-external"], "maxConcurrentSelf": 2 }
1316
+ ```
1317
+
1318
+ **Things to hash out:**
1319
+ - The grooming loop reads and writes GitHub issues autonomously. What safeguards prevent it from closing issues that are still relevant but appear resolved?
1320
+ - What is the "infer which issues are actionable" heuristic? Misclassification could cause WorkTrain to skip important work or start unwanted sessions.
1321
+ - `workOnAll: true` effectively gives WorkTrain permission to work on any open issue. How does the operator set scope limits beyond label exclusions -- e.g. restrict to a specific project, milestone, or assignee?
1322
+ - How does WorkTrain avoid duplicate work when `workOnAll` is enabled and another human or agent is already working on the same issue?
1323
+ - What is the escalation path when a grooomed issue turns out to need human judgment? Does WorkTrain leave a comment and move on, or does it hold the item?
1324
+ - Should `maxConcurrentSelf` apply at the daemon level or the workspace level? A single daemon managing multiple repos needs per-workspace caps.
1325
+
1326
+ ---
1327
+
1328
+ ### Escalating review gates based on finding severity
1329
+
1330
+ **Status: idea** | Priority: medium
1331
+
1332
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1333
+
1334
+ **Problem:** "Blocking" is binary -- a single Critical finding and a trivially incorrect comment are treated identically.
1335
+
1336
+ **Right behavior:** After a fix round, if re-review still returns Critical:
1337
+ 1. Another full MR review -- confirm the Critical is real, not a false positive
1338
+ 2. Production readiness audit -- a Critical finding often implies a runtime risk
1339
+ 3. Architecture audit -- if the Critical is architectural
1340
+
1341
+ Routing by `finding.category` from `wr.review_verdict`:
1342
+ - `correctness` / `security` -> always trigger prod audit
1343
+ - `architecture` / `design` -> trigger arch audit
1344
+ - All -> trigger re-review
1345
+
1346
+ **Hard rule:** A PR that triggered the escalating audit chain should NEVER auto-merge. Human explicit approval required.
1347
+
1348
+ **Things to hash out:**
1349
+ - The escalation routing by `finding.category` assumes categories are reliably assigned by the review workflow. How accurate is that classification in practice? A misclassified category could skip the wrong audit type.
1350
+ - How are false positives handled in the escalating chain? If a production audit is triggered by a Critical finding that turns out to be incorrect, is there a path to clear it without human intervention?
1351
+ - The "hard rule: never auto-merge after escalation" is correct but creates a potential pile-up of PRs waiting for human approval. Is there a notification mechanism to surface these to the operator?
1352
+ - Should the escalation chain be configurable per workspace or per workflow, or is it a global policy?
1353
+ - How does this interact with `riskLevel=Critical` tasks that already require human approval by policy? Are the two gates additive or redundant?
1354
+
1355
+ ---
1356
+
1357
+ ### Workflow execution time tracking and prediction
1358
+
1359
+ **Status: idea** | Priority: medium
1360
+
1361
+ **Score: 11** | Cor:1 Cap:2 Eff:3 Lev:2 Con:3 | Blocked: no
1362
+
1363
+ **Problem:** Timeouts are set by intuition. No data on how long workflows actually take.
1364
+
1365
+ **What to track:** For every completed session -- workflow ID, total wall-clock duration, turn count, step advances, outcome, task complexity signals. Store in `~/.workrail/data/execution-stats.jsonl`.
1366
+
1367
+ **Uses:**
1368
+ - Calibrate timeouts automatically (p95 * 1.5)
1369
+ - Predict duration before dispatch
1370
+ - Step-advance rate as workflow efficiency proxy
1371
+
1372
+ **Implementation:** Append to `execution-stats.jsonl` in `runWorkflow()`'s finally block.
1373
+
1374
+ **Things to hash out:**
1375
+ - How many data points are needed before timeout calibration is reliable? p95 * 1.5 from 3 samples is very different from p95 from 300 samples.
1376
+ - Should auto-calibrated timeouts update `triggers.yml` in place, or only influence the daemon's internal behavior? Modifying `triggers.yml` autonomously is a significant action.
1377
+ - Duration data varies by model, task complexity, and LLM provider load. Should the prediction account for these dimensions, or just average across them?
1378
+ - What happens to prediction accuracy when workflow structure changes significantly between versions? Should stats from old workflow versions be excluded?
1379
+ - Who can see and act on the execution stats? Should they be surfaced in the console or only in raw `.jsonl` form?
1380
+
1381
+ ---
1382
+
1383
+ ### WorkRail MCP server self-cleanup
1384
+
1385
+ **Status: idea** | Priority: medium
1386
+
1387
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
1388
+
1389
+ **Sources of stale state:** old workflow copies in `~/.workrail/workflows/`, dead managed sources, stale git repo caches, 500+ sessions accumulating with no TTL, remembered roots for non-existent paths.
1390
+
1391
+ **Fix -- two layers:**
1392
+
1393
+ 1. **Startup auto-cleanup (light):** On MCP server startup, silently remove managed sources where the filesystem path doesn't exist. Log "removed N stale sources."
1394
+
1395
+ 2. **`workrail cleanup` command:**
1396
+ ```
1397
+ workrail cleanup [--yes] [--sessions --older-than <age>] [--sources] [--cache] [--roots]
1398
+ ```
1399
+
1400
+ **Things to hash out:**
1401
+ - What is the policy for session retention -- is 500 sessions a problem in practice, or does it only become one after thousands? What storage cost is acceptable?
1402
+ - Startup auto-cleanup silently removes managed sources for non-existent paths. If a path is temporarily unmounted (NAS, external drive), silent removal is destructive. Should there be a warning or confirmation before removing?
1403
+ - `workrail cleanup --sessions --older-than <age>` deletes event logs. For debugging past failures, old session logs are valuable. Is there a distinction between sessions worth keeping and sessions safe to delete?
1404
+ - Should cleanup be idempotent and safe to run while the MCP server is live, or does it require the server to be stopped?
1405
+ - Who decides the default `--older-than` threshold? Too aggressive loses useful history; too conservative lets the store grow unbounded.
1406
+
1407
+ ---
1408
+
1409
+ ### Subagent context packaging
1410
+
1411
+ **Status: idea** | Priority: medium
1412
+
1413
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1414
+
1415
+ **Problem:** When a main agent spawns a subagent, the work package is too thin. The main agent has rich context (why this approach was chosen, what was tried, what constraints were discovered) but packages the subagent task as a one-liner.
1416
+
1417
+ **Design (Option B -- structured work package):**
1418
+ ```typescript
1419
+ spawnSession({
1420
+ workflowId: 'coding-task-workflow-agentic',
1421
+ goal: '...',
1422
+ context: {
1423
+ whyThisApproach: '...',
1424
+ alreadyTried: [...],
1425
+ knownConstraints: [...],
1426
+ relevantFiles: [...],
1427
+ completionCriteria: '...'
1428
+ }
1429
+ })
1430
+ ```
1431
+
1432
+ **Context mode:** `context: 'inherit' | 'blank' | 'custom'`. Blank is for adversarial roles (challenger, reviewer) where anchoring to main-agent context is counterproductive.
1433
+
1434
+ **Session knowledge log:** As the main agent progresses, it appends to `session-knowledge.jsonl` -- decisions, user pushback, relevant files, constraints, things tried and failed. Auto-included in subagent work packages.
1435
+
1436
+ **Things to hash out:**
1437
+ - Who enforces the `context` mode? If the spawning agent passes `context: 'inherit'` for an adversarial reviewer, the reviewer's independence is compromised. Is enforcement engine-level or convention?
1438
+ - How large can the structured context bundle grow before it becomes a liability rather than an asset? Is there a hard token budget for `whyThisApproach`, `alreadyTried`, etc.?
1439
+ - The `session-knowledge.jsonl` is append-only. Over a long session it could grow to thousands of entries. What is the selection/truncation strategy when packaging it into a subagent bundle?
1440
+ - How does the main agent know when to append to `session-knowledge.jsonl`? Is this tool-driven (explicit call), automatic on step advance, or heuristic?
1441
+ - What is the format and schema for `completionCriteria`? A natural language string is hard to evaluate programmatically -- is structured output needed?
1442
+
1443
+ ---
1444
+
1445
+ ### Workflow-scoped system prompts for subagents
1446
+
1447
+ **Status: idea** | Priority: medium
1448
+
1449
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
1450
+
1451
+ **Design:** Workflows (and individual steps) can declare a `systemPrompt` field injected into subagent sessions.
1452
+
1453
+ ```json
1454
+ {
1455
+ "id": "mr-review-workflow.agentic.v2",
1456
+ "systemPrompt": "You are an adversarial code reviewer. Your job is to find problems, not validate the approach.",
1457
+ "steps": [...]
1458
+ }
1459
+ ```
1460
+
1461
+ Step-level `systemPrompt` overrides workflow-level for that step.
1462
+
1463
+ **Composition layers:**
1464
+ 1. WorkTrain base prompt
1465
+ 2. Workflow-level `systemPrompt`
1466
+ 3. Step-level `systemPrompt`
1467
+ 4. Soul file (operator behavioral rules)
1468
+ 5. AGENTS.md / workspace context
1469
+ 6. Session knowledge log (if `context: 'inherit'`)
1470
+ 7. Step prompt
1471
+
1472
+ **Things to hash out:**
1473
+ - The composition order lists 7 layers. At what point does total system prompt size become a context window concern for the model? Is there a budget or truncation policy?
1474
+ - Should workflow authors be able to completely replace the WorkTrain base prompt, or only add to it? A workflow that removes the base prompt's safety constraints is a significant risk vector.
1475
+ - Step-level overrides apply only to that step, but the model's behavior may be shaped for the entire session by earlier steps. Is there a "reset" mechanism for step-scoped prompts?
1476
+ - If the same content appears in both the workflow-level `systemPrompt` and AGENTS.md, is that redundancy acceptable or should there be a deduplication step?
1477
+ - How is a workflow-scoped `systemPrompt` authored and validated? Is it freeform text, or are there constraints on what it can contain?
1478
+
1479
+ ---
1480
+
1481
+ ### `context-gather` step type
1482
+
1483
+ **Status: idea** | Priority: medium
1484
+
1485
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1486
+
1487
+ **Problem:** Phase 0.5 in the coding workflow currently looks for a shaped pitch by checking a local path. This doesn't handle coordinator-injected context, manually written docs (GDoc, Confluence, Notion), Glean-indexed artifacts, or URLs embedded in the task description. The search logic is duplicated if other workflows need the same document.
1488
+
1489
+ **Proposed primitive:**
1490
+ ```json
1491
+ {
1492
+ "type": "context-gather",
1493
+ "id": "gather-pitch",
1494
+ "contextType": "shaped-pitch",
1495
+ "outputVar": "shapedInput",
1496
+ "optional": true,
1497
+ "sources": ["coordinator-injected", "local-paths", "task-url", "glean"]
1498
+ }
1499
+ ```
1500
+
1501
+ **Source resolution order (stops at first hit):**
1502
+ 1. `coordinator-injected` -- coordinator already attached context of this type
1503
+ 2. `local-paths` -- check `.workrail/current-pitch.md`, `pitch.md`, `.workrail/pitches/`
1504
+ 3. `task-url` -- extract any URL from task description and fetch
1505
+ 4. `glean` -- search Glean for recent docs matching task keywords (opt-in only)
1506
+
1507
+ **Why engine-level:** Coordinator intercept requires the engine to check "has this type already been provided?" before running any search. A routine can't express that.
1508
+
1509
+ **Things to hash out:**
1510
+ - What is the contract between a `context-gather` step and the workflow steps that consume `outputVar`? If the step is `optional: true` and returns nothing, downstream steps that reference `shapedInput` get an empty value -- is that safe?
1511
+ - The `task-url` source extracts URLs from the task description and fetches them. This is a network call at engine level. Who is responsible for auth, rate limiting, and error handling for remote fetches?
1512
+ - The `glean` source is opt-in only. What is the opt-in mechanism -- a daemon config flag, a workflow declaration, or a user preference?
1513
+ - How does the engine signal to the agent that context was gathered successfully vs not found? Is this visible in the step prompt, or does the agent need to check `outputVar` itself?
1514
+ - Can a `context-gather` step block session start if a required source is unavailable, or should it always succeed (possibly with an empty result)?
1515
+
1516
+ ---
1517
+
1518
+ ## WorkRail MCP Server
1519
+
1520
+ The stdio/HTTP MCP server that Claude Code (and other MCP clients) connect to. MUST be bulletproof -- crashes kill all in-flight Claude Code sessions.
1521
+
1522
+ ### Multi-root workflow discovery and setup UX
1523
+
1524
+ **Status: designing** | Priority: medium
1525
+
1526
+ **Score: 7** | Cor:1 Cap:2 Eff:1 Lev:1 Con:2 | Blocked: no
1527
+
1528
+ Simplify third-party and team workflow hookup by requiring explicit `workspacePath`, silently remembering repo roots in user-level `~/.workrail/config.json`, recursively discovering team/module `.workrail/workflows/` folders under remembered roots, and improving grouped source visibility / precedence explanations.
1529
+
1530
+ **Current recommendation:**
1531
+ - Phase 1: Rooted Team Sharing + minimal Source Control Tower
1532
+ - Require explicit workspace identity
1533
+ - Silently persist repo roots at the user level
1534
+ - Support cross-repo workflows from remembered roots
1535
+ - Make remote repos default to managed-sync mode rather than pinned snapshots or live-remote behavior
1536
+ - Treat Slack/chat/file/zip sharing as an ingestion path that classifies into repo, file, pack, or snippet flows
1537
+ - Design the backend so the console can eventually manage and explain the remembered/discovered source model
1538
+
1539
+ **Additional idea:** explore enterprise auth / SSO integration for private repo access, such as Okta-backed flows for GitHub Enterprise, GitLab, or other self-hosted providers. Main question: should WorkRail integrate directly with identity providers like Okta, or should it integrate one layer lower with Git hosts / credential helpers that are already SSO-aware?
1540
+
1541
+ **Design doc:** `docs/ideas/third-party-workflow-setup-design-thinking.md`
1542
+
1543
+ ---
1544
+
1545
+ ## Console
1546
+
1547
+ ### Workflows tab: incorrect source attribution for bundled workflows (Apr 21, 2026)
1548
+
1549
+ **Status: bug** | Priority: low
1550
+
1551
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
1552
+
1553
+ The Workflows tab shows bundled workflows (e.g. `coding-task-workflow-agentic`) as coming from "User Library" instead of "WorkRail Built-in". This is a WorkRail MCP server issue, not a WorkTrain issue.
1554
+
1555
+ **Likely cause:** The `source.kind` field is incorrectly set when a workflow exists in both the bundled set AND a user's managed sources or remembered roots.
1556
+
1557
+ **Where to look:**
1558
+ - `src/infrastructure/storage/schema-validating-workflow-storage.ts` -- source kind propagation
1559
+ - `src/mcp/handlers/shared/workflow-source-visibility.ts` -- display label mapping in `list_workflows`
1560
+ - `src/infrastructure/storage/file-workflow-storage.ts` -- how `source.kind` is assigned when loading from disk
1561
+
1562
+ ---
1563
+
1564
+ ### Task picker mode: browse and launch available work (Apr 29, 2026)
1565
+
1566
+ **Status: idea** | Priority: high
1567
+
1568
+ **Score: 10** | Cor:1 Cap:3 Eff:2 Lev:1 Con:3 | Blocked: no
1569
+
1570
+ **Problem:** Once WorkTrain is configured (workspace set up, triggers.yml written, daemon running), there is still no easy way to say "run this workflow now" from the console. Dispatch requires knowing the API or writing a webhook. The console has a dispatch endpoint but no UI to drive it.
1571
+
1572
+ **Vision:** A console panel that lists the triggers already configured in triggers.yml and lets the user click one to fire it immediately -- without leaving the browser, without touching the API, without writing YAML.
1573
+
1574
+ **How it works:**
1575
+ 1. Console calls `GET /api/v2/triggers` to list all triggers loaded by the daemon.
1576
+ 2. User sees a list: trigger ID, workflow, goal, last-fired timestamp. Clicks "Run".
1577
+ 3. Console POSTs to `/api/v2/auto/dispatch` (already implemented) with the trigger's workflowId + goal + workspace.
1578
+ 4. New session appears in the session list immediately. User watches the DAG advance live.
1579
+ 5. On completion: outcome, PR link (if opened), and step notes all visible in the same panel.
1580
+
1581
+ **What this is not:** An onboarding wizard or zero-setup flow -- the daemon and environment must already be configured. This is a dispatch surface for *already-configured* users who want to trigger work without using the CLI or waiting for a webhook.
1582
+
1583
+ **Why it matters:** Makes the console a control plane, not just a read-only viewer. The daemon gains a "run this now" button. Users get to watch the agent work in real time, which builds confidence before trusting it on unattended tasks.
1584
+
1585
+ **Dependency:** `GET /api/v2/triggers` endpoint (returns the live trigger index -- may need to be added). `POST /api/v2/auto/dispatch` already exists. No new daemon work required.
1586
+
1587
+ **Things to hash out:**
1588
+ - When the user clicks "Run" on a trigger that requires a dynamic goal (not a static one), where does the goal come from? Is there a text input, or is it required to be a static-goal trigger?
1589
+ - Should manual dispatch from the console count against `maxConcurrentSessions`? Or is it a privileged path that bypasses the queue?
1590
+ - The console is described as read-only in AGENTS.md. Does adding dispatch capability change its security model? Is there authentication needed before dispatch is permitted?
1591
+ - If the daemon is not running when the user clicks "Run", what is the UX? Silent failure, immediate error, or auto-start attempt?
1592
+ - Should this panel also allow stopping or pausing running sessions, or is dispatch the only write operation?
1593
+
1594
+ ---
1595
+
1596
+ ### Console interactivity and liveliness
1597
+
1598
+ **Status: idea** | Priority: medium
1599
+
1600
+ **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
1601
+
1602
+ **Key areas:**
1603
+ - **DAG node hover effects** -- nodes in `RunLineageDag` should have visible hover states: border brightens, subtle glow, cursor changes to pointer. This is the single highest-impact item.
1604
+ - **Node selection highlight** -- selected node should pulse or glow, not just change border color
1605
+ - **Live session pulse** -- sessions with `status: in_progress` could have a subtle periodic animation
1606
+ - **Tooltip polish** -- fade in/out rather than appearing instantly
1607
+
1608
+ **Design constraint:** Dark navy, amber accent aesthetic. Additions should reinforce this language.
1609
+
1610
+ **Where to start:** `console/src/components/RunLineageDag.tsx`. The tooltip pattern (`handleNodeMouseEnter`/`handleNodeMouseLeave`) already exists; a hover glow is a natural peer addition.
1611
+
1612
+ **Related:** `docs/design/console-cyberpunk-ui-discovery.md`, `docs/design/console-ui-backlog.md`
1613
+
1614
+ **Things to hash out:**
1615
+ - CSS animations on many simultaneously live nodes can cause layout thrash and frame drops. Is there a performance budget or a maximum animated-node count before animations are disabled?
1616
+ - The dark navy + amber aesthetic is established but not formally documented as a design token system. Should a design token file be established before adding more visual elements?
1617
+ - Live session pulse animations may be distracting when many sessions are running. Should animation be suppressible via a user preference?
1618
+
1619
+ ---
1620
+
1621
+ ### Console engine-trace visibility and phase UX
1622
+
1623
+ **Status: idea** | Priority: medium
1624
+
1625
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
1626
+
1627
+ **Gap:** Users currently see only `node_created`/`edge_created`, which makes legitimate engine behavior look like missing workflow phases. Fast paths, skipped phases, condition evaluation, and loop gates are invisible.
1628
+
1629
+ **Recommended direction:**
1630
+ - Keep phases as authoring/workflow-organization concepts
1631
+ - Add an engine-trace/decision layer showing: selected next step, evaluated conditions, entered/exited loops, important run context variables (e.g. `taskComplexity`), skipped/bypassed planning paths
1632
+
1633
+ **Phase 1:** Extend console service/DTOs with a run-scoped execution-trace summary. Show a compact "engine decisions" strip or timeline above the DAG.
1634
+
1635
+ **Phase 2:** Richer explainability timeline with branches, skipped phases, condition results. Toggle between "execution DAG" and "engine trace" views.
1636
+
1637
+ **Things to hash out:**
1638
+ - Engine decisions (evaluated conditions, skipped steps) are not currently captured as session events -- they exist only in memory during the run. What new event types need to be added to the session store to make this work?
1639
+ - How does the "engine decisions" strip stay useful without becoming overwhelming for complex workflows with many branches and loop iterations?
1640
+ - Should condition variable values (e.g. `taskComplexity=Small`) be visible in the trace? This surfaces potentially sensitive session context in a UI accessible to anyone with console access.
1641
+ - Is Phase 2 (toggle between DAG and trace views) a separate ticket, or is it part of the same design effort as Phase 1?
1642
+
1643
+ ---
1644
+
1645
+ ### Console ghost nodes (Layer 3b)
1646
+
1647
+ **Status: idea** | Priority: low
1648
+
1649
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
1650
+
1651
+ Ghost nodes represent steps that were compiled into the DAG but skipped at runtime due to `runCondition`. Currently the DAG just shows fewer nodes with no indication of what was bypassed. Layer 3b would render skipped nodes as faded/ghost elements with a tooltip explaining the skip condition.
1652
+
1653
+ **Things to hash out:**
1654
+ - Ghost nodes require knowing which nodes were compiled but skipped. Does the engine currently emit any event for skipped nodes, or is this information lost after compilation?
1655
+ - For workflows with many conditional branches, ghost nodes could double or triple the visual complexity of the DAG. Is there a layout strategy that keeps it readable?
1656
+ - Should ghost nodes be shown by default, or hidden behind a toggle? What is the right default for users who are not debugging a skip?
1657
+
1658
+ ---
1659
+
1660
+ ## Workflow Library
1661
+
1662
+ ### Automatic root cause analysis when MR review finds issues post-coding (Apr 30, 2026)
1663
+
1664
+ **Status: idea** | Priority: high
1665
+
1666
+ **Score: 13** | Cor:3 Cap:3 Eff:2 Lev:3 Con:2 | Blocked: no
1667
+
1668
+ When an MR review session (run by a WorkTrain agent) finds issues in a coding session's output, WorkTrain should automatically investigate why the coding agent missed it and determine whether the workflow, the prompts, or the process can be improved.
1669
+
1670
+ **Two distinct triggers:**
1671
+
1672
+ 1. **WorkTrain MR review finds something**: after a WorkTrain review session produces findings, the coordinator should automatically spawn an analysis session asking: why did the coding agent produce code with this issue? Was it a workflow gap (missing verification step, insufficient scrutiny at a phase), a prompt gap (the agent wasn't told to check this), or a context gap (the agent didn't have the information needed)?
1673
+
1674
+ 2. **Human finds something post-review**: when a human reviewer comments on or requests changes to a PR that already passed WorkTrain's review, this is doubly significant -- it means both the coding agent AND the review agent missed it. WorkTrain should automatically investigate why both missed it and whether the review workflow has a systematic blind spot.
1675
+
1676
+ **Why this matters**: every finding that slips through is a signal about a workflow or process gap. Today that signal is lost. Capturing it systematically and feeding it back into workflow improvement closes the quality loop.
1677
+
1678
+ **Things to hash out:**
1679
+ - How does WorkTrain detect that a human has commented on a PR post-review? This requires monitoring the PR for new review activity after WorkTrain's session completed -- either webhook events or polling.
1680
+ - What does the analysis session actually produce? A structured finding about the gap? A concrete proposal for workflow improvement? Both?
1681
+ - Who reviews the analysis output before it becomes a workflow change? Auto-applying workflow changes based on analysis is risky.
1682
+ - How do you distinguish "the workflow is fine but this was a genuinely hard edge case" from "the workflow has a systematic gap"? A single miss doesn't prove a gap; multiple misses of the same kind do.
1683
+ - Should the analysis result feed directly into `workflow-effectiveness-assessment`, or is it a separate concern?
1684
+ - For the "coding agent missed it" case: is the right fix to change the coding workflow, or to make the review workflow more adversarial?
1685
+
1686
+ ---
1687
+
1688
+ ### Workflow previewer for compiled and runtime behavior
1689
+
1690
+ **Status: idea** | Priority: medium
1691
+
1692
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
1693
+
1694
+ Add a workflow previewer for the `workflows/` directory that shows what a workflow actually compiles to and how the engine can traverse it at runtime.
1695
+
1696
+ **Why:** Authors currently have to mentally reconstruct branching, loops, blocked-node behavior, and other runtime structure from authored JSON plus tests. Advanced workflow authoring gets much easier when the compiled DAG and runtime edges are visible.
1697
+
1698
+ **What it should show:** compiled step graph/DAG; branch points and condition-driven paths; loop structure and loop-control edges; blocked/resumed/checkpoint-related node shapes; template/routine expansion boundaries; the gap between authored JSON structure and runtime execution structure.
1699
+
1700
+ **Design questions:**
1701
+ - Should this live in the existing Console, as a dev-only page, or as a local authoring utility?
1702
+ - Should it show only the compiled DAG, or also annotate likely runtime transitions such as blocked attempts, rewinds, and loop continuations?
1703
+ - How much provenance should it expose for injected routines/templates?
1704
+
1705
+ Start as a read-only preview for bundled workflows; optimize for accuracy over polish.
1706
+
1707
+ **Things to hash out:**
1708
+ - Should the previewer live in the existing Console, as a dev-only page, or as a local authoring utility (CLI command)?
1709
+ - Should it show only the compiled DAG, or also annotate likely runtime transitions such as blocked attempts, rewinds, and loop continuations?
1710
+ - How much provenance should it expose for injected routines/templates? Is it useful to show the boundary between authored steps and expanded routine steps?
1711
+ - Does the previewer need to show all possible DAG paths, or only the "happy path"? For deeply conditional workflows, all-paths could be very large.
1712
+ - Is this only useful during workflow authoring, or also useful for operators who want to understand a running session's possible future states?
1713
+
1714
+ ---
1715
+
1716
+ ### Native assessment / decision gates for workflows
1717
+
1718
+ **Status: idea** | Priority: medium
1719
+
1720
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1721
+
1722
+ Add a first-class workflow primitive for structured assessments that can drive routing. The agent would assess a small set of named dimensions, give short rationales, and let the engine use explicit aggregation/gate rules to influence continuation, follow-up, branching, or final confidence.
1723
+
1724
+ **Why:** Some workflow decisions are clearer and more auditable as small assessment matrices than as long prompt prose. Confidence computation is a strong example: workflows may want to derive final confidence from dimensions like boundary, intent, evidence, coverage, and disagreement.
1725
+
1726
+ **Near-term shape:** keep reasoning with the agent, but let the workflow declare named assessment dimensions and allowed levels such as `High | Medium | Low`. Let the agent provide one short rationale per dimension. Let the engine compute caps/next actions/routing outcomes from explicit gate rules.
1727
+
1728
+ **Ownership split:** the agent assesses each dimension and gives the short rationale; the engine applies declared gate rules.
1729
+
1730
+ **Good early use cases:** MR review confidence assessment; planning readiness/confidence gates; debugging confidence and next-step routing; block-vs-continue/revisit-earlier-step decisions.
1731
+
1732
+ **Design questions:** should this be a narrow `assessmentGate` primitive or a more generic structured decision-table feature? Should reusable matrices be inline first, or backed by repo-owned refs? How should assessment provenance and rationales appear in compiled/runtime traces?
1733
+
1734
+ **Things to hash out:**
1735
+ - When the agent provides a rationale for each dimension, is that rationale stored in the session event log and surfaced in the console? Or is it ephemeral?
1736
+ - How does the engine enforce that the agent assessed all required dimensions before advancing? Is this a schema-validated output contract, or a soft expectation?
1737
+ - If the engine applies gate rules and routes the session differently than the agent expected, how is that decision communicated back to the agent in the next step's context?
1738
+ - Are assessment dimensions per-workflow or could they be shared across workflows via a named reference? What is the right reuse model?
1739
+ - What is the relationship between this primitive and the existing `assessmentConsequenceTrigger` in assessment gates v1?
1740
+
1741
+ ---
1742
+
1743
+ ### Engine-injected note scaffolding
1744
+
1745
+ **Status: idea** | Priority: low
1746
+
1747
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
1748
+
1749
+ Add an opt-in execution-contract or note-structure feature that helps agents produce compact notes useful to both humans and future resume agents.
1750
+
1751
+ Some workflows want notes to consistently capture current understanding, key findings, decisions, uncertainties, and next-step implications. This is related to assessment-driven routing, but it is a different product concern.
1752
+
1753
+ **Open question:** should note scaffolding live as a separate execution-contract feature, or share underlying primitives with assessment gates?
1754
+
1755
+ **Things to hash out:**
1756
+ - What does "opt-in" mean here -- a workflow-level flag, a step-level annotation, or a per-session config? Who decides whether a given workflow gets note scaffolding?
1757
+ - Note structure injects requirements into what the agent writes. Does this constrain the agent's ability to express nuanced or non-standard findings that don't fit the scaffold?
1758
+ - Are scaffolded notes stored differently from unstructured notes, or is the structure a soft suggestion that gets serialized the same way?
1759
+ - If the scaffold template changes between workflow versions, are older session notes still readable/comparable to newer ones?
1760
+
1761
+ ---
1762
+
1763
+ ### Agent-reportable workflow bugs (Apr 28, 2026)
1764
+
1765
+ **Status: idea** | Priority: high
1766
+
1767
+ **Score: 10** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1768
+
1769
+ Agents encounter problems with WorkTrain itself during runs -- confusing step prompts, broken output contracts, workflow logic that doesn't match the actual task, MCP tool bugs, unclear instructions. Right now there's no structured way for an agent to surface these. They either silently work around the issue or get stuck.
1770
+
1771
+ A mechanism for agents to report problems with the WorkRail system itself during a session -- distinct from `report_issue` (which is for the task). These reports should be visible to the operator and feed into workflow improvement.
1772
+
1773
+ **Things to hash out:**
1774
+ - How does an agent decide whether a problem is a workflow bug vs a task obstacle? The boundary is fuzzy -- a confusing step prompt might just be a hard task.
1775
+ - Does surfacing this tool change agent behavior in undesirable ways? Agents might blame the workflow instead of solving the problem.
1776
+ - Should reports survive session cleanup, or is their lifetime tied to the session?
1777
+ - Who owns acting on these reports -- the operator, the workflow author, or an automated system?
1778
+ - Should this be available in interactive (MCP) sessions, or daemon sessions only?
1779
+
1780
+ ---
1781
+
1782
+ ### Per-run workflow improvement retrospective (Apr 28, 2026)
1783
+
1784
+ **Status: idea** | Priority: high
1785
+
1786
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
1787
+
1788
+ Every workflow run is an opportunity to learn. At the end of each session, the agent has unique insight into what worked, what was unclear, what slowed it down, and what a better version of the workflow would look like. This insight currently evaporates when the session ends.
1789
+
1790
+ At the end of each session, the agent should have an opportunity to reflect on the process itself -- what was confusing, what took longer than it should, what context was missing, what it would change about the workflow.
1791
+
1792
+ **Things to hash out:**
1793
+ - Is agent reflection on its own process reliable? Agents may lack the self-awareness to accurately identify what went wrong, or may default to saying everything was fine.
1794
+ - Does this add unacceptable cost or latency for short/fast workflows? Should it be conditional on certain outcomes (e.g. only after a stuck or timeout result)?
1795
+ - How does retrospective data get used? Who reads it, and does it feed automatically into workflow improvement proposals or require human triage first?
1796
+ - Risk of agents gaming it -- saying the workflow was perfect to appear compliant rather than critical.
1797
+ - Should this be opt-in per workflow, universal, or triggered by specific signals during the run?
1798
+
1799
+ ---
1800
+
1801
+ ### Verification and proof as first-class citizens (Apr 15, 2026)
1802
+
1803
+ **Status: idea** | Priority: high
1804
+
1805
+ **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: yes (needs coordinator infrastructure)
1806
+
1807
+ **The problem:** today there's no single place that tells you "here's everything that was done to verify this feature is correct." Tests pass, a review ran, an audit happened -- but it's scattered across session notes, PR descriptions, CI logs, and half-remembered conversations. No verification chain.
1808
+
1809
+ **The vision:** every shipped change has a **proof record** -- a structured document that answers: what was built, how was it verified, by whom (which agents), and what was the verdict at each gate. Not a summary for humans -- a queryable record that the coordinator and watchdog can use to enforce quality gates.
1810
+
1811
+ A proof record contains: `prNumber`, `goal`, `verificationChain` (array of `{ kind, outcome, findings, sessionId, timestamp }`), `gates` (unit_tests, mr_review, production_audit, architecture_audit), `overallVerdict`, `mergedAt`.
1812
+
1813
+ **Verification gates the coordinator enforces:**
1814
+ | Gate | Required for |
1815
+ |------|-------------|
1816
+ | Unit tests pass | All changes |
1817
+ | MR review approved (no Critical/Major) | All changes |
1818
+ | Architecture audit | `touchesArchitecture=true` or `riskLevel=High` |
1819
+ | Production audit | `riskLevel=High` or affects prod paths |
1820
+ | Security audit | touches auth/input/external |
1821
+
1822
+ **Visibility surfaces:** Console PR view (full verification chain, expandable to session notes); `worktrain verify <pr-number>` command; proof record section in every PR description ("Verification chain: 14 unit tests | MR review (0 findings) | Production audit | Architecture audit (skipped: riskLevel=Low)").
1823
+
1824
+ **Why this matters:** "Has this been reviewed and audited?" becomes a query against proof records rather than reading through PRs and session notes. The knowledge graph stores these records. The watchdog checks them on a schedule. The coordinator gates on them before merging. Verification becomes infrastructure, not process.
1825
+
1826
+ **Things to hash out:**
1827
+ - Proof records are associated with PRs, but WorkTrain sessions may span multiple PRs, or a PR may be created by a human after WorkTrain's work. How is the PR-to-session mapping established?
1828
+ - Who writes the proof record -- the coordinator script (after each gate completes), the delivery pipeline (at merge time), or both incrementally?
1829
+ - What is the storage model for proof records -- append-only event log (like sessions), a separate file per PR, or entries in the knowledge graph? Each has different query characteristics.
1830
+ - "The coordinator gates on them before merging" requires the coordinator to read the proof record at merge time. What happens when the proof record is incomplete (a gate ran but its result was not recorded)?
1831
+ - How does this interact with PRs that are merged manually by humans, bypassing the coordinator's merge gate? The proof record would be incomplete but the merge already happened.
1832
+
1833
+ ---
1834
+
1835
+ ### Scripts-first coordinator: avoid the main agent wherever possible (Apr 15, 2026)
1836
+
1837
+ **Status: idea** | Priority: high
1838
+
1839
+ **Score: 12** | Cor:1 Cap:3 Eff:2 Lev:3 Con:3 | Blocked: no
1840
+
1841
+ **The insight:** In a coordinator workflow, the main agent spends most of its time on mechanical work -- reading PR lists, checking CI status, deciding whether findings are blocking, sequencing merges. That's all deterministic logic. An LLM is expensive, slow, and inconsistent for deterministic work.
1842
+
1843
+ **The principle:** the scripts-over-agent rule applies at the coordinator level too. The coordinator's job is to drive a DAG of child sessions. The DAG structure, routing decisions, and termination conditions should be scripts, not LLM reasoning.
1844
+
1845
+ **What this means concretely:** a coordinator script that calls `gh pr list`, spawns MR review sessions, awaits them, parses findings JSON, routes (clean -> merge queue, minor -> spawn fix agent, blocking -> escalate), awaits fix agents, and executes merge sequence when queue is empty. The LLM is only invoked for leaf work -- the actual MR review, the actual coding fix.
1846
+
1847
+ **What WorkTrain provides:**
1848
+ - `worktrain spawn --workflow <id> --goal <text>` -> prints sessionHandle
1849
+ - `worktrain await --sessions <handle1,handle2>` -> prints structured results JSON
1850
+ - `worktrain merge --pr <number>` -> runs the merge sequence
1851
+
1852
+ The coordinator "workflow" is then a shell script or TypeScript file. Fully deterministic, fully auditable, no tokens burned on routing decisions.
1853
+
1854
+ **Build order:** `worktrain spawn`/`worktrain await` CLI commands; structured output format for leaf sessions (handoff artifact JSON block already exists); a reference `coordinator-groom-prs.sh` as the first coordinator template; Console DAG view updated to show coordinator-script-spawned sessions with parent-child relationships.
1855
+
1856
+ **Things to hash out:**
1857
+ - `worktrain spawn` prints a `sessionHandle`. What is the format of this handle -- a session ID, an opaque token, or a structured JSON blob? The answer affects whether it can be safely passed between processes.
1858
+ - `worktrain await` blocks until sessions complete. What is the behavior when a session crashes mid-run -- does `await` eventually return with an error, or block indefinitely?
1859
+ - The coordinator is a shell script or TypeScript file, not a workflow. How does the coordinator's own execution get tracked in the session store or event log? Is it visible in the console?
1860
+ - If the coordinator script is invoked by a trigger, who is responsible for the coordinator's lifecycle -- the daemon, or the OS (via launchd/cron)?
1861
+ - How does a coordinator script handle partial failures (2 of 5 child sessions failed)? Is the failure handling logic in the script, or does WorkTrain provide a structured retry primitive?
1862
+
1863
+ ---
1864
+
1865
+ ### Full development pipeline: coordinator scripts drive multi-phase autonomous work (Apr 15, 2026)
1866
+
1867
+ **Status: idea** | Priority: high
1868
+
1869
+ **Score: 10** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs classify-task workflow + scripts-first coordinator)
1870
+
1871
+ The full pipeline DAG for feature implementation, driven by a coordinator script:
1872
+
1873
+ ```
1874
+ trigger: "implement feature X"
1875
+ -> [always] classify-task
1876
+ outputs: taskComplexity, riskLevel, hasUI, touchesArchitecture
1877
+ -> [if taskComplexity != Small] discovery
1878
+ -> [if hasUI] ux-design
1879
+ -> [if touchesArchitecture] architecture-design + arch-review (parallel)
1880
+ -> [always] coding-task (inputs: context bundle + design spec + arch decision)
1881
+ -> [always] mr-review
1882
+ -> [if clean] auto-commit -> auto-pr -> merge
1883
+ -> [if Minor/Nit] -> spawn fix agent -> re-review (max 3 passes)
1884
+ -> [if Critical/Major] -> escalate to human
1885
+ -> [if riskLevel == High] prod-risk-audit
1886
+ -> [if merged] notify
1887
+ ```
1888
+
1889
+ **The key insight:** the coordinator script reads `taskComplexity`, `riskLevel`, `hasUI`, and `touchesArchitecture` from the classify step's output and decides which phases to spawn. A one-line bug fix runs: classify -> coding-task -> mr-review. A new UI feature runs everything. Zero coordinator LLM calls.
1890
+
1891
+ **The missing workflow:** `classify-task-workflow` -- fast, 1-step, outputs taskComplexity/riskLevel/hasUI/touchesArchitecture. This is the single most important missing workflow -- without it, the coordinator has to spawn everything for every task, which is wasteful.
1892
+
1893
+ **Things to hash out:**
1894
+ - The coordinator script is described as "scripts, not LLM" -- but the pipeline DAG itself requires reading and interpreting `classify-task-workflow` outputs. Who validates that the script correctly handles all classification outcomes?
1895
+ - What is the fallback when `classify-task-workflow` fails or returns an inconclusive result? Does the pipeline abort, escalate, or default to the most conservative path?
1896
+ - How are errors in the coordinator script itself handled? A bug in the script could skip phases silently or merge without required gates.
1897
+ - Should the pipeline support human checkpoints between phases (e.g. "approve before coding starts"), or is it fully autonomous by design?
1898
+ - Who owns the coordinator script -- the workflow author, the workspace operator, or WorkTrain itself? Different owners have different update cadences.
1899
+
1900
+ ---
1901
+
1902
+ ### Additional coordinator pipeline templates (Apr 15, 2026)
1903
+
1904
+ **Status: idea** | Priority: medium
1905
+
1906
+ **Score: 9** | Cor:1 Cap:3 Eff:2 Lev:1 Con:2 | Blocked: yes (needs scripts-first coordinator)
1907
+
1908
+ Beyond the feature implementation pipeline, three more coordinator templates are high value:
1909
+
1910
+ **Backlog grooming coordinator:**
1911
+ ```
1912
+ trigger: "groom backlog" (cron: weekly, or manual dispatch)
1913
+ -> [for each open issue] classify-issue -> label-and-size
1914
+ -> [for stale issues > 90 days] auto-close-or-ping
1915
+ -> [for duplicate issues] detect-duplicates
1916
+ -> [for high-priority bugs with no assignee] spawn bug-investigation-agentic
1917
+ -> produce grooming summary -> post weekly digest to Slack
1918
+ ```
1919
+
1920
+ **Bug investigation + fix coordinator:**
1921
+ ```
1922
+ trigger: new issue labeled "bug" OR incident alert
1923
+ -> bug-investigation-agentic
1924
+ outputs: root cause hypothesis, affected files, severity, confidence
1925
+ -> [if severity == Critical] page-oncall
1926
+ -> [if severity <= High and hypothesis_confidence >= 0.8] attempt-fix
1927
+ -> coding-task-workflow-agentic
1928
+ -> mr-review -> [if clean] auto-commit -> auto-pr
1929
+ -> close-or-update-issue
1930
+ ```
1931
+
1932
+ The daemon can go from "bug filed" to "fix merged" with zero human involvement for well-understood bugs with high-confidence hypotheses. The `hypothesis_confidence` output from the investigation gates the auto-fix attempt.
1933
+
1934
+ **Incident monitoring coordinator:**
1935
+ ```
1936
+ trigger: monitoring alert (CPU spike, error rate, latency P99 > threshold)
1937
+ -> triage-alert (classify real incident vs noise)
1938
+ -> [if isRealIncident] investigate
1939
+ -> [if mitigation is config change] auto-mitigate (NEVER auto-rollback code without human approval)
1940
+ -> page-oncall with full context + session DAG link
1941
+ ```
1942
+
1943
+ The operator gets paged with a complete picture: what happened, likely why, what was already done automatically, and exactly what decision they need to make.
1944
+
1945
+ **Things to hash out:**
1946
+ - The backlog grooming coordinator auto-closes stale issues. What prevents it from closing issues that are still relevant but have no recent activity by design (e.g. long-term architectural items)?
1947
+ - The bug investigation + fix path is fully autonomous when `hypothesis_confidence >= 0.8`. How is that threshold validated? What is the cost of a false positive (fixing the wrong thing) at that confidence level?
1948
+ - "NEVER auto-rollback code without human approval" is a correct hard rule, but "auto-mitigate (config change)" is still a significant action. Who defines what qualifies as a safe config change vs a risky one?
1949
+ - The incident monitoring coordinator pages oncall. What is the integration path for paging -- PagerDuty, Slack, email? Is the paging mechanism configurable per workspace?
1950
+ - How do these coordinator templates relate to the general-purpose scripts-first coordinator concept? Are they instances of the same pattern, or separate implementations?
1951
+
1952
+ ---
1953
+
1954
+ ### Interactive ideation: WorkTrain as a thinking partner with full project context (Apr 15, 2026)
1955
+
1956
+ **Status: idea** | Priority: medium
1957
+
1958
+ **Score: 7** | Cor:1 Cap:1 Eff:1 Lev:2 Con:2 | Blocked: yes (needs knowledge graph + project memory)
1959
+
1960
+ The ability to have a conversation with WorkTrain with full awareness of what's been built, what's in flight, what's in the backlog, and what decisions were made and why. Unlike Claude Code, WorkTrain already has: the session store (every step note from every session), the knowledge graph, the backlog, and in-flight agent state.
1961
+
1962
+ **What it needs:**
1963
+ 1. **A `worktrain talk` command** -- opens an interactive session that starts with a synthesized context bundle: recent session outcomes, open PRs, backlog top items, any findings from in-flight agents.
1964
+ 2. **Project memory** -- WorkTrain maintains a synthesized "project state" updated after each major session batch. Answers questions like "what did we build today?", "why did we choose polling triggers over webhooks?", "what's the biggest gap right now?"
1965
+ 3. **Idea capture** -- when the conversation surfaces something new, WorkTrain offers to record it to the backlog or open a GitHub issue.
1966
+ 4. **Context awareness** -- WorkTrain knows which agents are running, what they've found so far, and can report on it during a conversation.
1967
+
1968
+ **Architecture:** a `talk` workflow -- a conversational loop workflow with no fixed step count. The agent has access to `query_knowledge_graph`, `read_session_notes`, `read_backlog`, `list_in_flight_agents`, and `append_to_backlog` as tools.
1969
+
1970
+ **Things to hash out:**
1971
+ - A conversational loop with no fixed step count could run indefinitely. What terminates a `worktrain talk` session -- user command, inactivity timeout, or a max-turns cap?
1972
+ - The `append_to_backlog` tool modifies `docs/ideas/backlog.md`, which is a protected file per AGENTS.md. Is this an intentional exception for the talk workflow, or should the tool write to a separate ideas buffer?
1973
+ - What is the "project state" synthesis cadence? After every session batch, continuously, or on demand? Who triggers it?
1974
+ - How does `worktrain talk` handle sensitive information -- session notes may contain API keys, error messages with credential paths, or other private data. Is the talk session sandboxed?
1975
+ - Does this replace `worktrain status` as the primary status surface, or do they serve different audiences?
1976
+
1977
+ ---
1978
+
1979
+ ### Automatic gap and improvement detection: proactive WorkTrain (Apr 15, 2026)
1980
+
1981
+ **Status: idea** | Priority: medium
1982
+
1983
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: yes (needs knowledge graph + scheduled tasks)
1984
+
1985
+ WorkTrain notices things without being asked. After a batch of work lands, it scans for gaps, inconsistencies, missed connections, and improvement opportunities -- and surfaces them proactively.
1986
+
1987
+ **Two modes:**
1988
+ 1. **Event-triggered scans** -- fires after significant events (batch of PRs merge, new workflow authored, new bug filed, coordinator run completes)
1989
+ 2. **Periodic health checks** -- runs on a schedule (weekly): are there backlog items with prerequisites met but not started? open issues actually already fixed by merged PRs? PRs approved but not merged for more than N days? stale knowledge graph?
1990
+
1991
+ **Architecture:** a `watchdog` workflow that runs on a cron trigger. Queries the knowledge graph, reads recent session notes, lists open PRs and issues, reads backlog priorities, produces a `gap-report.md` with actionable findings. Each finding is either: auto-actionable (spawn a fix agent), conversation-worthy (add to ideation queue), or escalation-worthy (post to Slack/file a GitHub issue).
1992
+
1993
+ **The key difference from the coordinator:** the coordinator executes a known plan. The watchdog discovers things that aren't in any plan yet.
1994
+
1995
+ **Things to hash out:**
1996
+ - The watchdog decides which findings are "auto-actionable." What safeguards prevent it from autonomously spawning sessions for things that should require human judgment?
1997
+ - How does the watchdog avoid creating duplicate work if the findings it surfaces are already tracked as open issues or active sessions?
1998
+ - What is the frequency trade-off for event-triggered scans? Firing after every PR merge could spawn many watchdog sessions per day on an active repo.
1999
+ - The gap report is currently described as a `.md` file. Should it instead be structured data (JSON/events) that the console or coordinator can process programmatically?
2000
+ - Who clears or acknowledges watchdog findings? If nobody acts on them, do they accumulate silently?
2001
+
2002
+ ---
2003
+
2004
+ ### Native multi-agent orchestration: coordinator sessions + session DAG (Apr 15, 2026)
2005
+
2006
+ **Status: idea** | Priority: high
2007
+
2008
+ **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
2009
+
2010
+ Everything we can do manually today -- spawn parallel agents, chain discovery->implement->review->fix, react to findings, merge when clean -- WorkTrain should do natively, fully autonomously, with full observability.
2011
+
2012
+ **New primitives required:**
2013
+
2014
+ `spawn_session` tool (available inside workflow steps) -- starts a child session with a given workflowId + goal. Non-blocking -- returns a `sessionHandle` immediately.
2015
+
2016
+ `await_sessions` tool -- blocks until one or all of a set of session handles complete. Returns their results and output artifacts.
2017
+
2018
+ **Coordinator workflow pattern:**
2019
+ ```
2020
+ Phase 1: Gather work items (open PRs, open issues, failing tests)
2021
+ Phase 2: Spawn workers in parallel (one per work item)
2022
+ Phase 3: Await all workers
2023
+ Phase 4: Classify results -- clean/findings/blockers
2024
+ Phase 5: Await fix agents, re-review if needed (circuit breaker: max 3 attempts)
2025
+ Phase 6: Execute final action (merge sequence, create summary, post to Slack)
2026
+ ```
2027
+
2028
+ **No-user-feedback policy logic:**
2029
+ - Critical/Major finding -> block merge, spawn fix agent, re-review (max 3 passes), escalate if still failing
2030
+ - Minor finding -> spawn fix agent if auto-fixable, else log and proceed
2031
+ - Nit -> log, proceed without fix
2032
+ - Clean -> queue for merge
2033
+ - Circuit breaker -> after 3 failed fix attempts, post to Slack/GitLab and pause
2034
+
2035
+ **Observability:** Console session tree (not flat list) showing coordinator and all children with parent-child relationships, status icons, and critical path.
2036
+
2037
+ **Build order:** `spawn_session` + `await_sessions` tools; parent-child session relationship in session store (`parentSessionId` field); Console DAG view for session tree; coordinator workflow templates.
2038
+
2039
+ **Things to hash out:**
2040
+ - `spawn_session` inside a workflow step means the engine must support async child session lifecycle management. Does the engine orchestrate this, or is it the daemon's responsibility?
2041
+ - If a child session fails, does the coordinator session receive the failure as a return value or as an exception? What is the Result type shape for `await_sessions`?
2042
+ - How does the console DAG view handle a coordinator with 10+ parallel children? Is there a rendering strategy for large session trees?
2043
+ - The circuit breaker (max 3 attempts) is described as a hard rule, but who configures it -- workflow author, coordinator script, or daemon policy?
2044
+ - What is the relationship between `parentSessionId` in the session store and the `spawn_session` tool call? Is one derived from the other, or do they need to be kept in sync?
2045
+
2046
+ ---
2047
+
2048
+ ### Autonomous merge: WorkTrain approves and merges its own PRs after full vetting (Apr 15, 2026)
2049
+
2050
+ **Status: idea** | Priority: medium
2051
+
2052
+ **Score: 10** | Cor:1 Cap:3 Eff:2 Lev:2 Con:2 | Blocked: yes (needs proof records + verified CI integration)
2053
+
2054
+ After the full verification chain passes (unit tests, MR review clean, all required audits green), WorkTrain runs `gh pr review --approve && gh pr merge --squash` itself.
2055
+
2056
+ **The auto-merge policy (what makes it safe):**
2057
+
2058
+ Auto-merge allowed when ALL of:
2059
+ - All required verification gates pass (defined by task classification)
2060
+ - MR review: 0 Critical, 0 Major findings
2061
+ - CI is green (all required checks pass)
2062
+ - No `needs-human-review` label on the PR
2063
+ - The PR was authored by a WorkTrain session (not a human)
2064
+
2065
+ Auto-merge blocked when ANY of:
2066
+ - Any Critical or Major finding in any review/audit
2067
+ - CI is failing
2068
+ - Circuit breaker has fired (3+ fix attempts on same finding)
2069
+ - `riskLevel=Critical`
2070
+
2071
+ Human always required for: schema changes, dependency upgrades (major version), infrastructure/CI/CD changes, changes to WorkTrain's own merge policy.
2072
+
2073
+ **The coordinator script merge gate:** checks the proof record before calling merge. The merge decision is deterministic. A human can always override by adding `needs-human-review`. Every auto-merge is appended to `~/.workrail/merge-log.jsonl`.
2074
+
2075
+ **Things to hash out:**
2076
+ - WorkTrain approving its own PRs (`gh pr review --approve`) requires the authenticated user to have self-approval rights. This is explicitly denied in many enterprise Git setups. Is this a supported configuration, or is self-approval gated behind an explicit setting?
2077
+ - The auto-merge policy excludes "changes to WorkTrain's own merge policy." How does this self-referential exception get enforced -- static analysis, file path check, or manual discipline?
2078
+ - `merge-log.jsonl` is a critical audit record. What is its retention policy, and is it protected from accidental deletion?
2079
+ - If the CI check suite includes flaky tests that are known to fail intermittently, the "CI is green" requirement could block merges indefinitely. Is there a policy for handling known-flaky tests?
2080
+ - Should auto-merge be opt-in per workspace or per trigger, or is it always enabled when the policy conditions are met?
2081
+
2082
+ ---
2083
+
2084
+ ### Coordinator context injection standard: agents start informed, not discovering (Apr 18, 2026)
2085
+
2086
+ **Status: idea** | Priority: high
2087
+
2088
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
2089
+
2090
+ Every coordinator-spawned agent gets a pre-packaged context bundle. The coordinator assembles it before calling `worktrain spawn`. The bundle includes:
2091
+ 1. **Prior session findings** -- what relevant sessions discovered (from session store query)
2092
+ 2. **Established patterns** -- the specific invariants and patterns the agent needs (from knowledge graph or AGENTS.md)
2093
+ 3. **What NOT to discover** -- explicit list of things already known so the agent doesn't waste turns
2094
+ 4. **Failure history** -- what's been tried and didn't work (prevents re-exploring dead ends)
2095
+
2096
+ ~2000 tokens max, injected as a `<context>` block before the task description. Structured so the agent can skip Phase 0 context gathering entirely when the bundle is complete.
2097
+
2098
+ Without this: every agent spawned without proper context burns tokens on discovery that should have been provided upfront. At 10 concurrent agents, that's 10x the waste.
2099
+
2100
+ **Things to hash out:**
2101
+ - Who assembles the context bundle -- the coordinator script, the daemon, or a dedicated context assembly service? Where does the assembly logic live?
2102
+ - The 2000-token budget is a guess. What is the actual optimal size -- enough to be useful, small enough not to crowd out the step prompt?
2103
+ - How does the context bundle stay fresh across a long coordinator run? Prior session findings from 2 hours ago may be stale if main advanced significantly.
2104
+ - If the knowledge graph is not yet built for a workspace, what is the fallback for context assembly? Does the coordinator skip bundling entirely, or manually assemble from known sources?
2105
+ - Should the `<context>` block format be standardized so all workflows know how to consume it, or is it opaque content the agent reads naturally?
2106
+
2107
+ ---
2108
+
2109
+ ### Session identity: a unit of work is one session, not many (Apr 18, 2026)
2110
+
2111
+ **Status: idea** | Priority: medium
2112
+
2113
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
2114
+
2115
+ A task involving discovery + design + implementation + review + re-review appears as 5 unrelated sessions in the console. The correct model: a session is a unit of work, not a workflow run.
2116
+
2117
+ **What's needed:**
2118
+ 1. `parentSessionId` optional field on `session_created` events
2119
+ 2. Root session as the visible identity (children are implementation details)
2120
+ 3. Console session tree view -- root sessions expandable to show children
2121
+ 4. `worktrain spawn --parent-session <id>` flag
2122
+
2123
+ **Why this matters:** with this, the console shows "here are my 5 units of work today" -- each telling a coherent story. Without it, users see 50 flat sessions and have to read goals to understand grouping.
2124
+
2125
+ **Things to hash out:**
2126
+ - The "unit of work" concept is useful for coordinator-spawned sessions, but what about ad-hoc sessions started via CLI or MCP? Do those also have a unit-of-work identity, or is that concept only for coordinator-managed work?
2127
+ - If a child session is retried after failure (new session ID, same `parentSessionId`), should both the failed and retried sessions appear in the tree, or only the successful one?
2128
+ - How deep can the session tree go? A coordinator spawning workers that each spawn subagents could produce a 3+ level tree. Is there a depth limit?
2129
+ - What happens when the root session is deleted or cleaned up but child sessions remain? Is the tree orphaned, or do children get promoted?
2130
+
2131
+ ---
2132
+
2133
+ ### Trigger-derived tool availability and knowledge configuration (Apr 18, 2026)
2134
+
2135
+ **Status: idea** | Priority: medium -- design-first
2136
+
2137
+ **Score: 6** | Cor:1 Cap:1 Eff:2 Lev:1 Con:1 | Blocked: no
2138
+
2139
+ The trigger already declares what external system matters. A `gitlab_poll` trigger means the agent will be working on GitLab content. WorkTrain should use this declaration to automatically configure what tools and knowledge sources the agent gets.
2140
+
2141
+ **Idea 1 -- implicit tool availability from trigger source:** if `provider: gitlab_poll` -> agent automatically gets GitLab MCP tools. If `provider: jira_poll` -> agent gets Jira tools. The trigger source is a declaration of intent.
2142
+
2143
+ **Idea 2 -- trigger as knowledge configuration:**
2144
+ ```yaml
2145
+ - id: jira-bug-fix
2146
+ provider: jira_poll
2147
+ knowledge:
2148
+ general: [glean, confluence]
2149
+ codebase: [github, local-kg]
2150
+ task: [jira-ticket, related-prs]
2151
+ style: [team-conventions, agents-md]
2152
+ ```
2153
+
2154
+ The daemon assembles a pre-packaged context bundle from these sources before the agent starts. The agent skips Phase 0 discovery entirely for the declared knowledge domains.
2155
+
2156
+ **Needs a design-first discovery pass** before implementation.
2157
+
2158
+ **Things to hash out:**
2159
+ - If the trigger source implicitly provides tool availability, what happens when a `gitlab_poll` trigger dispatches a task that turns out to need GitHub tools (e.g. cross-repo work)?
2160
+ - How does the knowledge configuration in the trigger interact with the workspace's AGENTS.md? If both declare knowledge sources, which takes precedence?
2161
+ - "Implicit tool availability from trigger source" means the daemon configures the agent's toolset based on the trigger. This is a significant change to how tools are injected. What is the migration path for existing triggers?
2162
+ - Does this add a new surface for configuration mistakes -- e.g. a trigger that misconfigures knowledge sources causing the agent to miss critical context silently?
2163
+
2164
+ ---
2165
+
2166
+ ### Rethinking the subagent loop from first principles (Apr 18, 2026)
2167
+
2168
+ **Status: idea** | Priority: medium -- design-first
2169
+
2170
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:3 Con:1 | Blocked: no
2171
+
2172
+ Step back from all assumptions. The current design assumes the LLM decides when to spawn, what to give subagents, and handles results -- inherited from Claude Code's `mcp__nested-subagent__Task`. That's not the only model, and it might not be the best one for WorkTrain.
2173
+
2174
+ **Problems with LLM-as-orchestrator:** LLMs are bad at orchestration decisions; context passing is lossy; subagent output competes with everything in the parent's context window; no enforcement -- the LLM can skip delegation entirely and just do the work itself.
2175
+
2176
+ **Alternative: workflow-declared parallelism, daemon-enforced:**
2177
+ ```yaml
2178
+ - id: parallel-review
2179
+ type: parallel
2180
+ agents:
2181
+ - workflow: routine-correctness-review
2182
+ contextFrom: [phase-3-output, candidateFiles]
2183
+ - workflow: routine-philosophy-alignment
2184
+ contextFrom: [phase-0-output, philosophySources]
2185
+ synthesisStep: synthesize-parallel-review
2186
+ ```
2187
+
2188
+ The daemon sees this step definition, automatically spawns child sessions with specified workflows, injects declared context bundles, waits for all to complete, passes results to a synthesis step. The parent LLM never decides to spawn anything. The workflow declares the orchestration pattern. The daemon enforces it.
2189
+
2190
+ **The shift:** from "agent as orchestrator" to "workflow as orchestrator, daemon as executor, agent as cognitive unit."
2191
+
2192
+ **Needs a discovery session to explore the design space** before any implementation.
2193
+
2194
+ **Things to hash out:**
2195
+ - "Workflow-declared parallelism, daemon-enforced" requires the workflow schema to express parallelism declaratively. What does that schema look like, and is it backward compatible with existing workflows?
2196
+ - In the proposed `parallel` step type, what happens if one child session fails while others are still running? Is it abort-all, continue-remaining, or configurable?
2197
+ - The parent LLM never decides to spawn in this model. But what if the workflow author wants the LLM to decide dynamically whether parallelism is warranted? Is that expressible in a declarative schema?
2198
+ - The "daemon as executor" model assumes a single daemon with visibility into all child sessions. How does this work in a distributed setup (multiple daemon instances, cloud-hosted)?
2199
+ - How does this proposal relate to the existing `spawn_agent` tool, which does allow the LLM to decide when to spawn? Are both models supported simultaneously, or does this replace `spawn_agent`?
2200
+
2201
+ ---
2202
+
2203
+ ### Workflow runtime adapter: one spec, two runtimes (Apr 18, 2026)
2204
+
2205
+ **Status: idea** | Priority: low -- depends on subagent loop rethinking
2206
+
2207
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:2 Con:1 | Blocked: yes (needs subagent loop rethinking)
2208
+
2209
+ The workflow JSON is the canonical spec for what work needs to happen. A single adapter layer translates the canonical spec to runtime-specific execution plans.
2210
+
2211
+ **Two runtimes, one spec:**
2212
+ - MCP adapter (human-in-the-loop): preserves `requireConfirmation` gates, presents `continue_workflow` tool call interface, LLM drives subagent spawning manually, maintains backward compat
2213
+ - Daemon adapter (fully autonomous): removes `requireConfirmation` gates, replaces `continue_workflow` with `complete_step`, converts workflow-declared parallelism into automatic child session spawning
2214
+
2215
+ **Why this matters:** workflow improvements automatically benefit both runtimes. No dual maintenance, no parallel workflow files.
2216
+
2217
+ **Also eliminates "autonomous workflow variants":** the canonical workflow spec is the only version -- the daemon adapter handles what "autonomy: full" means in practice.
2218
+
2219
+ **Dependencies:** requires the subagent loop rethinking to be resolved first.
2220
+
2221
+ **Things to hash out:**
2222
+ - The MCP adapter preserves `requireConfirmation` gates. The daemon adapter removes them. If a workflow is tested in one runtime context, how does the author verify it behaves correctly in the other?
2223
+ - "Replaces `continue_workflow` with `complete_step`" implies a semantic difference between the two runtimes. Are there workflow patterns where this substitution changes behavior in ways the author must account for?
2224
+ - Eliminating autonomous workflow variants simplifies the library, but authors currently write daemon variants for a reason. What are the cases where the adapter approach cannot replace a dedicated variant?
2225
+ - Who owns the adapter implementations -- the WorkRail engine team, or workflow authors? If an adapter has a bug, every workflow using that runtime is affected.
2226
+
2227
+ ---
2228
+
2229
+ ### General-purpose workflow / intelligent dispatcher
2230
+
2231
+ **Status: idea** | Priority: medium
2232
+
2233
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
2234
+
2235
+ Two related ideas:
2236
+
2237
+ **`wr.quick-task`** -- the simplest possible workflow. 2 steps: do the work, call complete_step. No complexity routing, no design review, no phased implementation. For tasks under ~10 minutes. Currently small tasks go through `wr.coding-task`'s Small fast-path which is still heavier than needed.
2238
+
2239
+ **`wr.dispatch`** -- an intelligent routing workflow. Given a goal, classify it and route to the right workflow: `wr.quick-task` | `wr.research` | `wr.coding-task` | `wr.mr-review` | `wr.competitive-analysis`. The general-purpose entry point -- not a workflow that does everything, but one that decides which workflow to use. The adaptive pipeline coordinator already does this for the queue-poll trigger; the question is whether to expose it as a named user-facing workflow.
2240
+
2241
+ Open questions: does `wr.dispatch` replace `workflowId` in trigger config, or coexist alongside it? How does it handle tasks that don't fit any known workflow?
2242
+
2243
+ **Things to hash out:**
2244
+ - How does `wr.dispatch` classify incoming goals accurately enough to route correctly? Classification errors could silently run the wrong workflow on real tasks.
2245
+ - If `wr.dispatch` is the entry point for all triggers, a classification failure blocks all work. Is there a safe fallback workflow for unclassified tasks?
2246
+ - Should `wr.dispatch` be visible to users as a selectable workflow in `list_workflows`, or is it infrastructure that only the coordinator and trigger config use?
2247
+ - `wr.quick-task` deliberately skips review and design gates. Who is responsible for ensuring it is only used for tasks where skipping those gates is safe?
2248
+ - How does `wr.dispatch` handle tasks that could fit multiple workflows (e.g. "investigate and fix this bug" spans `wr.bug-investigation` and `wr.coding-task`)?
2249
+
2250
+ ---
2251
+
2252
+ ### MR review session count inflation
2253
+
2254
+ **Status: idea** | Priority: medium
2255
+
2256
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
2257
+
2258
+ A single PR review dispatches 6-12 autonomous sessions (one per reviewer family: correctness_invariants, runtime_production_risk, missed_issue_hunter, etc.). This inflates session counts, complicates cost attribution, and makes ROI calculations imprecise. Worth investigating: are all 6 families catching distinct issues, or is there significant overlap? Should families be parallelized into a single session with sub-agents rather than separate top-level sessions?
2259
+
2260
+ **Things to hash out:**
2261
+ - Is the session count problem a UX/display problem (fixable by grouping under a parent session) or an actual cost and resource problem that requires consolidation?
2262
+ - If families are merged into a single session, does the LLM context window reliably hold all review dimensions simultaneously without degrading quality on any single dimension?
2263
+ - What data exists to measure overlap between reviewer families? Before consolidating, verify with empirical data which families have the most redundant findings.
2264
+ - If families run as sub-agents in a single session, what is the failure mode when one sub-agent's findings are poor? Does it contaminate the overall review verdict?
2265
+
2266
+ ---
2267
+
2268
+ ### Session trigger source attribution (daemon vs MCP)
2269
+
2270
+ **Status: done** | Shipped PR #899 (Apr 30, 2026)
2271
+
2272
+ `triggerSource: 'daemon' | 'mcp'` added to `run_started` event data. Three-layer design: optional in Zod schema (old sessions still validate), required in `ConsoleSessionSummary` and `ConsoleSessionDetail` projections (old sessions backfilled via `isAutonomous`), `'daemon'` or `'mcp'` wired at every `executeStartWorkflow` callsite.
2273
+
2274
+ ---
2275
+
2276
+ ### Standup status generator
2277
+
2278
+ **Status: idea** | Priority: low
2279
+
2280
+ **Score: 8** | Cor:1 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
2281
+
2282
+ A workflow that aggregates activity across git history, GitLab/GitHub MRs and reviews, and Jira ticket transitions since the last standup. Outputs a categorized ("what I did / doing today / blockers") human-readable message. Tool-agnostic: detect available integrations and adapt.
2283
+
2284
+ **Things to hash out:**
2285
+ - "Since the last standup" requires knowing when the last standup was. How is that derived -- calendar, fixed schedule, explicit command?
2286
+ - How should the workflow handle weeks where WorkTrain did mostly mechanical work (tests, chores) vs substantive features? Should it summarize at the commit level or the intent level?
2287
+ - For team standup contexts, should this expose WorkTrain's work as the developer's own work, or explicitly attribute it to WorkTrain? This depends on the team's norms.
2288
+ - Is the output format fixed (what I did / doing / blockers) or customizable per team format?
2289
+
2290
+ ---
2291
+
2292
+ ### Workflow effectiveness assessment and self-improvement proposals
2293
+
2294
+ **Status: idea** | Priority: medium
2295
+
2296
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
2297
+
2298
+ **Idea:** WorkTrain runs workflows hundreds of times. It should use that data to propose improvements.
2299
+
2300
+ **Per-run metrics to collect:**
2301
+ - Steps skipped most often (candidate for removal)
2302
+ - Steps consuming the most tokens/time
2303
+ - Steps where the agent calls `continue_workflow` immediately (prompt too vague or redundant)
2304
+ - Sessions that produced PRs with Critical findings (workflow not thorough enough)
2305
+ - Sessions that completed vs hit max_turns
2306
+
2307
+ **Output:** Structured proposal per workflow:
2308
+ - Step-level issues with evidence (specific sessions, specific steps)
2309
+ - Proposed changes with confidence and impact estimate
2310
+ - Feed directly into `workflow-for-workflows`
2311
+
2312
+ **Flow-back:** Low-confidence proposals as GitHub issues. High-confidence, low-risk proposals auto-applied to local copy + PR to community.
2313
+
2314
+ **Things to hash out:**
2315
+ - How is a workflow improvement proposal validated before auto-application? A regression in a bundled workflow affects all users. Is test passage sufficient, or does it require human review?
2316
+ - "High-confidence, low-risk proposals auto-applied" -- what defines low-risk? Prompt text changes are hard to classify by risk level automatically.
2317
+ - Who owns the community PR process for workflow improvements? Auto-opened PRs against a community repo need a reviewer.
2318
+ - If the same workflow is run with different models (Haiku vs Sonnet), the metrics will differ significantly. Are model-specific stats tracked separately or averaged?
2319
+ - How does this prevent a positive feedback loop where the assessment workflow optimizes for metrics (fewer turns, faster completion) at the expense of quality?
2320
+
2321
+ ---
2322
+
2323
+ ## Platform Vision (longer-term)
2324
+
2325
+ ### WorkTrain as a first-class project participant: ideal backlog and planning capabilities (Apr 30, 2026)
2326
+
2327
+ **Status: idea** | Priority: high (long-term)
2328
+
2329
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:3 Con:1 | Blocked: yes (needs knowledge graph + project memory layer)
2330
+
2331
+ Right now WorkTrain manages its backlog like a human with a text editor -- it reads a file, reasons about it, writes changes. Every session re-derives context it already derived before. There is no persistent structured understanding of the project that survives across sessions. The ideal is fundamentally different: the backlog is not a document WorkTrain edits, it is a live model of the project WorkTrain both reads and updates as a first-class participant.
2332
+
2333
+ The capabilities that make up the ideal:
2334
+
2335
+ **1. Persistent project memory**
2336
+ WorkTrain accumulates understanding of the project over time -- what was tried, why things were decided, what the current trajectory is -- in a form that persists and updates incrementally across sessions. Not session notes (those already exist), but a synthesized model: "where is this project right now and where is it going?" Updated automatically as work happens, not reconstructed from scratch each time.
2337
+
2338
+ **2. Native structured backlog operations**
2339
+ First-class tools -- `get_backlog_item(id)`, `update_score(id, scores)`, `add_item(...)`, `query_items(filter)`, `get_dependents(id)` -- rather than reading a markdown file and parsing it. The backlog is data. WorkTrain should treat it as data, not text.
2340
+
2341
+ **3. Dependency graph with automatic inference**
2342
+ Not just manually declared `blocked_by` links, but WorkTrain inferring relationships from reading items and the codebase -- "implementing X will require Y to exist first" -- and recording those inferences persistently. The graph updates as work completes and dependencies resolve.
2343
+
2344
+ **4. Context-aware scoring**
2345
+ Scores that understand the current moment -- what's in flight, what just shipped, what the operator is focused on -- so priority shifts as the project evolves without manual re-scoring. The rubric is not applied in isolation; it's applied against the current project state.
2346
+
2347
+ **5. Proactive surfacing**
2348
+ WorkTrain doesn't wait to be asked "what should I work on?" It knows when a high-score unblocked item has been sitting idle too long, when a blocker just resolved making a previously-blocked item executable, or when work it just completed changes the relative priority of other items. It surfaces these unprompted.
2349
+
2350
+ **6. Honest self-assessment**
2351
+ WorkTrain tracks its own execution history -- which item categories it completed cleanly vs got stuck on, where it overestimated confidence, which workflows it handles reliably vs which it doesn't. This history feeds back into scoring: a Correctness 3 item in a category WorkTrain consistently struggles with should score differently than one it handles well.
2352
+
2353
+ **7. Backlog and execution as one system**
2354
+ When WorkTrain picks up an item, it is simultaneously dequeued from the backlog, tracked as in-flight, and -- on completion -- automatically marked done, dependent item scores updated, and newly-executable items surfaced. The backlog and the work queue are not separate systems maintained separately.
2355
+
2356
+ **Things to hash out:**
2357
+ - What is the persistent project memory stored as -- a structured document, a database, a knowledge graph node, or a combination? The answer determines how it's queried and updated.
2358
+ - Automatic dependency inference requires reading both items and code. How does WorkTrain know when its inference is reliable vs speculative? Incorrect inferences that block work are worse than no inference at all.
2359
+ - Context-aware scoring means scores are not stable -- the same item can have a different score on different days. How does the operator reason about priority if scores shift? Is there a "score as of today" vs "canonical score" distinction?
2360
+ - Self-assessment requires WorkTrain to have a model of its own capabilities and failure modes. This is subtle -- how does it distinguish "I got stuck because the task was hard" from "I got stuck because I handle this category poorly"?
2361
+ - Proactive surfacing risks becoming noise if WorkTrain surfaces too many things or surfaces them at the wrong moment. What is the right cadence and channel for unprompted priority signals?
2362
+ - The backlog-as-data model requires a defined schema. What happens to items that don't fit the schema cleanly -- highly exploratory ideas, resolved debates, historical context that matters but isn't actionable?
2363
+
2364
+ ---
2365
+
2366
+ ### Inspiration: openclaw (Apr 29, 2026)
2367
+
2368
+ **Source:** https://github.com/openclaw/openclaw
2369
+
2370
+ openclaw is worth studying deeply before building out the platform layer. Draw inspiration from it when designing: multi-agent orchestration patterns, coordinator architecture, context packaging for subagents, task queue and dispatch models, and the overall shape of an autonomous engineering platform. Review it before making architectural decisions on any of the Platform Vision items below.
2371
+
2372
+ ---
2373
+
2374
+ ### Knowledge graph for agent context
2375
+
2376
+ **Status: idea** | Priority: medium
2377
+
2378
+ **Score: 10** | Cor:1 Cap:3 Eff:1 Lev:3 Con:2 | Blocked: no
2379
+
2380
+ **Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants.
2381
+
2382
+ **Design -- two-layer hybrid:**
2383
+
2384
+ **Layer 1: Structural graph (hard edges, deterministic)**
2385
+ Built by `ts-morph` (TypeScript Compiler API) + DuckDB. Captures: `imports`, `calls`, `exports`, `implements`, `extends`, `registers_in`, `tested_by`. Answers precise questions with certainty: "what imports trigger-router.ts?", "what CLI commands are registered?"
2386
+
2387
+ **Layer 2: Vector similarity (soft weights, semantic)**
2388
+ Every node gets an embedding. Answers fuzzy questions: "what is conceptually related to this?", "what past sessions are relevant to this bug?" Built with LanceDB (embedded, TypeScript-native, local-first).
2389
+
2390
+ **Technology:**
2391
+ - Structural: `ts-morph` + DuckDB
2392
+ - Vector: LanceDB + local embedding model (Ollama or `@xenova/transformers`)
2393
+ - Unified query: `query_knowledge_graph(intent)` returns merged structural + semantic results
2394
+
2395
+ **Build order:** Structural layer spike first (1-day). Vector layer after spike proves the foundation. Incremental update: re-index only files in `filesChanged` after each session.
2396
+
2397
+ **Build decision (from Apr 15 research):** ts-morph + DuckDB wins. Cognee: Python-only. GraphRAG/LightRAG: use LLMs to build graph (violates scripts-over-agent). Mem0/Zep: conversational memory, not code graphs. Sourcegraph: enterprise weight, overkill.
2398
+
2399
+ **Things to hash out:**
2400
+ - How large does a typical workspace KG get? For a medium-sized TypeScript monorepo, what are the expected node and edge counts for the structural layer?
2401
+ - The incremental update strategy (re-index only `filesChanged`) requires accurate change tracking. What is the fallback when `filesChanged` is unavailable (e.g. for manually triggered sessions)?
2402
+ - The embedding model (Ollama or `@xenova/transformers`) needs to be running locally. What is the setup story for a new workspace -- is it expected to already have an embedding model, or does WorkTrain set one up?
2403
+ - DuckDB is in-process -- what is the concurrency story when multiple daemon sessions try to query or update it simultaneously?
2404
+ - Is the KG per-workspace or global? If per-workspace, cross-workspace queries (multi-project WorkTrain) require a federation layer.
2405
+
2406
+ ---
2407
+
2408
+ ### Dynamic pipeline composition
2409
+
2410
+ **Status: idea** | Priority: medium
2411
+
2412
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: yes (needs classify-task workflow)
2413
+
2414
+ **Insight:** Not all tasks are equal in how much work is needed before implementation. A raw idea needs a completely different pipeline than a fully-specced ticket.
2415
+
2416
+ **Maturity spectrum:**
2417
+ - `idea` -> `rough` -> `specced` -> `ready` -> `code-complete`
2418
+
2419
+ **Coordinator reads maturity + existing artifacts and prepends the right phases:**
2420
+ - Nothing -> ideation -> market research -> spec authoring -> ticket creation -> implementation
2421
+ - BRD + designs -> architecture review -> implementation
2422
+ - Fully specced -> coding only
2423
+
2424
+ **New workflows needed:**
2425
+ - `classify-task-workflow` -- fast, 1-step, outputs `taskComplexity`/`riskLevel`/`hasUI`/`touchesArchitecture`/`taskMaturity`
2426
+ - `ideation-workflow`, `spec-authoring-workflow`, `ticket-creation-workflow`, `grooming-workflow`
2427
+
2428
+ **Things to hash out:**
2429
+ - How does the coordinator determine task maturity? Is this a classification workflow output, a field on the issue/ticket, or derived from artifact presence?
2430
+ - When maturity is `idea`, the pipeline runs ideation + market research. These could take hours. Does the coordinator hold the queue slot during all upstream phases, or release and re-acquire?
2431
+ - How are the new workflows (`ideation-workflow`, `spec-authoring-workflow`, etc.) different from `wr.discovery` and `wr.shaping`? Are these new workflows, or just renamed compositions?
2432
+ - How does the pipeline composition interact with `workOnAll: true`? For a raw idea, the pipeline could autonomously run all the way to code without any human input -- is that the intended behavior?
2433
+
2434
+ ---
2435
+
2436
+ ### Per-workspace work queue
2437
+
2438
+ **Status: idea** | Priority: medium
2439
+
2440
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
2441
+
2442
+ **The insight:** Triggers make WorkTrain reactive. A work queue makes it proactive -- it pulls the next item when capacity is available, works it to completion, pulls the next.
2443
+
2444
+ **Internal queue:** `~/.workrail/workspaces/<name>/queue.jsonl` -- append-only, one item per line, consumed in priority order then FIFO.
2445
+
2446
+ **External pull sources:**
2447
+ - GitHub issues (label filter)
2448
+ - GitLab issues (label filter)
2449
+ - Jira sprint board
2450
+ - Linear triage queue
2451
+
2452
+ **Queue + message queue + talk:**
2453
+
2454
+ | Interface | Use case | Latency |
2455
+ |-----------|----------|---------|
2456
+ | Work queue | "do this when you have capacity" | When a slot is free |
2457
+ | Message queue (`worktrain tell`) | "do this now, between current sessions" | End of current batch |
2458
+ | Talk (`worktrain talk`) | "let's discuss and decide together" | Interactive |
2459
+
2460
+ **Things to hash out:**
2461
+ - How does the per-workspace internal queue (`queue.jsonl`) interact with the existing `github_queue_poll` and `gitlab_poll` triggers? Are they additive sources into the same queue, or separate systems?
2462
+ - Who controls priority assignment for queue items? Is it explicit (operator assigns priority) or inferred (WorkTrain computes it)?
2463
+ - What happens when the queue is empty and capacity is available -- does WorkTrain go idle or proactively seek work?
2464
+ - Should the queue be inspectable and editable by the operator via CLI, or is it a fully opaque internal mechanism?
2465
+ - How does per-workspace queue isolation interact with global concurrency limits? A workspace with a large queue could starve other workspaces.
2466
+
2467
+ ---
2468
+
2469
+ ### Remote references (URLs, GDocs, Confluence)
2470
+
2471
+ **Status: idea** | Priority: medium
2472
+
2473
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
2474
+
2475
+ **Design:** Extend the workflow `references` system to support remote sources (HTTP URLs, Google Docs, Confluence pages). WorkRail remains a pointer system -- it validates declarations are well-formed, delivers the pointer, and the agent fetches with its own tools. Auth is entirely delegated to the agent.
2476
+
2477
+ **Incremental path:**
2478
+ - Phase 1: public HTTP URLs. `resolveFrom: "url"`. WorkRail delivers the URL; agent fetches. No auth surface in WorkRail.
2479
+ - Phase 2: workspace-configured bearer tokens in `.workrail/config.json` keyed by domain
2480
+ - Phase 3: named integrations (GDocs, Confluence, Notion) as first-class configured sources
2481
+
2482
+ **Design questions:**
2483
+ - Should WorkRail attempt a reachability check at start time, or skip entirely for remote refs?
2484
+ - How should remote refs appear in `workflowHash`? Content can change between runs.
2485
+ - `kind` field (`local` vs `remote`) or infer from `source` value?
2486
+
2487
+ **Things to hash out:**
2488
+ - Phase 2 (workspace-configured bearer tokens) puts credentials in `.workrail/config.json`. If this file is in the repo, tokens are at risk of being committed. What is the recommended credential storage model?
2489
+ - The Phase 1 design (agent fetches the URL itself) means the agent has access to any URL declared in a workflow. Is there any validation or allowlist for what remote sources a workflow can reference?
2490
+ - Remote document content changes between runs. Should WorkRail snapshot the content at session start for reproducibility, or always use live content?
2491
+ - When a remote ref is unavailable (network error, auth failure), should the session fail, warn and continue, or fall back to a cached version?
2492
+
2493
+ ---
2494
+
2495
+ ### Declarative composition engine
2496
+
2497
+ **Status: idea** | Priority: low
2498
+
2499
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
2500
+
2501
+ **Summary:** Users or agents fill out a declarative spec (dimensions, scope, rigor level) and the WorkRail engine assembles a workflow automatically from a library of pre-validated routines. The agent is a form-filler, not an architect -- the composition logic lives in the engine.
2502
+
2503
+ **Why different from agent-generated workflows:** Engine-composed workflows are assembled from pre-reviewed building blocks using deterministic rules. Same spec always produces the same workflow shape.
2504
+
2505
+ **Good early use cases:** Audit-style workflows (user picks dimensions, engine assembles auditor steps), review workflows, investigation workflows.
2506
+
2507
+ **Things to hash out:**
2508
+ - Who defines the "library of pre-validated routines"? How does a routine get accepted into the composition library vs remaining a workflow-specific step?
2509
+ - How does the spec input interface work -- is it a YAML/JSON document, a CLI prompt sequence, or a tool call? Who calls it?
2510
+ - "Same spec always produces the same workflow shape" is a strong determinism guarantee. How is this enforced when routines are updated? Does a spec locked to routine v1.2 still produce the same shape after routine v1.3 ships?
2511
+ - Should the resulting workflow be persisted (so the user can inspect and modify it), or is it ephemeral (assembled fresh each run)?
2512
+ - How does error handling work when the spec declares a combination of dimensions that no valid routine composition can satisfy?
2513
+
2514
+ ---
2515
+
2516
+ ### Workflow categories and category-first discovery
2517
+
2518
+ **Status: idea** | Priority: low
2519
+
2520
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
2521
+
2522
+ **Summary:** Improve workflow discovery by organizing bundled workflows into categories. Currently the catalog is large enough that flat discovery is becoming noisy.
2523
+
2524
+ **Phase 1 shape:** If no category is passed, return category names + workflow count per category + a few representative titles. If a category is passed, return the full workflows for that category.
2525
+
2526
+ **Design questions:**
2527
+ - Should categories live in workflow JSON, in a registry overlay, or be inferred from directory/naming?
2528
+ - Should `list_workflows` become polymorphic, or should category discovery be a separate mode?
2529
+
2530
+ **Things to hash out:**
2531
+ - How does category assignment work for user-imported workflows? Can users assign categories, or is it only for bundled workflows?
2532
+ - If a workflow fits multiple categories (e.g. a workflow that is both a "review" and an "audit"), can it appear in multiple categories, or does it have a single primary?
2533
+ - Does category-first discovery change what gets returned in the existing `list_workflows` schema? Is this a backward-compatible extension or a new tool?
2534
+ - Who maintains the category taxonomy as the library grows? What prevents categories from proliferating to the point they become as noisy as the flat list?
2535
+
2536
+ ---
2537
+
2538
+ ### Forever backward compatibility (workrailVersion)
2539
+
2540
+ **Status: idea** | Priority: medium
2541
+
2542
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
2543
+
2544
+ Every workflow declares `workrailVersion: "1.4.0"`. The engine maintains compatibility adapters for all previous declared versions -- old workflows run forever without author intervention. The engine adapts; authors never migrate.
2545
+
2546
+ **The web model:** this is how browsers handle HTML from 1995. A `<marquee>` tag still renders because the browser adapts, not because the author rewrote their page.
2547
+
2548
+ **Engineering implication:** permanent commitment. Once a version adapter is shipped, it cannot be removed. The tradeoff is real but the alternative (expecting external authors to track WorkRail releases and migrate) breaks the platform trust model.
2549
+
2550
+ **Phase 1:** Add `workrailVersion` field to schema. Default to `"1.0.0"` for existing workflows. Record in run events.
2551
+ **Phase 2:** Introduce the first adapter when the first schema-breaking change is needed.
2552
+ **Phase 3:** Build a compatibility test harness in CI.
2553
+
2554
+ **Related:** `src/v2/read-only/v1-to-v2-shim.ts` (existing precedent for version adaptation).
2555
+
2556
+ **Things to hash out:**
2557
+ - "Once a version adapter is shipped, it cannot be removed" is a hard commitment. What is the governance process for accepting this commitment for a given version? Who signs off?
2558
+ - How does `workrailVersion` interact with `schemaVersion` (the versioned schema validation idea elsewhere in this backlog)? Are these the same concept, or do they track different axes?
2559
+ - If a workflow omits `workrailVersion` (the default-1.0.0 case), can WorkRail ever remove the v1.0.0 adapter? The default-to-1.0.0 mechanism means the adapter must be permanent.
2560
+ - The compatibility test harness in CI must test all adapters on every release. For N historical versions, this is O(N) adapter tests. At what point does this become a maintenance burden?
2561
+
2562
+ ---
2563
+
2564
+ ### Parallel forEach execution
2565
+
2566
+ **Status: idea** | Priority: low
2567
+
2568
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
2569
+
2570
+ Sequential `forEach` (and `for`, `while`, `until`) all work -- implemented in the v1 interpreter and the v2 durable core. The idea here is parallel execution: run all iterations concurrently rather than sequentially. Requires design around: session store concurrent writes, token protocol isolation per iteration, and console DAG rendering for parallel branches.
2571
+
2572
+ **Things to hash out:**
2573
+ - Token protocol isolation per iteration is not trivial. Each parallel branch needs its own HMAC token chain. How does the engine mint and track N independent token chains for a single forEach step?
2574
+ - What is the semantics of a failure in one parallel iteration -- abort all, continue others, or configurable?
2575
+ - How are the outputs of N parallel iterations combined for the next sequential step? Is there a built-in aggregation, or is the workflow author responsible for merging?
2576
+ - How does the console DAG render parallel forEach branches without becoming unreadable for large arrays (e.g. 20 items in a forEach)?
2577
+ - What is the concurrency limit for parallel forEach -- is it bounded by `maxConcurrentSessions`, or is there a per-step parallelism limit?
2578
+
2579
+ ---
2580
+
2581
+ ### Assessment-gate tiers beyond v1
2582
+
2583
+ **Status: idea** | Priority: low
2584
+
2585
+ **Score: 7** | Cor:1 Cap:1 Eff:2 Lev:1 Con:2 | Blocked: no
2586
+
2587
+ **Tier 1 (current):** Same-step follow-up retry. Consequence keeps the same step pending; engine returns semantic follow-up guidance.
2588
+
2589
+ **Tier 2 (future):** Structured redo recipe on the same step. Engine surfaces a bounded checklist. No new DAG nodes or true subflow.
2590
+
2591
+ **Tier 3 (future):** Assessment-triggered redo subflow. Matched consequence routes into an explicit sequence of follow-up steps. Introduces assessment-driven control-flow behavior.
2592
+
2593
+ **Design questions:** When does Tier 2 become necessary? What durable model would Tier 3 need for entering, progressing through, and returning from a redo subflow?
2594
+
2595
+ **Things to hash out:**
2596
+ - Tier 3 (redo subflow) requires the engine to create new DAG nodes dynamically at runtime. What are the constraints on which steps can be the target of an assessment-triggered redo?
2597
+ - How does Tier 2's "bounded checklist" differ from an existing assessment consequence in Tier 1? Is this a new execution contract, or just a richer prompt injection?
2598
+ - When does Tier 2 become necessary? Before building it, is there evidence from real workflow runs that Tier 1 is insufficient for specific use cases?
2599
+ - Tier 3 significantly increases engine complexity. How does it interact with existing features like `jumpIf`, `runCondition`, and loops?
2600
+
2601
+ ---
2602
+
2603
+ ### Workflow rewind / re-scope support
2604
+
2605
+ **Status: idea** | Priority: low
2606
+
2607
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
2608
+
2609
+ Allow an in-progress session to go back to an earlier point when new information changes scope, invalidates assumptions, or reveals the current path is wrong.
2610
+
2611
+ **Phase 1:** Allow rewind to a prior checkpoint with an explicit reason. Record a "why we rewound" note in session history.
2612
+
2613
+ **Phase 2:** Scope-change prompts ("our understanding changed", "the task is broader/narrower"). Let workflows declare safe rewind points explicitly.
2614
+
2615
+ **Design questions:**
2616
+ - Should rewind be limited to explicit checkpoints, or support arbitrary node-level rewind?
2617
+ - How should the system preserve notes from abandoned paths?
2618
+ - Should some steps be marked non-rewindable once external side effects have happened?
2619
+
2620
+ **Things to hash out:**
2621
+ - Who can initiate a rewind -- the agent, a human operator, or the coordinator? Are there different constraints for each initiator?
2622
+ - If a rewind discards steps that made external side effects (e.g. a git push, a PR comment), the side effects remain but the session state rolls back. How is this inconsistency surfaced?
2623
+ - What is the maximum rewind distance? Allowing arbitrary node-level rewind on a 30-step workflow could create very confusing session histories.
2624
+ - How does rewind interact with the HMAC token protocol? Tokens are forward-only by design -- can a rewound session re-issue tokens for already-advanced steps?
2625
+
2626
+ ---
2627
+
2628
+ ### Subagent composition chains
2629
+
2630
+ **Status: idea** | Priority: low
2631
+
2632
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
2633
+
2634
+ Native support for nested subagents -- an agent spawning a subagent, which spawns its own -- up to a configurable depth limit.
2635
+
2636
+ ```yaml
2637
+ agentDefaults:
2638
+ maxSubagentDepth: 3
2639
+ maxTotalAgentsPerTask: 10
2640
+ ```
2641
+
2642
+ **Depth semantics:** Coordinator=0, worker=1, subagent=2, sub-subagent=3.
2643
+
2644
+ `maxTotalAgentsPerTask` prevents exponential explosion: depth-3 tree with 3 agents per node = 27 concurrent agents without this cap.
2645
+
2646
+ **Things to hash out:**
2647
+ - How does the depth counter propagate through `spawn_session` calls? Is it tracked in the session event log, or in-memory in the daemon?
2648
+ - If a sub-subagent is killed (timeout, crash), does it count against the depth and total counts of its parent session? How are orphaned depth slots reclaimed?
2649
+ - `maxTotalAgentsPerTask` requires a shared counter across all agents in a chain. What is the concurrency-safe mechanism for this counter -- is it in the session store, a daemon in-memory structure, or something else?
2650
+ - Should composition chains be opt-in per workflow/trigger, or available to any workflow by default?
2651
+
2652
+ ---
2653
+
2654
+ ### Mobile monitoring and remote access
2655
+
2656
+ **Status: idea** | Priority: low (post-daemon-MVP)
2657
+
2658
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
2659
+
2660
+ **Goal:** Control and monitor autonomous WorkRail sessions from a phone.
2661
+
2662
+ **What's needed:**
2663
+ 1. Mobile-responsive console with touch-friendly layout and tap to pause/resume/cancel
2664
+ 2. Push notifications (via Slack/Telegram webhook -- no native app required for MVP)
2665
+ 3. Human-in-the-loop approval on mobile -- maps to `POST /api/v2/sessions/:id/resume`
2666
+ 4. Session log view -- linear timeline, not DAG
2667
+
2668
+ **Things to hash out:**
2669
+ - Remote access requires the console to be reachable from outside the local network. What is the default security model -- is unauthenticated remote access acceptable for a tool managing autonomous code changes?
2670
+ - Push notifications via webhook require a persistent endpoint (Slack/Telegram bot). Who sets this up -- WorkTrain automates it, or the operator configures it manually?
2671
+ - "Tap to pause/resume/cancel" is write access from a mobile client. What authentication and authorization model protects these actions from unauthorized access?
2672
+ - Should mobile monitoring be opt-in or default-on? Users who haven't configured remote access should not inadvertently expose their console.
2673
+
2674
+ **Remote access options:**
2675
+ 1. `workrail tunnel` command (Cloudflare Tunnel from the laptop) -- works behind any NAT/VPN
2676
+ 2. Tailscale integration -- zero WorkRail code needed
2677
+ 3. Cloud session sync -- daemon pushes events to S3/R2
2678
+
2679
+ **Priority:** Post-daemon-MVP. Design the REST control plane with mobile in mind from the start.
2680
+
2681
+ ---
2682
+
2683
+ ### WorkRail Auto: cloud-hosted autonomous platform
2684
+
2685
+ **Status: idea** | Priority: long-term
2686
+
2687
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: yes (needs proven local daemon)
2688
+
2689
+ **Goal:** WorkRail Auto runs on a server 24/7, connected to your engineering ecosystem, working autonomously without a laptop open.
2690
+
2691
+ **What this enables:** GitLab MR opened -> WorkRail reviews, posts comment. Jira ticket moves to In Progress -> WorkRail starts coding task, pushes branch. PagerDuty fires -> WorkRail runs investigation, posts findings to Slack.
2692
+
2693
+ **Architecture implications:**
2694
+ - Multi-tenancy: isolated session stores, isolated credential vaults per org
2695
+ - Horizontal scaling: multiple daemon instances consuming from a shared trigger queue
2696
+ - Rate limiting per org, per integration
2697
+
2698
+ **Relationship to self-hosted:** Self-hosted is always free, always open source, always works offline. WorkRail Auto is the natural SaaS layer -- same engine, same workflows, managed infrastructure.
2699
+
2700
+ **Priority:** Long-term. Design the local daemon with multi-tenancy seams in mind from the start (don't hardcode single-user assumptions). Don't build the hosted layer until the local daemon is proven.
2701
+
2702
+ **Things to hash out:**
2703
+ - What is the business model for WorkRail Auto -- per-seat, per-org, usage-based (tokens consumed), or outcome-based?
2704
+ - Multi-tenancy requires credential isolation between orgs. What is the threat model -- can a compromised tenant access another tenant's code or credentials?
2705
+ - The "same engine, same workflows" promise requires the cloud version to stay in sync with the open-source version. What is the release cadence and sync mechanism?
2706
+ - Horizontal scaling with multiple daemon instances requires a shared trigger queue. What is the queue technology (Redis, Postgres, SQS)? This is a significant infrastructure dependency to introduce.
2707
+ - When does the decision to build the hosted layer get made? What are the criteria ("local daemon is proven" needs a concrete definition)?
2708
+
2709
+ ---
2710
+
2711
+ ### Multi-project WorkTrain
2712
+
2713
+ **Status: idea** | Priority: medium (to investigate)
2714
+
2715
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
2716
+
2717
+ **Problem:** WorkTrain needs to handle multiple completely unrelated projects simultaneously, but some projects are related and need to share knowledge.
2718
+
2719
+ **Proposed model:** Workspace namespacing with explicit cross-workspace links:
2720
+ ```yaml
2721
+ workspaces:
2722
+ workrail:
2723
+ path: ~/git/personal/workrail
2724
+ knowledgeGraph: ~/.workrail/graphs/workrail.db
2725
+ maxConcurrentSessions: 3
2726
+ relatedWorkspaces: [storyforge]
2727
+ storyforge:
2728
+ path: ~/git/personal/storyforge
2729
+ knowledgeGraph: ~/.workrail/graphs/storyforge.db
2730
+ relatedWorkspaces: [workrail]
443
2731
  ```
444
2732
 
445
- **Key decisions:**
446
- - Standard 5-field cron syntax, configurable timezone
447
- - Missed runs NOT caught up by default (optional `catchUp: true`)
448
- - Overlap prevention: if a run is still active when the next tick fires, skip it
449
- - `worktrain run schedule <trigger-id>` for manual trigger
2733
+ **Must be workspace-scoped:** knowledge graph, daemon-soul.md, session store, concurrency limits, triggers.
450
2734
 
451
- **Implementation:** `PollingScheduler` already runs time-based loops. Schedule provider would use cron expression matching instead of API polling. State persists to `~/.workrail/schedule-state.json`.
2735
+ **Can be shared globally:** WorkTrain binary, token usage tracking, message queue, merge audit log.
2736
+
2737
+ **Things to hash out:**
2738
+ - How does a workspace know about `relatedWorkspaces` in practice? Is this purely advisory metadata for human context, or does WorkTrain actively query related workspace KGs during sessions?
2739
+ - If two related workspaces have conflicting behavioral rules in their respective `daemon-soul.md` files, what is the priority when a cross-workspace session runs?
2740
+ - Is the workspace config (`~/.workrail/workspaces`) stored in the user's home directory or per-repo? If per-repo, what happens for repos shared across users or machines?
2741
+ - What is the migration path for existing single-workspace setups? Does adding workspace namespacing require changes to all existing config files?
2742
+ - Global shared items (token usage, message queue, merge audit log) need to remain consistent across workspaces. Who is responsible for multi-workspace consistency in these shared files?
452
2743
 
453
2744
  ---
454
2745
 
455
- ### Autonomous grooming loop + workOnAll mode
2746
+ ### Message queue: async communication with WorkTrain from anywhere
456
2747
 
457
2748
  **Status: idea** | Priority: medium
458
2749
 
459
- **Three autonomy levels:**
460
-
461
- - **Level 0 (current):** Human applies `worktrain` label to specific issues. WorkTrain works those only.
462
- - **Level 1 -- workOnAll:** Config flag `workOnAll: true`. WorkTrain looks at ALL open issues, infers which are actionable, picks highest-priority. Escape hatch: `worktrain:skip` label.
463
- - **Level 2 -- Fully proactive:** WorkTrain also surfaces work it found itself (failing CI, backlog items with no issue, patterns in git history).
2750
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
464
2751
 
465
- **Grooming loop (scheduled nightly):** Reads backlog, open issues, recent completed work. Closes resolved issues. For ungroomed items: infers maturity (linked spec, acceptance criteria, vague language). For high-value idea-level items: runs `wr.discovery` + `wr.shaping`, creates/updates issue.
2752
+ **Design:** A persistent message queue (`~/.workrail/message-queue.jsonl`) that decouples when you send a message from when WorkTrain acts on it.
466
2753
 
467
- **workOnAll config:**
468
- ```json
469
- { "workOnAll": true, "workOnAllExclusions": ["needs-design", "blocked-external"], "maxConcurrentSelf": 2 }
2754
+ ```bash
2755
+ worktrain tell "skip the architecture review for the polling triggers PR, it's low risk"
2756
+ worktrain tell "add knowledge graph vector layer to next sprint"
470
2757
  ```
471
2758
 
2759
+ Each command appends to the queue. The daemon drains between agent completions -- never mid-run, always at a natural break point.
2760
+
2761
+ **Outbox (WorkTrain -> user):** WorkTrain appends notifications to `~/.workrail/outbox.jsonl`. A mobile client polls this or an HTTP SSE endpoint wraps it.
2762
+
2763
+ **This is the foundation for mobile monitoring.** The mobile app is just a client that reads outbox and writes to message-queue.
2764
+
2765
+ **Things to hash out:**
2766
+ - Messages in the queue are natural language instructions. How does the daemon interpret and act on them reliably? Is there a classification step, or is the message passed directly to an LLM for interpretation?
2767
+ - What prevents a malicious or accidental message from authorizing dangerous actions ("merge all PRs" or "delete the worktree")? Is there a permission model for message queue instructions?
2768
+ - "Drained between agent completions" means messages could wait minutes or hours during a long session. Is this latency acceptable for all message types, or should high-priority messages have a faster path?
2769
+ - How long do messages persist in the queue? Is there a TTL, and what happens to messages that expire before being processed?
2770
+ - Should the outbox and message queue be per-workspace or global? A global queue makes cross-workspace messaging simple but creates coordination complexity.
2771
+
472
2772
  ---
473
2773
 
474
- ### Escalating review gates based on finding severity
2774
+ ### Periodic analysis agents
475
2775
 
476
- **Status: idea** | Priority: medium
2776
+ **Status: idea** | Priority: low
477
2777
 
478
- **Problem:** "Blocking" is binary -- a single Critical finding and a trivially incorrect comment are treated identically.
2778
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs scheduled tasks)
479
2779
 
480
- **Right behavior:** After a fix round, if re-review still returns Critical:
481
- 1. Another full MR review -- confirm the Critical is real, not a false positive
482
- 2. Production readiness audit -- a Critical finding often implies a runtime risk
483
- 3. Architecture audit -- if the Critical is architectural
2780
+ Agents on a schedule that proactively identify issues, gaps, improvement opportunities:
484
2781
 
485
- Routing by `finding.category` from `wr.review_verdict`:
486
- - `correctness` / `security` -> always trigger prod audit
487
- - `architecture` / `design` -> trigger arch audit
488
- - All -> trigger re-review
2782
+ - **Weekly: Code health scan** -- `architecture-scalability-audit` on modules not audited in 30 days
2783
+ - **Weekly: Test coverage scan** -- files modified with zero/low test coverage
2784
+ - **Weekly: Documentation drift scan** -- recently merged PRs changed behavior described in docs
2785
+ - **Monthly: Dependency health scan** -- CVEs, active forks, lighter alternatives
2786
+ - **Monthly: Performance baseline** -- benchmark scenarios vs previous month
2787
+ - **Continuous: Security scan** -- on every PR merge, OWASP top 10 patterns in changed files
2788
+ - **Monthly: Ideas generation** -- `wr.discovery` on codebase + backlog + session history, asking "what's the most impactful thing we could build next?"
489
2789
 
490
- **Hard rule:** A PR that triggered the escalating audit chain should NEVER auto-merge. Human explicit approval required.
2790
+ **Things to hash out:**
2791
+ - Each weekly/monthly agent runs on a schedule. What is the concurrency interaction with active task sessions? Do analysis agents run in background slots, or do they compete for the same pool?
2792
+ - The "Monthly: Ideas generation" agent can write to the backlog. Who reviews ideas before they are acted upon? Without a review gate, the backlog could accumulate LLM-generated noise.
2793
+ - What triggers the continuous security scan on every PR merge? Is this a delivery hook, a webhook, or a polling trigger? The latency requirement ("continuous") is different from the weekly scans.
2794
+ - Should these agents be configurable per workspace (enable/disable, change schedule) or globally controlled by WorkTrain?
2795
+ - What is the cost profile for running all of these agents monthly? Token cost, LLM API cost, and compute time add up across a busy repo.
491
2796
 
492
2797
  ---
493
2798
 
494
- ### Workflow execution time tracking and prediction
2799
+ ### Monitoring, analytics, and autonomous remediation
495
2800
 
496
- **Status: idea** | Priority: medium
2801
+ **Status: idea** | Priority: low
497
2802
 
498
- **Problem:** Timeouts are set by intuition. No data on how long workflows actually take.
2803
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: no
499
2804
 
500
- **What to track:** For every completed session -- workflow ID, total wall-clock duration, turn count, step advances, outcome, task complexity signals. Store in `~/.workrail/data/execution-stats.jsonl`.
2805
+ WorkTrain watches application health metrics (error rate, latency, session success/failure rate, queue depth), identifies anomalies, investigates root causes, and resolves what it can automatically.
501
2806
 
502
- **Uses:**
503
- - Calibrate timeouts automatically (p95 * 1.5)
504
- - Predict duration before dispatch
505
- - Step-advance rate as workflow efficiency proxy
2807
+ **Monitoring loop:** Detect anomaly -> classify severity -> investigate with `bug-investigation.agentic.v2` -> if confidence >= 0.8 and severity <= High, attempt auto-remediation (config/feature-flag fix, code fix) or else escalate with full findings.
506
2808
 
507
- **Implementation:** Append to `execution-stats.jsonl` in `runWorkflow()`'s finally block.
2809
+ **Analytics dashboard:** Per-module PR cycle time, workflow step failure rates, token cost per session type, quality score (weighted composite of review accuracy + coding success rate + investigation accuracy).
2810
+
2811
+ **Things to hash out:**
2812
+ - "Auto-remediation (config/feature-flag fix, code fix)" is a significant autonomous action in response to a production anomaly. What safeguards prevent a false positive from triggering a harmful automated change?
2813
+ - What is the source of "application health metrics" -- is WorkTrain reading from an external monitoring system, or monitoring its own daemon health? These are very different scopes.
2814
+ - The quality score is a weighted composite. Who determines the weights, and how are they recalibrated when the component metrics change?
2815
+ - How does this interact with the knowledge graph and session store? The analytics dashboard presumably reads from both -- is there a query API, or is it direct file reads?
2816
+ - "Continuous security scan on every PR merge" plus auto-remediation is a very tight loop. Who is responsible for reviewing auto-applied security fixes before they reach main?
508
2817
 
509
2818
  ---
510
2819
 
511
- ### WorkRail MCP server self-cleanup
2820
+ ### Cross-repo execution model
512
2821
 
513
- **Status: idea** | Priority: medium
2822
+ **Status: idea** | Priority: medium (post-MVP for hosted tier)
514
2823
 
515
- **Sources of stale state:** old workflow copies in `~/.workrail/workflows/`, dead managed sources, stale git repo caches, 500+ sessions accumulating with no TTL, remembered roots for non-existent paths.
2824
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: no
516
2825
 
517
- **Fix -- two layers:**
2826
+ **Problem:** WorkRail currently assumes a single repo. The autonomous daemon breaks this -- a coding task may touch Android, iOS, and a backend API simultaneously.
518
2827
 
519
- 1. **Startup auto-cleanup (light):** On MCP server startup, silently remove managed sources where the filesystem path doesn't exist. Log "removed N stale sources."
2828
+ **Workspace manifest:** Sessions declare which repos they need:
2829
+ ```json
2830
+ {
2831
+ "context": {
2832
+ "repos": [
2833
+ { "name": "android", "path": "~/git/my-project/android" },
2834
+ { "name": "backend", "path": "~/git/my-project/backend" }
2835
+ ]
2836
+ }
2837
+ }
2838
+ ```
520
2839
 
521
- 2. **`workrail cleanup` command:**
522
- ```
523
- workrail cleanup [--yes] [--sessions --older-than <age>] [--sources] [--cache] [--roots]
524
- ```
2840
+ **Scoped tools:** `BashInRepo`, `ReadRepo`, `WriteRepo` that route to the correct working directory.
2841
+
2842
+ **Dynamic provisioning:** If the repo is already cloned locally, use it. If declared as a remote URL, clone to `~/.workrail/repos/<name>/`.
2843
+
2844
+ **This is the feature that makes WorkRail truly freestanding** for multi-repo development teams.
2845
+
2846
+ **Things to hash out:**
2847
+ - `BashInRepo`, `ReadRepo`, `WriteRepo` are new tool variants scoped to a named repo. How does the agent know which repo to address -- is the repo name part of the tool call, or is the default repo set at session start?
2848
+ - If a session spans repos with different languages (Android/Kotlin + backend/TypeScript), does WorkRail need language-aware context strategies for each, or is the tooling language-agnostic?
2849
+ - Dynamically cloning a repo to `~/.workrail/repos/<name>/` at session start could take significant time for large repos. Is this acceptable latency, or does the design require pre-cloned repos?
2850
+ - Cross-repo sessions that make commits to multiple repos need atomic rollback semantics if one repo's commit fails. Is this in scope, or is it the agent's responsibility?
2851
+ - Should cross-repo sessions be allowed for solo developers with a single GitHub account, or does this primarily target team setups with broader permissions?
525
2852
 
526
2853
  ---
527
2854
 
528
- ### Subagent context packaging
2855
+ ### Long-term vision: WorkRail as a general engine, domain packs as configuration
529
2856
 
530
- **Status: idea** | Priority: medium
2857
+ **Status: idea** | Priority: long-term
531
2858
 
532
- **Problem:** When a main agent spawns a subagent, the work package is too thin. The main agent has rich context (why this approach was chosen, what was tried, what constraints were discovered) but packages the subagent task as a one-liner.
2859
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: no
533
2860
 
534
- **Design (Option B -- structured work package):**
535
- ```typescript
536
- spawnSession({
537
- workflowId: 'coding-task-workflow-agentic',
538
- goal: '...',
539
- context: {
540
- whyThisApproach: '...',
541
- alreadyTried: [...],
542
- knownConstraints: [...],
543
- relevantFiles: [...],
544
- completionCriteria: '...'
545
- }
546
- })
547
- ```
2861
+ WorkTrain is not just a coding tool. The underlying engine -- session management, workflow enforcement, daemon, agent loop, knowledge graph, context bundle assembly -- is domain-agnostic.
548
2862
 
549
- **Context mode:** `context: 'inherit' | 'blank' | 'custom'`. Blank is for adversarial roles (challenger, reviewer) where anchoring to main-agent context is counterproductive.
2863
+ **Domain packs:** Self-contained configuration bundles that specialize WorkTrain for a specific problem domain: a set of workflows, a knowledge graph schema, context bundle query definitions, trigger definitions, a daemon soul template.
550
2864
 
551
- **Session knowledge log:** As the main agent progresses, it appends to `session-knowledge.jsonl` -- decisions, user pushback, relevant files, constraints, things tried and failed. Auto-included in subagent work packages.
2865
+ **Examples:** `worktrain-coding` (current default), `worktrain-research`, `worktrain-creative`, `worktrain-ops`, `worktrain-data`.
2866
+
2867
+ **When to make it explicit:** The right time is when a second domain is ready to be added. Extract the coding-specific pieces into `worktrain-coding` and establish the domain pack contract.
2868
+
2869
+ **Things to hash out:**
2870
+ - What exactly is the boundary between the "domain-agnostic engine" and the "coding domain pack"? Some features feel fundamental (session store, HMAC tokens) while others feel domain-specific (worktree management, git integration). Where is the line?
2871
+ - How would domain packs be distributed and versioned? Is this a package manager model, a git submodule, or a bundled registry?
2872
+ - Can multiple domain packs be active simultaneously for a single workspace, or is it one pack per workspace?
2873
+ - The "right time is when a second domain is ready" -- what does "ready" mean? A prototype, a production use case, or explicit user demand?
552
2874
 
553
2875
  ---
554
2876
 
555
- ### Workflow-scoped system prompts for subagents
2877
+ ### WorkTrain as a native macOS app (Apr 18, 2026)
556
2878
 
557
- **Status: idea** | Priority: medium
2879
+ **Status: idea** | Priority: low / long-term
558
2880
 
559
- **Design:** Workflows (and individual steps) can declare a `systemPrompt` field injected into subagent sessions.
2881
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
560
2882
 
561
- ```json
562
- {
563
- "id": "mr-review-workflow.agentic.v2",
564
- "systemPrompt": "You are an adversarial code reviewer. Your job is to find problems, not validate the approach.",
565
- "steps": [...]
566
- }
567
- ```
2883
+ Long-term vision: WorkTrain becomes a full native Mac app -- menubar icon, system notifications, windows, native UX.
568
2884
 
569
- Step-level `systemPrompt` overrides workflow-level for that step.
2885
+ **What this unlocks:** always-on menubar presence showing daemon status; native macOS notifications (currently via osascript -- the app version would use UserNotifications framework directly); `worktrain status` overview as a native window; message queue and inbox as a native interface; background daemon management from the menubar without terminal.
570
2886
 
571
- **Composition layers:**
572
- 1. WorkTrain base prompt
573
- 2. Workflow-level `systemPrompt`
574
- 3. Step-level `systemPrompt`
575
- 4. Soul file (operator behavioral rules)
576
- 5. AGENTS.md / workspace context
577
- 6. Session knowledge log (if `context: 'inherit'`)
578
- 7. Step prompt
2887
+ **Tech stack options:**
2888
+ - Swift/SwiftUI: full native, best macOS integration
2889
+ - Tauri: Rust core + existing web frontend, lighter than Electron (recommended path)
2890
+ - Electron + existing console UI: fastest path, same TypeScript codebase, but heavy
2891
+
2892
+ **Things to hash out:**
2893
+ - A native app wrapping a daemon means the daemon becomes an app subprocess or a launchd service. Which model fits better, and does it change the daemon's lifecycle management?
2894
+ - Tauri requires Rust knowledge that the current team may not have. Is the recommended path realistic given the team's current skills?
2895
+ - macOS Gatekeeper and notarization requirements add significant release overhead for a signed app. Is this factored into the timeline estimate?
2896
+ - How does the macOS app interact with the existing console web UI? Are they two separate UIs, or does the native app embed the web console?
2897
+ - What happens to the CLI (`worktrain` commands) in the native app world -- do they remain the primary interface or become secondary?
579
2898
 
580
2899
  ---
581
2900
 
582
- ### `context-gather` step type
2901
+ ### Long-running sessions: stay open across agent handoffs (Apr 18, 2026)
583
2902
 
584
2903
  **Status: idea** | Priority: medium
585
2904
 
586
- **Problem:** Phase 0.5 in the coding workflow currently looks for a shaped pitch by checking a local path. This doesn't handle coordinator-injected context, manually written docs (GDoc, Confluence, Notion), Glean-indexed artifacts, or URLs embedded in the task description. The search logic is duplicated if other workflows need the same document.
2905
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: yes (needs session continuation)
587
2906
 
588
- **Proposed primitive:**
589
- ```json
590
- {
591
- "type": "context-gather",
592
- "id": "gather-pitch",
593
- "contextType": "shaped-pitch",
594
- "outputVar": "shapedInput",
595
- "optional": true,
596
- "sources": ["coordinator-injected", "local-paths", "task-url", "glean"]
597
- }
2907
+ Today when an MR review session completes, it writes its findings and exits. If the findings require fixes, a new fix agent starts from scratch with no shared context. Three sessions that are logically one unit of work are isolated from each other.
2908
+
2909
+ **The vision:** a session can stay open and wait -- dormant but alive -- while another agent does work. When that work completes, the waiting session resumes with full context continuity.
2910
+
2911
+ **The MR review example:**
2912
+ ```
2913
+ [MR review session] finds: 2 critical, 3 minor
2914
+ -> stays open, waiting for fixes
2915
+ [Fix agent session] addresses all 5 findings -> signals "fixes ready"
2916
+ [MR review session resumes] re-reads the diff, re-evaluates
2917
+ -> all 5 verified fixed, 0 new findings -> completes with APPROVE verdict
598
2918
  ```
599
2919
 
600
- **Source resolution order (stops at first hit):**
601
- 1. `coordinator-injected` -- coordinator already attached context of this type
602
- 2. `local-paths` -- check `.workrail/current-pitch.md`, `pitch.md`, `.workrail/pitches/`
603
- 3. `task-url` -- extract any URL from task description and fetch
604
- 4. `glean` -- search Glean for recent docs matching task keywords (opt-in only)
2920
+ The same session that found the issues verifies the fixes. No context reconstruction. No risk of re-review missing something the original reviewer knew.
605
2921
 
606
- **Why engine-level:** Coordinator intercept requires the engine to check "has this type already been provided?" before running any search. A routine can't express that.
2922
+ **Requires:** session continuation / post-completion phases architecture (already in the backlog under "Session as a living append-only record").
2923
+
2924
+ **Things to hash out:**
2925
+ - A dormant-but-alive session holds its conversation history in memory or must it be re-loaded from the event store on resume? If re-loaded, does the LLM truly have "full context continuity," or is it a reconstruction?
2926
+ - How long can a session remain dormant? If the fix agent takes 2 hours, the reviewing session holds its slot for 2 hours. Is that acceptable given concurrency limits?
2927
+ - What signals the reviewing session that "fixes are ready"? Is this a steer injection, a new `await_sessions` result, or a tool call from the fix agent?
2928
+ - What happens if the fix agent fails or produces a partial fix? Does the reviewing session resume anyway, or only on clean completion?
2929
+ - Should dormant sessions count against `maxConcurrentSessions`? If yes, long-running coordinated pipelines could exhaust the pool.
607
2930
 
608
2931
  ---
609
2932
 
610
- ## WorkRail MCP Server
2933
+ ### Coordinatable workflow steps: confirmation points the coordinator can satisfy (Apr 18, 2026)
611
2934
 
612
- The stdio/HTTP MCP server that Claude Code (and other MCP clients) connect to. MUST be bulletproof -- crashes kill all in-flight Claude Code sessions.
2935
+ **Status: idea** | Priority: medium -- needs discovery before implementation
613
2936
 
2937
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:2 Con:1 | Blocked: no
614
2938
 
2939
+ Workflows already have `requireConfirmation: true` on certain steps -- these are natural coordination points. Right now they pause for a human. The idea is to make them also pausable-for-a-coordinator, so a coordinator (or another agent) can be the one that responds instead of a human.
615
2940
 
616
- ## Console
2941
+ **The vision:** a workflow reaches a `requireConfirmation` step. In MCP mode (human-driven), it behaves exactly as today -- pauses and waits. In daemon/coordinator mode, instead of blocking forever, the coordinator can:
2942
+ - Inject a synthesized answer based on external work it just did ("architecture review found X, proceed with approach A")
2943
+ - Spawn another agent to generate the answer and inject its output
2944
+ - Simply forward a human's message from the message queue
617
2945
 
618
- ### Task picker mode: browse and launch available work (Apr 29, 2026)
2946
+ The original session never knows whether a human or a coordinator satisfied the confirmation. It just receives the next turn with context.
619
2947
 
620
- **Status: idea** | Priority: high
2948
+ **Open design questions:** How does the coordinator "subscribe" to pending confirmations? What's the protocol for injecting the response -- is it a steer, or a new continue_workflow call? What if a coordinator response conflicts with what the human would have said?
621
2949
 
622
- **Problem:** Once WorkTrain is configured (workspace set up, triggers.yml written, daemon running), there is still no easy way to say "run this workflow now" from the console. Dispatch requires knowing the API or writing a webhook. The console has a dispatch endpoint but no UI to drive it.
2950
+ **Things to hash out:**
2951
+ - Should the coordinator be able to satisfy any `requireConfirmation` step, or only steps explicitly marked as coordinator-satisfiable? An unexpected coordinator response on a step intended for human review could bypass important gates.
2952
+ - If both a coordinator response and a human message queue entry are available for the same confirmation, which takes precedence?
2953
+ - How does the session handle a confirmation that arrives after the session has timed out waiting? Is the response discarded, or does it attempt to resume the session?
2954
+ - What is the audit trail for coordinator-satisfied confirmations? Operators need to be able to see "this gate was satisfied by the coordinator with this reasoning" distinct from human approvals.
623
2955
 
624
- **Vision:** A console panel that lists the triggers already configured in triggers.yml and lets the user click one to fire it immediately -- without leaving the browser, without touching the API, without writing YAML.
2956
+ ---
625
2957
 
626
- **How it works:**
627
- 1. Console calls `GET /api/v2/triggers` to list all triggers loaded by the daemon.
628
- 2. User sees a list: trigger ID, workflow, goal, last-fired timestamp. Clicks "Run".
629
- 3. Console POSTs to `/api/v2/auto/dispatch` (already implemented) with the trigger's workflowId + goal + workspace.
630
- 4. New session appears in the session list immediately. User watches the DAG advance live.
631
- 5. On completion: outcome, PR link (if opened), and step notes all visible in the same panel.
2958
+ ### wr.shaping workflow: shape messy problems into implementation-ready specs (Apr 18, 2026)
632
2959
 
633
- **What this is not:** An onboarding wizard or zero-setup flow -- the daemon and environment must already be configured. This is a dispatch surface for *already-configured* users who want to trigger work without using the CLI or waiting for a webhook.
2960
+ **Status: ready to author** | Priority: medium
634
2961
 
635
- **Why it matters:** Makes the console a control plane, not just a read-only viewer. The daemon gains a "run this now" button. Users get to watch the agent work in real time, which builds confidence before trusting it on unattended tasks.
2962
+ **Score: 11** | Cor:1 Cap:3 Eff:2 Lev:2 Con:3 | Blocked: no
636
2963
 
637
- **Dependency:** `GET /api/v2/triggers` endpoint (returns the live trigger index -- may need to be added). `POST /api/v2/auto/dispatch` already exists. No new daemon work required.
2964
+ WorkRail has `wr.discovery` (divergent) and `coding-task-workflow-agentic` (convergent). Shaping is the missing middle -- converting messy discovery output into a bounded, implementation-ready spec without mid-implementation rabbit holes.
2965
+
2966
+ **Design docs:** `docs/design/shaping-workflow-discovery.md` (WorkRail-internal discovery findings), `docs/design/shaping-workflow-external-research.md` (Shape Up, LLM failure modes, artifact schema).
2967
+
2968
+ **The 11-step skeleton:**
2969
+ 1. `ingest_and_extract` -- extract problem frames, forces, open questions
2970
+ 2. `frame_gate` -- MANDATORY HUMAN GATE: confirm problem + appetite
2971
+ 3. `diverge_solution_shapes` -- 4 parallel rough shapes with varied framings
2972
+ 4. `converge_pick` -- SEPARATE JUDGE (different model/prompt): pick best shape
2973
+ 5. `breadboard_and_elements` -- fat-marker breadboard + Interface/Invariant/Exclusion classification
2974
+ 6. `rabbit_holes_nogos` -- adversarial: risks, mitigations, no-gos
2975
+ 7. `scope_and_slices` -- break into implementable slices with dependencies
2976
+ 8. `spec_draft` -- write the shaped pitch in full (problem + appetite + solution + no-gos + slices)
2977
+ 9. `spec_review` -- second-pass review of the spec for completeness and ambiguity
2978
+ 10. `spec_gate` -- MANDATORY HUMAN GATE: approve spec before implementation starts
2979
+ 11. `output_artifacts` -- write `current-shape.json`, `SPEC.md`; update `open-work-inventory.md`
2980
+
2981
+ **Things to hash out:**
2982
+ - `diverge_solution_shapes` produces 4 parallel shapes. Does this mean 4 parallel sessions, or 4 outputs from a single session? The resource and token cost differs significantly.
2983
+ - `converge_pick` uses a "SEPARATE JUDGE (different model/prompt)." How is this different model/prompt configured -- is it a different workflow step, a different API call, or a workaround for bias?
2984
+ - Who reads and validates the shaped spec between `spec_review` and the `spec_gate` human approval? If the human doesn't have context from the earlier steps, the gate is rubber-stamping.
2985
+ - The 11-step workflow writes to `open-work-inventory.md` in the final step. This is a shared planning file -- what happens if two shaping sessions run concurrently for different problems?
2986
+ - `Status: ready to author` -- what is blocking authoring? Is this waiting on the artifacts-as-first-class-citizens feature, or can it be authored with the current filesystem-based approach?
638
2987
 
639
2988
  ---
640
2989
 
641
- ### Console interactivity and liveliness
2990
+ ### Artifacts as first-class citizens: explorable, accessible, out of the repo (Apr 18, 2026)
642
2991
 
643
2992
  **Status: idea** | Priority: medium
644
2993
 
645
- **Key areas:**
646
- - **DAG node hover effects** -- nodes in `RunLineageDag` should have visible hover states: border brightens, subtle glow, cursor changes to pointer. This is the single highest-impact item.
647
- - **Node selection highlight** -- selected node should pulse or glow, not just change border color
648
- - **Live session pulse** -- sessions with `status: in_progress` could have a subtle periodic animation
649
- - **Tooltip polish** -- fade in/out rather than appearing instantly
2994
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
650
2995
 
651
- **Design constraint:** Dark navy, amber accent aesthetic. Additions should reinforce this language.
2996
+ Every autonomous session dumps `design-candidates.md`, `implementation_plan.md`, `design-review-findings.md` etc. as files in the repo root or worktrees. They are: not indexed or searchable, not visible in the console, not accessible to other sessions, polluting the repo with ephemeral working documents, lost when worktrees are cleaned up.
652
2997
 
653
- **Where to start:** `console/src/components/RunLineageDag.tsx`. The tooltip pattern (`handleNodeMouseEnter`/`handleNodeMouseLeave`) already exists; a hover glow is a natural peer addition.
2998
+ **The right model:** artifacts are WorkTrain data, not filesystem files. Any structured output from a session that has value beyond the session itself -- handoff docs, design candidates, implementation plans, review findings, spec files, investigation summaries -- should be stored in the session store and accessible via the console.
654
2999
 
655
- **Related:** `docs/design/console-cyberpunk-ui-discovery.md`, `docs/design/console-ui-backlog.md`
3000
+ **What an artifact is:** a named, typed, versioned blob produced by a session. Stored in `~/.workrail/data/artifacts/<sessionId>/`. Referenced from the session event log via `artifact_recorded` event. Accessible to other sessions via `read_artifact(sessionId, name)`.
656
3001
 
657
- ---
3002
+ **Console integration:** "Artifacts" tab on session detail. Each artifact shows name, type, size, and content. "Add to repo" button copies the artifact to the workspace as a markdown file for the cases where the author wants it in git.
658
3003
 
659
- ### Console engine-trace visibility and phase UX
3004
+ **Build order:** `artifact_recorded` event kind in the session store; `read_artifact` tool for daemon agents; Console artifacts tab; garbage collection policy (artifacts older than N days deleted unless pinned).
660
3005
 
661
- **Status: idea** | Priority: medium
3006
+ **Things to hash out:**
3007
+ - If artifacts replace filesystem files, what happens to the existing workflow steps that write to `design-candidates.md`, `implementation_plan.md`, etc. in the repo? Is migration required, or do both models coexist?
3008
+ - What is the artifact storage format -- raw Markdown, structured JSON, or type-specific? How does the console render artifacts of different types?
3009
+ - The `read_artifact(sessionId, name)` API gives any session read access to any other session's artifacts. What is the authorization model -- should all sessions have access to all artifacts, or is it scoped to related sessions?
3010
+ - How does garbage collection interact with the console's "Artifacts" tab? If an artifact is displayed in the console but has been garbage collected, what does the user see?
3011
+ - Are artifacts immutable once written, or can a session append to or replace an existing artifact?
662
3012
 
663
- **Gap:** Users currently see only `node_created`/`edge_created`, which makes legitimate engine behavior look like missing workflow phases. Fast paths, skipped phases, condition evaluation, and loop gates are invisible.
3013
+ ---
664
3014
 
665
- **Recommended direction:**
666
- - Keep phases as authoring/workflow-organization concepts
667
- - Add an engine-trace/decision layer showing: selected next step, evaluated conditions, entered/exited loops, important run context variables (e.g. `taskComplexity`), skipped/bypassed planning paths
3015
+ ### Business model (tentative)
668
3016
 
669
- **Phase 1:** Extend console service/DTOs with a run-scoped execution-trace summary. Show a compact "engine decisions" strip or timeline above the DAG.
3017
+ Three tiers:
670
3018
 
671
- **Phase 2:** Richer explainability timeline with branches, skipped phases, condition results. Toggle between "execution DAG" and "engine trace" views.
3019
+ | Tier | Who | Price | Notes |
3020
+ |------|-----|-------|-------|
3021
+ | **Personal / OSS** | Individual devs, open-source projects, non-commercial | Free forever | Builds community, reputation, workflow library. Never charge for this. |
3022
+ | **Corporate self-hosted** | Companies running WorkRail on their own infrastructure | Paid license | Data never leaves their VPC. Priced per seat or per org. |
3023
+ | **WorkRail Auto (cloud)** | Anyone who wants managed, zero-ops | Paid subscription | Higher price, lower friction. Pre-configured integrations. |
672
3024
 
673
- ---
3025
+ **License model options:**
3026
+ - **Dual-license:** AGPL for open-source use, commercial license for everyone else who doesn't want AGPL obligations
3027
+ - **MIT core + paid features:** Core engine stays MIT forever, advanced features (hosted dashboard, enterprise SSO, multi-tenant credential vault, audit logs) are paid
674
3028
 
675
- ### Console ghost nodes (Layer 3b)
3029
+ **The corporate self-hosted market is often the most lucrative.** Enterprises pay well for "runs in our VPC, vendor can't see our code." GitLab, Grafana, Jira -- all built significant businesses on self-hosted enterprise licenses before or alongside their cloud offerings.
676
3030
 
677
- **Status: idea** | Priority: low
3031
+ **What NOT to do:** Don't charge for the workflow library or the core MCP protocol. Those are the commons that make WorkRail valuable. Charge for the infrastructure layer, not the knowledge layer.
678
3032
 
679
- Ghost nodes represent steps that were compiled into the DAG but skipped at runtime due to `runCondition`. Currently the DAG just shows fewer nodes with no indication of what was bypassed. Layer 3b would render skipped nodes as faded/ghost elements with a tooltip explaining the skip condition.
3033
+ **Priority:** Don't worry about this until there are users.
680
3034
 
681
- ---
3035
+ **Things to hash out:**
3036
+ - The AGPL dual-license model requires companies using WorkRail in their products to either open-source those products or buy a commercial license. Is this the intended friction, and is it calibrated correctly for the target market?
3037
+ - What qualifies as "commercial use" in the MIT core + paid features model? A company running the free engine internally without distributing it -- is that commercial use?
3038
+ - Who decides which features are "advanced" (paid) vs "core" (free)? This decision shapes the community's willingness to contribute.
3039
+ - The corporate self-hosted market requires sales, invoicing, and legal infrastructure. Is there a plan for those operational capabilities, or is this purely a product decision for now?
3040
+ - How does the open-source community react if features they contributed to are moved behind a paywall? Is there a policy for handling contributions to the paid tier?
682
3041
 
683
- ## Workflow Library
3042
+ ---
684
3043
 
685
- ### General-purpose workflow / intelligent dispatcher
3044
+ ### WorkTrain benchmarking: prove it's better, publish the results (Apr 18, 2026)
686
3045
 
687
3046
  **Status: idea** | Priority: medium
688
3047
 
689
- Two related ideas:
3048
+ **Score: 10** | Cor:1 Cap:2 Eff:3 Lev:2 Con:2 | Blocked: no
690
3049
 
691
- **`wr.quick-task`** -- the simplest possible workflow. 2 steps: do the work, call complete_step. No complexity routing, no design review, no phased implementation. For tasks under ~10 minutes. Currently small tasks go through `wr.coding-task`'s Small fast-path which is still heavier than needed.
3050
+ If WorkTrain can demonstrably outperform one-shot LLM calls and human-in-the-loop for specific task types, with reproducible benchmarks published in GitHub and visible in the console, that's the killer adoption argument.
692
3051
 
693
- **`wr.dispatch`** -- an intelligent routing workflow. Given a goal, classify it and route to the right workflow: `wr.quick-task` | `wr.research` | `wr.coding-task` | `wr.mr-review` | `wr.competitive-analysis`. The general-purpose entry point -- not a workflow that does everything, but one that decides which workflow to use. The adaptive pipeline coordinator already does this for the queue-poll trigger; the question is whether to expose it as a named user-facing workflow.
3052
+ **What to benchmark:**
694
3053
 
695
- Open questions: does `wr.dispatch` replace `workflowId` in trigger config, or coexist alongside it? How does it handle tasks that don't fit any known workflow?
3054
+ | Dimension | WorkTrain | One-shot | Human-in-loop |
3055
+ |-----------|-----------|----------|---------------|
3056
+ | MR review finding rate (Critical/Major caught) | ? | ? | ? |
3057
+ | False positive rate | ? | ? | ? |
3058
+ | Coding task correctness (builds + tests pass) | ? | ? | ? |
3059
+ | Bug investigation accuracy (correct root cause) | ? | ? | ? |
3060
+ | Time to complete | ? | ? | ? |
3061
+ | Token cost per task | ? | ? | ? |
696
3062
 
697
- ---
3063
+ **Also within WorkTrain:** Haiku (fast, cheap) vs Sonnet (balanced) vs Opus (best) for each task type. Does workflow structure make Haiku competitive with Sonnet one-shot? (hypothesis: yes, for structured tasks)
698
3064
 
699
- ### MR review session count inflation
3065
+ **The benchmark suite:**
3066
+ 1. MR review benchmark -- 50 PRs with known ground truth. Score: recall + precision.
3067
+ 2. Coding task benchmark -- 50 tasks with objective completion criteria. Score: % completing correctly on first autonomous run.
3068
+ 3. Bug investigation benchmark -- 30 real bugs with known root causes. Score: % identifying correct root cause.
3069
+ 4. Discovery quality benchmark -- 20 design questions with expert-evaluated answers.
700
3070
 
701
- **Status: idea** | Priority: medium
3071
+ **How to publish:** `docs/benchmarks/` directory, GitHub Actions CI job on each release, Console "Benchmarks" tab, badge in README: "MR review recall: 87% (Sonnet 4.6, v3.36.0)".
702
3072
 
703
- A single PR review dispatches 6-12 autonomous sessions (one per reviewer family: correctness_invariants, runtime_production_risk, missed_issue_hunter, etc.). This inflates session counts, complicates cost attribution, and makes ROI calculations imprecise. Worth investigating: are all 6 families catching distinct issues, or is there significant overlap? Should families be parallelized into a single session with sub-agents rather than separate top-level sessions?
3073
+ **Starting point:** the mr-review workflow. Start with 20 PRs where bugs were later discovered and 20 PRs that shipped cleanly. Run each through `mr-review-workflow-agentic` on several model tiers. That's a publishable result with one weekend of work.
3074
+
3075
+ **Things to hash out:**
3076
+ - "Ground truth" for benchmark PRs requires human expert labeling of what the correct findings should be. Who does this labeling, and how is inter-rater reliability ensured?
3077
+ - Benchmark results are model-version-specific. When a new model version releases, do all benchmarks need to be re-run? What is the cost and cadence?
3078
+ - Publishing benchmarks that compare WorkTrain to "one-shot LLM" requires a controlled experimental setup. How are prompt and model variables controlled for the one-shot baseline?
3079
+ - Should benchmark results be published even when they show WorkTrain performing worse than expected? The commitment to honest benchmarking needs to be explicit.
3080
+ - A CI job that runs 50 PR reviews on every release is extremely expensive. What is the governance for this -- is it run manually, on major releases only, or on a separate schedule?
704
3081
 
705
3082
  ---
706
3083
 
707
- ### Session trigger source attribution (daemon vs MCP)
3084
+ ### Autonomous feature development: scope -> breakdown -> parallel execution -> merge (Apr 18, 2026)
708
3085
 
709
3086
  **Status: idea** | Priority: high
710
3087
 
711
- No reliable way to determine whether a session was started by the daemon (WorkTrain) or a human via MCP (Claude Code). Every session-level metric and ROI calculation is ambiguous without this.
3088
+ **Score: 9** | Cor:1 Cap:3 Eff:1 Lev:2 Con:2 | Blocked: yes (needs native multi-agent + scripts-first coordinator)
712
3089
 
713
- **Fix:** Add `triggerSource: 'daemon' | 'mcp'` to `run_started` event data. One-line change at each entry point, makes attribution permanent and queryable from the event log.
3090
+ Give WorkTrain a feature scope -- from a vague idea to a fully groomed ticket -- and it figures out the rest. Discovery if needed, design if needed, breakdown into parallel slices, execution across worktrees, context management across agents, bringing it all back together.
714
3091
 
715
- Files: `src/v2/durable-core/schemas/session/events.ts`, `src/mcp/handlers/v2-execution/start-workflow.ts`, `src/daemon/workflow-runner.ts`.
3092
+ **The four pillars:**
3093
+ 1. **Autonomy** -- WorkTrain takes a scope and figures out the work breakdown without hand-holding
3094
+ 2. **Quality** -- comes FROM autonomy + workflow enforcement + coordination
3095
+ 3. **Throughput** -- parallel slices across worktrees simultaneously
3096
+ 4. **Visibility** -- one coherent work unit you can track at a glance
716
3097
 
717
- ---
3098
+ **The pipeline for a scope:**
3099
+ ```
3100
+ Input: "add GitHub polling support" (any level of definition)
3101
+ -> [if vague] ideation + spec authoring
3102
+ -> classify-task -> taskComplexity, hasUI, touchesArchitecture, taskMaturity
3103
+ -> [if Medium/Large] discovery
3104
+ -> [if touchesArchitecture] design + review
3105
+ -> breakdown -> parallel slices with dependency graph
3106
+ Slice 1: types + schema (worktree A)
3107
+ Slice 2: polling adapter (worktree B, depends: 1)
3108
+ Slice 3: scheduler integration (worktree C, depends: 2)
3109
+ Slice 4: tests (worktree D, depends: 1-3)
3110
+ -> [parallel execution] each slice: implement -> review -> approved
3111
+ -> [serial integration] merge slices in dependency order
3112
+ -> [final] integration test -> PR created -> notification
3113
+ ```
718
3114
 
719
- ### Standup status generator
3115
+ **Context management:** Coordinator maintains a "work unit manifest" (current phase, slice status, shared invariants, decisions). Each spawned agent receives a context bundle. After each agent completes, its findings update the manifest.
720
3116
 
721
- **Status: idea** | Priority: low
3117
+ **The coordinator's job (scripts, not LLM):** maintain the manifest, compute the dependency graph, decide parallelism vs serialization, route outcomes, track worktrees, detect conflicts, sequence merge order.
722
3118
 
723
- A workflow that aggregates activity across git history, GitLab/GitHub MRs and reviews, and Jira ticket transitions since the last standup. Outputs a categorized ("what I did / doing today / blockers") human-readable message. Tool-agnostic: detect available integrations and adapt.
3119
+ **The minimum viable version:** a coordinator that handles a Medium/Small scoped task -- takes 2-4 parallel slices, runs them, reviews each, merges when clean. No escalation handling in v1.
3120
+
3121
+ **Things to hash out:**
3122
+ - "WorkTrain figures out the breakdown" -- how does it decompose a feature into independent, parallelizable slices without human input? What is the decision process, and how does it handle tasks that are fundamentally sequential?
3123
+ - Parallel slices across worktrees can produce merge conflicts when their branches are integrated. Who detects and resolves conflicts -- the coordinator script or the agent?
3124
+ - The breakdown step requires predicting which slices depend on which. Incorrect dependency analysis could cause a slice to start before its dependencies are complete. How is this validated before parallel execution begins?
3125
+ - Is the "minimum viable version" intended to run fully autonomously, or does it require human review between phases?
3126
+ - How does this relate to the full development pipeline entry earlier in the backlog? Are these the same concept or parallel efforts?
724
3127
 
725
3128
  ---
726
3129
 
727
- ### Workflow effectiveness assessment and self-improvement proposals
3130
+ ### WorkTrain analytics: stats, time saved, and quality metrics (Apr 15, 2026)
728
3131
 
729
3132
  **Status: idea** | Priority: medium
730
3133
 
731
- **Idea:** WorkTrain runs workflows hundreds of times. It should use that data to propose improvements.
3134
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
732
3135
 
733
- **Per-run metrics to collect:**
734
- - Steps skipped most often (candidate for removal)
735
- - Steps consuming the most tokens/time
736
- - Steps where the agent calls `continue_workflow` immediately (prompt too vague or redundant)
737
- - Sessions that produced PRs with Critical findings (workflow not thorough enough)
738
- - Sessions that completed vs hit max_turns
3136
+ WorkTrain should be accountable. Not just "it did work" but "did it do good work?" Stats without quality metrics are vanity. Quality metrics without stats lack context.
739
3137
 
740
- **Output:** Structured proposal per workflow:
741
- - Step-level issues with evidence (specific sessions, specific steps)
742
- - Proposed changes with confidence and impact estimate
743
- - Feed directly into `workflow-for-workflows`
3138
+ **Volume stats:** PRs opened/merged, PRs reviewed, bugs investigated, tasks completed, discoveries run, issues filed/resolved. Derived from session store + merge audit log + GitHub/Jira API.
3139
+
3140
+ **Time saved estimates:** calibrated human-equivalent time estimate per workflow type (e.g. MR review STANDARD = 25 min, coding task Medium = 2h). Honest: "Time saved is only real if the work would have been done by a human."
3141
+
3142
+ **Quality metrics:**
3143
+ - MR reviews: reviews with 0 findings / reviews that caught Critical / reviews where human disagreed
3144
+ - Coding tasks: PRs merged without rework / PRs that needed fix cycles / post-merge bug rate
3145
+ - Bug investigations: correct root cause identified / confidence was too high (wrong) / escalated correctly
3146
+ - **Overall quality score** (weighted composite): if score drops below 70, auto-trigger `workflow-effectiveness-assessment`
3147
+
3148
+ **Quality feedback loop:** post-merge outcome tracking (bugs filed against WorkTrain PRs within 30 days), MR review validation (author disputes a finding = signal), human override tracking, explicit `worktrain feedback "..."` command appending to `~/.workrail/feedback.jsonl`.
3149
+
3150
+ **Console Analytics tab:** quality score trend, volume/quality/cost summary, anomaly callouts with links to `workflow-effectiveness-assessment`.
3151
+
3152
+ **Things to hash out:**
3153
+ - "Time saved" estimates require knowing what a human would have done in the same time. This is inherently speculative. How is the calibration model updated as norms change?
3154
+ - "Reviews where human disagreed" requires a mechanism for tracking disagreement. What is the interface for a human to signal disagreement with a WorkTrain finding -- a label, a comment keyword, or an explicit command?
3155
+ - The quality score dropping below 70 auto-triggers `workflow-effectiveness-assessment`. Who defines the threshold, and is it configurable per workspace or global?
3156
+ - Post-merge bug tracking (bugs within 30 days) requires attributing bugs to specific PRs. What is the attribution mechanism -- PR metadata, commit SHA tracking, or manual annotation?
3157
+ - The analytics data requires access to GitHub/Jira APIs. Who manages token rotation for these read-access integrations, and what happens when they expire?
3158
+
3159
+ ---
3160
+
3161
+ ### Live status briefings: WorkTrain narrates its own work in human terms (Apr 15, 2026)
3162
+
3163
+ **Status: idea** | Priority: medium
744
3164
 
745
- **Flow-back:** Low-confidence proposals as GitHub issues. High-confidence, low-risk proposals auto-applied to local copy + PR to community.
3165
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
746
3166
 
747
- ---
3167
+ **The problem:** WorkTrain is doing a lot. Sessions are running, PRs are open, the queue has items. But the raw view -- session IDs, PR numbers, branch names -- is only meaningful to someone who's been following along. A user who checks in after a few hours needs a human-readable briefing, not a list of `sess_abc123` entries.
748
3168
 
749
- ## Platform Vision (longer-term)
3169
+ **`worktrain status` command:** assembles a briefing by reading active sessions (what's running, which step, how long), queue state, recent completions, blocked/waiting items. Summarizes each session in 2-3 plain English lines: what is being built, why it matters, where it is.
750
3170
 
751
- ### Inspiration: openclaw (Apr 29, 2026)
3171
+ **Adaptation:** `--audience owner` (full technical detail, default) vs `--audience stakeholder` (capability level, no PR numbers) vs `--audience external` (outcome level, no internal terminology).
752
3172
 
753
- **Source:** https://github.com/openclaw/openclaw
3173
+ **Console Status tab** (default view): live session list with step progress, queue next items, done today. Updates via SSE. Click any row to expand.
754
3174
 
755
- openclaw is worth studying deeply before building out the platform layer. Draw inspiration from it when designing: multi-agent orchestration patterns, coordinator architecture, context packaging for subagents, task queue and dispatch models, and the overall shape of an autonomous engineering platform. Review it before making architectural decisions on any of the Platform Vision items below.
3175
+ **Push notifications:** milestone completions ("WorkTrain shipped: worktrain init is live"), blockers surfaced ("PR #406 came back with 2 issues -- fixing automatically, estimated 20 min"), optional daily digest.
3176
+
3177
+ **Things to hash out:**
3178
+ - The briefing LLM call requires a full context assembly pass (session store, queue state, recent completions). This is expensive. Should `worktrain status` be a live query or cached periodically?
3179
+ - Audience adaptation (`--audience stakeholder`) requires understanding what "capability level" vs "technical detail" means for each piece of information. Who defines this mapping?
3180
+ - Push notifications require a notification channel (Slack, email, macOS, etc.). How does the user configure which channel(s) to use, and what is the default?
3181
+ - "Estimated 20 min" requires the workflow execution time prediction system to be built first. Is the status briefing gated on that feature?
3182
+ - Should the Console Status tab replace the existing Sessions tab as the default view, or be an additional tab?
756
3183
 
757
3184
  ---
758
3185
 
759
- ### Knowledge graph for agent context
3186
+ ### Pattern and architecture validation: WorkTrain enforces team conventions (Apr 15, 2026)
760
3187
 
761
3188
  **Status: idea** | Priority: medium
762
3189
 
763
- **Problem:** Every session starts with a full repo sweep. Context gathering subagents re-read the same files, re-trace the same call chains, re-identify the same invariants.
3190
+ **Score: 10** | Cor:2 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
764
3191
 
765
- **Design -- two-layer hybrid:**
3192
+ Beyond reviewing code for bugs, WorkTrain validates that the code matches the patterns and architecture the team expects.
766
3193
 
767
- **Layer 1: Structural graph (hard edges, deterministic)**
768
- Built by `ts-morph` (TypeScript Compiler API) + DuckDB. Captures: `imports`, `calls`, `exports`, `implements`, `extends`, `registers_in`, `tested_by`. Answers precise questions with certainty: "what imports trigger-router.ts?", "what CLI commands are registered?"
3194
+ **Two levels:**
769
3195
 
770
- **Layer 2: Vector similarity (soft weights, semantic)**
771
- Every node gets an embedding. Answers fuzzy questions: "what is conceptually related to this?", "what past sessions are relevant to this bug?" Built with LanceDB (embedded, TypeScript-native, local-first).
3196
+ **1. Philosophy lens (already partially built):** extend to be per-workspace configurable, and make some patterns machine-checkable (no direct db access outside the repository layer, no `console.log` in production code, no `any` types) rather than relying on the LLM.
772
3197
 
773
- **Technology:**
774
- - Structural: `ts-morph` + DuckDB
775
- - Vector: LanceDB + local embedding model (Ollama or `@xenova/transformers`)
776
- - Unified query: `query_knowledge_graph(intent)` returns merged structural + semantic results
3198
+ **2. Architectural invariant checking (new):**
3199
+ ```yaml
3200
+ workspaces:
3201
+ workrail:
3202
+ architectureRules:
3203
+ - id: no-daemon-imports-from-mcp
3204
+ rule: "src/daemon/** must not import from src/mcp/**"
3205
+ type: import_boundary
3206
+ severity: error
3207
+ - id: errors-as-data
3208
+ rule: "No throw statements in src/daemon/** -- use Result types"
3209
+ type: no_throw
3210
+ severity: warning
3211
+ exceptions: ["constructor", "assertExhaustive"]
3212
+ - id: no-exec-shell
3213
+ rule: "No child_process.exec() -- use execFile() with args array"
3214
+ type: forbidden_call
3215
+ severity: error
3216
+ ```
777
3217
 
778
- **Build order:** Structural layer spike first (1-day). Vector layer after spike proves the foundation. Incremental update: re-index only files in `filesChanged` after each session.
3218
+ These rules run as scripts (static analysis, not LLM) -- fast, deterministic, zero tokens. Checked during coding-task workflow, as part of CI, and by the periodic architecture scan.
779
3219
 
780
- **Build decision (from Apr 15 research):** ts-morph + DuckDB wins. Cognee: Python-only. GraphRAG/LightRAG: use LLMs to build graph (violates scripts-over-agent). Mem0/Zep: conversational memory, not code graphs. Sourcegraph: enterprise weight, overkill.
3220
+ **The self-improvement connection:** when `workflow-effectiveness-assessment` finds that a class of bug appears repeatedly (e.g. "3 of the last 5 coding tasks had shell injection risks"), it can propose a new architecture rule that prevents the pattern going forward. Rules start as soft warnings, graduate to errors after validation. WorkTrain learns from its own failure patterns and codifies them as invariants.
3221
+
3222
+ **Things to hash out:**
3223
+ - Static analysis rules (import boundaries, forbidden calls) are different from philosophy lens rules (LLM-evaluated). Should they live in the same configuration file and enforcement mechanism?
3224
+ - Who owns the `architectureRules` configuration per workspace -- the workspace team, the workflow author, or WorkTrain itself? Conflicting ownership creates maintenance friction.
3225
+ - When a new architecture rule is auto-proposed from failure patterns, how does it get reviewed and graduated from warning to error? Is there a human approval gate in the self-improvement loop?
3226
+ - How does architecture rule enforcement interact with existing CI checks? Should WorkRail generate a lint-style CI step from the `architectureRules` config?
3227
+ - Rules like "no throw in src/daemon/**" require nuance (exceptions for constructors). How is the exceptions list kept current as the codebase evolves?
781
3228
 
782
3229
  ---
783
3230
 
784
- ### Dynamic pipeline composition
3231
+ ### Resource management: preventing agent congestion under high concurrency (Apr 15, 2026)
785
3232
 
786
3233
  **Status: idea** | Priority: medium
787
3234
 
788
- **Insight:** Not all tasks are equal in how much work is needed before implementation. A raw idea needs a completely different pipeline than a fully-specced ticket.
3235
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
789
3236
 
790
- **Maturity spectrum:**
791
- - `idea` -> `rough` -> `specced` -> `ready` -> `code-complete`
3237
+ Running many simultaneous agents creates API rate limit bursts, host resource pressure, and context degradation. The `maxConcurrentSessions` semaphore addresses the daemon-level cap, but the broader resource management problem has several dimensions.
792
3238
 
793
- **Coordinator reads maturity + existing artifacts and prepends the right phases:**
794
- - Nothing -> ideation -> market research -> spec authoring -> ticket creation -> implementation
795
- - BRD + designs -> architecture review -> implementation
796
- - Fully specced -> coding only
3239
+ **The dimensions:**
3240
+ 1. **API rate limits** -- token-bucket rate limiter shared across all sessions: before each LLM call, acquire a slot from the bucket
3241
+ 2. **Host machine resources** -- each agent loop runs in-process, consuming RAM and CPU
3242
+ 3. **Tiered concurrency by task type** -- `coding-task-workflow-agentic: 2` (expensive), `mr-review: 3` (medium), `wr.discovery: 5` (cheap)
3243
+ 4. **Queue-aware throttling** -- prefer starting high-priority items even if slots are available for low-priority ones
3244
+ 5. **Graceful degradation** -- slow down polling intervals, prefer fast/cheap workflows, pause the queue drain when under load
797
3245
 
798
- **New workflows needed:**
799
- - `classify-task-workflow` -- fast, 1-step, outputs `taskComplexity`/`riskLevel`/`hasUI`/`touchesArchitecture`/`taskMaturity`
800
- - `ideation-workflow`, `spec-authoring-workflow`, `ticket-creation-workflow`, `grooming-workflow`
3246
+ **Build order:**
3247
+ 1. `maxConcurrentSessions` semaphore (simple global cap)
3248
+ 2. Token-bucket rate limiter in the agent loop
3249
+ 3. Per-workflow-type concurrency limits
3250
+ 4. Queue-aware slot allocation (high-priority first)
3251
+ 5. Adaptive throttling based on observed latency
3252
+
3253
+ **Things to hash out:**
3254
+ - The token-bucket rate limiter must be shared across all concurrent sessions. Where does it live -- daemon-global singleton, or a lightweight IPC mechanism? Thread safety is required.
3255
+ - Tiered concurrency limits by workflow type require the daemon to know the workflow type at dispatch time. How is this derived for dynamically dispatched sessions where the workflow is set at runtime?
3256
+ - "Host machine resources" monitoring requires either OS-level telemetry (CPU, RAM sampling) or inference from session count. Which is more reliable for the adaptive throttling use case?
3257
+ - Graceful degradation that pauses queue draining could leave important high-priority items waiting behind lower-priority work. Does degradation mode need priority awareness?
3258
+ - What is the interaction between resource limits and the `worktrain kill-sessions` kill switch? Should resource exhaustion trigger a softer intervention before escalating to kill?
801
3259
 
802
3260
  ---
803
3261
 
804
- ### Per-workspace work queue
3262
+ ### Universal integration layer: WorkTrain interfaces with everything (Apr 15, 2026)
805
3263
 
806
3264
  **Status: idea** | Priority: medium
807
3265
 
808
- **The insight:** Triggers make WorkTrain reactive. A work queue makes it proactive -- it pulls the next item when capacity is available, works it to completion, pulls the next.
3266
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: no
809
3267
 
810
- **Internal queue:** `~/.workrail/workspaces/<name>/queue.jsonl` -- append-only, one item per line, consumed in priority order then FIFO.
3268
+ WorkTrain is not opinionated about your stack. It works with whatever version control, project management, communication, monitoring, and documentation systems you use.
811
3269
 
812
- **External pull sources:**
813
- - GitHub issues (label filter)
814
- - GitLab issues (label filter)
815
- - Jira sprint board
816
- - Linear triage queue
3270
+ **Integration categories:**
3271
+ - **Version control:** GitHub, GitLab (already done), Bitbucket, Azure DevOps, Gitea, raw git
3272
+ - **Project management:** GitHub Issues, GitLab Issues, Jira (Cloud + Server), Linear, Asana, Notion, Monday.com, Azure Boards
3273
+ - **Communication:** Slack, Microsoft Teams, Discord, Telegram, Email, PagerDuty, OpsGenie, generic webhook
3274
+ - **Monitoring:** Sentry, Datadog, New Relic, Grafana/Prometheus, CloudWatch, custom HTTP endpoint
3275
+ - **Documentation:** Confluence, Notion, Google Docs, Markdown in repo, Docusaurus
817
3276
 
818
- **Queue + message queue + talk:**
3277
+ **Three integration modes (all already architected):**
3278
+ 1. **Polling source** -- WorkTrain calls the external API on a schedule, deduplicates events, dispatches workflows
3279
+ 2. **Delivery target** -- WorkTrain POSTs results to an external system when a workflow completes
3280
+ 3. **Reference context** -- WorkTrain fetches external documents and injects them into agent context
819
3281
 
820
- | Interface | Use case | Latency |
821
- |-----------|----------|---------|
822
- | Work queue | "do this when you have capacity" | When a slot is free |
823
- | Message queue (`worktrain tell`) | "do this now, between current sessions" | End of current batch |
824
- | Talk (`worktrain talk`) | "let's discuss and decide together" | Interactive |
3282
+ **The integration manifest in triggers.yml:**
3283
+ ```yaml
3284
+ integrations:
3285
+ github:
3286
+ token: $GITHUB_TOKEN
3287
+ jira:
3288
+ token: $JIRA_TOKEN
3289
+ baseUrl: https://mycompany.atlassian.net
3290
+ slack:
3291
+ webhookUrl: $SLACK_WEBHOOK_URL
3292
+ channels:
3293
+ reviews: "#code-review"
3294
+ incidents: "#incidents"
3295
+ ```
3296
+
3297
+ **Build order:** generic `callbackUrl` (already works); GitHub polling (same as GitLab, already written as template), Slack delivery (format + post to webhook); Jira polling + delivery (high enterprise value); Linear polling (high startup value); PagerDuty delivery. Each adapter is a bounded, testable, independently shippable unit.
3298
+
3299
+ **Things to hash out:**
3300
+ - The integration manifest in `triggers.yml` centralizes credentials for all external systems. Is this the right location, or should credentials live in a separate secrets file (like `~/.workrail/.env`)?
3301
+ - Each integration adapter is "independently shippable" -- but they share no common testing infrastructure. How is integration adapter quality maintained as the number of adapters grows?
3302
+ - What is the versioning policy for integration adapters? If an external API changes (e.g. Jira Cloud v3 -> v4), how are adapter updates coordinated with WorkTrain releases?
3303
+ - The "three integration modes" cover polling, delivery, and reference context. Are there integration use cases that don't fit these three modes?
3304
+ - Who is the target user for the universal integration layer -- solo developers, small teams, or enterprise teams? The complexity of configuring many integrations is higher than the current single-trigger setup.
825
3305
 
826
3306
  ---
827
3307
 
828
- ### Remote references (URLs, GDocs, Confluence)
3308
+ ### Communication agent: Slack monitoring, email management, and suggested responses (Apr 16, 2026)
829
3309
 
830
- **Status: idea** | Priority: medium
3310
+ **Status: idea** | Priority: low
831
3311
 
832
- **Design:** Extend the workflow `references` system to support remote sources (HTTP URLs, Google Docs, Confluence pages). WorkRail remains a pointer system -- it validates declarations are well-formed, delivers the pointer, and the agent fetches with its own tools. Auth is entirely delegated to the agent.
3312
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
833
3313
 
834
- **Incremental path:**
835
- - Phase 1: public HTTP URLs. `resolveFrom: "url"`. WorkRail delivers the URL; agent fetches. No auth surface in WorkRail.
836
- - Phase 2: workspace-configured bearer tokens in `.workrail/config.json` keyed by domain
837
- - Phase 3: named integrations (GDocs, Confluence, Notion) as first-class configured sources
3314
+ WorkTrain monitors your communication channels, understands context, and either responds on your behalf or prepares vetted drafts for you to send.
838
3315
 
839
- **Design questions:**
840
- - Should WorkRail attempt a reachability check at start time, or skip entirely for remote refs?
841
- - How should remote refs appear in `workflowHash`? Content can change between runs.
842
- - `kind` field (`local` vs `remote`) or infer from `source` value?
3316
+ **Slack:** Monitor specified channels and DMs for messages that mention you, reference your projects, or require a response. Options: auto-respond for routine questions, draft a response for your review, or surface with a notification. Configurable per-channel.
3317
+
3318
+ **Email:** Monitor inbox, understand context, draft responses. Suggest email filters, folder rules, and unsubscribe candidates based on patterns. Priority surfacing: "3 emails need a response, here are the drafts."
3319
+
3320
+ **Important constraint:** WorkTrain never sends on your behalf without explicit approval for anything that goes to other people. Auto-respond is opt-in per-channel, with a review window before sending.
3321
+
3322
+ **Things to hash out:**
3323
+ - Slack monitoring requires a Slack app with appropriate scopes. What is the setup experience -- does WorkTrain ship a Slack app manifest, or does the user create an app from scratch?
3324
+ - The "review window before sending" implies the agent drafts a response and waits. What is the window duration, and what happens if the user doesn't review within the window?
3325
+ - Email monitoring is significantly more sensitive than Slack. What are the minimum required email scopes, and how does WorkTrain prevent accidentally reading sensitive or confidential messages?
3326
+ - Auto-respond for Slack is opt-in per-channel. If a channel is not explicitly opted in, are all messages in that channel completely invisible to WorkTrain?
3327
+ - This is a significant scope expansion beyond code-related automation. What is the explicit boundary between WorkTrain as a coding tool and WorkTrain as a general productivity tool?
843
3328
 
844
3329
  ---
845
3330
 
846
- ### Declarative composition engine
3331
+ ### Local file organization and maintenance (Apr 16, 2026)
847
3332
 
848
3333
  **Status: idea** | Priority: low
849
3334
 
850
- **Summary:** Users or agents fill out a declarative spec (dimensions, scope, rigor level) and the WorkRail engine assembles a workflow automatically from a library of pre-validated routines. The agent is a form-filler, not an architect -- the composition logic lives in the engine.
851
-
852
- **Why different from agent-generated workflows:** Engine-composed workflows are assembled from pre-reviewed building blocks using deterministic rules. Same spec always produces the same workflow shape.
3335
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
853
3336
 
854
- **Good early use cases:** Audit-style workflows (user picks dimensions, engine assembles auditor steps), review workflows, investigation workflows.
3337
+ - WorkTrain scans specified directories for stale, duplicate, and disorganized files
3338
+ - Suggests folder structures based on file content and usage patterns
3339
+ - Identifies documents that are out of date and offers to update them
3340
+ - Keeps project-related files in sync with the repo
3341
+ - "~/Downloads has 847 files, most untouched for 6 months -- here's what's safe to delete and what should be archived"
3342
+ - Connects to the knowledge graph: files that reference code or projects get indexed alongside the code
855
3343
 
856
3344
  ---
857
3345
 
858
- ### Workflow categories and category-first discovery
3346
+ ### Worktree lifecycle management: automatic cleanup and inventory (Apr 18, 2026)
859
3347
 
860
- **Status: idea** | Priority: low
3348
+ **Status: idea** | Priority: medium
861
3349
 
862
- **Summary:** Improve workflow discovery by organizing bundled workflows into categories. Currently the catalog is large enough that flat discovery is becoming noisy.
3350
+ **Score: 10** | Cor:2 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
863
3351
 
864
- **Phase 1 shape:** If no category is passed, return category names + workflow count per category + a few representative titles. If a category is passed, return the full workflows for that category.
3352
+ With many concurrent agents using `branchStrategy: worktree`, worktrees accumulate. 10 agents running all day can produce dozens of worktrees, each triggering `git status` processes that saturate the host CPU.
865
3353
 
866
- **Design questions:**
867
- - Should categories live in workflow JSON, in a registry overlay, or be inferred from directory/naming?
868
- - Should `list_workflows` become polymorphic, or should category discovery be a separate mode?
3354
+ **What's needed:**
3355
+ 1. **Automatic cleanup on session end** -- when a WorkTrain session completes (success or failure), the daemon automatically runs `git worktree remove <path> --force`. If the branch is already merged to main, also delete the local branch ref.
3356
+ 2. **Startup pruning** -- `worktrain daemon` startup runs `git worktree prune` in each configured workspace before starting the trigger listener.
3357
+ 3. **`worktrain worktree list`** -- shows all WorkTrain-managed worktrees: path, branch, session ID, age, whether the branch is merged.
3358
+ 4. **`worktrain worktree clean`** -- removes all worktrees whose branches are merged to main, or older than N days. Dry-run mode by default.
3359
+ 5. **`worktrain worktree status`** -- summary: count, total disk usage, any stale ones.
3360
+
3361
+ **Things to hash out:**
3362
+ - `git worktree remove --force` discards uncommitted changes without warning. What is the policy for worktrees with uncommitted or unstaged work on session end? Is force-removal always safe?
3363
+ - "If the branch is already merged to main, also delete the local branch ref" -- what constitutes "merged"? Squash-merges don't leave an ancestor relationship in git history. How is squash-merge detection handled?
3364
+ - Startup pruning runs before the trigger listener starts. What is the time cost for pruning across many workspaces with many worktrees? Could it delay daemon startup noticeably?
3365
+ - Should cleanup be skipped for manually-created worktrees (not WorkTrain-managed)? How does the cleanup tool distinguish WorkTrain-managed from human-created worktrees?
869
3366
 
870
3367
  ---
871
3368
 
872
- ### Forever backward compatibility (workrailVersion)
3369
+ ### Git worktrees and branch management as a first-class capability (Apr 16, 2026)
873
3370
 
874
3371
  **Status: idea** | Priority: medium
875
3372
 
876
- Every workflow declares `workrailVersion: "1.4.0"`. The engine maintains compatibility adapters for all previous declared versions -- old workflows run forever without author intervention. The engine adapts; authors never migrate.
3373
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:2 Con:2 | Blocked: no
877
3374
 
878
- **The web model:** this is how browsers handle HTML from 1995. A `<marquee>` tag still renders because the browser adapts, not because the author rewrote their page.
3375
+ Critical for parallel work. WorkTrain needs native, sophisticated git management -- not just running git commands but understanding the full branching topology.
879
3376
 
880
- **Engineering implication:** permanent commitment. Once a version adapter is shipped, it cannot be removed. The tradeoff is real but the alternative (expecting external authors to track WorkRail releases and migrate) breaks the platform trust model.
3377
+ **Worktree management:** Create, list, switch between, and clean up worktrees automatically. Detect and warn about stale worktrees (branches that have been merged or abandoned).
881
3378
 
882
- **Phase 1:** Add `workrailVersion` field to schema. Default to `"1.0.0"` for existing workflows. Record in run events.
883
- **Phase 2:** Introduce the first adapter when the first schema-breaking change is needed.
884
- **Phase 3:** Build a compatibility test harness in CI.
3379
+ **Branch lifecycle:** Know which branches are: active (being worked on), stale (no commits in N days), merged (on main), or orphaned (created but abandoned). Automatic cleanup proposals. Rebase management when main advances. Conflict detection before spawning a new session.
885
3380
 
886
- **Related:** `src/v2/read-only/v1-to-v2-shim.ts` (existing precedent for version adaptation).
3381
+ **Parallel work coordination:** When multiple tasks touch the same files, WorkTrain detects potential conflicts before they happen. Sequences tasks that would conflict, parallelizes those that won't. Maintains a "file lock" mental model.
3382
+
3383
+ **The `worktrain worktree` command family:**
3384
+ ```bash
3385
+ worktrain worktree list # all worktrees and their status
3386
+ worktrain worktree clean # remove merged/stale worktrees
3387
+ worktrain worktree new <branch> [--task] # create worktree + optionally link to queue item
3388
+ worktrain worktree status # which files are locked by active sessions
3389
+ ```
3390
+
3391
+ Especially critical when WorkTrain is managing 10+ concurrent sessions -- without explicit worktree management, two sessions could clobber each other's changes on the same branch.
3392
+
3393
+ **Things to hash out:**
3394
+ - The "file lock" mental model requires knowing which files each active session is touching. How is this tracked -- by inspecting the worktree, by recording what files each session reads/writes, or by static analysis of the task?
3395
+ - Conflict detection before spawning is a prediction problem (which files will this session touch?). What is the accuracy requirement, and what is the cost of a false positive (unnecessarily serializing work)?
3396
+ - "Rebase management when main advances" is a significant automated git action. Who triggers the rebase -- the daemon on a schedule, the coordinator, or the session itself?
3397
+ - The command family (`worktrain worktree list`, `worktrain worktree clean`, etc.) overlaps significantly with the worktree lifecycle management entry above. Should these be unified into a single design effort?
887
3398
 
888
3399
  ---
889
3400
 
890
- ### Parallel forEach execution
3401
+ ### The single-conversation problem: WorkTrain needs multi-threaded interaction (Apr 16, 2026)
891
3402
 
892
- **Status: idea** | Priority: low
3403
+ **Status: idea** | Priority: medium
893
3404
 
894
- Sequential `forEach` (and `for`, `while`, `until`) all work -- implemented in the v1 interpreter and the v2 durable core. The idea here is parallel execution: run all iterations concurrently rather than sequentially. Requires design around: session store concurrent writes, token protocol isolation per iteration, and console DAG rendering for parallel branches.
3405
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
3406
+
3407
+ When WorkTrain is managing 10 concurrent agents, a single chat where everything is happening at the same time is not ideal. You can't follow any one thread or distinguish "in progress" from "needs a decision."
3408
+
3409
+ **Threaded conversations per work group:** each active work group gets its own conversation thread. You can follow the polling-triggers work in thread A without seeing the spawn/await implementation in thread B.
3410
+
3411
+ **`worktrain talk` shows a thread list:**
3412
+ ```
3413
+ Threads:
3414
+ WorkRail development [3 active agents, 2 waiting]
3415
+ Storyforge chapter work [idle]
3416
+ -> Select thread or type to start a new one
3417
+ ```
3418
+
3419
+ **`worktrain idea` for mid-conversation capture:** `worktrain idea "..."` appends to an ideas buffer without interrupting active work. The talk session reviews the buffer at the start of each conversation.
3420
+
3421
+ **Things to hash out:**
3422
+ - What defines a "work group" for the thread list -- is it a workspace, a parent session ID, a trigger ID, or something the user explicitly creates?
3423
+ - The thread list requires WorkTrain to know which work groups are active. Where does this mapping live, and who maintains it as sessions start and complete?
3424
+ - Should thread history persist across conversations, or is each `worktrain talk` session a fresh start that synthesizes from the session store?
3425
+ - `worktrain idea` writes to an ideas buffer. Is this buffer workspace-scoped, global, or per-thread? What is the path for ideas that don't belong to any active thread?
895
3426
 
896
3427
  ---
897
3428
 
898
- ### Assessment-gate tiers beyond v1
3429
+ ### Console session detail: more than the DAG when running standalone (Apr 16, 2026)
899
3430
 
900
- **Status: idea** | Priority: low
3431
+ **Status: idea** | Priority: medium
901
3432
 
902
- **Tier 1 (current):** Same-step follow-up retry. Consequence keeps the same step pending; engine returns semantic follow-up guidance.
3433
+ **Score: 9** | Cor:1 Cap:2 Eff:2 Lev:1 Con:3 | Blocked: no
903
3434
 
904
- **Tier 2 (future):** Structured redo recipe on the same step. Engine surfaces a bounded checklist. No new DAG nodes or true subflow.
3435
+ The session DAG shows structure but not meaning. When watching a session run in the console without being in Claude Code, you want to know what the agent is actually doing.
905
3436
 
906
- **Tier 3 (future):** Assessment-triggered redo subflow. Matched consequence routes into an explicit sequence of follow-up steps. Introduces assessment-driven control-flow behavior.
3437
+ **What's missing:**
3438
+ - The latest step output note, rendered inline and updating as it streams
3439
+ - A plain-English summary of what the agent is doing right now ("Analyzing the diff for shell injection risks")
3440
+ - Current step prompt visible on demand
3441
+ - Token count and cost estimate for the session so far
3442
+ - Time elapsed + estimated time remaining based on step history
3443
+ - A live feed of tool calls as they happen ("Reading trigger-router.ts", "Running npm test")
907
3444
 
908
- **Design questions:** When does Tier 2 become necessary? What durable model would Tier 3 need for entering, progressing through, and returning from a redo subflow?
3445
+ **The streaming step output** is the most valuable addition. Right now the DAG shows a step as "in progress" with a spinner. It should show the last few lines of the step's output note as it's being written.
3446
+
3447
+ **Build order:**
3448
+ 1. Inline latest step output in the session detail panel (read from session store, poll every 2s)
3449
+ 2. Live tool call feed alongside the DAG (SSE from the daemon, log each tool call as it fires)
3450
+ 3. Token/cost counter (daemon tracks tokens per session, expose via GET /api/v2/sessions/:id)
3451
+
3452
+ **Things to hash out:**
3453
+ - "Latest step output" streaming via 2s polling means up to 2s latency. For users watching a live session, is this acceptable, or is SSE needed here too?
3454
+ - The "plain-English summary" ("Analyzing the diff for shell injection risks") requires either real-time LLM inference or a structured feed from the agent. Where does this text come from?
3455
+ - Current step prompt exposed on demand could reveal sensitive context (workspace paths, credentials passed via goal). Should there be a filter or opt-in before showing prompt content?
3456
+ - Token cost estimates require knowing the model's pricing, which changes over time and varies by provider. How is the pricing table maintained and kept current?
3457
+ - "Estimated time remaining" requires historical session data for the same workflow. What is the minimum data needed for a meaningful estimate?
909
3458
 
910
3459
  ---
911
3460
 
912
- ### Workflow rewind / re-scope support
3461
+ ### Orphaned daemon session state: smarter recovery (Apr 16, 2026)
913
3462
 
914
- **Status: idea** | Priority: low
3463
+ **Status: idea** | Priority: medium
915
3464
 
916
- Allow an in-progress session to go back to an earlier point when new information changes scope, invalidates assumptions, or reveals the current path is wrong.
3465
+ **Score: 9** | Cor:2 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
917
3466
 
918
- **Phase 1:** Allow rewind to a prior checkpoint with an explicit reason. Record a "why we rewound" note in session history.
3467
+ **The problem:** When the daemon is killed mid-session, the session's in-process `KeyedAsyncQueue` promise chain is lost. On restart, the startup recovery reads orphaned session files -- but any external state tied to the queue is now inconsistent. More critically: if a session stalls (Bedrock call hangs, exception suppressed), the daemon log shows nothing after "Injecting workspace context" -- no error, no completion.
919
3468
 
920
- **Phase 2:** Scope-change prompts ("our understanding changed", "the task is broader/narrower"). Let workflows declare safe rewind points explicitly.
3469
+ **What needs to happen:**
3470
+ 1. Startup recovery should clear any pending queue slots -- if a session file exists at startup, that trigger's queue key should be treated as free
3471
+ 2. Session liveness detection -- if a session has been `in_progress` for more than N minutes with no `advance_recorded` events, the daemon watchdog should log a warning and optionally abort
3472
+ 3. Orphaned session cleanup should be user-facing -- `worktrain cleanup` or `worktrain status` should surface orphaned sessions with their age and offer to clear them
3473
+ 4. Better logging when `runWorkflow()` swallows errors -- the `void runWorkflow(...)` pattern drops errors silently; every path that ends in silence should log `[WorkflowRunner] Session died silently` with the session ID
921
3474
 
922
- **Design questions:**
923
- - Should rewind be limited to explicit checkpoints, or support arbitrary node-level rewind?
924
- - How should the system preserve notes from abandoned paths?
925
- - Should some steps be marked non-rewindable once external side effects have happened?
3475
+ **Things to hash out:**
3476
+ - How long should an orphaned session file be allowed to persist before `worktrain status` marks it as stale? The threshold must account for very long sessions vs actually orphaned ones.
3477
+ - "Optionally abort" for sessions exceeding N minutes with no advances -- who sets N, and should the threshold differ per workflow (a discovery session naturally advances slowly vs a coding session)?
3478
+ - Queue slot clearing on startup: if the daemon restarts while a session is genuinely still resumable, clearing its queue slot could lose deduplication state and re-dispatch the same task.
3479
+ - Should users be notified when an orphaned session is found, or only when they explicitly run `worktrain status`?
926
3480
 
927
3481
  ---
928
3482
 
929
- ### Subagent composition chains
3483
+ ### Observability and logging as first-class citizens (Apr 17, 2026)
930
3484
 
931
- **Status: idea** | Priority: low
3485
+ **Status: idea** | Priority: high
932
3486
 
933
- Native support for nested subagents -- an agent spawning a subagent, which spawns its own -- up to a configurable depth limit.
3487
+ **Score: 11** | Cor:2 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
934
3488
 
935
- ```yaml
936
- agentDefaults:
937
- maxSubagentDepth: 3
938
- maxTotalAgentsPerTask: 10
3489
+ WorkTrain should never be a black box. Every action, decision, failure, and state transition should be traceable after the fact.
3490
+
3491
+ **What "first-class" means:**
3492
+ 1. **Structured, not prose** -- every log line machine-parseable with consistent key=value pairs
3493
+ 2. **Levels matter** -- INFO for normal operations, WARN for recoverable anomalies, ERROR for failures. Silence = actively working, not unknown. A session that produces no logs for 5+ minutes should emit a heartbeat.
3494
+ 3. **Every state transition logged** -- session start, step advance, tool call, tool result (including errors), session end
3495
+ 4. **Errors always include context** -- which session, which tool, which step, how long it had been running, what the last successful action was
3496
+ 5. **Correlation IDs** -- every session has a `sessionId`, every tool call has a `toolCallId`; log entries include the relevant ID for cross-session filtering
3497
+ 6. **Log destinations are configurable** -- `--log-level` flag, `--log-format json|human`
3498
+
3499
+ **Specific gaps to close:** `continue_workflow` tool should log step ID and notes length; `makeBashTool` should log exit code and output length; `AgentLoop` should log each LLM turn (turn number, stop reason, tool count); `TriggerRouter` should log when a session is queued at capacity.
3500
+
3501
+ **The `worktrain logs` command:**
3502
+ ```bash
3503
+ worktrain logs # tail daemon.log
3504
+ worktrain logs --session sess_abc123 # replay full session from event store
3505
+ worktrain logs --trigger test-task # all sessions for this trigger
3506
+ worktrain logs --level error # only errors across all sources
3507
+ worktrain logs --since 1h # last hour
3508
+ worktrain logs --format json # machine-readable output
939
3509
  ```
940
3510
 
941
- **Depth semantics:** Coordinator=0, worker=1, subagent=2, sub-subagent=3.
3511
+ **Self-healing dependency:** the automatic gap detection, WORKTRAIN_STUCK routing, and coordinator self-healing patterns all depend on logs being structured and complete. Logging quality is a prerequisite for autonomous operation at scale.
942
3512
 
943
- `maxTotalAgentsPerTask` prevents exponential explosion: depth-3 tree with 3 agents per node = 27 concurrent agents without this cap.
3513
+ **Things to hash out:**
3514
+ - How do structured logs coexist with the existing session event store? Are they the same system, or parallel? Duplicating data in both would create consistency issues.
3515
+ - Tool call argument logging could expose secrets (file paths, API responses, bash commands). Is there a sanitization policy for log output?
3516
+ - The `worktrain logs --session` command replays from the event store. How is this different from what the console already shows? Is the CLI version for non-console users or for programmatic processing?
3517
+ - Log rotation and retention -- how much disk space should logs consume, and who configures the retention policy?
3518
+ - "Silence = actively working" requires the agent loop to emit heartbeats. What is the heartbeat interval, and is this a new event type in the session store?
944
3519
 
945
3520
  ---
946
3521
 
947
- ### Mobile monitoring and remote access
3522
+ ### Event sourcing for orchestration: extend the session store to daemon and coordinator events (Apr 17, 2026)
948
3523
 
949
- **Status: idea** | Priority: low (post-daemon-MVP)
3524
+ **Status: idea** | Priority: medium
950
3525
 
951
- **Goal:** Control and monitor autonomous WorkRail sessions from a phone.
3526
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
952
3527
 
953
- **What's needed:**
954
- 1. Mobile-responsive console with touch-friendly layout and tap to pause/resume/cancel
955
- 2. Push notifications (via Slack/Telegram webhook -- no native app required for MVP)
956
- 3. Human-in-the-loop approval on mobile -- maps to `POST /api/v2/sessions/:id/resume`
957
- 4. Session log view -- linear timeline, not DAG
3528
+ Extend the existing WorkRail event store infrastructure to cover orchestration-level events. The session store is already append-only, crash-safe, content-addressed, and queryable -- rebuilding those properties would be wasteful.
958
3529
 
959
- **Remote access options:**
960
- 1. `workrail tunnel` command (Cloudflare Tunnel from the laptop) -- works behind any NAT/VPN
961
- 2. Tailscale integration -- zero WorkRail code needed
962
- 3. Cloud session sync -- daemon pushes events to S3/R2
3530
+ **Multiple event streams, same infrastructure:**
3531
+ ```
3532
+ ~/.workrail/events/
3533
+ sessions/ <- already exists (per-session workflow events)
3534
+ daemon/ <- lifecycle, triggers, delivery, errors
3535
+ triggers/ <- per-trigger poll history and outcomes
3536
+ coordinator/ <- coordinator script decisions and routing
3537
+ ```
963
3538
 
964
- **Priority:** Post-daemon-MVP. Design the REST control plane with mobile in mind from the start.
3539
+ **Daemon event stream:** structured events like `daemon_started`, `trigger_fired`, `session_queued`, `session_started`, `tool_called`, `step_advanced`, `session_completed`, `delivery_attempted`, `poll_cycle`.
3540
+
3541
+ **`DaemonEventEmitter`:** thin wrapper around the event store, called from TriggerRouter, workflow-runner, delivery-client, and polling-scheduler. Zero overhead when nothing is listening. (Note: `DaemonEventEmitter` already ships -- this is about expanding what gets recorded and unifying with the session event store.)
3542
+
3543
+ **SSE extension:** the console already streams session events via SSE. Extend to also stream daemon events so the console live feed shows everything: trigger fires, tool calls, delivery attempts, errors -- not just step advances.
3544
+
3545
+ **Why this matters for self-healing:** the coordinator can react in real time to `tool_error` events rather than checking for WORKTRAIN_STUCK markers after the fact.
3546
+
3547
+ **Things to hash out:**
3548
+ - The `coordinator/` event stream records coordinator script decisions. Does this require the coordinator to be a first-class WorkTrain concept with an event-emitting API, or can it be retrofit to shell scripts via a CLI command (`worktrain event emit ...`)?
3549
+ - All four event directories live under `~/.workrail/events/`. What are the size and retention policies per directory? Trigger poll cycles could generate enormous volumes in `triggers/`.
3550
+ - SSE extension for daemon events means the console must distinguish session events from daemon events in the same stream. What is the event envelope schema for mixed event types?
3551
+ - Who is the primary consumer of coordinator events -- only the console, or also the coordinator itself (for self-healing)? The use cases have different latency and reliability requirements.
965
3552
 
966
3553
  ---
967
3554
 
968
- ### WorkRail Auto: cloud-hosted autonomous platform
3555
+ ### Duplicate task detection: prevent agents from doing the same work twice (Apr 18, 2026)
969
3556
 
970
- **Status: idea** | Priority: long-term
3557
+ **Status: idea** | Priority: medium
971
3558
 
972
- **Goal:** WorkRail Auto runs on a server 24/7, connected to your engineering ecosystem, working autonomously without a laptop open.
3559
+ **Score: 9** | Cor:2 Cap:1 Eff:2 Lev:1 Con:3 | Blocked: no
973
3560
 
974
- **What this enables:** GitLab MR opened -> WorkRail reviews, posts comment. Jira ticket moves to In Progress -> WorkRail starts coding task, pushes branch. PagerDuty fires -> WorkRail runs investigation, posts findings to Slack.
3561
+ With multiple agents running concurrently and a persistent work queue, it's easy to accidentally start two agents on the same task -- especially when the queue drains items from external sources that may be added again after a sync.
975
3562
 
976
- **Architecture implications:**
977
- - Multi-tenancy: isolated session stores, isolated credential vaults per org
978
- - Horizontal scaling: multiple daemon instances consuming from a shared trigger queue
979
- - Rate limiting per org, per integration
3563
+ **Detection sources:**
3564
+ 1. **Open PRs** -- before starting any coding task, check `gh pr list --state open` -- if a PR already exists addressing the same issue/goal, skip it
3565
+ 2. **Active sessions** -- session store knows which workflows are currently running; a new dispatch can check for semantic overlap before starting
3566
+ 3. **Queue deduplication** -- each queue item from an external source carries its `sourceId` (e.g. `github:owner/repo:issues:123`). On enqueue, check if `sourceId` already exists in the queue
3567
+ 4. **Session history** -- before starting an investigation, check recent session notes for the same workflowId + goal combination
980
3568
 
981
- **Relationship to self-hosted:** Self-hosted is always free, always open source, always works offline. WorkRail Auto is the natural SaaS layer -- same engine, same workflows, managed infrastructure.
3569
+ **Implementation:** queue-level dedup is the simplest and most reliable. PR-level dedup: before dispatching a coding task, run `gh pr list --search "<issue title keywords>"` and check for matches. For MVP, exact `sourceId` match + approximate PR title search is sufficient. Semantic dedup (same problem described differently) is a post-knowledge-graph feature.
982
3570
 
983
- **Priority:** Long-term. Design the local daemon with multi-tenancy seams in mind from the start (don't hardcode single-user assumptions). Don't build the hosted layer until the local daemon is proven.
3571
+ **Things to hash out:**
3572
+ - Approximate PR title search for dedup can produce false positives (skipping work that is actually unrelated). What is the policy for a false positive -- is the issue left unworked, or escalated?
3573
+ - `sourceId`-based dedup is reliable only when the same external system generates the ID consistently. What happens for goals dispatched manually via the message queue with no `sourceId`?
3574
+ - Should dedup checks happen at enqueue time, dispatch time, or both? Enqueue-time dedup is earlier but may not know about concurrent activity; dispatch-time is later but more accurate.
3575
+ - How long does a `sourceId` remain "in use" for dedup purposes after a session completes? If the issue is re-labeled after a failed session, it should be re-dispatchable.
984
3576
 
985
3577
  ---
986
3578
 
987
- ### Multi-project WorkTrain
3579
+ ### Agent actions as first-class events in the session event log (Apr 18, 2026)
988
3580
 
989
- **Status: idea** | Priority: medium (to investigate)
3581
+ **Status: idea** | Priority: medium
990
3582
 
991
- **Problem:** WorkTrain needs to handle multiple completely unrelated projects simultaneously, but some projects are related and need to share knowledge.
3583
+ **Score: 10** | Cor:1 Cap:2 Eff:2 Lev:2 Con:3 | Blocked: no
992
3584
 
993
- **Proposed model:** Workspace namespacing with explicit cross-workspace links:
994
- ```yaml
995
- workspaces:
996
- workrail:
997
- path: ~/git/personal/workrail
998
- knowledgeGraph: ~/.workrail/graphs/workrail.db
999
- maxConcurrentSessions: 3
1000
- relatedWorkspaces: [storyforge]
1001
- storyforge:
1002
- path: ~/git/personal/storyforge
1003
- knowledgeGraph: ~/.workrail/graphs/storyforge.db
1004
- relatedWorkspaces: [workrail]
1005
- ```
3585
+ The console should be able to reconstruct exactly what an agent did in a session -- every tool call, every argument, every result -- by reading the event log alone.
1006
3586
 
1007
- **Must be workspace-scoped:** knowledge graph, daemon-soul.md, session store, concurrency limits, triggers.
3587
+ **What's missing -- agent-level events:**
3588
+ - `tool_call_started` -- tool name, args, timestamp
3589
+ - `tool_call_completed` -- result (truncated), duration, success/error
3590
+ - `llm_turn_started` -- model, input token count
3591
+ - `llm_turn_completed` -- stop reason, output tokens, tools requested
3592
+ - `steer_injected` -- what context was injected and why
3593
+ - `report_issue_recorded` -- the structured issue from the `report_issue` tool
1008
3594
 
1009
- **Can be shared globally:** WorkTrain binary, token usage tracking, message queue, merge audit log.
3595
+ **Where to emit them:** in `src/daemon/agent-loop.ts` before and after each `tool.execute()` call and LLM call; in `src/daemon/workflow-runner.ts` for steer injection.
3596
+
3597
+ **Console rendering:** each session detail view gets a "Timeline" tab showing: `llm_turn (450 tokens -> 3 tool calls)`, `bash: git status (45ms)`, `read: AGENTS.md (8ms)`, `llm_turn (280 tokens -> advance)` per phase.
3598
+
3599
+ **Build order:** add `tool_call_started`/`tool_call_completed` to `agent-loop.ts` (smallest change, highest value); add `llm_turn_started`/`llm_turn_completed`; Console Timeline tab; wire `report_issue_recorded` and `steer_injected` events; once session events are comprehensive, `DaemonEventEmitter` daily log files become secondary.
3600
+
3601
+ **Things to hash out:**
3602
+ - Tool call arguments are logged for `tool_call_started`. Arguments can contain sensitive content (file content, bash output, API responses). What is the sanitization or truncation policy?
3603
+ - Every LLM turn logged means every token count is in the session event log. This is useful for analytics but also reveals cost information. Is this data considered sensitive?
3604
+ - The "Timeline" tab in the console requires the agent-loop events to be in the session store, not just in daemon logs. Does the existing session store schema need to be extended, or is there already a path for agent-loop events?
3605
+ - Should these events be emitted in MCP (interactive) sessions, or only in daemon sessions? The logging overhead may be more acceptable in one context than the other.
1010
3606
 
1011
3607
  ---
1012
3608
 
1013
- ### Message queue: async communication with WorkTrain from anywhere
3609
+ ### Context budget per spawned agent (Apr 18, 2026)
1014
3610
 
1015
3611
  **Status: idea** | Priority: medium
1016
3612
 
1017
- **Design:** A persistent message queue (`~/.workrail/message-queue.jsonl`) that decouples when you send a message from when WorkTrain acts on it.
3613
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: yes (needs knowledge graph)
1018
3614
 
1019
- ```bash
1020
- worktrain tell "skip the architecture review for the polling triggers PR, it's low risk"
1021
- worktrain tell "add knowledge graph vector layer to next sprint"
1022
- ```
3615
+ A pre-packaged bundle of ~2000 tokens that every coordinator-spawned agent starts with. The knowledge graph is what makes this scalable.
1023
3616
 
1024
- Each command appends to the queue. The daemon drains between agent completions -- never mid-run, always at a natural break point.
3617
+ **Bundle contents:**
3618
+ - `<relevant_files>` -- paths + key excerpts from files the agent will likely touch (from KG query)
3619
+ - `<prior_sessions>` -- summaries of the last 3 sessions that touched related code
3620
+ - `<established_patterns>` -- specific patterns the agent must follow
3621
+ - `<known_facts>` -- things already proven true
3622
+ - `<do_not_explore>` -- explicit list of dead ends and already-tried approaches
1025
3623
 
1026
- **Outbox (WorkTrain -> user):** WorkTrain appends notifications to `~/.workrail/outbox.jsonl`. A mobile client polls this or an HTTP SSE endpoint wraps it.
3624
+ **Without the KG (today):** the coordinator manually includes key context in the prompt.
3625
+ **With the KG (future):** `worktrain spawn --workflow X --goal "..."` automatically queries the KG and assembles the context bundle. Coordinator just provides the goal.
1027
3626
 
1028
- **This is the foundation for mobile monitoring.** The mobile app is just a client that reads outbox and writes to message-queue.
3627
+ **Things to hash out:**
3628
+ - This entry is closely related to "Coordinator context injection standard" earlier in the backlog. Are these the same idea, or does this entry specifically cover the KG-backed assembly vs the general standard?
3629
+ - The KG query for "relevant files" must happen before the agent starts. What is the latency of this query, and does it add meaningful overhead to session dispatch time?
3630
+ - "Prior sessions" summaries require the KG to have indexed session notes. Is session note indexing part of the KG build process, or a separate concern?
3631
+ - If the KG is stale or unavailable at dispatch time, should the session start without a context bundle, or should dispatch be deferred?
1029
3632
 
1030
3633
  ---
1031
3634
 
1032
- ### Periodic analysis agents
1033
-
1034
- **Status: idea** | Priority: low
3635
+ ### Work queue refinements: filtering, catch-all mode, and deadline-aware prioritization (Apr 15, 2026)
1035
3636
 
1036
- Agents on a schedule that proactively identify issues, gaps, improvement opportunities:
3637
+ **Status: idea** | Priority: medium
1037
3638
 
1038
- - **Weekly: Code health scan** -- `architecture-scalability-audit` on modules not audited in 30 days
1039
- - **Weekly: Test coverage scan** -- files modified with zero/low test coverage
1040
- - **Weekly: Documentation drift scan** -- recently merged PRs changed behavior described in docs
1041
- - **Monthly: Dependency health scan** -- CVEs, active forks, lighter alternatives
1042
- - **Monthly: Performance baseline** -- benchmark scenarios vs previous month
1043
- - **Continuous: Security scan** -- on every PR merge, OWASP top 10 patterns in changed files
1044
- - **Monthly: Ideas generation** -- `wr.discovery` on codebase + backlog + session history, asking "what's the most impactful thing we could build next?"
3639
+ **Score: 8** | Cor:1 Cap:2 Eff:2 Lev:1 Con:2 | Blocked: no
1045
3640
 
1046
- ---
3641
+ **Issue/ticket filtering:** richer than just a label -- filter by project, milestone, assignee, sprint, component. Per-source filter config with `notLabels` exclusion list.
1047
3642
 
1048
- ### Monitoring, analytics, and autonomous remediation
3643
+ **Catch-all mode:** if `filter` is omitted entirely, WorkTrain pulls everything open and unassigned in the project/repo. Requires explicit `catchAll: true` opt-in + `maxItemsPerCycle` limit.
1049
3644
 
1050
- **Status: idea** | Priority: low
3645
+ **Deadline-aware prioritization:** WorkTrain reads deadline context from issue/ticket due dates, epic end dates, sprint end dates, release/milestone dates, and optionally Confluence/Google Calendar. Computes adjusted priority score:
3646
+ ```
3647
+ deadline_urgency: < 2 days = +3, < 7 days = +2, < 14 days = +1, > 14 days = +0, past due = +4
3648
+ adjusted_priority = base_priority + deadline_urgency
3649
+ ```
1051
3650
 
1052
- WorkTrain watches application health metrics (error rate, latency, session success/failure rate, queue depth), identifies anomalies, investigates root causes, and resolves what it can automatically.
3651
+ Items are queued in adjusted priority order. A medium-priority task due tomorrow beats a high-priority task due in 3 months.
1053
3652
 
1054
- **Monitoring loop:** Detect anomaly -> classify severity -> investigate with `bug-investigation.agentic.v2` -> if confidence >= 0.8 and severity <= High, attempt auto-remediation (config/feature-flag fix, code fix) or else escalate with full findings.
3653
+ **Escalation when deadlines are at risk:** if a queue item has a deadline within 48 hours and hasn't been started, the watchdog notifies: bumping to position 1, posting to Slack + message outbox.
1055
3654
 
1056
- **Analytics dashboard:** Per-module PR cycle time, workflow step failure rates, token cost per session type, quality score (weighted composite of review accuracy + coding success rate + investigation accuracy).
3655
+ **Things to hash out:**
3656
+ - `base_priority` is referenced in the priority scoring formula but not defined in this entry. Where does base priority come from -- issue labels, explicit priority field, or inferred?
3657
+ - Reading deadline context from Confluence and Google Calendar requires auth integration. Is this in scope for the initial implementation, or is it a phase 2 concern?
3658
+ - "Past due = +4" could cause extremely stale tasks to permanently occupy the top of the queue. Is there a cap on urgency boost, or a different treatment for overdue items?
3659
+ - Bumping a task to position 1 due to deadline urgency could interrupt a work sequence that was deliberately ordered. Who should be notified when an automatic priority bump happens?
3660
+ - `catchAll: true` pulls all open unassigned items. In an active repo, this could mean hundreds of items entering the queue simultaneously. What is the behavior when `maxItemsPerCycle` is reached?
1057
3661
 
1058
3662
  ---
1059
3663
 
1060
- ### Cross-repo execution model
3664
+ ### Workspace pipeline policy: artifact gates vs autonomous decomposition (Apr 15, 2026)
1061
3665
 
1062
- **Status: idea** | Priority: medium (post-MVP for hosted tier)
3666
+ **Status: idea** | Priority: medium
1063
3667
 
1064
- **Problem:** WorkRail currently assumes a single repo. The autonomous daemon breaks this -- a coding task may touch Android, iOS, and a backend API simultaneously.
3668
+ **Score: 8** | Cor:1 Cap:2 Eff:1 Lev:2 Con:2 | Blocked: no
1065
3669
 
1066
- **Workspace manifest:** Sessions declare which repos they need:
1067
- ```json
1068
- {
1069
- "context": {
1070
- "repos": [
1071
- { "name": "android", "path": "~/git/my-project/android" },
1072
- { "name": "backend", "path": "~/git/my-project/backend" }
1073
- ]
1074
- }
1075
- }
3670
+ **The core tension:** some workspaces have rigorous pre-implementation processes (BRD required, design approved, shapeup doc reviewed). Others are solo/small-team projects where you figure it out as you go. WorkTrain should respect both.
3671
+
3672
+ **Two workspace modes:**
3673
+
3674
+ **Governed mode** -- for projects with existing process gates:
3675
+ ```yaml
3676
+ pipelinePolicy:
3677
+ mode: governed
3678
+ requiredArtifacts:
3679
+ - type: brd
3680
+ sources: [confluence, jira_epic, google_docs]
3681
+ searchQuery: "BRD {{ticket.key}}"
3682
+ onMissingArtifacts: wait # 'wait', 'skip', or 'escalate'
3683
+ waitCheckInterval: 3600
3684
+ waitTimeout: 168h
1076
3685
  ```
1077
3686
 
1078
- **Scoped tools:** `BashInRepo`, `ReadRepo`, `WriteRepo` that route to the correct working directory.
3687
+ When WorkTrain picks up a ticket and required artifacts aren't found -- holds the ticket in "waiting" state, re-checks hourly, notifies when artifacts appear. When found, automatically extracts context and proceeds, skipping discovery/design phases since those artifacts already contain the answer.
1079
3688
 
1080
- **Dynamic provisioning:** If the repo is already cloned locally, use it. If declared as a remote URL, clone to `~/.workrail/repos/<name>/`.
3689
+ **Autonomous mode** -- for projects without pre-existing process: WorkTrain runs the full pipeline including discovery, UX design, architecture review, and implementation.
1081
3690
 
1082
- **This is the feature that makes WorkRail truly freestanding** for multi-repo development teams.
3691
+ **Automatic task decomposition:** when a task is classified as `Large` (or Medium with high complexity), WorkTrain decomposes it into sub-tickets before starting implementation. Sub-tickets are Small or Medium (never Large), added to the queue with `parentTicketId` and `dependsOn` links.
3692
+
3693
+ **The "patiently waiting" UX:** console Queue tab shows tickets waiting for artifacts with a distinct state, plus Slack notification when WorkTrain starts waiting and again when artifacts are found.
3694
+
3695
+ **Things to hash out:**
3696
+ - Governed mode's `waitTimeout: 168h` (one week) means a ticket can hold a queue slot for a week. Does waiting hold a concurrency slot, or is it a separate "pending" state outside the concurrency pool?
3697
+ - Automatic task decomposition into sub-tickets creates GitHub/Jira issues autonomously. Is this acceptable without human review, or should sub-ticket creation be a gate requiring approval?
3698
+ - "Large = decompose" requires a reliable `Large` classification. What is the cost of a wrong classification that either skips decomposition (too large a task given to one agent) or decomposes unnecessarily (adding overhead)?
3699
+ - How does the governed vs autonomous mode selection work? Is it a workspace config flag, or does WorkTrain infer the mode from the presence/absence of artifact gates?
3700
+ - What does "context injection from BRD" look like at the agent level? Is the BRD injected as a reference, a context bundle field, or the full text?
1083
3701
 
1084
3702
  ---
1085
3703
 
1086
- ### Long-term vision: WorkRail as a general engine, domain packs as configuration
3704
+ ### Templates, living docs, and external workflow ingestion (Apr 15, 2026)
1087
3705
 
1088
- **Status: idea** | Priority: long-term
3706
+ **Status: idea** | Priority: medium
1089
3707
 
1090
- WorkTrain is not just a coding tool. The underlying engine -- session management, workflow enforcement, daemon, agent loop, knowledge graph, context bundle assembly -- is domain-agnostic.
3708
+ **Score: 6** | Cor:1 Cap:1 Eff:1 Lev:1 Con:2 | Blocked: no
1091
3709
 
1092
- **Domain packs:** Self-contained configuration bundles that specialize WorkTrain for a specific problem domain: a set of workflows, a knowledge graph schema, context bundle query definitions, trigger definitions, a daemon soul template.
3710
+ **Templates:** WorkTrain knows the templates used in each workspace and applies them automatically. PR templates, Jira ticket templates, design spec templates, BRD templates. Templates are resolved at session start and injected as context. The agent is told "when creating a [type], use this template structure exactly."
1093
3711
 
1094
- **Examples:** `worktrain-coding` (current default), `worktrain-research`, `worktrain-creative`, `worktrain-ops`, `worktrain-data`.
3712
+ **Living docs:** WorkTrain maintains documentation as a first-class output, not an afterthought.
3713
+ - On-demand: `worktrain doc generate --type architecture-overview --workspace workrail`
3714
+ - Continuous updates: when code changes, affected docs are flagged for update. `doc-drift-scan` (part of periodic analysis) identifies docs whose described behavior no longer matches the code.
1095
3715
 
1096
- **When to make it explicit:** The right time is when a second domain is ready to be added. Extract the coding-specific pieces into `worktrain-coding` and establish the domain pack contract.
3716
+ **External workflow ingestion:**
3717
+ - Workflow registry/marketplace: `worktrain workflow install community/postgres-migration-workflow`
3718
+ - Org-level workflow libraries: teams publish workflow libraries to a git repo. WorkTrain pulls from it.
3719
+ - `workflowSources` config: list of git repos + local paths to discover workflows from
3720
+
3721
+ **Things to hash out:**
3722
+ - Template injection ("use this template exactly") is a soft instruction to the LLM. How is compliance verified? If the agent diverges from the template, is that a workflow error or acceptable deviation?
3723
+ - `doc-drift-scan` requires comparing documentation to code semantically. Is this an LLM-based comparison or a static analysis? What is the false positive rate for "this doc is out of date"?
3724
+ - The workflow registry/marketplace concept requires trust decisions: which authors, which workflows, what versions are safe to install? Is there a vetting process or is it caveat emptor?
3725
+ - How does the `workflowSources` config interact with the existing workspace source discovery mechanism? Is this additive or a replacement?
3726
+ - "Living docs" updated continuously could produce many noisy documentation PRs. Should doc update frequency be throttled, or batched with code PRs?
1097
3727
 
1098
3728
  ---
1099
3729
 
@@ -1197,5 +3827,11 @@ WorkTrain has no tooling to surface the state of worktrees and branches relative
1197
3827
  - Abandoned in-progress branches have no attached context about why they were abandoned or what state they were in
1198
3828
  - Daemon-spawned worktrees under `~/.workrail/worktrees/` are opaque -- no indication of which session created them or whether cleanup is safe
1199
3829
 
3830
+ **Things to hash out:**
3831
+ - What is the authoritative source of truth for "is this worktree safe to delete" -- the session store, the git graph, or both?
3832
+ - Squash-merged branches leave no ancestry trace. What is the detection mechanism? Is it based on PR close status in the GitHub API, or on file-content comparison with main?
3833
+ - Should the inventory tool be reactive (shows current state on demand) or proactive (daemon monitors worktree state and alerts when stale ones accumulate)?
3834
+ - How does this entry relate to the "Worktree lifecycle management" and "Git worktrees and branch management" entries elsewhere in the backlog? Are these the same problem captured multiple times, or genuinely different aspects?
3835
+
1200
3836
  ---
1201
3837