@exaudeus/workrail 3.41.0 → 3.43.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (63) hide show
  1. package/dist/cli-worktrain.js +40 -11
  2. package/dist/console-ui/assets/{index-CQt4UhPB.js → index-Sb57DW4B.js} +1 -1
  3. package/dist/console-ui/index.html +1 -1
  4. package/dist/context-assembly/deps.d.ts +8 -0
  5. package/dist/context-assembly/deps.js +2 -0
  6. package/dist/context-assembly/index.d.ts +6 -0
  7. package/dist/context-assembly/index.js +50 -0
  8. package/dist/context-assembly/infra.d.ts +3 -0
  9. package/dist/context-assembly/infra.js +154 -0
  10. package/dist/context-assembly/types.d.ts +30 -0
  11. package/dist/context-assembly/types.js +2 -0
  12. package/dist/coordinators/pr-review.d.ts +3 -1
  13. package/dist/coordinators/pr-review.js +25 -4
  14. package/dist/daemon/workflow-runner.d.ts +11 -1
  15. package/dist/daemon/workflow-runner.js +82 -9
  16. package/dist/domain/execution/state.d.ts +6 -6
  17. package/dist/manifest.json +76 -44
  18. package/dist/mcp/handlers/v2-workflow.d.ts +2 -2
  19. package/dist/mcp/output-schemas.d.ts +234 -234
  20. package/dist/mcp/tools.d.ts +2 -2
  21. package/dist/mcp/v2/tools.d.ts +24 -24
  22. package/dist/trigger/delivery-action.d.ts +2 -0
  23. package/dist/trigger/delivery-action.js +24 -0
  24. package/dist/trigger/trigger-router.js +24 -1
  25. package/dist/trigger/trigger-store.js +42 -0
  26. package/dist/trigger/types.d.ts +3 -0
  27. package/dist/v2/durable-core/schemas/artifacts/assessment.d.ts +2 -2
  28. package/dist/v2/durable-core/schemas/artifacts/coordinator-signal.d.ts +2 -2
  29. package/dist/v2/durable-core/schemas/artifacts/loop-control.d.ts +6 -6
  30. package/dist/v2/durable-core/schemas/artifacts/review-verdict.d.ts +6 -6
  31. package/dist/v2/durable-core/schemas/compiled-workflow/index.d.ts +56 -56
  32. package/dist/v2/durable-core/schemas/execution-snapshot/blocked-snapshot.d.ts +83 -83
  33. package/dist/v2/durable-core/schemas/execution-snapshot/execution-snapshot.v1.d.ts +1024 -1024
  34. package/dist/v2/durable-core/schemas/export-bundle/index.d.ts +2336 -2336
  35. package/dist/v2/durable-core/schemas/session/dag-topology.d.ts +6 -6
  36. package/dist/v2/durable-core/schemas/session/events.d.ts +339 -339
  37. package/dist/v2/durable-core/schemas/session/gaps.d.ts +30 -30
  38. package/dist/v2/durable-core/schemas/session/manifest.d.ts +6 -6
  39. package/dist/v2/durable-core/schemas/session/outputs.d.ts +8 -8
  40. package/dist/v2/durable-core/schemas/session/validation-event.d.ts +3 -3
  41. package/docs/design/adaptive-coordinator-context-candidates.md +265 -0
  42. package/docs/design/adaptive-coordinator-context-review.md +101 -0
  43. package/docs/design/adaptive-coordinator-context.md +504 -0
  44. package/docs/design/adaptive-coordinator-routing-candidates.md +340 -0
  45. package/docs/design/adaptive-coordinator-routing-design-review.md +135 -0
  46. package/docs/design/adaptive-coordinator-routing-review.md +156 -0
  47. package/docs/design/adaptive-coordinator-routing.md +660 -0
  48. package/docs/design/context-assembly-design-candidates.md +199 -0
  49. package/docs/design/context-assembly-implementation-plan.md +211 -0
  50. package/docs/design/context-assembly-layer-design-review.md +110 -0
  51. package/docs/design/context-assembly-layer.md +622 -0
  52. package/docs/design/context-assembly-review-findings.md +112 -0
  53. package/docs/design/stuck-escalation-candidates.md +176 -0
  54. package/docs/design/stuck-escalation-design-review.md +70 -0
  55. package/docs/design/stuck-escalation.md +326 -0
  56. package/docs/design/worktrain-task-queue-candidates.md +252 -0
  57. package/docs/design/worktrain-task-queue-design-review.md +109 -0
  58. package/docs/design/worktrain-task-queue.md +443 -0
  59. package/docs/design/worktree-review-findings-candidates.md +101 -0
  60. package/docs/design/worktree-review-findings-design-review.md +65 -0
  61. package/docs/design/worktree-review-findings-implementation-plan.md +153 -0
  62. package/docs/ideas/backlog.md +212 -0
  63. package/package.json +3 -3
@@ -0,0 +1,326 @@
1
+ # Design: Stuck Escalation for Overnight-Autonomous WorkTrain Sessions
2
+
3
+ ## Context / Ask
4
+
5
+ **Stated goal:** Design automatic escalation when WorkTrain sessions get stuck, so overnight-autonomous runs don't burn their full 30-minute wall clock on a broken tool call.
6
+
7
+ **This goal is a solution statement.** The underlying problem is:
8
+
9
+ > Overnight-autonomous sessions have no early-exit path when the agent enters a stuck loop. The system must detect the stuck state, abort cleanly, and produce structured diagnostic output without requiring a human to check the logs after the full wall-clock budget is consumed.
10
+
11
+ **Scope:** `src/daemon/workflow-runner.ts` and `src/trigger/trigger-router.ts` only. Do NOT touch `src/mcp/`.
12
+
13
+ **Prior art:** `docs/design/daemon-stuck-detection-discovery.md` -- discovery that defined the three stuck heuristics, added `AgentStuckEvent` to `daemon-events.ts`, and wired `repeated_tool_call`, `no_progress`, and `timeout_imminent` into the `turn_end` subscriber. Those signals exist and fire today. They are advisory-only -- no abort, no outbox, no notification.
14
+
15
+ ## Path Recommendation
16
+
17
+ **`full_spectrum`** -- both landscape grounding (where to hook abort/outbox in the existing code) and concept shaping (abort policy, result variant design) are real risks. The proposed approach is well-matched to the problem, but the integration point and the result type design need careful analysis.
18
+
19
+ ## Constraints / Anti-goals
20
+
21
+ **Constraints:**
22
+ - Abort only within `src/daemon/` and `src/trigger/`. No changes to `src/mcp/`.
23
+ - `WorkflowRunResult` is a discriminated union consumed by `assertNever` in `TriggerRouter.route()` and `dispatch()` -- any new variant must be handled exhaustively in both callers.
24
+ - Outbox write must be best-effort (non-fatal) -- same contract as `DaemonEventEmitter.emit()`.
25
+ - Notifications are fire-and-forget -- same contract as `NotificationService.notify()`.
26
+ - `stuckAbortPolicy` must be opt-in, defaulting to `'abort'` for new triggers and to `'notify_only'` for any existing trigger that doesn't specify it (to avoid breaking existing behavior).
27
+
28
+ **Anti-goals:**
29
+ - Do not build a watchdog daemon or restart mechanism.
30
+ - Do not implement coordinator retry logic in this design (that is a future fix-coordinator concern).
31
+ - Do not change the `turn_end` subscriber's detection thresholds -- this design wires actions to existing signals, not new signals.
32
+ - Do not make `outbox.jsonl` the sole escalation path -- it has no automated consumer.
33
+
34
+ ## Challenged Assumptions
35
+
36
+ 1. **Assumption: The three heuristics reliably distinguish stuck from legitimately slow.**
37
+ - Risk: `make all` called 3x triggers `repeated_tool_call` but is not stuck. `no_progress` at 80% turns fires during valid deep research.
38
+ - Mitigation: `stuckAbortPolicy: 'notify_only'` available per trigger. For `repeated_tool_call`, the abort is justified because the exact same `argsSummary` (200-char JSON) repeating 3x is a stronger signal than mere tool-name repetition.
39
+
40
+ 2. **Assumption: Aborting on first heuristic trigger is better than one warning turn.**
41
+ - Risk: a transient 500 error resolves on the next turn; abort kills a recovering session.
42
+ - Mitigation: allow `stuckAbortThreshold` (future extension) to increase the repeat count before abort. For the initial design, threshold stays at 3 (existing `STUCK_REPEAT_THRESHOLD`).
43
+
44
+ 3. **Assumption: `outbox.jsonl` is the right escalation target.**
45
+ - Reality: `outbox.jsonl` has no automated consumer -- only `worktrain-inbox` (manual CLI) and `pr-review.ts` `drainMessageQueue` read it. The primary actionable signal is the macOS/webhook notification.
46
+ - Resolution: write to both outbox (for coordinator scripts) AND fire notification (for human overnight use). Neither is the sole path.
47
+
48
+ ## Landscape Packet
49
+
50
+ ### Integration point: where abort already happens
51
+
52
+ In `workflow-runner.ts` the `turn_end` subscriber already has two abort paths:
53
+
54
+ ```
55
+ // max_turns path (line 3088-3104)
56
+ if (maxTurns > 0 && turnCount >= maxTurns && timeoutReason === null) {
57
+ timeoutReason = 'max_turns';
58
+ emitter?.emit({ kind: 'agent_stuck', reason: 'timeout_imminent', ... });
59
+ agent.abort();
60
+ return;
61
+ }
62
+
63
+ // stuck detection heuristics (lines 3106-3165)
64
+ // -- repeated_tool_call: emit only, no abort
65
+ // -- no_progress: emit only, no abort
66
+ // -- timeout_imminent: emit only (abort already fired in setTimeout callback)
67
+ ```
68
+
69
+ The stuck abort must be inserted immediately after each heuristic `emitter?.emit()` call, guarded by `stuckAbortPolicy`.
70
+
71
+ ### WorkflowRunResult union (current)
72
+
73
+ ```typescript
74
+ type WorkflowRunResult =
75
+ | WorkflowRunSuccess // _tag: 'success'
76
+ | WorkflowRunError // _tag: 'error'
77
+ | WorkflowRunTimeout // _tag: 'timeout'
78
+ | WorkflowDeliveryFailed // _tag: 'delivery_failed'
79
+ ```
80
+
81
+ `TriggerRouter.route()` and `dispatch()` both have `assertNever(result)` guards. Adding a new `_tag: 'stuck'` variant requires handling in both.
82
+
83
+ ### outbox.jsonl write pattern (from pr-review.ts)
84
+
85
+ ```typescript
86
+ const outboxPath = path.join(os.homedir(), '.workrail', 'outbox.jsonl');
87
+ await fs.mkdir(workrailDir, { recursive: true });
88
+ await fs.appendFile(outboxPath, JSON.stringify(entry) + '\n', 'utf8');
89
+ ```
90
+
91
+ No shared utility -- each writer does it directly. The stuck-escalation writer should follow the same pattern, injected via `WorkflowTrigger` deps or called as a module-level helper.
92
+
93
+ ### NotificationService (notification-service.ts)
94
+
95
+ `notify(result: WorkflowRunResult, goal: string)` -- already dispatches on `result._tag`. Adding a `stuck` variant requires a new case in `buildNotificationBody()`, `buildOutcome()`, and `buildDetail()`.
96
+
97
+ ### agentConfig in WorkflowTrigger (workflow-runner.ts line 214)
98
+
99
+ ```typescript
100
+ readonly agentConfig?: {
101
+ readonly model?: string;
102
+ readonly maxSessionMinutes?: number;
103
+ readonly maxTurns?: number;
104
+ // NEW:
105
+ readonly stuckAbortPolicy?: 'abort' | 'notify_only';
106
+ readonly noProgressAbortEnabled?: boolean;
107
+ };
108
+ ```
109
+
110
+ This is the correct location for both new fields. They are session-behavior knobs, same as `maxSessionMinutes` and `maxTurns` -- not trigger-routing knobs.
111
+
112
+ ## Problem Frame Packet
113
+
114
+ The three stuck heuristics (`repeated_tool_call`, `no_progress`, `timeout_imminent`) fire today as advisory events. There is no action. For overnight-autonomous use, the invariant must change:
115
+
116
+ - `repeated_tool_call` is a **hard abort signal** -- same tool+args 3x is definitionally broken.
117
+ - `no_progress` at 80% turns is a **soft abort signal** -- may have false positives; the policy controls whether to abort or just notify.
118
+ - `timeout_imminent` already has an abort (the wall-clock timer fires `agent.abort()` independently) -- no new abort needed here, but a stuck-escalation outbox entry and notification should fire.
119
+
120
+ After abort, `runWorkflow()` must return a new `WorkflowRunResult` variant that carries enough structured data for a human and a future fix-coordinator to understand the failure without reading logs.
121
+
122
+ ## Recommended Design
123
+
124
+ ### 1. Abort Policy
125
+
126
+ | Signal | Default action | With `stuckAbortPolicy: 'notify_only'` | Gating flag |
127
+ |---|---|---|---|
128
+ | `repeated_tool_call` | Abort immediately | Emit event only, no abort | Always active |
129
+ | `no_progress` | Emit + notify only | Emit + notify only | `noProgressAbortEnabled: true` required to abort |
130
+ | `timeout_imminent` | No new abort (already aborting) | Same | N/A |
131
+
132
+ **`stuckAbortPolicy: 'abort' | 'notify_only'`** (default: `'abort'`): controls whether a stuck signal triggers an abort or only an event emission and notification. Per-trigger, lives in `agentConfig`.
133
+
134
+ **`noProgressAbortEnabled: boolean`** (default: `false`): separately gates whether the `no_progress` heuristic (80% turns, 0 step advances) can trigger an abort. When false, `no_progress` only emits the `agent_stuck` event and fires a notification -- it never aborts. This flag exists because `no_progress` has a meaningful false-positive rate on legitimate deep-research sessions (e.g. a wr.discovery run spending 50 turns before its first step advance).
135
+
136
+ **Rationale for the default `noProgressAbortEnabled: false`:** The primary overnight-autonomous failure mode is `repeated_tool_call` (the same failing command called 15x). The `no_progress` heuristic is a secondary signal that benefits from explicit opt-in after the false-positive rate is observed in production.
137
+
138
+ ### 2. New `WorkflowRunResult` Variant
139
+
140
+ ```typescript
141
+ /** Workflow aborted by stuck detection before the wall-clock timeout fired. */
142
+ export interface WorkflowRunStuck {
143
+ readonly _tag: 'stuck';
144
+ readonly workflowId: string;
145
+ /**
146
+ * Which heuristic triggered the abort.
147
+ * Matches AgentStuckEvent.reason to allow correlation with daemon event log.
148
+ */
149
+ readonly stuckReason: 'repeated_tool_call' | 'no_progress';
150
+ /** Human-readable description of why stuck was detected. */
151
+ readonly detail: string;
152
+ /** The tool name that was called repeatedly (present for repeated_tool_call). */
153
+ readonly toolName?: string;
154
+ /** The argsSummary of the repeated call (present for repeated_tool_call). */
155
+ readonly argsSummary?: string;
156
+ /** Total LLM turns consumed at the time of abort. */
157
+ readonly turnCount: number;
158
+ /** Number of workflow step advances at the time of abort. */
159
+ readonly stepAdvanceCount: number;
160
+ /**
161
+ * Wall-clock milliseconds elapsed from session start to abort.
162
+ * Lets a coordinator compute wall-clock savings vs. a full timeout.
163
+ * Requires `sessionStartMs = Date.now()` to be added before the agent loop.
164
+ */
165
+ readonly elapsedMs: number;
166
+ /**
167
+ * Summaries of all issue_reported calls during this session (if any).
168
+ * Populated from the `issueSummaries` ring tracked by the onIssueSummary callback.
169
+ * Provides additional context for a fix-coordinator without requiring log parsing.
170
+ */
171
+ readonly issueSummaries?: readonly string[];
172
+ }
173
+
174
+ // Updated union:
175
+ export type WorkflowRunResult =
176
+ | WorkflowRunSuccess
177
+ | WorkflowRunError
178
+ | WorkflowRunTimeout
179
+ | WorkflowDeliveryFailed
180
+ | WorkflowRunStuck; // NEW
181
+ ```
182
+
183
+ **Why a new variant instead of reusing `WorkflowRunError` or `WorkflowRunTimeout`:**
184
+ - `WorkflowRunError` means a tool or engine error, not a stuck loop. Conflating them forces consumers to parse message strings.
185
+ - `WorkflowRunTimeout` means the wall-clock fired -- stuck abort happens _before_ the wall clock fires.
186
+ - A distinct `_tag: 'stuck'` lets `TriggerRouter`, `NotificationService`, and future coordinators distinguish the case at compile time with `assertNever` exhaustiveness.
187
+
188
+ **`ChildWorkflowRunResult` must also be updated** to include `WorkflowRunStuck` since `runWorkflow()` now produces it directly (not via `TriggerRouter`).
189
+
190
+ ### 3. Outbox Entry Schema
191
+
192
+ Written to `~/.workrail/outbox.jsonl` as a single JSONL line immediately after abort:
193
+
194
+ ```json
195
+ {
196
+ "id": "<uuid-v4>",
197
+ "kind": "stuck_session",
198
+ "sessionId": "<local-uuid>",
199
+ "workrailSessionId": "<sess_...>",
200
+ "workflowId": "<workflow-id>",
201
+ "stuckReason": "repeated_tool_call",
202
+ "detail": "Same tool+args called 3 times: Bash",
203
+ "toolName": "Bash",
204
+ "argsSummary": "{\"command\":\"npm test\"}",
205
+ "turnCount": 12,
206
+ "stepAdvanceCount": 0,
207
+ "elapsedMs": 94000,
208
+ "issueSummaries": ["Tool call failed: ENOENT /path/to/file"],
209
+ "timestamp": "2026-04-19T03:14:15.926Z"
210
+ }
211
+ ```
212
+
213
+ **Fields needed by a future fix-coordinator:**
214
+ - `stuckReason` + `toolName` + `argsSummary` -- to classify the failure and decide whether to retry
215
+ - `turnCount` + `stepAdvanceCount` -- to understand how much progress was lost
216
+ - `workrailSessionId` -- to correlate with WorkRail session store for checkpoint resumption
217
+ - `elapsedMs` -- to measure wall-clock savings vs. a full timeout
218
+
219
+ ### 4. Integration Point
220
+
221
+ **Location: inside the `turn_end` subscriber in `runWorkflow()`, immediately after each stuck heuristic `emitter?.emit()` call.**
222
+
223
+ Rationale:
224
+ - This is where `agent.abort()` already lives for `max_turns`.
225
+ - `turnCount`, `stepAdvanceCount`, `sessionStartMs`, `toolName`, `argsSummary` are all in scope as closures.
226
+ - Post-run handling in `TriggerRouter.route()` is the wrong layer: by the time `runWorkflow()` returns, the outbox write should already have happened (it's diagnostic context for the abort, not post-hoc delivery).
227
+
228
+ **Abort sequence (for `repeated_tool_call`):**
229
+ 1. Check: `stuckAbortPolicy !== 'notify_only'` (default: abort)
230
+ 2. Set `stuckReason` and `stuckContext` closure variables
231
+ 3. Call `agent.abort()`
232
+ 4. Write outbox entry (best-effort, non-fatal, fire-and-forget pattern)
233
+ 5. Return from `turn_end` subscriber (same as `max_turns` path)
234
+ 6. `runWorkflow()` catches the abort and returns `WorkflowRunResult` with `_tag: 'stuck'`
235
+
236
+ **`runWorkflow()` return path:**
237
+ The existing error-catch block needs a branch for the stuck abort. The `stuckContext` closure variable (set in step 2 above) distinguishes stuck abort from other aborts:
238
+
239
+ ```typescript
240
+ // Existing error catch in runWorkflow():
241
+ } catch (err) {
242
+ if (stuckContext !== null) {
243
+ return {
244
+ _tag: 'stuck',
245
+ workflowId: trigger.workflowId,
246
+ ...stuckContext,
247
+ };
248
+ }
249
+ // ... existing error handling
250
+ }
251
+ ```
252
+
253
+ ### 5. TriggerRouter Changes
254
+
255
+ Both `route()` and `dispatch()` need a new branch before the `assertNever` guard:
256
+
257
+ ```typescript
258
+ } else if (result._tag === 'stuck') {
259
+ console.log(
260
+ `[TriggerRouter] Workflow stuck: triggerId=${trigger.id} ` +
261
+ `workflowId=${trigger.workflowId} reason=${result.stuckReason} ` +
262
+ `tool=${result.toolName ?? 'n/a'} turns=${result.turnCount}`,
263
+ );
264
+ }
265
+ ```
266
+
267
+ Delivery (`maybeRunDelivery`) should NOT run for stuck results -- there is no successful output to commit.
268
+
269
+ ### 6. NotificationService Changes
270
+
271
+ Three pure functions need new cases:
272
+
273
+ ```typescript
274
+ // buildNotificationBody:
275
+ case 'stuck':
276
+ return `Session aborted (stuck loop): ${truncated}`;
277
+
278
+ // buildOutcome: NotificationPayload['outcome'] needs 'stuck' added
279
+ // buildDetail:
280
+ case 'stuck':
281
+ return `stuckReason: ${result.stuckReason}; tool: ${result.toolName ?? 'n/a'}; ` +
282
+ `turns: ${result.turnCount}; stepAdvances: ${result.stepAdvanceCount}`;
283
+ ```
284
+
285
+ `NotificationPayload.outcome` currently is `'success' | 'error' | 'timeout' | 'delivery_failed'`. Add `'stuck'`.
286
+
287
+ ## 5-File Change Estimate
288
+
289
+ | File | Change |
290
+ |---|---|
291
+ | `src/daemon/workflow-runner.ts` | (1) Add `WorkflowRunStuck` interface and update `WorkflowRunResult` union. (2) **CRITICAL: Also update `ChildWorkflowRunResult` type alias** -- if missed, the cast at line 2014 silently allows `_tag: 'stuck'` to reach `makeSpawnAgentTool`'s assertNever, causing a runtime crash in child sessions. (3) Add `stuckAbortPolicy` and `noProgressAbortEnabled` to `WorkflowTrigger.agentConfig`. (4) Add `sessionStartMs = Date.now()` alongside `turnCount`. (5) Wire abort + fire-and-forget outbox write in `turn_end` subscriber. (6) Add `stuckContext` closure variable + branch in catch block to return `_tag: 'stuck'`. (7) Add `stuck` case to `makeSpawnAgentTool` switch. |
292
+ | `src/trigger/trigger-router.ts` | Add `stuck` branch in `route()` and `dispatch()` before `assertNever`; `maybeRunDelivery` already skips non-success results (no change needed). |
293
+ | `src/trigger/notification-service.ts` | Add `stuck` case to `buildNotificationBody`, `buildOutcome`, `buildDetail`; add `'stuck'` to `NotificationPayload.outcome`. |
294
+ | `src/trigger/types.ts` | Add `stuckAbortPolicy?: 'abort' | 'notify_only'` and `noProgressAbortEnabled?: boolean` to `TriggerDefinition.agentConfig`. |
295
+ | `src/daemon/daemon-events.ts` | No changes required -- `AgentStuckEvent` already exists and fires. |
296
+
297
+ **Total: 4 files changed** (`daemon-events.ts` is unchanged). `workflow-runner.ts` has multiple edit locations -- all must be done in a single commit to avoid TypeScript exhaustiveness errors at intermediate states.
298
+
299
+ ## Decision Log
300
+
301
+ - **`full_spectrum` path chosen** because both the integration point (landscape) and the result type design (concept) are genuine risks.
302
+ - **New `_tag: 'stuck'` variant** preferred over reusing `WorkflowRunTimeout` because stuck abort fires _before_ the wall clock -- conflating them loses diagnostic precision.
303
+ - **Outbox write in `turn_end` subscriber** (not in `TriggerRouter`) because the outbox entry is diagnostic context for the abort, not post-hoc delivery.
304
+ - **`stuckAbortPolicy` in `agentConfig`** (not a top-level `TriggerDefinition` field) because it belongs with `maxSessionMinutes` and `maxTurns` -- all are session-behavior knobs, not trigger-routing knobs.
305
+ - **`timeout_imminent` does not get a new abort** -- the wall-clock timer already calls `agent.abort()` independently; adding a second abort would be redundant and could race.
306
+ - **Default `stuckAbortPolicy: 'abort'`** for all triggers. Overnight-autonomous is the primary use case. Human-supervised users who want softer behavior must opt in to `'notify_only'`.
307
+ - **`noProgressAbortEnabled: false` default** -- `no_progress` has a real false-positive rate on deep-research sessions. Separating the gate from `stuckAbortPolicy` allows independent control: a trigger can have `stuckAbortPolicy: 'abort'` (abort on `repeated_tool_call`) without also aborting on `no_progress`.
308
+ - **`issueSummaries` field borrowed from Candidate C** -- the `issueSummaries` array is already tracked in session closures at zero additional collection cost. Adding it to `WorkflowRunStuck` now avoids a future breaking interface change when a fix-coordinator needs it.
309
+ - **outbox is best-effort** (fire-and-forget, errors swallowed) -- consistent with the `DaemonEventEmitter` contract. A failed write must never affect the `WorkflowRunResult`.
310
+ - **`sessionStartMs = Date.now()`** must be added before the agent loop -- it does not currently exist in workflow-runner.ts. Trivial one-line addition.
311
+ - **Candidate A rejected** -- adding `reason: 'stuck_loop'` to `WorkflowRunTimeout` conflates stuck-abort with wall-clock timeout, violating 'make illegal states unrepresentable'. The 5-file vs. 2-file maintenance advantage does not justify this semantic violation.
312
+ - **Candidate C deferred** -- `issue_reported severity=fatal` abort is the correct Phase 2 extension once `repeated_tool_call` abort is validated in production. The `issueSummaries` field on `WorkflowRunStuck` provides partial Candidate C value at near-zero cost.
313
+
314
+ ## Residual Concerns
315
+
316
+ 1. **`repeated_tool_call` false-positive rate in production is unvalidated.** The `stuckAbortPolicy: 'notify_only'` escape hatch mitigates. If false-positive rate exceeds ~20%, consider flipping the default to `'notify_only'` and making `'abort'` opt-in.
317
+
318
+ 2. **Outbox write may be lost on rapid daemon shutdown.** The fire-and-forget write initiates before `runWorkflow()` returns, but completes asynchronously. On SIGKILL, the entry may be lost. Same risk as `DaemonEventEmitter` -- accepted by design.
319
+
320
+ 3. **No shadow-mode validation.** Ideally, heuristics would run in shadow mode (emit only) for 20+ production sessions before enabling abort. `stuckAbortPolicy: 'notify_only'` serves as a manual shadow mode.
321
+
322
+ ## Final Summary
323
+
324
+ The design adds `WorkflowRunStuck` as a new `WorkflowRunResult` variant (`_tag: 'stuck'`), wires abort into the `repeated_tool_call` heuristic (unconditional, subject to `stuckAbortPolicy`) and optionally into `no_progress` (gated by `noProgressAbortEnabled: true`) in the `turn_end` subscriber, writes a structured outbox entry on abort, and extends `NotificationService` to distinguish stuck from timeout. The result carries `issueSummaries` for coordinator context. The policy is per-trigger via `agentConfig.stuckAbortPolicy` and `agentConfig.noProgressAbortEnabled`. The change touches 4 files and follows all existing patterns (fire-and-forget outbox, best-effort notification, `assertNever` exhaustiveness, max_turns abort template). A future fix-coordinator can read the outbox entry and extract all fields needed to classify and potentially retry the stuck session without parsing log text.
325
+
326
+ **Critical implementation note:** `ChildWorkflowRunResult` must be updated in the same commit as `WorkflowRunResult`. Failure to do so causes a runtime `assertNever` crash in child sessions spawned by `makeSpawnAgentTool`.
@@ -0,0 +1,252 @@
1
+ # WorkTrain Task Queue: Design Candidates
2
+
3
+ _Raw investigative material for main agent synthesis. Not a final decision._
4
+
5
+ ---
6
+
7
+ ## Problem Understanding
8
+
9
+ ### Core Tensions
10
+
11
+ **T1: Type safety vs. GitHub's stringly-typed surface**
12
+ The repo philosophy requires discriminated unions and exhaustive switches. GitHub issues are strings all the way down -- labels are strings, body is a string. The coordinator must parse and validate at the boundary, then work with typed values internally. The schema defines the valid value set (closed enum), not GitHub's type system.
13
+
14
+ **T2: Minimum required fields vs. routing ambiguity**
15
+ YAGNI says avoid speculative fields. But if the minimum schema (maturity only) is ambiguous in even one routing case (ready bug vs ready feature), it fails the primary success criterion. Adding `type` as a second required label resolves the ambiguity without speculative over-engineering.
16
+
17
+ **T3: Labels as transport vs. labels as schema**
18
+ Labels are the most natural routing mechanism (existing labelFilter infrastructure). But they can only carry enumerated values -- no URLs, no file paths. The schema needs two layers: labels for routing (read at dispatch time by coordinator) and body for enrichment (read at runtime by workflows). These are different contracts for different consumers.
19
+
20
+ **T4: Schema stability vs. coordinator evolution**
21
+ If the schema over-specifies coordinator behavior (e.g. `auto_merge: true` requires the coordinator to implement auto-merge), the schema becomes a coordinator implementation spec. The schema must define only issue observable properties, not coordinator behavior.
22
+
23
+ ### The Real Seam
24
+
25
+ The seam is at the coordinator, not the trigger. The trigger dispatches "there is a worktrain issue." The coordinator reads the issue, parses maturity+type from labels, and decides the pipeline. The schema is a contract between the issue and the coordinator's parser -- not between the issue and the trigger.
26
+
27
+ ### What Makes This Hard
28
+
29
+ The transport gap: the github poller does not fetch body content. All routing-time signals must come from labels. Body content requires a separate API call. This means the schema has two separate concerns:
30
+ 1. Routing labels -- read by the coordinator at dispatch time, free from labels
31
+ 2. Enrichment fields -- read by workflows at runtime, require a body fetch
32
+
33
+ A junior developer would treat these as one schema and put everything in labels or everything in the body. The right design separates them explicitly with different consumer contracts.
34
+
35
+ ---
36
+
37
+ ## Philosophy Constraints
38
+
39
+ From CLAUDE.md and observed repo patterns:
40
+ - **Exhaustiveness everywhere** -- maturity and type values must be closed enums; coordinator switch must be exhaustive
41
+ - **Make illegal states unrepresentable** -- validate maturity/type at the coordinator boundary; invalid values are routing errors, not silently ignored
42
+ - **YAGNI with discipline** -- only include fields the routing decision actually needs; no speculative coordinator fields
43
+ - **Validate at boundaries, trust inside** -- coordinator parses GitHub labels into typed values at entry; internal routing logic works with typed domain values
44
+ - **Immutability by default** -- routing table is a pure function of maturity+type; no mutable routing state
45
+
46
+ No conflicts between stated philosophy and observed patterns.
47
+
48
+ ---
49
+
50
+ ## Impact Surface
51
+
52
+ If the schema changes after coordinator implementation:
53
+ - `src/coordinators/pr-review.ts` pattern: the new issue coordinator will follow the same `CoordinatorDeps` injection pattern
54
+ - `src/trigger/adapters/github-poller.ts`: `GitHubIssue.labels` field is the only routing input at dispatch time; schema change must not require body fetch at routing time
55
+ - `src/trigger/types.ts` `GitHubPollingSource.labelFilter`: the label filter used in `triggers.yml` must match whatever label the coordinator uses to identify queue items
56
+ - `triggers.yml`: will need a new entry for the worktrain issue coordinator workflow
57
+
58
+ Nearby contracts that must stay consistent:
59
+ - The `worktrain` label is the queue membership signal -- must not be confused with lifecycle labels (`worktrain:in-progress`, `worktrain:done`)
60
+ - `maturity` label values must match the `taskMaturity` values in backlog.md (idea / rough / specced / ready) -- the coordinator bridges these two vocabularies
61
+
62
+ ---
63
+
64
+ ## Candidates
65
+
66
+ ### Candidate A: Labels-only, maturity + type (minimum viable routing)
67
+
68
+ **Summary:** Three required labels (`worktrain` + `worktrain:maturity:<value>` + `worktrain:type:<value>`); all routing signals live in labels; body is free-form prose only; no body parsing at any point.
69
+
70
+ **Required labels:**
71
+ - `worktrain` -- marks issue as in the queue
72
+ - `worktrain:maturity:idea | rough | specced | ready` (closed enum)
73
+ - `worktrain:type:feature | bug | chore | refactor` (closed enum)
74
+
75
+ **Optional labels:**
76
+ - `worktrain:complexity:small | medium | large`
77
+ - `worktrain:auto-merge` (presence = true)
78
+
79
+ **Routing table (exhaustive):**
80
+ ```
81
+ maturity=idea | rough => wr.discovery -> wr.shaping -> coding-task-workflow-agentic
82
+ maturity=specced => wr.discovery -> coding-task-workflow-agentic
83
+ maturity=ready => coding-task-workflow-agentic (Phase 0.5 searches for upstream spec at runtime)
84
+ ```
85
+
86
+ Type refines within the coding workflow (bug => skip hypothesis, chore => skip design phases) but does not change pipeline selection.
87
+
88
+ **Tensions resolved:** T1 (labels are closed enums, coordinator validates at boundary), T2 (maturity+type removes ambiguity), T4 (schema carries no coordinator behavior)
89
+ **Tensions accepted:** T3 (no enrichment path for upstream_spec URL -- body is unstructured prose)
90
+
91
+ **Boundary solved at:** Labels only. Coordinator never touches body.
92
+
93
+ **Why this boundary:** Zero extra API calls at routing time. `github_issues_poll` already delivers labels in the GitHubIssue object. Coordinator can route synchronously from dispatch payload.
94
+
95
+ **Failure mode:** Human omits required label (no maturity or no type). Coordinator detects missing label, logs warning, skips issue or routes conservatively to full pipeline. Not a hard failure.
96
+
97
+ **Repo-pattern relationship:** Follows GitHubPollingSource.labelFilter + notLabels, discriminated union pattern from ReviewSeverity and PollingSource.
98
+
99
+ **Gains:** Minimum filing friction, zero extra API call, cleanest routing logic.
100
+ **Losses:** No standard location for upstream spec URL. Developers put it in prose body where it is machine-invisible unless Phase 0.5 finds it by search.
101
+
102
+ **Scope judgment:** Best-fit for MVP. Routing is correct and deterministic. Enrichment gap is real but manageable.
103
+
104
+ **Philosophy fit:** Honors YAGNI, validate-at-boundary, exhaustiveness, immutability. No conflicts.
105
+
106
+ ---
107
+
108
+ ### Candidate B: Hybrid -- labels for routing, structured body section for enrichment
109
+
110
+ **Summary:** Same three required labels as A for routing; an optional `## WorkTrain` key-value section in the issue body carries enrichment fields consumed lazily at runtime by workflows (not by the coordinator at routing time).
111
+
112
+ **Required labels (routing, same as A):**
113
+ - `worktrain`
114
+ - `worktrain:maturity:idea | rough | specced | ready`
115
+ - `worktrain:type:feature | bug | chore | refactor`
116
+
117
+ **Optional body section (enrichment, read at runtime by downstream workflows):**
118
+ ```markdown
119
+ ## WorkTrain
120
+ upstream_spec: https://docs.example.com/pitch-feature-x
121
+ affected_files: src/foo.ts src/bar.ts
122
+ ```
123
+
124
+ **Routing mechanism:** Identical to A -- coordinator uses labels only. Body section is not read at routing time. Downstream workflows (coding-task-workflow-agentic Phase 0.5, wr.shaping) call GitHub API to fetch body and parse the `## WorkTrain` section when they need enrichment.
125
+
126
+ **Tensions resolved:** T1, T2, T3 (two-layer schema: labels for routing, body for enrichment), T4
127
+ **Tensions accepted:** None materially -- the body section is optional and gracefully absent
128
+
129
+ **Boundary solved at:** Labels for routing contract, body for workflow enrichment contract. Two separate consumers, two separate read times.
130
+
131
+ **Why this boundary:** Keeps routing fast (no extra API call), provides a standard machine-readable location for upstream_spec, and stays within GitHub without external tools.
132
+
133
+ **Failure mode:** Malformed `## WorkTrain` section (typo in key name, missing value). Workflows fall back to Phase 0.5's format-agnostic search. Graceful degradation, not a routing failure.
134
+
135
+ **Repo-pattern relationship:** Labels routing follows existing pattern. Body section is a new convention -- no repo precedent, but consistent with the "validate at boundaries" principle (workflow parses at its own entry).
136
+
137
+ **Gains:** Routing as fast as A. Standard location for upstream_spec. Compatible with future grooming coordinator that sets enrichment fields automatically.
138
+
139
+ **Losses:** Small added ceremony (optional body section). Developers must know the section exists to use it. Slightly more complex documentation.
140
+
141
+ **Scope judgment:** Best-fit. The added body section has minimal cost and real benefit for the ~30% of issues that have upstream specs.
142
+
143
+ **Philosophy fit:** Honors YAGNI (section is optional), validate-at-boundary (each consumer validates its own inputs), exhaustiveness (routing is still label-driven). No conflicts.
144
+
145
+ ---
146
+
147
+ ### Candidate C: YAML frontmatter in body -- all metadata centralized
148
+
149
+ **Summary:** Issue body starts with a YAML frontmatter block containing all metadata including routing signals; labels used only for lifecycle state.
150
+
151
+ **Frontmatter (required fields: maturity, type):**
152
+ ```yaml
153
+ ---
154
+ maturity: specced
155
+ type: feature
156
+ complexity: medium
157
+ auto_merge: false
158
+ upstream_spec: https://...
159
+ affected_files:
160
+ - src/foo.ts
161
+ - src/bar.ts
162
+ ---
163
+ ```
164
+
165
+ **Lifecycle labels only:**
166
+ - `worktrain` (queue membership)
167
+ - `worktrain:in-progress` (being worked)
168
+ - `worktrain:done` (completed)
169
+
170
+ **Routing mechanism:** Coordinator calls `GET /repos/:owner/:repo/issues/:number` for every issue, parses YAML frontmatter, extracts maturity+type for pipeline selection.
171
+
172
+ **Tensions resolved:** T3 (single schema location), all metadata in one place
173
+ **Tensions accepted:** T2 (higher filing friction -- YAML format unfamiliar to most developers), routing-time API call adds latency
174
+
175
+ **Boundary solved at:** Body only.
176
+
177
+ **Failure mode (HIGH SEVERITY):** (1) Malformed YAML blocks routing entirely for that issue. (2) `labelFilter` cannot filter by maturity -- ALL worktrain issues are dispatched to the coordinator, even unready ones. (3) Mandatory body API call adds 1 HTTP round-trip per issue per poll cycle.
178
+
179
+ **Repo-pattern relationship:** Departs from existing labelFilter pattern. Does not follow the label-driven routing used by other trigger configurations.
180
+
181
+ **Gains:** Single source of truth for all metadata. No label taxonomy to maintain.
182
+ **Losses:** YAML is unusual in GitHub issues. Parse failures block routing. API call required always. `labelFilter` loses routing power.
183
+
184
+ **Scope judgment:** Too broad. Adds body-parsing requirement for ALL issues even when enrichment is not needed (bugs, chores).
185
+
186
+ **Philosophy fit:** Conflicts with YAGNI (always fetches body even for simple routing), validate-at-boundary (harder to validate YAML string than label enum), determinism (malformed YAML = routing failure).
187
+
188
+ ---
189
+
190
+ ### Candidate D: Three-trigger architecture (routing as trigger configuration)
191
+
192
+ **Summary:** Instead of a coordinator that routes, create three triggers with different `labelFilter` values -- one per pipeline path. Routing is implicit in which trigger fires.
193
+
194
+ **Proposed triggers:**
195
+ - `worktrain-coding` trigger: labelFilter=`worktrain:maturity:ready` => `coding-task-workflow-agentic`
196
+ - `worktrain-discovery` trigger: labelFilter=`worktrain:maturity:specced` => `wr.discovery`
197
+ - `worktrain-full` trigger: labelFilter=`worktrain:maturity:idea` => `wr.discovery` (with full-pipeline flag)
198
+
199
+ **CRITICAL DEFECT:** GitHub's Issues API `labels=` parameter matches issues with ANY of the listed labels (OR semantics), not ALL (AND semantics). A trigger with `labelFilter: 'worktrain:maturity:ready'` would only work if `worktrain:maturity:ready` is the ONLY label filter. Confirmed from `github-poller.ts`: the `labels=` parameter is passed directly to the GitHub API without client-side AND enforcement. This means a `worktrain:maturity:idea` issue would also be dispatched by the `worktrain-coding` trigger if it happens to be in the same poll batch, because the API returns all issues with ANY matching label.
200
+
201
+ **Note:** This defect is architectural, not fixable by schema changes. The three-trigger approach is included as a reframe candidate to illustrate what would be needed if GitHub supported AND label filters (it does not).
202
+
203
+ **Tensions resolved:** T4 (routing logic in config, not code)
204
+ **Tensions accepted:** T1, T2 -- broken by GitHub API semantics
205
+
206
+ **Scope judgment:** Broken. Included to surface the reframe, not as a viable candidate.
207
+
208
+ ---
209
+
210
+ ## Comparison and Recommendation
211
+
212
+ ### Matrix
213
+
214
+ | | A: Labels-only | B: Hybrid | C: Frontmatter | D: Three-trigger |
215
+ |---|---|---|---|---|
216
+ | Routing at dispatch time, no body fetch | Yes | Yes | No | Yes |
217
+ | Handles upstream_spec | No | Yes | Yes | No |
218
+ | Filing friction | Low | Low-medium | High | Low |
219
+ | Failure mode severity | Low | Low | High | Critical (broken) |
220
+ | Follows repo patterns | Yes | Yes+new | No | Broken |
221
+ | YAGNI | Best | Good | Poor | N/A |
222
+ | Schema stability | Best | Good | Poor | N/A |
223
+
224
+ ### Recommendation: Candidate B (Hybrid)
225
+
226
+ A is minimal and correct for routing. B is everything A is, plus a standard location for `upstream_spec`. The cost is an optional body section that ~70-80% of issues won't use. But the benefit -- a machine-readable upstream spec location for the ~20-30% that have one -- is real and immediate (Phase 0.5 of the coding workflow already consumes it).
227
+
228
+ The routing path in B is identical to A: label-only, no body fetch, fast. The body section is consumed lazily by downstream workflows, not by the routing coordinator. This keeps the two contracts separate and the routing cost identical to A.
229
+
230
+ ---
231
+
232
+ ## Self-Critique
233
+
234
+ **Strongest counter-argument against B:** Most issues (bugs, chores, rough ideas) will never have a `## WorkTrain` section. Defining a convention that 80% of issues ignore adds documentation weight without value to those issues. A is simpler to explain and equally correct.
235
+
236
+ **Narrower option (A) that lost:** A loses because `upstream_spec` is a first-class concept in the Three-Workflow Pipeline ADR and Phase 0.5 of the coding workflow. Omitting a standard location forces developers to put it in prose where it may or may not be found by the workflow's format-agnostic search. The cost of defining the optional section is documentation-only; the cost of omitting it is that Phase 0.5's search quality degrades on issues that have specs.
237
+
238
+ **Pivot condition:** If the coordinator is built and `upstream_spec` is consistently found by Phase 0.5's format-agnostic search regardless of body format (i.e. developers naturally put the URL in the issue title or first paragraph), then the `## WorkTrain` section can be removed from the schema. A becomes correct. This is an empirical question that can only be answered after some usage.
239
+
240
+ **Assumption that if wrong would invalidate B:** `upstream_spec` is optional enrichment, not a routing gate. If the coordinator needs to know whether an upstream spec exists BEFORE selecting a pipeline (e.g. `maturity=ready + no upstream spec => run wr.discovery first`), then the body must be fetched at routing time and B's routing-vs-enrichment split breaks. Current evidence (Phase 0.5 runs inside the coding workflow, after pipeline selection) suggests this is not the case.
241
+
242
+ ---
243
+
244
+ ## Open Questions for Main Agent
245
+
246
+ 1. Should `type` be required or optional? If all `ready` issues always go to coding-only regardless of type, then type is optional enrichment. If type affects which phases run inside the coding workflow (e.g. `bug` skips hypothesis), is that different enough to be optional?
247
+
248
+ 2. What is the lifecycle label model? When the coordinator picks up an issue, what label does it add (in-progress)? When the workflow completes, what label? Is `worktrain:done` added to the issue, or is the issue closed?
249
+
250
+ 3. Backlog promotion: does the coordinator automatically add `worktrain:maturity:ready` to a `worktrain:maturity:specced` issue after running `wr.discovery`? Or is promotion always manual?
251
+
252
+ 4. Should `complexity` be a routing signal or just workflow enrichment? If `complexity:small` maps to a QUICK rigor path inside the coding workflow, it affects workflow behavior but not pipeline selection -- which makes it optional enrichment, not a required routing label.