@exaudeus/workrail 3.42.0 → 3.44.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. package/dist/console-ui/assets/{index-DwfWMKvv.js → index-Bi38ITiQ.js} +1 -1
  2. package/dist/console-ui/index.html +1 -1
  3. package/dist/daemon/workflow-runner.d.ts +15 -1
  4. package/dist/daemon/workflow-runner.js +86 -9
  5. package/dist/manifest.json +39 -23
  6. package/dist/trigger/adapters/github-queue-poller.d.ts +34 -0
  7. package/dist/trigger/adapters/github-queue-poller.js +200 -0
  8. package/dist/trigger/delivery-action.d.ts +2 -0
  9. package/dist/trigger/delivery-action.js +24 -0
  10. package/dist/trigger/github-queue-config.d.ts +18 -0
  11. package/dist/trigger/github-queue-config.js +155 -0
  12. package/dist/trigger/polling-scheduler.d.ts +1 -0
  13. package/dist/trigger/polling-scheduler.js +185 -6
  14. package/dist/trigger/trigger-router.js +24 -1
  15. package/dist/trigger/trigger-store.js +77 -2
  16. package/dist/trigger/types.d.ts +19 -0
  17. package/docs/design/adaptive-coordinator-context-candidates.md +265 -0
  18. package/docs/design/adaptive-coordinator-context-review.md +101 -0
  19. package/docs/design/adaptive-coordinator-context.md +504 -0
  20. package/docs/design/adaptive-coordinator-routing-candidates.md +340 -0
  21. package/docs/design/adaptive-coordinator-routing-design-review.md +135 -0
  22. package/docs/design/adaptive-coordinator-routing-review.md +156 -0
  23. package/docs/design/adaptive-coordinator-routing.md +660 -0
  24. package/docs/design/context-assembly-layer-design-review.md +110 -0
  25. package/docs/design/context-assembly-layer.md +622 -0
  26. package/docs/design/stuck-escalation-candidates.md +176 -0
  27. package/docs/design/stuck-escalation-design-review.md +70 -0
  28. package/docs/design/stuck-escalation.md +326 -0
  29. package/docs/design/worktrain-task-queue-candidates.md +252 -0
  30. package/docs/design/worktrain-task-queue-design-review.md +109 -0
  31. package/docs/design/worktrain-task-queue.md +443 -0
  32. package/docs/design/worktree-review-findings-candidates.md +101 -0
  33. package/docs/design/worktree-review-findings-design-review.md +65 -0
  34. package/docs/design/worktree-review-findings-implementation-plan.md +153 -0
  35. package/docs/ideas/backlog.md +148 -0
  36. package/package.json +3 -3
@@ -0,0 +1,176 @@
1
+ # Design Candidates: Stuck Escalation for Overnight-Autonomous WorkTrain Sessions
2
+
3
+ > Raw investigative material for main-agent review. Not a final decision.
4
+
5
+ ## Problem Understanding
6
+
7
+ ### Core Tensions
8
+
9
+ 1. **Early certainty vs. false positives.** Aborting at threshold 3 for `repeated_tool_call` saves up to 27 minutes of a 30-minute session. But a legitimate retry loop (transient network error, idempotent file read called 3x) also triggers at 3. Higher threshold = fewer false positives but less wall-clock savings; lower = more savings but more false positives.
10
+
11
+ 2. **Structural correctness vs. maintenance surface.** Adding `_tag: 'stuck'` to `WorkflowRunResult` is structurally clean (make illegal states unrepresentable) but widens the maintenance surface: every switch on the union must be updated. The naive alternative (add `reason: 'stuck_loop'` to `WorkflowRunTimeout`) avoids the new variant but conflates stuck abort with wall-clock timeout -- these have categorically different implications for retry logic.
12
+
13
+ 3. **Abort power vs. future recoverability.** `agent.abort()` is terminal. A `steer()` injection could warn the agent and let it self-correct. Aborting closes the door to LLM-driven self-recovery. For overnight-autonomous use, abort is deterministic and saves resources. For supervised use, steer-and-warn might be preferred.
14
+
15
+ 4. **Outbox write timing vs. fire-and-forget contract.** The outbox write must happen as close as possible to the abort moment. But `turn_end` is synchronous, and any blocking `await` would stall the abort path. Resolution: initiate outbox write as a detached fire-and-forget Promise in `turn_end`, same contract as `DaemonEventEmitter.emit()`.
16
+
17
+ ### Likely Seam
18
+
19
+ `turn_end` subscriber in `workflow-runner.ts`. Confirmed -- not just where the symptom appears, but where all relevant state exists (`turnCount`, `stepAdvanceCount`, `lastNToolCalls`, `timeoutReason`). The `max_turns` abort at lines 3088-3104 is the exact template: set closure variable, emit event, call `agent.abort()`, return.
20
+
21
+ ### What Makes This Hard
22
+
23
+ 1. `ChildWorkflowRunResult` type alias (line 396) -- must be updated alongside `WorkflowRunResult`. If missed, the cast at line 2014 silently hides the new variant from `makeSpawnAgentTool`'s switch, producing a runtime `assertNever` error in production.
24
+
25
+ 2. The fire-and-forget contract -- any `await` in the `turn_end` subscriber blocks the abort path. The outbox write must be a detached Promise.
26
+
27
+ 3. Double-emit of `timeout_imminent` -- the max_turns path emits it AND the `timeoutReason !== null` check at line 3157 would also emit it. The design must not add a third abort here.
28
+
29
+ 4. `maybeRunDelivery` gate in TriggerRouter -- must exclude `stuck` results from autoCommit delivery (there is no successful output to commit).
30
+
31
+ ## Philosophy Constraints
32
+
33
+ From `CLAUDE.md` and codebase patterns:
34
+
35
+ - **Make illegal states unrepresentable** -- stuck and timeout are categorically different; a new discriminant is required.
36
+ - **Exhaustiveness everywhere** -- all `assertNever` guards must be updated when the union grows.
37
+ - **Errors are data** -- `WorkflowRunStuck` as a result value, not an exception.
38
+ - **Fire-and-forget observability** -- `DaemonEventEmitter.emit()` and `NotificationService.notify()` both return void and swallow errors. Outbox write must follow this contract.
39
+ - **YAGNI with discipline** -- do not add `issue_reported severity=fatal` abort without production evidence.
40
+ - **Pure functions for message building** -- `buildNotificationBody`, `buildOutcome`, `buildDetail` are pure switch-dispatch functions; new cases extend them cleanly.
41
+ - **WHY comments** -- every non-obvious decision must have an inline rationale comment.
42
+
43
+ No conflicts between stated philosophy and repo patterns.
44
+
45
+ ## Impact Surface
46
+
47
+ Changes required beyond the immediate task if `WorkflowRunResult` is widened:
48
+
49
+ | Location | File | Required Change |
50
+ |---|---|---|
51
+ | `WorkflowRunResult` type alias | `workflow-runner.ts` | Add `WorkflowRunStuck` variant |
52
+ | `ChildWorkflowRunResult` type alias | `workflow-runner.ts` | Add `WorkflowRunStuck` variant |
53
+ | `makeSpawnAgentTool` switch | `workflow-runner.ts` | Add `stuck` case (assertNever guard) |
54
+ | `turn_end` subscriber | `workflow-runner.ts` | Add abort logic for `repeated_tool_call` and `no_progress` |
55
+ | `runWorkflow()` catch block | `workflow-runner.ts` | Add branch: if `stuckContext !== null` return `WorkflowRunStuck` |
56
+ | `TriggerRouter.route()` | `trigger-router.ts` | Add `stuck` log branch before `assertNever` |
57
+ | `TriggerRouter.dispatch()` | `trigger-router.ts` | Add `stuck` log branch before `assertNever` |
58
+ | `maybeRunDelivery` gate | `trigger-router.ts` | Exclude `stuck` from delivery |
59
+ | `buildNotificationBody` | `notification-service.ts` | Add `stuck` case |
60
+ | `buildOutcome` | `notification-service.ts` | Add `stuck` to return type |
61
+ | `buildDetail` | `notification-service.ts` | Add `stuck` case |
62
+ | `NotificationPayload.outcome` | `notification-service.ts` | Add `'stuck'` to union |
63
+ | `TriggerDefinition.agentConfig` | `types.ts` | Add `stuckAbortPolicy?: 'abort' | 'notify_only'` |
64
+
65
+ ## Candidates
66
+
67
+ ### Candidate A: Minimal -- Abort with existing result types
68
+
69
+ **Summary:** Wire `agent.abort()` after `repeated_tool_call` emit, add `reason: 'stuck_loop'` to `WorkflowRunTimeout` (new string value), skip outbox write and notification extension.
70
+
71
+ **Tensions resolved:** Maintenance surface (2 files changed).
72
+
73
+ **Tensions accepted:** Structural correctness (conflates stuck with wall-clock timeout), coordinator readiness (no toolName/argsSummary in result), notification distinctness (NotificationService cannot distinguish stuck from timeout).
74
+
75
+ **Boundary solved at:** `turn_end` subscriber + `WorkflowRunTimeout` extension. This is a symptom-level fix -- it stops the waste but provides no diagnostic value.
76
+
77
+ **Failure mode:** A coordinator script reading `result._tag === 'timeout'` has no way to distinguish stuck abort from wall-clock timeout without parsing `result.reason`. This is string parsing -- exactly what discriminated unions are designed to prevent.
78
+
79
+ **Repo pattern relationship:** Adapts max_turns abort template. Does NOT follow the `WorkflowRunTimeout` vs. `WorkflowRunError` precedent of 'one variant per categorically distinct outcome.'
80
+
81
+ **Gains:** 2 files changed. Minimal assertNever surface.
82
+
83
+ **Gives up:** Semantic precision (stuck != timeout), coordinator readiness, notification distinctness.
84
+
85
+ **Scope judgment:** Too narrow. Violates 'make illegal states unrepresentable.'
86
+
87
+ **Philosophy:** Honors YAGNI. Conflicts with 'make illegal states unrepresentable', 'exhaustiveness everywhere', 'errors are data.'
88
+
89
+ ---
90
+
91
+ ### Candidate B: Full -- New `WorkflowRunStuck` variant + outbox + notification (RECOMMENDED)
92
+
93
+ **Summary:** Add `WorkflowRunStuck` to `WorkflowRunResult` and `ChildWorkflowRunResult`; abort on `repeated_tool_call` and `no_progress`; write a 10-field outbox entry as a fire-and-forget Promise; extend `NotificationService` with a `stuck` case; add `stuckAbortPolicy: 'abort' | 'notify_only'` to `WorkflowTrigger.agentConfig`.
94
+
95
+ **Tensions resolved:** All four. Structural correctness (new variant), coordinator readiness (toolName/argsSummary/turnCount/stepAdvanceCount on the result), notification distinctness (new message body), policy expressiveness (stuckAbortPolicy with abort default).
96
+
97
+ **Tensions accepted:** Maintenance surface (5 files, all switch statements widened), policy granularity (no per-signal policy within a trigger -- only per-trigger).
98
+
99
+ **Boundary solved at:**
100
+ - `turn_end` subscriber: abort + fire-and-forget outbox write
101
+ - `runWorkflow()` catch block: construct `WorkflowRunStuck` from `stuckContext` closure variable
102
+ - `TriggerRouter.route()`/`dispatch()`: log and skip delivery
103
+ - `NotificationService.notify()`: new message body
104
+
105
+ **Failure mode:** The outbox write initiates in `turn_end` as a detached Promise but the `WorkflowRunStuck` result is returned synchronously from the catch block. The outbox write may complete AFTER the result reaches TriggerRouter. Acceptable (outbox is diagnostic, not delivery) but must be documented.
106
+
107
+ **Repo pattern relationship:** Follows the `WorkflowRunTimeout` precedent exactly. Uses `DaemonEventEmitter` fire-and-forget pattern for outbox write. Uses max_turns abort template. Extends `NotificationService` pure-function pattern.
108
+
109
+ **Gains:** Full semantic precision, coordinator-ready structured data, notification distinctness, type-safe exhaustiveness.
110
+
111
+ **Gives up:** 5-file maintenance surface, `ChildWorkflowRunResult` and `makeSpawnAgentTool` must be updated.
112
+
113
+ **Scope judgment:** Best-fit. Directly addresses all 4 decision criteria. No speculative abstractions.
114
+
115
+ **Philosophy:** Honors all core principles. Minor YAGNI pressure vs. Candidate A, but the added files are necessary consequences of correct union design.
116
+
117
+ ---
118
+
119
+ ### Candidate C: Extended -- Candidate B + `issue_reported severity=fatal` abort trigger
120
+
121
+ **Summary:** All of Candidate B, plus: abort when the `onIssueSummary` callback receives `severity: 'fatal'`, implemented via a `fatalIssueAbortPending` closure flag set by the callback and checked/cleared in `turn_end`. Adds `stuckReason: 'fatal_issue_report'` to `WorkflowRunStuck` and an optional `issueSummary` field.
122
+
123
+ **Tensions resolved:** All of Candidate B's tensions, plus the primary framing risk (agent self-report is more reliable than heuristics per session ea2de6e5).
124
+
125
+ **Tensions accepted:** Higher implementation complexity, one-turn abort latency for fatal issues (the flag is checked in `turn_end`, not inline in the callback).
126
+
127
+ **Boundary solved at:** All of Candidate B's boundaries, plus `onIssueSummary` callback wiring.
128
+
129
+ **Failure mode:** One-turn latency -- the abort fires on the turn AFTER `report_issue` calls, not immediately. For a fatal issue, one extra LLM turn is acceptable but must be documented.
130
+
131
+ **Repo pattern relationship:** Extends Candidate B. Adapts the existing `onIssueSummary` callback infrastructure.
132
+
133
+ **Gains:** Catches the most reliable real-world stuck signal. Directly addresses primary framing risk.
134
+
135
+ **Gives up:** Higher initial complexity. `stuckReason` union grows to 3 values. Requires production evidence to justify over Candidate B.
136
+
137
+ **Scope judgment:** Slightly broad for the initial design. The primary use case (blind tool loop) is covered by Candidate B. Candidate C is the correct Phase 2 extension.
138
+
139
+ **Philosophy:** Fully honors all principles. Marginal YAGNI pressure vs. Candidate B -- grounded in real log evidence (session ea2de6e5) but requires more than one data point to justify the added complexity upfront.
140
+
141
+ ## Comparison and Recommendation
142
+
143
+ | Criterion | A | B | C |
144
+ |---|---|---|---|
145
+ | Structural correctness | FAIL | PASS | PASS |
146
+ | Coordinator readiness | FAIL | PASS | PASS |
147
+ | Notification distinctness | FAIL | PASS | PASS |
148
+ | Policy expressiveness | PARTIAL | PASS | PASS |
149
+ | Maintenance surface | Best | Medium | Highest |
150
+ | Covers primary framing risk | No | No | Yes |
151
+ | YAGNI compliance | Best | Good | Marginal |
152
+ | Reversibility | Hard | Easy | Easy |
153
+
154
+ **Recommendation: Candidate B.**
155
+
156
+ Reasoning: Structural correctness is non-negotiable (CLAUDE.md: 'make illegal states unrepresentable'). Candidate A fails this criterion regardless of its maintenance advantage. Candidate C is architecturally correct but the `issue_reported severity=fatal` trigger requires production evidence beyond session ea2de6e5. The 5-file surface of Candidate B is manageable because all changes are additive switch-case additions, and TypeScript exhaustiveness enforcement catches any missed location at compile time.
157
+
158
+ ## Self-Critique
159
+
160
+ **Strongest counter-argument against Candidate B:** The `repeated_tool_call` heuristic may have an unacceptable false-positive rate in production. If it aborts legitimate sessions frequently, operators will set `stuckAbortPolicy: 'notify_only'` everywhere, negating the feature. A minimal Candidate A approach would have caused less collateral damage in this scenario.
161
+
162
+ **Response:** The `stuckAbortPolicy: 'notify_only'` escape hatch directly addresses this. The structural correctness argument still stands -- conflating stuck with timeout is a design debt that compounds over time.
163
+
164
+ **Pivot to Candidate A:** If there is a hard constraint against widening `WorkflowRunResult` (e.g., a serialization layer or cross-process protocol that can't handle new variants). No such constraint exists.
165
+
166
+ **Pivot to Candidate C:** If production logs show `repeated_tool_call` false-positive rate exceeds 20% while `issue_reported severity=fatal` false-positive rate is under 5%.
167
+
168
+ ## Open Questions for the Main Agent
169
+
170
+ 1. **Verify abort propagation:** Does `agent.abort()` called in `turn_end` correctly propagate to the `runWorkflow()` catch block without clearing closure state? The `stuckContext` variable must be readable in the catch block after abort. Confirm by reading `AgentLoop.abort()` implementation.
171
+
172
+ 2. **`sessionStartMs` presence:** Is `const sessionStartMs = Date.now()` already set before the agent loop in `runWorkflow()`? If not, it must be added to support `elapsedMs` in the outbox entry.
173
+
174
+ 3. **`no_progress` false-positive rate:** The 80%-turns threshold fires even on a legitimate research session. Should the initial design wire `no_progress` abort, or start with only `repeated_tool_call` abort and add `no_progress` in a follow-on?
175
+
176
+ 4. **`stuckAbortPolicy` placement:** Should the policy live in `WorkflowTrigger.agentConfig` (as proposed), or in a separate top-level `TriggerDefinition.stuckPolicy` field to distinguish session-behavior knobs from routing knobs?
@@ -0,0 +1,70 @@
1
+ # Design Review: Stuck Escalation for Overnight-Autonomous WorkTrain Sessions
2
+
3
+ > Findings from adversarial review of the selected direction (Candidate B, adjusted).
4
+
5
+ ## Tradeoff Review
6
+
7
+ | Tradeoff | Verdict | Condition That Invalidates It |
8
+ |---|---|---|
9
+ | 5-file maintenance surface for type safety | ACCEPTABLE | Union grows to 20+ consumers without a code-gen layer |
10
+ | `sessionStartMs` must be added | TRIVIALLY ACCEPTABLE | None |
11
+ | `no_progress` abort gated by `noProgressAbortEnabled` (default false) | ACCEPTABLE | Production evidence shows `no_progress` is the dominant failure mode |
12
+
13
+ ## Failure Mode Review
14
+
15
+ | Failure Mode | Design Handling | Missing Mitigation | Risk |
16
+ |---|---|---|---|
17
+ | `ChildWorkflowRunResult` not updated | Called out in design doc as required update | Add explicit WARNING comment at cast site (line 2014) | HIGH -- silent compile-time pass, runtime crash |
18
+ | `await` in `turn_end` subscriber blocks abort | Fire-and-forget pattern specified explicitly | Add WHY comment at the outbox write call | MEDIUM -- junior devs may add await by analogy |
19
+ | `maybeRunDelivery` not updated | Existing gate `if (result._tag !== 'success') return` already handles it | None needed | LOW -- already handled |
20
+
21
+ ## Runner-Up / Simpler Alternative Review
22
+
23
+ **Candidate C element borrowed:** `issueSummaries?: readonly string[]` added to `WorkflowRunStuck`. The `issueSummaries` array is already tracked in session closures (zero new collection code). Adds coordinator-readiness at near-zero cost.
24
+
25
+ **Simplest alternative:** Abort on `repeated_tool_call` only, no `no_progress` gate, no `issueSummaries`. Satisfies all 6 acceptance criteria. Excluded because `issueSummaries` and `noProgressAbortEnabled` add meaningful value at minimal cost.
26
+
27
+ **Hybrid result:** Candidate B adjusted = Candidate B + `issueSummaries` field (borrowed from C) + `noProgressAbortEnabled: boolean` gate (default false) for `no_progress` abort.
28
+
29
+ ## Philosophy Alignment
30
+
31
+ | Principle | Alignment |
32
+ |---|---|
33
+ | Make illegal states unrepresentable | FULL -- new discriminant, not reused timeout |
34
+ | Exhaustiveness everywhere | FULL -- all assertNever guards updated |
35
+ | Errors are data | FULL -- result value, not exception |
36
+ | Immutability by default | FULL -- all new fields readonly |
37
+ | Fire-and-forget observability | FULL -- outbox write is void+detached+swallowed |
38
+ | YAGNI | TENSION -- 5-file surface; resolved in favor of structural correctness |
39
+ | Determinism over cleverness | TENSION -- no_progress is a heuristic; resolved by gating it behind explicit opt-in |
40
+
41
+ ## Findings
42
+
43
+ ### RED (Blocking)
44
+ None. No design-correctness violations found.
45
+
46
+ ### ORANGE (High Risk)
47
+ **Finding O1: `ChildWorkflowRunResult` update is easy to miss.**\nThe coding agent must update `ChildWorkflowRunResult` alongside `WorkflowRunResult`. If missed, a `_tag: 'stuck'` result from a child session spawned via `makeSpawnAgentTool` will reach `assertNever` at runtime and crash. The design doc mentions it but the failure mode is severe enough to warrant a WARNING comment in the code at the cast site (line 2014 in workflow-runner.ts).\n\n**Recommended action:** Add to design doc: 'CRITICAL: Update ChildWorkflowRunResult on the SAME commit as WorkflowRunResult. Do not split across commits.'
48
+
49
+ ### YELLOW (Medium Risk)
50
+ **Finding Y1: `no_progress` false-positive rate is unvalidated.**\nThe 80%-turns threshold with 0 step advances can fire on legitimate deep-research sessions. The `noProgressAbortEnabled: false` default mitigates this but means the feature is effectively inactive until explicitly enabled. Users who expect `no_progress` to work out-of-the-box will be surprised.\n\n**Recommended action:** Document the default explicitly in the trigger YAML schema comment and in the CLI help text.
51
+
52
+ **Finding Y2: Fire-and-forget outbox write timing.**\nThe outbox write initiates in `turn_end` as a detached Promise but the `WorkflowRunStuck` result reaches TriggerRouter before the write completes. If the process exits immediately after TriggerRouter logs the result (e.g. during a rapid daemon shutdown), the outbox write may be lost.\n\n**Recommended action:** Accept this risk -- it is the same risk accepted by `DaemonEventEmitter`. Document it in the code with a WHY comment.
53
+
54
+ ## Recommended Revisions to Design Doc
55
+
56
+ 1. Add a **CRITICAL** callout in the '5-File Change Estimate' section: 'ChildWorkflowRunResult must be updated in the same commit as WorkflowRunResult. A cast at line 2014 allows a stuck result to bypass the makeSpawnAgentTool switch's assertNever if ChildWorkflowRunResult is not updated.'
57
+
58
+ 2. Add `issueSummaries?: readonly string[]` to the `WorkflowRunStuck` interface definition. Update the outbox entry schema to include `issueSummaries`.
59
+
60
+ 3. Add `noProgressAbortEnabled?: boolean` (default: false) to `WorkflowTrigger.agentConfig` as a separate field from `stuckAbortPolicy`. The abort policy controls whether to abort or notify; this flag controls whether `no_progress` is an active trigger at all.
61
+
62
+ 4. Update the '5-File Change Estimate' table to show 5 files clearly (the confusion is that workflow-runner.ts has multiple edit locations).
63
+
64
+ ## Residual Concerns
65
+
66
+ 1. **`repeated_tool_call` false-positive rate**: Not empirically validated. The `stuckAbortPolicy: 'notify_only'` escape hatch is the mitigation. If the false-positive rate is high in production, the recommended path is: set `stuckAbortPolicy: 'notify_only'` by default and make `'abort'` opt-in, reversing the current default.
67
+
68
+ 2. **Coordinator consumption of outbox.jsonl**: `outbox.jsonl` has no automated consumer today. The stuck entry will persist in the file until a human or coordinator reads it. This is a 'build it now, connect it later' tradeoff -- acceptable for the initial design.
69
+
70
+ 3. **No shadow-mode validation**: Ideally, the heuristics would run in shadow mode (emit but never abort) for 20+ production sessions before enabling abort. This design does not include shadow mode. The `stuckAbortPolicy: 'notify_only'` setting can serve as a manual shadow mode.
@@ -0,0 +1,326 @@
1
+ # Design: Stuck Escalation for Overnight-Autonomous WorkTrain Sessions
2
+
3
+ ## Context / Ask
4
+
5
+ **Stated goal:** Design automatic escalation when WorkTrain sessions get stuck, so overnight-autonomous runs don't burn their full 30-minute wall clock on a broken tool call.
6
+
7
+ **This goal is a solution statement.** The underlying problem is:
8
+
9
+ > Overnight-autonomous sessions have no early-exit path when the agent enters a stuck loop. The system must detect the stuck state, abort cleanly, and produce structured diagnostic output without requiring a human to check the logs after the full wall-clock budget is consumed.
10
+
11
+ **Scope:** `src/daemon/workflow-runner.ts` and `src/trigger/trigger-router.ts` only. Do NOT touch `src/mcp/`.
12
+
13
+ **Prior art:** `docs/design/daemon-stuck-detection-discovery.md` -- discovery that defined the three stuck heuristics, added `AgentStuckEvent` to `daemon-events.ts`, and wired `repeated_tool_call`, `no_progress`, and `timeout_imminent` into the `turn_end` subscriber. Those signals exist and fire today. They are advisory-only -- no abort, no outbox, no notification.
14
+
15
+ ## Path Recommendation
16
+
17
+ **`full_spectrum`** -- both landscape grounding (where to hook abort/outbox in the existing code) and concept shaping (abort policy, result variant design) are real risks. The proposed approach is well-matched to the problem, but the integration point and the result type design need careful analysis.
18
+
19
+ ## Constraints / Anti-goals
20
+
21
+ **Constraints:**
22
+ - Abort only within `src/daemon/` and `src/trigger/`. No changes to `src/mcp/`.
23
+ - `WorkflowRunResult` is a discriminated union consumed by `assertNever` in `TriggerRouter.route()` and `dispatch()` -- any new variant must be handled exhaustively in both callers.
24
+ - Outbox write must be best-effort (non-fatal) -- same contract as `DaemonEventEmitter.emit()`.
25
+ - Notifications are fire-and-forget -- same contract as `NotificationService.notify()`.
26
+ - `stuckAbortPolicy` must be opt-in, defaulting to `'abort'` for new triggers and to `'notify_only'` for any existing trigger that doesn't specify it (to avoid breaking existing behavior).
27
+
28
+ **Anti-goals:**
29
+ - Do not build a watchdog daemon or restart mechanism.
30
+ - Do not implement coordinator retry logic in this design (that is a future fix-coordinator concern).
31
+ - Do not change the `turn_end` subscriber's detection thresholds -- this design wires actions to existing signals, not new signals.
32
+ - Do not make `outbox.jsonl` the sole escalation path -- it has no automated consumer.
33
+
34
+ ## Challenged Assumptions
35
+
36
+ 1. **Assumption: The three heuristics reliably distinguish stuck from legitimately slow.**
37
+ - Risk: `make all` called 3x triggers `repeated_tool_call` but is not stuck. `no_progress` at 80% turns fires during valid deep research.
38
+ - Mitigation: `stuckAbortPolicy: 'notify_only'` available per trigger. For `repeated_tool_call`, the abort is justified because the exact same `argsSummary` (200-char JSON) repeating 3x is a stronger signal than mere tool-name repetition.
39
+
40
+ 2. **Assumption: Aborting on first heuristic trigger is better than one warning turn.**
41
+ - Risk: a transient 500 error resolves on the next turn; abort kills a recovering session.
42
+ - Mitigation: allow `stuckAbortThreshold` (future extension) to increase the repeat count before abort. For the initial design, threshold stays at 3 (existing `STUCK_REPEAT_THRESHOLD`).
43
+
44
+ 3. **Assumption: `outbox.jsonl` is the right escalation target.**
45
+ - Reality: `outbox.jsonl` has no automated consumer -- only `worktrain-inbox` (manual CLI) and `pr-review.ts` `drainMessageQueue` read it. The primary actionable signal is the macOS/webhook notification.
46
+ - Resolution: write to both outbox (for coordinator scripts) AND fire notification (for human overnight use). Neither is the sole path.
47
+
48
+ ## Landscape Packet
49
+
50
+ ### Integration point: where abort already happens
51
+
52
+ In `workflow-runner.ts` the `turn_end` subscriber already has two abort paths:
53
+
54
+ ```
55
+ // max_turns path (line 3088-3104)
56
+ if (maxTurns > 0 && turnCount >= maxTurns && timeoutReason === null) {
57
+ timeoutReason = 'max_turns';
58
+ emitter?.emit({ kind: 'agent_stuck', reason: 'timeout_imminent', ... });
59
+ agent.abort();
60
+ return;
61
+ }
62
+
63
+ // stuck detection heuristics (lines 3106-3165)
64
+ // -- repeated_tool_call: emit only, no abort
65
+ // -- no_progress: emit only, no abort
66
+ // -- timeout_imminent: emit only (abort already fired in setTimeout callback)
67
+ ```
68
+
69
+ The stuck abort must be inserted immediately after each heuristic `emitter?.emit()` call, guarded by `stuckAbortPolicy`.
70
+
71
+ ### WorkflowRunResult union (current)
72
+
73
+ ```typescript
74
+ type WorkflowRunResult =
75
+ | WorkflowRunSuccess // _tag: 'success'
76
+ | WorkflowRunError // _tag: 'error'
77
+ | WorkflowRunTimeout // _tag: 'timeout'
78
+ | WorkflowDeliveryFailed // _tag: 'delivery_failed'
79
+ ```
80
+
81
+ `TriggerRouter.route()` and `dispatch()` both have `assertNever(result)` guards. Adding a new `_tag: 'stuck'` variant requires handling in both.
82
+
83
+ ### outbox.jsonl write pattern (from pr-review.ts)
84
+
85
+ ```typescript
86
+ const outboxPath = path.join(os.homedir(), '.workrail', 'outbox.jsonl');
87
+ await fs.mkdir(workrailDir, { recursive: true });
88
+ await fs.appendFile(outboxPath, JSON.stringify(entry) + '\n', 'utf8');
89
+ ```
90
+
91
+ No shared utility -- each writer does it directly. The stuck-escalation writer should follow the same pattern, injected via `WorkflowTrigger` deps or called as a module-level helper.
92
+
93
+ ### NotificationService (notification-service.ts)
94
+
95
+ `notify(result: WorkflowRunResult, goal: string)` -- already dispatches on `result._tag`. Adding a `stuck` variant requires a new case in `buildNotificationBody()`, `buildOutcome()`, and `buildDetail()`.
96
+
97
+ ### agentConfig in WorkflowTrigger (workflow-runner.ts line 214)
98
+
99
+ ```typescript
100
+ readonly agentConfig?: {
101
+ readonly model?: string;
102
+ readonly maxSessionMinutes?: number;
103
+ readonly maxTurns?: number;
104
+ // NEW:
105
+ readonly stuckAbortPolicy?: 'abort' | 'notify_only';
106
+ readonly noProgressAbortEnabled?: boolean;
107
+ };
108
+ ```
109
+
110
+ This is the correct location for both new fields. They are session-behavior knobs, same as `maxSessionMinutes` and `maxTurns` -- not trigger-routing knobs.
111
+
112
+ ## Problem Frame Packet
113
+
114
+ The three stuck heuristics (`repeated_tool_call`, `no_progress`, `timeout_imminent`) fire today as advisory events. There is no action. For overnight-autonomous use, the invariant must change:
115
+
116
+ - `repeated_tool_call` is a **hard abort signal** -- same tool+args 3x is definitionally broken.
117
+ - `no_progress` at 80% turns is a **soft abort signal** -- may have false positives; the policy controls whether to abort or just notify.
118
+ - `timeout_imminent` already has an abort (the wall-clock timer fires `agent.abort()` independently) -- no new abort needed here, but a stuck-escalation outbox entry and notification should fire.
119
+
120
+ After abort, `runWorkflow()` must return a new `WorkflowRunResult` variant that carries enough structured data for a human and a future fix-coordinator to understand the failure without reading logs.
121
+
122
+ ## Recommended Design
123
+
124
+ ### 1. Abort Policy
125
+
126
+ | Signal | Default action | With `stuckAbortPolicy: 'notify_only'` | Gating flag |
127
+ |---|---|---|---|
128
+ | `repeated_tool_call` | Abort immediately | Emit event only, no abort | Always active |
129
+ | `no_progress` | Emit + notify only | Emit + notify only | `noProgressAbortEnabled: true` required to abort |
130
+ | `timeout_imminent` | No new abort (already aborting) | Same | N/A |
131
+
132
+ **`stuckAbortPolicy: 'abort' | 'notify_only'`** (default: `'abort'`): controls whether a stuck signal triggers an abort or only an event emission and notification. Per-trigger, lives in `agentConfig`.
133
+
134
+ **`noProgressAbortEnabled: boolean`** (default: `false`): separately gates whether the `no_progress` heuristic (80% turns, 0 step advances) can trigger an abort. When false, `no_progress` only emits the `agent_stuck` event and fires a notification -- it never aborts. This flag exists because `no_progress` has a meaningful false-positive rate on legitimate deep-research sessions (e.g. a wr.discovery run spending 50 turns before its first step advance).
135
+
136
+ **Rationale for the default `noProgressAbortEnabled: false`:** The primary overnight-autonomous failure mode is `repeated_tool_call` (the same failing command called 15x). The `no_progress` heuristic is a secondary signal that benefits from explicit opt-in after the false-positive rate is observed in production.
137
+
138
+ ### 2. New `WorkflowRunResult` Variant
139
+
140
+ ```typescript
141
+ /** Workflow aborted by stuck detection before the wall-clock timeout fired. */
142
+ export interface WorkflowRunStuck {
143
+ readonly _tag: 'stuck';
144
+ readonly workflowId: string;
145
+ /**
146
+ * Which heuristic triggered the abort.
147
+ * Matches AgentStuckEvent.reason to allow correlation with daemon event log.
148
+ */
149
+ readonly stuckReason: 'repeated_tool_call' | 'no_progress';
150
+ /** Human-readable description of why stuck was detected. */
151
+ readonly detail: string;
152
+ /** The tool name that was called repeatedly (present for repeated_tool_call). */
153
+ readonly toolName?: string;
154
+ /** The argsSummary of the repeated call (present for repeated_tool_call). */
155
+ readonly argsSummary?: string;
156
+ /** Total LLM turns consumed at the time of abort. */
157
+ readonly turnCount: number;
158
+ /** Number of workflow step advances at the time of abort. */
159
+ readonly stepAdvanceCount: number;
160
+ /**
161
+ * Wall-clock milliseconds elapsed from session start to abort.
162
+ * Lets a coordinator compute wall-clock savings vs. a full timeout.
163
+ * Requires `sessionStartMs = Date.now()` to be added before the agent loop.
164
+ */
165
+ readonly elapsedMs: number;
166
+ /**
167
+ * Summaries of all issue_reported calls during this session (if any).
168
+ * Populated from the `issueSummaries` ring tracked by the onIssueSummary callback.
169
+ * Provides additional context for a fix-coordinator without requiring log parsing.
170
+ */
171
+ readonly issueSummaries?: readonly string[];
172
+ }
173
+
174
+ // Updated union:
175
+ export type WorkflowRunResult =
176
+ | WorkflowRunSuccess
177
+ | WorkflowRunError
178
+ | WorkflowRunTimeout
179
+ | WorkflowDeliveryFailed
180
+ | WorkflowRunStuck; // NEW
181
+ ```
182
+
183
+ **Why a new variant instead of reusing `WorkflowRunError` or `WorkflowRunTimeout`:**
184
+ - `WorkflowRunError` means a tool or engine error, not a stuck loop. Conflating them forces consumers to parse message strings.
185
+ - `WorkflowRunTimeout` means the wall-clock fired -- stuck abort happens _before_ the wall clock fires.
186
+ - A distinct `_tag: 'stuck'` lets `TriggerRouter`, `NotificationService`, and future coordinators distinguish the case at compile time with `assertNever` exhaustiveness.
187
+
188
+ **`ChildWorkflowRunResult` must also be updated** to include `WorkflowRunStuck` since `runWorkflow()` now produces it directly (not via `TriggerRouter`).
189
+
190
+ ### 3. Outbox Entry Schema
191
+
192
+ Written to `~/.workrail/outbox.jsonl` as a single JSONL line immediately after abort:
193
+
194
+ ```json
195
+ {
196
+ "id": "<uuid-v4>",
197
+ "kind": "stuck_session",
198
+ "sessionId": "<local-uuid>",
199
+ "workrailSessionId": "<sess_...>",
200
+ "workflowId": "<workflow-id>",
201
+ "stuckReason": "repeated_tool_call",
202
+ "detail": "Same tool+args called 3 times: Bash",
203
+ "toolName": "Bash",
204
+ "argsSummary": "{\"command\":\"npm test\"}",
205
+ "turnCount": 12,
206
+ "stepAdvanceCount": 0,
207
+ "elapsedMs": 94000,
208
+ "issueSummaries": ["Tool call failed: ENOENT /path/to/file"],
209
+ "timestamp": "2026-04-19T03:14:15.926Z"
210
+ }
211
+ ```
212
+
213
+ **Fields needed by a future fix-coordinator:**
214
+ - `stuckReason` + `toolName` + `argsSummary` -- to classify the failure and decide whether to retry
215
+ - `turnCount` + `stepAdvanceCount` -- to understand how much progress was lost
216
+ - `workrailSessionId` -- to correlate with WorkRail session store for checkpoint resumption
217
+ - `elapsedMs` -- to measure wall-clock savings vs. a full timeout
218
+
219
+ ### 4. Integration Point
220
+
221
+ **Location: inside the `turn_end` subscriber in `runWorkflow()`, immediately after each stuck heuristic `emitter?.emit()` call.**
222
+
223
+ Rationale:
224
+ - This is where `agent.abort()` already lives for `max_turns`.
225
+ - `turnCount`, `stepAdvanceCount`, `sessionStartMs`, `toolName`, `argsSummary` are all in scope as closures.
226
+ - Post-run handling in `TriggerRouter.route()` is the wrong layer: by the time `runWorkflow()` returns, the outbox write should already have happened (it's diagnostic context for the abort, not post-hoc delivery).
227
+
228
+ **Abort sequence (for `repeated_tool_call`):**
229
+ 1. Check: `stuckAbortPolicy !== 'notify_only'` (default: abort)
230
+ 2. Set `stuckReason` and `stuckContext` closure variables
231
+ 3. Call `agent.abort()`
232
+ 4. Write outbox entry (best-effort, non-fatal, fire-and-forget pattern)
233
+ 5. Return from `turn_end` subscriber (same as `max_turns` path)
234
+ 6. `runWorkflow()` catches the abort and returns `WorkflowRunResult` with `_tag: 'stuck'`
235
+
236
+ **`runWorkflow()` return path:**
237
+ The existing error-catch block needs a branch for the stuck abort. The `stuckContext` closure variable (set in step 2 above) distinguishes stuck abort from other aborts:
238
+
239
+ ```typescript
240
+ // Existing error catch in runWorkflow():
241
+ } catch (err) {
242
+ if (stuckContext !== null) {
243
+ return {
244
+ _tag: 'stuck',
245
+ workflowId: trigger.workflowId,
246
+ ...stuckContext,
247
+ };
248
+ }
249
+ // ... existing error handling
250
+ }
251
+ ```
252
+
253
+ ### 5. TriggerRouter Changes
254
+
255
+ Both `route()` and `dispatch()` need a new branch before the `assertNever` guard:
256
+
257
+ ```typescript
258
+ } else if (result._tag === 'stuck') {
259
+ console.log(
260
+ `[TriggerRouter] Workflow stuck: triggerId=${trigger.id} ` +
261
+ `workflowId=${trigger.workflowId} reason=${result.stuckReason} ` +
262
+ `tool=${result.toolName ?? 'n/a'} turns=${result.turnCount}`,
263
+ );
264
+ }
265
+ ```
266
+
267
+ Delivery (`maybeRunDelivery`) should NOT run for stuck results -- there is no successful output to commit.
268
+
269
+ ### 6. NotificationService Changes
270
+
271
+ Three pure functions need new cases:
272
+
273
+ ```typescript
274
+ // buildNotificationBody:
275
+ case 'stuck':
276
+ return `Session aborted (stuck loop): ${truncated}`;
277
+
278
+ // buildOutcome: NotificationPayload['outcome'] needs 'stuck' added
279
+ // buildDetail:
280
+ case 'stuck':
281
+ return `stuckReason: ${result.stuckReason}; tool: ${result.toolName ?? 'n/a'}; ` +
282
+ `turns: ${result.turnCount}; stepAdvances: ${result.stepAdvanceCount}`;
283
+ ```
284
+
285
+ `NotificationPayload.outcome` currently is `'success' | 'error' | 'timeout' | 'delivery_failed'`. Add `'stuck'`.
286
+
287
+ ## 5-File Change Estimate
288
+
289
+ | File | Change |
290
+ |---|---|
291
+ | `src/daemon/workflow-runner.ts` | (1) Add `WorkflowRunStuck` interface and update `WorkflowRunResult` union. (2) **CRITICAL: Also update `ChildWorkflowRunResult` type alias** -- if missed, the cast at line 2014 silently allows `_tag: 'stuck'` to reach `makeSpawnAgentTool`'s assertNever, causing a runtime crash in child sessions. (3) Add `stuckAbortPolicy` and `noProgressAbortEnabled` to `WorkflowTrigger.agentConfig`. (4) Add `sessionStartMs = Date.now()` alongside `turnCount`. (5) Wire abort + fire-and-forget outbox write in `turn_end` subscriber. (6) Add `stuckContext` closure variable + branch in catch block to return `_tag: 'stuck'`. (7) Add `stuck` case to `makeSpawnAgentTool` switch. |
292
+ | `src/trigger/trigger-router.ts` | Add `stuck` branch in `route()` and `dispatch()` before `assertNever`; `maybeRunDelivery` already skips non-success results (no change needed). |
293
+ | `src/trigger/notification-service.ts` | Add `stuck` case to `buildNotificationBody`, `buildOutcome`, `buildDetail`; add `'stuck'` to `NotificationPayload.outcome`. |
294
+ | `src/trigger/types.ts` | Add `stuckAbortPolicy?: 'abort' | 'notify_only'` and `noProgressAbortEnabled?: boolean` to `TriggerDefinition.agentConfig`. |
295
+ | `src/daemon/daemon-events.ts` | No changes required -- `AgentStuckEvent` already exists and fires. |
296
+
297
+ **Total: 4 files changed** (`daemon-events.ts` is unchanged). `workflow-runner.ts` has multiple edit locations -- all must be done in a single commit to avoid TypeScript exhaustiveness errors at intermediate states.
298
+
299
+ ## Decision Log
300
+
301
+ - **`full_spectrum` path chosen** because both the integration point (landscape) and the result type design (concept) are genuine risks.
302
+ - **New `_tag: 'stuck'` variant** preferred over reusing `WorkflowRunTimeout` because stuck abort fires _before_ the wall clock -- conflating them loses diagnostic precision.
303
+ - **Outbox write in `turn_end` subscriber** (not in `TriggerRouter`) because the outbox entry is diagnostic context for the abort, not post-hoc delivery.
304
+ - **`stuckAbortPolicy` in `agentConfig`** (not a top-level `TriggerDefinition` field) because it belongs with `maxSessionMinutes` and `maxTurns` -- all are session-behavior knobs, not trigger-routing knobs.
305
+ - **`timeout_imminent` does not get a new abort** -- the wall-clock timer already calls `agent.abort()` independently; adding a second abort would be redundant and could race.
306
+ - **Default `stuckAbortPolicy: 'abort'`** for all triggers. Overnight-autonomous is the primary use case. Human-supervised users who want softer behavior must opt in to `'notify_only'`.
307
+ - **`noProgressAbortEnabled: false` default** -- `no_progress` has a real false-positive rate on deep-research sessions. Separating the gate from `stuckAbortPolicy` allows independent control: a trigger can have `stuckAbortPolicy: 'abort'` (abort on `repeated_tool_call`) without also aborting on `no_progress`.
308
+ - **`issueSummaries` field borrowed from Candidate C** -- the `issueSummaries` array is already tracked in session closures at zero additional collection cost. Adding it to `WorkflowRunStuck` now avoids a future breaking interface change when a fix-coordinator needs it.
309
+ - **outbox is best-effort** (fire-and-forget, errors swallowed) -- consistent with the `DaemonEventEmitter` contract. A failed write must never affect the `WorkflowRunResult`.
310
+ - **`sessionStartMs = Date.now()`** must be added before the agent loop -- it does not currently exist in workflow-runner.ts. Trivial one-line addition.
311
+ - **Candidate A rejected** -- adding `reason: 'stuck_loop'` to `WorkflowRunTimeout` conflates stuck-abort with wall-clock timeout, violating 'make illegal states unrepresentable'. The 5-file vs. 2-file maintenance advantage does not justify this semantic violation.
312
+ - **Candidate C deferred** -- `issue_reported severity=fatal` abort is the correct Phase 2 extension once `repeated_tool_call` abort is validated in production. The `issueSummaries` field on `WorkflowRunStuck` provides partial Candidate C value at near-zero cost.
313
+
314
+ ## Residual Concerns
315
+
316
+ 1. **`repeated_tool_call` false-positive rate in production is unvalidated.** The `stuckAbortPolicy: 'notify_only'` escape hatch mitigates. If false-positive rate exceeds ~20%, consider flipping the default to `'notify_only'` and making `'abort'` opt-in.
317
+
318
+ 2. **Outbox write may be lost on rapid daemon shutdown.** The fire-and-forget write initiates before `runWorkflow()` returns, but completes asynchronously. On SIGKILL, the entry may be lost. Same risk as `DaemonEventEmitter` -- accepted by design.
319
+
320
+ 3. **No shadow-mode validation.** Ideally, heuristics would run in shadow mode (emit only) for 20+ production sessions before enabling abort. `stuckAbortPolicy: 'notify_only'` serves as a manual shadow mode.
321
+
322
+ ## Final Summary
323
+
324
+ The design adds `WorkflowRunStuck` as a new `WorkflowRunResult` variant (`_tag: 'stuck'`), wires abort into the `repeated_tool_call` heuristic (unconditional, subject to `stuckAbortPolicy`) and optionally into `no_progress` (gated by `noProgressAbortEnabled: true`) in the `turn_end` subscriber, writes a structured outbox entry on abort, and extends `NotificationService` to distinguish stuck from timeout. The result carries `issueSummaries` for coordinator context. The policy is per-trigger via `agentConfig.stuckAbortPolicy` and `agentConfig.noProgressAbortEnabled`. The change touches 4 files and follows all existing patterns (fire-and-forget outbox, best-effort notification, `assertNever` exhaustiveness, max_turns abort template). A future fix-coordinator can read the outbox entry and extract all fields needed to classify and potentially retry the stuck session without parsing log text.
325
+
326
+ **Critical implementation note:** `ChildWorkflowRunResult` must be updated in the same commit as `WorkflowRunResult`. Failure to do so causes a runtime `assertNever` crash in child sessions spawned by `makeSpawnAgentTool`.