@exaudeus/workrail 3.44.0 → 3.45.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/console-ui/assets/{index-Bi38ITiQ.js → index-BpanIvmi.js} +1 -1
- package/dist/console-ui/index.html +1 -1
- package/dist/daemon/workflow-runner.d.ts +12 -2
- package/dist/daemon/workflow-runner.js +96 -13
- package/dist/manifest.json +17 -17
- package/dist/trigger/notification-service.d.ts +1 -1
- package/dist/trigger/notification-service.js +4 -0
- package/dist/trigger/trigger-router.js +8 -0
- package/dist/trigger/types.d.ts +2 -0
- package/dist/v2/usecases/console-routes.js +3 -0
- package/docs/design/design-candidates-stuck-escalation.md +183 -0
- package/docs/design/design-review-findings-stuck-escalation.md +93 -0
- package/docs/design/implementation-plan-stuck-escalation.md +172 -0
- package/package.json +1 -1
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
# Implementation Plan: WorkTrain Stuck-Escalation
|
|
2
|
+
|
|
3
|
+
*2026-04-19 | Pitch: .workrail/current-pitch.md*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. Problem Statement
|
|
8
|
+
|
|
9
|
+
When a WorkTrain daemon session enters a `repeated_tool_call` loop, the session
|
|
10
|
+
currently burns turns until wall-clock or max-turn timeout. The result is
|
|
11
|
+
`_tag: 'timeout'`, indistinguishable from a legitimate slow session. Automated
|
|
12
|
+
routing is impossible without string-parsing.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 2. Acceptance Criteria
|
|
17
|
+
|
|
18
|
+
1. `WorkflowRunStuck` interface exported from `workflow-runner.ts` with fields:
|
|
19
|
+
`_tag: 'stuck'`, `workflowId`, `reason`, `message`, `stopReason`, `issueSummaries?`
|
|
20
|
+
2. `WorkflowRunResult` union includes `WorkflowRunStuck`.
|
|
21
|
+
3. `ChildWorkflowRunResult` union includes `WorkflowRunStuck` (SAME COMMIT as #2).
|
|
22
|
+
4. `WorkflowTrigger.agentConfig` has `stuckAbortPolicy?` and `noProgressAbortEnabled?`.
|
|
23
|
+
5. `TriggerDefinition.agentConfig` has the same two fields.
|
|
24
|
+
6. When `repeated_tool_call` fires and `stuckAbortPolicy !== 'notify_only'`:
|
|
25
|
+
outbox entry written, `agent.abort()` called, `stuckReason = 'repeated_tool_call'`.
|
|
26
|
+
7. When `notify_only` is set: outbox written, abort NOT called.
|
|
27
|
+
8. When `noProgressAbortEnabled: true` and `no_progress` fires with `stuckAbortPolicy !== 'notify_only'`:
|
|
28
|
+
same abort + outbox write.
|
|
29
|
+
9. Return path returns `{ _tag: 'stuck', ... }` before `timeoutReason` check.
|
|
30
|
+
10. `trigger-router.ts` `route()` and `dispatch()` handle `stuck` without assertNever fallthrough.
|
|
31
|
+
11. `notification-service.ts` `buildNotificationBody()` and `buildDetail()` handle `stuck`.
|
|
32
|
+
12. `NotificationPayload.outcome` union includes `'stuck'`.
|
|
33
|
+
13. `makeSpawnAgentTool` handles `stuck` child result, returns `outcome: 'stuck'`.
|
|
34
|
+
14. All 6 test cases in `workflow-runner-stuck-escalation.test.ts` pass.
|
|
35
|
+
15. `npm run build` clean. `npx vitest run` no regressions.
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## 3. Non-Goals
|
|
40
|
+
|
|
41
|
+
- No `onStuck:` hook in TriggerDefinition (follow-up)
|
|
42
|
+
- No console live panel stuck indicator
|
|
43
|
+
- No `worktrain logs` formatting changes
|
|
44
|
+
- No automatic retry on stuck
|
|
45
|
+
- No Signal 5 (wall-clock at 80%) wiring
|
|
46
|
+
- No new heuristics beyond Signal 1 and 2
|
|
47
|
+
- No changes to `src/mcp/`
|
|
48
|
+
- No `trigger-store.ts` parser changes
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## 4. Philosophy-Driven Constraints
|
|
53
|
+
|
|
54
|
+
- All new fields `readonly`
|
|
55
|
+
- `issueSummaries` spread to new readonly array when included in return value
|
|
56
|
+
- `writeStuckOutboxEntry` is fire-and-forget (void + catch)
|
|
57
|
+
- `stuckReason` flag: first-writer-wins (same as `timeoutReason`)
|
|
58
|
+
- Outbox write and abort are independent effects (write before abort gate check)
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## 5. Invariants
|
|
63
|
+
|
|
64
|
+
- **I1**: `ChildWorkflowRunResult` and `WorkflowRunResult` updates ship in the same commit.
|
|
65
|
+
- **I2**: `stuckReason` is checked BEFORE `timeoutReason` in the return path.
|
|
66
|
+
- **I3**: Outbox write fires regardless of `stuckAbortPolicy`.
|
|
67
|
+
- **I4**: `no_progress` never aborts unless `noProgressAbortEnabled: true`.
|
|
68
|
+
- **I5**: `repeated_tool_call` abort fires on the same turn as detection.
|
|
69
|
+
- **I6**: First writer wins on `stuckReason` (guard: `stuckReason === null && timeoutReason === null`).
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 6. Selected Approach
|
|
74
|
+
|
|
75
|
+
New `_tag: 'stuck'` discriminated union variant. Wire abort in `turn_end` subscriber
|
|
76
|
+
after Signal 1 and Signal 2 emitter calls. Return stuck result before `timeoutReason`
|
|
77
|
+
check. Update both union types atomically. Add `writeStuckOutboxEntry` module-level
|
|
78
|
+
helper. Propagate to trigger-router, notification-service, makeSpawnAgentTool.
|
|
79
|
+
|
|
80
|
+
**Runner-up rejected**: Extend `WorkflowRunTimeout.reason` -- violates make-illegal-states-unrepresentable.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## 7. Vertical Slices
|
|
85
|
+
|
|
86
|
+
### Slice 1: Core types (workflow-runner.ts)
|
|
87
|
+
- Add `WorkflowRunStuck` interface after `WorkflowRunTimeout`
|
|
88
|
+
- Add to `WorkflowRunResult` union
|
|
89
|
+
- Add to `ChildWorkflowRunResult` union (ATOMIC with above)
|
|
90
|
+
- Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` to `WorkflowTrigger.agentConfig`
|
|
91
|
+
- **Done when**: `npm run build` clean after this slice
|
|
92
|
+
|
|
93
|
+
### Slice 2: TriggerDefinition.agentConfig (types.ts)
|
|
94
|
+
- Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` after `maxTurns`
|
|
95
|
+
- **Done when**: `npm run build` clean
|
|
96
|
+
|
|
97
|
+
### Slice 3: Runtime wiring (workflow-runner.ts)
|
|
98
|
+
- Add `sessionStartMs` constant after `maxTurns` resolution
|
|
99
|
+
- Add `stuckReason` flag after `timeoutReason` flag
|
|
100
|
+
- Add `writeStuckOutboxEntry` module-level helper
|
|
101
|
+
- Wire abort after Signal 1 emitter call in `turn_end`
|
|
102
|
+
- Wire abort after Signal 2 emitter call in `turn_end`
|
|
103
|
+
- Add stuck return path before `timeoutReason` check
|
|
104
|
+
- Update `makeSpawnAgentTool` resultObj type + add `stuck` arm before `assertNever`
|
|
105
|
+
- **Done when**: `npm run build` clean
|
|
106
|
+
|
|
107
|
+
### Slice 4: Caller propagation (trigger-router.ts, notification-service.ts)
|
|
108
|
+
- Add `stuck` arm in `route()` exhaustive chain
|
|
109
|
+
- Add `stuck` arm in `dispatch()` exhaustive chain
|
|
110
|
+
- Add `'stuck'` to `NotificationPayload.outcome` union
|
|
111
|
+
- Add `stuck` case in `buildNotificationBody()`
|
|
112
|
+
- Add `stuck` case in `buildDetail()`
|
|
113
|
+
- **Done when**: `npm run build` clean
|
|
114
|
+
|
|
115
|
+
### Slice 5: Tests
|
|
116
|
+
- Write `tests/unit/workflow-runner-stuck-escalation.test.ts` with 6 test cases
|
|
117
|
+
- **Done when**: all 6 tests pass, no regressions
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
## 8. Test Design
|
|
122
|
+
|
|
123
|
+
File: `tests/unit/workflow-runner-stuck-escalation.test.ts`
|
|
124
|
+
|
|
125
|
+
Pattern: replicate turn_end subscriber logic (same as workflow-runner-stuck-detection.test.ts).
|
|
126
|
+
|
|
127
|
+
**Test 1**: `stuckAbortPolicy: 'abort'` default -- repeated_tool_call fires, stuckReason set, abort called, would return _tag:'stuck'
|
|
128
|
+
**Test 2**: `stuckAbortPolicy: 'notify_only'` -- abort NOT called, emitter still fires
|
|
129
|
+
**Test 3**: `noProgressAbortEnabled: false` default -- no_progress does NOT set stuckReason
|
|
130
|
+
**Test 4**: `noProgressAbortEnabled: true` -- no_progress sets stuckReason = 'no_progress', abort called
|
|
131
|
+
**Test 5**: Compile-time assignability test: `WorkflowRunStuck` is assignable to `ChildWorkflowRunResult`
|
|
132
|
+
**Test 6**: trigger-router exhaustive switch handles 'stuck' (import trigger-router, verify no assertNever path hit)
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## 9. Risk Register
|
|
137
|
+
|
|
138
|
+
| Risk | Likelihood | Severity | Mitigation |
|
|
139
|
+
|------|------------|----------|------------|
|
|
140
|
+
| ChildWorkflowRunResult not updated atomically | Low | High | Single-PR, Slice 1 includes both updates, Test 5 catches gap |
|
|
141
|
+
| NotificationPayload.outcome union gap | Low | Medium | Slice 4 adds 'stuck'; build catches it |
|
|
142
|
+
| stuckReason/timeoutReason race | Low | Low | Guard condition (both null check) |
|
|
143
|
+
| writeStuckOutboxEntry silent failure | Low | Low | Fire-and-forget with console.warn |
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## 10. PR Packaging Strategy
|
|
148
|
+
|
|
149
|
+
Single PR: `feat/stuck-escalation`
|
|
150
|
+
Single atomic commit with all 4 source files + test file.
|
|
151
|
+
PR title: `feat(daemon): WorkflowRunStuck result variant with abort and outbox notification`
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## 11. Philosophy Alignment Per Slice
|
|
156
|
+
|
|
157
|
+
| Slice | Principle | Status |
|
|
158
|
+
|-------|-----------|--------|
|
|
159
|
+
| Slice 1 (types) | Make illegal states unrepresentable | Satisfied |
|
|
160
|
+
| Slice 1 (types) | Exhaustiveness everywhere | Satisfied |
|
|
161
|
+
| Slice 1 (types) | Type safety as first line of defense | Satisfied (ChildWorkflowRunResult updated) |
|
|
162
|
+
| Slice 3 (runtime) | Errors are data | Satisfied |
|
|
163
|
+
| Slice 3 (runtime) | Determinism over cleverness | Satisfied (simple flag) |
|
|
164
|
+
| Slice 3 (runtime) | Fire-and-forget side effects | Satisfied (outbox write) |
|
|
165
|
+
| Slice 4 (callers) | Exhaustiveness everywhere | Satisfied (all assertNever guards updated) |
|
|
166
|
+
| Slice 5 (tests) | Prefer fakes over mocks | Satisfied (replicate subscriber logic, not vi.mock) |
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
**unresolvedUnknownCount**: 0
|
|
171
|
+
**planConfidenceBand**: High
|
|
172
|
+
**estimatedPRCount**: 1
|