@exaudeus/workrail 3.43.0 → 3.45.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,183 @@
1
+ # Design Candidates: WorkTrain Stuck-Escalation
2
+
3
+ *Generated: 2026-04-19 | Pitch: .workrail/current-pitch.md*
4
+
5
+ ---
6
+
7
+ ## Problem Understanding
8
+
9
+ ### Core Tensions
10
+
11
+ 1. **Stuck vs timeout conflation**: When `repeated_tool_call` fires, the session
12
+ currently runs until wall-clock or max-turns timeout. The result is
13
+ `_tag: 'timeout'`, which is indistinguishable from a legitimate slow session.
14
+ Automated routing requires a distinct discriminant.
15
+
16
+ 2. **Abort vs notify-only independence**: Outbox notification and `agent.abort()`
17
+ are two separate effects. `notify_only` policy suppresses the abort but must
18
+ not suppress the outbox write. These effects must not be coupled.
19
+
20
+ 3. **ChildWorkflowRunResult atomic update**: The `as ChildWorkflowRunResult` cast
21
+ at line 2172 in `makeSpawnAgentTool` suppresses any compile-time error from a
22
+ missing union update. Only the `assertNever(childResult)` at line 2212 catches
23
+ the omission -- at runtime, crashing the parent session.
24
+
25
+ 4. **no_progress false-positive risk**: The no_progress heuristic fires on
26
+ legitimate research workflows that spend many turns reading before advancing.
27
+ It must be opt-in (default: false) to avoid breaking existing sessions.
28
+
29
+ ### Likely Seam
30
+
31
+ The `turn_end` subscriber in `runWorkflow()` is the correct location. All
32
+ required state (lastNToolCalls, stepAdvanceCount, timeoutReason, issueSummaries)
33
+ is available there as closure variables. Detection fires at the right moment
34
+ (after each turn, synchronously before next step injection).
35
+
36
+ ### What Makes This Hard
37
+
38
+ - The `as ChildWorkflowRunResult` cast is a type-safety trap: it silences
39
+ TypeScript while leaving a runtime crash. Only careful reading of the pitch
40
+ reveals the issue.
41
+ - `buildOutcome()` in notification-service.ts has return type
42
+ `NotificationPayload['outcome']`. Adding 'stuck' to WorkflowRunResult causes
43
+ a compile error there unless the outcome union is also widened.
44
+
45
+ ---
46
+
47
+ ## Philosophy Constraints
48
+
49
+ From CLAUDE.md:
50
+
51
+ - **Make illegal states unrepresentable**: the stuck discriminant prevents
52
+ conflating stuck with timeout at the type level.
53
+ - **Exhaustiveness everywhere**: assertNever guards in trigger-router and
54
+ makeSpawnAgentTool enforce this -- adding stuck arm is required.
55
+ - **Errors are data**: WorkflowRunResult is a Result type; WorkflowRunStuck is
56
+ a new variant, not an exception.
57
+ - **Type safety as first line of defense**: ChildWorkflowRunResult update in
58
+ same commit restores the compile-time invariant that the cast broke.
59
+ - **Fire-and-forget for side effects**: outbox write uses void + catch, same
60
+ as DaemonEventEmitter and issue recording.
61
+
62
+ No conflicts between stated philosophy and repo patterns.
63
+
64
+ ---
65
+
66
+ ## Impact Surface
67
+
68
+ Paths that must stay consistent when WorkflowRunResult gains a new variant:
69
+
70
+ 1. `makeSpawnAgentTool` -- `assertNever(childResult)` at line 2212; requires
71
+ ChildWorkflowRunResult update and a new `stuck` arm in the result mapping.
72
+ 2. `trigger-router.ts` `route()` -- exhaustive if-else chain ending in
73
+ `assertNever(result)` at line ~689.
74
+ 3. `trigger-router.ts` `dispatch()` -- same exhaustive chain at line ~770.
75
+ 4. `notification-service.ts` `buildNotificationBody()` -- exhaustive switch.
76
+ 5. `notification-service.ts` `buildDetail()` -- exhaustive switch.
77
+ 6. `notification-service.ts` `buildOutcome()` -- return type
78
+ `NotificationPayload['outcome']`; 'stuck' must be added to that union.
79
+ 7. `NotificationPayload.outcome` union -- currently
80
+ `'success' | 'error' | 'timeout' | 'delivery_failed'`; must add `'stuck'`.
81
+
82
+ ---
83
+
84
+ ## Candidates
85
+
86
+ ### Candidate A: New `_tag: 'stuck'` discriminated union variant (SELECTED)
87
+
88
+ **Summary**: Add `WorkflowRunStuck` interface with `_tag: 'stuck'`, wire abort
89
+ in turn_end subscriber after Signal 1 and Signal 2 emitter calls, return stuck
90
+ result before timeout check, update both `WorkflowRunResult` and
91
+ `ChildWorkflowRunResult` unions atomically, add `writeStuckOutboxEntry` helper.
92
+
93
+ **Tensions resolved**:
94
+ - Stuck/timeout conflation: separate discriminant, separate return path.
95
+ - Abort/notify independence: outbox write fires before the abort gate check.
96
+ - ChildWorkflowRunResult crash: atomic update with assertNever arm added.
97
+ - no_progress false-positive: gated by `noProgressAbortEnabled: false` default.
98
+
99
+ **Boundary solved at**: `turn_end` subscriber (detection + abort), result
100
+ construction (return), 4 files for propagation to callers.
101
+
102
+ **Why best-fit boundary**: The turn_end subscriber is the only location with
103
+ access to all required state. The result construction is the canonical output
104
+ boundary for runWorkflow(). Propagation to callers follows the existing
105
+ WorkflowRunResult variant fan-out pattern.
106
+
107
+ **Failure mode**: Forgetting to update `NotificationPayload.outcome` union --
108
+ caught by `npm run build` (TypeScript compile error in `buildOutcome()`).
109
+
110
+ **Repo-pattern relationship**: Mirrors `timeoutReason` flag pattern exactly.
111
+ Mirrors `WorkflowRunTimeout` interface field shape. Follows assertNever guard
112
+ pattern already established in trigger-router and makeSpawnAgentTool.
113
+
114
+ **Gains**: Distinct routing for stuck sessions, type-safe callers, clean
115
+ separation of abort and notification effects.
116
+
117
+ **Losses**: One more variant in the union (minor cognitive load increase).
118
+
119
+ **Scope judgment**: Best-fit. 4 files, mechanical wiring, all design resolved.
120
+
121
+ **Philosophy fit**: Honors all relevant CLAUDE.md principles. No conflicts.
122
+
123
+ ---
124
+
125
+ ### Candidate B: Extend `WorkflowRunTimeout.reason` with stuck sub-values
126
+
127
+ **Summary**: Add `'stuck_repeated_tool_call' | 'stuck_no_progress'` to
128
+ `WorkflowRunTimeout.reason` -- reuse the timeout discriminant.
129
+
130
+ **Tensions resolved**: None of the core ones. Stuck and timeout still share
131
+ `_tag: 'timeout'`, requiring callers to inspect reason to distinguish them.
132
+
133
+ **Failure mode**: Violates make-illegal-states-unrepresentable. Callers using
134
+ `result._tag === 'timeout'` would silently handle stuck sessions as timeouts.
135
+
136
+ **Repo-pattern relationship**: Departs from the exhaustiveness-everywhere
137
+ pattern. The assertNever guard pattern exists precisely to avoid this.
138
+
139
+ **Scope judgment**: Too narrow -- preserves the routing problem this pitch
140
+ exists to solve.
141
+
142
+ **Rejected because**: Violates philosophy, does not resolve the core tension,
143
+ and the pitch explicitly rejects conflating stuck with timeout.
144
+
145
+ ---
146
+
147
+ ## Comparison and Recommendation
148
+
149
+ Candidate A is the only viable candidate. All analysis converges.
150
+
151
+ The core recommendation is to implement Candidate A exactly as specified in
152
+ `.workrail/current-pitch.md`, with one addition not noted in the pitch:
153
+ update `NotificationPayload.outcome` union to include `'stuck'` (required for
154
+ `buildOutcome()` to compile).
155
+
156
+ ---
157
+
158
+ ## Self-Critique
159
+
160
+ **Strongest counter-argument**: Adding a 5th variant to WorkflowRunResult
161
+ increases cognitive load for callers. Counter: assertNever guards make missing
162
+ cases compile errors, which is the correct safeguard. The complexity cost is
163
+ paid once (at implementation) and enforced automatically.
164
+
165
+ **Narrower option that lost**: Update only WorkflowRunResult, skip
166
+ ChildWorkflowRunResult. Lost because: runtime crash in makeSpawnAgentTool
167
+ when a child hits stuck-abort. The cast at line 2172 provides no protection.
168
+
169
+ **Broader option not justified**: Adding `onStuck:` hook to TriggerDefinition.
170
+ Explicitly deferred per pitch No-Gos. Would require trigger-store.ts parser
171
+ changes -- outside the 4-file scope.
172
+
173
+ **Pivot condition**: If `assertNever(childResult)` were removed in favor of a
174
+ logged fallback, ChildWorkflowRunResult update would be less critical. It is
175
+ not removed, so the atomic update is required.
176
+
177
+ ---
178
+
179
+ ## Open Questions for the Main Agent
180
+
181
+ None. All design decisions are resolved in the pitch. The only implementation
182
+ detail requiring attention is the `NotificationPayload.outcome` union widening
183
+ (add 'stuck') -- verify this compiles before finalizing.
@@ -0,0 +1,93 @@
1
+ # Design Review Findings: WorkTrain Stuck-Escalation
2
+
3
+ *Generated: 2026-04-19 | Pitch: .workrail/current-pitch.md*
4
+
5
+ ---
6
+
7
+ ## Tradeoff Review
8
+
9
+ | Tradeoff | Status | Condition for Failure |
10
+ |----------|--------|-----------------------|
11
+ | One more union variant in WorkflowRunResult | Acceptable | All callers use assertNever guards -- compile error enforces handling |
12
+ | ChildWorkflowRunResult atomic update relies on discipline | Managed | Fails only if commit is split; mitigated by single-PR implementation and compile-time test |
13
+ | NotificationPayload.outcome union widening (gap, not tradeoff) | Resolved | Add 'stuck' to outcome union; caught by npm run build |
14
+
15
+ ---
16
+
17
+ ## Failure Mode Review
18
+
19
+ | Failure Mode | Severity | Design Handling | Missing Mitigation |
20
+ |--------------|----------|-----------------|--------------------|
21
+ | ChildWorkflowRunResult not updated | High | Atomic commit, compile-time assignability test | None beyond discipline |
22
+ | stuckReason / timeoutReason race | Low | First-writer-wins guard; max_turns early return prevents race | None needed |
23
+ | writeStuckOutboxEntry fails | Low | Fire-and-forget, console.warn on error | None -- intentional |
24
+ | no_progress fires on research workflow | Low | noProgressAbortEnabled defaults to false | None needed |
25
+ | NotificationPayload.outcome compile error | Medium | Add 'stuck' to union | None -- caught at build |
26
+
27
+ ---
28
+
29
+ ## Runner-Up / Simpler Alternative Review
30
+
31
+ - **Candidate B** (extend WorkflowRunTimeout.reason): No elements worth borrowing.
32
+ Does not resolve the core routing tension.
33
+ - **Skip ChildWorkflowRunResult**: Not acceptable -- runtime crash in parent session.
34
+ - **Skip sessionStartMs**: Not recommended -- pitch explicitly adds it for Signal 5 follow-up
35
+ to avoid future restructuring.
36
+ - **Inline outbox write**: Works but reduces turn_end subscriber readability. Not worth it.
37
+
38
+ No hybrid opportunities identified.
39
+
40
+ ---
41
+
42
+ ## Philosophy Alignment
43
+
44
+ | Principle | Status |
45
+ |-----------|--------|
46
+ | Make illegal states unrepresentable | Satisfied |
47
+ | Exhaustiveness everywhere | Satisfied |
48
+ | Errors are data | Satisfied |
49
+ | Immutability by default | Satisfied |
50
+ | Type safety as first line of defense | Under tension (pre-existing cast; improved but not fully resolved) |
51
+ | Fire-and-forget for side effects | Satisfied |
52
+
53
+ ---
54
+
55
+ ## Findings
56
+
57
+ ### Yellow: NotificationPayload.outcome union widening not specified in pitch
58
+
59
+ The pitch states 'buildOutcome() returns result._tag directly -- no change needed'.
60
+ However, the return type annotation `NotificationPayload['outcome']` will cause a
61
+ TypeScript compile error when 'stuck' is added to WorkflowRunResult but not to the
62
+ outcome union. **Resolution**: add `'stuck'` to `NotificationPayload.outcome` union
63
+ in notification-service.ts. This is a mechanical fix, not a design change.
64
+
65
+ ### Yellow: Pre-existing `as ChildWorkflowRunResult` cast at line 2172
66
+
67
+ The cast suppresses TypeScript's compile-time check that would otherwise catch a
68
+ missing ChildWorkflowRunResult update. This PR updates the union and adds a
69
+ compile-time assignability test to partially compensate. Removing the cast is
70
+ out of scope. **Residual concern**: future union additions must be caught by the
71
+ test rather than the compiler.
72
+
73
+ ---
74
+
75
+ ## Recommended Revisions
76
+
77
+ 1. Add `'stuck'` to `NotificationPayload.outcome` union (not in pitch, required for compile).
78
+ 2. Add compile-time assignability test for `ChildWorkflowRunResult` in the test file.
79
+ 3. Document the `as ChildWorkflowRunResult` cast issue in a code comment at line 2172
80
+ (or verify existing comment is sufficient).
81
+
82
+ ---
83
+
84
+ ## Residual Concerns
85
+
86
+ - The `as ChildWorkflowRunResult` cast remains. Future contributors adding a new
87
+ WorkflowRunResult variant may forget to update ChildWorkflowRunResult. The
88
+ compile-time test in the stuck-escalation test file partially mitigates this,
89
+ but only for the stuck variant. A broader structural fix (removing the cast)
90
+ is a follow-up.
91
+ - Webhook consumers reading `outcome: 'stuck'` must handle the new value.
92
+ This is a new feature, not a breaking change, but operators consuming the
93
+ webhook should be aware.
@@ -0,0 +1,172 @@
1
+ # Implementation Plan: WorkTrain Stuck-Escalation
2
+
3
+ *2026-04-19 | Pitch: .workrail/current-pitch.md*
4
+
5
+ ---
6
+
7
+ ## 1. Problem Statement
8
+
9
+ When a WorkTrain daemon session enters a `repeated_tool_call` loop, the session
10
+ currently burns turns until wall-clock or max-turn timeout. The result is
11
+ `_tag: 'timeout'`, indistinguishable from a legitimate slow session. Automated
12
+ routing is impossible without string-parsing.
13
+
14
+ ---
15
+
16
+ ## 2. Acceptance Criteria
17
+
18
+ 1. `WorkflowRunStuck` interface exported from `workflow-runner.ts` with fields:
19
+ `_tag: 'stuck'`, `workflowId`, `reason`, `message`, `stopReason`, `issueSummaries?`
20
+ 2. `WorkflowRunResult` union includes `WorkflowRunStuck`.
21
+ 3. `ChildWorkflowRunResult` union includes `WorkflowRunStuck` (SAME COMMIT as #2).
22
+ 4. `WorkflowTrigger.agentConfig` has `stuckAbortPolicy?` and `noProgressAbortEnabled?`.
23
+ 5. `TriggerDefinition.agentConfig` has the same two fields.
24
+ 6. When `repeated_tool_call` fires and `stuckAbortPolicy !== 'notify_only'`:
25
+ outbox entry written, `agent.abort()` called, `stuckReason = 'repeated_tool_call'`.
26
+ 7. When `notify_only` is set: outbox written, abort NOT called.
27
+ 8. When `noProgressAbortEnabled: true` and `no_progress` fires with `stuckAbortPolicy !== 'notify_only'`:
28
+ same abort + outbox write.
29
+ 9. Return path returns `{ _tag: 'stuck', ... }` before `timeoutReason` check.
30
+ 10. `trigger-router.ts` `route()` and `dispatch()` handle `stuck` without assertNever fallthrough.
31
+ 11. `notification-service.ts` `buildNotificationBody()` and `buildDetail()` handle `stuck`.
32
+ 12. `NotificationPayload.outcome` union includes `'stuck'`.
33
+ 13. `makeSpawnAgentTool` handles `stuck` child result, returns `outcome: 'stuck'`.
34
+ 14. All 6 test cases in `workflow-runner-stuck-escalation.test.ts` pass.
35
+ 15. `npm run build` clean. `npx vitest run` no regressions.
36
+
37
+ ---
38
+
39
+ ## 3. Non-Goals
40
+
41
+ - No `onStuck:` hook in TriggerDefinition (follow-up)
42
+ - No console live panel stuck indicator
43
+ - No `worktrain logs` formatting changes
44
+ - No automatic retry on stuck
45
+ - No Signal 5 (wall-clock at 80%) wiring
46
+ - No new heuristics beyond Signal 1 and 2
47
+ - No changes to `src/mcp/`
48
+ - No `trigger-store.ts` parser changes
49
+
50
+ ---
51
+
52
+ ## 4. Philosophy-Driven Constraints
53
+
54
+ - All new fields `readonly`
55
+ - `issueSummaries` spread to new readonly array when included in return value
56
+ - `writeStuckOutboxEntry` is fire-and-forget (void + catch)
57
+ - `stuckReason` flag: first-writer-wins (same as `timeoutReason`)
58
+ - Outbox write and abort are independent effects (write before abort gate check)
59
+
60
+ ---
61
+
62
+ ## 5. Invariants
63
+
64
+ - **I1**: `ChildWorkflowRunResult` and `WorkflowRunResult` updates ship in the same commit.
65
+ - **I2**: `stuckReason` is checked BEFORE `timeoutReason` in the return path.
66
+ - **I3**: Outbox write fires regardless of `stuckAbortPolicy`.
67
+ - **I4**: `no_progress` never aborts unless `noProgressAbortEnabled: true`.
68
+ - **I5**: `repeated_tool_call` abort fires on the same turn as detection.
69
+ - **I6**: First writer wins on `stuckReason` (guard: `stuckReason === null && timeoutReason === null`).
70
+
71
+ ---
72
+
73
+ ## 6. Selected Approach
74
+
75
+ New `_tag: 'stuck'` discriminated union variant. Wire abort in `turn_end` subscriber
76
+ after Signal 1 and Signal 2 emitter calls. Return stuck result before `timeoutReason`
77
+ check. Update both union types atomically. Add `writeStuckOutboxEntry` module-level
78
+ helper. Propagate to trigger-router, notification-service, makeSpawnAgentTool.
79
+
80
+ **Runner-up rejected**: Extend `WorkflowRunTimeout.reason` -- violates make-illegal-states-unrepresentable.
81
+
82
+ ---
83
+
84
+ ## 7. Vertical Slices
85
+
86
+ ### Slice 1: Core types (workflow-runner.ts)
87
+ - Add `WorkflowRunStuck` interface after `WorkflowRunTimeout`
88
+ - Add to `WorkflowRunResult` union
89
+ - Add to `ChildWorkflowRunResult` union (ATOMIC with above)
90
+ - Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` to `WorkflowTrigger.agentConfig`
91
+ - **Done when**: `npm run build` clean after this slice
92
+
93
+ ### Slice 2: TriggerDefinition.agentConfig (types.ts)
94
+ - Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` after `maxTurns`
95
+ - **Done when**: `npm run build` clean
96
+
97
+ ### Slice 3: Runtime wiring (workflow-runner.ts)
98
+ - Add `sessionStartMs` constant after `maxTurns` resolution
99
+ - Add `stuckReason` flag after `timeoutReason` flag
100
+ - Add `writeStuckOutboxEntry` module-level helper
101
+ - Wire abort after Signal 1 emitter call in `turn_end`
102
+ - Wire abort after Signal 2 emitter call in `turn_end`
103
+ - Add stuck return path before `timeoutReason` check
104
+ - Update `makeSpawnAgentTool` resultObj type + add `stuck` arm before `assertNever`
105
+ - **Done when**: `npm run build` clean
106
+
107
+ ### Slice 4: Caller propagation (trigger-router.ts, notification-service.ts)
108
+ - Add `stuck` arm in `route()` exhaustive chain
109
+ - Add `stuck` arm in `dispatch()` exhaustive chain
110
+ - Add `'stuck'` to `NotificationPayload.outcome` union
111
+ - Add `stuck` case in `buildNotificationBody()`
112
+ - Add `stuck` case in `buildDetail()`
113
+ - **Done when**: `npm run build` clean
114
+
115
+ ### Slice 5: Tests
116
+ - Write `tests/unit/workflow-runner-stuck-escalation.test.ts` with 6 test cases
117
+ - **Done when**: all 6 tests pass, no regressions
118
+
119
+ ---
120
+
121
+ ## 8. Test Design
122
+
123
+ File: `tests/unit/workflow-runner-stuck-escalation.test.ts`
124
+
125
+ Pattern: replicate turn_end subscriber logic (same as workflow-runner-stuck-detection.test.ts).
126
+
127
+ **Test 1**: `stuckAbortPolicy: 'abort'` default -- repeated_tool_call fires, stuckReason set, abort called, would return _tag:'stuck'
128
+ **Test 2**: `stuckAbortPolicy: 'notify_only'` -- abort NOT called, emitter still fires
129
+ **Test 3**: `noProgressAbortEnabled: false` default -- no_progress does NOT set stuckReason
130
+ **Test 4**: `noProgressAbortEnabled: true` -- no_progress sets stuckReason = 'no_progress', abort called
131
+ **Test 5**: Compile-time assignability test: `WorkflowRunStuck` is assignable to `ChildWorkflowRunResult`
132
+ **Test 6**: trigger-router exhaustive switch handles 'stuck' (import trigger-router, verify no assertNever path hit)
133
+
134
+ ---
135
+
136
+ ## 9. Risk Register
137
+
138
+ | Risk | Likelihood | Severity | Mitigation |
139
+ |------|------------|----------|------------|
140
+ | ChildWorkflowRunResult not updated atomically | Low | High | Single-PR, Slice 1 includes both updates, Test 5 catches gap |
141
+ | NotificationPayload.outcome union gap | Low | Medium | Slice 4 adds 'stuck'; build catches it |
142
+ | stuckReason/timeoutReason race | Low | Low | Guard condition (both null check) |
143
+ | writeStuckOutboxEntry silent failure | Low | Low | Fire-and-forget with console.warn |
144
+
145
+ ---
146
+
147
+ ## 10. PR Packaging Strategy
148
+
149
+ Single PR: `feat/stuck-escalation`
150
+ Single atomic commit with all 4 source files + test file.
151
+ PR title: `feat(daemon): WorkflowRunStuck result variant with abort and outbox notification`
152
+
153
+ ---
154
+
155
+ ## 11. Philosophy Alignment Per Slice
156
+
157
+ | Slice | Principle | Status |
158
+ |-------|-----------|--------|
159
+ | Slice 1 (types) | Make illegal states unrepresentable | Satisfied |
160
+ | Slice 1 (types) | Exhaustiveness everywhere | Satisfied |
161
+ | Slice 1 (types) | Type safety as first line of defense | Satisfied (ChildWorkflowRunResult updated) |
162
+ | Slice 3 (runtime) | Errors are data | Satisfied |
163
+ | Slice 3 (runtime) | Determinism over cleverness | Satisfied (simple flag) |
164
+ | Slice 3 (runtime) | Fire-and-forget side effects | Satisfied (outbox write) |
165
+ | Slice 4 (callers) | Exhaustiveness everywhere | Satisfied (all assertNever guards updated) |
166
+ | Slice 5 (tests) | Prefer fakes over mocks | Satisfied (replicate subscriber logic, not vi.mock) |
167
+
168
+ ---
169
+
170
+ **unresolvedUnknownCount**: 0
171
+ **planConfidenceBand**: High
172
+ **estimatedPRCount**: 1
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@exaudeus/workrail",
3
- "version": "3.43.0",
3
+ "version": "3.45.0",
4
4
  "description": "Step-by-step workflow enforcement for AI agents via MCP",
5
5
  "license": "MIT",
6
6
  "repository": {