@exaudeus/workrail 3.44.0 → 3.46.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli/commands/index.d.ts +1 -0
- package/dist/cli/commands/index.js +3 -1
- package/dist/cli/commands/worktrain-pipeline.d.ts +17 -0
- package/dist/cli/commands/worktrain-pipeline.js +121 -0
- package/dist/console-ui/assets/{index-Bi38ITiQ.js → index-BQFhoMcY.js} +1 -1
- package/dist/console-ui/index.html +1 -1
- package/dist/coordinators/adaptive-pipeline.d.ts +57 -0
- package/dist/coordinators/adaptive-pipeline.js +104 -0
- package/dist/coordinators/modes/full-pipeline.d.ts +4 -0
- package/dist/coordinators/modes/full-pipeline.js +256 -0
- package/dist/coordinators/modes/implement-shared.d.ts +4 -0
- package/dist/coordinators/modes/implement-shared.js +201 -0
- package/dist/coordinators/modes/implement.d.ts +3 -0
- package/dist/coordinators/modes/implement.js +108 -0
- package/dist/coordinators/modes/quick-review.d.ts +3 -0
- package/dist/coordinators/modes/quick-review.js +37 -0
- package/dist/coordinators/modes/review-only.d.ts +2 -0
- package/dist/coordinators/modes/review-only.js +28 -0
- package/dist/coordinators/routing/route-task.d.ts +21 -0
- package/dist/coordinators/routing/route-task.js +55 -0
- package/dist/daemon/workflow-runner.d.ts +12 -2
- package/dist/daemon/workflow-runner.js +96 -13
- package/dist/manifest.json +101 -29
- package/dist/mcp/output-schemas.d.ts +16 -16
- package/dist/trigger/notification-service.d.ts +1 -1
- package/dist/trigger/notification-service.js +4 -0
- package/dist/trigger/trigger-router.d.ts +3 -0
- package/dist/trigger/trigger-router.js +17 -0
- package/dist/trigger/types.d.ts +2 -0
- package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.d.ts +29 -0
- package/dist/v2/durable-core/schemas/artifacts/discovery-handoff.js +26 -0
- package/dist/v2/durable-core/schemas/artifacts/index.d.ts +2 -1
- package/dist/v2/durable-core/schemas/artifacts/index.js +7 -1
- package/dist/v2/durable-core/schemas/compiled-workflow/index.d.ts +8 -8
- package/dist/v2/usecases/console-routes.js +3 -0
- package/docs/design/design-candidates-stuck-escalation.md +183 -0
- package/docs/design/design-review-findings-stuck-escalation.md +93 -0
- package/docs/design/implementation-plan-stuck-escalation.md +172 -0
- package/docs/ideas/backlog.md +86 -0
- package/package.json +1 -1
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# Design Candidates: WorkTrain Stuck-Escalation
|
|
2
|
+
|
|
3
|
+
*Generated: 2026-04-19 | Pitch: .workrail/current-pitch.md*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Problem Understanding
|
|
8
|
+
|
|
9
|
+
### Core Tensions
|
|
10
|
+
|
|
11
|
+
1. **Stuck vs timeout conflation**: When `repeated_tool_call` fires, the session
|
|
12
|
+
currently runs until wall-clock or max-turns timeout. The result is
|
|
13
|
+
`_tag: 'timeout'`, which is indistinguishable from a legitimate slow session.
|
|
14
|
+
Automated routing requires a distinct discriminant.
|
|
15
|
+
|
|
16
|
+
2. **Abort vs notify-only independence**: Outbox notification and `agent.abort()`
|
|
17
|
+
are two separate effects. `notify_only` policy suppresses the abort but must
|
|
18
|
+
not suppress the outbox write. These effects must not be coupled.
|
|
19
|
+
|
|
20
|
+
3. **ChildWorkflowRunResult atomic update**: The `as ChildWorkflowRunResult` cast
|
|
21
|
+
at line 2172 in `makeSpawnAgentTool` suppresses any compile-time error from a
|
|
22
|
+
missing union update. Only the `assertNever(childResult)` at line 2212 catches
|
|
23
|
+
the omission -- at runtime, crashing the parent session.
|
|
24
|
+
|
|
25
|
+
4. **no_progress false-positive risk**: The no_progress heuristic fires on
|
|
26
|
+
legitimate research workflows that spend many turns reading before advancing.
|
|
27
|
+
It must be opt-in (default: false) to avoid breaking existing sessions.
|
|
28
|
+
|
|
29
|
+
### Likely Seam
|
|
30
|
+
|
|
31
|
+
The `turn_end` subscriber in `runWorkflow()` is the correct location. All
|
|
32
|
+
required state (lastNToolCalls, stepAdvanceCount, timeoutReason, issueSummaries)
|
|
33
|
+
is available there as closure variables. Detection fires at the right moment
|
|
34
|
+
(after each turn, synchronously before next step injection).
|
|
35
|
+
|
|
36
|
+
### What Makes This Hard
|
|
37
|
+
|
|
38
|
+
- The `as ChildWorkflowRunResult` cast is a type-safety trap: it silences
|
|
39
|
+
TypeScript while leaving a runtime crash. Only careful reading of the pitch
|
|
40
|
+
reveals the issue.
|
|
41
|
+
- `buildOutcome()` in notification-service.ts has return type
|
|
42
|
+
`NotificationPayload['outcome']`. Adding 'stuck' to WorkflowRunResult causes
|
|
43
|
+
a compile error there unless the outcome union is also widened.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Philosophy Constraints
|
|
48
|
+
|
|
49
|
+
From CLAUDE.md:
|
|
50
|
+
|
|
51
|
+
- **Make illegal states unrepresentable**: the stuck discriminant prevents
|
|
52
|
+
conflating stuck with timeout at the type level.
|
|
53
|
+
- **Exhaustiveness everywhere**: assertNever guards in trigger-router and
|
|
54
|
+
makeSpawnAgentTool enforce this -- adding stuck arm is required.
|
|
55
|
+
- **Errors are data**: WorkflowRunResult is a Result type; WorkflowRunStuck is
|
|
56
|
+
a new variant, not an exception.
|
|
57
|
+
- **Type safety as first line of defense**: ChildWorkflowRunResult update in
|
|
58
|
+
same commit restores the compile-time invariant that the cast broke.
|
|
59
|
+
- **Fire-and-forget for side effects**: outbox write uses void + catch, same
|
|
60
|
+
as DaemonEventEmitter and issue recording.
|
|
61
|
+
|
|
62
|
+
No conflicts between stated philosophy and repo patterns.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Impact Surface
|
|
67
|
+
|
|
68
|
+
Paths that must stay consistent when WorkflowRunResult gains a new variant:
|
|
69
|
+
|
|
70
|
+
1. `makeSpawnAgentTool` -- `assertNever(childResult)` at line 2212; requires
|
|
71
|
+
ChildWorkflowRunResult update and a new `stuck` arm in the result mapping.
|
|
72
|
+
2. `trigger-router.ts` `route()` -- exhaustive if-else chain ending in
|
|
73
|
+
`assertNever(result)` at line ~689.
|
|
74
|
+
3. `trigger-router.ts` `dispatch()` -- same exhaustive chain at line ~770.
|
|
75
|
+
4. `notification-service.ts` `buildNotificationBody()` -- exhaustive switch.
|
|
76
|
+
5. `notification-service.ts` `buildDetail()` -- exhaustive switch.
|
|
77
|
+
6. `notification-service.ts` `buildOutcome()` -- return type
|
|
78
|
+
`NotificationPayload['outcome']`; 'stuck' must be added to that union.
|
|
79
|
+
7. `NotificationPayload.outcome` union -- currently
|
|
80
|
+
`'success' | 'error' | 'timeout' | 'delivery_failed'`; must add `'stuck'`.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Candidates
|
|
85
|
+
|
|
86
|
+
### Candidate A: New `_tag: 'stuck'` discriminated union variant (SELECTED)
|
|
87
|
+
|
|
88
|
+
**Summary**: Add `WorkflowRunStuck` interface with `_tag: 'stuck'`, wire abort
|
|
89
|
+
in turn_end subscriber after Signal 1 and Signal 2 emitter calls, return stuck
|
|
90
|
+
result before timeout check, update both `WorkflowRunResult` and
|
|
91
|
+
`ChildWorkflowRunResult` unions atomically, add `writeStuckOutboxEntry` helper.
|
|
92
|
+
|
|
93
|
+
**Tensions resolved**:
|
|
94
|
+
- Stuck/timeout conflation: separate discriminant, separate return path.
|
|
95
|
+
- Abort/notify independence: outbox write fires before the abort gate check.
|
|
96
|
+
- ChildWorkflowRunResult crash: atomic update with assertNever arm added.
|
|
97
|
+
- no_progress false-positive: gated by `noProgressAbortEnabled: false` default.
|
|
98
|
+
|
|
99
|
+
**Boundary solved at**: `turn_end` subscriber (detection + abort), result
|
|
100
|
+
construction (return), 4 files for propagation to callers.
|
|
101
|
+
|
|
102
|
+
**Why best-fit boundary**: The turn_end subscriber is the only location with
|
|
103
|
+
access to all required state. The result construction is the canonical output
|
|
104
|
+
boundary for runWorkflow(). Propagation to callers follows the existing
|
|
105
|
+
WorkflowRunResult variant fan-out pattern.
|
|
106
|
+
|
|
107
|
+
**Failure mode**: Forgetting to update `NotificationPayload.outcome` union --
|
|
108
|
+
caught by `npm run build` (TypeScript compile error in `buildOutcome()`).
|
|
109
|
+
|
|
110
|
+
**Repo-pattern relationship**: Mirrors `timeoutReason` flag pattern exactly.
|
|
111
|
+
Mirrors `WorkflowRunTimeout` interface field shape. Follows assertNever guard
|
|
112
|
+
pattern already established in trigger-router and makeSpawnAgentTool.
|
|
113
|
+
|
|
114
|
+
**Gains**: Distinct routing for stuck sessions, type-safe callers, clean
|
|
115
|
+
separation of abort and notification effects.
|
|
116
|
+
|
|
117
|
+
**Losses**: One more variant in the union (minor cognitive load increase).
|
|
118
|
+
|
|
119
|
+
**Scope judgment**: Best-fit. 4 files, mechanical wiring, all design resolved.
|
|
120
|
+
|
|
121
|
+
**Philosophy fit**: Honors all relevant CLAUDE.md principles. No conflicts.
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
### Candidate B: Extend `WorkflowRunTimeout.reason` with stuck sub-values
|
|
126
|
+
|
|
127
|
+
**Summary**: Add `'stuck_repeated_tool_call' | 'stuck_no_progress'` to
|
|
128
|
+
`WorkflowRunTimeout.reason` -- reuse the timeout discriminant.
|
|
129
|
+
|
|
130
|
+
**Tensions resolved**: None of the core ones. Stuck and timeout still share
|
|
131
|
+
`_tag: 'timeout'`, requiring callers to inspect reason to distinguish them.
|
|
132
|
+
|
|
133
|
+
**Failure mode**: Violates make-illegal-states-unrepresentable. Callers using
|
|
134
|
+
`result._tag === 'timeout'` would silently handle stuck sessions as timeouts.
|
|
135
|
+
|
|
136
|
+
**Repo-pattern relationship**: Departs from the exhaustiveness-everywhere
|
|
137
|
+
pattern. The assertNever guard pattern exists precisely to avoid this.
|
|
138
|
+
|
|
139
|
+
**Scope judgment**: Too narrow -- preserves the routing problem this pitch
|
|
140
|
+
exists to solve.
|
|
141
|
+
|
|
142
|
+
**Rejected because**: Violates philosophy, does not resolve the core tension,
|
|
143
|
+
and the pitch explicitly rejects conflating stuck with timeout.
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## Comparison and Recommendation
|
|
148
|
+
|
|
149
|
+
Candidate A is the only viable candidate. All analysis converges.
|
|
150
|
+
|
|
151
|
+
The core recommendation is to implement Candidate A exactly as specified in
|
|
152
|
+
`.workrail/current-pitch.md`, with one addition not noted in the pitch:
|
|
153
|
+
update `NotificationPayload.outcome` union to include `'stuck'` (required for
|
|
154
|
+
`buildOutcome()` to compile).
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
## Self-Critique
|
|
159
|
+
|
|
160
|
+
**Strongest counter-argument**: Adding a 5th variant to WorkflowRunResult
|
|
161
|
+
increases cognitive load for callers. Counter: assertNever guards make missing
|
|
162
|
+
cases compile errors, which is the correct safeguard. The complexity cost is
|
|
163
|
+
paid once (at implementation) and enforced automatically.
|
|
164
|
+
|
|
165
|
+
**Narrower option that lost**: Update only WorkflowRunResult, skip
|
|
166
|
+
ChildWorkflowRunResult. Lost because: runtime crash in makeSpawnAgentTool
|
|
167
|
+
when a child hits stuck-abort. The cast at line 2172 provides no protection.
|
|
168
|
+
|
|
169
|
+
**Broader option not justified**: Adding `onStuck:` hook to TriggerDefinition.
|
|
170
|
+
Explicitly deferred per pitch No-Gos. Would require trigger-store.ts parser
|
|
171
|
+
changes -- outside the 4-file scope.
|
|
172
|
+
|
|
173
|
+
**Pivot condition**: If `assertNever(childResult)` were removed in favor of a
|
|
174
|
+
logged fallback, ChildWorkflowRunResult update would be less critical. It is
|
|
175
|
+
not removed, so the atomic update is required.
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## Open Questions for the Main Agent
|
|
180
|
+
|
|
181
|
+
None. All design decisions are resolved in the pitch. The only implementation
|
|
182
|
+
detail requiring attention is the `NotificationPayload.outcome` union widening
|
|
183
|
+
(add 'stuck') -- verify this compiles before finalizing.
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Design Review Findings: WorkTrain Stuck-Escalation
|
|
2
|
+
|
|
3
|
+
*Generated: 2026-04-19 | Pitch: .workrail/current-pitch.md*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Tradeoff Review
|
|
8
|
+
|
|
9
|
+
| Tradeoff | Status | Condition for Failure |
|
|
10
|
+
|----------|--------|-----------------------|
|
|
11
|
+
| One more union variant in WorkflowRunResult | Acceptable | All callers use assertNever guards -- compile error enforces handling |
|
|
12
|
+
| ChildWorkflowRunResult atomic update relies on discipline | Managed | Fails only if commit is split; mitigated by single-PR implementation and compile-time test |
|
|
13
|
+
| NotificationPayload.outcome union widening (gap, not tradeoff) | Resolved | Add 'stuck' to outcome union; caught by npm run build |
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Failure Mode Review
|
|
18
|
+
|
|
19
|
+
| Failure Mode | Severity | Design Handling | Missing Mitigation |
|
|
20
|
+
|--------------|----------|-----------------|--------------------|
|
|
21
|
+
| ChildWorkflowRunResult not updated | High | Atomic commit, compile-time assignability test | None beyond discipline |
|
|
22
|
+
| stuckReason / timeoutReason race | Low | First-writer-wins guard; max_turns early return prevents race | None needed |
|
|
23
|
+
| writeStuckOutboxEntry fails | Low | Fire-and-forget, console.warn on error | None -- intentional |
|
|
24
|
+
| no_progress fires on research workflow | Low | noProgressAbortEnabled defaults to false | None needed |
|
|
25
|
+
| NotificationPayload.outcome compile error | Medium | Add 'stuck' to union | None -- caught at build |
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Runner-Up / Simpler Alternative Review
|
|
30
|
+
|
|
31
|
+
- **Candidate B** (extend WorkflowRunTimeout.reason): No elements worth borrowing.
|
|
32
|
+
Does not resolve the core routing tension.
|
|
33
|
+
- **Skip ChildWorkflowRunResult**: Not acceptable -- runtime crash in parent session.
|
|
34
|
+
- **Skip sessionStartMs**: Not recommended -- pitch explicitly adds it for Signal 5 follow-up
|
|
35
|
+
to avoid future restructuring.
|
|
36
|
+
- **Inline outbox write**: Works but reduces turn_end subscriber readability. Not worth it.
|
|
37
|
+
|
|
38
|
+
No hybrid opportunities identified.
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Philosophy Alignment
|
|
43
|
+
|
|
44
|
+
| Principle | Status |
|
|
45
|
+
|-----------|--------|
|
|
46
|
+
| Make illegal states unrepresentable | Satisfied |
|
|
47
|
+
| Exhaustiveness everywhere | Satisfied |
|
|
48
|
+
| Errors are data | Satisfied |
|
|
49
|
+
| Immutability by default | Satisfied |
|
|
50
|
+
| Type safety as first line of defense | Under tension (pre-existing cast; improved but not fully resolved) |
|
|
51
|
+
| Fire-and-forget for side effects | Satisfied |
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## Findings
|
|
56
|
+
|
|
57
|
+
### Yellow: NotificationPayload.outcome union widening not specified in pitch
|
|
58
|
+
|
|
59
|
+
The pitch states 'buildOutcome() returns result._tag directly -- no change needed'.
|
|
60
|
+
However, the return type annotation `NotificationPayload['outcome']` will cause a
|
|
61
|
+
TypeScript compile error when 'stuck' is added to WorkflowRunResult but not to the
|
|
62
|
+
outcome union. **Resolution**: add `'stuck'` to `NotificationPayload.outcome` union
|
|
63
|
+
in notification-service.ts. This is a mechanical fix, not a design change.
|
|
64
|
+
|
|
65
|
+
### Yellow: Pre-existing `as ChildWorkflowRunResult` cast at line 2172
|
|
66
|
+
|
|
67
|
+
The cast suppresses TypeScript's compile-time check that would otherwise catch a
|
|
68
|
+
missing ChildWorkflowRunResult update. This PR updates the union and adds a
|
|
69
|
+
compile-time assignability test to partially compensate. Removing the cast is
|
|
70
|
+
out of scope. **Residual concern**: future union additions must be caught by the
|
|
71
|
+
test rather than the compiler.
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Recommended Revisions
|
|
76
|
+
|
|
77
|
+
1. Add `'stuck'` to `NotificationPayload.outcome` union (not in pitch, required for compile).
|
|
78
|
+
2. Add compile-time assignability test for `ChildWorkflowRunResult` in the test file.
|
|
79
|
+
3. Document the `as ChildWorkflowRunResult` cast issue in a code comment at line 2172
|
|
80
|
+
(or verify existing comment is sufficient).
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## Residual Concerns
|
|
85
|
+
|
|
86
|
+
- The `as ChildWorkflowRunResult` cast remains. Future contributors adding a new
|
|
87
|
+
WorkflowRunResult variant may forget to update ChildWorkflowRunResult. The
|
|
88
|
+
compile-time test in the stuck-escalation test file partially mitigates this,
|
|
89
|
+
but only for the stuck variant. A broader structural fix (removing the cast)
|
|
90
|
+
is a follow-up.
|
|
91
|
+
- Webhook consumers reading `outcome: 'stuck'` must handle the new value.
|
|
92
|
+
This is a new feature, not a breaking change, but operators consuming the
|
|
93
|
+
webhook should be aware.
|
|
@@ -0,0 +1,172 @@
|
|
|
1
|
+
# Implementation Plan: WorkTrain Stuck-Escalation
|
|
2
|
+
|
|
3
|
+
*2026-04-19 | Pitch: .workrail/current-pitch.md*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. Problem Statement
|
|
8
|
+
|
|
9
|
+
When a WorkTrain daemon session enters a `repeated_tool_call` loop, the session
|
|
10
|
+
currently burns turns until wall-clock or max-turn timeout. The result is
|
|
11
|
+
`_tag: 'timeout'`, indistinguishable from a legitimate slow session. Automated
|
|
12
|
+
routing is impossible without string-parsing.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 2. Acceptance Criteria
|
|
17
|
+
|
|
18
|
+
1. `WorkflowRunStuck` interface exported from `workflow-runner.ts` with fields:
|
|
19
|
+
`_tag: 'stuck'`, `workflowId`, `reason`, `message`, `stopReason`, `issueSummaries?`
|
|
20
|
+
2. `WorkflowRunResult` union includes `WorkflowRunStuck`.
|
|
21
|
+
3. `ChildWorkflowRunResult` union includes `WorkflowRunStuck` (SAME COMMIT as #2).
|
|
22
|
+
4. `WorkflowTrigger.agentConfig` has `stuckAbortPolicy?` and `noProgressAbortEnabled?`.
|
|
23
|
+
5. `TriggerDefinition.agentConfig` has the same two fields.
|
|
24
|
+
6. When `repeated_tool_call` fires and `stuckAbortPolicy !== 'notify_only'`:
|
|
25
|
+
outbox entry written, `agent.abort()` called, `stuckReason = 'repeated_tool_call'`.
|
|
26
|
+
7. When `notify_only` is set: outbox written, abort NOT called.
|
|
27
|
+
8. When `noProgressAbortEnabled: true` and `no_progress` fires with `stuckAbortPolicy !== 'notify_only'`:
|
|
28
|
+
same abort + outbox write.
|
|
29
|
+
9. Return path returns `{ _tag: 'stuck', ... }` before `timeoutReason` check.
|
|
30
|
+
10. `trigger-router.ts` `route()` and `dispatch()` handle `stuck` without assertNever fallthrough.
|
|
31
|
+
11. `notification-service.ts` `buildNotificationBody()` and `buildDetail()` handle `stuck`.
|
|
32
|
+
12. `NotificationPayload.outcome` union includes `'stuck'`.
|
|
33
|
+
13. `makeSpawnAgentTool` handles `stuck` child result, returns `outcome: 'stuck'`.
|
|
34
|
+
14. All 6 test cases in `workflow-runner-stuck-escalation.test.ts` pass.
|
|
35
|
+
15. `npm run build` clean. `npx vitest run` no regressions.
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## 3. Non-Goals
|
|
40
|
+
|
|
41
|
+
- No `onStuck:` hook in TriggerDefinition (follow-up)
|
|
42
|
+
- No console live panel stuck indicator
|
|
43
|
+
- No `worktrain logs` formatting changes
|
|
44
|
+
- No automatic retry on stuck
|
|
45
|
+
- No Signal 5 (wall-clock at 80%) wiring
|
|
46
|
+
- No new heuristics beyond Signal 1 and 2
|
|
47
|
+
- No changes to `src/mcp/`
|
|
48
|
+
- No `trigger-store.ts` parser changes
|
|
49
|
+
|
|
50
|
+
---
|
|
51
|
+
|
|
52
|
+
## 4. Philosophy-Driven Constraints
|
|
53
|
+
|
|
54
|
+
- All new fields `readonly`
|
|
55
|
+
- `issueSummaries` spread to new readonly array when included in return value
|
|
56
|
+
- `writeStuckOutboxEntry` is fire-and-forget (void + catch)
|
|
57
|
+
- `stuckReason` flag: first-writer-wins (same as `timeoutReason`)
|
|
58
|
+
- Outbox write and abort are independent effects (write before abort gate check)
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## 5. Invariants
|
|
63
|
+
|
|
64
|
+
- **I1**: `ChildWorkflowRunResult` and `WorkflowRunResult` updates ship in the same commit.
|
|
65
|
+
- **I2**: `stuckReason` is checked BEFORE `timeoutReason` in the return path.
|
|
66
|
+
- **I3**: Outbox write fires regardless of `stuckAbortPolicy`.
|
|
67
|
+
- **I4**: `no_progress` never aborts unless `noProgressAbortEnabled: true`.
|
|
68
|
+
- **I5**: `repeated_tool_call` abort fires on the same turn as detection.
|
|
69
|
+
- **I6**: First writer wins on `stuckReason` (guard: `stuckReason === null && timeoutReason === null`).
|
|
70
|
+
|
|
71
|
+
---
|
|
72
|
+
|
|
73
|
+
## 6. Selected Approach
|
|
74
|
+
|
|
75
|
+
New `_tag: 'stuck'` discriminated union variant. Wire abort in `turn_end` subscriber
|
|
76
|
+
after Signal 1 and Signal 2 emitter calls. Return stuck result before `timeoutReason`
|
|
77
|
+
check. Update both union types atomically. Add `writeStuckOutboxEntry` module-level
|
|
78
|
+
helper. Propagate to trigger-router, notification-service, makeSpawnAgentTool.
|
|
79
|
+
|
|
80
|
+
**Runner-up rejected**: Extend `WorkflowRunTimeout.reason` -- violates make-illegal-states-unrepresentable.
|
|
81
|
+
|
|
82
|
+
---
|
|
83
|
+
|
|
84
|
+
## 7. Vertical Slices
|
|
85
|
+
|
|
86
|
+
### Slice 1: Core types (workflow-runner.ts)
|
|
87
|
+
- Add `WorkflowRunStuck` interface after `WorkflowRunTimeout`
|
|
88
|
+
- Add to `WorkflowRunResult` union
|
|
89
|
+
- Add to `ChildWorkflowRunResult` union (ATOMIC with above)
|
|
90
|
+
- Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` to `WorkflowTrigger.agentConfig`
|
|
91
|
+
- **Done when**: `npm run build` clean after this slice
|
|
92
|
+
|
|
93
|
+
### Slice 2: TriggerDefinition.agentConfig (types.ts)
|
|
94
|
+
- Add `stuckAbortPolicy?` and `noProgressAbortEnabled?` after `maxTurns`
|
|
95
|
+
- **Done when**: `npm run build` clean
|
|
96
|
+
|
|
97
|
+
### Slice 3: Runtime wiring (workflow-runner.ts)
|
|
98
|
+
- Add `sessionStartMs` constant after `maxTurns` resolution
|
|
99
|
+
- Add `stuckReason` flag after `timeoutReason` flag
|
|
100
|
+
- Add `writeStuckOutboxEntry` module-level helper
|
|
101
|
+
- Wire abort after Signal 1 emitter call in `turn_end`
|
|
102
|
+
- Wire abort after Signal 2 emitter call in `turn_end`
|
|
103
|
+
- Add stuck return path before `timeoutReason` check
|
|
104
|
+
- Update `makeSpawnAgentTool` resultObj type + add `stuck` arm before `assertNever`
|
|
105
|
+
- **Done when**: `npm run build` clean
|
|
106
|
+
|
|
107
|
+
### Slice 4: Caller propagation (trigger-router.ts, notification-service.ts)
|
|
108
|
+
- Add `stuck` arm in `route()` exhaustive chain
|
|
109
|
+
- Add `stuck` arm in `dispatch()` exhaustive chain
|
|
110
|
+
- Add `'stuck'` to `NotificationPayload.outcome` union
|
|
111
|
+
- Add `stuck` case in `buildNotificationBody()`
|
|
112
|
+
- Add `stuck` case in `buildDetail()`
|
|
113
|
+
- **Done when**: `npm run build` clean
|
|
114
|
+
|
|
115
|
+
### Slice 5: Tests
|
|
116
|
+
- Write `tests/unit/workflow-runner-stuck-escalation.test.ts` with 6 test cases
|
|
117
|
+
- **Done when**: all 6 tests pass, no regressions
|
|
118
|
+
|
|
119
|
+
---
|
|
120
|
+
|
|
121
|
+
## 8. Test Design
|
|
122
|
+
|
|
123
|
+
File: `tests/unit/workflow-runner-stuck-escalation.test.ts`
|
|
124
|
+
|
|
125
|
+
Pattern: replicate turn_end subscriber logic (same as workflow-runner-stuck-detection.test.ts).
|
|
126
|
+
|
|
127
|
+
**Test 1**: `stuckAbortPolicy: 'abort'` default -- repeated_tool_call fires, stuckReason set, abort called, would return _tag:'stuck'
|
|
128
|
+
**Test 2**: `stuckAbortPolicy: 'notify_only'` -- abort NOT called, emitter still fires
|
|
129
|
+
**Test 3**: `noProgressAbortEnabled: false` default -- no_progress does NOT set stuckReason
|
|
130
|
+
**Test 4**: `noProgressAbortEnabled: true` -- no_progress sets stuckReason = 'no_progress', abort called
|
|
131
|
+
**Test 5**: Compile-time assignability test: `WorkflowRunStuck` is assignable to `ChildWorkflowRunResult`
|
|
132
|
+
**Test 6**: trigger-router exhaustive switch handles 'stuck' (import trigger-router, verify no assertNever path hit)
|
|
133
|
+
|
|
134
|
+
---
|
|
135
|
+
|
|
136
|
+
## 9. Risk Register
|
|
137
|
+
|
|
138
|
+
| Risk | Likelihood | Severity | Mitigation |
|
|
139
|
+
|------|------------|----------|------------|
|
|
140
|
+
| ChildWorkflowRunResult not updated atomically | Low | High | Single-PR, Slice 1 includes both updates, Test 5 catches gap |
|
|
141
|
+
| NotificationPayload.outcome union gap | Low | Medium | Slice 4 adds 'stuck'; build catches it |
|
|
142
|
+
| stuckReason/timeoutReason race | Low | Low | Guard condition (both null check) |
|
|
143
|
+
| writeStuckOutboxEntry silent failure | Low | Low | Fire-and-forget with console.warn |
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
## 10. PR Packaging Strategy
|
|
148
|
+
|
|
149
|
+
Single PR: `feat/stuck-escalation`
|
|
150
|
+
Single atomic commit with all 4 source files + test file.
|
|
151
|
+
PR title: `feat(daemon): WorkflowRunStuck result variant with abort and outbox notification`
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
## 11. Philosophy Alignment Per Slice
|
|
156
|
+
|
|
157
|
+
| Slice | Principle | Status |
|
|
158
|
+
|-------|-----------|--------|
|
|
159
|
+
| Slice 1 (types) | Make illegal states unrepresentable | Satisfied |
|
|
160
|
+
| Slice 1 (types) | Exhaustiveness everywhere | Satisfied |
|
|
161
|
+
| Slice 1 (types) | Type safety as first line of defense | Satisfied (ChildWorkflowRunResult updated) |
|
|
162
|
+
| Slice 3 (runtime) | Errors are data | Satisfied |
|
|
163
|
+
| Slice 3 (runtime) | Determinism over cleverness | Satisfied (simple flag) |
|
|
164
|
+
| Slice 3 (runtime) | Fire-and-forget side effects | Satisfied (outbox write) |
|
|
165
|
+
| Slice 4 (callers) | Exhaustiveness everywhere | Satisfied (all assertNever guards updated) |
|
|
166
|
+
| Slice 5 (tests) | Prefer fakes over mocks | Satisfied (replicate subscriber logic, not vi.mock) |
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
**unresolvedUnknownCount**: 0
|
|
171
|
+
**planConfidenceBand**: High
|
|
172
|
+
**estimatedPRCount**: 1
|
package/docs/ideas/backlog.md
CHANGED
|
@@ -6395,3 +6395,89 @@ When a post-implementation MR review finds a UI/UX finding (wrong affordance, mi
|
|
|
6395
6395
|
### Priority
|
|
6396
6396
|
|
|
6397
6397
|
Design this as part of the adaptive coordinator (#3). The `touchesUI` flag belongs on the classification output alongside `taskComplexity` and `maturity`. The UI detection logic and the design workflow insertion are both coordinator-level concerns, not engine-level.
|
|
6398
|
+
|
|
6399
|
+
---
|
|
6400
|
+
|
|
6401
|
+
## Current state update (Apr 20, 2026)
|
|
6402
|
+
|
|
6403
|
+
**npm version: v3.45.0**
|
|
6404
|
+
|
|
6405
|
+
### What shipped in this session (Apr 19-20, 2026)
|
|
6406
|
+
|
|
6407
|
+
All five top-priority autonomous pipeline items shipped:
|
|
6408
|
+
|
|
6409
|
+
- ✅ **#1 -- Worktree isolation + auto-commit** (PR #630) -- Each WorkTrain coding session now runs in an isolated git worktree (`~/.workrail/worktrees/<sessionId>`). `trigger.workspacePath` is never mutated; all tool factories receive `sessionWorkspacePath`. Crash recovery sidecar persists `worktreePath` for orphan cleanup. `delivery-action.ts` asserts HEAD branch before push. `test-task` trigger: `branchStrategy: worktree`, `autoCommit: true`, `autoOpenPR: true`.
|
|
6410
|
+
|
|
6411
|
+
- ✅ **#2 -- Stuck detection escalation** (PR #636) -- New `WorkflowRunResult._tag: 'stuck'` discriminant. When `repeated_tool_call` heuristic fires and `stuckAbortPolicy !== 'notify_only'` (default: `'abort'`), daemon aborts the session immediately instead of burning the 30-min wall clock. Writes structured entry to `~/.workrail/outbox.jsonl`. `stuckAbortPolicy` and `noProgressAbortEnabled` configurable per trigger in `agentConfig`. `ChildWorkflowRunResult` updated atomically.
|
|
6412
|
+
|
|
6413
|
+
- ✅ **#3 -- Adaptive pipeline coordinator** (PR #639) -- `worktrain run pipeline --issue N --workspace path` routes tasks to the right pipeline via pure static routing:
|
|
6414
|
+
- dep-bump + PR number → QUICK_REVIEW (delegates to `runPrReviewCoordinator`)
|
|
6415
|
+
- PR/MR number → REVIEW_ONLY
|
|
6416
|
+
- `current-pitch.md` exists → IMPLEMENT (coding + PR + review + merge)
|
|
6417
|
+
- Default → FULL (discovery → shaping → coding → PR → review → merge)
|
|
6418
|
+
- Fix loop cap: 2 iterations max. Escalating audit chain for Critical findings. UX gate for UI-touching tasks. 6 hardcoded timeout constants. Pitch archived after IMPLEMENT/FULL completes.
|
|
6419
|
+
|
|
6420
|
+
- ✅ **#4 -- GitHub issue queue poll trigger** (PR #637) -- New `github_queue_poll` trigger provider. Polls GitHub issues matching `GitHubQueueConfig` (assignee-based MVP, `label`/`mention`/`query` typed but `not_implemented`). Maturity inference from 3 deterministic heuristics. Idempotency check (conservative: parse errors = active). JSONL decision log at `~/.workrail/queue-poll.jsonl`. `maxTotalConcurrentSessions` cap. Bot identity config (`botName`, `botEmail`).
|
|
6421
|
+
|
|
6422
|
+
- ✅ **#5 -- Context assembly layer** (PR #624, shipped earlier) -- `ContextAssembler` injects git diff summary + prior session notes before turn 1. Feeds into coordinator pre-dispatch.
|
|
6423
|
+
|
|
6424
|
+
- ✅ **Performance sweep** (all 10 issues #248-257 -- already confirmed complete)
|
|
6425
|
+
- ✅ **Console session tree** (PR #607 -- parentSessionId rendered in UI)
|
|
6426
|
+
- ✅ **Daemon file-nav tools** (PR #619) -- Glob, Grep, Edit + upgraded Read/Write with staleness guard
|
|
6427
|
+
- ✅ **`spawn_agent` artifacts** (PR #613) -- `lastStepArtifacts` surfaced through spawn_agent return
|
|
6428
|
+
- ✅ **`wr.shaping` workflow** (PR #610) -- faithful Shape Up shaping, 9 steps
|
|
6429
|
+
- ✅ **Coding workflow Phase 0.5** (PR #610) -- upstream context detection, three-workflow pipeline
|
|
6430
|
+
|
|
6431
|
+
### WorkTrain current capabilities (v3.45.0)
|
|
6432
|
+
|
|
6433
|
+
**Autonomous workflow execution -- confirmed working:**
|
|
6434
|
+
- `worktrain run pipeline --issue N` routes to the right pipeline and runs it end-to-end
|
|
6435
|
+
- `worktrain run pr-review` autonomous PR review with structured verdicts and auto-merge
|
|
6436
|
+
- Coding sessions run in isolated worktrees, auto-commit, auto-open PR
|
|
6437
|
+
- Sessions abort when stuck (instead of burning 30-min wall clock)
|
|
6438
|
+
- GitHub issue queue polling: assign issue to `worktrain-etienneb` → daemon picks it up automatically
|
|
6439
|
+
- All sessions start with git diff + prior session notes injected (ContextAssembler)
|
|
6440
|
+
- Daemon file-nav tools: Glob, Grep, Edit, Read (paginated), Write (staleness guard)
|
|
6441
|
+
- Escalating audit chain: Critical findings → prod audit → re-review → escalate if still Critical
|
|
6442
|
+
- Fix loop: minor findings → max 2 fix iterations before escalation
|
|
6443
|
+
|
|
6444
|
+
**WorkTrain agent tool set (v3.45.0):**
|
|
6445
|
+
`complete_step`, `continue_workflow` (deprecated), `Bash`, `Read`, `Write`, `Glob`, `Grep`, `Edit`, `report_issue`, `spawn_agent`, `signal_coordinator`
|
|
6446
|
+
|
|
6447
|
+
**Trigger system:**
|
|
6448
|
+
- Generic webhook, GitLab MR polling, GitHub Issues polling, GitHub PR polling
|
|
6449
|
+
- **NEW: `github_queue_poll`** -- assignee-based issue queue with maturity inference
|
|
6450
|
+
- `branchStrategy: worktree` -- isolated worktree per session
|
|
6451
|
+
- `autoCommit: true` / `autoOpenPR: true` -- full delivery pipeline
|
|
6452
|
+
- `stuckAbortPolicy: 'abort' | 'notify_only'`
|
|
6453
|
+
- `goalTemplate`, `referenceUrls`, `contextMapping`, `agentConfig`
|
|
6454
|
+
|
|
6455
|
+
### Accurate limitations (v3.45.0)
|
|
6456
|
+
|
|
6457
|
+
1. **`dispatchAdaptivePipeline()` not yet connected** -- `TriggerRouter.dispatchAdaptivePipeline()` exists but `polling-scheduler.ts` still calls `router.dispatch()`. Queue poll sessions run as generic sessions, not routed through the adaptive coordinator. Cross-PR gap documented with TODO.
|
|
6458
|
+
|
|
6459
|
+
2. **`findingCategory` not on review-verdict** -- Audit chain always dispatches `production-readiness-audit` for Critical findings regardless of finding type. `findingCategory` field on `findings[]` items needs to be added to `wr.review_verdict` schema as a follow-up so architecture findings can route to `architecture-scalability-audit` correctly.
|
|
6460
|
+
|
|
6461
|
+
3. **Bot account setup required before first queue run** -- `worktrain-etienneb` GitHub account must be created, PAT generated with `repo:read` scope, stored as `WORKTRAIN_BOT_TOKEN`, and added as repo collaborator. Commit identity: `worktrain-etienneb@users.noreply.github.com`. Without this, `github_queue_poll` trigger has no bot identity.
|
|
6462
|
+
|
|
6463
|
+
4. **No auto-merge setting in `worktrain init`** -- Auto-merge policy is hardcoded in the coordinator. Should be a `~/.workrail/config.json` setting exposed during `worktrain init`.
|
|
6464
|
+
|
|
6465
|
+
5. **Grooming loop not built** -- Three open design decisions must be settled before building (human-ack boundary, compute budget, priority signal source). Deferred until Level 1 usage data exists.
|
|
6466
|
+
|
|
6467
|
+
6. **Knowledge graph not wired** -- `src/knowledge-graph/` module exists (DuckDB + ts-morph), `ts-morph` in devDependencies. No daemon tool yet. Architecture decision: belongs in context assembly layer, not as a daemon tool.
|
|
6468
|
+
|
|
6469
|
+
7. **`worktrain inbox --watch` stub** -- Prints "not yet implemented." The outbox mechanism exists; just needs a polling loop.
|
|
6470
|
+
|
|
6471
|
+
8. **Artifact store not built** -- Agents dump markdown in the repo. `~/.workrail/artifacts/` not created.
|
|
6472
|
+
|
|
6473
|
+
### Next priorities (groomed Apr 20)
|
|
6474
|
+
|
|
6475
|
+
1. **Connect `dispatchAdaptivePipeline()`** -- Wire `polling-scheduler.ts` to call `TriggerRouter.dispatchAdaptivePipeline()` when `context.taskCandidate` is present. Small change, unlocks the full autonomous queue → pipeline connection.
|
|
6476
|
+
|
|
6477
|
+
2. **`findingCategory` on review-verdict schema** -- Add `findingCategory: 'correctness' | 'security' | 'architecture' | 'ux' | 'performance' | 'testing'` to `findings[]` in `ReviewVerdictArtifactV1Schema`. Update `mr-review-workflow-agentic` final step to emit it. Unlocks correct audit routing.
|
|
6478
|
+
|
|
6479
|
+
3. **Bot account setup + `worktrain init` overhaul** -- Create `worktrain-etienneb`, add `worktrain daemon --check` command (API key + git fetch dry run), expose auto-merge policy in `worktrain init`.
|
|
6480
|
+
|
|
6481
|
+
4. **Level 1 usage: run WorkTrain on its own backlog** -- Create `worktrain:ready` issues for the top 10 ready tasks, assign to `worktrain-etienneb`, observe one full queue → pipeline run. Collect data on misclassifications and weak PRs before designing the grooming loop.
|
|
6482
|
+
|
|
6483
|
+
5. **`worktrain inbox --watch`** -- Close the notification loop. Outbox exists, just needs the polling implementation.
|