@exaudeus/workrail 3.59.3 → 3.59.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/console-ui/assets/{index-C8iMtnPv.js → index-Ctoxo1z6.js} +1 -1
- package/dist/console-ui/index.html +1 -1
- package/dist/coordinators/modes/full-pipeline.js +43 -11
- package/dist/coordinators/modes/implement-shared.js +84 -17
- package/dist/coordinators/modes/implement.js +18 -1
- package/dist/coordinators/pr-review.d.ts +1 -1
- package/dist/manifest.json +15 -15
- package/dist/trigger/trigger-listener.js +83 -72
- package/dist/trigger/trigger-router.js +4 -1
- package/docs/design/coordinator-in-process-await-candidates.md +128 -0
- package/docs/design/coordinator-in-process-await-design-review.md +93 -0
- package/docs/design/coordinator-io-error-handling-candidates.md +199 -0
- package/docs/design/coordinator-io-error-handling-design-review.md +120 -0
- package/docs/design/dispatch-dedup-prealloc-bypass-candidates.md +187 -0
- package/docs/design/dispatch-dedup-prealloc-bypass-design-review.md +100 -0
- package/docs/design/dispatch-dedup-prealloc-bypass-implementation-plan.md +218 -0
- package/docs/ideas/backlog.md +52 -0
- package/package.json +1 -1
|
@@ -0,0 +1,100 @@
|
|
|
1
|
+
# Design Review: Bypass Dispatch Dedup for Pre-Allocated Sessions
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-04-19
|
|
4
|
+
**Reviewer:** Claude (automated design review pass)
|
|
5
|
+
**Selected approach:** Wrap dedup block in `if (workflowTrigger._preAllocatedStartResponse === undefined)`
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Tradeoff Review
|
|
10
|
+
|
|
11
|
+
### Shared dedup map stays shared
|
|
12
|
+
|
|
13
|
+
The `_recentAdaptiveDispatches` map remains shared across all dispatch paths. Pre-alloc calls
|
|
14
|
+
bypass the check but do not update the map.
|
|
15
|
+
|
|
16
|
+
**Assessment:** Sound. The map entry from `dispatchAdaptivePipeline()` correctly blocks duplicate
|
|
17
|
+
top-level pipelines for 30s. Pre-alloc calls are child sessions -- they should not add new map
|
|
18
|
+
entries. All cross-path scenarios analyzed; no violations found.
|
|
19
|
+
|
|
20
|
+
**Condition for unacceptability:** If a legitimate retry of the same goal+workspace (without
|
|
21
|
+
pre-alloc) must fire within 30s of the original dispatch. Unlikely in practice; TTL is 30s.
|
|
22
|
+
|
|
23
|
+
### Comment is the only regression protection
|
|
24
|
+
|
|
25
|
+
The guard `if (_preAllocatedStartResponse === undefined)` wrapping the dedup block has no
|
|
26
|
+
compile-time enforcement beyond the unit test.
|
|
27
|
+
|
|
28
|
+
**Assessment:** Acceptable. The unit test `dispatch() with _preAllocatedStartResponse bypasses
|
|
29
|
+
dedup and calls runWorkflowFn` catches any regression. The JSDoc on the field and the guard
|
|
30
|
+
comment provide documentation-level protection.
|
|
31
|
+
|
|
32
|
+
---
|
|
33
|
+
|
|
34
|
+
## Failure Mode Review
|
|
35
|
+
|
|
36
|
+
| Mode | Handled? | Mitigation |
|
|
37
|
+
|---|---|---|
|
|
38
|
+
| Guard removed in refactor | Yes | Unit test catches it |
|
|
39
|
+
| Falsy check instead of `!== undefined` | Non-issue | Type is always an object when present |
|
|
40
|
+
| Semaphore deadlock | Not new risk | Pre-existing FIFO semaphore handles this |
|
|
41
|
+
| Session completes before enqueue | Not real | executeStartWorkflow doesn't run agent loop |
|
|
42
|
+
|
|
43
|
+
**Highest-risk:** Guard removed in refactor. Mitigated by unit test.
|
|
44
|
+
|
|
45
|
+
---
|
|
46
|
+
|
|
47
|
+
## Runner-Up / Simpler Alternative Review
|
|
48
|
+
|
|
49
|
+
- **Runner-up (early-return guard A):** Structurally equivalent. Loses because B keeps a single
|
|
50
|
+
enqueue block, reducing duplication risk. No elements worth borrowing beyond the guard comment.
|
|
51
|
+
- **Simpler alternative (extract `_enqueueDispatch`):** Would be cleaner but is out of scope for
|
|
52
|
+
this targeted fix. No correctness benefit.
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## Philosophy Alignment
|
|
57
|
+
|
|
58
|
+
All relevant CLAUDE.md principles are satisfied:
|
|
59
|
+
- **Architectural fixes over patches** -- the guard models the root invariant
|
|
60
|
+
- **Make illegal states unrepresentable** -- compile-time discriminator
|
|
61
|
+
- **YAGNI with discipline** -- minimal change, no speculation
|
|
62
|
+
- **Document why, not what** -- guard comment explains invariant
|
|
63
|
+
|
|
64
|
+
Pre-existing tensions (mutable shared map) are not introduced by this fix.
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
## Findings
|
|
69
|
+
|
|
70
|
+
**No RED findings.** No blocking issues detected.
|
|
71
|
+
|
|
72
|
+
**ORANGE (advisory):**
|
|
73
|
+
1. The guard comment must be explicit about WHY dedup is bypassed, not just THAT it is bypassed.
|
|
74
|
+
A vague comment like `// skip dedup for pre-alloc` is insufficient. The comment must state:
|
|
75
|
+
'executeStartWorkflow already created the session in the store; dropping this dispatch would
|
|
76
|
+
zombie it.'
|
|
77
|
+
|
|
78
|
+
**YELLOW (notes):**
|
|
79
|
+
1. The unit test for the bypass case should assert that `runWorkflowFn` is called exactly once
|
|
80
|
+
(not just called), to verify the session actually starts.
|
|
81
|
+
2. Consider adding a log line in the bypass path: `console.log('[TriggerRouter] Pre-allocated session dispatched: workflowId=...')` for observability.
|
|
82
|
+
|
|
83
|
+
---
|
|
84
|
+
|
|
85
|
+
## Recommended Revisions
|
|
86
|
+
|
|
87
|
+
1. Write the guard comment to match the invariant stated in the design doc:
|
|
88
|
+
```typescript
|
|
89
|
+
// Pre-allocated session: executeStartWorkflow already created the session in the store.
|
|
90
|
+
// Deduplication must not apply here -- dropping this dispatch would zombie the session.
|
|
91
|
+
```
|
|
92
|
+
2. In the test: assert `calls` has exactly 1 entry (`toHaveLength(1)`) after the bypass dispatch.
|
|
93
|
+
3. Optional: add a `console.log` in the bypass path for daemon observability.
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Residual Concerns
|
|
98
|
+
|
|
99
|
+
None. The fix is sound, minimal, and directly models the documented invariant. All failure modes
|
|
100
|
+
are either handled or pre-existing. The design is ready for implementation.
|
|
@@ -0,0 +1,218 @@
|
|
|
1
|
+
# Implementation Plan: Bypass Dispatch Dedup for Pre-Allocated Sessions
|
|
2
|
+
|
|
3
|
+
**Date:** 2026-04-19
|
|
4
|
+
**Branch:** fix/dispatch-dedup-prealloc-bypass
|
|
5
|
+
**Scope:** src/trigger/trigger-router.ts only (src/mcp/ excluded)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 1. Problem Statement
|
|
10
|
+
|
|
11
|
+
`TriggerRouter.dispatch()` has a 30-second deduplication guard that compares incoming
|
|
12
|
+
`goal::workspacePath` against `_recentAdaptiveDispatches`. When `dispatchAdaptivePipeline()`
|
|
13
|
+
runs, it writes this key. Milliseconds later, `spawnSession()` calls `dispatch()` with the same
|
|
14
|
+
goal and workspace (plus `_preAllocatedStartResponse`). The dedup guard fires, `queue.enqueue()`
|
|
15
|
+
is never called, and the session that was already written to the store by `executeStartWorkflow()`
|
|
16
|
+
zombies permanently.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
## 2. Acceptance Criteria
|
|
21
|
+
|
|
22
|
+
1. `dispatch()` called with `_preAllocatedStartResponse` set bypasses the dedup check and calls
|
|
23
|
+
`runWorkflowFn` exactly once.
|
|
24
|
+
2. `dispatch()` called WITHOUT `_preAllocatedStartResponse` still deduplicates correctly within 30s.
|
|
25
|
+
3. `npm run build` exits clean (no TypeScript errors).
|
|
26
|
+
4. `npx vitest run tests/unit/trigger-router.test.ts` -- all tests pass including two new tests.
|
|
27
|
+
5. `npx vitest run` -- no regressions in any other test file.
|
|
28
|
+
6. PR merged to main via `gh pr merge <N> --squash`.
|
|
29
|
+
7. Daemon rebuilt and reinstalled (`npm run build && node dist/cli-worktrain.js daemon --install`).
|
|
30
|
+
8. `node dist/cli-worktrain.js trigger poll self-improvement` starts a session and `session_started`
|
|
31
|
+
appears in the event log within 30s.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## 3. Non-Goals
|
|
36
|
+
|
|
37
|
+
- Do NOT touch `src/mcp/` (any file).
|
|
38
|
+
- Do NOT implement Option B (remove dedup from dispatch() entirely).
|
|
39
|
+
- Do NOT implement Option C (separate dedup maps).
|
|
40
|
+
- Do NOT touch `route()` or `dispatchAdaptivePipeline()`.
|
|
41
|
+
- Do NOT change the dedup TTL or map key format.
|
|
42
|
+
|
|
43
|
+
---
|
|
44
|
+
|
|
45
|
+
## 4. Philosophy-Driven Constraints
|
|
46
|
+
|
|
47
|
+
- Guard comment must explain WHY, not what (CLAUDE.md: 'Document why, not what').
|
|
48
|
+
- No code duplication: both the pre-alloc path and the normal path reach the same single
|
|
49
|
+
`queue.enqueue()` call (CLAUDE.md: 'Compose with small, pure functions').
|
|
50
|
+
- Guard uses `!== undefined` not a falsy check (CLAUDE.md: 'Type safety as the first line of defense').
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## 5. Invariants
|
|
55
|
+
|
|
56
|
+
1. When `_preAllocatedStartResponse !== undefined`, `queue.enqueue()` MUST be called.
|
|
57
|
+
2. When `_preAllocatedStartResponse === undefined`, the existing dedup check runs unchanged.
|
|
58
|
+
3. `_recentAdaptiveDispatches` is not updated for pre-alloc dispatch calls (the entry from
|
|
59
|
+
`dispatchAdaptivePipeline()` remains and is the correct TTL anchor for top-level dedup).
|
|
60
|
+
4. The `assertNever` exhaustiveness guard in the enqueue callback remains intact.
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## 6. Selected Approach + Rationale
|
|
65
|
+
|
|
66
|
+
**Approach:** Wrap the dedup block in `dispatch()` with:
|
|
67
|
+
```typescript
|
|
68
|
+
if (workflowTrigger._preAllocatedStartResponse === undefined) {
|
|
69
|
+
// ... existing dedup block ...
|
|
70
|
+
}
|
|
71
|
+
```
|
|
72
|
+
Both the pre-alloc path and the normal (post-dedup) path fall through to the same single
|
|
73
|
+
`void this.queue.enqueue(...)` call.
|
|
74
|
+
|
|
75
|
+
**Rationale:** Minimal blast radius. Single guard. No code duplication. Directly models the
|
|
76
|
+
documented invariant in `WorkflowTrigger._preAllocatedStartResponse` JSDoc.
|
|
77
|
+
|
|
78
|
+
**Runner-up:** Early-return guard before the dedup block (Candidate A from design review).
|
|
79
|
+
Lost because it risks duplicating the enqueue callback body. Structurally equivalent otherwise.
|
|
80
|
+
|
|
81
|
+
---
|
|
82
|
+
|
|
83
|
+
## 7. Vertical Slices
|
|
84
|
+
|
|
85
|
+
### Slice 1: Implementation fix in trigger-router.ts
|
|
86
|
+
|
|
87
|
+
**Scope:** `src/trigger/trigger-router.ts`, `dispatch()` method only (lines 847-933).
|
|
88
|
+
|
|
89
|
+
**Change:**
|
|
90
|
+
1. Add a guard comment before the dedup block explaining the pre-alloc invariant.
|
|
91
|
+
2. Wrap the scoped dedup block `{...}` in `if (workflowTrigger._preAllocatedStartResponse === undefined)`.
|
|
92
|
+
3. Add an optional `console.log` in the pre-alloc path for daemon observability.
|
|
93
|
+
4. The existing `void this.queue.enqueue(...)` call remains after the if-block.
|
|
94
|
+
|
|
95
|
+
**Acceptance criterion:** TypeScript compiles clean. The guard is visible and commented correctly.
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
### Slice 2: Unit tests in trigger-router.test.ts
|
|
100
|
+
|
|
101
|
+
**Scope:** `tests/unit/trigger-router.test.ts`, new describe block or additions to existing
|
|
102
|
+
'TriggerRouter.route and dispatch deduplication' describe block.
|
|
103
|
+
|
|
104
|
+
**Two new tests:**
|
|
105
|
+
|
|
106
|
+
**Test 1: dispatch() with _preAllocatedStartResponse bypasses dedup and calls runWorkflowFn**
|
|
107
|
+
- Call `dispatchAdaptivePipeline(goal, workspace)` to prime the dedup map.
|
|
108
|
+
- Then call `dispatch({ workflowId, goal, workspacePath: workspace, context: {}, _preAllocatedStartResponse: <fake> })`.
|
|
109
|
+
- Flush the async queue.
|
|
110
|
+
- Assert `calls.toHaveLength(1)` -- runWorkflowFn was called exactly once.
|
|
111
|
+
|
|
112
|
+
**Test 2: dispatch() WITHOUT _preAllocatedStartResponse still deduplicates within 30s**
|
|
113
|
+
- This test already exists at line 1604. Verify it still passes after the change.
|
|
114
|
+
- No new test needed for this case; the existing test is the regression guard.
|
|
115
|
+
|
|
116
|
+
**Acceptance criterion:** Both tests pass (`vitest run tests/unit/trigger-router.test.ts`).
|
|
117
|
+
|
|
118
|
+
---
|
|
119
|
+
|
|
120
|
+
### Slice 3: Build, test, PR, CI, merge
|
|
121
|
+
|
|
122
|
+
**Steps:**
|
|
123
|
+
1. `npm run build` -- clean
|
|
124
|
+
2. `npx vitest run tests/unit/trigger-router.test.ts` -- all pass
|
|
125
|
+
3. `npx vitest run` -- no regressions
|
|
126
|
+
4. Create branch `fix/dispatch-dedup-prealloc-bypass`
|
|
127
|
+
5. Commit: `fix(trigger): bypass dispatch dedup for pre-allocated sessions to prevent zombie sessions`
|
|
128
|
+
6. Push + open PR
|
|
129
|
+
7. Wait for CI
|
|
130
|
+
8. Merge: `gh pr merge <N> --squash`
|
|
131
|
+
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
### Slice 4: Daemon reinstall and smoke test
|
|
135
|
+
|
|
136
|
+
**Steps:**
|
|
137
|
+
1. `npm run build && node dist/cli-worktrain.js daemon --install`
|
|
138
|
+
2. `node dist/cli-worktrain.js trigger poll self-improvement`
|
|
139
|
+
3. Watch for `session_started` in event log for at least 30s
|
|
140
|
+
4. Confirm session is not zombie (status completes or progresses past `run_started`)
|
|
141
|
+
|
|
142
|
+
---
|
|
143
|
+
|
|
144
|
+
## 8. Test Design
|
|
145
|
+
|
|
146
|
+
### New Test 1: Bypass case (primary regression test for this fix)
|
|
147
|
+
|
|
148
|
+
```typescript
|
|
149
|
+
it('dispatch(): bypasses dedup and calls runWorkflowFn when _preAllocatedStartResponse is set', async () => {
|
|
150
|
+
vi.useFakeTimers();
|
|
151
|
+
const { fn, calls } = makeFakeRunWorkflow();
|
|
152
|
+
const trigger = makeTrigger();
|
|
153
|
+
const router = new TriggerRouter(
|
|
154
|
+
makeIndex(trigger), FAKE_CTX, FAKE_API_KEY, fn,
|
|
155
|
+
undefined, undefined, undefined, undefined, undefined,
|
|
156
|
+
FAKE_DEPS, executors,
|
|
157
|
+
);
|
|
158
|
+
|
|
159
|
+
const goal = trigger.goal;
|
|
160
|
+
const workspace = trigger.workspacePath;
|
|
161
|
+
|
|
162
|
+
// Prime the dedup map via dispatchAdaptivePipeline
|
|
163
|
+
await router.dispatchAdaptivePipeline(goal, workspace);
|
|
164
|
+
|
|
165
|
+
// Now dispatch with _preAllocatedStartResponse set -- must bypass dedup
|
|
166
|
+
router.dispatch({
|
|
167
|
+
workflowId: trigger.workflowId,
|
|
168
|
+
goal,
|
|
169
|
+
workspacePath: workspace,
|
|
170
|
+
context: {},
|
|
171
|
+
_preAllocatedStartResponse: {} as any, // non-undefined value triggers bypass
|
|
172
|
+
});
|
|
173
|
+
|
|
174
|
+
// Flush
|
|
175
|
+
await new Promise((r) => setImmediate(r));
|
|
176
|
+
|
|
177
|
+
// runWorkflowFn must have been called exactly once
|
|
178
|
+
expect(calls).toHaveLength(1);
|
|
179
|
+
vi.useRealTimers();
|
|
180
|
+
});
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
### Existing Test (regression guard): dispatch() dedup still works without _preAllocatedStartResponse
|
|
184
|
+
|
|
185
|
+
The test at line 1604-1644 of trigger-router.test.ts covers this. It will serve as the
|
|
186
|
+
regression guard after the change.
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
## 9. Risk Register
|
|
191
|
+
|
|
192
|
+
| Risk | Likelihood | Impact | Mitigation |
|
|
193
|
+
|---|---|---|---|
|
|
194
|
+
| Guard removed in future refactor | Low | High | Unit test catches it in CI |
|
|
195
|
+
| Comment becomes stale | Low | Medium | Comment explains invariant, not mechanics -- less likely to stale |
|
|
196
|
+
| _preAllocatedStartResponse type changes | Very low | Low | TypeScript would catch it at compile time |
|
|
197
|
+
|
|
198
|
+
---
|
|
199
|
+
|
|
200
|
+
## 10. PR Packaging Strategy
|
|
201
|
+
|
|
202
|
+
Single PR. One commit. No breaking changes.
|
|
203
|
+
|
|
204
|
+
Branch: `fix/dispatch-dedup-prealloc-bypass`
|
|
205
|
+
Commit: `fix(trigger): bypass dispatch dedup for pre-allocated sessions to prevent zombie sessions`
|
|
206
|
+
|
|
207
|
+
---
|
|
208
|
+
|
|
209
|
+
## 11. Philosophy Alignment
|
|
210
|
+
|
|
211
|
+
| Principle | Status | Why |
|
|
212
|
+
|---|---|---|
|
|
213
|
+
| Architectural fixes over patches | Satisfied | Guard models the root invariant, not a special-case |
|
|
214
|
+
| Make illegal states unrepresentable | Satisfied | `_preAllocatedStartResponse !== undefined` is compile-time discriminator |
|
|
215
|
+
| YAGNI with discipline | Satisfied | Minimal change, no speculative abstractions |
|
|
216
|
+
| Document why, not what | Satisfied | Guard comment explains invariant |
|
|
217
|
+
| Type safety as first line of defense | Satisfied | `!== undefined` check, typed optional field |
|
|
218
|
+
| Immutability by default | Tension (pre-existing) | Shared mutable map; not introduced by this fix |
|
package/docs/ideas/backlog.md
CHANGED
|
@@ -7299,3 +7299,55 @@ The daemon heartbeat + DaemonRegistry membership is the key insight: if the sess
|
|
|
7299
7299
|
### Priority
|
|
7300
7300
|
|
|
7301
7301
|
Medium-high. True session status makes WorkTrain trustworthy as an autonomous system -- operators can see exactly what's happening without guessing. Especially important as session durations get longer (55-minute discovery sessions, 120-minute pipeline runs).
|
|
7302
|
+
|
|
7303
|
+
---
|
|
7304
|
+
|
|
7305
|
+
## Workflows tab: incorrect source attribution for bundled workflows (Apr 21, 2026)
|
|
7306
|
+
|
|
7307
|
+
**The bug:** The Workflows tab in the console shows bundled workflows (e.g. `coding-task-workflow-agentic`, `workflow-for-workflows`) as coming from "User Library" instead of "WorkRail Built-in". This is a WorkRail MCP server issue, not a WorkTrain issue.
|
|
7308
|
+
|
|
7309
|
+
**Likely cause:** The source attribution logic reads from the workflow's loaded source (the `WorkflowSource` type). When a workflow exists in both the bundled set AND a user's managed sources or remembered roots, the source returned is the one that "wins" in the storage layer -- which may be the user path rather than the bundled path. Or the `source.kind` field is incorrectly set to `'personal'` for workflows that were loaded from the bundled workflows directory.
|
|
7310
|
+
|
|
7311
|
+
**Where to look:**
|
|
7312
|
+
- `src/infrastructure/storage/schema-validating-workflow-storage.ts` -- source kind propagation
|
|
7313
|
+
- `src/mcp/handlers/shared/workflow-source-visibility.ts` -- how source is mapped to display label in `list_workflows`
|
|
7314
|
+
- `src/infrastructure/storage/file-workflow-storage.ts` -- how `source.kind` is assigned when loading from disk
|
|
7315
|
+
|
|
7316
|
+
**Expected behavior:** Workflows in the `workflows/` directory of the workrail package should always display as "WorkRail Built-in" regardless of whether the user also has a managed source that happens to include the same directory.
|
|
7317
|
+
|
|
7318
|
+
**Priority:** Low for WorkTrain (doesn't affect functionality). Medium for WorkRail MCP (misleading UI, users may think they accidentally modified bundled workflows).
|
|
7319
|
+
|
|
7320
|
+
---
|
|
7321
|
+
|
|
7322
|
+
## Coordinator-managed git state and agent crash recovery (Apr 21, 2026)
|
|
7323
|
+
|
|
7324
|
+
### Git state management (coordinator's job)
|
|
7325
|
+
|
|
7326
|
+
Before dispatching any WorkTrain session that does git work:
|
|
7327
|
+
1. Check for `.git/index.lock` -- if present, verify the owning PID is dead (via `lsof` on macOS), then remove it
|
|
7328
|
+
2. Abort any in-progress git operations: `git rebase --abort; git merge --abort`
|
|
7329
|
+
3. Verify the workspace is in a clean state before handing off to the agent
|
|
7330
|
+
|
|
7331
|
+
Every session that touches files gets a worktree (already implemented). The coordinator ensures worktrees are created cleanly and removed after session completion. The orphan TTL cleanup (24h) handles crash cases.
|
|
7332
|
+
|
|
7333
|
+
### Agent crash recovery (coordinator's job)
|
|
7334
|
+
|
|
7335
|
+
An agent can die from: stream watchdog timeout (600s no progress), OOM kill, or SIGKILL. In all cases the session event log is intact -- the full conversation history is preserved.
|
|
7336
|
+
|
|
7337
|
+
**The coordinator should detect and recover automatically:**
|
|
7338
|
+
|
|
7339
|
+
1. Monitor child sessions via `worktrain await`
|
|
7340
|
+
2. If a session returns `_tag: 'aborted'` or `_tag: 'timeout'` mid-pipeline:
|
|
7341
|
+
- Check if the session made meaningful progress (step advances > 0, or notes written)
|
|
7342
|
+
- If yes: resume the session -- same session ID, same context, agent picks up at last checkpoint
|
|
7343
|
+
- If no (zero progress): retry from scratch with a fresh session, same context bundle
|
|
7344
|
+
3. Retry up to N times (configurable, default 2) before escalating to Human Outbox
|
|
7345
|
+
4. Track which phase failed and inject a hint on retry: "Previous attempt failed at this step. Retry with fresh approach."
|
|
7346
|
+
|
|
7347
|
+
**This is session continuation applied to crash recovery.** The agent's conversation history is fully preserved. Resuming puts it back exactly where it was. The 600s watchdog timeout (the most common failure) almost always means a hung LLM call or a tool timeout -- resuming naturally retries the step.
|
|
7348
|
+
|
|
7349
|
+
**Implementation:** `runFullPipeline` and `runImplementPipeline` already have per-phase error handling. Extend each `awaitSessions` call: on non-success outcome, attempt resume before returning escalated. The resume logic is `worktrain session continue <sessionId>` (once that command exists) or `dispatchAdaptivePipeline` with the existing session's context.
|
|
7350
|
+
|
|
7351
|
+
### Priority
|
|
7352
|
+
|
|
7353
|
+
High. Agent crash recovery makes the overnight-autonomous bar achievable. Without it, any hung LLM call or tool timeout fails the entire pipeline silently. With it, transient failures are automatically retried and the pipeline continues.
|