@exaudeus/workrail 3.59.2 → 3.59.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,128 @@
1
+ # Design Candidates: In-Process awaitSessions and getAgentResult
2
+
3
+ **Date:** 2026-04-19
4
+ **Task:** Replace HTTP-to-self `awaitSessions` and `getAgentResult` in `src/trigger/trigger-listener.ts` with in-process `ConsoleService` calls.
5
+
6
+ ---
7
+
8
+ ## Problem Understanding
9
+
10
+ ### Tensions
11
+
12
+ 1. **Construction order vs. dependency injection**: `coordinatorDeps` must be constructed before `TriggerRouter` (it is a constructor argument), but `ConsoleService` needs to be available inside the closure. The `routerRef` forward-reference pattern already solves a similar ordering problem -- `consoleService` can be constructed before the closure and captured by the closure.
13
+
14
+ 2. **Graceful degradation vs. correctness**: If `ctx.v2.dataDir` or `ctx.v2.directoryListing` is null (the daemon-console.ts path guards against this), should `awaitSessions` degrade to returning all-failed or crash? The design doc says construct before `coordinatorDeps` -- the guard should produce a logged warning and graceful fallback since the coordinator handles `allSucceeded: false`.
15
+
16
+ 3. **Session visibility race**: Sessions created in-process by `spawnSession()` may not be immediately readable via `getSessionDetail()`. This is why `SESSION_LOAD_FAILED` must be treated as "not ready yet" (retry), not as failure.
17
+
18
+ 4. **Interface cleanliness vs. minimal scope**: `CoordinatorDeps.port` is a required field (`readonly port: number`) that is never read by coordinator logic (`grep deps.port` returns nothing). Removing `port: DAEMON_CONSOLE_PORT` from the trigger-listener deps object requires either (a) making `port` optional in the interface, or (b) using a `0` sentinel.
19
+
20
+ ### Likely Seam
21
+
22
+ The composition root `startTriggerListener()` in `src/trigger/trigger-listener.ts` is the correct and only seam. This is where all other deps are wired, `ctx.v2.*` ports are available, and `ConsoleService` can be constructed.
23
+
24
+ ### What Makes This Hard
25
+
26
+ - `CoordinatorDeps.port` is required but unused by coordinator logic. The design doc says to remove it, but the interface requires it. TypeScript will reject omitting a required field.
27
+ - The `error` terminal status mentioned in the task description does not exist in `ConsoleRunStatus` (`'in_progress' | 'complete' | 'complete_with_gaps' | 'blocked'`). The design doc is authoritative.
28
+ - The dynamic import pattern is required to avoid circular dependency (same as `daemon-console.ts:113`).
29
+
30
+ ---
31
+
32
+ ## Philosophy Constraints
33
+
34
+ From `/Users/etienneb/CLAUDE.md`:
35
+ - **Architectural fixes over patches** -- this fix IS the architectural fix (in-process instead of HTTP)
36
+ - **Immutability by default** -- `pending Set` mutation is minimal and contained in the polling loop
37
+ - **Errors are data** -- use `.isOk()` / `.isErr()` on `ResultAsync`, not try/catch
38
+ - **Validate at boundaries, trust inside** -- guard `ctx.v2.dataDir` at construction time
39
+ - **Document "why", not "what"** -- add WHY comments on new implementations
40
+
41
+ No philosophy conflicts with repo patterns -- `daemon-console.ts` already uses the exact same construction approach.
42
+
43
+ ---
44
+
45
+ ## Impact Surface
46
+
47
+ - **`AdaptiveCoordinatorDeps` interface**: No change (extends `CoordinatorDeps`, no new fields needed)
48
+ - **`CoordinatorDeps` interface** (`src/coordinators/pr-review.ts:210`): `port` is `required number` -- must be made optional if removing from trigger-listener deps
49
+ - **CLI path** (`src/cli-worktrain.ts:1549`): sets `port` in deps object for out-of-process coordinator -- remains correct either way
50
+ - **Pipeline coordinators** (`full-pipeline.ts`, `implement.ts`, `pr-review.ts`): call `awaitSessions`/`getAgentResult` by interface -- behavior change is transparent
51
+ - **`src/mcp/`**: Not touched (explicit out-of-scope)
52
+
53
+ ---
54
+
55
+ ## Candidates
56
+
57
+ ### Candidate A: Design doc implementation with `port` made optional
58
+
59
+ **Summary**: Implement exactly per design doc. Construct `ConsoleService` locally in `startTriggerListener()`. Replace `awaitSessions` and `getAgentResult` with in-process calls. Make `readonly port: number` optional (`readonly port?: number`) in `CoordinatorDeps` to allow removing it from the trigger-listener deps object.
60
+
61
+ **Tensions resolved**: All four tensions resolved cleanly. `SESSION_LOAD_FAILED` = retry. Guard for `ctx.v2` nulls. `port` field made honest (optional, unused by logic).
62
+
63
+ **Tension accepted**: Requires touching `src/coordinators/pr-review.ts` for one-char interface change.
64
+
65
+ **Boundary**: `startTriggerListener()` composition root -- correct seam.
66
+
67
+ **Why best fit**: Fully executes design doc intent. `deps.port` confirmed unused by all coordinator logic.
68
+
69
+ **Failure mode**: None identified. `grep deps.port` confirmed zero usages in coordinator logic.
70
+
71
+ **Repo pattern**: Follows `daemon-console.ts` construction pattern exactly. Follows `spawnSession` in-process migration pattern.
72
+
73
+ **Gains**: Clean interface, no dead required field, full design doc compliance, no `0` sentinel.
74
+
75
+ **Losses**: Touches `src/coordinators/pr-review.ts` (one-char change).
76
+
77
+ **Scope judgment**: Best-fit. The scope restriction says "do not touch `src/mcp/`" not "do not touch `src/coordinators/`".
78
+
79
+ **Philosophy fit**: Honors "architectural fixes over patches", "make illegal states unrepresentable" (no sentinel), "errors are data".
80
+
81
+ ---
82
+
83
+ ### Candidate B: Design doc implementation, keep `port: 0` sentinel
84
+
85
+ **Summary**: Same `ConsoleService` construction and awaitSessions/getAgentResult replacement, but set `port: 0` in the trigger-listener deps object instead of making the interface field optional.
86
+
87
+ **Tensions resolved**: Removes HTTP-to-self bugs. Zero interface changes.
88
+
89
+ **Tension accepted**: Leaves dead required field with a misleading `0` value.
90
+
91
+ **Failure mode**: Future code reads `deps.port` and uses `0` as a real port, producing silent bugs.
92
+
93
+ **Repo pattern**: Departs from design doc intent ("remove the constant and port from coordinatorDeps").
94
+
95
+ **Gains**: No interface touch; purely local change.
96
+
97
+ **Losses**: Interface stays polluted with unused required field; violates design doc intent; `0` sentinel is an illegal state that can be constructed.
98
+
99
+ **Scope judgment**: Too narrow (doesn't fully execute design doc intent).
100
+
101
+ **Philosophy fit**: Conflicts with "make illegal states unrepresentable" and "architectural fixes over patches".
102
+
103
+ ---
104
+
105
+ ## Comparison and Recommendation
106
+
107
+ **Recommendation: Candidate A.**
108
+
109
+ `deps.port` is confirmed unused by any coordinator logic (exhaustive grep). Making it optional is a one-character change that eliminates a dead field and fully executes the design doc intent. The scope restriction is explicitly "do not touch `src/mcp/`" -- `src/coordinators/pr-review.ts` is in scope. Candidate A is the correct architectural fix.
110
+
111
+ The `0` sentinel in Candidate B is precisely the kind of "patch over architectural fix" that CLAUDE.md's philosophy warns against.
112
+
113
+ ---
114
+
115
+ ## Self-Critique
116
+
117
+ **Strongest argument against Candidate A**: A conservative interpretation of "only touch trigger-listener.ts" would favor the sentinel. If the reviewer intended zero interface changes, Candidate B is the safe choice.
118
+
119
+ **Pivot condition**: If touching `pr-review.ts` causes unexpected test failures (e.g., tests construct `CoordinatorDeps` with `port` required and would need to add `port: undefined`), fall back to `port: 0` sentinel or make it optional with a default. But since `port` is already unused, this risk is low.
120
+
121
+ **Assumption that would invalidate**: If some test or code path actually reads `deps.port` and would break if `0` is used or the field is absent. The grep confirms this does not exist.
122
+
123
+ ---
124
+
125
+ ## Open Questions for Main Agent
126
+
127
+ 1. Should the guard for `ctx.v2.dataDir === undefined` cause a process.stderr warning only, or should it cause `startTriggerListener` to return an `err`? (Daemon-console.ts returns `err` -- but trigger-listener has already started by this point. Recommendation: warn + let `awaitSessions`/`getAgentResult` return degraded results.)
128
+ 2. Should the `consoleService` local variable be constructed inside a try/catch or guarded more defensively? (No -- the constructor is synchronous and cannot throw given valid inputs.)
@@ -0,0 +1,93 @@
1
+ # Design Review: In-Process awaitSessions and getAgentResult
2
+
3
+ **Date:** 2026-04-19
4
+ **Candidate reviewed:** Candidate A from `coordinator-in-process-await-candidates.md`
5
+
6
+ ---
7
+
8
+ ## Tradeoff Review
9
+
10
+ **Tradeoff: `port` field made optional in `CoordinatorDeps`**
11
+ - Verified: `deps.port` is unused by all coordinator logic (grep returns zero results)
12
+ - Condition that invalidates: TypeScript errors showing `port` required elsewhere
13
+ - Mitigation: Run `npm run build` immediately; pivot to `port: 0` sentinel if errors appear
14
+ - **Verdict: Acceptable**
15
+
16
+ **Tradeoff: `null consoleService` fallback for missing `ctx.v2` ports**
17
+ - Verified: `createToolContext()` always provides non-null values in production
18
+ - Condition that invalidates: Not applicable in production path
19
+ - Mitigation: Warn on stderr + degrade gracefully to same all-failed behavior as current HTTP failure
20
+ - **Verdict: Acceptable**
21
+
22
+ **Tradeoff: Pending-Set polling over check-all-handles-every-poll**
23
+ - Verified: Terminal state transitions are monotonic (event log is append-only)
24
+ - Condition that invalidates: Not possible given ConsoleRunStatus projection semantics
25
+ - **Verdict: Correct and more efficient**
26
+
27
+ ---
28
+
29
+ ## Failure Mode Review
30
+
31
+ | Failure Mode | Handled? | Risk |
32
+ |---|---|---|
33
+ | `ctx.v2.dataDir`/`directoryListing` null | Yes -- null guard + graceful fallback | Low |
34
+ | Session not yet visible after spawnSession | Yes -- SESSION_LOAD_FAILED treated as retry | Low |
35
+ | TypeScript error from port optional | Partially -- pivot to sentinel if needed | Low |
36
+ | ConsoleService circular dependency | Yes -- dynamic import pattern | Low |
37
+ | getSessionDetail/getNodeDetail unexpected throw | Yes -- ResultAsync + outer try/catch | Low |
38
+
39
+ **Highest-risk**: TypeScript port optional change causing build failure. Pivot defined and trivial.
40
+
41
+ ---
42
+
43
+ ## Runner-Up / Simpler Alternative Review
44
+
45
+ **Candidate B** (port: 0 sentinel): Nothing worth borrowing. The only advantage was zero interface changes, but it introduces a dead required field with a misleading sentinel value.
46
+
47
+ **Simpler variant** (keep `port: DAEMON_CONSOLE_PORT`): Insufficient -- leaves dead code after removing the HTTP deps that needed it. Design doc explicitly lists port removal as required.
48
+
49
+ **No hybrid needed.**
50
+
51
+ ---
52
+
53
+ ## Philosophy Alignment
54
+
55
+ | Principle | Status |
56
+ |---|---|
57
+ | Architectural fixes over patches | SATISFIED -- root-cause fix, not workaround |
58
+ | Make illegal states unrepresentable | SATISFIED -- no sentinel, optional instead |
59
+ | Dependency injection for boundaries | SATISFIED -- ConsoleService gets ports injected |
60
+ | Immutability by default | SATISFIED -- minimal mutable state (pending Set only) |
61
+ | Errors are data | SATISFIED -- ResultAsync.isOk()/isErr() throughout |
62
+ | YAGNI with discipline | SATISFIED -- no speculative abstractions |
63
+ | Document "why", not "what" | REQUIRED -- add WHY comments in implementation |
64
+
65
+ One acceptable tension: null consoleService uses nullable variable rather than Result type. Acceptable at initialization boundary (not domain logic).
66
+
67
+ ---
68
+
69
+ ## Findings
70
+
71
+ **No Red (blocking) findings.**
72
+
73
+ **Orange (should address before shipping):**
74
+ - None identified.
75
+
76
+ **Yellow (advisory):**
77
+ - Y1: The `ctx.v2` null guard creates a nullable `consoleService` variable. If `ctx.v2` is ever null in production, the stderr warning may be missed. Recommend making the warning prominent: `[CRITICAL trigger-listener:reason=consoleService_unavailable]` prefix.
78
+
79
+ ---
80
+
81
+ ## Recommended Revisions
82
+
83
+ 1. Add `[CRITICAL]` prefix to the `ctx.v2` null guard warning to make it visible in logs.
84
+ 2. Add WHY comment on new `awaitSessions` explaining the in-process approach (mirrors the spawnSession WHY comment pattern).
85
+ 3. Add WHY comment on `ConsoleService` construction explaining it avoids the HTTP race condition.
86
+
87
+ ---
88
+
89
+ ## Residual Concerns
90
+
91
+ - **None blocking.** The design is sound, the tradeoffs are acceptable, and the failure modes are covered.
92
+ - Build verification (`npm run build`) will immediately catch any TypeScript issues from the `port` optional change.
93
+ - The implementation is a near-direct transcription of the design doc pseudocode, reducing creative risk.
@@ -0,0 +1,199 @@
1
+ # Coordinator I/O Error Handling -- Design Candidates
2
+
3
+ Generated: 2026-04-19
4
+
5
+ ## Problem Understanding
6
+
7
+ ### Core Tensions
8
+
9
+ 1. **Crash-safety vs. DI purity**: The coordinator declares "all phase failures produce
10
+ `PipelineOutcome { kind: 'escalated' }` -- never thrown" as a design invariant, but three
11
+ injected dep functions (`getAgentResult`, `postToOutbox`, `pollForPR`) are called without
12
+ try/catch in the mode files. Any throw from these functions crashes the coordinator silently
13
+ instead of returning a structured `PipelineOutcome`. The fix must enforce the invariant at
14
+ the right boundary.
15
+
16
+ 2. **Verbosity vs. DRY**: `postToOutbox` is called at 8+ critical escalation points across
17
+ `implement-shared.ts` and `full-pipeline.ts`. Each call site needs individual protection.
18
+ Inline try/catch at 8 sites is repetitive; a helper would reduce duplication but adds
19
+ abstraction not in the existing codebase pattern.
20
+
21
+ 3. **`process.stderr.write()` vs. `deps.stderr()`**: The prescribed pattern uses
22
+ `process.stderr.write()` in catch blocks, but the rest of the coordinator uses the injected
23
+ `deps.stderr()`. The tension is minor -- catch blocks represent unexpected I/O failures,
24
+ so using `process.stderr.write()` signals this is an emergency log path, not a normal
25
+ operational log.
26
+
27
+ ### Likely Seam
28
+
29
+ The mode files are the correct seam. `implement-shared.ts`, `full-pipeline.ts`, and
30
+ `implement.ts` are the callers of the three unsafe deps. The coordinator owns the
31
+ escalation-first invariant -- not the injectors (`trigger-listener.ts`, `cli-worktrain.ts`).
32
+
33
+ ### What Makes This Hard
34
+
35
+ - `postToOutbox` calls are immediately followed by `return { kind: 'escalated', ... }`. The
36
+ try/catch must wrap ONLY the `postToOutbox` call, not the return statement. Careful
37
+ placement required.
38
+ - `pollForPR` is called in BOTH `implement.ts` (explicitly mentioned in task) AND
39
+ `full-pipeline.ts` line 454 (not mentioned but equally unsafe). Both must be wrapped.
40
+ - UX gate zombie detection: `implement.ts` line 144 assigns `uxHandle` from `uxSpawnResult.value`
41
+ without a null/empty-string guard before passing to `awaitSessions`. This is the only session
42
+ handle in the coordinator without the guard -- a separate gap alongside the I/O error handling.
43
+
44
+ ---
45
+
46
+ ## Philosophy Constraints
47
+
48
+ **From `CLAUDE.md`:**
49
+ - "Errors are data -- represent failure as values (Result/Either), not exceptions as control flow"
50
+ - "Type safety as the first line of defense"
51
+
52
+ **From `adaptive-pipeline.ts` header (design invariant):**
53
+ - "All phase failures produce PipelineOutcome { kind: 'escalated' } -- never thrown."
54
+ - "All I/O is injected via AdaptiveCoordinatorDeps. Zero direct fs/fetch/exec imports."
55
+
56
+ **Repo precedent:**
57
+ - `archiveFile` (in `implement.ts` and `full-pipeline.ts`): try/catch inline in finally block, log-and-continue. This is the exact model for `postToOutbox`.
58
+ - `writeFile` routing log (in `adaptive-pipeline.ts`): try/catch inline, log-and-continue.
59
+
60
+ **Conflicts:** None material. The stated philosophy (errors as data) and the repo pattern (inline try/catch for non-Result deps) are consistent.
61
+
62
+ ---
63
+
64
+ ## Impact Surface
65
+
66
+ - `runReviewAndVerdictCycle` is called from both `implement.ts` and `full-pipeline.ts`. Fixing
67
+ `getAgentResult` in `implement-shared.ts` protects both callers automatically.
68
+ - `runAuditChain` (also in `implement-shared.ts`) calls both `getAgentResult` and `postToOutbox`
69
+ at multiple points.
70
+ - `adaptive-pipeline.ts` line 362 calls `postToOutbox` in the `ESCALATE` routing case -- this is
71
+ OUT OF SCOPE for this task (task restricts changes to the 3 mode files).
72
+ - No callers outside these files change signature or return type.
73
+
74
+ ---
75
+
76
+ ## Candidates
77
+
78
+ ### Candidate 1: Inline try/catch at each call site (prescribed pattern)
79
+
80
+ **Summary:** Wrap each `getAgentResult`, `postToOutbox`, and `pollForPR` call individually in
81
+ a try/catch block in the 3 mode files.
82
+
83
+ **Tensions resolved:** Crash-safety fully addressed. Accepts: slight verbosity from 8+
84
+ `postToOutbox` sites.
85
+
86
+ **Boundary:** At the mode file call sites -- the correct boundary. The coordinator owns the
87
+ escalation-first invariant; the mode files are where the invariant must be enforced.
88
+
89
+ **Failure mode:** Missing the `pollForPR` call in `full-pipeline.ts` (not explicitly called out
90
+ in task description but confirmed unsafe by code analysis). Must be systematic.
91
+
92
+ **Repo-pattern relationship:** Follows `archiveFile` try/catch pattern exactly. Adapts
93
+ `writeFile` routing-log pattern from `adaptive-pipeline.ts`.
94
+
95
+ **Gains:** Zero risk to happy path. Locally visible -- reviewer can see exactly what is
96
+ protected at each call site. No new abstractions.
97
+
98
+ **Losses:** Mildly repetitive for `postToOutbox` sites. Functions grow slightly.
99
+
100
+ **Scope:** Best-fit. The 3 files are exactly the seam.
101
+
102
+ **Philosophy fit:** Honors "errors are data", "escalation-first invariant", "DI for boundaries".
103
+ Minor: uses `process.stderr.write()` in catch blocks rather than `deps.stderr()`, consistent
104
+ with prescribed pattern and emergency-log semantics.
105
+
106
+ ---
107
+
108
+ ### Candidate 2: Wrap at injection site (safe wrapper functions)
109
+
110
+ **Summary:** Wrap `getAgentResult`, `postToOutbox`, `pollForPR` in safe adapter functions at
111
+ the injection sites (`trigger-listener.ts`, `cli-worktrain.ts`) so the deps never throw from
112
+ the coordinator's perspective.
113
+
114
+ **Tensions resolved:** DI purity -- the mode files stay clean. Accepts: changes to 2 files
115
+ outside the permitted scope.
116
+
117
+ **Boundary:** At the injection layer. Wrong boundary for this task -- the coordinator owns the
118
+ escalation invariant, not the injectors. Injectors wire up the real implementation; they are
119
+ not responsible for the coordinator's recovery behavior.
120
+
121
+ **Failure mode:** Wrapping at injection site catches throws but cannot return `PipelineOutcome`
122
+ -- would need to return null or a sentinel, which the mode files then check. Adds complexity
123
+ at both ends, solving neither fully.
124
+
125
+ **Repo-pattern relationship:** Departs -- existing injected deps use Result types for
126
+ error-returning deps; plain-promise deps are not wrapped at injection sites.
127
+
128
+ **Scope:** Too broad -- reaches outside the permitted 3 files.
129
+
130
+ **Verdict: Rejected.** Out of scope and wrong seam.
131
+
132
+ ---
133
+
134
+ ### Candidate 3: Private helper `safePostToOutbox(deps, msg, meta)`
135
+
136
+ **Summary:** Extract a private helper that wraps `deps.postToOutbox` in try/catch, reducing
137
+ repetition at the 8+ `postToOutbox` call sites.
138
+
139
+ **Tensions resolved:** DRY for `postToOutbox`. Accepts: new abstraction not prescribed by task.
140
+
141
+ **Boundary:** Same 3 mode files, plus a local helper function in `implement-shared.ts`.
142
+
143
+ **Failure mode:** Helper abstraction obscures the try/catch from reviewers; may hide future
144
+ misuse (e.g., someone using the helper for a call that SHOULD escalate on failure).
145
+
146
+ **Repo-pattern relationship:** No precedent for dep-wrapper helpers in the mode files. `archiveFile`
147
+ try/catch is inline without a helper.
148
+
149
+ **Scope:** Best-fit only if `postToOutbox` had 15+ sites. At 8, YAGNI says no.
150
+
151
+ **Verdict: Skipped.** Task spec gives explicit inline pattern. YAGNI applies.
152
+
153
+ ---
154
+
155
+ ## Comparison and Recommendation
156
+
157
+ **Candidate 1 is the clear choice.**
158
+
159
+ All three candidates converge on the same underlying mechanism (try/catch). The only real
160
+ alternatives differ in location (injection site -- wrong boundary) or DRY abstraction (helper --
161
+ not warranted at 8 sites). Convergence is honest here.
162
+
163
+ Candidate 1:
164
+ - Follows the prescribed pattern from the task description exactly
165
+ - Follows the repo precedent (`archiveFile`, `writeFile` try/catch)
166
+ - Is locally visible and reviewable
167
+ - Carries zero happy-path risk
168
+ - Can be applied systematically to all confirmed call sites
169
+
170
+ ---
171
+
172
+ ## Self-Critique
173
+
174
+ **Strongest argument against:** The 8+ `postToOutbox` call sites produce repetitive code. If
175
+ the count grew to 20+, a helper would be clearly warranted. At 8, the verbosity is manageable.
176
+
177
+ **Narrower option that might work:** Only fix `getAgentResult` and `pollForPR` (HIGH severity),
178
+ skip `postToOutbox` wrapping (MEDIUM severity). Would reduce scope. Loses: `postToOutbox` crash
179
+ at escalation decision points is still a real failure mode that kills the pipeline silently.
180
+
181
+ **Broader option:** Candidate 2 (wrap at injection). Would be justified only if there were a
182
+ precedent of wrapping injected deps at the injection layer. No such precedent exists.
183
+
184
+ **Invalidating assumption:** If `postToOutbox` is guaranteed never to throw in production (e.g.,
185
+ the real impl is in-memory rather than disk-based). The audit doc confirms it uses
186
+ `fs.promises.appendFile` -- can fail on disk full or permission error. Assumption holds.
187
+
188
+ ---
189
+
190
+ ## Open Questions for Main Agent
191
+
192
+ None. The problem, solution, and scope are fully specified. Implementation is mechanical.
193
+
194
+ - Confirm `pollForPR` in `full-pipeline.ts` line 454 also needs wrapping (not explicitly in
195
+ task description but confirmed unsafe -- include it).
196
+ - For `postToOutbox`: the task says "log a warning and continue". Use `process.stderr.write()`
197
+ as prescribed, not `deps.stderr()`.
198
+ - UX gate zombie detection in `implement.ts`: add `if (!uxHandle || uxHandle.trim() === '')` guard
199
+ after line 144, consistent with all 9 other session handle checks in the coordinator.
@@ -0,0 +1,120 @@
1
+ # Coordinator I/O Error Handling -- Design Review Findings
2
+
3
+ Generated: 2026-04-19
4
+
5
+ ## Tradeoff Review
6
+
7
+ ### Verbosity (8+ `postToOutbox` inline try/catch sites)
8
+
9
+ - **Verdict:** Acceptable. At 8 sites, YAGNI wins. The `archiveFile` precedent in the same files
10
+ shows inline try/catch is the established pattern.
11
+ - **Break condition:** If `postToOutbox` call sites grow to 15+, extract a private helper.
12
+ - **Hidden assumption:** `postToOutbox` call count stays roughly constant in the near term.
13
+
14
+ ### `deps.stderr()` vs. `process.stderr.write()` in catch blocks
15
+
16
+ - **Verdict:** Use `deps.stderr()` to match the existing `archiveFile` pattern in `implement.ts`
17
+ and `full-pipeline.ts`. The task spec example uses `process.stderr.write()` but the actual repo
18
+ uses `deps.stderr()` for the identical use case. `deps.stderr()` is more consistent and testable.
19
+ - **Break condition:** None. `deps.stderr()` is strictly better here.
20
+
21
+ ### `pollForPR` in `full-pipeline.ts` not in task description but included
22
+
23
+ - **Verdict:** Include it. It's the same unsafe call pattern, same dep, same risk. Excluding it
24
+ would leave the fix incomplete.
25
+ - **Hidden assumption:** Both `pollForPR` call sites use the same real implementation -- confirmed.
26
+
27
+ ---
28
+
29
+ ## Failure Mode Review
30
+
31
+ | Mode | Handled? | Risk | Notes |
32
+ |------|----------|------|-------|
33
+ | Missing a call site | Mitigated | Medium | Grep check after implementation |
34
+ | `postToOutbox` throw during escalation sequence | Yes | Low | `return` is on next line after try/catch |
35
+ | `getAgentResult` throw with non-Error | Yes | Low | `e instanceof Error ? e.message : String(e)` |
36
+ | `pollForPR` throw leaving `prUrl` uninitialized | Yes (with care) | Medium | Must use `let prUrl; try {...} catch -> return escalated` |
37
+ | UX gate empty `uxHandle` zombie | Yes (after fix) | Low | Same 4-line guard as 9 other handles |
38
+
39
+ **Highest-risk failure mode:** `pollForPR` catch block structure. If written incorrectly (catch
40
+ logs but falls through), `prUrl` would be undefined and the subsequent `if (!prUrl)` check would
41
+ catch it -- but the `prUrl` variable would need to be declared with `let` outside the try block.
42
+ The fix requires care in the variable declaration pattern.
43
+
44
+ ---
45
+
46
+ ## Runner-Up / Simpler Alternative Review
47
+
48
+ **Runner-up:** Private `safePostToOutbox` helper.
49
+ - **Strength worth borrowing:** Standardized log message format across all `postToOutbox` sites.
50
+ - **Adopted:** Standardize the log message format inline (consistent `[WARN coordinator] postToOutbox failed: ...` prefix across all sites).
51
+ - **Rejected:** Full helper extraction. YAGNI at 8 sites. No precedent in repo.
52
+
53
+ **Simpler alternative:** Skip `postToOutbox` wrapping (only fix `getAgentResult` and `pollForPR`).
54
+ - **Rejected:** `postToOutbox` crashes at critical escalation points. Medium severity is still a
55
+ production crash path that must be fixed.
56
+
57
+ ---
58
+
59
+ ## Philosophy Alignment
60
+
61
+ | Principle | Status |
62
+ |-----------|--------|
63
+ | Errors are data | Fully satisfied -- throws become `PipelineOutcome` values |
64
+ | Escalation-first invariant | Enforced -- no throw-exit paths remain after fix |
65
+ | Make illegal states unrepresentable | Satisfied -- coordinator now always returns a value |
66
+ | DI for boundaries | Satisfied -- no new imports, changes are in mode files only |
67
+ | Compose with small functions | Under acceptable tension -- functions grow slightly |
68
+ | Document why not what | Needs 1-line comment per postToOutbox catch explaining non-fatal rationale |
69
+ | YAGNI with discipline | Satisfied -- no speculative helper |
70
+
71
+ ---
72
+
73
+ ## Findings
74
+
75
+ ### Yellow: `pollForPR` variable declaration pattern
76
+
77
+ The `let prUrl` declaration must be placed BEFORE the try/catch block (not inside it) so that
78
+ the catch block can `return` an escalated outcome and the variable remains in scope after. If
79
+ the variable is declared inside `try`, TypeScript will not compile. This is a known TypeScript
80
+ pattern but worth flagging explicitly.
81
+
82
+ **Fix:** Use the explicit two-step pattern from the task spec:
83
+ ```typescript
84
+ let prUrl: string | null;
85
+ try {
86
+ prUrl = await deps.pollForPR(branchPattern, PR_POLL_TIMEOUT_MS);
87
+ } catch (e) {
88
+ const msg = e instanceof Error ? e.message : String(e);
89
+ deps.stderr(`[WARN coordinator] pollForPR threw: ${msg}`);
90
+ return { kind: 'escalated', escalationReason: { phase: 'pr-detection', reason: `pollForPR threw: ${msg}` } };
91
+ }
92
+ if (!prUrl) { ... }
93
+ ```
94
+
95
+ ### Yellow: `deps.stderr()` vs. `process.stderr.write()`
96
+
97
+ Use `deps.stderr()` in catch blocks. The task spec example uses `process.stderr.write()` but the
98
+ repo's `archiveFile` catch blocks use `deps.stderr()`. Consistency with repo pattern wins.
99
+
100
+ ---
101
+
102
+ ## Recommended Revisions
103
+
104
+ 1. Use `deps.stderr()` (not `process.stderr.write()`) in all catch blocks.
105
+ 2. Use `let prUrl: string | null` declared before the try block for `pollForPR` calls.
106
+ 3. Add a one-line comment in each `postToOutbox` catch explaining non-fatal policy:
107
+ `// postToOutbox write failure is non-fatal -- escalation still returns below`
108
+ 4. Include `pollForPR` in `full-pipeline.ts` even though task description only names `implement.ts`.
109
+ 5. Include UX gate zombie detection fix in `implement.ts` line 144.
110
+
111
+ ---
112
+
113
+ ## Residual Concerns
114
+
115
+ - **No tests for throw injection:** This PR fixes the runtime behavior but adds no tests for
116
+ the throw paths. Tests are a planned follow-up (per the audit doc). The absence of tests means
117
+ a regression in this fix would not be caught by CI. Low concern for this PR -- the fix is
118
+ mechanical and the pattern is simple.
119
+ - **`adaptive-pipeline.ts` line 362 `postToOutbox` is unguarded** but is explicitly out of scope
120
+ for this task. Should be addressed in a follow-up.
@@ -7251,3 +7251,103 @@ Option C (hybrid): Heuristic pass first, LLM for anything ambiguous or conflicti
7251
7251
  This IS the implementation of that feature. The manually-authored `.worktrain/rules/*.md` files are the canonical output. The preprocessing is how you get there automatically without having to write them by hand.
7252
7252
 
7253
7253
  **Priority:** Medium. Useful for any workspace with multiple tools. Essential for team repos where Cursor rules, AGENTS.md, and custom conventions might conflict. Build the categorized injection path (phase-scoped rules) first; add the automated preprocessing as a follow-up.
7254
+
7255
+ ---
7256
+
7257
+ ## True session status for WorkTrain: live agent state, not activity inference (Apr 21, 2026)
7258
+
7259
+ **The problem:** The console currently infers session status from last event timestamp. A session with no events for >1 hour becomes "dormant." For the WorkRail MCP server this is unavoidable -- it doesn't have access to the agent loop state. But WorkTrain does. It should show true status.
7260
+
7261
+ ### What WorkTrain knows that the MCP server doesn't
7262
+
7263
+ WorkTrain has direct access to:
7264
+ - `DaemonRegistry` -- tracks every active session by workrailSessionId in memory. If a session is in the registry, it's running. If not, it's not.
7265
+ - `DaemonEventEmitter` -- structured events fired synchronously by the agent loop
7266
+ - `daemon_heartbeat` -- fires every 30s. If last heartbeat >90s ago, daemon is down.
7267
+ - Turn-level events -- `llm_turn_started/completed`, `tool_call_started/completed` -- know exactly what the agent is doing RIGHT NOW
7268
+ - `agent_stuck` -- stuck heuristic fired, session may be stalled
7269
+ - `session_aborted` -- daemon killed mid-session (now emitted on SIGTERM)
7270
+
7271
+ ### True session status taxonomy
7272
+
7273
+ | Status | Meaning | Detection |
7274
+ |---|---|---|
7275
+ | `active:thinking` | LLM API call in progress | `llm_turn_started` without `llm_turn_completed` |
7276
+ | `active:tool` | Tool executing | `tool_call_started` without `tool_call_completed`, name=Bash/Read/etc. |
7277
+ | `active:idle` | Between turns | Last event was `llm_turn_completed`, session in DaemonRegistry |
7278
+ | `stuck` | Stuck heuristic fired | `agent_stuck` event, session still in DaemonRegistry |
7279
+ | `completed:success` | Done successfully | `session_completed` outcome=success |
7280
+ | `completed:timeout` | Hit wall-clock limit | `session_completed` outcome=timeout |
7281
+ | `completed:stuck` | Aborted by stuck policy | `session_completed` outcome=stuck |
7282
+ | `completed:max_turns` | Hit turn limit | `session_completed` outcome=timeout detail=max_turns |
7283
+ | `aborted` | Daemon killed mid-run | `session_aborted` event |
7284
+ | `daemon:down` | No recent heartbeat | Last `daemon_heartbeat` >90s ago |
7285
+
7286
+ ### Where to surface this
7287
+
7288
+ 1. **`worktrain status`** -- already shows session overview. Replace activity-inferred "RUNNING" with true status labels.
7289
+ 2. **`worktrain health <sessionId>`** -- already shows per-session summary. Add "Current state: active:tool (Bash, 23s)" line.
7290
+ 3. **Console workspace view** -- each session row should show true status badge, not just "live" vs "dormant."
7291
+ 4. **`worktrain logs --follow`** -- prefix each event with the derived session state so the log stream is self-explanatory.
7292
+
7293
+ ### Implementation
7294
+
7295
+ The status is derivable from the event log in O(N) where N is events for this session -- scan from the end, find the most recent relevant event. For live sessions, also check DaemonRegistry membership.
7296
+
7297
+ The daemon heartbeat + DaemonRegistry membership is the key insight: if the session's workrailSessionId is in DaemonRegistry AND the daemon is alive (recent heartbeat), the session is definitely running. If it's not in DaemonRegistry, the session is done regardless of what the event log says.
7298
+
7299
+ ### Priority
7300
+
7301
+ Medium-high. True session status makes WorkTrain trustworthy as an autonomous system -- operators can see exactly what's happening without guessing. Especially important as session durations get longer (55-minute discovery sessions, 120-minute pipeline runs).
7302
+
7303
+ ---
7304
+
7305
+ ## Workflows tab: incorrect source attribution for bundled workflows (Apr 21, 2026)
7306
+
7307
+ **The bug:** The Workflows tab in the console shows bundled workflows (e.g. `coding-task-workflow-agentic`, `workflow-for-workflows`) as coming from "User Library" instead of "WorkRail Built-in". This is a WorkRail MCP server issue, not a WorkTrain issue.
7308
+
7309
+ **Likely cause:** The source attribution logic reads from the workflow's loaded source (the `WorkflowSource` type). When a workflow exists in both the bundled set AND a user's managed sources or remembered roots, the source returned is the one that "wins" in the storage layer -- which may be the user path rather than the bundled path. Or the `source.kind` field is incorrectly set to `'personal'` for workflows that were loaded from the bundled workflows directory.
7310
+
7311
+ **Where to look:**
7312
+ - `src/infrastructure/storage/schema-validating-workflow-storage.ts` -- source kind propagation
7313
+ - `src/mcp/handlers/shared/workflow-source-visibility.ts` -- how source is mapped to display label in `list_workflows`
7314
+ - `src/infrastructure/storage/file-workflow-storage.ts` -- how `source.kind` is assigned when loading from disk
7315
+
7316
+ **Expected behavior:** Workflows in the `workflows/` directory of the workrail package should always display as "WorkRail Built-in" regardless of whether the user also has a managed source that happens to include the same directory.
7317
+
7318
+ **Priority:** Low for WorkTrain (doesn't affect functionality). Medium for WorkRail MCP (misleading UI, users may think they accidentally modified bundled workflows).
7319
+
7320
+ ---
7321
+
7322
+ ## Coordinator-managed git state and agent crash recovery (Apr 21, 2026)
7323
+
7324
+ ### Git state management (coordinator's job)
7325
+
7326
+ Before dispatching any WorkTrain session that does git work:
7327
+ 1. Check for `.git/index.lock` -- if present, verify the owning PID is dead (via `lsof` on macOS), then remove it
7328
+ 2. Abort any in-progress git operations: `git rebase --abort; git merge --abort`
7329
+ 3. Verify the workspace is in a clean state before handing off to the agent
7330
+
7331
+ Every session that touches files gets a worktree (already implemented). The coordinator ensures worktrees are created cleanly and removed after session completion. The orphan TTL cleanup (24h) handles crash cases.
7332
+
7333
+ ### Agent crash recovery (coordinator's job)
7334
+
7335
+ An agent can die from: stream watchdog timeout (600s no progress), OOM kill, or SIGKILL. In all cases the session event log is intact -- the full conversation history is preserved.
7336
+
7337
+ **The coordinator should detect and recover automatically:**
7338
+
7339
+ 1. Monitor child sessions via `worktrain await`
7340
+ 2. If a session returns `_tag: 'aborted'` or `_tag: 'timeout'` mid-pipeline:
7341
+ - Check if the session made meaningful progress (step advances > 0, or notes written)
7342
+ - If yes: resume the session -- same session ID, same context, agent picks up at last checkpoint
7343
+ - If no (zero progress): retry from scratch with a fresh session, same context bundle
7344
+ 3. Retry up to N times (configurable, default 2) before escalating to Human Outbox
7345
+ 4. Track which phase failed and inject a hint on retry: "Previous attempt failed at this step. Retry with fresh approach."
7346
+
7347
+ **This is session continuation applied to crash recovery.** The agent's conversation history is fully preserved. Resuming puts it back exactly where it was. The 600s watchdog timeout (the most common failure) almost always means a hung LLM call or a tool timeout -- resuming naturally retries the step.
7348
+
7349
+ **Implementation:** `runFullPipeline` and `runImplementPipeline` already have per-phase error handling. Extend each `awaitSessions` call: on non-success outcome, attempt resume before returning escalated. The resume logic is `worktrain session continue <sessionId>` (once that command exists) or `dispatchAdaptivePipeline` with the existing session's context.
7350
+
7351
+ ### Priority
7352
+
7353
+ High. Agent crash recovery makes the overnight-autonomous bar achievable. Without it, any hung LLM call or tool timeout fails the entire pipeline silently. With it, transient failures are automatically retried and the pipeline continues.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@exaudeus/workrail",
3
- "version": "3.59.2",
3
+ "version": "3.59.4",
4
4
  "description": "Step-by-step workflow enforcement for AI agents via MCP",
5
5
  "license": "MIT",
6
6
  "repository": {