@exaudeus/workrail 3.36.0 → 3.37.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (45) hide show
  1. package/dist/config/config-file.js +2 -0
  2. package/dist/console-ui/assets/{index-n8cJrS4v.js → index-t8Wi304z.js} +1 -1
  3. package/dist/console-ui/index.html +1 -1
  4. package/dist/daemon/workflow-runner.d.ts +1 -0
  5. package/dist/daemon/workflow-runner.js +3 -6
  6. package/dist/infrastructure/session/SessionManager.js +17 -4
  7. package/dist/manifest.json +25 -17
  8. package/dist/trigger/notification-service.d.ts +42 -0
  9. package/dist/trigger/notification-service.js +164 -0
  10. package/dist/trigger/trigger-listener.js +7 -1
  11. package/dist/trigger/trigger-router.d.ts +3 -1
  12. package/dist/trigger/trigger-router.js +4 -1
  13. package/docs/design/agent-behavior-patterns-discovery.md +312 -0
  14. package/docs/design/agent-engine-communication-discovery.md +390 -0
  15. package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
  16. package/docs/design/agent-loop-error-handling-contract.md +238 -0
  17. package/docs/design/complete-step-approach-validation-discovery.md +344 -0
  18. package/docs/design/daemon-stuck-detection-discovery.md +174 -0
  19. package/docs/design/mcp-server-disconnect-discovery.md +245 -0
  20. package/docs/design/mcp-server-epipe-crash.md +198 -0
  21. package/docs/design/notification-design-candidates.md +131 -0
  22. package/docs/design/notification-design-review.md +84 -0
  23. package/docs/design/notification-implementation-plan.md +181 -0
  24. package/docs/design/spawn-agent-failure-modes.md +161 -0
  25. package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
  26. package/docs/design/stdio-simplification-design-candidates.md +341 -0
  27. package/docs/design/stdio-simplification-design-review.md +93 -0
  28. package/docs/design/stdio-simplification-implementation-plan.md +317 -0
  29. package/docs/design/structured-output-tools-coexist-findings.md +288 -0
  30. package/docs/discovery/coordinator-script-design.md +745 -0
  31. package/docs/discovery/coordinator-ux-discovery.md +471 -0
  32. package/docs/discovery/spawn-agent-failure-modes.md +309 -0
  33. package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
  34. package/docs/discovery/worktrain-status-briefing.md +325 -0
  35. package/docs/discovery/worktrain-status-design-candidates.md +202 -0
  36. package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
  37. package/docs/ideas/backlog.md +608 -0
  38. package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
  39. package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
  40. package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
  41. package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
  42. package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
  43. package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
  44. package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
  45. package/package.json +1 -1
@@ -0,0 +1,174 @@
1
+ # Discovery: Daemon Stuck Detection and Visibility
2
+
3
+ ## Context / Ask
4
+
5
+ Identify what a "stuck" daemon agent looks like in the logs and define the definitive signals for detecting it. This is the prerequisite for implementing visibility improvements (stuck detection events, improved `worktrain logs`, session health summary, console live panel indicator, and richer WORKTRAIN_STUCK markers).
6
+
7
+ ## Path Recommendation
8
+
9
+ **landscape_first** -- the codebase and event log are fully readable; no architectural reframing needed. The signals are observable facts, not design decisions.
10
+
11
+ **Rationale:** The ask is empirical: what does stuck look like? The event log for 2026-04-18 has real `issue_reported` events showing actual stuck patterns. The code is fully readable. Discovery + landscaping is sufficient; no candidate comparison needed.
12
+
13
+ ## Constraints / Anti-goals
14
+
15
+ - Anti-goal: Do not redesign the session event store or merge daemon events into it (separate backlog item #4315).
16
+ - Anti-goal: Do not build a watchdog daemon or restart mechanism (out of scope for this phase).
17
+ - Constraint: All new events must follow the discriminated union pattern in `daemon-events.ts`.
18
+ - Constraint: Stuck detection runs in the `turn_end` subscriber in `runWorkflow()`, not as a new thread.
19
+
20
+ ## Landscape Packet
21
+
22
+ ### Source files reviewed
23
+
24
+ - `src/daemon/workflow-runner.ts` -- agent loop, timeout logic, turn counter, `report_issue` tool
25
+ - `src/daemon/daemon-events.ts` -- all current event kinds (15 events in the union)
26
+ - `src/daemon/agent-loop.ts` -- turn_end subscriber, steer, _runLoop
27
+ - `~/.workrail/events/daemon/2026-04-18.jsonl` -- 2000+ events from real sessions today
28
+ - `docs/ideas/backlog.md` -- relevant sections at lines 3912-3972, 4315-4380
29
+
30
+ ### Current event kinds in DaemonEvent union
31
+
32
+ 1. `daemon_started` -- daemon boot
33
+ 2. `trigger_fired` -- incoming webhook
34
+ 3. `session_queued` -- queue entry
35
+ 4. `session_started` -- agent loop about to begin
36
+ 5. `tool_called` (coarse stream) -- from inside each tool's execute()
37
+ 6. `tool_error` -- isError=true tool result in turn_end subscriber
38
+ 7. `step_advanced` -- onAdvance() fired (continue_workflow succeeded)
39
+ 8. `session_completed` -- outcome: success|error|timeout
40
+ 9. `delivery_attempted` -- HTTP callback POST
41
+ 10. `issue_reported` -- agent called report_issue; has severity + issueKind
42
+ 11. `llm_turn_started` -- before client.messages.create()
43
+ 12. `llm_turn_completed` -- after API response; has stopReason, inputTokens, outputTokens, toolNamesRequested
44
+ 13. `tool_call_started` -- fine-grained; has argsSummary (200 chars JSON)
45
+ 14. `tool_call_completed` -- fine-grained; has durationMs, resultSummary
46
+ 15. `tool_call_failed` -- fine-grained; has durationMs, errorMessage
47
+
48
+ **Missing:** No `agent_stuck` event kind currently exists.
49
+
50
+ ### Real stuck patterns in 2026-04-18.jsonl (session ea2de6e5)
51
+
52
+ Session `ea2de6e5` (workrailSessionId: `sess_5hb25pdpq2jqhciznto7vmgaue`) shows a clear real-world stuck pattern:
53
+
54
+ 1. Agent tried to submit `wr.assessment` artifacts 6+ times via `continue_workflow`
55
+ 2. Each attempt was blocked (the daemon's `continue_workflow` tool lacks an `artifacts` field)
56
+ 3. Agent escalated: `issue_reported` severity=warn (line 1779) → severity=error (line 1795) → severity=fatal (line 1825) → severity=fatal again (line 1915)
57
+ 4. The session ended with `session_completed outcome=timeout detail=max_turns` -- it hit the turn limit still stuck
58
+
59
+ This is the canonical "blocked_at_assessment_gate" stuck pattern. The agent knows it's stuck and signals it clearly via `report_issue`, but there's no `agent_stuck` event the console or coordinator can watch for.
60
+
61
+ ### Stuck signal taxonomy (from code analysis + log evidence)
62
+
63
+ **Signal 1: Repeated tool call (same tool + same args)**
64
+ - In `tool_call_started` events, `argsSummary` is JSON params truncated to 200 chars.
65
+ - If the last 3 `tool_call_started` events for the session have identical `toolName` AND `argsSummary`, the agent is looping.
66
+ - Observable from the event log by comparing consecutive `tool_call_started.argsSummary` values.
67
+ - Detection point: `turn_end` subscriber in `runWorkflow()`.
68
+
69
+ **Signal 2: issue_reported with severity=fatal**
70
+ - Agent self-diagnoses as stuck and explicitly calls `report_issue` with `severity='fatal'`.
71
+ - Confirmed by real log: 2x fatal reports in session ea2de6e5 before max_turns.
72
+ - This is the most reliable signal because the agent knows its state.
73
+ - Detection point: already emitted as `issue_reported` event; just needs to be surfaced.
74
+
75
+ **Signal 3: No step advances after N LLM turns**
76
+ - `step_advanced` events increment `stepAdvanceCount` (tracked in `onAdvance()` closure).
77
+ - `llm_turn_completed` events tracked by `turnCount` variable.
78
+ - If `turnCount >= maxTurns * 0.8` AND `stepAdvanceCount == 0`, the session will timeout without completing a single step.
79
+ - Detection point: `turn_end` subscriber (already has access to `turnCount` and can track `stepAdvanceCount`).
80
+
81
+ **Signal 4: Tool call failure rate > 50% over last 5 turns**
82
+ - `tool_call_failed` events tracked by the `turn_end` subscriber.
83
+ - If >50% of tool calls in the last 5 turns failed, something systematic is broken.
84
+ - Less reliable than signals 1-3 because short bursts of failure are normal (grep exit 1, missing files).
85
+ - Detection point: `turn_end` subscriber with a rolling 5-turn failure rate tracker.
86
+
87
+ **Signal 5: Wall-clock approaching maxSessionMinutes with < 2 advances**
88
+ - `timeoutReason !== null` is set when wall-clock timeout fires, but that's too late -- the abort already happened.
89
+ - Better: check `(Date.now() - sessionStartMs) > sessionTimeoutMs * 0.8` AND `stepAdvanceCount < 2`.
90
+ - Detection point: `turn_end` subscriber.
91
+
92
+ **Signal 6: Blocked attempt chain (assessment gates)**
93
+ - When `continue_workflow` returns `kind: 'blocked'`, the tool returns feedback to the LLM.
94
+ - PR #554 capped `blocked_attempt` chains at 3. Beyond that it's a fatal block.
95
+ - The `issue_reported` events are the reliable proxy for this pattern.
96
+
97
+ ### turn_end subscriber and tracking variables (workflow-runner.ts)
98
+
99
+ The `turn_end` subscriber (lines 1725-1755) currently:
100
+ - Emits `tool_error` events for `isError=true` tool results
101
+ - Increments `turnCount`
102
+ - Checks `maxTurns` limit and calls `agent.abort()` if hit
103
+ - Calls `agent.steer()` with pending step text
104
+
105
+ State variables accessible in the subscriber via closures:
106
+ - `turnCount` -- LLM turn count (already tracked)
107
+ - `isComplete` -- workflow complete flag
108
+ - `pendingSteerText` -- next step text (null until continue_workflow advances)
109
+ - `stepAdvanceCount` -- NOT currently tracked (needs to be added)
110
+ - `sessionStartMs` -- NOT currently tracked (needs to be added)
111
+ - The event history for "last N tool_call_started" -- NOT currently tracked (needs a ring buffer)
112
+
113
+ ### WORKTRAIN_STUCK current fields (workflow-runner.ts line 1837-1843)
114
+
115
+ ```json
116
+ {
117
+ "reason": "session_error",
118
+ "error": "<first 500 chars of error message>",
119
+ "workflowId": "<id>",
120
+ "sessionId": "<process-local UUID>"
121
+ }
122
+ ```
123
+
124
+ Missing (per implementation spec): `turnCount`, `stepAdvanceCount`, `lastToolCalled`, `issueSummaries`.
125
+
126
+ ### Backlog references
127
+
128
+ - Line 3923: "Session liveness detection. If a session has been in_progress for more than N minutes with no advance_recorded events, the daemon watchdog should log a warning and optionally abort the session."
129
+ - Line 4229: "report_issue tool -- WORKTRAIN_STUCK marker in WorkflowRunResult"
130
+ - Line 4315-4332: "Agent actions as first-class events" -- worktrain_stuck as a session event kind
131
+ - Line 3972: "WORKTRAIN_STUCK routing and coordinator self-healing patterns all depend on logs being structured and complete"
132
+
133
+ ## Problem Frame Packet
134
+
135
+ **The stuck agent is currently invisible.** An agent can loop for 50 turns hitting assessment gates, call `report_issue` 4 times at increasing severity, and the only external signal is that `session_completed outcome=timeout` eventually fires. There is no `agent_stuck` event. The `worktrain logs` output doesn't distinguish fatal issues from warnings. The console live panel shows no stuck indicator. The WORKTRAIN_STUCK marker has no context about turns used or issues reported.
136
+
137
+ **Root cause:** stuck detection was deferred to "after the fact" (WORKTRAIN_STUCK in final notes), but the signals exist in real-time (turn_end subscriber has turnCount, tool results, and the onAdvance closure tracks advances).
138
+
139
+ ## Candidate Directions
140
+
141
+ ### Direction A: Minimal -- just emit agent_stuck events (no UI changes)
142
+ Add `AgentStuckEvent` to daemon-events.ts, emit in turn_end subscriber on 3 signals. Low effort, observable via raw JSONL.
143
+
144
+ ### Direction B: Full -- 5 improvements as specified
145
+ Add stuck events + improve worktrain logs + add `worktrain status` + console panel + richer WORKTRAIN_STUCK. Full visibility stack.
146
+
147
+ **Recommendation: Direction B.** The 5 improvements are cohesive and each addresses a different visibility gap. The stuck event alone (Direction A) isn't actionable if nothing surfaces it to humans.
148
+
149
+ ## Resolution Notes
150
+
151
+ All 5 implementation items are well-scoped. The key implementation details:
152
+
153
+ 1. **Stuck detection (workflow-runner.ts):** Add `stepAdvanceCount` and `sessionStartMs` variables alongside `turnCount`. Add a `lastNToolCalls` ring buffer (last 3 `tool_call_started` events). In `turn_end`, check the 3 signals and emit `agent_stuck`.
154
+
155
+ 2. **worktrain logs formatting (cli-worktrain.ts):** `formatDaemonEventLine()` exists and can be extended with new cases. Currently minimal (no step_advanced or llm_turn_completed formatting found -- needs verification).
156
+
157
+ 3. **worktrain status command (cli-worktrain.ts):** New subcommand. Reads the daemon JSONL for a sessionId, aggregates counts, prints health summary. Pure reads, no daemon state required.
158
+
159
+ 4. **Console liveActivity (console-service.ts):** `readLiveActivity()` already reads `tool_called` events. Add `agent_stuck` to the filter.
160
+
161
+ 5. **WORKTRAIN_STUCK enrichment (workflow-runner.ts):** The stuckMarker JSON at line 1837 needs 4 new fields. The data is all available in closures at that point.
162
+
163
+ ## Decision Log
164
+
165
+ - Chose `landscape_first` path: the signals are observable facts from code + logs, not design decisions.
166
+ - Session ea2de6e5 from 2026-04-18.jsonl confirmed that `issue_reported` severity escalation is the primary real-world stuck signal.
167
+ - Ring buffer approach (last 3 `tool_call_started`) preferred over full history scan for repeated-tool detection.
168
+ - `stepAdvanceCount` must be tracked separately from `turnCount` (both exist in the subscriber but only `turnCount` is currently maintained).
169
+
170
+ ## Final Summary
171
+
172
+ A stuck daemon agent currently looks like: many `llm_turn_completed` events with `toolNamesRequested` containing only failed tools, escalating `issue_reported` severity levels (warn → error → fatal), zero `step_advanced` events, and finally `session_completed outcome=timeout detail=max_turns`. The session `ea2de6e5` in today's log is the canonical example: 6 blocked `continue_workflow` attempts, 4 escalating `issue_reported` calls, terminated by max_turns.
173
+
174
+ The 5 definitive stuck signals are: (1) same tool+args called 3+ times, (2) issue_reported severity=fatal, (3) 0 step advances after 80%+ of turns used, (4) tool failure rate >50% over last 5 turns, (5) wall-clock at 80%+ with <2 advances. These are all detectable in the `turn_end` subscriber using existing state variables plus 2 new counters and a 3-element ring buffer.
@@ -0,0 +1,245 @@
1
+ # MCP Server Disconnect Discovery
2
+
3
+ ## Context / Ask
4
+
5
+ The WorkRail MCP server keeps dying/disconnecting during active Claude Code sessions.
6
+ Bridges report repeated reconnects. Claude Code sessions are interrupted. The symptom:
7
+ "it restarts constantly."
8
+
9
+ Goal: identify root cause, supporting evidence, and recommended investigation path.
10
+
11
+ ---
12
+
13
+ ## Path Recommendation
14
+
15
+ **landscape_first** -- the problem is primarily one of understanding what is happening
16
+ (reading logs, reading code, tracing the crash chain), not of reframing a problem
17
+ definition. The code and logs give direct evidence of what is actually crashing.
18
+
19
+ Rationale over alternatives:
20
+ - `design_first` would be appropriate if the problem statement were ambiguous. It is not: we
21
+ have crash logs with stack traces.
22
+ - `full_spectrum` would be appropriate if the landscape might be hiding a deeper structural
23
+ question. The landscape already reveals a clear structural flaw (EPIPE crashing the server
24
+ before the stdio guard fires), so spectrum-wide reframing is not needed.
25
+
26
+ ---
27
+
28
+ ## Constraints / Anti-goals
29
+
30
+ - Do not change the MCP client protocol (Claude Code).
31
+ - Do not affect session data integrity.
32
+ - Do not change the bridge/primary topology for now.
33
+ - Anti-goal: do not add speculative defenses if the real crash path is already identified.
34
+
35
+ ---
36
+
37
+ ## Landscape Packet (landscape_first pass)
38
+
39
+ ### What runs
40
+
41
+ When Claude Code uses WorkRail there are two possible server topologies:
42
+
43
+ **Single process (first session):**
44
+ ```
45
+ Claude Code (IDE) --stdio--> WorkRail stdio server (stdio-entry.ts)
46
+ + embedded HttpServer (dashboard / MCP HTTP port 3100)
47
+ ```
48
+
49
+ **Multi-session (bridge mode):**
50
+ ```
51
+ Claude Code --stdio--> WorkRail bridge (bridge-entry.ts)
52
+ --> HTTP --> WorkRail primary (http-entry.ts, port 3100)
53
+ ```
54
+
55
+ ### What the crash.log shows
56
+
57
+ Every entry in `~/.workrail/crash.log` (Apr 16-18, 2026) is the same pattern:
58
+
59
+ ```json
60
+ {
61
+ "transport": "stdio",
62
+ "uptimeMs": 750-2100,
63
+ "label": "Uncaught exception",
64
+ "message": "write EPIPE",
65
+ "stack": "Error: write EPIPE\n at afterWriteDispatched...\n at console.error (node:internal/console/constructor:444:26)\n at HttpServer.<method> ..."
66
+ }
67
+ ```
68
+
69
+ **Key observations:**
70
+ 1. Every crash is `transport=stdio` -- the MCP stdio primary is dying, not a bridge.
71
+ 2. Uptime is 750-2100ms -- these processes live less than 2 seconds.
72
+ 3. The crash is always `write EPIPE` thrown by `console.error()` inside `HttpServer`.
73
+ 4. The offending call sites across different crashes:
74
+ - `HttpServer.reclaimStaleLock` (line 432, 436 in compiled dist)
75
+ - `HttpServer.printBanner` (line 577 in compiled dist)
76
+ - `HttpServer.start` (line 348)
77
+ 5. All crashes point to the **installed npm global** (`/opt/homebrew/lib/node_modules/@exaudeus/workrail/dist/...`), not the local dev build.
78
+
79
+ **This means Claude Code is running the globally-installed WorkRail binary, not the local dev build.**
80
+
81
+ ### What the bridge.log shows
82
+
83
+ The bridge.log for Apr 18 shows a repeating pattern:
84
+ ```
85
+ reconnected -> budget_exhausted (budgetUsed: 8) -> spawn_lock_acquired -> spawn_primary
86
+ ```
87
+
88
+ This happens across PIDs 76729, 83046, 96836, 28962 -- multiple bridges repeatedly
89
+ exhausting their full 8-reconnect budget before entering spawn mode. Each cycle takes
90
+ ~90 seconds, matching the "constant restarts" symptom.
91
+
92
+ The bridges are correctly detecting primary death and attempting to respawn. The problem
93
+ is the primary keeps dying immediately after spawning.
94
+
95
+ ### Root cause chain
96
+
97
+ 1. Claude Code (IDE) spawns a WorkRail stdio process (the global npm binary).
98
+ 2. The process starts, begins the `HttpServer.start()` / `tryBecomePrimary()` / `reclaimStaleLock()` path.
99
+ 3. **Before** `wireStdoutShutdown()` fires (or even before `server.connect(transport)` is called), the HttpServer attempts writes via `console.error()`.
100
+ 4. If Claude Code has already closed the stdio pipe (e.g. rapid reconnect, MCP restart), stdout/stderr are already broken.
101
+ 5. `console.error()` calls `console.value()` (Node internals) which does a synchronous socket write to stderr.
102
+ 6. The write throws `EPIPE` as an uncaught exception.
103
+ 7. `registerFatalHandlers()` was already called, so the `uncaughtException` handler fires and calls `fatalExit()`.
104
+ 8. `fatalExit()` writes to crash.log and calls `process.exit(1)`.
105
+ 9. The primary dies. The bridge detects the death and restarts it. Loop.
106
+
107
+ ### Why `wireStdoutShutdown()` doesn't save it
108
+
109
+ `wireStdoutShutdown()` guards `process.stdout` against EPIPE. But:
110
+ - The EPIPE is on `process.stderr` (from `console.error()`), not `process.stdout`.
111
+ - The crash happens inside `HttpServer.start()`, which is called from `composeServer()`, which is called before `wireStdoutShutdown()` is even registered.
112
+
113
+ The guard only covers `stdout` and is registered after `composeServer()` completes. HttpServer's
114
+ `console.error()` calls during startup (in `printBanner`, `reclaimStaleLock`, `start`) race
115
+ against a broken stderr pipe.
116
+
117
+ ### Why it affects only the global install
118
+
119
+ The local dev build has some functions already converted to `process.stderr.write()` with
120
+ try/catch (e.g. `printBanner` at line 918 in the source). But the global npm binary
121
+ (`/opt/homebrew/lib/node_modules/@exaudeus/workrail`) is an older compiled version that still
122
+ uses `console.error()` in those paths -- confirmed by the crash stack pointing to line numbers
123
+ that don't match the current source.
124
+
125
+ **This is a version skew issue.** The local source has partially fixed the pattern but
126
+ the globally-installed binary has not been rebuilt/reinstalled.
127
+
128
+ ### Secondary finding: `console.error()` in `setupPrimaryCleanup`
129
+
130
+ Even in the current source, `setupPrimaryCleanup()` still uses `console.error()` at line
131
+ 799 (`[Dashboard] Primary shutting down (sync cleanup)`) and line 810. These are inside
132
+ signal handlers that can fire when stderr is already broken. They are not yet guarded.
133
+
134
+ ---
135
+
136
+ ## Problem Frame Packet
137
+
138
+ **The real question**: Is this a deployment/version issue (global binary is stale) or a
139
+ latent code bug that will resurface even after updating?
140
+
141
+ **Answer**: Both.
142
+
143
+ - **Immediate cause**: The globally-installed `@exaudeus/workrail` is an older build that has
144
+ not received the `process.stderr.write()` hardening applied to the source. Reinstalling from
145
+ the current source would eliminate the known crashes.
146
+
147
+ - **Latent bug**: Even in the current source, `setupPrimaryCleanup()` still uses
148
+ `console.error()` in synchronous signal/exit handlers. Any signal arriving while stderr
149
+ is broken will crash the process. This was not caught because `setupPrimaryCleanup()` runs
150
+ after `wireStdoutShutdown()` -- but `wireStdoutShutdown()` only covers stdout, not stderr.
151
+
152
+ - **Deeper structural issue**: The stdio-entry.ts + HttpServer pattern calls `HttpServer.start()`
153
+ as part of `composeServer()`, which happens before any I/O guards are installed. Any
154
+ `console.error()` call inside that synchronous startup path is unguarded against a
155
+ pre-broken stderr pipe.
156
+
157
+ ---
158
+
159
+ ## Candidate Directions
160
+
161
+ ### Direction A -- Rebuild and reinstall the global binary (immediate fix)
162
+
163
+ Rebuild the compiled dist from the current source (which has most `process.stderr.write()`
164
+ hardening applied) and reinstall with `npm install -g`. This eliminates the known crash
165
+ paths in `printBanner`, `reclaimStaleLock`, and `start()`.
166
+
167
+ **Risk**: Does not address the remaining `console.error()` calls in `setupPrimaryCleanup`.
168
+ **Effort**: Low (5 min).
169
+
170
+ ### Direction B -- Audit and convert all remaining `console.error()` calls in HttpServer startup/signal paths
171
+
172
+ A systematic grep for `console.error` in `HttpServer.ts` and `shutdown-hooks.ts` inside
173
+ paths that run before the process is fully started or inside signal handlers. Convert each
174
+ to `try { process.stderr.write(...); } catch { /* ignore */ }`.
175
+
176
+ **Risk**: Low. Well-established pattern already used elsewhere in the codebase.
177
+ **Effort**: Medium (1-2 hours).
178
+
179
+ ### Direction C -- Guard stderr itself against EPIPE (parallel to stdout guard)
180
+
181
+ Add a `process.stderr.on('error', ...)` handler in `registerFatalHandlers()` or
182
+ `wireStdoutShutdown()` that swallows EPIPE without crashing. This would make the guard
183
+ transport-agnostic and prevent any future `console.error()` from causing a fatal crash.
184
+
185
+ **Risk**: Must be careful not to suppress genuine stderr errors. Only EPIPE/ERR_STREAM_DESTROYED
186
+ should be swallowed (same as the stdout guard).
187
+ **Effort**: Low (30 min). High leverage.
188
+
189
+ **Recommended direction**: C first (systemic fix, low effort), then B (belt-and-suspenders
190
+ for existing calls), then A (deploy).
191
+
192
+ ---
193
+
194
+ ## Challenge Notes
195
+
196
+ - The symptom ("restarts constantly") was actually the bridge correctly doing its job
197
+ (reconnecting + respawning). The bridge is healthy. The primary is the patient.
198
+ - The crash.log is the primary diagnostic tool here. Without it, this would have been
199
+ very hard to trace.
200
+ - The version skew between the global install and the local source obscured the fact that
201
+ some of these fixes were already partially applied.
202
+
203
+ ---
204
+
205
+ ## Resolution Notes
206
+
207
+ **Root cause (precise)**: `process.stderr.write()` emits an `'error'` event (not a thrown exception) when the stderr pipe is broken (EPIPE). No error event listener exists on `process.stderr`. Node.js escalates the unhandled stream error to `uncaughtException`. `registerFatalHandlers()` catches it and calls `fatalExit()`. The process exits within 750-2100ms of every spawn.
208
+
209
+ **Why try/catch is ineffective**: The `try { process.stderr.write(...); } catch {}` wrappers already present in `HttpServer.ts` do NOT prevent the crash. Stream error events are asynchronous events, not synchronous throws. A try/catch only catches thrown exceptions. The only effective protection is `process.stderr.on('error', ...)`.
210
+
211
+ **Why Claude Code cannot receive a local fix**: `claude_desktop_config.json` uses `npx -y @exaudeus/workrail` which fetches the published npm latest version. Any fix must be published to npm. Immediate workaround: change the config to point to the local build.
212
+
213
+ ---
214
+
215
+ ## Decision Log
216
+
217
+ - Chose `landscape_first` because crash logs give direct evidence, not ambiguity.
218
+ - Did not delegate to subagents because all source files and logs fit in context.
219
+ - No web access needed; all evidence is local.
220
+ - Selected Candidate B (wireStderrShutdown extracted function) over A (inline guard) because: testable via DI, consistent with wireStdoutShutdown pattern, architecturally principled.
221
+ - Candidate C (converting console.error calls in HttpServer) is now confirmed OPTIONAL: the stderr event listener protects all stderr writes including those from console.error. try/catch wrappers are security theater for stream errors.
222
+ - Key insight from challenge phase: try/catch does not intercept stream 'error' events. This was validated by inspecting line 436 of the global binary (inside try/catch, yet still in the crash.log stack trace).
223
+
224
+ ---
225
+
226
+ ## Final Summary
227
+
228
+ **The MCP server keeps restarting because the stderr pipe has no error listener.**
229
+
230
+ When Claude Code reconnects rapidly, the stdio pipe closes before `HttpServer.start()` completes. When `HttpServer.start()` (or `reclaimStaleLock`, `printBanner`, `setupPrimaryCleanup`) writes to stderr while the pipe is broken, `process.stderr` emits an `'error'` event. No listener handles it. Node.js converts it to `uncaughtException`. `registerFatalHandlers()` calls `fatalExit()`. Process exits. Bridge detects death, respawns. Loop.
231
+
232
+ **The fix** (Candidate B): Add `wireStderrShutdown()` to `shutdown-hooks.ts` (mirror of `wireStdoutShutdown()`), call it in `stdio-entry.ts` BEFORE `composeServer()`. This is 15-20 lines following an existing pattern.
233
+
234
+ **Immediate workaround**: Change `~/Library/Application Support/Claude/claude_desktop_config.json` `command` from `npx` / args `["-y", "@exaudeus/workrail"]` to point at the local dev build's `dist/index.js` while the fix is being prepared for publish.
235
+
236
+ **Confidence**: High. The crash mechanism is precisely confirmed by crash.log stack traces, source inspection, and the Node.js stream error event model.
237
+
238
+ **Files to change**:
239
+ 1. `src/mcp/transports/shutdown-hooks.ts` -- add `wireStderrShutdown()`
240
+ 2. `src/mcp/transports/stdio-entry.ts` -- call `wireStderrShutdown()` before `composeServer()`
241
+ 3. Publish to npm as a patch release
242
+
243
+ **Supporting artifacts**:
244
+ - `docs/design/mcp-server-disconnect-candidates.md` -- full candidate analysis
245
+ - `docs/design/mcp-server-disconnect-review.md` -- review findings and residual concerns
@@ -0,0 +1,198 @@
1
+ # Bug Handoff: WorkRail MCP Server Disconnections
2
+
3
+ **Date:** 2026-04-18
4
+ **Severity:** High (production-impacting, kills live sessions)
5
+ **Diagnosis type:** `root_plus_downstream`
6
+
7
+ ---
8
+
9
+ ## Bug Summary
10
+
11
+ The WorkRail MCP stdio server crashes within 2 seconds of startup due to an unhandled
12
+ asynchronous EPIPE error on `process.stderr`. When Claude Code closes an MCP connection
13
+ quickly (e.g., during `/mcp` reconnect), both `stdout` and `stderr` pipes break. The code
14
+ has `try/catch` guards around all `process.stderr.write()` calls, but those guards only
15
+ protect against *synchronous* exceptions. The actual EPIPE is delivered as an async
16
+ `'error'` event on the stderr Socket after the call frame exits. Since no
17
+ `process.stderr.on('error')` listener exists, Node.js promotes it to `uncaughtException`,
18
+ which `registerFatalHandlers` catches and routes to `fatalExit()` -> `process.exit(1)`.
19
+
20
+ The user experiences this as a "mid-session disconnect" because:
21
+ 1. User does `/mcp` reconnect (or Claude Code auto-reconnects).
22
+ 2. New MCP server starts, begins `HttpServer.start()` / `tryBecomePrimary()` / `reclaimStaleLock()`.
23
+ 3. The startup sequence writes to stderr (status messages, lock reclaim notification).
24
+ 4. If Claude Code's reconnect was fast (common during rapid `/mcp` retries), stderr is already
25
+ broken when the write is queued.
26
+ 5. Async EPIPE on stderr -> crash -> user must do `/mcp` again.
27
+ 6. Repeat until a clean startup window is hit.
28
+
29
+ ---
30
+
31
+ ## Repro Summary
32
+
33
+ - **Symptom:** Claude Code requires `/mcp` reconnects multiple times per day.
34
+ - **Environment:** macOS, Claude Code, WorkRail installed from npm
35
+ (`/opt/homebrew/bin/workrail`, v3.32.0), MCP server in stdio mode.
36
+ - **Trigger:** User does `/mcp` reconnect, or Claude Code auto-reconnects. If the reconnect
37
+ is fast enough that Claude Code moves on while the new server is still initializing
38
+ (~0-2 seconds window), the next `process.stderr.write()` call in the new server crashes it.
39
+ - **Evidence:** `~/.workrail/crash.log` contains 15 production crash entries:
40
+ - 100% have `message: "write EPIPE"`
41
+ - 100% have `transport: "stdio"`
42
+ - 100% have `uptimeMs < 2200ms` (all crash during startup)
43
+ - Crash sites: `HttpServer.reclaimStaleLock` (line 436), `HttpServer.printBanner`,
44
+ `HttpServer.start`, `shutdown-hooks.js:50` -- all inside `try/catch` blocks
45
+
46
+ ---
47
+
48
+ ## Diagnosis: Confirmed Root Cause
49
+
50
+ **The specific bug:** `process.stderr` has no `'error'` event listener anywhere in the
51
+ MCP transport layer. This is verifiable:
52
+
53
+ ```
54
+ grep -r "stderr.*\.on" src/mcp/transports/ # zero matches
55
+ grep -r "stderr.*\.on" src/infrastructure/ # zero matches
56
+ node -e "console.log(process.stderr.listenerCount('error'))" # outputs: 0
57
+ ```
58
+
59
+ `stdout` has protection via `wireStdoutShutdown()` which registers:
60
+ ```ts
61
+ process.stdout.on('error', (err) => { ... shutdownEvents.emit({kind:'shutdown_requested',...}) })
62
+ ```
63
+
64
+ `stderr` has *no equivalent*. The fix is to add one.
65
+
66
+ **Why the `try/catch` does not protect:**
67
+ Node.js `Socket.write()` (and by extension `process.stderr.write()`) does NOT throw
68
+ synchronously on EPIPE on macOS. It enqueues the write and returns a boolean. The OS-level
69
+ EPIPE signal arrives asynchronously and is delivered as a `'error'` event on the Socket
70
+ *outside* the current JavaScript call frame -- beyond the reach of any `try/catch`.
71
+
72
+ **Cascade after crash:**
73
+ - `fatalExit()` runs, tries to write to stderr (fails silently), writes crash.log entry.
74
+ - `process.exit(1)` kills the MCP server.
75
+ - Claude Code loses the MCP connection.
76
+ - User does `/mcp` -> new server spawns -> may crash again if the retry is fast.
77
+ - Eventually a stable startup window is hit and the server persists indefinitely
78
+ (PID 90392: 1+ day runtime, 46MB RSS, 38 fds -- completely healthy once started).
79
+
80
+ ---
81
+
82
+ ## Secondary Finding: Bridge Spawn Loop Resilience Regression (H2)
83
+
84
+ **Separate issue, not the primary user-facing bug.**
85
+
86
+ `~/.workrail/bridge.log` shows 109 bridge sessions (from sessions that use the
87
+ bridge/HTTP-primary architecture) all following this exact pattern:
88
+
89
+ ```
90
+ reconnected(attempt:0) -> budget_exhausted(budgetUsed:8, respawnBudget:0) ->
91
+ spawn_lock_acquired -> spawn_lock_skipped -> spawn_primary x3
92
+ ```
93
+
94
+ **Mechanism:** When the primary dies and the bridge reconnects:
95
+ 1. Reconnect Loop A starts (state: reconnecting, budget=3).
96
+ 2. `detect(attempt=0)` finds the primary immediately -> Loop A returns `'reconnected'`.
97
+ 3. `buildConnectedTransport()` sets state to `'connected'`.
98
+ 4. Primary dies again immediately (`t.onclose` fires).
99
+ 5. `t.onclose` sets state to `'reconnecting'` (new state, budget=3) and starts Loop B.
100
+ 6. Loop A's `.then()` fires: it sees Loop B's `reconnecting` state, logs `reconnected(attempt:0)`
101
+ (using Loop B's `attempt` field which is 0), and returns.
102
+ 7. Loop B runs 8 reconnect attempts with ECONNREFUSED (<1ms each), exhausts budget,
103
+ cycles through 3 spawn attempts (budget 3->2->1->0), then hits `budget_exhausted`.
104
+
105
+ **Effect:** Bridges spend their entire spawn budget in one rapid burst when the primary
106
+ dies at exactly the wrong moment. The zero `'waiting_for_primary'` events in the log
107
+ suggests bridges are dying (likely via EPIPE crash) before the wait loop log entry is
108
+ written, or the spawned HTTP primaries are not maintaining stable connections.
109
+
110
+ This is a resilience regression but does NOT directly affect current stdio-mode MCP sessions.
111
+
112
+ ---
113
+
114
+ ## Alternatives Ruled Out
115
+
116
+ | Hypothesis | Ruling |
117
+ |---|---|
118
+ | Standalone console competing for port 3456 | Graceful fallback to port 3457+, not a crash |
119
+ | File watchers from standalone console interfering | Different process, no crash mechanism |
120
+ | Memory exhaustion from conversation logging (PR #528) | Daemon-only feature; MCP server RSS=46MB healthy |
121
+ | Daemon crash corrupting shared DI singletons | Daemon and MCP server are separate processes with separate DI containers |
122
+ | EPIPE from daemon writing to MCP stdio | Daemon on port 3200 has no stdio connection to the MCP server |
123
+
124
+ ---
125
+
126
+ ## High-Level Fix Direction
127
+
128
+ ### Fix 1 (Primary -- Critical): Add stderr error listener
129
+
130
+ **File:** `src/mcp/transports/fatal-exit.ts`
131
+ **Where:** Inside `registerFatalHandlers()`, BEFORE any async work, at the very top.
132
+
133
+ Add a no-op error handler on `process.stderr` to absorb async EPIPE events:
134
+ ```ts
135
+ process.stderr.on('error', () => { /* absorb async EPIPE -- see wireStdoutShutdown for pattern */ });
136
+ ```
137
+
138
+ This mirrors the `wireStdoutShutdown()` pattern that already protects `process.stdout`.
139
+ The no-op is sufficient because `process.stderr` is write-only diagnostics -- there is
140
+ nothing to recover. The goal is only to prevent Node.js from promoting the unhandled
141
+ error event to `uncaughtException`.
142
+
143
+ `registerFatalHandlers()` is called first in every entry point (stdio-entry.ts, http-entry.ts,
144
+ bridge-entry.ts), so registering here protects all transport types.
145
+
146
+ **Alternative placement:** Could also go in each entry point before `composeServer()`, but
147
+ `registerFatalHandlers()` is the single earliest call and the most defensible location.
148
+
149
+ ### Fix 2 (Secondary -- Medium): Bridge reconnect race condition
150
+
151
+ **File:** `src/mcp/transports/bridge-entry.ts`
152
+ **Issue:** When the primary dies immediately after the bridge connects, the bridge can
153
+ have two concurrent reconnect loops (A and B). Loop A's outcome handler reads Loop B's
154
+ state snapshot, causing `reconnected` to be logged for what is actually Loop B's first
155
+ attempt, and Loop B's budget to be consumed rapidly.
156
+
157
+ **Direction:** The `handleReconnectOutcome` guard
158
+ (`if (stateAtOutcome.kind !== 'reconnecting') return`) should use the state snapshot
159
+ captured at *loop start*, not re-read at outcome time. Or: ensure `startReconnectLoop()`
160
+ is idempotent when called from `t.onclose` while a loop is already completing.
161
+
162
+ ---
163
+
164
+ ## Likely Files Involved
165
+
166
+ - `src/mcp/transports/fatal-exit.ts` -- **primary fix location** (`registerFatalHandlers`)
167
+ - `src/mcp/transports/shutdown-hooks.ts` -- existing stdout protection pattern to reference
168
+ - `src/mcp/transports/bridge-entry.ts` -- secondary fix (reconnect loop race)
169
+ - `src/infrastructure/session/HttpServer.ts` -- call sites that use `process.stderr.write()`
170
+ (no code change needed here; they work correctly once stderr has an error listener)
171
+
172
+ ---
173
+
174
+ ## Verification Recommendations
175
+
176
+ 1. **Unit test:** Add a test in `fatal-exit.test.ts` or a new `stderr-epipe.test.ts`:
177
+ - Call `registerFatalHandlers()` on a mock stderr with an `'error'` listener count check.
178
+ - Confirm that emitting `'error'` on stderr does NOT trigger `uncaughtException`.
179
+
180
+ 2. **Manual repro before fix:** Run `workrail` stdio server, immediately close the pipe
181
+ (e.g., via `workrail | head -0`) -- should produce a crash.log EPIPE entry.
182
+ After fix, the same command should NOT produce a crash.log entry and should exit cleanly.
183
+
184
+ 3. **Crash log regression:** After deploy, confirm `~/.workrail/crash.log` stops receiving
185
+ `write EPIPE` + `transport: stdio` entries during normal /mcp reconnect cycles.
186
+
187
+ 4. **Bridge resilience:** Observe `~/.workrail/bridge.log` -- after bridge fix, expect to see
188
+ `waiting_for_primary` events appear when budget is exhausted (currently never logged).
189
+
190
+ ---
191
+
192
+ ## Residual Uncertainty
193
+
194
+ - **Why HTTP primaries spawned by bridges disconnect immediately** (H2 sub-cause): Confirmed
195
+ they are not crashing (no crash.log entries). Root sub-cause of rapid disconnect (wrong
196
+ port, StreamableHTTP handshake issue, or another process claiming port 3100) was not
197
+ directly observable without live instrumentation. This is a secondary issue and does not
198
+ affect the primary crash fix.