npm - @exaudeus/workrail - Versions diffs - 3.36.0 → 3.37.1 - Mend

@exaudeus/workrail 3.36.0 → 3.37.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

package/dist/config/config-file.js +2 -0
package/dist/console-ui/assets/{index-n8cJrS4v.js → index-t8Wi304z.js} +1 -1
package/dist/console-ui/index.html +1 -1
package/dist/daemon/workflow-runner.d.ts +1 -0
package/dist/daemon/workflow-runner.js +3 -6
package/dist/infrastructure/session/SessionManager.js +17 -4
package/dist/manifest.json +25 -17
package/dist/trigger/notification-service.d.ts +42 -0
package/dist/trigger/notification-service.js +164 -0
package/dist/trigger/trigger-listener.js +7 -1
package/dist/trigger/trigger-router.d.ts +3 -1
package/dist/trigger/trigger-router.js +4 -1
package/docs/design/agent-behavior-patterns-discovery.md +312 -0
package/docs/design/agent-engine-communication-discovery.md +390 -0
package/docs/design/agent-loop-architecture-alternatives-discovery.md +531 -0
package/docs/design/agent-loop-error-handling-contract.md +238 -0
package/docs/design/complete-step-approach-validation-discovery.md +344 -0
package/docs/design/daemon-stuck-detection-discovery.md +174 -0
package/docs/design/mcp-server-disconnect-discovery.md +245 -0
package/docs/design/mcp-server-epipe-crash.md +198 -0
package/docs/design/notification-design-candidates.md +131 -0
package/docs/design/notification-design-review.md +84 -0
package/docs/design/notification-implementation-plan.md +181 -0
package/docs/design/spawn-agent-failure-modes.md +161 -0
package/docs/design/spawn-agent-result-handling-implementation-plan.md +186 -0
package/docs/design/stdio-simplification-design-candidates.md +341 -0
package/docs/design/stdio-simplification-design-review.md +93 -0
package/docs/design/stdio-simplification-implementation-plan.md +317 -0
package/docs/design/structured-output-tools-coexist-findings.md +288 -0
package/docs/discovery/coordinator-script-design.md +745 -0
package/docs/discovery/coordinator-ux-discovery.md +471 -0
package/docs/discovery/spawn-agent-failure-modes.md +309 -0
package/docs/discovery/workflow-selection-for-discovery-tasks.md +336 -0
package/docs/discovery/worktrain-status-briefing.md +325 -0
package/docs/discovery/worktrain-status-design-candidates.md +202 -0
package/docs/discovery/worktrain-status-design-review-findings.md +86 -0
package/docs/ideas/backlog.md +608 -0
package/docs/ideas/daemon-structured-output-vs-tool-calls.md +344 -0
package/docs/ideas/design-candidates-backlog-consolidation.md +85 -0
package/docs/ideas/design-review-findings-backlog-consolidation.md +39 -0
package/docs/ideas/implementation_plan_backlog_consolidation.md +117 -0
package/docs/plans/authoring-doc-staleness-enforcement-candidates.md +251 -0
package/docs/plans/authoring-doc-staleness-enforcement-review.md +99 -0
package/docs/plans/authoring-doc-staleness-enforcement.md +463 -0
package/package.json +1 -1

package/docs/design/daemon-stuck-detection-discovery.md ADDED Viewed

@@ -0,0 +1,174 @@
+# Discovery: Daemon Stuck Detection and Visibility
+## Context / Ask
+Identify what a "stuck" daemon agent looks like in the logs and define the definitive signals for detecting it. This is the prerequisite for implementing visibility improvements (stuck detection events, improved `worktrain logs`, session health summary, console live panel indicator, and richer WORKTRAIN_STUCK markers).
+## Path Recommendation
+**landscape_first** -- the codebase and event log are fully readable; no architectural reframing needed. The signals are observable facts, not design decisions.
+**Rationale:** The ask is empirical: what does stuck look like? The event log for 2026-04-18 has real `issue_reported` events showing actual stuck patterns. The code is fully readable. Discovery + landscaping is sufficient; no candidate comparison needed.
+## Constraints / Anti-goals
+- Anti-goal: Do not redesign the session event store or merge daemon events into it (separate backlog item #4315).
+- Anti-goal: Do not build a watchdog daemon or restart mechanism (out of scope for this phase).
+- Constraint: All new events must follow the discriminated union pattern in `daemon-events.ts`.
+- Constraint: Stuck detection runs in the `turn_end` subscriber in `runWorkflow()`, not as a new thread.
+## Landscape Packet
+### Source files reviewed
+- `src/daemon/workflow-runner.ts` -- agent loop, timeout logic, turn counter, `report_issue` tool
+- `src/daemon/daemon-events.ts` -- all current event kinds (15 events in the union)
+- `src/daemon/agent-loop.ts` -- turn_end subscriber, steer, _runLoop
+- `~/.workrail/events/daemon/2026-04-18.jsonl` -- 2000+ events from real sessions today
+- `docs/ideas/backlog.md` -- relevant sections at lines 3912-3972, 4315-4380
+### Current event kinds in DaemonEvent union
+1. `daemon_started` -- daemon boot
+2. `trigger_fired` -- incoming webhook
+3. `session_queued` -- queue entry
+4. `session_started` -- agent loop about to begin
+5. `tool_called` (coarse stream) -- from inside each tool's execute()
+6. `tool_error` -- isError=true tool result in turn_end subscriber
+7. `step_advanced` -- onAdvance() fired (continue_workflow succeeded)
+8. `session_completed` -- outcome: success|error|timeout
+9. `delivery_attempted` -- HTTP callback POST
+10. `issue_reported` -- agent called report_issue; has severity + issueKind
+11. `llm_turn_started` -- before client.messages.create()
+12. `llm_turn_completed` -- after API response; has stopReason, inputTokens, outputTokens, toolNamesRequested
+13. `tool_call_started` -- fine-grained; has argsSummary (200 chars JSON)
+14. `tool_call_completed` -- fine-grained; has durationMs, resultSummary
+15. `tool_call_failed` -- fine-grained; has durationMs, errorMessage
+**Missing:** No `agent_stuck` event kind currently exists.
+### Real stuck patterns in 2026-04-18.jsonl (session ea2de6e5)
+Session `ea2de6e5` (workrailSessionId: `sess_5hb25pdpq2jqhciznto7vmgaue`) shows a clear real-world stuck pattern:
+1. Agent tried to submit `wr.assessment` artifacts 6+ times via `continue_workflow`
+2. Each attempt was blocked (the daemon's `continue_workflow` tool lacks an `artifacts` field)
+3. Agent escalated: `issue_reported` severity=warn (line 1779) → severity=error (line 1795) → severity=fatal (line 1825) → severity=fatal again (line 1915)
+4. The session ended with `session_completed outcome=timeout detail=max_turns` -- it hit the turn limit still stuck
+This is the canonical "blocked_at_assessment_gate" stuck pattern. The agent knows it's stuck and signals it clearly via `report_issue`, but there's no `agent_stuck` event the console or coordinator can watch for.
+### Stuck signal taxonomy (from code analysis + log evidence)
+**Signal 1: Repeated tool call (same tool + same args)**
+- In `tool_call_started` events, `argsSummary` is JSON params truncated to 200 chars.
+- If the last 3 `tool_call_started` events for the session have identical `toolName` AND `argsSummary`, the agent is looping.
+- Observable from the event log by comparing consecutive `tool_call_started.argsSummary` values.
+- Detection point: `turn_end` subscriber in `runWorkflow()`.
+**Signal 2: issue_reported with severity=fatal**
+- Agent self-diagnoses as stuck and explicitly calls `report_issue` with `severity='fatal'`.
+- Confirmed by real log: 2x fatal reports in session ea2de6e5 before max_turns.
+- This is the most reliable signal because the agent knows its state.
+- Detection point: already emitted as `issue_reported` event; just needs to be surfaced.
+**Signal 3: No step advances after N LLM turns**
+- `step_advanced` events increment `stepAdvanceCount` (tracked in `onAdvance()` closure).
+- `llm_turn_completed` events tracked by `turnCount` variable.
+- If `turnCount >= maxTurns * 0.8` AND `stepAdvanceCount == 0`, the session will timeout without completing a single step.
+- Detection point: `turn_end` subscriber (already has access to `turnCount` and can track `stepAdvanceCount`).
+**Signal 4: Tool call failure rate > 50% over last 5 turns**
+- `tool_call_failed` events tracked by the `turn_end` subscriber.
+- If >50% of tool calls in the last 5 turns failed, something systematic is broken.
+- Less reliable than signals 1-3 because short bursts of failure are normal (grep exit 1, missing files).
+- Detection point: `turn_end` subscriber with a rolling 5-turn failure rate tracker.
+**Signal 5: Wall-clock approaching maxSessionMinutes with < 2 advances**
+- `timeoutReason !== null` is set when wall-clock timeout fires, but that's too late -- the abort already happened.
+- Better: check `(Date.now() - sessionStartMs) > sessionTimeoutMs * 0.8` AND `stepAdvanceCount < 2`.
+- Detection point: `turn_end` subscriber.
+**Signal 6: Blocked attempt chain (assessment gates)**
+- When `continue_workflow` returns `kind: 'blocked'`, the tool returns feedback to the LLM.
+- PR #554 capped `blocked_attempt` chains at 3. Beyond that it's a fatal block.
+- The `issue_reported` events are the reliable proxy for this pattern.
+### turn_end subscriber and tracking variables (workflow-runner.ts)
+The `turn_end` subscriber (lines 1725-1755) currently:
+- Emits `tool_error` events for `isError=true` tool results
+- Increments `turnCount`
+- Checks `maxTurns` limit and calls `agent.abort()` if hit
+- Calls `agent.steer()` with pending step text
+State variables accessible in the subscriber via closures:
+- `turnCount` -- LLM turn count (already tracked)
+- `isComplete` -- workflow complete flag
+- `pendingSteerText` -- next step text (null until continue_workflow advances)
+- `stepAdvanceCount` -- NOT currently tracked (needs to be added)
+- `sessionStartMs` -- NOT currently tracked (needs to be added)
+- The event history for "last N tool_call_started" -- NOT currently tracked (needs a ring buffer)
+### WORKTRAIN_STUCK current fields (workflow-runner.ts line 1837-1843)
+```json
+{
+  "reason": "session_error",
+  "error": "<first 500 chars of error message>",
+  "workflowId": "<id>",
+  "sessionId": "<process-local UUID>"
+}
+```
+Missing (per implementation spec): `turnCount`, `stepAdvanceCount`, `lastToolCalled`, `issueSummaries`.
+### Backlog references
+- Line 3923: "Session liveness detection. If a session has been in_progress for more than N minutes with no advance_recorded events, the daemon watchdog should log a warning and optionally abort the session."
+- Line 4229: "report_issue tool -- WORKTRAIN_STUCK marker in WorkflowRunResult"
+- Line 4315-4332: "Agent actions as first-class events" -- worktrain_stuck as a session event kind
+- Line 3972: "WORKTRAIN_STUCK routing and coordinator self-healing patterns all depend on logs being structured and complete"
+## Problem Frame Packet
+**The stuck agent is currently invisible.** An agent can loop for 50 turns hitting assessment gates, call `report_issue` 4 times at increasing severity, and the only external signal is that `session_completed outcome=timeout` eventually fires. There is no `agent_stuck` event. The `worktrain logs` output doesn't distinguish fatal issues from warnings. The console live panel shows no stuck indicator. The WORKTRAIN_STUCK marker has no context about turns used or issues reported.
+**Root cause:** stuck detection was deferred to "after the fact" (WORKTRAIN_STUCK in final notes), but the signals exist in real-time (turn_end subscriber has turnCount, tool results, and the onAdvance closure tracks advances).
+## Candidate Directions
+### Direction A: Minimal -- just emit agent_stuck events (no UI changes)
+Add `AgentStuckEvent` to daemon-events.ts, emit in turn_end subscriber on 3 signals. Low effort, observable via raw JSONL.
+### Direction B: Full -- 5 improvements as specified
+Add stuck events + improve worktrain logs + add `worktrain status` + console panel + richer WORKTRAIN_STUCK. Full visibility stack.
+**Recommendation: Direction B.** The 5 improvements are cohesive and each addresses a different visibility gap. The stuck event alone (Direction A) isn't actionable if nothing surfaces it to humans.
+## Resolution Notes
+All 5 implementation items are well-scoped. The key implementation details:
+1. **Stuck detection (workflow-runner.ts):** Add `stepAdvanceCount` and `sessionStartMs` variables alongside `turnCount`. Add a `lastNToolCalls` ring buffer (last 3 `tool_call_started` events). In `turn_end`, check the 3 signals and emit `agent_stuck`.
+2. **worktrain logs formatting (cli-worktrain.ts):** `formatDaemonEventLine()` exists and can be extended with new cases. Currently minimal (no step_advanced or llm_turn_completed formatting found -- needs verification).
+3. **worktrain status command (cli-worktrain.ts):** New subcommand. Reads the daemon JSONL for a sessionId, aggregates counts, prints health summary. Pure reads, no daemon state required.
+4. **Console liveActivity (console-service.ts):** `readLiveActivity()` already reads `tool_called` events. Add `agent_stuck` to the filter.
+5. **WORKTRAIN_STUCK enrichment (workflow-runner.ts):** The stuckMarker JSON at line 1837 needs 4 new fields. The data is all available in closures at that point.
+## Decision Log
+- Chose `landscape_first` path: the signals are observable facts from code + logs, not design decisions.
+- Session ea2de6e5 from 2026-04-18.jsonl confirmed that `issue_reported` severity escalation is the primary real-world stuck signal.
+- Ring buffer approach (last 3 `tool_call_started`) preferred over full history scan for repeated-tool detection.
+- `stepAdvanceCount` must be tracked separately from `turnCount` (both exist in the subscriber but only `turnCount` is currently maintained).
+## Final Summary
+A stuck daemon agent currently looks like: many `llm_turn_completed` events with `toolNamesRequested` containing only failed tools, escalating `issue_reported` severity levels (warn → error → fatal), zero `step_advanced` events, and finally `session_completed outcome=timeout detail=max_turns`. The session `ea2de6e5` in today's log is the canonical example: 6 blocked `continue_workflow` attempts, 4 escalating `issue_reported` calls, terminated by max_turns.
+The 5 definitive stuck signals are: (1) same tool+args called 3+ times, (2) issue_reported severity=fatal, (3) 0 step advances after 80%+ of turns used, (4) tool failure rate >50% over last 5 turns, (5) wall-clock at 80%+ with <2 advances. These are all detectable in the `turn_end` subscriber using existing state variables plus 2 new counters and a 3-element ring buffer.

package/docs/design/mcp-server-disconnect-discovery.md ADDED Viewed

@@ -0,0 +1,245 @@
+# MCP Server Disconnect Discovery
+## Context / Ask
+The WorkRail MCP server keeps dying/disconnecting during active Claude Code sessions.
+Bridges report repeated reconnects. Claude Code sessions are interrupted. The symptom:
+"it restarts constantly."
+Goal: identify root cause, supporting evidence, and recommended investigation path.
+---
+## Path Recommendation
+**landscape_first** -- the problem is primarily one of understanding what is happening
+(reading logs, reading code, tracing the crash chain), not of reframing a problem
+definition. The code and logs give direct evidence of what is actually crashing.
+Rationale over alternatives:
+- `design_first` would be appropriate if the problem statement were ambiguous. It is not: we
+  have crash logs with stack traces.
+- `full_spectrum` would be appropriate if the landscape might be hiding a deeper structural
+  question. The landscape already reveals a clear structural flaw (EPIPE crashing the server
+  before the stdio guard fires), so spectrum-wide reframing is not needed.
+---
+## Constraints / Anti-goals
+- Do not change the MCP client protocol (Claude Code).
+- Do not affect session data integrity.
+- Do not change the bridge/primary topology for now.
+- Anti-goal: do not add speculative defenses if the real crash path is already identified.
+---
+## Landscape Packet (landscape_first pass)
+### What runs
+When Claude Code uses WorkRail there are two possible server topologies:
+**Single process (first session):**
+```
+Claude Code (IDE) --stdio--> WorkRail stdio server (stdio-entry.ts)
+                              + embedded HttpServer (dashboard / MCP HTTP port 3100)
+```
+**Multi-session (bridge mode):**
+```
+Claude Code --stdio--> WorkRail bridge (bridge-entry.ts)
+                       --> HTTP --> WorkRail primary (http-entry.ts, port 3100)
+```
+### What the crash.log shows
+Every entry in `~/.workrail/crash.log` (Apr 16-18, 2026) is the same pattern:
+```json
+{
+  "transport": "stdio",
+  "uptimeMs": 750-2100,
+  "label": "Uncaught exception",
+  "message": "write EPIPE",
+  "stack": "Error: write EPIPE\n    at afterWriteDispatched...\n    at console.error (node:internal/console/constructor:444:26)\n    at HttpServer.<method> ..."
+}
+```
+**Key observations:**
+1. Every crash is `transport=stdio` -- the MCP stdio primary is dying, not a bridge.
+2. Uptime is 750-2100ms -- these processes live less than 2 seconds.
+3. The crash is always `write EPIPE` thrown by `console.error()` inside `HttpServer`.
+4. The offending call sites across different crashes:
+   - `HttpServer.reclaimStaleLock` (line 432, 436 in compiled dist)
+   - `HttpServer.printBanner` (line 577 in compiled dist)
+   - `HttpServer.start` (line 348)
+5. All crashes point to the **installed npm global** (`/opt/homebrew/lib/node_modules/@exaudeus/workrail/dist/...`), not the local dev build.
+**This means Claude Code is running the globally-installed WorkRail binary, not the local dev build.**
+### What the bridge.log shows
+The bridge.log for Apr 18 shows a repeating pattern:
+```
+reconnected -> budget_exhausted (budgetUsed: 8) -> spawn_lock_acquired -> spawn_primary
+```
+This happens across PIDs 76729, 83046, 96836, 28962 -- multiple bridges repeatedly
+exhausting their full 8-reconnect budget before entering spawn mode. Each cycle takes
+~90 seconds, matching the "constant restarts" symptom.
+The bridges are correctly detecting primary death and attempting to respawn. The problem
+is the primary keeps dying immediately after spawning.
+### Root cause chain
+1. Claude Code (IDE) spawns a WorkRail stdio process (the global npm binary).
+2. The process starts, begins the `HttpServer.start()` / `tryBecomePrimary()` / `reclaimStaleLock()` path.
+3. **Before** `wireStdoutShutdown()` fires (or even before `server.connect(transport)` is called), the HttpServer attempts writes via `console.error()`.
+4. If Claude Code has already closed the stdio pipe (e.g. rapid reconnect, MCP restart), stdout/stderr are already broken.
+5. `console.error()` calls `console.value()` (Node internals) which does a synchronous socket write to stderr.
+6. The write throws `EPIPE` as an uncaught exception.
+7. `registerFatalHandlers()` was already called, so the `uncaughtException` handler fires and calls `fatalExit()`.
+8. `fatalExit()` writes to crash.log and calls `process.exit(1)`.
+9. The primary dies. The bridge detects the death and restarts it. Loop.
+### Why `wireStdoutShutdown()` doesn't save it
+`wireStdoutShutdown()` guards `process.stdout` against EPIPE. But:
+- The EPIPE is on `process.stderr` (from `console.error()`), not `process.stdout`.
+- The crash happens inside `HttpServer.start()`, which is called from `composeServer()`, which is called before `wireStdoutShutdown()` is even registered.
+The guard only covers `stdout` and is registered after `composeServer()` completes. HttpServer's
+`console.error()` calls during startup (in `printBanner`, `reclaimStaleLock`, `start`) race
+against a broken stderr pipe.
+### Why it affects only the global install
+The local dev build has some functions already converted to `process.stderr.write()` with
+try/catch (e.g. `printBanner` at line 918 in the source). But the global npm binary
+(`/opt/homebrew/lib/node_modules/@exaudeus/workrail`) is an older compiled version that still
+uses `console.error()` in those paths -- confirmed by the crash stack pointing to line numbers
+that don't match the current source.
+**This is a version skew issue.** The local source has partially fixed the pattern but
+the globally-installed binary has not been rebuilt/reinstalled.
+### Secondary finding: `console.error()` in `setupPrimaryCleanup`
+Even in the current source, `setupPrimaryCleanup()` still uses `console.error()` at line
+799 (`[Dashboard] Primary shutting down (sync cleanup)`) and line 810. These are inside
+signal handlers that can fire when stderr is already broken. They are not yet guarded.
+---
+## Problem Frame Packet
+**The real question**: Is this a deployment/version issue (global binary is stale) or a
+latent code bug that will resurface even after updating?
+**Answer**: Both.
+- **Immediate cause**: The globally-installed `@exaudeus/workrail` is an older build that has
+  not received the `process.stderr.write()` hardening applied to the source. Reinstalling from
+  the current source would eliminate the known crashes.
+- **Latent bug**: Even in the current source, `setupPrimaryCleanup()` still uses
+  `console.error()` in synchronous signal/exit handlers. Any signal arriving while stderr
+  is broken will crash the process. This was not caught because `setupPrimaryCleanup()` runs
+  after `wireStdoutShutdown()` -- but `wireStdoutShutdown()` only covers stdout, not stderr.
+- **Deeper structural issue**: The stdio-entry.ts + HttpServer pattern calls `HttpServer.start()`
+  as part of `composeServer()`, which happens before any I/O guards are installed. Any
+  `console.error()` call inside that synchronous startup path is unguarded against a
+  pre-broken stderr pipe.
+---
+## Candidate Directions
+### Direction A -- Rebuild and reinstall the global binary (immediate fix)
+Rebuild the compiled dist from the current source (which has most `process.stderr.write()`
+hardening applied) and reinstall with `npm install -g`. This eliminates the known crash
+paths in `printBanner`, `reclaimStaleLock`, and `start()`.
+**Risk**: Does not address the remaining `console.error()` calls in `setupPrimaryCleanup`.
+**Effort**: Low (5 min).
+### Direction B -- Audit and convert all remaining `console.error()` calls in HttpServer startup/signal paths
+A systematic grep for `console.error` in `HttpServer.ts` and `shutdown-hooks.ts` inside
+paths that run before the process is fully started or inside signal handlers. Convert each
+to `try { process.stderr.write(...); } catch { /* ignore */ }`.
+**Risk**: Low. Well-established pattern already used elsewhere in the codebase.
+**Effort**: Medium (1-2 hours).
+### Direction C -- Guard stderr itself against EPIPE (parallel to stdout guard)
+Add a `process.stderr.on('error', ...)` handler in `registerFatalHandlers()` or
+`wireStdoutShutdown()` that swallows EPIPE without crashing. This would make the guard
+transport-agnostic and prevent any future `console.error()` from causing a fatal crash.
+**Risk**: Must be careful not to suppress genuine stderr errors. Only EPIPE/ERR_STREAM_DESTROYED
+should be swallowed (same as the stdout guard).
+**Effort**: Low (30 min). High leverage.
+**Recommended direction**: C first (systemic fix, low effort), then B (belt-and-suspenders
+for existing calls), then A (deploy).
+---
+## Challenge Notes
+- The symptom ("restarts constantly") was actually the bridge correctly doing its job
+  (reconnecting + respawning). The bridge is healthy. The primary is the patient.
+- The crash.log is the primary diagnostic tool here. Without it, this would have been
+  very hard to trace.
+- The version skew between the global install and the local source obscured the fact that
+  some of these fixes were already partially applied.
+---
+## Resolution Notes
+**Root cause (precise)**: `process.stderr.write()` emits an `'error'` event (not a thrown exception) when the stderr pipe is broken (EPIPE). No error event listener exists on `process.stderr`. Node.js escalates the unhandled stream error to `uncaughtException`. `registerFatalHandlers()` catches it and calls `fatalExit()`. The process exits within 750-2100ms of every spawn.
+**Why try/catch is ineffective**: The `try { process.stderr.write(...); } catch {}` wrappers already present in `HttpServer.ts` do NOT prevent the crash. Stream error events are asynchronous events, not synchronous throws. A try/catch only catches thrown exceptions. The only effective protection is `process.stderr.on('error', ...)`.
+**Why Claude Code cannot receive a local fix**: `claude_desktop_config.json` uses `npx -y @exaudeus/workrail` which fetches the published npm latest version. Any fix must be published to npm. Immediate workaround: change the config to point to the local build.
+---
+## Decision Log
+- Chose `landscape_first` because crash logs give direct evidence, not ambiguity.
+- Did not delegate to subagents because all source files and logs fit in context.
+- No web access needed; all evidence is local.
+- Selected Candidate B (wireStderrShutdown extracted function) over A (inline guard) because: testable via DI, consistent with wireStdoutShutdown pattern, architecturally principled.
+- Candidate C (converting console.error calls in HttpServer) is now confirmed OPTIONAL: the stderr event listener protects all stderr writes including those from console.error. try/catch wrappers are security theater for stream errors.
+- Key insight from challenge phase: try/catch does not intercept stream 'error' events. This was validated by inspecting line 436 of the global binary (inside try/catch, yet still in the crash.log stack trace).
+---
+## Final Summary
+**The MCP server keeps restarting because the stderr pipe has no error listener.**
+When Claude Code reconnects rapidly, the stdio pipe closes before `HttpServer.start()` completes. When `HttpServer.start()` (or `reclaimStaleLock`, `printBanner`, `setupPrimaryCleanup`) writes to stderr while the pipe is broken, `process.stderr` emits an `'error'` event. No listener handles it. Node.js converts it to `uncaughtException`. `registerFatalHandlers()` calls `fatalExit()`. Process exits. Bridge detects death, respawns. Loop.
+**The fix** (Candidate B): Add `wireStderrShutdown()` to `shutdown-hooks.ts` (mirror of `wireStdoutShutdown()`), call it in `stdio-entry.ts` BEFORE `composeServer()`. This is 15-20 lines following an existing pattern.
+**Immediate workaround**: Change `~/Library/Application Support/Claude/claude_desktop_config.json` `command` from `npx` / args `["-y", "@exaudeus/workrail"]` to point at the local dev build's `dist/index.js` while the fix is being prepared for publish.
+**Confidence**: High. The crash mechanism is precisely confirmed by crash.log stack traces, source inspection, and the Node.js stream error event model.
+**Files to change**:
+1. `src/mcp/transports/shutdown-hooks.ts` -- add `wireStderrShutdown()`
+2. `src/mcp/transports/stdio-entry.ts` -- call `wireStderrShutdown()` before `composeServer()`
+3. Publish to npm as a patch release
+**Supporting artifacts**:
+- `docs/design/mcp-server-disconnect-candidates.md` -- full candidate analysis
+- `docs/design/mcp-server-disconnect-review.md` -- review findings and residual concerns

package/docs/design/mcp-server-epipe-crash.md ADDED Viewed

@@ -0,0 +1,198 @@
+# Bug Handoff: WorkRail MCP Server Disconnections
+**Date:** 2026-04-18
+**Severity:** High (production-impacting, kills live sessions)
+**Diagnosis type:** `root_plus_downstream`
+---
+## Bug Summary
+The WorkRail MCP stdio server crashes within 2 seconds of startup due to an unhandled
+asynchronous EPIPE error on `process.stderr`. When Claude Code closes an MCP connection
+quickly (e.g., during `/mcp` reconnect), both `stdout` and `stderr` pipes break. The code
+has `try/catch` guards around all `process.stderr.write()` calls, but those guards only
+protect against *synchronous* exceptions. The actual EPIPE is delivered as an async
+`'error'` event on the stderr Socket after the call frame exits. Since no
+`process.stderr.on('error')` listener exists, Node.js promotes it to `uncaughtException`,
+which `registerFatalHandlers` catches and routes to `fatalExit()` -> `process.exit(1)`.
+The user experiences this as a "mid-session disconnect" because:
+1. User does `/mcp` reconnect (or Claude Code auto-reconnects).
+2. New MCP server starts, begins `HttpServer.start()` / `tryBecomePrimary()` / `reclaimStaleLock()`.
+3. The startup sequence writes to stderr (status messages, lock reclaim notification).
+4. If Claude Code's reconnect was fast (common during rapid `/mcp` retries), stderr is already
+   broken when the write is queued.
+5. Async EPIPE on stderr -> crash -> user must do `/mcp` again.
+6. Repeat until a clean startup window is hit.
+---
+## Repro Summary
+- **Symptom:** Claude Code requires `/mcp` reconnects multiple times per day.
+- **Environment:** macOS, Claude Code, WorkRail installed from npm
+  (`/opt/homebrew/bin/workrail`, v3.32.0), MCP server in stdio mode.
+- **Trigger:** User does `/mcp` reconnect, or Claude Code auto-reconnects. If the reconnect
+  is fast enough that Claude Code moves on while the new server is still initializing
+  (~0-2 seconds window), the next `process.stderr.write()` call in the new server crashes it.
+- **Evidence:** `~/.workrail/crash.log` contains 15 production crash entries:
+  - 100% have `message: "write EPIPE"`
+  - 100% have `transport: "stdio"`
+  - 100% have `uptimeMs < 2200ms` (all crash during startup)
+  - Crash sites: `HttpServer.reclaimStaleLock` (line 436), `HttpServer.printBanner`,
+    `HttpServer.start`, `shutdown-hooks.js:50` -- all inside `try/catch` blocks
+---
+## Diagnosis: Confirmed Root Cause
+**The specific bug:** `process.stderr` has no `'error'` event listener anywhere in the
+MCP transport layer. This is verifiable:
+```
+grep -r "stderr.*\.on" src/mcp/transports/    # zero matches
+grep -r "stderr.*\.on" src/infrastructure/    # zero matches
+node -e "console.log(process.stderr.listenerCount('error'))"  # outputs: 0
+```
+`stdout` has protection via `wireStdoutShutdown()` which registers:
+```ts
+process.stdout.on('error', (err) => { ... shutdownEvents.emit({kind:'shutdown_requested',...}) })
+```
+`stderr` has *no equivalent*. The fix is to add one.
+**Why the `try/catch` does not protect:**
+Node.js `Socket.write()` (and by extension `process.stderr.write()`) does NOT throw
+synchronously on EPIPE on macOS. It enqueues the write and returns a boolean. The OS-level
+EPIPE signal arrives asynchronously and is delivered as a `'error'` event on the Socket
+*outside* the current JavaScript call frame -- beyond the reach of any `try/catch`.
+**Cascade after crash:**
+- `fatalExit()` runs, tries to write to stderr (fails silently), writes crash.log entry.
+- `process.exit(1)` kills the MCP server.
+- Claude Code loses the MCP connection.
+- User does `/mcp` -> new server spawns -> may crash again if the retry is fast.
+- Eventually a stable startup window is hit and the server persists indefinitely
+  (PID 90392: 1+ day runtime, 46MB RSS, 38 fds -- completely healthy once started).
+---
+## Secondary Finding: Bridge Spawn Loop Resilience Regression (H2)
+**Separate issue, not the primary user-facing bug.**
+`~/.workrail/bridge.log` shows 109 bridge sessions (from sessions that use the
+bridge/HTTP-primary architecture) all following this exact pattern:
+```
+reconnected(attempt:0) -> budget_exhausted(budgetUsed:8, respawnBudget:0) ->
+spawn_lock_acquired -> spawn_lock_skipped -> spawn_primary x3
+```
+**Mechanism:** When the primary dies and the bridge reconnects:
+1. Reconnect Loop A starts (state: reconnecting, budget=3).
+2. `detect(attempt=0)` finds the primary immediately -> Loop A returns `'reconnected'`.
+3. `buildConnectedTransport()` sets state to `'connected'`.
+4. Primary dies again immediately (`t.onclose` fires).
+5. `t.onclose` sets state to `'reconnecting'` (new state, budget=3) and starts Loop B.
+6. Loop A's `.then()` fires: it sees Loop B's `reconnecting` state, logs `reconnected(attempt:0)`
+   (using Loop B's `attempt` field which is 0), and returns.
+7. Loop B runs 8 reconnect attempts with ECONNREFUSED (<1ms each), exhausts budget,
+   cycles through 3 spawn attempts (budget 3->2->1->0), then hits `budget_exhausted`.
+**Effect:** Bridges spend their entire spawn budget in one rapid burst when the primary
+dies at exactly the wrong moment. The zero `'waiting_for_primary'` events in the log
+suggests bridges are dying (likely via EPIPE crash) before the wait loop log entry is
+written, or the spawned HTTP primaries are not maintaining stable connections.
+This is a resilience regression but does NOT directly affect current stdio-mode MCP sessions.
+---
+## Alternatives Ruled Out
+| Hypothesis | Ruling |
+|---|---|
+| Standalone console competing for port 3456 | Graceful fallback to port 3457+, not a crash |
+| File watchers from standalone console interfering | Different process, no crash mechanism |
+| Memory exhaustion from conversation logging (PR #528) | Daemon-only feature; MCP server RSS=46MB healthy |
+| Daemon crash corrupting shared DI singletons | Daemon and MCP server are separate processes with separate DI containers |
+| EPIPE from daemon writing to MCP stdio | Daemon on port 3200 has no stdio connection to the MCP server |
+---
+## High-Level Fix Direction
+### Fix 1 (Primary -- Critical): Add stderr error listener
+**File:** `src/mcp/transports/fatal-exit.ts`
+**Where:** Inside `registerFatalHandlers()`, BEFORE any async work, at the very top.
+Add a no-op error handler on `process.stderr` to absorb async EPIPE events:
+```ts
+process.stderr.on('error', () => { /* absorb async EPIPE -- see wireStdoutShutdown for pattern */ });
+```
+This mirrors the `wireStdoutShutdown()` pattern that already protects `process.stdout`.
+The no-op is sufficient because `process.stderr` is write-only diagnostics -- there is
+nothing to recover. The goal is only to prevent Node.js from promoting the unhandled
+error event to `uncaughtException`.
+`registerFatalHandlers()` is called first in every entry point (stdio-entry.ts, http-entry.ts,
+bridge-entry.ts), so registering here protects all transport types.
+**Alternative placement:** Could also go in each entry point before `composeServer()`, but
+`registerFatalHandlers()` is the single earliest call and the most defensible location.
+### Fix 2 (Secondary -- Medium): Bridge reconnect race condition
+**File:** `src/mcp/transports/bridge-entry.ts`
+**Issue:** When the primary dies immediately after the bridge connects, the bridge can
+have two concurrent reconnect loops (A and B). Loop A's outcome handler reads Loop B's
+state snapshot, causing `reconnected` to be logged for what is actually Loop B's first
+attempt, and Loop B's budget to be consumed rapidly.
+**Direction:** The `handleReconnectOutcome` guard
+(`if (stateAtOutcome.kind !== 'reconnecting') return`) should use the state snapshot
+captured at *loop start*, not re-read at outcome time. Or: ensure `startReconnectLoop()`
+is idempotent when called from `t.onclose` while a loop is already completing.
+---
+## Likely Files Involved
+- `src/mcp/transports/fatal-exit.ts` -- **primary fix location** (`registerFatalHandlers`)
+- `src/mcp/transports/shutdown-hooks.ts` -- existing stdout protection pattern to reference
+- `src/mcp/transports/bridge-entry.ts` -- secondary fix (reconnect loop race)
+- `src/infrastructure/session/HttpServer.ts` -- call sites that use `process.stderr.write()`
+  (no code change needed here; they work correctly once stderr has an error listener)
+---
+## Verification Recommendations
+1. **Unit test:** Add a test in `fatal-exit.test.ts` or a new `stderr-epipe.test.ts`:
+   - Call `registerFatalHandlers()` on a mock stderr with an `'error'` listener count check.
+   - Confirm that emitting `'error'` on stderr does NOT trigger `uncaughtException`.
+2. **Manual repro before fix:** Run `workrail` stdio server, immediately close the pipe
+   (e.g., via `workrail | head -0`) -- should produce a crash.log EPIPE entry.
+   After fix, the same command should NOT produce a crash.log entry and should exit cleanly.
+3. **Crash log regression:** After deploy, confirm `~/.workrail/crash.log` stops receiving
+   `write EPIPE` + `transport: stdio` entries during normal /mcp reconnect cycles.
+4. **Bridge resilience:** Observe `~/.workrail/bridge.log` -- after bridge fix, expect to see
+   `waiting_for_primary` events appear when budget is exhausted (currently never logged).
+---
+## Residual Uncertainty
+- **Why HTTP primaries spawned by bridges disconnect immediately** (H2 sub-cause): Confirmed
+  they are not crashing (no crash.log entries). Root sub-cause of rapid disconnect (wrong
+  port, StreamableHTTP handshake issue, or another process claiming port 3100) was not
+  directly observable without live instrumentation. This is a secondary issue and does not
+  affect the primary crash fix.