npm - @exaudeus/workrail - Versions diffs - 3.27.0 → 3.29.0 - Mend

@exaudeus/workrail 3.27.0 → 3.29.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (160) hide show

package/dist/console/assets/{index-FtTaDku8.js → index-BZ6HkxGf.js} +1 -1
package/dist/console/index.html +1 -1
package/dist/manifest.json +3 -3
package/docs/README.md +57 -0
package/docs/adrs/001-hybrid-storage-backend.md +38 -0
package/docs/adrs/002-four-layer-context-classification.md +38 -0
package/docs/adrs/003-checkpoint-trigger-strategy.md +35 -0
package/docs/adrs/004-opt-in-encryption-strategy.md +36 -0
package/docs/adrs/005-agent-first-workflow-execution-tokens.md +105 -0
package/docs/adrs/006-append-only-session-run-event-log.md +76 -0
package/docs/adrs/007-resume-and-checkpoint-only-sessions.md +51 -0
package/docs/adrs/008-blocked-nodes-architectural-upgrade.md +178 -0
package/docs/adrs/009-bridge-mode-single-instance-mcp.md +195 -0
package/docs/adrs/010-release-pipeline.md +89 -0
package/docs/architecture/README.md +7 -0
package/docs/architecture/refactor-audit.md +364 -0
package/docs/authoring-v2.md +527 -0
package/docs/authoring.md +873 -0
package/docs/changelog-recent.md +201 -0
package/docs/configuration.md +505 -0
package/docs/ctc-mcp-proposal.md +518 -0
package/docs/design/README.md +22 -0
package/docs/design/agent-cascade-protocol.md +96 -0
package/docs/design/autonomous-console-design-candidates.md +253 -0
package/docs/design/autonomous-console-design-review.md +111 -0
package/docs/design/autonomous-platform-mvp-discovery.md +525 -0
package/docs/design/claude-code-source-deep-dive.md +713 -0
package/docs/design/console-cyberpunk-ui-discovery.md +504 -0
package/docs/design/console-execution-trace-candidates-final.md +160 -0
package/docs/design/console-execution-trace-candidates.md +211 -0
package/docs/design/console-execution-trace-design-candidates-v2.md +113 -0
package/docs/design/console-execution-trace-design-review.md +74 -0
package/docs/design/console-execution-trace-discovery.md +394 -0
package/docs/design/console-execution-trace-final-review.md +77 -0
package/docs/design/console-execution-trace-review.md +92 -0
package/docs/design/console-performance-discovery.md +415 -0
package/docs/design/console-ui-backlog.md +280 -0
package/docs/design/daemon-architecture-discovery.md +853 -0
package/docs/design/daemon-design-candidates.md +318 -0
package/docs/design/daemon-design-review-findings.md +119 -0
package/docs/design/daemon-engine-design-candidates.md +210 -0
package/docs/design/daemon-engine-design-review.md +131 -0
package/docs/design/daemon-execution-engine-discovery.md +280 -0
package/docs/design/daemon-gap-analysis.md +554 -0
package/docs/design/daemon-owns-console-plan.md +168 -0
package/docs/design/daemon-owns-console-review.md +91 -0
package/docs/design/daemon-owns-console.md +195 -0
package/docs/design/data-model-erd.md +11 -0
package/docs/design/design-candidates-consolidate-dev-staleness.md +98 -0
package/docs/design/design-candidates-walk-cache-depth-limit.md +80 -0
package/docs/design/design-review-consolidate-dev-staleness.md +54 -0
package/docs/design/design-review-walk-cache-depth-limit.md +48 -0
package/docs/design/implementation-plan-consolidate-dev-staleness.md +142 -0
package/docs/design/implementation-plan-walk-cache-depth-limit.md +141 -0
package/docs/design/layer3b-ghost-nodes-design-candidates.md +229 -0
package/docs/design/layer3b-ghost-nodes-design-review.md +93 -0
package/docs/design/layer3b-ghost-nodes-implementation-plan.md +219 -0
package/docs/design/list-workflows-latency-fix-plan.md +128 -0
package/docs/design/list-workflows-latency-fix-review.md +55 -0
package/docs/design/list-workflows-latency-fix.md +109 -0
package/docs/design/native-context-management-api.md +11 -0
package/docs/design/performance-sweep-2026-04.md +96 -0
package/docs/design/routines-guide.md +219 -0
package/docs/design/sequence-diagrams.md +11 -0
package/docs/design/subagent-design-principles.md +220 -0
package/docs/design/temporal-patterns-design-candidates.md +312 -0
package/docs/design/temporal-patterns-design-review-findings.md +163 -0
package/docs/design/test-isolation-from-config-file.md +335 -0
package/docs/design/v2-core-design-locks.md +2746 -0
package/docs/design/v2-lock-registry.json +734 -0
package/docs/design/workflow-authoring-v2.md +1044 -0
package/docs/design/workflow-docs-spec.md +218 -0
package/docs/design/workflow-extension-points.md +687 -0
package/docs/design/workrail-auto-trigger-system.md +359 -0
package/docs/design/workrail-config-file-discovery.md +513 -0
package/docs/docker.md +110 -0
package/docs/generated/v2-lock-closure-plan.md +26 -0
package/docs/generated/v2-lock-coverage.json +797 -0
package/docs/generated/v2-lock-coverage.md +177 -0
package/docs/ideas/backlog.md +3927 -0
package/docs/ideas/design-candidates-mcp-resilience.md +208 -0
package/docs/ideas/design-review-findings-mcp-resilience.md +119 -0
package/docs/ideas/implementation_plan.md +249 -0
package/docs/ideas/third-party-workflow-setup-design-thinking.md +1948 -0
package/docs/implementation/02-architecture.md +316 -0
package/docs/implementation/04-testing-strategy.md +124 -0
package/docs/implementation/09-simple-workflow-guide.md +835 -0
package/docs/implementation/13-advanced-validation-guide.md +874 -0
package/docs/implementation/README.md +21 -0
package/docs/integrations/claude-code.md +300 -0
package/docs/integrations/firebender.md +315 -0
package/docs/migration/v0.1.0.md +147 -0
package/docs/naming-conventions.md +45 -0
package/docs/planning/README.md +104 -0
package/docs/planning/github-ticketing-playbook.md +195 -0
package/docs/plans/README.md +24 -0
package/docs/plans/agent-managed-ticketing-design.md +605 -0
package/docs/plans/agentic-orchestration-roadmap.md +112 -0
package/docs/plans/assessment-gates-engine-handoff.md +536 -0
package/docs/plans/content-coherence-and-references.md +151 -0
package/docs/plans/library-extraction-plan.md +340 -0
package/docs/plans/mr-review-workflow-redesign.md +1451 -0
package/docs/plans/native-context-management-epic.md +11 -0
package/docs/plans/perf-fixes-design-candidates.md +225 -0
package/docs/plans/perf-fixes-design-review-findings.md +61 -0
package/docs/plans/perf-fixes-new-issues-candidates.md +264 -0
package/docs/plans/perf-fixes-new-issues-review.md +110 -0
package/docs/plans/prompt-fragments.md +53 -0
package/docs/plans/ui-ux-workflow-design-candidates.md +120 -0
package/docs/plans/ui-ux-workflow-discovery.md +100 -0
package/docs/plans/ui-ux-workflow-review.md +48 -0
package/docs/plans/v2-followup-enhancements.md +587 -0
package/docs/plans/workflow-categories-candidates.md +105 -0
package/docs/plans/workflow-categories-discovery.md +110 -0
package/docs/plans/workflow-categories-review.md +51 -0
package/docs/plans/workflow-discovery-model-candidates.md +94 -0
package/docs/plans/workflow-discovery-model-discovery.md +74 -0
package/docs/plans/workflow-discovery-model-review.md +48 -0
package/docs/plans/workflow-source-setup-phase-1.md +245 -0
package/docs/plans/workflow-source-setup-phase-2.md +361 -0
package/docs/plans/workflow-staleness-detection-candidates.md +104 -0
package/docs/plans/workflow-staleness-detection-review.md +58 -0
package/docs/plans/workflow-staleness-detection.md +80 -0
package/docs/plans/workflow-v2-design.md +69 -0
package/docs/plans/workflow-v2-roadmap.md +74 -0
package/docs/plans/workflow-validation-design.md +98 -0
package/docs/plans/workflow-validation-roadmap.md +108 -0
package/docs/plans/workrail-platform-vision.md +420 -0
package/docs/reference/agent-context-cleaner-snippet.md +94 -0
package/docs/reference/agent-context-guidance.md +140 -0
package/docs/reference/context-optimization.md +284 -0
package/docs/reference/example-workflow-repository-template/.github/workflows/validate.yml +125 -0
package/docs/reference/example-workflow-repository-template/README.md +268 -0
package/docs/reference/example-workflow-repository-template/workflows/example-workflow.json +80 -0
package/docs/reference/external-workflow-repositories.md +916 -0
package/docs/reference/feature-flags-architecture.md +472 -0
package/docs/reference/feature-flags.md +349 -0
package/docs/reference/god-tier-workflow-validation.md +272 -0
package/docs/reference/loop-optimization.md +209 -0
package/docs/reference/loop-validation.md +176 -0
package/docs/reference/loops.md +465 -0
package/docs/reference/mcp-platform-constraints.md +59 -0
package/docs/reference/recovery.md +88 -0
package/docs/reference/releases.md +177 -0
package/docs/reference/troubleshooting.md +105 -0
package/docs/reference/workflow-execution-contract.md +998 -0
package/docs/roadmap/README.md +22 -0
package/docs/roadmap/legacy-planning-status.md +103 -0
package/docs/roadmap/now-next-later.md +70 -0
package/docs/roadmap/open-work-inventory.md +389 -0
package/docs/tickets/README.md +39 -0
package/docs/tickets/next-up.md +76 -0
package/docs/workflow-management.md +317 -0
package/docs/workflow-templates.md +423 -0
package/docs/workflow-validation.md +184 -0
package/docs/workflows.md +254 -0
package/package.json +3 -1
package/spec/authoring-spec.json +61 -16
package/workflows/workflow-for-workflows.json +252 -93
package/workflows/workflow-for-workflows.v2.json +188 -77

package/docs/ideas/design-candidates-mcp-resilience.md ADDED Viewed

@@ -0,0 +1,208 @@
+# Design Candidates: MCP Server Resilience
+> Raw investigative material for the implementation agent. Honest analysis over polished presentation.
+---
+## Problem Understanding
+### Core Tensions
+1. **Catch-all vs. let-it-crash**: A server should catch tool handler exceptions and keep running (MCP server = long-lived service). But truly corrupt state (startup failure, unrecoverable invariant violation) should exit. The same `registerFatalHandlers()` currently handles both cases identically with `process.exit(1)`. The tension: distinguishing "exception inside one tool call" from "exception in global process state".
+2. **Graceful shutdown vs. simplicity**: Adding a graceful shutdown path to `fatalExit()` adds complexity (timeout, async path, potential for the shutdown itself to hang). The current synchronous path is simple and guaranteed to terminate. The tension: correctness (clean teardown) vs. reliability (always exits).
+3. **Spawn storm prevention vs. reconnect latency**: Increasing jitter reduces the probability of spawn storms at the cost of longer reconnect delays after a real crash. With 2s jitter, a crashed primary means the user waits ~2s longer per bridge before the first spawn attempt. But with 300ms, all bridges race simultaneously.
+4. **Tool handler catch vs. SDK internals**: Adding try/catch at the `setRequestHandler` callback level may interact with how the MCP SDK dispatches errors. If the SDK already catches rejected promises and handles them as protocol errors, the wrapper is redundant. If it doesn't, the try/catch is essential.
+### Likely Seam / Real Problem Location
+- **Tool handler exceptions**: the seam is `server.ts` line 437, the `setRequestHandler(CallToolRequestSchema, ...)` callback. The symptom (process.exit) is in `fatal-exit.ts`, but the fix belongs at the dispatch boundary. The `createHandler()` try/catch is a second inner layer -- good to have, but the outer layer is missing.
+- **`registerFatalHandlers()` aggression**: the seam is `fatal-exit.ts` lines 143-145. The fix is not to remove the handler but to make it less catastrophic for exceptions that could have been caught earlier.
+- **Spawn storm**: the seam is `bridge-entry.ts` line 190. The fix is surgical: increase the sleep duration and the post-jitter detection retries.
+### What Makes This Hard
+1. The MCP SDK's error handling behavior for async handler rejections is not documented -- it may or may not convert them to protocol errors.
+2. The graceful shutdown path in `fatalExit()` introduces a new failure mode: if `shutdown()` hangs, the process never exits. The timeout must be hard.
+3. The spawn storm is a distributed coordination problem -- purely local fixes (jitter) reduce it statistically but can't eliminate it without cross-process coordination (a spawn lock file).
+4. A junior developer would add try/catch in `createHandler()` (already done) and declare victory, missing the outer `setRequestHandler` boundary and the `withToolCallTiming()` gap.
+---
+## Philosophy Constraints
+**From `~/CLAUDE.md`:**
+- **Errors are data** -- represent failure as values, not exceptions. `createHandler()` already does this; the gap is at the MCP SDK dispatch layer.
+- **Validate at boundaries, trust inside** -- the `CallToolRequestSchema` handler is the outermost boundary. Add the catch there.
+- **Dependency injection for boundaries** -- the graceful shutdown callback registered via `registerGracefulShutdown()` follows this; transport entry points own their teardown logic.
+- **YAGNI with discipline** -- don't add complexity that isn't needed (e.g., spawn lock file is out of scope).
+- **Surface information, don't hide it** -- if something unexpected happens, log it to stderr and crash.log.
+**Conflicts:**
+- `fatalExit()` uses `process.exit(1)` immediately -- this is at odds with "errors are data" but the comments explain why (re-entrancy risk, sync crash log). For truly uncaught process-level exceptions, crash is correct. The conflict: should EVERY uncaught exception crash? No -- tool handler exceptions should not.
+- The mutable module state in `fatal-exit.ts` (`fatalHandlerActive`, `registeredTransport`) conflicts with "immutability by default" but is documented as intentional (last-resort handlers). Any new mutable state must follow the same documented pattern.
+---
+## Impact Surface
+**Files that must change:**
+- `src/mcp/server.ts` -- add try/catch around `CallToolRequestSchema` handler
+- `src/mcp/transports/fatal-exit.ts` -- add `registerGracefulShutdown()` + async path
+- `src/mcp/transports/bridge-entry.ts` -- increase jitter, increase post-jitter retries
+**Files that must stay consistent:**
+- `src/mcp/transports/stdio-entry.ts` -- should call `registerGracefulShutdown()` to register `ctx.httpServer?.stop()`
+- `src/mcp/transports/http-entry.ts` -- same, register `listener.stop()` + `ctx.httpServer?.stop()`
+- `src/mcp/transports/bridge-entry.ts` -- no graceful shutdown needed (bridge has its own `performShutdown()`)
+- `tests/unit/mcp/transports/fatal-exit.test.ts` -- needs updating for new exports and async path
+**Contracts that must remain consistent:**
+- `fatalExit(label, reason)` signature -- unchanged
+- `registerFatalHandlers(transport)` -- unchanged
+- `logStartup(transport, extra?)` -- unchanged
+- `McpCallToolResult` shape returned by the new outer catch -- must match `{content: [{type: 'text', text: '...'}], isError: true}`
+---
+## Candidates
+### Candidate A: Minimal Surgical Fix
+**Summary:** Add a single try/catch around the `CallToolRequestSchema` handler body in `server.ts`, increase bridge jitter from `Math.random() * 300` to `Math.random() * 2000`, increase post-jitter detection retries from 1 to 3. Do not change `fatalExit()`.
+**Tensions resolved:** Catches tool handler exceptions before they become unhandled rejections. Reduces spawn storm probability (~6x reduction).
+**Tensions accepted:** `fatalExit()` still calls `process.exit(1)` immediately for non-handler exceptions. No graceful shutdown.
+**Boundary solved at:** `server.ts` `CallToolRequestSchema` async callback -- the outermost handler boundary.
+**Why this boundary is the best fit:** This is where unhandled rejections originate for tool calls. Catching here prevents the rejection from ever reaching the `process.on('unhandledRejection')` handler.
+**Failure mode:** If the MCP SDK has its own error-catching logic that this interferes with (unlikely). If the exception happens in `withToolCallTiming()` itself (not the handler), the catch still fires but returns a generic error with no timing observation -- acceptable.
+**Repo-pattern relationship:** Directly adapts `createHandler()`'s try/catch pattern (lines 174-186 of `handler-factory.ts`) one level up. Same pattern, same boundary philosophy.
+**Gains:** Minimal diff, minimal risk, directly addresses the primary failure mode.
+**Gives up:** No graceful shutdown improvement. Task requirement 2 ("improve fatal-exit to attempt graceful shutdown") is not satisfied.
+**Scope judgment:** Too narrow -- satisfies 2/3 task requirements.
+**Philosophy fit:** Honors "errors are data", "validate at boundaries". Does not honor the explicit graceful shutdown request.
+---
+### Candidate B: Full Task Coverage -- Outer Catch + Graceful Shutdown + Jitter
+**Summary:** Same outer catch as A. Additionally: export `registerGracefulShutdown(fn: () => Promise<void>, timeoutMs: number): void` from `fatal-exit.ts`. `fatalExit()` becomes: write crash log (sync) -> write stderr (sync) -> if fn registered: `Promise.race([fn().catch(() => {}), sleep(timeoutMs)]).finally(() => process.exit(1))` else `process.exit(1)`. Transport entry points (stdio, http) call `registerGracefulShutdown()` after composing the server. Bridge doesn't register one (it has its own `performShutdown()`). Increase jitter as in A.
+**Tensions resolved:** All three. Catches tool handler exceptions. Adds bounded graceful shutdown (2s timeout guarantees termination). Reduces spawn storm.
+**Tensions accepted:** Adds complexity (timeout, new mutable state, new export). The async path in `fatalExit()` is new territory vs. the existing synchronous design.
+**Boundary solved at:**
+- Outer tool handler catch: `server.ts` (same as A)
+- Graceful shutdown: `fatal-exit.ts` -- module-level `let gracefulShutdownFn: (() => Promise<void>) | null = null`; new export `registerGracefulShutdown(fn, timeoutMs)`. The `fatalExit()` body gains an async branch protected by `Promise.race()`.
+**Why this boundary is the best fit:** `fatal-exit.ts` is explicitly the last-resort handler for all transports. Adding shutdown registration here means all transports benefit. The re-entrancy guard already prevents double-entry into `fatalExit()`.
+**Failure mode:** If the graceful shutdown fn throws synchronously (before the promise chain starts), it escapes the `Promise.race()`. Fix: wrap the `fn()` call itself in try/catch: `Promise.race([Promise.resolve().then(() => fn()).catch(() => {}), sleep(timeoutMs)])`. The outer `finally` with `process.exit(1)` is the ultimate guarantee.
+**Repo-pattern relationship:** Extends `fatal-exit.ts` with the injection pattern seen in `bridge-entry.ts` (injectable deps). New module-level mutable state follows the existing documented pattern (`fatalHandlerActive`, `registeredTransport`).
+**Gains:** Satisfies all task requirements. Graceful shutdown means HTTP server closes cleanly on crash. Lock file gets released. Dashboard doesn't leave stale state.
+**Gives up:** More complex. Risk of the async path behaving unexpectedly under V8 inspector (mitigated by the `Promise.race()` + timeout guarantee).
+**Scope judgment:** Best-fit -- matches all three explicit asks in the task.
+**Philosophy fit:** Honors "errors are data", "dependency injection for boundaries", "determinism" (timeout guarantees termination). Minor tension with "immutability by default" (new mutable state -- documented and necessary).
+---
+### Candidate C: Spawn Lock File for Coordination
+**Summary:** Add a `~/.workrail/spawn.lock` file written atomically (`wx` flag) by the first bridge that attempts a spawn. Other bridges check for this lock and skip spawning if it is < 5s old. Same jitter increase as A/B. Adapted from `HttpServer.tryBecomePrimary()` / `reclaimStaleLock()`.
+**Tensions resolved:** Eliminates spawn storms by construction. Even with very short jitter, only one bridge holds the spawn lock at a time.
+**Tensions accepted:** Adds a new file-system artifact, new cleanup path, more complex.
+**Boundary solved at:** `bridge-entry.ts` `spawnPrimary()` -- before the post-jitter check.
+**Failure mode:** If the spawn lock is never cleaned up (spawner crashes before cleanup), subsequent spawns are blocked for 5s. Mitigated by the TTL check.
+**Scope judgment:** Too broad -- the task says "increase bridge jitter to prevent spawn storms". C solves a broader coordination problem not specified in the task. Current bridge.log data shows 3-4 spawns within ~500ms; 2s jitter is sufficient to prevent this pattern.
+**Philosophy fit:** Honors "make illegal states unrepresentable" (spawn storm becomes impossible). Violates YAGNI for this task scope.
+---
+## Comparison and Recommendation
+### Comparison Matrix
+| Criterion | A (minimal) | B (full) | C (lock file) |
+|---|---|---|---|
+| Catches tool handler exceptions | Yes | Yes | No |
+| Graceful shutdown on fatal exit | No | Yes | No |
+| Reduces spawn storm | Yes (~6x) | Yes (~6x) | Eliminates |
+| Task requirements satisfied | 2/3 | 3/3 | 0/3 |
+| Complexity | Low | Medium | High |
+| Risk | Low | Low-medium | Medium |
+| Reversibility | Easy | Easy | Harder |
+| Repo pattern consistency | Direct | Extended | Adapted |
+### Recommendation: Candidate B
+B satisfies all three explicit task requirements. The graceful shutdown addition is bounded, safe, and reversible. The `Promise.race()` + hard `process.exit(1)` in `finally` preserves the termination guarantee. The new mutable state in `fatal-exit.ts` follows the existing documented pattern.
+**Concrete implementation:**
+1. `src/mcp/server.ts`, `CallToolRequestSchema` handler: wrap entire async body in try/catch; on catch, log to stderr and return `{content: [{type: 'text', text: JSON.stringify({code: 'INTERNAL_ERROR', message: '...'})}], isError: true}`.
+2. `src/mcp/transports/fatal-exit.ts`:
+   - Add `let gracefulShutdownFn: (() => Promise<void>) | null = null` and `let gracefulShutdownTimeoutMs = 2000` at module level.
+   - Export `registerGracefulShutdown(fn: () => Promise<void>, timeoutMs?: number): void`.
+   - In `fatalExit()`, after the crash log write + stderr write, before `process.exit(1)`:
+     ```ts
+     if (gracefulShutdownFn !== null) {
+       const fn = gracefulShutdownFn;
+       Promise.race([
+         Promise.resolve().then(() => fn()).catch(() => {}),
+         new Promise<void>(resolve => setTimeout(resolve, gracefulShutdownTimeoutMs)),
+       ]).finally(() => process.exit(1));
+     } else {
+       process.exit(1);
+     }
+     ```
+3. `src/mcp/transports/stdio-entry.ts`: after `composeServer()`, call `registerGracefulShutdown(async () => { await ctx.httpServer?.stop(); }, 2000)`.
+4. `src/mcp/transports/http-entry.ts`: after `composeServer()`, call `registerGracefulShutdown(async () => { await listener.stop(); await ctx.httpServer?.stop(); }, 2000)`.
+5. `src/mcp/transports/bridge-entry.ts`: increase jitter from `Math.random() * 300` to `Math.random() * 2000`. Increase post-jitter detection: `detectHealthyPrimary(port, { retries: 3, baseDelayMs: 500, fetch: deps.fetch })`.
+---
+## Self-Critique
+**Strongest counter-argument against B:**
+The existing `fatalExit()` comments explicitly explain why it's synchronous (V8 inspector re-entrancy). Adding an async `Promise.race()` path means Node.js's event loop continues running during the grace period, which could allow other callbacks to fire (including re-entrant `fatalExit()` calls). The re-entrancy guard (`fatalHandlerActive`) is set synchronously at the top, so this is safe -- a second call returns immediately. But the concern is valid: the async path is new territory in a module explicitly designed to be synchronous.
+**Narrower option (A) why it lost:**
+Satisfies 2/3 requirements. The task description explicitly asks for "improve fatal-exit to attempt graceful shutdown". Leaving this out would be an incomplete implementation.
+**Broader option (C) what evidence would be required:**
+C would be justified if bridge.log showed many `spawn_primary` events from the same timestamp even with 2s jitter. Current data shows 3-4 spawns within ~500ms -- 2s jitter prevents this. Only justify C if post-B bridge.log still shows storms.
+**Assumption that would invalidate B:**
+If the MCP SDK already wraps async `setRequestHandler` callbacks and converts rejections to protocol errors, the outer try/catch in `server.ts` is redundant (harmless). More critically: if the actual production crashes come from somewhere outside tool handler context (timer callbacks, startup code), then neither A nor B prevents them. The crash.log shows only `fatal-exit.test.ts` interference in the visible entries -- we don't have evidence of production handler crashes. The fixes are defensive and correct regardless.
+---
+## Open Questions for the Main Agent
+1. Should `registerGracefulShutdown()` in `fatal-exit.ts` allow _replacing_ a previously registered fn (last-writer-wins), or should it throw on double-registration? The transport entry points call `composeServer()` once, so double-registration shouldn't happen in production -- but for tests, last-writer-wins is safer.
+2. The bridge process has its own `performShutdown()` that handles cleanup. Should the bridge also call `registerGracefulShutdown()`, or does its own shutdown path make this redundant? Recommendation: bridge should NOT register -- its `performShutdown()` is already called before `process.exit(0)`.
+3. For the outer try/catch in `server.ts`: should it use the same `errNotRetryable('INTERNAL_ERROR', ...)` pattern as `createHandler()`, or a simpler static error response? Recommendation: match `createHandler()` for consistency.

package/docs/ideas/design-review-findings-mcp-resilience.md ADDED Viewed

@@ -0,0 +1,119 @@
+# Design Review Findings: MCP Server Resilience
+> Concise, actionable findings for the implementation agent.
+---
+## Tradeoff Review
+| Tradeoff | Verdict | Condition That Changes It |
+|---|---|---|
+| 2s->2s jitter + reconnect latency | Acceptable | Only revisit if bridge.log shows >5s reconnect times impacting users |
+| New mutable state in fatal-exit.ts | Acceptable | Must be documented with same comment pattern as existing mutable state |
+| Async event loop active during 3s shutdown | Acceptable | Bounded by hard exit timer; no correctness risk |
+| Graceful shutdown timeout (2s) vs HTTP server's own 5s timeout | **Needs fix** | Increase to 3s so shutdown fn has a real chance to complete |
+---
+## Failure Mode Review
+| Failure Mode | Coverage | Gap | Severity |
+|---|---|---|---|
+| Shutdown fn hangs | Hard exit timer | None | Covered |
+| SDK already handles handler rejections | Try/catch is harmless redundancy | None | Covered |
+| 2s jitter + primary startup >2s | 3-retry post-jitter detection | If startup >5.5s, second spawn still possible (exits cleanly with EADDRINUSE) | Low |
+| Shutdown fn throws synchronously | `Promise.resolve().then(() => fn())` | Must be implemented correctly (not bare `fn()`) | Must not miss |
+| Double-registration in tests | Last-writer-wins + null-clear support | `registerGracefulShutdown(null)` needed for test reset | Medium |
+| withToolCallTiming throws | Outer try/catch covers it | One timing observation lost (observability only) | Low |
+---
+## Runner-Up / Simpler Alternative Review
+**Candidate A (no graceful shutdown):** Satisfies 2/3 task requirements. Not recommended -- the task explicitly asks for graceful shutdown improvement.
+**Simplified async path:** Replace `Promise.race()` with dual-path `setTimeout` + `Promise.then`. This is strictly simpler and achieves the same semantics:
+```ts
+if (gracefulShutdownFn !== null) {
+  const fn = gracefulShutdownFn;
+  const hardExit = setTimeout(() => process.exit(1), gracefulShutdownTimeoutMs);
+  void Promise.resolve()
+    .then(() => fn())
+    .catch(() => {})
+    .finally(() => {
+      clearTimeout(hardExit);
+      process.exit(1);
+    });
+} else {
+  process.exit(1);
+}
+```
+**Recommendation:** Use the simplified dual-path approach. It is easier to reason about than `Promise.race()` and makes the `clearTimeout` + `process.exit(1)` sequencing explicit.
+---
+## Philosophy Alignment
+| Principle | Status |
+|---|---|
+| Errors are data | Satisfied -- outer try/catch converts throws to structured error values |
+| Validate at boundaries | Satisfied -- catch at outermost dispatch boundary |
+| Dependency injection | Satisfied -- shutdown fn is injected via `registerGracefulShutdown()` |
+| Surface information | Satisfied -- crash log + stderr write before any async work |
+| YAGNI | Satisfied -- no speculative abstractions |
+| Immutability by default | Acceptable tension -- mutable state follows existing documented pattern |
+| Determinism | Acceptable tension -- bounded by hard exit timer (O(seconds)) |
+| Small pure functions | Acceptable tension -- last-resort handlers are inherently impure |
+---
+## Findings
+### Red (must fix before implementing)
+None.
+### Orange (must address, affects correctness)
+**O1: Graceful shutdown timeout must be 3s, not 2s.**
+`HttpServer.stop()` has an internal 5s `server.close()` timeout. A 2s outer timeout races with the first 2s of that timeout, meaning the HTTP server never actually calls `server.close()` in time. Use 3s to give the shutdown fn a real chance, while still guaranteeing process exit within a bounded window.
+**O2: `registerGracefulShutdown()` must accept `null` to clear the registered fn.**
+Signature: `registerGracefulShutdown(fn: (() => Promise<void>) | null, timeoutMs?: number): void`. Required for test isolation. Without this, tests that call `fatalExit()` after a previous test registered a fn will try to call the stale fn.
+**O3: Shutdown fn must be called via `Promise.resolve().then(() => fn())`, not `fn()` directly.**
+Synchronous throws from `fn()` must be converted to rejected promises before the `.catch(() => {})` can handle them. A bare `fn()` call that throws synchronously escapes the catch and propagates as an uncaught exception. This would re-enter `fatalExit()`, which the re-entrancy guard handles -- but it also means the hard exit timer fires without the cleanup completing. Use `Promise.resolve().then(() => fn())` to convert sync throws to rejected promises.
+### Yellow (should address, affects quality)
+**Y1: Update `design-candidates.md` graceful shutdown timeout from 2s to 3s.**
+The design document says 2s; implementation should use 3s per finding O1.
+**Y2: Add test-reset documentation to `fatal-exit.ts` module comment.**
+The existing tests manage `fatalHandlerActive` state -- document that tests must also manage `gracefulShutdownFn` state via `registerGracefulShutdown(null)` after each test that registers a fn.
+**Y3: Log a warning in `fatalExit()` when graceful shutdown is attempted.**
+Something like `[FatalExit] Attempting graceful shutdown (${timeoutMs}ms timeout)` on stderr before starting the async path. This makes the behavior visible in crash scenarios.
+---
+## Recommended Revisions
+1. Use 3s timeout (not 2s) for `gracefulShutdownTimeoutMs` default.
+2. `registerGracefulShutdown(fn: (() => Promise<void>) | null, timeoutMs?: number): void` -- null clears.
+3. Async path: `Promise.resolve().then(() => fn()).catch(() => {}).finally(...)` -- not bare `fn()`.
+4. Use dual-path `setTimeout` + `Promise.then` approach (simpler than `Promise.race()`).
+5. Add stderr log line in `fatalExit()` when entering the graceful shutdown path.
+6. Update `fatal-exit.test.ts` to call `registerGracefulShutdown(null)` in `afterEach`.
+---
+## Residual Concerns
+1. **SDK error handling behavior**: We don't know if the MCP SDK converts async handler rejections to protocol errors. The outer try/catch is defensive and harmless if redundant. No action needed, but worth confirming empirically after implementation.
+2. **Bridge spawn storm with >3 bridges**: Increasing jitter to 2s handles the observed 3-4 bridge case. If the deployment grows to 10+ bridge processes, the probabilistic coordination may break down. The spawn lock file (Candidate C) would be the fix. File this for future if needed.
+3. **Test isolation for module-level state in `fatal-exit.ts`**: Both the existing `fatalHandlerActive`/`registeredTransport` and the new `gracefulShutdownFn` require test cleanup. The test file should use `vi.resetModules()` or explicit `registerGracefulShutdown(null)` calls.

package/docs/ideas/implementation_plan.md ADDED Viewed

@@ -0,0 +1,249 @@
+# Implementation Plan: MCP Server Resilience
+## 1. Problem Statement
+The MCP server crashes silently from uncaught exceptions in tool handlers. The current architecture:
+1. `registerFatalHandlers()` installs `process.on('uncaughtException')` -> `fatalExit()` -> `process.exit(1)`
+2. Tool handler exceptions that escape `createHandler()`'s inner try/catch become unhandled rejections
+3. `fatalExit()` calls `process.exit(1)` immediately -- no graceful shutdown, no HTTP server cleanup
+4. Multiple bridges detect the crash simultaneously, all try to spawn a new primary within 300ms (shorter than startup time), causing a spawn storm
+**Three root causes to fix:**
+1. No outer try/catch at the MCP `CallToolRequestSchema` handler boundary
+2. `fatalExit()` exits immediately without attempting graceful shutdown
+3. Bridge jitter window (300ms) is shorter than primary startup time (~500-1000ms)
+---
+## 2. Acceptance Criteria
+- [ ] An uncaught exception inside a tool handler does NOT crash the MCP server process
+- [ ] The handler returns an MCP error response (`isError: true`, `code: INTERNAL_ERROR`) instead of killing the process
+- [ ] The exception is logged to stderr before the error response is returned
+- [ ] `fatalExit()` attempts graceful shutdown (HTTP server stop) before `process.exit(1)`
+- [ ] Graceful shutdown has a hard timeout: `process.exit(1)` fires after at most 3s regardless of shutdown state
+- [ ] Bridge jitter is 0-2000ms (was 0-300ms)
+- [ ] Post-jitter health check uses 3 retries with 500ms base delay (was 1 retry)
+- [ ] All existing tests pass
+- [ ] New tests verify the outer try/catch returns `isError: true` for handler exceptions
+- [ ] New tests verify `fatalExit()` calls the registered graceful shutdown fn and exits after it completes
+- [ ] New tests verify `fatalExit()` still exits after 3s if shutdown fn hangs
+---
+## 3. Non-Goals
+- Daemon-owns-the-console refactor (separate backlog item)
+- Spawn lock file for cross-bridge coordination (probabilistic jitter reduction is sufficient for now)
+- Automatic zombie cleanup (separate backlog item)
+- Changing the primary election lock file mechanism
+- Modifying the MCP SDK or its error handling behavior
+- Making the process indestructible (truly fatal startup failures should still crash)
+---
+## 4. Philosophy-Driven Constraints
+- **Errors are data**: Tool handler exceptions must be converted to `McpCallToolResult` values, not left as thrown exceptions.
+- **Validate at boundaries**: The catch must be at the outermost dispatch boundary (`CallToolRequestSchema` handler), not buried inside individual handlers.
+- **Dependency injection**: The graceful shutdown fn is injected into `fatal-exit.ts` via `registerGracefulShutdown()`. Transport entry points own their cleanup logic.
+- **Determinism**: The graceful shutdown must have a bounded timeout. `process.exit(1)` must always fire, no exceptions.
+- **Surface information**: Log to stderr before returning error response or before starting async shutdown.
+- **YAGNI**: No speculative abstractions. No spawn lock file. No retry framework.
+---
+## 5. Invariants
+- `fatalExit()` ALWAYS calls `process.exit(1)` eventually. The graceful shutdown path cannot prevent exit -- it can only delay it by at most `gracefulShutdownTimeoutMs` milliseconds.
+- `fatalExit()` is re-entrant safe. A second call while shutdown is in progress is a no-op.
+- The crash log write and stderr write happen SYNCHRONOUSLY, before any async work starts. They survive process death even if the async shutdown hangs.
+- The outer try/catch in `server.ts` NEVER re-throws. It always returns a valid `McpCallToolResult`.
+- `registerGracefulShutdown(null)` is always valid and clears the registered fn.
+---
+## 6. Selected Approach
+**Candidate B (revised):** Outer try/catch at `CallToolRequestSchema` boundary + `registerGracefulShutdown()` in `fatal-exit.ts` + bridge jitter increase.
+**Runner-up:** Candidate A (no graceful shutdown change) -- rejected because it satisfies only 2/3 task requirements.
+**Why B:** The task explicitly asks for three things: (1) catch exceptions in tool handlers, (2) improve fatal-exit graceful shutdown, (3) increase bridge jitter. B satisfies all three. The added complexity (async path in `fatalExit()`) is bounded and safe due to the hard exit timer.
+**Key design decisions (confirmed during review):**
+- Graceful shutdown timeout: 3s (not 2s) -- `HttpServer.stop()` has an internal 5s `server.close()` timeout; 2s races with it
+- Async pattern: dual-path `setTimeout` + `Promise.then` (not `Promise.race()`) -- simpler, equivalent semantics
+- `registerGracefulShutdown()` accepts `null` to clear the fn (for test isolation)
+- Shutdown fn called via `Promise.resolve().then(() => fn())` -- converts sync throws to rejected promises
+---
+## 7. Vertical Slices
+### Slice 1: Outer try/catch in `server.ts`
+**Scope:** `src/mcp/server.ts` only.
+**Change:** Wrap the body of the `CallToolRequestSchema` handler (lines 437-468) in a try/catch. On catch: log to stderr, return `{content: [{type: 'text', text: JSON.stringify({code: 'INTERNAL_ERROR', message: '...'})}], isError: true}`.
+**Acceptance:** A test that throws inside `withToolCallTiming()` returns `isError: true` without crashing the process.
+**Philosophy:** Errors are data / Validate at boundaries.
+---
+### Slice 2: `registerGracefulShutdown()` in `fatal-exit.ts`
+**Scope:** `src/mcp/transports/fatal-exit.ts` only.
+**Change:**
+- Add module-level mutable state: `let gracefulShutdownFn: (() => Promise<void>) | null = null` and `let gracefulShutdownTimeoutMs = 3000`
+- Export `registerGracefulShutdown(fn: (() => Promise<void>) | null, timeoutMs?: number): void`
+- Modify `fatalExit()` exit path:
+  ```ts
+  if (gracefulShutdownFn !== null) {
+    process.stderr.write(`[FatalExit] Attempting graceful shutdown (${gracefulShutdownTimeoutMs}ms timeout)\n`);
+    const fn = gracefulShutdownFn;
+    const timeout = gracefulShutdownTimeoutMs;
+    const hardExit = setTimeout(() => process.exit(1), timeout);
+    void Promise.resolve()
+      .then(() => fn())
+      .catch(() => { /* shutdown errors must not block exit */ })
+      .finally(() => {
+        clearTimeout(hardExit);
+        process.exit(1);
+      });
+  } else {
+    process.exit(1);
+  }
+  ```
+**Acceptance:**
+- `fatalExit()` still exits with code 1 when no fn is registered (existing behavior)
+- `fatalExit()` calls the registered fn when one is registered
+- `fatalExit()` exits after 3s even if the fn hangs
+- `registerGracefulShutdown(null)` clears the fn
+- Existing `fatal-exit.test.ts` tests still pass (module state is reset via `vi.resetModules()`)
+**Philosophy:** Dependency injection / Determinism (bounded timeout).
+---
+### Slice 3: Register graceful shutdown in transport entry points
+**Scope:** `src/mcp/transports/stdio-entry.ts` and `src/mcp/transports/http-entry.ts`.
+**Change:**
+- In `stdio-entry.ts` `startStdioServer()`: after `composeServer()`, add:
+  ```ts
+  import { registerGracefulShutdown } from './fatal-exit.js';
+  registerGracefulShutdown(async () => { await ctx.httpServer?.stop(); });
+  ```
+- In `http-entry.ts` `startHttpServer()`: after `composeServer()`, add:
+  ```ts
+  registerGracefulShutdown(async () => {
+    await listener.stop();
+    await ctx.httpServer?.stop();
+  });
+  ```
+- Bridge does NOT register -- it has its own `performShutdown()` path
+**Acceptance:** If `fatalExit()` fires in stdio or http transport, `ctx.httpServer?.stop()` is called before process exit.
+---
+### Slice 4: Bridge jitter increase
+**Scope:** `src/mcp/transports/bridge-entry.ts` only.
+**Changes:**
+- Line 190: `await sleep(Math.random() * 300)` -> `await sleep(Math.random() * 2000)`
+- `spawnPrimary()` post-jitter detection call: `detectHealthyPrimary(port, { retries: 1, fetch: deps.fetch })` -> `detectHealthyPrimary(port, { retries: 3, baseDelayMs: 500, fetch: deps.fetch })`
+**Acceptance:**
+- `DEFAULT_BRIDGE_CONFIG` is unchanged (jitter is not in the config, it's hardcoded in `spawnPrimary()`)
+- Existing bridge tests pass
+- Bridge.log should show fewer simultaneous `spawn_primary` events after a real crash
+---
+## 8. Test Design
+### Slice 1 tests (new)
+File: `tests/unit/mcp/server.test.ts` (create if not exists, or add to existing)
+- **"CallToolRequestSchema handler catches exceptions and returns INTERNAL_ERROR"**: mock a handler that throws; verify the handler returns `{content: [...], isError: true}` without process crash
+- **"CallToolRequestSchema handler catches exceptions and logs to stderr"**: verify `process.stderr.write` called with error info
+### Slice 2 tests (add to `fatal-exit.test.ts`)
+- **"registerGracefulShutdown registers a fn called by fatalExit"**: register fn, call `fatalExit()`, verify fn was called (mock fn as spy, mock setTimeout as immediate)
+- **"registerGracefulShutdown(null) clears the fn"**: register fn, then null, verify fn not called
+- **"fatalExit exits after timeout if shutdown fn hangs"**: register fn that never resolves; mock setTimeout to fire immediately; verify `process.exit(1)` called
+- **"fatalExit handles sync throws in shutdown fn"**: register fn that throws synchronously; verify `process.exit(1)` still called
+- Note: `vi.resetModules()` + dynamic import already resets module state between tests (confirmed by reading test file)
+### Slice 4 tests (verify existing pass)
+No new tests needed -- jitter value is not observable in unit tests (it uses `Math.random()`). The change is verified by bridge.log observation in production.
+---
+## 9. Risk Register
+| Risk | Likelihood | Impact | Mitigation |
+|---|---|---|---|
+| SDK already handles async handler rejections (outer try/catch redundant) | Medium | Low (harmless) | No action; defensive depth is acceptable |
+| Graceful shutdown fn hangs; test mocking setTimeout fails | Low | Medium | Use `vi.useFakeTimers()` to advance timers in tests |
+| Module state not reset in fatal-exit tests for new state | Low | Low | `vi.resetModules()` already used -- new state is reset automatically |
+| 3s timeout races with HttpServer's 5s timeout in certain scenarios | Low | Low | Acceptable: 3s gives enough time for fast closes; hard exit fires for slow ones |
+| Bridge jitter increase causes noticeable reconnect delay for users | Low | Low | 2s extra wait is acceptable for a dev tool; primary starts in <1s normally |
+---
+## 10. PR Packaging Strategy
+**Single PR on branch `fix/mcp-server-resilience`.**
+All 4 slices are related (MCP server resilience), small in scope (4 files changed, ~40 lines net), and have no unresolved dependencies between them. A single PR is cleaner and easier to review.
+Commit sequence (logical order for review):
+1. `feat(mcp): add registerGracefulShutdown to fatal-exit for clean teardown on crash`
+2. `fix(mcp): catch unhandled tool handler exceptions at CallToolRequest boundary`
+3. `fix(mcp): register graceful shutdown in stdio and http transport entry points`
+4. `fix(mcp): increase bridge jitter to 2s to prevent spawn storms`
+---
+## 11. Philosophy Alignment Per Slice
+### Slice 1 (outer try/catch in server.ts)
+- **Errors are data** -> Satisfied: exceptions converted to `McpCallToolResult` values
+- **Validate at boundaries** -> Satisfied: catch at outermost dispatch boundary
+- **Surface information** -> Satisfied: log to stderr before returning error
+### Slice 2 (registerGracefulShutdown in fatal-exit.ts)
+- **Dependency injection** -> Satisfied: shutdown fn is injected, not hardcoded
+- **Determinism** -> Satisfied: hard timeout guarantees bounded exit time
+- **Immutability by default** -> Acceptable tension: mutable module state is documented and follows existing pattern
+- **Compose with small pure functions** -> Acceptable tension: last-resort handlers are inherently impure
+### Slice 3 (register shutdown in entry points)
+- **Dependency injection** -> Satisfied: entry points own their teardown logic
+- **YAGNI** -> Satisfied: minimal addition (one line per entry point)
+### Slice 4 (bridge jitter increase)
+- **YAGNI** -> Satisfied: surgical change to existing constant
+- **Determinism** -> Neutral: jitter is random by design
+---
+## Estimated PR Count: 1
+## Plan Confidence: High
+All implementation details are fully specified. No unresolved unknowns that would materially affect implementation quality.
+`unresolvedUnknownCount`: 0