npm - @swarmclawai/swarmclaw - Versions diffs - 1.5.64 → 1.5.66 - Mend

@swarmclawai/swarmclaw 1.5.64 → 1.5.66

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/README.md +19 -0
package/package.json +1 -1
package/src/app/api/mcp-registry/[slug]/route.ts +31 -0
package/src/app/api/mcp-registry/route.ts +36 -0
package/src/app/api/mcp-servers/[id]/test/route.ts +10 -4
package/src/app/api/mcp-servers/route.test.ts +10 -0
package/src/cli/index.js +8 -0
package/src/components/mcp-servers/mcp-server-list.tsx +1 -1
package/src/components/mcp-servers/registry-browser.tsx +4 -9
package/src/lib/server/mcp-connection-pool.test.ts +29 -0
package/src/lib/server/mcp-connection-pool.ts +16 -0
package/src/lib/server/runtime/heartbeat-service.test.ts +32 -0
package/src/lib/server/runtime/heartbeat-service.ts +59 -25
package/src/lib/server/session-tools/index.ts +33 -2

package/README.md CHANGED Viewed

@@ -399,6 +399,25 @@ Operational docs: https://swarmclaw.ai/docs/observability
 ## Releases
+### v1.5.66 Highlights
+Fixes a runaway-token-burn bug in the orchestrator-wake and heartbeat loops. The root cause was hidden in the success/failure classification: a session run can resolve its promise successfully while still carrying an `error` on the result (e.g. a provider 429 swallowed into persisted output), and the wake trackers only incremented their failure counters on a rejected promise. So the backoff never engaged, the auto-disable-after-N-failures gate never tripped, and the wake kept firing at its configured interval indefinitely — every firing spending tokens on a full prompt against a provider that was already cooling down.
+- **`classifyWakeOutcome` (`src/lib/server/runtime/heartbeat-service.ts`)** — new pure helper, extracted for unit testing, that maps a resolved run result into `null` (success) or a short failure reason. A run counts as a failure when `result.error` is a non-empty string, *or* when `result.text` is empty/whitespace-only. Both the orchestrator-wake and heartbeat outcome handlers now feed through this helper, so silent-failure runs tick the failure counter and the exponential backoff (10s → 5min) kicks in normally.
+- **Auto-disable gate now trips for provider 429 / silent-wake loops.** The existing `MAX_CONSECUTIVE_FAILURES = 10` threshold was already in place but unreachable for the most common failure mode (429 errors that still persisted a run). After the fix, ten consecutive dud wakes auto-disable the orchestrator/heartbeat for that agent/session and post an explicit notification instead of grinding indefinitely.
+- **Regression coverage.** `heartbeat-service.test.ts` now has 5 targeted cases on `classifyWakeOutcome` — the 429 regression, empty-output detection, non-string error fields, whitespace-only errors, and the happy path. `test:runtime` now runs 104 cases.
+### v1.5.65 Highlights
+Follow-up hardening on the v1.5.64 work after live-testing the chat-header flows, the MCP connection pool, and the MCP Registry browser. Six concrete bugs fixed in the clear/undo, MCP pool eviction, and registry-browser code paths.
+- **`clearChatMessages` now resets `opencodeWebSessionId` too.** The snapshot/undo pair already captured and restored it, but `clear` itself left the stale identifier in place — so a fresh opencode-web turn would resume the conversation the user intended to drop. Paired with a matching default in `storage-normalization.ts` so older session records load with `opencodeWebSessionId: null` instead of `undefined`. Regression covered by `clear-route.test.ts`.
+- **Undo toast no longer writes to the wrong chat.** If the user navigated away after clicking Clear, clicking Undo in the toast would inject restored messages into whatever chat was currently open. `chat-area.tsx` now gates the `setMessages` calls on `selectActiveSessionId === targetSessionId`; same guard added to the compact-complete path.
+- **Background MCP status probes no longer evict the connection pool.** Visiting `/mcp-servers` auto-called `POST /api/mcp-servers/:id/test` for every server, which force-disconnected pooled clients that running agents were using mid-turn. Eviction is now gated behind `?reset=1`, which only the explicit **Re-test** button sends. Regression added to `src/app/api/mcp-servers/route.test.ts`.
+- **SwarmDock MCP Registry browser actually works now.** The upstream `swarmdock-api.onrender.com` endpoint emits no CORS headers, so the in-browser `RegistryBrowser` component always failed with `Failed to fetch`. Added `GET /api/mcp-registry` and `GET /api/mcp-registry/:slug` as server-side proxies and rewired the component to call them. Verified in Chrome: 20 servers load, selecting one prefills the New MCP Server sheet with its recommended install command.
+- **`mcp-registry` CLI group.** New commands `swarmclaw mcp-registry search` and `swarmclaw mcp-registry get <slug>` so CLI workflows can pull from the same proxy.
+- **Prior release's MCP tool-evict-on-transport-failure fix** (cherry-picked from user's local branch): connection-class errors from downstream MCP tools now evict the pool entry for the originating server, so the next turn reconnects fresh instead of retrying through a half-broken transport.
 ### v1.5.64 Highlights
 Two themes this release. First, **context-window management reaches the chat UI**: a live token-usage meter in every chat header, a one-click LLM-backed compaction that keeps the session alive without nuking history, and a redesigned clear flow with a 30-second undo that restores both transcripts and CLI resume IDs. Second, **MCP token spend is now controllable**: per-server `alwaysExpose` policy, per-agent eager-tool overrides, an in-session `mcp_tool_search` promoter, a long-lived connection pool, a token-cost endpoint per server, and a built-in browser for the public SwarmDock MCP registry.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@swarmclawai/swarmclaw",
-  "version": "1.5.64",
+  "version": "1.5.66",
   "description": "Build and run autonomous AI agents with OpenClaw, Hermes, multiple model providers, orchestration, delegation, memory, skills, schedules, and chat connectors.",
   "main": "electron-dist/main.js",
   "license": "MIT",

package/src/app/api/mcp-registry/[slug]/route.ts ADDED Viewed

@@ -0,0 +1,31 @@
+import { NextResponse } from 'next/server'
+const REGISTRY_API = 'https://swarmdock-api.onrender.com/api/v1/mcp/servers'
+export async function GET(_req: Request, { params }: { params: Promise<{ slug: string }> }) {
+  const { slug } = await params
+  if (!slug.trim()) {
+    return NextResponse.json({ error: 'slug is required' }, { status: 400 })
+  }
+  try {
+    const upstream = await fetch(`${REGISTRY_API}/${encodeURIComponent(slug)}`, {
+      headers: { accept: 'application/json' },
+    })
+    if (upstream.status === 404) {
+      return NextResponse.json({ error: 'Registry server not found' }, { status: 404 })
+    }
+    if (!upstream.ok) {
+      return NextResponse.json(
+        { error: `Server detail returned ${upstream.status}` },
+        { status: 502 },
+      )
+    }
+    const data = await upstream.json()
+    return NextResponse.json(data)
+  } catch (err: unknown) {
+    return NextResponse.json(
+      { error: err instanceof Error ? err.message : 'Registry unreachable' },
+      { status: 502 },
+    )
+  }
+}

package/src/app/api/mcp-registry/route.ts ADDED Viewed

@@ -0,0 +1,36 @@
+import { NextResponse } from 'next/server'
+// Server-side proxy for the public SwarmDock MCP Registry. The upstream API
+// does not emit CORS headers, so the RegistryBrowser component in the browser
+// cannot fetch it directly. This route forwards the search request and its
+// JSON response untouched.
+const REGISTRY_API = 'https://swarmdock-api.onrender.com/api/v1/mcp/servers'
+export async function GET(req: Request) {
+  const url = new URL(req.url)
+  const q = url.searchParams.get('q') ?? ''
+  const limitRaw = url.searchParams.get('limit') ?? '20'
+  const limit = Math.max(1, Math.min(Number.parseInt(limitRaw, 10) || 20, 50))
+  const qs = new URLSearchParams({ limit: String(limit) })
+  if (q.trim()) qs.set('q', q.trim())
+  try {
+    const upstream = await fetch(`${REGISTRY_API}?${qs.toString()}`, {
+      headers: { accept: 'application/json' },
+    })
+    if (!upstream.ok) {
+      return NextResponse.json(
+        { error: `Registry returned ${upstream.status}` },
+        { status: 502 },
+      )
+    }
+    const data = await upstream.json()
+    return NextResponse.json(data)
+  } catch (err: unknown) {
+    return NextResponse.json(
+      { error: err instanceof Error ? err.message : 'Registry unreachable' },
+      { status: 502 },
+    )
+  }
+}

package/src/app/api/mcp-servers/[id]/test/route.ts CHANGED Viewed

@@ -5,15 +5,21 @@ import { connectMcpServer, mcpToolsToLangChain, disconnectMcpServer } from '@/li
 import { evictMcpClient } from '@/lib/server/mcp-connection-pool'
 import { errorMessage } from '@/lib/shared-utils'
-export async function POST(_req: Request, { params }: { params: Promise<{ id: string }> }) {
+export async function POST(req: Request, { params }: { params: Promise<{ id: string }> }) {
   const { id } = await params
   const servers = loadMcpServers()
   const server = servers[id]
   if (!server) return notFound()
-  // Force a fresh connection for the test — if a pooled client is in a weird
-  // state, the test button is the user's signal to rebuild it.
-  await evictMcpClient(id)
+  // Only evict the pool when the caller explicitly asks for a reset (e.g. the
+  // "Re-test" button). Background probes from the server list view skip this
+  // so they don't disconnect pooled clients that running agents are using
+  // mid-turn. Pool eviction on config change is handled by the PUT route.
+  const url = new URL(req.url)
+  const reset = url.searchParams.get('reset') === '1' || url.searchParams.get('reset') === 'true'
+  if (reset) {
+    await evictMcpClient(id)
+  }
   try {
     const { client, transport } = await connectMcpServer(server)

package/src/app/api/mcp-servers/route.test.ts CHANGED Viewed

@@ -61,6 +61,16 @@ test('MCP server routes exercise a live stdio server end to end', async () => {
   assert.equal(health.ok, true)
   assert.deepEqual(health.tools, ['mcp_smoke_ping', 'mcp_smoke_echo', 'mcp_smoke_cwd_check'])
+  // `reset=1` still works and succeeds — used by the explicit "Re-test" button
+  // to force pool eviction. Default (no query) path skips eviction so
+  // auto-probes don't disrupt in-flight agent MCP calls.
+  const resetHealthResponse = await testMcpServer(new Request(`http://local/api/mcp-servers/${serverId}/test?reset=1`, {
+    method: 'POST',
+  }), routeParams(serverId))
+  assert.equal(resetHealthResponse.status, 200)
+  const resetHealth = await resetHealthResponse.json() as Record<string, unknown>
+  assert.equal(resetHealth.ok, true)
   const toolsResponse = await listMcpTools(new Request(`http://local/api/mcp-servers/${serverId}/tools`), routeParams(serverId))
   assert.equal(toolsResponse.status, 200)
   const tools = await toolsResponse.json() as Array<Record<string, unknown>>

package/src/cli/index.js CHANGED Viewed

@@ -372,6 +372,14 @@ const COMMAND_GROUPS = [
       cmd('invoke', 'POST', '/mcp-servers/:id/invoke', 'Invoke an MCP tool on a server', { expectsJsonBody: true }),
     ],
   },
+  {
+    name: 'mcp-registry',
+    description: 'Browse the public SwarmDock MCP Registry',
+    commands: [
+      cmd('search', 'GET', '/mcp-registry', 'Search registry servers (supports --query q=postgres,limit=20)'),
+      cmd('get', 'GET', '/mcp-registry/:slug', 'Get registry server detail by slug'),
+    ],
+  },
   {
     name: 'memories',
     description: 'Alias of memory command group',

package/src/components/mcp-servers/mcp-server-list.tsx CHANGED Viewed

@@ -188,7 +188,7 @@ export function McpServerList({ inSidebar }: { inSidebar?: boolean }) {
     e.stopPropagation()
     setStatuses((prev) => ({ ...prev, [id]: { ok: false, loading: true } }))
     try {
-      const res = await api<{ ok: boolean; tools?: string[]; error?: string }>('POST', `/mcp-servers/${id}/test`)
+      const res = await api<{ ok: boolean; tools?: string[]; error?: string }>('POST', `/mcp-servers/${id}/test?reset=1`)
       if (!mountedRef.current) return
       setStatuses((prev) => ({ ...prev, [id]: { ok: res.ok, tools: res.tools, error: res.error, loading: false } }))
       if (res.ok) toast.success('Connection test passed')

package/src/components/mcp-servers/registry-browser.tsx CHANGED Viewed

@@ -11,8 +11,7 @@
  */
 import { useEffect, useState } from 'react'
-const REGISTRY_API = 'https://swarmdock-api.onrender.com/api/v1/mcp/servers'
+import { api } from '@/lib/app/api-client'
 export interface RegistryPrefill {
   name: string
@@ -100,10 +99,8 @@ export function RegistryBrowser({
       setError(null)
       try {
         const qs = query ? `?q=${encodeURIComponent(query)}&limit=20` : '?limit=20'
-        const res = await fetch(`${REGISTRY_API}${qs}`)
-        if (!res.ok) throw new Error(`Registry returned ${res.status}`)
-        const data = await res.json() as { servers: RegistryServer[] }
-        if (!cancelled) setServers(data.servers)
+        const data = await api<{ servers: RegistryServer[] }>('GET', `/mcp-registry${qs}`)
+        if (!cancelled) setServers(data.servers ?? [])
       } catch (err) {
         if (!cancelled) setError(err instanceof Error ? err.message : 'Failed to load registry')
       } finally {
@@ -120,9 +117,7 @@ export function RegistryBrowser({
   const handleSelect = async (slug: string) => {
     setSelecting(slug)
     try {
-      const res = await fetch(`${REGISTRY_API}/${encodeURIComponent(slug)}`)
-      if (!res.ok) throw new Error(`Server detail returned ${res.status}`)
-      const detail = await res.json() as RegistryDetail
+      const detail = await api<RegistryDetail>('GET', `/mcp-registry/${encodeURIComponent(slug)}`)
       const prefill = installToPrefill(detail)
       if (!prefill) {
         setError('This server has no installation method SwarmClaw can consume yet.')

package/src/lib/server/mcp-connection-pool.test.ts CHANGED Viewed

@@ -6,6 +6,7 @@ import {
   evictAllMcpClients,
   evictMcpClient,
   getOrConnectMcpClient,
+  isConnectionLikeError,
   isPooled,
   poolSize,
 } from './mcp-connection-pool'
@@ -96,3 +97,31 @@ describe('mcp-connection-pool', () => {
     assert.equal(poolSize(), 0)
   })
 })
+describe('isConnectionLikeError', () => {
+  it('returns true for known transport-level error codes', () => {
+    const err = Object.assign(new Error('epipe'), { code: 'EPIPE' })
+    assert.equal(isConnectionLikeError(err), true)
+    const err2 = Object.assign(new Error('reset'), { code: 'ECONNRESET' })
+    assert.equal(isConnectionLikeError(err2), true)
+  })
+  it('returns true on connection-closed messages', () => {
+    assert.equal(isConnectionLikeError(new Error('Connection closed')), true)
+    assert.equal(isConnectionLikeError(new Error('MCP server not connected')), true)
+    assert.equal(isConnectionLikeError(new Error('child process exited')), true)
+    assert.equal(isConnectionLikeError(new Error('socket hang up')), true)
+  })
+  it('returns false for ordinary tool-level errors', () => {
+    assert.equal(isConnectionLikeError(new Error('GitHub token is invalid')), false)
+    assert.equal(isConnectionLikeError(new Error('File not found: /nope')), false)
+    assert.equal(isConnectionLikeError(new Error('schema validation failed')), false)
+  })
+  it('returns false for non-error inputs', () => {
+    assert.equal(isConnectionLikeError(null), false)
+    assert.equal(isConnectionLikeError(undefined), false)
+    assert.equal(isConnectionLikeError(''), false)
+  })
+})

package/src/lib/server/mcp-connection-pool.ts CHANGED Viewed

@@ -132,3 +132,19 @@ async function safeDisconnect(entry: PoolEntry): Promise<void> {
     /* ignore — we're tearing down anyway */
   }
 }
+/**
+ * Heuristic: does this error look like the pooled connection is dead (vs. a
+ * normal tool-level error the caller should surface)? Conservative by design —
+ * we only evict on well-known transport-level signatures so a "your API key is
+ * wrong" error from an MCP tool doesn't force a reconnect storm.
+ */
+export function isConnectionLikeError(err: unknown): boolean {
+  if (!err) return false
+  const code = typeof err === 'object' && err && 'code' in err ? String((err as { code: unknown }).code ?? '') : ''
+  if (code && /^(ECONNREFUSED|ECONNRESET|EPIPE|EHOSTUNREACH|ETIMEDOUT|ENOTFOUND|ECONNABORTED)$/i.test(code)) {
+    return true
+  }
+  const msg = err instanceof Error ? err.message : String(err)
+  return /connection closed|transport closed|server has closed|process exited|child exited|mcp server not connected|read ECONN|write EPIPE|socket hang up|stream closed|unexpected end of (?:json|input|stream)/i.test(msg)
+}

package/src/lib/server/runtime/heartbeat-service.test.ts CHANGED Viewed

@@ -450,3 +450,35 @@ describe('heartbeatConfigForSession lightContext', () => {
     assert.equal(cfg.lightContext, false)
   })
 })
+describe('classifyWakeOutcome (runaway-loop guard)', () => {
+  it('returns null for a run with visible text and no error', () => {
+    assert.equal(mod.classifyWakeOutcome({ text: 'all good', error: null }), null)
+    assert.equal(mod.classifyWakeOutcome({ text: 'ORCHESTRATOR_OK' }), null)
+  })
+  it('treats a resolved-but-errored result as failure (the 429 regression)', () => {
+    const out = mod.classifyWakeOutcome({
+      text: '',
+      error: '429 All credentials for model gpt-5.4 are cooling down via provider codex',
+    })
+    assert.equal(out, '429 All credentials for model gpt-5.4 are cooling down via provider codex')
+  })
+  it('counts empty visible output as failure so silent wakes trigger backoff', () => {
+    assert.equal(mod.classifyWakeOutcome({ text: '' }), 'empty wake response')
+    assert.equal(mod.classifyWakeOutcome({ text: '   \n\t' }), 'empty wake response')
+    assert.equal(mod.classifyWakeOutcome({}), 'empty wake response')
+    assert.equal(mod.classifyWakeOutcome(null), 'empty wake response')
+    assert.equal(mod.classifyWakeOutcome(undefined), 'empty wake response')
+  })
+  it('ignores a non-string error field and falls back to text check', () => {
+    assert.equal(mod.classifyWakeOutcome({ text: 'hi', error: 42 }), null)
+    assert.equal(mod.classifyWakeOutcome({ text: '', error: 42 }), 'empty wake response')
+  })
+  it('ignores an empty-string error so whitespace errors do not double-count', () => {
+    assert.equal(mod.classifyWakeOutcome({ text: 'fine', error: '   ' }), null)
+  })
+})

package/src/lib/server/runtime/heartbeat-service.ts CHANGED Viewed

@@ -54,6 +54,23 @@ const ORCHESTRATOR_MIN_INTERVAL_SEC = 60
 const ORCHESTRATOR_MAX_INTERVAL_SEC = 86400    // 24h
 const ORCHESTRATOR_MAX_PROMPT_CHARS = 4000
+/**
+ * Classify a resolved session-run result as success or failure for the
+ * heartbeat/orchestrator outcome tracker. A resolved promise can still
+ * carry an error on `result.error` (e.g. a provider 429 that was swallowed
+ * into persisted output) or resolve with empty text, and both cases must
+ * count as failures — otherwise a stuck wake loop never ticks the
+ * failure counter, never backs off, and never auto-disables.
+ */
+export function classifyWakeOutcome(result: unknown): string | null {
+  if (!result || typeof result !== 'object') return 'empty wake response'
+  const obj = result as { error?: unknown; text?: unknown }
+  if (typeof obj.error === 'string' && obj.error.trim()) return obj.error
+  const text = typeof obj.text === 'string' ? obj.text : ''
+  if (!text.trim()) return 'empty wake response'
+  return null
+}
 interface FailureRecord {
   count: number
   lastFailedAt: number
@@ -782,24 +799,28 @@ export async function tickHeartbeats() {
     state.lastBySession.set(session.id, now)
     const sid = session.id as string
-    enqueue.promise.then(() => {
-      const prev = state.failures.get(sid)
-      if (prev?.recoveryAttempts) {
-        log.info('heartbeat', `Recovery successful for session ${sid} after ${prev.recoveryAttempts} attempt(s)`)
+    // A session run can "resolve" with an error in result.error (e.g. provider
+    // 429 swallowed into the persisted failure) or with empty text. Treat both
+    // as failures so backoff and auto-disable trigger, otherwise a stuck
+    // heartbeat keeps re-firing at the configured interval and burning tokens.
+    const handleHeartbeatOutcome = (failure: string | null) => {
+      if (!failure) {
+        const prev = state.failures.get(sid)
+        if (prev?.recoveryAttempts) {
+          log.info('heartbeat', `Recovery successful for session ${sid} after ${prev.recoveryAttempts} attempt(s)`)
+        }
+        state.failures.delete(sid)
+        patchSession(sid, (s) => {
+          if (!s) return s
+          s.lastDeliveryStatus = 'ok'
+          s.lastDeliveredAt = Date.now()
+          return s
+        })
+        return
       }
-      state.failures.delete(sid)
-      // Track successful delivery
-      patchSession(sid, (s) => {
-        if (!s) return s
-        s.lastDeliveryStatus = 'ok'
-        s.lastDeliveredAt = Date.now()
-        return s
-      })
-    }).catch((err: unknown) => {
       const prev = state.failures.get(sid)
       const newCount = (prev?.count ?? 0) + 1
       const record: FailureRecord = { count: newCount, lastFailedAt: Date.now() }
-      // Auto-disable heartbeat after too many consecutive failures to prevent resource waste
       if (newCount >= MAX_CONSECUTIVE_FAILURES) {
         record.autoDisabledAt = Date.now()
         log.warn('heartbeat', `Auto-disabling heartbeat for session ${sid} after ${newCount} consecutive failures`)
@@ -821,17 +842,20 @@ export async function tickHeartbeats() {
         })
       }
       state.failures.set(sid, record)
-      const msg = errorMessage(err)
-      log.warn('heartbeat', `Heartbeat run failed for session ${sid} (${newCount}/${MAX_CONSECUTIVE_FAILURES})`, msg)
-      // Track failed delivery
+      log.warn('heartbeat', `Heartbeat run failed for session ${sid} (${newCount}/${MAX_CONSECUTIVE_FAILURES})`, failure)
       patchSession(sid, (s) => {
         if (!s) return s
         s.lastDeliveryStatus = 'error'
-        s.lastDeliveryError = msg
+        s.lastDeliveryError = failure
         s.lastDeliveredAt = Date.now()
         return s
       })
-    })
+    }
+    enqueue.promise
+      .then((result) => handleHeartbeatOutcome(classifyWakeOutcome(result)))
+      .catch((err: unknown) => {
+        handleHeartbeatOutcome(errorMessage(err) || 'heartbeat rejected')
+      })
   }
 }
@@ -1118,10 +1142,15 @@ export async function tickOrchestratorAgents() {
       log.info('orchestrator', `Woke orchestrator agent ${agent.name} (${agent.id}), cycle #${(agent.orchestratorCycleCount || 0) + 1}`)
-      // Track success/failure
-      enqueue.promise.then(() => {
-        orchestratorState.failures.delete(agent.id)
-      }).catch((err: unknown) => {
+      // Track success/failure. A run can "resolve" but still carry an error
+      // on the result (e.g. provider 429 that was caught and persisted), so we
+      // inspect the resolved result as well as the rejected path — otherwise
+      // a stuck wake loop never ticks the failure counter and never backs off.
+      const handleWakeOutcome = (failure: string | null) => {
+        if (!failure) {
+          orchestratorState.failures.delete(agent.id)
+          return
+        }
         const prev = orchestratorState.failures.get(agent.id)
         const newCount = (prev?.count ?? 0) + 1
         const record: FailureRecord = { count: newCount, lastFailedAt: Date.now() }
@@ -1146,8 +1175,13 @@ export async function tickOrchestratorAgents() {
           })
         }
         orchestratorState.failures.set(agent.id, record)
-        log.warn('orchestrator', `Orchestrator wake failed for agent ${agent.id} (${newCount}/${MAX_CONSECUTIVE_FAILURES})`, errorMessage(err))
-      })
+        log.warn('orchestrator', `Orchestrator wake failed for agent ${agent.id} (${newCount}/${MAX_CONSECUTIVE_FAILURES})`, failure)
+      }
+      enqueue.promise
+        .then((result) => handleWakeOutcome(classifyWakeOutcome(result)))
+        .catch((err: unknown) => {
+          handleWakeOutcome(errorMessage(err) || 'wake rejected')
+        })
     } catch (err) {
       log.warn('orchestrator', `Error ticking orchestrator agent ${agent.id}:`, errorMessage(err))
     }

package/src/lib/server/session-tools/index.ts CHANGED Viewed

@@ -62,7 +62,7 @@ import {
   shouldExposeMcpTool,
   type DiscoveredTool,
 } from '../mcp-gateway-runtime'
-import { getOrConnectMcpClient } from '../mcp-connection-pool'
+import { getOrConnectMcpClient, evictMcpClient, isConnectionLikeError } from '../mcp-connection-pool'
 import {
   getEnabledCapabilitySelection,
   isExternalExtensionId,
@@ -94,6 +94,37 @@ function inferBareName(langChainName: string, serverName: string): string {
   return langChainName.startsWith(prefix) ? langChainName.slice(prefix.length) : langChainName
 }
+/**
+ * Wraps an MCP-sourced LangChain tool so connection-class failures (stdio pipe
+ * closed, HTTP reset, etc.) evict the pool entry, letting the next turn
+ * rebuild the client fresh. Non-connection errors (validation, tool logic,
+ * auth) propagate unchanged — we trust the downstream's isError signal.
+ */
+function wrapMcpToolWithPoolEviction(
+  inner: StructuredToolInterface,
+  serverId: string,
+): StructuredToolInterface {
+  const wrappedCallback = async (args: unknown): Promise<unknown> => {
+    try {
+      return await inner.invoke(args as Record<string, unknown>)
+    } catch (err: unknown) {
+      if (isConnectionLikeError(err)) {
+        void evictMcpClient(serverId).catch(() => undefined)
+        log.warn('session-tools', `MCP tool "${inner.name}" connection error — evicted pool entry for ${serverId}`, {
+          error: errorMessage(err),
+        })
+      }
+      throw err
+    }
+  }
+  return tool(wrappedCallback, {
+    name: inner.name,
+    description: inner.description,
+    // Re-use the inner tool's zod schema so shape/validation is identical.
+    schema: (inner as unknown as { schema: z.ZodType }).schema,
+  })
+}
 export async function buildSessionTools(cwd: string, enabledExtensions: string[], ctx?: ToolContext): Promise<SessionToolsResult> {
   const tools: StructuredToolInterface[] = []
   const cleanupFns: (() => Promise<void>)[] = []
@@ -354,7 +385,7 @@ export async function buildSessionTools(cwd: string, enabledExtensions: string[]
             })
             if (!shouldBind) continue
             toolToExtensionMap[t.name] = `mcp:${serverId}`
-            tools.push(t)
+            tools.push(wrapMcpToolWithPoolEviction(t, serverId))
           }
         } catch (err: unknown) {
           log.warn('session-tools', `Failed to connect MCP server "${config.name}"`, { serverId, error: errorMessage(err) })