npm - @poncho-ai/harness - Versions diffs - 0.52.0 → 0.52.2 - Mend

@poncho-ai/harness 0.52.0 → 0.52.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/.turbo/turbo-build.log +5 -5
package/CHANGELOG.md +80 -0
package/dist/index.d.ts +42 -16
package/dist/index.js +177 -80
package/package.json +1 -1
package/src/default-agent.ts +11 -4
package/src/harness.ts +7 -2
package/src/orchestrator/orchestrator.ts +142 -35
package/src/storage/postgres-engine.ts +83 -41

package/.turbo/turbo-build.log CHANGED Viewed

@@ -1,5 +1,5 @@
-> @poncho-ai/harness@0.52.0 build /home/runner/work/poncho-ai/poncho-ai/packages/harness
+> @poncho-ai/harness@0.52.2 build /home/runner/work/poncho-ai/poncho-ai/packages/harness
 > node scripts/embed-docs.js && tsup src/index.ts --format esm --dts
 [embed-docs] Generated poncho-docs.ts with 4 topics
@@ -8,9 +8,9 @@
 [34mCLI[39m tsup v8.5.1
 [34mCLI[39m Target: es2022
 [34mESM[39m Build start
-[32mESM[39m [1mdist/index.js            [22m[32m536.18 KB[39m
+[32mESM[39m [1mdist/index.js            [22m[32m540.60 KB[39m
 [32mESM[39m [1mdist/isolate-F2PPSUL6.js [22m[32m53.82 KB[39m
-[32mESM[39m ⚡️ Build success in 234ms
+[32mESM[39m ⚡️ Build success in 204ms
 [34mDTS[39m Build start
-[32mDTS[39m ⚡️ Build success in 7615ms
-[32mDTS[39m [1mdist/index.d.ts [22m[32m92.18 KB[39m
+[32mDTS[39m ⚡️ Build success in 7242ms
+[32mDTS[39m [1mdist/index.d.ts [22m[32m93.60 KB[39m

package/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,85 @@
 # @poncho-ai/harness
+## 0.52.2
+### Patch Changes
+- [#124](https://github.com/cesr/poncho-ai/pull/124) [`4ae26e0`](https://github.com/cesr/poncho-ai/commit/4ae26e0d8d2788f57411f9c17e10766769514f9b) Thanks [@cesr](https://github.com/cesr)! - harness: postgres retry covers exec/transaction + 3 attempts + tighter idle
+  Follow-up to the previous `idle_timeout`/`max_lifetime`/retry patch.
+  Live testing on Railway showed the previous values weren't tight
+  enough — `write CONNECTION_ENDED postgres.railway.internal:5432`
+  still surfaced both during user-facing chat turns and during
+  subagent auto-callback reruns, despite the new config and the
+  one-shot retry.
+  Two failure modes the previous version didn't cover:
+  1. The retry only wrapped `private query()` (executor.run/get/all),
+     but `executor.exec` (`sql.unsafe`) and `executor.transaction`
+     (`sql.begin`) called the postgres.js client directly. A pg drop
+     inside a transaction or migration write threw straight through.
+  2. After an idle period the pool can have multiple stale sockets;
+     a single retry can checkout a second stale socket from the pool
+     and fail again. One-shot retry exhausted into an error visible
+     to the caller.
+  Fixes:
+  - All three executor paths (`run/get/all`, `exec`, `transaction`)
+    now go through the same `runWithRetry` wrapper. Transactions
+    only retry the connection-level `CONNECTION_ENDED` reject from
+    the postgres.js client — actual SQL errors mid-transaction
+    surface as a different error class and bypass the retry,
+    preserving atomic semantics.
+  - Three attempts with light exponential backoff (0, 50ms, 200ms).
+    Enough to ride out a typical staleness wave; if all three fail
+    the network is genuinely broken.
+  - `CONNECT_TIMEOUT` and `ECONNRESET` added to the retry-eligible
+    error codes.
+  Config knobs tightened:
+  - `idle_timeout: 5` (was 20). Empirically Railway's pg drops
+    sockets well before 20s; 5s wins the race in practice while
+    staying long enough for bursty workloads to reuse connections.
+  - `max_lifetime: 300` (was 600). Same reasoning — recycle more
+    aggressively.
+  - `connect_timeout: 10` (was 30 default). Faster failure during
+    incidents lets callers shed load instead of stacking up.
+- [#144](https://github.com/cesr/poncho-ai/pull/144) [`28d640b`](https://github.com/cesr/poncho-ai/commit/28d640b2f82ea780f8e0be90965972d9903c01d7) Thanks [@cesr](https://github.com/cesr)! - orchestrator: make subagent result delivery reliable
+  Subagent results could silently never reach the parent agent. Several
+  plumbing bugs in `runSubagent` / `runSubagentContinuation`:
+  - **Emit-before-persist race.** `subagent:completed` / `subagent:error`
+    were emitted to the parent's event stream _before_ the result was
+    written to the store, so a consumer reacting to the event (the parent
+    callback, the streaming client) could race the write. Now the result
+    is persisted first, then the event is emitted.
+  - **Silently swallowed writes.** Two `appendSubagentResult(...).catch(() => {})`
+    call sites (the error path and the continuation-error path) dropped the
+    result with no trace on a transient store failure. Replaced with a
+    shared `appendSubagentResultReliable` helper that retries once and then
+    logs loudly — a dropped result is the worst failure mode (the parent
+    waits forever on a subagent it thinks is still running).
+  - **Un-awaited eventSink.** The subagent-callback run path was the lone
+    `this.eventSink(...)` call site that didn't `await` (every other site
+    does), so callback-turn events could interleave out of order. Now awaited.
+  - **Spawn rejections went to a bare `console.error`.** A background
+    `runSubagent` that rejected outside its own try/catch left the parent
+    hanging. Both fire-and-forget spawn paths now route to a
+    `handleSpawnFailure` that marks the child errored and hands the parent
+    an error result so the turn can resume.
+  - **`recoverStaleSubagents` now also drains undelivered results.** It
+    previously only rescued children stuck in `running`; it now also
+    re-triggers the parent callback for any parent that has results sitting
+    in the store with no active run (e.g. a result persisted just before a
+    process restart, whose in-memory callback trigger was lost).
+## 0.52.1
+### Patch Changes
+- [`0e8fff1`](https://github.com/cesr/poncho-ai/commit/0e8fff12aed9d5efe1821ed3560ead48a16113c1) Thanks [@cesr](https://github.com/cesr)! - Only send `temperature` to the model when the agent explicitly sets one. The harness previously defaulted to `temperature: 0.2` and always passed it to `streamText`, which returns a 400 ("`temperature` is deprecated for this model") on models that removed sampling params (Fable 5, Opus 4.7+). `temperature` is now omitted from the request when undefined — the same treatment `maxTokens` already had — and `defaultAgentDefinition` no longer hard-codes a `temperature` line into the generated frontmatter (pass `temperature` explicitly to set one).
 ## 0.52.0
 ### Minor Changes

package/dist/index.d.ts CHANGED Viewed

@@ -737,7 +737,12 @@ interface DefaultAgentDefinitionOptions {
     modelProvider?: "anthropic" | "openai" | "openai-codex";
     /** Model name. Default: "claude-opus-4-5". */
     modelName?: string;
-    /** Sampling temperature. Default: 0.2. */
+    /**
+     * Sampling temperature. When unset, it is omitted from the generated
+     * frontmatter entirely and the harness sends no temperature (provider
+     * default). Newer models (Fable 5, Opus 4.7+) reject `temperature` — leave
+     * this unset for them.
+     */
     temperature?: number;
     /** Max tool-call steps per run. Default: 20. */
     maxSteps?: number;
@@ -1797,22 +1802,27 @@ declare class PostgresEngine extends SqlStorageEngine {
     private patchVfs;
     private query;
     /**
-     * Single retry on a transient connection-layer failure. The
-     * `idle_timeout` / `max_lifetime` config above prevents *most*
-     * stale-connection cases, but a query can still race a
-     * provider-initiated drop in flight — the postgres.js client
-     * rejects with `code: "CONNECTION_ENDED"` and the next attempt
-     * checks out a fresh connection from the pool. One retry is
-     * enough; if it fails again the host-side network is genuinely
-     * broken and the caller should see the error.
+     * Retry on transient connection-layer failures. Three attempts
+     * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
+     * have multiple stale sockets accumulated during an idle period
+     * (especially on managed Postgres after boot when no traffic
+     * has flowed for a while), so a single retry can land on a
+     * second stale socket and still fail. Three attempts virtually
+     * always exhausts the staleness wave; if all three throw, the
+     * failure is real and the caller should see it.
      *
-     * Only retries reads + the standard exec/run paths in `query`;
-     * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
-     * `sql.begin(...)` transactions are unwrapped — those are
-     * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
-     * atomically scoped (transactions roll back cleanly), and adding
-     * a retry around them would complicate the transaction
-     * semantics.
+     * Applied to every pg path the executor exposes:
+     *  - `query()` (run/get/all)  — natural retry: queries are
+     *    idempotent at the connection-failure boundary because the
+     *    server-side rollback runs cleanly on socket close.
+     *  - `exec(sql)` for DDL      — `CREATE TABLE IF NOT EXISTS` and
+     *    friends are idempotent by construction.
+     *  - `transaction(fn)`        — only retried when the
+     *    CONNECTION_ENDED reject arrives *before* the transaction
+     *    body started executing on the connection; if it errors
+     *    mid-transaction, the postgres.js client surfaces a
+     *    different error class (the inner SQL error) and bypasses
+     *    this retry, preserving the all-or-nothing semantics.
      */
     private runWithRetry;
     private addToPathCache;
@@ -2140,6 +2150,22 @@ declare class AgentOrchestrator {
     processSubagentCallback(conversationId: string, skipLockCheck?: boolean): Promise<void>;
     runSubagentContinuation(conversationId: string, conversation: Conversation, continuationMessages: Message[]): AsyncGenerator<AgentEvent>;
     createSubagentManager(): SubagentManager;
+    /**
+     * Append a subagent result to its parent, retrying once on a transient
+     * store failure before giving up loudly. A silently dropped result is the
+     * worst subagent failure mode — the parent waits forever on a subagent it
+     * thinks is still running — so this never swallows the error the way the
+     * old `.catch(() => {})` call sites did. Returns whether the result landed.
+     */
+    private appendSubagentResultReliable;
+    /**
+     * A subagent's fire-and-forget background run rejected outside its own
+     * error handling (e.g. it threw before entering its try block, or the
+     * catch block itself threw). Without this the parent is left waiting on a
+     * subagent that will never report back. Record the failure on the child
+     * and hand the parent an error result so the turn can resume.
+     */
+    private handleSpawnFailure;
     recoverStaleSubagents(): Promise<void>;
 }

package/dist/index.js CHANGED Viewed

@@ -588,7 +588,8 @@ var defaultAgentDefinition = (opts = {}) => {
   const description = opts.description ?? DEFAULT_AGENT_DESCRIPTION;
   const modelProvider = opts.modelProvider ?? DEFAULT_MODEL_PROVIDER;
   const modelName = opts.modelName ?? DEFAULT_MODEL_NAME;
-  const temperature = opts.temperature ?? DEFAULT_TEMPERATURE;
+  const temperatureLine = opts.temperature !== void 0 ? `
+  temperature: ${opts.temperature}` : "";
   const maxSteps = opts.maxSteps ?? DEFAULT_MAX_STEPS;
   const timeout = opts.timeout ?? DEFAULT_TIMEOUT;
   return `---
@@ -597,8 +598,7 @@ id: ${id}
 description: ${description}
 model:
   provider: ${modelProvider}
-  name: ${modelName}
-  temperature: ${temperature}
+  name: ${modelName}${temperatureLine}
 limits:
   maxSteps: ${maxSteps}
   timeout: ${timeout}
@@ -4415,12 +4415,12 @@ var PostgresEngine = class extends SqlStorageEngine {
         return rows;
       },
       exec: async (sql) => {
-        await this.sql.unsafe(sql);
+        await this.runWithRetry(() => this.sql.unsafe(sql));
       },
       transaction: async (fn) => {
-        await this.sql.begin(async () => {
+        await this.runWithRetry(() => this.sql.begin(async () => {
           await fn();
-        });
+        }));
       }
     };
   }
@@ -4438,25 +4438,34 @@ var PostgresEngine = class extends SqlStorageEngine {
       prepare: false,
       // Connection-pool resilience. Managed Postgres providers
       // (Railway, Neon, Heroku, etc.) routinely drop idle TCP
-      // connections server-side after a few minutes. Without these
-      // knobs, porsager/postgres keeps stale sockets in the pool;
-      // the next query on one rejects with
-      // `write CONNECTION_ENDED <host>:5432` at `durMs=0`, surfacing
-      // as a hard failure to the caller. Two complementary settings:
+      // connections server-side after a few minutes — and on
+      // Railway in particular, mid-stream drops within a few
+      // seconds of inactivity are common. Without these knobs,
+      // porsager/postgres keeps stale sockets in the pool; the
+      // next query on one rejects with
+      // `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
+      // surfacing as a hard failure to the caller.
       //
-      //   - `idle_timeout: 20` closes idle connections client-side
-      //     after 20s, before any reasonable provider-side timer
-      //     fires. Fresh connection on next checkout = no stale
-      //     socket race.
-      //   - `max_lifetime: 600` (10 min) recycles long-lived
-      //     connections defensively even if they've stayed busy,
-      //     which sidesteps a separate class of provider-side
-      //     "max connection age" limits.
+      //   - `idle_timeout: 5` closes idle connections client-side
+      //     aggressively. Empirically Railway's pg drops sockets
+      //     well before the 20s value that managed-provider docs
+      //     suggest; 5s is short enough to win the race in
+      //     practice while staying long enough that bursty
+      //     workloads still get connection reuse.
+      //   - `max_lifetime: 300` (5 min) recycles long-lived
+      //     connections defensively. Even with idle_timeout, a
+      //     connection that's been actively serving small queries
+      //     for an hour can hit provider-side max-age limits.
+      //   - `connect_timeout: 10` — slightly less patient on
+      //     initial connect than the 30s default. Combined with
+      //     the retry below, "connection refused" surfaces faster
+      //     during incidents and the caller can shed load instead
+      //     of stacking up.
       //
-      // Defaults remain `max: 10`, `connect_timeout: 30` — leaving
-      // pool size + initial connect behavior unchanged.
-      idle_timeout: 20,
-      max_lifetime: 60 * 10
+      // Pool size (`max: 10`) unchanged.
+      idle_timeout: 5,
+      max_lifetime: 60 * 5,
+      connect_timeout: 10
     });
   }
   async initialize() {
@@ -4505,33 +4514,47 @@ var PostgresEngine = class extends SqlStorageEngine {
     );
   }
   /**
-   * Single retry on a transient connection-layer failure. The
-   * `idle_timeout` / `max_lifetime` config above prevents *most*
-   * stale-connection cases, but a query can still race a
-   * provider-initiated drop in flight — the postgres.js client
-   * rejects with `code: "CONNECTION_ENDED"` and the next attempt
-   * checks out a fresh connection from the pool. One retry is
-   * enough; if it fails again the host-side network is genuinely
-   * broken and the caller should see the error.
+   * Retry on transient connection-layer failures. Three attempts
+   * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
+   * have multiple stale sockets accumulated during an idle period
+   * (especially on managed Postgres after boot when no traffic
+   * has flowed for a while), so a single retry can land on a
+   * second stale socket and still fail. Three attempts virtually
+   * always exhausts the staleness wave; if all three throw, the
+   * failure is real and the caller should see it.
    *
-   * Only retries reads + the standard exec/run paths in `query`;
-   * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
-   * `sql.begin(...)` transactions are unwrapped — those are
-   * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
-   * atomically scoped (transactions roll back cleanly), and adding
-   * a retry around them would complicate the transaction
-   * semantics.
+   * Applied to every pg path the executor exposes:
+   *  - `query()` (run/get/all)  — natural retry: queries are
+   *    idempotent at the connection-failure boundary because the
+   *    server-side rollback runs cleanly on socket close.
+   *  - `exec(sql)` for DDL      — `CREATE TABLE IF NOT EXISTS` and
+   *    friends are idempotent by construction.
+   *  - `transaction(fn)`        — only retried when the
+   *    CONNECTION_ENDED reject arrives *before* the transaction
+   *    body started executing on the connection; if it errors
+   *    mid-transaction, the postgres.js client surfaces a
+   *    different error class (the inner SQL error) and bypasses
+   *    this retry, preserving the all-or-nothing semantics.
    */
   async runWithRetry(fn) {
-    try {
-      return await fn();
-    } catch (err) {
-      const code = err?.code;
-      if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED") {
+    const backoffs = [0, 50, 200];
+    let lastErr;
+    for (let attempt = 0; attempt < backoffs.length; attempt++) {
+      if (backoffs[attempt] > 0) {
+        await new Promise((r) => setTimeout(r, backoffs[attempt]));
+      }
+      try {
         return await fn();
+      } catch (err) {
+        lastErr = err;
+        const code = err?.code;
+        if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED" || code === "CONNECT_TIMEOUT" || code === "ECONNRESET") {
+          continue;
+        }
+        throw err;
       }
-      throw err;
     }
+    throw lastErr;
   }
   addToPathCache(tenantId, path) {
     const paths = this.pathCache.get(tenantId);
@@ -10832,7 +10855,7 @@ ${textContent}` };
           cachedCoreMessages = [...cachedCoreMessages, ...newCoreMessages];
           convertedUpTo = messages.length;
           const coreMessages = cachedCoreMessages;
-          const temperature = agent.frontmatter.model?.temperature ?? 0.2;
+          const temperature = agent.frontmatter.model?.temperature;
           const maxTokens = agent.frontmatter.model?.maxTokens;
           const cachedMessages = skipTailCache ? coreMessages : addPromptCacheBreakpoints(
             coreMessages,
@@ -10860,7 +10883,7 @@ ${textContent}` };
             ...useStaticCache ? {} : { system: systemPrompt },
             messages: messagesForStep,
             tools: toolsForStep,
-            temperature,
+            ...typeof temperature === "number" ? { temperature } : {},
             abortSignal: input.abortSignal,
             ...typeof maxTokens === "number" ? { maxTokens } : {},
             experimental_telemetry: {
@@ -13077,12 +13100,6 @@ var AgentOrchestrator = class {
         };
         await this.conversationStore.update(conv);
       }
-      this.hooks?.onStreamEnd?.(childConversationId);
-      await this.eventSink(parentConversationId, {
-        type: "subagent:completed",
-        subagentId: childConversationId,
-        conversationId: childConversationId
-      });
       let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
       if (!gathered) {
         const freshSubConv = await this.conversationStore.get(childConversationId);
@@ -13104,7 +13121,13 @@ var AgentOrchestrator = class {
         ...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
         timestamp: Date.now()
       };
-      await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult);
+      await this.appendSubagentResultReliable(parentConversationId, pendingResult);
+      this.hooks?.onStreamEnd?.(childConversationId);
+      await this.eventSink(parentConversationId, {
+        type: "subagent:completed",
+        subagentId: childConversationId,
+        conversationId: childConversationId
+      });
       this.triggerParentCallback(parentConversationId).catch(
         (err) => console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err)
       );
@@ -13121,13 +13144,6 @@ var AgentOrchestrator = class {
         conv.updatedAt = Date.now();
         await this.conversationStore.update(conv);
       }
-      this.hooks?.onStreamEnd?.(childConversationId);
-      await this.eventSink(parentConversationId, {
-        type: "subagent:error",
-        subagentId: childConversationId,
-        conversationId: childConversationId,
-        error: errMsg
-      });
       const pendingResult = {
         subagentId: childConversationId,
         task,
@@ -13135,7 +13151,13 @@ var AgentOrchestrator = class {
         error: { code: "SUBAGENT_ERROR", message: errMsg },
         timestamp: Date.now()
       };
-      await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult).catch(() => {
+      await this.appendSubagentResultReliable(parentConversationId, pendingResult);
+      this.hooks?.onStreamEnd?.(childConversationId);
+      await this.eventSink(parentConversationId, {
+        type: "subagent:error",
+        subagentId: childConversationId,
+        conversationId: childConversationId,
+        error: errMsg
       });
       this.triggerParentCallback(parentConversationId).catch(
         (err2) => console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2)
@@ -13251,12 +13273,12 @@ ${resultBody}`,
         },
         initialContextTokens: conversation.contextTokens ?? 0,
         initialContextWindow: conversation.contextWindow ?? 0,
-        onEvent: (event) => {
+        onEvent: async (event) => {
           if (event.type === "run:started") {
             const active = this.activeConversationRuns.get(conversationId);
             if (active) active.runId = event.runId;
           }
-          this.eventSink(conversationId, event);
+          await this.eventSink(conversationId, event);
         }
       });
       flushTurnDraft(execution.draft);
@@ -13442,11 +13464,6 @@ ${resultBody}`,
         await this.conversationStore.update(conv);
       }
       this.activeSubagentRuns.delete(conversationId);
-      await this.eventSink(parentConversationId, {
-        type: "subagent:completed",
-        subagentId: conversationId,
-        conversationId
-      });
       let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
       if (!gathered) {
         const freshSubConv = await this.conversationStore.get(conversationId);
@@ -13464,7 +13481,14 @@ ${resultBody}`,
           ...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
           timestamp: Date.now()
         };
-        await this.conversationStore.appendSubagentResult(parentConversationId, result);
+        await this.appendSubagentResultReliable(parentConversationId, result);
+      }
+      await this.eventSink(parentConversationId, {
+        type: "subagent:completed",
+        subagentId: conversationId,
+        conversationId
+      });
+      if (parentConv) {
         if (this.isServerless) {
           this.hooks.dispatchBackground("subagent-callback", parentConversationId);
         } else {
@@ -13492,11 +13516,6 @@ ${resultBody}`,
         conv.updatedAt = Date.now();
         await this.conversationStore.update(conv);
       }
-      await this.eventSink(conversation.parentConversationId, {
-        type: "subagent:completed",
-        subagentId: conversationId,
-        conversationId
-      });
       const parentConv = await this.conversationStore.get(conversation.parentConversationId);
       if (parentConv) {
         const result = {
@@ -13506,12 +13525,20 @@ ${resultBody}`,
           error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
           timestamp: Date.now()
         };
-        await this.conversationStore.appendSubagentResult(conversation.parentConversationId, result);
+        await this.appendSubagentResultReliable(conversation.parentConversationId, result);
+      }
+      await this.eventSink(conversation.parentConversationId, {
+        type: "subagent:completed",
+        subagentId: conversationId,
+        conversationId
+      });
+      if (parentConv) {
         if (this.isServerless) {
           this.hooks.dispatchBackground("subagent-callback", conversation.parentConversationId);
         } else {
-          this.processSubagentCallback(conversation.parentConversationId).catch(() => {
-          });
+          this.processSubagentCallback(conversation.parentConversationId).catch(
+            (err2) => console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2)
+          );
         }
       }
     }
@@ -13555,7 +13582,7 @@ ${resultBody}`,
             opts.parentConversationId,
             opts.task,
             opts.ownerId
-          ).catch((err) => console.error(`[poncho][subagent] Background spawn failed:`, err instanceof Error ? err.message : err));
+          ).catch((err) => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
         }
         return { subagentId: conversation.conversationId };
       },
@@ -13588,7 +13615,7 @@ ${resultBody}`,
             conversation.parentConversationId,
             message,
             conversation.ownerId
-          ).catch((err) => console.error(`[poncho][subagent] Background sendMessage failed:`, err instanceof Error ? err.message : err));
+          ).catch((err) => this.handleSpawnFailure(subagentId, conversation.parentConversationId, message, err));
         }
         return { subagentId };
       },
@@ -13667,6 +13694,67 @@ ${resultBody}`,
     };
   }
   // ── Stale subagent recovery ──
+  /**
+   * Append a subagent result to its parent, retrying once on a transient
+   * store failure before giving up loudly. A silently dropped result is the
+   * worst subagent failure mode — the parent waits forever on a subagent it
+   * thinks is still running — so this never swallows the error the way the
+   * old `.catch(() => {})` call sites did. Returns whether the result landed.
+   */
+  async appendSubagentResultReliable(parentConversationId, result) {
+    try {
+      await this.conversationStore.appendSubagentResult(parentConversationId, result);
+      return true;
+    } catch (firstErr) {
+      try {
+        await this.conversationStore.appendSubagentResult(parentConversationId, result);
+        return true;
+      } catch (secondErr) {
+        console.error(
+          `[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} to parent ${parentConversationId} after 2 attempts \u2014 the parent will not see this result:`,
+          secondErr instanceof Error ? secondErr.message : secondErr,
+          `(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`
+        );
+        return false;
+      }
+    }
+  }
+  /**
+   * A subagent's fire-and-forget background run rejected outside its own
+   * error handling (e.g. it threw before entering its try block, or the
+   * catch block itself threw). Without this the parent is left waiting on a
+   * subagent that will never report back. Record the failure on the child
+   * and hand the parent an error result so the turn can resume.
+   */
+  async handleSpawnFailure(childConversationId, parentConversationId, task, err) {
+    const message = err instanceof Error ? err.message : String(err);
+    console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
+    try {
+      const conv = await this.conversationStore.get(childConversationId);
+      if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
+        conv.subagentMeta = {
+          ...conv.subagentMeta,
+          status: "error",
+          error: { code: "SUBAGENT_SPAWN_FAILED", message }
+        };
+        conv.updatedAt = Date.now();
+        await this.conversationStore.update(conv);
+      }
+    } catch {
+    }
+    const appended = await this.appendSubagentResultReliable(parentConversationId, {
+      subagentId: childConversationId,
+      task,
+      status: "error",
+      error: { code: "SUBAGENT_SPAWN_FAILED", message },
+      timestamp: Date.now()
+    });
+    if (appended) {
+      this.triggerParentCallback(parentConversationId).catch(
+        (e) => console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e)
+      );
+    }
+  }
   async recoverStaleSubagents() {
     const allSummaries = await this.conversationStore.listSummaries();
     const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
@@ -13692,11 +13780,20 @@ ${resultBody}`,
             error: conv.subagentMeta.error,
             timestamp: Date.now()
           };
-          await this.conversationStore.appendSubagentResult(conv.parentConversationId, pendingResult);
+          await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
           parentsToCallback.add(conv.parentConversationId);
         }
       }
     }
+    const parentIds = new Set(
+      subagentSummaries.map((s) => s.parentConversationId).filter((id) => !!id)
+    );
+    for (const parentId of parentIds) {
+      if (parentsToCallback.has(parentId)) continue;
+      if (this.activeConversationRuns.has(parentId)) continue;
+      const parent = await this.conversationStore.get(parentId);
+      if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
+    }
     for (const parentId of parentsToCallback) {
       this.processSubagentCallback(parentId).catch(
         (err) => console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@poncho-ai/harness",
-  "version": "0.52.0",
+  "version": "0.52.2",
   "description": "Agent execution runtime - conversation loop, tool dispatch, streaming",
   "repository": {
     "type": "git",

package/src/default-agent.ts CHANGED Viewed

@@ -26,7 +26,12 @@ export interface DefaultAgentDefinitionOptions {
   modelProvider?: "anthropic" | "openai" | "openai-codex";
   /** Model name. Default: "claude-opus-4-5". */
   modelName?: string;
-  /** Sampling temperature. Default: 0.2. */
+  /**
+   * Sampling temperature. When unset, it is omitted from the generated
+   * frontmatter entirely and the harness sends no temperature (provider
+   * default). Newer models (Fable 5, Opus 4.7+) reject `temperature` — leave
+   * this unset for them.
+   */
   temperature?: number;
   /** Max tool-call steps per run. Default: 20. */
   maxSteps?: number;
@@ -55,7 +60,10 @@ export const defaultAgentDefinition = (
   const description = opts.description ?? DEFAULT_AGENT_DESCRIPTION;
   const modelProvider = opts.modelProvider ?? DEFAULT_MODEL_PROVIDER;
   const modelName = opts.modelName ?? DEFAULT_MODEL_NAME;
-  const temperature = opts.temperature ?? DEFAULT_TEMPERATURE;
+  // Opt-in: only emit a `temperature:` line when explicitly provided, so the
+  // harness sends no temperature otherwise (newer models reject it).
+  const temperatureLine =
+    opts.temperature !== undefined ? `\n  temperature: ${opts.temperature}` : "";
   const maxSteps = opts.maxSteps ?? DEFAULT_MAX_STEPS;
   const timeout = opts.timeout ?? DEFAULT_TIMEOUT;
@@ -65,8 +73,7 @@ id: ${id}
 description: ${description}
 model:
   provider: ${modelProvider}
-  name: ${modelName}
-  temperature: ${temperature}
+  name: ${modelName}${temperatureLine}
 limits:
   maxSteps: ${maxSteps}
   timeout: ${timeout}

package/src/harness.ts CHANGED Viewed

@@ -2907,7 +2907,12 @@ Code is wrapped in an async IIFE — use \`return\` to return a value to the too
         convertedUpTo = messages.length;
         const coreMessages = cachedCoreMessages;
-        const temperature = agent.frontmatter.model?.temperature ?? 0.2;
+        // Only send temperature when the agent explicitly set one. Newer
+        // models (Fable 5, Opus 4.7+) removed sampling params entirely and
+        // return a 400 ("`temperature` is deprecated for this model") on any
+        // value — forcing a default here broke them. Treated like maxTokens
+        // below: omitted from the request when undefined.
+        const temperature = agent.frontmatter.model?.temperature;
         const maxTokens = agent.frontmatter.model?.maxTokens;
         // Place the tail breakpoint before any untruncated tool-result so
         // we cache only the stable prefix when prior-run tool results are
@@ -2971,7 +2976,7 @@ Code is wrapped in an async IIFE — use \`return\` to return a value to the too
           ...(useStaticCache ? {} : { system: systemPrompt }),
           messages: messagesForStep,
           tools: toolsForStep,
-          temperature,
+          ...(typeof temperature === "number" ? { temperature } : {}),
           abortSignal: input.abortSignal,
           ...(typeof maxTokens === "number" ? { maxTokens } : {}),
           experimental_telemetry: {

package/src/orchestrator/orchestrator.ts CHANGED Viewed

@@ -1012,13 +1012,6 @@ export class AgentOrchestrator {
         await this.conversationStore.update(conv);
       }
-      this.hooks?.onStreamEnd?.(childConversationId);
-      await this.eventSink(parentConversationId, {
-        type: "subagent:completed",
-        subagentId: childConversationId,
-        conversationId: childConversationId,
-      });
       // Recover the subagent's real output: prefer the run response, then the
       // streamed draft, then walk the transcript — discarding the synthetic
       // "[Error: ...]" placeholder at each step.
@@ -1051,7 +1044,18 @@ export class AgentOrchestrator {
           : {}),
         timestamp: Date.now(),
       };
-      await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult);
+      // Persist the result BEFORE emitting subagent:completed: a consumer
+      // reacting to the event (the parent callback, the streaming client)
+      // must find the result already durable in the store, not race its write.
+      await this.appendSubagentResultReliable(parentConversationId, pendingResult);
+      this.hooks?.onStreamEnd?.(childConversationId);
+      await this.eventSink(parentConversationId, {
+        type: "subagent:completed",
+        subagentId: childConversationId,
+        conversationId: childConversationId,
+      });
       this.triggerParentCallback(parentConversationId).catch(err =>
         console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err),
       );
@@ -1070,6 +1074,16 @@ export class AgentOrchestrator {
         await this.conversationStore.update(conv);
       }
+      const pendingResult: PendingSubagentResult = {
+        subagentId: childConversationId,
+        task,
+        status: "error",
+        error: { code: "SUBAGENT_ERROR", message: errMsg },
+        timestamp: Date.now(),
+      };
+      // Persist before emitting (see the success path); never swallow.
+      await this.appendSubagentResultReliable(parentConversationId, pendingResult);
       this.hooks?.onStreamEnd?.(childConversationId);
       await this.eventSink(parentConversationId, {
         type: "subagent:error",
@@ -1078,14 +1092,6 @@ export class AgentOrchestrator {
         error: errMsg,
       });
-      const pendingResult: PendingSubagentResult = {
-        subagentId: childConversationId,
-        task,
-        status: "error",
-        error: { code: "SUBAGENT_ERROR", message: errMsg },
-        timestamp: Date.now(),
-      };
-      await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult).catch(() => {});
       this.triggerParentCallback(parentConversationId).catch(err2 =>
         console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2),
       );
@@ -1221,12 +1227,15 @@ export class AgentOrchestrator {
         },
         initialContextTokens: conversation.contextTokens ?? 0,
         initialContextWindow: conversation.contextWindow ?? 0,
-        onEvent: (event) => {
+        onEvent: async (event) => {
           if (event.type === "run:started") {
             const active = this.activeConversationRuns.get(conversationId);
             if (active) active.runId = event.runId;
           }
-          this.eventSink(conversationId, event);
+          // Await so the event is fully sunk before the next step's events,
+          // matching every other eventSink call site (the callback run path
+          // was the lone fire-and-forget exception).
+          await this.eventSink(conversationId, event);
         },
       });
       flushTurnDraft(execution.draft);
@@ -1436,11 +1445,6 @@ export class AgentOrchestrator {
       }
       this.activeSubagentRuns.delete(conversationId);
-      await this.eventSink(parentConversationId, {
-        type: "subagent:completed",
-        subagentId: conversationId,
-        conversationId,
-      });
       let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
       if (!gathered) {
@@ -1464,8 +1468,17 @@ export class AgentOrchestrator {
             : {}),
           timestamp: Date.now(),
         };
-        await this.conversationStore.appendSubagentResult(parentConversationId, result);
+        // Persist before emitting completion (see runSubagent).
+        await this.appendSubagentResultReliable(parentConversationId, result);
+      }
+      await this.eventSink(parentConversationId, {
+        type: "subagent:completed",
+        subagentId: conversationId,
+        conversationId,
+      });
+      if (parentConv) {
         if (this.isServerless) {
           this.hooks!.dispatchBackground!("subagent-callback", parentConversationId);
         } else {
@@ -1490,12 +1503,6 @@ export class AgentOrchestrator {
         await this.conversationStore.update(conv);
       }
-      await this.eventSink(conversation.parentConversationId!, {
-        type: "subagent:completed",
-        subagentId: conversationId,
-        conversationId,
-      });
       const parentConv = await this.conversationStore.get(conversation.parentConversationId!);
       if (parentConv) {
         const result: PendingSubagentResult = {
@@ -1505,11 +1512,23 @@ export class AgentOrchestrator {
           error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
           timestamp: Date.now(),
         };
-        await this.conversationStore.appendSubagentResult(conversation.parentConversationId!, result);
+        // Persist before emitting; never swallow (was `.catch(() => {})`).
+        await this.appendSubagentResultReliable(conversation.parentConversationId!, result);
+      }
+      await this.eventSink(conversation.parentConversationId!, {
+        type: "subagent:completed",
+        subagentId: conversationId,
+        conversationId,
+      });
+      if (parentConv) {
         if (this.isServerless) {
           this.hooks!.dispatchBackground!("subagent-callback", conversation.parentConversationId!);
         } else {
-          this.processSubagentCallback(conversation.parentConversationId!).catch(() => {});
+          this.processSubagentCallback(conversation.parentConversationId!).catch(err2 =>
+            console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2),
+          );
         }
       }
     }
@@ -1559,7 +1578,7 @@ export class AgentOrchestrator {
             opts.parentConversationId,
             opts.task,
             opts.ownerId,
-          ).catch(err => console.error(`[poncho][subagent] Background spawn failed:`, err instanceof Error ? err.message : err));
+          ).catch(err => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
         }
         return { subagentId: conversation.conversationId };
@@ -1596,7 +1615,7 @@ export class AgentOrchestrator {
             conversation.parentConversationId,
             message,
             conversation.ownerId,
-          ).catch(err => console.error(`[poncho][subagent] Background sendMessage failed:`, err instanceof Error ? err.message : err));
+          ).catch(err => this.handleSpawnFailure(subagentId, conversation.parentConversationId!, message, err));
         }
         return { subagentId };
@@ -1684,6 +1703,79 @@ export class AgentOrchestrator {
   // ── Stale subagent recovery ──
+  /**
+   * Append a subagent result to its parent, retrying once on a transient
+   * store failure before giving up loudly. A silently dropped result is the
+   * worst subagent failure mode — the parent waits forever on a subagent it
+   * thinks is still running — so this never swallows the error the way the
+   * old `.catch(() => {})` call sites did. Returns whether the result landed.
+   */
+  private async appendSubagentResultReliable(
+    parentConversationId: string,
+    result: PendingSubagentResult,
+  ): Promise<boolean> {
+    try {
+      await this.conversationStore.appendSubagentResult(parentConversationId, result);
+      return true;
+    } catch (firstErr) {
+      try {
+        await this.conversationStore.appendSubagentResult(parentConversationId, result);
+        return true;
+      } catch (secondErr) {
+        console.error(
+          `[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} ` +
+            `to parent ${parentConversationId} after 2 attempts — the parent will not see this result:`,
+          secondErr instanceof Error ? secondErr.message : secondErr,
+          `(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`,
+        );
+        return false;
+      }
+    }
+  }
+  /**
+   * A subagent's fire-and-forget background run rejected outside its own
+   * error handling (e.g. it threw before entering its try block, or the
+   * catch block itself threw). Without this the parent is left waiting on a
+   * subagent that will never report back. Record the failure on the child
+   * and hand the parent an error result so the turn can resume.
+   */
+  private async handleSpawnFailure(
+    childConversationId: string,
+    parentConversationId: string,
+    task: string,
+    err: unknown,
+  ): Promise<void> {
+    const message = err instanceof Error ? err.message : String(err);
+    console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
+    try {
+      const conv = await this.conversationStore.get(childConversationId);
+      if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
+        conv.subagentMeta = {
+          ...conv.subagentMeta,
+          status: "error",
+          error: { code: "SUBAGENT_SPAWN_FAILED", message },
+        };
+        conv.updatedAt = Date.now();
+        await this.conversationStore.update(conv);
+      }
+    } catch {
+      // best-effort: the result append below is what the parent actually needs
+    }
+    const appended = await this.appendSubagentResultReliable(parentConversationId, {
+      subagentId: childConversationId,
+      task,
+      status: "error",
+      error: { code: "SUBAGENT_SPAWN_FAILED", message },
+      timestamp: Date.now(),
+    });
+    if (appended) {
+      this.triggerParentCallback(parentConversationId).catch(e =>
+        console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e),
+      );
+    }
+  }
   async recoverStaleSubagents(): Promise<void> {
     const allSummaries = await this.conversationStore.listSummaries();
     const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
@@ -1711,11 +1803,26 @@ export class AgentOrchestrator {
             error: conv.subagentMeta.error,
             timestamp: Date.now(),
           };
-          await this.conversationStore.appendSubagentResult(conv.parentConversationId, pendingResult);
+          await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
           parentsToCallback.add(conv.parentConversationId);
         }
       }
     }
+    // Also drain parents that already have results sitting in the store but
+    // no active run to deliver them — e.g. a result persisted just before a
+    // process restart, whose in-memory callback trigger was lost. Without
+    // this the parent stays stuck even though its result landed durably.
+    const parentIds = new Set(
+      subagentSummaries.map(s => s.parentConversationId).filter((id): id is string => !!id),
+    );
+    for (const parentId of parentIds) {
+      if (parentsToCallback.has(parentId)) continue;
+      if (this.activeConversationRuns.has(parentId)) continue;
+      const parent = await this.conversationStore.get(parentId);
+      if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
+    }
     for (const parentId of parentsToCallback) {
       this.processSubagentCallback(parentId).catch(err =>
         console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err),

package/src/storage/postgres-engine.ts CHANGED Viewed

@@ -36,12 +36,25 @@ export class PostgresEngine extends SqlStorageEngine {
         return rows as T[];
       },
       exec: async (sql: string): Promise<void> => {
-        await this.sql.unsafe(sql);
+        // DDL is idempotent in our migrations (`CREATE TABLE IF NOT
+        // EXISTS`, etc.), so retrying on a stale-socket drop is
+        // safe — same idempotency as `query()` reads/writes.
+        await this.runWithRetry(() => this.sql.unsafe(sql));
       },
       transaction: async (fn: () => Promise<void>): Promise<void> => {
-        await this.sql.begin(async () => {
+        // Transactions are inherently retry-safe at the
+        // CONNECTION_ENDED boundary: if the connection dies before
+        // BEGIN takes effect server-side, no work was committed and
+        // re-running `fn` produces the correct end state. The retry
+        // only catches the connection-level reject from the
+        // postgres.js client; a partial-commit + drop scenario
+        // surfaces as a different error code and bypasses the
+        // retry, preserving the caller's expectation that a
+        // returned transaction either fully committed or fully
+        // rolled back.
+        await this.runWithRetry(() => this.sql.begin(async () => {
           await fn();
-        });
+        }));
       },
     };
   }
@@ -59,25 +72,34 @@ export class PostgresEngine extends SqlStorageEngine {
       prepare: false,
       // Connection-pool resilience. Managed Postgres providers
       // (Railway, Neon, Heroku, etc.) routinely drop idle TCP
-      // connections server-side after a few minutes. Without these
-      // knobs, porsager/postgres keeps stale sockets in the pool;
-      // the next query on one rejects with
-      // `write CONNECTION_ENDED <host>:5432` at `durMs=0`, surfacing
-      // as a hard failure to the caller. Two complementary settings:
+      // connections server-side after a few minutes — and on
+      // Railway in particular, mid-stream drops within a few
+      // seconds of inactivity are common. Without these knobs,
+      // porsager/postgres keeps stale sockets in the pool; the
+      // next query on one rejects with
+      // `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
+      // surfacing as a hard failure to the caller.
       //
-      //   - `idle_timeout: 20` closes idle connections client-side
-      //     after 20s, before any reasonable provider-side timer
-      //     fires. Fresh connection on next checkout = no stale
-      //     socket race.
-      //   - `max_lifetime: 600` (10 min) recycles long-lived
-      //     connections defensively even if they've stayed busy,
-      //     which sidesteps a separate class of provider-side
-      //     "max connection age" limits.
+      //   - `idle_timeout: 5` closes idle connections client-side
+      //     aggressively. Empirically Railway's pg drops sockets
+      //     well before the 20s value that managed-provider docs
+      //     suggest; 5s is short enough to win the race in
+      //     practice while staying long enough that bursty
+      //     workloads still get connection reuse.
+      //   - `max_lifetime: 300` (5 min) recycles long-lived
+      //     connections defensively. Even with idle_timeout, a
+      //     connection that's been actively serving small queries
+      //     for an hour can hit provider-side max-age limits.
+      //   - `connect_timeout: 10` — slightly less patient on
+      //     initial connect than the 30s default. Combined with
+      //     the retry below, "connection refused" surfaces faster
+      //     during incidents and the caller can shed load instead
+      //     of stacking up.
       //
-      // Defaults remain `max: 10`, `connect_timeout: 30` — leaving
-      // pool size + initial connect behavior unchanged.
-      idle_timeout: 20,
-      max_lifetime: 60 * 10,
+      // Pool size (`max: 10`) unchanged.
+      idle_timeout: 5,
+      max_lifetime: 60 * 5,
+      connect_timeout: 10,
     });
   }
@@ -147,33 +169,53 @@ export class PostgresEngine extends SqlStorageEngine {
   }
   /**
-   * Single retry on a transient connection-layer failure. The
-   * `idle_timeout` / `max_lifetime` config above prevents *most*
-   * stale-connection cases, but a query can still race a
-   * provider-initiated drop in flight — the postgres.js client
-   * rejects with `code: "CONNECTION_ENDED"` and the next attempt
-   * checks out a fresh connection from the pool. One retry is
-   * enough; if it fails again the host-side network is genuinely
-   * broken and the caller should see the error.
+   * Retry on transient connection-layer failures. Three attempts
+   * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
+   * have multiple stale sockets accumulated during an idle period
+   * (especially on managed Postgres after boot when no traffic
+   * has flowed for a while), so a single retry can land on a
+   * second stale socket and still fail. Three attempts virtually
+   * always exhausts the staleness wave; if all three throw, the
+   * failure is real and the caller should see it.
    *
-   * Only retries reads + the standard exec/run paths in `query`;
-   * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
-   * `sql.begin(...)` transactions are unwrapped — those are
-   * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
-   * atomically scoped (transactions roll back cleanly), and adding
-   * a retry around them would complicate the transaction
-   * semantics.
+   * Applied to every pg path the executor exposes:
+   *  - `query()` (run/get/all)  — natural retry: queries are
+   *    idempotent at the connection-failure boundary because the
+   *    server-side rollback runs cleanly on socket close.
+   *  - `exec(sql)` for DDL      — `CREATE TABLE IF NOT EXISTS` and
+   *    friends are idempotent by construction.
+   *  - `transaction(fn)`        — only retried when the
+   *    CONNECTION_ENDED reject arrives *before* the transaction
+   *    body started executing on the connection; if it errors
+   *    mid-transaction, the postgres.js client surfaces a
+   *    different error class (the inner SQL error) and bypasses
+   *    this retry, preserving the all-or-nothing semantics.
    */
   private async runWithRetry<T>(fn: () => Promise<T>): Promise<T> {
-    try {
-      return await fn();
-    } catch (err) {
-      const code = (err as { code?: string } | null | undefined)?.code;
-      if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED") {
+    const backoffs = [0, 50, 200];
+    let lastErr: unknown;
+    for (let attempt = 0; attempt < backoffs.length; attempt++) {
+      if (backoffs[attempt] > 0) {
+        await new Promise((r) => setTimeout(r, backoffs[attempt]));
+      }
+      try {
         return await fn();
+      } catch (err) {
+        lastErr = err;
+        const code = (err as { code?: string } | null | undefined)?.code;
+        if (
+          code === "CONNECTION_ENDED" ||
+          code === "CONNECTION_CLOSED" ||
+          code === "CONNECTION_DESTROYED" ||
+          code === "CONNECT_TIMEOUT" ||
+          code === "ECONNRESET"
+        ) {
+          continue;
+        }
+        throw err;
       }
-      throw err;
     }
+    throw lastErr;
   }
   private addToPathCache(tenantId: string, path: string): void {