@poncho-ai/harness 0.52.0 → 0.52.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,5 +1,5 @@
1
1
 
2
- > @poncho-ai/harness@0.52.0 build /home/runner/work/poncho-ai/poncho-ai/packages/harness
2
+ > @poncho-ai/harness@0.52.2 build /home/runner/work/poncho-ai/poncho-ai/packages/harness
3
3
  > node scripts/embed-docs.js && tsup src/index.ts --format esm --dts
4
4
 
5
5
  [embed-docs] Generated poncho-docs.ts with 4 topics
@@ -8,9 +8,9 @@
8
8
  CLI tsup v8.5.1
9
9
  CLI Target: es2022
10
10
  ESM Build start
11
- ESM dist/index.js 536.18 KB
11
+ ESM dist/index.js 540.60 KB
12
12
  ESM dist/isolate-F2PPSUL6.js 53.82 KB
13
- ESM ⚡️ Build success in 234ms
13
+ ESM ⚡️ Build success in 204ms
14
14
  DTS Build start
15
- DTS ⚡️ Build success in 7615ms
16
- DTS dist/index.d.ts 92.18 KB
15
+ DTS ⚡️ Build success in 7242ms
16
+ DTS dist/index.d.ts 93.60 KB
package/CHANGELOG.md CHANGED
@@ -1,5 +1,85 @@
1
1
  # @poncho-ai/harness
2
2
 
3
+ ## 0.52.2
4
+
5
+ ### Patch Changes
6
+
7
+ - [#124](https://github.com/cesr/poncho-ai/pull/124) [`4ae26e0`](https://github.com/cesr/poncho-ai/commit/4ae26e0d8d2788f57411f9c17e10766769514f9b) Thanks [@cesr](https://github.com/cesr)! - harness: postgres retry covers exec/transaction + 3 attempts + tighter idle
8
+
9
+ Follow-up to the previous `idle_timeout`/`max_lifetime`/retry patch.
10
+ Live testing on Railway showed the previous values weren't tight
11
+ enough — `write CONNECTION_ENDED postgres.railway.internal:5432`
12
+ still surfaced both during user-facing chat turns and during
13
+ subagent auto-callback reruns, despite the new config and the
14
+ one-shot retry.
15
+
16
+ Two failure modes the previous version didn't cover:
17
+ 1. The retry only wrapped `private query()` (executor.run/get/all),
18
+ but `executor.exec` (`sql.unsafe`) and `executor.transaction`
19
+ (`sql.begin`) called the postgres.js client directly. A pg drop
20
+ inside a transaction or migration write threw straight through.
21
+ 2. After an idle period the pool can have multiple stale sockets;
22
+ a single retry can checkout a second stale socket from the pool
23
+ and fail again. One-shot retry exhausted into an error visible
24
+ to the caller.
25
+
26
+ Fixes:
27
+ - All three executor paths (`run/get/all`, `exec`, `transaction`)
28
+ now go through the same `runWithRetry` wrapper. Transactions
29
+ only retry the connection-level `CONNECTION_ENDED` reject from
30
+ the postgres.js client — actual SQL errors mid-transaction
31
+ surface as a different error class and bypass the retry,
32
+ preserving atomic semantics.
33
+ - Three attempts with light exponential backoff (0, 50ms, 200ms).
34
+ Enough to ride out a typical staleness wave; if all three fail
35
+ the network is genuinely broken.
36
+ - `CONNECT_TIMEOUT` and `ECONNRESET` added to the retry-eligible
37
+ error codes.
38
+
39
+ Config knobs tightened:
40
+ - `idle_timeout: 5` (was 20). Empirically Railway's pg drops
41
+ sockets well before 20s; 5s wins the race in practice while
42
+ staying long enough for bursty workloads to reuse connections.
43
+ - `max_lifetime: 300` (was 600). Same reasoning — recycle more
44
+ aggressively.
45
+ - `connect_timeout: 10` (was 30 default). Faster failure during
46
+ incidents lets callers shed load instead of stacking up.
47
+
48
+ - [#144](https://github.com/cesr/poncho-ai/pull/144) [`28d640b`](https://github.com/cesr/poncho-ai/commit/28d640b2f82ea780f8e0be90965972d9903c01d7) Thanks [@cesr](https://github.com/cesr)! - orchestrator: make subagent result delivery reliable
49
+
50
+ Subagent results could silently never reach the parent agent. Several
51
+ plumbing bugs in `runSubagent` / `runSubagentContinuation`:
52
+ - **Emit-before-persist race.** `subagent:completed` / `subagent:error`
53
+ were emitted to the parent's event stream _before_ the result was
54
+ written to the store, so a consumer reacting to the event (the parent
55
+ callback, the streaming client) could race the write. Now the result
56
+ is persisted first, then the event is emitted.
57
+ - **Silently swallowed writes.** Two `appendSubagentResult(...).catch(() => {})`
58
+ call sites (the error path and the continuation-error path) dropped the
59
+ result with no trace on a transient store failure. Replaced with a
60
+ shared `appendSubagentResultReliable` helper that retries once and then
61
+ logs loudly — a dropped result is the worst failure mode (the parent
62
+ waits forever on a subagent it thinks is still running).
63
+ - **Un-awaited eventSink.** The subagent-callback run path was the lone
64
+ `this.eventSink(...)` call site that didn't `await` (every other site
65
+ does), so callback-turn events could interleave out of order. Now awaited.
66
+ - **Spawn rejections went to a bare `console.error`.** A background
67
+ `runSubagent` that rejected outside its own try/catch left the parent
68
+ hanging. Both fire-and-forget spawn paths now route to a
69
+ `handleSpawnFailure` that marks the child errored and hands the parent
70
+ an error result so the turn can resume.
71
+ - **`recoverStaleSubagents` now also drains undelivered results.** It
72
+ previously only rescued children stuck in `running`; it now also
73
+ re-triggers the parent callback for any parent that has results sitting
74
+ in the store with no active run (e.g. a result persisted just before a
75
+ process restart, whose in-memory callback trigger was lost).
76
+
77
+ ## 0.52.1
78
+
79
+ ### Patch Changes
80
+
81
+ - [`0e8fff1`](https://github.com/cesr/poncho-ai/commit/0e8fff12aed9d5efe1821ed3560ead48a16113c1) Thanks [@cesr](https://github.com/cesr)! - Only send `temperature` to the model when the agent explicitly sets one. The harness previously defaulted to `temperature: 0.2` and always passed it to `streamText`, which returns a 400 ("`temperature` is deprecated for this model") on models that removed sampling params (Fable 5, Opus 4.7+). `temperature` is now omitted from the request when undefined — the same treatment `maxTokens` already had — and `defaultAgentDefinition` no longer hard-codes a `temperature` line into the generated frontmatter (pass `temperature` explicitly to set one).
82
+
3
83
  ## 0.52.0
4
84
 
5
85
  ### Minor Changes
package/dist/index.d.ts CHANGED
@@ -737,7 +737,12 @@ interface DefaultAgentDefinitionOptions {
737
737
  modelProvider?: "anthropic" | "openai" | "openai-codex";
738
738
  /** Model name. Default: "claude-opus-4-5". */
739
739
  modelName?: string;
740
- /** Sampling temperature. Default: 0.2. */
740
+ /**
741
+ * Sampling temperature. When unset, it is omitted from the generated
742
+ * frontmatter entirely and the harness sends no temperature (provider
743
+ * default). Newer models (Fable 5, Opus 4.7+) reject `temperature` — leave
744
+ * this unset for them.
745
+ */
741
746
  temperature?: number;
742
747
  /** Max tool-call steps per run. Default: 20. */
743
748
  maxSteps?: number;
@@ -1797,22 +1802,27 @@ declare class PostgresEngine extends SqlStorageEngine {
1797
1802
  private patchVfs;
1798
1803
  private query;
1799
1804
  /**
1800
- * Single retry on a transient connection-layer failure. The
1801
- * `idle_timeout` / `max_lifetime` config above prevents *most*
1802
- * stale-connection cases, but a query can still race a
1803
- * provider-initiated drop in flight the postgres.js client
1804
- * rejects with `code: "CONNECTION_ENDED"` and the next attempt
1805
- * checks out a fresh connection from the pool. One retry is
1806
- * enough; if it fails again the host-side network is genuinely
1807
- * broken and the caller should see the error.
1805
+ * Retry on transient connection-layer failures. Three attempts
1806
+ * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
1807
+ * have multiple stale sockets accumulated during an idle period
1808
+ * (especially on managed Postgres after boot when no traffic
1809
+ * has flowed for a while), so a single retry can land on a
1810
+ * second stale socket and still fail. Three attempts virtually
1811
+ * always exhausts the staleness wave; if all three throw, the
1812
+ * failure is real and the caller should see it.
1808
1813
  *
1809
- * Only retries reads + the standard exec/run paths in `query`;
1810
- * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
1811
- * `sql.begin(...)` transactions are unwrapped those are
1812
- * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
1813
- * atomically scoped (transactions roll back cleanly), and adding
1814
- * a retry around them would complicate the transaction
1815
- * semantics.
1814
+ * Applied to every pg path the executor exposes:
1815
+ * - `query()` (run/get/all) — natural retry: queries are
1816
+ * idempotent at the connection-failure boundary because the
1817
+ * server-side rollback runs cleanly on socket close.
1818
+ * - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
1819
+ * friends are idempotent by construction.
1820
+ * - `transaction(fn)` — only retried when the
1821
+ * CONNECTION_ENDED reject arrives *before* the transaction
1822
+ * body started executing on the connection; if it errors
1823
+ * mid-transaction, the postgres.js client surfaces a
1824
+ * different error class (the inner SQL error) and bypasses
1825
+ * this retry, preserving the all-or-nothing semantics.
1816
1826
  */
1817
1827
  private runWithRetry;
1818
1828
  private addToPathCache;
@@ -2140,6 +2150,22 @@ declare class AgentOrchestrator {
2140
2150
  processSubagentCallback(conversationId: string, skipLockCheck?: boolean): Promise<void>;
2141
2151
  runSubagentContinuation(conversationId: string, conversation: Conversation, continuationMessages: Message[]): AsyncGenerator<AgentEvent>;
2142
2152
  createSubagentManager(): SubagentManager;
2153
+ /**
2154
+ * Append a subagent result to its parent, retrying once on a transient
2155
+ * store failure before giving up loudly. A silently dropped result is the
2156
+ * worst subagent failure mode — the parent waits forever on a subagent it
2157
+ * thinks is still running — so this never swallows the error the way the
2158
+ * old `.catch(() => {})` call sites did. Returns whether the result landed.
2159
+ */
2160
+ private appendSubagentResultReliable;
2161
+ /**
2162
+ * A subagent's fire-and-forget background run rejected outside its own
2163
+ * error handling (e.g. it threw before entering its try block, or the
2164
+ * catch block itself threw). Without this the parent is left waiting on a
2165
+ * subagent that will never report back. Record the failure on the child
2166
+ * and hand the parent an error result so the turn can resume.
2167
+ */
2168
+ private handleSpawnFailure;
2143
2169
  recoverStaleSubagents(): Promise<void>;
2144
2170
  }
2145
2171
 
package/dist/index.js CHANGED
@@ -588,7 +588,8 @@ var defaultAgentDefinition = (opts = {}) => {
588
588
  const description = opts.description ?? DEFAULT_AGENT_DESCRIPTION;
589
589
  const modelProvider = opts.modelProvider ?? DEFAULT_MODEL_PROVIDER;
590
590
  const modelName = opts.modelName ?? DEFAULT_MODEL_NAME;
591
- const temperature = opts.temperature ?? DEFAULT_TEMPERATURE;
591
+ const temperatureLine = opts.temperature !== void 0 ? `
592
+ temperature: ${opts.temperature}` : "";
592
593
  const maxSteps = opts.maxSteps ?? DEFAULT_MAX_STEPS;
593
594
  const timeout = opts.timeout ?? DEFAULT_TIMEOUT;
594
595
  return `---
@@ -597,8 +598,7 @@ id: ${id}
597
598
  description: ${description}
598
599
  model:
599
600
  provider: ${modelProvider}
600
- name: ${modelName}
601
- temperature: ${temperature}
601
+ name: ${modelName}${temperatureLine}
602
602
  limits:
603
603
  maxSteps: ${maxSteps}
604
604
  timeout: ${timeout}
@@ -4415,12 +4415,12 @@ var PostgresEngine = class extends SqlStorageEngine {
4415
4415
  return rows;
4416
4416
  },
4417
4417
  exec: async (sql) => {
4418
- await this.sql.unsafe(sql);
4418
+ await this.runWithRetry(() => this.sql.unsafe(sql));
4419
4419
  },
4420
4420
  transaction: async (fn) => {
4421
- await this.sql.begin(async () => {
4421
+ await this.runWithRetry(() => this.sql.begin(async () => {
4422
4422
  await fn();
4423
- });
4423
+ }));
4424
4424
  }
4425
4425
  };
4426
4426
  }
@@ -4438,25 +4438,34 @@ var PostgresEngine = class extends SqlStorageEngine {
4438
4438
  prepare: false,
4439
4439
  // Connection-pool resilience. Managed Postgres providers
4440
4440
  // (Railway, Neon, Heroku, etc.) routinely drop idle TCP
4441
- // connections server-side after a few minutes. Without these
4442
- // knobs, porsager/postgres keeps stale sockets in the pool;
4443
- // the next query on one rejects with
4444
- // `write CONNECTION_ENDED <host>:5432` at `durMs=0`, surfacing
4445
- // as a hard failure to the caller. Two complementary settings:
4441
+ // connections server-side after a few minutes and on
4442
+ // Railway in particular, mid-stream drops within a few
4443
+ // seconds of inactivity are common. Without these knobs,
4444
+ // porsager/postgres keeps stale sockets in the pool; the
4445
+ // next query on one rejects with
4446
+ // `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
4447
+ // surfacing as a hard failure to the caller.
4446
4448
  //
4447
- // - `idle_timeout: 20` closes idle connections client-side
4448
- // after 20s, before any reasonable provider-side timer
4449
- // fires. Fresh connection on next checkout = no stale
4450
- // socket race.
4451
- // - `max_lifetime: 600` (10 min) recycles long-lived
4452
- // connections defensively even if they've stayed busy,
4453
- // which sidesteps a separate class of provider-side
4454
- // "max connection age" limits.
4449
+ // - `idle_timeout: 5` closes idle connections client-side
4450
+ // aggressively. Empirically Railway's pg drops sockets
4451
+ // well before the 20s value that managed-provider docs
4452
+ // suggest; 5s is short enough to win the race in
4453
+ // practice while staying long enough that bursty
4454
+ // workloads still get connection reuse.
4455
+ // - `max_lifetime: 300` (5 min) recycles long-lived
4456
+ // connections defensively. Even with idle_timeout, a
4457
+ // connection that's been actively serving small queries
4458
+ // for an hour can hit provider-side max-age limits.
4459
+ // - `connect_timeout: 10` — slightly less patient on
4460
+ // initial connect than the 30s default. Combined with
4461
+ // the retry below, "connection refused" surfaces faster
4462
+ // during incidents and the caller can shed load instead
4463
+ // of stacking up.
4455
4464
  //
4456
- // Defaults remain `max: 10`, `connect_timeout: 30` — leaving
4457
- // pool size + initial connect behavior unchanged.
4458
- idle_timeout: 20,
4459
- max_lifetime: 60 * 10
4465
+ // Pool size (`max: 10`) unchanged.
4466
+ idle_timeout: 5,
4467
+ max_lifetime: 60 * 5,
4468
+ connect_timeout: 10
4460
4469
  });
4461
4470
  }
4462
4471
  async initialize() {
@@ -4505,33 +4514,47 @@ var PostgresEngine = class extends SqlStorageEngine {
4505
4514
  );
4506
4515
  }
4507
4516
  /**
4508
- * Single retry on a transient connection-layer failure. The
4509
- * `idle_timeout` / `max_lifetime` config above prevents *most*
4510
- * stale-connection cases, but a query can still race a
4511
- * provider-initiated drop in flight the postgres.js client
4512
- * rejects with `code: "CONNECTION_ENDED"` and the next attempt
4513
- * checks out a fresh connection from the pool. One retry is
4514
- * enough; if it fails again the host-side network is genuinely
4515
- * broken and the caller should see the error.
4517
+ * Retry on transient connection-layer failures. Three attempts
4518
+ * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
4519
+ * have multiple stale sockets accumulated during an idle period
4520
+ * (especially on managed Postgres after boot when no traffic
4521
+ * has flowed for a while), so a single retry can land on a
4522
+ * second stale socket and still fail. Three attempts virtually
4523
+ * always exhausts the staleness wave; if all three throw, the
4524
+ * failure is real and the caller should see it.
4516
4525
  *
4517
- * Only retries reads + the standard exec/run paths in `query`;
4518
- * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
4519
- * `sql.begin(...)` transactions are unwrapped those are
4520
- * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
4521
- * atomically scoped (transactions roll back cleanly), and adding
4522
- * a retry around them would complicate the transaction
4523
- * semantics.
4526
+ * Applied to every pg path the executor exposes:
4527
+ * - `query()` (run/get/all) — natural retry: queries are
4528
+ * idempotent at the connection-failure boundary because the
4529
+ * server-side rollback runs cleanly on socket close.
4530
+ * - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
4531
+ * friends are idempotent by construction.
4532
+ * - `transaction(fn)` — only retried when the
4533
+ * CONNECTION_ENDED reject arrives *before* the transaction
4534
+ * body started executing on the connection; if it errors
4535
+ * mid-transaction, the postgres.js client surfaces a
4536
+ * different error class (the inner SQL error) and bypasses
4537
+ * this retry, preserving the all-or-nothing semantics.
4524
4538
  */
4525
4539
  async runWithRetry(fn) {
4526
- try {
4527
- return await fn();
4528
- } catch (err) {
4529
- const code = err?.code;
4530
- if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED") {
4540
+ const backoffs = [0, 50, 200];
4541
+ let lastErr;
4542
+ for (let attempt = 0; attempt < backoffs.length; attempt++) {
4543
+ if (backoffs[attempt] > 0) {
4544
+ await new Promise((r) => setTimeout(r, backoffs[attempt]));
4545
+ }
4546
+ try {
4531
4547
  return await fn();
4548
+ } catch (err) {
4549
+ lastErr = err;
4550
+ const code = err?.code;
4551
+ if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED" || code === "CONNECT_TIMEOUT" || code === "ECONNRESET") {
4552
+ continue;
4553
+ }
4554
+ throw err;
4532
4555
  }
4533
- throw err;
4534
4556
  }
4557
+ throw lastErr;
4535
4558
  }
4536
4559
  addToPathCache(tenantId, path) {
4537
4560
  const paths = this.pathCache.get(tenantId);
@@ -10832,7 +10855,7 @@ ${textContent}` };
10832
10855
  cachedCoreMessages = [...cachedCoreMessages, ...newCoreMessages];
10833
10856
  convertedUpTo = messages.length;
10834
10857
  const coreMessages = cachedCoreMessages;
10835
- const temperature = agent.frontmatter.model?.temperature ?? 0.2;
10858
+ const temperature = agent.frontmatter.model?.temperature;
10836
10859
  const maxTokens = agent.frontmatter.model?.maxTokens;
10837
10860
  const cachedMessages = skipTailCache ? coreMessages : addPromptCacheBreakpoints(
10838
10861
  coreMessages,
@@ -10860,7 +10883,7 @@ ${textContent}` };
10860
10883
  ...useStaticCache ? {} : { system: systemPrompt },
10861
10884
  messages: messagesForStep,
10862
10885
  tools: toolsForStep,
10863
- temperature,
10886
+ ...typeof temperature === "number" ? { temperature } : {},
10864
10887
  abortSignal: input.abortSignal,
10865
10888
  ...typeof maxTokens === "number" ? { maxTokens } : {},
10866
10889
  experimental_telemetry: {
@@ -13077,12 +13100,6 @@ var AgentOrchestrator = class {
13077
13100
  };
13078
13101
  await this.conversationStore.update(conv);
13079
13102
  }
13080
- this.hooks?.onStreamEnd?.(childConversationId);
13081
- await this.eventSink(parentConversationId, {
13082
- type: "subagent:completed",
13083
- subagentId: childConversationId,
13084
- conversationId: childConversationId
13085
- });
13086
13103
  let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
13087
13104
  if (!gathered) {
13088
13105
  const freshSubConv = await this.conversationStore.get(childConversationId);
@@ -13104,7 +13121,13 @@ var AgentOrchestrator = class {
13104
13121
  ...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
13105
13122
  timestamp: Date.now()
13106
13123
  };
13107
- await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult);
13124
+ await this.appendSubagentResultReliable(parentConversationId, pendingResult);
13125
+ this.hooks?.onStreamEnd?.(childConversationId);
13126
+ await this.eventSink(parentConversationId, {
13127
+ type: "subagent:completed",
13128
+ subagentId: childConversationId,
13129
+ conversationId: childConversationId
13130
+ });
13108
13131
  this.triggerParentCallback(parentConversationId).catch(
13109
13132
  (err) => console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err)
13110
13133
  );
@@ -13121,13 +13144,6 @@ var AgentOrchestrator = class {
13121
13144
  conv.updatedAt = Date.now();
13122
13145
  await this.conversationStore.update(conv);
13123
13146
  }
13124
- this.hooks?.onStreamEnd?.(childConversationId);
13125
- await this.eventSink(parentConversationId, {
13126
- type: "subagent:error",
13127
- subagentId: childConversationId,
13128
- conversationId: childConversationId,
13129
- error: errMsg
13130
- });
13131
13147
  const pendingResult = {
13132
13148
  subagentId: childConversationId,
13133
13149
  task,
@@ -13135,7 +13151,13 @@ var AgentOrchestrator = class {
13135
13151
  error: { code: "SUBAGENT_ERROR", message: errMsg },
13136
13152
  timestamp: Date.now()
13137
13153
  };
13138
- await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult).catch(() => {
13154
+ await this.appendSubagentResultReliable(parentConversationId, pendingResult);
13155
+ this.hooks?.onStreamEnd?.(childConversationId);
13156
+ await this.eventSink(parentConversationId, {
13157
+ type: "subagent:error",
13158
+ subagentId: childConversationId,
13159
+ conversationId: childConversationId,
13160
+ error: errMsg
13139
13161
  });
13140
13162
  this.triggerParentCallback(parentConversationId).catch(
13141
13163
  (err2) => console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2)
@@ -13251,12 +13273,12 @@ ${resultBody}`,
13251
13273
  },
13252
13274
  initialContextTokens: conversation.contextTokens ?? 0,
13253
13275
  initialContextWindow: conversation.contextWindow ?? 0,
13254
- onEvent: (event) => {
13276
+ onEvent: async (event) => {
13255
13277
  if (event.type === "run:started") {
13256
13278
  const active = this.activeConversationRuns.get(conversationId);
13257
13279
  if (active) active.runId = event.runId;
13258
13280
  }
13259
- this.eventSink(conversationId, event);
13281
+ await this.eventSink(conversationId, event);
13260
13282
  }
13261
13283
  });
13262
13284
  flushTurnDraft(execution.draft);
@@ -13442,11 +13464,6 @@ ${resultBody}`,
13442
13464
  await this.conversationStore.update(conv);
13443
13465
  }
13444
13466
  this.activeSubagentRuns.delete(conversationId);
13445
- await this.eventSink(parentConversationId, {
13446
- type: "subagent:completed",
13447
- subagentId: conversationId,
13448
- conversationId
13449
- });
13450
13467
  let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
13451
13468
  if (!gathered) {
13452
13469
  const freshSubConv = await this.conversationStore.get(conversationId);
@@ -13464,7 +13481,14 @@ ${resultBody}`,
13464
13481
  ...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
13465
13482
  timestamp: Date.now()
13466
13483
  };
13467
- await this.conversationStore.appendSubagentResult(parentConversationId, result);
13484
+ await this.appendSubagentResultReliable(parentConversationId, result);
13485
+ }
13486
+ await this.eventSink(parentConversationId, {
13487
+ type: "subagent:completed",
13488
+ subagentId: conversationId,
13489
+ conversationId
13490
+ });
13491
+ if (parentConv) {
13468
13492
  if (this.isServerless) {
13469
13493
  this.hooks.dispatchBackground("subagent-callback", parentConversationId);
13470
13494
  } else {
@@ -13492,11 +13516,6 @@ ${resultBody}`,
13492
13516
  conv.updatedAt = Date.now();
13493
13517
  await this.conversationStore.update(conv);
13494
13518
  }
13495
- await this.eventSink(conversation.parentConversationId, {
13496
- type: "subagent:completed",
13497
- subagentId: conversationId,
13498
- conversationId
13499
- });
13500
13519
  const parentConv = await this.conversationStore.get(conversation.parentConversationId);
13501
13520
  if (parentConv) {
13502
13521
  const result = {
@@ -13506,12 +13525,20 @@ ${resultBody}`,
13506
13525
  error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
13507
13526
  timestamp: Date.now()
13508
13527
  };
13509
- await this.conversationStore.appendSubagentResult(conversation.parentConversationId, result);
13528
+ await this.appendSubagentResultReliable(conversation.parentConversationId, result);
13529
+ }
13530
+ await this.eventSink(conversation.parentConversationId, {
13531
+ type: "subagent:completed",
13532
+ subagentId: conversationId,
13533
+ conversationId
13534
+ });
13535
+ if (parentConv) {
13510
13536
  if (this.isServerless) {
13511
13537
  this.hooks.dispatchBackground("subagent-callback", conversation.parentConversationId);
13512
13538
  } else {
13513
- this.processSubagentCallback(conversation.parentConversationId).catch(() => {
13514
- });
13539
+ this.processSubagentCallback(conversation.parentConversationId).catch(
13540
+ (err2) => console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2)
13541
+ );
13515
13542
  }
13516
13543
  }
13517
13544
  }
@@ -13555,7 +13582,7 @@ ${resultBody}`,
13555
13582
  opts.parentConversationId,
13556
13583
  opts.task,
13557
13584
  opts.ownerId
13558
- ).catch((err) => console.error(`[poncho][subagent] Background spawn failed:`, err instanceof Error ? err.message : err));
13585
+ ).catch((err) => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
13559
13586
  }
13560
13587
  return { subagentId: conversation.conversationId };
13561
13588
  },
@@ -13588,7 +13615,7 @@ ${resultBody}`,
13588
13615
  conversation.parentConversationId,
13589
13616
  message,
13590
13617
  conversation.ownerId
13591
- ).catch((err) => console.error(`[poncho][subagent] Background sendMessage failed:`, err instanceof Error ? err.message : err));
13618
+ ).catch((err) => this.handleSpawnFailure(subagentId, conversation.parentConversationId, message, err));
13592
13619
  }
13593
13620
  return { subagentId };
13594
13621
  },
@@ -13667,6 +13694,67 @@ ${resultBody}`,
13667
13694
  };
13668
13695
  }
13669
13696
  // ── Stale subagent recovery ──
13697
+ /**
13698
+ * Append a subagent result to its parent, retrying once on a transient
13699
+ * store failure before giving up loudly. A silently dropped result is the
13700
+ * worst subagent failure mode — the parent waits forever on a subagent it
13701
+ * thinks is still running — so this never swallows the error the way the
13702
+ * old `.catch(() => {})` call sites did. Returns whether the result landed.
13703
+ */
13704
+ async appendSubagentResultReliable(parentConversationId, result) {
13705
+ try {
13706
+ await this.conversationStore.appendSubagentResult(parentConversationId, result);
13707
+ return true;
13708
+ } catch (firstErr) {
13709
+ try {
13710
+ await this.conversationStore.appendSubagentResult(parentConversationId, result);
13711
+ return true;
13712
+ } catch (secondErr) {
13713
+ console.error(
13714
+ `[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} to parent ${parentConversationId} after 2 attempts \u2014 the parent will not see this result:`,
13715
+ secondErr instanceof Error ? secondErr.message : secondErr,
13716
+ `(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`
13717
+ );
13718
+ return false;
13719
+ }
13720
+ }
13721
+ }
13722
+ /**
13723
+ * A subagent's fire-and-forget background run rejected outside its own
13724
+ * error handling (e.g. it threw before entering its try block, or the
13725
+ * catch block itself threw). Without this the parent is left waiting on a
13726
+ * subagent that will never report back. Record the failure on the child
13727
+ * and hand the parent an error result so the turn can resume.
13728
+ */
13729
+ async handleSpawnFailure(childConversationId, parentConversationId, task, err) {
13730
+ const message = err instanceof Error ? err.message : String(err);
13731
+ console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
13732
+ try {
13733
+ const conv = await this.conversationStore.get(childConversationId);
13734
+ if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
13735
+ conv.subagentMeta = {
13736
+ ...conv.subagentMeta,
13737
+ status: "error",
13738
+ error: { code: "SUBAGENT_SPAWN_FAILED", message }
13739
+ };
13740
+ conv.updatedAt = Date.now();
13741
+ await this.conversationStore.update(conv);
13742
+ }
13743
+ } catch {
13744
+ }
13745
+ const appended = await this.appendSubagentResultReliable(parentConversationId, {
13746
+ subagentId: childConversationId,
13747
+ task,
13748
+ status: "error",
13749
+ error: { code: "SUBAGENT_SPAWN_FAILED", message },
13750
+ timestamp: Date.now()
13751
+ });
13752
+ if (appended) {
13753
+ this.triggerParentCallback(parentConversationId).catch(
13754
+ (e) => console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e)
13755
+ );
13756
+ }
13757
+ }
13670
13758
  async recoverStaleSubagents() {
13671
13759
  const allSummaries = await this.conversationStore.listSummaries();
13672
13760
  const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
@@ -13692,11 +13780,20 @@ ${resultBody}`,
13692
13780
  error: conv.subagentMeta.error,
13693
13781
  timestamp: Date.now()
13694
13782
  };
13695
- await this.conversationStore.appendSubagentResult(conv.parentConversationId, pendingResult);
13783
+ await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
13696
13784
  parentsToCallback.add(conv.parentConversationId);
13697
13785
  }
13698
13786
  }
13699
13787
  }
13788
+ const parentIds = new Set(
13789
+ subagentSummaries.map((s) => s.parentConversationId).filter((id) => !!id)
13790
+ );
13791
+ for (const parentId of parentIds) {
13792
+ if (parentsToCallback.has(parentId)) continue;
13793
+ if (this.activeConversationRuns.has(parentId)) continue;
13794
+ const parent = await this.conversationStore.get(parentId);
13795
+ if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
13796
+ }
13700
13797
  for (const parentId of parentsToCallback) {
13701
13798
  this.processSubagentCallback(parentId).catch(
13702
13799
  (err) => console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@poncho-ai/harness",
3
- "version": "0.52.0",
3
+ "version": "0.52.2",
4
4
  "description": "Agent execution runtime - conversation loop, tool dispatch, streaming",
5
5
  "repository": {
6
6
  "type": "git",
@@ -26,7 +26,12 @@ export interface DefaultAgentDefinitionOptions {
26
26
  modelProvider?: "anthropic" | "openai" | "openai-codex";
27
27
  /** Model name. Default: "claude-opus-4-5". */
28
28
  modelName?: string;
29
- /** Sampling temperature. Default: 0.2. */
29
+ /**
30
+ * Sampling temperature. When unset, it is omitted from the generated
31
+ * frontmatter entirely and the harness sends no temperature (provider
32
+ * default). Newer models (Fable 5, Opus 4.7+) reject `temperature` — leave
33
+ * this unset for them.
34
+ */
30
35
  temperature?: number;
31
36
  /** Max tool-call steps per run. Default: 20. */
32
37
  maxSteps?: number;
@@ -55,7 +60,10 @@ export const defaultAgentDefinition = (
55
60
  const description = opts.description ?? DEFAULT_AGENT_DESCRIPTION;
56
61
  const modelProvider = opts.modelProvider ?? DEFAULT_MODEL_PROVIDER;
57
62
  const modelName = opts.modelName ?? DEFAULT_MODEL_NAME;
58
- const temperature = opts.temperature ?? DEFAULT_TEMPERATURE;
63
+ // Opt-in: only emit a `temperature:` line when explicitly provided, so the
64
+ // harness sends no temperature otherwise (newer models reject it).
65
+ const temperatureLine =
66
+ opts.temperature !== undefined ? `\n temperature: ${opts.temperature}` : "";
59
67
  const maxSteps = opts.maxSteps ?? DEFAULT_MAX_STEPS;
60
68
  const timeout = opts.timeout ?? DEFAULT_TIMEOUT;
61
69
 
@@ -65,8 +73,7 @@ id: ${id}
65
73
  description: ${description}
66
74
  model:
67
75
  provider: ${modelProvider}
68
- name: ${modelName}
69
- temperature: ${temperature}
76
+ name: ${modelName}${temperatureLine}
70
77
  limits:
71
78
  maxSteps: ${maxSteps}
72
79
  timeout: ${timeout}
package/src/harness.ts CHANGED
@@ -2907,7 +2907,12 @@ Code is wrapped in an async IIFE — use \`return\` to return a value to the too
2907
2907
  convertedUpTo = messages.length;
2908
2908
  const coreMessages = cachedCoreMessages;
2909
2909
 
2910
- const temperature = agent.frontmatter.model?.temperature ?? 0.2;
2910
+ // Only send temperature when the agent explicitly set one. Newer
2911
+ // models (Fable 5, Opus 4.7+) removed sampling params entirely and
2912
+ // return a 400 ("`temperature` is deprecated for this model") on any
2913
+ // value — forcing a default here broke them. Treated like maxTokens
2914
+ // below: omitted from the request when undefined.
2915
+ const temperature = agent.frontmatter.model?.temperature;
2911
2916
  const maxTokens = agent.frontmatter.model?.maxTokens;
2912
2917
  // Place the tail breakpoint before any untruncated tool-result so
2913
2918
  // we cache only the stable prefix when prior-run tool results are
@@ -2971,7 +2976,7 @@ Code is wrapped in an async IIFE — use \`return\` to return a value to the too
2971
2976
  ...(useStaticCache ? {} : { system: systemPrompt }),
2972
2977
  messages: messagesForStep,
2973
2978
  tools: toolsForStep,
2974
- temperature,
2979
+ ...(typeof temperature === "number" ? { temperature } : {}),
2975
2980
  abortSignal: input.abortSignal,
2976
2981
  ...(typeof maxTokens === "number" ? { maxTokens } : {}),
2977
2982
  experimental_telemetry: {
@@ -1012,13 +1012,6 @@ export class AgentOrchestrator {
1012
1012
  await this.conversationStore.update(conv);
1013
1013
  }
1014
1014
 
1015
- this.hooks?.onStreamEnd?.(childConversationId);
1016
- await this.eventSink(parentConversationId, {
1017
- type: "subagent:completed",
1018
- subagentId: childConversationId,
1019
- conversationId: childConversationId,
1020
- });
1021
-
1022
1015
  // Recover the subagent's real output: prefer the run response, then the
1023
1016
  // streamed draft, then walk the transcript — discarding the synthetic
1024
1017
  // "[Error: ...]" placeholder at each step.
@@ -1051,7 +1044,18 @@ export class AgentOrchestrator {
1051
1044
  : {}),
1052
1045
  timestamp: Date.now(),
1053
1046
  };
1054
- await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult);
1047
+ // Persist the result BEFORE emitting subagent:completed: a consumer
1048
+ // reacting to the event (the parent callback, the streaming client)
1049
+ // must find the result already durable in the store, not race its write.
1050
+ await this.appendSubagentResultReliable(parentConversationId, pendingResult);
1051
+
1052
+ this.hooks?.onStreamEnd?.(childConversationId);
1053
+ await this.eventSink(parentConversationId, {
1054
+ type: "subagent:completed",
1055
+ subagentId: childConversationId,
1056
+ conversationId: childConversationId,
1057
+ });
1058
+
1055
1059
  this.triggerParentCallback(parentConversationId).catch(err =>
1056
1060
  console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err),
1057
1061
  );
@@ -1070,6 +1074,16 @@ export class AgentOrchestrator {
1070
1074
  await this.conversationStore.update(conv);
1071
1075
  }
1072
1076
 
1077
+ const pendingResult: PendingSubagentResult = {
1078
+ subagentId: childConversationId,
1079
+ task,
1080
+ status: "error",
1081
+ error: { code: "SUBAGENT_ERROR", message: errMsg },
1082
+ timestamp: Date.now(),
1083
+ };
1084
+ // Persist before emitting (see the success path); never swallow.
1085
+ await this.appendSubagentResultReliable(parentConversationId, pendingResult);
1086
+
1073
1087
  this.hooks?.onStreamEnd?.(childConversationId);
1074
1088
  await this.eventSink(parentConversationId, {
1075
1089
  type: "subagent:error",
@@ -1078,14 +1092,6 @@ export class AgentOrchestrator {
1078
1092
  error: errMsg,
1079
1093
  });
1080
1094
 
1081
- const pendingResult: PendingSubagentResult = {
1082
- subagentId: childConversationId,
1083
- task,
1084
- status: "error",
1085
- error: { code: "SUBAGENT_ERROR", message: errMsg },
1086
- timestamp: Date.now(),
1087
- };
1088
- await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult).catch(() => {});
1089
1095
  this.triggerParentCallback(parentConversationId).catch(err2 =>
1090
1096
  console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2),
1091
1097
  );
@@ -1221,12 +1227,15 @@ export class AgentOrchestrator {
1221
1227
  },
1222
1228
  initialContextTokens: conversation.contextTokens ?? 0,
1223
1229
  initialContextWindow: conversation.contextWindow ?? 0,
1224
- onEvent: (event) => {
1230
+ onEvent: async (event) => {
1225
1231
  if (event.type === "run:started") {
1226
1232
  const active = this.activeConversationRuns.get(conversationId);
1227
1233
  if (active) active.runId = event.runId;
1228
1234
  }
1229
- this.eventSink(conversationId, event);
1235
+ // Await so the event is fully sunk before the next step's events,
1236
+ // matching every other eventSink call site (the callback run path
1237
+ // was the lone fire-and-forget exception).
1238
+ await this.eventSink(conversationId, event);
1230
1239
  },
1231
1240
  });
1232
1241
  flushTurnDraft(execution.draft);
@@ -1436,11 +1445,6 @@ export class AgentOrchestrator {
1436
1445
  }
1437
1446
 
1438
1447
  this.activeSubagentRuns.delete(conversationId);
1439
- await this.eventSink(parentConversationId, {
1440
- type: "subagent:completed",
1441
- subagentId: conversationId,
1442
- conversationId,
1443
- });
1444
1448
 
1445
1449
  let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
1446
1450
  if (!gathered) {
@@ -1464,8 +1468,17 @@ export class AgentOrchestrator {
1464
1468
  : {}),
1465
1469
  timestamp: Date.now(),
1466
1470
  };
1467
- await this.conversationStore.appendSubagentResult(parentConversationId, result);
1471
+ // Persist before emitting completion (see runSubagent).
1472
+ await this.appendSubagentResultReliable(parentConversationId, result);
1473
+ }
1468
1474
 
1475
+ await this.eventSink(parentConversationId, {
1476
+ type: "subagent:completed",
1477
+ subagentId: conversationId,
1478
+ conversationId,
1479
+ });
1480
+
1481
+ if (parentConv) {
1469
1482
  if (this.isServerless) {
1470
1483
  this.hooks!.dispatchBackground!("subagent-callback", parentConversationId);
1471
1484
  } else {
@@ -1490,12 +1503,6 @@ export class AgentOrchestrator {
1490
1503
  await this.conversationStore.update(conv);
1491
1504
  }
1492
1505
 
1493
- await this.eventSink(conversation.parentConversationId!, {
1494
- type: "subagent:completed",
1495
- subagentId: conversationId,
1496
- conversationId,
1497
- });
1498
-
1499
1506
  const parentConv = await this.conversationStore.get(conversation.parentConversationId!);
1500
1507
  if (parentConv) {
1501
1508
  const result: PendingSubagentResult = {
@@ -1505,11 +1512,23 @@ export class AgentOrchestrator {
1505
1512
  error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
1506
1513
  timestamp: Date.now(),
1507
1514
  };
1508
- await this.conversationStore.appendSubagentResult(conversation.parentConversationId!, result);
1515
+ // Persist before emitting; never swallow (was `.catch(() => {})`).
1516
+ await this.appendSubagentResultReliable(conversation.parentConversationId!, result);
1517
+ }
1518
+
1519
+ await this.eventSink(conversation.parentConversationId!, {
1520
+ type: "subagent:completed",
1521
+ subagentId: conversationId,
1522
+ conversationId,
1523
+ });
1524
+
1525
+ if (parentConv) {
1509
1526
  if (this.isServerless) {
1510
1527
  this.hooks!.dispatchBackground!("subagent-callback", conversation.parentConversationId!);
1511
1528
  } else {
1512
- this.processSubagentCallback(conversation.parentConversationId!).catch(() => {});
1529
+ this.processSubagentCallback(conversation.parentConversationId!).catch(err2 =>
1530
+ console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2),
1531
+ );
1513
1532
  }
1514
1533
  }
1515
1534
  }
@@ -1559,7 +1578,7 @@ export class AgentOrchestrator {
1559
1578
  opts.parentConversationId,
1560
1579
  opts.task,
1561
1580
  opts.ownerId,
1562
- ).catch(err => console.error(`[poncho][subagent] Background spawn failed:`, err instanceof Error ? err.message : err));
1581
+ ).catch(err => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
1563
1582
  }
1564
1583
 
1565
1584
  return { subagentId: conversation.conversationId };
@@ -1596,7 +1615,7 @@ export class AgentOrchestrator {
1596
1615
  conversation.parentConversationId,
1597
1616
  message,
1598
1617
  conversation.ownerId,
1599
- ).catch(err => console.error(`[poncho][subagent] Background sendMessage failed:`, err instanceof Error ? err.message : err));
1618
+ ).catch(err => this.handleSpawnFailure(subagentId, conversation.parentConversationId!, message, err));
1600
1619
  }
1601
1620
 
1602
1621
  return { subagentId };
@@ -1684,6 +1703,79 @@ export class AgentOrchestrator {
1684
1703
 
1685
1704
  // ── Stale subagent recovery ──
1686
1705
 
1706
+ /**
1707
+ * Append a subagent result to its parent, retrying once on a transient
1708
+ * store failure before giving up loudly. A silently dropped result is the
1709
+ * worst subagent failure mode — the parent waits forever on a subagent it
1710
+ * thinks is still running — so this never swallows the error the way the
1711
+ * old `.catch(() => {})` call sites did. Returns whether the result landed.
1712
+ */
1713
+ private async appendSubagentResultReliable(
1714
+ parentConversationId: string,
1715
+ result: PendingSubagentResult,
1716
+ ): Promise<boolean> {
1717
+ try {
1718
+ await this.conversationStore.appendSubagentResult(parentConversationId, result);
1719
+ return true;
1720
+ } catch (firstErr) {
1721
+ try {
1722
+ await this.conversationStore.appendSubagentResult(parentConversationId, result);
1723
+ return true;
1724
+ } catch (secondErr) {
1725
+ console.error(
1726
+ `[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} ` +
1727
+ `to parent ${parentConversationId} after 2 attempts — the parent will not see this result:`,
1728
+ secondErr instanceof Error ? secondErr.message : secondErr,
1729
+ `(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`,
1730
+ );
1731
+ return false;
1732
+ }
1733
+ }
1734
+ }
1735
+
1736
+ /**
1737
+ * A subagent's fire-and-forget background run rejected outside its own
1738
+ * error handling (e.g. it threw before entering its try block, or the
1739
+ * catch block itself threw). Without this the parent is left waiting on a
1740
+ * subagent that will never report back. Record the failure on the child
1741
+ * and hand the parent an error result so the turn can resume.
1742
+ */
1743
+ private async handleSpawnFailure(
1744
+ childConversationId: string,
1745
+ parentConversationId: string,
1746
+ task: string,
1747
+ err: unknown,
1748
+ ): Promise<void> {
1749
+ const message = err instanceof Error ? err.message : String(err);
1750
+ console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
1751
+ try {
1752
+ const conv = await this.conversationStore.get(childConversationId);
1753
+ if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
1754
+ conv.subagentMeta = {
1755
+ ...conv.subagentMeta,
1756
+ status: "error",
1757
+ error: { code: "SUBAGENT_SPAWN_FAILED", message },
1758
+ };
1759
+ conv.updatedAt = Date.now();
1760
+ await this.conversationStore.update(conv);
1761
+ }
1762
+ } catch {
1763
+ // best-effort: the result append below is what the parent actually needs
1764
+ }
1765
+ const appended = await this.appendSubagentResultReliable(parentConversationId, {
1766
+ subagentId: childConversationId,
1767
+ task,
1768
+ status: "error",
1769
+ error: { code: "SUBAGENT_SPAWN_FAILED", message },
1770
+ timestamp: Date.now(),
1771
+ });
1772
+ if (appended) {
1773
+ this.triggerParentCallback(parentConversationId).catch(e =>
1774
+ console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e),
1775
+ );
1776
+ }
1777
+ }
1778
+
1687
1779
  async recoverStaleSubagents(): Promise<void> {
1688
1780
  const allSummaries = await this.conversationStore.listSummaries();
1689
1781
  const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
@@ -1711,11 +1803,26 @@ export class AgentOrchestrator {
1711
1803
  error: conv.subagentMeta.error,
1712
1804
  timestamp: Date.now(),
1713
1805
  };
1714
- await this.conversationStore.appendSubagentResult(conv.parentConversationId, pendingResult);
1806
+ await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
1715
1807
  parentsToCallback.add(conv.parentConversationId);
1716
1808
  }
1717
1809
  }
1718
1810
  }
1811
+
1812
+ // Also drain parents that already have results sitting in the store but
1813
+ // no active run to deliver them — e.g. a result persisted just before a
1814
+ // process restart, whose in-memory callback trigger was lost. Without
1815
+ // this the parent stays stuck even though its result landed durably.
1816
+ const parentIds = new Set(
1817
+ subagentSummaries.map(s => s.parentConversationId).filter((id): id is string => !!id),
1818
+ );
1819
+ for (const parentId of parentIds) {
1820
+ if (parentsToCallback.has(parentId)) continue;
1821
+ if (this.activeConversationRuns.has(parentId)) continue;
1822
+ const parent = await this.conversationStore.get(parentId);
1823
+ if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
1824
+ }
1825
+
1719
1826
  for (const parentId of parentsToCallback) {
1720
1827
  this.processSubagentCallback(parentId).catch(err =>
1721
1828
  console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err),
@@ -36,12 +36,25 @@ export class PostgresEngine extends SqlStorageEngine {
36
36
  return rows as T[];
37
37
  },
38
38
  exec: async (sql: string): Promise<void> => {
39
- await this.sql.unsafe(sql);
39
+ // DDL is idempotent in our migrations (`CREATE TABLE IF NOT
40
+ // EXISTS`, etc.), so retrying on a stale-socket drop is
41
+ // safe — same idempotency as `query()` reads/writes.
42
+ await this.runWithRetry(() => this.sql.unsafe(sql));
40
43
  },
41
44
  transaction: async (fn: () => Promise<void>): Promise<void> => {
42
- await this.sql.begin(async () => {
45
+ // Transactions are inherently retry-safe at the
46
+ // CONNECTION_ENDED boundary: if the connection dies before
47
+ // BEGIN takes effect server-side, no work was committed and
48
+ // re-running `fn` produces the correct end state. The retry
49
+ // only catches the connection-level reject from the
50
+ // postgres.js client; a partial-commit + drop scenario
51
+ // surfaces as a different error code and bypasses the
52
+ // retry, preserving the caller's expectation that a
53
+ // returned transaction either fully committed or fully
54
+ // rolled back.
55
+ await this.runWithRetry(() => this.sql.begin(async () => {
43
56
  await fn();
44
- });
57
+ }));
45
58
  },
46
59
  };
47
60
  }
@@ -59,25 +72,34 @@ export class PostgresEngine extends SqlStorageEngine {
59
72
  prepare: false,
60
73
  // Connection-pool resilience. Managed Postgres providers
61
74
  // (Railway, Neon, Heroku, etc.) routinely drop idle TCP
62
- // connections server-side after a few minutes. Without these
63
- // knobs, porsager/postgres keeps stale sockets in the pool;
64
- // the next query on one rejects with
65
- // `write CONNECTION_ENDED <host>:5432` at `durMs=0`, surfacing
66
- // as a hard failure to the caller. Two complementary settings:
75
+ // connections server-side after a few minutes and on
76
+ // Railway in particular, mid-stream drops within a few
77
+ // seconds of inactivity are common. Without these knobs,
78
+ // porsager/postgres keeps stale sockets in the pool; the
79
+ // next query on one rejects with
80
+ // `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
81
+ // surfacing as a hard failure to the caller.
67
82
  //
68
- // - `idle_timeout: 20` closes idle connections client-side
69
- // after 20s, before any reasonable provider-side timer
70
- // fires. Fresh connection on next checkout = no stale
71
- // socket race.
72
- // - `max_lifetime: 600` (10 min) recycles long-lived
73
- // connections defensively even if they've stayed busy,
74
- // which sidesteps a separate class of provider-side
75
- // "max connection age" limits.
83
+ // - `idle_timeout: 5` closes idle connections client-side
84
+ // aggressively. Empirically Railway's pg drops sockets
85
+ // well before the 20s value that managed-provider docs
86
+ // suggest; 5s is short enough to win the race in
87
+ // practice while staying long enough that bursty
88
+ // workloads still get connection reuse.
89
+ // - `max_lifetime: 300` (5 min) recycles long-lived
90
+ // connections defensively. Even with idle_timeout, a
91
+ // connection that's been actively serving small queries
92
+ // for an hour can hit provider-side max-age limits.
93
+ // - `connect_timeout: 10` — slightly less patient on
94
+ // initial connect than the 30s default. Combined with
95
+ // the retry below, "connection refused" surfaces faster
96
+ // during incidents and the caller can shed load instead
97
+ // of stacking up.
76
98
  //
77
- // Defaults remain `max: 10`, `connect_timeout: 30` — leaving
78
- // pool size + initial connect behavior unchanged.
79
- idle_timeout: 20,
80
- max_lifetime: 60 * 10,
99
+ // Pool size (`max: 10`) unchanged.
100
+ idle_timeout: 5,
101
+ max_lifetime: 60 * 5,
102
+ connect_timeout: 10,
81
103
  });
82
104
  }
83
105
 
@@ -147,33 +169,53 @@ export class PostgresEngine extends SqlStorageEngine {
147
169
  }
148
170
 
149
171
  /**
150
- * Single retry on a transient connection-layer failure. The
151
- * `idle_timeout` / `max_lifetime` config above prevents *most*
152
- * stale-connection cases, but a query can still race a
153
- * provider-initiated drop in flight the postgres.js client
154
- * rejects with `code: "CONNECTION_ENDED"` and the next attempt
155
- * checks out a fresh connection from the pool. One retry is
156
- * enough; if it fails again the host-side network is genuinely
157
- * broken and the caller should see the error.
172
+ * Retry on transient connection-layer failures. Three attempts
173
+ * with exponential-ish backoff (0, 50ms, 200ms) — the pool may
174
+ * have multiple stale sockets accumulated during an idle period
175
+ * (especially on managed Postgres after boot when no traffic
176
+ * has flowed for a while), so a single retry can land on a
177
+ * second stale socket and still fail. Three attempts virtually
178
+ * always exhausts the staleness wave; if all three throw, the
179
+ * failure is real and the caller should see it.
158
180
  *
159
- * Only retries reads + the standard exec/run paths in `query`;
160
- * `sql.unsafe(sql)` calls in `executeRaw` (migration DDL) and
161
- * `sql.begin(...)` transactions are unwrapped those are
162
- * idempotent-by-construction (DDL is `IF NOT EXISTS`) or
163
- * atomically scoped (transactions roll back cleanly), and adding
164
- * a retry around them would complicate the transaction
165
- * semantics.
181
+ * Applied to every pg path the executor exposes:
182
+ * - `query()` (run/get/all) — natural retry: queries are
183
+ * idempotent at the connection-failure boundary because the
184
+ * server-side rollback runs cleanly on socket close.
185
+ * - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
186
+ * friends are idempotent by construction.
187
+ * - `transaction(fn)` — only retried when the
188
+ * CONNECTION_ENDED reject arrives *before* the transaction
189
+ * body started executing on the connection; if it errors
190
+ * mid-transaction, the postgres.js client surfaces a
191
+ * different error class (the inner SQL error) and bypasses
192
+ * this retry, preserving the all-or-nothing semantics.
166
193
  */
167
194
  private async runWithRetry<T>(fn: () => Promise<T>): Promise<T> {
168
- try {
169
- return await fn();
170
- } catch (err) {
171
- const code = (err as { code?: string } | null | undefined)?.code;
172
- if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED") {
195
+ const backoffs = [0, 50, 200];
196
+ let lastErr: unknown;
197
+ for (let attempt = 0; attempt < backoffs.length; attempt++) {
198
+ if (backoffs[attempt] > 0) {
199
+ await new Promise((r) => setTimeout(r, backoffs[attempt]));
200
+ }
201
+ try {
173
202
  return await fn();
203
+ } catch (err) {
204
+ lastErr = err;
205
+ const code = (err as { code?: string } | null | undefined)?.code;
206
+ if (
207
+ code === "CONNECTION_ENDED" ||
208
+ code === "CONNECTION_CLOSED" ||
209
+ code === "CONNECTION_DESTROYED" ||
210
+ code === "CONNECT_TIMEOUT" ||
211
+ code === "ECONNRESET"
212
+ ) {
213
+ continue;
214
+ }
215
+ throw err;
174
216
  }
175
- throw err;
176
217
  }
218
+ throw lastErr;
177
219
  }
178
220
 
179
221
  private addToPathCache(tenantId: string, path: string): void {