@poncho-ai/harness 0.52.1 → 0.52.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.turbo/turbo-build.log +5 -5
- package/CHANGELOG.md +74 -0
- package/dist/index.d.ts +36 -15
- package/dist/index.js +172 -75
- package/package.json +1 -1
- package/src/orchestrator/orchestrator.ts +142 -35
- package/src/storage/postgres-engine.ts +83 -41
package/.turbo/turbo-build.log
CHANGED
|
@@ -1,5 +1,5 @@
|
|
|
1
1
|
|
|
2
|
-
> @poncho-ai/harness@0.52.
|
|
2
|
+
> @poncho-ai/harness@0.52.2 build /home/runner/work/poncho-ai/poncho-ai/packages/harness
|
|
3
3
|
> node scripts/embed-docs.js && tsup src/index.ts --format esm --dts
|
|
4
4
|
|
|
5
5
|
[embed-docs] Generated poncho-docs.ts with 4 topics
|
|
@@ -8,9 +8,9 @@
|
|
|
8
8
|
[34mCLI[39m tsup v8.5.1
|
|
9
9
|
[34mCLI[39m Target: es2022
|
|
10
10
|
[34mESM[39m Build start
|
|
11
|
-
[32mESM[39m [1mdist/index.js [22m[
|
|
11
|
+
[32mESM[39m [1mdist/index.js [22m[32m540.60 KB[39m
|
|
12
12
|
[32mESM[39m [1mdist/isolate-F2PPSUL6.js [22m[32m53.82 KB[39m
|
|
13
|
-
[32mESM[39m ⚡️ Build success in
|
|
13
|
+
[32mESM[39m ⚡️ Build success in 204ms
|
|
14
14
|
[34mDTS[39m Build start
|
|
15
|
-
[32mDTS[39m ⚡️ Build success in
|
|
16
|
-
[32mDTS[39m [1mdist/index.d.ts [22m[
|
|
15
|
+
[32mDTS[39m ⚡️ Build success in 7242ms
|
|
16
|
+
[32mDTS[39m [1mdist/index.d.ts [22m[32m93.60 KB[39m
|
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,79 @@
|
|
|
1
1
|
# @poncho-ai/harness
|
|
2
2
|
|
|
3
|
+
## 0.52.2
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- [#124](https://github.com/cesr/poncho-ai/pull/124) [`4ae26e0`](https://github.com/cesr/poncho-ai/commit/4ae26e0d8d2788f57411f9c17e10766769514f9b) Thanks [@cesr](https://github.com/cesr)! - harness: postgres retry covers exec/transaction + 3 attempts + tighter idle
|
|
8
|
+
|
|
9
|
+
Follow-up to the previous `idle_timeout`/`max_lifetime`/retry patch.
|
|
10
|
+
Live testing on Railway showed the previous values weren't tight
|
|
11
|
+
enough — `write CONNECTION_ENDED postgres.railway.internal:5432`
|
|
12
|
+
still surfaced both during user-facing chat turns and during
|
|
13
|
+
subagent auto-callback reruns, despite the new config and the
|
|
14
|
+
one-shot retry.
|
|
15
|
+
|
|
16
|
+
Two failure modes the previous version didn't cover:
|
|
17
|
+
1. The retry only wrapped `private query()` (executor.run/get/all),
|
|
18
|
+
but `executor.exec` (`sql.unsafe`) and `executor.transaction`
|
|
19
|
+
(`sql.begin`) called the postgres.js client directly. A pg drop
|
|
20
|
+
inside a transaction or migration write threw straight through.
|
|
21
|
+
2. After an idle period the pool can have multiple stale sockets;
|
|
22
|
+
a single retry can checkout a second stale socket from the pool
|
|
23
|
+
and fail again. One-shot retry exhausted into an error visible
|
|
24
|
+
to the caller.
|
|
25
|
+
|
|
26
|
+
Fixes:
|
|
27
|
+
- All three executor paths (`run/get/all`, `exec`, `transaction`)
|
|
28
|
+
now go through the same `runWithRetry` wrapper. Transactions
|
|
29
|
+
only retry the connection-level `CONNECTION_ENDED` reject from
|
|
30
|
+
the postgres.js client — actual SQL errors mid-transaction
|
|
31
|
+
surface as a different error class and bypass the retry,
|
|
32
|
+
preserving atomic semantics.
|
|
33
|
+
- Three attempts with light exponential backoff (0, 50ms, 200ms).
|
|
34
|
+
Enough to ride out a typical staleness wave; if all three fail
|
|
35
|
+
the network is genuinely broken.
|
|
36
|
+
- `CONNECT_TIMEOUT` and `ECONNRESET` added to the retry-eligible
|
|
37
|
+
error codes.
|
|
38
|
+
|
|
39
|
+
Config knobs tightened:
|
|
40
|
+
- `idle_timeout: 5` (was 20). Empirically Railway's pg drops
|
|
41
|
+
sockets well before 20s; 5s wins the race in practice while
|
|
42
|
+
staying long enough for bursty workloads to reuse connections.
|
|
43
|
+
- `max_lifetime: 300` (was 600). Same reasoning — recycle more
|
|
44
|
+
aggressively.
|
|
45
|
+
- `connect_timeout: 10` (was 30 default). Faster failure during
|
|
46
|
+
incidents lets callers shed load instead of stacking up.
|
|
47
|
+
|
|
48
|
+
- [#144](https://github.com/cesr/poncho-ai/pull/144) [`28d640b`](https://github.com/cesr/poncho-ai/commit/28d640b2f82ea780f8e0be90965972d9903c01d7) Thanks [@cesr](https://github.com/cesr)! - orchestrator: make subagent result delivery reliable
|
|
49
|
+
|
|
50
|
+
Subagent results could silently never reach the parent agent. Several
|
|
51
|
+
plumbing bugs in `runSubagent` / `runSubagentContinuation`:
|
|
52
|
+
- **Emit-before-persist race.** `subagent:completed` / `subagent:error`
|
|
53
|
+
were emitted to the parent's event stream _before_ the result was
|
|
54
|
+
written to the store, so a consumer reacting to the event (the parent
|
|
55
|
+
callback, the streaming client) could race the write. Now the result
|
|
56
|
+
is persisted first, then the event is emitted.
|
|
57
|
+
- **Silently swallowed writes.** Two `appendSubagentResult(...).catch(() => {})`
|
|
58
|
+
call sites (the error path and the continuation-error path) dropped the
|
|
59
|
+
result with no trace on a transient store failure. Replaced with a
|
|
60
|
+
shared `appendSubagentResultReliable` helper that retries once and then
|
|
61
|
+
logs loudly — a dropped result is the worst failure mode (the parent
|
|
62
|
+
waits forever on a subagent it thinks is still running).
|
|
63
|
+
- **Un-awaited eventSink.** The subagent-callback run path was the lone
|
|
64
|
+
`this.eventSink(...)` call site that didn't `await` (every other site
|
|
65
|
+
does), so callback-turn events could interleave out of order. Now awaited.
|
|
66
|
+
- **Spawn rejections went to a bare `console.error`.** A background
|
|
67
|
+
`runSubagent` that rejected outside its own try/catch left the parent
|
|
68
|
+
hanging. Both fire-and-forget spawn paths now route to a
|
|
69
|
+
`handleSpawnFailure` that marks the child errored and hands the parent
|
|
70
|
+
an error result so the turn can resume.
|
|
71
|
+
- **`recoverStaleSubagents` now also drains undelivered results.** It
|
|
72
|
+
previously only rescued children stuck in `running`; it now also
|
|
73
|
+
re-triggers the parent callback for any parent that has results sitting
|
|
74
|
+
in the store with no active run (e.g. a result persisted just before a
|
|
75
|
+
process restart, whose in-memory callback trigger was lost).
|
|
76
|
+
|
|
3
77
|
## 0.52.1
|
|
4
78
|
|
|
5
79
|
### Patch Changes
|
package/dist/index.d.ts
CHANGED
|
@@ -1802,22 +1802,27 @@ declare class PostgresEngine extends SqlStorageEngine {
|
|
|
1802
1802
|
private patchVfs;
|
|
1803
1803
|
private query;
|
|
1804
1804
|
/**
|
|
1805
|
-
*
|
|
1806
|
-
*
|
|
1807
|
-
*
|
|
1808
|
-
*
|
|
1809
|
-
*
|
|
1810
|
-
*
|
|
1811
|
-
*
|
|
1812
|
-
*
|
|
1805
|
+
* Retry on transient connection-layer failures. Three attempts
|
|
1806
|
+
* with exponential-ish backoff (0, 50ms, 200ms) — the pool may
|
|
1807
|
+
* have multiple stale sockets accumulated during an idle period
|
|
1808
|
+
* (especially on managed Postgres after boot when no traffic
|
|
1809
|
+
* has flowed for a while), so a single retry can land on a
|
|
1810
|
+
* second stale socket and still fail. Three attempts virtually
|
|
1811
|
+
* always exhausts the staleness wave; if all three throw, the
|
|
1812
|
+
* failure is real and the caller should see it.
|
|
1813
1813
|
*
|
|
1814
|
-
*
|
|
1815
|
-
* `
|
|
1816
|
-
*
|
|
1817
|
-
*
|
|
1818
|
-
*
|
|
1819
|
-
*
|
|
1820
|
-
*
|
|
1814
|
+
* Applied to every pg path the executor exposes:
|
|
1815
|
+
* - `query()` (run/get/all) — natural retry: queries are
|
|
1816
|
+
* idempotent at the connection-failure boundary because the
|
|
1817
|
+
* server-side rollback runs cleanly on socket close.
|
|
1818
|
+
* - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
|
|
1819
|
+
* friends are idempotent by construction.
|
|
1820
|
+
* - `transaction(fn)` — only retried when the
|
|
1821
|
+
* CONNECTION_ENDED reject arrives *before* the transaction
|
|
1822
|
+
* body started executing on the connection; if it errors
|
|
1823
|
+
* mid-transaction, the postgres.js client surfaces a
|
|
1824
|
+
* different error class (the inner SQL error) and bypasses
|
|
1825
|
+
* this retry, preserving the all-or-nothing semantics.
|
|
1821
1826
|
*/
|
|
1822
1827
|
private runWithRetry;
|
|
1823
1828
|
private addToPathCache;
|
|
@@ -2145,6 +2150,22 @@ declare class AgentOrchestrator {
|
|
|
2145
2150
|
processSubagentCallback(conversationId: string, skipLockCheck?: boolean): Promise<void>;
|
|
2146
2151
|
runSubagentContinuation(conversationId: string, conversation: Conversation, continuationMessages: Message[]): AsyncGenerator<AgentEvent>;
|
|
2147
2152
|
createSubagentManager(): SubagentManager;
|
|
2153
|
+
/**
|
|
2154
|
+
* Append a subagent result to its parent, retrying once on a transient
|
|
2155
|
+
* store failure before giving up loudly. A silently dropped result is the
|
|
2156
|
+
* worst subagent failure mode — the parent waits forever on a subagent it
|
|
2157
|
+
* thinks is still running — so this never swallows the error the way the
|
|
2158
|
+
* old `.catch(() => {})` call sites did. Returns whether the result landed.
|
|
2159
|
+
*/
|
|
2160
|
+
private appendSubagentResultReliable;
|
|
2161
|
+
/**
|
|
2162
|
+
* A subagent's fire-and-forget background run rejected outside its own
|
|
2163
|
+
* error handling (e.g. it threw before entering its try block, or the
|
|
2164
|
+
* catch block itself threw). Without this the parent is left waiting on a
|
|
2165
|
+
* subagent that will never report back. Record the failure on the child
|
|
2166
|
+
* and hand the parent an error result so the turn can resume.
|
|
2167
|
+
*/
|
|
2168
|
+
private handleSpawnFailure;
|
|
2148
2169
|
recoverStaleSubagents(): Promise<void>;
|
|
2149
2170
|
}
|
|
2150
2171
|
|
package/dist/index.js
CHANGED
|
@@ -4415,12 +4415,12 @@ var PostgresEngine = class extends SqlStorageEngine {
|
|
|
4415
4415
|
return rows;
|
|
4416
4416
|
},
|
|
4417
4417
|
exec: async (sql) => {
|
|
4418
|
-
await this.sql.unsafe(sql);
|
|
4418
|
+
await this.runWithRetry(() => this.sql.unsafe(sql));
|
|
4419
4419
|
},
|
|
4420
4420
|
transaction: async (fn) => {
|
|
4421
|
-
await this.sql.begin(async () => {
|
|
4421
|
+
await this.runWithRetry(() => this.sql.begin(async () => {
|
|
4422
4422
|
await fn();
|
|
4423
|
-
});
|
|
4423
|
+
}));
|
|
4424
4424
|
}
|
|
4425
4425
|
};
|
|
4426
4426
|
}
|
|
@@ -4438,25 +4438,34 @@ var PostgresEngine = class extends SqlStorageEngine {
|
|
|
4438
4438
|
prepare: false,
|
|
4439
4439
|
// Connection-pool resilience. Managed Postgres providers
|
|
4440
4440
|
// (Railway, Neon, Heroku, etc.) routinely drop idle TCP
|
|
4441
|
-
// connections server-side after a few minutes
|
|
4442
|
-
//
|
|
4443
|
-
//
|
|
4444
|
-
//
|
|
4445
|
-
//
|
|
4441
|
+
// connections server-side after a few minutes — and on
|
|
4442
|
+
// Railway in particular, mid-stream drops within a few
|
|
4443
|
+
// seconds of inactivity are common. Without these knobs,
|
|
4444
|
+
// porsager/postgres keeps stale sockets in the pool; the
|
|
4445
|
+
// next query on one rejects with
|
|
4446
|
+
// `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
|
|
4447
|
+
// surfacing as a hard failure to the caller.
|
|
4446
4448
|
//
|
|
4447
|
-
// - `idle_timeout:
|
|
4448
|
-
//
|
|
4449
|
-
//
|
|
4450
|
-
//
|
|
4451
|
-
//
|
|
4452
|
-
//
|
|
4453
|
-
//
|
|
4454
|
-
//
|
|
4449
|
+
// - `idle_timeout: 5` closes idle connections client-side
|
|
4450
|
+
// aggressively. Empirically Railway's pg drops sockets
|
|
4451
|
+
// well before the 20s value that managed-provider docs
|
|
4452
|
+
// suggest; 5s is short enough to win the race in
|
|
4453
|
+
// practice while staying long enough that bursty
|
|
4454
|
+
// workloads still get connection reuse.
|
|
4455
|
+
// - `max_lifetime: 300` (5 min) recycles long-lived
|
|
4456
|
+
// connections defensively. Even with idle_timeout, a
|
|
4457
|
+
// connection that's been actively serving small queries
|
|
4458
|
+
// for an hour can hit provider-side max-age limits.
|
|
4459
|
+
// - `connect_timeout: 10` — slightly less patient on
|
|
4460
|
+
// initial connect than the 30s default. Combined with
|
|
4461
|
+
// the retry below, "connection refused" surfaces faster
|
|
4462
|
+
// during incidents and the caller can shed load instead
|
|
4463
|
+
// of stacking up.
|
|
4455
4464
|
//
|
|
4456
|
-
//
|
|
4457
|
-
|
|
4458
|
-
|
|
4459
|
-
|
|
4465
|
+
// Pool size (`max: 10`) unchanged.
|
|
4466
|
+
idle_timeout: 5,
|
|
4467
|
+
max_lifetime: 60 * 5,
|
|
4468
|
+
connect_timeout: 10
|
|
4460
4469
|
});
|
|
4461
4470
|
}
|
|
4462
4471
|
async initialize() {
|
|
@@ -4505,33 +4514,47 @@ var PostgresEngine = class extends SqlStorageEngine {
|
|
|
4505
4514
|
);
|
|
4506
4515
|
}
|
|
4507
4516
|
/**
|
|
4508
|
-
*
|
|
4509
|
-
*
|
|
4510
|
-
*
|
|
4511
|
-
*
|
|
4512
|
-
*
|
|
4513
|
-
*
|
|
4514
|
-
*
|
|
4515
|
-
*
|
|
4517
|
+
* Retry on transient connection-layer failures. Three attempts
|
|
4518
|
+
* with exponential-ish backoff (0, 50ms, 200ms) — the pool may
|
|
4519
|
+
* have multiple stale sockets accumulated during an idle period
|
|
4520
|
+
* (especially on managed Postgres after boot when no traffic
|
|
4521
|
+
* has flowed for a while), so a single retry can land on a
|
|
4522
|
+
* second stale socket and still fail. Three attempts virtually
|
|
4523
|
+
* always exhausts the staleness wave; if all three throw, the
|
|
4524
|
+
* failure is real and the caller should see it.
|
|
4516
4525
|
*
|
|
4517
|
-
*
|
|
4518
|
-
* `
|
|
4519
|
-
*
|
|
4520
|
-
*
|
|
4521
|
-
*
|
|
4522
|
-
*
|
|
4523
|
-
*
|
|
4526
|
+
* Applied to every pg path the executor exposes:
|
|
4527
|
+
* - `query()` (run/get/all) — natural retry: queries are
|
|
4528
|
+
* idempotent at the connection-failure boundary because the
|
|
4529
|
+
* server-side rollback runs cleanly on socket close.
|
|
4530
|
+
* - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
|
|
4531
|
+
* friends are idempotent by construction.
|
|
4532
|
+
* - `transaction(fn)` — only retried when the
|
|
4533
|
+
* CONNECTION_ENDED reject arrives *before* the transaction
|
|
4534
|
+
* body started executing on the connection; if it errors
|
|
4535
|
+
* mid-transaction, the postgres.js client surfaces a
|
|
4536
|
+
* different error class (the inner SQL error) and bypasses
|
|
4537
|
+
* this retry, preserving the all-or-nothing semantics.
|
|
4524
4538
|
*/
|
|
4525
4539
|
async runWithRetry(fn) {
|
|
4526
|
-
|
|
4527
|
-
|
|
4528
|
-
|
|
4529
|
-
|
|
4530
|
-
|
|
4540
|
+
const backoffs = [0, 50, 200];
|
|
4541
|
+
let lastErr;
|
|
4542
|
+
for (let attempt = 0; attempt < backoffs.length; attempt++) {
|
|
4543
|
+
if (backoffs[attempt] > 0) {
|
|
4544
|
+
await new Promise((r) => setTimeout(r, backoffs[attempt]));
|
|
4545
|
+
}
|
|
4546
|
+
try {
|
|
4531
4547
|
return await fn();
|
|
4548
|
+
} catch (err) {
|
|
4549
|
+
lastErr = err;
|
|
4550
|
+
const code = err?.code;
|
|
4551
|
+
if (code === "CONNECTION_ENDED" || code === "CONNECTION_CLOSED" || code === "CONNECTION_DESTROYED" || code === "CONNECT_TIMEOUT" || code === "ECONNRESET") {
|
|
4552
|
+
continue;
|
|
4553
|
+
}
|
|
4554
|
+
throw err;
|
|
4532
4555
|
}
|
|
4533
|
-
throw err;
|
|
4534
4556
|
}
|
|
4557
|
+
throw lastErr;
|
|
4535
4558
|
}
|
|
4536
4559
|
addToPathCache(tenantId, path) {
|
|
4537
4560
|
const paths = this.pathCache.get(tenantId);
|
|
@@ -13077,12 +13100,6 @@ var AgentOrchestrator = class {
|
|
|
13077
13100
|
};
|
|
13078
13101
|
await this.conversationStore.update(conv);
|
|
13079
13102
|
}
|
|
13080
|
-
this.hooks?.onStreamEnd?.(childConversationId);
|
|
13081
|
-
await this.eventSink(parentConversationId, {
|
|
13082
|
-
type: "subagent:completed",
|
|
13083
|
-
subagentId: childConversationId,
|
|
13084
|
-
conversationId: childConversationId
|
|
13085
|
-
});
|
|
13086
13103
|
let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
|
|
13087
13104
|
if (!gathered) {
|
|
13088
13105
|
const freshSubConv = await this.conversationStore.get(childConversationId);
|
|
@@ -13104,7 +13121,13 @@ var AgentOrchestrator = class {
|
|
|
13104
13121
|
...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
|
|
13105
13122
|
timestamp: Date.now()
|
|
13106
13123
|
};
|
|
13107
|
-
await this.
|
|
13124
|
+
await this.appendSubagentResultReliable(parentConversationId, pendingResult);
|
|
13125
|
+
this.hooks?.onStreamEnd?.(childConversationId);
|
|
13126
|
+
await this.eventSink(parentConversationId, {
|
|
13127
|
+
type: "subagent:completed",
|
|
13128
|
+
subagentId: childConversationId,
|
|
13129
|
+
conversationId: childConversationId
|
|
13130
|
+
});
|
|
13108
13131
|
this.triggerParentCallback(parentConversationId).catch(
|
|
13109
13132
|
(err) => console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err)
|
|
13110
13133
|
);
|
|
@@ -13121,13 +13144,6 @@ var AgentOrchestrator = class {
|
|
|
13121
13144
|
conv.updatedAt = Date.now();
|
|
13122
13145
|
await this.conversationStore.update(conv);
|
|
13123
13146
|
}
|
|
13124
|
-
this.hooks?.onStreamEnd?.(childConversationId);
|
|
13125
|
-
await this.eventSink(parentConversationId, {
|
|
13126
|
-
type: "subagent:error",
|
|
13127
|
-
subagentId: childConversationId,
|
|
13128
|
-
conversationId: childConversationId,
|
|
13129
|
-
error: errMsg
|
|
13130
|
-
});
|
|
13131
13147
|
const pendingResult = {
|
|
13132
13148
|
subagentId: childConversationId,
|
|
13133
13149
|
task,
|
|
@@ -13135,7 +13151,13 @@ var AgentOrchestrator = class {
|
|
|
13135
13151
|
error: { code: "SUBAGENT_ERROR", message: errMsg },
|
|
13136
13152
|
timestamp: Date.now()
|
|
13137
13153
|
};
|
|
13138
|
-
await this.
|
|
13154
|
+
await this.appendSubagentResultReliable(parentConversationId, pendingResult);
|
|
13155
|
+
this.hooks?.onStreamEnd?.(childConversationId);
|
|
13156
|
+
await this.eventSink(parentConversationId, {
|
|
13157
|
+
type: "subagent:error",
|
|
13158
|
+
subagentId: childConversationId,
|
|
13159
|
+
conversationId: childConversationId,
|
|
13160
|
+
error: errMsg
|
|
13139
13161
|
});
|
|
13140
13162
|
this.triggerParentCallback(parentConversationId).catch(
|
|
13141
13163
|
(err2) => console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2)
|
|
@@ -13251,12 +13273,12 @@ ${resultBody}`,
|
|
|
13251
13273
|
},
|
|
13252
13274
|
initialContextTokens: conversation.contextTokens ?? 0,
|
|
13253
13275
|
initialContextWindow: conversation.contextWindow ?? 0,
|
|
13254
|
-
onEvent: (event) => {
|
|
13276
|
+
onEvent: async (event) => {
|
|
13255
13277
|
if (event.type === "run:started") {
|
|
13256
13278
|
const active = this.activeConversationRuns.get(conversationId);
|
|
13257
13279
|
if (active) active.runId = event.runId;
|
|
13258
13280
|
}
|
|
13259
|
-
this.eventSink(conversationId, event);
|
|
13281
|
+
await this.eventSink(conversationId, event);
|
|
13260
13282
|
}
|
|
13261
13283
|
});
|
|
13262
13284
|
flushTurnDraft(execution.draft);
|
|
@@ -13442,11 +13464,6 @@ ${resultBody}`,
|
|
|
13442
13464
|
await this.conversationStore.update(conv);
|
|
13443
13465
|
}
|
|
13444
13466
|
this.activeSubagentRuns.delete(conversationId);
|
|
13445
|
-
await this.eventSink(parentConversationId, {
|
|
13446
|
-
type: "subagent:completed",
|
|
13447
|
-
subagentId: conversationId,
|
|
13448
|
-
conversationId
|
|
13449
|
-
});
|
|
13450
13467
|
let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
|
|
13451
13468
|
if (!gathered) {
|
|
13452
13469
|
const freshSubConv = await this.conversationStore.get(conversationId);
|
|
@@ -13464,7 +13481,14 @@ ${resultBody}`,
|
|
|
13464
13481
|
...abnormal ? { error: { code: runError?.code ?? "SUBAGENT_INCOMPLETE", message: runError?.message ?? "subagent ended without a result" } } : {},
|
|
13465
13482
|
timestamp: Date.now()
|
|
13466
13483
|
};
|
|
13467
|
-
await this.
|
|
13484
|
+
await this.appendSubagentResultReliable(parentConversationId, result);
|
|
13485
|
+
}
|
|
13486
|
+
await this.eventSink(parentConversationId, {
|
|
13487
|
+
type: "subagent:completed",
|
|
13488
|
+
subagentId: conversationId,
|
|
13489
|
+
conversationId
|
|
13490
|
+
});
|
|
13491
|
+
if (parentConv) {
|
|
13468
13492
|
if (this.isServerless) {
|
|
13469
13493
|
this.hooks.dispatchBackground("subagent-callback", parentConversationId);
|
|
13470
13494
|
} else {
|
|
@@ -13492,11 +13516,6 @@ ${resultBody}`,
|
|
|
13492
13516
|
conv.updatedAt = Date.now();
|
|
13493
13517
|
await this.conversationStore.update(conv);
|
|
13494
13518
|
}
|
|
13495
|
-
await this.eventSink(conversation.parentConversationId, {
|
|
13496
|
-
type: "subagent:completed",
|
|
13497
|
-
subagentId: conversationId,
|
|
13498
|
-
conversationId
|
|
13499
|
-
});
|
|
13500
13519
|
const parentConv = await this.conversationStore.get(conversation.parentConversationId);
|
|
13501
13520
|
if (parentConv) {
|
|
13502
13521
|
const result = {
|
|
@@ -13506,12 +13525,20 @@ ${resultBody}`,
|
|
|
13506
13525
|
error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
|
|
13507
13526
|
timestamp: Date.now()
|
|
13508
13527
|
};
|
|
13509
|
-
await this.
|
|
13528
|
+
await this.appendSubagentResultReliable(conversation.parentConversationId, result);
|
|
13529
|
+
}
|
|
13530
|
+
await this.eventSink(conversation.parentConversationId, {
|
|
13531
|
+
type: "subagent:completed",
|
|
13532
|
+
subagentId: conversationId,
|
|
13533
|
+
conversationId
|
|
13534
|
+
});
|
|
13535
|
+
if (parentConv) {
|
|
13510
13536
|
if (this.isServerless) {
|
|
13511
13537
|
this.hooks.dispatchBackground("subagent-callback", conversation.parentConversationId);
|
|
13512
13538
|
} else {
|
|
13513
|
-
this.processSubagentCallback(conversation.parentConversationId).catch(
|
|
13514
|
-
|
|
13539
|
+
this.processSubagentCallback(conversation.parentConversationId).catch(
|
|
13540
|
+
(err2) => console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2)
|
|
13541
|
+
);
|
|
13515
13542
|
}
|
|
13516
13543
|
}
|
|
13517
13544
|
}
|
|
@@ -13555,7 +13582,7 @@ ${resultBody}`,
|
|
|
13555
13582
|
opts.parentConversationId,
|
|
13556
13583
|
opts.task,
|
|
13557
13584
|
opts.ownerId
|
|
13558
|
-
).catch((err) =>
|
|
13585
|
+
).catch((err) => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
|
|
13559
13586
|
}
|
|
13560
13587
|
return { subagentId: conversation.conversationId };
|
|
13561
13588
|
},
|
|
@@ -13588,7 +13615,7 @@ ${resultBody}`,
|
|
|
13588
13615
|
conversation.parentConversationId,
|
|
13589
13616
|
message,
|
|
13590
13617
|
conversation.ownerId
|
|
13591
|
-
).catch((err) =>
|
|
13618
|
+
).catch((err) => this.handleSpawnFailure(subagentId, conversation.parentConversationId, message, err));
|
|
13592
13619
|
}
|
|
13593
13620
|
return { subagentId };
|
|
13594
13621
|
},
|
|
@@ -13667,6 +13694,67 @@ ${resultBody}`,
|
|
|
13667
13694
|
};
|
|
13668
13695
|
}
|
|
13669
13696
|
// ── Stale subagent recovery ──
|
|
13697
|
+
/**
|
|
13698
|
+
* Append a subagent result to its parent, retrying once on a transient
|
|
13699
|
+
* store failure before giving up loudly. A silently dropped result is the
|
|
13700
|
+
* worst subagent failure mode — the parent waits forever on a subagent it
|
|
13701
|
+
* thinks is still running — so this never swallows the error the way the
|
|
13702
|
+
* old `.catch(() => {})` call sites did. Returns whether the result landed.
|
|
13703
|
+
*/
|
|
13704
|
+
async appendSubagentResultReliable(parentConversationId, result) {
|
|
13705
|
+
try {
|
|
13706
|
+
await this.conversationStore.appendSubagentResult(parentConversationId, result);
|
|
13707
|
+
return true;
|
|
13708
|
+
} catch (firstErr) {
|
|
13709
|
+
try {
|
|
13710
|
+
await this.conversationStore.appendSubagentResult(parentConversationId, result);
|
|
13711
|
+
return true;
|
|
13712
|
+
} catch (secondErr) {
|
|
13713
|
+
console.error(
|
|
13714
|
+
`[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} to parent ${parentConversationId} after 2 attempts \u2014 the parent will not see this result:`,
|
|
13715
|
+
secondErr instanceof Error ? secondErr.message : secondErr,
|
|
13716
|
+
`(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`
|
|
13717
|
+
);
|
|
13718
|
+
return false;
|
|
13719
|
+
}
|
|
13720
|
+
}
|
|
13721
|
+
}
|
|
13722
|
+
/**
|
|
13723
|
+
* A subagent's fire-and-forget background run rejected outside its own
|
|
13724
|
+
* error handling (e.g. it threw before entering its try block, or the
|
|
13725
|
+
* catch block itself threw). Without this the parent is left waiting on a
|
|
13726
|
+
* subagent that will never report back. Record the failure on the child
|
|
13727
|
+
* and hand the parent an error result so the turn can resume.
|
|
13728
|
+
*/
|
|
13729
|
+
async handleSpawnFailure(childConversationId, parentConversationId, task, err) {
|
|
13730
|
+
const message = err instanceof Error ? err.message : String(err);
|
|
13731
|
+
console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
|
|
13732
|
+
try {
|
|
13733
|
+
const conv = await this.conversationStore.get(childConversationId);
|
|
13734
|
+
if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
|
|
13735
|
+
conv.subagentMeta = {
|
|
13736
|
+
...conv.subagentMeta,
|
|
13737
|
+
status: "error",
|
|
13738
|
+
error: { code: "SUBAGENT_SPAWN_FAILED", message }
|
|
13739
|
+
};
|
|
13740
|
+
conv.updatedAt = Date.now();
|
|
13741
|
+
await this.conversationStore.update(conv);
|
|
13742
|
+
}
|
|
13743
|
+
} catch {
|
|
13744
|
+
}
|
|
13745
|
+
const appended = await this.appendSubagentResultReliable(parentConversationId, {
|
|
13746
|
+
subagentId: childConversationId,
|
|
13747
|
+
task,
|
|
13748
|
+
status: "error",
|
|
13749
|
+
error: { code: "SUBAGENT_SPAWN_FAILED", message },
|
|
13750
|
+
timestamp: Date.now()
|
|
13751
|
+
});
|
|
13752
|
+
if (appended) {
|
|
13753
|
+
this.triggerParentCallback(parentConversationId).catch(
|
|
13754
|
+
(e) => console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e)
|
|
13755
|
+
);
|
|
13756
|
+
}
|
|
13757
|
+
}
|
|
13670
13758
|
async recoverStaleSubagents() {
|
|
13671
13759
|
const allSummaries = await this.conversationStore.listSummaries();
|
|
13672
13760
|
const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
|
|
@@ -13692,11 +13780,20 @@ ${resultBody}`,
|
|
|
13692
13780
|
error: conv.subagentMeta.error,
|
|
13693
13781
|
timestamp: Date.now()
|
|
13694
13782
|
};
|
|
13695
|
-
await this.
|
|
13783
|
+
await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
|
|
13696
13784
|
parentsToCallback.add(conv.parentConversationId);
|
|
13697
13785
|
}
|
|
13698
13786
|
}
|
|
13699
13787
|
}
|
|
13788
|
+
const parentIds = new Set(
|
|
13789
|
+
subagentSummaries.map((s) => s.parentConversationId).filter((id) => !!id)
|
|
13790
|
+
);
|
|
13791
|
+
for (const parentId of parentIds) {
|
|
13792
|
+
if (parentsToCallback.has(parentId)) continue;
|
|
13793
|
+
if (this.activeConversationRuns.has(parentId)) continue;
|
|
13794
|
+
const parent = await this.conversationStore.get(parentId);
|
|
13795
|
+
if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
|
|
13796
|
+
}
|
|
13700
13797
|
for (const parentId of parentsToCallback) {
|
|
13701
13798
|
this.processSubagentCallback(parentId).catch(
|
|
13702
13799
|
(err) => console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err)
|
package/package.json
CHANGED
|
@@ -1012,13 +1012,6 @@ export class AgentOrchestrator {
|
|
|
1012
1012
|
await this.conversationStore.update(conv);
|
|
1013
1013
|
}
|
|
1014
1014
|
|
|
1015
|
-
this.hooks?.onStreamEnd?.(childConversationId);
|
|
1016
|
-
await this.eventSink(parentConversationId, {
|
|
1017
|
-
type: "subagent:completed",
|
|
1018
|
-
subagentId: childConversationId,
|
|
1019
|
-
conversationId: childConversationId,
|
|
1020
|
-
});
|
|
1021
|
-
|
|
1022
1015
|
// Recover the subagent's real output: prefer the run response, then the
|
|
1023
1016
|
// streamed draft, then walk the transcript — discarding the synthetic
|
|
1024
1017
|
// "[Error: ...]" placeholder at each step.
|
|
@@ -1051,7 +1044,18 @@ export class AgentOrchestrator {
|
|
|
1051
1044
|
: {}),
|
|
1052
1045
|
timestamp: Date.now(),
|
|
1053
1046
|
};
|
|
1054
|
-
|
|
1047
|
+
// Persist the result BEFORE emitting subagent:completed: a consumer
|
|
1048
|
+
// reacting to the event (the parent callback, the streaming client)
|
|
1049
|
+
// must find the result already durable in the store, not race its write.
|
|
1050
|
+
await this.appendSubagentResultReliable(parentConversationId, pendingResult);
|
|
1051
|
+
|
|
1052
|
+
this.hooks?.onStreamEnd?.(childConversationId);
|
|
1053
|
+
await this.eventSink(parentConversationId, {
|
|
1054
|
+
type: "subagent:completed",
|
|
1055
|
+
subagentId: childConversationId,
|
|
1056
|
+
conversationId: childConversationId,
|
|
1057
|
+
});
|
|
1058
|
+
|
|
1055
1059
|
this.triggerParentCallback(parentConversationId).catch(err =>
|
|
1056
1060
|
console.error(`[poncho][subagent] Parent callback failed:`, err instanceof Error ? err.message : err),
|
|
1057
1061
|
);
|
|
@@ -1070,6 +1074,16 @@ export class AgentOrchestrator {
|
|
|
1070
1074
|
await this.conversationStore.update(conv);
|
|
1071
1075
|
}
|
|
1072
1076
|
|
|
1077
|
+
const pendingResult: PendingSubagentResult = {
|
|
1078
|
+
subagentId: childConversationId,
|
|
1079
|
+
task,
|
|
1080
|
+
status: "error",
|
|
1081
|
+
error: { code: "SUBAGENT_ERROR", message: errMsg },
|
|
1082
|
+
timestamp: Date.now(),
|
|
1083
|
+
};
|
|
1084
|
+
// Persist before emitting (see the success path); never swallow.
|
|
1085
|
+
await this.appendSubagentResultReliable(parentConversationId, pendingResult);
|
|
1086
|
+
|
|
1073
1087
|
this.hooks?.onStreamEnd?.(childConversationId);
|
|
1074
1088
|
await this.eventSink(parentConversationId, {
|
|
1075
1089
|
type: "subagent:error",
|
|
@@ -1078,14 +1092,6 @@ export class AgentOrchestrator {
|
|
|
1078
1092
|
error: errMsg,
|
|
1079
1093
|
});
|
|
1080
1094
|
|
|
1081
|
-
const pendingResult: PendingSubagentResult = {
|
|
1082
|
-
subagentId: childConversationId,
|
|
1083
|
-
task,
|
|
1084
|
-
status: "error",
|
|
1085
|
-
error: { code: "SUBAGENT_ERROR", message: errMsg },
|
|
1086
|
-
timestamp: Date.now(),
|
|
1087
|
-
};
|
|
1088
|
-
await this.conversationStore.appendSubagentResult(parentConversationId, pendingResult).catch(() => {});
|
|
1089
1095
|
this.triggerParentCallback(parentConversationId).catch(err2 =>
|
|
1090
1096
|
console.error(`[poncho][subagent] Parent callback failed:`, err2 instanceof Error ? err2.message : err2),
|
|
1091
1097
|
);
|
|
@@ -1221,12 +1227,15 @@ export class AgentOrchestrator {
|
|
|
1221
1227
|
},
|
|
1222
1228
|
initialContextTokens: conversation.contextTokens ?? 0,
|
|
1223
1229
|
initialContextWindow: conversation.contextWindow ?? 0,
|
|
1224
|
-
onEvent: (event) => {
|
|
1230
|
+
onEvent: async (event) => {
|
|
1225
1231
|
if (event.type === "run:started") {
|
|
1226
1232
|
const active = this.activeConversationRuns.get(conversationId);
|
|
1227
1233
|
if (active) active.runId = event.runId;
|
|
1228
1234
|
}
|
|
1229
|
-
|
|
1235
|
+
// Await so the event is fully sunk before the next step's events,
|
|
1236
|
+
// matching every other eventSink call site (the callback run path
|
|
1237
|
+
// was the lone fire-and-forget exception).
|
|
1238
|
+
await this.eventSink(conversationId, event);
|
|
1230
1239
|
},
|
|
1231
1240
|
});
|
|
1232
1241
|
flushTurnDraft(execution.draft);
|
|
@@ -1436,11 +1445,6 @@ export class AgentOrchestrator {
|
|
|
1436
1445
|
}
|
|
1437
1446
|
|
|
1438
1447
|
this.activeSubagentRuns.delete(conversationId);
|
|
1439
|
-
await this.eventSink(parentConversationId, {
|
|
1440
|
-
type: "subagent:completed",
|
|
1441
|
-
subagentId: conversationId,
|
|
1442
|
-
conversationId,
|
|
1443
|
-
});
|
|
1444
1448
|
|
|
1445
1449
|
let gathered = realResponseText(runResult?.response) || realResponseText(draft.assistantResponse);
|
|
1446
1450
|
if (!gathered) {
|
|
@@ -1464,8 +1468,17 @@ export class AgentOrchestrator {
|
|
|
1464
1468
|
: {}),
|
|
1465
1469
|
timestamp: Date.now(),
|
|
1466
1470
|
};
|
|
1467
|
-
|
|
1471
|
+
// Persist before emitting completion (see runSubagent).
|
|
1472
|
+
await this.appendSubagentResultReliable(parentConversationId, result);
|
|
1473
|
+
}
|
|
1468
1474
|
|
|
1475
|
+
await this.eventSink(parentConversationId, {
|
|
1476
|
+
type: "subagent:completed",
|
|
1477
|
+
subagentId: conversationId,
|
|
1478
|
+
conversationId,
|
|
1479
|
+
});
|
|
1480
|
+
|
|
1481
|
+
if (parentConv) {
|
|
1469
1482
|
if (this.isServerless) {
|
|
1470
1483
|
this.hooks!.dispatchBackground!("subagent-callback", parentConversationId);
|
|
1471
1484
|
} else {
|
|
@@ -1490,12 +1503,6 @@ export class AgentOrchestrator {
|
|
|
1490
1503
|
await this.conversationStore.update(conv);
|
|
1491
1504
|
}
|
|
1492
1505
|
|
|
1493
|
-
await this.eventSink(conversation.parentConversationId!, {
|
|
1494
|
-
type: "subagent:completed",
|
|
1495
|
-
subagentId: conversationId,
|
|
1496
|
-
conversationId,
|
|
1497
|
-
});
|
|
1498
|
-
|
|
1499
1506
|
const parentConv = await this.conversationStore.get(conversation.parentConversationId!);
|
|
1500
1507
|
if (parentConv) {
|
|
1501
1508
|
const result: PendingSubagentResult = {
|
|
@@ -1505,11 +1512,23 @@ export class AgentOrchestrator {
|
|
|
1505
1512
|
error: { code: "CONTINUATION_ERROR", message: err instanceof Error ? err.message : String(err) },
|
|
1506
1513
|
timestamp: Date.now(),
|
|
1507
1514
|
};
|
|
1508
|
-
|
|
1515
|
+
// Persist before emitting; never swallow (was `.catch(() => {})`).
|
|
1516
|
+
await this.appendSubagentResultReliable(conversation.parentConversationId!, result);
|
|
1517
|
+
}
|
|
1518
|
+
|
|
1519
|
+
await this.eventSink(conversation.parentConversationId!, {
|
|
1520
|
+
type: "subagent:completed",
|
|
1521
|
+
subagentId: conversationId,
|
|
1522
|
+
conversationId,
|
|
1523
|
+
});
|
|
1524
|
+
|
|
1525
|
+
if (parentConv) {
|
|
1509
1526
|
if (this.isServerless) {
|
|
1510
1527
|
this.hooks!.dispatchBackground!("subagent-callback", conversation.parentConversationId!);
|
|
1511
1528
|
} else {
|
|
1512
|
-
this.processSubagentCallback(conversation.parentConversationId!).catch(
|
|
1529
|
+
this.processSubagentCallback(conversation.parentConversationId!).catch(err2 =>
|
|
1530
|
+
console.error(`[poncho][subagent] Continuation-error callback failed:`, err2 instanceof Error ? err2.message : err2),
|
|
1531
|
+
);
|
|
1513
1532
|
}
|
|
1514
1533
|
}
|
|
1515
1534
|
}
|
|
@@ -1559,7 +1578,7 @@ export class AgentOrchestrator {
|
|
|
1559
1578
|
opts.parentConversationId,
|
|
1560
1579
|
opts.task,
|
|
1561
1580
|
opts.ownerId,
|
|
1562
|
-
).catch(err =>
|
|
1581
|
+
).catch(err => this.handleSpawnFailure(conversation.conversationId, opts.parentConversationId, opts.task, err));
|
|
1563
1582
|
}
|
|
1564
1583
|
|
|
1565
1584
|
return { subagentId: conversation.conversationId };
|
|
@@ -1596,7 +1615,7 @@ export class AgentOrchestrator {
|
|
|
1596
1615
|
conversation.parentConversationId,
|
|
1597
1616
|
message,
|
|
1598
1617
|
conversation.ownerId,
|
|
1599
|
-
).catch(err =>
|
|
1618
|
+
).catch(err => this.handleSpawnFailure(subagentId, conversation.parentConversationId!, message, err));
|
|
1600
1619
|
}
|
|
1601
1620
|
|
|
1602
1621
|
return { subagentId };
|
|
@@ -1684,6 +1703,79 @@ export class AgentOrchestrator {
|
|
|
1684
1703
|
|
|
1685
1704
|
// ── Stale subagent recovery ──
|
|
1686
1705
|
|
|
1706
|
+
/**
|
|
1707
|
+
* Append a subagent result to its parent, retrying once on a transient
|
|
1708
|
+
* store failure before giving up loudly. A silently dropped result is the
|
|
1709
|
+
* worst subagent failure mode — the parent waits forever on a subagent it
|
|
1710
|
+
* thinks is still running — so this never swallows the error the way the
|
|
1711
|
+
* old `.catch(() => {})` call sites did. Returns whether the result landed.
|
|
1712
|
+
*/
|
|
1713
|
+
private async appendSubagentResultReliable(
|
|
1714
|
+
parentConversationId: string,
|
|
1715
|
+
result: PendingSubagentResult,
|
|
1716
|
+
): Promise<boolean> {
|
|
1717
|
+
try {
|
|
1718
|
+
await this.conversationStore.appendSubagentResult(parentConversationId, result);
|
|
1719
|
+
return true;
|
|
1720
|
+
} catch (firstErr) {
|
|
1721
|
+
try {
|
|
1722
|
+
await this.conversationStore.appendSubagentResult(parentConversationId, result);
|
|
1723
|
+
return true;
|
|
1724
|
+
} catch (secondErr) {
|
|
1725
|
+
console.error(
|
|
1726
|
+
`[poncho][subagent] FAILED to persist result for subagent ${result.subagentId} ` +
|
|
1727
|
+
`to parent ${parentConversationId} after 2 attempts — the parent will not see this result:`,
|
|
1728
|
+
secondErr instanceof Error ? secondErr.message : secondErr,
|
|
1729
|
+
`(first attempt: ${firstErr instanceof Error ? firstErr.message : firstErr})`,
|
|
1730
|
+
);
|
|
1731
|
+
return false;
|
|
1732
|
+
}
|
|
1733
|
+
}
|
|
1734
|
+
}
|
|
1735
|
+
|
|
1736
|
+
/**
|
|
1737
|
+
* A subagent's fire-and-forget background run rejected outside its own
|
|
1738
|
+
* error handling (e.g. it threw before entering its try block, or the
|
|
1739
|
+
* catch block itself threw). Without this the parent is left waiting on a
|
|
1740
|
+
* subagent that will never report back. Record the failure on the child
|
|
1741
|
+
* and hand the parent an error result so the turn can resume.
|
|
1742
|
+
*/
|
|
1743
|
+
private async handleSpawnFailure(
|
|
1744
|
+
childConversationId: string,
|
|
1745
|
+
parentConversationId: string,
|
|
1746
|
+
task: string,
|
|
1747
|
+
err: unknown,
|
|
1748
|
+
): Promise<void> {
|
|
1749
|
+
const message = err instanceof Error ? err.message : String(err);
|
|
1750
|
+
console.error(`[poncho][subagent] Background run failed for ${childConversationId}:`, message);
|
|
1751
|
+
try {
|
|
1752
|
+
const conv = await this.conversationStore.get(childConversationId);
|
|
1753
|
+
if (conv?.subagentMeta && conv.subagentMeta.status === "running") {
|
|
1754
|
+
conv.subagentMeta = {
|
|
1755
|
+
...conv.subagentMeta,
|
|
1756
|
+
status: "error",
|
|
1757
|
+
error: { code: "SUBAGENT_SPAWN_FAILED", message },
|
|
1758
|
+
};
|
|
1759
|
+
conv.updatedAt = Date.now();
|
|
1760
|
+
await this.conversationStore.update(conv);
|
|
1761
|
+
}
|
|
1762
|
+
} catch {
|
|
1763
|
+
// best-effort: the result append below is what the parent actually needs
|
|
1764
|
+
}
|
|
1765
|
+
const appended = await this.appendSubagentResultReliable(parentConversationId, {
|
|
1766
|
+
subagentId: childConversationId,
|
|
1767
|
+
task,
|
|
1768
|
+
status: "error",
|
|
1769
|
+
error: { code: "SUBAGENT_SPAWN_FAILED", message },
|
|
1770
|
+
timestamp: Date.now(),
|
|
1771
|
+
});
|
|
1772
|
+
if (appended) {
|
|
1773
|
+
this.triggerParentCallback(parentConversationId).catch(e =>
|
|
1774
|
+
console.error(`[poncho][subagent] Parent callback failed after spawn failure:`, e instanceof Error ? e.message : e),
|
|
1775
|
+
);
|
|
1776
|
+
}
|
|
1777
|
+
}
|
|
1778
|
+
|
|
1687
1779
|
async recoverStaleSubagents(): Promise<void> {
|
|
1688
1780
|
const allSummaries = await this.conversationStore.listSummaries();
|
|
1689
1781
|
const subagentSummaries = allSummaries.filter((s) => s.parentConversationId);
|
|
@@ -1711,11 +1803,26 @@ export class AgentOrchestrator {
|
|
|
1711
1803
|
error: conv.subagentMeta.error,
|
|
1712
1804
|
timestamp: Date.now(),
|
|
1713
1805
|
};
|
|
1714
|
-
await this.
|
|
1806
|
+
await this.appendSubagentResultReliable(conv.parentConversationId, pendingResult);
|
|
1715
1807
|
parentsToCallback.add(conv.parentConversationId);
|
|
1716
1808
|
}
|
|
1717
1809
|
}
|
|
1718
1810
|
}
|
|
1811
|
+
|
|
1812
|
+
// Also drain parents that already have results sitting in the store but
|
|
1813
|
+
// no active run to deliver them — e.g. a result persisted just before a
|
|
1814
|
+
// process restart, whose in-memory callback trigger was lost. Without
|
|
1815
|
+
// this the parent stays stuck even though its result landed durably.
|
|
1816
|
+
const parentIds = new Set(
|
|
1817
|
+
subagentSummaries.map(s => s.parentConversationId).filter((id): id is string => !!id),
|
|
1818
|
+
);
|
|
1819
|
+
for (const parentId of parentIds) {
|
|
1820
|
+
if (parentsToCallback.has(parentId)) continue;
|
|
1821
|
+
if (this.activeConversationRuns.has(parentId)) continue;
|
|
1822
|
+
const parent = await this.conversationStore.get(parentId);
|
|
1823
|
+
if (parent?.pendingSubagentResults?.length) parentsToCallback.add(parentId);
|
|
1824
|
+
}
|
|
1825
|
+
|
|
1719
1826
|
for (const parentId of parentsToCallback) {
|
|
1720
1827
|
this.processSubagentCallback(parentId).catch(err =>
|
|
1721
1828
|
console.error(`[poncho][subagent] Recovery callback failed for ${parentId}:`, err instanceof Error ? err.message : err),
|
|
@@ -36,12 +36,25 @@ export class PostgresEngine extends SqlStorageEngine {
|
|
|
36
36
|
return rows as T[];
|
|
37
37
|
},
|
|
38
38
|
exec: async (sql: string): Promise<void> => {
|
|
39
|
-
|
|
39
|
+
// DDL is idempotent in our migrations (`CREATE TABLE IF NOT
|
|
40
|
+
// EXISTS`, etc.), so retrying on a stale-socket drop is
|
|
41
|
+
// safe — same idempotency as `query()` reads/writes.
|
|
42
|
+
await this.runWithRetry(() => this.sql.unsafe(sql));
|
|
40
43
|
},
|
|
41
44
|
transaction: async (fn: () => Promise<void>): Promise<void> => {
|
|
42
|
-
|
|
45
|
+
// Transactions are inherently retry-safe at the
|
|
46
|
+
// CONNECTION_ENDED boundary: if the connection dies before
|
|
47
|
+
// BEGIN takes effect server-side, no work was committed and
|
|
48
|
+
// re-running `fn` produces the correct end state. The retry
|
|
49
|
+
// only catches the connection-level reject from the
|
|
50
|
+
// postgres.js client; a partial-commit + drop scenario
|
|
51
|
+
// surfaces as a different error code and bypasses the
|
|
52
|
+
// retry, preserving the caller's expectation that a
|
|
53
|
+
// returned transaction either fully committed or fully
|
|
54
|
+
// rolled back.
|
|
55
|
+
await this.runWithRetry(() => this.sql.begin(async () => {
|
|
43
56
|
await fn();
|
|
44
|
-
});
|
|
57
|
+
}));
|
|
45
58
|
},
|
|
46
59
|
};
|
|
47
60
|
}
|
|
@@ -59,25 +72,34 @@ export class PostgresEngine extends SqlStorageEngine {
|
|
|
59
72
|
prepare: false,
|
|
60
73
|
// Connection-pool resilience. Managed Postgres providers
|
|
61
74
|
// (Railway, Neon, Heroku, etc.) routinely drop idle TCP
|
|
62
|
-
// connections server-side after a few minutes
|
|
63
|
-
//
|
|
64
|
-
//
|
|
65
|
-
//
|
|
66
|
-
//
|
|
75
|
+
// connections server-side after a few minutes — and on
|
|
76
|
+
// Railway in particular, mid-stream drops within a few
|
|
77
|
+
// seconds of inactivity are common. Without these knobs,
|
|
78
|
+
// porsager/postgres keeps stale sockets in the pool; the
|
|
79
|
+
// next query on one rejects with
|
|
80
|
+
// `write CONNECTION_ENDED <host>:5432` at `durMs=0`,
|
|
81
|
+
// surfacing as a hard failure to the caller.
|
|
67
82
|
//
|
|
68
|
-
// - `idle_timeout:
|
|
69
|
-
//
|
|
70
|
-
//
|
|
71
|
-
//
|
|
72
|
-
//
|
|
73
|
-
//
|
|
74
|
-
//
|
|
75
|
-
//
|
|
83
|
+
// - `idle_timeout: 5` closes idle connections client-side
|
|
84
|
+
// aggressively. Empirically Railway's pg drops sockets
|
|
85
|
+
// well before the 20s value that managed-provider docs
|
|
86
|
+
// suggest; 5s is short enough to win the race in
|
|
87
|
+
// practice while staying long enough that bursty
|
|
88
|
+
// workloads still get connection reuse.
|
|
89
|
+
// - `max_lifetime: 300` (5 min) recycles long-lived
|
|
90
|
+
// connections defensively. Even with idle_timeout, a
|
|
91
|
+
// connection that's been actively serving small queries
|
|
92
|
+
// for an hour can hit provider-side max-age limits.
|
|
93
|
+
// - `connect_timeout: 10` — slightly less patient on
|
|
94
|
+
// initial connect than the 30s default. Combined with
|
|
95
|
+
// the retry below, "connection refused" surfaces faster
|
|
96
|
+
// during incidents and the caller can shed load instead
|
|
97
|
+
// of stacking up.
|
|
76
98
|
//
|
|
77
|
-
//
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
99
|
+
// Pool size (`max: 10`) unchanged.
|
|
100
|
+
idle_timeout: 5,
|
|
101
|
+
max_lifetime: 60 * 5,
|
|
102
|
+
connect_timeout: 10,
|
|
81
103
|
});
|
|
82
104
|
}
|
|
83
105
|
|
|
@@ -147,33 +169,53 @@ export class PostgresEngine extends SqlStorageEngine {
|
|
|
147
169
|
}
|
|
148
170
|
|
|
149
171
|
/**
|
|
150
|
-
*
|
|
151
|
-
*
|
|
152
|
-
*
|
|
153
|
-
*
|
|
154
|
-
*
|
|
155
|
-
*
|
|
156
|
-
*
|
|
157
|
-
*
|
|
172
|
+
* Retry on transient connection-layer failures. Three attempts
|
|
173
|
+
* with exponential-ish backoff (0, 50ms, 200ms) — the pool may
|
|
174
|
+
* have multiple stale sockets accumulated during an idle period
|
|
175
|
+
* (especially on managed Postgres after boot when no traffic
|
|
176
|
+
* has flowed for a while), so a single retry can land on a
|
|
177
|
+
* second stale socket and still fail. Three attempts virtually
|
|
178
|
+
* always exhausts the staleness wave; if all three throw, the
|
|
179
|
+
* failure is real and the caller should see it.
|
|
158
180
|
*
|
|
159
|
-
*
|
|
160
|
-
* `
|
|
161
|
-
*
|
|
162
|
-
*
|
|
163
|
-
*
|
|
164
|
-
*
|
|
165
|
-
*
|
|
181
|
+
* Applied to every pg path the executor exposes:
|
|
182
|
+
* - `query()` (run/get/all) — natural retry: queries are
|
|
183
|
+
* idempotent at the connection-failure boundary because the
|
|
184
|
+
* server-side rollback runs cleanly on socket close.
|
|
185
|
+
* - `exec(sql)` for DDL — `CREATE TABLE IF NOT EXISTS` and
|
|
186
|
+
* friends are idempotent by construction.
|
|
187
|
+
* - `transaction(fn)` — only retried when the
|
|
188
|
+
* CONNECTION_ENDED reject arrives *before* the transaction
|
|
189
|
+
* body started executing on the connection; if it errors
|
|
190
|
+
* mid-transaction, the postgres.js client surfaces a
|
|
191
|
+
* different error class (the inner SQL error) and bypasses
|
|
192
|
+
* this retry, preserving the all-or-nothing semantics.
|
|
166
193
|
*/
|
|
167
194
|
private async runWithRetry<T>(fn: () => Promise<T>): Promise<T> {
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
195
|
+
const backoffs = [0, 50, 200];
|
|
196
|
+
let lastErr: unknown;
|
|
197
|
+
for (let attempt = 0; attempt < backoffs.length; attempt++) {
|
|
198
|
+
if (backoffs[attempt] > 0) {
|
|
199
|
+
await new Promise((r) => setTimeout(r, backoffs[attempt]));
|
|
200
|
+
}
|
|
201
|
+
try {
|
|
173
202
|
return await fn();
|
|
203
|
+
} catch (err) {
|
|
204
|
+
lastErr = err;
|
|
205
|
+
const code = (err as { code?: string } | null | undefined)?.code;
|
|
206
|
+
if (
|
|
207
|
+
code === "CONNECTION_ENDED" ||
|
|
208
|
+
code === "CONNECTION_CLOSED" ||
|
|
209
|
+
code === "CONNECTION_DESTROYED" ||
|
|
210
|
+
code === "CONNECT_TIMEOUT" ||
|
|
211
|
+
code === "ECONNRESET"
|
|
212
|
+
) {
|
|
213
|
+
continue;
|
|
214
|
+
}
|
|
215
|
+
throw err;
|
|
174
216
|
}
|
|
175
|
-
throw err;
|
|
176
217
|
}
|
|
218
|
+
throw lastErr;
|
|
177
219
|
}
|
|
178
220
|
|
|
179
221
|
private addToPathCache(tenantId: string, path: string): void {
|