@bookedsolid/rea 0.7.0 → 0.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/cli/serve.d.ts +8 -0
- package/dist/cli/serve.js +32 -6
- package/dist/cli/status.d.ts +25 -0
- package/dist/cli/status.js +86 -1
- package/dist/gateway/circuit-breaker.d.ts +8 -2
- package/dist/gateway/downstream-pool.d.ts +13 -1
- package/dist/gateway/downstream-pool.js +30 -2
- package/dist/gateway/downstream.d.ts +157 -0
- package/dist/gateway/downstream.js +307 -5
- package/dist/gateway/live-state.d.ts +252 -0
- package/dist/gateway/live-state.js +504 -0
- package/dist/gateway/server.d.ts +44 -1
- package/dist/gateway/server.js +101 -1
- package/dist/gateway/session-blocker.d.ts +132 -0
- package/dist/gateway/session-blocker.js +163 -0
- package/hooks/_lib/push-review-core.sh +52 -8
- package/hooks/push-review-gate-git.sh +8 -6
- package/hooks/push-review-gate.sh +32 -17
- package/package.json +1 -1
|
@@ -30,6 +30,26 @@
|
|
|
30
30
|
* successful reconnect — in that case we mark the connection unhealthy and
|
|
31
31
|
* let the circuit breaker take over.
|
|
32
32
|
*
|
|
33
|
+
* ## Supervisor / child-death detection (0.9.0, BUG-002..003)
|
|
34
|
+
*
|
|
35
|
+
* The SDK `StdioClientTransport` exposes `onclose` + `onerror` callbacks that
|
|
36
|
+
* fire when the child process exits or the stdio pipe errors outside a
|
|
37
|
+
* caller-initiated `close()`. We wire both and treat an unexpected close as
|
|
38
|
+
* "child is dead" — the next `callTool` must force a fresh connect rather
|
|
39
|
+
* than calling into a stale `Client` that will reply `Not connected`.
|
|
40
|
+
*
|
|
41
|
+
* Before 0.9.0 the supervisor was reactive only: a dead child was not noticed
|
|
42
|
+
* until the NEXT tool call tried to use it, at which point the circuit could
|
|
43
|
+
* flap open → half-open → open with the child still dead because the
|
|
44
|
+
* half-open probe re-used the zombie client. 0.9.0 makes death detection
|
|
45
|
+
* eager: `onclose` nulls `this.client` so the very next call takes the
|
|
46
|
+
* `connect()` branch and actually respawns the child.
|
|
47
|
+
*
|
|
48
|
+
* "Not connected" error messages from the SDK (our in-flight fallback) are
|
|
49
|
+
* now also treated as fatal for the current client — we null it before the
|
|
50
|
+
* one-shot reconnect path so we spawn fresh rather than retrying with the
|
|
51
|
+
* same dead handle.
|
|
52
|
+
*
|
|
33
53
|
* ## Why not request-level retries
|
|
34
54
|
*
|
|
35
55
|
* MCP tool calls are not idempotent by default. Retrying `send_message` after
|
|
@@ -94,10 +114,41 @@ export function buildChildEnv(config, hostEnv = process.env) {
|
|
|
94
114
|
}
|
|
95
115
|
return { env: out, missing: interp.missing, secretKeys: interp.secretKeys };
|
|
96
116
|
}
|
|
117
|
+
/**
|
|
118
|
+
* Substring marker for "the SDK thinks the client is still alive but the
|
|
119
|
+
* child transport is already gone" errors. Matches the exact message the
|
|
120
|
+
* MCP SDK throws from `Client` method calls after `onclose` has fired but
|
|
121
|
+
* before our own code has re-connected. Kept as a constant so tests can
|
|
122
|
+
* assert against it without string duplication.
|
|
123
|
+
*/
|
|
124
|
+
const NOT_CONNECTED_MARKER = 'Not connected';
|
|
97
125
|
export class DownstreamConnection {
|
|
98
126
|
config;
|
|
99
127
|
logger;
|
|
100
128
|
client = null;
|
|
129
|
+
/**
|
|
130
|
+
* Handle to the currently active transport, so our `onclose`/`onerror`
|
|
131
|
+
* hooks can tell "this is the transport we care about" vs "a stale callback
|
|
132
|
+
* firing after we already swapped to a new transport". Cleared in `close()`
|
|
133
|
+
* BEFORE we invoke `client.close()` so our own tear-down does not race the
|
|
134
|
+
* supervisor path.
|
|
135
|
+
*/
|
|
136
|
+
activeTransport = null;
|
|
137
|
+
/**
|
|
138
|
+
* Set of transports currently being torn down by an in-flight `close()`.
|
|
139
|
+
* `onclose` / `onerror` callbacks that fire for a transport in this set
|
|
140
|
+
* must NOT be promoted to an "unexpected child death" — they are our own
|
|
141
|
+
* tear-down signal.
|
|
142
|
+
*
|
|
143
|
+
* Codex P2 (0.9.0 review): the earlier `closingIntentionally` boolean was
|
|
144
|
+
* connection-wide. Under concurrent calls, one call's `await this.close()`
|
|
145
|
+
* could overlap with another call's reconnect that had already installed
|
|
146
|
+
* a NEW transport. A genuine `onclose` from the new transport would hit
|
|
147
|
+
* the boolean guard and be silently ignored, reintroducing the stale-
|
|
148
|
+
* handle bug the patch targeted. Per-transport scoping eliminates the
|
|
149
|
+
* race: only the exact transport we asked to close is silenced.
|
|
150
|
+
*/
|
|
151
|
+
closingTransports = new Set();
|
|
101
152
|
/**
|
|
102
153
|
* Whether a reconnect has already been attempted in the CURRENT failure
|
|
103
154
|
* episode. Resets to `false` after a reconnect succeeds (so a later,
|
|
@@ -107,7 +158,30 @@ export class DownstreamConnection {
|
|
|
107
158
|
reconnectAttempted = false;
|
|
108
159
|
/** Epoch ms of the last successful reconnect. Used by the flapping guard. */
|
|
109
160
|
lastReconnectAt = 0;
|
|
161
|
+
/**
|
|
162
|
+
* Epoch ms of the most recent unexpected child-death event. Stamped by
|
|
163
|
+
* `handleUnexpectedClose()`. 0 means "never died unexpectedly".
|
|
164
|
+
*
|
|
165
|
+
* Codex 0.9.0 pass-5 P2b: when `handleUnexpectedClose` nulls `this.client`,
|
|
166
|
+
* the very next `callTool` takes the top-level `client === null` branch,
|
|
167
|
+
* which normally bypasses the flap-window check entirely (that check lives
|
|
168
|
+
* in the catch branch below, conditioned on `lastReconnectAt`). A downstream
|
|
169
|
+
* that crashes immediately after every spawn would therefore be respawned
|
|
170
|
+
* unconditionally on every incoming call — exactly the loop the flap
|
|
171
|
+
* window is supposed to suppress. Consulting this timestamp in the
|
|
172
|
+
* `client === null` branch lets us refuse the respawn when the previous
|
|
173
|
+
* death is within the flap window, and the caller gets a clear error
|
|
174
|
+
* instead of watching the child die again.
|
|
175
|
+
*/
|
|
176
|
+
unexpectedDeathAt = 0;
|
|
110
177
|
health = 'healthy';
|
|
178
|
+
/**
|
|
179
|
+
* Optional supervisor-event listener. Set via
|
|
180
|
+
* {@link onSupervisorEvent}. A single subscriber is sufficient — the pool
|
|
181
|
+
* is the one consumer. Listener failures are swallowed; a broken consumer
|
|
182
|
+
* must never break the connection lifecycle.
|
|
183
|
+
*/
|
|
184
|
+
supervisorListener = null;
|
|
111
185
|
/**
|
|
112
186
|
* The most recent error observed on this connection (connect or call
|
|
113
187
|
* failure). Surfaced via `__rea__health` so callers can diagnose an empty
|
|
@@ -163,6 +237,127 @@ export class DownstreamConnection {
|
|
|
163
237
|
get isConnected() {
|
|
164
238
|
return this.client !== null;
|
|
165
239
|
}
|
|
240
|
+
/**
|
|
241
|
+
* Register a supervisor-event listener. Intended for the pool to wire up
|
|
242
|
+
* SESSION_BLOCKER tracking + observability hooks without the connection
|
|
243
|
+
* class having to know about either. Only one listener is supported — a
|
|
244
|
+
* second call replaces the first. Pass `null` to detach.
|
|
245
|
+
*/
|
|
246
|
+
onSupervisorEvent(listener) {
|
|
247
|
+
this.supervisorListener = listener;
|
|
248
|
+
}
|
|
249
|
+
/**
|
|
250
|
+
* Invoke the supervisor listener if registered. Swallows listener errors —
|
|
251
|
+
* a broken observer must never break the connection state machine.
|
|
252
|
+
*/
|
|
253
|
+
emitSupervisorEvent(event) {
|
|
254
|
+
const listener = this.supervisorListener;
|
|
255
|
+
if (listener === null)
|
|
256
|
+
return;
|
|
257
|
+
try {
|
|
258
|
+
listener(event);
|
|
259
|
+
}
|
|
260
|
+
catch {
|
|
261
|
+
// Intentionally swallowed. See JSDoc.
|
|
262
|
+
}
|
|
263
|
+
}
|
|
264
|
+
/**
|
|
265
|
+
* Emit a `health_changed` event. Called from every site that mutates a
|
|
266
|
+
* health/last_error/tools_count-visible field WITHOUT firing one of the
|
|
267
|
+
* louder supervisor events (`child_died_unexpectedly` / `respawned`).
|
|
268
|
+
* Addresses Codex 0.9.0 pass-2 P2a — live-state was only scheduled from
|
|
269
|
+
* breaker transitions and respawns, so transient errors below the breaker
|
|
270
|
+
* threshold would leave `rea status` showing stale data.
|
|
271
|
+
*/
|
|
272
|
+
emitHealthChanged() {
|
|
273
|
+
this.emitSupervisorEvent({ kind: 'health_changed', server: this.config.name });
|
|
274
|
+
}
|
|
275
|
+
/**
|
|
276
|
+
* Handle an unexpected transport close. Fires when the child process exits
|
|
277
|
+
* outside a caller-initiated `close()`, or when the stdio pipe errors in a
|
|
278
|
+
* way the SDK surfaces as a close event.
|
|
279
|
+
*
|
|
280
|
+
* Contract:
|
|
281
|
+
* - Only runs for the currently-active transport (stale callbacks from
|
|
282
|
+
* an already-swapped transport are ignored).
|
|
283
|
+
* - Does NOT run when WE initiated the close (the transport is a member
|
|
284
|
+
* of `closingTransports` for the duration of our own `close()` call).
|
|
285
|
+
* - Nulls `this.client` so the next `callTool` takes the `connect()`
|
|
286
|
+
* branch and actually respawns the child.
|
|
287
|
+
* - Marks the connection unhealthy so the pool knows not to route
|
|
288
|
+
* traffic to it while we wait for the next call.
|
|
289
|
+
* - Emits a `child_died_unexpectedly` supervisor event so the pool's
|
|
290
|
+
* SESSION_BLOCKER tracker can count this even though no callTool has
|
|
291
|
+
* failed yet (the child may die mid-idle).
|
|
292
|
+
*/
|
|
293
|
+
handleUnexpectedClose(transport, reason) {
|
|
294
|
+
// Stale callback: a previous transport's onclose firing after we've
|
|
295
|
+
// already swapped in a new one. Ignore — the new transport is live and
|
|
296
|
+
// we don't want to clobber it.
|
|
297
|
+
if (this.activeTransport !== transport)
|
|
298
|
+
return;
|
|
299
|
+
// Per-transport intentional-close filter. Codex P2 (0.9.0 review): a
|
|
300
|
+
// connection-wide boolean would let a late `onclose` from a newly
|
|
301
|
+
// reconnected transport be silenced while an earlier `close()` on the
|
|
302
|
+
// PREVIOUS transport was still in flight. Scoping by transport
|
|
303
|
+
// identity means only the exact transport we asked to close is
|
|
304
|
+
// silenced — a real death on any other transport fires normally.
|
|
305
|
+
if (this.closingTransports.has(transport))
|
|
306
|
+
return;
|
|
307
|
+
this.client = null;
|
|
308
|
+
this.activeTransport = null;
|
|
309
|
+
this.health = 'unhealthy';
|
|
310
|
+
this.#lastErrorMessage = `child process exited unexpectedly: ${reason}`;
|
|
311
|
+
// Codex 0.9.0 pass-5 P2b: stamp the death time so `callTool`'s
|
|
312
|
+
// `client === null` branch can consult the flap window and refuse a
|
|
313
|
+
// respawn if the child died within `RECONNECT_FLAP_WINDOW_MS`. Without
|
|
314
|
+
// this, the top-level respawn path bypasses the flap guard entirely.
|
|
315
|
+
this.unexpectedDeathAt = Date.now();
|
|
316
|
+
this.logger?.warn({
|
|
317
|
+
event: 'downstream.child_died',
|
|
318
|
+
server_name: this.config.name,
|
|
319
|
+
message: `downstream "${this.config.name}" child died unexpectedly — next call will respawn`,
|
|
320
|
+
reason,
|
|
321
|
+
});
|
|
322
|
+
this.emitSupervisorEvent({
|
|
323
|
+
kind: 'child_died_unexpectedly',
|
|
324
|
+
server: this.config.name,
|
|
325
|
+
reason,
|
|
326
|
+
});
|
|
327
|
+
}
|
|
328
|
+
/**
|
|
329
|
+
* Handle a transport-layer protocol error. onerror does NOT always imply
|
|
330
|
+
* close — the SDK emits it for protocol errors too. We record the error
|
|
331
|
+
* text but leave connection invalidation to the eventual onclose callback,
|
|
332
|
+
* which is guaranteed to follow a fatal transport error on stdio.
|
|
333
|
+
*
|
|
334
|
+
* Codex 0.9.0 pass-6 P2: filter stale/intentional-close callbacks the
|
|
335
|
+
* same way `handleUnexpectedClose` does. Without this, a delayed
|
|
336
|
+
* onerror from a PREVIOUSLY-ACTIVE transport (one we've already torn
|
|
337
|
+
* down or replaced) can clobber the HEALTHY replacement connection's
|
|
338
|
+
* last_error and emit a spurious health_changed, leaving `rea status`
|
|
339
|
+
* showing a stale error on a perfectly live child. The `onclose`
|
|
340
|
+
* hook already enforced this filter; the `onerror` hook did not.
|
|
341
|
+
*/
|
|
342
|
+
handleTransportError(transport, err) {
|
|
343
|
+
if (this.activeTransport !== transport)
|
|
344
|
+
return;
|
|
345
|
+
if (this.closingTransports.has(transport))
|
|
346
|
+
return;
|
|
347
|
+
this.#lastErrorMessage = err.message;
|
|
348
|
+
this.logger?.warn({
|
|
349
|
+
event: 'downstream.transport_error',
|
|
350
|
+
server_name: this.config.name,
|
|
351
|
+
message: `downstream "${this.config.name}" transport error`,
|
|
352
|
+
error: err.message,
|
|
353
|
+
});
|
|
354
|
+
// Codex 0.9.0 pass-4 P2: surface the new last_error to the live-state
|
|
355
|
+
// publisher immediately. Before this emit, a protocol-level transport
|
|
356
|
+
// error that did NOT trigger a subsequent onclose would update
|
|
357
|
+
// last_error in memory but leave `rea status` showing the previous
|
|
358
|
+
// (stale) value until some unrelated circuit/respawn event flushed.
|
|
359
|
+
this.emitHealthChanged();
|
|
360
|
+
}
|
|
166
361
|
/**
|
|
167
362
|
* Last error observed, or null if the connection has never failed (or fully
|
|
168
363
|
* recovered).
|
|
@@ -206,11 +401,13 @@ export class DownstreamConnection {
|
|
|
206
401
|
this.health = 'unhealthy';
|
|
207
402
|
const msg = `failed to resolve env for downstream "${this.config.name}": ${err instanceof Error ? err.message : err}`;
|
|
208
403
|
this.#lastErrorMessage = msg;
|
|
404
|
+
this.emitHealthChanged();
|
|
209
405
|
throw new Error(msg);
|
|
210
406
|
}
|
|
211
407
|
if (built.missing.length > 0) {
|
|
212
408
|
this.health = 'unhealthy';
|
|
213
409
|
this.#lastErrorMessage = `missing env: ${built.missing.join(', ')}`;
|
|
410
|
+
this.emitHealthChanged();
|
|
214
411
|
// One line per missing var so grep/jq users can find the exact gap.
|
|
215
412
|
// We intentionally do NOT log the env key name's VALUE (there is none —
|
|
216
413
|
// it's unresolved) nor any other env values.
|
|
@@ -225,17 +422,38 @@ export class DownstreamConnection {
|
|
|
225
422
|
args: this.config.args,
|
|
226
423
|
env: built.env,
|
|
227
424
|
});
|
|
425
|
+
// BUG-002/003: wire supervisor hooks BEFORE connect so we never miss a
|
|
426
|
+
// close event that fires during the initial handshake. The hooks only
|
|
427
|
+
// act on the transport we hand them — a stale callback from a previous
|
|
428
|
+
// transport is ignored in `handleUnexpectedClose`.
|
|
429
|
+
transport.onclose = () => {
|
|
430
|
+
this.handleUnexpectedClose(transport, 'transport closed');
|
|
431
|
+
};
|
|
432
|
+
transport.onerror = (err) => {
|
|
433
|
+
this.handleTransportError(transport, err);
|
|
434
|
+
};
|
|
228
435
|
const client = new Client({ name: `rea-gateway-client:${this.config.name}`, version: '0.2.0' }, { capabilities: {} });
|
|
229
436
|
try {
|
|
230
437
|
await client.connect(transport);
|
|
231
438
|
this.client = client;
|
|
439
|
+
this.activeTransport = transport;
|
|
232
440
|
this.health = 'healthy';
|
|
233
441
|
this.#lastErrorMessage = null;
|
|
442
|
+
this.emitHealthChanged();
|
|
234
443
|
}
|
|
235
444
|
catch (err) {
|
|
236
445
|
this.health = 'unhealthy';
|
|
237
446
|
const msg = `failed to connect to downstream "${this.config.name}" (${this.config.command}): ${err instanceof Error ? err.message : err}`;
|
|
238
447
|
this.#lastErrorMessage = msg;
|
|
448
|
+
// The transport may have partially started and set up child pipes —
|
|
449
|
+
// tell the SDK to tear it down so we don't leak the zombie child.
|
|
450
|
+
try {
|
|
451
|
+
await transport.close();
|
|
452
|
+
}
|
|
453
|
+
catch {
|
|
454
|
+
// Best-effort.
|
|
455
|
+
}
|
|
456
|
+
this.emitHealthChanged();
|
|
239
457
|
throw new Error(msg);
|
|
240
458
|
}
|
|
241
459
|
}
|
|
@@ -254,7 +472,38 @@ export class DownstreamConnection {
|
|
|
254
472
|
*/
|
|
255
473
|
async callTool(toolName, args) {
|
|
256
474
|
if (this.client === null) {
|
|
475
|
+
// Codex 0.9.0 pass-5 P2b: if the previous death was inside the flap
|
|
476
|
+
// window, refuse the respawn and surface the flap-window error instead.
|
|
477
|
+
// This keeps a crash-on-spawn child from being respawned on every
|
|
478
|
+
// incoming call — the same guarantee the `catch` branch provides for
|
|
479
|
+
// transport errors on a live client. The timestamp is stamped by
|
|
480
|
+
// `handleUnexpectedClose`; if the client was nulled by some other
|
|
481
|
+
// path (our own `close()`, initial cold start, etc.) `unexpectedDeathAt`
|
|
482
|
+
// is 0 and the check is a no-op.
|
|
483
|
+
const deathWithinFlapWindow = this.unexpectedDeathAt !== 0 &&
|
|
484
|
+
Date.now() - this.unexpectedDeathAt < RECONNECT_FLAP_WINDOW_MS;
|
|
485
|
+
if (deathWithinFlapWindow) {
|
|
486
|
+
this.health = 'unhealthy';
|
|
487
|
+
const msg = `downstream "${this.config.name}" unhealthy — child died within ` +
|
|
488
|
+
`flap window, refusing to respawn`;
|
|
489
|
+
this.#lastErrorMessage = msg;
|
|
490
|
+
this.logger?.error({
|
|
491
|
+
event: 'downstream.respawn_refused_flap',
|
|
492
|
+
server_name: this.config.name,
|
|
493
|
+
message: msg,
|
|
494
|
+
last_death_ms_ago: Date.now() - this.unexpectedDeathAt,
|
|
495
|
+
});
|
|
496
|
+
this.emitHealthChanged();
|
|
497
|
+
throw new Error(msg);
|
|
498
|
+
}
|
|
257
499
|
await this.connect();
|
|
500
|
+
// A successful spawn after a death ends the episode — clear the stamp
|
|
501
|
+
// so future unrelated deaths get their own flap window rather than
|
|
502
|
+
// inheriting this one.
|
|
503
|
+
this.unexpectedDeathAt = 0;
|
|
504
|
+
// Successful respawn counts as recovery for the supervisor — emit it
|
|
505
|
+
// so observability sinks can reset per-server session-blocker counts.
|
|
506
|
+
this.emitSupervisorEvent({ kind: 'respawned', server: this.config.name });
|
|
258
507
|
}
|
|
259
508
|
try {
|
|
260
509
|
const result = await this.client.callTool({ name: toolName, arguments: args });
|
|
@@ -262,12 +511,40 @@ export class DownstreamConnection {
|
|
|
262
511
|
// this, a connection that failed once and then recovered on the very
|
|
263
512
|
// next call (same client, no reconnect) would forever report the old
|
|
264
513
|
// error via `__rea__health`, misleading operators about live state.
|
|
514
|
+
//
|
|
515
|
+
// Codex 0.9.0 pass-2 P2a: only emit `health_changed` when we actually
|
|
516
|
+
// cleared something — the common success path runs through here every
|
|
517
|
+
// call, so noisy emission would burn debounced writes. A same-value
|
|
518
|
+
// write is a no-op for live-state purposes.
|
|
519
|
+
const hadError = this.#lastErrorMessage !== null;
|
|
265
520
|
this.#lastErrorMessage = null;
|
|
521
|
+
if (hadError)
|
|
522
|
+
this.emitHealthChanged();
|
|
266
523
|
return result;
|
|
267
524
|
}
|
|
268
525
|
catch (err) {
|
|
269
526
|
const message = err instanceof Error ? err.message : String(err);
|
|
270
527
|
const withinFlapWindow = this.lastReconnectAt !== 0 && Date.now() - this.lastReconnectAt < RECONNECT_FLAP_WINDOW_MS;
|
|
528
|
+
// BUG-003: "Not connected" means the SDK's idea of the client state
|
|
529
|
+
// has diverged from reality — usually because the child exited between
|
|
530
|
+
// calls and the `onclose` hook hasn't fired yet (or raced this call).
|
|
531
|
+
// Force a proper tear-down NOW so the next branch either reconnects
|
|
532
|
+
// against a clean slate (reconnect branch) or leaves a null client so
|
|
533
|
+
// the NEXT callTool's guard spawns fresh (terminal branch). Codex
|
|
534
|
+
// 0.9.0 pass-3 P2: an earlier implementation nulled `this.client` +
|
|
535
|
+
// `this.activeTransport` inline here, which made the subsequent
|
|
536
|
+
// `await this.close()` below a no-op (`c` was already null) — the
|
|
537
|
+
// stale child would leak until gateway shutdown. Calling `close()`
|
|
538
|
+
// eagerly ensures the transport is actually closed.
|
|
539
|
+
if (message.includes(NOT_CONNECTED_MARKER)) {
|
|
540
|
+
try {
|
|
541
|
+
await this.close();
|
|
542
|
+
}
|
|
543
|
+
catch {
|
|
544
|
+
// Best-effort — close() already swallows transport close errors,
|
|
545
|
+
// but belt-and-braces for any unexpected throw.
|
|
546
|
+
}
|
|
547
|
+
}
|
|
271
548
|
if (!this.reconnectAttempted && !withinFlapWindow) {
|
|
272
549
|
this.reconnectAttempted = true;
|
|
273
550
|
this.health = 'degraded';
|
|
@@ -278,8 +555,12 @@ export class DownstreamConnection {
|
|
|
278
555
|
reason: message,
|
|
279
556
|
});
|
|
280
557
|
try {
|
|
558
|
+
// For non-NOT_CONNECTED paths we still need to tear down the old
|
|
559
|
+
// client. When we DID take the NOT_CONNECTED branch above, `close()`
|
|
560
|
+
// is idempotent: `c === null` short-circuits cleanly.
|
|
281
561
|
await this.close();
|
|
282
562
|
await this.connect();
|
|
563
|
+
this.emitSupervisorEvent({ kind: 'respawned', server: this.config.name });
|
|
283
564
|
const result = await this.client.callTool({ name: toolName, arguments: args });
|
|
284
565
|
// Success: episode closed. Reset for the NEXT unrelated failure and
|
|
285
566
|
// stamp the reconnect time so flap-guard can refuse rapid repeats.
|
|
@@ -303,6 +584,7 @@ export class DownstreamConnection {
|
|
|
303
584
|
message: `downstream "${this.config.name}" unhealthy after one reconnect`,
|
|
304
585
|
error: errMsg,
|
|
305
586
|
});
|
|
587
|
+
this.emitHealthChanged();
|
|
306
588
|
throw new Error(`downstream "${this.config.name}" unhealthy after one reconnect: ${errMsg}`);
|
|
307
589
|
}
|
|
308
590
|
}
|
|
@@ -314,19 +596,39 @@ export class DownstreamConnection {
|
|
|
314
596
|
message: `downstream "${this.config.name}" call failed`,
|
|
315
597
|
error: message,
|
|
316
598
|
});
|
|
599
|
+
this.emitHealthChanged();
|
|
317
600
|
throw new Error(`downstream "${this.config.name}" call failed: ${message}`);
|
|
318
601
|
}
|
|
319
602
|
}
|
|
320
603
|
async close() {
|
|
321
604
|
const c = this.client;
|
|
605
|
+
// Capture the transport being closed BEFORE we null `activeTransport`,
|
|
606
|
+
// so a synchronously-firing `onclose` during `c.close()` can be matched
|
|
607
|
+
// against this specific transport instead of whichever transport is
|
|
608
|
+
// "current" at the moment the callback lands. Codex P2 (0.9.0 review):
|
|
609
|
+
// the earlier implementation used a connection-wide boolean, which
|
|
610
|
+
// under concurrent calls could silence a legitimate death event for a
|
|
611
|
+
// newer transport while we were still tearing down an older one.
|
|
612
|
+
const closingTransport = this.activeTransport;
|
|
613
|
+
if (closingTransport !== null) {
|
|
614
|
+
this.closingTransports.add(closingTransport);
|
|
615
|
+
}
|
|
322
616
|
this.client = null;
|
|
323
|
-
|
|
324
|
-
return;
|
|
617
|
+
this.activeTransport = null;
|
|
325
618
|
try {
|
|
326
|
-
|
|
619
|
+
if (c === null)
|
|
620
|
+
return;
|
|
621
|
+
try {
|
|
622
|
+
await c.close();
|
|
623
|
+
}
|
|
624
|
+
catch {
|
|
625
|
+
// Best-effort close — child may already be gone.
|
|
626
|
+
}
|
|
327
627
|
}
|
|
328
|
-
|
|
329
|
-
|
|
628
|
+
finally {
|
|
629
|
+
if (closingTransport !== null) {
|
|
630
|
+
this.closingTransports.delete(closingTransport);
|
|
631
|
+
}
|
|
330
632
|
}
|
|
331
633
|
}
|
|
332
634
|
}
|
|
@@ -0,0 +1,252 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Live `serve.state.json` publisher (BUG-005, 0.9.0).
|
|
3
|
+
*
|
|
4
|
+
* Before 0.9.0 `.rea/serve.state.json` was written once at `rea serve` boot
|
|
5
|
+
* and never touched again. `rea status` therefore only surfaced
|
|
6
|
+
* `session_id`, `started_at`, and `metrics_port` — agents planning a
|
|
7
|
+
* multi-downstream workflow had no way to see "is helixir's circuit open
|
|
8
|
+
* right now?" without calling `__rea__health` through the MCP transport
|
|
9
|
+
* (which, ironically, wouldn't work if the gateway was the thing that had
|
|
10
|
+
* wedged).
|
|
11
|
+
*
|
|
12
|
+
* The publisher subscribes to two signals:
|
|
13
|
+
*
|
|
14
|
+
* 1. Circuit-breaker `onStateChange` — transitions to/from open/half-open
|
|
15
|
+
* update the per-downstream block.
|
|
16
|
+
* 2. Supervisor events from the pool — `child_died_unexpectedly` and
|
|
17
|
+
* `respawned` update per-downstream liveness.
|
|
18
|
+
*
|
|
19
|
+
* Each update debounces to at most one write per ~250 ms via a trailing
|
|
20
|
+
* timer so a storm of transitions (e.g. open → half-open → open → half-open
|
|
21
|
+
* during a flap) doesn't spam the filesystem.
|
|
22
|
+
*
|
|
23
|
+
* Writes reuse the atomic temp+rename pattern from `serve.ts`. The write
|
|
24
|
+
* carries the same ownership key (`session_id`) as the boot write so a
|
|
25
|
+
* racing second `rea serve` instance is still correctly distinguished at
|
|
26
|
+
* shutdown.
|
|
27
|
+
*
|
|
28
|
+
* ## Why not an IPC endpoint?
|
|
29
|
+
*
|
|
30
|
+
* We briefly considered piggy-backing a `/downstreams.json` route on the
|
|
31
|
+
* metrics HTTP server. Rejected on the grounds of:
|
|
32
|
+
*
|
|
33
|
+
* - `rea status` works when `REA_METRICS_PORT` is unset (common in local
|
|
34
|
+
* dev); a disk snapshot keeps it authoritative.
|
|
35
|
+
* - The write rate is bounded (debounced) and the snapshot is tiny (few
|
|
36
|
+
* hundred bytes).
|
|
37
|
+
* - The on-disk file is the one surface a CRASHED gateway leaves behind
|
|
38
|
+
* — IPC evaporates the moment the process dies, whereas a file survives
|
|
39
|
+
* for post-mortem inspection.
|
|
40
|
+
*/
|
|
41
|
+
import type { CircuitBreaker, CircuitState } from './circuit-breaker.js';
|
|
42
|
+
import type { DownstreamPool } from './downstream-pool.js';
|
|
43
|
+
import type { FieldRedactor, Logger } from './log.js';
|
|
44
|
+
import type { SessionBlockerTracker } from './session-blocker.js';
|
|
45
|
+
export interface LiveStateOptions {
|
|
46
|
+
baseDir: string;
|
|
47
|
+
stateFilePath: string;
|
|
48
|
+
sessionId: string;
|
|
49
|
+
startedAt: string;
|
|
50
|
+
metricsPort: number | null;
|
|
51
|
+
pool: DownstreamPool;
|
|
52
|
+
breaker: CircuitBreaker;
|
|
53
|
+
sessionBlocker: SessionBlockerTracker;
|
|
54
|
+
logger?: Logger;
|
|
55
|
+
/**
|
|
56
|
+
* Debounce window for coalesced writes. Default 250 ms. Exposed so tests
|
|
57
|
+
* can force immediate flushes.
|
|
58
|
+
*/
|
|
59
|
+
debounceMs?: number;
|
|
60
|
+
/**
|
|
61
|
+
* Redactor applied to `last_error` strings before they are written to
|
|
62
|
+
* `serve.state.json`. `rea serve` wires this to the same
|
|
63
|
+
* `buildRegexRedactor` instance the gateway logger uses (policy
|
|
64
|
+
* `redact.patterns` + built-in `SECRET_PATTERNS`) so a credential that
|
|
65
|
+
* leaked into a downstream error message does not end up on disk or on
|
|
66
|
+
* an operator's terminal via `rea status`.
|
|
67
|
+
*
|
|
68
|
+
* Omitting the redactor preserves pre-0.9.0 behavior (no last_error
|
|
69
|
+
* redaction at the publisher layer). Direct embedders of `createGateway`
|
|
70
|
+
* that pass their own logger redactor should also pass this.
|
|
71
|
+
*/
|
|
72
|
+
lastErrorRedactor?: FieldRedactor;
|
|
73
|
+
}
|
|
74
|
+
/**
|
|
75
|
+
* Per-downstream block surfaced in `serve.state.json` and echoed by
|
|
76
|
+
* `rea status`. Narrow by design — anything an operator wants beyond this
|
|
77
|
+
* lives in `__rea__health` where the gateway's richer state machine is
|
|
78
|
+
* live.
|
|
79
|
+
*/
|
|
80
|
+
export interface LiveDownstreamState {
|
|
81
|
+
name: string;
|
|
82
|
+
connected: boolean;
|
|
83
|
+
healthy: boolean;
|
|
84
|
+
circuit_state: CircuitState;
|
|
85
|
+
/** ISO timestamp when the circuit is expected to move to half-open. Only present when `open`. */
|
|
86
|
+
retry_at: string | null;
|
|
87
|
+
last_error: string | null;
|
|
88
|
+
tools_count: number | null;
|
|
89
|
+
/** Cumulative circuit-open transitions counted toward SESSION_BLOCKER. */
|
|
90
|
+
open_transitions: number;
|
|
91
|
+
/** True once SESSION_BLOCKER has fired for this server in the current session. */
|
|
92
|
+
session_blocker_emitted: boolean;
|
|
93
|
+
}
|
|
94
|
+
export interface LiveServeState {
|
|
95
|
+
session_id: string;
|
|
96
|
+
started_at: string;
|
|
97
|
+
metrics_port: number | null;
|
|
98
|
+
/** Downstream block — added in 0.9.0. Empty array when no servers configured. */
|
|
99
|
+
downstreams: LiveDownstreamState[];
|
|
100
|
+
/** ISO timestamp of this snapshot. Separate from `started_at`. */
|
|
101
|
+
updated_at: string;
|
|
102
|
+
/**
|
|
103
|
+
* PID of the process that wrote this snapshot. Added in 0.9.0 pass-4 so
|
|
104
|
+
* a NEW `rea serve` can detect an ABANDONED state file (writer crashed,
|
|
105
|
+
* no one cleaned up) and take over ownership. Without this field,
|
|
106
|
+
* the pass-2 session_id-only ownership check was strictly safer but
|
|
107
|
+
* also strictly one-directional: once an older session wrote, no new
|
|
108
|
+
* session could ever claim the file, and `rea status` would stall on
|
|
109
|
+
* the dead session forever. Optional for backward compatibility with
|
|
110
|
+
* pre-0.9.0 snapshots that lack the field.
|
|
111
|
+
*/
|
|
112
|
+
owner_pid?: number;
|
|
113
|
+
}
|
|
114
|
+
export declare class LiveStatePublisher {
|
|
115
|
+
private readonly opts;
|
|
116
|
+
private timer;
|
|
117
|
+
/**
|
|
118
|
+
* Separate timer for the yield-retry path. Kept distinct from `timer` so a
|
|
119
|
+
* scheduled debounce doesn't cancel the retry and vice-versa — they serve
|
|
120
|
+
* different purposes (coalesce vs. poll). Cleared by `stop()`.
|
|
121
|
+
*/
|
|
122
|
+
private yieldRetryTimer;
|
|
123
|
+
private stopped;
|
|
124
|
+
constructor(opts: LiveStateOptions);
|
|
125
|
+
/**
|
|
126
|
+
* Schedule a write. Coalesces multiple calls within the debounce window
|
|
127
|
+
* into a single flush. Safe to call from circuit-breaker and supervisor
|
|
128
|
+
* event paths without worrying about write rate.
|
|
129
|
+
*/
|
|
130
|
+
scheduleUpdate(): void;
|
|
131
|
+
/**
|
|
132
|
+
* Write the current snapshot synchronously, bypassing the debounce.
|
|
133
|
+
* Called on boot (to publish the initial downstream block) and on
|
|
134
|
+
* shutdown (to flush any pending updates before the state file is
|
|
135
|
+
* ownership-cleaned).
|
|
136
|
+
*
|
|
137
|
+
* ## Ownership handoff (Codex P1 + P2b)
|
|
138
|
+
*
|
|
139
|
+
* The ownership check + rename is performed under a sidecar lockfile
|
|
140
|
+
* (`serve.state.json.lock`) created with `O_EXCL` (`wx`). This converts
|
|
141
|
+
* what was two non-atomic steps into a serialized critical section.
|
|
142
|
+
*
|
|
143
|
+
* Flow:
|
|
144
|
+
*
|
|
145
|
+
* 1. Acquire the lock (`open(path, 'wx')`). If EEXIST, a concurrent
|
|
146
|
+
* writer — either another publisher in THIS process (not possible
|
|
147
|
+
* given the debounce, but cheap to defend against) or another
|
|
148
|
+
* `rea serve` instance with overlapping lifetime — holds it. Skip
|
|
149
|
+
* this flush silently; the debounce timer will try again, and on
|
|
150
|
+
* shutdown the concurrent writer's own state will be authoritative.
|
|
151
|
+
* 2. Under the lock: re-read the on-disk `session_id`. If it belongs
|
|
152
|
+
* to a DIFFERENT session, another instance has already claimed the
|
|
153
|
+
* breadcrumb. Release the lock and yield (log-only).
|
|
154
|
+
* 3. Under the lock: atomically rename our temp file over the target.
|
|
155
|
+
* Because the concurrent writer cannot execute step 3 until we
|
|
156
|
+
* release the lock, and we only reach step 3 after confirming the
|
|
157
|
+
* on-disk session matches ours, the "older clobbers newer"
|
|
158
|
+
* race Codex flagged is closed.
|
|
159
|
+
* 4. Release the lock (unlink the sidecar) in a finally block.
|
|
160
|
+
*
|
|
161
|
+
* Stale locks from a crashed process with the same PID would deadlock
|
|
162
|
+
* the critical section forever — so the acquire step checks the lock
|
|
163
|
+
* file's contents (written as our PID + random nonce) and, if the
|
|
164
|
+
* owning PID is no longer running, steals it. The steal path is
|
|
165
|
+
* intentionally narrow (PID-check only, no timestamp TTL) because
|
|
166
|
+
* holding the lock longer than a single flushNow invocation is a bug.
|
|
167
|
+
*/
|
|
168
|
+
flushNow(): void;
|
|
169
|
+
/** Path to the sidecar lockfile. Resolved once per call; trivial cost. */
|
|
170
|
+
private lockFilePath;
|
|
171
|
+
/**
|
|
172
|
+
* Try to acquire the sidecar lock. Returns the lock file descriptor on
|
|
173
|
+
* success, or `null` on contention. Throws only on unexpected I/O errors
|
|
174
|
+
* (permissions, disk full) — those propagate out of `flushNow`'s try
|
|
175
|
+
* block and land in the `write_failed` log path.
|
|
176
|
+
*
|
|
177
|
+
* Stale-lock recovery: if a lockfile exists but its recorded PID is not
|
|
178
|
+
* currently running, the file is unlinked and one retry is issued. This
|
|
179
|
+
* covers the case where a previous `rea serve` SIGKILL'd mid-flush and
|
|
180
|
+
* left a dangling lockfile.
|
|
181
|
+
*/
|
|
182
|
+
private acquireLock;
|
|
183
|
+
/**
|
|
184
|
+
* Release the sidecar lock. Best-effort — if the unlink fails, the next
|
|
185
|
+
* flushNow will see a dangling lock and the stale-lock recovery path
|
|
186
|
+
* will clean it up. We MUST still close the fd so we don't leak it.
|
|
187
|
+
*/
|
|
188
|
+
private releaseLock;
|
|
189
|
+
/**
|
|
190
|
+
* Returns true iff the lock file's recorded PID is not currently alive.
|
|
191
|
+
* Uses `process.kill(pid, 0)` which sends no signal but errors with
|
|
192
|
+
* ESRCH when the PID is gone. Any parse error or unexpected kill error
|
|
193
|
+
* is treated as "not stale" to err on the side of NOT stealing a live
|
|
194
|
+
* peer's lock.
|
|
195
|
+
*/
|
|
196
|
+
private isStaleLock;
|
|
197
|
+
/**
|
|
198
|
+
* Returns true iff this publisher is allowed to write the on-disk state
|
|
199
|
+
* file on behalf of its session. The check runs under the sidecar lock
|
|
200
|
+
* (see `flushNow`) so the read + subsequent rename form one serialized
|
|
201
|
+
* critical section.
|
|
202
|
+
*
|
|
203
|
+
* Ownership resolves against three buckets:
|
|
204
|
+
*
|
|
205
|
+
* 1. **Safe-to-write**: the file is absent, corrupt, or has a missing/
|
|
206
|
+
* malformed `session_id`. No competing session is on disk, so we
|
|
207
|
+
* write without hesitation.
|
|
208
|
+
* 2. **We own it**: the stored `session_id` matches ours. Normal
|
|
209
|
+
* steady-state — every flush lands here.
|
|
210
|
+
* 3. **Another session owns it**: the stored `session_id` differs
|
|
211
|
+
* from ours. Before 0.9.0 pass-4 this was an unconditional yield,
|
|
212
|
+
* which was strictly safer but broke the crash-recovery case —
|
|
213
|
+
* a NEW `rea serve` launched after an unclean shutdown would
|
|
214
|
+
* observe the crashed session's id and yield forever, leaving
|
|
215
|
+
* `rea status` permanently stuck. Codex pass-4 P1 flagged this.
|
|
216
|
+
*
|
|
217
|
+
* The 0.9.0 `owner_pid` field exists exactly to disambiguate this
|
|
218
|
+
* bucket. If `owner_pid` is alive, an overlapping writer is still
|
|
219
|
+
* running and we yield (silent). If `owner_pid` is gone (ESRCH)
|
|
220
|
+
* or missing from the payload (pre-0.9.0 file or same-process
|
|
221
|
+
* write), we treat the file as abandoned and take over.
|
|
222
|
+
*
|
|
223
|
+
* `process.kill(pid, 0)` returns ESRCH for a missing PID, EPERM for a
|
|
224
|
+
* live PID we cannot signal. We treat EPERM as "alive from someone's
|
|
225
|
+
* perspective" and yield — never steal a file the kernel is uncertain
|
|
226
|
+
* about.
|
|
227
|
+
*/
|
|
228
|
+
private ownsStateFile;
|
|
229
|
+
/**
|
|
230
|
+
* Schedule a longer-interval retry of `flushNow`. Used by the yield path
|
|
231
|
+
* so a new gateway waiting on a live peer eventually reclaims the file
|
|
232
|
+
* when the peer exits. Idempotent — if a retry is already pending, this
|
|
233
|
+
* call is a no-op.
|
|
234
|
+
*
|
|
235
|
+
* Distinct from `scheduleUpdate()` because:
|
|
236
|
+
* - The debounce timer coalesces rapid events; this timer polls at a
|
|
237
|
+
* slow cadence for ownership changes.
|
|
238
|
+
* - Scheduling yield retries on the debounce timer would mean one
|
|
239
|
+
* supervisor event during the wait cancels the retry, and the
|
|
240
|
+
* debounce timer ALSO can't be re-scheduled while `timer !== null`.
|
|
241
|
+
*/
|
|
242
|
+
private scheduleYieldRetry;
|
|
243
|
+
/**
|
|
244
|
+
* Stop further scheduled writes. Called from the gateway shutdown path
|
|
245
|
+
* AFTER the final flush. Clears any pending timer; no more writes will
|
|
246
|
+
* occur after this returns.
|
|
247
|
+
*/
|
|
248
|
+
stop(): void;
|
|
249
|
+
/** Exposed for tests. Builds the canonical payload from live sources. */
|
|
250
|
+
buildSnapshot(): LiveServeState;
|
|
251
|
+
private buildDownstreamBlock;
|
|
252
|
+
}
|