@bookedsolid/rea 0.7.0 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -30,6 +30,26 @@
30
30
  * successful reconnect — in that case we mark the connection unhealthy and
31
31
  * let the circuit breaker take over.
32
32
  *
33
+ * ## Supervisor / child-death detection (0.9.0, BUG-002..003)
34
+ *
35
+ * The SDK `StdioClientTransport` exposes `onclose` + `onerror` callbacks that
36
+ * fire when the child process exits or the stdio pipe errors outside a
37
+ * caller-initiated `close()`. We wire both and treat an unexpected close as
38
+ * "child is dead" — the next `callTool` must force a fresh connect rather
39
+ * than calling into a stale `Client` that will reply `Not connected`.
40
+ *
41
+ * Before 0.9.0 the supervisor was reactive only: a dead child was not noticed
42
+ * until the NEXT tool call tried to use it, at which point the circuit could
43
+ * flap open → half-open → open with the child still dead because the
44
+ * half-open probe re-used the zombie client. 0.9.0 makes death detection
45
+ * eager: `onclose` nulls `this.client` so the very next call takes the
46
+ * `connect()` branch and actually respawns the child.
47
+ *
48
+ * "Not connected" error messages from the SDK (our in-flight fallback) are
49
+ * now also treated as fatal for the current client — we null it before the
50
+ * one-shot reconnect path so we spawn fresh rather than retrying with the
51
+ * same dead handle.
52
+ *
33
53
  * ## Why not request-level retries
34
54
  *
35
55
  * MCP tool calls are not idempotent by default. Retrying `send_message` after
@@ -94,10 +114,41 @@ export function buildChildEnv(config, hostEnv = process.env) {
94
114
  }
95
115
  return { env: out, missing: interp.missing, secretKeys: interp.secretKeys };
96
116
  }
117
+ /**
118
+ * Substring marker for "the SDK thinks the client is still alive but the
119
+ * child transport is already gone" errors. Matches the exact message the
120
+ * MCP SDK throws from `Client` method calls after `onclose` has fired but
121
+ * before our own code has re-connected. Kept as a constant so tests can
122
+ * assert against it without string duplication.
123
+ */
124
+ const NOT_CONNECTED_MARKER = 'Not connected';
97
125
  export class DownstreamConnection {
98
126
  config;
99
127
  logger;
100
128
  client = null;
129
+ /**
130
+ * Handle to the currently active transport, so our `onclose`/`onerror`
131
+ * hooks can tell "this is the transport we care about" vs "a stale callback
132
+ * firing after we already swapped to a new transport". Cleared in `close()`
133
+ * BEFORE we invoke `client.close()` so our own tear-down does not race the
134
+ * supervisor path.
135
+ */
136
+ activeTransport = null;
137
+ /**
138
+ * Set of transports currently being torn down by an in-flight `close()`.
139
+ * `onclose` / `onerror` callbacks that fire for a transport in this set
140
+ * must NOT be promoted to an "unexpected child death" — they are our own
141
+ * tear-down signal.
142
+ *
143
+ * Codex P2 (0.9.0 review): the earlier `closingIntentionally` boolean was
144
+ * connection-wide. Under concurrent calls, one call's `await this.close()`
145
+ * could overlap with another call's reconnect that had already installed
146
+ * a NEW transport. A genuine `onclose` from the new transport would hit
147
+ * the boolean guard and be silently ignored, reintroducing the stale-
148
+ * handle bug the patch targeted. Per-transport scoping eliminates the
149
+ * race: only the exact transport we asked to close is silenced.
150
+ */
151
+ closingTransports = new Set();
101
152
  /**
102
153
  * Whether a reconnect has already been attempted in the CURRENT failure
103
154
  * episode. Resets to `false` after a reconnect succeeds (so a later,
@@ -107,7 +158,30 @@ export class DownstreamConnection {
107
158
  reconnectAttempted = false;
108
159
  /** Epoch ms of the last successful reconnect. Used by the flapping guard. */
109
160
  lastReconnectAt = 0;
161
+ /**
162
+ * Epoch ms of the most recent unexpected child-death event. Stamped by
163
+ * `handleUnexpectedClose()`. 0 means "never died unexpectedly".
164
+ *
165
+ * Codex 0.9.0 pass-5 P2b: when `handleUnexpectedClose` nulls `this.client`,
166
+ * the very next `callTool` takes the top-level `client === null` branch,
167
+ * which normally bypasses the flap-window check entirely (that check lives
168
+ * in the catch branch below, conditioned on `lastReconnectAt`). A downstream
169
+ * that crashes immediately after every spawn would therefore be respawned
170
+ * unconditionally on every incoming call — exactly the loop the flap
171
+ * window is supposed to suppress. Consulting this timestamp in the
172
+ * `client === null` branch lets us refuse the respawn when the previous
173
+ * death is within the flap window, and the caller gets a clear error
174
+ * instead of watching the child die again.
175
+ */
176
+ unexpectedDeathAt = 0;
110
177
  health = 'healthy';
178
+ /**
179
+ * Optional supervisor-event listener. Set via
180
+ * {@link onSupervisorEvent}. A single subscriber is sufficient — the pool
181
+ * is the one consumer. Listener failures are swallowed; a broken consumer
182
+ * must never break the connection lifecycle.
183
+ */
184
+ supervisorListener = null;
111
185
  /**
112
186
  * The most recent error observed on this connection (connect or call
113
187
  * failure). Surfaced via `__rea__health` so callers can diagnose an empty
@@ -163,6 +237,127 @@ export class DownstreamConnection {
163
237
  get isConnected() {
164
238
  return this.client !== null;
165
239
  }
240
+ /**
241
+ * Register a supervisor-event listener. Intended for the pool to wire up
242
+ * SESSION_BLOCKER tracking + observability hooks without the connection
243
+ * class having to know about either. Only one listener is supported — a
244
+ * second call replaces the first. Pass `null` to detach.
245
+ */
246
+ onSupervisorEvent(listener) {
247
+ this.supervisorListener = listener;
248
+ }
249
+ /**
250
+ * Invoke the supervisor listener if registered. Swallows listener errors —
251
+ * a broken observer must never break the connection state machine.
252
+ */
253
+ emitSupervisorEvent(event) {
254
+ const listener = this.supervisorListener;
255
+ if (listener === null)
256
+ return;
257
+ try {
258
+ listener(event);
259
+ }
260
+ catch {
261
+ // Intentionally swallowed. See JSDoc.
262
+ }
263
+ }
264
+ /**
265
+ * Emit a `health_changed` event. Called from every site that mutates a
266
+ * health/last_error/tools_count-visible field WITHOUT firing one of the
267
+ * louder supervisor events (`child_died_unexpectedly` / `respawned`).
268
+ * Addresses Codex 0.9.0 pass-2 P2a — live-state was only scheduled from
269
+ * breaker transitions and respawns, so transient errors below the breaker
270
+ * threshold would leave `rea status` showing stale data.
271
+ */
272
+ emitHealthChanged() {
273
+ this.emitSupervisorEvent({ kind: 'health_changed', server: this.config.name });
274
+ }
275
+ /**
276
+ * Handle an unexpected transport close. Fires when the child process exits
277
+ * outside a caller-initiated `close()`, or when the stdio pipe errors in a
278
+ * way the SDK surfaces as a close event.
279
+ *
280
+ * Contract:
281
+ * - Only runs for the currently-active transport (stale callbacks from
282
+ * an already-swapped transport are ignored).
283
+ * - Does NOT run when WE initiated the close (the transport is a member
284
+ * of `closingTransports` for the duration of our own `close()` call).
285
+ * - Nulls `this.client` so the next `callTool` takes the `connect()`
286
+ * branch and actually respawns the child.
287
+ * - Marks the connection unhealthy so the pool knows not to route
288
+ * traffic to it while we wait for the next call.
289
+ * - Emits a `child_died_unexpectedly` supervisor event so the pool's
290
+ * SESSION_BLOCKER tracker can count this even though no callTool has
291
+ * failed yet (the child may die mid-idle).
292
+ */
293
+ handleUnexpectedClose(transport, reason) {
294
+ // Stale callback: a previous transport's onclose firing after we've
295
+ // already swapped in a new one. Ignore — the new transport is live and
296
+ // we don't want to clobber it.
297
+ if (this.activeTransport !== transport)
298
+ return;
299
+ // Per-transport intentional-close filter. Codex P2 (0.9.0 review): a
300
+ // connection-wide boolean would let a late `onclose` from a newly
301
+ // reconnected transport be silenced while an earlier `close()` on the
302
+ // PREVIOUS transport was still in flight. Scoping by transport
303
+ // identity means only the exact transport we asked to close is
304
+ // silenced — a real death on any other transport fires normally.
305
+ if (this.closingTransports.has(transport))
306
+ return;
307
+ this.client = null;
308
+ this.activeTransport = null;
309
+ this.health = 'unhealthy';
310
+ this.#lastErrorMessage = `child process exited unexpectedly: ${reason}`;
311
+ // Codex 0.9.0 pass-5 P2b: stamp the death time so `callTool`'s
312
+ // `client === null` branch can consult the flap window and refuse a
313
+ // respawn if the child died within `RECONNECT_FLAP_WINDOW_MS`. Without
314
+ // this, the top-level respawn path bypasses the flap guard entirely.
315
+ this.unexpectedDeathAt = Date.now();
316
+ this.logger?.warn({
317
+ event: 'downstream.child_died',
318
+ server_name: this.config.name,
319
+ message: `downstream "${this.config.name}" child died unexpectedly — next call will respawn`,
320
+ reason,
321
+ });
322
+ this.emitSupervisorEvent({
323
+ kind: 'child_died_unexpectedly',
324
+ server: this.config.name,
325
+ reason,
326
+ });
327
+ }
328
+ /**
329
+ * Handle a transport-layer protocol error. onerror does NOT always imply
330
+ * close — the SDK emits it for protocol errors too. We record the error
331
+ * text but leave connection invalidation to the eventual onclose callback,
332
+ * which is guaranteed to follow a fatal transport error on stdio.
333
+ *
334
+ * Codex 0.9.0 pass-6 P2: filter stale/intentional-close callbacks the
335
+ * same way `handleUnexpectedClose` does. Without this, a delayed
336
+ * onerror from a PREVIOUSLY-ACTIVE transport (one we've already torn
337
+ * down or replaced) can clobber the HEALTHY replacement connection's
338
+ * last_error and emit a spurious health_changed, leaving `rea status`
339
+ * showing a stale error on a perfectly live child. The `onclose`
340
+ * hook already enforced this filter; the `onerror` hook did not.
341
+ */
342
+ handleTransportError(transport, err) {
343
+ if (this.activeTransport !== transport)
344
+ return;
345
+ if (this.closingTransports.has(transport))
346
+ return;
347
+ this.#lastErrorMessage = err.message;
348
+ this.logger?.warn({
349
+ event: 'downstream.transport_error',
350
+ server_name: this.config.name,
351
+ message: `downstream "${this.config.name}" transport error`,
352
+ error: err.message,
353
+ });
354
+ // Codex 0.9.0 pass-4 P2: surface the new last_error to the live-state
355
+ // publisher immediately. Before this emit, a protocol-level transport
356
+ // error that did NOT trigger a subsequent onclose would update
357
+ // last_error in memory but leave `rea status` showing the previous
358
+ // (stale) value until some unrelated circuit/respawn event flushed.
359
+ this.emitHealthChanged();
360
+ }
166
361
  /**
167
362
  * Last error observed, or null if the connection has never failed (or fully
168
363
  * recovered).
@@ -206,11 +401,13 @@ export class DownstreamConnection {
206
401
  this.health = 'unhealthy';
207
402
  const msg = `failed to resolve env for downstream "${this.config.name}": ${err instanceof Error ? err.message : err}`;
208
403
  this.#lastErrorMessage = msg;
404
+ this.emitHealthChanged();
209
405
  throw new Error(msg);
210
406
  }
211
407
  if (built.missing.length > 0) {
212
408
  this.health = 'unhealthy';
213
409
  this.#lastErrorMessage = `missing env: ${built.missing.join(', ')}`;
410
+ this.emitHealthChanged();
214
411
  // One line per missing var so grep/jq users can find the exact gap.
215
412
  // We intentionally do NOT log the env key name's VALUE (there is none —
216
413
  // it's unresolved) nor any other env values.
@@ -225,17 +422,38 @@ export class DownstreamConnection {
225
422
  args: this.config.args,
226
423
  env: built.env,
227
424
  });
425
+ // BUG-002/003: wire supervisor hooks BEFORE connect so we never miss a
426
+ // close event that fires during the initial handshake. The hooks only
427
+ // act on the transport we hand them — a stale callback from a previous
428
+ // transport is ignored in `handleUnexpectedClose`.
429
+ transport.onclose = () => {
430
+ this.handleUnexpectedClose(transport, 'transport closed');
431
+ };
432
+ transport.onerror = (err) => {
433
+ this.handleTransportError(transport, err);
434
+ };
228
435
  const client = new Client({ name: `rea-gateway-client:${this.config.name}`, version: '0.2.0' }, { capabilities: {} });
229
436
  try {
230
437
  await client.connect(transport);
231
438
  this.client = client;
439
+ this.activeTransport = transport;
232
440
  this.health = 'healthy';
233
441
  this.#lastErrorMessage = null;
442
+ this.emitHealthChanged();
234
443
  }
235
444
  catch (err) {
236
445
  this.health = 'unhealthy';
237
446
  const msg = `failed to connect to downstream "${this.config.name}" (${this.config.command}): ${err instanceof Error ? err.message : err}`;
238
447
  this.#lastErrorMessage = msg;
448
+ // The transport may have partially started and set up child pipes —
449
+ // tell the SDK to tear it down so we don't leak the zombie child.
450
+ try {
451
+ await transport.close();
452
+ }
453
+ catch {
454
+ // Best-effort.
455
+ }
456
+ this.emitHealthChanged();
239
457
  throw new Error(msg);
240
458
  }
241
459
  }
@@ -254,7 +472,38 @@ export class DownstreamConnection {
254
472
  */
255
473
  async callTool(toolName, args) {
256
474
  if (this.client === null) {
475
+ // Codex 0.9.0 pass-5 P2b: if the previous death was inside the flap
476
+ // window, refuse the respawn and surface the flap-window error instead.
477
+ // This keeps a crash-on-spawn child from being respawned on every
478
+ // incoming call — the same guarantee the `catch` branch provides for
479
+ // transport errors on a live client. The timestamp is stamped by
480
+ // `handleUnexpectedClose`; if the client was nulled by some other
481
+ // path (our own `close()`, initial cold start, etc.) `unexpectedDeathAt`
482
+ // is 0 and the check is a no-op.
483
+ const deathWithinFlapWindow = this.unexpectedDeathAt !== 0 &&
484
+ Date.now() - this.unexpectedDeathAt < RECONNECT_FLAP_WINDOW_MS;
485
+ if (deathWithinFlapWindow) {
486
+ this.health = 'unhealthy';
487
+ const msg = `downstream "${this.config.name}" unhealthy — child died within ` +
488
+ `flap window, refusing to respawn`;
489
+ this.#lastErrorMessage = msg;
490
+ this.logger?.error({
491
+ event: 'downstream.respawn_refused_flap',
492
+ server_name: this.config.name,
493
+ message: msg,
494
+ last_death_ms_ago: Date.now() - this.unexpectedDeathAt,
495
+ });
496
+ this.emitHealthChanged();
497
+ throw new Error(msg);
498
+ }
257
499
  await this.connect();
500
+ // A successful spawn after a death ends the episode — clear the stamp
501
+ // so future unrelated deaths get their own flap window rather than
502
+ // inheriting this one.
503
+ this.unexpectedDeathAt = 0;
504
+ // Successful respawn counts as recovery for the supervisor — emit it
505
+ // so observability sinks can reset per-server session-blocker counts.
506
+ this.emitSupervisorEvent({ kind: 'respawned', server: this.config.name });
258
507
  }
259
508
  try {
260
509
  const result = await this.client.callTool({ name: toolName, arguments: args });
@@ -262,12 +511,40 @@ export class DownstreamConnection {
262
511
  // this, a connection that failed once and then recovered on the very
263
512
  // next call (same client, no reconnect) would forever report the old
264
513
  // error via `__rea__health`, misleading operators about live state.
514
+ //
515
+ // Codex 0.9.0 pass-2 P2a: only emit `health_changed` when we actually
516
+ // cleared something — the common success path runs through here every
517
+ // call, so noisy emission would burn debounced writes. A same-value
518
+ // write is a no-op for live-state purposes.
519
+ const hadError = this.#lastErrorMessage !== null;
265
520
  this.#lastErrorMessage = null;
521
+ if (hadError)
522
+ this.emitHealthChanged();
266
523
  return result;
267
524
  }
268
525
  catch (err) {
269
526
  const message = err instanceof Error ? err.message : String(err);
270
527
  const withinFlapWindow = this.lastReconnectAt !== 0 && Date.now() - this.lastReconnectAt < RECONNECT_FLAP_WINDOW_MS;
528
+ // BUG-003: "Not connected" means the SDK's idea of the client state
529
+ // has diverged from reality — usually because the child exited between
530
+ // calls and the `onclose` hook hasn't fired yet (or raced this call).
531
+ // Force a proper tear-down NOW so the next branch either reconnects
532
+ // against a clean slate (reconnect branch) or leaves a null client so
533
+ // the NEXT callTool's guard spawns fresh (terminal branch). Codex
534
+ // 0.9.0 pass-3 P2: an earlier implementation nulled `this.client` +
535
+ // `this.activeTransport` inline here, which made the subsequent
536
+ // `await this.close()` below a no-op (`c` was already null) — the
537
+ // stale child would leak until gateway shutdown. Calling `close()`
538
+ // eagerly ensures the transport is actually closed.
539
+ if (message.includes(NOT_CONNECTED_MARKER)) {
540
+ try {
541
+ await this.close();
542
+ }
543
+ catch {
544
+ // Best-effort — close() already swallows transport close errors,
545
+ // but belt-and-braces for any unexpected throw.
546
+ }
547
+ }
271
548
  if (!this.reconnectAttempted && !withinFlapWindow) {
272
549
  this.reconnectAttempted = true;
273
550
  this.health = 'degraded';
@@ -278,8 +555,12 @@ export class DownstreamConnection {
278
555
  reason: message,
279
556
  });
280
557
  try {
558
+ // For non-NOT_CONNECTED paths we still need to tear down the old
559
+ // client. When we DID take the NOT_CONNECTED branch above, `close()`
560
+ // is idempotent: `c === null` short-circuits cleanly.
281
561
  await this.close();
282
562
  await this.connect();
563
+ this.emitSupervisorEvent({ kind: 'respawned', server: this.config.name });
283
564
  const result = await this.client.callTool({ name: toolName, arguments: args });
284
565
  // Success: episode closed. Reset for the NEXT unrelated failure and
285
566
  // stamp the reconnect time so flap-guard can refuse rapid repeats.
@@ -303,6 +584,7 @@ export class DownstreamConnection {
303
584
  message: `downstream "${this.config.name}" unhealthy after one reconnect`,
304
585
  error: errMsg,
305
586
  });
587
+ this.emitHealthChanged();
306
588
  throw new Error(`downstream "${this.config.name}" unhealthy after one reconnect: ${errMsg}`);
307
589
  }
308
590
  }
@@ -314,19 +596,39 @@ export class DownstreamConnection {
314
596
  message: `downstream "${this.config.name}" call failed`,
315
597
  error: message,
316
598
  });
599
+ this.emitHealthChanged();
317
600
  throw new Error(`downstream "${this.config.name}" call failed: ${message}`);
318
601
  }
319
602
  }
320
603
  async close() {
321
604
  const c = this.client;
605
+ // Capture the transport being closed BEFORE we null `activeTransport`,
606
+ // so a synchronously-firing `onclose` during `c.close()` can be matched
607
+ // against this specific transport instead of whichever transport is
608
+ // "current" at the moment the callback lands. Codex P2 (0.9.0 review):
609
+ // the earlier implementation used a connection-wide boolean, which
610
+ // under concurrent calls could silence a legitimate death event for a
611
+ // newer transport while we were still tearing down an older one.
612
+ const closingTransport = this.activeTransport;
613
+ if (closingTransport !== null) {
614
+ this.closingTransports.add(closingTransport);
615
+ }
322
616
  this.client = null;
323
- if (c === null)
324
- return;
617
+ this.activeTransport = null;
325
618
  try {
326
- await c.close();
619
+ if (c === null)
620
+ return;
621
+ try {
622
+ await c.close();
623
+ }
624
+ catch {
625
+ // Best-effort close — child may already be gone.
626
+ }
327
627
  }
328
- catch {
329
- // Best-effort close child may already be gone.
628
+ finally {
629
+ if (closingTransport !== null) {
630
+ this.closingTransports.delete(closingTransport);
631
+ }
330
632
  }
331
633
  }
332
634
  }
@@ -0,0 +1,252 @@
1
+ /**
2
+ * Live `serve.state.json` publisher (BUG-005, 0.9.0).
3
+ *
4
+ * Before 0.9.0 `.rea/serve.state.json` was written once at `rea serve` boot
5
+ * and never touched again. `rea status` therefore only surfaced
6
+ * `session_id`, `started_at`, and `metrics_port` — agents planning a
7
+ * multi-downstream workflow had no way to see "is helixir's circuit open
8
+ * right now?" without calling `__rea__health` through the MCP transport
9
+ * (which, ironically, wouldn't work if the gateway was the thing that had
10
+ * wedged).
11
+ *
12
+ * The publisher subscribes to two signals:
13
+ *
14
+ * 1. Circuit-breaker `onStateChange` — transitions to/from open/half-open
15
+ * update the per-downstream block.
16
+ * 2. Supervisor events from the pool — `child_died_unexpectedly` and
17
+ * `respawned` update per-downstream liveness.
18
+ *
19
+ * Each update debounces to at most one write per ~250 ms via a trailing
20
+ * timer so a storm of transitions (e.g. open → half-open → open → half-open
21
+ * during a flap) doesn't spam the filesystem.
22
+ *
23
+ * Writes reuse the atomic temp+rename pattern from `serve.ts`. The write
24
+ * carries the same ownership key (`session_id`) as the boot write so a
25
+ * racing second `rea serve` instance is still correctly distinguished at
26
+ * shutdown.
27
+ *
28
+ * ## Why not an IPC endpoint?
29
+ *
30
+ * We briefly considered piggy-backing a `/downstreams.json` route on the
31
+ * metrics HTTP server. Rejected on the grounds of:
32
+ *
33
+ * - `rea status` works when `REA_METRICS_PORT` is unset (common in local
34
+ * dev); a disk snapshot keeps it authoritative.
35
+ * - The write rate is bounded (debounced) and the snapshot is tiny (few
36
+ * hundred bytes).
37
+ * - The on-disk file is the one surface a CRASHED gateway leaves behind
38
+ * — IPC evaporates the moment the process dies, whereas a file survives
39
+ * for post-mortem inspection.
40
+ */
41
+ import type { CircuitBreaker, CircuitState } from './circuit-breaker.js';
42
+ import type { DownstreamPool } from './downstream-pool.js';
43
+ import type { FieldRedactor, Logger } from './log.js';
44
+ import type { SessionBlockerTracker } from './session-blocker.js';
45
+ export interface LiveStateOptions {
46
+ baseDir: string;
47
+ stateFilePath: string;
48
+ sessionId: string;
49
+ startedAt: string;
50
+ metricsPort: number | null;
51
+ pool: DownstreamPool;
52
+ breaker: CircuitBreaker;
53
+ sessionBlocker: SessionBlockerTracker;
54
+ logger?: Logger;
55
+ /**
56
+ * Debounce window for coalesced writes. Default 250 ms. Exposed so tests
57
+ * can force immediate flushes.
58
+ */
59
+ debounceMs?: number;
60
+ /**
61
+ * Redactor applied to `last_error` strings before they are written to
62
+ * `serve.state.json`. `rea serve` wires this to the same
63
+ * `buildRegexRedactor` instance the gateway logger uses (policy
64
+ * `redact.patterns` + built-in `SECRET_PATTERNS`) so a credential that
65
+ * leaked into a downstream error message does not end up on disk or on
66
+ * an operator's terminal via `rea status`.
67
+ *
68
+ * Omitting the redactor preserves pre-0.9.0 behavior (no last_error
69
+ * redaction at the publisher layer). Direct embedders of `createGateway`
70
+ * that pass their own logger redactor should also pass this.
71
+ */
72
+ lastErrorRedactor?: FieldRedactor;
73
+ }
74
+ /**
75
+ * Per-downstream block surfaced in `serve.state.json` and echoed by
76
+ * `rea status`. Narrow by design — anything an operator wants beyond this
77
+ * lives in `__rea__health` where the gateway's richer state machine is
78
+ * live.
79
+ */
80
+ export interface LiveDownstreamState {
81
+ name: string;
82
+ connected: boolean;
83
+ healthy: boolean;
84
+ circuit_state: CircuitState;
85
+ /** ISO timestamp when the circuit is expected to move to half-open. Only present when `open`. */
86
+ retry_at: string | null;
87
+ last_error: string | null;
88
+ tools_count: number | null;
89
+ /** Cumulative circuit-open transitions counted toward SESSION_BLOCKER. */
90
+ open_transitions: number;
91
+ /** True once SESSION_BLOCKER has fired for this server in the current session. */
92
+ session_blocker_emitted: boolean;
93
+ }
94
+ export interface LiveServeState {
95
+ session_id: string;
96
+ started_at: string;
97
+ metrics_port: number | null;
98
+ /** Downstream block — added in 0.9.0. Empty array when no servers configured. */
99
+ downstreams: LiveDownstreamState[];
100
+ /** ISO timestamp of this snapshot. Separate from `started_at`. */
101
+ updated_at: string;
102
+ /**
103
+ * PID of the process that wrote this snapshot. Added in 0.9.0 pass-4 so
104
+ * a NEW `rea serve` can detect an ABANDONED state file (writer crashed,
105
+ * no one cleaned up) and take over ownership. Without this field,
106
+ * the pass-2 session_id-only ownership check was strictly safer but
107
+ * also strictly one-directional: once an older session wrote, no new
108
+ * session could ever claim the file, and `rea status` would stall on
109
+ * the dead session forever. Optional for backward compatibility with
110
+ * pre-0.9.0 snapshots that lack the field.
111
+ */
112
+ owner_pid?: number;
113
+ }
114
+ export declare class LiveStatePublisher {
115
+ private readonly opts;
116
+ private timer;
117
+ /**
118
+ * Separate timer for the yield-retry path. Kept distinct from `timer` so a
119
+ * scheduled debounce doesn't cancel the retry and vice-versa — they serve
120
+ * different purposes (coalesce vs. poll). Cleared by `stop()`.
121
+ */
122
+ private yieldRetryTimer;
123
+ private stopped;
124
+ constructor(opts: LiveStateOptions);
125
+ /**
126
+ * Schedule a write. Coalesces multiple calls within the debounce window
127
+ * into a single flush. Safe to call from circuit-breaker and supervisor
128
+ * event paths without worrying about write rate.
129
+ */
130
+ scheduleUpdate(): void;
131
+ /**
132
+ * Write the current snapshot synchronously, bypassing the debounce.
133
+ * Called on boot (to publish the initial downstream block) and on
134
+ * shutdown (to flush any pending updates before the state file is
135
+ * ownership-cleaned).
136
+ *
137
+ * ## Ownership handoff (Codex P1 + P2b)
138
+ *
139
+ * The ownership check + rename is performed under a sidecar lockfile
140
+ * (`serve.state.json.lock`) created with `O_EXCL` (`wx`). This converts
141
+ * what was two non-atomic steps into a serialized critical section.
142
+ *
143
+ * Flow:
144
+ *
145
+ * 1. Acquire the lock (`open(path, 'wx')`). If EEXIST, a concurrent
146
+ * writer — either another publisher in THIS process (not possible
147
+ * given the debounce, but cheap to defend against) or another
148
+ * `rea serve` instance with overlapping lifetime — holds it. Skip
149
+ * this flush silently; the debounce timer will try again, and on
150
+ * shutdown the concurrent writer's own state will be authoritative.
151
+ * 2. Under the lock: re-read the on-disk `session_id`. If it belongs
152
+ * to a DIFFERENT session, another instance has already claimed the
153
+ * breadcrumb. Release the lock and yield (log-only).
154
+ * 3. Under the lock: atomically rename our temp file over the target.
155
+ * Because the concurrent writer cannot execute step 3 until we
156
+ * release the lock, and we only reach step 3 after confirming the
157
+ * on-disk session matches ours, the "older clobbers newer"
158
+ * race Codex flagged is closed.
159
+ * 4. Release the lock (unlink the sidecar) in a finally block.
160
+ *
161
+ * Stale locks from a crashed process with the same PID would deadlock
162
+ * the critical section forever — so the acquire step checks the lock
163
+ * file's contents (written as our PID + random nonce) and, if the
164
+ * owning PID is no longer running, steals it. The steal path is
165
+ * intentionally narrow (PID-check only, no timestamp TTL) because
166
+ * holding the lock longer than a single flushNow invocation is a bug.
167
+ */
168
+ flushNow(): void;
169
+ /** Path to the sidecar lockfile. Resolved once per call; trivial cost. */
170
+ private lockFilePath;
171
+ /**
172
+ * Try to acquire the sidecar lock. Returns the lock file descriptor on
173
+ * success, or `null` on contention. Throws only on unexpected I/O errors
174
+ * (permissions, disk full) — those propagate out of `flushNow`'s try
175
+ * block and land in the `write_failed` log path.
176
+ *
177
+ * Stale-lock recovery: if a lockfile exists but its recorded PID is not
178
+ * currently running, the file is unlinked and one retry is issued. This
179
+ * covers the case where a previous `rea serve` SIGKILL'd mid-flush and
180
+ * left a dangling lockfile.
181
+ */
182
+ private acquireLock;
183
+ /**
184
+ * Release the sidecar lock. Best-effort — if the unlink fails, the next
185
+ * flushNow will see a dangling lock and the stale-lock recovery path
186
+ * will clean it up. We MUST still close the fd so we don't leak it.
187
+ */
188
+ private releaseLock;
189
+ /**
190
+ * Returns true iff the lock file's recorded PID is not currently alive.
191
+ * Uses `process.kill(pid, 0)` which sends no signal but errors with
192
+ * ESRCH when the PID is gone. Any parse error or unexpected kill error
193
+ * is treated as "not stale" to err on the side of NOT stealing a live
194
+ * peer's lock.
195
+ */
196
+ private isStaleLock;
197
+ /**
198
+ * Returns true iff this publisher is allowed to write the on-disk state
199
+ * file on behalf of its session. The check runs under the sidecar lock
200
+ * (see `flushNow`) so the read + subsequent rename form one serialized
201
+ * critical section.
202
+ *
203
+ * Ownership resolves against three buckets:
204
+ *
205
+ * 1. **Safe-to-write**: the file is absent, corrupt, or has a missing/
206
+ * malformed `session_id`. No competing session is on disk, so we
207
+ * write without hesitation.
208
+ * 2. **We own it**: the stored `session_id` matches ours. Normal
209
+ * steady-state — every flush lands here.
210
+ * 3. **Another session owns it**: the stored `session_id` differs
211
+ * from ours. Before 0.9.0 pass-4 this was an unconditional yield,
212
+ * which was strictly safer but broke the crash-recovery case —
213
+ * a NEW `rea serve` launched after an unclean shutdown would
214
+ * observe the crashed session's id and yield forever, leaving
215
+ * `rea status` permanently stuck. Codex pass-4 P1 flagged this.
216
+ *
217
+ * The 0.9.0 `owner_pid` field exists exactly to disambiguate this
218
+ * bucket. If `owner_pid` is alive, an overlapping writer is still
219
+ * running and we yield (silent). If `owner_pid` is gone (ESRCH)
220
+ * or missing from the payload (pre-0.9.0 file or same-process
221
+ * write), we treat the file as abandoned and take over.
222
+ *
223
+ * `process.kill(pid, 0)` returns ESRCH for a missing PID, EPERM for a
224
+ * live PID we cannot signal. We treat EPERM as "alive from someone's
225
+ * perspective" and yield — never steal a file the kernel is uncertain
226
+ * about.
227
+ */
228
+ private ownsStateFile;
229
+ /**
230
+ * Schedule a longer-interval retry of `flushNow`. Used by the yield path
231
+ * so a new gateway waiting on a live peer eventually reclaims the file
232
+ * when the peer exits. Idempotent — if a retry is already pending, this
233
+ * call is a no-op.
234
+ *
235
+ * Distinct from `scheduleUpdate()` because:
236
+ * - The debounce timer coalesces rapid events; this timer polls at a
237
+ * slow cadence for ownership changes.
238
+ * - Scheduling yield retries on the debounce timer would mean one
239
+ * supervisor event during the wait cancels the retry, and the
240
+ * debounce timer ALSO can't be re-scheduled while `timer !== null`.
241
+ */
242
+ private scheduleYieldRetry;
243
+ /**
244
+ * Stop further scheduled writes. Called from the gateway shutdown path
245
+ * AFTER the final flush. Clears any pending timer; no more writes will
246
+ * occur after this returns.
247
+ */
248
+ stop(): void;
249
+ /** Exposed for tests. Builds the canonical payload from live sources. */
250
+ buildSnapshot(): LiveServeState;
251
+ private buildDownstreamBlock;
252
+ }