oxtail 0.14.1 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -67,7 +67,7 @@ Contributing? `git clone https://github.com/d4j3y2k/oxtail && cd oxtail && npm i
67
67
  - `set_my_state` — write a small "state card" onto this session's registry entry so peers can see what we're doing without reading our transcript. v1 surfaces a single field, `purpose` (≤200 chars).
68
68
  - `send_message` — **fire-and-forget** message to a peer. Target is a tmux session name or a raw `client_session_id` UUID. Body ≤ 8KB. Delivery is async via the peer's mailbox file. A plain message does **not** wake an idle peer; pass `wake: "auto"` to nudge one (state-gated — see [Waking an idle peer](#waking-an-idle-peer)). Replies to `ask_peer` should pass `reply_to: "<request_id>"` when the inbound message carries a `request_id` — and a reply **auto-wakes the requester by default** (strictly gated; `wake: "off"` opts out). (v0.5+)
69
69
  - `read_my_messages` — drain this session's mailbox and return any queued messages. Messages include `from_session_id`, server-stamped `origin: "peer"`, and optional `request_id` / `reply_to`. Codex peers (and unhooked Claude Code) poll this; Claude Code peers with the hooks installed see messages mid-turn (PreToolUse) or at turn end (Stop) instead. (v0.5+)
70
- - `ask_peer` — **delegate-and-wait**. Enqueues a message with a `request_id` and blocks server-side until the peer replies with `send_message({ reply_to: request_id })` or the timeout elapses. Default timeout is 45s (`OXTAIL_ASK_PEER_TIMEOUT_MS`), and each call may pass `timeout_ms`. New peers use strict `reply_to` correlation; legacy/no-capability peers fall back to best-effort first-message matching and the response reports `correlation: "uncorrelated"`. That legacy path may stale-match old same-peer chatter, so callers should treat `uncorrelated` as compatibility-only. Use `send_message` for fire-and-forget. (v0.7+)
70
+ - `ask_peer` — **delegate-and-wait**. Enqueues a message with a `request_id` and blocks server-side until the peer replies with `send_message({ reply_to: request_id })` or the timeout elapses. Default timeout is 60s (`OXTAIL_ASK_PEER_TIMEOUT_MS`), and each call may pass `timeout_ms`. New peers use strict `reply_to` correlation; legacy/no-capability peers fall back to best-effort first-message matching and the response reports `correlation: "uncorrelated"`. That legacy path may stale-match old same-peer chatter, so callers should treat `uncorrelated` as compatibility-only. **Durable on timeout (v0.15+):** if the wait elapses, the request is recorded as a pending obligation, so when the peer's reply finally arrives — minutes or hours later — it *wakes the requester back* (`wake_reason: "late_reply_to_pending"`) instead of landing silently. That makes `ask_peer` safe for long-running delegations: let it time out, end the turn, get pulled back when the work is done. Use `send_message` for fire-and-forget. (v0.7+)
71
71
  - `reply_to_message` — **reply by `message_id`**. The atomic, correlation-safe alternative to hand-wiring `send_message`'s `target` + `reply_to`: pass the `message_id` the hook or `read_my_messages` showed you and the server looks the inbound envelope up in this session's durable **received-ledger**, derives the reply target (the original sender), carries `reply_to: request_id` when the inbound was an `ask_peer` (keeping the exchange correlated), and stamps `source_message_id`. Replying to a plain `send_message` works too — it just omits `reply_to`. Ownership is structural (you can only reply to a message delivered to *you*); fail-closed on an unknown/aged-out id. Same wake semantics as `send_message`, including the wake-on-reply default. (v0.13+)
72
72
  - `register_my_session` — pin this MCP server's `session_id` directly. Kept for debugging; prefer `claim_session`.
73
73
  - `get_my_session` — return this MCP server's own registry entry plus a per-strategy detection diagnosis. Useful for debugging.
@@ -163,7 +163,9 @@ send_message({ target: "<requester>", body: "...", reply_to: "<request_id>" })
163
163
 
164
164
  The reply path is deliberately **stricter** than explicit `wake: "auto"`. It fires only when the target is **freshly idle** — an `idle` activity marker newer than `OXTAIL_AUTOWAKE_FRESH_IDLE_MS` (default 5 min). Stale, unknown, missing, or busy state yields `skipped_no_fresh_idle` (no best-effort wake — typing unprompted into a terminal that may be unattended is the risk we refuse to take). Two more guards bound it: a **per-target rate limit** (`OXTAIL_AUTOWAKE_MIN_INTERVAL_MS`, default 4s → `skipped_rate_limited`) since one wake already drains the whole mailbox, and a **one-wake dedupe** keyed on `(session_id, reply_to)` (`skipped_deduped`) so a duplicate or late hook drain of the same reply can't re-fire. If the dedupe/rate store is somehow unwritable the wake degrades to `skipped_store_error` rather than failing the (already-delivered) message. The env kill-switch `OXTAIL_AUTOWAKE=off` disables reply auto-wake entirely (`wake_status: "disabled"`). Every outcome that reaches the gate surfaces a `wake_status`; the reply path also stamps `wake_reason: "reply_to_default"` (present even on a resolve error like `ambiguous-target`, where there's no single target to wake).
165
165
 
166
- **Coverage (which requesters this reaches).** The fresh-idle gate keys on the requester's busy/idle activity marker, which only the Claude Code hooks maintain. So wake-on-reply currently closes the stranding for a **hooked Claude Code requester** (the originally-observed case: a peer's async reply to an idle Claude session). A **Codex** requester — or a Claude requester without the hooks installed — has no idle marker, so a reply with `wake` unset returns `skipped_no_fresh_idle` and is **not** auto-woken; reach it with an explicit `wake: "auto"`, which always takes the lenient wake path (idle/unknown/stale all wake; only a fresh-`busy` peer is skipped) and bypasses the strict fresh-idle gate even for a reply. Closing the Codex/unhooked-requester direction *by default* needs a requester-side waiter signal (`expects_reply`), which is the next slice — a blind `unknown ⇒ wake` default is deliberately avoided because it reintroduces the double-wake-an-active-waiter risk this gate exists to prevent.
166
+ **Coverage (which requesters this reaches).** The fresh-idle gate keys on the requester's busy/idle activity marker, which only the Claude Code hooks maintain. So wake-on-reply currently closes the stranding for a **hooked Claude Code requester** (the originally-observed case: a peer's async reply to an idle Claude session). A **Codex** requester — or a Claude requester without the hooks installed — has no idle marker, so a reply with `wake` unset returns `skipped_no_fresh_idle` and is **not** auto-woken; reach it with an explicit `wake: "auto"`, which always takes the lenient wake path (idle/unknown/stale all wake; only a fresh-`busy` peer is skipped) and bypasses the strict fresh-idle gate even for a reply.
167
+
168
+ For the **`ask_peer` case specifically**, the Codex/unhooked-requester direction is now closed *by default* (v0.15+, see [Durable `ask_peer`](#durable-ask_peer-long-efforts) below): a timed-out `ask_peer` records a durable **pending-ask** keyed on the requester's `session_id` + `request_id`, and the matching late reply takes the lenient wake path regardless of any idle marker — so even a markerless idle Codex requester is pulled back. This is exactly the requester-side waiter signal the blind `unknown ⇒ wake` default was avoided for: it's evidence the requester *explicitly asked and is waiting*, so it can't double-wake an unrelated active turn.
167
169
 
168
170
  **Codex and the wake matrix.** The send-keys wake needs a tmux pane. A Codex peer running **outside tmux** has none, so it returns `wake_status: "skipped_no_target"` — its idle delivery stays poll-based (`read_my_messages`). Run Codex **inside a tmux pane** to get symmetric idle-wake; the routing already handles the Codex paste-burst case.
169
171
 
@@ -238,17 +240,34 @@ All wake paths funnel through one place, which **coalesces** rapid repeat wakes
238
240
  ### Constraints
239
241
 
240
242
  - The target peer must have a registered `client.session_id`. Codex peers must call `claim_session` / `register_my_session` first; without that, `ask_peer` returns `error: "peer-has-no-session-id"` rather than guessing.
241
- - Timeout defaults to 45000ms (conservative under typical MCP-client tool-call abort windows). Pass `timeout_ms` on a call when a specific delegation needs a different bound; max 300000ms.
243
+ - Timeout defaults to 60000ms — enough headroom for a slower multi-tool-call peer reply (e.g. a Codex peer running `set_my_state` + `reply_to_message` + composing a report, observed ~46s) while staying under both known callers' tool-call abort windows (Claude Code is clean to ~60s; Codex aborts ~120s). Pass `timeout_ms` on a call when a specific delegation needs a different bound; max 300000ms.
242
244
 
243
245
  ### Tuning the timeout
244
246
 
245
- If `ask_peer` returns an abort error before its built-in 45s timeout fires, your MCP client's tool-call ceiling is lower than 45s. Override the bound at server startup:
247
+ If `ask_peer` returns an abort error before its built-in 60s timeout fires, your MCP client's tool-call ceiling is lower than 60s. Override the bound at server startup:
246
248
 
247
249
  ```sh
248
250
  OXTAIL_ASK_PEER_TIMEOUT_MS=30000 npx -y oxtail@0.10.1
249
251
  ```
250
252
 
251
- The server reads the env var once at boot and uses it as the fixed timeout for all `ask_peer` calls in that session. Values must be positive numbers; anything else falls back to the 45000ms default.
253
+ The server reads the env var once at boot and uses it as the fixed timeout for all `ask_peer` calls in that session. Values must be positive numbers; anything else falls back to the 60000ms default.
254
+
255
+ ### Durable `ask_peer` (long efforts)
256
+
257
+ The blocking wait is a *short* primitive (bounded by the client's tool-call abort window, ~60s). A real task can take minutes or hours — far longer than any wait can block. So `ask_peer` decouples the **wait** from the **delivery of the answer**:
258
+
259
+ - On timeout (for a correlated peer + a claimed requester), the request is recorded as a durable **pending-ask** at `~/.oxtail/pending-ask/p-<hash(session_id, request_id)>`, keyed on the *requester's* `session_id` + `request_id`. A `recordPendingAsk` runs **before** one final authoritative union-drain of the requester's mailbox (write-before-final-drain), so a reply that lands in the poll-vs-deadline gap is returned immediately, and a reply that arrives later finds the persisted record.
260
+ - When that reply eventually arrives, `resolveSendWake` finds the matching pending-ask, **consumes it** (atomic `unlink`, single-winner — a duplicate/re-delivered reply can't re-fire), and takes the **lenient** wake path (`wake_reason: "late_reply_to_pending"`). Because the record is *proof the requester explicitly asked and is waiting*, the wake fires regardless of the 5-min fresh-idle window — and reaches a **markerless idle Codex** requester that the strict reply-default would skip. It also stamps the autowake dedupe for `(session_id, request_id)` so a later duplicate can't strict-wake via the fresh-idle fallback.
261
+ - `wake: "off"` still **consumes** the record (the obligation is satisfied — leaving it would let a later duplicate wake and violate the explicit off) but suppresses the wake (`wake_reason: "late_reply_to_pending_suppressed"`). The automatic (wake-unset) path honors `OXTAIL_AUTOWAKE=off` (`wake_status: "disabled"`); an explicit `wake: "auto"` intentionally does not.
262
+ - The reply drain is a **union across the requester's sibling MCP-child pids** (`drainMatchingReplyMany`), mirroring `read_my_messages` — a dual-scope requester's reply may land in a sibling pid, not the one that blocked in `ask_peer`.
263
+
264
+ Records are honored for `OXTAIL_PENDING_ASK_TTL_MS` (default 1h, sized for long efforts): a reply after that still delivers durably via `read_my_messages` but won't fire the pull-back wake (`consumePendingAsk` is TTL-aware — it removes an over-TTL record without waking). GC is **opportunistic** — abandoned records (a reply that never came) are swept when a later `ask_peer` times out, not on a wall-clock timer; the files are tiny, and a reply always cleans up its own record on arrival.
265
+
266
+ **The pattern:** `ask_peer` a long task → let it return `timed_out: true` → end your turn → get woken when the answer lands. Pair with a generous `OXTAIL_ACTIVITY_BUSY_TTL_MS` if your turns run long (see below).
267
+
268
+ ### Keeping a long turn marked busy
269
+
270
+ `wake: "auto"` skips a peer that is **freshly `busy`** (mid-turn — its hooks deliver, so a keystroke wake would be noise). The `busy` marker is set at turn start (UserPromptSubmit hook) and **re-stamped on every tool call** (PreToolUse hook, v0.15+), so a long *active* turn stays fresh and never invites a spurious wake. A turn that stops making tool calls — one giant single tool call, or a crash without a clean Stop — ages past `OXTAIL_ACTIVITY_BUSY_TTL_MS` (default 10 min) and then *does* wake, which is the intended stale-busy → recovery behavior. Widen the TTL for deployments with very long single-tool-call turns.
252
271
 
253
272
  ### Recommended permissions for autonomous agent-to-agent collaboration
254
273
 
@@ -306,8 +325,9 @@ A scheduled CI job (`.github/workflows/codex-drift.yml`, also runnable on demand
306
325
 
307
326
  ## Status
308
327
 
309
- v0.13.0. Pushes the autonomous peer-messaging matrix toward zero human relay, hardens the wake path, then makes correlated replies atomic.
328
+ v0.15.0. Pushes the autonomous peer-messaging matrix toward zero human relay, hardens the wake path, makes correlated replies atomic, and makes delegation durable across long (minutes-to-hours) efforts.
310
329
 
330
+ - **Durable `ask_peer` + long-effort liveness (v0.15.0).** A timed-out `ask_peer` records a pending obligation (`~/.oxtail/pending-ask/`, keyed on requester `session_id` + `request_id`, written *before* a final authoritative union-drain), so the peer's reply — arriving minutes or hours later — *wakes the requester back* (`wake_reason: "late_reply_to_pending"`) instead of landing silently. The pull-back takes the lenient wake path, so it reaches even a markerless idle Codex requester — closing the last wake-on-reply asymmetry. The reply drain unions the requester's sibling MCP-child pids (and sweeps migrate-crash duplicates) so a dual-scope reply can't strand. Separately, the `PreToolUse` hook now re-stamps the `busy` marker every tool call, so a long *active* turn never reads as stale-busy and invites a spurious wake. New env: `OXTAIL_PENDING_ASK_TTL_MS` (1h), `OXTAIL_ACTIVITY_BUSY_TTL_MS` (10m); `ask_peer` default timeout 45s→60s.
311
331
  - **Reply by id (v0.13.0).** `reply_to_message(message_id, body)` removes the manual `target` + `reply_to` rewiring that silently degraded a correlated exchange into loose mailbox traffic: the server looks the inbound envelope up in a durable per-session **received-ledger** (`~/.oxtail/received/<hash(session_id)>.jsonl`), derives the reply target and `reply_to` itself, and enforces ownership structurally (you can only reply to a message delivered to you). The ledger is written *before* the mailbox line is visible — so a handle the hook displays is always resolvable even though both delivery paths destroy the queue entry once it is handed off. Fail-closed on an unknown/aged-out id.
312
332
  - **Wake-on-reply (v0.11.0).** A reply — `send_message` with `reply_to` — auto-wakes a freshly-idle requester by default, so an awaited answer doesn't strand an idle peer. Strictly gated (fresh-idle only, per-target rate limit, one-wake dedupe, `OXTAIL_AUTOWAKE=off` kill-switch). `wake:"off"` opts out; explicit `wake:"auto"` is the escape hatch for a requester without an idle marker (Codex / hookless Claude).
313
333
  - **Wake hardening (v0.12.0).** Wake keystrokes only ever target the pane the process tree confirms hosts the peer's `server_pid` — never a self-written `tmux_pane`/`tmux_session`, and registry entries whose `server_pid` doesn't match their filename are rejected. Rapid repeat wakes to one peer are coalesced (`skipped_debounced`). `oxtail diagnose` summarizes wake outcomes from `MCP_TRACE_FILE`, and a scheduled CI job flags drift in Codex's paste-burst window before it can break the wake.
@@ -42,6 +42,18 @@ if [ ! -t 0 ]; then
42
42
  fi
43
43
  [ -z "$sid" ] && exit 0
44
44
 
45
+ # Re-stamp "busy" on EVERY tool call (before any early-exit below) so a long,
46
+ # ACTIVE turn keeps a fresh marker and never reads as stale-busy (>TTL) to a
47
+ # peer's wake:auto. UserPromptSubmit sets "busy" once at turn start; without this
48
+ # a turn outrunning the TTL would invite a spurious keystroke wake into a working
49
+ # agent. The Stop hook flips this back to "idle" on a real stop. Keyed by
50
+ # session_id; sanitization MUST match the server's activitySessionKey().
51
+ safe_sid=$(printf '%s' "$sid" | tr -c 'A-Za-z0-9_-' '_')
52
+ [ -n "$safe_sid" ] && {
53
+ mkdir -p "$HOME/.oxtail/activity" 2>/dev/null || true
54
+ printf 'busy' > "$HOME/.oxtail/activity/$safe_sid" 2>/dev/null || true
55
+ }
56
+
45
57
  sessions_dir="$HOME/.oxtail/sessions"
46
58
  mailboxes_dir="$HOME/.oxtail/mailboxes"
47
59
  [ -d "$sessions_dir" ] || exit 0
package/dist/claims.js CHANGED
@@ -155,9 +155,6 @@ function compareClaimScores(a, b) {
155
155
  }
156
156
  return a.claimed_at - b.claimed_at;
157
157
  }
158
- function scoresTie(a, b) {
159
- return compareClaimScores(a, b) === 0;
160
- }
161
158
  function claimKey(clientType, cwd, sessionId) {
162
159
  return createHash("sha256")
163
160
  .update(`${clientType} ${cwd} ${sessionId}`)
@@ -172,7 +169,14 @@ function claimPath(key) {
172
169
  // Atomic temp+rename so a concurrent reader never sees a torn write.
173
170
  export function writeClaim(input) {
174
171
  ensureClaimsDir();
175
- gcStaleClaims();
172
+ // Age-only sweep on this hot path. writeClaim can run concurrently with another
173
+ // agent's writeClaim (dual-scope, or two sessions in one project); the
174
+ // transcript-existence check is racy — a transcript that momentarily fails to
175
+ // stat would unlink a sibling's just-written claim (M6). Age is monotonic and
176
+ // race-free, so reclaim only by age here. recoverClaim already skips records
177
+ // whose transcript is gone, and the full (transcript-aware) sweep remains
178
+ // available via a direct gcStaleClaims() call.
179
+ gcStaleClaims(Date.now(), { ageOnly: true });
176
180
  const rec = {
177
181
  schema_version: 1,
178
182
  client_type: input.client_type,
@@ -245,14 +249,30 @@ export function recoverClaim(clientType, cwd, ancestors, deps = {}) {
245
249
  matches.sort((a, b) => compareClaimScores(b.score, a.score));
246
250
  const best = matches[0];
247
251
  const second = matches[1];
248
- if (second && scoresTie(best.score, second.score))
252
+ // Abstain on cross-session ambiguity. Two DISTINCT sessions that overlap the
253
+ // live chain equally (same overlap_count) AND at the same live-chain depth
254
+ // (same nearest_overlap_current) share liveness only at a common ancestor —
255
+ // the shared terminal/login-shell. The remaining tiebreakers (record-side
256
+ // depth, recency) do NOT correlate with which child actually restarted, so
257
+ // adopting either would risk cross-session misrouting (H1) — the very
258
+ // split-identity class this store exists to prevent. Return null so the caller
259
+ // falls back to the explicit claim_session next_step. (This strictly subsumes
260
+ // the old exact-tie check, which had equal overlap_count and nearest_current
261
+ // by definition.) A same-session second-best routes to the same identity and
262
+ // so can never split-route — defensive only, since the per-session claim key
263
+ // means two records can't share a session_id.
264
+ if (second &&
265
+ best.rec.session_id !== second.rec.session_id &&
266
+ best.score.overlap_count === second.score.overlap_count &&
267
+ best.score.nearest_overlap_current === second.score.nearest_overlap_current) {
249
268
  return null;
269
+ }
250
270
  return best.rec;
251
271
  }
252
272
  // Drop records that are clearly dead: transcript gone, or older than the max
253
273
  // age. Best-effort; never throws. A dead process pid alone is NOT grounds for
254
274
  // removal — that's exactly the restart case recovery exists to serve.
255
- export function gcStaleClaims(nowMs = Date.now()) {
275
+ export function gcStaleClaims(nowMs = Date.now(), opts = {}) {
256
276
  const dir = claimsDir();
257
277
  if (!existsSync(dir))
258
278
  return;
@@ -274,7 +294,7 @@ export function gcStaleClaims(nowMs = Date.now()) {
274
294
  catch {
275
295
  continue;
276
296
  }
277
- const transcriptGone = !rec.transcript_path || !existsSync(rec.transcript_path);
297
+ const transcriptGone = !opts.ageOnly && (!rec.transcript_path || !existsSync(rec.transcript_path));
278
298
  const tooOld = nowMs - rec.claimed_at * 1000 > CLAIM_MAX_AGE_MS;
279
299
  if (transcriptGone || tooOld) {
280
300
  try {
@@ -2,6 +2,12 @@ import { closeSync, existsSync, openSync, readSync, readdirSync, statSync } from
2
2
  import { homedir } from "node:os";
3
3
  import { join } from "node:path";
4
4
  const FIVE_MIN_MS = 5 * 60 * 1000;
5
+ // started_at is whole-second granularity (Math.floor(Date.now()/1000)*1000)
6
+ // while a transcript's birth_ms is real-millisecond, so a transcript
7
+ // legitimately created in the same second can land slightly BEFORE started_at
8
+ // (delta in [-1000, 0]). Allow one second of grace below zero so the unique
9
+ // candidate isn't dropped on pure rounding (M7).
10
+ const ONE_SECOND_MS = 1000;
5
11
  const UUID_RE = /([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/;
6
12
  // Returns the unique post-start candidate inside the window, or null if there
7
13
  // are zero or multiple. Multiple positive-delta candidates means another
@@ -10,7 +16,7 @@ const UUID_RE = /([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/
10
16
  export function pickByDelta(candidates, startedAtMs, windowMs = FIVE_MIN_MS) {
11
17
  const ranked = candidates
12
18
  .map((c) => ({ ...c, delta: c.birth_ms - startedAtMs }))
13
- .filter((c) => c.delta > 0 && c.delta <= windowMs);
19
+ .filter((c) => c.delta > -ONE_SECOND_MS && c.delta <= windowMs);
14
20
  if (ranked.length !== 1)
15
21
  return null;
16
22
  return { session_id: ranked[0].session_id, birth_ms: ranked[0].birth_ms };
@@ -141,7 +147,7 @@ function abstainReason(type, candidates, startedAtMs) {
141
147
  }
142
148
  const ranked = candidates
143
149
  .map((c) => ({ ...c, delta: c.birth_ms - startedAtMs }))
144
- .filter((c) => c.delta > 0 && c.delta <= FIVE_MIN_MS);
150
+ .filter((c) => c.delta > -ONE_SECOND_MS && c.delta <= FIVE_MIN_MS);
145
151
  if (ranked.length === 0) {
146
152
  return {
147
153
  abstain: true,
package/dist/locks.js CHANGED
@@ -41,8 +41,13 @@ import { trace } from "./trace.js";
41
41
  // ALWAYS the single-winner `mkdir(lock)`, so even redundant clears can never
42
42
  // produce two owners — the worst they do is race to recreate the lock, which
43
43
  // exactly one wins.
44
- const LOCK_RETRY_LIMIT = 50;
45
44
  const LOCK_RETRY_DELAY_MS = 10;
45
+ // Total acquire budget is wall-clock, NOT a fixed retry count: a successful
46
+ // stale-clear retries mkdir immediately (no sleep) so it must not consume the
47
+ // budget without time passing — a count-based budget threw "could not acquire
48
+ // lock" spuriously under contention (H2). 2s is ample for the tiny mailbox/
49
+ // ledger critical sections and well under any caller-level timeout.
50
+ const LOCK_ACQUIRE_TIMEOUT_MS = 2_000;
46
51
  function sleepSync(ms) {
47
52
  Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
48
53
  }
@@ -166,7 +171,8 @@ export function clearStaleLock(lock, staleMs, traceEvent, traceCtx) {
166
171
  // releaseDirLock. The caller is responsible for creating the parent directory.
167
172
  export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
168
173
  const token = mintToken();
169
- for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
174
+ const deadline = Date.now() + LOCK_ACQUIRE_TIMEOUT_MS;
175
+ for (;;) {
170
176
  try {
171
177
  mkdirSync(lock, { mode: 0o700 });
172
178
  writeOwner(lock, token);
@@ -176,12 +182,19 @@ export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
176
182
  const err = e;
177
183
  if (err.code !== "EEXIST")
178
184
  throw err;
179
- if (clearStaleLock(lock, staleMs, traceEvent, traceCtx))
180
- continue;
181
- sleepSync(LOCK_RETRY_DELAY_MS);
185
+ // A successful stale-clear means the lock is gone: loop straight back to
186
+ // mkdir WITHOUT sleeping, to grab it before another contender (this retry
187
+ // must not consume the budget without time passing). Otherwise — a fresh
188
+ // holder or a lost steal — back off before retrying.
189
+ if (!clearStaleLock(lock, staleMs, traceEvent, traceCtx)) {
190
+ sleepSync(LOCK_RETRY_DELAY_MS);
191
+ }
192
+ }
193
+ // Wall-clock budget so the no-sleep stale-clear path cannot spin forever.
194
+ if (Date.now() >= deadline) {
195
+ throw new Error(`could not acquire lock at ${lock}`);
182
196
  }
183
197
  }
184
- throw new Error(`could not acquire lock at ${lock}`);
185
198
  }
186
199
  // Release the lock — but only if we PROVABLY still own it (owner === our token).
187
200
  // A holder that stalled past the stale window and was stolen from sees a
@@ -192,7 +205,13 @@ export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
192
205
  // stale lock and is reclaimed by clearStaleLock, strictly safer than a stomp.
193
206
  export function releaseDirLock(lock, token) {
194
207
  if (!token) {
195
- removeLock(lock); // no token to verify (defensive/legacy) best-effort
208
+ // No token to prove ownership. An empty token reaches here only from a
209
+ // lockTokens Map miss (an acquire that threw, or a future same-key nested
210
+ // release), so removing would stomp whatever lock currently exists —
211
+ // possibly a DIFFERENT owner's fresh one. Leave it: a genuinely leaked lock
212
+ // ages into a stale lock and is reclaimed by clearStaleLock, strictly safer
213
+ // than a stomp (H3).
214
+ trace("lock_release_skipped_no_token", { lock });
196
215
  return;
197
216
  }
198
217
  const owner = readOwner(lock);
package/dist/mailbox.js CHANGED
@@ -188,6 +188,13 @@ export function requeueMany(target_pid, msgs) {
188
188
  // two unioned sibling mailboxes. Both copies are drained (so neither lingers) but
189
189
  // the message is returned ONCE. message_id is a unique per-message nonce, so this
190
190
  // only ever collapses true duplicates, never two distinct messages.
191
+ // Union-drain a session's mailboxes (one per server_pid it has used), deduping
192
+ // by message_id so a migrate crash-window duplicate (same id in two sibling
193
+ // mailboxes) is delivered once. INVARIANT: every unioned pid is drained (and so
194
+ // truncated) before returning — do NOT add a budget/early-exit short-circuit
195
+ // here. The dedup is per-call only, so a duplicate left in an un-drained sibling
196
+ // would re-surface on a later call with no cross-call dedup (M5). Budgeting
197
+ // belongs in the caller, applied to the already-fully-drained result.
191
198
  export function drainMany(pids) {
192
199
  const out = [];
193
200
  const seenPids = new Set();
@@ -265,8 +272,42 @@ export function migrateMailbox(fromPid, toPid) {
265
272
  }
266
273
  if (!raw || !raw.trim())
267
274
  return 0;
268
- const block = raw.endsWith("\n") ? raw : raw + "\n";
269
- const count = raw.split("\n").filter((l) => l.trim().length > 0).length;
275
+ // Migrate only VALID records, reserialized canonically. A crash mid-append
276
+ // into the source can leave a torn final line; copying raw bytes would glue
277
+ // a synthesized newline onto that fragment, promoting garbage into a
278
+ // standalone (unparseable) line in the dest AND over-counting it (H4). Parse
279
+ // each line with the same guard drain uses, drop torn/invalid ones, and
280
+ // rebuild a clean block so the count reflects real, deliverable messages.
281
+ const valid = [];
282
+ for (const line of raw.split("\n")) {
283
+ if (!line.trim())
284
+ continue;
285
+ let parsed;
286
+ try {
287
+ parsed = JSON.parse(line);
288
+ }
289
+ catch {
290
+ trace("mailbox_migrate_skip_invalid", { fromPid, toPid, line });
291
+ continue;
292
+ }
293
+ if (parsed &&
294
+ typeof parsed === "object" &&
295
+ parsed.schema_version === 1 &&
296
+ typeof parsed.id === "string" &&
297
+ typeof parsed.body === "string") {
298
+ valid.push(parsed);
299
+ }
300
+ else {
301
+ trace("mailbox_migrate_skip_invalid", { fromPid, toPid, line });
302
+ }
303
+ }
304
+ if (valid.length === 0) {
305
+ // Only torn/garbage lines — clear the source and report nothing migrated.
306
+ truncateSync(src, 0);
307
+ return 0;
308
+ }
309
+ // serializeMailboxLine already terminates each line with "\n", so join("").
310
+ const block = valid.map((m) => serializeMailboxLine(m)).join("");
270
311
  acquireLock(toPid);
271
312
  try {
272
313
  appendLines(mailboxPath(toPid), block);
@@ -276,7 +317,7 @@ export function migrateMailbox(fromPid, toPid) {
276
317
  }
277
318
  // Append succeeded → clear the source (still under the source lock).
278
319
  truncateSync(src, 0);
279
- return count;
320
+ return valid.length;
280
321
  }
281
322
  finally {
282
323
  releaseLock(fromPid);
@@ -349,6 +390,79 @@ export function drainMatchingSession(my_pid, from_session_id) {
349
390
  export function drainMatchingReply(my_pid, from_session_id, reply_to) {
350
391
  return drainFirstMatching(my_pid, (msg) => msg.from_session_id === from_session_id && msg.reply_to === reply_to);
351
392
  }
393
+ // Union variant of drainMatchingReply across a session's sibling/previous MCP
394
+ // child pids. ask_peer waits on the requester's OWN pid, but the reply is
395
+ // addressed by client.session_id and resolveTarget(readAll) enqueues it to the
396
+ // session's freshest sibling — which, in a dual-scope / pid-rotation setup, may
397
+ // NOT be the pid blocked in ask_peer. A single-pid drain would then miss a reply
398
+ // that already landed in a sibling mailbox and strand it. Mirrors the session
399
+ // union read_my_messages / the PreToolUse hook already use.
400
+ //
401
+ // Returns the FIRST matching reply across the (deduped) pids. It does NOT pull
402
+ // every match: two DISTINCT replies to the same request_id (an answer + a
403
+ // follow-up correction) must not both be drained with one silently dropped — the
404
+ // second stays for read_my_messages. But once the first match is found, it DOES
405
+ // sweep an exact same-message_id duplicate out of the remaining pids: a
406
+ // migrate-crash can leave the SAME message in two siblings, and if we returned
407
+ // one copy and left the other, a later union drain would see only the lone
408
+ // survivor and re-deliver it as a "new" message. Sweeping by message_id removes
409
+ // the duplicate while leaving any distinct reply intact.
410
+ //
411
+ // `skipped` reports pids that could not be inspected (lock contention after the
412
+ // internal acquire-retry budget). The poll tolerates this (it retries next tick);
413
+ // the authoritative final drain in ask_peer retries the skipped pids so a
414
+ // transiently-locked sibling holding the reply isn't mistaken for "no reply".
415
+ export function drainMatchingReplyManyChecked(pids, from_session_id, reply_to) {
416
+ const seen = new Set();
417
+ const skipped = [];
418
+ let found = null;
419
+ for (const pid of pids) {
420
+ if (seen.has(pid))
421
+ continue;
422
+ seen.add(pid);
423
+ try {
424
+ if (!found) {
425
+ const m = drainMatchingReply(pid, from_session_id, reply_to);
426
+ if (m)
427
+ found = m;
428
+ }
429
+ else {
430
+ // Sweep an exact-message_id duplicate (migrate-crash) from this sibling;
431
+ // a distinct reply (different id) is left untouched.
432
+ const dupId = found.id;
433
+ drainFirstMatching(pid, (msg) => msg.id === dupId);
434
+ }
435
+ }
436
+ catch {
437
+ skipped.push(pid);
438
+ }
439
+ }
440
+ return { reply: found, skipped };
441
+ }
442
+ export function drainMatchingReplyMany(pids, from_session_id, reply_to) {
443
+ return drainMatchingReplyManyChecked(pids, from_session_id, reply_to).reply;
444
+ }
445
+ // Best-effort removal of an EXACT message_id from each of `pids`. Used to clean
446
+ // up a migrate-crash duplicate that was left in a pid the union drain couldn't
447
+ // inspect (lock contention) at the time the reply was pulled from another pid —
448
+ // otherwise a later read_my_messages would re-deliver the lone survivor as a
449
+ // "new" message. Matches by message_id only, so a DISTINCT reply (different id)
450
+ // in the same pid is never touched. Per-pid errors are skipped.
451
+ export function sweepMessageId(pids, messageId) {
452
+ const seen = new Set();
453
+ for (const pid of pids) {
454
+ if (seen.has(pid))
455
+ continue;
456
+ seen.add(pid);
457
+ try {
458
+ drainFirstMatching(pid, (msg) => msg.id === messageId);
459
+ }
460
+ catch {
461
+ // best effort — a still-locked pid is left; the dup is a rare crash-window
462
+ // artifact and the cost is at most one re-delivered (same-id) message.
463
+ }
464
+ }
465
+ }
352
466
  function drainFirstMatching(my_pid, matches) {
353
467
  acquireLock(my_pid);
354
468
  try {
@@ -0,0 +1,167 @@
1
+ // Pending-ask registry — durable ask_peer (the long-effort liveness fix).
2
+ //
3
+ // When an ask_peer wait TIMES OUT, the requester records a "pending ask" here:
4
+ // a durable note that it is still awaiting a reply correlated by request_id.
5
+ // When that reply eventually arrives — minutes or hours later, long after the
6
+ // 5-minute fresh-idle window the strict reply-default wake is gated to — the
7
+ // reply handler (server.ts resolveSendWake) finds the matching record and fires
8
+ // a LENIENT wake to pull the requester back, instead of stranding it idle until
9
+ // its next turn. This is what turns ask_peer into "delegate a long task and get
10
+ // pulled back the moment it's done", and it also reaches a markerless idle Codex
11
+ // requester that the fresh-idle gate would skip as skipped_no_fresh_idle.
12
+ //
13
+ // Design mirrors autowake.ts exactly: one small file per record under
14
+ // ~/.oxtail/pending-ask/, mtime is the source of truth (driven by an injected
15
+ // nowMs so it's deterministic in tests), the body is a debug breadcrumb, GC'd by
16
+ // age. Keyed on the REQUESTER's client.session_id + the request_id (the agent
17
+ // identity per AGENTS.md, never server_pid). Best-effort: a broken store
18
+ // degrades to "no record" — it NEVER throws, because a thrown error here would
19
+ // surface on an already-enqueued/already-delivered message and invite a retry.
20
+ import { createHash } from "node:crypto";
21
+ import { closeSync, mkdirSync, openSync, readdirSync, statSync, unlinkSync, utimesSync, writeFileSync, } from "node:fs";
22
+ import { homedir } from "node:os";
23
+ import { join } from "node:path";
24
+ function envPosInt(name, def, env = process.env) {
25
+ const v = env[name];
26
+ if (!v)
27
+ return def;
28
+ const n = Number(v);
29
+ return Number.isFinite(n) && n > 0 ? n : def;
30
+ }
31
+ // How long a recorded pending-ask is honored before GC reclaims it. Sized for
32
+ // long efforts (a delegated task that runs for the better part of an hour) — a
33
+ // reply arriving after this window still delivers durably via read_my_messages,
34
+ // it just won't fire the pull-back wake. Generous by default; tunable.
35
+ export const PENDING_ASK_TTL_MS = envPosInt("OXTAIL_PENDING_ASK_TTL_MS", 60 * 60 * 1000);
36
+ export function defaultPendingAskDir() {
37
+ return join(homedir(), ".oxtail", "pending-ask");
38
+ }
39
+ function hash(s) {
40
+ // request_id is caller-influenced, so never build a filename from it directly.
41
+ return createHash("sha256").update(s).digest("hex").slice(0, 32);
42
+ }
43
+ function recordPath(dir, sessionId, requestId) {
44
+ // JSON-encode the pair so the (sessionId, requestId) boundary is unambiguous
45
+ // and can't be crafted to collide with a different split (mirrors autowake.ts).
46
+ return join(dir, `p-${hash(JSON.stringify([sessionId, requestId]))}`);
47
+ }
48
+ function setMtime(path, nowMs) {
49
+ const t = nowMs / 1000;
50
+ try {
51
+ utimesSync(path, t, t);
52
+ }
53
+ catch {
54
+ // best effort — mtime drives TTL math; a failure only skews freshness by the
55
+ // small real-vs-injected clock delta.
56
+ }
57
+ }
58
+ // Record a pending ask. Atomic create-exclusive so a duplicate record (same
59
+ // requester + request_id) is a no-op rather than resetting the TTL clock.
60
+ // Returns true if a record now exists for this pair (freshly written OR already
61
+ // present), false only on a missing identity or an unusable store. Never throws.
62
+ export function recordPendingAsk(dir, sessionId, requestId, nowMs) {
63
+ // Never key on an empty identity: an unclaimed requester can't be correlated
64
+ // or replied-to, so there's nothing to wake later.
65
+ if (!sessionId || !requestId)
66
+ return false;
67
+ try {
68
+ mkdirSync(dir, { recursive: true, mode: 0o700 });
69
+ const p = recordPath(dir, sessionId, requestId);
70
+ try {
71
+ const fd = openSync(p, "wx"); // atomic create-exclusive
72
+ try {
73
+ writeFileSync(fd, JSON.stringify({ sessionId, requestId, at: nowMs }));
74
+ }
75
+ finally {
76
+ closeSync(fd);
77
+ }
78
+ setMtime(p, nowMs);
79
+ return true;
80
+ }
81
+ catch (e) {
82
+ // EEXIST: a record already exists → fine, leave its original mtime so the
83
+ // TTL counts from the first record, not this duplicate.
84
+ if (e.code === "EEXIST")
85
+ return true;
86
+ throw e;
87
+ }
88
+ }
89
+ catch {
90
+ // Store unusable (e.g. ~/.oxtail/pending-ask is a file, permission error) —
91
+ // degrade to "no durable record"; the strict fresh-idle reply-default still
92
+ // covers a Claude requester that went idle <5 min ago.
93
+ return false;
94
+ }
95
+ }
96
+ // Read-only: is there a live (within TTL) pending-ask for this pair?
97
+ export function hasPendingAsk(dir, sessionId, requestId, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
98
+ if (!sessionId || !requestId)
99
+ return false;
100
+ try {
101
+ const st = statSync(recordPath(dir, sessionId, requestId));
102
+ return nowMs - st.mtimeMs < ttlMs;
103
+ }
104
+ catch {
105
+ return false;
106
+ }
107
+ }
108
+ // Atomically consume (delete) the pending-ask for this pair. Returns true iff a
109
+ // record existed, was within the TTL, and THIS caller removed it — the
110
+ // single-winner signal the reply handler uses to fire exactly one pull-back
111
+ // wake. A concurrent second reply (or a re-delivered duplicate) racing the same
112
+ // key loses: unlinkSync throws ENOENT for the loser, so it returns false and
113
+ // does not re-wake.
114
+ //
115
+ // When nowMs is supplied, an OVER-TTL record is still unlinked (so a stale
116
+ // record can't leak) but the function returns false — honoring the contract that
117
+ // a reply arriving after PENDING_ASK_TTL_MS still delivers durably but does NOT
118
+ // fire the late wake. Omit nowMs to consume regardless of age (used right after
119
+ // recordPendingAsk, where the record is freshly written).
120
+ export function consumePendingAsk(dir, sessionId, requestId, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
121
+ if (!sessionId || !requestId)
122
+ return false;
123
+ const p = recordPath(dir, sessionId, requestId);
124
+ let withinTtl = true;
125
+ if (nowMs !== undefined) {
126
+ try {
127
+ withinTtl = nowMs - statSync(p).mtimeMs < ttlMs;
128
+ }
129
+ catch {
130
+ return false; // no record to consume
131
+ }
132
+ }
133
+ try {
134
+ unlinkSync(p); // remove regardless of age so a stale record can't leak
135
+ }
136
+ catch {
137
+ // ENOENT (no record / already consumed by a racing caller) or any store
138
+ // error → not ours to act on.
139
+ return false;
140
+ }
141
+ return withinTtl;
142
+ }
143
+ // Remove pending-ask records older than the TTL. Cheap, low-volume dir; run
144
+ // opportunistically so abandoned records (a reply that never came) can't
145
+ // accumulate. Mirrors gcAutowake.
146
+ export function gcPendingAsk(dir, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
147
+ let names;
148
+ try {
149
+ names = readdirSync(dir);
150
+ }
151
+ catch {
152
+ return; // dir not created yet
153
+ }
154
+ for (const name of names) {
155
+ if (name[0] !== "p")
156
+ continue;
157
+ const p = join(dir, name);
158
+ try {
159
+ const st = statSync(p);
160
+ if (nowMs - st.mtimeMs >= ttlMs)
161
+ unlinkSync(p);
162
+ }
163
+ catch {
164
+ // best effort
165
+ }
166
+ }
167
+ }
package/dist/received.js CHANGED
@@ -88,24 +88,43 @@ function readLines(sessionId) {
88
88
  throw err;
89
89
  }
90
90
  }
91
+ // The message_id of a serialized ledger line, or null if unparseable. Used to
92
+ // keep recordReceived idempotent without fully deserializing every envelope.
93
+ function lineMessageId(line) {
94
+ try {
95
+ const parsed = JSON.parse(line);
96
+ return typeof parsed.id === "string" ? parsed.id : null;
97
+ }
98
+ catch {
99
+ return null;
100
+ }
101
+ }
91
102
  // Append an inbound envelope to the receiver's ledger and prune to receivedMax()
92
103
  // (oldest dropped first). Called by delivery.ts BEFORE the mailbox append.
104
+ // Idempotent by message_id: re-recording an id replaces its prior line.
93
105
  export function recordReceived(receiverSessionId, msg) {
94
106
  if (!receiverSessionId)
95
107
  return;
96
108
  acquireLock(receiverSessionId);
97
109
  try {
98
110
  const lines = readLines(receiverSessionId);
99
- lines.push(JSON.stringify(msg));
111
+ // Idempotent by message_id: a re-record (ask_peer abort recovery, chained
112
+ // re-delivery) must not append a duplicate ledger line. Duplicates waste the
113
+ // receivedMax prune budget and can evict still-needed handles early,
114
+ // surfacing as spurious reply_to_message "message-not-found" (M4). Drop any
115
+ // prior line for this id, then append the latest. lookupReceived already
116
+ // returns first-match newest-first, so behavior is unchanged for callers.
117
+ const deduped = msg.id ? lines.filter((l) => lineMessageId(l) !== msg.id) : lines;
118
+ deduped.push(JSON.stringify(msg));
100
119
  const max = receivedMax();
101
- let pruned = lines;
102
- if (lines.length > max) {
103
- pruned = lines.slice(lines.length - max);
120
+ let pruned = deduped;
121
+ if (deduped.length > max) {
122
+ pruned = deduped.slice(deduped.length - max);
104
123
  // No silent caps: a dropped handle becomes reply_to_message
105
124
  // "message-not-found", so surface that the bound bit.
106
125
  trace("received_ledger_pruned", {
107
126
  session_id: receiverSessionId,
108
- dropped: lines.length - max,
127
+ dropped: deduped.length - max,
109
128
  kept: max,
110
129
  });
111
130
  }
@@ -136,6 +155,8 @@ export function lookupReceived(receiverSessionId, messageId) {
136
155
  }
137
156
  if (parsed &&
138
157
  typeof parsed === "object" &&
158
+ parsed.schema_version === 1 &&
159
+ typeof parsed.body === "string" &&
139
160
  parsed.id === messageId) {
140
161
  return parsed;
141
162
  }
package/dist/registry.js CHANGED
@@ -68,9 +68,21 @@ export function isValidTmuxSession(s) {
68
68
  // target, so we refuse rather than fall back to the self-written cached value.
69
69
  // - the resolved pane isn't a well-formed pane id (tmux output anomaly).
70
70
  // resolvePane is injected in tests; production uses currentPaneForServerPid.
71
- export function chooseVerifiedWakePane(peer, resolvePane = currentPaneForServerPid) {
71
+ export function chooseVerifiedWakePane(peer, resolvePane = currentPaneForServerPid, resolveSig = processStartSig) {
72
72
  if (!peer.tmux_pane)
73
73
  return null;
74
+ // PID-reuse guard: if the entry recorded the server process's start-time
75
+ // signature, confirm the live pid is STILL that process before resolving and
76
+ // waking its pane. Otherwise an OS-recycled pid — now an unrelated process
77
+ // that happens to sit under a different tmux pane — would resolve to, and get
78
+ // our wake keystrokes typed into, a stranger's pane (M3). Only refuse on a
79
+ // positively-different signature; an empty reading (transient ps failure)
80
+ // falls through to pane resolution, which fails closed for a truly dead pid.
81
+ if (peer.proc_sig) {
82
+ const liveSig = resolveSig(peer.server_pid);
83
+ if (liveSig && liveSig !== peer.proc_sig)
84
+ return null;
85
+ }
74
86
  const live = resolvePane(peer.server_pid);
75
87
  if (!live || !isValidTmuxPane(live))
76
88
  return null;
@@ -203,11 +215,36 @@ export function resolveTmuxPane(env = process.env, pid = process.pid) {
203
215
  export function currentPaneForServerPid(serverPid) {
204
216
  return findTmuxPaneByAncestry(serverPid, listTmuxPanePids(), listAllPpids());
205
217
  }
218
+ // The OS start-time signature (lstart) of a process, or "" if it can't be read
219
+ // (dead pid, or ps unavailable). Same provenance signal claims.ts uses on
220
+ // ancestor pids: an OS-recycled pid yields a DIFFERENT start time, so comparing
221
+ // a live pid's signature against one captured at register time detects pid reuse
222
+ // — distinguishing "our process is still alive" from "the pid now belongs to an
223
+ // unrelated process."
224
+ export function processStartSig(pid) {
225
+ try {
226
+ return execFileSync("ps", ["-o", "lstart=", "-p", String(pid)], {
227
+ encoding: "utf8",
228
+ stdio: ["ignore", "pipe", "pipe"],
229
+ }).trim();
230
+ }
231
+ catch {
232
+ return "";
233
+ }
234
+ }
235
+ // A process's start time never changes, so capture our own once and reuse it.
236
+ let cachedSelfProcSig;
237
+ function selfProcSig() {
238
+ if (cachedSelfProcSig === undefined)
239
+ cachedSelfProcSig = processStartSig(process.pid);
240
+ return cachedSelfProcSig;
241
+ }
206
242
  export function buildEntry(client, env = process.env) {
207
243
  const tmux_pane = resolveTmuxPane(env);
208
244
  return {
209
245
  server_pid: process.pid,
210
246
  started_at: Math.floor(Date.now() / 1000),
247
+ proc_sig: selfProcSig(),
211
248
  client,
212
249
  tmux_pane,
213
250
  tmux_session: resolveTmuxSessionFromPane(tmux_pane),
package/dist/server.js CHANGED
@@ -10,12 +10,13 @@ import { dirname, join, sep } from "node:path";
10
10
  import { clientFromHandshake, detectClient, enrichWithDiagnosis, transcriptPathFor, } from "./clients.js";
11
11
  import { isAbstain } from "./detect/index.js";
12
12
  import { trace } from "./trace.js";
13
- import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
13
+ import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, processStartSig, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
14
14
  import * as mailbox from "./mailbox.js";
15
15
  import * as received from "./received.js";
16
16
  import { deliverExistingToPeer, deliverToPeer } from "./delivery.js";
17
17
  import { recoverClaim, resolveAncestors, writeClaim } from "./claims.js";
18
- import { decideReplyAutoWake, defaultAutowakeDir } from "./autowake.js";
18
+ import { autowakeKillSwitchOff, claimWake, decideReplyAutoWake, defaultAutowakeDir, } from "./autowake.js";
19
+ import { consumePendingAsk, defaultPendingAskDir, gcPendingAsk, recordPendingAsk, } from "./pending-ask.js";
19
20
  import { markWoke, newWakeDebounceStore, recentlyWoke } from "./wake-debounce.js";
20
21
  // CLI subcommand dispatch must run before any MCP setup so that
21
22
  // `npx oxtail install-hook` doesn't open an MCP transport or register a
@@ -637,6 +638,11 @@ function refineFromHandshake(trigger) {
637
638
  return diagnosis;
638
639
  }
639
640
  server.server.oninitialized = () => {
641
+ // Sweep pending-ask records orphaned by a prior session (an ask that timed out,
642
+ // was never answered, and whose owner went away). gcPendingAsk otherwise only
643
+ // runs on a later ask_peer timeout, so this startup sweep keeps the dir from
644
+ // accumulating stale records. Best-effort; never throws.
645
+ gcPendingAsk(defaultPendingAskDir(), Date.now());
640
646
  const diagnosis = refineFromHandshake("oninitialized");
641
647
  // After type is known via handshake, schedule retries to catch transcript files
642
648
  // that don't exist yet at handshake time. No-op if session_id is already set.
@@ -854,12 +860,16 @@ server.registerTool("get_my_session", {
854
860
  // strategy mirrors session_id_source so callers can still see whether
855
861
  // env / birth-time / self-register resolved this entry.
856
862
  const source = entry.client.session_id_source ?? "self-register";
863
+ // Report confidence honestly per source: env and explicit self-register
864
+ // (claim_session) are authoritative ("high"); inferred sources (birth-time,
865
+ // sticky-claim) are "medium" — matching what the detect strategies return.
866
+ const confidence = source === "env" || source === "self-register" ? "high" : "medium";
857
867
  diagnosis = {
858
868
  per_strategy: {},
859
869
  winning: {
860
870
  session_id: entry.client.session_id,
861
871
  source,
862
- confidence: "high",
872
+ confidence,
863
873
  strategy: source,
864
874
  },
865
875
  next_step: null,
@@ -972,7 +982,20 @@ function resolveTarget(target, caller) {
972
982
  const fresh = reReadRegistryEntry(e.server_pid);
973
983
  if (!fresh)
974
984
  return false;
975
- return fresh.started_at === e.started_at;
985
+ if (fresh.started_at !== e.started_at)
986
+ return false;
987
+ // PID-reuse: started_at is the original registration time and lives on the
988
+ // stale on-disk entry, so a recycled pid (alive, file untouched) passes the
989
+ // check above. If the entry recorded the process start-time signature,
990
+ // confirm the live pid is still that same process; a recycled pid reads a
991
+ // different signature and is rejected (M3). Empty reading → indeterminate,
992
+ // leave it to downstream (the pane wake gate re-verifies before keystrokes).
993
+ if (fresh.proc_sig) {
994
+ const liveSig = processStartSig(e.server_pid);
995
+ if (liveSig && liveSig !== fresh.proc_sig)
996
+ return false;
997
+ }
998
+ return true;
976
999
  });
977
1000
  if (candidates.length === 0)
978
1001
  return { ok: false, error: "target-not-found" };
@@ -1010,7 +1033,7 @@ function resolveTarget(target, caller) {
1010
1033
  server.registerTool("send_message", {
1011
1034
  description: [
1012
1035
  "Fire-and-forget message to a peer in the same project root. Target: a tmux session name OR a client_session_id (UUID). Async via the peer's mailbox — delivered mid-turn (PreToolUse hook) or next-turn (read_my_messages); cross-project targets are rejected.",
1013
- "A plain message does NOT wake an idle peer. Pass wake:\"auto\" to nudge one via per-client send-keys, state-gated (skipped if the peer is mid-turn). EXCEPTION (wake-on-reply): when you set reply_to, this auto-wakes the requester by default so your answer doesn't strand them idle — pass wake:\"off\" to suppress. The reply-default wake is strictly gated: it fires only for a FRESHLY-IDLE requester (one whose Claude Code hooks maintain a fresh idle marker), with a per-target rate limit and a one-wake dedupe; env kill-switch OXTAIL_AUTOWAKE=off. A requester with no idle marker (Codex, or Claude without the hooks) returns skipped_no_fresh_idle and is NOT auto-woken — use explicit wake:\"auto\" for those. Response carries wake_status (\"fired\" | \"skipped_busy\" | \"skipped_debounced\" | \"skipped_no_fresh_idle\" | \"skipped_rate_limited\" | \"skipped_deduped\" | \"skipped_store_error\" | \"skipped_no_target\" | \"disabled\") and, on the reply path, wake_reason:\"reply_to_default\".",
1036
+ "A plain message does NOT wake an idle peer. Pass wake:\"auto\" to nudge one via per-client send-keys, state-gated (skipped if the peer is mid-turn). EXCEPTION (wake-on-reply): when you set reply_to, this auto-wakes the requester by default so your answer doesn't strand them idle — pass wake:\"off\" to suppress. The reply-default wake is strictly gated: it fires only for a FRESHLY-IDLE requester (one whose Claude Code hooks maintain a fresh idle marker), with a per-target rate limit and a one-wake dedupe; env kill-switch OXTAIL_AUTOWAKE=off. A requester with no idle marker (Codex, or Claude without the hooks) returns skipped_no_fresh_idle and is NOT auto-woken — use explicit wake:\"auto\" for those. Response carries wake_status (\"fired\" | \"skipped_busy\" | \"skipped_debounced\" | \"skipped_no_fresh_idle\" | \"skipped_rate_limited\" | \"skipped_deduped\" | \"skipped_store_error\" | \"skipped_no_target\" | \"disabled\") and, on the reply path, wake_reason:\"reply_to_default\" — or wake_reason:\"late_reply_to_pending\" when this reply answers an ask_peer that had timed out (durably pulls the requester back regardless of the fresh-idle window; \"late_reply_to_pending_suppressed\" if you passed wake:\"off\").",
1014
1037
  "Body is verbatim — wrap in <system-reminder>...</system-reminder> yourself if you want that framing. When replying to ask_peer, include reply_to: request_id from the inbound message. For a blocking send-and-wait, use ask_peer instead.",
1015
1038
  ].join(" "),
1016
1039
  inputSchema: {
@@ -1085,7 +1108,7 @@ server.registerTool("send_message", {
1085
1108
  server.registerTool("reply_to_message", {
1086
1109
  description: [
1087
1110
  "Reply to a specific inbound peer message by its message_id — the atomic, correlation-safe alternative to hand-wiring send_message's target + reply_to. The server looks the message up in this session's durable received-ledger, so you pass only the message_id the PreToolUse hook or read_my_messages already showed you; it derives the reply target (the original sender), carries reply_to=request_id when the inbound was an ask_peer (keeping the exchange correlated), and sets source_message_id for provenance. Replying to a plain send_message works too — it just omits reply_to. Ownership is structural: you can only reply to a message delivered to you.",
1088
- "Delivery + wake match send_message exactly, including the wake-on-reply default: when the inbound carried a request_id and you leave wake unset, a freshly-idle requester is auto-woken; pass wake:\"auto\" to nudge any idle peer, or wake:\"off\" to suppress. Fail-closed: an unknown or aged-out message_id returns error message-not-found instead of guessing a target.",
1111
+ "Delivery + wake match send_message exactly, including the wake-on-reply default: when the inbound carried a request_id and you leave wake unset, a freshly-idle requester is auto-woken; pass wake:\"auto\" to nudge any idle peer, or wake:\"off\" to suppress. If the inbound ask_peer had since timed out, this reply durably pulls the requester back (wake_reason late_reply_to_pending) regardless of the fresh-idle window. Fail-closed: an unknown or aged-out message_id returns error message-not-found instead of guessing a target.",
1089
1112
  ].join(" "),
1090
1113
  inputSchema: {
1091
1114
  message_id: z
@@ -1261,15 +1284,18 @@ server.registerTool("read_my_messages", {
1261
1284
  // elapses. Reply-to-capable peers must reply with reply_to=request_id; legacy
1262
1285
  // peers fall back to the original from_session_id-only matching.
1263
1286
  //
1264
- // User-tunable override via OXTAIL_ASK_PEER_TIMEOUT_MS; defaults to 45000ms
1265
- // (conservative under typical MCP-client tool-call abort windows). Set to a
1266
- // lower value if your client aborts before our timeout fires.
1287
+ // User-tunable override via OXTAIL_ASK_PEER_TIMEOUT_MS; defaults to 60000ms.
1288
+ // 60s covers a slower multi-tool-call peer reply (a Codex peer composing
1289
+ // set_my_state + reply_to_message + a report was observed at ~46s and falsely
1290
+ // timed out under the old 45s default) while staying under both known callers'
1291
+ // tool-call abort windows: Claude Code is clean to ~60s, Codex aborts ~120s.
1292
+ // Set to a lower value if your client aborts before our timeout fires.
1267
1293
  const ASK_PEER_TIMEOUT_MS = (() => {
1268
1294
  const env = process.env.OXTAIL_ASK_PEER_TIMEOUT_MS;
1269
1295
  if (!env)
1270
- return 45_000;
1296
+ return 60_000;
1271
1297
  const n = Number(env);
1272
- return Number.isFinite(n) && n > 0 ? n : 45_000;
1298
+ return Number.isFinite(n) && n > 0 ? n : 60_000;
1273
1299
  })();
1274
1300
  const ASK_PEER_GRACE_MS = 500;
1275
1301
  const ASK_PEER_POLL_MS = 200;
@@ -1470,6 +1496,14 @@ async function wakePeer(peer) {
1470
1496
  // No session-name fallback: a self-written tmux_session could target another
1471
1497
  // session, and the verified pane already handles pane-id churn. Pass null.
1472
1498
  const ok = await askPeerWakeImpl(verifiedPane, null, fire);
1499
+ if (!ok && sid) {
1500
+ // The fire failed (e.g. the pane vanished between verification and the
1501
+ // send-keys), so no keystroke landed. Clear the debounce stamp set pre-fire
1502
+ // above — otherwise a genuine retry within WAKE_DEBOUNCE_MS is suppressed as
1503
+ // "debounced" even though the peer was never actually woken (M1). The
1504
+ // pre-stamp only needs to survive a SUCCESSFUL fire's async paste gap.
1505
+ wakeDebounce.delete(sid);
1506
+ }
1473
1507
  return ok ? "fired" : "skipped_no_target";
1474
1508
  }
1475
1509
  // --- send_message wake:auto gating -------------------------------------------
@@ -1480,7 +1514,19 @@ async function wakePeer(peer) {
1480
1514
  // Keyed by session_id (the agent identity), NOT server_pid: a dual-scope agent
1481
1515
  // has several MCP children sharing one session_id, and the hooks/sender must
1482
1516
  // agree on the key (see AGENTS.md). Must match the sanitization in the hooks.
1483
- const ACTIVITY_BUSY_TTL_MS = 10 * 60 * 1000;
1517
+ // How long a "busy" marker is trusted before a peer treats the turn as stale and
1518
+ // wakes anyway. The PreToolUse hook now re-stamps "busy" on every tool call, so
1519
+ // a long ACTIVE turn stays fresh; this TTL only governs a turn that stops making
1520
+ // tool calls (one giant single tool call, or a crash without a clean Stop) — the
1521
+ // latter is exactly the stale-busy→wake recovery we want. Configurable for
1522
+ // deployments with very long single-tool-call turns.
1523
+ const ACTIVITY_BUSY_TTL_MS = (() => {
1524
+ const env = process.env.OXTAIL_ACTIVITY_BUSY_TTL_MS;
1525
+ if (!env)
1526
+ return 10 * 60 * 1000;
1527
+ const n = Number(env);
1528
+ return Number.isFinite(n) && n > 0 ? n : 10 * 60 * 1000;
1529
+ })();
1484
1530
  function activitySessionKey(sessionId) {
1485
1531
  return sessionId.replace(/[^A-Za-z0-9_-]/g, "_");
1486
1532
  }
@@ -1553,11 +1599,64 @@ async function autoWakeOnReply(peer, replyTo) {
1553
1599
  trace("autowake_reply_fire", { target_session_id: sid });
1554
1600
  return wakePeer(peer);
1555
1601
  }
1556
- // Resolve the wake for a send_message. The strict reply-default path engages
1557
- // only for a reply with wake UNSET; an explicit wake:"auto" always means the
1558
- // lenient wakeForSend path (even for a reply — the Codex/hookless escape hatch),
1559
- // and wake:"off" means no wake. Returns the status + reason to surface.
1602
+ // Stamp the autowake dedupe record for (sessionId, replyTo) when the durable
1603
+ // pending-ask path fires, so a re-delivered / duplicate copy of the SAME reply
1604
+ // can't separately strict-wake the requester via the fresh-idle reply-default
1605
+ // (the in-memory wakePeer debounce is per-process and not reply_to-keyed, so it
1606
+ // doesn't cover a restart or a >1s gap). Best-effort; we're stamping, not gating.
1607
+ //
1608
+ // Like the existing reply-default path (decideReplyAutoWake → claimWake), this is
1609
+ // stamped on the wake ATTEMPT — before wakeForSend's keystroke outcome is known —
1610
+ // and claimWake also stamps the per-target RATE record. Intentional and
1611
+ // consistent with that path: one wake pulls the requester in to drain its whole
1612
+ // mailbox, so a second reply within the rate window doesn't need its own wake.
1613
+ // (It is NOT stamped on the wake:"off" / kill-switch-disabled paths, where no
1614
+ // wake is intended — see resolveSendWake.)
1615
+ function stampReplyWakeDedupe(sessionId, replyTo) {
1616
+ if (!sessionId)
1617
+ return;
1618
+ try {
1619
+ claimWake(defaultAutowakeDir(), sessionId, replyTo, Date.now());
1620
+ }
1621
+ catch {
1622
+ // best effort — a failure only means a duplicate could still strict-wake,
1623
+ // which is harmless (debounced, and the requester drains an empty mailbox).
1624
+ }
1625
+ }
1626
+ // Resolve the wake for a send_message / reply_to_message. Order matters:
1627
+ // 1. DURABLE pending-ask: if this reply satisfies an ask_peer that timed out
1628
+ // and recorded a pending obligation, consume it (regardless of wake mode —
1629
+ // a late reply satisfies the obligation even under wake:"off", and leaving
1630
+ // the record would let a later duplicate wake and violate the explicit off)
1631
+ // and fire the LENIENT wakeForSend so even a long-idle / markerless-Codex
1632
+ // requester is pulled back. The automatic (wake unset) variant honors the
1633
+ // OXTAIL_AUTOWAKE kill-switch; an explicit wake:"auto" intentionally does
1634
+ // not (it's the caller's explicit ask, matching existing semantics).
1635
+ // 2. STRICT reply-default: a reply with wake UNSET and no pending record →
1636
+ // fresh-idle-only auto-wake (autowake.ts), wake_reason "reply_to_default".
1637
+ // 3. Explicit wake:"auto" → lenient wakeForSend. wake:"off" → no wake.
1560
1638
  async function resolveSendWake(peer, wake, replyTo) {
1639
+ if (replyTo) {
1640
+ const sid = peer.client.session_id ?? "";
1641
+ if (consumePendingAsk(defaultPendingAskDir(), sid, replyTo, Date.now())) {
1642
+ // wake:"off" and the kill-switch path do NOT wake — so they must NOT stamp
1643
+ // the wake-dedupe: stamping there would later suppress the strict wake for a
1644
+ // genuine, distinct second reply to the same request_id (no wake happened,
1645
+ // so there is nothing to dedupe against). Only stamp on the path that fires.
1646
+ if (wake === "off") {
1647
+ trace("late_reply_pending_suppressed", { target_session_id: sid });
1648
+ return { wake_reason: "late_reply_to_pending_suppressed" };
1649
+ }
1650
+ if (wake === undefined && autowakeKillSwitchOff()) {
1651
+ return { wake_status: "disabled", wake_reason: "late_reply_to_pending" };
1652
+ }
1653
+ // About to actually wake → stamp so a re-delivered copy of THIS reply can't
1654
+ // strict-wake again via the fresh-idle fallback.
1655
+ stampReplyWakeDedupe(peer.client.session_id, replyTo);
1656
+ trace("late_reply_pending_wake", { target_session_id: sid });
1657
+ return { wake_status: await wakeForSend(peer), wake_reason: "late_reply_to_pending" };
1658
+ }
1659
+ }
1561
1660
  if (replyAutoWakeTriggered(wake, replyTo)) {
1562
1661
  return { wake_status: await autoWakeOnReply(peer, replyTo), wake_reason: "reply_to_default" };
1563
1662
  }
@@ -1571,24 +1670,38 @@ async function resolveSendWake(peer, wake, replyTo) {
1571
1670
  // mailbox lock when there's a probable hit. The lock is held only inside
1572
1671
  // drainMatchingSession (sub-10ms) — never across the poll interval, so the
1573
1672
  // PreToolUse hook on subsequent caller tool calls is never starved.
1574
- async function askPeerPoll(my_pid, from_session_id, request_id, require_reply_to, deadlineMs, signal) {
1575
- let lastMtime = -1;
1576
- const path = mailbox.mailboxFilePath(my_pid);
1673
+ // The requester's mailbox pid union: own pid first (fast-path locality), then
1674
+ // any sibling/previous MCP child sharing the session_id. Recomputed at the final
1675
+ // drain so a sibling that appeared DURING the wait is still covered.
1676
+ function requesterPids(ownPid, sessionId) {
1677
+ return sessionId
1678
+ ? [ownPid, ...sessionPidsForId(sessionId).filter((p) => p !== ownPid)]
1679
+ : [ownPid];
1680
+ }
1681
+ async function askPeerPoll(pids, from_session_id, request_id, require_reply_to, deadlineMs, signal) {
1682
+ // Watch the mtime of EVERY sibling pid's mailbox (a dual-scope requester's
1683
+ // reply may land in a pid other than the one blocked here), draining only when
1684
+ // a file that exists has changed — so the lock is acquired on a probable hit,
1685
+ // never every tick. Mirrors the single-pid optimization, widened to the union.
1686
+ const lastMtimes = new Map();
1577
1687
  while (Date.now() < deadlineMs) {
1578
1688
  if (signal.aborted)
1579
1689
  throw new Error("aborted");
1580
- let stat = null;
1581
- try {
1582
- stat = statSync(path);
1583
- }
1584
- catch {
1585
- // ENOENT: mailbox file not created yet; treat as no change
1690
+ let changed = false;
1691
+ for (const pid of pids) {
1692
+ let m = -1;
1693
+ try {
1694
+ m = statSync(mailbox.mailboxFilePath(pid)).mtimeMs;
1695
+ }
1696
+ catch {
1697
+ // ENOENT: mailbox file not created yet
1698
+ }
1699
+ if (m !== -1 && lastMtimes.get(pid) !== m)
1700
+ changed = true;
1701
+ lastMtimes.set(pid, m);
1586
1702
  }
1587
- if (stat && stat.mtimeMs !== lastMtime) {
1588
- lastMtime = stat.mtimeMs;
1589
- const reply = require_reply_to
1590
- ? mailbox.drainMatchingReply(my_pid, from_session_id, request_id)
1591
- : mailbox.drainMatchingSession(my_pid, from_session_id);
1703
+ if (changed) {
1704
+ const reply = drainAskPeerReply(pids, from_session_id, request_id, require_reply_to);
1592
1705
  if (reply)
1593
1706
  return reply;
1594
1707
  }
@@ -1599,15 +1712,18 @@ async function askPeerPoll(my_pid, from_session_id, request_id, require_reply_to
1599
1712
  }
1600
1713
  return null;
1601
1714
  }
1602
- function drainAskPeerReply(my_pid, from_session_id, request_id, require_reply_to) {
1715
+ function drainAskPeerReply(pids, from_session_id, request_id, require_reply_to) {
1716
+ // Correlated peers: union-drain by reply_to across the requester's siblings.
1717
+ // Legacy/uncorrelated peers: keep the best-effort own-pid session match (no
1718
+ // request_id to correlate the union safely).
1603
1719
  return require_reply_to
1604
- ? mailbox.drainMatchingReply(my_pid, from_session_id, request_id)
1605
- : mailbox.drainMatchingSession(my_pid, from_session_id);
1720
+ ? mailbox.drainMatchingReplyMany(pids, from_session_id, request_id)
1721
+ : mailbox.drainMatchingSession(pids[0], from_session_id);
1606
1722
  }
1607
1723
  server.registerTool("ask_peer", {
1608
1724
  description: [
1609
1725
  "Delegate-and-wait: enqueue a message to a peer in the same project root, wake them, and block until they reply (via send_message) or the timeout elapses. Use this for back-and-forth; use send_message for fire-and-forget.",
1610
- "Wakes the peer via per-client tmux send-keys (Codex gets a paste-burst-aware gap, Claude Code doesn't), then polls for a reply. For reply_to-capable peers, only from_session_id + reply_to == request_id satisfies the wait; legacy peers fall back to best-effort from_session_id matching and the response reports correlation:\"uncorrelated\". Response carries wake_status: \"fired\" | \"skipped_busy\" | \"skipped_no_target\" | \"disabled\" (skipped_unsupported is reserved). A peer that is mid-turn is NOT keystroke-woken (skipped_busy) — its hook/poll delivers the enqueued message and we still poll for the reply. Returns reply: null, timed_out: true on timeout (default 45000ms, override per call with timeout_ms, or set OXTAIL_ASK_PEER_TIMEOUT_MS at startup). timeout_ms is clamped to a safe ceiling (default 100000ms, env OXTAIL_ASK_PEER_MAX_TIMEOUT_MS) so the wait can't outlast the client's tool-call abort window — exceeding it makes the client hard-fail the call instead of returning graceful timed_out; the response reports timeout_clamped_from_ms when clamped. Late replies still arrive via read_my_messages / the hook.",
1726
+ "Wakes the peer via per-client tmux send-keys (Codex gets a paste-burst-aware gap, Claude Code doesn't), then polls for a reply. For reply_to-capable peers, only from_session_id + reply_to == request_id satisfies the wait; legacy peers fall back to best-effort from_session_id matching and the response reports correlation:\"uncorrelated\". Response carries wake_status: \"fired\" | \"skipped_busy\" | \"skipped_no_target\" | \"disabled\" (skipped_unsupported is reserved). A peer that is mid-turn is NOT keystroke-woken (skipped_busy) — its hook/poll delivers the enqueued message and we still poll for the reply. Returns reply: null, timed_out: true on timeout (default 60000ms, override per call with timeout_ms, or set OXTAIL_ASK_PEER_TIMEOUT_MS at startup). timeout_ms is clamped to a safe ceiling (default 100000ms, env OXTAIL_ASK_PEER_MAX_TIMEOUT_MS) so the wait can't outlast the client's tool-call abort window — exceeding it makes the client hard-fail the call instead of returning graceful timed_out; the response reports timeout_clamped_from_ms when clamped. DURABLE DELEGATION: on timeout (correlated peers, claimed requester), the request is recorded as a pending obligation, so when the peer's reply finally arrives — minutes or hours later — it WAKES you back (wake_reason late_reply_to_pending), not just landing silently in read_my_messages. So ask_peer is safe for long tasks: let it time out, end your turn, get pulled back when the work is done.",
1611
1727
  "Target must have a registered client.session_id (Codex peers call claim_session first). Body is verbatim — frame it as an assignment (objective + requested action) so it reads as delegation, not chat. Wake overridable via OXTAIL_ASK_PEER_WAKE_STRATEGY=auto|legacy|off.",
1612
1728
  ].join(" "),
1613
1729
  inputSchema: {
@@ -1656,6 +1772,10 @@ server.registerTool("ask_peer", {
1656
1772
  const requestId = randomBytes(8).toString("hex");
1657
1773
  const requireReplyTo = peerSupportsReplyTo(peer);
1658
1774
  const fromSessionId = entry.client.session_id ?? undefined;
1775
+ // The reply is addressed to OUR session_id; resolveTarget enqueues it to the
1776
+ // session's freshest sibling, which may not be entry.server_pid. Drain the
1777
+ // union (own pid first for fast-path locality), mirroring read_my_messages.
1778
+ const myPids = requesterPids(entry.server_pid, fromSessionId);
1659
1779
  // Record-before-append (mirrors send_message): lets the peer answer with
1660
1780
  // reply_to_message(message_id) instead of hand-wiring target + reply_to.
1661
1781
  const msg = deliverToPeer(expectedSessionId, peer.server_pid, body, fromSessionId, {
@@ -1683,7 +1803,7 @@ server.registerTool("ask_peer", {
1683
1803
  // our outbound arrived, their hook delivered it as additionalContext and
1684
1804
  // their response may already be in our mailbox.
1685
1805
  await askPeerDelay(ASK_PEER_GRACE_MS, extra.signal);
1686
- reply = drainAskPeerReply(entry.server_pid, expectedSessionId, requestId, requireReplyTo);
1806
+ reply = drainAskPeerReply(myPids, expectedSessionId, requestId, requireReplyTo);
1687
1807
  if (!reply) {
1688
1808
  // Common path: peer was idle. Route the wake per client_type, but skip
1689
1809
  // the keystroke if the peer is FRESHLY busy (mid-turn): typing into a
@@ -1706,7 +1826,7 @@ server.registerTool("ask_peer", {
1706
1826
  // return this and the caller fail-fasts instead of polling.
1707
1827
  }
1708
1828
  else {
1709
- reply = await askPeerPoll(entry.server_pid, expectedSessionId, requestId, requireReplyTo, deadlineMs, extra.signal);
1829
+ reply = await askPeerPoll(myPids, expectedSessionId, requestId, requireReplyTo, deadlineMs, extra.signal);
1710
1830
  }
1711
1831
  }
1712
1832
  else {
@@ -1749,6 +1869,77 @@ server.registerTool("ask_peer", {
1749
1869
  // attempted) is NOT a timeout; the message has been enqueued and will be
1750
1870
  // delivered when the peer next enters a turn.
1751
1871
  const polled = wakeStatus !== "skipped_unsupported";
1872
+ // Durable delegation: we polled to the deadline with no reply. Record a
1873
+ // pending obligation FIRST, then do one final authoritative UNION drain —
1874
+ // write-before-final-drain closes the poll-vs-deadline TOCTOU. A reply that
1875
+ // landed in the gap is caught here and returned now; a reply that arrives
1876
+ // AFTER finds the persisted record and pulls us back via resolveSendWake's
1877
+ // late_reply_to_pending path — even minutes/hours later, and even for a
1878
+ // markerless idle Codex requester. Correlated peers + claimed requester only.
1879
+ if (polled && reply === null && !aborted && requireReplyTo) {
1880
+ if (fromSessionId) {
1881
+ const dir = defaultPendingAskDir();
1882
+ // Opportunistic sweep so abandoned records (a reply that never came)
1883
+ // can't accumulate — mirrors gcAutowake inside decideReplyAutoWake.
1884
+ gcPendingAsk(dir, Date.now());
1885
+ // Write the pending obligation BEFORE the final drain (write-before-
1886
+ // final-drain): a reply that lands after the drain finds this record and
1887
+ // wakes us via resolveSendWake; one that landed before is caught below.
1888
+ if (!recordPendingAsk(dir, fromSessionId, requestId, Date.now())) {
1889
+ // Store unwritable → silently degrades to the read_my_messages path
1890
+ // (no durable pull-back). Surface it so the degradation is observable.
1891
+ trace("ask_peer_pending_record_failed", { request_id: requestId });
1892
+ }
1893
+ // Authoritative final drain. Recompute the pid union NOW — a sibling MCP
1894
+ // child may have appeared during the wait. Use the CHECKED variant and
1895
+ // retry any pid we couldn't inspect (transient lock): silently treating
1896
+ // "couldn't read" as "no reply" would leave the record with no later
1897
+ // event to consume it → a stranded pull-back.
1898
+ const finalPids = requesterPids(entry.server_pid, fromSessionId);
1899
+ let drained = mailbox.drainMatchingReplyManyChecked(finalPids, expectedSessionId, requestId);
1900
+ if (drained.skipped.length > 0) {
1901
+ // A pid we couldn't inspect might hold either the already-landed reply
1902
+ // (if we have none yet) OR a migrate-crash duplicate of the reply we DID
1903
+ // pull (which a later read_my_messages would re-deliver). Retry once
1904
+ // after a brief delay for the lock to clear.
1905
+ try {
1906
+ await askPeerDelay(ASK_PEER_POLL_MS, extra.signal);
1907
+ if (!drained.reply) {
1908
+ drained = mailbox.drainMatchingReplyManyChecked(drained.skipped, expectedSessionId, requestId);
1909
+ if (!drained.reply && drained.skipped.length > 0) {
1910
+ // Still un-inspectable after the retry: a lock held past the
1911
+ // acquire budget + retry (SIGSTOP-class / long holder). diagnose
1912
+ // can use this to tell "no reply" from "a reply may sit behind a
1913
+ // locked pid" — the record persists, so a later send still wakes.
1914
+ trace("ask_peer_skipped_after_final_retry", {
1915
+ request_id: requestId,
1916
+ skipped: drained.skipped,
1917
+ });
1918
+ }
1919
+ }
1920
+ else {
1921
+ // We have the reply — sweep only its exact id from the skipped pids
1922
+ // (a distinct second reply, different id, is left for read_my_messages).
1923
+ mailbox.sweepMessageId(drained.skipped, drained.reply.id);
1924
+ }
1925
+ }
1926
+ catch {
1927
+ // aborted during the brief retry delay — leave the record; we return
1928
+ // timed_out and the reply still delivers via read_my_messages.
1929
+ }
1930
+ }
1931
+ if (drained.reply) {
1932
+ consumePendingAsk(dir, fromSessionId, requestId);
1933
+ reply = drained.reply;
1934
+ trace("ask_peer_late_catch", { request_id: requestId, message_id: drained.reply.id });
1935
+ }
1936
+ }
1937
+ else {
1938
+ // Unclaimed requester: a peer can't correlate/reply_to_message back to
1939
+ // us, so there's nothing to durably wake — surface it rather than guess.
1940
+ trace("ask_peer_pending_skipped_unclaimed", { request_id: requestId });
1941
+ }
1942
+ }
1752
1943
  const timedOut = polled && reply === null;
1753
1944
  trace("ask_peer_end", {
1754
1945
  target_session_id: expectedSessionId,
@@ -30,7 +30,14 @@ export function newWakeDebounceStore() {
30
30
  // True if a wake fired for this key within the window — i.e. skip this one.
31
31
  export function recentlyWoke(store, key, nowMs, windowMs = WAKE_DEBOUNCE_MS) {
32
32
  const last = store.get(key);
33
- return last !== undefined && nowMs - last < windowMs;
33
+ if (last === undefined)
34
+ return false;
35
+ const delta = nowMs - last;
36
+ // A backwards clock step (NTP correction, laptop resume) makes delta negative
37
+ // and < windowMs, which would wrongly suppress every wake to this peer until
38
+ // the clock catches back up. Treat a negative delta as "not recent" (mirrors
39
+ // the ageMs >= 0 guard in isFreshIdle).
40
+ return delta >= 0 && delta < windowMs;
34
41
  }
35
42
  // Record that a wake fired for this key. Opportunistically evicts stale entries
36
43
  // so the map can't grow unbounded across many short-lived peers.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "oxtail",
3
- "version": "0.14.1",
3
+ "version": "0.16.0",
4
4
  "private": false,
5
5
  "type": "module",
6
6
  "description": "Coordination layer for parallel AI coding agent sessions, exposed over MCP.",
@@ -33,10 +33,14 @@ export const HOOK_MARKER_KEY = "_oxtailHook";
33
33
  // with no owner check, so during an upgrade window (before re-install) the
34
34
  // old hook can still lose the stall-resume / double-clear races against a
35
35
  // v6 peer. The version bump forces re-install to close that window.
36
+ // v7: pretooluse re-stamps the "busy" activity marker on every tool call, so a
37
+ // long ACTIVE turn stays fresh and doesn't invite a spurious wake:auto once
38
+ // it outruns ACTIVITY_BUSY_TTL_MS. A stale pre-v7 hook just doesn't refresh
39
+ // (the prior behavior) — never wrong, only less fresh on long turns.
36
40
  // INVARIANT: any change to an assets/*.sh script MUST bump this version, so
37
41
  // existing installs are forced to re-install. scripts/check-hook-version.mjs
38
42
  // enforces this in CI.
39
- export const HOOK_MARKER_VERSION = 6;
43
+ export const HOOK_MARKER_VERSION = 7;
40
44
 
41
45
  const HOOKS_DIR = path.join(os.homedir(), ".oxtail", "hooks");
42
46