oxtail 0.14.1 → 0.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +26 -6
- package/assets/pretooluse.sh +12 -0
- package/dist/claims.js +27 -7
- package/dist/detect/birthTimeMatchStrategy.js +8 -2
- package/dist/locks.js +26 -7
- package/dist/mailbox.js +117 -3
- package/dist/pending-ask.js +167 -0
- package/dist/received.js +26 -5
- package/dist/registry.js +38 -1
- package/dist/server.js +227 -36
- package/dist/wake-debounce.js +8 -1
- package/package.json +1 -1
- package/scripts/hook-constants.mjs +5 -1
package/README.md
CHANGED
|
@@ -67,7 +67,7 @@ Contributing? `git clone https://github.com/d4j3y2k/oxtail && cd oxtail && npm i
|
|
|
67
67
|
- `set_my_state` — write a small "state card" onto this session's registry entry so peers can see what we're doing without reading our transcript. v1 surfaces a single field, `purpose` (≤200 chars).
|
|
68
68
|
- `send_message` — **fire-and-forget** message to a peer. Target is a tmux session name or a raw `client_session_id` UUID. Body ≤ 8KB. Delivery is async via the peer's mailbox file. A plain message does **not** wake an idle peer; pass `wake: "auto"` to nudge one (state-gated — see [Waking an idle peer](#waking-an-idle-peer)). Replies to `ask_peer` should pass `reply_to: "<request_id>"` when the inbound message carries a `request_id` — and a reply **auto-wakes the requester by default** (strictly gated; `wake: "off"` opts out). (v0.5+)
|
|
69
69
|
- `read_my_messages` — drain this session's mailbox and return any queued messages. Messages include `from_session_id`, server-stamped `origin: "peer"`, and optional `request_id` / `reply_to`. Codex peers (and unhooked Claude Code) poll this; Claude Code peers with the hooks installed see messages mid-turn (PreToolUse) or at turn end (Stop) instead. (v0.5+)
|
|
70
|
-
- `ask_peer` — **delegate-and-wait**. Enqueues a message with a `request_id` and blocks server-side until the peer replies with `send_message({ reply_to: request_id })` or the timeout elapses. Default timeout is
|
|
70
|
+
- `ask_peer` — **delegate-and-wait**. Enqueues a message with a `request_id` and blocks server-side until the peer replies with `send_message({ reply_to: request_id })` or the timeout elapses. Default timeout is 60s (`OXTAIL_ASK_PEER_TIMEOUT_MS`), and each call may pass `timeout_ms`. New peers use strict `reply_to` correlation; legacy/no-capability peers fall back to best-effort first-message matching and the response reports `correlation: "uncorrelated"`. That legacy path may stale-match old same-peer chatter, so callers should treat `uncorrelated` as compatibility-only. **Durable on timeout (v0.15+):** if the wait elapses, the request is recorded as a pending obligation, so when the peer's reply finally arrives — minutes or hours later — it *wakes the requester back* (`wake_reason: "late_reply_to_pending"`) instead of landing silently. That makes `ask_peer` safe for long-running delegations: let it time out, end the turn, get pulled back when the work is done. Use `send_message` for fire-and-forget. (v0.7+)
|
|
71
71
|
- `reply_to_message` — **reply by `message_id`**. The atomic, correlation-safe alternative to hand-wiring `send_message`'s `target` + `reply_to`: pass the `message_id` the hook or `read_my_messages` showed you and the server looks the inbound envelope up in this session's durable **received-ledger**, derives the reply target (the original sender), carries `reply_to: request_id` when the inbound was an `ask_peer` (keeping the exchange correlated), and stamps `source_message_id`. Replying to a plain `send_message` works too — it just omits `reply_to`. Ownership is structural (you can only reply to a message delivered to *you*); fail-closed on an unknown/aged-out id. Same wake semantics as `send_message`, including the wake-on-reply default. (v0.13+)
|
|
72
72
|
- `register_my_session` — pin this MCP server's `session_id` directly. Kept for debugging; prefer `claim_session`.
|
|
73
73
|
- `get_my_session` — return this MCP server's own registry entry plus a per-strategy detection diagnosis. Useful for debugging.
|
|
@@ -163,7 +163,9 @@ send_message({ target: "<requester>", body: "...", reply_to: "<request_id>" })
|
|
|
163
163
|
|
|
164
164
|
The reply path is deliberately **stricter** than explicit `wake: "auto"`. It fires only when the target is **freshly idle** — an `idle` activity marker newer than `OXTAIL_AUTOWAKE_FRESH_IDLE_MS` (default 5 min). Stale, unknown, missing, or busy state yields `skipped_no_fresh_idle` (no best-effort wake — typing unprompted into a terminal that may be unattended is the risk we refuse to take). Two more guards bound it: a **per-target rate limit** (`OXTAIL_AUTOWAKE_MIN_INTERVAL_MS`, default 4s → `skipped_rate_limited`) since one wake already drains the whole mailbox, and a **one-wake dedupe** keyed on `(session_id, reply_to)` (`skipped_deduped`) so a duplicate or late hook drain of the same reply can't re-fire. If the dedupe/rate store is somehow unwritable the wake degrades to `skipped_store_error` rather than failing the (already-delivered) message. The env kill-switch `OXTAIL_AUTOWAKE=off` disables reply auto-wake entirely (`wake_status: "disabled"`). Every outcome that reaches the gate surfaces a `wake_status`; the reply path also stamps `wake_reason: "reply_to_default"` (present even on a resolve error like `ambiguous-target`, where there's no single target to wake).
|
|
165
165
|
|
|
166
|
-
**Coverage (which requesters this reaches).** The fresh-idle gate keys on the requester's busy/idle activity marker, which only the Claude Code hooks maintain. So wake-on-reply currently closes the stranding for a **hooked Claude Code requester** (the originally-observed case: a peer's async reply to an idle Claude session). A **Codex** requester — or a Claude requester without the hooks installed — has no idle marker, so a reply with `wake` unset returns `skipped_no_fresh_idle` and is **not** auto-woken; reach it with an explicit `wake: "auto"`, which always takes the lenient wake path (idle/unknown/stale all wake; only a fresh-`busy` peer is skipped) and bypasses the strict fresh-idle gate even for a reply.
|
|
166
|
+
**Coverage (which requesters this reaches).** The fresh-idle gate keys on the requester's busy/idle activity marker, which only the Claude Code hooks maintain. So wake-on-reply currently closes the stranding for a **hooked Claude Code requester** (the originally-observed case: a peer's async reply to an idle Claude session). A **Codex** requester — or a Claude requester without the hooks installed — has no idle marker, so a reply with `wake` unset returns `skipped_no_fresh_idle` and is **not** auto-woken; reach it with an explicit `wake: "auto"`, which always takes the lenient wake path (idle/unknown/stale all wake; only a fresh-`busy` peer is skipped) and bypasses the strict fresh-idle gate even for a reply.
|
|
167
|
+
|
|
168
|
+
For the **`ask_peer` case specifically**, the Codex/unhooked-requester direction is now closed *by default* (v0.15+, see [Durable `ask_peer`](#durable-ask_peer-long-efforts) below): a timed-out `ask_peer` records a durable **pending-ask** keyed on the requester's `session_id` + `request_id`, and the matching late reply takes the lenient wake path regardless of any idle marker — so even a markerless idle Codex requester is pulled back. This is exactly the requester-side waiter signal the blind `unknown ⇒ wake` default was avoided for: it's evidence the requester *explicitly asked and is waiting*, so it can't double-wake an unrelated active turn.
|
|
167
169
|
|
|
168
170
|
**Codex and the wake matrix.** The send-keys wake needs a tmux pane. A Codex peer running **outside tmux** has none, so it returns `wake_status: "skipped_no_target"` — its idle delivery stays poll-based (`read_my_messages`). Run Codex **inside a tmux pane** to get symmetric idle-wake; the routing already handles the Codex paste-burst case.
|
|
169
171
|
|
|
@@ -238,17 +240,34 @@ All wake paths funnel through one place, which **coalesces** rapid repeat wakes
|
|
|
238
240
|
### Constraints
|
|
239
241
|
|
|
240
242
|
- The target peer must have a registered `client.session_id`. Codex peers must call `claim_session` / `register_my_session` first; without that, `ask_peer` returns `error: "peer-has-no-session-id"` rather than guessing.
|
|
241
|
-
- Timeout defaults to
|
|
243
|
+
- Timeout defaults to 60000ms — enough headroom for a slower multi-tool-call peer reply (e.g. a Codex peer running `set_my_state` + `reply_to_message` + composing a report, observed ~46s) while staying under both known callers' tool-call abort windows (Claude Code is clean to ~60s; Codex aborts ~120s). Pass `timeout_ms` on a call when a specific delegation needs a different bound; max 300000ms.
|
|
242
244
|
|
|
243
245
|
### Tuning the timeout
|
|
244
246
|
|
|
245
|
-
If `ask_peer` returns an abort error before its built-in
|
|
247
|
+
If `ask_peer` returns an abort error before its built-in 60s timeout fires, your MCP client's tool-call ceiling is lower than 60s. Override the bound at server startup:
|
|
246
248
|
|
|
247
249
|
```sh
|
|
248
250
|
OXTAIL_ASK_PEER_TIMEOUT_MS=30000 npx -y oxtail@0.10.1
|
|
249
251
|
```
|
|
250
252
|
|
|
251
|
-
The server reads the env var once at boot and uses it as the fixed timeout for all `ask_peer` calls in that session. Values must be positive numbers; anything else falls back to the
|
|
253
|
+
The server reads the env var once at boot and uses it as the fixed timeout for all `ask_peer` calls in that session. Values must be positive numbers; anything else falls back to the 60000ms default.
|
|
254
|
+
|
|
255
|
+
### Durable `ask_peer` (long efforts)
|
|
256
|
+
|
|
257
|
+
The blocking wait is a *short* primitive (bounded by the client's tool-call abort window, ~60s). A real task can take minutes or hours — far longer than any wait can block. So `ask_peer` decouples the **wait** from the **delivery of the answer**:
|
|
258
|
+
|
|
259
|
+
- On timeout (for a correlated peer + a claimed requester), the request is recorded as a durable **pending-ask** at `~/.oxtail/pending-ask/p-<hash(session_id, request_id)>`, keyed on the *requester's* `session_id` + `request_id`. A `recordPendingAsk` runs **before** one final authoritative union-drain of the requester's mailbox (write-before-final-drain), so a reply that lands in the poll-vs-deadline gap is returned immediately, and a reply that arrives later finds the persisted record.
|
|
260
|
+
- When that reply eventually arrives, `resolveSendWake` finds the matching pending-ask, **consumes it** (atomic `unlink`, single-winner — a duplicate/re-delivered reply can't re-fire), and takes the **lenient** wake path (`wake_reason: "late_reply_to_pending"`). Because the record is *proof the requester explicitly asked and is waiting*, the wake fires regardless of the 5-min fresh-idle window — and reaches a **markerless idle Codex** requester that the strict reply-default would skip. It also stamps the autowake dedupe for `(session_id, request_id)` so a later duplicate can't strict-wake via the fresh-idle fallback.
|
|
261
|
+
- `wake: "off"` still **consumes** the record (the obligation is satisfied — leaving it would let a later duplicate wake and violate the explicit off) but suppresses the wake (`wake_reason: "late_reply_to_pending_suppressed"`). The automatic (wake-unset) path honors `OXTAIL_AUTOWAKE=off` (`wake_status: "disabled"`); an explicit `wake: "auto"` intentionally does not.
|
|
262
|
+
- The reply drain is a **union across the requester's sibling MCP-child pids** (`drainMatchingReplyMany`), mirroring `read_my_messages` — a dual-scope requester's reply may land in a sibling pid, not the one that blocked in `ask_peer`.
|
|
263
|
+
|
|
264
|
+
Records are honored for `OXTAIL_PENDING_ASK_TTL_MS` (default 1h, sized for long efforts): a reply after that still delivers durably via `read_my_messages` but won't fire the pull-back wake (`consumePendingAsk` is TTL-aware — it removes an over-TTL record without waking). GC is **opportunistic** — abandoned records (a reply that never came) are swept when a later `ask_peer` times out, not on a wall-clock timer; the files are tiny, and a reply always cleans up its own record on arrival.
|
|
265
|
+
|
|
266
|
+
**The pattern:** `ask_peer` a long task → let it return `timed_out: true` → end your turn → get woken when the answer lands. Pair with a generous `OXTAIL_ACTIVITY_BUSY_TTL_MS` if your turns run long (see below).
|
|
267
|
+
|
|
268
|
+
### Keeping a long turn marked busy
|
|
269
|
+
|
|
270
|
+
`wake: "auto"` skips a peer that is **freshly `busy`** (mid-turn — its hooks deliver, so a keystroke wake would be noise). The `busy` marker is set at turn start (UserPromptSubmit hook) and **re-stamped on every tool call** (PreToolUse hook, v0.15+), so a long *active* turn stays fresh and never invites a spurious wake. A turn that stops making tool calls — one giant single tool call, or a crash without a clean Stop — ages past `OXTAIL_ACTIVITY_BUSY_TTL_MS` (default 10 min) and then *does* wake, which is the intended stale-busy → recovery behavior. Widen the TTL for deployments with very long single-tool-call turns.
|
|
252
271
|
|
|
253
272
|
### Recommended permissions for autonomous agent-to-agent collaboration
|
|
254
273
|
|
|
@@ -306,8 +325,9 @@ A scheduled CI job (`.github/workflows/codex-drift.yml`, also runnable on demand
|
|
|
306
325
|
|
|
307
326
|
## Status
|
|
308
327
|
|
|
309
|
-
v0.
|
|
328
|
+
v0.15.0. Pushes the autonomous peer-messaging matrix toward zero human relay, hardens the wake path, makes correlated replies atomic, and makes delegation durable across long (minutes-to-hours) efforts.
|
|
310
329
|
|
|
330
|
+
- **Durable `ask_peer` + long-effort liveness (v0.15.0).** A timed-out `ask_peer` records a pending obligation (`~/.oxtail/pending-ask/`, keyed on requester `session_id` + `request_id`, written *before* a final authoritative union-drain), so the peer's reply — arriving minutes or hours later — *wakes the requester back* (`wake_reason: "late_reply_to_pending"`) instead of landing silently. The pull-back takes the lenient wake path, so it reaches even a markerless idle Codex requester — closing the last wake-on-reply asymmetry. The reply drain unions the requester's sibling MCP-child pids (and sweeps migrate-crash duplicates) so a dual-scope reply can't strand. Separately, the `PreToolUse` hook now re-stamps the `busy` marker every tool call, so a long *active* turn never reads as stale-busy and invites a spurious wake. New env: `OXTAIL_PENDING_ASK_TTL_MS` (1h), `OXTAIL_ACTIVITY_BUSY_TTL_MS` (10m); `ask_peer` default timeout 45s→60s.
|
|
311
331
|
- **Reply by id (v0.13.0).** `reply_to_message(message_id, body)` removes the manual `target` + `reply_to` rewiring that silently degraded a correlated exchange into loose mailbox traffic: the server looks the inbound envelope up in a durable per-session **received-ledger** (`~/.oxtail/received/<hash(session_id)>.jsonl`), derives the reply target and `reply_to` itself, and enforces ownership structurally (you can only reply to a message delivered to you). The ledger is written *before* the mailbox line is visible — so a handle the hook displays is always resolvable even though both delivery paths destroy the queue entry once it is handed off. Fail-closed on an unknown/aged-out id.
|
|
312
332
|
- **Wake-on-reply (v0.11.0).** A reply — `send_message` with `reply_to` — auto-wakes a freshly-idle requester by default, so an awaited answer doesn't strand an idle peer. Strictly gated (fresh-idle only, per-target rate limit, one-wake dedupe, `OXTAIL_AUTOWAKE=off` kill-switch). `wake:"off"` opts out; explicit `wake:"auto"` is the escape hatch for a requester without an idle marker (Codex / hookless Claude).
|
|
313
333
|
- **Wake hardening (v0.12.0).** Wake keystrokes only ever target the pane the process tree confirms hosts the peer's `server_pid` — never a self-written `tmux_pane`/`tmux_session`, and registry entries whose `server_pid` doesn't match their filename are rejected. Rapid repeat wakes to one peer are coalesced (`skipped_debounced`). `oxtail diagnose` summarizes wake outcomes from `MCP_TRACE_FILE`, and a scheduled CI job flags drift in Codex's paste-burst window before it can break the wake.
|
package/assets/pretooluse.sh
CHANGED
|
@@ -42,6 +42,18 @@ if [ ! -t 0 ]; then
|
|
|
42
42
|
fi
|
|
43
43
|
[ -z "$sid" ] && exit 0
|
|
44
44
|
|
|
45
|
+
# Re-stamp "busy" on EVERY tool call (before any early-exit below) so a long,
|
|
46
|
+
# ACTIVE turn keeps a fresh marker and never reads as stale-busy (>TTL) to a
|
|
47
|
+
# peer's wake:auto. UserPromptSubmit sets "busy" once at turn start; without this
|
|
48
|
+
# a turn outrunning the TTL would invite a spurious keystroke wake into a working
|
|
49
|
+
# agent. The Stop hook flips this back to "idle" on a real stop. Keyed by
|
|
50
|
+
# session_id; sanitization MUST match the server's activitySessionKey().
|
|
51
|
+
safe_sid=$(printf '%s' "$sid" | tr -c 'A-Za-z0-9_-' '_')
|
|
52
|
+
[ -n "$safe_sid" ] && {
|
|
53
|
+
mkdir -p "$HOME/.oxtail/activity" 2>/dev/null || true
|
|
54
|
+
printf 'busy' > "$HOME/.oxtail/activity/$safe_sid" 2>/dev/null || true
|
|
55
|
+
}
|
|
56
|
+
|
|
45
57
|
sessions_dir="$HOME/.oxtail/sessions"
|
|
46
58
|
mailboxes_dir="$HOME/.oxtail/mailboxes"
|
|
47
59
|
[ -d "$sessions_dir" ] || exit 0
|
package/dist/claims.js
CHANGED
|
@@ -155,9 +155,6 @@ function compareClaimScores(a, b) {
|
|
|
155
155
|
}
|
|
156
156
|
return a.claimed_at - b.claimed_at;
|
|
157
157
|
}
|
|
158
|
-
function scoresTie(a, b) {
|
|
159
|
-
return compareClaimScores(a, b) === 0;
|
|
160
|
-
}
|
|
161
158
|
function claimKey(clientType, cwd, sessionId) {
|
|
162
159
|
return createHash("sha256")
|
|
163
160
|
.update(`${clientType} ${cwd} ${sessionId}`)
|
|
@@ -172,7 +169,14 @@ function claimPath(key) {
|
|
|
172
169
|
// Atomic temp+rename so a concurrent reader never sees a torn write.
|
|
173
170
|
export function writeClaim(input) {
|
|
174
171
|
ensureClaimsDir();
|
|
175
|
-
|
|
172
|
+
// Age-only sweep on this hot path. writeClaim can run concurrently with another
|
|
173
|
+
// agent's writeClaim (dual-scope, or two sessions in one project); the
|
|
174
|
+
// transcript-existence check is racy — a transcript that momentarily fails to
|
|
175
|
+
// stat would unlink a sibling's just-written claim (M6). Age is monotonic and
|
|
176
|
+
// race-free, so reclaim only by age here. recoverClaim already skips records
|
|
177
|
+
// whose transcript is gone, and the full (transcript-aware) sweep remains
|
|
178
|
+
// available via a direct gcStaleClaims() call.
|
|
179
|
+
gcStaleClaims(Date.now(), { ageOnly: true });
|
|
176
180
|
const rec = {
|
|
177
181
|
schema_version: 1,
|
|
178
182
|
client_type: input.client_type,
|
|
@@ -245,14 +249,30 @@ export function recoverClaim(clientType, cwd, ancestors, deps = {}) {
|
|
|
245
249
|
matches.sort((a, b) => compareClaimScores(b.score, a.score));
|
|
246
250
|
const best = matches[0];
|
|
247
251
|
const second = matches[1];
|
|
248
|
-
|
|
252
|
+
// Abstain on cross-session ambiguity. Two DISTINCT sessions that overlap the
|
|
253
|
+
// live chain equally (same overlap_count) AND at the same live-chain depth
|
|
254
|
+
// (same nearest_overlap_current) share liveness only at a common ancestor —
|
|
255
|
+
// the shared terminal/login-shell. The remaining tiebreakers (record-side
|
|
256
|
+
// depth, recency) do NOT correlate with which child actually restarted, so
|
|
257
|
+
// adopting either would risk cross-session misrouting (H1) — the very
|
|
258
|
+
// split-identity class this store exists to prevent. Return null so the caller
|
|
259
|
+
// falls back to the explicit claim_session next_step. (This strictly subsumes
|
|
260
|
+
// the old exact-tie check, which had equal overlap_count and nearest_current
|
|
261
|
+
// by definition.) A same-session second-best routes to the same identity and
|
|
262
|
+
// so can never split-route — defensive only, since the per-session claim key
|
|
263
|
+
// means two records can't share a session_id.
|
|
264
|
+
if (second &&
|
|
265
|
+
best.rec.session_id !== second.rec.session_id &&
|
|
266
|
+
best.score.overlap_count === second.score.overlap_count &&
|
|
267
|
+
best.score.nearest_overlap_current === second.score.nearest_overlap_current) {
|
|
249
268
|
return null;
|
|
269
|
+
}
|
|
250
270
|
return best.rec;
|
|
251
271
|
}
|
|
252
272
|
// Drop records that are clearly dead: transcript gone, or older than the max
|
|
253
273
|
// age. Best-effort; never throws. A dead process pid alone is NOT grounds for
|
|
254
274
|
// removal — that's exactly the restart case recovery exists to serve.
|
|
255
|
-
export function gcStaleClaims(nowMs = Date.now()) {
|
|
275
|
+
export function gcStaleClaims(nowMs = Date.now(), opts = {}) {
|
|
256
276
|
const dir = claimsDir();
|
|
257
277
|
if (!existsSync(dir))
|
|
258
278
|
return;
|
|
@@ -274,7 +294,7 @@ export function gcStaleClaims(nowMs = Date.now()) {
|
|
|
274
294
|
catch {
|
|
275
295
|
continue;
|
|
276
296
|
}
|
|
277
|
-
const transcriptGone = !rec.transcript_path || !existsSync(rec.transcript_path);
|
|
297
|
+
const transcriptGone = !opts.ageOnly && (!rec.transcript_path || !existsSync(rec.transcript_path));
|
|
278
298
|
const tooOld = nowMs - rec.claimed_at * 1000 > CLAIM_MAX_AGE_MS;
|
|
279
299
|
if (transcriptGone || tooOld) {
|
|
280
300
|
try {
|
|
@@ -2,6 +2,12 @@ import { closeSync, existsSync, openSync, readSync, readdirSync, statSync } from
|
|
|
2
2
|
import { homedir } from "node:os";
|
|
3
3
|
import { join } from "node:path";
|
|
4
4
|
const FIVE_MIN_MS = 5 * 60 * 1000;
|
|
5
|
+
// started_at is whole-second granularity (Math.floor(Date.now()/1000)*1000)
|
|
6
|
+
// while a transcript's birth_ms is real-millisecond, so a transcript
|
|
7
|
+
// legitimately created in the same second can land slightly BEFORE started_at
|
|
8
|
+
// (delta in [-1000, 0]). Allow one second of grace below zero so the unique
|
|
9
|
+
// candidate isn't dropped on pure rounding (M7).
|
|
10
|
+
const ONE_SECOND_MS = 1000;
|
|
5
11
|
const UUID_RE = /([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/;
|
|
6
12
|
// Returns the unique post-start candidate inside the window, or null if there
|
|
7
13
|
// are zero or multiple. Multiple positive-delta candidates means another
|
|
@@ -10,7 +16,7 @@ const UUID_RE = /([0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})/
|
|
|
10
16
|
export function pickByDelta(candidates, startedAtMs, windowMs = FIVE_MIN_MS) {
|
|
11
17
|
const ranked = candidates
|
|
12
18
|
.map((c) => ({ ...c, delta: c.birth_ms - startedAtMs }))
|
|
13
|
-
.filter((c) => c.delta >
|
|
19
|
+
.filter((c) => c.delta > -ONE_SECOND_MS && c.delta <= windowMs);
|
|
14
20
|
if (ranked.length !== 1)
|
|
15
21
|
return null;
|
|
16
22
|
return { session_id: ranked[0].session_id, birth_ms: ranked[0].birth_ms };
|
|
@@ -141,7 +147,7 @@ function abstainReason(type, candidates, startedAtMs) {
|
|
|
141
147
|
}
|
|
142
148
|
const ranked = candidates
|
|
143
149
|
.map((c) => ({ ...c, delta: c.birth_ms - startedAtMs }))
|
|
144
|
-
.filter((c) => c.delta >
|
|
150
|
+
.filter((c) => c.delta > -ONE_SECOND_MS && c.delta <= FIVE_MIN_MS);
|
|
145
151
|
if (ranked.length === 0) {
|
|
146
152
|
return {
|
|
147
153
|
abstain: true,
|
package/dist/locks.js
CHANGED
|
@@ -41,8 +41,13 @@ import { trace } from "./trace.js";
|
|
|
41
41
|
// ALWAYS the single-winner `mkdir(lock)`, so even redundant clears can never
|
|
42
42
|
// produce two owners — the worst they do is race to recreate the lock, which
|
|
43
43
|
// exactly one wins.
|
|
44
|
-
const LOCK_RETRY_LIMIT = 50;
|
|
45
44
|
const LOCK_RETRY_DELAY_MS = 10;
|
|
45
|
+
// Total acquire budget is wall-clock, NOT a fixed retry count: a successful
|
|
46
|
+
// stale-clear retries mkdir immediately (no sleep) so it must not consume the
|
|
47
|
+
// budget without time passing — a count-based budget threw "could not acquire
|
|
48
|
+
// lock" spuriously under contention (H2). 2s is ample for the tiny mailbox/
|
|
49
|
+
// ledger critical sections and well under any caller-level timeout.
|
|
50
|
+
const LOCK_ACQUIRE_TIMEOUT_MS = 2_000;
|
|
46
51
|
function sleepSync(ms) {
|
|
47
52
|
Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
|
|
48
53
|
}
|
|
@@ -166,7 +171,8 @@ export function clearStaleLock(lock, staleMs, traceEvent, traceCtx) {
|
|
|
166
171
|
// releaseDirLock. The caller is responsible for creating the parent directory.
|
|
167
172
|
export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
|
|
168
173
|
const token = mintToken();
|
|
169
|
-
|
|
174
|
+
const deadline = Date.now() + LOCK_ACQUIRE_TIMEOUT_MS;
|
|
175
|
+
for (;;) {
|
|
170
176
|
try {
|
|
171
177
|
mkdirSync(lock, { mode: 0o700 });
|
|
172
178
|
writeOwner(lock, token);
|
|
@@ -176,12 +182,19 @@ export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
|
|
|
176
182
|
const err = e;
|
|
177
183
|
if (err.code !== "EEXIST")
|
|
178
184
|
throw err;
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
185
|
+
// A successful stale-clear means the lock is gone: loop straight back to
|
|
186
|
+
// mkdir WITHOUT sleeping, to grab it before another contender (this retry
|
|
187
|
+
// must not consume the budget without time passing). Otherwise — a fresh
|
|
188
|
+
// holder or a lost steal — back off before retrying.
|
|
189
|
+
if (!clearStaleLock(lock, staleMs, traceEvent, traceCtx)) {
|
|
190
|
+
sleepSync(LOCK_RETRY_DELAY_MS);
|
|
191
|
+
}
|
|
192
|
+
}
|
|
193
|
+
// Wall-clock budget so the no-sleep stale-clear path cannot spin forever.
|
|
194
|
+
if (Date.now() >= deadline) {
|
|
195
|
+
throw new Error(`could not acquire lock at ${lock}`);
|
|
182
196
|
}
|
|
183
197
|
}
|
|
184
|
-
throw new Error(`could not acquire lock at ${lock}`);
|
|
185
198
|
}
|
|
186
199
|
// Release the lock — but only if we PROVABLY still own it (owner === our token).
|
|
187
200
|
// A holder that stalled past the stale window and was stolen from sees a
|
|
@@ -192,7 +205,13 @@ export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
|
|
|
192
205
|
// stale lock and is reclaimed by clearStaleLock, strictly safer than a stomp.
|
|
193
206
|
export function releaseDirLock(lock, token) {
|
|
194
207
|
if (!token) {
|
|
195
|
-
|
|
208
|
+
// No token to prove ownership. An empty token reaches here only from a
|
|
209
|
+
// lockTokens Map miss (an acquire that threw, or a future same-key nested
|
|
210
|
+
// release), so removing would stomp whatever lock currently exists —
|
|
211
|
+
// possibly a DIFFERENT owner's fresh one. Leave it: a genuinely leaked lock
|
|
212
|
+
// ages into a stale lock and is reclaimed by clearStaleLock, strictly safer
|
|
213
|
+
// than a stomp (H3).
|
|
214
|
+
trace("lock_release_skipped_no_token", { lock });
|
|
196
215
|
return;
|
|
197
216
|
}
|
|
198
217
|
const owner = readOwner(lock);
|
package/dist/mailbox.js
CHANGED
|
@@ -188,6 +188,13 @@ export function requeueMany(target_pid, msgs) {
|
|
|
188
188
|
// two unioned sibling mailboxes. Both copies are drained (so neither lingers) but
|
|
189
189
|
// the message is returned ONCE. message_id is a unique per-message nonce, so this
|
|
190
190
|
// only ever collapses true duplicates, never two distinct messages.
|
|
191
|
+
// Union-drain a session's mailboxes (one per server_pid it has used), deduping
|
|
192
|
+
// by message_id so a migrate crash-window duplicate (same id in two sibling
|
|
193
|
+
// mailboxes) is delivered once. INVARIANT: every unioned pid is drained (and so
|
|
194
|
+
// truncated) before returning — do NOT add a budget/early-exit short-circuit
|
|
195
|
+
// here. The dedup is per-call only, so a duplicate left in an un-drained sibling
|
|
196
|
+
// would re-surface on a later call with no cross-call dedup (M5). Budgeting
|
|
197
|
+
// belongs in the caller, applied to the already-fully-drained result.
|
|
191
198
|
export function drainMany(pids) {
|
|
192
199
|
const out = [];
|
|
193
200
|
const seenPids = new Set();
|
|
@@ -265,8 +272,42 @@ export function migrateMailbox(fromPid, toPid) {
|
|
|
265
272
|
}
|
|
266
273
|
if (!raw || !raw.trim())
|
|
267
274
|
return 0;
|
|
268
|
-
|
|
269
|
-
|
|
275
|
+
// Migrate only VALID records, reserialized canonically. A crash mid-append
|
|
276
|
+
// into the source can leave a torn final line; copying raw bytes would glue
|
|
277
|
+
// a synthesized newline onto that fragment, promoting garbage into a
|
|
278
|
+
// standalone (unparseable) line in the dest AND over-counting it (H4). Parse
|
|
279
|
+
// each line with the same guard drain uses, drop torn/invalid ones, and
|
|
280
|
+
// rebuild a clean block so the count reflects real, deliverable messages.
|
|
281
|
+
const valid = [];
|
|
282
|
+
for (const line of raw.split("\n")) {
|
|
283
|
+
if (!line.trim())
|
|
284
|
+
continue;
|
|
285
|
+
let parsed;
|
|
286
|
+
try {
|
|
287
|
+
parsed = JSON.parse(line);
|
|
288
|
+
}
|
|
289
|
+
catch {
|
|
290
|
+
trace("mailbox_migrate_skip_invalid", { fromPid, toPid, line });
|
|
291
|
+
continue;
|
|
292
|
+
}
|
|
293
|
+
if (parsed &&
|
|
294
|
+
typeof parsed === "object" &&
|
|
295
|
+
parsed.schema_version === 1 &&
|
|
296
|
+
typeof parsed.id === "string" &&
|
|
297
|
+
typeof parsed.body === "string") {
|
|
298
|
+
valid.push(parsed);
|
|
299
|
+
}
|
|
300
|
+
else {
|
|
301
|
+
trace("mailbox_migrate_skip_invalid", { fromPid, toPid, line });
|
|
302
|
+
}
|
|
303
|
+
}
|
|
304
|
+
if (valid.length === 0) {
|
|
305
|
+
// Only torn/garbage lines — clear the source and report nothing migrated.
|
|
306
|
+
truncateSync(src, 0);
|
|
307
|
+
return 0;
|
|
308
|
+
}
|
|
309
|
+
// serializeMailboxLine already terminates each line with "\n", so join("").
|
|
310
|
+
const block = valid.map((m) => serializeMailboxLine(m)).join("");
|
|
270
311
|
acquireLock(toPid);
|
|
271
312
|
try {
|
|
272
313
|
appendLines(mailboxPath(toPid), block);
|
|
@@ -276,7 +317,7 @@ export function migrateMailbox(fromPid, toPid) {
|
|
|
276
317
|
}
|
|
277
318
|
// Append succeeded → clear the source (still under the source lock).
|
|
278
319
|
truncateSync(src, 0);
|
|
279
|
-
return
|
|
320
|
+
return valid.length;
|
|
280
321
|
}
|
|
281
322
|
finally {
|
|
282
323
|
releaseLock(fromPid);
|
|
@@ -349,6 +390,79 @@ export function drainMatchingSession(my_pid, from_session_id) {
|
|
|
349
390
|
export function drainMatchingReply(my_pid, from_session_id, reply_to) {
|
|
350
391
|
return drainFirstMatching(my_pid, (msg) => msg.from_session_id === from_session_id && msg.reply_to === reply_to);
|
|
351
392
|
}
|
|
393
|
+
// Union variant of drainMatchingReply across a session's sibling/previous MCP
|
|
394
|
+
// child pids. ask_peer waits on the requester's OWN pid, but the reply is
|
|
395
|
+
// addressed by client.session_id and resolveTarget(readAll) enqueues it to the
|
|
396
|
+
// session's freshest sibling — which, in a dual-scope / pid-rotation setup, may
|
|
397
|
+
// NOT be the pid blocked in ask_peer. A single-pid drain would then miss a reply
|
|
398
|
+
// that already landed in a sibling mailbox and strand it. Mirrors the session
|
|
399
|
+
// union read_my_messages / the PreToolUse hook already use.
|
|
400
|
+
//
|
|
401
|
+
// Returns the FIRST matching reply across the (deduped) pids. It does NOT pull
|
|
402
|
+
// every match: two DISTINCT replies to the same request_id (an answer + a
|
|
403
|
+
// follow-up correction) must not both be drained with one silently dropped — the
|
|
404
|
+
// second stays for read_my_messages. But once the first match is found, it DOES
|
|
405
|
+
// sweep an exact same-message_id duplicate out of the remaining pids: a
|
|
406
|
+
// migrate-crash can leave the SAME message in two siblings, and if we returned
|
|
407
|
+
// one copy and left the other, a later union drain would see only the lone
|
|
408
|
+
// survivor and re-deliver it as a "new" message. Sweeping by message_id removes
|
|
409
|
+
// the duplicate while leaving any distinct reply intact.
|
|
410
|
+
//
|
|
411
|
+
// `skipped` reports pids that could not be inspected (lock contention after the
|
|
412
|
+
// internal acquire-retry budget). The poll tolerates this (it retries next tick);
|
|
413
|
+
// the authoritative final drain in ask_peer retries the skipped pids so a
|
|
414
|
+
// transiently-locked sibling holding the reply isn't mistaken for "no reply".
|
|
415
|
+
export function drainMatchingReplyManyChecked(pids, from_session_id, reply_to) {
|
|
416
|
+
const seen = new Set();
|
|
417
|
+
const skipped = [];
|
|
418
|
+
let found = null;
|
|
419
|
+
for (const pid of pids) {
|
|
420
|
+
if (seen.has(pid))
|
|
421
|
+
continue;
|
|
422
|
+
seen.add(pid);
|
|
423
|
+
try {
|
|
424
|
+
if (!found) {
|
|
425
|
+
const m = drainMatchingReply(pid, from_session_id, reply_to);
|
|
426
|
+
if (m)
|
|
427
|
+
found = m;
|
|
428
|
+
}
|
|
429
|
+
else {
|
|
430
|
+
// Sweep an exact-message_id duplicate (migrate-crash) from this sibling;
|
|
431
|
+
// a distinct reply (different id) is left untouched.
|
|
432
|
+
const dupId = found.id;
|
|
433
|
+
drainFirstMatching(pid, (msg) => msg.id === dupId);
|
|
434
|
+
}
|
|
435
|
+
}
|
|
436
|
+
catch {
|
|
437
|
+
skipped.push(pid);
|
|
438
|
+
}
|
|
439
|
+
}
|
|
440
|
+
return { reply: found, skipped };
|
|
441
|
+
}
|
|
442
|
+
export function drainMatchingReplyMany(pids, from_session_id, reply_to) {
|
|
443
|
+
return drainMatchingReplyManyChecked(pids, from_session_id, reply_to).reply;
|
|
444
|
+
}
|
|
445
|
+
// Best-effort removal of an EXACT message_id from each of `pids`. Used to clean
|
|
446
|
+
// up a migrate-crash duplicate that was left in a pid the union drain couldn't
|
|
447
|
+
// inspect (lock contention) at the time the reply was pulled from another pid —
|
|
448
|
+
// otherwise a later read_my_messages would re-deliver the lone survivor as a
|
|
449
|
+
// "new" message. Matches by message_id only, so a DISTINCT reply (different id)
|
|
450
|
+
// in the same pid is never touched. Per-pid errors are skipped.
|
|
451
|
+
export function sweepMessageId(pids, messageId) {
|
|
452
|
+
const seen = new Set();
|
|
453
|
+
for (const pid of pids) {
|
|
454
|
+
if (seen.has(pid))
|
|
455
|
+
continue;
|
|
456
|
+
seen.add(pid);
|
|
457
|
+
try {
|
|
458
|
+
drainFirstMatching(pid, (msg) => msg.id === messageId);
|
|
459
|
+
}
|
|
460
|
+
catch {
|
|
461
|
+
// best effort — a still-locked pid is left; the dup is a rare crash-window
|
|
462
|
+
// artifact and the cost is at most one re-delivered (same-id) message.
|
|
463
|
+
}
|
|
464
|
+
}
|
|
465
|
+
}
|
|
352
466
|
function drainFirstMatching(my_pid, matches) {
|
|
353
467
|
acquireLock(my_pid);
|
|
354
468
|
try {
|
|
@@ -0,0 +1,167 @@
|
|
|
1
|
+
// Pending-ask registry — durable ask_peer (the long-effort liveness fix).
|
|
2
|
+
//
|
|
3
|
+
// When an ask_peer wait TIMES OUT, the requester records a "pending ask" here:
|
|
4
|
+
// a durable note that it is still awaiting a reply correlated by request_id.
|
|
5
|
+
// When that reply eventually arrives — minutes or hours later, long after the
|
|
6
|
+
// 5-minute fresh-idle window the strict reply-default wake is gated to — the
|
|
7
|
+
// reply handler (server.ts resolveSendWake) finds the matching record and fires
|
|
8
|
+
// a LENIENT wake to pull the requester back, instead of stranding it idle until
|
|
9
|
+
// its next turn. This is what turns ask_peer into "delegate a long task and get
|
|
10
|
+
// pulled back the moment it's done", and it also reaches a markerless idle Codex
|
|
11
|
+
// requester that the fresh-idle gate would skip as skipped_no_fresh_idle.
|
|
12
|
+
//
|
|
13
|
+
// Design mirrors autowake.ts exactly: one small file per record under
|
|
14
|
+
// ~/.oxtail/pending-ask/, mtime is the source of truth (driven by an injected
|
|
15
|
+
// nowMs so it's deterministic in tests), the body is a debug breadcrumb, GC'd by
|
|
16
|
+
// age. Keyed on the REQUESTER's client.session_id + the request_id (the agent
|
|
17
|
+
// identity per AGENTS.md, never server_pid). Best-effort: a broken store
|
|
18
|
+
// degrades to "no record" — it NEVER throws, because a thrown error here would
|
|
19
|
+
// surface on an already-enqueued/already-delivered message and invite a retry.
|
|
20
|
+
import { createHash } from "node:crypto";
|
|
21
|
+
import { closeSync, mkdirSync, openSync, readdirSync, statSync, unlinkSync, utimesSync, writeFileSync, } from "node:fs";
|
|
22
|
+
import { homedir } from "node:os";
|
|
23
|
+
import { join } from "node:path";
|
|
24
|
+
function envPosInt(name, def, env = process.env) {
|
|
25
|
+
const v = env[name];
|
|
26
|
+
if (!v)
|
|
27
|
+
return def;
|
|
28
|
+
const n = Number(v);
|
|
29
|
+
return Number.isFinite(n) && n > 0 ? n : def;
|
|
30
|
+
}
|
|
31
|
+
// How long a recorded pending-ask is honored before GC reclaims it. Sized for
|
|
32
|
+
// long efforts (a delegated task that runs for the better part of an hour) — a
|
|
33
|
+
// reply arriving after this window still delivers durably via read_my_messages,
|
|
34
|
+
// it just won't fire the pull-back wake. Generous by default; tunable.
|
|
35
|
+
export const PENDING_ASK_TTL_MS = envPosInt("OXTAIL_PENDING_ASK_TTL_MS", 60 * 60 * 1000);
|
|
36
|
+
export function defaultPendingAskDir() {
|
|
37
|
+
return join(homedir(), ".oxtail", "pending-ask");
|
|
38
|
+
}
|
|
39
|
+
function hash(s) {
|
|
40
|
+
// request_id is caller-influenced, so never build a filename from it directly.
|
|
41
|
+
return createHash("sha256").update(s).digest("hex").slice(0, 32);
|
|
42
|
+
}
|
|
43
|
+
function recordPath(dir, sessionId, requestId) {
|
|
44
|
+
// JSON-encode the pair so the (sessionId, requestId) boundary is unambiguous
|
|
45
|
+
// and can't be crafted to collide with a different split (mirrors autowake.ts).
|
|
46
|
+
return join(dir, `p-${hash(JSON.stringify([sessionId, requestId]))}`);
|
|
47
|
+
}
|
|
48
|
+
function setMtime(path, nowMs) {
|
|
49
|
+
const t = nowMs / 1000;
|
|
50
|
+
try {
|
|
51
|
+
utimesSync(path, t, t);
|
|
52
|
+
}
|
|
53
|
+
catch {
|
|
54
|
+
// best effort — mtime drives TTL math; a failure only skews freshness by the
|
|
55
|
+
// small real-vs-injected clock delta.
|
|
56
|
+
}
|
|
57
|
+
}
|
|
58
|
+
// Record a pending ask. Atomic create-exclusive so a duplicate record (same
|
|
59
|
+
// requester + request_id) is a no-op rather than resetting the TTL clock.
|
|
60
|
+
// Returns true if a record now exists for this pair (freshly written OR already
|
|
61
|
+
// present), false only on a missing identity or an unusable store. Never throws.
|
|
62
|
+
export function recordPendingAsk(dir, sessionId, requestId, nowMs) {
|
|
63
|
+
// Never key on an empty identity: an unclaimed requester can't be correlated
|
|
64
|
+
// or replied-to, so there's nothing to wake later.
|
|
65
|
+
if (!sessionId || !requestId)
|
|
66
|
+
return false;
|
|
67
|
+
try {
|
|
68
|
+
mkdirSync(dir, { recursive: true, mode: 0o700 });
|
|
69
|
+
const p = recordPath(dir, sessionId, requestId);
|
|
70
|
+
try {
|
|
71
|
+
const fd = openSync(p, "wx"); // atomic create-exclusive
|
|
72
|
+
try {
|
|
73
|
+
writeFileSync(fd, JSON.stringify({ sessionId, requestId, at: nowMs }));
|
|
74
|
+
}
|
|
75
|
+
finally {
|
|
76
|
+
closeSync(fd);
|
|
77
|
+
}
|
|
78
|
+
setMtime(p, nowMs);
|
|
79
|
+
return true;
|
|
80
|
+
}
|
|
81
|
+
catch (e) {
|
|
82
|
+
// EEXIST: a record already exists → fine, leave its original mtime so the
|
|
83
|
+
// TTL counts from the first record, not this duplicate.
|
|
84
|
+
if (e.code === "EEXIST")
|
|
85
|
+
return true;
|
|
86
|
+
throw e;
|
|
87
|
+
}
|
|
88
|
+
}
|
|
89
|
+
catch {
|
|
90
|
+
// Store unusable (e.g. ~/.oxtail/pending-ask is a file, permission error) —
|
|
91
|
+
// degrade to "no durable record"; the strict fresh-idle reply-default still
|
|
92
|
+
// covers a Claude requester that went idle <5 min ago.
|
|
93
|
+
return false;
|
|
94
|
+
}
|
|
95
|
+
}
|
|
96
|
+
// Read-only: is there a live (within TTL) pending-ask for this pair?
|
|
97
|
+
export function hasPendingAsk(dir, sessionId, requestId, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
|
|
98
|
+
if (!sessionId || !requestId)
|
|
99
|
+
return false;
|
|
100
|
+
try {
|
|
101
|
+
const st = statSync(recordPath(dir, sessionId, requestId));
|
|
102
|
+
return nowMs - st.mtimeMs < ttlMs;
|
|
103
|
+
}
|
|
104
|
+
catch {
|
|
105
|
+
return false;
|
|
106
|
+
}
|
|
107
|
+
}
|
|
108
|
+
// Atomically consume (delete) the pending-ask for this pair. Returns true iff a
|
|
109
|
+
// record existed, was within the TTL, and THIS caller removed it — the
|
|
110
|
+
// single-winner signal the reply handler uses to fire exactly one pull-back
|
|
111
|
+
// wake. A concurrent second reply (or a re-delivered duplicate) racing the same
|
|
112
|
+
// key loses: unlinkSync throws ENOENT for the loser, so it returns false and
|
|
113
|
+
// does not re-wake.
|
|
114
|
+
//
|
|
115
|
+
// When nowMs is supplied, an OVER-TTL record is still unlinked (so a stale
|
|
116
|
+
// record can't leak) but the function returns false — honoring the contract that
|
|
117
|
+
// a reply arriving after PENDING_ASK_TTL_MS still delivers durably but does NOT
|
|
118
|
+
// fire the late wake. Omit nowMs to consume regardless of age (used right after
|
|
119
|
+
// recordPendingAsk, where the record is freshly written).
|
|
120
|
+
export function consumePendingAsk(dir, sessionId, requestId, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
|
|
121
|
+
if (!sessionId || !requestId)
|
|
122
|
+
return false;
|
|
123
|
+
const p = recordPath(dir, sessionId, requestId);
|
|
124
|
+
let withinTtl = true;
|
|
125
|
+
if (nowMs !== undefined) {
|
|
126
|
+
try {
|
|
127
|
+
withinTtl = nowMs - statSync(p).mtimeMs < ttlMs;
|
|
128
|
+
}
|
|
129
|
+
catch {
|
|
130
|
+
return false; // no record to consume
|
|
131
|
+
}
|
|
132
|
+
}
|
|
133
|
+
try {
|
|
134
|
+
unlinkSync(p); // remove regardless of age so a stale record can't leak
|
|
135
|
+
}
|
|
136
|
+
catch {
|
|
137
|
+
// ENOENT (no record / already consumed by a racing caller) or any store
|
|
138
|
+
// error → not ours to act on.
|
|
139
|
+
return false;
|
|
140
|
+
}
|
|
141
|
+
return withinTtl;
|
|
142
|
+
}
|
|
143
|
+
// Remove pending-ask records older than the TTL. Cheap, low-volume dir; run
|
|
144
|
+
// opportunistically so abandoned records (a reply that never came) can't
|
|
145
|
+
// accumulate. Mirrors gcAutowake.
|
|
146
|
+
export function gcPendingAsk(dir, nowMs, ttlMs = PENDING_ASK_TTL_MS) {
|
|
147
|
+
let names;
|
|
148
|
+
try {
|
|
149
|
+
names = readdirSync(dir);
|
|
150
|
+
}
|
|
151
|
+
catch {
|
|
152
|
+
return; // dir not created yet
|
|
153
|
+
}
|
|
154
|
+
for (const name of names) {
|
|
155
|
+
if (name[0] !== "p")
|
|
156
|
+
continue;
|
|
157
|
+
const p = join(dir, name);
|
|
158
|
+
try {
|
|
159
|
+
const st = statSync(p);
|
|
160
|
+
if (nowMs - st.mtimeMs >= ttlMs)
|
|
161
|
+
unlinkSync(p);
|
|
162
|
+
}
|
|
163
|
+
catch {
|
|
164
|
+
// best effort
|
|
165
|
+
}
|
|
166
|
+
}
|
|
167
|
+
}
|
package/dist/received.js
CHANGED
|
@@ -88,24 +88,43 @@ function readLines(sessionId) {
|
|
|
88
88
|
throw err;
|
|
89
89
|
}
|
|
90
90
|
}
|
|
91
|
+
// The message_id of a serialized ledger line, or null if unparseable. Used to
|
|
92
|
+
// keep recordReceived idempotent without fully deserializing every envelope.
|
|
93
|
+
function lineMessageId(line) {
|
|
94
|
+
try {
|
|
95
|
+
const parsed = JSON.parse(line);
|
|
96
|
+
return typeof parsed.id === "string" ? parsed.id : null;
|
|
97
|
+
}
|
|
98
|
+
catch {
|
|
99
|
+
return null;
|
|
100
|
+
}
|
|
101
|
+
}
|
|
91
102
|
// Append an inbound envelope to the receiver's ledger and prune to receivedMax()
|
|
92
103
|
// (oldest dropped first). Called by delivery.ts BEFORE the mailbox append.
|
|
104
|
+
// Idempotent by message_id: re-recording an id replaces its prior line.
|
|
93
105
|
export function recordReceived(receiverSessionId, msg) {
|
|
94
106
|
if (!receiverSessionId)
|
|
95
107
|
return;
|
|
96
108
|
acquireLock(receiverSessionId);
|
|
97
109
|
try {
|
|
98
110
|
const lines = readLines(receiverSessionId);
|
|
99
|
-
|
|
111
|
+
// Idempotent by message_id: a re-record (ask_peer abort recovery, chained
|
|
112
|
+
// re-delivery) must not append a duplicate ledger line. Duplicates waste the
|
|
113
|
+
// receivedMax prune budget and can evict still-needed handles early,
|
|
114
|
+
// surfacing as spurious reply_to_message "message-not-found" (M4). Drop any
|
|
115
|
+
// prior line for this id, then append the latest. lookupReceived already
|
|
116
|
+
// returns first-match newest-first, so behavior is unchanged for callers.
|
|
117
|
+
const deduped = msg.id ? lines.filter((l) => lineMessageId(l) !== msg.id) : lines;
|
|
118
|
+
deduped.push(JSON.stringify(msg));
|
|
100
119
|
const max = receivedMax();
|
|
101
|
-
let pruned =
|
|
102
|
-
if (
|
|
103
|
-
pruned =
|
|
120
|
+
let pruned = deduped;
|
|
121
|
+
if (deduped.length > max) {
|
|
122
|
+
pruned = deduped.slice(deduped.length - max);
|
|
104
123
|
// No silent caps: a dropped handle becomes reply_to_message
|
|
105
124
|
// "message-not-found", so surface that the bound bit.
|
|
106
125
|
trace("received_ledger_pruned", {
|
|
107
126
|
session_id: receiverSessionId,
|
|
108
|
-
dropped:
|
|
127
|
+
dropped: deduped.length - max,
|
|
109
128
|
kept: max,
|
|
110
129
|
});
|
|
111
130
|
}
|
|
@@ -136,6 +155,8 @@ export function lookupReceived(receiverSessionId, messageId) {
|
|
|
136
155
|
}
|
|
137
156
|
if (parsed &&
|
|
138
157
|
typeof parsed === "object" &&
|
|
158
|
+
parsed.schema_version === 1 &&
|
|
159
|
+
typeof parsed.body === "string" &&
|
|
139
160
|
parsed.id === messageId) {
|
|
140
161
|
return parsed;
|
|
141
162
|
}
|
package/dist/registry.js
CHANGED
|
@@ -68,9 +68,21 @@ export function isValidTmuxSession(s) {
|
|
|
68
68
|
// target, so we refuse rather than fall back to the self-written cached value.
|
|
69
69
|
// - the resolved pane isn't a well-formed pane id (tmux output anomaly).
|
|
70
70
|
// resolvePane is injected in tests; production uses currentPaneForServerPid.
|
|
71
|
-
export function chooseVerifiedWakePane(peer, resolvePane = currentPaneForServerPid) {
|
|
71
|
+
export function chooseVerifiedWakePane(peer, resolvePane = currentPaneForServerPid, resolveSig = processStartSig) {
|
|
72
72
|
if (!peer.tmux_pane)
|
|
73
73
|
return null;
|
|
74
|
+
// PID-reuse guard: if the entry recorded the server process's start-time
|
|
75
|
+
// signature, confirm the live pid is STILL that process before resolving and
|
|
76
|
+
// waking its pane. Otherwise an OS-recycled pid — now an unrelated process
|
|
77
|
+
// that happens to sit under a different tmux pane — would resolve to, and get
|
|
78
|
+
// our wake keystrokes typed into, a stranger's pane (M3). Only refuse on a
|
|
79
|
+
// positively-different signature; an empty reading (transient ps failure)
|
|
80
|
+
// falls through to pane resolution, which fails closed for a truly dead pid.
|
|
81
|
+
if (peer.proc_sig) {
|
|
82
|
+
const liveSig = resolveSig(peer.server_pid);
|
|
83
|
+
if (liveSig && liveSig !== peer.proc_sig)
|
|
84
|
+
return null;
|
|
85
|
+
}
|
|
74
86
|
const live = resolvePane(peer.server_pid);
|
|
75
87
|
if (!live || !isValidTmuxPane(live))
|
|
76
88
|
return null;
|
|
@@ -203,11 +215,36 @@ export function resolveTmuxPane(env = process.env, pid = process.pid) {
|
|
|
203
215
|
export function currentPaneForServerPid(serverPid) {
|
|
204
216
|
return findTmuxPaneByAncestry(serverPid, listTmuxPanePids(), listAllPpids());
|
|
205
217
|
}
|
|
218
|
+
// The OS start-time signature (lstart) of a process, or "" if it can't be read
|
|
219
|
+
// (dead pid, or ps unavailable). Same provenance signal claims.ts uses on
|
|
220
|
+
// ancestor pids: an OS-recycled pid yields a DIFFERENT start time, so comparing
|
|
221
|
+
// a live pid's signature against one captured at register time detects pid reuse
|
|
222
|
+
// — distinguishing "our process is still alive" from "the pid now belongs to an
|
|
223
|
+
// unrelated process."
|
|
224
|
+
export function processStartSig(pid) {
|
|
225
|
+
try {
|
|
226
|
+
return execFileSync("ps", ["-o", "lstart=", "-p", String(pid)], {
|
|
227
|
+
encoding: "utf8",
|
|
228
|
+
stdio: ["ignore", "pipe", "pipe"],
|
|
229
|
+
}).trim();
|
|
230
|
+
}
|
|
231
|
+
catch {
|
|
232
|
+
return "";
|
|
233
|
+
}
|
|
234
|
+
}
|
|
235
|
+
// A process's start time never changes, so capture our own once and reuse it.
|
|
236
|
+
let cachedSelfProcSig;
|
|
237
|
+
function selfProcSig() {
|
|
238
|
+
if (cachedSelfProcSig === undefined)
|
|
239
|
+
cachedSelfProcSig = processStartSig(process.pid);
|
|
240
|
+
return cachedSelfProcSig;
|
|
241
|
+
}
|
|
206
242
|
export function buildEntry(client, env = process.env) {
|
|
207
243
|
const tmux_pane = resolveTmuxPane(env);
|
|
208
244
|
return {
|
|
209
245
|
server_pid: process.pid,
|
|
210
246
|
started_at: Math.floor(Date.now() / 1000),
|
|
247
|
+
proc_sig: selfProcSig(),
|
|
211
248
|
client,
|
|
212
249
|
tmux_pane,
|
|
213
250
|
tmux_session: resolveTmuxSessionFromPane(tmux_pane),
|
package/dist/server.js
CHANGED
|
@@ -10,12 +10,13 @@ import { dirname, join, sep } from "node:path";
|
|
|
10
10
|
import { clientFromHandshake, detectClient, enrichWithDiagnosis, transcriptPathFor, } from "./clients.js";
|
|
11
11
|
import { isAbstain } from "./detect/index.js";
|
|
12
12
|
import { trace } from "./trace.js";
|
|
13
|
-
import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
|
|
13
|
+
import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, processStartSig, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
|
|
14
14
|
import * as mailbox from "./mailbox.js";
|
|
15
15
|
import * as received from "./received.js";
|
|
16
16
|
import { deliverExistingToPeer, deliverToPeer } from "./delivery.js";
|
|
17
17
|
import { recoverClaim, resolveAncestors, writeClaim } from "./claims.js";
|
|
18
|
-
import { decideReplyAutoWake, defaultAutowakeDir } from "./autowake.js";
|
|
18
|
+
import { autowakeKillSwitchOff, claimWake, decideReplyAutoWake, defaultAutowakeDir, } from "./autowake.js";
|
|
19
|
+
import { consumePendingAsk, defaultPendingAskDir, gcPendingAsk, recordPendingAsk, } from "./pending-ask.js";
|
|
19
20
|
import { markWoke, newWakeDebounceStore, recentlyWoke } from "./wake-debounce.js";
|
|
20
21
|
// CLI subcommand dispatch must run before any MCP setup so that
|
|
21
22
|
// `npx oxtail install-hook` doesn't open an MCP transport or register a
|
|
@@ -637,6 +638,11 @@ function refineFromHandshake(trigger) {
|
|
|
637
638
|
return diagnosis;
|
|
638
639
|
}
|
|
639
640
|
server.server.oninitialized = () => {
|
|
641
|
+
// Sweep pending-ask records orphaned by a prior session (an ask that timed out,
|
|
642
|
+
// was never answered, and whose owner went away). gcPendingAsk otherwise only
|
|
643
|
+
// runs on a later ask_peer timeout, so this startup sweep keeps the dir from
|
|
644
|
+
// accumulating stale records. Best-effort; never throws.
|
|
645
|
+
gcPendingAsk(defaultPendingAskDir(), Date.now());
|
|
640
646
|
const diagnosis = refineFromHandshake("oninitialized");
|
|
641
647
|
// After type is known via handshake, schedule retries to catch transcript files
|
|
642
648
|
// that don't exist yet at handshake time. No-op if session_id is already set.
|
|
@@ -854,12 +860,16 @@ server.registerTool("get_my_session", {
|
|
|
854
860
|
// strategy mirrors session_id_source so callers can still see whether
|
|
855
861
|
// env / birth-time / self-register resolved this entry.
|
|
856
862
|
const source = entry.client.session_id_source ?? "self-register";
|
|
863
|
+
// Report confidence honestly per source: env and explicit self-register
|
|
864
|
+
// (claim_session) are authoritative ("high"); inferred sources (birth-time,
|
|
865
|
+
// sticky-claim) are "medium" — matching what the detect strategies return.
|
|
866
|
+
const confidence = source === "env" || source === "self-register" ? "high" : "medium";
|
|
857
867
|
diagnosis = {
|
|
858
868
|
per_strategy: {},
|
|
859
869
|
winning: {
|
|
860
870
|
session_id: entry.client.session_id,
|
|
861
871
|
source,
|
|
862
|
-
confidence
|
|
872
|
+
confidence,
|
|
863
873
|
strategy: source,
|
|
864
874
|
},
|
|
865
875
|
next_step: null,
|
|
@@ -972,7 +982,20 @@ function resolveTarget(target, caller) {
|
|
|
972
982
|
const fresh = reReadRegistryEntry(e.server_pid);
|
|
973
983
|
if (!fresh)
|
|
974
984
|
return false;
|
|
975
|
-
|
|
985
|
+
if (fresh.started_at !== e.started_at)
|
|
986
|
+
return false;
|
|
987
|
+
// PID-reuse: started_at is the original registration time and lives on the
|
|
988
|
+
// stale on-disk entry, so a recycled pid (alive, file untouched) passes the
|
|
989
|
+
// check above. If the entry recorded the process start-time signature,
|
|
990
|
+
// confirm the live pid is still that same process; a recycled pid reads a
|
|
991
|
+
// different signature and is rejected (M3). Empty reading → indeterminate,
|
|
992
|
+
// leave it to downstream (the pane wake gate re-verifies before keystrokes).
|
|
993
|
+
if (fresh.proc_sig) {
|
|
994
|
+
const liveSig = processStartSig(e.server_pid);
|
|
995
|
+
if (liveSig && liveSig !== fresh.proc_sig)
|
|
996
|
+
return false;
|
|
997
|
+
}
|
|
998
|
+
return true;
|
|
976
999
|
});
|
|
977
1000
|
if (candidates.length === 0)
|
|
978
1001
|
return { ok: false, error: "target-not-found" };
|
|
@@ -1010,7 +1033,7 @@ function resolveTarget(target, caller) {
|
|
|
1010
1033
|
server.registerTool("send_message", {
|
|
1011
1034
|
description: [
|
|
1012
1035
|
"Fire-and-forget message to a peer in the same project root. Target: a tmux session name OR a client_session_id (UUID). Async via the peer's mailbox — delivered mid-turn (PreToolUse hook) or next-turn (read_my_messages); cross-project targets are rejected.",
|
|
1013
|
-
"A plain message does NOT wake an idle peer. Pass wake:\"auto\" to nudge one via per-client send-keys, state-gated (skipped if the peer is mid-turn). EXCEPTION (wake-on-reply): when you set reply_to, this auto-wakes the requester by default so your answer doesn't strand them idle — pass wake:\"off\" to suppress. The reply-default wake is strictly gated: it fires only for a FRESHLY-IDLE requester (one whose Claude Code hooks maintain a fresh idle marker), with a per-target rate limit and a one-wake dedupe; env kill-switch OXTAIL_AUTOWAKE=off. A requester with no idle marker (Codex, or Claude without the hooks) returns skipped_no_fresh_idle and is NOT auto-woken — use explicit wake:\"auto\" for those. Response carries wake_status (\"fired\" | \"skipped_busy\" | \"skipped_debounced\" | \"skipped_no_fresh_idle\" | \"skipped_rate_limited\" | \"skipped_deduped\" | \"skipped_store_error\" | \"skipped_no_target\" | \"disabled\") and, on the reply path, wake_reason:\"reply_to_default\".",
|
|
1036
|
+
"A plain message does NOT wake an idle peer. Pass wake:\"auto\" to nudge one via per-client send-keys, state-gated (skipped if the peer is mid-turn). EXCEPTION (wake-on-reply): when you set reply_to, this auto-wakes the requester by default so your answer doesn't strand them idle — pass wake:\"off\" to suppress. The reply-default wake is strictly gated: it fires only for a FRESHLY-IDLE requester (one whose Claude Code hooks maintain a fresh idle marker), with a per-target rate limit and a one-wake dedupe; env kill-switch OXTAIL_AUTOWAKE=off. A requester with no idle marker (Codex, or Claude without the hooks) returns skipped_no_fresh_idle and is NOT auto-woken — use explicit wake:\"auto\" for those. Response carries wake_status (\"fired\" | \"skipped_busy\" | \"skipped_debounced\" | \"skipped_no_fresh_idle\" | \"skipped_rate_limited\" | \"skipped_deduped\" | \"skipped_store_error\" | \"skipped_no_target\" | \"disabled\") and, on the reply path, wake_reason:\"reply_to_default\" — or wake_reason:\"late_reply_to_pending\" when this reply answers an ask_peer that had timed out (durably pulls the requester back regardless of the fresh-idle window; \"late_reply_to_pending_suppressed\" if you passed wake:\"off\").",
|
|
1014
1037
|
"Body is verbatim — wrap in <system-reminder>...</system-reminder> yourself if you want that framing. When replying to ask_peer, include reply_to: request_id from the inbound message. For a blocking send-and-wait, use ask_peer instead.",
|
|
1015
1038
|
].join(" "),
|
|
1016
1039
|
inputSchema: {
|
|
@@ -1085,7 +1108,7 @@ server.registerTool("send_message", {
|
|
|
1085
1108
|
server.registerTool("reply_to_message", {
|
|
1086
1109
|
description: [
|
|
1087
1110
|
"Reply to a specific inbound peer message by its message_id — the atomic, correlation-safe alternative to hand-wiring send_message's target + reply_to. The server looks the message up in this session's durable received-ledger, so you pass only the message_id the PreToolUse hook or read_my_messages already showed you; it derives the reply target (the original sender), carries reply_to=request_id when the inbound was an ask_peer (keeping the exchange correlated), and sets source_message_id for provenance. Replying to a plain send_message works too — it just omits reply_to. Ownership is structural: you can only reply to a message delivered to you.",
|
|
1088
|
-
"Delivery + wake match send_message exactly, including the wake-on-reply default: when the inbound carried a request_id and you leave wake unset, a freshly-idle requester is auto-woken; pass wake:\"auto\" to nudge any idle peer, or wake:\"off\" to suppress. Fail-closed: an unknown or aged-out message_id returns error message-not-found instead of guessing a target.",
|
|
1111
|
+
"Delivery + wake match send_message exactly, including the wake-on-reply default: when the inbound carried a request_id and you leave wake unset, a freshly-idle requester is auto-woken; pass wake:\"auto\" to nudge any idle peer, or wake:\"off\" to suppress. If the inbound ask_peer had since timed out, this reply durably pulls the requester back (wake_reason late_reply_to_pending) regardless of the fresh-idle window. Fail-closed: an unknown or aged-out message_id returns error message-not-found instead of guessing a target.",
|
|
1089
1112
|
].join(" "),
|
|
1090
1113
|
inputSchema: {
|
|
1091
1114
|
message_id: z
|
|
@@ -1261,15 +1284,18 @@ server.registerTool("read_my_messages", {
|
|
|
1261
1284
|
// elapses. Reply-to-capable peers must reply with reply_to=request_id; legacy
|
|
1262
1285
|
// peers fall back to the original from_session_id-only matching.
|
|
1263
1286
|
//
|
|
1264
|
-
// User-tunable override via OXTAIL_ASK_PEER_TIMEOUT_MS; defaults to
|
|
1265
|
-
//
|
|
1266
|
-
//
|
|
1287
|
+
// User-tunable override via OXTAIL_ASK_PEER_TIMEOUT_MS; defaults to 60000ms.
|
|
1288
|
+
// 60s covers a slower multi-tool-call peer reply (a Codex peer composing
|
|
1289
|
+
// set_my_state + reply_to_message + a report was observed at ~46s and falsely
|
|
1290
|
+
// timed out under the old 45s default) while staying under both known callers'
|
|
1291
|
+
// tool-call abort windows: Claude Code is clean to ~60s, Codex aborts ~120s.
|
|
1292
|
+
// Set to a lower value if your client aborts before our timeout fires.
|
|
1267
1293
|
const ASK_PEER_TIMEOUT_MS = (() => {
|
|
1268
1294
|
const env = process.env.OXTAIL_ASK_PEER_TIMEOUT_MS;
|
|
1269
1295
|
if (!env)
|
|
1270
|
-
return
|
|
1296
|
+
return 60_000;
|
|
1271
1297
|
const n = Number(env);
|
|
1272
|
-
return Number.isFinite(n) && n > 0 ? n :
|
|
1298
|
+
return Number.isFinite(n) && n > 0 ? n : 60_000;
|
|
1273
1299
|
})();
|
|
1274
1300
|
const ASK_PEER_GRACE_MS = 500;
|
|
1275
1301
|
const ASK_PEER_POLL_MS = 200;
|
|
@@ -1470,6 +1496,14 @@ async function wakePeer(peer) {
|
|
|
1470
1496
|
// No session-name fallback: a self-written tmux_session could target another
|
|
1471
1497
|
// session, and the verified pane already handles pane-id churn. Pass null.
|
|
1472
1498
|
const ok = await askPeerWakeImpl(verifiedPane, null, fire);
|
|
1499
|
+
if (!ok && sid) {
|
|
1500
|
+
// The fire failed (e.g. the pane vanished between verification and the
|
|
1501
|
+
// send-keys), so no keystroke landed. Clear the debounce stamp set pre-fire
|
|
1502
|
+
// above — otherwise a genuine retry within WAKE_DEBOUNCE_MS is suppressed as
|
|
1503
|
+
// "debounced" even though the peer was never actually woken (M1). The
|
|
1504
|
+
// pre-stamp only needs to survive a SUCCESSFUL fire's async paste gap.
|
|
1505
|
+
wakeDebounce.delete(sid);
|
|
1506
|
+
}
|
|
1473
1507
|
return ok ? "fired" : "skipped_no_target";
|
|
1474
1508
|
}
|
|
1475
1509
|
// --- send_message wake:auto gating -------------------------------------------
|
|
@@ -1480,7 +1514,19 @@ async function wakePeer(peer) {
|
|
|
1480
1514
|
// Keyed by session_id (the agent identity), NOT server_pid: a dual-scope agent
|
|
1481
1515
|
// has several MCP children sharing one session_id, and the hooks/sender must
|
|
1482
1516
|
// agree on the key (see AGENTS.md). Must match the sanitization in the hooks.
|
|
1483
|
-
|
|
1517
|
+
// How long a "busy" marker is trusted before a peer treats the turn as stale and
|
|
1518
|
+
// wakes anyway. The PreToolUse hook now re-stamps "busy" on every tool call, so
|
|
1519
|
+
// a long ACTIVE turn stays fresh; this TTL only governs a turn that stops making
|
|
1520
|
+
// tool calls (one giant single tool call, or a crash without a clean Stop) — the
|
|
1521
|
+
// latter is exactly the stale-busy→wake recovery we want. Configurable for
|
|
1522
|
+
// deployments with very long single-tool-call turns.
|
|
1523
|
+
const ACTIVITY_BUSY_TTL_MS = (() => {
|
|
1524
|
+
const env = process.env.OXTAIL_ACTIVITY_BUSY_TTL_MS;
|
|
1525
|
+
if (!env)
|
|
1526
|
+
return 10 * 60 * 1000;
|
|
1527
|
+
const n = Number(env);
|
|
1528
|
+
return Number.isFinite(n) && n > 0 ? n : 10 * 60 * 1000;
|
|
1529
|
+
})();
|
|
1484
1530
|
function activitySessionKey(sessionId) {
|
|
1485
1531
|
return sessionId.replace(/[^A-Za-z0-9_-]/g, "_");
|
|
1486
1532
|
}
|
|
@@ -1553,11 +1599,64 @@ async function autoWakeOnReply(peer, replyTo) {
|
|
|
1553
1599
|
trace("autowake_reply_fire", { target_session_id: sid });
|
|
1554
1600
|
return wakePeer(peer);
|
|
1555
1601
|
}
|
|
1556
|
-
//
|
|
1557
|
-
//
|
|
1558
|
-
//
|
|
1559
|
-
//
|
|
1602
|
+
// Stamp the autowake dedupe record for (sessionId, replyTo) when the durable
|
|
1603
|
+
// pending-ask path fires, so a re-delivered / duplicate copy of the SAME reply
|
|
1604
|
+
// can't separately strict-wake the requester via the fresh-idle reply-default
|
|
1605
|
+
// (the in-memory wakePeer debounce is per-process and not reply_to-keyed, so it
|
|
1606
|
+
// doesn't cover a restart or a >1s gap). Best-effort; we're stamping, not gating.
|
|
1607
|
+
//
|
|
1608
|
+
// Like the existing reply-default path (decideReplyAutoWake → claimWake), this is
|
|
1609
|
+
// stamped on the wake ATTEMPT — before wakeForSend's keystroke outcome is known —
|
|
1610
|
+
// and claimWake also stamps the per-target RATE record. Intentional and
|
|
1611
|
+
// consistent with that path: one wake pulls the requester in to drain its whole
|
|
1612
|
+
// mailbox, so a second reply within the rate window doesn't need its own wake.
|
|
1613
|
+
// (It is NOT stamped on the wake:"off" / kill-switch-disabled paths, where no
|
|
1614
|
+
// wake is intended — see resolveSendWake.)
|
|
1615
|
+
function stampReplyWakeDedupe(sessionId, replyTo) {
|
|
1616
|
+
if (!sessionId)
|
|
1617
|
+
return;
|
|
1618
|
+
try {
|
|
1619
|
+
claimWake(defaultAutowakeDir(), sessionId, replyTo, Date.now());
|
|
1620
|
+
}
|
|
1621
|
+
catch {
|
|
1622
|
+
// best effort — a failure only means a duplicate could still strict-wake,
|
|
1623
|
+
// which is harmless (debounced, and the requester drains an empty mailbox).
|
|
1624
|
+
}
|
|
1625
|
+
}
|
|
1626
|
+
// Resolve the wake for a send_message / reply_to_message. Order matters:
|
|
1627
|
+
// 1. DURABLE pending-ask: if this reply satisfies an ask_peer that timed out
|
|
1628
|
+
// and recorded a pending obligation, consume it (regardless of wake mode —
|
|
1629
|
+
// a late reply satisfies the obligation even under wake:"off", and leaving
|
|
1630
|
+
// the record would let a later duplicate wake and violate the explicit off)
|
|
1631
|
+
// and fire the LENIENT wakeForSend so even a long-idle / markerless-Codex
|
|
1632
|
+
// requester is pulled back. The automatic (wake unset) variant honors the
|
|
1633
|
+
// OXTAIL_AUTOWAKE kill-switch; an explicit wake:"auto" intentionally does
|
|
1634
|
+
// not (it's the caller's explicit ask, matching existing semantics).
|
|
1635
|
+
// 2. STRICT reply-default: a reply with wake UNSET and no pending record →
|
|
1636
|
+
// fresh-idle-only auto-wake (autowake.ts), wake_reason "reply_to_default".
|
|
1637
|
+
// 3. Explicit wake:"auto" → lenient wakeForSend. wake:"off" → no wake.
|
|
1560
1638
|
async function resolveSendWake(peer, wake, replyTo) {
|
|
1639
|
+
if (replyTo) {
|
|
1640
|
+
const sid = peer.client.session_id ?? "";
|
|
1641
|
+
if (consumePendingAsk(defaultPendingAskDir(), sid, replyTo, Date.now())) {
|
|
1642
|
+
// wake:"off" and the kill-switch path do NOT wake — so they must NOT stamp
|
|
1643
|
+
// the wake-dedupe: stamping there would later suppress the strict wake for a
|
|
1644
|
+
// genuine, distinct second reply to the same request_id (no wake happened,
|
|
1645
|
+
// so there is nothing to dedupe against). Only stamp on the path that fires.
|
|
1646
|
+
if (wake === "off") {
|
|
1647
|
+
trace("late_reply_pending_suppressed", { target_session_id: sid });
|
|
1648
|
+
return { wake_reason: "late_reply_to_pending_suppressed" };
|
|
1649
|
+
}
|
|
1650
|
+
if (wake === undefined && autowakeKillSwitchOff()) {
|
|
1651
|
+
return { wake_status: "disabled", wake_reason: "late_reply_to_pending" };
|
|
1652
|
+
}
|
|
1653
|
+
// About to actually wake → stamp so a re-delivered copy of THIS reply can't
|
|
1654
|
+
// strict-wake again via the fresh-idle fallback.
|
|
1655
|
+
stampReplyWakeDedupe(peer.client.session_id, replyTo);
|
|
1656
|
+
trace("late_reply_pending_wake", { target_session_id: sid });
|
|
1657
|
+
return { wake_status: await wakeForSend(peer), wake_reason: "late_reply_to_pending" };
|
|
1658
|
+
}
|
|
1659
|
+
}
|
|
1561
1660
|
if (replyAutoWakeTriggered(wake, replyTo)) {
|
|
1562
1661
|
return { wake_status: await autoWakeOnReply(peer, replyTo), wake_reason: "reply_to_default" };
|
|
1563
1662
|
}
|
|
@@ -1571,24 +1670,38 @@ async function resolveSendWake(peer, wake, replyTo) {
|
|
|
1571
1670
|
// mailbox lock when there's a probable hit. The lock is held only inside
|
|
1572
1671
|
// drainMatchingSession (sub-10ms) — never across the poll interval, so the
|
|
1573
1672
|
// PreToolUse hook on subsequent caller tool calls is never starved.
|
|
1574
|
-
|
|
1575
|
-
|
|
1576
|
-
|
|
1673
|
+
// The requester's mailbox pid union: own pid first (fast-path locality), then
|
|
1674
|
+
// any sibling/previous MCP child sharing the session_id. Recomputed at the final
|
|
1675
|
+
// drain so a sibling that appeared DURING the wait is still covered.
|
|
1676
|
+
function requesterPids(ownPid, sessionId) {
|
|
1677
|
+
return sessionId
|
|
1678
|
+
? [ownPid, ...sessionPidsForId(sessionId).filter((p) => p !== ownPid)]
|
|
1679
|
+
: [ownPid];
|
|
1680
|
+
}
|
|
1681
|
+
async function askPeerPoll(pids, from_session_id, request_id, require_reply_to, deadlineMs, signal) {
|
|
1682
|
+
// Watch the mtime of EVERY sibling pid's mailbox (a dual-scope requester's
|
|
1683
|
+
// reply may land in a pid other than the one blocked here), draining only when
|
|
1684
|
+
// a file that exists has changed — so the lock is acquired on a probable hit,
|
|
1685
|
+
// never every tick. Mirrors the single-pid optimization, widened to the union.
|
|
1686
|
+
const lastMtimes = new Map();
|
|
1577
1687
|
while (Date.now() < deadlineMs) {
|
|
1578
1688
|
if (signal.aborted)
|
|
1579
1689
|
throw new Error("aborted");
|
|
1580
|
-
let
|
|
1581
|
-
|
|
1582
|
-
|
|
1583
|
-
|
|
1584
|
-
|
|
1585
|
-
|
|
1690
|
+
let changed = false;
|
|
1691
|
+
for (const pid of pids) {
|
|
1692
|
+
let m = -1;
|
|
1693
|
+
try {
|
|
1694
|
+
m = statSync(mailbox.mailboxFilePath(pid)).mtimeMs;
|
|
1695
|
+
}
|
|
1696
|
+
catch {
|
|
1697
|
+
// ENOENT: mailbox file not created yet
|
|
1698
|
+
}
|
|
1699
|
+
if (m !== -1 && lastMtimes.get(pid) !== m)
|
|
1700
|
+
changed = true;
|
|
1701
|
+
lastMtimes.set(pid, m);
|
|
1586
1702
|
}
|
|
1587
|
-
if (
|
|
1588
|
-
|
|
1589
|
-
const reply = require_reply_to
|
|
1590
|
-
? mailbox.drainMatchingReply(my_pid, from_session_id, request_id)
|
|
1591
|
-
: mailbox.drainMatchingSession(my_pid, from_session_id);
|
|
1703
|
+
if (changed) {
|
|
1704
|
+
const reply = drainAskPeerReply(pids, from_session_id, request_id, require_reply_to);
|
|
1592
1705
|
if (reply)
|
|
1593
1706
|
return reply;
|
|
1594
1707
|
}
|
|
@@ -1599,15 +1712,18 @@ async function askPeerPoll(my_pid, from_session_id, request_id, require_reply_to
|
|
|
1599
1712
|
}
|
|
1600
1713
|
return null;
|
|
1601
1714
|
}
|
|
1602
|
-
function drainAskPeerReply(
|
|
1715
|
+
function drainAskPeerReply(pids, from_session_id, request_id, require_reply_to) {
|
|
1716
|
+
// Correlated peers: union-drain by reply_to across the requester's siblings.
|
|
1717
|
+
// Legacy/uncorrelated peers: keep the best-effort own-pid session match (no
|
|
1718
|
+
// request_id to correlate the union safely).
|
|
1603
1719
|
return require_reply_to
|
|
1604
|
-
? mailbox.
|
|
1605
|
-
: mailbox.drainMatchingSession(
|
|
1720
|
+
? mailbox.drainMatchingReplyMany(pids, from_session_id, request_id)
|
|
1721
|
+
: mailbox.drainMatchingSession(pids[0], from_session_id);
|
|
1606
1722
|
}
|
|
1607
1723
|
server.registerTool("ask_peer", {
|
|
1608
1724
|
description: [
|
|
1609
1725
|
"Delegate-and-wait: enqueue a message to a peer in the same project root, wake them, and block until they reply (via send_message) or the timeout elapses. Use this for back-and-forth; use send_message for fire-and-forget.",
|
|
1610
|
-
"Wakes the peer via per-client tmux send-keys (Codex gets a paste-burst-aware gap, Claude Code doesn't), then polls for a reply. For reply_to-capable peers, only from_session_id + reply_to == request_id satisfies the wait; legacy peers fall back to best-effort from_session_id matching and the response reports correlation:\"uncorrelated\". Response carries wake_status: \"fired\" | \"skipped_busy\" | \"skipped_no_target\" | \"disabled\" (skipped_unsupported is reserved). A peer that is mid-turn is NOT keystroke-woken (skipped_busy) — its hook/poll delivers the enqueued message and we still poll for the reply. Returns reply: null, timed_out: true on timeout (default
|
|
1726
|
+
"Wakes the peer via per-client tmux send-keys (Codex gets a paste-burst-aware gap, Claude Code doesn't), then polls for a reply. For reply_to-capable peers, only from_session_id + reply_to == request_id satisfies the wait; legacy peers fall back to best-effort from_session_id matching and the response reports correlation:\"uncorrelated\". Response carries wake_status: \"fired\" | \"skipped_busy\" | \"skipped_no_target\" | \"disabled\" (skipped_unsupported is reserved). A peer that is mid-turn is NOT keystroke-woken (skipped_busy) — its hook/poll delivers the enqueued message and we still poll for the reply. Returns reply: null, timed_out: true on timeout (default 60000ms, override per call with timeout_ms, or set OXTAIL_ASK_PEER_TIMEOUT_MS at startup). timeout_ms is clamped to a safe ceiling (default 100000ms, env OXTAIL_ASK_PEER_MAX_TIMEOUT_MS) so the wait can't outlast the client's tool-call abort window — exceeding it makes the client hard-fail the call instead of returning graceful timed_out; the response reports timeout_clamped_from_ms when clamped. DURABLE DELEGATION: on timeout (correlated peers, claimed requester), the request is recorded as a pending obligation, so when the peer's reply finally arrives — minutes or hours later — it WAKES you back (wake_reason late_reply_to_pending), not just landing silently in read_my_messages. So ask_peer is safe for long tasks: let it time out, end your turn, get pulled back when the work is done.",
|
|
1611
1727
|
"Target must have a registered client.session_id (Codex peers call claim_session first). Body is verbatim — frame it as an assignment (objective + requested action) so it reads as delegation, not chat. Wake overridable via OXTAIL_ASK_PEER_WAKE_STRATEGY=auto|legacy|off.",
|
|
1612
1728
|
].join(" "),
|
|
1613
1729
|
inputSchema: {
|
|
@@ -1656,6 +1772,10 @@ server.registerTool("ask_peer", {
|
|
|
1656
1772
|
const requestId = randomBytes(8).toString("hex");
|
|
1657
1773
|
const requireReplyTo = peerSupportsReplyTo(peer);
|
|
1658
1774
|
const fromSessionId = entry.client.session_id ?? undefined;
|
|
1775
|
+
// The reply is addressed to OUR session_id; resolveTarget enqueues it to the
|
|
1776
|
+
// session's freshest sibling, which may not be entry.server_pid. Drain the
|
|
1777
|
+
// union (own pid first for fast-path locality), mirroring read_my_messages.
|
|
1778
|
+
const myPids = requesterPids(entry.server_pid, fromSessionId);
|
|
1659
1779
|
// Record-before-append (mirrors send_message): lets the peer answer with
|
|
1660
1780
|
// reply_to_message(message_id) instead of hand-wiring target + reply_to.
|
|
1661
1781
|
const msg = deliverToPeer(expectedSessionId, peer.server_pid, body, fromSessionId, {
|
|
@@ -1683,7 +1803,7 @@ server.registerTool("ask_peer", {
|
|
|
1683
1803
|
// our outbound arrived, their hook delivered it as additionalContext and
|
|
1684
1804
|
// their response may already be in our mailbox.
|
|
1685
1805
|
await askPeerDelay(ASK_PEER_GRACE_MS, extra.signal);
|
|
1686
|
-
reply = drainAskPeerReply(
|
|
1806
|
+
reply = drainAskPeerReply(myPids, expectedSessionId, requestId, requireReplyTo);
|
|
1687
1807
|
if (!reply) {
|
|
1688
1808
|
// Common path: peer was idle. Route the wake per client_type, but skip
|
|
1689
1809
|
// the keystroke if the peer is FRESHLY busy (mid-turn): typing into a
|
|
@@ -1706,7 +1826,7 @@ server.registerTool("ask_peer", {
|
|
|
1706
1826
|
// return this and the caller fail-fasts instead of polling.
|
|
1707
1827
|
}
|
|
1708
1828
|
else {
|
|
1709
|
-
reply = await askPeerPoll(
|
|
1829
|
+
reply = await askPeerPoll(myPids, expectedSessionId, requestId, requireReplyTo, deadlineMs, extra.signal);
|
|
1710
1830
|
}
|
|
1711
1831
|
}
|
|
1712
1832
|
else {
|
|
@@ -1749,6 +1869,77 @@ server.registerTool("ask_peer", {
|
|
|
1749
1869
|
// attempted) is NOT a timeout; the message has been enqueued and will be
|
|
1750
1870
|
// delivered when the peer next enters a turn.
|
|
1751
1871
|
const polled = wakeStatus !== "skipped_unsupported";
|
|
1872
|
+
// Durable delegation: we polled to the deadline with no reply. Record a
|
|
1873
|
+
// pending obligation FIRST, then do one final authoritative UNION drain —
|
|
1874
|
+
// write-before-final-drain closes the poll-vs-deadline TOCTOU. A reply that
|
|
1875
|
+
// landed in the gap is caught here and returned now; a reply that arrives
|
|
1876
|
+
// AFTER finds the persisted record and pulls us back via resolveSendWake's
|
|
1877
|
+
// late_reply_to_pending path — even minutes/hours later, and even for a
|
|
1878
|
+
// markerless idle Codex requester. Correlated peers + claimed requester only.
|
|
1879
|
+
if (polled && reply === null && !aborted && requireReplyTo) {
|
|
1880
|
+
if (fromSessionId) {
|
|
1881
|
+
const dir = defaultPendingAskDir();
|
|
1882
|
+
// Opportunistic sweep so abandoned records (a reply that never came)
|
|
1883
|
+
// can't accumulate — mirrors gcAutowake inside decideReplyAutoWake.
|
|
1884
|
+
gcPendingAsk(dir, Date.now());
|
|
1885
|
+
// Write the pending obligation BEFORE the final drain (write-before-
|
|
1886
|
+
// final-drain): a reply that lands after the drain finds this record and
|
|
1887
|
+
// wakes us via resolveSendWake; one that landed before is caught below.
|
|
1888
|
+
if (!recordPendingAsk(dir, fromSessionId, requestId, Date.now())) {
|
|
1889
|
+
// Store unwritable → silently degrades to the read_my_messages path
|
|
1890
|
+
// (no durable pull-back). Surface it so the degradation is observable.
|
|
1891
|
+
trace("ask_peer_pending_record_failed", { request_id: requestId });
|
|
1892
|
+
}
|
|
1893
|
+
// Authoritative final drain. Recompute the pid union NOW — a sibling MCP
|
|
1894
|
+
// child may have appeared during the wait. Use the CHECKED variant and
|
|
1895
|
+
// retry any pid we couldn't inspect (transient lock): silently treating
|
|
1896
|
+
// "couldn't read" as "no reply" would leave the record with no later
|
|
1897
|
+
// event to consume it → a stranded pull-back.
|
|
1898
|
+
const finalPids = requesterPids(entry.server_pid, fromSessionId);
|
|
1899
|
+
let drained = mailbox.drainMatchingReplyManyChecked(finalPids, expectedSessionId, requestId);
|
|
1900
|
+
if (drained.skipped.length > 0) {
|
|
1901
|
+
// A pid we couldn't inspect might hold either the already-landed reply
|
|
1902
|
+
// (if we have none yet) OR a migrate-crash duplicate of the reply we DID
|
|
1903
|
+
// pull (which a later read_my_messages would re-deliver). Retry once
|
|
1904
|
+
// after a brief delay for the lock to clear.
|
|
1905
|
+
try {
|
|
1906
|
+
await askPeerDelay(ASK_PEER_POLL_MS, extra.signal);
|
|
1907
|
+
if (!drained.reply) {
|
|
1908
|
+
drained = mailbox.drainMatchingReplyManyChecked(drained.skipped, expectedSessionId, requestId);
|
|
1909
|
+
if (!drained.reply && drained.skipped.length > 0) {
|
|
1910
|
+
// Still un-inspectable after the retry: a lock held past the
|
|
1911
|
+
// acquire budget + retry (SIGSTOP-class / long holder). diagnose
|
|
1912
|
+
// can use this to tell "no reply" from "a reply may sit behind a
|
|
1913
|
+
// locked pid" — the record persists, so a later send still wakes.
|
|
1914
|
+
trace("ask_peer_skipped_after_final_retry", {
|
|
1915
|
+
request_id: requestId,
|
|
1916
|
+
skipped: drained.skipped,
|
|
1917
|
+
});
|
|
1918
|
+
}
|
|
1919
|
+
}
|
|
1920
|
+
else {
|
|
1921
|
+
// We have the reply — sweep only its exact id from the skipped pids
|
|
1922
|
+
// (a distinct second reply, different id, is left for read_my_messages).
|
|
1923
|
+
mailbox.sweepMessageId(drained.skipped, drained.reply.id);
|
|
1924
|
+
}
|
|
1925
|
+
}
|
|
1926
|
+
catch {
|
|
1927
|
+
// aborted during the brief retry delay — leave the record; we return
|
|
1928
|
+
// timed_out and the reply still delivers via read_my_messages.
|
|
1929
|
+
}
|
|
1930
|
+
}
|
|
1931
|
+
if (drained.reply) {
|
|
1932
|
+
consumePendingAsk(dir, fromSessionId, requestId);
|
|
1933
|
+
reply = drained.reply;
|
|
1934
|
+
trace("ask_peer_late_catch", { request_id: requestId, message_id: drained.reply.id });
|
|
1935
|
+
}
|
|
1936
|
+
}
|
|
1937
|
+
else {
|
|
1938
|
+
// Unclaimed requester: a peer can't correlate/reply_to_message back to
|
|
1939
|
+
// us, so there's nothing to durably wake — surface it rather than guess.
|
|
1940
|
+
trace("ask_peer_pending_skipped_unclaimed", { request_id: requestId });
|
|
1941
|
+
}
|
|
1942
|
+
}
|
|
1752
1943
|
const timedOut = polled && reply === null;
|
|
1753
1944
|
trace("ask_peer_end", {
|
|
1754
1945
|
target_session_id: expectedSessionId,
|
package/dist/wake-debounce.js
CHANGED
|
@@ -30,7 +30,14 @@ export function newWakeDebounceStore() {
|
|
|
30
30
|
// True if a wake fired for this key within the window — i.e. skip this one.
|
|
31
31
|
export function recentlyWoke(store, key, nowMs, windowMs = WAKE_DEBOUNCE_MS) {
|
|
32
32
|
const last = store.get(key);
|
|
33
|
-
|
|
33
|
+
if (last === undefined)
|
|
34
|
+
return false;
|
|
35
|
+
const delta = nowMs - last;
|
|
36
|
+
// A backwards clock step (NTP correction, laptop resume) makes delta negative
|
|
37
|
+
// and < windowMs, which would wrongly suppress every wake to this peer until
|
|
38
|
+
// the clock catches back up. Treat a negative delta as "not recent" (mirrors
|
|
39
|
+
// the ageMs >= 0 guard in isFreshIdle).
|
|
40
|
+
return delta >= 0 && delta < windowMs;
|
|
34
41
|
}
|
|
35
42
|
// Record that a wake fired for this key. Opportunistically evicts stale entries
|
|
36
43
|
// so the map can't grow unbounded across many short-lived peers.
|
package/package.json
CHANGED
|
@@ -33,10 +33,14 @@ export const HOOK_MARKER_KEY = "_oxtailHook";
|
|
|
33
33
|
// with no owner check, so during an upgrade window (before re-install) the
|
|
34
34
|
// old hook can still lose the stall-resume / double-clear races against a
|
|
35
35
|
// v6 peer. The version bump forces re-install to close that window.
|
|
36
|
+
// v7: pretooluse re-stamps the "busy" activity marker on every tool call, so a
|
|
37
|
+
// long ACTIVE turn stays fresh and doesn't invite a spurious wake:auto once
|
|
38
|
+
// it outruns ACTIVITY_BUSY_TTL_MS. A stale pre-v7 hook just doesn't refresh
|
|
39
|
+
// (the prior behavior) — never wrong, only less fresh on long turns.
|
|
36
40
|
// INVARIANT: any change to an assets/*.sh script MUST bump this version, so
|
|
37
41
|
// existing installs are forced to re-install. scripts/check-hook-version.mjs
|
|
38
42
|
// enforces this in CI.
|
|
39
|
-
export const HOOK_MARKER_VERSION =
|
|
43
|
+
export const HOOK_MARKER_VERSION = 7;
|
|
40
44
|
|
|
41
45
|
const HOOKS_DIR = path.join(os.homedir(), ".oxtail", "hooks");
|
|
42
46
|
|