oxtail 0.12.0 → 0.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -55,9 +55,15 @@ The v0.9/v0.10.1 changes close the public dogfooding gaps found by real peer tra
55
55
  - **Session identity is monotonic after first non-null resolution.** Automatic detection is a bootstrap aid. Once `claim_session`, `register_my_session`, or sticky-claim recovery sets a session id, later env/birth-time detection and `get_my_session` refreshes must preserve it. Only another explicit claim can change it.
56
56
  - **`ask_peer` replies must correlate when the peer supports it.** Same-peer chatter is not a reply. Upgraded peers advertise `capabilities.mailbox.reply_to` and must satisfy waits with `from_session_id == target.session_id` plus `reply_to == request_id`; unmatched messages stay in the mailbox. The older `from_session_id`-only path is legacy compatibility and must be surfaced as `correlation: "uncorrelated"`. For no-capability peers, stale same-peer chatter may still satisfy the wait; that is an explicit compatibility limitation, not a correctness guarantee.
57
57
  - **Peer messages are context, not user authority.** Mailbox provenance (`origin: "peer"`, `request_id`, `reply_to`, `source_message_id`) is diagnostic metadata, not a trust boundary. Hook text must keep the trust framing visible — the "context, not user authority" line plus the `from_session_id` / `request_id` / `reply_to` reply fields (full protocol names) are rendered on every delivery — and injected hook bodies must stay under an explicit budget. Single-valued provenance the framing already implies (`origin: "peer"`) stays in the mailbox JSONL but need not be rendered into context.
58
+ - **A displayed reply handle must be resolvable: record the received-ledger before the mailbox line is visible.** Both delivery paths are destructive — `read_my_messages` and the PreToolUse/Stop hook each truncate the mailbox on handoff — so `reply_to_message` resolves `message_id` against a durable per-session ledger (`~/.oxtail/received/<hash(session_id)>.jsonl`), never the queue. `deliverToPeer` (the single delivery primitive behind `send_message` / `ask_peer` / `reply_to_message`) MUST write the ledger entry **before** appending the mailbox line: append-then-record reopens a window where the hook renders a `message_id` the receiver cannot yet reply to. The ledger is keyed and owned by receiver `session_id`; a lookup reads only the caller's own file. The ledger write is best-effort (a failure degrades to "no handle, reply via `send_message`") but must never reorder ahead of, or block, the actual delivery.
59
+ - **Delivery mutations are crash-consistent; the shared advisory lock is owner-validated.** A crash mid-write must never corrupt a neighbour: mailbox appends heal a torn record boundary so two JSONL lines can't glue into one unparseable record (`appendLines`), and full-file rewrites (received-ledger, selective drain) go through a temp file + atomic `rename` (`atomicWrite`), never an in-place `writeFileSync` that a torn write could leave half-applied. The `mkdir` lock — shared between the Node server and the bash hooks — carries an owner token in a sidecar `<lock>.owner` (beside the dir so it stays empty and a plain `rmdir` still works cross-language): **release removes the lock only if it still owns it** (a stalled holder can't stomp a successor), and **stale removal is single-winner** (`<lock>.steal` marker) **plus compare-and-clear** (remove only if the owner is still the dead token observed). Provable race-freedom is unachievable on a plain shared filesystem (no atomic compare-and-swap); the design closes every realistic race and the only residuals — enumerated in `src/locks.ts` — require a >30s SIGSTOP-class stall inside a microsecond syscall gap, with a bounded consequence (rare double-delivery or a degraded reply-handle, never a wedge or torn file). One protocol, mirrored in `src/locks.ts` and both hooks.
58
60
 
59
61
  ## Recently shipped
60
62
 
63
+ - **Crash-consistency + cross-language lock hardening (v0.14.0).** A `compile-sim` pass plus four Codex adversarial rounds hardened the delivery core against crash/torn-write and lock races. **Crash-consistency:** every mailbox append heals a torn previous write so a crash can't glue two JSONL records into one unparseable line (`appendLines` in `src/mailbox.ts`); the received-ledger rewrite and `drainFirstMatching`'s survivor rewrite are now atomic temp-file + `rename` (`atomicWrite`), so a torn write can't drop unrelated survivors / corrupt old reply handles. **Advisory lock:** the `mkdir` lock gains an owner token in a sidecar `<lock>.owner` (kept beside the dir so it stays empty and a bash hook's plain `rmdir` still works); **release** removes the lock only if it still owns it (closes stall-resume-release stomp), and **stale-clear** is gated behind a single-winner `<lock>.steal` marker + compare-and-clear (closes the double-clear). The protocol lives once in `src/locks.ts` and is mirrored in both bash hooks (`assets/pretooluse.sh`, `assets/stop.sh`; `HOOK_MARKER_VERSION` → 6 forces re-install). **Honest limit:** a provably race-free stale-recoverable lock isn't achievable on a plain shared FS — the residuals all require a >30s SIGSTOP-class stall in a microsecond syscall gap, bounded to a rare double-delivery / degraded reply-handle (never a wedge or torn file), documented in `src/locks.ts`. Also: `deliverExistingToPeer` preserves `message_id` + ledger on the `ask_peer` abort-recovery path (was minting a new id + skipping the ledger). Codex converged after 4 rounds (it broke the first two lock attempts before the owner-token + compare-and-clear design held).
64
+
65
+ - **Reply by id + received-ledger (v0.13.0).** `reply_to_message(message_id, body)` looks the inbound envelope up in a durable per-session ledger and derives `target` / `reply_to` / `source_message_id` server-side, replacing the manual rewiring that silently degraded a correlated exchange into loose mailbox traffic. New `src/received.ts` (ledger: sha256-keyed file, `mkdir`-lock, bounded retention `OXTAIL_RECEIVED_MAX`=1000 with a `received_ledger_pruned` trace so a drop is never silent) and `src/delivery.ts` (`deliverToPeer` = `buildMessage` → `recordReceived` → `requeue` — the record-before-append ordering above), wired into `send_message` / `ask_peer` / `reply_to_message`. Adversarial race-pair + ledger-failure-still-delivers tests in `src/delivery.test.ts`. Converged with Codex over a 5-round peer-messaging pressure test; Codex's review caught the append-before-record race, fixed before merge.
66
+
61
67
  - **Wake hardening (v0.12.0 — issues #5/#6/#7, the v0.7-review backlog).** Three deferred wake items, landed together. **#6 (security):** wake send-keys now only ever target the pane the live process tree says hosts the peer's `server_pid` (`chooseVerifiedWakePane` → `currentPaneForServerPid`), never the peer's self-written `tmux_pane`/`tmux_session`; unverifiable ⇒ refuse (`skipped_no_target`). Registry-sourced tmux ids are shape-validated (`isValidTmuxPane`/`isValidTmuxSession`) and a spoofed `TMUX_PANE` env is ignored. This removed the cached-pane and session-name send-keys fallbacks (legit peers always register a real pane; churn is handled by re-resolution). **#5 (debounce):** all wake paths funnel through `wakePeer`, which coalesces repeat wakes to the same peer within `OXTAIL_WAKE_DEBOUNCE_MS` (default 1s, in-memory per process) ⇒ `skipped_debounced`. **#7 (observability):** a `wake_outcome` trace event per wake; `oxtail diagnose` summarizes wake_status counts by tool from `MCP_TRACE_FILE`; a scheduled `codex-drift.yml` fails if Codex's `PASTE_ENTER_SUPPRESS_WINDOW` drifts past our 500ms gap. New modules: `src/wake-debounce.ts`, `src/diagnose.ts`; `chooseVerifiedWakePane` in `src/registry.ts`.
62
68
  - **Wake-on-reply (Slice 1, peer-messaging refinement push).** A `send_message` that carries `reply_to` now auto-wakes the original requester **by default** (explicit `wake:"off"` opts out), closing the observed stranding where a peer's async reply to an idle requester forced a human to relay it. The reply path is a separate, stricter gate than the lenient `wake:"auto"` path (`src/autowake.ts`): it fires only for a **fresh-idle** target (idle marker newer than `OXTAIL_AUTOWAKE_FRESH_IDLE_MS`, default 5m) — stale/unknown/missing/busy ⇒ `skipped_no_fresh_idle`, never a best-effort wake — and adds a **per-target rate limit** (`skipped_rate_limited`), a persistent **one-wake dedupe** keyed on `(session_id, reply_to)` (`skipped_deduped`, GC'd by age) to survive duplicate/late hook drains, an `OXTAIL_AUTOWAKE=off` kill-switch, and a best-effort `skipped_store_error` degrade so a broken dedupe store can never turn an already-enqueued reply into a tool error. Target is resolved by `client.session_id` with the pane re-resolved immediately before send-keys (no `server_pid`/stale-pane reuse). Response surfaces `wake_status` + `wake_reason:"reply_to_default"`. **Coverage caveat:** the fresh-idle gate keys on the busy/idle marker that only the Claude Code hooks maintain, so this slice reaches a **hooked Claude Code requester** (the observed case). A Codex / hookless-Claude requester has no idle marker ⇒ `skipped_no_fresh_idle` (reach it with explicit `wake:"auto"`); closing that direction is **Slice 2** (`expects_reply:true` — a requester-side waiter signal), deliberately not faked here with a blind `unknown ⇒ wake` that would reintroduce the active-waiter double-wake.
63
69
  - **Protocol hardening (v0.10.1).** `ask_peer` now stamps outbound messages with `request_id`; reply-to-capable peers answer with `send_message({ reply_to: request_id })`, and the waiter ignores stale same-peer messages. Explicit identity claims are monotonic, so stale automatic detection cannot clobber a real client session id. PreToolUse/Stop hook pushes are body-budgeted and labeled as peer context, not user authority.
package/README.md CHANGED
@@ -36,7 +36,7 @@ args = ["-y", "oxtail@0.10.1"]
36
36
 
37
37
  ```sh
38
38
  mkdir -p ~/.claude/commands
39
- curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.12.0/.claude/commands/oxtail-join.md \
39
+ curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.13.0/.claude/commands/oxtail-join.md \
40
40
  -o ~/.claude/commands/oxtail-join.md
41
41
  ```
42
42
 
@@ -44,9 +44,9 @@ curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.12.0/.claude/command
44
44
 
45
45
  ```sh
46
46
  mkdir -p ~/.codex/skills/oxtail-join/agents
47
- curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.12.0/integrations/codex/oxtail-join/SKILL.md \
47
+ curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.13.0/integrations/codex/oxtail-join/SKILL.md \
48
48
  -o ~/.codex/skills/oxtail-join/SKILL.md
49
- curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.12.0/integrations/codex/oxtail-join/agents/openai.yaml \
49
+ curl -L https://raw.githubusercontent.com/d4j3y2k/oxtail/v0.13.0/integrations/codex/oxtail-join/agents/openai.yaml \
50
50
  -o ~/.codex/skills/oxtail-join/agents/openai.yaml
51
51
  ```
52
52
 
@@ -68,10 +68,11 @@ Contributing? `git clone https://github.com/d4j3y2k/oxtail && cd oxtail && npm i
68
68
  - `send_message` — **fire-and-forget** message to a peer. Target is a tmux session name or a raw `client_session_id` UUID. Body ≤ 8KB. Delivery is async via the peer's mailbox file. A plain message does **not** wake an idle peer; pass `wake: "auto"` to nudge one (state-gated — see [Waking an idle peer](#waking-an-idle-peer)). Replies to `ask_peer` should pass `reply_to: "<request_id>"` when the inbound message carries a `request_id` — and a reply **auto-wakes the requester by default** (strictly gated; `wake: "off"` opts out). (v0.5+)
69
69
  - `read_my_messages` — drain this session's mailbox and return any queued messages. Messages include `from_session_id`, server-stamped `origin: "peer"`, and optional `request_id` / `reply_to`. Codex peers (and unhooked Claude Code) poll this; Claude Code peers with the hooks installed see messages mid-turn (PreToolUse) or at turn end (Stop) instead. (v0.5+)
70
70
  - `ask_peer` — **delegate-and-wait**. Enqueues a message with a `request_id` and blocks server-side until the peer replies with `send_message({ reply_to: request_id })` or the timeout elapses. Default timeout is 45s (`OXTAIL_ASK_PEER_TIMEOUT_MS`), and each call may pass `timeout_ms`. New peers use strict `reply_to` correlation; legacy/no-capability peers fall back to best-effort first-message matching and the response reports `correlation: "uncorrelated"`. That legacy path may stale-match old same-peer chatter, so callers should treat `uncorrelated` as compatibility-only. Use `send_message` for fire-and-forget. (v0.7+)
71
+ - `reply_to_message` — **reply by `message_id`**. The atomic, correlation-safe alternative to hand-wiring `send_message`'s `target` + `reply_to`: pass the `message_id` the hook or `read_my_messages` showed you and the server looks the inbound envelope up in this session's durable **received-ledger**, derives the reply target (the original sender), carries `reply_to: request_id` when the inbound was an `ask_peer` (keeping the exchange correlated), and stamps `source_message_id`. Replying to a plain `send_message` works too — it just omits `reply_to`. Ownership is structural (you can only reply to a message delivered to *you*); fail-closed on an unknown/aged-out id. Same wake semantics as `send_message`, including the wake-on-reply default. (v0.13+)
71
72
  - `register_my_session` — pin this MCP server's `session_id` directly. Kept for debugging; prefer `claim_session`.
72
73
  - `get_my_session` — return this MCP server's own registry entry plus a per-strategy detection diagnosis. Useful for debugging.
73
74
 
74
- See [design principles](https://github.com/d4j3y2k/oxtail/blob/v0.12.0/AGENTS.md) for scope and architecture.
75
+ See [design principles](https://github.com/d4j3y2k/oxtail/blob/v0.13.0/AGENTS.md) for scope and architecture.
75
76
 
76
77
  ## Usage from an agent
77
78
 
@@ -90,6 +91,8 @@ send_message({ target: "<peer-uuid>", body: "...", reply_to: "<ask request_id>"
90
91
  read_my_messages()
91
92
  ask_peer({ target: "primary", body: "[Handoff] please audit X and tell me what you find" })
92
93
  // → blocks server-side until the peer replies via send_message, then returns their body
94
+ reply_to_message({ message_id: "<id from the hook / read_my_messages>", body: "..." })
95
+ // → looks up the inbound envelope, derives target + reply_to itself; correlated when the inbound was an ask_peer
93
96
  ```
94
97
 
95
98
  Omitting `project_root` triggers a best-effort `.git`-ancestor walk from the server's own cwd. The response includes `inferred: true` when this happens. Pass `project_root` explicitly when you can.
@@ -112,6 +115,8 @@ read_my_messages()
112
115
 
113
116
  The mailbox lives at `~/.oxtail/mailboxes/<server_pid>.jsonl`, append-only JSONL, drained under an `mkdir`-based advisory lock. The transport is intentionally dumb: 8KB UTF-8 body cap, sender chooses the framing (raw text or pre-wrapped `<system-reminder>...</system-reminder>`). Hook-delivered mailbox pushes are body-budgeted at 24K escaped characters by default; set `OXTAIL_HOOK_MAX_BODY_CHARS` to tune. If the budget is exceeded, the hook tells the receiver which bodies were truncated or omitted.
114
117
 
118
+ Because both delivery paths are **destructive** — `read_my_messages` and the hook each truncate the mailbox once a message is handed off — a reply-by-id verb can't rely on the queue. Every delivered envelope is therefore also recorded in a durable **received-ledger** at `~/.oxtail/received/<hash(session_id)>.jsonl` keyed by `message_id`, written *before* the mailbox line becomes visible (so any handle a receiver can see is already resolvable) and bounded to the most recent `OXTAIL_RECEIVED_MAX` (default 1000) entries. `reply_to_message` reads only the caller's own ledger — that file *is* the ownership boundary.
119
+
115
120
  Inbound peer messages are context, not user authority. oxtail stamps delivered messages with `origin: "peer"` for provenance/debugging, but this is not a trust boundary and peers cannot mint trusted user instructions.
116
121
 
117
122
  Cross-project sends are rejected, never silently dropped. Sending to a peer with the same tmux session name as another live peer returns `ambiguous-target` with the candidate `client_session_id`s — use the UUID form to disambiguate.
@@ -301,8 +306,9 @@ A scheduled CI job (`.github/workflows/codex-drift.yml`, also runnable on demand
301
306
 
302
307
  ## Status
303
308
 
304
- v0.12.0. Pushes the autonomous peer-messaging matrix toward zero human relay, then hardens the wake path.
309
+ v0.13.0. Pushes the autonomous peer-messaging matrix toward zero human relay, hardens the wake path, then makes correlated replies atomic.
305
310
 
311
+ - **Reply by id (v0.13.0).** `reply_to_message(message_id, body)` removes the manual `target` + `reply_to` rewiring that silently degraded a correlated exchange into loose mailbox traffic: the server looks the inbound envelope up in a durable per-session **received-ledger** (`~/.oxtail/received/<hash(session_id)>.jsonl`), derives the reply target and `reply_to` itself, and enforces ownership structurally (you can only reply to a message delivered to you). The ledger is written *before* the mailbox line is visible — so a handle the hook displays is always resolvable even though both delivery paths destroy the queue entry once it is handed off. Fail-closed on an unknown/aged-out id.
306
312
  - **Wake-on-reply (v0.11.0).** A reply — `send_message` with `reply_to` — auto-wakes a freshly-idle requester by default, so an awaited answer doesn't strand an idle peer. Strictly gated (fresh-idle only, per-target rate limit, one-wake dedupe, `OXTAIL_AUTOWAKE=off` kill-switch). `wake:"off"` opts out; explicit `wake:"auto"` is the escape hatch for a requester without an idle marker (Codex / hookless Claude).
307
313
  - **Wake hardening (v0.12.0).** Wake keystrokes only ever target the pane the process tree confirms hosts the peer's `server_pid` — never a self-written `tmux_pane`/`tmux_session`, and registry entries whose `server_pid` doesn't match their filename are rejected. Rapid repeat wakes to one peer are coalesced (`skipped_debounced`). `oxtail diagnose` summarizes wake outcomes from `MCP_TRACE_FILE`, and a scheduled CI job flags drift in Codex's paste-burst window before it can break the wake.
308
314
  - **Correlated delegate-and-wait.** `ask_peer` now sends a `request_id`; upgraded peers reply with `send_message({ reply_to })`, and the waiter ignores same-peer chatter that does not match. Legacy peers are still supported, but their replies are marked `correlation: "uncorrelated"`.
@@ -60,19 +60,73 @@ done < <(grep -lE "\"session_id\"[[:space:]]*:[[:space:]]*\"$sid\"" "$sessions_d
60
60
 
61
61
  [ "${#mboxes[@]}" -eq 0 ] && exit 0
62
62
 
63
- # 3. Acquire each mailbox's mkdir-based lock (best-effort; 30s staleness window,
64
- # matching src/mailbox.ts:LOCK_STALE_MS). GNU and BSD stat formats differ.
65
- locked=()
66
- for m in "${mboxes[@]}"; do
63
+ # ── Advisory lock: owner-token mkdir lock mirror of src/locks.ts ────────────
64
+ # The lock is a mkdir dir; the owner token lives in the SIDECAR file
65
+ # "<lock>.owner" (beside the dir, not inside, so the dir stays empty and a plain
66
+ # rmdir still removes it). Stale removal is gated behind a single-winner mkdir
67
+ # "<lock>.steal" marker plus compare-and-clear (remove only if the owner is still
68
+ # the dead token we observed), and release removes the lock only if we still own
69
+ # it. Keep in sync with src/locks.ts. GNU and BSD stat formats differ.
70
+ OXL_STALE=30 # seconds; mirror src/mailbox.ts LOCK_STALE_MS — also the
71
+ # marker-staleness window (same SIGSTOP-class threshold)
72
+ oxl_now() { date +%s 2>/dev/null || echo 0; }
73
+ oxl_mtime() { stat -c %Y "$1" 2>/dev/null || stat -f %m "$1" 2>/dev/null || echo 0; }
74
+ oxl_token() { # pid.random; tolerate a missing /dev/urandom without degrading to bare pid
75
+ local r
76
+ r=$(od -An -N6 -tx1 /dev/urandom 2>/dev/null | tr -d ' \n')
77
+ [ -n "$r" ] || r="${RANDOM}${RANDOM}${RANDOM}"
78
+ echo "$$.$r"
79
+ }
80
+ oxl_owner() { cat "$1.owner" 2>/dev/null || true; }
81
+ oxl_clear_stale() { # $1=lock dir; returns 0 if it did clearing work (retry mkdir)
82
+ local lock="$1" n mt obs smt
83
+ n=$(oxl_now); mt=$(oxl_mtime "$lock")
84
+ [ "$mt" -gt 0 ] || return 1
85
+ [ $((n - mt)) -gt "$OXL_STALE" ] || return 1
86
+ obs=$(oxl_owner "$lock")
87
+ if mkdir "$lock.steal" 2>/dev/null; then
88
+ if [ "x$(oxl_owner "$lock")" = "x$obs" ]; then
89
+ rm -f "$lock.owner" 2>/dev/null
90
+ rmdir "$lock" 2>/dev/null || rm -rf "$lock" 2>/dev/null
91
+ fi
92
+ rmdir "$lock.steal" 2>/dev/null
93
+ return 0
94
+ fi
95
+ smt=$(oxl_mtime "$lock.steal")
96
+ if [ "$smt" -gt 0 ] && [ $((n - smt)) -gt "$OXL_STALE" ]; then rmdir "$lock.steal" 2>/dev/null; fi
97
+ return 1
98
+ }
99
+ oxl_acquire() { # $1=lock dir; prints owner token on success, returns 0/1
100
+ local lock="$1" t i
101
+ t=$(oxl_token)
67
102
  for i in $(seq 1 50); do
68
- if mkdir "$m.lock" 2>/dev/null; then locked+=("$m"); break; fi
69
- now=$(date +%s 2>/dev/null || echo 0)
70
- mtime=$(stat -c %Y "$m.lock" 2>/dev/null || stat -f %m "$m.lock" 2>/dev/null || echo 0)
71
- if [ "$mtime" -gt 0 ] && [ $((now - mtime)) -gt 30 ]; then
72
- rmdir "$m.lock" 2>/dev/null
103
+ if mkdir "$lock" 2>/dev/null; then
104
+ printf '%s' "$t" > "$lock.owner" 2>/dev/null || true
105
+ printf '%s' "$t"
106
+ return 0
73
107
  fi
108
+ oxl_clear_stale "$lock" && continue
74
109
  sleep 0.01
75
110
  done
111
+ return 1
112
+ }
113
+ oxl_release() { # $1=lock dir, $2=our token — remove only if we PROVABLY own it
114
+ local lock="$1" t="$2" o
115
+ o=$(oxl_owner "$lock")
116
+ if [ -z "$t" ] || [ "x$o" = "x$t" ]; then
117
+ rm -f "$lock.owner" 2>/dev/null
118
+ rmdir "$lock" 2>/dev/null || true
119
+ fi
120
+ # owner differs or absent → not provably ours; leave it (it ages into a stale
121
+ # lock and is reclaimed by oxl_clear_stale) rather than stomp a successor.
122
+ }
123
+ # ─────────────────────────────────────────────────────────────────────────────
124
+
125
+ # 3. Acquire each mailbox's owner-token lock (best-effort; 30s staleness window).
126
+ locked=()
127
+ locked_tokens=()
128
+ for m in "${mboxes[@]}"; do
129
+ tok=$(oxl_acquire "$m.lock") && { locked+=("$m"); locked_tokens+=("$tok"); }
76
130
  done
77
131
  [ "${#locked[@]}" -eq 0 ] && exit 0
78
132
 
@@ -183,5 +237,9 @@ if [ -n "$output" ]; then
183
237
  for m in "${locked[@]}"; do : > "$m"; done
184
238
  fi
185
239
 
186
- for m in "${locked[@]}"; do rmdir "$m.lock" 2>/dev/null || true; done
240
+ ri=0
241
+ for m in "${locked[@]}"; do
242
+ oxl_release "$m.lock" "${locked_tokens[$ri]:-}"
243
+ ri=$((ri + 1))
244
+ done
187
245
  exit 0
package/assets/stop.sh CHANGED
@@ -90,16 +90,72 @@ if [ "${#mboxes[@]}" -eq 0 ]; then
90
90
  exit 0
91
91
  fi
92
92
 
93
- # 5. Lock each non-empty mailbox (best-effort; 30s staleness window).
94
- locked=()
95
- for m in "${mboxes[@]}"; do
93
+ # ── Advisory lock: owner-token mkdir lock mirror of src/locks.ts ────────────
94
+ # Lock is a mkdir dir; owner token lives in the SIDECAR "<lock>.owner" (beside,
95
+ # not inside, so the dir stays empty and a plain rmdir still removes it). Stale
96
+ # removal is gated behind a single-winner mkdir "<lock>.steal" marker plus
97
+ # compare-and-clear; release removes the lock only if we still own it. Keep in
98
+ # sync with src/locks.ts (identical block in assets/pretooluse.sh).
99
+ OXL_STALE=30 # seconds; mirror src/mailbox.ts LOCK_STALE_MS — also the
100
+ # marker-staleness window (same SIGSTOP-class threshold)
101
+ oxl_now() { date +%s 2>/dev/null || echo 0; }
102
+ oxl_mtime() { stat -c %Y "$1" 2>/dev/null || stat -f %m "$1" 2>/dev/null || echo 0; }
103
+ oxl_token() { # pid.random; tolerate a missing /dev/urandom without degrading to bare pid
104
+ local r
105
+ r=$(od -An -N6 -tx1 /dev/urandom 2>/dev/null | tr -d ' \n')
106
+ [ -n "$r" ] || r="${RANDOM}${RANDOM}${RANDOM}"
107
+ echo "$$.$r"
108
+ }
109
+ oxl_owner() { cat "$1.owner" 2>/dev/null || true; }
110
+ oxl_clear_stale() { # $1=lock dir; returns 0 if it did clearing work (retry mkdir)
111
+ local lock="$1" n mt obs smt
112
+ n=$(oxl_now); mt=$(oxl_mtime "$lock")
113
+ [ "$mt" -gt 0 ] || return 1
114
+ [ $((n - mt)) -gt "$OXL_STALE" ] || return 1
115
+ obs=$(oxl_owner "$lock")
116
+ if mkdir "$lock.steal" 2>/dev/null; then
117
+ if [ "x$(oxl_owner "$lock")" = "x$obs" ]; then
118
+ rm -f "$lock.owner" 2>/dev/null
119
+ rmdir "$lock" 2>/dev/null || rm -rf "$lock" 2>/dev/null
120
+ fi
121
+ rmdir "$lock.steal" 2>/dev/null
122
+ return 0
123
+ fi
124
+ smt=$(oxl_mtime "$lock.steal")
125
+ if [ "$smt" -gt 0 ] && [ $((n - smt)) -gt "$OXL_STALE" ]; then rmdir "$lock.steal" 2>/dev/null; fi
126
+ return 1
127
+ }
128
+ oxl_acquire() { # $1=lock dir; prints owner token on success, returns 0/1
129
+ local lock="$1" t i
130
+ t=$(oxl_token)
96
131
  for i in $(seq 1 50); do
97
- if mkdir "$m.lock" 2>/dev/null; then locked+=("$m"); break; fi
98
- now=$(date +%s 2>/dev/null || echo 0)
99
- mt=$(stat -c %Y "$m.lock" 2>/dev/null || stat -f %m "$m.lock" 2>/dev/null || echo 0)
100
- if [ "$mt" -gt 0 ] && [ $((now - mt)) -gt 30 ]; then rmdir "$m.lock" 2>/dev/null; fi
132
+ if mkdir "$lock" 2>/dev/null; then
133
+ printf '%s' "$t" > "$lock.owner" 2>/dev/null || true
134
+ printf '%s' "$t"
135
+ return 0
136
+ fi
137
+ oxl_clear_stale "$lock" && continue
101
138
  sleep 0.01
102
139
  done
140
+ return 1
141
+ }
142
+ oxl_release() { # $1=lock dir, $2=our token — remove only if we PROVABLY own it
143
+ local lock="$1" t="$2" o
144
+ o=$(oxl_owner "$lock")
145
+ if [ -z "$t" ] || [ "x$o" = "x$t" ]; then
146
+ rm -f "$lock.owner" 2>/dev/null
147
+ rmdir "$lock" 2>/dev/null || true
148
+ fi
149
+ # owner differs or absent → not provably ours; leave it (it ages into a stale
150
+ # lock and is reclaimed by oxl_clear_stale) rather than stomp a successor.
151
+ }
152
+ # ─────────────────────────────────────────────────────────────────────────────
153
+
154
+ # 5. Lock each non-empty mailbox (best-effort; 30s staleness window).
155
+ locked=()
156
+ locked_tokens=()
157
+ for m in "${mboxes[@]}"; do
158
+ tok=$(oxl_acquire "$m.lock") && { locked+=("$m"); locked_tokens+=("$tok"); }
103
159
  done
104
160
  # Couldn't lock anything → leave messages for next time. This still allows the
105
161
  # turn to stop, so mark idle; otherwise wake:auto will suppress a wake for a
@@ -215,5 +271,9 @@ else
215
271
  mark_idle
216
272
  fi
217
273
 
218
- for m in "${locked[@]}"; do rmdir "$m.lock" 2>/dev/null || true; done
274
+ ri=0
275
+ for m in "${locked[@]}"; do
276
+ oxl_release "$m.lock" "${locked_tokens[$ri]:-}"
277
+ ri=$((ri + 1))
278
+ done
219
279
  exit 0
@@ -0,0 +1,53 @@
1
+ import * as mailbox from "./mailbox.js";
2
+ import { recordReceived } from "./received.js";
3
+ import { trace } from "./trace.js";
4
+ // Deliver a message to a peer's mailbox, recording the durable reply-handle in
5
+ // the receiver's ledger BEFORE the mailbox line becomes visible. The ordering is
6
+ // the correctness guarantee: a hook/poll drainer can only observe the mailbox
7
+ // line after the append, which happens strictly after the ledger write — so any
8
+ // message_id a receiver can drain/render already has a ledger entry behind it.
9
+ // The reverse order (append, then record) left a window where the hook rendered
10
+ // a handle reply_to_message could not yet resolve (the race Codex caught).
11
+ //
12
+ // receiverSessionId may be null/empty (an unclaimed peer): then there is no
13
+ // ledger to own the handle and we skip the record — reply_to_message simply
14
+ // won't find it, which is the documented fall-back-to-send_message path.
15
+ //
16
+ // The ledger write is best-effort: a ledger failure must NEVER drop the actual
17
+ // delivery. Worst case the reply handle is missing and the peer falls back to
18
+ // send_message — never the reverse (a visible line with no handle on success),
19
+ // because record precedes append.
20
+ export function deliverToPeer(receiverSessionId, targetPid, body, fromSessionId, options = {}) {
21
+ const msg = mailbox.buildMessage(body, fromSessionId, options);
22
+ if (receiverSessionId) {
23
+ try {
24
+ recordReceived(receiverSessionId, msg);
25
+ }
26
+ catch (e) {
27
+ trace("received_ledger_write_failed", { message_id: msg.id, error: String(e) });
28
+ }
29
+ }
30
+ mailbox.requeue(targetPid, msg);
31
+ return msg;
32
+ }
33
+ // Re-deliver an ALREADY-BUILT message to a peer, preserving its message_id and
34
+ // (re)recording the receiver's ledger handle BEFORE the mailbox line becomes
35
+ // visible — same record-before-append ordering as deliverToPeer. Used by the
36
+ // ask_peer abort-recovery path: the reply was drained into memory but the client
37
+ // aborted before it was returned, so it must be re-enqueued WITHOUT minting a new
38
+ // id. (mailbox.enqueue would mint a fresh id and skip the ledger, so the
39
+ // redelivered reply's displayed id resolves to message-not-found on
40
+ // reply_to_message.) The ledger write is best-effort — a failure must never drop
41
+ // the redelivery; worst case the handle is missing and the peer falls back to
42
+ // send_message.
43
+ export function deliverExistingToPeer(receiverSessionId, targetPid, msg) {
44
+ if (receiverSessionId) {
45
+ try {
46
+ recordReceived(receiverSessionId, msg);
47
+ }
48
+ catch (e) {
49
+ trace("received_ledger_write_failed", { message_id: msg.id, error: String(e) });
50
+ }
51
+ }
52
+ mailbox.requeue(targetPid, msg);
53
+ }
package/dist/locks.js ADDED
@@ -0,0 +1,207 @@
1
+ import { randomBytes } from "node:crypto";
2
+ import { mkdirSync, readFileSync, rmdirSync, rmSync, statSync, unlinkSync, writeFileSync, } from "node:fs";
3
+ import { trace } from "./trace.js";
4
+ // Shared advisory-lock primitive for the mkdir-based locks used by both the
5
+ // mailbox queues (mailbox.ts) and the received-ledger (received.ts), and mirrored
6
+ // in the bash hooks (assets/pretooluse.sh, assets/stop.sh). Centralised here
7
+ // because stale-recovery is subtle and must be reasoned about (and tested) once.
8
+ //
9
+ // HONEST LIMIT: a provably race-free, stale-recoverable advisory lock is not
10
+ // achievable on a plain shared filesystem (no atomic compare-and-swap; every
11
+ // "detect stale → remove → reacquire" has a check-then-act window). This design
12
+ // eliminates the REALISTIC failure modes; the residuals that remain ALL require
13
+ // a process to stall (SIGSTOP / huge swap / multi-second GC) past the 30s stale
14
+ // window while inside a microsecond-wide gap between two syscalls:
15
+ // (a) a clearer that stalls >30s between its owner-compare and its rmdir, while
16
+ // another clearer reclaims the (now >30s-stale) steal marker and reacquires;
17
+ // (b) a holder that stalls >30s between mkdir(lock) and writeOwner(lock), gets
18
+ // stale-cleared as owner-less, then resumes and overwrites a successor's
19
+ // owner sidecar;
20
+ // (c) a holder that stalls >30s mid-critical-section and resumes to do its data
21
+ // write believing it still holds the lock (the data ops do not re-validate
22
+ // ownership before writing).
23
+ // Eliminating these needs kernel-arbitrated locks (flock/fcntl), which are not
24
+ // viable here because the lock is shared with bash hooks on macOS (no flock CLI).
25
+ // The consequence of any of these firing is bounded — a rare double-delivery
26
+ // (benign once readers dedup by message_id) or a rare ledger lost-update (the
27
+ // reply handle degrades to send_message), never a wedge or torn file.
28
+ //
29
+ // Two mechanisms do the work:
30
+ // 1. OWNER TOKEN. Each acquisition writes a unique token into the SIDECAR file
31
+ // `<lock>.owner` (kept beside the lock dir, NOT inside it, so the lock dir
32
+ // stays empty and a bash hook's plain `rmdir <lock>` still works cross-
33
+ // language). Release only removes the lock if the token still matches — so a
34
+ // holder that stalled past the stale window, got its lock stolen, and then
35
+ // resumes can no longer rmdir the SUCCESSOR's fresh lock (stall-resume bug).
36
+ // 2. SINGLE-WINNER + COMPARE-AND-CLEAR. Stale removal is gated behind an atomic
37
+ // `mkdir(<lock>.steal)` marker, and the clearer removes the lock only if its
38
+ // owner is STILL the dead token it observed. While the marker is held and the
39
+ // lock still exists, nobody else can clear (marker) or acquire (mkdir EEXIST),
40
+ // so the owner is stable across the check→rmdir. And the actual acquire is
41
+ // ALWAYS the single-winner `mkdir(lock)`, so even redundant clears can never
42
+ // produce two owners — the worst they do is race to recreate the lock, which
43
+ // exactly one wins.
44
+ const LOCK_RETRY_LIMIT = 50;
45
+ const LOCK_RETRY_DELAY_MS = 10;
46
+ function sleepSync(ms) {
47
+ Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
48
+ }
49
+ // Sidecar beside the lock dir (not inside) so the lock dir stays empty and a
50
+ // bash hook's plain `rmdir <lock>` still removes a Node-held lock. An orphaned
51
+ // sidecar (lock dir removed but sidecar left, e.g. by a bash clearer that doesn't
52
+ // know about it) is harmless — the next acquirer overwrites it.
53
+ function ownerPath(lock) {
54
+ return `${lock}.owner`;
55
+ }
56
+ function mintToken() {
57
+ return `${process.pid}.${randomBytes(8).toString("hex")}`;
58
+ }
59
+ // Read the owner token, or null if the lock has none (foreign/legacy lock, or a
60
+ // lock observed in the tiny window after mkdir but before the owner write).
61
+ function readOwner(lock) {
62
+ try {
63
+ return readFileSync(ownerPath(lock), "utf8");
64
+ }
65
+ catch {
66
+ return null;
67
+ }
68
+ }
69
+ function writeOwner(lock, token) {
70
+ try {
71
+ writeFileSync(ownerPath(lock), token, { mode: 0o600 });
72
+ }
73
+ catch {
74
+ // Best effort: an owner-less lock still excludes (the dir exists); it just
75
+ // loses the stall-resume protection until the next acquisition.
76
+ }
77
+ }
78
+ // Remove the lock dir and its owner file. Tolerates a foreign non-empty lock dir
79
+ // (e.g. one a bash hook or test created without our layout) via a recursive rm.
80
+ function removeLock(lock) {
81
+ try {
82
+ unlinkSync(ownerPath(lock));
83
+ }
84
+ catch {
85
+ // no owner file — fine
86
+ }
87
+ try {
88
+ rmdirSync(lock);
89
+ }
90
+ catch (e) {
91
+ const err = e;
92
+ if (err.code === "ENOENT")
93
+ return;
94
+ // Non-empty (foreign contents) or other — fall back to recursive removal.
95
+ try {
96
+ rmSync(lock, { recursive: true, force: true });
97
+ }
98
+ catch {
99
+ // best effort
100
+ }
101
+ }
102
+ }
103
+ // Compare-and-clear a stale lock under the single-winner steal marker. Returns
104
+ // true iff this call did the clearing work (caller retries mkdir immediately);
105
+ // false if the lock is fresh, vanished, or another clearer holds the marker
106
+ // (caller should sleep and retry).
107
+ export function clearStaleLock(lock, staleMs, traceEvent, traceCtx) {
108
+ let st;
109
+ try {
110
+ st = statSync(lock);
111
+ }
112
+ catch {
113
+ return false; // lock vanished between the failed mkdir and now — just retry
114
+ }
115
+ if (Date.now() - st.mtimeMs <= staleMs)
116
+ return false; // fresh holder — wait
117
+ const observed = readOwner(lock); // the (presumed dead) holder's token, or null
118
+ const steal = `${lock}.steal`;
119
+ try {
120
+ mkdirSync(steal, { mode: 0o700 });
121
+ }
122
+ catch (e) {
123
+ const err = e;
124
+ if (err.code === "EEXIST") {
125
+ // Another clearer holds the marker. If the marker is itself stale by the
126
+ // SAME stale window as the lock (its clearer crashed/SIGSTOP'd mid-steal),
127
+ // force it so recovery cannot wedge forever. Using the lock's stale window
128
+ // (not a shorter one) means a clearer can only be displaced after a full
129
+ // 30s stall — the same SIGSTOP-class threshold as every other residual —
130
+ // rather than after a brief pause. Compare-and-clear below still refuses to
131
+ // remove a lock whose owner changed, backstopping a reclaim race.
132
+ try {
133
+ const sst = statSync(steal);
134
+ if (Date.now() - sst.mtimeMs > staleMs) {
135
+ try {
136
+ rmdirSync(steal);
137
+ }
138
+ catch {
139
+ // raced with another clearer — fine
140
+ }
141
+ }
142
+ }
143
+ catch {
144
+ // marker vanished — fine
145
+ }
146
+ }
147
+ return false; // lost the steal — sleep and retry
148
+ }
149
+ // Sole clearer (modulo a leaked-marker race, which compare-and-clear backstops).
150
+ // Re-read the owner now: if it still equals what we observed, the dead holder's
151
+ // lock is unchanged and safe to remove; if it changed, someone reacquired and
152
+ // we must leave their lock alone.
153
+ if (readOwner(lock) === observed) {
154
+ removeLock(lock);
155
+ trace(traceEvent, traceCtx);
156
+ }
157
+ try {
158
+ rmdirSync(steal);
159
+ }
160
+ catch {
161
+ // best effort — a leaked marker is force-cleared by the next clearer
162
+ }
163
+ return true;
164
+ }
165
+ // Acquire the advisory lock, returning the owner token to hand back to
166
+ // releaseDirLock. The caller is responsible for creating the parent directory.
167
+ export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
168
+ const token = mintToken();
169
+ for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
170
+ try {
171
+ mkdirSync(lock, { mode: 0o700 });
172
+ writeOwner(lock, token);
173
+ return token;
174
+ }
175
+ catch (e) {
176
+ const err = e;
177
+ if (err.code !== "EEXIST")
178
+ throw err;
179
+ if (clearStaleLock(lock, staleMs, traceEvent, traceCtx))
180
+ continue;
181
+ sleepSync(LOCK_RETRY_DELAY_MS);
182
+ }
183
+ }
184
+ throw new Error(`could not acquire lock at ${lock}`);
185
+ }
186
+ // Release the lock — but only if we PROVABLY still own it (owner === our token).
187
+ // A holder that stalled past the stale window and was stolen from sees a
188
+ // different owner and leaves the successor's lock intact. We deliberately do NOT
189
+ // remove on an absent owner: a successor in its mkdir→writeOwner window has no
190
+ // owner yet, and removing then would stomp its fresh lock (Codex round-3). If our
191
+ // OWN owner write was lost, the cost is a leaked lock — which simply ages into a
192
+ // stale lock and is reclaimed by clearStaleLock, strictly safer than a stomp.
193
+ export function releaseDirLock(lock, token) {
194
+ if (!token) {
195
+ removeLock(lock); // no token to verify (defensive/legacy) — best-effort
196
+ return;
197
+ }
198
+ const owner = readOwner(lock);
199
+ if (owner === token) {
200
+ removeLock(lock);
201
+ }
202
+ else {
203
+ // Owner differs or is absent → not provably ours; leave it. A truly
204
+ // abandoned lock becomes stale and is reclaimed by clearStaleLock.
205
+ trace("lock_release_skipped_not_owner", { lock });
206
+ }
207
+ }
package/dist/mailbox.js CHANGED
@@ -1,7 +1,8 @@
1
1
  import { randomBytes } from "node:crypto";
2
- import { appendFileSync, mkdirSync, readFileSync, rmdirSync, statSync, truncateSync, writeFileSync, } from "node:fs";
2
+ import { appendFileSync, closeSync, mkdirSync, openSync, readFileSync, readSync, renameSync, statSync, truncateSync, writeFileSync, } from "node:fs";
3
3
  import { homedir } from "node:os";
4
4
  import { join } from "node:path";
5
+ import { acquireDirLock, releaseDirLock } from "./locks.js";
5
6
  import { trace } from "./trace.js";
6
7
  // Resolved lazily so tests can swap HOME between cases. Each call re-reads
7
8
  // homedir(), which on POSIX defers to $HOME.
@@ -17,58 +18,24 @@ function mailboxesDir() {
17
18
  //
18
19
  // Sync this value with assets/pretooluse.sh (find -mmin +0.5 ≈ 30s).
19
20
  const LOCK_STALE_MS = 30_000;
20
- const LOCK_RETRY_LIMIT = 50;
21
- const LOCK_RETRY_DELAY_MS = 10;
22
21
  function mailboxPath(pid) {
23
22
  return join(mailboxesDir(), `${pid}.jsonl`);
24
23
  }
25
24
  function lockPath(pid) {
26
25
  return `${mailboxPath(pid)}.lock`;
27
26
  }
28
- function sleepSync(ms) {
29
- Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
30
- }
27
+ // Owner tokens for held locks, so releaseLock can prove ownership (a lock stolen
28
+ // out from under a stalled holder is not removed on its late release). Keyed by
29
+ // pid; never two concurrent acquisitions of the same pid within one process.
30
+ const lockTokens = new Map();
31
31
  export function acquireLock(pid) {
32
32
  mkdirSync(mailboxesDir(), { recursive: true, mode: 0o700 });
33
- const lock = lockPath(pid);
34
- for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
35
- try {
36
- mkdirSync(lock, { mode: 0o700 });
37
- return;
38
- }
39
- catch (e) {
40
- const err = e;
41
- if (err.code !== "EEXIST")
42
- throw err;
43
- // Check staleness. If older than LOCK_STALE_MS, force-clear and retry.
44
- try {
45
- const st = statSync(lock);
46
- if (Date.now() - st.mtimeMs > LOCK_STALE_MS) {
47
- try {
48
- rmdirSync(lock);
49
- trace("mailbox_lock_stale_clear", { pid });
50
- }
51
- catch {
52
- // raced with another clearer; fall through to retry
53
- }
54
- continue;
55
- }
56
- }
57
- catch {
58
- // stat may race; just retry
59
- }
60
- sleepSync(LOCK_RETRY_DELAY_MS);
61
- }
62
- }
63
- throw new Error(`could not acquire mailbox lock for pid ${pid}`);
33
+ lockTokens.set(pid, acquireDirLock(lockPath(pid), LOCK_STALE_MS, "mailbox_lock_stale_clear", { pid }));
64
34
  }
65
35
  export function releaseLock(pid) {
66
- try {
67
- rmdirSync(lockPath(pid));
68
- }
69
- catch {
70
- // ignore ENOENT / not-empty / EPERM
71
- }
36
+ const token = lockTokens.get(pid);
37
+ lockTokens.delete(pid);
38
+ releaseDirLock(lockPath(pid), token ?? "");
72
39
  }
73
40
  // Critical: the serialized JSONL line must always begin
74
41
  // `{"schema_version":1,"id":"...","body":"`. The awk extractor in
@@ -107,8 +74,53 @@ export function serializeMailboxLine(msg) {
107
74
  }
108
75
  return line;
109
76
  }
110
- export function enqueue(target_pid, body, from_session_id, options = {}) {
111
- const msg = {
77
+ // Append JSONL bytes to a mailbox, healing a missing record boundary first.
78
+ // appendFileSync of a buffer is NOT a single atomic syscall, so a crash/torn
79
+ // write can leave a file ending in a partial line with no trailing "\n". A later
80
+ // append would then concatenate onto that partial line, gluing two records into
81
+ // one line that fails JSON.parse in BOTH drain() and the awk hook — silently
82
+ // dropping both messages. If the file is non-empty and its last byte isn't "\n",
83
+ // prepend one so the boundary is restored (the already-torn record is still lost,
84
+ // but it can no longer eat its neighbor). Every append path routes through here.
85
+ function appendLines(path, buf) {
86
+ let heal = false;
87
+ let fd;
88
+ try {
89
+ const st = statSync(path);
90
+ if (st.size > 0) {
91
+ fd = openSync(path, "r");
92
+ const last = Buffer.alloc(1);
93
+ readSync(fd, last, 0, 1, st.size - 1);
94
+ heal = last[0] !== 0x0a; // 0x0a === "\n"
95
+ }
96
+ }
97
+ catch (e) {
98
+ const err = e;
99
+ if (err.code !== "ENOENT")
100
+ throw err;
101
+ }
102
+ finally {
103
+ if (fd !== undefined)
104
+ closeSync(fd);
105
+ }
106
+ appendFileSync(path, heal ? "\n" + buf : buf);
107
+ }
108
+ // Atomically replace a file's contents: write to a unique temp file in the same
109
+ // directory, then renameSync over the target. rename(2) is atomic on POSIX, so a
110
+ // concurrent reader/crasher never observes a torn file — unlike writeFileSync,
111
+ // which issues multiple write() syscalls and can leave a half-written line on
112
+ // crash, dropping unrelated surviving records.
113
+ function atomicWrite(path, data) {
114
+ const tmp = `${path}.tmp.${process.pid}.${randomBytes(6).toString("hex")}`;
115
+ writeFileSync(tmp, data, { mode: 0o600 });
116
+ renameSync(tmp, path);
117
+ }
118
+ // Mint a message envelope WITHOUT writing it anywhere. Split out from enqueue so
119
+ // a higher layer (delivery.ts) can record the durable received-ledger entry
120
+ // BEFORE the mailbox line becomes visible — the ordering that guarantees any
121
+ // message_id a receiver can drain/render already has a ledger entry behind it.
122
+ export function buildMessage(body, from_session_id, options = {}) {
123
+ return {
112
124
  schema_version: 1,
113
125
  id: randomBytes(8).toString("hex"),
114
126
  body,
@@ -120,10 +132,12 @@ export function enqueue(target_pid, body, from_session_id, options = {}) {
120
132
  ...(options.reply_to ? { reply_to: options.reply_to } : {}),
121
133
  ...(options.source_message_id ? { source_message_id: options.source_message_id } : {}),
122
134
  };
123
- const line = serializeMailboxLine(msg);
135
+ }
136
+ export function enqueue(target_pid, body, from_session_id, options = {}) {
137
+ const msg = buildMessage(body, from_session_id, options);
124
138
  acquireLock(target_pid);
125
139
  try {
126
- appendFileSync(mailboxPath(target_pid), line);
140
+ appendLines(mailboxPath(target_pid), serializeMailboxLine(msg));
127
141
  }
128
142
  finally {
129
143
  releaseLock(target_pid);
@@ -138,7 +152,7 @@ export function requeue(target_pid, msg) {
138
152
  const line = serializeMailboxLine(msg);
139
153
  acquireLock(target_pid);
140
154
  try {
141
- appendFileSync(mailboxPath(target_pid), line);
155
+ appendLines(mailboxPath(target_pid), line);
142
156
  }
143
157
  finally {
144
158
  releaseLock(target_pid);
@@ -155,7 +169,7 @@ export function requeueMany(target_pid, msgs) {
155
169
  buf += serializeMailboxLine(m);
156
170
  acquireLock(target_pid);
157
171
  try {
158
- appendFileSync(mailboxPath(target_pid), buf);
172
+ appendLines(mailboxPath(target_pid), buf);
159
173
  }
160
174
  finally {
161
175
  releaseLock(target_pid);
@@ -244,7 +258,7 @@ export function migrateMailbox(fromPid, toPid) {
244
258
  const count = raw.split("\n").filter((l) => l.trim().length > 0).length;
245
259
  acquireLock(toPid);
246
260
  try {
247
- appendFileSync(mailboxPath(toPid), block);
261
+ appendLines(mailboxPath(toPid), block);
248
262
  }
249
263
  finally {
250
264
  releaseLock(toPid);
@@ -376,7 +390,7 @@ function drainFirstMatching(my_pid, matches) {
376
390
  }
377
391
  }
378
392
  else {
379
- writeFileSync(mailboxPath(my_pid), surviving.join("\n") + "\n");
393
+ atomicWrite(mailboxPath(my_pid), surviving.join("\n") + "\n");
380
394
  }
381
395
  return matchedMsg;
382
396
  }
@@ -0,0 +1,151 @@
1
+ import { createHash, randomBytes } from "node:crypto";
2
+ import { mkdirSync, readFileSync, renameSync, writeFileSync } from "node:fs";
3
+ import { homedir } from "node:os";
4
+ import { join } from "node:path";
5
+ import { acquireDirLock, releaseDirLock } from "./locks.js";
6
+ import { trace } from "./trace.js";
7
+ // The received-message ledger: a durable, per-session index of every inbound
8
+ // envelope, keyed by message_id. It exists because both delivery paths are
9
+ // DESTRUCTIVE — mailbox.drain() truncates the queue to 0 after a read, and the
10
+ // PreToolUse hook does `:> "$m"` after rendering messages into model context.
11
+ // So once a message is delivered, the mailbox no longer holds it. A reply verb
12
+ // (reply_to_message) that looks a message up by id therefore cannot rely on the
13
+ // mailbox; it needs this separate ledger.
14
+ //
15
+ // Correctness hinges on ORDERING, enforced by delivery.ts: the ledger entry is
16
+ // written BEFORE the mailbox line becomes visible. A drainer can only observe
17
+ // the line after the append, which happens strictly after this write — so any
18
+ // message_id a receiver can see has a ledger entry behind it. (The reverse order
19
+ // left a window where the hook rendered a handle reply_to_message couldn't yet
20
+ // resolve — the race Codex caught in review.)
21
+ //
22
+ // Ownership is structural: the ledger lives at received/<hash(session_id)>, and
23
+ // lookups only ever read the caller's own file. You can only reply to a message
24
+ // that was delivered to YOU.
25
+ function receivedDir() {
26
+ // Resolved lazily so tests can swap HOME between cases (mirrors mailbox.ts).
27
+ return join(homedir(), ".oxtail", "received");
28
+ }
29
+ // Hash the session_id into the filename (mirrors claims.ts) so two distinct ids
30
+ // can never collide onto one ledger file — a lossy character-sanitize could map
31
+ // different sessions to the same path. UUIDs are already path-safe; the hash is
32
+ // defensive and collision-free.
33
+ function ledgerKey(sessionId) {
34
+ return createHash("sha256").update(sessionId).digest("hex").slice(0, 32);
35
+ }
36
+ function ledgerPath(sessionId) {
37
+ return join(receivedDir(), `${ledgerKey(sessionId)}.jsonl`);
38
+ }
39
+ function lockPath(sessionId) {
40
+ return `${ledgerPath(sessionId)}.lock`;
41
+ }
42
+ // Lock idiom mirrors mailbox.ts (owner-token mkdir lock — see locks.ts). The
43
+ // ledger read-modify-write is small (bounded by receivedMax() lines) so the lock
44
+ // window is short.
45
+ const LOCK_STALE_MS = 30_000;
46
+ // Bounded retention: keep at most this many of the most-recent inbound messages
47
+ // per session. Read lazily so tests can tune it per-case. Generous by default so
48
+ // a realistic mailbox burst (read_my_messages budgets 50/drain) can't push a
49
+ // just-displayed handle out of the ledger before the receiver replies; when the
50
+ // cap DOES bite, recordReceived traces the drop so it is never silent.
51
+ export function receivedMax() {
52
+ const n = Number(process.env.OXTAIL_RECEIVED_MAX);
53
+ return Number.isFinite(n) && n > 0 ? Math.floor(n) : 1000;
54
+ }
55
+ // Owner tokens for held ledger locks (see mailbox.ts for the rationale).
56
+ const lockTokens = new Map();
57
+ function acquireLock(sessionId) {
58
+ mkdirSync(receivedDir(), { recursive: true, mode: 0o700 });
59
+ lockTokens.set(sessionId, acquireDirLock(lockPath(sessionId), LOCK_STALE_MS, "received_lock_stale_clear", {
60
+ session_id: sessionId,
61
+ }));
62
+ }
63
+ function releaseLock(sessionId) {
64
+ const token = lockTokens.get(sessionId);
65
+ lockTokens.delete(sessionId);
66
+ releaseDirLock(lockPath(sessionId), token ?? "");
67
+ }
68
+ // Atomically replace the ledger: write a unique temp file, then renameSync over
69
+ // the target. rename(2) is atomic on POSIX, so a crash/torn write can't leave a
70
+ // half-rewritten ledger that loses older reply handles — unlike a direct
71
+ // writeFileSync, which issues multiple write() syscalls.
72
+ function atomicWrite(path, data) {
73
+ const tmp = `${path}.tmp.${process.pid}.${randomBytes(6).toString("hex")}`;
74
+ writeFileSync(tmp, data, { mode: 0o600 });
75
+ renameSync(tmp, path);
76
+ }
77
+ function readLines(sessionId) {
78
+ try {
79
+ const raw = readFileSync(ledgerPath(sessionId), "utf8");
80
+ if (!raw)
81
+ return [];
82
+ return raw.split("\n").filter((l) => l.length > 0);
83
+ }
84
+ catch (e) {
85
+ const err = e;
86
+ if (err.code === "ENOENT")
87
+ return [];
88
+ throw err;
89
+ }
90
+ }
91
+ // Append an inbound envelope to the receiver's ledger and prune to receivedMax()
92
+ // (oldest dropped first). Called by delivery.ts BEFORE the mailbox append.
93
+ export function recordReceived(receiverSessionId, msg) {
94
+ if (!receiverSessionId)
95
+ return;
96
+ acquireLock(receiverSessionId);
97
+ try {
98
+ const lines = readLines(receiverSessionId);
99
+ lines.push(JSON.stringify(msg));
100
+ const max = receivedMax();
101
+ let pruned = lines;
102
+ if (lines.length > max) {
103
+ pruned = lines.slice(lines.length - max);
104
+ // No silent caps: a dropped handle becomes reply_to_message
105
+ // "message-not-found", so surface that the bound bit.
106
+ trace("received_ledger_pruned", {
107
+ session_id: receiverSessionId,
108
+ dropped: lines.length - max,
109
+ kept: max,
110
+ });
111
+ }
112
+ atomicWrite(ledgerPath(receiverSessionId), pruned.join("\n") + "\n");
113
+ }
114
+ finally {
115
+ releaseLock(receiverSessionId);
116
+ }
117
+ }
118
+ // Look up a previously-received envelope by message_id in this session's ledger.
119
+ // Newest-first scan (ids are unique, so the first match is the only match).
120
+ // Returns null when not found / aged out — the fail-closed signal the reply
121
+ // verb turns into message-not-found. Read under the same lock so a concurrent
122
+ // recordReceived rewrite can't be observed half-written.
123
+ export function lookupReceived(receiverSessionId, messageId) {
124
+ if (!receiverSessionId)
125
+ return null;
126
+ acquireLock(receiverSessionId);
127
+ try {
128
+ const lines = readLines(receiverSessionId);
129
+ for (let i = lines.length - 1; i >= 0; i--) {
130
+ let parsed;
131
+ try {
132
+ parsed = JSON.parse(lines[i]);
133
+ }
134
+ catch {
135
+ continue;
136
+ }
137
+ if (parsed &&
138
+ typeof parsed === "object" &&
139
+ parsed.id === messageId) {
140
+ return parsed;
141
+ }
142
+ }
143
+ return null;
144
+ }
145
+ finally {
146
+ releaseLock(receiverSessionId);
147
+ }
148
+ }
149
+ export function receivedFilePath(sessionId) {
150
+ return ledgerPath(sessionId);
151
+ }
package/dist/server.js CHANGED
@@ -12,6 +12,8 @@ import { isAbstain } from "./detect/index.js";
12
12
  import { trace } from "./trace.js";
13
13
  import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
14
14
  import * as mailbox from "./mailbox.js";
15
+ import * as received from "./received.js";
16
+ import { deliverExistingToPeer, deliverToPeer } from "./delivery.js";
15
17
  import { recoverClaim, resolveAncestors, writeClaim } from "./claims.js";
16
18
  import { decideReplyAutoWake, defaultAutowakeDir } from "./autowake.js";
17
19
  import { markWoke, newWakeDebounceStore, recentlyWoke } from "./wake-debounce.js";
@@ -1053,7 +1055,11 @@ server.registerTool("send_message", {
1053
1055
  }
1054
1056
  const peer = resolved.entry;
1055
1057
  const fromSessionId = entry.client.session_id ?? undefined;
1056
- const msg = mailbox.enqueue(peer.server_pid, body, fromSessionId, {
1058
+ // deliverToPeer records the durable reply-handle in the recipient's ledger
1059
+ // BEFORE the mailbox line is visible, so a later reply_to_message(message_id)
1060
+ // resolves even after the destructive mailbox/hook drain — and never sees a
1061
+ // displayed-but-unrecorded handle (record precedes append).
1062
+ const msg = deliverToPeer(peer.client.session_id, peer.server_pid, body, fromSessionId, {
1057
1063
  reply_to,
1058
1064
  source_message_id,
1059
1065
  });
@@ -1076,6 +1082,100 @@ server.registerTool("send_message", {
1076
1082
  ...(wake_reason ? { wake_reason } : {}),
1077
1083
  });
1078
1084
  });
1085
+ server.registerTool("reply_to_message", {
1086
+ description: [
1087
+ "Reply to a specific inbound peer message by its message_id — the atomic, correlation-safe alternative to hand-wiring send_message's target + reply_to. The server looks the message up in this session's durable received-ledger, so you pass only the message_id the PreToolUse hook or read_my_messages already showed you; it derives the reply target (the original sender), carries reply_to=request_id when the inbound was an ask_peer (keeping the exchange correlated), and sets source_message_id for provenance. Replying to a plain send_message works too — it just omits reply_to. Ownership is structural: you can only reply to a message delivered to you.",
1088
+ "Delivery + wake match send_message exactly, including the wake-on-reply default: when the inbound carried a request_id and you leave wake unset, a freshly-idle requester is auto-woken; pass wake:\"auto\" to nudge any idle peer, or wake:\"off\" to suppress. Fail-closed: an unknown or aged-out message_id returns error message-not-found instead of guessing a target.",
1089
+ ].join(" "),
1090
+ inputSchema: {
1091
+ message_id: z
1092
+ .string()
1093
+ .min(1)
1094
+ .describe("The message_id of the inbound peer message you are replying to, as shown by the PreToolUse hook or read_my_messages."),
1095
+ body: z
1096
+ .string()
1097
+ .min(1)
1098
+ .refine((s) => Buffer.byteLength(s, "utf8") <= 8192, {
1099
+ message: "body exceeds 8192 UTF-8 bytes",
1100
+ })
1101
+ .describe("Reply body, ≤8KB UTF-8. Verbatim."),
1102
+ wake: z
1103
+ .enum(["off", "auto"])
1104
+ .optional()
1105
+ .describe('Wake strategy, same semantics as send_message. Unset: wake-on-reply default (auto-wakes a freshly-idle requester when the inbound was an ask_peer). "auto": nudge any idle peer. "off": no nudge.'),
1106
+ },
1107
+ }, async ({ message_id, body, wake }) => {
1108
+ const myId = entry.client.session_id;
1109
+ if (!myId) {
1110
+ return jsonResult({
1111
+ schema_version: 1,
1112
+ ok: false,
1113
+ error: "no-session-id",
1114
+ message: "This session has not claimed a session_id, so it has no received-ledger to reply from. Call claim_session first.",
1115
+ });
1116
+ }
1117
+ const inbound = received.lookupReceived(myId, message_id);
1118
+ if (!inbound) {
1119
+ return jsonResult({
1120
+ schema_version: 1,
1121
+ ok: false,
1122
+ error: "message-not-found",
1123
+ message: `No received message ${message_id} in this session's ledger (it may have aged out of retention, or predates reply_to_message). Fall back to send_message with an explicit target.`,
1124
+ });
1125
+ }
1126
+ const targetSid = inbound.from_session_id;
1127
+ if (!targetSid) {
1128
+ return jsonResult({
1129
+ schema_version: 1,
1130
+ ok: false,
1131
+ error: "no-reply-target",
1132
+ message: `Inbound message ${message_id} has no from_session_id, so there is no peer to reply to.`,
1133
+ });
1134
+ }
1135
+ const replyTo = inbound.request_id; // undefined when the inbound was a plain send_message
1136
+ const resolved = resolveTarget(targetSid, entry);
1137
+ if (!resolved.ok) {
1138
+ const replyDefault = replyAutoWakeTriggered(wake, replyTo);
1139
+ const wakeIntended = wake === "auto" || replyDefault;
1140
+ const wake_status = wakeIntended ? resolveErrorWakeStatus(resolved.error) : undefined;
1141
+ return jsonResult({
1142
+ schema_version: 1,
1143
+ ...resolved,
1144
+ in_reply_to_message_id: message_id,
1145
+ original_from_session_id: targetSid,
1146
+ ...(wake_status ? { wake_status } : {}),
1147
+ ...(replyDefault ? { wake_reason: "reply_to_default" } : {}),
1148
+ });
1149
+ }
1150
+ const peer = resolved.entry;
1151
+ const fromSessionId = entry.client.session_id ?? undefined;
1152
+ // Record the reply itself into the original asker's ledger (record-before-
1153
+ // append) so replies can be replied to in turn — chained correlation.
1154
+ const msg = deliverToPeer(peer.client.session_id, peer.server_pid, body, fromSessionId, {
1155
+ reply_to: replyTo,
1156
+ source_message_id: message_id,
1157
+ });
1158
+ const { wake_status, wake_reason } = await resolveSendWake(peer, wake, replyTo);
1159
+ if (wake_status) {
1160
+ trace("wake_outcome", {
1161
+ via: wake_reason === "reply_to_default" ? "reply_default" : "reply_to_message",
1162
+ wake_status,
1163
+ target_session_id: peer.client.session_id,
1164
+ client_type: peer.client.type,
1165
+ });
1166
+ }
1167
+ return jsonResult({
1168
+ schema_version: 1,
1169
+ ok: true,
1170
+ message_id: msg.id,
1171
+ in_reply_to_message_id: message_id,
1172
+ target_session_id: peer.client.session_id,
1173
+ target_server_pid: peer.server_pid,
1174
+ correlation: replyTo ? "correlated" : "uncorrelated",
1175
+ ...(wake_status ? { wake_status } : {}),
1176
+ ...(wake_reason ? { wake_reason } : {}),
1177
+ });
1178
+ });
1079
1179
  // read_my_messages budget. A session's union drain can return a backlog; cap
1080
1180
  // how much one call hands back so a flood (or a peer spamming near-8KB bodies)
1081
1181
  // can't blow the caller's context in a single drain. Overflow is NOT dropped or
@@ -1556,7 +1656,9 @@ server.registerTool("ask_peer", {
1556
1656
  const requestId = randomBytes(8).toString("hex");
1557
1657
  const requireReplyTo = peerSupportsReplyTo(peer);
1558
1658
  const fromSessionId = entry.client.session_id ?? undefined;
1559
- const msg = mailbox.enqueue(peer.server_pid, body, fromSessionId, {
1659
+ // Record-before-append (mirrors send_message): lets the peer answer with
1660
+ // reply_to_message(message_id) instead of hand-wiring target + reply_to.
1661
+ const msg = deliverToPeer(expectedSessionId, peer.server_pid, body, fromSessionId, {
1560
1662
  request_id: requestId,
1561
1663
  });
1562
1664
  const startedAt = Date.now();
@@ -1626,11 +1728,11 @@ server.registerTool("ask_peer", {
1626
1728
  // Re-enqueue so it's not lost.
1627
1729
  if (aborted && reply) {
1628
1730
  try {
1629
- mailbox.enqueue(entry.server_pid, reply.body, reply.from_session_id, {
1630
- request_id: reply.request_id,
1631
- reply_to: reply.reply_to,
1632
- source_message_id: reply.source_message_id,
1633
- });
1731
+ // Re-deliver the EXISTING reply: preserve reply.id and (re)write the
1732
+ // requester's received-ledger entry so reply_to_message against the
1733
+ // displayed id still resolves. mailbox.enqueue would mint a NEW id and
1734
+ // skip the ledger, breaking the reply handle on the abort path.
1735
+ deliverExistingToPeer(entry.client.session_id, entry.server_pid, reply);
1634
1736
  trace("ask_peer_abort_reenqueue", { message_id: reply.id });
1635
1737
  }
1636
1738
  catch (e) {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "oxtail",
3
- "version": "0.12.0",
3
+ "version": "0.14.0",
4
4
  "private": false,
5
5
  "type": "module",
6
6
  "description": "Coordination layer for parallel AI coding agent sessions, exposed over MCP.",
@@ -23,10 +23,20 @@ export const HOOK_MARKER_KEY = "_oxtailHook";
23
23
  // and drop the redundant single-valued `origin` field. message_id +
24
24
  // from_session_id are still rendered (correlation/debug unaffected); a
25
25
  // stale pre-v5 hook is only larger, never wrong.
26
+ // v6: owner-token advisory lock (mirror of src/locks.ts) in pretooluse + stop.
27
+ // The lock dir gains a sidecar `<lock>.owner` token; stale removal is
28
+ // gated behind a single-winner `<lock>.steal` marker + compare-and-clear,
29
+ // and release only removes a lock we still own. The sidecar layout keeps
30
+ // the lock dir EMPTY so a pre-v6 hook's plain `rmdir` still removes a v6
31
+ // lock — i.e. mixed versions never WEDGE. They are not fully race-safe,
32
+ // though: a pre-v6 hook does an unconditional stale-rmdir / release-rmdir
33
+ // with no owner check, so during an upgrade window (before re-install) the
34
+ // old hook can still lose the stall-resume / double-clear races against a
35
+ // v6 peer. The version bump forces re-install to close that window.
26
36
  // INVARIANT: any change to an assets/*.sh script MUST bump this version, so
27
37
  // existing installs are forced to re-install. scripts/check-hook-version.mjs
28
38
  // enforces this in CI.
29
- export const HOOK_MARKER_VERSION = 5;
39
+ export const HOOK_MARKER_VERSION = 6;
30
40
 
31
41
  const HOOKS_DIR = path.join(os.homedir(), ".oxtail", "hooks");
32
42