npm - oxtail - Versions diffs - 0.13.0 → 0.14.1 - Mend

oxtail 0.13.0 → 0.14.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/AGENTS.md +3 -0
package/assets/pretooluse.sh +68 -10
package/assets/stop.sh +68 -8
package/dist/delivery.js +21 -0
package/dist/locks.js +207 -0
package/dist/mailbox.js +71 -52
package/dist/received.js +23 -48
package/dist/server.js +6 -6
package/package.json +1 -1
package/scripts/hook-constants.mjs +11 -1

package/AGENTS.md CHANGED Viewed

@@ -56,9 +56,12 @@ The v0.9/v0.10.1 changes close the public dogfooding gaps found by real peer tra
 - **`ask_peer` replies must correlate when the peer supports it.** Same-peer chatter is not a reply. Upgraded peers advertise `capabilities.mailbox.reply_to` and must satisfy waits with `from_session_id == target.session_id` plus `reply_to == request_id`; unmatched messages stay in the mailbox. The older `from_session_id`-only path is legacy compatibility and must be surfaced as `correlation: "uncorrelated"`. For no-capability peers, stale same-peer chatter may still satisfy the wait; that is an explicit compatibility limitation, not a correctness guarantee.
 - **Peer messages are context, not user authority.** Mailbox provenance (`origin: "peer"`, `request_id`, `reply_to`, `source_message_id`) is diagnostic metadata, not a trust boundary. Hook text must keep the trust framing visible — the "context, not user authority" line plus the `from_session_id` / `request_id` / `reply_to` reply fields (full protocol names) are rendered on every delivery — and injected hook bodies must stay under an explicit budget. Single-valued provenance the framing already implies (`origin: "peer"`) stays in the mailbox JSONL but need not be rendered into context.
 - **A displayed reply handle must be resolvable: record the received-ledger before the mailbox line is visible.** Both delivery paths are destructive — `read_my_messages` and the PreToolUse/Stop hook each truncate the mailbox on handoff — so `reply_to_message` resolves `message_id` against a durable per-session ledger (`~/.oxtail/received/<hash(session_id)>.jsonl`), never the queue. `deliverToPeer` (the single delivery primitive behind `send_message` / `ask_peer` / `reply_to_message`) MUST write the ledger entry **before** appending the mailbox line: append-then-record reopens a window where the hook renders a `message_id` the receiver cannot yet reply to. The ledger is keyed and owned by receiver `session_id`; a lookup reads only the caller's own file. The ledger write is best-effort (a failure degrades to "no handle, reply via `send_message`") but must never reorder ahead of, or block, the actual delivery.
+- **Delivery mutations are crash-consistent; the shared advisory lock is owner-validated.** A crash mid-write must never corrupt a neighbour: mailbox appends heal a torn record boundary so two JSONL lines can't glue into one unparseable record (`appendLines`), and full-file rewrites (received-ledger, selective drain) go through a temp file + atomic `rename` (`atomicWrite`), never an in-place `writeFileSync` that a torn write could leave half-applied. The `mkdir` lock — shared between the Node server and the bash hooks — carries an owner token in a sidecar `<lock>.owner` (beside the dir so it stays empty and a plain `rmdir` still works cross-language): **release removes the lock only if it still owns it** (a stalled holder can't stomp a successor), and **stale removal is single-winner** (`<lock>.steal` marker) **plus compare-and-clear** (remove only if the owner is still the dead token observed). Provable race-freedom is unachievable on a plain shared filesystem (no atomic compare-and-swap); the design closes every realistic race and the only residuals — enumerated in `src/locks.ts` — require a >30s SIGSTOP-class stall inside a microsecond syscall gap, with a bounded consequence (rare double-delivery or a degraded reply-handle, never a wedge or torn file). One protocol, mirrored in `src/locks.ts` and both hooks. As a consequence-mitigation, the union reader `drainMany` dedups by `message_id` (v0.14.1), so the most realistic duplicate — a `migrateMailbox` crash-window leaving the same message in two sibling mailboxes — is delivered exactly once (ids are unique nonces, so dedup only ever collapses true duplicates).
 ## Recently shipped
+- **Crash-consistency + cross-language lock hardening (v0.14.0).** A `compile-sim` pass plus four Codex adversarial rounds hardened the delivery core against crash/torn-write and lock races. **Crash-consistency:** every mailbox append heals a torn previous write so a crash can't glue two JSONL records into one unparseable line (`appendLines` in `src/mailbox.ts`); the received-ledger rewrite and `drainFirstMatching`'s survivor rewrite are now atomic temp-file + `rename` (`atomicWrite`), so a torn write can't drop unrelated survivors / corrupt old reply handles. **Advisory lock:** the `mkdir` lock gains an owner token in a sidecar `<lock>.owner` (kept beside the dir so it stays empty and a bash hook's plain `rmdir` still works); **release** removes the lock only if it still owns it (closes stall-resume-release stomp), and **stale-clear** is gated behind a single-winner `<lock>.steal` marker + compare-and-clear (closes the double-clear). The protocol lives once in `src/locks.ts` and is mirrored in both bash hooks (`assets/pretooluse.sh`, `assets/stop.sh`; `HOOK_MARKER_VERSION` → 6 forces re-install). **Honest limit:** a provably race-free stale-recoverable lock isn't achievable on a plain shared FS — the residuals all require a >30s SIGSTOP-class stall in a microsecond syscall gap, bounded to a rare double-delivery / degraded reply-handle (never a wedge or torn file), documented in `src/locks.ts`. Also: `deliverExistingToPeer` preserves `message_id` + ledger on the `ask_peer` abort-recovery path (was minting a new id + skipping the ledger). Codex converged after 4 rounds (it broke the first two lock attempts before the owner-token + compare-and-clear design held).
 - **Reply by id + received-ledger (v0.13.0).** `reply_to_message(message_id, body)` looks the inbound envelope up in a durable per-session ledger and derives `target` / `reply_to` / `source_message_id` server-side, replacing the manual rewiring that silently degraded a correlated exchange into loose mailbox traffic. New `src/received.ts` (ledger: sha256-keyed file, `mkdir`-lock, bounded retention `OXTAIL_RECEIVED_MAX`=1000 with a `received_ledger_pruned` trace so a drop is never silent) and `src/delivery.ts` (`deliverToPeer` = `buildMessage` → `recordReceived` → `requeue` — the record-before-append ordering above), wired into `send_message` / `ask_peer` / `reply_to_message`. Adversarial race-pair + ledger-failure-still-delivers tests in `src/delivery.test.ts`. Converged with Codex over a 5-round peer-messaging pressure test; Codex's review caught the append-before-record race, fixed before merge.
 - **Wake hardening (v0.12.0 — issues #5/#6/#7, the v0.7-review backlog).** Three deferred wake items, landed together. **#6 (security):** wake send-keys now only ever target the pane the live process tree says hosts the peer's `server_pid` (`chooseVerifiedWakePane` → `currentPaneForServerPid`), never the peer's self-written `tmux_pane`/`tmux_session`; unverifiable ⇒ refuse (`skipped_no_target`). Registry-sourced tmux ids are shape-validated (`isValidTmuxPane`/`isValidTmuxSession`) and a spoofed `TMUX_PANE` env is ignored. This removed the cached-pane and session-name send-keys fallbacks (legit peers always register a real pane; churn is handled by re-resolution). **#5 (debounce):** all wake paths funnel through `wakePeer`, which coalesces repeat wakes to the same peer within `OXTAIL_WAKE_DEBOUNCE_MS` (default 1s, in-memory per process) ⇒ `skipped_debounced`. **#7 (observability):** a `wake_outcome` trace event per wake; `oxtail diagnose` summarizes wake_status counts by tool from `MCP_TRACE_FILE`; a scheduled `codex-drift.yml` fails if Codex's `PASTE_ENTER_SUPPRESS_WINDOW` drifts past our 500ms gap. New modules: `src/wake-debounce.ts`, `src/diagnose.ts`; `chooseVerifiedWakePane` in `src/registry.ts`.

package/assets/pretooluse.sh CHANGED Viewed

@@ -60,19 +60,73 @@ done < <(grep -lE "\"session_id\"[[:space:]]*:[[:space:]]*\"$sid\"" "$sessions_d
 [ "${#mboxes[@]}" -eq 0 ] && exit 0
-# 3. Acquire each mailbox's mkdir-based lock (best-effort; 30s staleness window,
-#    matching src/mailbox.ts:LOCK_STALE_MS). GNU and BSD stat formats differ.
-locked=()
-for m in "${mboxes[@]}"; do
+# ── Advisory lock: owner-token mkdir lock — mirror of src/locks.ts ────────────
+# The lock is a mkdir dir; the owner token lives in the SIDECAR file
+# "<lock>.owner" (beside the dir, not inside, so the dir stays empty and a plain
+# rmdir still removes it). Stale removal is gated behind a single-winner mkdir
+# "<lock>.steal" marker plus compare-and-clear (remove only if the owner is still
+# the dead token we observed), and release removes the lock only if we still own
+# it. Keep in sync with src/locks.ts. GNU and BSD stat formats differ.
+OXL_STALE=30        # seconds; mirror src/mailbox.ts LOCK_STALE_MS — also the
+                    # marker-staleness window (same SIGSTOP-class threshold)
+oxl_now() { date +%s 2>/dev/null || echo 0; }
+oxl_mtime() { stat -c %Y "$1" 2>/dev/null || stat -f %m "$1" 2>/dev/null || echo 0; }
+oxl_token() { # pid.random; tolerate a missing /dev/urandom without degrading to bare pid
+  local r
+  r=$(od -An -N6 -tx1 /dev/urandom 2>/dev/null | tr -d ' \n')
+  [ -n "$r" ] || r="${RANDOM}${RANDOM}${RANDOM}"
+  echo "$$.$r"
+}
+oxl_owner() { cat "$1.owner" 2>/dev/null || true; }
+oxl_clear_stale() { # $1=lock dir; returns 0 if it did clearing work (retry mkdir)
+  local lock="$1" n mt obs smt
+  n=$(oxl_now); mt=$(oxl_mtime "$lock")
+  [ "$mt" -gt 0 ] || return 1
+  [ $((n - mt)) -gt "$OXL_STALE" ] || return 1
+  obs=$(oxl_owner "$lock")
+  if mkdir "$lock.steal" 2>/dev/null; then
+    if [ "x$(oxl_owner "$lock")" = "x$obs" ]; then
+      rm -f "$lock.owner" 2>/dev/null
+      rmdir "$lock" 2>/dev/null || rm -rf "$lock" 2>/dev/null
+    fi
+    rmdir "$lock.steal" 2>/dev/null
+    return 0
+  fi
+  smt=$(oxl_mtime "$lock.steal")
+  if [ "$smt" -gt 0 ] && [ $((n - smt)) -gt "$OXL_STALE" ]; then rmdir "$lock.steal" 2>/dev/null; fi
+  return 1
+}
+oxl_acquire() { # $1=lock dir; prints owner token on success, returns 0/1
+  local lock="$1" t i
+  t=$(oxl_token)
   for i in $(seq 1 50); do
-    if mkdir "$m.lock" 2>/dev/null; then locked+=("$m"); break; fi
-    now=$(date +%s 2>/dev/null || echo 0)
-    mtime=$(stat -c %Y "$m.lock" 2>/dev/null || stat -f %m "$m.lock" 2>/dev/null || echo 0)
-    if [ "$mtime" -gt 0 ] && [ $((now - mtime)) -gt 30 ]; then
-      rmdir "$m.lock" 2>/dev/null
+    if mkdir "$lock" 2>/dev/null; then
+      printf '%s' "$t" > "$lock.owner" 2>/dev/null || true
+      printf '%s' "$t"
+      return 0
     fi
+    oxl_clear_stale "$lock" && continue
     sleep 0.01
   done
+  return 1
+}
+oxl_release() { # $1=lock dir, $2=our token — remove only if we PROVABLY own it
+  local lock="$1" t="$2" o
+  o=$(oxl_owner "$lock")
+  if [ -z "$t" ] || [ "x$o" = "x$t" ]; then
+    rm -f "$lock.owner" 2>/dev/null
+    rmdir "$lock" 2>/dev/null || true
+  fi
+  # owner differs or absent → not provably ours; leave it (it ages into a stale
+  # lock and is reclaimed by oxl_clear_stale) rather than stomp a successor.
+}
+# ─────────────────────────────────────────────────────────────────────────────
+# 3. Acquire each mailbox's owner-token lock (best-effort; 30s staleness window).
+locked=()
+locked_tokens=()
+for m in "${mboxes[@]}"; do
+  tok=$(oxl_acquire "$m.lock") && { locked+=("$m"); locked_tokens+=("$tok"); }
 done
 [ "${#locked[@]}" -eq 0 ] && exit 0
@@ -183,5 +237,9 @@ if [ -n "$output" ]; then
   for m in "${locked[@]}"; do : > "$m"; done
 fi
-for m in "${locked[@]}"; do rmdir "$m.lock" 2>/dev/null || true; done
+ri=0
+for m in "${locked[@]}"; do
+  oxl_release "$m.lock" "${locked_tokens[$ri]:-}"
+  ri=$((ri + 1))
+done
 exit 0

package/assets/stop.sh CHANGED Viewed

@@ -90,16 +90,72 @@ if [ "${#mboxes[@]}" -eq 0 ]; then
   exit 0
 fi
-# 5. Lock each non-empty mailbox (best-effort; 30s staleness window).
-locked=()
-for m in "${mboxes[@]}"; do
+# ── Advisory lock: owner-token mkdir lock — mirror of src/locks.ts ────────────
+# Lock is a mkdir dir; owner token lives in the SIDECAR "<lock>.owner" (beside,
+# not inside, so the dir stays empty and a plain rmdir still removes it). Stale
+# removal is gated behind a single-winner mkdir "<lock>.steal" marker plus
+# compare-and-clear; release removes the lock only if we still own it. Keep in
+# sync with src/locks.ts (identical block in assets/pretooluse.sh).
+OXL_STALE=30        # seconds; mirror src/mailbox.ts LOCK_STALE_MS — also the
+                    # marker-staleness window (same SIGSTOP-class threshold)
+oxl_now() { date +%s 2>/dev/null || echo 0; }
+oxl_mtime() { stat -c %Y "$1" 2>/dev/null || stat -f %m "$1" 2>/dev/null || echo 0; }
+oxl_token() { # pid.random; tolerate a missing /dev/urandom without degrading to bare pid
+  local r
+  r=$(od -An -N6 -tx1 /dev/urandom 2>/dev/null | tr -d ' \n')
+  [ -n "$r" ] || r="${RANDOM}${RANDOM}${RANDOM}"
+  echo "$$.$r"
+}
+oxl_owner() { cat "$1.owner" 2>/dev/null || true; }
+oxl_clear_stale() { # $1=lock dir; returns 0 if it did clearing work (retry mkdir)
+  local lock="$1" n mt obs smt
+  n=$(oxl_now); mt=$(oxl_mtime "$lock")
+  [ "$mt" -gt 0 ] || return 1
+  [ $((n - mt)) -gt "$OXL_STALE" ] || return 1
+  obs=$(oxl_owner "$lock")
+  if mkdir "$lock.steal" 2>/dev/null; then
+    if [ "x$(oxl_owner "$lock")" = "x$obs" ]; then
+      rm -f "$lock.owner" 2>/dev/null
+      rmdir "$lock" 2>/dev/null || rm -rf "$lock" 2>/dev/null
+    fi
+    rmdir "$lock.steal" 2>/dev/null
+    return 0
+  fi
+  smt=$(oxl_mtime "$lock.steal")
+  if [ "$smt" -gt 0 ] && [ $((n - smt)) -gt "$OXL_STALE" ]; then rmdir "$lock.steal" 2>/dev/null; fi
+  return 1
+}
+oxl_acquire() { # $1=lock dir; prints owner token on success, returns 0/1
+  local lock="$1" t i
+  t=$(oxl_token)
   for i in $(seq 1 50); do
-    if mkdir "$m.lock" 2>/dev/null; then locked+=("$m"); break; fi
-    now=$(date +%s 2>/dev/null || echo 0)
-    mt=$(stat -c %Y "$m.lock" 2>/dev/null || stat -f %m "$m.lock" 2>/dev/null || echo 0)
-    if [ "$mt" -gt 0 ] && [ $((now - mt)) -gt 30 ]; then rmdir "$m.lock" 2>/dev/null; fi
+    if mkdir "$lock" 2>/dev/null; then
+      printf '%s' "$t" > "$lock.owner" 2>/dev/null || true
+      printf '%s' "$t"
+      return 0
+    fi
+    oxl_clear_stale "$lock" && continue
     sleep 0.01
   done
+  return 1
+}
+oxl_release() { # $1=lock dir, $2=our token — remove only if we PROVABLY own it
+  local lock="$1" t="$2" o
+  o=$(oxl_owner "$lock")
+  if [ -z "$t" ] || [ "x$o" = "x$t" ]; then
+    rm -f "$lock.owner" 2>/dev/null
+    rmdir "$lock" 2>/dev/null || true
+  fi
+  # owner differs or absent → not provably ours; leave it (it ages into a stale
+  # lock and is reclaimed by oxl_clear_stale) rather than stomp a successor.
+}
+# ─────────────────────────────────────────────────────────────────────────────
+# 5. Lock each non-empty mailbox (best-effort; 30s staleness window).
+locked=()
+locked_tokens=()
+for m in "${mboxes[@]}"; do
+  tok=$(oxl_acquire "$m.lock") && { locked+=("$m"); locked_tokens+=("$tok"); }
 done
 # Couldn't lock anything → leave messages for next time. This still allows the
 # turn to stop, so mark idle; otherwise wake:auto will suppress a wake for a
@@ -215,5 +271,9 @@ else
   mark_idle
 fi
-for m in "${locked[@]}"; do rmdir "$m.lock" 2>/dev/null || true; done
+ri=0
+for m in "${locked[@]}"; do
+  oxl_release "$m.lock" "${locked_tokens[$ri]:-}"
+  ri=$((ri + 1))
+done
 exit 0

package/dist/delivery.js CHANGED Viewed

@@ -30,3 +30,24 @@ export function deliverToPeer(receiverSessionId, targetPid, body, fromSessionId,
     mailbox.requeue(targetPid, msg);
     return msg;
 }
+// Re-deliver an ALREADY-BUILT message to a peer, preserving its message_id and
+// (re)recording the receiver's ledger handle BEFORE the mailbox line becomes
+// visible — same record-before-append ordering as deliverToPeer. Used by the
+// ask_peer abort-recovery path: the reply was drained into memory but the client
+// aborted before it was returned, so it must be re-enqueued WITHOUT minting a new
+// id. (mailbox.enqueue would mint a fresh id and skip the ledger, so the
+// redelivered reply's displayed id resolves to message-not-found on
+// reply_to_message.) The ledger write is best-effort — a failure must never drop
+// the redelivery; worst case the handle is missing and the peer falls back to
+// send_message.
+export function deliverExistingToPeer(receiverSessionId, targetPid, msg) {
+    if (receiverSessionId) {
+        try {
+            recordReceived(receiverSessionId, msg);
+        }
+        catch (e) {
+            trace("received_ledger_write_failed", { message_id: msg.id, error: String(e) });
+        }
+    }
+    mailbox.requeue(targetPid, msg);
+}

package/dist/locks.js ADDED Viewed

@@ -0,0 +1,207 @@
+import { randomBytes } from "node:crypto";
+import { mkdirSync, readFileSync, rmdirSync, rmSync, statSync, unlinkSync, writeFileSync, } from "node:fs";
+import { trace } from "./trace.js";
+// Shared advisory-lock primitive for the mkdir-based locks used by both the
+// mailbox queues (mailbox.ts) and the received-ledger (received.ts), and mirrored
+// in the bash hooks (assets/pretooluse.sh, assets/stop.sh). Centralised here
+// because stale-recovery is subtle and must be reasoned about (and tested) once.
+//
+// HONEST LIMIT: a provably race-free, stale-recoverable advisory lock is not
+// achievable on a plain shared filesystem (no atomic compare-and-swap; every
+// "detect stale → remove → reacquire" has a check-then-act window). This design
+// eliminates the REALISTIC failure modes; the residuals that remain ALL require
+// a process to stall (SIGSTOP / huge swap / multi-second GC) past the 30s stale
+// window while inside a microsecond-wide gap between two syscalls:
+//   (a) a clearer that stalls >30s between its owner-compare and its rmdir, while
+//       another clearer reclaims the (now >30s-stale) steal marker and reacquires;
+//   (b) a holder that stalls >30s between mkdir(lock) and writeOwner(lock), gets
+//       stale-cleared as owner-less, then resumes and overwrites a successor's
+//       owner sidecar;
+//   (c) a holder that stalls >30s mid-critical-section and resumes to do its data
+//       write believing it still holds the lock (the data ops do not re-validate
+//       ownership before writing).
+// Eliminating these needs kernel-arbitrated locks (flock/fcntl), which are not
+// viable here because the lock is shared with bash hooks on macOS (no flock CLI).
+// The consequence of any of these firing is bounded — a rare double-delivery
+// (benign once readers dedup by message_id) or a rare ledger lost-update (the
+// reply handle degrades to send_message), never a wedge or torn file.
+//
+// Two mechanisms do the work:
+//  1. OWNER TOKEN. Each acquisition writes a unique token into the SIDECAR file
+//     `<lock>.owner` (kept beside the lock dir, NOT inside it, so the lock dir
+//     stays empty and a bash hook's plain `rmdir <lock>` still works cross-
+//     language). Release only removes the lock if the token still matches — so a
+//     holder that stalled past the stale window, got its lock stolen, and then
+//     resumes can no longer rmdir the SUCCESSOR's fresh lock (stall-resume bug).
+//  2. SINGLE-WINNER + COMPARE-AND-CLEAR. Stale removal is gated behind an atomic
+//     `mkdir(<lock>.steal)` marker, and the clearer removes the lock only if its
+//     owner is STILL the dead token it observed. While the marker is held and the
+//     lock still exists, nobody else can clear (marker) or acquire (mkdir EEXIST),
+//     so the owner is stable across the check→rmdir. And the actual acquire is
+//     ALWAYS the single-winner `mkdir(lock)`, so even redundant clears can never
+//     produce two owners — the worst they do is race to recreate the lock, which
+//     exactly one wins.
+const LOCK_RETRY_LIMIT = 50;
+const LOCK_RETRY_DELAY_MS = 10;
+function sleepSync(ms) {
+    Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
+}
+// Sidecar beside the lock dir (not inside) so the lock dir stays empty and a
+// bash hook's plain `rmdir <lock>` still removes a Node-held lock. An orphaned
+// sidecar (lock dir removed but sidecar left, e.g. by a bash clearer that doesn't
+// know about it) is harmless — the next acquirer overwrites it.
+function ownerPath(lock) {
+    return `${lock}.owner`;
+}
+function mintToken() {
+    return `${process.pid}.${randomBytes(8).toString("hex")}`;
+}
+// Read the owner token, or null if the lock has none (foreign/legacy lock, or a
+// lock observed in the tiny window after mkdir but before the owner write).
+function readOwner(lock) {
+    try {
+        return readFileSync(ownerPath(lock), "utf8");
+    }
+    catch {
+        return null;
+    }
+}
+function writeOwner(lock, token) {
+    try {
+        writeFileSync(ownerPath(lock), token, { mode: 0o600 });
+    }
+    catch {
+        // Best effort: an owner-less lock still excludes (the dir exists); it just
+        // loses the stall-resume protection until the next acquisition.
+    }
+}
+// Remove the lock dir and its owner file. Tolerates a foreign non-empty lock dir
+// (e.g. one a bash hook or test created without our layout) via a recursive rm.
+function removeLock(lock) {
+    try {
+        unlinkSync(ownerPath(lock));
+    }
+    catch {
+        // no owner file — fine
+    }
+    try {
+        rmdirSync(lock);
+    }
+    catch (e) {
+        const err = e;
+        if (err.code === "ENOENT")
+            return;
+        // Non-empty (foreign contents) or other — fall back to recursive removal.
+        try {
+            rmSync(lock, { recursive: true, force: true });
+        }
+        catch {
+            // best effort
+        }
+    }
+}
+// Compare-and-clear a stale lock under the single-winner steal marker. Returns
+// true iff this call did the clearing work (caller retries mkdir immediately);
+// false if the lock is fresh, vanished, or another clearer holds the marker
+// (caller should sleep and retry).
+export function clearStaleLock(lock, staleMs, traceEvent, traceCtx) {
+    let st;
+    try {
+        st = statSync(lock);
+    }
+    catch {
+        return false; // lock vanished between the failed mkdir and now — just retry
+    }
+    if (Date.now() - st.mtimeMs <= staleMs)
+        return false; // fresh holder — wait
+    const observed = readOwner(lock); // the (presumed dead) holder's token, or null
+    const steal = `${lock}.steal`;
+    try {
+        mkdirSync(steal, { mode: 0o700 });
+    }
+    catch (e) {
+        const err = e;
+        if (err.code === "EEXIST") {
+            // Another clearer holds the marker. If the marker is itself stale by the
+            // SAME stale window as the lock (its clearer crashed/SIGSTOP'd mid-steal),
+            // force it so recovery cannot wedge forever. Using the lock's stale window
+            // (not a shorter one) means a clearer can only be displaced after a full
+            // 30s stall — the same SIGSTOP-class threshold as every other residual —
+            // rather than after a brief pause. Compare-and-clear below still refuses to
+            // remove a lock whose owner changed, backstopping a reclaim race.
+            try {
+                const sst = statSync(steal);
+                if (Date.now() - sst.mtimeMs > staleMs) {
+                    try {
+                        rmdirSync(steal);
+                    }
+                    catch {
+                        // raced with another clearer — fine
+                    }
+                }
+            }
+            catch {
+                // marker vanished — fine
+            }
+        }
+        return false; // lost the steal — sleep and retry
+    }
+    // Sole clearer (modulo a leaked-marker race, which compare-and-clear backstops).
+    // Re-read the owner now: if it still equals what we observed, the dead holder's
+    // lock is unchanged and safe to remove; if it changed, someone reacquired and
+    // we must leave their lock alone.
+    if (readOwner(lock) === observed) {
+        removeLock(lock);
+        trace(traceEvent, traceCtx);
+    }
+    try {
+        rmdirSync(steal);
+    }
+    catch {
+        // best effort — a leaked marker is force-cleared by the next clearer
+    }
+    return true;
+}
+// Acquire the advisory lock, returning the owner token to hand back to
+// releaseDirLock. The caller is responsible for creating the parent directory.
+export function acquireDirLock(lock, staleMs, traceEvent, traceCtx) {
+    const token = mintToken();
+    for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
+        try {
+            mkdirSync(lock, { mode: 0o700 });
+            writeOwner(lock, token);
+            return token;
+        }
+        catch (e) {
+            const err = e;
+            if (err.code !== "EEXIST")
+                throw err;
+            if (clearStaleLock(lock, staleMs, traceEvent, traceCtx))
+                continue;
+            sleepSync(LOCK_RETRY_DELAY_MS);
+        }
+    }
+    throw new Error(`could not acquire lock at ${lock}`);
+}
+// Release the lock — but only if we PROVABLY still own it (owner === our token).
+// A holder that stalled past the stale window and was stolen from sees a
+// different owner and leaves the successor's lock intact. We deliberately do NOT
+// remove on an absent owner: a successor in its mkdir→writeOwner window has no
+// owner yet, and removing then would stomp its fresh lock (Codex round-3). If our
+// OWN owner write was lost, the cost is a leaked lock — which simply ages into a
+// stale lock and is reclaimed by clearStaleLock, strictly safer than a stomp.
+export function releaseDirLock(lock, token) {
+    if (!token) {
+        removeLock(lock); // no token to verify (defensive/legacy) — best-effort
+        return;
+    }
+    const owner = readOwner(lock);
+    if (owner === token) {
+        removeLock(lock);
+    }
+    else {
+        // Owner differs or is absent → not provably ours; leave it. A truly
+        // abandoned lock becomes stale and is reclaimed by clearStaleLock.
+        trace("lock_release_skipped_not_owner", { lock });
+    }
+}

package/dist/mailbox.js CHANGED Viewed

@@ -1,7 +1,8 @@
 import { randomBytes } from "node:crypto";
-import { appendFileSync, mkdirSync, readFileSync, rmdirSync, statSync, truncateSync, writeFileSync, } from "node:fs";
+import { appendFileSync, closeSync, mkdirSync, openSync, readFileSync, readSync, renameSync, statSync, truncateSync, writeFileSync, } from "node:fs";
 import { homedir } from "node:os";
 import { join } from "node:path";
+import { acquireDirLock, releaseDirLock } from "./locks.js";
 import { trace } from "./trace.js";
 // Resolved lazily so tests can swap HOME between cases. Each call re-reads
 // homedir(), which on POSIX defers to $HOME.
@@ -17,58 +18,24 @@ function mailboxesDir() {
 //
 // Sync this value with assets/pretooluse.sh (find -mmin +0.5 ≈ 30s).
 const LOCK_STALE_MS = 30_000;
-const LOCK_RETRY_LIMIT = 50;
-const LOCK_RETRY_DELAY_MS = 10;
 function mailboxPath(pid) {
     return join(mailboxesDir(), `${pid}.jsonl`);
 }
 function lockPath(pid) {
     return `${mailboxPath(pid)}.lock`;
 }
-function sleepSync(ms) {
-    Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
-}
+// Owner tokens for held locks, so releaseLock can prove ownership (a lock stolen
+// out from under a stalled holder is not removed on its late release). Keyed by
+// pid; never two concurrent acquisitions of the same pid within one process.
+const lockTokens = new Map();
 export function acquireLock(pid) {
     mkdirSync(mailboxesDir(), { recursive: true, mode: 0o700 });
-    const lock = lockPath(pid);
-    for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
-        try {
-            mkdirSync(lock, { mode: 0o700 });
-            return;
-        }
-        catch (e) {
-            const err = e;
-            if (err.code !== "EEXIST")
-                throw err;
-            // Check staleness. If older than LOCK_STALE_MS, force-clear and retry.
-            try {
-                const st = statSync(lock);
-                if (Date.now() - st.mtimeMs > LOCK_STALE_MS) {
-                    try {
-                        rmdirSync(lock);
-                        trace("mailbox_lock_stale_clear", { pid });
-                    }
-                    catch {
-                        // raced with another clearer; fall through to retry
-                    }
-                    continue;
-                }
-            }
-            catch {
-                // stat may race; just retry
-            }
-            sleepSync(LOCK_RETRY_DELAY_MS);
-        }
-    }
-    throw new Error(`could not acquire mailbox lock for pid ${pid}`);
+    lockTokens.set(pid, acquireDirLock(lockPath(pid), LOCK_STALE_MS, "mailbox_lock_stale_clear", { pid }));
 }
 export function releaseLock(pid) {
-    try {
-        rmdirSync(lockPath(pid));
-    }
-    catch {
-        // ignore ENOENT / not-empty / EPERM
-    }
+    const token = lockTokens.get(pid);
+    lockTokens.delete(pid);
+    releaseDirLock(lockPath(pid), token ?? "");
 }
 // Critical: the serialized JSONL line must always begin
 // `{"schema_version":1,"id":"...","body":"`. The awk extractor in
@@ -107,6 +74,47 @@ export function serializeMailboxLine(msg) {
     }
     return line;
 }
+// Append JSONL bytes to a mailbox, healing a missing record boundary first.
+// appendFileSync of a buffer is NOT a single atomic syscall, so a crash/torn
+// write can leave a file ending in a partial line with no trailing "\n". A later
+// append would then concatenate onto that partial line, gluing two records into
+// one line that fails JSON.parse in BOTH drain() and the awk hook — silently
+// dropping both messages. If the file is non-empty and its last byte isn't "\n",
+// prepend one so the boundary is restored (the already-torn record is still lost,
+// but it can no longer eat its neighbor). Every append path routes through here.
+function appendLines(path, buf) {
+    let heal = false;
+    let fd;
+    try {
+        const st = statSync(path);
+        if (st.size > 0) {
+            fd = openSync(path, "r");
+            const last = Buffer.alloc(1);
+            readSync(fd, last, 0, 1, st.size - 1);
+            heal = last[0] !== 0x0a; // 0x0a === "\n"
+        }
+    }
+    catch (e) {
+        const err = e;
+        if (err.code !== "ENOENT")
+            throw err;
+    }
+    finally {
+        if (fd !== undefined)
+            closeSync(fd);
+    }
+    appendFileSync(path, heal ? "\n" + buf : buf);
+}
+// Atomically replace a file's contents: write to a unique temp file in the same
+// directory, then renameSync over the target. rename(2) is atomic on POSIX, so a
+// concurrent reader/crasher never observes a torn file — unlike writeFileSync,
+// which issues multiple write() syscalls and can leave a half-written line on
+// crash, dropping unrelated surviving records.
+function atomicWrite(path, data) {
+    const tmp = `${path}.tmp.${process.pid}.${randomBytes(6).toString("hex")}`;
+    writeFileSync(tmp, data, { mode: 0o600 });
+    renameSync(tmp, path);
+}
 // Mint a message envelope WITHOUT writing it anywhere. Split out from enqueue so
 // a higher layer (delivery.ts) can record the durable received-ledger entry
 // BEFORE the mailbox line becomes visible — the ordering that guarantees any
@@ -129,7 +137,7 @@ export function enqueue(target_pid, body, from_session_id, options = {}) {
     const msg = buildMessage(body, from_session_id, options);
     acquireLock(target_pid);
     try {
-        appendFileSync(mailboxPath(target_pid), serializeMailboxLine(msg));
+        appendLines(mailboxPath(target_pid), serializeMailboxLine(msg));
     }
     finally {
         releaseLock(target_pid);
@@ -144,7 +152,7 @@ export function requeue(target_pid, msg) {
     const line = serializeMailboxLine(msg);
     acquireLock(target_pid);
     try {
-        appendFileSync(mailboxPath(target_pid), line);
+        appendLines(mailboxPath(target_pid), line);
     }
     finally {
         releaseLock(target_pid);
@@ -161,7 +169,7 @@ export function requeueMany(target_pid, msgs) {
         buf += serializeMailboxLine(m);
     acquireLock(target_pid);
     try {
-        appendFileSync(mailboxPath(target_pid), buf);
+        appendLines(mailboxPath(target_pid), buf);
     }
     finally {
         releaseLock(target_pid);
@@ -174,17 +182,28 @@ export function requeueMany(target_pid, msgs) {
 // of silently stranding it. Best-effort per pid: a contended/unreadable mailbox
 // is skipped (counted) and left for the next poll rather than failing the whole
 // drain — one stuck lock must not block a session's entire inbox.
+//
+// Deduped by message_id: a migrateMailbox crash-window (append to dest done, but
+// the process died before truncating the source) can leave the SAME message in
+// two unioned sibling mailboxes. Both copies are drained (so neither lingers) but
+// the message is returned ONCE. message_id is a unique per-message nonce, so this
+// only ever collapses true duplicates, never two distinct messages.
 export function drainMany(pids) {
     const out = [];
-    const seen = new Set();
+    const seenPids = new Set();
+    const seenIds = new Set();
     let skipped = 0;
     for (const pid of pids) {
-        if (seen.has(pid))
+        if (seenPids.has(pid))
             continue;
-        seen.add(pid);
+        seenPids.add(pid);
         try {
-            for (const m of drain(pid))
+            for (const m of drain(pid)) {
+                if (seenIds.has(m.id))
+                    continue;
+                seenIds.add(m.id);
                 out.push(m);
+            }
         }
         catch {
             skipped++;
@@ -250,7 +269,7 @@ export function migrateMailbox(fromPid, toPid) {
         const count = raw.split("\n").filter((l) => l.trim().length > 0).length;
         acquireLock(toPid);
         try {
-            appendFileSync(mailboxPath(toPid), block);
+            appendLines(mailboxPath(toPid), block);
         }
         finally {
             releaseLock(toPid);
@@ -382,7 +401,7 @@ function drainFirstMatching(my_pid, matches) {
             }
         }
         else {
-            writeFileSync(mailboxPath(my_pid), surviving.join("\n") + "\n");
+            atomicWrite(mailboxPath(my_pid), surviving.join("\n") + "\n");
         }
         return matchedMsg;
     }

package/dist/received.js CHANGED Viewed

@@ -1,7 +1,8 @@
-import { createHash } from "node:crypto";
-import { mkdirSync, readFileSync, rmdirSync, statSync, writeFileSync, } from "node:fs";
+import { createHash, randomBytes } from "node:crypto";
+import { mkdirSync, readFileSync, renameSync, writeFileSync } from "node:fs";
 import { homedir } from "node:os";
 import { join } from "node:path";
+import { acquireDirLock, releaseDirLock } from "./locks.js";
 import { trace } from "./trace.js";
 // The received-message ledger: a durable, per-session index of every inbound
 // envelope, keyed by message_id. It exists because both delivery paths are
@@ -38,12 +39,10 @@ function ledgerPath(sessionId) {
 function lockPath(sessionId) {
     return `${ledgerPath(sessionId)}.lock`;
 }
-// Lock idiom mirrors mailbox.ts (mkdir-based, staleness-cleared). The ledger
-// read-modify-write is small (bounded by receivedMax() lines) so the lock
+// Lock idiom mirrors mailbox.ts (owner-token mkdir lock — see locks.ts). The
+// ledger read-modify-write is small (bounded by receivedMax() lines) so the lock
 // window is short.
 const LOCK_STALE_MS = 30_000;
-const LOCK_RETRY_LIMIT = 50;
-const LOCK_RETRY_DELAY_MS = 10;
 // Bounded retention: keep at most this many of the most-recent inbound messages
 // per session. Read lazily so tests can tune it per-case. Generous by default so
 // a realistic mailbox burst (read_my_messages budgets 50/drain) can't push a
@@ -53,49 +52,27 @@ export function receivedMax() {
     const n = Number(process.env.OXTAIL_RECEIVED_MAX);
     return Number.isFinite(n) && n > 0 ? Math.floor(n) : 1000;
 }
-function sleepSync(ms) {
-    Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, ms);
-}
+// Owner tokens for held ledger locks (see mailbox.ts for the rationale).
+const lockTokens = new Map();
 function acquireLock(sessionId) {
     mkdirSync(receivedDir(), { recursive: true, mode: 0o700 });
-    const lock = lockPath(sessionId);
-    for (let i = 0; i < LOCK_RETRY_LIMIT; i++) {
-        try {
-            mkdirSync(lock, { mode: 0o700 });
-            return;
-        }
-        catch (e) {
-            const err = e;
-            if (err.code !== "EEXIST")
-                throw err;
-            try {
-                const st = statSync(lock);
-                if (Date.now() - st.mtimeMs > LOCK_STALE_MS) {
-                    try {
-                        rmdirSync(lock);
-                        trace("received_lock_stale_clear", { session_id: sessionId });
-                    }
-                    catch {
-                        // raced with another clearer; fall through to retry
-                    }
-                    continue;
-                }
-            }
-            catch {
-                // stat may race; just retry
-            }
-            sleepSync(LOCK_RETRY_DELAY_MS);
-        }
-    }
-    throw new Error(`could not acquire received-ledger lock for ${sessionId}`);
+    lockTokens.set(sessionId, acquireDirLock(lockPath(sessionId), LOCK_STALE_MS, "received_lock_stale_clear", {
+        session_id: sessionId,
+    }));
 }
 function releaseLock(sessionId) {
-    try {
-        rmdirSync(lockPath(sessionId));
-    }
-    catch {
-        // ignore ENOENT / not-empty / EPERM
-    }
+    const token = lockTokens.get(sessionId);
+    lockTokens.delete(sessionId);
+    releaseDirLock(lockPath(sessionId), token ?? "");
+}
+// Atomically replace the ledger: write a unique temp file, then renameSync over
+// the target. rename(2) is atomic on POSIX, so a crash/torn write can't leave a
+// half-rewritten ledger that loses older reply handles — unlike a direct
+// writeFileSync, which issues multiple write() syscalls.
+function atomicWrite(path, data) {
+    const tmp = `${path}.tmp.${process.pid}.${randomBytes(6).toString("hex")}`;
+    writeFileSync(tmp, data, { mode: 0o600 });
+    renameSync(tmp, path);
 }
 function readLines(sessionId) {
     try {
@@ -132,9 +109,7 @@ export function recordReceived(receiverSessionId, msg) {
                 kept: max,
             });
         }
-        writeFileSync(ledgerPath(receiverSessionId), pruned.join("\n") + "\n", {
-            mode: 0o600,
-        });
+        atomicWrite(ledgerPath(receiverSessionId), pruned.join("\n") + "\n");
     }
     finally {
         releaseLock(receiverSessionId);

package/dist/server.js CHANGED Viewed

@@ -13,7 +13,7 @@ import { trace } from "./trace.js";
 import { buildEntry, chooseVerifiedWakePane, findByTmuxSession, readAll, refreshTmuxBinding, register, sessionPidsForId, unregister, } from "./registry.js";
 import * as mailbox from "./mailbox.js";
 import * as received from "./received.js";
-import { deliverToPeer } from "./delivery.js";
+import { deliverExistingToPeer, deliverToPeer } from "./delivery.js";
 import { recoverClaim, resolveAncestors, writeClaim } from "./claims.js";
 import { decideReplyAutoWake, defaultAutowakeDir } from "./autowake.js";
 import { markWoke, newWakeDebounceStore, recentlyWoke } from "./wake-debounce.js";
@@ -1728,11 +1728,11 @@ server.registerTool("ask_peer", {
     // Re-enqueue so it's not lost.
     if (aborted && reply) {
         try {
-            mailbox.enqueue(entry.server_pid, reply.body, reply.from_session_id, {
-                request_id: reply.request_id,
-                reply_to: reply.reply_to,
-                source_message_id: reply.source_message_id,
-            });
+            // Re-deliver the EXISTING reply: preserve reply.id and (re)write the
+            // requester's received-ledger entry so reply_to_message against the
+            // displayed id still resolves. mailbox.enqueue would mint a NEW id and
+            // skip the ledger, breaking the reply handle on the abort path.
+            deliverExistingToPeer(entry.client.session_id, entry.server_pid, reply);
             trace("ask_peer_abort_reenqueue", { message_id: reply.id });
         }
         catch (e) {

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "oxtail",
-  "version": "0.13.0",
+  "version": "0.14.1",
   "private": false,
   "type": "module",
   "description": "Coordination layer for parallel AI coding agent sessions, exposed over MCP.",

package/scripts/hook-constants.mjs CHANGED Viewed

@@ -23,10 +23,20 @@ export const HOOK_MARKER_KEY = "_oxtailHook";
 //       and drop the redundant single-valued `origin` field. message_id +
 //       from_session_id are still rendered (correlation/debug unaffected); a
 //       stale pre-v5 hook is only larger, never wrong.
+//   v6: owner-token advisory lock (mirror of src/locks.ts) in pretooluse + stop.
+//       The lock dir gains a sidecar `<lock>.owner` token; stale removal is
+//       gated behind a single-winner `<lock>.steal` marker + compare-and-clear,
+//       and release only removes a lock we still own. The sidecar layout keeps
+//       the lock dir EMPTY so a pre-v6 hook's plain `rmdir` still removes a v6
+//       lock — i.e. mixed versions never WEDGE. They are not fully race-safe,
+//       though: a pre-v6 hook does an unconditional stale-rmdir / release-rmdir
+//       with no owner check, so during an upgrade window (before re-install) the
+//       old hook can still lose the stall-resume / double-clear races against a
+//       v6 peer. The version bump forces re-install to close that window.
 // INVARIANT: any change to an assets/*.sh script MUST bump this version, so
 // existing installs are forced to re-install. scripts/check-hook-version.mjs
 // enforces this in CI.
-export const HOOK_MARKER_VERSION = 5;
+export const HOOK_MARKER_VERSION = 6;
 const HOOKS_DIR = path.join(os.homedir(), ".oxtail", "hooks");