talking-stick 0.1.0-alpha

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1156 @@
1
+ # Talking Stick MCP Coordination Plan
2
+
3
+ ## Purpose
4
+
5
+ Talking Stick is an MCP server that lets multiple agent harnesses coordinate work in a shared workspace without accidentally performing parallel work and without re-deriving context on every turn.
6
+
7
+ The core metaphor is simple:
8
+
9
+ - A workspace maps to a coordination room.
10
+ - Agents join the room by operating in a path that resolves to it.
11
+ - Exactly one agent may hold the talking stick for that room at a time.
12
+ - The holder may work, release the stick to the next agent in sequence, or explicitly pass it to a specific agent.
13
+ - Passing or releasing the stick requires a structured handoff. The handoff carries what was done, what remains, and where to look, so the next agent does not have to rediscover context.
14
+ - Normal release follows the member sequence. Explicit pass and timeout takeover are deliberate escape hatches.
15
+ - If the expected agent fails to respond, another active member may take over after a timeout. Timeout opens takeover eligibility; it does not revoke the expected agent until a takeover actually commits.
16
+
17
+ The goal is not a general chat system. The goal is a small, fault-tolerant coordination primitive that also serves as shared working memory for planning, code review, task handoff, and multi-agent turn-taking.
18
+
19
+ ## Design Goals
20
+
21
+ - Resolve coordination scope to a workspace root, matching how developers already think about workspaces (git, `CLAUDE.md`, `package.json`).
22
+ - Keep the MVP to one default room per workspace path; multiple simultaneous topics can be added later if real workflows need them.
23
+ - Make the handoff between agents the primary state transfer, not an afterthought.
24
+ - Provide predictable ordered turn-taking without making fairness a hard concurrency invariant.
25
+ - Work safely when multiple MCP server processes run concurrently (multiple terminal tabs, split views, parallel sessions).
26
+ - Store state in a predictable per-user data directory under `~/.local/share` on Linux and macOS, with an override for tests and project isolation.
27
+ - Recover cleanly when an agent crashes, times out, or stops polling.
28
+ - Keep the MCP surface small enough that harnesses can follow it reliably.
29
+ - Make stale writes impossible with fencing tokens and turn numbers.
30
+ - Prefer explicit state transitions over hidden automatic behavior.
31
+
32
+ ## Non-Goals
33
+
34
+ - Not a replacement for git, issue trackers, or durable project documentation.
35
+ - Not a general-purpose pub/sub bus.
36
+ - Does not merge simultaneous edits from multiple agents.
37
+ - Does not guarantee that an agent follows instructions outside the protocol. It only makes protocol-compliant coordination safe.
38
+ - Does not coordinate across hosts. Single-host only in the MVP.
39
+
40
+ ## Core Concepts
41
+
42
+ ### Path Room and Workspace Resolution
43
+
44
+ A path room is the coordination scope for a workspace. In the MVP, room identity is the canonical room path, normally the workspace root.
45
+
46
+ Room resolution has three steps:
47
+
48
+ 1. Resolve the request path to a preferred workspace root.
49
+ 2. Search from the request path up to that workspace root for the deepest existing room.
50
+ 3. If no room exists on that path, create a room at the preferred workspace root.
51
+
52
+ This avoids the common monorepo failure mode where one agent starts in `/repo/packages/foo/src/` and another starts in `/repo/packages/bar/`, creating separate rooms even though both are working in the same repo. If `/repo/` is the git worktree root, both agents resolve to `/repo/`. It also preserves explicit nested rooms: if an operator creates a room at `/repo/packages/foo/`, agents below that path join the nested room instead of the repo root.
53
+
54
+ Preferred workspace root resolution:
55
+
56
+ 1. If the request path is inside a git worktree, use the git top-level path.
57
+ 2. Otherwise, use the nearest ancestor containing a recognized workspace marker such as `CLAUDE.md`, `AGENTS.md`, `package.json`, `pyproject.toml`, `Cargo.toml`, or `go.mod`.
58
+ 3. Otherwise, use the canonical request path.
59
+
60
+ Canonicalization applied before room lookup:
61
+
62
+ - If `context_path` points to a file, use its parent directory.
63
+ - Resolve symlinks.
64
+ - Normalize path separators.
65
+ - Normalize casing on case-insensitive filesystems.
66
+
67
+ **Nesting and conflict.** Creating a nested room inside an existing room requires explicit opt-in via `force_new = true`. The default behavior is to join the ancestor room. Because talking-stick coordination is operator-initiated, this default is safe: operators know when they are starting a nested conversation and can request one explicitly.
68
+
69
+ ### Deferred Extension: Topics
70
+
71
+ The MVP intentionally has one default room per workspace path.
72
+
73
+ Optional topics may be added later if unrelated efforts in the same workspace need independent coordination, such as a `review` room and a `triage` room at the same repo root. Deferring topics keeps the initial protocol aligned with the simple "path chat" model and avoids making every tool carry an extra discriminator before the need is proven.
74
+
75
+ ### Agent Identity
76
+
77
+ `agent_id` is derived by the MCP adapter at connection time, not supplied by the harness. Harnesses should not set or guess their own identity; the server knows more about which process is calling than the harness does about itself.
78
+
79
+ Derivation signals, in order of preference:
80
+
81
+ 1. `clientInfo.name` and `clientInfo.version` from the MCP `initialize` handshake. Every MCP client sends these; Claude Code, Codex, and Gemini CLI all set distinctive values.
82
+ 2. The MCP server's own parent process identity: `(parent_pid, parent_start_time)`. Together these uniquely identify the harness instance on a host. `parent_pid` alone is unsafe because PIDs are reused after exit.
83
+ 3. Environment variables the harness exports, such as `CLAUDECODE`, `CLAUDE_CODE_ENTRYPOINT`, `TERM_PROGRAM`, `ITERM_SESSION_ID`, `TMUX`, `SSH_TTY`.
84
+
85
+ Composed identity, stable for the life of one harness instance:
86
+
87
+ ```text
88
+ <harness-slug>:<short-hash>
89
+
90
+ e.g.
91
+ claude-code:a3f1
92
+ codex:9b22
93
+ gemini:1c4e
94
+ ```
95
+
96
+ The hash is a short digest over the signals above, so reconnects from the same harness instance land on the same `agent_id`. Distinct tabs, splits, or parallel spawns of the same harness get distinct hashes because their `parent_pid`/`parent_start_time` differ.
97
+
98
+ For the Human CLI (see deferred extension), the same idea applies with different signals: `$USER`, parent shell `(pid, start_time)`, and tty yield identities like:
99
+
100
+ ```text
101
+ human:wojtek:s003
102
+ ```
103
+
104
+ The derived string is the protocol-facing identity, but the server must also persist the source liveness facts behind it:
105
+
106
+ - `host_id`
107
+ - `pid`
108
+ - `process_started_at`
109
+ - `session_kind` (`mcp_harness`, `human_guardian`, later others)
110
+ - optional display metadata such as tty or client name
111
+
112
+ The digest alone is not enough for liveness decisions. If the system is going to say "that owner is really gone," it must be able to check whether the exact spawning process identified by `(host_id, pid, process_started_at)` still exists.
113
+
114
+ For the Human CLI, this implies a split between one-shot commands and holders:
115
+
116
+ - one-shot commands like `list`, `join`, `state`, and `events` are ordinary short-lived processes and do not own the room,
117
+ - indefinite human ownership uses an attached hold mode or a lightweight local guardian process, so the owner is still represented by one live process that can be checked and can exit cleanly.
118
+
119
+ The `join_path` response returns the assigned `agent_id` to the harness so it appears in logs, downstream handoffs, and event records. It also returns the effective policy for that room, including the server's expected heartbeat cadence, so harnesses and human-holder helpers can renew leases without guessing. An optional `agent_id_override` is accepted for tests and debugging and is flagged in the event stream.
120
+
121
+ No MCP tool input other than `join_path` carries `agent_id_override`. If `join_path` receives an override, that override becomes the connection's derived identity for subsequent calls on that same connection until disconnect. Otherwise, for every owner mutation the adapter injects the derived identity from the connection context, and the service layer continues to use `agent_id` internally for fencing and membership checks.
122
+
123
+ This keeps the protocol surface simple: `agent_id` is the session-scoped identity. The MVP still does not need a separate global participant/session abstraction, but it does need to persist process metadata alongside room membership so it can make exact local liveness checks.
124
+
125
+ ### Membership Sequence
126
+
127
+ When an agent joins a room, the server appends it to the room's ordered member list if not already present. This order defines the default turn sequence.
128
+
129
+ ```text
130
+ A joins
131
+ B joins
132
+ C joins
133
+
134
+ Default sequence: A -> B -> C -> A
135
+ ```
136
+
137
+ An owner can follow this sequence by releasing the stick. An owner can skip the sequence by explicitly passing to a chosen agent.
138
+
139
+ ### Ownership
140
+
141
+ At most one agent owns the stick for a room at any time.
142
+
143
+ Only the owner may perform owner actions:
144
+
145
+ - `heartbeat`
146
+ - `release_stick`
147
+ - `pass_stick`
148
+ - `close_room` if that optional later tool is implemented
149
+
150
+ Every owner action carries:
151
+
152
+ - `room_id`
153
+ - `lease_id`
154
+ - `expected_turn_id`
155
+
156
+ `agent_id` is derived by the MCP adapter from the connection rather than sent by the caller. The service layer still uses `(agent_id, lease_id, turn_id)` together for fencing; the caller just does not get to name themselves. Old actions from stale agents must be rejected.
157
+
158
+ ### Turn ID Semantics
159
+
160
+ `turn_id` identifies the current ownership epoch. It increments only when an agent is granted ownership:
161
+
162
+ - an idle room is claimed,
163
+ - a reserved recipient claims the stick,
164
+ - a timeout takeover succeeds.
165
+
166
+ `release_stick` and `pass_stick` end the current ownership epoch and invalidate the current lease, but they do not grant ownership to the next agent by themselves. They create a pending reservation that the next eligible agent must claim.
167
+
168
+ ### Default Turn Order
169
+
170
+ The room maintains an ordered member list and a `sequence_index`. Normal release reserves the stick for the next active member after the current owner.
171
+
172
+ This gives the common case a predictable round-robin shape:
173
+
174
+ ```text
175
+ A releases -> B gets first right of refusal
176
+ B releases -> C gets first right of refusal
177
+ C releases -> A gets first right of refusal
178
+ ```
179
+
180
+ The sequence is not a hard fairness lock in the MVP:
181
+
182
+ - An owner may explicitly pass to any active member.
183
+ - If a reserved recipient misses `claim_ttl`, another active member may take over, but the immediately prior owner should not be the takeover winner while any other active member can take it.
184
+ - If an owner misses `owner_lease_ttl`, another active member may take over.
185
+
186
+ This is intentionally simpler than strict round fairness. The protocol should prevent accidental parallel ownership first; social fairness can be added as a configurable policy once real usage shows which workflows need it.
187
+
188
+ ### Handoff Artifact
189
+
190
+ A handoff is the structured payload produced when an agent releases or passes the stick. It is the protocol's primary state-transfer mechanism.
191
+
192
+ ```ts
193
+ interface Handoff {
194
+ // Required. Ensures state transfer is never empty.
195
+ status: string; // what I did, what I learned, what is unfinished
196
+ next_action: string; // what the recipient should do
197
+
198
+ // Optional. Reduces context re-derivation for the recipient.
199
+ artifacts?: Array<{
200
+ path: string; // absolute or workspace-relative
201
+ lines?: [number, number]; // inclusive line range
202
+ role: "examine" | "review" | "edit" | "context" | "output";
203
+ note?: string;
204
+ }>;
205
+
206
+ // Optional.
207
+ open_questions?: string[];
208
+ do_not?: string[]; // off-limits for the next agent
209
+ }
210
+ ```
211
+
212
+ The server validates that `status` and `next_action` are non-empty before accepting a `release_stick` or `pass_stick`. This makes "pass without doing anything" mechanically impossible.
213
+
214
+ `artifacts[]` entries give the recipient direct pointers — path plus optional line range — so a new owner can load exactly the relevant code without re-exploring the workspace. This is the primary mechanism the protocol uses to reduce prompt-context churn across agents.
215
+
216
+ The handoff is stored verbatim in the event log and returned to the recipient by `wait_for_turn` when they claim the stick.
217
+
218
+ ## Room State
219
+
220
+ ```ts
221
+ type RoomState =
222
+ | "idle"
223
+ | "owned"
224
+ | "reserved"
225
+ | "stale_owner"
226
+ | "owner_gone"
227
+ | "recipient_gone"
228
+ | "dormant"
229
+ | "closed";
230
+
231
+ interface PathRoom {
232
+ room_id: string; // server-generated
233
+ canonical_path: string;
234
+
235
+ members: AgentId[];
236
+ sequence_index: number;
237
+
238
+ owner: AgentId | null;
239
+ reserved_for: AgentId | null;
240
+ pending_handoff_event_seq: number | null;
241
+
242
+ turn_id: number;
243
+ lease_id: string | null;
244
+ lease_expires_at: string | null;
245
+ claim_expires_at: string | null;
246
+
247
+ state: RoomState;
248
+ updated_at: string;
249
+ }
250
+
251
+ interface RoomMember {
252
+ agent_id: AgentId;
253
+ ordinal: number;
254
+ joined_at: string;
255
+ last_seen_at: string;
256
+ status: "active" | "inactive"; // derived from last_seen_at and presence_ttl
257
+ }
258
+ ```
259
+
260
+ State meanings:
261
+
262
+ - `idle`: no current owner and no specific reserved recipient. It may still have a pending handoff from the previous release.
263
+ - `owned`: one agent has a live lease and may work.
264
+ - `reserved`: the stick has been released or passed to a specific agent, which has a limited time to claim it.
265
+ - `stale_owner`: derived state indicating the owner missed its lease heartbeat and takeover is available. The owner is not revoked until a takeover commits.
266
+ - `owner_gone`: derived state indicating the exact owning process is known to have exited. Takeover is immediately available; no lease timeout wait is required.
267
+ - `recipient_gone`: derived state indicating the reserved recipient's exact process is known to have exited. Takeover is immediately available; no claim timeout wait is required.
268
+ - `dormant`: derived state indicating no member is currently live or recently present, but the room was not explicitly closed. In implementation, takeover-relevant states such as `owner_gone`, `recipient_gone`, and `stale_owner` take precedence over `dormant`; the room should report the most actionable recovery state first.
269
+ - `closed`: no further turns are expected.
270
+
271
+ An active member is one whose `last_seen_at` is within `presence_ttl` and whose exact spawning process is still alive when liveness metadata is available. Death beats timeout for the current owner and reserved recipient: if the server can prove either exact process is gone, recovery opens immediately rather than waiting for `presence_ttl`.
272
+
273
+ As with lease expiry, activity can be derived lazily on reads and writes rather than maintained by a background process. Process liveness checks must use `pid + process_started_at`; `kill(pid, 0)` or pid-only lookup is not sufficient because PIDs are reused. To keep write transactions short, the MVP uses exact process checks for owner/reserved recovery and recent presence for broader sequence scans; timeout recovery remains the fallback for a stale sequence target.
274
+
275
+ ### Room Termination vs Dormancy
276
+
277
+ The protocol model reserves a `closed` state for an optional later `close_room` tool. The MVP implementation does not provide that tool and therefore never enters `closed`; rooms remain resumable unless they become dormant. This still matters because "no live processes currently point at this room" is a common state during normal work — an operator steps away, or all harnesses exit between turns — and it must not be confused with "this conversation is over."
278
+
279
+ The MVP therefore distinguishes three situations, not two:
280
+
281
+ - **Active:** at least one member has recent presence within `presence_ttl`. Normal operation.
282
+ - **Dormant:** no member has recent presence or a currently live process, and no optional later close mechanism has been invoked. The room persists, its event log stays readable, and any member returning later can resume. Dormancy is a derived condition, not a stored state; `get_room_state` is the authoritative projection and may use persisted process metadata when available, while `list_rooms` may use a cheaper summary projection based on room ownership fields and recent presence so it does not need to probe every room holder.
283
+ - **Closed:** reserved for a future `close_room` extension. If that tool is added later, the room becomes terminal and no further owner mutations are accepted. The event log remains for inspection.
284
+
285
+ Retention policy for long-dormant rooms (archive, prune, purge after N days with no activity) is out of scope for MVP and is expected to be a separate administrative concern, not a protocol state transition. This prevents surprise deletions and keeps the protocol's responsibilities narrow.
286
+
287
+ ## Default Lifecycle
288
+
289
+ ### Discover
290
+
291
+ An agent enumerates rooms reachable from a path:
292
+
293
+ ```ts
294
+ list_rooms({ context_path? }) -> Room[]
295
+ ```
296
+
297
+ Rooms are returned for the ancestor chain from `context_path` to the resolved workspace root, keyed by `room_id` and annotated with `canonical_path` and current state. This lets a harness show "here is what is happening in this workspace" in one call.
298
+
299
+ ### Join
300
+
301
+ ```ts
302
+ join_path({
303
+ context_path,
304
+ force_new?, // defaults to false
305
+ agent_id_override? // optional; tests and debugging only
306
+ })
307
+ ```
308
+
309
+ `agent_id` is not an input. It is derived server-side from the MCP connection and returned in the response so the harness can surface it in logs.
310
+
311
+ Resolution:
312
+
313
+ 1. Canonicalize `context_path`.
314
+ 2. Resolve the preferred workspace root.
315
+ 3. Walk up from the canonical `context_path` to the preferred workspace root looking for an existing room.
316
+ 4. If found and `force_new = false`: join the deepest existing ancestor room.
317
+ 5. If found and `force_new = true`: create a nested room at the canonical `context_path`, returning a warning that an ancestor room exists. If a room already exists at that exact path, join it.
318
+ 6. If not found: create a new room at the preferred workspace root.
319
+
320
+ The response includes the resolved `room_id`, the `canonical_path` the agent actually joined (which may differ from the request path when workspace root resolution or ancestor lookup redirected the call), the effective room policy (including `heartbeat_interval_ms`), and a `handoff_template` hint describing the expected handoff shape. For the MVP this template is static server-wide; room-specific prompting can be added later if real workflows need it.
321
+
322
+ Effects:
323
+
324
+ - Adds `agent_id` to the ordered member list if absent.
325
+ - Updates the agent presence timestamp.
326
+ - Returns the current room state.
327
+
328
+ ### Wait
329
+
330
+ ```ts
331
+ wait_for_turn({
332
+ room_id,
333
+ cursor?,
334
+ max_wait_ms?
335
+ })
336
+ ```
337
+
338
+ Possible results:
339
+
340
+ ```ts
341
+ type WaitForTurnResult =
342
+ | {
343
+ status: "your_turn";
344
+ room_id: string;
345
+ turn_id: number;
346
+ lease_id: string;
347
+ handoff: Handoff | null; // null only for the first open claim in a fresh room
348
+ from_agent_id: AgentId | null;
349
+ reason: "direct_pass" | "sequence" | "open_claim";
350
+ }
351
+ | {
352
+ status: "not_yet";
353
+ cursor: string;
354
+ room_state: RoomState;
355
+ }
356
+ | {
357
+ status: "takeover_available";
358
+ room_id: string;
359
+ turn_id: number;
360
+ room_state:
361
+ | "owned"
362
+ | "reserved"
363
+ | "stale_owner"
364
+ | "owner_gone"
365
+ | "recipient_gone";
366
+ reason:
367
+ | "claim_timeout"
368
+ | "owner_timeout"
369
+ | "owner_gone"
370
+ | "recipient_gone";
371
+ current_owner?: AgentId;
372
+ reserved_for?: AgentId;
373
+ }
374
+ | {
375
+ status: "closed";
376
+ room_id: string;
377
+ };
378
+ ```
379
+
380
+ `wait_for_turn` may claim the stick when the caller is directly eligible:
381
+
382
+ - If the room is `idle`, any active member may claim.
383
+ - If the room is `reserved`, `reserved_for` may claim as long as no takeover has committed, even after `claim_expires_at`.
384
+
385
+ Each `wait_for_turn` call updates the caller's `last_seen_at`, so polling agents remain active.
386
+
387
+ As part of each read/write operation, the server may also refresh derived liveness for the current owner and reserved recipient. If the exact spawning process for either is proven absent, the room moves to `owner_gone` or `recipient_gone` as a derived condition and `takeover_available` is returned immediately to other eligible members.
388
+
389
+ `wait_for_turn` does not perform takeover for a non-reserved caller. If timeout has made takeover possible, it returns `takeover_available`; the caller must then invoke `takeover_stick` with an explicit reason.
390
+
391
+ `max_wait_ms = 0` is a valid non-blocking call. It still performs one atomic read/claim attempt before returning `your_turn`, `takeover_available`, `closed`, or `not_yet`.
392
+
393
+ When a claim succeeds, the server atomically:
394
+
395
+ - increments `turn_id`,
396
+ - issues a new `lease_id`,
397
+ - sets `owner = agent_id`,
398
+ - clears `reserved_for`,
399
+ - clears `pending_handoff_event_seq`,
400
+ - sets `lease_expires_at`,
401
+ - appends a claim event,
402
+ - returns `your_turn` with the prior handoff attached.
403
+
404
+ ### Work
405
+
406
+ While holding the stick, an agent should call:
407
+
408
+ ```ts
409
+ heartbeat({ room_id, lease_id, expected_turn_id })
410
+ ```
411
+
412
+ The heartbeat extends the owner lease and updates the owner's `last_seen_at`. A `stale_lease` response means another agent has taken over or otherwise invalidated the lease; the caller must stop acting as owner and re-read the room state.
413
+
414
+ Lease expiry opens takeover eligibility. It does not invalidate the owner's lease by itself. If an expired owner heartbeats before another agent successfully takes over, the heartbeat may renew the lease.
415
+
416
+ By contrast, exact process death is definitive. If the server can prove that the owning process is gone, the owner may not renew; takeover is immediately available to other eligible members.
417
+
418
+ ### Release
419
+
420
+ The owner may release the stick without naming a recipient:
421
+
422
+ ```ts
423
+ release_stick({
424
+ room_id,
425
+ lease_id,
426
+ expected_turn_id,
427
+ handoff
428
+ })
429
+ ```
430
+
431
+ Server validates:
432
+
433
+ - `lease_id` and `expected_turn_id` are current.
434
+ - `handoff.status` and `handoff.next_action` are non-empty.
435
+
436
+ Effects on success:
437
+
438
+ - Appends a release event containing the full `handoff`.
439
+ - Stores that event's `event_seq` as `pending_handoff_event_seq`.
440
+ - Updates the releasing owner's `last_seen_at`.
441
+ - Clears current owner and invalidates the current lease.
442
+ - Advances `sequence_index` to the next active member after the releasing owner.
443
+ - Sets `reserved_for` to the member found above, if one exists.
444
+ - Sets `claim_expires_at` when a recipient is reserved, otherwise clears it.
445
+ - Changes state to `reserved`, or `idle` if no other active member exists.
446
+
447
+ ### Explicit Pass
448
+
449
+ ```ts
450
+ pass_stick({
451
+ room_id,
452
+ lease_id,
453
+ expected_turn_id,
454
+ to_agent_id,
455
+ handoff
456
+ })
457
+ ```
458
+
459
+ Same handoff validation as `release_stick`. In the MVP, `to_agent_id` must already be an active member of the room. Passing to non-members is deferred until there is an explicit invite or discovery story.
460
+
461
+ Effects:
462
+
463
+ - Appends a pass event containing the full `handoff`.
464
+ - Stores that event's `event_seq` as `pending_handoff_event_seq`.
465
+ - Updates the passing owner's `last_seen_at`.
466
+ - Clears current owner and invalidates the current lease.
467
+ - Sets `reserved_for = to_agent_id`.
468
+ - Sets `sequence_index` to the target agent's ordinal, so the default sequence resumes from the passed-to agent after they release.
469
+ - Sets `claim_expires_at`.
470
+ - Changes state to `reserved`.
471
+
472
+ If the target misses the claim timeout, another active member may take over. The immediately prior owner should not be the takeover winner while any other active member can take it.
473
+
474
+ ### Takeover
475
+
476
+ Another active member may take over when the expected owner or reserved recipient has failed to respond:
477
+
478
+ ```ts
479
+ takeover_stick({
480
+ room_id,
481
+ expected_turn_id,
482
+ reason
483
+ })
484
+ ```
485
+
486
+ The takeover call itself refreshes the caller's presence before eligibility is checked. Other members' activity is still evaluated from their existing `last_seen_at` values. This lets a returning active harness recover a timed-out room with one explicit operation, while the prior-owner guard still depends on whether some other member has been active recently.
487
+
488
+ Allowed when:
489
+
490
+ - room is `reserved` and `claim_expires_at` has passed, or
491
+ - room is `owned` and `lease_expires_at` has passed, or
492
+ - room is `reserved` and the reserved recipient's exact process is known gone, or
493
+ - room is `owned` and the owner's exact process is known gone, or
494
+ - room is `stale_owner`.
495
+
496
+ In timeout cases, an active member other than the current owner or reserved recipient may attempt takeover. The previous owner or reserved recipient is not revoked merely because a timeout elapsed; they are revoked only if another agent's `takeover_stick` transaction commits first.
497
+
498
+ In process-gone cases, the server has positive evidence that the exact spawning process has exited. The dead owner or dead reserved recipient is therefore immediately ineligible to reclaim, heartbeat, release, or pass. Ownership still transfers only by explicit `takeover_stick`; the server never auto-promotes another member in the background.
499
+
500
+ For claim timeouts, there is one additional anti-monopoly guard: the immediately prior owner, identified by the pending handoff event's `from_agent_id`, is not eligible to take over while any other active member is eligible. If no other active member is available, the prior owner may take over rather than deadlocking the room. This preserves the important "do not immediately grab the stick back" behavior without adding full round-fairness state.
501
+
502
+ Effects:
503
+
504
+ - Atomically re-reads the room and verifies timeout eligibility.
505
+ - Atomically increments `turn_id`.
506
+ - Issues a new `lease_id`.
507
+ - Sets `owner = agent_id`.
508
+ - Updates the caller's `last_seen_at`.
509
+ - Clears `reserved_for`.
510
+ - Clears `pending_handoff_event_seq`.
511
+ - Records the previous owner or reserved recipient as revoked for that turn.
512
+ - Appends a takeover event with `reason`.
513
+
514
+ The result includes the new `turn_id` and `lease_id`. It does not include a handoff; the prior owner or reserved recipient never produced one. The new owner relies on `get_room_events` to reconstruct context from the most recent handoffs.
515
+
516
+ Old agents cannot mutate the room after takeover because their `lease_id` and `turn_id` no longer match.
517
+
518
+ ## MCP Tool Surface
519
+
520
+ MVP tools:
521
+
522
+ ```ts
523
+ list_rooms(input) -> Room[]
524
+ join_path(input) -> JoinPathResult
525
+ wait_for_turn(input) -> WaitForTurnResult
526
+ heartbeat(input) -> HeartbeatResult
527
+ release_stick(input) -> ReleaseStickResult
528
+ pass_stick(input) -> PassStickResult
529
+ takeover_stick(input) -> TakeoverStickResult
530
+ get_room_state(input) -> GetRoomStateResult
531
+ get_room_events(input) -> RoomEvent[]
532
+ ```
533
+
534
+ Optional later additions:
535
+
536
+ ```ts
537
+ append_note(input) -> AppendNoteResult
538
+ leave_path(input) -> LeavePathResult
539
+ close_room(input) -> CloseRoomResult
540
+ reorder_members(input) -> ReorderMembersResult
541
+ set_room_policy(input) -> SetRoomPolicyResult
542
+ ```
543
+
544
+ `append_note` may be useful for non-owners to add side-channel context without claiming ownership. It must not change owner state.
545
+
546
+ `get_room_events` is MVP because it is the recovery mechanism for takeover and the audit trail for the working-memory story. Without it, a new owner after takeover has no way to read prior handoffs.
547
+
548
+ ## State Transitions
549
+
550
+ ```text
551
+ idle
552
+ wait_for_turn by any active member
553
+ -> owned
554
+
555
+ owned
556
+ heartbeat by owner
557
+ -> owned
558
+
559
+ owned
560
+ release_stick by owner (with valid Handoff)
561
+ -> reserved, if another active member exists
562
+ -> idle, if no other active member exists
563
+
564
+ owned
565
+ pass_stick by owner to an active member (with valid Handoff)
566
+ -> reserved
567
+
568
+ owned
569
+ lease expires
570
+ -> stale_owner/takeover_available as derived state
571
+
572
+ owned
573
+ owning process is known gone
574
+ -> owner_gone/takeover_available as derived state
575
+
576
+ reserved
577
+ reserved_for calls wait_for_turn before any takeover commits
578
+ -> owned
579
+
580
+ reserved
581
+ claim timeout expires
582
+ -> reserved, but takeover_available is returned to other active members
583
+
584
+ reserved
585
+ reserved recipient process is known gone
586
+ -> recipient_gone/takeover_available as derived state
587
+
588
+ reserved
589
+ takeover_stick by another active member after claim timeout
590
+ prior owner is skipped if another candidate exists
591
+ -> owned
592
+
593
+ stale_owner
594
+ owner heartbeat/release/pass before takeover commits
595
+ -> owned/reserved
596
+
597
+ stale_owner
598
+ takeover_stick by another active member
599
+ -> owned
600
+
601
+ owner_gone/recipient_gone
602
+ takeover_stick by another active member
603
+ -> owned
604
+
605
+ owned/reserved/idle/stale_owner/owner_gone/recipient_gone/dormant
606
+ optional later close_room
607
+ -> closed
608
+
609
+ owned/reserved/idle/stale_owner/owner_gone/recipient_gone
610
+ all members inactive or gone and no explicit close
611
+ -> dormant as derived state
612
+ ```
613
+
614
+ ## Race Condition Prevention
615
+
616
+ The server is the only authority for ownership.
617
+
618
+ Required safety rules:
619
+
620
+ - Store room state in a transactional database.
621
+ - Use row-level locking or a single atomic compare-and-swap update for each room mutation.
622
+ - Require `room_id` for all owner mutations. `agent_id` is derived by the MCP adapter from the connection and supplied to the service layer; it is not a tool input.
623
+ - Require `lease_id` for all owner mutations.
624
+ - Require `expected_turn_id` for all owner mutations.
625
+ - Persist enough process metadata to identify the exact spawning process for each member (`pid` plus `process_started_at`, and preferably `host_id`).
626
+ - Increment `turn_id` whenever an agent is granted ownership.
627
+ - Never reuse `lease_id`.
628
+ - Treat `lease_id` as a fencing token.
629
+ - Treat `lease_expires_at` as takeover eligibility, not automatic lease revocation. A lease becomes stale only when the room's current `(lease_id, turn_id)` no longer matches the caller's values.
630
+ - Treat exact process death as stronger than timeout. If the exact owner or reserved-recipient process is known absent, expose immediate takeover eligibility and mark that member inactive.
631
+ - Never infer process death from `pid` alone.
632
+ - On claim-timeout takeover, reject the immediately prior owner when another active takeover candidate exists.
633
+ - Reject stale mutations with a structured error that includes the current owner and current turn.
634
+
635
+ Example stale mutation response:
636
+
637
+ ```json
638
+ {
639
+ "error": "stale_lease",
640
+ "message": "The supplied lease is no longer current for this room.",
641
+ "current_owner": "claude:session-2",
642
+ "current_turn_id": 12,
643
+ "room_state": "owned"
644
+ }
645
+ ```
646
+
647
+ Handoff and membership errors use the same structured form:
648
+
649
+ ```json
650
+ {
651
+ "error": "invalid_handoff",
652
+ "message": "handoff.next_action must be non-empty",
653
+ "field": "next_action"
654
+ }
655
+ ```
656
+
657
+ ```json
658
+ {
659
+ "error": "unknown_member",
660
+ "message": "pass_stick target must be an active room member in the MVP.",
661
+ "to_agent_id": "gemini:session-1"
662
+ }
663
+ ```
664
+
665
+ Owner mutation error precedence is deterministic:
666
+
667
+ 1. If `expected_turn_id` does not match the room's current `turn_id`, return `turn_mismatch`.
668
+ 2. If the turn matches but `owner`, `agent_id`, or `lease_id` does not match the current owner epoch, return `stale_lease`.
669
+
670
+ This keeps "I am writing against the wrong epoch" distinct from "I am in the current epoch but do not hold the current fencing token."
671
+
672
+ ## Deadlock Prevention
673
+
674
+ The protocol avoids permanent deadlock by combining:
675
+
676
+ - finite claim timeouts for reserved recipients,
677
+ - renewable leases for active owners,
678
+ - exact process-gone detection for owners and reserved recipients when metadata is available,
679
+ - takeover after missed claim or missed lease,
680
+ - prior-owner takeover fallback when no other active candidate exists,
681
+ - explicit stale state,
682
+ - read-only room inspection via `get_room_state` and `get_room_events`.
683
+
684
+ The server does not silently auto-transfer ownership when a lease expires. It marks the state recoverable and requires an explicit `takeover_stick` call. This keeps recovery auditable and prevents surprise parallel work.
685
+
686
+ Because the MVP has no daemon, expiry is evaluated lazily on reads and writes. A room row may still store `state = 'owned'` after `lease_expires_at`; `get_room_state` and `wait_for_turn` should report the derived state as `stale_owner` or `takeover_available`. The next successful heartbeat, release, pass, or takeover transaction writes the new projected state.
687
+
688
+ ## Multi-Process Concurrency
689
+
690
+ Agent harnesses routinely run in multiple terminal tabs, split panes, or parallel sessions. Each harness typically spawns its own MCP server subprocess. The server design must therefore support many concurrent server processes sharing state.
691
+
692
+ The model is **shared database, no daemon**. Every server process opens the same SQLite file. Coordination is purely through the database.
693
+
694
+ ### SQLite configuration
695
+
696
+ Every connection must apply these pragmas:
697
+
698
+ ```sql
699
+ PRAGMA journal_mode = WAL; -- concurrent readers + single writer, no reader blocking
700
+ PRAGMA synchronous = NORMAL; -- durable enough for coordination state, faster than FULL
701
+ PRAGMA busy_timeout = 5000; -- wait up to 5s when another process holds the write lock
702
+ PRAGMA foreign_keys = ON;
703
+ ```
704
+
705
+ All write transactions start with `BEGIN IMMEDIATE` so the write lock is acquired up front rather than through a mid-transaction upgrade. This avoids `SQLITE_BUSY` failures during lock promotion under contention.
706
+
707
+ Every mutation re-reads the relevant room row inside its transaction and verifies fencing conditions (`lease_id`, `expected_turn_id`, membership, timeout eligibility) before committing. This makes the "two processes see stale state and both try to claim" race impossible: one commits, the other re-reads and returns `not_yet`.
708
+
709
+ ### wait_for_turn across processes
710
+
711
+ `wait_for_turn` is implemented as bounded polling. Each server process polls `path_rooms` and `room_events` for the requested room at a short interval (250 ms recommended) up to `max_wait_ms`. Changes made by any other process become visible on the next poll.
712
+
713
+ A cursor on the most recent monotonic `event_seq` lets the server return immediately when new events appear, so long polls do not consume CPU redundantly across consecutive calls.
714
+
715
+ ### Limitations
716
+
717
+ - **Network filesystems.** SQLite locking is unreliable on NFS and some SMB implementations. The server must detect a non-local filesystem at startup and fail with a clear error suggesting `TALKING_STICK_DATA_DIR` pointed at a local path.
718
+ - **Cross-host coordination.** The MVP is explicitly single-host. Multi-host deployments require a different backend (Postgres is the natural upgrade, with `FOR UPDATE` locks replacing SQLite's write lock).
719
+
720
+ ### Optional future: daemon mode
721
+
722
+ If polling overhead ever becomes a concern at scale, a future version may add an optional local daemon:
723
+
724
+ - The first client to call `join_path` starts a daemon if none is running.
725
+ - The daemon holds an exclusive writer connection to the SQLite file.
726
+ - Clients connect via a Unix domain socket at `<data_dir>/server.sock`.
727
+ - The daemon pushes state changes to waiting clients directly, eliminating polling.
728
+ - The daemon self-terminates after a configurable idle period.
729
+
730
+ This is not needed for MVP. Polling is sufficient for typical multi-tab workflows.
731
+
732
+ ## Timeout Policy
733
+
734
+ Recommended defaults (product scale, sized for real agent work rather than chat turns):
735
+
736
+ ```ts
737
+ owner_lease_ttl_ms = 45 * 60 * 1000; // 45 minutes
738
+ heartbeat_interval_ms = 5 * 60 * 1000; // 5 minutes
739
+ claim_ttl_ms = 20 * 60 * 1000; // 20 minutes
740
+ wait_for_turn_max_wait_ms = 30 * 1000; // 30 seconds
741
+ wait_for_turn_poll_ms = 250; // transport polling cadence
742
+ presence_ttl_ms = 4 * 60 * 60 * 1000; // 4 hours
743
+ ```
744
+
745
+ Timeout meanings:
746
+
747
+ - `wait_for_turn` max wait is only a polling budget. The client should call again if it returns `not_yet`.
748
+ - `wait_for_turn_poll_ms` is how often a waiting process re-reads room state during a single long poll.
749
+ - `claim_ttl` is how long a reserved recipient has exclusive first right of refusal before others may take over.
750
+ - `owner_lease_ttl` is how long an owner may remain silent before takeover becomes possible.
751
+ - `presence_ttl` determines whether a member is active for sequence selection and takeover eligibility.
752
+
753
+ Rationale for these defaults: a real agent turn often runs 20-30 minutes (plan-and-edit, build-and-verify, review-and-respond), and a human collaborator walking through a few rooms may easily be idle for an hour without being "gone." Earlier drafts inherited chat-scale defaults (5-minute lease, 10-minute presence) which would silently open takeover windows mid-turn. The selected values accept a slower takeover response in exchange for not interrupting legitimate long work; operators who want faster response can shorten them via per-room policy once that ships.
754
+
755
+ These timers are fallback recovery, not the only recovery path. When the server can prove that the exact spawning process is gone, it should expose `owner_gone` or `recipient_gone` immediately instead of waiting for timeout. Ownership timings (lease, claim, presence) are the product-facing knobs; transport timings (wait max, poll cadence) are unchanged because they only affect polling efficiency, not ownership semantics.
756
+
757
+ Per-room policy is expected to become a first-class need quickly (batch workflows want longer TTLs; interactive workflows want shorter claims). Storing timeouts on the room record rather than as global server defaults is the recommended near-term extension, enabled via `set_room_policy`.
758
+
759
+ Even before per-room policy ships, the effective policy must be part of the `join_path` response so holders can schedule heartbeats from server truth rather than from compiled-in defaults.
760
+
761
+ ## Persistence Model
762
+
763
+ ### File Layout
764
+
765
+ State lives in a single SQLite database at a predictable per-user data directory:
766
+
767
+ - **Linux and macOS**: `$XDG_DATA_HOME/talking-stick/rooms.sqlite` if `XDG_DATA_HOME` is set, otherwise `~/.local/share/talking-stick/rooms.sqlite`.
768
+ - **Windows**: `%APPDATA%\talking-stick\rooms.sqlite`.
769
+
770
+ For this tool, the shared Linux/macOS default is intentional. Talking Stick is a CLI/MCP developer utility, and using `~/.local/share` on both Unix-like platforms keeps scripts, docs, backups, and troubleshooting consistent across machines.
771
+
772
+ Override:
773
+
774
+ - `TALKING_STICK_DATA_DIR` sets an explicit directory. If set, the database lives at `$TALKING_STICK_DATA_DIR/rooms.sqlite`. This is the recommended way to isolate test databases and to keep per-project state when that is desired.
775
+
776
+ The server creates the directory on first run if it does not exist. All rooms (across all workspaces on the host) share the single database file — keeping them together is what makes ancestor lookup a simple indexed query rather than a filesystem traversal.
777
+
778
+ ### Schema
779
+
780
+ ```sql
781
+ CREATE TABLE path_rooms (
782
+ room_id TEXT PRIMARY KEY,
783
+ canonical_path TEXT NOT NULL,
784
+ sequence_index INTEGER NOT NULL DEFAULT 0,
785
+ owner TEXT,
786
+ reserved_for TEXT,
787
+ pending_handoff_event_seq INTEGER,
788
+ turn_id INTEGER NOT NULL DEFAULT 0,
789
+ lease_id TEXT,
790
+ lease_expires_at TEXT,
791
+ claim_expires_at TEXT,
792
+ state TEXT NOT NULL,
793
+ updated_at TEXT NOT NULL,
794
+ UNIQUE (canonical_path)
795
+ );
796
+
797
+ CREATE INDEX path_rooms_canonical_path_idx
798
+ ON path_rooms (canonical_path);
799
+
800
+ CREATE TABLE room_members (
801
+ room_id TEXT NOT NULL,
802
+ agent_id TEXT NOT NULL,
803
+ ordinal INTEGER NOT NULL,
804
+ joined_at TEXT NOT NULL,
805
+ last_seen_at TEXT NOT NULL,
806
+ status TEXT NOT NULL,
807
+ host_id TEXT,
808
+ pid INTEGER,
809
+ process_started_at TEXT,
810
+ session_kind TEXT NOT NULL,
811
+ display_name TEXT,
812
+ PRIMARY KEY (room_id, agent_id),
813
+ FOREIGN KEY (room_id) REFERENCES path_rooms(room_id)
814
+ );
815
+
816
+ CREATE TABLE room_events (
817
+ event_seq INTEGER PRIMARY KEY AUTOINCREMENT,
818
+ event_id TEXT NOT NULL UNIQUE,
819
+ room_id TEXT NOT NULL,
820
+ turn_id INTEGER NOT NULL,
821
+ event_type TEXT NOT NULL, -- claim | release | pass | takeover | close
822
+ from_agent_id TEXT,
823
+ to_agent_id TEXT,
824
+ handoff_json TEXT, -- NULL for claim | takeover | close
825
+ reason TEXT, -- populated on takeover events
826
+ created_at TEXT NOT NULL,
827
+ FOREIGN KEY (room_id) REFERENCES path_rooms(room_id)
828
+ );
829
+
830
+ CREATE INDEX room_events_room_seq_idx
831
+ ON room_events (room_id, event_seq);
832
+
833
+ CREATE INDEX room_events_room_turn_idx
834
+ ON room_events (room_id, turn_id);
835
+ ```
836
+
837
+ Ancestor lookup uses the `canonical_path` index: given a request path `P` and resolved workspace root `W`, generate ancestor paths from `P` up to `W` in code and issue a single `IN` query against `canonical_path`, picking the longest match. At small scale this is microsecond-fast; at very large scale consider materialized paths.
838
+
839
+ `room_events` is append-only. `path_rooms` is a projection of the event stream for fast reads. The event log is also the takeover recovery context: a new owner after `takeover_stick` reads recent events to reconstruct what was happening before the prior owner went silent.
840
+
841
+ `room_members` stores the process metadata needed for exact local liveness checks. If `pid` and `process_started_at` are unavailable on a platform, those columns may be null and the server falls back to timeout-based recovery for that member rather than making unsafe pid-only guesses.
842
+
843
+ Heartbeats update `path_rooms.lease_expires_at` and `updated_at`; they are not written to `room_events` by default. Otherwise waiters would wake up on routine heartbeat traffic.
844
+
845
+ `path_rooms.pending_handoff_event_seq` points at the release or pass event that should be returned to the next successful claimant. It is cleared when a claim or takeover succeeds.
846
+
847
+ ## Agent Operating Instructions
848
+
849
+ Harnesses using this MCP server should follow these rules:
850
+
851
+ 1. Before joining, call `list_rooms` to see what is already happening in the workspace.
852
+ 2. Join using `join_path`. Accept the resolved `room_id` and `canonical_path` the server returns, even if they differ from what you asked for, because workspace root resolution or ancestor lookup may have attached you to a parent room.
853
+ 3. Do not perform shared task work unless `wait_for_turn` returns `your_turn`.
854
+ 4. When you receive `your_turn`, read the attached `handoff` before doing anything else. Load `artifacts[]` entries directly rather than re-exploring the workspace.
855
+ 5. While working, heartbeat periodically.
856
+ 6. Include `room_id`, `lease_id`, and `expected_turn_id` on every owner mutation. Do not send an `agent_id`; the server derives it from the MCP connection.
857
+ 7. If any owner mutation returns `stale_lease`, `turn_mismatch`, or `unknown_member`, stop working and read current state.
858
+ 8. To release the stick, construct a `Handoff` with a truthful `status` and an actionable `next_action`. Include `artifacts[]` entries when the next agent needs to load specific files or line ranges.
859
+ 9. Use `release_stick` to continue the default sequence.
860
+ 10. Use `pass_stick` to choose a specific active member.
861
+ 11. Use `takeover_stick` only after `wait_for_turn` or `get_room_state` reports timeout eligibility. Include a reason. After a successful takeover, call `get_room_events` to reconstruct context from the most recent handoffs.
862
+
863
+ Suggested format for the free-text `status` field:
864
+
865
+ ```md
866
+ What I did:
867
+ What I learned:
868
+ Open risks:
869
+ ```
870
+
871
+ ## Example: Three-Agent Round Robin
872
+
873
+ Members join in this order:
874
+
875
+ ```text
876
+ codex
877
+ claude
878
+ gemini
879
+ ```
880
+
881
+ Flow:
882
+
883
+ ```text
884
+ codex claims idle room and writes initial plan.
885
+ codex releases stick with:
886
+ status: "wrote initial plan covering sections 1-3"
887
+ next_action: "review plan for gaps; suggest additions"
888
+ artifacts: [{ path: "plan.md", role: "review" }]
889
+ room reserves for claude.
890
+
891
+ claude claims reserved stick, receives codex's handoff.
892
+ claude reviews the plan and releases with:
893
+ status: "reviewed plan; section 2 is thin on error handling"
894
+ next_action: "extend error handling coverage in section 2"
895
+ artifacts: [{ path: "plan.md", lines: [45, 78], role: "edit" }]
896
+ room reserves for gemini.
897
+
898
+ gemini claims reserved stick, receives claude's handoff.
899
+ gemini extends section 2 and releases.
900
+ room reserves for codex, continuing the member sequence.
901
+ The cycle continues.
902
+ ```
903
+
904
+ ## Example: Explicit Skip
905
+
906
+ Default sequence:
907
+
908
+ ```text
909
+ codex -> claude -> gemini -> codex
910
+ ```
911
+
912
+ If codex wants gemini to review a concurrency detail immediately:
913
+
914
+ ```text
915
+ codex pass_stick(
916
+ to_agent_id = gemini,
917
+ handoff = {
918
+ status: "found potential race in claim logic",
919
+ next_action: "assess whether lease_id fencing covers this case",
920
+ artifacts: [{ path: "src/claim.ts", lines: [102, 140], role: "review" }]
921
+ }
922
+ )
923
+ ```
924
+
925
+ The room reserves the next turn for gemini, skipping claude. After gemini releases, the default sequence resumes from gemini's position in the member list.
926
+
927
+ ## Example: Missed Recipient
928
+
929
+ ```text
930
+ codex holds, then releases; reserved for claude.
931
+ claude does not claim before claim_ttl.
932
+ wait_for_turn now returns takeover_available to other active members.
933
+ codex attempts takeover -- REJECTED while gemini is active, because codex was
934
+ the immediately prior owner.
935
+ gemini calls takeover_stick(reason = "claim timeout expired") -- ACCEPTED.
936
+ server grants gemini a fresh lease and increments turn_id.
937
+ gemini calls get_room_events to read codex's original handoff.
938
+ claude wakes up late and tries to claim -- REJECTED (stale turn).
939
+ ```
940
+
941
+ ## Example: Crashed Owner
942
+
943
+ ```text
944
+ claude owns the stick.
945
+ claude stops heartbeating.
946
+ owner_lease_ttl expires.
947
+ get_room_state reports stale_owner.
948
+ codex calls takeover_stick(reason = "owner lease expired").
949
+ server grants codex a fresh lease and increments turn_id.
950
+ codex calls get_room_events to read the handoff claude received on claim,
951
+ so codex can infer what claude was working on.
952
+ claude later tries to release with its old lease.
953
+ server rejects as stale_lease.
954
+ ```
955
+
956
+ ## Example: Two-Process Concurrent Claim
957
+
958
+ ```text
959
+ Terminal A (claude process) and Terminal B (codex process) both run wait_for_turn
960
+ against an idle room.
961
+ Both processes poll. Both see room_state = idle at the same poll tick.
962
+ Both start a BEGIN IMMEDIATE transaction to claim.
963
+ Process A wins the write lock; process B blocks.
964
+ Process A re-reads the row, verifies idle, increments turn_id, sets owner=claude,
965
+ commits.
966
+ Process B acquires the write lock, re-reads, sees room_state = owned by claude,
967
+ aborts the claim, returns not_yet to its client.
968
+ Process B's client polls again and correctly sees claude as owner.
969
+ ```
970
+
971
+ ## Design Rationale
972
+
973
+ This section records the reasoning behind the load-bearing choices in this plan. Future maintainers should read it before proposing structural changes.
974
+
975
+ ### Why workspace root plus ancestor lookup
976
+
977
+ An earlier draft resolved each request path to a canonical string and looked up rooms by exact match. That failed the common monorepo case: two agents running in `/repo/packages/foo/` and `/repo/packages/bar/` would create separate rooms and never coordinate, even when they were doing related work for the same repo.
978
+
979
+ Resolving to a workspace root before lookup matches the mental model developers already use for `.git`, `CLAUDE.md`, `package.json`, and every other workspace marker in common use. An agent at any depth under `/repo/` automatically joins the `/repo/` room if `/repo/` is the resolved workspace root, and no explicit room identifiers need to be coordinated out of band.
980
+
981
+ Because coordination is operator-initiated, the server does not actively prevent nested rooms. Operators know when they are starting a nested conversation and can request one explicitly via `force_new`.
982
+
983
+ ### Why topics are deferred
984
+
985
+ Multiple concurrent conversations at the same path may become useful, for example a review in flight and a triage in parallel. But topics also weaken the central simplicity of the tool: "agents in this workspace share this room."
986
+
987
+ The MVP keeps one default room per workspace path. That makes the agent instructions shorter, avoids accidental topic mismatch, and keeps path membership as the primary mental model. If real usage needs concurrent rooms, topics can be added as an extension without changing the ownership and lease protocol.
988
+
989
+ ### Why the handoff is structured and mandatory
990
+
991
+ The protocol's original shape treated ownership transfer as metadata about the lock with a free-text note attached. But in practice the expensive operation in a multi-agent handoff is not the lock transfer — it is context reconstruction by the new owner. Every agent that takes the stick starts by asking "what was happening before I got here?"
992
+
993
+ Making the handoff a structured artifact (`status`, `next_action`, `artifacts[]`) turns ownership transfer into state transfer. The `artifacts[]` entries map directly onto how LLMs consume code — path and optional line range — so the next owner can load exactly the relevant context without rediscovering it.
994
+
995
+ Requiring `status` and `next_action` to be non-empty is a small amount of server-side validation that prevents a large class of low-quality handoffs. Handoff quality still depends on harness behavior, which the protocol cannot enforce, so `join_path` returns a `handoff_template` hint that harnesses can surface to their models. The MVP keeps that template static across rooms to avoid turning prompting policy into room configuration before a concrete need exists.
996
+
997
+ ### Why takeovers do not carry a handoff
998
+
999
+ A takeover happens precisely because the expected owner did not produce one. Requiring a handoff from the new owner would ask them to speak for the failed owner, which they cannot do truthfully.
1000
+
1001
+ Instead, the event log is the recovery context. `get_room_events` returns recent handoffs, including the one the failed owner received when they took the stick. This is also why `get_room_events` is MVP rather than optional — without it, the takeover recovery story is broken.
1002
+
1003
+ ### Why `turn_id` increments on grant, not on release
1004
+
1005
+ Incrementing on grant keeps fencing math trivial: the current `(lease_id, turn_id)` pair always matches exactly one epoch, and release merely ends that epoch without starting a new one. A pending reservation is not a new epoch; it is a waiting room. Incrementing on release would create a window where a slow releaser and a fast new claimant disagree about the current epoch.
1006
+
1007
+ ### Why strict round fairness is deferred
1008
+
1009
+ Strict round fairness sounds attractive, but it adds state and policy complexity to the part of the system that must stay easiest to trust. It also creates awkward edge cases: a missed recipient, a single active member, a stale owner, or a deliberate explicit pass can all look like fairness violations even when continuing is the useful behavior.
1010
+
1011
+ The MVP uses ordered release for the normal path and explicit takeover for failure recovery. It also includes a narrow prior-owner guard for claim-timeout takeover, preventing the agent that just released or passed the stick from immediately grabbing it back when another active participant is available. That covers the most important anti-monopoly case without tracking full rounds. If agents later need stronger turn fairness, it can be added as a per-room policy using additional member state.
1012
+
1013
+ ### Why takeover is explicit rather than automatic
1014
+
1015
+ The server could auto-transfer ownership the moment a lease expired. It does not, because silent promotions are the source of most "surprise parallel work" incidents in real coordination systems. An explicit `takeover_stick` call requires an agent to name a reason and produces an auditable event, making recovery visible in the log.
1016
+
1017
+ ### Why shared SQLite instead of a daemon
1018
+
1019
+ Agent harnesses in this ecosystem spawn MCP servers as subprocesses, typically one per harness invocation. A daemon-based coordination server would require lifecycle management (who starts it, when it shuts down, how it recovers from crashes) that adds operational complexity with little benefit at a single-host, single-user scale.
1020
+
1021
+ SQLite in WAL mode handles concurrent readers and a single writer at low latency without any additional process. Multiple MCP server processes can share the database file without coordination beyond what SQLite already provides. The cost is polling — `wait_for_turn` cannot be push-notified across processes — but a 250 ms poll interval is well within the latency budget for agent-to-agent handoffs.
1022
+
1023
+ A daemon mode remains open as a future optimization if polling becomes a bottleneck. It is not needed for the typical multi-tab workflow the MVP targets.
1024
+
1025
+ ### Why `~/.local/share` on Linux and macOS
1026
+
1027
+ Writing coordination state to `~/.talking-stick/` would litter the home directory. Sending macOS to `~/Library/Application Support`, however, makes a CLI-first developer tool harder to script and explain consistently across Unix-like machines.
1028
+
1029
+ The MVP therefore uses the XDG-style location on both Linux and macOS: `$XDG_DATA_HOME/talking-stick` when set, otherwise `~/.local/share/talking-stick`. Windows uses `%APPDATA%\talking-stick`. The `TALKING_STICK_DATA_DIR` override exists for users who want per-project isolation, test databases, or a different local disk.
1030
+
1031
+ Centralizing all rooms in a single SQLite file (rather than one file per room) makes ancestor lookup a simple indexed query rather than a filesystem walk. It also means backups and migrations move a single file.
1032
+
1033
+ ## Open Design Questions
1034
+
1035
+ The following questions are worth revisiting once the MVP has seen real use:
1036
+
1037
+ - Should non-owners be able to append notes, or would that encourage side-channel work that bypasses the handoff discipline?
1038
+ - Should a human operator override use the same `takeover_stick` tool as peer agents, or a separate admin tool that bypasses timeout gating?
1039
+
1040
+ Current implementation note: no timeout-bypass or admin override exists today. Human operators use the same explicit `takeover_stick` flow as peer agents, and any bypass semantics remain intentionally undecided.
1041
+
1042
+ ## Implementation Plan
1043
+
1044
+ 1. Build a local TypeScript MCP server using the Node MCP SDK.
1045
+ 2. Use SQLite (via `better-sqlite3` or `libsql`) with WAL mode, resolving the database path to `~/.local/share/talking-stick` on Linux/macOS, `%APPDATA%\talking-stick` on Windows, and honoring `TALKING_STICK_DATA_DIR`.
1046
+ 3. Apply required pragmas on every connection; use `BEGIN IMMEDIATE` for all write transactions.
1047
+ 4. Detect non-local filesystems at startup and fail fast with a clear error.
1048
+ 5. Implement canonical path resolution, workspace root detection, and deepest-ancestor room lookup.
1049
+ 6. Implement `list_rooms`, `join_path` (with `force_new`), `get_room_state`, and member sequencing.
1050
+ 7. Implement the `Handoff` type with server-side validation of required fields.
1051
+ 8. Implement `wait_for_turn` as bounded polling with monotonic cursor support and atomic claiming; attach the prior handoff to `your_turn` responses.
1052
+ 9. Implement `takeover_available` responses without auto-taking the stick.
1053
+ 10. Implement lease issuing, heartbeat, release (with handoff), explicit pass (with handoff), and takeover.
1054
+ 11. Implement `get_room_events` for both audit and takeover recovery.
1055
+ 12. Add tests for:
1056
+ - ancestor lookup (including nested rooms and `force_new`),
1057
+ - handoff validation errors,
1058
+ - stale leases,
1059
+ - simultaneous claims within one process,
1060
+ - **simultaneous claims across multiple concurrent processes** (spawn N processes, have them claim/release under contention, verify no state corruption),
1061
+ - explicit pass,
1062
+ - pass to unknown or inactive member rejection,
1063
+ - release sequence,
1064
+ - claim timeout and takeover,
1065
+ - prior owner rejected on claim-timeout takeover when another active candidate exists,
1066
+ - prior owner allowed on claim-timeout takeover when no other active candidate exists,
1067
+ - owner timeout and takeover,
1068
+ - owner process gone yields immediate `owner_gone` takeover availability,
1069
+ - reserved recipient process gone yields immediate `recipient_gone` takeover availability,
1070
+ - original reserved recipient claiming after claim timeout but before takeover,
1071
+ - expired owner heartbeating after lease timeout but before takeover,
1072
+ - dead owner or dead reserved recipient cannot reclaim after exact process-gone detection,
1073
+ - dormant rooms remain readable and resumable, rather than auto-closing,
1074
+ - event log reconstruction after takeover,
1075
+ - database path resolution across platforms and with `TALKING_STICK_DATA_DIR` set.
1076
+ 13. Add a small CLI or script for manual inspection during development.
1077
+
1078
+ Current first-slice test coverage:
1079
+
1080
+ - happy path: `join_path` -> idle `wait_for_turn` claim -> `release_stick` with handoff -> reserved recipient claim,
1081
+ - handoff validation,
1082
+ - stale lease rejection,
1083
+ - turn mismatch rejection,
1084
+ - deepest ancestor lookup,
1085
+ - workspace-root room creation,
1086
+ - `wait_for_turn` returns `takeover_available` after claim timeout,
1087
+ - reserved recipient may still claim after `claim_ttl` until takeover commits,
1088
+ - `wait_for_turn` returns `takeover_available` after owner lease timeout,
1089
+ - expired owner may heartbeat before takeover commits,
1090
+ - owner-timeout takeover fences stale owner writes,
1091
+ - owner process death yields immediate `owner_gone`,
1092
+ - reserved recipient process death yields immediate `recipient_gone`,
1093
+ - dormant rooms stay readable and resumable,
1094
+ - prior-owner takeover guard after claim timeout,
1095
+ - multi-process contention against an idle room.
1096
+
1097
+ ## Minimum Viable Version
1098
+
1099
+ The first useful version can omit optional notes, admin features, and per-room policy.
1100
+
1101
+ MVP tools:
1102
+
1103
+ ```text
1104
+ list_rooms
1105
+ join_path
1106
+ wait_for_turn
1107
+ heartbeat
1108
+ release_stick
1109
+ pass_stick
1110
+ takeover_stick
1111
+ get_room_state
1112
+ get_room_events
1113
+ ```
1114
+
1115
+ MVP storage:
1116
+
1117
+ ```text
1118
+ path_rooms (with canonical_path unique key, current ownership projection,
1119
+ and pending_handoff_event_seq)
1120
+ room_members (with join order and presence timestamps)
1121
+ room_events (with handoff_json payload on release and pass events)
1122
+ ```
1123
+
1124
+ MVP policy:
1125
+
1126
+ ```text
1127
+ data directory: ~/.local/share/talking-stick on Linux and macOS
1128
+ (or $XDG_DATA_HOME/talking-stick when set);
1129
+ %APPDATA%\talking-stick on Windows;
1130
+ override via TALKING_STICK_DATA_DIR
1131
+ database file: <data_dir>/rooms.sqlite, WAL mode, synchronous=NORMAL, busy_timeout=5s
1132
+ concurrency: shared database across server processes; BEGIN IMMEDIATE for writes;
1133
+ wait_for_turn polls at 250 ms across processes
1134
+ filesystem requirement: local filesystem; NFS/SMB rejected at startup
1135
+ room identity: canonical workspace path, resolved via workspace root detection
1136
+ plus deepest-ancestor lookup
1137
+ room creation default: attach to ancestor when one exists; require force_new to nest
1138
+ topics: deferred extension, not MVP
1139
+ release behavior: reserve next active member in sequence
1140
+ explicit pass behavior: reserve active target member
1141
+ takeover behavior: another active member after timeout; timeout itself does not revoke
1142
+ claim-timeout takeover skips the prior owner when another
1143
+ active candidate exists
1144
+ exact owner/recipient process death yields immediate
1145
+ takeover availability without waiting for timeout
1146
+ handoff requirement: release_stick and pass_stick require non-empty status and next_action
1147
+ recovery context: get_room_events supplies prior handoffs to takeover winner
1148
+ owner lease TTL: 45 minutes
1149
+ heartbeat interval: 5 minutes
1150
+ claim TTL: 20 minutes
1151
+ presence TTL: 4 hours
1152
+ close semantics: no `close_room` tool in the MVP implementation;
1153
+ rooms remain resumable and can become dormant
1154
+ when nobody is live
1155
+ wait_for_turn max wait: 30 seconds, polled at 250 ms
1156
+ ```