talking-stick 0.1.0-alpha
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +166 -0
- package/dist/cli.js +701 -0
- package/dist/commands.js +70 -0
- package/dist/config.js +31 -0
- package/dist/db.js +177 -0
- package/dist/errors.js +20 -0
- package/dist/identity.js +184 -0
- package/dist/index.js +12 -0
- package/dist/install.js +272 -0
- package/dist/mcp-server.js +171 -0
- package/dist/path-resolution.js +101 -0
- package/dist/process-utils.js +93 -0
- package/dist/server.js +3 -0
- package/dist/service.js +980 -0
- package/dist/session-store.js +80 -0
- package/dist/skill-install.js +107 -0
- package/dist/types.js +1 -0
- package/docs/ambient-presence.md +191 -0
- package/docs/releases/0.1.0-alpha.md +32 -0
- package/docs/talking-stick-plan.md +1156 -0
- package/package.json +40 -0
- package/skills/talking-stick/SKILL.md +132 -0
- package/skills/talking-stick/agents/openai.yaml +4 -0
|
@@ -0,0 +1,1156 @@
|
|
|
1
|
+
# Talking Stick MCP Coordination Plan
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
|
|
5
|
+
Talking Stick is an MCP server that lets multiple agent harnesses coordinate work in a shared workspace without accidentally performing parallel work and without re-deriving context on every turn.
|
|
6
|
+
|
|
7
|
+
The core metaphor is simple:
|
|
8
|
+
|
|
9
|
+
- A workspace maps to a coordination room.
|
|
10
|
+
- Agents join the room by operating in a path that resolves to it.
|
|
11
|
+
- Exactly one agent may hold the talking stick for that room at a time.
|
|
12
|
+
- The holder may work, release the stick to the next agent in sequence, or explicitly pass it to a specific agent.
|
|
13
|
+
- Passing or releasing the stick requires a structured handoff. The handoff carries what was done, what remains, and where to look, so the next agent does not have to rediscover context.
|
|
14
|
+
- Normal release follows the member sequence. Explicit pass and timeout takeover are deliberate escape hatches.
|
|
15
|
+
- If the expected agent fails to respond, another active member may take over after a timeout. Timeout opens takeover eligibility; it does not revoke the expected agent until a takeover actually commits.
|
|
16
|
+
|
|
17
|
+
The goal is not a general chat system. The goal is a small, fault-tolerant coordination primitive that also serves as shared working memory for planning, code review, task handoff, and multi-agent turn-taking.
|
|
18
|
+
|
|
19
|
+
## Design Goals
|
|
20
|
+
|
|
21
|
+
- Resolve coordination scope to a workspace root, matching how developers already think about workspaces (git, `CLAUDE.md`, `package.json`).
|
|
22
|
+
- Keep the MVP to one default room per workspace path; multiple simultaneous topics can be added later if real workflows need them.
|
|
23
|
+
- Make the handoff between agents the primary state transfer, not an afterthought.
|
|
24
|
+
- Provide predictable ordered turn-taking without making fairness a hard concurrency invariant.
|
|
25
|
+
- Work safely when multiple MCP server processes run concurrently (multiple terminal tabs, split views, parallel sessions).
|
|
26
|
+
- Store state in a predictable per-user data directory under `~/.local/share` on Linux and macOS, with an override for tests and project isolation.
|
|
27
|
+
- Recover cleanly when an agent crashes, times out, or stops polling.
|
|
28
|
+
- Keep the MCP surface small enough that harnesses can follow it reliably.
|
|
29
|
+
- Make stale writes impossible with fencing tokens and turn numbers.
|
|
30
|
+
- Prefer explicit state transitions over hidden automatic behavior.
|
|
31
|
+
|
|
32
|
+
## Non-Goals
|
|
33
|
+
|
|
34
|
+
- Not a replacement for git, issue trackers, or durable project documentation.
|
|
35
|
+
- Not a general-purpose pub/sub bus.
|
|
36
|
+
- Does not merge simultaneous edits from multiple agents.
|
|
37
|
+
- Does not guarantee that an agent follows instructions outside the protocol. It only makes protocol-compliant coordination safe.
|
|
38
|
+
- Does not coordinate across hosts. Single-host only in the MVP.
|
|
39
|
+
|
|
40
|
+
## Core Concepts
|
|
41
|
+
|
|
42
|
+
### Path Room and Workspace Resolution
|
|
43
|
+
|
|
44
|
+
A path room is the coordination scope for a workspace. In the MVP, room identity is the canonical room path, normally the workspace root.
|
|
45
|
+
|
|
46
|
+
Room resolution has three steps:
|
|
47
|
+
|
|
48
|
+
1. Resolve the request path to a preferred workspace root.
|
|
49
|
+
2. Search from the request path up to that workspace root for the deepest existing room.
|
|
50
|
+
3. If no room exists on that path, create a room at the preferred workspace root.
|
|
51
|
+
|
|
52
|
+
This avoids the common monorepo failure mode where one agent starts in `/repo/packages/foo/src/` and another starts in `/repo/packages/bar/`, creating separate rooms even though both are working in the same repo. If `/repo/` is the git worktree root, both agents resolve to `/repo/`. It also preserves explicit nested rooms: if an operator creates a room at `/repo/packages/foo/`, agents below that path join the nested room instead of the repo root.
|
|
53
|
+
|
|
54
|
+
Preferred workspace root resolution:
|
|
55
|
+
|
|
56
|
+
1. If the request path is inside a git worktree, use the git top-level path.
|
|
57
|
+
2. Otherwise, use the nearest ancestor containing a recognized workspace marker such as `CLAUDE.md`, `AGENTS.md`, `package.json`, `pyproject.toml`, `Cargo.toml`, or `go.mod`.
|
|
58
|
+
3. Otherwise, use the canonical request path.
|
|
59
|
+
|
|
60
|
+
Canonicalization applied before room lookup:
|
|
61
|
+
|
|
62
|
+
- If `context_path` points to a file, use its parent directory.
|
|
63
|
+
- Resolve symlinks.
|
|
64
|
+
- Normalize path separators.
|
|
65
|
+
- Normalize casing on case-insensitive filesystems.
|
|
66
|
+
|
|
67
|
+
**Nesting and conflict.** Creating a nested room inside an existing room requires explicit opt-in via `force_new = true`. The default behavior is to join the ancestor room. Because talking-stick coordination is operator-initiated, this default is safe: operators know when they are starting a nested conversation and can request one explicitly.
|
|
68
|
+
|
|
69
|
+
### Deferred Extension: Topics
|
|
70
|
+
|
|
71
|
+
The MVP intentionally has one default room per workspace path.
|
|
72
|
+
|
|
73
|
+
Optional topics may be added later if unrelated efforts in the same workspace need independent coordination, such as a `review` room and a `triage` room at the same repo root. Deferring topics keeps the initial protocol aligned with the simple "path chat" model and avoids making every tool carry an extra discriminator before the need is proven.
|
|
74
|
+
|
|
75
|
+
### Agent Identity
|
|
76
|
+
|
|
77
|
+
`agent_id` is derived by the MCP adapter at connection time, not supplied by the harness. Harnesses should not set or guess their own identity; the server knows more about which process is calling than the harness does about itself.
|
|
78
|
+
|
|
79
|
+
Derivation signals, in order of preference:
|
|
80
|
+
|
|
81
|
+
1. `clientInfo.name` and `clientInfo.version` from the MCP `initialize` handshake. Every MCP client sends these; Claude Code, Codex, and Gemini CLI all set distinctive values.
|
|
82
|
+
2. The MCP server's own parent process identity: `(parent_pid, parent_start_time)`. Together these uniquely identify the harness instance on a host. `parent_pid` alone is unsafe because PIDs are reused after exit.
|
|
83
|
+
3. Environment variables the harness exports, such as `CLAUDECODE`, `CLAUDE_CODE_ENTRYPOINT`, `TERM_PROGRAM`, `ITERM_SESSION_ID`, `TMUX`, `SSH_TTY`.
|
|
84
|
+
|
|
85
|
+
Composed identity, stable for the life of one harness instance:
|
|
86
|
+
|
|
87
|
+
```text
|
|
88
|
+
<harness-slug>:<short-hash>
|
|
89
|
+
|
|
90
|
+
e.g.
|
|
91
|
+
claude-code:a3f1
|
|
92
|
+
codex:9b22
|
|
93
|
+
gemini:1c4e
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
The hash is a short digest over the signals above, so reconnects from the same harness instance land on the same `agent_id`. Distinct tabs, splits, or parallel spawns of the same harness get distinct hashes because their `parent_pid`/`parent_start_time` differ.
|
|
97
|
+
|
|
98
|
+
For the Human CLI (see deferred extension), the same idea applies with different signals: `$USER`, parent shell `(pid, start_time)`, and tty yield identities like:
|
|
99
|
+
|
|
100
|
+
```text
|
|
101
|
+
human:wojtek:s003
|
|
102
|
+
```
|
|
103
|
+
|
|
104
|
+
The derived string is the protocol-facing identity, but the server must also persist the source liveness facts behind it:
|
|
105
|
+
|
|
106
|
+
- `host_id`
|
|
107
|
+
- `pid`
|
|
108
|
+
- `process_started_at`
|
|
109
|
+
- `session_kind` (`mcp_harness`, `human_guardian`, later others)
|
|
110
|
+
- optional display metadata such as tty or client name
|
|
111
|
+
|
|
112
|
+
The digest alone is not enough for liveness decisions. If the system is going to say "that owner is really gone," it must be able to check whether the exact spawning process identified by `(host_id, pid, process_started_at)` still exists.
|
|
113
|
+
|
|
114
|
+
For the Human CLI, this implies a split between one-shot commands and holders:
|
|
115
|
+
|
|
116
|
+
- one-shot commands like `list`, `join`, `state`, and `events` are ordinary short-lived processes and do not own the room,
|
|
117
|
+
- indefinite human ownership uses an attached hold mode or a lightweight local guardian process, so the owner is still represented by one live process that can be checked and can exit cleanly.
|
|
118
|
+
|
|
119
|
+
The `join_path` response returns the assigned `agent_id` to the harness so it appears in logs, downstream handoffs, and event records. It also returns the effective policy for that room, including the server's expected heartbeat cadence, so harnesses and human-holder helpers can renew leases without guessing. An optional `agent_id_override` is accepted for tests and debugging and is flagged in the event stream.
|
|
120
|
+
|
|
121
|
+
No MCP tool input other than `join_path` carries `agent_id_override`. If `join_path` receives an override, that override becomes the connection's derived identity for subsequent calls on that same connection until disconnect. Otherwise, for every owner mutation the adapter injects the derived identity from the connection context, and the service layer continues to use `agent_id` internally for fencing and membership checks.
|
|
122
|
+
|
|
123
|
+
This keeps the protocol surface simple: `agent_id` is the session-scoped identity. The MVP still does not need a separate global participant/session abstraction, but it does need to persist process metadata alongside room membership so it can make exact local liveness checks.
|
|
124
|
+
|
|
125
|
+
### Membership Sequence
|
|
126
|
+
|
|
127
|
+
When an agent joins a room, the server appends it to the room's ordered member list if not already present. This order defines the default turn sequence.
|
|
128
|
+
|
|
129
|
+
```text
|
|
130
|
+
A joins
|
|
131
|
+
B joins
|
|
132
|
+
C joins
|
|
133
|
+
|
|
134
|
+
Default sequence: A -> B -> C -> A
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
An owner can follow this sequence by releasing the stick. An owner can skip the sequence by explicitly passing to a chosen agent.
|
|
138
|
+
|
|
139
|
+
### Ownership
|
|
140
|
+
|
|
141
|
+
At most one agent owns the stick for a room at any time.
|
|
142
|
+
|
|
143
|
+
Only the owner may perform owner actions:
|
|
144
|
+
|
|
145
|
+
- `heartbeat`
|
|
146
|
+
- `release_stick`
|
|
147
|
+
- `pass_stick`
|
|
148
|
+
- `close_room` if that optional later tool is implemented
|
|
149
|
+
|
|
150
|
+
Every owner action carries:
|
|
151
|
+
|
|
152
|
+
- `room_id`
|
|
153
|
+
- `lease_id`
|
|
154
|
+
- `expected_turn_id`
|
|
155
|
+
|
|
156
|
+
`agent_id` is derived by the MCP adapter from the connection rather than sent by the caller. The service layer still uses `(agent_id, lease_id, turn_id)` together for fencing; the caller just does not get to name themselves. Old actions from stale agents must be rejected.
|
|
157
|
+
|
|
158
|
+
### Turn ID Semantics
|
|
159
|
+
|
|
160
|
+
`turn_id` identifies the current ownership epoch. It increments only when an agent is granted ownership:
|
|
161
|
+
|
|
162
|
+
- an idle room is claimed,
|
|
163
|
+
- a reserved recipient claims the stick,
|
|
164
|
+
- a timeout takeover succeeds.
|
|
165
|
+
|
|
166
|
+
`release_stick` and `pass_stick` end the current ownership epoch and invalidate the current lease, but they do not grant ownership to the next agent by themselves. They create a pending reservation that the next eligible agent must claim.
|
|
167
|
+
|
|
168
|
+
### Default Turn Order
|
|
169
|
+
|
|
170
|
+
The room maintains an ordered member list and a `sequence_index`. Normal release reserves the stick for the next active member after the current owner.
|
|
171
|
+
|
|
172
|
+
This gives the common case a predictable round-robin shape:
|
|
173
|
+
|
|
174
|
+
```text
|
|
175
|
+
A releases -> B gets first right of refusal
|
|
176
|
+
B releases -> C gets first right of refusal
|
|
177
|
+
C releases -> A gets first right of refusal
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
The sequence is not a hard fairness lock in the MVP:
|
|
181
|
+
|
|
182
|
+
- An owner may explicitly pass to any active member.
|
|
183
|
+
- If a reserved recipient misses `claim_ttl`, another active member may take over, but the immediately prior owner should not be the takeover winner while any other active member can take it.
|
|
184
|
+
- If an owner misses `owner_lease_ttl`, another active member may take over.
|
|
185
|
+
|
|
186
|
+
This is intentionally simpler than strict round fairness. The protocol should prevent accidental parallel ownership first; social fairness can be added as a configurable policy once real usage shows which workflows need it.
|
|
187
|
+
|
|
188
|
+
### Handoff Artifact
|
|
189
|
+
|
|
190
|
+
A handoff is the structured payload produced when an agent releases or passes the stick. It is the protocol's primary state-transfer mechanism.
|
|
191
|
+
|
|
192
|
+
```ts
|
|
193
|
+
interface Handoff {
|
|
194
|
+
// Required. Ensures state transfer is never empty.
|
|
195
|
+
status: string; // what I did, what I learned, what is unfinished
|
|
196
|
+
next_action: string; // what the recipient should do
|
|
197
|
+
|
|
198
|
+
// Optional. Reduces context re-derivation for the recipient.
|
|
199
|
+
artifacts?: Array<{
|
|
200
|
+
path: string; // absolute or workspace-relative
|
|
201
|
+
lines?: [number, number]; // inclusive line range
|
|
202
|
+
role: "examine" | "review" | "edit" | "context" | "output";
|
|
203
|
+
note?: string;
|
|
204
|
+
}>;
|
|
205
|
+
|
|
206
|
+
// Optional.
|
|
207
|
+
open_questions?: string[];
|
|
208
|
+
do_not?: string[]; // off-limits for the next agent
|
|
209
|
+
}
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
The server validates that `status` and `next_action` are non-empty before accepting a `release_stick` or `pass_stick`. This makes "pass without doing anything" mechanically impossible.
|
|
213
|
+
|
|
214
|
+
`artifacts[]` entries give the recipient direct pointers — path plus optional line range — so a new owner can load exactly the relevant code without re-exploring the workspace. This is the primary mechanism the protocol uses to reduce prompt-context churn across agents.
|
|
215
|
+
|
|
216
|
+
The handoff is stored verbatim in the event log and returned to the recipient by `wait_for_turn` when they claim the stick.
|
|
217
|
+
|
|
218
|
+
## Room State
|
|
219
|
+
|
|
220
|
+
```ts
|
|
221
|
+
type RoomState =
|
|
222
|
+
| "idle"
|
|
223
|
+
| "owned"
|
|
224
|
+
| "reserved"
|
|
225
|
+
| "stale_owner"
|
|
226
|
+
| "owner_gone"
|
|
227
|
+
| "recipient_gone"
|
|
228
|
+
| "dormant"
|
|
229
|
+
| "closed";
|
|
230
|
+
|
|
231
|
+
interface PathRoom {
|
|
232
|
+
room_id: string; // server-generated
|
|
233
|
+
canonical_path: string;
|
|
234
|
+
|
|
235
|
+
members: AgentId[];
|
|
236
|
+
sequence_index: number;
|
|
237
|
+
|
|
238
|
+
owner: AgentId | null;
|
|
239
|
+
reserved_for: AgentId | null;
|
|
240
|
+
pending_handoff_event_seq: number | null;
|
|
241
|
+
|
|
242
|
+
turn_id: number;
|
|
243
|
+
lease_id: string | null;
|
|
244
|
+
lease_expires_at: string | null;
|
|
245
|
+
claim_expires_at: string | null;
|
|
246
|
+
|
|
247
|
+
state: RoomState;
|
|
248
|
+
updated_at: string;
|
|
249
|
+
}
|
|
250
|
+
|
|
251
|
+
interface RoomMember {
|
|
252
|
+
agent_id: AgentId;
|
|
253
|
+
ordinal: number;
|
|
254
|
+
joined_at: string;
|
|
255
|
+
last_seen_at: string;
|
|
256
|
+
status: "active" | "inactive"; // derived from last_seen_at and presence_ttl
|
|
257
|
+
}
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
State meanings:
|
|
261
|
+
|
|
262
|
+
- `idle`: no current owner and no specific reserved recipient. It may still have a pending handoff from the previous release.
|
|
263
|
+
- `owned`: one agent has a live lease and may work.
|
|
264
|
+
- `reserved`: the stick has been released or passed to a specific agent, which has a limited time to claim it.
|
|
265
|
+
- `stale_owner`: derived state indicating the owner missed its lease heartbeat and takeover is available. The owner is not revoked until a takeover commits.
|
|
266
|
+
- `owner_gone`: derived state indicating the exact owning process is known to have exited. Takeover is immediately available; no lease timeout wait is required.
|
|
267
|
+
- `recipient_gone`: derived state indicating the reserved recipient's exact process is known to have exited. Takeover is immediately available; no claim timeout wait is required.
|
|
268
|
+
- `dormant`: derived state indicating no member is currently live or recently present, but the room was not explicitly closed. In implementation, takeover-relevant states such as `owner_gone`, `recipient_gone`, and `stale_owner` take precedence over `dormant`; the room should report the most actionable recovery state first.
|
|
269
|
+
- `closed`: no further turns are expected.
|
|
270
|
+
|
|
271
|
+
An active member is one whose `last_seen_at` is within `presence_ttl` and whose exact spawning process is still alive when liveness metadata is available. Death beats timeout for the current owner and reserved recipient: if the server can prove either exact process is gone, recovery opens immediately rather than waiting for `presence_ttl`.
|
|
272
|
+
|
|
273
|
+
As with lease expiry, activity can be derived lazily on reads and writes rather than maintained by a background process. Process liveness checks must use `pid + process_started_at`; `kill(pid, 0)` or pid-only lookup is not sufficient because PIDs are reused. To keep write transactions short, the MVP uses exact process checks for owner/reserved recovery and recent presence for broader sequence scans; timeout recovery remains the fallback for a stale sequence target.
|
|
274
|
+
|
|
275
|
+
### Room Termination vs Dormancy
|
|
276
|
+
|
|
277
|
+
The protocol model reserves a `closed` state for an optional later `close_room` tool. The MVP implementation does not provide that tool and therefore never enters `closed`; rooms remain resumable unless they become dormant. This still matters because "no live processes currently point at this room" is a common state during normal work — an operator steps away, or all harnesses exit between turns — and it must not be confused with "this conversation is over."
|
|
278
|
+
|
|
279
|
+
The MVP therefore distinguishes three situations, not two:
|
|
280
|
+
|
|
281
|
+
- **Active:** at least one member has recent presence within `presence_ttl`. Normal operation.
|
|
282
|
+
- **Dormant:** no member has recent presence or a currently live process, and no optional later close mechanism has been invoked. The room persists, its event log stays readable, and any member returning later can resume. Dormancy is a derived condition, not a stored state; `get_room_state` is the authoritative projection and may use persisted process metadata when available, while `list_rooms` may use a cheaper summary projection based on room ownership fields and recent presence so it does not need to probe every room holder.
|
|
283
|
+
- **Closed:** reserved for a future `close_room` extension. If that tool is added later, the room becomes terminal and no further owner mutations are accepted. The event log remains for inspection.
|
|
284
|
+
|
|
285
|
+
Retention policy for long-dormant rooms (archive, prune, purge after N days with no activity) is out of scope for MVP and is expected to be a separate administrative concern, not a protocol state transition. This prevents surprise deletions and keeps the protocol's responsibilities narrow.
|
|
286
|
+
|
|
287
|
+
## Default Lifecycle
|
|
288
|
+
|
|
289
|
+
### Discover
|
|
290
|
+
|
|
291
|
+
An agent enumerates rooms reachable from a path:
|
|
292
|
+
|
|
293
|
+
```ts
|
|
294
|
+
list_rooms({ context_path? }) -> Room[]
|
|
295
|
+
```
|
|
296
|
+
|
|
297
|
+
Rooms are returned for the ancestor chain from `context_path` to the resolved workspace root, keyed by `room_id` and annotated with `canonical_path` and current state. This lets a harness show "here is what is happening in this workspace" in one call.
|
|
298
|
+
|
|
299
|
+
### Join
|
|
300
|
+
|
|
301
|
+
```ts
|
|
302
|
+
join_path({
|
|
303
|
+
context_path,
|
|
304
|
+
force_new?, // defaults to false
|
|
305
|
+
agent_id_override? // optional; tests and debugging only
|
|
306
|
+
})
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
`agent_id` is not an input. It is derived server-side from the MCP connection and returned in the response so the harness can surface it in logs.
|
|
310
|
+
|
|
311
|
+
Resolution:
|
|
312
|
+
|
|
313
|
+
1. Canonicalize `context_path`.
|
|
314
|
+
2. Resolve the preferred workspace root.
|
|
315
|
+
3. Walk up from the canonical `context_path` to the preferred workspace root looking for an existing room.
|
|
316
|
+
4. If found and `force_new = false`: join the deepest existing ancestor room.
|
|
317
|
+
5. If found and `force_new = true`: create a nested room at the canonical `context_path`, returning a warning that an ancestor room exists. If a room already exists at that exact path, join it.
|
|
318
|
+
6. If not found: create a new room at the preferred workspace root.
|
|
319
|
+
|
|
320
|
+
The response includes the resolved `room_id`, the `canonical_path` the agent actually joined (which may differ from the request path when workspace root resolution or ancestor lookup redirected the call), the effective room policy (including `heartbeat_interval_ms`), and a `handoff_template` hint describing the expected handoff shape. For the MVP this template is static server-wide; room-specific prompting can be added later if real workflows need it.
|
|
321
|
+
|
|
322
|
+
Effects:
|
|
323
|
+
|
|
324
|
+
- Adds `agent_id` to the ordered member list if absent.
|
|
325
|
+
- Updates the agent presence timestamp.
|
|
326
|
+
- Returns the current room state.
|
|
327
|
+
|
|
328
|
+
### Wait
|
|
329
|
+
|
|
330
|
+
```ts
|
|
331
|
+
wait_for_turn({
|
|
332
|
+
room_id,
|
|
333
|
+
cursor?,
|
|
334
|
+
max_wait_ms?
|
|
335
|
+
})
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
Possible results:
|
|
339
|
+
|
|
340
|
+
```ts
|
|
341
|
+
type WaitForTurnResult =
|
|
342
|
+
| {
|
|
343
|
+
status: "your_turn";
|
|
344
|
+
room_id: string;
|
|
345
|
+
turn_id: number;
|
|
346
|
+
lease_id: string;
|
|
347
|
+
handoff: Handoff | null; // null only for the first open claim in a fresh room
|
|
348
|
+
from_agent_id: AgentId | null;
|
|
349
|
+
reason: "direct_pass" | "sequence" | "open_claim";
|
|
350
|
+
}
|
|
351
|
+
| {
|
|
352
|
+
status: "not_yet";
|
|
353
|
+
cursor: string;
|
|
354
|
+
room_state: RoomState;
|
|
355
|
+
}
|
|
356
|
+
| {
|
|
357
|
+
status: "takeover_available";
|
|
358
|
+
room_id: string;
|
|
359
|
+
turn_id: number;
|
|
360
|
+
room_state:
|
|
361
|
+
| "owned"
|
|
362
|
+
| "reserved"
|
|
363
|
+
| "stale_owner"
|
|
364
|
+
| "owner_gone"
|
|
365
|
+
| "recipient_gone";
|
|
366
|
+
reason:
|
|
367
|
+
| "claim_timeout"
|
|
368
|
+
| "owner_timeout"
|
|
369
|
+
| "owner_gone"
|
|
370
|
+
| "recipient_gone";
|
|
371
|
+
current_owner?: AgentId;
|
|
372
|
+
reserved_for?: AgentId;
|
|
373
|
+
}
|
|
374
|
+
| {
|
|
375
|
+
status: "closed";
|
|
376
|
+
room_id: string;
|
|
377
|
+
};
|
|
378
|
+
```
|
|
379
|
+
|
|
380
|
+
`wait_for_turn` may claim the stick when the caller is directly eligible:
|
|
381
|
+
|
|
382
|
+
- If the room is `idle`, any active member may claim.
|
|
383
|
+
- If the room is `reserved`, `reserved_for` may claim as long as no takeover has committed, even after `claim_expires_at`.
|
|
384
|
+
|
|
385
|
+
Each `wait_for_turn` call updates the caller's `last_seen_at`, so polling agents remain active.
|
|
386
|
+
|
|
387
|
+
As part of each read/write operation, the server may also refresh derived liveness for the current owner and reserved recipient. If the exact spawning process for either is proven absent, the room moves to `owner_gone` or `recipient_gone` as a derived condition and `takeover_available` is returned immediately to other eligible members.
|
|
388
|
+
|
|
389
|
+
`wait_for_turn` does not perform takeover for a non-reserved caller. If timeout has made takeover possible, it returns `takeover_available`; the caller must then invoke `takeover_stick` with an explicit reason.
|
|
390
|
+
|
|
391
|
+
`max_wait_ms = 0` is a valid non-blocking call. It still performs one atomic read/claim attempt before returning `your_turn`, `takeover_available`, `closed`, or `not_yet`.
|
|
392
|
+
|
|
393
|
+
When a claim succeeds, the server atomically:
|
|
394
|
+
|
|
395
|
+
- increments `turn_id`,
|
|
396
|
+
- issues a new `lease_id`,
|
|
397
|
+
- sets `owner = agent_id`,
|
|
398
|
+
- clears `reserved_for`,
|
|
399
|
+
- clears `pending_handoff_event_seq`,
|
|
400
|
+
- sets `lease_expires_at`,
|
|
401
|
+
- appends a claim event,
|
|
402
|
+
- returns `your_turn` with the prior handoff attached.
|
|
403
|
+
|
|
404
|
+
### Work
|
|
405
|
+
|
|
406
|
+
While holding the stick, an agent should call:
|
|
407
|
+
|
|
408
|
+
```ts
|
|
409
|
+
heartbeat({ room_id, lease_id, expected_turn_id })
|
|
410
|
+
```
|
|
411
|
+
|
|
412
|
+
The heartbeat extends the owner lease and updates the owner's `last_seen_at`. A `stale_lease` response means another agent has taken over or otherwise invalidated the lease; the caller must stop acting as owner and re-read the room state.
|
|
413
|
+
|
|
414
|
+
Lease expiry opens takeover eligibility. It does not invalidate the owner's lease by itself. If an expired owner heartbeats before another agent successfully takes over, the heartbeat may renew the lease.
|
|
415
|
+
|
|
416
|
+
By contrast, exact process death is definitive. If the server can prove that the owning process is gone, the owner may not renew; takeover is immediately available to other eligible members.
|
|
417
|
+
|
|
418
|
+
### Release
|
|
419
|
+
|
|
420
|
+
The owner may release the stick without naming a recipient:
|
|
421
|
+
|
|
422
|
+
```ts
|
|
423
|
+
release_stick({
|
|
424
|
+
room_id,
|
|
425
|
+
lease_id,
|
|
426
|
+
expected_turn_id,
|
|
427
|
+
handoff
|
|
428
|
+
})
|
|
429
|
+
```
|
|
430
|
+
|
|
431
|
+
Server validates:
|
|
432
|
+
|
|
433
|
+
- `lease_id` and `expected_turn_id` are current.
|
|
434
|
+
- `handoff.status` and `handoff.next_action` are non-empty.
|
|
435
|
+
|
|
436
|
+
Effects on success:
|
|
437
|
+
|
|
438
|
+
- Appends a release event containing the full `handoff`.
|
|
439
|
+
- Stores that event's `event_seq` as `pending_handoff_event_seq`.
|
|
440
|
+
- Updates the releasing owner's `last_seen_at`.
|
|
441
|
+
- Clears current owner and invalidates the current lease.
|
|
442
|
+
- Advances `sequence_index` to the next active member after the releasing owner.
|
|
443
|
+
- Sets `reserved_for` to the member found above, if one exists.
|
|
444
|
+
- Sets `claim_expires_at` when a recipient is reserved, otherwise clears it.
|
|
445
|
+
- Changes state to `reserved`, or `idle` if no other active member exists.
|
|
446
|
+
|
|
447
|
+
### Explicit Pass
|
|
448
|
+
|
|
449
|
+
```ts
|
|
450
|
+
pass_stick({
|
|
451
|
+
room_id,
|
|
452
|
+
lease_id,
|
|
453
|
+
expected_turn_id,
|
|
454
|
+
to_agent_id,
|
|
455
|
+
handoff
|
|
456
|
+
})
|
|
457
|
+
```
|
|
458
|
+
|
|
459
|
+
Same handoff validation as `release_stick`. In the MVP, `to_agent_id` must already be an active member of the room. Passing to non-members is deferred until there is an explicit invite or discovery story.
|
|
460
|
+
|
|
461
|
+
Effects:
|
|
462
|
+
|
|
463
|
+
- Appends a pass event containing the full `handoff`.
|
|
464
|
+
- Stores that event's `event_seq` as `pending_handoff_event_seq`.
|
|
465
|
+
- Updates the passing owner's `last_seen_at`.
|
|
466
|
+
- Clears current owner and invalidates the current lease.
|
|
467
|
+
- Sets `reserved_for = to_agent_id`.
|
|
468
|
+
- Sets `sequence_index` to the target agent's ordinal, so the default sequence resumes from the passed-to agent after they release.
|
|
469
|
+
- Sets `claim_expires_at`.
|
|
470
|
+
- Changes state to `reserved`.
|
|
471
|
+
|
|
472
|
+
If the target misses the claim timeout, another active member may take over. The immediately prior owner should not be the takeover winner while any other active member can take it.
|
|
473
|
+
|
|
474
|
+
### Takeover
|
|
475
|
+
|
|
476
|
+
Another active member may take over when the expected owner or reserved recipient has failed to respond:
|
|
477
|
+
|
|
478
|
+
```ts
|
|
479
|
+
takeover_stick({
|
|
480
|
+
room_id,
|
|
481
|
+
expected_turn_id,
|
|
482
|
+
reason
|
|
483
|
+
})
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
The takeover call itself refreshes the caller's presence before eligibility is checked. Other members' activity is still evaluated from their existing `last_seen_at` values. This lets a returning active harness recover a timed-out room with one explicit operation, while the prior-owner guard still depends on whether some other member has been active recently.
|
|
487
|
+
|
|
488
|
+
Allowed when:
|
|
489
|
+
|
|
490
|
+
- room is `reserved` and `claim_expires_at` has passed, or
|
|
491
|
+
- room is `owned` and `lease_expires_at` has passed, or
|
|
492
|
+
- room is `reserved` and the reserved recipient's exact process is known gone, or
|
|
493
|
+
- room is `owned` and the owner's exact process is known gone, or
|
|
494
|
+
- room is `stale_owner`.
|
|
495
|
+
|
|
496
|
+
In timeout cases, an active member other than the current owner or reserved recipient may attempt takeover. The previous owner or reserved recipient is not revoked merely because a timeout elapsed; they are revoked only if another agent's `takeover_stick` transaction commits first.
|
|
497
|
+
|
|
498
|
+
In process-gone cases, the server has positive evidence that the exact spawning process has exited. The dead owner or dead reserved recipient is therefore immediately ineligible to reclaim, heartbeat, release, or pass. Ownership still transfers only by explicit `takeover_stick`; the server never auto-promotes another member in the background.
|
|
499
|
+
|
|
500
|
+
For claim timeouts, there is one additional anti-monopoly guard: the immediately prior owner, identified by the pending handoff event's `from_agent_id`, is not eligible to take over while any other active member is eligible. If no other active member is available, the prior owner may take over rather than deadlocking the room. This preserves the important "do not immediately grab the stick back" behavior without adding full round-fairness state.
|
|
501
|
+
|
|
502
|
+
Effects:
|
|
503
|
+
|
|
504
|
+
- Atomically re-reads the room and verifies timeout eligibility.
|
|
505
|
+
- Atomically increments `turn_id`.
|
|
506
|
+
- Issues a new `lease_id`.
|
|
507
|
+
- Sets `owner = agent_id`.
|
|
508
|
+
- Updates the caller's `last_seen_at`.
|
|
509
|
+
- Clears `reserved_for`.
|
|
510
|
+
- Clears `pending_handoff_event_seq`.
|
|
511
|
+
- Records the previous owner or reserved recipient as revoked for that turn.
|
|
512
|
+
- Appends a takeover event with `reason`.
|
|
513
|
+
|
|
514
|
+
The result includes the new `turn_id` and `lease_id`. It does not include a handoff; the prior owner or reserved recipient never produced one. The new owner relies on `get_room_events` to reconstruct context from the most recent handoffs.
|
|
515
|
+
|
|
516
|
+
Old agents cannot mutate the room after takeover because their `lease_id` and `turn_id` no longer match.
|
|
517
|
+
|
|
518
|
+
## MCP Tool Surface
|
|
519
|
+
|
|
520
|
+
MVP tools:
|
|
521
|
+
|
|
522
|
+
```ts
|
|
523
|
+
list_rooms(input) -> Room[]
|
|
524
|
+
join_path(input) -> JoinPathResult
|
|
525
|
+
wait_for_turn(input) -> WaitForTurnResult
|
|
526
|
+
heartbeat(input) -> HeartbeatResult
|
|
527
|
+
release_stick(input) -> ReleaseStickResult
|
|
528
|
+
pass_stick(input) -> PassStickResult
|
|
529
|
+
takeover_stick(input) -> TakeoverStickResult
|
|
530
|
+
get_room_state(input) -> GetRoomStateResult
|
|
531
|
+
get_room_events(input) -> RoomEvent[]
|
|
532
|
+
```
|
|
533
|
+
|
|
534
|
+
Optional later additions:
|
|
535
|
+
|
|
536
|
+
```ts
|
|
537
|
+
append_note(input) -> AppendNoteResult
|
|
538
|
+
leave_path(input) -> LeavePathResult
|
|
539
|
+
close_room(input) -> CloseRoomResult
|
|
540
|
+
reorder_members(input) -> ReorderMembersResult
|
|
541
|
+
set_room_policy(input) -> SetRoomPolicyResult
|
|
542
|
+
```
|
|
543
|
+
|
|
544
|
+
`append_note` may be useful for non-owners to add side-channel context without claiming ownership. It must not change owner state.
|
|
545
|
+
|
|
546
|
+
`get_room_events` is MVP because it is the recovery mechanism for takeover and the audit trail for the working-memory story. Without it, a new owner after takeover has no way to read prior handoffs.
|
|
547
|
+
|
|
548
|
+
## State Transitions
|
|
549
|
+
|
|
550
|
+
```text
|
|
551
|
+
idle
|
|
552
|
+
wait_for_turn by any active member
|
|
553
|
+
-> owned
|
|
554
|
+
|
|
555
|
+
owned
|
|
556
|
+
heartbeat by owner
|
|
557
|
+
-> owned
|
|
558
|
+
|
|
559
|
+
owned
|
|
560
|
+
release_stick by owner (with valid Handoff)
|
|
561
|
+
-> reserved, if another active member exists
|
|
562
|
+
-> idle, if no other active member exists
|
|
563
|
+
|
|
564
|
+
owned
|
|
565
|
+
pass_stick by owner to an active member (with valid Handoff)
|
|
566
|
+
-> reserved
|
|
567
|
+
|
|
568
|
+
owned
|
|
569
|
+
lease expires
|
|
570
|
+
-> stale_owner/takeover_available as derived state
|
|
571
|
+
|
|
572
|
+
owned
|
|
573
|
+
owning process is known gone
|
|
574
|
+
-> owner_gone/takeover_available as derived state
|
|
575
|
+
|
|
576
|
+
reserved
|
|
577
|
+
reserved_for calls wait_for_turn before any takeover commits
|
|
578
|
+
-> owned
|
|
579
|
+
|
|
580
|
+
reserved
|
|
581
|
+
claim timeout expires
|
|
582
|
+
-> reserved, but takeover_available is returned to other active members
|
|
583
|
+
|
|
584
|
+
reserved
|
|
585
|
+
reserved recipient process is known gone
|
|
586
|
+
-> recipient_gone/takeover_available as derived state
|
|
587
|
+
|
|
588
|
+
reserved
|
|
589
|
+
takeover_stick by another active member after claim timeout
|
|
590
|
+
prior owner is skipped if another candidate exists
|
|
591
|
+
-> owned
|
|
592
|
+
|
|
593
|
+
stale_owner
|
|
594
|
+
owner heartbeat/release/pass before takeover commits
|
|
595
|
+
-> owned/reserved
|
|
596
|
+
|
|
597
|
+
stale_owner
|
|
598
|
+
takeover_stick by another active member
|
|
599
|
+
-> owned
|
|
600
|
+
|
|
601
|
+
owner_gone/recipient_gone
|
|
602
|
+
takeover_stick by another active member
|
|
603
|
+
-> owned
|
|
604
|
+
|
|
605
|
+
owned/reserved/idle/stale_owner/owner_gone/recipient_gone/dormant
|
|
606
|
+
optional later close_room
|
|
607
|
+
-> closed
|
|
608
|
+
|
|
609
|
+
owned/reserved/idle/stale_owner/owner_gone/recipient_gone
|
|
610
|
+
all members inactive or gone and no explicit close
|
|
611
|
+
-> dormant as derived state
|
|
612
|
+
```
|
|
613
|
+
|
|
614
|
+
## Race Condition Prevention
|
|
615
|
+
|
|
616
|
+
The server is the only authority for ownership.
|
|
617
|
+
|
|
618
|
+
Required safety rules:
|
|
619
|
+
|
|
620
|
+
- Store room state in a transactional database.
|
|
621
|
+
- Use row-level locking or a single atomic compare-and-swap update for each room mutation.
|
|
622
|
+
- Require `room_id` for all owner mutations. `agent_id` is derived by the MCP adapter from the connection and supplied to the service layer; it is not a tool input.
|
|
623
|
+
- Require `lease_id` for all owner mutations.
|
|
624
|
+
- Require `expected_turn_id` for all owner mutations.
|
|
625
|
+
- Persist enough process metadata to identify the exact spawning process for each member (`pid` plus `process_started_at`, and preferably `host_id`).
|
|
626
|
+
- Increment `turn_id` whenever an agent is granted ownership.
|
|
627
|
+
- Never reuse `lease_id`.
|
|
628
|
+
- Treat `lease_id` as a fencing token.
|
|
629
|
+
- Treat `lease_expires_at` as takeover eligibility, not automatic lease revocation. A lease becomes stale only when the room's current `(lease_id, turn_id)` no longer matches the caller's values.
|
|
630
|
+
- Treat exact process death as stronger than timeout. If the exact owner or reserved-recipient process is known absent, expose immediate takeover eligibility and mark that member inactive.
|
|
631
|
+
- Never infer process death from `pid` alone.
|
|
632
|
+
- On claim-timeout takeover, reject the immediately prior owner when another active takeover candidate exists.
|
|
633
|
+
- Reject stale mutations with a structured error that includes the current owner and current turn.
|
|
634
|
+
|
|
635
|
+
Example stale mutation response:
|
|
636
|
+
|
|
637
|
+
```json
|
|
638
|
+
{
|
|
639
|
+
"error": "stale_lease",
|
|
640
|
+
"message": "The supplied lease is no longer current for this room.",
|
|
641
|
+
"current_owner": "claude:session-2",
|
|
642
|
+
"current_turn_id": 12,
|
|
643
|
+
"room_state": "owned"
|
|
644
|
+
}
|
|
645
|
+
```
|
|
646
|
+
|
|
647
|
+
Handoff and membership errors use the same structured form:
|
|
648
|
+
|
|
649
|
+
```json
|
|
650
|
+
{
|
|
651
|
+
"error": "invalid_handoff",
|
|
652
|
+
"message": "handoff.next_action must be non-empty",
|
|
653
|
+
"field": "next_action"
|
|
654
|
+
}
|
|
655
|
+
```
|
|
656
|
+
|
|
657
|
+
```json
|
|
658
|
+
{
|
|
659
|
+
"error": "unknown_member",
|
|
660
|
+
"message": "pass_stick target must be an active room member in the MVP.",
|
|
661
|
+
"to_agent_id": "gemini:session-1"
|
|
662
|
+
}
|
|
663
|
+
```
|
|
664
|
+
|
|
665
|
+
Owner mutation error precedence is deterministic:
|
|
666
|
+
|
|
667
|
+
1. If `expected_turn_id` does not match the room's current `turn_id`, return `turn_mismatch`.
|
|
668
|
+
2. If the turn matches but `owner`, `agent_id`, or `lease_id` does not match the current owner epoch, return `stale_lease`.
|
|
669
|
+
|
|
670
|
+
This keeps "I am writing against the wrong epoch" distinct from "I am in the current epoch but do not hold the current fencing token."
|
|
671
|
+
|
|
672
|
+
## Deadlock Prevention
|
|
673
|
+
|
|
674
|
+
The protocol avoids permanent deadlock by combining:
|
|
675
|
+
|
|
676
|
+
- finite claim timeouts for reserved recipients,
|
|
677
|
+
- renewable leases for active owners,
|
|
678
|
+
- exact process-gone detection for owners and reserved recipients when metadata is available,
|
|
679
|
+
- takeover after missed claim or missed lease,
|
|
680
|
+
- prior-owner takeover fallback when no other active candidate exists,
|
|
681
|
+
- explicit stale state,
|
|
682
|
+
- read-only room inspection via `get_room_state` and `get_room_events`.
|
|
683
|
+
|
|
684
|
+
The server does not silently auto-transfer ownership when a lease expires. It marks the state recoverable and requires an explicit `takeover_stick` call. This keeps recovery auditable and prevents surprise parallel work.
|
|
685
|
+
|
|
686
|
+
Because the MVP has no daemon, expiry is evaluated lazily on reads and writes. A room row may still store `state = 'owned'` after `lease_expires_at`; `get_room_state` and `wait_for_turn` should report the derived state as `stale_owner` or `takeover_available`. The next successful heartbeat, release, pass, or takeover transaction writes the new projected state.
|
|
687
|
+
|
|
688
|
+
## Multi-Process Concurrency
|
|
689
|
+
|
|
690
|
+
Agent harnesses routinely run in multiple terminal tabs, split panes, or parallel sessions. Each harness typically spawns its own MCP server subprocess. The server design must therefore support many concurrent server processes sharing state.
|
|
691
|
+
|
|
692
|
+
The model is **shared database, no daemon**. Every server process opens the same SQLite file. Coordination is purely through the database.
|
|
693
|
+
|
|
694
|
+
### SQLite configuration
|
|
695
|
+
|
|
696
|
+
Every connection must apply these pragmas:
|
|
697
|
+
|
|
698
|
+
```sql
|
|
699
|
+
PRAGMA journal_mode = WAL; -- concurrent readers + single writer, no reader blocking
|
|
700
|
+
PRAGMA synchronous = NORMAL; -- durable enough for coordination state, faster than FULL
|
|
701
|
+
PRAGMA busy_timeout = 5000; -- wait up to 5s when another process holds the write lock
|
|
702
|
+
PRAGMA foreign_keys = ON;
|
|
703
|
+
```
|
|
704
|
+
|
|
705
|
+
All write transactions start with `BEGIN IMMEDIATE` so the write lock is acquired up front rather than through a mid-transaction upgrade. This avoids `SQLITE_BUSY` failures during lock promotion under contention.
|
|
706
|
+
|
|
707
|
+
Every mutation re-reads the relevant room row inside its transaction and verifies fencing conditions (`lease_id`, `expected_turn_id`, membership, timeout eligibility) before committing. This makes the "two processes see stale state and both try to claim" race impossible: one commits, the other re-reads and returns `not_yet`.
|
|
708
|
+
|
|
709
|
+
### wait_for_turn across processes
|
|
710
|
+
|
|
711
|
+
`wait_for_turn` is implemented as bounded polling. Each server process polls `path_rooms` and `room_events` for the requested room at a short interval (250 ms recommended) up to `max_wait_ms`. Changes made by any other process become visible on the next poll.
|
|
712
|
+
|
|
713
|
+
A cursor on the most recent monotonic `event_seq` lets the server return immediately when new events appear, so long polls do not consume CPU redundantly across consecutive calls.
|
|
714
|
+
|
|
715
|
+
### Limitations
|
|
716
|
+
|
|
717
|
+
- **Network filesystems.** SQLite locking is unreliable on NFS and some SMB implementations. The server must detect a non-local filesystem at startup and fail with a clear error suggesting `TALKING_STICK_DATA_DIR` pointed at a local path.
|
|
718
|
+
- **Cross-host coordination.** The MVP is explicitly single-host. Multi-host deployments require a different backend (Postgres is the natural upgrade, with `FOR UPDATE` locks replacing SQLite's write lock).
|
|
719
|
+
|
|
720
|
+
### Optional future: daemon mode
|
|
721
|
+
|
|
722
|
+
If polling overhead ever becomes a concern at scale, a future version may add an optional local daemon:
|
|
723
|
+
|
|
724
|
+
- The first client to call `join_path` starts a daemon if none is running.
|
|
725
|
+
- The daemon holds an exclusive writer connection to the SQLite file.
|
|
726
|
+
- Clients connect via a Unix domain socket at `<data_dir>/server.sock`.
|
|
727
|
+
- The daemon pushes state changes to waiting clients directly, eliminating polling.
|
|
728
|
+
- The daemon self-terminates after a configurable idle period.
|
|
729
|
+
|
|
730
|
+
This is not needed for MVP. Polling is sufficient for typical multi-tab workflows.
|
|
731
|
+
|
|
732
|
+
## Timeout Policy
|
|
733
|
+
|
|
734
|
+
Recommended defaults (product scale, sized for real agent work rather than chat turns):
|
|
735
|
+
|
|
736
|
+
```ts
|
|
737
|
+
owner_lease_ttl_ms = 45 * 60 * 1000; // 45 minutes
|
|
738
|
+
heartbeat_interval_ms = 5 * 60 * 1000; // 5 minutes
|
|
739
|
+
claim_ttl_ms = 20 * 60 * 1000; // 20 minutes
|
|
740
|
+
wait_for_turn_max_wait_ms = 30 * 1000; // 30 seconds
|
|
741
|
+
wait_for_turn_poll_ms = 250; // transport polling cadence
|
|
742
|
+
presence_ttl_ms = 4 * 60 * 60 * 1000; // 4 hours
|
|
743
|
+
```
|
|
744
|
+
|
|
745
|
+
Timeout meanings:
|
|
746
|
+
|
|
747
|
+
- `wait_for_turn` max wait is only a polling budget. The client should call again if it returns `not_yet`.
|
|
748
|
+
- `wait_for_turn_poll_ms` is how often a waiting process re-reads room state during a single long poll.
|
|
749
|
+
- `claim_ttl` is how long a reserved recipient has exclusive first right of refusal before others may take over.
|
|
750
|
+
- `owner_lease_ttl` is how long an owner may remain silent before takeover becomes possible.
|
|
751
|
+
- `presence_ttl` determines whether a member is active for sequence selection and takeover eligibility.
|
|
752
|
+
|
|
753
|
+
Rationale for these defaults: a real agent turn often runs 20-30 minutes (plan-and-edit, build-and-verify, review-and-respond), and a human collaborator walking through a few rooms may easily be idle for an hour without being "gone." Earlier drafts inherited chat-scale defaults (5-minute lease, 10-minute presence) which would silently open takeover windows mid-turn. The selected values accept a slower takeover response in exchange for not interrupting legitimate long work; operators who want faster response can shorten them via per-room policy once that ships.
|
|
754
|
+
|
|
755
|
+
These timers are fallback recovery, not the only recovery path. When the server can prove that the exact spawning process is gone, it should expose `owner_gone` or `recipient_gone` immediately instead of waiting for timeout. Ownership timings (lease, claim, presence) are the product-facing knobs; transport timings (wait max, poll cadence) are unchanged because they only affect polling efficiency, not ownership semantics.
|
|
756
|
+
|
|
757
|
+
Per-room policy is expected to become a first-class need quickly (batch workflows want longer TTLs; interactive workflows want shorter claims). Storing timeouts on the room record rather than as global server defaults is the recommended near-term extension, enabled via `set_room_policy`.
|
|
758
|
+
|
|
759
|
+
Even before per-room policy ships, the effective policy must be part of the `join_path` response so holders can schedule heartbeats from server truth rather than from compiled-in defaults.
|
|
760
|
+
|
|
761
|
+
## Persistence Model
|
|
762
|
+
|
|
763
|
+
### File Layout
|
|
764
|
+
|
|
765
|
+
State lives in a single SQLite database at a predictable per-user data directory:
|
|
766
|
+
|
|
767
|
+
- **Linux and macOS**: `$XDG_DATA_HOME/talking-stick/rooms.sqlite` if `XDG_DATA_HOME` is set, otherwise `~/.local/share/talking-stick/rooms.sqlite`.
|
|
768
|
+
- **Windows**: `%APPDATA%\talking-stick\rooms.sqlite`.
|
|
769
|
+
|
|
770
|
+
For this tool, the shared Linux/macOS default is intentional. Talking Stick is a CLI/MCP developer utility, and using `~/.local/share` on both Unix-like platforms keeps scripts, docs, backups, and troubleshooting consistent across machines.
|
|
771
|
+
|
|
772
|
+
Override:
|
|
773
|
+
|
|
774
|
+
- `TALKING_STICK_DATA_DIR` sets an explicit directory. If set, the database lives at `$TALKING_STICK_DATA_DIR/rooms.sqlite`. This is the recommended way to isolate test databases and to keep per-project state when that is desired.
|
|
775
|
+
|
|
776
|
+
The server creates the directory on first run if it does not exist. All rooms (across all workspaces on the host) share the single database file — keeping them together is what makes ancestor lookup a simple indexed query rather than a filesystem traversal.
|
|
777
|
+
|
|
778
|
+
### Schema
|
|
779
|
+
|
|
780
|
+
```sql
|
|
781
|
+
CREATE TABLE path_rooms (
|
|
782
|
+
room_id TEXT PRIMARY KEY,
|
|
783
|
+
canonical_path TEXT NOT NULL,
|
|
784
|
+
sequence_index INTEGER NOT NULL DEFAULT 0,
|
|
785
|
+
owner TEXT,
|
|
786
|
+
reserved_for TEXT,
|
|
787
|
+
pending_handoff_event_seq INTEGER,
|
|
788
|
+
turn_id INTEGER NOT NULL DEFAULT 0,
|
|
789
|
+
lease_id TEXT,
|
|
790
|
+
lease_expires_at TEXT,
|
|
791
|
+
claim_expires_at TEXT,
|
|
792
|
+
state TEXT NOT NULL,
|
|
793
|
+
updated_at TEXT NOT NULL,
|
|
794
|
+
UNIQUE (canonical_path)
|
|
795
|
+
);
|
|
796
|
+
|
|
797
|
+
CREATE INDEX path_rooms_canonical_path_idx
|
|
798
|
+
ON path_rooms (canonical_path);
|
|
799
|
+
|
|
800
|
+
CREATE TABLE room_members (
|
|
801
|
+
room_id TEXT NOT NULL,
|
|
802
|
+
agent_id TEXT NOT NULL,
|
|
803
|
+
ordinal INTEGER NOT NULL,
|
|
804
|
+
joined_at TEXT NOT NULL,
|
|
805
|
+
last_seen_at TEXT NOT NULL,
|
|
806
|
+
status TEXT NOT NULL,
|
|
807
|
+
host_id TEXT,
|
|
808
|
+
pid INTEGER,
|
|
809
|
+
process_started_at TEXT,
|
|
810
|
+
session_kind TEXT NOT NULL,
|
|
811
|
+
display_name TEXT,
|
|
812
|
+
PRIMARY KEY (room_id, agent_id),
|
|
813
|
+
FOREIGN KEY (room_id) REFERENCES path_rooms(room_id)
|
|
814
|
+
);
|
|
815
|
+
|
|
816
|
+
CREATE TABLE room_events (
|
|
817
|
+
event_seq INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
818
|
+
event_id TEXT NOT NULL UNIQUE,
|
|
819
|
+
room_id TEXT NOT NULL,
|
|
820
|
+
turn_id INTEGER NOT NULL,
|
|
821
|
+
event_type TEXT NOT NULL, -- claim | release | pass | takeover | close
|
|
822
|
+
from_agent_id TEXT,
|
|
823
|
+
to_agent_id TEXT,
|
|
824
|
+
handoff_json TEXT, -- NULL for claim | takeover | close
|
|
825
|
+
reason TEXT, -- populated on takeover events
|
|
826
|
+
created_at TEXT NOT NULL,
|
|
827
|
+
FOREIGN KEY (room_id) REFERENCES path_rooms(room_id)
|
|
828
|
+
);
|
|
829
|
+
|
|
830
|
+
CREATE INDEX room_events_room_seq_idx
|
|
831
|
+
ON room_events (room_id, event_seq);
|
|
832
|
+
|
|
833
|
+
CREATE INDEX room_events_room_turn_idx
|
|
834
|
+
ON room_events (room_id, turn_id);
|
|
835
|
+
```
|
|
836
|
+
|
|
837
|
+
Ancestor lookup uses the `canonical_path` index: given a request path `P` and resolved workspace root `W`, generate ancestor paths from `P` up to `W` in code and issue a single `IN` query against `canonical_path`, picking the longest match. At small scale this is microsecond-fast; at very large scale consider materialized paths.
|
|
838
|
+
|
|
839
|
+
`room_events` is append-only. `path_rooms` is a projection of the event stream for fast reads. The event log is also the takeover recovery context: a new owner after `takeover_stick` reads recent events to reconstruct what was happening before the prior owner went silent.
|
|
840
|
+
|
|
841
|
+
`room_members` stores the process metadata needed for exact local liveness checks. If `pid` and `process_started_at` are unavailable on a platform, those columns may be null and the server falls back to timeout-based recovery for that member rather than making unsafe pid-only guesses.
|
|
842
|
+
|
|
843
|
+
Heartbeats update `path_rooms.lease_expires_at` and `updated_at`; they are not written to `room_events` by default. Otherwise waiters would wake up on routine heartbeat traffic.
|
|
844
|
+
|
|
845
|
+
`path_rooms.pending_handoff_event_seq` points at the release or pass event that should be returned to the next successful claimant. It is cleared when a claim or takeover succeeds.
|
|
846
|
+
|
|
847
|
+
## Agent Operating Instructions
|
|
848
|
+
|
|
849
|
+
Harnesses using this MCP server should follow these rules:
|
|
850
|
+
|
|
851
|
+
1. Before joining, call `list_rooms` to see what is already happening in the workspace.
|
|
852
|
+
2. Join using `join_path`. Accept the resolved `room_id` and `canonical_path` the server returns, even if they differ from what you asked for, because workspace root resolution or ancestor lookup may have attached you to a parent room.
|
|
853
|
+
3. Do not perform shared task work unless `wait_for_turn` returns `your_turn`.
|
|
854
|
+
4. When you receive `your_turn`, read the attached `handoff` before doing anything else. Load `artifacts[]` entries directly rather than re-exploring the workspace.
|
|
855
|
+
5. While working, heartbeat periodically.
|
|
856
|
+
6. Include `room_id`, `lease_id`, and `expected_turn_id` on every owner mutation. Do not send an `agent_id`; the server derives it from the MCP connection.
|
|
857
|
+
7. If any owner mutation returns `stale_lease`, `turn_mismatch`, or `unknown_member`, stop working and read current state.
|
|
858
|
+
8. To release the stick, construct a `Handoff` with a truthful `status` and an actionable `next_action`. Include `artifacts[]` entries when the next agent needs to load specific files or line ranges.
|
|
859
|
+
9. Use `release_stick` to continue the default sequence.
|
|
860
|
+
10. Use `pass_stick` to choose a specific active member.
|
|
861
|
+
11. Use `takeover_stick` only after `wait_for_turn` or `get_room_state` reports timeout eligibility. Include a reason. After a successful takeover, call `get_room_events` to reconstruct context from the most recent handoffs.
|
|
862
|
+
|
|
863
|
+
Suggested format for the free-text `status` field:
|
|
864
|
+
|
|
865
|
+
```md
|
|
866
|
+
What I did:
|
|
867
|
+
What I learned:
|
|
868
|
+
Open risks:
|
|
869
|
+
```
|
|
870
|
+
|
|
871
|
+
## Example: Three-Agent Round Robin
|
|
872
|
+
|
|
873
|
+
Members join in this order:
|
|
874
|
+
|
|
875
|
+
```text
|
|
876
|
+
codex
|
|
877
|
+
claude
|
|
878
|
+
gemini
|
|
879
|
+
```
|
|
880
|
+
|
|
881
|
+
Flow:
|
|
882
|
+
|
|
883
|
+
```text
|
|
884
|
+
codex claims idle room and writes initial plan.
|
|
885
|
+
codex releases stick with:
|
|
886
|
+
status: "wrote initial plan covering sections 1-3"
|
|
887
|
+
next_action: "review plan for gaps; suggest additions"
|
|
888
|
+
artifacts: [{ path: "plan.md", role: "review" }]
|
|
889
|
+
room reserves for claude.
|
|
890
|
+
|
|
891
|
+
claude claims reserved stick, receives codex's handoff.
|
|
892
|
+
claude reviews the plan and releases with:
|
|
893
|
+
status: "reviewed plan; section 2 is thin on error handling"
|
|
894
|
+
next_action: "extend error handling coverage in section 2"
|
|
895
|
+
artifacts: [{ path: "plan.md", lines: [45, 78], role: "edit" }]
|
|
896
|
+
room reserves for gemini.
|
|
897
|
+
|
|
898
|
+
gemini claims reserved stick, receives claude's handoff.
|
|
899
|
+
gemini extends section 2 and releases.
|
|
900
|
+
room reserves for codex, continuing the member sequence.
|
|
901
|
+
The cycle continues.
|
|
902
|
+
```
|
|
903
|
+
|
|
904
|
+
## Example: Explicit Skip
|
|
905
|
+
|
|
906
|
+
Default sequence:
|
|
907
|
+
|
|
908
|
+
```text
|
|
909
|
+
codex -> claude -> gemini -> codex
|
|
910
|
+
```
|
|
911
|
+
|
|
912
|
+
If codex wants gemini to review a concurrency detail immediately:
|
|
913
|
+
|
|
914
|
+
```text
|
|
915
|
+
codex pass_stick(
|
|
916
|
+
to_agent_id = gemini,
|
|
917
|
+
handoff = {
|
|
918
|
+
status: "found potential race in claim logic",
|
|
919
|
+
next_action: "assess whether lease_id fencing covers this case",
|
|
920
|
+
artifacts: [{ path: "src/claim.ts", lines: [102, 140], role: "review" }]
|
|
921
|
+
}
|
|
922
|
+
)
|
|
923
|
+
```
|
|
924
|
+
|
|
925
|
+
The room reserves the next turn for gemini, skipping claude. After gemini releases, the default sequence resumes from gemini's position in the member list.
|
|
926
|
+
|
|
927
|
+
## Example: Missed Recipient
|
|
928
|
+
|
|
929
|
+
```text
|
|
930
|
+
codex holds, then releases; reserved for claude.
|
|
931
|
+
claude does not claim before claim_ttl.
|
|
932
|
+
wait_for_turn now returns takeover_available to other active members.
|
|
933
|
+
codex attempts takeover -- REJECTED while gemini is active, because codex was
|
|
934
|
+
the immediately prior owner.
|
|
935
|
+
gemini calls takeover_stick(reason = "claim timeout expired") -- ACCEPTED.
|
|
936
|
+
server grants gemini a fresh lease and increments turn_id.
|
|
937
|
+
gemini calls get_room_events to read codex's original handoff.
|
|
938
|
+
claude wakes up late and tries to claim -- REJECTED (stale turn).
|
|
939
|
+
```
|
|
940
|
+
|
|
941
|
+
## Example: Crashed Owner
|
|
942
|
+
|
|
943
|
+
```text
|
|
944
|
+
claude owns the stick.
|
|
945
|
+
claude stops heartbeating.
|
|
946
|
+
owner_lease_ttl expires.
|
|
947
|
+
get_room_state reports stale_owner.
|
|
948
|
+
codex calls takeover_stick(reason = "owner lease expired").
|
|
949
|
+
server grants codex a fresh lease and increments turn_id.
|
|
950
|
+
codex calls get_room_events to read the handoff claude received on claim,
|
|
951
|
+
so codex can infer what claude was working on.
|
|
952
|
+
claude later tries to release with its old lease.
|
|
953
|
+
server rejects as stale_lease.
|
|
954
|
+
```
|
|
955
|
+
|
|
956
|
+
## Example: Two-Process Concurrent Claim
|
|
957
|
+
|
|
958
|
+
```text
|
|
959
|
+
Terminal A (claude process) and Terminal B (codex process) both run wait_for_turn
|
|
960
|
+
against an idle room.
|
|
961
|
+
Both processes poll. Both see room_state = idle at the same poll tick.
|
|
962
|
+
Both start a BEGIN IMMEDIATE transaction to claim.
|
|
963
|
+
Process A wins the write lock; process B blocks.
|
|
964
|
+
Process A re-reads the row, verifies idle, increments turn_id, sets owner=claude,
|
|
965
|
+
commits.
|
|
966
|
+
Process B acquires the write lock, re-reads, sees room_state = owned by claude,
|
|
967
|
+
aborts the claim, returns not_yet to its client.
|
|
968
|
+
Process B's client polls again and correctly sees claude as owner.
|
|
969
|
+
```
|
|
970
|
+
|
|
971
|
+
## Design Rationale
|
|
972
|
+
|
|
973
|
+
This section records the reasoning behind the load-bearing choices in this plan. Future maintainers should read it before proposing structural changes.
|
|
974
|
+
|
|
975
|
+
### Why workspace root plus ancestor lookup
|
|
976
|
+
|
|
977
|
+
An earlier draft resolved each request path to a canonical string and looked up rooms by exact match. That failed the common monorepo case: two agents running in `/repo/packages/foo/` and `/repo/packages/bar/` would create separate rooms and never coordinate, even when they were doing related work for the same repo.
|
|
978
|
+
|
|
979
|
+
Resolving to a workspace root before lookup matches the mental model developers already use for `.git`, `CLAUDE.md`, `package.json`, and every other workspace marker in common use. An agent at any depth under `/repo/` automatically joins the `/repo/` room if `/repo/` is the resolved workspace root, and no explicit room identifiers need to be coordinated out of band.
|
|
980
|
+
|
|
981
|
+
Because coordination is operator-initiated, the server does not actively prevent nested rooms. Operators know when they are starting a nested conversation and can request one explicitly via `force_new`.
|
|
982
|
+
|
|
983
|
+
### Why topics are deferred
|
|
984
|
+
|
|
985
|
+
Multiple concurrent conversations at the same path may become useful, for example a review in flight and a triage in parallel. But topics also weaken the central simplicity of the tool: "agents in this workspace share this room."
|
|
986
|
+
|
|
987
|
+
The MVP keeps one default room per workspace path. That makes the agent instructions shorter, avoids accidental topic mismatch, and keeps path membership as the primary mental model. If real usage needs concurrent rooms, topics can be added as an extension without changing the ownership and lease protocol.
|
|
988
|
+
|
|
989
|
+
### Why the handoff is structured and mandatory
|
|
990
|
+
|
|
991
|
+
The protocol's original shape treated ownership transfer as metadata about the lock with a free-text note attached. But in practice the expensive operation in a multi-agent handoff is not the lock transfer — it is context reconstruction by the new owner. Every agent that takes the stick starts by asking "what was happening before I got here?"
|
|
992
|
+
|
|
993
|
+
Making the handoff a structured artifact (`status`, `next_action`, `artifacts[]`) turns ownership transfer into state transfer. The `artifacts[]` entries map directly onto how LLMs consume code — path and optional line range — so the next owner can load exactly the relevant context without rediscovering it.
|
|
994
|
+
|
|
995
|
+
Requiring `status` and `next_action` to be non-empty is a small amount of server-side validation that prevents a large class of low-quality handoffs. Handoff quality still depends on harness behavior, which the protocol cannot enforce, so `join_path` returns a `handoff_template` hint that harnesses can surface to their models. The MVP keeps that template static across rooms to avoid turning prompting policy into room configuration before a concrete need exists.
|
|
996
|
+
|
|
997
|
+
### Why takeovers do not carry a handoff
|
|
998
|
+
|
|
999
|
+
A takeover happens precisely because the expected owner did not produce one. Requiring a handoff from the new owner would ask them to speak for the failed owner, which they cannot do truthfully.
|
|
1000
|
+
|
|
1001
|
+
Instead, the event log is the recovery context. `get_room_events` returns recent handoffs, including the one the failed owner received when they took the stick. This is also why `get_room_events` is MVP rather than optional — without it, the takeover recovery story is broken.
|
|
1002
|
+
|
|
1003
|
+
### Why `turn_id` increments on grant, not on release
|
|
1004
|
+
|
|
1005
|
+
Incrementing on grant keeps fencing math trivial: the current `(lease_id, turn_id)` pair always matches exactly one epoch, and release merely ends that epoch without starting a new one. A pending reservation is not a new epoch; it is a waiting room. Incrementing on release would create a window where a slow releaser and a fast new claimant disagree about the current epoch.
|
|
1006
|
+
|
|
1007
|
+
### Why strict round fairness is deferred
|
|
1008
|
+
|
|
1009
|
+
Strict round fairness sounds attractive, but it adds state and policy complexity to the part of the system that must stay easiest to trust. It also creates awkward edge cases: a missed recipient, a single active member, a stale owner, or a deliberate explicit pass can all look like fairness violations even when continuing is the useful behavior.
|
|
1010
|
+
|
|
1011
|
+
The MVP uses ordered release for the normal path and explicit takeover for failure recovery. It also includes a narrow prior-owner guard for claim-timeout takeover, preventing the agent that just released or passed the stick from immediately grabbing it back when another active participant is available. That covers the most important anti-monopoly case without tracking full rounds. If agents later need stronger turn fairness, it can be added as a per-room policy using additional member state.
|
|
1012
|
+
|
|
1013
|
+
### Why takeover is explicit rather than automatic
|
|
1014
|
+
|
|
1015
|
+
The server could auto-transfer ownership the moment a lease expired. It does not, because silent promotions are the source of most "surprise parallel work" incidents in real coordination systems. An explicit `takeover_stick` call requires an agent to name a reason and produces an auditable event, making recovery visible in the log.
|
|
1016
|
+
|
|
1017
|
+
### Why shared SQLite instead of a daemon
|
|
1018
|
+
|
|
1019
|
+
Agent harnesses in this ecosystem spawn MCP servers as subprocesses, typically one per harness invocation. A daemon-based coordination server would require lifecycle management (who starts it, when it shuts down, how it recovers from crashes) that adds operational complexity with little benefit at a single-host, single-user scale.
|
|
1020
|
+
|
|
1021
|
+
SQLite in WAL mode handles concurrent readers and a single writer at low latency without any additional process. Multiple MCP server processes can share the database file without coordination beyond what SQLite already provides. The cost is polling — `wait_for_turn` cannot be push-notified across processes — but a 250 ms poll interval is well within the latency budget for agent-to-agent handoffs.
|
|
1022
|
+
|
|
1023
|
+
A daemon mode remains open as a future optimization if polling becomes a bottleneck. It is not needed for the typical multi-tab workflow the MVP targets.
|
|
1024
|
+
|
|
1025
|
+
### Why `~/.local/share` on Linux and macOS
|
|
1026
|
+
|
|
1027
|
+
Writing coordination state to `~/.talking-stick/` would litter the home directory. Sending macOS to `~/Library/Application Support`, however, makes a CLI-first developer tool harder to script and explain consistently across Unix-like machines.
|
|
1028
|
+
|
|
1029
|
+
The MVP therefore uses the XDG-style location on both Linux and macOS: `$XDG_DATA_HOME/talking-stick` when set, otherwise `~/.local/share/talking-stick`. Windows uses `%APPDATA%\talking-stick`. The `TALKING_STICK_DATA_DIR` override exists for users who want per-project isolation, test databases, or a different local disk.
|
|
1030
|
+
|
|
1031
|
+
Centralizing all rooms in a single SQLite file (rather than one file per room) makes ancestor lookup a simple indexed query rather than a filesystem walk. It also means backups and migrations move a single file.
|
|
1032
|
+
|
|
1033
|
+
## Open Design Questions
|
|
1034
|
+
|
|
1035
|
+
The following questions are worth revisiting once the MVP has seen real use:
|
|
1036
|
+
|
|
1037
|
+
- Should non-owners be able to append notes, or would that encourage side-channel work that bypasses the handoff discipline?
|
|
1038
|
+
- Should a human operator override use the same `takeover_stick` tool as peer agents, or a separate admin tool that bypasses timeout gating?
|
|
1039
|
+
|
|
1040
|
+
Current implementation note: no timeout-bypass or admin override exists today. Human operators use the same explicit `takeover_stick` flow as peer agents, and any bypass semantics remain intentionally undecided.
|
|
1041
|
+
|
|
1042
|
+
## Implementation Plan
|
|
1043
|
+
|
|
1044
|
+
1. Build a local TypeScript MCP server using the Node MCP SDK.
|
|
1045
|
+
2. Use SQLite (via `better-sqlite3` or `libsql`) with WAL mode, resolving the database path to `~/.local/share/talking-stick` on Linux/macOS, `%APPDATA%\talking-stick` on Windows, and honoring `TALKING_STICK_DATA_DIR`.
|
|
1046
|
+
3. Apply required pragmas on every connection; use `BEGIN IMMEDIATE` for all write transactions.
|
|
1047
|
+
4. Detect non-local filesystems at startup and fail fast with a clear error.
|
|
1048
|
+
5. Implement canonical path resolution, workspace root detection, and deepest-ancestor room lookup.
|
|
1049
|
+
6. Implement `list_rooms`, `join_path` (with `force_new`), `get_room_state`, and member sequencing.
|
|
1050
|
+
7. Implement the `Handoff` type with server-side validation of required fields.
|
|
1051
|
+
8. Implement `wait_for_turn` as bounded polling with monotonic cursor support and atomic claiming; attach the prior handoff to `your_turn` responses.
|
|
1052
|
+
9. Implement `takeover_available` responses without auto-taking the stick.
|
|
1053
|
+
10. Implement lease issuing, heartbeat, release (with handoff), explicit pass (with handoff), and takeover.
|
|
1054
|
+
11. Implement `get_room_events` for both audit and takeover recovery.
|
|
1055
|
+
12. Add tests for:
|
|
1056
|
+
- ancestor lookup (including nested rooms and `force_new`),
|
|
1057
|
+
- handoff validation errors,
|
|
1058
|
+
- stale leases,
|
|
1059
|
+
- simultaneous claims within one process,
|
|
1060
|
+
- **simultaneous claims across multiple concurrent processes** (spawn N processes, have them claim/release under contention, verify no state corruption),
|
|
1061
|
+
- explicit pass,
|
|
1062
|
+
- pass to unknown or inactive member rejection,
|
|
1063
|
+
- release sequence,
|
|
1064
|
+
- claim timeout and takeover,
|
|
1065
|
+
- prior owner rejected on claim-timeout takeover when another active candidate exists,
|
|
1066
|
+
- prior owner allowed on claim-timeout takeover when no other active candidate exists,
|
|
1067
|
+
- owner timeout and takeover,
|
|
1068
|
+
- owner process gone yields immediate `owner_gone` takeover availability,
|
|
1069
|
+
- reserved recipient process gone yields immediate `recipient_gone` takeover availability,
|
|
1070
|
+
- original reserved recipient claiming after claim timeout but before takeover,
|
|
1071
|
+
- expired owner heartbeating after lease timeout but before takeover,
|
|
1072
|
+
- dead owner or dead reserved recipient cannot reclaim after exact process-gone detection,
|
|
1073
|
+
- dormant rooms remain readable and resumable, rather than auto-closing,
|
|
1074
|
+
- event log reconstruction after takeover,
|
|
1075
|
+
- database path resolution across platforms and with `TALKING_STICK_DATA_DIR` set.
|
|
1076
|
+
13. Add a small CLI or script for manual inspection during development.
|
|
1077
|
+
|
|
1078
|
+
Current first-slice test coverage:
|
|
1079
|
+
|
|
1080
|
+
- happy path: `join_path` -> idle `wait_for_turn` claim -> `release_stick` with handoff -> reserved recipient claim,
|
|
1081
|
+
- handoff validation,
|
|
1082
|
+
- stale lease rejection,
|
|
1083
|
+
- turn mismatch rejection,
|
|
1084
|
+
- deepest ancestor lookup,
|
|
1085
|
+
- workspace-root room creation,
|
|
1086
|
+
- `wait_for_turn` returns `takeover_available` after claim timeout,
|
|
1087
|
+
- reserved recipient may still claim after `claim_ttl` until takeover commits,
|
|
1088
|
+
- `wait_for_turn` returns `takeover_available` after owner lease timeout,
|
|
1089
|
+
- expired owner may heartbeat before takeover commits,
|
|
1090
|
+
- owner-timeout takeover fences stale owner writes,
|
|
1091
|
+
- owner process death yields immediate `owner_gone`,
|
|
1092
|
+
- reserved recipient process death yields immediate `recipient_gone`,
|
|
1093
|
+
- dormant rooms stay readable and resumable,
|
|
1094
|
+
- prior-owner takeover guard after claim timeout,
|
|
1095
|
+
- multi-process contention against an idle room.
|
|
1096
|
+
|
|
1097
|
+
## Minimum Viable Version
|
|
1098
|
+
|
|
1099
|
+
The first useful version can omit optional notes, admin features, and per-room policy.
|
|
1100
|
+
|
|
1101
|
+
MVP tools:
|
|
1102
|
+
|
|
1103
|
+
```text
|
|
1104
|
+
list_rooms
|
|
1105
|
+
join_path
|
|
1106
|
+
wait_for_turn
|
|
1107
|
+
heartbeat
|
|
1108
|
+
release_stick
|
|
1109
|
+
pass_stick
|
|
1110
|
+
takeover_stick
|
|
1111
|
+
get_room_state
|
|
1112
|
+
get_room_events
|
|
1113
|
+
```
|
|
1114
|
+
|
|
1115
|
+
MVP storage:
|
|
1116
|
+
|
|
1117
|
+
```text
|
|
1118
|
+
path_rooms (with canonical_path unique key, current ownership projection,
|
|
1119
|
+
and pending_handoff_event_seq)
|
|
1120
|
+
room_members (with join order and presence timestamps)
|
|
1121
|
+
room_events (with handoff_json payload on release and pass events)
|
|
1122
|
+
```
|
|
1123
|
+
|
|
1124
|
+
MVP policy:
|
|
1125
|
+
|
|
1126
|
+
```text
|
|
1127
|
+
data directory: ~/.local/share/talking-stick on Linux and macOS
|
|
1128
|
+
(or $XDG_DATA_HOME/talking-stick when set);
|
|
1129
|
+
%APPDATA%\talking-stick on Windows;
|
|
1130
|
+
override via TALKING_STICK_DATA_DIR
|
|
1131
|
+
database file: <data_dir>/rooms.sqlite, WAL mode, synchronous=NORMAL, busy_timeout=5s
|
|
1132
|
+
concurrency: shared database across server processes; BEGIN IMMEDIATE for writes;
|
|
1133
|
+
wait_for_turn polls at 250 ms across processes
|
|
1134
|
+
filesystem requirement: local filesystem; NFS/SMB rejected at startup
|
|
1135
|
+
room identity: canonical workspace path, resolved via workspace root detection
|
|
1136
|
+
plus deepest-ancestor lookup
|
|
1137
|
+
room creation default: attach to ancestor when one exists; require force_new to nest
|
|
1138
|
+
topics: deferred extension, not MVP
|
|
1139
|
+
release behavior: reserve next active member in sequence
|
|
1140
|
+
explicit pass behavior: reserve active target member
|
|
1141
|
+
takeover behavior: another active member after timeout; timeout itself does not revoke
|
|
1142
|
+
claim-timeout takeover skips the prior owner when another
|
|
1143
|
+
active candidate exists
|
|
1144
|
+
exact owner/recipient process death yields immediate
|
|
1145
|
+
takeover availability without waiting for timeout
|
|
1146
|
+
handoff requirement: release_stick and pass_stick require non-empty status and next_action
|
|
1147
|
+
recovery context: get_room_events supplies prior handoffs to takeover winner
|
|
1148
|
+
owner lease TTL: 45 minutes
|
|
1149
|
+
heartbeat interval: 5 minutes
|
|
1150
|
+
claim TTL: 20 minutes
|
|
1151
|
+
presence TTL: 4 hours
|
|
1152
|
+
close semantics: no `close_room` tool in the MVP implementation;
|
|
1153
|
+
rooms remain resumable and can become dormant
|
|
1154
|
+
when nobody is live
|
|
1155
|
+
wait_for_turn max wait: 30 seconds, polled at 250 ms
|
|
1156
|
+
```
|