@martintrojer/mu 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/VISION.md ADDED
@@ -0,0 +1,440 @@
1
+ # VISION: A persistent crew of agents
2
+
3
+ > Terminology used in this doc is canonical. See
4
+ > [VOCABULARY.md](VOCABULARY.md) for definitions of *workstream*,
5
+ > *agent*, *task DAG*, *crew*, *track*, *claim*, *free*, *workspace*,
6
+ > and the rest.
7
+
8
+ ## What This Is
9
+
10
+ mu is a small, durable **control plane** for a persistent crew of AI
11
+ agents in tmux panes. Agents have names, roles, and status; they
12
+ live across sessions; they work on a built-in task graph with VCS
13
+ workspace isolation; humans and other agents drive them through one
14
+ CLI. State lives in one SQLite file; everything in mu is a typed
15
+ verb over that state or a reconciled view of reality.
16
+
17
+ A user with mu installed can:
18
+
19
+ ```bash
20
+ mu agent spawn worker-1 --tab Backend --workspace
21
+ mu agent spawn reviewer-1 --tab Review --workspace --role read-only
22
+ mu task add --title "Build auth" --impact 80 --effort-days 3
23
+ mu task claim build_auth --for worker-1 --evidence "have implementation plan"
24
+ mu agent send worker-1 "Implement build_auth per the description"
25
+ mu state # canonical state card
26
+ mu log --tail # subscribe to every state change
27
+ ```
28
+
29
+ That's the whole product. Everything else is in service of making
30
+ those few lines work, recover from failure, and scale to dozens of
31
+ agents and hundreds of tasks.
32
+
33
+ ---
34
+
35
+ ## Why It Exists
36
+
37
+ Existing tools force a choice:
38
+
39
+ - **Pi-subagents** is great for one-shot focused delegation but the
40
+ children it spawns are short-lived, pi-only, and not driveable from
41
+ outside their parent pi session.
42
+ - **Tmux-orchestration tools** spawn agents in panes but leave
43
+ coordination to chat transcripts or filesystem conventions.
44
+ - **Task trackers** (GitHub Issues, Linear, even tg) model the work but
45
+ don't run the agents.
46
+ - **Bigger orchestration platforms** provide the coordination
47
+ state but have accumulated breadth that costs more in
48
+ lifecycle/maintenance/model-entropy than they pay for. See
49
+ [§ What looking at a prior multi-agent runtime taught us](#what-looking-at-a-prior-multi-agent-runtime-taught-us)
50
+ below.
51
+
52
+ mu unifies the first three without becoming the fourth: persistent
53
+ pi agents, a structured work graph, per-agent VCS isolation, one
54
+ CLI, one SQLite file. The cost of "what should this agent do next?"
55
+ drops from "ask the LLM and hope" to:
56
+
57
+ ```bash
58
+ mu task next # top ready task by ROI
59
+ mu state # full picture as a JSON state card
60
+ ```
61
+
62
+ ---
63
+
64
+ ## Design Principles
65
+
66
+ ### 1. The CLI is the product
67
+
68
+ The pi extension is a UX skin. Everything mu does must work from a
69
+ shell with no pi anywhere. If a feature requires the extension to
70
+ function, it doesn't ship.
71
+
72
+ ### 2. One DB is canonical
73
+
74
+ All state lives in `~/.local/state/mu/mu.db`. SQLite WAL. Multiple processes share
75
+ it safely. The DB is the source of truth; in-memory state is a cache.
76
+ The extension and the CLI both go through the same DB; they never
77
+ diverge.
78
+
79
+ ### 3. Reality wins reconciliation
80
+
81
+ `mu agent list` queries tmux, prunes ghosts, adopts orphans. The DB records
82
+ what we last observed, not what we wish were true. If worker-1's pane
83
+ crashed, the next `mu agent list` notices and updates the registry.
84
+
85
+ ### 4. Agents are dumb workers; the task graph is the brain
86
+
87
+ The **task DAG is the central organizing primitive**, not a sidecar
88
+ feature. Tasks have mandatory `impact` and `effort_days`; edges are
89
+ `blocks` relationships; `ready`/`blocked`/`goals` are SQL views; the
90
+ parallel-track detector runs union-find with automatic diamond-merge
91
+ so two agents never collide on a shared dependency.
92
+
93
+ This means "what should this agent do next?" and "can we parallelize?"
94
+ are deterministic queries against the graph, not LLM judgment calls.
95
+ The LLM decides *what to type to the agent*; the graph decides *which
96
+ agent gets which task*.
97
+
98
+ ### 5. One workstream per tmux session
99
+
100
+ A mu workstream is a tmux session. All its agents are panes/windows
101
+ inside that session. `tmux a -t mu-<workstream>` shows the whole crew
102
+ live. Multiple workstreams on one machine are multiple isolated tmux
103
+ sessions, partitioned in the DB by `session_id`. Detach and reattach
104
+ as you would any tmux session — the crew survives.
105
+
106
+ ### 6. Pi-only, by current scope
107
+
108
+ Mu's status detection (`busy` / `needs_input` / `idle` / `done`) is
109
+ pi-only. The `--cli <name>` flag accepts other strings, but no other
110
+ CLI ships with detection support today, so a non-pi pane will always
111
+ show `needs_input`. In practice mu is a pi orchestrator.
112
+
113
+ The `--cli` and `MU_<UPPER_CLI>_COMMAND` surface stays useful for one
114
+ thing: swapping the pi binary. If your install ships pi under a
115
+ different binary name, set `MU_PI_COMMAND=<name>` once and every
116
+ spawn picks it up. Multi-word commands work too:
117
+ `MU_PI_COMMAND="pi-alt --some-flag"`.
118
+
119
+ Multi-CLI support (claude / codex with real status detection) is a
120
+ future possibility but not currently planned. If it earns its way
121
+ back per the [ROADMAP](ROADMAP.md) criteria, the substrate is ready
122
+ (spawn already accepts arbitrary commands; the schema's `cli` column
123
+ is TEXT). Until then, treat "mu is a pi orchestrator" as the honest
124
+ positioning.
125
+
126
+ ### 7. TypeScript on Node — deliberate, not a compromise
127
+
128
+ Mu is TypeScript on Node, with a small set of well-established
129
+ npm deps (`commander`, `better-sqlite3`, `cli-table3`, `picocolors`,
130
+ `execa`). No native code we maintain, no build matrix. Anyone
131
+ reading `package.json` should recognize every name.
132
+
133
+ This was an early framing as "boring," but in practice the choice
134
+ earns its keep on four specific axes — not just inertia:
135
+
136
+ - **Type system value is real.** The `AgentNotFoundError` /
137
+ `TaskNotFoundError` / `TaskNotInWorkstreamError` / `CycleError`
138
+ hierarchy maps directly to exit codes via `handle()`. The
139
+ `assertXInWorkstream` helper family stays type-safe across
140
+ every resource namespace. `noUncheckedIndexedAccess` has
141
+ prevented several real bugs in iteration. The same code in Go would lose
142
+ the discriminated unions; in Python the type checker is too
143
+ weak; in Rust the LOC cost would be 2–3×.
144
+ - **JSON-first surface fits TS like a glove.** `emitJson(value)`
145
+ is one line. Every read verb's `--json` output is
146
+ `JSON.stringify(value)` straight from a typed shape. Compare to
147
+ `serde_json` derive-macro friction in Rust or `json.dumps`
148
+ with no type guard in Python.
149
+ - **`better-sqlite3` is genuinely best-in-class.** Synchronous
150
+ request/response API matches the CLI invocation model
151
+ perfectly. WAL handling correct out of the box.
152
+ `db.transaction()` wrapper is exactly the right shape.
153
+ Equivalent in Rust (rusqlite) or Go (mattn/go-sqlite3) is more
154
+ verbose to use.
155
+ - **Iteration speed.** ~60 typed verbs / 14 tables (schema v7) /
156
+ ~880 tests in ~30k LOC src+tests, with multiple substantive
157
+ changes per day during active work. That cadence in a Rust
158
+ codebase of equivalent surface area would be 2–3× slower at
159
+ minimum.
160
+
161
+ **Where it's weak: cold start.** Node's V8 init is ~30–50ms even
162
+ after tsup bundles. Rust would be ~5ms; Go ~10–15ms. This would
163
+ matter if mu were called in tight loops at the heart of agent
164
+ scripts — but it's not, by design. Pillar 4 ("async coordination
165
+ via the activity log") explicitly steers operators away from
166
+ polling loops toward `mu log --tail` subscriptions. **The
167
+ weakness is sidestepped by the architecture.** If that ever
168
+ stops being true (mu becomes a polling tool, distribution goes
169
+ broadly public with "no toolchain required" as a feature, or
170
+ sub-5ms startup becomes load-bearing), Rust is the natural port
171
+ target; until then, TS+Node is actively the right choice, not a
172
+ compromise.
173
+
174
+ **Native dep:** `better-sqlite3` requires prebuilds or a C++
175
+ toolchain. Prebuilds cover darwin-arm64/x64, linux-x64/arm64,
176
+ win32-x64 — every dev workstation we care about. Acceptable.
177
+
178
+ ### 8. Schema-first; typed verbs over read views; SQL as escape hatch
179
+
180
+ The product surface is:
181
+
182
+ - **Read views** (the `ready` / `blocked` / `goals` SQL views; `mu
183
+ state` as the curated state card) for inspection.
184
+ - **Typed verbs** that map cleanly to resource transitions for action
185
+ (`task add`, `task claim`, `task close`, `agent spawn`, `workspace
186
+ create`, `workstream init`, `archive add --destroy`, ...).
187
+ - **`--json` on every read verb** so scripts pipe through `jq`
188
+ instead of parsing tables.
189
+ - **`mu sql`** as the explicit escape hatch underneath.
190
+
191
+ There is no DSL, no plugin system, no workflow engine, no
192
+ `defineOperation` registry generating verbs from declarations. The
193
+ commander wiring in `src/cli.ts` is the single source of truth for
194
+ the verb surface. Adding a new verb is one SDK function plus one
195
+ commander block.
196
+
197
+ ### 9. Observed vs claimed
198
+
199
+ When a verb mutates state, the audit trail records what the caller
200
+ said it relied on. `mu task close design --evidence "tests pass:
201
+ npm test exit 0"` lands in the event log as
202
+ `task status design (IN_PROGRESS → CLOSED) evidence="..."`. The verb
203
+ still trusts the caller — mu doesn't run tests for you — but the
204
+ grounding for every state change is searchable in `mu log --kind
205
+ event`. First inch of a discipline that earns more enforcement when
206
+ real-world friction asks for it.
207
+
208
+ ### 10. Get out of the model's way
209
+
210
+ Mu coordinates agents; it does not reason about them. Specifically,
211
+ mu does not own:
212
+
213
+ - **Model selection.** No tier abstraction (no `mini/modest/big`),
214
+ no provider matrix, no vendor-name mapping. Pi already speaks
215
+ `--model sonnet:high` and `--provider openai`. The day mu invents
216
+ its own tier names is the day mu owns a vendor matrix that goes
217
+ stale every quarter — that's the "adjacent product identities"
218
+ trap an internal critique flagged, and we're not falling into it.
219
+ - **Effort / thinking levels.** Pi has
220
+ `--thinking off|minimal|low|medium|high|xhigh`. Mu doesn't wrap
221
+ it, doesn't normalise it, doesn't second-guess it. Pass-through
222
+ via `--command` or the `MU_<UPPER_CLI>_COMMAND` env var, full stop.
223
+ - **Prompt engineering.** Mu has no system-prompt templating, no
224
+ role injection beyond the agent name and `--role`, no "agent
225
+ template" registry. The system prompt is whatever you put in the
226
+ spawn command and the first message you send.
227
+ - **Tool routing decisions.** Pi (and any other CLI you spawn) owns
228
+ tool allowlists, MCP servers, extensions. Mu doesn't proxy or
229
+ inspect them.
230
+ - **Output interpretation.** Mu reads pane contents to detect
231
+ `busy / needs_input / idle / done` (a 4-state classification).
232
+ It does not parse model output for facts, claims, or tool calls.
233
+ The `--evidence` payload is whatever the agent says it is; mu
234
+ records it without interpretation.
235
+
236
+ The full mechanism is one function: `--cli <key>` uppercases the
237
+ key and looks up `$MU_<KEY>_COMMAND`. That's mu's entire vendor
238
+ surface. The operator pattern when you want different models per
239
+ role is just convention on top:
240
+
241
+ ```bash
242
+ export MU_PI_MINI_COMMAND="pi --model haiku:off" # → --cli pi_mini
243
+ export MU_PI_BIG_COMMAND="pi --model opus:high" # → --cli pi_big
244
+
245
+ mu agent spawn worker-1 --cli pi_mini
246
+ mu agent spawn reviewer-1 --cli pi_big
247
+ ```
248
+
249
+ Your shell rc owns the mapping. The names `pi_mini` / `pi_big` are
250
+ operator convention — mu doesn't know about "tiers," it just looks
251
+ up whatever env var the uppercased key produces. Swap the whole
252
+ matrix in one line; per-machine, per-workstream, per project —
253
+ wherever you set the env. The substrate stays small; the
254
+ orchestrator stays in charge.
255
+
256
+ This is the zen of mu: every layer doing its job, no layer
257
+ speaking for another.
258
+
259
+ ---
260
+
261
+ ## What It Enables
262
+
263
+ - **Persistent crews in one place** — Spawn worker-1/worker-2/reviewer-1
264
+ once, send them work all day. `tmux a -t mu-<workstream>` shows
265
+ the whole crew in one session: each agent in its own pane, all
266
+ observable at a glance, all detachable.
267
+ - **Multi-pi crews in one session** — several pi workers and a
268
+ read-only pi reviewer in the same workstream, each in its own
269
+ pane, each independently observable via `tmux attach`.
270
+ - **Graph-driven coordination** — The task DAG answers "what's ready?",
271
+ "what blocks what?", "what can be parallelized?" with SQL queries
272
+ and union-find, not LLM guesses. Notes per task accumulate durable
273
+ context that outlives any single agent or session.
274
+ - **Deterministic parallelization** — Diamond patterns (shared
275
+ prerequisites) get merged automatically so two agents never collide
276
+ on a shared dependency. The orchestrator follows the algorithm; it
277
+ doesn't have to be smart enough to spot the trap.
278
+ - **VCS workspace isolation** — Each agent gets its own jj workspace,
279
+ sl clone, git worktree, or `cp -a` snapshot, auto-detected.
280
+ `mu agent spawn --workspace` creates and mounts; `mu agent close`
281
+ auto-frees. Two parallel agents in the same project never trample
282
+ each other's working tree.
283
+ - **Async coordination via `mu log`** — Every state-changing verb
284
+ auto-emits a `kind='event'` row. Subscribers `mu log --tail`
285
+ instead of polling. Real-time wakeups without a daemon.
286
+ - **Audit trail with grounding** — `--evidence` on lifecycle verbs
287
+ records what the caller observed. Searchable via `mu log --kind
288
+ event`.
289
+ - **Crash recovery** — Reconciliation prunes ghost agents; the reaper
290
+ reverts their IN_PROGRESS tasks to OPEN with an explanatory note;
291
+ no manual cleanup.
292
+ - **Human-driveable** — Anything mu can do, you can do from a shell.
293
+ Debug, recover, script, cron.
294
+
295
+ ---
296
+
297
+ ## What It Is NOT
298
+
299
+ - **Not an orchestrator.** mu provides primitives. The orchestration
300
+ *policy* (when to spawn, what to assign, when to free) is yours —
301
+ expressed as bash scripts, jq pipelines over `--json` output, or
302
+ driven by an LLM through the bundled skill. There is no JS DSL,
303
+ no workflow engine, no `mu run script.ts`. (See
304
+ [§ What looking at a prior multi-agent runtime taught us](#what-looking-at-a-prior-multi-agent-runtime-taught-us).)
305
+ - **Not a build tool.** mu doesn't compile, test, or deploy your code.
306
+ It runs agents that do those things.
307
+ - **Not a chat protocol.** Agents communicate through the work graph
308
+ (notes, claim, status) and the `agent_logs` activity channel.
309
+ - **Not a replacement for pi-subagents.** Different problem (persistent
310
+ crew vs one-shot focused delegation). Install both; they share the
311
+ agent-frontmatter format.
312
+ - **Not a hosted service.** Local-first SQLite. Zero ops, no accounts.
313
+ Your machine is the deployment.
314
+ - **Not a verifier.** The verbs trust the caller. `task close
315
+ --evidence "tests pass"` records the claim; mu doesn't run the
316
+ tests. Verification is the caller's job. (mu may grow optional
317
+ verifying-runners later if friction surfaces; today it's an
318
+ audit-trail discipline, not enforcement.)
319
+ - **Not undoable.** No snapshots, no `mu undo`. `mu workstream
320
+ destroy --yes` is irreversible. Recovery is restoring `mu.db` from
321
+ a backup. Snapshots are deferred past 1.0 — the SQL escape hatch
322
+ + FK CASCADE behaviour cover most repair scenarios.
323
+
324
+ ---
325
+
326
+ ## Key Constraints
327
+
328
+ 1. **Tmux required.** The substrate is tmux panes. No tmux, no agents.
329
+ `mu doctor` checks for it on every run that touches the agent layer.
330
+
331
+ 2. **Local-only persistence.** SQLite file at `~/.local/state/mu/mu.db`.
332
+ No cross-machine state in v1; layer something like syncthing on top
333
+ if you want it.
334
+
335
+ 3. **Pi-only.** Status detection (and de-facto the entire product)
336
+ targets pi. `--cli pi` is the meaningful default; `--cli` accepts
337
+ other strings as a key for the `MU_<UPPER_CLI>_COMMAND` env var
338
+ resolver but no other CLI has a detector. We optimize for
339
+ false-negative-then-poll over false-positive-then-act.
340
+
341
+ 4. **Send is fire-and-forget.** `mu agent send` delivers to the pane;
342
+ no acknowledgment. Orchestrators poll status or subscribe to
343
+ `mu log --tail` for confirmation. This is by design — the
344
+ alternative requires a protocol every CLI would have to speak.
345
+
346
+ 5. **Recursion is opt-in.** Default `maxSubagentDepth: 0`. Children
347
+ get the `mu` binary on PATH but the bundled skill explicitly says
348
+ "you are not the orchestrator." Hierarchical orchestration is
349
+ intentional, not accidental.
350
+
351
+ 6. **Subscriptions are polling-based.** `mu log --tail` polls
352
+ SQLite once per second. SQLite handles the concurrency; latency
353
+ is bounded by the poll interval. Real subscription mechanisms
354
+ (SQLite hooks, fs.watch) are a future ask if anyone hits the
355
+ latency cliff.
356
+
357
+ ---
358
+
359
+ ## What looking at a prior multi-agent runtime taught us
360
+
361
+ We ran a five-role council critique against a prior
362
+ internal multi-agent runtime mu's author worked on — mu's design
363
+ ancestor. The council converged on a sharp central claim:
364
+
365
+ > [The runtime] is not justified as a better general coding harness.
366
+ > [The runtime] is justified only when it becomes a durable
367
+ > coordination/control plane for work that outgrows a thin harness
368
+ > plus manually managed tmux.
369
+
370
+ And a sharper recommendation for what such a control plane should
371
+ look like:
372
+
373
+ > A minimal defensible core would be: durable sessions/transcripts;
374
+ > agent registry; task records / task graph; workspace and checkout
375
+ > ownership/leases; event log; wakeups/timers; human approvals/input;
376
+ > typed control API; read-only views/state cards; recovery/orphan
377
+ > detection.
378
+ >
379
+ > Everything else — chat, docs, IDE assist, incident-triage,
380
+ > mobile-agent, end-to-end workflows, memory policy, rich
381
+ > dashboards, workflow DSLs — should be optional layers that prove
382
+ > they strengthen the supervision loop.
383
+
384
+ This is independent validation of the shape mu had landed on. Almost
385
+ every item in the council's minimal core ships in mu today (9 of 10);
386
+ almost every item the council criticised the prior runtime for is
387
+ something mu explicitly does not have.
388
+
389
+ ### The council's criticisms → mu's design choices
390
+
391
+ | Council critique of the prior runtime | mu's stance |
392
+ | ------------------------------------------------------------- | -------------------------------------------------------------------------- |
393
+ | "Sprawling product identities (TUI + web + Thrift + plugin host + workflow engine + chat + docs + memory + ...)" | One CLI, one SQLite file, no plugins, no web UI, no Thrift, no chat/docs integrations. |
394
+ | "Workflow DSL is mostly liability" | Rejected outright. No `mu run`/`eval`/`repl`. `--json` + bash + jq cover the scripting story. |
395
+ | "defineOperation/verb-registry adds entropy without consumers" | Rejected. The commander wiring in `src/cli.ts` is the verb surface; one place. |
396
+ | "Plugin sprawl with hidden state and lifecycle bugs" | No plugins. Adding behaviour is a typed verb in `src/cli.ts`. |
397
+ | "CLI verbs as a primary model surface vs. typed mutations" | mu's verbs *are* the typed mutations. CLI is a thin wrapper over a typed SDK with idempotency, validation, exit-code-mapped errors. |
398
+ | "Raw SQL as the only inspection surface is too low-level" | `mu state` is the canonical state card. `--json` everywhere. `mu sql` is the escape hatch beneath, not the cockpit. |
399
+ | "Distinguish observed from claimed state" | `--evidence` on lifecycle verbs (first inch). The verb still trusts the caller; the audit trail records grounding. |
400
+ | "Approval primitives belong in the core" | **REMOVED post-v0.3 wave.** Shipped as `mu approve add/list/grant/deny/wait` in v0.1; zero usage across 200+ tasks of dogfood through v0.2 + v0.3. Anti-anticipatory pruning per VISION.md "no traits with zero implementors". May return when a real second implementor surfaces (e.g., an unattended pi-orchestrator running mu). |
401
+ | "Reads must distinguish provenance (process telemetry vs agent self-report)" | event `source` field attributes events to actor (claiming agent / decider / 'system'). |
402
+ | "State must be authoritative and recoverable, not just durable" | Reconciliation runs on read paths; reaper recovers stuck IN_PROGRESS automatically. |
403
+
404
+ ### What this validates
405
+
406
+ Three things the council's analysis lets us state with more
407
+ confidence than "we just had a hunch":
408
+
409
+ 1. **The anti-feature pledges are load-bearing.** No DSL, no plugins,
410
+ no daemon, no config file, no web UI, no remote sync. Each one is
411
+ a failure mode the prior runtime exhibited that mu chose not to inherit.
412
+
413
+ 2. **"Pi+tmux is the benchmark" is the right comparison.** mu only
414
+ earns its complexity above the threshold where coordination
415
+ itself is the work — multiple agents, multiple checkouts, delayed
416
+ wakeups, recovery, approvals. Below that, a thin harness with
417
+ manual tmux is more transparent.
418
+
419
+ 3. **Schema-first + typed verbs + state cards is the right model UX
420
+ shape.** Not because we read a paper that said so, but because
421
+ independent reasoning from operators, engineers, architects, and
422
+ model-UX specialists converges on it.
423
+
424
+ ### What this still flags as gaps
425
+
426
+ The council's critique cuts mu too in places. The honest list:
427
+
428
+ - **Wakeups are polling-based.** `mu log --tail` polls every 1s.
429
+ Real subscriptions (SQLite update hooks) are deferred.
430
+ - **`--evidence` is grounding, not verification.** mu doesn't run the
431
+ tests. A future `--verify-by` mode that runs a command and records
432
+ its exit could deepen this; not built yet.
433
+ - **No idempotency keys on mutations.** Most ops are idempotent by
434
+ happenstance; not declared as part of the API contract.
435
+ - **No dry-run on most mutations.** Only `workstream destroy` has it.
436
+ - **No capability model.** The `role` field is stored on agent rows
437
+ but unused. No "reviewer-1 cannot delete tasks" enforcement.
438
+
439
+ Each is a known gap with a clear shape. None has friction-driven
440
+ promotion yet.