brainclaw 1.9.0 → 1.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (149) hide show
  1. package/README.md +631 -499
  2. package/dist/brainclaw-vscode.vsix +0 -0
  3. package/dist/cli.js +18 -1
  4. package/dist/commands/code-map.js +129 -0
  5. package/dist/commands/codev.js +7 -0
  6. package/dist/commands/harvest.js +1 -1
  7. package/dist/commands/hooks.js +73 -73
  8. package/dist/commands/init.js +1 -1
  9. package/dist/commands/install-hooks.js +78 -78
  10. package/dist/commands/mcp-read-handlers.js +57 -14
  11. package/dist/commands/mcp.js +200 -13
  12. package/dist/commands/run-profile.js +3 -2
  13. package/dist/commands/switch.js +125 -93
  14. package/dist/commands/version.js +1 -1
  15. package/dist/core/agent-capability.js +19 -4
  16. package/dist/core/agent-files.js +131 -119
  17. package/dist/core/code-map/backend.js +123 -0
  18. package/dist/core/code-map/core.js +81 -0
  19. package/dist/core/code-map/drafts.js +2 -0
  20. package/dist/core/code-map/extractor.js +29 -0
  21. package/dist/core/code-map/finalizer.js +191 -0
  22. package/dist/core/code-map/freshness.js +108 -0
  23. package/dist/core/code-map/ids.js +0 -0
  24. package/dist/core/code-map/importable.js +35 -0
  25. package/dist/core/code-map/indexes.js +197 -0
  26. package/dist/core/code-map/lang/java/imports.scm +17 -0
  27. package/dist/core/code-map/lang/java/index.js +254 -0
  28. package/dist/core/code-map/lang/java/tags.scm +48 -0
  29. package/dist/core/code-map/lang/php/imports.scm +21 -0
  30. package/dist/core/code-map/lang/php/index.js +251 -0
  31. package/dist/core/code-map/lang/php/tags.scm +44 -0
  32. package/dist/core/code-map/lang/provider.js +9 -0
  33. package/dist/core/code-map/lang/providers.js +24 -0
  34. package/dist/core/code-map/lang/python/imports.scm +90 -0
  35. package/dist/core/code-map/lang/python/index.js +364 -0
  36. package/dist/core/code-map/lang/python/tags.scm +81 -0
  37. package/dist/core/code-map/lang/query-runtime.js +374 -0
  38. package/dist/core/code-map/lang/registry.js +125 -0
  39. package/dist/core/code-map/lang/typescript/imports.scm +90 -0
  40. package/dist/core/code-map/lang/typescript/index.js +306 -0
  41. package/dist/core/code-map/lang/typescript/tags.js.scm +106 -0
  42. package/dist/core/code-map/lang/typescript/tags.scm +151 -0
  43. package/dist/core/code-map/lock.js +210 -0
  44. package/dist/core/code-map/materialized.js +51 -0
  45. package/dist/core/code-map/memory-reader.js +59 -0
  46. package/dist/core/code-map/paths.js +53 -0
  47. package/dist/core/code-map/query.js +568 -0
  48. package/dist/core/code-map/refresh.js +0 -0
  49. package/dist/core/code-map/resolve.js +177 -0
  50. package/dist/core/code-map/store.js +206 -0
  51. package/dist/core/code-map/types.js +288 -0
  52. package/dist/core/code-map/vocabulary.js +57 -0
  53. package/dist/core/code-map/wasm-loader.js +294 -0
  54. package/dist/core/code-map/work-section.js +206 -0
  55. package/dist/core/codev-prompts.js +38 -38
  56. package/dist/core/codev-rounds.js +4 -0
  57. package/dist/core/default-profiles/doctor.yaml +11 -11
  58. package/dist/core/default-profiles/janitor.yaml +11 -11
  59. package/dist/core/default-profiles/onboarder.yaml +11 -11
  60. package/dist/core/default-profiles/reviewer.yaml +13 -13
  61. package/dist/core/dispatcher.js +1 -1
  62. package/dist/core/entity-operations.js +29 -3
  63. package/dist/core/execution-adapters.js +11 -10
  64. package/dist/core/execution-profile.js +58 -0
  65. package/dist/core/execution.js +1 -1
  66. package/dist/core/facade-schema.js +9 -0
  67. package/dist/core/instruction-templates.js +2 -0
  68. package/dist/core/loops/verbs.js +0 -1
  69. package/dist/core/mcp-command-resolution.js +3 -1
  70. package/dist/core/messaging.js +2 -2
  71. package/dist/core/protocol-skills.js +164 -164
  72. package/dist/core/runtime-signals.js +1 -1
  73. package/dist/core/search.js +19 -2
  74. package/dist/core/security-guard.js +207 -207
  75. package/dist/core/spawn-check.js +16 -2
  76. package/dist/core/staleness.js +1 -1
  77. package/dist/core/store-resolution.js +67 -11
  78. package/dist/core/worktree.js +18 -18
  79. package/dist/facts.js +9 -5
  80. package/dist/facts.json +8 -4
  81. package/dist/vendor/web-tree-sitter/tree-sitter.js +3980 -0
  82. package/dist/vendor/web-tree-sitter/tree-sitter.wasm +0 -0
  83. package/dist/wasm/tree-sitter-java.wasm +0 -0
  84. package/dist/wasm/tree-sitter-javascript.wasm +0 -0
  85. package/dist/wasm/tree-sitter-php.wasm +0 -0
  86. package/dist/wasm/tree-sitter-python.wasm +0 -0
  87. package/dist/wasm/tree-sitter-tsx.wasm +0 -0
  88. package/dist/wasm/tree-sitter-typescript.wasm +0 -0
  89. package/dist/wasm/tree-sitter.wasm +0 -0
  90. package/docs/PROTOCOL.md +1 -1
  91. package/docs/adapters/openclaw.md +43 -43
  92. package/docs/architecture/project-refs.md +328 -328
  93. package/docs/cli.md +2131 -2093
  94. package/docs/code-map.md +198 -0
  95. package/docs/concepts/coordination.md +52 -52
  96. package/docs/concepts/coordinator-runbook.md +129 -129
  97. package/docs/concepts/dispatch-lifecycle.md +245 -245
  98. package/docs/concepts/event-log-store.md +928 -928
  99. package/docs/concepts/ideation-loop.md +317 -317
  100. package/docs/concepts/loop-engine.md +520 -511
  101. package/docs/concepts/mcp-governance.md +268 -268
  102. package/docs/concepts/memory.md +84 -84
  103. package/docs/concepts/multi-agent-workflows.md +167 -167
  104. package/docs/concepts/observer-protocol.md +361 -361
  105. package/docs/concepts/plans-and-claims.md +217 -217
  106. package/docs/concepts/project-md-convention.md +35 -35
  107. package/docs/concepts/runtime-notes.md +38 -38
  108. package/docs/concepts/troubleshooting.md +254 -254
  109. package/docs/concepts/workspace-bootstrapping.md +142 -142
  110. package/docs/context-format-changelog.md +35 -35
  111. package/docs/context-format.md +48 -48
  112. package/docs/index.md +65 -65
  113. package/docs/integrations/agents.md +158 -158
  114. package/docs/integrations/claude-code.md +23 -23
  115. package/docs/integrations/cline.md +77 -77
  116. package/docs/integrations/continue.md +55 -55
  117. package/docs/integrations/copilot.md +68 -68
  118. package/docs/integrations/cursor.md +23 -23
  119. package/docs/integrations/kilocode.md +72 -72
  120. package/docs/integrations/mcp.md +385 -378
  121. package/docs/integrations/mistral-vibe.md +122 -122
  122. package/docs/integrations/openclaw.md +92 -92
  123. package/docs/integrations/opencode.md +84 -84
  124. package/docs/integrations/overview.md +115 -115
  125. package/docs/integrations/roo.md +71 -71
  126. package/docs/integrations/windsurf.md +77 -77
  127. package/docs/mcp-schema-changelog.md +364 -356
  128. package/docs/playbooks/integration/index.md +121 -121
  129. package/docs/playbooks/orchestration.md +37 -0
  130. package/docs/playbooks/productivity/index.md +99 -99
  131. package/docs/playbooks/team/index.md +117 -117
  132. package/docs/product/agent-first-model.md +184 -184
  133. package/docs/product/entity-model-audit.md +462 -462
  134. package/docs/product/positioning.md +86 -86
  135. package/docs/quickstart-existing-project.md +107 -107
  136. package/docs/quickstart.md +183 -183
  137. package/docs/release-maintenance.md +79 -79
  138. package/docs/reputation.md +52 -52
  139. package/docs/review.md +45 -45
  140. package/docs/security.md +212 -212
  141. package/docs/server-operations.md +118 -118
  142. package/docs/storage.md +106 -106
  143. package/package.json +86 -66
  144. package/docs/concepts/event-log-store-critique-A.md +0 -333
  145. package/docs/concepts/event-log-store-critique-B.md +0 -353
  146. package/docs/concepts/event-log-store-phase0-measurements.md +0 -58
  147. package/docs/concepts/event-log-store-proposal-A.md +0 -365
  148. package/docs/concepts/event-log-store-proposal-B.md +0 -404
  149. package/docs/concepts/identity-model-proposal.md +0 -371
@@ -1,928 +1,928 @@
1
- # Event-Log Store — Converged Design Spec
2
-
3
- > Synthesis (round 3) of ideation loop lop_3bf55b9492e0d96c, pln#543 step 1.
4
- > Distills proposal-A, proposal-B, and both cross-critiques. Where the two
5
- > round-2 VERDICT blocks agree, this spec follows them; where they diverge,
6
- > one option is chosen and the loser recorded in Appendix A. Status: SPEC,
7
- > product calls ARBITRATED (Juan, 2026-06-10 — see §6); C1–C4 resolved by
8
- > the symmetric schema review of 2026-06-10 (§2.1.1–2.1.4, §2.2, §2.10,
9
- > §2.11); residue R1–R4 + new product call J6 in §6, pending the Codex
10
- > second pass.
11
-
12
- ## 1. Motivation
13
-
14
- The 2026-06-10 review (zone 1) found that `src/core/event-log.ts` cannot
15
- serve as the store's source of truth in its current form: appends are
16
- swallowed on error (`appendFileSync` inside a catch-all — a journal that may
17
- silently drop writes is not a journal), events carry no payload (state is not
18
- reconstructible from the log), rotation at 10MB renames the file away and
19
- **deletes all reader cursors** (silent re-notification loss, history
20
- unreachable), and the only ordering key is a wall-clock timestamp (unreliable
21
- across agent shells, WSL, containers). Meanwhile every state mutation already
22
- serializes through the hardened store lock, and loops already run a
23
- payload-carrying journal (`loops/<id>/events.jsonl`) — the substrate and the
24
- precedent both exist. This spec evolves the event log into a write-ahead
25
- journal of full-entity snapshots, organized as immutable segments plus
26
- out-of-band checkpoints, with the existing per-entity JSON directories
27
- demoted to lazily reconciled projections (the pln#496 pattern).
28
-
29
- ## 2. Design
30
-
31
- ### 2.1 Event record format
32
-
33
- Each record is one JSONL line, zod-validated, envelope version `v: 2`
34
- (existing events are retroactively v1):
35
-
36
- ```jsonc
37
- {
38
- "v": 2,
39
- "seq": 18342, // store-global monotonic, assigned under lock
40
- "ts": "2026-06-10T14:03:22.114Z", // informational ONLY — never an ordering key
41
- "writer": "w_31416-9f3c2a", // pid + start-nonce (NOT agent name, NOT bare pid)
42
- "agent": "claude-code",
43
- "agent_id": "agt_...", // optional
44
- "user": "jberdah", // optional
45
- "action": "update", // EventAction union (see payload rule below)
46
- "item_type": "plan",
47
- "item_id": "pln_2290bc70",
48
- "entity_rev": 7, // per-entity monotonic revision
49
- "summary": "step 1 completed", // human-facing, optional
50
- "payload": { /* full post-image, schema-valid entity doc */ }
51
- }
52
- ```
53
-
54
- Normative rules:
55
-
56
- - **Payload = full entity snapshot** (post-image), never a diff. Required
57
- iff the action mutates a persisted entity; lifecycle/observability actions
58
- (`session_start`, notification verbs) are payload-free. The normative
59
- action-class → payload-requirement mapping is §2.1.1.
60
- - **Tombstones**: `action: "delete"`, payload omitted. No redundant
61
- `deleted` boolean. Per-item_type semantics in §2.1.3.
62
- - **`(seq, writer)` is the normative event identity.** Bare `seq` is an
63
- address, valid only where the lock guarantees held (see §2.2 anomaly
64
- handling). Federation idempotency keys, dedup, and the dup-seq reducer all
65
- use the pair.
66
- - **`entity_rev`** is a per-entity monotonic revision bumped on every event
67
- for that id, carried in the envelope (entity schemas untouched). It powers
68
- projection dirty-checks, the never-regress guard (§2.7), optimistic
69
- concurrency for future API writes, and is the local half of federation
70
- conflict detection.
71
- - **New event kinds** introduced by this spec: `checkpoint_ref` (§2.4),
72
- `journal_note` (§2.6), `seq_repair` (§2.2), `backfill` (§4). Normative
73
- schemas in §2.1.2.
74
- - **Writer identity** is pid + per-process random start-nonce. Pid reuse
75
- makes bare pid unreliable over a journal's lifetime; agent name is
76
- metadata, not identity.
77
- - **Max record size, enforced at write time**: payloads > 64 KB are
78
- externalized via `payload_ref` (§2.10); the *envelope line* hard-fails at
79
- 256 KB (with payload_ref no legitimate record approaches it). The cap is
80
- the tripwire that tells us when the snapshot-everywhere assumption expires
81
- (see falsifier, §2.8 — it fired on handoffs in phase 0, hence §2.10).
82
-
83
- #### 2.1.1 Action taxonomy → payload requirement (C1 resolution)
84
-
85
- The v2 `action` field extends today's `EventAction` union
86
- (`src/core/event-log.ts`, 34 members) with five journal-meta actions
87
- (`checkpoint_ref`, `journal_note`, `seq_repair`, `backfill`,
88
- `federation_apply`) and three progress-split verbs (`run_progress`,
89
- `assignment_amended`, `run_amended` — see the heartbeat/durable split
90
- below), and classifies every action into exactly one of five classes.
91
- `EventItemType` gains `journal` (journal-meta records) and, at registry
92
- unification (§4 phase 3), `loop`.
93
-
94
- **Encoding (R1 resolution, 2026-06-12):** the class is **never
95
- serialized** — `action` remains the only discriminant on the wire. Code
96
- carries one `ACTION_CLASS_BY_ACTION` table typed
97
- `satisfies Record<EventAction, ActionClass>` so adding a 35th action
98
- without classifying it is a compile error, and the zod schema is a
99
- discriminated union keyed on `action` (zod ≥ 4 accepts enum values per
100
- branch — 4.4.3 is the installed runtime dep). The phase-gated payload
101
- requiredness of registry-lifecycle (OPTIONAL until phase 1.5) is a
102
- runtime refinement selected by journal mode
103
- (`off | dual | primary | registryPrimary`), not a frozen schema variant —
104
- the wire format never changes across the cutover, only the validator
105
- strictness. A serialized `class` field was rejected: a derived persisted
106
- field is drift waiting to happen (same failure family as trp#180).
107
-
108
- | Class | Actions | `payload` | `item_id` | `entity_rev` |
109
- |---|---|---|---|---|
110
- | entity-state | `create`, `update`, `accept`, `reject`, `claim`, `release_claim`, `rollback`, `upgrade`, `backfill` | REQUIRED — versioned post-image (§2.1.4) or `payload_ref` (§2.10) | REQUIRED | REQUIRED, bumped |
111
- | tombstone | `delete` | FORBIDDEN | REQUIRED | REQUIRED, bumped |
112
- | journal-meta | `checkpoint_ref`, `journal_note`, `seq_repair`, `federation_apply` | REQUIRED — meta-schema per action (§2.1.2), never an entity post-image | FORBIDDEN (`item_type: "journal"`) | absent |
113
- | observability | `session_start`, `session_end`, `assignment_offered`*, `assignment_progress`†, `run_progress`† | FORBIDDEN | optional | absent |
114
- | registry-lifecycle | `assignment_created/accepted/started/completed/cancelled/failed/blocked/timed_out/expired/retrying/rerouted`, `assignment_amended`†, `run_amended`†, all other `run_*` (`run_running` = the transition into running, emitted once†) | OPTIONAL until registry families go journal-primary (J4 phase 1.5); REQUIRED post-image from then on | REQUIRED | absent until phase 1.5, then REQUIRED, bumped |
115
-
116
- \* `assignment_offered` is a status transition of the assignment doc and
117
- moves to registry-lifecycle at phase 1.5; until then it is notification-only.
118
- † See the heartbeat/durable-progress split below.
119
-
120
- Holes found by the adversarial enumeration, resolved as follows:
121
-
122
- - **`item_id` is optional in today's `MemoryEvent`.** v2 makes it REQUIRED
123
- for entity-state, tombstone, and registry-lifecycle records — a
124
- payload-carrying or rev-bumping record without an addressable entity is
125
- unreplayable and rejected at write time.
126
- - **Heartbeat vs durable progress (pre-P0 codex review, 2026-06-12).**
127
- Today's code conflates the two on a single verb:
128
- `recordAssignmentProgress` persists `status_reason` and **appends to
129
- `artifacts`** (durable, accumulating state) on the same path that bumps
130
- `last_heartbeat_at`, then emits `assignment_progress`
131
- (`assignments.ts:296-301`); `recordAgentRunProgress` likewise mutates
132
- `session_id`/`status_reason`/`artifacts` and re-emits `run_running` on
133
- every tick (`agentruns.ts:318-325`). If those verbs stayed
134
- heartbeat-class/no-replay, journal-primary replay would silently drop
135
- artifacts reported mid-run. Normative invariant: **any event reflecting
136
- a durable-field mutation must be replayable by the phase its family goes
137
- journal-primary.** Resolution, effective at phase 1.5 (no write-path
138
- code change before then):
139
- - `assignment_progress` and `run_progress` (new) are **pure ticks** —
140
- observability-class, payload FORBIDDEN, touching only ephemeral-class
141
- fields (`last_heartbeat_at`, `updated_at`, `last_event_at`); excluded
142
- from replay, exactly as §2.8 masks them.
143
- - `run_running` re-scopes to the status **transition into** running —
144
- registry-lifecycle, emitted once per transition, never as a tick.
145
- - A progress call that carries durable mutations (`status_reason`,
146
- `artifacts`, `session_id` binding) emits `assignment_amended` /
147
- `run_amended` instead — registry-lifecycle, REQUIRED post-image,
148
- `entity_rev` bumped. The write path splits on argument presence;
149
- existing tests that assert artifacts-on-progress move to the amended
150
- verbs with the phase 1.5 migration.
151
- - **Whole-store operations** (`rollback`, `upgrade` — today emitted once
152
- with `item_type: "state"`): in v2 the diff choke point (§2.8) emits them
153
- *per entity* (entity-state class, post-image each), plus one
154
- `journal_note` kind `store_marker` recording the store-level operation
155
- for audit. The coarse `item_type: "state"` event class disappears; the
156
- `state` item_type survives only inside `store_marker` notes.
157
- - **Compactor archival vs deletion**: archival removal emits `delete` with
158
- `summary: "archived"` (the archived copy lives outside live dirs and is
159
- not journal-visible). Restore emits `create` continuing the entity_rev
160
- counter. **`entity_rev` is per item_id and never resets**, including
161
- across delete→recreate — required by federation LWW (§2.11).
162
- - **Sessions stay observability-class.** `current_session` /
163
- `session_snapshot` docs remain projection-only (ephemeral-class, like
164
- heartbeats); if sessions ever need replay they move to entity-state, but
165
- nothing today consumes a replayed session.
166
-
167
- #### 2.1.2 Journal-meta record schemas (C1 resolution)
168
-
169
- All journal-meta records use `item_type: "journal"`, omit `item_id` and
170
- `entity_rev`, and carry a payload discriminated as follows:
171
-
172
- ```jsonc
173
- // checkpoint_ref — appended AFTER manifest fsync (§2.4)
174
- { "action": "checkpoint_ref", "item_type": "journal", "payload": {
175
- "file": "ckpt-00018000.json", // name under checkpoints/
176
- "sha256": "…", // hash of the manifest bytes
177
- "head_seq": 18342, // last seq the manifest covers
178
- "entities": 913, // live entity count
179
- "bytes": 481332,
180
- "blobs": ["…"] // payload_ref closure (§2.10); [] if none
181
- } }
182
-
183
- // journal_note — discriminated by payload.kind
184
- { "action": "journal_note", "item_type": "journal", "payload": {
185
- "kind": "torn_tail_adjudicated",
186
- "segment": "seg-00018000.jsonl",
187
- "byte_start": 104832, "byte_end": 105219,
188
- "sha256": "…" } } // hash of the adjudicated fragment
189
- { "payload": { "kind": "genesis", // phase-1 migration marker (§4)
190
- "migrated_from": "v1", "v1_events_parked": 17727,
191
- "backfill_count": 913, "tool_version": "…" } }
192
- { "payload": { "kind": "redaction", // J1 audit trail — doctor redact
193
- "segments": ["seg-00000001.jsonl"],
194
- "redacted": [{ "seq": 1234, "writer": "w_…" }],
195
- "reason": "…", "by": "…" } }
196
- { "payload": { "kind": "store_marker", // whole-store ops (§2.1.1)
197
- "op": "rollback", // "rollback" | "upgrade"
198
- "detail": "…" } }
199
-
200
- // seq_repair — tail-validation correction (§2.2)
201
- { "action": "seq_repair", "item_type": "journal", "payload": {
202
- "meta_next_seq": 18301, // stale value found in meta.json
203
- "tail_seq": 18342, // observed at the active-segment tail
204
- "repaired_next_seq": 18343 } }
205
-
206
- // federation_apply — local record of an applied remote slice
207
- // (required by identity-model-proposal §"local apply"; declared NOW so
208
- // the frozen v2 union needs no post-freeze extension — emitted only once
209
- // federation ships, inert until then)
210
- { "action": "federation_apply", "item_type": "journal", "payload": {
211
- "origin_id": "org_a1b2…", "origin_epoch": 3,
212
- "seq_range": [120, 184], // remote seqs covered by the slice
213
- "applied": 64, // records materialized locally
214
- "conflicts": 1, // LWW losers surfaced as candidates (§2.11)
215
- "slice_sha256": "…" } } // hash of the ingested slice bytes
216
- ```
217
-
218
- `backfill` is **entity-state class**, not journal-meta: normal envelope
219
- with `item_type`/`item_id`/`entity_rev`/payload. Genesis (§4 phase 1) = one
220
- `journal_note` kind `genesis` followed by one `backfill` per live entity
221
- with `entity_rev: 1`, all under a single lock hold. Doctor-initiated
222
- re-syncs reuse `backfill` with the entity's current rev + 1.
223
-
224
- #### 2.1.3 Tombstone semantics per item_type (C1 resolution)
225
-
226
- - Payload FORBIDDEN; `item_id` REQUIRED; `entity_rev` bumped. The rev
227
- counter survives deletion (§2.1.1 — never resets per item_id).
228
- - Projection unlink happens iff a tombstone is applied (§2.8). The
229
- never-unlink-unparseable guard **wins over the tombstone**: the file is
230
- preserved and the divergence is a *persistent, counted* doctor item —
231
- divergence-by-design, distinct from corruption.
232
- - Singleton item types (`state`, `session`) never tombstone —
233
- schema-forbidden; an encountered one is a doctor error.
234
- - Claims: lifecycle release is `release_claim` (entity-state, post-image
235
- with `status: released`); `delete` on a claim appears only from prune.
236
- - Archival is `delete` + `summary: "archived"` (§2.1.1), not a distinct
237
- action.
238
-
239
- #### 2.1.4 Payload schema versioning (C2 resolution)
240
-
241
- Decision: **version-in-payload + migration-on-replay, reusing the existing
242
- versioned-document registry** (`src/core/migration.ts`). No new envelope
243
- field.
244
-
245
- - Every entity payload — and every checkpoint post-image — is persisted
246
- exactly as its projection file is today: the document carries
247
- `schema_version` and is registered in the migration registry keyed by
248
- `VersionedDocumentType`.
249
- - Replay runs each payload through the same detect → stepwise-migrate →
250
- zod-validate path projections already use (`loadVersionedJsonFile`
251
- semantics). One mechanism, one registry, one set of migration tests —
252
- the journal adds zero new versioning machinery.
253
- - The envelope's `v: 2` governs ONLY envelope shape (seq/writer/action
254
- fields). Envelope and payload version independently.
255
- - **Migration-retention invariant (normative):** journal immutability makes
256
- migration paths load-bearing — a stepwise migration may never be deleted
257
- while any non-archived segment, or either of the **two** newest verified
258
- checkpoints, contains a payload at the pre-migration version. (Two, not
259
- one: the §2.4 fallback chain replays from the second-newest checkpoint,
260
- so the version floor is the state of the *second-newest* checkpoint.)
261
- Checkpoints rewrite post-images at current schema versions when written,
262
- so each checkpoint advances the floor; in the common case replay spans
263
- only the post-checkpoint tail (weeks of records, ≤ 1–2 schema versions).
264
- Archived segments may outlive migration paths: doctor warns
265
- "archive predates migration floor" rather than promising eternal
266
- replayability of archives.
267
- - **Replay validation failure** (unknown version / migration throws / zod
268
- fails): skip + count + doctor (the §2.6 mid-file rule). If the failed
269
- record is the entity's *newest*, the projection keeps its current content
270
- (never-regress, §2.7) and the entity is flagged divergent — rebuild never
271
- silently regresses to the previous snapshot.
272
-
273
- Alternatives rejected (recorded in Appendix A): per-record envelope
274
- schema-version (redundant — payloads self-describe); segment-rewrite
275
- migration (violates immutability and J1's audited-rewrite-only rule).
276
-
277
- ### 2.2 Seq and ordering
278
-
279
- - `seq` is store-global, monotonic, persisted as `next_seq` in
280
- `events/meta.json`, incremented **under the store mutation lock**. Every
281
- append — including observability events — takes the lock and gets a seq.
282
- There is no lockless append path and no `seq: null` record class (a
283
- lockless path races segment roll, and seq-less records are unaddressable
284
- by seq-watermark cursors).
285
- - **Timestamps never order anything.** `ts` is for humans and notification
286
- summaries.
287
- - **Tail validation at lock acquisition (normative):** before its first
288
- append, a writer reads the last record of the active segment and sets
289
- `next_seq = max(meta.next_seq, tail_seq + 1)`. If meta was behind, it
290
- appends a `seq_repair` event recording the correction. This re-derives
291
- truth from the journal (meta is a cache) and caps seq collisions to the
292
- single in-flight race write.
293
- - **Two writers are NOT impossible.** The lock can be broken on presumed
294
- owner death, and presumed death is fallible (pid-liveness false negatives
295
- on Windows, pid reuse). A duplicate `seq` from distinct writers is a
296
- **detected anomaly**: the reducer applies both records in file order
297
- (snapshot payloads make double-apply convergent — the later line wins
298
- wholesale), and doctor emits a warning. Detection via `(seq, writer)`;
299
- containment via tail validation above. The journal's two-writer story is
300
- only as rare as lock.ts's steal rate; the spec depends on lock.ts
301
- identifying owners by pid + random token (verified against today's
302
- `lockIsOwnedByCurrentProcess` — token-based, pid reuse alone cannot forge
303
- ownership).
304
- - **Dup-seq reducer semantics (normative, C1 resolution).** Replay
305
- processes records strictly in (segment, file-line) order — never sorted
306
- by seq. Collision cases:
307
- 1. Identical `(seq, writer)`, identical payload bytes → idempotent
308
- duplicate (e.g. ambiguous-retry residue): second occurrence skipped,
309
- doctor counter.
310
- 2. Identical `(seq, writer)`, different content → doctor **ERROR**
311
- (should be impossible — a writer never reuses its own seq); both
312
- applied in file order, later wins, entity flagged.
313
- 3. Same `seq`, different writers → the lock-steal anomaly above: both
314
- applied in file order, doctor **WARNING**.
315
- 4. `entity_rev` ties produced by case 3 on the same entity: later file
316
- order wins wholesale; the never-regress guard (§2.7) treats
317
- equal-rev-different-writer as a doctor-flagged overwrite, not a
318
- regression.
319
- 5. `entity_rev` *gaps* during replay (expected prev+1, observed larger):
320
- doctor warning (possible lost event) — snapshot payloads self-heal
321
- state, the counter records that history is incomplete.
322
- - **Scope boundary (stated so the assumption is visible when it breaks):**
323
- global-seq-under-lock welds event capture to lock availability. Sandboxed
324
- or worktree workers that cannot reach the store produce zero journal
325
- events until a sync point — the journal is the truth of the *store*, not
326
- of the *system*. The moment any roadmap item requires offline local event
327
- capture with later merge, this primitive is falsified and per-writer seqs
328
- + merge (the federation mechanism applied locally) become necessary.
329
- Until then, global seq costs zero new coordination and stays.
330
-
331
- ### 2.3 Segments and sealing
332
-
333
- Layout:
334
-
335
- ```
336
- .brainclaw/events/
337
- meta.json # next_seq + per-family last_applied_seq — rebuildable cache
338
- seg-00000001.jsonl # immutable once rolled; name = first seq it contains
339
- seg-00018000.jsonl # active segment (newest = append target)
340
- checkpoints/
341
- ckpt-00018000.json # self-contained state manifest (out-of-band, §2.4)
342
- quarantine/ # doctor-parked bytes only (offline repair, §2.6)
343
- archive/
344
- events.v1.jsonl # parked legacy notification log (never deleted)
345
- ```
346
-
347
- - Segments are **named by their first seq at birth and never renamed**.
348
- The active segment is simply the newest one. No rename means no Windows
349
- EPERM/EBUSY hazard, no retry protocol, no cursor invalidation. Locating
350
- seq N = directory listing + binary search by filename; no index file.
351
- - Roll when the active segment ≥ 10 MB: under the lock, write a checkpoint
352
- (§2.4), create the next segment, update `meta.json`. Rolled segments are
353
- immutable — an invariant that holds because **all** appenders take the
354
- lock and resolve the active segment inside it.
355
- - `meta.json` is a single small file (one read covers staleness checks for
356
- everything), rewritten atomically (temp+rename), and is a **rebuildable
357
- cache**: if missing or corrupt it is reconstructed from the segment
358
- listing plus a tail read of the last segment.
359
- - **Retention**: sealed segments are never auto-deleted. `gc` may move
360
- segments superseded by a *verified* checkpoint to `archive/`
361
- (park-don't-delete), but never past the **second-newest verified
362
- checkpoint** — the previous checkpoint must remain replayable as the
363
- fallback chain (§2.4).
364
- - **Support boundary**: journal correctness is guaranteed on local
365
- filesystems only (NTFS, ext4, APFS). O_APPEND atomicity does not hold on
366
- SMB/NFS. `bclaw doctor` performs best-effort (heuristic) detection of UNC
367
- paths and mapped network drives and warns; the boundary is documented,
368
- not silently assumed.
369
-
370
- ### 2.4 Checkpoints
371
-
372
- - A checkpoint is an **out-of-band, self-contained** manifest
373
- `checkpoints/ckpt-<seq>.json`: full post-images of every live entity at
374
- head seq ("self-contained" = manifest + its blob closure once
375
- `payload_ref` exists — §2.10). Never hashes referencing projection files — a checkpoint whose
376
- validity depends on projection integrity is useless in exactly the
377
- scenarios it exists for.
378
- - Written under the lock at segment roll (and on `bclaw doctor --compact`):
379
- write manifest → fsync → append a `checkpoint_ref` event to the journal
380
- carrying the checkpoint's **sha256** → update meta last. A crash leaves
381
- at worst an orphan manifest with no ref (harmless) — cursors never see
382
- checkpoint content, the seq space is not inflated, and rebuild needs no
383
- terminator-scanning.
384
- - **Verify before archive (normative):** a checkpoint must be fully
385
- re-parsed and schema-validated before any segment it supersedes moves to
386
- `archive/`. On checksum or parse failure at rebuild time, fall back to
387
- the previous checkpoint and replay more segments (guaranteed available by
388
- the two-checkpoint gc floor).
389
- - Rebuild cost is bounded: latest verified checkpoint + replay of segments
390
- after it (≤ ~10 MB tail in the common case).
391
-
392
- ### 2.5 Cursors
393
-
394
- - `AgentCursor` = `{last_seq, last_read}` — a **seq watermark**. Rotation,
395
- compaction, archival, and any future segment surgery cannot invalidate a
396
- watermark. Byte offsets are dead (they die under any file mutation,
397
- including the offline repairs in §2.6).
398
- - `readUnseenEvents(agent)` = binary-search the segment containing
399
- `last_seq + 1` by filename, stream forward across segments.
400
- - If the watermark predates the oldest non-archived segment, the reader
401
- gets `{gap: true}` plus a summary built from the latest checkpoint —
402
- notifications degrade gracefully; state rebuild never depended on them.
403
- - **Cursor key and self-exclusion.** Cursors are keyed today by agent
404
- *name*; the identity-model proposal re-keys them name → actor instance
405
- (its migration step 3 — one-time rename, cursors are caches). v2
406
- self-exclusion compares the record's `writer` (or actor id), never the
407
- display name: three same-name claude-code instances sharing one cursor
408
- and consuming each other's notifications was an observed incident
409
- (2026-06-10).
410
-
411
- ### 2.6 Append protocol, framing, torn tails
412
-
413
- - One record = **one single-buffer write** (`"\n" + JSON + "\n"`) to an fd
414
- opened append-only (O_APPEND / FILE_APPEND_DATA). The lock is the primary
415
- concurrency guarantee; single-write atomicity on local FS is the seatbelt
416
- for the lock-steal window.
417
- - The **leading `\n`** caps torn-write damage at exactly one event: if the
418
- previous append tore (no trailing newline), our leading newline
419
- terminates the fragment as its own malformed line instead of letting our
420
- valid record be absorbed into it.
421
- - **Short-write check**: `bytesWritten !== buffer.length` ⇒ throw inside
422
- `mutate()`; the mutation fails loudly before any projection write.
423
- - **Append failures are loud.** The current error swallow is removed for v2
424
- state events: a failed journal append is a failed mutation.
425
- - Reader rules (normative):
426
- 1. Split on `\n`, skip empty lines.
427
- 2. A mid-file line failing parse or schema validation: skip, count,
428
- surface via doctor — never silently (trp_d5595086).
429
- 3. A torn **tail** (final line, unparseable or missing trailing `\n`) is
430
- expected crash residue: skip it. This is correct even when the torn
431
- line *parses* validly — journal-first + fsync-before-projection (§2.7)
432
- means an unconfirmed tail can always be dropped, because the caller
433
- was never told "ok".
434
- - **No hot-path rewrites, ever.** The journal is append-only; nothing
435
- truncates or moves bytes during normal operation. When a writer (under
436
- lock, before appending) detects a torn tail, it appends a `journal_note`
437
- event recording the fragment's segment, byte range, and content hash as
438
- *adjudicated*. Doctor counts adjudicated fragments separately from
439
- unexplained mid-file corruption — benign crash residue never raises a
440
- permanent alarm (alarm fatigue is how real corruption later slips
441
- through). Physical excision of damaged bytes into `quarantine/` exists
442
- only as an **offline doctor repair** (doctor holds the lock, no
443
- concurrent appender, parks bytes, never deletes).
444
-
445
- ### 2.7 Durability (fsync) and the journal-first invariant
446
-
447
- - **Write order inside `mutate()` (the single most important invariant):**
448
-
449
- ```
450
- append v2 event(s) → fsync journal fd → write projection files → bump watermark in meta
451
- ```
452
-
453
- Program-order journal-first is fiction without a barrier: the OS may
454
- persist later projection writes before earlier journal appends, yielding
455
- a projection *from the future* that the journal cannot explain — which a
456
- reconciler would then wrongly regress (silent data loss).
457
- - **Default: one `fsync` per `mutate()` call** — after the last append,
458
- before any projection write. Mutations are human-action frequency, not
459
- hot-loop; one fsync each is affordable on NTFS. Config escape hatch
460
- `store.journal.fsync: "mutation" | "never"`; **CI and tests run the prod
461
- default** (fidelity over speed, per the test-env-contamination history).
462
- - **Never-regress guard (defense in depth — fsync can be configured off):**
463
- the reconciler refuses to overwrite a projection with replayed state that
464
- is *older* (lower `entity_rev`) than what the projection holds; a
465
- regressing mismatch is a doctor error, not a write.
466
-
467
- ### 2.8 Projections and event emission
468
-
469
- - Projections are exactly today's per-entity JSON files — atomic,
470
- pretty-printed, git-diffable. They remain the store's human-readable and
471
- MCP-cheap representation.
472
- - **Staleness check is O(1)**: read `meta.json`, compare per-family
473
- `last_applied_seq` to `next_seq - 1`. Equal (the overwhelmingly common
474
- case) → serve projection files directly; the MCP worker-per-call fresh
475
- path adds one small file read. Behind → acquire the lock, replay only the
476
- gap onto the projection files, bump the watermark, serve. pln#496 lazy
477
- reconcile; no daemon.
478
- - **Lock contended** → serve the stale projection annotated `stale: true`
479
- rather than block; whoever wins the lock heals once (no thundering herd
480
- of identical reconciles). Whether claim-class entities may be served
481
- stale is a Juan call (§6).
482
- - **Emission = diff synthesis at the persist choke point, permanently,
483
- plus verb-site intent annotation.** `persistStateUnlocked` computes an
484
- id-level diff (created / changed / removed) against the loaded state and
485
- synthesizes snapshot events — a single choke point provably consistent
486
- with what was persisted, immune to call-site drift. To preserve verb
487
- semantics (`claim` vs `update` vs `complete` — consumed by notifications
488
- and federation signaling), verb sites declare
489
- `(action, item_type, item_id, summary)` into the in-flight mutation
490
- context (today's ~30 `appendEvent` call sites already pass exactly these
491
- fields; they redirect to the context instead of the legacy stream); the
492
- diff supplies the payload and emits any *unannotated* change as `update`
493
- plus a doctor counter. There is **no migration to explicit call-site
494
- event emission** — explicit emission is justified only for registries
495
- that never pass through `State` (assignments/runs/loops), and those reuse
496
- the same append+project primitive.
497
- - **Deletion authority** (journal-primary mode): a projection file is
498
- unlinked only when a tombstone for its id is applied. "Absent from
499
- in-memory state" stops being a deletion signal — closing the
500
- trp_d5595086 bug class structurally. The never-unlink-unparseable guard
501
- carries over on the projection side. The **same id-level diff** that
502
- synthesizes events is what drives the unlink — one diff, two consumers,
503
- cannot disagree (today's `deleteMissing` path and event emission are
504
- separate code; v2 fuses them). The coarse `agent: "system"` /
505
- `item_type: "state"` ping today's `persistStateUnlocked` appends is
506
- replaced by the per-entity diff events.
507
- - **Heartbeat-class churn is never journaled.** Refresh/liveness field
508
- updates (claim `expires_at` extensions, run `last_heartbeat_at`,
509
- assignment `last_progress`, lock metadata) are ephemeral —
510
- projection/registry layer only. Only lifecycle *transitions* (claimed,
511
- released, completed, failed) are events. Without this rule, 20 agents ×
512
- 30s heartbeats × 2 KB snapshots ≈ >100 MB/day of journal for zero
513
- information.
514
- - **Ephemeral-field masking (normative consequence).** Ephemeral fields
515
- mutate projections without journal events or `entity_rev` bumps, so a
516
- projection can legitimately differ from replayed state at *equal* rev.
517
- Therefore: the reconciler never overwrites a projection at equal rev (it
518
- only fills gaps forward), `doctor --verify-journal` masks the ephemeral
519
- field set per item_type before diffing, and the §2.8 diff synthesizer
520
- emits **no event** for ephemeral-only changes. The ephemeral field set
521
- is declared once per schema — a single source consumed by all three. **Falsifier (phase 0 deliverable):** from the dogfood
522
- store's 17k v1 events, compute per-item_type p95 entity size × event
523
- frequency; instrument event bytes by action class during the dual-mode
524
- sprint. If any non-heartbeat class exceeds ~50% of journal bytes, or any
525
- record would exceed 64 KB, that type needs `payload_ref` or a delta
526
- format in phase 1, not deferred.
527
-
528
- ### 2.9 Locking interplay
529
-
530
- - The journal lives **inside** the existing `mutate()` critical section. No
531
- new lock protocol. Seq assignment, appends, fsync, projection writes, and
532
- the watermark bump all happen under the one store lock, journal-first.
533
- - Lock-steal residual (a breaker briefly coexisting with a
534
- stale-but-alive owner) is handled by detection + containment (§2.2), not
535
- denied. The phrase "impossible by construction" is banned.
536
- - Lock-hold growth (fsync + reconciling readers) is instrumented, not
537
- assumed away: the phase-1 dual sprint records lock wait-time
538
- distribution. **Falsifier:** p95 lock wait > ~200 ms under normal
539
- multi-agent load falsifies global-seq-under-lock and forces the
540
- per-writer-journal redesign (§2.2 boundary). Note for the instrumented
541
- baseline: today's `persistStateUnlocked` already runs a **git commit**
542
- (`commitMemoryChange`) inside the critical section — the pre-existing
543
- dominant lock-hold term. The fsync the journal adds is marginal against
544
- it; attribute wait-time per phase (append/fsync/projection/git) so the
545
- falsifier indicts the right component.
546
- - Federation imports must chunk: a 10k-event pull takes and releases the
547
- lock per chunk rather than starving local agents.
548
-
549
- ### 2.10 Oversized payloads — `payload_ref` and the handoff diet (C3 resolution)
550
-
551
- The phase-0 measurement (`event-log-store-phase0-measurements.md`) fired
552
- the §2.8 falsifier: handoff entities are p50 109,700 B / p95 225,157 B —
553
- 15–45× over the 64 KB threshold at p50 already — while every other
554
- item_type sits at p95 ≤ 7.5 KB. Per C3's own rule this enters **phase 1**;
555
- the record format ships with it. Two composable mechanisms, **both
556
- adopted**:
557
-
558
- 1. **Handoff diet (primary fix).** The dominant bytes are the inline
559
- `snapshot.diff` (same root cause as the 41 MB
560
- `handoffs/compacted.jsonl`). Externalize `snapshot.diff` from the
561
- handoff document to a content-addressed attachment under
562
- `events/blobs/` referenced by hash; the handoff entity returns to the
563
- 2–8 KB class every other entity lives in. One move fixes the journal
564
- record size, checkpoint size, J2's git posture, and the legacy
565
- compacted.jsonl pathology. The schema change rides the existing
566
- migration registry (§2.1.4). Product call J6 (§6) confirms portability
567
- implications.
568
- 2. **`payload_ref` (permanent safety net).** If a serialized payload
569
- exceeds 64 KB, the writer stores it at
570
- `events/blobs/<sha256[0:2]>/<sha256>` (content-addressed, write-once)
571
- and the record carries `payload_ref: { sha256, bytes }` *instead of*
572
- `payload`. Readers resolve transparently; a missing or hash-mismatched
573
- blob is a doctor **ERROR** for that entity — never silent.
574
- - **Blob-before-ref ordering (normative):** the blob is written and
575
- fsync'd *before* the journal append that references it — the §2.7
576
- barrier extended one link left. A crash between the two leaves an
577
- orphan blob (harmless, gc-able), never a dangling ref.
578
- - **Checkpoint closure:** checkpoints store oversized post-images as
579
- the same `payload_ref` (manifests stay small); the
580
- `checkpoint_ref.payload.blobs` list (§2.1.2) enumerates the closure.
581
- "Self-contained" (§2.4) is redefined as *manifest + blob closure*;
582
- verify-before-archive verifies the manifest hash AND presence + hash
583
- of every blob in the closure.
584
- - **Blob gc:** park-don't-delete. A blob moves to `archive/blobs/` only
585
- when referenced by zero records in non-archived segments AND by
586
- neither of the two newest verified checkpoints' closures — the §2.3
587
- floor extended verbatim.
588
- - **Redaction closure (J1 × `payload_ref`, normative — resolves the
589
- blocking half of R2, 2026-06-12):** `doctor redact` of a record whose
590
- payload lives in a blob must also **delete the blob** (true erasure —
591
- the one exception to park-don't-delete; an erasure request is not
592
- satisfied by parking) AND regenerate any checkpoint whose closure
593
- references it *before* the redaction completes — manifest rewritten
594
- minus the redacted post-image, re-verified, the stale checkpoint
595
- parked. The redaction `journal_note` (§2.1.2) lists rewritten
596
- checkpoints alongside segments. Invariant: after `doctor redact`
597
- returns, no live segment, no `archive/blobs/` entry, and no
598
- checkpoint closure can yield the redacted bytes. The *federation*
599
- half of R2 (peer re-presenting a pre-redaction record or checkpoint)
600
- stays open in §6 — it cannot be closed before the federation
601
- transport exists.
602
- - **Git (J2 boundary):** `events/blobs/` is gitignored like segments.
603
- With the diet in place no live entity ships an oversized payload, so
604
- bare-clone restorability from projections + checkpoints holds in
605
- practice; doctor flags any checkpoint whose closure references a
606
- gitignored blob as not-clone-restorable. This becomes a real product
607
- trade-off only if J6 rejects the diet.
608
-
609
- Residual falsifier follow-up: `runtime_note`/`session` event *count* (10k
610
- of 17.7k v1 events) is volume, not bytes — both classes are payload-free
611
- in v2 (observability), so they contribute line overhead only (~2–3 MB at
612
- historical rates) and do not threaten the weekly-roll target. No per-class
613
- retention knob needed ahead of J5.
614
-
615
- ### 2.11 Federation conflict primitive (C4 resolution)
616
-
617
- Cross-checked against `identity-model-proposal.md` (origin-partitioned
618
- write authority; scalar `entity_rev` + origin tag;
619
- `(origin_id, origin_epoch, seq)`-headed slices — the epoch handles
620
- restore-from-backup, see the proposal). Both symmetric reviews attacked the
621
- same concurrent-edit hole independently and produced two complementary
622
- detection mechanisms; this section reconciles them (coordinator synthesis
623
- 2026-06-11, flagged for Codex adjudication in §6 R-C4).
624
-
625
- - **Execution entities** (claims, runs, locks, assignments): single-writer
626
- per origin; other origins materialize read-only. Authority partition
627
- means no concurrent-write conflict exists; the scalar is trivially
628
- sufficient. (The advisory cross-machine claim race is *arbitration*, not
629
- journal conflict — deferred to the cloud dispatcher per the proposal.)
630
- - **Memory entities**: LWW ordered by the total order
631
- `(entity_rev, origin_id)` — rev first, origin_id lexicographic as the
632
- deterministic tiebreak. **No wall clock anywhere** (the "LWW by what
633
- clock?" answer: by revision counter + origin id, never time). Convergent:
634
- every origin applying the same record set reaches the same head.
635
- - **The attack (resolution ≠ detection):** origin A edits entity e
636
- rev 7→8→9; origin B, offline, edits e 7→8. B's slice reaches A after A
637
- is at rev 9. *Resolution* is correct (9 > 8, deterministic LWW). But
638
- *detection* — the proposal's "conflicts surface as candidates, never
639
- silent overwrite" — cannot be decided from head comparison: B's rev-8 is
640
- concurrent with A's lineage, not an ancestor of it, and a bare scalar
641
- head cannot distinguish "stale copy of what I already incorporated" from
642
- "divergent edit with a lower rev".
643
- - **Adopted detection rules (two, complementary — reconciled with the
644
- identity proposal's hardened model):**
645
- 1. **PRIMARY — `base_rev` fast-forward check** (from the identity
646
- proposal, post-review): every *exported* memory-entity record carries
647
- `base_rev`, the rev the write was based on. Receiver rule: incoming is
648
- a clean fast-forward iff `incoming.base_rev >= current.rev`; otherwise
649
- the write was concurrent → LWW materializes the winner AND a conflict
650
- candidate carries both post-images. One integer per exported record,
651
- decided from the record alone — **independent of local history
652
- retention**, so it survives gc/compaction and works on a fresh
653
- materialize.
654
- 2. **DEFENSE-IN-DEPTH — `(rev, origin)` journal collision at replay**:
655
- import replays the incoming slice through the reducer; an incoming
656
- record whose `(item_id, entity_rev)` already exists locally **with a
657
- different origin** is a conflict (the §2.2 dup-detection generalized
658
- from `(seq, writer)`). Catches legacy/foreign slices lacking
659
- `base_rev` and cross-checks rule 1, at zero envelope cost — but only
660
- reaches back to the gc floor.
661
- In the attack above, both rules fire: B's record has `base_rev 7 <
662
- current rev 9` (rule 1) and B's (e, 8) collides with A's journaled
663
- (e, 8) (rule 2) → candidate surfaced while LWW keeps A's rev 9.
664
- - **Residual miss-window, now narrow:** only a record that *lacks*
665
- `base_rev` (legacy exporter) AND whose colliding rev is archived past
666
- the gc floor escapes surfacing — convergence still never breaks.
667
- The per-origin high-watermark map (a bounded vector clock, size =
668
- origin count, typically ≤ 3) remains the **named upgrade path** if dogfooding
669
- shows missed candidates; it slots into import metadata without touching
670
- the envelope, which stays origin-agnostic (origin appears only in
671
- exported slice headers, per the proposal's migration step 2).
672
- - **Cross-requirement flowing back to the identity proposal:**
673
- `entity_rev` must never reset per item_id — tombstone → recreate
674
- continues the counter (§2.1.1/§2.1.3) — otherwise (rev, origin)
675
- collisions become false positives after delete→recreate races.
676
-
677
- ## 3. Failure-mode matrix
678
-
679
- | # | Scenario (round-2 attack) | Mitigation in this spec |
680
- |---|---|---|
681
- | 1 | Crash mid-append (torn tail) | Leading-`\n` framing caps loss at 1 event; reader skips tail; next writer appends adjudicating `journal_note`; doctor counts adjudicated residue separately from corruption (§2.6) |
682
- | 2 | Torn line that parses validly | Dropped anyway: journal-first + fsync means an unconfirmed tail was never acknowledged to the caller (§2.6 rule 3) |
683
- | 3 | Crash between append and projection write | Projection stale, never ahead (fsync barrier §2.7); lazy reconcile heals forward on next read |
684
- | 4 | Projection from the future (no-fsync reorder) | One fsync per mutate before projection writes; never-regress guard keyed on `entity_rev` as second line (§2.7) |
685
- | 5 | Two writers in the lock-steal window | O_APPEND seatbelt prevents byte interleaving; duplicate seq detected via `(seq, writer)`, applied in file order (snapshot double-apply is convergent), doctor warns (§2.2) |
686
- | 6 | Seq counter corruption outliving the race (both writers rewrite meta, loser's bump lost, third writer reuses seq) | Tail validation at lock acquisition: `next_seq = max(meta, tail+1)` + `seq_repair` event; meta is a rebuildable cache, the journal is truth (§2.2) |
687
- | 7 | Lockless appender writes into a just-rolled "immutable" segment | No lockless path exists; all appends take the lock and resolve the active segment inside it (§2.2, §2.3) |
688
- | 8 | Crash mid-checkpoint | Out-of-band manifest; worst case orphan file with no `checkpoint_ref` (harmless); meta written last (§2.4) |
689
- | 9 | Corrupt checkpoint discovered after segments archived | Verify-by-full-re-parse before archival; sha256 in `checkpoint_ref`; previous-checkpoint fallback; gc floor = second-newest verified checkpoint (§2.4) |
690
- | 10 | Oversized record exits the O_APPEND atomicity envelope | Payloads > 64 KB externalized via `payload_ref` (§2.10); envelope line hard-fails at 256 KB (§2.1) |
691
- | 11 | Partial `write()` (signal, ENOSPC, quota) | Short-write check ⇒ loud mutation failure before projections (§2.6) |
692
- | 12 | Rotation/sealing during concurrent read | Segments never renamed; active segment is just the newest file; seq watermarks survive any layout change (§2.3, §2.5) |
693
- | 13 | Cursor predates archived history | `{gap: true}` + checkpoint-built summary; graceful notification degradation (§2.5) |
694
- | 14 | Clock skew / ts collision | Irrelevant — ts never orders (§2.2) |
695
- | 15 | 100k-event store cold read | Fresh path O(1) check + projection read; stale path replays only the gap; rebuild bounded by latest checkpoint (§2.4, §5) |
696
- | 16 | `meta.json` corrupt/lost | Rebuilt from segment listing + tail read — it is a cache, not truth (§2.3) |
697
- | 17 | Heartbeat churn floods segments (20-agent scale) | Heartbeat-class updates excluded from the journal by rule; volume falsifier instrumented (§2.8) |
698
- | 18 | Store on a network mount | Documented local-FS-only support boundary; doctor warns heuristically (§2.3) |
699
- | 19 | Wedged lock = no event capture; sandboxed workers can't append | Stated scope boundary: journal is truth of the store, not the system; offline capture falsifies the primitive and triggers the per-writer redesign (§2.2) |
700
- | 20 | Mid-file malformed line (should be impossible under lock) | Skip + count + doctor alarm (unexplained-corruption class), never silent (§2.6) |
701
- | 21 | Crash between blob write and referencing append | Blob-before-ref ordering: worst case an orphan blob (harmless, gc-able), never a dangling `payload_ref` (§2.10) |
702
- | 22 | `payload_ref` blob missing or hash-mismatched at read | Doctor ERROR for that entity, read fails loudly — never silent (§2.10) |
703
- | 23 | Replay diff flags ephemeral-only field drift as divergence | Ephemeral field set masked per item_type in verify-journal and the reconciler; equal-rev projections never overwritten (§2.8) |
704
-
705
- ## 4. Migration plan
706
-
707
- Flag: `store.journal_v2: off | dual | primary` (default `off`). Each phase
708
- ships dark behind the flag; this repo's own store (~17k v1 events of real
709
- multi-agent traffic) is the canary. A `.brainclaw/` backup is taken at every
710
- phase flip (upgrade-style, park-don't-delete).
711
-
712
- - **Phase 0 — format, no behavior change.** Land the v2 record schema (zod),
713
- segment reader/writer, meta cache, doctor counters,
714
- max-record-size enforcement, and the **snapshot-size falsifier
715
- measurement** (§2.8). v1 `events.jsonl` untouched.
716
- - **Phase 1 — `dual`: journal-first dual-write.** One-shot
717
- `bclaw migrate journal`: backup store; emit a **genesis backfill** — one
718
- `backfill` snapshot event per current entity, built from the projection
719
- files (the only truth we have; the 17k payload-less v1 events are not
720
- translatable — parked to `events/archive/events.v1.jsonl`, readable
721
- forever for forensics); initialize meta. `persistStateUnlocked` reorders
722
- to append → fsync → existing file writes → watermark. Notifications
723
- switch to seq-watermark cursors. State dirs remain authoritative.
724
- Phase 1 also lands `payload_ref` + the handoff diet (§2.10) — the
725
- phase-0 falsifier fired on handoffs, so the record format ships with
726
- both.
727
- **Rollback:** set `off` — projection files were written on every mutation
728
- in exactly today's format; park `events/`; zero data transformation in
729
- either direction.
730
- - **Phase 2 — verification (promotion gate).**
731
- `bclaw doctor --verify-journal` rebuilds state from
732
- checkpoint + journal in a temp dir and diffs against live projections —
733
- the only check that validates the actual claim ("the journal is
734
- sufficient to reproduce state"). Runs in CI on **both OS families**,
735
- alongside: kill-9 storm tests (crash between append and projection must
736
- always converge), a two-process append stress test (N children × K
737
- events; assert no interleaved bytes, no lost `(seq, writer)` pairs), and
738
- the tail-validation test. Doctor counters (skipped lines, torn tails,
739
- adjudicated fragments, unannotated-diff emissions, network-FS warning)
740
- run always-on as continuous telemetry. **Exit criterion:** zero
741
- divergence across a full dogfooding sprint of real multi-agent traffic,
742
- including dispatch worktree churn; lock wait-time distribution recorded
743
- (§2.9 falsifier).
744
- - **Phase 3 — `primary`.** Reads serve projections via lazy reconcile;
745
- deletion authority moves to tombstones; `mutateState` callers unchanged.
746
- Then per-entity ops: single-entity mutations append + patch one
747
- projection file without full-store load/rewrite; registries
748
- (assignments/runs/loops) unify on the same append+project primitive
749
- (entry phase is a Juan sequencing call, §6). **Rollback:** projections
750
- are at all times a complete materialized state in legacy format — flip
751
- to `dual` or `off`, re-arm legacy delete semantics, no data
752
- transformation.
753
-
754
- ### Phase 2 gate status (pln#565, 2026-06-12)
755
-
756
- The promotion gate is now **mechanically checkable** via one command and an
757
- automated hardening suite. Status of each Phase-2 exit criterion:
758
-
759
- | Criterion | Status | Evidence |
760
- | --- | --- | --- |
761
- | Journal reproduces projections (the core claim) | ✅ | `brainclaw doctor --verify-journal` — rebuilds from journal, diffs vs live projections, exit 1 on drift. GREEN on this repo's store (mode=dual). |
762
- | Tail validation / torn-tail adjudication | ✅ | `journal-v2.test` (torn tail → `torn_tail_adjudicated`, stale meta → `seq_repair`). |
763
- | Two-process append stress | ✅ | `journal-concurrency.test` — N processes × K appends, gap-free 1..N*K seq, N distinct writers, zero torn/lost. |
764
- | Kill-9 storm convergence (append path) | ✅ | `journal-concurrency.test` — SIGKILL mid-append storm: journal stays readable, seqs never duplicate, post-storm append re-derives a non-colliding seq, state still materializes. |
765
- | **Persist crash-ordering — journal before projections (I2)** | ✅ | pln#566 F1 (codex review): persist now PLANs → emits+fsyncs the journal → APPLIES projection writes, so a crash can only leave the journal ahead (recoverable), never projections ahead. Proven by `journal-crash-ordering.test` via deterministic fault injection on the real `mutateState` pipeline. (Earlier the kill-9 test exercised `forceAppendJournalRecords` directly, not the mutation pipeline — that gap is now closed.) |
766
- | Migration + rollback tooling | ✅ | genesis backfill + `rollbackJournal` (park `events/`, projections untouched). |
767
- | Dual-OS CI | ✅ | `.github/workflows/ci.yml` matrix `[ubuntu, windows]`. |
768
- | **Zero divergence across a real multi-agent sprint** | ✅ | seq#47 (2026-06-12): 4 parallel claude-code lanes + dispatch worktree churn + 4 merges → `verify-journal` zero drift throughout. |
769
- | Lock wait-time distribution (§2.9 falsifier) | ◐ | Lock serialization proven under contention by `journal-concurrency.test`; explicit p50/p95 telemetry via doctor counters is the one remaining instrumentation item — lands with the cutover (it touches the mutate hot path). |
770
-
771
- **Verdict:** the correctness gate is GREEN. The only residual is wait-time
772
- *telemetry* (not a correctness blocker). The primary cutover (Phase 3) is a
773
- Juan sequencing call (§6) and a distinct implementation chantier (tombstones +
774
- per-entity append/patch), not gated on more verification.
775
-
776
- ## 5. Perf targets (measured, not asserted)
777
-
778
- - `bclaw_work` cold read < 1 s on a 100k-event store.
779
- - Single-entity op cost independent of store size: O(1) append + O(1)
780
- projection write + O(gap) reconcile.
781
- - MCP worker-per-call overhead delta < 50 ms vs. today (fresh path = one
782
- extra small meta read).
783
- - One fsync per `mutate()`; lock p95 wait < 200 ms under normal multi-agent
784
- load (falsifier threshold, §2.9).
785
- - Segment roll ≈ every 2–3 weeks at current write rates (post heartbeat
786
- exclusion); checkpoint cost O(live entities) under lock.
787
-
788
- ## 6. OPEN QUESTIONS
789
-
790
- Severity-ranked. Every open question from round 2 not resolved by this spec
791
- is carried here.
792
-
793
- ### [JUAN — product calls] — RESOLVED 2026-06-10
794
-
795
- | # | Sev | Decision |
796
- |---|---|---|
797
- | J1 | HIGH | **`doctor redact` ships in v1.** Immutability is "immutable except via audited `doctor redact`": tooled segment rewrite, audit-trailed, seq watermarks survive it. Rationale: the EU/GDPR positioning cannot answer "impossible" to an erasure request. (Write-time secret-detection may complement later; it does not replace redaction.) |
798
- | J2 | HIGH | **Projections + checkpoints in git; segments and meta gitignored.** The store's git-diffable identity = the per-entity projections (diff/merge as today) plus checkpoints (single-file snapshots a human can adjudicate in a merge, making a bare git clone restorable without segments). No segment blobs in history; the branched-seq merge problem never enters git. |
799
- | J3 | MED | **Read-through for claim-class entities.** Claims and active assignments read the journal tail even under contention — consistency before liveness for the coordination primitive (no double-work is the product promise). Tail-read cost is paid only on this hot-critical path; memory-class entities keep stale-annotated reads (§2.8). |
800
- | J4 | MED | **Registry enters in a dedicated Phase 1.5.** Phase 1 = memory entities (low volume, proven reversibility); registry lifecycle transitions migrate once the journal is hardened in real use. Matches the off/dual/primary posture: the dispatch lifecycle is the product's credibility — it is not migrated first. |
801
- | J5 | LOW | **Defer fine gc/archive thresholds.** The normative two-verified-checkpoint floor stands alone until federation defines its consumer; count/age knobs are trivial additive later. |
802
-
803
- ### [JUAN — new product call raised by C3]
804
-
805
- | # | Sev | Question |
806
- |---|---|---|
807
- | J6 | MED | **Handoff diet (§2.10):** externalize `snapshot.diff` from handoff documents to content-addressed blob attachments. Affects handoff export/import and federation transfer (the blob closure must travel with the document). Recommended: **accept** — it also fixes the 41 MB `compacted.jsonl` class and keeps J2's bare-clone restorability intact. |
808
-
809
- ### [CODEX — schema/invariant review] — RESOLVED 2026-06-10 (symmetric pass)
810
-
811
- | # | Sev | Resolution |
812
- |---|---|---|
813
- | C1 | HIGH | Resolved in §2.1.1 (action taxonomy, 5 classes, holes closed: required `item_id`, `assignment_progress` heartbeat-class, store-ops per-entity + `store_marker`, archival-vs-delete, rev-never-resets), §2.1.2 (journal-meta schemas incl. genesis + J1 redaction audit note), §2.1.3 (tombstones per item_type), §2.2 (dup-seq reducer, 5 normative cases). |
814
- | C2 | HIGH | Resolved in §2.1.4: version-in-payload + migration-on-replay reusing the existing `migration.ts` versioned-document registry; migration-retention invariant pinned to the *second-newest* checkpoint; alternatives in Appendix A. |
815
- | C3 | MED | Falsifier FIRED on handoffs (phase-0 measurements). Resolved in §2.10: handoff diet (primary) + `payload_ref` (safety net), blob-before-ref ordering, checkpoint blob closure, gc floor extension, J2 git posture. Residual product call → J6. |
816
- | C4 | MED | Resolved in §2.11 against `identity-model-proposal.md`: scalar `(entity_rev, origin_id)` survives — convergence intact; conflict *surfacing* via (rev, origin) journal collision; documented miss-window past the gc floor with the per-origin watermark as named upgrade path. |
817
-
818
- ### [CODEX pre-P0 review] — RESOLVED 2026-06-12 (claude-code, codex out of credits)
819
-
820
- Codex's final pass before P0 implementation surfaced 5 findings; all
821
- verified against code and resolved in this revision:
822
-
823
- | # | Sev | Resolution |
824
- |---|---|---|
825
- | F1 | MED/HIGH | `assignment_progress` carried durable state (`status_reason`, `artifacts`) on the heartbeat path — un-replayable as specced. Resolved in §2.1.1: heartbeat/durable split (`assignment_progress`/`run_progress` = pure ticks; `assignment_amended`/`run_amended` = registry-lifecycle with post-image), effective phase 1.5. |
826
- | F2 | MED | Same ambiguity on `run_running` (re-emitted per tick). Resolved with F1 — one decision: `run_running` = transition-only; ticks move to `run_progress`. |
827
- | F3 | MED | `federation_apply` required by the identity proposal but absent from the taxonomy. Resolved: declared as journal-meta NOW (§2.1.1 table + §2.1.2 schema), inert until federation ships — avoids a post-freeze union extension. |
828
- | F4 | MED | Redaction × payload_ref/checkpoints under-specified. Blocking half resolved in §2.10 (redaction closure: blob deletion + checkpoint regeneration, normative invariant); federation re-import half stays open as R2 below. |
829
- | F5 | LOW | Spec said 32 `EventAction` members; code has 34. Corrected in §2.1.1. |
830
-
831
- ### [CODEX residue — needs a second model's schema instincts]
832
-
833
- | # | Sev | Question |
834
- |---|---|---|
835
- | R1 | MED | ~~Zod encoding of §2.1.1~~ **RESOLVED 2026-06-12** (codex recommendation, claude-code verified zod 4.4.3 installed): `action` stays the only discriminant — no serialized `class` field (derived-field drift, trp#180 family). `ACTION_CLASS_BY_ACTION` table `satisfies Record<EventAction, ActionClass>` for compile-time exhaustiveness; zod discriminatedUnion on `action` enums per class; phase-gated payload requiredness = runtime refinement by journal mode, not schema variants. See §2.1.1. |
836
- | R2 | MED | **Redaction × cursors × federation — federation half only** (blob/checkpoint closure resolved in §2.10). Does seq-watermark survival hold for a cursor positioned *inside* a redacted range? And the re-import hole: a federation peer that pulled a record pre-redaction can re-present it — `(seq, writer)` dedup would *reject* the redacted copy (good) but the peer's checkpoint may still embed the payload. The redaction note likely needs to propagate as a federation signal; decide with the federation transport (cannot be closed before it exists). |
837
- | R3 | LOW | **Ephemeral field set enumeration (§2.8).** Adversarial sweep of the real zod schemas for fields beyond `last_heartbeat_at` / claim `expires_at` / `last_progress` that mutate without semantic change (counters, denormalized caches?) — the masking set must be complete or verify-journal cries wolf. |
838
- | R4 | LOW | **C4 miss-window sizing (§2.11).** Gc-floor window (weeks) vs realistic offline-origin durations; should the per-origin watermark ship in federation v1 regardless of dogfood evidence? |
839
- | R-C4 | MED | **Dual conflict-detection adjudication (§2.11, reconciliation 2026-06-11).** The two symmetric reviews independently produced `base_rev` fast-forward (identity proposal) and `(rev, origin)` journal collision (this spec); the coordinator kept BOTH (primary + defense-in-depth). Adjudicate: is the redundancy worth the dual maintenance, or should one become normative? Note `base_rev` is the only one that survives gc and fresh materializes. |
840
-
841
- ## Appendix A — Rejected alternatives
842
-
843
- - **Diff/patch payloads (RFC 6902 or field-deltas).** Every event becomes
844
- load-bearing: one torn line poisons all later state for that entity, and
845
- zero-dep means hand-rolling a patch engine. Snapshots are idempotent,
846
- self-healing, and compaction-trivial. (Both proposals; unanimous.)
847
- - **A's rename-based sealing (`active.jsonl` → range name).** Contradicts
848
- its own cursor format (offsets dangle after rename), and rename-of-open-
849
- file is the exact Windows EPERM/EBUSY hazard it then needs retry logic
850
- for. Segments are born with their permanent first-seq name.
851
- - **A's byte-offset cursors `{segment_id, offset}`.** Die under rename,
852
- under quarantine truncation, and under any future segment surgery
853
- (including J1 redaction). Seq watermarks survive all of it.
854
- - **A's writer-inline torn-tail quarantine (truncate + move bytes).** A
855
- read-modify-write of the journal on the hot path: breaks append-only,
856
- races the very lock-steal window the seatbelt exists for, and can
857
- quarantine a live in-flight write. Demoted to offline doctor repair.
858
- - **A's `fsync: rotate` default (no fsync per mutation).** Program-order
859
- journal-first without a barrier permits projections from the future and
860
- silent reconciler regression — the trp_d5595086 class. One fsync per
861
- mutate is affordable at human-action mutation rates.
862
- - **B's in-journal checkpoint event runs (+ terminator).** Pollutes every
863
- seq-watermark cursor with O(entities) phantom events, leaves headless
864
- runs on crash that are schema-identical to real events, and stretches
865
- lock hold time. Out-of-band manifests have none of these.
866
- - **A's "referencing" checkpoint variant (hashes of projection files).**
867
- Circular: a rebuild-from-truth artifact whose validity depends on
868
- projection integrity is useless precisely when projections are suspect.
869
- Killed without further study.
870
- - **B's lockless observability appends (`seq: null`).** Races segment roll
871
- into "immutable" files, and seq-less records are unaddressable by B's own
872
- seq-watermark cursors. All appends take the lock; revisit only if
873
- instrumentation shows notification contention.
874
- - **B's `(writer_id, writer_seq)` per-writer counter in the envelope.**
875
- Serves only federation and is derivable later; `entity_rev` serves three
876
- local masters today. Dead weight dropped.
877
- - **B's `deleted: true` tombstone boolean.** Redundant with
878
- `action: "delete"`; one source of truth in the envelope.
879
- - **B's two meta files (`HEAD.json` + `projections.json`).** Two reads per
880
- MCP call, two renames per mutation, plus cross-file ordering reasoning,
881
- for state always consumed together. Single `meta.json`, keeping B's
882
- rebuildable-cache property.
883
- - **Migration to explicit verb-site event emission (A's end-state).** ~30
884
- call sites each become a chance to forget, double-emit, or
885
- emit-without-persisting. The diff choke point is provably consistent
886
- with what was persisted; verb semantics are preserved by intent
887
- annotation instead. Conversely, **pure diff with no annotation** (B's
888
- letter) was also rejected: it collapses the EventAction union to generic
889
- `update`, losing semantics notifications and federation signaling consume.
890
- - **Splitting notification stream from state journal now (B Q6).** Same
891
- journal is simpler — one reader, one cursor type, one ordering; split
892
- only if volume instrumentation demands it.
893
- - **Separate journal per entity (vs. one per store).** Global order comes
894
- free with one journal; per-entity journals reintroduce cross-entity
895
- ordering as a problem. (Proposal A §0; never contested.)
896
- - **Per-record envelope schema-version for payloads (C2 alternative).**
897
- Redundant: payloads already self-describe via `schema_version` + the
898
- migration registry; a second version field in the envelope creates two
899
- sources of truth that can disagree.
900
- - **Migration-by-segment-rewrite (C2 alternative).** Rewriting old
901
- segments to the current payload schema violates append-only immutability
902
- and J1's audited-rewrite-only rule; replay-time migration is
903
- pure-functional and leaves bytes untouched.
904
- - **Hash in every envelope, inline payloads included (C3 variant).**
905
- Inline payloads are already line-framed and zod-validated; mandatory
906
- hashing buys federation dedup nothing (dedup keys on `(seq, writer)`)
907
- at a per-mutation CPU cost. The hash lives where it is load-bearing:
908
- `payload_ref` and `checkpoint_ref`.
909
- - **A vector-clock component in the envelope (C4 alternative).** Origin-
910
- partitioned write authority makes convergence scalar-safe (§2.11); the
911
- only thing a vector adds is complete conflict *surfacing* across
912
- gc-floor-sized offline windows. Deferred to import metadata (per-origin
913
- watermark) — the envelope stays origin-agnostic.
914
-
915
- ## Appendix B — Memory citations (union of rounds 1–2)
916
-
917
- trp_d5595086 (silent-loss-via-swallow → loud appends, doctor-visible skips,
918
- tombstone deletion authority, never-regress guard);
919
- feedback_lazy_reconcile_pattern / pln#496 (read-path reconciliation, no
920
- daemon); trp_e85e9fbe (dual-platform CI gates, Windows/POSIX divergence
921
- discipline); trp_26e9634b (missing-store failure mode); trp_09988deb
922
- (upgrade-style backups); feedback_no_init_force + park-don't-delete house
923
- rule (retention, quarantine, archives, rollback);
924
- federation_architecture_decisions + cross_project_signaling_vs_execution
925
- (Pull-and-Materialize substrate, signaling-only foreign writes, no daemon);
926
- feedback_bisect_state_before_code (doctor counters over silent skips);
927
- feedback_ideation_loop_single_agent_method (multi-instance multi-round
928
- method that produced this spec).
1
+ # Event-Log Store — Converged Design Spec
2
+
3
+ > Synthesis (round 3) of ideation loop lop_3bf55b9492e0d96c, pln#543 step 1.
4
+ > Distills proposal-A, proposal-B, and both cross-critiques. Where the two
5
+ > round-2 VERDICT blocks agree, this spec follows them; where they diverge,
6
+ > one option is chosen and the loser recorded in Appendix A. Status: SPEC,
7
+ > product calls ARBITRATED (Juan, 2026-06-10 — see §6); C1–C4 resolved by
8
+ > the symmetric schema review of 2026-06-10 (§2.1.1–2.1.4, §2.2, §2.10,
9
+ > §2.11); residue R1–R4 + new product call J6 in §6, pending the Codex
10
+ > second pass.
11
+
12
+ ## 1. Motivation
13
+
14
+ The 2026-06-10 review (zone 1) found that `src/core/event-log.ts` cannot
15
+ serve as the store's source of truth in its current form: appends are
16
+ swallowed on error (`appendFileSync` inside a catch-all — a journal that may
17
+ silently drop writes is not a journal), events carry no payload (state is not
18
+ reconstructible from the log), rotation at 10MB renames the file away and
19
+ **deletes all reader cursors** (silent re-notification loss, history
20
+ unreachable), and the only ordering key is a wall-clock timestamp (unreliable
21
+ across agent shells, WSL, containers). Meanwhile every state mutation already
22
+ serializes through the hardened store lock, and loops already run a
23
+ payload-carrying journal (`loops/<id>/events.jsonl`) — the substrate and the
24
+ precedent both exist. This spec evolves the event log into a write-ahead
25
+ journal of full-entity snapshots, organized as immutable segments plus
26
+ out-of-band checkpoints, with the existing per-entity JSON directories
27
+ demoted to lazily reconciled projections (the pln#496 pattern).
28
+
29
+ ## 2. Design
30
+
31
+ ### 2.1 Event record format
32
+
33
+ Each record is one JSONL line, zod-validated, envelope version `v: 2`
34
+ (existing events are retroactively v1):
35
+
36
+ ```jsonc
37
+ {
38
+ "v": 2,
39
+ "seq": 18342, // store-global monotonic, assigned under lock
40
+ "ts": "2026-06-10T14:03:22.114Z", // informational ONLY — never an ordering key
41
+ "writer": "w_31416-9f3c2a", // pid + start-nonce (NOT agent name, NOT bare pid)
42
+ "agent": "claude-code",
43
+ "agent_id": "agt_...", // optional
44
+ "user": "jberdah", // optional
45
+ "action": "update", // EventAction union (see payload rule below)
46
+ "item_type": "plan",
47
+ "item_id": "pln_2290bc70",
48
+ "entity_rev": 7, // per-entity monotonic revision
49
+ "summary": "step 1 completed", // human-facing, optional
50
+ "payload": { /* full post-image, schema-valid entity doc */ }
51
+ }
52
+ ```
53
+
54
+ Normative rules:
55
+
56
+ - **Payload = full entity snapshot** (post-image), never a diff. Required
57
+ iff the action mutates a persisted entity; lifecycle/observability actions
58
+ (`session_start`, notification verbs) are payload-free. The normative
59
+ action-class → payload-requirement mapping is §2.1.1.
60
+ - **Tombstones**: `action: "delete"`, payload omitted. No redundant
61
+ `deleted` boolean. Per-item_type semantics in §2.1.3.
62
+ - **`(seq, writer)` is the normative event identity.** Bare `seq` is an
63
+ address, valid only where the lock guarantees held (see §2.2 anomaly
64
+ handling). Federation idempotency keys, dedup, and the dup-seq reducer all
65
+ use the pair.
66
+ - **`entity_rev`** is a per-entity monotonic revision bumped on every event
67
+ for that id, carried in the envelope (entity schemas untouched). It powers
68
+ projection dirty-checks, the never-regress guard (§2.7), optimistic
69
+ concurrency for future API writes, and is the local half of federation
70
+ conflict detection.
71
+ - **New event kinds** introduced by this spec: `checkpoint_ref` (§2.4),
72
+ `journal_note` (§2.6), `seq_repair` (§2.2), `backfill` (§4). Normative
73
+ schemas in §2.1.2.
74
+ - **Writer identity** is pid + per-process random start-nonce. Pid reuse
75
+ makes bare pid unreliable over a journal's lifetime; agent name is
76
+ metadata, not identity.
77
+ - **Max record size, enforced at write time**: payloads > 64 KB are
78
+ externalized via `payload_ref` (§2.10); the *envelope line* hard-fails at
79
+ 256 KB (with payload_ref no legitimate record approaches it). The cap is
80
+ the tripwire that tells us when the snapshot-everywhere assumption expires
81
+ (see falsifier, §2.8 — it fired on handoffs in phase 0, hence §2.10).
82
+
83
+ #### 2.1.1 Action taxonomy → payload requirement (C1 resolution)
84
+
85
+ The v2 `action` field extends today's `EventAction` union
86
+ (`src/core/event-log.ts`, 34 members) with five journal-meta actions
87
+ (`checkpoint_ref`, `journal_note`, `seq_repair`, `backfill`,
88
+ `federation_apply`) and three progress-split verbs (`run_progress`,
89
+ `assignment_amended`, `run_amended` — see the heartbeat/durable split
90
+ below), and classifies every action into exactly one of five classes.
91
+ `EventItemType` gains `journal` (journal-meta records) and, at registry
92
+ unification (§4 phase 3), `loop`.
93
+
94
+ **Encoding (R1 resolution, 2026-06-12):** the class is **never
95
+ serialized** — `action` remains the only discriminant on the wire. Code
96
+ carries one `ACTION_CLASS_BY_ACTION` table typed
97
+ `satisfies Record<EventAction, ActionClass>` so adding a 35th action
98
+ without classifying it is a compile error, and the zod schema is a
99
+ discriminated union keyed on `action` (zod ≥ 4 accepts enum values per
100
+ branch — 4.4.3 is the installed runtime dep). The phase-gated payload
101
+ requiredness of registry-lifecycle (OPTIONAL until phase 1.5) is a
102
+ runtime refinement selected by journal mode
103
+ (`off | dual | primary | registryPrimary`), not a frozen schema variant —
104
+ the wire format never changes across the cutover, only the validator
105
+ strictness. A serialized `class` field was rejected: a derived persisted
106
+ field is drift waiting to happen (same failure family as trp#180).
107
+
108
+ | Class | Actions | `payload` | `item_id` | `entity_rev` |
109
+ |---|---|---|---|---|
110
+ | entity-state | `create`, `update`, `accept`, `reject`, `claim`, `release_claim`, `rollback`, `upgrade`, `backfill` | REQUIRED — versioned post-image (§2.1.4) or `payload_ref` (§2.10) | REQUIRED | REQUIRED, bumped |
111
+ | tombstone | `delete` | FORBIDDEN | REQUIRED | REQUIRED, bumped |
112
+ | journal-meta | `checkpoint_ref`, `journal_note`, `seq_repair`, `federation_apply` | REQUIRED — meta-schema per action (§2.1.2), never an entity post-image | FORBIDDEN (`item_type: "journal"`) | absent |
113
+ | observability | `session_start`, `session_end`, `assignment_offered`*, `assignment_progress`†, `run_progress`† | FORBIDDEN | optional | absent |
114
+ | registry-lifecycle | `assignment_created/accepted/started/completed/cancelled/failed/blocked/timed_out/expired/retrying/rerouted`, `assignment_amended`†, `run_amended`†, all other `run_*` (`run_running` = the transition into running, emitted once†) | OPTIONAL until registry families go journal-primary (J4 phase 1.5); REQUIRED post-image from then on | REQUIRED | absent until phase 1.5, then REQUIRED, bumped |
115
+
116
+ \* `assignment_offered` is a status transition of the assignment doc and
117
+ moves to registry-lifecycle at phase 1.5; until then it is notification-only.
118
+ † See the heartbeat/durable-progress split below.
119
+
120
+ Holes found by the adversarial enumeration, resolved as follows:
121
+
122
+ - **`item_id` is optional in today's `MemoryEvent`.** v2 makes it REQUIRED
123
+ for entity-state, tombstone, and registry-lifecycle records — a
124
+ payload-carrying or rev-bumping record without an addressable entity is
125
+ unreplayable and rejected at write time.
126
+ - **Heartbeat vs durable progress (pre-P0 codex review, 2026-06-12).**
127
+ Today's code conflates the two on a single verb:
128
+ `recordAssignmentProgress` persists `status_reason` and **appends to
129
+ `artifacts`** (durable, accumulating state) on the same path that bumps
130
+ `last_heartbeat_at`, then emits `assignment_progress`
131
+ (`assignments.ts:296-301`); `recordAgentRunProgress` likewise mutates
132
+ `session_id`/`status_reason`/`artifacts` and re-emits `run_running` on
133
+ every tick (`agentruns.ts:318-325`). If those verbs stayed
134
+ heartbeat-class/no-replay, journal-primary replay would silently drop
135
+ artifacts reported mid-run. Normative invariant: **any event reflecting
136
+ a durable-field mutation must be replayable by the phase its family goes
137
+ journal-primary.** Resolution, effective at phase 1.5 (no write-path
138
+ code change before then):
139
+ - `assignment_progress` and `run_progress` (new) are **pure ticks** —
140
+ observability-class, payload FORBIDDEN, touching only ephemeral-class
141
+ fields (`last_heartbeat_at`, `updated_at`, `last_event_at`); excluded
142
+ from replay, exactly as §2.8 masks them.
143
+ - `run_running` re-scopes to the status **transition into** running —
144
+ registry-lifecycle, emitted once per transition, never as a tick.
145
+ - A progress call that carries durable mutations (`status_reason`,
146
+ `artifacts`, `session_id` binding) emits `assignment_amended` /
147
+ `run_amended` instead — registry-lifecycle, REQUIRED post-image,
148
+ `entity_rev` bumped. The write path splits on argument presence;
149
+ existing tests that assert artifacts-on-progress move to the amended
150
+ verbs with the phase 1.5 migration.
151
+ - **Whole-store operations** (`rollback`, `upgrade` — today emitted once
152
+ with `item_type: "state"`): in v2 the diff choke point (§2.8) emits them
153
+ *per entity* (entity-state class, post-image each), plus one
154
+ `journal_note` kind `store_marker` recording the store-level operation
155
+ for audit. The coarse `item_type: "state"` event class disappears; the
156
+ `state` item_type survives only inside `store_marker` notes.
157
+ - **Compactor archival vs deletion**: archival removal emits `delete` with
158
+ `summary: "archived"` (the archived copy lives outside live dirs and is
159
+ not journal-visible). Restore emits `create` continuing the entity_rev
160
+ counter. **`entity_rev` is per item_id and never resets**, including
161
+ across delete→recreate — required by federation LWW (§2.11).
162
+ - **Sessions stay observability-class.** `current_session` /
163
+ `session_snapshot` docs remain projection-only (ephemeral-class, like
164
+ heartbeats); if sessions ever need replay they move to entity-state, but
165
+ nothing today consumes a replayed session.
166
+
167
+ #### 2.1.2 Journal-meta record schemas (C1 resolution)
168
+
169
+ All journal-meta records use `item_type: "journal"`, omit `item_id` and
170
+ `entity_rev`, and carry a payload discriminated as follows:
171
+
172
+ ```jsonc
173
+ // checkpoint_ref — appended AFTER manifest fsync (§2.4)
174
+ { "action": "checkpoint_ref", "item_type": "journal", "payload": {
175
+ "file": "ckpt-00018000.json", // name under checkpoints/
176
+ "sha256": "…", // hash of the manifest bytes
177
+ "head_seq": 18342, // last seq the manifest covers
178
+ "entities": 913, // live entity count
179
+ "bytes": 481332,
180
+ "blobs": ["…"] // payload_ref closure (§2.10); [] if none
181
+ } }
182
+
183
+ // journal_note — discriminated by payload.kind
184
+ { "action": "journal_note", "item_type": "journal", "payload": {
185
+ "kind": "torn_tail_adjudicated",
186
+ "segment": "seg-00018000.jsonl",
187
+ "byte_start": 104832, "byte_end": 105219,
188
+ "sha256": "…" } } // hash of the adjudicated fragment
189
+ { "payload": { "kind": "genesis", // phase-1 migration marker (§4)
190
+ "migrated_from": "v1", "v1_events_parked": 17727,
191
+ "backfill_count": 913, "tool_version": "…" } }
192
+ { "payload": { "kind": "redaction", // J1 audit trail — doctor redact
193
+ "segments": ["seg-00000001.jsonl"],
194
+ "redacted": [{ "seq": 1234, "writer": "w_…" }],
195
+ "reason": "…", "by": "…" } }
196
+ { "payload": { "kind": "store_marker", // whole-store ops (§2.1.1)
197
+ "op": "rollback", // "rollback" | "upgrade"
198
+ "detail": "…" } }
199
+
200
+ // seq_repair — tail-validation correction (§2.2)
201
+ { "action": "seq_repair", "item_type": "journal", "payload": {
202
+ "meta_next_seq": 18301, // stale value found in meta.json
203
+ "tail_seq": 18342, // observed at the active-segment tail
204
+ "repaired_next_seq": 18343 } }
205
+
206
+ // federation_apply — local record of an applied remote slice
207
+ // (required by identity-model-proposal §"local apply"; declared NOW so
208
+ // the frozen v2 union needs no post-freeze extension — emitted only once
209
+ // federation ships, inert until then)
210
+ { "action": "federation_apply", "item_type": "journal", "payload": {
211
+ "origin_id": "org_a1b2…", "origin_epoch": 3,
212
+ "seq_range": [120, 184], // remote seqs covered by the slice
213
+ "applied": 64, // records materialized locally
214
+ "conflicts": 1, // LWW losers surfaced as candidates (§2.11)
215
+ "slice_sha256": "…" } } // hash of the ingested slice bytes
216
+ ```
217
+
218
+ `backfill` is **entity-state class**, not journal-meta: normal envelope
219
+ with `item_type`/`item_id`/`entity_rev`/payload. Genesis (§4 phase 1) = one
220
+ `journal_note` kind `genesis` followed by one `backfill` per live entity
221
+ with `entity_rev: 1`, all under a single lock hold. Doctor-initiated
222
+ re-syncs reuse `backfill` with the entity's current rev + 1.
223
+
224
+ #### 2.1.3 Tombstone semantics per item_type (C1 resolution)
225
+
226
+ - Payload FORBIDDEN; `item_id` REQUIRED; `entity_rev` bumped. The rev
227
+ counter survives deletion (§2.1.1 — never resets per item_id).
228
+ - Projection unlink happens iff a tombstone is applied (§2.8). The
229
+ never-unlink-unparseable guard **wins over the tombstone**: the file is
230
+ preserved and the divergence is a *persistent, counted* doctor item —
231
+ divergence-by-design, distinct from corruption.
232
+ - Singleton item types (`state`, `session`) never tombstone —
233
+ schema-forbidden; an encountered one is a doctor error.
234
+ - Claims: lifecycle release is `release_claim` (entity-state, post-image
235
+ with `status: released`); `delete` on a claim appears only from prune.
236
+ - Archival is `delete` + `summary: "archived"` (§2.1.1), not a distinct
237
+ action.
238
+
239
+ #### 2.1.4 Payload schema versioning (C2 resolution)
240
+
241
+ Decision: **version-in-payload + migration-on-replay, reusing the existing
242
+ versioned-document registry** (`src/core/migration.ts`). No new envelope
243
+ field.
244
+
245
+ - Every entity payload — and every checkpoint post-image — is persisted
246
+ exactly as its projection file is today: the document carries
247
+ `schema_version` and is registered in the migration registry keyed by
248
+ `VersionedDocumentType`.
249
+ - Replay runs each payload through the same detect → stepwise-migrate →
250
+ zod-validate path projections already use (`loadVersionedJsonFile`
251
+ semantics). One mechanism, one registry, one set of migration tests —
252
+ the journal adds zero new versioning machinery.
253
+ - The envelope's `v: 2` governs ONLY envelope shape (seq/writer/action
254
+ fields). Envelope and payload version independently.
255
+ - **Migration-retention invariant (normative):** journal immutability makes
256
+ migration paths load-bearing — a stepwise migration may never be deleted
257
+ while any non-archived segment, or either of the **two** newest verified
258
+ checkpoints, contains a payload at the pre-migration version. (Two, not
259
+ one: the §2.4 fallback chain replays from the second-newest checkpoint,
260
+ so the version floor is the state of the *second-newest* checkpoint.)
261
+ Checkpoints rewrite post-images at current schema versions when written,
262
+ so each checkpoint advances the floor; in the common case replay spans
263
+ only the post-checkpoint tail (weeks of records, ≤ 1–2 schema versions).
264
+ Archived segments may outlive migration paths: doctor warns
265
+ "archive predates migration floor" rather than promising eternal
266
+ replayability of archives.
267
+ - **Replay validation failure** (unknown version / migration throws / zod
268
+ fails): skip + count + doctor (the §2.6 mid-file rule). If the failed
269
+ record is the entity's *newest*, the projection keeps its current content
270
+ (never-regress, §2.7) and the entity is flagged divergent — rebuild never
271
+ silently regresses to the previous snapshot.
272
+
273
+ Alternatives rejected (recorded in Appendix A): per-record envelope
274
+ schema-version (redundant — payloads self-describe); segment-rewrite
275
+ migration (violates immutability and J1's audited-rewrite-only rule).
276
+
277
+ ### 2.2 Seq and ordering
278
+
279
+ - `seq` is store-global, monotonic, persisted as `next_seq` in
280
+ `events/meta.json`, incremented **under the store mutation lock**. Every
281
+ append — including observability events — takes the lock and gets a seq.
282
+ There is no lockless append path and no `seq: null` record class (a
283
+ lockless path races segment roll, and seq-less records are unaddressable
284
+ by seq-watermark cursors).
285
+ - **Timestamps never order anything.** `ts` is for humans and notification
286
+ summaries.
287
+ - **Tail validation at lock acquisition (normative):** before its first
288
+ append, a writer reads the last record of the active segment and sets
289
+ `next_seq = max(meta.next_seq, tail_seq + 1)`. If meta was behind, it
290
+ appends a `seq_repair` event recording the correction. This re-derives
291
+ truth from the journal (meta is a cache) and caps seq collisions to the
292
+ single in-flight race write.
293
+ - **Two writers are NOT impossible.** The lock can be broken on presumed
294
+ owner death, and presumed death is fallible (pid-liveness false negatives
295
+ on Windows, pid reuse). A duplicate `seq` from distinct writers is a
296
+ **detected anomaly**: the reducer applies both records in file order
297
+ (snapshot payloads make double-apply convergent — the later line wins
298
+ wholesale), and doctor emits a warning. Detection via `(seq, writer)`;
299
+ containment via tail validation above. The journal's two-writer story is
300
+ only as rare as lock.ts's steal rate; the spec depends on lock.ts
301
+ identifying owners by pid + random token (verified against today's
302
+ `lockIsOwnedByCurrentProcess` — token-based, pid reuse alone cannot forge
303
+ ownership).
304
+ - **Dup-seq reducer semantics (normative, C1 resolution).** Replay
305
+ processes records strictly in (segment, file-line) order — never sorted
306
+ by seq. Collision cases:
307
+ 1. Identical `(seq, writer)`, identical payload bytes → idempotent
308
+ duplicate (e.g. ambiguous-retry residue): second occurrence skipped,
309
+ doctor counter.
310
+ 2. Identical `(seq, writer)`, different content → doctor **ERROR**
311
+ (should be impossible — a writer never reuses its own seq); both
312
+ applied in file order, later wins, entity flagged.
313
+ 3. Same `seq`, different writers → the lock-steal anomaly above: both
314
+ applied in file order, doctor **WARNING**.
315
+ 4. `entity_rev` ties produced by case 3 on the same entity: later file
316
+ order wins wholesale; the never-regress guard (§2.7) treats
317
+ equal-rev-different-writer as a doctor-flagged overwrite, not a
318
+ regression.
319
+ 5. `entity_rev` *gaps* during replay (expected prev+1, observed larger):
320
+ doctor warning (possible lost event) — snapshot payloads self-heal
321
+ state, the counter records that history is incomplete.
322
+ - **Scope boundary (stated so the assumption is visible when it breaks):**
323
+ global-seq-under-lock welds event capture to lock availability. Sandboxed
324
+ or worktree workers that cannot reach the store produce zero journal
325
+ events until a sync point — the journal is the truth of the *store*, not
326
+ of the *system*. The moment any roadmap item requires offline local event
327
+ capture with later merge, this primitive is falsified and per-writer seqs
328
+ + merge (the federation mechanism applied locally) become necessary.
329
+ Until then, global seq costs zero new coordination and stays.
330
+
331
+ ### 2.3 Segments and sealing
332
+
333
+ Layout:
334
+
335
+ ```
336
+ .brainclaw/events/
337
+ meta.json # next_seq + per-family last_applied_seq — rebuildable cache
338
+ seg-00000001.jsonl # immutable once rolled; name = first seq it contains
339
+ seg-00018000.jsonl # active segment (newest = append target)
340
+ checkpoints/
341
+ ckpt-00018000.json # self-contained state manifest (out-of-band, §2.4)
342
+ quarantine/ # doctor-parked bytes only (offline repair, §2.6)
343
+ archive/
344
+ events.v1.jsonl # parked legacy notification log (never deleted)
345
+ ```
346
+
347
+ - Segments are **named by their first seq at birth and never renamed**.
348
+ The active segment is simply the newest one. No rename means no Windows
349
+ EPERM/EBUSY hazard, no retry protocol, no cursor invalidation. Locating
350
+ seq N = directory listing + binary search by filename; no index file.
351
+ - Roll when the active segment ≥ 10 MB: under the lock, write a checkpoint
352
+ (§2.4), create the next segment, update `meta.json`. Rolled segments are
353
+ immutable — an invariant that holds because **all** appenders take the
354
+ lock and resolve the active segment inside it.
355
+ - `meta.json` is a single small file (one read covers staleness checks for
356
+ everything), rewritten atomically (temp+rename), and is a **rebuildable
357
+ cache**: if missing or corrupt it is reconstructed from the segment
358
+ listing plus a tail read of the last segment.
359
+ - **Retention**: sealed segments are never auto-deleted. `gc` may move
360
+ segments superseded by a *verified* checkpoint to `archive/`
361
+ (park-don't-delete), but never past the **second-newest verified
362
+ checkpoint** — the previous checkpoint must remain replayable as the
363
+ fallback chain (§2.4).
364
+ - **Support boundary**: journal correctness is guaranteed on local
365
+ filesystems only (NTFS, ext4, APFS). O_APPEND atomicity does not hold on
366
+ SMB/NFS. `bclaw doctor` performs best-effort (heuristic) detection of UNC
367
+ paths and mapped network drives and warns; the boundary is documented,
368
+ not silently assumed.
369
+
370
+ ### 2.4 Checkpoints
371
+
372
+ - A checkpoint is an **out-of-band, self-contained** manifest
373
+ `checkpoints/ckpt-<seq>.json`: full post-images of every live entity at
374
+ head seq ("self-contained" = manifest + its blob closure once
375
+ `payload_ref` exists — §2.10). Never hashes referencing projection files — a checkpoint whose
376
+ validity depends on projection integrity is useless in exactly the
377
+ scenarios it exists for.
378
+ - Written under the lock at segment roll (and on `bclaw doctor --compact`):
379
+ write manifest → fsync → append a `checkpoint_ref` event to the journal
380
+ carrying the checkpoint's **sha256** → update meta last. A crash leaves
381
+ at worst an orphan manifest with no ref (harmless) — cursors never see
382
+ checkpoint content, the seq space is not inflated, and rebuild needs no
383
+ terminator-scanning.
384
+ - **Verify before archive (normative):** a checkpoint must be fully
385
+ re-parsed and schema-validated before any segment it supersedes moves to
386
+ `archive/`. On checksum or parse failure at rebuild time, fall back to
387
+ the previous checkpoint and replay more segments (guaranteed available by
388
+ the two-checkpoint gc floor).
389
+ - Rebuild cost is bounded: latest verified checkpoint + replay of segments
390
+ after it (≤ ~10 MB tail in the common case).
391
+
392
+ ### 2.5 Cursors
393
+
394
+ - `AgentCursor` = `{last_seq, last_read}` — a **seq watermark**. Rotation,
395
+ compaction, archival, and any future segment surgery cannot invalidate a
396
+ watermark. Byte offsets are dead (they die under any file mutation,
397
+ including the offline repairs in §2.6).
398
+ - `readUnseenEvents(agent)` = binary-search the segment containing
399
+ `last_seq + 1` by filename, stream forward across segments.
400
+ - If the watermark predates the oldest non-archived segment, the reader
401
+ gets `{gap: true}` plus a summary built from the latest checkpoint —
402
+ notifications degrade gracefully; state rebuild never depended on them.
403
+ - **Cursor key and self-exclusion.** Cursors are keyed today by agent
404
+ *name*; the identity-model proposal re-keys them name → actor instance
405
+ (its migration step 3 — one-time rename, cursors are caches). v2
406
+ self-exclusion compares the record's `writer` (or actor id), never the
407
+ display name: three same-name claude-code instances sharing one cursor
408
+ and consuming each other's notifications was an observed incident
409
+ (2026-06-10).
410
+
411
+ ### 2.6 Append protocol, framing, torn tails
412
+
413
+ - One record = **one single-buffer write** (`"\n" + JSON + "\n"`) to an fd
414
+ opened append-only (O_APPEND / FILE_APPEND_DATA). The lock is the primary
415
+ concurrency guarantee; single-write atomicity on local FS is the seatbelt
416
+ for the lock-steal window.
417
+ - The **leading `\n`** caps torn-write damage at exactly one event: if the
418
+ previous append tore (no trailing newline), our leading newline
419
+ terminates the fragment as its own malformed line instead of letting our
420
+ valid record be absorbed into it.
421
+ - **Short-write check**: `bytesWritten !== buffer.length` ⇒ throw inside
422
+ `mutate()`; the mutation fails loudly before any projection write.
423
+ - **Append failures are loud.** The current error swallow is removed for v2
424
+ state events: a failed journal append is a failed mutation.
425
+ - Reader rules (normative):
426
+ 1. Split on `\n`, skip empty lines.
427
+ 2. A mid-file line failing parse or schema validation: skip, count,
428
+ surface via doctor — never silently (trp_d5595086).
429
+ 3. A torn **tail** (final line, unparseable or missing trailing `\n`) is
430
+ expected crash residue: skip it. This is correct even when the torn
431
+ line *parses* validly — journal-first + fsync-before-projection (§2.7)
432
+ means an unconfirmed tail can always be dropped, because the caller
433
+ was never told "ok".
434
+ - **No hot-path rewrites, ever.** The journal is append-only; nothing
435
+ truncates or moves bytes during normal operation. When a writer (under
436
+ lock, before appending) detects a torn tail, it appends a `journal_note`
437
+ event recording the fragment's segment, byte range, and content hash as
438
+ *adjudicated*. Doctor counts adjudicated fragments separately from
439
+ unexplained mid-file corruption — benign crash residue never raises a
440
+ permanent alarm (alarm fatigue is how real corruption later slips
441
+ through). Physical excision of damaged bytes into `quarantine/` exists
442
+ only as an **offline doctor repair** (doctor holds the lock, no
443
+ concurrent appender, parks bytes, never deletes).
444
+
445
+ ### 2.7 Durability (fsync) and the journal-first invariant
446
+
447
+ - **Write order inside `mutate()` (the single most important invariant):**
448
+
449
+ ```
450
+ append v2 event(s) → fsync journal fd → write projection files → bump watermark in meta
451
+ ```
452
+
453
+ Program-order journal-first is fiction without a barrier: the OS may
454
+ persist later projection writes before earlier journal appends, yielding
455
+ a projection *from the future* that the journal cannot explain — which a
456
+ reconciler would then wrongly regress (silent data loss).
457
+ - **Default: one `fsync` per `mutate()` call** — after the last append,
458
+ before any projection write. Mutations are human-action frequency, not
459
+ hot-loop; one fsync each is affordable on NTFS. Config escape hatch
460
+ `store.journal.fsync: "mutation" | "never"`; **CI and tests run the prod
461
+ default** (fidelity over speed, per the test-env-contamination history).
462
+ - **Never-regress guard (defense in depth — fsync can be configured off):**
463
+ the reconciler refuses to overwrite a projection with replayed state that
464
+ is *older* (lower `entity_rev`) than what the projection holds; a
465
+ regressing mismatch is a doctor error, not a write.
466
+
467
+ ### 2.8 Projections and event emission
468
+
469
+ - Projections are exactly today's per-entity JSON files — atomic,
470
+ pretty-printed, git-diffable. They remain the store's human-readable and
471
+ MCP-cheap representation.
472
+ - **Staleness check is O(1)**: read `meta.json`, compare per-family
473
+ `last_applied_seq` to `next_seq - 1`. Equal (the overwhelmingly common
474
+ case) → serve projection files directly; the MCP worker-per-call fresh
475
+ path adds one small file read. Behind → acquire the lock, replay only the
476
+ gap onto the projection files, bump the watermark, serve. pln#496 lazy
477
+ reconcile; no daemon.
478
+ - **Lock contended** → serve the stale projection annotated `stale: true`
479
+ rather than block; whoever wins the lock heals once (no thundering herd
480
+ of identical reconciles). Whether claim-class entities may be served
481
+ stale is a Juan call (§6).
482
+ - **Emission = diff synthesis at the persist choke point, permanently,
483
+ plus verb-site intent annotation.** `persistStateUnlocked` computes an
484
+ id-level diff (created / changed / removed) against the loaded state and
485
+ synthesizes snapshot events — a single choke point provably consistent
486
+ with what was persisted, immune to call-site drift. To preserve verb
487
+ semantics (`claim` vs `update` vs `complete` — consumed by notifications
488
+ and federation signaling), verb sites declare
489
+ `(action, item_type, item_id, summary)` into the in-flight mutation
490
+ context (today's ~30 `appendEvent` call sites already pass exactly these
491
+ fields; they redirect to the context instead of the legacy stream); the
492
+ diff supplies the payload and emits any *unannotated* change as `update`
493
+ plus a doctor counter. There is **no migration to explicit call-site
494
+ event emission** — explicit emission is justified only for registries
495
+ that never pass through `State` (assignments/runs/loops), and those reuse
496
+ the same append+project primitive.
497
+ - **Deletion authority** (journal-primary mode): a projection file is
498
+ unlinked only when a tombstone for its id is applied. "Absent from
499
+ in-memory state" stops being a deletion signal — closing the
500
+ trp_d5595086 bug class structurally. The never-unlink-unparseable guard
501
+ carries over on the projection side. The **same id-level diff** that
502
+ synthesizes events is what drives the unlink — one diff, two consumers,
503
+ cannot disagree (today's `deleteMissing` path and event emission are
504
+ separate code; v2 fuses them). The coarse `agent: "system"` /
505
+ `item_type: "state"` ping today's `persistStateUnlocked` appends is
506
+ replaced by the per-entity diff events.
507
+ - **Heartbeat-class churn is never journaled.** Refresh/liveness field
508
+ updates (claim `expires_at` extensions, run `last_heartbeat_at`,
509
+ assignment `last_progress`, lock metadata) are ephemeral —
510
+ projection/registry layer only. Only lifecycle *transitions* (claimed,
511
+ released, completed, failed) are events. Without this rule, 20 agents ×
512
+ 30s heartbeats × 2 KB snapshots ≈ >100 MB/day of journal for zero
513
+ information.
514
+ - **Ephemeral-field masking (normative consequence).** Ephemeral fields
515
+ mutate projections without journal events or `entity_rev` bumps, so a
516
+ projection can legitimately differ from replayed state at *equal* rev.
517
+ Therefore: the reconciler never overwrites a projection at equal rev (it
518
+ only fills gaps forward), `doctor --verify-journal` masks the ephemeral
519
+ field set per item_type before diffing, and the §2.8 diff synthesizer
520
+ emits **no event** for ephemeral-only changes. The ephemeral field set
521
+ is declared once per schema — a single source consumed by all three. **Falsifier (phase 0 deliverable):** from the dogfood
522
+ store's 17k v1 events, compute per-item_type p95 entity size × event
523
+ frequency; instrument event bytes by action class during the dual-mode
524
+ sprint. If any non-heartbeat class exceeds ~50% of journal bytes, or any
525
+ record would exceed 64 KB, that type needs `payload_ref` or a delta
526
+ format in phase 1, not deferred.
527
+
528
+ ### 2.9 Locking interplay
529
+
530
+ - The journal lives **inside** the existing `mutate()` critical section. No
531
+ new lock protocol. Seq assignment, appends, fsync, projection writes, and
532
+ the watermark bump all happen under the one store lock, journal-first.
533
+ - Lock-steal residual (a breaker briefly coexisting with a
534
+ stale-but-alive owner) is handled by detection + containment (§2.2), not
535
+ denied. The phrase "impossible by construction" is banned.
536
+ - Lock-hold growth (fsync + reconciling readers) is instrumented, not
537
+ assumed away: the phase-1 dual sprint records lock wait-time
538
+ distribution. **Falsifier:** p95 lock wait > ~200 ms under normal
539
+ multi-agent load falsifies global-seq-under-lock and forces the
540
+ per-writer-journal redesign (§2.2 boundary). Note for the instrumented
541
+ baseline: today's `persistStateUnlocked` already runs a **git commit**
542
+ (`commitMemoryChange`) inside the critical section — the pre-existing
543
+ dominant lock-hold term. The fsync the journal adds is marginal against
544
+ it; attribute wait-time per phase (append/fsync/projection/git) so the
545
+ falsifier indicts the right component.
546
+ - Federation imports must chunk: a 10k-event pull takes and releases the
547
+ lock per chunk rather than starving local agents.
548
+
549
+ ### 2.10 Oversized payloads — `payload_ref` and the handoff diet (C3 resolution)
550
+
551
+ The phase-0 measurement (`event-log-store-phase0-measurements.md`) fired
552
+ the §2.8 falsifier: handoff entities are p50 109,700 B / p95 225,157 B —
553
+ 15–45× over the 64 KB threshold at p50 already — while every other
554
+ item_type sits at p95 ≤ 7.5 KB. Per C3's own rule this enters **phase 1**;
555
+ the record format ships with it. Two composable mechanisms, **both
556
+ adopted**:
557
+
558
+ 1. **Handoff diet (primary fix).** The dominant bytes are the inline
559
+ `snapshot.diff` (same root cause as the 41 MB
560
+ `handoffs/compacted.jsonl`). Externalize `snapshot.diff` from the
561
+ handoff document to a content-addressed attachment under
562
+ `events/blobs/` referenced by hash; the handoff entity returns to the
563
+ 2–8 KB class every other entity lives in. One move fixes the journal
564
+ record size, checkpoint size, J2's git posture, and the legacy
565
+ compacted.jsonl pathology. The schema change rides the existing
566
+ migration registry (§2.1.4). Product call J6 (§6) confirms portability
567
+ implications.
568
+ 2. **`payload_ref` (permanent safety net).** If a serialized payload
569
+ exceeds 64 KB, the writer stores it at
570
+ `events/blobs/<sha256[0:2]>/<sha256>` (content-addressed, write-once)
571
+ and the record carries `payload_ref: { sha256, bytes }` *instead of*
572
+ `payload`. Readers resolve transparently; a missing or hash-mismatched
573
+ blob is a doctor **ERROR** for that entity — never silent.
574
+ - **Blob-before-ref ordering (normative):** the blob is written and
575
+ fsync'd *before* the journal append that references it — the §2.7
576
+ barrier extended one link left. A crash between the two leaves an
577
+ orphan blob (harmless, gc-able), never a dangling ref.
578
+ - **Checkpoint closure:** checkpoints store oversized post-images as
579
+ the same `payload_ref` (manifests stay small); the
580
+ `checkpoint_ref.payload.blobs` list (§2.1.2) enumerates the closure.
581
+ "Self-contained" (§2.4) is redefined as *manifest + blob closure*;
582
+ verify-before-archive verifies the manifest hash AND presence + hash
583
+ of every blob in the closure.
584
+ - **Blob gc:** park-don't-delete. A blob moves to `archive/blobs/` only
585
+ when referenced by zero records in non-archived segments AND by
586
+ neither of the two newest verified checkpoints' closures — the §2.3
587
+ floor extended verbatim.
588
+ - **Redaction closure (J1 × `payload_ref`, normative — resolves the
589
+ blocking half of R2, 2026-06-12):** `doctor redact` of a record whose
590
+ payload lives in a blob must also **delete the blob** (true erasure —
591
+ the one exception to park-don't-delete; an erasure request is not
592
+ satisfied by parking) AND regenerate any checkpoint whose closure
593
+ references it *before* the redaction completes — manifest rewritten
594
+ minus the redacted post-image, re-verified, the stale checkpoint
595
+ parked. The redaction `journal_note` (§2.1.2) lists rewritten
596
+ checkpoints alongside segments. Invariant: after `doctor redact`
597
+ returns, no live segment, no `archive/blobs/` entry, and no
598
+ checkpoint closure can yield the redacted bytes. The *federation*
599
+ half of R2 (peer re-presenting a pre-redaction record or checkpoint)
600
+ stays open in §6 — it cannot be closed before the federation
601
+ transport exists.
602
+ - **Git (J2 boundary):** `events/blobs/` is gitignored like segments.
603
+ With the diet in place no live entity ships an oversized payload, so
604
+ bare-clone restorability from projections + checkpoints holds in
605
+ practice; doctor flags any checkpoint whose closure references a
606
+ gitignored blob as not-clone-restorable. This becomes a real product
607
+ trade-off only if J6 rejects the diet.
608
+
609
+ Residual falsifier follow-up: `runtime_note`/`session` event *count* (10k
610
+ of 17.7k v1 events) is volume, not bytes — both classes are payload-free
611
+ in v2 (observability), so they contribute line overhead only (~2–3 MB at
612
+ historical rates) and do not threaten the weekly-roll target. No per-class
613
+ retention knob needed ahead of J5.
614
+
615
+ ### 2.11 Federation conflict primitive (C4 resolution)
616
+
617
+ Cross-checked against `identity-model-proposal.md` (origin-partitioned
618
+ write authority; scalar `entity_rev` + origin tag;
619
+ `(origin_id, origin_epoch, seq)`-headed slices — the epoch handles
620
+ restore-from-backup, see the proposal). Both symmetric reviews attacked the
621
+ same concurrent-edit hole independently and produced two complementary
622
+ detection mechanisms; this section reconciles them (coordinator synthesis
623
+ 2026-06-11, flagged for Codex adjudication in §6 R-C4).
624
+
625
+ - **Execution entities** (claims, runs, locks, assignments): single-writer
626
+ per origin; other origins materialize read-only. Authority partition
627
+ means no concurrent-write conflict exists; the scalar is trivially
628
+ sufficient. (The advisory cross-machine claim race is *arbitration*, not
629
+ journal conflict — deferred to the cloud dispatcher per the proposal.)
630
+ - **Memory entities**: LWW ordered by the total order
631
+ `(entity_rev, origin_id)` — rev first, origin_id lexicographic as the
632
+ deterministic tiebreak. **No wall clock anywhere** (the "LWW by what
633
+ clock?" answer: by revision counter + origin id, never time). Convergent:
634
+ every origin applying the same record set reaches the same head.
635
+ - **The attack (resolution ≠ detection):** origin A edits entity e
636
+ rev 7→8→9; origin B, offline, edits e 7→8. B's slice reaches A after A
637
+ is at rev 9. *Resolution* is correct (9 > 8, deterministic LWW). But
638
+ *detection* — the proposal's "conflicts surface as candidates, never
639
+ silent overwrite" — cannot be decided from head comparison: B's rev-8 is
640
+ concurrent with A's lineage, not an ancestor of it, and a bare scalar
641
+ head cannot distinguish "stale copy of what I already incorporated" from
642
+ "divergent edit with a lower rev".
643
+ - **Adopted detection rules (two, complementary — reconciled with the
644
+ identity proposal's hardened model):**
645
+ 1. **PRIMARY — `base_rev` fast-forward check** (from the identity
646
+ proposal, post-review): every *exported* memory-entity record carries
647
+ `base_rev`, the rev the write was based on. Receiver rule: incoming is
648
+ a clean fast-forward iff `incoming.base_rev >= current.rev`; otherwise
649
+ the write was concurrent → LWW materializes the winner AND a conflict
650
+ candidate carries both post-images. One integer per exported record,
651
+ decided from the record alone — **independent of local history
652
+ retention**, so it survives gc/compaction and works on a fresh
653
+ materialize.
654
+ 2. **DEFENSE-IN-DEPTH — `(rev, origin)` journal collision at replay**:
655
+ import replays the incoming slice through the reducer; an incoming
656
+ record whose `(item_id, entity_rev)` already exists locally **with a
657
+ different origin** is a conflict (the §2.2 dup-detection generalized
658
+ from `(seq, writer)`). Catches legacy/foreign slices lacking
659
+ `base_rev` and cross-checks rule 1, at zero envelope cost — but only
660
+ reaches back to the gc floor.
661
+ In the attack above, both rules fire: B's record has `base_rev 7 <
662
+ current rev 9` (rule 1) and B's (e, 8) collides with A's journaled
663
+ (e, 8) (rule 2) → candidate surfaced while LWW keeps A's rev 9.
664
+ - **Residual miss-window, now narrow:** only a record that *lacks*
665
+ `base_rev` (legacy exporter) AND whose colliding rev is archived past
666
+ the gc floor escapes surfacing — convergence still never breaks.
667
+ The per-origin high-watermark map (a bounded vector clock, size =
668
+ origin count, typically ≤ 3) remains the **named upgrade path** if dogfooding
669
+ shows missed candidates; it slots into import metadata without touching
670
+ the envelope, which stays origin-agnostic (origin appears only in
671
+ exported slice headers, per the proposal's migration step 2).
672
+ - **Cross-requirement flowing back to the identity proposal:**
673
+ `entity_rev` must never reset per item_id — tombstone → recreate
674
+ continues the counter (§2.1.1/§2.1.3) — otherwise (rev, origin)
675
+ collisions become false positives after delete→recreate races.
676
+
677
+ ## 3. Failure-mode matrix
678
+
679
+ | # | Scenario (round-2 attack) | Mitigation in this spec |
680
+ |---|---|---|
681
+ | 1 | Crash mid-append (torn tail) | Leading-`\n` framing caps loss at 1 event; reader skips tail; next writer appends adjudicating `journal_note`; doctor counts adjudicated residue separately from corruption (§2.6) |
682
+ | 2 | Torn line that parses validly | Dropped anyway: journal-first + fsync means an unconfirmed tail was never acknowledged to the caller (§2.6 rule 3) |
683
+ | 3 | Crash between append and projection write | Projection stale, never ahead (fsync barrier §2.7); lazy reconcile heals forward on next read |
684
+ | 4 | Projection from the future (no-fsync reorder) | One fsync per mutate before projection writes; never-regress guard keyed on `entity_rev` as second line (§2.7) |
685
+ | 5 | Two writers in the lock-steal window | O_APPEND seatbelt prevents byte interleaving; duplicate seq detected via `(seq, writer)`, applied in file order (snapshot double-apply is convergent), doctor warns (§2.2) |
686
+ | 6 | Seq counter corruption outliving the race (both writers rewrite meta, loser's bump lost, third writer reuses seq) | Tail validation at lock acquisition: `next_seq = max(meta, tail+1)` + `seq_repair` event; meta is a rebuildable cache, the journal is truth (§2.2) |
687
+ | 7 | Lockless appender writes into a just-rolled "immutable" segment | No lockless path exists; all appends take the lock and resolve the active segment inside it (§2.2, §2.3) |
688
+ | 8 | Crash mid-checkpoint | Out-of-band manifest; worst case orphan file with no `checkpoint_ref` (harmless); meta written last (§2.4) |
689
+ | 9 | Corrupt checkpoint discovered after segments archived | Verify-by-full-re-parse before archival; sha256 in `checkpoint_ref`; previous-checkpoint fallback; gc floor = second-newest verified checkpoint (§2.4) |
690
+ | 10 | Oversized record exits the O_APPEND atomicity envelope | Payloads > 64 KB externalized via `payload_ref` (§2.10); envelope line hard-fails at 256 KB (§2.1) |
691
+ | 11 | Partial `write()` (signal, ENOSPC, quota) | Short-write check ⇒ loud mutation failure before projections (§2.6) |
692
+ | 12 | Rotation/sealing during concurrent read | Segments never renamed; active segment is just the newest file; seq watermarks survive any layout change (§2.3, §2.5) |
693
+ | 13 | Cursor predates archived history | `{gap: true}` + checkpoint-built summary; graceful notification degradation (§2.5) |
694
+ | 14 | Clock skew / ts collision | Irrelevant — ts never orders (§2.2) |
695
+ | 15 | 100k-event store cold read | Fresh path O(1) check + projection read; stale path replays only the gap; rebuild bounded by latest checkpoint (§2.4, §5) |
696
+ | 16 | `meta.json` corrupt/lost | Rebuilt from segment listing + tail read — it is a cache, not truth (§2.3) |
697
+ | 17 | Heartbeat churn floods segments (20-agent scale) | Heartbeat-class updates excluded from the journal by rule; volume falsifier instrumented (§2.8) |
698
+ | 18 | Store on a network mount | Documented local-FS-only support boundary; doctor warns heuristically (§2.3) |
699
+ | 19 | Wedged lock = no event capture; sandboxed workers can't append | Stated scope boundary: journal is truth of the store, not the system; offline capture falsifies the primitive and triggers the per-writer redesign (§2.2) |
700
+ | 20 | Mid-file malformed line (should be impossible under lock) | Skip + count + doctor alarm (unexplained-corruption class), never silent (§2.6) |
701
+ | 21 | Crash between blob write and referencing append | Blob-before-ref ordering: worst case an orphan blob (harmless, gc-able), never a dangling `payload_ref` (§2.10) |
702
+ | 22 | `payload_ref` blob missing or hash-mismatched at read | Doctor ERROR for that entity, read fails loudly — never silent (§2.10) |
703
+ | 23 | Replay diff flags ephemeral-only field drift as divergence | Ephemeral field set masked per item_type in verify-journal and the reconciler; equal-rev projections never overwritten (§2.8) |
704
+
705
+ ## 4. Migration plan
706
+
707
+ Flag: `store.journal_v2: off | dual | primary` (default `off`). Each phase
708
+ ships dark behind the flag; this repo's own store (~17k v1 events of real
709
+ multi-agent traffic) is the canary. A `.brainclaw/` backup is taken at every
710
+ phase flip (upgrade-style, park-don't-delete).
711
+
712
+ - **Phase 0 — format, no behavior change.** Land the v2 record schema (zod),
713
+ segment reader/writer, meta cache, doctor counters,
714
+ max-record-size enforcement, and the **snapshot-size falsifier
715
+ measurement** (§2.8). v1 `events.jsonl` untouched.
716
+ - **Phase 1 — `dual`: journal-first dual-write.** One-shot
717
+ `bclaw migrate journal`: backup store; emit a **genesis backfill** — one
718
+ `backfill` snapshot event per current entity, built from the projection
719
+ files (the only truth we have; the 17k payload-less v1 events are not
720
+ translatable — parked to `events/archive/events.v1.jsonl`, readable
721
+ forever for forensics); initialize meta. `persistStateUnlocked` reorders
722
+ to append → fsync → existing file writes → watermark. Notifications
723
+ switch to seq-watermark cursors. State dirs remain authoritative.
724
+ Phase 1 also lands `payload_ref` + the handoff diet (§2.10) — the
725
+ phase-0 falsifier fired on handoffs, so the record format ships with
726
+ both.
727
+ **Rollback:** set `off` — projection files were written on every mutation
728
+ in exactly today's format; park `events/`; zero data transformation in
729
+ either direction.
730
+ - **Phase 2 — verification (promotion gate).**
731
+ `bclaw doctor --verify-journal` rebuilds state from
732
+ checkpoint + journal in a temp dir and diffs against live projections —
733
+ the only check that validates the actual claim ("the journal is
734
+ sufficient to reproduce state"). Runs in CI on **both OS families**,
735
+ alongside: kill-9 storm tests (crash between append and projection must
736
+ always converge), a two-process append stress test (N children × K
737
+ events; assert no interleaved bytes, no lost `(seq, writer)` pairs), and
738
+ the tail-validation test. Doctor counters (skipped lines, torn tails,
739
+ adjudicated fragments, unannotated-diff emissions, network-FS warning)
740
+ run always-on as continuous telemetry. **Exit criterion:** zero
741
+ divergence across a full dogfooding sprint of real multi-agent traffic,
742
+ including dispatch worktree churn; lock wait-time distribution recorded
743
+ (§2.9 falsifier).
744
+ - **Phase 3 — `primary`.** Reads serve projections via lazy reconcile;
745
+ deletion authority moves to tombstones; `mutateState` callers unchanged.
746
+ Then per-entity ops: single-entity mutations append + patch one
747
+ projection file without full-store load/rewrite; registries
748
+ (assignments/runs/loops) unify on the same append+project primitive
749
+ (entry phase is a Juan sequencing call, §6). **Rollback:** projections
750
+ are at all times a complete materialized state in legacy format — flip
751
+ to `dual` or `off`, re-arm legacy delete semantics, no data
752
+ transformation.
753
+
754
+ ### Phase 2 gate status (pln#565, 2026-06-12)
755
+
756
+ The promotion gate is now **mechanically checkable** via one command and an
757
+ automated hardening suite. Status of each Phase-2 exit criterion:
758
+
759
+ | Criterion | Status | Evidence |
760
+ | --- | --- | --- |
761
+ | Journal reproduces projections (the core claim) | ✅ | `brainclaw doctor --verify-journal` — rebuilds from journal, diffs vs live projections, exit 1 on drift. GREEN on this repo's store (mode=dual). |
762
+ | Tail validation / torn-tail adjudication | ✅ | `journal-v2.test` (torn tail → `torn_tail_adjudicated`, stale meta → `seq_repair`). |
763
+ | Two-process append stress | ✅ | `journal-concurrency.test` — N processes × K appends, gap-free 1..N*K seq, N distinct writers, zero torn/lost. |
764
+ | Kill-9 storm convergence (append path) | ✅ | `journal-concurrency.test` — SIGKILL mid-append storm: journal stays readable, seqs never duplicate, post-storm append re-derives a non-colliding seq, state still materializes. |
765
+ | **Persist crash-ordering — journal before projections (I2)** | ✅ | pln#566 F1 (codex review): persist now PLANs → emits+fsyncs the journal → APPLIES projection writes, so a crash can only leave the journal ahead (recoverable), never projections ahead. Proven by `journal-crash-ordering.test` via deterministic fault injection on the real `mutateState` pipeline. (Earlier the kill-9 test exercised `forceAppendJournalRecords` directly, not the mutation pipeline — that gap is now closed.) |
766
+ | Migration + rollback tooling | ✅ | genesis backfill + `rollbackJournal` (park `events/`, projections untouched). |
767
+ | Dual-OS CI | ✅ | `.github/workflows/ci.yml` matrix `[ubuntu, windows]`. |
768
+ | **Zero divergence across a real multi-agent sprint** | ✅ | seq#47 (2026-06-12): 4 parallel claude-code lanes + dispatch worktree churn + 4 merges → `verify-journal` zero drift throughout. |
769
+ | Lock wait-time distribution (§2.9 falsifier) | ◐ | Lock serialization proven under contention by `journal-concurrency.test`; explicit p50/p95 telemetry via doctor counters is the one remaining instrumentation item — lands with the cutover (it touches the mutate hot path). |
770
+
771
+ **Verdict:** the correctness gate is GREEN. The only residual is wait-time
772
+ *telemetry* (not a correctness blocker). The primary cutover (Phase 3) is a
773
+ Juan sequencing call (§6) and a distinct implementation chantier (tombstones +
774
+ per-entity append/patch), not gated on more verification.
775
+
776
+ ## 5. Perf targets (measured, not asserted)
777
+
778
+ - `bclaw_work` cold read < 1 s on a 100k-event store.
779
+ - Single-entity op cost independent of store size: O(1) append + O(1)
780
+ projection write + O(gap) reconcile.
781
+ - MCP worker-per-call overhead delta < 50 ms vs. today (fresh path = one
782
+ extra small meta read).
783
+ - One fsync per `mutate()`; lock p95 wait < 200 ms under normal multi-agent
784
+ load (falsifier threshold, §2.9).
785
+ - Segment roll ≈ every 2–3 weeks at current write rates (post heartbeat
786
+ exclusion); checkpoint cost O(live entities) under lock.
787
+
788
+ ## 6. OPEN QUESTIONS
789
+
790
+ Severity-ranked. Every open question from round 2 not resolved by this spec
791
+ is carried here.
792
+
793
+ ### [JUAN — product calls] — RESOLVED 2026-06-10
794
+
795
+ | # | Sev | Decision |
796
+ |---|---|---|
797
+ | J1 | HIGH | **`doctor redact` ships in v1.** Immutability is "immutable except via audited `doctor redact`": tooled segment rewrite, audit-trailed, seq watermarks survive it. Rationale: the EU/GDPR positioning cannot answer "impossible" to an erasure request. (Write-time secret-detection may complement later; it does not replace redaction.) |
798
+ | J2 | HIGH | **Projections + checkpoints in git; segments and meta gitignored.** The store's git-diffable identity = the per-entity projections (diff/merge as today) plus checkpoints (single-file snapshots a human can adjudicate in a merge, making a bare git clone restorable without segments). No segment blobs in history; the branched-seq merge problem never enters git. |
799
+ | J3 | MED | **Read-through for claim-class entities.** Claims and active assignments read the journal tail even under contention — consistency before liveness for the coordination primitive (no double-work is the product promise). Tail-read cost is paid only on this hot-critical path; memory-class entities keep stale-annotated reads (§2.8). |
800
+ | J4 | MED | **Registry enters in a dedicated Phase 1.5.** Phase 1 = memory entities (low volume, proven reversibility); registry lifecycle transitions migrate once the journal is hardened in real use. Matches the off/dual/primary posture: the dispatch lifecycle is the product's credibility — it is not migrated first. |
801
+ | J5 | LOW | **Defer fine gc/archive thresholds.** The normative two-verified-checkpoint floor stands alone until federation defines its consumer; count/age knobs are trivial additive later. |
802
+
803
+ ### [JUAN — new product call raised by C3]
804
+
805
+ | # | Sev | Question |
806
+ |---|---|---|
807
+ | J6 | MED | **Handoff diet (§2.10):** externalize `snapshot.diff` from handoff documents to content-addressed blob attachments. Affects handoff export/import and federation transfer (the blob closure must travel with the document). Recommended: **accept** — it also fixes the 41 MB `compacted.jsonl` class and keeps J2's bare-clone restorability intact. |
808
+
809
+ ### [CODEX — schema/invariant review] — RESOLVED 2026-06-10 (symmetric pass)
810
+
811
+ | # | Sev | Resolution |
812
+ |---|---|---|
813
+ | C1 | HIGH | Resolved in §2.1.1 (action taxonomy, 5 classes, holes closed: required `item_id`, `assignment_progress` heartbeat-class, store-ops per-entity + `store_marker`, archival-vs-delete, rev-never-resets), §2.1.2 (journal-meta schemas incl. genesis + J1 redaction audit note), §2.1.3 (tombstones per item_type), §2.2 (dup-seq reducer, 5 normative cases). |
814
+ | C2 | HIGH | Resolved in §2.1.4: version-in-payload + migration-on-replay reusing the existing `migration.ts` versioned-document registry; migration-retention invariant pinned to the *second-newest* checkpoint; alternatives in Appendix A. |
815
+ | C3 | MED | Falsifier FIRED on handoffs (phase-0 measurements). Resolved in §2.10: handoff diet (primary) + `payload_ref` (safety net), blob-before-ref ordering, checkpoint blob closure, gc floor extension, J2 git posture. Residual product call → J6. |
816
+ | C4 | MED | Resolved in §2.11 against `identity-model-proposal.md`: scalar `(entity_rev, origin_id)` survives — convergence intact; conflict *surfacing* via (rev, origin) journal collision; documented miss-window past the gc floor with the per-origin watermark as named upgrade path. |
817
+
818
+ ### [CODEX pre-P0 review] — RESOLVED 2026-06-12 (claude-code, codex out of credits)
819
+
820
+ Codex's final pass before P0 implementation surfaced 5 findings; all
821
+ verified against code and resolved in this revision:
822
+
823
+ | # | Sev | Resolution |
824
+ |---|---|---|
825
+ | F1 | MED/HIGH | `assignment_progress` carried durable state (`status_reason`, `artifacts`) on the heartbeat path — un-replayable as specced. Resolved in §2.1.1: heartbeat/durable split (`assignment_progress`/`run_progress` = pure ticks; `assignment_amended`/`run_amended` = registry-lifecycle with post-image), effective phase 1.5. |
826
+ | F2 | MED | Same ambiguity on `run_running` (re-emitted per tick). Resolved with F1 — one decision: `run_running` = transition-only; ticks move to `run_progress`. |
827
+ | F3 | MED | `federation_apply` required by the identity proposal but absent from the taxonomy. Resolved: declared as journal-meta NOW (§2.1.1 table + §2.1.2 schema), inert until federation ships — avoids a post-freeze union extension. |
828
+ | F4 | MED | Redaction × payload_ref/checkpoints under-specified. Blocking half resolved in §2.10 (redaction closure: blob deletion + checkpoint regeneration, normative invariant); federation re-import half stays open as R2 below. |
829
+ | F5 | LOW | Spec said 32 `EventAction` members; code has 34. Corrected in §2.1.1. |
830
+
831
+ ### [CODEX residue — needs a second model's schema instincts]
832
+
833
+ | # | Sev | Question |
834
+ |---|---|---|
835
+ | R1 | MED | ~~Zod encoding of §2.1.1~~ **RESOLVED 2026-06-12** (codex recommendation, claude-code verified zod 4.4.3 installed): `action` stays the only discriminant — no serialized `class` field (derived-field drift, trp#180 family). `ACTION_CLASS_BY_ACTION` table `satisfies Record<EventAction, ActionClass>` for compile-time exhaustiveness; zod discriminatedUnion on `action` enums per class; phase-gated payload requiredness = runtime refinement by journal mode, not schema variants. See §2.1.1. |
836
+ | R2 | MED | **Redaction × cursors × federation — federation half only** (blob/checkpoint closure resolved in §2.10). Does seq-watermark survival hold for a cursor positioned *inside* a redacted range? And the re-import hole: a federation peer that pulled a record pre-redaction can re-present it — `(seq, writer)` dedup would *reject* the redacted copy (good) but the peer's checkpoint may still embed the payload. The redaction note likely needs to propagate as a federation signal; decide with the federation transport (cannot be closed before it exists). |
837
+ | R3 | LOW | **Ephemeral field set enumeration (§2.8).** Adversarial sweep of the real zod schemas for fields beyond `last_heartbeat_at` / claim `expires_at` / `last_progress` that mutate without semantic change (counters, denormalized caches?) — the masking set must be complete or verify-journal cries wolf. |
838
+ | R4 | LOW | **C4 miss-window sizing (§2.11).** Gc-floor window (weeks) vs realistic offline-origin durations; should the per-origin watermark ship in federation v1 regardless of dogfood evidence? |
839
+ | R-C4 | MED | **Dual conflict-detection adjudication (§2.11, reconciliation 2026-06-11).** The two symmetric reviews independently produced `base_rev` fast-forward (identity proposal) and `(rev, origin)` journal collision (this spec); the coordinator kept BOTH (primary + defense-in-depth). Adjudicate: is the redundancy worth the dual maintenance, or should one become normative? Note `base_rev` is the only one that survives gc and fresh materializes. |
840
+
841
+ ## Appendix A — Rejected alternatives
842
+
843
+ - **Diff/patch payloads (RFC 6902 or field-deltas).** Every event becomes
844
+ load-bearing: one torn line poisons all later state for that entity, and
845
+ zero-dep means hand-rolling a patch engine. Snapshots are idempotent,
846
+ self-healing, and compaction-trivial. (Both proposals; unanimous.)
847
+ - **A's rename-based sealing (`active.jsonl` → range name).** Contradicts
848
+ its own cursor format (offsets dangle after rename), and rename-of-open-
849
+ file is the exact Windows EPERM/EBUSY hazard it then needs retry logic
850
+ for. Segments are born with their permanent first-seq name.
851
+ - **A's byte-offset cursors `{segment_id, offset}`.** Die under rename,
852
+ under quarantine truncation, and under any future segment surgery
853
+ (including J1 redaction). Seq watermarks survive all of it.
854
+ - **A's writer-inline torn-tail quarantine (truncate + move bytes).** A
855
+ read-modify-write of the journal on the hot path: breaks append-only,
856
+ races the very lock-steal window the seatbelt exists for, and can
857
+ quarantine a live in-flight write. Demoted to offline doctor repair.
858
+ - **A's `fsync: rotate` default (no fsync per mutation).** Program-order
859
+ journal-first without a barrier permits projections from the future and
860
+ silent reconciler regression — the trp_d5595086 class. One fsync per
861
+ mutate is affordable at human-action mutation rates.
862
+ - **B's in-journal checkpoint event runs (+ terminator).** Pollutes every
863
+ seq-watermark cursor with O(entities) phantom events, leaves headless
864
+ runs on crash that are schema-identical to real events, and stretches
865
+ lock hold time. Out-of-band manifests have none of these.
866
+ - **A's "referencing" checkpoint variant (hashes of projection files).**
867
+ Circular: a rebuild-from-truth artifact whose validity depends on
868
+ projection integrity is useless precisely when projections are suspect.
869
+ Killed without further study.
870
+ - **B's lockless observability appends (`seq: null`).** Races segment roll
871
+ into "immutable" files, and seq-less records are unaddressable by B's own
872
+ seq-watermark cursors. All appends take the lock; revisit only if
873
+ instrumentation shows notification contention.
874
+ - **B's `(writer_id, writer_seq)` per-writer counter in the envelope.**
875
+ Serves only federation and is derivable later; `entity_rev` serves three
876
+ local masters today. Dead weight dropped.
877
+ - **B's `deleted: true` tombstone boolean.** Redundant with
878
+ `action: "delete"`; one source of truth in the envelope.
879
+ - **B's two meta files (`HEAD.json` + `projections.json`).** Two reads per
880
+ MCP call, two renames per mutation, plus cross-file ordering reasoning,
881
+ for state always consumed together. Single `meta.json`, keeping B's
882
+ rebuildable-cache property.
883
+ - **Migration to explicit verb-site event emission (A's end-state).** ~30
884
+ call sites each become a chance to forget, double-emit, or
885
+ emit-without-persisting. The diff choke point is provably consistent
886
+ with what was persisted; verb semantics are preserved by intent
887
+ annotation instead. Conversely, **pure diff with no annotation** (B's
888
+ letter) was also rejected: it collapses the EventAction union to generic
889
+ `update`, losing semantics notifications and federation signaling consume.
890
+ - **Splitting notification stream from state journal now (B Q6).** Same
891
+ journal is simpler — one reader, one cursor type, one ordering; split
892
+ only if volume instrumentation demands it.
893
+ - **Separate journal per entity (vs. one per store).** Global order comes
894
+ free with one journal; per-entity journals reintroduce cross-entity
895
+ ordering as a problem. (Proposal A §0; never contested.)
896
+ - **Per-record envelope schema-version for payloads (C2 alternative).**
897
+ Redundant: payloads already self-describe via `schema_version` + the
898
+ migration registry; a second version field in the envelope creates two
899
+ sources of truth that can disagree.
900
+ - **Migration-by-segment-rewrite (C2 alternative).** Rewriting old
901
+ segments to the current payload schema violates append-only immutability
902
+ and J1's audited-rewrite-only rule; replay-time migration is
903
+ pure-functional and leaves bytes untouched.
904
+ - **Hash in every envelope, inline payloads included (C3 variant).**
905
+ Inline payloads are already line-framed and zod-validated; mandatory
906
+ hashing buys federation dedup nothing (dedup keys on `(seq, writer)`)
907
+ at a per-mutation CPU cost. The hash lives where it is load-bearing:
908
+ `payload_ref` and `checkpoint_ref`.
909
+ - **A vector-clock component in the envelope (C4 alternative).** Origin-
910
+ partitioned write authority makes convergence scalar-safe (§2.11); the
911
+ only thing a vector adds is complete conflict *surfacing* across
912
+ gc-floor-sized offline windows. Deferred to import metadata (per-origin
913
+ watermark) — the envelope stays origin-agnostic.
914
+
915
+ ## Appendix B — Memory citations (union of rounds 1–2)
916
+
917
+ trp_d5595086 (silent-loss-via-swallow → loud appends, doctor-visible skips,
918
+ tombstone deletion authority, never-regress guard);
919
+ feedback_lazy_reconcile_pattern / pln#496 (read-path reconciliation, no
920
+ daemon); trp_e85e9fbe (dual-platform CI gates, Windows/POSIX divergence
921
+ discipline); trp_26e9634b (missing-store failure mode); trp_09988deb
922
+ (upgrade-style backups); feedback_no_init_force + park-don't-delete house
923
+ rule (retention, quarantine, archives, rollback);
924
+ federation_architecture_decisions + cross_project_signaling_vs_execution
925
+ (Pull-and-Materialize substrate, signaling-only foreign writes, no daemon);
926
+ feedback_bisect_state_before_code (doctor counters over silent skips);
927
+ feedback_ideation_loop_single_agent_method (multi-instance multi-round
928
+ method that produced this spec).