@romiluz/clawmongo 2026.3.22 → 2026.3.23

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,112 @@
1
+ # ClawMongo vs OpenClaw Default Memory
2
+
3
+ ## Overview
4
+
5
+ OpenClaw ships with QMD (SQLite + Markdown files) as its default memory backend. ClawMongo replaces this with MongoDB Community + mongot + Voyage AI. This page compares the two approaches feature by feature to help you decide which is right for your workload.
6
+
7
+ Both are valid choices. The default memory is simpler to set up. ClawMongo is more capable at scale. This is not a judgment -- it is a tradeoff.
8
+
9
+ ---
10
+
11
+ ## Feature Comparison
12
+
13
+ | Capability | OpenClaw Default (QMD/SQLite) | ClawMongo (MongoDB) |
14
+ |---|---|---|
15
+ | **Storage backend** | SQLite file + Markdown files | MongoDB Community (replica set) |
16
+ | **Vector search** | sqlite-vec extension or LanceDB | mongot + Voyage AI autoEmbed (`voyage-4-large`) |
17
+ | **Embedding management** | Application-side (OpenAI, Gemini, Voyage, Mistral, Ollama) | Automated via mongot (zero application-side code) |
18
+ | **Full-text search** | SQLite FTS5 / BM25 | mongot text indexes (Lucene standard analyzer) |
19
+ | **Hybrid search** | BM25 + vector with MMR diversity | `$rankFusion` / `$scoreFusion` + manual RRF fallback |
20
+ | **Knowledge graph** | None | `$graphLookup` with entities, relations, bi-directional expansion |
21
+ | **Episodes** | None | Auto-materialized from event windows (daily, weekly, thread, topic, decision) |
22
+ | **Event sourcing** | None (append-only Markdown) | Canonical events collection with derived projections |
23
+ | **Structured memory** | Basic key-value facts | Salience, temporal validity, state, provenance, revision tracking |
24
+ | **Procedures** | None | Versioned workflow artifacts with intent tags and ordered steps |
25
+ | **Retrieval paths** | 1 (search) | 8 (active-critical, procedural, structured, raw-window, graph, episodic, kb, hybrid) |
26
+ | **Retrieval planning** | Simple search dispatch | Pure-function planner scoring paths based on query analysis |
27
+ | **Reranking** | MMR (Maximal Marginal Relevance) | Source diversity penalty + episode boost + deduplication |
28
+ | **Schema validation** | None | JSON Schema (`$jsonSchema`) on 17 collections |
29
+ | **Multi-tenant isolation** | Filesystem separation | Compound indexes with `agentId` prefix |
30
+ | **Cross-instance sync** | File sync (rsync, git) | MongoDB replica set + change streams |
31
+ | **Operational visibility** | Limited | Ingest runs, projection runs, relevance telemetry (3 collections) |
32
+ | **Data model** | Flat files + SQLite rows | 20 collections, 53 standard indexes, up to 8 search indexes |
33
+ | **Entity extraction** | None | Rule-based from conversations (@mentions, #tags, URLs, paths, quoted names) |
34
+ | **Graph traversal** | None | `$graphLookup` with `restrictSearchWithMatch` for tenant isolation |
35
+ | **Memory lifecycle** | Manual | TTL indexes for caches/telemetry, consolidation lifecycle for events |
36
+ | **Write idempotency** | File overwrites | `$setOnInsert` + `$set` on unique compound keys |
37
+ | **Diagnostic tools** | Limited | `memory relevance *` CLI with explain-driven telemetry |
38
+
39
+ ---
40
+
41
+ ## When to Use OpenClaw Default Memory
42
+
43
+ The default memory backend is the right choice when:
44
+
45
+ - **Single user, small corpus**: You are one person with a modest amount of conversation history and notes. SQLite handles this well.
46
+ - **Local-only deployment**: You run OpenClaw on one machine with no need for multi-instance sync. File-based storage is simpler.
47
+ - **No operational requirements**: You do not need retrieval diagnostics, schema validation, or audit trails on your agent's memory.
48
+ - **Minimal setup time**: You want to start using OpenClaw in 2 minutes without setting up MongoDB.
49
+ - **No knowledge graph needs**: Your use case does not require entity relationships or graph traversal.
50
+
51
+ The default memory works well for personal note-taking assistants, simple Q&A bots, and single-machine setups where retrieval quality is not a primary concern.
52
+
53
+ ---
54
+
55
+ ## When to Use ClawMongo
56
+
57
+ ClawMongo is the right choice when:
58
+
59
+ - **Team-scale knowledge**: Your agent processes enough data that flat-file search quality degrades. MongoDB indexes maintain performance as the corpus grows.
60
+ - **Retrieval quality SLOs**: You need to measure and maintain retrieval quality over time. Relevance telemetry gives you the data to do this.
61
+ - **Production deployment**: You need schema validation, idempotent writes, and operational visibility. Production systems need these guarantees.
62
+ - **Multi-instance sync**: You run multiple gateway instances that need to share memory state. MongoDB replica sets handle this natively.
63
+ - **Knowledge graph**: Your agent needs to understand relationships between entities (people, projects, topics) and traverse them during retrieval.
64
+ - **Episode materialization**: You want conversation threads automatically summarized into searchable episodes.
65
+ - **Auditable memory**: You need to know what was written, when, and why. Event-sourcing provides a complete audit trail.
66
+ - **MongoDB expertise**: Your team already operates MongoDB and wants to use familiar tooling, monitoring, and backup infrastructure.
67
+
68
+ ---
69
+
70
+ ## What You Keep Either Way
71
+
72
+ Both configurations give you the full OpenClaw platform:
73
+
74
+ - 22 messaging channels (WhatsApp, Telegram, Slack, Discord, and 18 more)
75
+ - 78 extensions (25+ LLM providers, tools, media, infra)
76
+ - Companion apps (macOS, iOS, Android)
77
+ - Voice Wake + Talk Mode
78
+ - Live Canvas + A2UI
79
+ - Browser control
80
+ - Skills platform
81
+ - Gateway control plane
82
+
83
+ The memory backend is a pluggable layer. Switching from default to ClawMongo changes how your agent stores and retrieves memory. It does not change how the agent communicates, reasons, or uses tools.
84
+
85
+ ---
86
+
87
+ ## Migration Path
88
+
89
+ ClawMongo includes a built-in migration function (`backfillEventsFromChunks`) that reads existing chunk data and creates canonical events from it. This allows existing OpenClaw users to migrate their conversation history into ClawMongo's event-sourced model.
90
+
91
+ Steps:
92
+
93
+ 1. Install ClawMongo: `npm install -g @romiluz/clawmongo@latest`
94
+ 2. Configure MongoDB connection: `clawmongo config set memory.mongodb.uri "mongodb://..."`
95
+ 3. Run onboarding: `clawmongo onboard --install-daemon`
96
+ 4. The migration runs automatically on first startup when it detects existing chunk data without corresponding events.
97
+
98
+ After migration, ClawMongo uses MongoDB as the sole memory backend. The original Markdown/SQLite files remain untouched as a backup but are no longer read by the runtime.
99
+
100
+ ---
101
+
102
+ ## Numbers at a Glance
103
+
104
+ | Metric | OpenClaw Default | ClawMongo |
105
+ |--------|-----------------|-----------|
106
+ | Collections | ~2 (SQLite tables) | 20 |
107
+ | Indexes | Few | 53 standard + up to 8 search |
108
+ | Retrieval paths | 1 | 8 |
109
+ | Schema-validated collections | 0 | 17 |
110
+ | Unit tests (memory module) | Varies | 573 |
111
+ | v2 memory unit tests | N/A | 205 |
112
+ | Live e2e tests | N/A | 53 (MongoDB 8.2 + Voyage AI) |
@@ -0,0 +1,548 @@
1
+ # MongoDB Capabilities in ClawMongo
2
+
3
+ ClawMongo uses 12 MongoDB capabilities that together make MongoDB the best agentic data layer. This page explains **why** each capability matters for agent memory and **how** ClawMongo implements it.
4
+
5
+ ---
6
+
7
+ ## Table of Contents
8
+
9
+ 1. [Automated Embeddings (Voyage AI via mongot)](#1-automated-embeddings-voyage-ai-via-mongot)
10
+ 2. [Vector Search ($vectorSearch)](#2-vector-search-vectorsearch)
11
+ 3. [Full-Text Search (mongot)](#3-full-text-search-mongot)
12
+ 4. [Hybrid Search ($rankFusion / $scoreFusion)](#4-hybrid-search-rankfusion--scorefusion)
13
+ 5. [Knowledge Graph ($graphLookup)](#5-knowledge-graph-graphlookup)
14
+ 6. [Event-Sourcing (Canonical Events)](#6-event-sourcing-canonical-events)
15
+ 7. [Schema Validation (JSON Schema)](#7-schema-validation-json-schema)
16
+ 8. [Change Streams](#8-change-streams)
17
+ 9. [TTL Indexes](#9-ttl-indexes)
18
+ 10. [Multi-Tenant Isolation](#10-multi-tenant-isolation)
19
+ 11. [Idempotent Upserts](#11-idempotent-upserts)
20
+ 12. [Relevance Telemetry](#12-relevance-telemetry)
21
+
22
+ ---
23
+
24
+ ## 1. Automated Embeddings (Voyage AI via mongot)
25
+
26
+ ### Why This Matters
27
+
28
+ Most agent memory systems require you to manage an embedding pipeline: choose an embedding model, run it in your application code, handle batching, retries, model version upgrades, and dimension mismatches. This is an entire infrastructure category that has nothing to do with building an AI assistant.
29
+
30
+ With mongot's `autoEmbed`, ClawMongo eliminates the embedding pipeline entirely. mongot reads the `text` field from your documents, calls the Voyage AI API to generate embeddings at index time, and does the same at query time. Your application code never touches embeddings. When Voyage releases a better model, you update the index definition -- not your application.
31
+
32
+ ### How It Works
33
+
34
+ mongot delegates to the Voyage AI API using the `autoEmbed` field type in vector search index definitions. At index time, mongot reads the text field from each document and sends it to `voyage-4-large` for embedding. At query time, `$vectorSearch` sends the query text to mongot, which embeds it and runs approximate nearest neighbor (ANN) search.
35
+
36
+ Configuration: set `memory.mongodb.embeddingMode = "automated"` in ClawMongo config. Provide a Voyage AI API key in your mongot configuration.
37
+
38
+ ### Configuration Example
39
+
40
+ ```json5
41
+ // Vector search index definition (autoEmbed)
42
+ {
43
+ type: "vectorSearch",
44
+ definition: {
45
+ fields: [
46
+ {
47
+ type: "autoEmbed",
48
+ modality: "text",
49
+ path: "text",
50
+ model: "voyage-4-large" // 1024 dimensions
51
+ },
52
+ { type: "filter", path: "source" },
53
+ { type: "filter", path: "agentId" },
54
+ { type: "filter", path: "scope" }
55
+ ]
56
+ }
57
+ }
58
+ ```
59
+
60
+ ### Collections Using This
61
+
62
+ - `chunks` (conversation memory)
63
+ - `kb_chunks` (knowledge base)
64
+ - `structured_mem` (structured facts)
65
+ - `procedures` (workflow artifacts)
66
+
67
+ ---
68
+
69
+ ## 2. Vector Search ($vectorSearch)
70
+
71
+ ### Why This Matters
72
+
73
+ Agent memory retrieval is fundamentally a semantic matching problem. When a user asks "what was that project we discussed last week," the agent needs to find relevant conversations by meaning, not by keyword overlap. Vector search enables this by comparing the mathematical similarity of text embeddings.
74
+
75
+ Without vector search, agent memory is limited to exact keyword matches. With it, the agent can find semantically related content even when the user uses different words than what was originally stored.
76
+
77
+ ### How It Works
78
+
79
+ ClawMongo uses `$vectorSearch` as a first-stage aggregation pipeline operator. With autoEmbed enabled, you pass the query as text and mongot handles embedding:
80
+
81
+ ```javascript
82
+ db.chunks.aggregate([
83
+ {
84
+ $vectorSearch: {
85
+ index: "chunks_vector",
86
+ query: { text: "what project did we discuss" },
87
+ path: "text",
88
+ numCandidates: 100,
89
+ limit: 10,
90
+ filter: { agentId: "agent-123", scope: "agent" }
91
+ }
92
+ },
93
+ { $project: { text: 1, score: { $meta: "vectorSearchScore" } } }
94
+ ])
95
+ ```
96
+
97
+ The HNSW index provides sub-linear search time. Filter fields (`agentId`, `scope`, `scopeRef`, `source`, `path`) are pre-filtered before vector comparison, so multi-tenant queries remain fast.
98
+
99
+ ### Collections Using This
100
+
101
+ - `chunks` (conversation memory vector search)
102
+ - `kb_chunks` (knowledge base vector search)
103
+ - `structured_mem` (structured fact semantic search)
104
+ - `procedures` (workflow artifact search)
105
+
106
+ ---
107
+
108
+ ## 3. Full-Text Search (mongot)
109
+
110
+ ### Why This Matters
111
+
112
+ Vector search excels at semantic similarity but can miss exact terms. When a user says "find the error about ECONNREFUSED," they want keyword-exact matches, not semantically similar errors. Full-text search with mongot handles this case with Lucene-powered text indexes.
113
+
114
+ Agent memory needs both search modes. Semantic search finds related content; text search finds exact content. Using only one leaves blind spots.
115
+
116
+ ### How It Works
117
+
118
+ ClawMongo creates mongot text search indexes alongside vector indexes. The text indexes use Lucene's standard analyzer for tokenization and relevance scoring:
119
+
120
+ ```json5
121
+ // Text search index definition
122
+ {
123
+ type: "search",
124
+ definition: {
125
+ mappings: {
126
+ dynamic: false,
127
+ fields: {
128
+ text: { type: "string", analyzer: "lucene.standard" },
129
+ source: { type: "token" },
130
+ agentId: { type: "token" },
131
+ scope: { type: "token" }
132
+ }
133
+ }
134
+ }
135
+ }
136
+ ```
137
+
138
+ ClawMongo also maintains BSON `$text` indexes as a defensive fallback when mongot is unavailable. The `buildOrJoinFtsQuery` function constructs OR-join queries (not AND-join) to maximize recall for natural language queries.
139
+
140
+ ### Collections Using This
141
+
142
+ - `chunks` (conversation text search)
143
+ - `kb_chunks` (knowledge base text search)
144
+ - `structured_mem` (structured fact text search)
145
+ - `procedures` (workflow text search)
146
+ - `entities` (entity name + alias text search)
147
+ - `episodes` (episode title + summary text search)
148
+
149
+ ---
150
+
151
+ ## 4. Hybrid Search ($rankFusion / $scoreFusion)
152
+
153
+ ### Why This Matters
154
+
155
+ Neither vector search nor full-text search alone is sufficient for agent memory. Vector search misses exact terms. Text search misses semantic similarity. Hybrid search fuses results from both to maximize recall.
156
+
157
+ The challenge is scoring: vector scores (cosine similarity) and text scores (BM25/Lucene) live on different scales. Naive combination penalizes one type. Reciprocal Rank Fusion (RRF) solves this by ranking results by their position in each list, not by raw score.
158
+
159
+ ### How It Works
160
+
161
+ On MongoDB 8.0+, ClawMongo uses `$rankFusion` to combine vector and text pipelines in a single aggregation. On MongoDB 8.2+, `$scoreFusion` is also available for score-based combination with normalization.
162
+
163
+ For servers where these stages are unavailable, ClawMongo falls back to manual RRF: run both searches independently, compute `1 / (k + rank)` for each result in each list, sum scores for results appearing in both, and sort by combined score.
164
+
165
+ ```javascript
166
+ // Manual RRF fallback (simplified)
167
+ function rrfScore(rank, k = 60) {
168
+ return 1 / (k + rank);
169
+ }
170
+ // Results appearing in both vector and text lists get summed RRF scores
171
+ ```
172
+
173
+ Score normalization happens per search method: vector scores are clamped to [0, 1], text scores are min-max normalized within their result set. The `normalizeSearchResults` function in `mongodb-hybrid.ts` handles this.
174
+
175
+ ### Collections Using This
176
+
177
+ Hybrid search applies to any collection that has both a text and vector index: `chunks`, `kb_chunks`, `structured_mem`, `procedures`.
178
+
179
+ ---
180
+
181
+ ## 5. Knowledge Graph ($graphLookup)
182
+
183
+ ### Why This Matters
184
+
185
+ Flat retrieval (search and return snippets) loses the relationships between entities. When a user asks "what does Alice work on," the agent needs to traverse a graph: find Alice (entity), follow her `works_on` edges (relations), and return the connected projects (entities). This cannot be done with search alone.
186
+
187
+ Most agent memory systems either skip graph traversal entirely or bolt on an external graph database (Neo4j, etc.). ClawMongo uses MongoDB's native `$graphLookup` stage, so the knowledge graph lives in the same database as everything else.
188
+
189
+ ### How It Works
190
+
191
+ Entities are extracted from conversation events using rule-based regex patterns (5 types: @mentions, #tags, URLs, file paths, "Quoted Names"). Relations link entities with typed edges (`works_on`, `owns`, `depends_on`, `blocked_by`, `decided`, `mentioned_with`, `reported_by`, `related_to`) and carry weight + confidence scores.
192
+
193
+ Graph traversal uses `$graphLookup` with optional bi-directional expansion via `$facet`:
194
+
195
+ ```javascript
196
+ // Bi-directional graph expansion
197
+ db.relations.aggregate([
198
+ { $match: { agentId: "agent-123", scope: "agent" } },
199
+ {
200
+ $facet: {
201
+ outbound: [
202
+ { $match: { fromEntityId: rootEntityId } },
203
+ {
204
+ $graphLookup: {
205
+ from: "relations",
206
+ startWith: "$toEntityId",
207
+ connectFromField: "toEntityId",
208
+ connectToField: "fromEntityId",
209
+ as: "connections",
210
+ maxDepth: 1,
211
+ restrictSearchWithMatch: { agentId: "agent-123" }
212
+ }
213
+ }
214
+ ],
215
+ inbound: [
216
+ { $match: { toEntityId: rootEntityId } },
217
+ {
218
+ $graphLookup: {
219
+ from: "relations",
220
+ startWith: "$fromEntityId",
221
+ connectFromField: "fromEntityId",
222
+ connectToField: "toEntityId",
223
+ as: "connections",
224
+ maxDepth: 1,
225
+ restrictSearchWithMatch: { agentId: "agent-123" }
226
+ }
227
+ }
228
+ ]
229
+ }
230
+ }
231
+ ])
232
+ ```
233
+
234
+ The `restrictSearchWithMatch` parameter is essential for multi-tenant isolation -- it ensures `$graphLookup` never traverses into another agent's data.
235
+
236
+ ### Collections Using This
237
+
238
+ - `entities` (nodes: people, projects, topics, concepts)
239
+ - `relations` (edges: typed, weighted, directional)
240
+ - `entity_links` (entity resolution: confirmed same, candidate same, related mention)
241
+
242
+ ---
243
+
244
+ ## 6. Event-Sourcing (Canonical Events)
245
+
246
+ ### Why This Matters
247
+
248
+ Most agent memory systems write directly to their storage format (chunks, embeddings, key-value pairs). This means there is no audit trail, no way to replay history, and no way to derive new representations from old conversations.
249
+
250
+ ClawMongo uses event-sourcing: every inbound message, tool output, and system event is written to a canonical `events` collection first. Chunks, entities, relations, and episodes are all derived projections from these events. If you want to re-index, re-extract entities, or add a new derived representation, you replay the events -- you never lose data.
251
+
252
+ ### How It Works
253
+
254
+ The write path goes through `writeEventAndProject()`:
255
+
256
+ 1. Write the canonical event to the `events` collection using `$setOnInsert` (idempotent)
257
+ 2. Project a chunk from the event into the `chunks` collection
258
+ 3. Record an ingest run in `ingest_runs`
259
+ 4. Fire-and-forget entity extraction via `extractAndUpsertEntities()`
260
+
261
+ ```typescript
262
+ // Canonical event structure
263
+ {
264
+ eventId: "uuid", // unique identifier
265
+ agentId: "agent-123", // agent isolation
266
+ role: "user", // user | assistant | system | tool
267
+ body: "message text", // the content
268
+ scope: "agent", // memory scope
269
+ scopeRef: "agent-123", // resolved scope namespace
270
+ timestamp: ISODate(), // when it happened
271
+ sessionId: "session-abc", // optional session context
272
+ projectedAt: ISODate(), // when chunks were derived
273
+ consolidatedAt: ISODate(), // when rolled into an episode
274
+ consolidatedIntoEpisodeId: "episode-xyz"
275
+ }
276
+ ```
277
+
278
+ Events are append-only for the canonical fields. The `projectedAt` and `consolidatedAt` fields are updated when derived projections are created, providing a full lifecycle audit trail.
279
+
280
+ ### Collections Using This
281
+
282
+ - `events` (canonical source of truth)
283
+ - `chunks` (derived: searchable text fragments)
284
+ - `entities`, `relations` (derived: knowledge graph)
285
+ - `episodes` (derived: consolidated summaries)
286
+ - `ingest_runs` (audit: what was ingested and when)
287
+ - `projection_runs` (audit: what was derived and when)
288
+
289
+ ---
290
+
291
+ ## 7. Schema Validation (JSON Schema)
292
+
293
+ ### Why This Matters
294
+
295
+ Agent memory is written by both application code and LLM-generated tool calls. LLMs produce unpredictable output. Without schema validation, a malformed memory write silently corrupts the database. With it, the write fails fast and the application can handle the error.
296
+
297
+ ClawMongo applies JSON Schema validation (`$jsonSchema`) to 17 collections with `validationAction: "error"`, so invalid documents are rejected at write time rather than silently accepted.
298
+
299
+ ### How It Works
300
+
301
+ Validators are defined in `mongodb-schema.ts` and applied via `ensureSchemaValidation()` on every startup. The `collMod` command applies the schema idempotently.
302
+
303
+ ```javascript
304
+ // Example: events collection schema
305
+ {
306
+ $jsonSchema: {
307
+ bsonType: "object",
308
+ required: ["eventId", "agentId", "role", "body", "scope", "scopeRef", "timestamp"],
309
+ properties: {
310
+ eventId: { bsonType: "string" },
311
+ agentId: { bsonType: "string" },
312
+ role: { enum: ["user", "assistant", "system", "tool"] },
313
+ body: { bsonType: "string" },
314
+ scope: { enum: ["session", "user", "agent", "workspace", "tenant", "global"] },
315
+ scopeRef: { bsonType: "string" },
316
+ timestamp: { bsonType: "date" }
317
+ }
318
+ }
319
+ }
320
+ ```
321
+
322
+ Schema validation catches common LLM mistakes: missing required fields, wrong types (string instead of date), invalid enum values (a `role` of `"human"` instead of `"user"`), and negative numbers where minimums are defined.
323
+
324
+ ### Collections Using This
325
+
326
+ All 17 validated collections: `chunks`, `knowledge_base`, `kb_chunks`, `structured_mem`, `structured_mem_revisions`, `procedures`, `procedure_revisions`, `relevance_runs`, `relevance_artifacts`, `relevance_regressions`, `events`, `entities`, `relations`, `entity_links`, `episodes`, `ingest_runs`, `projection_runs`.
327
+
328
+ ---
329
+
330
+ ## 8. Change Streams
331
+
332
+ ### Why This Matters
333
+
334
+ When multiple gateway instances or agents share the same MongoDB backend, they need to stay in sync. Without change streams, each instance would need to poll for changes, introducing latency and wasted queries.
335
+
336
+ MongoDB change streams provide real-time notification of document changes. ClawMongo uses them to sync memory state across gateway instances without polling.
337
+
338
+ ### How It Works
339
+
340
+ The `MongoDBChangeStreamWatcher` class opens a change stream on memory collections and emits events when documents are inserted, updated, or deleted. Resume tokens are stored in the `meta` collection so watchers can resume after gateway restarts without missing changes.
341
+
342
+ ```typescript
343
+ // Change stream setup (simplified)
344
+ const changeStream = collection.watch([], {
345
+ fullDocument: "updateLookup",
346
+ resumeAfter: savedResumeToken
347
+ });
348
+
349
+ changeStream.on("change", (event) => {
350
+ // Notify local cache / other subsystems
351
+ // Persist resume token for crash recovery
352
+ });
353
+ ```
354
+
355
+ Change streams require a replica set. For single-node development, ClawMongo falls back to periodic sync as a degraded but functional alternative.
356
+
357
+ ### Collections Using This
358
+
359
+ Change streams are opened on the primary memory collections (`chunks`, `structured_mem`, `events`). The resume token is stored in the `meta` collection.
360
+
361
+ ---
362
+
363
+ ## 9. TTL Indexes
364
+
365
+ ### Why This Matters
366
+
367
+ Not all data in an agent's memory should live forever. Embedding caches become stale when the model changes. Relevance telemetry data is useful for diagnostics but should not accumulate indefinitely. Without automatic expiration, operators must build and maintain cleanup jobs.
368
+
369
+ MongoDB TTL indexes handle this automatically: documents are deleted when their timestamp field exceeds the configured `expireAfterSeconds`.
370
+
371
+ ### How It Works
372
+
373
+ ClawMongo creates TTL indexes on three collection types:
374
+
375
+ ```javascript
376
+ // Embedding cache: expire after N days
377
+ await embeddingCache.createIndex(
378
+ { updatedAt: 1 },
379
+ { name: "idx_cache_ttl", expireAfterSeconds: days * 86400 }
380
+ );
381
+
382
+ // Relevance telemetry: expire after retention period
383
+ await relevanceRuns.createIndex(
384
+ { ts: 1 },
385
+ { name: "idx_relruns_ttl", expireAfterSeconds: retentionDays * 86400 }
386
+ );
387
+ ```
388
+
389
+ TTL is configurable per collection type and defaults to disabled for canonical data (events, chunks, structured memory) to prevent accidental data loss.
390
+
391
+ ### Collections Using This
392
+
393
+ - `embedding_cache` (configurable via `embeddingCacheTtlDays`)
394
+ - `relevance_runs` (configurable via `relevanceRetentionDays`)
395
+ - `relevance_artifacts` (follows same retention as runs)
396
+ - `files` (optional, configurable via `memoryTtlDays` -- use with caution)
397
+
398
+ ---
399
+
400
+ ## 10. Multi-Tenant Isolation
401
+
402
+ ### Why This Matters
403
+
404
+ A single ClawMongo deployment can serve multiple agents, each with their own memory, entities, and episodes. Agent A must never see Agent B's data, even when they share the same database and collections.
405
+
406
+ Most agent memory systems achieve isolation through separate databases or file paths. ClawMongo uses compound indexes with `agentId` as the leading field, so all queries are scoped by agent at the index level.
407
+
408
+ ### How It Works
409
+
410
+ Every collection uses `agentId` as the first field in its compound indexes:
411
+
412
+ ```javascript
413
+ // Events: scoped by agent + scope + time
414
+ { agentId: 1, scope: 1, scopeRef: 1, timestamp: -1 }
415
+
416
+ // Entities: scoped by agent + scope + type + name
417
+ { agentId: 1, scope: 1, scopeRef: 1, type: 1, name: 1 }
418
+
419
+ // Structured memory: unique per agent + scope + type + key
420
+ { agentId: 1, scope: 1, scopeRef: 1, type: 1, key: 1 } // unique: true
421
+ ```
422
+
423
+ For `$graphLookup`, the `restrictSearchWithMatch` parameter ensures graph traversal stays within the agent's data boundary. Without this, a graph expansion could follow edges into another agent's entities.
424
+
425
+ The `VALID_SCOPES` and `VALID_ROLES` ReadonlySet patterns validate scope and role parameters before they reach MongoDB, preventing injection of unexpected values.
426
+
427
+ ### Collections Using This
428
+
429
+ All 20 collections use `agentId`-prefixed indexes. The `scope` and `scopeRef` fields add a second level of isolation within each agent (session, user, workspace, tenant, global).
430
+
431
+ ---
432
+
433
+ ## 11. Idempotent Upserts
434
+
435
+ ### Why This Matters
436
+
437
+ Network retries, webhook replays, and concurrent writes are facts of life in distributed systems. If writing the same event twice creates two documents, the agent's memory is corrupted. Idempotent upserts ensure that retrying a write produces the same result as the original write.
438
+
439
+ ### How It Works
440
+
441
+ ClawMongo separates creation-time fields from mutable fields using MongoDB's `$setOnInsert` and `$set` operators on unique compound keys:
442
+
443
+ ```javascript
444
+ // Writing a canonical event (idempotent)
445
+ await events.updateOne(
446
+ { eventId: eventId }, // unique key
447
+ {
448
+ $setOnInsert: { // only set on first insert
449
+ eventId,
450
+ agentId,
451
+ role,
452
+ body,
453
+ scope,
454
+ scopeRef,
455
+ timestamp,
456
+ createdAt: new Date()
457
+ },
458
+ $set: { // update on every upsert
459
+ updatedAt: new Date()
460
+ }
461
+ },
462
+ { upsert: true }
463
+ );
464
+ ```
465
+
466
+ This pattern is used consistently across all collections:
467
+ - `eventId` for events (unique: true)
468
+ - `episodeId` for episodes (unique: true)
469
+ - `{agentId, scope, scopeRef, type, key}` for structured memory (unique: true)
470
+ - `{procedureId, agentId, scope, scopeRef}` for procedures (unique: true)
471
+ - `{fromEntityId, toEntityId, linkType}` + agent scope for entity links (unique: true)
472
+
473
+ ### Collections Using This
474
+
475
+ All collections with unique compound keys: `events`, `episodes`, `structured_mem`, `procedures`, `entity_links`, `embedding_cache`, `knowledge_base`, `kb_chunks`.
476
+
477
+ ---
478
+
479
+ ## 12. Relevance Telemetry
480
+
481
+ ### Why This Matters
482
+
483
+ Retrieval quality degrades silently. Without measurement, you cannot tell whether your agent is finding the right memories or returning irrelevant noise. Most agent memory systems provide no visibility into retrieval quality.
484
+
485
+ ClawMongo includes explain-driven diagnostics that record what happened during each retrieval: which search methods were used, what scores came back, how long the search took, and whether the results meet quality thresholds.
486
+
487
+ ### How It Works
488
+
489
+ The `MongoDBRelevanceRuntime` class records telemetry for each retrieval operation:
490
+
491
+ - **Relevance runs** (`relevance_runs`): per-query metadata including agent ID, timestamp, query hash, source scope, latency, hit sources, fallback path, and status (`ok`, `degraded`, `insufficient-data`)
492
+ - **Relevance artifacts** (`relevance_artifacts`): detailed explain output from search, vector, and fusion stages. Includes raw `explain()` output for deep debugging.
493
+ - **Relevance regressions** (`relevance_regressions`): detected quality drops with severity levels, baseline vs current metrics, delta, and failing cases.
494
+
495
+ ```javascript
496
+ // Relevance run record
497
+ {
498
+ runId: "uuid",
499
+ agentId: "agent-123",
500
+ ts: ISODate(),
501
+ queryHash: "sha256-of-query",
502
+ sourceScope: "all",
503
+ latencyMs: 142,
504
+ topK: 10,
505
+ hitSources: ["chunks", "structured_mem"],
506
+ fallbackPath: null,
507
+ status: "ok"
508
+ }
509
+ ```
510
+
511
+ The CLI surface (`memory relevance *`) exposes this data for operators. Sampling rate is configurable to control storage overhead.
512
+
513
+ ### Collections Using This
514
+
515
+ - `relevance_runs` (query-level telemetry)
516
+ - `relevance_artifacts` (detailed explain output)
517
+ - `relevance_regressions` (quality regression detection)
518
+
519
+ ---
520
+
521
+ ## The Full Picture
522
+
523
+ All 12 capabilities work together in a single query/write cycle:
524
+
525
+ ```text
526
+ Write path:
527
+ message arrives
528
+ -> Schema Validation rejects malformed input [7]
529
+ -> Event-Sourcing writes canonical event [6]
530
+ -> Idempotent Upsert prevents duplicate writes [11]
531
+ -> Automated Embeddings index the text [1]
532
+ -> Entity extraction builds Knowledge Graph [5]
533
+ -> Episode triggers consolidate event windows [6]
534
+ -> Multi-Tenant Isolation scopes everything by agent [10]
535
+ -> Change Streams notify other gateway instances [8]
536
+
537
+ Read path:
538
+ query arrives
539
+ -> Retrieval planner selects paths
540
+ -> Vector Search finds semantically similar content [2]
541
+ -> Full-Text Search finds exact keyword matches [3]
542
+ -> Hybrid Search fuses both result sets [4]
543
+ -> Knowledge Graph traverses entity relationships [5]
544
+ -> Relevance Telemetry records what happened [12]
545
+ -> TTL Indexes keep caches and telemetry bounded [9]
546
+ ```
547
+
548
+ This is why MongoDB is the best agentic data layer: one database, one operational surface, 12 capabilities that would otherwise require 5-6 separate services.