claude_memory 0.10.0 → 0.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/memory.sqlite3 +0 -0
- data/.claude/rules/claude_memory.generated.md +42 -64
- data/.claude/skills/release/SKILL.md +44 -6
- data/.claude/skills/study-repo/SKILL.md +15 -0
- data/.claude-plugin/commands/audit-memory.md +68 -0
- data/.claude-plugin/marketplace.json +1 -1
- data/.claude-plugin/plugin.json +1 -1
- data/CHANGELOG.md +70 -0
- data/CLAUDE.md +20 -5
- data/README.md +64 -2
- data/db/migrations/018_add_otel_telemetry.rb +81 -0
- data/docs/1_0_punchlist.md +522 -89
- data/docs/GETTING_STARTED.md +3 -1
- data/docs/api_stability.md +341 -0
- data/docs/architecture.md +3 -3
- data/docs/audit_runbook.md +209 -0
- data/docs/claude_monitoring.md +956 -0
- data/docs/dashboard.md +23 -3
- data/docs/improvements.md +329 -5
- data/docs/influence/ai-memory-systems-2026.md +403 -0
- data/docs/memory_audit_2026-05-21.md +303 -0
- data/docs/plugin.md +1 -1
- data/docs/quality_review.md +35 -0
- data/lib/claude_memory/audit/checks.rb +239 -0
- data/lib/claude_memory/audit/finding.rb +33 -0
- data/lib/claude_memory/audit/runner.rb +73 -0
- data/lib/claude_memory/commands/audit_command.rb +117 -0
- data/lib/claude_memory/commands/dashboard_command.rb +2 -1
- data/lib/claude_memory/commands/digest_command.rb +95 -3
- data/lib/claude_memory/commands/hook_command.rb +27 -2
- data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
- data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
- data/lib/claude_memory/commands/otel_command.rb +240 -0
- data/lib/claude_memory/commands/registry.rb +5 -1
- data/lib/claude_memory/commands/show_command.rb +90 -0
- data/lib/claude_memory/commands/stats_command.rb +94 -2
- data/lib/claude_memory/configuration.rb +60 -0
- data/lib/claude_memory/core/fact_query_builder.rb +1 -0
- data/lib/claude_memory/dashboard/api.rb +8 -0
- data/lib/claude_memory/dashboard/index.html +140 -1
- data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
- data/lib/claude_memory/dashboard/server.rb +86 -0
- data/lib/claude_memory/dashboard/telemetry.rb +156 -0
- data/lib/claude_memory/dashboard/trust.rb +180 -11
- data/lib/claude_memory/deprecations.rb +106 -0
- data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
- data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
- data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
- data/lib/claude_memory/hook/context_injector.rb +11 -2
- data/lib/claude_memory/hook/handler.rb +142 -1
- data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
- data/lib/claude_memory/otel/attributes.rb +118 -0
- data/lib/claude_memory/otel/constants.rb +32 -0
- data/lib/claude_memory/otel/ingestor.rb +54 -0
- data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
- data/lib/claude_memory/otel/prompt_scope.rb +108 -0
- data/lib/claude_memory/otel/settings_writer.rb +122 -0
- data/lib/claude_memory/otel/status.rb +58 -0
- data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
- data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
- data/lib/claude_memory/resolve/resolver.rb +30 -3
- data/lib/claude_memory/shortcuts.rb +61 -18
- data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
- data/lib/claude_memory/store/schema_manager.rb +1 -1
- data/lib/claude_memory/store/sqlite_store.rb +136 -0
- data/lib/claude_memory/sweep/maintenance.rb +31 -1
- data/lib/claude_memory/sweep/sweeper.rb +6 -0
- data/lib/claude_memory/templates/hooks.example.json +5 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +20 -0
- metadata +28 -1
|
@@ -0,0 +1,403 @@
|
|
|
1
|
+
# AI Memory Systems Landscape Analysis (2026)
|
|
2
|
+
|
|
3
|
+
*Analysis Date: 2026-05-23*
|
|
4
|
+
*Source: "The state of AI memory systems: benchmarks, architectures, and what actually works"*
|
|
5
|
+
*Author: Yohei Nakajima (compiled by Claude Opus 4.6 Research)*
|
|
6
|
+
*Source URL: https://x.com/yoheinakajima/status/2037201711937577319*
|
|
7
|
+
*Type: Meta-study (article, not single repository)*
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Executive Summary
|
|
12
|
+
|
|
13
|
+
### What this is
|
|
14
|
+
|
|
15
|
+
This is a **field survey**, not a single-repo study. The article reviews seven memory benchmarks and ~12 open-source memory systems published 2024-2026, ranks them by performance, and extracts five architectural patterns that separate top performers from the rest. Unlike a `/study-repo` of one codebase, the unit of analysis is **architectural choices that correlate with benchmark wins**.
|
|
16
|
+
|
|
17
|
+
### Key finding from the article
|
|
18
|
+
|
|
19
|
+
> "Architecture matters more than model size. A 20B-parameter model with Hindsight's multi-strategy memory achieves 83.6% on LongMemEval, dramatically outperforming full-context GPT-4o at 60.2%."
|
|
20
|
+
|
|
21
|
+
The field is converging on a specific template: **hybrid vector+graph storage, multi-strategy retrieval with reranking, explicit temporal modeling, and active memory consolidation**. Pure vector-store approaches (Mem0) plateau around 49% on LongMemEval; graph-native systems (Zep) reach 71%; multi-strategy systems (Hindsight) break 90%.
|
|
22
|
+
|
|
23
|
+
### Why ClaudeMemory cares
|
|
24
|
+
|
|
25
|
+
ClaudeMemory sits architecturally closest to Mem0 (vector + light graph via entity_aliases and fact_links, SQLite-only, LLM-light extraction). The article quantifies the cost of that choice — 22-point gap to Zep on LongMemEval, ~42-point gap to Hindsight. We don't need to chase those numbers, but the gaps tell us where our retrieval will *predictably* fail (multi-hop, temporal reasoning, conflict resolution at scale) and what's worth adopting given our local-first, no-cloud, single-developer constraints.
|
|
26
|
+
|
|
27
|
+
### Systems surveyed in the article (for cross-reference)
|
|
28
|
+
|
|
29
|
+
| System | Architecture | LongMemEval | LoCoMo | License |
|
|
30
|
+
|--------|--------------|-------------|--------|---------|
|
|
31
|
+
| Hindsight (Vectorize) | 4 networks + 4-strategy retrieval + cross-encoder | **91.4%** | 89.61% | MIT |
|
|
32
|
+
| Zep / Graphiti | Bi-temporal knowledge graph | 71.2% | 75.14% (disputed) | Apache 2.0 |
|
|
33
|
+
| MemGPT / Letta | OS-style hierarchy + agent-controlled | n/a | 74.0% (filesystem variant) | Apache 2.0 |
|
|
34
|
+
| Mem0 | Vector + optional graph, LLM-orchestrated CRUD | ~49% | 66.9-68.5% | Apache 2.0 |
|
|
35
|
+
| Cognee | Graph + vector + relational + ontology validation | n/a | n/a (self-reported wins) | Apache 2.0 |
|
|
36
|
+
| HippoRAG | Hippocampal indexing + Personalized PageRank | n/a | n/a | MIT |
|
|
37
|
+
| Letta (filesystem) | Simple file tools + agent capability | n/a | 74.0% | Apache 2.0 |
|
|
38
|
+
|
|
39
|
+
None of these were cloned for this study — the article itself is the primary source. Source-level file:line references in this document are to **ClaudeMemory** code, for adoption assessment.
|
|
40
|
+
|
|
41
|
+
### Production readiness assessment (article-derived)
|
|
42
|
+
|
|
43
|
+
- **Most mature**: Zep/Graphiti (24K stars, enterprise customers, Apache 2.0)
|
|
44
|
+
- **Best-published benchmarks**: Hindsight (MIT, but optimized for Vectorize-as-a-service)
|
|
45
|
+
- **Best fit for local-first**: Cognee (file-based defaults, swappable to cloud DBs) and Letta (open agent file format)
|
|
46
|
+
- **Most disputed**: LoCoMo benchmark itself — Mem0 and Zep publicly contradict each other's scores; the article calls LoCoMo "unreliable for cross-vendor comparison."
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Architecture Overview
|
|
51
|
+
|
|
52
|
+
### The Five Patterns (article's central claim)
|
|
53
|
+
|
|
54
|
+
The article identifies five patterns where the correlation with benchmark performance is "nearly linear":
|
|
55
|
+
|
|
56
|
+
1. **Multi-strategy retrieval** is the single biggest differentiator. Hindsight (4 strategies, 91.4%) > Zep (3 strategies, 71.2%) > Mem0 (1-2 strategies, 49%).
|
|
57
|
+
2. **Graph structure is essential for complex reasoning, vector for breadth.** Every top system uses hybrid storage.
|
|
58
|
+
3. **Temporal modeling correlates with the largest gains.** Systems with explicit temporal models score 20-60 points higher on temporal queries.
|
|
59
|
+
4. **Active memory consolidation prevents degradation at scale.** Top systems all run a background "refine/invalidate/prune" pass.
|
|
60
|
+
5. **Agent-controlled memory can outperform specialized infrastructure.** Letta's filesystem approach beat Mem0's purpose-built memory by 5.5 points on LoCoMo.
|
|
61
|
+
|
|
62
|
+
### Comparison Table: ClaudeMemory vs. the field
|
|
63
|
+
|
|
64
|
+
| Capability | Hindsight | Zep | Letta | Mem0 | Cognee | **ClaudeMemory** |
|
|
65
|
+
|-----------|-----------|-----|-------|------|--------|------------------|
|
|
66
|
+
| Vector search | ✅ cosine | ✅ cosine | ✅ pgvector | ✅ Qdrant | ✅ LanceDB | ✅ sqlite-vec (vec0) |
|
|
67
|
+
| BM25 / FTS | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ FTS5 |
|
|
68
|
+
| Graph traversal | ✅ | ✅ BFS | ❌ | ✅ (graph variant) | ✅ | ⚠️ partial (entity_aliases, fact_links — no traversal API) |
|
|
69
|
+
| Temporal-aware retrieval | ✅ dual timestamps | ✅ bi-temporal | ❌ | ⚠️ basic | ✅ | ⚠️ valid_from/valid_to in schema, not in ranking |
|
|
70
|
+
| Reranking | ✅ cross-encoder | ✅ RRF/MMR/cross-encoder | ❌ | ❌ | ❌ | ✅ RRF (`lib/claude_memory/core/rr_fusion.rb:1`) |
|
|
71
|
+
| Reflection/consolidation | ✅ reflect op | ✅ invalidate-not-delete | ✅ sleep-time compute | ✅ LLM CRUD | ✅ memify | ⚠️ supersession + sweep TTLs; no LLM reflect step |
|
|
72
|
+
| Agent-controlled writes | ❌ | ❌ | ✅ core operation | ❌ | ❌ | ⚠️ `memory.store_extraction` exists but ingestion is mostly passive via hooks |
|
|
73
|
+
| Bi-temporal (valid+ingest time) | ✅ | ✅ 4 timestamps | ❌ | ❌ | ⚠️ | ❌ (only valid_from/valid_to + created_at) |
|
|
74
|
+
| Fact / opinion separation | ✅ 4 networks | ⚠️ episode vs semantic | ⚠️ human vs persona block | ❌ | ❌ | ❌ (single facts table, all predicates equal) |
|
|
75
|
+
| Latency target | n/a | <200ms-1s P95 | varies | 1.4s P95 | n/a | hook context + recall: typically <100ms for SQLite read |
|
|
76
|
+
| Ingestion cost | high (parallel strategies) | hours for large corpora (many LLM calls) | low | low | medium | **low** (Layer 1 NullDistiller is free; Layer 2 piggybacks on Claude Code session) |
|
|
77
|
+
|
|
78
|
+
ClaudeMemory's profile: **vector + FTS + light graph hints, no traversal, no temporal-aware ranking, no reflection pass, mostly passive ingestion.** Closest peer: Mem0 base variant — which the article scores at ~49% on LongMemEval. The features we've explicitly *rejected* (cross-encoder reranking, LLM query expansion, custom fine-tuned models — see `docs/improvements.md` "Features to Avoid") are the same ones Hindsight uses to break 90%. The article suggests we are correctly trading some peak benchmark score for cost/latency/local-first, but it also names two things we **didn't** trade away by choice but simply haven't built: temporal-aware ranking and explicit graph traversal.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Key Components Deep-Dive
|
|
83
|
+
|
|
84
|
+
This section maps each pattern from the article to ClaudeMemory's current implementation and the gap, with `file:line` references to **our** code (the studied systems weren't cloned).
|
|
85
|
+
|
|
86
|
+
### 1. Multi-Strategy Retrieval
|
|
87
|
+
|
|
88
|
+
**The article's claim.** Hindsight runs four concurrent retrieval strategies (cosine semantic similarity, BM25 keyword matching, graph traversal across the shared memory graph, temporal reasoning) and fuses them with cross-encoder reranking. On temporal queries specifically, this took accuracy from a 31.6% baseline to 91.0% — a 60-point gain. Zep's three strategies (cosine + BM25 + BFS graph traversal) hit 71.2% on LongMemEval. Mem0's 1-2 strategies score 49%.
|
|
89
|
+
|
|
90
|
+
**What we have.**
|
|
91
|
+
|
|
92
|
+
- Vector (`lib/claude_memory/index/vector_index.rb`): sqlite-vec native KNN.
|
|
93
|
+
- BM25/FTS5 (`lib/claude_memory/index/lexical_fts.rb`): SQLite FTS5 full-text.
|
|
94
|
+
- Fusion (`lib/claude_memory/core/rr_fusion.rb:1`): Reciprocal Rank Fusion of vec + FTS, with optional `score_trace` for debugging.
|
|
95
|
+
|
|
96
|
+
**What we don't have.**
|
|
97
|
+
|
|
98
|
+
- **No graph traversal as a retrieval strategy.** We store entity relationships in `entity_aliases` and supersession/conflict edges in `fact_links`, but no MCP tool walks them from a seed entity. `memory.fact_graph` returns immediate-neighbor facts for one fact_id; it doesn't BFS from a query.
|
|
99
|
+
- **No temporal-aware retrieval strategy.** We have `valid_from`, `valid_to`, `last_recalled_at` columns but the ranker doesn't use them as a third RRF input.
|
|
100
|
+
|
|
101
|
+
**Why this matters per the article.** "BM25 catches exact mentions that embedding search misses; graph traversal finds multi-hop connections invisible to flat similarity; temporal filtering prevents returning outdated facts." We have two of three; the third (graph BFS) is the one Zep credits for its 22-point lead over Mem0.
|
|
102
|
+
|
|
103
|
+
### 2. Hybrid Vector + Graph Storage
|
|
104
|
+
|
|
105
|
+
**The article's claim.** Pure-vector systems plateau at ~50% on LongMemEval; graph-native systems reach 71%+. "The specific graph implementation matters less than having one — Neo4j, FalkorDB, and custom in-memory graphs all appear in high-performing systems."
|
|
106
|
+
|
|
107
|
+
**What we have.** A subject-predicate-object fact table with entity nodes and edges between facts (`fact_links` for supersession + conflict). This is graph-shaped data but we don't expose it as a traversable graph at query time.
|
|
108
|
+
|
|
109
|
+
**What we don't have.** A `BFS from entity X over relationship type Y` capability. The article specifically calls out that this finds multi-hop connections invisible to similarity search ("Who recommended the architecture decision we're using for storage?" — needs entity-resolved hops, not text overlap).
|
|
110
|
+
|
|
111
|
+
**Why this matters per the article.** Quote: "Mem0's specific graph implementation matters less than having one." We have graph-shaped storage but no graph-shaped retrieval — the worst of both worlds if we don't fix this.
|
|
112
|
+
|
|
113
|
+
### 3. Explicit Temporal Modeling
|
|
114
|
+
|
|
115
|
+
**The article's claim.** Temporal reasoning is the hardest capability across every benchmark (up to 73% human-vs-system gap on LoCoMo). Hindsight stores **dual timestamps** (occurrence time + mention time) — "what happened when" vs "what did I learn when." Zep's **bi-temporal model** tracks four timestamps per edge:
|
|
116
|
+
|
|
117
|
+
- `valid_at` — when the fact became true in the world
|
|
118
|
+
- `invalid_at` — when it was superseded
|
|
119
|
+
- `created_at` — when Graphiti ingested it
|
|
120
|
+
- `expired_at` — when the record was logically replaced
|
|
121
|
+
|
|
122
|
+
This enables point-in-time queries ("What did we know about X on date Y?") and full audit trails.
|
|
123
|
+
|
|
124
|
+
**What we have.**
|
|
125
|
+
|
|
126
|
+
- `valid_from` / `valid_to` (`db/migrations/001_create_initial_schema.rb:64-65`) — world-time validity window.
|
|
127
|
+
- `created_at` — ingest time.
|
|
128
|
+
- `last_recalled_at` (schema v17) — access time.
|
|
129
|
+
|
|
130
|
+
**What we don't have.**
|
|
131
|
+
|
|
132
|
+
- No `invalid_at` / `expired_at` distinction. We set `valid_to` when superseded and `status='superseded'` — but `valid_to` conflates "world-time end" and "ingest-time supersession." A fact retroactively learned to have been false in 2023 and a fact superseded today look identical in the schema.
|
|
133
|
+
- No temporal-aware retrieval. `Recall` queries don't weight by recency, and `memory.recall` doesn't accept "as of <date>" filters.
|
|
134
|
+
|
|
135
|
+
**Why this matters per the article.** This is the field-wide weakness. Systems "with explicit temporal modeling consistently score 20-60 points higher on temporal queries than systems treating time as metadata." We're currently in the "metadata" camp.
|
|
136
|
+
|
|
137
|
+
### 4. Active Memory Consolidation
|
|
138
|
+
|
|
139
|
+
**The article's claim.** Systems that accumulate without consolidating suffer noise growth. The article catalogs five consolidation strategies:
|
|
140
|
+
|
|
141
|
+
- **Hindsight reflect** — updates beliefs based on new evidence.
|
|
142
|
+
- **Zep invalidate-not-delete** — contradicted facts are marked invalid, preserving history.
|
|
143
|
+
- **Cognee memify** — prunes stale nodes, strengthens frequent connections, derives new facts.
|
|
144
|
+
- **Letta sleep-time compute** — background agent processes facts during idle time using stronger/slower models, producing refined "learned context."
|
|
145
|
+
- **Mem0 LLM-CRUD** — ADD / UPDATE / DELETE / NOOP decided per-extraction by an LLM.
|
|
146
|
+
|
|
147
|
+
**What we have.**
|
|
148
|
+
|
|
149
|
+
- Supersession with provenance preservation (`lib/claude_memory/resolve/resolver.rb:126-149`) — closest to Zep's invalidate-not-delete (we set status=`superseded` and keep the row).
|
|
150
|
+
- Sweep with TTL escalation (`lib/claude_memory/sweep/maintenance.rb`) — closest to Cognee's pruning.
|
|
151
|
+
- Conflict detection — adjacent to Mem0's LLM-CRUD but rule-based, not LLM-driven.
|
|
152
|
+
|
|
153
|
+
**What we don't have.**
|
|
154
|
+
|
|
155
|
+
- **No reflect/refine pass.** We never re-examine an old fact in light of new context. A decision from January and one from May about the same subject don't get re-evaluated as a pair unless they happen to trigger supersession at insert time.
|
|
156
|
+
- **No background "learned context" agent.** Layer 2 distillation runs *only* on the current session's transcripts; nothing reflects on the full corpus during idle time.
|
|
157
|
+
|
|
158
|
+
**Why this matters per the article.** Without consolidation, signal-to-noise degrades as memory grows — this is the "scale" failure mode. Today our corpus is small (low hundreds of facts per project). The article suggests this will hurt at 10K+ facts.
|
|
159
|
+
|
|
160
|
+
### 5. Agent-Controlled Memory
|
|
161
|
+
|
|
162
|
+
**The article's claim.** Letta demonstrated that a simple filesystem approach (agent + file tools) hit 74% on LoCoMo with GPT-4o-mini, beating Mem0's purpose-built infrastructure at 68.5%. Quote: "Agent capability matters more than specialized memory infrastructure."
|
|
163
|
+
|
|
164
|
+
**What we have.**
|
|
165
|
+
|
|
166
|
+
- `memory.store_extraction` MCP tool — the agent *can* write, but in practice extraction happens passively via SessionStart hook injection (Layer 2 distillation).
|
|
167
|
+
- Five "shortcut" tools (`memory.decisions`, `memory.conventions`, `memory.architecture`, `memory.facts_by_tool`, `memory.facts_by_context`) the agent uses for recall.
|
|
168
|
+
|
|
169
|
+
**What we don't have.**
|
|
170
|
+
|
|
171
|
+
- No "agent decides when to remember" mode. Layer 1 (NullDistiller regex) runs unconditionally on hook events; Layer 2 runs on SessionStart; Layer 3 is user-triggered. The agent doesn't proactively decide "this thread is important, store it."
|
|
172
|
+
|
|
173
|
+
**Why this matters per the article.** This is the "autonomy vs. determinism" trade-off the article explicitly names. Letta's autonomy is non-deterministic and model-dependent; our determinism is fast and predictable. We probably don't want to flip the model — but a *partial* adoption (an explicit "save this for later" tool the agent can call mid-conversation) is consistent with our current architecture.
|
|
174
|
+
|
|
175
|
+
### 6. Benchmarks We're Not Running
|
|
176
|
+
|
|
177
|
+
**The article's claim.** Seven benchmarks now define the evaluation landscape:
|
|
178
|
+
|
|
179
|
+
| Benchmark | Year | What it tests | Notes |
|
|
180
|
+
|-----------|------|---------------|-------|
|
|
181
|
+
| LongMemEval | ICLR 2025 | 5 abilities × 500 questions, 115K-1.5M token contexts | Gold standard |
|
|
182
|
+
| LoCoMo | ACL 2024 | 10 conversations × 300 turns | Vendor-disputed; scores unreliable |
|
|
183
|
+
| MemBench | ACL 2025 | Factual vs reflective memory | Useful for our "decision vs convention" split |
|
|
184
|
+
| MemoryBench | Tsinghua 2025 | Continual learning from feedback | 11 datasets, 3 domains, 2 languages |
|
|
185
|
+
| MemoryAgentBench | ICLR 2026 | 4 competencies including conflict resolution | "No method excels at all four" |
|
|
186
|
+
| EverMemBench | Feb 2026 | Multi-party group conversations | Niche |
|
|
187
|
+
| Letta Leaderboard | 2025 | LLMs managing own memory via tools | Most relevant to our MCP design |
|
|
188
|
+
|
|
189
|
+
**What we have.** Our own eval suite (`spec/evals/`), DevMemBench (`spec/benchmarks/`), and `spec/benchmarks/comparative/` against QMD + grepai.
|
|
190
|
+
|
|
191
|
+
**What we don't have.** Any cross-comparison against LongMemEval or LoCoMo. We can't say with evidence "ClaudeMemory scores X on LongMemEval" — and given the article's framing, that's the question potential adopters will ask.
|
|
192
|
+
|
|
193
|
+
**Why this matters per the article.** LongMemEval is the only benchmark the article describes as rigorous. LoCoMo numbers are "unreliable for cross-vendor comparison" because of public scoring disputes. If we report any benchmark, it should be LongMemEval; if we cite LoCoMo it should be with the disclaimer.
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Comparative Analysis
|
|
198
|
+
|
|
199
|
+
### What the field does well that we don't
|
|
200
|
+
|
|
201
|
+
1. **Graph traversal at retrieval time** (Zep, Mem0 graph variant, Cognee). We store the graph; we don't walk it.
|
|
202
|
+
2. **Bi-temporal modeling** (Zep). We conflate world-time and ingest-time in a single `valid_to` column.
|
|
203
|
+
3. **Active consolidation / reflect pass** (Hindsight, Cognee memify, Letta sleep-time). We supersede at insert time only.
|
|
204
|
+
4. **Epistemic separation** (Hindsight 4 networks: world facts / agent experiences / entity observations / evolving opinions). We have `provenance.strength` (stated/inferred/derived) but don't route differently.
|
|
205
|
+
5. **Standardized benchmark scores** (LongMemEval). We have internal evals only.
|
|
206
|
+
|
|
207
|
+
### What we do well that they don't
|
|
208
|
+
|
|
209
|
+
1. **Local-first, zero-cloud-dependency.** Letta and Mem0 require PostgreSQL + (Qdrant or pgvector). Cognee defaults to file-based but is Python-heavyweight. Our gem + SQLite stack ships as a single Ruby dependency.
|
|
210
|
+
2. **No LLM in the retrieval path.** Zep makes this point ("no LLM calls during retrieval"), achieving 200ms-1s P95 — and so do we, even more aggressively (no inference at all, just SQL).
|
|
211
|
+
3. **Free Layer 2 distillation.** Mem0 calls an LLM for every extraction. Letta runs background sleep-time agents. We piggyback on the user's existing Claude Code session via context hook injection — zero additional API spend. This is genuinely novel and the article doesn't mention any equivalent.
|
|
212
|
+
4. **Provenance receipts on every fact.** Mem0 logs operations to SQLite for audit but doesn't tie each fact to a quoted source. Our `provenance` + `mcp_tool_calls` tables give every claim a traceable origin.
|
|
213
|
+
5. **Public predicate vocabulary.** PredicatePolicy is the article's missing piece for fact/opinion separation — it's an opinionated, curated set of 9 predicates with cardinality semantics, exposed publicly via `docs/api_stability.md`. Hindsight does this implicitly in code; we do it as a contract.
|
|
214
|
+
|
|
215
|
+
### Trade-offs the article explicitly names
|
|
216
|
+
|
|
217
|
+
| Tension | Their pole | Our pole |
|
|
218
|
+
|---------|-----------|----------|
|
|
219
|
+
| Richness vs. latency | Zep: hours of ingestion for richer graph | NullDistiller P95 <5ms; minutes for Layer 3 manual |
|
|
220
|
+
| Autonomy vs. determinism | Letta: agent-controlled, model-dependent | Deterministic SQL queries |
|
|
221
|
+
| Completeness vs. compression | Zep preserves raw episodes | We distill into structured facts only (raw transcript chunks live in `content_items` until swept) |
|
|
222
|
+
|
|
223
|
+
These poles match the design decisions we've already made and recorded. The article validates them, including specifically what we *gave up* (peak benchmark score on LongMemEval) for what we *gained* (sub-100ms recall, no cloud cost, no LLM in critical path).
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Adoption Opportunities
|
|
228
|
+
|
|
229
|
+
### High Priority ⭐
|
|
230
|
+
|
|
231
|
+
#### 1. Graph Traversal as a Third Retrieval Strategy ⭐
|
|
232
|
+
|
|
233
|
+
- **Value.** The article credits graph-BFS as the difference between Mem0 (49% on LongMemEval) and Zep (71.2%). We already store the graph; we just don't traverse it at query time. This is the highest-leverage gap in our retrieval — work we've already done 80% of, exposed differently.
|
|
234
|
+
- **Evidence.** Article Pattern 1 + Pattern 2. ClaudeMemory has `fact_links` (supersession/conflict edges) and `entities`/`entity_aliases` (entity nodes) but `lib/claude_memory/recall.rb` doesn't BFS over them.
|
|
235
|
+
- **Implementation.** Add a `Recall::GraphTraversal` strategy: resolve the query to seed entities via the existing entity matcher, BFS one or two hops over `entities` ↔ `facts` ↔ `entities` (using `subject_id` and `object_entity_id` if present), score by hop distance × edge type. Fuse into the existing RRF in `Core::RRFusion` (`lib/claude_memory/core/rr_fusion.rb`) as a third source alongside vec + FTS. Bound BFS depth (1-2 hops) so latency stays sub-100ms.
|
|
236
|
+
- **Effort.** Medium — 2-3 days. The data is already shaped correctly; this is a new strategy class + RRF integration + tests.
|
|
237
|
+
- **Trade-off.** Adds a third source to RRF tuning. Empty graphs (early-project use) will simply contribute zero rerank weight — degrades gracefully.
|
|
238
|
+
- **Recommendation.** **ADOPT** in 0.12.0 or 0.13.0. Aligns with our existing hybrid retrieval; no new dependencies; demonstrably the field's biggest accuracy lever.
|
|
239
|
+
|
|
240
|
+
#### 2. Temporal-Aware Retrieval Strategy ⭐
|
|
241
|
+
|
|
242
|
+
- **Value.** The article says temporal reasoning shows the largest performance gaps across every benchmark (up to 73% on LoCoMo). Adding even basic recency weighting and "as-of" filtering would close part of this.
|
|
243
|
+
- **Evidence.** Article Pattern 3. Schema already has `valid_from`, `valid_to`, `created_at`, `last_recalled_at`. None are used in ranking.
|
|
244
|
+
- **Implementation.** Two pieces:
|
|
245
|
+
1. **Recency boost in RRF.** Add a `temporal_rank` input to `Core::RRFusion`: facts with newer `valid_from` get a small rank boost (decay factor configurable). Doesn't replace lexical/semantic — it's a third (or fourth, with graph) RRF source.
|
|
246
|
+
2. **`as_of` parameter on `memory.recall`.** Optional ISO 8601 timestamp; filters to facts where `valid_from <= as_of AND (valid_to IS NULL OR valid_to > as_of)`. Enables "what did we know about X on date Y" queries the article credits Zep with.
|
|
247
|
+
- **Effort.** Small — 1-2 days. Existing columns; just thread the new parameter and ranker.
|
|
248
|
+
- **Trade-off.** Recency weighting can over-rank ephemeral facts (e.g., a debugging note from yesterday vs. a long-standing convention). Cap the boost at low weight (e.g., 0.1× of vec contribution) and tune via the existing eval harness.
|
|
249
|
+
- **Recommendation.** **ADOPT** in 0.12.0 alongside #1. Tiny change, big article-validated upside, no new dependencies.
|
|
250
|
+
|
|
251
|
+
#### 3. Bi-Temporal Schema Cleanup (`world_invalid_at` vs `ingest_expired_at`)
|
|
252
|
+
|
|
253
|
+
- **Value.** Today, `valid_to` does double duty: "fact ceased to be true in the world" *and* "we superseded this fact during ingestion." The article calls this out specifically as Zep's most important innovation. With both columns, point-in-time queries work correctly — without them, we silently corrupt the temporal axis.
|
|
254
|
+
- **Evidence.** Article: "Every entity edge tracks four timestamps: valid_at (when the fact became true in the world), invalid_at (when it was superseded), created_at (when Graphiti ingested it), expired_at (when the record was logically replaced)."
|
|
255
|
+
- **Implementation.** Schema v18 migration: rename `valid_to` → `world_invalid_at`; add `ingest_expired_at` (datetime, nullable). Update `Resolver` to set `ingest_expired_at` on supersession and leave `world_invalid_at` for explicit user-supplied "this fact stopped being true on date X" updates. Backfill: copy existing `valid_to` into both columns (we can't recover the distinction historically).
|
|
256
|
+
- **Effort.** Medium — schema migration + resolver update + MCP tool surface (optional `world_invalid_at` parameter on `memory.reject_fact` and friends) + tests. 2-3 days.
|
|
257
|
+
- **Trade-off.** API surface change — `valid_to` is part of the public schema per `docs/api_stability.md`. Needs a deprecation cycle (alias `valid_to` to `world_invalid_at` in the Sequel model for one minor version).
|
|
258
|
+
- **Recommendation.** **ADOPT** in 0.13.0 (after #1 and #2). Lower urgency than the retrieval changes, but it's the foundation for any future "as of" reasoning, audit trail, and historical reasoning. Cheaper to do before our corpus grows.
|
|
259
|
+
|
|
260
|
+
#### 4. LongMemEval Benchmark Integration
|
|
261
|
+
|
|
262
|
+
- **Value.** The article calls LongMemEval the "gold standard." Without an external benchmark score, we can't credibly position ClaudeMemory against the field. Internal evals (which we have) don't answer "is this competitive with Zep/Mem0?"
|
|
263
|
+
- **Evidence.** Article: "LongMemEval has emerged as the gold standard… Three-stage framework (Indexing → Retrieval → Reading) with LLM-as-judge scoring provides the most rigorous evaluation available."
|
|
264
|
+
- **Implementation.** Add `spec/benchmarks/longmemeval/` adapter. Dataset is public (Wu et al. arXiv). Wire it into `bin/run-evals --longmemeval`. Report Recall@k, MRR, nDCG@10 the way DevMemBench already does.
|
|
265
|
+
- **Effort.** Medium — 2-4 days. Mostly dataset wrangling + adapter code. The existing DevMemBench pipeline already has the right shape.
|
|
266
|
+
- **Trade-off.** LongMemEval_S is ~115K tokens; ingesting all 500 questions will be slow and cost real API spend if we use Claude Code in the inner loop. Mitigation: stub mode for the retrieval-only portion (no LLM-judge), real mode opt-in.
|
|
267
|
+
- **Recommendation.** **ADOPT** in 0.12.0 or 0.13.0. This is what we'd cite in a release blog post; the article makes it clear it's the only number that matters.
|
|
268
|
+
|
|
269
|
+
### Medium Priority
|
|
270
|
+
|
|
271
|
+
#### 5. Reflect Pass — Background Consolidation on Idle
|
|
272
|
+
|
|
273
|
+
- **Value.** Hindsight's reflect operation and Letta's sleep-time compute both run a background process that re-examines stored facts using a stronger/slower model. The article credits this with preventing noise growth at scale. We don't have it; today our corpus is small enough to not need it; we will need it once any single project exceeds ~5K facts.
|
|
274
|
+
- **Evidence.** Article Pattern 4.
|
|
275
|
+
- **Implementation.** Extend `Sweep::Maintenance` with a `reflect` operation that runs during the SessionEnd hook when N facts have accumulated since last reflect. The reflect operation is an MCP-callable prompt: "Given these N facts about subject X, produce: (a) a consolidated summary fact, (b) any contradictions, (c) any facts that should be marked obsolete." Like Layer 2 distillation, this can piggyback on the user's Claude Code session — no extra API cost.
|
|
276
|
+
- **Effort.** Large — 5-7 days. Touches sweep, hooks, MCP, and skill design. Needs a careful prompt + good eval to prove we're not introducing hallucinated consolidations.
|
|
277
|
+
- **Trade-off.** Risk of consolidating away real distinctions. Mitigation: every consolidated fact links to the source facts via `fact_links` (already supported); manual `claude-memory reject` undoes a bad consolidation.
|
|
278
|
+
- **Recommendation.** **CONSIDER** for 1.0.0 or later. The article validates the direction; we don't have the scale problem yet. Track when largest project DB crosses 5K facts.
|
|
279
|
+
|
|
280
|
+
#### 6. `memory.save_this` Tool — Agent-Initiated Storage
|
|
281
|
+
|
|
282
|
+
- **Value.** Letta's striking result (74% vs Mem0's 68.5%) suggests that giving the agent explicit "save this" capability beats passive extraction in some scenarios. We already have `memory.store_extraction`, but it's framed as "report an extraction you found," not "I (the agent) want to remember this for later." A friendlier surface might increase use.
|
|
283
|
+
- **Evidence.** Article Pattern 5 + Letta filesystem result.
|
|
284
|
+
- **Implementation.** Add `memory.save_this` as a thin wrapper over `memory.store_extraction` with simpler prompt: "Save the most important fact from this turn for future sessions. Tag with `subject`, `predicate`, `object`, and a brief reason." Document it in the MCP `memory_guide` prompt as the agent's "I want to remember this" tool.
|
|
285
|
+
- **Effort.** Small — 1 day. Mostly MCP surface + prompt updates + tests.
|
|
286
|
+
- **Trade-off.** Could drive low-quality "save everything" behavior. Mitigation: existing `BareConclusionDetector` already gates against poor extractions.
|
|
287
|
+
- **Recommendation.** **CONSIDER** in 0.13.0 if first-week usage shows agents rarely use `store_extraction` proactively. Cheap to try; cheap to remove.
|
|
288
|
+
|
|
289
|
+
#### 7. Provenance Strength Routing (light epistemic separation)
|
|
290
|
+
|
|
291
|
+
- **Value.** Hindsight's 4-network architecture (world facts / agent experiences / entity observations / evolving opinions) gives different retrieval characteristics to different fact types. We have a similar axis — `provenance.strength` ∈ {stated, inferred, derived} — but the ranker doesn't use it.
|
|
292
|
+
- **Evidence.** Article: "Epistemic separation — structurally distinguishing evidence from inference — is a key innovation."
|
|
293
|
+
- **Implementation.** Add a small weight in `Core::RRFusion`: `stated` facts get full weight, `inferred` get 0.7×, `derived` get 0.5×. Surface a `strength_filter` parameter on `memory.recall` for "only stated facts" use cases.
|
|
294
|
+
- **Effort.** Small — 1 day. We already store the data.
|
|
295
|
+
- **Trade-off.** Minor — could under-rank inferred facts that are nonetheless useful. Tune via eval harness.
|
|
296
|
+
- **Recommendation.** **CONSIDER**. Already covered partially by improvement #57 (Provenance-Strength-Aware Retrieval Ranking) in `docs/improvements.md`. This article *strongly validates* that improvement; promoting #57 from Medium to High is the right move.
|
|
297
|
+
|
|
298
|
+
### Low Priority / Defer
|
|
299
|
+
|
|
300
|
+
#### 8. Ontology Validation Layer (Cognee-style)
|
|
301
|
+
|
|
302
|
+
- **Value.** Canonicalizes "car manufacturer," "automobile maker," "vehicle producer" into one entity. Reduces graph fragmentation.
|
|
303
|
+
- **Evidence.** Article: Cognee uses RDF/OWL ontologies + `difflib.get_close_matches()`.
|
|
304
|
+
- **Trade-off.** We already do this for predicates via `PredicatePolicy::SYNONYMS`. Extending to entities means defining ontologies per project — heavyweight for a single-developer tool.
|
|
305
|
+
- **Recommendation.** **DEFER**. Our entity_aliases mechanism is the lightweight version of this. Adopt only if entity fragmentation shows up as a real failure mode in benchmarks.
|
|
306
|
+
|
|
307
|
+
#### 9. LoCoMo Benchmark
|
|
308
|
+
|
|
309
|
+
- **Value.** Cross-comparison with other memory systems.
|
|
310
|
+
- **Evidence.** Article: "Vendor disputes about proper implementation… Mem0 and Zep have publicly contradicted each other's reported scores, making LoCoMo rankings unreliable for cross-vendor comparison."
|
|
311
|
+
- **Recommendation.** **DEFER**. The article specifically discredits LoCoMo as a comparison axis. LongMemEval (recommendation #4) is the right benchmark to invest in. If we cite LoCoMo at all, cite our own number standalone, not against vendor-reported scores.
|
|
312
|
+
|
|
313
|
+
### Features to Avoid (article-derived)
|
|
314
|
+
|
|
315
|
+
These are confirmed by the article as either over-engineering, mismatched, or solving problems we don't have:
|
|
316
|
+
|
|
317
|
+
- **Cross-encoder reranking** — Already in our avoid list. Article confirms: "Hindsight's four parallel retrieval strategies with cross-encoder reranking are expensive." No LLM in retrieval path is one of our key advantages.
|
|
318
|
+
- **Bi-temporal complexity beyond a second column** — Zep tracks four timestamps per edge. The article doesn't quantify the value of `expired_at` separately from `invalid_at`. Recommendation #3 above adopts the simpler 3-timestamp model (world_invalid_at + ingest_expired_at + created_at) rather than the full 4-column Graphiti schema.
|
|
319
|
+
- **Custom fine-tuned models for any pipeline stage** — Already in our avoid list. Hindsight's results require Gemini-3 Pro for the 91.4% number; their 20B open variant scores 83.6%. We can't and shouldn't compete with model size; per-the-article, architecture (which we can fix) matters more anyway.
|
|
320
|
+
- **Cloud-required architecture** — Letta requires PostgreSQL + pgvector; Cognee defaults to local but production runs PostgreSQL + Neo4j + Qdrant. Our SQLite-only stack is a real differentiator the article doesn't address.
|
|
321
|
+
- **Multi-network epistemic separation as a hard schema split** (full Hindsight 4-network model) — Over-complex for our scale. Recommendation #7 above adopts the soft version (weight by `provenance.strength`).
|
|
322
|
+
- **Conversation-level memory (Letta filesystem approach as primary mode)** — Article reports 74% on LoCoMo for filesystem-only, but the read/write loop consumes user-visible tokens on every interaction. Our hook-based passive ingestion is cheaper per session.
|
|
323
|
+
- **Sleep-time compute as a separate service** — Letta runs background agents. We can achieve the same effect on the next SessionStart for free (recommendation #5). No separate process needed.
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## Implementation Recommendations
|
|
328
|
+
|
|
329
|
+
### Phase 1 — Validate the architecture pattern (0.12.0)
|
|
330
|
+
|
|
331
|
+
- **Graph traversal strategy** (recommendation #1, ⭐). Highest leverage; data is ready.
|
|
332
|
+
- **Temporal recency in RRF + `as_of` parameter** (recommendation #2, ⭐). Tiny code, big benchmark-validated upside.
|
|
333
|
+
- **LongMemEval integration** (recommendation #4, ⭐). Get a baseline number before we start tuning, so we can measure each subsequent change.
|
|
334
|
+
|
|
335
|
+
### Phase 2 — Foundation cleanup (0.13.0)
|
|
336
|
+
|
|
337
|
+
- **Bi-temporal schema cleanup** (recommendation #3). Schema change is easier now than later.
|
|
338
|
+
- **Promote improvement #57 to High and ship it** (recommendation #7). Already-tracked work; this article strongly validates it.
|
|
339
|
+
- **`memory.save_this` tool** (recommendation #6) if eval data suggests agents under-use `memory.store_extraction`.
|
|
340
|
+
|
|
341
|
+
### Phase 3 — Scale concerns (1.0.0 or later)
|
|
342
|
+
|
|
343
|
+
- **Reflect pass** (recommendation #5). Only when a real project DB crosses ~5K facts; until then, premature.
|
|
344
|
+
|
|
345
|
+
### What to skip
|
|
346
|
+
|
|
347
|
+
- **LoCoMo benchmark** (recommendation #9). Article explicitly discredits it for cross-vendor use.
|
|
348
|
+
- **Ontology validation** (recommendation #8). Our existing `entity_aliases` + `PredicatePolicy::SYNONYMS` are the right-sized version.
|
|
349
|
+
|
|
350
|
+
---
|
|
351
|
+
|
|
352
|
+
## Architecture Decisions
|
|
353
|
+
|
|
354
|
+
### What to preserve (validated by the article)
|
|
355
|
+
|
|
356
|
+
1. **Local-first, SQLite-only** — competitive differentiator vs. Letta/Cognee cloud stacks.
|
|
357
|
+
2. **No LLM in retrieval path** — Zep makes this same choice and credits it for <200ms-1s latency; we go further with no LLM at all.
|
|
358
|
+
3. **Hook-based passive ingestion via Claude Code session** — zero-API-cost Layer 2 distillation; the article surveys no equivalent.
|
|
359
|
+
4. **RRF over vec+FTS** — same pattern Zep uses (cosine + BM25 + BFS), we just need to add the third source.
|
|
360
|
+
5. **Publicly-versioned predicate vocabulary** (`PredicatePolicy` + `docs/api_stability.md`) — light, opinionated, stable. Field-wide there's no equivalent contract.
|
|
361
|
+
6. **Provenance receipts on every fact** — comparable systems log operations to SQLite but don't tie each fact to a quoted source.
|
|
362
|
+
|
|
363
|
+
### What to adopt (article-validated)
|
|
364
|
+
|
|
365
|
+
1. **Graph traversal as third retrieval strategy** — closes the largest article-named gap.
|
|
366
|
+
2. **Temporal-aware RRF + `as_of` queries** — closes the second-largest gap.
|
|
367
|
+
3. **Bi-temporal columns** — `world_invalid_at` separate from `ingest_expired_at`.
|
|
368
|
+
4. **LongMemEval as the comparison benchmark** — the only number the article describes as rigorous.
|
|
369
|
+
|
|
370
|
+
### What to reject
|
|
371
|
+
|
|
372
|
+
1. **Cross-encoder LLM reranking** — already rejected; the article confirms cost is the reason.
|
|
373
|
+
2. **Cloud-required graph DB** (Neo4j, FalkorDB) — SQLite + our existing schema is sufficient; recommendation #1 traverses the graph we already have.
|
|
374
|
+
3. **4-network hard epistemic split** — recommendation #7 adopts the soft (weight-by-strength) version.
|
|
375
|
+
4. **LoCoMo benchmark** — the article itself discredits cross-vendor comparison.
|
|
376
|
+
|
|
377
|
+
---
|
|
378
|
+
|
|
379
|
+
## Key Takeaways
|
|
380
|
+
|
|
381
|
+
1. **We are architecturally closer to Mem0 (49% on LongMemEval) than to Zep (71.2%) or Hindsight (91.4%).** That's mostly a deliberate trade for local-first / no-LLM-in-retrieval. But two pieces of the gap — graph traversal and temporal-aware retrieval — are unforced. We already store the data; we just don't query it.
|
|
382
|
+
|
|
383
|
+
2. **The biggest single improvement we can make is adding graph traversal as a third RRF source.** Article-validated as the difference between Mem0-class and Zep-class systems. We have the data shape; we don't have the strategy class.
|
|
384
|
+
|
|
385
|
+
3. **Layer 2 distillation (free LLM via Claude Code session) is genuinely novel.** No system the article surveys does this. We should keep emphasizing it in documentation and in any benchmark write-up.
|
|
386
|
+
|
|
387
|
+
4. **Our existing improvement #57 (Provenance-Strength-Aware Retrieval Ranking) is the soft version of Hindsight's epistemic separation.** This article promotes it from "nice to have" to "fits the field-wide pattern." Recommend moving #57 to High Priority.
|
|
388
|
+
|
|
389
|
+
5. **Temporal reasoning is the field's hardest problem.** We've under-invested here. Schema-level fix (recommendation #3) and ranker-level fix (recommendation #2) together cost about a week's work.
|
|
390
|
+
|
|
391
|
+
6. **We should benchmark against LongMemEval before tuning any of this.** Without a baseline, we can't tell which adopted changes help.
|
|
392
|
+
|
|
393
|
+
7. **Article's clearest negative result: pure vector approaches plateau at ~50% on LongMemEval.** Anything we do that doubles down on vector-only retrieval is investing in the wrong axis.
|
|
394
|
+
|
|
395
|
+
---
|
|
396
|
+
|
|
397
|
+
## Cross-References
|
|
398
|
+
|
|
399
|
+
- **`docs/improvements.md`** — recommendations #1, #2, #3, #4, #6 should be added as new entries. Recommendation #7 promotes existing #57 from Medium to High.
|
|
400
|
+
- **`docs/influence/qmd.md`**, **`docs/influence/grepai.md`** — these are the other single-repo studies in the same architectural space (hybrid vec+FTS, no graph, no temporal). The article suggests their tradeoffs match ours.
|
|
401
|
+
- **`docs/api_stability.md`** — schema changes in recommendation #3 (bi-temporal cleanup) need to land in the same commit as updates here. `valid_to` rename is a public-API break with deprecation aliasing.
|
|
402
|
+
- **`spec/benchmarks/README.md`** — recommendation #4 (LongMemEval integration) belongs in this directory.
|
|
403
|
+
- **`lib/claude_memory/core/rr_fusion.rb`** — recommendations #1, #2, #7 all add new sources to this fusion. Touching this file once for all three is cheaper than three separate passes.
|