claude_memory 0.11.0 → 0.12.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/memory.sqlite3 +0 -0
- data/.claude/rules/claude_memory.generated.md +42 -64
- data/.claude/skills/release/SKILL.md +44 -6
- data/.claude/skills/study-repo/SKILL.md +15 -0
- data/.claude-plugin/commands/audit-memory.md +68 -0
- data/.claude-plugin/marketplace.json +1 -1
- data/.claude-plugin/plugin.json +1 -1
- data/CHANGELOG.md +26 -0
- data/CLAUDE.md +9 -2
- data/README.md +29 -1
- data/db/migrations/018_add_otel_telemetry.rb +81 -0
- data/docs/1_0_punchlist.md +318 -66
- data/docs/api_stability.md +341 -0
- data/docs/audit_runbook.md +209 -0
- data/docs/claude_monitoring.md +956 -0
- data/docs/improvements.md +148 -9
- data/docs/influence/ai-memory-systems-2026.md +403 -0
- data/docs/memory_audit_2026-05-21.md +303 -0
- data/docs/plugin.md +1 -1
- data/lib/claude_memory/audit/checks.rb +239 -0
- data/lib/claude_memory/audit/finding.rb +33 -0
- data/lib/claude_memory/audit/runner.rb +73 -0
- data/lib/claude_memory/commands/audit_command.rb +117 -0
- data/lib/claude_memory/commands/dashboard_command.rb +2 -1
- data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
- data/lib/claude_memory/commands/otel_command.rb +240 -0
- data/lib/claude_memory/commands/registry.rb +4 -1
- data/lib/claude_memory/configuration.rb +60 -0
- data/lib/claude_memory/core/fact_query_builder.rb +1 -0
- data/lib/claude_memory/dashboard/api.rb +8 -0
- data/lib/claude_memory/dashboard/index.html +140 -1
- data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
- data/lib/claude_memory/dashboard/server.rb +86 -0
- data/lib/claude_memory/dashboard/telemetry.rb +156 -0
- data/lib/claude_memory/deprecations.rb +106 -0
- data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
- data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
- data/lib/claude_memory/hook/context_injector.rb +11 -2
- data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
- data/lib/claude_memory/otel/attributes.rb +118 -0
- data/lib/claude_memory/otel/constants.rb +32 -0
- data/lib/claude_memory/otel/ingestor.rb +54 -0
- data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
- data/lib/claude_memory/otel/prompt_scope.rb +108 -0
- data/lib/claude_memory/otel/settings_writer.rb +122 -0
- data/lib/claude_memory/otel/status.rb +58 -0
- data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
- data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
- data/lib/claude_memory/resolve/resolver.rb +30 -3
- data/lib/claude_memory/shortcuts.rb +61 -18
- data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
- data/lib/claude_memory/store/schema_manager.rb +1 -1
- data/lib/claude_memory/store/sqlite_store.rb +136 -0
- data/lib/claude_memory/sweep/maintenance.rb +31 -1
- data/lib/claude_memory/sweep/sweeper.rb +6 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +18 -0
- metadata +26 -1
data/docs/improvements.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Improvements to Consider
|
|
2
2
|
|
|
3
|
-
*Updated: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
|
|
3
|
+
*Updated: 2026-05-23 - Added AI Memory Systems Landscape Analysis (Nakajima/Opus 4.6 Research article, 2026-03-26) — meta-study of 7 benchmarks + ~12 systems. Four High Priority items: graph traversal as third RRF source (#64), temporal-aware retrieval (#65), bi-temporal schema cleanup (#66), LongMemEval integration (#67). One promotion: improvement #57 (provenance-strength ranking) Medium → High, validated as the "soft epistemic separation" pattern. See `docs/influence/ai-memory-systems-2026.md`. Previously: 2026-05-01 - Added Strands Agent SOPs study (article, not repo) — one M-priority item (parameter blocks in skill frontmatter); rest already implemented or deferred. See `docs/influence/strands-agent-sops.md`. Previously: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
|
|
4
4
|
*Sources:*
|
|
5
5
|
- *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
|
|
6
6
|
- *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
|
|
@@ -318,6 +318,99 @@ cq is complementary to ClaudeMemory, not competing: it's an out-of-band SQL audi
|
|
|
318
318
|
|
|
319
319
|
---
|
|
320
320
|
|
|
321
|
+
## Strands Agent SOPs Study (2026-05-01)
|
|
322
|
+
|
|
323
|
+
Source: docs/influence/strands-agent-sops.md — article study (AWS Open Source Blog)
|
|
324
|
+
|
|
325
|
+
Amazon's Strands Agent SOPs describe markdown-based parameterized workflows for agents (RFC-2119 keywords, parameter blocks, sequential chaining via artifact handoff, MCP-prompt invocation). **ClaudeMemory has independently arrived at the same architecture** via Anthropic Skills (`/distill-transcripts`, `/release`, `/study-repo`), MCP `prompts/list`+`prompts/get` (`memory_guide`), and the `Ingest → Distill → Resolve → Publish` pipeline. The article is *validation*, not a roadmap.
|
|
326
|
+
|
|
327
|
+
### Medium Priority Recommendations
|
|
328
|
+
|
|
329
|
+
- [ ] **Add explicit `## Parameters` blocks to skill markdowns**
|
|
330
|
+
- Value: Self-documenting skills; Claude can prompt the user for missing parameters instead of guessing from `$ARGUMENTS`
|
|
331
|
+
- Evidence: Strands' `Required Parameters / Optional Parameters` block — the only verbatim format snippet in the article (`docs/influence/strands-agent-sops.md`)
|
|
332
|
+
- Implementation: Add `## Parameters` section to `lib/claude_memory/commands/skills/distill-transcripts.md`, `release.md`, `study-repo.md`, `quality-update.md`, `improve.md`. Format: bullet list with `name: description (default: …)`
|
|
333
|
+
- Effort: ~30 minutes total
|
|
334
|
+
- Trade-off: Tiny doc maintenance; no runtime cost
|
|
335
|
+
|
|
336
|
+
### Deferred / Avoid (from this study)
|
|
337
|
+
|
|
338
|
+
- **Progress markers + checkpoint file in `/distill-transcripts`** — UX-only improvement; DB already handles correctness. Defer until usage data shows multi-hundred-item distillation runs.
|
|
339
|
+
- **MCP-prompt-exposed skill format spec** (analog of `strands-agents-sops rule`) — solves a problem we don't have; defer until ≥3 skill-authoring locations exist.
|
|
340
|
+
- **Strands Python package** — wrong language ecosystem.
|
|
341
|
+
- **`.sop/<name>/` artifact filesystem** — would parallel our DB-as-checkpoint substrate and double the cleanup burden.
|
|
342
|
+
- **Adopting "SOP" as user-facing terminology** — Anthropic Skills is the term Claude Code users know; renaming creates confusion for zero gain.
|
|
343
|
+
|
|
344
|
+
---
|
|
345
|
+
|
|
346
|
+
## AI Memory Systems Landscape Study (2026-05-23)
|
|
347
|
+
|
|
348
|
+
Source: `docs/influence/ai-memory-systems-2026.md` — meta-study of the Nakajima/Opus 4.6 Research article surveying 7 memory benchmarks and ~12 memory systems (Hindsight, Zep/Graphiti, MemGPT/Letta, Mem0, Cognee, HippoRAG, etc.).
|
|
349
|
+
|
|
350
|
+
**Headline finding.** ClaudeMemory's retrieval profile (vector + FTS, light graph, no temporal-aware ranking) sits architecturally closest to Mem0 (49% on LongMemEval). Two unforced gaps separate us from Zep-class systems (71.2%): we already store the graph but don't traverse it at query time, and we have temporal columns we don't rank by. Closing both is ~3-5 days of work without new dependencies.
|
|
351
|
+
|
|
352
|
+
### High Priority Recommendations
|
|
353
|
+
|
|
354
|
+
- [ ] **64. Graph Traversal as Third RRF Source** ⭐
|
|
355
|
+
- Value: Field-wide validated as the difference between Mem0-class (49%) and Zep-class (71.2%) LongMemEval scores. We already store the graph (`entities`, `entity_aliases`, `fact_links`).
|
|
356
|
+
- Evidence: Article Pattern 1 + Pattern 2; our `lib/claude_memory/recall.rb` has no BFS strategy; `lib/claude_memory/core/rr_fusion.rb` fuses only vec + FTS.
|
|
357
|
+
- Implementation: Add `Recall::GraphTraversal` strategy that resolves query → seed entities → 1-2 hop BFS over `entities` ↔ `facts` ↔ `entities`, scored by hop distance × edge type. Fuse into existing RRF as a third source. Bound depth so latency stays sub-100ms.
|
|
358
|
+
- Effort: Medium (2-3 days). Data shape already correct; new strategy class + RRF integration + tests.
|
|
359
|
+
- Trade-off: Empty graphs degrade gracefully to zero rerank contribution.
|
|
360
|
+
|
|
361
|
+
- [ ] **65. Temporal-Aware Retrieval Strategy** ⭐
|
|
362
|
+
- Value: Article identifies temporal reasoning as the hardest field-wide capability (up to 73% gap on LoCoMo). Schema already has `valid_from`, `valid_to`, `last_recalled_at`; ranker doesn't use them.
|
|
363
|
+
- Evidence: Article Pattern 3.
|
|
364
|
+
- Implementation: (1) Add `temporal_rank` input to `Core::RRFusion` — facts with newer `valid_from` get a small rank boost (capped at ~0.1× vec contribution). (2) Optional `as_of` ISO 8601 parameter on `memory.recall` filters to `valid_from <= as_of AND (valid_to IS NULL OR valid_to > as_of)`.
|
|
365
|
+
- Effort: Small (1-2 days). Existing columns; thread parameter and ranker.
|
|
366
|
+
- Trade-off: Recency over-ranking risk; cap boost weight and tune via eval harness.
|
|
367
|
+
|
|
368
|
+
- [ ] **66. Bi-Temporal Schema Cleanup (world vs ingest time)**
|
|
369
|
+
- Value: Today `valid_to` does double duty — "fact ceased to be true in the world" *and* "we superseded this fact during ingestion." Article credits this distinction as Zep's most important innovation. Without it, point-in-time queries silently corrupt the temporal axis.
|
|
370
|
+
- Evidence: Article: "Every entity edge tracks four timestamps: valid_at, invalid_at, created_at, expired_at." See also our schema (`db/migrations/001_create_initial_schema.rb:64-65`).
|
|
371
|
+
- Implementation: Schema v18 migration: rename `valid_to` → `world_invalid_at`; add `ingest_expired_at` (datetime, nullable). Resolver sets `ingest_expired_at` on supersession; leaves `world_invalid_at` for explicit "this fact stopped being true on date X" updates. Backfill copies `valid_to` into both columns.
|
|
372
|
+
- Effort: Medium (2-3 days). Schema migration + resolver update + MCP tool surface + tests. Public API break — needs deprecation alias for one minor version per `docs/api_stability.md`.
|
|
373
|
+
- Trade-off: API surface change. Lower urgency than #64/#65 but cheaper to do before corpus grows.
|
|
374
|
+
|
|
375
|
+
- [ ] **67. LongMemEval Benchmark Integration** ⭐
|
|
376
|
+
- Value: Article calls LongMemEval the "gold standard" — the only benchmark it describes as rigorous. Without an external benchmark score, we can't credibly position ClaudeMemory against the field.
|
|
377
|
+
- Evidence: Article — Wu et al. ICLR 2025, 500 questions across 115K-1.5M token contexts, three-stage framework with LLM-as-judge.
|
|
378
|
+
- Implementation: Add `spec/benchmarks/longmemeval/` adapter. Dataset is public. Wire into `bin/run-evals --longmemeval`. Report Recall@k, MRR, nDCG@10 like DevMemBench.
|
|
379
|
+
- Effort: Medium (2-4 days). Mostly dataset wrangling + adapter code; existing DevMemBench pipeline has the right shape.
|
|
380
|
+
- Trade-off: Real-mode runs (with LLM judge) cost API spend. Mitigation: stub mode for retrieval-only, real mode opt-in.
|
|
381
|
+
|
|
382
|
+
### Promotion (existing improvement, article-validated)
|
|
383
|
+
|
|
384
|
+
- [ ] **#57 Provenance-Strength-Aware Retrieval Ranking** — promote from Medium to High Priority
|
|
385
|
+
- Rationale: Article describes Hindsight's "epistemic separation" (4 networks: world facts / agent experiences / entity observations / evolving opinions) as a key innovation. Our `provenance.strength` ∈ {stated, inferred, derived} is the soft version of this — already in the schema, just not used by the ranker. This article promotes the change from "nice to have" to "fits the field-wide pattern."
|
|
386
|
+
- Implementation unchanged from existing #57 entry.
|
|
387
|
+
|
|
388
|
+
### Medium Priority
|
|
389
|
+
|
|
390
|
+
- [ ] **Reflect Pass — Background Consolidation on Idle** (see influence doc rec #5)
|
|
391
|
+
- Value: Hindsight's reflect operation and Letta's sleep-time compute both re-examine stored facts using a background process. Article credits this with preventing noise growth at scale. We don't have it; today our corpus is small enough not to need it.
|
|
392
|
+
- Recommendation: Track when largest project DB crosses 5K facts. Until then, premature. **CONSIDER for 1.0.0 or later.**
|
|
393
|
+
|
|
394
|
+
- [ ] **`memory.save_this` Tool — Agent-Initiated Storage** (see influence doc rec #6)
|
|
395
|
+
- Value: Letta's striking result (74% vs Mem0's 68.5% on LoCoMo) suggests agent-controlled "save this" beats passive extraction. We have `memory.store_extraction` but it's framed as "report an extraction," not "I want to remember this."
|
|
396
|
+
- Implementation: Thin wrapper over `store_extraction` with friendlier prompt. Document in MCP `memory_guide` prompt.
|
|
397
|
+
- Effort: Small (1 day).
|
|
398
|
+
- Recommendation: **CONSIDER** in 0.13.0 if first-week usage shows agents under-use `store_extraction` proactively.
|
|
399
|
+
|
|
400
|
+
### Features to Avoid (from this study)
|
|
401
|
+
|
|
402
|
+
- **Cross-encoder LLM reranking** — Article confirms cost as the reason (already in our avoid list).
|
|
403
|
+
- **Full 4-column Graphiti timestamp model** — Recommendation #66 above adopts the simpler 3-timestamp version (world_invalid_at + ingest_expired_at + created_at).
|
|
404
|
+
- **Hindsight 4-network hard epistemic split** — Over-complex for our scale; recommendation #57 promotion is the soft version.
|
|
405
|
+
- **Cloud-required graph DB** (Neo4j / FalkorDB) — Recommendation #64 traverses the graph we already have in SQLite.
|
|
406
|
+
- **Custom fine-tuned models in any pipeline stage** — Article confirms architecture > model size; we can't compete on model investment anyway.
|
|
407
|
+
- **LoCoMo benchmark for cross-vendor comparison** — Article explicitly discredits it: "Mem0 and Zep have publicly contradicted each other's reported scores, making LoCoMo rankings unreliable for cross-vendor comparison." If we cite LoCoMo at all, cite our own number standalone.
|
|
408
|
+
- **Cognee-style RDF/OWL ontology validation** — Our `entity_aliases` + `PredicatePolicy::SYNONYMS` are the right-sized version for a single-developer tool.
|
|
409
|
+
- **Letta-style filesystem-only memory as primary mode** — Consumes user-visible tokens on every interaction; our hook-based passive ingestion is cheaper per session.
|
|
410
|
+
- **Sleep-time compute as a separate background service** — We can achieve the same effect on the next SessionStart via Layer 2 distillation, for free. No separate process needed.
|
|
411
|
+
|
|
412
|
+
---
|
|
413
|
+
|
|
321
414
|
## Medium Priority
|
|
322
415
|
|
|
323
416
|
### ~~18. Shell Completion for CLI~~ ✅ Implemented 2026-03-20
|
|
@@ -446,7 +539,9 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #10). Builds on
|
|
|
446
539
|
|
|
447
540
|
---
|
|
448
541
|
|
|
449
|
-
### 59. API Stability Audit (
|
|
542
|
+
### 59. API Stability Audit (promoted to 0.12.0 — 2026-05-01)
|
|
543
|
+
|
|
544
|
+
*Originally slated as 1.0 release blocker; promoted to 0.12 because #52's benchmark scoreboard needs an explicit "what surfaces are stable" list to know what counts as a regression vs. internal change. The deprecation-warning module is also a prerequisite for any soft-rename work surfaced during the 0.12 → 1.0 soak.*
|
|
450
545
|
|
|
451
546
|
Source: 2026-04-28 path-to-1.0 review (`docs/1_0_punchlist.md` #11). Added after 0.10.0 ship. *(Renumbered from #57 to #59 during rebase against origin/main on 2026-04-28 — Mercury-article PR #5 had already taken #57 and #58.)*
|
|
452
547
|
|
|
@@ -527,18 +622,23 @@ Source: 2026-04-30 production verification of #48 hallucination-rate metric. Sur
|
|
|
527
622
|
|
|
528
623
|
Source: 2026-04-30 #60 investigation, cause 4. All 27 rejected facts in this project's 7-day window were `uses_database` (18) or `deployment_platform` (9) with `session_id=nil` (MCP-originated), all from a 2-day burst on 2026-04-23 to 04-24. The pattern: when running `/study-repo` on an external project, the LLM extracted that project's tech stack and asserted it as facts about *this* project. Cleanup happened correctly via `claude-memory reject` after detection, but the round-trip is wasteful and noisy.
|
|
529
624
|
|
|
530
|
-
**
|
|
625
|
+
**Phase 1 — prompt fix (LANDED 2026-05-01).**
|
|
626
|
+
|
|
627
|
+
`.claude/skills/study-repo/SKILL.md` gained a top-level "CRITICAL: Memory Discipline" section that explicitly forbids the LLM from calling `memory.store_extraction` with the studied project's tech stack as `uses_database` / `uses_framework` / `uses_language` / `deployment_platform` / `auth_method`. Allowed: `predicate=reference` for descriptions of the external project, plus genuine project-facing decisions/conventions/architecture derived from contrast (with reason clauses). The influence document (`docs/influence/<project>.md`) is named as the right home for "what tech does the studied project use" observations, taking memory entirely out of that loop.
|
|
628
|
+
|
|
629
|
+
**Phase 2 — defense-in-depth detector (DEFERRED to 0.12.x or later).**
|
|
531
630
|
|
|
532
|
-
-
|
|
533
|
-
|
|
534
|
-
-
|
|
631
|
+
If the prompt fix isn't enough on its own — measured by re-running `/study-repo` against ≥3 external projects post-2026-05-01 and counting any `uses_database`/`deployment_platform` rows that appear with non-self subjects — build `Distill::ExternalAttributionDetector` as a sister to `ReferenceMaterialDetector`. Heuristics: source content_item text containing "studying X", "/study-repo", a non-current-project repo URL, or "external project" → bias single-value-cardinality extractions toward `predicate=reference`.
|
|
632
|
+
|
|
633
|
+
False-positive risk to handle: legitimate facts ABOUT this project that mention an external one ("ClaudeMemory adopts SessionStart hook context injection like claude-supermemory does") must still land as `decision` with reason clause, not be retagged. Solution if needed: detector requires both (a) external-project marker in source AND (b) the extracted subject not being the current project's repo entity.
|
|
535
634
|
|
|
536
635
|
**Acceptance.**
|
|
537
636
|
|
|
538
|
-
-
|
|
539
|
-
-
|
|
637
|
+
- After Phase 1: re-run `/study-repo` on a fresh DB; observe zero `uses_database` or `deployment_platform` facts inserted that point to the external project's tech.
|
|
638
|
+
- After Phase 1: the 27-fact cluster pattern doesn't reappear in similar `/study-repo` sessions.
|
|
639
|
+
- Phase 2 trigger: only build if Phase 1 measurement shows persistent leakage.
|
|
540
640
|
|
|
541
|
-
**Effort.**
|
|
641
|
+
**Effort.** Phase 1: 15 minutes (done). Phase 2 (if needed): ~½ day for detector + tests.
|
|
542
642
|
|
|
543
643
|
---
|
|
544
644
|
|
|
@@ -566,6 +666,45 @@ C. **Retroactive rejection.** Mark them all `status=rejected`. Cheap and clean b
|
|
|
566
666
|
|
|
567
667
|
---
|
|
568
668
|
|
|
669
|
+
### 63. Pre-Release Hook Smoke Gate (0.12.0)
|
|
670
|
+
|
|
671
|
+
Source: 2026-04-30 verification incident during 0.11 work. Five commits landed for #47 token-budget telemetry with 156 specs green. The user asked "did you actually run claude-memory show on this project?" — at which point a smoke test revealed the installed gem was still 0.9.1 and 24 hours of real SessionStart hook events had recorded no `context_tokens` field. The bug was not in the code; the bug was in the *release process* — specs verify code correctness against the working tree, but production hooks invoke the installed gem via PATH. Without `rake install`, every hook/MCP code change is dead in production.
|
|
672
|
+
|
|
673
|
+
This already lives in memory (`feedback_hooks_run_installed_gem.md`) and as two project conventions stored via `memory.store_extraction`. It's a known trap that I (Claude) hit anyway. **Codify it into the release pipeline so the trap can't be sprung again.**
|
|
674
|
+
|
|
675
|
+
**Implementation.**
|
|
676
|
+
|
|
677
|
+
- **New `bin/pre-release-smoke`** script that:
|
|
678
|
+
1. Runs `bundle exec rake install` (rebuild gem from current working tree).
|
|
679
|
+
2. Verifies `which claude-memory` resolves to the installed-gem binary (sanity check).
|
|
680
|
+
3. Triggers each gem-managed hook event with a synthetic payload via stdin: `claude-memory hook context`, `claude-memory hook ingest --db /tmp/smoke.sqlite3`, `claude-memory hook nudge`, etc. — populates a temp DB.
|
|
681
|
+
4. Inspects `activity_events` table via `sqlite3 json_extract` for the fields the current version is supposed to record. Specifically:
|
|
682
|
+
- `hook_context` events should carry both `context_length` and `context_tokens` (since 0.11.0).
|
|
683
|
+
- `roi_nudge` events should carry `n`, `used`, `pct`, `prior_count` (since 0.11.0).
|
|
684
|
+
- Any future field added under release becomes part of this checklist.
|
|
685
|
+
5. Exits non-zero if any expected field is null or absent.
|
|
686
|
+
- **Per-version expectation manifest** at `spec/smoke/expected_fields.yml` — declarative list of `{event_type, fields, since_version}` so the script doesn't need code changes when a new field lands; just append to the YAML and the gate enforces it on the next release.
|
|
687
|
+
- **`/release` skill integration.** Phase 1 Step 5b (after specs, before lint) runs `bin/pre-release-smoke`. Failure aborts the release with the field name(s) that were null. Skill description gains a one-line "verifies installed gem actually fires hooks correctly".
|
|
688
|
+
|
|
689
|
+
**Acceptance.**
|
|
690
|
+
|
|
691
|
+
- `bin/pre-release-smoke` exits 0 when the installed gem matches the working tree and all expected fields populate.
|
|
692
|
+
- Deleting the `context_tokens:` line from `Hook::Handler#context` and re-running `bin/pre-release-smoke` produces a clear error pointing at the missing field on `hook_context.detail_json`.
|
|
693
|
+
- `/release` skill aborts Phase 1 if the smoke gate fails — never reaches `git push`.
|
|
694
|
+
- Test: `spec/smoke/pre_release_smoke_spec.rb` verifies the manifest schema and that the script's exit-code logic flips on simulated null fields.
|
|
695
|
+
|
|
696
|
+
**Edge cases.**
|
|
697
|
+
|
|
698
|
+
- The script uses a temp DB so it can't pollute the user's project DB. Cleans up on exit.
|
|
699
|
+
- If `rake install` fails (gemspec validation, signing, etc.), the script reports that as a separate failure mode, not a smoke-gate failure.
|
|
700
|
+
- The `hook nudge` synthetic payload needs a `session_id` of a real session that contributed facts — the script can pre-seed one fact and use a dedicated `smoke-test-NNNN` session id.
|
|
701
|
+
|
|
702
|
+
**Effort.** ~½ day for the script + manifest + skill integration. Spec is the bulk of the time.
|
|
703
|
+
|
|
704
|
+
**Why this release.** 0.11 verification gap directly motivated this. Release Discipline that doesn't catch the trap that's already hit twice (#47 today, plus the 2026-04-16 ActivityLog incident in `feedback_hooks_run_installed_gem.md`) isn't real discipline. Pairs naturally with #52 — scoreboard catches regressions in measurement; smoke gate catches the regression where the measurement itself doesn't fire.
|
|
705
|
+
|
|
706
|
+
---
|
|
707
|
+
|
|
569
708
|
### 21. Incremental Indexing with File Watching
|
|
570
709
|
|
|
571
710
|
Source: grepai study (reinforced 2026-03-02)
|
|
@@ -0,0 +1,403 @@
|
|
|
1
|
+
# AI Memory Systems Landscape Analysis (2026)
|
|
2
|
+
|
|
3
|
+
*Analysis Date: 2026-05-23*
|
|
4
|
+
*Source: "The state of AI memory systems: benchmarks, architectures, and what actually works"*
|
|
5
|
+
*Author: Yohei Nakajima (compiled by Claude Opus 4.6 Research)*
|
|
6
|
+
*Source URL: https://x.com/yoheinakajima/status/2037201711937577319*
|
|
7
|
+
*Type: Meta-study (article, not single repository)*
|
|
8
|
+
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
## Executive Summary
|
|
12
|
+
|
|
13
|
+
### What this is
|
|
14
|
+
|
|
15
|
+
This is a **field survey**, not a single-repo study. The article reviews seven memory benchmarks and ~12 open-source memory systems published 2024-2026, ranks them by performance, and extracts five architectural patterns that separate top performers from the rest. Unlike a `/study-repo` of one codebase, the unit of analysis is **architectural choices that correlate with benchmark wins**.
|
|
16
|
+
|
|
17
|
+
### Key finding from the article
|
|
18
|
+
|
|
19
|
+
> "Architecture matters more than model size. A 20B-parameter model with Hindsight's multi-strategy memory achieves 83.6% on LongMemEval, dramatically outperforming full-context GPT-4o at 60.2%."
|
|
20
|
+
|
|
21
|
+
The field is converging on a specific template: **hybrid vector+graph storage, multi-strategy retrieval with reranking, explicit temporal modeling, and active memory consolidation**. Pure vector-store approaches (Mem0) plateau around 49% on LongMemEval; graph-native systems (Zep) reach 71%; multi-strategy systems (Hindsight) break 90%.
|
|
22
|
+
|
|
23
|
+
### Why ClaudeMemory cares
|
|
24
|
+
|
|
25
|
+
ClaudeMemory sits architecturally closest to Mem0 (vector + light graph via entity_aliases and fact_links, SQLite-only, LLM-light extraction). The article quantifies the cost of that choice — 22-point gap to Zep on LongMemEval, ~42-point gap to Hindsight. We don't need to chase those numbers, but the gaps tell us where our retrieval will *predictably* fail (multi-hop, temporal reasoning, conflict resolution at scale) and what's worth adopting given our local-first, no-cloud, single-developer constraints.
|
|
26
|
+
|
|
27
|
+
### Systems surveyed in the article (for cross-reference)
|
|
28
|
+
|
|
29
|
+
| System | Architecture | LongMemEval | LoCoMo | License |
|
|
30
|
+
|--------|--------------|-------------|--------|---------|
|
|
31
|
+
| Hindsight (Vectorize) | 4 networks + 4-strategy retrieval + cross-encoder | **91.4%** | 89.61% | MIT |
|
|
32
|
+
| Zep / Graphiti | Bi-temporal knowledge graph | 71.2% | 75.14% (disputed) | Apache 2.0 |
|
|
33
|
+
| MemGPT / Letta | OS-style hierarchy + agent-controlled | n/a | 74.0% (filesystem variant) | Apache 2.0 |
|
|
34
|
+
| Mem0 | Vector + optional graph, LLM-orchestrated CRUD | ~49% | 66.9-68.5% | Apache 2.0 |
|
|
35
|
+
| Cognee | Graph + vector + relational + ontology validation | n/a | n/a (self-reported wins) | Apache 2.0 |
|
|
36
|
+
| HippoRAG | Hippocampal indexing + Personalized PageRank | n/a | n/a | MIT |
|
|
37
|
+
| Letta (filesystem) | Simple file tools + agent capability | n/a | 74.0% | Apache 2.0 |
|
|
38
|
+
|
|
39
|
+
None of these were cloned for this study — the article itself is the primary source. Source-level file:line references in this document are to **ClaudeMemory** code, for adoption assessment.
|
|
40
|
+
|
|
41
|
+
### Production readiness assessment (article-derived)
|
|
42
|
+
|
|
43
|
+
- **Most mature**: Zep/Graphiti (24K stars, enterprise customers, Apache 2.0)
|
|
44
|
+
- **Best-published benchmarks**: Hindsight (MIT, but optimized for Vectorize-as-a-service)
|
|
45
|
+
- **Best fit for local-first**: Cognee (file-based defaults, swappable to cloud DBs) and Letta (open agent file format)
|
|
46
|
+
- **Most disputed**: LoCoMo benchmark itself — Mem0 and Zep publicly contradict each other's scores; the article calls LoCoMo "unreliable for cross-vendor comparison."
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## Architecture Overview
|
|
51
|
+
|
|
52
|
+
### The Five Patterns (article's central claim)
|
|
53
|
+
|
|
54
|
+
The article identifies five patterns where the correlation with benchmark performance is "nearly linear":
|
|
55
|
+
|
|
56
|
+
1. **Multi-strategy retrieval** is the single biggest differentiator. Hindsight (4 strategies, 91.4%) > Zep (3 strategies, 71.2%) > Mem0 (1-2 strategies, 49%).
|
|
57
|
+
2. **Graph structure is essential for complex reasoning, vector for breadth.** Every top system uses hybrid storage.
|
|
58
|
+
3. **Temporal modeling correlates with the largest gains.** Systems with explicit temporal models score 20-60 points higher on temporal queries.
|
|
59
|
+
4. **Active memory consolidation prevents degradation at scale.** Top systems all run a background "refine/invalidate/prune" pass.
|
|
60
|
+
5. **Agent-controlled memory can outperform specialized infrastructure.** Letta's filesystem approach beat Mem0's purpose-built memory by 5.5 points on LoCoMo.
|
|
61
|
+
|
|
62
|
+
### Comparison Table: ClaudeMemory vs. the field
|
|
63
|
+
|
|
64
|
+
| Capability | Hindsight | Zep | Letta | Mem0 | Cognee | **ClaudeMemory** |
|
|
65
|
+
|-----------|-----------|-----|-------|------|--------|------------------|
|
|
66
|
+
| Vector search | ✅ cosine | ✅ cosine | ✅ pgvector | ✅ Qdrant | ✅ LanceDB | ✅ sqlite-vec (vec0) |
|
|
67
|
+
| BM25 / FTS | ✅ | ✅ | ❌ | ❌ | ❌ | ✅ FTS5 |
|
|
68
|
+
| Graph traversal | ✅ | ✅ BFS | ❌ | ✅ (graph variant) | ✅ | ⚠️ partial (entity_aliases, fact_links — no traversal API) |
|
|
69
|
+
| Temporal-aware retrieval | ✅ dual timestamps | ✅ bi-temporal | ❌ | ⚠️ basic | ✅ | ⚠️ valid_from/valid_to in schema, not in ranking |
|
|
70
|
+
| Reranking | ✅ cross-encoder | ✅ RRF/MMR/cross-encoder | ❌ | ❌ | ❌ | ✅ RRF (`lib/claude_memory/core/rr_fusion.rb:1`) |
|
|
71
|
+
| Reflection/consolidation | ✅ reflect op | ✅ invalidate-not-delete | ✅ sleep-time compute | ✅ LLM CRUD | ✅ memify | ⚠️ supersession + sweep TTLs; no LLM reflect step |
|
|
72
|
+
| Agent-controlled writes | ❌ | ❌ | ✅ core operation | ❌ | ❌ | ⚠️ `memory.store_extraction` exists but ingestion is mostly passive via hooks |
|
|
73
|
+
| Bi-temporal (valid+ingest time) | ✅ | ✅ 4 timestamps | ❌ | ❌ | ⚠️ | ❌ (only valid_from/valid_to + created_at) |
|
|
74
|
+
| Fact / opinion separation | ✅ 4 networks | ⚠️ episode vs semantic | ⚠️ human vs persona block | ❌ | ❌ | ❌ (single facts table, all predicates equal) |
|
|
75
|
+
| Latency target | n/a | <200ms-1s P95 | varies | 1.4s P95 | n/a | hook context + recall: typically <100ms for SQLite read |
|
|
76
|
+
| Ingestion cost | high (parallel strategies) | hours for large corpora (many LLM calls) | low | low | medium | **low** (Layer 1 NullDistiller is free; Layer 2 piggybacks on Claude Code session) |
|
|
77
|
+
|
|
78
|
+
ClaudeMemory's profile: **vector + FTS + light graph hints, no traversal, no temporal-aware ranking, no reflection pass, mostly passive ingestion.** Closest peer: Mem0 base variant — which the article scores at ~49% on LongMemEval. The features we've explicitly *rejected* (cross-encoder reranking, LLM query expansion, custom fine-tuned models — see `docs/improvements.md` "Features to Avoid") are the same ones Hindsight uses to break 90%. The article suggests we are correctly trading some peak benchmark score for cost/latency/local-first, but it also names two things we **didn't** trade away by choice but simply haven't built: temporal-aware ranking and explicit graph traversal.
|
|
79
|
+
|
|
80
|
+
---
|
|
81
|
+
|
|
82
|
+
## Key Components Deep-Dive
|
|
83
|
+
|
|
84
|
+
This section maps each pattern from the article to ClaudeMemory's current implementation and the gap, with `file:line` references to **our** code (the studied systems weren't cloned).
|
|
85
|
+
|
|
86
|
+
### 1. Multi-Strategy Retrieval
|
|
87
|
+
|
|
88
|
+
**The article's claim.** Hindsight runs four concurrent retrieval strategies (cosine semantic similarity, BM25 keyword matching, graph traversal across the shared memory graph, temporal reasoning) and fuses them with cross-encoder reranking. On temporal queries specifically, this took accuracy from a 31.6% baseline to 91.0% — a 60-point gain. Zep's three strategies (cosine + BM25 + BFS graph traversal) hit 71.2% on LongMemEval. Mem0's 1-2 strategies score 49%.
|
|
89
|
+
|
|
90
|
+
**What we have.**
|
|
91
|
+
|
|
92
|
+
- Vector (`lib/claude_memory/index/vector_index.rb`): sqlite-vec native KNN.
|
|
93
|
+
- BM25/FTS5 (`lib/claude_memory/index/lexical_fts.rb`): SQLite FTS5 full-text.
|
|
94
|
+
- Fusion (`lib/claude_memory/core/rr_fusion.rb:1`): Reciprocal Rank Fusion of vec + FTS, with optional `score_trace` for debugging.
|
|
95
|
+
|
|
96
|
+
**What we don't have.**
|
|
97
|
+
|
|
98
|
+
- **No graph traversal as a retrieval strategy.** We store entity relationships in `entity_aliases` and supersession/conflict edges in `fact_links`, but no MCP tool walks them from a seed entity. `memory.fact_graph` returns immediate-neighbor facts for one fact_id; it doesn't BFS from a query.
|
|
99
|
+
- **No temporal-aware retrieval strategy.** We have `valid_from`, `valid_to`, `last_recalled_at` columns but the ranker doesn't use them as a third RRF input.
|
|
100
|
+
|
|
101
|
+
**Why this matters per the article.** "BM25 catches exact mentions that embedding search misses; graph traversal finds multi-hop connections invisible to flat similarity; temporal filtering prevents returning outdated facts." We have two of three; the third (graph BFS) is the one Zep credits for its 22-point lead over Mem0.
|
|
102
|
+
|
|
103
|
+
### 2. Hybrid Vector + Graph Storage
|
|
104
|
+
|
|
105
|
+
**The article's claim.** Pure-vector systems plateau at ~50% on LongMemEval; graph-native systems reach 71%+. "The specific graph implementation matters less than having one — Neo4j, FalkorDB, and custom in-memory graphs all appear in high-performing systems."
|
|
106
|
+
|
|
107
|
+
**What we have.** A subject-predicate-object fact table with entity nodes and edges between facts (`fact_links` for supersession + conflict). This is graph-shaped data but we don't expose it as a traversable graph at query time.
|
|
108
|
+
|
|
109
|
+
**What we don't have.** A `BFS from entity X over relationship type Y` capability. The article specifically calls out that this finds multi-hop connections invisible to similarity search ("Who recommended the architecture decision we're using for storage?" — needs entity-resolved hops, not text overlap).
|
|
110
|
+
|
|
111
|
+
**Why this matters per the article.** Quote: "Mem0's specific graph implementation matters less than having one." We have graph-shaped storage but no graph-shaped retrieval — the worst of both worlds if we don't fix this.
|
|
112
|
+
|
|
113
|
+
### 3. Explicit Temporal Modeling
|
|
114
|
+
|
|
115
|
+
**The article's claim.** Temporal reasoning is the hardest capability across every benchmark (up to 73% human-vs-system gap on LoCoMo). Hindsight stores **dual timestamps** (occurrence time + mention time) — "what happened when" vs "what did I learn when." Zep's **bi-temporal model** tracks four timestamps per edge:
|
|
116
|
+
|
|
117
|
+
- `valid_at` — when the fact became true in the world
|
|
118
|
+
- `invalid_at` — when it was superseded
|
|
119
|
+
- `created_at` — when Graphiti ingested it
|
|
120
|
+
- `expired_at` — when the record was logically replaced
|
|
121
|
+
|
|
122
|
+
This enables point-in-time queries ("What did we know about X on date Y?") and full audit trails.
|
|
123
|
+
|
|
124
|
+
**What we have.**
|
|
125
|
+
|
|
126
|
+
- `valid_from` / `valid_to` (`db/migrations/001_create_initial_schema.rb:64-65`) — world-time validity window.
|
|
127
|
+
- `created_at` — ingest time.
|
|
128
|
+
- `last_recalled_at` (schema v17) — access time.
|
|
129
|
+
|
|
130
|
+
**What we don't have.**
|
|
131
|
+
|
|
132
|
+
- No `invalid_at` / `expired_at` distinction. We set `valid_to` when superseded and `status='superseded'` — but `valid_to` conflates "world-time end" and "ingest-time supersession." A fact retroactively learned to have been false in 2023 and a fact superseded today look identical in the schema.
|
|
133
|
+
- No temporal-aware retrieval. `Recall` queries don't weight by recency, and `memory.recall` doesn't accept "as of <date>" filters.
|
|
134
|
+
|
|
135
|
+
**Why this matters per the article.** This is the field-wide weakness. Systems "with explicit temporal modeling consistently score 20-60 points higher on temporal queries than systems treating time as metadata." We're currently in the "metadata" camp.
|
|
136
|
+
|
|
137
|
+
### 4. Active Memory Consolidation
|
|
138
|
+
|
|
139
|
+
**The article's claim.** Systems that accumulate without consolidating suffer noise growth. The article catalogs five consolidation strategies:
|
|
140
|
+
|
|
141
|
+
- **Hindsight reflect** — updates beliefs based on new evidence.
|
|
142
|
+
- **Zep invalidate-not-delete** — contradicted facts are marked invalid, preserving history.
|
|
143
|
+
- **Cognee memify** — prunes stale nodes, strengthens frequent connections, derives new facts.
|
|
144
|
+
- **Letta sleep-time compute** — background agent processes facts during idle time using stronger/slower models, producing refined "learned context."
|
|
145
|
+
- **Mem0 LLM-CRUD** — ADD / UPDATE / DELETE / NOOP decided per-extraction by an LLM.
|
|
146
|
+
|
|
147
|
+
**What we have.**
|
|
148
|
+
|
|
149
|
+
- Supersession with provenance preservation (`lib/claude_memory/resolve/resolver.rb:126-149`) — closest to Zep's invalidate-not-delete (we set status=`superseded` and keep the row).
|
|
150
|
+
- Sweep with TTL escalation (`lib/claude_memory/sweep/maintenance.rb`) — closest to Cognee's pruning.
|
|
151
|
+
- Conflict detection — adjacent to Mem0's LLM-CRUD but rule-based, not LLM-driven.
|
|
152
|
+
|
|
153
|
+
**What we don't have.**
|
|
154
|
+
|
|
155
|
+
- **No reflect/refine pass.** We never re-examine an old fact in light of new context. A decision from January and one from May about the same subject don't get re-evaluated as a pair unless they happen to trigger supersession at insert time.
|
|
156
|
+
- **No background "learned context" agent.** Layer 2 distillation runs *only* on the current session's transcripts; nothing reflects on the full corpus during idle time.
|
|
157
|
+
|
|
158
|
+
**Why this matters per the article.** Without consolidation, signal-to-noise degrades as memory grows — this is the "scale" failure mode. Today our corpus is small (low hundreds of facts per project). The article suggests this will hurt at 10K+ facts.
|
|
159
|
+
|
|
160
|
+
### 5. Agent-Controlled Memory
|
|
161
|
+
|
|
162
|
+
**The article's claim.** Letta demonstrated that a simple filesystem approach (agent + file tools) hit 74% on LoCoMo with GPT-4o-mini, beating Mem0's purpose-built infrastructure at 68.5%. Quote: "Agent capability matters more than specialized memory infrastructure."
|
|
163
|
+
|
|
164
|
+
**What we have.**
|
|
165
|
+
|
|
166
|
+
- `memory.store_extraction` MCP tool — the agent *can* write, but in practice extraction happens passively via SessionStart hook injection (Layer 2 distillation).
|
|
167
|
+
- Five "shortcut" tools (`memory.decisions`, `memory.conventions`, `memory.architecture`, `memory.facts_by_tool`, `memory.facts_by_context`) the agent uses for recall.
|
|
168
|
+
|
|
169
|
+
**What we don't have.**
|
|
170
|
+
|
|
171
|
+
- No "agent decides when to remember" mode. Layer 1 (NullDistiller regex) runs unconditionally on hook events; Layer 2 runs on SessionStart; Layer 3 is user-triggered. The agent doesn't proactively decide "this thread is important, store it."
|
|
172
|
+
|
|
173
|
+
**Why this matters per the article.** This is the "autonomy vs. determinism" trade-off the article explicitly names. Letta's autonomy is non-deterministic and model-dependent; our determinism is fast and predictable. We probably don't want to flip the model — but a *partial* adoption (an explicit "save this for later" tool the agent can call mid-conversation) is consistent with our current architecture.
|
|
174
|
+
|
|
175
|
+
### 6. Benchmarks We're Not Running
|
|
176
|
+
|
|
177
|
+
**The article's claim.** Seven benchmarks now define the evaluation landscape:
|
|
178
|
+
|
|
179
|
+
| Benchmark | Year | What it tests | Notes |
|
|
180
|
+
|-----------|------|---------------|-------|
|
|
181
|
+
| LongMemEval | ICLR 2025 | 5 abilities × 500 questions, 115K-1.5M token contexts | Gold standard |
|
|
182
|
+
| LoCoMo | ACL 2024 | 10 conversations × 300 turns | Vendor-disputed; scores unreliable |
|
|
183
|
+
| MemBench | ACL 2025 | Factual vs reflective memory | Useful for our "decision vs convention" split |
|
|
184
|
+
| MemoryBench | Tsinghua 2025 | Continual learning from feedback | 11 datasets, 3 domains, 2 languages |
|
|
185
|
+
| MemoryAgentBench | ICLR 2026 | 4 competencies including conflict resolution | "No method excels at all four" |
|
|
186
|
+
| EverMemBench | Feb 2026 | Multi-party group conversations | Niche |
|
|
187
|
+
| Letta Leaderboard | 2025 | LLMs managing own memory via tools | Most relevant to our MCP design |
|
|
188
|
+
|
|
189
|
+
**What we have.** Our own eval suite (`spec/evals/`), DevMemBench (`spec/benchmarks/`), and `spec/benchmarks/comparative/` against QMD + grepai.
|
|
190
|
+
|
|
191
|
+
**What we don't have.** Any cross-comparison against LongMemEval or LoCoMo. We can't say with evidence "ClaudeMemory scores X on LongMemEval" — and given the article's framing, that's the question potential adopters will ask.
|
|
192
|
+
|
|
193
|
+
**Why this matters per the article.** LongMemEval is the only benchmark the article describes as rigorous. LoCoMo numbers are "unreliable for cross-vendor comparison" because of public scoring disputes. If we report any benchmark, it should be LongMemEval; if we cite LoCoMo it should be with the disclaimer.
|
|
194
|
+
|
|
195
|
+
---
|
|
196
|
+
|
|
197
|
+
## Comparative Analysis
|
|
198
|
+
|
|
199
|
+
### What the field does well that we don't
|
|
200
|
+
|
|
201
|
+
1. **Graph traversal at retrieval time** (Zep, Mem0 graph variant, Cognee). We store the graph; we don't walk it.
|
|
202
|
+
2. **Bi-temporal modeling** (Zep). We conflate world-time and ingest-time in a single `valid_to` column.
|
|
203
|
+
3. **Active consolidation / reflect pass** (Hindsight, Cognee memify, Letta sleep-time). We supersede at insert time only.
|
|
204
|
+
4. **Epistemic separation** (Hindsight 4 networks: world facts / agent experiences / entity observations / evolving opinions). We have `provenance.strength` (stated/inferred/derived) but don't route differently.
|
|
205
|
+
5. **Standardized benchmark scores** (LongMemEval). We have internal evals only.
|
|
206
|
+
|
|
207
|
+
### What we do well that they don't
|
|
208
|
+
|
|
209
|
+
1. **Local-first, zero-cloud-dependency.** Letta and Mem0 require PostgreSQL + (Qdrant or pgvector). Cognee defaults to file-based but is Python-heavyweight. Our gem + SQLite stack ships as a single Ruby dependency.
|
|
210
|
+
2. **No LLM in the retrieval path.** Zep makes this point ("no LLM calls during retrieval"), achieving 200ms-1s P95 — and so do we, even more aggressively (no inference at all, just SQL).
|
|
211
|
+
3. **Free Layer 2 distillation.** Mem0 calls an LLM for every extraction. Letta runs background sleep-time agents. We piggyback on the user's existing Claude Code session via context hook injection — zero additional API spend. This is genuinely novel and the article doesn't mention any equivalent.
|
|
212
|
+
4. **Provenance receipts on every fact.** Mem0 logs operations to SQLite for audit but doesn't tie each fact to a quoted source. Our `provenance` + `mcp_tool_calls` tables give every claim a traceable origin.
|
|
213
|
+
5. **Public predicate vocabulary.** PredicatePolicy is the article's missing piece for fact/opinion separation — it's an opinionated, curated set of 9 predicates with cardinality semantics, exposed publicly via `docs/api_stability.md`. Hindsight does this implicitly in code; we do it as a contract.
|
|
214
|
+
|
|
215
|
+
### Trade-offs the article explicitly names
|
|
216
|
+
|
|
217
|
+
| Tension | Their pole | Our pole |
|
|
218
|
+
|---------|-----------|----------|
|
|
219
|
+
| Richness vs. latency | Zep: hours of ingestion for richer graph | NullDistiller P95 <5ms; minutes for Layer 3 manual |
|
|
220
|
+
| Autonomy vs. determinism | Letta: agent-controlled, model-dependent | Deterministic SQL queries |
|
|
221
|
+
| Completeness vs. compression | Zep preserves raw episodes | We distill into structured facts only (raw transcript chunks live in `content_items` until swept) |
|
|
222
|
+
|
|
223
|
+
These poles match the design decisions we've already made and recorded. The article validates them, including specifically what we *gave up* (peak benchmark score on LongMemEval) for what we *gained* (sub-100ms recall, no cloud cost, no LLM in critical path).
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Adoption Opportunities
|
|
228
|
+
|
|
229
|
+
### High Priority ⭐
|
|
230
|
+
|
|
231
|
+
#### 1. Graph Traversal as a Third Retrieval Strategy ⭐
|
|
232
|
+
|
|
233
|
+
- **Value.** The article credits graph-BFS as the difference between Mem0 (49% on LongMemEval) and Zep (71.2%). We already store the graph; we just don't traverse it at query time. This is the highest-leverage gap in our retrieval — work we've already done 80% of, exposed differently.
|
|
234
|
+
- **Evidence.** Article Pattern 1 + Pattern 2. ClaudeMemory has `fact_links` (supersession/conflict edges) and `entities`/`entity_aliases` (entity nodes) but `lib/claude_memory/recall.rb` doesn't BFS over them.
|
|
235
|
+
- **Implementation.** Add a `Recall::GraphTraversal` strategy: resolve the query to seed entities via the existing entity matcher, BFS one or two hops over `entities` ↔ `facts` ↔ `entities` (using `subject_id` and `object_entity_id` if present), score by hop distance × edge type. Fuse into the existing RRF in `Core::RRFusion` (`lib/claude_memory/core/rr_fusion.rb`) as a third source alongside vec + FTS. Bound BFS depth (1-2 hops) so latency stays sub-100ms.
|
|
236
|
+
- **Effort.** Medium — 2-3 days. The data is already shaped correctly; this is a new strategy class + RRF integration + tests.
|
|
237
|
+
- **Trade-off.** Adds a third source to RRF tuning. Empty graphs (early-project use) will simply contribute zero rerank weight — degrades gracefully.
|
|
238
|
+
- **Recommendation.** **ADOPT** in 0.12.0 or 0.13.0. Aligns with our existing hybrid retrieval; no new dependencies; demonstrably the field's biggest accuracy lever.
|
|
239
|
+
|
|
240
|
+
#### 2. Temporal-Aware Retrieval Strategy ⭐
|
|
241
|
+
|
|
242
|
+
- **Value.** The article says temporal reasoning shows the largest performance gaps across every benchmark (up to 73% on LoCoMo). Adding even basic recency weighting and "as-of" filtering would close part of this.
|
|
243
|
+
- **Evidence.** Article Pattern 3. Schema already has `valid_from`, `valid_to`, `created_at`, `last_recalled_at`. None are used in ranking.
|
|
244
|
+
- **Implementation.** Two pieces:
|
|
245
|
+
1. **Recency boost in RRF.** Add a `temporal_rank` input to `Core::RRFusion`: facts with newer `valid_from` get a small rank boost (decay factor configurable). Doesn't replace lexical/semantic — it's a third (or fourth, with graph) RRF source.
|
|
246
|
+
2. **`as_of` parameter on `memory.recall`.** Optional ISO 8601 timestamp; filters to facts where `valid_from <= as_of AND (valid_to IS NULL OR valid_to > as_of)`. Enables "what did we know about X on date Y" queries the article credits Zep with.
|
|
247
|
+
- **Effort.** Small — 1-2 days. Existing columns; just thread the new parameter and ranker.
|
|
248
|
+
- **Trade-off.** Recency weighting can over-rank ephemeral facts (e.g., a debugging note from yesterday vs. a long-standing convention). Cap the boost at low weight (e.g., 0.1× of vec contribution) and tune via the existing eval harness.
|
|
249
|
+
- **Recommendation.** **ADOPT** in 0.12.0 alongside #1. Tiny change, big article-validated upside, no new dependencies.
|
|
250
|
+
|
|
251
|
+
#### 3. Bi-Temporal Schema Cleanup (`world_invalid_at` vs `ingest_expired_at`)
|
|
252
|
+
|
|
253
|
+
- **Value.** Today, `valid_to` does double duty: "fact ceased to be true in the world" *and* "we superseded this fact during ingestion." The article calls this out specifically as Zep's most important innovation. With both columns, point-in-time queries work correctly — without them, we silently corrupt the temporal axis.
|
|
254
|
+
- **Evidence.** Article: "Every entity edge tracks four timestamps: valid_at (when the fact became true in the world), invalid_at (when it was superseded), created_at (when Graphiti ingested it), expired_at (when the record was logically replaced)."
|
|
255
|
+
- **Implementation.** Schema v18 migration: rename `valid_to` → `world_invalid_at`; add `ingest_expired_at` (datetime, nullable). Update `Resolver` to set `ingest_expired_at` on supersession and leave `world_invalid_at` for explicit user-supplied "this fact stopped being true on date X" updates. Backfill: copy existing `valid_to` into both columns (we can't recover the distinction historically).
|
|
256
|
+
- **Effort.** Medium — schema migration + resolver update + MCP tool surface (optional `world_invalid_at` parameter on `memory.reject_fact` and friends) + tests. 2-3 days.
|
|
257
|
+
- **Trade-off.** API surface change — `valid_to` is part of the public schema per `docs/api_stability.md`. Needs a deprecation cycle (alias `valid_to` to `world_invalid_at` in the Sequel model for one minor version).
|
|
258
|
+
- **Recommendation.** **ADOPT** in 0.13.0 (after #1 and #2). Lower urgency than the retrieval changes, but it's the foundation for any future "as of" reasoning, audit trail, and historical reasoning. Cheaper to do before our corpus grows.
|
|
259
|
+
|
|
260
|
+
#### 4. LongMemEval Benchmark Integration
|
|
261
|
+
|
|
262
|
+
- **Value.** The article calls LongMemEval the "gold standard." Without an external benchmark score, we can't credibly position ClaudeMemory against the field. Internal evals (which we have) don't answer "is this competitive with Zep/Mem0?"
|
|
263
|
+
- **Evidence.** Article: "LongMemEval has emerged as the gold standard… Three-stage framework (Indexing → Retrieval → Reading) with LLM-as-judge scoring provides the most rigorous evaluation available."
|
|
264
|
+
- **Implementation.** Add `spec/benchmarks/longmemeval/` adapter. Dataset is public (Wu et al. arXiv). Wire it into `bin/run-evals --longmemeval`. Report Recall@k, MRR, nDCG@10 the way DevMemBench already does.
|
|
265
|
+
- **Effort.** Medium — 2-4 days. Mostly dataset wrangling + adapter code. The existing DevMemBench pipeline already has the right shape.
|
|
266
|
+
- **Trade-off.** LongMemEval_S is ~115K tokens; ingesting all 500 questions will be slow and cost real API spend if we use Claude Code in the inner loop. Mitigation: stub mode for the retrieval-only portion (no LLM-judge), real mode opt-in.
|
|
267
|
+
- **Recommendation.** **ADOPT** in 0.12.0 or 0.13.0. This is what we'd cite in a release blog post; the article makes it clear it's the only number that matters.
|
|
268
|
+
|
|
269
|
+
### Medium Priority
|
|
270
|
+
|
|
271
|
+
#### 5. Reflect Pass — Background Consolidation on Idle
|
|
272
|
+
|
|
273
|
+
- **Value.** Hindsight's reflect operation and Letta's sleep-time compute both run a background process that re-examines stored facts using a stronger/slower model. The article credits this with preventing noise growth at scale. We don't have it; today our corpus is small enough to not need it; we will need it once any single project exceeds ~5K facts.
|
|
274
|
+
- **Evidence.** Article Pattern 4.
|
|
275
|
+
- **Implementation.** Extend `Sweep::Maintenance` with a `reflect` operation that runs during the SessionEnd hook when N facts have accumulated since last reflect. The reflect operation is an MCP-callable prompt: "Given these N facts about subject X, produce: (a) a consolidated summary fact, (b) any contradictions, (c) any facts that should be marked obsolete." Like Layer 2 distillation, this can piggyback on the user's Claude Code session — no extra API cost.
|
|
276
|
+
- **Effort.** Large — 5-7 days. Touches sweep, hooks, MCP, and skill design. Needs a careful prompt + good eval to prove we're not introducing hallucinated consolidations.
|
|
277
|
+
- **Trade-off.** Risk of consolidating away real distinctions. Mitigation: every consolidated fact links to the source facts via `fact_links` (already supported); manual `claude-memory reject` undoes a bad consolidation.
|
|
278
|
+
- **Recommendation.** **CONSIDER** for 1.0.0 or later. The article validates the direction; we don't have the scale problem yet. Track when largest project DB crosses 5K facts.
|
|
279
|
+
|
|
280
|
+
#### 6. `memory.save_this` Tool — Agent-Initiated Storage
|
|
281
|
+
|
|
282
|
+
- **Value.** Letta's striking result (74% vs Mem0's 68.5%) suggests that giving the agent explicit "save this" capability beats passive extraction in some scenarios. We already have `memory.store_extraction`, but it's framed as "report an extraction you found," not "I (the agent) want to remember this for later." A friendlier surface might increase use.
|
|
283
|
+
- **Evidence.** Article Pattern 5 + Letta filesystem result.
|
|
284
|
+
- **Implementation.** Add `memory.save_this` as a thin wrapper over `memory.store_extraction` with simpler prompt: "Save the most important fact from this turn for future sessions. Tag with `subject`, `predicate`, `object`, and a brief reason." Document it in the MCP `memory_guide` prompt as the agent's "I want to remember this" tool.
|
|
285
|
+
- **Effort.** Small — 1 day. Mostly MCP surface + prompt updates + tests.
|
|
286
|
+
- **Trade-off.** Could drive low-quality "save everything" behavior. Mitigation: existing `BareConclusionDetector` already gates against poor extractions.
|
|
287
|
+
- **Recommendation.** **CONSIDER** in 0.13.0 if first-week usage shows agents rarely use `store_extraction` proactively. Cheap to try; cheap to remove.
|
|
288
|
+
|
|
289
|
+
#### 7. Provenance Strength Routing (light epistemic separation)
|
|
290
|
+
|
|
291
|
+
- **Value.** Hindsight's 4-network architecture (world facts / agent experiences / entity observations / evolving opinions) gives different retrieval characteristics to different fact types. We have a similar axis — `provenance.strength` ∈ {stated, inferred, derived} — but the ranker doesn't use it.
|
|
292
|
+
- **Evidence.** Article: "Epistemic separation — structurally distinguishing evidence from inference — is a key innovation."
|
|
293
|
+
- **Implementation.** Add a small weight in `Core::RRFusion`: `stated` facts get full weight, `inferred` get 0.7×, `derived` get 0.5×. Surface a `strength_filter` parameter on `memory.recall` for "only stated facts" use cases.
|
|
294
|
+
- **Effort.** Small — 1 day. We already store the data.
|
|
295
|
+
- **Trade-off.** Minor — could under-rank inferred facts that are nonetheless useful. Tune via eval harness.
|
|
296
|
+
- **Recommendation.** **CONSIDER**. Already covered partially by improvement #57 (Provenance-Strength-Aware Retrieval Ranking) in `docs/improvements.md`. This article *strongly validates* that improvement; promoting #57 from Medium to High is the right move.
|
|
297
|
+
|
|
298
|
+
### Low Priority / Defer
|
|
299
|
+
|
|
300
|
+
#### 8. Ontology Validation Layer (Cognee-style)
|
|
301
|
+
|
|
302
|
+
- **Value.** Canonicalizes "car manufacturer," "automobile maker," "vehicle producer" into one entity. Reduces graph fragmentation.
|
|
303
|
+
- **Evidence.** Article: Cognee uses RDF/OWL ontologies + `difflib.get_close_matches()`.
|
|
304
|
+
- **Trade-off.** We already do this for predicates via `PredicatePolicy::SYNONYMS`. Extending to entities means defining ontologies per project — heavyweight for a single-developer tool.
|
|
305
|
+
- **Recommendation.** **DEFER**. Our entity_aliases mechanism is the lightweight version of this. Adopt only if entity fragmentation shows up as a real failure mode in benchmarks.
|
|
306
|
+
|
|
307
|
+
#### 9. LoCoMo Benchmark
|
|
308
|
+
|
|
309
|
+
- **Value.** Cross-comparison with other memory systems.
|
|
310
|
+
- **Evidence.** Article: "Vendor disputes about proper implementation… Mem0 and Zep have publicly contradicted each other's reported scores, making LoCoMo rankings unreliable for cross-vendor comparison."
|
|
311
|
+
- **Recommendation.** **DEFER**. The article specifically discredits LoCoMo as a comparison axis. LongMemEval (recommendation #4) is the right benchmark to invest in. If we cite LoCoMo at all, cite our own number standalone, not against vendor-reported scores.
|
|
312
|
+
|
|
313
|
+
### Features to Avoid (article-derived)
|
|
314
|
+
|
|
315
|
+
These are confirmed by the article as either over-engineering, mismatched, or solving problems we don't have:
|
|
316
|
+
|
|
317
|
+
- **Cross-encoder reranking** — Already in our avoid list. Article confirms: "Hindsight's four parallel retrieval strategies with cross-encoder reranking are expensive." No LLM in retrieval path is one of our key advantages.
|
|
318
|
+
- **Bi-temporal complexity beyond a second column** — Zep tracks four timestamps per edge. The article doesn't quantify the value of `expired_at` separately from `invalid_at`. Recommendation #3 above adopts the simpler 3-timestamp model (world_invalid_at + ingest_expired_at + created_at) rather than the full 4-column Graphiti schema.
|
|
319
|
+
- **Custom fine-tuned models for any pipeline stage** — Already in our avoid list. Hindsight's results require Gemini-3 Pro for the 91.4% number; their 20B open variant scores 83.6%. We can't and shouldn't compete with model size; per-the-article, architecture (which we can fix) matters more anyway.
|
|
320
|
+
- **Cloud-required architecture** — Letta requires PostgreSQL + pgvector; Cognee defaults to local but production runs PostgreSQL + Neo4j + Qdrant. Our SQLite-only stack is a real differentiator the article doesn't address.
|
|
321
|
+
- **Multi-network epistemic separation as a hard schema split** (full Hindsight 4-network model) — Over-complex for our scale. Recommendation #7 above adopts the soft version (weight by `provenance.strength`).
|
|
322
|
+
- **Conversation-level memory (Letta filesystem approach as primary mode)** — Article reports 74% on LoCoMo for filesystem-only, but the read/write loop consumes user-visible tokens on every interaction. Our hook-based passive ingestion is cheaper per session.
|
|
323
|
+
- **Sleep-time compute as a separate service** — Letta runs background agents. We can achieve the same effect on the next SessionStart for free (recommendation #5). No separate process needed.
|
|
324
|
+
|
|
325
|
+
---
|
|
326
|
+
|
|
327
|
+
## Implementation Recommendations
|
|
328
|
+
|
|
329
|
+
### Phase 1 — Validate the architecture pattern (0.12.0)
|
|
330
|
+
|
|
331
|
+
- **Graph traversal strategy** (recommendation #1, ⭐). Highest leverage; data is ready.
|
|
332
|
+
- **Temporal recency in RRF + `as_of` parameter** (recommendation #2, ⭐). Tiny code, big benchmark-validated upside.
|
|
333
|
+
- **LongMemEval integration** (recommendation #4, ⭐). Get a baseline number before we start tuning, so we can measure each subsequent change.
|
|
334
|
+
|
|
335
|
+
### Phase 2 — Foundation cleanup (0.13.0)
|
|
336
|
+
|
|
337
|
+
- **Bi-temporal schema cleanup** (recommendation #3). Schema change is easier now than later.
|
|
338
|
+
- **Promote improvement #57 to High and ship it** (recommendation #7). Already-tracked work; this article strongly validates it.
|
|
339
|
+
- **`memory.save_this` tool** (recommendation #6) if eval data suggests agents under-use `memory.store_extraction`.
|
|
340
|
+
|
|
341
|
+
### Phase 3 — Scale concerns (1.0.0 or later)
|
|
342
|
+
|
|
343
|
+
- **Reflect pass** (recommendation #5). Only when a real project DB crosses ~5K facts; until then, premature.
|
|
344
|
+
|
|
345
|
+
### What to skip
|
|
346
|
+
|
|
347
|
+
- **LoCoMo benchmark** (recommendation #9). Article explicitly discredits it for cross-vendor use.
|
|
348
|
+
- **Ontology validation** (recommendation #8). Our existing `entity_aliases` + `PredicatePolicy::SYNONYMS` are the right-sized version.
|
|
349
|
+
|
|
350
|
+
---
|
|
351
|
+
|
|
352
|
+
## Architecture Decisions
|
|
353
|
+
|
|
354
|
+
### What to preserve (validated by the article)
|
|
355
|
+
|
|
356
|
+
1. **Local-first, SQLite-only** — competitive differentiator vs. Letta/Cognee cloud stacks.
|
|
357
|
+
2. **No LLM in retrieval path** — Zep makes this same choice and credits it for <200ms-1s latency; we go further with no LLM at all.
|
|
358
|
+
3. **Hook-based passive ingestion via Claude Code session** — zero-API-cost Layer 2 distillation; the article surveys no equivalent.
|
|
359
|
+
4. **RRF over vec+FTS** — same pattern Zep uses (cosine + BM25 + BFS), we just need to add the third source.
|
|
360
|
+
5. **Publicly-versioned predicate vocabulary** (`PredicatePolicy` + `docs/api_stability.md`) — light, opinionated, stable. Field-wide there's no equivalent contract.
|
|
361
|
+
6. **Provenance receipts on every fact** — comparable systems log operations to SQLite but don't tie each fact to a quoted source.
|
|
362
|
+
|
|
363
|
+
### What to adopt (article-validated)
|
|
364
|
+
|
|
365
|
+
1. **Graph traversal as third retrieval strategy** — closes the largest article-named gap.
|
|
366
|
+
2. **Temporal-aware RRF + `as_of` queries** — closes the second-largest gap.
|
|
367
|
+
3. **Bi-temporal columns** — `world_invalid_at` separate from `ingest_expired_at`.
|
|
368
|
+
4. **LongMemEval as the comparison benchmark** — the only number the article describes as rigorous.
|
|
369
|
+
|
|
370
|
+
### What to reject
|
|
371
|
+
|
|
372
|
+
1. **Cross-encoder LLM reranking** — already rejected; the article confirms cost is the reason.
|
|
373
|
+
2. **Cloud-required graph DB** (Neo4j, FalkorDB) — SQLite + our existing schema is sufficient; recommendation #1 traverses the graph we already have.
|
|
374
|
+
3. **4-network hard epistemic split** — recommendation #7 adopts the soft (weight-by-strength) version.
|
|
375
|
+
4. **LoCoMo benchmark** — the article itself discredits cross-vendor comparison.
|
|
376
|
+
|
|
377
|
+
---
|
|
378
|
+
|
|
379
|
+
## Key Takeaways
|
|
380
|
+
|
|
381
|
+
1. **We are architecturally closer to Mem0 (49% on LongMemEval) than to Zep (71.2%) or Hindsight (91.4%).** That's mostly a deliberate trade for local-first / no-LLM-in-retrieval. But two pieces of the gap — graph traversal and temporal-aware retrieval — are unforced. We already store the data; we just don't query it.
|
|
382
|
+
|
|
383
|
+
2. **The biggest single improvement we can make is adding graph traversal as a third RRF source.** Article-validated as the difference between Mem0-class and Zep-class systems. We have the data shape; we don't have the strategy class.
|
|
384
|
+
|
|
385
|
+
3. **Layer 2 distillation (free LLM via Claude Code session) is genuinely novel.** No system the article surveys does this. We should keep emphasizing it in documentation and in any benchmark write-up.
|
|
386
|
+
|
|
387
|
+
4. **Our existing improvement #57 (Provenance-Strength-Aware Retrieval Ranking) is the soft version of Hindsight's epistemic separation.** This article promotes it from "nice to have" to "fits the field-wide pattern." Recommend moving #57 to High Priority.
|
|
388
|
+
|
|
389
|
+
5. **Temporal reasoning is the field's hardest problem.** We've under-invested here. Schema-level fix (recommendation #3) and ranker-level fix (recommendation #2) together cost about a week's work.
|
|
390
|
+
|
|
391
|
+
6. **We should benchmark against LongMemEval before tuning any of this.** Without a baseline, we can't tell which adopted changes help.
|
|
392
|
+
|
|
393
|
+
7. **Article's clearest negative result: pure vector approaches plateau at ~50% on LongMemEval.** Anything we do that doubles down on vector-only retrieval is investing in the wrong axis.
|
|
394
|
+
|
|
395
|
+
---
|
|
396
|
+
|
|
397
|
+
## Cross-References
|
|
398
|
+
|
|
399
|
+
- **`docs/improvements.md`** — recommendations #1, #2, #3, #4, #6 should be added as new entries. Recommendation #7 promotes existing #57 from Medium to High.
|
|
400
|
+
- **`docs/influence/qmd.md`**, **`docs/influence/grepai.md`** — these are the other single-repo studies in the same architectural space (hybrid vec+FTS, no graph, no temporal). The article suggests their tradeoffs match ours.
|
|
401
|
+
- **`docs/api_stability.md`** — schema changes in recommendation #3 (bi-temporal cleanup) need to land in the same commit as updates here. `valid_to` rename is a public-API break with deprecation aliasing.
|
|
402
|
+
- **`spec/benchmarks/README.md`** — recommendation #4 (LongMemEval integration) belongs in this directory.
|
|
403
|
+
- **`lib/claude_memory/core/rr_fusion.rb`** — recommendations #1, #2, #7 all add new sources to this fusion. Touching this file once for all three is cheaper than three separate passes.
|