claude_memory 0.12.1 → 0.13.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (58) hide show
  1. checksums.yaml +4 -4
  2. data/.claude/memory.sqlite3 +0 -0
  3. data/.claude/rules/claude_memory.generated.md +6 -1
  4. data/.claude/settings.local.json +2 -1
  5. data/.claude-plugin/marketplace.json +2 -2
  6. data/.claude-plugin/plugin.json +2 -2
  7. data/CHANGELOG.md +38 -0
  8. data/CLAUDE.md +11 -6
  9. data/README.md +35 -0
  10. data/db/migrations/019_add_observations.rb +43 -0
  11. data/db/migrations/020_add_observation_promotion.rb +33 -0
  12. data/docs/GETTING_STARTED.md +38 -0
  13. data/docs/api_stability.md +16 -5
  14. data/docs/architecture.md +18 -6
  15. data/docs/audit_runbook.md +67 -0
  16. data/docs/dashboard.md +28 -0
  17. data/docs/improvements.md +173 -1
  18. data/docs/influence/mastra-observational-memory.md +198 -0
  19. data/docs/influence/strands-agent-sops.md +163 -0
  20. data/docs/quality_review.md +45 -0
  21. data/lib/claude_memory/audit/checks.rb +149 -0
  22. data/lib/claude_memory/audit/runner.rb +4 -0
  23. data/lib/claude_memory/commands/census_command.rb +1 -1
  24. data/lib/claude_memory/commands/hook_command.rb +16 -3
  25. data/lib/claude_memory/commands/initializers/hooks_configurator.rb +3 -1
  26. data/lib/claude_memory/commands/install_skill_command.rb +4 -0
  27. data/lib/claude_memory/commands/observations_command.rb +367 -0
  28. data/lib/claude_memory/commands/registry.rb +1 -0
  29. data/lib/claude_memory/commands/skills/reflect.md +68 -0
  30. data/lib/claude_memory/commands/stats_command.rb +60 -1
  31. data/lib/claude_memory/dashboard/api.rb +4 -0
  32. data/lib/claude_memory/dashboard/index.html +154 -2
  33. data/lib/claude_memory/dashboard/observations.rb +115 -0
  34. data/lib/claude_memory/dashboard/server.rb +1 -0
  35. data/lib/claude_memory/distill/extraction.rb +6 -4
  36. data/lib/claude_memory/distill/null_distiller.rb +108 -3
  37. data/lib/claude_memory/distill/reference_material_detector.rb +4 -1
  38. data/lib/claude_memory/domain/observation.rb +118 -0
  39. data/lib/claude_memory/embeddings/generator.rb +1 -1
  40. data/lib/claude_memory/hook/context_injector.rb +125 -2
  41. data/lib/claude_memory/mcp/handlers/management_handlers.rb +113 -2
  42. data/lib/claude_memory/mcp/handlers/query_handlers.rb +48 -1
  43. data/lib/claude_memory/mcp/instructions_builder.rb +1 -0
  44. data/lib/claude_memory/mcp/query_guide.rb +28 -0
  45. data/lib/claude_memory/mcp/tool_definitions.rb +58 -0
  46. data/lib/claude_memory/mcp/tools.rb +3 -0
  47. data/lib/claude_memory/observe/observations_renderer.rb +49 -0
  48. data/lib/claude_memory/observe/reflector.rb +107 -0
  49. data/lib/claude_memory/observe/token_overlap_matcher.rb +55 -0
  50. data/lib/claude_memory/publish.rb +53 -1
  51. data/lib/claude_memory/resolve/resolver.rb +45 -8
  52. data/lib/claude_memory/store/schema_manager.rb +1 -1
  53. data/lib/claude_memory/store/sqlite_store.rb +181 -0
  54. data/lib/claude_memory/sweep/maintenance.rb +15 -1
  55. data/lib/claude_memory/sweep/sweeper.rb +7 -1
  56. data/lib/claude_memory/version.rb +1 -1
  57. data/lib/claude_memory.rb +6 -0
  58. metadata +12 -1
data/docs/dashboard.md CHANGED
@@ -116,6 +116,29 @@ sqlite-vec coverage. Each surfaces an actionable fix string (e.g.,
116
116
  "Run `claude-memory init` to install the standard hook set"). Status
117
117
  escalates to the worst individual check (error > warning > healthy).
118
118
 
119
+ ### Observations (episodic layer, 0.13.0+)
120
+
121
+ The episodic counterpart to the fact-based panels. Facts answer "what is
122
+ true"; **observations** are an append-only log of "what happened" in your
123
+ sessions. Surfaced both as a first-class sidebar panel (headline numbers)
124
+ and an Advanced → Observations tab (full detail):
125
+
126
+ - **Counts by status / kind / priority** — active vs. consolidated vs.
127
+ expired; decision / preference / event; 🔴 important / 🟡 maybe / 🟢 info.
128
+ - **Corroboration + promotion readiness** — how many observations have been
129
+ seen enough times (≥2, the corroboration gate) to be promotable to facts,
130
+ and the highest corroboration count seen. Promotion is the
131
+ anti-hallucination gate: a one-off mention never becomes a fact.
132
+ - **Compression ratio** — source content tokens ÷ observation tokens, the
133
+ Mastra-style measure of how much the episodic log condenses raw sessions.
134
+ - **Recent timeline** — the latest observations, newest first, with their
135
+ priority markers.
136
+
137
+ Promote a corroborated observation to a fact with `memory.promote_observation`
138
+ (or `claude-memory observations promote`), merge related ones with
139
+ `memory.consolidate_observations`, or run the `/reflect` skill for a guided
140
+ survey → consolidate → promote pass.
141
+
119
142
  ### Activity drill-down
120
143
 
121
144
  Clicking any moment opens a modal with the parsed payload, prettified JSON,
@@ -190,3 +213,8 @@ WAL writer lock open across page loads.
190
213
  flags as stale.
191
214
  - `claude-memory dedupe-conflicts` / `reclassify-references` — one-shot
192
215
  cleanups for what the Conflicts and Knowledge → References panels surface.
216
+ - `claude-memory observations [list|promote|consolidate]` *(0.13.0+)* — the
217
+ CLI mirror of the Observations panel: list/inspect the episodic log
218
+ (`--kind`, `--status`, `--scope`, `--json`), promote a corroborated
219
+ observation to a fact, or consolidate related ones. `claude-memory stats
220
+ --observations` prints the counts summary.
data/docs/improvements.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Improvements to Consider
2
2
 
3
- *Updated: 2026-05-23 - Added AI Memory Systems Landscape Analysis (Nakajima/Opus 4.6 Research article, 2026-03-26) — meta-study of 7 benchmarks + ~12 systems. Four High Priority items: graph traversal as third RRF source (#64), temporal-aware retrieval (#65), bi-temporal schema cleanup (#66), LongMemEval integration (#67). One promotion: improvement #57 (provenance-strength ranking) Medium → High, validated as the "soft epistemic separation" pattern. See `docs/influence/ai-memory-systems-2026.md`. Previously: 2026-05-01 - Added Strands Agent SOPs study (article, not repo) — one M-priority item (parameter blocks in skill frontmatter); rest already implemented or deferred. See `docs/influence/strands-agent-sops.md`. Previously: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
3
+ *Updated: 2026-06-17 - Added #70 (recall-preserving fact precision on real transcripts — live obs-experiment found Layer-1 fact noise from prose/comparisons/English-word collisions; claim-context gating was measured to crater the distillation benchmark Fact F1 0.958→0.64, so the lever is downstream: wire ReferenceMaterialDetector into the ingest path / Layer-2. Observation extraction was tightened in-branch; facts left at baseline). Earlier 2026-06-16 - Added #69 (self-heal the FTS rank index after concurrent ingest — live incident: hook-vs-MCP write contention leaves `ORDER BY rank` malformed while data stays intact, silently degrading recall until a manual `compact`). Earlier 2026-06-16 - Added Mastra Observational Memory study — one High Priority item (#68, episodic observation layer: Observer + Reflector + observation→fact promotion bridge) and one Medium item (compression/cache telemetry + LongMemEval episodic suite). Key insight: ClaudeMemory has no episodic layer; observations ("what happened") complement facts ("what is true"). See `docs/influence/mastra-observational-memory.md`. Previously: 2026-05-23 - Added AI Memory Systems Landscape Analysis (Nakajima/Opus 4.6 Research article, 2026-03-26) — meta-study of 7 benchmarks + ~12 systems. Four High Priority items: graph traversal as third RRF source (#64), temporal-aware retrieval (#65), bi-temporal schema cleanup (#66), LongMemEval integration (#67). One promotion: improvement #57 (provenance-strength ranking) Medium → High, validated as the "soft epistemic separation" pattern. See `docs/influence/ai-memory-systems-2026.md`. Previously: 2026-05-01 - Added Strands Agent SOPs study (article, not repo) — one M-priority item (parameter blocks in skill frontmatter); rest already implemented or deferred. See `docs/influence/strands-agent-sops.md`. Previously: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
4
4
  *Sources:*
5
5
  - *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
6
6
  - *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
@@ -9,6 +9,7 @@
9
9
  - *[tobi/qmd](https://github.com/tobi/qmd) - On-device hybrid search engine (v2.0.1+unreleased, re-studied 2026-03-30)*
10
10
  - *[MadBomber/kbs](https://github.com/MadBomber/kbs) - Knowledge-Based System with RETE inference (v0.2.1, studied 2026-03-30 — no changes)*
11
11
  - *[martian-engineering/lossless-claw](https://github.com/martian-engineering/lossless-claw) - DAG-based lossless context management (v0.5.2, re-studied 2026-03-30)*
12
+ - *[Mastra Observational Memory](https://mastra.ai/blog/observational-memory) - Text-based dual-agent episodic memory (studied 2026-06-16)*
12
13
 
13
14
  This document contains only unimplemented improvements. Completed items are removed.
14
15
 
@@ -291,6 +292,61 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #6)
291
292
 
292
293
  ---
293
294
 
295
+ ### 69. Self-Heal the FTS Rank Index After Concurrent Ingest
296
+
297
+ Source: 2026-06-16 live incident on `claude/observational-layer-design-7662r9` — observed first-hand, not a study.
298
+
299
+ **Gap.** The contentless FTS5 index (`content_fts`) silently drifts into a broken state under concurrent writers: the ingest hook (`claude-memory hook ingest`) and the MCP server (`store_extraction` → `Index::LexicalFTS#index_content_item`) both write the same WAL DB, and a large ingest produces `"Database busy, retrying"` (`Store::RetryHandler`) followed by an FTS index where **plain `MATCH` works but `... ORDER BY rank` raises `database disk image is malformed`**. `integrity_check` passes and all rows are intact — so `recall`/`recall_index` ranking is silently degraded (the rank query throws or returns nothing) while nothing looks wrong. The only fix today is the user manually running `claude-memory compact` (which `rebuild_fts` + vacuums). Documented in `docs/influence/...` gotchas and surfaced reactively in the dashboard (`lib/claude_memory/dashboard/api.rb:338`), but never repaired automatically. Severe form (btree corruption, plain `MATCH` also failing) was separately seen when **two** memory MCP servers ran concurrently — de-duping to a single server removed that, but the benign rank-artifact still recurs from hook-vs-MCP contention alone.
300
+
301
+ **Implementation.**
302
+
303
+ - **Cheap probe + self-heal in the sweep tail.** In `Sweep::Maintenance`, add a `repair_fts_rank` step: run a bounded `SELECT rowid FROM content_fts WHERE content_fts MATCH 'a' ORDER BY rank LIMIT 1`; on `malformed`, call `Index::LexicalFTS#rebuild!` (already contentless-safe) instead of requiring a manual `compact`. Sweep already runs on PreCompact/SessionEnd, so recall ranking self-repairs within the session that broke it. Guard with a time budget so a huge index doesn't blow the hook timeout.
304
+ - **Reduce the contention that triggers it.** Raise the SQLite `busy_timeout` and use `BEGIN IMMEDIATE` for the FTS-writing transactions so the ingest hook and MCP server serialize cleanly instead of racing (the retry-loop WARN is the symptom). Confirm both the hook path and `ManagementHandlers#store_extraction` open connections with the same pragmas via `Store::RetryHandler`.
305
+ - **Proactive detection.** Add a `doctor` check that runs the rank probe and reports "FTS rank index needs rebuild — run `claude-memory compact`" (or auto-heals if the sweep step lands). Optionally a `roi_nudge`-style one-liner.
306
+ - **Option (larger).** Evaluate external-content FTS5 over `content_items` instead of contentless — more robust to `rebuild` and avoids the auxiliary-index drift entirely. Note as a follow-up, not part of the first fix.
307
+
308
+ **Acceptance.**
309
+
310
+ - An integration test that interleaves a large `hook ingest` with an MCP `store_extraction` against the same DB leaves `MATCH ... ORDER BY rank` working (or self-healed by the next sweep) — no manual `compact` required.
311
+ - `doctor` flags the rank-artifact when present.
312
+ - `"Database busy, retrying"` WARN frequency drops under the contention test.
313
+
314
+ **Effort.** Medium (~2 days). Self-heal step + pragmas are small; the integration test reproducing concurrent writers is the bulk.
315
+
316
+ **Why high priority.** It silently degrades `recall` — a core feature — and the user has no signal except empty/unranked results until they happen to run `compact`. Recurs on normal usage (hit live 2026-06-16). Relates to the WAL/connection-release discipline already noted for the dashboard.
317
+
318
+ ---
319
+
320
+ ### 70. Recall-Preserving Fact Precision on Real Transcripts (downstream, not regex)
321
+
322
+ Source: 2026-06-17 obs-experiment live session — observed first-hand.
323
+
324
+ **Gap.** Layer-1 `NullDistiller` fact extraction is noisy on real (large, mixed-content) transcripts. A live session on a tiny **Ruby + SQLite** project produced `uses_database = postgres/mysql`, `uses_framework = rails/express`, `uses_language = go`, `deployment_platform = docker` — all from prose mentions, comparisons, negations, instruction/skill text, and English-word collisions (`go` in "want this to go", `express` in "expressive"). This is the documented #48 hallucination problem, now characterized with live data.
325
+
326
+ **What was ruled out (with data).** Gating fact emission on a usage/claim verb near the entity (`using X`, `deployed on X`, `written in X`) was implemented and measured against `spec/benchmarks/distillation/extraction_spec.rb`:
327
+
328
+ | | Fact Precision | Fact Recall | F1 |
329
+ |---|---|---|---|
330
+ | baseline | 0.919 | 1.0 | 0.958 |
331
+ | + claim-context | 0.615 | 0.667 | **0.64** |
332
+
333
+ Regex can't separate terse legit claims (`MongoDB database`, `on AWS`, `Dockerized`) from terse prose mentions (`Postgres/MySQL buy you…`) — claim-context trades recall ~1:1 and even loses precision on the clean corpus. **Reverted.**
334
+
335
+ **Why distiller-level tightening is mostly off the table (2026-06-17 finding).** The benchmark *enforces* high-recall: case `ext_ent_negative` — "We looked at MongoDB but **decided against it**" — still **expects** `uses_database: mongodb`. The fact distiller is deliberately "extract every mention, filter downstream." So negation/comparison **exclusion** also regresses recall (it would drop the rejected-MongoDB case). The genuine downstream precision lever is Layer-2 / the observation→fact promotion gate, which already prevents Layer-1 noise from being *committed* as corroborated facts.
336
+
337
+ **Done (recall-safe slice):**
338
+ - ✅ **English-word collision fix** — `go` is matched case-sensitively (`Go`/`golang`) so the verb "go" / "go-to" no longer fires `uses_language=go`. Benchmark Fact Precision **0.919 → 0.935**, Recall held at **1.0** (it removed a real false positive — "my go-to database"). `react`/`rust`/`express` are the same collision class and can follow the same `(?-i:)` pattern, each verified against the benchmark.
339
+
340
+ **Deferred / not worth it:** wiring `ReferenceMaterialDetector` into the ingest path is a no-op for current Layer-1 (it produces stack predicates, not `convention` facts the detector targets) — skip until Layer-1 emits conventions.
341
+
342
+ **Acceptance (revised).** Distiller-level work is bounded to recall-safe English-word-collision fixes (benchmark Fact F1 ≥ baseline). Broader fact precision is owned by Layer-2 + the promotion gate, not regex.
343
+
344
+ **Effort.** Medium. Wiring the existing detector is small; extending its heuristics + a real-transcript precision fixture is the bulk.
345
+
346
+ **Why this priority.** Fact noise predates the observational layer (it's #48) and is mitigated downstream (promotion gate), so it's not a blocker — but it's the most-visible remaining quality gap on real sessions. The observational layer's Layer-1 observation extraction was tightened in this branch (high-precision/low-recall); facts were deliberately left at baseline pending this recall-preserving approach.
347
+
348
+ ---
349
+
294
350
  ## cq Study (2026-04-28)
295
351
 
296
352
  Source: docs/influence/cq.md — usefulness-focused study (not internals)
@@ -411,6 +467,122 @@ Source: `docs/influence/ai-memory-systems-2026.md` — meta-study of the Nakajim
411
467
 
412
468
  ---
413
469
 
470
+ ## Mastra Observational Memory Study (2026-06-16)
471
+
472
+ Source: `docs/influence/mastra-observational-memory.md` — architecture study of Mastra's Observational Memory (OM), a text-based dual-agent (Observer + Reflector) episodic memory that compresses raw messages into an append-only, dated observation log living in the context window. SOTA on LongMemEval (84–95%) at 3–6× compression, cache-stable by design.
473
+
474
+ **Headline finding.** In OM's taxonomy ClaudeMemory is the thing it positions against: a structured *semantic* store injected *dynamically per query*. The gap OM exposes is not retrieval quality — it's that **ClaudeMemory has no episodic layer at all.** Facts answer "what is true"; observations answer "what happened." OM is purely episodic, we are purely semantic. The two are complementary, and we already own analogues of OM's Observer (distillation pipeline) and Reflector (Resolve + Sweep) — they just emit facts, not a narrative log.
475
+
476
+ ### High Priority Recommendations
477
+
478
+ - [x] **68. Episodic Observation Layer (Observer + Reflector + promotion bridge)** ⭐ — ✅ **Shipped 2026-06-16/17** (phases 1–4)
479
+ - Value: Adds the missing episodic half of memory (narrative "what happened" log) and a cache-stable injection mode, on top of the existing semantic fact store. The promotion bridge (observation→fact on corroboration) doubles as an anti-hallucination gate for the documented reject-churn problem (distiller commits `uses_database`/`uses_framework` facts from one-off doc example text).
480
+ - Evidence: `docs/influence/mastra-observational-memory.md`. Our distill pipeline (`lib/claude_memory/distill/`) is already an Observer that emits facts; `resolve/` + `sweep/` is already a Reflector over facts. No episodic store exists.
481
+ - Implementation (phased):
482
+ 1. ✅ **Shipped 2026-06-16** (schema v19): `observations` table (`body`, `kind`, `priority` 🔴/🟡/🟢, `scope`, `source_content_item_id`, `consolidated_into` lineage, `token_count`); `Domain::Observation`; NullDistiller emits observation rows; Resolver persists them; `memory.observations` read tool. **Append-only with tombstoning, not lossy drop** — preserves provenance.
483
+ 2. ✅ **Shipped 2026-06-16**: two-block SessionStart injection via `ContextInjector` — Block 1 = observation log (🔴 marked, 🟡/🟢 stripped as Mastra does for the actor) ahead of Block 2 = undistilled tail; `Observe::ObservationsRenderer` shared with the published `.claude/rules/claude_memory.observations.md` snapshot; `observation_count` added to the `hook_context` activity event for token/compression measurement.
484
+ 3. ✅ **Shipped 2026-06-17**: `Observe::Reflector` — deterministic, free (no LLM) GC. Dedupes near-identical active observations into the newest (tombstone via `consolidated_into`) and expires stale info-level (🟢) ones past a TTL (`observation_info_ttl_days`, default 30); 🔴/🟡 never expire. Provenance-preserving (rows tombstoned, never deleted). Wired into `Sweep::Maintenance#reflect_observations` → `Sweeper#run!`, so it runs on the existing `PreCompact`/`SessionEnd` sweep — context-pressure-triggered, the analog of Mastra's ~40k-token threshold. (Semantic "merge related/surface patterns" deferred to phase 4 — needs the LLM.)
485
+ 4. ✅ **Shipped 2026-06-17** (schema v20): the observation→fact **promotion bridge**. Dedup folds duplicates' `corroboration_count` into the keeper; once an observation crosses `Domain::Observation::PROMOTION_THRESHOLD` (2) sightings it becomes a promotion candidate. `memory.promote_observation` creates the fact via the resolver, links provenance, marks the observation promoted, and **refuses uncorroborated observations server-side** — the anti-hallucination gate. `ContextInjector` surfaces candidates in a SessionStart "## Observation Reflection" section instructing Claude to promote inline (automatic semantic reflection, no extra API cost); a manual `/reflect` skill drives deep on-demand passes. (Trigger is SessionStart rather than PreCompact — the already-wired free injection point; a PreCompact context hook is a possible future refinement.)
486
+ 5. ✅ **Shipped 2026-06-17** (branch `claude/observational-layer-complete`): the LLM half. **Layer-2 Claude-as-observer** — the SessionStart extraction prompt asks Claude to emit episodic observations in the `observations` field of `memory.store_extraction` (coerced/validated at the handler border, persisted via the resolver), making the log rich where Layer-1 regex is high-precision/low-recall. **Semantic reflection** — `memory.consolidate_observations` merges related-but-differently-worded observations into one synthesized row with *combined* corroboration (which can tip it over the promotion gate), tombstoning the sources; surfaced in the reflection section + `/reflect`. (Also fixed a latent Liskov bug: `ReferenceMaterialDetector#reclassify` dropped observations when a fact was present.)
487
+ 6. ✅ **Shipped 2026-06-17** (branch `claude/observational-layer-complete`): observability + measurement. `Dashboard::Observations` panel (`/api/observations`, Advanced → Observations tab) — counts by status/kind/priority, corroboration + promotion readiness, recent timeline, and a **compression ratio** (source content tokens ÷ observation tokens, Mastra-style). The compression metric is the measurement half of design rec E; a full LongMemEval-style episodic benchmark remains (overlaps #67).
488
+ 7. ✅ **Shipped 2026-06-18** (branch `claude/observational-layer-complete`): polish. **PreCompact reflection trigger** — `claude-memory hook context` injects only the reflection nudge (`ContextInjector#reflection_context`) on PreCompact (context pressure, the Mastra token-threshold analog), wired into the standard PreCompact hook set; not the full snapshot. **Observation↔fact provenance** — `memory.observations` exposes status/corroboration_count/promoted_fact_id/consolidated_into; `memory.explain(fact_id)` shows `promoted_from_observations` (reverse link via `observations_for_fact`). Full observational layer complete; remaining: LongMemEval episodic benchmark (#67).
489
+ - Effort: Large, phased. Phase 1 ~2-3 days; full arc ~2 weeks. Reuses distill/resolve/sweep/publish/context-hook machinery and `context_tokens` telemetry.
490
+ - Trade-off: reflection is automatic on *lifecycle events* (compaction/session boundaries), not a wall clock — Claude Code has no timer/cron hook, and Routines/subagents incur separate token budgets (rejected). Observer/Reflector reuse the existing session (no extra API cost). **Augments dynamic recall, does not replace it (user-confirmed 2026-06-16).** See claude-code-guide consultation in the influence doc.
491
+
492
+ ### Medium Priority
493
+
494
+ - [ ] **Compression / cache telemetry + LongMemEval episodic suite** (see influence doc rec E)
495
+ - Value: Report compression ratio and token reduction on Trust/Health panels using existing `context_tokens` events (0.11.0). Add a LongMemEval-style long-session suite to DevMemBench to score the episodic layer. Overlaps with existing item #67 (LongMemEval integration) — coordinate.
496
+ - Effort: Medium. Depends on #68 phase 1-2.
497
+
498
+ ### Features to Avoid (from this study)
499
+
500
+ - **Two always-on background LLM agents** — violates the no-separate-API-call convention. Observer = context-hook injection; Reflector = deterministic shell-side GC + `PreCompact`-injected semantic consolidation (rides the existing session).
501
+ - **Claude Code Routines / subagents for recurring reflection** — Routines run as a separate scheduled cloud session; subagents run in their own context (~7× tokens). Both incur extra spend; reserve only for an explicitly opted-in one-off backfill.
502
+ - **Lossy drop on reflection** ("never forgives") — we tombstone via `consolidated_into` and retain raw `content_items`; provenance is non-negotiable.
503
+ - **Replacing dynamic recall with a wholesale-loaded log** — augment, don't replace; keep `memory.recall` for targeted lookups.
504
+
505
+ ---
506
+
507
+ ### 71. Exclude the project DB from the published gem (gem is 28MB, ~96MB of it the dogfooding DB)
508
+
509
+ Source: 2026-06-18 live observation while building the 0.13.0 release gem.
510
+
511
+ **Problem.** `claude_memory.gemspec` builds its file list from `git ls-files` and rejects `bin/ Gemfile .gitignore .rspec spec/ .github/ .standard.yml` — but **not** `.claude/memory.sqlite3`, which is tracked (per the "always commit the project DB" convention). So the published gem *ships the repo's own dogfooding memory database*: the working-tree DB is ~96MB, compressing to a **28MB gem** (v0.6.0 was 280KB; the gem has been silently growing — 0.9.1 was 19MB — as the DB accumulates). Gem users get nothing from it (they init their own empty DB on install), it bloats every download, and it's trending toward RubyGems' 100MB ceiling.
512
+
513
+ **Fix.** Add `.claude/` (or at least `.claude/memory.sqlite3` + WAL/SHM siblings) to the gemspec reject filter. Verify with `gem build` that the gem drops to <1MB and that nothing in the gem actually requires the file at runtime (it shouldn't — runtime opens the *user's* DB path via `Configuration`). Add a spec asserting `Gem::Specification.load(...).files` excludes `.claude/memory.sqlite3` so it can't regress.
514
+
515
+ **Why High.** Low effort, high impact: ~28MB → <1MB published gem, and it removes a slow-growing landmine before it actually exceeds the RubyGems size limit and blocks a release. Not introduced by 0.13.0 — pre-existing and compounding.
516
+
517
+ **Note on the convention.** This does *not* conflict with "always commit `.claude/memory.sqlite3`" — that's about repo reproducibility for collaborators. Shipping it *in the gem* is a separate, unintended consequence of the `git ls-files` manifest.
518
+
519
+ ---
520
+
521
+ > **Observational-layer audit (2026-06-23).** A critical examination of every observation in this project's DB found the episodic layer is, in practice, producing ~no useful observations and is injecting noise into sessions. Four root causes (#72–#75), each backed by the live data below. Snapshot at audit time: **113 active observations, 0 consolidated, 0 expired, 0 promoted, every `corroboration_count = 1`**; only `decision`/`preference` kinds; every row traces to a `claude_code` transcript on the `observational-layer-*` branches. The count grew 99 → 105 → 113 *during* the audit session — the dogfooding loop is live and compounding. These supersede the optimistic framing of #68; the mechanism is sound, the inputs and the matching are not.
522
+
523
+ ### 72. Layer-2 (Claude-as-observer) produces **zero** observations — the quality source is silent ⭐
524
+
525
+ Source: 2026-06-23 observational-layer audit.
526
+
527
+ **🟡 Structural fix shipped 2026-06-23** (option a). Decoupled the observation-capture ask from the buried, rarely-fired deep-distill paragraph into its own prominent SessionStart section (`ContextInjector#format_observation_capture_prompt` — "## Log What Happened"). This maximizes the chance Claude authors observations, but persistence still rides a `store_extraction` tool call, so **effectiveness is not yet proven** — whether Layer-2 now actually fires is measurable via the `mcp_extraction` content-item source and needs real-session validation (and ultimately the #75 eval). Not closing this until that signal turns positive.
528
+
529
+ **Problem.** The design's quality observations were always meant to come from **Layer-2** (Claude-as-observer): the SessionStart prompt (`ContextInjector#format_distillation_prompt`) asks Claude to populate the `observations` field of its `memory.store_extraction` call. **The vehicle is dormant.** Evidence (sharpened 2026-06-23 with `mcp_tool_calls.called_at` + `activity_events`):
530
+ - `store_extraction` was invoked **4 times ever — all on 2026-04-17 to 2026-04-30**, i.e. *six-plus weeks before the observational layer (and the `observations` parameter) shipped on 2026-06-16/18*. Those calls **could not** have carried observations; the field didn't exist yet.
531
+ - **Since the layer shipped, `store_extraction` has fired ZERO times.** Layer-2 has never run, not once, in the feature's entire life. (Corroborating: `store_extraction` creates a synthetic `source: "mcp_extraction"` content_item; **0 of the 113 observations trace to `mcp_extraction`** — all to `claude_code` ingest.)
532
+ - **Layer-1 auto-ingest dominates ~100:1**: `activity_events` shows **409 `hook_ingest` vs 4 `store_extraction` total**. Content flows in automatically on every Stop/SessionEnd hook without Claude's involvement, so the Layer-2 deep-distill path — gated on a *fresh session* with *undistilled ≥200-char* items that Claude must *choose* to act on — essentially never triggers.
533
+
534
+ **Why this is the highest-leverage finding.** Layer-1 (regex over raw transcript) *cannot* produce episodic narrative — it only scrapes fragments (see #74; #74 makes it high-precision, not narrative). The design delegated quality to Layer-2, but Layer-2 is structurally dormant: it isn't a prompt-wording problem (the `store_extraction` schema *does* expose `observations` with a good description, and the prompt *does* ask) — it's that **the path the observations ride on doesn't fire** in normal operation. So the episodic log is, and will remain, 100% Layer-1 scrapes until observation authoring is moved onto a path that actually runs.
535
+
536
+ **Fix (design fork — not a one-shot code change).** Options:
537
+ - **(a) Author observations on the Layer-1 hook path, but with an LLM.** The hook can't call Claude (no API budget), so this means: have the *next* SessionStart context inject the raw undistilled tail and ask Claude to emit observations directly as part of the normal turn (not gated behind a voluntary `store_extraction` deep-distill). Rides the session, no extra cost — same mechanism the fact-injection uses.
538
+ - **(b) Make Layer-2 fire reliably** — lower the fresh-session/≥200-char gate, or make observation emission a first-class, early, non-optional instruction. Risk: effectiveness is unmeasurable without real A/B sessions, and the headless-recall gap (`project_headless_retrieval_gap.md`) says Claude often won't call MCP tools at all.
539
+ - **(c) Derive observations from what Claude already produces** — the `decisions`/`facts` it extracts are higher-signal than regex scrapes; synthesize observations from those deterministically.
540
+ - Regardless: **add telemetry distinguishing Layer-1 vs Layer-2 observation provenance** so "is Layer-2 firing?" is a dashboard number, not a forensic dig.
541
+
542
+ **Cross-links.** Blocks the value premise of #68; #74/#73 only make the Layer-1 floor less bad and the loop functional — neither makes the log *good*. This is the one that does.
543
+
544
+ ### 73. Observation dedup/corroboration is normalized-**exact**, so the promotion loop can never fire ⭐
545
+
546
+ **✅ Shipped 2026-06-23.** Replaced exact-string grouping with greedy clustering over an injected similarity matcher (`Reflector#dedupe_scope`). Default matcher is `Observe::TokenOverlapMatcher` — lexical Jaccard token-overlap (deterministic, free, no embedding dependency), threshold 0.5. Folds the common case (one event re-observed with slightly different wording → corroboration now accumulates and can cross the promotion gate) while keeping unrelated statements apart (Jaccard ~0). **Deliberate limitation, data-driven:** measured that tfidf cosine (0.32) can't separate pure synonym paraphrases from unrelated pairs (0.13) on short bodies, so neither the default lexical matcher nor tfidf folds "use SQLite" vs "chose SQLite" — that needs real embeddings, which can be injected via the `matcher:` seam (any object responding to `similar?(a, b)`) when fastembed is configured.
547
+
548
+ Source: 2026-06-23 observational-layer audit.
549
+
550
+ **Problem.** `Observe::Reflector#dedupe` folds observations by `group_by { [scope, normalize(body)] }` where `normalize` is just `downcase` + whitespace-collapse + strip. Two observations corroborate **only if their bodies are byte-identical after lowercasing**. Real captures of "the same thing" are never byte-identical — e.g. the four stored variants "PreCompact hook set.", "PreCompact hook set — the design's Mastra-token-threshold analog.", "PreCompact set alongside ingest + sweep." describe one event but never fold. Result, confirmed in the data: **every observation has `corroboration_count = 1`; 0 consolidated; 0 promoted.** The corroboration gate — the layer's headline anti-hallucination feature — is **dead by construction** on any varied text. It can only fire if the *exact same string* recurs, which regex fragments from different transcript chunks essentially never do.
551
+
552
+ **Fix.** Corroboration/dedup must be **semantic**, not exact: reuse the existing embedding stack (`Embeddings` + sqlite-vec) to fold observations above a similarity threshold, or fold on a normalized *subject+kind* key rather than the full body. Until then, the promotion gate provides no value and the "graduate after 2 sightings" story is unsupported. Add a spec that two paraphrases of one event corroborate.
553
+
554
+ **Cross-links.** Without this, #72's quality observations still wouldn't promote.
555
+
556
+ ### 74. Layer-1 Observer ingests code/doc/transcript fragments; `noise_body?` lets ~⅓ through
557
+
558
+ **✅ Shipped 2026-06-23** (commit `d81a684`). Strengthened `NOISE_BODY_SIGNATURE` (code/JSON `key: "value"`, method calls, spaced table pipes, box-drawing glyphs, `(vector)` labels, JSONL fields) and added a prose-start requirement. Verified against the audit corpus: all five real noise samples now rejected, clean prose kept. **Residual (not a regression):** *truncated-prose* fragments with no code signature (e.g. "encompasses how to use fr…") can still slip — that's the greedy `.+` capture, and ultimately the Layer-2 question (#72), not the noise filter.
559
+
560
+ Source: 2026-06-23 observational-layer audit.
561
+
562
+ **Problem.** The Layer-1 Observer runs `decided to (.+)` / `we always|never (.+)` over **raw transcript text**, which on this repo (and any repo whose sessions discuss code) is saturated with trigger phrases inside source, specs, docs, and tool output. The `noise_body?` filter (`NOISE_BODY_SIGNATURE = /\bdef\s|\bclass\s|\bmodule\s|=>|::|","|":\s*"|[{}]|\$\(|&&|\|\|/`) is tuned for code-*syntax* and misses prose/table/transcript fragments. Measured against the live 113: the filter catches **39**, but **38 obvious-noise rows slip through** (≈44% look like noise by a conservative heuristic; manual review puts it higher). Concrete slipped examples actually sitting in the injected log:
563
+ - `[89] decided to use SQLite", kind: "decision", priority: 1) expect(id).to be_a(Integer)…` — a **spec fixture line**.
564
+ - `[104] decided to gate promotion on corroboration" | | Changes | Explicitly…` — a **CHANGELOG table row** (`| |` ≠ `||`, so it dodges the filter).
565
+ - `[48]–[55] / first-person `we always|never`)…` — fragments of the **distiller's own source-code comment**.
566
+ - `[99] · (vector) 78 ├─ How frozen_string_literal…` — **benchmark tree output**.
567
+
568
+ These are priority-1 `decision` rows, so they *are* injected into Block 1 of SessionStart (observed live in this session's own context) — spending context budget on garbage and risking misdirection (e.g. `[46] decided to use Postgres.`, a fixture string, implying a stack the project doesn't use).
569
+
570
+ **Fix.** Make Layer-1 high-precision-or-silent: reject bodies that look like code/markdown/transcript (leading `-`/`#`/`|`, table pipes, `key: "value"` shapes, tree glyphs `├─└─`, `(vector)`, backtick-dense spans, JSONL artifacts) — invert the default from high-recall to high-precision, since the recall here is ~all noise. Pair with the `ContentSanitizer`/Observer border. (This is the P1 item from the 2026-06-18 quality review, now empirically confirmed and worse than estimated.)
571
+
572
+ **Cross-links.** Even fully fixed, Layer-1 is a stopgap until #72; together they decide whether the log is signal or noise.
573
+
574
+ ### 75. The episodic layer has no fair test — this repo is a pathological self-pollution case
575
+
576
+ Source: 2026-06-23 observational-layer audit.
577
+
578
+ **Problem.** Every observation traces to this project's *own* `claude_code` design transcripts, whose specs literally contain `insert_observation(body: "decided to use SQLite")` and whose docs are full of "decided to…" prose. claude_memory dogfooding on its own repo is the **worst possible self-test** for the Observer — it maximizes trigger-text density and self-ingestion. So the audit above measures *self-pollution*, not the design's ceiling; a normal Rails/Django app would look very different. We currently have **no measurement of the layer's value on a representative project**, and the optimistic compression/promotion story in #68 was never validated.
579
+
580
+ **Fix.** Stand up the deferred **LongMemEval-style episodic suite** (#67/#68 medium item) and/or capture a real non-claude_memory project trace as a fixture, and report observation precision (signal vs noise), corroboration/promotion rates, and compression on *that*. Treat "episodic layer adds value" as **unproven** in public materials until this exists (the 0.13.0 blog draft already hedges accordingly). Until then, the self-pollution makes the dashboard Observations panel actively misleading on this repo.
581
+
582
+ **Cross-links.** Gates any future episodic value claim; depends on #72–#74 being fixed first to be worth measuring.
583
+
584
+ ---
585
+
414
586
  ## Medium Priority
415
587
 
416
588
  ### ~~18. Shell Completion for CLI~~ ✅ Implemented 2026-03-20
@@ -0,0 +1,198 @@
1
+ # Mastra Observational Memory — Influence Study
2
+
3
+ *Analysis Date: 2026-06-16*
4
+ *Source: Mastra "Observational Memory" (announcement + docs + research, Feb 2026)*
5
+ *Type: Architecture study (feature/paradigm, not a full repo clone)*
6
+ *Status: Design exploration — no code yet. Branch `claude/observational-layer-design-7662r9`.*
7
+
8
+ *Sources:*
9
+ - *[Announcing Observational Memory (Mastra blog)](https://mastra.ai/blog/observational-memory)*
10
+ - *[Observational Memory docs](https://mastra.ai/docs/memory/observational-memory)*
11
+ - *[Observational Memory research / LongMemEval](https://mastra.ai/research/observational-memory)*
12
+ - *[VentureBeat: "Observational memory cuts AI agent costs 10x..."](https://venturebeat.com/data/observational-memory-cuts-ai-agent-costs-10x-and-outscores-rag-on-long)*
13
+ - *[The Decoder: traffic-light priority system](https://the-decoder.com/mastras-open-source-ai-memory-uses-traffic-light-emojis-for-more-efficient-compression/)*
14
+
15
+ ---
16
+
17
+ ## Executive Summary
18
+
19
+ ### What this is
20
+
21
+ Mastra Observational Memory (OM) is a **text-based, dual-agent episodic memory** for long-running agents. It compresses raw message history into a structured, append-only log of dated **observations** that lives entirely in the LLM context window — no vector or graph DB. It reports state-of-the-art LongMemEval scores (84.23% with gpt-4o; 94.87% with gpt-5-mini) at 3–6× token compression.
22
+
23
+ ### Why ClaudeMemory cares
24
+
25
+ In Mastra's taxonomy, ClaudeMemory is the thing OM positions *against*: a structured **semantic** store (subject-predicate-object facts with scope, validity windows, supersession, provenance) injected **dynamically per query** via `memory.recall` and SessionStart fact injection.
26
+
27
+ The key realization from this study: **ClaudeMemory has no episodic layer at all.** Facts answer "what is true." Observations answer "what happened." OM is purely episodic; ClaudeMemory is purely semantic. An observational layer is not redundant with distillation — it is the missing half.
28
+
29
+ We already own two of OM's four moving parts in spirit:
30
+ - The **distillation pipeline** (NullDistiller + Layer-2 Claude-as-distiller) is an Observer that emits *facts* instead of a *narrative log*.
31
+ - **Resolve + Sweep** is a Reflector that operates on *facts* instead of *observations*.
32
+
33
+ The work is therefore: add a narrative episodic store, point the existing Observer/Reflector machinery at it, add a cache-stable injection mode, and — uniquely to us — bridge observations into facts via corroboration.
34
+
35
+ ---
36
+
37
+ ## How Mastra OM Works
38
+
39
+ ### Two-block context window
40
+
41
+ 1. **Observation block** — a compressed, append-only log of dated observations (decisions, key events, distilled facts from older messages). Reads like a log of decisions and actions, not documentation.
42
+ 2. **Raw tail** — recent messages not yet compressed.
43
+
44
+ ### The Observer
45
+
46
+ Fires when raw message tokens cross ~30k (configurable). A separate background agent compresses messages into new dated observations appended to the observation block. Each observation captures one discrete event: a user statement, an agent action, a tool-call result, or a preference expressed in passing. 3–6× compression.
47
+
48
+ ### The Reflector
49
+
50
+ Fires when observations cross ~40k tokens (configurable). A separate background agent garbage-collects: combines related items, reflects on overarching patterns, and drops context that no longer matters.
51
+
52
+ ### Traffic-light priority
53
+
54
+ Observations carry 🔴 (important) / 🟡 (maybe important) / 🟢 (info only). The priority is **internal** to the Observer/Reflector pipeline. When observations are presented to the main "Actor" agent, 🟡 and 🟢 are stripped — only 🔴 survives — because the priority emojis serve the memory pipeline and are visual noise to the actor.
55
+
56
+ ### Prompt-cache stability (the headline win)
57
+
58
+ Because the observation block is **append-only between reflections**, the prompt prefix stays stable and every turn gets a full cache hit. Cache invalidates only on a reflection, which is infrequent. This is explicitly contrasted with RAG-style memory that re-retrieves and rewrites the prompt every turn, busting the cache and producing a variable cost curve.
59
+
60
+ ### Storage
61
+
62
+ Plain text in a standard backend (Postgres / LibSQL / MongoDB), loaded directly into the context window — not pulled through embedding search.
63
+
64
+ ---
65
+
66
+ ## Comparative Analysis vs ClaudeMemory
67
+
68
+ | Dimension | Mastra OM | ClaudeMemory today |
69
+ |-----------|-----------|--------------------|
70
+ | Memory type | Episodic (narrative log) | Semantic (SPO facts) |
71
+ | Storage | Plain text in context window | Normalized SQLite + FTS5 + vec0 |
72
+ | Retrieval | None — log loaded wholesale | Dynamic per-query (FTS + vector RRF) |
73
+ | Compression | Observer (LLM), 3–6× | NullDistiller + Claude-as-distiller → facts |
74
+ | Consolidation | Reflector (LLM), lossy drop | Resolve (supersession) + Sweep (TTL/GC) |
75
+ | Provenance | Weak — compression is lossy | Strong — provenance receipts, lineage |
76
+ | Cache behavior | Stable append-only prefix | Per-query injection (cache-busting) |
77
+ | Cost | Two background LLM agents (extra API $) | Claude-as-distiller, zero extra API $ |
78
+
79
+ **The two systems are complementary, not competing.** OM's weakness is exactly ClaudeMemory's strength (provenance, truth maintenance) and vice versa (episodic recall, cache-stable injection).
80
+
81
+ ---
82
+
83
+ ## Adoption Opportunities (prioritized)
84
+
85
+ ### High Priority
86
+
87
+ **A. Episodic observation store + Layer-1 Observer.** New `observations` table (schema v19 — v18 was taken by OTel telemetry); NullDistiller emits observation rows alongside facts; `memory.observations` read tool. Append-only with `consolidated_into` lineage (mirrors `fact_links`) rather than Mastra's lossy drop — preserves our provenance guarantee. Zero behavior change to facts.
88
+
89
+ **B. Cache-stable injection.** Publish `.claude/rules/claude_memory.observations.md` (append-only, dated, 🔴+plain only — 🟡/🟢 stripped as Mastra does for the actor). SessionStart injects a two-block context: Block 1 = consolidated observations (stable, cache-friendly), Block 2 = recent undistilled tail. Front-loading a stable block reduces the per-turn `memory.recall` churn that busts caching. *Honest limit:* we influence Claude Code's cache via a stable `additionalContext` prefix within a session; we don't control it. Cross-session caching remains Claude Code's domain.
90
+
91
+ **C. The observation→fact promotion bridge (unique to us).** The Reflector promotes *corroborated* observations into structured facts. An observation is low-commitment; a fact is committed truth. Requiring repeated, corroborated sightings before promotion is a natural confidence gate — and directly mitigates the documented hallucination problem where the distiller commits `uses_database`/`uses_framework` facts from one-off example text in docs (today producing reject churn). Observation-first, fact-on-corroboration makes premature hallucinated facts never commit.
92
+
93
+ ### Medium Priority
94
+
95
+ **D. Automatic Reflector (free) — confirmed feasible.** A consultation with the claude-code-guide agent (2026-06-16) confirms automatic reflection is achievable with zero extra API cost, in two tiers:
96
+ - **Deterministic tier (fully autonomous, no model):** dedupe near-identical observations, drop stale 🟢 past a TTL, merge by entity/time window — pure Ruby, run shell-side inside the `PreCompact` and `SessionEnd` hooks (and the existing Sweep). This needs no model and fires automatically.
97
+ - **Semantic tier (autonomous-on-next-turn, rides the session):** at `PreCompact`, the hook injects a reflection instruction via `additionalContext` ("consolidate the observation log: combine related items, surface patterns, drop the irrelevant"). Claude Code itself performs the consolidation on its next turn, inside the existing session — no separate paid call.
98
+
99
+ See the dedicated section below for why `PreCompact` is the right trigger and what the constraints are. This **supersedes** the earlier "manual `/reflect` only" recommendation: `/reflect` remains as a manual on-demand deep pass, but reflection is now primarily automatic.
100
+
101
+ **E. Compression / cache telemetry.** Reuse the `context_tokens` telemetry on `hook_context` events (0.11.0) and the Trust/Health panels to report compression ratio and token reduction. Add a LongMemEval-style episodic/long-session suite to DevMemBench alongside the existing retrieval and truth-maintenance suites.
102
+
103
+ ### Features to Avoid (from this study)
104
+
105
+ - **Two always-on background LLM agents.** Violates the standing convention against features requiring separate Anthropic API calls. Our Observer = context-hook injection (Claude-as-distiller); our Reflector = deterministic shell-side GC + `PreCompact`-injected semantic consolidation that rides the existing session (see automatic-reflection section).
106
+ - **Claude Code Routines / subagents for reflection.** Routines run as a separate scheduled cloud session (separate token budget); subagents run in their own context window (~7× token burn). Both incur extra spend — rejected for recurring reflection. Reserve them, if ever, for a one-off heavy backfill the user explicitly opts into.
107
+ - **Lossy drop on reflection.** Mastra truly discards observations ("never forgives"). We tombstone via `consolidated_into` and retain raw `content_items` — provenance is non-negotiable.
108
+ - **Replacing dynamic recall.** Augment, don't replace. Observations become a front-loaded episodic block; `memory.recall` stays for targeted lookups.
109
+
110
+ ---
111
+
112
+ ## Proposed Data Model (sketch)
113
+
114
+ ```
115
+ observations (schema v19)
116
+ id, ts (event time), session_id
117
+ body -- dense narrative text, the observation itself
118
+ kind -- user_statement | agent_action | tool_result | preference | decision | event
119
+ priority -- 1=🔴 important, 2=🟡 maybe, 3=🟢 info (internal pipeline signal)
120
+ scope, project_path
121
+ source_content_item_id -- provenance back to the raw transcript chunk
122
+ consolidated_into -- Reflector lineage (mirrors fact_links supersession)
123
+ token_count -- for budget / compression math
124
+ status, created_at, reflected_at
125
+ ```
126
+
127
+ ## Proposed Pipeline Integration
128
+
129
+ ```
130
+ Transcripts → Ingest → Index (FTS5)
131
+
132
+ ┌─────────────── Distill ───────────────┐
133
+ │ │
134
+ Facts (SPO, semantic) Observations (narrative, episodic) ← NEW
135
+ │ │
136
+ Resolve (truth maint.) Reflect (consolidate / GC / pattern) ← NEW
137
+ │ │
138
+ Store (facts) Store (observations) ← NEW
139
+ │ │
140
+ └──────────── Promotion bridge ──────────┘
141
+ (Reflector promotes corroborated observations → facts)
142
+
143
+ Publish: stable observation block (cache-friendly) + fact snapshot
144
+ ```
145
+
146
+ ## Automatic Reflection in Claude Code (consultation findings, 2026-06-16)
147
+
148
+ Source: claude-code-guide agent consultation. Citations: [Hooks reference](https://code.claude.com/docs/en/hooks.md), [Subagents](https://code.claude.com/docs/en/subagents.md), [Routines / scheduled tasks](https://code.claude.com/docs/en/web-scheduled-tasks).
149
+
150
+ **What does not exist:** There is no timer-, cron-, or idle-based hook event. Hook events are lifecycle-driven only — `SessionStart`, `SessionEnd`, `UserPromptSubmit`, `Stop`/`StopFailure`, `PreCompact`/`PostCompact`, `PreToolUse`/`PostToolUse(Failure)`, plus async signals (`FileChanged`, etc.). No hook can force a model turn or enqueue a prompt; a hook can only inject `additionalContext` that the model acts on at its *next* invocation.
151
+
152
+ **What this unlocks anyway:** `PreCompact` is the right reflection trigger because it fires precisely when the context window is filling — i.e. on *context pressure*. That is conceptually the same signal Mastra uses (Reflector fires at a ~40k-token observation threshold). So "reflect when memory gets big" maps cleanly onto "reflect when Claude Code is about to compact."
153
+
154
+ **The free automatic pattern (recommended):**
155
+ - `PreCompact` + `SessionEnd` hooks run the **deterministic** Reflector shell-side in Ruby (dedupe / TTL-drop 🟢 / merge) — fully autonomous, no model, no cost.
156
+ - `PreCompact` injects an `additionalContext` instruction that makes Claude perform the **semantic** consolidation (pattern-finding, observation→fact promotion) on its next turn, inside the existing session — no separate paid call.
157
+ - `SessionStart` injects the consolidated two-block observation log (already in recommendation B).
158
+
159
+ **Where extra cost is unavoidable (and therefore rejected):** truly autonomous *between-session* reflection on a wall clock. That requires Claude Code Routines (separate paid cloud session) or a headless `claude -p` call or a subagent (~7× tokens) — all separate spend. We accept the tradeoff: our reflection is automatic on *lifecycle events* (compaction, session boundaries), not on a wall-clock timer. For our single-developer, local-first scale this is sufficient.
160
+
161
+ ## Suggested Phasing
162
+
163
+ 1. Schema + Layer-1 Observer (table, NullDistiller rows, `memory.observations`).
164
+ 2. Stable two-block injection; measure token/compression deltas.
165
+ 3. **Automatic Reflector**: deterministic GC shell-side in `PreCompact` + `SessionEnd`/Sweep.
166
+ 4. **Automatic semantic reflection**: `PreCompact` `additionalContext` consolidation instruction + observation→fact promotion bridge. Keep a manual `/reflect` skill for on-demand deep passes.
167
+
168
+ Phase 4 is where this stops being "Mastra-on-Ruby" and becomes a hybrid episodic+semantic system stronger than either alone.
169
+
170
+ ## Decisions for ClaudeMemory (memory-convention format)
171
+
172
+ Per the `/study-repo` memory discipline, the following are decisions about **claude_memory itself** derived from this study — to be stored via `memory.store_extraction` (`subject=claude_memory`, `decision`/`architecture` predicate, reason clause embedded) once the memory MCP server is connected. External facts about Mastra stay in this influence doc, not in memory.
173
+
174
+ - **Decision:** claude_memory will add an episodic observation layer that *augments* (does not replace) the dynamic-recall semantic fact store — because facts answer "what is true" and observations answer "what happened," and we currently have no episodic half; recall stays for targeted lookups while observations provide a stable front-loaded narrative. (User-confirmed "augment" on 2026-06-16.)
175
+ - **Decision:** observation reflection will be automatic via the `PreCompact` and `SessionEnd` hooks rather than a manual-only skill — because Claude Code exposes no timer/cron hook, but `PreCompact` fires on context pressure (the analog of Mastra's token-threshold trigger) and rides the existing session at no extra API cost.
176
+ - **Decision:** the Reflector's deterministic GC runs shell-side in Ruby and its semantic consolidation runs via `PreCompact` `additionalContext` (Claude-as-reflector inline) — to keep automatic reflection within the no-extra-API-cost convention, explicitly rejecting Claude Code Routines and subagents because each incurs a separate token budget.
177
+ - **Decision:** reflection will tombstone superseded observations via a `consolidated_into` link rather than hard-deleting them (unlike Mastra's lossy drop) — to preserve claude_memory's provenance guarantee while still bounding context size.
178
+ - **Decision:** an observation is promoted to a structured fact only after corroboration across multiple observations — because requiring repeated sightings before commitment doubles as an anti-hallucination gate against the documented reject-churn from one-off doc/example text.
179
+
180
+ > **Pending memory ingestion (deferred 2026-06-16).** These were NOT written to the project DB in the remote design session: `.claude/memory.sqlite3` is git-LFS-backed and the container had only the pointer (no `git-lfs`, real DB not materialized), so writing would have clobbered it. Store them in a local session via `memory.store_extraction` with this payload:
181
+ >
182
+ > ```json
183
+ > {
184
+ > "scope": "project",
185
+ > "facts": [
186
+ > {"subject": "claude_memory", "predicate": "decision", "object": "Add an episodic observation layer that augments (does not replace) the dynamic-recall semantic fact store, because facts answer 'what is true' and observations answer 'what happened' and we currently lack the episodic half; recall stays for targeted lookups."},
187
+ > {"subject": "claude_memory", "predicate": "decision", "object": "Make observation reflection automatic via the PreCompact and SessionEnd hooks rather than a manual-only skill, because Claude Code exposes no timer/cron hook but PreCompact fires on context pressure (the analog of Mastra's token threshold) and rides the existing session at no extra API cost."},
188
+ > {"subject": "claude_memory", "predicate": "decision", "object": "Run the Reflector's deterministic GC shell-side in Ruby and its semantic consolidation via PreCompact additionalContext (Claude-as-reflector inline), to keep automatic reflection within the no-extra-API-cost convention, explicitly rejecting Claude Code Routines and subagents because each incurs a separate token budget."},
189
+ > {"subject": "claude_memory", "predicate": "decision", "object": "Tombstone superseded observations via a consolidated_into link rather than hard-deleting them (unlike Mastra's lossy drop), to preserve claude_memory's provenance guarantee while still bounding context size."},
190
+ > {"subject": "claude_memory", "predicate": "decision", "object": "Promote an observation to a structured fact only after corroboration across multiple observations, because requiring repeated sightings before commitment doubles as an anti-hallucination gate against reject-churn from one-off doc/example text."}
191
+ > ]
192
+ > }
193
+ > ```
194
+
195
+ ## Open Questions
196
+
197
+ - **Augment vs replace recall?** Resolved: **augment** (user-confirmed 2026-06-16). Observations become a front-loaded episodic block; `memory.recall` stays for targeted lookups.
198
+ - **Automatic vs manual reflection?** Resolved: **automatic** via `PreCompact`/`SessionEnd` (deterministic GC shell-side + semantic consolidation injected for the next turn), with `/reflect` retained for manual deep passes. The only thing we forgo is wall-clock between-session reflection, which would cost extra (Routines/subagents) — deliberately rejected.