claude_memory 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/docs/architecture.md CHANGED
@@ -40,7 +40,7 @@ ClaudeMemory is architected using Domain-Driven Design (DDD) principles with cle
40
40
 
41
41
  **Components:**
42
42
  - **CLI** (`cli.rb`): Thin router that dispatches to command classes
43
- - **Commands** (`commands/`): 32 command classes, each handling one CLI command
43
+ - **Commands** (`commands/`): 34 command classes, each handling one CLI command
44
44
  - **Configuration** (`configuration.rb`): Centralized ENV access and path calculation
45
45
 
46
46
  **Key Principles:**
@@ -205,7 +205,7 @@ end
205
205
  - **Server**: WEBrick HTTP server (default port 3377), starts via `claude-memory dashboard`
206
206
  - **API**: HTTP-shape glue + per-endpoint formatting; routes/delegates to panel classes
207
207
  - **Panels** (each backed by a dedicated class with focused responsibility):
208
- - `Trust`: weekly moments, fingerprint, utilization, feedback ratio, needs-review
208
+ - `Trust`: weekly moments, fingerprint, utilization, feedback ratio, needs-review, **token_budget** (p50/p95/avg over 30d, 0.11.0+), **quality_score** (live 30-day window + historical baseline, 0.11.0+)
209
209
  - `Moments`: feed-first activity stream with kind classification
210
210
  - `Knowledge`: predicate-grouped fact summary (incl. References section)
211
211
  - `Conflicts`: display-layer dedup with bulk-reject helper
@@ -361,7 +361,7 @@ FileSystem (write)
361
361
  - Value objects (SessionId, TranscriptPath, FactId)
362
362
  - Centralized Configuration
363
363
  - 4 domain models with business logic
364
- - 32 command classes
364
+ - 34 command classes
365
365
  - 25 MCP tools
366
366
  - Semantic search with local embeddings (FastEmbed + TF-IDF fallback)
367
367
  - Schema v17 with WAL mode
data/docs/dashboard.md CHANGED
@@ -31,7 +31,8 @@ The dashboard is **feed-first**: the main view is a chronological stream of
31
31
 
32
32
  ### Sidebar — Trust
33
33
 
34
- Three at-a-glance signals so you can answer "is memory helping?" in one look:
34
+ At-a-glance signals so you can answer "is memory helping?" and "what does
35
+ it cost?" — in one look:
35
36
 
36
37
  - **This week's moments** — count of value-producing events (recall hits,
37
38
  context injections, extractions). Includes a week-over-week delta.
@@ -40,6 +41,16 @@ Three at-a-glance signals so you can answer "is memory helping?" in one look:
40
41
  - **Needs review** — open conflicts (deduped to distinct contradictions) +
41
42
  stale facts (active but not recalled in the configured window) + empty
42
43
  recalls (queries that returned nothing).
44
+ - **Token budget (30d)** *(0.11.0+)* — p50/p95/avg `context_tokens` injected
45
+ per SessionStart over the last 30 days, with sample size. Answers "what
46
+ does memory cost per session?" — pairs with the digest's "Context cost"
47
+ section and `claude-memory stats --tokens`.
48
+ - **Quality score (live, 30d)** *(0.11.0+)* — 0–100 hallucination-rate
49
+ proxy. `score = 100 - (suspect_pct + bare_pct)` where suspect = facts
50
+ retagged as `predicate=reference` and bare = decision/convention facts
51
+ whose object skipped the prompt-mandated reason clause. Headline is the
52
+ live 30-day window; the underlying snapshot also exposes a `historical`
53
+ block over all active facts for context. Returns 100 on empty stores.
43
54
  - **Utilization (30d)** — of facts extracted in the last 30 days, what % has
44
55
  Claude actually surfaced via recall or context injection. Color-coded
45
56
  (green ≥40%, yellow ≥15%, red below). Hidden on fresh installs.
@@ -161,8 +172,17 @@ WAL writer lock open across page loads.
161
172
  ## Related CLI
162
173
 
163
174
  - `claude-memory digest [--since DAYS] [--output FILE]` — markdown report of
164
- the same Trust + Knowledge + Conflicts + Feedback signals, suitable for
165
- email or commit-into-repo.
175
+ the same Trust + Knowledge + Conflicts + Feedback signals plus
176
+ **Context cost** (token-budget p50/p95) and **Quality** (score + rejection
177
+ rate) sections. Suitable for email or commit-into-repo.
178
+ - `claude-memory show [--pending] [--source SOURCE]` *(0.11.0+)* — print
179
+ what memory would inject at the next SessionStart in plain Markdown.
180
+ Same `Hook::ContextInjector` path real sessions use, so the output
181
+ matches what Claude actually receives. Footer reports fact count, ~token
182
+ estimate, and char count.
183
+ - `claude-memory stats --tokens [--since DAYS]` *(0.11.0+)* — token budget
184
+ histogram (p50/p95/avg/min/max + bucketed distribution) for SessionStart
185
+ context injections. Same data the Trust panel's Token budget block aggregates.
166
186
  - `claude-memory census [--root DIR]` — privacy-safe cross-project
167
187
  predicate vocabulary scan; pairs with the Knowledge panel for "what
168
188
  predicates does my whole tree use?".
data/docs/improvements.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Improvements to Consider
2
2
 
3
- *Updated: 2026-04-28 - Opened the 1.0 punchlist track (see `docs/1_0_punchlist.md`). High-priority entries below now include the must-have 1.0 items: token-budget telemetry (#47), hallucination-rate metric (#48), negative-fact harm benchmark (#49), CLAUDE.md baseline publication (#50), `claude-memory show` (#51), benchmark scoreboard diff (#52). Post-1.0 entries: first-week ROI nudge (#53), real-session repeat-correction detector (#54), token-cost growth tracking (#55), drift dashboard (#56). Earlier 2026-04-28 update added cq study (usefulness-focused). Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
3
+ *Updated: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
4
4
  *Sources:*
5
5
  - *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
6
6
  - *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
@@ -152,10 +152,14 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #2). Builds on
152
152
 
153
153
  ---
154
154
 
155
- ### 49. Negative-Fact Harm Benchmark
155
+ ### 49. Negative-Fact Harm Benchmark — *prototype in 0.11.0, full corpus in 0.12.0*
156
156
 
157
157
  Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #3). Parallels #32 (Repeat-Correction Benchmark) but inverts the goal.
158
158
 
159
+ **Two-phase delivery (added 2026-04-28):**
160
+ - **0.11.0 — 3-scenario prototype (~½d).** Three hand-written cases (one stale-tech, one mismatched-scope, one superseded-but-undetected) run against real Claude under `EVAL_MODE=real`. Smoke test: if even three cases produce >0% harm rate, the full benchmark in 0.12 will reveal a fundamental issue and we want to know early. No release gate yet — the prototype is diagnostic.
161
+ - **0.12.0 — full 10-15 scenario corpus (~2d).** Adds the missing harm classes (reference-material-as-fact + remaining stale/mismatched/superseded cases) and wires the >1% harm-rate release gate.
162
+
159
163
  **Gap.** Every benchmark we run measures whether memory **helps** (Recall@k, MRR, e2e pass rate, repeat-correction prevention rate). Nothing measures whether memory **harms** — i.e. holds a wrong/stale fact and causes Claude to follow it. Without this, "memory helps" is unfalsifiable.
160
164
 
161
165
  **Implementation.**
@@ -328,9 +332,9 @@ IndexCommand builds text→embedding cache from already-embedded facts before in
328
332
 
329
333
  In Ruby fallback path (`search_by_vector_fallback`), facts are grouped by `embedding_json` before cosine similarity computation. Unique embeddings scored once, results fanned out to all matching fact_ids. Native sqlite-vec path unaffected (handles own dedup).
330
334
 
331
- ### 53. First-Week ROI Nudge
335
+ ### 53. First-Week ROI Nudge — *targeted for 0.11.0 (moved up from post-1.0)*
332
336
 
333
- Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap.
337
+ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap. **Moved up from post-1.0 to 0.11.0** in the 2026-04-28 path-to-1.0 restructure — fits the "Trust & Cost" theme since it's the user-visible proof that memory is doing work.
334
338
 
335
339
  **Gap.** New users install the gem, run a few sessions, and don't know whether memory is working. The dashboard exists but they have to know to look. The auto-memory mirror (#36) helps but isn't surfaced. We need a low-friction nudge in the first ~10 sessions that says "memory is working, here's what it did" — and then gets out of the way.
336
340
 
@@ -442,6 +446,126 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #10). Builds on
442
446
 
443
447
  ---
444
448
 
449
+ ### 59. API Stability Audit (1.0 release blocker)
450
+
451
+ Source: 2026-04-28 path-to-1.0 review (`docs/1_0_punchlist.md` #11). Added after 0.10.0 ship. *(Renumbered from #57 to #59 during rebase against origin/main on 2026-04-28 — Mercury-article PR #5 had already taken #57 and #58.)*
452
+
453
+ **Gap.** "1.0 commits to semver" is meaningless without an explicit public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 (MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints) have evolved organically and aren't formally documented as stable vs. internal. Without this audit, future "regression" complaints become un-arbitrable — was that flag/method/tool *promised*? We don't know.
454
+
455
+ **Implementation.**
456
+
457
+ - **New `docs/api_stability.md`** as the authoritative public-API reference. Sections:
458
+ 1. *Public CLI surface*: every `claude-memory <subcommand>` registered in `Commands::Registry::COMMANDS`, every documented flag, with stability tier per command (`stable` / `experimental` / `internal`).
459
+ 2. *Public MCP tools*: every entry in `MCP::ToolDefinitions.all` with its argument schema, return shape, and tool-annotation hints (`readOnlyHint`, `idempotentHint`, `destructiveHint`). Stability tier per tool.
460
+ 3. *Public hook contract*: payload field names accepted by `Hook::Handler` and `Commands::HookCommand`, return shapes (`hookSpecificOutput`, exit codes via `Hook::ExitCodes`), stability tier per hook event.
461
+ 4. *Public Ruby API*: the surface external Ruby callers can rely on. Candidates: `ClaudeMemory::Recall`, `Configuration`, `Store::StoreManager`, `Domain::*`. Everything else (resolver internals, dashboard internals, sweep internals) marked internal.
462
+ 5. *Schema stability*: column names, table names, predicate vocabulary in `PredicatePolicy::POLICIES`. Schema migrations remain forward-compatible per the round-trip-spec convention; column *removals* require deprecation cycle.
463
+ - **Deprecation policy paragraph**: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0." Mirrors Ruby/Rails conventions.
464
+ - **Deprecation-warning instrumentation**: tiny module `ClaudeMemory::Deprecations` with a `warn(name, replacement:, removed_in:)` helper. Anywhere we want to change a public surface in 1.x, we wrap with `Deprecations.warn` first.
465
+ - **README + CLAUDE.md** add a top-level link: "Public API: see [docs/api_stability.md](docs/api_stability.md)".
466
+
467
+ **Acceptance.**
468
+
469
+ - `docs/api_stability.md` exists and lists every CLI command, MCP tool, hook event, and key Ruby class with a stability tier.
470
+ - A reader of the doc can answer "is `claude-memory dashboard --port` stable?" / "will `Recall.new(manager).query(...)` keep its signature in 1.x?" in <30 seconds.
471
+ - `ClaudeMemory::Deprecations.warn` is wired up and used at least once (e.g. for a soon-to-be-renamed flag) so the mechanism is exercised.
472
+ - `/release` skill knows about `docs/api_stability.md` and reminds the operator to update it on any public-surface change.
473
+
474
+ **Edge cases.**
475
+
476
+ - We have to be honest about which Ruby surfaces are public. `Recall` and `Configuration` clearly are; `Sweep::Maintenance` clearly isn't; `Domain::Fact` is ambiguous (used by external benchmark adapters in `spec/benchmarks/`). Default to **internal** when ambiguous — easier to promote later than demote.
477
+ - Schema column names are tricky. Migrations can rename safely; external SQL tools (e.g. cq) read the schema directly. Document the column names as "best-effort stable, no removal without deprecation cycle."
478
+ - The dashboard JSON API is internal — explicitly call this out so users don't build scripts against it.
479
+
480
+ **Effort.** ~2 days. The doc is the bulk of the time; the deprecation warning module is ~50 LOC.
481
+
482
+ **Why 1.0 must-have.** Without this, the semver promise is vibes. Future regressions in non-listed areas can be argued away; future regressions in listed areas are bugs. Forces honesty about what we're committing to.
483
+
484
+ ---
485
+
486
+ ### 60. LLM Extractor Calibration Drift (surfaced by #48)
487
+
488
+ Source: 2026-04-30 production verification of #48 hallucination-rate metric. Surfaced when the metric was first run against real data on this very project.
489
+
490
+ **The signal.** First run of `claude-memory digest` against `claude_memory/.claude/memory.sqlite3` after the metric landed:
491
+
492
+ | Number | Value | Verdict |
493
+ |---|---|---|
494
+ | Quality score | 39/100 | bad |
495
+ | Suspect (predicate=`reference`) | 2 / 59 (3.4%) | acceptable |
496
+ | Bare conclusions (decision/convention without reason) | 34 / 59 (57.6%) | poor |
497
+ | 7-day rejection rate | 27 of 32 facts (84.4%) | very bad |
498
+
499
+ **What it means.** The 84% rejection rate over 7 days says the LLM extractor in this project was producing noise faster than usable knowledge — almost everything new it created got rejected within a week. The 57.6% bare-conclusion rate confirms the same drift from the prompt's *"every decision/convention MUST embed a reason clause"* requirement: the prompt asks for "because…" / "so that…" / "to avoid…" but recent extractions skipped the reason clause majority of the time.
500
+
501
+ **Why this is a finding, not a metric bug.** Spot-checked 5 flagged + 5 unflagged facts on 2026-04-30; the detector's regex correctly matches the prompt's strict reason-clause vocabulary in both directions. Not a false-positive issue. The metric is doing what it was designed to do: surface real LLM calibration drift that was previously invisible.
502
+
503
+ **Possible causes (to investigate).**
504
+
505
+ 1. **Prompt drift in `lib/claude_memory/commands/skills/distill-transcripts.md`** — the reason-clause requirement may have been added to the prompt after a chunk of older facts were already extracted. Mostly historical noise rather than ongoing extraction problem. → check `git log -p lib/claude_memory/commands/skills/distill-transcripts.md` for when the reason-clause section landed and whether bare-conclusion facts cluster pre-that-commit.
506
+ 2. **Auto-memory mirror regurgitation** — the `Hook::AutoMemoryMirror` (0.10.0) injects auto-memory file content as extraction candidates at SessionStart. If those auto-memory files have bare-conclusion content (likely, since they're written by Claude with no reason-clause discipline), the LLM may be re-extracting them faithfully without injecting reasons that weren't in the source. → grep auto-memory file content for the same bare conclusions appearing in flagged DB facts.
507
+ 3. **Reference-material guard too narrow** — `ReferenceMaterialDetector` only retags `convention` predicates; "From QMD restudy: adopt X" facts (clearly third-party-project descriptions) come back as `decision` rather than `reference` and stay in the corpus. → expand `GUARDED_PREDICATES` to include `decision` for the same patterns.
508
+ 4. **High rejection rate is correct + the corpus is junky** — 84% rejection in last 7 days might mean we (the team) are correctly rejecting noise that the LLM is producing too aggressively. → check whether rejected facts cluster by source (transcript topic, hook event type, time-of-day).
509
+
510
+ **Acceptance / next steps.**
511
+
512
+ - Investigation note in `docs/quality_review.md` capturing which of (1)–(4) above explains the bulk of the drift.
513
+ - If prompt drift (cause 1): the historical bulk-flag is fine, the live extraction rate is what matters. Expose "extraction rate" over a tighter window (last 24h vs 30-day baseline) so calibration drift becomes visible without historical noise drowning the signal.
514
+ - If auto-memory regurgitation (cause 2): patch the auto-memory-mirror prompt or distillation prompt to require reason-clause synthesis even when source text is bare.
515
+ - If reference-material guard too narrow (cause 3): expand `Distill::ReferenceMaterialDetector::GUARDED_PREDICATES` and re-run `claude-memory reclassify-references --predicate decision` against active corpus.
516
+ - If correct + junky (cause 4): the metric is healthy; the cleanup is `claude-memory reject` runs against high-frequency junk.
517
+
518
+ **Effort.** Investigation: 0.5d. Fix: depends on cause.
519
+
520
+ **Why this is in `improvements.md`.** Independently of which cause is correct, the verification of #48 surfaced a real signal worth tracking. The metric did its job (turning invisible drift into a visible 84%); now the work is the actual cleanup. Tracked here so it doesn't fall off the radar between 0.11 ship and the 1.0 soak.
521
+
522
+ **Update 2026-04-30: investigation complete.** Diagnostics ran for all four causes; results recorded in `docs/quality_review.md`. Summary: cause 1 (prompt drift) explains 97% of bare conclusions; cause 4 (`/study-repo` misattribution burst) explains 100% of the 7-day rejection cluster; causes 2 and 3 ruled out. Headline metric calibration fix landed in commit `7591da4` (live 30-day window + historical block). The two systemic issues split into entries #61 and #62 below.
523
+
524
+ ---
525
+
526
+ ### 61. /study-repo Misattribution Guard
527
+
528
+ Source: 2026-04-30 #60 investigation, cause 4. All 27 rejected facts in this project's 7-day window were `uses_database` (18) or `deployment_platform` (9) with `session_id=nil` (MCP-originated), all from a 2-day burst on 2026-04-23 to 04-24. The pattern: when running `/study-repo` on an external project, the LLM extracted that project's tech stack and asserted it as facts about *this* project. Cleanup happened correctly via `claude-memory reject` after detection, but the round-trip is wasteful and noisy.
529
+
530
+ **Implementation.**
531
+
532
+ - New `Distill::ExternalAttributionDetector` (sister to `ReferenceMaterialDetector`). Runs after extraction and before storage.
533
+ - Heuristics: when the source content_item text contains markers like "studying X", "/study-repo", a non-current-project repo URL, or "external project", strongly bias toward `predicate=reference` for any `uses_database`/`deployment_platform`/`uses_framework` extraction.
534
+ - Optional: extend `Hook::ContextInjector` or the distillation prompt to make this constraint explicit ("when discussing an external repository, do NOT extract its tech stack as project-level facts").
535
+
536
+ **Acceptance.**
537
+
538
+ - Re-run a `/study-repo` on a fresh DB; observe zero `uses_database` or `deployment_platform` facts inserted that point to the external project's tech.
539
+ - The 27 rejected facts cluster from this project's history doesn't reappear in similar scenarios.
540
+
541
+ **Effort.** ~½ day. Detector is mostly regex + content_item text inspection. Prompt addition is trivial.
542
+
543
+ ---
544
+
545
+ ### 62. Historical Bare-Conclusion Backfill
546
+
547
+ Source: 2026-04-30 #60 investigation, cause 1. 34 bare-conclusion facts pre-date the 2026-04-20 reason-clause prompt commit (`f22d12f`). They satisfy the strict regex but most are factually informative ("MCP tools return dual content + structuredContent via TextSummary module" — describes mechanics implicitly without a "because"). The `quality_score` headline now correctly windows to the last 30 days (commit `7591da4`), but those 34 facts still appear in the historical line and may surface in `claude-memory show` and recall queries forever.
548
+
549
+ **Implementation options (pick one).**
550
+
551
+ A. **Reclassify to `legacy_observation` predicate.** New non-guarded predicate that the bare-conclusion detector ignores. Migration walks active `decision`/`convention` facts created before 2026-04-20 with no reason clause, reclassifies. Preserves the content; removes the metric pollution.
552
+
553
+ B. **One-shot prompt-rewrite pass.** For each pre-2026-04-20 bare fact, run a small LLM call asking "infer the reason from the original quote/content_item text" and rewrite the object. Higher fidelity; costs ~$1-5 in API calls.
554
+
555
+ C. **Retroactive rejection.** Mark them all `status=rejected`. Cheap and clean but throws away signal. Probably wrong.
556
+
557
+ **Recommendation.** Option A. Cheap, reversible (predicate change is just a column update), and the facts remain queryable just outside the bare-conclusion bucket.
558
+
559
+ **Acceptance.**
560
+
561
+ - Run the migration; verify the historical bare-conclusion count drops by ~34.
562
+ - Verify those facts still appear in `memory.recall` queries (predicate filter optional).
563
+ - `digest` quality section's historical block reports a meaningfully lower number afterwards.
564
+
565
+ **Effort.** ~½ day. Mostly a Sequel migration + a `claude-memory reclassify-bare-conclusions` command paralleling `reclassify-references`.
566
+
567
+ ---
568
+
445
569
  ### 21. Incremental Indexing with File Watching
446
570
 
447
571
  Source: grepai study (reinforced 2026-03-02)
@@ -562,6 +686,67 @@ Specs cover: refresher updates from both stores including cross-DB project→glo
562
686
 
563
687
  Schema migration v13 adds `mcp_tool_calls` telemetry table (tool_name, called_at, duration_ms, result_count, scope, error_class). `MCP::Telemetry` wraps `Server#handle_tools_call` with monotonic-clock timing, captures errors, and records to the project DB; DB errors are swallowed so telemetry never fails a real tool call. `StatsCommand` gains `--tools` and `--since DAYS` flags showing total calls, error rate, and per-tool breakdown (calls, avg ms, p95 ms, error rate). `Sweep::Maintenance#prune_old_mcp_tool_calls` enforces a 90-day retention window, wired into `Sweeper#run!`. Rejected NDJSON in favor of SQLite for schema/query consistency with the rest of the gem. Dropped query-text capture (YAGNI — the dedup insight the hash would enable also needs raw text). Also fixed a latent bug where `StatsCommand` opened the DB via `Sequel.sqlite` (requiring the unlisted `sqlite3` gem); now uses the extralite adapter consistently.
564
688
 
689
+ ### 57. Provenance-Strength-Aware Retrieval Ranking
690
+
691
+ Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid, [@Ctrl_Alt_Zaid](https://x.com/Ctrl_Alt_Zaid/status/2049082538686382397)) — "Memories need metadata such as confidence" / "without scoring, everything competes equally."
692
+
693
+ **Gap.** `Domain::Provenance` already records `strength` ∈ {`stated`, `inferred`} (provenance.rb:7,14,22-26), but the value is only consumed as a boolean (`stated?` / `inferred?`) for display. `Index::IndexQuery` and the RRF fusion in `Recall` do not factor strength into ranking. Result: a fact that was inferred from one ambiguous transcript line ranks identically to one explicitly stated multiple times across sessions.
694
+
695
+ **Implementation.**
696
+
697
+ - **Strength score derivation.** Add `Domain::Provenance#confidence_weight` returning `1.0` for `stated`, `0.6` for `inferred`. Single-source — no new column.
698
+ - **Per-fact aggregate.** New `SQLiteStore#fact_confidence(fact_id)` returns max strength weight across all provenance rows (a fact stated once and inferred twice is still high-confidence).
699
+ - **Ranking integration.** `Index::IndexQuery` already returns scored candidates; multiply final RRF score by `(0.7 + 0.3 * confidence_weight)`. Bounded modifier (0.7-1.0 range) so a low-confidence fact still ranks if it's the only relevant one — we're nudging, not filtering.
700
+ - **Surfacing.** `score_trace` (introduced in #5) gains a `confidence_factor` field so the multiplier is auditable in `memory.recall_semantic --explain`.
701
+
702
+ **Acceptance.**
703
+
704
+ - `memory.recall` results re-rank in tests: an `inferred`-only fact loses to a `stated` fact when both have similar BM25/vector scores.
705
+ - Retrieval benchmark (`spec/benchmarks/retrieval/`) shows Recall@k unchanged or improved on the 155-query set.
706
+ - `score_trace.confidence_factor` populated for every result.
707
+
708
+ **Edge cases.**
709
+
710
+ - Facts with no provenance (legacy / direct stores): default to 0.8 (between stated and inferred). Don't penalize as 0.6 — those facts predate the field.
711
+ - `memory.store_extraction` callers don't always set strength; default already lands on `stated` per provenance.rb:14, which is the right behavior.
712
+
713
+ **Effort.** ~half day. No schema migration; `strength` already exists.
714
+
715
+ **Why medium.** The article calls this out as a structural reliability requirement, but ClaudeMemory already has the data — we're just not using it. Cheap win that closes a visible gap in the article's external critique.
716
+
717
+ ---
718
+
719
+ ### 58. Reinforcement-and-Decay Ranking Signal
720
+
721
+ Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid) — "Memories need metadata such as freshness, importance, reinforcement" / "Some memory should weaken, expire, or be archived."
722
+
723
+ **Gap.** `last_recalled_at` (schema v17, populated by `Sweep::RecallTimestampRefresher`) currently only feeds `Recall::StaleDetector` to *flag* unused facts (stale_detector.rb:57-61). It does not boost frequently-recalled facts in retrieval ranking, nor decay long-untouched ones. Result: a fact recalled 50 times in the last week and a fact recalled once 8 months ago compete on equal footing once their BM25/vector scores match — the inverse of what the article calls "the right memory, not the most memory."
724
+
725
+ **Implementation.**
726
+
727
+ - **Add `recall_count` column.** Migration vNN adds `facts.recall_count INTEGER DEFAULT 0`. `RecallTimestampRefresher` increments it alongside the `last_recalled_at` update (single UPDATE, no extra query).
728
+ - **Reinforcement-decay multiplier.** New `Recall::FreshnessScorer.weight(fact)` returns `max(0.5, min(1.5, log1p(recall_count) * exp(-age_days / HALF_LIFE)))` where `HALF_LIFE` defaults to 60 days. Bounded so a single hot fact can't dominate and a cold fact can't disappear.
729
+ - **Wire into RRF.** Same composition point as #57: `final_score = rrf_score * confidence_factor * freshness_factor`. Both factors land in `score_trace`.
730
+ - **Configuration.** `CLAUDE_MEMORY_RECALL_HALF_LIFE_DAYS` env var (default 60) for users who want longer/shorter memory.
731
+ - **Decay is soft, not destructive.** No facts are deleted or archived by this — that stays the user's job via `claude-memory reject`. The article's "decay" framing is correct in spirit (rank weight drops) but we don't auto-prune.
732
+
733
+ **Acceptance.**
734
+
735
+ - Two facts with identical BM25 scores: the one recalled 10× in the last week ranks above one not recalled in 6 months.
736
+ - Repeat-correction benchmark (#32) shows improvement: facts that "stuck" rank higher than abandoned ones.
737
+ - `score_trace.freshness_factor` populated; visible in `memory.recall_semantic --explain`.
738
+ - Telemetry: `activity_events` gain `freshness_factor` in the details JSON for hook_context events so we can backtest changes to `HALF_LIFE`.
739
+
740
+ **Edge cases.**
741
+
742
+ - Brand-new facts (recall_count=0, age=0): `log1p(0) = 0` would zero out the weight. Floor at 0.5 — new facts shouldn't be penalized for being new.
743
+ - Facts never recalled but still valid: clamped to 0.5 floor; ranked behind reinforced peers but not invisible.
744
+ - Cross-DB mixing: refresher already handles cross-DB project→global per memory fact "OperationTracker.reset_stuck_operations…"; recall_count lives on each fact in its own DB, which is the right shape.
745
+
746
+ **Effort.** ~1 day (migration, refresher update, ranking integration, tests).
747
+
748
+ **Why medium.** This pairs naturally with #57 — together they answer the article's "without scoring, everything competes equally" critique. Defer behind the 1.0 punchlist (#47-52) but ahead of the post-1.0 nudge/drift items, since these directly affect retrieval quality measured by the existing benchmarks.
749
+
565
750
  ---
566
751
 
567
752
  ## Low Priority / Defer
@@ -753,4 +938,4 @@ Influence documents:
753
938
 
754
939
  ---
755
940
 
756
- *Last updated: 2026-04-28 - 1.0 punchlist track opened (`docs/1_0_punchlist.md`). High Priority entries #47-52 (must-have for 1.0): token-budget telemetry, hallucination rate, harm benchmark, CLAUDE.md baseline publication, `claude-memory show`, benchmark scoreboard. Medium Priority entries #53-56 (post-1.0): first-week ROI nudge, real-session repeat-correction detection, token-cost growth tracking, drift dashboard. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*
941
+ *Last updated: 2026-04-28 (post-0.10.0 release, post-rebase). 1.0 punchlist restructured around milestone versions per `docs/1_0_punchlist.md`. **0.11.0** = #47/#48/#51/#53 + #49 prototype. **0.12.0** = #49 full + #50/#52. **1.0.0** = #54/#55/#56/#59 (the new API stability audit). #59 added 2026-04-28 as a 1.0 release blocker (originally #57; renumbered after rebase brought in Mercury-article entries #57/#58). #53 (first-week ROI nudge) moved up from post-1.0 to 0.11.0. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*
@@ -9,6 +9,41 @@
9
9
 
10
10
  ---
11
11
 
12
+ ## Post-0.11 Investigation: Hallucination Rate Metric Calibration (2026-04-30)
13
+
14
+ When #48 (hallucination-rate metric) was first run against this project's real DB, it surfaced numbers that *looked* alarming:
15
+
16
+ - Quality score: 39/100
17
+ - Bare conclusions: 34 / 59 active facts (57.6%)
18
+ - 7-day rejection rate: 27 of 32 facts (84.4%)
19
+
20
+ The first read was that the LLM extractor was producing noise faster than usable knowledge. Per `improvements.md` #60, four causes were proposed; diagnostics ran 2026-04-30:
21
+
22
+ | Cause | Verdict | Evidence |
23
+ |---|---|---|
24
+ | Prompt drift in `distill-transcripts.md` | **Confirmed dominant** | 34/35 (97%) bare-conclusion facts pre-date the reason-clause prompt commit `f22d12f` (2026-04-20). Only 1 was created post-commit (and that one is a meta-convention added during this session). |
25
+ | Auto-memory mirror regurgitation | Rejected | 0/35 substring matches in `~/.claude/projects/.../memory/*.md`. Auto-memory mirror only landed in 0.10.0 (2026-04-28), after the bare-fact creation window — temporally impossible to be the source. |
26
+ | `ReferenceMaterialDetector` predicate scope too narrow | Not material | Only 3/35 bare facts are `decision`-predicate; 0 of those match the strong reference-material patterns. Expanding `GUARDED_PREDICATES` would not move the needle on the bare-conclusion count. |
27
+ | Junky corpus / rejection cluster | **Confirmed in single class** | All 27 rejected facts in the 7-day window are `uses_database` (18) or `deployment_platform` (9), all with `session_id=nil` (MCP-originated, almost certainly `/study-repo` runs misattributing external-project tech to this project), all from 2026-04-23 to 04-24. Systemic single-class failure, correctly cleaned up after detection — not ongoing extraction noise. |
28
+
29
+ **What this means for #48 as currently shipped:**
30
+
31
+ The metric is *technically correct* but *pragmatically misleading*. It bakes historical noise (pre-prompt-commit bare conclusions) into a signal that users will read as "ongoing extraction quality." A 57.6% bare-conclusion rate looks like the LLM is broken; in reality the live extraction rate (post-2026-04-20) is ~3% (1 bare fact out of ~30+ created since the prompt commit landed).
32
+
33
+ The 84% rejection rate has a similar structural issue: it counts cleanup of a bursty `/study-repo` regression against the active-facts denominator, not against the actual extraction quality of the live window.
34
+
35
+ **Quick fix shipping now (this session):** restrict `quality_score` and the digest's "Quality" section to facts created within the same 30-day window already used by `token_budget`. Surface a separate "historical" line so users can see both numbers, but the headline is the live one. This makes the metric actionable: a high live bare-conclusion rate = live LLM calibration drift; a high historical rate = legacy data, not a current alarm.
36
+
37
+ **Deferred to 0.12 / 1.x:**
38
+
39
+ 1. The systemic `/study-repo` misattribution failure mode (cause 4) deserves its own guard. External-project READMEs being studied should land in `reference` predicates, not as `uses_database`/`deployment_platform`. Track this as a follow-up entry.
40
+ 2. A backfill/cleanup pass on the 34 historical bare-conclusion facts: either retroactive rejection, or a one-shot reclassification that moves them to a `legacy_observation` predicate that the prompt's reason-clause requirement doesn't apply to.
41
+ 3. The metric's calibration assumes "bare conclusion = bad", but spot-checking shows several flagged facts are perfectly informative ("MCP tools return dual content + structuredContent via TextSummary module") — they describe mechanics implicitly. The vocabulary may itself be too strict; revisit during 1.0 soak with real usage data.
42
+
43
+ **Process win:** the metric did its job — it surfaced a real signal that would otherwise have stayed invisible, and the investigation distinguished historical noise from live calibration. Without #48 we'd have no way to know.
44
+
45
+ ---
46
+
12
47
  ## Executive Summary
13
48
 
14
49
  Six days, +2,011 LOC. The headline finding: **the watch-list item from 2026-04-22 (#28 — extract per-endpoint helpers from `Dashboard::API`) was not just deferred, it actively regressed.** `dashboard/api.rb` grew from 627 → 807 LOC (+180, +29%), is now the only file in `lib/` over 750 lines, and gained four new methods all exceeding 15 lines. Method-size pressure increased: the previous worst case (`recall` at 39 lines) is now `timeline` at 52 lines, and the file has 11 methods over 15 lines (vs 11 last review) but with a higher mean.
@@ -5,9 +5,11 @@ require "optparse"
5
5
  module ClaudeMemory
6
6
  module Commands
7
7
  # Weekly digest — a markdown summary of what memory did over the last N days.
8
- # Rolls up moment counts, new knowledge, utilization, conflicts, and user
9
- # feedback so users can see the value memory is delivering without
10
- # needing to visit the dashboard.
8
+ # Sections (in order): Activity, Context cost, Quality, New knowledge,
9
+ # Utilization, Conflicts, Feedback. The Context cost and Quality
10
+ # sections (added 0.11.0) read from `Dashboard::Trust#token_budget` and
11
+ # `#quality_score` so users see the cost/pollution side-by-side with
12
+ # the value side without needing to visit the dashboard.
11
13
  #
12
14
  # The data it aggregates all already exists (activity_events, facts,
13
15
  # conflicts, moment_feedback); this command only shapes it into a report.
@@ -48,6 +50,10 @@ module ClaudeMemory
48
50
  lines << ""
49
51
  lines << activity_section(manager, cutoff)
50
52
  lines << ""
53
+ lines << context_cost_section(manager)
54
+ lines << ""
55
+ lines << quality_section(manager, cutoff)
56
+ lines << ""
51
57
  lines << knowledge_section(manager, cutoff)
52
58
  lines << ""
53
59
  lines << utilization_section(manager)
@@ -124,6 +130,92 @@ module ClaudeMemory
124
130
  "## New knowledge\n\n_Unavailable: #{e.message}_"
125
131
  end
126
132
 
133
+ # The token cost of every SessionStart context injection, measured over
134
+ # the last 30 days (Trust panel's window — intentionally wider than the
135
+ # digest's coverage window so percentiles stay statistically meaningful
136
+ # on quiet weeks). Reports zero state explicitly so users know whether a
137
+ # missing number means "no injections" vs. "telemetry didn't fire".
138
+ def context_cost_section(manager)
139
+ tb = Dashboard::Trust.new(manager).token_budget
140
+ out = ["## Context cost", ""]
141
+ if tb[:sample_size].zero?
142
+ out << "_No context injections in the last #{tb[:window_days]} days._"
143
+ else
144
+ out << "**Per-session injected tokens (last #{tb[:window_days]}d, n=#{tb[:sample_size]}):**"
145
+ out << "- p50: #{tb[:p50]} tokens"
146
+ out << "- p95: #{tb[:p95]} tokens"
147
+ out << "- avg: #{tb[:avg]} tokens"
148
+ end
149
+ out.join("\n")
150
+ rescue Sequel::DatabaseError => e
151
+ "## Context cost\n\n_Unavailable: #{e.message}_"
152
+ end
153
+
154
+ # Hallucination-rate proxy. Reports two numbers per the
155
+ # `quality_review.md` 2026-04-30 investigation:
156
+ #
157
+ # - Live (last `window_days`, headline) — actionable signal of
158
+ # ongoing extraction quality.
159
+ # - Historical (all active facts, supplementary) — visible so
160
+ # legacy noise isn't hidden, but the headline is the live one.
161
+ #
162
+ # The split exists because the unwindowed metric mixed pre-prompt-
163
+ # commit bare conclusions with live data; users read the combined
164
+ # number as "ongoing quality" and that's misleading.
165
+ def quality_section(manager, cutoff)
166
+ out = ["## Quality", ""]
167
+ qs = Dashboard::Trust.new(manager).quality_score
168
+
169
+ if qs[:total_active].zero?
170
+ if qs[:historical][:total_active].zero?
171
+ out << "_No active facts to score yet._"
172
+ else
173
+ out << "_No facts extracted in the last #{qs[:window_days]} days._"
174
+ out << "- Historical (all active): score #{qs[:historical][:score]}/100, " \
175
+ "#{qs[:historical][:total_active]} facts, " \
176
+ "#{qs[:historical][:bare_conclusion_count]} bare, " \
177
+ "#{qs[:historical][:suspect_count]} suspect"
178
+ end
179
+ else
180
+ out << "**Live score (last #{qs[:window_days]}d):** #{qs[:score]}/100 _(higher is cleaner)_"
181
+ out << "- Suspect (reference material): #{qs[:suspect_count]} (#{qs[:suspect_pct]}%)"
182
+ out << "- Bare conclusions (decision/convention without reason): #{qs[:bare_conclusion_count]} (#{qs[:bare_pct]}%)"
183
+ if qs[:historical][:total_active] > qs[:total_active]
184
+ out << ""
185
+ out << "_Historical (all active): score #{qs[:historical][:score]}/100, " \
186
+ "#{qs[:historical][:total_active]} facts, " \
187
+ "#{qs[:historical][:bare_conclusion_count]} bare, " \
188
+ "#{qs[:historical][:suspect_count]} suspect_"
189
+ end
190
+ end
191
+
192
+ rate = rejection_rate_in_window(manager, cutoff)
193
+ out << ""
194
+ out << "**Rejection rate (in window):** #{rate[:rejected]} of #{rate[:created]} extracted facts rejected (#{rate[:pct]}%)"
195
+
196
+ out.join("\n")
197
+ rescue Sequel::DatabaseError => e
198
+ "## Quality\n\n_Unavailable: #{e.message}_"
199
+ end
200
+
201
+ # How many facts created in the digest window have since been
202
+ # rejected? Counts across both stores.
203
+ def rejection_rate_in_window(manager, cutoff)
204
+ created = 0
205
+ rejected = 0
206
+
207
+ %w[project global].each do |scope|
208
+ store = manager.store_if_exists(scope)
209
+ next unless store
210
+ dataset = store.facts.where { created_at >= cutoff }
211
+ created += dataset.count
212
+ rejected += dataset.where(status: "rejected").count
213
+ end
214
+
215
+ pct = created.zero? ? 0.0 : (rejected * 100.0 / created).round(1)
216
+ {created: created, rejected: rejected, pct: pct}
217
+ end
218
+
127
219
  def utilization_section(manager)
128
220
  util = Dashboard::Trust.new(manager).utilization
129
221
  pct = util[:ratio_pct]
@@ -19,9 +19,9 @@ module ClaudeMemory
19
19
  return Hook::ExitCodes::ERROR
20
20
  end
21
21
 
22
- unless %w[ingest sweep publish context].include?(subcommand)
22
+ unless %w[ingest sweep publish context nudge].include?(subcommand)
23
23
  stderr.puts "Unknown hook command: #{subcommand}"
24
- stderr.puts "Available: ingest, sweep, publish, context"
24
+ stderr.puts "Available: ingest, sweep, publish, context, nudge"
25
25
  return Hook::ExitCodes::ERROR
26
26
  end
27
27
 
@@ -63,6 +63,8 @@ module ClaudeMemory
63
63
  hook_publish(handler, payload)
64
64
  when "context"
65
65
  hook_context(payload, opts[:db])
66
+ when "nudge"
67
+ hook_nudge(payload, opts[:db])
66
68
  end
67
69
 
68
70
  store.close
@@ -169,6 +171,28 @@ module ClaudeMemory
169
171
  Hook::ExitCodes::SUCCESS
170
172
  end
171
173
 
174
+ def hook_nudge(payload, db_path)
175
+ # Nudge needs to count past nudge events across both stores,
176
+ # so prefer the manager-aware path. db_path overrides only
177
+ # the project store (useful for tests).
178
+ project_path = payload["project_path"] || payload["cwd"]
179
+ manager = ClaudeMemory::Store::StoreManager.new(
180
+ project_db_path: db_path, project_path: project_path
181
+ )
182
+ manager.ensure_both!
183
+ store = manager.project_store || manager.global_store
184
+
185
+ handler = ClaudeMemory::Hook::Handler.new(store, manager: manager)
186
+ result = handler.nudge(payload)
187
+
188
+ stdout.puts result[:message] if result[:status] == :emitted
189
+
190
+ manager.close
191
+ Hook::ExitCodes::SUCCESS
192
+ rescue => e
193
+ classify_error(e)
194
+ end
195
+
172
196
  def hook_context(payload, db_path)
173
197
  project_path = payload["project_path"] || payload["cwd"]
174
198
  source = payload["source"]
@@ -213,6 +237,7 @@ module ClaudeMemory
213
237
  details = {
214
238
  source: source,
215
239
  context_length: context_text&.length,
240
+ context_tokens: ClaudeMemory::Core::TokenEstimator.estimate(context_text),
216
241
  preview: context_text&.byteslice(0, CONTEXT_PREVIEW_BYTES),
217
242
  truncated: context_text ? context_text.bytesize > CONTEXT_PREVIEW_BYTES : false,
218
243
  top_fact_ids: injector.emitted_fact_ids.first(10),
@@ -19,8 +19,9 @@ module ClaudeMemory
19
19
  db_path = ClaudeMemory.project_db_path
20
20
  ingest_cmd = "claude-memory hook ingest --db #{db_path}"
21
21
  sweep_cmd = "claude-memory hook sweep --db #{db_path}"
22
+ nudge_cmd = "claude-memory hook nudge --db #{db_path}"
22
23
 
23
- hooks_config = build_hooks_config(ingest_cmd, sweep_cmd)
24
+ hooks_config = build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd)
24
25
 
25
26
  existing = load_json_file(settings_path)
26
27
  existing["hooks"] ||= {}
@@ -37,8 +38,9 @@ module ClaudeMemory
37
38
  db_path = ClaudeMemory.global_db_path
38
39
  ingest_cmd = "claude-memory hook ingest --db #{db_path}"
39
40
  sweep_cmd = "claude-memory hook sweep --db #{db_path}"
41
+ nudge_cmd = "claude-memory hook nudge --db #{db_path}"
40
42
 
41
- hooks_config = build_hooks_config(ingest_cmd, sweep_cmd)
43
+ hooks_config = build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd)
42
44
 
43
45
  existing = load_json_file(settings_path)
44
46
  existing["hooks"] ||= {}
@@ -96,7 +98,7 @@ module ClaudeMemory
96
98
 
97
99
  private
98
100
 
99
- def build_hooks_config(ingest_cmd, sweep_cmd)
101
+ def build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd = "claude-memory hook nudge")
100
102
  context_cmd = "claude-memory hook context"
101
103
 
102
104
  {
@@ -132,7 +134,8 @@ module ClaudeMemory
132
134
  {"type" => "command", "command" => ingest_cmd, "timeout" => 30,
133
135
  "statusMessage" => "Saving memory..."},
134
136
  {"type" => "command", "command" => sweep_cmd, "timeout" => 30,
135
- "statusMessage" => "Sweeping memory..."}
137
+ "statusMessage" => "Sweeping memory..."},
138
+ {"type" => "command", "command" => nudge_cmd, "timeout" => 5}
136
139
  ]
137
140
  }],
138
141
  "TaskCompleted" => [{
@@ -42,7 +42,8 @@ module ClaudeMemory
42
42
  "reclassify-references" => {class: ReclassifyReferencesCommand, description: "Retag existing convention facts that match reference-material heuristics"},
43
43
  "census" => {class: CensusCommand, description: "Aggregate predicate usage across project databases"},
44
44
  "dashboard" => {class: DashboardCommand, description: "Open debugging dashboard"},
45
- "digest" => {class: DigestCommand, description: "Render a weekly markdown digest of memory activity"}
45
+ "digest" => {class: DigestCommand, description: "Render a weekly markdown digest of memory activity"},
46
+ "show" => {class: ShowCommand, description: "Print what memory would inject at the next SessionStart"}
46
47
  }.freeze
47
48
 
48
49
  # Find a command class by name