RubyGems - claude_memory - Versions diffs - 0.10.0 → 0.12.0 - Mend

claude_memory 0.10.0 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (72) hide show

checksums.yaml +4 -4
data/.claude/memory.sqlite3 +0 -0
data/.claude/rules/claude_memory.generated.md +42 -64
data/.claude/skills/release/SKILL.md +44 -6
data/.claude/skills/study-repo/SKILL.md +15 -0
data/.claude-plugin/commands/audit-memory.md +68 -0
data/.claude-plugin/marketplace.json +1 -1
data/.claude-plugin/plugin.json +1 -1
data/CHANGELOG.md +70 -0
data/CLAUDE.md +20 -5
data/README.md +64 -2
data/db/migrations/018_add_otel_telemetry.rb +81 -0
data/docs/1_0_punchlist.md +522 -89
data/docs/GETTING_STARTED.md +3 -1
data/docs/api_stability.md +341 -0
data/docs/architecture.md +3 -3
data/docs/audit_runbook.md +209 -0
data/docs/claude_monitoring.md +956 -0
data/docs/dashboard.md +23 -3
data/docs/improvements.md +329 -5
data/docs/influence/ai-memory-systems-2026.md +403 -0
data/docs/memory_audit_2026-05-21.md +303 -0
data/docs/plugin.md +1 -1
data/docs/quality_review.md +35 -0
data/lib/claude_memory/audit/checks.rb +239 -0
data/lib/claude_memory/audit/finding.rb +33 -0
data/lib/claude_memory/audit/runner.rb +73 -0
data/lib/claude_memory/commands/audit_command.rb +117 -0
data/lib/claude_memory/commands/dashboard_command.rb +2 -1
data/lib/claude_memory/commands/digest_command.rb +95 -3
data/lib/claude_memory/commands/hook_command.rb +27 -2
data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
data/lib/claude_memory/commands/otel_command.rb +240 -0
data/lib/claude_memory/commands/registry.rb +5 -1
data/lib/claude_memory/commands/show_command.rb +90 -0
data/lib/claude_memory/commands/stats_command.rb +94 -2
data/lib/claude_memory/configuration.rb +60 -0
data/lib/claude_memory/core/fact_query_builder.rb +1 -0
data/lib/claude_memory/dashboard/api.rb +8 -0
data/lib/claude_memory/dashboard/index.html +140 -1
data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
data/lib/claude_memory/dashboard/server.rb +86 -0
data/lib/claude_memory/dashboard/telemetry.rb +156 -0
data/lib/claude_memory/dashboard/trust.rb +180 -11
data/lib/claude_memory/deprecations.rb +106 -0
data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
data/lib/claude_memory/hook/context_injector.rb +11 -2
data/lib/claude_memory/hook/handler.rb +142 -1
data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
data/lib/claude_memory/otel/attributes.rb +118 -0
data/lib/claude_memory/otel/constants.rb +32 -0
data/lib/claude_memory/otel/ingestor.rb +54 -0
data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
data/lib/claude_memory/otel/prompt_scope.rb +108 -0
data/lib/claude_memory/otel/settings_writer.rb +122 -0
data/lib/claude_memory/otel/status.rb +58 -0
data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
data/lib/claude_memory/resolve/resolver.rb +30 -3
data/lib/claude_memory/shortcuts.rb +61 -18
data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
data/lib/claude_memory/store/schema_manager.rb +1 -1
data/lib/claude_memory/store/sqlite_store.rb +136 -0
data/lib/claude_memory/sweep/maintenance.rb +31 -1
data/lib/claude_memory/sweep/sweeper.rb +6 -0
data/lib/claude_memory/templates/hooks.example.json +5 -0
data/lib/claude_memory/version.rb +1 -1
data/lib/claude_memory.rb +20 -0
metadata +28 -1

data/docs/dashboard.md CHANGED Viewed

@@ -31,7 +31,8 @@ The dashboard is **feed-first**: the main view is a chronological stream of
 ### Sidebar — Trust
-Three at-a-glance signals so you can answer "is memory helping?" in one look:
+At-a-glance signals so you can answer "is memory helping?" — and "what does
+it cost?" — in one look:
 - **This week's moments** — count of value-producing events (recall hits,
   context injections, extractions). Includes a week-over-week delta.
@@ -40,6 +41,16 @@ Three at-a-glance signals so you can answer "is memory helping?" in one look:
 - **Needs review** — open conflicts (deduped to distinct contradictions) +
   stale facts (active but not recalled in the configured window) + empty
   recalls (queries that returned nothing).
+- **Token budget (30d)** *(0.11.0+)* — p50/p95/avg `context_tokens` injected
+  per SessionStart over the last 30 days, with sample size. Answers "what
+  does memory cost per session?" — pairs with the digest's "Context cost"
+  section and `claude-memory stats --tokens`.
+- **Quality score (live, 30d)** *(0.11.0+)* — 0–100 hallucination-rate
+  proxy. `score = 100 - (suspect_pct + bare_pct)` where suspect = facts
+  retagged as `predicate=reference` and bare = decision/convention facts
+  whose object skipped the prompt-mandated reason clause. Headline is the
+  live 30-day window; the underlying snapshot also exposes a `historical`
+  block over all active facts for context. Returns 100 on empty stores.
 - **Utilization (30d)** — of facts extracted in the last 30 days, what % has
   Claude actually surfaced via recall or context injection. Color-coded
   (green ≥40%, yellow ≥15%, red below). Hidden on fresh installs.
@@ -161,8 +172,17 @@ WAL writer lock open across page loads.
 ## Related CLI
 - `claude-memory digest [--since DAYS] [--output FILE]` — markdown report of
-  the same Trust + Knowledge + Conflicts + Feedback signals, suitable for
-  email or commit-into-repo.
+  the same Trust + Knowledge + Conflicts + Feedback signals plus
+  **Context cost** (token-budget p50/p95) and **Quality** (score + rejection
+  rate) sections. Suitable for email or commit-into-repo.
+- `claude-memory show [--pending] [--source SOURCE]` *(0.11.0+)* — print
+  what memory would inject at the next SessionStart in plain Markdown.
+  Same `Hook::ContextInjector` path real sessions use, so the output
+  matches what Claude actually receives. Footer reports fact count, ~token
+  estimate, and char count.
+- `claude-memory stats --tokens [--since DAYS]` *(0.11.0+)* — token budget
+  histogram (p50/p95/avg/min/max + bucketed distribution) for SessionStart
+  context injections. Same data the Trust panel's Token budget block aggregates.
 - `claude-memory census [--root DIR]` — privacy-safe cross-project
   predicate vocabulary scan; pairs with the Knowledge panel for "what
   predicates does my whole tree use?".

data/docs/improvements.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Improvements to Consider
-*Updated: 2026-04-28 - Opened the 1.0 punchlist track (see `docs/1_0_punchlist.md`). High-priority entries below now include the must-have 1.0 items: token-budget telemetry (#47), hallucination-rate metric (#48), negative-fact harm benchmark (#49), CLAUDE.md baseline publication (#50), `claude-memory show` (#51), benchmark scoreboard diff (#52). Post-1.0 entries: first-week ROI nudge (#53), real-session repeat-correction detector (#54), token-cost growth tracking (#55), drift dashboard (#56). Earlier 2026-04-28 update added cq study (usefulness-focused). Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
+*Updated: 2026-05-23 - Added AI Memory Systems Landscape Analysis (Nakajima/Opus 4.6 Research article, 2026-03-26) — meta-study of 7 benchmarks + ~12 systems. Four High Priority items: graph traversal as third RRF source (#64), temporal-aware retrieval (#65), bi-temporal schema cleanup (#66), LongMemEval integration (#67). One promotion: improvement #57 (provenance-strength ranking) Medium → High, validated as the "soft epistemic separation" pattern. See `docs/influence/ai-memory-systems-2026.md`. Previously: 2026-05-01 - Added Strands Agent SOPs study (article, not repo) — one M-priority item (parameter blocks in skill frontmatter); rest already implemented or deferred. See `docs/influence/strands-agent-sops.md`. Previously: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
 *Sources:*
 - *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
 - *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
@@ -152,10 +152,14 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #2). Builds on
 ---
-### 49. Negative-Fact Harm Benchmark
+### 49. Negative-Fact Harm Benchmark — *prototype in 0.11.0, full corpus in 0.12.0*
 Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #3). Parallels #32 (Repeat-Correction Benchmark) but inverts the goal.
+**Two-phase delivery (added 2026-04-28):**
+- **0.11.0 — 3-scenario prototype (~½d).** Three hand-written cases (one stale-tech, one mismatched-scope, one superseded-but-undetected) run against real Claude under `EVAL_MODE=real`. Smoke test: if even three cases produce >0% harm rate, the full benchmark in 0.12 will reveal a fundamental issue and we want to know early. No release gate yet — the prototype is diagnostic.
+- **0.12.0 — full 10-15 scenario corpus (~2d).** Adds the missing harm classes (reference-material-as-fact + remaining stale/mismatched/superseded cases) and wires the >1% harm-rate release gate.
 **Gap.** Every benchmark we run measures whether memory **helps** (Recall@k, MRR, e2e pass rate, repeat-correction prevention rate). Nothing measures whether memory **harms** — i.e. holds a wrong/stale fact and causes Claude to follow it. Without this, "memory helps" is unfalsifiable.
 **Implementation.**
@@ -314,6 +318,99 @@ cq is complementary to ClaudeMemory, not competing: it's an out-of-band SQL audi
 ---
+## Strands Agent SOPs Study (2026-05-01)
+Source: docs/influence/strands-agent-sops.md — article study (AWS Open Source Blog)
+Amazon's Strands Agent SOPs describe markdown-based parameterized workflows for agents (RFC-2119 keywords, parameter blocks, sequential chaining via artifact handoff, MCP-prompt invocation). **ClaudeMemory has independently arrived at the same architecture** via Anthropic Skills (`/distill-transcripts`, `/release`, `/study-repo`), MCP `prompts/list`+`prompts/get` (`memory_guide`), and the `Ingest → Distill → Resolve → Publish` pipeline. The article is *validation*, not a roadmap.
+### Medium Priority Recommendations
+- [ ] **Add explicit `## Parameters` blocks to skill markdowns**
+  - Value: Self-documenting skills; Claude can prompt the user for missing parameters instead of guessing from `$ARGUMENTS`
+  - Evidence: Strands' `Required Parameters / Optional Parameters` block — the only verbatim format snippet in the article (`docs/influence/strands-agent-sops.md`)
+  - Implementation: Add `## Parameters` section to `lib/claude_memory/commands/skills/distill-transcripts.md`, `release.md`, `study-repo.md`, `quality-update.md`, `improve.md`. Format: bullet list with `name: description (default: …)`
+  - Effort: ~30 minutes total
+  - Trade-off: Tiny doc maintenance; no runtime cost
+### Deferred / Avoid (from this study)
+- **Progress markers + checkpoint file in `/distill-transcripts`** — UX-only improvement; DB already handles correctness. Defer until usage data shows multi-hundred-item distillation runs.
+- **MCP-prompt-exposed skill format spec** (analog of `strands-agents-sops rule`) — solves a problem we don't have; defer until ≥3 skill-authoring locations exist.
+- **Strands Python package** — wrong language ecosystem.
+- **`.sop/<name>/` artifact filesystem** — would parallel our DB-as-checkpoint substrate and double the cleanup burden.
+- **Adopting "SOP" as user-facing terminology** — Anthropic Skills is the term Claude Code users know; renaming creates confusion for zero gain.
+---
+## AI Memory Systems Landscape Study (2026-05-23)
+Source: `docs/influence/ai-memory-systems-2026.md` — meta-study of the Nakajima/Opus 4.6 Research article surveying 7 memory benchmarks and ~12 memory systems (Hindsight, Zep/Graphiti, MemGPT/Letta, Mem0, Cognee, HippoRAG, etc.).
+**Headline finding.** ClaudeMemory's retrieval profile (vector + FTS, light graph, no temporal-aware ranking) sits architecturally closest to Mem0 (49% on LongMemEval). Two unforced gaps separate us from Zep-class systems (71.2%): we already store the graph but don't traverse it at query time, and we have temporal columns we don't rank by. Closing both is ~3-5 days of work without new dependencies.
+### High Priority Recommendations
+- [ ] **64. Graph Traversal as Third RRF Source** ⭐
+  - Value: Field-wide validated as the difference between Mem0-class (49%) and Zep-class (71.2%) LongMemEval scores. We already store the graph (`entities`, `entity_aliases`, `fact_links`).
+  - Evidence: Article Pattern 1 + Pattern 2; our `lib/claude_memory/recall.rb` has no BFS strategy; `lib/claude_memory/core/rr_fusion.rb` fuses only vec + FTS.
+  - Implementation: Add `Recall::GraphTraversal` strategy that resolves query → seed entities → 1-2 hop BFS over `entities` ↔ `facts` ↔ `entities`, scored by hop distance × edge type. Fuse into existing RRF as a third source. Bound depth so latency stays sub-100ms.
+  - Effort: Medium (2-3 days). Data shape already correct; new strategy class + RRF integration + tests.
+  - Trade-off: Empty graphs degrade gracefully to zero rerank contribution.
+- [ ] **65. Temporal-Aware Retrieval Strategy** ⭐
+  - Value: Article identifies temporal reasoning as the hardest field-wide capability (up to 73% gap on LoCoMo). Schema already has `valid_from`, `valid_to`, `last_recalled_at`; ranker doesn't use them.
+  - Evidence: Article Pattern 3.
+  - Implementation: (1) Add `temporal_rank` input to `Core::RRFusion` — facts with newer `valid_from` get a small rank boost (capped at ~0.1× vec contribution). (2) Optional `as_of` ISO 8601 parameter on `memory.recall` filters to `valid_from <= as_of AND (valid_to IS NULL OR valid_to > as_of)`.
+  - Effort: Small (1-2 days). Existing columns; thread parameter and ranker.
+  - Trade-off: Recency over-ranking risk; cap boost weight and tune via eval harness.
+- [ ] **66. Bi-Temporal Schema Cleanup (world vs ingest time)**
+  - Value: Today `valid_to` does double duty — "fact ceased to be true in the world" *and* "we superseded this fact during ingestion." Article credits this distinction as Zep's most important innovation. Without it, point-in-time queries silently corrupt the temporal axis.
+  - Evidence: Article: "Every entity edge tracks four timestamps: valid_at, invalid_at, created_at, expired_at." See also our schema (`db/migrations/001_create_initial_schema.rb:64-65`).
+  - Implementation: Schema v18 migration: rename `valid_to` → `world_invalid_at`; add `ingest_expired_at` (datetime, nullable). Resolver sets `ingest_expired_at` on supersession; leaves `world_invalid_at` for explicit "this fact stopped being true on date X" updates. Backfill copies `valid_to` into both columns.
+  - Effort: Medium (2-3 days). Schema migration + resolver update + MCP tool surface + tests. Public API break — needs deprecation alias for one minor version per `docs/api_stability.md`.
+  - Trade-off: API surface change. Lower urgency than #64/#65 but cheaper to do before corpus grows.
+- [ ] **67. LongMemEval Benchmark Integration** ⭐
+  - Value: Article calls LongMemEval the "gold standard" — the only benchmark it describes as rigorous. Without an external benchmark score, we can't credibly position ClaudeMemory against the field.
+  - Evidence: Article — Wu et al. ICLR 2025, 500 questions across 115K-1.5M token contexts, three-stage framework with LLM-as-judge.
+  - Implementation: Add `spec/benchmarks/longmemeval/` adapter. Dataset is public. Wire into `bin/run-evals --longmemeval`. Report Recall@k, MRR, nDCG@10 like DevMemBench.
+  - Effort: Medium (2-4 days). Mostly dataset wrangling + adapter code; existing DevMemBench pipeline has the right shape.
+  - Trade-off: Real-mode runs (with LLM judge) cost API spend. Mitigation: stub mode for retrieval-only, real mode opt-in.
+### Promotion (existing improvement, article-validated)
+- [ ] **#57 Provenance-Strength-Aware Retrieval Ranking** — promote from Medium to High Priority
+  - Rationale: Article describes Hindsight's "epistemic separation" (4 networks: world facts / agent experiences / entity observations / evolving opinions) as a key innovation. Our `provenance.strength` ∈ {stated, inferred, derived} is the soft version of this — already in the schema, just not used by the ranker. This article promotes the change from "nice to have" to "fits the field-wide pattern."
+  - Implementation unchanged from existing #57 entry.
+### Medium Priority
+- [ ] **Reflect Pass — Background Consolidation on Idle** (see influence doc rec #5)
+  - Value: Hindsight's reflect operation and Letta's sleep-time compute both re-examine stored facts using a background process. Article credits this with preventing noise growth at scale. We don't have it; today our corpus is small enough not to need it.
+  - Recommendation: Track when largest project DB crosses 5K facts. Until then, premature. **CONSIDER for 1.0.0 or later.**
+- [ ] **`memory.save_this` Tool — Agent-Initiated Storage** (see influence doc rec #6)
+  - Value: Letta's striking result (74% vs Mem0's 68.5% on LoCoMo) suggests agent-controlled "save this" beats passive extraction. We have `memory.store_extraction` but it's framed as "report an extraction," not "I want to remember this."
+  - Implementation: Thin wrapper over `store_extraction` with friendlier prompt. Document in MCP `memory_guide` prompt.
+  - Effort: Small (1 day).
+  - Recommendation: **CONSIDER** in 0.13.0 if first-week usage shows agents under-use `store_extraction` proactively.
+### Features to Avoid (from this study)
+- **Cross-encoder LLM reranking** — Article confirms cost as the reason (already in our avoid list).
+- **Full 4-column Graphiti timestamp model** — Recommendation #66 above adopts the simpler 3-timestamp version (world_invalid_at + ingest_expired_at + created_at).
+- **Hindsight 4-network hard epistemic split** — Over-complex for our scale; recommendation #57 promotion is the soft version.
+- **Cloud-required graph DB** (Neo4j / FalkorDB) — Recommendation #64 traverses the graph we already have in SQLite.
+- **Custom fine-tuned models in any pipeline stage** — Article confirms architecture > model size; we can't compete on model investment anyway.
+- **LoCoMo benchmark for cross-vendor comparison** — Article explicitly discredits it: "Mem0 and Zep have publicly contradicted each other's reported scores, making LoCoMo rankings unreliable for cross-vendor comparison." If we cite LoCoMo at all, cite our own number standalone.
+- **Cognee-style RDF/OWL ontology validation** — Our `entity_aliases` + `PredicatePolicy::SYNONYMS` are the right-sized version for a single-developer tool.
+- **Letta-style filesystem-only memory as primary mode** — Consumes user-visible tokens on every interaction; our hook-based passive ingestion is cheaper per session.
+- **Sleep-time compute as a separate background service** — We can achieve the same effect on the next SessionStart via Layer 2 distillation, for free. No separate process needed.
+---
 ## Medium Priority
 ### ~~18. Shell Completion for CLI~~ ✅ Implemented 2026-03-20
@@ -328,9 +425,9 @@ IndexCommand builds text→embedding cache from already-embedded facts before in
 In Ruby fallback path (`search_by_vector_fallback`), facts are grouped by `embedding_json` before cosine similarity computation. Unique embeddings scored once, results fanned out to all matching fact_ids. Native sqlite-vec path unaffected (handles own dedup).
-### 53. First-Week ROI Nudge
+### 53. First-Week ROI Nudge — *targeted for 0.11.0 (moved up from post-1.0)*
-Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap.
+Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap. **Moved up from post-1.0 to 0.11.0** in the 2026-04-28 path-to-1.0 restructure — fits the "Trust & Cost" theme since it's the user-visible proof that memory is doing work.
 **Gap.** New users install the gem, run a few sessions, and don't know whether memory is working. The dashboard exists but they have to know to look. The auto-memory mirror (#36) helps but isn't surfaced. We need a low-friction nudge in the first ~10 sessions that says "memory is working, here's what it did" — and then gets out of the way.
@@ -442,6 +539,172 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #10). Builds on
 ---
+### 59. API Stability Audit (promoted to 0.12.0 — 2026-05-01)
+*Originally slated as 1.0 release blocker; promoted to 0.12 because #52's benchmark scoreboard needs an explicit "what surfaces are stable" list to know what counts as a regression vs. internal change. The deprecation-warning module is also a prerequisite for any soft-rename work surfaced during the 0.12 → 1.0 soak.*
+Source: 2026-04-28 path-to-1.0 review (`docs/1_0_punchlist.md` #11). Added after 0.10.0 ship. *(Renumbered from #57 to #59 during rebase against origin/main on 2026-04-28 — Mercury-article PR #5 had already taken #57 and #58.)*
+**Gap.** "1.0 commits to semver" is meaningless without an explicit public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 (MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints) have evolved organically and aren't formally documented as stable vs. internal. Without this audit, future "regression" complaints become un-arbitrable — was that flag/method/tool *promised*? We don't know.
+**Implementation.**
+- **New `docs/api_stability.md`** as the authoritative public-API reference. Sections:
+  1. *Public CLI surface*: every `claude-memory <subcommand>` registered in `Commands::Registry::COMMANDS`, every documented flag, with stability tier per command (`stable` / `experimental` / `internal`).
+  2. *Public MCP tools*: every entry in `MCP::ToolDefinitions.all` with its argument schema, return shape, and tool-annotation hints (`readOnlyHint`, `idempotentHint`, `destructiveHint`). Stability tier per tool.
+  3. *Public hook contract*: payload field names accepted by `Hook::Handler` and `Commands::HookCommand`, return shapes (`hookSpecificOutput`, exit codes via `Hook::ExitCodes`), stability tier per hook event.
+  4. *Public Ruby API*: the surface external Ruby callers can rely on. Candidates: `ClaudeMemory::Recall`, `Configuration`, `Store::StoreManager`, `Domain::*`. Everything else (resolver internals, dashboard internals, sweep internals) marked internal.
+  5. *Schema stability*: column names, table names, predicate vocabulary in `PredicatePolicy::POLICIES`. Schema migrations remain forward-compatible per the round-trip-spec convention; column *removals* require deprecation cycle.
+- **Deprecation policy paragraph**: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0." Mirrors Ruby/Rails conventions.
+- **Deprecation-warning instrumentation**: tiny module `ClaudeMemory::Deprecations` with a `warn(name, replacement:, removed_in:)` helper. Anywhere we want to change a public surface in 1.x, we wrap with `Deprecations.warn` first.
+- **README + CLAUDE.md** add a top-level link: "Public API: see [docs/api_stability.md](docs/api_stability.md)".
+**Acceptance.**
+- `docs/api_stability.md` exists and lists every CLI command, MCP tool, hook event, and key Ruby class with a stability tier.
+- A reader of the doc can answer "is `claude-memory dashboard --port` stable?" / "will `Recall.new(manager).query(...)` keep its signature in 1.x?" in <30 seconds.
+- `ClaudeMemory::Deprecations.warn` is wired up and used at least once (e.g. for a soon-to-be-renamed flag) so the mechanism is exercised.
+- `/release` skill knows about `docs/api_stability.md` and reminds the operator to update it on any public-surface change.
+**Edge cases.**
+- We have to be honest about which Ruby surfaces are public. `Recall` and `Configuration` clearly are; `Sweep::Maintenance` clearly isn't; `Domain::Fact` is ambiguous (used by external benchmark adapters in `spec/benchmarks/`). Default to **internal** when ambiguous — easier to promote later than demote.
+- Schema column names are tricky. Migrations can rename safely; external SQL tools (e.g. cq) read the schema directly. Document the column names as "best-effort stable, no removal without deprecation cycle."
+- The dashboard JSON API is internal — explicitly call this out so users don't build scripts against it.
+**Effort.** ~2 days. The doc is the bulk of the time; the deprecation warning module is ~50 LOC.
+**Why 1.0 must-have.** Without this, the semver promise is vibes. Future regressions in non-listed areas can be argued away; future regressions in listed areas are bugs. Forces honesty about what we're committing to.
+---
+### 60. LLM Extractor Calibration Drift (surfaced by #48)
+Source: 2026-04-30 production verification of #48 hallucination-rate metric. Surfaced when the metric was first run against real data on this very project.
+**The signal.** First run of `claude-memory digest` against `claude_memory/.claude/memory.sqlite3` after the metric landed:
+| Number | Value | Verdict |
+|---|---|---|
+| Quality score | 39/100 | bad |
+| Suspect (predicate=`reference`) | 2 / 59 (3.4%) | acceptable |
+| Bare conclusions (decision/convention without reason) | 34 / 59 (57.6%) | poor |
+| 7-day rejection rate | 27 of 32 facts (84.4%) | very bad |
+**What it means.** The 84% rejection rate over 7 days says the LLM extractor in this project was producing noise faster than usable knowledge — almost everything new it created got rejected within a week. The 57.6% bare-conclusion rate confirms the same drift from the prompt's *"every decision/convention MUST embed a reason clause"* requirement: the prompt asks for "because…" / "so that…" / "to avoid…" but recent extractions skipped the reason clause majority of the time.
+**Why this is a finding, not a metric bug.** Spot-checked 5 flagged + 5 unflagged facts on 2026-04-30; the detector's regex correctly matches the prompt's strict reason-clause vocabulary in both directions. Not a false-positive issue. The metric is doing what it was designed to do: surface real LLM calibration drift that was previously invisible.
+**Possible causes (to investigate).**
+1. **Prompt drift in `lib/claude_memory/commands/skills/distill-transcripts.md`** — the reason-clause requirement may have been added to the prompt after a chunk of older facts were already extracted. Mostly historical noise rather than ongoing extraction problem. → check `git log -p lib/claude_memory/commands/skills/distill-transcripts.md` for when the reason-clause section landed and whether bare-conclusion facts cluster pre-that-commit.
+2. **Auto-memory mirror regurgitation** — the `Hook::AutoMemoryMirror` (0.10.0) injects auto-memory file content as extraction candidates at SessionStart. If those auto-memory files have bare-conclusion content (likely, since they're written by Claude with no reason-clause discipline), the LLM may be re-extracting them faithfully without injecting reasons that weren't in the source. → grep auto-memory file content for the same bare conclusions appearing in flagged DB facts.
+3. **Reference-material guard too narrow** — `ReferenceMaterialDetector` only retags `convention` predicates; "From QMD restudy: adopt X" facts (clearly third-party-project descriptions) come back as `decision` rather than `reference` and stay in the corpus. → expand `GUARDED_PREDICATES` to include `decision` for the same patterns.
+4. **High rejection rate is correct + the corpus is junky** — 84% rejection in last 7 days might mean we (the team) are correctly rejecting noise that the LLM is producing too aggressively. → check whether rejected facts cluster by source (transcript topic, hook event type, time-of-day).
+**Acceptance / next steps.**
+- Investigation note in `docs/quality_review.md` capturing which of (1)–(4) above explains the bulk of the drift.
+- If prompt drift (cause 1): the historical bulk-flag is fine, the live extraction rate is what matters. Expose "extraction rate" over a tighter window (last 24h vs 30-day baseline) so calibration drift becomes visible without historical noise drowning the signal.
+- If auto-memory regurgitation (cause 2): patch the auto-memory-mirror prompt or distillation prompt to require reason-clause synthesis even when source text is bare.
+- If reference-material guard too narrow (cause 3): expand `Distill::ReferenceMaterialDetector::GUARDED_PREDICATES` and re-run `claude-memory reclassify-references --predicate decision` against active corpus.
+- If correct + junky (cause 4): the metric is healthy; the cleanup is `claude-memory reject` runs against high-frequency junk.
+**Effort.** Investigation: 0.5d. Fix: depends on cause.
+**Why this is in `improvements.md`.** Independently of which cause is correct, the verification of #48 surfaced a real signal worth tracking. The metric did its job (turning invisible drift into a visible 84%); now the work is the actual cleanup. Tracked here so it doesn't fall off the radar between 0.11 ship and the 1.0 soak.
+**Update 2026-04-30: investigation complete.** Diagnostics ran for all four causes; results recorded in `docs/quality_review.md`. Summary: cause 1 (prompt drift) explains 97% of bare conclusions; cause 4 (`/study-repo` misattribution burst) explains 100% of the 7-day rejection cluster; causes 2 and 3 ruled out. Headline metric calibration fix landed in commit `7591da4` (live 30-day window + historical block). The two systemic issues split into entries #61 and #62 below.
+---
+### 61. /study-repo Misattribution Guard
+Source: 2026-04-30 #60 investigation, cause 4. All 27 rejected facts in this project's 7-day window were `uses_database` (18) or `deployment_platform` (9) with `session_id=nil` (MCP-originated), all from a 2-day burst on 2026-04-23 to 04-24. The pattern: when running `/study-repo` on an external project, the LLM extracted that project's tech stack and asserted it as facts about *this* project. Cleanup happened correctly via `claude-memory reject` after detection, but the round-trip is wasteful and noisy.
+**Phase 1 — prompt fix (LANDED 2026-05-01).**
+`.claude/skills/study-repo/SKILL.md` gained a top-level "CRITICAL: Memory Discipline" section that explicitly forbids the LLM from calling `memory.store_extraction` with the studied project's tech stack as `uses_database` / `uses_framework` / `uses_language` / `deployment_platform` / `auth_method`. Allowed: `predicate=reference` for descriptions of the external project, plus genuine project-facing decisions/conventions/architecture derived from contrast (with reason clauses). The influence document (`docs/influence/<project>.md`) is named as the right home for "what tech does the studied project use" observations, taking memory entirely out of that loop.
+**Phase 2 — defense-in-depth detector (DEFERRED to 0.12.x or later).**
+If the prompt fix isn't enough on its own — measured by re-running `/study-repo` against ≥3 external projects post-2026-05-01 and counting any `uses_database`/`deployment_platform` rows that appear with non-self subjects — build `Distill::ExternalAttributionDetector` as a sister to `ReferenceMaterialDetector`. Heuristics: source content_item text containing "studying X", "/study-repo", a non-current-project repo URL, or "external project" → bias single-value-cardinality extractions toward `predicate=reference`.
+False-positive risk to handle: legitimate facts ABOUT this project that mention an external one ("ClaudeMemory adopts SessionStart hook context injection like claude-supermemory does") must still land as `decision` with reason clause, not be retagged. Solution if needed: detector requires both (a) external-project marker in source AND (b) the extracted subject not being the current project's repo entity.
+**Acceptance.**
+- After Phase 1: re-run `/study-repo` on a fresh DB; observe zero `uses_database` or `deployment_platform` facts inserted that point to the external project's tech.
+- After Phase 1: the 27-fact cluster pattern doesn't reappear in similar `/study-repo` sessions.
+- Phase 2 trigger: only build if Phase 1 measurement shows persistent leakage.
+**Effort.** Phase 1: 15 minutes (done). Phase 2 (if needed): ~½ day for detector + tests.
+---
+### 62. Historical Bare-Conclusion Backfill
+Source: 2026-04-30 #60 investigation, cause 1. 34 bare-conclusion facts pre-date the 2026-04-20 reason-clause prompt commit (`f22d12f`). They satisfy the strict regex but most are factually informative ("MCP tools return dual content + structuredContent via TextSummary module" — describes mechanics implicitly without a "because"). The `quality_score` headline now correctly windows to the last 30 days (commit `7591da4`), but those 34 facts still appear in the historical line and may surface in `claude-memory show` and recall queries forever.
+**Implementation options (pick one).**
+A. **Reclassify to `legacy_observation` predicate.** New non-guarded predicate that the bare-conclusion detector ignores. Migration walks active `decision`/`convention` facts created before 2026-04-20 with no reason clause, reclassifies. Preserves the content; removes the metric pollution.
+B. **One-shot prompt-rewrite pass.** For each pre-2026-04-20 bare fact, run a small LLM call asking "infer the reason from the original quote/content_item text" and rewrite the object. Higher fidelity; costs ~$1-5 in API calls.
+C. **Retroactive rejection.** Mark them all `status=rejected`. Cheap and clean but throws away signal. Probably wrong.
+**Recommendation.** Option A. Cheap, reversible (predicate change is just a column update), and the facts remain queryable just outside the bare-conclusion bucket.
+**Acceptance.**
+- Run the migration; verify the historical bare-conclusion count drops by ~34.
+- Verify those facts still appear in `memory.recall` queries (predicate filter optional).
+- `digest` quality section's historical block reports a meaningfully lower number afterwards.
+**Effort.** ~½ day. Mostly a Sequel migration + a `claude-memory reclassify-bare-conclusions` command paralleling `reclassify-references`.
+---
+### 63. Pre-Release Hook Smoke Gate (0.12.0)
+Source: 2026-04-30 verification incident during 0.11 work. Five commits landed for #47 token-budget telemetry with 156 specs green. The user asked "did you actually run claude-memory show on this project?" — at which point a smoke test revealed the installed gem was still 0.9.1 and 24 hours of real SessionStart hook events had recorded no `context_tokens` field. The bug was not in the code; the bug was in the *release process* — specs verify code correctness against the working tree, but production hooks invoke the installed gem via PATH. Without `rake install`, every hook/MCP code change is dead in production.
+This already lives in memory (`feedback_hooks_run_installed_gem.md`) and as two project conventions stored via `memory.store_extraction`. It's a known trap that I (Claude) hit anyway. **Codify it into the release pipeline so the trap can't be sprung again.**
+**Implementation.**
+- **New `bin/pre-release-smoke`** script that:
+  1. Runs `bundle exec rake install` (rebuild gem from current working tree).
+  2. Verifies `which claude-memory` resolves to the installed-gem binary (sanity check).
+  3. Triggers each gem-managed hook event with a synthetic payload via stdin: `claude-memory hook context`, `claude-memory hook ingest --db /tmp/smoke.sqlite3`, `claude-memory hook nudge`, etc. — populates a temp DB.
+  4. Inspects `activity_events` table via `sqlite3 json_extract` for the fields the current version is supposed to record. Specifically:
+     - `hook_context` events should carry both `context_length` and `context_tokens` (since 0.11.0).
+     - `roi_nudge` events should carry `n`, `used`, `pct`, `prior_count` (since 0.11.0).
+     - Any future field added under release becomes part of this checklist.
+  5. Exits non-zero if any expected field is null or absent.
+- **Per-version expectation manifest** at `spec/smoke/expected_fields.yml` — declarative list of `{event_type, fields, since_version}` so the script doesn't need code changes when a new field lands; just append to the YAML and the gate enforces it on the next release.
+- **`/release` skill integration.** Phase 1 Step 5b (after specs, before lint) runs `bin/pre-release-smoke`. Failure aborts the release with the field name(s) that were null. Skill description gains a one-line "verifies installed gem actually fires hooks correctly".
+**Acceptance.**
+- `bin/pre-release-smoke` exits 0 when the installed gem matches the working tree and all expected fields populate.
+- Deleting the `context_tokens:` line from `Hook::Handler#context` and re-running `bin/pre-release-smoke` produces a clear error pointing at the missing field on `hook_context.detail_json`.
+- `/release` skill aborts Phase 1 if the smoke gate fails — never reaches `git push`.
+- Test: `spec/smoke/pre_release_smoke_spec.rb` verifies the manifest schema and that the script's exit-code logic flips on simulated null fields.
+**Edge cases.**
+- The script uses a temp DB so it can't pollute the user's project DB. Cleans up on exit.
+- If `rake install` fails (gemspec validation, signing, etc.), the script reports that as a separate failure mode, not a smoke-gate failure.
+- The `hook nudge` synthetic payload needs a `session_id` of a real session that contributed facts — the script can pre-seed one fact and use a dedicated `smoke-test-NNNN` session id.
+**Effort.** ~½ day for the script + manifest + skill integration. Spec is the bulk of the time.
+**Why this release.** 0.11 verification gap directly motivated this. Release Discipline that doesn't catch the trap that's already hit twice (#47 today, plus the 2026-04-16 ActivityLog incident in `feedback_hooks_run_installed_gem.md`) isn't real discipline. Pairs naturally with #52 — scoreboard catches regressions in measurement; smoke gate catches the regression where the measurement itself doesn't fire.
+---
 ### 21. Incremental Indexing with File Watching
 Source: grepai study (reinforced 2026-03-02)
@@ -562,6 +825,67 @@ Specs cover: refresher updates from both stores including cross-DB project→glo
 Schema migration v13 adds `mcp_tool_calls` telemetry table (tool_name, called_at, duration_ms, result_count, scope, error_class). `MCP::Telemetry` wraps `Server#handle_tools_call` with monotonic-clock timing, captures errors, and records to the project DB; DB errors are swallowed so telemetry never fails a real tool call. `StatsCommand` gains `--tools` and `--since DAYS` flags showing total calls, error rate, and per-tool breakdown (calls, avg ms, p95 ms, error rate). `Sweep::Maintenance#prune_old_mcp_tool_calls` enforces a 90-day retention window, wired into `Sweeper#run!`. Rejected NDJSON in favor of SQLite for schema/query consistency with the rest of the gem. Dropped query-text capture (YAGNI — the dedup insight the hash would enable also needs raw text). Also fixed a latent bug where `StatsCommand` opened the DB via `Sequel.sqlite` (requiring the unlisted `sqlite3` gem); now uses the extralite adapter consistently.
+### 57. Provenance-Strength-Aware Retrieval Ranking
+Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid, [@Ctrl_Alt_Zaid](https://x.com/Ctrl_Alt_Zaid/status/2049082538686382397)) — "Memories need metadata such as confidence" / "without scoring, everything competes equally."
+**Gap.** `Domain::Provenance` already records `strength` ∈ {`stated`, `inferred`} (provenance.rb:7,14,22-26), but the value is only consumed as a boolean (`stated?` / `inferred?`) for display. `Index::IndexQuery` and the RRF fusion in `Recall` do not factor strength into ranking. Result: a fact that was inferred from one ambiguous transcript line ranks identically to one explicitly stated multiple times across sessions.
+**Implementation.**
+- **Strength score derivation.** Add `Domain::Provenance#confidence_weight` returning `1.0` for `stated`, `0.6` for `inferred`. Single-source — no new column.
+- **Per-fact aggregate.** New `SQLiteStore#fact_confidence(fact_id)` returns max strength weight across all provenance rows (a fact stated once and inferred twice is still high-confidence).
+- **Ranking integration.** `Index::IndexQuery` already returns scored candidates; multiply final RRF score by `(0.7 + 0.3 * confidence_weight)`. Bounded modifier (0.7-1.0 range) so a low-confidence fact still ranks if it's the only relevant one — we're nudging, not filtering.
+- **Surfacing.** `score_trace` (introduced in #5) gains a `confidence_factor` field so the multiplier is auditable in `memory.recall_semantic --explain`.
+**Acceptance.**
+- `memory.recall` results re-rank in tests: an `inferred`-only fact loses to a `stated` fact when both have similar BM25/vector scores.
+- Retrieval benchmark (`spec/benchmarks/retrieval/`) shows Recall@k unchanged or improved on the 155-query set.
+- `score_trace.confidence_factor` populated for every result.
+**Edge cases.**
+- Facts with no provenance (legacy / direct stores): default to 0.8 (between stated and inferred). Don't penalize as 0.6 — those facts predate the field.
+- `memory.store_extraction` callers don't always set strength; default already lands on `stated` per provenance.rb:14, which is the right behavior.
+**Effort.** ~half day. No schema migration; `strength` already exists.
+**Why medium.** The article calls this out as a structural reliability requirement, but ClaudeMemory already has the data — we're just not using it. Cheap win that closes a visible gap in the article's external critique.
+---
+### 58. Reinforcement-and-Decay Ranking Signal
+Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid) — "Memories need metadata such as freshness, importance, reinforcement" / "Some memory should weaken, expire, or be archived."
+**Gap.** `last_recalled_at` (schema v17, populated by `Sweep::RecallTimestampRefresher`) currently only feeds `Recall::StaleDetector` to *flag* unused facts (stale_detector.rb:57-61). It does not boost frequently-recalled facts in retrieval ranking, nor decay long-untouched ones. Result: a fact recalled 50 times in the last week and a fact recalled once 8 months ago compete on equal footing once their BM25/vector scores match — the inverse of what the article calls "the right memory, not the most memory."
+**Implementation.**
+- **Add `recall_count` column.** Migration vNN adds `facts.recall_count INTEGER DEFAULT 0`. `RecallTimestampRefresher` increments it alongside the `last_recalled_at` update (single UPDATE, no extra query).
+- **Reinforcement-decay multiplier.** New `Recall::FreshnessScorer.weight(fact)` returns `max(0.5, min(1.5, log1p(recall_count) * exp(-age_days / HALF_LIFE)))` where `HALF_LIFE` defaults to 60 days. Bounded so a single hot fact can't dominate and a cold fact can't disappear.
+- **Wire into RRF.** Same composition point as #57: `final_score = rrf_score * confidence_factor * freshness_factor`. Both factors land in `score_trace`.
+- **Configuration.** `CLAUDE_MEMORY_RECALL_HALF_LIFE_DAYS` env var (default 60) for users who want longer/shorter memory.
+- **Decay is soft, not destructive.** No facts are deleted or archived by this — that stays the user's job via `claude-memory reject`. The article's "decay" framing is correct in spirit (rank weight drops) but we don't auto-prune.
+**Acceptance.**
+- Two facts with identical BM25 scores: the one recalled 10× in the last week ranks above one not recalled in 6 months.
+- Repeat-correction benchmark (#32) shows improvement: facts that "stuck" rank higher than abandoned ones.
+- `score_trace.freshness_factor` populated; visible in `memory.recall_semantic --explain`.
+- Telemetry: `activity_events` gain `freshness_factor` in the details JSON for hook_context events so we can backtest changes to `HALF_LIFE`.
+**Edge cases.**
+- Brand-new facts (recall_count=0, age=0): `log1p(0) = 0` would zero out the weight. Floor at 0.5 — new facts shouldn't be penalized for being new.
+- Facts never recalled but still valid: clamped to 0.5 floor; ranked behind reinforced peers but not invisible.
+- Cross-DB mixing: refresher already handles cross-DB project→global per memory fact "OperationTracker.reset_stuck_operations…"; recall_count lives on each fact in its own DB, which is the right shape.
+**Effort.** ~1 day (migration, refresher update, ranking integration, tests).
+**Why medium.** This pairs naturally with #57 — together they answer the article's "without scoring, everything competes equally" critique. Defer behind the 1.0 punchlist (#47-52) but ahead of the post-1.0 nudge/drift items, since these directly affect retrieval quality measured by the existing benchmarks.
 ---
 ## Low Priority / Defer
@@ -753,4 +1077,4 @@ Influence documents:
 ---
-*Last updated: 2026-04-28 - 1.0 punchlist track opened (`docs/1_0_punchlist.md`). High Priority entries #47-52 (must-have for 1.0): token-budget telemetry, hallucination rate, harm benchmark, CLAUDE.md baseline publication, `claude-memory show`, benchmark scoreboard. Medium Priority entries #53-56 (post-1.0): first-week ROI nudge, real-session repeat-correction detection, token-cost growth tracking, drift dashboard. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*
+*Last updated: 2026-04-28 (post-0.10.0 release, post-rebase). 1.0 punchlist restructured around milestone versions per `docs/1_0_punchlist.md`. **0.11.0** = #47/#48/#51/#53 + #49 prototype. **0.12.0** = #49 full + #50/#52. **1.0.0** = #54/#55/#56/#59 (the new API stability audit). #59 added 2026-04-28 as a 1.0 release blocker (originally #57; renumbered after rebase brought in Mercury-article entries #57/#58). #53 (first-week ROI nudge) moved up from post-1.0 to 0.11.0. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*