RubyGems - claude_memory - Versions diffs - 0.10.0 → 0.11.0 - Mend

claude_memory 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/.claude/memory.sqlite3 +0 -0
data/.claude-plugin/marketplace.json +1 -1
data/.claude-plugin/plugin.json +1 -1
data/CHANGELOG.md +44 -0
data/CLAUDE.md +11 -3
data/README.md +35 -1
data/docs/1_0_punchlist.md +269 -88
data/docs/GETTING_STARTED.md +3 -1
data/docs/architecture.md +3 -3
data/docs/dashboard.md +23 -3
data/docs/improvements.md +190 -5
data/docs/quality_review.md +35 -0
data/lib/claude_memory/commands/digest_command.rb +95 -3
data/lib/claude_memory/commands/hook_command.rb +27 -2
data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
data/lib/claude_memory/commands/registry.rb +2 -1
data/lib/claude_memory/commands/show_command.rb +90 -0
data/lib/claude_memory/commands/stats_command.rb +94 -2
data/lib/claude_memory/dashboard/trust.rb +180 -11
data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
data/lib/claude_memory/hook/handler.rb +142 -1
data/lib/claude_memory/templates/hooks.example.json +5 -0
data/lib/claude_memory/version.rb +1 -1
data/lib/claude_memory.rb +2 -0
metadata +3 -1

data/docs/architecture.md CHANGED Viewed

@@ -40,7 +40,7 @@ ClaudeMemory is architected using Domain-Driven Design (DDD) principles with cle
 **Components:**
 - **CLI** (`cli.rb`): Thin router that dispatches to command classes
-- **Commands** (`commands/`): 32 command classes, each handling one CLI command
+- **Commands** (`commands/`): 34 command classes, each handling one CLI command
 - **Configuration** (`configuration.rb`): Centralized ENV access and path calculation
 **Key Principles:**
@@ -205,7 +205,7 @@ end
 - **Server**: WEBrick HTTP server (default port 3377), starts via `claude-memory dashboard`
 - **API**: HTTP-shape glue + per-endpoint formatting; routes/delegates to panel classes
 - **Panels** (each backed by a dedicated class with focused responsibility):
-  - `Trust`: weekly moments, fingerprint, utilization, feedback ratio, needs-review
+  - `Trust`: weekly moments, fingerprint, utilization, feedback ratio, needs-review, **token_budget** (p50/p95/avg over 30d, 0.11.0+), **quality_score** (live 30-day window + historical baseline, 0.11.0+)
   - `Moments`: feed-first activity stream with kind classification
   - `Knowledge`: predicate-grouped fact summary (incl. References section)
   - `Conflicts`: display-layer dedup with bulk-reject helper
@@ -361,7 +361,7 @@ FileSystem (write)
 - Value objects (SessionId, TranscriptPath, FactId)
 - Centralized Configuration
 - 4 domain models with business logic
-- 32 command classes
+- 34 command classes
 - 25 MCP tools
 - Semantic search with local embeddings (FastEmbed + TF-IDF fallback)
 - Schema v17 with WAL mode

data/docs/dashboard.md CHANGED Viewed

@@ -31,7 +31,8 @@ The dashboard is **feed-first**: the main view is a chronological stream of
 ### Sidebar — Trust
-Three at-a-glance signals so you can answer "is memory helping?" in one look:
+At-a-glance signals so you can answer "is memory helping?" — and "what does
+it cost?" — in one look:
 - **This week's moments** — count of value-producing events (recall hits,
   context injections, extractions). Includes a week-over-week delta.
@@ -40,6 +41,16 @@ Three at-a-glance signals so you can answer "is memory helping?" in one look:
 - **Needs review** — open conflicts (deduped to distinct contradictions) +
   stale facts (active but not recalled in the configured window) + empty
   recalls (queries that returned nothing).
+- **Token budget (30d)** *(0.11.0+)* — p50/p95/avg `context_tokens` injected
+  per SessionStart over the last 30 days, with sample size. Answers "what
+  does memory cost per session?" — pairs with the digest's "Context cost"
+  section and `claude-memory stats --tokens`.
+- **Quality score (live, 30d)** *(0.11.0+)* — 0–100 hallucination-rate
+  proxy. `score = 100 - (suspect_pct + bare_pct)` where suspect = facts
+  retagged as `predicate=reference` and bare = decision/convention facts
+  whose object skipped the prompt-mandated reason clause. Headline is the
+  live 30-day window; the underlying snapshot also exposes a `historical`
+  block over all active facts for context. Returns 100 on empty stores.
 - **Utilization (30d)** — of facts extracted in the last 30 days, what % has
   Claude actually surfaced via recall or context injection. Color-coded
   (green ≥40%, yellow ≥15%, red below). Hidden on fresh installs.
@@ -161,8 +172,17 @@ WAL writer lock open across page loads.
 ## Related CLI
 - `claude-memory digest [--since DAYS] [--output FILE]` — markdown report of
-  the same Trust + Knowledge + Conflicts + Feedback signals, suitable for
-  email or commit-into-repo.
+  the same Trust + Knowledge + Conflicts + Feedback signals plus
+  **Context cost** (token-budget p50/p95) and **Quality** (score + rejection
+  rate) sections. Suitable for email or commit-into-repo.
+- `claude-memory show [--pending] [--source SOURCE]` *(0.11.0+)* — print
+  what memory would inject at the next SessionStart in plain Markdown.
+  Same `Hook::ContextInjector` path real sessions use, so the output
+  matches what Claude actually receives. Footer reports fact count, ~token
+  estimate, and char count.
+- `claude-memory stats --tokens [--since DAYS]` *(0.11.0+)* — token budget
+  histogram (p50/p95/avg/min/max + bucketed distribution) for SessionStart
+  context injections. Same data the Trust panel's Token budget block aggregates.
 - `claude-memory census [--root DIR]` — privacy-safe cross-project
   predicate vocabulary scan; pairs with the Knowledge panel for "what
   predicates does my whole tree use?".

data/docs/improvements.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Improvements to Consider
-*Updated: 2026-04-28 - Opened the 1.0 punchlist track (see `docs/1_0_punchlist.md`). High-priority entries below now include the must-have 1.0 items: token-budget telemetry (#47), hallucination-rate metric (#48), negative-fact harm benchmark (#49), CLAUDE.md baseline publication (#50), `claude-memory show` (#51), benchmark scoreboard diff (#52). Post-1.0 entries: first-week ROI nudge (#53), real-session repeat-correction detector (#54), token-cost growth tracking (#55), drift dashboard (#56). Earlier 2026-04-28 update added cq study (usefulness-focused). Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
+*Updated: 2026-04-28 (post-0.10.0) - Restructured 1.0 punchlist around milestone versions. **0.11.0 "Trust & Cost"** ships #47 (token budget), #48 (hallucination rate), #51 (claude-memory show), #53 (first-week ROI nudge — moved up from post-1.0), and a 3-scenario prototype of #49 (harm benchmark). **0.12.0 "Release Discipline"** ships #49 full corpus, #50 (CLAUDE.md baseline), #52 (benchmark scoreboard). **1.0.0** lands soak-validated #54/#55/#56 if time + new #59 API stability audit. See `docs/1_0_punchlist.md` for the full plan with calendar targets. Also added 2026-04-28: two ranking-signal gaps surfaced by the Mercury / "Why Karpathy's Second Brain Breaks" article (Zaid, 2026-04-28) — provenance-strength-aware ranking (#57) and reinforcement/decay scoring (#58). Earlier 2026-04-28 updates: opened the 1.0 punchlist track + added cq study. Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
 *Sources:*
 - *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
 - *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
@@ -152,10 +152,14 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #2). Builds on
 ---
-### 49. Negative-Fact Harm Benchmark
+### 49. Negative-Fact Harm Benchmark — *prototype in 0.11.0, full corpus in 0.12.0*
 Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #3). Parallels #32 (Repeat-Correction Benchmark) but inverts the goal.
+**Two-phase delivery (added 2026-04-28):**
+- **0.11.0 — 3-scenario prototype (~½d).** Three hand-written cases (one stale-tech, one mismatched-scope, one superseded-but-undetected) run against real Claude under `EVAL_MODE=real`. Smoke test: if even three cases produce >0% harm rate, the full benchmark in 0.12 will reveal a fundamental issue and we want to know early. No release gate yet — the prototype is diagnostic.
+- **0.12.0 — full 10-15 scenario corpus (~2d).** Adds the missing harm classes (reference-material-as-fact + remaining stale/mismatched/superseded cases) and wires the >1% harm-rate release gate.
 **Gap.** Every benchmark we run measures whether memory **helps** (Recall@k, MRR, e2e pass rate, repeat-correction prevention rate). Nothing measures whether memory **harms** — i.e. holds a wrong/stale fact and causes Claude to follow it. Without this, "memory helps" is unfalsifiable.
 **Implementation.**
@@ -328,9 +332,9 @@ IndexCommand builds text→embedding cache from already-embedded facts before in
 In Ruby fallback path (`search_by_vector_fallback`), facts are grouped by `embedding_json` before cosine similarity computation. Unique embeddings scored once, results fanned out to all matching fact_ids. Native sqlite-vec path unaffected (handles own dedup).
-### 53. First-Week ROI Nudge
+### 53. First-Week ROI Nudge — *targeted for 0.11.0 (moved up from post-1.0)*
-Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap.
+Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap. **Moved up from post-1.0 to 0.11.0** in the 2026-04-28 path-to-1.0 restructure — fits the "Trust & Cost" theme since it's the user-visible proof that memory is doing work.
 **Gap.** New users install the gem, run a few sessions, and don't know whether memory is working. The dashboard exists but they have to know to look. The auto-memory mirror (#36) helps but isn't surfaced. We need a low-friction nudge in the first ~10 sessions that says "memory is working, here's what it did" — and then gets out of the way.
@@ -442,6 +446,126 @@ Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #10). Builds on
 ---
+### 59. API Stability Audit (1.0 release blocker)
+Source: 2026-04-28 path-to-1.0 review (`docs/1_0_punchlist.md` #11). Added after 0.10.0 ship. *(Renumbered from #57 to #59 during rebase against origin/main on 2026-04-28 — Mercury-article PR #5 had already taken #57 and #58.)*
+**Gap.** "1.0 commits to semver" is meaningless without an explicit public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 (MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints) have evolved organically and aren't formally documented as stable vs. internal. Without this audit, future "regression" complaints become un-arbitrable — was that flag/method/tool *promised*? We don't know.
+**Implementation.**
+- **New `docs/api_stability.md`** as the authoritative public-API reference. Sections:
+  1. *Public CLI surface*: every `claude-memory <subcommand>` registered in `Commands::Registry::COMMANDS`, every documented flag, with stability tier per command (`stable` / `experimental` / `internal`).
+  2. *Public MCP tools*: every entry in `MCP::ToolDefinitions.all` with its argument schema, return shape, and tool-annotation hints (`readOnlyHint`, `idempotentHint`, `destructiveHint`). Stability tier per tool.
+  3. *Public hook contract*: payload field names accepted by `Hook::Handler` and `Commands::HookCommand`, return shapes (`hookSpecificOutput`, exit codes via `Hook::ExitCodes`), stability tier per hook event.
+  4. *Public Ruby API*: the surface external Ruby callers can rely on. Candidates: `ClaudeMemory::Recall`, `Configuration`, `Store::StoreManager`, `Domain::*`. Everything else (resolver internals, dashboard internals, sweep internals) marked internal.
+  5. *Schema stability*: column names, table names, predicate vocabulary in `PredicatePolicy::POLICIES`. Schema migrations remain forward-compatible per the round-trip-spec convention; column *removals* require deprecation cycle.
+- **Deprecation policy paragraph**: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0." Mirrors Ruby/Rails conventions.
+- **Deprecation-warning instrumentation**: tiny module `ClaudeMemory::Deprecations` with a `warn(name, replacement:, removed_in:)` helper. Anywhere we want to change a public surface in 1.x, we wrap with `Deprecations.warn` first.
+- **README + CLAUDE.md** add a top-level link: "Public API: see [docs/api_stability.md](docs/api_stability.md)".
+**Acceptance.**
+- `docs/api_stability.md` exists and lists every CLI command, MCP tool, hook event, and key Ruby class with a stability tier.
+- A reader of the doc can answer "is `claude-memory dashboard --port` stable?" / "will `Recall.new(manager).query(...)` keep its signature in 1.x?" in <30 seconds.
+- `ClaudeMemory::Deprecations.warn` is wired up and used at least once (e.g. for a soon-to-be-renamed flag) so the mechanism is exercised.
+- `/release` skill knows about `docs/api_stability.md` and reminds the operator to update it on any public-surface change.
+**Edge cases.**
+- We have to be honest about which Ruby surfaces are public. `Recall` and `Configuration` clearly are; `Sweep::Maintenance` clearly isn't; `Domain::Fact` is ambiguous (used by external benchmark adapters in `spec/benchmarks/`). Default to **internal** when ambiguous — easier to promote later than demote.
+- Schema column names are tricky. Migrations can rename safely; external SQL tools (e.g. cq) read the schema directly. Document the column names as "best-effort stable, no removal without deprecation cycle."
+- The dashboard JSON API is internal — explicitly call this out so users don't build scripts against it.
+**Effort.** ~2 days. The doc is the bulk of the time; the deprecation warning module is ~50 LOC.
+**Why 1.0 must-have.** Without this, the semver promise is vibes. Future regressions in non-listed areas can be argued away; future regressions in listed areas are bugs. Forces honesty about what we're committing to.
+---
+### 60. LLM Extractor Calibration Drift (surfaced by #48)
+Source: 2026-04-30 production verification of #48 hallucination-rate metric. Surfaced when the metric was first run against real data on this very project.
+**The signal.** First run of `claude-memory digest` against `claude_memory/.claude/memory.sqlite3` after the metric landed:
+| Number | Value | Verdict |
+|---|---|---|
+| Quality score | 39/100 | bad |
+| Suspect (predicate=`reference`) | 2 / 59 (3.4%) | acceptable |
+| Bare conclusions (decision/convention without reason) | 34 / 59 (57.6%) | poor |
+| 7-day rejection rate | 27 of 32 facts (84.4%) | very bad |
+**What it means.** The 84% rejection rate over 7 days says the LLM extractor in this project was producing noise faster than usable knowledge — almost everything new it created got rejected within a week. The 57.6% bare-conclusion rate confirms the same drift from the prompt's *"every decision/convention MUST embed a reason clause"* requirement: the prompt asks for "because…" / "so that…" / "to avoid…" but recent extractions skipped the reason clause majority of the time.
+**Why this is a finding, not a metric bug.** Spot-checked 5 flagged + 5 unflagged facts on 2026-04-30; the detector's regex correctly matches the prompt's strict reason-clause vocabulary in both directions. Not a false-positive issue. The metric is doing what it was designed to do: surface real LLM calibration drift that was previously invisible.
+**Possible causes (to investigate).**
+1. **Prompt drift in `lib/claude_memory/commands/skills/distill-transcripts.md`** — the reason-clause requirement may have been added to the prompt after a chunk of older facts were already extracted. Mostly historical noise rather than ongoing extraction problem. → check `git log -p lib/claude_memory/commands/skills/distill-transcripts.md` for when the reason-clause section landed and whether bare-conclusion facts cluster pre-that-commit.
+2. **Auto-memory mirror regurgitation** — the `Hook::AutoMemoryMirror` (0.10.0) injects auto-memory file content as extraction candidates at SessionStart. If those auto-memory files have bare-conclusion content (likely, since they're written by Claude with no reason-clause discipline), the LLM may be re-extracting them faithfully without injecting reasons that weren't in the source. → grep auto-memory file content for the same bare conclusions appearing in flagged DB facts.
+3. **Reference-material guard too narrow** — `ReferenceMaterialDetector` only retags `convention` predicates; "From QMD restudy: adopt X" facts (clearly third-party-project descriptions) come back as `decision` rather than `reference` and stay in the corpus. → expand `GUARDED_PREDICATES` to include `decision` for the same patterns.
+4. **High rejection rate is correct + the corpus is junky** — 84% rejection in last 7 days might mean we (the team) are correctly rejecting noise that the LLM is producing too aggressively. → check whether rejected facts cluster by source (transcript topic, hook event type, time-of-day).
+**Acceptance / next steps.**
+- Investigation note in `docs/quality_review.md` capturing which of (1)–(4) above explains the bulk of the drift.
+- If prompt drift (cause 1): the historical bulk-flag is fine, the live extraction rate is what matters. Expose "extraction rate" over a tighter window (last 24h vs 30-day baseline) so calibration drift becomes visible without historical noise drowning the signal.
+- If auto-memory regurgitation (cause 2): patch the auto-memory-mirror prompt or distillation prompt to require reason-clause synthesis even when source text is bare.
+- If reference-material guard too narrow (cause 3): expand `Distill::ReferenceMaterialDetector::GUARDED_PREDICATES` and re-run `claude-memory reclassify-references --predicate decision` against active corpus.
+- If correct + junky (cause 4): the metric is healthy; the cleanup is `claude-memory reject` runs against high-frequency junk.
+**Effort.** Investigation: 0.5d. Fix: depends on cause.
+**Why this is in `improvements.md`.** Independently of which cause is correct, the verification of #48 surfaced a real signal worth tracking. The metric did its job (turning invisible drift into a visible 84%); now the work is the actual cleanup. Tracked here so it doesn't fall off the radar between 0.11 ship and the 1.0 soak.
+**Update 2026-04-30: investigation complete.** Diagnostics ran for all four causes; results recorded in `docs/quality_review.md`. Summary: cause 1 (prompt drift) explains 97% of bare conclusions; cause 4 (`/study-repo` misattribution burst) explains 100% of the 7-day rejection cluster; causes 2 and 3 ruled out. Headline metric calibration fix landed in commit `7591da4` (live 30-day window + historical block). The two systemic issues split into entries #61 and #62 below.
+---
+### 61. /study-repo Misattribution Guard
+Source: 2026-04-30 #60 investigation, cause 4. All 27 rejected facts in this project's 7-day window were `uses_database` (18) or `deployment_platform` (9) with `session_id=nil` (MCP-originated), all from a 2-day burst on 2026-04-23 to 04-24. The pattern: when running `/study-repo` on an external project, the LLM extracted that project's tech stack and asserted it as facts about *this* project. Cleanup happened correctly via `claude-memory reject` after detection, but the round-trip is wasteful and noisy.
+**Implementation.**
+- New `Distill::ExternalAttributionDetector` (sister to `ReferenceMaterialDetector`). Runs after extraction and before storage.
+- Heuristics: when the source content_item text contains markers like "studying X", "/study-repo", a non-current-project repo URL, or "external project", strongly bias toward `predicate=reference` for any `uses_database`/`deployment_platform`/`uses_framework` extraction.
+- Optional: extend `Hook::ContextInjector` or the distillation prompt to make this constraint explicit ("when discussing an external repository, do NOT extract its tech stack as project-level facts").
+**Acceptance.**
+- Re-run a `/study-repo` on a fresh DB; observe zero `uses_database` or `deployment_platform` facts inserted that point to the external project's tech.
+- The 27 rejected facts cluster from this project's history doesn't reappear in similar scenarios.
+**Effort.** ~½ day. Detector is mostly regex + content_item text inspection. Prompt addition is trivial.
+---
+### 62. Historical Bare-Conclusion Backfill
+Source: 2026-04-30 #60 investigation, cause 1. 34 bare-conclusion facts pre-date the 2026-04-20 reason-clause prompt commit (`f22d12f`). They satisfy the strict regex but most are factually informative ("MCP tools return dual content + structuredContent via TextSummary module" — describes mechanics implicitly without a "because"). The `quality_score` headline now correctly windows to the last 30 days (commit `7591da4`), but those 34 facts still appear in the historical line and may surface in `claude-memory show` and recall queries forever.
+**Implementation options (pick one).**
+A. **Reclassify to `legacy_observation` predicate.** New non-guarded predicate that the bare-conclusion detector ignores. Migration walks active `decision`/`convention` facts created before 2026-04-20 with no reason clause, reclassifies. Preserves the content; removes the metric pollution.
+B. **One-shot prompt-rewrite pass.** For each pre-2026-04-20 bare fact, run a small LLM call asking "infer the reason from the original quote/content_item text" and rewrite the object. Higher fidelity; costs ~$1-5 in API calls.
+C. **Retroactive rejection.** Mark them all `status=rejected`. Cheap and clean but throws away signal. Probably wrong.
+**Recommendation.** Option A. Cheap, reversible (predicate change is just a column update), and the facts remain queryable just outside the bare-conclusion bucket.
+**Acceptance.**
+- Run the migration; verify the historical bare-conclusion count drops by ~34.
+- Verify those facts still appear in `memory.recall` queries (predicate filter optional).
+- `digest` quality section's historical block reports a meaningfully lower number afterwards.
+**Effort.** ~½ day. Mostly a Sequel migration + a `claude-memory reclassify-bare-conclusions` command paralleling `reclassify-references`.
+---
 ### 21. Incremental Indexing with File Watching
 Source: grepai study (reinforced 2026-03-02)
@@ -562,6 +686,67 @@ Specs cover: refresher updates from both stores including cross-DB project→glo
 Schema migration v13 adds `mcp_tool_calls` telemetry table (tool_name, called_at, duration_ms, result_count, scope, error_class). `MCP::Telemetry` wraps `Server#handle_tools_call` with monotonic-clock timing, captures errors, and records to the project DB; DB errors are swallowed so telemetry never fails a real tool call. `StatsCommand` gains `--tools` and `--since DAYS` flags showing total calls, error rate, and per-tool breakdown (calls, avg ms, p95 ms, error rate). `Sweep::Maintenance#prune_old_mcp_tool_calls` enforces a 90-day retention window, wired into `Sweeper#run!`. Rejected NDJSON in favor of SQLite for schema/query consistency with the rest of the gem. Dropped query-text capture (YAGNI — the dedup insight the hash would enable also needs raw text). Also fixed a latent bug where `StatsCommand` opened the DB via `Sequel.sqlite` (requiring the unlisted `sqlite3` gem); now uses the extralite adapter consistently.
+### 57. Provenance-Strength-Aware Retrieval Ranking
+Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid, [@Ctrl_Alt_Zaid](https://x.com/Ctrl_Alt_Zaid/status/2049082538686382397)) — "Memories need metadata such as confidence" / "without scoring, everything competes equally."
+**Gap.** `Domain::Provenance` already records `strength` ∈ {`stated`, `inferred`} (provenance.rb:7,14,22-26), but the value is only consumed as a boolean (`stated?` / `inferred?`) for display. `Index::IndexQuery` and the RRF fusion in `Recall` do not factor strength into ranking. Result: a fact that was inferred from one ambiguous transcript line ranks identically to one explicitly stated multiple times across sessions.
+**Implementation.**
+- **Strength score derivation.** Add `Domain::Provenance#confidence_weight` returning `1.0` for `stated`, `0.6` for `inferred`. Single-source — no new column.
+- **Per-fact aggregate.** New `SQLiteStore#fact_confidence(fact_id)` returns max strength weight across all provenance rows (a fact stated once and inferred twice is still high-confidence).
+- **Ranking integration.** `Index::IndexQuery` already returns scored candidates; multiply final RRF score by `(0.7 + 0.3 * confidence_weight)`. Bounded modifier (0.7-1.0 range) so a low-confidence fact still ranks if it's the only relevant one — we're nudging, not filtering.
+- **Surfacing.** `score_trace` (introduced in #5) gains a `confidence_factor` field so the multiplier is auditable in `memory.recall_semantic --explain`.
+**Acceptance.**
+- `memory.recall` results re-rank in tests: an `inferred`-only fact loses to a `stated` fact when both have similar BM25/vector scores.
+- Retrieval benchmark (`spec/benchmarks/retrieval/`) shows Recall@k unchanged or improved on the 155-query set.
+- `score_trace.confidence_factor` populated for every result.
+**Edge cases.**
+- Facts with no provenance (legacy / direct stores): default to 0.8 (between stated and inferred). Don't penalize as 0.6 — those facts predate the field.
+- `memory.store_extraction` callers don't always set strength; default already lands on `stated` per provenance.rb:14, which is the right behavior.
+**Effort.** ~half day. No schema migration; `strength` already exists.
+**Why medium.** The article calls this out as a structural reliability requirement, but ClaudeMemory already has the data — we're just not using it. Cheap win that closes a visible gap in the article's external critique.
+---
+### 58. Reinforcement-and-Decay Ranking Signal
+Source: 2026-04-28 article "Why Karpathy's Second Brain Breaks at Agent Scale" (Zaid) — "Memories need metadata such as freshness, importance, reinforcement" / "Some memory should weaken, expire, or be archived."
+**Gap.** `last_recalled_at` (schema v17, populated by `Sweep::RecallTimestampRefresher`) currently only feeds `Recall::StaleDetector` to *flag* unused facts (stale_detector.rb:57-61). It does not boost frequently-recalled facts in retrieval ranking, nor decay long-untouched ones. Result: a fact recalled 50 times in the last week and a fact recalled once 8 months ago compete on equal footing once their BM25/vector scores match — the inverse of what the article calls "the right memory, not the most memory."
+**Implementation.**
+- **Add `recall_count` column.** Migration vNN adds `facts.recall_count INTEGER DEFAULT 0`. `RecallTimestampRefresher` increments it alongside the `last_recalled_at` update (single UPDATE, no extra query).
+- **Reinforcement-decay multiplier.** New `Recall::FreshnessScorer.weight(fact)` returns `max(0.5, min(1.5, log1p(recall_count) * exp(-age_days / HALF_LIFE)))` where `HALF_LIFE` defaults to 60 days. Bounded so a single hot fact can't dominate and a cold fact can't disappear.
+- **Wire into RRF.** Same composition point as #57: `final_score = rrf_score * confidence_factor * freshness_factor`. Both factors land in `score_trace`.
+- **Configuration.** `CLAUDE_MEMORY_RECALL_HALF_LIFE_DAYS` env var (default 60) for users who want longer/shorter memory.
+- **Decay is soft, not destructive.** No facts are deleted or archived by this — that stays the user's job via `claude-memory reject`. The article's "decay" framing is correct in spirit (rank weight drops) but we don't auto-prune.
+**Acceptance.**
+- Two facts with identical BM25 scores: the one recalled 10× in the last week ranks above one not recalled in 6 months.
+- Repeat-correction benchmark (#32) shows improvement: facts that "stuck" rank higher than abandoned ones.
+- `score_trace.freshness_factor` populated; visible in `memory.recall_semantic --explain`.
+- Telemetry: `activity_events` gain `freshness_factor` in the details JSON for hook_context events so we can backtest changes to `HALF_LIFE`.
+**Edge cases.**
+- Brand-new facts (recall_count=0, age=0): `log1p(0) = 0` would zero out the weight. Floor at 0.5 — new facts shouldn't be penalized for being new.
+- Facts never recalled but still valid: clamped to 0.5 floor; ranked behind reinforced peers but not invisible.
+- Cross-DB mixing: refresher already handles cross-DB project→global per memory fact "OperationTracker.reset_stuck_operations…"; recall_count lives on each fact in its own DB, which is the right shape.
+**Effort.** ~1 day (migration, refresher update, ranking integration, tests).
+**Why medium.** This pairs naturally with #57 — together they answer the article's "without scoring, everything competes equally" critique. Defer behind the 1.0 punchlist (#47-52) but ahead of the post-1.0 nudge/drift items, since these directly affect retrieval quality measured by the existing benchmarks.
 ---
 ## Low Priority / Defer
@@ -753,4 +938,4 @@ Influence documents:
 ---
-*Last updated: 2026-04-28 - 1.0 punchlist track opened (`docs/1_0_punchlist.md`). High Priority entries #47-52 (must-have for 1.0): token-budget telemetry, hallucination rate, harm benchmark, CLAUDE.md baseline publication, `claude-memory show`, benchmark scoreboard. Medium Priority entries #53-56 (post-1.0): first-week ROI nudge, real-session repeat-correction detection, token-cost growth tracking, drift dashboard. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*
+*Last updated: 2026-04-28 (post-0.10.0 release, post-rebase). 1.0 punchlist restructured around milestone versions per `docs/1_0_punchlist.md`. **0.11.0** = #47/#48/#51/#53 + #49 prototype. **0.12.0** = #49 full + #50/#52. **1.0.0** = #54/#55/#56/#59 (the new API stability audit). #59 added 2026-04-28 as a 1.0 release blocker (originally #57; renumbered after rebase brought in Mercury-article entries #57/#58). #53 (first-week ROI nudge) moved up from post-1.0 to 0.11.0. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*

data/docs/quality_review.md CHANGED Viewed

@@ -9,6 +9,41 @@
 ---
+## Post-0.11 Investigation: Hallucination Rate Metric Calibration (2026-04-30)
+When #48 (hallucination-rate metric) was first run against this project's real DB, it surfaced numbers that *looked* alarming:
+- Quality score: 39/100
+- Bare conclusions: 34 / 59 active facts (57.6%)
+- 7-day rejection rate: 27 of 32 facts (84.4%)
+The first read was that the LLM extractor was producing noise faster than usable knowledge. Per `improvements.md` #60, four causes were proposed; diagnostics ran 2026-04-30:
+| Cause | Verdict | Evidence |
+|---|---|---|
+| Prompt drift in `distill-transcripts.md` | **Confirmed dominant** | 34/35 (97%) bare-conclusion facts pre-date the reason-clause prompt commit `f22d12f` (2026-04-20). Only 1 was created post-commit (and that one is a meta-convention added during this session). |
+| Auto-memory mirror regurgitation | Rejected | 0/35 substring matches in `~/.claude/projects/.../memory/*.md`. Auto-memory mirror only landed in 0.10.0 (2026-04-28), after the bare-fact creation window — temporally impossible to be the source. |
+| `ReferenceMaterialDetector` predicate scope too narrow | Not material | Only 3/35 bare facts are `decision`-predicate; 0 of those match the strong reference-material patterns. Expanding `GUARDED_PREDICATES` would not move the needle on the bare-conclusion count. |
+| Junky corpus / rejection cluster | **Confirmed in single class** | All 27 rejected facts in the 7-day window are `uses_database` (18) or `deployment_platform` (9), all with `session_id=nil` (MCP-originated, almost certainly `/study-repo` runs misattributing external-project tech to this project), all from 2026-04-23 to 04-24. Systemic single-class failure, correctly cleaned up after detection — not ongoing extraction noise. |
+**What this means for #48 as currently shipped:**
+The metric is *technically correct* but *pragmatically misleading*. It bakes historical noise (pre-prompt-commit bare conclusions) into a signal that users will read as "ongoing extraction quality." A 57.6% bare-conclusion rate looks like the LLM is broken; in reality the live extraction rate (post-2026-04-20) is ~3% (1 bare fact out of ~30+ created since the prompt commit landed).
+The 84% rejection rate has a similar structural issue: it counts cleanup of a bursty `/study-repo` regression against the active-facts denominator, not against the actual extraction quality of the live window.
+**Quick fix shipping now (this session):** restrict `quality_score` and the digest's "Quality" section to facts created within the same 30-day window already used by `token_budget`. Surface a separate "historical" line so users can see both numbers, but the headline is the live one. This makes the metric actionable: a high live bare-conclusion rate = live LLM calibration drift; a high historical rate = legacy data, not a current alarm.
+**Deferred to 0.12 / 1.x:**
+1. The systemic `/study-repo` misattribution failure mode (cause 4) deserves its own guard. External-project READMEs being studied should land in `reference` predicates, not as `uses_database`/`deployment_platform`. Track this as a follow-up entry.
+2. A backfill/cleanup pass on the 34 historical bare-conclusion facts: either retroactive rejection, or a one-shot reclassification that moves them to a `legacy_observation` predicate that the prompt's reason-clause requirement doesn't apply to.
+3. The metric's calibration assumes "bare conclusion = bad", but spot-checking shows several flagged facts are perfectly informative ("MCP tools return dual content + structuredContent via TextSummary module") — they describe mechanics implicitly. The vocabulary may itself be too strict; revisit during 1.0 soak with real usage data.
+**Process win:** the metric did its job — it surfaced a real signal that would otherwise have stayed invisible, and the investigation distinguished historical noise from live calibration. Without #48 we'd have no way to know.
+---
 ## Executive Summary
 Six days, +2,011 LOC. The headline finding: **the watch-list item from 2026-04-22 (#28 — extract per-endpoint helpers from `Dashboard::API`) was not just deferred, it actively regressed.** `dashboard/api.rb` grew from 627 → 807 LOC (+180, +29%), is now the only file in `lib/` over 750 lines, and gained four new methods all exceeding 15 lines. Method-size pressure increased: the previous worst case (`recall` at 39 lines) is now `timeline` at 52 lines, and the file has 11 methods over 15 lines (vs 11 last review) but with a higher mean.

data/lib/claude_memory/commands/digest_command.rb CHANGED Viewed

@@ -5,9 +5,11 @@ require "optparse"
 module ClaudeMemory
   module Commands
     # Weekly digest — a markdown summary of what memory did over the last N days.
-    # Rolls up moment counts, new knowledge, utilization, conflicts, and user
-    # feedback so users can see the value memory is delivering without
-    # needing to visit the dashboard.
+    # Sections (in order): Activity, Context cost, Quality, New knowledge,
+    # Utilization, Conflicts, Feedback. The Context cost and Quality
+    # sections (added 0.11.0) read from `Dashboard::Trust#token_budget` and
+    # `#quality_score` so users see the cost/pollution side-by-side with
+    # the value side without needing to visit the dashboard.
     #
     # The data it aggregates all already exists (activity_events, facts,
     # conflicts, moment_feedback); this command only shapes it into a report.
@@ -48,6 +50,10 @@ module ClaudeMemory
         lines << ""
         lines << activity_section(manager, cutoff)
         lines << ""
+        lines << context_cost_section(manager)
+        lines << ""
+        lines << quality_section(manager, cutoff)
+        lines << ""
         lines << knowledge_section(manager, cutoff)
         lines << ""
         lines << utilization_section(manager)
@@ -124,6 +130,92 @@ module ClaudeMemory
         "## New knowledge\n\n_Unavailable: #{e.message}_"
       end
+      # The token cost of every SessionStart context injection, measured over
+      # the last 30 days (Trust panel's window — intentionally wider than the
+      # digest's coverage window so percentiles stay statistically meaningful
+      # on quiet weeks). Reports zero state explicitly so users know whether a
+      # missing number means "no injections" vs. "telemetry didn't fire".
+      def context_cost_section(manager)
+        tb = Dashboard::Trust.new(manager).token_budget
+        out = ["## Context cost", ""]
+        if tb[:sample_size].zero?
+          out << "_No context injections in the last #{tb[:window_days]} days._"
+        else
+          out << "**Per-session injected tokens (last #{tb[:window_days]}d, n=#{tb[:sample_size]}):**"
+          out << "- p50: #{tb[:p50]} tokens"
+          out << "- p95: #{tb[:p95]} tokens"
+          out << "- avg: #{tb[:avg]} tokens"
+        end
+        out.join("\n")
+      rescue Sequel::DatabaseError => e
+        "## Context cost\n\n_Unavailable: #{e.message}_"
+      end
+      # Hallucination-rate proxy. Reports two numbers per the
+      # `quality_review.md` 2026-04-30 investigation:
+      #
+      #   - Live (last `window_days`, headline) — actionable signal of
+      #     ongoing extraction quality.
+      #   - Historical (all active facts, supplementary) — visible so
+      #     legacy noise isn't hidden, but the headline is the live one.
+      #
+      # The split exists because the unwindowed metric mixed pre-prompt-
+      # commit bare conclusions with live data; users read the combined
+      # number as "ongoing quality" and that's misleading.
+      def quality_section(manager, cutoff)
+        out = ["## Quality", ""]
+        qs = Dashboard::Trust.new(manager).quality_score
+        if qs[:total_active].zero?
+          if qs[:historical][:total_active].zero?
+            out << "_No active facts to score yet._"
+          else
+            out << "_No facts extracted in the last #{qs[:window_days]} days._"
+            out << "- Historical (all active): score #{qs[:historical][:score]}/100, " \
+              "#{qs[:historical][:total_active]} facts, " \
+              "#{qs[:historical][:bare_conclusion_count]} bare, " \
+              "#{qs[:historical][:suspect_count]} suspect"
+          end
+        else
+          out << "**Live score (last #{qs[:window_days]}d):** #{qs[:score]}/100 _(higher is cleaner)_"
+          out << "- Suspect (reference material): #{qs[:suspect_count]} (#{qs[:suspect_pct]}%)"
+          out << "- Bare conclusions (decision/convention without reason): #{qs[:bare_conclusion_count]} (#{qs[:bare_pct]}%)"
+          if qs[:historical][:total_active] > qs[:total_active]
+            out << ""
+            out << "_Historical (all active): score #{qs[:historical][:score]}/100, " \
+              "#{qs[:historical][:total_active]} facts, " \
+              "#{qs[:historical][:bare_conclusion_count]} bare, " \
+              "#{qs[:historical][:suspect_count]} suspect_"
+          end
+        end
+        rate = rejection_rate_in_window(manager, cutoff)
+        out << ""
+        out << "**Rejection rate (in window):** #{rate[:rejected]} of #{rate[:created]} extracted facts rejected (#{rate[:pct]}%)"
+        out.join("\n")
+      rescue Sequel::DatabaseError => e
+        "## Quality\n\n_Unavailable: #{e.message}_"
+      end
+      # How many facts created in the digest window have since been
+      # rejected? Counts across both stores.
+      def rejection_rate_in_window(manager, cutoff)
+        created = 0
+        rejected = 0
+        %w[project global].each do |scope|
+          store = manager.store_if_exists(scope)
+          next unless store
+          dataset = store.facts.where { created_at >= cutoff }
+          created += dataset.count
+          rejected += dataset.where(status: "rejected").count
+        end
+        pct = created.zero? ? 0.0 : (rejected * 100.0 / created).round(1)
+        {created: created, rejected: rejected, pct: pct}
+      end
       def utilization_section(manager)
         util = Dashboard::Trust.new(manager).utilization
         pct = util[:ratio_pct]

data/lib/claude_memory/commands/hook_command.rb CHANGED Viewed

@@ -19,9 +19,9 @@ module ClaudeMemory
           return Hook::ExitCodes::ERROR
         end
-        unless %w[ingest sweep publish context].include?(subcommand)
+        unless %w[ingest sweep publish context nudge].include?(subcommand)
           stderr.puts "Unknown hook command: #{subcommand}"
-          stderr.puts "Available: ingest, sweep, publish, context"
+          stderr.puts "Available: ingest, sweep, publish, context, nudge"
           return Hook::ExitCodes::ERROR
         end
@@ -63,6 +63,8 @@ module ClaudeMemory
           hook_publish(handler, payload)
         when "context"
           hook_context(payload, opts[:db])
+        when "nudge"
+          hook_nudge(payload, opts[:db])
         end
         store.close
@@ -169,6 +171,28 @@ module ClaudeMemory
         Hook::ExitCodes::SUCCESS
       end
+      def hook_nudge(payload, db_path)
+        # Nudge needs to count past nudge events across both stores,
+        # so prefer the manager-aware path. db_path overrides only
+        # the project store (useful for tests).
+        project_path = payload["project_path"] || payload["cwd"]
+        manager = ClaudeMemory::Store::StoreManager.new(
+          project_db_path: db_path, project_path: project_path
+        )
+        manager.ensure_both!
+        store = manager.project_store || manager.global_store
+        handler = ClaudeMemory::Hook::Handler.new(store, manager: manager)
+        result = handler.nudge(payload)
+        stdout.puts result[:message] if result[:status] == :emitted
+        manager.close
+        Hook::ExitCodes::SUCCESS
+      rescue => e
+        classify_error(e)
+      end
       def hook_context(payload, db_path)
         project_path = payload["project_path"] || payload["cwd"]
         source = payload["source"]
@@ -213,6 +237,7 @@ module ClaudeMemory
         details = {
           source: source,
           context_length: context_text&.length,
+          context_tokens: ClaudeMemory::Core::TokenEstimator.estimate(context_text),
           preview: context_text&.byteslice(0, CONTEXT_PREVIEW_BYTES),
           truncated: context_text ? context_text.bytesize > CONTEXT_PREVIEW_BYTES : false,
           top_fact_ids: injector.emitted_fact_ids.first(10),

data/lib/claude_memory/commands/initializers/hooks_configurator.rb CHANGED Viewed

@@ -19,8 +19,9 @@ module ClaudeMemory
           db_path = ClaudeMemory.project_db_path
           ingest_cmd = "claude-memory hook ingest --db #{db_path}"
           sweep_cmd = "claude-memory hook sweep --db #{db_path}"
+          nudge_cmd = "claude-memory hook nudge --db #{db_path}"
-          hooks_config = build_hooks_config(ingest_cmd, sweep_cmd)
+          hooks_config = build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd)
           existing = load_json_file(settings_path)
           existing["hooks"] ||= {}
@@ -37,8 +38,9 @@ module ClaudeMemory
           db_path = ClaudeMemory.global_db_path
           ingest_cmd = "claude-memory hook ingest --db #{db_path}"
           sweep_cmd = "claude-memory hook sweep --db #{db_path}"
+          nudge_cmd = "claude-memory hook nudge --db #{db_path}"
-          hooks_config = build_hooks_config(ingest_cmd, sweep_cmd)
+          hooks_config = build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd)
           existing = load_json_file(settings_path)
           existing["hooks"] ||= {}
@@ -96,7 +98,7 @@ module ClaudeMemory
         private
-        def build_hooks_config(ingest_cmd, sweep_cmd)
+        def build_hooks_config(ingest_cmd, sweep_cmd, nudge_cmd = "claude-memory hook nudge")
           context_cmd = "claude-memory hook context"
           {
@@ -132,7 +134,8 @@ module ClaudeMemory
                   {"type" => "command", "command" => ingest_cmd, "timeout" => 30,
                    "statusMessage" => "Saving memory..."},
                   {"type" => "command", "command" => sweep_cmd, "timeout" => 30,
-                   "statusMessage" => "Sweeping memory..."}
+                   "statusMessage" => "Sweeping memory..."},
+                  {"type" => "command", "command" => nudge_cmd, "timeout" => 5}
                 ]
               }],
               "TaskCompleted" => [{

data/lib/claude_memory/commands/registry.rb CHANGED Viewed

@@ -42,7 +42,8 @@ module ClaudeMemory
         "reclassify-references" => {class: ReclassifyReferencesCommand, description: "Retag existing convention facts that match reference-material heuristics"},
         "census" => {class: CensusCommand, description: "Aggregate predicate usage across project databases"},
         "dashboard" => {class: DashboardCommand, description: "Open debugging dashboard"},
-        "digest" => {class: DigestCommand, description: "Render a weekly markdown digest of memory activity"}
+        "digest" => {class: DigestCommand, description: "Render a weekly markdown digest of memory activity"},
+        "show" => {class: ShowCommand, description: "Print what memory would inject at the next SessionStart"}
       }.freeze
       # Find a command class by name