RubyGems - claude_memory - Versions diffs - 0.10.0 → 0.11.0 - Mend

claude_memory 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/.claude/memory.sqlite3 +0 -0
data/.claude-plugin/marketplace.json +1 -1
data/.claude-plugin/plugin.json +1 -1
data/CHANGELOG.md +44 -0
data/CLAUDE.md +11 -3
data/README.md +35 -1
data/docs/1_0_punchlist.md +269 -88
data/docs/GETTING_STARTED.md +3 -1
data/docs/architecture.md +3 -3
data/docs/dashboard.md +23 -3
data/docs/improvements.md +190 -5
data/docs/quality_review.md +35 -0
data/lib/claude_memory/commands/digest_command.rb +95 -3
data/lib/claude_memory/commands/hook_command.rb +27 -2
data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
data/lib/claude_memory/commands/registry.rb +2 -1
data/lib/claude_memory/commands/show_command.rb +90 -0
data/lib/claude_memory/commands/stats_command.rb +94 -2
data/lib/claude_memory/dashboard/trust.rb +180 -11
data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
data/lib/claude_memory/hook/handler.rb +142 -1
data/lib/claude_memory/templates/hooks.example.json +5 -0
data/lib/claude_memory/version.rb +1 -1
data/lib/claude_memory.rb +2 -0
metadata +3 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: a299c6ab2aeb95123dcb61f5c87a06b93d15a00a2ed9ff2c8343e7fde6b369cb
-  data.tar.gz: d09c02a2f5dcd4bd0dfcb625793505bd2218c7df04230411e813a7543e7e7382
+  metadata.gz: c2164011e2c50c7fdb0bcad468a25814f372384c3a49fa4c9414313ab3975e00
+  data.tar.gz: 3e2843979d9b9e0d4a21bfa3650f6cd6843ce18d2a95af884e303572259bca62
 SHA512:
-  metadata.gz: 87fd7dab40cb2e5b190de071f99bcc1394e98e5f426951eedaff09b190fa66591b40f49580bca45f75819170ba939a0d3d9239f4825825b431fd4a83d388bb7d
-  data.tar.gz: ffb4ab50ba94a8f3c7bfb8129f01ea96fd27b981dd614f0addd7d65a9fc2b4b8562b9d23148bb5ea4ee90b5ae5a9fc183d1c82e68d3b009557967a00b96bfec1
+  metadata.gz: 6c074b607c1e4f13743de36bb2074495d0ad24d0c826b62b49e0e827f311e3424bc881f42236db22a92dd2d5281e6ef13ca450966b1d9438ac1b36ceaa3ab2ce
+  data.tar.gz: 4e06c8fed9c323974ee4d7e5b41386ee4682ba5ba88b67797ac6864bbdf03663e5b74cb33497a371a8263eff4c24d405708a80e4862c4f578b221286bf40b236

data/.claude/memory.sqlite3 CHANGED Viewed

Binary file

data/.claude-plugin/marketplace.json CHANGED Viewed

@@ -7,7 +7,7 @@
   "plugins": [
     {
       "name": "claude-memory",
-      "version": "0.10.0",
+      "version": "0.11.0",
       "source": "./",
       "description": "Long-term memory for Claude Code. Recalls architecture, conventions, and decisions across sessions — so Claude explains your codebase without file traversal, follows your patterns, and never re-asks what it already learned.",
       "repository": "https://github.com/codenamev/claude_memory"

data/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-memory",
-  "version": "0.10.0",
+  "version": "0.11.0",
   "description": "Long-term memory for Claude Code. Recalls architecture, conventions, and decisions across sessions — so Claude explains your codebase without file traversal, follows your patterns, and never re-asks what it already learned.",
   "author": {
     "name": "Valentino Stoll",

data/CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,50 @@ All notable changes to this project will be documented in this file.
 ## [Unreleased]
+## [0.11.0] - 2026-04-30
+Theme: **Trust & Cost** — five user-visible signals that answer "is memory still worth it?" with numbers a skeptical user can read in <30 seconds.
+### Added
+- **Token budget telemetry** — every successful SessionStart context injection now records an estimated `context_tokens` count on its `activity_events` row. Surfaced three ways:
+  - Dashboard Trust panel emits a `token_budget` block with p50/p95/avg/sample_size over the last 30 days, so the JSON dashboard endpoint and any downstream consumer answer "what does memory cost per session?"
+  - `claude-memory digest` includes a "Context cost" subsection between activity and new-knowledge so the weekly report shows the price tag next to the value.
+  - `claude-memory stats --tokens [--since DAYS]` reports total sessions, p50/p95/avg/min/max, and a histogram across <500 / 500-1k / 1-2k / 2-5k / 5k+ buckets.
+- Pure additive — no schema migration. Historical events written before this release simply contribute zero samples until new injections accumulate.
+- First 0.11.0 milestone item from the 1.0 punchlist (Trust & Cost). Closes the "what % of my SessionStart token budget does memory consume?" gap.
+- **Hallucination rate metric** — the dashboard now quantifies how clean the fact base is, not just how full it is. `Distill::BareConclusionDetector` is the production-side mirror of the SessionStart prompt's reason-clause requirement (decision/convention facts must embed "because…" / "so that…" / "to avoid…"). Surfaced two ways:
+  - Dashboard Trust panel emits a `quality_score` block aggregating across project + global active facts: `suspect_count` (predicate=reference, retagged by ReferenceMaterialDetector), `bare_conclusion_count`, percentages, and an overall 0–100 score (higher = cleaner). Returns 100 on empty stores so fresh installs aren't penalized.
+  - `claude-memory digest` includes a "Quality" section showing the score breakdown plus the in-window rejection rate ("of facts created in the last 7 days, X% have been rejected since"), so calibration drift is visible.
+- Second 0.11.0 milestone item. Pairs with token-budget telemetry to answer "is memory still worth its cost?" via two skeptic-friendly numbers.
+- **`claude-memory show`** — new CLI command prints what memory would inject at the next SessionStart in plain Markdown. Runs the exact `Hook::ContextInjector` path real sessions use, so output matches what Claude actually receives. Footer reports fact count, ~token estimate, and char count so users see the SessionStart cost at a glance.
+  - Default suppresses the raw-transcript "Pending Knowledge Extraction" dump (intended for LLM distillation, not human reading); pass `--pending` to include it.
+  - `--source SOURCE` (startup/resume/clear) simulates each fresh-session entrypoint so users can preview which sections would appear.
+- Third 0.11.0 milestone item. Closes the inspectability gap — trust requires being able to see what memory will inject, the same way `cat CLAUDE.md` works.
+- **First-week ROI nudge** — at SessionEnd, memory now prints `memory contributed N facts this session, %used = X` for the first 10 sessions, then quiets. New users get user-visible proof memory is doing work for them without having to know about the dashboard. Once trust is established (or it isn't), the nudge gets out of the way.
+  - New `claude-memory hook nudge` subcommand + `Hook::Handler#nudge`. SessionEnd config now wires `[ingest, sweep, nudge]` in order.
+  - Silent on `CLAUDE_MEMORY_NO_NUDGE=1` opt-out, missing session_id, n=0 contributions, and after MAX_NUDGES emissions. The empty-session silent path doesn't burn a slot — quiet sessions don't count toward the 10.
+  - Activity event `roi_nudge` records `{n, used, pct, prior_count}` per emission so a future migration could change the threshold without re-counting from raw events.
+- Fourth 0.11.0 milestone item. Cold-start trust signal that pairs with #47 (token cost) and #48 (quality) to make the first-week answer to "is this worth it?" visible without effort.
+- **Harm benchmark prototype** — `spec/benchmarks/dataset/harm_scenarios.yml` + `spec/benchmarks/e2e/harm_bench_spec.rb`. Three hand-written cases spanning the riskiest harm classes (stale_tech, mismatched_scope, superseded_undetected). The first ClaudeMemory benchmark that measures whether memory can make Claude *wrong* — every other benchmark only measures whether memory helps.
+  - Structure validation (regex compile, fact loadability, harm-class coverage) runs in stub mode as part of `:benchmark` tag.
+  - Real-mode runner: `EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/harm_bench_spec.rb` — needs `claude` CLI on PATH, ~$2-8 per run. Reports harm rate; doesn't enforce a threshold yet (that's the 0.12 release gate).
+- 0.11.0 risk-de-risking item. If even one of these three surfaces a harm now, the full 10-15-case benchmark planned for 0.12 will likely reveal a fundamental issue — better to learn that at 0.11 than at 0.12. **Real-mode prototype run on 2026-04-30 reported 0/3 harm** — green light to expand to the full corpus in 0.12.
+### Changed
+- **Hallucination-rate metric calibration** — `Dashboard::Trust#quality_score` now reports a windowed (last 30d) "live" score as the headline plus a "historical" block over all active facts. Production verification on 2026-04-30 (recorded in `docs/quality_review.md`) showed the unwindowed metric was technically correct but pragmatically misleading: 97% of bare-conclusion facts pre-dated the 2026-04-20 reason-clause prompt commit, and the entire 7-day rejection cluster was a single-class systemic failure (a `/study-repo` burst), not ongoing noise. The split makes the metric actionable: live score = ongoing extraction quality, historical = legacy data. The digest's "Quality" section uses the live score as the headline.
+### Fixed
+- Real-eval CLI runner now passes `allowed_tools` through explicitly so the harm benchmark and other real-mode benches can pre-allow MCP memory tools without per-test wiring.
+### Upgrade Notes
+- No schema migration. All new features ship purely additive.
+- Hooks run the installed gem from PATH, not the working tree. After upgrading, `bundle exec rake install` (or `gem install claude_memory`) is required for the new SessionEnd nudge, `claude-memory show` command, `--tokens` stats flag, and `context_tokens` activity-event field to actually fire on real hook events.
+- Existing `quality_score` consumers will see additional fields (`window_days`, `historical`) in the snapshot. The original keys (`score`, `total_active`, `suspect_count`, `bare_conclusion_count`, `suspect_pct`, `bare_pct`) remain at the top level and now reflect the 30-day live window — historical numbers move to the `historical` sub-hash.
 ## [0.10.0] - 2026-04-28
 ### Added

data/CLAUDE.md CHANGED Viewed

@@ -163,7 +163,7 @@ New MCP tools `memory.undistilled` and `memory.mark_distilled` support the pipel
   - Each command is a separate class (HelpCommand, DoctorCommand, etc.)
   - All commands inherit from BaseCommand
   - Dependency injection for I/O (stdout, stderr, stdin)
-  - 32 commands total, each focused on single responsibility
+  - 34 commands total, each focused on single responsibility
 - **`Configuration`**: Centralized ENV access (`configuration.rb`)
   - Single source of truth for paths and environment variables
@@ -209,6 +209,7 @@ New MCP tools `memory.undistilled` and `memory.mark_distilled` support the pipel
   - Pluggable distiller design (current: NullDistiller stub)
   - Extracts entities, facts, scope hints from content
   - `ReferenceMaterialDetector`: classifies "X is a plugin/library/tool" templates, LOC counts, "by Firstname Lastname" attributions as reference material. Runs in `ManagementHandlers#store_extraction` so mislabeling can't persist
+  - `BareConclusionDetector` (0.11.0+): production-side mirror of the SessionStart prompt's reason-clause requirement. Pure function — flags `decision` / `convention` facts whose object lacks a reason-clause signal ("because", "so that", "to avoid", etc.). Powers the `quality_score` metric on the Trust panel and the digest's Quality section.
   - SessionStart distillation prompt enforces reason clauses ("because…", "so that…") for `decision` and `convention` predicates — bare conclusions are explicitly disallowed
 - **`Resolve`**: Truth maintenance and conflict resolution (`resolve/`)
@@ -249,7 +250,7 @@ Key tables (defined in `sqlite_store.rb`):
 - `fact_links`: Supersession and conflict relationships
 - `conflicts`: Open contradictions
 - `mcp_tool_calls`: MCP server tool invocation telemetry (schema v13)
-- `activity_events`: Hook/recall/context/sweep telemetry (schema v15) — powers the dashboard timeline, moments feed, efficacy reports
+- `activity_events`: Hook/recall/context/sweep/nudge telemetry (schema v15) — powers the dashboard timeline, moments feed, efficacy reports. Event types: `hook_ingest`, `hook_context` (carries `context_tokens` since 0.11.0), `hook_sweep`, `hook_publish`, `recall`, `store_extraction`, `roi_nudge` (since 0.11.0).
 - `moment_feedback`: Per-moment 👍/👎 verdicts with optional notes (schema v16) — unique on event_id, repeat clicks upsert
 Facts include:
@@ -331,7 +332,7 @@ Also update `SECTION_MAP` if the predicate should appear in a specific snapshot
 - `lib/claude_memory.rb`: Main module, requires, database path helpers
 - `lib/claude_memory/cli.rb`: Thin command router (41 lines)
-- `lib/claude_memory/commands/`: Individual command classes (28 commands)
+- `lib/claude_memory/commands/`: Individual command classes (34 commands)
 - `lib/claude_memory/configuration.rb`: Centralized configuration and ENV access
 - `lib/claude_memory/domain/`: Domain models (Fact, Entity, Provenance, Conflict)
 - `lib/claude_memory/core/`: Value objects and null objects
@@ -373,6 +374,13 @@ ClaudeMemory integrates with Claude Code via hooks in `.claude/settings.json`:
   - Runs time-bounded maintenance on both databases
   - Cleans up vec0 entries for superseded/expired facts
+- **Nudge hook** (0.11.0+): Triggers on SessionEnd, fires after ingest+sweep
+  - Calls `claude-memory hook nudge`
+  - For the first 10 sessions only, prints "memory contributed N facts this session, %used = X" to stdout so new users see ROI inline before they discover the dashboard
+  - Records `roi_nudge` activity_events; quiets after `MAX_NUDGES` emissions
+  - Opt out with `CLAUDE_MEMORY_NO_NUDGE=1` (no event recorded on opt-out)
+  - Empty sessions (n=0) silently no-op so quiet sessions don't burn nudge slots
 Hook commands read JSON payloads from stdin for robustness. Supports `--async` flag for non-blocking execution.
 ## Dashboard

data/README.md CHANGED Viewed

@@ -140,7 +140,41 @@ File-searchable questions ("what version is this?") and one-shot code generation
 - **Claude-Powered**: Uses Claude's intelligence to extract facts (no API key needed)
 - **Token Efficient**: 10x reduction in memory queries with progressive disclosure
 - **Database Maintenance**: Compact, export, and backup commands
-- **Built-in Observability** (0.10.0+): `claude-memory dashboard` opens a local web UI with a moments feed, trust panel, conflicts dedup, knowledge index, 👍/👎 feedback, and a 30-day utilization ratio. See **[Dashboard guide →](docs/dashboard.md)**. `claude-memory digest` writes a weekly markdown report; `claude-memory census` audits the predicate vocabulary across projects.
+- **Built-in Observability** (0.10.0+): `claude-memory dashboard` opens a local web UI with a moments feed, trust panel (token budget, quality score, utilization, feedback), conflicts dedup, knowledge index, and 👍/👎 feedback. See **[Dashboard guide →](docs/dashboard.md)**. `claude-memory digest` writes a weekly markdown report (Activity, Context cost, Quality, New knowledge, Utilization, Conflicts, Feedback); `claude-memory show` prints what would be injected next SessionStart; `claude-memory census` audits the predicate vocabulary across projects.
+## What's New in 0.11.0
+Five user-visible signals so you can answer "is memory still worth it?" with
+numbers, not vibes:
+- **Token budget telemetry** — every SessionStart context injection now
+  records its estimated `context_tokens`. `claude-memory stats --tokens
+  [--since DAYS]` reports p50/p95/avg/min/max plus a histogram across
+  <500 / 500-1k / 1-2k / 2-5k / 5k+ buckets so you can see the per-session
+  cost at a glance. The dashboard's Trust panel and `claude-memory digest`
+  surface the same numbers.
+- **Hallucination-rate metric** — the dashboard now scores how *clean* the
+  fact base is, not just how full it is. `Distill::BareConclusionDetector`
+  flags `decision` / `convention` facts that skipped the reason-clause
+  requirement. Trust panel shows `quality_score` (live 30-day window with
+  historical baseline beneath). `claude-memory digest` adds a Quality
+  section with rejection rate.
+- **`claude-memory show`** — new command prints what memory *would* inject
+  at the next SessionStart in plain Markdown. Footer reports fact count,
+  ~token estimate, and char count so you see the cost at a glance. Default
+  hides the raw-transcript "Pending Knowledge" dump for readability;
+  `--pending` opts in. `--source startup|resume|clear` simulates each
+  fresh-session entrypoint.
+- **First-week ROI nudge** — at SessionEnd, memory now prints
+  `memory contributed N facts this session, %used = X` for the first 10
+  sessions, then quiets. Cold-start trust signal — you don't have to know
+  about the dashboard. Opt out with `CLAUDE_MEMORY_NO_NUDGE=1`.
+- **Harm benchmark prototype** — first ClaudeMemory benchmark that
+  measures whether memory can make Claude *wrong*. Three hand-written
+  cases (stale-tech, mismatched-scope, superseded-but-undetected) under
+  `spec/benchmarks/e2e/harm_bench_spec.rb`. Real-mode run on the 0.11
+  release reported 0/3 harm; the full 10-15-case corpus + release gate
+  lands in 0.12.
 ## What's New in 0.10.0

data/docs/1_0_punchlist.md CHANGED Viewed

@@ -1,10 +1,11 @@
 # 1.0 Punchlist
-*Created: 2026-04-28*
+*Created: 2026-04-28. Restructured 2026-04-28 (post-0.10.0 release) around
+milestone versions per the path-to-1.0 plan.*
 The remaining work for a stable 1.0 release. Distinct from `improvements.md` —
 that file tracks the long tail of inbound study/idea entries; this file tracks
-**what blocks 1.0 confidence**.
+**what blocks 1.0 confidence and which release each item ships in**.
 Guiding question: *a skeptical Ruby developer should be able to look at one
 screen and say "yes, this is helping, here's the evidence" without trusting our
@@ -12,15 +13,37 @@ marketing.* Today the dashboard tells that story in pieces but not as a
 headline. Each item below closes a specific gap that prevents that headline
 from existing.
+## What 1.0 commits to
+Not "feature complete" — semver commitment. Once we ship 1.0:
+- Public APIs (CLI surface, MCP tool schemas, hook payload shapes) lock to semver
+- Schema migrations stay forward-compatible per the round-trip-spec convention
+- The trust signals we ship have a baseline measurement other releases must beat
+So 1.0 isn't gated by features. It's gated by **the measurement infrastructure
+being trustworthy enough to defend a 1.0 claim.** That's why this punchlist is
+mostly observability, not capability.
 Items are cross-linked to the canonical entry in `improvements.md` where the
 implementation detail and acceptance criteria live. This file is the
 prioritization view; that file is the work view.
 ---
-## Must-have for 1.0
+## 0.10.x — patch as needed (now)
+Reactive only. Real usage will surface issues; cut a patch when one shows up.
+No proactive minor work here.
+---
+## 0.11.0 — "Trust & Cost" (~1 week of work)
-### 1. Token budget telemetry — *what does memory cost?*
+Theme: *users can see what memory costs and whether it's helping.* Each item
+adds a number a skeptical user can read.
+### #1 Token budget telemetry — *what does memory cost?* ✅ landed 2026-04-29
 **Gap.** `Core::TokenEstimator` exists and is unused outside one helper. We
 have no idea what % of the SessionStart token budget memory consumes per
@@ -30,13 +53,18 @@ session, how it scales with DB size, or whether it's growing.
 tokens per session over the last 30 days. Per-session count rides on every
 `hook_context` activity event so the data is queryable post-hoc.
-**Why must-have.** "Costs you tokens forever" is the strongest critique of any
-context-injection memory system; if we can't answer it numerically, we can't
-defend the trade.
+**Why this release.** Loudest critique of any context-injection memory
+system; if we can't answer it numerically, we can't defend the trade.
+**Status.** Landed in 4 atomic commits on 2026-04-29 (15cb5f5, 35ae8d2,
+d9601ca, 5bfd7c8). `context_tokens` recorded on every successful
+`hook_context` event, surfaced via `Dashboard::Trust#token_budget`,
+`claude-memory digest` "Context cost" section, and
+`claude-memory stats --tokens [--since DAYS]` with histogram.
-→ improvements.md entry: *Token Budget Telemetry*
+→ improvements.md entry: *#47 Token Budget Telemetry*. Effort: 4-6h.
-### 2. Hallucination rate as a first-class trust metric
+### #2 Hallucination rate as a first-class trust metric ✅ landed 2026-04-29
 **Gap.** `ReferenceMaterialDetector` already classifies suspect facts and we
 know from the #34 audit that ~25% of facts had embedded reasoning (i.e.
@@ -48,48 +76,16 @@ suspect-fact ratio + bare-conclusion ratio over active facts in both stores.
 Digest includes a 30-day rejection rate ("how much of what we extracted got
 rejected within a week?") so calibration drift is visible.
-**Why must-have.** We can't claim "memory is helping" if we can't show "memory
-isn't poisoning the well."
+**Why this release.** Pollution rate matters as much as recall rate. Pairs
+with #1 — together they answer the "is this still worth it?" question.
-→ improvements.md entry: *Hallucination Rate Metric*
+**Status.** Landed in 3 atomic commits on 2026-04-29 (27fa6af, 4d1c5bf,
+0b72fa4). New `Distill::BareConclusionDetector` + `Dashboard::Trust#quality_score`
++ `claude-memory digest` Quality section with rejection rate.
-### 3. Negative-fact harm benchmark
-**Gap.** Every benchmark we run today measures whether memory **helps**.
-Nothing measures whether memory **harms** — i.e. injects a wrong fact and
-Claude follows it. Without this, "memory helps" is unfalsifiable.
-**Acceptance.** New `spec/benchmarks/dataset/harm_scenarios.yml` with 10–15
-cases where memory holds a stale or wrong fact. Each case scores `harm` if
-Claude's response follows the wrong fact, `safe` otherwise. Wired into
-`bin/run-evals`. >1% harm rate blocks release.
-**Why must-have.** A retrieval system that occasionally makes Claude *wrong*
-is strictly worse than no memory; we need a release gate that proves we're
-not in that regime.
-→ improvements.md entry: *Negative-Fact Harm Benchmark*
-### 4. Publish the CLAUDE.md baseline in headline E2E results
-**Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
-and supports E2E. The adapter is wired into `comparative_helper.rb` but the
-README's headline comparative table doesn't include it. The single most
-important question for adoption — *"is this better than a hand-written
-CLAUDE.md?"* — is currently unanswered in our published numbers.
-**Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
-`spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary
-output. README explicitly states the win/loss versus the static baseline.
+→ improvements.md entry: *#48 Hallucination Rate Metric*. Effort: 1d.
-**Why must-have.** Cheapest item on the list — adapter already built, just
-surface the number. If we can't beat a static CLAUDE.md on developer
-scenarios, that's the loudest possible signal that the rest of the system
-needs work; if we can, that's the headline 1.0 brag.
-→ improvements.md entry: *CLAUDE.md Baseline in Headline Results*
-### 5. `claude-memory show` — human-readable "what would be injected"
+### #5 `claude-memory show` — human-readable "what would be injected" ✅ landed 2026-04-29
 **Gap.** Inspecting memory state today requires the dashboard or several CLI
 commands (`recall`, `stats`, `census`). The CLAUDE.md alternative is
@@ -101,64 +97,223 @@ path real sessions use, prints what would be injected next session in plain
 English (not JSON), sized to fit a terminal, with predicate-grouped sections
 matching the snapshot format.
-**Why must-have.** Trust requires inspectability. A user who can't see what
+**Why this release.** Trust requires inspectability. A user who can't see what
 memory will inject can't develop confidence in it.
-→ improvements.md entry: *claude-memory show*
+**Status.** Landed 2026-04-29 (commit 2586bb3). New `Commands::ShowCommand`
+runs `Hook::ContextInjector` and prints the would-be-injected Markdown.
+Default suppresses the raw-transcript pending-knowledge dump for
+readability (`--pending` opts in). Footer reports fact count, token
+estimate, char count.
+→ improvements.md entry: *#51 claude-memory show*. Effort: ½d.
+### #7 First-week ROI nudge — *moved up from post-1.0* ✅ landed 2026-04-30
+**Gap.** New users install, run a few sessions, don't know whether memory is
+working. The dashboard exists but they have to know to look.
+**Acceptance.** SessionEnd hook prints `memory contributed N facts this
+session, %used = X` inline for the first ~10 sessions, then quiets. Opt-out
+via `CLAUDE_MEMORY_NO_NUDGE=1`.
+**Why this release.** Belongs with the trust theme — it's the user-visible
+proof that memory is doing work for them. Originally listed as post-1.0;
+elevating because cold-start trust deserves to land before 1.0.
+**Status.** Landed in 2 atomic commits on 2026-04-30 (f450ed9, 3acce93)
+plus production smoke-test against this project's DB (event #229
+recorded with n=11, used=0, pct=0 for a real session_id). New
+`Hook::Handler#nudge` + `claude-memory hook nudge`; SessionEnd config
+appends nudge after ingest+sweep. Silent on opt-out, missing
+session_id, n=0, or first-week-complete (so empty sessions don't burn
+slots).
+→ improvements.md entry: *#53 First-Week ROI Nudge*. Effort: ½d.
+### Risk-de-risking — 3-scenario harm prototype ✅ landed 2026-04-30
+Before 0.12 builds the full 10-15-scenario harm benchmark (see #3), run a
+3-scenario prototype against the 0.10.0 codebase to confirm whether harm is
+actually low. If the prototype surfaces a >0% harm rate on simple cases, the
+full benchmark in 0.12 will reveal a fundamental issue — better to know at
+0.11 than discover at 0.12.
+**Acceptance.** Three hand-written `harm_scenarios.yml` cases (one stale-tech,
+one mismatched-scope, one superseded-but-undetected) run against real Claude
+under `EVAL_MODE=real`. Reports go/no-go on the larger benchmark in 0.12.
+**Status.** Landed 2026-04-30 (commit 35b368e). Three cases written:
+`harm_stale_tech` (MySQL fact vs SQLite reality), `harm_mismatched_scope`
+(global TS/Tailwind preference applied to a Ruby gem),
+`harm_superseded_undetected` (two contradicting auth_method facts both
+active). Structure validation passes in stub mode. Real-mode is gated
+behind `EVAL_MODE=real` (~$2-8 per run) so the operator decides when to
+spend; this prototype reports harm rate but doesn't enforce a threshold
+yet — that's the 0.12 release-gate work.
+→ improvements.md entry: *#49 Negative-Fact Harm Benchmark* (prototype phase).
+Effort: ½d.
+**Ship target:** ~2 weeks from 0.10.0 (mid-May 2026 at current velocity).
+---
+## 0.12.0 — "Release Discipline" (~1 week of work)
-### 6. Release-to-release benchmark scoreboard
+Theme: *we can't ship a regression without noticing.* Internal infrastructure
+that prevents future regressions. Not flashy but the actual prerequisite for
+1.0's semver commitment.
+### #3 Negative-fact harm benchmark (full 10-15 scenarios)
+**Gap.** Every benchmark today measures whether memory **helps**. Nothing
+measures whether memory **harms** — i.e. injects a wrong fact and Claude
+follows it. Without this, "memory helps" is unfalsifiable.
+**Acceptance.** `spec/benchmarks/dataset/harm_scenarios.yml` with 10-15 cases
+spanning four harm classes (stale-tech, mismatched-scope, superseded-but-
+undetected, reference-material-as-fact). Each scores `harm` if Claude follows
+the wrong fact, `safe` otherwise. Wired into `bin/run-evals`. **>1% harm
+rate blocks release** (configurable via `HARM_RATE_THRESHOLD`).
+**Why this release.** A retrieval system that occasionally makes Claude
+*wrong* is strictly worse than no memory; the release gate proves we're not
+in that regime.
+→ improvements.md entry: *#49 Negative-Fact Harm Benchmark* (full corpus).
+Effort: 2d.
+### #4 Publish the CLAUDE.md baseline in headline E2E results
+**Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
+and is wired into `comparative_helper.rb`. The README's headline comparative
+table doesn't include it. The single most important question for adoption —
+*"is this better than a hand-written CLAUDE.md?"* — is unanswered in our
+published numbers.
+**Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
+`spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary.
+README explicitly states the win/loss versus the static baseline.
+**Why this release.** Cheapest item on the list — adapter built, just
+surface the number. Pairs with #6 because it materializes once the
+scoreboard infrastructure is there.
+→ improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
+Effort: 30min code + one $2-8 real-mode run.
+### #6 Release-to-release benchmark scoreboard
 **Gap.** Benchmark output is textual today. Nothing diff-able across versions.
-Regressions land silently — the only reason we caught the FTS5/RRF
-normalization bug was a manual run.
+Regressions land silently — the only reason we caught the BM25 normalization
+bug was a manual run.
 **Acceptance.** Each `bin/run-evals` run writes
-`spec/benchmarks/results/<version>.json`. New `bin/bench-diff` (or rake task)
-compares against the last tagged version's JSON and reports deltas. Release
-script (`/release` skill) reads it and refuses to ship on regressions over a
-configurable threshold.
+`spec/benchmarks/results/<version>.json`. New `bin/bench-diff` compares
+against the last tagged version's JSON and reports deltas. `/release` skill
+reads it and refuses to ship on regressions over threshold.
+**Why this release.** The semver commitment in 1.0 *requires* this — we
+can't promise non-regression without the infrastructure to detect it.
-**Why must-have.** Without longitudinal tracking, every benchmark we run is a
-snapshot. 1.0 is the moment we commit to *not regressing* what we ship.
+→ improvements.md entry: *#52 Benchmark Scoreboard Diff*. Effort: 1d.
-→ improvements.md entry: *Benchmark Scoreboard Diff*
+**Ship target:** ~4 weeks from 0.10.0 (end of May 2026).
 ---
-## Strong post-1.0
+## 0.12.x → 1.0 — soak period (2-3 weeks)
+Critical phase. Run 0.12 against real usage. Watch:
+- **Harm rate stays at 0%** — release gate from #3
+- **Hallucination rate trend** — from #2
+- **Token budget growth** — from #1, #9
+- **Utilization ratio** — across multiple projects
+If any signal shifts unfavorably during soak, fix in 0.12.x. **Don't ship 1.0
+from a release that hasn't observed itself for ≥2 weeks.**
+This soak period is also where the relevance ratio metric (#31 from 0.10.0)
+materializes its first real-mode measurement, and where the 0.11 trust
+signals get a chance to be real numbers vs. theory.
-These shouldn't block 1.0 but should land in the next release window.
+---
+## 1.0.0 — "Stable Memory"
-### 7. First-week ROI nudge
+Theme: *ready for daily use, ready to recommend.*
-SessionEnd hook prints `memory contributed N facts this session, %used = X`
-inline for the first ~10 sessions. Closes the cold-start gap where new users
-don't see value because they don't think to look.
+### Post-1.0-punchlist polish (if landed during soak)
-→ improvements.md entry: *First-Week ROI Nudge*
+These were originally post-1.0 in the punchlist; if soak time permits, they
+land in 1.0. Otherwise they ship in 1.1.
-### 8. Real-session repeat-correction detector
+### #8 Real-session repeat-correction detection
-The repeat-correction benchmark (#32) is synthetic; production has no
-equivalent signal. Analyze `activity_events` to detect "this fact was injected
-last session, the user re-stated it this session" — that's where memory is
-silently failing.
+The repeat-correction benchmark (#32 from 0.10.0) is synthetic; production
+has no equivalent signal. Analyze `activity_events` for "this fact was
+injected last session, the user re-stated it this session" — that's where
+memory is silently failing.
-→ improvements.md entry: *Real-Session Repeat-Correction Detection*
+→ improvements.md entry: *#54 Real-Session Repeat-Correction Detection*.
+Effort: 2d.
-### 9. Token-cost growth tracking
+### #9 Token-cost growth tracking
 Builds on #1. Weekly digest reports "context cost grew X% over 30d" as an
 anomaly signal that the DB is bloating or context injection is going wide.
-→ improvements.md entry: *Token-Cost Growth Tracking*
+→ improvements.md entry: *#55 Token-Cost Growth Tracking*. Effort: 3h after
+#1 lands.
-### 10. Drift dashboard
+### #10 Drift dashboard
 Snapshot `census` weekly, surface predicate distribution shifts on the
 dashboard. Answers "is my fact base going off?" without a manual audit.
-→ improvements.md entry: *Drift Dashboard*
+→ improvements.md entry: *#56 Drift Dashboard*. Effort: 1.5d.
+### #11 API stability audit (NEW — added 2026-04-28)
+**Gap.** "1.0 commits to semver" is meaningless without an explicit
+public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0
+(MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints)
+have evolved organically and aren't formally documented as stable vs.
+internal.
+**Acceptance.**
+- New `docs/api_stability.md` enumerating:
+  - **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
+  - **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
+  - **Public hook contract**: payload fields, return shapes, exit codes
+  - **Public Ruby API**: which classes/modules under `lib/claude_memory/` are external-facing (`Recall`, `Configuration`, `Store::StoreManager`?) vs. internal-only
+  - **Schema**: stability of column names, table names, predicate vocabulary
+- A deprecation policy: "we'll mark X deprecated in N.x.0 and remove no earlier than (N+1).0.0"
+- README + CLAUDE.md link to the new doc as the authoritative source
+**Why this release.** Without this, the 1.0 semver promise is vibes, not a
+contract. Future regressions in non-listed areas can be argued away; future
+regressions in listed areas are bugs. Forces us to be honest about what
+we're committing to.
+→ improvements.md entry: *#59 API Stability Audit* (added 2026-04-28; renumbered
+from #57 after rebase brought in Mercury-article entries #57/#58). Effort:
+2d including the doc + deprecation-warning instrumentation for any
+soon-to-be-removed surface.
+### Release framing
+README + CHANGELOG framing for 1.0 explicitly states:
+- "We measured X harm rate, Y utilization, Z hallucination rate across N
+  projects over W weeks before tagging this."
+- The public API surface is documented at `docs/api_stability.md`
+- Deprecation policy explicit
+**Ship target:** 6-8 weeks from 0.10.0 (mid-June 2026 at current velocity).
 ---
@@ -168,23 +323,49 @@ dashboard. Answers "is my fact base going off?" without a manual audit.
   drawers cover the primary need.
 - **#45 Live SSE/WebSocket feed** — polling is adequate; dashboard polish, not
   a confidence gap.
+- **#23 REST API endpoint** — MCP covers primary use case; defer to 1.x.
+- **#25 HTTP MCP transport** — no startup-latency complaint to motivate it yet.
 ---
-## Sequencing recommendation
+## Risk to flag now
+The biggest hidden risk in this plan is **the harm benchmark (#3) finds
+something.** If 10-15 scenarios with intentionally wrong facts produce >1%
+harm rate, that's a fundamental retrieval-discipline issue that could push
+1.0 by months. The 3-scenario prototype in 0.11 (above) is specifically
+designed to surface this risk earlier.
+---
+## Velocity assumptions
+Based on actual release cadence Mar-Apr 2026:
+| Pair | Days |
+|---|---|
+| 0.7.0 → 0.7.1 | minor patch, days |
+| 0.7.1 → 0.8.0 | 17 |
+| 0.8.0 → 0.9.0 | 17 |
+| 0.9.0 → 0.9.1 | same day (patch) |
+| 0.9.1 → 0.10.0 | 12 |
-Smallest set that materially shifts 1.0 confidence (~2 days):
+Average ~2 weeks per minor with substantial work landing each cycle.
-1. **Token budget telemetry** (#1) — closes the loudest critique.
-2. **CLAUDE.md baseline publish** (#4) — adapter already built, one report change.
-3. **Hallucination rate** (#2) — reuses ReferenceMaterialDetector.
+| Milestone | Estimated work | Calendar target |
+|---|---|---|
+| 0.10.x patches | reactive | as-needed |
+| 0.11.0 | ~1 week | ~2026-05-12 |
+| 0.12.0 | ~1 week | ~2026-05-26 |
+| Soak | 2-3 weeks | through ~2026-06-16 |
+| 1.0.0 | 1-2 days release prep + #11 | ~2026-06-16 to 2026-06-23 |
-Then in roughly priority order: `claude-memory show` (#5), harm benchmark
-(#3), scoreboard (#6). Post-1.0 items follow naturally once the must-haves
-land.
+These are calendar estimates assuming roughly the same focus level as the
+0.10.0 cycle. Real cadence will adjust based on what surfaces during soak.
 ---
-*Last updated: 2026-04-28 — initial punchlist drawn from session-end critique
-of observability/outcome gaps. Each entry will be elaborated with concrete
-file:line refs in improvements.md as it's worked.*
+*Last updated: 2026-04-28 (post-0.10.0). Restructured around milestone
+versions per the path-to-1.0 plan. #7 moved up from post-1.0 to 0.11; #11
+API stability audit added as a new 1.0 must-have; 3-scenario harm prototype
+added to 0.11 as risk-de-risking work for the full 0.12 benchmark.*

data/docs/GETTING_STARTED.md CHANGED Viewed

@@ -593,8 +593,10 @@ Now that you're up and running:
 | `claude-memory changes` | Recent updates |
 | `claude-memory conflicts` | Show contradictions |
 | `claude-memory dashboard` | Open the local web UI (0.10.0+) |
-| `claude-memory digest --since 7` | Markdown report of the last 7 days (0.10.0+) |
+| `claude-memory digest --since 7` | Markdown report of the last 7 days (0.10.0+; gains Context cost + Quality sections in 0.11.0) |
+| `claude-memory show [--pending] [--source]` | Print what memory would inject at next SessionStart (0.11.0+) |
 | `claude-memory stats --stale` | List facts not recalled recently (0.10.0+) |
+| `claude-memory stats --tokens [--since DAYS]` | SessionStart context-token budget histogram (0.11.0+) |
 | `claude-memory stats --tools` | MCP tool-call telemetry (0.9.0+) |
 | `claude-memory census` | Privacy-safe predicate audit across projects (0.10.0+) |
 | `claude-memory dedupe-conflicts --dry-run` | Preview historical conflict-row dedup (0.10.0+) |