claude_memory 0.10.0 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: a299c6ab2aeb95123dcb61f5c87a06b93d15a00a2ed9ff2c8343e7fde6b369cb
4
- data.tar.gz: d09c02a2f5dcd4bd0dfcb625793505bd2218c7df04230411e813a7543e7e7382
3
+ metadata.gz: c2164011e2c50c7fdb0bcad468a25814f372384c3a49fa4c9414313ab3975e00
4
+ data.tar.gz: 3e2843979d9b9e0d4a21bfa3650f6cd6843ce18d2a95af884e303572259bca62
5
5
  SHA512:
6
- metadata.gz: 87fd7dab40cb2e5b190de071f99bcc1394e98e5f426951eedaff09b190fa66591b40f49580bca45f75819170ba939a0d3d9239f4825825b431fd4a83d388bb7d
7
- data.tar.gz: ffb4ab50ba94a8f3c7bfb8129f01ea96fd27b981dd614f0addd7d65a9fc2b4b8562b9d23148bb5ea4ee90b5ae5a9fc183d1c82e68d3b009557967a00b96bfec1
6
+ metadata.gz: 6c074b607c1e4f13743de36bb2074495d0ad24d0c826b62b49e0e827f311e3424bc881f42236db22a92dd2d5281e6ef13ca450966b1d9438ac1b36ceaa3ab2ce
7
+ data.tar.gz: 4e06c8fed9c323974ee4d7e5b41386ee4682ba5ba88b67797ac6864bbdf03663e5b74cb33497a371a8263eff4c24d405708a80e4862c4f578b221286bf40b236
Binary file
@@ -7,7 +7,7 @@
7
7
  "plugins": [
8
8
  {
9
9
  "name": "claude-memory",
10
- "version": "0.10.0",
10
+ "version": "0.11.0",
11
11
  "source": "./",
12
12
  "description": "Long-term memory for Claude Code. Recalls architecture, conventions, and decisions across sessions — so Claude explains your codebase without file traversal, follows your patterns, and never re-asks what it already learned.",
13
13
  "repository": "https://github.com/codenamev/claude_memory"
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-memory",
3
- "version": "0.10.0",
3
+ "version": "0.11.0",
4
4
  "description": "Long-term memory for Claude Code. Recalls architecture, conventions, and decisions across sessions — so Claude explains your codebase without file traversal, follows your patterns, and never re-asks what it already learned.",
5
5
  "author": {
6
6
  "name": "Valentino Stoll",
data/CHANGELOG.md CHANGED
@@ -4,6 +4,50 @@ All notable changes to this project will be documented in this file.
4
4
 
5
5
  ## [Unreleased]
6
6
 
7
+ ## [0.11.0] - 2026-04-30
8
+
9
+ Theme: **Trust & Cost** — five user-visible signals that answer "is memory still worth it?" with numbers a skeptical user can read in <30 seconds.
10
+
11
+ ### Added
12
+
13
+ - **Token budget telemetry** — every successful SessionStart context injection now records an estimated `context_tokens` count on its `activity_events` row. Surfaced three ways:
14
+ - Dashboard Trust panel emits a `token_budget` block with p50/p95/avg/sample_size over the last 30 days, so the JSON dashboard endpoint and any downstream consumer answer "what does memory cost per session?"
15
+ - `claude-memory digest` includes a "Context cost" subsection between activity and new-knowledge so the weekly report shows the price tag next to the value.
16
+ - `claude-memory stats --tokens [--since DAYS]` reports total sessions, p50/p95/avg/min/max, and a histogram across <500 / 500-1k / 1-2k / 2-5k / 5k+ buckets.
17
+ - Pure additive — no schema migration. Historical events written before this release simply contribute zero samples until new injections accumulate.
18
+ - First 0.11.0 milestone item from the 1.0 punchlist (Trust & Cost). Closes the "what % of my SessionStart token budget does memory consume?" gap.
19
+ - **Hallucination rate metric** — the dashboard now quantifies how clean the fact base is, not just how full it is. `Distill::BareConclusionDetector` is the production-side mirror of the SessionStart prompt's reason-clause requirement (decision/convention facts must embed "because…" / "so that…" / "to avoid…"). Surfaced two ways:
20
+ - Dashboard Trust panel emits a `quality_score` block aggregating across project + global active facts: `suspect_count` (predicate=reference, retagged by ReferenceMaterialDetector), `bare_conclusion_count`, percentages, and an overall 0–100 score (higher = cleaner). Returns 100 on empty stores so fresh installs aren't penalized.
21
+ - `claude-memory digest` includes a "Quality" section showing the score breakdown plus the in-window rejection rate ("of facts created in the last 7 days, X% have been rejected since"), so calibration drift is visible.
22
+ - Second 0.11.0 milestone item. Pairs with token-budget telemetry to answer "is memory still worth its cost?" via two skeptic-friendly numbers.
23
+ - **`claude-memory show`** — new CLI command prints what memory would inject at the next SessionStart in plain Markdown. Runs the exact `Hook::ContextInjector` path real sessions use, so output matches what Claude actually receives. Footer reports fact count, ~token estimate, and char count so users see the SessionStart cost at a glance.
24
+ - Default suppresses the raw-transcript "Pending Knowledge Extraction" dump (intended for LLM distillation, not human reading); pass `--pending` to include it.
25
+ - `--source SOURCE` (startup/resume/clear) simulates each fresh-session entrypoint so users can preview which sections would appear.
26
+ - Third 0.11.0 milestone item. Closes the inspectability gap — trust requires being able to see what memory will inject, the same way `cat CLAUDE.md` works.
27
+ - **First-week ROI nudge** — at SessionEnd, memory now prints `memory contributed N facts this session, %used = X` for the first 10 sessions, then quiets. New users get user-visible proof memory is doing work for them without having to know about the dashboard. Once trust is established (or it isn't), the nudge gets out of the way.
28
+ - New `claude-memory hook nudge` subcommand + `Hook::Handler#nudge`. SessionEnd config now wires `[ingest, sweep, nudge]` in order.
29
+ - Silent on `CLAUDE_MEMORY_NO_NUDGE=1` opt-out, missing session_id, n=0 contributions, and after MAX_NUDGES emissions. The empty-session silent path doesn't burn a slot — quiet sessions don't count toward the 10.
30
+ - Activity event `roi_nudge` records `{n, used, pct, prior_count}` per emission so a future migration could change the threshold without re-counting from raw events.
31
+ - Fourth 0.11.0 milestone item. Cold-start trust signal that pairs with #47 (token cost) and #48 (quality) to make the first-week answer to "is this worth it?" visible without effort.
32
+ - **Harm benchmark prototype** — `spec/benchmarks/dataset/harm_scenarios.yml` + `spec/benchmarks/e2e/harm_bench_spec.rb`. Three hand-written cases spanning the riskiest harm classes (stale_tech, mismatched_scope, superseded_undetected). The first ClaudeMemory benchmark that measures whether memory can make Claude *wrong* — every other benchmark only measures whether memory helps.
33
+ - Structure validation (regex compile, fact loadability, harm-class coverage) runs in stub mode as part of `:benchmark` tag.
34
+ - Real-mode runner: `EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/harm_bench_spec.rb` — needs `claude` CLI on PATH, ~$2-8 per run. Reports harm rate; doesn't enforce a threshold yet (that's the 0.12 release gate).
35
+ - 0.11.0 risk-de-risking item. If even one of these three surfaces a harm now, the full 10-15-case benchmark planned for 0.12 will likely reveal a fundamental issue — better to learn that at 0.11 than at 0.12. **Real-mode prototype run on 2026-04-30 reported 0/3 harm** — green light to expand to the full corpus in 0.12.
36
+
37
+ ### Changed
38
+
39
+ - **Hallucination-rate metric calibration** — `Dashboard::Trust#quality_score` now reports a windowed (last 30d) "live" score as the headline plus a "historical" block over all active facts. Production verification on 2026-04-30 (recorded in `docs/quality_review.md`) showed the unwindowed metric was technically correct but pragmatically misleading: 97% of bare-conclusion facts pre-dated the 2026-04-20 reason-clause prompt commit, and the entire 7-day rejection cluster was a single-class systemic failure (a `/study-repo` burst), not ongoing noise. The split makes the metric actionable: live score = ongoing extraction quality, historical = legacy data. The digest's "Quality" section uses the live score as the headline.
40
+
41
+ ### Fixed
42
+
43
+ - Real-eval CLI runner now passes `allowed_tools` through explicitly so the harm benchmark and other real-mode benches can pre-allow MCP memory tools without per-test wiring.
44
+
45
+ ### Upgrade Notes
46
+
47
+ - No schema migration. All new features ship purely additive.
48
+ - Hooks run the installed gem from PATH, not the working tree. After upgrading, `bundle exec rake install` (or `gem install claude_memory`) is required for the new SessionEnd nudge, `claude-memory show` command, `--tokens` stats flag, and `context_tokens` activity-event field to actually fire on real hook events.
49
+ - Existing `quality_score` consumers will see additional fields (`window_days`, `historical`) in the snapshot. The original keys (`score`, `total_active`, `suspect_count`, `bare_conclusion_count`, `suspect_pct`, `bare_pct`) remain at the top level and now reflect the 30-day live window — historical numbers move to the `historical` sub-hash.
50
+
7
51
  ## [0.10.0] - 2026-04-28
8
52
 
9
53
  ### Added
data/CLAUDE.md CHANGED
@@ -163,7 +163,7 @@ New MCP tools `memory.undistilled` and `memory.mark_distilled` support the pipel
163
163
  - Each command is a separate class (HelpCommand, DoctorCommand, etc.)
164
164
  - All commands inherit from BaseCommand
165
165
  - Dependency injection for I/O (stdout, stderr, stdin)
166
- - 32 commands total, each focused on single responsibility
166
+ - 34 commands total, each focused on single responsibility
167
167
 
168
168
  - **`Configuration`**: Centralized ENV access (`configuration.rb`)
169
169
  - Single source of truth for paths and environment variables
@@ -209,6 +209,7 @@ New MCP tools `memory.undistilled` and `memory.mark_distilled` support the pipel
209
209
  - Pluggable distiller design (current: NullDistiller stub)
210
210
  - Extracts entities, facts, scope hints from content
211
211
  - `ReferenceMaterialDetector`: classifies "X is a plugin/library/tool" templates, LOC counts, "by Firstname Lastname" attributions as reference material. Runs in `ManagementHandlers#store_extraction` so mislabeling can't persist
212
+ - `BareConclusionDetector` (0.11.0+): production-side mirror of the SessionStart prompt's reason-clause requirement. Pure function — flags `decision` / `convention` facts whose object lacks a reason-clause signal ("because", "so that", "to avoid", etc.). Powers the `quality_score` metric on the Trust panel and the digest's Quality section.
212
213
  - SessionStart distillation prompt enforces reason clauses ("because…", "so that…") for `decision` and `convention` predicates — bare conclusions are explicitly disallowed
213
214
 
214
215
  - **`Resolve`**: Truth maintenance and conflict resolution (`resolve/`)
@@ -249,7 +250,7 @@ Key tables (defined in `sqlite_store.rb`):
249
250
  - `fact_links`: Supersession and conflict relationships
250
251
  - `conflicts`: Open contradictions
251
252
  - `mcp_tool_calls`: MCP server tool invocation telemetry (schema v13)
252
- - `activity_events`: Hook/recall/context/sweep telemetry (schema v15) — powers the dashboard timeline, moments feed, efficacy reports
253
+ - `activity_events`: Hook/recall/context/sweep/nudge telemetry (schema v15) — powers the dashboard timeline, moments feed, efficacy reports. Event types: `hook_ingest`, `hook_context` (carries `context_tokens` since 0.11.0), `hook_sweep`, `hook_publish`, `recall`, `store_extraction`, `roi_nudge` (since 0.11.0).
253
254
  - `moment_feedback`: Per-moment 👍/👎 verdicts with optional notes (schema v16) — unique on event_id, repeat clicks upsert
254
255
 
255
256
  Facts include:
@@ -331,7 +332,7 @@ Also update `SECTION_MAP` if the predicate should appear in a specific snapshot
331
332
 
332
333
  - `lib/claude_memory.rb`: Main module, requires, database path helpers
333
334
  - `lib/claude_memory/cli.rb`: Thin command router (41 lines)
334
- - `lib/claude_memory/commands/`: Individual command classes (28 commands)
335
+ - `lib/claude_memory/commands/`: Individual command classes (34 commands)
335
336
  - `lib/claude_memory/configuration.rb`: Centralized configuration and ENV access
336
337
  - `lib/claude_memory/domain/`: Domain models (Fact, Entity, Provenance, Conflict)
337
338
  - `lib/claude_memory/core/`: Value objects and null objects
@@ -373,6 +374,13 @@ ClaudeMemory integrates with Claude Code via hooks in `.claude/settings.json`:
373
374
  - Runs time-bounded maintenance on both databases
374
375
  - Cleans up vec0 entries for superseded/expired facts
375
376
 
377
+ - **Nudge hook** (0.11.0+): Triggers on SessionEnd, fires after ingest+sweep
378
+ - Calls `claude-memory hook nudge`
379
+ - For the first 10 sessions only, prints "memory contributed N facts this session, %used = X" to stdout so new users see ROI inline before they discover the dashboard
380
+ - Records `roi_nudge` activity_events; quiets after `MAX_NUDGES` emissions
381
+ - Opt out with `CLAUDE_MEMORY_NO_NUDGE=1` (no event recorded on opt-out)
382
+ - Empty sessions (n=0) silently no-op so quiet sessions don't burn nudge slots
383
+
376
384
  Hook commands read JSON payloads from stdin for robustness. Supports `--async` flag for non-blocking execution.
377
385
 
378
386
  ## Dashboard
data/README.md CHANGED
@@ -140,7 +140,41 @@ File-searchable questions ("what version is this?") and one-shot code generation
140
140
  - **Claude-Powered**: Uses Claude's intelligence to extract facts (no API key needed)
141
141
  - **Token Efficient**: 10x reduction in memory queries with progressive disclosure
142
142
  - **Database Maintenance**: Compact, export, and backup commands
143
- - **Built-in Observability** (0.10.0+): `claude-memory dashboard` opens a local web UI with a moments feed, trust panel, conflicts dedup, knowledge index, 👍/👎 feedback, and a 30-day utilization ratio. See **[Dashboard guide →](docs/dashboard.md)**. `claude-memory digest` writes a weekly markdown report; `claude-memory census` audits the predicate vocabulary across projects.
143
+ - **Built-in Observability** (0.10.0+): `claude-memory dashboard` opens a local web UI with a moments feed, trust panel (token budget, quality score, utilization, feedback), conflicts dedup, knowledge index, and 👍/👎 feedback. See **[Dashboard guide →](docs/dashboard.md)**. `claude-memory digest` writes a weekly markdown report (Activity, Context cost, Quality, New knowledge, Utilization, Conflicts, Feedback); `claude-memory show` prints what would be injected next SessionStart; `claude-memory census` audits the predicate vocabulary across projects.
144
+
145
+ ## What's New in 0.11.0
146
+
147
+ Five user-visible signals so you can answer "is memory still worth it?" with
148
+ numbers, not vibes:
149
+
150
+ - **Token budget telemetry** — every SessionStart context injection now
151
+ records its estimated `context_tokens`. `claude-memory stats --tokens
152
+ [--since DAYS]` reports p50/p95/avg/min/max plus a histogram across
153
+ <500 / 500-1k / 1-2k / 2-5k / 5k+ buckets so you can see the per-session
154
+ cost at a glance. The dashboard's Trust panel and `claude-memory digest`
155
+ surface the same numbers.
156
+ - **Hallucination-rate metric** — the dashboard now scores how *clean* the
157
+ fact base is, not just how full it is. `Distill::BareConclusionDetector`
158
+ flags `decision` / `convention` facts that skipped the reason-clause
159
+ requirement. Trust panel shows `quality_score` (live 30-day window with
160
+ historical baseline beneath). `claude-memory digest` adds a Quality
161
+ section with rejection rate.
162
+ - **`claude-memory show`** — new command prints what memory *would* inject
163
+ at the next SessionStart in plain Markdown. Footer reports fact count,
164
+ ~token estimate, and char count so you see the cost at a glance. Default
165
+ hides the raw-transcript "Pending Knowledge" dump for readability;
166
+ `--pending` opts in. `--source startup|resume|clear` simulates each
167
+ fresh-session entrypoint.
168
+ - **First-week ROI nudge** — at SessionEnd, memory now prints
169
+ `memory contributed N facts this session, %used = X` for the first 10
170
+ sessions, then quiets. Cold-start trust signal — you don't have to know
171
+ about the dashboard. Opt out with `CLAUDE_MEMORY_NO_NUDGE=1`.
172
+ - **Harm benchmark prototype** — first ClaudeMemory benchmark that
173
+ measures whether memory can make Claude *wrong*. Three hand-written
174
+ cases (stale-tech, mismatched-scope, superseded-but-undetected) under
175
+ `spec/benchmarks/e2e/harm_bench_spec.rb`. Real-mode run on the 0.11
176
+ release reported 0/3 harm; the full 10-15-case corpus + release gate
177
+ lands in 0.12.
144
178
 
145
179
  ## What's New in 0.10.0
146
180
 
@@ -1,10 +1,11 @@
1
1
  # 1.0 Punchlist
2
2
 
3
- *Created: 2026-04-28*
3
+ *Created: 2026-04-28. Restructured 2026-04-28 (post-0.10.0 release) around
4
+ milestone versions per the path-to-1.0 plan.*
4
5
 
5
6
  The remaining work for a stable 1.0 release. Distinct from `improvements.md` —
6
7
  that file tracks the long tail of inbound study/idea entries; this file tracks
7
- **what blocks 1.0 confidence**.
8
+ **what blocks 1.0 confidence and which release each item ships in**.
8
9
 
9
10
  Guiding question: *a skeptical Ruby developer should be able to look at one
10
11
  screen and say "yes, this is helping, here's the evidence" without trusting our
@@ -12,15 +13,37 @@ marketing.* Today the dashboard tells that story in pieces but not as a
12
13
  headline. Each item below closes a specific gap that prevents that headline
13
14
  from existing.
14
15
 
16
+ ## What 1.0 commits to
17
+
18
+ Not "feature complete" — semver commitment. Once we ship 1.0:
19
+
20
+ - Public APIs (CLI surface, MCP tool schemas, hook payload shapes) lock to semver
21
+ - Schema migrations stay forward-compatible per the round-trip-spec convention
22
+ - The trust signals we ship have a baseline measurement other releases must beat
23
+
24
+ So 1.0 isn't gated by features. It's gated by **the measurement infrastructure
25
+ being trustworthy enough to defend a 1.0 claim.** That's why this punchlist is
26
+ mostly observability, not capability.
27
+
15
28
  Items are cross-linked to the canonical entry in `improvements.md` where the
16
29
  implementation detail and acceptance criteria live. This file is the
17
30
  prioritization view; that file is the work view.
18
31
 
19
32
  ---
20
33
 
21
- ## Must-have for 1.0
34
+ ## 0.10.x patch as needed (now)
35
+
36
+ Reactive only. Real usage will surface issues; cut a patch when one shows up.
37
+ No proactive minor work here.
38
+
39
+ ---
40
+
41
+ ## 0.11.0 — "Trust & Cost" (~1 week of work)
22
42
 
23
- ### 1. Token budget telemetry *what does memory cost?*
43
+ Theme: *users can see what memory costs and whether it's helping.* Each item
44
+ adds a number a skeptical user can read.
45
+
46
+ ### #1 Token budget telemetry — *what does memory cost?* ✅ landed 2026-04-29
24
47
 
25
48
  **Gap.** `Core::TokenEstimator` exists and is unused outside one helper. We
26
49
  have no idea what % of the SessionStart token budget memory consumes per
@@ -30,13 +53,18 @@ session, how it scales with DB size, or whether it's growing.
30
53
  tokens per session over the last 30 days. Per-session count rides on every
31
54
  `hook_context` activity event so the data is queryable post-hoc.
32
55
 
33
- **Why must-have.** "Costs you tokens forever" is the strongest critique of any
34
- context-injection memory system; if we can't answer it numerically, we can't
35
- defend the trade.
56
+ **Why this release.** Loudest critique of any context-injection memory
57
+ system; if we can't answer it numerically, we can't defend the trade.
58
+
59
+ **Status.** Landed in 4 atomic commits on 2026-04-29 (15cb5f5, 35ae8d2,
60
+ d9601ca, 5bfd7c8). `context_tokens` recorded on every successful
61
+ `hook_context` event, surfaced via `Dashboard::Trust#token_budget`,
62
+ `claude-memory digest` "Context cost" section, and
63
+ `claude-memory stats --tokens [--since DAYS]` with histogram.
36
64
 
37
- → improvements.md entry: *Token Budget Telemetry*
65
+ → improvements.md entry: *#47 Token Budget Telemetry*. Effort: 4-6h.
38
66
 
39
- ### 2. Hallucination rate as a first-class trust metric
67
+ ### #2 Hallucination rate as a first-class trust metric ✅ landed 2026-04-29
40
68
 
41
69
  **Gap.** `ReferenceMaterialDetector` already classifies suspect facts and we
42
70
  know from the #34 audit that ~25% of facts had embedded reasoning (i.e.
@@ -48,48 +76,16 @@ suspect-fact ratio + bare-conclusion ratio over active facts in both stores.
48
76
  Digest includes a 30-day rejection rate ("how much of what we extracted got
49
77
  rejected within a week?") so calibration drift is visible.
50
78
 
51
- **Why must-have.** We can't claim "memory is helping" if we can't show "memory
52
- isn't poisoning the well."
79
+ **Why this release.** Pollution rate matters as much as recall rate. Pairs
80
+ with #1 — together they answer the "is this still worth it?" question.
53
81
 
54
- improvements.md entry: *Hallucination Rate Metric*
82
+ **Status.** Landed in 3 atomic commits on 2026-04-29 (27fa6af, 4d1c5bf,
83
+ 0b72fa4). New `Distill::BareConclusionDetector` + `Dashboard::Trust#quality_score`
84
+ + `claude-memory digest` Quality section with rejection rate.
55
85
 
56
- ### 3. Negative-fact harm benchmark
57
-
58
- **Gap.** Every benchmark we run today measures whether memory **helps**.
59
- Nothing measures whether memory **harms** — i.e. injects a wrong fact and
60
- Claude follows it. Without this, "memory helps" is unfalsifiable.
61
-
62
- **Acceptance.** New `spec/benchmarks/dataset/harm_scenarios.yml` with 10–15
63
- cases where memory holds a stale or wrong fact. Each case scores `harm` if
64
- Claude's response follows the wrong fact, `safe` otherwise. Wired into
65
- `bin/run-evals`. >1% harm rate blocks release.
66
-
67
- **Why must-have.** A retrieval system that occasionally makes Claude *wrong*
68
- is strictly worse than no memory; we need a release gate that proves we're
69
- not in that regime.
70
-
71
- → improvements.md entry: *Negative-Fact Harm Benchmark*
72
-
73
- ### 4. Publish the CLAUDE.md baseline in headline E2E results
74
-
75
- **Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
76
- and supports E2E. The adapter is wired into `comparative_helper.rb` but the
77
- README's headline comparative table doesn't include it. The single most
78
- important question for adoption — *"is this better than a hand-written
79
- CLAUDE.md?"* — is currently unanswered in our published numbers.
80
-
81
- **Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
82
- `spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary
83
- output. README explicitly states the win/loss versus the static baseline.
86
+ improvements.md entry: *#48 Hallucination Rate Metric*. Effort: 1d.
84
87
 
85
- **Why must-have.** Cheapest item on the list adapter already built, just
86
- surface the number. If we can't beat a static CLAUDE.md on developer
87
- scenarios, that's the loudest possible signal that the rest of the system
88
- needs work; if we can, that's the headline 1.0 brag.
89
-
90
- → improvements.md entry: *CLAUDE.md Baseline in Headline Results*
91
-
92
- ### 5. `claude-memory show` — human-readable "what would be injected"
88
+ ### #5 `claude-memory show` human-readable "what would be injected" landed 2026-04-29
93
89
 
94
90
  **Gap.** Inspecting memory state today requires the dashboard or several CLI
95
91
  commands (`recall`, `stats`, `census`). The CLAUDE.md alternative is
@@ -101,64 +97,223 @@ path real sessions use, prints what would be injected next session in plain
101
97
  English (not JSON), sized to fit a terminal, with predicate-grouped sections
102
98
  matching the snapshot format.
103
99
 
104
- **Why must-have.** Trust requires inspectability. A user who can't see what
100
+ **Why this release.** Trust requires inspectability. A user who can't see what
105
101
  memory will inject can't develop confidence in it.
106
102
 
107
- improvements.md entry: *claude-memory show*
103
+ **Status.** Landed 2026-04-29 (commit 2586bb3). New `Commands::ShowCommand`
104
+ runs `Hook::ContextInjector` and prints the would-be-injected Markdown.
105
+ Default suppresses the raw-transcript pending-knowledge dump for
106
+ readability (`--pending` opts in). Footer reports fact count, token
107
+ estimate, char count.
108
+
109
+ → improvements.md entry: *#51 claude-memory show*. Effort: ½d.
110
+
111
+ ### #7 First-week ROI nudge — *moved up from post-1.0* ✅ landed 2026-04-30
112
+
113
+ **Gap.** New users install, run a few sessions, don't know whether memory is
114
+ working. The dashboard exists but they have to know to look.
115
+
116
+ **Acceptance.** SessionEnd hook prints `memory contributed N facts this
117
+ session, %used = X` inline for the first ~10 sessions, then quiets. Opt-out
118
+ via `CLAUDE_MEMORY_NO_NUDGE=1`.
119
+
120
+ **Why this release.** Belongs with the trust theme — it's the user-visible
121
+ proof that memory is doing work for them. Originally listed as post-1.0;
122
+ elevating because cold-start trust deserves to land before 1.0.
123
+
124
+ **Status.** Landed in 2 atomic commits on 2026-04-30 (f450ed9, 3acce93)
125
+ plus production smoke-test against this project's DB (event #229
126
+ recorded with n=11, used=0, pct=0 for a real session_id). New
127
+ `Hook::Handler#nudge` + `claude-memory hook nudge`; SessionEnd config
128
+ appends nudge after ingest+sweep. Silent on opt-out, missing
129
+ session_id, n=0, or first-week-complete (so empty sessions don't burn
130
+ slots).
131
+
132
+ → improvements.md entry: *#53 First-Week ROI Nudge*. Effort: ½d.
133
+
134
+ ### Risk-de-risking — 3-scenario harm prototype ✅ landed 2026-04-30
135
+
136
+ Before 0.12 builds the full 10-15-scenario harm benchmark (see #3), run a
137
+ 3-scenario prototype against the 0.10.0 codebase to confirm whether harm is
138
+ actually low. If the prototype surfaces a >0% harm rate on simple cases, the
139
+ full benchmark in 0.12 will reveal a fundamental issue — better to know at
140
+ 0.11 than discover at 0.12.
141
+
142
+ **Acceptance.** Three hand-written `harm_scenarios.yml` cases (one stale-tech,
143
+ one mismatched-scope, one superseded-but-undetected) run against real Claude
144
+ under `EVAL_MODE=real`. Reports go/no-go on the larger benchmark in 0.12.
145
+
146
+ **Status.** Landed 2026-04-30 (commit 35b368e). Three cases written:
147
+ `harm_stale_tech` (MySQL fact vs SQLite reality), `harm_mismatched_scope`
148
+ (global TS/Tailwind preference applied to a Ruby gem),
149
+ `harm_superseded_undetected` (two contradicting auth_method facts both
150
+ active). Structure validation passes in stub mode. Real-mode is gated
151
+ behind `EVAL_MODE=real` (~$2-8 per run) so the operator decides when to
152
+ spend; this prototype reports harm rate but doesn't enforce a threshold
153
+ yet — that's the 0.12 release-gate work.
154
+
155
+ → improvements.md entry: *#49 Negative-Fact Harm Benchmark* (prototype phase).
156
+ Effort: ½d.
157
+
158
+ **Ship target:** ~2 weeks from 0.10.0 (mid-May 2026 at current velocity).
159
+
160
+ ---
161
+
162
+ ## 0.12.0 — "Release Discipline" (~1 week of work)
108
163
 
109
- ### 6. Release-to-release benchmark scoreboard
164
+ Theme: *we can't ship a regression without noticing.* Internal infrastructure
165
+ that prevents future regressions. Not flashy but the actual prerequisite for
166
+ 1.0's semver commitment.
167
+
168
+ ### #3 Negative-fact harm benchmark (full 10-15 scenarios)
169
+
170
+ **Gap.** Every benchmark today measures whether memory **helps**. Nothing
171
+ measures whether memory **harms** — i.e. injects a wrong fact and Claude
172
+ follows it. Without this, "memory helps" is unfalsifiable.
173
+
174
+ **Acceptance.** `spec/benchmarks/dataset/harm_scenarios.yml` with 10-15 cases
175
+ spanning four harm classes (stale-tech, mismatched-scope, superseded-but-
176
+ undetected, reference-material-as-fact). Each scores `harm` if Claude follows
177
+ the wrong fact, `safe` otherwise. Wired into `bin/run-evals`. **>1% harm
178
+ rate blocks release** (configurable via `HARM_RATE_THRESHOLD`).
179
+
180
+ **Why this release.** A retrieval system that occasionally makes Claude
181
+ *wrong* is strictly worse than no memory; the release gate proves we're not
182
+ in that regime.
183
+
184
+ → improvements.md entry: *#49 Negative-Fact Harm Benchmark* (full corpus).
185
+ Effort: 2d.
186
+
187
+ ### #4 Publish the CLAUDE.md baseline in headline E2E results
188
+
189
+ **Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
190
+ and is wired into `comparative_helper.rb`. The README's headline comparative
191
+ table doesn't include it. The single most important question for adoption —
192
+ *"is this better than a hand-written CLAUDE.md?"* — is unanswered in our
193
+ published numbers.
194
+
195
+ **Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
196
+ `spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary.
197
+ README explicitly states the win/loss versus the static baseline.
198
+
199
+ **Why this release.** Cheapest item on the list — adapter built, just
200
+ surface the number. Pairs with #6 because it materializes once the
201
+ scoreboard infrastructure is there.
202
+
203
+ → improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
204
+ Effort: 30min code + one $2-8 real-mode run.
205
+
206
+ ### #6 Release-to-release benchmark scoreboard
110
207
 
111
208
  **Gap.** Benchmark output is textual today. Nothing diff-able across versions.
112
- Regressions land silently — the only reason we caught the FTS5/RRF
113
- normalization bug was a manual run.
209
+ Regressions land silently — the only reason we caught the BM25 normalization
210
+ bug was a manual run.
114
211
 
115
212
  **Acceptance.** Each `bin/run-evals` run writes
116
- `spec/benchmarks/results/<version>.json`. New `bin/bench-diff` (or rake task)
117
- compares against the last tagged version's JSON and reports deltas. Release
118
- script (`/release` skill) reads it and refuses to ship on regressions over a
119
- configurable threshold.
213
+ `spec/benchmarks/results/<version>.json`. New `bin/bench-diff` compares
214
+ against the last tagged version's JSON and reports deltas. `/release` skill
215
+ reads it and refuses to ship on regressions over threshold.
216
+
217
+ **Why this release.** The semver commitment in 1.0 *requires* this — we
218
+ can't promise non-regression without the infrastructure to detect it.
120
219
 
121
- **Why must-have.** Without longitudinal tracking, every benchmark we run is a
122
- snapshot. 1.0 is the moment we commit to *not regressing* what we ship.
220
+ improvements.md entry: *#52 Benchmark Scoreboard Diff*. Effort: 1d.
123
221
 
124
- improvements.md entry: *Benchmark Scoreboard Diff*
222
+ **Ship target:** ~4 weeks from 0.10.0 (end of May 2026).
125
223
 
126
224
  ---
127
225
 
128
- ## Strong post-1.0
226
+ ## 0.12.x 1.0 — soak period (2-3 weeks)
227
+
228
+ Critical phase. Run 0.12 against real usage. Watch:
229
+
230
+ - **Harm rate stays at 0%** — release gate from #3
231
+ - **Hallucination rate trend** — from #2
232
+ - **Token budget growth** — from #1, #9
233
+ - **Utilization ratio** — across multiple projects
234
+
235
+ If any signal shifts unfavorably during soak, fix in 0.12.x. **Don't ship 1.0
236
+ from a release that hasn't observed itself for ≥2 weeks.**
237
+
238
+ This soak period is also where the relevance ratio metric (#31 from 0.10.0)
239
+ materializes its first real-mode measurement, and where the 0.11 trust
240
+ signals get a chance to be real numbers vs. theory.
129
241
 
130
- These shouldn't block 1.0 but should land in the next release window.
242
+ ---
243
+
244
+ ## 1.0.0 — "Stable Memory"
131
245
 
132
- ### 7. First-week ROI nudge
246
+ Theme: *ready for daily use, ready to recommend.*
133
247
 
134
- SessionEnd hook prints `memory contributed N facts this session, %used = X`
135
- inline for the first ~10 sessions. Closes the cold-start gap where new users
136
- don't see value because they don't think to look.
248
+ ### Post-1.0-punchlist polish (if landed during soak)
137
249
 
138
- improvements.md entry: *First-Week ROI Nudge*
250
+ These were originally post-1.0 in the punchlist; if soak time permits, they
251
+ land in 1.0. Otherwise they ship in 1.1.
139
252
 
140
- ### 8. Real-session repeat-correction detector
253
+ ### #8 Real-session repeat-correction detection
141
254
 
142
- The repeat-correction benchmark (#32) is synthetic; production has no
143
- equivalent signal. Analyze `activity_events` to detect "this fact was injected
144
- last session, the user re-stated it this session" — that's where memory is
145
- silently failing.
255
+ The repeat-correction benchmark (#32 from 0.10.0) is synthetic; production
256
+ has no equivalent signal. Analyze `activity_events` for "this fact was
257
+ injected last session, the user re-stated it this session" — that's where
258
+ memory is silently failing.
146
259
 
147
- → improvements.md entry: *Real-Session Repeat-Correction Detection*
260
+ → improvements.md entry: *#54 Real-Session Repeat-Correction Detection*.
261
+ Effort: 2d.
148
262
 
149
- ### 9. Token-cost growth tracking
263
+ ### #9 Token-cost growth tracking
150
264
 
151
265
  Builds on #1. Weekly digest reports "context cost grew X% over 30d" as an
152
266
  anomaly signal that the DB is bloating or context injection is going wide.
153
267
 
154
- → improvements.md entry: *Token-Cost Growth Tracking*
268
+ → improvements.md entry: *#55 Token-Cost Growth Tracking*. Effort: 3h after
269
+ #1 lands.
155
270
 
156
- ### 10. Drift dashboard
271
+ ### #10 Drift dashboard
157
272
 
158
273
  Snapshot `census` weekly, surface predicate distribution shifts on the
159
274
  dashboard. Answers "is my fact base going off?" without a manual audit.
160
275
 
161
- → improvements.md entry: *Drift Dashboard*
276
+ → improvements.md entry: *#56 Drift Dashboard*. Effort: 1.5d.
277
+
278
+ ### #11 API stability audit (NEW — added 2026-04-28)
279
+
280
+ **Gap.** "1.0 commits to semver" is meaningless without an explicit
281
+ public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0
282
+ (MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints)
283
+ have evolved organically and aren't formally documented as stable vs.
284
+ internal.
285
+
286
+ **Acceptance.**
287
+
288
+ - New `docs/api_stability.md` enumerating:
289
+ - **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
290
+ - **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
291
+ - **Public hook contract**: payload fields, return shapes, exit codes
292
+ - **Public Ruby API**: which classes/modules under `lib/claude_memory/` are external-facing (`Recall`, `Configuration`, `Store::StoreManager`?) vs. internal-only
293
+ - **Schema**: stability of column names, table names, predicate vocabulary
294
+ - A deprecation policy: "we'll mark X deprecated in N.x.0 and remove no earlier than (N+1).0.0"
295
+ - README + CLAUDE.md link to the new doc as the authoritative source
296
+
297
+ **Why this release.** Without this, the 1.0 semver promise is vibes, not a
298
+ contract. Future regressions in non-listed areas can be argued away; future
299
+ regressions in listed areas are bugs. Forces us to be honest about what
300
+ we're committing to.
301
+
302
+ → improvements.md entry: *#59 API Stability Audit* (added 2026-04-28; renumbered
303
+ from #57 after rebase brought in Mercury-article entries #57/#58). Effort:
304
+ 2d including the doc + deprecation-warning instrumentation for any
305
+ soon-to-be-removed surface.
306
+
307
+ ### Release framing
308
+
309
+ README + CHANGELOG framing for 1.0 explicitly states:
310
+
311
+ - "We measured X harm rate, Y utilization, Z hallucination rate across N
312
+ projects over W weeks before tagging this."
313
+ - The public API surface is documented at `docs/api_stability.md`
314
+ - Deprecation policy explicit
315
+
316
+ **Ship target:** 6-8 weeks from 0.10.0 (mid-June 2026 at current velocity).
162
317
 
163
318
  ---
164
319
 
@@ -168,23 +323,49 @@ dashboard. Answers "is my fact base going off?" without a manual audit.
168
323
  drawers cover the primary need.
169
324
  - **#45 Live SSE/WebSocket feed** — polling is adequate; dashboard polish, not
170
325
  a confidence gap.
326
+ - **#23 REST API endpoint** — MCP covers primary use case; defer to 1.x.
327
+ - **#25 HTTP MCP transport** — no startup-latency complaint to motivate it yet.
171
328
 
172
329
  ---
173
330
 
174
- ## Sequencing recommendation
331
+ ## Risk to flag now
332
+
333
+ The biggest hidden risk in this plan is **the harm benchmark (#3) finds
334
+ something.** If 10-15 scenarios with intentionally wrong facts produce >1%
335
+ harm rate, that's a fundamental retrieval-discipline issue that could push
336
+ 1.0 by months. The 3-scenario prototype in 0.11 (above) is specifically
337
+ designed to surface this risk earlier.
338
+
339
+ ---
340
+
341
+ ## Velocity assumptions
342
+
343
+ Based on actual release cadence Mar-Apr 2026:
344
+
345
+ | Pair | Days |
346
+ |---|---|
347
+ | 0.7.0 → 0.7.1 | minor patch, days |
348
+ | 0.7.1 → 0.8.0 | 17 |
349
+ | 0.8.0 → 0.9.0 | 17 |
350
+ | 0.9.0 → 0.9.1 | same day (patch) |
351
+ | 0.9.1 → 0.10.0 | 12 |
175
352
 
176
- Smallest set that materially shifts 1.0 confidence (~2 days):
353
+ Average ~2 weeks per minor with substantial work landing each cycle.
177
354
 
178
- 1. **Token budget telemetry** (#1) closes the loudest critique.
179
- 2. **CLAUDE.md baseline publish** (#4) — adapter already built, one report change.
180
- 3. **Hallucination rate** (#2) reuses ReferenceMaterialDetector.
355
+ | Milestone | Estimated work | Calendar target |
356
+ |---|---|---|
357
+ | 0.10.x patches | reactive | as-needed |
358
+ | 0.11.0 | ~1 week | ~2026-05-12 |
359
+ | 0.12.0 | ~1 week | ~2026-05-26 |
360
+ | Soak | 2-3 weeks | through ~2026-06-16 |
361
+ | 1.0.0 | 1-2 days release prep + #11 | ~2026-06-16 to 2026-06-23 |
181
362
 
182
- Then in roughly priority order: `claude-memory show` (#5), harm benchmark
183
- (#3), scoreboard (#6). Post-1.0 items follow naturally once the must-haves
184
- land.
363
+ These are calendar estimates assuming roughly the same focus level as the
364
+ 0.10.0 cycle. Real cadence will adjust based on what surfaces during soak.
185
365
 
186
366
  ---
187
367
 
188
- *Last updated: 2026-04-28 initial punchlist drawn from session-end critique
189
- of observability/outcome gaps. Each entry will be elaborated with concrete
190
- file:line refs in improvements.md as it's worked.*
368
+ *Last updated: 2026-04-28 (post-0.10.0). Restructured around milestone
369
+ versions per the path-to-1.0 plan. #7 moved up from post-1.0 to 0.11; #11
370
+ API stability audit added as a new 1.0 must-have; 3-scenario harm prototype
371
+ added to 0.11 as risk-de-risking work for the full 0.12 benchmark.*
@@ -593,8 +593,10 @@ Now that you're up and running:
593
593
  | `claude-memory changes` | Recent updates |
594
594
  | `claude-memory conflicts` | Show contradictions |
595
595
  | `claude-memory dashboard` | Open the local web UI (0.10.0+) |
596
- | `claude-memory digest --since 7` | Markdown report of the last 7 days (0.10.0+) |
596
+ | `claude-memory digest --since 7` | Markdown report of the last 7 days (0.10.0+; gains Context cost + Quality sections in 0.11.0) |
597
+ | `claude-memory show [--pending] [--source]` | Print what memory would inject at next SessionStart (0.11.0+) |
597
598
  | `claude-memory stats --stale` | List facts not recalled recently (0.10.0+) |
599
+ | `claude-memory stats --tokens [--since DAYS]` | SessionStart context-token budget histogram (0.11.0+) |
598
600
  | `claude-memory stats --tools` | MCP tool-call telemetry (0.9.0+) |
599
601
  | `claude-memory census` | Privacy-safe predicate audit across projects (0.10.0+) |
600
602
  | `claude-memory dedupe-conflicts --dry-run` | Preview historical conflict-row dedup (0.10.0+) |