claude_memory 0.10.0 → 0.12.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (72) hide show
  1. checksums.yaml +4 -4
  2. data/.claude/memory.sqlite3 +0 -0
  3. data/.claude/rules/claude_memory.generated.md +42 -64
  4. data/.claude/skills/release/SKILL.md +44 -6
  5. data/.claude/skills/study-repo/SKILL.md +15 -0
  6. data/.claude-plugin/commands/audit-memory.md +68 -0
  7. data/.claude-plugin/marketplace.json +1 -1
  8. data/.claude-plugin/plugin.json +1 -1
  9. data/CHANGELOG.md +70 -0
  10. data/CLAUDE.md +20 -5
  11. data/README.md +64 -2
  12. data/db/migrations/018_add_otel_telemetry.rb +81 -0
  13. data/docs/1_0_punchlist.md +522 -89
  14. data/docs/GETTING_STARTED.md +3 -1
  15. data/docs/api_stability.md +341 -0
  16. data/docs/architecture.md +3 -3
  17. data/docs/audit_runbook.md +209 -0
  18. data/docs/claude_monitoring.md +956 -0
  19. data/docs/dashboard.md +23 -3
  20. data/docs/improvements.md +329 -5
  21. data/docs/influence/ai-memory-systems-2026.md +403 -0
  22. data/docs/memory_audit_2026-05-21.md +303 -0
  23. data/docs/plugin.md +1 -1
  24. data/docs/quality_review.md +35 -0
  25. data/lib/claude_memory/audit/checks.rb +239 -0
  26. data/lib/claude_memory/audit/finding.rb +33 -0
  27. data/lib/claude_memory/audit/runner.rb +73 -0
  28. data/lib/claude_memory/commands/audit_command.rb +117 -0
  29. data/lib/claude_memory/commands/dashboard_command.rb +2 -1
  30. data/lib/claude_memory/commands/digest_command.rb +95 -3
  31. data/lib/claude_memory/commands/hook_command.rb +27 -2
  32. data/lib/claude_memory/commands/import_auto_memory_command.rb +180 -0
  33. data/lib/claude_memory/commands/initializers/hooks_configurator.rb +7 -4
  34. data/lib/claude_memory/commands/otel_command.rb +240 -0
  35. data/lib/claude_memory/commands/registry.rb +5 -1
  36. data/lib/claude_memory/commands/show_command.rb +90 -0
  37. data/lib/claude_memory/commands/stats_command.rb +94 -2
  38. data/lib/claude_memory/configuration.rb +60 -0
  39. data/lib/claude_memory/core/fact_query_builder.rb +1 -0
  40. data/lib/claude_memory/dashboard/api.rb +8 -0
  41. data/lib/claude_memory/dashboard/index.html +140 -1
  42. data/lib/claude_memory/dashboard/prompt_journey.rb +48 -0
  43. data/lib/claude_memory/dashboard/server.rb +86 -0
  44. data/lib/claude_memory/dashboard/telemetry.rb +156 -0
  45. data/lib/claude_memory/dashboard/trust.rb +180 -11
  46. data/lib/claude_memory/deprecations.rb +106 -0
  47. data/lib/claude_memory/distill/bare_conclusion_detector.rb +71 -0
  48. data/lib/claude_memory/distill/reference_material_detector.rb +37 -4
  49. data/lib/claude_memory/hook/auto_memory_mirror.rb +7 -3
  50. data/lib/claude_memory/hook/context_injector.rb +11 -2
  51. data/lib/claude_memory/hook/handler.rb +142 -1
  52. data/lib/claude_memory/mcp/tool_definitions.rb +3 -3
  53. data/lib/claude_memory/otel/attributes.rb +118 -0
  54. data/lib/claude_memory/otel/constants.rb +32 -0
  55. data/lib/claude_memory/otel/ingestor.rb +54 -0
  56. data/lib/claude_memory/otel/otlp_json_envelope.rb +254 -0
  57. data/lib/claude_memory/otel/prompt_scope.rb +108 -0
  58. data/lib/claude_memory/otel/settings_writer.rb +122 -0
  59. data/lib/claude_memory/otel/status.rb +58 -0
  60. data/lib/claude_memory/recall/staleness_annotator.rb +73 -0
  61. data/lib/claude_memory/resolve/predicate_policy.rb +17 -1
  62. data/lib/claude_memory/resolve/resolver.rb +30 -3
  63. data/lib/claude_memory/shortcuts.rb +61 -18
  64. data/lib/claude_memory/store/prompt_journey_query.rb +87 -0
  65. data/lib/claude_memory/store/schema_manager.rb +1 -1
  66. data/lib/claude_memory/store/sqlite_store.rb +136 -0
  67. data/lib/claude_memory/sweep/maintenance.rb +31 -1
  68. data/lib/claude_memory/sweep/sweeper.rb +6 -0
  69. data/lib/claude_memory/templates/hooks.example.json +5 -0
  70. data/lib/claude_memory/version.rb +1 -1
  71. data/lib/claude_memory.rb +20 -0
  72. metadata +28 -1
@@ -1,10 +1,13 @@
1
1
  # 1.0 Punchlist
2
2
 
3
- *Created: 2026-04-28*
3
+ *Created: 2026-04-28. Restructured 2026-04-28 (post-0.10.0 release) around
4
+ milestone versions per the path-to-1.0 plan. Re-oriented 2026-05-27 to
5
+ acknowledge OTel + audit-toolkit landings and re-anchor on the three
6
+ 1.0 pillars.*
4
7
 
5
8
  The remaining work for a stable 1.0 release. Distinct from `improvements.md` —
6
9
  that file tracks the long tail of inbound study/idea entries; this file tracks
7
- **what blocks 1.0 confidence**.
10
+ **what blocks 1.0 confidence and which release each item ships in**.
8
11
 
9
12
  Guiding question: *a skeptical Ruby developer should be able to look at one
10
13
  screen and say "yes, this is helping, here's the evidence" without trusting our
@@ -12,15 +15,58 @@ marketing.* Today the dashboard tells that story in pieces but not as a
12
15
  headline. Each item below closes a specific gap that prevents that headline
13
16
  from existing.
14
17
 
18
+ ## What 1.0 commits to
19
+
20
+ Not "feature complete" — semver commitment. Once we ship 1.0:
21
+
22
+ - Public APIs (CLI surface, MCP tool schemas, hook payload shapes) lock to semver
23
+ - Schema migrations stay forward-compatible per the round-trip-spec convention
24
+ - The trust signals we ship have a baseline measurement other releases must beat
25
+
26
+ So 1.0 isn't gated by features. It's gated by **the measurement infrastructure
27
+ being trustworthy enough to defend a 1.0 claim.** That's why this punchlist is
28
+ mostly observability, not capability.
29
+
30
+ ### The three 1.0 pillars
31
+
32
+ Restated 2026-05-27 to ground prioritization decisions:
33
+
34
+ 1. **Stability** — semver-locked CLI / MCP / hook / Ruby API contracts, schema
35
+ round-trip discipline, deprecation policy. Anchored by `docs/api_stability.md`
36
+ (#11 ✅) and the round-trip-spec convention.
37
+ 2. **Visibility** — a skeptical user can see what memory costs, what memory
38
+ contains, what memory contributed, and what is wrong with it, on one screen,
39
+ in <30s, without trusting our marketing. Anchored by the Trust panel, the
40
+ digest, OTel ingestion, and the new `claude-memory audit` toolkit.
41
+ 3. **Long-horizon quality** — over weeks and months, the repo demonstrably
42
+ improves session quality rather than degrading it. Anchored by the harm
43
+ benchmark (#3, the actual release gate), the CLAUDE.md headline baseline
44
+ (#4), repeat-correction detection (#8), and the drift dashboard (#10).
45
+
46
+ Every 0.12 item maps to one of those pillars; an item that doesn't map is a
47
+ 1.x feature, not a 1.0 gate. The audit toolkit and OTel landed during 0.12
48
+ because they directly serve pillars 1 and 2 — not as scope creep, but as work
49
+ the original punchlist didn't anticipate would be needed.
50
+
15
51
  Items are cross-linked to the canonical entry in `improvements.md` where the
16
52
  implementation detail and acceptance criteria live. This file is the
17
53
  prioritization view; that file is the work view.
18
54
 
19
55
  ---
20
56
 
21
- ## Must-have for 1.0
57
+ ## 0.10.x patch as needed (now)
22
58
 
23
- ### 1. Token budget telemetry *what does memory cost?*
59
+ Reactive only. Real usage will surface issues; cut a patch when one shows up.
60
+ No proactive minor work here.
61
+
62
+ ---
63
+
64
+ ## 0.11.0 — "Trust & Cost" (~1 week of work)
65
+
66
+ Theme: *users can see what memory costs and whether it's helping.* Each item
67
+ adds a number a skeptical user can read.
68
+
69
+ ### #1 Token budget telemetry — *what does memory cost?* ✅ landed 2026-04-29
24
70
 
25
71
  **Gap.** `Core::TokenEstimator` exists and is unused outside one helper. We
26
72
  have no idea what % of the SessionStart token budget memory consumes per
@@ -30,13 +76,18 @@ session, how it scales with DB size, or whether it's growing.
30
76
  tokens per session over the last 30 days. Per-session count rides on every
31
77
  `hook_context` activity event so the data is queryable post-hoc.
32
78
 
33
- **Why must-have.** "Costs you tokens forever" is the strongest critique of any
34
- context-injection memory system; if we can't answer it numerically, we can't
35
- defend the trade.
79
+ **Why this release.** Loudest critique of any context-injection memory
80
+ system; if we can't answer it numerically, we can't defend the trade.
36
81
 
37
- improvements.md entry: *Token Budget Telemetry*
82
+ **Status.** Landed in 4 atomic commits on 2026-04-29 (15cb5f5, 35ae8d2,
83
+ d9601ca, 5bfd7c8). `context_tokens` recorded on every successful
84
+ `hook_context` event, surfaced via `Dashboard::Trust#token_budget`,
85
+ `claude-memory digest` "Context cost" section, and
86
+ `claude-memory stats --tokens [--since DAYS]` with histogram.
38
87
 
39
- ### 2. Hallucination rate as a first-class trust metric
88
+ improvements.md entry: *#47 Token Budget Telemetry*. Effort: 4-6h.
89
+
90
+ ### #2 Hallucination rate as a first-class trust metric ✅ landed 2026-04-29
40
91
 
41
92
  **Gap.** `ReferenceMaterialDetector` already classifies suspect facts and we
42
93
  know from the #34 audit that ~25% of facts had embedded reasoning (i.e.
@@ -48,48 +99,16 @@ suspect-fact ratio + bare-conclusion ratio over active facts in both stores.
48
99
  Digest includes a 30-day rejection rate ("how much of what we extracted got
49
100
  rejected within a week?") so calibration drift is visible.
50
101
 
51
- **Why must-have.** We can't claim "memory is helping" if we can't show "memory
52
- isn't poisoning the well."
53
-
54
- → improvements.md entry: *Hallucination Rate Metric*
55
-
56
- ### 3. Negative-fact harm benchmark
57
-
58
- **Gap.** Every benchmark we run today measures whether memory **helps**.
59
- Nothing measures whether memory **harms** — i.e. injects a wrong fact and
60
- Claude follows it. Without this, "memory helps" is unfalsifiable.
102
+ **Why this release.** Pollution rate matters as much as recall rate. Pairs
103
+ with #1 — together they answer the "is this still worth it?" question.
61
104
 
62
- **Acceptance.** New `spec/benchmarks/dataset/harm_scenarios.yml` with 10–15
63
- cases where memory holds a stale or wrong fact. Each case scores `harm` if
64
- Claude's response follows the wrong fact, `safe` otherwise. Wired into
65
- `bin/run-evals`. >1% harm rate blocks release.
105
+ **Status.** Landed in 3 atomic commits on 2026-04-29 (27fa6af, 4d1c5bf,
106
+ 0b72fa4). New `Distill::BareConclusionDetector` + `Dashboard::Trust#quality_score`
107
+ + `claude-memory digest` Quality section with rejection rate.
66
108
 
67
- **Why must-have.** A retrieval system that occasionally makes Claude *wrong*
68
- is strictly worse than no memory; we need a release gate that proves we're
69
- not in that regime.
109
+ improvements.md entry: *#48 Hallucination Rate Metric*. Effort: 1d.
70
110
 
71
- improvements.md entry: *Negative-Fact Harm Benchmark*
72
-
73
- ### 4. Publish the CLAUDE.md baseline in headline E2E results
74
-
75
- **Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
76
- and supports E2E. The adapter is wired into `comparative_helper.rb` but the
77
- README's headline comparative table doesn't include it. The single most
78
- important question for adoption — *"is this better than a hand-written
79
- CLAUDE.md?"* — is currently unanswered in our published numbers.
80
-
81
- **Acceptance.** Comparative E2E report includes `CLAUDE.md baseline` row in
82
- `spec/benchmarks/README.md` and in `bin/run-evals --comparative` summary
83
- output. README explicitly states the win/loss versus the static baseline.
84
-
85
- **Why must-have.** Cheapest item on the list — adapter already built, just
86
- surface the number. If we can't beat a static CLAUDE.md on developer
87
- scenarios, that's the loudest possible signal that the rest of the system
88
- needs work; if we can, that's the headline 1.0 brag.
89
-
90
- → improvements.md entry: *CLAUDE.md Baseline in Headline Results*
91
-
92
- ### 5. `claude-memory show` — human-readable "what would be injected"
111
+ ### #5 `claude-memory show` — human-readable "what would be injected" ✅ landed 2026-04-29
93
112
 
94
113
  **Gap.** Inspecting memory state today requires the dashboard or several CLI
95
114
  commands (`recall`, `stats`, `census`). The CLAUDE.md alternative is
@@ -101,64 +120,426 @@ path real sessions use, prints what would be injected next session in plain
101
120
  English (not JSON), sized to fit a terminal, with predicate-grouped sections
102
121
  matching the snapshot format.
103
122
 
104
- **Why must-have.** Trust requires inspectability. A user who can't see what
123
+ **Why this release.** Trust requires inspectability. A user who can't see what
105
124
  memory will inject can't develop confidence in it.
106
125
 
107
- improvements.md entry: *claude-memory show*
126
+ **Status.** Landed 2026-04-29 (commit 2586bb3). New `Commands::ShowCommand`
127
+ runs `Hook::ContextInjector` and prints the would-be-injected Markdown.
128
+ Default suppresses the raw-transcript pending-knowledge dump for
129
+ readability (`--pending` opts in). Footer reports fact count, token
130
+ estimate, char count.
131
+
132
+ → improvements.md entry: *#51 claude-memory show*. Effort: ½d.
133
+
134
+ ### #7 First-week ROI nudge — *moved up from post-1.0* ✅ landed 2026-04-30
135
+
136
+ **Gap.** New users install, run a few sessions, don't know whether memory is
137
+ working. The dashboard exists but they have to know to look.
138
+
139
+ **Acceptance.** SessionEnd hook prints `memory contributed N facts this
140
+ session, %used = X` inline for the first ~10 sessions, then quiets. Opt-out
141
+ via `CLAUDE_MEMORY_NO_NUDGE=1`.
142
+
143
+ **Why this release.** Belongs with the trust theme — it's the user-visible
144
+ proof that memory is doing work for them. Originally listed as post-1.0;
145
+ elevating because cold-start trust deserves to land before 1.0.
146
+
147
+ **Status.** Landed in 2 atomic commits on 2026-04-30 (f450ed9, 3acce93)
148
+ plus production smoke-test against this project's DB (event #229
149
+ recorded with n=11, used=0, pct=0 for a real session_id). New
150
+ `Hook::Handler#nudge` + `claude-memory hook nudge`; SessionEnd config
151
+ appends nudge after ingest+sweep. Silent on opt-out, missing
152
+ session_id, n=0, or first-week-complete (so empty sessions don't burn
153
+ slots).
154
+
155
+ → improvements.md entry: *#53 First-Week ROI Nudge*. Effort: ½d.
156
+
157
+ ### Risk-de-risking — 3-scenario harm prototype ✅ landed 2026-04-30
158
+
159
+ Before 0.12 builds the full 10-15-scenario harm benchmark (see #3), run a
160
+ 3-scenario prototype against the 0.10.0 codebase to confirm whether harm is
161
+ actually low. If the prototype surfaces a >0% harm rate on simple cases, the
162
+ full benchmark in 0.12 will reveal a fundamental issue — better to know at
163
+ 0.11 than discover at 0.12.
164
+
165
+ **Acceptance.** Three hand-written `harm_scenarios.yml` cases (one stale-tech,
166
+ one mismatched-scope, one superseded-but-undetected) run against real Claude
167
+ under `EVAL_MODE=real`. Reports go/no-go on the larger benchmark in 0.12.
108
168
 
109
- ### 6. Release-to-release benchmark scoreboard
169
+ **Status.** Landed 2026-04-30 (commit 35b368e). Three cases written:
170
+ `harm_stale_tech` (MySQL fact vs SQLite reality), `harm_mismatched_scope`
171
+ (global TS/Tailwind preference applied to a Ruby gem),
172
+ `harm_superseded_undetected` (two contradicting auth_method facts both
173
+ active). Structure validation passes in stub mode. Real-mode is gated
174
+ behind `EVAL_MODE=real` (~$2-8 per run) so the operator decides when to
175
+ spend; this prototype reports harm rate but doesn't enforce a threshold
176
+ yet — that's the 0.12 release-gate work.
177
+
178
+ → improvements.md entry: *#49 Negative-Fact Harm Benchmark* (prototype phase).
179
+ Effort: ½d.
180
+
181
+ **Ship target:** ~2 weeks from 0.10.0 (mid-May 2026 at current velocity).
182
+
183
+ ---
184
+
185
+ ## 0.12.0 — "Release Discipline + Observability + Self-Audit" (~4 weeks of work)
186
+
187
+ Theme: *we can't ship a regression without noticing, and we can see what's
188
+ happening inside.* Internal infrastructure that prevents future regressions,
189
+ plus the observability primitives the 1.0 visibility pillar requires, plus
190
+ the self-audit toolkit that catches drift in our own DB.
191
+
192
+ *Restructured 2026-05-01: #11 (API stability audit) promoted from 1.0
193
+ because the scoreboard #6 needs an explicit stable-surface list to gate
194
+ against; new #12 (pre-release hook smoke gate) added to codify the
195
+ verification convention that surfaced during 0.11 work.*
196
+
197
+ *Restructured 2026-05-27: theme widened from "Release Discipline" to
198
+ acknowledge two unplanned but on-mission work tracks that landed during the
199
+ 0.12 window — the OTel observability primitives (~15 commits) and the audit
200
+ toolkit (#13). Both serve 1.0 pillars 1+2 directly and the punchlist now
201
+ reflects that.*
202
+
203
+ ### #3 Negative-fact harm benchmark (full 10-15 scenarios) — **in progress 2026-05-27 (Path B blocker)**
204
+
205
+ **Gap.** Every benchmark today measures whether memory **helps**. Nothing
206
+ measures whether memory **harms** — i.e. injects a wrong fact and Claude
207
+ follows it. Without this, "memory helps" is unfalsifiable. This is the
208
+ single 0.12 item that directly serves pillar 3 (long-horizon quality);
209
+ shipping 0.12 without it would tag a release whose central claim is
210
+ unmeasured.
211
+
212
+ **Acceptance.** `spec/benchmarks/dataset/harm_scenarios.yml` with 10-15 cases
213
+ spanning four harm classes (stale-tech, mismatched-scope, superseded-but-
214
+ undetected, reference-material-as-fact). Each scores `harm` if Claude follows
215
+ the wrong fact, `safe` otherwise. Wired into `bin/run-evals`. **>1% harm
216
+ rate blocks release** (configurable via `HARM_RATE_THRESHOLD`).
217
+
218
+ **Why this release.** A retrieval system that occasionally makes Claude
219
+ *wrong* is strictly worse than no memory; the release gate proves we're not
220
+ in that regime.
221
+
222
+ → improvements.md entry: *#49 Negative-Fact Harm Benchmark* (full corpus).
223
+ Effort: 2d.
224
+
225
+ ### #4 Publish the CLAUDE.md baseline in headline E2E results — **DEFERRED to 0.13 (2026-05-29): harness limitation**
226
+
227
+ **Gap.** `claude_md_adapter` exists in `spec/benchmarks/comparative/adapters/`
228
+ and is wired into `comparative_helper.rb`. The single most important question
229
+ for adoption — *"is this better than a hand-written CLAUDE.md?"* — is
230
+ unanswered in our published numbers.
231
+
232
+ **What happened.** The first real-mode comparative run (2026-05-28) returned
233
+ ClaudeMemory **0/10**, No-memory **0/10**, CLAUDE.md baseline **8/10** — and
234
+ investigation showed this is a *harness artifact, not a verdict*. The CLAUDE.md
235
+ adapter auto-loads every fact into context unconditionally; the ClaudeMemory
236
+ adapter relies on Claude proactively calling `memory.recall` MCP tools, which
237
+ `claude -p` headless mode doesn't do for these prompts (and the SessionStart
238
+ context hook injects only a generic top-5, not the specific fact each
239
+ LongMemEval-style scenario needs). So ClaudeMemory's retrieval path is never
240
+ exercised and it ties no-memory at 0. Publishing 0% vs 80% would actively
241
+ mislead and violate the visibility pillar's honest-numbers standard.
242
+
243
+ **Decision (2026-05-29).** Defer #4 to 0.13. It was never a release blocker
244
+ (the harm gate was, and it's green at 0/13). 0.12 ships without comparative
245
+ numbers; the README + benchmark README document the limitation honestly.
246
+
247
+ **0.13 acceptance.** Fix the harness so it fairly exercises ClaudeMemory's
248
+ retrieval — either (a) force memory-tool use (allowedTools + a recall-
249
+ encouraging system turn), or (b) inject the full fact set via the context
250
+ hook to match CLAUDE.md's "everything in context" model — then re-run and
251
+ publish the real win/loss.
252
+
253
+ → improvements.md entry: *#50 CLAUDE.md Baseline in Headline Results*.
254
+ Effort: harness fix ~1d + one real-mode run.
255
+
256
+ ### #16 Headless retrieval gap — *new observation 2026-05-29, investigate for 0.13*
257
+
258
+ **Observation.** The #4 comparative run surfaced a genuine (separable) product
259
+ concern: in fully headless, non-interactive `claude -p` usage with no
260
+ tool-forcing, Claude does **not** proactively call ClaudeMemory's `memory.recall`
261
+ MCP tools, so memory's contribution rides entirely on what the SessionStart
262
+ context hook injects (a generic top-5 decisions/conventions/architecture). For
263
+ *interactive* sessions — where Claude readily calls MCP tools — this isn't an
264
+ issue, and it's the primary use case. But the gap is real and worth measuring:
265
+ does the context-hook top-5 cover enough, or should headless usage get a richer
266
+ injection (or a recall-on-demand affordance)?
267
+
268
+ **Why not 0.12.** This is investigation, not a known fix, and it's orthogonal
269
+ to the 0.12 visibility/stability theme. Pair it with the #4 harness fix in 0.13
270
+ since both touch the same headless-retrieval seam.
271
+
272
+ → No improvements.md entry yet; originates from the 2026-05-28 comparative run.
273
+
274
+ ### #6 Release-to-release benchmark scoreboard ✅ landed 2026-05-01
110
275
 
111
276
  **Gap.** Benchmark output is textual today. Nothing diff-able across versions.
112
- Regressions land silently — the only reason we caught the FTS5/RRF
113
- normalization bug was a manual run.
277
+ Regressions land silently — the only reason we caught the BM25 normalization
278
+ bug was a manual run.
114
279
 
115
280
  **Acceptance.** Each `bin/run-evals` run writes
116
- `spec/benchmarks/results/<version>.json`. New `bin/bench-diff` (or rake task)
117
- compares against the last tagged version's JSON and reports deltas. Release
118
- script (`/release` skill) reads it and refuses to ship on regressions over a
119
- configurable threshold.
281
+ `spec/benchmarks/results/<version>.json`. New `bin/bench-diff` compares
282
+ against the last tagged version's JSON and reports deltas. `/release` skill
283
+ reads it and refuses to ship on regressions over threshold.
284
+
285
+ **Why this release.** The semver commitment in 1.0 *requires* this — we
286
+ can't promise non-regression without the infrastructure to detect it.
287
+
288
+ **Status.** Landed 2026-05-01. `bin/run-evals` writes
289
+ `spec/benchmarks/results/<version>.json` with diff-friendly pass-rate
290
+ metrics by category and per-scenario. `bin/bench-diff` compares against
291
+ the most recent prior tagged version's scoreboard via `Gem::Version`
292
+ ordering, flags pass-rate drops > threshold (default 5%), supports
293
+ `--threshold` / `--baseline` / `--json` / `--strict`. 11 unit specs
294
+ covering missing-baseline, threshold tuning, deep-nested metric paths,
295
+ JSON output. Wired into `/release` skill as new Phase 1 Step 7 (after
296
+ smoke gate, before lint). First release with the gate is 0.12.0 itself
297
+ — prior versions have no scoreboard, so bench-diff exits 0 with a "no
298
+ baseline" note; from 0.13 onward it actively gates.
299
+
300
+ → improvements.md entry: *#52 Benchmark Scoreboard Diff*. Effort: 1d.
301
+
302
+ ### #11 API stability audit — *promoted from 1.0 (2026-05-01)* ✅ landed 2026-05-01
303
+
304
+ **Gap.** "1.0 commits to semver" is meaningless without an explicit
305
+ public/internal split. Many of the surfaces touched in 0.9.0 / 0.10.0 / 0.11.0
306
+ (MCP tool schemas, hook payload shapes, CLI flags, dashboard endpoints,
307
+ `detail_json` field set) have evolved organically and aren't formally
308
+ documented as stable vs. internal.
309
+
310
+ **Acceptance.**
311
+
312
+ - New `docs/api_stability.md` enumerating:
313
+ - **Public CLI**: every `claude-memory <subcommand>` and its flags, with stability tier
314
+ - **Public MCP tools**: every tool's schema, return shape, and tool-annotation hints
315
+ - **Public hook contract**: payload fields, return shapes, exit codes, `detail_json` field set per event_type
316
+ - **Public Ruby API**: `Recall`, `Configuration`, `Store::StoreManager`, `Domain::*` vs. internal-only
317
+ - **Schema**: stability of column names, table names, predicate vocabulary
318
+ - Deprecation policy paragraph: "we'll mark X deprecated in N.x.0 (with a runtime warning), keep it functional for ≥1 minor cycle, and remove no earlier than (N+1).0.0"
319
+ - `ClaudeMemory::Deprecations.warn(name:, replacement:, removed_in:)` module wired up and used at least once so the mechanism is exercised
320
+ - README + CLAUDE.md link to the new doc as the authoritative source
321
+
322
+ **Why this release.** #6's scoreboard needs to know what surfaces are stable
323
+ to gate against. Without #11, any "regression" finding is arguable. The
324
+ deprecation-warning module is also a prerequisite for any soft-rename work
325
+ during the 0.12 → 1.0 soak.
326
+
327
+ → improvements.md entry: *#59 API Stability Audit*. Effort: 2d.
328
+
329
+ ### #12 Pre-release hook smoke gate — *new this release (2026-05-01)* ✅ landed 2026-05-01
330
+
331
+ **Gap.** During 0.11 work, five commits landed for #47 token-budget telemetry
332
+ with 156 specs green. 24 hours of real SessionStart hook events recorded no
333
+ `context_tokens` field — because the *installed* gem was still 0.9.1 and the
334
+ `.claude/settings.json` hooks invoke the installed binary via PATH, not the
335
+ working tree. The bug wasn't in the code; the bug was in the release process.
336
+
337
+ This trap has been hit twice now (#47 in 0.11; an earlier ActivityLog
338
+ incident on 2026-04-16). It's documented in
339
+ `~/.claude/projects/.../memory/feedback_hooks_run_installed_gem.md` and as
340
+ two project conventions, but documentation hasn't stopped me (Claude) from
341
+ springing the trap again.
342
+
343
+ **Acceptance.**
344
+
345
+ - New `bin/pre-release-smoke` script: `rake install` → trigger each hook
346
+ with a synthetic payload → inspect `activity_events.detail_json` via
347
+ `sqlite3 json_extract` for expected fields per the current version → exit
348
+ non-zero if anything is null.
349
+ - Per-version expectation manifest at `spec/smoke/expected_fields.yml`
350
+ declares `{event_type, fields, since_version}` so new fields just need a
351
+ YAML append; no script changes per release.
352
+ - `/release` skill Phase 1 runs the smoke gate after specs and before lint.
353
+ Failure aborts before `git push`.
354
+ - Test: `spec/smoke/pre_release_smoke_spec.rb` validates the manifest schema
355
+ and that the exit-code logic correctly flips on simulated null fields.
356
+
357
+ **Why this release.** Release Discipline that doesn't catch the trap I've
358
+ already hit twice isn't real discipline. Pairs with #6 — the scoreboard
359
+ catches regressions in measurement; the smoke gate catches the regression
360
+ where the measurement itself doesn't fire.
361
+
362
+ → improvements.md entry: *#63 Pre-Release Hook Smoke Gate*. Effort: ½d.
363
+
364
+ ### #13 Memory health audit toolkit — *unplanned, landed 2026-05-27* ✅
365
+
366
+ **Gap.** Drift inside the project DB — duplicate global conventions,
367
+ single-cardinality multiplicity, contamination-driven rejection churn, bare
368
+ conclusions, shortcut tools leaking the wrong predicate — was diagnosable
369
+ only by hand, project by project. The 2026-05-21 audit surfaced 103 rejected
370
+ single-cardinality facts in this project's own DB, all sourced from example
371
+ text in our own docs being re-ingested. Without a productionized check, this
372
+ class of regression silently erodes the 1.0 visibility claim.
373
+
374
+ **Acceptance.**
375
+
376
+ - `claude-memory audit` CLI with ten contract checks (C001-C010), `--json`
377
+ for CI, `--severity`, `--no-exit`
378
+ - `/audit-memory` slash command for interactive walkthrough
379
+ - `docs/audit_runbook.md` per-check rationale + remediation
380
+ - `ReferenceMaterialDetector` example-quote guard + `Resolver` `:discard`
381
+ path (defense-in-depth at write time)
382
+ - Memory shortcuts (`memory.decisions`/`.conventions`/`.architecture`)
383
+ switched from FTS text search to predicate-based filtering
384
+ - `claude-memory import-auto-memory` retroactively pulls auto-memory entries
385
+ `AutoMemoryMirror` missed (slug bug fixed: `tr("/_", "-")`)
386
+ - Signal-health benchmark spec (`spec/benchmarks/health/database_signal_spec.rb`)
387
+ codifies the cleanup contracts so regressions can be detected in CI
388
+
389
+ **Why this release.** Serves pillars 1 (stability — guards single-cardinality
390
+ predicates from drifting) and 2 (visibility — surfaces drift as a measurable
391
+ signal). The detector + resolver fixes mean the 0.12 → 1.0 soak is more
392
+ likely to surface real signal vs. doc-text contamination noise.
393
+
394
+ → improvements.md entry: not yet promoted; lives in `docs/memory_audit_2026-05-21.md`
395
+ as the originating artifact. Effort: ~2d (across the 2026-05-27 session).
396
+
397
+ ### #14 OpenTelemetry ingestion + Dashboard Telemetry/Prompt Journey — *unplanned, landed 2026-05-21* ✅
398
+
399
+ **Gap.** The visibility pillar promised "you can see what memory costs and
400
+ what it's doing." Token-budget telemetry (#1) covered the cost; the rest —
401
+ per-tool latency, cost-per-hour, the full prompt-to-response journey across
402
+ hooks/MCP/distillation — was invisible without an external tracer. Claude
403
+ Code already exports OTLP if asked; the question was whether ClaudeMemory
404
+ should ingest its own telemetry rather than punting to Datadog/Honeycomb.
405
+
406
+ **Acceptance.**
407
+
408
+ - Schema v18: `otel_metrics`, `otel_events`, `otel_traces` + `prompt_id`
409
+ on `activity_events` for journey correlation
410
+ - `claude-memory otel` CLI manages the env block (`--enable`, `--disable`,
411
+ `--enable-traces`, `--capture-prompts`, `--status`, `--verify`, `--backfill`)
412
+ - Dashboard exposes `/v1/metrics`, `/v1/logs`, `/v1/traces` on
413
+ `127.0.0.1:3377` (OTLP/HTTP/JSON) plus a new "Telemetry" drawer
414
+ - Prompt Journey panel UNIONs `otel_events` with `activity_events` and
415
+ back-tags activity_events with `prompt.id` via `OTel::PromptScope`
416
+ - Sweep retention: 30d metrics, 14d events, 7d traces
417
+ - Privacy posture: opt-in for prompt capture; traces 501-gated until
418
+ explicit `--enable-traces`
419
+
420
+ **Why this release.** Directly serves pillar 2 (visibility) at a depth
421
+ nothing else can — no dashboard polish substitutes for actual per-prompt
422
+ trace data. Loud answer to "what is this thing doing right now?"
423
+
424
+ → improvements.md entry: tracked under the OTel research → study line.
425
+ Effort: ~2.5w (Apr 26 → May 21).
426
+
427
+ ### #15 Staleness guard for single-value facts — *born from the #3 harm run, landed 2026-05-28* ✅
428
+
429
+ **Gap.** The first full-corpus real-mode harm run (#3) surfaced a 15.4%
430
+ harm rate. One was a false positive in the test pattern (fixed in the
431
+ corpus); the other was a **real harm**: Claude emitted `git push heroku
432
+ HEAD:main` from a stale `deployment_platform` fact with no hedge.
433
+ Single-value predicates are exclusive claims Claude follows
434
+ authoritatively — and ClaudeMemory had no defense against a stale one
435
+ when no superseding fact exists (supersession only fires if the
436
+ migration was recorded). This is a direct pillar-3 (long-horizon
437
+ quality) hole: over months, single-value facts go stale and silently
438
+ make Claude wrong.
439
+
440
+ **Acceptance.**
441
+
442
+ - `Recall::StalenessAnnotator` pure function: flags single-value facts
443
+ (uses_database / deployment_platform / auth_method) that are old
444
+ (valid_from/created_at older than threshold) AND not recently
445
+ confirmed (last_recalled_at null/stale)
446
+ - `Hook::ContextInjector` appends a "⚠ stale … verify before relying"
447
+ marker at SessionStart; multi-value predicates never annotated
448
+ - `Configuration#injection_stale_days` (default 180, env override),
449
+ distinct from the 14-day dashboard review window
450
+ - Re-run of #3 (scaffolded + best-of-N) confirms the gate is green
451
+
452
+ **Why this release.** It's the concrete payoff of building the harm
453
+ benchmark before 1.0: the benchmark didn't just report a number, it
454
+ forced a real defensive feature that makes the long-horizon-quality
455
+ claim defensible. Shipping #3 without #15 would have meant tagging a
456
+ release whose own gate said "memory makes Claude wrong 1-in-13 times."
457
+
458
+ **Harness hardening (same investigation).** The first full-corpus run
459
+ also exposed two confounds that made the gate unverifiable: scenarios
460
+ ran in an empty tmpdir (Claude often refused for lack of project
461
+ context, not because it resisted the bad fact) and single-shot scoring
462
+ was noisy (the harmed *set* changed run-to-run). Fixed by (a) shipping a
463
+ `project_files` scaffold per scenario whose current state contradicts
464
+ the wrong memory fact — making each case a real "memory vs reality"
465
+ test — and (b) best-of-N majority scoring (HARM_BENCH_RUNS, default 3).
466
+ Without this, #15's effect couldn't be measured cleanly.
467
+
468
+ → improvements.md entry: not yet promoted; originates from the
469
+ `spec/benchmarks/dataset/harm_scenarios.yml` `harm_stale_deployment_heroku`
470
+ finding. Effort: ~½d (2026-05-28 session).
471
+
472
+ **Ship target:** ready to tag (2026-05-29). #3 harm gate is green at 0/13
473
+ (best-of-3) after #15; #4 deferred to 0.13 (harness limitation, never a
474
+ blocker); everything else in 0.12 has shipped. 0.12 tags now; soak window
475
+ 2-3 weeks before 1.0.
120
476
 
121
- **Why must-have.** Without longitudinal tracking, every benchmark we run is a
122
- snapshot. 1.0 is the moment we commit to *not regressing* what we ship.
477
+ ---
123
478
 
124
- improvements.md entry: *Benchmark Scoreboard Diff*
479
+ ## 0.12.x 1.0 soak period (2-3 weeks)
125
480
 
126
- ---
481
+ Critical phase. Run 0.12 against real usage. Watch:
482
+
483
+ - **Harm rate stays at 0%** — release gate from #3
484
+ - **Hallucination rate trend** — from #2
485
+ - **Token budget growth** — from #1, #9
486
+ - **Utilization ratio** — across multiple projects
487
+
488
+ If any signal shifts unfavorably during soak, fix in 0.12.x. **Don't ship 1.0
489
+ from a release that hasn't observed itself for ≥2 weeks.**
490
+
491
+ This soak period is also where the relevance ratio metric (#31 from 0.10.0)
492
+ materializes its first real-mode measurement, and where the 0.11 trust
493
+ signals get a chance to be real numbers vs. theory.
127
494
 
128
- ## Strong post-1.0
495
+ ---
129
496
 
130
- These shouldn't block 1.0 but should land in the next release window.
497
+ ## 1.0.0 "Stable Memory"
131
498
 
132
- ### 7. First-week ROI nudge
499
+ Theme: *ready for daily use, ready to recommend.*
133
500
 
134
- SessionEnd hook prints `memory contributed N facts this session, %used = X`
135
- inline for the first ~10 sessions. Closes the cold-start gap where new users
136
- don't see value because they don't think to look.
501
+ ### Post-1.0-punchlist polish (if landed during soak)
137
502
 
138
- improvements.md entry: *First-Week ROI Nudge*
503
+ These were originally post-1.0 in the punchlist; if soak time permits, they
504
+ land in 1.0. Otherwise they ship in 1.1.
139
505
 
140
- ### 8. Real-session repeat-correction detector
506
+ ### #8 Real-session repeat-correction detection
141
507
 
142
- The repeat-correction benchmark (#32) is synthetic; production has no
143
- equivalent signal. Analyze `activity_events` to detect "this fact was injected
144
- last session, the user re-stated it this session" — that's where memory is
145
- silently failing.
508
+ The repeat-correction benchmark (#32 from 0.10.0) is synthetic; production
509
+ has no equivalent signal. Analyze `activity_events` for "this fact was
510
+ injected last session, the user re-stated it this session" — that's where
511
+ memory is silently failing.
146
512
 
147
- → improvements.md entry: *Real-Session Repeat-Correction Detection*
513
+ → improvements.md entry: *#54 Real-Session Repeat-Correction Detection*.
514
+ Effort: 2d.
148
515
 
149
- ### 9. Token-cost growth tracking
516
+ ### #9 Token-cost growth tracking
150
517
 
151
518
  Builds on #1. Weekly digest reports "context cost grew X% over 30d" as an
152
519
  anomaly signal that the DB is bloating or context injection is going wide.
153
520
 
154
- → improvements.md entry: *Token-Cost Growth Tracking*
521
+ → improvements.md entry: *#55 Token-Cost Growth Tracking*. Effort: 3h after
522
+ #1 lands.
155
523
 
156
- ### 10. Drift dashboard
524
+ ### #10 Drift dashboard
157
525
 
158
526
  Snapshot `census` weekly, surface predicate distribution shifts on the
159
527
  dashboard. Answers "is my fact base going off?" without a manual audit.
160
528
 
161
- → improvements.md entry: *Drift Dashboard*
529
+ → improvements.md entry: *#56 Drift Dashboard*. Effort: 1.5d.
530
+
531
+ *(#11 API stability audit moved to 0.12 on 2026-05-01 — see above.)*
532
+
533
+ ### Release framing
534
+
535
+ README + CHANGELOG framing for 1.0 explicitly states:
536
+
537
+ - "We measured X harm rate, Y utilization, Z hallucination rate across N
538
+ projects over W weeks before tagging this."
539
+ - The public API surface is documented at `docs/api_stability.md`
540
+ - Deprecation policy explicit
541
+
542
+ **Ship target:** 6-8 weeks from 0.10.0 (mid-June 2026 at current velocity).
162
543
 
163
544
  ---
164
545
 
@@ -168,23 +549,75 @@ dashboard. Answers "is my fact base going off?" without a manual audit.
168
549
  drawers cover the primary need.
169
550
  - **#45 Live SSE/WebSocket feed** — polling is adequate; dashboard polish, not
170
551
  a confidence gap.
552
+ - **#23 REST API endpoint** — MCP covers primary use case; defer to 1.x.
553
+ - **#25 HTTP MCP transport** — no startup-latency complaint to motivate it yet.
554
+
555
+ ---
556
+
557
+ ## Risk to flag now
558
+
559
+ The biggest hidden risk in this plan was **the harm benchmark (#3) finds
560
+ something.** The 3-scenario prototype in 0.11 (above) was specifically
561
+ designed to surface this risk earlier — and **on 2026-04-30 the real-mode
562
+ prototype reported 0/3 harm**, green-lighting the full corpus expansion.
563
+ Risk is materially reduced; the 10-15-case corpus may still surface
564
+ something the 3-case sample missed, but a fundamental retrieval-discipline
565
+ issue is now unlikely.
566
+
567
+ Remaining risk for 0.12: **#11 API stability audit reveals the surface is
568
+ larger or messier than we thought**, pushing the doc work past the 2-day
569
+ estimate. Mitigation: scope `Public Ruby API` aggressively to "internal
570
+ unless proven otherwise" — easier to promote later than demote. *Update
571
+ 2026-05-27: #11 landed on time on 2026-05-01; this risk did not materialize.*
572
+
573
+ Remaining risk for 0.12, take 2 (added 2026-05-27 in light of Path B):
574
+ **the full 13-scenario harm corpus surfaces a >1% harm rate** that the
575
+ 3-scenario prototype masked. Mitigation paths if it happens: classify the
576
+ harming class, ship a guard (the way #13 added `ReferenceMaterialDetector`
577
+ example-quote guard for the contamination class), re-run. Worst case
578
+ extends 0.12 by ~3-5 days; doesn't push 1.0 if the soak window has slack.
171
579
 
172
580
  ---
173
581
 
174
- ## Sequencing recommendation
582
+ ## Velocity assumptions
583
+
584
+ Based on actual release cadence Mar-Apr 2026:
585
+
586
+ | Pair | Days |
587
+ |---|---|
588
+ | 0.7.0 → 0.7.1 | minor patch, days |
589
+ | 0.7.1 → 0.8.0 | 17 |
590
+ | 0.8.0 → 0.9.0 | 17 |
591
+ | 0.9.0 → 0.9.1 | same day (patch) |
592
+ | 0.9.1 → 0.10.0 | 12 |
593
+
594
+ Average ~2 weeks per minor with substantial work landing each cycle.
175
595
 
176
- Smallest set that materially shifts 1.0 confidence (~2 days):
596
+ | Milestone | Estimated work | Calendar target | Status |
597
+ |---|---|---|---|
598
+ | 0.11.0 | ~1 week | ~2026-05-12 | ✅ shipped 2026-04-30 |
599
+ | 0.11.x patches | reactive | as-needed | open |
600
+ | 0.12.0 (originally planned) | ~1.5 weeks | ~2026-06-02 | superseded — actual scope widened (see 2026-05-27 restructure) |
601
+ | 0.12.0 (actual) | ~4 weeks (#6/#11/#12 + OTel + audit toolkit + Path B #3/#4) | tag ~2026-06-03 | 5 of 7 items shipped; #3 + #4 in progress |
602
+ | Soak | 2-3 weeks | through ~2026-06-24 | future |
603
+ | 1.0.0 | 1-2 days release prep | ~2026-06-24 to 2026-07-01 | future |
177
604
 
178
- 1. **Token budget telemetry** (#1) closes the loudest critique.
179
- 2. **CLAUDE.md baseline publish** (#4) adapter already built, one report change.
180
- 3. **Hallucination rate** (#2) reuses ReferenceMaterialDetector.
605
+ *0.12 grew from ~1 week to ~1.5 weeks after 2026-05-01 restructure
606
+ (promoted #11 + added #12), then widened again to ~4 weeks after the
607
+ 2026-05-27 restructure that absorbed the OTel observability work and the
608
+ audit toolkit. 1.0 calendar shifted ~3 weeks later in total but the soak
609
+ window remains 2-3 weeks — the visibility/stability surface 0.12 now ships
610
+ is materially larger than the original "Release Discipline" scope.*
181
611
 
182
- Then in roughly priority order: `claude-memory show` (#5), harm benchmark
183
- (#3), scoreboard (#6). Post-1.0 items follow naturally once the must-haves
184
- land.
612
+ These are calendar estimates assuming roughly the same focus level as the
613
+ 0.10.0 cycle. Real cadence will adjust based on what surfaces during soak.
185
614
 
186
615
  ---
187
616
 
188
- *Last updated: 2026-04-28 initial punchlist drawn from session-end critique
189
- of observability/outcome gaps. Each entry will be elaborated with concrete
190
- file:line refs in improvements.md as it's worked.*
617
+ *Last updated: 2026-05-27 (mid-0.12 cycle). 0.11.0 shipped 2026-04-30 with
618
+ all 5 punchlist items + harm prototype reporting 0/3 harm. 0.12 restructured
619
+ 2026-05-01 (promoted #11, added #12) and again 2026-05-27 (absorbed OTel
620
+ #14 + audit toolkit #13, re-anchored on the three 1.0 pillars, committed
621
+ to Path B finishing #3 + #4 before tag). 0.12 grew ~1.5w → ~4w; 1.0 ship
622
+ target shifted ~3w later in return. Soak window held at 2-3w because the
623
+ visibility surface in 0.12 is materially larger than originally scoped.*