claude_memory 0.9.0 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/memory.sqlite3 +0 -0
- data/.claude/rules/claude_memory.generated.md +63 -1
- data/.claude/skills/dashboard/SKILL.md +42 -0
- data/.claude/skills/release/SKILL.md +168 -0
- data/.claude-plugin/marketplace.json +1 -1
- data/.claude-plugin/plugin.json +1 -1
- data/CHANGELOG.md +92 -0
- data/CLAUDE.md +21 -5
- data/README.md +32 -2
- data/db/migrations/015_add_activity_events.rb +26 -0
- data/db/migrations/016_add_moment_feedback.rb +22 -0
- data/db/migrations/017_add_last_recalled_at.rb +15 -0
- data/docs/1_0_punchlist.md +190 -0
- data/docs/EXAMPLES.md +41 -2
- data/docs/GETTING_STARTED.md +31 -4
- data/docs/architecture.md +22 -7
- data/docs/audit-queries.md +131 -0
- data/docs/dashboard.md +172 -0
- data/docs/improvements.md +465 -9
- data/docs/influence/cq.md +187 -0
- data/docs/plugin.md +13 -6
- data/docs/quality_review.md +489 -172
- data/docs/reflection_memory_as_accumulating_judgment.md +67 -0
- data/lib/claude_memory/activity_log.rb +86 -0
- data/lib/claude_memory/commands/census_command.rb +210 -0
- data/lib/claude_memory/commands/completion_command.rb +3 -0
- data/lib/claude_memory/commands/dashboard_command.rb +54 -0
- data/lib/claude_memory/commands/dedupe_conflicts_command.rb +55 -0
- data/lib/claude_memory/commands/digest_command.rb +181 -0
- data/lib/claude_memory/commands/hook_command.rb +34 -0
- data/lib/claude_memory/commands/reclassify_references_command.rb +56 -0
- data/lib/claude_memory/commands/registry.rb +6 -1
- data/lib/claude_memory/commands/skills/distill-transcripts.md +13 -1
- data/lib/claude_memory/commands/stats_command.rb +38 -1
- data/lib/claude_memory/commands/sweep_command.rb +2 -0
- data/lib/claude_memory/configuration.rb +16 -0
- data/lib/claude_memory/core/relative_time.rb +9 -0
- data/lib/claude_memory/dashboard/api.rb +610 -0
- data/lib/claude_memory/dashboard/conflicts.rb +279 -0
- data/lib/claude_memory/dashboard/efficacy.rb +127 -0
- data/lib/claude_memory/dashboard/fact_presenter.rb +109 -0
- data/lib/claude_memory/dashboard/health.rb +175 -0
- data/lib/claude_memory/dashboard/index.html +2707 -0
- data/lib/claude_memory/dashboard/knowledge.rb +136 -0
- data/lib/claude_memory/dashboard/moments.rb +244 -0
- data/lib/claude_memory/dashboard/reuse.rb +97 -0
- data/lib/claude_memory/dashboard/scoped_fact_resolver.rb +95 -0
- data/lib/claude_memory/dashboard/server.rb +211 -0
- data/lib/claude_memory/dashboard/timeline.rb +68 -0
- data/lib/claude_memory/dashboard/trust.rb +285 -0
- data/lib/claude_memory/distill/reference_material_detector.rb +78 -0
- data/lib/claude_memory/hook/auto_memory_mirror.rb +112 -0
- data/lib/claude_memory/hook/context_injector.rb +97 -3
- data/lib/claude_memory/hook/handler.rb +50 -3
- data/lib/claude_memory/mcp/handlers/management_handlers.rb +8 -0
- data/lib/claude_memory/mcp/query_guide.rb +11 -0
- data/lib/claude_memory/mcp/server.rb +8 -2
- data/lib/claude_memory/mcp/text_summary.rb +29 -0
- data/lib/claude_memory/mcp/tool_definitions.rb +13 -0
- data/lib/claude_memory/mcp/tools.rb +148 -0
- data/lib/claude_memory/publish.rb +13 -21
- data/lib/claude_memory/recall/stale_detector.rb +67 -0
- data/lib/claude_memory/resolve/predicate_policy.rb +2 -0
- data/lib/claude_memory/resolve/resolver.rb +41 -11
- data/lib/claude_memory/store/llm_cache.rb +68 -0
- data/lib/claude_memory/store/metrics_aggregator.rb +96 -0
- data/lib/claude_memory/store/schema_manager.rb +1 -1
- data/lib/claude_memory/store/sqlite_store.rb +47 -143
- data/lib/claude_memory/store/store_manager.rb +29 -0
- data/lib/claude_memory/sweep/maintenance.rb +216 -0
- data/lib/claude_memory/sweep/recall_timestamp_refresher.rb +83 -0
- data/lib/claude_memory/sweep/sweeper.rb +2 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +22 -0
- metadata +50 -1
data/docs/improvements.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# Improvements to Consider
|
|
2
2
|
|
|
3
|
-
*Updated: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
|
|
3
|
+
*Updated: 2026-04-28 - Opened the 1.0 punchlist track (see `docs/1_0_punchlist.md`). High-priority entries below now include the must-have 1.0 items: token-budget telemetry (#47), hallucination-rate metric (#48), negative-fact harm benchmark (#49), CLAUDE.md baseline publication (#50), `claude-memory show` (#51), benchmark scoreboard diff (#52). Post-1.0 entries: first-week ROI nudge (#53), real-session repeat-correction detector (#54), token-cost growth tracking (#55), drift dashboard (#56). Earlier 2026-04-28 update added cq study (usefulness-focused). Previously: 2026-03-30 - Re-studied all 7 influencer repos. New recommendations: CLAUDE_CONFIG_DIR support (#26, from episodic-memory), Usage Stats / ROI Tracking (#27, from grepai v0.35.0). New Features to Avoid: AST-Aware Code Chunking (QMD), Custom Instructions via Env Var (lossless-claw v0.5.2), OpenClaw Context Injection (claude-mem v10.6.0). Repos with no changes: kbs (v0.2.1), claude-supermemory (v2.0.1), episodic-memory (v1.0.15). Previously: 14 features implemented through 2026-03-24.*
|
|
4
4
|
*Sources:*
|
|
5
5
|
- *[thedotmack/claude-mem](https://github.com/thedotmack/claude-mem) - Memory compression system (v10.6.3, re-studied 2026-03-30)*
|
|
6
6
|
- *[obra/episodic-memory](https://github.com/obra/episodic-memory) - Semantic conversation search (v1.0.15, re-studied 2026-03-30 — no changes)*
|
|
@@ -88,6 +88,230 @@ Source: claude-supermemory v2.0.1 study (2026-03-09)
|
|
|
88
88
|
|
|
89
89
|
Extraction instructions embedded in `/distill-transcripts` skill and context hook injection prompt. Defines what to extract (technology decisions, conventions, preferences, architecture, entities by type) vs what to skip (debugging steps, code output, transient errors). Scope detection for global vs project facts. Claude Code itself performs extraction — no separate API call needed.
|
|
90
90
|
|
|
91
|
+
### 47. Token Budget Telemetry — *what does memory cost?*
|
|
92
|
+
|
|
93
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #1)
|
|
94
|
+
|
|
95
|
+
**Gap.** `Core::TokenEstimator` (`lib/claude_memory/core/token_estimator.rb`) exists and is only consumed by `Index::IndexQuery`. We record `context_length` (chars) on every `hook_context` activity event but never tokens, so users can't answer "what's memory costing me per session?" — the loudest critique of any context-injection memory system.
|
|
96
|
+
|
|
97
|
+
**Implementation.**
|
|
98
|
+
|
|
99
|
+
- **Capture at injection time.** `Commands::HookCommand#record_context_activity` (hook_command.rb:208-232) already builds the details hash with `context_length`. Add `context_tokens: Core::TokenEstimator.estimate(context_text)` and the same field in `Hook::Handler#context` (handler.rb:106-108). Backfill behavior: legacy events without `context_tokens` fall back to `context_length / 4` (matches TokenEstimator's CHARS_PER_TOKEN constant).
|
|
100
|
+
- **Surface in Trust.** `Dashboard::Trust#snapshot` (trust.rb:28-36) gains a `token_budget` block: `{p50:, p95:, total_30d:, sessions:}` derived from `activity_events` where `event_type='hook_context' AND status='success'` over `UTILIZATION_DAYS`.
|
|
101
|
+
- **Surface in digest.** `Commands::DigestCommand` (digest_command.rb) adds a "Context cost" line — average tokens injected per session in the window, rendered alongside activity counts.
|
|
102
|
+
- **Surface in stats.** `claude-memory stats --tokens` prints the same p50/p95 + per-day distribution for terminal-only users.
|
|
103
|
+
|
|
104
|
+
**Acceptance.**
|
|
105
|
+
|
|
106
|
+
- Trust panel shows `Context cost` widget with current-week p95 + week-over-week delta (matches the existing weekly_moments shape).
|
|
107
|
+
- Digest's Activity section includes "Context tokens injected (avg/session): N".
|
|
108
|
+
- `claude-memory stats --tokens --since 30` works and matches the dashboard.
|
|
109
|
+
|
|
110
|
+
**Edge cases.**
|
|
111
|
+
|
|
112
|
+
- Sessions where `generate_context` returns nil (`status='skipped'`): record `context_tokens: 0` so the denominator stays honest.
|
|
113
|
+
- Fresh installs with no `hook_context` events: Trust shows the widget hidden (mirroring the `utilization` panel's empty-state handling).
|
|
114
|
+
- Old events (pre-rollout) without the field: fall back via `(detail_json->>'context_length').to_i / 4`. Doc this in the migration note in `db/migrations/` if a schema change is added later — currently no schema change required.
|
|
115
|
+
|
|
116
|
+
**Effort.** ~4-6 hours. No schema changes; `detail_json` is opaque blob.
|
|
117
|
+
|
|
118
|
+
**Why high priority.** Without this number, the trade-off "memory eats N tokens forever" is unfalsifiable. The data is already flowing through `record_context_activity` — we're only failing to compute one extra integer.
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
### 48. Hallucination Rate as a First-Class Trust Metric
|
|
123
|
+
|
|
124
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #2). Builds on #34 (Why-Preservation Audit) and #41 (ReferenceMaterialDetector).
|
|
125
|
+
|
|
126
|
+
**Gap.** We already have `Distill::ReferenceMaterialDetector` classifying "X is a CLI/library/MCP server" / "by Firstname Lastname" / LOC-count facts as suspect. The #34 audit found ~25% of project facts had embedded reasoning, ~75% were bare conclusions. Neither signal is exposed on the dashboard. The Trust panel today shows clean numbers; it should show stained ones so users can see the calibration loop.
|
|
127
|
+
|
|
128
|
+
**Implementation.**
|
|
129
|
+
|
|
130
|
+
- **Two component metrics.**
|
|
131
|
+
1. *Suspect-fact ratio*: `ReferenceMaterialDetector.suspect_count(active_facts) / active_facts.count`. Already a one-liner — the detector exists and is invoked in `ManagementHandlers#store_extraction` to retag at write time. Add a read-only count method.
|
|
132
|
+
2. *Bare-conclusion ratio*: new lightweight detector that flags `decision`/`convention` facts whose `object_literal` lacks a why clause. Cheapest heuristic: `object !~ /\b(because|so that|caused by|breaks when|to avoid|to ensure|reason)\b/i`. Lives in `lib/claude_memory/distill/why_clause_detector.rb` so the rule is cited in one spot.
|
|
133
|
+
- **Composite quality_score.** `Dashboard::Trust#snapshot` exposes `quality: {suspect_pct:, bare_conclusion_pct:, score:}` where `score = 100 - suspect_pct - bare_conclusion_pct/2` (bare conclusions are weaker negatives than reference-material mislabels). Tunable; the formula matters less than the trend.
|
|
134
|
+
- **Rejection-rate companion.** Digest gains a "Calibration" section: of facts created in the last 30d, what % are now `status='rejected'`? This is the ex-post calibration signal that complements the ex-ante quality_score.
|
|
135
|
+
- **CLI surface.** `claude-memory stats --quality` prints the score plus the top 10 suspect facts so users can act.
|
|
136
|
+
|
|
137
|
+
**Acceptance.**
|
|
138
|
+
|
|
139
|
+
- Trust panel shows `Quality score: 87 (suspect 4%, bare 18%)` with red/yellow/green coding (>80 green, >60 yellow, else red).
|
|
140
|
+
- Digest's Calibration section shows `12/87 facts rejected within 7 days (14% rejection rate)`.
|
|
141
|
+
- Stats command lists actionable suspects with docids so users can `claude-memory reject <docid>`.
|
|
142
|
+
|
|
143
|
+
**Edge cases.**
|
|
144
|
+
|
|
145
|
+
- Reference-material is a multi-value predicate now (#41), so detector hits don't always mean rejection — they can also indicate correctly-tagged reference rows. The metric only counts mislabeled-as-convention/decision suspects, not facts with `predicate='reference'`.
|
|
146
|
+
- Bare-conclusion detection is regex-based and lossy. Keep it advisory: this score is a trend signal, not a precision tool. Accept ~10% false-positive rate as long as the directional signal holds across releases.
|
|
147
|
+
- Empty-DB case: `quality_score` is nil (not 100). Frontend hides the widget.
|
|
148
|
+
|
|
149
|
+
**Effort.** ~1 day. Detector reuse + one new helper + Trust + digest wiring.
|
|
150
|
+
|
|
151
|
+
**Why high priority.** A retrieval system that injects polluted facts is strictly worse than no memory. Users need to see the pollution rate, not just the recall rate.
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
### 49. Negative-Fact Harm Benchmark
|
|
156
|
+
|
|
157
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #3). Parallels #32 (Repeat-Correction Benchmark) but inverts the goal.
|
|
158
|
+
|
|
159
|
+
**Gap.** Every benchmark we run measures whether memory **helps** (Recall@k, MRR, e2e pass rate, repeat-correction prevention rate). Nothing measures whether memory **harms** — i.e. holds a wrong/stale fact and causes Claude to follow it. Without this, "memory helps" is unfalsifiable.
|
|
160
|
+
|
|
161
|
+
**Implementation.**
|
|
162
|
+
|
|
163
|
+
- **Dataset.** `spec/benchmarks/dataset/harm_scenarios.yml` modeled on `repeat_correction_scenarios.yml` (`spec/benchmarks/e2e/repeat_correction_spec.rb` is the template). Each scenario carries:
|
|
164
|
+
- `memory_facts`: 1-3 facts pre-loaded into memory, intentionally outdated/wrong (e.g. `uses_database = MySQL` when the prompt context implies PostgreSQL is current).
|
|
165
|
+
- `prompt`: a question whose right answer requires *not* trusting the wrong fact.
|
|
166
|
+
- `harm_patterns`: regex list — any match in Claude's response = Claude followed the bad fact. Matches the absence-pattern shape from #32.
|
|
167
|
+
- `safe_indicators`: optional positive patterns showing Claude correctly questioned/ignored the fact.
|
|
168
|
+
- **10-15 scenarios spanning four harm classes:**
|
|
169
|
+
1. *Stale-tech*: outdated framework/database choice that conflicts with prompt cues.
|
|
170
|
+
2. *Mismatched-scope*: project fact applied to a different-project prompt (tests scope leakage).
|
|
171
|
+
3. *Superseded-but-undetected*: fact that should have been superseded but wasn't.
|
|
172
|
+
4. *Reference-material-as-fact*: a "by Firstname Lastname" attribution mislabeled as `convention`, prompt asks for actual conventions.
|
|
173
|
+
- **Spec.** `spec/benchmarks/e2e/harm_spec.rb` runs each scenario through the e2e harness (`ClaudeCliRunner`) with memory enabled; scores `harm` if any `harm_patterns` matches, `safe` otherwise. Stub mode validates schema + regex compile (matches #32 pattern). Real mode reports harm rate with $-cost printed.
|
|
174
|
+
- **Release gate.** `bin/run-evals --all` aggregates harm rate; `> 1%` blocks release. Threshold tunable via `HARM_RATE_THRESHOLD` env var. The `/release` skill reads the latest result JSON (#52 below) before publishing.
|
|
175
|
+
|
|
176
|
+
**Acceptance.**
|
|
177
|
+
|
|
178
|
+
- Stub run validates 10-15 scenarios pass schema/regex checks.
|
|
179
|
+
- Real run prints `Harm rate: X/N (Y%)` with per-scenario passes/fails and `safe_indicators` stats.
|
|
180
|
+
- Release script refuses to publish when harm rate exceeds threshold.
|
|
181
|
+
- Dashboard shows latest harm rate alongside other benchmark scores once #52 lands.
|
|
182
|
+
|
|
183
|
+
**Edge cases.**
|
|
184
|
+
|
|
185
|
+
- `harm_patterns` regexes need to be specific enough that "I'm not sure" doesn't match. Lean on the same diagnostic discipline as #32 (positive `safe_indicators` for ambiguous cases).
|
|
186
|
+
- Scenario IDs need stable docids so we can track which scenarios regress release-to-release once #52 lands.
|
|
187
|
+
- No `acceptance_keywords` — the metric is *absence* of harm, not positive proof of correctness.
|
|
188
|
+
|
|
189
|
+
**Effort.** ~2 days. Dataset is the bulk of the time (real-world wrong-fact patterns drawn from the existing audit notes — Sequel.sqlite, hallucination CLAUDE.md example, Rails-vs-React conflicts).
|
|
190
|
+
|
|
191
|
+
**Why high priority.** Closes the "is this strictly better than no memory" question. Pairs with #50 (CLAUDE.md baseline) so we can publish "vs no memory: harmless; vs CLAUDE.md: superior".
|
|
192
|
+
|
|
193
|
+
---
|
|
194
|
+
|
|
195
|
+
### 50. Publish CLAUDE.md Baseline in Headline E2E Results
|
|
196
|
+
|
|
197
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #4)
|
|
198
|
+
|
|
199
|
+
**Gap.** `spec/benchmarks/comparative/adapters/claude_md_adapter.rb` exists, supports E2E (`supports_e2e?` returns true, `setup_for_claude` writes a real CLAUDE.md), and is registered in `comparative_helper.rb`. But the README's headline comparative table doesn't include it. The single most important question for adoption — *"is this better than a hand-written CLAUDE.md?"* — is unanswered in our published numbers.
|
|
200
|
+
|
|
201
|
+
**Implementation.**
|
|
202
|
+
|
|
203
|
+
- **Surface in comparative E2E spec.** `spec/benchmarks/comparative/e2e/comparative_e2e_spec.rb` already iterates adapters via `ComparativeHelpers.adapters`; ensure CLAUDE.md baseline is included in the iteration (verify by reading the spec — likely needs an `if adapter.supports_e2e?` guard tweak).
|
|
204
|
+
- **Reporter changes.** `spec/benchmarks/comparative/reporting/comparative_reporter.rb` already supports multi-adapter rows. Confirm CLAUDE.md row appears in markdown + terminal output.
|
|
205
|
+
- **README publishing.** `spec/benchmarks/README.md` "Comparative Results" section gets a new E2E table showing pass rate per ability category for ClaudeMemory vs CLAUDE.md baseline vs No memory. Run `EVAL_MODE=real ./bin/run-evals --comparative` once and paste the result.
|
|
206
|
+
- **Release gate.** Add a soft gate in `/release` skill: warn (don't block) if ClaudeMemory's E2E pass rate isn't materially above CLAUDE.md baseline. Threshold: 5% absolute pass-rate margin. Tunable.
|
|
207
|
+
|
|
208
|
+
**Acceptance.**
|
|
209
|
+
|
|
210
|
+
- README has a "ClaudeMemory vs CLAUDE.md baseline" E2E pass-rate table with a brief commentary on when each wins.
|
|
211
|
+
- Comparative reporter prints CLAUDE.md row inline with QMD/grepai/no-memory.
|
|
212
|
+
- README "Key takeaways" updated to include the ClaudeMemory-vs-CLAUDE.md comparison as a top-line finding.
|
|
213
|
+
|
|
214
|
+
**Edge cases.**
|
|
215
|
+
|
|
216
|
+
- CLAUDE.md baseline returns `[]` for `search()` — that's fine, retrieval comparison already handles this (it's a No-Retrieval row in retrieval results). The E2E story is the one we care about.
|
|
217
|
+
- The static CLAUDE.md grows unbounded with our test fact set (105 facts). That's the baseline's *actual* ergonomics — don't artificially shrink it. If CLAUDE.md beats us in E2E because Claude can read everything, that's a genuine signal.
|
|
218
|
+
|
|
219
|
+
**Effort.** ~30 min code + one $2-8 real-mode run.
|
|
220
|
+
|
|
221
|
+
**Why high priority.** Cheapest item on the list. If we can't beat a static CLAUDE.md on developer scenarios, that's the loudest possible "we're not done" signal; if we can, that's the headline 1.0 brag.
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
### 51. `claude-memory show` — Human-Readable "What Would Be Injected"
|
|
226
|
+
|
|
227
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #5)
|
|
228
|
+
|
|
229
|
+
**Gap.** Inspecting memory state today requires the dashboard or several CLI commands (`recall`, `stats`, `census`). The CLAUDE.md alternative is `cat CLAUDE.md` — instant, plain-English, no tool. Users develop trust through inspectability, and we're missing the simplest possible inspect surface.
|
|
230
|
+
|
|
231
|
+
**Implementation.**
|
|
232
|
+
|
|
233
|
+
- **New command.** `lib/claude_memory/commands/show_command.rb` registered in `Commands::Registry`. Construct a `Hook::ContextInjector` against the current manager (`source: nil` → behaves as a startup session for the fresh-session sections), call `generate_context`, and print the result. That's the same path real sessions use, so the output is *exactly* what would be injected.
|
|
234
|
+
- **Plain-English rendering.** ContextInjector already returns markdown; the command pipes it through `less` if `STDOUT.tty?` and `--paginate` (default true). `--raw` flag dumps the unprocessed string for diffing across runs.
|
|
235
|
+
- **Section flags.** `--decisions`, `--conventions`, `--architecture`, `--undistilled`, `--mirror` filter to specific sections. Default is all sections.
|
|
236
|
+
- **Sized for terminal.** Existing `MAX_TEXT_PER_ITEM` (1500 chars) and per-section limits already cap output.
|
|
237
|
+
- **Token reporting.** When #47 lands, `claude-memory show` prints a footer line: `(Estimated cost: ~N tokens; X% of 200k context window.)` so the user sees the trade in the same view.
|
|
238
|
+
|
|
239
|
+
**Acceptance.**
|
|
240
|
+
|
|
241
|
+
- `claude-memory show` runs in <1s on a populated DB and prints what next session would see.
|
|
242
|
+
- `claude-memory show --raw` is suitable for diff'ing (deterministic ordering already enforced by `Recall#query`).
|
|
243
|
+
- `claude-memory show --section decisions` works for narrow inspection.
|
|
244
|
+
|
|
245
|
+
**Edge cases.**
|
|
246
|
+
|
|
247
|
+
- Empty DB: print "No facts in memory yet. Try `claude-memory hook context` after a few sessions." rather than empty output.
|
|
248
|
+
- Fresh-session-only sections (undistilled, mirror) only show with `--source startup` or by default. `--no-fresh` suppresses them for the steady-state view.
|
|
249
|
+
- ContextInjector currently auto-commits the auto-memory mirror state on emission (context_injector.rb:67); the show command must pass an injector that *doesn't* commit, or the act of inspecting alters state. Two options: (a) add a `read_only:` flag to ContextInjector, (b) construct a no-op AutoMemoryMirror double in the show command. (a) is cleaner.
|
|
250
|
+
|
|
251
|
+
**Effort.** Half a day.
|
|
252
|
+
|
|
253
|
+
**Why high priority.** Trust requires inspectability. A user who can't see what memory will inject can't develop confidence in it. This is the answer to "show me, don't tell me."
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
### 52. Release-to-Release Benchmark Scoreboard
|
|
258
|
+
|
|
259
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #6)
|
|
260
|
+
|
|
261
|
+
**Gap.** Benchmark output is textual today (`spec/benchmarks/comparative/reporting/comparative_reporter.rb` + per-spec `puts`). Nothing diff-able across versions. The only reason we caught the BM25 normalization regression was a manual run. 1.0 is the moment we commit to *not regressing* what we ship; we need machine-readable longitudinal results.
|
|
262
|
+
|
|
263
|
+
**Implementation.**
|
|
264
|
+
|
|
265
|
+
- **JSON output sink.** New `BenchmarkHelpers::ResultsWriter` module in `spec/benchmarks/benchmark_helper.rb`. Each benchmark spec calls `ResultsWriter.record(suite:, metrics:)` after computing its metrics. Writer accumulates into a single `spec/benchmarks/results/<version>-<timestamp>.json` per run, plus a `spec/benchmarks/results/latest.json` symlink.
|
|
266
|
+
- **Schema.** Top-level `{version:, run_at:, suites: {retrieval: {...}, resolution: {...}, distillation: {...}, e2e: {...}, harm: {...}, comparative: {...}}}`. Per-suite metrics match what's already printed today.
|
|
267
|
+
- **Diff command.** `bin/bench-diff [--against TAG]` reads the latest results JSON and the JSON for the named tag (default: previous tag from `git tag --sort=-creatordate`). Prints color-coded deltas for each metric. Threshold for "regression" is per-metric (e.g. Recall@5 ±2%, MRR ±3%, harm rate must not increase at all).
|
|
268
|
+
- **Release gate.** `/release` skill reads `latest.json` and the previous version's JSON before bumping; refuses to ship on regressions over threshold. Override with `--force-regression` for explicit acknowledgments (e.g. an intentional algorithm change).
|
|
269
|
+
- **Storage.** Results JSON committed to repo (small, <50KB per run) so any contributor can `bin/bench-diff` historically. `.gitignore` excludes intermediate timestamped files; only the per-version stable file is committed.
|
|
270
|
+
|
|
271
|
+
**Acceptance.**
|
|
272
|
+
|
|
273
|
+
- Running `bin/run-evals --all` writes `spec/benchmarks/results/<version>.json`.
|
|
274
|
+
- `bin/bench-diff` shows a clear delta table when there are changes.
|
|
275
|
+
- `/release` warns/blocks on regressions per the threshold.
|
|
276
|
+
- README "Latest Results" section is auto-generated from the JSON via a rake task to stop drift.
|
|
277
|
+
|
|
278
|
+
**Edge cases.**
|
|
279
|
+
|
|
280
|
+
- Stub mode (no real Claude) only fills retrieval/resolution/distillation suites; e2e/harm/comparative sections are absent. Diff command tolerates missing keys.
|
|
281
|
+
- Comparative results vary by adapter availability — schema accommodates absent adapters without diffing them as regressions.
|
|
282
|
+
- First run has no prior JSON: `bin/bench-diff` prints "no baseline" and `/release` proceeds without gating.
|
|
283
|
+
|
|
284
|
+
**Effort.** ~1 day. Mostly plumbing; the metrics already exist as Ruby variables in the specs.
|
|
285
|
+
|
|
286
|
+
**Why high priority.** Without longitudinal tracking every benchmark we run is a snapshot. Pairs with #49 (harm benchmark) — the harm rate is the metric most worth tracking release-to-release.
|
|
287
|
+
|
|
288
|
+
---
|
|
289
|
+
|
|
290
|
+
## cq Study (2026-04-28)
|
|
291
|
+
|
|
292
|
+
Source: docs/influence/cq.md — usefulness-focused study (not internals)
|
|
293
|
+
|
|
294
|
+
cq is complementary to ClaudeMemory, not competing: it's an out-of-band SQL audit tool over raw Claude Code transcripts (DuckDB cache + `tool_calls`/`messages`/`sessions` views), aimed at meta-questions like "is my skill firing?" or "where did context go in that bad session?" ClaudeMemory has data parity for the per-project case (its own `tool_calls` table) but lacks cross-project SQL ergonomics.
|
|
295
|
+
|
|
296
|
+
### High Priority Recommendations
|
|
297
|
+
|
|
298
|
+
- [ ] **Install cq as a developer audit tool for the ClaudeMemory plugin itself**
|
|
299
|
+
- Value: Answer "is the memory plugin firing when it should?" — currently unanswerable
|
|
300
|
+
- Evidence: cq's three documented patterns (skill-activation gap, silent failure, context budget) translate directly; only predicate names change
|
|
301
|
+
- Effort: 5 minutes (`cargo install --git https://github.com/technicalpickles/cq`)
|
|
302
|
+
- Trade-off: Adds Rust toolchain dep on dev machine; runs out-of-band so no project impact
|
|
303
|
+
|
|
304
|
+
- [x] **Capture reference audit queries in `docs/audit-queries.md`** (2026-04-28)
|
|
305
|
+
- Five queries: activation rate, missed memory-shaped prompts, tool ranking, error rate, result-size distribution
|
|
306
|
+
- Each runnable as `cq sql "..." --since 30d --table` against Claude Code transcripts (not ClaudeMemory's own SQLite — cq sees calls that bypassed the MCP server entirely)
|
|
307
|
+
- Re-run before each release, after MCP server instruction changes, or when investigating "memory doesn't seem to do anything" reports
|
|
308
|
+
|
|
309
|
+
### Features to Avoid (from this study)
|
|
310
|
+
|
|
311
|
+
- DuckDB as a primary store — wrong tool for the curation workload
|
|
312
|
+
- Cross-project default scoping — breaks ClaudeMemory's project/global memory separation
|
|
313
|
+
- Re-indexing transcripts on every command — ClaudeMemory's hook-driven ingest is already the right pattern
|
|
314
|
+
|
|
91
315
|
---
|
|
92
316
|
|
|
93
317
|
## Medium Priority
|
|
@@ -104,6 +328,120 @@ IndexCommand builds text→embedding cache from already-embedded facts before in
|
|
|
104
328
|
|
|
105
329
|
In Ruby fallback path (`search_by_vector_fallback`), facts are grouped by `embedding_json` before cosine similarity computation. Unique embeddings scored once, results fanned out to all matching fact_ids. Native sqlite-vec path unaffected (handles own dedup).
|
|
106
330
|
|
|
331
|
+
### 53. First-Week ROI Nudge
|
|
332
|
+
|
|
333
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #7). Closes the cold-start gap.
|
|
334
|
+
|
|
335
|
+
**Gap.** New users install the gem, run a few sessions, and don't know whether memory is working. The dashboard exists but they have to know to look. The auto-memory mirror (#36) helps but isn't surfaced. We need a low-friction nudge in the first ~10 sessions that says "memory is working, here's what it did" — and then gets out of the way.
|
|
336
|
+
|
|
337
|
+
**Implementation.**
|
|
338
|
+
|
|
339
|
+
- **New hook command.** `claude-memory hook session-end-summary` runs on SessionEnd alongside the existing ingest/sweep. Reads the most recent `hook_context` activity event for the current session_id; emits a `systemMessage` (or `additionalContext` if the spec supports it for SessionEnd) summarizing: facts injected, % used, top subjects.
|
|
340
|
+
- **Sentinel.** Tracked in a new `Configuration#session_count` (or `.claude/.session_counter`) — only emit on sessions 1–10. After 10, the user has either seen enough or doesn't care; turn it off so we don't become noise.
|
|
341
|
+
- **Hooks config.** `HooksConfigurator#build_hooks_config` (hooks_configurator.rb:130) gains the new command in the SessionEnd block.
|
|
342
|
+
- **Opt-out.** `CLAUDE_MEMORY_NO_NUDGE=1` disables.
|
|
343
|
+
|
|
344
|
+
**Acceptance.**
|
|
345
|
+
|
|
346
|
+
- Sessions 1-10 print a one-line "memory contributed N facts; you used Y of them" summary at session end.
|
|
347
|
+
- Session 11+ stays silent unless the user opts in via `CLAUDE_MEMORY_ALWAYS_NUDGE=1`.
|
|
348
|
+
- Telemetry: each emitted nudge logs an `activity_event` so we can track whether users disable it (rough proxy for noise).
|
|
349
|
+
|
|
350
|
+
**Edge cases.**
|
|
351
|
+
|
|
352
|
+
- Sessions where `generate_context` returned nil: don't emit the nudge — there's nothing to celebrate.
|
|
353
|
+
- Multi-window sessions / tab-switches: the session counter is per-(project_path, claude_config_dir), not global. Two projects = two independent first-week windows.
|
|
354
|
+
- "% used" needs a recall event in the same session to compute; absent that, fall back to "memory contributed N facts (use them via /memory-recall)".
|
|
355
|
+
|
|
356
|
+
**Effort.** ~half day.
|
|
357
|
+
|
|
358
|
+
**Why post-1.0.** Nice onboarding polish, not a confidence gap. The token-budget, hallucination, and harm metrics in the must-have set already give the skeptic the answer they need.
|
|
359
|
+
|
|
360
|
+
---
|
|
361
|
+
|
|
362
|
+
### 54. Real-Session Repeat-Correction Detection
|
|
363
|
+
|
|
364
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #8). Production-side companion to #32 (synthetic harness).
|
|
365
|
+
|
|
366
|
+
**Gap.** The repeat-correction benchmark fires synthetic prompts and asks "did Claude repeat itself?". Production has no equivalent signal. When a user re-states something memory already injected, that's the strongest possible "memory failed silently" signal — and we don't capture it.
|
|
367
|
+
|
|
368
|
+
**Implementation.**
|
|
369
|
+
|
|
370
|
+
- **Detector.** New `Sweep::RepeatCorrectionDetector` (parallel to `Sweep::RecallTimestampRefresher`). Runs in the sweep cycle; reads `activity_events` for `event_type='hook_context'` over the last 7 days. For each session, takes the `top_subjects` (from `detail_json`) and looks at the next ingested transcript chunk for prompts that mention the same subject in a "we discussed this" / "I told you" / correction-shaped way.
|
|
371
|
+
- **Signal extraction.** Regex-light heuristic against ingested content: `/\b(again|already|told you|previously|as I said|reminder)\b/i` AND a subject keyword from the prior injection's `top_subjects`.
|
|
372
|
+
- **Surface.** New dashboard panel "Memory misses (last 30d)" + a `--missed` flag on `claude-memory stats`. Each row links to the offending session and the subject that was injected but not heeded.
|
|
373
|
+
- **Privacy posture.** Only surfaces subject names + session IDs, never the user's full prompt text. Same posture as census.
|
|
374
|
+
|
|
375
|
+
**Acceptance.**
|
|
376
|
+
|
|
377
|
+
- Stats command shows actionable list of "memory was injected but the user re-corrected" cases.
|
|
378
|
+
- Dashboard surfaces these with a link to the originating fact so users can act (reject / promote / rephrase).
|
|
379
|
+
- Aggregate "miss rate" appears in digest as a 30d trend.
|
|
380
|
+
|
|
381
|
+
**Edge cases.**
|
|
382
|
+
|
|
383
|
+
- Heuristic is lossy — we'll miss real misses and flag false positives. Treat as a trend signal not a precision tool, same posture as `relevance_ratio` (#31).
|
|
384
|
+
- Need to disambiguate "user re-stated for emphasis" vs "memory failed". Lean toward false-negative bias (only flag obvious cases) so the panel isn't crying wolf.
|
|
385
|
+
|
|
386
|
+
**Effort.** ~2 days. Detector logic is the bulk; UI is straightforward addition.
|
|
387
|
+
|
|
388
|
+
**Why post-1.0.** Good signal but not blocking — the synthetic harness in #32 already gives release-time guarantees. Production-side measurement is icing.
|
|
389
|
+
|
|
390
|
+
---
|
|
391
|
+
|
|
392
|
+
### 55. Token-Cost Growth Tracking
|
|
393
|
+
|
|
394
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #9). Builds on #47 (token budget telemetry).
|
|
395
|
+
|
|
396
|
+
**Gap.** Once #47 is recording context_tokens per session, the next question is: *is it growing?* DB bloat or context-injection going wide should be visible as an anomaly, not discoverable only by manual census.
|
|
397
|
+
|
|
398
|
+
**Implementation.**
|
|
399
|
+
|
|
400
|
+
- **Digest section.** `Commands::DigestCommand` adds a "Context cost trend" line: `current 7d avg vs 30d avg (delta %)`. Same window comparison shape as the existing `weekly_moments`.
|
|
401
|
+
- **Dashboard widget.** Trust panel's `token_budget` block (added in #47) gains `growth_30d` and `growth_7d` fields with color coding (>20% growth = yellow, >50% = red).
|
|
402
|
+
- **Alert threshold.** New `Configuration#token_growth_alert_pct` (default 30) controls the "is this concerning" line. Configurable via env var.
|
|
403
|
+
|
|
404
|
+
**Acceptance.**
|
|
405
|
+
|
|
406
|
+
- Digest shows directional trend at a glance.
|
|
407
|
+
- Dashboard surfaces sustained growth with appropriate severity.
|
|
408
|
+
|
|
409
|
+
**Effort.** ~3 hours after #47 lands.
|
|
410
|
+
|
|
411
|
+
**Why post-1.0.** Pure derivation from #47's data; doesn't add new instrumentation.
|
|
412
|
+
|
|
413
|
+
---
|
|
414
|
+
|
|
415
|
+
### 56. Drift Dashboard
|
|
416
|
+
|
|
417
|
+
Source: 2026-04-28 1.0 readiness review (`docs/1_0_punchlist.md` #10). Builds on #30 (Predicate Census).
|
|
418
|
+
|
|
419
|
+
**Gap.** `claude-memory census` (#30) gives a one-shot privacy-safe scan but it's not longitudinal. "Is my fact base going off?" requires comparing today's predicate distribution against historical ones — which today only exists in a user's git history of committed `.claude/memory.sqlite3` (and we don't recommend committing that).
|
|
420
|
+
|
|
421
|
+
**Implementation.**
|
|
422
|
+
|
|
423
|
+
- **Snapshot store.** New table `census_snapshots` (schema migration vNN) stores compact aggregates: `{snapshotted_at, predicate, status, count, scope}`. Bounded retention (keep last 12 weeks).
|
|
424
|
+
- **Capture.** Sweep cycle records a snapshot weekly (gated by "last snapshot > 6 days ago"). Cheap — single aggregate query.
|
|
425
|
+
- **Dashboard panel.** "Distribution drift" widget shows a small sparkline per predicate over the last 12 weeks. Anomalies (predicate count drops >50%, or rises >200%) get highlighted.
|
|
426
|
+
- **CLI.** `claude-memory drift` prints a text-mode version of the dashboard widget for terminal users.
|
|
427
|
+
|
|
428
|
+
**Acceptance.**
|
|
429
|
+
|
|
430
|
+
- Dashboard shows predicate distribution sparklines.
|
|
431
|
+
- A user who's been running the gem for 3 months can see "convention facts dropped 40% this week — what happened?".
|
|
432
|
+
- Snapshots stay <100KB total over 12 weeks (bounded by predicate × status × scope cardinality).
|
|
433
|
+
|
|
434
|
+
**Edge cases.**
|
|
435
|
+
|
|
436
|
+
- Fresh installs have no historical snapshots. Widget hides until 2+ snapshots exist.
|
|
437
|
+
- Schema migration touches the gem-core schema; needs round-trip migration tests per #f1fe317.
|
|
438
|
+
|
|
439
|
+
**Effort.** ~1.5 days.
|
|
440
|
+
|
|
441
|
+
**Why post-1.0.** Useful longitudinal signal but the must-have set already gives the headline confidence numbers. Drift is the "operate it long-term" question.
|
|
442
|
+
|
|
443
|
+
---
|
|
444
|
+
|
|
107
445
|
### 21. Incremental Indexing with File Watching
|
|
108
446
|
|
|
109
447
|
Source: grepai study (reinforced 2026-03-02)
|
|
@@ -143,15 +481,82 @@ Source: QMD v2.0.1+unreleased re-study (2026-03-30)
|
|
|
143
481
|
- **Effort**: 2-3 days (after #22)
|
|
144
482
|
- **Trade-off**: Adds tree-sitter dependency; graceful fallback to regex-only chunking when grammar unavailable
|
|
145
483
|
|
|
146
|
-
### 30. Predicate Census Command
|
|
484
|
+
### ~~30. Predicate Census Command~~ ✅ Implemented 2026-04-20
|
|
485
|
+
|
|
486
|
+
`claude-memory census [--root DIR]` scans every `.claude/memory.sqlite3` under the root (plus the global DB unless `--no-global`), aggregates per-DB predicate × status counts, entity type counts, schema versions, novel predicates, and synonym candidates (Jaccard token overlap ≥ 0.4 against `PredicatePolicy.known_predicates`). Emits privacy-safe JSON — no object_literal, no entity names, no paths, no quotes; per-DB entries carry an SHA256-prefixed id rather than a path. Supports `--output FILE`, `--pretty`.
|
|
487
|
+
|
|
488
|
+
### ~~31. Relevance Ratio Metric for Eval Suite~~ ✅ Implemented 2026-04-20
|
|
489
|
+
|
|
490
|
+
Offline plumbing landed; the real-mode measurement will materialize the first time someone runs `EVAL_MODE=real` against the e2e suite.
|
|
491
|
+
|
|
492
|
+
- `Hook::ContextInjector` now exposes `emitted_fact_ids` / `emitted_subjects` reader accessors populated during `generate_context`. Existing callers unaffected — the context string return value is unchanged, tracking is a side channel.
|
|
493
|
+
- `BenchmarkHelpers::RelevanceMetrics` module in `spec/benchmarks/benchmark_helper.rb` adds `relevance_ratio(subjects, response)` — case-insensitive subject-substring match, deduped, returns 1.0 for empty-injection (keeps the metric monotone with recall semantics so it doesn't penalize abstention scenarios).
|
|
494
|
+
- `spec/benchmarks/e2e/devmemeval_spec.rb` captures injected subjects via a local `ContextInjector` against the scenario DB (same state in → same injection out — avoids having to scrape the running Claude process), computes the ratio against `result[:result]`, prints per-scenario `relevance=X.XX` alongside the existing score, and reports `avg relevance ratio` per ability group.
|
|
147
495
|
|
|
148
|
-
|
|
496
|
+
Response-side matching stays deliberately approximate — subject substring overlap. The metric is a trend signal (is memory being *applied*, not just retrieved), not a precision tool. Benchmark owner should sanity-check the first real-mode run and tighten the matcher if the ratios look implausibly high or low.
|
|
149
497
|
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
-
|
|
498
|
+
### ~~32. Repeat-Correction Benchmark~~ ⭐ Partially Implemented 2026-04-21
|
|
499
|
+
|
|
500
|
+
Harness landed with a 2-scenario starter set drawn from real, repeated corrections in the project's auto-memory (Sequel.sqlite adapter, rake-install/git-ls-files). Path to the 5–10 scenario set left for incremental growth.
|
|
501
|
+
|
|
502
|
+
- `spec/benchmarks/dataset/repeat_correction_scenarios.yml` — each scenario carries `memory_facts` (pre-loaded as a past session's correction), `prompt` (would re-trigger the bad pattern), and `violation_patterns` (regexes; any match = correction was repeated). Optional `expected_mentions` for diagnostic "correction aware" signal.
|
|
503
|
+
- `spec/benchmarks/e2e/repeat_correction_spec.rb` — stub mode validates schema + regex compile + fact loadability; real mode (`EVAL_MODE=real`) runs each prompt through Claude and reports pass rate. No hard assertion on pass rate yet — the metric is a trend signal; tighten once baseline data exists. Tagged `:benchmark :eval_real :slow` matching `devmemeval_spec.rb`.
|
|
504
|
+
- `BenchmarkHelpers::DatasetLoader.load_repeat_correction_scenarios` added for consistency with existing dataset loaders.
|
|
505
|
+
|
|
506
|
+
Deliberately no `acceptance_keywords`-style pass gate — the point is *absence* of the bad pattern, not positive proof of the good one. Per the improvements note, this runs nightly or on release, not per commit.
|
|
507
|
+
|
|
508
|
+
### ~~33. Conflict Cluster Audit — Fact 21 / 45 / 48~~ ✅ Implemented 2026-04-19/20
|
|
509
|
+
|
|
510
|
+
Audit completed inline during the dashboard Conflicts-tab work on 2026-04-19 and the cluster was eliminated via the resolver fixes shipped on 2026-04-20.
|
|
511
|
+
|
|
512
|
+
**Classification of the three anchor facts (all three were (b) distiller hallucination):**
|
|
513
|
+
|
|
514
|
+
- **Fact 21** (`repo uses_database sqlite`) — correct keeper. Contradictions came from CLAUDE.md example text ("this app uses PostgreSQL") being extracted as a literal claim. Fixed by rewriting the example in CLAUDE.md line 258 to self-describe the real stack ("claude_memory uses SQLite for storage") — commit `61666bc`.
|
|
515
|
+
- **Fact 45** (`repo uses_framework rails`) — correct keeper. Contradictions were artifacts of the `uses_framework` single→multi reclassification in 0.9.0; `claude-memory restore --predicate uses_framework` already exists for this case (0.9.0 CHANGELOG).
|
|
516
|
+
- **Fact 48** (`repo deployment_platform aws`) — correct keeper. Contradictions from platform-mention hallucinations; no further resolution machinery needed beyond rejecting contradicting rows.
|
|
517
|
+
|
|
518
|
+
**Delivered cleanup**: bulk-reject-similar UI in the Conflicts modal (commit `61666bc`), resolver dedup (commit `f571ba4`), scope-leakage fix (commit `50cf02e`). Project DB conflict count dropped from 31 → 15 during the session via bulk-reject, with further shrinkage from the dedup + scope passes. Going forward, the resolver's dedup and the CLAUDE.md rewrite prevent the same cluster from regenerating.
|
|
519
|
+
|
|
520
|
+
No separate `docs/conflict_audit_2026-04.md` file written — the classification and resolution are preserved in the relevant commit messages and memory entries.
|
|
521
|
+
|
|
522
|
+
### ~~34. "Why" Preservation Audit~~ ✅ Implemented 2026-04-20
|
|
523
|
+
|
|
524
|
+
Audit of 20 random project facts showed ~25% embed reasoning, ~75% are bare conclusions — a material gap. Updated two extraction surfaces to require a reason clause for `decision` and `convention` predicates:
|
|
525
|
+
|
|
526
|
+
- `lib/claude_memory/commands/skills/distill-transcripts.md` — added reasoning requirement to the Facts section, with contrasting ❌ bare / ✅ with-why examples drawn from the audit sample, plus a prefer-one-fact-with-reason-over-two-without guideline.
|
|
527
|
+
- `lib/claude_memory/hook/context_injector.rb#format_distillation_prompt` — added a **Reasoning requirement** block to the SessionStart extraction prompt that ships with every fresh session; locked in by a new spec assertion so the contract can't silently regress.
|
|
528
|
+
|
|
529
|
+
No schema change. Reasoning rides in `object_literal`. The plugin-copy mirror (`.claude-plugin/commands/distill-transcripts.md`) was left alone — it's already out of sync with the source skill on the predicate list and is manually maintained; a separate improvement should reconcile it.
|
|
530
|
+
|
|
531
|
+
### ~~36. Auto-Mirror Auto-Memory Observations into claude_memory on SessionStart~~ ⭐ Partially Implemented 2026-04-21
|
|
532
|
+
|
|
533
|
+
Core diff + emission landed. Dashboard indicator (pending mirror count) deferred until real-session usage data suggests the UI is needed.
|
|
534
|
+
|
|
535
|
+
- `Hook::AutoMemoryMirror` scans `~/.claude/projects/<slug>/memory/*.md` (slug = `project_path.tr("/", "-")`) and diffs each file's md5 against `.claude/auto_memory_mirror.json`. `pending_candidates(limit:)` returns only new/changed entries, sorted by mtime descending. Bounded at 5 per session, 1500 chars per entry.
|
|
536
|
+
- `Hook::ContextInjector#generate_context` appends an "Auto-Memory Mirror Candidates" section on fresh sessions (startup/resume/clear/nil source) when candidates exist, then `commit`s them as the new baseline so subsequent sessions won't re-emit unchanged files. Section explains the mirror is advisory — Claude reviews and calls `memory.store_extraction` only for high-signal entries, preserving the `**Why:**` / `**How to apply:**` reasoning (inherits #34 discipline via the sibling distillation prompt).
|
|
537
|
+
- Graceful fallbacks: missing auto-memory dir returns `[]`, malformed state JSON treated as empty baseline, file read errors skipped. Manager must expose `project_path` or the mirror is silently skipped — so non-project managers (plain global-only) never break.
|
|
538
|
+
- Test coverage: `spec/claude_memory/hook/auto_memory_mirror_spec.rb` covers slug derivation, initial scan, commit idempotence, changed-file re-emission, malformed state tolerance, and limit enforcement. `context_injector_spec.rb` adds integration tests for the mirror section, non-fresh-source suppression, and no-re-emission across sessions.
|
|
539
|
+
|
|
540
|
+
Still deferred:
|
|
541
|
+
- Dashboard "N auto-memory entries awaiting mirror" indicator — not wired until it's clear from real usage whether a visible backlog adds value beyond the SessionStart nudge.
|
|
542
|
+
- Scope-hint inference per file. The current emission is the raw file content; Claude decides subject/predicate/scope in the normal extraction review. A future upgrade could parse filename prefixes (`feedback_*`, `gotcha_*`, `reference_*`) into predicate hints.
|
|
543
|
+
|
|
544
|
+
### ~~35. Access-Based Staleness Scoring~~ ✅ Implemented 2026-04-27
|
|
545
|
+
|
|
546
|
+
Triggered by the digest (#46) surfacing 11% utilization with no way to point at the dead weight. Built as **Path B (sweep-derived from activity_events)** rather than the originally-proposed Path A (per-recall update buffer) — the v15 activity_events table eliminated the WAL-contention concern that drove Path A, since the (scope, fact_id) data already exists. No new hot-path writes.
|
|
547
|
+
|
|
548
|
+
- Migration v17 adds nullable `last_recalled_at` to `facts`.
|
|
549
|
+
- `Sweep::RecallTimestampRefresher.new(manager).refresh!` scans both stores' activity_events (event_type IN recall, hook_context) within a 90-day lookback, projects the most recent occurrence per (scope, fact_id) via `Dashboard::ScopedFactResolver`, and bulk-UPDATEs `last_recalled_at` across both DBs. Cross-DB by design — project events touching global facts update global rows.
|
|
550
|
+
- Wired into `Hook::Handler#sweep` and `Commands::SweepCommand` so every sweep cycle freshens timestamps.
|
|
551
|
+
- `Configuration#stale_days` reads `CLAUDE_MEMORY_STALE_DAYS` (default **14**, falls back on garbage / non-positive input).
|
|
552
|
+
- `Recall::StaleDetector.stale_facts(manager, threshold_days:)` and `.stale_count(manager, ...)` return active facts where `(last_recalled_at < cutoff OR last_recalled_at IS NULL) AND created_at < cutoff` — the AND-on-created_at is the grace window so freshly extracted facts don't surface as stale on day one.
|
|
553
|
+
- `claude-memory stats --stale [--stale-days N]` prints the list grouped by scope.
|
|
554
|
+
- `Dashboard::Trust#count_stale_facts` now reads through `StaleDetector#stale_count`, replacing the old "active facts minus seen-in-recall pairs" approximation that couldn't distinguish a never-touched 6-month-old fact from a freshly stored one.
|
|
555
|
+
- No auto-deletion. Staleness is informational; users decide what to reject.
|
|
556
|
+
|
|
557
|
+
Privacy posture: timestamps don't carry user content (different shape from the rejected `query_text` capture). Same posture as `mcp_tool_calls.called_at` — load-bearing but not content-revealing.
|
|
558
|
+
|
|
559
|
+
Specs cover: refresher updates from both stores including cross-DB project→global, lookback bound, latest-wins on multiple touches, stale detection grace window, scope-spanning, status filtering, limits, CLI flag output, Configuration env knob fallbacks.
|
|
155
560
|
|
|
156
561
|
### ~~27. Usage Stats / ROI Tracking~~ ✅ Implemented 2026-04-15
|
|
157
562
|
|
|
@@ -198,6 +603,57 @@ Source: QMD study (2026-03-02)
|
|
|
198
603
|
- **Trade-off**: Process management complexity
|
|
199
604
|
- **Recommendation**: DEFER — Only if MCP startup latency becomes an issue
|
|
200
605
|
|
|
606
|
+
### ~~38. Dashboard: Dedupe conflicts at display layer~~ ✅ Implemented 2026-04-24
|
|
607
|
+
|
|
608
|
+
`Dashboard::Conflicts#list` now groups rows by `(source, status, predicate, sorted-normalized-object-pair)` and returns each group as one row with a `group_size` count plus `group_member_ids`. `total` and the `counts` field reflect the distinct-contradiction count; a new `raw_counts` field preserves the underlying row totals for the Advanced drawer. `Trust#count_open_conflicts` delegates to a new `Conflicts#distinct_open_counts` helper so the `Needs review` sidebar alert stops overstating the backlog. Frontend renders a `×N` badge on the status cell when a group has more than one detection. Covered by new specs (`group_size`, order-swapped pair collapse, raw vs distinct counts, sidebar helper).
|
|
609
|
+
|
|
610
|
+
### ~~39. Resolver: Deduplicate conflict insertion~~ ✅ Implemented 2026-04-24
|
|
611
|
+
|
|
612
|
+
Source: 2026-04-24 dashboard data audit. Root cause traced to `facts_for_slot` defaulting to `status="active"`, which made the existing disputed fact invisible to the re-extraction path. Fixed in `Resolver#apply_conflict`: before creating a new disputed fact + conflict row, look up disputed facts in the same (subject, predicate) slot and reinforce the matching one with provenance instead of duplicating. New spec `resolver_spec.rb` "does not duplicate a conflict when the same contradiction is re-extracted" locks in the behavior. Historical DB rows (e.g. 11× sqlite vs postgresql) stay until an optional cleanup pass runs.
|
|
613
|
+
|
|
614
|
+
### ~~40. Cleanup: Prune historical rails-vs-react conflicts (data only — code already correct)~~ ✅ Implemented 2026-04-24
|
|
615
|
+
|
|
616
|
+
Shipped in commit `22eeaf1` as `claude-memory dedupe-conflicts` and `claude-memory reclassify-references`. `Sweep::Maintenance` gains two one-off maintenance methods:
|
|
617
|
+
|
|
618
|
+
- `dedupe_conflicts` groups open conflicts by `(subject_entity_id, predicate, normalized(object_a, object_b))`, keeps the earliest, rejects the duplicate disputed facts, and migrates their provenance onto the keeper.
|
|
619
|
+
- `reclassify_references` walks active convention facts through `ReferenceMaterialDetector` and retags matches to `predicate=reference`.
|
|
620
|
+
|
|
621
|
+
Both CLI commands accept `--dry-run` and `--scope`. Tightened `ReferenceMaterialDetector` so the `by Firstname Lastname` pattern is now a weak signal (fires only alongside a strong pattern). Covered by 9 new maintenance specs and 1 detector spec.
|
|
622
|
+
|
|
623
|
+
### ~~41. Distiller: Guard against reference material mislabeled as convention~~ ✅ Implemented 2026-04-24
|
|
624
|
+
|
|
625
|
+
Source: 2026-04-24 dashboard data audit. `Distill::ReferenceMaterialDetector` reclassifies convention facts whose object text matches any of: LOC counts (`~?\d+[,.]?\d*\s*(LOC|lines of code)`), star counts, `by Firstname Lastname` author attribution, or "is a (plugin|library|tool|gem|service|framework|extension|cli|mcp server)" templates. New predicate `reference` registered in `PredicatePolicy::POLICIES` (multi, non-exclusive) with its own section in `SECTION_MAP` → `:references`. Detector is applied in `ManagementHandlers#store_extraction` before the resolver runs, so mislabeling can't persist. New `References` section in `Dashboard::Knowledge`. 8 new specs lock in behavior. Historical mislabeled facts (project facts #1, #3) remain until manual reject or cleanup pass.
|
|
626
|
+
|
|
627
|
+
### ~~42. Dashboard: ROI diagnostic — extracted vs recalled~~ ✅ Implemented 2026-04-24
|
|
628
|
+
|
|
629
|
+
Shipped in commit `3906c23`. `Dashboard::Trust#snapshot` now returns a `utilization` section with `extracted` (active facts created in the last 30 days across both stores), `used` ((scope, id) pairs Claude has recalled or injected over the window), `used_from_extracted` (intersection), and `ratio_pct`. Rendered as a stat on the Most-used-this-week panel, color-coded (green ≥40%, yellow ≥15%, red below). Panel hides itself on fresh installs where there's no extraction or use yet. Covered by new `dashboard/trust_spec.rb` assertions.
|
|
630
|
+
|
|
631
|
+
### ~~43. Dashboard: 👍/👎 feedback on moments~~ ✅ Implemented 2026-04-24
|
|
632
|
+
|
|
633
|
+
Schema migration v16 adds a `moment_feedback` table with a unique index on `event_id` so repeat clicks upsert. `SQLiteStore#upsert_moment_feedback` and `#clear_moment_feedback` own the writes; `Dashboard::API` exposes `POST /api/moments/:id/feedback` (with `{verdict, note}`) and `DELETE /api/moments/:id/feedback` to clear. `Moments#list` now batch-attaches the current verdict to each moment. `Trust#snapshot` gains a `feedback` section (`up`, `down`, `net`, `ratio_pct`) windowed to the last 30 days, rendered inline on the Most-used-this-week panel whenever any feedback exists. Frontend adds 👍/👎 buttons on each moment card with active-state styling; repeat-click clears. Covered by store, API, Moments attach, and Trust ratio specs.
|
|
634
|
+
|
|
635
|
+
### 44. Dashboard: Universal search box
|
|
636
|
+
|
|
637
|
+
Source: 2026-04-22 dashboard exploration
|
|
638
|
+
|
|
639
|
+
- **Value**: One input spans facts / sessions / conflicts / moments with typed results — removes the drawer-tab nav for power users.
|
|
640
|
+
- **Implementation**: New `/api/search?q=` endpoint fanning out across stores + activity_events. Alfred-style typed result list.
|
|
641
|
+
- **Effort**: 2 days
|
|
642
|
+
- **Recommendation**: **LOW PRIORITY** — Nice-to-have; existing Knowledge/Facts drawer covers primary needs.
|
|
643
|
+
|
|
644
|
+
### 45. Dashboard: Live feed via SSE or WebSocket
|
|
645
|
+
|
|
646
|
+
Source: 2026-04-22 dashboard exploration
|
|
647
|
+
|
|
648
|
+
- **Value**: New moments animate in as hooks fire rather than waiting for 30s polling. Enables the "watch this" onboarding demo.
|
|
649
|
+
- **Implementation**: WEBrick doesn't support WebSockets cleanly; would need `async-websocket` or ServerSentEvents via `rack-sse`. 30s polling stays as fallback.
|
|
650
|
+
- **Effort**: 2-3 days
|
|
651
|
+
- **Recommendation**: **LOW PRIORITY** — Polling is adequate; SSE/WS is cosmetic polish.
|
|
652
|
+
|
|
653
|
+
### ~~46. Dashboard + CLI: Weekly digest~~ ✅ Implemented 2026-04-24
|
|
654
|
+
|
|
655
|
+
`claude-memory digest [--since DAYS] [--output FILE]` renders a markdown report from already-existing aggregates — no new schema, no cron. Sections: Activity (moments bucketed by event_type), New knowledge (active facts created in the window, grouped by predicate), Utilization (30d extracted-vs-used ratio from `Dashboard::Trust#utilization`), Conflicts (deduped open count via `Dashboard::Conflicts#distinct_open_counts`), Feedback (👍/👎 from the #43 moment_feedback table). `--output FILE` writes to disk; default is stdout. `--since 0` errors out so the user knows the window must be positive. Covered by command specs (baseline, activity grouping, predicate grouping, since-window, positive-only validation, output-file, feedback inclusion).
|
|
656
|
+
|
|
201
657
|
### ~~7. MCP Discovery Tools~~ ✅ Implemented 2026-03-02
|
|
202
658
|
|
|
203
659
|
Added `memory.list_projects` MCP tool. Shows global DB, current project, and discovers other projects from promoted facts/global fact paths with stats.
|
|
@@ -297,4 +753,4 @@ Influence documents:
|
|
|
297
753
|
|
|
298
754
|
---
|
|
299
755
|
|
|
300
|
-
*Last updated: 2026-04-
|
|
756
|
+
*Last updated: 2026-04-28 - 1.0 punchlist track opened (`docs/1_0_punchlist.md`). High Priority entries #47-52 (must-have for 1.0): token-budget telemetry, hallucination rate, harm benchmark, CLAUDE.md baseline publication, `claude-memory show`, benchmark scoreboard. Medium Priority entries #53-56 (post-1.0): first-week ROI nudge, real-session repeat-correction detection, token-cost growth tracking, drift dashboard. Previously: 2026-04-27 - #35 (access-based staleness, sweep-derived) landed.*
|