claude_memory 0.3.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/CLAUDE.md +1 -1
- data/.claude/output-styles/memory-aware.md +1 -0
- data/.claude/rules/claude_memory.generated.md +9 -34
- data/.claude/settings.local.json +4 -1
- data/.claude/skills/check-memory/DEPRECATED.md +29 -0
- data/.claude/skills/check-memory/SKILL.md +10 -0
- data/.claude/skills/debug-memory +1 -0
- data/.claude/skills/improve/SKILL.md +12 -1
- data/.claude/skills/memory-first-workflow +1 -0
- data/.claude/skills/setup-memory +1 -0
- data/.claude-plugin/plugin.json +1 -1
- data/.lefthook/map_specs.rb +29 -0
- data/CHANGELOG.md +83 -5
- data/CLAUDE.md +38 -0
- data/README.md +43 -0
- data/Rakefile +14 -1
- data/WEEK2_COMPLETE.md +250 -0
- data/db/migrations/008_add_provenance_line_range.rb +21 -0
- data/db/migrations/009_add_docid.rb +39 -0
- data/db/migrations/010_add_llm_cache.rb +30 -0
- data/docs/architecture.md +49 -14
- data/docs/ci_integration.md +294 -0
- data/docs/eval_week1_summary.md +183 -0
- data/docs/eval_week2_summary.md +419 -0
- data/docs/evals.md +353 -0
- data/docs/improvements.md +72 -1085
- data/docs/influence/claude-supermemory.md +498 -0
- data/docs/influence/qmd.md +424 -2022
- data/docs/quality_review.md +64 -705
- data/lefthook.yml +8 -1
- data/lib/claude_memory/commands/doctor_command.rb +45 -4
- data/lib/claude_memory/commands/explain_command.rb +11 -6
- data/lib/claude_memory/commands/stats_command.rb +1 -1
- data/lib/claude_memory/core/fact_graph.rb +122 -0
- data/lib/claude_memory/core/fact_query_builder.rb +34 -14
- data/lib/claude_memory/core/fact_ranker.rb +3 -20
- data/lib/claude_memory/core/relative_time.rb +45 -0
- data/lib/claude_memory/core/result_sorter.rb +2 -2
- data/lib/claude_memory/core/rr_fusion.rb +57 -0
- data/lib/claude_memory/core/snippet_extractor.rb +97 -0
- data/lib/claude_memory/domain/fact.rb +3 -1
- data/lib/claude_memory/embeddings/fastembed_adapter.rb +55 -0
- data/lib/claude_memory/index/index_query.rb +2 -0
- data/lib/claude_memory/index/lexical_fts.rb +18 -0
- data/lib/claude_memory/infrastructure/operation_tracker.rb +7 -21
- data/lib/claude_memory/infrastructure/schema_validator.rb +30 -25
- data/lib/claude_memory/ingest/content_sanitizer.rb +8 -1
- data/lib/claude_memory/ingest/ingester.rb +74 -59
- data/lib/claude_memory/ingest/tool_extractor.rb +1 -1
- data/lib/claude_memory/ingest/tool_filter.rb +55 -0
- data/lib/claude_memory/logging/logger.rb +112 -0
- data/lib/claude_memory/mcp/query_guide.rb +96 -0
- data/lib/claude_memory/mcp/response_formatter.rb +86 -23
- data/lib/claude_memory/mcp/server.rb +34 -4
- data/lib/claude_memory/mcp/text_summary.rb +257 -0
- data/lib/claude_memory/mcp/tool_definitions.rb +27 -11
- data/lib/claude_memory/mcp/tools.rb +133 -120
- data/lib/claude_memory/publish.rb +12 -2
- data/lib/claude_memory/recall/expansion_detector.rb +44 -0
- data/lib/claude_memory/recall.rb +93 -41
- data/lib/claude_memory/resolve/resolver.rb +72 -40
- data/lib/claude_memory/store/sqlite_store.rb +99 -24
- data/lib/claude_memory/sweep/sweeper.rb +6 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +21 -0
- data/output-styles/memory-aware.md +71 -0
- data/skills/debug-memory/SKILL.md +146 -0
- data/skills/memory-first-workflow/SKILL.md +144 -0
- metadata +29 -5
- data/.claude/.mind.mv2.o2N83S +0 -0
- data/.claude/output-styles/memory-aware.md +0 -21
- data/docs/.claude/mind.mv2.lock +0 -0
- data/docs/remaining_improvements.md +0 -330
- /data/{.claude/skills → skills}/setup-memory/SKILL.md +0 -0
data/docs/evals.md
ADDED
|
@@ -0,0 +1,353 @@
|
|
|
1
|
+
# ClaudeMemory Evaluation Framework
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
The ClaudeMemory eval framework measures the system's effectiveness at improving Claude Code's responses. Inspired by [Vercel's blog post on agent evals](https://vercel.com/blog/building-reliable-agents-what-we-learned-from-evals), this framework quantifies:
|
|
6
|
+
|
|
7
|
+
1. **Behavioral Outcomes**: Does memory improve response quality and accuracy?
|
|
8
|
+
2. **Tool Selection**: Are memory tools invoked when appropriate? (Future work)
|
|
9
|
+
3. **Mode Comparison**: MCP tools vs generated context vs both? (Future work)
|
|
10
|
+
|
|
11
|
+
## Key Insight from Vercel
|
|
12
|
+
|
|
13
|
+
**"Skills were NOT invoked 56% of the time, even when available."**
|
|
14
|
+
|
|
15
|
+
Vercel found that:
|
|
16
|
+
- Baseline (no tools): 53% pass rate
|
|
17
|
+
- Skills (on-demand tools): 79% pass rate (but 56% skip rate)
|
|
18
|
+
- AGENTS.md (persistent context): **100% pass rate**
|
|
19
|
+
|
|
20
|
+
Our hypothesis: ClaudeMemory's dual-mode approach (MCP tools + generated context file) should achieve high reliability.
|
|
21
|
+
|
|
22
|
+
## Current Status
|
|
23
|
+
|
|
24
|
+
**Week 1 Complete** ✅
|
|
25
|
+
|
|
26
|
+
- 3 eval scenarios implemented
|
|
27
|
+
- 15 tests passing (100% pass rate)
|
|
28
|
+
- Behavioral scoring logic proven
|
|
29
|
+
- Fast tests (<1s) suitable for TDD workflow
|
|
30
|
+
- Baseline comparison shows 100% improvement with memory
|
|
31
|
+
|
|
32
|
+
## Scenarios
|
|
33
|
+
|
|
34
|
+
### 1. Convention Recall
|
|
35
|
+
|
|
36
|
+
**Tests**: Whether Claude mentions stored coding conventions when asked.
|
|
37
|
+
|
|
38
|
+
**Setup**:
|
|
39
|
+
- Store conventions in memory (e.g., "Use 2-space indentation", "Prefer RSpec expect syntax")
|
|
40
|
+
- Ask: "What are the coding conventions for this Ruby project?"
|
|
41
|
+
|
|
42
|
+
**Results**:
|
|
43
|
+
- With Memory: Mentions specific conventions (score: 1.0)
|
|
44
|
+
- Baseline: Gives generic advice without specifics (score: 0.0)
|
|
45
|
+
- **Improvement: +100%**
|
|
46
|
+
|
|
47
|
+
### 2. Architectural Decision
|
|
48
|
+
|
|
49
|
+
**Tests**: Whether Claude respects stored architectural decisions.
|
|
50
|
+
|
|
51
|
+
**Setup**:
|
|
52
|
+
- Store decision in memory (e.g., "Use Sequel for database access, not ActiveRecord")
|
|
53
|
+
- Ask: "How should I query the database in this project?"
|
|
54
|
+
|
|
55
|
+
**Results**:
|
|
56
|
+
- With Memory: Recommends Sequel specifically (score: 1.0)
|
|
57
|
+
- Baseline: Lists multiple options without recommendation (score: 0.0)
|
|
58
|
+
- **Improvement: +100%**
|
|
59
|
+
|
|
60
|
+
### 3. Tech Stack Recall
|
|
61
|
+
|
|
62
|
+
**Tests**: Whether Claude correctly identifies frameworks and databases.
|
|
63
|
+
|
|
64
|
+
**Setup**:
|
|
65
|
+
- Store tech stack facts (uses_framework: "RSpec", uses_database: "SQLite")
|
|
66
|
+
- Ask: "What testing framework does this project use?"
|
|
67
|
+
|
|
68
|
+
**Results**:
|
|
69
|
+
- With Memory: Identifies RSpec confidently (score: 1.0)
|
|
70
|
+
- Baseline: Lists options but admits uncertainty (score: 0.0)
|
|
71
|
+
- **Improvement: +100%**
|
|
72
|
+
|
|
73
|
+
## Behavioral Scoring
|
|
74
|
+
|
|
75
|
+
Each eval calculates a **behavioral score** (0.0 - 1.0) that quantifies response quality:
|
|
76
|
+
|
|
77
|
+
```ruby
|
|
78
|
+
# Example: Convention Recall
|
|
79
|
+
mentions_indentation = response.include?("2-space")
|
|
80
|
+
mentions_rspec = response.include?("expect syntax")
|
|
81
|
+
|
|
82
|
+
score = 0.0
|
|
83
|
+
score += 0.5 if mentions_indentation
|
|
84
|
+
score += 0.5 if mentions_rspec
|
|
85
|
+
|
|
86
|
+
# With memory: 1.0
|
|
87
|
+
# Baseline: 0.0
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
Scores measure:
|
|
91
|
+
- **Accuracy**: Correct information mentioned
|
|
92
|
+
- **Specificity**: Project-specific vs generic advice
|
|
93
|
+
- **Confidence**: Definitive answer vs hedging
|
|
94
|
+
|
|
95
|
+
## Running Evals
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
# Quick summary report
|
|
99
|
+
./bin/run-evals
|
|
100
|
+
|
|
101
|
+
# Detailed output
|
|
102
|
+
bundle exec rspec spec/evals/ --format documentation
|
|
103
|
+
|
|
104
|
+
# Run specific scenario
|
|
105
|
+
bundle exec rspec spec/evals/convention_recall_spec.rb
|
|
106
|
+
|
|
107
|
+
# Run only eval tests (skip others)
|
|
108
|
+
bundle exec rspec --tag eval
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Example Output
|
|
112
|
+
|
|
113
|
+
```
|
|
114
|
+
============================================================
|
|
115
|
+
EVAL SUMMARY
|
|
116
|
+
============================================================
|
|
117
|
+
|
|
118
|
+
Total Examples: 15
|
|
119
|
+
Passed: 15 ✅
|
|
120
|
+
Failed: 0 ❌
|
|
121
|
+
Duration: 0.23s
|
|
122
|
+
|
|
123
|
+
============================================================
|
|
124
|
+
BY SCENARIO
|
|
125
|
+
============================================================
|
|
126
|
+
|
|
127
|
+
Convention Recall: 5/5 ✅
|
|
128
|
+
Architectural Decision: 5/5 ✅
|
|
129
|
+
Tech Stack Recall: 5/5 ✅
|
|
130
|
+
|
|
131
|
+
============================================================
|
|
132
|
+
BEHAVIORAL SCORES
|
|
133
|
+
============================================================
|
|
134
|
+
|
|
135
|
+
Convention Recall:
|
|
136
|
+
With Memory: 1.0 (100%)
|
|
137
|
+
Baseline: 0.0 (0%)
|
|
138
|
+
Improvement: +100%
|
|
139
|
+
|
|
140
|
+
Architectural Decision:
|
|
141
|
+
With Memory: 1.0 (100%)
|
|
142
|
+
Baseline: 0.0 (0%)
|
|
143
|
+
Improvement: +100%
|
|
144
|
+
|
|
145
|
+
Tech Stack Recall:
|
|
146
|
+
With Memory: 1.0 (100%)
|
|
147
|
+
Baseline: 0.0 (0%)
|
|
148
|
+
Improvement: +100%
|
|
149
|
+
|
|
150
|
+
============================================================
|
|
151
|
+
OVERALL: Memory improves responses by 100% on average
|
|
152
|
+
============================================================
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## Implementation Approach
|
|
156
|
+
|
|
157
|
+
Following expert principles (Kent Beck, Gary Bernhardt, Sandi Metz), we took an incremental approach:
|
|
158
|
+
|
|
159
|
+
### Week 1: Prove the Concept ✅
|
|
160
|
+
|
|
161
|
+
**Goal**: Get ONE eval working end-to-end, no abstractions.
|
|
162
|
+
|
|
163
|
+
**What we built**:
|
|
164
|
+
- 3 eval scenarios with stubbed Claude responses
|
|
165
|
+
- Fixture setup using `Dir.mktmpdir` for isolation
|
|
166
|
+
- Memory population using existing `ClaudeMemory::Store` patterns
|
|
167
|
+
- Behavioral scoring logic
|
|
168
|
+
- Fast tests (<1s) by avoiding real API calls
|
|
169
|
+
|
|
170
|
+
**Key decisions**:
|
|
171
|
+
- ✅ Stub Claude responses instead of shelling out (fast, free, deterministic)
|
|
172
|
+
- ✅ No premature abstractions (inline everything first)
|
|
173
|
+
- ✅ Focus on evaluation logic, not infrastructure
|
|
174
|
+
|
|
175
|
+
### Week 2: Extract Patterns (Future)
|
|
176
|
+
|
|
177
|
+
**Triggers for extraction**:
|
|
178
|
+
- Fixture setup becomes repetitive → Extract `FixtureBuilder`
|
|
179
|
+
- Scoring logic duplicated → Extract `ScoreCalculator`
|
|
180
|
+
- Need real Claude execution → Extract `ClaudeRunner` (slow tests, CI only)
|
|
181
|
+
|
|
182
|
+
**NOT extracting yet** because we don't feel enough pain.
|
|
183
|
+
|
|
184
|
+
### Week 3+: Advanced Features (Future)
|
|
185
|
+
|
|
186
|
+
**Potential additions**:
|
|
187
|
+
- Real Claude execution (tagged `:slow`, CI only)
|
|
188
|
+
- Tool call tracking (did Claude invoke `memory.conventions`?)
|
|
189
|
+
- Mode comparison (MCP vs context vs both)
|
|
190
|
+
- Regression tracking (store results over time)
|
|
191
|
+
- CI integration (block releases on eval failures)
|
|
192
|
+
|
|
193
|
+
## Design Principles Applied
|
|
194
|
+
|
|
195
|
+
### Kent Beck: Simple Design
|
|
196
|
+
|
|
197
|
+
> "Make it work, make it right, make it fast"
|
|
198
|
+
|
|
199
|
+
- Started with ONE passing eval
|
|
200
|
+
- Added 2 more to feel pain points
|
|
201
|
+
- No design up front—let it emerge from real needs
|
|
202
|
+
|
|
203
|
+
### Gary Bernhardt: Fast Tests
|
|
204
|
+
|
|
205
|
+
> "Tests should be fast enough for TDD workflow"
|
|
206
|
+
|
|
207
|
+
- Stubbed Claude responses (no API calls)
|
|
208
|
+
- Tests run in <1s (1003 tests in 47s total)
|
|
209
|
+
- Will add slow integration tests later (CI only)
|
|
210
|
+
|
|
211
|
+
### Sandi Metz: Single Responsibility
|
|
212
|
+
|
|
213
|
+
> "Extract collaborators only when you feel pain"
|
|
214
|
+
|
|
215
|
+
- Each eval is independent
|
|
216
|
+
- No shared base class yet
|
|
217
|
+
- Common patterns not extracted until needed
|
|
218
|
+
|
|
219
|
+
### Jeremy Evans: Simplicity
|
|
220
|
+
|
|
221
|
+
> "Start with 2 modes, not 4"
|
|
222
|
+
|
|
223
|
+
- Testing baseline vs full memory (2 modes)
|
|
224
|
+
- Defer MCP-only vs context-only comparison
|
|
225
|
+
|
|
226
|
+
### Avdi Grimm: Explicit Code
|
|
227
|
+
|
|
228
|
+
> "Make failures explicit"
|
|
229
|
+
|
|
230
|
+
- Clear behavioral assertions
|
|
231
|
+
- Quantified scores (not vague "better")
|
|
232
|
+
- Specific test names
|
|
233
|
+
|
|
234
|
+
## Files
|
|
235
|
+
|
|
236
|
+
```
|
|
237
|
+
spec/evals/
|
|
238
|
+
├── README.md # Eval documentation
|
|
239
|
+
├── convention_recall_spec.rb # Eval 1: Coding conventions
|
|
240
|
+
├── architectural_decision_spec.rb # Eval 2: Architectural decisions
|
|
241
|
+
└── tech_stack_recall_spec.rb # Eval 3: Tech stack identification
|
|
242
|
+
|
|
243
|
+
bin/
|
|
244
|
+
└── run-evals # Summary report runner
|
|
245
|
+
|
|
246
|
+
docs/
|
|
247
|
+
└── evals.md # This file
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
## Future Work
|
|
251
|
+
|
|
252
|
+
### Phase 1: Real Claude Execution (Optional)
|
|
253
|
+
|
|
254
|
+
If we need to validate against actual Claude behavior:
|
|
255
|
+
|
|
256
|
+
```ruby
|
|
257
|
+
def run_claude_headless(prompt, working_dir)
|
|
258
|
+
cmd = ["claude", "-p", prompt, "--output-format", "json"]
|
|
259
|
+
output, status = Open3.capture2(*cmd, chdir: working_dir)
|
|
260
|
+
JSON.parse(output)
|
|
261
|
+
end
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
**Trade-offs**:
|
|
265
|
+
- ✅ Tests real Claude behavior
|
|
266
|
+
- ❌ Slow (30s+ per test)
|
|
267
|
+
- ❌ Costs money (API calls)
|
|
268
|
+
- ❌ Non-deterministic
|
|
269
|
+
|
|
270
|
+
**Recommendation**: Only add if stubbed tests miss real issues.
|
|
271
|
+
|
|
272
|
+
### Phase 2: Tool Call Tracking
|
|
273
|
+
|
|
274
|
+
Track whether Claude invokes memory tools:
|
|
275
|
+
|
|
276
|
+
```ruby
|
|
277
|
+
# Check transcript for tool calls
|
|
278
|
+
tool_invoked = transcript[:tool_calls].any? { |t| t[:tool] == "memory.conventions" }
|
|
279
|
+
|
|
280
|
+
# Tool selection score
|
|
281
|
+
tool_selection_score = tool_invoked ? 1.0 : 0.0
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
**Use case**: Detect when Claude skips memory tools (like Vercel's 56% skip rate).
|
|
285
|
+
|
|
286
|
+
### Phase 3: Mode Comparison
|
|
287
|
+
|
|
288
|
+
Test 4 configurations:
|
|
289
|
+
1. Baseline (no memory)
|
|
290
|
+
2. MCP tools only
|
|
291
|
+
3. Generated context only
|
|
292
|
+
4. Both (current default)
|
|
293
|
+
|
|
294
|
+
**Expected result**: Generated context should have highest pass rate (like Vercel's AGENTS.md).
|
|
295
|
+
|
|
296
|
+
### Phase 4: Regression Tracking
|
|
297
|
+
|
|
298
|
+
Store eval results over time:
|
|
299
|
+
|
|
300
|
+
```ruby
|
|
301
|
+
# Store results in SQLite
|
|
302
|
+
@db[:eval_runs].insert(
|
|
303
|
+
timestamp: Time.now,
|
|
304
|
+
git_sha: `git rev-parse HEAD`.strip,
|
|
305
|
+
pass_rate: 1.0,
|
|
306
|
+
avg_score: 1.0
|
|
307
|
+
)
|
|
308
|
+
|
|
309
|
+
# Compare to previous runs
|
|
310
|
+
previous_run = @db[:eval_runs].order(:timestamp).last
|
|
311
|
+
regression = pass_rate < previous_run[:pass_rate]
|
|
312
|
+
```
|
|
313
|
+
|
|
314
|
+
**Use case**: Prevent regressions during development.
|
|
315
|
+
|
|
316
|
+
### Phase 5: CI Integration
|
|
317
|
+
|
|
318
|
+
Add to GitHub Actions:
|
|
319
|
+
|
|
320
|
+
```yaml
|
|
321
|
+
- name: Run ClaudeMemory Evals
|
|
322
|
+
run: ./bin/run-evals
|
|
323
|
+
|
|
324
|
+
- name: Check for Regressions
|
|
325
|
+
run: |
|
|
326
|
+
if [ $? -ne 0 ]; then
|
|
327
|
+
echo "Evals failed! Blocking release."
|
|
328
|
+
exit 1
|
|
329
|
+
fi
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
**Use case**: Enforce quality before gem releases.
|
|
333
|
+
|
|
334
|
+
## Success Metrics
|
|
335
|
+
|
|
336
|
+
**Current (Week 1)**:
|
|
337
|
+
- ✅ 15 tests passing (100% pass rate)
|
|
338
|
+
- ✅ Behavioral scores: 1.0 with memory, 0.0 baseline
|
|
339
|
+
- ✅ Fast tests (<1s)
|
|
340
|
+
- ✅ Baseline comparison proven valuable
|
|
341
|
+
|
|
342
|
+
**Future Goals**:
|
|
343
|
+
- [ ] Tool invocation rate > 80% (better than Vercel's 44%)
|
|
344
|
+
- [ ] Pass rate maintained across versions (no regressions)
|
|
345
|
+
- [ ] Generated context achieves 100% pass rate (like Vercel's AGENTS.md)
|
|
346
|
+
- [ ] Mode comparison validates dual-mode approach
|
|
347
|
+
|
|
348
|
+
## References
|
|
349
|
+
|
|
350
|
+
- **Vercel Blog**: [Building reliable agents: What we learned from evals](https://vercel.com/blog/building-reliable-agents-what-we-learned-from-evals)
|
|
351
|
+
- **Implementation Plan**: Detailed plan document with expert reviews
|
|
352
|
+
- **Testing Patterns**: `spec/claude_memory/mcp/tools_spec.rb`, `spec/claude_memory/recall_spec.rb`
|
|
353
|
+
- **Expert Principles**: Kent Beck (Simple Design), Gary Bernhardt (Fast Tests), Sandi Metz (SRP)
|