claude_memory 0.3.0 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.claude/CLAUDE.md +1 -1
- data/.claude/output-styles/memory-aware.md +1 -0
- data/.claude/rules/claude_memory.generated.md +9 -34
- data/.claude/settings.local.json +4 -1
- data/.claude/skills/check-memory/DEPRECATED.md +29 -0
- data/.claude/skills/check-memory/SKILL.md +10 -0
- data/.claude/skills/debug-memory +1 -0
- data/.claude/skills/improve/SKILL.md +12 -1
- data/.claude/skills/memory-first-workflow +1 -0
- data/.claude/skills/setup-memory +1 -0
- data/.claude-plugin/plugin.json +1 -1
- data/.lefthook/map_specs.rb +29 -0
- data/CHANGELOG.md +83 -5
- data/CLAUDE.md +38 -0
- data/README.md +43 -0
- data/Rakefile +14 -1
- data/WEEK2_COMPLETE.md +250 -0
- data/db/migrations/008_add_provenance_line_range.rb +21 -0
- data/db/migrations/009_add_docid.rb +39 -0
- data/db/migrations/010_add_llm_cache.rb +30 -0
- data/docs/architecture.md +49 -14
- data/docs/ci_integration.md +294 -0
- data/docs/eval_week1_summary.md +183 -0
- data/docs/eval_week2_summary.md +419 -0
- data/docs/evals.md +353 -0
- data/docs/improvements.md +72 -1085
- data/docs/influence/claude-supermemory.md +498 -0
- data/docs/influence/qmd.md +424 -2022
- data/docs/quality_review.md +64 -705
- data/lefthook.yml +8 -1
- data/lib/claude_memory/commands/doctor_command.rb +45 -4
- data/lib/claude_memory/commands/explain_command.rb +11 -6
- data/lib/claude_memory/commands/stats_command.rb +1 -1
- data/lib/claude_memory/core/fact_graph.rb +122 -0
- data/lib/claude_memory/core/fact_query_builder.rb +34 -14
- data/lib/claude_memory/core/fact_ranker.rb +3 -20
- data/lib/claude_memory/core/relative_time.rb +45 -0
- data/lib/claude_memory/core/result_sorter.rb +2 -2
- data/lib/claude_memory/core/rr_fusion.rb +57 -0
- data/lib/claude_memory/core/snippet_extractor.rb +97 -0
- data/lib/claude_memory/domain/fact.rb +3 -1
- data/lib/claude_memory/embeddings/fastembed_adapter.rb +55 -0
- data/lib/claude_memory/index/index_query.rb +2 -0
- data/lib/claude_memory/index/lexical_fts.rb +18 -0
- data/lib/claude_memory/infrastructure/operation_tracker.rb +7 -21
- data/lib/claude_memory/infrastructure/schema_validator.rb +30 -25
- data/lib/claude_memory/ingest/content_sanitizer.rb +8 -1
- data/lib/claude_memory/ingest/ingester.rb +74 -59
- data/lib/claude_memory/ingest/tool_extractor.rb +1 -1
- data/lib/claude_memory/ingest/tool_filter.rb +55 -0
- data/lib/claude_memory/logging/logger.rb +112 -0
- data/lib/claude_memory/mcp/query_guide.rb +96 -0
- data/lib/claude_memory/mcp/response_formatter.rb +86 -23
- data/lib/claude_memory/mcp/server.rb +34 -4
- data/lib/claude_memory/mcp/text_summary.rb +257 -0
- data/lib/claude_memory/mcp/tool_definitions.rb +27 -11
- data/lib/claude_memory/mcp/tools.rb +133 -120
- data/lib/claude_memory/publish.rb +12 -2
- data/lib/claude_memory/recall/expansion_detector.rb +44 -0
- data/lib/claude_memory/recall.rb +93 -41
- data/lib/claude_memory/resolve/resolver.rb +72 -40
- data/lib/claude_memory/store/sqlite_store.rb +99 -24
- data/lib/claude_memory/sweep/sweeper.rb +6 -0
- data/lib/claude_memory/version.rb +1 -1
- data/lib/claude_memory.rb +21 -0
- data/output-styles/memory-aware.md +71 -0
- data/skills/debug-memory/SKILL.md +146 -0
- data/skills/memory-first-workflow/SKILL.md +144 -0
- metadata +29 -5
- data/.claude/.mind.mv2.o2N83S +0 -0
- data/.claude/output-styles/memory-aware.md +0 -21
- data/docs/.claude/mind.mv2.lock +0 -0
- data/docs/remaining_improvements.md +0 -330
- /data/{.claude/skills → skills}/setup-memory/SKILL.md +0 -0
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: d32c7aa8093eedade1783eefa7109c996aa134d01dea01b67794612c126cf068
|
|
4
|
+
data.tar.gz: 78fe0f2ff602c740b6ab727a0fde88d07fb235f38b1143b644b4e7498f07079f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: be68b153e9e25ef3b8e5ad71ba7fdd89e5ace40f0848125b480bf35c96ad8b0a73b1fd891e59faf5807bc99d8d637f0c6d647e2b1d6c86ad0654b9df69c1ab45
|
|
7
|
+
data.tar.gz: 92ee3e92e7967e03f642a5736a84f70563dd6e9f00fc3a7b24714b97294348e35645cdbfdd4b760bc5ee21301bb869bccbc686050cbc95e2bf7399c38508a32e
|
data/.claude/CLAUDE.md
CHANGED
|
@@ -0,0 +1 @@
|
|
|
1
|
+
../../output-styles/memory-aware.md
|
|
@@ -1,46 +1,21 @@
|
|
|
1
1
|
<!--
|
|
2
2
|
This file is auto-generated by claude-memory.
|
|
3
3
|
Do not edit manually - changes will be overwritten.
|
|
4
|
-
Generated: 2026-
|
|
4
|
+
Generated: 2026-02-04T16:37:00Z
|
|
5
5
|
-->
|
|
6
6
|
|
|
7
7
|
# Project Memory
|
|
8
8
|
|
|
9
9
|
## Current Decisions
|
|
10
10
|
|
|
11
|
-
-
|
|
11
|
+
- From QMD 2026-02-02 restudy: adopt Claude Code plugin format, MCP structured content pattern, MCP query guide prompt, inline status checks. Carry forward sqlite-vec, RRF, docids, smart expansion from 2026-01-26. Reject custom fine-tuned models, LLM reranking, YAML collections.
|
|
12
|
+
- From claude-supermemory study: adopt SessionStart hook context injection (hookSpecificOutput.additionalContext), tool-specific observation compression, and relative time formatting. Reject cloud storage dependency and no-test approach.
|
|
12
13
|
|
|
13
14
|
## Conventions
|
|
14
15
|
|
|
15
|
-
-
|
|
16
|
-
-
|
|
17
|
-
-
|
|
18
|
-
-
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
- Pre-commit hook runs standard:fix and stages changes
|
|
22
|
-
- Pre-commit hook runs tests
|
|
23
|
-
- Pre-commit hook runs quality review on Ruby files
|
|
24
|
-
- Pre-commit hook runs standard:fix and stages changes
|
|
25
|
-
- Pre-commit hook runs tests
|
|
26
|
-
- Pre-commit hook runs quality review on Ruby files
|
|
27
|
-
- frozen_string_literal: true
|
|
28
|
-
- RSpec documentation format
|
|
29
|
-
- Default Rake task runs tests and linter
|
|
30
|
-
- frozen_string_literal: true in all Ruby files
|
|
31
|
-
- RSpec uses documentation format
|
|
32
|
-
- Default Rake task runs tests and linter
|
|
33
|
-
- Pre-commit hook runs standard:fix and stages changes
|
|
34
|
-
- Pre-commit hook runs tests
|
|
35
|
-
- Pre-commit hook runs quality review on Ruby files
|
|
36
|
-
- Pre-commit hook runs standard:fix and stages changes
|
|
37
|
-
- Pre-commit hook runs tests
|
|
38
|
-
- Pre-commit hook runs quality review on Ruby files
|
|
39
|
-
- frozen_string_literal: true in all Ruby files
|
|
40
|
-
- RSpec format documentation mode
|
|
41
|
-
- Default Rake task runs spec and standard
|
|
42
|
-
- frozen_string_literal: true in all Ruby files
|
|
43
|
-
|
|
44
|
-
## Technical Constraints
|
|
45
|
-
|
|
46
|
-
- **Uses database**: SQLite with Extralite adapter
|
|
16
|
+
- MCP tools return dual content (text summary) + structuredContent (JSON) via TextSummary module and Server#handle_tools_call. Compact mode (compact: true) omits receipts for ~60% smaller responses.
|
|
17
|
+
- ContentSanitizer strips system-reminder, local-command-caveat, command-message, command-name, command-args tags in addition to private/no-memory/secret/claude-memory-context.
|
|
18
|
+
- Core::RelativeTime module provides progressive time formatting: just now → Xm ago → Xh ago → Xd ago → YYYY-MM-DD. Used in ResponseFormatter for *_ago fields.
|
|
19
|
+
- MCP server registers memory_guide prompt via prompts/list and prompts/get endpoints. QueryGuide module holds prompt content.
|
|
20
|
+
- Claude Code plugin with marketplace.json, skill definitions, MCP server bundling. 5,700+ stars, by Tobi Lütke. Custom fine-tuned query expansion (Qwen3-1.7B, SFT+GRPO). Dual content/structuredContent MCP pattern.
|
|
21
|
+
- Cloud-backed Claude Code plugin (~1,195 LOC JavaScript) using Supermemory API for persistent memory across sessions. Uses hooks for SessionStart context injection and Stop transcript capture. No local database.
|
data/.claude/settings.local.json
CHANGED
|
@@ -25,7 +25,10 @@
|
|
|
25
25
|
"Bash(git log:*)",
|
|
26
26
|
"Bash(find:*)",
|
|
27
27
|
"Bash(wc:*)",
|
|
28
|
-
"mcp__plugin_claude-memory_memory__memory_architecture"
|
|
28
|
+
"mcp__plugin_claude-memory_memory__memory_architecture",
|
|
29
|
+
"mcp__plugin_claude-memory_memory__memory_recall_index",
|
|
30
|
+
"Bash(./bin/run-evals:*)",
|
|
31
|
+
"WebSearch"
|
|
29
32
|
]
|
|
30
33
|
},
|
|
31
34
|
"enableAllProjectMcpServers": true
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# Deprecated: /check-memory
|
|
2
|
+
|
|
3
|
+
This skill is **no longer needed** and should not be used.
|
|
4
|
+
|
|
5
|
+
## Why Deprecated?
|
|
6
|
+
|
|
7
|
+
The `/check-memory` skill was created to force a "check memory before file exploration" workflow. However, this should be **automatic**, not manual.
|
|
8
|
+
|
|
9
|
+
## What Replaced It?
|
|
10
|
+
|
|
11
|
+
The enhanced `memory-aware` output style now handles this automatically by:
|
|
12
|
+
- Explicitly instructing Claude to check memory FIRST before file reads
|
|
13
|
+
- Providing clear workflow: memory.recall → then file exploration if needed
|
|
14
|
+
- Making this behavior persistent across all conversations
|
|
15
|
+
|
|
16
|
+
## If You Need Debugging
|
|
17
|
+
|
|
18
|
+
Use `/debug-memory` instead to troubleshoot ClaudeMemory installation issues.
|
|
19
|
+
|
|
20
|
+
## Migration
|
|
21
|
+
|
|
22
|
+
If you were using `/check-memory`:
|
|
23
|
+
- **Remove** any references to it
|
|
24
|
+
- **Use** the `memory-aware` output style (automatically applied)
|
|
25
|
+
- **Trust** that Claude will check memory first automatically
|
|
26
|
+
|
|
27
|
+
## Archive Date
|
|
28
|
+
|
|
29
|
+
Deprecated: 2026-01-29
|
|
@@ -16,6 +16,16 @@ The user is asking about: $ARGUMENTS
|
|
|
16
16
|
|
|
17
17
|
## Step-by-Step Workflow
|
|
18
18
|
|
|
19
|
+
### 0. Verify Memory Health
|
|
20
|
+
|
|
21
|
+
Before querying, confirm the memory system is operational:
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
memory.check_setup
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
If status is not "healthy", inform the user and suggest running `claude-memory doctor` for details.
|
|
28
|
+
|
|
19
29
|
### 1. Query Memory (REQUIRED FIRST STEP)
|
|
20
30
|
|
|
21
31
|
Run multiple memory queries to find existing knowledge:
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
../../skills/debug-memory
|
|
@@ -11,7 +11,8 @@ Systematically implement feature improvements from `docs/improvements.md`, makin
|
|
|
11
11
|
|
|
12
12
|
## Process Overview
|
|
13
13
|
|
|
14
|
-
1. **
|
|
14
|
+
1. **Check memory health** by calling `memory.check_setup` to verify the system is operational
|
|
15
|
+
2. **Read the improvements document** from `docs/improvements.md`
|
|
15
16
|
2. **Identify unimplemented features** from "Remaining Tasks" section
|
|
16
17
|
3. **Prioritize by stated priority** (Medium → Low)
|
|
17
18
|
4. **Assess feasibility** (skip if too complex or requires external services)
|
|
@@ -22,6 +23,16 @@ Systematically implement feature improvements from `docs/improvements.md`, makin
|
|
|
22
23
|
|
|
23
24
|
## Detailed Steps
|
|
24
25
|
|
|
26
|
+
### Step 0: Verify Memory Health
|
|
27
|
+
|
|
28
|
+
Before starting, confirm the memory system is operational:
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
memory.check_setup
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
If status is not "healthy", address any issues before proceeding.
|
|
35
|
+
|
|
25
36
|
### Step 1: Read and Parse Improvements Document
|
|
26
37
|
|
|
27
38
|
```bash
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
../../skills/memory-first-workflow
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
../../skills/setup-memory
|
data/.claude-plugin/plugin.json
CHANGED
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# frozen_string_literal: true
|
|
3
|
+
|
|
4
|
+
# Map changed Ruby files to their corresponding spec files
|
|
5
|
+
# Usage: .lefthook/map_specs.rb
|
|
6
|
+
# Returns: Space-separated list of spec files to run
|
|
7
|
+
|
|
8
|
+
changed_files = `git diff --cached --name-only --diff-filter=ACM`.split("\n")
|
|
9
|
+
ruby_files = changed_files.select { |f| f.end_with?(".rb") && f.start_with?("lib/") }
|
|
10
|
+
|
|
11
|
+
# Map lib/ files to their spec/ equivalents
|
|
12
|
+
specs = ruby_files.map do |file|
|
|
13
|
+
# lib/claude_memory/foo/bar.rb → spec/claude_memory/foo/bar_spec.rb
|
|
14
|
+
file.sub("lib/", "spec/").sub(".rb", "_spec.rb")
|
|
15
|
+
end.select { |spec| File.exist?(spec) }
|
|
16
|
+
|
|
17
|
+
# Always run integration tests if core infrastructure changed
|
|
18
|
+
critical_paths = [
|
|
19
|
+
"lib/claude_memory/store",
|
|
20
|
+
"lib/claude_memory/mcp",
|
|
21
|
+
"lib/claude_memory/hook"
|
|
22
|
+
]
|
|
23
|
+
|
|
24
|
+
if ruby_files.any? { |f| critical_paths.any? { |path| f.start_with?(path) } }
|
|
25
|
+
specs += Dir["spec/integration/*_spec.rb"] if Dir.exist?("spec/integration")
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
# Output unique spec files
|
|
29
|
+
puts specs.uniq.join(" ")
|
data/CHANGELOG.md
CHANGED
|
@@ -4,16 +4,94 @@ All notable changes to this project will be documented in this file.
|
|
|
4
4
|
|
|
5
5
|
## [Unreleased]
|
|
6
6
|
|
|
7
|
+
## [0.5.0] - 2026-02-04
|
|
8
|
+
|
|
7
9
|
### Added
|
|
8
10
|
|
|
9
|
-
|
|
11
|
+
**MCP Structured Content & Compact Mode**
|
|
12
|
+
- Dual content (text summary) + structuredContent (JSON) for all MCP tools
|
|
13
|
+
- `TextSummary` module generates human-readable summaries alongside structured data
|
|
14
|
+
- Compact mode (`compact: true`) omits provenance receipts for ~60% smaller responses
|
|
15
|
+
- MCP query guide prompt registered via `prompts/list` and `prompts/get` endpoints
|
|
16
|
+
- `QueryGuide` module provides tool selection guidance to Claude
|
|
17
|
+
|
|
18
|
+
**Search & Retrieval Improvements**
|
|
19
|
+
- Reciprocal Rank Fusion (RRF) replacing naive merge for hybrid search
|
|
20
|
+
- Better result ordering when combining FTS5 and semantic search results
|
|
21
|
+
- Smart expansion detection to skip unnecessary vector search
|
|
22
|
+
- Reduces latency when FTS5 already provides strong matches
|
|
23
|
+
- Enhanced snippet extraction for search results
|
|
24
|
+
- Better context windows around matched terms
|
|
25
|
+
|
|
26
|
+
**Provenance & Traceability**
|
|
27
|
+
- Line-range references in provenance for precise source linking
|
|
28
|
+
- Facts now track exact line ranges in source transcripts
|
|
29
|
+
- Fact dependency graph visualization via BFS traversal
|
|
30
|
+
- Trace supersession and conflict chains between facts
|
|
31
|
+
|
|
32
|
+
**User-Friendly Identifiers**
|
|
33
|
+
- Docid short hash system for user-friendly fact references
|
|
34
|
+
- Short, memorable identifiers instead of raw integer IDs
|
|
35
|
+
|
|
36
|
+
**Caching & Performance**
|
|
37
|
+
- LLM response caching schema and store methods
|
|
38
|
+
- Cache layer for expensive extraction operations
|
|
39
|
+
- Structured JSON logging with level filtering
|
|
40
|
+
- Configurable log levels (debug, info, warn, error)
|
|
41
|
+
- JSON format for machine-parseable log output
|
|
42
|
+
|
|
43
|
+
**Ingestion & Content Processing**
|
|
44
|
+
- Configurable tool capture filtering for ingestion
|
|
45
|
+
- Control which tool outputs are captured during transcript processing
|
|
46
|
+
- ContentSanitizer now strips `system-reminder`, `local-command-caveat`, `command-message`,
|
|
47
|
+
`command-name`, and `command-args` tags in addition to privacy tags
|
|
48
|
+
- Relative time formatting in MCP recall output
|
|
49
|
+
- Progressive format: just now → Xm ago → Xh ago → Xd ago → YYYY-MM-DD
|
|
10
50
|
|
|
11
|
-
|
|
51
|
+
**Developer Tools**
|
|
52
|
+
- `--brief` flag for doctor command and health checks in skills
|
|
53
|
+
- Quick pass/fail output for automated workflows
|
|
12
54
|
|
|
13
|
-
###
|
|
55
|
+
### Fixed
|
|
56
|
+
- Preserve SQLite PRAGMAs across connection reconnects
|
|
57
|
+
- WAL mode and other pragmas now survive reconnection cycles
|
|
58
|
+
- Timestamp-only churn in publish output
|
|
59
|
+
- Publish no longer regenerates files when only the timestamp changed
|
|
14
60
|
|
|
15
61
|
### Internal
|
|
16
62
|
|
|
63
|
+
**Code Quality Improvements**
|
|
64
|
+
- Extract duplicates and decompose long methods across codebase
|
|
65
|
+
- Extract ingester transaction body into focused methods
|
|
66
|
+
- Decompose `resolve_fact` into intention-revealing methods
|
|
67
|
+
- Extract `check_setup` and `detailed_stats` into focused helpers
|
|
68
|
+
- Fix N+1 query patterns in `recall.rb`
|
|
69
|
+
- Fix 6 quick wins from quality review (frozen strings, method sizes, naming)
|
|
70
|
+
|
|
71
|
+
**Research & Studies**
|
|
72
|
+
- QMD restudy (2026-02-02): adopt Claude Code plugin format, MCP structured content pattern,
|
|
73
|
+
MCP query guide prompt, inline status checks
|
|
74
|
+
- claude-supermemory study: adopt SessionStart hook context injection, tool-specific observation
|
|
75
|
+
compression, and relative time formatting
|
|
76
|
+
|
|
77
|
+
## [0.4.0] - 2026-02-02
|
|
78
|
+
|
|
79
|
+
### Added
|
|
80
|
+
|
|
81
|
+
**Semantic Search with FastEmbed**
|
|
82
|
+
- Integrated [fastembed-rb](https://github.com/khasinski/fastembed-rb) for high-quality local embeddings
|
|
83
|
+
- Uses BAAI/bge-small-en-v1.5 model (384-dim, ~67MB ONNX, runs locally)
|
|
84
|
+
- No API key required -- model downloaded once to `~/.cache/fastembed/`
|
|
85
|
+
- Asymmetric query/passage encoding for better retrieval accuracy
|
|
86
|
+
- `FastembedAdapter` class implementing the existing `Generator` interface for drop-in replacement
|
|
87
|
+
- Benchmark retrieval scores jumped significantly with real embeddings:
|
|
88
|
+
- Semantic easy: Recall@5 = 0.900, medium: 0.696
|
|
89
|
+
- Hybrid aggregate: Recall@5 = 0.727 (was 0.266 with TF-IDF fallback)
|
|
90
|
+
|
|
91
|
+
### Documentation
|
|
92
|
+
- Updated benchmark results throughout README, spec/benchmarks/README, and architecture docs
|
|
93
|
+
- Replaced TF-IDF embedding references with FastEmbed in architecture documentation
|
|
94
|
+
|
|
17
95
|
## [0.3.0] - 2026-01-29
|
|
18
96
|
|
|
19
97
|
### Added
|
|
@@ -41,9 +119,9 @@ All notable changes to this project will be documented in this file.
|
|
|
41
119
|
- Configuration class for centralized ENV access and testability
|
|
42
120
|
|
|
43
121
|
**Search & Recall**
|
|
44
|
-
- `index` command to generate
|
|
122
|
+
- `index` command to generate embeddings for semantic search
|
|
45
123
|
- Index command resumability with checkpoints (recover from interruption)
|
|
46
|
-
- Semantic search capabilities
|
|
124
|
+
- Semantic search capabilities with embedding-based vector search
|
|
47
125
|
- Improved full-text search with empty query handling
|
|
48
126
|
|
|
49
127
|
**Session Intelligence**
|
data/CLAUDE.md
CHANGED
|
@@ -11,6 +11,10 @@ ClaudeMemory is a Ruby gem that provides long-term, self-managed memory for Clau
|
|
|
11
11
|
- Sequel (~> 5.0) for database access
|
|
12
12
|
- Extralite (~> 2.14) for high-performance SQLite storage
|
|
13
13
|
|
|
14
|
+
## Working with This Codebase
|
|
15
|
+
|
|
16
|
+
**Check memory before exploring code.** Use `memory.recall`, `memory.decisions`, `memory.architecture`, or `memory.conventions` to find existing knowledge before reading files.
|
|
17
|
+
|
|
14
18
|
## Development Commands
|
|
15
19
|
|
|
16
20
|
### Setup
|
|
@@ -49,6 +53,40 @@ bundle exec rake release # Tag + push to RubyGems (requires credentials)
|
|
|
49
53
|
bundle exec claude-memory <command>
|
|
50
54
|
```
|
|
51
55
|
|
|
56
|
+
### Evals
|
|
57
|
+
```bash
|
|
58
|
+
# Run automated evaluation suite (stub mode - fast, free)
|
|
59
|
+
./bin/run-evals # Run all evals with summary report
|
|
60
|
+
|
|
61
|
+
# Run real eval validation (slow, costs ~$0.12)
|
|
62
|
+
./bin/run-real-evals all # Run all scenarios with real Claude
|
|
63
|
+
./bin/run-real-evals convention_recall,tech_stack_recall # Specific scenarios
|
|
64
|
+
|
|
65
|
+
# Or run directly with RSpec
|
|
66
|
+
bundle exec rspec spec/evals/ # Run all eval scenarios (stub mode)
|
|
67
|
+
bundle exec rspec --tag eval # Run only eval-tagged tests
|
|
68
|
+
EVAL_MODE=real bundle exec rspec spec/evals/ --tag eval_real # Real mode
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
The eval framework tests ClaudeMemory's effectiveness by comparing baseline (no memory) vs memory-enabled responses. See `spec/evals/README.md` for details, `spec/evals/REAL_MODE.md` for real Claude execution, and `spec/evals/CI_INTEGRATION.md` for GitHub Actions integration.
|
|
72
|
+
|
|
73
|
+
### Benchmarks (DevMemBench)
|
|
74
|
+
```bash
|
|
75
|
+
# Run offline benchmarks - retrieval accuracy + truth maintenance ($0, ~8s)
|
|
76
|
+
bundle exec rspec spec/benchmarks/ --tag benchmark --format documentation
|
|
77
|
+
|
|
78
|
+
# Run all evals + benchmarks together
|
|
79
|
+
./bin/run-evals --all
|
|
80
|
+
|
|
81
|
+
# Run only benchmarks (skip evals)
|
|
82
|
+
./bin/run-evals --benchmarks-only
|
|
83
|
+
|
|
84
|
+
# End-to-end with real Claude (~$2-8)
|
|
85
|
+
EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/ --tag eval_real
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
DevMemBench measures retrieval accuracy (Recall@k, MRR, nDCG@10) across 155 queries, truth maintenance correctness across 100 cases, and end-to-end Claude response quality across 31 scenarios. Semantic and hybrid retrieval use [fastembed-rb](https://github.com/khasinski/fastembed-rb) (BAAI/bge-small-en-v1.5, local ONNX, no API key). See `spec/benchmarks/README.md` for full details.
|
|
89
|
+
|
|
52
90
|
## Architecture
|
|
53
91
|
|
|
54
92
|
### Dual-Database System
|
data/README.md
CHANGED
|
@@ -206,6 +206,49 @@ The uninstall command removes:
|
|
|
206
206
|
- 🏗️ [Architecture](docs/architecture.md) - Technical deep dive
|
|
207
207
|
- 📝 [Changelog](CHANGELOG.md) - Release notes
|
|
208
208
|
|
|
209
|
+
## Benchmarks
|
|
210
|
+
|
|
211
|
+
ClaudeMemory includes **DevMemBench**, a developer-domain benchmark suite that measures retrieval quality and truth maintenance accuracy. All offline benchmarks run locally at zero cost.
|
|
212
|
+
|
|
213
|
+
### Latest Results
|
|
214
|
+
|
|
215
|
+
| Benchmark | Metric | Score |
|
|
216
|
+
|-----------|--------|-------|
|
|
217
|
+
| **Truth Maintenance** | Accuracy (100 cases) | **100%** |
|
|
218
|
+
| **FTS5 Retrieval** | Recall@5 (40 easy queries) | **97.5%** |
|
|
219
|
+
| **Semantic Retrieval** | Recall@5 (85 queries aggregate) | **78.6%** |
|
|
220
|
+
| **Semantic Retrieval** | Recall@5 (40 medium queries) | **69.6%** |
|
|
221
|
+
| **Hybrid Retrieval** | Recall@5 (100 queries aggregate) | **72.7%** |
|
|
222
|
+
| **Hybrid Retrieval** | Recall@10 (20 hard queries) | **62.8%** |
|
|
223
|
+
| **Scope Ranking** | Queries returning expected facts | **5/5** |
|
|
224
|
+
|
|
225
|
+
Semantic and hybrid retrieval use [fastembed-rb](https://github.com/khasinski/fastembed-rb) with the BAAI/bge-small-en-v1.5 model (384-dim, runs locally, no API key needed).
|
|
226
|
+
|
|
227
|
+
### What the benchmarks measure
|
|
228
|
+
|
|
229
|
+
**Retrieval accuracy** -- Given a database of ~105 developer-domain facts across 5 simulated projects, how well does search find the right facts? Measured with standard IR metrics (Recall@k, MRR, nDCG@10) across 155 queries at varying difficulty levels (exact keyword match, semantic paraphrase, cross-category synthesis, abstention, temporal).
|
|
230
|
+
|
|
231
|
+
**Truth maintenance** -- Given pairs of existing and incoming facts, does the resolver correctly determine the outcome? 100 FEVER-inspired cases test four outcomes: supersession (new stated fact replaces old), conflict (inferred fact contradicts stated), accumulation (multi-value predicates coexist), and corroboration (same fact adds provenance).
|
|
232
|
+
|
|
233
|
+
**End-to-end with Claude** -- 31 scenarios across 5 LongMemEval ability categories (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, abstention). Requires `EVAL_MODE=real` and costs ~$2-8 per run.
|
|
234
|
+
|
|
235
|
+
### Running benchmarks
|
|
236
|
+
|
|
237
|
+
```bash
|
|
238
|
+
# Offline benchmarks ($0, ~8 seconds)
|
|
239
|
+
bundle exec rspec spec/benchmarks/ --tag benchmark --format documentation
|
|
240
|
+
|
|
241
|
+
# Full evals + benchmarks
|
|
242
|
+
./bin/run-evals --all
|
|
243
|
+
|
|
244
|
+
# End-to-end with real Claude (~$2-8)
|
|
245
|
+
EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/ --tag eval_real
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
The benchmark dataset draws from real CLAUDE.md patterns and is designed specifically for ClaudeMemory's 6 predicates and 8 entity types. Open IR datasets (BEIR, FEVER, LongMemEval) informed the methodology but don't cover developer-domain knowledge.
|
|
249
|
+
|
|
250
|
+
👉 **[Benchmark Details →](spec/benchmarks/README.md)**
|
|
251
|
+
|
|
209
252
|
## For Developers
|
|
210
253
|
|
|
211
254
|
- **Language:** Ruby 3.2+
|
data/Rakefile
CHANGED
|
@@ -3,7 +3,20 @@
|
|
|
3
3
|
require "bundler/gem_tasks"
|
|
4
4
|
require "rspec/core/rake_task"
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
# Parallel test execution for faster runs
|
|
7
|
+
desc "Run specs in parallel"
|
|
8
|
+
task :spec do
|
|
9
|
+
# Use parallel_rspec if available, fall back to regular rspec
|
|
10
|
+
if system("which parallel_rspec > /dev/null 2>&1")
|
|
11
|
+
sh "bundle exec parallel_rspec spec/"
|
|
12
|
+
else
|
|
13
|
+
puts "parallel_tests not installed, running sequentially"
|
|
14
|
+
sh "bundle exec rspec"
|
|
15
|
+
end
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
# Sequential test execution (for debugging)
|
|
19
|
+
RSpec::Core::RakeTask.new(:spec_sequential)
|
|
7
20
|
|
|
8
21
|
require "standard/rake"
|
|
9
22
|
|