claude_memory 0.3.0 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (75) hide show
  1. checksums.yaml +4 -4
  2. data/.claude/CLAUDE.md +1 -1
  3. data/.claude/output-styles/memory-aware.md +1 -0
  4. data/.claude/rules/claude_memory.generated.md +9 -34
  5. data/.claude/settings.local.json +4 -1
  6. data/.claude/skills/check-memory/DEPRECATED.md +29 -0
  7. data/.claude/skills/check-memory/SKILL.md +10 -0
  8. data/.claude/skills/debug-memory +1 -0
  9. data/.claude/skills/improve/SKILL.md +12 -1
  10. data/.claude/skills/memory-first-workflow +1 -0
  11. data/.claude/skills/setup-memory +1 -0
  12. data/.claude-plugin/plugin.json +1 -1
  13. data/.lefthook/map_specs.rb +29 -0
  14. data/CHANGELOG.md +83 -5
  15. data/CLAUDE.md +38 -0
  16. data/README.md +43 -0
  17. data/Rakefile +14 -1
  18. data/WEEK2_COMPLETE.md +250 -0
  19. data/db/migrations/008_add_provenance_line_range.rb +21 -0
  20. data/db/migrations/009_add_docid.rb +39 -0
  21. data/db/migrations/010_add_llm_cache.rb +30 -0
  22. data/docs/architecture.md +49 -14
  23. data/docs/ci_integration.md +294 -0
  24. data/docs/eval_week1_summary.md +183 -0
  25. data/docs/eval_week2_summary.md +419 -0
  26. data/docs/evals.md +353 -0
  27. data/docs/improvements.md +72 -1085
  28. data/docs/influence/claude-supermemory.md +498 -0
  29. data/docs/influence/qmd.md +424 -2022
  30. data/docs/quality_review.md +64 -705
  31. data/lefthook.yml +8 -1
  32. data/lib/claude_memory/commands/doctor_command.rb +45 -4
  33. data/lib/claude_memory/commands/explain_command.rb +11 -6
  34. data/lib/claude_memory/commands/stats_command.rb +1 -1
  35. data/lib/claude_memory/core/fact_graph.rb +122 -0
  36. data/lib/claude_memory/core/fact_query_builder.rb +34 -14
  37. data/lib/claude_memory/core/fact_ranker.rb +3 -20
  38. data/lib/claude_memory/core/relative_time.rb +45 -0
  39. data/lib/claude_memory/core/result_sorter.rb +2 -2
  40. data/lib/claude_memory/core/rr_fusion.rb +57 -0
  41. data/lib/claude_memory/core/snippet_extractor.rb +97 -0
  42. data/lib/claude_memory/domain/fact.rb +3 -1
  43. data/lib/claude_memory/embeddings/fastembed_adapter.rb +55 -0
  44. data/lib/claude_memory/index/index_query.rb +2 -0
  45. data/lib/claude_memory/index/lexical_fts.rb +18 -0
  46. data/lib/claude_memory/infrastructure/operation_tracker.rb +7 -21
  47. data/lib/claude_memory/infrastructure/schema_validator.rb +30 -25
  48. data/lib/claude_memory/ingest/content_sanitizer.rb +8 -1
  49. data/lib/claude_memory/ingest/ingester.rb +74 -59
  50. data/lib/claude_memory/ingest/tool_extractor.rb +1 -1
  51. data/lib/claude_memory/ingest/tool_filter.rb +55 -0
  52. data/lib/claude_memory/logging/logger.rb +112 -0
  53. data/lib/claude_memory/mcp/query_guide.rb +96 -0
  54. data/lib/claude_memory/mcp/response_formatter.rb +86 -23
  55. data/lib/claude_memory/mcp/server.rb +34 -4
  56. data/lib/claude_memory/mcp/text_summary.rb +257 -0
  57. data/lib/claude_memory/mcp/tool_definitions.rb +27 -11
  58. data/lib/claude_memory/mcp/tools.rb +133 -120
  59. data/lib/claude_memory/publish.rb +12 -2
  60. data/lib/claude_memory/recall/expansion_detector.rb +44 -0
  61. data/lib/claude_memory/recall.rb +93 -41
  62. data/lib/claude_memory/resolve/resolver.rb +72 -40
  63. data/lib/claude_memory/store/sqlite_store.rb +99 -24
  64. data/lib/claude_memory/sweep/sweeper.rb +6 -0
  65. data/lib/claude_memory/version.rb +1 -1
  66. data/lib/claude_memory.rb +21 -0
  67. data/output-styles/memory-aware.md +71 -0
  68. data/skills/debug-memory/SKILL.md +146 -0
  69. data/skills/memory-first-workflow/SKILL.md +144 -0
  70. metadata +29 -5
  71. data/.claude/.mind.mv2.o2N83S +0 -0
  72. data/.claude/output-styles/memory-aware.md +0 -21
  73. data/docs/.claude/mind.mv2.lock +0 -0
  74. data/docs/remaining_improvements.md +0 -330
  75. /data/{.claude/skills → skills}/setup-memory/SKILL.md +0 -0
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7835fd6219efb8e7b8942c0933abd529228b92fc003649abeb26415fae05881b
4
- data.tar.gz: 8a466024f18ac23db38cc85cedc9e80512f9b8a7cf74cdbcd453626f12408afc
3
+ metadata.gz: d32c7aa8093eedade1783eefa7109c996aa134d01dea01b67794612c126cf068
4
+ data.tar.gz: 78fe0f2ff602c740b6ab727a0fde88d07fb235f38b1143b644b4e7498f07079f
5
5
  SHA512:
6
- metadata.gz: 664a3477dd134ad7857cc7e292fea52833260986cb8d85f80d5ab5faba5c3821bab72e754960d5dadd3c203e03443b393e3b7687fbd4915f70970b3d98c9e3ec
7
- data.tar.gz: 9924044b6fbe3d4a39a722e754a4dc94d3f73af7b0cb9b5177353f9de7461afbd5688f411fbe8d2dcccdd52153318b2ed279ed70984fc7f8b6b45544ad2036e5
6
+ metadata.gz: be68b153e9e25ef3b8e5ad71ba7fdd89e5ace40f0848125b480bf35c96ad8b0a73b1fd891e59faf5807bc99d8d637f0c6d647e2b1d6c86ad0654b9df69c1ab45
7
+ data.tar.gz: 92ee3e92e7967e03f642a5736a84f70563dd6e9f00fc3a7b24714b97294348e35645cdbfdd4b760bc5ee21301bb869bccbc686050cbc95e2bf7399c38508a32e
data/.claude/CLAUDE.md CHANGED
@@ -1,4 +1,4 @@
1
- <!-- ClaudeMemory v0.3.0 -->
1
+ <!-- ClaudeMemory v0.5.0 -->
2
2
  # Project Memory
3
3
 
4
4
  @.claude/rules/claude_memory.generated.md
@@ -0,0 +1 @@
1
+ ../../output-styles/memory-aware.md
@@ -1,46 +1,21 @@
1
1
  <!--
2
2
  This file is auto-generated by claude-memory.
3
3
  Do not edit manually - changes will be overwritten.
4
- Generated: 2026-01-29T18:59:01Z
4
+ Generated: 2026-02-04T16:37:00Z
5
5
  -->
6
6
 
7
7
  # Project Memory
8
8
 
9
9
  ## Current Decisions
10
10
 
11
- - Extralite chosen to fix database lock contention between MCP server and hooks
11
+ - From QMD 2026-02-02 restudy: adopt Claude Code plugin format, MCP structured content pattern, MCP query guide prompt, inline status checks. Carry forward sqlite-vec, RRF, docids, smart expansion from 2026-01-26. Reject custom fine-tuned models, LLM reranking, YAML collections.
12
+ - From claude-supermemory study: adopt SessionStart hook context injection (hookSpecificOutput.additionalContext), tool-specific observation compression, and relative time formatting. Reject cloud storage dependency and no-test approach.
12
13
 
13
14
  ## Conventions
14
15
 
15
- - frozen_string_literal: true in all Ruby files
16
- - RSpec uses documentation format
17
- - Default Rake task runs tests and linter
18
- - frozen_string_literal: true in all Ruby files
19
- - Default Rake task runs spec and standard
20
- - Standard Ruby linter configured for Ruby 3.2
21
- - Pre-commit hook runs standard:fix and stages changes
22
- - Pre-commit hook runs tests
23
- - Pre-commit hook runs quality review on Ruby files
24
- - Pre-commit hook runs standard:fix and stages changes
25
- - Pre-commit hook runs tests
26
- - Pre-commit hook runs quality review on Ruby files
27
- - frozen_string_literal: true
28
- - RSpec documentation format
29
- - Default Rake task runs tests and linter
30
- - frozen_string_literal: true in all Ruby files
31
- - RSpec uses documentation format
32
- - Default Rake task runs tests and linter
33
- - Pre-commit hook runs standard:fix and stages changes
34
- - Pre-commit hook runs tests
35
- - Pre-commit hook runs quality review on Ruby files
36
- - Pre-commit hook runs standard:fix and stages changes
37
- - Pre-commit hook runs tests
38
- - Pre-commit hook runs quality review on Ruby files
39
- - frozen_string_literal: true in all Ruby files
40
- - RSpec format documentation mode
41
- - Default Rake task runs spec and standard
42
- - frozen_string_literal: true in all Ruby files
43
-
44
- ## Technical Constraints
45
-
46
- - **Uses database**: SQLite with Extralite adapter
16
+ - MCP tools return dual content (text summary) + structuredContent (JSON) via TextSummary module and Server#handle_tools_call. Compact mode (compact: true) omits receipts for ~60% smaller responses.
17
+ - ContentSanitizer strips system-reminder, local-command-caveat, command-message, command-name, command-args tags in addition to private/no-memory/secret/claude-memory-context.
18
+ - Core::RelativeTime module provides progressive time formatting: just now → Xm ago → Xh ago → Xd ago → YYYY-MM-DD. Used in ResponseFormatter for *_ago fields.
19
+ - MCP server registers memory_guide prompt via prompts/list and prompts/get endpoints. QueryGuide module holds prompt content.
20
+ - Claude Code plugin with marketplace.json, skill definitions, MCP server bundling. 5,700+ stars, by Tobi Lütke. Custom fine-tuned query expansion (Qwen3-1.7B, SFT+GRPO). Dual content/structuredContent MCP pattern.
21
+ - Cloud-backed Claude Code plugin (~1,195 LOC JavaScript) using Supermemory API for persistent memory across sessions. Uses hooks for SessionStart context injection and Stop transcript capture. No local database.
@@ -25,7 +25,10 @@
25
25
  "Bash(git log:*)",
26
26
  "Bash(find:*)",
27
27
  "Bash(wc:*)",
28
- "mcp__plugin_claude-memory_memory__memory_architecture"
28
+ "mcp__plugin_claude-memory_memory__memory_architecture",
29
+ "mcp__plugin_claude-memory_memory__memory_recall_index",
30
+ "Bash(./bin/run-evals:*)",
31
+ "WebSearch"
29
32
  ]
30
33
  },
31
34
  "enableAllProjectMcpServers": true
@@ -0,0 +1,29 @@
1
+ # Deprecated: /check-memory
2
+
3
+ This skill is **no longer needed** and should not be used.
4
+
5
+ ## Why Deprecated?
6
+
7
+ The `/check-memory` skill was created to force a "check memory before file exploration" workflow. However, this should be **automatic**, not manual.
8
+
9
+ ## What Replaced It?
10
+
11
+ The enhanced `memory-aware` output style now handles this automatically by:
12
+ - Explicitly instructing Claude to check memory FIRST before file reads
13
+ - Providing clear workflow: memory.recall → then file exploration if needed
14
+ - Making this behavior persistent across all conversations
15
+
16
+ ## If You Need Debugging
17
+
18
+ Use `/debug-memory` instead to troubleshoot ClaudeMemory installation issues.
19
+
20
+ ## Migration
21
+
22
+ If you were using `/check-memory`:
23
+ - **Remove** any references to it
24
+ - **Use** the `memory-aware` output style (automatically applied)
25
+ - **Trust** that Claude will check memory first automatically
26
+
27
+ ## Archive Date
28
+
29
+ Deprecated: 2026-01-29
@@ -16,6 +16,16 @@ The user is asking about: $ARGUMENTS
16
16
 
17
17
  ## Step-by-Step Workflow
18
18
 
19
+ ### 0. Verify Memory Health
20
+
21
+ Before querying, confirm the memory system is operational:
22
+
23
+ ```
24
+ memory.check_setup
25
+ ```
26
+
27
+ If status is not "healthy", inform the user and suggest running `claude-memory doctor` for details.
28
+
19
29
  ### 1. Query Memory (REQUIRED FIRST STEP)
20
30
 
21
31
  Run multiple memory queries to find existing knowledge:
@@ -0,0 +1 @@
1
+ ../../skills/debug-memory
@@ -11,7 +11,8 @@ Systematically implement feature improvements from `docs/improvements.md`, makin
11
11
 
12
12
  ## Process Overview
13
13
 
14
- 1. **Read the improvements document** from `docs/improvements.md`
14
+ 1. **Check memory health** by calling `memory.check_setup` to verify the system is operational
15
+ 2. **Read the improvements document** from `docs/improvements.md`
15
16
  2. **Identify unimplemented features** from "Remaining Tasks" section
16
17
  3. **Prioritize by stated priority** (Medium → Low)
17
18
  4. **Assess feasibility** (skip if too complex or requires external services)
@@ -22,6 +23,16 @@ Systematically implement feature improvements from `docs/improvements.md`, makin
22
23
 
23
24
  ## Detailed Steps
24
25
 
26
+ ### Step 0: Verify Memory Health
27
+
28
+ Before starting, confirm the memory system is operational:
29
+
30
+ ```
31
+ memory.check_setup
32
+ ```
33
+
34
+ If status is not "healthy", address any issues before proceeding.
35
+
25
36
  ### Step 1: Read and Parse Improvements Document
26
37
 
27
38
  ```bash
@@ -0,0 +1 @@
1
+ ../../skills/memory-first-workflow
@@ -0,0 +1 @@
1
+ ../../skills/setup-memory
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-memory",
3
- "version": "0.2.0",
3
+ "version": "0.5.0",
4
4
  "description": "Long-term self-managed memory for Claude Code with fact extraction, truth maintenance, and provenance tracking",
5
5
  "author": {
6
6
  "name": "Valentino Stoll"
@@ -0,0 +1,29 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ # Map changed Ruby files to their corresponding spec files
5
+ # Usage: .lefthook/map_specs.rb
6
+ # Returns: Space-separated list of spec files to run
7
+
8
+ changed_files = `git diff --cached --name-only --diff-filter=ACM`.split("\n")
9
+ ruby_files = changed_files.select { |f| f.end_with?(".rb") && f.start_with?("lib/") }
10
+
11
+ # Map lib/ files to their spec/ equivalents
12
+ specs = ruby_files.map do |file|
13
+ # lib/claude_memory/foo/bar.rb → spec/claude_memory/foo/bar_spec.rb
14
+ file.sub("lib/", "spec/").sub(".rb", "_spec.rb")
15
+ end.select { |spec| File.exist?(spec) }
16
+
17
+ # Always run integration tests if core infrastructure changed
18
+ critical_paths = [
19
+ "lib/claude_memory/store",
20
+ "lib/claude_memory/mcp",
21
+ "lib/claude_memory/hook"
22
+ ]
23
+
24
+ if ruby_files.any? { |f| critical_paths.any? { |path| f.start_with?(path) } }
25
+ specs += Dir["spec/integration/*_spec.rb"] if Dir.exist?("spec/integration")
26
+ end
27
+
28
+ # Output unique spec files
29
+ puts specs.uniq.join(" ")
data/CHANGELOG.md CHANGED
@@ -4,16 +4,94 @@ All notable changes to this project will be documented in this file.
4
4
 
5
5
  ## [Unreleased]
6
6
 
7
+ ## [0.5.0] - 2026-02-04
8
+
7
9
  ### Added
8
10
 
9
- ### Changed
11
+ **MCP Structured Content & Compact Mode**
12
+ - Dual content (text summary) + structuredContent (JSON) for all MCP tools
13
+ - `TextSummary` module generates human-readable summaries alongside structured data
14
+ - Compact mode (`compact: true`) omits provenance receipts for ~60% smaller responses
15
+ - MCP query guide prompt registered via `prompts/list` and `prompts/get` endpoints
16
+ - `QueryGuide` module provides tool selection guidance to Claude
17
+
18
+ **Search & Retrieval Improvements**
19
+ - Reciprocal Rank Fusion (RRF) replacing naive merge for hybrid search
20
+ - Better result ordering when combining FTS5 and semantic search results
21
+ - Smart expansion detection to skip unnecessary vector search
22
+ - Reduces latency when FTS5 already provides strong matches
23
+ - Enhanced snippet extraction for search results
24
+ - Better context windows around matched terms
25
+
26
+ **Provenance & Traceability**
27
+ - Line-range references in provenance for precise source linking
28
+ - Facts now track exact line ranges in source transcripts
29
+ - Fact dependency graph visualization via BFS traversal
30
+ - Trace supersession and conflict chains between facts
31
+
32
+ **User-Friendly Identifiers**
33
+ - Docid short hash system for user-friendly fact references
34
+ - Short, memorable identifiers instead of raw integer IDs
35
+
36
+ **Caching & Performance**
37
+ - LLM response caching schema and store methods
38
+ - Cache layer for expensive extraction operations
39
+ - Structured JSON logging with level filtering
40
+ - Configurable log levels (debug, info, warn, error)
41
+ - JSON format for machine-parseable log output
42
+
43
+ **Ingestion & Content Processing**
44
+ - Configurable tool capture filtering for ingestion
45
+ - Control which tool outputs are captured during transcript processing
46
+ - ContentSanitizer now strips `system-reminder`, `local-command-caveat`, `command-message`,
47
+ `command-name`, and `command-args` tags in addition to privacy tags
48
+ - Relative time formatting in MCP recall output
49
+ - Progressive format: just now → Xm ago → Xh ago → Xd ago → YYYY-MM-DD
10
50
 
11
- ### Fixed
51
+ **Developer Tools**
52
+ - `--brief` flag for doctor command and health checks in skills
53
+ - Quick pass/fail output for automated workflows
12
54
 
13
- ### Documentation
55
+ ### Fixed
56
+ - Preserve SQLite PRAGMAs across connection reconnects
57
+ - WAL mode and other pragmas now survive reconnection cycles
58
+ - Timestamp-only churn in publish output
59
+ - Publish no longer regenerates files when only the timestamp changed
14
60
 
15
61
  ### Internal
16
62
 
63
+ **Code Quality Improvements**
64
+ - Extract duplicates and decompose long methods across codebase
65
+ - Extract ingester transaction body into focused methods
66
+ - Decompose `resolve_fact` into intention-revealing methods
67
+ - Extract `check_setup` and `detailed_stats` into focused helpers
68
+ - Fix N+1 query patterns in `recall.rb`
69
+ - Fix 6 quick wins from quality review (frozen strings, method sizes, naming)
70
+
71
+ **Research & Studies**
72
+ - QMD restudy (2026-02-02): adopt Claude Code plugin format, MCP structured content pattern,
73
+ MCP query guide prompt, inline status checks
74
+ - claude-supermemory study: adopt SessionStart hook context injection, tool-specific observation
75
+ compression, and relative time formatting
76
+
77
+ ## [0.4.0] - 2026-02-02
78
+
79
+ ### Added
80
+
81
+ **Semantic Search with FastEmbed**
82
+ - Integrated [fastembed-rb](https://github.com/khasinski/fastembed-rb) for high-quality local embeddings
83
+ - Uses BAAI/bge-small-en-v1.5 model (384-dim, ~67MB ONNX, runs locally)
84
+ - No API key required -- model downloaded once to `~/.cache/fastembed/`
85
+ - Asymmetric query/passage encoding for better retrieval accuracy
86
+ - `FastembedAdapter` class implementing the existing `Generator` interface for drop-in replacement
87
+ - Benchmark retrieval scores jumped significantly with real embeddings:
88
+ - Semantic easy: Recall@5 = 0.900, medium: 0.696
89
+ - Hybrid aggregate: Recall@5 = 0.727 (was 0.266 with TF-IDF fallback)
90
+
91
+ ### Documentation
92
+ - Updated benchmark results throughout README, spec/benchmarks/README, and architecture docs
93
+ - Replaced TF-IDF embedding references with FastEmbed in architecture documentation
94
+
17
95
  ## [0.3.0] - 2026-01-29
18
96
 
19
97
  ### Added
@@ -41,9 +119,9 @@ All notable changes to this project will be documented in this file.
41
119
  - Configuration class for centralized ENV access and testability
42
120
 
43
121
  **Search & Recall**
44
- - `index` command to generate TF-IDF embeddings for semantic search
122
+ - `index` command to generate embeddings for semantic search
45
123
  - Index command resumability with checkpoints (recover from interruption)
46
- - Semantic search capabilities using TF-IDF embeddings
124
+ - Semantic search capabilities with embedding-based vector search
47
125
  - Improved full-text search with empty query handling
48
126
 
49
127
  **Session Intelligence**
data/CLAUDE.md CHANGED
@@ -11,6 +11,10 @@ ClaudeMemory is a Ruby gem that provides long-term, self-managed memory for Clau
11
11
  - Sequel (~> 5.0) for database access
12
12
  - Extralite (~> 2.14) for high-performance SQLite storage
13
13
 
14
+ ## Working with This Codebase
15
+
16
+ **Check memory before exploring code.** Use `memory.recall`, `memory.decisions`, `memory.architecture`, or `memory.conventions` to find existing knowledge before reading files.
17
+
14
18
  ## Development Commands
15
19
 
16
20
  ### Setup
@@ -49,6 +53,40 @@ bundle exec rake release # Tag + push to RubyGems (requires credentials)
49
53
  bundle exec claude-memory <command>
50
54
  ```
51
55
 
56
+ ### Evals
57
+ ```bash
58
+ # Run automated evaluation suite (stub mode - fast, free)
59
+ ./bin/run-evals # Run all evals with summary report
60
+
61
+ # Run real eval validation (slow, costs ~$0.12)
62
+ ./bin/run-real-evals all # Run all scenarios with real Claude
63
+ ./bin/run-real-evals convention_recall,tech_stack_recall # Specific scenarios
64
+
65
+ # Or run directly with RSpec
66
+ bundle exec rspec spec/evals/ # Run all eval scenarios (stub mode)
67
+ bundle exec rspec --tag eval # Run only eval-tagged tests
68
+ EVAL_MODE=real bundle exec rspec spec/evals/ --tag eval_real # Real mode
69
+ ```
70
+
71
+ The eval framework tests ClaudeMemory's effectiveness by comparing baseline (no memory) vs memory-enabled responses. See `spec/evals/README.md` for details, `spec/evals/REAL_MODE.md` for real Claude execution, and `spec/evals/CI_INTEGRATION.md` for GitHub Actions integration.
72
+
73
+ ### Benchmarks (DevMemBench)
74
+ ```bash
75
+ # Run offline benchmarks - retrieval accuracy + truth maintenance ($0, ~8s)
76
+ bundle exec rspec spec/benchmarks/ --tag benchmark --format documentation
77
+
78
+ # Run all evals + benchmarks together
79
+ ./bin/run-evals --all
80
+
81
+ # Run only benchmarks (skip evals)
82
+ ./bin/run-evals --benchmarks-only
83
+
84
+ # End-to-end with real Claude (~$2-8)
85
+ EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/ --tag eval_real
86
+ ```
87
+
88
+ DevMemBench measures retrieval accuracy (Recall@k, MRR, nDCG@10) across 155 queries, truth maintenance correctness across 100 cases, and end-to-end Claude response quality across 31 scenarios. Semantic and hybrid retrieval use [fastembed-rb](https://github.com/khasinski/fastembed-rb) (BAAI/bge-small-en-v1.5, local ONNX, no API key). See `spec/benchmarks/README.md` for full details.
89
+
52
90
  ## Architecture
53
91
 
54
92
  ### Dual-Database System
data/README.md CHANGED
@@ -206,6 +206,49 @@ The uninstall command removes:
206
206
  - 🏗️ [Architecture](docs/architecture.md) - Technical deep dive
207
207
  - 📝 [Changelog](CHANGELOG.md) - Release notes
208
208
 
209
+ ## Benchmarks
210
+
211
+ ClaudeMemory includes **DevMemBench**, a developer-domain benchmark suite that measures retrieval quality and truth maintenance accuracy. All offline benchmarks run locally at zero cost.
212
+
213
+ ### Latest Results
214
+
215
+ | Benchmark | Metric | Score |
216
+ |-----------|--------|-------|
217
+ | **Truth Maintenance** | Accuracy (100 cases) | **100%** |
218
+ | **FTS5 Retrieval** | Recall@5 (40 easy queries) | **97.5%** |
219
+ | **Semantic Retrieval** | Recall@5 (85 queries aggregate) | **78.6%** |
220
+ | **Semantic Retrieval** | Recall@5 (40 medium queries) | **69.6%** |
221
+ | **Hybrid Retrieval** | Recall@5 (100 queries aggregate) | **72.7%** |
222
+ | **Hybrid Retrieval** | Recall@10 (20 hard queries) | **62.8%** |
223
+ | **Scope Ranking** | Queries returning expected facts | **5/5** |
224
+
225
+ Semantic and hybrid retrieval use [fastembed-rb](https://github.com/khasinski/fastembed-rb) with the BAAI/bge-small-en-v1.5 model (384-dim, runs locally, no API key needed).
226
+
227
+ ### What the benchmarks measure
228
+
229
+ **Retrieval accuracy** -- Given a database of ~105 developer-domain facts across 5 simulated projects, how well does search find the right facts? Measured with standard IR metrics (Recall@k, MRR, nDCG@10) across 155 queries at varying difficulty levels (exact keyword match, semantic paraphrase, cross-category synthesis, abstention, temporal).
230
+
231
+ **Truth maintenance** -- Given pairs of existing and incoming facts, does the resolver correctly determine the outcome? 100 FEVER-inspired cases test four outcomes: supersession (new stated fact replaces old), conflict (inferred fact contradicts stated), accumulation (multi-value predicates coexist), and corroboration (same fact adds provenance).
232
+
233
+ **End-to-end with Claude** -- 31 scenarios across 5 LongMemEval ability categories (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, abstention). Requires `EVAL_MODE=real` and costs ~$2-8 per run.
234
+
235
+ ### Running benchmarks
236
+
237
+ ```bash
238
+ # Offline benchmarks ($0, ~8 seconds)
239
+ bundle exec rspec spec/benchmarks/ --tag benchmark --format documentation
240
+
241
+ # Full evals + benchmarks
242
+ ./bin/run-evals --all
243
+
244
+ # End-to-end with real Claude (~$2-8)
245
+ EVAL_MODE=real bundle exec rspec spec/benchmarks/e2e/ --tag eval_real
246
+ ```
247
+
248
+ The benchmark dataset draws from real CLAUDE.md patterns and is designed specifically for ClaudeMemory's 6 predicates and 8 entity types. Open IR datasets (BEIR, FEVER, LongMemEval) informed the methodology but don't cover developer-domain knowledge.
249
+
250
+ 👉 **[Benchmark Details →](spec/benchmarks/README.md)**
251
+
209
252
  ## For Developers
210
253
 
211
254
  - **Language:** Ruby 3.2+
data/Rakefile CHANGED
@@ -3,7 +3,20 @@
3
3
  require "bundler/gem_tasks"
4
4
  require "rspec/core/rake_task"
5
5
 
6
- RSpec::Core::RakeTask.new(:spec)
6
+ # Parallel test execution for faster runs
7
+ desc "Run specs in parallel"
8
+ task :spec do
9
+ # Use parallel_rspec if available, fall back to regular rspec
10
+ if system("which parallel_rspec > /dev/null 2>&1")
11
+ sh "bundle exec parallel_rspec spec/"
12
+ else
13
+ puts "parallel_tests not installed, running sequentially"
14
+ sh "bundle exec rspec"
15
+ end
16
+ end
17
+
18
+ # Sequential test execution (for debugging)
19
+ RSpec::Core::RakeTask.new(:spec_sequential)
7
20
 
8
21
  require "standard/rake"
9
22