@1mbrain/benchmarks 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (69) hide show
  1. package/README.md +85 -0
  2. package/fixtures/1mbrain-focused-mini/1mbrain-focused-mini.json +928 -0
  3. package/fixtures/1mbrain-focused-mini/README.md +45 -0
  4. package/fixtures/adversarial-memory/dataset_claude_adversarial.json +3333 -0
  5. package/fixtures/adversarial-memory/dataset_gemini_adversarial_memory.json +2984 -0
  6. package/fixtures/balanced-mini/dataset_claude_balanced_mini.json +2077 -0
  7. package/fixtures/balanced-mini/dataset_gemini_balanced_mini.json +1995 -0
  8. package/fixtures/generate_datasets.js +1741 -0
  9. package/fixtures/graph-stress-hard/README.md +43 -0
  10. package/fixtures/graph-stress-hard/dataset_graph_stress_hard.json +4374 -0
  11. package/fixtures/graph-stress-hard/generate_graph_stress_hard.js +526 -0
  12. package/fixtures/realistic-medium/dataset_claude_realistic_medium.json +7462 -0
  13. package/fixtures/realistic-medium/dataset_gemini_realistic_medium.json +7277 -0
  14. package/fixtures/realistic-medium/gen_claude_medium.js +600 -0
  15. package/package.json +22 -0
  16. package/reports/benchmark_report.md +48 -0
  17. package/reports/benchmark_report_claude_adversarial.md +42 -0
  18. package/reports/benchmark_report_claude_adversarial_adaptive.md +42 -0
  19. package/reports/benchmark_report_claude_adversarial_adaptive2_fast.md +42 -0
  20. package/reports/benchmark_report_claude_adversarial_adaptive_fast.md +42 -0
  21. package/reports/benchmark_report_claude_adversarial_rerank.md +42 -0
  22. package/reports/benchmark_report_claude_balanced_mini.md +42 -0
  23. package/reports/benchmark_report_claude_balanced_mini_adaptive.md +42 -0
  24. package/reports/benchmark_report_claude_balanced_mini_adaptive2_fast.md +42 -0
  25. package/reports/benchmark_report_claude_balanced_mini_adaptive_fast.md +42 -0
  26. package/reports/benchmark_report_claude_balanced_mini_rerank.md +42 -0
  27. package/reports/benchmark_report_claude_realistic_medium.md +42 -0
  28. package/reports/benchmark_report_claude_realistic_medium_adaptive.md +42 -0
  29. package/reports/benchmark_report_claude_realistic_medium_adaptive2_fast.md +42 -0
  30. package/reports/benchmark_report_claude_realistic_medium_adaptive_fast.md +42 -0
  31. package/reports/benchmark_report_claude_realistic_medium_evidence_rerank_local.md +42 -0
  32. package/reports/benchmark_report_claude_realistic_medium_openai_evidence_rerank.md +41 -0
  33. package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal.md +41 -0
  34. package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal_scoped.md +41 -0
  35. package/reports/benchmark_report_claude_realistic_medium_openai_phase8_no_judge.md +42 -0
  36. package/reports/benchmark_report_claude_realistic_medium_openai_rankingpolicy.md +41 -0
  37. package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter.md +41 -0
  38. package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter_absence_fix.md +41 -0
  39. package/reports/benchmark_report_claude_realistic_medium_openai_write_time_invalidation.md +41 -0
  40. package/reports/benchmark_report_claude_realistic_medium_rerank.md +42 -0
  41. package/reports/benchmark_report_claude_realistic_medium_stale_filter_local.md +42 -0
  42. package/reports/benchmark_report_graph_stress_hard.md +42 -0
  43. package/reports/benchmark_report_graph_stress_hard_absence_fix.md +42 -0
  44. package/reports/benchmark_report_graph_stress_hard_adaptive.md +42 -0
  45. package/reports/benchmark_report_graph_stress_hard_evidence_rerank.md +42 -0
  46. package/reports/benchmark_report_graph_stress_hard_multi_signal_current_guardrail.md +42 -0
  47. package/reports/benchmark_report_graph_stress_hard_multi_signal_guardrail_fixed.md +42 -0
  48. package/reports/benchmark_report_graph_stress_hard_multi_signal_local.md +42 -0
  49. package/reports/benchmark_report_graph_stress_hard_multi_signal_scoped_guardrail.md +42 -0
  50. package/reports/benchmark_report_graph_stress_hard_multi_signal_vector_pure_guardrail.md +42 -0
  51. package/reports/benchmark_report_graph_stress_hard_phase8_sdk_guardrail.md +42 -0
  52. package/reports/benchmark_report_graph_stress_hard_rerank.md +42 -0
  53. package/reports/benchmark_report_graph_stress_hard_stale_filter.md +42 -0
  54. package/reports/benchmark_report_graph_stress_hard_write_time_invalidation.md +42 -0
  55. package/results/.gitignore +2 -0
  56. package/src/adapters/1mbrain.ts +317 -0
  57. package/src/adapters/keyword-embedding.ts +48 -0
  58. package/src/adapters/mem0.ts +124 -0
  59. package/src/adapters/qdrant.ts +214 -0
  60. package/src/adapters/unavailable.ts +49 -0
  61. package/src/adapters/vector-baseline.ts +149 -0
  62. package/src/datasets/focused-mini.ts +158 -0
  63. package/src/datasets/synthetic-agent-memory.ts +532 -0
  64. package/src/llm-evaluator.ts +262 -0
  65. package/src/metrics.ts +482 -0
  66. package/src/provider.ts +151 -0
  67. package/src/runner.ts +635 -0
  68. package/tsconfig.json +10 -0
  69. package/tsconfig.tsbuildinfo +1 -0
package/package.json ADDED
@@ -0,0 +1,22 @@
1
+ {
2
+ "name": "@1mbrain/benchmarks",
3
+ "version": "0.1.1",
4
+ "description": "Provider-level benchmarks for 1MBrain and comparable memory providers.",
5
+ "license": "MIT",
6
+ "type": "module",
7
+ "main": "./dist/index.js",
8
+ "types": "./dist/index.d.ts",
9
+ "scripts": {
10
+ "bench": "tsx src/runner.ts",
11
+ "build": "tsc -p tsconfig.json",
12
+ "typecheck": "tsc --noEmit"
13
+ },
14
+ "dependencies": {
15
+ "@1mbrain/core": "*",
16
+ "mem0ai": "^3.0.9"
17
+ },
18
+ "devDependencies": {
19
+ "tsx": "^4.19.0",
20
+ "typescript": "^5.7.0"
21
+ }
22
+ }
@@ -0,0 +1,48 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `graph-stress-hard` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 1 | 1 | 0.917 | 7.49ms | 9.552ms/case |
10
+ | 1MBrain Graph Light | 1 | 1 | 0.917 | 6.65ms | 8.098ms/case |
11
+ | 1MBrain Vector Only | 0.361 | 0.889 | 0.786 | 2.164ms | 7.73ms/case |
12
+ | Vector Baseline (SQLite) | 0.322 | 0.883 | 0.752 | 0.673ms | 1.392ms/case |
13
+ | Qdrant Vector | N/A | N/A | N/A | N/A | N/A (Unsupported) |
14
+ | Mem0 (Cloud) | N/A | N/A | N/A | N/A | N/A (Unsupported) |
15
+ | Zep/Graphiti | N/A | N/A | N/A | N/A | N/A (Unsupported) |
16
+ | Letta | N/A | N/A | N/A | N/A | N/A (Unsupported) |
17
+ | LangMem | N/A | N/A | N/A | N/A | N/A (Unsupported) |
18
+
19
+ ## Key Evaluation Questions
20
+
21
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
22
+ 1MBrain Graph Full outperforms the Vector Baseline by **210.345%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **1** for Graph Full versus **0.602** for Vector Only.
23
+
24
+ ### 2. Where does 1MBrain underperform?
25
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **7.49ms**, compared to **0.673ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
26
+
27
+ ### 3. Does association graph improve recall quality?
28
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **1** compared to **0.361** for 1MBrain Vector Only, a **176.923%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
29
+
30
+ ### 4. Does spreading activation improve multi-hop reasoning?
31
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.602** to **1**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
32
+
33
+ ### 5. Does decay/refresh help prevent stale memory pollution?
34
+ Not convincingly in this run. Memory update evidence accuracy is **1**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
35
+
36
+ ### 6. Is Memory Passport practically useful?
37
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
38
+
39
+ ### 7. What is the tradeoff between quality, latency, and cost?
40
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
41
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **6.817ms** p95 latency in this small dataset.
42
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
43
+
44
+ ### 8. What should be improved before public release?
45
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
46
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
47
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
48
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-adversarial` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.789 | 0.906 | 0.732 | 2.859ms | 5.058ms/case |
10
+ | 1MBrain Vector Only | 0.789 | 0.906 | 0.732 | 1.935ms | 4.899ms/case |
11
+ | Vector Baseline (SQLite) | 0.789 | 0.906 | 0.732 | 0.713ms | 1.237ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.833** for Graph Full versus **0.833** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.859ms**, compared to **0.713ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.789** compared to **0.789** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.833** to **0.833**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.422**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **2.146ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-adversarial` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.789 | 0.906 | 0.731 | 5.144ms | 9.504ms/case |
10
+ | 1MBrain Vector Only | 0.789 | 0.906 | 0.732 | 4.164ms | 10.152ms/case |
11
+ | Vector Baseline (SQLite) | 0.789 | 0.906 | 0.732 | 1.301ms | 2.721ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.833** for Graph Full versus **0.833** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **5.144ms**, compared to **1.301ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.789** compared to **0.789** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.833** to **0.833**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.422**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **3.842ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-adversarial` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.789 | 0.906 | 0.731 | 2.264ms | 4.355ms/case |
10
+ | 1MBrain Vector Only | 0.789 | 0.906 | 0.732 | 1.854ms | 4.179ms/case |
11
+ | Vector Baseline (SQLite) | 0.789 | 0.906 | 0.732 | 0.492ms | 1.099ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.833** for Graph Full versus **0.833** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.264ms**, compared to **0.492ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.789** compared to **0.789** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.833** to **0.833**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.422**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **1.772ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-adversarial` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.756 | 0.872 | 0.734 | 2.795ms | 4.394ms/case |
10
+ | 1MBrain Vector Only | 0.789 | 0.906 | 0.732 | 1.99ms | 4.169ms/case |
11
+ | Vector Baseline (SQLite) | 0.789 | 0.906 | 0.732 | 0.502ms | 1.092ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **-4.225%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.833** for Graph Full versus **0.833** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.795ms**, compared to **0.502ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.756** compared to **0.789** for 1MBrain Vector Only, a **-4.225%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.833** to **0.833**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.422**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **2.293ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-adversarial` dataset (60 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.686 | 0.819 | 0.602 | 11.214ms | 10.703ms/case |
10
+ | 1MBrain Vector Only | 0.789 | 0.906 | 0.732 | 6.429ms | 11.327ms/case |
11
+ | Vector Baseline (SQLite) | 0.789 | 0.906 | 0.732 | 1.531ms | 2.597ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **-13.028%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.917** for Graph Full versus **0.833** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **11.214ms**, compared to **1.531ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.686** compared to **0.789** for 1MBrain Vector Only, a **-13.028%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.833** to **0.917**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.511**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **9.684ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-balanced-mini` dataset (40 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.802 | 0.902 | 0.711 | 2.347ms | 3.656ms/case |
10
+ | 1MBrain Vector Only | 0.802 | 0.902 | 0.711 | 1.406ms | 3.395ms/case |
11
+ | Vector Baseline (SQLite) | 0.802 | 0.902 | 0.711 | 0.694ms | 0.883ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.646** for Graph Full versus **0.646** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.347ms**, compared to **0.694ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.802** compared to **0.802** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.646** to **0.646**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.65**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **1.654ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-balanced-mini` dataset (40 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.802 | 0.902 | 0.682 | 4.102ms | 4.766ms/case |
10
+ | 1MBrain Vector Only | 0.802 | 0.902 | 0.711 | 3.356ms | 6.928ms/case |
11
+ | Vector Baseline (SQLite) | 0.802 | 0.902 | 0.711 | 1.171ms | 2.004ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.646** for Graph Full versus **0.646** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **4.102ms**, compared to **1.171ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.802** compared to **0.802** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.646** to **0.646**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.65**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **2.931ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-balanced-mini` dataset (40 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.802 | 0.902 | 0.682 | 1.852ms | 3.095ms/case |
10
+ | 1MBrain Vector Only | 0.802 | 0.902 | 0.711 | 1.638ms | 3.058ms/case |
11
+ | Vector Baseline (SQLite) | 0.802 | 0.902 | 0.711 | 0.434ms | 0.896ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.646** for Graph Full versus **0.646** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **1.852ms**, compared to **0.434ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.802** compared to **0.802** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.646** to **0.646**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.65**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **1.418ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-balanced-mini` dataset (40 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.821 | 0.921 | 0.682 | 2.078ms | 3.249ms/case |
10
+ | 1MBrain Vector Only | 0.802 | 0.902 | 0.711 | 1.392ms | 2.865ms/case |
11
+ | Vector Baseline (SQLite) | 0.802 | 0.902 | 0.711 | 0.448ms | 0.881ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **2.338%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.833** for Graph Full versus **0.646** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.078ms**, compared to **0.448ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.821** compared to **0.802** for 1MBrain Vector Only, a **2.338%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.646** to **0.833**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.65**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **1.63ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-balanced-mini` dataset (40 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.788 | 0.888 | 0.641 | 8.7ms | 9.007ms/case |
10
+ | 1MBrain Vector Only | 0.802 | 0.902 | 0.711 | 5.1ms | 9.156ms/case |
11
+ | Vector Baseline (SQLite) | 0.802 | 0.902 | 0.711 | 1.545ms | 2.322ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **-1.818%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **1** for Graph Full versus **0.646** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **8.7ms**, compared to **1.545ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.788** compared to **0.802** for 1MBrain Vector Only, a **-1.818%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.646** to **1**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.8**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **7.155ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-realistic-medium` dataset (120 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.619 | 0.757 | 0.557 | 2.823ms | 5.678ms/case |
10
+ | 1MBrain Vector Only | 0.619 | 0.757 | 0.557 | 4.417ms | 8.086ms/case |
11
+ | Vector Baseline (SQLite) | 0.619 | 0.757 | 0.557 | 1.587ms | 3.331ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **0%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.806** for Graph Full versus **0.806** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **2.823ms**, compared to **1.587ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.619** compared to **0.619** for 1MBrain Vector Only, a **0%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.806** to **0.806**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.2**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **1.236ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.
@@ -0,0 +1,42 @@
1
+ # 1MBrain Benchmark Final Report
2
+
3
+ This report evaluates the performance of **1MBrain** against standard vector-only baselines and other providers using the `memory-bench-realistic-medium` dataset (120 cases).
4
+
5
+ ## Performance Leaderboard
6
+
7
+ | Provider | Evidence Accuracy | Recall@5 | MRR | p95 Latency | Ingestion Rate |
8
+ |---|---:|---:|---:|---:|---:|
9
+ | 1MBrain Graph Full | 0.628 | 0.782 | 0.576 | 5.115ms | 10.215ms/case |
10
+ | 1MBrain Vector Only | 0.619 | 0.757 | 0.557 | 4.559ms | 10.847ms/case |
11
+ | Vector Baseline (SQLite) | 0.619 | 0.757 | 0.557 | 1.459ms | 2.908ms/case |
12
+
13
+ ## Key Evaluation Questions
14
+
15
+ ### 1. Where does 1MBrain outperform typical vector-only memory?
16
+ 1MBrain Graph Full outperforms the Vector Baseline by **1.345%** in evidence retrieval accuracy on this focused dataset. The clearest measurable advantage is in graph-aware scenarios: multi-hop evidence accuracy is **0.806** for Graph Full versus **0.806** for Vector Only.
17
+
18
+ ### 2. Where does 1MBrain underperform?
19
+ The main weakness is not graph traversal cost; it is retrieval precision under paraphrase, stale preference conflicts, and noisy distractors. Graph Full p95 latency is **5.115ms**, compared to **1.459ms** for the raw SQLite vector baseline. This is still low in absolute terms, but quality improvements are modest because the benchmark currently uses a local keyword embedder rather than a stronger semantic embedder.
20
+
21
+ ### 3. Does association graph improve recall quality?
22
+ Partially. 1MBrain Graph Full achieved evidence accuracy of **0.628** compared to **0.619** for 1MBrain Vector Only, a **1.345%** relative improvement. This shows graph links help, but the improvement is not yet large enough to claim the graph layer alone solves recall quality.
23
+
24
+ ### 4. Does spreading activation improve multi-hop reasoning?
25
+ Yes, with caveats. Multi-hop evidence accuracy improved from **0.806** to **0.806**, but some required supporting memories were still missed. The failure cases indicate that graph traversal needs better seed recall and/or query expansion to consistently reach the correct neighboring nodes.
26
+
27
+ ### 5. Does decay/refresh help prevent stale memory pollution?
28
+ Not convincingly in this run. Memory update evidence accuracy is **0.2**, but stale-memory failures are still present. This benchmark should be treated as evidence that explicit recency/conflict resolution needs more work before public claims about stale-memory handling.
29
+
30
+ ### 6. Is Memory Passport practically useful?
31
+ Yes for the 1MBrain adapters tested here. Graph Full portability success rate is **0** on the focused portability cases. The vector baseline has no portability capability and is expected to fail those operation checks.
32
+
33
+ ### 7. What is the tradeoff between quality, latency, and cost?
34
+ - **Quality:** Graph-enabled 1MBrain is the best local provider in this run, but only by a modest margin.
35
+ - **Latency:** SQLite vector-only baseline is the fastest, while graph traversal adds roughly **3.656ms** p95 latency in this small dataset.
36
+ - **Cost:** Since 1MBrain can run fully locally (SQLite + local embedder/Ollama), the running query cost is **$0.00** per 1,000 queries, compared to high cloud API vendor fees.
37
+
38
+ ### 8. What should be improved before public release?
39
+ - Replace or complement the keyword embedder with a stronger local semantic embedder for paraphrase-heavy questions.
40
+ - Add explicit recency/conflict ranking so newer preferences reliably beat stale memories.
41
+ - Improve seed recall and query expansion before spreading activation so graph traversal starts from the right nodes.
42
+ - Keep failure-case reporting in the public benchmark so claims remain reproducible and falsifiable.