@1mbrain/benchmarks 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +85 -0
- package/fixtures/1mbrain-focused-mini/1mbrain-focused-mini.json +928 -0
- package/fixtures/1mbrain-focused-mini/README.md +45 -0
- package/fixtures/adversarial-memory/dataset_claude_adversarial.json +3333 -0
- package/fixtures/adversarial-memory/dataset_gemini_adversarial_memory.json +2984 -0
- package/fixtures/balanced-mini/dataset_claude_balanced_mini.json +2077 -0
- package/fixtures/balanced-mini/dataset_gemini_balanced_mini.json +1995 -0
- package/fixtures/generate_datasets.js +1741 -0
- package/fixtures/graph-stress-hard/README.md +43 -0
- package/fixtures/graph-stress-hard/dataset_graph_stress_hard.json +4374 -0
- package/fixtures/graph-stress-hard/generate_graph_stress_hard.js +526 -0
- package/fixtures/realistic-medium/dataset_claude_realistic_medium.json +7462 -0
- package/fixtures/realistic-medium/dataset_gemini_realistic_medium.json +7277 -0
- package/fixtures/realistic-medium/gen_claude_medium.js +600 -0
- package/package.json +22 -0
- package/reports/benchmark_report.md +48 -0
- package/reports/benchmark_report_claude_adversarial.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_adversarial_rerank.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_rerank.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_evidence_rerank_local.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_evidence_rerank.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal_scoped.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_phase8_no_judge.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_rankingpolicy.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter_absence_fix.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_write_time_invalidation.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_rerank.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_stale_filter_local.md +42 -0
- package/reports/benchmark_report_graph_stress_hard.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_absence_fix.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_adaptive.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_evidence_rerank.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_current_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_guardrail_fixed.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_local.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_scoped_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_vector_pure_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_phase8_sdk_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_rerank.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_stale_filter.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_write_time_invalidation.md +42 -0
- package/results/.gitignore +2 -0
- package/src/adapters/1mbrain.ts +317 -0
- package/src/adapters/keyword-embedding.ts +48 -0
- package/src/adapters/mem0.ts +124 -0
- package/src/adapters/qdrant.ts +214 -0
- package/src/adapters/unavailable.ts +49 -0
- package/src/adapters/vector-baseline.ts +149 -0
- package/src/datasets/focused-mini.ts +158 -0
- package/src/datasets/synthetic-agent-memory.ts +532 -0
- package/src/llm-evaluator.ts +262 -0
- package/src/metrics.ts +482 -0
- package/src/provider.ts +151 -0
- package/src/runner.ts +635 -0
- package/tsconfig.json +10 -0
- package/tsconfig.tsbuildinfo +1 -0
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
# 1MBrain Focused Mini Benchmark
|
|
2
|
+
|
|
3
|
+
This fixture is a small, GitHub-friendly benchmark dataset for evaluating memory-system
|
|
4
|
+
behavior before running large external suites such as LOCOMO.
|
|
5
|
+
|
|
6
|
+
It is designed to keep API usage low:
|
|
7
|
+
|
|
8
|
+
- 5 conversations
|
|
9
|
+
- 40 turns
|
|
10
|
+
- 41 memory records
|
|
11
|
+
- 23 questions
|
|
12
|
+
- Deterministic expected answers
|
|
13
|
+
- Explicit required and forbidden memory IDs
|
|
14
|
+
- No LLM judge required for retrieval metrics
|
|
15
|
+
|
|
16
|
+
## What It Tests
|
|
17
|
+
|
|
18
|
+
- Atomic fact recall
|
|
19
|
+
- Temporal updates and "latest wins" behavior
|
|
20
|
+
- Multi-hop association recall
|
|
21
|
+
- Contradiction resolution
|
|
22
|
+
- Paraphrase recall with weak keyword overlap
|
|
23
|
+
- Procedural memory
|
|
24
|
+
- Noise resistance
|
|
25
|
+
- Portability/export-import preservation
|
|
26
|
+
- Context injection suitability
|
|
27
|
+
|
|
28
|
+
## Recommended Metrics
|
|
29
|
+
|
|
30
|
+
- `Recall@5`
|
|
31
|
+
- `Recall@10`
|
|
32
|
+
- `MRR@10`
|
|
33
|
+
- `NDCG@10`
|
|
34
|
+
- `Forbidden hit rate`
|
|
35
|
+
- `Temporal correctness`
|
|
36
|
+
- `Multi-hop completeness`
|
|
37
|
+
- `Passport preservation rate`
|
|
38
|
+
|
|
39
|
+
## File
|
|
40
|
+
|
|
41
|
+
- `1mbrain-focused-mini.json`
|
|
42
|
+
|
|
43
|
+
The fixture stores natural conversation turns and a canonical `memory_records` list.
|
|
44
|
+
Benchmark runners should ingest `memory_records`, optionally create `associations`,
|
|
45
|
+
then run the listed questions.
|