@1mbrain/benchmarks 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +85 -0
- package/fixtures/1mbrain-focused-mini/1mbrain-focused-mini.json +928 -0
- package/fixtures/1mbrain-focused-mini/README.md +45 -0
- package/fixtures/adversarial-memory/dataset_claude_adversarial.json +3333 -0
- package/fixtures/adversarial-memory/dataset_gemini_adversarial_memory.json +2984 -0
- package/fixtures/balanced-mini/dataset_claude_balanced_mini.json +2077 -0
- package/fixtures/balanced-mini/dataset_gemini_balanced_mini.json +1995 -0
- package/fixtures/generate_datasets.js +1741 -0
- package/fixtures/graph-stress-hard/README.md +43 -0
- package/fixtures/graph-stress-hard/dataset_graph_stress_hard.json +4374 -0
- package/fixtures/graph-stress-hard/generate_graph_stress_hard.js +526 -0
- package/fixtures/realistic-medium/dataset_claude_realistic_medium.json +7462 -0
- package/fixtures/realistic-medium/dataset_gemini_realistic_medium.json +7277 -0
- package/fixtures/realistic-medium/gen_claude_medium.js +600 -0
- package/package.json +22 -0
- package/reports/benchmark_report.md +48 -0
- package/reports/benchmark_report_claude_adversarial.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_adversarial_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_adversarial_rerank.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_balanced_mini_rerank.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive2_fast.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_adaptive_fast.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_evidence_rerank_local.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_evidence_rerank.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_multi_signal_scoped.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_phase8_no_judge.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_rankingpolicy.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_stale_filter_absence_fix.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_openai_write_time_invalidation.md +41 -0
- package/reports/benchmark_report_claude_realistic_medium_rerank.md +42 -0
- package/reports/benchmark_report_claude_realistic_medium_stale_filter_local.md +42 -0
- package/reports/benchmark_report_graph_stress_hard.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_absence_fix.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_adaptive.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_evidence_rerank.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_current_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_guardrail_fixed.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_local.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_scoped_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_multi_signal_vector_pure_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_phase8_sdk_guardrail.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_rerank.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_stale_filter.md +42 -0
- package/reports/benchmark_report_graph_stress_hard_write_time_invalidation.md +42 -0
- package/results/.gitignore +2 -0
- package/src/adapters/1mbrain.ts +317 -0
- package/src/adapters/keyword-embedding.ts +48 -0
- package/src/adapters/mem0.ts +124 -0
- package/src/adapters/qdrant.ts +214 -0
- package/src/adapters/unavailable.ts +49 -0
- package/src/adapters/vector-baseline.ts +149 -0
- package/src/datasets/focused-mini.ts +158 -0
- package/src/datasets/synthetic-agent-memory.ts +532 -0
- package/src/llm-evaluator.ts +262 -0
- package/src/metrics.ts +482 -0
- package/src/provider.ts +151 -0
- package/src/runner.ts +635 -0
- package/tsconfig.json +10 -0
- package/tsconfig.tsbuildinfo +1 -0
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
# Graph Stress Hard Benchmark
|
|
2
|
+
|
|
3
|
+
This fixture is a diagnostic benchmark for graph-aware memory retrieval. It is
|
|
4
|
+
intentionally harder and less provider-neutral than the Claude balanced and
|
|
5
|
+
realistic fixtures.
|
|
6
|
+
|
|
7
|
+
## Purpose
|
|
8
|
+
|
|
9
|
+
Use this dataset to tune and compare:
|
|
10
|
+
|
|
11
|
+
- Multi-hop association recall.
|
|
12
|
+
- Conflict and current-state resolution.
|
|
13
|
+
- Graph traversal with weak lexical overlap.
|
|
14
|
+
- Near-entity distractor resistance.
|
|
15
|
+
- Abstention when only a similar entity has the requested fact.
|
|
16
|
+
|
|
17
|
+
## Shape
|
|
18
|
+
|
|
19
|
+
- 12 conversations.
|
|
20
|
+
- 144 memory records.
|
|
21
|
+
- 60 questions.
|
|
22
|
+
- Deterministic generation from `generate_graph_stress_hard.js`.
|
|
23
|
+
|
|
24
|
+
## Category Mix
|
|
25
|
+
|
|
26
|
+
- 24 `multi_hop_association`.
|
|
27
|
+
- 12 `contradiction_resolution`.
|
|
28
|
+
- 12 `graph_traversal`.
|
|
29
|
+
- 7 `entity_disambiguation`.
|
|
30
|
+
- 5 `abstention`.
|
|
31
|
+
|
|
32
|
+
## Usage
|
|
33
|
+
|
|
34
|
+
```powershell
|
|
35
|
+
node packages/benchmarks/fixtures/graph-stress-hard/generate_graph_stress_hard.js
|
|
36
|
+
$env:BENCH_DATASET = "graph-stress-hard"
|
|
37
|
+
$env:BENCH_PROVIDERS = "1mbrain_graph_full,1mbrain_vector_only,vector_baseline"
|
|
38
|
+
node packages/benchmarks/dist/runner.js
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
This is best interpreted alongside provider-neutral datasets. A graph-enabled
|
|
42
|
+
provider should separate itself most clearly on `multi_hop_recall` and
|
|
43
|
+
`memory_update` metrics.
|