@remnic/bench 1.0.1 → 9.3.517
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +103 -0
- package/baselines/procedural-recall-baseline.json +198 -0
- package/dist/index.d.ts +1556 -151
- package/dist/index.js +24292 -6309
- package/package.json +13 -6
package/README.md
CHANGED
|
@@ -20,11 +20,33 @@ The CLI loads `@remnic/bench` via a computed-specifier dynamic import. If it's n
|
|
|
20
20
|
|
|
21
21
|
- **Benchmark runners** for a growing set of memory-oriented evals: `longmemeval`, `locomo`, `memory-arena`, `amemgym`, `ama-bench`, plus a lightweight smoke fixture.
|
|
22
22
|
- **Stored-run management** — every `remnic bench run *` writes a timestamped JSON result under `~/.remnic/bench/results/`; `remnic bench runs list|show|delete` let you browse, inspect, and prune.
|
|
23
|
+
- **Reproducibility manifests** — package-backed runs write `MANIFEST.json` beside the result files, locking result hashes, dataset file hashes, seeds, runtime profiles, command argv with secret values redacted, selected environment keys, git state, QMD collections, and config-file hashes.
|
|
23
24
|
- **Baselines + regression gates** — save a run as a named baseline, compare candidates against it, gate CI on threshold violations.
|
|
24
25
|
- **Result export** — `remnic bench export <run> --format json|csv|html`.
|
|
25
26
|
- **Published feed** — `remnic bench publish --target remnic-ai` builds the tamper-evident integrity manifest consumed by remnic.ai.
|
|
26
27
|
- **Provider discovery** — `remnic bench providers discover` enumerates local OpenAI / Anthropic / Ollama / LiteLLM providers for adapter wiring.
|
|
27
28
|
|
|
29
|
+
## Memory eval dimensions
|
|
30
|
+
|
|
31
|
+
Agent memory without evals is vibes with a database.
|
|
32
|
+
|
|
33
|
+
`@remnic/bench` exports `MEMORY_EVAL_DIMENSIONS` as Remnic's shared eval
|
|
34
|
+
contract for user-aware agents. It covers:
|
|
35
|
+
|
|
36
|
+
- repeated-context reduction
|
|
37
|
+
- unnecessary-clarification reduction
|
|
38
|
+
- retrieval correctness
|
|
39
|
+
- stale-memory harm
|
|
40
|
+
- scope respect
|
|
41
|
+
- ask-when-needed decisions
|
|
42
|
+
- act-when-enough-context decisions
|
|
43
|
+
- personalization quality
|
|
44
|
+
|
|
45
|
+
Each dimension maps to existing quick-capable benchmark ids. Use
|
|
46
|
+
`listMemoryEvalBenchmarkIds()` when wiring CI coverage, and use the per-dimension
|
|
47
|
+
`fullModeGuidance` strings when designing publishable eval claims. See
|
|
48
|
+
[`docs/memory-evals.md`](../../docs/memory-evals.md) for the full map.
|
|
49
|
+
|
|
28
50
|
## CLI quick reference
|
|
29
51
|
|
|
30
52
|
```bash
|
|
@@ -44,6 +66,9 @@ remnic bench run --quick longmemeval
|
|
|
44
66
|
remnic bench runs list
|
|
45
67
|
remnic bench runs show <run-id> --detail
|
|
46
68
|
|
|
69
|
+
# Inspect the reproducibility lock for the last run set:
|
|
70
|
+
jq . ~/.remnic/bench/results/MANIFEST.json
|
|
71
|
+
|
|
47
72
|
# Compare two runs:
|
|
48
73
|
remnic bench compare base-run candidate-run
|
|
49
74
|
|
|
@@ -61,6 +86,83 @@ remnic bench publish --target remnic-ai
|
|
|
61
86
|
|
|
62
87
|
Dataset markers match the runner's accepted filenames, so `datasets status` reports "downloaded" exactly when the runner will load successfully.
|
|
63
88
|
|
|
89
|
+
## Running on real datasets
|
|
90
|
+
|
|
91
|
+
The `longmemeval` and `locomo` runners ship with a bundled smoke fixture so
|
|
92
|
+
`remnic bench run --quick` and CI stay green without downloading anything.
|
|
93
|
+
To produce public-quality numbers you need the real datasets. Both live on
|
|
94
|
+
HuggingFace.
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
# Print the exact download commands (no auto-fetch):
|
|
98
|
+
scripts/bench/fetch-datasets.sh --help
|
|
99
|
+
scripts/bench/fetch-datasets.sh --target ./bench-datasets
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Expected layout (the `bench-datasets/` directory is gitignored):
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
bench-datasets/
|
|
106
|
+
longmemeval/
|
|
107
|
+
longmemeval_oracle.json # preferred filename
|
|
108
|
+
longmemeval_s_cleaned.json # optional alternate
|
|
109
|
+
longmemeval_s.json # optional alternate
|
|
110
|
+
locomo/
|
|
111
|
+
locomo10.json # preferred filename
|
|
112
|
+
locomo.json # optional alternate
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Point the runners at the directory. Use the current `remnic bench run`
|
|
116
|
+
CLI surface with `--dataset-dir` (a dedicated `remnic bench published`
|
|
117
|
+
subcommand with user-configurable `--limit`, `--model`, and `--seed` is
|
|
118
|
+
planned for a later slice of
|
|
119
|
+
[#566](https://github.com/joshuaswarren/remnic/issues/566)):
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
pnpm exec remnic bench run longmemeval \
|
|
123
|
+
--dataset-dir ./bench-datasets/longmemeval
|
|
124
|
+
|
|
125
|
+
pnpm exec remnic bench run locomo \
|
|
126
|
+
--dataset-dir ./bench-datasets/locomo
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Programmatic loaders are exported from `@remnic/bench`:
|
|
130
|
+
|
|
131
|
+
```ts
|
|
132
|
+
import { loadLongMemEvalS, loadLoCoMo10 } from "@remnic/bench";
|
|
133
|
+
|
|
134
|
+
const longmemeval = await loadLongMemEvalS({
|
|
135
|
+
mode: "full",
|
|
136
|
+
datasetDir: "./bench-datasets/longmemeval",
|
|
137
|
+
limit: 100,
|
|
138
|
+
});
|
|
139
|
+
// longmemeval.source === "dataset" when the real file was found,
|
|
140
|
+
// "smoke" when quick-mode fallback was used, "missing" when full-mode
|
|
141
|
+
// could not find any of the canonical filenames.
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
When `mode: "full"` and no dataset is found, the loaders return
|
|
145
|
+
`{ source: "missing", errors }` and the runner throws a
|
|
146
|
+
`formatMissingDatasetError()` message pointing operators at
|
|
147
|
+
`scripts/bench/fetch-datasets.sh`. Quick mode silently falls back to the
|
|
148
|
+
bundled smoke fixture and logs the probe errors so you can tell why.
|
|
149
|
+
|
|
150
|
+
## CI regression gate (smoke fixtures)
|
|
151
|
+
|
|
152
|
+
`.github/workflows/bench-smoke.yml` runs `scripts/bench/bench-smoke.ts`
|
|
153
|
+
on every PR. The script exercises the LongMemEval + LoCoMo runners
|
|
154
|
+
against their bundled smoke fixtures with a fixed seed and a
|
|
155
|
+
deterministic in-memory adapter (no real datasets, no LLM calls, no
|
|
156
|
+
network). Metrics are compared to the committed baseline at
|
|
157
|
+
`tests/fixtures/bench-smoke/baseline.json`; any drop greater than 5%
|
|
158
|
+
fails the job.
|
|
159
|
+
|
|
160
|
+
Regenerate the baseline after an intentional runner change:
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
pnpm exec tsx scripts/bench/bench-smoke.ts --update-baseline
|
|
164
|
+
```
|
|
165
|
+
|
|
64
166
|
## Programmatic API
|
|
65
167
|
|
|
66
168
|
```ts
|
|
@@ -68,6 +170,7 @@ import {
|
|
|
68
170
|
listBenchmarks,
|
|
69
171
|
runBenchmark,
|
|
70
172
|
writeBenchmarkResult,
|
|
173
|
+
writeBenchmarkReproManifest,
|
|
71
174
|
createLightweightAdapter,
|
|
72
175
|
createRemnicAdapter,
|
|
73
176
|
compareResults,
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
{
|
|
2
|
+
"schemaVersion": 1,
|
|
3
|
+
"fixture": {
|
|
4
|
+
"path": null,
|
|
5
|
+
"scenarioCount": 20
|
|
6
|
+
},
|
|
7
|
+
"onScore": 0.75,
|
|
8
|
+
"offScore": 0.25,
|
|
9
|
+
"lift": 0.5,
|
|
10
|
+
"confidenceInterval": {
|
|
11
|
+
"lower": 0.3,
|
|
12
|
+
"upper": 0.75,
|
|
13
|
+
"level": 0.95
|
|
14
|
+
},
|
|
15
|
+
"perCase": [
|
|
16
|
+
{
|
|
17
|
+
"id": "rerun-deploy-gateway",
|
|
18
|
+
"prompt": "Let's deploy the gateway service to production now.",
|
|
19
|
+
"expectMatch": true,
|
|
20
|
+
"onMatched": true,
|
|
21
|
+
"offMatched": false,
|
|
22
|
+
"onScore": 1,
|
|
23
|
+
"offScore": 0
|
|
24
|
+
},
|
|
25
|
+
{
|
|
26
|
+
"id": "rerun-open-pr",
|
|
27
|
+
"prompt": "Open a pull request for the regression fix against main.",
|
|
28
|
+
"expectMatch": true,
|
|
29
|
+
"onMatched": true,
|
|
30
|
+
"offMatched": false,
|
|
31
|
+
"onScore": 1,
|
|
32
|
+
"offScore": 0
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"id": "rerun-run-tests",
|
|
36
|
+
"prompt": "Run the test suite before we merge the release branch.",
|
|
37
|
+
"expectMatch": true,
|
|
38
|
+
"onMatched": true,
|
|
39
|
+
"offMatched": false,
|
|
40
|
+
"onScore": 1,
|
|
41
|
+
"offScore": 0
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"id": "rerun-rotate-credentials",
|
|
45
|
+
"prompt": "Rotate the staging database credentials right now.",
|
|
46
|
+
"expectMatch": true,
|
|
47
|
+
"onMatched": false,
|
|
48
|
+
"offMatched": false,
|
|
49
|
+
"onScore": 0,
|
|
50
|
+
"offScore": 0
|
|
51
|
+
},
|
|
52
|
+
{
|
|
53
|
+
"id": "rerun-ship-release",
|
|
54
|
+
"prompt": "We need to ship the v9 release tonight.",
|
|
55
|
+
"expectMatch": true,
|
|
56
|
+
"onMatched": true,
|
|
57
|
+
"offMatched": false,
|
|
58
|
+
"onScore": 1,
|
|
59
|
+
"offScore": 0
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"id": "paramvar-deploy-api",
|
|
63
|
+
"prompt": "Let's deploy the API service to production today.",
|
|
64
|
+
"expectMatch": true,
|
|
65
|
+
"onMatched": true,
|
|
66
|
+
"offMatched": false,
|
|
67
|
+
"onScore": 1,
|
|
68
|
+
"offScore": 0
|
|
69
|
+
},
|
|
70
|
+
{
|
|
71
|
+
"id": "paramvar-rollback-ticket",
|
|
72
|
+
"prompt": "Roll back ticket PROJ-912 before the standup tomorrow.",
|
|
73
|
+
"expectMatch": true,
|
|
74
|
+
"onMatched": false,
|
|
75
|
+
"offMatched": false,
|
|
76
|
+
"onScore": 0,
|
|
77
|
+
"offScore": 0
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"id": "paramvar-rotate-prod",
|
|
81
|
+
"prompt": "Rotate the production database credentials this morning.",
|
|
82
|
+
"expectMatch": true,
|
|
83
|
+
"onMatched": false,
|
|
84
|
+
"offMatched": false,
|
|
85
|
+
"onScore": 0,
|
|
86
|
+
"offScore": 0
|
|
87
|
+
},
|
|
88
|
+
{
|
|
89
|
+
"id": "paramvar-start-branch",
|
|
90
|
+
"prompt": "Starting work on the billing branch feature.",
|
|
91
|
+
"expectMatch": true,
|
|
92
|
+
"onMatched": true,
|
|
93
|
+
"offMatched": false,
|
|
94
|
+
"onScore": 1,
|
|
95
|
+
"offScore": 0
|
|
96
|
+
},
|
|
97
|
+
{
|
|
98
|
+
"id": "paramvar-merge-pr-after-ci",
|
|
99
|
+
"prompt": "Merge the pull request after CI turns green.",
|
|
100
|
+
"expectMatch": true,
|
|
101
|
+
"onMatched": true,
|
|
102
|
+
"offMatched": false,
|
|
103
|
+
"onScore": 1,
|
|
104
|
+
"offScore": 0
|
|
105
|
+
},
|
|
106
|
+
{
|
|
107
|
+
"id": "decomp-incident-response",
|
|
108
|
+
"prompt": "Run the incident response playbook for the gateway outage.",
|
|
109
|
+
"expectMatch": true,
|
|
110
|
+
"onMatched": false,
|
|
111
|
+
"offMatched": false,
|
|
112
|
+
"onScore": 0,
|
|
113
|
+
"offScore": 0
|
|
114
|
+
},
|
|
115
|
+
{
|
|
116
|
+
"id": "decomp-release-cut",
|
|
117
|
+
"prompt": "Cut the weekly release branch and start the release checklist.",
|
|
118
|
+
"expectMatch": true,
|
|
119
|
+
"onMatched": true,
|
|
120
|
+
"offMatched": false,
|
|
121
|
+
"onScore": 1,
|
|
122
|
+
"offScore": 0
|
|
123
|
+
},
|
|
124
|
+
{
|
|
125
|
+
"id": "decomp-onboarding",
|
|
126
|
+
"prompt": "Start the onboarding workflow for a new engineer joining the team.",
|
|
127
|
+
"expectMatch": true,
|
|
128
|
+
"onMatched": true,
|
|
129
|
+
"offMatched": false,
|
|
130
|
+
"onScore": 1,
|
|
131
|
+
"offScore": 0
|
|
132
|
+
},
|
|
133
|
+
{
|
|
134
|
+
"id": "decomp-data-migration",
|
|
135
|
+
"prompt": "Begin the schema migration plan for the billing table.",
|
|
136
|
+
"expectMatch": true,
|
|
137
|
+
"onMatched": false,
|
|
138
|
+
"offMatched": false,
|
|
139
|
+
"onScore": 0,
|
|
140
|
+
"offScore": 0
|
|
141
|
+
},
|
|
142
|
+
{
|
|
143
|
+
"id": "decomp-runbook-certificate",
|
|
144
|
+
"prompt": "Start the certificate renewal runbook for the public edge.",
|
|
145
|
+
"expectMatch": true,
|
|
146
|
+
"onMatched": true,
|
|
147
|
+
"offMatched": false,
|
|
148
|
+
"onScore": 1,
|
|
149
|
+
"offScore": 0
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"id": "distract-explain-question",
|
|
153
|
+
"prompt": "How does hybrid retrieval combine BM25 and vector search?",
|
|
154
|
+
"expectMatch": false,
|
|
155
|
+
"onMatched": false,
|
|
156
|
+
"offMatched": false,
|
|
157
|
+
"onScore": 1,
|
|
158
|
+
"offScore": 1
|
|
159
|
+
},
|
|
160
|
+
{
|
|
161
|
+
"id": "distract-past-tense-recap",
|
|
162
|
+
"prompt": "What did we decide about the database rotation last week?",
|
|
163
|
+
"expectMatch": false,
|
|
164
|
+
"onMatched": false,
|
|
165
|
+
"offMatched": false,
|
|
166
|
+
"onScore": 1,
|
|
167
|
+
"offScore": 1
|
|
168
|
+
},
|
|
169
|
+
{
|
|
170
|
+
"id": "distract-summary-request",
|
|
171
|
+
"prompt": "Summarize the timeline of the gateway outage for the report.",
|
|
172
|
+
"expectMatch": false,
|
|
173
|
+
"onMatched": false,
|
|
174
|
+
"offMatched": false,
|
|
175
|
+
"onScore": 1,
|
|
176
|
+
"offScore": 1
|
|
177
|
+
},
|
|
178
|
+
{
|
|
179
|
+
"id": "distract-thanks",
|
|
180
|
+
"prompt": "Thanks, that summary helps a lot.",
|
|
181
|
+
"expectMatch": false,
|
|
182
|
+
"onMatched": false,
|
|
183
|
+
"offMatched": false,
|
|
184
|
+
"onScore": 1,
|
|
185
|
+
"offScore": 1
|
|
186
|
+
},
|
|
187
|
+
{
|
|
188
|
+
"id": "distract-off-topic-task",
|
|
189
|
+
"prompt": "Let's grab coffee before the standup tomorrow.",
|
|
190
|
+
"expectMatch": false,
|
|
191
|
+
"onMatched": false,
|
|
192
|
+
"offMatched": false,
|
|
193
|
+
"onScore": 1,
|
|
194
|
+
"offScore": 1
|
|
195
|
+
}
|
|
196
|
+
],
|
|
197
|
+
"generatedAt": "baseline-v1"
|
|
198
|
+
}
|