@remnic/bench 1.0.0 → 9.3.515
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +202 -0
- package/baselines/procedural-recall-baseline.json +198 -0
- package/dist/index.d.ts +1560 -152
- package/dist/index.js +28784 -10696
- package/package.json +13 -6
package/README.md
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
1
|
+
# @remnic/bench
|
|
2
|
+
|
|
3
|
+
Benchmark suite and CI regression gates for [Remnic](https://github.com/joshuaswarren/remnic) memory pipelines. Ships the runners, adapters, and results store that the `remnic bench` CLI surface drives.
|
|
4
|
+
|
|
5
|
+
`@remnic/bench` is an **optional companion** to [`@remnic/cli`](https://www.npmjs.com/package/@remnic/cli). Install it only when you need to run benchmarks, compare runs, or publish results. Memory-only users do not need it.
|
|
6
|
+
|
|
7
|
+
## Install
|
|
8
|
+
|
|
9
|
+
```bash
|
|
10
|
+
# Alongside the CLI:
|
|
11
|
+
npm install -g @remnic/cli @remnic/bench
|
|
12
|
+
|
|
13
|
+
# Or in a project that drives benchmarks programmatically:
|
|
14
|
+
pnpm add @remnic/bench
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
The CLI loads `@remnic/bench` via a computed-specifier dynamic import. If it's not installed, `remnic bench *` prints a clear install hint; the rest of the CLI keeps working.
|
|
18
|
+
|
|
19
|
+
## What it does
|
|
20
|
+
|
|
21
|
+
- **Benchmark runners** for a growing set of memory-oriented evals: `longmemeval`, `locomo`, `memory-arena`, `amemgym`, `ama-bench`, plus a lightweight smoke fixture.
|
|
22
|
+
- **Stored-run management** — every `remnic bench run *` writes a timestamped JSON result under `~/.remnic/bench/results/`; `remnic bench runs list|show|delete` let you browse, inspect, and prune.
|
|
23
|
+
- **Reproducibility manifests** — package-backed runs write `MANIFEST.json` beside the result files, locking result hashes, dataset file hashes, seeds, runtime profiles, command argv with secret values redacted, selected environment keys, git state, QMD collections, and config-file hashes.
|
|
24
|
+
- **Baselines + regression gates** — save a run as a named baseline, compare candidates against it, gate CI on threshold violations.
|
|
25
|
+
- **Result export** — `remnic bench export <run> --format json|csv|html`.
|
|
26
|
+
- **Published feed** — `remnic bench publish --target remnic-ai` builds the tamper-evident integrity manifest consumed by remnic.ai.
|
|
27
|
+
- **Provider discovery** — `remnic bench providers discover` enumerates local OpenAI / Anthropic / Ollama / LiteLLM providers for adapter wiring.
|
|
28
|
+
|
|
29
|
+
## Memory eval dimensions
|
|
30
|
+
|
|
31
|
+
Agent memory without evals is vibes with a database.
|
|
32
|
+
|
|
33
|
+
`@remnic/bench` exports `MEMORY_EVAL_DIMENSIONS` as Remnic's shared eval
|
|
34
|
+
contract for user-aware agents. It covers:
|
|
35
|
+
|
|
36
|
+
- repeated-context reduction
|
|
37
|
+
- unnecessary-clarification reduction
|
|
38
|
+
- retrieval correctness
|
|
39
|
+
- stale-memory harm
|
|
40
|
+
- scope respect
|
|
41
|
+
- ask-when-needed decisions
|
|
42
|
+
- act-when-enough-context decisions
|
|
43
|
+
- personalization quality
|
|
44
|
+
|
|
45
|
+
Each dimension maps to existing quick-capable benchmark ids. Use
|
|
46
|
+
`listMemoryEvalBenchmarkIds()` when wiring CI coverage, and use the per-dimension
|
|
47
|
+
`fullModeGuidance` strings when designing publishable eval claims. See
|
|
48
|
+
[`docs/memory-evals.md`](../../docs/memory-evals.md) for the full map.
|
|
49
|
+
|
|
50
|
+
## CLI quick reference
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
# List available benchmarks:
|
|
54
|
+
remnic bench list
|
|
55
|
+
|
|
56
|
+
# Download a dataset for a full run:
|
|
57
|
+
remnic bench datasets download longmemeval
|
|
58
|
+
|
|
59
|
+
# Full run on the downloaded dataset:
|
|
60
|
+
remnic bench run longmemeval
|
|
61
|
+
|
|
62
|
+
# 60-second smoke run on the bundled fixture:
|
|
63
|
+
remnic bench run --quick longmemeval
|
|
64
|
+
|
|
65
|
+
# Browse stored runs:
|
|
66
|
+
remnic bench runs list
|
|
67
|
+
remnic bench runs show <run-id> --detail
|
|
68
|
+
|
|
69
|
+
# Inspect the reproducibility lock for the last run set:
|
|
70
|
+
jq . ~/.remnic/bench/results/MANIFEST.json
|
|
71
|
+
|
|
72
|
+
# Compare two runs:
|
|
73
|
+
remnic bench compare base-run candidate-run
|
|
74
|
+
|
|
75
|
+
# Save a baseline (archives the run under ~/.remnic/bench/baselines):
|
|
76
|
+
remnic bench baseline save dashboard-v1 candidate-run
|
|
77
|
+
|
|
78
|
+
# Gate CI against a stored run with a 2% threshold (compare takes run
|
|
79
|
+
# ids / paths, not baseline names — use `baseline save` for archival,
|
|
80
|
+
# then reference the underlying run id in `compare`):
|
|
81
|
+
remnic bench compare candidate-run nightly-run --threshold 0.02
|
|
82
|
+
|
|
83
|
+
# Ship results to remnic.ai:
|
|
84
|
+
remnic bench publish --target remnic-ai
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Dataset markers match the runner's accepted filenames, so `datasets status` reports "downloaded" exactly when the runner will load successfully.
|
|
88
|
+
|
|
89
|
+
## Running on real datasets
|
|
90
|
+
|
|
91
|
+
The `longmemeval` and `locomo` runners ship with a bundled smoke fixture so
|
|
92
|
+
`remnic bench run --quick` and CI stay green without downloading anything.
|
|
93
|
+
To produce public-quality numbers you need the real datasets. Both live on
|
|
94
|
+
HuggingFace.
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
# Print the exact download commands (no auto-fetch):
|
|
98
|
+
scripts/bench/fetch-datasets.sh --help
|
|
99
|
+
scripts/bench/fetch-datasets.sh --target ./bench-datasets
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Expected layout (the `bench-datasets/` directory is gitignored):
|
|
103
|
+
|
|
104
|
+
```
|
|
105
|
+
bench-datasets/
|
|
106
|
+
longmemeval/
|
|
107
|
+
longmemeval_oracle.json # preferred filename
|
|
108
|
+
longmemeval_s_cleaned.json # optional alternate
|
|
109
|
+
longmemeval_s.json # optional alternate
|
|
110
|
+
locomo/
|
|
111
|
+
locomo10.json # preferred filename
|
|
112
|
+
locomo.json # optional alternate
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Point the runners at the directory. Use the current `remnic bench run`
|
|
116
|
+
CLI surface with `--dataset-dir` (a dedicated `remnic bench published`
|
|
117
|
+
subcommand with user-configurable `--limit`, `--model`, and `--seed` is
|
|
118
|
+
planned for a later slice of
|
|
119
|
+
[#566](https://github.com/joshuaswarren/remnic/issues/566)):
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
pnpm exec remnic bench run longmemeval \
|
|
123
|
+
--dataset-dir ./bench-datasets/longmemeval
|
|
124
|
+
|
|
125
|
+
pnpm exec remnic bench run locomo \
|
|
126
|
+
--dataset-dir ./bench-datasets/locomo
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Programmatic loaders are exported from `@remnic/bench`:
|
|
130
|
+
|
|
131
|
+
```ts
|
|
132
|
+
import { loadLongMemEvalS, loadLoCoMo10 } from "@remnic/bench";
|
|
133
|
+
|
|
134
|
+
const longmemeval = await loadLongMemEvalS({
|
|
135
|
+
mode: "full",
|
|
136
|
+
datasetDir: "./bench-datasets/longmemeval",
|
|
137
|
+
limit: 100,
|
|
138
|
+
});
|
|
139
|
+
// longmemeval.source === "dataset" when the real file was found,
|
|
140
|
+
// "smoke" when quick-mode fallback was used, "missing" when full-mode
|
|
141
|
+
// could not find any of the canonical filenames.
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
When `mode: "full"` and no dataset is found, the loaders return
|
|
145
|
+
`{ source: "missing", errors }` and the runner throws a
|
|
146
|
+
`formatMissingDatasetError()` message pointing operators at
|
|
147
|
+
`scripts/bench/fetch-datasets.sh`. Quick mode silently falls back to the
|
|
148
|
+
bundled smoke fixture and logs the probe errors so you can tell why.
|
|
149
|
+
|
|
150
|
+
## CI regression gate (smoke fixtures)
|
|
151
|
+
|
|
152
|
+
`.github/workflows/bench-smoke.yml` runs `scripts/bench/bench-smoke.ts`
|
|
153
|
+
on every PR. The script exercises the LongMemEval + LoCoMo runners
|
|
154
|
+
against their bundled smoke fixtures with a fixed seed and a
|
|
155
|
+
deterministic in-memory adapter (no real datasets, no LLM calls, no
|
|
156
|
+
network). Metrics are compared to the committed baseline at
|
|
157
|
+
`tests/fixtures/bench-smoke/baseline.json`; any drop greater than 5%
|
|
158
|
+
fails the job.
|
|
159
|
+
|
|
160
|
+
Regenerate the baseline after an intentional runner change:
|
|
161
|
+
|
|
162
|
+
```bash
|
|
163
|
+
pnpm exec tsx scripts/bench/bench-smoke.ts --update-baseline
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
## Programmatic API
|
|
167
|
+
|
|
168
|
+
```ts
|
|
169
|
+
import {
|
|
170
|
+
listBenchmarks,
|
|
171
|
+
runBenchmark,
|
|
172
|
+
writeBenchmarkResult,
|
|
173
|
+
writeBenchmarkReproManifest,
|
|
174
|
+
createLightweightAdapter,
|
|
175
|
+
createRemnicAdapter,
|
|
176
|
+
compareResults,
|
|
177
|
+
saveBenchmarkBaseline,
|
|
178
|
+
listBenchmarkResults,
|
|
179
|
+
deleteBenchmarkResults,
|
|
180
|
+
buildBenchmarkPublishFeed,
|
|
181
|
+
discoverAllProviders,
|
|
182
|
+
type BenchmarkResult,
|
|
183
|
+
type ComparisonResult,
|
|
184
|
+
type BenchmarkDefinition,
|
|
185
|
+
} from "@remnic/bench";
|
|
186
|
+
```
|
|
187
|
+
|
|
188
|
+
Each runner accepts a `system` adapter — `createRemnicAdapter()` talks to a live `@remnic/core` Orchestrator; `createLightweightAdapter()` is a minimal in-memory stand-in used for CI smoke runs. Results conform to the `BenchmarkResult` schema (see `dist/index.d.ts`).
|
|
189
|
+
|
|
190
|
+
## Agent note
|
|
191
|
+
|
|
192
|
+
If you're an AI agent extending a Remnic-based stack: **do not** import `@remnic/bench` from a base install surface (CLI, core, plugin). Optional companion packages must be loaded via computed-specifier dynamic imports with an install-hint fallback. See `packages/remnic-cli/src/optional-bench.ts` in the repo for the canonical pattern, and the à-la-carte invariant in the repo's `AGENTS.md` §44 / `CLAUDE.md` gotcha #57.
|
|
193
|
+
|
|
194
|
+
## Related
|
|
195
|
+
|
|
196
|
+
- [`@remnic/cli`](https://www.npmjs.com/package/@remnic/cli) — the CLI that drives `remnic bench *`
|
|
197
|
+
- [`@remnic/core`](https://www.npmjs.com/package/@remnic/core) — the memory engine bench adapters talk to
|
|
198
|
+
- Source + issues: <https://github.com/joshuaswarren/remnic>
|
|
199
|
+
|
|
200
|
+
## License
|
|
201
|
+
|
|
202
|
+
MIT. See the root [LICENSE](https://github.com/joshuaswarren/remnic/blob/main/LICENSE) file.
|
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
{
|
|
2
|
+
"schemaVersion": 1,
|
|
3
|
+
"fixture": {
|
|
4
|
+
"path": null,
|
|
5
|
+
"scenarioCount": 20
|
|
6
|
+
},
|
|
7
|
+
"onScore": 0.75,
|
|
8
|
+
"offScore": 0.25,
|
|
9
|
+
"lift": 0.5,
|
|
10
|
+
"confidenceInterval": {
|
|
11
|
+
"lower": 0.3,
|
|
12
|
+
"upper": 0.75,
|
|
13
|
+
"level": 0.95
|
|
14
|
+
},
|
|
15
|
+
"perCase": [
|
|
16
|
+
{
|
|
17
|
+
"id": "rerun-deploy-gateway",
|
|
18
|
+
"prompt": "Let's deploy the gateway service to production now.",
|
|
19
|
+
"expectMatch": true,
|
|
20
|
+
"onMatched": true,
|
|
21
|
+
"offMatched": false,
|
|
22
|
+
"onScore": 1,
|
|
23
|
+
"offScore": 0
|
|
24
|
+
},
|
|
25
|
+
{
|
|
26
|
+
"id": "rerun-open-pr",
|
|
27
|
+
"prompt": "Open a pull request for the regression fix against main.",
|
|
28
|
+
"expectMatch": true,
|
|
29
|
+
"onMatched": true,
|
|
30
|
+
"offMatched": false,
|
|
31
|
+
"onScore": 1,
|
|
32
|
+
"offScore": 0
|
|
33
|
+
},
|
|
34
|
+
{
|
|
35
|
+
"id": "rerun-run-tests",
|
|
36
|
+
"prompt": "Run the test suite before we merge the release branch.",
|
|
37
|
+
"expectMatch": true,
|
|
38
|
+
"onMatched": true,
|
|
39
|
+
"offMatched": false,
|
|
40
|
+
"onScore": 1,
|
|
41
|
+
"offScore": 0
|
|
42
|
+
},
|
|
43
|
+
{
|
|
44
|
+
"id": "rerun-rotate-credentials",
|
|
45
|
+
"prompt": "Rotate the staging database credentials right now.",
|
|
46
|
+
"expectMatch": true,
|
|
47
|
+
"onMatched": false,
|
|
48
|
+
"offMatched": false,
|
|
49
|
+
"onScore": 0,
|
|
50
|
+
"offScore": 0
|
|
51
|
+
},
|
|
52
|
+
{
|
|
53
|
+
"id": "rerun-ship-release",
|
|
54
|
+
"prompt": "We need to ship the v9 release tonight.",
|
|
55
|
+
"expectMatch": true,
|
|
56
|
+
"onMatched": true,
|
|
57
|
+
"offMatched": false,
|
|
58
|
+
"onScore": 1,
|
|
59
|
+
"offScore": 0
|
|
60
|
+
},
|
|
61
|
+
{
|
|
62
|
+
"id": "paramvar-deploy-api",
|
|
63
|
+
"prompt": "Let's deploy the API service to production today.",
|
|
64
|
+
"expectMatch": true,
|
|
65
|
+
"onMatched": true,
|
|
66
|
+
"offMatched": false,
|
|
67
|
+
"onScore": 1,
|
|
68
|
+
"offScore": 0
|
|
69
|
+
},
|
|
70
|
+
{
|
|
71
|
+
"id": "paramvar-rollback-ticket",
|
|
72
|
+
"prompt": "Roll back ticket PROJ-912 before the standup tomorrow.",
|
|
73
|
+
"expectMatch": true,
|
|
74
|
+
"onMatched": false,
|
|
75
|
+
"offMatched": false,
|
|
76
|
+
"onScore": 0,
|
|
77
|
+
"offScore": 0
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"id": "paramvar-rotate-prod",
|
|
81
|
+
"prompt": "Rotate the production database credentials this morning.",
|
|
82
|
+
"expectMatch": true,
|
|
83
|
+
"onMatched": false,
|
|
84
|
+
"offMatched": false,
|
|
85
|
+
"onScore": 0,
|
|
86
|
+
"offScore": 0
|
|
87
|
+
},
|
|
88
|
+
{
|
|
89
|
+
"id": "paramvar-start-branch",
|
|
90
|
+
"prompt": "Starting work on the billing branch feature.",
|
|
91
|
+
"expectMatch": true,
|
|
92
|
+
"onMatched": true,
|
|
93
|
+
"offMatched": false,
|
|
94
|
+
"onScore": 1,
|
|
95
|
+
"offScore": 0
|
|
96
|
+
},
|
|
97
|
+
{
|
|
98
|
+
"id": "paramvar-merge-pr-after-ci",
|
|
99
|
+
"prompt": "Merge the pull request after CI turns green.",
|
|
100
|
+
"expectMatch": true,
|
|
101
|
+
"onMatched": true,
|
|
102
|
+
"offMatched": false,
|
|
103
|
+
"onScore": 1,
|
|
104
|
+
"offScore": 0
|
|
105
|
+
},
|
|
106
|
+
{
|
|
107
|
+
"id": "decomp-incident-response",
|
|
108
|
+
"prompt": "Run the incident response playbook for the gateway outage.",
|
|
109
|
+
"expectMatch": true,
|
|
110
|
+
"onMatched": false,
|
|
111
|
+
"offMatched": false,
|
|
112
|
+
"onScore": 0,
|
|
113
|
+
"offScore": 0
|
|
114
|
+
},
|
|
115
|
+
{
|
|
116
|
+
"id": "decomp-release-cut",
|
|
117
|
+
"prompt": "Cut the weekly release branch and start the release checklist.",
|
|
118
|
+
"expectMatch": true,
|
|
119
|
+
"onMatched": true,
|
|
120
|
+
"offMatched": false,
|
|
121
|
+
"onScore": 1,
|
|
122
|
+
"offScore": 0
|
|
123
|
+
},
|
|
124
|
+
{
|
|
125
|
+
"id": "decomp-onboarding",
|
|
126
|
+
"prompt": "Start the onboarding workflow for a new engineer joining the team.",
|
|
127
|
+
"expectMatch": true,
|
|
128
|
+
"onMatched": true,
|
|
129
|
+
"offMatched": false,
|
|
130
|
+
"onScore": 1,
|
|
131
|
+
"offScore": 0
|
|
132
|
+
},
|
|
133
|
+
{
|
|
134
|
+
"id": "decomp-data-migration",
|
|
135
|
+
"prompt": "Begin the schema migration plan for the billing table.",
|
|
136
|
+
"expectMatch": true,
|
|
137
|
+
"onMatched": false,
|
|
138
|
+
"offMatched": false,
|
|
139
|
+
"onScore": 0,
|
|
140
|
+
"offScore": 0
|
|
141
|
+
},
|
|
142
|
+
{
|
|
143
|
+
"id": "decomp-runbook-certificate",
|
|
144
|
+
"prompt": "Start the certificate renewal runbook for the public edge.",
|
|
145
|
+
"expectMatch": true,
|
|
146
|
+
"onMatched": true,
|
|
147
|
+
"offMatched": false,
|
|
148
|
+
"onScore": 1,
|
|
149
|
+
"offScore": 0
|
|
150
|
+
},
|
|
151
|
+
{
|
|
152
|
+
"id": "distract-explain-question",
|
|
153
|
+
"prompt": "How does hybrid retrieval combine BM25 and vector search?",
|
|
154
|
+
"expectMatch": false,
|
|
155
|
+
"onMatched": false,
|
|
156
|
+
"offMatched": false,
|
|
157
|
+
"onScore": 1,
|
|
158
|
+
"offScore": 1
|
|
159
|
+
},
|
|
160
|
+
{
|
|
161
|
+
"id": "distract-past-tense-recap",
|
|
162
|
+
"prompt": "What did we decide about the database rotation last week?",
|
|
163
|
+
"expectMatch": false,
|
|
164
|
+
"onMatched": false,
|
|
165
|
+
"offMatched": false,
|
|
166
|
+
"onScore": 1,
|
|
167
|
+
"offScore": 1
|
|
168
|
+
},
|
|
169
|
+
{
|
|
170
|
+
"id": "distract-summary-request",
|
|
171
|
+
"prompt": "Summarize the timeline of the gateway outage for the report.",
|
|
172
|
+
"expectMatch": false,
|
|
173
|
+
"onMatched": false,
|
|
174
|
+
"offMatched": false,
|
|
175
|
+
"onScore": 1,
|
|
176
|
+
"offScore": 1
|
|
177
|
+
},
|
|
178
|
+
{
|
|
179
|
+
"id": "distract-thanks",
|
|
180
|
+
"prompt": "Thanks, that summary helps a lot.",
|
|
181
|
+
"expectMatch": false,
|
|
182
|
+
"onMatched": false,
|
|
183
|
+
"offMatched": false,
|
|
184
|
+
"onScore": 1,
|
|
185
|
+
"offScore": 1
|
|
186
|
+
},
|
|
187
|
+
{
|
|
188
|
+
"id": "distract-off-topic-task",
|
|
189
|
+
"prompt": "Let's grab coffee before the standup tomorrow.",
|
|
190
|
+
"expectMatch": false,
|
|
191
|
+
"onMatched": false,
|
|
192
|
+
"offMatched": false,
|
|
193
|
+
"onScore": 1,
|
|
194
|
+
"offScore": 1
|
|
195
|
+
}
|
|
196
|
+
],
|
|
197
|
+
"generatedAt": "baseline-v1"
|
|
198
|
+
}
|