@remnic/bench 1.0.0 → 9.3.515

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,202 @@
1
+ # @remnic/bench
2
+
3
+ Benchmark suite and CI regression gates for [Remnic](https://github.com/joshuaswarren/remnic) memory pipelines. Ships the runners, adapters, and results store that the `remnic bench` CLI surface drives.
4
+
5
+ `@remnic/bench` is an **optional companion** to [`@remnic/cli`](https://www.npmjs.com/package/@remnic/cli). Install it only when you need to run benchmarks, compare runs, or publish results. Memory-only users do not need it.
6
+
7
+ ## Install
8
+
9
+ ```bash
10
+ # Alongside the CLI:
11
+ npm install -g @remnic/cli @remnic/bench
12
+
13
+ # Or in a project that drives benchmarks programmatically:
14
+ pnpm add @remnic/bench
15
+ ```
16
+
17
+ The CLI loads `@remnic/bench` via a computed-specifier dynamic import. If it's not installed, `remnic bench *` prints a clear install hint; the rest of the CLI keeps working.
18
+
19
+ ## What it does
20
+
21
+ - **Benchmark runners** for a growing set of memory-oriented evals: `longmemeval`, `locomo`, `memory-arena`, `amemgym`, `ama-bench`, plus a lightweight smoke fixture.
22
+ - **Stored-run management** — every `remnic bench run *` writes a timestamped JSON result under `~/.remnic/bench/results/`; `remnic bench runs list|show|delete` let you browse, inspect, and prune.
23
+ - **Reproducibility manifests** — package-backed runs write `MANIFEST.json` beside the result files, locking result hashes, dataset file hashes, seeds, runtime profiles, command argv with secret values redacted, selected environment keys, git state, QMD collections, and config-file hashes.
24
+ - **Baselines + regression gates** — save a run as a named baseline, compare candidates against it, gate CI on threshold violations.
25
+ - **Result export** — `remnic bench export <run> --format json|csv|html`.
26
+ - **Published feed** — `remnic bench publish --target remnic-ai` builds the tamper-evident integrity manifest consumed by remnic.ai.
27
+ - **Provider discovery** — `remnic bench providers discover` enumerates local OpenAI / Anthropic / Ollama / LiteLLM providers for adapter wiring.
28
+
29
+ ## Memory eval dimensions
30
+
31
+ Agent memory without evals is vibes with a database.
32
+
33
+ `@remnic/bench` exports `MEMORY_EVAL_DIMENSIONS` as Remnic's shared eval
34
+ contract for user-aware agents. It covers:
35
+
36
+ - repeated-context reduction
37
+ - unnecessary-clarification reduction
38
+ - retrieval correctness
39
+ - stale-memory harm
40
+ - scope respect
41
+ - ask-when-needed decisions
42
+ - act-when-enough-context decisions
43
+ - personalization quality
44
+
45
+ Each dimension maps to existing quick-capable benchmark ids. Use
46
+ `listMemoryEvalBenchmarkIds()` when wiring CI coverage, and use the per-dimension
47
+ `fullModeGuidance` strings when designing publishable eval claims. See
48
+ [`docs/memory-evals.md`](../../docs/memory-evals.md) for the full map.
49
+
50
+ ## CLI quick reference
51
+
52
+ ```bash
53
+ # List available benchmarks:
54
+ remnic bench list
55
+
56
+ # Download a dataset for a full run:
57
+ remnic bench datasets download longmemeval
58
+
59
+ # Full run on the downloaded dataset:
60
+ remnic bench run longmemeval
61
+
62
+ # 60-second smoke run on the bundled fixture:
63
+ remnic bench run --quick longmemeval
64
+
65
+ # Browse stored runs:
66
+ remnic bench runs list
67
+ remnic bench runs show <run-id> --detail
68
+
69
+ # Inspect the reproducibility lock for the last run set:
70
+ jq . ~/.remnic/bench/results/MANIFEST.json
71
+
72
+ # Compare two runs:
73
+ remnic bench compare base-run candidate-run
74
+
75
+ # Save a baseline (archives the run under ~/.remnic/bench/baselines):
76
+ remnic bench baseline save dashboard-v1 candidate-run
77
+
78
+ # Gate CI against a stored run with a 2% threshold (compare takes run
79
+ # ids / paths, not baseline names — use `baseline save` for archival,
80
+ # then reference the underlying run id in `compare`):
81
+ remnic bench compare candidate-run nightly-run --threshold 0.02
82
+
83
+ # Ship results to remnic.ai:
84
+ remnic bench publish --target remnic-ai
85
+ ```
86
+
87
+ Dataset markers match the runner's accepted filenames, so `datasets status` reports "downloaded" exactly when the runner will load successfully.
88
+
89
+ ## Running on real datasets
90
+
91
+ The `longmemeval` and `locomo` runners ship with a bundled smoke fixture so
92
+ `remnic bench run --quick` and CI stay green without downloading anything.
93
+ To produce public-quality numbers you need the real datasets. Both live on
94
+ HuggingFace.
95
+
96
+ ```bash
97
+ # Print the exact download commands (no auto-fetch):
98
+ scripts/bench/fetch-datasets.sh --help
99
+ scripts/bench/fetch-datasets.sh --target ./bench-datasets
100
+ ```
101
+
102
+ Expected layout (the `bench-datasets/` directory is gitignored):
103
+
104
+ ```
105
+ bench-datasets/
106
+ longmemeval/
107
+ longmemeval_oracle.json # preferred filename
108
+ longmemeval_s_cleaned.json # optional alternate
109
+ longmemeval_s.json # optional alternate
110
+ locomo/
111
+ locomo10.json # preferred filename
112
+ locomo.json # optional alternate
113
+ ```
114
+
115
+ Point the runners at the directory. Use the current `remnic bench run`
116
+ CLI surface with `--dataset-dir` (a dedicated `remnic bench published`
117
+ subcommand with user-configurable `--limit`, `--model`, and `--seed` is
118
+ planned for a later slice of
119
+ [#566](https://github.com/joshuaswarren/remnic/issues/566)):
120
+
121
+ ```bash
122
+ pnpm exec remnic bench run longmemeval \
123
+ --dataset-dir ./bench-datasets/longmemeval
124
+
125
+ pnpm exec remnic bench run locomo \
126
+ --dataset-dir ./bench-datasets/locomo
127
+ ```
128
+
129
+ Programmatic loaders are exported from `@remnic/bench`:
130
+
131
+ ```ts
132
+ import { loadLongMemEvalS, loadLoCoMo10 } from "@remnic/bench";
133
+
134
+ const longmemeval = await loadLongMemEvalS({
135
+ mode: "full",
136
+ datasetDir: "./bench-datasets/longmemeval",
137
+ limit: 100,
138
+ });
139
+ // longmemeval.source === "dataset" when the real file was found,
140
+ // "smoke" when quick-mode fallback was used, "missing" when full-mode
141
+ // could not find any of the canonical filenames.
142
+ ```
143
+
144
+ When `mode: "full"` and no dataset is found, the loaders return
145
+ `{ source: "missing", errors }` and the runner throws a
146
+ `formatMissingDatasetError()` message pointing operators at
147
+ `scripts/bench/fetch-datasets.sh`. Quick mode silently falls back to the
148
+ bundled smoke fixture and logs the probe errors so you can tell why.
149
+
150
+ ## CI regression gate (smoke fixtures)
151
+
152
+ `.github/workflows/bench-smoke.yml` runs `scripts/bench/bench-smoke.ts`
153
+ on every PR. The script exercises the LongMemEval + LoCoMo runners
154
+ against their bundled smoke fixtures with a fixed seed and a
155
+ deterministic in-memory adapter (no real datasets, no LLM calls, no
156
+ network). Metrics are compared to the committed baseline at
157
+ `tests/fixtures/bench-smoke/baseline.json`; any drop greater than 5%
158
+ fails the job.
159
+
160
+ Regenerate the baseline after an intentional runner change:
161
+
162
+ ```bash
163
+ pnpm exec tsx scripts/bench/bench-smoke.ts --update-baseline
164
+ ```
165
+
166
+ ## Programmatic API
167
+
168
+ ```ts
169
+ import {
170
+ listBenchmarks,
171
+ runBenchmark,
172
+ writeBenchmarkResult,
173
+ writeBenchmarkReproManifest,
174
+ createLightweightAdapter,
175
+ createRemnicAdapter,
176
+ compareResults,
177
+ saveBenchmarkBaseline,
178
+ listBenchmarkResults,
179
+ deleteBenchmarkResults,
180
+ buildBenchmarkPublishFeed,
181
+ discoverAllProviders,
182
+ type BenchmarkResult,
183
+ type ComparisonResult,
184
+ type BenchmarkDefinition,
185
+ } from "@remnic/bench";
186
+ ```
187
+
188
+ Each runner accepts a `system` adapter — `createRemnicAdapter()` talks to a live `@remnic/core` Orchestrator; `createLightweightAdapter()` is a minimal in-memory stand-in used for CI smoke runs. Results conform to the `BenchmarkResult` schema (see `dist/index.d.ts`).
189
+
190
+ ## Agent note
191
+
192
+ If you're an AI agent extending a Remnic-based stack: **do not** import `@remnic/bench` from a base install surface (CLI, core, plugin). Optional companion packages must be loaded via computed-specifier dynamic imports with an install-hint fallback. See `packages/remnic-cli/src/optional-bench.ts` in the repo for the canonical pattern, and the à-la-carte invariant in the repo's `AGENTS.md` §44 / `CLAUDE.md` gotcha #57.
193
+
194
+ ## Related
195
+
196
+ - [`@remnic/cli`](https://www.npmjs.com/package/@remnic/cli) — the CLI that drives `remnic bench *`
197
+ - [`@remnic/core`](https://www.npmjs.com/package/@remnic/core) — the memory engine bench adapters talk to
198
+ - Source + issues: <https://github.com/joshuaswarren/remnic>
199
+
200
+ ## License
201
+
202
+ MIT. See the root [LICENSE](https://github.com/joshuaswarren/remnic/blob/main/LICENSE) file.
@@ -0,0 +1,198 @@
1
+ {
2
+ "schemaVersion": 1,
3
+ "fixture": {
4
+ "path": null,
5
+ "scenarioCount": 20
6
+ },
7
+ "onScore": 0.75,
8
+ "offScore": 0.25,
9
+ "lift": 0.5,
10
+ "confidenceInterval": {
11
+ "lower": 0.3,
12
+ "upper": 0.75,
13
+ "level": 0.95
14
+ },
15
+ "perCase": [
16
+ {
17
+ "id": "rerun-deploy-gateway",
18
+ "prompt": "Let's deploy the gateway service to production now.",
19
+ "expectMatch": true,
20
+ "onMatched": true,
21
+ "offMatched": false,
22
+ "onScore": 1,
23
+ "offScore": 0
24
+ },
25
+ {
26
+ "id": "rerun-open-pr",
27
+ "prompt": "Open a pull request for the regression fix against main.",
28
+ "expectMatch": true,
29
+ "onMatched": true,
30
+ "offMatched": false,
31
+ "onScore": 1,
32
+ "offScore": 0
33
+ },
34
+ {
35
+ "id": "rerun-run-tests",
36
+ "prompt": "Run the test suite before we merge the release branch.",
37
+ "expectMatch": true,
38
+ "onMatched": true,
39
+ "offMatched": false,
40
+ "onScore": 1,
41
+ "offScore": 0
42
+ },
43
+ {
44
+ "id": "rerun-rotate-credentials",
45
+ "prompt": "Rotate the staging database credentials right now.",
46
+ "expectMatch": true,
47
+ "onMatched": false,
48
+ "offMatched": false,
49
+ "onScore": 0,
50
+ "offScore": 0
51
+ },
52
+ {
53
+ "id": "rerun-ship-release",
54
+ "prompt": "We need to ship the v9 release tonight.",
55
+ "expectMatch": true,
56
+ "onMatched": true,
57
+ "offMatched": false,
58
+ "onScore": 1,
59
+ "offScore": 0
60
+ },
61
+ {
62
+ "id": "paramvar-deploy-api",
63
+ "prompt": "Let's deploy the API service to production today.",
64
+ "expectMatch": true,
65
+ "onMatched": true,
66
+ "offMatched": false,
67
+ "onScore": 1,
68
+ "offScore": 0
69
+ },
70
+ {
71
+ "id": "paramvar-rollback-ticket",
72
+ "prompt": "Roll back ticket PROJ-912 before the standup tomorrow.",
73
+ "expectMatch": true,
74
+ "onMatched": false,
75
+ "offMatched": false,
76
+ "onScore": 0,
77
+ "offScore": 0
78
+ },
79
+ {
80
+ "id": "paramvar-rotate-prod",
81
+ "prompt": "Rotate the production database credentials this morning.",
82
+ "expectMatch": true,
83
+ "onMatched": false,
84
+ "offMatched": false,
85
+ "onScore": 0,
86
+ "offScore": 0
87
+ },
88
+ {
89
+ "id": "paramvar-start-branch",
90
+ "prompt": "Starting work on the billing branch feature.",
91
+ "expectMatch": true,
92
+ "onMatched": true,
93
+ "offMatched": false,
94
+ "onScore": 1,
95
+ "offScore": 0
96
+ },
97
+ {
98
+ "id": "paramvar-merge-pr-after-ci",
99
+ "prompt": "Merge the pull request after CI turns green.",
100
+ "expectMatch": true,
101
+ "onMatched": true,
102
+ "offMatched": false,
103
+ "onScore": 1,
104
+ "offScore": 0
105
+ },
106
+ {
107
+ "id": "decomp-incident-response",
108
+ "prompt": "Run the incident response playbook for the gateway outage.",
109
+ "expectMatch": true,
110
+ "onMatched": false,
111
+ "offMatched": false,
112
+ "onScore": 0,
113
+ "offScore": 0
114
+ },
115
+ {
116
+ "id": "decomp-release-cut",
117
+ "prompt": "Cut the weekly release branch and start the release checklist.",
118
+ "expectMatch": true,
119
+ "onMatched": true,
120
+ "offMatched": false,
121
+ "onScore": 1,
122
+ "offScore": 0
123
+ },
124
+ {
125
+ "id": "decomp-onboarding",
126
+ "prompt": "Start the onboarding workflow for a new engineer joining the team.",
127
+ "expectMatch": true,
128
+ "onMatched": true,
129
+ "offMatched": false,
130
+ "onScore": 1,
131
+ "offScore": 0
132
+ },
133
+ {
134
+ "id": "decomp-data-migration",
135
+ "prompt": "Begin the schema migration plan for the billing table.",
136
+ "expectMatch": true,
137
+ "onMatched": false,
138
+ "offMatched": false,
139
+ "onScore": 0,
140
+ "offScore": 0
141
+ },
142
+ {
143
+ "id": "decomp-runbook-certificate",
144
+ "prompt": "Start the certificate renewal runbook for the public edge.",
145
+ "expectMatch": true,
146
+ "onMatched": true,
147
+ "offMatched": false,
148
+ "onScore": 1,
149
+ "offScore": 0
150
+ },
151
+ {
152
+ "id": "distract-explain-question",
153
+ "prompt": "How does hybrid retrieval combine BM25 and vector search?",
154
+ "expectMatch": false,
155
+ "onMatched": false,
156
+ "offMatched": false,
157
+ "onScore": 1,
158
+ "offScore": 1
159
+ },
160
+ {
161
+ "id": "distract-past-tense-recap",
162
+ "prompt": "What did we decide about the database rotation last week?",
163
+ "expectMatch": false,
164
+ "onMatched": false,
165
+ "offMatched": false,
166
+ "onScore": 1,
167
+ "offScore": 1
168
+ },
169
+ {
170
+ "id": "distract-summary-request",
171
+ "prompt": "Summarize the timeline of the gateway outage for the report.",
172
+ "expectMatch": false,
173
+ "onMatched": false,
174
+ "offMatched": false,
175
+ "onScore": 1,
176
+ "offScore": 1
177
+ },
178
+ {
179
+ "id": "distract-thanks",
180
+ "prompt": "Thanks, that summary helps a lot.",
181
+ "expectMatch": false,
182
+ "onMatched": false,
183
+ "offMatched": false,
184
+ "onScore": 1,
185
+ "offScore": 1
186
+ },
187
+ {
188
+ "id": "distract-off-topic-task",
189
+ "prompt": "Let's grab coffee before the standup tomorrow.",
190
+ "expectMatch": false,
191
+ "onMatched": false,
192
+ "offMatched": false,
193
+ "onScore": 1,
194
+ "offScore": 1
195
+ }
196
+ ],
197
+ "generatedAt": "baseline-v1"
198
+ }