@remnic/bench 1.0.1 → 9.3.517

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -20,11 +20,33 @@ The CLI loads `@remnic/bench` via a computed-specifier dynamic import. If it's n
20
20
 
21
21
  - **Benchmark runners** for a growing set of memory-oriented evals: `longmemeval`, `locomo`, `memory-arena`, `amemgym`, `ama-bench`, plus a lightweight smoke fixture.
22
22
  - **Stored-run management** — every `remnic bench run *` writes a timestamped JSON result under `~/.remnic/bench/results/`; `remnic bench runs list|show|delete` let you browse, inspect, and prune.
23
+ - **Reproducibility manifests** — package-backed runs write `MANIFEST.json` beside the result files, locking result hashes, dataset file hashes, seeds, runtime profiles, command argv with secret values redacted, selected environment keys, git state, QMD collections, and config-file hashes.
23
24
  - **Baselines + regression gates** — save a run as a named baseline, compare candidates against it, gate CI on threshold violations.
24
25
  - **Result export** — `remnic bench export <run> --format json|csv|html`.
25
26
  - **Published feed** — `remnic bench publish --target remnic-ai` builds the tamper-evident integrity manifest consumed by remnic.ai.
26
27
  - **Provider discovery** — `remnic bench providers discover` enumerates local OpenAI / Anthropic / Ollama / LiteLLM providers for adapter wiring.
27
28
 
29
+ ## Memory eval dimensions
30
+
31
+ Agent memory without evals is vibes with a database.
32
+
33
+ `@remnic/bench` exports `MEMORY_EVAL_DIMENSIONS` as Remnic's shared eval
34
+ contract for user-aware agents. It covers:
35
+
36
+ - repeated-context reduction
37
+ - unnecessary-clarification reduction
38
+ - retrieval correctness
39
+ - stale-memory harm
40
+ - scope respect
41
+ - ask-when-needed decisions
42
+ - act-when-enough-context decisions
43
+ - personalization quality
44
+
45
+ Each dimension maps to existing quick-capable benchmark ids. Use
46
+ `listMemoryEvalBenchmarkIds()` when wiring CI coverage, and use the per-dimension
47
+ `fullModeGuidance` strings when designing publishable eval claims. See
48
+ [`docs/memory-evals.md`](../../docs/memory-evals.md) for the full map.
49
+
28
50
  ## CLI quick reference
29
51
 
30
52
  ```bash
@@ -44,6 +66,9 @@ remnic bench run --quick longmemeval
44
66
  remnic bench runs list
45
67
  remnic bench runs show <run-id> --detail
46
68
 
69
+ # Inspect the reproducibility lock for the last run set:
70
+ jq . ~/.remnic/bench/results/MANIFEST.json
71
+
47
72
  # Compare two runs:
48
73
  remnic bench compare base-run candidate-run
49
74
 
@@ -61,6 +86,83 @@ remnic bench publish --target remnic-ai
61
86
 
62
87
  Dataset markers match the runner's accepted filenames, so `datasets status` reports "downloaded" exactly when the runner will load successfully.
63
88
 
89
+ ## Running on real datasets
90
+
91
+ The `longmemeval` and `locomo` runners ship with a bundled smoke fixture so
92
+ `remnic bench run --quick` and CI stay green without downloading anything.
93
+ To produce public-quality numbers you need the real datasets. Both live on
94
+ HuggingFace.
95
+
96
+ ```bash
97
+ # Print the exact download commands (no auto-fetch):
98
+ scripts/bench/fetch-datasets.sh --help
99
+ scripts/bench/fetch-datasets.sh --target ./bench-datasets
100
+ ```
101
+
102
+ Expected layout (the `bench-datasets/` directory is gitignored):
103
+
104
+ ```
105
+ bench-datasets/
106
+ longmemeval/
107
+ longmemeval_oracle.json # preferred filename
108
+ longmemeval_s_cleaned.json # optional alternate
109
+ longmemeval_s.json # optional alternate
110
+ locomo/
111
+ locomo10.json # preferred filename
112
+ locomo.json # optional alternate
113
+ ```
114
+
115
+ Point the runners at the directory. Use the current `remnic bench run`
116
+ CLI surface with `--dataset-dir` (a dedicated `remnic bench published`
117
+ subcommand with user-configurable `--limit`, `--model`, and `--seed` is
118
+ planned for a later slice of
119
+ [#566](https://github.com/joshuaswarren/remnic/issues/566)):
120
+
121
+ ```bash
122
+ pnpm exec remnic bench run longmemeval \
123
+ --dataset-dir ./bench-datasets/longmemeval
124
+
125
+ pnpm exec remnic bench run locomo \
126
+ --dataset-dir ./bench-datasets/locomo
127
+ ```
128
+
129
+ Programmatic loaders are exported from `@remnic/bench`:
130
+
131
+ ```ts
132
+ import { loadLongMemEvalS, loadLoCoMo10 } from "@remnic/bench";
133
+
134
+ const longmemeval = await loadLongMemEvalS({
135
+ mode: "full",
136
+ datasetDir: "./bench-datasets/longmemeval",
137
+ limit: 100,
138
+ });
139
+ // longmemeval.source === "dataset" when the real file was found,
140
+ // "smoke" when quick-mode fallback was used, "missing" when full-mode
141
+ // could not find any of the canonical filenames.
142
+ ```
143
+
144
+ When `mode: "full"` and no dataset is found, the loaders return
145
+ `{ source: "missing", errors }` and the runner throws a
146
+ `formatMissingDatasetError()` message pointing operators at
147
+ `scripts/bench/fetch-datasets.sh`. Quick mode silently falls back to the
148
+ bundled smoke fixture and logs the probe errors so you can tell why.
149
+
150
+ ## CI regression gate (smoke fixtures)
151
+
152
+ `.github/workflows/bench-smoke.yml` runs `scripts/bench/bench-smoke.ts`
153
+ on every PR. The script exercises the LongMemEval + LoCoMo runners
154
+ against their bundled smoke fixtures with a fixed seed and a
155
+ deterministic in-memory adapter (no real datasets, no LLM calls, no
156
+ network). Metrics are compared to the committed baseline at
157
+ `tests/fixtures/bench-smoke/baseline.json`; any drop greater than 5%
158
+ fails the job.
159
+
160
+ Regenerate the baseline after an intentional runner change:
161
+
162
+ ```bash
163
+ pnpm exec tsx scripts/bench/bench-smoke.ts --update-baseline
164
+ ```
165
+
64
166
  ## Programmatic API
65
167
 
66
168
  ```ts
@@ -68,6 +170,7 @@ import {
68
170
  listBenchmarks,
69
171
  runBenchmark,
70
172
  writeBenchmarkResult,
173
+ writeBenchmarkReproManifest,
71
174
  createLightweightAdapter,
72
175
  createRemnicAdapter,
73
176
  compareResults,
@@ -0,0 +1,198 @@
1
+ {
2
+ "schemaVersion": 1,
3
+ "fixture": {
4
+ "path": null,
5
+ "scenarioCount": 20
6
+ },
7
+ "onScore": 0.75,
8
+ "offScore": 0.25,
9
+ "lift": 0.5,
10
+ "confidenceInterval": {
11
+ "lower": 0.3,
12
+ "upper": 0.75,
13
+ "level": 0.95
14
+ },
15
+ "perCase": [
16
+ {
17
+ "id": "rerun-deploy-gateway",
18
+ "prompt": "Let's deploy the gateway service to production now.",
19
+ "expectMatch": true,
20
+ "onMatched": true,
21
+ "offMatched": false,
22
+ "onScore": 1,
23
+ "offScore": 0
24
+ },
25
+ {
26
+ "id": "rerun-open-pr",
27
+ "prompt": "Open a pull request for the regression fix against main.",
28
+ "expectMatch": true,
29
+ "onMatched": true,
30
+ "offMatched": false,
31
+ "onScore": 1,
32
+ "offScore": 0
33
+ },
34
+ {
35
+ "id": "rerun-run-tests",
36
+ "prompt": "Run the test suite before we merge the release branch.",
37
+ "expectMatch": true,
38
+ "onMatched": true,
39
+ "offMatched": false,
40
+ "onScore": 1,
41
+ "offScore": 0
42
+ },
43
+ {
44
+ "id": "rerun-rotate-credentials",
45
+ "prompt": "Rotate the staging database credentials right now.",
46
+ "expectMatch": true,
47
+ "onMatched": false,
48
+ "offMatched": false,
49
+ "onScore": 0,
50
+ "offScore": 0
51
+ },
52
+ {
53
+ "id": "rerun-ship-release",
54
+ "prompt": "We need to ship the v9 release tonight.",
55
+ "expectMatch": true,
56
+ "onMatched": true,
57
+ "offMatched": false,
58
+ "onScore": 1,
59
+ "offScore": 0
60
+ },
61
+ {
62
+ "id": "paramvar-deploy-api",
63
+ "prompt": "Let's deploy the API service to production today.",
64
+ "expectMatch": true,
65
+ "onMatched": true,
66
+ "offMatched": false,
67
+ "onScore": 1,
68
+ "offScore": 0
69
+ },
70
+ {
71
+ "id": "paramvar-rollback-ticket",
72
+ "prompt": "Roll back ticket PROJ-912 before the standup tomorrow.",
73
+ "expectMatch": true,
74
+ "onMatched": false,
75
+ "offMatched": false,
76
+ "onScore": 0,
77
+ "offScore": 0
78
+ },
79
+ {
80
+ "id": "paramvar-rotate-prod",
81
+ "prompt": "Rotate the production database credentials this morning.",
82
+ "expectMatch": true,
83
+ "onMatched": false,
84
+ "offMatched": false,
85
+ "onScore": 0,
86
+ "offScore": 0
87
+ },
88
+ {
89
+ "id": "paramvar-start-branch",
90
+ "prompt": "Starting work on the billing branch feature.",
91
+ "expectMatch": true,
92
+ "onMatched": true,
93
+ "offMatched": false,
94
+ "onScore": 1,
95
+ "offScore": 0
96
+ },
97
+ {
98
+ "id": "paramvar-merge-pr-after-ci",
99
+ "prompt": "Merge the pull request after CI turns green.",
100
+ "expectMatch": true,
101
+ "onMatched": true,
102
+ "offMatched": false,
103
+ "onScore": 1,
104
+ "offScore": 0
105
+ },
106
+ {
107
+ "id": "decomp-incident-response",
108
+ "prompt": "Run the incident response playbook for the gateway outage.",
109
+ "expectMatch": true,
110
+ "onMatched": false,
111
+ "offMatched": false,
112
+ "onScore": 0,
113
+ "offScore": 0
114
+ },
115
+ {
116
+ "id": "decomp-release-cut",
117
+ "prompt": "Cut the weekly release branch and start the release checklist.",
118
+ "expectMatch": true,
119
+ "onMatched": true,
120
+ "offMatched": false,
121
+ "onScore": 1,
122
+ "offScore": 0
123
+ },
124
+ {
125
+ "id": "decomp-onboarding",
126
+ "prompt": "Start the onboarding workflow for a new engineer joining the team.",
127
+ "expectMatch": true,
128
+ "onMatched": true,
129
+ "offMatched": false,
130
+ "onScore": 1,
131
+ "offScore": 0
132
+ },
133
+ {
134
+ "id": "decomp-data-migration",
135
+ "prompt": "Begin the schema migration plan for the billing table.",
136
+ "expectMatch": true,
137
+ "onMatched": false,
138
+ "offMatched": false,
139
+ "onScore": 0,
140
+ "offScore": 0
141
+ },
142
+ {
143
+ "id": "decomp-runbook-certificate",
144
+ "prompt": "Start the certificate renewal runbook for the public edge.",
145
+ "expectMatch": true,
146
+ "onMatched": true,
147
+ "offMatched": false,
148
+ "onScore": 1,
149
+ "offScore": 0
150
+ },
151
+ {
152
+ "id": "distract-explain-question",
153
+ "prompt": "How does hybrid retrieval combine BM25 and vector search?",
154
+ "expectMatch": false,
155
+ "onMatched": false,
156
+ "offMatched": false,
157
+ "onScore": 1,
158
+ "offScore": 1
159
+ },
160
+ {
161
+ "id": "distract-past-tense-recap",
162
+ "prompt": "What did we decide about the database rotation last week?",
163
+ "expectMatch": false,
164
+ "onMatched": false,
165
+ "offMatched": false,
166
+ "onScore": 1,
167
+ "offScore": 1
168
+ },
169
+ {
170
+ "id": "distract-summary-request",
171
+ "prompt": "Summarize the timeline of the gateway outage for the report.",
172
+ "expectMatch": false,
173
+ "onMatched": false,
174
+ "offMatched": false,
175
+ "onScore": 1,
176
+ "offScore": 1
177
+ },
178
+ {
179
+ "id": "distract-thanks",
180
+ "prompt": "Thanks, that summary helps a lot.",
181
+ "expectMatch": false,
182
+ "onMatched": false,
183
+ "offMatched": false,
184
+ "onScore": 1,
185
+ "offScore": 1
186
+ },
187
+ {
188
+ "id": "distract-off-topic-task",
189
+ "prompt": "Let's grab coffee before the standup tomorrow.",
190
+ "expectMatch": false,
191
+ "onMatched": false,
192
+ "offMatched": false,
193
+ "onScore": 1,
194
+ "offScore": 1
195
+ }
196
+ ],
197
+ "generatedAt": "baseline-v1"
198
+ }