@irisrun/evals 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +27 -0
- package/dist/evals.js +1 -1
- package/dist/reproduce.js +1 -1
- package/package.json +3 -3
package/README.md
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
1
|
+
# @irisrun/evals
|
|
2
|
+
|
|
3
|
+
**Reproducible evals — because determinism makes scoring repeatable.** Run the
|
|
4
|
+
same scenario and it replays **byte-identically** from its journal, so a score is
|
|
5
|
+
a fact about the run, not a roll of the dice. The arbiter is reproducibility, not
|
|
6
|
+
taste: the same case + scorer re-runs identically; a swapped tactic scores
|
|
7
|
+
differently, but reproducibly.
|
|
8
|
+
|
|
9
|
+
## What it is
|
|
10
|
+
|
|
11
|
+
`runEval` runs a deterministic scenario (a fresh store + scripted performers per
|
|
12
|
+
run) on the core `runTurn`, then scores the recorded session via
|
|
13
|
+
`@irisrun/inspect`; `runSuite` runs a set; `reproduce` re-runs a case to confirm
|
|
14
|
+
byte-identical replay. This is faithful record-replay of captured effects — not a
|
|
15
|
+
claim that a live model is deterministic. Depends on `@irisrun/core` +
|
|
16
|
+
`@irisrun/inspect`.
|
|
17
|
+
|
|
18
|
+
## Use it
|
|
19
|
+
|
|
20
|
+
```sh
|
|
21
|
+
iris eval ./evals/suite.mjs # reproducible scenario scoring
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
See **[docs/Audit & reproducible evals](../../docs/audit-and-evals.md)**.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
Part of [Iris](../../README.md) — own, portable, verifiable state.
|
package/dist/evals.js
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
// The reproducible-eval arbiter
|
|
1
|
+
// The reproducible-eval arbiter: "reproducible evals are the arbiter,
|
|
2
2
|
// not editorial taste." An EvalCase is a DETERMINISTIC scenario; a Scorer reads the
|
|
3
3
|
// recorded session (via @irisrun/inspect) and the last turn outcome. runEval calls
|
|
4
4
|
// case.build() on EVERY invocation, so it gets a FRESH store AND fresh performers
|
package/dist/reproduce.js
CHANGED
|
@@ -1,4 +1,4 @@
|
|
|
1
|
-
// reproduce()
|
|
1
|
+
// reproduce(): makes "reproducible evals" an EXPLICIT, provable
|
|
2
2
|
// feature, not just an implicit property. It runs an EvalCase N independent times
|
|
3
3
|
// (each `case.build()` is a fresh store + performers, index reset to 0 — the EvalCase
|
|
4
4
|
// contract) and proves byte-identical {score, status, FULL-journal digest} across
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "@irisrun/evals",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.2.0",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"description": "Iris reproducible-eval arbiter (spec 03 §7) — run a deterministic scenario (fresh store + scripted performers per run), then score the recorded session via @irisrun/inspect. The arbiter is reproducibility, not taste: the same case+scorer re-runs byte-identically; a swapped tactic scores differently but reproducibly. Runner is core runTurn; deps @irisrun/core + @irisrun/inspect.",
|
|
6
6
|
"exports": {
|
|
@@ -11,8 +11,8 @@
|
|
|
11
11
|
}
|
|
12
12
|
},
|
|
13
13
|
"dependencies": {
|
|
14
|
-
"@irisrun/core": "^0.
|
|
15
|
-
"@irisrun/inspect": "^0.
|
|
14
|
+
"@irisrun/core": "^0.2.0",
|
|
15
|
+
"@irisrun/inspect": "^0.2.0"
|
|
16
16
|
},
|
|
17
17
|
"license": "MIT",
|
|
18
18
|
"engines": {
|