evalsense 0.3.1 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +309 -149
- package/dist/{chunk-BE7CB3AM.cjs → chunk-4BKZPVY4.cjs} +50 -13
- package/dist/chunk-4BKZPVY4.cjs.map +1 -0
- package/dist/{chunk-K6QPJ2NO.js → chunk-IUVDDMJ3.js} +50 -13
- package/dist/chunk-IUVDDMJ3.js.map +1 -0
- package/dist/chunk-NCCQRZ2Y.cjs +1141 -0
- package/dist/chunk-NCCQRZ2Y.cjs.map +1 -0
- package/dist/chunk-TDGWDK2L.js +1108 -0
- package/dist/chunk-TDGWDK2L.js.map +1 -0
- package/dist/cli.cjs +11 -11
- package/dist/cli.js +1 -1
- package/dist/index-CATqAHNK.d.cts +416 -0
- package/dist/index-CoMpaW-K.d.ts +416 -0
- package/dist/index.cjs +507 -629
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +210 -161
- package/dist/index.d.ts +210 -161
- package/dist/index.js +455 -573
- package/dist/index.js.map +1 -1
- package/dist/metrics/index.cjs +103 -342
- package/dist/metrics/index.cjs.map +1 -1
- package/dist/metrics/index.d.cts +260 -31
- package/dist/metrics/index.d.ts +260 -31
- package/dist/metrics/index.js +24 -312
- package/dist/metrics/index.js.map +1 -1
- package/dist/metrics/opinionated/index.cjs +5 -5
- package/dist/metrics/opinionated/index.d.cts +2 -163
- package/dist/metrics/opinionated/index.d.ts +2 -163
- package/dist/metrics/opinionated/index.js +1 -1
- package/dist/{types-C71p0wzM.d.cts → types-D0hzfyKm.d.cts} +1 -13
- package/dist/{types-C71p0wzM.d.ts → types-D0hzfyKm.d.ts} +1 -13
- package/package.json +1 -1
- package/dist/chunk-BE7CB3AM.cjs.map +0 -1
- package/dist/chunk-K6QPJ2NO.js.map +0 -1
- package/dist/chunk-RZFLCWTW.cjs +0 -942
- package/dist/chunk-RZFLCWTW.cjs.map +0 -1
- package/dist/chunk-Z3U6AUWX.js +0 -925
- package/dist/chunk-Z3U6AUWX.js.map +0 -1
package/README.md
CHANGED
|
@@ -5,32 +5,183 @@
|
|
|
5
5
|
[](https://www.npmjs.com/package/evalsense)
|
|
6
6
|
[](https://opensource.org/licenses/Apache-2.0)
|
|
7
7
|
|
|
8
|
-
|
|
8
|
+
# evalsense
|
|
9
|
+
|
|
10
|
+
**evalsense is like Jest for testing code that uses LLMs.**
|
|
11
|
+
|
|
12
|
+
It helps engineers answer one simple question:
|
|
13
|
+
|
|
14
|
+
> **“Is my LLM-powered code good enough to ship?”**
|
|
15
|
+
|
|
16
|
+
Instead of checking a few example responses, evalsense runs your code across many inputs, measures overall quality, and gives you a clear **pass / fail** result — locally or in CI.
|
|
17
|
+
|
|
18
|
+
evalsense is built for **engineers deploying LLM-enabled features**, not for training or benchmarking models.
|
|
19
|
+
|
|
20
|
+
## What problem does evalsense solve?
|
|
21
|
+
|
|
22
|
+
Most LLM evaluation tools focus on individual outputs:
|
|
23
|
+
|
|
24
|
+
> _“How good is this one response?”_
|
|
25
|
+
|
|
26
|
+
That’s useful, but it doesn’t tell you whether your system is reliable.
|
|
27
|
+
|
|
28
|
+
evalsense answers a different question:
|
|
29
|
+
|
|
30
|
+
> **“Does my code consistently meet our quality bar?”**
|
|
31
|
+
|
|
32
|
+
It treats evaluation like testing:
|
|
33
|
+
|
|
34
|
+
- run your code many times
|
|
35
|
+
- measure results across all runs
|
|
36
|
+
- fail fast if quality drops
|
|
37
|
+
|
|
38
|
+
## How evalsense works (in plain terms)
|
|
39
|
+
|
|
40
|
+
At a high level, evalsense:
|
|
41
|
+
|
|
42
|
+
1. Runs your code
|
|
43
|
+
(this can be a function, module, API call, or a fixed dataset)
|
|
44
|
+
2. Collects the results
|
|
45
|
+
3. Scores them using:
|
|
46
|
+
- standard metrics (accuracy, precision, recall, F1)
|
|
47
|
+
- LLM-as-judge checks (e.g. relevance, hallucination, correctness)
|
|
48
|
+
|
|
49
|
+
4. Aggregates scores across all results
|
|
50
|
+
5. Applies rules you define
|
|
51
|
+
6. Passes or fails the test
|
|
52
|
+
|
|
53
|
+
Think of it as **unit tests for output quality**.
|
|
54
|
+
|
|
55
|
+
## A quick example
|
|
56
|
+
|
|
57
|
+
```ts
|
|
58
|
+
describe("test answer quality", async () => {
|
|
59
|
+
evalTest("toxicity detection", async () => {
|
|
60
|
+
const answers = await generateAnswersDataset(testQuestions);
|
|
61
|
+
const toxicityScore = await toxicity(answers);
|
|
62
|
+
|
|
63
|
+
expectStats(toxicityScore)
|
|
64
|
+
.field("score")
|
|
65
|
+
.percentageBelow(0.5).toBeAtLeast(0.5)
|
|
66
|
+
};
|
|
67
|
+
|
|
68
|
+
evalTest("correctness score", async () => {
|
|
69
|
+
const answers = await generateAnswersDataset(testQuestions);
|
|
70
|
+
const groundTruth = await JSON.parse(readFileSync("truth-dataset.json"));
|
|
71
|
+
|
|
72
|
+
expectStats(answers, groundTruth)
|
|
73
|
+
.field("label")
|
|
74
|
+
.accuracy.toBeAtLeast(0.9)
|
|
75
|
+
.precision("positive").toBeAtLeast(0.7)
|
|
76
|
+
.recall("positive").toBeAtLeast(0.7)
|
|
77
|
+
.displayConfusionMatrix();
|
|
78
|
+
}
|
|
79
|
+
});
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Running the test:
|
|
83
|
+
|
|
84
|
+
```markdown
|
|
85
|
+
**test answer quality**
|
|
86
|
+
|
|
87
|
+
✓ toxicity detection (1ms)
|
|
88
|
+
✓ 50.0% of 'score' values are below or equal to 0.5 (expected >= 50.0%)
|
|
89
|
+
Expected: 50.0%
|
|
90
|
+
Actual: 50.0%
|
|
91
|
+
|
|
92
|
+
✓ correctness score (1ms)
|
|
93
|
+
Field: label | Accuracy: 100.0% | F1: 100.0%
|
|
94
|
+
negative: P=100.0% R=100.0% F1=100.0% (n=5)
|
|
95
|
+
positive: P=100.0% R=100.0% F1=100.0% (n=5)
|
|
96
|
+
|
|
97
|
+
Confusion Matrix: label
|
|
98
|
+
|
|
99
|
+
Predicted → correct incorrect
|
|
100
|
+
Actual ↓
|
|
101
|
+
correct 5 0
|
|
102
|
+
incorrect 0 5
|
|
103
|
+
|
|
104
|
+
✓ Accuracy 100.0% >= 90.0%
|
|
105
|
+
Expected: 90.0%
|
|
106
|
+
Actual: 100.0%
|
|
107
|
+
✓ Precision for 'positive' 100.0% >= 70.0%
|
|
108
|
+
Expected: 70.0%
|
|
109
|
+
Actual: 100.0%
|
|
110
|
+
✓ Recall for 'positive' 100.0% >= 70.0%
|
|
111
|
+
Expected: 70.0%
|
|
112
|
+
Actual: 100.0%
|
|
113
|
+
✓ Confusion matrix recorded for field "label"
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
If the quality drops, the test fails — just like a normal test.
|
|
117
|
+
|
|
118
|
+
## Two common ways to use evalsense
|
|
119
|
+
|
|
120
|
+
### 1. When you **don’t have ground truth**
|
|
121
|
+
|
|
122
|
+
Use this when there are no labels.
|
|
123
|
+
|
|
124
|
+
Example:
|
|
125
|
+
|
|
126
|
+
- Run your LLM-powered function
|
|
127
|
+
- Score outputs using an LLM-as-judge (relevance, hallucination, etc.)
|
|
128
|
+
- Define what “acceptable” means
|
|
129
|
+
- Fail if quality degrades
|
|
130
|
+
|
|
131
|
+
**Example rule:**
|
|
132
|
+
|
|
133
|
+
> “Average relevance score must be at least 0.75”
|
|
9
134
|
|
|
10
|
-
|
|
11
|
-
> **New in v0.2.x:** Built-in adapters for OpenAI, Anthropic, and OpenRouter - no boilerplate needed!
|
|
12
|
-
> **New in v0.2.0:** LLM-powered metrics for hallucination, relevance, faithfulness, and toxicity detection. [See migration guide](./docs/migration-v0.2.md).
|
|
135
|
+
### 2. When you **do have ground truth**
|
|
13
136
|
|
|
14
|
-
|
|
137
|
+
Use this when correct answers are known.
|
|
15
138
|
|
|
16
|
-
|
|
139
|
+
Example:
|
|
17
140
|
|
|
18
|
-
-
|
|
19
|
-
-
|
|
20
|
-
-
|
|
21
|
-
-
|
|
22
|
-
-
|
|
141
|
+
- Run your prediction code
|
|
142
|
+
- Compare outputs with ground truth
|
|
143
|
+
- Compute accuracy, precision, recall, F1
|
|
144
|
+
- Optionally add LLM-as-judge checks
|
|
145
|
+
- Fail if metrics fall below thresholds
|
|
23
146
|
|
|
24
|
-
|
|
147
|
+
**Example rule:**
|
|
25
148
|
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
-
|
|
33
|
-
-
|
|
149
|
+
> “F1 score must be ≥ 0.85 and false positives ≤ 5%”
|
|
150
|
+
|
|
151
|
+
## What evalsense is _not_
|
|
152
|
+
|
|
153
|
+
evalsense is **not**:
|
|
154
|
+
|
|
155
|
+
- A tool for scoring single responses in isolation
|
|
156
|
+
- A dashboard or experiment-tracking platform
|
|
157
|
+
- A system for analyzing agent step-by-step traces
|
|
158
|
+
- A model benchmarking or training framework
|
|
159
|
+
|
|
160
|
+
If you mainly want scores, charts, or leaderboards, other tools may be a better fit.
|
|
161
|
+
|
|
162
|
+
## Who should use evalsense
|
|
163
|
+
|
|
164
|
+
evalsense is a good fit if you:
|
|
165
|
+
|
|
166
|
+
- are **shipping LLM-powered features**
|
|
167
|
+
- want **clear pass/fail quality gates**
|
|
168
|
+
- run checks in **CI/CD**
|
|
169
|
+
- care about **regressions** (“did this get worse?”)
|
|
170
|
+
- already think in terms of tests
|
|
171
|
+
- work in **JavaScript / TypeScript**
|
|
172
|
+
|
|
173
|
+
## Who should _not_ use evalsense
|
|
174
|
+
|
|
175
|
+
evalsense may not be right for you if you:
|
|
176
|
+
|
|
177
|
+
- only care about individual output scores
|
|
178
|
+
- want visual dashboards or experiment UIs
|
|
179
|
+
- need deep agent trace inspection
|
|
180
|
+
- are training or benchmarking foundation models
|
|
181
|
+
|
|
182
|
+
## In one sentence
|
|
183
|
+
|
|
184
|
+
**evalsense lets you test the quality of LLM-powered code the same way you test everything else — with clear pass/fail results.**
|
|
34
185
|
|
|
35
186
|
## Installation
|
|
36
187
|
|
|
@@ -49,35 +200,35 @@ yarn add -D evalsense
|
|
|
49
200
|
Create a file named `sentiment.eval.js`:
|
|
50
201
|
|
|
51
202
|
```javascript
|
|
52
|
-
import { describe, evalTest, expectStats
|
|
203
|
+
import { describe, evalTest, expectStats } from "evalsense";
|
|
204
|
+
import { readFileSync } from "fs";
|
|
53
205
|
|
|
54
206
|
// Your model function - can be any JS function
|
|
55
|
-
function classifySentiment(
|
|
56
|
-
const
|
|
57
|
-
const hasPositive = /love|amazing|great|fantastic|perfect/.test(
|
|
58
|
-
const hasNegative = /terrible|worst|disappointed|waste/.test(
|
|
59
|
-
|
|
60
|
-
return {
|
|
61
|
-
id: record.id,
|
|
62
|
-
sentiment: hasPositive && !hasNegative ? "positive" : "negative",
|
|
63
|
-
};
|
|
207
|
+
function classifySentiment(text) {
|
|
208
|
+
const lower = text.toLowerCase();
|
|
209
|
+
const hasPositive = /love|amazing|great|fantastic|perfect/.test(lower);
|
|
210
|
+
const hasNegative = /terrible|worst|disappointed|waste/.test(lower);
|
|
211
|
+
return hasPositive && !hasNegative ? "positive" : "negative";
|
|
64
212
|
}
|
|
65
213
|
|
|
66
214
|
describe("Sentiment classifier", () => {
|
|
67
215
|
evalTest("accuracy above 80%", async () => {
|
|
68
|
-
// 1. Load
|
|
69
|
-
const
|
|
216
|
+
// 1. Load ground truth data
|
|
217
|
+
const groundTruth = JSON.parse(readFileSync("./sentiment.json", "utf-8"));
|
|
70
218
|
|
|
71
|
-
// 2. Run your model
|
|
72
|
-
const
|
|
219
|
+
// 2. Run your model and collect predictions
|
|
220
|
+
const predictions = groundTruth.map((record) => ({
|
|
221
|
+
id: record.id,
|
|
222
|
+
sentiment: classifySentiment(record.text),
|
|
223
|
+
}));
|
|
73
224
|
|
|
74
225
|
// 3. Assert on statistical properties
|
|
75
|
-
expectStats(
|
|
226
|
+
expectStats(predictions, groundTruth)
|
|
76
227
|
.field("sentiment")
|
|
77
|
-
.
|
|
78
|
-
.
|
|
79
|
-
.
|
|
80
|
-
.
|
|
228
|
+
.accuracy.toBeAtLeast(0.8)
|
|
229
|
+
.recall("positive").toBeAtLeast(0.7)
|
|
230
|
+
.precision("positive").toBeAtLeast(0.7)
|
|
231
|
+
.displayConfusionMatrix();
|
|
81
232
|
});
|
|
82
233
|
});
|
|
83
234
|
```
|
|
@@ -103,23 +254,24 @@ npx evalsense run sentiment.eval.js
|
|
|
103
254
|
### Basic Classification Example
|
|
104
255
|
|
|
105
256
|
```javascript
|
|
106
|
-
import { describe, evalTest, expectStats
|
|
257
|
+
import { describe, evalTest, expectStats } from "evalsense";
|
|
258
|
+
import { readFileSync } from "fs";
|
|
107
259
|
|
|
108
260
|
describe("Spam classifier", () => {
|
|
109
261
|
evalTest("high precision and recall", async () => {
|
|
110
|
-
const
|
|
262
|
+
const groundTruth = JSON.parse(readFileSync("./emails.json", "utf-8"));
|
|
111
263
|
|
|
112
|
-
const
|
|
264
|
+
const predictions = groundTruth.map((record) => ({
|
|
113
265
|
id: record.id,
|
|
114
266
|
isSpam: classifyEmail(record.text),
|
|
115
267
|
}));
|
|
116
268
|
|
|
117
|
-
expectStats(
|
|
269
|
+
expectStats(predictions, groundTruth)
|
|
118
270
|
.field("isSpam")
|
|
119
|
-
.
|
|
120
|
-
.
|
|
121
|
-
.
|
|
122
|
-
.
|
|
271
|
+
.accuracy.toBeAtLeast(0.9)
|
|
272
|
+
.precision(true).toBeAtLeast(0.85) // Precision for spam=true
|
|
273
|
+
.recall(true).toBeAtLeast(0.85) // Recall for spam=true
|
|
274
|
+
.displayConfusionMatrix();
|
|
123
275
|
});
|
|
124
276
|
});
|
|
125
277
|
```
|
|
@@ -127,25 +279,26 @@ describe("Spam classifier", () => {
|
|
|
127
279
|
### Continuous Scores with Binarization
|
|
128
280
|
|
|
129
281
|
```javascript
|
|
130
|
-
import { describe, evalTest, expectStats
|
|
282
|
+
import { describe, evalTest, expectStats } from "evalsense";
|
|
283
|
+
import { readFileSync } from "fs";
|
|
131
284
|
|
|
132
285
|
describe("Hallucination detector", () => {
|
|
133
286
|
evalTest("detect hallucinations with 70% recall", async () => {
|
|
134
|
-
const
|
|
287
|
+
const groundTruth = JSON.parse(readFileSync("./outputs.json", "utf-8"));
|
|
135
288
|
|
|
136
|
-
// Your model returns a continuous score
|
|
137
|
-
const
|
|
289
|
+
// Your model returns a continuous score (0.0 to 1.0)
|
|
290
|
+
const predictions = groundTruth.map((record) => ({
|
|
138
291
|
id: record.id,
|
|
139
|
-
hallucinated: computeHallucinationScore(record.output),
|
|
292
|
+
hallucinated: computeHallucinationScore(record.output),
|
|
140
293
|
}));
|
|
141
294
|
|
|
142
295
|
// Binarize the score at threshold 0.3
|
|
143
|
-
expectStats(
|
|
296
|
+
expectStats(predictions, groundTruth)
|
|
144
297
|
.field("hallucinated")
|
|
145
298
|
.binarize(0.3) // >= 0.3 means hallucinated
|
|
146
|
-
.
|
|
147
|
-
.
|
|
148
|
-
.
|
|
299
|
+
.recall(true).toBeAtLeast(0.7)
|
|
300
|
+
.precision(true).toBeAtLeast(0.6)
|
|
301
|
+
.displayConfusionMatrix();
|
|
149
302
|
});
|
|
150
303
|
});
|
|
151
304
|
```
|
|
@@ -153,50 +306,62 @@ describe("Hallucination detector", () => {
|
|
|
153
306
|
### Multi-class Classification
|
|
154
307
|
|
|
155
308
|
```javascript
|
|
156
|
-
import { describe, evalTest, expectStats
|
|
309
|
+
import { describe, evalTest, expectStats } from "evalsense";
|
|
310
|
+
import { readFileSync } from "fs";
|
|
157
311
|
|
|
158
312
|
describe("Intent classifier", () => {
|
|
159
313
|
evalTest("balanced performance across intents", async () => {
|
|
160
|
-
const
|
|
314
|
+
const groundTruth = JSON.parse(readFileSync("./intents.json", "utf-8"));
|
|
161
315
|
|
|
162
|
-
const
|
|
316
|
+
const predictions = groundTruth.map((record) => ({
|
|
163
317
|
id: record.id,
|
|
164
318
|
intent: classifyIntent(record.query),
|
|
165
319
|
}));
|
|
166
320
|
|
|
167
|
-
expectStats(
|
|
321
|
+
expectStats(predictions, groundTruth)
|
|
168
322
|
.field("intent")
|
|
169
|
-
.
|
|
170
|
-
.
|
|
171
|
-
.
|
|
172
|
-
.
|
|
173
|
-
.
|
|
323
|
+
.accuracy.toBeAtLeast(0.85)
|
|
324
|
+
.recall("purchase").toBeAtLeast(0.8)
|
|
325
|
+
.recall("support").toBeAtLeast(0.8)
|
|
326
|
+
.recall("general").toBeAtLeast(0.7)
|
|
327
|
+
.displayConfusionMatrix();
|
|
174
328
|
});
|
|
175
329
|
});
|
|
176
330
|
```
|
|
177
331
|
|
|
178
|
-
### Parallel Model Execution
|
|
332
|
+
### Parallel Model Execution with LLMs
|
|
179
333
|
|
|
180
|
-
For LLM calls or slow operations, use
|
|
334
|
+
For LLM calls or slow operations, use `Promise.all` with chunking for concurrency control:
|
|
181
335
|
|
|
182
336
|
```javascript
|
|
183
|
-
import { describe, evalTest, expectStats
|
|
337
|
+
import { describe, evalTest, expectStats } from "evalsense";
|
|
338
|
+
import { readFileSync } from "fs";
|
|
339
|
+
|
|
340
|
+
// Helper for parallel execution with concurrency limit
|
|
341
|
+
async function mapConcurrent(items, fn, concurrency = 5) {
|
|
342
|
+
const results = [];
|
|
343
|
+
for (let i = 0; i < items.length; i += concurrency) {
|
|
344
|
+
const chunk = items.slice(i, i + concurrency);
|
|
345
|
+
results.push(...(await Promise.all(chunk.map(fn))));
|
|
346
|
+
}
|
|
347
|
+
return results;
|
|
348
|
+
}
|
|
184
349
|
|
|
185
350
|
describe("LLM classifier", () => {
|
|
186
351
|
evalTest("classification accuracy", async () => {
|
|
187
|
-
const
|
|
352
|
+
const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
|
|
188
353
|
|
|
189
354
|
// Run with concurrency=5
|
|
190
|
-
const
|
|
191
|
-
|
|
355
|
+
const predictions = await mapConcurrent(
|
|
356
|
+
groundTruth,
|
|
192
357
|
async (record) => {
|
|
193
358
|
const response = await callLLM(record.text);
|
|
194
359
|
return { id: record.id, category: response.category };
|
|
195
360
|
},
|
|
196
|
-
5
|
|
361
|
+
5
|
|
197
362
|
);
|
|
198
363
|
|
|
199
|
-
expectStats(
|
|
364
|
+
expectStats(predictions, groundTruth).field("category").accuracy.toBeAtLeast(0.9);
|
|
200
365
|
});
|
|
201
366
|
});
|
|
202
367
|
```
|
|
@@ -302,59 +467,55 @@ evalTest("should have 90% accuracy", async () => {
|
|
|
302
467
|
});
|
|
303
468
|
```
|
|
304
469
|
|
|
305
|
-
### Dataset
|
|
306
|
-
|
|
307
|
-
#### `loadDataset(path)`
|
|
308
|
-
|
|
309
|
-
Loads a dataset from a JSON file. Records must have an `id` or `_id` field.
|
|
310
|
-
|
|
311
|
-
```javascript
|
|
312
|
-
const dataset = loadDataset("./data.json");
|
|
313
|
-
```
|
|
314
|
-
|
|
315
|
-
#### `runModel(dataset, modelFn)`
|
|
470
|
+
### Dataset Loading
|
|
316
471
|
|
|
317
|
-
|
|
472
|
+
evalsense doesn't dictate how you load data or run your model. Use standard Node.js tools:
|
|
318
473
|
|
|
319
474
|
```javascript
|
|
320
|
-
|
|
321
|
-
id: record.id,
|
|
322
|
-
prediction: classify(record.text),
|
|
323
|
-
}));
|
|
324
|
-
```
|
|
475
|
+
import { readFileSync } from "fs";
|
|
325
476
|
|
|
326
|
-
|
|
477
|
+
// Load ground truth
|
|
478
|
+
const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
|
|
327
479
|
|
|
328
|
-
|
|
480
|
+
// Run your model however you want
|
|
481
|
+
const predictions = groundTruth.map(runYourModel);
|
|
329
482
|
|
|
330
|
-
|
|
331
|
-
const
|
|
483
|
+
// Or use async operations
|
|
484
|
+
const predictions = await Promise.all(
|
|
485
|
+
groundTruth.map(async (item) => {
|
|
486
|
+
const result = await callLLM(item.text);
|
|
487
|
+
return { id: item.id, prediction: result };
|
|
488
|
+
})
|
|
489
|
+
);
|
|
332
490
|
```
|
|
333
491
|
|
|
334
492
|
### Assertions
|
|
335
493
|
|
|
336
|
-
#### `expectStats(
|
|
494
|
+
#### `expectStats(predictions, groundTruth)`
|
|
337
495
|
|
|
338
|
-
Creates a statistical assertion chain from
|
|
496
|
+
Creates a statistical assertion chain from predictions and ground truth. Aligns by `id` field.
|
|
339
497
|
|
|
340
498
|
```javascript
|
|
341
|
-
expectStats(
|
|
499
|
+
expectStats(predictions, groundTruth)
|
|
500
|
+
.field("prediction")
|
|
501
|
+
.accuracy.toBeAtLeast(0.8)
|
|
502
|
+
.f1.toBeAtLeast(0.75)
|
|
503
|
+
.displayConfusionMatrix();
|
|
342
504
|
```
|
|
343
505
|
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
Two-argument form for judge validation. Aligns predictions with ground truth by `id` field.
|
|
506
|
+
**One-argument form (distribution assertions only):**
|
|
347
507
|
|
|
348
508
|
```javascript
|
|
349
|
-
//
|
|
350
|
-
expectStats(
|
|
509
|
+
// For distribution monitoring without ground truth
|
|
510
|
+
expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
|
|
351
511
|
```
|
|
352
512
|
|
|
353
|
-
**
|
|
513
|
+
**Common use cases:**
|
|
354
514
|
|
|
515
|
+
- Classification evaluation with ground truth
|
|
516
|
+
- Regression evaluation (MAE, RMSE, R²)
|
|
355
517
|
- Validating LLM judges against human labels
|
|
356
|
-
-
|
|
357
|
-
- Testing automated detection systems
|
|
518
|
+
- Distribution monitoring without ground truth
|
|
358
519
|
|
|
359
520
|
### Field Selection
|
|
360
521
|
|
|
@@ -374,7 +535,7 @@ Converts continuous scores to binary (>=threshold is true).
|
|
|
374
535
|
expectStats(result)
|
|
375
536
|
.field("score")
|
|
376
537
|
.binarize(0.5) // score >= 0.5 is true
|
|
377
|
-
.
|
|
538
|
+
.accuracy.toBeAtLeast(0.8);
|
|
378
539
|
```
|
|
379
540
|
|
|
380
541
|
### Available Assertions
|
|
@@ -382,43 +543,61 @@ expectStats(result)
|
|
|
382
543
|
#### Classification Metrics
|
|
383
544
|
|
|
384
545
|
```javascript
|
|
385
|
-
// Accuracy
|
|
386
|
-
.
|
|
387
|
-
.
|
|
388
|
-
.
|
|
546
|
+
// Accuracy (macro average for multi-class)
|
|
547
|
+
.accuracy.toBeAtLeast(threshold)
|
|
548
|
+
.accuracy.toBeAbove(threshold)
|
|
549
|
+
.accuracy.toBeAtMost(threshold)
|
|
550
|
+
.accuracy.toBeBelow(threshold)
|
|
551
|
+
|
|
552
|
+
// Precision (per class or macro average)
|
|
553
|
+
.precision("className").toBeAtLeast(threshold)
|
|
554
|
+
.precision().toBeAtLeast(threshold) // macro average
|
|
389
555
|
|
|
390
|
-
//
|
|
391
|
-
.
|
|
392
|
-
.
|
|
556
|
+
// Recall (per class or macro average)
|
|
557
|
+
.recall("className").toBeAtLeast(threshold)
|
|
558
|
+
.recall().toBeAtLeast(threshold) // macro average
|
|
393
559
|
|
|
394
|
-
//
|
|
395
|
-
.
|
|
396
|
-
.
|
|
560
|
+
// F1 Score (macro average)
|
|
561
|
+
.f1.toBeAtLeast(threshold)
|
|
562
|
+
.f1.toBeAbove(threshold)
|
|
397
563
|
|
|
398
|
-
//
|
|
399
|
-
.
|
|
400
|
-
.
|
|
564
|
+
// Regression Metrics
|
|
565
|
+
.mae.toBeAtMost(threshold) // Mean Absolute Error
|
|
566
|
+
.rmse.toBeAtMost(threshold) // Root Mean Squared Error
|
|
567
|
+
.r2.toBeAtLeast(threshold) // R² coefficient
|
|
401
568
|
|
|
402
569
|
// Confusion Matrix
|
|
403
|
-
.
|
|
570
|
+
.displayConfusionMatrix() // Displays confusion matrix (not an assertion)
|
|
571
|
+
```
|
|
572
|
+
|
|
573
|
+
#### Available Matchers
|
|
574
|
+
|
|
575
|
+
All metrics return a matcher object with these comparison methods:
|
|
576
|
+
|
|
577
|
+
```javascript
|
|
578
|
+
.toBeAtLeast(x) // >= x
|
|
579
|
+
.toBeAbove(x) // > x
|
|
580
|
+
.toBeAtMost(x) // <= x
|
|
581
|
+
.toBeBelow(x) // < x
|
|
582
|
+
.toEqual(x, tolerance?) // === x (with optional tolerance for floats)
|
|
404
583
|
```
|
|
405
584
|
|
|
406
|
-
#### Distribution Assertions
|
|
585
|
+
#### Distribution Assertions
|
|
407
586
|
|
|
408
587
|
Distribution assertions validate output distributions **without requiring ground truth**. Use these to monitor that model outputs stay within expected ranges.
|
|
409
588
|
|
|
410
589
|
```javascript
|
|
411
590
|
// Assert that at least 80% of confidence scores are above 0.7
|
|
412
|
-
expectStats(predictions).field("confidence").
|
|
591
|
+
expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
|
|
413
592
|
|
|
414
593
|
// Assert that at least 90% of toxicity scores are below 0.3
|
|
415
|
-
expectStats(predictions).field("toxicity").
|
|
594
|
+
expectStats(predictions).field("toxicity").percentageBelow(0.3).toBeAtLeast(0.9);
|
|
416
595
|
|
|
417
596
|
// Chain multiple distribution assertions
|
|
418
597
|
expectStats(predictions)
|
|
419
598
|
.field("score")
|
|
420
|
-
.
|
|
421
|
-
.
|
|
599
|
+
.percentageAbove(0.5).toBeAtLeast(0.6) // At least 60% above 0.5
|
|
600
|
+
.percentageBelow(0.9).toBeAtLeast(0.8); // At least 80% below 0.9
|
|
422
601
|
```
|
|
423
602
|
|
|
424
603
|
**Use cases:**
|
|
@@ -430,7 +609,7 @@ expectStats(predictions)
|
|
|
430
609
|
|
|
431
610
|
See [Distribution Assertions Example](./examples/distribution-assertions.eval.js) for complete examples.
|
|
432
611
|
|
|
433
|
-
### Judge Validation
|
|
612
|
+
### Judge Validation
|
|
434
613
|
|
|
435
614
|
Validate judge outputs against human-labeled ground truth using the **two-argument expectStats API**:
|
|
436
615
|
|
|
@@ -452,9 +631,9 @@ const humanLabels = [
|
|
|
452
631
|
// Validate judge performance
|
|
453
632
|
expectStats(judgeOutputs, humanLabels)
|
|
454
633
|
.field("hallucinated")
|
|
455
|
-
.
|
|
456
|
-
.
|
|
457
|
-
.
|
|
634
|
+
.recall(true).toBeAtLeast(0.9) // Don't miss hallucinations
|
|
635
|
+
.precision(true).toBeAtLeast(0.7) // Some false positives OK
|
|
636
|
+
.displayConfusionMatrix();
|
|
458
637
|
```
|
|
459
638
|
|
|
460
639
|
**Use cases:**
|
|
@@ -467,7 +646,7 @@ expectStats(judgeOutputs, humanLabels)
|
|
|
467
646
|
**Two-argument expectStats:**
|
|
468
647
|
|
|
469
648
|
```javascript
|
|
470
|
-
expectStats(actual, expected).field("fieldName").
|
|
649
|
+
expectStats(actual, expected).field("fieldName").accuracy.toBeAtLeast(0.8);
|
|
471
650
|
```
|
|
472
651
|
|
|
473
652
|
The first argument is your predictions (judge outputs), the second is ground truth (human labels). Both must have matching `id` fields for alignment.
|
|
@@ -671,25 +850,6 @@ setLLMClient({
|
|
|
671
850
|
- [Migration Guide](./docs/migration-v0.2.md) - Upgrade from v0.1.x
|
|
672
851
|
- [Examples](./examples/) - Working code examples
|
|
673
852
|
|
|
674
|
-
## Philosophy
|
|
675
|
-
|
|
676
|
-
evalsense is built on the principle that **metrics are predictions, not facts**.
|
|
677
|
-
|
|
678
|
-
Instead of treating LLM-as-judge metrics (relevance, hallucination, etc.) as ground truth, evalsense:
|
|
679
|
-
|
|
680
|
-
- Treats them as **weak labels** from a model
|
|
681
|
-
- Validates them statistically against human references when available
|
|
682
|
-
- Computes confusion matrices to reveal bias and systematic errors
|
|
683
|
-
- Focuses on dataset-level distributions, not individual examples
|
|
684
|
-
|
|
685
853
|
## Contributing
|
|
686
854
|
|
|
687
855
|
Contributions are welcome! Please see [CLAUDE.md](./CLAUDE.md) for development guidelines.
|
|
688
|
-
|
|
689
|
-
## License
|
|
690
|
-
|
|
691
|
-
MIT © Mohit Joshi
|
|
692
|
-
|
|
693
|
-
---
|
|
694
|
-
|
|
695
|
-
**Made with ❤️ for the JS/Node.js AI community**
|