evalsense 0.3.2 → 0.4.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +89 -627
- package/dist/chunk-IZAC4S4T.js +1108 -0
- package/dist/chunk-IZAC4S4T.js.map +1 -0
- package/dist/{chunk-IYLSY7NX.js → chunk-RRTJDD4M.js} +13 -6
- package/dist/chunk-RRTJDD4M.js.map +1 -0
- package/dist/{chunk-BFGA2NUB.cjs → chunk-SYEKZ327.cjs} +13 -6
- package/dist/chunk-SYEKZ327.cjs.map +1 -0
- package/dist/chunk-UH6L7A5Y.cjs +1141 -0
- package/dist/chunk-UH6L7A5Y.cjs.map +1 -0
- package/dist/cli.cjs +11 -11
- package/dist/cli.js +1 -1
- package/dist/index-7Qog3wxS.d.ts +417 -0
- package/dist/index-ezghUO7Q.d.cts +417 -0
- package/dist/index.cjs +507 -580
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +210 -161
- package/dist/index.d.ts +210 -161
- package/dist/index.js +455 -524
- package/dist/index.js.map +1 -1
- package/dist/metrics/index.cjs +103 -342
- package/dist/metrics/index.cjs.map +1 -1
- package/dist/metrics/index.d.cts +260 -31
- package/dist/metrics/index.d.ts +260 -31
- package/dist/metrics/index.js +24 -312
- package/dist/metrics/index.js.map +1 -1
- package/dist/metrics/opinionated/index.cjs +5 -5
- package/dist/metrics/opinionated/index.d.cts +2 -163
- package/dist/metrics/opinionated/index.d.ts +2 -163
- package/dist/metrics/opinionated/index.js +1 -1
- package/dist/{types-C71p0wzM.d.cts → types-D0hzfyKm.d.cts} +1 -13
- package/dist/{types-C71p0wzM.d.ts → types-D0hzfyKm.d.ts} +1 -13
- package/package.json +1 -1
- package/dist/chunk-BFGA2NUB.cjs.map +0 -1
- package/dist/chunk-IYLSY7NX.js.map +0 -1
- package/dist/chunk-RZFLCWTW.cjs +0 -942
- package/dist/chunk-RZFLCWTW.cjs.map +0 -1
- package/dist/chunk-Z3U6AUWX.js +0 -925
- package/dist/chunk-Z3U6AUWX.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,718 +1,180 @@
|
|
|
1
|
-
|
|
2
|
-
|
|
3
|
-
> JS-native LLM evaluation framework with Jest-like API and statistical assertions
|
|
1
|
+
[](https://www.evalsense.com)
|
|
4
2
|
|
|
5
3
|
[](https://www.npmjs.com/package/evalsense)
|
|
4
|
+
[](https://github.com/evalsense/evalsense/actions/workflows/ci.yml)
|
|
6
5
|
[](https://opensource.org/licenses/Apache-2.0)
|
|
7
6
|
|
|
8
|
-
**
|
|
9
|
-
|
|
10
|
-
> **New in v0.3.2:** Enhanced assertion reporting - all assertions (passed and failed) now display expected vs actual values, and chained assertions evaluate completely instead of short-circuiting on first failure!
|
|
11
|
-
> **New in v0.3.0:** Regression assertions (MAE, RMSE, R²) and flexible ID matching for custom identifier fields! [See migration guide](./docs/migration-v0.3.0.md).
|
|
12
|
-
> **New in v0.2.x:** Built-in adapters for OpenAI, Anthropic, and OpenRouter - no boilerplate needed!
|
|
13
|
-
> **New in v0.2.0:** LLM-powered metrics for hallucination, relevance, faithfulness, and toxicity detection. [See migration guide](./docs/migration-v0.2.md).
|
|
14
|
-
|
|
15
|
-
## Why evalsense?
|
|
16
|
-
|
|
17
|
-
Most LLM evaluation tools stop at producing scores (accuracy, relevance, hallucination). evalsense goes further by:
|
|
7
|
+
> **Jest for LLM Evaluation.** Pass/fail quality gates for your LLM-powered code.
|
|
18
8
|
|
|
19
|
-
|
|
20
|
-
- ✅ Analyzing **false positives vs false negatives** across datasets
|
|
21
|
-
- ✅ Treating **metrics as predictions, not truth** (and validating them statistically)
|
|
22
|
-
- ✅ Providing a **Jest-like API** that fits naturally into JS/Node workflows
|
|
23
|
-
- ✅ Supporting **deterministic CI/CD** integration with specific exit codes
|
|
24
|
-
|
|
25
|
-
## Features
|
|
26
|
-
|
|
27
|
-
- 📊 **Dataset-level evaluation** - evaluate distributions, not single examples
|
|
28
|
-
- 🎯 **Statistical rigor** - confusion matrices, precision/recall, F1, regression metrics
|
|
29
|
-
- 🧪 **Jest-like API** - familiar `describe()` and test patterns
|
|
30
|
-
- 🤖 **LLM-powered metrics** - hallucination, relevance, faithfulness, toxicity with explainable reasoning
|
|
31
|
-
- ⚡ **Dual evaluation modes** - choose between accuracy (per-row) or cost efficiency (batch)
|
|
32
|
-
- 🔄 **CI-friendly** - deterministic execution, machine-readable reports
|
|
33
|
-
- 🚀 **JS-native** - first-class TypeScript support, works with any Node.js LLM library
|
|
34
|
-
- 🔌 **Composable** - evaluate outputs from your existing LLM code
|
|
35
|
-
|
|
36
|
-
## Installation
|
|
9
|
+
evalsense runs your code across many inputs, measures quality statistically, and gives you a clear **pass / fail** result — locally or in CI.
|
|
37
10
|
|
|
38
11
|
```bash
|
|
39
12
|
npm install --save-dev evalsense
|
|
40
13
|
```
|
|
41
14
|
|
|
42
|
-
Or with yarn:
|
|
43
|
-
|
|
44
|
-
```bash
|
|
45
|
-
yarn add -D evalsense
|
|
46
|
-
```
|
|
47
|
-
|
|
48
15
|
## Quick Start
|
|
49
16
|
|
|
50
|
-
Create
|
|
17
|
+
Create `sentiment.eval.js`:
|
|
51
18
|
|
|
52
19
|
```javascript
|
|
53
20
|
import { describe, evalTest, expectStats } from "evalsense";
|
|
54
21
|
import { readFileSync } from "fs";
|
|
55
22
|
|
|
56
|
-
// Your model function - can be any JS function
|
|
57
23
|
function classifySentiment(text) {
|
|
58
|
-
|
|
59
|
-
const hasPositive = /love|amazing|great|fantastic|perfect/.test(lower);
|
|
60
|
-
const hasNegative = /terrible|worst|disappointed|waste/.test(lower);
|
|
61
|
-
return hasPositive && !hasNegative ? "positive" : "negative";
|
|
24
|
+
return /love|great|amazing/.test(text.toLowerCase()) ? "positive" : "negative";
|
|
62
25
|
}
|
|
63
26
|
|
|
64
27
|
describe("Sentiment classifier", () => {
|
|
65
28
|
evalTest("accuracy above 80%", async () => {
|
|
66
|
-
// 1. Load ground truth data
|
|
67
29
|
const groundTruth = JSON.parse(readFileSync("./sentiment.json", "utf-8"));
|
|
68
30
|
|
|
69
|
-
// 2. Run your model and collect predictions
|
|
70
31
|
const predictions = groundTruth.map((record) => ({
|
|
71
32
|
id: record.id,
|
|
72
33
|
sentiment: classifySentiment(record.text),
|
|
73
34
|
}));
|
|
74
35
|
|
|
75
|
-
// 3. Assert on statistical properties
|
|
76
36
|
expectStats(predictions, groundTruth)
|
|
77
37
|
.field("sentiment")
|
|
78
|
-
.
|
|
79
|
-
.
|
|
80
|
-
.
|
|
81
|
-
.
|
|
38
|
+
.accuracy.toBeAtLeast(0.8)
|
|
39
|
+
.recall("positive")
|
|
40
|
+
.toBeAtLeast(0.7)
|
|
41
|
+
.precision("positive")
|
|
42
|
+
.toBeAtLeast(0.7)
|
|
43
|
+
.displayConfusionMatrix();
|
|
82
44
|
});
|
|
83
45
|
});
|
|
84
46
|
```
|
|
85
47
|
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
```json
|
|
89
|
-
[
|
|
90
|
-
{ "id": "1", "text": "I love this product!", "sentiment": "positive" },
|
|
91
|
-
{ "id": "2", "text": "Terrible experience.", "sentiment": "negative" },
|
|
92
|
-
{ "id": "3", "text": "Great quality!", "sentiment": "positive" }
|
|
93
|
-
]
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
Run the evaluation:
|
|
48
|
+
Run it:
|
|
97
49
|
|
|
98
50
|
```bash
|
|
99
51
|
npx evalsense run sentiment.eval.js
|
|
100
52
|
```
|
|
101
53
|
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
### Basic Classification Example
|
|
105
|
-
|
|
106
|
-
```javascript
|
|
107
|
-
import { describe, evalTest, expectStats } from "evalsense";
|
|
108
|
-
import { readFileSync } from "fs";
|
|
109
|
-
|
|
110
|
-
describe("Spam classifier", () => {
|
|
111
|
-
evalTest("high precision and recall", async () => {
|
|
112
|
-
const groundTruth = JSON.parse(readFileSync("./emails.json", "utf-8"));
|
|
113
|
-
|
|
114
|
-
const predictions = groundTruth.map((record) => ({
|
|
115
|
-
id: record.id,
|
|
116
|
-
isSpam: classifyEmail(record.text),
|
|
117
|
-
}));
|
|
54
|
+
Output:
|
|
118
55
|
|
|
119
|
-
expectStats(predictions, groundTruth)
|
|
120
|
-
.field("isSpam")
|
|
121
|
-
.toHaveAccuracyAbove(0.9)
|
|
122
|
-
.toHavePrecisionAbove(true, 0.85) // Precision for spam=true
|
|
123
|
-
.toHaveRecallAbove(true, 0.85) // Recall for spam=true
|
|
124
|
-
.toHaveConfusionMatrix();
|
|
125
|
-
});
|
|
126
|
-
});
|
|
127
56
|
```
|
|
57
|
+
EvalSense v0.4.1
|
|
58
|
+
Running 1 eval file(s)...
|
|
128
59
|
|
|
129
|
-
|
|
60
|
+
Sentiment classifier
|
|
130
61
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
62
|
+
✓ accuracy above 80% (12ms)
|
|
63
|
+
Field: sentiment | Accuracy: 90.0% | F1: 89.5%
|
|
64
|
+
negative: P=88.0% R=92.0% F1=90.0% (n=25)
|
|
65
|
+
positive: P=91.0% R=87.0% F1=89.0% (n=25)
|
|
66
|
+
✓ Accuracy 90.0% >= 80.0%
|
|
67
|
+
✓ Recall for 'positive' 87.0% >= 70.0%
|
|
68
|
+
✓ Precision for 'positive' 91.0% >= 70.0%
|
|
134
69
|
|
|
135
|
-
|
|
136
|
-
evalTest("detect hallucinations with 70% recall", async () => {
|
|
137
|
-
const groundTruth = JSON.parse(readFileSync("./outputs.json", "utf-8"));
|
|
70
|
+
Summary
|
|
138
71
|
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
id: record.id,
|
|
142
|
-
hallucinated: computeHallucinationScore(record.output),
|
|
143
|
-
}));
|
|
72
|
+
Tests: 1 passed, 0 failed, 0 errors, 0 skipped
|
|
73
|
+
Duration: 12ms
|
|
144
74
|
|
|
145
|
-
|
|
146
|
-
expectStats(predictions, groundTruth)
|
|
147
|
-
.field("hallucinated")
|
|
148
|
-
.binarize(0.3) // >= 0.3 means hallucinated
|
|
149
|
-
.toHaveRecallAbove(true, 0.7)
|
|
150
|
-
.toHavePrecisionAbove(true, 0.6)
|
|
151
|
-
.toHaveConfusionMatrix();
|
|
152
|
-
});
|
|
153
|
-
});
|
|
75
|
+
All tests passed!
|
|
154
76
|
```
|
|
155
77
|
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
```javascript
|
|
159
|
-
import { describe, evalTest, expectStats } from "evalsense";
|
|
160
|
-
import { readFileSync } from "fs";
|
|
161
|
-
|
|
162
|
-
describe("Intent classifier", () => {
|
|
163
|
-
evalTest("balanced performance across intents", async () => {
|
|
164
|
-
const groundTruth = JSON.parse(readFileSync("./intents.json", "utf-8"));
|
|
78
|
+
## Key Features
|
|
165
79
|
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
80
|
+
- **Jest-like API** — `describe`, `evalTest`, `expectStats` feel familiar
|
|
81
|
+
- **Statistical assertions** — accuracy, precision, recall, F1, MAE, RMSE, R²
|
|
82
|
+
- **Confusion matrices** — built-in display with `.displayConfusionMatrix()`
|
|
83
|
+
- **Distribution monitoring** — `percentageAbove` / `percentageBelow` without ground truth
|
|
84
|
+
- **LLM-as-judge** — built-in hallucination, relevance, faithfulness, toxicity metrics
|
|
85
|
+
- **CI/CD ready** — structured exit codes, JSON reporter, bail mode
|
|
86
|
+
- **Zero config** — works with any JS data loading and model execution
|
|
170
87
|
|
|
171
|
-
|
|
172
|
-
.field("intent")
|
|
173
|
-
.toHaveAccuracyAbove(0.85)
|
|
174
|
-
.toHaveRecallAbove("purchase", 0.8)
|
|
175
|
-
.toHaveRecallAbove("support", 0.8)
|
|
176
|
-
.toHaveRecallAbove("general", 0.7)
|
|
177
|
-
.toHaveConfusionMatrix();
|
|
178
|
-
});
|
|
179
|
-
});
|
|
180
|
-
```
|
|
181
|
-
|
|
182
|
-
### Parallel Model Execution with LLMs
|
|
183
|
-
|
|
184
|
-
For LLM calls or slow operations, use `Promise.all` with chunking for concurrency control:
|
|
185
|
-
|
|
186
|
-
```javascript
|
|
187
|
-
import { describe, evalTest, expectStats } from "evalsense";
|
|
188
|
-
import { readFileSync } from "fs";
|
|
189
|
-
|
|
190
|
-
// Helper for parallel execution with concurrency limit
|
|
191
|
-
async function mapConcurrent(items, fn, concurrency = 5) {
|
|
192
|
-
const results = [];
|
|
193
|
-
for (let i = 0; i < items.length; i += concurrency) {
|
|
194
|
-
const chunk = items.slice(i, i + concurrency);
|
|
195
|
-
results.push(...(await Promise.all(chunk.map(fn))));
|
|
196
|
-
}
|
|
197
|
-
return results;
|
|
198
|
-
}
|
|
199
|
-
|
|
200
|
-
describe("LLM classifier", () => {
|
|
201
|
-
evalTest("classification accuracy", async () => {
|
|
202
|
-
const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
|
|
203
|
-
|
|
204
|
-
// Run with concurrency=5
|
|
205
|
-
const predictions = await mapConcurrent(
|
|
206
|
-
groundTruth,
|
|
207
|
-
async (record) => {
|
|
208
|
-
const response = await callLLM(record.text);
|
|
209
|
-
return { id: record.id, category: response.category };
|
|
210
|
-
},
|
|
211
|
-
5
|
|
212
|
-
);
|
|
213
|
-
|
|
214
|
-
expectStats(predictions, groundTruth).field("category").toHaveAccuracyAbove(0.9);
|
|
215
|
-
});
|
|
216
|
-
});
|
|
217
|
-
```
|
|
88
|
+
## Two Ways to Use It
|
|
218
89
|
|
|
219
|
-
###
|
|
220
|
-
|
|
221
|
-
```javascript
|
|
222
|
-
import { describe, evalTest, beforeAll, afterAll, beforeEach, afterEach } from "evalsense";
|
|
223
|
-
|
|
224
|
-
describe("Model evaluation", () => {
|
|
225
|
-
let model;
|
|
226
|
-
|
|
227
|
-
beforeAll(async () => {
|
|
228
|
-
// Load model once before all tests
|
|
229
|
-
model = await loadModel();
|
|
230
|
-
});
|
|
231
|
-
|
|
232
|
-
afterAll(async () => {
|
|
233
|
-
// Cleanup after all tests
|
|
234
|
-
await model.dispose();
|
|
235
|
-
});
|
|
236
|
-
|
|
237
|
-
beforeEach(() => {
|
|
238
|
-
// Reset state before each test
|
|
239
|
-
model.reset();
|
|
240
|
-
});
|
|
241
|
-
|
|
242
|
-
afterEach(() => {
|
|
243
|
-
// Cleanup after each test
|
|
244
|
-
console.log("Test completed");
|
|
245
|
-
});
|
|
246
|
-
|
|
247
|
-
evalTest("test 1", async () => {
|
|
248
|
-
// ...
|
|
249
|
-
});
|
|
250
|
-
|
|
251
|
-
evalTest("test 2", async () => {
|
|
252
|
-
// ...
|
|
253
|
-
});
|
|
254
|
-
});
|
|
255
|
-
```
|
|
256
|
-
|
|
257
|
-
## CLI Usage
|
|
258
|
-
|
|
259
|
-
### Run Evaluations
|
|
260
|
-
|
|
261
|
-
```bash
|
|
262
|
-
# Run all eval files in current directory
|
|
263
|
-
npx evalsense run
|
|
264
|
-
|
|
265
|
-
# Run specific file or directory
|
|
266
|
-
npx evalsense run tests/eval/
|
|
267
|
-
|
|
268
|
-
# Filter tests by name
|
|
269
|
-
npx evalsense run --filter "accuracy"
|
|
270
|
-
|
|
271
|
-
# Output JSON report
|
|
272
|
-
npx evalsense run --output report.json
|
|
273
|
-
|
|
274
|
-
# Use different reporters
|
|
275
|
-
npx evalsense run --reporter console # default
|
|
276
|
-
npx evalsense run --reporter json
|
|
277
|
-
npx evalsense run --reporter both
|
|
278
|
-
|
|
279
|
-
# Bail on first failure
|
|
280
|
-
npx evalsense run --bail
|
|
281
|
-
|
|
282
|
-
# Set timeout (in milliseconds)
|
|
283
|
-
npx evalsense run --timeout 60000
|
|
284
|
-
```
|
|
285
|
-
|
|
286
|
-
### List Eval Files
|
|
287
|
-
|
|
288
|
-
```bash
|
|
289
|
-
# List all discovered eval files
|
|
290
|
-
npx evalsense list
|
|
291
|
-
|
|
292
|
-
# List files in specific directory
|
|
293
|
-
npx evalsense list tests/
|
|
294
|
-
```
|
|
295
|
-
|
|
296
|
-
## API Reference
|
|
297
|
-
|
|
298
|
-
### Core API
|
|
299
|
-
|
|
300
|
-
#### `describe(name, fn)`
|
|
301
|
-
|
|
302
|
-
Groups related evaluation tests (like Jest's describe).
|
|
303
|
-
|
|
304
|
-
```javascript
|
|
305
|
-
describe("My model", () => {
|
|
306
|
-
// eval tests go here
|
|
307
|
-
});
|
|
308
|
-
```
|
|
309
|
-
|
|
310
|
-
#### `evalTest(name, fn)` / `test(name, fn)` / `it(name, fn)`
|
|
311
|
-
|
|
312
|
-
Defines an evaluation test.
|
|
313
|
-
|
|
314
|
-
```javascript
|
|
315
|
-
evalTest("should have 90% accuracy", async () => {
|
|
316
|
-
// test implementation
|
|
317
|
-
});
|
|
318
|
-
```
|
|
319
|
-
|
|
320
|
-
### Dataset Loading
|
|
321
|
-
|
|
322
|
-
evalsense doesn't dictate how you load data or run your model. Use standard Node.js tools:
|
|
323
|
-
|
|
324
|
-
```javascript
|
|
325
|
-
import { readFileSync } from "fs";
|
|
326
|
-
|
|
327
|
-
// Load ground truth
|
|
328
|
-
const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
|
|
329
|
-
|
|
330
|
-
// Run your model however you want
|
|
331
|
-
const predictions = groundTruth.map(runYourModel);
|
|
332
|
-
|
|
333
|
-
// Or use async operations
|
|
334
|
-
const predictions = await Promise.all(
|
|
335
|
-
groundTruth.map(async (item) => {
|
|
336
|
-
const result = await callLLM(item.text);
|
|
337
|
-
return { id: item.id, prediction: result };
|
|
338
|
-
})
|
|
339
|
-
);
|
|
340
|
-
```
|
|
341
|
-
|
|
342
|
-
**Helper functions available (optional):**
|
|
343
|
-
|
|
344
|
-
- `loadDataset(path)` - Simple JSON file loader
|
|
345
|
-
- `runModel(dataset, fn)` - Sequential model execution
|
|
346
|
-
- `runModelParallel(dataset, fn, concurrency)` - Parallel execution with concurrency limit
|
|
347
|
-
|
|
348
|
-
### Assertions
|
|
349
|
-
|
|
350
|
-
#### `expectStats(predictions, groundTruth)`
|
|
351
|
-
|
|
352
|
-
Creates a statistical assertion chain from predictions and ground truth. Aligns by `id` field.
|
|
90
|
+
### With ground truth (classification / regression)
|
|
353
91
|
|
|
354
92
|
```javascript
|
|
355
93
|
expectStats(predictions, groundTruth)
|
|
356
|
-
.field("
|
|
357
|
-
.
|
|
358
|
-
.
|
|
359
|
-
.
|
|
360
|
-
|
|
361
|
-
|
|
362
|
-
**New in v0.3.2: Enhanced Assertion Reporting**
|
|
363
|
-
|
|
364
|
-
- All assertions (passed and failed) now display expected vs actual values
|
|
365
|
-
- Chained assertions evaluate completely instead of short-circuiting on first failure
|
|
366
|
-
- See all metric results in a single run for better debugging
|
|
367
|
-
|
|
368
|
-
**One-argument form (distribution assertions only):**
|
|
369
|
-
|
|
370
|
-
```javascript
|
|
371
|
-
// For distribution monitoring without ground truth
|
|
372
|
-
expectStats(predictions).field("confidence").toHavePercentageAbove(0.7, 0.8);
|
|
373
|
-
```
|
|
374
|
-
|
|
375
|
-
**Common use cases:**
|
|
376
|
-
|
|
377
|
-
- Classification evaluation with ground truth
|
|
378
|
-
- Regression evaluation (MAE, RMSE, R²)
|
|
379
|
-
- Validating LLM judges against human labels
|
|
380
|
-
- Distribution monitoring without ground truth
|
|
381
|
-
|
|
382
|
-
### Field Selection
|
|
383
|
-
|
|
384
|
-
#### `.field(fieldName)`
|
|
385
|
-
|
|
386
|
-
Selects a field for evaluation.
|
|
387
|
-
|
|
388
|
-
```javascript
|
|
389
|
-
expectStats(result).field("sentiment");
|
|
390
|
-
```
|
|
391
|
-
|
|
392
|
-
#### `.binarize(threshold)`
|
|
393
|
-
|
|
394
|
-
Converts continuous scores to binary (>=threshold is true).
|
|
395
|
-
|
|
396
|
-
```javascript
|
|
397
|
-
expectStats(result)
|
|
398
|
-
.field("score")
|
|
399
|
-
.binarize(0.5) // score >= 0.5 is true
|
|
400
|
-
.toHaveAccuracyAbove(0.8);
|
|
401
|
-
```
|
|
402
|
-
|
|
403
|
-
### Available Assertions
|
|
404
|
-
|
|
405
|
-
#### Classification Metrics
|
|
406
|
-
|
|
407
|
-
```javascript
|
|
408
|
-
// Accuracy
|
|
409
|
-
.toHaveAccuracyAbove(threshold)
|
|
410
|
-
.toHaveAccuracyBelow(threshold)
|
|
411
|
-
.toHaveAccuracyBetween(min, max)
|
|
412
|
-
|
|
413
|
-
// Precision (per class)
|
|
414
|
-
.toHavePrecisionAbove(className, threshold)
|
|
415
|
-
.toHavePrecisionBelow(className, threshold)
|
|
416
|
-
|
|
417
|
-
// Recall (per class)
|
|
418
|
-
.toHaveRecallAbove(className, threshold)
|
|
419
|
-
.toHaveRecallBelow(className, threshold)
|
|
420
|
-
|
|
421
|
-
// F1 Score
|
|
422
|
-
.toHaveF1Above(threshold) // Overall F1
|
|
423
|
-
.toHaveF1Above(className, threshold) // Per-class F1
|
|
424
|
-
|
|
425
|
-
// Confusion Matrix
|
|
426
|
-
.toHaveConfusionMatrix() // Prints confusion matrix
|
|
427
|
-
```
|
|
428
|
-
|
|
429
|
-
#### Distribution Assertions (Pattern 1)
|
|
430
|
-
|
|
431
|
-
Distribution assertions validate output distributions **without requiring ground truth**. Use these to monitor that model outputs stay within expected ranges.
|
|
432
|
-
|
|
433
|
-
```javascript
|
|
434
|
-
// Assert that at least 80% of confidence scores are above 0.7
|
|
435
|
-
expectStats(predictions).field("confidence").toHavePercentageAbove(0.7, 0.8);
|
|
436
|
-
|
|
437
|
-
// Assert that at least 90% of toxicity scores are below 0.3
|
|
438
|
-
expectStats(predictions).field("toxicity").toHavePercentageBelow(0.3, 0.9);
|
|
439
|
-
|
|
440
|
-
// Chain multiple distribution assertions
|
|
441
|
-
expectStats(predictions)
|
|
442
|
-
.field("score")
|
|
443
|
-
.toHavePercentageAbove(0.5, 0.6) // At least 60% above 0.5
|
|
444
|
-
.toHavePercentageBelow(0.9, 0.8); // At least 80% below 0.9
|
|
445
|
-
```
|
|
446
|
-
|
|
447
|
-
**Use cases:**
|
|
448
|
-
|
|
449
|
-
- Monitor confidence score distributions
|
|
450
|
-
- Validate schema compliance rates
|
|
451
|
-
- Check output range constraints
|
|
452
|
-
- Ensure score distributions remain stable over time
|
|
453
|
-
|
|
454
|
-
See [Distribution Assertions Example](./examples/distribution-assertions.eval.js) for complete examples.
|
|
455
|
-
|
|
456
|
-
### Judge Validation (Pattern 1b)
|
|
457
|
-
|
|
458
|
-
Validate judge outputs against human-labeled ground truth using the **two-argument expectStats API**:
|
|
459
|
-
|
|
460
|
-
```javascript
|
|
461
|
-
// Judge outputs (predictions from your judge/metric)
|
|
462
|
-
const judgeOutputs = [
|
|
463
|
-
{ id: "1", hallucinated: true },
|
|
464
|
-
{ id: "2", hallucinated: false },
|
|
465
|
-
{ id: "3", hallucinated: true },
|
|
466
|
-
];
|
|
467
|
-
|
|
468
|
-
// Human labels (ground truth)
|
|
469
|
-
const humanLabels = [
|
|
470
|
-
{ id: "1", hallucinated: true },
|
|
471
|
-
{ id: "2", hallucinated: false },
|
|
472
|
-
{ id: "3", hallucinated: false },
|
|
473
|
-
];
|
|
474
|
-
|
|
475
|
-
// Validate judge performance
|
|
476
|
-
expectStats(judgeOutputs, humanLabels)
|
|
477
|
-
.field("hallucinated")
|
|
478
|
-
.toHaveRecallAbove(true, 0.9) // Don't miss hallucinations
|
|
479
|
-
.toHavePrecisionAbove(true, 0.7) // Some false positives OK
|
|
480
|
-
.toHaveConfusionMatrix();
|
|
94
|
+
.field("label")
|
|
95
|
+
.accuracy.toBeAtLeast(0.9)
|
|
96
|
+
.recall("positive")
|
|
97
|
+
.toBeAtLeast(0.8)
|
|
98
|
+
.f1.toBeAtLeast(0.85);
|
|
481
99
|
```
|
|
482
100
|
|
|
483
|
-
|
|
484
|
-
|
|
485
|
-
- Evaluate LLM-as-judge accuracy
|
|
486
|
-
- Validate heuristic metrics against human labels
|
|
487
|
-
- Test automated detection systems (refusal, policy compliance)
|
|
488
|
-
- Calibrate metric thresholds
|
|
489
|
-
|
|
490
|
-
**Two-argument expectStats:**
|
|
101
|
+
### Without ground truth (distribution monitoring)
|
|
491
102
|
|
|
492
103
|
```javascript
|
|
493
|
-
expectStats(
|
|
494
|
-
```
|
|
495
|
-
|
|
496
|
-
The first argument is your predictions (judge outputs), the second is ground truth (human labels). Both must have matching `id` fields for alignment.
|
|
497
|
-
|
|
498
|
-
See [Judge Validation Example](./examples/judge-validation.eval.js) for complete examples.
|
|
499
|
-
|
|
500
|
-
For comprehensive guidance on evaluating agent systems, see [Agent Judges Design Patterns](./docs/agent-judges.md).
|
|
501
|
-
|
|
502
|
-
## Dataset Format
|
|
503
|
-
|
|
504
|
-
Datasets must be JSON arrays where each record has an `id` or `_id` field:
|
|
505
|
-
|
|
506
|
-
```json
|
|
507
|
-
[
|
|
508
|
-
{
|
|
509
|
-
"id": "1",
|
|
510
|
-
"text": "input text",
|
|
511
|
-
"label": "expected_output"
|
|
512
|
-
},
|
|
513
|
-
{
|
|
514
|
-
"id": "2",
|
|
515
|
-
"text": "another input",
|
|
516
|
-
"label": "another_output"
|
|
517
|
-
}
|
|
518
|
-
]
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
**Requirements:**
|
|
522
|
-
|
|
523
|
-
- Each record MUST have `id` or `_id` for alignment
|
|
524
|
-
- Ground truth fields (e.g., `label`, `sentiment`, `category`) are compared against model outputs
|
|
525
|
-
- Model functions must return predictions with matching `id`
|
|
526
|
-
|
|
527
|
-
## Exit Codes
|
|
528
|
-
|
|
529
|
-
evalsense returns specific exit codes for CI integration:
|
|
530
|
-
|
|
531
|
-
- `0` - Success (all tests passed)
|
|
532
|
-
- `1` - Assertion failure (statistical thresholds not met)
|
|
533
|
-
- `2` - Integrity failure (dataset alignment issues)
|
|
534
|
-
- `3` - Execution error (test threw exception)
|
|
535
|
-
- `4` - Configuration error (invalid CLI options)
|
|
536
|
-
|
|
537
|
-
## Writing Eval Files
|
|
538
|
-
|
|
539
|
-
Eval files use the `.eval.js` or `.eval.ts` extension and are discovered automatically:
|
|
540
|
-
|
|
541
|
-
```
|
|
542
|
-
project/
|
|
543
|
-
├── tests/
|
|
544
|
-
│ ├── classifier.eval.js
|
|
545
|
-
│ └── hallucination.eval.js
|
|
546
|
-
├── data/
|
|
547
|
-
│ └── dataset.json
|
|
548
|
-
└── package.json
|
|
549
|
-
```
|
|
550
|
-
|
|
551
|
-
Run with:
|
|
552
|
-
|
|
553
|
-
```bash
|
|
554
|
-
npx evalsense run tests/
|
|
104
|
+
expectStats(llmOutputs).field("toxicity_score").percentageBelow(0.3).toBeAtLeast(0.95); // 95% of outputs must be non-toxic
|
|
555
105
|
```
|
|
556
106
|
|
|
557
|
-
##
|
|
558
|
-
|
|
559
|
-
See the [`examples/`](./examples/) directory for complete examples:
|
|
560
|
-
|
|
561
|
-
- [`classification.eval.js`](./examples/basic/classification.eval.js) - Binary sentiment classification
|
|
562
|
-
- [`hallucination.eval.js`](./examples/basic/hallucination.eval.js) - Continuous score binarization
|
|
563
|
-
- [`distribution-assertions.eval.js`](./examples/distribution-assertions.eval.js) - Distribution monitoring without ground truth
|
|
564
|
-
- [`judge-validation.eval.js`](./examples/judge-validation.eval.js) - Validating judges against human labels
|
|
565
|
-
|
|
566
|
-
## Field Types
|
|
567
|
-
|
|
568
|
-
evalsense automatically determines evaluation metrics based on field values:
|
|
569
|
-
|
|
570
|
-
- **Boolean** (`true`/`false`) → Binary classification metrics
|
|
571
|
-
- **Categorical** (strings) → Multi-class classification metrics
|
|
572
|
-
- **Numeric** (numbers) → Regression metrics (MAE, MSE, RMSE, R²)
|
|
573
|
-
- **Numeric + threshold** → Binarized classification metrics
|
|
574
|
-
|
|
575
|
-
## LLM-Based Metrics (v0.2.0+)
|
|
576
|
-
|
|
577
|
-
evalsense includes LLM-powered metrics for hallucination detection, relevance assessment, faithfulness verification, and toxicity detection.
|
|
578
|
-
|
|
579
|
-
### Quick Setup
|
|
107
|
+
## LLM-Based Metrics
|
|
580
108
|
|
|
581
109
|
```javascript
|
|
582
|
-
import { setLLMClient,
|
|
583
|
-
import { hallucination, relevance
|
|
110
|
+
import { setLLMClient, createAnthropicAdapter } from "evalsense/metrics";
|
|
111
|
+
import { hallucination, relevance } from "evalsense/metrics/opinionated";
|
|
584
112
|
|
|
585
|
-
// 1. Configure your LLM client (one-time setup)
|
|
586
113
|
setLLMClient(
|
|
587
|
-
|
|
588
|
-
model: "
|
|
589
|
-
temperature: 0,
|
|
114
|
+
createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
|
|
115
|
+
model: "claude-haiku-4-5-20251001",
|
|
590
116
|
})
|
|
591
117
|
);
|
|
592
118
|
|
|
593
|
-
|
|
594
|
-
const results = await hallucination({
|
|
119
|
+
const scores = await hallucination({
|
|
595
120
|
outputs: [{ id: "1", output: "Paris has 50 million people." }],
|
|
596
121
|
context: ["Paris has approximately 2.1 million residents."],
|
|
597
122
|
});
|
|
598
|
-
|
|
599
|
-
|
|
600
|
-
console.log(results[0].reasoning); // "Output claims 50M, context says 2.1M"
|
|
123
|
+
// scores[0].score → 0.9 (high hallucination)
|
|
124
|
+
// scores[0].reasoning → "Output claims 50M, context says 2.1M"
|
|
601
125
|
```
|
|
602
126
|
|
|
603
|
-
|
|
604
|
-
|
|
605
|
-
- **`hallucination()`** - Detects claims not supported by context
|
|
606
|
-
- **`relevance()`** - Measures query-response alignment
|
|
607
|
-
- **`faithfulness()`** - Verifies outputs don't contradict sources
|
|
608
|
-
- **`toxicity()`** - Identifies harmful or inappropriate content
|
|
609
|
-
|
|
610
|
-
### Evaluation Modes
|
|
611
|
-
|
|
612
|
-
Choose between accuracy and cost:
|
|
613
|
-
|
|
614
|
-
```javascript
|
|
615
|
-
// Per-row: Higher accuracy, higher cost (N API calls)
|
|
616
|
-
await hallucination({
|
|
617
|
-
outputs,
|
|
618
|
-
context,
|
|
619
|
-
evaluationMode: "per-row", // default
|
|
620
|
-
});
|
|
621
|
-
|
|
622
|
-
// Batch: Lower cost, single API call
|
|
623
|
-
await hallucination({
|
|
624
|
-
outputs,
|
|
625
|
-
context,
|
|
626
|
-
evaluationMode: "batch",
|
|
627
|
-
});
|
|
628
|
-
```
|
|
629
|
-
|
|
630
|
-
### Built-in Provider Adapters
|
|
631
|
-
|
|
632
|
-
evalsense includes ready-to-use adapters for popular LLM providers:
|
|
127
|
+
Built-in providers: OpenAI, Anthropic, OpenRouter, or bring your own adapter.
|
|
128
|
+
See [LLM Metrics Guide](./docs/llm-metrics.md) and [Adapters Guide](./docs/llm-adapters.md).
|
|
633
129
|
|
|
634
|
-
|
|
130
|
+
## Using with Claude Code (Vibe Check)
|
|
635
131
|
|
|
636
|
-
|
|
637
|
-
import { createOpenAIAdapter } from "evalsense/metrics";
|
|
638
|
-
|
|
639
|
-
// npm install openai
|
|
640
|
-
setLLMClient(
|
|
641
|
-
createOpenAIAdapter(process.env.OPENAI_API_KEY, {
|
|
642
|
-
model: "gpt-4-turbo-preview", // or "gpt-3.5-turbo" for lower cost
|
|
643
|
-
temperature: 0,
|
|
644
|
-
maxTokens: 4096,
|
|
645
|
-
})
|
|
646
|
-
);
|
|
647
|
-
```
|
|
132
|
+
evalsense includes an example [Claude Code skill](./skill.md) that acts as an automated LLM quality gate. To set it up in your project:
|
|
648
133
|
|
|
649
|
-
|
|
650
|
-
|
|
651
|
-
|
|
652
|
-
import { createAnthropicAdapter } from "evalsense/metrics";
|
|
134
|
+
1. Install evalsense as a dev dependency
|
|
135
|
+
2. Copy [`skill.md`](./skill.md) into your project at `.claude/skills/llm-quality-gate/SKILL.md`
|
|
136
|
+
3. After building any LLM feature, run `/llm-quality-gate` in Claude Code
|
|
653
137
|
|
|
654
|
-
|
|
655
|
-
setLLMClient(
|
|
656
|
-
createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
|
|
657
|
-
model: "claude-3-5-sonnet-20241022", // or "claude-3-haiku-20240307" for speed
|
|
658
|
-
maxTokens: 4096,
|
|
659
|
-
})
|
|
660
|
-
);
|
|
661
|
-
```
|
|
138
|
+
Claude will automatically create a `.eval.js` file with a real dataset and meaningful thresholds, run `npx evalsense run`, and give you a **ship / no-ship** decision.
|
|
662
139
|
|
|
663
|
-
|
|
140
|
+
## Documentation
|
|
664
141
|
|
|
665
|
-
|
|
666
|
-
|
|
142
|
+
| Guide | Description |
|
|
143
|
+
| -------------------------------------------------- | ------------------------------------------------ |
|
|
144
|
+
| [API Reference](./docs/api-reference.md) | Full API — all assertions, matchers, metrics |
|
|
145
|
+
| [CLI Reference](./docs/cli.md) | All CLI flags, exit codes, CI integration |
|
|
146
|
+
| [LLM Metrics](./docs/llm-metrics.md) | Hallucination, relevance, faithfulness, toxicity |
|
|
147
|
+
| [LLM Adapters](./docs/llm-adapters.md) | OpenAI, Anthropic, OpenRouter, custom adapters |
|
|
148
|
+
| [Custom Metrics](./docs/custom-metrics-guide.md) | Pattern and keyword metrics |
|
|
149
|
+
| [Agent Judges](./docs/agent-judges.md) | Design patterns for evaluating agent systems |
|
|
150
|
+
| [Regression Metrics](./docs/regression-metrics.md) | MAE, RMSE, R² usage |
|
|
151
|
+
| [Examples](./examples/) | Working code examples |
|
|
667
152
|
|
|
668
|
-
|
|
669
|
-
setLLMClient(
|
|
670
|
-
createOpenRouterAdapter(process.env.OPENROUTER_API_KEY, {
|
|
671
|
-
model: "anthropic/claude-3.5-sonnet", // or "openai/gpt-3.5-turbo", etc.
|
|
672
|
-
temperature: 0,
|
|
673
|
-
appName: "my-eval-system",
|
|
674
|
-
})
|
|
675
|
-
);
|
|
676
|
-
```
|
|
153
|
+
## Dataset Format
|
|
677
154
|
|
|
678
|
-
|
|
155
|
+
Records must have an `id` or `_id` field:
|
|
679
156
|
|
|
680
|
-
```
|
|
681
|
-
|
|
682
|
-
|
|
683
|
-
|
|
684
|
-
|
|
685
|
-
return response.text;
|
|
686
|
-
},
|
|
687
|
-
});
|
|
157
|
+
```json
|
|
158
|
+
[
|
|
159
|
+
{ "id": "1", "text": "sample input", "label": "positive" },
|
|
160
|
+
{ "id": "2", "text": "another input", "label": "negative" }
|
|
161
|
+
]
|
|
688
162
|
```
|
|
689
163
|
|
|
690
|
-
|
|
691
|
-
|
|
692
|
-
- [LLM Metrics Guide](./docs/llm-metrics.md) - Complete usage guide
|
|
693
|
-
- [LLM Adapters Guide](./docs/llm-adapters.md) - Implement adapters for different providers
|
|
694
|
-
- [Migration Guide](./docs/migration-v0.2.md) - Upgrade from v0.1.x
|
|
695
|
-
- [Examples](./examples/) - Working code examples
|
|
696
|
-
|
|
697
|
-
## Philosophy
|
|
698
|
-
|
|
699
|
-
evalsense is built on the principle that **metrics are predictions, not facts**.
|
|
700
|
-
|
|
701
|
-
Instead of treating LLM-as-judge metrics (relevance, hallucination, etc.) as ground truth, evalsense:
|
|
164
|
+
## Exit Codes
|
|
702
165
|
|
|
703
|
-
|
|
704
|
-
|
|
705
|
-
|
|
706
|
-
|
|
166
|
+
| Code | Meaning |
|
|
167
|
+
| ---- | ------------------------- |
|
|
168
|
+
| `0` | All tests passed |
|
|
169
|
+
| `1` | Assertion failure |
|
|
170
|
+
| `2` | Dataset integrity failure |
|
|
171
|
+
| `3` | Execution error |
|
|
172
|
+
| `4` | Configuration error |
|
|
707
173
|
|
|
708
174
|
## Contributing
|
|
709
175
|
|
|
710
|
-
Contributions are welcome
|
|
176
|
+
Contributions are welcome. See [CONTRIBUTING.md](./CONTRIBUTING.md) for setup, coding standards, and the PR process.
|
|
711
177
|
|
|
712
178
|
## License
|
|
713
179
|
|
|
714
|
-
|
|
715
|
-
|
|
716
|
-
---
|
|
717
|
-
|
|
718
|
-
**Made with ❤️ for the JS/Node.js AI community**
|
|
180
|
+
[Apache 2.0](./LICENSE)
|