evalsense 0.3.2 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/README.md +89 -627
  2. package/dist/chunk-IZAC4S4T.js +1108 -0
  3. package/dist/chunk-IZAC4S4T.js.map +1 -0
  4. package/dist/{chunk-IYLSY7NX.js → chunk-RRTJDD4M.js} +13 -6
  5. package/dist/chunk-RRTJDD4M.js.map +1 -0
  6. package/dist/{chunk-BFGA2NUB.cjs → chunk-SYEKZ327.cjs} +13 -6
  7. package/dist/chunk-SYEKZ327.cjs.map +1 -0
  8. package/dist/chunk-UH6L7A5Y.cjs +1141 -0
  9. package/dist/chunk-UH6L7A5Y.cjs.map +1 -0
  10. package/dist/cli.cjs +11 -11
  11. package/dist/cli.js +1 -1
  12. package/dist/index-7Qog3wxS.d.ts +417 -0
  13. package/dist/index-ezghUO7Q.d.cts +417 -0
  14. package/dist/index.cjs +507 -580
  15. package/dist/index.cjs.map +1 -1
  16. package/dist/index.d.cts +210 -161
  17. package/dist/index.d.ts +210 -161
  18. package/dist/index.js +455 -524
  19. package/dist/index.js.map +1 -1
  20. package/dist/metrics/index.cjs +103 -342
  21. package/dist/metrics/index.cjs.map +1 -1
  22. package/dist/metrics/index.d.cts +260 -31
  23. package/dist/metrics/index.d.ts +260 -31
  24. package/dist/metrics/index.js +24 -312
  25. package/dist/metrics/index.js.map +1 -1
  26. package/dist/metrics/opinionated/index.cjs +5 -5
  27. package/dist/metrics/opinionated/index.d.cts +2 -163
  28. package/dist/metrics/opinionated/index.d.ts +2 -163
  29. package/dist/metrics/opinionated/index.js +1 -1
  30. package/dist/{types-C71p0wzM.d.cts → types-D0hzfyKm.d.cts} +1 -13
  31. package/dist/{types-C71p0wzM.d.ts → types-D0hzfyKm.d.ts} +1 -13
  32. package/package.json +1 -1
  33. package/dist/chunk-BFGA2NUB.cjs.map +0 -1
  34. package/dist/chunk-IYLSY7NX.js.map +0 -1
  35. package/dist/chunk-RZFLCWTW.cjs +0 -942
  36. package/dist/chunk-RZFLCWTW.cjs.map +0 -1
  37. package/dist/chunk-Z3U6AUWX.js +0 -925
  38. package/dist/chunk-Z3U6AUWX.js.map +0 -1
package/README.md CHANGED
@@ -1,718 +1,180 @@
1
- # evalsense
2
-
3
- > JS-native LLM evaluation framework with Jest-like API and statistical assertions
1
+ [![Evalsense logo](./brand/evalsense.png)](https://www.evalsense.com)
4
2
 
5
3
  [![npm version](https://img.shields.io/npm/v/evalsense.svg)](https://www.npmjs.com/package/evalsense)
4
+ [![CI](https://github.com/evalsense/evalsense/actions/workflows/ci.yml/badge.svg)](https://github.com/evalsense/evalsense/actions/workflows/ci.yml)
6
5
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7
6
 
8
- **evalsense** brings classical ML-style statistical evaluation to LLM systems in JavaScript. Instead of evaluating individual test cases, evalsense evaluates entire datasets and computes confusion matrices, precision/recall, F1 scores, and other statistical metrics.
9
-
10
- > **New in v0.3.2:** Enhanced assertion reporting - all assertions (passed and failed) now display expected vs actual values, and chained assertions evaluate completely instead of short-circuiting on first failure!
11
- > **New in v0.3.0:** Regression assertions (MAE, RMSE, R²) and flexible ID matching for custom identifier fields! [See migration guide](./docs/migration-v0.3.0.md).
12
- > **New in v0.2.x:** Built-in adapters for OpenAI, Anthropic, and OpenRouter - no boilerplate needed!
13
- > **New in v0.2.0:** LLM-powered metrics for hallucination, relevance, faithfulness, and toxicity detection. [See migration guide](./docs/migration-v0.2.md).
14
-
15
- ## Why evalsense?
16
-
17
- Most LLM evaluation tools stop at producing scores (accuracy, relevance, hallucination). evalsense goes further by:
7
+ > **Jest for LLM Evaluation.** Pass/fail quality gates for your LLM-powered code.
18
8
 
19
- - Computing **confusion matrices** to reveal systematic failure patterns
20
- - ✅ Analyzing **false positives vs false negatives** across datasets
21
- - ✅ Treating **metrics as predictions, not truth** (and validating them statistically)
22
- - ✅ Providing a **Jest-like API** that fits naturally into JS/Node workflows
23
- - ✅ Supporting **deterministic CI/CD** integration with specific exit codes
24
-
25
- ## Features
26
-
27
- - 📊 **Dataset-level evaluation** - evaluate distributions, not single examples
28
- - 🎯 **Statistical rigor** - confusion matrices, precision/recall, F1, regression metrics
29
- - 🧪 **Jest-like API** - familiar `describe()` and test patterns
30
- - 🤖 **LLM-powered metrics** - hallucination, relevance, faithfulness, toxicity with explainable reasoning
31
- - ⚡ **Dual evaluation modes** - choose between accuracy (per-row) or cost efficiency (batch)
32
- - 🔄 **CI-friendly** - deterministic execution, machine-readable reports
33
- - 🚀 **JS-native** - first-class TypeScript support, works with any Node.js LLM library
34
- - 🔌 **Composable** - evaluate outputs from your existing LLM code
35
-
36
- ## Installation
9
+ evalsense runs your code across many inputs, measures quality statistically, and gives you a clear **pass / fail** result locally or in CI.
37
10
 
38
11
  ```bash
39
12
  npm install --save-dev evalsense
40
13
  ```
41
14
 
42
- Or with yarn:
43
-
44
- ```bash
45
- yarn add -D evalsense
46
- ```
47
-
48
15
  ## Quick Start
49
16
 
50
- Create a file named `sentiment.eval.js`:
17
+ Create `sentiment.eval.js`:
51
18
 
52
19
  ```javascript
53
20
  import { describe, evalTest, expectStats } from "evalsense";
54
21
  import { readFileSync } from "fs";
55
22
 
56
- // Your model function - can be any JS function
57
23
  function classifySentiment(text) {
58
- const lower = text.toLowerCase();
59
- const hasPositive = /love|amazing|great|fantastic|perfect/.test(lower);
60
- const hasNegative = /terrible|worst|disappointed|waste/.test(lower);
61
- return hasPositive && !hasNegative ? "positive" : "negative";
24
+ return /love|great|amazing/.test(text.toLowerCase()) ? "positive" : "negative";
62
25
  }
63
26
 
64
27
  describe("Sentiment classifier", () => {
65
28
  evalTest("accuracy above 80%", async () => {
66
- // 1. Load ground truth data
67
29
  const groundTruth = JSON.parse(readFileSync("./sentiment.json", "utf-8"));
68
30
 
69
- // 2. Run your model and collect predictions
70
31
  const predictions = groundTruth.map((record) => ({
71
32
  id: record.id,
72
33
  sentiment: classifySentiment(record.text),
73
34
  }));
74
35
 
75
- // 3. Assert on statistical properties
76
36
  expectStats(predictions, groundTruth)
77
37
  .field("sentiment")
78
- .toHaveAccuracyAbove(0.8)
79
- .toHaveRecallAbove("positive", 0.7)
80
- .toHavePrecisionAbove("positive", 0.7)
81
- .toHaveConfusionMatrix();
38
+ .accuracy.toBeAtLeast(0.8)
39
+ .recall("positive")
40
+ .toBeAtLeast(0.7)
41
+ .precision("positive")
42
+ .toBeAtLeast(0.7)
43
+ .displayConfusionMatrix();
82
44
  });
83
45
  });
84
46
  ```
85
47
 
86
- Create `sentiment.json`:
87
-
88
- ```json
89
- [
90
- { "id": "1", "text": "I love this product!", "sentiment": "positive" },
91
- { "id": "2", "text": "Terrible experience.", "sentiment": "negative" },
92
- { "id": "3", "text": "Great quality!", "sentiment": "positive" }
93
- ]
94
- ```
95
-
96
- Run the evaluation:
48
+ Run it:
97
49
 
98
50
  ```bash
99
51
  npx evalsense run sentiment.eval.js
100
52
  ```
101
53
 
102
- ## Usage
103
-
104
- ### Basic Classification Example
105
-
106
- ```javascript
107
- import { describe, evalTest, expectStats } from "evalsense";
108
- import { readFileSync } from "fs";
109
-
110
- describe("Spam classifier", () => {
111
- evalTest("high precision and recall", async () => {
112
- const groundTruth = JSON.parse(readFileSync("./emails.json", "utf-8"));
113
-
114
- const predictions = groundTruth.map((record) => ({
115
- id: record.id,
116
- isSpam: classifyEmail(record.text),
117
- }));
54
+ Output:
118
55
 
119
- expectStats(predictions, groundTruth)
120
- .field("isSpam")
121
- .toHaveAccuracyAbove(0.9)
122
- .toHavePrecisionAbove(true, 0.85) // Precision for spam=true
123
- .toHaveRecallAbove(true, 0.85) // Recall for spam=true
124
- .toHaveConfusionMatrix();
125
- });
126
- });
127
56
  ```
57
+ EvalSense v0.4.1
58
+ Running 1 eval file(s)...
128
59
 
129
- ### Continuous Scores with Binarization
60
+ Sentiment classifier
130
61
 
131
- ```javascript
132
- import { describe, evalTest, expectStats } from "evalsense";
133
- import { readFileSync } from "fs";
62
+ ✓ accuracy above 80% (12ms)
63
+ Field: sentiment | Accuracy: 90.0% | F1: 89.5%
64
+ negative: P=88.0% R=92.0% F1=90.0% (n=25)
65
+ positive: P=91.0% R=87.0% F1=89.0% (n=25)
66
+ ✓ Accuracy 90.0% >= 80.0%
67
+ ✓ Recall for 'positive' 87.0% >= 70.0%
68
+ ✓ Precision for 'positive' 91.0% >= 70.0%
134
69
 
135
- describe("Hallucination detector", () => {
136
- evalTest("detect hallucinations with 70% recall", async () => {
137
- const groundTruth = JSON.parse(readFileSync("./outputs.json", "utf-8"));
70
+ Summary
138
71
 
139
- // Your model returns a continuous score (0.0 to 1.0)
140
- const predictions = groundTruth.map((record) => ({
141
- id: record.id,
142
- hallucinated: computeHallucinationScore(record.output),
143
- }));
72
+ Tests: 1 passed, 0 failed, 0 errors, 0 skipped
73
+ Duration: 12ms
144
74
 
145
- // Binarize the score at threshold 0.3
146
- expectStats(predictions, groundTruth)
147
- .field("hallucinated")
148
- .binarize(0.3) // >= 0.3 means hallucinated
149
- .toHaveRecallAbove(true, 0.7)
150
- .toHavePrecisionAbove(true, 0.6)
151
- .toHaveConfusionMatrix();
152
- });
153
- });
75
+ All tests passed!
154
76
  ```
155
77
 
156
- ### Multi-class Classification
157
-
158
- ```javascript
159
- import { describe, evalTest, expectStats } from "evalsense";
160
- import { readFileSync } from "fs";
161
-
162
- describe("Intent classifier", () => {
163
- evalTest("balanced performance across intents", async () => {
164
- const groundTruth = JSON.parse(readFileSync("./intents.json", "utf-8"));
78
+ ## Key Features
165
79
 
166
- const predictions = groundTruth.map((record) => ({
167
- id: record.id,
168
- intent: classifyIntent(record.query),
169
- }));
80
+ - **Jest-like API** `describe`, `evalTest`, `expectStats` feel familiar
81
+ - **Statistical assertions** — accuracy, precision, recall, F1, MAE, RMSE, R²
82
+ - **Confusion matrices** — built-in display with `.displayConfusionMatrix()`
83
+ - **Distribution monitoring** — `percentageAbove` / `percentageBelow` without ground truth
84
+ - **LLM-as-judge** — built-in hallucination, relevance, faithfulness, toxicity metrics
85
+ - **CI/CD ready** — structured exit codes, JSON reporter, bail mode
86
+ - **Zero config** — works with any JS data loading and model execution
170
87
 
171
- expectStats(predictions, groundTruth)
172
- .field("intent")
173
- .toHaveAccuracyAbove(0.85)
174
- .toHaveRecallAbove("purchase", 0.8)
175
- .toHaveRecallAbove("support", 0.8)
176
- .toHaveRecallAbove("general", 0.7)
177
- .toHaveConfusionMatrix();
178
- });
179
- });
180
- ```
181
-
182
- ### Parallel Model Execution with LLMs
183
-
184
- For LLM calls or slow operations, use `Promise.all` with chunking for concurrency control:
185
-
186
- ```javascript
187
- import { describe, evalTest, expectStats } from "evalsense";
188
- import { readFileSync } from "fs";
189
-
190
- // Helper for parallel execution with concurrency limit
191
- async function mapConcurrent(items, fn, concurrency = 5) {
192
- const results = [];
193
- for (let i = 0; i < items.length; i += concurrency) {
194
- const chunk = items.slice(i, i + concurrency);
195
- results.push(...(await Promise.all(chunk.map(fn))));
196
- }
197
- return results;
198
- }
199
-
200
- describe("LLM classifier", () => {
201
- evalTest("classification accuracy", async () => {
202
- const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
203
-
204
- // Run with concurrency=5
205
- const predictions = await mapConcurrent(
206
- groundTruth,
207
- async (record) => {
208
- const response = await callLLM(record.text);
209
- return { id: record.id, category: response.category };
210
- },
211
- 5
212
- );
213
-
214
- expectStats(predictions, groundTruth).field("category").toHaveAccuracyAbove(0.9);
215
- });
216
- });
217
- ```
88
+ ## Two Ways to Use It
218
89
 
219
- ### Test Lifecycle Hooks
220
-
221
- ```javascript
222
- import { describe, evalTest, beforeAll, afterAll, beforeEach, afterEach } from "evalsense";
223
-
224
- describe("Model evaluation", () => {
225
- let model;
226
-
227
- beforeAll(async () => {
228
- // Load model once before all tests
229
- model = await loadModel();
230
- });
231
-
232
- afterAll(async () => {
233
- // Cleanup after all tests
234
- await model.dispose();
235
- });
236
-
237
- beforeEach(() => {
238
- // Reset state before each test
239
- model.reset();
240
- });
241
-
242
- afterEach(() => {
243
- // Cleanup after each test
244
- console.log("Test completed");
245
- });
246
-
247
- evalTest("test 1", async () => {
248
- // ...
249
- });
250
-
251
- evalTest("test 2", async () => {
252
- // ...
253
- });
254
- });
255
- ```
256
-
257
- ## CLI Usage
258
-
259
- ### Run Evaluations
260
-
261
- ```bash
262
- # Run all eval files in current directory
263
- npx evalsense run
264
-
265
- # Run specific file or directory
266
- npx evalsense run tests/eval/
267
-
268
- # Filter tests by name
269
- npx evalsense run --filter "accuracy"
270
-
271
- # Output JSON report
272
- npx evalsense run --output report.json
273
-
274
- # Use different reporters
275
- npx evalsense run --reporter console # default
276
- npx evalsense run --reporter json
277
- npx evalsense run --reporter both
278
-
279
- # Bail on first failure
280
- npx evalsense run --bail
281
-
282
- # Set timeout (in milliseconds)
283
- npx evalsense run --timeout 60000
284
- ```
285
-
286
- ### List Eval Files
287
-
288
- ```bash
289
- # List all discovered eval files
290
- npx evalsense list
291
-
292
- # List files in specific directory
293
- npx evalsense list tests/
294
- ```
295
-
296
- ## API Reference
297
-
298
- ### Core API
299
-
300
- #### `describe(name, fn)`
301
-
302
- Groups related evaluation tests (like Jest's describe).
303
-
304
- ```javascript
305
- describe("My model", () => {
306
- // eval tests go here
307
- });
308
- ```
309
-
310
- #### `evalTest(name, fn)` / `test(name, fn)` / `it(name, fn)`
311
-
312
- Defines an evaluation test.
313
-
314
- ```javascript
315
- evalTest("should have 90% accuracy", async () => {
316
- // test implementation
317
- });
318
- ```
319
-
320
- ### Dataset Loading
321
-
322
- evalsense doesn't dictate how you load data or run your model. Use standard Node.js tools:
323
-
324
- ```javascript
325
- import { readFileSync } from "fs";
326
-
327
- // Load ground truth
328
- const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
329
-
330
- // Run your model however you want
331
- const predictions = groundTruth.map(runYourModel);
332
-
333
- // Or use async operations
334
- const predictions = await Promise.all(
335
- groundTruth.map(async (item) => {
336
- const result = await callLLM(item.text);
337
- return { id: item.id, prediction: result };
338
- })
339
- );
340
- ```
341
-
342
- **Helper functions available (optional):**
343
-
344
- - `loadDataset(path)` - Simple JSON file loader
345
- - `runModel(dataset, fn)` - Sequential model execution
346
- - `runModelParallel(dataset, fn, concurrency)` - Parallel execution with concurrency limit
347
-
348
- ### Assertions
349
-
350
- #### `expectStats(predictions, groundTruth)`
351
-
352
- Creates a statistical assertion chain from predictions and ground truth. Aligns by `id` field.
90
+ ### With ground truth (classification / regression)
353
91
 
354
92
  ```javascript
355
93
  expectStats(predictions, groundTruth)
356
- .field("prediction")
357
- .toHaveAccuracyAbove(0.8)
358
- .toHaveF1Above(0.75)
359
- .toHaveConfusionMatrix();
360
- ```
361
-
362
- **New in v0.3.2: Enhanced Assertion Reporting**
363
-
364
- - All assertions (passed and failed) now display expected vs actual values
365
- - Chained assertions evaluate completely instead of short-circuiting on first failure
366
- - See all metric results in a single run for better debugging
367
-
368
- **One-argument form (distribution assertions only):**
369
-
370
- ```javascript
371
- // For distribution monitoring without ground truth
372
- expectStats(predictions).field("confidence").toHavePercentageAbove(0.7, 0.8);
373
- ```
374
-
375
- **Common use cases:**
376
-
377
- - Classification evaluation with ground truth
378
- - Regression evaluation (MAE, RMSE, R²)
379
- - Validating LLM judges against human labels
380
- - Distribution monitoring without ground truth
381
-
382
- ### Field Selection
383
-
384
- #### `.field(fieldName)`
385
-
386
- Selects a field for evaluation.
387
-
388
- ```javascript
389
- expectStats(result).field("sentiment");
390
- ```
391
-
392
- #### `.binarize(threshold)`
393
-
394
- Converts continuous scores to binary (>=threshold is true).
395
-
396
- ```javascript
397
- expectStats(result)
398
- .field("score")
399
- .binarize(0.5) // score >= 0.5 is true
400
- .toHaveAccuracyAbove(0.8);
401
- ```
402
-
403
- ### Available Assertions
404
-
405
- #### Classification Metrics
406
-
407
- ```javascript
408
- // Accuracy
409
- .toHaveAccuracyAbove(threshold)
410
- .toHaveAccuracyBelow(threshold)
411
- .toHaveAccuracyBetween(min, max)
412
-
413
- // Precision (per class)
414
- .toHavePrecisionAbove(className, threshold)
415
- .toHavePrecisionBelow(className, threshold)
416
-
417
- // Recall (per class)
418
- .toHaveRecallAbove(className, threshold)
419
- .toHaveRecallBelow(className, threshold)
420
-
421
- // F1 Score
422
- .toHaveF1Above(threshold) // Overall F1
423
- .toHaveF1Above(className, threshold) // Per-class F1
424
-
425
- // Confusion Matrix
426
- .toHaveConfusionMatrix() // Prints confusion matrix
427
- ```
428
-
429
- #### Distribution Assertions (Pattern 1)
430
-
431
- Distribution assertions validate output distributions **without requiring ground truth**. Use these to monitor that model outputs stay within expected ranges.
432
-
433
- ```javascript
434
- // Assert that at least 80% of confidence scores are above 0.7
435
- expectStats(predictions).field("confidence").toHavePercentageAbove(0.7, 0.8);
436
-
437
- // Assert that at least 90% of toxicity scores are below 0.3
438
- expectStats(predictions).field("toxicity").toHavePercentageBelow(0.3, 0.9);
439
-
440
- // Chain multiple distribution assertions
441
- expectStats(predictions)
442
- .field("score")
443
- .toHavePercentageAbove(0.5, 0.6) // At least 60% above 0.5
444
- .toHavePercentageBelow(0.9, 0.8); // At least 80% below 0.9
445
- ```
446
-
447
- **Use cases:**
448
-
449
- - Monitor confidence score distributions
450
- - Validate schema compliance rates
451
- - Check output range constraints
452
- - Ensure score distributions remain stable over time
453
-
454
- See [Distribution Assertions Example](./examples/distribution-assertions.eval.js) for complete examples.
455
-
456
- ### Judge Validation (Pattern 1b)
457
-
458
- Validate judge outputs against human-labeled ground truth using the **two-argument expectStats API**:
459
-
460
- ```javascript
461
- // Judge outputs (predictions from your judge/metric)
462
- const judgeOutputs = [
463
- { id: "1", hallucinated: true },
464
- { id: "2", hallucinated: false },
465
- { id: "3", hallucinated: true },
466
- ];
467
-
468
- // Human labels (ground truth)
469
- const humanLabels = [
470
- { id: "1", hallucinated: true },
471
- { id: "2", hallucinated: false },
472
- { id: "3", hallucinated: false },
473
- ];
474
-
475
- // Validate judge performance
476
- expectStats(judgeOutputs, humanLabels)
477
- .field("hallucinated")
478
- .toHaveRecallAbove(true, 0.9) // Don't miss hallucinations
479
- .toHavePrecisionAbove(true, 0.7) // Some false positives OK
480
- .toHaveConfusionMatrix();
94
+ .field("label")
95
+ .accuracy.toBeAtLeast(0.9)
96
+ .recall("positive")
97
+ .toBeAtLeast(0.8)
98
+ .f1.toBeAtLeast(0.85);
481
99
  ```
482
100
 
483
- **Use cases:**
484
-
485
- - Evaluate LLM-as-judge accuracy
486
- - Validate heuristic metrics against human labels
487
- - Test automated detection systems (refusal, policy compliance)
488
- - Calibrate metric thresholds
489
-
490
- **Two-argument expectStats:**
101
+ ### Without ground truth (distribution monitoring)
491
102
 
492
103
  ```javascript
493
- expectStats(actual, expected).field("fieldName").toHaveAccuracyAbove(0.8);
494
- ```
495
-
496
- The first argument is your predictions (judge outputs), the second is ground truth (human labels). Both must have matching `id` fields for alignment.
497
-
498
- See [Judge Validation Example](./examples/judge-validation.eval.js) for complete examples.
499
-
500
- For comprehensive guidance on evaluating agent systems, see [Agent Judges Design Patterns](./docs/agent-judges.md).
501
-
502
- ## Dataset Format
503
-
504
- Datasets must be JSON arrays where each record has an `id` or `_id` field:
505
-
506
- ```json
507
- [
508
- {
509
- "id": "1",
510
- "text": "input text",
511
- "label": "expected_output"
512
- },
513
- {
514
- "id": "2",
515
- "text": "another input",
516
- "label": "another_output"
517
- }
518
- ]
519
- ```
520
-
521
- **Requirements:**
522
-
523
- - Each record MUST have `id` or `_id` for alignment
524
- - Ground truth fields (e.g., `label`, `sentiment`, `category`) are compared against model outputs
525
- - Model functions must return predictions with matching `id`
526
-
527
- ## Exit Codes
528
-
529
- evalsense returns specific exit codes for CI integration:
530
-
531
- - `0` - Success (all tests passed)
532
- - `1` - Assertion failure (statistical thresholds not met)
533
- - `2` - Integrity failure (dataset alignment issues)
534
- - `3` - Execution error (test threw exception)
535
- - `4` - Configuration error (invalid CLI options)
536
-
537
- ## Writing Eval Files
538
-
539
- Eval files use the `.eval.js` or `.eval.ts` extension and are discovered automatically:
540
-
541
- ```
542
- project/
543
- ├── tests/
544
- │ ├── classifier.eval.js
545
- │ └── hallucination.eval.js
546
- ├── data/
547
- │ └── dataset.json
548
- └── package.json
549
- ```
550
-
551
- Run with:
552
-
553
- ```bash
554
- npx evalsense run tests/
104
+ expectStats(llmOutputs).field("toxicity_score").percentageBelow(0.3).toBeAtLeast(0.95); // 95% of outputs must be non-toxic
555
105
  ```
556
106
 
557
- ## Examples
558
-
559
- See the [`examples/`](./examples/) directory for complete examples:
560
-
561
- - [`classification.eval.js`](./examples/basic/classification.eval.js) - Binary sentiment classification
562
- - [`hallucination.eval.js`](./examples/basic/hallucination.eval.js) - Continuous score binarization
563
- - [`distribution-assertions.eval.js`](./examples/distribution-assertions.eval.js) - Distribution monitoring without ground truth
564
- - [`judge-validation.eval.js`](./examples/judge-validation.eval.js) - Validating judges against human labels
565
-
566
- ## Field Types
567
-
568
- evalsense automatically determines evaluation metrics based on field values:
569
-
570
- - **Boolean** (`true`/`false`) → Binary classification metrics
571
- - **Categorical** (strings) → Multi-class classification metrics
572
- - **Numeric** (numbers) → Regression metrics (MAE, MSE, RMSE, R²)
573
- - **Numeric + threshold** → Binarized classification metrics
574
-
575
- ## LLM-Based Metrics (v0.2.0+)
576
-
577
- evalsense includes LLM-powered metrics for hallucination detection, relevance assessment, faithfulness verification, and toxicity detection.
578
-
579
- ### Quick Setup
107
+ ## LLM-Based Metrics
580
108
 
581
109
  ```javascript
582
- import { setLLMClient, createOpenAIAdapter } from "evalsense/metrics";
583
- import { hallucination, relevance, faithfulness, toxicity } from "evalsense/metrics/opinionated";
110
+ import { setLLMClient, createAnthropicAdapter } from "evalsense/metrics";
111
+ import { hallucination, relevance } from "evalsense/metrics/opinionated";
584
112
 
585
- // 1. Configure your LLM client (one-time setup)
586
113
  setLLMClient(
587
- createOpenAIAdapter(process.env.OPENAI_API_KEY, {
588
- model: "gpt-4-turbo-preview",
589
- temperature: 0,
114
+ createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
115
+ model: "claude-haiku-4-5-20251001",
590
116
  })
591
117
  );
592
118
 
593
- // 2. Use metrics in evaluations
594
- const results = await hallucination({
119
+ const scores = await hallucination({
595
120
  outputs: [{ id: "1", output: "Paris has 50 million people." }],
596
121
  context: ["Paris has approximately 2.1 million residents."],
597
122
  });
598
-
599
- console.log(results[0].score); // 0.9 (high hallucination)
600
- console.log(results[0].reasoning); // "Output claims 50M, context says 2.1M"
123
+ // scores[0].score → 0.9 (high hallucination)
124
+ // scores[0].reasoning "Output claims 50M, context says 2.1M"
601
125
  ```
602
126
 
603
- ### Available Metrics
604
-
605
- - **`hallucination()`** - Detects claims not supported by context
606
- - **`relevance()`** - Measures query-response alignment
607
- - **`faithfulness()`** - Verifies outputs don't contradict sources
608
- - **`toxicity()`** - Identifies harmful or inappropriate content
609
-
610
- ### Evaluation Modes
611
-
612
- Choose between accuracy and cost:
613
-
614
- ```javascript
615
- // Per-row: Higher accuracy, higher cost (N API calls)
616
- await hallucination({
617
- outputs,
618
- context,
619
- evaluationMode: "per-row", // default
620
- });
621
-
622
- // Batch: Lower cost, single API call
623
- await hallucination({
624
- outputs,
625
- context,
626
- evaluationMode: "batch",
627
- });
628
- ```
629
-
630
- ### Built-in Provider Adapters
631
-
632
- evalsense includes ready-to-use adapters for popular LLM providers:
127
+ Built-in providers: OpenAI, Anthropic, OpenRouter, or bring your own adapter.
128
+ See [LLM Metrics Guide](./docs/llm-metrics.md) and [Adapters Guide](./docs/llm-adapters.md).
633
129
 
634
- **OpenAI (GPT-4, GPT-3.5)**
130
+ ## Using with Claude Code (Vibe Check)
635
131
 
636
- ```javascript
637
- import { createOpenAIAdapter } from "evalsense/metrics";
638
-
639
- // npm install openai
640
- setLLMClient(
641
- createOpenAIAdapter(process.env.OPENAI_API_KEY, {
642
- model: "gpt-4-turbo-preview", // or "gpt-3.5-turbo" for lower cost
643
- temperature: 0,
644
- maxTokens: 4096,
645
- })
646
- );
647
- ```
132
+ evalsense includes an example [Claude Code skill](./skill.md) that acts as an automated LLM quality gate. To set it up in your project:
648
133
 
649
- **Anthropic (Claude)**
650
-
651
- ```javascript
652
- import { createAnthropicAdapter } from "evalsense/metrics";
134
+ 1. Install evalsense as a dev dependency
135
+ 2. Copy [`skill.md`](./skill.md) into your project at `.claude/skills/llm-quality-gate/SKILL.md`
136
+ 3. After building any LLM feature, run `/llm-quality-gate` in Claude Code
653
137
 
654
- // npm install @anthropic-ai/sdk
655
- setLLMClient(
656
- createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
657
- model: "claude-3-5-sonnet-20241022", // or "claude-3-haiku-20240307" for speed
658
- maxTokens: 4096,
659
- })
660
- );
661
- ```
138
+ Claude will automatically create a `.eval.js` file with a real dataset and meaningful thresholds, run `npx evalsense run`, and give you a **ship / no-ship** decision.
662
139
 
663
- **OpenRouter (100+ models from one API)**
140
+ ## Documentation
664
141
 
665
- ```javascript
666
- import { createOpenRouterAdapter } from "evalsense/metrics";
142
+ | Guide | Description |
143
+ | -------------------------------------------------- | ------------------------------------------------ |
144
+ | [API Reference](./docs/api-reference.md) | Full API — all assertions, matchers, metrics |
145
+ | [CLI Reference](./docs/cli.md) | All CLI flags, exit codes, CI integration |
146
+ | [LLM Metrics](./docs/llm-metrics.md) | Hallucination, relevance, faithfulness, toxicity |
147
+ | [LLM Adapters](./docs/llm-adapters.md) | OpenAI, Anthropic, OpenRouter, custom adapters |
148
+ | [Custom Metrics](./docs/custom-metrics-guide.md) | Pattern and keyword metrics |
149
+ | [Agent Judges](./docs/agent-judges.md) | Design patterns for evaluating agent systems |
150
+ | [Regression Metrics](./docs/regression-metrics.md) | MAE, RMSE, R² usage |
151
+ | [Examples](./examples/) | Working code examples |
667
152
 
668
- // No SDK needed - uses fetch
669
- setLLMClient(
670
- createOpenRouterAdapter(process.env.OPENROUTER_API_KEY, {
671
- model: "anthropic/claude-3.5-sonnet", // or "openai/gpt-3.5-turbo", etc.
672
- temperature: 0,
673
- appName: "my-eval-system",
674
- })
675
- );
676
- ```
153
+ ## Dataset Format
677
154
 
678
- **Custom Adapter (for any provider)**
155
+ Records must have an `id` or `_id` field:
679
156
 
680
- ```javascript
681
- setLLMClient({
682
- async complete(prompt) {
683
- // Implement for your LLM provider
684
- const response = await yourLLM.generate(prompt);
685
- return response.text;
686
- },
687
- });
157
+ ```json
158
+ [
159
+ { "id": "1", "text": "sample input", "label": "positive" },
160
+ { "id": "2", "text": "another input", "label": "negative" }
161
+ ]
688
162
  ```
689
163
 
690
- ### Learn More
691
-
692
- - [LLM Metrics Guide](./docs/llm-metrics.md) - Complete usage guide
693
- - [LLM Adapters Guide](./docs/llm-adapters.md) - Implement adapters for different providers
694
- - [Migration Guide](./docs/migration-v0.2.md) - Upgrade from v0.1.x
695
- - [Examples](./examples/) - Working code examples
696
-
697
- ## Philosophy
698
-
699
- evalsense is built on the principle that **metrics are predictions, not facts**.
700
-
701
- Instead of treating LLM-as-judge metrics (relevance, hallucination, etc.) as ground truth, evalsense:
164
+ ## Exit Codes
702
165
 
703
- - Treats them as **weak labels** from a model
704
- - Validates them statistically against human references when available
705
- - Computes confusion matrices to reveal bias and systematic errors
706
- - Focuses on dataset-level distributions, not individual examples
166
+ | Code | Meaning |
167
+ | ---- | ------------------------- |
168
+ | `0` | All tests passed |
169
+ | `1` | Assertion failure |
170
+ | `2` | Dataset integrity failure |
171
+ | `3` | Execution error |
172
+ | `4` | Configuration error |
707
173
 
708
174
  ## Contributing
709
175
 
710
- Contributions are welcome! Please see [CLAUDE.md](./CLAUDE.md) for development guidelines.
176
+ Contributions are welcome. See [CONTRIBUTING.md](./CONTRIBUTING.md) for setup, coding standards, and the PR process.
711
177
 
712
178
  ## License
713
179
 
714
- MIT © Mohit Joshi
715
-
716
- ---
717
-
718
- **Made with ❤️ for the JS/Node.js AI community**
180
+ [Apache 2.0](./LICENSE)