evalsense 0.4.0 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,855 +1,180 @@
1
- # evalsense
2
-
3
- > JS-native LLM evaluation framework with Jest-like API and statistical assertions
1
+ [![Evalsense logo](./brand/evalsense.png)](https://www.evalsense.com)
4
2
 
5
3
  [![npm version](https://img.shields.io/npm/v/evalsense.svg)](https://www.npmjs.com/package/evalsense)
4
+ [![CI](https://github.com/evalsense/evalsense/actions/workflows/ci.yml/badge.svg)](https://github.com/evalsense/evalsense/actions/workflows/ci.yml)
6
5
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7
6
 
8
- # evalsense
9
-
10
- **evalsense is like Jest for testing code that uses LLMs.**
11
-
12
- It helps engineers answer one simple question:
13
-
14
- > **“Is my LLM-powered code good enough to ship?”**
15
-
16
- Instead of checking a few example responses, evalsense runs your code across many inputs, measures overall quality, and gives you a clear **pass / fail** result — locally or in CI.
17
-
18
- evalsense is built for **engineers deploying LLM-enabled features**, not for training or benchmarking models.
19
-
20
- ## What problem does evalsense solve?
21
-
22
- Most LLM evaluation tools focus on individual outputs:
23
-
24
- > _“How good is this one response?”_
25
-
26
- That’s useful, but it doesn’t tell you whether your system is reliable.
27
-
28
- evalsense answers a different question:
29
-
30
- > **“Does my code consistently meet our quality bar?”**
31
-
32
- It treats evaluation like testing:
33
-
34
- - run your code many times
35
- - measure results across all runs
36
- - fail fast if quality drops
37
-
38
- ## How evalsense works (in plain terms)
39
-
40
- At a high level, evalsense:
41
-
42
- 1. Runs your code
43
- (this can be a function, module, API call, or a fixed dataset)
44
- 2. Collects the results
45
- 3. Scores them using:
46
- - standard metrics (accuracy, precision, recall, F1)
47
- - LLM-as-judge checks (e.g. relevance, hallucination, correctness)
48
-
49
- 4. Aggregates scores across all results
50
- 5. Applies rules you define
51
- 6. Passes or fails the test
52
-
53
- Think of it as **unit tests for output quality**.
54
-
55
- ## A quick example
56
-
57
- ```ts
58
- describe("test answer quality", async () => {
59
- evalTest("toxicity detection", async () => {
60
- const answers = await generateAnswersDataset(testQuestions);
61
- const toxicityScore = await toxicity(answers);
62
-
63
- expectStats(toxicityScore)
64
- .field("score")
65
- .percentageBelow(0.5).toBeAtLeast(0.5)
66
- };
67
-
68
- evalTest("correctness score", async () => {
69
- const answers = await generateAnswersDataset(testQuestions);
70
- const groundTruth = await JSON.parse(readFileSync("truth-dataset.json"));
71
-
72
- expectStats(answers, groundTruth)
73
- .field("label")
74
- .accuracy.toBeAtLeast(0.9)
75
- .precision("positive").toBeAtLeast(0.7)
76
- .recall("positive").toBeAtLeast(0.7)
77
- .displayConfusionMatrix();
78
- }
79
- });
80
- ```
81
-
82
- Running the test:
83
-
84
- ```markdown
85
- **test answer quality**
86
-
87
- ✓ toxicity detection (1ms)
88
- ✓ 50.0% of 'score' values are below or equal to 0.5 (expected >= 50.0%)
89
- Expected: 50.0%
90
- Actual: 50.0%
91
-
92
- ✓ correctness score (1ms)
93
- Field: label | Accuracy: 100.0% | F1: 100.0%
94
- negative: P=100.0% R=100.0% F1=100.0% (n=5)
95
- positive: P=100.0% R=100.0% F1=100.0% (n=5)
96
-
97
- Confusion Matrix: label
98
-
99
- Predicted → correct incorrect
100
- Actual ↓
101
- correct 5 0
102
- incorrect 0 5
103
-
104
- ✓ Accuracy 100.0% >= 90.0%
105
- Expected: 90.0%
106
- Actual: 100.0%
107
- ✓ Precision for 'positive' 100.0% >= 70.0%
108
- Expected: 70.0%
109
- Actual: 100.0%
110
- ✓ Recall for 'positive' 100.0% >= 70.0%
111
- Expected: 70.0%
112
- Actual: 100.0%
113
- ✓ Confusion matrix recorded for field "label"
114
- ```
115
-
116
- If the quality drops, the test fails — just like a normal test.
117
-
118
- ## Two common ways to use evalsense
119
-
120
- ### 1. When you **don’t have ground truth**
121
-
122
- Use this when there are no labels.
123
-
124
- Example:
125
-
126
- - Run your LLM-powered function
127
- - Score outputs using an LLM-as-judge (relevance, hallucination, etc.)
128
- - Define what “acceptable” means
129
- - Fail if quality degrades
130
-
131
- **Example rule:**
132
-
133
- > “Average relevance score must be at least 0.75”
7
+ > **Jest for LLM Evaluation.** Pass/fail quality gates for your LLM-powered code.
134
8
 
135
- ### 2. When you **do have ground truth**
136
-
137
- Use this when correct answers are known.
138
-
139
- Example:
140
-
141
- - Run your prediction code
142
- - Compare outputs with ground truth
143
- - Compute accuracy, precision, recall, F1
144
- - Optionally add LLM-as-judge checks
145
- - Fail if metrics fall below thresholds
146
-
147
- **Example rule:**
148
-
149
- > “F1 score must be ≥ 0.85 and false positives ≤ 5%”
150
-
151
- ## What evalsense is _not_
152
-
153
- evalsense is **not**:
154
-
155
- - A tool for scoring single responses in isolation
156
- - A dashboard or experiment-tracking platform
157
- - A system for analyzing agent step-by-step traces
158
- - A model benchmarking or training framework
159
-
160
- If you mainly want scores, charts, or leaderboards, other tools may be a better fit.
161
-
162
- ## Who should use evalsense
163
-
164
- evalsense is a good fit if you:
165
-
166
- - are **shipping LLM-powered features**
167
- - want **clear pass/fail quality gates**
168
- - run checks in **CI/CD**
169
- - care about **regressions** (“did this get worse?”)
170
- - already think in terms of tests
171
- - work in **JavaScript / TypeScript**
172
-
173
- ## Who should _not_ use evalsense
174
-
175
- evalsense may not be right for you if you:
176
-
177
- - only care about individual output scores
178
- - want visual dashboards or experiment UIs
179
- - need deep agent trace inspection
180
- - are training or benchmarking foundation models
181
-
182
- ## In one sentence
183
-
184
- **evalsense lets you test the quality of LLM-powered code the same way you test everything else — with clear pass/fail results.**
185
-
186
- ## Installation
9
+ evalsense runs your code across many inputs, measures quality statistically, and gives you a clear **pass / fail** result — locally or in CI.
187
10
 
188
11
  ```bash
189
12
  npm install --save-dev evalsense
190
13
  ```
191
14
 
192
- Or with yarn:
193
-
194
- ```bash
195
- yarn add -D evalsense
196
- ```
197
-
198
15
  ## Quick Start
199
16
 
200
- Create a file named `sentiment.eval.js`:
17
+ Create `sentiment.eval.js`:
201
18
 
202
19
  ```javascript
203
20
  import { describe, evalTest, expectStats } from "evalsense";
204
21
  import { readFileSync } from "fs";
205
22
 
206
- // Your model function - can be any JS function
207
23
  function classifySentiment(text) {
208
- const lower = text.toLowerCase();
209
- const hasPositive = /love|amazing|great|fantastic|perfect/.test(lower);
210
- const hasNegative = /terrible|worst|disappointed|waste/.test(lower);
211
- return hasPositive && !hasNegative ? "positive" : "negative";
24
+ return /love|great|amazing/.test(text.toLowerCase()) ? "positive" : "negative";
212
25
  }
213
26
 
214
27
  describe("Sentiment classifier", () => {
215
28
  evalTest("accuracy above 80%", async () => {
216
- // 1. Load ground truth data
217
29
  const groundTruth = JSON.parse(readFileSync("./sentiment.json", "utf-8"));
218
30
 
219
- // 2. Run your model and collect predictions
220
31
  const predictions = groundTruth.map((record) => ({
221
32
  id: record.id,
222
33
  sentiment: classifySentiment(record.text),
223
34
  }));
224
35
 
225
- // 3. Assert on statistical properties
226
36
  expectStats(predictions, groundTruth)
227
37
  .field("sentiment")
228
38
  .accuracy.toBeAtLeast(0.8)
229
- .recall("positive").toBeAtLeast(0.7)
230
- .precision("positive").toBeAtLeast(0.7)
39
+ .recall("positive")
40
+ .toBeAtLeast(0.7)
41
+ .precision("positive")
42
+ .toBeAtLeast(0.7)
231
43
  .displayConfusionMatrix();
232
44
  });
233
45
  });
234
46
  ```
235
47
 
236
- Create `sentiment.json`:
237
-
238
- ```json
239
- [
240
- { "id": "1", "text": "I love this product!", "sentiment": "positive" },
241
- { "id": "2", "text": "Terrible experience.", "sentiment": "negative" },
242
- { "id": "3", "text": "Great quality!", "sentiment": "positive" }
243
- ]
244
- ```
245
-
246
- Run the evaluation:
48
+ Run it:
247
49
 
248
50
  ```bash
249
51
  npx evalsense run sentiment.eval.js
250
52
  ```
251
53
 
252
- ## Usage
253
-
254
- ### Basic Classification Example
255
-
256
- ```javascript
257
- import { describe, evalTest, expectStats } from "evalsense";
258
- import { readFileSync } from "fs";
259
-
260
- describe("Spam classifier", () => {
261
- evalTest("high precision and recall", async () => {
262
- const groundTruth = JSON.parse(readFileSync("./emails.json", "utf-8"));
263
-
264
- const predictions = groundTruth.map((record) => ({
265
- id: record.id,
266
- isSpam: classifyEmail(record.text),
267
- }));
268
-
269
- expectStats(predictions, groundTruth)
270
- .field("isSpam")
271
- .accuracy.toBeAtLeast(0.9)
272
- .precision(true).toBeAtLeast(0.85) // Precision for spam=true
273
- .recall(true).toBeAtLeast(0.85) // Recall for spam=true
274
- .displayConfusionMatrix();
275
- });
276
- });
277
- ```
278
-
279
- ### Continuous Scores with Binarization
54
+ Output:
280
55
 
281
- ```javascript
282
- import { describe, evalTest, expectStats } from "evalsense";
283
- import { readFileSync } from "fs";
284
-
285
- describe("Hallucination detector", () => {
286
- evalTest("detect hallucinations with 70% recall", async () => {
287
- const groundTruth = JSON.parse(readFileSync("./outputs.json", "utf-8"));
288
-
289
- // Your model returns a continuous score (0.0 to 1.0)
290
- const predictions = groundTruth.map((record) => ({
291
- id: record.id,
292
- hallucinated: computeHallucinationScore(record.output),
293
- }));
294
-
295
- // Binarize the score at threshold 0.3
296
- expectStats(predictions, groundTruth)
297
- .field("hallucinated")
298
- .binarize(0.3) // >= 0.3 means hallucinated
299
- .recall(true).toBeAtLeast(0.7)
300
- .precision(true).toBeAtLeast(0.6)
301
- .displayConfusionMatrix();
302
- });
303
- });
304
56
  ```
57
+ EvalSense v0.4.1
58
+ Running 1 eval file(s)...
305
59
 
306
- ### Multi-class Classification
60
+ Sentiment classifier
307
61
 
308
- ```javascript
309
- import { describe, evalTest, expectStats } from "evalsense";
310
- import { readFileSync } from "fs";
62
+ ✓ accuracy above 80% (12ms)
63
+ Field: sentiment | Accuracy: 90.0% | F1: 89.5%
64
+ negative: P=88.0% R=92.0% F1=90.0% (n=25)
65
+ positive: P=91.0% R=87.0% F1=89.0% (n=25)
66
+ ✓ Accuracy 90.0% >= 80.0%
67
+ ✓ Recall for 'positive' 87.0% >= 70.0%
68
+ ✓ Precision for 'positive' 91.0% >= 70.0%
311
69
 
312
- describe("Intent classifier", () => {
313
- evalTest("balanced performance across intents", async () => {
314
- const groundTruth = JSON.parse(readFileSync("./intents.json", "utf-8"));
70
+ Summary
315
71
 
316
- const predictions = groundTruth.map((record) => ({
317
- id: record.id,
318
- intent: classifyIntent(record.query),
319
- }));
72
+ Tests: 1 passed, 0 failed, 0 errors, 0 skipped
73
+ Duration: 12ms
320
74
 
321
- expectStats(predictions, groundTruth)
322
- .field("intent")
323
- .accuracy.toBeAtLeast(0.85)
324
- .recall("purchase").toBeAtLeast(0.8)
325
- .recall("support").toBeAtLeast(0.8)
326
- .recall("general").toBeAtLeast(0.7)
327
- .displayConfusionMatrix();
328
- });
329
- });
75
+ All tests passed!
330
76
  ```
331
77
 
332
- ### Parallel Model Execution with LLMs
78
+ ## Key Features
333
79
 
334
- For LLM calls or slow operations, use `Promise.all` with chunking for concurrency control:
80
+ - **Jest-like API** `describe`, `evalTest`, `expectStats` feel familiar
81
+ - **Statistical assertions** — accuracy, precision, recall, F1, MAE, RMSE, R²
82
+ - **Confusion matrices** — built-in display with `.displayConfusionMatrix()`
83
+ - **Distribution monitoring** — `percentageAbove` / `percentageBelow` without ground truth
84
+ - **LLM-as-judge** — built-in hallucination, relevance, faithfulness, toxicity metrics
85
+ - **CI/CD ready** — structured exit codes, JSON reporter, bail mode
86
+ - **Zero config** — works with any JS data loading and model execution
335
87
 
336
- ```javascript
337
- import { describe, evalTest, expectStats } from "evalsense";
338
- import { readFileSync } from "fs";
88
+ ## Two Ways to Use It
339
89
 
340
- // Helper for parallel execution with concurrency limit
341
- async function mapConcurrent(items, fn, concurrency = 5) {
342
- const results = [];
343
- for (let i = 0; i < items.length; i += concurrency) {
344
- const chunk = items.slice(i, i + concurrency);
345
- results.push(...(await Promise.all(chunk.map(fn))));
346
- }
347
- return results;
348
- }
349
-
350
- describe("LLM classifier", () => {
351
- evalTest("classification accuracy", async () => {
352
- const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
353
-
354
- // Run with concurrency=5
355
- const predictions = await mapConcurrent(
356
- groundTruth,
357
- async (record) => {
358
- const response = await callLLM(record.text);
359
- return { id: record.id, category: response.category };
360
- },
361
- 5
362
- );
363
-
364
- expectStats(predictions, groundTruth).field("category").accuracy.toBeAtLeast(0.9);
365
- });
366
- });
367
- ```
368
-
369
- ### Test Lifecycle Hooks
370
-
371
- ```javascript
372
- import { describe, evalTest, beforeAll, afterAll, beforeEach, afterEach } from "evalsense";
373
-
374
- describe("Model evaluation", () => {
375
- let model;
376
-
377
- beforeAll(async () => {
378
- // Load model once before all tests
379
- model = await loadModel();
380
- });
381
-
382
- afterAll(async () => {
383
- // Cleanup after all tests
384
- await model.dispose();
385
- });
386
-
387
- beforeEach(() => {
388
- // Reset state before each test
389
- model.reset();
390
- });
391
-
392
- afterEach(() => {
393
- // Cleanup after each test
394
- console.log("Test completed");
395
- });
396
-
397
- evalTest("test 1", async () => {
398
- // ...
399
- });
400
-
401
- evalTest("test 2", async () => {
402
- // ...
403
- });
404
- });
405
- ```
406
-
407
- ## CLI Usage
408
-
409
- ### Run Evaluations
410
-
411
- ```bash
412
- # Run all eval files in current directory
413
- npx evalsense run
414
-
415
- # Run specific file or directory
416
- npx evalsense run tests/eval/
417
-
418
- # Filter tests by name
419
- npx evalsense run --filter "accuracy"
420
-
421
- # Output JSON report
422
- npx evalsense run --output report.json
423
-
424
- # Use different reporters
425
- npx evalsense run --reporter console # default
426
- npx evalsense run --reporter json
427
- npx evalsense run --reporter both
428
-
429
- # Bail on first failure
430
- npx evalsense run --bail
431
-
432
- # Set timeout (in milliseconds)
433
- npx evalsense run --timeout 60000
434
- ```
435
-
436
- ### List Eval Files
437
-
438
- ```bash
439
- # List all discovered eval files
440
- npx evalsense list
441
-
442
- # List files in specific directory
443
- npx evalsense list tests/
444
- ```
445
-
446
- ## API Reference
447
-
448
- ### Core API
449
-
450
- #### `describe(name, fn)`
451
-
452
- Groups related evaluation tests (like Jest's describe).
453
-
454
- ```javascript
455
- describe("My model", () => {
456
- // eval tests go here
457
- });
458
- ```
459
-
460
- #### `evalTest(name, fn)` / `test(name, fn)` / `it(name, fn)`
461
-
462
- Defines an evaluation test.
463
-
464
- ```javascript
465
- evalTest("should have 90% accuracy", async () => {
466
- // test implementation
467
- });
468
- ```
469
-
470
- ### Dataset Loading
471
-
472
- evalsense doesn't dictate how you load data or run your model. Use standard Node.js tools:
473
-
474
- ```javascript
475
- import { readFileSync } from "fs";
476
-
477
- // Load ground truth
478
- const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
479
-
480
- // Run your model however you want
481
- const predictions = groundTruth.map(runYourModel);
482
-
483
- // Or use async operations
484
- const predictions = await Promise.all(
485
- groundTruth.map(async (item) => {
486
- const result = await callLLM(item.text);
487
- return { id: item.id, prediction: result };
488
- })
489
- );
490
- ```
491
-
492
- ### Assertions
493
-
494
- #### `expectStats(predictions, groundTruth)`
495
-
496
- Creates a statistical assertion chain from predictions and ground truth. Aligns by `id` field.
90
+ ### With ground truth (classification / regression)
497
91
 
498
92
  ```javascript
499
93
  expectStats(predictions, groundTruth)
500
- .field("prediction")
501
- .accuracy.toBeAtLeast(0.8)
502
- .f1.toBeAtLeast(0.75)
503
- .displayConfusionMatrix();
94
+ .field("label")
95
+ .accuracy.toBeAtLeast(0.9)
96
+ .recall("positive")
97
+ .toBeAtLeast(0.8)
98
+ .f1.toBeAtLeast(0.85);
504
99
  ```
505
100
 
506
- **One-argument form (distribution assertions only):**
101
+ ### Without ground truth (distribution monitoring)
507
102
 
508
103
  ```javascript
509
- // For distribution monitoring without ground truth
510
- expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
104
+ expectStats(llmOutputs).field("toxicity_score").percentageBelow(0.3).toBeAtLeast(0.95); // 95% of outputs must be non-toxic
511
105
  ```
512
106
 
513
- **Common use cases:**
514
-
515
- - Classification evaluation with ground truth
516
- - Regression evaluation (MAE, RMSE, R²)
517
- - Validating LLM judges against human labels
518
- - Distribution monitoring without ground truth
519
-
520
- ### Field Selection
521
-
522
- #### `.field(fieldName)`
523
-
524
- Selects a field for evaluation.
107
+ ## LLM-Based Metrics
525
108
 
526
109
  ```javascript
527
- expectStats(result).field("sentiment");
528
- ```
529
-
530
- #### `.binarize(threshold)`
531
-
532
- Converts continuous scores to binary (>=threshold is true).
533
-
534
- ```javascript
535
- expectStats(result)
536
- .field("score")
537
- .binarize(0.5) // score >= 0.5 is true
538
- .accuracy.toBeAtLeast(0.8);
539
- ```
540
-
541
- ### Available Assertions
542
-
543
- #### Classification Metrics
544
-
545
- ```javascript
546
- // Accuracy (macro average for multi-class)
547
- .accuracy.toBeAtLeast(threshold)
548
- .accuracy.toBeAbove(threshold)
549
- .accuracy.toBeAtMost(threshold)
550
- .accuracy.toBeBelow(threshold)
551
-
552
- // Precision (per class or macro average)
553
- .precision("className").toBeAtLeast(threshold)
554
- .precision().toBeAtLeast(threshold) // macro average
555
-
556
- // Recall (per class or macro average)
557
- .recall("className").toBeAtLeast(threshold)
558
- .recall().toBeAtLeast(threshold) // macro average
559
-
560
- // F1 Score (macro average)
561
- .f1.toBeAtLeast(threshold)
562
- .f1.toBeAbove(threshold)
563
-
564
- // Regression Metrics
565
- .mae.toBeAtMost(threshold) // Mean Absolute Error
566
- .rmse.toBeAtMost(threshold) // Root Mean Squared Error
567
- .r2.toBeAtLeast(threshold) // R² coefficient
568
-
569
- // Confusion Matrix
570
- .displayConfusionMatrix() // Displays confusion matrix (not an assertion)
571
- ```
572
-
573
- #### Available Matchers
574
-
575
- All metrics return a matcher object with these comparison methods:
576
-
577
- ```javascript
578
- .toBeAtLeast(x) // >= x
579
- .toBeAbove(x) // > x
580
- .toBeAtMost(x) // <= x
581
- .toBeBelow(x) // < x
582
- .toEqual(x, tolerance?) // === x (with optional tolerance for floats)
583
- ```
584
-
585
- #### Distribution Assertions
586
-
587
- Distribution assertions validate output distributions **without requiring ground truth**. Use these to monitor that model outputs stay within expected ranges.
588
-
589
- ```javascript
590
- // Assert that at least 80% of confidence scores are above 0.7
591
- expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
592
-
593
- // Assert that at least 90% of toxicity scores are below 0.3
594
- expectStats(predictions).field("toxicity").percentageBelow(0.3).toBeAtLeast(0.9);
595
-
596
- // Chain multiple distribution assertions
597
- expectStats(predictions)
598
- .field("score")
599
- .percentageAbove(0.5).toBeAtLeast(0.6) // At least 60% above 0.5
600
- .percentageBelow(0.9).toBeAtLeast(0.8); // At least 80% below 0.9
601
- ```
602
-
603
- **Use cases:**
604
-
605
- - Monitor confidence score distributions
606
- - Validate schema compliance rates
607
- - Check output range constraints
608
- - Ensure score distributions remain stable over time
609
-
610
- See [Distribution Assertions Example](./examples/distribution-assertions.eval.js) for complete examples.
611
-
612
- ### Judge Validation
613
-
614
- Validate judge outputs against human-labeled ground truth using the **two-argument expectStats API**:
615
-
616
- ```javascript
617
- // Judge outputs (predictions from your judge/metric)
618
- const judgeOutputs = [
619
- { id: "1", hallucinated: true },
620
- { id: "2", hallucinated: false },
621
- { id: "3", hallucinated: true },
622
- ];
623
-
624
- // Human labels (ground truth)
625
- const humanLabels = [
626
- { id: "1", hallucinated: true },
627
- { id: "2", hallucinated: false },
628
- { id: "3", hallucinated: false },
629
- ];
630
-
631
- // Validate judge performance
632
- expectStats(judgeOutputs, humanLabels)
633
- .field("hallucinated")
634
- .recall(true).toBeAtLeast(0.9) // Don't miss hallucinations
635
- .precision(true).toBeAtLeast(0.7) // Some false positives OK
636
- .displayConfusionMatrix();
637
- ```
638
-
639
- **Use cases:**
640
-
641
- - Evaluate LLM-as-judge accuracy
642
- - Validate heuristic metrics against human labels
643
- - Test automated detection systems (refusal, policy compliance)
644
- - Calibrate metric thresholds
645
-
646
- **Two-argument expectStats:**
647
-
648
- ```javascript
649
- expectStats(actual, expected).field("fieldName").accuracy.toBeAtLeast(0.8);
650
- ```
651
-
652
- The first argument is your predictions (judge outputs), the second is ground truth (human labels). Both must have matching `id` fields for alignment.
653
-
654
- See [Judge Validation Example](./examples/judge-validation.eval.js) for complete examples.
655
-
656
- For comprehensive guidance on evaluating agent systems, see [Agent Judges Design Patterns](./docs/agent-judges.md).
110
+ import { setLLMClient, createAnthropicAdapter } from "evalsense/metrics";
111
+ import { hallucination, relevance } from "evalsense/metrics/opinionated";
657
112
 
658
- ## Dataset Format
659
-
660
- Datasets must be JSON arrays where each record has an `id` or `_id` field:
661
-
662
- ```json
663
- [
664
- {
665
- "id": "1",
666
- "text": "input text",
667
- "label": "expected_output"
668
- },
669
- {
670
- "id": "2",
671
- "text": "another input",
672
- "label": "another_output"
673
- }
674
- ]
675
- ```
676
-
677
- **Requirements:**
678
-
679
- - Each record MUST have `id` or `_id` for alignment
680
- - Ground truth fields (e.g., `label`, `sentiment`, `category`) are compared against model outputs
681
- - Model functions must return predictions with matching `id`
682
-
683
- ## Exit Codes
684
-
685
- evalsense returns specific exit codes for CI integration:
686
-
687
- - `0` - Success (all tests passed)
688
- - `1` - Assertion failure (statistical thresholds not met)
689
- - `2` - Integrity failure (dataset alignment issues)
690
- - `3` - Execution error (test threw exception)
691
- - `4` - Configuration error (invalid CLI options)
692
-
693
- ## Writing Eval Files
694
-
695
- Eval files use the `.eval.js` or `.eval.ts` extension and are discovered automatically:
696
-
697
- ```
698
- project/
699
- ├── tests/
700
- │ ├── classifier.eval.js
701
- │ └── hallucination.eval.js
702
- ├── data/
703
- │ └── dataset.json
704
- └── package.json
705
- ```
706
-
707
- Run with:
708
-
709
- ```bash
710
- npx evalsense run tests/
711
- ```
712
-
713
- ## Examples
714
-
715
- See the [`examples/`](./examples/) directory for complete examples:
716
-
717
- - [`classification.eval.js`](./examples/basic/classification.eval.js) - Binary sentiment classification
718
- - [`hallucination.eval.js`](./examples/basic/hallucination.eval.js) - Continuous score binarization
719
- - [`distribution-assertions.eval.js`](./examples/distribution-assertions.eval.js) - Distribution monitoring without ground truth
720
- - [`judge-validation.eval.js`](./examples/judge-validation.eval.js) - Validating judges against human labels
721
-
722
- ## Field Types
723
-
724
- evalsense automatically determines evaluation metrics based on field values:
725
-
726
- - **Boolean** (`true`/`false`) → Binary classification metrics
727
- - **Categorical** (strings) → Multi-class classification metrics
728
- - **Numeric** (numbers) → Regression metrics (MAE, MSE, RMSE, R²)
729
- - **Numeric + threshold** → Binarized classification metrics
730
-
731
- ## LLM-Based Metrics (v0.2.0+)
732
-
733
- evalsense includes LLM-powered metrics for hallucination detection, relevance assessment, faithfulness verification, and toxicity detection.
734
-
735
- ### Quick Setup
736
-
737
- ```javascript
738
- import { setLLMClient, createOpenAIAdapter } from "evalsense/metrics";
739
- import { hallucination, relevance, faithfulness, toxicity } from "evalsense/metrics/opinionated";
740
-
741
- // 1. Configure your LLM client (one-time setup)
742
113
  setLLMClient(
743
- createOpenAIAdapter(process.env.OPENAI_API_KEY, {
744
- model: "gpt-4-turbo-preview",
745
- temperature: 0,
114
+ createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
115
+ model: "claude-haiku-4-5-20251001",
746
116
  })
747
117
  );
748
118
 
749
- // 2. Use metrics in evaluations
750
- const results = await hallucination({
119
+ const scores = await hallucination({
751
120
  outputs: [{ id: "1", output: "Paris has 50 million people." }],
752
121
  context: ["Paris has approximately 2.1 million residents."],
753
122
  });
754
-
755
- console.log(results[0].score); // 0.9 (high hallucination)
756
- console.log(results[0].reasoning); // "Output claims 50M, context says 2.1M"
757
- ```
758
-
759
- ### Available Metrics
760
-
761
- - **`hallucination()`** - Detects claims not supported by context
762
- - **`relevance()`** - Measures query-response alignment
763
- - **`faithfulness()`** - Verifies outputs don't contradict sources
764
- - **`toxicity()`** - Identifies harmful or inappropriate content
765
-
766
- ### Evaluation Modes
767
-
768
- Choose between accuracy and cost:
769
-
770
- ```javascript
771
- // Per-row: Higher accuracy, higher cost (N API calls)
772
- await hallucination({
773
- outputs,
774
- context,
775
- evaluationMode: "per-row", // default
776
- });
777
-
778
- // Batch: Lower cost, single API call
779
- await hallucination({
780
- outputs,
781
- context,
782
- evaluationMode: "batch",
783
- });
123
+ // scores[0].score → 0.9 (high hallucination)
124
+ // scores[0].reasoning "Output claims 50M, context says 2.1M"
784
125
  ```
785
126
 
786
- ### Built-in Provider Adapters
127
+ Built-in providers: OpenAI, Anthropic, OpenRouter, or bring your own adapter.
128
+ See [LLM Metrics Guide](./docs/llm-metrics.md) and [Adapters Guide](./docs/llm-adapters.md).
787
129
 
788
- evalsense includes ready-to-use adapters for popular LLM providers:
130
+ ## Using with Claude Code (Vibe Check)
789
131
 
790
- **OpenAI (GPT-4, GPT-3.5)**
132
+ evalsense includes an example [Claude Code skill](./skill.md) that acts as an automated LLM quality gate. To set it up in your project:
791
133
 
792
- ```javascript
793
- import { createOpenAIAdapter } from "evalsense/metrics";
134
+ 1. Install evalsense as a dev dependency
135
+ 2. Copy [`skill.md`](./skill.md) into your project at `.claude/skills/llm-quality-gate/SKILL.md`
136
+ 3. After building any LLM feature, run `/llm-quality-gate` in Claude Code
794
137
 
795
- // npm install openai
796
- setLLMClient(
797
- createOpenAIAdapter(process.env.OPENAI_API_KEY, {
798
- model: "gpt-4-turbo-preview", // or "gpt-3.5-turbo" for lower cost
799
- temperature: 0,
800
- maxTokens: 4096,
801
- })
802
- );
803
- ```
804
-
805
- **Anthropic (Claude)**
138
+ Claude will automatically create a `.eval.js` file with a real dataset and meaningful thresholds, run `npx evalsense run`, and give you a **ship / no-ship** decision.
806
139
 
807
- ```javascript
808
- import { createAnthropicAdapter } from "evalsense/metrics";
140
+ ## Documentation
809
141
 
810
- // npm install @anthropic-ai/sdk
811
- setLLMClient(
812
- createAnthropicAdapter(process.env.ANTHROPIC_API_KEY, {
813
- model: "claude-3-5-sonnet-20241022", // or "claude-3-haiku-20240307" for speed
814
- maxTokens: 4096,
815
- })
816
- );
817
- ```
142
+ | Guide | Description |
143
+ | -------------------------------------------------- | ------------------------------------------------ |
144
+ | [API Reference](./docs/api-reference.md) | Full API — all assertions, matchers, metrics |
145
+ | [CLI Reference](./docs/cli.md) | All CLI flags, exit codes, CI integration |
146
+ | [LLM Metrics](./docs/llm-metrics.md) | Hallucination, relevance, faithfulness, toxicity |
147
+ | [LLM Adapters](./docs/llm-adapters.md) | OpenAI, Anthropic, OpenRouter, custom adapters |
148
+ | [Custom Metrics](./docs/custom-metrics-guide.md) | Pattern and keyword metrics |
149
+ | [Agent Judges](./docs/agent-judges.md) | Design patterns for evaluating agent systems |
150
+ | [Regression Metrics](./docs/regression-metrics.md) | MAE, RMSE, R² usage |
151
+ | [Examples](./examples/) | Working code examples |
818
152
 
819
- **OpenRouter (100+ models from one API)**
153
+ ## Dataset Format
820
154
 
821
- ```javascript
822
- import { createOpenRouterAdapter } from "evalsense/metrics";
155
+ Records must have an `id` or `_id` field:
823
156
 
824
- // No SDK needed - uses fetch
825
- setLLMClient(
826
- createOpenRouterAdapter(process.env.OPENROUTER_API_KEY, {
827
- model: "anthropic/claude-3.5-sonnet", // or "openai/gpt-3.5-turbo", etc.
828
- temperature: 0,
829
- appName: "my-eval-system",
830
- })
831
- );
157
+ ```json
158
+ [
159
+ { "id": "1", "text": "sample input", "label": "positive" },
160
+ { "id": "2", "text": "another input", "label": "negative" }
161
+ ]
832
162
  ```
833
163
 
834
- **Custom Adapter (for any provider)**
164
+ ## Exit Codes
835
165
 
836
- ```javascript
837
- setLLMClient({
838
- async complete(prompt) {
839
- // Implement for your LLM provider
840
- const response = await yourLLM.generate(prompt);
841
- return response.text;
842
- },
843
- });
844
- ```
166
+ | Code | Meaning |
167
+ | ---- | ------------------------- |
168
+ | `0` | All tests passed |
169
+ | `1` | Assertion failure |
170
+ | `2` | Dataset integrity failure |
171
+ | `3` | Execution error |
172
+ | `4` | Configuration error |
845
173
 
846
- ### Learn More
174
+ ## Contributing
847
175
 
848
- - [LLM Metrics Guide](./docs/llm-metrics.md) - Complete usage guide
849
- - [LLM Adapters Guide](./docs/llm-adapters.md) - Implement adapters for different providers
850
- - [Migration Guide](./docs/migration-v0.2.md) - Upgrade from v0.1.x
851
- - [Examples](./examples/) - Working code examples
176
+ Contributions are welcome. See [CONTRIBUTING.md](./CONTRIBUTING.md) for setup, coding standards, and the PR process.
852
177
 
853
- ## Contributing
178
+ ## License
854
179
 
855
- Contributions are welcome! Please see [CLAUDE.md](./CLAUDE.md) for development guidelines.
180
+ [Apache 2.0](./LICENSE)