evalsense 0.3.1 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/README.md +309 -149
  2. package/dist/{chunk-BE7CB3AM.cjs → chunk-4BKZPVY4.cjs} +50 -13
  3. package/dist/chunk-4BKZPVY4.cjs.map +1 -0
  4. package/dist/{chunk-K6QPJ2NO.js → chunk-IUVDDMJ3.js} +50 -13
  5. package/dist/chunk-IUVDDMJ3.js.map +1 -0
  6. package/dist/chunk-NCCQRZ2Y.cjs +1141 -0
  7. package/dist/chunk-NCCQRZ2Y.cjs.map +1 -0
  8. package/dist/chunk-TDGWDK2L.js +1108 -0
  9. package/dist/chunk-TDGWDK2L.js.map +1 -0
  10. package/dist/cli.cjs +11 -11
  11. package/dist/cli.js +1 -1
  12. package/dist/index-CATqAHNK.d.cts +416 -0
  13. package/dist/index-CoMpaW-K.d.ts +416 -0
  14. package/dist/index.cjs +507 -629
  15. package/dist/index.cjs.map +1 -1
  16. package/dist/index.d.cts +210 -161
  17. package/dist/index.d.ts +210 -161
  18. package/dist/index.js +455 -573
  19. package/dist/index.js.map +1 -1
  20. package/dist/metrics/index.cjs +103 -342
  21. package/dist/metrics/index.cjs.map +1 -1
  22. package/dist/metrics/index.d.cts +260 -31
  23. package/dist/metrics/index.d.ts +260 -31
  24. package/dist/metrics/index.js +24 -312
  25. package/dist/metrics/index.js.map +1 -1
  26. package/dist/metrics/opinionated/index.cjs +5 -5
  27. package/dist/metrics/opinionated/index.d.cts +2 -163
  28. package/dist/metrics/opinionated/index.d.ts +2 -163
  29. package/dist/metrics/opinionated/index.js +1 -1
  30. package/dist/{types-C71p0wzM.d.cts → types-D0hzfyKm.d.cts} +1 -13
  31. package/dist/{types-C71p0wzM.d.ts → types-D0hzfyKm.d.ts} +1 -13
  32. package/package.json +1 -1
  33. package/dist/chunk-BE7CB3AM.cjs.map +0 -1
  34. package/dist/chunk-K6QPJ2NO.js.map +0 -1
  35. package/dist/chunk-RZFLCWTW.cjs +0 -942
  36. package/dist/chunk-RZFLCWTW.cjs.map +0 -1
  37. package/dist/chunk-Z3U6AUWX.js +0 -925
  38. package/dist/chunk-Z3U6AUWX.js.map +0 -1
package/README.md CHANGED
@@ -5,32 +5,183 @@
5
5
  [![npm version](https://img.shields.io/npm/v/evalsense.svg)](https://www.npmjs.com/package/evalsense)
6
6
  [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
7
7
 
8
- **evalsense** brings classical ML-style statistical evaluation to LLM systems in JavaScript. Instead of evaluating individual test cases, evalsense evaluates entire datasets and computes confusion matrices, precision/recall, F1 scores, and other statistical metrics.
8
+ # evalsense
9
+
10
+ **evalsense is like Jest for testing code that uses LLMs.**
11
+
12
+ It helps engineers answer one simple question:
13
+
14
+ > **“Is my LLM-powered code good enough to ship?”**
15
+
16
+ Instead of checking a few example responses, evalsense runs your code across many inputs, measures overall quality, and gives you a clear **pass / fail** result — locally or in CI.
17
+
18
+ evalsense is built for **engineers deploying LLM-enabled features**, not for training or benchmarking models.
19
+
20
+ ## What problem does evalsense solve?
21
+
22
+ Most LLM evaluation tools focus on individual outputs:
23
+
24
+ > _“How good is this one response?”_
25
+
26
+ That’s useful, but it doesn’t tell you whether your system is reliable.
27
+
28
+ evalsense answers a different question:
29
+
30
+ > **“Does my code consistently meet our quality bar?”**
31
+
32
+ It treats evaluation like testing:
33
+
34
+ - run your code many times
35
+ - measure results across all runs
36
+ - fail fast if quality drops
37
+
38
+ ## How evalsense works (in plain terms)
39
+
40
+ At a high level, evalsense:
41
+
42
+ 1. Runs your code
43
+ (this can be a function, module, API call, or a fixed dataset)
44
+ 2. Collects the results
45
+ 3. Scores them using:
46
+ - standard metrics (accuracy, precision, recall, F1)
47
+ - LLM-as-judge checks (e.g. relevance, hallucination, correctness)
48
+
49
+ 4. Aggregates scores across all results
50
+ 5. Applies rules you define
51
+ 6. Passes or fails the test
52
+
53
+ Think of it as **unit tests for output quality**.
54
+
55
+ ## A quick example
56
+
57
+ ```ts
58
+ describe("test answer quality", async () => {
59
+ evalTest("toxicity detection", async () => {
60
+ const answers = await generateAnswersDataset(testQuestions);
61
+ const toxicityScore = await toxicity(answers);
62
+
63
+ expectStats(toxicityScore)
64
+ .field("score")
65
+ .percentageBelow(0.5).toBeAtLeast(0.5)
66
+ };
67
+
68
+ evalTest("correctness score", async () => {
69
+ const answers = await generateAnswersDataset(testQuestions);
70
+ const groundTruth = await JSON.parse(readFileSync("truth-dataset.json"));
71
+
72
+ expectStats(answers, groundTruth)
73
+ .field("label")
74
+ .accuracy.toBeAtLeast(0.9)
75
+ .precision("positive").toBeAtLeast(0.7)
76
+ .recall("positive").toBeAtLeast(0.7)
77
+ .displayConfusionMatrix();
78
+ }
79
+ });
80
+ ```
81
+
82
+ Running the test:
83
+
84
+ ```markdown
85
+ **test answer quality**
86
+
87
+ ✓ toxicity detection (1ms)
88
+ ✓ 50.0% of 'score' values are below or equal to 0.5 (expected >= 50.0%)
89
+ Expected: 50.0%
90
+ Actual: 50.0%
91
+
92
+ ✓ correctness score (1ms)
93
+ Field: label | Accuracy: 100.0% | F1: 100.0%
94
+ negative: P=100.0% R=100.0% F1=100.0% (n=5)
95
+ positive: P=100.0% R=100.0% F1=100.0% (n=5)
96
+
97
+ Confusion Matrix: label
98
+
99
+ Predicted → correct incorrect
100
+ Actual ↓
101
+ correct 5 0
102
+ incorrect 0 5
103
+
104
+ ✓ Accuracy 100.0% >= 90.0%
105
+ Expected: 90.0%
106
+ Actual: 100.0%
107
+ ✓ Precision for 'positive' 100.0% >= 70.0%
108
+ Expected: 70.0%
109
+ Actual: 100.0%
110
+ ✓ Recall for 'positive' 100.0% >= 70.0%
111
+ Expected: 70.0%
112
+ Actual: 100.0%
113
+ ✓ Confusion matrix recorded for field "label"
114
+ ```
115
+
116
+ If the quality drops, the test fails — just like a normal test.
117
+
118
+ ## Two common ways to use evalsense
119
+
120
+ ### 1. When you **don’t have ground truth**
121
+
122
+ Use this when there are no labels.
123
+
124
+ Example:
125
+
126
+ - Run your LLM-powered function
127
+ - Score outputs using an LLM-as-judge (relevance, hallucination, etc.)
128
+ - Define what “acceptable” means
129
+ - Fail if quality degrades
130
+
131
+ **Example rule:**
132
+
133
+ > “Average relevance score must be at least 0.75”
9
134
 
10
- > **New in v0.3.0:** Regression assertions (MAE, RMSE, R²) and flexible ID matching for custom identifier fields! [See migration guide](./docs/migration-v0.3.0.md).
11
- > **New in v0.2.x:** Built-in adapters for OpenAI, Anthropic, and OpenRouter - no boilerplate needed!
12
- > **New in v0.2.0:** LLM-powered metrics for hallucination, relevance, faithfulness, and toxicity detection. [See migration guide](./docs/migration-v0.2.md).
135
+ ### 2. When you **do have ground truth**
13
136
 
14
- ## Why evalsense?
137
+ Use this when correct answers are known.
15
138
 
16
- Most LLM evaluation tools stop at producing scores (accuracy, relevance, hallucination). evalsense goes further by:
139
+ Example:
17
140
 
18
- - Computing **confusion matrices** to reveal systematic failure patterns
19
- - Analyzing **false positives vs false negatives** across datasets
20
- - Treating **metrics as predictions, not truth** (and validating them statistically)
21
- - Providing a **Jest-like API** that fits naturally into JS/Node workflows
22
- - Supporting **deterministic CI/CD** integration with specific exit codes
141
+ - Run your prediction code
142
+ - Compare outputs with ground truth
143
+ - Compute accuracy, precision, recall, F1
144
+ - Optionally add LLM-as-judge checks
145
+ - Fail if metrics fall below thresholds
23
146
 
24
- ## Features
147
+ **Example rule:**
25
148
 
26
- - 📊 **Dataset-level evaluation** - evaluate distributions, not single examples
27
- - 🎯 **Statistical rigor** - confusion matrices, precision/recall, F1, regression metrics
28
- - 🧪 **Jest-like API** - familiar `describe()` and test patterns
29
- - 🤖 **LLM-powered metrics** - hallucination, relevance, faithfulness, toxicity with explainable reasoning
30
- - **Dual evaluation modes** - choose between accuracy (per-row) or cost efficiency (batch)
31
- - 🔄 **CI-friendly** - deterministic execution, machine-readable reports
32
- - 🚀 **JS-native** - first-class TypeScript support, works with any Node.js LLM library
33
- - 🔌 **Composable** - evaluate outputs from your existing LLM code
149
+ > “F1 score must be 0.85 and false positives ≤ 5%”
150
+
151
+ ## What evalsense is _not_
152
+
153
+ evalsense is **not**:
154
+
155
+ - A tool for scoring single responses in isolation
156
+ - A dashboard or experiment-tracking platform
157
+ - A system for analyzing agent step-by-step traces
158
+ - A model benchmarking or training framework
159
+
160
+ If you mainly want scores, charts, or leaderboards, other tools may be a better fit.
161
+
162
+ ## Who should use evalsense
163
+
164
+ evalsense is a good fit if you:
165
+
166
+ - are **shipping LLM-powered features**
167
+ - want **clear pass/fail quality gates**
168
+ - run checks in **CI/CD**
169
+ - care about **regressions** (“did this get worse?”)
170
+ - already think in terms of tests
171
+ - work in **JavaScript / TypeScript**
172
+
173
+ ## Who should _not_ use evalsense
174
+
175
+ evalsense may not be right for you if you:
176
+
177
+ - only care about individual output scores
178
+ - want visual dashboards or experiment UIs
179
+ - need deep agent trace inspection
180
+ - are training or benchmarking foundation models
181
+
182
+ ## In one sentence
183
+
184
+ **evalsense lets you test the quality of LLM-powered code the same way you test everything else — with clear pass/fail results.**
34
185
 
35
186
  ## Installation
36
187
 
@@ -49,35 +200,35 @@ yarn add -D evalsense
49
200
  Create a file named `sentiment.eval.js`:
50
201
 
51
202
  ```javascript
52
- import { describe, evalTest, expectStats, loadDataset, runModel } from "evalsense";
203
+ import { describe, evalTest, expectStats } from "evalsense";
204
+ import { readFileSync } from "fs";
53
205
 
54
206
  // Your model function - can be any JS function
55
- function classifySentiment(record) {
56
- const text = record.text.toLowerCase();
57
- const hasPositive = /love|amazing|great|fantastic|perfect/.test(text);
58
- const hasNegative = /terrible|worst|disappointed|waste/.test(text);
59
-
60
- return {
61
- id: record.id,
62
- sentiment: hasPositive && !hasNegative ? "positive" : "negative",
63
- };
207
+ function classifySentiment(text) {
208
+ const lower = text.toLowerCase();
209
+ const hasPositive = /love|amazing|great|fantastic|perfect/.test(lower);
210
+ const hasNegative = /terrible|worst|disappointed|waste/.test(lower);
211
+ return hasPositive && !hasNegative ? "positive" : "negative";
64
212
  }
65
213
 
66
214
  describe("Sentiment classifier", () => {
67
215
  evalTest("accuracy above 80%", async () => {
68
- // 1. Load dataset with ground truth
69
- const dataset = loadDataset("./sentiment.json");
216
+ // 1. Load ground truth data
217
+ const groundTruth = JSON.parse(readFileSync("./sentiment.json", "utf-8"));
70
218
 
71
- // 2. Run your model on the dataset
72
- const result = await runModel(dataset, classifySentiment);
219
+ // 2. Run your model and collect predictions
220
+ const predictions = groundTruth.map((record) => ({
221
+ id: record.id,
222
+ sentiment: classifySentiment(record.text),
223
+ }));
73
224
 
74
225
  // 3. Assert on statistical properties
75
- expectStats(result)
226
+ expectStats(predictions, groundTruth)
76
227
  .field("sentiment")
77
- .toHaveAccuracyAbove(0.8)
78
- .toHaveRecallAbove("positive", 0.7)
79
- .toHavePrecisionAbove("positive", 0.7)
80
- .toHaveConfusionMatrix();
228
+ .accuracy.toBeAtLeast(0.8)
229
+ .recall("positive").toBeAtLeast(0.7)
230
+ .precision("positive").toBeAtLeast(0.7)
231
+ .displayConfusionMatrix();
81
232
  });
82
233
  });
83
234
  ```
@@ -103,23 +254,24 @@ npx evalsense run sentiment.eval.js
103
254
  ### Basic Classification Example
104
255
 
105
256
  ```javascript
106
- import { describe, evalTest, expectStats, loadDataset, runModel } from "evalsense";
257
+ import { describe, evalTest, expectStats } from "evalsense";
258
+ import { readFileSync } from "fs";
107
259
 
108
260
  describe("Spam classifier", () => {
109
261
  evalTest("high precision and recall", async () => {
110
- const dataset = loadDataset("./emails.json");
262
+ const groundTruth = JSON.parse(readFileSync("./emails.json", "utf-8"));
111
263
 
112
- const result = await runModel(dataset, (record) => ({
264
+ const predictions = groundTruth.map((record) => ({
113
265
  id: record.id,
114
266
  isSpam: classifyEmail(record.text),
115
267
  }));
116
268
 
117
- expectStats(result)
269
+ expectStats(predictions, groundTruth)
118
270
  .field("isSpam")
119
- .toHaveAccuracyAbove(0.9)
120
- .toHavePrecisionAbove(true, 0.85) // Precision for spam=true
121
- .toHaveRecallAbove(true, 0.85) // Recall for spam=true
122
- .toHaveConfusionMatrix();
271
+ .accuracy.toBeAtLeast(0.9)
272
+ .precision(true).toBeAtLeast(0.85) // Precision for spam=true
273
+ .recall(true).toBeAtLeast(0.85) // Recall for spam=true
274
+ .displayConfusionMatrix();
123
275
  });
124
276
  });
125
277
  ```
@@ -127,25 +279,26 @@ describe("Spam classifier", () => {
127
279
  ### Continuous Scores with Binarization
128
280
 
129
281
  ```javascript
130
- import { describe, evalTest, expectStats, loadDataset, runModel } from "evalsense";
282
+ import { describe, evalTest, expectStats } from "evalsense";
283
+ import { readFileSync } from "fs";
131
284
 
132
285
  describe("Hallucination detector", () => {
133
286
  evalTest("detect hallucinations with 70% recall", async () => {
134
- const dataset = loadDataset("./outputs.json");
287
+ const groundTruth = JSON.parse(readFileSync("./outputs.json", "utf-8"));
135
288
 
136
- // Your model returns a continuous score
137
- const result = await runModel(dataset, (record) => ({
289
+ // Your model returns a continuous score (0.0 to 1.0)
290
+ const predictions = groundTruth.map((record) => ({
138
291
  id: record.id,
139
- hallucinated: computeHallucinationScore(record.output), // 0.0 to 1.0
292
+ hallucinated: computeHallucinationScore(record.output),
140
293
  }));
141
294
 
142
295
  // Binarize the score at threshold 0.3
143
- expectStats(result)
296
+ expectStats(predictions, groundTruth)
144
297
  .field("hallucinated")
145
298
  .binarize(0.3) // >= 0.3 means hallucinated
146
- .toHaveRecallAbove(true, 0.7)
147
- .toHavePrecisionAbove(true, 0.6)
148
- .toHaveConfusionMatrix();
299
+ .recall(true).toBeAtLeast(0.7)
300
+ .precision(true).toBeAtLeast(0.6)
301
+ .displayConfusionMatrix();
149
302
  });
150
303
  });
151
304
  ```
@@ -153,50 +306,62 @@ describe("Hallucination detector", () => {
153
306
  ### Multi-class Classification
154
307
 
155
308
  ```javascript
156
- import { describe, evalTest, expectStats, loadDataset, runModel } from "evalsense";
309
+ import { describe, evalTest, expectStats } from "evalsense";
310
+ import { readFileSync } from "fs";
157
311
 
158
312
  describe("Intent classifier", () => {
159
313
  evalTest("balanced performance across intents", async () => {
160
- const dataset = loadDataset("./intents.json");
314
+ const groundTruth = JSON.parse(readFileSync("./intents.json", "utf-8"));
161
315
 
162
- const result = await runModel(dataset, (record) => ({
316
+ const predictions = groundTruth.map((record) => ({
163
317
  id: record.id,
164
318
  intent: classifyIntent(record.query),
165
319
  }));
166
320
 
167
- expectStats(result)
321
+ expectStats(predictions, groundTruth)
168
322
  .field("intent")
169
- .toHaveAccuracyAbove(0.85)
170
- .toHaveRecallAbove("purchase", 0.8)
171
- .toHaveRecallAbove("support", 0.8)
172
- .toHaveRecallAbove("general", 0.7)
173
- .toHaveConfusionMatrix();
323
+ .accuracy.toBeAtLeast(0.85)
324
+ .recall("purchase").toBeAtLeast(0.8)
325
+ .recall("support").toBeAtLeast(0.8)
326
+ .recall("general").toBeAtLeast(0.7)
327
+ .displayConfusionMatrix();
174
328
  });
175
329
  });
176
330
  ```
177
331
 
178
- ### Parallel Model Execution
332
+ ### Parallel Model Execution with LLMs
179
333
 
180
- For LLM calls or slow operations, use parallel execution:
334
+ For LLM calls or slow operations, use `Promise.all` with chunking for concurrency control:
181
335
 
182
336
  ```javascript
183
- import { describe, evalTest, expectStats, loadDataset, runModelParallel } from "evalsense";
337
+ import { describe, evalTest, expectStats } from "evalsense";
338
+ import { readFileSync } from "fs";
339
+
340
+ // Helper for parallel execution with concurrency limit
341
+ async function mapConcurrent(items, fn, concurrency = 5) {
342
+ const results = [];
343
+ for (let i = 0; i < items.length; i += concurrency) {
344
+ const chunk = items.slice(i, i + concurrency);
345
+ results.push(...(await Promise.all(chunk.map(fn))));
346
+ }
347
+ return results;
348
+ }
184
349
 
185
350
  describe("LLM classifier", () => {
186
351
  evalTest("classification accuracy", async () => {
187
- const dataset = loadDataset("./data.json");
352
+ const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
188
353
 
189
354
  // Run with concurrency=5
190
- const result = await runModelParallel(
191
- dataset,
355
+ const predictions = await mapConcurrent(
356
+ groundTruth,
192
357
  async (record) => {
193
358
  const response = await callLLM(record.text);
194
359
  return { id: record.id, category: response.category };
195
360
  },
196
- 5 // concurrency limit
361
+ 5
197
362
  );
198
363
 
199
- expectStats(result).field("category").toHaveAccuracyAbove(0.9);
364
+ expectStats(predictions, groundTruth).field("category").accuracy.toBeAtLeast(0.9);
200
365
  });
201
366
  });
202
367
  ```
@@ -302,59 +467,55 @@ evalTest("should have 90% accuracy", async () => {
302
467
  });
303
468
  ```
304
469
 
305
- ### Dataset Functions
306
-
307
- #### `loadDataset(path)`
308
-
309
- Loads a dataset from a JSON file. Records must have an `id` or `_id` field.
310
-
311
- ```javascript
312
- const dataset = loadDataset("./data.json");
313
- ```
314
-
315
- #### `runModel(dataset, modelFn)`
470
+ ### Dataset Loading
316
471
 
317
- Runs a model function on each record sequentially.
472
+ evalsense doesn't dictate how you load data or run your model. Use standard Node.js tools:
318
473
 
319
474
  ```javascript
320
- const result = await runModel(dataset, (record) => ({
321
- id: record.id,
322
- prediction: classify(record.text),
323
- }));
324
- ```
475
+ import { readFileSync } from "fs";
325
476
 
326
- #### `runModelParallel(dataset, modelFn, concurrency)`
477
+ // Load ground truth
478
+ const groundTruth = JSON.parse(readFileSync("./data.json", "utf-8"));
327
479
 
328
- Runs a model function with parallel execution.
480
+ // Run your model however you want
481
+ const predictions = groundTruth.map(runYourModel);
329
482
 
330
- ```javascript
331
- const result = await runModelParallel(dataset, modelFn, 10); // concurrency=10
483
+ // Or use async operations
484
+ const predictions = await Promise.all(
485
+ groundTruth.map(async (item) => {
486
+ const result = await callLLM(item.text);
487
+ return { id: item.id, prediction: result };
488
+ })
489
+ );
332
490
  ```
333
491
 
334
492
  ### Assertions
335
493
 
336
- #### `expectStats(result)`
494
+ #### `expectStats(predictions, groundTruth)`
337
495
 
338
- Creates a statistical assertion chain from model results.
496
+ Creates a statistical assertion chain from predictions and ground truth. Aligns by `id` field.
339
497
 
340
498
  ```javascript
341
- expectStats(result).field("prediction").toHaveAccuracyAbove(0.8);
499
+ expectStats(predictions, groundTruth)
500
+ .field("prediction")
501
+ .accuracy.toBeAtLeast(0.8)
502
+ .f1.toBeAtLeast(0.75)
503
+ .displayConfusionMatrix();
342
504
  ```
343
505
 
344
- #### `expectStats(predictions, groundTruth)`
345
-
346
- Two-argument form for judge validation. Aligns predictions with ground truth by `id` field.
506
+ **One-argument form (distribution assertions only):**
347
507
 
348
508
  ```javascript
349
- // Validate judge outputs against human labels
350
- expectStats(judgeOutputs, humanLabels).field("label").toHaveAccuracyAbove(0.85);
509
+ // For distribution monitoring without ground truth
510
+ expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
351
511
  ```
352
512
 
353
- **When to use:**
513
+ **Common use cases:**
354
514
 
515
+ - Classification evaluation with ground truth
516
+ - Regression evaluation (MAE, RMSE, R²)
355
517
  - Validating LLM judges against human labels
356
- - Evaluating metric quality
357
- - Testing automated detection systems
518
+ - Distribution monitoring without ground truth
358
519
 
359
520
  ### Field Selection
360
521
 
@@ -374,7 +535,7 @@ Converts continuous scores to binary (>=threshold is true).
374
535
  expectStats(result)
375
536
  .field("score")
376
537
  .binarize(0.5) // score >= 0.5 is true
377
- .toHaveAccuracyAbove(0.8);
538
+ .accuracy.toBeAtLeast(0.8);
378
539
  ```
379
540
 
380
541
  ### Available Assertions
@@ -382,43 +543,61 @@ expectStats(result)
382
543
  #### Classification Metrics
383
544
 
384
545
  ```javascript
385
- // Accuracy
386
- .toHaveAccuracyAbove(threshold)
387
- .toHaveAccuracyBelow(threshold)
388
- .toHaveAccuracyBetween(min, max)
546
+ // Accuracy (macro average for multi-class)
547
+ .accuracy.toBeAtLeast(threshold)
548
+ .accuracy.toBeAbove(threshold)
549
+ .accuracy.toBeAtMost(threshold)
550
+ .accuracy.toBeBelow(threshold)
551
+
552
+ // Precision (per class or macro average)
553
+ .precision("className").toBeAtLeast(threshold)
554
+ .precision().toBeAtLeast(threshold) // macro average
389
555
 
390
- // Precision (per class)
391
- .toHavePrecisionAbove(className, threshold)
392
- .toHavePrecisionBelow(className, threshold)
556
+ // Recall (per class or macro average)
557
+ .recall("className").toBeAtLeast(threshold)
558
+ .recall().toBeAtLeast(threshold) // macro average
393
559
 
394
- // Recall (per class)
395
- .toHaveRecallAbove(className, threshold)
396
- .toHaveRecallBelow(className, threshold)
560
+ // F1 Score (macro average)
561
+ .f1.toBeAtLeast(threshold)
562
+ .f1.toBeAbove(threshold)
397
563
 
398
- // F1 Score
399
- .toHaveF1Above(threshold) // Overall F1
400
- .toHaveF1Above(className, threshold) // Per-class F1
564
+ // Regression Metrics
565
+ .mae.toBeAtMost(threshold) // Mean Absolute Error
566
+ .rmse.toBeAtMost(threshold) // Root Mean Squared Error
567
+ .r2.toBeAtLeast(threshold) // R² coefficient
401
568
 
402
569
  // Confusion Matrix
403
- .toHaveConfusionMatrix() // Prints confusion matrix
570
+ .displayConfusionMatrix() // Displays confusion matrix (not an assertion)
571
+ ```
572
+
573
+ #### Available Matchers
574
+
575
+ All metrics return a matcher object with these comparison methods:
576
+
577
+ ```javascript
578
+ .toBeAtLeast(x) // >= x
579
+ .toBeAbove(x) // > x
580
+ .toBeAtMost(x) // <= x
581
+ .toBeBelow(x) // < x
582
+ .toEqual(x, tolerance?) // === x (with optional tolerance for floats)
404
583
  ```
405
584
 
406
- #### Distribution Assertions (Pattern 1)
585
+ #### Distribution Assertions
407
586
 
408
587
  Distribution assertions validate output distributions **without requiring ground truth**. Use these to monitor that model outputs stay within expected ranges.
409
588
 
410
589
  ```javascript
411
590
  // Assert that at least 80% of confidence scores are above 0.7
412
- expectStats(predictions).field("confidence").toHavePercentageAbove(0.7, 0.8);
591
+ expectStats(predictions).field("confidence").percentageAbove(0.7).toBeAtLeast(0.8);
413
592
 
414
593
  // Assert that at least 90% of toxicity scores are below 0.3
415
- expectStats(predictions).field("toxicity").toHavePercentageBelow(0.3, 0.9);
594
+ expectStats(predictions).field("toxicity").percentageBelow(0.3).toBeAtLeast(0.9);
416
595
 
417
596
  // Chain multiple distribution assertions
418
597
  expectStats(predictions)
419
598
  .field("score")
420
- .toHavePercentageAbove(0.5, 0.6) // At least 60% above 0.5
421
- .toHavePercentageBelow(0.9, 0.8); // At least 80% below 0.9
599
+ .percentageAbove(0.5).toBeAtLeast(0.6) // At least 60% above 0.5
600
+ .percentageBelow(0.9).toBeAtLeast(0.8); // At least 80% below 0.9
422
601
  ```
423
602
 
424
603
  **Use cases:**
@@ -430,7 +609,7 @@ expectStats(predictions)
430
609
 
431
610
  See [Distribution Assertions Example](./examples/distribution-assertions.eval.js) for complete examples.
432
611
 
433
- ### Judge Validation (Pattern 1b)
612
+ ### Judge Validation
434
613
 
435
614
  Validate judge outputs against human-labeled ground truth using the **two-argument expectStats API**:
436
615
 
@@ -452,9 +631,9 @@ const humanLabels = [
452
631
  // Validate judge performance
453
632
  expectStats(judgeOutputs, humanLabels)
454
633
  .field("hallucinated")
455
- .toHaveRecallAbove(true, 0.9) // Don't miss hallucinations
456
- .toHavePrecisionAbove(true, 0.7) // Some false positives OK
457
- .toHaveConfusionMatrix();
634
+ .recall(true).toBeAtLeast(0.9) // Don't miss hallucinations
635
+ .precision(true).toBeAtLeast(0.7) // Some false positives OK
636
+ .displayConfusionMatrix();
458
637
  ```
459
638
 
460
639
  **Use cases:**
@@ -467,7 +646,7 @@ expectStats(judgeOutputs, humanLabels)
467
646
  **Two-argument expectStats:**
468
647
 
469
648
  ```javascript
470
- expectStats(actual, expected).field("fieldName").toHaveAccuracyAbove(0.8);
649
+ expectStats(actual, expected).field("fieldName").accuracy.toBeAtLeast(0.8);
471
650
  ```
472
651
 
473
652
  The first argument is your predictions (judge outputs), the second is ground truth (human labels). Both must have matching `id` fields for alignment.
@@ -671,25 +850,6 @@ setLLMClient({
671
850
  - [Migration Guide](./docs/migration-v0.2.md) - Upgrade from v0.1.x
672
851
  - [Examples](./examples/) - Working code examples
673
852
 
674
- ## Philosophy
675
-
676
- evalsense is built on the principle that **metrics are predictions, not facts**.
677
-
678
- Instead of treating LLM-as-judge metrics (relevance, hallucination, etc.) as ground truth, evalsense:
679
-
680
- - Treats them as **weak labels** from a model
681
- - Validates them statistically against human references when available
682
- - Computes confusion matrices to reveal bias and systematic errors
683
- - Focuses on dataset-level distributions, not individual examples
684
-
685
853
  ## Contributing
686
854
 
687
855
  Contributions are welcome! Please see [CLAUDE.md](./CLAUDE.md) for development guidelines.
688
-
689
- ## License
690
-
691
- MIT © Mohit Joshi
692
-
693
- ---
694
-
695
- **Made with ❤️ for the JS/Node.js AI community**