@mastra/evals 1.1.0 → 1.1.1-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/CHANGELOG.md +11 -0
  2. package/dist/docs/SKILL.md +31 -20
  3. package/dist/docs/{SOURCE_MAP.json → assets/SOURCE_MAP.json} +1 -1
  4. package/dist/docs/{evals/02-built-in-scorers.md → references/docs-evals-built-in-scorers.md} +5 -7
  5. package/dist/docs/{evals/01-overview.md → references/docs-evals-overview.md} +26 -10
  6. package/dist/docs/references/reference-evals-answer-relevancy.md +105 -0
  7. package/dist/docs/references/reference-evals-answer-similarity.md +99 -0
  8. package/dist/docs/references/reference-evals-bias.md +120 -0
  9. package/dist/docs/references/reference-evals-completeness.md +137 -0
  10. package/dist/docs/references/reference-evals-content-similarity.md +101 -0
  11. package/dist/docs/references/reference-evals-context-precision.md +196 -0
  12. package/dist/docs/references/reference-evals-context-relevance.md +536 -0
  13. package/dist/docs/references/reference-evals-faithfulness.md +114 -0
  14. package/dist/docs/references/reference-evals-hallucination.md +220 -0
  15. package/dist/docs/references/reference-evals-keyword-coverage.md +128 -0
  16. package/dist/docs/references/reference-evals-noise-sensitivity.md +685 -0
  17. package/dist/docs/references/reference-evals-prompt-alignment.md +619 -0
  18. package/dist/docs/references/reference-evals-scorer-utils.md +330 -0
  19. package/dist/docs/references/reference-evals-textual-difference.md +113 -0
  20. package/dist/docs/references/reference-evals-tone-consistency.md +119 -0
  21. package/dist/docs/references/reference-evals-tool-call-accuracy.md +533 -0
  22. package/dist/docs/references/reference-evals-toxicity.md +123 -0
  23. package/dist/scorers/llm/faithfulness/index.d.ts +3 -1
  24. package/dist/scorers/llm/faithfulness/index.d.ts.map +1 -1
  25. package/dist/scorers/llm/noise-sensitivity/index.d.ts.map +1 -1
  26. package/dist/scorers/llm/prompt-alignment/index.d.ts.map +1 -1
  27. package/dist/scorers/prebuilt/index.cjs +11 -7
  28. package/dist/scorers/prebuilt/index.cjs.map +1 -1
  29. package/dist/scorers/prebuilt/index.js +11 -7
  30. package/dist/scorers/prebuilt/index.js.map +1 -1
  31. package/package.json +3 -4
  32. package/dist/docs/README.md +0 -31
  33. package/dist/docs/evals/03-reference.md +0 -4092
@@ -0,0 +1,536 @@
1
+ # Context Relevance Scorer
2
+
3
+ The `createContextRelevanceScorerLLM()` function creates a scorer that evaluates how relevant and useful provided context was for generating agent responses. It uses weighted relevance levels and applies penalties for unused high-relevance context and missing information.
4
+
5
+ It is especially useful for these use cases:
6
+
7
+ **Content Generation Evaluation**
8
+
9
+ Best for evaluating context quality in:
10
+
11
+ - Chat systems where context usage matters
12
+ - RAG pipelines needing nuanced relevance assessment
13
+ - Systems where missing context affects quality
14
+
15
+ **Context Selection Optimization**
16
+
17
+ Use when optimizing for:
18
+
19
+ - Comprehensive context coverage
20
+ - Effective context utilization
21
+ - Identifying context gaps
22
+
23
+ ## Parameters
24
+
25
+ **model:** (`MastraModelConfig`): The language model to use for evaluating context relevance
26
+
27
+ **options:** (`ContextRelevanceOptions`): Configuration options for the scorer
28
+
29
+ Note: Either `context` or `contextExtractor` must be provided. If both are provided, `contextExtractor` takes precedence.
30
+
31
+ ## .run() Returns
32
+
33
+ **score:** (`number`): Weighted relevance score between 0 and scale (default 0-1)
34
+
35
+ **reason:** (`string`): Human-readable explanation of the context relevance evaluation
36
+
37
+ ## Scoring Details
38
+
39
+ ### Weighted Relevance Scoring
40
+
41
+ Context Relevance uses a sophisticated scoring algorithm that considers:
42
+
43
+ 1. **Relevance Levels**: Each context piece is classified with weighted values:
44
+
45
+ - `high` = 1.0 (directly addresses the query)
46
+ - `medium` = 0.7 (supporting information)
47
+ - `low` = 0.3 (tangentially related)
48
+ - `none` = 0.0 (completely irrelevant)
49
+
50
+ 2. **Usage Detection**: Tracks whether relevant context was actually used in the response
51
+
52
+ 3. **Penalties Applied** (configurable via `penalties` options):
53
+
54
+ - **Unused High-Relevance**: `unusedHighRelevanceContext` penalty per unused high-relevance context (default: 0.1)
55
+ - **Missing Context**: Up to `maxMissingContextPenalty` for identified missing information (default: 0.5)
56
+
57
+ ### Scoring Formula
58
+
59
+ ```text
60
+ Base Score = Σ(relevance_weights) / (num_contexts × 1.0)
61
+ Usage Penalty = count(unused_high_relevance) × unusedHighRelevanceContext
62
+ Missing Penalty = min(count(missing_context) × missingContextPerItem, maxMissingContextPenalty)
63
+
64
+ Final Score = max(0, Base Score - Usage Penalty - Missing Penalty) × scale
65
+ ```
66
+
67
+ **Default Values**:
68
+
69
+ - `unusedHighRelevanceContext` = 0.1 (10% penalty per unused high-relevance context)
70
+ - `missingContextPerItem` = 0.15 (15% penalty per missing context item)
71
+ - `maxMissingContextPenalty` = 0.5 (maximum 50% penalty for missing context)
72
+ - `scale` = 1
73
+
74
+ ### Score interpretation
75
+
76
+ - **0.9-1.0**: Excellent - all context highly relevant and used
77
+ - **0.7-0.8**: Good - mostly relevant with minor gaps
78
+ - **0.4-0.6**: Mixed - significant irrelevant or unused context
79
+ - **0.2-0.3**: Poor - mostly irrelevant context
80
+ - **0.0-0.1**: Very poor - no relevant context found
81
+
82
+ ### Reason analysis
83
+
84
+ The reason field provides insights on:
85
+
86
+ - Relevance level of each context piece (high/medium/low/none)
87
+ - Which context was actually used in the response
88
+ - Penalties applied for unused high-relevance context (configurable via `unusedHighRelevanceContext`)
89
+ - Missing context that would have improved the response (penalized via `missingContextPerItem` up to `maxMissingContextPenalty`)
90
+
91
+ ### Optimization strategies
92
+
93
+ Use results to improve your system:
94
+
95
+ - **Filter irrelevant context**: Remove low/none relevance pieces before processing
96
+ - **Ensure context usage**: Make sure high-relevance context is incorporated
97
+ - **Fill context gaps**: Add missing information identified by the scorer
98
+ - **Balance context size**: Find optimal amount of context for best relevance
99
+ - **Tune penalty sensitivity**: Adjust `unusedHighRelevanceContext`, `missingContextPerItem`, and `maxMissingContextPenalty` based on your application's tolerance for unused or missing context
100
+
101
+ ### Difference from Context Precision
102
+
103
+ | Aspect | Context Relevance | Context Precision |
104
+ | ------------- | -------------------------------------- | ---------------------------------- |
105
+ | **Algorithm** | Weighted levels with penalties | Mean Average Precision (MAP) |
106
+ | **Relevance** | Multiple levels (high/medium/low/none) | Binary (yes/no) |
107
+ | **Position** | Not considered | Critical (rewards early placement) |
108
+ | **Usage** | Tracks and penalizes unused context | Not considered |
109
+ | **Missing** | Identifies and penalizes gaps | Not evaluated |
110
+
111
+ ## Scorer configuration
112
+
113
+ ### Custom penalty configuration
114
+
115
+ Control how penalties are applied for unused and missing context:
116
+
117
+ ```typescript
118
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
119
+
120
+ // Stricter penalty configuration
121
+ const strictScorer = createContextRelevanceScorerLLM({
122
+ model: "openai/gpt-5.1",
123
+ options: {
124
+ context: [
125
+ "Einstein won the Nobel Prize for photoelectric effect",
126
+ "He developed the theory of relativity",
127
+ "Einstein was born in Germany",
128
+ ],
129
+ penalties: {
130
+ unusedHighRelevanceContext: 0.2, // 20% penalty per unused high-relevance context
131
+ missingContextPerItem: 0.25, // 25% penalty per missing context item
132
+ maxMissingContextPenalty: 0.6, // Maximum 60% penalty for missing context
133
+ },
134
+ scale: 1,
135
+ },
136
+ });
137
+
138
+ // Lenient penalty configuration
139
+ const lenientScorer = createContextRelevanceScorerLLM({
140
+ model: "openai/gpt-5.1",
141
+ options: {
142
+ context: [
143
+ "Einstein won the Nobel Prize for photoelectric effect",
144
+ "He developed the theory of relativity",
145
+ "Einstein was born in Germany",
146
+ ],
147
+ penalties: {
148
+ unusedHighRelevanceContext: 0.05, // 5% penalty per unused high-relevance context
149
+ missingContextPerItem: 0.1, // 10% penalty per missing context item
150
+ maxMissingContextPenalty: 0.3, // Maximum 30% penalty for missing context
151
+ },
152
+ scale: 1,
153
+ },
154
+ });
155
+
156
+ const testRun = {
157
+ input: {
158
+ inputMessages: [
159
+ {
160
+ id: "1",
161
+ role: "user",
162
+ content: "What did Einstein achieve in physics?",
163
+ },
164
+ ],
165
+ },
166
+ output: [
167
+ {
168
+ id: "2",
169
+ role: "assistant",
170
+ content:
171
+ "Einstein won the Nobel Prize for his work on the photoelectric effect.",
172
+ },
173
+ ],
174
+ };
175
+
176
+ const strictResult = await strictScorer.run(testRun);
177
+ const lenientResult = await lenientScorer.run(testRun);
178
+
179
+ console.log("Strict penalties:", strictResult.score); // Lower score due to unused context
180
+ console.log("Lenient penalties:", lenientResult.score); // Higher score, less penalty
181
+ ```
182
+
183
+ ### Dynamic Context Extraction
184
+
185
+ ```typescript
186
+ const scorer = createContextRelevanceScorerLLM({
187
+ model: "openai/gpt-5.1",
188
+ options: {
189
+ contextExtractor: (input, output) => {
190
+ // Extract context based on the query
191
+ const userQuery = input?.inputMessages?.[0]?.content || "";
192
+ if (userQuery.includes("Einstein")) {
193
+ return [
194
+ "Einstein won the Nobel Prize for the photoelectric effect",
195
+ "He developed the theory of relativity",
196
+ ];
197
+ }
198
+ return ["General physics information"];
199
+ },
200
+ penalties: {
201
+ unusedHighRelevanceContext: 0.15,
202
+ },
203
+ },
204
+ });
205
+ ```
206
+
207
+ ### Custom scale factor
208
+
209
+ ```typescript
210
+ const scorer = createContextRelevanceScorerLLM({
211
+ model: "openai/gpt-5.1",
212
+ options: {
213
+ context: ["Relevant information...", "Supporting details..."],
214
+ scale: 100, // Scale scores from 0-100 instead of 0-1
215
+ },
216
+ });
217
+
218
+ // Result will be scaled: score: 85 instead of 0.85
219
+ ```
220
+
221
+ ### Combining multiple context sources
222
+
223
+ ```typescript
224
+ const scorer = createContextRelevanceScorerLLM({
225
+ model: "openai/gpt-5.1",
226
+ options: {
227
+ contextExtractor: (input, output) => {
228
+ const query = input?.inputMessages?.[0]?.content || "";
229
+
230
+ // Combine from multiple sources
231
+ const kbContext = knowledgeBase.search(query);
232
+ const docContext = documentStore.retrieve(query);
233
+ const cacheContext = contextCache.get(query);
234
+
235
+ return [...kbContext, ...docContext, ...cacheContext];
236
+ },
237
+ scale: 1,
238
+ },
239
+ });
240
+ ```
241
+
242
+ ## Examples
243
+
244
+ ### High relevance example
245
+
246
+ This example shows excellent context relevance where all context directly supports the response:
247
+
248
+ ```typescript
249
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
250
+
251
+ const scorer = createContextRelevanceScorerLLM({
252
+ model: "openai/gpt-5.1",
253
+ options: {
254
+ context: [
255
+ "Einstein won the Nobel Prize for his discovery of the photoelectric effect in 1921.",
256
+ "He published his theory of special relativity in 1905.",
257
+ "His general relativity theory, published in 1915, revolutionized our understanding of gravity.",
258
+ ],
259
+ scale: 1,
260
+ },
261
+ });
262
+
263
+ const result = await scorer.run({
264
+ input: {
265
+ inputMessages: [
266
+ {
267
+ id: "1",
268
+ role: "user",
269
+ content: "What were Einstein's major scientific achievements?",
270
+ },
271
+ ],
272
+ },
273
+ output: [
274
+ {
275
+ id: "2",
276
+ role: "assistant",
277
+ content:
278
+ "Einstein's major achievements include the Nobel Prize for the photoelectric effect, special relativity in 1905, and general relativity in 1915.",
279
+ },
280
+ ],
281
+ });
282
+
283
+ console.log(result);
284
+ // Output:
285
+ // {
286
+ // score: 1.0,
287
+ // reason: "The score is 1.0 because all context pieces are highly relevant to Einstein's achievements and were effectively used in generating the comprehensive response."
288
+ // }
289
+ ```
290
+
291
+ ### Mixed relevance example
292
+
293
+ This example shows moderate relevance with some context being irrelevant or unused:
294
+
295
+ ```typescript
296
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
297
+
298
+ const scorer = createContextRelevanceScorerLLM({
299
+ model: "openai/gpt-5.1",
300
+ options: {
301
+ context: [
302
+ "Solar eclipses occur when the Moon blocks the Sun.",
303
+ "The Moon moves between the Earth and Sun during eclipses.",
304
+ "The Moon is visible at night.",
305
+ "Stars twinkle due to atmospheric interference.",
306
+ "Total eclipses can last up to 7.5 minutes.",
307
+ ],
308
+ scale: 1,
309
+ },
310
+ });
311
+
312
+ const result = await scorer.run({
313
+ input: {
314
+ inputMessages: [
315
+ {
316
+ id: "1",
317
+ role: "user",
318
+ content: "What causes solar eclipses?",
319
+ },
320
+ ],
321
+ },
322
+ output: [
323
+ {
324
+ id: "2",
325
+ role: "assistant",
326
+ content:
327
+ "Solar eclipses happen when the Moon moves between Earth and the Sun, blocking sunlight.",
328
+ },
329
+ ],
330
+ });
331
+
332
+ console.log(result);
333
+ // Output with default penalties:
334
+ // {
335
+ // score: 0.64,
336
+ // reason: "The score is 0.64 because contexts 1 and 2 are highly relevant and used, context 5 is relevant but unused (10% penalty), while contexts 3 and 4 are irrelevant."
337
+ // }
338
+
339
+ // With custom penalty configuration
340
+ const customScorer = createContextRelevanceScorerLLM({
341
+ model: "openai/gpt-5.1",
342
+ options: {
343
+ context: [
344
+ "Solar eclipses occur when the Moon blocks the Sun.",
345
+ "The Moon moves between the Earth and Sun during eclipses.",
346
+ "The Moon is visible at night.",
347
+ "Stars twinkle due to atmospheric interference.",
348
+ "Total eclipses can last up to 7.5 minutes.",
349
+ ],
350
+ penalties: {
351
+ unusedHighRelevanceContext: 0.05, // Lower penalty for unused context
352
+ missingContextPerItem: 0.1,
353
+ maxMissingContextPenalty: 0.3,
354
+ },
355
+ },
356
+ });
357
+
358
+ const customResult = await customScorer.run({
359
+ input: {
360
+ inputMessages: [
361
+ { id: "1", role: "user", content: "What causes solar eclipses?" },
362
+ ],
363
+ },
364
+ output: [
365
+ {
366
+ id: "2",
367
+ role: "assistant",
368
+ content:
369
+ "Solar eclipses happen when the Moon moves between Earth and the Sun, blocking sunlight.",
370
+ },
371
+ ],
372
+ });
373
+
374
+ console.log(customResult);
375
+ // Output with lenient penalties:
376
+ // {
377
+ // score: 0.69, // Higher score due to reduced penalty for unused context
378
+ // reason: "The score is 0.69 because contexts 1 and 2 are highly relevant and used, context 5 is relevant but unused (5% penalty), while contexts 3 and 4 are irrelevant."
379
+ // }
380
+ ```
381
+
382
+ ### Low relevance example
383
+
384
+ This example shows poor context relevance with mostly irrelevant information:
385
+
386
+ ```typescript
387
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
388
+
389
+ const scorer = createContextRelevanceScorerLLM({
390
+ model: "openai/gpt-5.1",
391
+ options: {
392
+ context: [
393
+ "The Great Barrier Reef is located in Australia.",
394
+ "Coral reefs need warm water to survive.",
395
+ "Many fish species live in coral reefs.",
396
+ "Australia has six states and two territories.",
397
+ "The capital of Australia is Canberra.",
398
+ ],
399
+ scale: 1,
400
+ },
401
+ });
402
+
403
+ const result = await scorer.run({
404
+ input: {
405
+ inputMessages: [
406
+ {
407
+ id: "1",
408
+ role: "user",
409
+ content: "What is the capital of Australia?",
410
+ },
411
+ ],
412
+ },
413
+ output: [
414
+ {
415
+ id: "2",
416
+ role: "assistant",
417
+ content: "The capital of Australia is Canberra.",
418
+ },
419
+ ],
420
+ });
421
+
422
+ console.log(result);
423
+ // Output:
424
+ // {
425
+ // score: 0.26,
426
+ // reason: "The score is 0.26 because only context 5 is relevant to the query about Australia's capital, while the other contexts about reefs are completely irrelevant."
427
+ // }
428
+ ```
429
+
430
+ ### Dynamic context extraction
431
+
432
+ Extract context dynamically based on the run input:
433
+
434
+ ```typescript
435
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
436
+
437
+ const scorer = createContextRelevanceScorerLLM({
438
+ model: "openai/gpt-5.1",
439
+ options: {
440
+ contextExtractor: (input, output) => {
441
+ // Extract query from input
442
+ const query = input?.inputMessages?.[0]?.content || "";
443
+
444
+ // Dynamically retrieve context based on query
445
+ if (query.toLowerCase().includes("einstein")) {
446
+ return [
447
+ "Einstein developed E=mc²",
448
+ "He won the Nobel Prize in 1921",
449
+ "His theories revolutionized physics",
450
+ ];
451
+ }
452
+
453
+ if (query.toLowerCase().includes("climate")) {
454
+ return [
455
+ "Global temperatures are rising",
456
+ "CO2 levels affect climate",
457
+ "Renewable energy reduces emissions",
458
+ ];
459
+ }
460
+
461
+ return ["General knowledge base entry"];
462
+ },
463
+ penalties: {
464
+ unusedHighRelevanceContext: 0.15, // 15% penalty for unused relevant context
465
+ missingContextPerItem: 0.2, // 20% penalty per missing context item
466
+ maxMissingContextPenalty: 0.4, // Cap at 40% total missing context penalty
467
+ },
468
+ scale: 1,
469
+ },
470
+ });
471
+ ```
472
+
473
+ ### RAG system integration
474
+
475
+ Integrate with RAG pipelines to evaluate retrieved context:
476
+
477
+ ```typescript
478
+ import { createContextRelevanceScorerLLM } from "@mastra/evals";
479
+
480
+ const scorer = createContextRelevanceScorerLLM({
481
+ model: "openai/gpt-5.1",
482
+ options: {
483
+ contextExtractor: (input, output) => {
484
+ // Extract from RAG retrieval results
485
+ const ragResults = inputData.metadata?.ragResults || [];
486
+
487
+ // Return the text content of retrieved documents
488
+ return ragResults
489
+ .filter((doc) => doc.relevanceScore > 0.5)
490
+ .map((doc) => doc.content);
491
+ },
492
+ penalties: {
493
+ unusedHighRelevanceContext: 0.12, // Moderate penalty for unused RAG context
494
+ missingContextPerItem: 0.18, // Higher penalty for missing information in RAG
495
+ maxMissingContextPenalty: 0.45, // Slightly higher cap for RAG systems
496
+ },
497
+ scale: 1,
498
+ },
499
+ });
500
+
501
+ // Evaluate RAG system performance
502
+ const evaluateRAG = async (testCases) => {
503
+ const results = [];
504
+
505
+ for (const testCase of testCases) {
506
+ const score = await scorer.run(testCase);
507
+ results.push({
508
+ query: testCase.inputData.inputMessages[0].content,
509
+ relevanceScore: score.score,
510
+ feedback: score.reason,
511
+ unusedContext: score.reason.includes("unused"),
512
+ missingContext: score.reason.includes("missing"),
513
+ });
514
+ }
515
+
516
+ return results;
517
+ };
518
+ ```
519
+
520
+ ## Comparison with Context Precision
521
+
522
+ Choose the right scorer for your needs:
523
+
524
+ | Use Case | Context Relevance | Context Precision |
525
+ | ------------------------ | -------------------- | ------------------------- |
526
+ | **RAG evaluation** | When usage matters | When ranking matters |
527
+ | **Context quality** | Nuanced levels | Binary relevance |
528
+ | **Missing detection** | ✓ Identifies gaps | ✗ Not evaluated |
529
+ | **Usage tracking** | ✓ Tracks utilization | ✗ Not considered |
530
+ | **Position sensitivity** | ✗ Position agnostic | ✓ Rewards early placement |
531
+
532
+ ## Related
533
+
534
+ - [Context Precision Scorer](https://mastra.ai/reference/evals/context-precision) - Evaluates context ranking using MAP
535
+ - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness) - Measures answer groundedness in context
536
+ - [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) - Creating your own evaluation metrics
@@ -0,0 +1,114 @@
1
+ # Faithfulness Scorer
2
+
3
+ The `createFaithfulnessScorer()` function evaluates how factually accurate an LLM's output is compared to the provided context. It extracts claims from the output and verifies them against the context, making it essential to measure RAG pipeline responses' reliability.
4
+
5
+ ## Parameters
6
+
7
+ The `createFaithfulnessScorer()` function accepts a single options object with the following properties:
8
+
9
+ **model:** (`LanguageModel`): Configuration for the model used to evaluate faithfulness.
10
+
11
+ **context:** (`string[]`): Array of context chunks against which the output's claims will be verified.
12
+
13
+ **scale:** (`number`): The maximum score value. The final score will be normalized to this scale. (Default: `1`)
14
+
15
+ This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
16
+
17
+ ## .run() Returns
18
+
19
+ **runId:** (`string`): The id of the run (optional).
20
+
21
+ **preprocessStepResult:** (`string[]`): Array of extracted claims from the output.
22
+
23
+ **preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
24
+
25
+ **analyzeStepResult:** (`object`): Object with verdicts: { verdicts: Array<{ verdict: 'yes' | 'no' | 'unsure', reason: string }> }
26
+
27
+ **analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
28
+
29
+ **score:** (`number`): A score between 0 and the configured scale, representing the proportion of claims that are supported by the context.
30
+
31
+ **reason:** (`string`): A detailed explanation of the score, including which claims were supported, contradicted, or marked as unsure.
32
+
33
+ **generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
34
+
35
+ ## Scoring Details
36
+
37
+ The scorer evaluates faithfulness through claim verification against provided context.
38
+
39
+ ### Scoring Process
40
+
41
+ 1. Analyzes claims and context:
42
+
43
+ - Extracts all claims (factual and speculative)
44
+
45
+ - Verifies each claim against context
46
+
47
+ - Assigns one of three verdicts:
48
+
49
+ - "yes" - claim supported by context
50
+ - "no" - claim contradicts context
51
+ - "unsure" - claim unverifiable
52
+
53
+ 2. Calculates faithfulness score:
54
+
55
+ - Counts supported claims
56
+ - Divides by total claims
57
+ - Scales to configured range
58
+
59
+ Final score: `(supported_claims / total_claims) * scale`
60
+
61
+ ### Score interpretation
62
+
63
+ A faithfulness score between 0 and 1:
64
+
65
+ - **1.0**: All claims are accurate and directly supported by the context.
66
+ - **0.7–0.9**: Most claims are correct, with minor additions or omissions.
67
+ - **0.4–0.6**: Some claims are supported, but others are unverifiable.
68
+ - **0.1–0.3**: Most of the content is inaccurate or unsupported.
69
+ - **0.0**: All claims are false or contradict the context.
70
+
71
+ ## Example
72
+
73
+ Evaluate agent responses for faithfulness to provided context:
74
+
75
+ ```typescript
76
+ import { runEvals } from "@mastra/core/evals";
77
+ import { createFaithfulnessScorer } from "@mastra/evals/scorers/prebuilt";
78
+ import { myAgent } from "./agent";
79
+
80
+ // Context is typically populated from agent tool calls or RAG retrieval
81
+ const scorer = createFaithfulnessScorer({
82
+ model: "openai/gpt-4o",
83
+ });
84
+
85
+ const result = await runEvals({
86
+ data: [
87
+ {
88
+ input: "Tell me about the Tesla Model 3.",
89
+ },
90
+ {
91
+ input: "What are the key features of this electric vehicle?",
92
+ },
93
+ ],
94
+ scorers: [scorer],
95
+ target: myAgent,
96
+ onItemComplete: ({ scorerResults }) => {
97
+ console.log({
98
+ score: scorerResults[scorer.id].score,
99
+ reason: scorerResults[scorer.id].reason,
100
+ });
101
+ },
102
+ });
103
+
104
+ console.log(result.scores);
105
+ ```
106
+
107
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
108
+
109
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
110
+
111
+ ## Related
112
+
113
+ - [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)
114
+ - [Hallucination Scorer](https://mastra.ai/reference/evals/hallucination)