@mastra/evals 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/CHANGELOG.md +22 -0
  2. package/dist/docs/SKILL.md +31 -20
  3. package/dist/docs/{SOURCE_MAP.json → assets/SOURCE_MAP.json} +1 -1
  4. package/dist/docs/{evals/02-built-in-scorers.md → references/docs-evals-built-in-scorers.md} +5 -7
  5. package/dist/docs/{evals/01-overview.md → references/docs-evals-overview.md} +26 -10
  6. package/dist/docs/references/reference-evals-answer-relevancy.md +105 -0
  7. package/dist/docs/references/reference-evals-answer-similarity.md +99 -0
  8. package/dist/docs/references/reference-evals-bias.md +120 -0
  9. package/dist/docs/references/reference-evals-completeness.md +137 -0
  10. package/dist/docs/references/reference-evals-content-similarity.md +101 -0
  11. package/dist/docs/references/reference-evals-context-precision.md +196 -0
  12. package/dist/docs/references/reference-evals-context-relevance.md +536 -0
  13. package/dist/docs/references/reference-evals-faithfulness.md +114 -0
  14. package/dist/docs/references/reference-evals-hallucination.md +220 -0
  15. package/dist/docs/references/reference-evals-keyword-coverage.md +128 -0
  16. package/dist/docs/references/reference-evals-noise-sensitivity.md +685 -0
  17. package/dist/docs/references/reference-evals-prompt-alignment.md +619 -0
  18. package/dist/docs/references/reference-evals-scorer-utils.md +330 -0
  19. package/dist/docs/references/reference-evals-textual-difference.md +113 -0
  20. package/dist/docs/references/reference-evals-tone-consistency.md +119 -0
  21. package/dist/docs/references/reference-evals-tool-call-accuracy.md +533 -0
  22. package/dist/docs/references/reference-evals-toxicity.md +123 -0
  23. package/dist/scorers/llm/faithfulness/index.d.ts +3 -1
  24. package/dist/scorers/llm/faithfulness/index.d.ts.map +1 -1
  25. package/dist/scorers/llm/noise-sensitivity/index.d.ts.map +1 -1
  26. package/dist/scorers/llm/prompt-alignment/index.d.ts.map +1 -1
  27. package/dist/scorers/prebuilt/index.cjs +11 -7
  28. package/dist/scorers/prebuilt/index.cjs.map +1 -1
  29. package/dist/scorers/prebuilt/index.js +11 -7
  30. package/dist/scorers/prebuilt/index.js.map +1 -1
  31. package/package.json +4 -5
  32. package/dist/docs/README.md +0 -31
  33. package/dist/docs/evals/03-reference.md +0 -4092
@@ -0,0 +1,220 @@
1
+ # Hallucination Scorer
2
+
3
+ The `createHallucinationScorer()` function evaluates whether an LLM generates factually correct information by comparing its output against the provided context. This scorer measures hallucination by identifying direct contradictions between the context and the output.
4
+
5
+ ## Parameters
6
+
7
+ The `createHallucinationScorer()` function accepts a single options object with the following properties:
8
+
9
+ **model:** (`LanguageModel`): Configuration for the model used to evaluate hallucination.
10
+
11
+ **options.scale:** (`number`): Maximum score value. (Default: `1`)
12
+
13
+ **options.context:** (`string[]`): Static context strings to use as ground truth for hallucination detection.
14
+
15
+ **options.getContext:** (`(params: GetContextParams) => string[] | Promise<string[]>`): A hook to dynamically resolve context at runtime. Takes priority over static context. Useful for live scoring where context (like tool results) is only available when the scorer runs.
16
+
17
+ This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
18
+
19
+ ### GetContextParams
20
+
21
+ The `getContext` hook receives the following parameters:
22
+
23
+ **run:** (`GetContextRun`): The scorer run containing input, output, runId, requestContext, and tracingContext.
24
+
25
+ **results:** (`Record<string, any>`): Accumulated results from previous steps (e.g., preprocessStepResult with extracted claims).
26
+
27
+ **score:** (`number`): The computed score. Only present when called from the generateReason step.
28
+
29
+ **step:** (`'analyze' | 'generateReason'`): Which step is calling the hook. Useful for caching context between calls.
30
+
31
+ ## .run() Returns
32
+
33
+ **runId:** (`string`): The id of the run (optional).
34
+
35
+ **preprocessStepResult:** (`object`): Object with extracted claims: { claims: string\[] }
36
+
37
+ **preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
38
+
39
+ **analyzeStepResult:** (`object`): Object with verdicts: { verdicts: Array<{ statement: string, verdict: 'yes' | 'no', reason: string }> }
40
+
41
+ **analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
42
+
43
+ **score:** (`number`): Hallucination score (0 to scale, default 0-1).
44
+
45
+ **reason:** (`string`): Detailed explanation of the score and identified contradictions.
46
+
47
+ **generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
48
+
49
+ ## Scoring Details
50
+
51
+ The scorer evaluates hallucination through contradiction detection and unsupported claim analysis.
52
+
53
+ ### Scoring Process
54
+
55
+ 1. Analyzes factual content:
56
+
57
+ - Extracts statements from context
58
+ - Identifies numerical values and dates
59
+ - Maps statement relationships
60
+
61
+ 2. Analyzes output for hallucinations:
62
+
63
+ - Compares against context statements
64
+ - Marks direct conflicts as hallucinations
65
+ - Identifies unsupported claims as hallucinations
66
+ - Evaluates numerical accuracy
67
+ - Considers approximation context
68
+
69
+ 3. Calculates hallucination score:
70
+
71
+ - Counts hallucinated statements (contradictions and unsupported claims)
72
+ - Divides by total statements
73
+ - Scales to configured range
74
+
75
+ Final score: `(hallucinated_statements / total_statements) * scale`
76
+
77
+ ### Important Considerations
78
+
79
+ - Claims not present in context are treated as hallucinations
80
+
81
+ - Subjective claims are hallucinations unless explicitly supported
82
+
83
+ - Speculative language ("might", "possibly") about facts IN context is allowed
84
+
85
+ - Speculative language about facts NOT in context is treated as hallucination
86
+
87
+ - Empty outputs result in zero hallucinations
88
+
89
+ - Numerical evaluation considers:
90
+
91
+ - Scale-appropriate precision
92
+ - Contextual approximations
93
+ - Explicit precision indicators
94
+
95
+ ### Score interpretation
96
+
97
+ A hallucination score between 0 and 1:
98
+
99
+ - **0.0**: No hallucination — all claims match the context.
100
+ - **0.3–0.4**: Low hallucination — a few contradictions.
101
+ - **0.5–0.6**: Mixed hallucination — several contradictions.
102
+ - **0.7–0.8**: High hallucination — many contradictions.
103
+ - **0.9–1.0**: Complete hallucination — most or all claims contradict the context.
104
+
105
+ **Note:** The score represents the degree of hallucination - lower scores indicate better factual alignment with the provided context
106
+
107
+ ## Examples
108
+
109
+ ### Static Context
110
+
111
+ Use static context when you have known ground truth to compare against:
112
+
113
+ ```typescript
114
+ import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
115
+
116
+ const scorer = createHallucinationScorer({
117
+ model: "openai/gpt-4o",
118
+ options: {
119
+ context: [
120
+ "The first iPhone was announced on January 9, 2007.",
121
+ "It was released on June 29, 2007.",
122
+ "Steve Jobs introduced it at Macworld.",
123
+ ],
124
+ },
125
+ });
126
+ ```
127
+
128
+ ### Dynamic Context with getContext
129
+
130
+ Use `getContext` for live scoring scenarios where context comes from tool results:
131
+
132
+ ```typescript
133
+ import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
134
+ import { extractToolResults } from "@mastra/evals/scorers";
135
+
136
+ const scorer = createHallucinationScorer({
137
+ model: "openai/gpt-4o",
138
+ options: {
139
+ getContext: ({ run, step }) => {
140
+ // Extract tool results as context
141
+ const toolResults = extractToolResults(run.output);
142
+ return toolResults.map((t) =>
143
+ JSON.stringify({ tool: t.toolName, result: t.result })
144
+ );
145
+ },
146
+ },
147
+ });
148
+ ```
149
+
150
+ ### Live Scoring with Agent
151
+
152
+ Attach the scorer to an agent for live evaluation:
153
+
154
+ ```typescript
155
+ import { Agent } from "@mastra/core/agent";
156
+ import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
157
+ import { extractToolResults } from "@mastra/evals/scorers";
158
+
159
+ const hallucinationScorer = createHallucinationScorer({
160
+ model: "openai/gpt-4o",
161
+ options: {
162
+ getContext: ({ run }) => {
163
+ const toolResults = extractToolResults(run.output);
164
+ return toolResults.map((t) =>
165
+ JSON.stringify({ tool: t.toolName, result: t.result })
166
+ );
167
+ },
168
+ },
169
+ });
170
+
171
+ const agent = new Agent({
172
+ name: "my-agent",
173
+ model: "openai/gpt-4o",
174
+ instructions: "You are a helpful assistant.",
175
+ evals: {
176
+ scorers: [hallucinationScorer],
177
+ },
178
+ });
179
+ ```
180
+
181
+ ### Batch Evaluation with runEvals
182
+
183
+ ```typescript
184
+ import { runEvals } from "@mastra/core/evals";
185
+ import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
186
+ import { myAgent } from "./agent";
187
+
188
+ const scorer = createHallucinationScorer({
189
+ model: "openai/gpt-4o",
190
+ options: {
191
+ context: ["Known fact 1", "Known fact 2"],
192
+ },
193
+ });
194
+
195
+ const result = await runEvals({
196
+ data: [
197
+ { input: "Tell me about topic A" },
198
+ { input: "Tell me about topic B" },
199
+ ],
200
+ scorers: [scorer],
201
+ target: myAgent,
202
+ onItemComplete: ({ scorerResults }) => {
203
+ console.log({
204
+ score: scorerResults[scorer.id].score,
205
+ reason: scorerResults[scorer.id].reason,
206
+ });
207
+ },
208
+ });
209
+
210
+ console.log(result.scores);
211
+ ```
212
+
213
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
214
+
215
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
216
+
217
+ ## Related
218
+
219
+ - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
220
+ - [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)
@@ -0,0 +1,128 @@
1
+ # Keyword Coverage Scorer
2
+
3
+ The `createKeywordCoverageScorer()` function evaluates how well an LLM's output covers the important keywords from the input. It analyzes keyword presence and matches while ignoring common words and stop words.
4
+
5
+ ## Parameters
6
+
7
+ The `createKeywordCoverageScorer()` function does not take any options.
8
+
9
+ This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
10
+
11
+ ## .run() Returns
12
+
13
+ **runId:** (`string`): The id of the run (optional).
14
+
15
+ **preprocessStepResult:** (`object`): Object with extracted keywords: { referenceKeywords: Set\<string>, responseKeywords: Set\<string> }
16
+
17
+ **analyzeStepResult:** (`object`): Object with keyword coverage: { totalKeywords: number, matchedKeywords: number }
18
+
19
+ **score:** (`number`): Coverage score (0-1) representing the proportion of matched keywords.
20
+
21
+ `.run()` returns a result in the following shape:
22
+
23
+ ```typescript
24
+ {
25
+ runId: string,
26
+ extractStepResult: {
27
+ referenceKeywords: Set<string>,
28
+ responseKeywords: Set<string>
29
+ },
30
+ analyzeStepResult: {
31
+ totalKeywords: number,
32
+ matchedKeywords: number
33
+ },
34
+ score: number
35
+ }
36
+ ```
37
+
38
+ ## Scoring Details
39
+
40
+ The scorer evaluates keyword coverage by matching keywords with the following features:
41
+
42
+ - Common word and stop word filtering (e.g., "the", "a", "and")
43
+ - Case-insensitive matching
44
+ - Word form variation handling
45
+ - Special handling of technical terms and compound words
46
+
47
+ ### Scoring Process
48
+
49
+ 1. Processes keywords from input and output:
50
+
51
+ - Filters out common words and stop words
52
+ - Normalizes case and word forms
53
+ - Handles special terms and compounds
54
+
55
+ 2. Calculates keyword coverage:
56
+
57
+ - Matches keywords between texts
58
+ - Counts successful matches
59
+ - Computes coverage ratio
60
+
61
+ Final score: `(matched_keywords / total_keywords) * scale`
62
+
63
+ ### Score interpretation
64
+
65
+ A coverage score between 0 and 1:
66
+
67
+ - **1.0**: Complete coverage – all keywords present.
68
+ - **0.7–0.9**: High coverage – most keywords included.
69
+ - **0.4–0.6**: Partial coverage – some keywords present.
70
+ - **0.1–0.3**: Low coverage – few keywords matched.
71
+ - **0.0**: No coverage – no keywords found.
72
+
73
+ ### Special Cases
74
+
75
+ The scorer handles several special cases:
76
+
77
+ - Empty input/output: Returns score of 1.0 if both empty, 0.0 if only one is empty
78
+ - Single word: Treated as a single keyword
79
+ - Technical terms: Preserves compound technical terms (e.g., "React.js", "machine learning")
80
+ - Case differences: "JavaScript" matches "javascript"
81
+ - Common words: Ignored in scoring to focus on meaningful keywords
82
+
83
+ ## Example
84
+
85
+ Evaluate keyword coverage between input queries and agent responses:
86
+
87
+ ```typescript
88
+ import { runEvals } from "@mastra/core/evals";
89
+ import { createKeywordCoverageScorer } from "@mastra/evals/scorers/prebuilt";
90
+ import { myAgent } from "./agent";
91
+
92
+ const scorer = createKeywordCoverageScorer();
93
+
94
+ const result = await runEvals({
95
+ data: [
96
+ {
97
+ input: "JavaScript frameworks like React and Vue",
98
+ },
99
+ {
100
+ input: "TypeScript offers interfaces, generics, and type inference",
101
+ },
102
+ {
103
+ input:
104
+ "Machine learning models require data preprocessing, feature engineering, and hyperparameter tuning",
105
+ },
106
+ ],
107
+ scorers: [scorer],
108
+ target: myAgent,
109
+ onItemComplete: ({ scorerResults }) => {
110
+ console.log({
111
+ score: scorerResults[scorer.id].score,
112
+ });
113
+ },
114
+ });
115
+
116
+ console.log(result.scores);
117
+ ```
118
+
119
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
120
+
121
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
122
+
123
+ ## Related
124
+
125
+ - [Completeness Scorer](https://mastra.ai/reference/evals/completeness)
126
+ - [Content Similarity Scorer](https://mastra.ai/reference/evals/content-similarity)
127
+ - [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)
128
+ - [Textual Difference Scorer](https://mastra.ai/reference/evals/textual-difference)