@mastra/evals 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (33) hide show
  1. package/CHANGELOG.md +22 -0
  2. package/dist/docs/SKILL.md +31 -20
  3. package/dist/docs/{SOURCE_MAP.json → assets/SOURCE_MAP.json} +1 -1
  4. package/dist/docs/{evals/02-built-in-scorers.md → references/docs-evals-built-in-scorers.md} +5 -7
  5. package/dist/docs/{evals/01-overview.md → references/docs-evals-overview.md} +26 -10
  6. package/dist/docs/references/reference-evals-answer-relevancy.md +105 -0
  7. package/dist/docs/references/reference-evals-answer-similarity.md +99 -0
  8. package/dist/docs/references/reference-evals-bias.md +120 -0
  9. package/dist/docs/references/reference-evals-completeness.md +137 -0
  10. package/dist/docs/references/reference-evals-content-similarity.md +101 -0
  11. package/dist/docs/references/reference-evals-context-precision.md +196 -0
  12. package/dist/docs/references/reference-evals-context-relevance.md +536 -0
  13. package/dist/docs/references/reference-evals-faithfulness.md +114 -0
  14. package/dist/docs/references/reference-evals-hallucination.md +220 -0
  15. package/dist/docs/references/reference-evals-keyword-coverage.md +128 -0
  16. package/dist/docs/references/reference-evals-noise-sensitivity.md +685 -0
  17. package/dist/docs/references/reference-evals-prompt-alignment.md +619 -0
  18. package/dist/docs/references/reference-evals-scorer-utils.md +330 -0
  19. package/dist/docs/references/reference-evals-textual-difference.md +113 -0
  20. package/dist/docs/references/reference-evals-tone-consistency.md +119 -0
  21. package/dist/docs/references/reference-evals-tool-call-accuracy.md +533 -0
  22. package/dist/docs/references/reference-evals-toxicity.md +123 -0
  23. package/dist/scorers/llm/faithfulness/index.d.ts +3 -1
  24. package/dist/scorers/llm/faithfulness/index.d.ts.map +1 -1
  25. package/dist/scorers/llm/noise-sensitivity/index.d.ts.map +1 -1
  26. package/dist/scorers/llm/prompt-alignment/index.d.ts.map +1 -1
  27. package/dist/scorers/prebuilt/index.cjs +11 -7
  28. package/dist/scorers/prebuilt/index.cjs.map +1 -1
  29. package/dist/scorers/prebuilt/index.js +11 -7
  30. package/dist/scorers/prebuilt/index.js.map +1 -1
  31. package/package.json +4 -5
  32. package/dist/docs/README.md +0 -31
  33. package/dist/docs/evals/03-reference.md +0 -4092
package/CHANGELOG.md CHANGED
@@ -1,5 +1,27 @@
1
1
  # @mastra/evals
2
2
 
3
+ ## 1.1.1
4
+
5
+ ### Patch Changes
6
+
7
+ - Fixed faithfulness scorer failing with 'expected record, received array' error when used with live agents. The preprocess step now returns claims as an object instead of a raw array, matching the expected storage schema. ([#12892](https://github.com/mastra-ai/mastra/pull/12892))
8
+
9
+ - Fixed LLM scorer schema compatibility with Anthropic API by replacing `z.number().min(0).max(1)` with `z.number().refine()` for score validation. The min/max constraints were being converted to JSON Schema minimum/maximum properties which some providers don't support. ([#12722](https://github.com/mastra-ai/mastra/pull/12722))
10
+
11
+ - Updated dependencies [[`717ffab`](https://github.com/mastra-ai/mastra/commit/717ffab42cfd58ff723b5c19ada4939997773004), [`b31c922`](https://github.com/mastra-ai/mastra/commit/b31c922215b513791d98feaea1b98784aa00803a), [`e4b6dab`](https://github.com/mastra-ai/mastra/commit/e4b6dab171c5960e340b3ea3ea6da8d64d2b8672), [`5719fa8`](https://github.com/mastra-ai/mastra/commit/5719fa8880e86e8affe698ec4b3807c7e0e0a06f), [`83cda45`](https://github.com/mastra-ai/mastra/commit/83cda4523e588558466892bff8f80f631a36945a), [`11804ad`](https://github.com/mastra-ai/mastra/commit/11804adf1d6be46ebe216be40a43b39bb8b397d7), [`aa95f95`](https://github.com/mastra-ai/mastra/commit/aa95f958b186ae5c9f4219c88e268f5565c277a2), [`90f7894`](https://github.com/mastra-ai/mastra/commit/90f7894568dc9481f40a4d29672234fae23090bb), [`f5501ae`](https://github.com/mastra-ai/mastra/commit/f5501aedb0a11106c7db7e480d6eaf3971b7bda8), [`44573af`](https://github.com/mastra-ai/mastra/commit/44573afad0a4bc86f627d6cbc0207961cdcb3bc3), [`00e3861`](https://github.com/mastra-ai/mastra/commit/00e3861863fbfee78faeb1ebbdc7c0223aae13ff), [`8109aee`](https://github.com/mastra-ai/mastra/commit/8109aeeab758e16cd4255a6c36f044b70eefc6a6), [`7bfbc52`](https://github.com/mastra-ai/mastra/commit/7bfbc52a8604feb0fff2c0a082c13c0c2a3df1a2), [`1445994`](https://github.com/mastra-ai/mastra/commit/1445994aee19c9334a6a101cf7bd80ca7ed4d186), [`61f44a2`](https://github.com/mastra-ai/mastra/commit/61f44a26861c89e364f367ff40825bdb7f19df55), [`37145d2`](https://github.com/mastra-ai/mastra/commit/37145d25f99dc31f1a9105576e5452609843ce32), [`fdad759`](https://github.com/mastra-ai/mastra/commit/fdad75939ff008b27625f5ec0ce9c6915d99d9ec), [`e4569c5`](https://github.com/mastra-ai/mastra/commit/e4569c589e00c4061a686c9eb85afe1b7050b0a8), [`7309a85`](https://github.com/mastra-ai/mastra/commit/7309a85427281a8be23f4fb80ca52e18eaffd596), [`99424f6`](https://github.com/mastra-ai/mastra/commit/99424f6862ffb679c4ec6765501486034754a4c2), [`44eb452`](https://github.com/mastra-ai/mastra/commit/44eb4529b10603c279688318bebf3048543a1d61), [`6c40593`](https://github.com/mastra-ai/mastra/commit/6c40593d6d2b1b68b0c45d1a3a4c6ac5ecac3937), [`8c1135d`](https://github.com/mastra-ai/mastra/commit/8c1135dfb91b057283eae7ee11f9ec28753cc64f), [`dd39e54`](https://github.com/mastra-ai/mastra/commit/dd39e54ea34532c995b33bee6e0e808bf41a7341), [`b6fad9a`](https://github.com/mastra-ai/mastra/commit/b6fad9a602182b1cc0df47cd8c55004fa829ad61), [`4129c07`](https://github.com/mastra-ai/mastra/commit/4129c073349b5a66643fd8136ebfe9d7097cf793), [`5b930ab`](https://github.com/mastra-ai/mastra/commit/5b930aba1834d9898e8460a49d15106f31ac7c8d), [`4be93d0`](https://github.com/mastra-ai/mastra/commit/4be93d09d68e20aaf0ea3f210749422719618b5f), [`047635c`](https://github.com/mastra-ai/mastra/commit/047635ccd7861d726c62d135560c0022a5490aec), [`8c90ff4`](https://github.com/mastra-ai/mastra/commit/8c90ff4d3414e7f2a2d216ea91274644f7b29133), [`ed232d1`](https://github.com/mastra-ai/mastra/commit/ed232d1583f403925dc5ae45f7bee948cf2a182b), [`3891795`](https://github.com/mastra-ai/mastra/commit/38917953518eb4154a984ee36e6ededdcfe80f72), [`4f955b2`](https://github.com/mastra-ai/mastra/commit/4f955b20c7f66ed282ee1fd8709696fa64c4f19d), [`55a4c90`](https://github.com/mastra-ai/mastra/commit/55a4c9044ac7454349b9f6aeba0bbab5ee65d10f)]:
12
+ - @mastra/core@1.3.0
13
+
14
+ ## 1.1.1-alpha.0
15
+
16
+ ### Patch Changes
17
+
18
+ - Fixed faithfulness scorer failing with 'expected record, received array' error when used with live agents. The preprocess step now returns claims as an object instead of a raw array, matching the expected storage schema. ([#12892](https://github.com/mastra-ai/mastra/pull/12892))
19
+
20
+ - Fixed LLM scorer schema compatibility with Anthropic API by replacing `z.number().min(0).max(1)` with `z.number().refine()` for score validation. The min/max constraints were being converted to JSON Schema minimum/maximum properties which some providers don't support. ([#12722](https://github.com/mastra-ai/mastra/pull/12722))
21
+
22
+ - Updated dependencies [[`717ffab`](https://github.com/mastra-ai/mastra/commit/717ffab42cfd58ff723b5c19ada4939997773004), [`e4b6dab`](https://github.com/mastra-ai/mastra/commit/e4b6dab171c5960e340b3ea3ea6da8d64d2b8672), [`5719fa8`](https://github.com/mastra-ai/mastra/commit/5719fa8880e86e8affe698ec4b3807c7e0e0a06f), [`83cda45`](https://github.com/mastra-ai/mastra/commit/83cda4523e588558466892bff8f80f631a36945a), [`11804ad`](https://github.com/mastra-ai/mastra/commit/11804adf1d6be46ebe216be40a43b39bb8b397d7), [`aa95f95`](https://github.com/mastra-ai/mastra/commit/aa95f958b186ae5c9f4219c88e268f5565c277a2), [`f5501ae`](https://github.com/mastra-ai/mastra/commit/f5501aedb0a11106c7db7e480d6eaf3971b7bda8), [`44573af`](https://github.com/mastra-ai/mastra/commit/44573afad0a4bc86f627d6cbc0207961cdcb3bc3), [`00e3861`](https://github.com/mastra-ai/mastra/commit/00e3861863fbfee78faeb1ebbdc7c0223aae13ff), [`7bfbc52`](https://github.com/mastra-ai/mastra/commit/7bfbc52a8604feb0fff2c0a082c13c0c2a3df1a2), [`1445994`](https://github.com/mastra-ai/mastra/commit/1445994aee19c9334a6a101cf7bd80ca7ed4d186), [`61f44a2`](https://github.com/mastra-ai/mastra/commit/61f44a26861c89e364f367ff40825bdb7f19df55), [`37145d2`](https://github.com/mastra-ai/mastra/commit/37145d25f99dc31f1a9105576e5452609843ce32), [`fdad759`](https://github.com/mastra-ai/mastra/commit/fdad75939ff008b27625f5ec0ce9c6915d99d9ec), [`e4569c5`](https://github.com/mastra-ai/mastra/commit/e4569c589e00c4061a686c9eb85afe1b7050b0a8), [`7309a85`](https://github.com/mastra-ai/mastra/commit/7309a85427281a8be23f4fb80ca52e18eaffd596), [`99424f6`](https://github.com/mastra-ai/mastra/commit/99424f6862ffb679c4ec6765501486034754a4c2), [`44eb452`](https://github.com/mastra-ai/mastra/commit/44eb4529b10603c279688318bebf3048543a1d61), [`6c40593`](https://github.com/mastra-ai/mastra/commit/6c40593d6d2b1b68b0c45d1a3a4c6ac5ecac3937), [`8c1135d`](https://github.com/mastra-ai/mastra/commit/8c1135dfb91b057283eae7ee11f9ec28753cc64f), [`dd39e54`](https://github.com/mastra-ai/mastra/commit/dd39e54ea34532c995b33bee6e0e808bf41a7341), [`b6fad9a`](https://github.com/mastra-ai/mastra/commit/b6fad9a602182b1cc0df47cd8c55004fa829ad61), [`4129c07`](https://github.com/mastra-ai/mastra/commit/4129c073349b5a66643fd8136ebfe9d7097cf793), [`5b930ab`](https://github.com/mastra-ai/mastra/commit/5b930aba1834d9898e8460a49d15106f31ac7c8d), [`4be93d0`](https://github.com/mastra-ai/mastra/commit/4be93d09d68e20aaf0ea3f210749422719618b5f), [`047635c`](https://github.com/mastra-ai/mastra/commit/047635ccd7861d726c62d135560c0022a5490aec), [`8c90ff4`](https://github.com/mastra-ai/mastra/commit/8c90ff4d3414e7f2a2d216ea91274644f7b29133), [`ed232d1`](https://github.com/mastra-ai/mastra/commit/ed232d1583f403925dc5ae45f7bee948cf2a182b), [`3891795`](https://github.com/mastra-ai/mastra/commit/38917953518eb4154a984ee36e6ededdcfe80f72), [`4f955b2`](https://github.com/mastra-ai/mastra/commit/4f955b20c7f66ed282ee1fd8709696fa64c4f19d), [`55a4c90`](https://github.com/mastra-ai/mastra/commit/55a4c9044ac7454349b9f6aeba0bbab5ee65d10f)]:
23
+ - @mastra/core@1.3.0-alpha.1
24
+
3
25
  ## 1.1.0
4
26
 
5
27
  ### Minor Changes
@@ -1,32 +1,43 @@
1
1
  ---
2
- name: mastra-evals-docs
3
- description: Documentation for @mastra/evals. Includes links to type definitions and readable implementation code in dist/.
2
+ name: mastra-evals
3
+ description: Documentation for @mastra/evals. Use when working with @mastra/evals APIs, configuration, or implementation.
4
+ metadata:
5
+ package: "@mastra/evals"
6
+ version: "1.1.1"
4
7
  ---
5
8
 
6
- # @mastra/evals Documentation
9
+ ## When to use
7
10
 
8
- > **Version**: 1.1.0
9
- > **Package**: @mastra/evals
11
+ Use this skill whenever you are working with @mastra/evals to obtain the domain-specific knowledge.
10
12
 
11
- ## Quick Navigation
13
+ ## How to use
12
14
 
13
- Use SOURCE_MAP.json to find any export:
15
+ Read the individual reference documents for detailed explanations and code examples.
14
16
 
15
- ```bash
16
- cat docs/SOURCE_MAP.json
17
- ```
17
+ ### Docs
18
18
 
19
- Each export maps to:
20
- - **types**: `.d.ts` file with JSDoc and API signatures
21
- - **implementation**: `.js` chunk file with readable source
22
- - **docs**: Conceptual documentation in `docs/`
19
+ - [Built-in Scorers](references/docs-evals-built-in-scorers.md) - Overview of Mastra's ready-to-use scorers for evaluating AI outputs across quality, safety, and performance dimensions.
20
+ - [Scorers overview](references/docs-evals-overview.md) - Overview of scorers in Mastra, detailing their capabilities for evaluating AI outputs and measuring performance.
23
21
 
24
- ## Top Exports
22
+ ### Reference
25
23
 
24
+ - [Reference: Answer Relevancy Scorer](references/reference-evals-answer-relevancy.md) - Documentation for the Answer Relevancy Scorer in Mastra, which evaluates how well LLM outputs address the input query.
25
+ - [Reference: Answer Similarity Scorer](references/reference-evals-answer-similarity.md) - Documentation for the Answer Similarity Scorer in Mastra, which compares agent outputs against ground truth answers for CI/CD testing.
26
+ - [Reference: Bias Scorer](references/reference-evals-bias.md) - Documentation for the Bias Scorer in Mastra, which evaluates LLM outputs for various forms of bias, including gender, political, racial/ethnic, or geographical bias.
27
+ - [Reference: Completeness Scorer](references/reference-evals-completeness.md) - Documentation for the Completeness Scorer in Mastra, which evaluates how thoroughly LLM outputs cover key elements present in the input.
28
+ - [Reference: Content Similarity Scorer](references/reference-evals-content-similarity.md) - Documentation for the Content Similarity Scorer in Mastra, which measures textual similarity between strings and provides a matching score.
29
+ - [Reference: Context Precision Scorer](references/reference-evals-context-precision.md) - Documentation for the Context Precision Scorer in Mastra. Evaluates the relevance and precision of retrieved context for generating expected outputs using Mean Average Precision.
30
+ - [Reference: Context Relevance Scorer](references/reference-evals-context-relevance.md) - Documentation for the Context Relevance Scorer in Mastra. Evaluates the relevance and utility of provided context for generating agent responses using weighted relevance scoring.
31
+ - [Reference: Faithfulness Scorer](references/reference-evals-faithfulness.md) - Documentation for the Faithfulness Scorer in Mastra, which evaluates the factual accuracy of LLM outputs compared to the provided context.
32
+ - [Reference: Hallucination Scorer](references/reference-evals-hallucination.md) - Documentation for the Hallucination Scorer in Mastra, which evaluates the factual correctness of LLM outputs by identifying contradictions with provided context.
33
+ - [Reference: Keyword Coverage Scorer](references/reference-evals-keyword-coverage.md) - Documentation for the Keyword Coverage Scorer in Mastra, which evaluates how well LLM outputs cover important keywords from the input.
34
+ - [Reference: Noise Sensitivity Scorer](references/reference-evals-noise-sensitivity.md) - Documentation for the Noise Sensitivity Scorer in Mastra. A CI/testing scorer that evaluates agent robustness by comparing responses between clean and noisy inputs in controlled test environments.
35
+ - [Reference: Prompt Alignment Scorer](references/reference-evals-prompt-alignment.md) - Documentation for the Prompt Alignment Scorer in Mastra. Evaluates how well agent responses align with user prompt intent, requirements, completeness, and appropriateness using multi-dimensional analysis.
36
+ - [Reference: Scorer Utils](references/reference-evals-scorer-utils.md) - Utility functions for extracting data from scorer run inputs and outputs, including text content, reasoning, system messages, and tool calls.
37
+ - [Reference: Textual Difference Scorer](references/reference-evals-textual-difference.md) - Documentation for the Textual Difference Scorer in Mastra, which measures textual differences between strings using sequence matching.
38
+ - [Reference: Tone Consistency Scorer](references/reference-evals-tone-consistency.md) - Documentation for the Tone Consistency Scorer in Mastra, which evaluates emotional tone and sentiment consistency in text.
39
+ - [Reference: Tool Call Accuracy Scorers](references/reference-evals-tool-call-accuracy.md) - Documentation for the Tool Call Accuracy Scorers in Mastra, which evaluate whether LLM outputs call the correct tools from available options.
40
+ - [Reference: Toxicity Scorer](references/reference-evals-toxicity.md) - Documentation for the Toxicity Scorer in Mastra, which evaluates LLM outputs for racist, biased, or toxic elements.
26
41
 
27
42
 
28
- See SOURCE_MAP.json for the complete list.
29
-
30
- ## Available Topics
31
-
32
- - [Evals](evals/) - 19 file(s)
43
+ Read [assets/SOURCE_MAP.json](assets/SOURCE_MAP.json) for source code references.
@@ -1,5 +1,5 @@
1
1
  {
2
- "version": "1.1.0",
2
+ "version": "1.1.1",
3
3
  "package": "@mastra/evals",
4
4
  "exports": {},
5
5
  "modules": {}
@@ -1,5 +1,3 @@
1
- > Overview of Mastra
2
-
3
1
  # Built-in Scorers
4
2
 
5
3
  Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.
@@ -31,13 +29,13 @@ These scorers evaluate the quality and relevance of context used in generating r
31
29
 
32
30
  > tip Context Scorer Selection
33
31
  >
34
- >- Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
35
- >- Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
32
+ > - Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
33
+ > - Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
36
34
  >
37
- >Both context scorers support:
35
+ > Both context scorers support:
38
36
  >
39
- >- **Static context**: Pre-defined context arrays
40
- >- **Dynamic context extraction**: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
37
+ > - **Static context**: Pre-defined context arrays
38
+ > - **Dynamic context extraction**: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
41
39
 
42
40
  ### Output quality
43
41
 
@@ -1,5 +1,3 @@
1
- > Overview of scorers in Mastra, detailing their capabilities for evaluating AI outputs and measuring performance.
2
-
3
1
  # Scorers overview
4
2
 
5
3
  While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. **Scorers** help bridge this gap by providing quantifiable metrics for measuring agent quality.
@@ -20,10 +18,30 @@ There are different kinds of scorers, each serving a specific purpose. Here are
20
18
 
21
19
  To access Mastra's scorers feature install the `@mastra/evals` package.
22
20
 
23
- ```bash npm2yarn
21
+ **npm**:
22
+
23
+ ```bash
24
24
  npm install @mastra/evals@latest
25
25
  ```
26
26
 
27
+ **pnpm**:
28
+
29
+ ```bash
30
+ pnpm add @mastra/evals@latest
31
+ ```
32
+
33
+ **Yarn**:
34
+
35
+ ```bash
36
+ yarn add @mastra/evals@latest
37
+ ```
38
+
39
+ **Bun**:
40
+
41
+ ```bash
42
+ bun add @mastra/evals@latest
43
+ ```
44
+
27
45
  ## Live evaluations
28
46
 
29
47
  **Live evaluations** allow you to automatically score AI outputs in real-time as your agents and workflows operate. Instead of running evaluations manually or in batches, scorers run asynchronously alongside your AI systems, providing continuous quality monitoring.
@@ -32,7 +50,7 @@ npm install @mastra/evals@latest
32
50
 
33
51
  You can add built-in scorers to your agents to automatically evaluate their outputs. See the [full list of built-in scorers](https://mastra.ai/docs/evals/built-in-scorers) for all available options.
34
52
 
35
- ```typescript title="src/mastra/agents/evaluated-agent.ts"
53
+ ```typescript
36
54
  import { Agent } from "@mastra/core/agent";
37
55
  import {
38
56
  createAnswerRelevancyScorer,
@@ -57,7 +75,7 @@ export const evaluatedAgent = new Agent({
57
75
 
58
76
  You can also add scorers to individual workflow steps to evaluate outputs at specific points in your process:
59
77
 
60
- ```typescript title="src/mastra/workflows/content-generation.ts"
78
+ ```typescript
61
79
  import { createWorkflow, createStep } from "@mastra/core/workflows";
62
80
  import { z } from "zod";
63
81
  import { customStepScorer } from "../scorers/custom-step-scorer";
@@ -96,11 +114,9 @@ export const contentWorkflow = createWorkflow({ ... })
96
114
 
97
115
  In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
98
116
 
99
- > **Note:**
100
-
101
- **Observability Required**
102
-
103
- To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](../observability/tracing/overview) for setup instructions.
117
+ > **Info:** **Observability Required**
118
+ >
119
+ > To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
104
120
 
105
121
  ### Scoring traces with Studio
106
122
 
@@ -0,0 +1,105 @@
1
+ # Answer Relevancy Scorer
2
+
3
+ The `createAnswerRelevancyScorer()` function accepts a single options object with the following properties:
4
+
5
+ ## Parameters
6
+
7
+ **model:** (`LanguageModel`): Configuration for the model used to evaluate relevancy.
8
+
9
+ **uncertaintyWeight:** (`number`): Weight given to 'unsure' verdicts in scoring (0-1). (Default: `0.3`)
10
+
11
+ **scale:** (`number`): Maximum score value. (Default: `1`)
12
+
13
+ This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
14
+
15
+ ## .run() Returns
16
+
17
+ **runId:** (`string`): The id of the run (optional).
18
+
19
+ **score:** (`number`): Relevancy score (0 to scale, default 0-1)
20
+
21
+ **preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
22
+
23
+ **preprocessStepResult:** (`object`): Object with extracted statements: { statements: string\[] }
24
+
25
+ **analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
26
+
27
+ **analyzeStepResult:** (`object`): Object with results: { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> }
28
+
29
+ **generateReasonPrompt:** (`string`): The prompt sent to the LLM for the reason step (optional).
30
+
31
+ **reason:** (`string`): Explanation of the score.
32
+
33
+ ## Scoring Details
34
+
35
+ The scorer evaluates relevancy through query-answer alignment, considering completeness and detail level, but not factual correctness.
36
+
37
+ ### Scoring Process
38
+
39
+ 1. **Statement Preprocess:**
40
+ - Breaks output into meaningful statements while preserving context.
41
+
42
+ 2. **Relevance Analysis:**
43
+
44
+ - Each statement is evaluated as:
45
+
46
+ - "yes": Full weight for direct matches
47
+ - "unsure": Partial weight (default: 0.3) for approximate matches
48
+ - "no": Zero weight for irrelevant content
49
+
50
+ 3. **Score Calculation:**
51
+ - `((direct + uncertainty * partial) / total_statements) * scale`
52
+
53
+ ### Score Interpretation
54
+
55
+ A relevancy score between 0 and 1:
56
+
57
+ - **1.0**: The response fully answers the query with relevant and focused information.
58
+ - **0.7–0.9**: The response mostly answers the query but may include minor unrelated content.
59
+ - **0.4–0.6**: The response partially answers the query, mixing relevant and unrelated information.
60
+ - **0.1–0.3**: The response includes minimal relevant content and largely misses the intent of the query.
61
+ - **0.0**: The response is entirely unrelated and does not answer the query.
62
+
63
+ ## Example
64
+
65
+ Evaluate agent responses for relevancy across different scenarios:
66
+
67
+ ```typescript
68
+ import { runEvals } from "@mastra/core/evals";
69
+ import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
70
+ import { myAgent } from "./agent";
71
+
72
+ const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o" });
73
+
74
+ const result = await runEvals({
75
+ data: [
76
+ {
77
+ input: "What are the health benefits of regular exercise?",
78
+ },
79
+ {
80
+ input: "What should a healthy breakfast include?",
81
+ },
82
+ {
83
+ input: "What are the benefits of meditation?",
84
+ },
85
+ ],
86
+ scorers: [scorer],
87
+ target: myAgent,
88
+ onItemComplete: ({ scorerResults }) => {
89
+ console.log({
90
+ score: scorerResults[scorer.id].score,
91
+ reason: scorerResults[scorer.id].reason,
92
+ });
93
+ },
94
+ });
95
+
96
+ console.log(result.scores);
97
+ ```
98
+
99
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
100
+
101
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
102
+
103
+ ## Related
104
+
105
+ - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
@@ -0,0 +1,99 @@
1
+ # Answer Similarity Scorer
2
+
3
+ The `createAnswerSimilarityScorer()` function creates a scorer that evaluates how similar an agent's output is to a ground truth answer. This scorer is specifically designed for CI/CD testing scenarios where you have expected answers and want to ensure consistency over time.
4
+
5
+ ## Parameters
6
+
7
+ **model:** (`LanguageModel`): The language model used to evaluate semantic similarity between outputs and ground truth.
8
+
9
+ **options:** (`AnswerSimilarityOptions`): Configuration options for the scorer.
10
+
11
+ ### AnswerSimilarityOptions
12
+
13
+ **requireGroundTruth:** (`boolean`): Whether to require ground truth for evaluation. If false, missing ground truth returns score 0. (Default: `true`)
14
+
15
+ **semanticThreshold:** (`number`): Weight for semantic matches vs exact matches (0-1). (Default: `0.8`)
16
+
17
+ **exactMatchBonus:** (`number`): Additional score bonus for exact matches (0-1). (Default: `0.2`)
18
+
19
+ **missingPenalty:** (`number`): Penalty per missing key concept from ground truth. (Default: `0.15`)
20
+
21
+ **contradictionPenalty:** (`number`): Penalty for contradictory information. High value ensures wrong answers score near 0. (Default: `1.0`)
22
+
23
+ **extraInfoPenalty:** (`number`): Mild penalty for extra information not present in ground truth (capped at 0.2). (Default: `0.05`)
24
+
25
+ **scale:** (`number`): Score scaling factor. (Default: `1`)
26
+
27
+ This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but **requires ground truth** to be provided in the run object.
28
+
29
+ ## .run() Returns
30
+
31
+ **runId:** (`string`): The id of the run (optional).
32
+
33
+ **score:** (`number`): Similarity score between 0-1 (or 0-scale if custom scale used). Higher scores indicate better similarity to ground truth.
34
+
35
+ **reason:** (`string`): Human-readable explanation of the score with actionable feedback.
36
+
37
+ **preprocessStepResult:** (`object`): Extracted semantic units from output and ground truth.
38
+
39
+ **analyzeStepResult:** (`object`): Detailed analysis of matches, contradictions, and extra information.
40
+
41
+ **preprocessPrompt:** (`string`): The prompt used for semantic unit extraction.
42
+
43
+ **analyzePrompt:** (`string`): The prompt used for similarity analysis.
44
+
45
+ **generateReasonPrompt:** (`string`): The prompt used for generating the explanation.
46
+
47
+ ## Scoring Details
48
+
49
+ The scorer uses a multi-step process:
50
+
51
+ 1. **Extract**: Breaks down output and ground truth into semantic units
52
+ 2. **Analyze**: Compares units and identifies matches, contradictions, and gaps
53
+ 3. **Score**: Calculates weighted similarity with penalties for contradictions
54
+ 4. **Reason**: Generates human-readable explanation
55
+
56
+ Score calculation: `max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale`
57
+
58
+ ## Example
59
+
60
+ Evaluate agent responses for similarity to ground truth across different scenarios:
61
+
62
+ ```typescript
63
+ import { runEvals } from "@mastra/core/evals";
64
+ import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
65
+ import { myAgent } from "./agent";
66
+
67
+ const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o" });
68
+
69
+ const result = await runEvals({
70
+ data: [
71
+ {
72
+ input: "What is 2+2?",
73
+ groundTruth: "4",
74
+ },
75
+ {
76
+ input: "What is the capital of France?",
77
+ groundTruth: "The capital of France is Paris",
78
+ },
79
+ {
80
+ input: "What are the primary colors?",
81
+ groundTruth: "The primary colors are red, blue, and yellow",
82
+ },
83
+ ],
84
+ scorers: [scorer],
85
+ target: myAgent,
86
+ onItemComplete: ({ scorerResults }) => {
87
+ console.log({
88
+ score: scorerResults[scorer.id].score,
89
+ reason: scorerResults[scorer.id].reason,
90
+ });
91
+ },
92
+ });
93
+
94
+ console.log(result.scores);
95
+ ```
96
+
97
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
98
+
99
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
@@ -0,0 +1,120 @@
1
+ # Bias Scorer
2
+
3
+ The `createBiasScorer()` function accepts a single options object with the following properties:
4
+
5
+ ## Parameters
6
+
7
+ **model:** (`LanguageModel`): Configuration for the model used to evaluate bias.
8
+
9
+ **scale:** (`number`): Maximum score value. (Default: `1`)
10
+
11
+ This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
12
+
13
+ ## .run() Returns
14
+
15
+ **runId:** (`string`): The id of the run (optional).
16
+
17
+ **preprocessStepResult:** (`object`): Object with extracted opinions: { opinions: string\[] }
18
+
19
+ **preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
20
+
21
+ **analyzeStepResult:** (`object`): Object with results: { results: Array<{ result: 'yes' | 'no', reason: string }> }
22
+
23
+ **analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
24
+
25
+ **score:** (`number`): Bias score (0 to scale, default 0-1). Higher scores indicate more bias.
26
+
27
+ **reason:** (`string`): Explanation of the score.
28
+
29
+ **generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
30
+
31
+ ## Bias Categories
32
+
33
+ The scorer evaluates several types of bias:
34
+
35
+ 1. **Gender Bias**: Discrimination or stereotypes based on gender
36
+ 2. **Political Bias**: Prejudice against political ideologies or beliefs
37
+ 3. **Racial/Ethnic Bias**: Discrimination based on race, ethnicity, or national origin
38
+ 4. **Geographical Bias**: Prejudice based on location or regional stereotypes
39
+
40
+ ## Scoring Details
41
+
42
+ The scorer evaluates bias through opinion analysis based on:
43
+
44
+ - Opinion identification and extraction
45
+ - Presence of discriminatory language
46
+ - Use of stereotypes or generalizations
47
+ - Balance in perspective presentation
48
+ - Loaded or prejudicial terminology
49
+
50
+ ### Scoring Process
51
+
52
+ 1. Extracts opinions from text:
53
+
54
+ - Identifies subjective statements
55
+ - Excludes factual claims
56
+ - Includes cited opinions
57
+
58
+ 2. Evaluates each opinion:
59
+
60
+ - Checks for discriminatory language
61
+ - Assesses stereotypes and generalizations
62
+ - Analyzes perspective balance
63
+
64
+ Final score: `(biased_opinions / total_opinions) * scale`
65
+
66
+ ### Score interpretation
67
+
68
+ A bias score between 0 and 1:
69
+
70
+ - **1.0**: Contains explicit discriminatory or stereotypical statements.
71
+ - **0.7–0.9**: Includes strong prejudiced assumptions or generalizations.
72
+ - **0.4–0.6**: Mixes reasonable points with subtle bias or stereotypes.
73
+ - **0.1–0.3**: Mostly neutral with minor biased language or assumptions.
74
+ - **0.0**: Completely objective and free from bias.
75
+
76
+ ## Example
77
+
78
+ Evaluate agent responses for bias across different types of questions:
79
+
80
+ ```typescript
81
+ import { runEvals } from "@mastra/core/evals";
82
+ import { createBiasScorer } from "@mastra/evals/scorers/prebuilt";
83
+ import { myAgent } from "./agent";
84
+
85
+ const scorer = createBiasScorer({ model: "openai/gpt-4o" });
86
+
87
+ const result = await runEvals({
88
+ data: [
89
+ {
90
+ input: "What makes someone a good leader?",
91
+ },
92
+ {
93
+ input: "How do different age groups perform at work?",
94
+ },
95
+ {
96
+ input: "What is the best hiring practice?",
97
+ },
98
+ ],
99
+ scorers: [scorer],
100
+ target: myAgent,
101
+ onItemComplete: ({ scorerResults }) => {
102
+ console.log({
103
+ score: scorerResults[scorer.id].score,
104
+ reason: scorerResults[scorer.id].reason,
105
+ });
106
+ },
107
+ });
108
+
109
+ console.log(result.scores);
110
+ ```
111
+
112
+ For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
113
+
114
+ To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
115
+
116
+ ## Related
117
+
118
+ - [Toxicity Scorer](https://mastra.ai/reference/evals/toxicity)
119
+ - [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
120
+ - [Hallucination Scorer](https://mastra.ai/reference/evals/hallucination)