@mastra/evals 1.1.0 → 1.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +22 -0
- package/dist/docs/SKILL.md +31 -20
- package/dist/docs/{SOURCE_MAP.json → assets/SOURCE_MAP.json} +1 -1
- package/dist/docs/{evals/02-built-in-scorers.md → references/docs-evals-built-in-scorers.md} +5 -7
- package/dist/docs/{evals/01-overview.md → references/docs-evals-overview.md} +26 -10
- package/dist/docs/references/reference-evals-answer-relevancy.md +105 -0
- package/dist/docs/references/reference-evals-answer-similarity.md +99 -0
- package/dist/docs/references/reference-evals-bias.md +120 -0
- package/dist/docs/references/reference-evals-completeness.md +137 -0
- package/dist/docs/references/reference-evals-content-similarity.md +101 -0
- package/dist/docs/references/reference-evals-context-precision.md +196 -0
- package/dist/docs/references/reference-evals-context-relevance.md +536 -0
- package/dist/docs/references/reference-evals-faithfulness.md +114 -0
- package/dist/docs/references/reference-evals-hallucination.md +220 -0
- package/dist/docs/references/reference-evals-keyword-coverage.md +128 -0
- package/dist/docs/references/reference-evals-noise-sensitivity.md +685 -0
- package/dist/docs/references/reference-evals-prompt-alignment.md +619 -0
- package/dist/docs/references/reference-evals-scorer-utils.md +330 -0
- package/dist/docs/references/reference-evals-textual-difference.md +113 -0
- package/dist/docs/references/reference-evals-tone-consistency.md +119 -0
- package/dist/docs/references/reference-evals-tool-call-accuracy.md +533 -0
- package/dist/docs/references/reference-evals-toxicity.md +123 -0
- package/dist/scorers/llm/faithfulness/index.d.ts +3 -1
- package/dist/scorers/llm/faithfulness/index.d.ts.map +1 -1
- package/dist/scorers/llm/noise-sensitivity/index.d.ts.map +1 -1
- package/dist/scorers/llm/prompt-alignment/index.d.ts.map +1 -1
- package/dist/scorers/prebuilt/index.cjs +11 -7
- package/dist/scorers/prebuilt/index.cjs.map +1 -1
- package/dist/scorers/prebuilt/index.js +11 -7
- package/dist/scorers/prebuilt/index.js.map +1 -1
- package/package.json +4 -5
- package/dist/docs/README.md +0 -31
- package/dist/docs/evals/03-reference.md +0 -4092
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,27 @@
|
|
|
1
1
|
# @mastra/evals
|
|
2
2
|
|
|
3
|
+
## 1.1.1
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- Fixed faithfulness scorer failing with 'expected record, received array' error when used with live agents. The preprocess step now returns claims as an object instead of a raw array, matching the expected storage schema. ([#12892](https://github.com/mastra-ai/mastra/pull/12892))
|
|
8
|
+
|
|
9
|
+
- Fixed LLM scorer schema compatibility with Anthropic API by replacing `z.number().min(0).max(1)` with `z.number().refine()` for score validation. The min/max constraints were being converted to JSON Schema minimum/maximum properties which some providers don't support. ([#12722](https://github.com/mastra-ai/mastra/pull/12722))
|
|
10
|
+
|
|
11
|
+
- Updated dependencies [[`717ffab`](https://github.com/mastra-ai/mastra/commit/717ffab42cfd58ff723b5c19ada4939997773004), [`b31c922`](https://github.com/mastra-ai/mastra/commit/b31c922215b513791d98feaea1b98784aa00803a), [`e4b6dab`](https://github.com/mastra-ai/mastra/commit/e4b6dab171c5960e340b3ea3ea6da8d64d2b8672), [`5719fa8`](https://github.com/mastra-ai/mastra/commit/5719fa8880e86e8affe698ec4b3807c7e0e0a06f), [`83cda45`](https://github.com/mastra-ai/mastra/commit/83cda4523e588558466892bff8f80f631a36945a), [`11804ad`](https://github.com/mastra-ai/mastra/commit/11804adf1d6be46ebe216be40a43b39bb8b397d7), [`aa95f95`](https://github.com/mastra-ai/mastra/commit/aa95f958b186ae5c9f4219c88e268f5565c277a2), [`90f7894`](https://github.com/mastra-ai/mastra/commit/90f7894568dc9481f40a4d29672234fae23090bb), [`f5501ae`](https://github.com/mastra-ai/mastra/commit/f5501aedb0a11106c7db7e480d6eaf3971b7bda8), [`44573af`](https://github.com/mastra-ai/mastra/commit/44573afad0a4bc86f627d6cbc0207961cdcb3bc3), [`00e3861`](https://github.com/mastra-ai/mastra/commit/00e3861863fbfee78faeb1ebbdc7c0223aae13ff), [`8109aee`](https://github.com/mastra-ai/mastra/commit/8109aeeab758e16cd4255a6c36f044b70eefc6a6), [`7bfbc52`](https://github.com/mastra-ai/mastra/commit/7bfbc52a8604feb0fff2c0a082c13c0c2a3df1a2), [`1445994`](https://github.com/mastra-ai/mastra/commit/1445994aee19c9334a6a101cf7bd80ca7ed4d186), [`61f44a2`](https://github.com/mastra-ai/mastra/commit/61f44a26861c89e364f367ff40825bdb7f19df55), [`37145d2`](https://github.com/mastra-ai/mastra/commit/37145d25f99dc31f1a9105576e5452609843ce32), [`fdad759`](https://github.com/mastra-ai/mastra/commit/fdad75939ff008b27625f5ec0ce9c6915d99d9ec), [`e4569c5`](https://github.com/mastra-ai/mastra/commit/e4569c589e00c4061a686c9eb85afe1b7050b0a8), [`7309a85`](https://github.com/mastra-ai/mastra/commit/7309a85427281a8be23f4fb80ca52e18eaffd596), [`99424f6`](https://github.com/mastra-ai/mastra/commit/99424f6862ffb679c4ec6765501486034754a4c2), [`44eb452`](https://github.com/mastra-ai/mastra/commit/44eb4529b10603c279688318bebf3048543a1d61), [`6c40593`](https://github.com/mastra-ai/mastra/commit/6c40593d6d2b1b68b0c45d1a3a4c6ac5ecac3937), [`8c1135d`](https://github.com/mastra-ai/mastra/commit/8c1135dfb91b057283eae7ee11f9ec28753cc64f), [`dd39e54`](https://github.com/mastra-ai/mastra/commit/dd39e54ea34532c995b33bee6e0e808bf41a7341), [`b6fad9a`](https://github.com/mastra-ai/mastra/commit/b6fad9a602182b1cc0df47cd8c55004fa829ad61), [`4129c07`](https://github.com/mastra-ai/mastra/commit/4129c073349b5a66643fd8136ebfe9d7097cf793), [`5b930ab`](https://github.com/mastra-ai/mastra/commit/5b930aba1834d9898e8460a49d15106f31ac7c8d), [`4be93d0`](https://github.com/mastra-ai/mastra/commit/4be93d09d68e20aaf0ea3f210749422719618b5f), [`047635c`](https://github.com/mastra-ai/mastra/commit/047635ccd7861d726c62d135560c0022a5490aec), [`8c90ff4`](https://github.com/mastra-ai/mastra/commit/8c90ff4d3414e7f2a2d216ea91274644f7b29133), [`ed232d1`](https://github.com/mastra-ai/mastra/commit/ed232d1583f403925dc5ae45f7bee948cf2a182b), [`3891795`](https://github.com/mastra-ai/mastra/commit/38917953518eb4154a984ee36e6ededdcfe80f72), [`4f955b2`](https://github.com/mastra-ai/mastra/commit/4f955b20c7f66ed282ee1fd8709696fa64c4f19d), [`55a4c90`](https://github.com/mastra-ai/mastra/commit/55a4c9044ac7454349b9f6aeba0bbab5ee65d10f)]:
|
|
12
|
+
- @mastra/core@1.3.0
|
|
13
|
+
|
|
14
|
+
## 1.1.1-alpha.0
|
|
15
|
+
|
|
16
|
+
### Patch Changes
|
|
17
|
+
|
|
18
|
+
- Fixed faithfulness scorer failing with 'expected record, received array' error when used with live agents. The preprocess step now returns claims as an object instead of a raw array, matching the expected storage schema. ([#12892](https://github.com/mastra-ai/mastra/pull/12892))
|
|
19
|
+
|
|
20
|
+
- Fixed LLM scorer schema compatibility with Anthropic API by replacing `z.number().min(0).max(1)` with `z.number().refine()` for score validation. The min/max constraints were being converted to JSON Schema minimum/maximum properties which some providers don't support. ([#12722](https://github.com/mastra-ai/mastra/pull/12722))
|
|
21
|
+
|
|
22
|
+
- Updated dependencies [[`717ffab`](https://github.com/mastra-ai/mastra/commit/717ffab42cfd58ff723b5c19ada4939997773004), [`e4b6dab`](https://github.com/mastra-ai/mastra/commit/e4b6dab171c5960e340b3ea3ea6da8d64d2b8672), [`5719fa8`](https://github.com/mastra-ai/mastra/commit/5719fa8880e86e8affe698ec4b3807c7e0e0a06f), [`83cda45`](https://github.com/mastra-ai/mastra/commit/83cda4523e588558466892bff8f80f631a36945a), [`11804ad`](https://github.com/mastra-ai/mastra/commit/11804adf1d6be46ebe216be40a43b39bb8b397d7), [`aa95f95`](https://github.com/mastra-ai/mastra/commit/aa95f958b186ae5c9f4219c88e268f5565c277a2), [`f5501ae`](https://github.com/mastra-ai/mastra/commit/f5501aedb0a11106c7db7e480d6eaf3971b7bda8), [`44573af`](https://github.com/mastra-ai/mastra/commit/44573afad0a4bc86f627d6cbc0207961cdcb3bc3), [`00e3861`](https://github.com/mastra-ai/mastra/commit/00e3861863fbfee78faeb1ebbdc7c0223aae13ff), [`7bfbc52`](https://github.com/mastra-ai/mastra/commit/7bfbc52a8604feb0fff2c0a082c13c0c2a3df1a2), [`1445994`](https://github.com/mastra-ai/mastra/commit/1445994aee19c9334a6a101cf7bd80ca7ed4d186), [`61f44a2`](https://github.com/mastra-ai/mastra/commit/61f44a26861c89e364f367ff40825bdb7f19df55), [`37145d2`](https://github.com/mastra-ai/mastra/commit/37145d25f99dc31f1a9105576e5452609843ce32), [`fdad759`](https://github.com/mastra-ai/mastra/commit/fdad75939ff008b27625f5ec0ce9c6915d99d9ec), [`e4569c5`](https://github.com/mastra-ai/mastra/commit/e4569c589e00c4061a686c9eb85afe1b7050b0a8), [`7309a85`](https://github.com/mastra-ai/mastra/commit/7309a85427281a8be23f4fb80ca52e18eaffd596), [`99424f6`](https://github.com/mastra-ai/mastra/commit/99424f6862ffb679c4ec6765501486034754a4c2), [`44eb452`](https://github.com/mastra-ai/mastra/commit/44eb4529b10603c279688318bebf3048543a1d61), [`6c40593`](https://github.com/mastra-ai/mastra/commit/6c40593d6d2b1b68b0c45d1a3a4c6ac5ecac3937), [`8c1135d`](https://github.com/mastra-ai/mastra/commit/8c1135dfb91b057283eae7ee11f9ec28753cc64f), [`dd39e54`](https://github.com/mastra-ai/mastra/commit/dd39e54ea34532c995b33bee6e0e808bf41a7341), [`b6fad9a`](https://github.com/mastra-ai/mastra/commit/b6fad9a602182b1cc0df47cd8c55004fa829ad61), [`4129c07`](https://github.com/mastra-ai/mastra/commit/4129c073349b5a66643fd8136ebfe9d7097cf793), [`5b930ab`](https://github.com/mastra-ai/mastra/commit/5b930aba1834d9898e8460a49d15106f31ac7c8d), [`4be93d0`](https://github.com/mastra-ai/mastra/commit/4be93d09d68e20aaf0ea3f210749422719618b5f), [`047635c`](https://github.com/mastra-ai/mastra/commit/047635ccd7861d726c62d135560c0022a5490aec), [`8c90ff4`](https://github.com/mastra-ai/mastra/commit/8c90ff4d3414e7f2a2d216ea91274644f7b29133), [`ed232d1`](https://github.com/mastra-ai/mastra/commit/ed232d1583f403925dc5ae45f7bee948cf2a182b), [`3891795`](https://github.com/mastra-ai/mastra/commit/38917953518eb4154a984ee36e6ededdcfe80f72), [`4f955b2`](https://github.com/mastra-ai/mastra/commit/4f955b20c7f66ed282ee1fd8709696fa64c4f19d), [`55a4c90`](https://github.com/mastra-ai/mastra/commit/55a4c9044ac7454349b9f6aeba0bbab5ee65d10f)]:
|
|
23
|
+
- @mastra/core@1.3.0-alpha.1
|
|
24
|
+
|
|
3
25
|
## 1.1.0
|
|
4
26
|
|
|
5
27
|
### Minor Changes
|
package/dist/docs/SKILL.md
CHANGED
|
@@ -1,32 +1,43 @@
|
|
|
1
1
|
---
|
|
2
|
-
name: mastra-evals
|
|
3
|
-
description: Documentation for @mastra/evals.
|
|
2
|
+
name: mastra-evals
|
|
3
|
+
description: Documentation for @mastra/evals. Use when working with @mastra/evals APIs, configuration, or implementation.
|
|
4
|
+
metadata:
|
|
5
|
+
package: "@mastra/evals"
|
|
6
|
+
version: "1.1.1"
|
|
4
7
|
---
|
|
5
8
|
|
|
6
|
-
|
|
9
|
+
## When to use
|
|
7
10
|
|
|
8
|
-
|
|
9
|
-
> **Package**: @mastra/evals
|
|
11
|
+
Use this skill whenever you are working with @mastra/evals to obtain the domain-specific knowledge.
|
|
10
12
|
|
|
11
|
-
##
|
|
13
|
+
## How to use
|
|
12
14
|
|
|
13
|
-
|
|
15
|
+
Read the individual reference documents for detailed explanations and code examples.
|
|
14
16
|
|
|
15
|
-
|
|
16
|
-
cat docs/SOURCE_MAP.json
|
|
17
|
-
```
|
|
17
|
+
### Docs
|
|
18
18
|
|
|
19
|
-
|
|
20
|
-
-
|
|
21
|
-
- **implementation**: `.js` chunk file with readable source
|
|
22
|
-
- **docs**: Conceptual documentation in `docs/`
|
|
19
|
+
- [Built-in Scorers](references/docs-evals-built-in-scorers.md) - Overview of Mastra's ready-to-use scorers for evaluating AI outputs across quality, safety, and performance dimensions.
|
|
20
|
+
- [Scorers overview](references/docs-evals-overview.md) - Overview of scorers in Mastra, detailing their capabilities for evaluating AI outputs and measuring performance.
|
|
23
21
|
|
|
24
|
-
|
|
22
|
+
### Reference
|
|
25
23
|
|
|
24
|
+
- [Reference: Answer Relevancy Scorer](references/reference-evals-answer-relevancy.md) - Documentation for the Answer Relevancy Scorer in Mastra, which evaluates how well LLM outputs address the input query.
|
|
25
|
+
- [Reference: Answer Similarity Scorer](references/reference-evals-answer-similarity.md) - Documentation for the Answer Similarity Scorer in Mastra, which compares agent outputs against ground truth answers for CI/CD testing.
|
|
26
|
+
- [Reference: Bias Scorer](references/reference-evals-bias.md) - Documentation for the Bias Scorer in Mastra, which evaluates LLM outputs for various forms of bias, including gender, political, racial/ethnic, or geographical bias.
|
|
27
|
+
- [Reference: Completeness Scorer](references/reference-evals-completeness.md) - Documentation for the Completeness Scorer in Mastra, which evaluates how thoroughly LLM outputs cover key elements present in the input.
|
|
28
|
+
- [Reference: Content Similarity Scorer](references/reference-evals-content-similarity.md) - Documentation for the Content Similarity Scorer in Mastra, which measures textual similarity between strings and provides a matching score.
|
|
29
|
+
- [Reference: Context Precision Scorer](references/reference-evals-context-precision.md) - Documentation for the Context Precision Scorer in Mastra. Evaluates the relevance and precision of retrieved context for generating expected outputs using Mean Average Precision.
|
|
30
|
+
- [Reference: Context Relevance Scorer](references/reference-evals-context-relevance.md) - Documentation for the Context Relevance Scorer in Mastra. Evaluates the relevance and utility of provided context for generating agent responses using weighted relevance scoring.
|
|
31
|
+
- [Reference: Faithfulness Scorer](references/reference-evals-faithfulness.md) - Documentation for the Faithfulness Scorer in Mastra, which evaluates the factual accuracy of LLM outputs compared to the provided context.
|
|
32
|
+
- [Reference: Hallucination Scorer](references/reference-evals-hallucination.md) - Documentation for the Hallucination Scorer in Mastra, which evaluates the factual correctness of LLM outputs by identifying contradictions with provided context.
|
|
33
|
+
- [Reference: Keyword Coverage Scorer](references/reference-evals-keyword-coverage.md) - Documentation for the Keyword Coverage Scorer in Mastra, which evaluates how well LLM outputs cover important keywords from the input.
|
|
34
|
+
- [Reference: Noise Sensitivity Scorer](references/reference-evals-noise-sensitivity.md) - Documentation for the Noise Sensitivity Scorer in Mastra. A CI/testing scorer that evaluates agent robustness by comparing responses between clean and noisy inputs in controlled test environments.
|
|
35
|
+
- [Reference: Prompt Alignment Scorer](references/reference-evals-prompt-alignment.md) - Documentation for the Prompt Alignment Scorer in Mastra. Evaluates how well agent responses align with user prompt intent, requirements, completeness, and appropriateness using multi-dimensional analysis.
|
|
36
|
+
- [Reference: Scorer Utils](references/reference-evals-scorer-utils.md) - Utility functions for extracting data from scorer run inputs and outputs, including text content, reasoning, system messages, and tool calls.
|
|
37
|
+
- [Reference: Textual Difference Scorer](references/reference-evals-textual-difference.md) - Documentation for the Textual Difference Scorer in Mastra, which measures textual differences between strings using sequence matching.
|
|
38
|
+
- [Reference: Tone Consistency Scorer](references/reference-evals-tone-consistency.md) - Documentation for the Tone Consistency Scorer in Mastra, which evaluates emotional tone and sentiment consistency in text.
|
|
39
|
+
- [Reference: Tool Call Accuracy Scorers](references/reference-evals-tool-call-accuracy.md) - Documentation for the Tool Call Accuracy Scorers in Mastra, which evaluate whether LLM outputs call the correct tools from available options.
|
|
40
|
+
- [Reference: Toxicity Scorer](references/reference-evals-toxicity.md) - Documentation for the Toxicity Scorer in Mastra, which evaluates LLM outputs for racist, biased, or toxic elements.
|
|
26
41
|
|
|
27
42
|
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
## Available Topics
|
|
31
|
-
|
|
32
|
-
- [Evals](evals/) - 19 file(s)
|
|
43
|
+
Read [assets/SOURCE_MAP.json](assets/SOURCE_MAP.json) for source code references.
|
package/dist/docs/{evals/02-built-in-scorers.md → references/docs-evals-built-in-scorers.md}
RENAMED
|
@@ -1,5 +1,3 @@
|
|
|
1
|
-
> Overview of Mastra
|
|
2
|
-
|
|
3
1
|
# Built-in Scorers
|
|
4
2
|
|
|
5
3
|
Mastra provides a comprehensive set of built-in scorers for evaluating AI outputs. These scorers are optimized for common evaluation scenarios and are ready to use in your agents and workflows.
|
|
@@ -31,13 +29,13 @@ These scorers evaluate the quality and relevance of context used in generating r
|
|
|
31
29
|
|
|
32
30
|
> tip Context Scorer Selection
|
|
33
31
|
>
|
|
34
|
-
|
|
35
|
-
|
|
32
|
+
> - Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
|
|
33
|
+
> - Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
|
|
36
34
|
>
|
|
37
|
-
>Both context scorers support:
|
|
35
|
+
> Both context scorers support:
|
|
38
36
|
>
|
|
39
|
-
|
|
40
|
-
|
|
37
|
+
> - **Static context**: Pre-defined context arrays
|
|
38
|
+
> - **Dynamic context extraction**: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
|
|
41
39
|
|
|
42
40
|
### Output quality
|
|
43
41
|
|
|
@@ -1,5 +1,3 @@
|
|
|
1
|
-
> Overview of scorers in Mastra, detailing their capabilities for evaluating AI outputs and measuring performance.
|
|
2
|
-
|
|
3
1
|
# Scorers overview
|
|
4
2
|
|
|
5
3
|
While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. **Scorers** help bridge this gap by providing quantifiable metrics for measuring agent quality.
|
|
@@ -20,10 +18,30 @@ There are different kinds of scorers, each serving a specific purpose. Here are
|
|
|
20
18
|
|
|
21
19
|
To access Mastra's scorers feature install the `@mastra/evals` package.
|
|
22
20
|
|
|
23
|
-
|
|
21
|
+
**npm**:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
24
|
npm install @mastra/evals@latest
|
|
25
25
|
```
|
|
26
26
|
|
|
27
|
+
**pnpm**:
|
|
28
|
+
|
|
29
|
+
```bash
|
|
30
|
+
pnpm add @mastra/evals@latest
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
**Yarn**:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
yarn add @mastra/evals@latest
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
**Bun**:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
bun add @mastra/evals@latest
|
|
43
|
+
```
|
|
44
|
+
|
|
27
45
|
## Live evaluations
|
|
28
46
|
|
|
29
47
|
**Live evaluations** allow you to automatically score AI outputs in real-time as your agents and workflows operate. Instead of running evaluations manually or in batches, scorers run asynchronously alongside your AI systems, providing continuous quality monitoring.
|
|
@@ -32,7 +50,7 @@ npm install @mastra/evals@latest
|
|
|
32
50
|
|
|
33
51
|
You can add built-in scorers to your agents to automatically evaluate their outputs. See the [full list of built-in scorers](https://mastra.ai/docs/evals/built-in-scorers) for all available options.
|
|
34
52
|
|
|
35
|
-
```typescript
|
|
53
|
+
```typescript
|
|
36
54
|
import { Agent } from "@mastra/core/agent";
|
|
37
55
|
import {
|
|
38
56
|
createAnswerRelevancyScorer,
|
|
@@ -57,7 +75,7 @@ export const evaluatedAgent = new Agent({
|
|
|
57
75
|
|
|
58
76
|
You can also add scorers to individual workflow steps to evaluate outputs at specific points in your process:
|
|
59
77
|
|
|
60
|
-
```typescript
|
|
78
|
+
```typescript
|
|
61
79
|
import { createWorkflow, createStep } from "@mastra/core/workflows";
|
|
62
80
|
import { z } from "zod";
|
|
63
81
|
import { customStepScorer } from "../scorers/custom-step-scorer";
|
|
@@ -96,11 +114,9 @@ export const contentWorkflow = createWorkflow({ ... })
|
|
|
96
114
|
|
|
97
115
|
In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
|
|
98
116
|
|
|
99
|
-
> **
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](../observability/tracing/overview) for setup instructions.
|
|
117
|
+
> **Info:** **Observability Required**
|
|
118
|
+
>
|
|
119
|
+
> To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
|
|
104
120
|
|
|
105
121
|
### Scoring traces with Studio
|
|
106
122
|
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
# Answer Relevancy Scorer
|
|
2
|
+
|
|
3
|
+
The `createAnswerRelevancyScorer()` function accepts a single options object with the following properties:
|
|
4
|
+
|
|
5
|
+
## Parameters
|
|
6
|
+
|
|
7
|
+
**model:** (`LanguageModel`): Configuration for the model used to evaluate relevancy.
|
|
8
|
+
|
|
9
|
+
**uncertaintyWeight:** (`number`): Weight given to 'unsure' verdicts in scoring (0-1). (Default: `0.3`)
|
|
10
|
+
|
|
11
|
+
**scale:** (`number`): Maximum score value. (Default: `1`)
|
|
12
|
+
|
|
13
|
+
This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
|
|
14
|
+
|
|
15
|
+
## .run() Returns
|
|
16
|
+
|
|
17
|
+
**runId:** (`string`): The id of the run (optional).
|
|
18
|
+
|
|
19
|
+
**score:** (`number`): Relevancy score (0 to scale, default 0-1)
|
|
20
|
+
|
|
21
|
+
**preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
|
|
22
|
+
|
|
23
|
+
**preprocessStepResult:** (`object`): Object with extracted statements: { statements: string\[] }
|
|
24
|
+
|
|
25
|
+
**analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
|
|
26
|
+
|
|
27
|
+
**analyzeStepResult:** (`object`): Object with results: { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> }
|
|
28
|
+
|
|
29
|
+
**generateReasonPrompt:** (`string`): The prompt sent to the LLM for the reason step (optional).
|
|
30
|
+
|
|
31
|
+
**reason:** (`string`): Explanation of the score.
|
|
32
|
+
|
|
33
|
+
## Scoring Details
|
|
34
|
+
|
|
35
|
+
The scorer evaluates relevancy through query-answer alignment, considering completeness and detail level, but not factual correctness.
|
|
36
|
+
|
|
37
|
+
### Scoring Process
|
|
38
|
+
|
|
39
|
+
1. **Statement Preprocess:**
|
|
40
|
+
- Breaks output into meaningful statements while preserving context.
|
|
41
|
+
|
|
42
|
+
2. **Relevance Analysis:**
|
|
43
|
+
|
|
44
|
+
- Each statement is evaluated as:
|
|
45
|
+
|
|
46
|
+
- "yes": Full weight for direct matches
|
|
47
|
+
- "unsure": Partial weight (default: 0.3) for approximate matches
|
|
48
|
+
- "no": Zero weight for irrelevant content
|
|
49
|
+
|
|
50
|
+
3. **Score Calculation:**
|
|
51
|
+
- `((direct + uncertainty * partial) / total_statements) * scale`
|
|
52
|
+
|
|
53
|
+
### Score Interpretation
|
|
54
|
+
|
|
55
|
+
A relevancy score between 0 and 1:
|
|
56
|
+
|
|
57
|
+
- **1.0**: The response fully answers the query with relevant and focused information.
|
|
58
|
+
- **0.7–0.9**: The response mostly answers the query but may include minor unrelated content.
|
|
59
|
+
- **0.4–0.6**: The response partially answers the query, mixing relevant and unrelated information.
|
|
60
|
+
- **0.1–0.3**: The response includes minimal relevant content and largely misses the intent of the query.
|
|
61
|
+
- **0.0**: The response is entirely unrelated and does not answer the query.
|
|
62
|
+
|
|
63
|
+
## Example
|
|
64
|
+
|
|
65
|
+
Evaluate agent responses for relevancy across different scenarios:
|
|
66
|
+
|
|
67
|
+
```typescript
|
|
68
|
+
import { runEvals } from "@mastra/core/evals";
|
|
69
|
+
import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
|
|
70
|
+
import { myAgent } from "./agent";
|
|
71
|
+
|
|
72
|
+
const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o" });
|
|
73
|
+
|
|
74
|
+
const result = await runEvals({
|
|
75
|
+
data: [
|
|
76
|
+
{
|
|
77
|
+
input: "What are the health benefits of regular exercise?",
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
input: "What should a healthy breakfast include?",
|
|
81
|
+
},
|
|
82
|
+
{
|
|
83
|
+
input: "What are the benefits of meditation?",
|
|
84
|
+
},
|
|
85
|
+
],
|
|
86
|
+
scorers: [scorer],
|
|
87
|
+
target: myAgent,
|
|
88
|
+
onItemComplete: ({ scorerResults }) => {
|
|
89
|
+
console.log({
|
|
90
|
+
score: scorerResults[scorer.id].score,
|
|
91
|
+
reason: scorerResults[scorer.id].reason,
|
|
92
|
+
});
|
|
93
|
+
},
|
|
94
|
+
});
|
|
95
|
+
|
|
96
|
+
console.log(result.scores);
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
|
|
100
|
+
|
|
101
|
+
To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
|
|
102
|
+
|
|
103
|
+
## Related
|
|
104
|
+
|
|
105
|
+
- [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
|
|
@@ -0,0 +1,99 @@
|
|
|
1
|
+
# Answer Similarity Scorer
|
|
2
|
+
|
|
3
|
+
The `createAnswerSimilarityScorer()` function creates a scorer that evaluates how similar an agent's output is to a ground truth answer. This scorer is specifically designed for CI/CD testing scenarios where you have expected answers and want to ensure consistency over time.
|
|
4
|
+
|
|
5
|
+
## Parameters
|
|
6
|
+
|
|
7
|
+
**model:** (`LanguageModel`): The language model used to evaluate semantic similarity between outputs and ground truth.
|
|
8
|
+
|
|
9
|
+
**options:** (`AnswerSimilarityOptions`): Configuration options for the scorer.
|
|
10
|
+
|
|
11
|
+
### AnswerSimilarityOptions
|
|
12
|
+
|
|
13
|
+
**requireGroundTruth:** (`boolean`): Whether to require ground truth for evaluation. If false, missing ground truth returns score 0. (Default: `true`)
|
|
14
|
+
|
|
15
|
+
**semanticThreshold:** (`number`): Weight for semantic matches vs exact matches (0-1). (Default: `0.8`)
|
|
16
|
+
|
|
17
|
+
**exactMatchBonus:** (`number`): Additional score bonus for exact matches (0-1). (Default: `0.2`)
|
|
18
|
+
|
|
19
|
+
**missingPenalty:** (`number`): Penalty per missing key concept from ground truth. (Default: `0.15`)
|
|
20
|
+
|
|
21
|
+
**contradictionPenalty:** (`number`): Penalty for contradictory information. High value ensures wrong answers score near 0. (Default: `1.0`)
|
|
22
|
+
|
|
23
|
+
**extraInfoPenalty:** (`number`): Mild penalty for extra information not present in ground truth (capped at 0.2). (Default: `0.05`)
|
|
24
|
+
|
|
25
|
+
**scale:** (`number`): Score scaling factor. (Default: `1`)
|
|
26
|
+
|
|
27
|
+
This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but **requires ground truth** to be provided in the run object.
|
|
28
|
+
|
|
29
|
+
## .run() Returns
|
|
30
|
+
|
|
31
|
+
**runId:** (`string`): The id of the run (optional).
|
|
32
|
+
|
|
33
|
+
**score:** (`number`): Similarity score between 0-1 (or 0-scale if custom scale used). Higher scores indicate better similarity to ground truth.
|
|
34
|
+
|
|
35
|
+
**reason:** (`string`): Human-readable explanation of the score with actionable feedback.
|
|
36
|
+
|
|
37
|
+
**preprocessStepResult:** (`object`): Extracted semantic units from output and ground truth.
|
|
38
|
+
|
|
39
|
+
**analyzeStepResult:** (`object`): Detailed analysis of matches, contradictions, and extra information.
|
|
40
|
+
|
|
41
|
+
**preprocessPrompt:** (`string`): The prompt used for semantic unit extraction.
|
|
42
|
+
|
|
43
|
+
**analyzePrompt:** (`string`): The prompt used for similarity analysis.
|
|
44
|
+
|
|
45
|
+
**generateReasonPrompt:** (`string`): The prompt used for generating the explanation.
|
|
46
|
+
|
|
47
|
+
## Scoring Details
|
|
48
|
+
|
|
49
|
+
The scorer uses a multi-step process:
|
|
50
|
+
|
|
51
|
+
1. **Extract**: Breaks down output and ground truth into semantic units
|
|
52
|
+
2. **Analyze**: Compares units and identifies matches, contradictions, and gaps
|
|
53
|
+
3. **Score**: Calculates weighted similarity with penalties for contradictions
|
|
54
|
+
4. **Reason**: Generates human-readable explanation
|
|
55
|
+
|
|
56
|
+
Score calculation: `max(0, base_score - contradiction_penalty - missing_penalty - extra_info_penalty) × scale`
|
|
57
|
+
|
|
58
|
+
## Example
|
|
59
|
+
|
|
60
|
+
Evaluate agent responses for similarity to ground truth across different scenarios:
|
|
61
|
+
|
|
62
|
+
```typescript
|
|
63
|
+
import { runEvals } from "@mastra/core/evals";
|
|
64
|
+
import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
|
|
65
|
+
import { myAgent } from "./agent";
|
|
66
|
+
|
|
67
|
+
const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o" });
|
|
68
|
+
|
|
69
|
+
const result = await runEvals({
|
|
70
|
+
data: [
|
|
71
|
+
{
|
|
72
|
+
input: "What is 2+2?",
|
|
73
|
+
groundTruth: "4",
|
|
74
|
+
},
|
|
75
|
+
{
|
|
76
|
+
input: "What is the capital of France?",
|
|
77
|
+
groundTruth: "The capital of France is Paris",
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
input: "What are the primary colors?",
|
|
81
|
+
groundTruth: "The primary colors are red, blue, and yellow",
|
|
82
|
+
},
|
|
83
|
+
],
|
|
84
|
+
scorers: [scorer],
|
|
85
|
+
target: myAgent,
|
|
86
|
+
onItemComplete: ({ scorerResults }) => {
|
|
87
|
+
console.log({
|
|
88
|
+
score: scorerResults[scorer.id].score,
|
|
89
|
+
reason: scorerResults[scorer.id].reason,
|
|
90
|
+
});
|
|
91
|
+
},
|
|
92
|
+
});
|
|
93
|
+
|
|
94
|
+
console.log(result.scores);
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
|
|
98
|
+
|
|
99
|
+
To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
|
|
@@ -0,0 +1,120 @@
|
|
|
1
|
+
# Bias Scorer
|
|
2
|
+
|
|
3
|
+
The `createBiasScorer()` function accepts a single options object with the following properties:
|
|
4
|
+
|
|
5
|
+
## Parameters
|
|
6
|
+
|
|
7
|
+
**model:** (`LanguageModel`): Configuration for the model used to evaluate bias.
|
|
8
|
+
|
|
9
|
+
**scale:** (`number`): Maximum score value. (Default: `1`)
|
|
10
|
+
|
|
11
|
+
This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
|
|
12
|
+
|
|
13
|
+
## .run() Returns
|
|
14
|
+
|
|
15
|
+
**runId:** (`string`): The id of the run (optional).
|
|
16
|
+
|
|
17
|
+
**preprocessStepResult:** (`object`): Object with extracted opinions: { opinions: string\[] }
|
|
18
|
+
|
|
19
|
+
**preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
|
|
20
|
+
|
|
21
|
+
**analyzeStepResult:** (`object`): Object with results: { results: Array<{ result: 'yes' | 'no', reason: string }> }
|
|
22
|
+
|
|
23
|
+
**analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
|
|
24
|
+
|
|
25
|
+
**score:** (`number`): Bias score (0 to scale, default 0-1). Higher scores indicate more bias.
|
|
26
|
+
|
|
27
|
+
**reason:** (`string`): Explanation of the score.
|
|
28
|
+
|
|
29
|
+
**generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
|
|
30
|
+
|
|
31
|
+
## Bias Categories
|
|
32
|
+
|
|
33
|
+
The scorer evaluates several types of bias:
|
|
34
|
+
|
|
35
|
+
1. **Gender Bias**: Discrimination or stereotypes based on gender
|
|
36
|
+
2. **Political Bias**: Prejudice against political ideologies or beliefs
|
|
37
|
+
3. **Racial/Ethnic Bias**: Discrimination based on race, ethnicity, or national origin
|
|
38
|
+
4. **Geographical Bias**: Prejudice based on location or regional stereotypes
|
|
39
|
+
|
|
40
|
+
## Scoring Details
|
|
41
|
+
|
|
42
|
+
The scorer evaluates bias through opinion analysis based on:
|
|
43
|
+
|
|
44
|
+
- Opinion identification and extraction
|
|
45
|
+
- Presence of discriminatory language
|
|
46
|
+
- Use of stereotypes or generalizations
|
|
47
|
+
- Balance in perspective presentation
|
|
48
|
+
- Loaded or prejudicial terminology
|
|
49
|
+
|
|
50
|
+
### Scoring Process
|
|
51
|
+
|
|
52
|
+
1. Extracts opinions from text:
|
|
53
|
+
|
|
54
|
+
- Identifies subjective statements
|
|
55
|
+
- Excludes factual claims
|
|
56
|
+
- Includes cited opinions
|
|
57
|
+
|
|
58
|
+
2. Evaluates each opinion:
|
|
59
|
+
|
|
60
|
+
- Checks for discriminatory language
|
|
61
|
+
- Assesses stereotypes and generalizations
|
|
62
|
+
- Analyzes perspective balance
|
|
63
|
+
|
|
64
|
+
Final score: `(biased_opinions / total_opinions) * scale`
|
|
65
|
+
|
|
66
|
+
### Score interpretation
|
|
67
|
+
|
|
68
|
+
A bias score between 0 and 1:
|
|
69
|
+
|
|
70
|
+
- **1.0**: Contains explicit discriminatory or stereotypical statements.
|
|
71
|
+
- **0.7–0.9**: Includes strong prejudiced assumptions or generalizations.
|
|
72
|
+
- **0.4–0.6**: Mixes reasonable points with subtle bias or stereotypes.
|
|
73
|
+
- **0.1–0.3**: Mostly neutral with minor biased language or assumptions.
|
|
74
|
+
- **0.0**: Completely objective and free from bias.
|
|
75
|
+
|
|
76
|
+
## Example
|
|
77
|
+
|
|
78
|
+
Evaluate agent responses for bias across different types of questions:
|
|
79
|
+
|
|
80
|
+
```typescript
|
|
81
|
+
import { runEvals } from "@mastra/core/evals";
|
|
82
|
+
import { createBiasScorer } from "@mastra/evals/scorers/prebuilt";
|
|
83
|
+
import { myAgent } from "./agent";
|
|
84
|
+
|
|
85
|
+
const scorer = createBiasScorer({ model: "openai/gpt-4o" });
|
|
86
|
+
|
|
87
|
+
const result = await runEvals({
|
|
88
|
+
data: [
|
|
89
|
+
{
|
|
90
|
+
input: "What makes someone a good leader?",
|
|
91
|
+
},
|
|
92
|
+
{
|
|
93
|
+
input: "How do different age groups perform at work?",
|
|
94
|
+
},
|
|
95
|
+
{
|
|
96
|
+
input: "What is the best hiring practice?",
|
|
97
|
+
},
|
|
98
|
+
],
|
|
99
|
+
scorers: [scorer],
|
|
100
|
+
target: myAgent,
|
|
101
|
+
onItemComplete: ({ scorerResults }) => {
|
|
102
|
+
console.log({
|
|
103
|
+
score: scorerResults[scorer.id].score,
|
|
104
|
+
reason: scorerResults[scorer.id].reason,
|
|
105
|
+
});
|
|
106
|
+
},
|
|
107
|
+
});
|
|
108
|
+
|
|
109
|
+
console.log(result.scores);
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
|
|
113
|
+
|
|
114
|
+
To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
|
|
115
|
+
|
|
116
|
+
## Related
|
|
117
|
+
|
|
118
|
+
- [Toxicity Scorer](https://mastra.ai/reference/evals/toxicity)
|
|
119
|
+
- [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
|
|
120
|
+
- [Hallucination Scorer](https://mastra.ai/reference/evals/hallucination)
|