@mastra/evals 1.2.0-alpha.1 → 1.2.1-alpha.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CHANGELOG.md +83 -0
- package/dist/docs/SKILL.md +1 -1
- package/dist/docs/assets/SOURCE_MAP.json +1 -1
- package/dist/docs/references/docs-evals-built-in-scorers.md +1 -1
- package/dist/docs/references/docs-evals-overview.md +8 -9
- package/dist/scorers/prebuilt/index.cjs +1 -1
- package/dist/scorers/prebuilt/index.cjs.map +1 -1
- package/dist/scorers/prebuilt/index.js +1 -1
- package/dist/scorers/prebuilt/index.js.map +1 -1
- package/package.json +7 -7
package/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,88 @@
|
|
|
1
1
|
# @mastra/evals
|
|
2
2
|
|
|
3
|
+
## 1.2.1-alpha.0
|
|
4
|
+
|
|
5
|
+
### Patch Changes
|
|
6
|
+
|
|
7
|
+
- Fix answer-similarity scorer to align prompt guidelines with allowed match types ([#15001](https://github.com/mastra-ai/mastra/pull/15001))
|
|
8
|
+
|
|
9
|
+
The answer-similarity scorer could throw a ZodError when the LLM returned
|
|
10
|
+
"contradiction" as a matchType, since only exact/semantic/partial/missing are
|
|
11
|
+
valid. The prompt now correctly directs contradictory information to the
|
|
12
|
+
existing contradictions array instead.
|
|
13
|
+
|
|
14
|
+
- Updated dependencies [[`ac7baf6`](https://github.com/mastra-ai/mastra/commit/ac7baf66ef1db15e03975ef4ebb02724f015a391), [`0df8321`](https://github.com/mastra-ai/mastra/commit/0df832196eeb2450ab77ce887e8553abdd44c5a6), [`61109b3`](https://github.com/mastra-ai/mastra/commit/61109b34feb0e38d54bee4b8ca83eb7345b1d557), [`33f1ead`](https://github.com/mastra-ai/mastra/commit/33f1eadfa19c86953f593478e5fa371093b33779)]:
|
|
15
|
+
- @mastra/core@1.23.0-alpha.8
|
|
16
|
+
|
|
17
|
+
## 1.2.0
|
|
18
|
+
|
|
19
|
+
### Minor Changes
|
|
20
|
+
|
|
21
|
+
- **Trajectory scorers**: Added scorers for evaluating agent and workflow execution paths. ([#14697](https://github.com/mastra-ai/mastra/pull/14697))
|
|
22
|
+
- `createTrajectoryScorerCode` — unified scorer that evaluates accuracy, efficiency, blacklist violations, and tool failure patterns in a single pass. Supports per-item expectations from datasets with static defaults. Nested `ExpectedStep.children` configs allow recursive evaluation with different rules per hierarchy level.
|
|
23
|
+
- `createTrajectoryAccuracyScorerCode` — deterministic accuracy scorer with strict, relaxed, and unordered ordering modes.
|
|
24
|
+
- `createTrajectoryAccuracyScorerLLM` — LLM-based scorer for semantic trajectory evaluation.
|
|
25
|
+
|
|
26
|
+
**Utility functions:**
|
|
27
|
+
- `extractTrajectory` / `extractWorkflowTrajectory` — Convert agent runs and workflow executions into structured trajectories
|
|
28
|
+
- `extractTrajectoryFromTrace` — Build hierarchical trajectories from observability trace spans, including nested agent/tool calls
|
|
29
|
+
- `compareTrajectories` — Compare actual vs. expected trajectories with configurable ordering and data matching. Accepts `ExpectedStep[]` for simpler expected step definitions
|
|
30
|
+
- `checkTrajectoryEfficiency` — Evaluate step counts, token usage, and duration against budgets
|
|
31
|
+
- `checkTrajectoryBlacklist` — Detect forbidden tools or tool sequences
|
|
32
|
+
- `analyzeToolFailures` — Detect retry patterns, fallbacks, and argument corrections
|
|
33
|
+
|
|
34
|
+
**Example — unified scorer with defaults:**
|
|
35
|
+
|
|
36
|
+
```ts
|
|
37
|
+
import { createTrajectoryScorerCode } from '@mastra/evals/scorers';
|
|
38
|
+
|
|
39
|
+
const scorer = createTrajectoryScorerCode({
|
|
40
|
+
defaults: {
|
|
41
|
+
ordering: 'strict',
|
|
42
|
+
steps: [
|
|
43
|
+
{ name: 'validate-input' },
|
|
44
|
+
{
|
|
45
|
+
name: 'research-agent',
|
|
46
|
+
stepType: 'agent_run',
|
|
47
|
+
children: {
|
|
48
|
+
ordering: 'unordered',
|
|
49
|
+
steps: [{ name: 'search' }, { name: 'summarize' }],
|
|
50
|
+
},
|
|
51
|
+
},
|
|
52
|
+
{ name: 'save-result' },
|
|
53
|
+
],
|
|
54
|
+
maxSteps: 10,
|
|
55
|
+
blacklistedTools: ['deleteAll'],
|
|
56
|
+
},
|
|
57
|
+
});
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
### Patch Changes
|
|
61
|
+
|
|
62
|
+
- **Configurable weights**: Add `weights` option to `createTrajectoryScorerCode` for controlling how dimension scores are combined. Defaults to `{ accuracy: 0.4, efficiency: 0.3, toolFailures: 0.2, blacklist: 0.1 }`. ([#14740](https://github.com/mastra-ai/mastra/pull/14740))
|
|
63
|
+
|
|
64
|
+
```ts
|
|
65
|
+
const scorer = createTrajectoryScorerCode({
|
|
66
|
+
defaults: { steps: [{ name: 'search' }], maxSteps: 5 },
|
|
67
|
+
weights: { accuracy: 0.6, efficiency: 0.2, toolFailures: 0.1, blacklist: 0.1 },
|
|
68
|
+
});
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
**ExpectedStep redesign**: `ExpectedStep` is now a discriminated union mirroring `TrajectoryStep`. When you specify a `stepType`, you get autocomplete for that variant's fields (e.g., `toolArgs` for `tool_call`, `modelId` for `model_generation`). The old `data: Record<string, unknown>` field is replaced by direct variant fields.
|
|
72
|
+
|
|
73
|
+
```ts
|
|
74
|
+
// Before: { name: 'search', stepType: 'tool_call', data: { input: { query: 'weather' } } }
|
|
75
|
+
// After:
|
|
76
|
+
{ name: 'search', stepType: 'tool_call', toolArgs: { query: 'weather' } }
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
**Remove `compareStepData`**: The `compareStepData` option is removed from `compareTrajectories`, `TrajectoryExpectation`, and all scorers. Data fields are now auto-compared when present on expected steps — if you specify `toolArgs` on an `ExpectedStep`, it will be compared against the actual step. If you omit it, only name and stepType are matched.
|
|
80
|
+
|
|
81
|
+
Also fixes documentation inaccuracies in `trajectory-accuracy.mdx` and `scorer-utils.mdx`.
|
|
82
|
+
|
|
83
|
+
- Updated dependencies [[`dc514a8`](https://github.com/mastra-ai/mastra/commit/dc514a83dba5f719172dddfd2c7b858e4943d067), [`e333b77`](https://github.com/mastra-ai/mastra/commit/e333b77e2d76ba57ccec1818e08cebc1993469ff), [`dc9fc19`](https://github.com/mastra-ai/mastra/commit/dc9fc19da4437f6b508cc355f346a8856746a76b), [`60a224d`](https://github.com/mastra-ai/mastra/commit/60a224dd497240e83698cfa5bfd02e3d1d854844), [`fbf22a7`](https://github.com/mastra-ai/mastra/commit/fbf22a7ad86bcb50dcf30459f0d075e51ddeb468), [`f16d92c`](https://github.com/mastra-ai/mastra/commit/f16d92c677a119a135cebcf7e2b9f51ada7a9df4), [`949b7bf`](https://github.com/mastra-ai/mastra/commit/949b7bfd4e40f2b2cba7fef5eb3f108a02cfe938), [`404fea1`](https://github.com/mastra-ai/mastra/commit/404fea13042181f0b0c73a101392ac87c79ceae2), [`ebf5047`](https://github.com/mastra-ai/mastra/commit/ebf5047e825c38a1a356f10b214c1d4260dfcd8d), [`12c647c`](https://github.com/mastra-ai/mastra/commit/12c647cf3a26826eb72d40b42e3c8356ceae16ed), [`d084b66`](https://github.com/mastra-ai/mastra/commit/d084b6692396057e83c086b954c1857d20b58a14), [`79c699a`](https://github.com/mastra-ai/mastra/commit/79c699acf3cd8a77e11c55530431f48eb48456e9), [`62757b6`](https://github.com/mastra-ai/mastra/commit/62757b6db6e8bb86569d23ad0b514178f57053f8), [`675f15b`](https://github.com/mastra-ai/mastra/commit/675f15b7eaeea649158d228ea635be40480c584d), [`b174c63`](https://github.com/mastra-ai/mastra/commit/b174c63a093108d4e53b9bc89a078d9f66202b3f), [`819f03c`](https://github.com/mastra-ai/mastra/commit/819f03c25823373b32476413bd76be28a5d8705a), [`04160ee`](https://github.com/mastra-ai/mastra/commit/04160eedf3130003cf842ad08428c8ff69af4cc1), [`2c27503`](https://github.com/mastra-ai/mastra/commit/2c275032510d131d2cde47f99953abf0fe02c081), [`424a1df`](https://github.com/mastra-ai/mastra/commit/424a1df7bee59abb5c83717a54807fdd674a6224), [`3d70b0b`](https://github.com/mastra-ai/mastra/commit/3d70b0b3524d817173ad870768f259c06d61bd23), [`eef7cb2`](https://github.com/mastra-ai/mastra/commit/eef7cb2abe7ef15951e2fdf792a5095c6c643333), [`260fe12`](https://github.com/mastra-ai/mastra/commit/260fe1295fe7354e39d6def2775e0797a7a277f0), [`12c88a6`](https://github.com/mastra-ai/mastra/commit/12c88a6e32bf982c2fe0c6af62e65a3414519a75), [`43595bf`](https://github.com/mastra-ai/mastra/commit/43595bf7b8df1a6edce7a23b445b5124d2a0b473), [`78670e9`](https://github.com/mastra-ai/mastra/commit/78670e97e76d7422cf7025faf371b2aeafed860d), [`e8a5b0b`](https://github.com/mastra-ai/mastra/commit/e8a5b0b9bc94d12dee4150095512ca27a288d778), [`3b45a13`](https://github.com/mastra-ai/mastra/commit/3b45a138d09d040779c0aba1edbbfc1b57442d23), [`d400e7c`](https://github.com/mastra-ai/mastra/commit/d400e7c8b8d7afa6ba2c71769eace4048e3cef8e), [`f58d1a7`](https://github.com/mastra-ai/mastra/commit/f58d1a7a457588a996c3ecb53201a68f3d28c432), [`a49a929`](https://github.com/mastra-ai/mastra/commit/a49a92904968b4fc67e01effee8c7c8d0464ba85), [`8127d96`](https://github.com/mastra-ai/mastra/commit/8127d96280492e335d49b244501088dfdd59a8f1)]:
|
|
84
|
+
- @mastra/core@1.18.0
|
|
85
|
+
|
|
3
86
|
## 1.2.0-alpha.1
|
|
4
87
|
|
|
5
88
|
### Patch Changes
|
package/dist/docs/SKILL.md
CHANGED
|
@@ -28,7 +28,7 @@ These scorers evaluate the quality and relevance of context used in generating r
|
|
|
28
28
|
- [`context-precision`](https://mastra.ai/reference/evals/context-precision): Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (`0-1`, higher is better)
|
|
29
29
|
- [`context-relevance`](https://mastra.ai/reference/evals/context-relevance): Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (`0-1`, higher is better)
|
|
30
30
|
|
|
31
|
-
>
|
|
31
|
+
> **Context Scorer Selection:**
|
|
32
32
|
>
|
|
33
33
|
> - Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
|
|
34
34
|
> - Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
|
|
@@ -111,9 +111,9 @@ export const contentWorkflow = createWorkflow({ ... })
|
|
|
111
111
|
|
|
112
112
|
In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
|
|
113
113
|
|
|
114
|
-
> **Observability
|
|
114
|
+
> **Observability required:** To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
|
|
115
115
|
|
|
116
|
-
|
|
116
|
+
## Studio
|
|
117
117
|
|
|
118
118
|
To score traces, you first need to register your scorers with your Mastra instance:
|
|
119
119
|
|
|
@@ -126,16 +126,15 @@ const mastra = new Mastra({
|
|
|
126
126
|
})
|
|
127
127
|
```
|
|
128
128
|
|
|
129
|
-
Once registered, you can score traces interactively within Studio under the Observability section.
|
|
129
|
+
Once registered, you can score traces interactively within Studio under the **Observability** section. Open Studio to manage scorers, review scores, and run experiments.
|
|
130
130
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
For more details, see [Studio](https://mastra.ai/docs/getting-started/studio) docs.
|
|
131
|
+
- **Scorers list**: Browse all registered scorers with their description, and the number of agents and workflows each scorer is attached to.
|
|
132
|
+
- **Score results**: Select a scorer to see a paginated list of every score it has produced. Click a row to open the detail panel, which shows the score value, reason, input, output, and the prompts used by the judge. From this panel, save any result as a dataset item for future experiments.
|
|
133
|
+
- **Agent Evaluate tab**: Open the Evaluate tab on any agent to attach or detach scorers, create or edit stored scorers inline, manage datasets, and run experiments. Experiment results display per-item scores alongside pass/fail status and version tags.
|
|
134
|
+
- **Trace scoring**: In the Observability section, run a scorer against any historical trace or span to evaluate past interactions. Filter scores by agent or workflow.
|
|
136
135
|
|
|
137
136
|
## Next steps
|
|
138
137
|
|
|
139
138
|
- Learn how to create your own scorers in the [Creating Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) guide
|
|
140
139
|
- Explore built-in scorers in the [Built-in Scorers](https://mastra.ai/docs/evals/built-in-scorers) section
|
|
141
|
-
- Test scorers with [Studio](https://mastra.ai/docs/
|
|
140
|
+
- Test scorers with [Studio](https://mastra.ai/docs/studio/overview)
|
|
@@ -333,7 +333,7 @@ Matching Guidelines:
|
|
|
333
333
|
- "semantic": The same concept or fact expressed differently but with equivalent meaning
|
|
334
334
|
- "partial": Some overlap but missing important details or context
|
|
335
335
|
- "missing": No corresponding information found in the output
|
|
336
|
-
-
|
|
336
|
+
- For factually incorrect information (wrong facts, incorrect names), mark the match as "missing" and add it to the "contradictions" array
|
|
337
337
|
|
|
338
338
|
CRITICAL: If the output contains factually incorrect information (wrong names, wrong facts, opposite claims), you MUST identify contradictions and mark relevant matches as "missing" while adding entries to the contradictions array.
|
|
339
339
|
|