@mastra/evals 1.2.0-alpha.1 → 1.2.1-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,5 +1,88 @@
1
1
  # @mastra/evals
2
2
 
3
+ ## 1.2.1-alpha.0
4
+
5
+ ### Patch Changes
6
+
7
+ - Fix answer-similarity scorer to align prompt guidelines with allowed match types ([#15001](https://github.com/mastra-ai/mastra/pull/15001))
8
+
9
+ The answer-similarity scorer could throw a ZodError when the LLM returned
10
+ "contradiction" as a matchType, since only exact/semantic/partial/missing are
11
+ valid. The prompt now correctly directs contradictory information to the
12
+ existing contradictions array instead.
13
+
14
+ - Updated dependencies [[`ac7baf6`](https://github.com/mastra-ai/mastra/commit/ac7baf66ef1db15e03975ef4ebb02724f015a391), [`0df8321`](https://github.com/mastra-ai/mastra/commit/0df832196eeb2450ab77ce887e8553abdd44c5a6), [`61109b3`](https://github.com/mastra-ai/mastra/commit/61109b34feb0e38d54bee4b8ca83eb7345b1d557), [`33f1ead`](https://github.com/mastra-ai/mastra/commit/33f1eadfa19c86953f593478e5fa371093b33779)]:
15
+ - @mastra/core@1.23.0-alpha.8
16
+
17
+ ## 1.2.0
18
+
19
+ ### Minor Changes
20
+
21
+ - **Trajectory scorers**: Added scorers for evaluating agent and workflow execution paths. ([#14697](https://github.com/mastra-ai/mastra/pull/14697))
22
+ - `createTrajectoryScorerCode` — unified scorer that evaluates accuracy, efficiency, blacklist violations, and tool failure patterns in a single pass. Supports per-item expectations from datasets with static defaults. Nested `ExpectedStep.children` configs allow recursive evaluation with different rules per hierarchy level.
23
+ - `createTrajectoryAccuracyScorerCode` — deterministic accuracy scorer with strict, relaxed, and unordered ordering modes.
24
+ - `createTrajectoryAccuracyScorerLLM` — LLM-based scorer for semantic trajectory evaluation.
25
+
26
+ **Utility functions:**
27
+ - `extractTrajectory` / `extractWorkflowTrajectory` — Convert agent runs and workflow executions into structured trajectories
28
+ - `extractTrajectoryFromTrace` — Build hierarchical trajectories from observability trace spans, including nested agent/tool calls
29
+ - `compareTrajectories` — Compare actual vs. expected trajectories with configurable ordering and data matching. Accepts `ExpectedStep[]` for simpler expected step definitions
30
+ - `checkTrajectoryEfficiency` — Evaluate step counts, token usage, and duration against budgets
31
+ - `checkTrajectoryBlacklist` — Detect forbidden tools or tool sequences
32
+ - `analyzeToolFailures` — Detect retry patterns, fallbacks, and argument corrections
33
+
34
+ **Example — unified scorer with defaults:**
35
+
36
+ ```ts
37
+ import { createTrajectoryScorerCode } from '@mastra/evals/scorers';
38
+
39
+ const scorer = createTrajectoryScorerCode({
40
+ defaults: {
41
+ ordering: 'strict',
42
+ steps: [
43
+ { name: 'validate-input' },
44
+ {
45
+ name: 'research-agent',
46
+ stepType: 'agent_run',
47
+ children: {
48
+ ordering: 'unordered',
49
+ steps: [{ name: 'search' }, { name: 'summarize' }],
50
+ },
51
+ },
52
+ { name: 'save-result' },
53
+ ],
54
+ maxSteps: 10,
55
+ blacklistedTools: ['deleteAll'],
56
+ },
57
+ });
58
+ ```
59
+
60
+ ### Patch Changes
61
+
62
+ - **Configurable weights**: Add `weights` option to `createTrajectoryScorerCode` for controlling how dimension scores are combined. Defaults to `{ accuracy: 0.4, efficiency: 0.3, toolFailures: 0.2, blacklist: 0.1 }`. ([#14740](https://github.com/mastra-ai/mastra/pull/14740))
63
+
64
+ ```ts
65
+ const scorer = createTrajectoryScorerCode({
66
+ defaults: { steps: [{ name: 'search' }], maxSteps: 5 },
67
+ weights: { accuracy: 0.6, efficiency: 0.2, toolFailures: 0.1, blacklist: 0.1 },
68
+ });
69
+ ```
70
+
71
+ **ExpectedStep redesign**: `ExpectedStep` is now a discriminated union mirroring `TrajectoryStep`. When you specify a `stepType`, you get autocomplete for that variant's fields (e.g., `toolArgs` for `tool_call`, `modelId` for `model_generation`). The old `data: Record<string, unknown>` field is replaced by direct variant fields.
72
+
73
+ ```ts
74
+ // Before: { name: 'search', stepType: 'tool_call', data: { input: { query: 'weather' } } }
75
+ // After:
76
+ { name: 'search', stepType: 'tool_call', toolArgs: { query: 'weather' } }
77
+ ```
78
+
79
+ **Remove `compareStepData`**: The `compareStepData` option is removed from `compareTrajectories`, `TrajectoryExpectation`, and all scorers. Data fields are now auto-compared when present on expected steps — if you specify `toolArgs` on an `ExpectedStep`, it will be compared against the actual step. If you omit it, only name and stepType are matched.
80
+
81
+ Also fixes documentation inaccuracies in `trajectory-accuracy.mdx` and `scorer-utils.mdx`.
82
+
83
+ - Updated dependencies [[`dc514a8`](https://github.com/mastra-ai/mastra/commit/dc514a83dba5f719172dddfd2c7b858e4943d067), [`e333b77`](https://github.com/mastra-ai/mastra/commit/e333b77e2d76ba57ccec1818e08cebc1993469ff), [`dc9fc19`](https://github.com/mastra-ai/mastra/commit/dc9fc19da4437f6b508cc355f346a8856746a76b), [`60a224d`](https://github.com/mastra-ai/mastra/commit/60a224dd497240e83698cfa5bfd02e3d1d854844), [`fbf22a7`](https://github.com/mastra-ai/mastra/commit/fbf22a7ad86bcb50dcf30459f0d075e51ddeb468), [`f16d92c`](https://github.com/mastra-ai/mastra/commit/f16d92c677a119a135cebcf7e2b9f51ada7a9df4), [`949b7bf`](https://github.com/mastra-ai/mastra/commit/949b7bfd4e40f2b2cba7fef5eb3f108a02cfe938), [`404fea1`](https://github.com/mastra-ai/mastra/commit/404fea13042181f0b0c73a101392ac87c79ceae2), [`ebf5047`](https://github.com/mastra-ai/mastra/commit/ebf5047e825c38a1a356f10b214c1d4260dfcd8d), [`12c647c`](https://github.com/mastra-ai/mastra/commit/12c647cf3a26826eb72d40b42e3c8356ceae16ed), [`d084b66`](https://github.com/mastra-ai/mastra/commit/d084b6692396057e83c086b954c1857d20b58a14), [`79c699a`](https://github.com/mastra-ai/mastra/commit/79c699acf3cd8a77e11c55530431f48eb48456e9), [`62757b6`](https://github.com/mastra-ai/mastra/commit/62757b6db6e8bb86569d23ad0b514178f57053f8), [`675f15b`](https://github.com/mastra-ai/mastra/commit/675f15b7eaeea649158d228ea635be40480c584d), [`b174c63`](https://github.com/mastra-ai/mastra/commit/b174c63a093108d4e53b9bc89a078d9f66202b3f), [`819f03c`](https://github.com/mastra-ai/mastra/commit/819f03c25823373b32476413bd76be28a5d8705a), [`04160ee`](https://github.com/mastra-ai/mastra/commit/04160eedf3130003cf842ad08428c8ff69af4cc1), [`2c27503`](https://github.com/mastra-ai/mastra/commit/2c275032510d131d2cde47f99953abf0fe02c081), [`424a1df`](https://github.com/mastra-ai/mastra/commit/424a1df7bee59abb5c83717a54807fdd674a6224), [`3d70b0b`](https://github.com/mastra-ai/mastra/commit/3d70b0b3524d817173ad870768f259c06d61bd23), [`eef7cb2`](https://github.com/mastra-ai/mastra/commit/eef7cb2abe7ef15951e2fdf792a5095c6c643333), [`260fe12`](https://github.com/mastra-ai/mastra/commit/260fe1295fe7354e39d6def2775e0797a7a277f0), [`12c88a6`](https://github.com/mastra-ai/mastra/commit/12c88a6e32bf982c2fe0c6af62e65a3414519a75), [`43595bf`](https://github.com/mastra-ai/mastra/commit/43595bf7b8df1a6edce7a23b445b5124d2a0b473), [`78670e9`](https://github.com/mastra-ai/mastra/commit/78670e97e76d7422cf7025faf371b2aeafed860d), [`e8a5b0b`](https://github.com/mastra-ai/mastra/commit/e8a5b0b9bc94d12dee4150095512ca27a288d778), [`3b45a13`](https://github.com/mastra-ai/mastra/commit/3b45a138d09d040779c0aba1edbbfc1b57442d23), [`d400e7c`](https://github.com/mastra-ai/mastra/commit/d400e7c8b8d7afa6ba2c71769eace4048e3cef8e), [`f58d1a7`](https://github.com/mastra-ai/mastra/commit/f58d1a7a457588a996c3ecb53201a68f3d28c432), [`a49a929`](https://github.com/mastra-ai/mastra/commit/a49a92904968b4fc67e01effee8c7c8d0464ba85), [`8127d96`](https://github.com/mastra-ai/mastra/commit/8127d96280492e335d49b244501088dfdd59a8f1)]:
84
+ - @mastra/core@1.18.0
85
+
3
86
  ## 1.2.0-alpha.1
4
87
 
5
88
  ### Patch Changes
@@ -3,7 +3,7 @@ name: mastra-evals
3
3
  description: Documentation for @mastra/evals. Use when working with @mastra/evals APIs, configuration, or implementation.
4
4
  metadata:
5
5
  package: "@mastra/evals"
6
- version: "1.2.0-alpha.1"
6
+ version: "1.2.1-alpha.0"
7
7
  ---
8
8
 
9
9
  ## When to use
@@ -1,5 +1,5 @@
1
1
  {
2
- "version": "1.2.0-alpha.1",
2
+ "version": "1.2.1-alpha.0",
3
3
  "package": "@mastra/evals",
4
4
  "exports": {},
5
5
  "modules": {}
@@ -28,7 +28,7 @@ These scorers evaluate the quality and relevance of context used in generating r
28
28
  - [`context-precision`](https://mastra.ai/reference/evals/context-precision): Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (`0-1`, higher is better)
29
29
  - [`context-relevance`](https://mastra.ai/reference/evals/context-relevance): Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (`0-1`, higher is better)
30
30
 
31
- > tip Context Scorer Selection
31
+ > **Context Scorer Selection:**
32
32
  >
33
33
  > - Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
34
34
  > - Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
@@ -111,9 +111,9 @@ export const contentWorkflow = createWorkflow({ ... })
111
111
 
112
112
  In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
113
113
 
114
- > **Observability Required:** To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
114
+ > **Observability required:** To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
115
115
 
116
- ### Scoring traces with Studio
116
+ ## Studio
117
117
 
118
118
  To score traces, you first need to register your scorers with your Mastra instance:
119
119
 
@@ -126,16 +126,15 @@ const mastra = new Mastra({
126
126
  })
127
127
  ```
128
128
 
129
- Once registered, you can score traces interactively within Studio under the Observability section. This provides a user-friendly interface for running scorers against historical traces.
129
+ Once registered, you can score traces interactively within Studio under the **Observability** section. Open Studio to manage scorers, review scores, and run experiments.
130
130
 
131
- ## Testing scorers locally
132
-
133
- Mastra provides a CLI command `mastra dev` to test your scorers. Studio includes a scorers section where you can run individual scorers against test inputs and view detailed results.
134
-
135
- For more details, see [Studio](https://mastra.ai/docs/getting-started/studio) docs.
131
+ - **Scorers list**: Browse all registered scorers with their description, and the number of agents and workflows each scorer is attached to.
132
+ - **Score results**: Select a scorer to see a paginated list of every score it has produced. Click a row to open the detail panel, which shows the score value, reason, input, output, and the prompts used by the judge. From this panel, save any result as a dataset item for future experiments.
133
+ - **Agent Evaluate tab**: Open the Evaluate tab on any agent to attach or detach scorers, create or edit stored scorers inline, manage datasets, and run experiments. Experiment results display per-item scores alongside pass/fail status and version tags.
134
+ - **Trace scoring**: In the Observability section, run a scorer against any historical trace or span to evaluate past interactions. Filter scores by agent or workflow.
136
135
 
137
136
  ## Next steps
138
137
 
139
138
  - Learn how to create your own scorers in the [Creating Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) guide
140
139
  - Explore built-in scorers in the [Built-in Scorers](https://mastra.ai/docs/evals/built-in-scorers) section
141
- - Test scorers with [Studio](https://mastra.ai/docs/getting-started/studio)
140
+ - Test scorers with [Studio](https://mastra.ai/docs/studio/overview)
@@ -333,7 +333,7 @@ Matching Guidelines:
333
333
  - "semantic": The same concept or fact expressed differently but with equivalent meaning
334
334
  - "partial": Some overlap but missing important details or context
335
335
  - "missing": No corresponding information found in the output
336
- - "contradiction": Information that directly conflicts with the ground truth (wrong facts, incorrect names, false claims)
336
+ - For factually incorrect information (wrong facts, incorrect names), mark the match as "missing" and add it to the "contradictions" array
337
337
 
338
338
  CRITICAL: If the output contains factually incorrect information (wrong names, wrong facts, opposite claims), you MUST identify contradictions and mark relevant matches as "missing" while adding entries to the contradictions array.
339
339