@mastra/mcp-docs-server 1.1.17-alpha.1 → 1.1.17-alpha.13

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/.docs/docs/evals/built-in-scorers.md +1 -0
  2. package/.docs/docs/memory/observational-memory.md +56 -9
  3. package/.docs/docs/observability/tracing/bridges/otel.md +3 -3
  4. package/.docs/docs/observability/tracing/exporters/sentry.md +1 -1
  5. package/.docs/docs/server/auth/okta.md +225 -0
  6. package/.docs/docs/server/auth.md +1 -0
  7. package/.docs/docs/server/mastra-client.md +17 -0
  8. package/.docs/docs/workspace/lsp.md +116 -0
  9. package/.docs/docs/workspace/overview.md +15 -1
  10. package/.docs/guides/agent-frameworks/ai-sdk.md +3 -3
  11. package/.docs/models/gateways/openrouter.md +2 -1
  12. package/.docs/models/index.md +1 -1
  13. package/.docs/models/providers/groq.md +24 -16
  14. package/.docs/models/providers/llmgateway.md +269 -0
  15. package/.docs/models/providers/poe.md +3 -1
  16. package/.docs/models/providers/zai-coding-plan.md +3 -2
  17. package/.docs/models/providers/zai.md +14 -13
  18. package/.docs/models/providers/zhipuai-coding-plan.md +5 -2
  19. package/.docs/models/providers/zhipuai.md +13 -12
  20. package/.docs/models/providers.md +1 -0
  21. package/.docs/reference/ai-sdk/handle-chat-stream.md +2 -0
  22. package/.docs/reference/ai-sdk/with-mastra.md +2 -2
  23. package/.docs/reference/auth/okta.md +162 -0
  24. package/.docs/reference/client-js/agents.md +13 -8
  25. package/.docs/reference/client-js/mastra-client.md +1 -1
  26. package/.docs/reference/client-js/memory.md +1 -1
  27. package/.docs/reference/deployer/cloudflare.md +31 -1
  28. package/.docs/reference/evals/noise-sensitivity.md +3 -3
  29. package/.docs/reference/evals/run-evals.md +78 -3
  30. package/.docs/reference/evals/scorer-utils.md +188 -0
  31. package/.docs/reference/evals/trajectory-accuracy.md +627 -0
  32. package/.docs/reference/harness/harness-class.md +2 -0
  33. package/.docs/reference/index.md +3 -2
  34. package/.docs/reference/logging/pino-logger.md +58 -0
  35. package/.docs/reference/memory/observational-memory.md +34 -8
  36. package/.docs/reference/observability/tracing/interfaces.md +1 -1
  37. package/.docs/reference/processors/message-history-processor.md +1 -1
  38. package/.docs/reference/processors/processor-interface.md +3 -3
  39. package/.docs/reference/processors/semantic-recall-processor.md +1 -1
  40. package/.docs/reference/processors/skill-search-processor.md +93 -0
  41. package/.docs/reference/processors/tool-call-filter.md +2 -2
  42. package/.docs/reference/processors/working-memory-processor.md +1 -1
  43. package/.docs/reference/streaming/agents/stream.md +1 -1
  44. package/.docs/reference/tools/mcp-client.md +1 -1
  45. package/CHANGELOG.md +42 -0
  46. package/package.json +4 -4
  47. package/.docs/reference/core/getStoredAgentById.md +0 -87
  48. package/.docs/reference/core/listStoredAgents.md +0 -91
@@ -0,0 +1,627 @@
1
+ # Trajectory accuracy scorers
2
+
3
+ Mastra provides two trajectory accuracy scorers for evaluating whether an agent or workflow follows an expected sequence of actions:
4
+
5
+ 1. **Code-based scorer** - Deterministic evaluation using exact step matching and ordering
6
+ 2. **LLM-based scorer** - Semantic evaluation using AI to assess trajectory quality and appropriateness
7
+
8
+ Both scorers work with agents and workflows. The `runEvals` pipeline automatically extracts trajectories, so scorers receive a `Trajectory` object directly.
9
+
10
+ ## Trajectory extraction
11
+
12
+ The `runEvals` pipeline uses two extraction strategies, depending on whether observability storage is configured:
13
+
14
+ ### Trace-based extraction (preferred)
15
+
16
+ When the target's `Mastra` instance has storage configured, the pipeline fetches the full execution trace from the observability store and calls `extractTrajectoryFromTrace()`. This produces a hierarchical trajectory with nested `children`, capturing the complete execution tree — including nested agent runs, tool calls within workflow steps, and model generations.
17
+
18
+ For example, a workflow that calls an agent, which in turn calls tools, produces:
19
+
20
+ ```text
21
+ workflow_run
22
+ └─ workflow_step (validate-input)
23
+ └─ workflow_step (process-data)
24
+ └─ agent_run (my-agent)
25
+ └─ model_generation
26
+ └─ tool_call (search)
27
+ └─ model_generation
28
+ └─ tool_call (summarize)
29
+ └─ workflow_step (save-result)
30
+ ```
31
+
32
+ ### Fallback extraction
33
+
34
+ When storage is not available, the pipeline falls back to:
35
+
36
+ - **Agents:** `extractTrajectory()` — Extracts `ToolCallStep` entries from `toolInvocations` in the agent's message output. Produces a flat list of tool calls.
37
+ - **Workflows:** `extractWorkflowTrajectory()` — Extracts `WorkflowStepStep` entries from `stepResults`. Produces a flat list of workflow steps.
38
+
39
+ These fallbacks don't capture nested execution or non-tool-call spans.
40
+
41
+ ## Trajectory types
42
+
43
+ Trajectory steps use a discriminated union on `stepType`. Each step type has specific properties:
44
+
45
+ ### `ToolCallStep`
46
+
47
+ Represents an agent tool call.
48
+
49
+ **stepType** (`'tool_call'`): Discriminant.
50
+
51
+ **name** (`string`): Tool name.
52
+
53
+ **toolArgs** (`Record<string, unknown>`): Arguments passed to the tool.
54
+
55
+ **toolResult** (`Record<string, unknown>`): Result returned by the tool.
56
+
57
+ **success** (`boolean`): Whether the call succeeded.
58
+
59
+ **durationMs** (`number`): Execution time in milliseconds.
60
+
61
+ **metadata** (`Record<string, unknown>`): Arbitrary metadata.
62
+
63
+ **children** (`TrajectoryStep[]`): Nested sub-steps.
64
+
65
+ ### `WorkflowStepStep`
66
+
67
+ Represents a workflow step execution.
68
+
69
+ **stepType** (`'workflow_step'`): Discriminant.
70
+
71
+ **name** (`string`): Step identifier.
72
+
73
+ **stepId** (`string`): Step ID in the workflow.
74
+
75
+ **status** (`string`): Step result status (success, failed, suspended, etc.).
76
+
77
+ **output** (`Record<string, unknown>`): Step output data.
78
+
79
+ **durationMs** (`number`): Execution time in milliseconds.
80
+
81
+ **metadata** (`Record<string, unknown>`): Arbitrary metadata.
82
+
83
+ **children** (`TrajectoryStep[]`): Nested sub-steps (e.g. tool calls inside the step).
84
+
85
+ ### Other step types
86
+
87
+ The discriminated union includes these additional step types:
88
+
89
+ | Step type | Key properties |
90
+ | ---------------------- | ------------------------------------------------------------- |
91
+ | `mcp_tool_call` | `toolArgs`, `toolResult`, `mcpServer`, `success` |
92
+ | `model_generation` | `modelId`, `promptTokens`, `completionTokens`, `finishReason` |
93
+ | `agent_run` | `agentId` |
94
+ | `workflow_run` | `workflowId`, `status` |
95
+ | `workflow_conditional` | `conditionCount`, `selectedSteps` |
96
+ | `workflow_parallel` | `branchCount`, `parallelSteps` |
97
+ | `workflow_loop` | `loopType`, `totalIterations` |
98
+ | `workflow_sleep` | `durationMs`, `sleepType` |
99
+ | `workflow_wait_event` | `eventName`, `eventReceived` |
100
+ | `processor_run` | `processorId` |
101
+
102
+ All step types share the base properties `name`, `durationMs`, `metadata`, and `children`.
103
+
104
+ ## Expected steps
105
+
106
+ When defining expected trajectories, use `ExpectedStep` instead of the full `TrajectoryStep` discriminated union. `ExpectedStep` is a discriminated union that mirrors `TrajectoryStep` — when you specify a `stepType`, you get autocomplete for that variant's fields (e.g., `toolArgs` for `tool_call`, `modelId` for `model_generation`). All variant-specific fields are optional, so you only assert against what you care about.
107
+
108
+ Omit `stepType` entirely to match any step by name only.
109
+
110
+ **name** (`string`): Step name to match (tool name, agent ID, workflow step name, etc.).
111
+
112
+ **stepType** (`TrajectoryStepType`): Step type discriminant. When set, enables autocomplete for that variant's fields. If omitted, matches any step type with the given name.
113
+
114
+ **(variant fields)** (`varies`): Type-specific fields from the corresponding TrajectoryStep variant. For example, \`toolArgs\` and \`toolResult\` for \`tool\_call\`, \`modelId\` for \`model\_generation\`, \`output\` for \`workflow\_step\`. All optional — only specified fields are compared.
115
+
116
+ **children** (`TrajectoryExpectation`): Nested expectation config for this step's children. Overrides the parent config for evaluating children of this step.
117
+
118
+ ### Simple expected steps
119
+
120
+ ```typescript
121
+ const steps: ExpectedStep[] = [
122
+ // Match by name only (any step type)
123
+ { name: 'search' },
124
+
125
+ // Match by name and step type (autocomplete for tool_call fields)
126
+ { name: 'search', stepType: 'tool_call' },
127
+
128
+ // Match with specific toolArgs (auto-compared when present)
129
+ { name: 'search', stepType: 'tool_call', toolArgs: { query: 'weather' } },
130
+
131
+ // Match a model generation step by model ID
132
+ { name: 'gpt-4o', stepType: 'model_generation', modelId: 'gpt-4o' },
133
+ ]
134
+ ```
135
+
136
+ ### Nested expectations
137
+
138
+ Each expected step can include a `children` config with its own evaluation rules. This lets you set different ordering or comparison rules at each level of the hierarchy.
139
+
140
+ ```typescript
141
+ const scorer = createTrajectoryScorerCode({
142
+ defaults: {
143
+ ordering: 'strict',
144
+ steps: [
145
+ { name: 'validate-input', stepType: 'workflow_step' },
146
+ {
147
+ name: 'research-agent',
148
+ stepType: 'agent_run',
149
+ children: {
150
+ // Sub-agent can call tools in any order
151
+ ordering: 'unordered',
152
+ steps: [
153
+ { name: 'search', stepType: 'tool_call' },
154
+ { name: 'summarize', stepType: 'tool_call' },
155
+ ],
156
+ },
157
+ },
158
+ { name: 'save-result', stepType: 'workflow_step' },
159
+ ],
160
+ },
161
+ })
162
+ ```
163
+
164
+ In this example, the parent workflow requires strict ordering of its steps, but the nested `research-agent` allows its tool calls in any order.
165
+
166
+ ## Choosing between scorers
167
+
168
+ ### Use the code-based scorer when:
169
+
170
+ - You need **deterministic, reproducible** results
171
+ - You have a **known expected trajectory** to compare against
172
+ - You want to validate **exact step sequences**
173
+ - Speed and cost are priorities (no LLM calls)
174
+ - You are running automated tests in CI/CD
175
+
176
+ ### Use the LLM-based scorer when:
177
+
178
+ - You need **semantic understanding** of whether steps were appropriate
179
+ - The optimal trajectory is **not predetermined** (evaluate based on task requirements)
180
+ - You want to detect **unnecessary, redundant, or missing** steps
181
+ - You need **explanations** for scoring decisions
182
+ - You are evaluating **production agent behavior**
183
+
184
+ ## Code-based trajectory accuracy scorer
185
+
186
+ The `createTrajectoryAccuracyScorerCode()` function from `@mastra/evals/scorers/prebuilt` provides deterministic scoring based on step matching and ordering against an expected trajectory.
187
+
188
+ ### Parameters
189
+
190
+ **expectedTrajectory** (`Trajectory | ExpectedStep[]`): Static expected trajectory to compare against. Accepts a full Trajectory or an array of ExpectedStep matchers. When omitted, the scorer reads expectedTrajectory from each dataset item at runtime.
191
+
192
+ **comparisonOptions** (`TrajectoryComparisonOptions`): Controls how the comparison is performed.
193
+
194
+ This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
195
+
196
+ ### Expected trajectory sources
197
+
198
+ The code-based scorer resolves `expectedTrajectory` from two sources, in order of priority:
199
+
200
+ 1. **Constructor option** — A static trajectory passed when creating the scorer. Used for all dataset items.
201
+ 2. **Dataset item** — An `expectedTrajectory` field on the dataset item, passed through the `runEvals` pipeline. Allows different expected trajectories per item.
202
+
203
+ ```typescript
204
+ // Static: same expected trajectory for all items
205
+ const scorer = createTrajectoryAccuracyScorerCode({
206
+ expectedTrajectory: {
207
+ steps: [
208
+ { stepType: 'tool_call', name: 'search' },
209
+ { stepType: 'tool_call', name: 'summarize' },
210
+ ],
211
+ },
212
+ })
213
+ ```
214
+
215
+ ```typescript
216
+ // Per-item: each dataset item has its own expectedTrajectory
217
+ const scorer = createTrajectoryAccuracyScorerCode()
218
+
219
+ await runEvals({
220
+ target: myAgent,
221
+ scorers: { trajectory: [scorer] },
222
+ data: [
223
+ {
224
+ input: 'Search and summarize weather',
225
+ expectedTrajectory: {
226
+ steps: [
227
+ { stepType: 'tool_call', name: 'search' },
228
+ { stepType: 'tool_call', name: 'summarize' },
229
+ ],
230
+ },
231
+ },
232
+ {
233
+ input: 'Just search for weather',
234
+ expectedTrajectory: {
235
+ steps: [{ stepType: 'tool_call', name: 'search' }],
236
+ },
237
+ },
238
+ ],
239
+ })
240
+ ```
241
+
242
+ ### Evaluation modes
243
+
244
+ The code-based scorer operates in two modes based on `strictOrder`:
245
+
246
+ #### Strict mode (`strictOrder: true`)
247
+
248
+ Requires an exact match. The actual steps must match the expected steps in the same order with no extra or missing steps. Returns `1.0` for an exact match and `0.0` otherwise.
249
+
250
+ #### Relaxed mode (`strictOrder: false`, default)
251
+
252
+ Allows extra steps. Expected steps must appear in the correct relative order. The score is calculated based on how many expected steps were matched, with optional penalties for extra or repeated steps.
253
+
254
+ ## Code-based scoring details
255
+
256
+ - **Continuous scores**: Returns values between 0.0 and 1.0 in relaxed mode; binary (0 or 1) in strict mode
257
+ - **Deterministic**: Same input always produces the same output
258
+ - **Fast**: No external API calls
259
+
260
+ ### Code-based scorer results
261
+
262
+ ```typescript
263
+ {
264
+ runId: string,
265
+ preprocessStepResult: {
266
+ actualTrajectory: Trajectory,
267
+ expectedTrajectory: Trajectory,
268
+ comparison: {
269
+ score: number,
270
+ matchedSteps: number,
271
+ totalExpectedSteps: number,
272
+ totalActualSteps: number,
273
+ missingSteps: string[],
274
+ extraSteps: string[],
275
+ outOfOrderSteps: string[],
276
+ repeatedSteps: string[]
277
+ },
278
+ actualStepNames: string[],
279
+ expectedStepNames: string[]
280
+ },
281
+ score: number
282
+ }
283
+ ```
284
+
285
+ ## Code-based scorer examples
286
+
287
+ ### Agent trajectory with strict ordering
288
+
289
+ Validates that an agent follows an exact sequence of tool calls:
290
+
291
+ ```typescript
292
+ import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
293
+ import { runEvals } from '@mastra/core/evals'
294
+
295
+ const scorer = createTrajectoryAccuracyScorerCode({
296
+ expectedTrajectory: {
297
+ steps: [
298
+ { stepType: 'tool_call', name: 'auth-tool' },
299
+ { stepType: 'tool_call', name: 'fetch-tool' },
300
+ ],
301
+ },
302
+ comparisonOptions: { strictOrder: true },
303
+ })
304
+
305
+ const result = await runEvals({
306
+ target: myAgent,
307
+ scorers: { trajectory: [scorer] },
308
+ data: [{ input: 'Get my data' }],
309
+ })
310
+
311
+ console.log(result.scores.trajectory['trajectory-accuracy']) // 1.0
312
+ ```
313
+
314
+ ### Agent trajectory with relaxed ordering
315
+
316
+ Allows extra steps as long as expected steps appear in the correct relative order:
317
+
318
+ ```typescript
319
+ const scorer = createTrajectoryAccuracyScorerCode({
320
+ expectedTrajectory: {
321
+ steps: [
322
+ { stepType: 'tool_call', name: 'search-tool' },
323
+ { stepType: 'tool_call', name: 'summarize-tool' },
324
+ ],
325
+ },
326
+ comparisonOptions: { strictOrder: false },
327
+ })
328
+
329
+ // Agent called search-tool → log-tool → summarize-tool
330
+ // The extra log-tool is allowed in relaxed mode
331
+ // score: 0.75 — all expected steps matched, small penalty for extra step
332
+ ```
333
+
334
+ ### Workflow trajectory
335
+
336
+ Evaluates a workflow's execution path:
337
+
338
+ ```typescript
339
+ import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
340
+ import { runEvals } from '@mastra/core/evals'
341
+
342
+ const scorer = createTrajectoryAccuracyScorerCode({
343
+ expectedTrajectory: {
344
+ steps: [
345
+ { stepType: 'workflow_step', name: 'validate-input' },
346
+ { stepType: 'workflow_step', name: 'process-data' },
347
+ { stepType: 'workflow_step', name: 'save-result' },
348
+ ],
349
+ },
350
+ })
351
+
352
+ const result = await runEvals({
353
+ target: myWorkflow,
354
+ scorers: { trajectory: [scorer] },
355
+ data: [{ input: { data: 'test' } }],
356
+ })
357
+
358
+ console.log(result.scores.trajectory['trajectory-accuracy'])
359
+ ```
360
+
361
+ ### Comparing step data
362
+
363
+ Validates not just the step names but also step-specific data. For tool calls, this compares `toolArgs` and `toolResult`. For workflow steps, this compares `output`.
364
+
365
+ ```typescript
366
+ const scorer = createTrajectoryAccuracyScorerCode({
367
+ expectedTrajectory: {
368
+ steps: [
369
+ {
370
+ stepType: 'tool_call',
371
+ name: 'search-tool',
372
+ toolArgs: { query: 'weather in NYC' },
373
+ },
374
+ ],
375
+ },
376
+ })
377
+ // Data fields like toolArgs are auto-compared when present on expected steps
378
+ ```
379
+
380
+ ## LLM-based trajectory accuracy scorer
381
+
382
+ The `createTrajectoryAccuracyScorerLLM()` function from `@mastra/evals/scorers/prebuilt` uses an LLM to evaluate whether an agent's or workflow's trajectory was appropriate, efficient, and complete.
383
+
384
+ ### Parameters
385
+
386
+ **model** (`MastraModelConfig`): The LLM model to use for evaluating trajectory quality.
387
+
388
+ **expectedTrajectory** (`Trajectory | ExpectedStep[]`): Optional static expected trajectory to compare against. Accepts a full Trajectory or an array of ExpectedStep matchers. When omitted, the LLM evaluates the trajectory based on the task requirements alone. Can also come from dataset items at runtime.
389
+
390
+ ### Features
391
+
392
+ The LLM-based scorer provides:
393
+
394
+ - **Task-aware evaluation**: Assesses whether each step was necessary given the user's request
395
+ - **Ordering assessment**: Evaluates whether steps were taken in a logical order
396
+ - **Missing step detection**: Identifies steps that should have been taken
397
+ - **Redundancy detection**: Flags unnecessary or repeated steps
398
+ - **Reasoning generation**: Provides human-readable explanations for scoring decisions
399
+
400
+ ### Evaluation process
401
+
402
+ 1. **Receive trajectory**: Gets a pre-extracted `Trajectory` object from the pipeline
403
+ 2. **Analyze steps**: Evaluates each step for necessity and ordering using the LLM
404
+ 3. **Generate score**: Calculates score weighted as 60% necessity, 30% ordering, minus 10% missing penalty
405
+ 4. **Generate reasoning**: Provides a human-readable explanation
406
+
407
+ ## LLM-based scoring details
408
+
409
+ - **Fractional scores**: Returns values between 0.0 and 1.0
410
+ - **Context-aware**: Considers user intent and task requirements
411
+ - **Explanatory**: Provides reasoning for scores
412
+ - **Flexible**: Works with or without an expected trajectory
413
+
414
+ ### LLM-based scorer options
415
+
416
+ ```typescript
417
+ // Evaluate based on task requirements (no expected trajectory)
418
+ const openScorer = createTrajectoryAccuracyScorerLLM({
419
+ model: { provider: 'openai', name: 'gpt-5.4' },
420
+ })
421
+
422
+ // Evaluate against a static expected trajectory
423
+ const guidedScorer = createTrajectoryAccuracyScorerLLM({
424
+ model: { provider: 'openai', name: 'gpt-5.4' },
425
+ expectedTrajectory: {
426
+ steps: [
427
+ { stepType: 'tool_call', name: 'search-tool' },
428
+ { stepType: 'tool_call', name: 'summarize-tool' },
429
+ ],
430
+ },
431
+ })
432
+ ```
433
+
434
+ ### LLM-based scorer results
435
+
436
+ ```typescript
437
+ {
438
+ runId: string,
439
+ preprocessStepResult: {
440
+ actualTrajectory: Trajectory,
441
+ actualTrajectoryFormatted: string,
442
+ expectedTrajectoryFormatted?: string,
443
+ hasSteps: boolean
444
+ },
445
+ analyzeStepResult: {
446
+ stepEvaluations: Array<{
447
+ stepName: string,
448
+ wasNecessary: boolean,
449
+ wasInOrder: boolean,
450
+ reasoning: string
451
+ }>,
452
+ missingSteps?: string[],
453
+ extraSteps?: string[],
454
+ overallAssessment: string
455
+ },
456
+ score: number,
457
+ reason: string
458
+ }
459
+ ```
460
+
461
+ ## Unified trajectory scorer
462
+
463
+ The `createTrajectoryScorerCode()` function from `@mastra/evals/scorers/prebuilt` provides a multi-dimensional trajectory evaluation that checks accuracy, efficiency, blacklisted tools, and tool failure patterns in a single pass.
464
+
465
+ ### Parameters
466
+
467
+ **defaults** (`TrajectoryExpectation`): Default expectations applied to all dataset items. Per-item expectedTrajectory values override these defaults.
468
+
469
+ **weights** (`TrajectoryScoreWeights`): Custom weights for combining dimension scores. Weights are normalized to sum to 1.0.
470
+
471
+ ### Scoring behavior
472
+
473
+ The unified scorer evaluates four dimensions:
474
+
475
+ 1. **Accuracy** — Matches actual steps against expected steps (if `steps` is configured). Uses the `ordering` mode.
476
+ 2. **Efficiency** — Checks step budgets (`maxSteps`, `maxTotalTokens`, `maxTotalDurationMs`) and redundant calls (`noRedundantCalls`).
477
+ 3. **Blacklist** — Checks for forbidden tools or sequences. Any violation immediately results in a score of **0.0** regardless of other dimensions.
478
+ 4. **Tool failures** — Detects retry patterns, fallback patterns, and argument correction patterns.
479
+
480
+ The final score is a weighted combination of active dimensions, normalized by which dimensions are active. Default weights are accuracy 0.4, efficiency 0.3, tool failures 0.2, blacklist 0.1, but you can customize them via the `weights` option. Blacklist violations override everything to 0. When nested evaluations are present, the score is 70% top-level and 30% nested average.
481
+
482
+ ### Unified scorer results
483
+
484
+ ```typescript
485
+ {
486
+ runId: string,
487
+ preprocessStepResult: {
488
+ accuracy?: TrajectoryComparisonResult,
489
+ efficiency?: TrajectoryEfficiencyResult,
490
+ blacklist?: TrajectoryBlacklistResult,
491
+ toolFailures?: ToolFailureAnalysisResult,
492
+ nested?: NestedEvaluationResult[],
493
+ },
494
+ score: number,
495
+ reason: string
496
+ }
497
+ ```
498
+
499
+ ### Per-item expectations
500
+
501
+ Each dataset item can override the defaults with its own `expectedTrajectory`. This lets you vary expectations per prompt:
502
+
503
+ ```typescript
504
+ import { createTrajectoryScorerCode } from '@mastra/evals/scorers/prebuilt'
505
+ import { runEvals } from '@mastra/core/evals'
506
+
507
+ // Default blacklist applies to all items
508
+ const scorer = createTrajectoryScorerCode({
509
+ defaults: {
510
+ blacklistedTools: ['deleteAll'],
511
+ maxSteps: 5,
512
+ },
513
+ })
514
+
515
+ const result = await runEvals({
516
+ target: myAgent,
517
+ scorers: { trajectory: [scorer] },
518
+ data: [
519
+ {
520
+ input: 'Search for weather',
521
+ expectedTrajectory: {
522
+ steps: [{ stepType: 'tool_call', name: 'search' }],
523
+ maxSteps: 2,
524
+ },
525
+ },
526
+ {
527
+ input: 'Search and summarize',
528
+ expectedTrajectory: {
529
+ steps: [
530
+ { stepType: 'tool_call', name: 'search' },
531
+ { stepType: 'tool_call', name: 'summarize' },
532
+ ],
533
+ },
534
+ },
535
+ ],
536
+ })
537
+ ```
538
+
539
+ ### Example: efficiency and blacklist
540
+
541
+ ```typescript
542
+ import { createTrajectoryScorerCode } from '@mastra/evals/scorers/prebuilt'
543
+
544
+ const scorer = createTrajectoryScorerCode({
545
+ defaults: {
546
+ blacklistedTools: ['escalate', 'admin-override'],
547
+ blacklistedSequences: [['escalate', 'admin-override']],
548
+ maxSteps: 10,
549
+ noRedundantCalls: true,
550
+ maxRetriesPerTool: 2,
551
+ },
552
+ // Customize how dimensions contribute to the final score
553
+ weights: {
554
+ accuracy: 0.5, // prioritize step accuracy
555
+ efficiency: 0.3,
556
+ toolFailures: 0.1,
557
+ blacklist: 0.1,
558
+ },
559
+ })
560
+ ```
561
+
562
+ ## Using trajectory scorers with `runEvals`
563
+
564
+ Trajectory scorers are configured under the `trajectory` key in the scorer config. The `runEvals` pipeline handles trajectory extraction automatically.
565
+
566
+ ### Agent trajectory evaluation
567
+
568
+ ```typescript
569
+ import { runEvals } from '@mastra/core/evals'
570
+ import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
571
+
572
+ const trajectoryScorer = createTrajectoryAccuracyScorerCode({
573
+ expectedTrajectory: {
574
+ steps: [
575
+ { stepType: 'tool_call', name: 'search' },
576
+ { stepType: 'tool_call', name: 'format' },
577
+ ],
578
+ },
579
+ })
580
+
581
+ const result = await runEvals({
582
+ target: myAgent,
583
+ scorers: {
584
+ agent: [qualityScorer], // receives raw MastraDBMessage[] output
585
+ trajectory: [trajectoryScorer], // receives pre-extracted Trajectory
586
+ },
587
+ data: [{ input: 'Find and format the data' }],
588
+ })
589
+
590
+ // result.scores.agent['quality'] — agent-level score
591
+ // result.scores.trajectory['trajectory-accuracy'] — trajectory score
592
+ ```
593
+
594
+ ### Workflow trajectory evaluation
595
+
596
+ ```typescript
597
+ import { runEvals } from '@mastra/core/evals'
598
+ import { createTrajectoryAccuracyScorerCode } from '@mastra/evals/scorers/prebuilt'
599
+
600
+ const workflowTrajectoryScorer = createTrajectoryAccuracyScorerCode({
601
+ expectedTrajectory: {
602
+ steps: [
603
+ { stepType: 'workflow_step', name: 'validate' },
604
+ { stepType: 'workflow_step', name: 'process' },
605
+ { stepType: 'workflow_step', name: 'notify' },
606
+ ],
607
+ },
608
+ })
609
+
610
+ const result = await runEvals({
611
+ target: myWorkflow,
612
+ scorers: {
613
+ workflow: [outputScorer], // receives workflow output
614
+ trajectory: [workflowTrajectoryScorer], // receives pre-extracted Trajectory from step results
615
+ },
616
+ data: [{ input: { userId: '123' } }],
617
+ })
618
+
619
+ // result.scores.workflow['output-quality'] — workflow-level score
620
+ // result.scores.trajectory['trajectory-accuracy'] — trajectory score
621
+ ```
622
+
623
+ ## Related
624
+
625
+ - [runEvals reference](https://mastra.ai/reference/evals/run-evals) — Pipeline that extracts trajectories and passes them to scorers
626
+ - [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) — Base scorer interface
627
+ - [Scorer utils](https://mastra.ai/reference/evals/scorer-utils) — Utility functions including `extractTrajectory` and `compareTrajectories`
@@ -94,6 +94,8 @@ await harness.sendMessage({ content: 'Hello!' })
94
94
 
95
95
  **omConfig** (`HarnessOMConfig`): Default configuration for observational memory (observer/reflector model IDs and thresholds).
96
96
 
97
+ **disableBuiltinTools** (`BuiltinToolId[]`): Built-in harness tool IDs to remove from the \`harnessBuiltIn\` toolset. Valid values are \`ask\_user\`, \`submit\_plan\`, \`task\_write\`, \`task\_check\`, and \`subagent\`.
98
+
97
99
  **heartbeatHandlers** (`HeartbeatHandler[]`): Periodic background tasks started during \`init()\`. Use for gateway sync, cache refresh, and similar tasks.
98
100
 
99
101
  **idGenerator** (`() => string`): Custom ID generator for Harness-managed IDs such as threads and mode-run identifiers. (Default: `timestamp + random string`)
@@ -35,6 +35,7 @@ The Reference section provides documentation of Mastra's API, including paramete
35
35
  - [Clerk](https://mastra.ai/reference/auth/clerk)
36
36
  - [Firebase](https://mastra.ai/reference/auth/firebase)
37
37
  - [JSON Web Token](https://mastra.ai/reference/auth/jwt)
38
+ - [Okta](https://mastra.ai/reference/auth/okta)
38
39
  - [Supabase](https://mastra.ai/reference/auth/supabase)
39
40
  - [WorkOS](https://mastra.ai/reference/auth/workos)
40
41
  - [create-mastra](https://mastra.ai/reference/cli/create-mastra)
@@ -65,7 +66,6 @@ The Reference section provides documentation of Mastra's API, including paramete
65
66
  - [.getScorerById()](https://mastra.ai/reference/core/getScorerById)
66
67
  - [.getServer()](https://mastra.ai/reference/core/getServer)
67
68
  - [.getStorage()](https://mastra.ai/reference/core/getStorage)
68
- - [.getStoredAgentById()](https://mastra.ai/reference/core/getStoredAgentById)
69
69
  - [.getTelemetry()](https://mastra.ai/reference/core/getTelemetry)
70
70
  - [.getVector()](https://mastra.ai/reference/core/getVector)
71
71
  - [.getWorkflow()](https://mastra.ai/reference/core/getWorkflow)
@@ -76,7 +76,6 @@ The Reference section provides documentation of Mastra's API, including paramete
76
76
  - [.listMCPServers()](https://mastra.ai/reference/core/listMCPServers)
77
77
  - [.listMemory()](https://mastra.ai/reference/core/listMemory)
78
78
  - [.listScorers()](https://mastra.ai/reference/core/listScorers)
79
- - [.listStoredAgents()](https://mastra.ai/reference/core/listStoredAgents)
80
79
  - [.listVectors()](https://mastra.ai/reference/core/listVectors)
81
80
  - [.listWorkflows()](https://mastra.ai/reference/core/listWorkflows)
82
81
  - [.setLogger()](https://mastra.ai/reference/core/setLogger)
@@ -105,6 +104,7 @@ The Reference section provides documentation of Mastra's API, including paramete
105
104
  - [Tone Consistency Scorer](https://mastra.ai/reference/evals/tone-consistency)
106
105
  - [Tool Call Accuracy Scorers](https://mastra.ai/reference/evals/tool-call-accuracy)
107
106
  - [Toxicity](https://mastra.ai/reference/evals/toxicity)
107
+ - [Trajectory Accuracy Scorers](https://mastra.ai/reference/evals/trajectory-accuracy)
108
108
  - [Harness Class](https://mastra.ai/reference/harness/harness-class)
109
109
  - [Cloned Thread Utilities](https://mastra.ai/reference/memory/clone-utilities)
110
110
  - [Memory Class](https://mastra.ai/reference/memory/memory-class)
@@ -152,6 +152,7 @@ The Reference section provides documentation of Mastra's API, including paramete
152
152
  - [Processor Interface](https://mastra.ai/reference/processors/processor-interface)
153
153
  - [PromptInjectionDetector](https://mastra.ai/reference/processors/prompt-injection-detector)
154
154
  - [SemanticRecall](https://mastra.ai/reference/processors/semantic-recall-processor)
155
+ - [SkillSearchProcessor](https://mastra.ai/reference/processors/skill-search-processor)
155
156
  - [SystemPromptScrubber](https://mastra.ai/reference/processors/system-prompt-scrubber)
156
157
  - [TokenLimiterProcessor](https://mastra.ai/reference/processors/token-limiter-processor)
157
158
  - [ToolCallFilter](https://mastra.ai/reference/processors/tool-call-filter)