npm - @mastra/evals - Versions diffs - 1.1.2-alpha.0 → 1.2.0-alpha.0 - Mend

@mastra/evals 1.1.2-alpha.0 → 1.2.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/CHANGELOG.md +59 -2
package/LICENSE.md +15 -0
package/dist/chunk-EVBNIL5M.js +606 -0
package/dist/chunk-EVBNIL5M.js.map +1 -0
package/dist/chunk-XRUR5PBK.cjs +632 -0
package/dist/chunk-XRUR5PBK.cjs.map +1 -0
package/dist/docs/SKILL.md +20 -19
package/dist/docs/assets/SOURCE_MAP.json +1 -1
package/dist/docs/references/docs-evals-built-in-scorers.md +2 -1
package/dist/docs/references/docs-evals-overview.md +11 -16
package/dist/docs/references/reference-evals-answer-relevancy.md +25 -25
package/dist/docs/references/reference-evals-answer-similarity.md +33 -35
package/dist/docs/references/reference-evals-bias.md +24 -24
package/dist/docs/references/reference-evals-completeness.md +19 -20
package/dist/docs/references/reference-evals-content-similarity.md +20 -20
package/dist/docs/references/reference-evals-context-precision.md +36 -36
package/dist/docs/references/reference-evals-context-relevance.md +136 -141
package/dist/docs/references/reference-evals-faithfulness.md +24 -24
package/dist/docs/references/reference-evals-hallucination.md +52 -69
package/dist/docs/references/reference-evals-keyword-coverage.md +18 -18
package/dist/docs/references/reference-evals-noise-sensitivity.md +167 -177
package/dist/docs/references/reference-evals-prompt-alignment.md +111 -116
package/dist/docs/references/reference-evals-scorer-utils.md +285 -105
package/dist/docs/references/reference-evals-textual-difference.md +18 -18
package/dist/docs/references/reference-evals-tone-consistency.md +19 -19
package/dist/docs/references/reference-evals-tool-call-accuracy.md +165 -165
package/dist/docs/references/reference-evals-toxicity.md +21 -21
package/dist/docs/references/reference-evals-trajectory-accuracy.md +613 -0
package/dist/scorers/code/index.d.ts +1 -0
package/dist/scorers/code/index.d.ts.map +1 -1
package/dist/scorers/code/trajectory/index.d.ts +147 -0
package/dist/scorers/code/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/answer-similarity/index.d.ts +2 -2
package/dist/scorers/llm/context-precision/index.d.ts +2 -2
package/dist/scorers/llm/context-relevance/index.d.ts +1 -1
package/dist/scorers/llm/faithfulness/index.d.ts +1 -1
package/dist/scorers/llm/hallucination/index.d.ts +2 -2
package/dist/scorers/llm/index.d.ts +1 -0
package/dist/scorers/llm/index.d.ts.map +1 -1
package/dist/scorers/llm/noise-sensitivity/index.d.ts +1 -1
package/dist/scorers/llm/prompt-alignment/index.d.ts +5 -5
package/dist/scorers/llm/tool-call-accuracy/index.d.ts +1 -1
package/dist/scorers/llm/toxicity/index.d.ts +1 -1
package/dist/scorers/llm/trajectory/index.d.ts +58 -0
package/dist/scorers/llm/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/trajectory/prompts.d.ts +20 -0
package/dist/scorers/llm/trajectory/prompts.d.ts.map +1 -0
package/dist/scorers/prebuilt/index.cjs +638 -59
package/dist/scorers/prebuilt/index.cjs.map +1 -1
package/dist/scorers/prebuilt/index.js +578 -2
package/dist/scorers/prebuilt/index.js.map +1 -1
package/dist/scorers/utils.cjs +41 -17
package/dist/scorers/utils.d.ts +171 -1
package/dist/scorers/utils.d.ts.map +1 -1
package/dist/scorers/utils.js +1 -1
package/package.json +14 -11
package/dist/chunk-OEOE7ZHN.js +0 -195
package/dist/chunk-OEOE7ZHN.js.map +0 -1
package/dist/chunk-W3U7MMDX.cjs +0 -212
package/dist/chunk-W3U7MMDX.cjs.map +0 -1

package/dist/docs/references/reference-evals-tool-call-accuracy.md CHANGED Viewed

@@ -1,11 +1,11 @@
-# Tool Call Accuracy Scorers
+# Tool call accuracy scorers
 Mastra provides two tool call accuracy scorers for evaluating whether an LLM selects the correct tools from available options:
 1. **Code-based scorer** - Deterministic evaluation using exact tool matching
 2. **LLM-based scorer** - Semantic evaluation using AI to assess appropriateness
-## Choosing Between Scorers
+## Choosing between scorers
 ### Use the Code-Based Scorer When:
@@ -23,17 +23,17 @@ Mastra provides two tool call accuracy scorers for evaluating whether an LLM sel
 - You need **explanations** for scoring decisions
 - You're evaluating **production agent behavior**
-## Code-Based Tool Call Accuracy Scorer
+## Code-based tool call accuracy scorer
 The `createToolCallAccuracyScorerCode()` function from `@mastra/evals/scorers/prebuilt` provides deterministic binary scoring based on exact tool matching and supports both strict and lenient evaluation modes, as well as tool calling order validation.
 ### Parameters
-**expectedTool:** (`string`): The name of the tool that should be called for the given task. Ignored when expectedToolOrder is provided.
+**expectedTool** (`string`): The name of the tool that should be called for the given task. Ignored when expectedToolOrder is provided.
-**strictMode:** (`boolean`): Controls evaluation strictness. For single tool mode: only exact single tool calls accepted. For order checking mode: tools must match exactly with no extra tools allowed.
+**strictMode** (`boolean`): Controls evaluation strictness. For single tool mode: only exact single tool calls accepted. For order checking mode: tools must match exactly with no extra tools allowed.
-**expectedToolOrder:** (`string[]`): Array of tool names in the expected calling order. When provided, enables order checking mode and ignores expectedTool parameter.
+**expectedToolOrder** (`string[]`): Array of tool names in the expected calling order. When provided, enables order checking mode and ignores expectedTool parameter.
 This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
@@ -43,7 +43,7 @@ The code-based scorer operates in two distinct modes:
 #### Single Tool Mode
-When `expectedToolOrder` is not provided, the scorer evaluates single tool selection:
+When `expectedToolOrder` isn't provided, the scorer evaluates single tool selection:
 - **Standard Mode (strictMode: false)**: Returns `1` if the expected tool is called, regardless of other tools
 - **Strict Mode (strictMode: true)**: Returns `1` only if exactly one tool is called and it matches the expected tool
@@ -55,7 +55,7 @@ When `expectedToolOrder` is provided, the scorer validates tool calling sequence
 - **Strict Order (strictMode: true)**: Tools must be called in exactly the specified order with no extra tools
 - **Flexible Order (strictMode: false)**: Expected tools must appear in correct relative order (extra tools allowed)
-## Code-Based Scoring Details
+## Code-based scoring details
 - **Binary scores**: Always returns 0 or 1
 - **Deterministic**: Same input always produces same output
@@ -66,22 +66,22 @@ When `expectedToolOrder` is provided, the scorer validates tool calling sequence
 ```typescript
 // Standard mode - passes if expected tool is called
 const lenientScorer = createCodeScorer({
-  expectedTool: "search-tool",
+  expectedTool: 'search-tool',
   strictMode: false,
-});
+})
 // Strict mode - only passes if exactly one tool is called
 const strictScorer = createCodeScorer({
-  expectedTool: "search-tool",
+  expectedTool: 'search-tool',
   strictMode: true,
-});
+})
 // Order checking with strict mode
 const strictOrderScorer = createCodeScorer({
-  expectedTool: "step1-tool",
-  expectedToolOrder: ["step1-tool", "step2-tool", "step3-tool"],
+  expectedTool: 'step1-tool',
+  expectedToolOrder: ['step1-tool', 'step2-tool', 'step3-tool'],
   strictMode: true, // no extra tools allowed
-});
+})
 ```
 ### Code-Based Scorer Results
@@ -103,7 +103,7 @@ const strictOrderScorer = createCodeScorer({
 }
 ```
-## Code-Based Scorer Examples
+## Code-based scorer examples
 The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.
@@ -111,40 +111,40 @@ The code-based scorer provides deterministic, binary scoring (0 or 1) based on e
 ```typescript
 const scorer = createToolCallAccuracyScorerCode({
-  expectedTool: "weather-tool",
-});
+  expectedTool: 'weather-tool',
+})
 // Simulate LLM input and output with tool call
 const inputMessages = [
   createTestMessage({
-    content: "What is the weather like in New York today?",
-    role: "user",
-    id: "input-1",
+    content: 'What is the weather like in New York today?',
+    role: 'user',
+    id: 'input-1',
   }),
-];
+]
 const output = [
   createTestMessage({
-    content: "Let me check the weather for you.",
-    role: "assistant",
-    id: "output-1",
+    content: 'Let me check the weather for you.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-123",
-        toolName: "weather-tool",
-        args: { location: "New York" },
-        result: { temperature: "72°F", condition: "sunny" },
-        state: "result",
+        toolCallId: 'call-123',
+        toolName: 'weather-tool',
+        args: { location: 'New York' },
+        result: { temperature: '72°F', condition: 'sunny' },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const run = createAgentTestRun({ inputMessages, output });
-const result = await scorer.run(run);
+const run = createAgentTestRun({ inputMessages, output })
+const result = await scorer.run(run)
-console.log(result.score); // 1
-console.log(result.preprocessStepResult?.correctToolCalled); // true
+console.log(result.score) // 1
+console.log(result.preprocessStepResult?.correctToolCalled) // true
 ```
 ### Strict mode evaluation
@@ -153,37 +153,37 @@ Only passes if exactly one tool is called:
 ```typescript
 const strictScorer = createToolCallAccuracyScorerCode({
-  expectedTool: "weather-tool",
+  expectedTool: 'weather-tool',
   strictMode: true,
-});
+})
 // Multiple tools called - fails in strict mode
 const output = [
   createTestMessage({
-    content: "Let me help you with that.",
-    role: "assistant",
-    id: "output-1",
+    content: 'Let me help you with that.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-1",
-        toolName: "search-tool",
+        toolCallId: 'call-1',
+        toolName: 'search-tool',
         args: {},
         result: {},
-        state: "result",
+        state: 'result',
       }),
       createToolInvocation({
-        toolCallId: "call-2",
-        toolName: "weather-tool",
-        args: { location: "New York" },
-        result: { temperature: "20°C" },
-        state: "result",
+        toolCallId: 'call-2',
+        toolName: 'weather-tool',
+        args: { location: 'New York' },
+        result: { temperature: '20°C' },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const result = await strictScorer.run(run);
-console.log(result.score); // 0 - fails because multiple tools were called
+const result = await strictScorer.run(run)
+console.log(result.score) // 0 - fails because multiple tools were called
 ```
 ### Tool order validation
@@ -192,37 +192,37 @@ Validates that tools are called in a specific sequence:
 ```typescript
 const orderScorer = createToolCallAccuracyScorerCode({
-  expectedTool: "auth-tool", // ignored when order is specified
-  expectedToolOrder: ["auth-tool", "fetch-tool"],
+  expectedTool: 'auth-tool', // ignored when order is specified
+  expectedToolOrder: ['auth-tool', 'fetch-tool'],
   strictMode: true, // no extra tools allowed
-});
+})
 const output = [
   createTestMessage({
-    content: "I will authenticate and fetch the data.",
-    role: "assistant",
-    id: "output-1",
+    content: 'I will authenticate and fetch the data.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-1",
-        toolName: "auth-tool",
-        args: { token: "abc123" },
+        toolCallId: 'call-1',
+        toolName: 'auth-tool',
+        args: { token: 'abc123' },
         result: { authenticated: true },
-        state: "result",
+        state: 'result',
       }),
       createToolInvocation({
-        toolCallId: "call-2",
-        toolName: "fetch-tool",
-        args: { endpoint: "/data" },
-        result: { data: ["item1"] },
-        state: "result",
+        toolCallId: 'call-2',
+        toolName: 'fetch-tool',
+        args: { endpoint: '/data' },
+        result: { data: ['item1'] },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const result = await orderScorer.run(run);
-console.log(result.score); // 1 - correct order
+const result = await orderScorer.run(run)
+console.log(result.score) // 1 - correct order
 ```
 ### Flexible order mode
@@ -231,55 +231,55 @@ Allows extra tools as long as expected tools maintain relative order:
 ```typescript
 const flexibleOrderScorer = createToolCallAccuracyScorerCode({
-  expectedTool: "auth-tool",
-  expectedToolOrder: ["auth-tool", "fetch-tool"],
+  expectedTool: 'auth-tool',
+  expectedToolOrder: ['auth-tool', 'fetch-tool'],
   strictMode: false, // allows extra tools
-});
+})
 const output = [
   createTestMessage({
-    content: "Performing comprehensive operation.",
-    role: "assistant",
-    id: "output-1",
+    content: 'Performing comprehensive operation.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-1",
-        toolName: "auth-tool",
-        args: { token: "abc123" },
+        toolCallId: 'call-1',
+        toolName: 'auth-tool',
+        args: { token: 'abc123' },
         result: { authenticated: true },
-        state: "result",
+        state: 'result',
       }),
       createToolInvocation({
-        toolCallId: "call-2",
-        toolName: "log-tool", // Extra tool - OK in flexible mode
-        args: { message: "Starting fetch" },
+        toolCallId: 'call-2',
+        toolName: 'log-tool', // Extra tool - OK in flexible mode
+        args: { message: 'Starting fetch' },
         result: { logged: true },
-        state: "result",
+        state: 'result',
       }),
       createToolInvocation({
-        toolCallId: "call-3",
-        toolName: "fetch-tool",
-        args: { endpoint: "/data" },
-        result: { data: ["item1"] },
-        state: "result",
+        toolCallId: 'call-3',
+        toolName: 'fetch-tool',
+        args: { endpoint: '/data' },
+        result: { data: ['item1'] },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const result = await flexibleOrderScorer.run(run);
-console.log(result.score); // 1 - auth-tool comes before fetch-tool
+const result = await flexibleOrderScorer.run(run)
+console.log(result.score) // 1 - auth-tool comes before fetch-tool
 ```
-## LLM-Based Tool Call Accuracy Scorer
+## LLM-based tool call accuracy scorer
 The `createToolCallAccuracyScorerLLM()` function from `@mastra/evals/scorers/prebuilt` uses an LLM to evaluate whether the tools called by an agent are appropriate for the given user request, providing semantic evaluation rather than exact matching.
 ### Parameters
-**model:** (`MastraModelConfig`): The LLM model to use for evaluating tool appropriateness
+**model** (`MastraModelConfig`): The LLM model to use for evaluating tool appropriateness
-**availableTools:** (`Array<{name: string, description: string}>`): List of available tools with their descriptions for context
+**availableTools** (`Array<{name: string, description: string}>`): List of available tools with their descriptions for context
 ### Features
@@ -298,7 +298,7 @@ The LLM-based scorer provides:
 3. **Generate Score**: Calculates score based on appropriate vs total tool calls
 4. **Generate Reasoning**: Provides human-readable explanation
-## LLM-Based Scoring Details
+## LLM-based scoring details
 - **Fractional scores**: Returns values between 0.0 and 1.0
 - **Context-aware**: Considers user intent and appropriateness
@@ -309,7 +309,7 @@ The LLM-based scorer provides:
 ```typescript
 // Basic configuration
 const basicLLMScorer = createLLMScorer({
-  model: 'openai/gpt-5.1',
+  model: 'openai/gpt-5.4',
   availableTools: [
     { name: 'tool1', description: 'Description 1' },
     { name: 'tool2', description: 'Description 2' }
@@ -341,7 +341,7 @@ const customModelScorer = createLLMScorer({
 }
 ```
-## LLM-Based Scorer Examples
+## LLM-based scorer examples
 The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user's request.
@@ -349,53 +349,53 @@ The LLM-based scorer uses AI to evaluate whether tool selections are appropriate
 ```typescript
 const llmScorer = createToolCallAccuracyScorerLLM({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   availableTools: [
     {
-      name: "weather-tool",
-      description: "Get current weather information for any location",
+      name: 'weather-tool',
+      description: 'Get current weather information for any location',
     },
     {
-      name: "calendar-tool",
-      description: "Check calendar events and scheduling",
+      name: 'calendar-tool',
+      description: 'Check calendar events and scheduling',
     },
     {
-      name: "search-tool",
-      description: "Search the web for general information",
+      name: 'search-tool',
+      description: 'Search the web for general information',
     },
   ],
-});
+})
 const inputMessages = [
   createTestMessage({
-    content: "What is the weather like in San Francisco today?",
-    role: "user",
-    id: "input-1",
+    content: 'What is the weather like in San Francisco today?',
+    role: 'user',
+    id: 'input-1',
   }),
-];
+]
 const output = [
   createTestMessage({
-    content: "Let me check the current weather for you.",
-    role: "assistant",
-    id: "output-1",
+    content: 'Let me check the current weather for you.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-123",
-        toolName: "weather-tool",
-        args: { location: "San Francisco", date: "today" },
-        result: { temperature: "68°F", condition: "foggy" },
-        state: "result",
+        toolCallId: 'call-123',
+        toolName: 'weather-tool',
+        args: { location: 'San Francisco', date: 'today' },
+        result: { temperature: '68°F', condition: 'foggy' },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const run = createAgentTestRun({ inputMessages, output });
-const result = await llmScorer.run(run);
+const run = createAgentTestRun({ inputMessages, output })
+const result = await llmScorer.run(run)
-console.log(result.score); // 1.0 - appropriate tool usage
-console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."
+console.log(result.score) // 1.0 - appropriate tool usage
+console.log(result.reason) // "The agent correctly used the weather-tool to address the user's request for weather information."
 ```
 ### Handling inappropriate tool usage
@@ -403,34 +403,34 @@ console.log(result.reason); // "The agent correctly used the weather-tool to add
 ```typescript
 const inputMessages = [
   createTestMessage({
-    content: "What is the weather in Tokyo?",
-    role: "user",
-    id: "input-1",
+    content: 'What is the weather in Tokyo?',
+    role: 'user',
+    id: 'input-1',
   }),
-];
+]
 const inappropriateOutput = [
   createTestMessage({
-    content: "Let me search for that information.",
-    role: "assistant",
-    id: "output-1",
+    content: 'Let me search for that information.',
+    role: 'assistant',
+    id: 'output-1',
     toolInvocations: [
       createToolInvocation({
-        toolCallId: "call-456",
-        toolName: "search-tool", // Less appropriate than weather-tool
-        args: { query: "Tokyo weather" },
-        result: { results: ["Tokyo weather data..."] },
-        state: "result",
+        toolCallId: 'call-456',
+        toolName: 'search-tool', // Less appropriate than weather-tool
+        args: { query: 'Tokyo weather' },
+        result: { results: ['Tokyo weather data...'] },
+        state: 'result',
       }),
     ],
   }),
-];
+]
-const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
-const result = await llmScorer.run(run);
+const run = createAgentTestRun({ inputMessages, output: inappropriateOutput })
+const result = await llmScorer.run(run)
-console.log(result.score); // 0.5 - partially appropriate
-console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."
+console.log(result.score) // 0.5 - partially appropriate
+console.log(result.reason) // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."
 ```
 ### Evaluating clarification requests
@@ -465,64 +465,64 @@ console.log(result.score); // 1.0 - appropriate to ask for clarification
 console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."
 ```
-## Comparing Both Scorers
+## Comparing both scorers
 Here's an example using both scorers on the same data:
 ```typescript
 import {
   createToolCallAccuracyScorerCode as createCodeScorer,
-  createToolCallAccuracyScorerLLM as createLLMScorer
-} from "@mastra/evals/scorers/prebuilt";
+  createToolCallAccuracyScorerLLM as createLLMScorer,
+} from '@mastra/evals/scorers/prebuilt'
 // Setup both scorers
 const codeScorer = createCodeScorer({
-  expectedTool: "weather-tool",
+  expectedTool: 'weather-tool',
   strictMode: false,
-});
+})
 const llmScorer = createLLMScorer({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   availableTools: [
-    { name: "weather-tool", description: "Get weather information" },
-    { name: "search-tool", description: "Search the web" },
+    { name: 'weather-tool', description: 'Get weather information' },
+    { name: 'search-tool', description: 'Search the web' },
   ],
-});
+})
 // Test data
 const run = createAgentTestRun({
   inputMessages: [
     createTestMessage({
-      content: "What is the weather?",
-      role: "user",
-      id: "input-1",
+      content: 'What is the weather?',
+      role: 'user',
+      id: 'input-1',
     }),
   ],
   output: [
     createTestMessage({
-      content: "Let me find that information.",
-      role: "assistant",
-      id: "output-1",
+      content: 'Let me find that information.',
+      role: 'assistant',
+      id: 'output-1',
       toolInvocations: [
         createToolInvocation({
-          toolCallId: "call-1",
-          toolName: "search-tool",
-          args: { query: "weather" },
-          result: { results: ["weather data"] },
-          state: "result",
+          toolCallId: 'call-1',
+          toolName: 'search-tool',
+          args: { query: 'weather' },
+          result: { results: ['weather data'] },
+          state: 'result',
         }),
       ],
     }),
   ],
-});
+})
 // Run both scorers
-const codeResult = await codeScorer.run(run);
-const llmResult = await llmScorer.run(run);
+const codeResult = await codeScorer.run(run)
+const llmResult = await llmScorer.run(run)
-console.log("Code Scorer:", codeResult.score); // 0 - wrong tool
-console.log("LLM Scorer:", llmResult.score); // 0.3 - partially appropriate
-console.log("LLM Reason:", llmResult.reason); // Explains why search-tool is less appropriate
+console.log('Code Scorer:', codeResult.score) // 0 - wrong tool
+console.log('LLM Scorer:', llmResult.score) // 0.3 - partially appropriate
+console.log('LLM Reason:', llmResult.reason) // Explains why search-tool is less appropriate
 ```
 ## Related