npm - @mastra/mcp-docs-server - Versions diffs - 0.13.30-alpha.0 → 0.13.30-alpha.2 - Mend

@mastra/mcp-docs-server 0.13.30-alpha.0 → 0.13.30-alpha.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (65) hide show

package/.docs/organized/changelogs/%40mastra%2Fagent-builder.md +9 -9
package/.docs/organized/changelogs/%40mastra%2Fai-sdk.md +15 -0
package/.docs/organized/changelogs/%40mastra%2Fclient-js.md +15 -15
package/.docs/organized/changelogs/%40mastra%2Fcore.md +35 -35
package/.docs/organized/changelogs/%40mastra%2Fdeployer-cloud.md +17 -17
package/.docs/organized/changelogs/%40mastra%2Fdeployer.md +24 -24
package/.docs/organized/changelogs/%40mastra%2Fmcp-docs-server.md +15 -15
package/.docs/organized/changelogs/%40mastra%2Fmemory.md +16 -16
package/.docs/organized/changelogs/%40mastra%2Fpg.md +16 -16
package/.docs/organized/changelogs/%40mastra%2Fplayground-ui.md +31 -31
package/.docs/organized/changelogs/%40mastra%2Freact.md +20 -0
package/.docs/organized/changelogs/%40mastra%2Fserver.md +15 -15
package/.docs/organized/changelogs/create-mastra.md +19 -19
package/.docs/organized/changelogs/mastra.md +27 -27
package/.docs/organized/code-examples/agent.md +0 -1
package/.docs/organized/code-examples/agui.md +2 -2
package/.docs/organized/code-examples/client-side-tools.md +2 -2
package/.docs/raw/agents/adding-voice.mdx +118 -25
package/.docs/raw/agents/agent-memory.mdx +73 -89
package/.docs/raw/agents/guardrails.mdx +1 -1
package/.docs/raw/agents/overview.mdx +39 -7
package/.docs/raw/agents/using-tools.mdx +95 -0
package/.docs/raw/deployment/overview.mdx +9 -11
package/.docs/raw/frameworks/agentic-uis/ai-sdk.mdx +1 -1
package/.docs/raw/frameworks/servers/express.mdx +2 -2
package/.docs/raw/getting-started/installation.mdx +34 -85
package/.docs/raw/getting-started/mcp-docs-server.mdx +13 -1
package/.docs/raw/index.mdx +49 -14
package/.docs/raw/observability/ai-tracing/exporters/otel.mdx +3 -0
package/.docs/raw/reference/observability/ai-tracing/exporters/otel.mdx +6 -0
package/.docs/raw/reference/scorers/answer-relevancy.mdx +105 -7
package/.docs/raw/reference/scorers/answer-similarity.mdx +266 -16
package/.docs/raw/reference/scorers/bias.mdx +107 -6
package/.docs/raw/reference/scorers/completeness.mdx +131 -8
package/.docs/raw/reference/scorers/content-similarity.mdx +107 -8
package/.docs/raw/reference/scorers/context-precision.mdx +234 -18
package/.docs/raw/reference/scorers/context-relevance.mdx +418 -35
package/.docs/raw/reference/scorers/faithfulness.mdx +122 -8
package/.docs/raw/reference/scorers/hallucination.mdx +125 -8
package/.docs/raw/reference/scorers/keyword-coverage.mdx +141 -9
package/.docs/raw/reference/scorers/noise-sensitivity.mdx +478 -6
package/.docs/raw/reference/scorers/prompt-alignment.mdx +351 -102
package/.docs/raw/reference/scorers/textual-difference.mdx +134 -6
package/.docs/raw/reference/scorers/tone-consistency.mdx +133 -0
package/.docs/raw/reference/scorers/tool-call-accuracy.mdx +422 -65
package/.docs/raw/reference/scorers/toxicity.mdx +125 -7
package/.docs/raw/reference/workflows/workflow.mdx +33 -0
package/.docs/raw/scorers/custom-scorers.mdx +244 -3
package/.docs/raw/scorers/overview.mdx +8 -38
package/.docs/raw/server-db/middleware.mdx +5 -2
package/.docs/raw/server-db/runtime-context.mdx +178 -0
package/.docs/raw/streaming/workflow-streaming.mdx +5 -1
package/.docs/raw/tools-mcp/overview.mdx +25 -7
package/.docs/raw/workflows/overview.mdx +28 -1
package/CHANGELOG.md +14 -0
package/package.json +4 -4
package/.docs/raw/agents/runtime-context.mdx +0 -106
package/.docs/raw/agents/using-tools-and-mcp.mdx +0 -241
package/.docs/raw/getting-started/model-providers.mdx +0 -63
package/.docs/raw/tools-mcp/runtime-context.mdx +0 -63
/package/.docs/raw/{evals → scorers/evals-old-api}/custom-eval.mdx +0 -0
/package/.docs/raw/{evals → scorers/evals-old-api}/overview.mdx +0 -0
/package/.docs/raw/{evals → scorers/evals-old-api}/running-in-ci.mdx +0 -0
/package/.docs/raw/{evals → scorers/evals-old-api}/textual-evals.mdx +0 -0
/package/.docs/raw/{server-db → workflows}/snapshots.mdx +0 -0

package/.docs/raw/reference/scorers/tone-consistency.mdx CHANGED Viewed

@@ -37,6 +37,22 @@ This function returns an instance of the MastraScorer class. See the [MastraScor
   ]}
 />
+`.run()` returns a result in the following shape:
+```typescript
+{
+  runId: string,
+  analyzeStepResult: {
+    responseSentiment?: number,
+    referenceSentiment?: number,
+    difference?: number,
+    avgSentiment?: number,
+    sentimentVariance?: number,
+  },
+  score: number
+}
+```
 ## Scoring Details
 The scorer evaluates sentiment consistency through tone pattern analysis and mode-specific scoring.
@@ -69,6 +85,123 @@ Final score: `mode_specific_score * scale`
 - 0.1-0.3: Poor consistency with major tone changes
 - 0.0: No consistency - completely different tones
+### analyzeStepResult
+Object with tone metrics:
+- **responseSentiment**: Sentiment score for the response (comparison mode).
+- **referenceSentiment**: Sentiment score for the input/reference (comparison mode).
+- **difference**: Absolute difference between sentiment scores (comparison mode).
+- **avgSentiment**: Average sentiment across sentences (stability mode).
+- **sentimentVariance**: Variance of sentiment across sentences (stability mode).
+## Examples
+### Positive tone example
+In this example, the texts exhibit a similar positive sentiment. The scorer measures the consistency between the tones, resulting in a high score.
+```typescript filename="src/example-positive-tone.ts" showLineNumbers copy
+import { createToneScorer } from "@mastra/evals/scorers/code";
+const scorer = createToneScorer();
+const input = 'This product is fantastic and amazing!';
+const output = 'The product is excellent and wonderful!';
+const result = await scorer.run({
+  input: [{ role: 'user', content: input }],
+  output: { role: 'assistant', text: output },
+});
+console.log('Score:', result.score);
+console.log('AnalyzeStepResult:', result.analyzeStepResult);
+```
+#### Positive tone output
+The scorer returns a high score reflecting strong sentiment alignment. The `analyzeStepResult` field provides sentiment values and the difference between them.
+```typescript
+{
+  score: 0.8333333333333335,
+  analyzeStepResult: {
+    responseSentiment: 1.3333333333333333,
+    referenceSentiment: 1.1666666666666667,
+    difference: 0.16666666666666652,
+  },
+}
+```
+### Stable tone example
+In this example, the text’s internal tone consistency is analyzed by passing an empty response. This signals the scorer to evaluate sentiment stability within the single input text, resulting in a score reflecting how uniform the tone is throughout.
+```typescript filename="src/example-stable-tone.ts" showLineNumbers copy
+import { createToneScorer } from "@mastra/evals/scorers/code";
+const scorer = createToneScorer();
+const input = 'Great service! Friendly staff. Perfect atmosphere.';
+const output = '';
+const result = await scorer.run({
+  input: [{ role: 'user', content: input }],
+  output: { role: 'assistant', text: output },
+});
+console.log('Score:', result.score);
+console.log('AnalyzeStepResult:', result.analyzeStepResult);
+```
+#### Stable tone output
+The scorer returns a high score indicating consistent sentiment throughout the input text. The `analyzeStepResult` field includes the average sentiment and sentiment variance, reflecting tone stability.
+```typescript
+{
+  score: 0.9444444444444444,
+  analyzeStepResult: {
+    avgSentiment: 1.3333333333333333,
+    sentimentVariance: 0.05555555555555556,
+  },
+}
+```
+### Mixed tone example
+In this example, the input and response have different emotional tones. The scorer picks up on these variations and gives a lower consistency score.
+```typescript filename="src/example-mixed-tone.ts" showLineNumbers copy
+import { createToneScorer } from "@mastra/evals/scorers/code";
+const scorer = createToneScorer();
+const input = 'The interface is frustrating and confusing, though it has potential.';
+const output = 'The design shows promise but needs significant improvements to be usable.';
+const result = await scorer.run({
+  input: [{ role: 'user', content: input }],
+  output: { role: 'assistant', text: output },
+});
+console.log('Score:', result.score);
+console.log('AnalyzeStepResult:', result.analyzeStepResult);
+```
+#### Mixed tone output
+The scorer returns a low score due to the noticeable differences in emotional tone. The `analyzeStepResult` field highlights the sentiment values and the degree of variation between them.
+```typescript
+{
+  score: 0.4181818181818182,
+  analyzeStepResult: {
+    responseSentiment: -0.4,
+    referenceSentiment: 0.18181818181818182,
+    difference: 0.5818181818181818,
+  },
+}
+```
 ## Related
 - [Content Similarity Scorer](./content-similarity)

package/.docs/raw/reference/scorers/tool-call-accuracy.mdx CHANGED Viewed

@@ -10,7 +10,23 @@ Mastra provides two tool call accuracy scorers for evaluating whether an LLM sel
 1. **Code-based scorer** - Deterministic evaluation using exact tool matching
 2. **LLM-based scorer** - Semantic evaluation using AI to assess appropriateness
-For usage examples, see the [Tool Call Accuracy Examples](/examples/scorers/tool-call-accuracy).
+## Choosing Between Scorers
+### Use the Code-Based Scorer When:
+- You need **deterministic, reproducible** results
+- You want to test **exact tool matching**
+- You need to validate **specific tool sequences**
+- Speed and cost are priorities (no LLM calls)
+- You're running automated tests
+### Use the LLM-Based Scorer When:
+- You need **semantic understanding** of appropriateness
+- Tool selection depends on **context and intent**
+- You want to handle **edge cases** like clarification requests
+- You need **explanations** for scoring decisions
+- You're evaluating **production agent behavior**
 ## Code-Based Tool Call Accuracy Scorer
@@ -62,28 +78,220 @@ When `expectedToolOrder` is provided, the scorer validates tool calling sequence
 - **Strict Order (strictMode: true)**: Tools must be called in exactly the specified order with no extra tools
 - **Flexible Order (strictMode: false)**: Expected tools must appear in correct relative order (extra tools allowed)
-### Examples
+## Code-Based Scoring Details
+- **Binary scores**: Always returns 0 or 1
+- **Deterministic**: Same input always produces same output
+- **Fast**: No external API calls
+### Code-Based Scorer Options
+```typescript showLineNumbers copy
+// Standard mode - passes if expected tool is called
+const lenientScorer = createCodeScorer({
+  expectedTool: 'search-tool',
+  strictMode: false
+});
+// Strict mode - only passes if exactly one tool is called
+const strictScorer = createCodeScorer({
+  expectedTool: 'search-tool',
+  strictMode: true
+});
+// Order checking with strict mode
+const strictOrderScorer = createCodeScorer({
+  expectedTool: 'step1-tool',
+  expectedToolOrder: ['step1-tool', 'step2-tool', 'step3-tool'],
+  strictMode: true // no extra tools allowed
+});
+```
+### Code-Based Scorer Results
 ```typescript
-import { createToolCallAccuracyScorerCode } from '@mastra/evals/scorers/code';
+{
+  runId: string,
+  preprocessStepResult: {
+    expectedTool: string,
+    actualTools: string[],
+    strictMode: boolean,
+    expectedToolOrder?: string[],
+    hasToolCalls: boolean,
+    correctToolCalled: boolean,
+    correctOrderCalled: boolean | null,
+    toolCallInfos: ToolCallInfo[]
+  },
+  score: number // Always 0 or 1
+}
+```
+## Code-Based Scorer Examples
+The code-based scorer provides deterministic, binary scoring (0 or 1) based on exact tool matching.
-// Single tool validation
-const scorer = createToolCallAccuracyScorerCode({
-  expectedTool: 'weather-tool'
+### Correct tool selection
+```typescript filename="src/example-correct-tool.ts" showLineNumbers copy
+const scorer = createToolCallAccuracyScorerCode({
+  expectedTool: 'weather-tool'
 });
-// Strict single tool (no other tools allowed)
-const strictScorer = createToolCallAccuracyScorerCode({
-  expectedTool: 'calculator-tool',
+// Simulate LLM input and output with tool call
+const inputMessages = [
+  createUIMessage({
+    content: 'What is the weather like in New York today?',
+    role: 'user',
+    id: 'input-1'
+  })
+];
+const output = [
+  createUIMessage({
+    content: 'Let me check the weather for you.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-123',
+        toolName: 'weather-tool',
+        args: { location: 'New York' },
+        result: { temperature: '72°F', condition: 'sunny' },
+        state: 'result'
+      })
+    ]
+  })
+];
+const run = createAgentTestRun({ inputMessages, output });
+const result = await scorer.run(run);
+console.log(result.score); // 1
+console.log(result.preprocessStepResult?.correctToolCalled); // true
+```
+### Strict mode evaluation
+Only passes if exactly one tool is called:
+```typescript filename="src/example-strict-mode.ts" showLineNumbers copy
+const strictScorer = createToolCallAccuracyScorerCode({
+  expectedTool: 'weather-tool',
   strictMode: true
 });
-// Tool order validation
+// Multiple tools called - fails in strict mode
+const output = [
+  createUIMessage({
+    content: 'Let me help you with that.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-1',
+        toolName: 'search-tool',
+        args: {},
+        result: {},
+        state: 'result',
+      }),
+      createToolInvocation({
+        toolCallId: 'call-2',
+        toolName: 'weather-tool',
+        args: { location: 'New York' },
+        result: { temperature: '20°C' },
+        state: 'result',
+      })
+    ]
+  })
+];
+const result = await strictScorer.run(run);
+console.log(result.score); // 0 - fails because multiple tools were called
+```
+### Tool order validation
+Validates that tools are called in a specific sequence:
+```typescript filename="src/example-order-validation.ts" showLineNumbers copy
 const orderScorer = createToolCallAccuracyScorerCode({
-  expectedTool: 'search-tool', // ignored when order is specified
-  expectedToolOrder: ['search-tool', 'weather-tool'],
-  strictMode: true // exact match required
+  expectedTool: 'auth-tool', // ignored when order is specified
+  expectedToolOrder: ['auth-tool', 'fetch-tool'],
+  strictMode: true // no extra tools allowed
 });
+const output = [
+  createUIMessage({
+    content: 'I will authenticate and fetch the data.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-1',
+        toolName: 'auth-tool',
+        args: { token: 'abc123' },
+        result: { authenticated: true },
+        state: 'result'
+      }),
+      createToolInvocation({
+        toolCallId: 'call-2',
+        toolName: 'fetch-tool',
+        args: { endpoint: '/data' },
+        result: { data: ['item1'] },
+        state: 'result'
+      })
+    ]
+  })
+];
+const result = await orderScorer.run(run);
+console.log(result.score); // 1 - correct order
+```
+### Flexible order mode
+Allows extra tools as long as expected tools maintain relative order:
+```typescript filename="src/example-flexible-order.ts" showLineNumbers copy
+const flexibleOrderScorer = createToolCallAccuracyScorerCode({
+  expectedTool: 'auth-tool',
+  expectedToolOrder: ['auth-tool', 'fetch-tool'],
+  strictMode: false // allows extra tools
+});
+const output = [
+  createUIMessage({
+    content: 'Performing comprehensive operation.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-1',
+        toolName: 'auth-tool',
+        args: { token: 'abc123' },
+        result: { authenticated: true },
+        state: 'result'
+      }),
+      createToolInvocation({
+        toolCallId: 'call-2',
+        toolName: 'log-tool', // Extra tool - OK in flexible mode
+        args: { message: 'Starting fetch' },
+        result: { logged: true },
+        state: 'result'
+      }),
+      createToolInvocation({
+        toolCallId: 'call-3',
+        toolName: 'fetch-tool',
+        args: { endpoint: '/data' },
+        result: { data: ['item1'] },
+        state: 'result'
+      })
+    ]
+  })
+];
+const result = await flexibleOrderScorer.run(run);
+console.log(result.score); // 1 - auth-tool comes before fetch-tool
 ```
 ## LLM-Based Tool Call Accuracy Scorer
@@ -126,82 +334,231 @@ The LLM-based scorer provides:
 3. **Generate Score**: Calculates score based on appropriate vs total tool calls
 4. **Generate Reasoning**: Provides human-readable explanation
-### Examples
+## LLM-Based Scoring Details
+- **Fractional scores**: Returns values between 0.0 and 1.0
+- **Context-aware**: Considers user intent and appropriateness
+- **Explanatory**: Provides reasoning for scores
+### LLM-Based Scorer Options
+```typescript showLineNumbers copy
+// Basic configuration
+const basicLLMScorer = createLLMScorer({
+  model: openai('gpt-4o-mini'),
+  availableTools: [
+    { name: 'tool1', description: 'Description 1' },
+    { name: 'tool2', description: 'Description 2' }
+  ]
+});
+// With different model
+const customModelScorer = createLLMScorer({
+  model: openai('gpt-4'), // More powerful model for complex evaluations
+  availableTools: [...]
+});
+```
+### LLM-Based Scorer Results
 ```typescript
-import { createToolCallAccuracyScorerLLM } from '@mastra/evals/scorers/llm';
-import { openai } from '@ai-sdk/openai';
+{
+  runId: string,
+  score: number,  // 0.0 to 1.0
+  reason: string, // Human-readable explanation
+  analyzeStepResult: {
+    evaluations: Array<{
+      toolCalled: string,
+      wasAppropriate: boolean,
+      reasoning: string
+    }>,
+    missingTools?: string[]
+  }
+}
+```
+## LLM-Based Scorer Examples
+The LLM-based scorer uses AI to evaluate whether tool selections are appropriate for the user's request.
+### Basic LLM evaluation
+```typescript filename="src/example-llm-basic.ts" showLineNumbers copy
 const llmScorer = createToolCallAccuracyScorerLLM({
   model: openai('gpt-4o-mini'),
   availableTools: [
-    {
-      name: 'weather-tool',
-      description: 'Get current weather information for any location'
+    {
+      name: 'weather-tool',
+      description: 'Get current weather information for any location'
     },
-    {
-      name: 'search-tool',
-      description: 'Search the web for information'
+    {
+      name: 'calendar-tool',
+      description: 'Check calendar events and scheduling'
     },
-    {
-      name: 'calendar-tool',
-      description: 'Check calendar events and scheduling'
+    {
+      name: 'search-tool',
+      description: 'Search the web for general information'
     }
   ]
 });
-const result = await llmScorer.run(agentRun);
-console.log(result.score); // 0.0 to 1.0
-console.log(result.reason); // Explanation of the score
+const inputMessages = [
+  createUIMessage({
+    content: 'What is the weather like in San Francisco today?',
+    role: 'user',
+    id: 'input-1'
+  })
+];
+const output = [
+  createUIMessage({
+    content: 'Let me check the current weather for you.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-123',
+        toolName: 'weather-tool',
+        args: { location: 'San Francisco', date: 'today' },
+        result: { temperature: '68°F', condition: 'foggy' },
+        state: 'result'
+      })
+    ]
+  })
+];
+const run = createAgentTestRun({ inputMessages, output });
+const result = await llmScorer.run(run);
+console.log(result.score); // 1.0 - appropriate tool usage
+console.log(result.reason); // "The agent correctly used the weather-tool to address the user's request for weather information."
 ```
-## Choosing Between Scorers
-### Use the Code-Based Scorer When:
-- You need **deterministic, reproducible** results
-- You want to test **exact tool matching**
-- You need to validate **specific tool sequences**
-- Speed and cost are priorities (no LLM calls)
-- You're running automated tests
-### Use the LLM-Based Scorer When:
-- You need **semantic understanding** of appropriateness
-- Tool selection depends on **context and intent**
-- You want to handle **edge cases** like clarification requests
-- You need **explanations** for scoring decisions
-- You're evaluating **production agent behavior**
+### Handling inappropriate tool usage
+```typescript filename="src/example-llm-inappropriate.ts" showLineNumbers copy
+const inputMessages = [
+  createUIMessage({
+    content: 'What is the weather in Tokyo?',
+    role: 'user',
+    id: 'input-1'
+  })
+];
+const inappropriateOutput = [
+  createUIMessage({
+    content: 'Let me search for that information.',
+    role: 'assistant',
+    id: 'output-1',
+    toolInvocations: [
+      createToolInvocation({
+        toolCallId: 'call-456',
+        toolName: 'search-tool', // Less appropriate than weather-tool
+        args: { query: 'Tokyo weather' },
+        result: { results: ['Tokyo weather data...'] },
+        state: 'result'
+      })
+    ]
+  })
+];
+const run = createAgentTestRun({ inputMessages, output: inappropriateOutput });
+const result = await llmScorer.run(run);
+console.log(result.score); // 0.5 - partially appropriate
+console.log(result.reason); // "The agent used search-tool when weather-tool would have been more appropriate for a direct weather query."
+```
-## Scoring Details
+### Evaluating clarification requests
+The LLM scorer recognizes when agents appropriately ask for clarification:
+```typescript filename="src/example-llm-clarification.ts" showLineNumbers copy
+const vagueInput = [
+  createUIMessage({
+    content: 'I need help with something',
+    role: 'user',
+    id: 'input-1'
+  })
+];
+const clarificationOutput = [
+  createUIMessage({
+    content: 'I'd be happy to help! Could you please provide more details about what you need assistance with?',
+    role: 'assistant',
+    id: 'output-1',
+    // No tools called - asking for clarification instead
+  })
+];
+const run = createAgentTestRun({
+  inputMessages: vagueInput,
+  output: clarificationOutput
+});
+const result = await llmScorer.run(run);
-### Code-Based Scoring
+console.log(result.score); // 1.0 - appropriate to ask for clarification
+console.log(result.reason); // "The agent appropriately asked for clarification rather than calling tools with insufficient information."
+```
-- **Binary scores**: Always returns 0 or 1
-- **Deterministic**: Same input always produces same output
-- **Fast**: No external API calls
+## Comparing Both Scorers
-### LLM-Based Scoring
+Here's an example using both scorers on the same data:
-- **Fractional scores**: Returns values between 0.0 and 1.0
-- **Context-aware**: Considers user intent and appropriateness
-- **Explanatory**: Provides reasoning for scores
+```typescript filename="src/example-comparison.ts" showLineNumbers copy
+import { createToolCallAccuracyScorerCode as createCodeScorer } from '@mastra/evals/scorers/code';
+import { createToolCallAccuracyScorerLLM as createLLMScorer } from '@mastra/evals/scorers/llm';
+import { openai } from '@ai-sdk/openai';
-## Use Cases
+// Setup both scorers
+const codeScorer = createCodeScorer({
+  expectedTool: 'weather-tool',
+  strictMode: false
+});
-### Code-Based Scorer Use Cases
+const llmScorer = createLLMScorer({
+  model: openai('gpt-4o-mini'),
+  availableTools: [
+    { name: 'weather-tool', description: 'Get weather information' },
+    { name: 'search-tool', description: 'Search the web' }
+  ]
+});
-- **Unit Testing**: Verify specific tool selection behavior
-- **Regression Testing**: Ensure tool selection doesn't change
-- **Workflow Validation**: Check tool sequences in multi-step processes
-- **CI/CD Pipelines**: Fast, deterministic validation
+// Test data
+const run = createAgentTestRun({
+  inputMessages: [
+    createUIMessage({
+      content: 'What is the weather?',
+      role: 'user',
+      id: 'input-1'
+    })
+  ],
+  output: [
+    createUIMessage({
+      content: 'Let me find that information.',
+      role: 'assistant',
+      id: 'output-1',
+      toolInvocations: [
+        createToolInvocation({
+          toolCallId: 'call-1',
+          toolName: 'search-tool',
+          args: { query: 'weather' },
+          result: { results: ['weather data'] },
+          state: 'result'
+        })
+      ]
+    })
+  ]
+});
-### LLM-Based Scorer Use Cases
+// Run both scorers
+const codeResult = await codeScorer.run(run);
+const llmResult = await llmScorer.run(run);
-- **Quality Assurance**: Evaluate production agent behavior
-- **A/B Testing**: Compare different agent implementations
-- **User Intent Alignment**: Ensure tools match user needs
-- **Edge Case Handling**: Evaluate clarification and error scenarios
+console.log('Code Scorer:', codeResult.score); // 0 - wrong tool
+console.log('LLM Scorer:', llmResult.score);   // 0.3 - partially appropriate
+console.log('LLM Reason:', llmResult.reason);   // Explains why search-tool is less appropriate
+```
 ## Related