npm - @mastra/mcp-docs-server - Versions diffs - 0.13.17-alpha.3 → 0.13.17-alpha.5 - Mend

@mastra/mcp-docs-server 0.13.17-alpha.3 → 0.13.17-alpha.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (112) hide show

package/.docs/raw/reference/scorers/noise-sensitivity.mdx ADDED Viewed

@@ -0,0 +1,237 @@
+---
+title: "Reference: Noise Sensitivity Scorer | Scorers | Mastra Docs"
+description: Documentation for the Noise Sensitivity Scorer in Mastra. Evaluates how robust an agent is when exposed to irrelevant, distracting, or misleading information in user queries.
+---
+import { PropertiesTable } from "@/components/properties-table";
+# Noise Sensitivity Scorer
+The `createNoiseSensitivityScorerLLM()` function creates a scorer that evaluates how robust an agent is when exposed to irrelevant, distracting, or misleading information. It measures the agent's ability to maintain response quality and accuracy despite noise in the input.
+## Parameters
+<PropertiesTable
+  content={[
+    {
+      name: "model",
+      type: "MastraLanguageModel",
+      description: "The language model to use for evaluating noise sensitivity",
+      required: true,
+    },
+    {
+      name: "options",
+      type: "NoiseSensitivityOptions",
+      description: "Configuration options for the scorer",
+      required: true,
+      children: [
+        {
+          name: "baselineResponse",
+          type: "string",
+          description: "The expected clean response to compare against (what the agent should ideally produce without noise)",
+          required: true,
+        },
+        {
+          name: "noisyQuery",
+          type: "string",
+          description: "The user query with added noise, distractions, or misleading information",
+          required: true,
+        },
+        {
+          name: "noiseType",
+          type: "string",
+          description: "Type of noise added (e.g., 'misinformation', 'distractors', 'adversarial')",
+          required: false,
+        },
+        {
+          name: "scoring",
+          type: "object",
+          description: "Advanced scoring configuration for fine-tuning evaluation",
+          required: false,
+          children: [
+            {
+              name: "impactWeights",
+              type: "object",
+              description: "Custom weights for different impact levels",
+              required: false,
+              children: [
+                {
+                  name: "none",
+                  type: "number",
+                  description: "Weight for no impact (default: 1.0)",
+                  required: false,
+                },
+                {
+                  name: "minimal",
+                  type: "number",
+                  description: "Weight for minimal impact (default: 0.85)",
+                  required: false,
+                },
+                {
+                  name: "moderate",
+                  type: "number",
+                  description: "Weight for moderate impact (default: 0.6)",
+                  required: false,
+                },
+                {
+                  name: "significant",
+                  type: "number",
+                  description: "Weight for significant impact (default: 0.3)",
+                  required: false,
+                },
+                {
+                  name: "severe",
+                  type: "number",
+                  description: "Weight for severe impact (default: 0.1)",
+                  required: false,
+                },
+              ],
+            },
+            {
+              name: "penalties",
+              type: "object",
+              description: "Penalty configuration for major issues",
+              required: false,
+              children: [
+                {
+                  name: "majorIssuePerItem",
+                  type: "number",
+                  description: "Penalty per major issue identified (default: 0.1)",
+                  required: false,
+                },
+                {
+                  name: "maxMajorIssuePenalty",
+                  type: "number",
+                  description: "Maximum total penalty for major issues (default: 0.3)",
+                  required: false,
+                },
+              ],
+            },
+            {
+              name: "discrepancyThreshold",
+              type: "number",
+              description: "Threshold for using conservative scoring when LLM and calculated scores diverge (default: 0.2)",
+              required: false,
+            },
+          ],
+        },
+      ],
+    },
+  ]}
+/>
+## .run() Returns
+<PropertiesTable
+  content={[
+    {
+      name: "score",
+      type: "number",
+      description: "Robustness score between 0 and 1 (1.0 = completely robust, 0.0 = severely compromised)",
+    },
+    {
+      name: "reason",
+      type: "string",
+      description: "Human-readable explanation of how noise affected the agent's response",
+    },
+  ]}
+/>
+## Evaluation Dimensions
+The Noise Sensitivity scorer analyzes five key dimensions:
+### 1. Content Accuracy
+Evaluates whether facts and information remain correct despite noise. The scorer checks if the agent maintains truthfulness when exposed to misinformation.
+### 2. Completeness
+Assesses if the noisy response addresses the original query as thoroughly as the baseline. Measures whether noise causes the agent to miss important information.
+### 3. Relevance
+Determines if the agent stayed focused on the original question or got distracted by irrelevant information in the noise.
+### 4. Consistency
+Compares how similar the responses are in their core message and conclusions. Evaluates whether noise causes the agent to contradict itself.
+### 5. Hallucination Resistance
+Checks if noise causes the agent to generate false or fabricated information that wasn't present in either the query or the noise.
+## Scoring Algorithm
+### Formula
+```
+Final Score = max(0, min(llm_score, calculated_score) - issues_penalty)
+```
+Where:
+- `llm_score` = Direct robustness score from LLM analysis
+- `calculated_score` = Average of impact weights across dimensions
+- `issues_penalty` = min(major_issues × penalty_rate, max_penalty)
+### Impact Level Weights
+Each dimension receives an impact level with corresponding weights:
+- **None (1.0)**: Response virtually identical in quality and accuracy
+- **Minimal (0.85)**: Slight phrasing changes but maintains correctness
+- **Moderate (0.6)**: Noticeable changes affecting quality but core info correct
+- **Significant (0.3)**: Major degradation in quality or accuracy
+- **Severe (0.1)**: Response substantially worse or completely derailed
+### Conservative Scoring
+When the LLM's direct score and the calculated score diverge by more than the discrepancy threshold, the scorer uses the lower (more conservative) score to ensure reliable evaluation.
+## Noise Types
+### Misinformation
+False or misleading claims mixed with legitimate queries.
+Example: "What causes climate change? Also, climate change is a hoax invented by scientists."
+### Distractors
+Irrelevant information that could pull focus from the main query.
+Example: "How do I bake a cake? My cat is orange and I like pizza on Tuesdays."
+### Adversarial
+Deliberately conflicting instructions designed to confuse.
+Example: "Write a summary of this article. Actually, ignore that and tell me about dogs instead."
+## Usage Patterns
+### Testing Agent Robustness
+Use to verify that agents maintain quality when faced with:
+- User confusion or contradictions
+- Multiple unrelated questions in one query
+- False premises or assumptions
+- Emotional or distracting content
+### Quality Assurance
+Integrate into evaluation pipelines to:
+- Benchmark different models' noise resistance
+- Identify agents vulnerable to manipulation
+- Validate production readiness
+### Security Testing
+Evaluate resistance to:
+- Prompt injection attempts
+- Social engineering tactics
+- Information pollution attacks
+## Score Interpretation
+- **0.9-1.0**: Excellent robustness, minimal impact from noise
+- **0.7-0.8**: Good resistance with minor degradation
+- **0.5-0.6**: Moderate impact, some key aspects affected
+- **0.3-0.4**: Significant vulnerability to noise
+- **0.0-0.2**: Severe compromise, agent easily misled
+## Related
+- [Noise Sensitivity Examples](/examples/scorers/noise-sensitivity) - Practical usage examples
+- [Hallucination Scorer](/reference/scorers/hallucination) - Evaluates fabricated content
+- [Answer Relevancy Scorer](/reference/scorers/answer-relevancy) - Measures response focus
+- [Custom Scorers](/docs/scorers/custom-scorers) - Creating your own evaluation metrics

package/.docs/raw/reference/scorers/prompt-alignment.mdx ADDED Viewed

@@ -0,0 +1,369 @@
+---
+title: "Reference: Prompt Alignment Scorer | Scorers | Mastra Docs"
+description: Documentation for the Prompt Alignment Scorer in Mastra. Evaluates how well agent responses align with user prompt intent, requirements, completeness, and appropriateness using multi-dimensional analysis.
+---
+import { PropertiesTable } from "@/components/properties-table";
+# Prompt Alignment Scorer
+The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.
+## Parameters
+<PropertiesTable
+  content={[
+    {
+      name: "model",
+      type: "MastraLanguageModel",
+      description: "The language model to use for evaluating prompt-response alignment",
+      required: true,
+    },
+    {
+      name: "options",
+      type: "PromptAlignmentOptions",
+      description: "Configuration options for the scorer",
+      required: false,
+      children: [
+        {
+          name: "scale",
+          type: "number",
+          description: "Scale factor to multiply the final score (default: 1)",
+          required: false,
+        },
+        {
+          name: "evaluationMode",
+          type: "'user' | 'system' | 'both'",
+          description: "Evaluation mode - 'user' evaluates user prompt alignment only, 'system' evaluates system compliance only, 'both' evaluates both with weighted scoring (default: 'both')",
+          required: false,
+        },
+      ],
+    },
+  ]}
+/>
+## .run() Returns
+<PropertiesTable
+  content={[
+    {
+      name: "score",
+      type: "number",
+      description: "Multi-dimensional alignment score between 0 and scale (default 0-1)",
+    },
+    {
+      name: "reason",
+      type: "string",
+      description: "Human-readable explanation of the prompt alignment evaluation with detailed breakdown",
+    },
+  ]}
+/>
+## Scoring Details
+### Multi-Dimensional Analysis
+Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:
+#### User Mode ('user')
+Evaluates alignment with user prompts only:
+1. **Intent Alignment** (40% weight) - Whether the response addresses the user's core request
+2. **Requirements Fulfillment** (30% weight) - If all user requirements are met
+3. **Completeness** (20% weight) - Whether the response is comprehensive for user needs
+4. **Response Appropriateness** (10% weight) - If format and tone match user expectations
+#### System Mode ('system')
+Evaluates compliance with system guidelines only:
+1. **Intent Alignment** (35% weight) - Whether the response follows system behavioral guidelines
+2. **Requirements Fulfillment** (35% weight) - If all system constraints are respected
+3. **Completeness** (15% weight) - Whether the response adheres to all system rules
+4. **Response Appropriateness** (15% weight) - If format and tone match system specifications
+#### Both Mode ('both' - default)
+Combines evaluation of both user and system alignment:
+- **User alignment**: 70% of final score (using user mode weights)
+- **System compliance**: 30% of final score (using system mode weights)
+- Provides balanced assessment of user satisfaction and system adherence
+### Scoring Formula
+**User Mode:**
+```
+Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
+                 (completeness_score × 0.2) + (appropriateness_score × 0.1)
+Final Score = Weighted Score × scale
+```
+**System Mode:**
+```
+Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
+                 (completeness_score × 0.15) + (appropriateness_score × 0.15)
+Final Score = Weighted Score × scale
+```
+**Both Mode (default):**
+```
+User Score = (user dimensions with user weights)
+System Score = (system dimensions with system weights)
+Weighted Score = (User Score × 0.7) + (System Score × 0.3)
+Final Score = Weighted Score × scale
+```
+**Weight Distribution Rationale**:
+- **User Mode**: Prioritizes intent (40%) and requirements (30%) for user satisfaction
+- **System Mode**: Balances behavioral compliance (35%) and constraints (35%) equally
+- **Both Mode**: 70/30 split ensures user needs are primary while maintaining system compliance
+### Score Interpretation
+- **0.9-1.0** = Excellent alignment across all dimensions
+- **0.8-0.9** = Very good alignment with minor gaps
+- **0.7-0.8** = Good alignment but missing some requirements or completeness
+- **0.6-0.7** = Moderate alignment with noticeable gaps
+- **0.4-0.6** = Poor alignment with significant issues
+- **0.0-0.4** = Very poor alignment, response doesn't address the prompt effectively
+### Comparison with Other Scorers
+| Aspect | Prompt Alignment | Answer Relevancy | Faithfulness |
+|--------|------------------|------------------|--------------|
+| **Focus** | Multi-dimensional prompt adherence | Query-response relevance | Context groundedness |
+| **Evaluation** | Intent, requirements, completeness, format | Semantic similarity to query | Factual consistency with context |
+| **Use Case** | General prompt following | Information retrieval | RAG/context-based systems |
+| **Dimensions** | 4 weighted dimensions | Single relevance dimension | Single faithfulness dimension |
+### When to Use Each Mode
+**User Mode (`'user'`)** - Use when:
+- Evaluating customer service responses for user satisfaction
+- Testing content generation quality from user perspective
+- Measuring how well responses address user questions
+- Focusing purely on request fulfillment without system constraints
+**System Mode (`'system'`)** - Use when:
+- Auditing AI safety and compliance with behavioral guidelines
+- Ensuring agents follow brand voice and tone requirements
+- Validating adherence to content policies and constraints
+- Testing system-level behavioral consistency
+**Both Mode (`'both'`)** - Use when (default, recommended):
+- Comprehensive evaluation of overall AI agent performance
+- Balancing user satisfaction with system compliance
+- Production monitoring where both user and system requirements matter
+- Holistic assessment of prompt-response alignment
+## Usage Examples
+### Basic Configuration
+```typescript
+import { openai } from '@ai-sdk/openai';
+import { createPromptAlignmentScorerLLM } from '@mastra/evals';
+const scorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o'),
+});
+// Evaluate a code generation task
+const result = await scorer.run({
+  input: [{
+    role: 'user',
+    content: 'Write a Python function to calculate factorial with error handling'
+  }],
+  output: {
+    role: 'assistant',
+    text: `def factorial(n):
+    if n < 0:
+        raise ValueError("Factorial not defined for negative numbers")
+    if n == 0:
+        return 1
+    return n * factorial(n-1)`
+  }
+});
+// Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
+```
+### Custom Configuration Examples
+```typescript
+// Configure scale and evaluation mode
+const scorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o'),
+  options: {
+    scale: 10, // Score from 0-10 instead of 0-1
+    evaluationMode: 'both' // 'user', 'system', or 'both' (default)
+  },
+});
+// User-only evaluation - focus on user satisfaction
+const userScorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o'),
+  options: { evaluationMode: 'user' }
+});
+// System-only evaluation - focus on compliance
+const systemScorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o'),
+  options: { evaluationMode: 'system' }
+});
+const result = await scorer.run(testRun);
+// Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }
+```
+### Format-Specific Evaluation
+```typescript
+// Evaluate bullet point formatting
+const result = await scorer.run({
+  input: [{
+    role: 'user',
+    content: 'List the benefits of TypeScript in bullet points'
+  }],
+  output: {
+    role: 'assistant',
+    text: 'TypeScript provides static typing, better IDE support, and enhanced code reliability.'
+  }
+});
+// Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
+```
+## Usage Patterns
+### Code Generation Evaluation
+Ideal for evaluating:
+- Programming task completion
+- Code quality and completeness
+- Adherence to coding requirements
+- Format specifications (functions, classes, etc.)
+```typescript
+// Example: API endpoint creation
+const codePrompt = "Create a REST API endpoint with authentication and rate limiting";
+// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
+// completeness (full implementation), format (code structure)
+```
+### Instruction Following Assessment
+Perfect for:
+- Task completion verification
+- Multi-step instruction adherence
+- Requirement compliance checking
+- Educational content evaluation
+```typescript
+// Example: Multi-requirement task
+const taskPrompt = "Write a Python class with initialization, validation, error handling, and documentation";
+// Scorer tracks each requirement individually and provides detailed breakdown
+```
+### Content Format Validation
+Useful for:
+- Format specification compliance
+- Style guide adherence
+- Output structure verification
+- Response appropriateness checking
+```typescript
+// Example: Structured output
+const formatPrompt = "Explain the differences between let and const in JavaScript using bullet points";
+// Scorer evaluates content accuracy AND format compliance
+```
+## Common Use Cases
+### 1. Agent Response Quality
+Measure how well your AI agents follow user instructions:
+```typescript
+const agent = new Agent({
+  name: 'CodingAssistant',
+  instructions: 'You are a helpful coding assistant. Always provide working code examples.',
+  model: openai('gpt-4o'),
+});
+// Evaluate comprehensive alignment (default)
+const scorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o-mini'),
+  options: { evaluationMode: 'both' } // Evaluates both user intent and system guidelines
+});
+// Evaluate just user satisfaction
+const userScorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o-mini'),
+  options: { evaluationMode: 'user' } // Focus only on user request fulfillment
+});
+// Evaluate system compliance
+const systemScorer = createPromptAlignmentScorerLLM({
+  model: openai('gpt-4o-mini'),
+  options: { evaluationMode: 'system' } // Check adherence to system instructions
+});
+const result = await scorer.run(agentRun);
+```
+### 2. Prompt Engineering Optimization
+Test different prompts to improve alignment:
+```typescript
+const prompts = [
+  'Write a function to calculate factorial',
+  'Create a Python function that calculates factorial with error handling for negative inputs',
+  'Implement a factorial calculator in Python with: input validation, error handling, and docstring'
+];
+// Compare alignment scores to find the best prompt
+for (const prompt of prompts) {
+  const result = await scorer.run(createTestRun(prompt, response));
+  console.log(`Prompt alignment: ${result.score}`);
+}
+```
+### 3. Multi-Agent System Evaluation
+Compare different agents or models:
+```typescript
+const agents = [agent1, agent2, agent3];
+const testPrompts = [...]; // Array of test prompts
+for (const agent of agents) {
+  let totalScore = 0;
+  for (const prompt of testPrompts) {
+    const response = await agent.run(prompt);
+    const evaluation = await scorer.run({ input: prompt, output: response });
+    totalScore += evaluation.score;
+  }
+  console.log(`${agent.name} average alignment: ${totalScore / testPrompts.length}`);
+}
+```
+## Error Handling
+The scorer handles various edge cases gracefully:
+```typescript
+// Missing user prompt
+try {
+  await scorer.run({ input: [], output: response });
+} catch (error) {
+  // Error: "Both user prompt and agent response are required for prompt alignment scoring"
+}
+// Empty response
+const result = await scorer.run({
+  input: [userMessage],
+  output: { role: 'assistant', text: '' }
+});
+// Returns low scores with detailed reasoning about incompleteness
+```
+## Related
+- [Answer Relevancy Scorer](/reference/scorers/answer-relevancy) - Evaluates query-response relevance
+- [Faithfulness Scorer](/reference/scorers/faithfulness) - Measures context groundedness
+- [Tool Call Accuracy Scorer](/reference/scorers/tool-call-accuracy) - Evaluates tool selection
+- [Custom Scorers](/docs/scorers/custom-scorers) - Creating your own evaluation metrics

package/.docs/raw/scorers/off-the-shelf-scorers.mdx CHANGED Viewed

@@ -20,6 +20,7 @@ These scorers evaluate how correct, truthful, and complete your agent's answers
 - [`content-similarity`](/reference/scorers/content-similarity): Measures textual similarity using character-level matching (`0-1`, higher is better)
 - [`textual-difference`](/reference/scorers/textual-difference): Measures textual differences between strings (`0-1`, higher means more similar)
 - [`tool-call-accuracy`](/reference/scorers/tool-call-accuracy): Evaluates whether the LLM selects the correct tool from available options (`0-1`, higher is better)
+- [`prompt-alignment`](/reference/scorers/prompt-alignment): Measures how well agent responses align with user prompt intent, requirements, completeness, and format (`0-1`, higher is better)
 ### Context Quality
@@ -28,14 +29,13 @@ These scorers evaluate the quality and relevance of context used in generating r
 - [`context-precision`](/reference/scorers/context-precision): Evaluates context relevance and ranking using Mean Average Precision, rewarding early placement of relevant context (`0-1`, higher is better)
 - [`context-relevance`](/reference/scorers/context-relevance): Measures context utility with nuanced relevance levels, usage tracking, and missing context detection (`0-1`, higher is better)
-:::tip Context Scorer Selection
+> tip Context Scorer Selection
 - Use **Context Precision** when context ordering matters and you need standard IR metrics (ideal for RAG ranking evaluation)
 - Use **Context Relevance** when you need detailed relevance assessment and want to track context usage and identify gaps
 Both context scorers support:
 - **Static context**: Pre-defined context arrays
 - **Dynamic context extraction**: Extract context from runs using custom functions (ideal for RAG systems, vector databases, etc.)
-:::
 ### Output Quality

package/.docs/raw/streaming/overview.mdx CHANGED Viewed

@@ -15,8 +15,8 @@ Mastra supports real-time, incremental responses from agents and workflows, allo
 Mastra currently supports two streaming methods, this page explains how to use `streamVNext()`.
-1. **`.stream()`**: Current stable API, supports **AI SDK v1**.
-2. **`.streamVNext()`**: Experimental API, supports **AI SDK v2**.
+1. **`.stream()`**: Current stable API, supports **AI SDK v4** (`LanguageModelV1`).
+2. **`.streamVNext()`**: Experimental API, supports **AI SDK v5** (`LanguageModelV2`).
 ## Streaming with agents

package/.docs/raw/streaming/tool-streaming.mdx CHANGED Viewed

@@ -3,6 +3,8 @@ title: "Tool Streaming | Streaming | Mastra"
 description: "Learn how to use tool streaming in Mastra, including handling tool calls, tool results, and tool execution events during streaming."
 ---
+import { Callout } from "nextra/components";
 # Tool streaming
 Tool streaming in Mastra enables tools to send incremental results while they run, rather than waiting until execution finishes. This allows you to surface partial progress, intermediate states, or progressive data directly to users or upstream agents and workflows.
@@ -36,6 +38,10 @@ export const testAgent = new Agent({
 The `writer` argument is passed to a tool’s `execute` function and can be used to emit custom events, data, or values into the active stream. This enables tools to provide intermediate results or status updates while execution is still in progress.
+<Callout type="warning">
+You must `await` the call to `writer.write(...)` or else you will lock the stream and get a `WritableStream is locked` error.
+</Callout>
 ```typescript {5,8,15} showLineNumbers copy
 import { createTool } from "@mastra/core/tools";
@@ -44,14 +50,14 @@ export const testTool = createTool({
   execute: async ({ context, writer }) => {
     const { value } = context;
-    writer?.write({
+   await writer?.write({
       type: "custom-event",
       status: "pending"
     });
     const response = await fetch(...);
-    writer?.write({
+   await writer?.write({
       type: "custom-event",
       status: "success"
     });