npm - @mastra/mcp-docs-server - Versions diffs - 0.13.37 → 0.13.38 - Mend

@mastra/mcp-docs-server 0.13.37 → 0.13.38

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (397) hide show

package/.docs/raw/reference/scorers/prompt-alignment.mdx CHANGED Viewed

@@ -3,7 +3,7 @@ title: "Reference: Prompt Alignment Scorer | Scorers | Mastra Docs"
 description: Documentation for the Prompt Alignment Scorer in Mastra. Evaluates how well agent responses align with user prompt intent, requirements, completeness, and appropriateness using multi-dimensional analysis.
 ---
-import { PropertiesTable } from "@/components/properties-table";
+import PropertiesTable from "@site/src/components/PropertiesTable";
 # Prompt Alignment Scorer
@@ -16,7 +16,8 @@ The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates
     {
       name: "model",
       type: "MastraModelConfig",
-      description: "The language model to use for evaluating prompt-response alignment",
+      description:
+        "The language model to use for evaluating prompt-response alignment",
       required: true,
     },
     {
@@ -34,7 +35,8 @@ The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates
         {
           name: "evaluationMode",
           type: "'user' | 'system' | 'both'",
-          description: "Evaluation mode - 'user' evaluates user prompt alignment only, 'system' evaluates system compliance only, 'both' evaluates both with weighted scoring (default: 'both')",
+          description:
+            "Evaluation mode - 'user' evaluates user prompt alignment only, 'system' evaluates system compliance only, 'both' evaluates both with weighted scoring (default: 'both')",
           required: false,
         },
       ],
@@ -49,12 +51,14 @@ The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates
     {
       name: "score",
       type: "number",
-      description: "Multi-dimensional alignment score between 0 and scale (default 0-1)",
+      description:
+        "Multi-dimensional alignment score between 0 and scale (default 0-1)",
     },
     {
       name: "reason",
       type: "string",
-      description: "Human-readable explanation of the prompt alignment evaluation with detailed breakdown",
+      description:
+        "Human-readable explanation of the prompt alignment evaluation with detailed breakdown",
     },
   ]}
 />
@@ -104,12 +108,12 @@ The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates
 You can customize the Prompt Alignment Scorer by adjusting the scale parameter and evaluation mode to fit your scoring needs.
 ```typescript showLineNumbers copy
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: {
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: {
     scale: 10, // Score from 0-10 instead of 0-1
-    evaluationMode: 'both' // 'user', 'system', or 'both' (default)
-  }
+    evaluationMode: "both", // 'user', 'system', or 'both' (default)
+  },
 });
 ```
@@ -118,14 +122,16 @@ const scorer = createPromptAlignmentScorerLLM({
 Prompt Alignment evaluates responses across four key dimensions with weighted scoring that adapts based on the evaluation mode:
 #### User Mode ('user')
 Evaluates alignment with user prompts only:
 1. **Intent Alignment** (40% weight) - Whether the response addresses the user's core request
 2. **Requirements Fulfillment** (30% weight) - If all user requirements are met
-3. **Completeness** (20% weight) - Whether the response is comprehensive for user needs
+3. **Completeness** (20% weight) - Whether the response is comprehensive for user needs
 4. **Response Appropriateness** (10% weight) - If format and tone match user expectations
 #### System Mode ('system')
 Evaluates compliance with system guidelines only:
 1. **Intent Alignment** (35% weight) - Whether the response follows system behavioral guidelines
@@ -134,6 +140,7 @@ Evaluates compliance with system guidelines only:
 4. **Response Appropriateness** (15% weight) - If format and tone match system specifications
 #### Both Mode ('both' - default)
 Combines evaluation of both user and system alignment:
 - **User alignment**: 70% of final score (using user mode weights)
@@ -143,28 +150,32 @@ Combines evaluation of both user and system alignment:
 ### Scoring Formula
 **User Mode:**
 ```
-Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
+Weighted Score = (intent_score × 0.4) + (requirements_score × 0.3) +
                  (completeness_score × 0.2) + (appropriateness_score × 0.1)
 Final Score = Weighted Score × scale
 ```
 **System Mode:**
 ```
-Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
+Weighted Score = (intent_score × 0.35) + (requirements_score × 0.35) +
                  (completeness_score × 0.15) + (appropriateness_score × 0.15)
 Final Score = Weighted Score × scale
 ```
 **Both Mode (default):**
 ```
 User Score = (user dimensions with user weights)
-System Score = (system dimensions with system weights)
+System Score = (system dimensions with system weights)
 Weighted Score = (User Score × 0.7) + (System Score × 0.3)
 Final Score = Weighted Score × scale
 ```
 **Weight Distribution Rationale**:
 - **User Mode**: Prioritizes intent (40%) and requirements (30%) for user satisfaction
 - **System Mode**: Balances behavioral compliance (35%) and constraints (35%) equally
 - **Both Mode**: 70/30 split ensures user needs are primary while maintaining system compliance
@@ -181,18 +192,21 @@ Final Score = Weighted Score × scale
 ### When to Use Each Mode
 **User Mode (`'user'`)** - Use when:
 - Evaluating customer service responses for user satisfaction
-- Testing content generation quality from user perspective
+- Testing content generation quality from user perspective
 - Measuring how well responses address user questions
 - Focusing purely on request fulfillment without system constraints
 **System Mode (`'system'`)** - Use when:
 - Auditing AI safety and compliance with behavioral guidelines
 - Ensuring agents follow brand voice and tone requirements
 - Validating adherence to content policies and constraints
 - Testing system-level behavioral consistency
 **Both Mode (`'both'`)** - Use when (default, recommended):
 - Comprehensive evaluation of overall AI agent performance
 - Balancing user satisfaction with system compliance
 - Production monitoring where both user and system requirements matter
@@ -201,21 +215,26 @@ Final Score = Weighted Score × scale
 ## Common Use Cases
 ### Code Generation Evaluation
 Ideal for evaluating:
 - Programming task completion
-- Code quality and completeness
+- Code quality and completeness
 - Adherence to coding requirements
 - Format specifications (functions, classes, etc.)
 ```typescript
 // Example: API endpoint creation
-const codePrompt = "Create a REST API endpoint with authentication and rate limiting";
-// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
+const codePrompt =
+  "Create a REST API endpoint with authentication and rate limiting";
+// Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
 // completeness (full implementation), format (code structure)
 ```
 ### Instruction Following Assessment
 Perfect for:
 - Task completion verification
 - Multi-step instruction adherence
 - Requirement compliance checking
@@ -223,12 +242,15 @@ Perfect for:
 ```typescript
 // Example: Multi-requirement task
-const taskPrompt = "Write a Python class with initialization, validation, error handling, and documentation";
+const taskPrompt =
+  "Write a Python class with initialization, validation, error handling, and documentation";
 // Scorer tracks each requirement individually and provides detailed breakdown
 ```
 ### Content Format Validation
 Useful for:
 - Format specification compliance
 - Style guide adherence
 - Output structure verification
@@ -236,49 +258,53 @@ Useful for:
 ```typescript
 // Example: Structured output
-const formatPrompt = "Explain the differences between let and const in JavaScript using bullet points";
+const formatPrompt =
+  "Explain the differences between let and const in JavaScript using bullet points";
 // Scorer evaluates content accuracy AND format compliance
 ```
 ### Agent Response Quality
 Measure how well your AI agents follow user instructions:
 ```typescript
 const agent = new Agent({
-  name: 'CodingAssistant',
-  instructions: 'You are a helpful coding assistant. Always provide working code examples.',
-  model: 'openai/gpt-4o',
+  name: "CodingAssistant",
+  instructions:
+    "You are a helpful coding assistant. Always provide working code examples.",
+  model: "openai/gpt-4o",
 });
 // Evaluate comprehensive alignment (default)
 const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'both' } // Evaluates both user intent and system guidelines
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
 });
 // Evaluate just user satisfaction
 const userScorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'user' } // Focus only on user request fulfillment
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "user" }, // Focus only on user request fulfillment
 });
 // Evaluate system compliance
 const systemScorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'system' } // Check adherence to system instructions
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "system" }, // Check adherence to system instructions
 });
 const result = await scorer.run(agentRun);
 ```
 ### Prompt Engineering Optimization
 Test different prompts to improve alignment:
 ```typescript
 const prompts = [
-  'Write a function to calculate factorial',
-  'Create a Python function that calculates factorial with error handling for negative inputs',
-  'Implement a factorial calculator in Python with: input validation, error handling, and docstring'
+  "Write a function to calculate factorial",
+  "Create a Python function that calculates factorial with error handling for negative inputs",
+  "Implement a factorial calculator in Python with: input validation, error handling, and docstring",
 ];
 // Compare alignment scores to find the best prompt
@@ -289,6 +315,7 @@ for (const prompt of prompts) {
 ```
 ### Multi-Agent System Evaluation
 Compare different agents or models:
 ```typescript
@@ -311,27 +338,30 @@ for (const agent of agents) {
 ### Basic Configuration
 ```typescript
-import { createPromptAlignmentScorerLLM } from '@mastra/evals';
+import { createPromptAlignmentScorerLLM } from "@mastra/evals";
 const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o',
+  model: "openai/gpt-4o",
 });
 // Evaluate a code generation task
 const result = await scorer.run({
-  input: [{
-    role: 'user',
-    content: 'Write a Python function to calculate factorial with error handling'
-  }],
+  input: [
+    {
+      role: "user",
+      content:
+        "Write a Python function to calculate factorial with error handling",
+    },
+  ],
   output: {
-    role: 'assistant',
+    role: "assistant",
     text: `def factorial(n):
     if n < 0:
         raise ValueError("Factorial not defined for negative numbers")
     if n == 0:
         return 1
-    return n * factorial(n-1)`
-  }
+    return n * factorial(n-1)`,
+  },
 });
 // Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
 ```
@@ -341,23 +371,23 @@ const result = await scorer.run({
 ```typescript
 // Configure scale and evaluation mode
 const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o',
+  model: "openai/gpt-4o",
   options: {
     scale: 10, // Score from 0-10 instead of 0-1
-    evaluationMode: 'both' // 'user', 'system', or 'both' (default)
+    evaluationMode: "both", // 'user', 'system', or 'both' (default)
   },
 });
 // User-only evaluation - focus on user satisfaction
 const userScorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o',
-  options: { evaluationMode: 'user' }
+  model: "openai/gpt-4o",
+  options: { evaluationMode: "user" },
 });
 // System-only evaluation - focus on compliance
 const systemScorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o',
-  options: { evaluationMode: 'system' }
+  model: "openai/gpt-4o",
+  options: { evaluationMode: "system" },
 });
 const result = await scorer.run(testRun);
@@ -369,14 +399,16 @@ const result = await scorer.run(testRun);
 ```typescript
 // Evaluate bullet point formatting
 const result = await scorer.run({
-  input: [{
-    role: 'user',
-    content: 'List the benefits of TypeScript in bullet points'
-  }],
+  input: [
+    {
+      role: "user",
+      content: "List the benefits of TypeScript in bullet points",
+    },
+  ],
   output: {
-    role: 'assistant',
-    text: 'TypeScript provides static typing, better IDE support, and enhanced code reliability.'
-  }
+    role: "assistant",
+    text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
+  },
 });
 // Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
 ```
@@ -385,26 +417,29 @@ const result = await scorer.run({
 In this example, the response fully addresses the user's prompt with all requirements met.
-```typescript filename="src/example-excellent-prompt-alignment.ts" showLineNumbers copy
+```typescript title="src/example-excellent-prompt-alignment.ts" showLineNumbers copy
 import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini'
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
 });
-const inputMessages = [{
-  role: 'user',
-  content: "Write a Python function to calculate factorial with error handling for negative numbers"
-}];
+const inputMessages = [
+  {
+    role: "user",
+    content:
+      "Write a Python function to calculate factorial with error handling for negative numbers",
+  },
+];
-const outputMessage = {
+const outputMessage = {
   text: `def factorial(n):
     """Calculate factorial of a number."""
     if n < 0:
         raise ValueError("Factorial not defined for negative numbers")
     if n == 0 or n == 1:
         return 1
-    return n * factorial(n - 1)`
+    return n * factorial(n - 1)`,
 };
 const result = await scorer.run({
@@ -430,20 +465,22 @@ The output receives a high score because it perfectly addresses the intent, fulf
 In this example, the response addresses the core intent but misses some requirements or has format issues.
-```typescript filename="src/example-partial-prompt-alignment.ts" showLineNumbers copy
+```typescript title="src/example-partial-prompt-alignment.ts" showLineNumbers copy
 import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini'
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
 });
-const inputMessages = [{
-  role: 'user',
-  content: "List the benefits of TypeScript in bullet points"
-}];
+const inputMessages = [
+  {
+    role: "user",
+    content: "List the benefits of TypeScript in bullet points",
+  },
+];
-const outputMessage = {
-  text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking."
+const outputMessage = {
+  text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
 };
 const result = await scorer.run({
@@ -469,22 +506,25 @@ The output receives a lower score because while the content is accurate, it does
 In this example, the response fails to address the user's specific requirements.
-```typescript filename="src/example-poor-prompt-alignment.ts" showLineNumbers copy
+```typescript title="src/example-poor-prompt-alignment.ts" showLineNumbers copy
 import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/llm";
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini'
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
 });
-const inputMessages = [{
-  role: 'user',
-  content: "Write a Python class with initialization, validation, error handling, and documentation"
-}];
+const inputMessages = [
+  {
+    role: "user",
+    content:
+      "Write a Python class with initialization, validation, error handling, and documentation",
+  },
+];
-const outputMessage = {
+const outputMessage = {
   text: `class Example:
     def __init__(self, value):
-        self.value = value`
+        self.value = value`,
 };
 const result = await scorer.run({
@@ -512,26 +552,30 @@ The output receives a low score because it only partially fulfills the requireme
 Evaluates how well the response addresses the user's request, ignoring system instructions:
-```typescript filename="src/example-user-mode.ts" showLineNumbers copy
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'user' }
+```typescript title="src/example-user-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "user" },
 });
 const result = await scorer.run({
   input: {
-    inputMessages: [{
-      role: 'user',
-      content: "Explain recursion with an example"
-    }],
-    systemMessages: [{
-      role: 'system',
-      content: "Always provide code examples in Python"
-    }]
+    inputMessages: [
+      {
+        role: "user",
+        content: "Explain recursion with an example",
+      },
+    ],
+    systemMessages: [
+      {
+        role: "system",
+        content: "Always provide code examples in Python",
+      },
+    ],
+  },
+  output: {
+    text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
   },
-  output: {
-    text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)"
-  }
 });
 // Scores high for addressing user request, even without Python code
 ```
@@ -540,26 +584,31 @@ const result = await scorer.run({
 Evaluates compliance with system behavioral guidelines and constraints:
-```typescript filename="src/example-system-mode.ts" showLineNumbers copy
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'system' }
+```typescript title="src/example-system-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "system" },
 });
 const result = await scorer.run({
   input: {
-    systemMessages: [{
-      role: 'system',
-      content: "You are a helpful assistant. Always be polite, concise, and provide examples."
-    }],
-    inputMessages: [{
-      role: 'user',
-      content: "What is machine learning?"
-    }]
+    systemMessages: [
+      {
+        role: "system",
+        content:
+          "You are a helpful assistant. Always be polite, concise, and provide examples.",
+      },
+    ],
+    inputMessages: [
+      {
+        role: "user",
+        content: "What is machine learning?",
+      },
+    ],
+  },
+  output: {
+    text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
   },
-  output: {
-    text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam."
-  }
 });
 // Evaluates politeness, conciseness, and example provision
 ```
@@ -568,47 +617,52 @@ const result = await scorer.run({
 Evaluates both user intent fulfillment and system compliance with weighted scoring (70% user, 30% system):
-```typescript filename="src/example-both-mode.ts" showLineNumbers copy
-const scorer = createPromptAlignmentScorerLLM({
-  model: 'openai/gpt-4o-mini',
-  options: { evaluationMode: 'both' } // This is the default
+```typescript title="src/example-both-mode.ts" showLineNumbers copy
+const scorer = createPromptAlignmentScorerLLM({
+  model: "openai/gpt-4o-mini",
+  options: { evaluationMode: "both" }, // This is the default
 });
 const result = await scorer.run({
   input: {
-    systemMessages: [{
-      role: 'system',
-      content: "Always provide code examples when explaining programming concepts"
-    }],
-    inputMessages: [{
-      role: 'user',
-      content: "Explain how to reverse a string"
-    }]
+    systemMessages: [
+      {
+        role: "system",
+        content:
+          "Always provide code examples when explaining programming concepts",
+      },
+    ],
+    inputMessages: [
+      {
+        role: "user",
+        content: "Explain how to reverse a string",
+      },
+    ],
   },
-  output: {
+  output: {
     text: `To reverse a string, you can iterate through it backwards. Here's an example in Python:
     def reverse_string(s):
         return s[::-1]
-    # Usage: reverse_string("hello") returns "olleh"`
-  }
+    # Usage: reverse_string("hello") returns "olleh"`,
+  },
 });
 // High score for both addressing the user's request AND following system guidelines
 ```
 ## Comparison with Other Scorers
-| Aspect | Prompt Alignment | Answer Relevancy | Faithfulness |
-|--------|------------------|------------------|--------------|
-| **Focus** | Multi-dimensional prompt adherence | Query-response relevance | Context groundedness |
+| Aspect         | Prompt Alignment                           | Answer Relevancy             | Faithfulness                     |
+| -------------- | ------------------------------------------ | ---------------------------- | -------------------------------- |
+| **Focus**      | Multi-dimensional prompt adherence         | Query-response relevance     | Context groundedness             |
 | **Evaluation** | Intent, requirements, completeness, format | Semantic similarity to query | Factual consistency with context |
-| **Use Case** | General prompt following | Information retrieval | RAG/context-based systems |
-| **Dimensions** | 4 weighted dimensions | Single relevance dimension | Single faithfulness dimension |
+| **Use Case**   | General prompt following                   | Information retrieval        | RAG/context-based systems        |
+| **Dimensions** | 4 weighted dimensions                      | Single relevance dimension   | Single faithfulness dimension    |
 ## Related
 - [Answer Relevancy Scorer](/reference/scorers/answer-relevancy) - Evaluates query-response relevance
 - [Faithfulness Scorer](/reference/scorers/faithfulness) - Measures context groundedness
 - [Tool Call Accuracy Scorer](/reference/scorers/tool-call-accuracy) - Evaluates tool selection
-- [Custom Scorers](/docs/scorers/custom-scorers) - Creating your own evaluation metrics
+- [Custom Scorers](/docs/scorers/custom-scorers) - Creating your own evaluation metrics