npm - @mastra/mcp-docs-server - Versions diffs - 0.13.39 → 1.0.0-beta.1 - Mend

@mastra/mcp-docs-server 0.13.39 → 1.0.0-beta.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (494) hide show

package/.docs/raw/reference/evals/answer-relevancy.mdx CHANGED Viewed

@@ -1,189 +1,227 @@
 ---
-title: "Reference: AnswerRelevancyMetric | Evals | Mastra Docs"
-description: Documentation for the Answer Relevancy Metric in Mastra, which evaluates how well LLM outputs address the input query.
+title: "Reference: Answer Relevancy Scorer | Evals | Mastra Docs"
+description: Documentation for the Answer Relevancy Scorer in Mastra, which evaluates how well LLM outputs address the input query.
 ---
-# AnswerRelevancyMetric
+# Answer Relevancy Scorer
-:::info Scorers
-This documentation refers to the legacy evals API. For the latest scorer features, see [Scorers](/docs/scorers/overview).
-:::
+The `createAnswerRelevancyScorer()` function accepts a single options object with the following properties:
-The `AnswerRelevancyMetric` class evaluates how well an LLM's output answers or addresses the input query. It uses a judge-based system to determine relevancy and provides detailed scoring and reasoning.
-## Basic Usage
-```typescript
-import { openai } from "@ai-sdk/openai";
-import { AnswerRelevancyMetric } from "@mastra/evals/llm";
-// Configure the model for evaluation
-const model = openai("gpt-4o-mini");
-const metric = new AnswerRelevancyMetric(model, {
-  uncertaintyWeight: 0.3,
-  scale: 1,
-});
-const result = await metric.measure(
-  "What is the capital of France?",
-  "Paris is the capital of France.",
-);
-console.log(result.score); // Score from 0-1
-console.log(result.info.reason); // Explanation of the score
-```
-## Constructor Parameters
+## Parameters
 <PropertiesTable
   content={[
     {
       name: "model",
       type: "LanguageModel",
-      description: "Configuration for the model used to evaluate relevancy",
-      isOptional: false,
+      required: true,
+      description: "Configuration for the model used to evaluate relevancy.",
     },
-    {
-      name: "options",
-      type: "AnswerRelevancyMetricOptions",
-      description: "Configuration options for the metric",
-      isOptional: true,
-      defaultValue: "{ uncertaintyWeight: 0.3, scale: 1 }",
-    },
-  ]}
-/>
-### AnswerRelevancyMetricOptions
-<PropertiesTable
-  content={[
     {
       name: "uncertaintyWeight",
       type: "number",
-      description: "Weight given to 'unsure' verdicts in scoring (0-1)",
-      isOptional: true,
+      required: false,
       defaultValue: "0.3",
+      description: "Weight given to 'unsure' verdicts in scoring (0-1).",
     },
     {
       name: "scale",
       type: "number",
-      description: "Maximum score value",
-      isOptional: true,
+      required: false,
       defaultValue: "1",
+      description: "Maximum score value.",
     },
   ]}
 />
-## measure() Parameters
+This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](./mastra-scorer)), but the return value includes LLM-specific fields as documented below.
+## .run() Returns
 <PropertiesTable
   content={[
     {
-      name: "input",
+      name: "runId",
       type: "string",
-      description: "The original query or prompt",
-      isOptional: false,
+      description: "The id of the run (optional).",
     },
-    {
-      name: "output",
-      type: "string",
-      description: "The LLM's response to evaluate",
-      isOptional: false,
-    },
-  ]}
-/>
-## Returns
-<PropertiesTable
-  content={[
     {
       name: "score",
       type: "number",
       description: "Relevancy score (0 to scale, default 0-1)",
     },
     {
-      name: "info",
+      name: "preprocessPrompt",
+      type: "string",
+      description:
+        "The prompt sent to the LLM for the preprocess step (optional).",
+    },
+    {
+      name: "preprocessStepResult",
       type: "object",
-      description: "Object containing the reason for the score",
-      properties: [
-        {
-          type: "string",
-          parameters: [
-            {
-              name: "reason",
-              type: "string",
-              description: "Explanation of the score",
-            },
-          ],
-        },
-      ],
+      description: "Object with extracted statements: { statements: string[] }",
+    },
+    {
+      name: "analyzePrompt",
+      type: "string",
+      description:
+        "The prompt sent to the LLM for the analyze step (optional).",
+    },
+    {
+      name: "analyzeStepResult",
+      type: "object",
+      description:
+        "Object with results: { results: Array<{ result: 'yes' | 'unsure' | 'no', reason: string }> }",
+    },
+    {
+      name: "generateReasonPrompt",
+      type: "string",
+      description: "The prompt sent to the LLM for the reason step (optional).",
+    },
+    {
+      name: "reason",
+      type: "string",
+      description: "Explanation of the score.",
     },
   ]}
 />
 ## Scoring Details
-The metric evaluates relevancy through query-answer alignment, considering completeness, accuracy, and detail level.
+The scorer evaluates relevancy through query-answer alignment, considering completeness and detail level, but not factual correctness.
 ### Scoring Process
-1. Statement Analysis:
-   - Breaks output into meaningful statements while preserving context
-   - Evaluates each statement against query requirements
+1. **Statement Preprocess:**
+   - Breaks output into meaningful statements while preserving context.
+2. **Relevance Analysis:**
+   - Each statement is evaluated as:
+     - "yes": Full weight for direct matches
+     - "unsure": Partial weight (default: 0.3) for approximate matches
+     - "no": Zero weight for irrelevant content
+3. **Score Calculation:**
+   - `((direct + uncertainty * partial) / total_statements) * scale`
+### Score Interpretation
+A relevancy score between 0 and 1:
+- **1.0**: The response fully answers the query with relevant and focused information.
+- **0.7–0.9**: The response mostly answers the query but may include minor unrelated content.
+- **0.4–0.6**: The response partially answers the query, mixing relevant and unrelated information.
+- **0.1–0.3**: The response includes minimal relevant content and largely misses the intent of the query.
+- **0.0**: The response is entirely unrelated and does not answer the query.
+## Examples
+### High relevancy example
+In this example, the response accurately addresses the input query with specific and relevant information.
+```typescript title="src/example-high-answer-relevancy.ts" showLineNumbers copy
+import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
+const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o-mini" });
+const inputMessages = [
+  {
+    role: "user",
+    content: "What are the health benefits of regular exercise?",
+  },
+];
+const outputMessage = {
+  text: "Regular exercise improves cardiovascular health, strengthens muscles, boosts metabolism, and enhances mental well-being through the release of endorphins.",
+};
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
+});
+console.log(result);
+```
+#### High relevancy output
-2. Evaluates relevance of each statement:
-   - "yes": Full weight for direct matches
-   - "unsure": Partial weight (default: 0.3) for approximate matches
-   - "no": Zero weight for irrelevant content
+The output receives a high score because it accurately answers the query without including unrelated information.
-Final score: `((direct + uncertainty * partial) / total_statements) * scale`
+```typescript
+{
+  score: 1,
+  reason: 'The score is 1 because the output directly addresses the question by providing multiple explicit health benefits of regular exercise, including improvements in cardiovascular health, muscle strength, metabolism, and mental well-being. Each point is relevant and contributes to a comprehensive understanding of the health benefits.'
+}
+```
+### Partial relevancy example
+In this example, the response addresses the query in part but includes additional information that isn't directly relevant.
+```typescript title="src/example-partial-answer-relevancy.ts" showLineNumbers copy
+import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
-### Score interpretation
+const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o-mini" });
-(0 to scale, default 0-1)
+const inputMessages = [
+  { role: "user", content: "What should a healthy breakfast include?" },
+];
+const outputMessage = {
+  text: "A nutritious breakfast should include whole grains and protein. However, the timing of your breakfast is just as important - studies show eating within 2 hours of waking optimizes metabolism and energy levels throughout the day.",
+};
-- 1.0: Perfect relevance - complete and accurate
-- 0.7-0.9: High relevance - minor gaps or imprecisions
-- 0.4-0.6: Moderate relevance - significant gaps
-- 0.1-0.3: Low relevance - major issues
-- 0.0: No relevance - incorrect or off-topic
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
+});
+console.log(result);
+```
-## Example with Custom Configuration
+#### Partial relevancy output
+The output receives a lower score because it partially answers the query. While some relevant information is included, unrelated details reduce the overall relevance.
 ```typescript
-import { openai } from "@ai-sdk/openai";
-import { AnswerRelevancyMetric } from "@mastra/evals/llm";
+{
+  score: 0.25,
+  reason: 'The score is 0.25 because the output provides a direct answer by mentioning whole grains and protein as components of a healthy breakfast, which is relevant. However, the additional information about the timing of breakfast and its effects on metabolism and energy levels is not directly related to the question, leading to a lower overall relevance score.'
+}
+```
-// Configure the model for evaluation
-const model = openai("gpt-4o-mini");
+## Low relevancy example
-const metric = new AnswerRelevancyMetric(model, {
-  uncertaintyWeight: 0.5, // Higher weight for uncertain verdicts
-  scale: 5, // Use 0-5 scale instead of 0-1
+In this example, the response does not address the query and contains information that is entirely unrelated.
+```typescript title="src/example-low-answer-relevancy.ts" showLineNumbers copy
+import { createAnswerRelevancyScorer } from "@mastra/evals/scorers/prebuilt";
+const scorer = createAnswerRelevancyScorer({ model: "openai/gpt-4o-mini" });
+const inputMessages = [
+  { role: "user", content: "What are the benefits of meditation?" },
+];
+const outputMessage = {
+  text: "The Great Wall of China is over 13,000 miles long and was built during the Ming Dynasty to protect against invasions.",
+};
+const result = await scorer.run({
+  input: inputMessages,
+  output: outputMessage,
 });
-const result = await metric.measure(
-  "What are the benefits of exercise?",
-  "Regular exercise improves cardiovascular health, builds strength, and boosts mental wellbeing.",
-);
-// Example output:
-// {
-//   score: 4.5,
-//   info: {
-//     reason: "The score is 4.5 out of 5 because the response directly addresses the query
-//           with specific, accurate benefits of exercise. It covers multiple aspects
-//           (cardiovascular, muscular, and mental health) in a clear and concise manner.
-//           The answer is highly relevant and provides appropriate detail without
-//           including unnecessary information."
-//   }
-// }
+console.log(result);
+```
+#### Low relevancy output
+The output receives a score of 0 because it fails to answer the query or provide any relevant information.
+```typescript
+{
+  score: 0,
+  reason: 'The score is 0 because the output about the Great Wall of China is completely unrelated to the benefits of meditation, providing no relevant information or context that addresses the input question.'
+}
 ```
 ## Related
-- [Prompt Alignment Metric](./prompt-alignment)
-- [Context Precision Metric](./context-precision)
-- [Faithfulness Metric](./faithfulness)
+- [Faithfulness Scorer](./faithfulness)

package/.docs/raw/reference/{scorers → evals}/answer-similarity.mdx RENAMED Viewed

@@ -1,5 +1,5 @@
 ---
-title: "Reference: Answer Similarity Scorer | Scorers | Mastra Docs"
+title: "Reference: Answer Similarity Scorer | Evals | Mastra Docs"
 description: Documentation for the Answer Similarity Scorer in Mastra, which compares agent outputs against ground truth answers for CI/CD testing.
 ---
@@ -151,17 +151,17 @@ Score calculation: `max(0, base_score - contradiction_penalty - missing_penalty
 ## Examples
-### Usage with runExperiment
+### Usage with runEvals
-This scorer is designed for use with `runExperiment` for CI/CD testing:
+This scorer is designed for use with `runEvals` for CI/CD testing:
 ```typescript
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 const scorer = createAnswerSimilarityScorer({ model });
-await runExperiment({
+await runEvals({
   data: [
     {
       input: "What is the capital of France?",
@@ -184,13 +184,13 @@ await runExperiment({
 In this example, the agent's output semantically matches the ground truth perfectly.
 ```typescript title="src/example-perfect-similarity.ts" showLineNumbers copy
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o-mini" });
-const result = await runExperiment({
+const result = await runEvals({
   data: [
     {
       input: "What is 2+2?",
@@ -222,13 +222,13 @@ The output receives a perfect score because both the agent's answer and ground t
 In this example, the agent provides the same information as the ground truth but with different phrasing.
 ```typescript title="src/example-semantic-similarity.ts" showLineNumbers copy
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o-mini" });
-const result = await runExperiment({
+const result = await runEvals({
   data: [
     {
       input: "What is the capital of France?",
@@ -260,13 +260,13 @@ The output receives a high score because it conveys the same information with eq
 In this example, the agent's response is partially correct but missing key information.
 ```typescript title="src/example-partial-similarity.ts" showLineNumbers copy
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o-mini" });
-const result = await runExperiment({
+const result = await runEvals({
   data: [
     {
       input: "What are the primary colors?",
@@ -298,13 +298,13 @@ The output receives a moderate score because it includes some correct informatio
 In this example, the agent provides factually incorrect information that contradicts the ground truth.
 ```typescript title="src/example-contradiction.ts" showLineNumbers copy
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o-mini" });
-const result = await runExperiment({
+const result = await runEvals({
   data: [
     {
       input: "Who wrote Romeo and Juliet?",
@@ -337,15 +337,15 @@ Use the scorer in your test suites to ensure agent consistency over time:
 ```typescript title="src/ci-integration.test.ts" showLineNumbers copy
 import { describe, it, expect } from "vitest";
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 describe("Agent Consistency Tests", () => {
   const scorer = createAnswerSimilarityScorer({ model: "openai/gpt-4o-mini" });
   it("should provide accurate factual answers", async () => {
-    const result = await runExperiment({
+    const result = await runEvals({
       data: [
         {
           input: "What is the speed of light?",
@@ -376,9 +376,9 @@ describe("Agent Consistency Tests", () => {
     // Run multiple times to check consistency
     const results = await Promise.all([
-      runExperiment({ data: [testData], scorers: [scorer], target: myAgent }),
-      runExperiment({ data: [testData], scorers: [scorer], target: myAgent }),
-      runExperiment({ data: [testData], scorers: [scorer], target: myAgent }),
+      runEvals({ data: [testData], scorers: [scorer], target: myAgent }),
+      runEvals({ data: [testData], scorers: [scorer], target: myAgent }),
+      runEvals({ data: [testData], scorers: [scorer], target: myAgent }),
     ]);
     // Check that all runs produce similar scores (within 0.1 tolerance)
@@ -396,8 +396,8 @@ describe("Agent Consistency Tests", () => {
 Customize the scorer behavior for specific use cases:
 ```typescript title="src/custom-config.ts" showLineNumbers copy
-import { runExperiment } from "@mastra/core/scores";
-import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/llm";
+import { runEvals } from "@mastra/core/evals";
+import { createAnswerSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
 import { myAgent } from "./agent";
 // Configure for strict exact matching with high scale
@@ -422,7 +422,7 @@ const lenientScorer = createAnswerSimilarityScorer({
   },
 });
-const result = await runExperiment({
+const result = await runEvals({
   data: [
     {
       input: "Explain photosynthesis",