npm - @mastra/core - Versions diffs - 1.6.0 → 1.7.0 - Mend

@mastra/core 1.6.0 → 1.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (453) hide show

package/dist/docs/references/docs-evals-custom-scorers.md DELETED Viewed

@@ -1,519 +0,0 @@
-# Custom scorers
-Mastra provides a unified `createScorer` factory that allows you to build custom evaluation logic using either JavaScript functions or LLM-based prompt objects for each step. This flexibility lets you choose the best approach for each part of your evaluation pipeline.
-### The Four-Step Pipeline
-All scorers in Mastra follow a consistent four-step evaluation pipeline:
-1. **preprocess** (optional): Prepare or transform input/output data
-2. **analyze** (optional): Perform evaluation analysis and gather insights
-3. **generateScore** (required): Convert analysis into a numerical score
-4. **generateReason** (optional): Generate human-readable explanations
-Each step can use either **functions** or **prompt objects** (LLM-based evaluation), giving you the flexibility to combine deterministic algorithms with AI judgment as needed.
-### Functions vs Prompt Objects
-**Functions** use JavaScript for deterministic logic. They're ideal for:
-- Algorithmic evaluations with clear criteria
-- Performance-critical scenarios
-- Integration with existing libraries
-- Consistent, reproducible results
-**Prompt Objects** use LLMs as judges for evaluation. They're perfect for:
-- Subjective evaluations requiring human-like judgment
-- Complex criteria difficult to code algorithmically
-- Natural language understanding tasks
-- Nuanced context evaluation
-**What “prompt object” means:** Instead of a function, the step is an object with `description` + `createPrompt` (and `outputSchema` for `preprocess`/`analyze`). That object tells Mastra to run the judge LLM for the step and store the structured output in `results.<step>StepResult`.
-You can mix and match approaches within a single scorer - for example, use a function for preprocessing data and an LLM for analyzing quality.
-### Initializing a Scorer
-Every scorer starts with the `createScorer` factory function, which requires an id and description, and optionally accepts a type specification and judge configuration.
-```typescript
-import { createScorer } from '@mastra/core/evals';
-const glutenCheckerScorer = createScorer({
-  id: 'gluten-checker',
-  description: 'Check if recipes contain gluten ingredients',
-  judge: {                    // Optional: for prompt object steps
-    model: 'openai/gpt-5.1',
-    instructions: 'You are a Chef that identifies if recipes contain gluten.'
-  }
-})
-// Chain step methods here
-.preprocess(...)
-.analyze(...)
-.generateScore(...)
-.generateReason(...)
-```
-The judge configuration is only needed if you plan to use prompt objects in any step. Individual steps can override this default configuration with their own judge settings.
-If all steps are function-based, the judge is never called and there is no judge output. To see LLM output, define at least one step as a prompt object and read the corresponding step result (for example, `results.analyzeStepResult`).
-#### Minimal judge example (prompt object)
-This example uses a prompt object in `analyze`, so the judge runs and its structured output is available as `results.analyzeStepResult`.
-```typescript
-import { createScorer } from "@mastra/core/evals";
-import { z } from "zod";
-const quoteSourcesScorer = createScorer({
-  id: "quote-sources",
-  description: "Check if the response includes sources",
-  judge: {
-    model: "openai/gpt-4.1-nano",
-    instructions: "You are a strict evaluator.",
-  },
-})
-  .analyze({
-    description: "Detect whether sources are present",
-    outputSchema: z.object({
-      hasSources: z.boolean(),
-      sources: z.array(z.string()),
-    }),
-    createPrompt: ({ run }) => `
-Does the response contain sources? Extract them as a list.
-Response:
-${run.output}
-`,
-  })
-  .generateScore(({ results }) => (results.analyzeStepResult.hasSources ? 1 : 0));
-// Run the scorer and inspect judge output
-const result = await quoteSourcesScorer.run({
-  input: "What is the capital of France?",
-  output: "Paris is the capital of France [1]. Source: [1] Wikipedia",
-});
-console.log(result.score);              // 1
-console.log(result.analyzeStepResult);  // { hasSources: true, sources: ["Wikipedia"] }
-```
-#### Agent Type for Agent Evaluation
-For type safety and compatibility with both live agent scoring and trace scoring, use `type: 'agent'` when creating scorers for agent evaluation. This allows you to use the same scorer for an agent and also use it to score traces:
-```typescript
-const myScorer = createScorer({
-  type: "agent", // Automatically handles agent input/output types
-}).generateScore(({ run, results }) => {
-  // run.output is automatically typed as ScorerRunOutputForAgent
-  // run.input is automatically typed as ScorerRunInputForAgent
-});
-```
-### Step-by-Step Breakdown
-#### preprocess Step (Optional)
-Prepares input/output data when you need to extract specific elements, filter content, or transform complex data structures.
-**Functions:** `({ run, results }) => any`
-```typescript
-const glutenCheckerScorer = createScorer(...)
-.preprocess(({ run }) => {
-  // Extract and clean recipe text
-  const recipeText = run.output.text.toLowerCase();
-  const wordCount = recipeText.split(' ').length;
-  return {
-    recipeText,
-    wordCount,
-    hasCommonGlutenWords: /flour|wheat|bread|pasta/.test(recipeText)
-  };
-})
-```
-**Prompt Objects:** Use `description`, `outputSchema`, and `createPrompt` to structure LLM-based preprocessing.
-```typescript
-const glutenCheckerScorer = createScorer(...)
-.preprocess({
-  description: 'Extract ingredients from the recipe',
-  outputSchema: z.object({
-    ingredients: z.array(z.string()),
-    cookingMethods: z.array(z.string())
-  }),
-  createPrompt: ({ run }) => `
-    Extract all ingredients and cooking methods from this recipe:
-    ${run.output.text}
-    Return JSON with ingredients and cookingMethods arrays.
-  `
-})
-```
-**Data Flow:** Results are available to subsequent steps as `results.preprocessStepResult`
-#### analyze Step (Optional)
-Performs core evaluation analysis, gathering insights that will inform the scoring decision.
-**Functions:** `({ run, results }) => any`
-```typescript
-const glutenCheckerScorer = createScorer({...})
-.preprocess(...)
-.analyze(({ run, results }) => {
-  const { recipeText, hasCommonGlutenWords } = results.preprocessStepResult;
-  // Simple gluten detection algorithm
-  const glutenKeywords = ['wheat', 'flour', 'barley', 'rye', 'bread'];
-  const foundGlutenWords = glutenKeywords.filter(word =>
-    recipeText.includes(word)
-  );
-  return {
-    isGlutenFree: foundGlutenWords.length === 0,
-    detectedGlutenSources: foundGlutenWords,
-    confidence: hasCommonGlutenWords ? 0.9 : 0.7
-  };
-})
-```
-**Prompt Objects:** Use `description`, `outputSchema`, and `createPrompt` for LLM-based analysis.
-```typescript
-const glutenCheckerScorer = createScorer({...})
-.preprocess(...)
-.analyze({
-  description: 'Analyze recipe for gluten content',
-  outputSchema: z.object({
-    isGlutenFree: z.boolean(),
-    glutenSources: z.array(z.string()),
-    confidence: z.number().min(0).max(1)
-  }),
-  createPrompt: ({ run, results }) => `
-    Analyze this recipe for gluten content:
-    "${results.preprocessStepResult.recipeText}"
-    Look for wheat, barley, rye, and hidden sources like soy sauce.
-    Return JSON with isGlutenFree, glutenSources array, and confidence (0-1).
-  `
-})
-```
-**Data Flow:** Results are available to subsequent steps as `results.analyzeStepResult`
-#### generateScore Step (Required)
-Converts analysis results into a numerical score. This is the only required step in the pipeline.
-**Functions:** `({ run, results }) => number`
-```typescript
-const glutenCheckerScorer = createScorer({...})
-.preprocess(...)
-.analyze(...)
-.generateScore(({ results }) => {
-  const { isGlutenFree, confidence } = results.analyzeStepResult;
-  // Return 1 for gluten-free, 0 for contains gluten
-  // Weight by confidence level
-  return isGlutenFree ? confidence : 0;
-})
-```
-**Prompt Objects:** See the [`createScorer`](https://mastra.ai/reference/evals/create-scorer) API reference for details on using prompt objects with generateScore, including required `calculateScore` function.
-**Data Flow:** The score is available to generateReason as the `score` parameter
-#### generateReason Step (Optional)
-Generates human-readable explanations for the score, useful for debugging, transparency, or user feedback.
-**Functions:** `({ run, results, score }) => string`
-```typescript
-const glutenCheckerScorer = createScorer({...})
-.preprocess(...)
-.analyze(...)
-.generateScore(...)
-.generateReason(({ results, score }) => {
-  const { isGlutenFree, glutenSources } = results.analyzeStepResult;
-  if (isGlutenFree) {
-    return `Score: ${score}. This recipe is gluten-free with no harmful ingredients detected.`;
-  } else {
-    return `Score: ${score}. Contains gluten from: ${glutenSources.join(', ')}`;
-  }
-})
-```
-**Prompt Objects:** Use `description` and `createPrompt` for LLM-generated explanations.
-```typescript
-const glutenCheckerScorer = createScorer({...})
-.preprocess(...)
-.analyze(...)
-.generateScore(...)
-.generateReason({
-  description: 'Explain the gluten assessment',
-  createPrompt: ({ results, score }) => `
-    Explain why this recipe received a score of ${score}.
-    Analysis: ${JSON.stringify(results.analyzeStepResult)}
-    Provide a clear explanation for someone with dietary restrictions.
-  `
-})
-```
-## Example: Create a custom scorer
-A custom scorer in Mastra uses `createScorer` with four core components:
-1. [**Judge Configuration**](#judge-configuration)
-2. [**Analysis Step**](#analysis-step)
-3. [**Score Generation**](#score-generation)
-4. [**Reason Generation**](#reason-generation)
-Together, these components allow you to define custom evaluation logic using LLMs as judges.
-> **Info:** Visit [createScorer](https://mastra.ai/reference/evals/create-scorer) for the full API and configuration options.
-```typescript
-import { createScorer } from "@mastra/core/evals";
-import { z } from "zod";
-export const GLUTEN_INSTRUCTIONS = `You are a Chef that identifies if recipes contain gluten.`;
-export const generateGlutenPrompt = ({
-  output,
-}: {
-  output: string;
-}) => `Check if this recipe is gluten-free.
-Check for:
-- Wheat
-- Barley
-- Rye
-- Common sources like flour, pasta, bread
-Example with gluten:
-"Mix flour and water to make dough"
-Response: {
-  "isGlutenFree": false,
-  "glutenSources": ["flour"]
-}
-Example gluten-free:
-"Mix rice, beans, and vegetables"
-Response: {
-  "isGlutenFree": true,
-  "glutenSources": []
-}
-Recipe to analyze:
-${output}
-Return your response in this format:
-{
-  "isGlutenFree": boolean,
-  "glutenSources": ["list ingredients containing gluten"]
-}`;
-export const generateReasonPrompt = ({
-  isGlutenFree,
-  glutenSources,
-}: {
-  isGlutenFree: boolean;
-  glutenSources: string[];
-}) => `Explain why this recipe is${isGlutenFree ? "" : " not"} gluten-free.
-${glutenSources.length > 0 ? `Sources of gluten: ${glutenSources.join(", ")}` : "No gluten-containing ingredients found"}
-Return your response in this format:
-"This recipe is [gluten-free/contains gluten] because [explanation]"`;
-export const glutenCheckerScorer = createScorer({
-  id: "gluten-checker",
-  description: "Check if the output contains any gluten",
-  judge: {
-    model: "openai/gpt-4.1-nano",
-    instructions: GLUTEN_INSTRUCTIONS,
-  },
-})
-  .analyze({
-    description: "Analyze the output for gluten",
-    outputSchema: z.object({
-      isGlutenFree: z.boolean(),
-      glutenSources: z.array(z.string()),
-    }),
-    createPrompt: ({ run }) => {
-      const { output } = run;
-      return generateGlutenPrompt({ output: output.text });
-    },
-  })
-  .generateScore(({ results }) => {
-    return results.analyzeStepResult.isGlutenFree ? 1 : 0;
-  })
-  .generateReason({
-    description: "Generate a reason for the score",
-    createPrompt: ({ results }) => {
-      return generateReasonPrompt({
-        glutenSources: results.analyzeStepResult.glutenSources,
-        isGlutenFree: results.analyzeStepResult.isGlutenFree,
-      });
-    },
-  });
-```
-### Judge Configuration
-Sets up the LLM model and defines its role as a domain expert.
-```typescript
-judge: {
-  model: 'openai/gpt-4.1-nano',
-  instructions: GLUTEN_INSTRUCTIONS,
-}
-```
-### Analysis Step
-Defines how the LLM should analyze the input and what structured output to return.
-```typescript
-.analyze({
-  description: 'Analyze the output for gluten',
-  outputSchema: z.object({
-    isGlutenFree: z.boolean(),
-    glutenSources: z.array(z.string()),
-  }),
-  createPrompt: ({ run }) => {
-    const { output } = run;
-    return generateGlutenPrompt({ output: output.text });
-  },
-})
-```
-The analysis step uses a prompt object to:
-- Provide a clear description of the analysis task
-- Define expected output structure with Zod schema (both boolean result and list of gluten sources)
-- Generate dynamic prompts based on the input content
-### Score Generation
-Converts the LLM's structured analysis into a numerical score.
-```typescript
-.generateScore(({ results }) => {
-  return results.analyzeStepResult.isGlutenFree ? 1 : 0;
-})
-```
-The score generation function takes the analysis results and applies business logic to produce a score. In this case, the LLM directly determines if the recipe is gluten-free, so we use that boolean result: 1 for gluten-free, 0 for contains gluten.
-### Reason Generation
-Provides human-readable explanations for the score using another LLM call.
-```typescript
-.generateReason({
-  description: 'Generate a reason for the score',
-  createPrompt: ({ results }) => {
-    return generateReasonPrompt({
-      glutenSources: results.analyzeStepResult.glutenSources,
-      isGlutenFree: results.analyzeStepResult.isGlutenFree,
-    });
-  },
-})
-```
-The reason generation step creates explanations that help users understand why a particular score was assigned, using both the boolean result and the specific gluten sources identified by the analysis step.
-## High gluten-free example
-```typescript
-const result = await glutenCheckerScorer.run({
-  input: [{ role: 'user', content: 'Mix rice, beans, and vegetables' }],
-  output: { text: 'Mix rice, beans, and vegetables' },
-});
-console.log('Score:', result.score);
-console.log('Gluten sources:', result.analyzeStepResult.glutenSources);
-console.log('Reason:', result.reason);
-```
-### High gluten-free output
-```typescript
-{
-  score: 1,
-  analyzeStepResult: {
-    isGlutenFree: true,
-    glutenSources: []
-  },
-  reason: 'This recipe is gluten-free because rice, beans, and vegetables are naturally gluten-free ingredients that are safe for people with celiac disease.'
-}
-```
-## Partial gluten example
-```typescript
-const result = await glutenCheckerScorer.run({
-  input: [{ role: "user", content: "Mix flour and water to make dough" }],
-  output: { text: "Mix flour and water to make dough" },
-});
-console.log("Score:", result.score);
-console.log("Gluten sources:", result.analyzeStepResult.glutenSources);
-console.log("Reason:", result.reason);
-```
-### Partial gluten output
-```typescript
-{
-  score: 0,
-  analyzeStepResult: {
-    isGlutenFree: false,
-    glutenSources: ['flour']
-  },
-  reason: 'This recipe is not gluten-free because it contains flour. Regular flour is made from wheat and contains gluten, making it unsafe for people with celiac disease or gluten sensitivity.'
-}
-```
-## Low gluten-free example
-```typescript
-const result = await glutenCheckerScorer.run({
-  input: [{ role: "user", content: "Add soy sauce and noodles" }],
-  output: { text: "Add soy sauce and noodles" },
-});
-console.log("Score:", result.score);
-console.log("Gluten sources:", result.analyzeStepResult.glutenSources);
-console.log("Reason:", result.reason);
-```
-### Low gluten-free output
-```typescript
-{
-  score: 0,
-  analyzeStepResult: {
-    isGlutenFree: false,
-    glutenSources: ['soy sauce', 'noodles']
-  },
-  reason: 'This recipe is not gluten-free because it contains soy sauce, noodles. Regular soy sauce contains wheat and most noodles are made from wheat flour, both of which contain gluten and are unsafe for people with gluten sensitivity.'
-}
-```
-**Examples and Resources:**
-- [createScorer API Reference](https://mastra.ai/reference/evals/create-scorer) - Complete technical documentation
-- [Built-in Scorers Source Code](https://github.com/mastra-ai/mastra/tree/main/packages/evals/src/scorers) - Real implementations for reference

package/dist/docs/references/docs-evals-overview.md DELETED Viewed

@@ -1,146 +0,0 @@
-# Scorers overview
-While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. **Scorers** help bridge this gap by providing quantifiable metrics for measuring agent quality.
-Scorers are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Scorers return **scores**: numerical values (typically between 0 and 1) that quantify how well an output meets your evaluation criteria. These scores enable you to objectively track performance, compare different approaches, and identify areas for improvement in your AI systems. Scorers can be customized with your own prompts and scoring functions.
-Scorers can be run in the cloud, capturing real-time results. But scorers can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.
-## Types of Scorers
-There are different kinds of scorers, each serving a specific purpose. Here are some common types:
-1. **Textual Scorers**: Evaluate accuracy, reliability, and context understanding of agent responses
-2. **Classification Scorers**: Measure accuracy in categorizing data based on predefined categories
-3. **Prompt Engineering Scorers**: Explore impact of different instructions and input formats
-## Installation
-To access Mastra's scorers feature install the `@mastra/evals` package.
-**npm**:
-```bash
-npm install @mastra/evals@latest
-```
-**pnpm**:
-```bash
-pnpm add @mastra/evals@latest
-```
-**Yarn**:
-```bash
-yarn add @mastra/evals@latest
-```
-**Bun**:
-```bash
-bun add @mastra/evals@latest
-```
-## Live evaluations
-**Live evaluations** allow you to automatically score AI outputs in real-time as your agents and workflows operate. Instead of running evaluations manually or in batches, scorers run asynchronously alongside your AI systems, providing continuous quality monitoring.
-### Adding scorers to agents
-You can add built-in scorers to your agents to automatically evaluate their outputs. See the [full list of built-in scorers](https://mastra.ai/docs/evals/built-in-scorers) for all available options.
-```typescript
-import { Agent } from "@mastra/core/agent";
-import {
-  createAnswerRelevancyScorer,
-  createToxicityScorer,
-} from "@mastra/evals/scorers/prebuilt";
-export const evaluatedAgent = new Agent({
-  scorers: {
-    relevancy: {
-      scorer: createAnswerRelevancyScorer({ model: "openai/gpt-4.1-nano" }),
-      sampling: { type: "ratio", rate: 0.5 },
-    },
-    safety: {
-      scorer: createToxicityScorer({ model: "openai/gpt-4.1-nano" }),
-      sampling: { type: "ratio", rate: 1 },
-    },
-  },
-});
-```
-### Adding scorers to workflow steps
-You can also add scorers to individual workflow steps to evaluate outputs at specific points in your process:
-```typescript
-import { createWorkflow, createStep } from "@mastra/core/workflows";
-import { z } from "zod";
-import { customStepScorer } from "../scorers/custom-step-scorer";
-const contentStep = createStep({
-  scorers: {
-    customStepScorer: {
-      scorer: customStepScorer(),
-      sampling: {
-        type: "ratio",
-        rate: 1, // Score every step execution
-      }
-    }
-  },
-});
-export const contentWorkflow = createWorkflow({ ... })
-  .then(contentStep)
-  .commit();
-```
-### How live evaluations work
-**Asynchronous execution**: Live evaluations run in the background without blocking your agent responses or workflow execution. This ensures your AI systems maintain their performance while still being monitored.
-**Sampling control**: The `sampling.rate` parameter (0-1) controls what percentage of outputs get scored:
-- `1.0`: Score every single response (100%)
-- `0.5`: Score half of all responses (50%)
-- `0.1`: Score 10% of responses
-- `0.0`: Disable scoring
-**Automatic storage**: All scoring results are automatically stored in the `mastra_scorers` table in your configured database, allowing you to analyze performance trends over time.
-## Trace evaluations
-In addition to live evaluations, you can use scorers to evaluate historical traces from your agent interactions and workflows. This is particularly useful for analyzing past performance, debugging issues, or running batch evaluations.
-> **Info:** **Observability Required**
->
-> To score traces, you must first configure observability in your Mastra instance to collect trace data. See [Tracing documentation](https://mastra.ai/docs/observability/tracing/overview) for setup instructions.
-### Scoring traces with Studio
-To score traces, you first need to register your scorers with your Mastra instance:
-```typescript
-const mastra = new Mastra({
-  scorers: {
-    answerRelevancy: myAnswerRelevancyScorer,
-    responseQuality: myResponseQualityScorer,
-  },
-});
-```
-Once registered, you can score traces interactively within Studio under the Observability section. This provides a user-friendly interface for running scorers against historical traces.
-## Testing scorers locally
-Mastra provides a CLI command `mastra dev` to test your scorers. Studio includes a scorers section where you can run individual scorers against test inputs and view detailed results.
-For more details, see [Studio](https://mastra.ai/docs/getting-started/studio) docs.
-## Next steps
-- Learn how to create your own scorers in the [Creating Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) guide
-- Explore built-in scorers in the [Built-in Scorers](https://mastra.ai/docs/evals/built-in-scorers) section
-- Test scorers with [Studio](https://mastra.ai/docs/getting-started/studio)