npm - @mastra/core - Versions diffs - 1.2.0 → 1.2.1-alpha.0 - Mend

@mastra/core 1.2.0 → 1.2.1-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (406) hide show

package/dist/docs/references/reference-evals-hallucination.md ADDED Viewed

@@ -0,0 +1,220 @@
+# Hallucination Scorer
+The `createHallucinationScorer()` function evaluates whether an LLM generates factually correct information by comparing its output against the provided context. This scorer measures hallucination by identifying direct contradictions between the context and the output.
+## Parameters
+The `createHallucinationScorer()` function accepts a single options object with the following properties:
+**model:** (`LanguageModel`): Configuration for the model used to evaluate hallucination.
+**options.scale:** (`number`): Maximum score value. (Default: `1`)
+**options.context:** (`string[]`): Static context strings to use as ground truth for hallucination detection.
+**options.getContext:** (`(params: GetContextParams) => string[] | Promise<string[]>`): A hook to dynamically resolve context at runtime. Takes priority over static context. Useful for live scoring where context (like tool results) is only available when the scorer runs.
+This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
+### GetContextParams
+The `getContext` hook receives the following parameters:
+**run:** (`GetContextRun`): The scorer run containing input, output, runId, requestContext, and tracingContext.
+**results:** (`Record<string, any>`): Accumulated results from previous steps (e.g., preprocessStepResult with extracted claims).
+**score:** (`number`): The computed score. Only present when called from the generateReason step.
+**step:** (`'analyze' | 'generateReason'`): Which step is calling the hook. Useful for caching context between calls.
+## .run() Returns
+**runId:** (`string`): The id of the run (optional).
+**preprocessStepResult:** (`object`): Object with extracted claims: { claims: string\[] }
+**preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
+**analyzeStepResult:** (`object`): Object with verdicts: { verdicts: Array<{ statement: string, verdict: 'yes' | 'no', reason: string }> }
+**analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
+**score:** (`number`): Hallucination score (0 to scale, default 0-1).
+**reason:** (`string`): Detailed explanation of the score and identified contradictions.
+**generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
+## Scoring Details
+The scorer evaluates hallucination through contradiction detection and unsupported claim analysis.
+### Scoring Process
+1. Analyzes factual content:
+   - Extracts statements from context
+   - Identifies numerical values and dates
+   - Maps statement relationships
+2. Analyzes output for hallucinations:
+   - Compares against context statements
+   - Marks direct conflicts as hallucinations
+   - Identifies unsupported claims as hallucinations
+   - Evaluates numerical accuracy
+   - Considers approximation context
+3. Calculates hallucination score:
+   - Counts hallucinated statements (contradictions and unsupported claims)
+   - Divides by total statements
+   - Scales to configured range
+Final score: `(hallucinated_statements / total_statements) * scale`
+### Important Considerations
+- Claims not present in context are treated as hallucinations
+- Subjective claims are hallucinations unless explicitly supported
+- Speculative language ("might", "possibly") about facts IN context is allowed
+- Speculative language about facts NOT in context is treated as hallucination
+- Empty outputs result in zero hallucinations
+- Numerical evaluation considers:
+  - Scale-appropriate precision
+  - Contextual approximations
+  - Explicit precision indicators
+### Score interpretation
+A hallucination score between 0 and 1:
+- **0.0**: No hallucination — all claims match the context.
+- **0.3–0.4**: Low hallucination — a few contradictions.
+- **0.5–0.6**: Mixed hallucination — several contradictions.
+- **0.7–0.8**: High hallucination — many contradictions.
+- **0.9–1.0**: Complete hallucination — most or all claims contradict the context.
+**Note:** The score represents the degree of hallucination - lower scores indicate better factual alignment with the provided context
+## Examples
+### Static Context
+Use static context when you have known ground truth to compare against:
+```typescript
+import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
+const scorer = createHallucinationScorer({
+  model: "openai/gpt-4o",
+  options: {
+    context: [
+      "The first iPhone was announced on January 9, 2007.",
+      "It was released on June 29, 2007.",
+      "Steve Jobs introduced it at Macworld.",
+    ],
+  },
+});
+```
+### Dynamic Context with getContext
+Use `getContext` for live scoring scenarios where context comes from tool results:
+```typescript
+import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
+import { extractToolResults } from "@mastra/evals/scorers";
+const scorer = createHallucinationScorer({
+  model: "openai/gpt-4o",
+  options: {
+    getContext: ({ run, step }) => {
+      // Extract tool results as context
+      const toolResults = extractToolResults(run.output);
+      return toolResults.map((t) =>
+        JSON.stringify({ tool: t.toolName, result: t.result })
+      );
+    },
+  },
+});
+```
+### Live Scoring with Agent
+Attach the scorer to an agent for live evaluation:
+```typescript
+import { Agent } from "@mastra/core/agent";
+import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
+import { extractToolResults } from "@mastra/evals/scorers";
+const hallucinationScorer = createHallucinationScorer({
+  model: "openai/gpt-4o",
+  options: {
+    getContext: ({ run }) => {
+      const toolResults = extractToolResults(run.output);
+      return toolResults.map((t) =>
+        JSON.stringify({ tool: t.toolName, result: t.result })
+      );
+    },
+  },
+});
+const agent = new Agent({
+  name: "my-agent",
+  model: "openai/gpt-4o",
+  instructions: "You are a helpful assistant.",
+  evals: {
+    scorers: [hallucinationScorer],
+  },
+});
+```
+### Batch Evaluation with runEvals
+```typescript
+import { runEvals } from "@mastra/core/evals";
+import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
+import { myAgent } from "./agent";
+const scorer = createHallucinationScorer({
+  model: "openai/gpt-4o",
+  options: {
+    context: ["Known fact 1", "Known fact 2"],
+  },
+});
+const result = await runEvals({
+  data: [
+    { input: "Tell me about topic A" },
+    { input: "Tell me about topic B" },
+  ],
+  scorers: [scorer],
+  target: myAgent,
+  onItemComplete: ({ scorerResults }) => {
+    console.log({
+      score: scorerResults[scorer.id].score,
+      reason: scorerResults[scorer.id].reason,
+    });
+  },
+});
+console.log(result.scores);
+```
+For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
+To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
+## Related
+- [Faithfulness Scorer](https://mastra.ai/reference/evals/faithfulness)
+- [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)

package/dist/docs/references/reference-evals-keyword-coverage.md ADDED Viewed

@@ -0,0 +1,128 @@
+# Keyword Coverage Scorer
+The `createKeywordCoverageScorer()` function evaluates how well an LLM's output covers the important keywords from the input. It analyzes keyword presence and matches while ignoring common words and stop words.
+## Parameters
+The `createKeywordCoverageScorer()` function does not take any options.
+This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
+## .run() Returns
+**runId:** (`string`): The id of the run (optional).
+**preprocessStepResult:** (`object`): Object with extracted keywords: { referenceKeywords: Set\<string>, responseKeywords: Set\<string> }
+**analyzeStepResult:** (`object`): Object with keyword coverage: { totalKeywords: number, matchedKeywords: number }
+**score:** (`number`): Coverage score (0-1) representing the proportion of matched keywords.
+`.run()` returns a result in the following shape:
+```typescript
+{
+  runId: string,
+  extractStepResult: {
+    referenceKeywords: Set<string>,
+    responseKeywords: Set<string>
+  },
+  analyzeStepResult: {
+    totalKeywords: number,
+    matchedKeywords: number
+  },
+  score: number
+}
+```
+## Scoring Details
+The scorer evaluates keyword coverage by matching keywords with the following features:
+- Common word and stop word filtering (e.g., "the", "a", "and")
+- Case-insensitive matching
+- Word form variation handling
+- Special handling of technical terms and compound words
+### Scoring Process
+1. Processes keywords from input and output:
+   - Filters out common words and stop words
+   - Normalizes case and word forms
+   - Handles special terms and compounds
+2. Calculates keyword coverage:
+   - Matches keywords between texts
+   - Counts successful matches
+   - Computes coverage ratio
+Final score: `(matched_keywords / total_keywords) * scale`
+### Score interpretation
+A coverage score between 0 and 1:
+- **1.0**: Complete coverage – all keywords present.
+- **0.7–0.9**: High coverage – most keywords included.
+- **0.4–0.6**: Partial coverage – some keywords present.
+- **0.1–0.3**: Low coverage – few keywords matched.
+- **0.0**: No coverage – no keywords found.
+### Special Cases
+The scorer handles several special cases:
+- Empty input/output: Returns score of 1.0 if both empty, 0.0 if only one is empty
+- Single word: Treated as a single keyword
+- Technical terms: Preserves compound technical terms (e.g., "React.js", "machine learning")
+- Case differences: "JavaScript" matches "javascript"
+- Common words: Ignored in scoring to focus on meaningful keywords
+## Example
+Evaluate keyword coverage between input queries and agent responses:
+```typescript
+import { runEvals } from "@mastra/core/evals";
+import { createKeywordCoverageScorer } from "@mastra/evals/scorers/prebuilt";
+import { myAgent } from "./agent";
+const scorer = createKeywordCoverageScorer();
+const result = await runEvals({
+  data: [
+    {
+      input: "JavaScript frameworks like React and Vue",
+    },
+    {
+      input: "TypeScript offers interfaces, generics, and type inference",
+    },
+    {
+      input:
+        "Machine learning models require data preprocessing, feature engineering, and hyperparameter tuning",
+    },
+  ],
+  scorers: [scorer],
+  target: myAgent,
+  onItemComplete: ({ scorerResults }) => {
+    console.log({
+      score: scorerResults[scorer.id].score,
+    });
+  },
+});
+console.log(result.scores);
+```
+For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
+To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
+## Related
+- [Completeness Scorer](https://mastra.ai/reference/evals/completeness)
+- [Content Similarity Scorer](https://mastra.ai/reference/evals/content-similarity)
+- [Answer Relevancy Scorer](https://mastra.ai/reference/evals/answer-relevancy)
+- [Textual Difference Scorer](https://mastra.ai/reference/evals/textual-difference)

package/dist/docs/references/reference-evals-mastra-scorer.md ADDED Viewed

@@ -0,0 +1,123 @@
+# MastraScorer
+The `MastraScorer` class is the base class for all scorers in Mastra. It provides a standard `.run()` method for evaluating input/output pairs and supports multi-step scoring workflows with preprocess → analyze → generateScore → generateReason execution flow.
+**Note:** Most users should use [`createScorer`](https://mastra.ai/reference/evals/create-scorer) to create scorer instances. Direct instantiation of `MastraScorer` is not recommended.
+## How to Get a MastraScorer Instance
+Use the `createScorer` factory function, which returns a `MastraScorer` instance:
+```typescript
+import { createScorer } from "@mastra/core/evals";
+const scorer = createScorer({
+  name: "My Custom Scorer",
+  description: "Evaluates responses based on custom criteria",
+}).generateScore(({ run, results }) => {
+  // scoring logic
+  return 0.85;
+});
+// scorer is now a MastraScorer instance
+```
+## .run() Method
+The `.run()` method is the primary way to execute your scorer and evaluate input/output pairs. It processes the data through your defined steps (preprocess → analyze → generateScore → generateReason) and returns a comprehensive result object with the score, reasoning, and intermediate results.
+```typescript
+const result = await scorer.run({
+  input: "What is machine learning?",
+  output: "Machine learning is a subset of artificial intelligence...",
+  runId: "optional-run-id",
+  requestContext: {
+    /* optional context */
+  },
+});
+```
+## .run() Input
+**input:** (`any`): Input data to be evaluated. Can be any type depending on your scorer's requirements.
+**output:** (`any`): Output data to be evaluated. Can be any type depending on your scorer's requirements.
+**runId:** (`string`): Optional unique identifier for this scoring run.
+**requestContext:** (`any`): Optional request context from the agent or workflow step being evaluated.
+**groundTruth:** (`any`): Optional expected or reference output for comparison during scoring. Automatically passed when using runEvals.
+## .run() Returns
+**runId:** (`string`): The unique identifier for this scoring run.
+**score:** (`number`): Numerical score computed by the generateScore step.
+**reason:** (`string`): Explanation for the score, if generateReason step was defined (optional).
+**preprocessStepResult:** (`any`): Result of the preprocess step, if defined (optional).
+**analyzeStepResult:** (`any`): Result of the analyze step, if defined (optional).
+**preprocessPrompt:** (`string`): Preprocess prompt, if defined (optional).
+**analyzePrompt:** (`string`): Analyze prompt, if defined (optional).
+**generateScorePrompt:** (`string`): Generate score prompt, if defined (optional).
+**generateReasonPrompt:** (`string`): Generate reason prompt, if defined (optional).
+## Step Execution Flow
+When you call `.run()`, the MastraScorer executes the defined steps in this order:
+1. **preprocess** (optional) - Extracts or transforms data
+2. **analyze** (optional) - Processes the input/output and preprocessed data
+3. **generateScore** (required) - Computes the numerical score
+4. **generateReason** (optional) - Provides explanation for the score
+Each step receives the results from previous steps, allowing you to build complex evaluation pipelines.
+## Usage Example
+```typescript
+const scorer = createScorer({
+  name: "Quality Scorer",
+  description: "Evaluates response quality",
+})
+  .preprocess(({ run }) => {
+    // Extract key information
+    return { wordCount: run.output.split(" ").length };
+  })
+  .analyze(({ run, results }) => {
+    // Analyze the response
+    const hasSubstance = results.preprocessStepResult.wordCount > 10;
+    return { hasSubstance };
+  })
+  .generateScore(({ results }) => {
+    // Calculate score
+    return results.analyzeStepResult.hasSubstance ? 1.0 : 0.0;
+  })
+  .generateReason(({ score, results }) => {
+    // Explain the score
+    const wordCount = results.preprocessStepResult.wordCount;
+    return `Score: ${score}. Response has ${wordCount} words.`;
+  });
+// Use the scorer
+const result = await scorer.run({
+  input: "What is machine learning?",
+  output: "Machine learning is a subset of artificial intelligence...",
+});
+console.log(result.score); // 1.0
+console.log(result.reason); // "Score: 1.0. Response has 12 words."
+```
+## Integration
+MastraScorer instances can be used for agents and workflow steps
+See the [createScorer reference](https://mastra.ai/reference/evals/create-scorer) for detailed information on defining custom scoring logic.

package/dist/docs/references/reference-evals-run-evals.md ADDED Viewed

@@ -0,0 +1,138 @@
+# runEvals
+The `runEvals` function enables batch evaluation of agents and workflows by running multiple test cases against scorers concurrently. This is essential for systematic testing, performance analysis, and validation of AI systems.
+## Usage Example
+```typescript
+import { runEvals } from "@mastra/core/evals";
+import { myAgent } from "./agents/my-agent";
+import { myScorer1, myScorer2 } from "./scorers";
+const result = await runEvals({
+  target: myAgent,
+  data: [
+    { input: "What is machine learning?" },
+    { input: "Explain neural networks" },
+    { input: "How does AI work?" },
+  ],
+  scorers: [myScorer1, myScorer2],
+  concurrency: 2,
+  onItemComplete: ({ item, targetResult, scorerResults }) => {
+    console.log(`Completed: ${item.input}`);
+    console.log(`Scores:`, scorerResults);
+  },
+});
+console.log(`Average scores:`, result.scores);
+console.log(`Processed ${result.summary.totalItems} items`);
+```
+## Parameters
+**target:** (`Agent | Workflow`): The agent or workflow to evaluate.
+**data:** (`RunEvalsDataItem[]`): Array of test cases with input data and optional ground truth.
+**scorers:** (`MastraScorer[] | WorkflowScorerConfig`): Array of scorers for agents, or configuration object for workflows specifying scorers for the workflow and individual steps.
+**concurrency?:** (`number`): Number of test cases to run concurrently. (Default: `1`)
+**onItemComplete?:** (`function`): Callback function called after each test case completes. Receives item, target result, and scorer results.
+## Data Item Structure
+**input:** (`string | string[] | CoreMessage[] | any`): Input data for the target. For agents: messages or strings. For workflows: workflow input data.
+**groundTruth?:** (`any`): Expected or reference output for comparison during scoring.
+**requestContext?:** (`RequestContext`): Request Context to pass to the target during execution.
+**tracingContext?:** (`TracingContext`): Tracing context for observability and debugging.
+## Workflow Scorer Configuration
+For workflows, you can specify scorers at different levels using `WorkflowScorerConfig`:
+**workflow?:** (`MastraScorer[]`): Array of scorers to evaluate the entire workflow output.
+**steps?:** (`Record<string, MastraScorer[]>`): Object mapping step IDs to arrays of scorers for evaluating individual step outputs.
+## Returns
+**scores:** (`Record<string, any>`): Average scores across all test cases, organized by scorer name.
+**summary:** (`object`): Summary information about the experiment execution.
+**summary.totalItems:** (`number`): Total number of test cases processed.
+## Examples
+### Agent Evaluation
+```typescript
+import { createScorer, runEvals } from "@mastra/core/evals";
+const myScorer = createScorer({
+  id: "my-scorer",
+  description: "Check if Agent's response contains ground truth",
+  type: "agent",
+}).generateScore(({ run }) => {
+  const response = run.output[0]?.content || "";
+  const expectedResponse = run.groundTruth;
+  return response.includes(expectedResponse) ? 1 : 0;
+});
+const result = await runEvals({
+  target: chatAgent,
+  data: [
+    {
+      input: "What is AI?",
+      groundTruth:
+        "AI is a field of computer science that creates intelligent machines.",
+    },
+    {
+      input: "How does machine learning work?",
+      groundTruth:
+        "Machine learning uses algorithms to learn patterns from data.",
+    },
+  ],
+  scorers: [relevancyScorer],
+  concurrency: 3,
+});
+```
+### Workflow Evaluation
+```typescript
+const workflowResult = await runEvals({
+  target: myWorkflow,
+  data: [
+    { input: { query: "Process this data", priority: "high" } },
+    { input: { query: "Another task", priority: "low" } },
+  ],
+  scorers: {
+    workflow: [outputQualityScorer],
+    steps: {
+      "validation-step": [validationScorer],
+      "processing-step": [processingScorer],
+    },
+  },
+  onItemComplete: ({ item, targetResult, scorerResults }) => {
+    console.log(`Workflow completed for: ${item.inputData.query}`);
+    if (scorerResults.workflow) {
+      console.log("Workflow scores:", scorerResults.workflow);
+    }
+    if (scorerResults.steps) {
+      console.log("Step scores:", scorerResults.steps);
+    }
+  },
+});
+```
+## Related
+- [createScorer()](https://mastra.ai/reference/evals/create-scorer) - Create custom scorers for experiments
+- [MastraScorer](https://mastra.ai/reference/evals/mastra-scorer) - Learn about scorer structure and methods
+- [Custom Scorers](https://mastra.ai/docs/evals/custom-scorers) - Guide to building evaluation logic
+- [Scorers Overview](https://mastra.ai/docs/evals/overview) - Understanding scorer concepts