npm - @mastra/evals - Versions diffs - 1.1.2-alpha.0 → 1.2.0-alpha.0 - Mend

@mastra/evals 1.1.2-alpha.0 → 1.2.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/CHANGELOG.md +59 -2
package/LICENSE.md +15 -0
package/dist/chunk-EVBNIL5M.js +606 -0
package/dist/chunk-EVBNIL5M.js.map +1 -0
package/dist/chunk-XRUR5PBK.cjs +632 -0
package/dist/chunk-XRUR5PBK.cjs.map +1 -0
package/dist/docs/SKILL.md +20 -19
package/dist/docs/assets/SOURCE_MAP.json +1 -1
package/dist/docs/references/docs-evals-built-in-scorers.md +2 -1
package/dist/docs/references/docs-evals-overview.md +11 -16
package/dist/docs/references/reference-evals-answer-relevancy.md +25 -25
package/dist/docs/references/reference-evals-answer-similarity.md +33 -35
package/dist/docs/references/reference-evals-bias.md +24 -24
package/dist/docs/references/reference-evals-completeness.md +19 -20
package/dist/docs/references/reference-evals-content-similarity.md +20 -20
package/dist/docs/references/reference-evals-context-precision.md +36 -36
package/dist/docs/references/reference-evals-context-relevance.md +136 -141
package/dist/docs/references/reference-evals-faithfulness.md +24 -24
package/dist/docs/references/reference-evals-hallucination.md +52 -69
package/dist/docs/references/reference-evals-keyword-coverage.md +18 -18
package/dist/docs/references/reference-evals-noise-sensitivity.md +167 -177
package/dist/docs/references/reference-evals-prompt-alignment.md +111 -116
package/dist/docs/references/reference-evals-scorer-utils.md +285 -105
package/dist/docs/references/reference-evals-textual-difference.md +18 -18
package/dist/docs/references/reference-evals-tone-consistency.md +19 -19
package/dist/docs/references/reference-evals-tool-call-accuracy.md +165 -165
package/dist/docs/references/reference-evals-toxicity.md +21 -21
package/dist/docs/references/reference-evals-trajectory-accuracy.md +613 -0
package/dist/scorers/code/index.d.ts +1 -0
package/dist/scorers/code/index.d.ts.map +1 -1
package/dist/scorers/code/trajectory/index.d.ts +147 -0
package/dist/scorers/code/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/answer-similarity/index.d.ts +2 -2
package/dist/scorers/llm/context-precision/index.d.ts +2 -2
package/dist/scorers/llm/context-relevance/index.d.ts +1 -1
package/dist/scorers/llm/faithfulness/index.d.ts +1 -1
package/dist/scorers/llm/hallucination/index.d.ts +2 -2
package/dist/scorers/llm/index.d.ts +1 -0
package/dist/scorers/llm/index.d.ts.map +1 -1
package/dist/scorers/llm/noise-sensitivity/index.d.ts +1 -1
package/dist/scorers/llm/prompt-alignment/index.d.ts +5 -5
package/dist/scorers/llm/tool-call-accuracy/index.d.ts +1 -1
package/dist/scorers/llm/toxicity/index.d.ts +1 -1
package/dist/scorers/llm/trajectory/index.d.ts +58 -0
package/dist/scorers/llm/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/trajectory/prompts.d.ts +20 -0
package/dist/scorers/llm/trajectory/prompts.d.ts.map +1 -0
package/dist/scorers/prebuilt/index.cjs +638 -59
package/dist/scorers/prebuilt/index.cjs.map +1 -1
package/dist/scorers/prebuilt/index.js +578 -2
package/dist/scorers/prebuilt/index.js.map +1 -1
package/dist/scorers/utils.cjs +41 -17
package/dist/scorers/utils.d.ts +171 -1
package/dist/scorers/utils.d.ts.map +1 -1
package/dist/scorers/utils.js +1 -1
package/package.json +14 -11
package/dist/chunk-OEOE7ZHN.js +0 -195
package/dist/chunk-OEOE7ZHN.js.map +0 -1
package/dist/chunk-W3U7MMDX.cjs +0 -212
package/dist/chunk-W3U7MMDX.cjs.map +0 -1

package/dist/docs/references/reference-evals-bias.md CHANGED Viewed

@@ -1,34 +1,34 @@
-# Bias Scorer
+# Bias scorer
 The `createBiasScorer()` function accepts a single options object with the following properties:
 ## Parameters
-**model:** (`LanguageModel`): Configuration for the model used to evaluate bias.
+**model** (`LanguageModel`): Configuration for the model used to evaluate bias.
-**scale:** (`number`): Maximum score value. (Default: `1`)
+**scale** (`number`): Maximum score value. (Default: `1`)
 This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
-## .run() Returns
+## `.run()` returns
-**runId:** (`string`): The id of the run (optional).
+**runId** (`string`): The id of the run (optional).
-**preprocessStepResult:** (`object`): Object with extracted opinions: { opinions: string\[] }
+**preprocessStepResult** (`object`): Object with extracted opinions: { opinions: string\[] }
-**preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
+**preprocessPrompt** (`string`): The prompt sent to the LLM for the preprocess step (optional).
-**analyzeStepResult:** (`object`): Object with results: { results: Array<{ result: 'yes' | 'no', reason: string }> }
+**analyzeStepResult** (`object`): Object with results: { results: Array<{ result: 'yes' | 'no', reason: string }> }
-**analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
+**analyzePrompt** (`string`): The prompt sent to the LLM for the analyze step (optional).
-**score:** (`number`): Bias score (0 to scale, default 0-1). Higher scores indicate more bias.
+**score** (`number`): Bias score (0 to scale, default 0-1). Higher scores indicate more bias.
-**reason:** (`string`): Explanation of the score.
+**reason** (`string`): Explanation of the score.
-**generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
+**generateReasonPrompt** (`string`): The prompt sent to the LLM for the generateReason step (optional).
-## Bias Categories
+## Bias categories
 The scorer evaluates several types of bias:
@@ -37,7 +37,7 @@ The scorer evaluates several types of bias:
 3. **Racial/Ethnic Bias**: Discrimination based on race, ethnicity, or national origin
 4. **Geographical Bias**: Prejudice based on location or regional stereotypes
-## Scoring Details
+## Scoring details
 The scorer evaluates bias through opinion analysis based on:
@@ -78,22 +78,22 @@ A bias score between 0 and 1:
 Evaluate agent responses for bias across different types of questions:
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createBiasScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createBiasScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
-const scorer = createBiasScorer({ model: "openai/gpt-4o" });
+const scorer = createBiasScorer({ model: 'openai/gpt-5.4' })
 const result = await runEvals({
   data: [
     {
-      input: "What makes someone a good leader?",
+      input: 'What makes someone a good leader?',
     },
     {
-      input: "How do different age groups perform at work?",
+      input: 'How do different age groups perform at work?',
     },
     {
-      input: "What is the best hiring practice?",
+      input: 'What is the best hiring practice?',
     },
   ],
   scorers: [scorer],
@@ -102,11 +102,11 @@ const result = await runEvals({
     console.log({
       score: scorerResults[scorer.id].score,
       reason: scorerResults[scorer.id].reason,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).

package/dist/docs/references/reference-evals-completeness.md CHANGED Viewed

@@ -1,20 +1,20 @@
-# Completeness Scorer
+# Completeness scorer
 The `createCompletenessScorer()` function evaluates how thoroughly an LLM's output covers the key elements present in the input. It analyzes nouns, verbs, topics, and terms to determine coverage and provides a detailed completeness score.
 ## Parameters
-The `createCompletenessScorer()` function does not take any options.
+The `createCompletenessScorer()` function doesn't take any options.
 This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
-## .run() Returns
+## `.run()` returns
-**runId:** (`string`): The id of the run (optional).
+**runId** (`string`): The id of the run (optional).
-**preprocessStepResult:** (`object`): Object with extracted elements and coverage details: { inputElements: string\[], outputElements: string\[], missingElements: string\[], elementCounts: { input: number, output: number } }
+**preprocessStepResult** (`object`): Object with extracted elements and coverage details: { inputElements: string\[], outputElements: string\[], missingElements: string\[], elementCounts: { input: number, output: number } }
-**score:** (`number`): Completeness score (0-1) representing the proportion of input elements covered in the output.
+**score** (`number`): Completeness score (0-1) representing the proportion of input elements covered in the output.
 The `.run()` method returns a result in the following shape:
@@ -31,7 +31,7 @@ The `.run()` method returns a result in the following shape:
 }
 ```
-## Element Extraction Details
+## Element extraction details
 The scorer extracts and analyzes several types of elements:
@@ -48,7 +48,7 @@ The extraction process includes:
 - Special handling of short words (3 characters or less)
 - Deduplication of elements
-### extractStepResult
+### `extractStepResult`
 From the `.run()` method, you can get the `extractStepResult` object with the following properties:
@@ -57,7 +57,7 @@ From the `.run()` method, you can get the `extractStepResult` object with the fo
 - **missingElements**: Input elements not found in the output.
 - **elementCounts**: The number of elements in the input and output.
-## Scoring Details
+## Scoring details
 The scorer evaluates completeness through linguistic element coverage analysis.
@@ -92,25 +92,24 @@ A completeness score between 0 and 1:
 Evaluate agent responses for completeness across different query complexities:
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createCompletenessScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createCompletenessScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
-const scorer = createCompletenessScorer();
+const scorer = createCompletenessScorer()
 const result = await runEvals({
   data: [
     {
       input:
-        "Explain the process of photosynthesis, including the inputs, outputs, and stages involved.",
+        'Explain the process of photosynthesis, including the inputs, outputs, and stages involved.',
     },
     {
-      input:
-        "What are the benefits and drawbacks of remote work for both employees and employers?",
+      input: 'What are the benefits and drawbacks of remote work for both employees and employers?',
     },
     {
       input:
-        "Compare renewable and non-renewable energy sources in terms of cost, environmental impact, and sustainability.",
+        'Compare renewable and non-renewable energy sources in terms of cost, environmental impact, and sustainability.',
     },
   ],
   scorers: [scorer],
@@ -118,11 +117,11 @@ const result = await runEvals({
   onItemComplete: ({ scorerResults }) => {
     console.log({
       score: scorerResults[scorer.id].score,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).

package/dist/docs/references/reference-evals-content-similarity.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Content Similarity Scorer
+# Content similarity scorer
 The `createContentSimilarityScorer()` function measures the textual similarity between two strings, providing a score that indicates how closely they match. It supports configurable options for case sensitivity and whitespace handling.
@@ -6,23 +6,23 @@ The `createContentSimilarityScorer()` function measures the textual similarity b
 The `createContentSimilarityScorer()` function accepts a single options object with the following properties:
-**ignoreCase:** (`boolean`): Whether to ignore case differences when comparing strings. (Default: `true`)
+**ignoreCase** (`boolean`): Whether to ignore case differences when comparing strings. (Default: `true`)
-**ignoreWhitespace:** (`boolean`): Whether to normalize whitespace when comparing strings. (Default: `true`)
+**ignoreWhitespace** (`boolean`): Whether to normalize whitespace when comparing strings. (Default: `true`)
 This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
-## .run() Returns
+## `.run()` returns
-**runId:** (`string`): The id of the run (optional).
+**runId** (`string`): The id of the run (optional).
-**preprocessStepResult:** (`object`): Object with processed input and output: { processedInput: string, processedOutput: string }
+**preprocessStepResult** (`object`): Object with processed input and output: { processedInput: string, processedOutput: string }
-**analyzeStepResult:** (`object`): Object with similarity: { similarity: number }
+**analyzeStepResult** (`object`): Object with similarity: { similarity: number }
-**score:** (`number`): Similarity score (0-1) where 1 indicates perfect similarity.
+**score** (`number`): Similarity score (0-1) where 1 indicates perfect similarity.
-## Scoring Details
+## Scoring details
 The scorer evaluates textual similarity through character-level matching and configurable text normalization.
@@ -47,23 +47,23 @@ Final score: `similarity_value * scale`
 Evaluate textual similarity between expected and actual agent outputs:
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createContentSimilarityScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createContentSimilarityScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
-const scorer = createContentSimilarityScorer();
+const scorer = createContentSimilarityScorer()
 const result = await runEvals({
   data: [
     {
-      input: "Summarize the benefits of TypeScript",
+      input: 'Summarize the benefits of TypeScript',
       groundTruth:
-        "TypeScript provides static typing, better tooling support, and improved code maintainability.",
+        'TypeScript provides static typing, better tooling support, and improved code maintainability.',
     },
     {
-      input: "What is machine learning?",
+      input: 'What is machine learning?',
       groundTruth:
-        "Machine learning is a subset of AI that enables systems to learn from data without explicit programming.",
+        'Machine learning is a subset of AI that enables systems to learn from data without explicit programming.',
     },
   ],
   scorers: [scorer],
@@ -72,11 +72,11 @@ const result = await runEvals({
     console.log({
       score: scorerResults[scorer.id].score,
       groundTruth: scorerResults[scorer.id].groundTruth,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).

package/dist/docs/references/reference-evals-context-precision.md CHANGED Viewed

@@ -1,18 +1,18 @@
-# Context Precision Scorer
+# Context precision scorer
 The `createContextPrecisionScorer()` function creates a scorer that evaluates how relevant and well-positioned retrieved context pieces are for generating expected outputs. It uses **Mean Average Precision (MAP)** to reward systems that place relevant context earlier in the sequence.
-It is especially useful for these use cases:
+It's especially useful for these use cases:
-**RAG System Evaluation**
+## RAG system evaluation
 Ideal for evaluating retrieved context in RAG pipelines where:
 - Context ordering matters for model performance
-- You need to measure retrieval quality beyond simple relevance
+- You need to measure retrieval quality beyond basic relevance
 - Early relevant context is more valuable than later relevant context
-**Context Window Optimization**
+## Context window optimization
 Use when optimizing context selection for:
@@ -22,19 +22,19 @@ Use when optimizing context selection for:
 ## Parameters
-**model:** (`MastraModelConfig`): The language model to use for evaluating context relevance
+**model** (`MastraModelConfig`): The language model to use for evaluating context relevance
-**options:** (`ContextPrecisionMetricOptions`): Configuration options for the scorer
+**options** (`ContextPrecisionMetricOptions`): Configuration options for the scorer
 **Note**: Either `context` or `contextExtractor` must be provided. If both are provided, `contextExtractor` takes precedence.
-## .run() Returns
+## `.run()` returns
-**score:** (`number`): Mean Average Precision score between 0 and scale (default 0-1)
+**score** (`number`): Mean Average Precision score between 0 and scale (default 0-1)
-**reason:** (`string`): Human-readable explanation of the context precision evaluation
+**reason** (`string`): Human-readable explanation of the context precision evaluation
-## Scoring Details
+## Scoring details
 ### Mean Average Precision (MAP)
@@ -77,7 +77,7 @@ The reason field explains:
 Use results to:
 - **Improve retrieval**: Filter out irrelevant context before ranking
-- **Optimize ranking**: Ensure relevant context appears early
+- **Optimize ranking**: Ensure relevant context surfaces early
 - **Tune chunk size**: Balance context detail vs. relevance precision
 - **Evaluate embeddings**: Test different embedding models for better retrieval
@@ -98,38 +98,38 @@ MAP = (1.0 + 0.67) / 2 = 0.835 ≈ **0.83**
 ```typescript
 const scorer = createContextPrecisionScorer({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   options: {
     contextExtractor: (input, output) => {
       // Extract context dynamically based on the query
-      const query = input?.inputMessages?.[0]?.content || "";
+      const query = input?.inputMessages?.[0]?.content || ''
       // Example: Retrieve from a vector database
-      const searchResults = vectorDB.search(query, { limit: 10 });
-      return searchResults.map((result) => result.content);
+      const searchResults = vectorDB.search(query, { limit: 10 })
+      return searchResults.map(result => result.content)
     },
     scale: 1,
   },
-});
+})
 ```
 ### Large context evaluation
 ```typescript
 const scorer = createContextPrecisionScorer({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   options: {
     context: [
       // Simulate retrieved documents from vector database
-      "Document 1: Highly relevant content...",
-      "Document 2: Somewhat related content...",
-      "Document 3: Tangentially related...",
-      "Document 4: Not relevant...",
-      "Document 5: Highly relevant content...",
+      'Document 1: Highly relevant content...',
+      'Document 2: Somewhat related content...',
+      'Document 3: Tangentially related...',
+      'Document 4: Not relevant...',
+      'Document 5: Highly relevant content...',
       // ... up to dozens of context pieces
     ],
   },
-});
+})
 ```
 ## Example
@@ -137,27 +137,27 @@ const scorer = createContextPrecisionScorer({
 Evaluate RAG system context retrieval precision for different queries:
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createContextPrecisionScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createContextPrecisionScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
 const scorer = createContextPrecisionScorer({
-  model: "openai/gpt-4o",
+  model: 'openai/gpt-5.4',
   options: {
     contextExtractor: (input, output) => {
       // Extract context from agent's retrieved documents
-      return output.metadata?.retrievedContext || [];
+      return output.metadata?.retrievedContext || []
     },
   },
-});
+})
 const result = await runEvals({
   data: [
     {
-      input: "How does photosynthesis work in plants?",
+      input: 'How does photosynthesis work in plants?',
     },
     {
-      input: "What are the mental and physical benefits of exercise?",
+      input: 'What are the mental and physical benefits of exercise?',
     },
   ],
   scorers: [scorer],
@@ -166,18 +166,18 @@ const result = await runEvals({
     console.log({
       score: scorerResults[scorer.id].score,
       reason: scorerResults[scorer.id].reason,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).
 To add this scorer to an agent, see the [Scorers overview](https://mastra.ai/docs/evals/overview) guide.
-## Comparison with Context Relevance
+## Comparison with context relevance
 Choose the right scorer for your needs: