npm - @mastra/evals - Versions diffs - 1.1.2-alpha.0 → 1.2.0-alpha.0 - Mend

@mastra/evals 1.1.2-alpha.0 → 1.2.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/CHANGELOG.md +59 -2
package/LICENSE.md +15 -0
package/dist/chunk-EVBNIL5M.js +606 -0
package/dist/chunk-EVBNIL5M.js.map +1 -0
package/dist/chunk-XRUR5PBK.cjs +632 -0
package/dist/chunk-XRUR5PBK.cjs.map +1 -0
package/dist/docs/SKILL.md +20 -19
package/dist/docs/assets/SOURCE_MAP.json +1 -1
package/dist/docs/references/docs-evals-built-in-scorers.md +2 -1
package/dist/docs/references/docs-evals-overview.md +11 -16
package/dist/docs/references/reference-evals-answer-relevancy.md +25 -25
package/dist/docs/references/reference-evals-answer-similarity.md +33 -35
package/dist/docs/references/reference-evals-bias.md +24 -24
package/dist/docs/references/reference-evals-completeness.md +19 -20
package/dist/docs/references/reference-evals-content-similarity.md +20 -20
package/dist/docs/references/reference-evals-context-precision.md +36 -36
package/dist/docs/references/reference-evals-context-relevance.md +136 -141
package/dist/docs/references/reference-evals-faithfulness.md +24 -24
package/dist/docs/references/reference-evals-hallucination.md +52 -69
package/dist/docs/references/reference-evals-keyword-coverage.md +18 -18
package/dist/docs/references/reference-evals-noise-sensitivity.md +167 -177
package/dist/docs/references/reference-evals-prompt-alignment.md +111 -116
package/dist/docs/references/reference-evals-scorer-utils.md +285 -105
package/dist/docs/references/reference-evals-textual-difference.md +18 -18
package/dist/docs/references/reference-evals-tone-consistency.md +19 -19
package/dist/docs/references/reference-evals-tool-call-accuracy.md +165 -165
package/dist/docs/references/reference-evals-toxicity.md +21 -21
package/dist/docs/references/reference-evals-trajectory-accuracy.md +613 -0
package/dist/scorers/code/index.d.ts +1 -0
package/dist/scorers/code/index.d.ts.map +1 -1
package/dist/scorers/code/trajectory/index.d.ts +147 -0
package/dist/scorers/code/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/answer-similarity/index.d.ts +2 -2
package/dist/scorers/llm/context-precision/index.d.ts +2 -2
package/dist/scorers/llm/context-relevance/index.d.ts +1 -1
package/dist/scorers/llm/faithfulness/index.d.ts +1 -1
package/dist/scorers/llm/hallucination/index.d.ts +2 -2
package/dist/scorers/llm/index.d.ts +1 -0
package/dist/scorers/llm/index.d.ts.map +1 -1
package/dist/scorers/llm/noise-sensitivity/index.d.ts +1 -1
package/dist/scorers/llm/prompt-alignment/index.d.ts +5 -5
package/dist/scorers/llm/tool-call-accuracy/index.d.ts +1 -1
package/dist/scorers/llm/toxicity/index.d.ts +1 -1
package/dist/scorers/llm/trajectory/index.d.ts +58 -0
package/dist/scorers/llm/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/trajectory/prompts.d.ts +20 -0
package/dist/scorers/llm/trajectory/prompts.d.ts.map +1 -0
package/dist/scorers/prebuilt/index.cjs +638 -59
package/dist/scorers/prebuilt/index.cjs.map +1 -1
package/dist/scorers/prebuilt/index.js +578 -2
package/dist/scorers/prebuilt/index.js.map +1 -1
package/dist/scorers/utils.cjs +41 -17
package/dist/scorers/utils.d.ts +171 -1
package/dist/scorers/utils.d.ts.map +1 -1
package/dist/scorers/utils.js +1 -1
package/package.json +14 -11
package/dist/chunk-OEOE7ZHN.js +0 -195
package/dist/chunk-OEOE7ZHN.js.map +0 -1
package/dist/chunk-W3U7MMDX.cjs +0 -212
package/dist/chunk-W3U7MMDX.cjs.map +0 -1

package/dist/docs/references/reference-evals-prompt-alignment.md CHANGED Viewed

@@ -1,18 +1,18 @@
-# Prompt Alignment Scorer
+# Prompt alignment scorer
 The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates how well agent responses align with user prompts across multiple dimensions: intent understanding, requirement fulfillment, response completeness, and format appropriateness.
 ## Parameters
-**model:** (`MastraModelConfig`): The language model to use for evaluating prompt-response alignment
+**model** (`MastraModelConfig`): The language model to use for evaluating prompt-response alignment
-**options:** (`PromptAlignmentOptions`): Configuration options for the scorer
+**options** (`PromptAlignmentOptions`): Configuration options for the scorer
-## .run() Returns
+## `.run()` returns
-**score:** (`number`): Multi-dimensional alignment score between 0 and scale (default 0-1)
+**score** (`number`): Multi-dimensional alignment score between 0 and scale (default 0-1)
-**reason:** (`string`): Human-readable explanation of the prompt alignment evaluation with detailed breakdown
+**reason** (`string`): Human-readable explanation of the prompt alignment evaluation with detailed breakdown
 `.run()` returns a result in the following shape:
@@ -52,7 +52,7 @@ The `createPromptAlignmentScorerLLM()` function creates a scorer that evaluates
 }
 ```
-## Scoring Details
+## Scoring details
 ### Scorer configuration
@@ -60,12 +60,12 @@ You can customize the Prompt Alignment Scorer by adjusting the scale parameter a
 ```typescript
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   options: {
     scale: 10, // Score from 0-10 instead of 0-1
-    evaluationMode: "both", // 'user', 'system', or 'both' (default)
+    evaluationMode: 'both', // 'user', 'system', or 'both' (default)
   },
-});
+})
 ```
 ### Multi-Dimensional Analysis
@@ -163,7 +163,7 @@ Final Score = Weighted Score × scale
 - Production monitoring where both user and system requirements matter
 - Holistic assessment of prompt-response alignment
-## Common Use Cases
+## Common use cases
 ### Code Generation Evaluation
@@ -176,8 +176,7 @@ Ideal for evaluating:
 ```typescript
 // Example: API endpoint creation
-const codePrompt =
-  "Create a REST API endpoint with authentication and rate limiting";
+const codePrompt = 'Create a REST API endpoint with authentication and rate limiting'
 // Scorer evaluates: intent (API creation), requirements (auth + rate limiting),
 // completeness (full implementation), format (code structure)
 ```
@@ -194,7 +193,7 @@ Perfect for:
 ```typescript
 // Example: Multi-requirement task
 const taskPrompt =
-  "Write a Python class with initialization, validation, error handling, and documentation";
+  'Write a Python class with initialization, validation, error handling, and documentation'
 // Scorer tracks each requirement individually and provides detailed breakdown
 ```
@@ -210,7 +209,7 @@ Useful for:
 ```typescript
 // Example: Structured output
 const formatPrompt =
-  "Explain the differences between let and const in JavaScript using bullet points";
+  'Explain the differences between let and const in JavaScript using bullet points'
 // Scorer evaluates content accuracy AND format compliance
 ```
@@ -220,31 +219,30 @@ Measure how well your AI agents follow user instructions:
 ```typescript
 const agent = new Agent({
-  name: "CodingAssistant",
-  instructions:
-    "You are a helpful coding assistant. Always provide working code examples.",
-  model: "openai/gpt-5.1",
-});
+  name: 'CodingAssistant',
+  instructions: 'You are a helpful coding assistant. Always provide working code examples.',
+  model: 'openai/gpt-5.4',
+})
 // Evaluate comprehensive alignment (default)
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "both" }, // Evaluates both user intent and system guidelines
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'both' }, // Evaluates both user intent and system guidelines
+})
 // Evaluate just user satisfaction
 const userScorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "user" }, // Focus only on user request fulfillment
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'user' }, // Focus only on user request fulfillment
+})
 // Evaluate system compliance
 const systemScorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "system" }, // Check adherence to system instructions
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'system' }, // Check adherence to system instructions
+})
-const result = await scorer.run(agentRun);
+const result = await scorer.run(agentRun)
 ```
 ### Prompt Engineering Optimization
@@ -253,15 +251,15 @@ Test different prompts to improve alignment:
 ```typescript
 const prompts = [
-  "Write a function to calculate factorial",
-  "Create a Python function that calculates factorial with error handling for negative inputs",
-  "Implement a factorial calculator in Python with: input validation, error handling, and docstring",
-];
+  'Write a function to calculate factorial',
+  'Create a Python function that calculates factorial with error handling for negative inputs',
+  'Implement a factorial calculator in Python with: input validation, error handling, and docstring',
+]
 // Compare alignment scores to find the best prompt
 for (const prompt of prompts) {
-  const result = await scorer.run(createTestRun(prompt, response));
-  console.log(`Prompt alignment: ${result.score}`);
+  const result = await scorer.run(createTestRun(prompt, response))
+  console.log(`Prompt alignment: ${result.score}`)
 }
 ```
@@ -289,23 +287,22 @@ for (const agent of agents) {
 ### Basic Configuration
 ```typescript
-import { createPromptAlignmentScorerLLM } from "@mastra/evals";
+import { createPromptAlignmentScorerLLM } from '@mastra/evals'
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-});
+  model: 'openai/gpt-5.4',
+})
 // Evaluate a code generation task
 const result = await scorer.run({
   input: [
     {
-      role: "user",
-      content:
-        "Write a Python function to calculate factorial with error handling",
+      role: 'user',
+      content: 'Write a Python function to calculate factorial with error handling',
     },
   ],
   output: {
-    role: "assistant",
+    role: 'assistant',
     text: `def factorial(n):
     if n < 0:
         raise ValueError("Factorial not defined for negative numbers")
@@ -313,7 +310,7 @@ const result = await scorer.run({
         return 1
     return n * factorial(n-1)`,
   },
-});
+})
 // Result: { score: 0.95, reason: "Excellent alignment - function addresses intent, includes error handling..." }
 ```
@@ -322,26 +319,26 @@ const result = await scorer.run({
 ```typescript
 // Configure scale and evaluation mode
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
+  model: 'openai/gpt-5.4',
   options: {
     scale: 10, // Score from 0-10 instead of 0-1
-    evaluationMode: "both", // 'user', 'system', or 'both' (default)
+    evaluationMode: 'both', // 'user', 'system', or 'both' (default)
   },
-});
+})
 // User-only evaluation - focus on user satisfaction
 const userScorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "user" },
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'user' },
+})
 // System-only evaluation - focus on compliance
 const systemScorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "system" },
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'system' },
+})
-const result = await scorer.run(testRun);
+const result = await scorer.run(testRun)
 // Result: { score: 8.5, reason: "Score: 8.5 out of 10 - Good alignment with both user intent and system guidelines..." }
 ```
@@ -352,15 +349,15 @@ const result = await scorer.run(testRun);
 const result = await scorer.run({
   input: [
     {
-      role: "user",
-      content: "List the benefits of TypeScript in bullet points",
+      role: 'user',
+      content: 'List the benefits of TypeScript in bullet points',
     },
   ],
   output: {
-    role: "assistant",
-    text: "TypeScript provides static typing, better IDE support, and enhanced code reliability.",
+    role: 'assistant',
+    text: 'TypeScript provides static typing, better IDE support, and enhanced code reliability.',
   },
-});
+})
 // Result: Lower appropriateness score due to format mismatch (paragraph vs bullet points)
 ```
@@ -369,19 +366,19 @@ const result = await scorer.run({
 In this example, the response fully addresses the user's prompt with all requirements met.
 ```typescript
-import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
+import { createPromptAlignmentScorerLLM } from '@mastra/evals/scorers/prebuilt'
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-});
+  model: 'openai/gpt-5.4',
+})
 const inputMessages = [
   {
-    role: "user",
+    role: 'user',
     content:
-      "Write a Python function to calculate factorial with error handling for negative numbers",
+      'Write a Python function to calculate factorial with error handling for negative numbers',
   },
-];
+]
 const outputMessage = {
   text: `def factorial(n):
@@ -391,14 +388,14 @@ const outputMessage = {
     if n == 0 or n == 1:
         return 1
     return n * factorial(n - 1)`,
-};
+}
 const result = await scorer.run({
   input: inputMessages,
   output: outputMessage,
-});
+})
-console.log(result);
+console.log(result)
 ```
 ### Excellent alignment output
@@ -417,29 +414,29 @@ The output receives a high score because it perfectly addresses the intent, fulf
 In this example, the response addresses the core intent but misses some requirements or has format issues.
 ```typescript
-import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
+import { createPromptAlignmentScorerLLM } from '@mastra/evals/scorers/prebuilt'
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-});
+  model: 'openai/gpt-5.4',
+})
 const inputMessages = [
   {
-    role: "user",
-    content: "List the benefits of TypeScript in bullet points",
+    role: 'user',
+    content: 'List the benefits of TypeScript in bullet points',
   },
-];
+]
 const outputMessage = {
-  text: "TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.",
-};
+  text: 'TypeScript provides static typing, better IDE support, and enhanced code reliability through compile-time error checking.',
+}
 const result = await scorer.run({
   input: inputMessages,
   output: outputMessage,
-});
+})
-console.log(result);
+console.log(result)
 ```
 #### Partial alignment output
@@ -458,32 +455,32 @@ The output receives a lower score because while the content is accurate, it does
 In this example, the response fails to address the user's specific requirements.
 ```typescript
-import { createPromptAlignmentScorerLLM } from "@mastra/evals/scorers/prebuilt";
+import { createPromptAlignmentScorerLLM } from '@mastra/evals/scorers/prebuilt'
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-});
+  model: 'openai/gpt-5.4',
+})
 const inputMessages = [
   {
-    role: "user",
+    role: 'user',
     content:
-      "Write a Python class with initialization, validation, error handling, and documentation",
+      'Write a Python class with initialization, validation, error handling, and documentation',
   },
-];
+]
 const outputMessage = {
   text: `class Example:
     def __init__(self, value):
         self.value = value`,
-};
+}
 const result = await scorer.run({
   input: inputMessages,
   output: outputMessage,
-});
+})
-console.log(result);
+console.log(result)
 ```
 ### Poor alignment output
@@ -505,29 +502,29 @@ Evaluates how well the response addresses the user's request, ignoring system in
 ```typescript
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "user" },
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'user' },
+})
 const result = await scorer.run({
   input: {
     inputMessages: [
       {
-        role: "user",
-        content: "Explain recursion with an example",
+        role: 'user',
+        content: 'Explain recursion with an example',
       },
     ],
     systemMessages: [
       {
-        role: "system",
-        content: "Always provide code examples in Python",
+        role: 'system',
+        content: 'Always provide code examples in Python',
       },
     ],
   },
   output: {
-    text: "Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)",
+    text: 'Recursion is when a function calls itself. For example: factorial(5) = 5 * factorial(4)',
   },
-});
+})
 // Scores high for addressing user request, even without Python code
 ```
@@ -537,30 +534,29 @@ Evaluates compliance with system behavioral guidelines and constraints:
 ```typescript
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "system" },
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'system' },
+})
 const result = await scorer.run({
   input: {
     systemMessages: [
       {
-        role: "system",
-        content:
-          "You are a helpful assistant. Always be polite, concise, and provide examples.",
+        role: 'system',
+        content: 'You are a helpful assistant. Always be polite, concise, and provide examples.',
       },
     ],
     inputMessages: [
       {
-        role: "user",
-        content: "What is machine learning?",
+        role: 'user',
+        content: 'What is machine learning?',
       },
     ],
   },
   output: {
-    text: "Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.",
+    text: 'Machine learning is a subset of AI where computers learn from data. For example, spam filters learn to identify unwanted emails by analyzing patterns in previously marked spam.',
   },
-});
+})
 // Evaluates politeness, conciseness, and example provision
 ```
@@ -570,23 +566,22 @@ Evaluates both user intent fulfillment and system compliance with weighted scori
 ```typescript
 const scorer = createPromptAlignmentScorerLLM({
-  model: "openai/gpt-5.1",
-  options: { evaluationMode: "both" }, // This is the default
-});
+  model: 'openai/gpt-5.4',
+  options: { evaluationMode: 'both' }, // This is the default
+})
 const result = await scorer.run({
   input: {
     systemMessages: [
       {
-        role: "system",
-        content:
-          "Always provide code examples when explaining programming concepts",
+        role: 'system',
+        content: 'Always provide code examples when explaining programming concepts',
       },
     ],
     inputMessages: [
       {
-        role: "user",
-        content: "Explain how to reverse a string",
+        role: 'user',
+        content: 'Explain how to reverse a string',
       },
     ],
   },
@@ -598,11 +593,11 @@ const result = await scorer.run({
     # Usage: reverse_string("hello") returns "olleh"`,
   },
-});
+})
 // High score for both addressing the user's request AND following system guidelines
 ```
-## Comparison with Other Scorers
+## Comparison with other scorers
 | Aspect         | Prompt Alignment                           | Answer Relevancy             | Faithfulness                     |
 | -------------- | ------------------------------------------ | ---------------------------- | -------------------------------- |