npm - @mastra/evals - Versions diffs - 1.1.2-alpha.0 → 1.2.0-alpha.0 - Mend

@mastra/evals 1.1.2-alpha.0 → 1.2.0-alpha.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (60) hide show

package/CHANGELOG.md +59 -2
package/LICENSE.md +15 -0
package/dist/chunk-EVBNIL5M.js +606 -0
package/dist/chunk-EVBNIL5M.js.map +1 -0
package/dist/chunk-XRUR5PBK.cjs +632 -0
package/dist/chunk-XRUR5PBK.cjs.map +1 -0
package/dist/docs/SKILL.md +20 -19
package/dist/docs/assets/SOURCE_MAP.json +1 -1
package/dist/docs/references/docs-evals-built-in-scorers.md +2 -1
package/dist/docs/references/docs-evals-overview.md +11 -16
package/dist/docs/references/reference-evals-answer-relevancy.md +25 -25
package/dist/docs/references/reference-evals-answer-similarity.md +33 -35
package/dist/docs/references/reference-evals-bias.md +24 -24
package/dist/docs/references/reference-evals-completeness.md +19 -20
package/dist/docs/references/reference-evals-content-similarity.md +20 -20
package/dist/docs/references/reference-evals-context-precision.md +36 -36
package/dist/docs/references/reference-evals-context-relevance.md +136 -141
package/dist/docs/references/reference-evals-faithfulness.md +24 -24
package/dist/docs/references/reference-evals-hallucination.md +52 -69
package/dist/docs/references/reference-evals-keyword-coverage.md +18 -18
package/dist/docs/references/reference-evals-noise-sensitivity.md +167 -177
package/dist/docs/references/reference-evals-prompt-alignment.md +111 -116
package/dist/docs/references/reference-evals-scorer-utils.md +285 -105
package/dist/docs/references/reference-evals-textual-difference.md +18 -18
package/dist/docs/references/reference-evals-tone-consistency.md +19 -19
package/dist/docs/references/reference-evals-tool-call-accuracy.md +165 -165
package/dist/docs/references/reference-evals-toxicity.md +21 -21
package/dist/docs/references/reference-evals-trajectory-accuracy.md +613 -0
package/dist/scorers/code/index.d.ts +1 -0
package/dist/scorers/code/index.d.ts.map +1 -1
package/dist/scorers/code/trajectory/index.d.ts +147 -0
package/dist/scorers/code/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/answer-similarity/index.d.ts +2 -2
package/dist/scorers/llm/context-precision/index.d.ts +2 -2
package/dist/scorers/llm/context-relevance/index.d.ts +1 -1
package/dist/scorers/llm/faithfulness/index.d.ts +1 -1
package/dist/scorers/llm/hallucination/index.d.ts +2 -2
package/dist/scorers/llm/index.d.ts +1 -0
package/dist/scorers/llm/index.d.ts.map +1 -1
package/dist/scorers/llm/noise-sensitivity/index.d.ts +1 -1
package/dist/scorers/llm/prompt-alignment/index.d.ts +5 -5
package/dist/scorers/llm/tool-call-accuracy/index.d.ts +1 -1
package/dist/scorers/llm/toxicity/index.d.ts +1 -1
package/dist/scorers/llm/trajectory/index.d.ts +58 -0
package/dist/scorers/llm/trajectory/index.d.ts.map +1 -0
package/dist/scorers/llm/trajectory/prompts.d.ts +20 -0
package/dist/scorers/llm/trajectory/prompts.d.ts.map +1 -0
package/dist/scorers/prebuilt/index.cjs +638 -59
package/dist/scorers/prebuilt/index.cjs.map +1 -1
package/dist/scorers/prebuilt/index.js +578 -2
package/dist/scorers/prebuilt/index.js.map +1 -1
package/dist/scorers/utils.cjs +41 -17
package/dist/scorers/utils.d.ts +171 -1
package/dist/scorers/utils.d.ts.map +1 -1
package/dist/scorers/utils.js +1 -1
package/package.json +14 -11
package/dist/chunk-OEOE7ZHN.js +0 -195
package/dist/chunk-OEOE7ZHN.js.map +0 -1
package/dist/chunk-W3U7MMDX.cjs +0 -212
package/dist/chunk-W3U7MMDX.cjs.map +0 -1

package/dist/docs/references/reference-evals-hallucination.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# Hallucination Scorer
+# Hallucination scorer
 The `createHallucinationScorer()` function evaluates whether an LLM generates factually correct information by comparing its output against the provided context. This scorer measures hallucination by identifying direct contradictions between the context and the output.
@@ -6,47 +6,37 @@ The `createHallucinationScorer()` function evaluates whether an LLM generates fa
 The `createHallucinationScorer()` function accepts a single options object with the following properties:
-**model:** (`LanguageModel`): Configuration for the model used to evaluate hallucination.
+**model** (`LanguageModel`): Configuration for the model used to evaluate hallucination.
-**options.scale:** (`number`): Maximum score value. (Default: `1`)
+**options** (`Options`): Configuration options.
-**options.context:** (`string[]`): Static context strings to use as ground truth for hallucination detection.
+**options.scale** (`number`): Maximum score value.
-**options.getContext:** (`(params: GetContextParams) => string[] | Promise<string[]>`): A hook to dynamically resolve context at runtime. Takes priority over static context. Useful for live scoring where context (like tool results) is only available when the scorer runs.
+**options.context** (`string[]`): Static context strings to use as ground truth for hallucination detection.
-This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
-### GetContextParams
-The `getContext` hook receives the following parameters:
-**run:** (`GetContextRun`): The scorer run containing input, output, runId, requestContext, and tracingContext.
+**options.getContext** (`(params: GetContextParams) => string[] | Promise<string[]>`): A hook to dynamically resolve context at runtime. Takes priority over static context. Useful for live scoring where context (like tool results) is only available when the scorer runs.
-**results:** (`Record<string, any>`): Accumulated results from previous steps (e.g., preprocessStepResult with extracted claims).
-**score:** (`number`): The computed score. Only present when called from the generateReason step.
-**step:** (`'analyze' | 'generateReason'`): Which step is calling the hook. Useful for caching context between calls.
+This function returns an instance of the MastraScorer class. The `.run()` method accepts the same input as other scorers (see the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer)), but the return value includes LLM-specific fields as documented below.
-## .run() Returns
+## `.run()` returns
-**runId:** (`string`): The id of the run (optional).
+**runId** (`string`): The id of the run (optional).
-**preprocessStepResult:** (`object`): Object with extracted claims: { claims: string\[] }
+**preprocessStepResult** (`object`): Object with extracted claims: { claims: string\[] }
-**preprocessPrompt:** (`string`): The prompt sent to the LLM for the preprocess step (optional).
+**preprocessPrompt** (`string`): The prompt sent to the LLM for the preprocess step (optional).
-**analyzeStepResult:** (`object`): Object with verdicts: { verdicts: Array<{ statement: string, verdict: 'yes' | 'no', reason: string }> }
+**analyzeStepResult** (`object`): Object with verdicts: { verdicts: Array<{ statement: string, verdict: 'yes' | 'no', reason: string }> }
-**analyzePrompt:** (`string`): The prompt sent to the LLM for the analyze step (optional).
+**analyzePrompt** (`string`): The prompt sent to the LLM for the analyze step (optional).
-**score:** (`number`): Hallucination score (0 to scale, default 0-1).
+**score** (`number`): Hallucination score (0 to scale, default 0-1).
-**reason:** (`string`): Detailed explanation of the score and identified contradictions.
+**reason** (`string`): Detailed explanation of the score and identified contradictions.
-**generateReasonPrompt:** (`string`): The prompt sent to the LLM for the generateReason step (optional).
+**generateReasonPrompt** (`string`): The prompt sent to the LLM for the generateReason step (optional).
-## Scoring Details
+## Scoring details
 The scorer evaluates hallucination through contradiction detection and unsupported claim analysis.
@@ -111,40 +101,38 @@ A hallucination score between 0 and 1:
 Use static context when you have known ground truth to compare against:
 ```typescript
-import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
+import { createHallucinationScorer } from '@mastra/evals/scorers/prebuilt'
 const scorer = createHallucinationScorer({
-  model: "openai/gpt-4o",
+  model: 'openai/gpt-5.4',
   options: {
     context: [
-      "The first iPhone was announced on January 9, 2007.",
-      "It was released on June 29, 2007.",
-      "Steve Jobs introduced it at Macworld.",
+      'The first iPhone was announced on January 9, 2007.',
+      'It was released on June 29, 2007.',
+      'Steve Jobs introduced it at Macworld.',
     ],
   },
-});
+})
 ```
-### Dynamic Context with getContext
+### Dynamic Context with `getContext`
 Use `getContext` for live scoring scenarios where context comes from tool results:
 ```typescript
-import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
-import { extractToolResults } from "@mastra/evals/scorers";
+import { createHallucinationScorer } from '@mastra/evals/scorers/prebuilt'
+import { extractToolResults } from '@mastra/evals/scorers'
 const scorer = createHallucinationScorer({
-  model: "openai/gpt-4o",
+  model: 'openai/gpt-5.4',
   options: {
     getContext: ({ run, step }) => {
       // Extract tool results as context
-      const toolResults = extractToolResults(run.output);
-      return toolResults.map((t) =>
-        JSON.stringify({ tool: t.toolName, result: t.result })
-      );
+      const toolResults = extractToolResults(run.output)
+      return toolResults.map(t => JSON.stringify({ tool: t.toolName, result: t.result }))
     },
   },
-});
+})
 ```
 ### Live Scoring with Agent
@@ -152,62 +140,57 @@ const scorer = createHallucinationScorer({
 Attach the scorer to an agent for live evaluation:
 ```typescript
-import { Agent } from "@mastra/core/agent";
-import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
-import { extractToolResults } from "@mastra/evals/scorers";
+import { Agent } from '@mastra/core/agent'
+import { createHallucinationScorer } from '@mastra/evals/scorers/prebuilt'
+import { extractToolResults } from '@mastra/evals/scorers'
 const hallucinationScorer = createHallucinationScorer({
-  model: "openai/gpt-4o",
+  model: 'openai/gpt-5.4',
   options: {
     getContext: ({ run }) => {
-      const toolResults = extractToolResults(run.output);
-      return toolResults.map((t) =>
-        JSON.stringify({ tool: t.toolName, result: t.result })
-      );
+      const toolResults = extractToolResults(run.output)
+      return toolResults.map(t => JSON.stringify({ tool: t.toolName, result: t.result }))
     },
   },
-});
+})
 const agent = new Agent({
-  name: "my-agent",
-  model: "openai/gpt-4o",
-  instructions: "You are a helpful assistant.",
+  name: 'my-agent',
+  model: 'openai/gpt-5.4',
+  instructions: 'You are a helpful assistant.',
   evals: {
     scorers: [hallucinationScorer],
   },
-});
+})
 ```
-### Batch Evaluation with runEvals
+### Batch Evaluation with `runEvals`
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createHallucinationScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createHallucinationScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
 const scorer = createHallucinationScorer({
-  model: "openai/gpt-4o",
+  model: 'openai/gpt-5.4',
   options: {
-    context: ["Known fact 1", "Known fact 2"],
+    context: ['Known fact 1', 'Known fact 2'],
   },
-});
+})
 const result = await runEvals({
-  data: [
-    { input: "Tell me about topic A" },
-    { input: "Tell me about topic B" },
-  ],
+  data: [{ input: 'Tell me about topic A' }, { input: 'Tell me about topic B' }],
   scorers: [scorer],
   target: myAgent,
   onItemComplete: ({ scorerResults }) => {
     console.log({
       score: scorerResults[scorer.id].score,
       reason: scorerResults[scorer.id].reason,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).

package/dist/docs/references/reference-evals-keyword-coverage.md CHANGED Viewed

@@ -1,22 +1,22 @@
-# Keyword Coverage Scorer
+# Keyword coverage scorer
 The `createKeywordCoverageScorer()` function evaluates how well an LLM's output covers the important keywords from the input. It analyzes keyword presence and matches while ignoring common words and stop words.
 ## Parameters
-The `createKeywordCoverageScorer()` function does not take any options.
+The `createKeywordCoverageScorer()` function doesn't take any options.
 This function returns an instance of the MastraScorer class. See the [MastraScorer reference](https://mastra.ai/reference/evals/mastra-scorer) for details on the `.run()` method and its input/output.
-## .run() Returns
+## `.run()` returns
-**runId:** (`string`): The id of the run (optional).
+**runId** (`string`): The id of the run (optional).
-**preprocessStepResult:** (`object`): Object with extracted keywords: { referenceKeywords: Set\<string>, responseKeywords: Set\<string> }
+**preprocessStepResult** (`object`): Object with extracted keywords: { referenceKeywords: Set\<string>, responseKeywords: Set\<string> }
-**analyzeStepResult:** (`object`): Object with keyword coverage: { totalKeywords: number, matchedKeywords: number }
+**analyzeStepResult** (`object`): Object with keyword coverage: { totalKeywords: number, matchedKeywords: number }
-**score:** (`number`): Coverage score (0-1) representing the proportion of matched keywords.
+**score** (`number`): Coverage score (0-1) representing the proportion of matched keywords.
 `.run()` returns a result in the following shape:
@@ -35,7 +35,7 @@ This function returns an instance of the MastraScorer class. See the [MastraScor
 }
 ```
-## Scoring Details
+## Scoring details
 The scorer evaluates keyword coverage by matching keywords with the following features:
@@ -85,23 +85,23 @@ The scorer handles several special cases:
 Evaluate keyword coverage between input queries and agent responses:
 ```typescript
-import { runEvals } from "@mastra/core/evals";
-import { createKeywordCoverageScorer } from "@mastra/evals/scorers/prebuilt";
-import { myAgent } from "./agent";
+import { runEvals } from '@mastra/core/evals'
+import { createKeywordCoverageScorer } from '@mastra/evals/scorers/prebuilt'
+import { myAgent } from './agent'
-const scorer = createKeywordCoverageScorer();
+const scorer = createKeywordCoverageScorer()
 const result = await runEvals({
   data: [
     {
-      input: "JavaScript frameworks like React and Vue",
+      input: 'JavaScript frameworks like React and Vue',
     },
     {
-      input: "TypeScript offers interfaces, generics, and type inference",
+      input: 'TypeScript offers interfaces, generics, and type inference',
     },
     {
       input:
-        "Machine learning models require data preprocessing, feature engineering, and hyperparameter tuning",
+        'Machine learning models require data preprocessing, feature engineering, and hyperparameter tuning',
     },
   ],
   scorers: [scorer],
@@ -109,11 +109,11 @@ const result = await runEvals({
   onItemComplete: ({ scorerResults }) => {
     console.log({
       score: scorerResults[scorer.id].score,
-    });
+    })
   },
-});
+})
-console.log(result.scores);
+console.log(result.scores)
 ```
 For more details on `runEvals`, see the [runEvals reference](https://mastra.ai/reference/evals/run-evals).