npm - agentevals - Versions diffs - 0.0.3 → 0.0.4 - Mend

agentevals 0.0.3 → 0.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/README.md +141 -36
package/dist/index.cjs +3 -1
package/dist/index.d.ts +1 -0
package/dist/index.js +1 -0
package/dist/trajectory/match.cjs +84 -0
package/dist/trajectory/match.d.ts +61 -0
package/dist/trajectory/match.js +80 -0
package/dist/trajectory/strict.cjs +42 -42
package/dist/trajectory/strict.d.ts +23 -2
package/dist/trajectory/strict.js +40 -41
package/dist/trajectory/subset.cjs +13 -9
package/dist/trajectory/subset.d.ts +8 -1
package/dist/trajectory/subset.js +11 -8
package/dist/trajectory/superset.cjs +13 -9
package/dist/trajectory/superset.d.ts +8 -1
package/dist/trajectory/superset.js +11 -8
package/dist/trajectory/unordered.cjs +14 -10
package/dist/trajectory/unordered.d.ts +8 -1
package/dist/trajectory/unordered.js +12 -9
package/dist/trajectory/utils.cjs +107 -18
package/dist/trajectory/utils.d.ts +3 -2
package/dist/trajectory/utils.js +105 -17
package/dist/types.d.ts +3 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -86,7 +86,6 @@ You can see that despite the small difference in the final response and tool cal
   - [Graph Trajectory](#graph-trajectory)
     - [Graph trajectory LLM-as-judge](#graph-trajectory-llm-as-judge)
     - [Graph trajectory strict match](#graph-trajectory-strict-match)
-- [Python Async Support](#python-async-support)
 - [LangSmith Integration](#langsmith-integration)
   - [Pytest or Vitest/Jest](#pytest-or-vitestjest)
   - [Evaluate](#evaluate)
@@ -106,7 +105,7 @@ npm install openai
 ```
 It is also helpful to be familiar with some [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and
-LangSmith's Vitest/Jest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
+LangSmith's pytest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
 ## Evaluators
@@ -116,21 +115,105 @@ Agent trajectory evaluators are used to judge the trajectory of an agent's execu
 These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain `BaseMessage` classes, and handle message formatting
 under the hood.
+AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task.
+#### Checking tool call equality
+When checking equality between tool calls, these matchers will require that all tool call arguments are the same. You can configure this behavior to ignore tool call arguments by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (JS), or by only checking specific properties within the call using the `tool_args_match_overrides`/`toolArgsMatchOverrides` param.
+`tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
+```python
+ToolArgsMatchMode = Literal["exact", "ignore"]
+ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str],  Callable[[dict, dict], bool]]]
+```
+Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
+```ts
+import { createTrajectoryMatchEvaluator } from "agentevals";
+const outputs = [
+    { role: "user", content: "What is the weather in SF?" },
+    {
+      role: "assistant",
+      tool_calls: [{
+        function: {
+          name: "get_weather",
+          arguments: JSON.stringify({ city: "san francisco" })
+        },
+      }]
+    },
+    { role: "tool", content: "It's 80 degrees and sunny in SF." },
+    { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
+];
+const referenceOutputs = [
+    { role: "user", content: "What is the weather in San Francisco?" },
+    {
+      role: "assistant",
+      tool_calls: [{
+        function: {
+          name: "get_weather",
+          arguments: JSON.stringify({ city: "San Francisco" })
+        }
+      }]
+    },
+    { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
+];
+const evaluator = createTrajectoryMatchEvaluator({
+  trajectoryMatchMode: "strict",
+  toolArgsMatchMode: "exact",  // Default value
+  toolArgsMatchOverrides: {
+    get_weather: (x, y) => {
+      return typeof x.city === "string" &&
+        typeof y.city === "string" &&
+        x.city.toLowerCase() === y.city.toLowerCase();
+    },
+  }
+});
+const result = await evaluator({
+  outputs,
+  referenceOutputs,
+});
+console.log(result);
+```
+```
+{
+  'key': 'trajectory_strict_match',
+  'score': true,
+}
+```
+This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
 #### Strict match
-The `trajectory_strict_match` evaluator, compares two trajectories and ensures that they contain the same messages
-in the same order with the same tool calls. It allows for differences in message content and tool call arguments,
-but requires that the selected tools at each step are the same.
+The `"strict"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same messages
+in the same order with the same tool calls. Note that it does allow for differences in message content:
 ```ts
-import { trajectoryStrictMatch } from "agentevals";
+import { createTrajectoryMatchEvaluator } from "agentevals";
 const outputs = [
     { role: "user", content: "What is the weather in SF?" },
     {
       role: "assistant",
       tool_calls: [{
-        function: { name: "get_weather", arguments: JSON.stringify({ city: "SF" }) }
+        function: {
+          name: "get_weather",
+          arguments: JSON.stringify({ city: "San Francisco" })
+        },
+      }, {
+        function: {
+          name: "accuweather_forecast",
+          arguments: JSON.stringify({"city": "San Francisco"}),
+        },
       }]
     },
     { role: "tool", content: "It's 80 degrees and sunny in SF." },
@@ -143,7 +226,11 @@ const referenceOutputs = [
     { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
 ];
-const result = await trajectoryStrictMatch({
+const evaluator = createTrajectoryMatchEvaluator({
+  trajectoryMatchMode: "strict",
+})
+const result = await evaluator({
   outputs,
   referenceOutputs,
 });
@@ -153,17 +240,21 @@ console.log(result);
 ```
 {
-    'key': 'trajectory_accuracy',
-    'score': true,
+    'key': 'trajectory_strict_match',
+    'score': false,
 }
 ```
+`"strict"` is useful is if you want to ensure that tools are always called in the same order for a given query (e.g. a company policy lookup tool before a tool that requests vacation time for an employee).
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
 #### Unordered match
-The `trajectory_unordered_match` evaluator, compares two trajectories and ensures that they contain the same number of tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
+The `"unordered"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
 ```ts
-import { trajectoryUnorderedMatch } from "agentevals";
+import { createTrajectoryMatchEvaluator } from "agentevals";
 const outputs = [
   { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
@@ -214,7 +305,11 @@ const referenceOutputs = [
   { role: "assistant", content: "In SF, it's 80˚ and sunny, but there is nothing fun happening." },
 ];
-const result = await trajectoryUnorderedMatch({
+const evaluator = createTrajectoryMatchEvaluator({
+  trajectoryMatchMode: "unordered",
+});
+const result = await evaluator({
   outputs,
   referenceOutputs,
 });
@@ -229,13 +324,16 @@ console.log(result)
 }
 ```
+`"unordered"` is useful is if you want to ensure that specific tools are called at some point in the trajectory, but you don't necessarily need them to be in message order (e.g. the agent called a company policy retrieval tool at an arbitrary point in an interaction before authorizing spend for a pizza party).
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
 #### Subset and superset match
-There are other evaluators for checking partial trajectory matches (ensuring that a trajectory contains a subset and superset of tool calls compared to a reference trajectory).
+The `"subset"` and `"superset"` modes match partial trajectories (ensuring that a trajectory contains a subset/superset of tool calls contained in a reference trajectory).
 ```ts
-import { trajectorySubset } from "agentevals";
-// import { trajectorySuperset } from "agentevals";
+import { createTrajectoryMatchEvaluator } from "agentevals";
 const outputs = [
   { role: "user", content: "What is the weather in SF and London?" },
@@ -246,9 +344,15 @@ const outputs = [
         name: "get_weather",
         arguments: JSON.stringify({ city: "SF and London" }),
       }
+    }, {
+      "function": {
+        name: "accuweather_forecast",
+        arguments: JSON.stringify({"city": "SF and London"}),
+      }
     }],
   },
   { role: "tool", content: "It's 80 degrees and sunny in SF, and 90 degrees and rainy in London." },
+  { role: "tool", content: "Unknown." },
   { role: "assistant", content: "The weather in SF is 80 degrees and sunny. In London, it's 90 degrees and rainy."},
 ];
@@ -260,23 +364,20 @@ const referenceOutputs = [
       {
         function: {
           name: "get_weather",
-          arguments: JSON.stringify({ city: "San Francisco" }),
-        }
-      },
-      {
-        function: {
-          name: "get_weather",
-          arguments: JSON.stringify({ city: "London" }),
+          arguments: JSON.stringify({ city: "SF and London" }),
         }
       },
     ],
   },
-  { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
-  { role: "tool", content: "It's 90 degrees and rainy in London." },
+  { role: "tool", content: "It's 80 degrees and sunny in San Francisco, and 90 degrees and rainy in London." },
   { role: "assistant", content: "The weather in SF is 80˚ and sunny. In London, it's 90˚ and rainy." },
 ];
-const result = await trajectorySubset({
+const evaluator = createTrajectoryMatchEvaluator({
+  trajectoryMatchMode: "superset", // or "subset"
+});
+const result = await evaluator({
   outputs,
   referenceOutputs,
 });
@@ -286,11 +387,15 @@ console.log(result)
 ```
 {
-    'key': 'trajectory_subset',
+    'key': 'trajectory_superset_match',
     'score': true,
 }
 ```
+`"superset"` is useful if you want to ensure that some key tools were called at some point in the trajectory, but an agent calling extra tools is still acceptable. `"subset"` is the inverse and is useful if you want to ensure that the agent did not call any tools beyond the expected ones.
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
 #### Trajectory LLM-as-judge
 The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the other trajectory evaluators, it doesn't require a reference trajectory,
@@ -514,7 +619,7 @@ console.log(res);
 }
 ```
-Note that though this evaluator takes the typical `inputs`, `outputs`, and `referenceOutputs` parameters, it internally combines `inputs` and `outputs` to form a `thread`. Therefore, if you want to customize the prompt, your prompt should also contain a `thread` input variable:
+Note that though this evaluator takes the typical `inputs`, `outputs`, and `reference_outputs` parameters, it internally combines `inputs` and `outputs` to form a `thread`. Therefore, if you want to customize the prompt, your prompt should also contain a `thread` input variable:
 ```ts
 const CUSTOM_PROMPT = `You are an expert data labeler.
@@ -546,18 +651,18 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
   model: "openai:o3-mini",
 })
 res = await graphTrajectoryEvaluator(
-  inputs=extractedTrajectory.inputs,
-  outputs=extractedTrajectory.outputs,
+  inputs: extractedTrajectory.inputs,
+  outputs: extractedTrajectory.outputs,
 )
 ```
-In order to format them properly into the prompt, `referenceOutputs` should be passed in as a `GraphTrajectory` object like `outputs`.
+In order to format them properly into the prompt, `reference_outputs` should be passed in as a `GraphTrajectory` object like `outputs`.
 Also note that like other LLM-as-judge evaluators, you can pass extra kwargs into the evaluator to format them into the prompt.
 #### Graph trajectory strict match
-The `graphTrajectoryStrictMatch` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
+The `graph_trajectory_strict_match` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
 ```ts
 import { tool } from "@langchain/core/tools";
@@ -626,23 +731,24 @@ console.log(result);
   'score': True,
 }
 ```
 ## LangSmith Integration
 For tracking experiments over time, you can log evaluator results to [LangSmith](https://smith.langchain.com/), a platform for building production-grade LLM applications that includes tracing, evaluation, and experimentation tools.
-LangSmith currently offers two ways to run evals. We'll give a quick example of how to run evals using both.
+LangSmith currently offers two ways to run evals: a [pytest](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) (Python) or [Vitest/Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) integration and the `evaluate` function. We'll give a quick example of how to run evals using both.
 ### Pytest or Vitest/Jest
-First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) to set up LangSmith's Vitest/Jest runner,
+First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) to set up LangSmith's pytest runner, or these to set up [Vitest or Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest),
 setting appropriate environment variables:
 ```bash
 export LANGSMITH_API_KEY="your_langsmith_api_key"
 export LANGSMITH_TRACING="true"
 ```
 Then, set up a file named `test_trajectory.eval.ts` with the following contents:
 ```ts
@@ -717,7 +823,6 @@ Now, run the eval with your runner of choice:
 vitest run test_trajectory.eval.ts
 ```
 Feedback from the prebuilt evaluator will be automatically logged in LangSmith as a table of results like this in your terminal:
 ![Terminal results](/static/img/pytest_output.png)

package/dist/index.cjs CHANGED Viewed

@@ -14,7 +14,7 @@ var __exportStar = (this && this.__exportStar) || function(m, exports) {
     for (var p in m) if (p !== "default" && !Object.prototype.hasOwnProperty.call(exports, p)) __createBinding(exports, m, p);
 };
 Object.defineProperty(exports, "__esModule", { value: true });
-exports.GRAPH_TRAJECTORY_ACCURACY_PROMPT = exports.createGraphTrajectoryLLMAsJudge = exports.TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = exports.TRAJECTORY_ACCURACY_PROMPT = exports.createTrajectoryLLMAsJudge = exports.trajectoryUnorderedMatch = exports.trajectorySuperset = exports.trajectorySubset = exports.trajectoryStrictMatch = void 0;
+exports.GRAPH_TRAJECTORY_ACCURACY_PROMPT = exports.createGraphTrajectoryLLMAsJudge = exports.TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = exports.TRAJECTORY_ACCURACY_PROMPT = exports.createTrajectoryLLMAsJudge = exports.createTrajectoryMatchEvaluator = exports.trajectoryUnorderedMatch = exports.trajectorySuperset = exports.trajectorySubset = exports.trajectoryStrictMatch = void 0;
 var strict_js_1 = require("./trajectory/strict.cjs");
 Object.defineProperty(exports, "trajectoryStrictMatch", { enumerable: true, get: function () { return strict_js_1.trajectoryStrictMatch; } });
 var subset_js_1 = require("./trajectory/subset.cjs");
@@ -23,6 +23,8 @@ var superset_js_1 = require("./trajectory/superset.cjs");
 Object.defineProperty(exports, "trajectorySuperset", { enumerable: true, get: function () { return superset_js_1.trajectorySuperset; } });
 var unordered_js_1 = require("./trajectory/unordered.cjs");
 Object.defineProperty(exports, "trajectoryUnorderedMatch", { enumerable: true, get: function () { return unordered_js_1.trajectoryUnorderedMatch; } });
+var match_js_1 = require("./trajectory/match.cjs");
+Object.defineProperty(exports, "createTrajectoryMatchEvaluator", { enumerable: true, get: function () { return match_js_1.createTrajectoryMatchEvaluator; } });
 var llm_js_1 = require("./trajectory/llm.cjs");
 Object.defineProperty(exports, "createTrajectoryLLMAsJudge", { enumerable: true, get: function () { return llm_js_1.createTrajectoryLLMAsJudge; } });
 Object.defineProperty(exports, "TRAJECTORY_ACCURACY_PROMPT", { enumerable: true, get: function () { return llm_js_1.TRAJECTORY_ACCURACY_PROMPT; } });

package/dist/index.d.ts CHANGED Viewed

@@ -2,6 +2,7 @@ export { trajectoryStrictMatch } from "./trajectory/strict.js";
 export { trajectorySubset } from "./trajectory/subset.js";
 export { trajectorySuperset } from "./trajectory/superset.js";
 export { trajectoryUnorderedMatch } from "./trajectory/unordered.js";
+export { createTrajectoryMatchEvaluator, type TrajectoryMatchMode, } from "./trajectory/match.js";
 export { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE, } from "./trajectory/llm.js";
 export { createGraphTrajectoryLLMAsJudge, GRAPH_TRAJECTORY_ACCURACY_PROMPT, } from "./graph_trajectory/llm.js";
 export * from "./types.js";

package/dist/index.js CHANGED Viewed

@@ -2,6 +2,7 @@ export { trajectoryStrictMatch } from "./trajectory/strict.js";
 export { trajectorySubset } from "./trajectory/subset.js";
 export { trajectorySuperset } from "./trajectory/superset.js";
 export { trajectoryUnorderedMatch } from "./trajectory/unordered.js";
+export { createTrajectoryMatchEvaluator, } from "./trajectory/match.js";
 export { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE, } from "./trajectory/llm.js";
 export { createGraphTrajectoryLLMAsJudge, GRAPH_TRAJECTORY_ACCURACY_PROMPT, } from "./graph_trajectory/llm.js";
 export * from "./types.js";

package/dist/trajectory/match.cjs ADDED Viewed

@@ -0,0 +1,84 @@
+"use strict";
+Object.defineProperty(exports, "__esModule", { value: true });
+exports.createTrajectoryMatchEvaluator = void 0;
+const utils_js_1 = require("../utils.cjs");
+const strict_js_1 = require("./strict.cjs");
+const unordered_js_1 = require("./unordered.cjs");
+const subset_js_1 = require("./subset.cjs");
+const superset_js_1 = require("./superset.cjs");
+/**
+ * Creates an evaluator that compares trajectories between model outputs and reference outputs.
+ *
+ * @param options - The configuration options
+ * @param options.trajectoryMatchMode - The mode for matching trajectories:
+ *   - `"strict"`: Requires exact match in order and content
+ *   - `"unordered"`: Allows matching in any order
+ *   - `"subset"`: Accepts if output trajectory is a subset of reference
+ *   - `"superset"`: Accepts if output trajectory is a superset of reference
+ * @param options.toolArgsMatchMode - Mode for matching tool arguments ("exact" by default, can be "ignore")
+ * @param options.toolArgsMatchOverrides - Object containing custom overrides for tool argument matching.
+ *   Each key should be a tool name, and each value should be either a match mode or a matcher function.
+ *   Matchers should be a function that takes two sets of tool call args and returns whether they are equal.
+ *
+ * @returns An async function that evaluates trajectory matches between outputs and references.
+ *   The returned evaluator accepts:
+ *   - outputs: List of messages or dict representing the model output trajectory
+ *   - referenceOutputs: List of messages or dict representing the reference trajectory
+ *   - Additional arguments passed to the underlying evaluator
+ *
+ * @example
+ * ```typescript
+ * const matcher = (
+ *   outputToolCallArgs: Record<string, any>,
+ *   referenceToolCallArgs: Record<string, any>
+ * ): boolean => {
+ *   const outputArgs = (outputToolCallArgs.query ?? "").toLowerCase();
+ *   const referenceArgs = (referenceToolCallArgs.query ?? "").toLowerCase();
+ *   return outputArgs === referenceArgs;
+ * };
+ *
+ * const evaluator = createAsyncTrajectoryMatchEvaluator({
+ *   trajectoryMatchMode: "strict",
+ *   toolArgsMatchMode: "exact",
+ *   toolArgsMatchOverrides: {
+ *     myToolName: matcher,
+ *   },
+ * });
+ *
+ * const result = await evaluator({
+ *   outputs: [...],
+ *   referenceOutputs: [...],
+ * });
+ * ```
+ */
+function createTrajectoryMatchEvaluator({ trajectoryMatchMode = "strict", toolArgsMatchMode = "exact", toolArgsMatchOverrides, }) {
+    let scorer;
+    switch (trajectoryMatchMode) {
+        case "strict":
+            scorer = strict_js_1._scorer;
+            break;
+        case "unordered":
+            scorer = unordered_js_1._scorer;
+            break;
+        case "subset":
+            scorer = subset_js_1._scorer;
+            break;
+        case "superset":
+            scorer = superset_js_1._scorer;
+            break;
+        default:
+            throw new Error(`Invalid trajectory match type: ${trajectoryMatchMode}`);
+    }
+    return async function _wrappedEvaluator({ outputs, referenceOutputs, ...extra }) {
+        const normalizedOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(outputs);
+        const normalizedReferenceOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(referenceOutputs);
+        return (0, utils_js_1._runEvaluator)(`trajectory_${trajectoryMatchMode}_match`, scorer, `trajectory_${trajectoryMatchMode}_match`, {
+            outputs: normalizedOutputs,
+            referenceOutputs: normalizedReferenceOutputs,
+            toolArgsMatchMode,
+            toolArgsMatchOverrides,
+            ...extra,
+        });
+    };
+}
+exports.createTrajectoryMatchEvaluator = createTrajectoryMatchEvaluator;

package/dist/trajectory/match.d.ts ADDED Viewed

@@ -0,0 +1,61 @@
+import { BaseMessage } from "@langchain/core/messages";
+import { ChatCompletionMessage, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+export type TrajectoryMatchMode = "strict" | "unordered" | "subset" | "superset";
+/**
+ * Creates an evaluator that compares trajectories between model outputs and reference outputs.
+ *
+ * @param options - The configuration options
+ * @param options.trajectoryMatchMode - The mode for matching trajectories:
+ *   - `"strict"`: Requires exact match in order and content
+ *   - `"unordered"`: Allows matching in any order
+ *   - `"subset"`: Accepts if output trajectory is a subset of reference
+ *   - `"superset"`: Accepts if output trajectory is a superset of reference
+ * @param options.toolArgsMatchMode - Mode for matching tool arguments ("exact" by default, can be "ignore")
+ * @param options.toolArgsMatchOverrides - Object containing custom overrides for tool argument matching.
+ *   Each key should be a tool name, and each value should be either a match mode or a matcher function.
+ *   Matchers should be a function that takes two sets of tool call args and returns whether they are equal.
+ *
+ * @returns An async function that evaluates trajectory matches between outputs and references.
+ *   The returned evaluator accepts:
+ *   - outputs: List of messages or dict representing the model output trajectory
+ *   - referenceOutputs: List of messages or dict representing the reference trajectory
+ *   - Additional arguments passed to the underlying evaluator
+ *
+ * @example
+ * ```typescript
+ * const matcher = (
+ *   outputToolCallArgs: Record<string, any>,
+ *   referenceToolCallArgs: Record<string, any>
+ * ): boolean => {
+ *   const outputArgs = (outputToolCallArgs.query ?? "").toLowerCase();
+ *   const referenceArgs = (referenceToolCallArgs.query ?? "").toLowerCase();
+ *   return outputArgs === referenceArgs;
+ * };
+ *
+ * const evaluator = createAsyncTrajectoryMatchEvaluator({
+ *   trajectoryMatchMode: "strict",
+ *   toolArgsMatchMode: "exact",
+ *   toolArgsMatchOverrides: {
+ *     myToolName: matcher,
+ *   },
+ * });
+ *
+ * const result = await evaluator({
+ *   outputs: [...],
+ *   referenceOutputs: [...],
+ * });
+ * ```
+ */
+export declare function createTrajectoryMatchEvaluator({ trajectoryMatchMode, toolArgsMatchMode, toolArgsMatchOverrides, }: {
+    trajectoryMatchMode?: TrajectoryMatchMode;
+    toolArgsMatchMode?: ToolArgsMatchMode;
+    toolArgsMatchOverrides?: ToolArgsMatchOverrides;
+}): ({ outputs, referenceOutputs, ...extra }: {
+    [key: string]: unknown;
+    outputs: ChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage)[];
+    };
+    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage)[];
+    };
+}) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;

package/dist/trajectory/match.js ADDED Viewed

@@ -0,0 +1,80 @@
+import { _normalizeToOpenAIMessagesList, _runEvaluator } from "../utils.js";
+import { _scorer as trajectoryStrictScorer } from "./strict.js";
+import { _scorer as trajectoryUnorderedScorer } from "./unordered.js";
+import { _scorer as trajectorySubsetScorer } from "./subset.js";
+import { _scorer as trajectorySuperstScorer } from "./superset.js";
+/**
+ * Creates an evaluator that compares trajectories between model outputs and reference outputs.
+ *
+ * @param options - The configuration options
+ * @param options.trajectoryMatchMode - The mode for matching trajectories:
+ *   - `"strict"`: Requires exact match in order and content
+ *   - `"unordered"`: Allows matching in any order
+ *   - `"subset"`: Accepts if output trajectory is a subset of reference
+ *   - `"superset"`: Accepts if output trajectory is a superset of reference
+ * @param options.toolArgsMatchMode - Mode for matching tool arguments ("exact" by default, can be "ignore")
+ * @param options.toolArgsMatchOverrides - Object containing custom overrides for tool argument matching.
+ *   Each key should be a tool name, and each value should be either a match mode or a matcher function.
+ *   Matchers should be a function that takes two sets of tool call args and returns whether they are equal.
+ *
+ * @returns An async function that evaluates trajectory matches between outputs and references.
+ *   The returned evaluator accepts:
+ *   - outputs: List of messages or dict representing the model output trajectory
+ *   - referenceOutputs: List of messages or dict representing the reference trajectory
+ *   - Additional arguments passed to the underlying evaluator
+ *
+ * @example
+ * ```typescript
+ * const matcher = (
+ *   outputToolCallArgs: Record<string, any>,
+ *   referenceToolCallArgs: Record<string, any>
+ * ): boolean => {
+ *   const outputArgs = (outputToolCallArgs.query ?? "").toLowerCase();
+ *   const referenceArgs = (referenceToolCallArgs.query ?? "").toLowerCase();
+ *   return outputArgs === referenceArgs;
+ * };
+ *
+ * const evaluator = createAsyncTrajectoryMatchEvaluator({
+ *   trajectoryMatchMode: "strict",
+ *   toolArgsMatchMode: "exact",
+ *   toolArgsMatchOverrides: {
+ *     myToolName: matcher,
+ *   },
+ * });
+ *
+ * const result = await evaluator({
+ *   outputs: [...],
+ *   referenceOutputs: [...],
+ * });
+ * ```
+ */
+export function createTrajectoryMatchEvaluator({ trajectoryMatchMode = "strict", toolArgsMatchMode = "exact", toolArgsMatchOverrides, }) {
+    let scorer;
+    switch (trajectoryMatchMode) {
+        case "strict":
+            scorer = trajectoryStrictScorer;
+            break;
+        case "unordered":
+            scorer = trajectoryUnorderedScorer;
+            break;
+        case "subset":
+            scorer = trajectorySubsetScorer;
+            break;
+        case "superset":
+            scorer = trajectorySuperstScorer;
+            break;
+        default:
+            throw new Error(`Invalid trajectory match type: ${trajectoryMatchMode}`);
+    }
+    return async function _wrappedEvaluator({ outputs, referenceOutputs, ...extra }) {
+        const normalizedOutputs = _normalizeToOpenAIMessagesList(outputs);
+        const normalizedReferenceOutputs = _normalizeToOpenAIMessagesList(referenceOutputs);
+        return _runEvaluator(`trajectory_${trajectoryMatchMode}_match`, scorer, `trajectory_${trajectoryMatchMode}_match`, {
+            outputs: normalizedOutputs,
+            referenceOutputs: normalizedReferenceOutputs,
+            toolArgsMatchMode,
+            toolArgsMatchOverrides,
+            ...extra,
+        });
+    };
+}