npm - agentevals - Versions diffs - 0.0.4 → 0.0.6 - Mend

agentevals 0.0.4 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/README.md +278 -156
package/dist/graph_trajectory/llm.d.ts +1 -1
package/dist/graph_trajectory/strict.d.ts +1 -1
package/dist/graph_trajectory/utils.cjs +8 -1
package/dist/graph_trajectory/utils.d.ts +2 -2
package/dist/graph_trajectory/utils.js +8 -1
package/dist/trajectory/llm.d.ts +5 -5
package/dist/trajectory/match.d.ts +6 -6
package/dist/trajectory/strict.cjs +6 -3
package/dist/trajectory/strict.d.ts +7 -11
package/dist/trajectory/strict.js +6 -3
package/dist/trajectory/subset.d.ts +5 -5
package/dist/trajectory/superset.d.ts +5 -5
package/dist/trajectory/unordered.d.ts +5 -5
package/dist/trajectory/utils.cjs +14 -0
package/dist/trajectory/utils.js +14 -0
package/dist/types.d.ts +18 -3
package/dist/utils.cjs +21 -2
package/dist/utils.d.ts +4 -3
package/dist/utils.js +19 -1
package/package.json +10 -10

package/README.md CHANGED Viewed

@@ -9,7 +9,7 @@ It is intended to provide a good conceptual starting point for your agent's eval
 If you are looking for more general evaluation tools, please check out the companion package [`openevals`](https://github.com/langchain-ai/openevals).
-## Quickstart
+# Quickstart
 To get started, install `agentevals`:
@@ -28,6 +28,7 @@ Once you've done this, you can run your first trajectory evaluator. We represent
 ```ts
 import {
   createTrajectoryLLMAsJudge,
+  type FlexibleChatCompletionMessage,
   TRAJECTORY_ACCURACY_PROMPT,
 } from "agentevals";
@@ -55,7 +56,7 @@ const outputs = [
     role: "assistant",
     content: "The weather in SF is 80 degrees and sunny.",
   },
-];
+] satisfies FlexibleChatCompletionMessage[];
 const evalResult = await trajectoryEvaluator({
   outputs,
@@ -72,25 +73,29 @@ console.log(evalResult);
 }
 ```
-You can see that despite the small difference in the final response and tool calls, the evaluator still returns a score of `true` since the overall trajectory is the same between the output and reference!
+You can see that the evaluator returns a score of `true` since the overall trajectory is a reasonable path for the agent to take to answer the user's question.
+For more details on this evaluator, including how to customize it, see the section on [trajectory LLM-as-judge](#trajectory-llm-as-judge).
-## Table of Contents
+# Table of Contents
 - [Installation](#installation)
 - [Evaluators](#evaluators)
-  - [Agent Trajectory](#agent-trajectory)
+  - [Agent Trajectory Match](#agent-trajectory-match)
     - [Strict match](#strict-match)
     - [Unordered match](#unordered-match)
     - [Subset/superset match](#subset-and-superset-match)
-    - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
+    - [Tool args match modes](#tool-args-match-modes)
+  - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
   - [Graph Trajectory](#graph-trajectory)
     - [Graph trajectory LLM-as-judge](#graph-trajectory-llm-as-judge)
     - [Graph trajectory strict match](#graph-trajectory-strict-match)
+- [Python Async Support](#python-async-support)
 - [LangSmith Integration](#langsmith-integration)
   - [Pytest or Vitest/Jest](#pytest-or-vitestjest)
   - [Evaluate](#evaluate)
-## Installation
+# Installation
 You can install `agentevals` like this:
@@ -107,124 +112,65 @@ npm install openai
 It is also helpful to be familiar with some [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and
 LangSmith's pytest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
-## Evaluators
+# Evaluators
-### Agent trajectory
+## Agent trajectory match
-Agent trajectory evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
+Agent trajectory match evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
 These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain `BaseMessage` classes, and handle message formatting
 under the hood.
-AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task.
-#### Checking tool call equality
-When checking equality between tool calls, these matchers will require that all tool call arguments are the same. You can configure this behavior to ignore tool call arguments by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (JS), or by only checking specific properties within the call using the `tool_args_match_overrides`/`toolArgsMatchOverrides` param.
-`tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
-```python
-ToolArgsMatchMode = Literal["exact", "ignore"]
-ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str],  Callable[[dict, dict], bool]]]
-```
+AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task. You can customize their behavior in a few ways:
-Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
+- Setting `trajectory_match_mode`/`trajectoryMatchMode` to [`strict`](#strict-match), [`unordered`](#unordered-match), [`subset`](#subset-and-superset-match), or [`superset`](#subset-and-superset-match) to provide the general strategy the evaluator will use to compare trajectories
+- Setting [`tool_args_match_mode`](#tool-args-match-modes) and/or [`tool_args_match_overrides`](#tool-args-match-modes) to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal.
-```ts
-import { createTrajectoryMatchEvaluator } from "agentevals";
-const outputs = [
-    { role: "user", content: "What is the weather in SF?" },
-    {
-      role: "assistant",
-      tool_calls: [{
-        function: {
-          name: "get_weather",
-          arguments: JSON.stringify({ city: "san francisco" })
-        },
-      }]
-    },
-    { role: "tool", content: "It's 80 degrees and sunny in SF." },
-    { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
-];
-const referenceOutputs = [
-    { role: "user", content: "What is the weather in San Francisco?" },
-    {
-      role: "assistant",
-      tool_calls: [{
-        function: {
-          name: "get_weather",
-          arguments: JSON.stringify({ city: "San Francisco" })
-        }
-      }]
-    },
-    { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
-];
-const evaluator = createTrajectoryMatchEvaluator({
-  trajectoryMatchMode: "strict",
-  toolArgsMatchMode: "exact",  // Default value
-  toolArgsMatchOverrides: {
-    get_weather: (x, y) => {
-      return typeof x.city === "string" &&
-        typeof y.city === "string" &&
-        x.city.toLowerCase() === y.city.toLowerCase();
-    },
-  }
-});
-const result = await evaluator({
-  outputs,
-  referenceOutputs,
-});
-console.log(result);
-```
-```
-{
-  'key': 'trajectory_strict_match',
-  'score': true,
-}
-```
-This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
-#### Strict match
+### Strict match
 The `"strict"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same messages
 in the same order with the same tool calls. Note that it does allow for differences in message content:
 ```ts
-import { createTrajectoryMatchEvaluator } from "agentevals";
+import {
+  createTrajectoryMatchEvaluator,
+  type FlexibleChatCompletionMessage,
+} from "agentevals";
 const outputs = [
-    { role: "user", content: "What is the weather in SF?" },
-    {
-      role: "assistant",
-      tool_calls: [{
-        function: {
-          name: "get_weather",
-          arguments: JSON.stringify({ city: "San Francisco" })
-        },
-      }, {
-        function: {
-          name: "accuweather_forecast",
-          arguments: JSON.stringify({"city": "San Francisco"}),
-        },
-      }]
-    },
-    { role: "tool", content: "It's 80 degrees and sunny in SF." },
-    { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
-];
+  { role: "user", content: "What is the weather in SF?" },
+  {
+    role: "assistant",
+    content: "",
+    tool_calls: [{
+      function: {
+        name: "get_weather",
+        arguments: JSON.stringify({ city: "San Francisco" })
+      },
+    }, {
+      function: {
+        name: "accuweather_forecast",
+        arguments: JSON.stringify({"city": "San Francisco"}),
+      },
+    }]
+  },
+  { role: "tool", content: "It's 80 degrees and sunny in SF." },
+  { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
+] satisfies FlexibleChatCompletionMessage[];
 const referenceOutputs = [
-    { role: "user", content: "What is the weather in San Francisco?" },
-    { role: "assistant", tool_calls: [{ function: { name: "get_weather", arguments: JSON.stringify({ city: "San Francisco" }) } }] },
-    { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
-];
+  { role: "user", content: "What is the weather in San Francisco?" },
+  {
+    role: "assistant",
+    content: "",
+    tool_calls: [{
+      function: {
+        name: "get_weather",
+        arguments: JSON.stringify({ city: "San Francisco" })
+      }
+    }]
+  },
+  { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
+] satisfies FlexibleChatCompletionMessage[];
 const evaluator = createTrajectoryMatchEvaluator({
   trajectoryMatchMode: "strict",
@@ -247,19 +193,23 @@ console.log(result);
 `"strict"` is useful is if you want to ensure that tools are always called in the same order for a given query (e.g. a company policy lookup tool before a tool that requests vacation time for an employee).
-**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
-#### Unordered match
+### Unordered match
 The `"unordered"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
 ```ts
-import { createTrajectoryMatchEvaluator } from "agentevals";
+import {
+  createTrajectoryMatchEvaluator,
+  type FlexibleChatCompletionMessage,
+} from "agentevals";
 const outputs = [
   { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
   {
     role: "assistant",
+    content: "",
     tool_calls: [{
       function: {
         name: "get_weather",
@@ -270,6 +220,7 @@ const outputs = [
   { role: "tool", content: "It's 80 degrees and sunny in SF." },
   {
     role: "assistant",
+    content: "",
     tool_calls: [{
       function: {
         name: "get_fun_activities",
@@ -279,12 +230,13 @@ const outputs = [
   },
   { role: "tool", content: "Nothing fun is happening, you should stay indoors and read!" },
   { role: "assistant", content: "The weather in SF is 80 degrees and sunny, but there is nothing fun happening." },
-];
+] satisifes FlexibleChatCompletionMessage[];
 const referenceOutputs = [
   { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
   {
     role: "assistant",
+    content: "",
     tool_calls: [
       {
         function: {
@@ -303,7 +255,7 @@ const referenceOutputs = [
   { role: "tool", content: "Nothing fun is happening, you should stay indoors and read!" },
   { role: "tool", content: "It's 80 degrees and sunny in SF." },
   { role: "assistant", content: "In SF, it's 80˚ and sunny, but there is nothing fun happening." },
-];
+] satisfies FlexibleChatCompletionMessage[];
 const evaluator = createTrajectoryMatchEvaluator({
   trajectoryMatchMode: "unordered",
@@ -326,19 +278,23 @@ console.log(result)
 `"unordered"` is useful is if you want to ensure that specific tools are called at some point in the trajectory, but you don't necessarily need them to be in message order (e.g. the agent called a company policy retrieval tool at an arbitrary point in an interaction before authorizing spend for a pizza party).
-**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
-#### Subset and superset match
+### Subset and superset match
 The `"subset"` and `"superset"` modes match partial trajectories (ensuring that a trajectory contains a subset/superset of tool calls contained in a reference trajectory).
 ```ts
-import { createTrajectoryMatchEvaluator } from "agentevals";
+import {
+  createTrajectoryMatchEvaluator,
+  type FlexibleChatCompletionMessage
+} from "agentevals";
 const outputs = [
   { role: "user", content: "What is the weather in SF and London?" },
   {
     role: "assistant",
+    content: "",
     tool_calls: [{
       function: {
         name: "get_weather",
@@ -354,12 +310,13 @@ const outputs = [
   { role: "tool", content: "It's 80 degrees and sunny in SF, and 90 degrees and rainy in London." },
   { role: "tool", content: "Unknown." },
   { role: "assistant", content: "The weather in SF is 80 degrees and sunny. In London, it's 90 degrees and rainy."},
-];
+] satisfies FlexibleChatCompletionMessage[];
 const referenceOutputs = [
   { role: "user", content: "What is the weather in SF and London?" },
   {
     role: "assistant",
+    content: "",
     tool_calls: [
       {
         function: {
@@ -371,7 +328,7 @@ const referenceOutputs = [
   },
   { role: "tool", content: "It's 80 degrees and sunny in San Francisco, and 90 degrees and rainy in London." },
   { role: "assistant", content: "The weather in SF is 80˚ and sunny. In London, it's 90˚ and rainy." },
-];
+] satisfies FlexibleChatCompletionMessage[];
 const evaluator = createTrajectoryMatchEvaluator({
   trajectoryMatchMode: "superset", // or "subset"
@@ -394,18 +351,148 @@ console.log(result)
 `"superset"` is useful if you want to ensure that some key tools were called at some point in the trajectory, but an agent calling extra tools is still acceptable. `"subset"` is the inverse and is useful if you want to ensure that the agent did not call any tools beyond the expected ones.
-**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
+**Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
+### Tool args match modes
+When checking equality between tool calls, the above evaluators will require that all tool call arguments are the exact same by default. You can configure this behavior in the following ways:
+- Treating any two tool calls for the same tool as equivalent by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (TypeScript)
+- Treating a tool call as equivalent if it contain as subset/superset of args compared to a reference tool call of the same name with `tool_args_match_mode="subset"/"superset"` (Python) or `toolArgsMatchMode: "subset"/"superset` (TypeScript)
+- Setting custom matchers for all calls of a given tool using the `tool_args_match_overrides` (Python) or `toolArgsMatchOverrides` (TypeScript) param
+You can set both of these parameters at the same time. `tool_args_match_overrides` will take precendence over `tool_args_match_mode`.
+`tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
+```python
+ToolArgsMatchMode = Literal["exact", "ignore", "subset", "superset"]
-#### Trajectory LLM-as-judge
+ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str],  Callable[[dict, dict], bool]]]
+```
-The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the other trajectory evaluators, it doesn't require a reference trajectory,
-and supports
-This allows for more flexibility in the trajectory comparison:
+Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
+```ts
+import {
+  createTrajectoryMatchEvaluator,
+  type FlexibleChatCompletionMessage,
+} from "agentevals";
+const outputs = [
+  { role: "user", content: "What is the weather in SF?" },
+  {
+    role: "assistant",
+    content: "",
+    tool_calls: [{
+      function: {
+        name: "get_weather",
+        arguments: JSON.stringify({ city: "san francisco" })
+      },
+    }]
+  },
+  { role: "tool", content: "It's 80 degrees and sunny in SF." },
+  { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
+] satisfies FlexibleChatCompletionMessage[];
+const referenceOutputs = [
+  { role: "user", content: "What is the weather in San Francisco?" },
+  {
+    role: "assistant",
+    content: "",
+    tool_calls: [{
+      function: {
+        name: "get_weather",
+        arguments: JSON.stringify({ city: "San Francisco" })
+      }
+    }]
+  },
+  { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
+] satisfies FlexibleChatCompletionMessage[];
+const evaluator = createTrajectoryMatchEvaluator({
+  trajectoryMatchMode: "strict",
+  toolArgsMatchMode: "exact",  // Default value
+  toolArgsMatchOverrides: {
+    get_weather: (x, y) => {
+      return typeof x.city === "string" &&
+        typeof y.city === "string" &&
+        x.city.toLowerCase() === y.city.toLowerCase();
+    },
+  }
+});
+const result = await evaluator({
+  outputs,
+  referenceOutputs,
+});
+console.log(result);
+```
+```
+{
+  'key': 'trajectory_strict_match',
+  'score': true,
+}
+```
+This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
+## Trajectory LLM-as-judge
+The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the trajectory match evaluators, it doesn't require a reference trajectory. Here's an example:
 ```ts
 import {
   createTrajectoryLLMAsJudge,
-  TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE
+  TRAJECTORY_ACCURACY_PROMPT,
+  type FlexibleChatCompletionMessage,
+} from "agentevals";
+const evaluator = createTrajectoryLLMAsJudge({
+  prompt: TRAJECTORY_ACCURACY_PROMPT,
+  model: "openai:o3-mini",
+});
+const outputs = [
+  {role: "user", content: "What is the weather in SF?"},
+  {
+    role: "assistant",
+    content: "",
+    tool_calls: [
+      {
+        function: {
+          name: "get_weather",
+          arguments: JSON.stringify({ city: "SF" }),
+        }
+      }
+    ],
+  },
+  {role: "tool", content: "It's 80 degrees and sunny in SF."},
+  {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
+] satisfies FlexibleChatCompletionMessage[];
+const result = await evaluator({ outputs });
+console.log(result)
+```
+```
+{
+    'key': 'trajectory_accuracy',
+    'score': True,
+    'comment': 'The provided agent trajectory is reasonable...'
+}
+```
+If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt  `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` prompt, which contains a `reference_outputs` variable:
+```ts
+import {
+  createTrajectoryLLMAsJudge,
+  TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
+  type FlexibleChatCompletionMessage,
 } from "agentevals";
 const evaluator = createTrajectoryLLMAsJudge({
@@ -417,6 +504,7 @@ const outputs = [
   {role: "user", content: "What is the weather in SF?"},
   {
     role: "assistant",
+    content: "",
     tool_calls: [
       {
         function: {
@@ -428,11 +516,13 @@ const outputs = [
   },
   {role: "tool", content: "It's 80 degrees and sunny in SF."},
   {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
-]
+] satisfies FlexibleChatCompletionMessage[];
 const referenceOutputs = [
   {role: "user", content: "What is the weather in SF?"},
   {
     role: "assistant",
+    content: "",
     tool_calls: [
       {
         function: {
@@ -444,7 +534,7 @@ const referenceOutputs = [
   },
   {role: "tool", content: "It's 80 degrees and sunny in San Francisco."},
   {role: "assistant", content: "The weather in SF is 80˚ and sunny."},
-]
+] satisfies FlexibleChatCompletionMessage[];
 const result = await evaluator({
   outputs,
@@ -484,7 +574,7 @@ const fewShotExamples = [
 See the [`openevals`](https://github.com/langchain-ai/openevals?tab=readme-ov-file#llm-as-judge) repo for a fully up to date list of parameters.
-### Graph trajectory
+## Graph trajectory
 For frameworks like [LangGraph](https://github.com/langchain-ai/langgraph) that model agents as graphs, it can be more convenient to represent trajectories in terms of nodes visited rather than messages. `agentevals` includes a category of evaluators called **graph trajectory** evaluators that are designed to work with this format, as well as convenient utilities for extracting trajectories from a LangGraph thread, including different conversation turns and interrupts.
@@ -509,7 +599,7 @@ const evaluator: ({ inputs, outputs, referenceOutputs, ...extra }: {
 Where `inputs` is a list of inputs (or a dict with a key named `"inputs"`) to the graph whose items each represent the start of a new invocation in a thread, `results` representing the final output from each turn in the thread, and `steps` representing the internal steps taken for each turn.
-#### Graph trajectory LLM-as-judge
+### Graph trajectory LLM-as-judge
 This evaluator is similar to the `trajectory_llm_as_judge` evaluator, but it works with graph trajectories instead of message trajectories. Below, we set up a LangGraph agent, extract a trajectory from it using the built-in utils, and pass it to the evaluator. First, let's setup our graph, call it, and then extract the trajectory:
@@ -603,10 +693,10 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
     model: "openai:o3-mini",
 })
-const res = await graphTrajectoryEvaluator(
-    inputs=extractedTrajectory.inputs,
-    outputs=extractedTrajectory.outputs,
-)
+const res = await graphTrajectoryEvaluator({
+  inputs: extractedTrajectory.inputs,
+  outputs: extractedTrajectory.outputs,
+});
 console.log(res);
 ```
@@ -650,17 +740,17 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
   prompt: CUSTOM_PROMPT,
   model: "openai:o3-mini",
 })
-res = await graphTrajectoryEvaluator(
+const res = await graphTrajectoryEvaluator({
   inputs: extractedTrajectory.inputs,
   outputs: extractedTrajectory.outputs,
-)
+});
 ```
 In order to format them properly into the prompt, `reference_outputs` should be passed in as a `GraphTrajectory` object like `outputs`.
-Also note that like other LLM-as-judge evaluators, you can pass extra kwargs into the evaluator to format them into the prompt.
+Also note that like other LLM-as-judge evaluators, you can pass extra params into the evaluator to format them into the prompt.
-#### Graph trajectory strict match
+### Graph trajectory strict match
 The `graph_trajectory_strict_match` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
@@ -732,18 +822,47 @@ console.log(result);
 }
 ```
-## LangSmith Integration
+# Python Async Support
+All `agentevals` evaluators support Python [asyncio](https://docs.python.org/3/library/asyncio.html). As a convention, evaluators that use a factory function will have `async` put immediately after `create_` in the function name (for example, `create_async_trajectory_llm_as_judge`), and evaluators used directly will end in `async` (e.g. `trajectory_strict_match_async`).
+Here's an example of how to use the `create_async_llm_as_judge` evaluator asynchronously:
+```python
+from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge
+evaluator = create_async_llm_as_judge(
+    prompt="What is the weather in {inputs}?",
+)
+result = await evaluator(inputs="San Francisco")
+```
+If you are using the OpenAI client directly, remember to pass in `AsyncOpenAI` as the `judge` parameter:
+```python
+from openai import AsyncOpenAI
+evaluator = create_async_llm_as_judge(
+    prompt="What is the weather in {inputs}?",
+    judge=AsyncOpenAI(),
+    model="o3-mini",
+)
+result = await evaluator(inputs="San Francisco")
+```
+# LangSmith Integration
 For tracking experiments over time, you can log evaluator results to [LangSmith](https://smith.langchain.com/), a platform for building production-grade LLM applications that includes tracing, evaluation, and experimentation tools.
 LangSmith currently offers two ways to run evals: a [pytest](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) (Python) or [Vitest/Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) integration and the `evaluate` function. We'll give a quick example of how to run evals using both.
-### Pytest or Vitest/Jest
+## Pytest or Vitest/Jest
 First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) to set up LangSmith's pytest runner, or these to set up [Vitest or Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest),
 setting appropriate environment variables:
 ```bash
 export LANGSMITH_API_KEY="your_langsmith_api_key"
 export LANGSMITH_TRACING="true"
@@ -776,6 +895,7 @@ ls.describe("trajectory accuracy", () => {
         {"role": "user", "content": "What is the weather in SF?"},
         {
             "role": "assistant",
+            "content": "",
             "tool_calls": [
                 {
                     "function": {
@@ -794,6 +914,7 @@ ls.describe("trajectory accuracy", () => {
         {"role": "user", "content": "What is the weather in SF?"},
         {
             "role": "assistant",
+            "content": "",
             "tool_calls": [
                 {
                     "function": {
@@ -831,7 +952,7 @@ And you should also see the results in the experiment view in LangSmith:
 ![LangSmith results](/static/img/langsmith_results.png)
-### Evaluate
+## Evaluate
 Alternatively, you can [create a dataset in LangSmith](https://docs.smith.langchain.com/evaluation/concepts#dataset-curation) and use your created evaluators with LangSmith's [`evaluate`](https://docs.smith.langchain.com/evaluation#8-run-and-view-results) function:
@@ -846,20 +967,21 @@ const trajectoryEvaluator = createTrajectoryLLMAsJudge({
 await evaluate(
   (inputs) => [
-        {role: "user", content: "What is the weather in SF?"},
-        {
-            role: "assistant",
-            tool_calls: [
-                {
-                    function: {
-                        name: "get_weather",
-                        arguments: json.dumps({"city": "SF"}),
-                    }
-                }
-            ],
-        },
-        {role: "tool", content: "It's 80 degrees and sunny in SF."},
-        {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
+      {role: "user", content: "What is the weather in SF?"},
+      {
+          role: "assistant",
+          content: "",
+          tool_calls: [
+              {
+                  function: {
+                      name: "get_weather",
+                      arguments: json.dumps({"city": "SF"}),
+                  }
+              }
+          ],
+      },
+      {role: "tool", content: "It's 80 degrees and sunny in SF."},
+      {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
     ],
   {
     data: datasetName,
@@ -868,7 +990,7 @@ await evaluate(
 );
 ```
-## Thank you!
+# Thank you!
 We hope that `agentevals` helps make evaluating your LLM agents easier!

package/dist/graph_trajectory/llm.d.ts CHANGED Viewed

@@ -27,4 +27,4 @@ export declare const createGraphTrajectoryLLMAsJudge: ({ prompt, model, feedback
     };
     outputs: GraphTrajectory;
     referenceOutputs?: GraphTrajectory | undefined;
-}) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
+}) => Promise<import("../types.js").EvaluatorResult>;

package/dist/graph_trajectory/strict.d.ts CHANGED Viewed

@@ -11,4 +11,4 @@ import { GraphTrajectory } from "../types.js";
 export declare const graphTrajectoryStrictMatch: ({ outputs, referenceOutputs, }: {
     outputs: GraphTrajectory;
     referenceOutputs: GraphTrajectory;
-}) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
+}) => Promise<import("../types.js").EvaluatorResult>;

package/dist/graph_trajectory/utils.cjs CHANGED Viewed

@@ -56,7 +56,14 @@ const extractLangGraphTrajectoryFromSnapshots = (snapshots) => {
         }
         if (isAccumulatingSteps) {
             if (snapshot.metadata != null && snapshot.metadata.source === "input") {
-                inputs.push(snapshot.metadata.writes);
+                if ("writes" in snapshot.metadata &&
+                    snapshot.metadata.writes != null &&
+                    typeof snapshot.metadata.writes === "object") {
+                    inputs.push(snapshot.metadata.writes);
+                }
+                else {
+                    inputs.push(...snapshot.tasks.map((task) => ({ [task.name]: task.result })));
+                }
             }
             else if (i + 1 < snapshots.length &&
                 snapshots[i + 1].tasks?.find((task) => task.interrupts?.length > 0)) {

package/dist/graph_trajectory/utils.d.ts CHANGED Viewed

@@ -2,11 +2,11 @@ import type { StateSnapshot, Pregel } from "@langchain/langgraph/web";
 import type { RunnableConfig } from "@langchain/core/runnables";
 import type { GraphTrajectory } from "../types.js";
 export declare const extractLangGraphTrajectoryFromSnapshots: (snapshots: StateSnapshot[]) => {
-    inputs: (string | Record<string, unknown> | null)[];
+    inputs: (string | Record<string, unknown>)[];
     outputs: GraphTrajectory;
 };
 export declare const _getLangGraphStateHistoryRecursive: (graph: Pregel<any, any>, config: RunnableConfig) => Promise<StateSnapshot[]>;
 export declare const extractLangGraphTrajectoryFromThread: (graph: Pregel<any, any>, config: RunnableConfig) => Promise<{
-    inputs: (string | Record<string, unknown> | null)[];
+    inputs: (string | Record<string, unknown>)[];
     outputs: GraphTrajectory;
 }>;

package/dist/graph_trajectory/utils.js CHANGED Viewed

@@ -53,7 +53,14 @@ export const extractLangGraphTrajectoryFromSnapshots = (snapshots) => {
         }
         if (isAccumulatingSteps) {
             if (snapshot.metadata != null && snapshot.metadata.source === "input") {
-                inputs.push(snapshot.metadata.writes);
+                if ("writes" in snapshot.metadata &&
+                    snapshot.metadata.writes != null &&
+                    typeof snapshot.metadata.writes === "object") {
+                    inputs.push(snapshot.metadata.writes);
+                }
+                else {
+                    inputs.push(...snapshot.tasks.map((task) => ({ [task.name]: task.result })));
+                }
             }
             else if (i + 1 < snapshots.length &&
                 snapshots[i + 1].tasks?.find((task) => task.interrupts?.length > 0)) {

package/dist/trajectory/llm.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, EvaluatorResult, TrajectoryLLMAsJudgeParams } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, TrajectoryLLMAsJudgeParams } from "../types.js";
 export declare const TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = "You are an expert data labeler.\nYour task is to grade the accuracy of an AI agent's internal trajectory.\n\n<Rubric>\n  An accurate trajectory:\n  - Makes logical sense between steps\n  - Shows clear progression\n  - Is relatively efficient, though it does not need to be perfectly efficient\n  - Is semantically equivalent to the provided reference trajectory\n</Rubric>\n\nBased on the following reference trajectory:\n\n<reference_trajectory>\n{reference_outputs}\n</reference_trajectory>\n\nGrade this actual trajectory:\n\n<trajectory>\n{outputs}\n</trajectory>\n";
 export declare const TRAJECTORY_ACCURACY_PROMPT = "You are an expert data labeler.\nYour task is to grade the accuracy of an AI agent's internal trajectory.\n\n<Rubric>\n  An accurate trajectory:\n  - Makes logical sense between steps\n  - Shows clear progression\n  - Is relatively efficient, though it does not need to be perfectly efficient\n</Rubric>\n\nFirst, try to understand the goal of the trajectory by looking at the input\n(if the input is not present try to infer it from the content of the first message),\nas well as the output of the final message. Once you understand the goal, grade the trajectory\nas it relates to achieving that goal.\n\nGrade the following trajectory:\n\n<trajectory>\n{outputs}\n</trajectory>";
 /**
@@ -25,10 +25,10 @@ export declare const TRAJECTORY_ACCURACY_PROMPT = "You are an expert data labele
  */
 export declare const createTrajectoryLLMAsJudge: ({ prompt, feedbackKey, model, system, judge, continuous, choices, useReasoning, fewShotExamples, }: TrajectoryLLMAsJudgeParams) => ({ inputs, outputs, referenceOutputs, ...extra }: {
     [key: string]: unknown;
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs?: BaseMessage[] | ChatCompletionMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs?: ChatCompletionMessage[] | BaseMessage[] | FlexibleChatCompletionMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     } | undefined;
 }) => Promise<EvaluatorResult>;

package/dist/trajectory/match.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
 export type TrajectoryMatchMode = "strict" | "unordered" | "subset" | "superset";
 /**
  * Creates an evaluator that compares trajectories between model outputs and reference outputs.
@@ -52,10 +52,10 @@ export declare function createTrajectoryMatchEvaluator({ trajectoryMatchMode, to
     toolArgsMatchOverrides?: ToolArgsMatchOverrides;
 }): ({ outputs, referenceOutputs, ...extra }: {
     [key: string]: unknown;
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-}) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
+}) => Promise<import("../types.js").EvaluatorResult>;

package/dist/trajectory/strict.cjs CHANGED Viewed

@@ -5,8 +5,8 @@ const utils_js_1 = require("../utils.cjs");
 const utils_js_2 = require("./utils.cjs");
 async function _scorer(params) {
     const { outputs, referenceOutputs, toolArgsMatchMode, toolArgsMatchOverrides, } = params;
-    const normalizedOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(outputs);
-    const normalizedReferenceOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(referenceOutputs);
+    const normalizedOutputs = outputs;
+    const normalizedReferenceOutputs = referenceOutputs;
     if (!normalizedOutputs || !normalizedReferenceOutputs) {
         throw new Error("Strict trajectory match requires both outputs and reference_outputs");
     }
@@ -66,8 +66,11 @@ exports._scorer = _scorer;
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 async function trajectoryStrictMatch(params) {
+    const normalizedOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(params.outputs);
+    const normalizedReferenceOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(params.referenceOutputs);
     return (0, utils_js_1._runEvaluator)("trajectory_strict_match", _scorer, "trajectory_strict_match", {
-        ...params,
+        outputs: normalizedOutputs,
+        referenceOutputs: normalizedReferenceOutputs,
         toolArgsMatchMode: params.toolCallArgsExactMatch ? "exact" : "ignore",
     });
 }

package/dist/trajectory/strict.d.ts CHANGED Viewed

@@ -1,12 +1,8 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
 export declare function _scorer(params: {
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
-    };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
-    };
+    outputs: ChatCompletionMessage[];
+    referenceOutputs: ChatCompletionMessage[];
     toolArgsMatchMode: ToolArgsMatchMode;
     toolArgsMatchOverrides?: ToolArgsMatchOverrides;
 }): Promise<boolean>;
@@ -23,11 +19,11 @@ export declare function _scorer(params: {
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 export declare function trajectoryStrictMatch(params: {
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
     toolCallArgsExactMatch: boolean;
 }): Promise<EvaluatorResult>;

package/dist/trajectory/strict.js CHANGED Viewed

@@ -2,8 +2,8 @@ import { _normalizeToOpenAIMessagesList, _runEvaluator } from "../utils.js";
 import { _getMatcherForToolName } from "./utils.js";
 export async function _scorer(params) {
     const { outputs, referenceOutputs, toolArgsMatchMode, toolArgsMatchOverrides, } = params;
-    const normalizedOutputs = _normalizeToOpenAIMessagesList(outputs);
-    const normalizedReferenceOutputs = _normalizeToOpenAIMessagesList(referenceOutputs);
+    const normalizedOutputs = outputs;
+    const normalizedReferenceOutputs = referenceOutputs;
     if (!normalizedOutputs || !normalizedReferenceOutputs) {
         throw new Error("Strict trajectory match requires both outputs and reference_outputs");
     }
@@ -62,8 +62,11 @@ export async function _scorer(params) {
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 export async function trajectoryStrictMatch(params) {
+    const normalizedOutputs = _normalizeToOpenAIMessagesList(params.outputs);
+    const normalizedReferenceOutputs = _normalizeToOpenAIMessagesList(params.referenceOutputs);
     return _runEvaluator("trajectory_strict_match", _scorer, "trajectory_strict_match", {
-        ...params,
+        outputs: normalizedOutputs,
+        referenceOutputs: normalizedReferenceOutputs,
         toolArgsMatchMode: params.toolCallArgsExactMatch ? "exact" : "ignore",
     });
 }

package/dist/trajectory/subset.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
 export declare const _scorer: (params: {
     outputs: ChatCompletionMessage[];
     referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 export declare function trajectorySubset(params: {
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
 }): Promise<EvaluatorResult>;

package/dist/trajectory/superset.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
 export declare const _scorer: (params: {
     outputs: ChatCompletionMessage[];
     referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 export declare function trajectorySuperset(params: {
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
 }): Promise<EvaluatorResult>;

package/dist/trajectory/unordered.d.ts CHANGED Viewed

@@ -1,5 +1,5 @@
 import { BaseMessage } from "@langchain/core/messages";
-import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
 export declare const _scorer: (params: {
     outputs: ChatCompletionMessage[];
     referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
  */
 export declare function trajectoryUnorderedMatch(params: {
-    outputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
-    referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
-        messages: (BaseMessage | ChatCompletionMessage)[];
+    referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
+        messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
     };
 }): Promise<EvaluatorResult>;

package/dist/trajectory/utils.cjs CHANGED Viewed

@@ -88,10 +88,24 @@ function _exactMatch(toolCall, referenceToolCall) {
 function _ignoreMatch(_toolCall, _referenceToolCall) {
     return true;
 }
+function _subsetMatch(toolCall, referenceToolCall) {
+    // Every key-value pair in toolCall must exist in referenceToolCall with the same value
+    return Object.entries(toolCall).every(([key, value]) => key in referenceToolCall && _deepEqual(referenceToolCall[key], value));
+}
+function _supersetMatch(toolCall, referenceToolCall) {
+    // Every key-value pair in referenceToolCall must exist in toolCall with the same value
+    return Object.entries(referenceToolCall).every(([key, value]) => key in toolCall && _deepEqual(toolCall[key], value));
+}
 function _getMatcherForComparisonMode(mode) {
     if (mode === "exact") {
         return _exactMatch;
     }
+    else if (mode === "subset") {
+        return _subsetMatch;
+    }
+    else if (mode === "superset") {
+        return _supersetMatch;
+    }
     else {
         return _ignoreMatch;
     }

package/dist/trajectory/utils.js CHANGED Viewed

@@ -84,10 +84,24 @@ function _exactMatch(toolCall, referenceToolCall) {
 function _ignoreMatch(_toolCall, _referenceToolCall) {
     return true;
 }
+function _subsetMatch(toolCall, referenceToolCall) {
+    // Every key-value pair in toolCall must exist in referenceToolCall with the same value
+    return Object.entries(toolCall).every(([key, value]) => key in referenceToolCall && _deepEqual(referenceToolCall[key], value));
+}
+function _supersetMatch(toolCall, referenceToolCall) {
+    // Every key-value pair in referenceToolCall must exist in toolCall with the same value
+    return Object.entries(referenceToolCall).every(([key, value]) => key in toolCall && _deepEqual(toolCall[key], value));
+}
 function _getMatcherForComparisonMode(mode) {
     if (mode === "exact") {
         return _exactMatch;
     }
+    else if (mode === "subset") {
+        return _subsetMatch;
+    }
+    else if (mode === "superset") {
+        return _supersetMatch;
+    }
     else {
         return _ignoreMatch;
     }

package/dist/types.d.ts CHANGED Viewed

@@ -1,5 +1,20 @@
 import { createLLMAsJudge } from "openevals/llm";
 export * from "openevals/types";
+export type FlexibleChatCompletionMessage = Record<string, any> & ({
+    content: any;
+    role: "user" | "system" | "developer";
+    id?: string;
+} | {
+    role: "assistant";
+    content: any;
+    tool_calls?: any[];
+    id?: string;
+} | {
+    role: "tool";
+    content: any;
+    tool_call_id?: string;
+    id?: string;
+});
 export type GraphTrajectory = {
     inputs?: (Record<string, unknown> | null)[];
     results: Record<string, unknown>[];
@@ -9,9 +24,9 @@ export type ExtractedLangGraphThreadTrajectory = {
     inputs: (Record<string, unknown> | null)[][];
     outputs: GraphTrajectory;
 };
-export type TrajectoryLLMAsJudgeParams = Omit<Parameters<typeof createLLMAsJudge>[0], "prompt"> & {
-    prompt?: string;
+export type TrajectoryLLMAsJudgeParams = Partial<Omit<Parameters<typeof createLLMAsJudge>[0], "prompt">> & {
+    prompt?: Parameters<typeof createLLMAsJudge>[0]["prompt"];
 };
-export type ToolArgsMatchMode = "exact" | "ignore";
+export type ToolArgsMatchMode = "exact" | "ignore" | "subset" | "superset";
 export type ToolArgsMatcher = (toolCall: Record<string, unknown>, referenceToolCall: Record<string, unknown>) => boolean | Promise<boolean>;
 export type ToolArgsMatchOverrides = Record<string, ToolArgsMatchMode | string[] | ToolArgsMatcher>;

package/dist/utils.cjs CHANGED Viewed

@@ -1,6 +1,6 @@
 "use strict";
 Object.defineProperty(exports, "__esModule", { value: true });
-exports._runEvaluator = exports.processScore = exports._normalizeToOpenAIMessagesList = exports._convertToOpenAIMessage = void 0;
+exports._runEvaluator = exports.processScore = exports._normalizeToOpenAIMessagesList = exports._convertToChatCompletionMessage = exports._convertToOpenAIMessage = void 0;
 const messages_1 = require("@langchain/core/messages");
 const openai_1 = require("@langchain/openai");
 const utils_1 = require("openevals/utils");
@@ -14,6 +14,25 @@ const _convertToOpenAIMessage = (message) => {
     }
 };
 exports._convertToOpenAIMessage = _convertToOpenAIMessage;
+const _convertToChatCompletionMessage = (message) => {
+    let converted;
+    if ((0, messages_1.isBaseMessage)(message)) {
+        // eslint-disable-next-line @typescript-eslint/no-explicit-any
+        converted = (0, openai_1._convertMessagesToOpenAIParams)([message])[0];
+    }
+    else {
+        converted = message;
+    }
+    // For tool messages without tool_call_id, generate one for compatibility
+    if (converted.role === "tool" && !converted.tool_call_id) {
+        converted = {
+            ...converted,
+            tool_call_id: `generated-${Math.random().toString(36).substring(2)}`,
+        };
+    }
+    return converted;
+};
+exports._convertToChatCompletionMessage = _convertToChatCompletionMessage;
 const _normalizeToOpenAIMessagesList = (messages) => {
     if (!messages) {
         return [];
@@ -30,7 +49,7 @@ const _normalizeToOpenAIMessagesList = (messages) => {
     else {
         messagesList = messages;
     }
-    return messagesList.map(exports._convertToOpenAIMessage);
+    return messagesList.map(exports._convertToChatCompletionMessage);
 };
 exports._normalizeToOpenAIMessagesList = _normalizeToOpenAIMessagesList;
 const processScore = (_, value) => {

package/dist/utils.d.ts CHANGED Viewed

@@ -1,9 +1,10 @@
 import { BaseMessage } from "@langchain/core/messages";
 import { EvaluationResultType } from "openevals/utils";
-import { ChatCompletionMessage, MultiResultScorerReturnType, SingleResultScorerReturnType } from "./types.js";
+import { ChatCompletionMessage, FlexibleChatCompletionMessage, MultiResultScorerReturnType, SingleResultScorerReturnType } from "./types.js";
 export declare const _convertToOpenAIMessage: (message: BaseMessage | ChatCompletionMessage) => ChatCompletionMessage;
-export declare const _normalizeToOpenAIMessagesList: (messages?: (BaseMessage | ChatCompletionMessage)[] | {
-    messages: (BaseMessage | ChatCompletionMessage)[];
+export declare const _convertToChatCompletionMessage: (message: BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage) => ChatCompletionMessage;
+export declare const _normalizeToOpenAIMessagesList: (messages?: (FlexibleChatCompletionMessage | ChatCompletionMessage | BaseMessage)[] | {
+    messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
 } | undefined) => ChatCompletionMessage[];
 export declare const processScore: (_: string, value: boolean | number | {
     score: boolean | number;

package/dist/utils.js CHANGED Viewed

@@ -10,6 +10,24 @@ export const _convertToOpenAIMessage = (message) => {
         return message;
     }
 };
+export const _convertToChatCompletionMessage = (message) => {
+    let converted;
+    if (isBaseMessage(message)) {
+        // eslint-disable-next-line @typescript-eslint/no-explicit-any
+        converted = _convertMessagesToOpenAIParams([message])[0];
+    }
+    else {
+        converted = message;
+    }
+    // For tool messages without tool_call_id, generate one for compatibility
+    if (converted.role === "tool" && !converted.tool_call_id) {
+        converted = {
+            ...converted,
+            tool_call_id: `generated-${Math.random().toString(36).substring(2)}`,
+        };
+    }
+    return converted;
+};
 export const _normalizeToOpenAIMessagesList = (messages) => {
     if (!messages) {
         return [];
@@ -26,7 +44,7 @@ export const _normalizeToOpenAIMessagesList = (messages) => {
     else {
         messagesList = messages;
     }
-    return messagesList.map(_convertToOpenAIMessage);
+    return messagesList.map(_convertToChatCompletionMessage);
 };
 export const processScore = (_, value) => {
     if (typeof value === "object") {

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentevals",
-  "version": "0.0.4",
+  "version": "0.0.6",
   "packageManager": "yarn@3.5.1",
   "type": "module",
   "scripts": {
@@ -14,18 +14,18 @@
     "test": "vitest run"
   },
   "dependencies": {
-    "@langchain/openai": "^0.4.4",
-    "langchain": "^0.3.18",
-    "langsmith": "^0.3.11",
-    "openevals": "^0.0.3"
+    "@langchain/openai": ">=0.4.4",
+    "langchain": ">=0.3.18",
+    "langsmith": ">=0.3.11",
+    "openevals": "^0.1.0"
   },
   "peerDependencies": {
-    "@langchain/core": "^0.3.40",
-    "@langchain/langgraph": "^0.2.46"
+    "@langchain/core": ">=0.3.73",
+    "@langchain/langgraph": ">=0.2.46"
   },
   "devDependencies": {
-    "@langchain/core": "^0.3.40",
-    "@langchain/langgraph": "^0.2.46",
+    "@langchain/core": "^0.3.73",
+    "@langchain/langgraph": "^0.4.9",
     "@langchain/scripts": "0.1.3",
     "@tsconfig/recommended": "^1.0.8",
     "@typescript-eslint/eslint-plugin": "^8.24.1",
@@ -43,7 +43,7 @@
     "prettier": "^3.5.1",
     "typescript": "~5.1.6",
     "vitest": "^3.0.5",
-    "zod": "^3.24.2"
+    "zod": "^4.1.5"
   },
   "files": [
     "dist/",