agentevals 0.0.4 → 0.0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -9,7 +9,7 @@ It is intended to provide a good conceptual starting point for your agent's eval
9
9
 
10
10
  If you are looking for more general evaluation tools, please check out the companion package [`openevals`](https://github.com/langchain-ai/openevals).
11
11
 
12
- ## Quickstart
12
+ # Quickstart
13
13
 
14
14
  To get started, install `agentevals`:
15
15
 
@@ -28,6 +28,7 @@ Once you've done this, you can run your first trajectory evaluator. We represent
28
28
  ```ts
29
29
  import {
30
30
  createTrajectoryLLMAsJudge,
31
+ type FlexibleChatCompletionMessage,
31
32
  TRAJECTORY_ACCURACY_PROMPT,
32
33
  } from "agentevals";
33
34
 
@@ -55,7 +56,7 @@ const outputs = [
55
56
  role: "assistant",
56
57
  content: "The weather in SF is 80 degrees and sunny.",
57
58
  },
58
- ];
59
+ ] satisfies FlexibleChatCompletionMessage[];
59
60
 
60
61
  const evalResult = await trajectoryEvaluator({
61
62
  outputs,
@@ -72,25 +73,29 @@ console.log(evalResult);
72
73
  }
73
74
  ```
74
75
 
75
- You can see that despite the small difference in the final response and tool calls, the evaluator still returns a score of `true` since the overall trajectory is the same between the output and reference!
76
+ You can see that the evaluator returns a score of `true` since the overall trajectory is a reasonable path for the agent to take to answer the user's question.
77
+
78
+ For more details on this evaluator, including how to customize it, see the section on [trajectory LLM-as-judge](#trajectory-llm-as-judge).
76
79
 
77
- ## Table of Contents
80
+ # Table of Contents
78
81
 
79
82
  - [Installation](#installation)
80
83
  - [Evaluators](#evaluators)
81
- - [Agent Trajectory](#agent-trajectory)
84
+ - [Agent Trajectory Match](#agent-trajectory-match)
82
85
  - [Strict match](#strict-match)
83
86
  - [Unordered match](#unordered-match)
84
87
  - [Subset/superset match](#subset-and-superset-match)
85
- - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
88
+ - [Tool args match modes](#tool-args-match-modes)
89
+ - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
86
90
  - [Graph Trajectory](#graph-trajectory)
87
91
  - [Graph trajectory LLM-as-judge](#graph-trajectory-llm-as-judge)
88
92
  - [Graph trajectory strict match](#graph-trajectory-strict-match)
93
+ - [Python Async Support](#python-async-support)
89
94
  - [LangSmith Integration](#langsmith-integration)
90
95
  - [Pytest or Vitest/Jest](#pytest-or-vitestjest)
91
96
  - [Evaluate](#evaluate)
92
97
 
93
- ## Installation
98
+ # Installation
94
99
 
95
100
  You can install `agentevals` like this:
96
101
 
@@ -107,124 +112,65 @@ npm install openai
107
112
  It is also helpful to be familiar with some [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and
108
113
  LangSmith's pytest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
109
114
 
110
- ## Evaluators
115
+ # Evaluators
111
116
 
112
- ### Agent trajectory
117
+ ## Agent trajectory match
113
118
 
114
- Agent trajectory evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
119
+ Agent trajectory match evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
115
120
  These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain `BaseMessage` classes, and handle message formatting
116
121
  under the hood.
117
122
 
118
- AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task.
119
-
120
- #### Checking tool call equality
121
-
122
- When checking equality between tool calls, these matchers will require that all tool call arguments are the same. You can configure this behavior to ignore tool call arguments by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (JS), or by only checking specific properties within the call using the `tool_args_match_overrides`/`toolArgsMatchOverrides` param.
123
-
124
- `tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
125
-
126
- ```python
127
- ToolArgsMatchMode = Literal["exact", "ignore"]
128
-
129
- ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str], Callable[[dict, dict], bool]]]
130
- ```
123
+ AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task. You can customize their behavior in a few ways:
131
124
 
132
- Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
125
+ - Setting `trajectory_match_mode`/`trajectoryMatchMode` to [`strict`](#strict-match), [`unordered`](#unordered-match), [`subset`](#subset-and-superset-match), or [`superset`](#subset-and-superset-match) to provide the general strategy the evaluator will use to compare trajectories
126
+ - Setting [`tool_args_match_mode`](#tool-args-match-modes) and/or [`tool_args_match_overrides`](#tool-args-match-modes) to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal.
133
127
 
134
- ```ts
135
- import { createTrajectoryMatchEvaluator } from "agentevals";
136
-
137
- const outputs = [
138
- { role: "user", content: "What is the weather in SF?" },
139
- {
140
- role: "assistant",
141
- tool_calls: [{
142
- function: {
143
- name: "get_weather",
144
- arguments: JSON.stringify({ city: "san francisco" })
145
- },
146
- }]
147
- },
148
- { role: "tool", content: "It's 80 degrees and sunny in SF." },
149
- { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
150
- ];
151
-
152
- const referenceOutputs = [
153
- { role: "user", content: "What is the weather in San Francisco?" },
154
- {
155
- role: "assistant",
156
- tool_calls: [{
157
- function: {
158
- name: "get_weather",
159
- arguments: JSON.stringify({ city: "San Francisco" })
160
- }
161
- }]
162
- },
163
- { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
164
- ];
165
-
166
- const evaluator = createTrajectoryMatchEvaluator({
167
- trajectoryMatchMode: "strict",
168
- toolArgsMatchMode: "exact", // Default value
169
- toolArgsMatchOverrides: {
170
- get_weather: (x, y) => {
171
- return typeof x.city === "string" &&
172
- typeof y.city === "string" &&
173
- x.city.toLowerCase() === y.city.toLowerCase();
174
- },
175
- }
176
- });
177
-
178
- const result = await evaluator({
179
- outputs,
180
- referenceOutputs,
181
- });
182
-
183
- console.log(result);
184
- ```
185
-
186
- ```
187
- {
188
- 'key': 'trajectory_strict_match',
189
- 'score': true,
190
- }
191
- ```
192
-
193
- This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
194
-
195
- #### Strict match
128
+ ### Strict match
196
129
 
197
130
  The `"strict"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same messages
198
131
  in the same order with the same tool calls. Note that it does allow for differences in message content:
199
132
 
200
133
  ```ts
201
- import { createTrajectoryMatchEvaluator } from "agentevals";
134
+ import {
135
+ createTrajectoryMatchEvaluator,
136
+ type FlexibleChatCompletionMessage,
137
+ } from "agentevals";
202
138
 
203
139
  const outputs = [
204
- { role: "user", content: "What is the weather in SF?" },
205
- {
206
- role: "assistant",
207
- tool_calls: [{
208
- function: {
209
- name: "get_weather",
210
- arguments: JSON.stringify({ city: "San Francisco" })
211
- },
212
- }, {
213
- function: {
214
- name: "accuweather_forecast",
215
- arguments: JSON.stringify({"city": "San Francisco"}),
216
- },
217
- }]
218
- },
219
- { role: "tool", content: "It's 80 degrees and sunny in SF." },
220
- { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
221
- ];
140
+ { role: "user", content: "What is the weather in SF?" },
141
+ {
142
+ role: "assistant",
143
+ content: "",
144
+ tool_calls: [{
145
+ function: {
146
+ name: "get_weather",
147
+ arguments: JSON.stringify({ city: "San Francisco" })
148
+ },
149
+ }, {
150
+ function: {
151
+ name: "accuweather_forecast",
152
+ arguments: JSON.stringify({"city": "San Francisco"}),
153
+ },
154
+ }]
155
+ },
156
+ { role: "tool", content: "It's 80 degrees and sunny in SF." },
157
+ { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
158
+ ] satisfies FlexibleChatCompletionMessage[];
222
159
 
223
160
  const referenceOutputs = [
224
- { role: "user", content: "What is the weather in San Francisco?" },
225
- { role: "assistant", tool_calls: [{ function: { name: "get_weather", arguments: JSON.stringify({ city: "San Francisco" }) } }] },
226
- { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
227
- ];
161
+ { role: "user", content: "What is the weather in San Francisco?" },
162
+ {
163
+ role: "assistant",
164
+ content: "",
165
+ tool_calls: [{
166
+ function: {
167
+ name: "get_weather",
168
+ arguments: JSON.stringify({ city: "San Francisco" })
169
+ }
170
+ }]
171
+ },
172
+ { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
173
+ ] satisfies FlexibleChatCompletionMessage[];
228
174
 
229
175
  const evaluator = createTrajectoryMatchEvaluator({
230
176
  trajectoryMatchMode: "strict",
@@ -247,19 +193,23 @@ console.log(result);
247
193
 
248
194
  `"strict"` is useful is if you want to ensure that tools are always called in the same order for a given query (e.g. a company policy lookup tool before a tool that requests vacation time for an employee).
249
195
 
250
- **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
196
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
251
197
 
252
- #### Unordered match
198
+ ### Unordered match
253
199
 
254
200
  The `"unordered"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
255
201
 
256
202
  ```ts
257
- import { createTrajectoryMatchEvaluator } from "agentevals";
203
+ import {
204
+ createTrajectoryMatchEvaluator,
205
+ type FlexibleChatCompletionMessage,
206
+ } from "agentevals";
258
207
 
259
208
  const outputs = [
260
209
  { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
261
210
  {
262
211
  role: "assistant",
212
+ content: "",
263
213
  tool_calls: [{
264
214
  function: {
265
215
  name: "get_weather",
@@ -270,6 +220,7 @@ const outputs = [
270
220
  { role: "tool", content: "It's 80 degrees and sunny in SF." },
271
221
  {
272
222
  role: "assistant",
223
+ content: "",
273
224
  tool_calls: [{
274
225
  function: {
275
226
  name: "get_fun_activities",
@@ -279,12 +230,13 @@ const outputs = [
279
230
  },
280
231
  { role: "tool", content: "Nothing fun is happening, you should stay indoors and read!" },
281
232
  { role: "assistant", content: "The weather in SF is 80 degrees and sunny, but there is nothing fun happening." },
282
- ];
233
+ ] satisifes FlexibleChatCompletionMessage[];
283
234
 
284
235
  const referenceOutputs = [
285
236
  { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
286
237
  {
287
238
  role: "assistant",
239
+ content: "",
288
240
  tool_calls: [
289
241
  {
290
242
  function: {
@@ -303,7 +255,7 @@ const referenceOutputs = [
303
255
  { role: "tool", content: "Nothing fun is happening, you should stay indoors and read!" },
304
256
  { role: "tool", content: "It's 80 degrees and sunny in SF." },
305
257
  { role: "assistant", content: "In SF, it's 80˚ and sunny, but there is nothing fun happening." },
306
- ];
258
+ ] satisfies FlexibleChatCompletionMessage[];
307
259
 
308
260
  const evaluator = createTrajectoryMatchEvaluator({
309
261
  trajectoryMatchMode: "unordered",
@@ -326,19 +278,23 @@ console.log(result)
326
278
 
327
279
  `"unordered"` is useful is if you want to ensure that specific tools are called at some point in the trajectory, but you don't necessarily need them to be in message order (e.g. the agent called a company policy retrieval tool at an arbitrary point in an interaction before authorizing spend for a pizza party).
328
280
 
329
- **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
281
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
330
282
 
331
- #### Subset and superset match
283
+ ### Subset and superset match
332
284
 
333
285
  The `"subset"` and `"superset"` modes match partial trajectories (ensuring that a trajectory contains a subset/superset of tool calls contained in a reference trajectory).
334
286
 
335
287
  ```ts
336
- import { createTrajectoryMatchEvaluator } from "agentevals";
288
+ import {
289
+ createTrajectoryMatchEvaluator,
290
+ type FlexibleChatCompletionMessage
291
+ } from "agentevals";
337
292
 
338
293
  const outputs = [
339
294
  { role: "user", content: "What is the weather in SF and London?" },
340
295
  {
341
296
  role: "assistant",
297
+ content: "",
342
298
  tool_calls: [{
343
299
  function: {
344
300
  name: "get_weather",
@@ -354,12 +310,13 @@ const outputs = [
354
310
  { role: "tool", content: "It's 80 degrees and sunny in SF, and 90 degrees and rainy in London." },
355
311
  { role: "tool", content: "Unknown." },
356
312
  { role: "assistant", content: "The weather in SF is 80 degrees and sunny. In London, it's 90 degrees and rainy."},
357
- ];
313
+ ] satisfies FlexibleChatCompletionMessage[];
358
314
 
359
315
  const referenceOutputs = [
360
316
  { role: "user", content: "What is the weather in SF and London?" },
361
317
  {
362
318
  role: "assistant",
319
+ content: "",
363
320
  tool_calls: [
364
321
  {
365
322
  function: {
@@ -371,7 +328,7 @@ const referenceOutputs = [
371
328
  },
372
329
  { role: "tool", content: "It's 80 degrees and sunny in San Francisco, and 90 degrees and rainy in London." },
373
330
  { role: "assistant", content: "The weather in SF is 80˚ and sunny. In London, it's 90˚ and rainy." },
374
- ];
331
+ ] satisfies FlexibleChatCompletionMessage[];
375
332
 
376
333
  const evaluator = createTrajectoryMatchEvaluator({
377
334
  trajectoryMatchMode: "superset", // or "subset"
@@ -394,18 +351,148 @@ console.log(result)
394
351
 
395
352
  `"superset"` is useful if you want to ensure that some key tools were called at some point in the trajectory, but an agent calling extra tools is still acceptable. `"subset"` is the inverse and is useful if you want to ensure that the agent did not call any tools beyond the expected ones.
396
353
 
397
- **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#checking-tool-call-equality).
354
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
355
+
356
+ ### Tool args match modes
357
+
358
+ When checking equality between tool calls, the above evaluators will require that all tool call arguments are the exact same by default. You can configure this behavior in the following ways:
359
+
360
+ - Treating any two tool calls for the same tool as equivalent by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (TypeScript)
361
+ - Treating a tool call as equivalent if it contain as subset/superset of args compared to a reference tool call of the same name with `tool_args_match_mode="subset"/"superset"` (Python) or `toolArgsMatchMode: "subset"/"superset` (TypeScript)
362
+ - Setting custom matchers for all calls of a given tool using the `tool_args_match_overrides` (Python) or `toolArgsMatchOverrides` (TypeScript) param
363
+
364
+ You can set both of these parameters at the same time. `tool_args_match_overrides` will take precendence over `tool_args_match_mode`.
365
+
366
+ `tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
367
+
368
+ ```python
369
+ ToolArgsMatchMode = Literal["exact", "ignore", "subset", "superset"]
398
370
 
399
- #### Trajectory LLM-as-judge
371
+ ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str], Callable[[dict, dict], bool]]]
372
+ ```
400
373
 
401
- The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the other trajectory evaluators, it doesn't require a reference trajectory,
402
- and supports
403
- This allows for more flexibility in the trajectory comparison:
374
+ Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
375
+
376
+ ```ts
377
+ import {
378
+ createTrajectoryMatchEvaluator,
379
+ type FlexibleChatCompletionMessage,
380
+ } from "agentevals";
381
+
382
+ const outputs = [
383
+ { role: "user", content: "What is the weather in SF?" },
384
+ {
385
+ role: "assistant",
386
+ content: "",
387
+ tool_calls: [{
388
+ function: {
389
+ name: "get_weather",
390
+ arguments: JSON.stringify({ city: "san francisco" })
391
+ },
392
+ }]
393
+ },
394
+ { role: "tool", content: "It's 80 degrees and sunny in SF." },
395
+ { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
396
+ ] satisfies FlexibleChatCompletionMessage[];
397
+
398
+ const referenceOutputs = [
399
+ { role: "user", content: "What is the weather in San Francisco?" },
400
+ {
401
+ role: "assistant",
402
+ content: "",
403
+ tool_calls: [{
404
+ function: {
405
+ name: "get_weather",
406
+ arguments: JSON.stringify({ city: "San Francisco" })
407
+ }
408
+ }]
409
+ },
410
+ { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
411
+ ] satisfies FlexibleChatCompletionMessage[];
412
+
413
+ const evaluator = createTrajectoryMatchEvaluator({
414
+ trajectoryMatchMode: "strict",
415
+ toolArgsMatchMode: "exact", // Default value
416
+ toolArgsMatchOverrides: {
417
+ get_weather: (x, y) => {
418
+ return typeof x.city === "string" &&
419
+ typeof y.city === "string" &&
420
+ x.city.toLowerCase() === y.city.toLowerCase();
421
+ },
422
+ }
423
+ });
424
+
425
+ const result = await evaluator({
426
+ outputs,
427
+ referenceOutputs,
428
+ });
429
+
430
+ console.log(result);
431
+ ```
432
+
433
+ ```
434
+ {
435
+ 'key': 'trajectory_strict_match',
436
+ 'score': true,
437
+ }
438
+ ```
439
+
440
+ This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
441
+
442
+ ## Trajectory LLM-as-judge
443
+
444
+ The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the trajectory match evaluators, it doesn't require a reference trajectory. Here's an example:
404
445
 
405
446
  ```ts
406
447
  import {
407
448
  createTrajectoryLLMAsJudge,
408
- TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE
449
+ TRAJECTORY_ACCURACY_PROMPT,
450
+ type FlexibleChatCompletionMessage,
451
+ } from "agentevals";
452
+
453
+ const evaluator = createTrajectoryLLMAsJudge({
454
+ prompt: TRAJECTORY_ACCURACY_PROMPT,
455
+ model: "openai:o3-mini",
456
+ });
457
+
458
+ const outputs = [
459
+ {role: "user", content: "What is the weather in SF?"},
460
+ {
461
+ role: "assistant",
462
+ content: "",
463
+ tool_calls: [
464
+ {
465
+ function: {
466
+ name: "get_weather",
467
+ arguments: JSON.stringify({ city: "SF" }),
468
+ }
469
+ }
470
+ ],
471
+ },
472
+ {role: "tool", content: "It's 80 degrees and sunny in SF."},
473
+ {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
474
+ ] satisfies FlexibleChatCompletionMessage[];
475
+
476
+ const result = await evaluator({ outputs });
477
+
478
+ console.log(result)
479
+ ```
480
+
481
+ ```
482
+ {
483
+ 'key': 'trajectory_accuracy',
484
+ 'score': True,
485
+ 'comment': 'The provided agent trajectory is reasonable...'
486
+ }
487
+ ```
488
+
489
+ If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` prompt, which contains a `reference_outputs` variable:
490
+
491
+ ```ts
492
+ import {
493
+ createTrajectoryLLMAsJudge,
494
+ TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
495
+ type FlexibleChatCompletionMessage,
409
496
  } from "agentevals";
410
497
 
411
498
  const evaluator = createTrajectoryLLMAsJudge({
@@ -417,6 +504,7 @@ const outputs = [
417
504
  {role: "user", content: "What is the weather in SF?"},
418
505
  {
419
506
  role: "assistant",
507
+ content: "",
420
508
  tool_calls: [
421
509
  {
422
510
  function: {
@@ -428,11 +516,13 @@ const outputs = [
428
516
  },
429
517
  {role: "tool", content: "It's 80 degrees and sunny in SF."},
430
518
  {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
431
- ]
519
+ ] satisfies FlexibleChatCompletionMessage[];
520
+
432
521
  const referenceOutputs = [
433
522
  {role: "user", content: "What is the weather in SF?"},
434
523
  {
435
524
  role: "assistant",
525
+ content: "",
436
526
  tool_calls: [
437
527
  {
438
528
  function: {
@@ -444,7 +534,7 @@ const referenceOutputs = [
444
534
  },
445
535
  {role: "tool", content: "It's 80 degrees and sunny in San Francisco."},
446
536
  {role: "assistant", content: "The weather in SF is 80˚ and sunny."},
447
- ]
537
+ ] satisfies FlexibleChatCompletionMessage[];
448
538
 
449
539
  const result = await evaluator({
450
540
  outputs,
@@ -484,7 +574,7 @@ const fewShotExamples = [
484
574
 
485
575
  See the [`openevals`](https://github.com/langchain-ai/openevals?tab=readme-ov-file#llm-as-judge) repo for a fully up to date list of parameters.
486
576
 
487
- ### Graph trajectory
577
+ ## Graph trajectory
488
578
 
489
579
  For frameworks like [LangGraph](https://github.com/langchain-ai/langgraph) that model agents as graphs, it can be more convenient to represent trajectories in terms of nodes visited rather than messages. `agentevals` includes a category of evaluators called **graph trajectory** evaluators that are designed to work with this format, as well as convenient utilities for extracting trajectories from a LangGraph thread, including different conversation turns and interrupts.
490
580
 
@@ -509,7 +599,7 @@ const evaluator: ({ inputs, outputs, referenceOutputs, ...extra }: {
509
599
 
510
600
  Where `inputs` is a list of inputs (or a dict with a key named `"inputs"`) to the graph whose items each represent the start of a new invocation in a thread, `results` representing the final output from each turn in the thread, and `steps` representing the internal steps taken for each turn.
511
601
 
512
- #### Graph trajectory LLM-as-judge
602
+ ### Graph trajectory LLM-as-judge
513
603
 
514
604
  This evaluator is similar to the `trajectory_llm_as_judge` evaluator, but it works with graph trajectories instead of message trajectories. Below, we set up a LangGraph agent, extract a trajectory from it using the built-in utils, and pass it to the evaluator. First, let's setup our graph, call it, and then extract the trajectory:
515
605
 
@@ -603,10 +693,10 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
603
693
  model: "openai:o3-mini",
604
694
  })
605
695
 
606
- const res = await graphTrajectoryEvaluator(
607
- inputs=extractedTrajectory.inputs,
608
- outputs=extractedTrajectory.outputs,
609
- )
696
+ const res = await graphTrajectoryEvaluator({
697
+ inputs: extractedTrajectory.inputs,
698
+ outputs: extractedTrajectory.outputs,
699
+ });
610
700
 
611
701
  console.log(res);
612
702
  ```
@@ -650,17 +740,17 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
650
740
  prompt: CUSTOM_PROMPT,
651
741
  model: "openai:o3-mini",
652
742
  })
653
- res = await graphTrajectoryEvaluator(
743
+ const res = await graphTrajectoryEvaluator({
654
744
  inputs: extractedTrajectory.inputs,
655
745
  outputs: extractedTrajectory.outputs,
656
- )
746
+ });
657
747
  ```
658
748
 
659
749
  In order to format them properly into the prompt, `reference_outputs` should be passed in as a `GraphTrajectory` object like `outputs`.
660
750
 
661
- Also note that like other LLM-as-judge evaluators, you can pass extra kwargs into the evaluator to format them into the prompt.
751
+ Also note that like other LLM-as-judge evaluators, you can pass extra params into the evaluator to format them into the prompt.
662
752
 
663
- #### Graph trajectory strict match
753
+ ### Graph trajectory strict match
664
754
 
665
755
  The `graph_trajectory_strict_match` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
666
756
 
@@ -732,18 +822,47 @@ console.log(result);
732
822
  }
733
823
  ```
734
824
 
735
- ## LangSmith Integration
825
+ # Python Async Support
826
+
827
+ All `agentevals` evaluators support Python [asyncio](https://docs.python.org/3/library/asyncio.html). As a convention, evaluators that use a factory function will have `async` put immediately after `create_` in the function name (for example, `create_async_trajectory_llm_as_judge`), and evaluators used directly will end in `async` (e.g. `trajectory_strict_match_async`).
828
+
829
+ Here's an example of how to use the `create_async_llm_as_judge` evaluator asynchronously:
830
+
831
+ ```python
832
+ from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge
833
+
834
+ evaluator = create_async_llm_as_judge(
835
+ prompt="What is the weather in {inputs}?",
836
+ )
837
+
838
+ result = await evaluator(inputs="San Francisco")
839
+ ```
840
+
841
+ If you are using the OpenAI client directly, remember to pass in `AsyncOpenAI` as the `judge` parameter:
842
+
843
+ ```python
844
+ from openai import AsyncOpenAI
845
+
846
+ evaluator = create_async_llm_as_judge(
847
+ prompt="What is the weather in {inputs}?",
848
+ judge=AsyncOpenAI(),
849
+ model="o3-mini",
850
+ )
851
+
852
+ result = await evaluator(inputs="San Francisco")
853
+ ```
854
+
855
+ # LangSmith Integration
736
856
 
737
857
  For tracking experiments over time, you can log evaluator results to [LangSmith](https://smith.langchain.com/), a platform for building production-grade LLM applications that includes tracing, evaluation, and experimentation tools.
738
858
 
739
859
  LangSmith currently offers two ways to run evals: a [pytest](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) (Python) or [Vitest/Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) integration and the `evaluate` function. We'll give a quick example of how to run evals using both.
740
860
 
741
- ### Pytest or Vitest/Jest
861
+ ## Pytest or Vitest/Jest
742
862
 
743
863
  First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) to set up LangSmith's pytest runner, or these to set up [Vitest or Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest),
744
864
  setting appropriate environment variables:
745
865
 
746
-
747
866
  ```bash
748
867
  export LANGSMITH_API_KEY="your_langsmith_api_key"
749
868
  export LANGSMITH_TRACING="true"
@@ -776,6 +895,7 @@ ls.describe("trajectory accuracy", () => {
776
895
  {"role": "user", "content": "What is the weather in SF?"},
777
896
  {
778
897
  "role": "assistant",
898
+ "content": "",
779
899
  "tool_calls": [
780
900
  {
781
901
  "function": {
@@ -794,6 +914,7 @@ ls.describe("trajectory accuracy", () => {
794
914
  {"role": "user", "content": "What is the weather in SF?"},
795
915
  {
796
916
  "role": "assistant",
917
+ "content": "",
797
918
  "tool_calls": [
798
919
  {
799
920
  "function": {
@@ -831,7 +952,7 @@ And you should also see the results in the experiment view in LangSmith:
831
952
 
832
953
  ![LangSmith results](/static/img/langsmith_results.png)
833
954
 
834
- ### Evaluate
955
+ ## Evaluate
835
956
 
836
957
  Alternatively, you can [create a dataset in LangSmith](https://docs.smith.langchain.com/evaluation/concepts#dataset-curation) and use your created evaluators with LangSmith's [`evaluate`](https://docs.smith.langchain.com/evaluation#8-run-and-view-results) function:
837
958
 
@@ -846,20 +967,21 @@ const trajectoryEvaluator = createTrajectoryLLMAsJudge({
846
967
 
847
968
  await evaluate(
848
969
  (inputs) => [
849
- {role: "user", content: "What is the weather in SF?"},
850
- {
851
- role: "assistant",
852
- tool_calls: [
853
- {
854
- function: {
855
- name: "get_weather",
856
- arguments: json.dumps({"city": "SF"}),
857
- }
858
- }
859
- ],
860
- },
861
- {role: "tool", content: "It's 80 degrees and sunny in SF."},
862
- {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
970
+ {role: "user", content: "What is the weather in SF?"},
971
+ {
972
+ role: "assistant",
973
+ content: "",
974
+ tool_calls: [
975
+ {
976
+ function: {
977
+ name: "get_weather",
978
+ arguments: json.dumps({"city": "SF"}),
979
+ }
980
+ }
981
+ ],
982
+ },
983
+ {role: "tool", content: "It's 80 degrees and sunny in SF."},
984
+ {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
863
985
  ],
864
986
  {
865
987
  data: datasetName,
@@ -868,7 +990,7 @@ await evaluate(
868
990
  );
869
991
  ```
870
992
 
871
- ## Thank you!
993
+ # Thank you!
872
994
 
873
995
  We hope that `agentevals` helps make evaluating your LLM agents easier!
874
996
 
@@ -27,4 +27,4 @@ export declare const createGraphTrajectoryLLMAsJudge: ({ prompt, model, feedback
27
27
  };
28
28
  outputs: GraphTrajectory;
29
29
  referenceOutputs?: GraphTrajectory | undefined;
30
- }) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
30
+ }) => Promise<import("../types.js").EvaluatorResult>;
@@ -11,4 +11,4 @@ import { GraphTrajectory } from "../types.js";
11
11
  export declare const graphTrajectoryStrictMatch: ({ outputs, referenceOutputs, }: {
12
12
  outputs: GraphTrajectory;
13
13
  referenceOutputs: GraphTrajectory;
14
- }) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
14
+ }) => Promise<import("../types.js").EvaluatorResult>;
@@ -56,7 +56,14 @@ const extractLangGraphTrajectoryFromSnapshots = (snapshots) => {
56
56
  }
57
57
  if (isAccumulatingSteps) {
58
58
  if (snapshot.metadata != null && snapshot.metadata.source === "input") {
59
- inputs.push(snapshot.metadata.writes);
59
+ if ("writes" in snapshot.metadata &&
60
+ snapshot.metadata.writes != null &&
61
+ typeof snapshot.metadata.writes === "object") {
62
+ inputs.push(snapshot.metadata.writes);
63
+ }
64
+ else {
65
+ inputs.push(...snapshot.tasks.map((task) => ({ [task.name]: task.result })));
66
+ }
60
67
  }
61
68
  else if (i + 1 < snapshots.length &&
62
69
  snapshots[i + 1].tasks?.find((task) => task.interrupts?.length > 0)) {
@@ -2,11 +2,11 @@ import type { StateSnapshot, Pregel } from "@langchain/langgraph/web";
2
2
  import type { RunnableConfig } from "@langchain/core/runnables";
3
3
  import type { GraphTrajectory } from "../types.js";
4
4
  export declare const extractLangGraphTrajectoryFromSnapshots: (snapshots: StateSnapshot[]) => {
5
- inputs: (string | Record<string, unknown> | null)[];
5
+ inputs: (string | Record<string, unknown>)[];
6
6
  outputs: GraphTrajectory;
7
7
  };
8
8
  export declare const _getLangGraphStateHistoryRecursive: (graph: Pregel<any, any>, config: RunnableConfig) => Promise<StateSnapshot[]>;
9
9
  export declare const extractLangGraphTrajectoryFromThread: (graph: Pregel<any, any>, config: RunnableConfig) => Promise<{
10
- inputs: (string | Record<string, unknown> | null)[];
10
+ inputs: (string | Record<string, unknown>)[];
11
11
  outputs: GraphTrajectory;
12
12
  }>;
@@ -53,7 +53,14 @@ export const extractLangGraphTrajectoryFromSnapshots = (snapshots) => {
53
53
  }
54
54
  if (isAccumulatingSteps) {
55
55
  if (snapshot.metadata != null && snapshot.metadata.source === "input") {
56
- inputs.push(snapshot.metadata.writes);
56
+ if ("writes" in snapshot.metadata &&
57
+ snapshot.metadata.writes != null &&
58
+ typeof snapshot.metadata.writes === "object") {
59
+ inputs.push(snapshot.metadata.writes);
60
+ }
61
+ else {
62
+ inputs.push(...snapshot.tasks.map((task) => ({ [task.name]: task.result })));
63
+ }
57
64
  }
58
65
  else if (i + 1 < snapshots.length &&
59
66
  snapshots[i + 1].tasks?.find((task) => task.interrupts?.length > 0)) {
@@ -1,5 +1,5 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, EvaluatorResult, TrajectoryLLMAsJudgeParams } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, TrajectoryLLMAsJudgeParams } from "../types.js";
3
3
  export declare const TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = "You are an expert data labeler.\nYour task is to grade the accuracy of an AI agent's internal trajectory.\n\n<Rubric>\n An accurate trajectory:\n - Makes logical sense between steps\n - Shows clear progression\n - Is relatively efficient, though it does not need to be perfectly efficient\n - Is semantically equivalent to the provided reference trajectory\n</Rubric>\n\nBased on the following reference trajectory:\n\n<reference_trajectory>\n{reference_outputs}\n</reference_trajectory>\n\nGrade this actual trajectory:\n\n<trajectory>\n{outputs}\n</trajectory>\n";
4
4
  export declare const TRAJECTORY_ACCURACY_PROMPT = "You are an expert data labeler.\nYour task is to grade the accuracy of an AI agent's internal trajectory.\n\n<Rubric>\n An accurate trajectory:\n - Makes logical sense between steps\n - Shows clear progression\n - Is relatively efficient, though it does not need to be perfectly efficient\n</Rubric>\n\nFirst, try to understand the goal of the trajectory by looking at the input\n(if the input is not present try to infer it from the content of the first message),\nas well as the output of the final message. Once you understand the goal, grade the trajectory\nas it relates to achieving that goal.\n\nGrade the following trajectory:\n\n<trajectory>\n{outputs}\n</trajectory>";
5
5
  /**
@@ -25,10 +25,10 @@ export declare const TRAJECTORY_ACCURACY_PROMPT = "You are an expert data labele
25
25
  */
26
26
  export declare const createTrajectoryLLMAsJudge: ({ prompt, feedbackKey, model, system, judge, continuous, choices, useReasoning, fewShotExamples, }: TrajectoryLLMAsJudgeParams) => ({ inputs, outputs, referenceOutputs, ...extra }: {
27
27
  [key: string]: unknown;
28
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
29
- messages: (BaseMessage | ChatCompletionMessage)[];
28
+ outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
29
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
30
30
  };
31
- referenceOutputs?: BaseMessage[] | ChatCompletionMessage[] | {
32
- messages: (BaseMessage | ChatCompletionMessage)[];
31
+ referenceOutputs?: ChatCompletionMessage[] | BaseMessage[] | FlexibleChatCompletionMessage[] | {
32
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
33
33
  } | undefined;
34
34
  }) => Promise<EvaluatorResult>;
@@ -1,5 +1,5 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
3
3
  export type TrajectoryMatchMode = "strict" | "unordered" | "subset" | "superset";
4
4
  /**
5
5
  * Creates an evaluator that compares trajectories between model outputs and reference outputs.
@@ -52,10 +52,10 @@ export declare function createTrajectoryMatchEvaluator({ trajectoryMatchMode, to
52
52
  toolArgsMatchOverrides?: ToolArgsMatchOverrides;
53
53
  }): ({ outputs, referenceOutputs, ...extra }: {
54
54
  [key: string]: unknown;
55
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
56
- messages: (BaseMessage | ChatCompletionMessage)[];
55
+ outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
56
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
57
57
  };
58
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
59
- messages: (BaseMessage | ChatCompletionMessage)[];
58
+ referenceOutputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
59
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
60
60
  };
61
- }) => Promise<import("langsmith/vitest").SimpleEvaluationResult>;
61
+ }) => Promise<import("../types.js").EvaluatorResult>;
@@ -5,8 +5,8 @@ const utils_js_1 = require("../utils.cjs");
5
5
  const utils_js_2 = require("./utils.cjs");
6
6
  async function _scorer(params) {
7
7
  const { outputs, referenceOutputs, toolArgsMatchMode, toolArgsMatchOverrides, } = params;
8
- const normalizedOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(outputs);
9
- const normalizedReferenceOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(referenceOutputs);
8
+ const normalizedOutputs = outputs;
9
+ const normalizedReferenceOutputs = referenceOutputs;
10
10
  if (!normalizedOutputs || !normalizedReferenceOutputs) {
11
11
  throw new Error("Strict trajectory match requires both outputs and reference_outputs");
12
12
  }
@@ -66,8 +66,11 @@ exports._scorer = _scorer;
66
66
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
67
67
  */
68
68
  async function trajectoryStrictMatch(params) {
69
+ const normalizedOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(params.outputs);
70
+ const normalizedReferenceOutputs = (0, utils_js_1._normalizeToOpenAIMessagesList)(params.referenceOutputs);
69
71
  return (0, utils_js_1._runEvaluator)("trajectory_strict_match", _scorer, "trajectory_strict_match", {
70
- ...params,
72
+ outputs: normalizedOutputs,
73
+ referenceOutputs: normalizedReferenceOutputs,
71
74
  toolArgsMatchMode: params.toolCallArgsExactMatch ? "exact" : "ignore",
72
75
  });
73
76
  }
@@ -1,12 +1,8 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
3
3
  export declare function _scorer(params: {
4
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
5
- messages: (BaseMessage | ChatCompletionMessage)[];
6
- };
7
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
8
- messages: (BaseMessage | ChatCompletionMessage)[];
9
- };
4
+ outputs: ChatCompletionMessage[];
5
+ referenceOutputs: ChatCompletionMessage[];
10
6
  toolArgsMatchMode: ToolArgsMatchMode;
11
7
  toolArgsMatchOverrides?: ToolArgsMatchOverrides;
12
8
  }): Promise<boolean>;
@@ -23,11 +19,11 @@ export declare function _scorer(params: {
23
19
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
24
20
  */
25
21
  export declare function trajectoryStrictMatch(params: {
26
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
27
- messages: (BaseMessage | ChatCompletionMessage)[];
22
+ outputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
23
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
28
24
  };
29
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
30
- messages: (BaseMessage | ChatCompletionMessage)[];
25
+ referenceOutputs: ChatCompletionMessage[] | FlexibleChatCompletionMessage[] | BaseMessage[] | {
26
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
31
27
  };
32
28
  toolCallArgsExactMatch: boolean;
33
29
  }): Promise<EvaluatorResult>;
@@ -2,8 +2,8 @@ import { _normalizeToOpenAIMessagesList, _runEvaluator } from "../utils.js";
2
2
  import { _getMatcherForToolName } from "./utils.js";
3
3
  export async function _scorer(params) {
4
4
  const { outputs, referenceOutputs, toolArgsMatchMode, toolArgsMatchOverrides, } = params;
5
- const normalizedOutputs = _normalizeToOpenAIMessagesList(outputs);
6
- const normalizedReferenceOutputs = _normalizeToOpenAIMessagesList(referenceOutputs);
5
+ const normalizedOutputs = outputs;
6
+ const normalizedReferenceOutputs = referenceOutputs;
7
7
  if (!normalizedOutputs || !normalizedReferenceOutputs) {
8
8
  throw new Error("Strict trajectory match requires both outputs and reference_outputs");
9
9
  }
@@ -62,8 +62,11 @@ export async function _scorer(params) {
62
62
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
63
63
  */
64
64
  export async function trajectoryStrictMatch(params) {
65
+ const normalizedOutputs = _normalizeToOpenAIMessagesList(params.outputs);
66
+ const normalizedReferenceOutputs = _normalizeToOpenAIMessagesList(params.referenceOutputs);
65
67
  return _runEvaluator("trajectory_strict_match", _scorer, "trajectory_strict_match", {
66
- ...params,
68
+ outputs: normalizedOutputs,
69
+ referenceOutputs: normalizedReferenceOutputs,
67
70
  toolArgsMatchMode: params.toolCallArgsExactMatch ? "exact" : "ignore",
68
71
  });
69
72
  }
@@ -1,5 +1,5 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
3
3
  export declare const _scorer: (params: {
4
4
  outputs: ChatCompletionMessage[];
5
5
  referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
21
21
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
22
22
  */
23
23
  export declare function trajectorySubset(params: {
24
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
25
- messages: (BaseMessage | ChatCompletionMessage)[];
24
+ outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
25
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
26
26
  };
27
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
28
- messages: (BaseMessage | ChatCompletionMessage)[];
27
+ referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
28
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
29
29
  };
30
30
  }): Promise<EvaluatorResult>;
@@ -1,5 +1,5 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
3
3
  export declare const _scorer: (params: {
4
4
  outputs: ChatCompletionMessage[];
5
5
  referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
21
21
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
22
22
  */
23
23
  export declare function trajectorySuperset(params: {
24
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
25
- messages: (BaseMessage | ChatCompletionMessage)[];
24
+ outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
25
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
26
26
  };
27
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
28
- messages: (BaseMessage | ChatCompletionMessage)[];
27
+ referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
28
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
29
29
  };
30
30
  }): Promise<EvaluatorResult>;
@@ -1,5 +1,5 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
- import { ChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
2
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, EvaluatorResult, ToolArgsMatchMode, ToolArgsMatchOverrides } from "../types.js";
3
3
  export declare const _scorer: (params: {
4
4
  outputs: ChatCompletionMessage[];
5
5
  referenceOutputs: ChatCompletionMessage[];
@@ -21,10 +21,10 @@ export declare const _scorer: (params: {
21
21
  * @returns EvaluatorResult containing a score of true if trajectory (including called tools) matches, false otherwise
22
22
  */
23
23
  export declare function trajectoryUnorderedMatch(params: {
24
- outputs: ChatCompletionMessage[] | BaseMessage[] | {
25
- messages: (BaseMessage | ChatCompletionMessage)[];
24
+ outputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
25
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
26
26
  };
27
- referenceOutputs: ChatCompletionMessage[] | BaseMessage[] | {
28
- messages: (BaseMessage | ChatCompletionMessage)[];
27
+ referenceOutputs: FlexibleChatCompletionMessage[] | BaseMessage[] | {
28
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
29
29
  };
30
30
  }): Promise<EvaluatorResult>;
@@ -88,10 +88,24 @@ function _exactMatch(toolCall, referenceToolCall) {
88
88
  function _ignoreMatch(_toolCall, _referenceToolCall) {
89
89
  return true;
90
90
  }
91
+ function _subsetMatch(toolCall, referenceToolCall) {
92
+ // Every key-value pair in toolCall must exist in referenceToolCall with the same value
93
+ return Object.entries(toolCall).every(([key, value]) => key in referenceToolCall && _deepEqual(referenceToolCall[key], value));
94
+ }
95
+ function _supersetMatch(toolCall, referenceToolCall) {
96
+ // Every key-value pair in referenceToolCall must exist in toolCall with the same value
97
+ return Object.entries(referenceToolCall).every(([key, value]) => key in toolCall && _deepEqual(toolCall[key], value));
98
+ }
91
99
  function _getMatcherForComparisonMode(mode) {
92
100
  if (mode === "exact") {
93
101
  return _exactMatch;
94
102
  }
103
+ else if (mode === "subset") {
104
+ return _subsetMatch;
105
+ }
106
+ else if (mode === "superset") {
107
+ return _supersetMatch;
108
+ }
95
109
  else {
96
110
  return _ignoreMatch;
97
111
  }
@@ -84,10 +84,24 @@ function _exactMatch(toolCall, referenceToolCall) {
84
84
  function _ignoreMatch(_toolCall, _referenceToolCall) {
85
85
  return true;
86
86
  }
87
+ function _subsetMatch(toolCall, referenceToolCall) {
88
+ // Every key-value pair in toolCall must exist in referenceToolCall with the same value
89
+ return Object.entries(toolCall).every(([key, value]) => key in referenceToolCall && _deepEqual(referenceToolCall[key], value));
90
+ }
91
+ function _supersetMatch(toolCall, referenceToolCall) {
92
+ // Every key-value pair in referenceToolCall must exist in toolCall with the same value
93
+ return Object.entries(referenceToolCall).every(([key, value]) => key in toolCall && _deepEqual(toolCall[key], value));
94
+ }
87
95
  function _getMatcherForComparisonMode(mode) {
88
96
  if (mode === "exact") {
89
97
  return _exactMatch;
90
98
  }
99
+ else if (mode === "subset") {
100
+ return _subsetMatch;
101
+ }
102
+ else if (mode === "superset") {
103
+ return _supersetMatch;
104
+ }
91
105
  else {
92
106
  return _ignoreMatch;
93
107
  }
package/dist/types.d.ts CHANGED
@@ -1,5 +1,20 @@
1
1
  import { createLLMAsJudge } from "openevals/llm";
2
2
  export * from "openevals/types";
3
+ export type FlexibleChatCompletionMessage = Record<string, any> & ({
4
+ content: any;
5
+ role: "user" | "system" | "developer";
6
+ id?: string;
7
+ } | {
8
+ role: "assistant";
9
+ content: any;
10
+ tool_calls?: any[];
11
+ id?: string;
12
+ } | {
13
+ role: "tool";
14
+ content: any;
15
+ tool_call_id?: string;
16
+ id?: string;
17
+ });
3
18
  export type GraphTrajectory = {
4
19
  inputs?: (Record<string, unknown> | null)[];
5
20
  results: Record<string, unknown>[];
@@ -9,9 +24,9 @@ export type ExtractedLangGraphThreadTrajectory = {
9
24
  inputs: (Record<string, unknown> | null)[][];
10
25
  outputs: GraphTrajectory;
11
26
  };
12
- export type TrajectoryLLMAsJudgeParams = Omit<Parameters<typeof createLLMAsJudge>[0], "prompt"> & {
13
- prompt?: string;
27
+ export type TrajectoryLLMAsJudgeParams = Partial<Omit<Parameters<typeof createLLMAsJudge>[0], "prompt">> & {
28
+ prompt?: Parameters<typeof createLLMAsJudge>[0]["prompt"];
14
29
  };
15
- export type ToolArgsMatchMode = "exact" | "ignore";
30
+ export type ToolArgsMatchMode = "exact" | "ignore" | "subset" | "superset";
16
31
  export type ToolArgsMatcher = (toolCall: Record<string, unknown>, referenceToolCall: Record<string, unknown>) => boolean | Promise<boolean>;
17
32
  export type ToolArgsMatchOverrides = Record<string, ToolArgsMatchMode | string[] | ToolArgsMatcher>;
package/dist/utils.cjs CHANGED
@@ -1,6 +1,6 @@
1
1
  "use strict";
2
2
  Object.defineProperty(exports, "__esModule", { value: true });
3
- exports._runEvaluator = exports.processScore = exports._normalizeToOpenAIMessagesList = exports._convertToOpenAIMessage = void 0;
3
+ exports._runEvaluator = exports.processScore = exports._normalizeToOpenAIMessagesList = exports._convertToChatCompletionMessage = exports._convertToOpenAIMessage = void 0;
4
4
  const messages_1 = require("@langchain/core/messages");
5
5
  const openai_1 = require("@langchain/openai");
6
6
  const utils_1 = require("openevals/utils");
@@ -14,6 +14,25 @@ const _convertToOpenAIMessage = (message) => {
14
14
  }
15
15
  };
16
16
  exports._convertToOpenAIMessage = _convertToOpenAIMessage;
17
+ const _convertToChatCompletionMessage = (message) => {
18
+ let converted;
19
+ if ((0, messages_1.isBaseMessage)(message)) {
20
+ // eslint-disable-next-line @typescript-eslint/no-explicit-any
21
+ converted = (0, openai_1._convertMessagesToOpenAIParams)([message])[0];
22
+ }
23
+ else {
24
+ converted = message;
25
+ }
26
+ // For tool messages without tool_call_id, generate one for compatibility
27
+ if (converted.role === "tool" && !converted.tool_call_id) {
28
+ converted = {
29
+ ...converted,
30
+ tool_call_id: `generated-${Math.random().toString(36).substring(2)}`,
31
+ };
32
+ }
33
+ return converted;
34
+ };
35
+ exports._convertToChatCompletionMessage = _convertToChatCompletionMessage;
17
36
  const _normalizeToOpenAIMessagesList = (messages) => {
18
37
  if (!messages) {
19
38
  return [];
@@ -30,7 +49,7 @@ const _normalizeToOpenAIMessagesList = (messages) => {
30
49
  else {
31
50
  messagesList = messages;
32
51
  }
33
- return messagesList.map(exports._convertToOpenAIMessage);
52
+ return messagesList.map(exports._convertToChatCompletionMessage);
34
53
  };
35
54
  exports._normalizeToOpenAIMessagesList = _normalizeToOpenAIMessagesList;
36
55
  const processScore = (_, value) => {
package/dist/utils.d.ts CHANGED
@@ -1,9 +1,10 @@
1
1
  import { BaseMessage } from "@langchain/core/messages";
2
2
  import { EvaluationResultType } from "openevals/utils";
3
- import { ChatCompletionMessage, MultiResultScorerReturnType, SingleResultScorerReturnType } from "./types.js";
3
+ import { ChatCompletionMessage, FlexibleChatCompletionMessage, MultiResultScorerReturnType, SingleResultScorerReturnType } from "./types.js";
4
4
  export declare const _convertToOpenAIMessage: (message: BaseMessage | ChatCompletionMessage) => ChatCompletionMessage;
5
- export declare const _normalizeToOpenAIMessagesList: (messages?: (BaseMessage | ChatCompletionMessage)[] | {
6
- messages: (BaseMessage | ChatCompletionMessage)[];
5
+ export declare const _convertToChatCompletionMessage: (message: BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage) => ChatCompletionMessage;
6
+ export declare const _normalizeToOpenAIMessagesList: (messages?: (FlexibleChatCompletionMessage | ChatCompletionMessage | BaseMessage)[] | {
7
+ messages: (BaseMessage | ChatCompletionMessage | FlexibleChatCompletionMessage)[];
7
8
  } | undefined) => ChatCompletionMessage[];
8
9
  export declare const processScore: (_: string, value: boolean | number | {
9
10
  score: boolean | number;
package/dist/utils.js CHANGED
@@ -10,6 +10,24 @@ export const _convertToOpenAIMessage = (message) => {
10
10
  return message;
11
11
  }
12
12
  };
13
+ export const _convertToChatCompletionMessage = (message) => {
14
+ let converted;
15
+ if (isBaseMessage(message)) {
16
+ // eslint-disable-next-line @typescript-eslint/no-explicit-any
17
+ converted = _convertMessagesToOpenAIParams([message])[0];
18
+ }
19
+ else {
20
+ converted = message;
21
+ }
22
+ // For tool messages without tool_call_id, generate one for compatibility
23
+ if (converted.role === "tool" && !converted.tool_call_id) {
24
+ converted = {
25
+ ...converted,
26
+ tool_call_id: `generated-${Math.random().toString(36).substring(2)}`,
27
+ };
28
+ }
29
+ return converted;
30
+ };
13
31
  export const _normalizeToOpenAIMessagesList = (messages) => {
14
32
  if (!messages) {
15
33
  return [];
@@ -26,7 +44,7 @@ export const _normalizeToOpenAIMessagesList = (messages) => {
26
44
  else {
27
45
  messagesList = messages;
28
46
  }
29
- return messagesList.map(_convertToOpenAIMessage);
47
+ return messagesList.map(_convertToChatCompletionMessage);
30
48
  };
31
49
  export const processScore = (_, value) => {
32
50
  if (typeof value === "object") {
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentevals",
3
- "version": "0.0.4",
3
+ "version": "0.0.6",
4
4
  "packageManager": "yarn@3.5.1",
5
5
  "type": "module",
6
6
  "scripts": {
@@ -14,18 +14,18 @@
14
14
  "test": "vitest run"
15
15
  },
16
16
  "dependencies": {
17
- "@langchain/openai": "^0.4.4",
18
- "langchain": "^0.3.18",
19
- "langsmith": "^0.3.11",
20
- "openevals": "^0.0.3"
17
+ "@langchain/openai": ">=0.4.4",
18
+ "langchain": ">=0.3.18",
19
+ "langsmith": ">=0.3.11",
20
+ "openevals": "^0.1.0"
21
21
  },
22
22
  "peerDependencies": {
23
- "@langchain/core": "^0.3.40",
24
- "@langchain/langgraph": "^0.2.46"
23
+ "@langchain/core": ">=0.3.73",
24
+ "@langchain/langgraph": ">=0.2.46"
25
25
  },
26
26
  "devDependencies": {
27
- "@langchain/core": "^0.3.40",
28
- "@langchain/langgraph": "^0.2.46",
27
+ "@langchain/core": "^0.3.73",
28
+ "@langchain/langgraph": "^0.4.9",
29
29
  "@langchain/scripts": "0.1.3",
30
30
  "@tsconfig/recommended": "^1.0.8",
31
31
  "@typescript-eslint/eslint-plugin": "^8.24.1",
@@ -43,7 +43,7 @@
43
43
  "prettier": "^3.5.1",
44
44
  "typescript": "~5.1.6",
45
45
  "vitest": "^3.0.5",
46
- "zod": "^3.24.2"
46
+ "zod": "^4.1.5"
47
47
  },
48
48
  "files": [
49
49
  "dist/",