agentevals 0.0.3 → 0.0.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -9,7 +9,7 @@ It is intended to provide a good conceptual starting point for your agent's eval
9
9
 
10
10
  If you are looking for more general evaluation tools, please check out the companion package [`openevals`](https://github.com/langchain-ai/openevals).
11
11
 
12
- ## Quickstart
12
+ # Quickstart
13
13
 
14
14
  To get started, install `agentevals`:
15
15
 
@@ -72,17 +72,20 @@ console.log(evalResult);
72
72
  }
73
73
  ```
74
74
 
75
- You can see that despite the small difference in the final response and tool calls, the evaluator still returns a score of `true` since the overall trajectory is the same between the output and reference!
75
+ You can see that the evaluator returns a score of `true` since the overall trajectory is a reasonable path for the agent to take to answer the user's question.
76
76
 
77
- ## Table of Contents
77
+ For more details on this evaluator, including how to customize it, see the section on [trajectory LLM-as-judge](#trajectory-llm-as-judge).
78
+
79
+ # Table of Contents
78
80
 
79
81
  - [Installation](#installation)
80
82
  - [Evaluators](#evaluators)
81
- - [Agent Trajectory](#agent-trajectory)
83
+ - [Agent Trajectory Match](#agent-trajectory-match)
82
84
  - [Strict match](#strict-match)
83
85
  - [Unordered match](#unordered-match)
84
86
  - [Subset/superset match](#subset-and-superset-match)
85
- - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
87
+ - [Tool args match modes](#tool-args-match-modes)
88
+ - [Trajectory LLM-as-judge](#trajectory-llm-as-judge)
86
89
  - [Graph Trajectory](#graph-trajectory)
87
90
  - [Graph trajectory LLM-as-judge](#graph-trajectory-llm-as-judge)
88
91
  - [Graph trajectory strict match](#graph-trajectory-strict-match)
@@ -91,7 +94,7 @@ You can see that despite the small difference in the final response and tool cal
91
94
  - [Pytest or Vitest/Jest](#pytest-or-vitestjest)
92
95
  - [Evaluate](#evaluate)
93
96
 
94
- ## Installation
97
+ # Installation
95
98
 
96
99
  You can install `agentevals` like this:
97
100
 
@@ -106,44 +109,70 @@ npm install openai
106
109
  ```
107
110
 
108
111
  It is also helpful to be familiar with some [evaluation concepts](https://docs.smith.langchain.com/evaluation/concepts) and
109
- LangSmith's Vitest/Jest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
112
+ LangSmith's pytest integration for running evals, which is documented [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest).
110
113
 
111
- ## Evaluators
114
+ # Evaluators
112
115
 
113
- ### Agent trajectory
116
+ ## Agent trajectory match
114
117
 
115
- Agent trajectory evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
118
+ Agent trajectory match evaluators are used to judge the trajectory of an agent's execution either against an expected trajectory or using an LLM.
116
119
  These evaluators expect you to format your agent's trajectory as a list of OpenAI format dicts or as a list of LangChain `BaseMessage` classes, and handle message formatting
117
120
  under the hood.
118
121
 
119
- #### Strict match
122
+ AgentEvals offers the `create_trajectory_match_evaluator`/`createTrajectoryMatchEvaluator` and `create_async_trajectory_match_evaluator` methods for this task. You can customize their behavior in a few ways:
123
+
124
+ - Setting `trajectory_match_mode`/`trajectoryMatchMode` to [`strict`](#strict-match), [`unordered`](#unordered-match), [`subset`](#subset-and-superset-match), or [`superset`](#subset-and-superset-match) to provide the general strategy the evaluator will use to compare trajectories
125
+ - Setting [`tool_args_match_mode`](#tool-args-match-modes) and/or [`tool_args_match_overrides`](#tool-args-match-modes) to customize how the evaluator considers equality between tool calls in the actual trajectory vs. the reference. By default, only tool calls with the same arguments to the same tool are considered equal.
126
+
127
+ ### Strict match
120
128
 
121
- The `trajectory_strict_match` evaluator, compares two trajectories and ensures that they contain the same messages
122
- in the same order with the same tool calls. It allows for differences in message content and tool call arguments,
123
- but requires that the selected tools at each step are the same.
129
+ The `"strict"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same messages
130
+ in the same order with the same tool calls. Note that it does allow for differences in message content:
124
131
 
125
132
  ```ts
126
- import { trajectoryStrictMatch } from "agentevals";
133
+ import { createTrajectoryMatchEvaluator } from "agentevals";
127
134
 
128
135
  const outputs = [
129
- { role: "user", content: "What is the weather in SF?" },
130
- {
131
- role: "assistant",
132
- tool_calls: [{
133
- function: { name: "get_weather", arguments: JSON.stringify({ city: "SF" }) }
134
- }]
135
- },
136
- { role: "tool", content: "It's 80 degrees and sunny in SF." },
137
- { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
136
+ { role: "user", content: "What is the weather in SF?" },
137
+ {
138
+ role: "assistant",
139
+ content: "",
140
+ tool_calls: [{
141
+ function: {
142
+ name: "get_weather",
143
+ arguments: JSON.stringify({ city: "San Francisco" })
144
+ },
145
+ }, {
146
+ function: {
147
+ name: "accuweather_forecast",
148
+ arguments: JSON.stringify({"city": "San Francisco"}),
149
+ },
150
+ }]
151
+ },
152
+ { role: "tool", content: "It's 80 degrees and sunny in SF." },
153
+ { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
138
154
  ];
139
155
 
140
156
  const referenceOutputs = [
141
- { role: "user", content: "What is the weather in San Francisco?" },
142
- { role: "assistant", tool_calls: [{ function: { name: "get_weather", arguments: JSON.stringify({ city: "San Francisco" }) } }] },
143
- { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
157
+ { role: "user", content: "What is the weather in San Francisco?" },
158
+ {
159
+ role: "assistant",
160
+ content: "",
161
+ tool_calls: [{
162
+ function: {
163
+ name: "get_weather",
164
+ arguments: JSON.stringify({ city: "San Francisco" })
165
+ }
166
+ }]
167
+ },
168
+ { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
144
169
  ];
145
170
 
146
- const result = await trajectoryStrictMatch({
171
+ const evaluator = createTrajectoryMatchEvaluator({
172
+ trajectoryMatchMode: "strict",
173
+ })
174
+
175
+ const result = await evaluator({
147
176
  outputs,
148
177
  referenceOutputs,
149
178
  });
@@ -153,22 +182,27 @@ console.log(result);
153
182
 
154
183
  ```
155
184
  {
156
- 'key': 'trajectory_accuracy',
157
- 'score': true,
185
+ 'key': 'trajectory_strict_match',
186
+ 'score': false,
158
187
  }
159
188
  ```
160
189
 
161
- #### Unordered match
190
+ `"strict"` is useful is if you want to ensure that tools are always called in the same order for a given query (e.g. a company policy lookup tool before a tool that requests vacation time for an employee).
191
+
192
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
162
193
 
163
- The `trajectory_unordered_match` evaluator, compares two trajectories and ensures that they contain the same number of tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
194
+ ### Unordered match
195
+
196
+ The `"unordered"` `trajectory_match_mode` compares two trajectories and ensures that they contain the same tool calls in any order. This is useful if you want to allow flexibility in how an agent obtains the proper information, but still do care that all information was retrieved.
164
197
 
165
198
  ```ts
166
- import { trajectoryUnorderedMatch } from "agentevals";
199
+ import { createTrajectoryMatchEvaluator } from "agentevals";
167
200
 
168
201
  const outputs = [
169
202
  { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
170
203
  {
171
204
  role: "assistant",
205
+ content: "",
172
206
  tool_calls: [{
173
207
  function: {
174
208
  name: "get_weather",
@@ -179,6 +213,7 @@ const outputs = [
179
213
  { role: "tool", content: "It's 80 degrees and sunny in SF." },
180
214
  {
181
215
  role: "assistant",
216
+ content: "",
182
217
  tool_calls: [{
183
218
  function: {
184
219
  name: "get_fun_activities",
@@ -194,6 +229,7 @@ const referenceOutputs = [
194
229
  { role: "user", content: "What is the weather in SF and is there anything fun happening?" },
195
230
  {
196
231
  role: "assistant",
232
+ content: "",
197
233
  tool_calls: [
198
234
  {
199
235
  function: {
@@ -214,7 +250,11 @@ const referenceOutputs = [
214
250
  { role: "assistant", content: "In SF, it's 80˚ and sunny, but there is nothing fun happening." },
215
251
  ];
216
252
 
217
- const result = await trajectoryUnorderedMatch({
253
+ const evaluator = createTrajectoryMatchEvaluator({
254
+ trajectoryMatchMode: "unordered",
255
+ });
256
+
257
+ const result = await evaluator({
218
258
  outputs,
219
259
  referenceOutputs,
220
260
  });
@@ -229,26 +269,36 @@ console.log(result)
229
269
  }
230
270
  ```
231
271
 
232
- #### Subset and superset match
272
+ `"unordered"` is useful is if you want to ensure that specific tools are called at some point in the trajectory, but you don't necessarily need them to be in message order (e.g. the agent called a company policy retrieval tool at an arbitrary point in an interaction before authorizing spend for a pizza party).
273
+
274
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
233
275
 
234
- There are other evaluators for checking partial trajectory matches (ensuring that a trajectory contains a subset and superset of tool calls compared to a reference trajectory).
276
+ ### Subset and superset match
277
+
278
+ The `"subset"` and `"superset"` modes match partial trajectories (ensuring that a trajectory contains a subset/superset of tool calls contained in a reference trajectory).
235
279
 
236
280
  ```ts
237
- import { trajectorySubset } from "agentevals";
238
- // import { trajectorySuperset } from "agentevals";
281
+ import { createTrajectoryMatchEvaluator } from "agentevals";
239
282
 
240
283
  const outputs = [
241
284
  { role: "user", content: "What is the weather in SF and London?" },
242
285
  {
243
286
  role: "assistant",
287
+ content: "",
244
288
  tool_calls: [{
245
289
  function: {
246
290
  name: "get_weather",
247
291
  arguments: JSON.stringify({ city: "SF and London" }),
248
292
  }
293
+ }, {
294
+ "function": {
295
+ name: "accuweather_forecast",
296
+ arguments: JSON.stringify({"city": "SF and London"}),
297
+ }
249
298
  }],
250
299
  },
251
300
  { role: "tool", content: "It's 80 degrees and sunny in SF, and 90 degrees and rainy in London." },
301
+ { role: "tool", content: "Unknown." },
252
302
  { role: "assistant", content: "The weather in SF is 80 degrees and sunny. In London, it's 90 degrees and rainy."},
253
303
  ];
254
304
 
@@ -256,27 +306,25 @@ const referenceOutputs = [
256
306
  { role: "user", content: "What is the weather in SF and London?" },
257
307
  {
258
308
  role: "assistant",
309
+ content: "",
259
310
  tool_calls: [
260
311
  {
261
312
  function: {
262
313
  name: "get_weather",
263
- arguments: JSON.stringify({ city: "San Francisco" }),
264
- }
265
- },
266
- {
267
- function: {
268
- name: "get_weather",
269
- arguments: JSON.stringify({ city: "London" }),
314
+ arguments: JSON.stringify({ city: "SF and London" }),
270
315
  }
271
316
  },
272
317
  ],
273
318
  },
274
- { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
275
- { role: "tool", content: "It's 90 degrees and rainy in London." },
319
+ { role: "tool", content: "It's 80 degrees and sunny in San Francisco, and 90 degrees and rainy in London." },
276
320
  { role: "assistant", content: "The weather in SF is 80˚ and sunny. In London, it's 90˚ and rainy." },
277
321
  ];
278
322
 
279
- const result = await trajectorySubset({
323
+ const evaluator = createTrajectoryMatchEvaluator({
324
+ trajectoryMatchMode: "superset", // or "subset"
325
+ });
326
+
327
+ const result = await evaluator({
280
328
  outputs,
281
329
  referenceOutputs,
282
330
  });
@@ -286,16 +334,145 @@ console.log(result)
286
334
 
287
335
  ```
288
336
  {
289
- 'key': 'trajectory_subset',
337
+ 'key': 'trajectory_superset_match',
290
338
  'score': true,
291
339
  }
292
340
  ```
293
341
 
294
- #### Trajectory LLM-as-judge
342
+ `"superset"` is useful if you want to ensure that some key tools were called at some point in the trajectory, but an agent calling extra tools is still acceptable. `"subset"` is the inverse and is useful if you want to ensure that the agent did not call any tools beyond the expected ones.
343
+
344
+ **Note:** If you would like to configure the way this evaluator checks for tool call equality, see [this section](#tool-args-match-modes).
345
+
346
+ ### Tool args match modes
347
+
348
+ When checking equality between tool calls, the above evaluators will require that all tool call arguments are the exact same by default. You can configure this behavior in the following ways:
349
+
350
+ - Treating any two tool calls for the same tool as equivalent by setting `tool_args_match_mode="ignore"` (Python) or `toolArgsMatchMode: "ignore"` (TypeScript)
351
+ - Treating a tool call as equivalent if it contain as subset/superset of args compared to a reference tool call of the same name with `tool_args_match_mode="subset"/"superset"` (Python) or `toolArgsMatchMode: "subset"/"superset` (TypeScript)
352
+ - Setting custom matchers for all calls of a given tool using the `tool_args_match_overrides` (Python) or `toolArgsMatchOverrides` (TypeScript) param
353
+
354
+ You can set both of these parameters at the same time. `tool_args_match_overrides` will take precendence over `tool_args_match_mode`.
355
+
356
+ `tool_args_match_overrides`/`toolArgsMatchOverrides` takes a dictionary whose keys are tool names and whose values are either `"exact"`, `"ignore"`, a list of fields within the tool call that must match exactly, or a comparator function that takes two arguments and returns whether they are equal:
357
+
358
+ ```python
359
+ ToolArgsMatchMode = Literal["exact", "ignore", "subset", "superset"]
360
+
361
+ ToolArgsMatchOverrides = dict[str, Union[ToolArgsMatchMode, list[str], Callable[[dict, dict], bool]]]
362
+ ```
363
+
364
+ Here's an example that allows case insensitivity for the arguments to a tool named `get_weather`:
365
+
366
+ ```ts
367
+ import { createTrajectoryMatchEvaluator } from "agentevals";
368
+
369
+ const outputs = [
370
+ { role: "user", content: "What is the weather in SF?" },
371
+ {
372
+ role: "assistant",
373
+ content: "",
374
+ tool_calls: [{
375
+ function: {
376
+ name: "get_weather",
377
+ arguments: JSON.stringify({ city: "san francisco" })
378
+ },
379
+ }]
380
+ },
381
+ { role: "tool", content: "It's 80 degrees and sunny in SF." },
382
+ { role: "assistant", content: "The weather in SF is 80 degrees and sunny." },
383
+ ];
384
+
385
+ const referenceOutputs = [
386
+ { role: "user", content: "What is the weather in San Francisco?" },
387
+ {
388
+ role: "assistant",
389
+ content: "",
390
+ tool_calls: [{
391
+ function: {
392
+ name: "get_weather",
393
+ arguments: JSON.stringify({ city: "San Francisco" })
394
+ }
395
+ }]
396
+ },
397
+ { role: "tool", content: "It's 80 degrees and sunny in San Francisco." },
398
+ ];
399
+
400
+ const evaluator = createTrajectoryMatchEvaluator({
401
+ trajectoryMatchMode: "strict",
402
+ toolArgsMatchMode: "exact", // Default value
403
+ toolArgsMatchOverrides: {
404
+ get_weather: (x, y) => {
405
+ return typeof x.city === "string" &&
406
+ typeof y.city === "string" &&
407
+ x.city.toLowerCase() === y.city.toLowerCase();
408
+ },
409
+ }
410
+ });
411
+
412
+ const result = await evaluator({
413
+ outputs,
414
+ referenceOutputs,
415
+ });
416
+
417
+ console.log(result);
418
+ ```
419
+
420
+ ```
421
+ {
422
+ 'key': 'trajectory_strict_match',
423
+ 'score': true,
424
+ }
425
+ ```
426
+
427
+ This flexibility allows you to handle cases where you want looser equality for LLM generated arguments (`"san francisco"` to equal `"San Francisco"`) for only specific tool calls.
428
+
429
+ ## Trajectory LLM-as-judge
430
+
431
+ The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the trajectory match evaluators, it doesn't require a reference trajectory. Here's an example:
432
+
433
+ ```ts
434
+ import {
435
+ createTrajectoryLLMAsJudge,
436
+ TRAJECTORY_ACCURACY_PROMPT,
437
+ } from "agentevals";
438
+
439
+ const evaluator = createTrajectoryLLMAsJudge({
440
+ prompt: TRAJECTORY_ACCURACY_PROMPT,
441
+ model: "openai:o3-mini",
442
+ });
443
+
444
+ const outputs = [
445
+ {role: "user", content: "What is the weather in SF?"},
446
+ {
447
+ role: "assistant",
448
+ content: "",
449
+ tool_calls: [
450
+ {
451
+ function: {
452
+ name: "get_weather",
453
+ arguments: JSON.stringify({ city: "SF" }),
454
+ }
455
+ }
456
+ ],
457
+ },
458
+ {role: "tool", content: "It's 80 degrees and sunny in SF."},
459
+ {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
460
+ ];
461
+
462
+ const result = await evaluator({ outputs });
463
+
464
+ console.log(result)
465
+ ```
466
+
467
+ ```
468
+ {
469
+ 'key': 'trajectory_accuracy',
470
+ 'score': True,
471
+ 'comment': 'The provided agent trajectory is reasonable...'
472
+ }
473
+ ```
295
474
 
296
- The LLM-as-judge trajectory evaluator that uses an LLM to evaluate the trajectory. Unlike the other trajectory evaluators, it doesn't require a reference trajectory,
297
- and supports
298
- This allows for more flexibility in the trajectory comparison:
475
+ If you have a reference trajectory, you can add an extra variable to your prompt and pass in the reference trajectory. Below, we use the prebuilt `TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE` prompt, which contains a `reference_outputs` variable:
299
476
 
300
477
  ```ts
301
478
  import {
@@ -312,6 +489,7 @@ const outputs = [
312
489
  {role: "user", content: "What is the weather in SF?"},
313
490
  {
314
491
  role: "assistant",
492
+ content: "",
315
493
  tool_calls: [
316
494
  {
317
495
  function: {
@@ -328,6 +506,7 @@ const referenceOutputs = [
328
506
  {role: "user", content: "What is the weather in SF?"},
329
507
  {
330
508
  role: "assistant",
509
+ content: "",
331
510
  tool_calls: [
332
511
  {
333
512
  function: {
@@ -379,7 +558,7 @@ const fewShotExamples = [
379
558
 
380
559
  See the [`openevals`](https://github.com/langchain-ai/openevals?tab=readme-ov-file#llm-as-judge) repo for a fully up to date list of parameters.
381
560
 
382
- ### Graph trajectory
561
+ ## Graph trajectory
383
562
 
384
563
  For frameworks like [LangGraph](https://github.com/langchain-ai/langgraph) that model agents as graphs, it can be more convenient to represent trajectories in terms of nodes visited rather than messages. `agentevals` includes a category of evaluators called **graph trajectory** evaluators that are designed to work with this format, as well as convenient utilities for extracting trajectories from a LangGraph thread, including different conversation turns and interrupts.
385
564
 
@@ -404,7 +583,7 @@ const evaluator: ({ inputs, outputs, referenceOutputs, ...extra }: {
404
583
 
405
584
  Where `inputs` is a list of inputs (or a dict with a key named `"inputs"`) to the graph whose items each represent the start of a new invocation in a thread, `results` representing the final output from each turn in the thread, and `steps` representing the internal steps taken for each turn.
406
585
 
407
- #### Graph trajectory LLM-as-judge
586
+ ### Graph trajectory LLM-as-judge
408
587
 
409
588
  This evaluator is similar to the `trajectory_llm_as_judge` evaluator, but it works with graph trajectories instead of message trajectories. Below, we set up a LangGraph agent, extract a trajectory from it using the built-in utils, and pass it to the evaluator. First, let's setup our graph, call it, and then extract the trajectory:
410
589
 
@@ -514,7 +693,7 @@ console.log(res);
514
693
  }
515
694
  ```
516
695
 
517
- Note that though this evaluator takes the typical `inputs`, `outputs`, and `referenceOutputs` parameters, it internally combines `inputs` and `outputs` to form a `thread`. Therefore, if you want to customize the prompt, your prompt should also contain a `thread` input variable:
696
+ Note that though this evaluator takes the typical `inputs`, `outputs`, and `reference_outputs` parameters, it internally combines `inputs` and `outputs` to form a `thread`. Therefore, if you want to customize the prompt, your prompt should also contain a `thread` input variable:
518
697
 
519
698
  ```ts
520
699
  const CUSTOM_PROMPT = `You are an expert data labeler.
@@ -546,18 +725,18 @@ const graphTrajectoryEvaluator = createGraphTrajectoryLLMAsJudge({
546
725
  model: "openai:o3-mini",
547
726
  })
548
727
  res = await graphTrajectoryEvaluator(
549
- inputs=extractedTrajectory.inputs,
550
- outputs=extractedTrajectory.outputs,
728
+ inputs: extractedTrajectory.inputs,
729
+ outputs: extractedTrajectory.outputs,
551
730
  )
552
731
  ```
553
732
 
554
- In order to format them properly into the prompt, `referenceOutputs` should be passed in as a `GraphTrajectory` object like `outputs`.
733
+ In order to format them properly into the prompt, `reference_outputs` should be passed in as a `GraphTrajectory` object like `outputs`.
555
734
 
556
- Also note that like other LLM-as-judge evaluators, you can pass extra kwargs into the evaluator to format them into the prompt.
735
+ Also note that like other LLM-as-judge evaluators, you can pass extra params into the evaluator to format them into the prompt.
557
736
 
558
- #### Graph trajectory strict match
737
+ ### Graph trajectory strict match
559
738
 
560
- The `graphTrajectoryStrictMatch` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
739
+ The `graph_trajectory_strict_match` evaluator is a simple evaluator that checks if the steps in the provided graph trajectory match the reference trajectory exactly.
561
740
 
562
741
  ```ts
563
742
  import { tool } from "@langchain/core/tools";
@@ -626,15 +805,46 @@ console.log(result);
626
805
  'score': True,
627
806
  }
628
807
  ```
629
- ## LangSmith Integration
808
+
809
+ # Python Async Support
810
+
811
+ All `agentevals` evaluators support Python [asyncio](https://docs.python.org/3/library/asyncio.html). As a convention, evaluators that use a factory function will have `async` put immediately after `create_` in the function name (for example, `create_async_trajectory_llm_as_judge`), and evaluators used directly will end in `async` (e.g. `trajectory_strict_match_async`).
812
+
813
+ Here's an example of how to use the `create_async_llm_as_judge` evaluator asynchronously:
814
+
815
+ ```python
816
+ from agentevals.trajectory.llm import create_async_trajectory_llm_as_judge
817
+
818
+ evaluator = create_async_llm_as_judge(
819
+ prompt="What is the weather in {inputs}?",
820
+ )
821
+
822
+ result = await evaluator(inputs="San Francisco")
823
+ ```
824
+
825
+ If you are using the OpenAI client directly, remember to pass in `AsyncOpenAI` as the `judge` parameter:
826
+
827
+ ```python
828
+ from openai import AsyncOpenAI
829
+
830
+ evaluator = create_async_llm_as_judge(
831
+ prompt="What is the weather in {inputs}?",
832
+ judge=AsyncOpenAI(),
833
+ model="o3-mini",
834
+ )
835
+
836
+ result = await evaluator(inputs="San Francisco")
837
+ ```
838
+
839
+ # LangSmith Integration
630
840
 
631
841
  For tracking experiments over time, you can log evaluator results to [LangSmith](https://smith.langchain.com/), a platform for building production-grade LLM applications that includes tracing, evaluation, and experimentation tools.
632
842
 
633
- LangSmith currently offers two ways to run evals. We'll give a quick example of how to run evals using both.
843
+ LangSmith currently offers two ways to run evals: a [pytest](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) (Python) or [Vitest/Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) integration and the `evaluate` function. We'll give a quick example of how to run evals using both.
634
844
 
635
- ### Pytest or Vitest/Jest
845
+ ## Pytest or Vitest/Jest
636
846
 
637
- First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest) to set up LangSmith's Vitest/Jest runner,
847
+ First, follow [these instructions](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest) to set up LangSmith's pytest runner, or these to set up [Vitest or Jest](https://docs.smith.langchain.com/evaluation/how_to_guides/vitest_jest),
638
848
  setting appropriate environment variables:
639
849
 
640
850
  ```bash
@@ -642,7 +852,6 @@ export LANGSMITH_API_KEY="your_langsmith_api_key"
642
852
  export LANGSMITH_TRACING="true"
643
853
  ```
644
854
 
645
-
646
855
  Then, set up a file named `test_trajectory.eval.ts` with the following contents:
647
856
 
648
857
  ```ts
@@ -670,6 +879,7 @@ ls.describe("trajectory accuracy", () => {
670
879
  {"role": "user", "content": "What is the weather in SF?"},
671
880
  {
672
881
  "role": "assistant",
882
+ "content": "",
673
883
  "tool_calls": [
674
884
  {
675
885
  "function": {
@@ -688,6 +898,7 @@ ls.describe("trajectory accuracy", () => {
688
898
  {"role": "user", "content": "What is the weather in SF?"},
689
899
  {
690
900
  "role": "assistant",
901
+ "content": "",
691
902
  "tool_calls": [
692
903
  {
693
904
  "function": {
@@ -717,7 +928,6 @@ Now, run the eval with your runner of choice:
717
928
  vitest run test_trajectory.eval.ts
718
929
  ```
719
930
 
720
-
721
931
  Feedback from the prebuilt evaluator will be automatically logged in LangSmith as a table of results like this in your terminal:
722
932
 
723
933
  ![Terminal results](/static/img/pytest_output.png)
@@ -726,7 +936,7 @@ And you should also see the results in the experiment view in LangSmith:
726
936
 
727
937
  ![LangSmith results](/static/img/langsmith_results.png)
728
938
 
729
- ### Evaluate
939
+ ## Evaluate
730
940
 
731
941
  Alternatively, you can [create a dataset in LangSmith](https://docs.smith.langchain.com/evaluation/concepts#dataset-curation) and use your created evaluators with LangSmith's [`evaluate`](https://docs.smith.langchain.com/evaluation#8-run-and-view-results) function:
732
942
 
@@ -741,20 +951,21 @@ const trajectoryEvaluator = createTrajectoryLLMAsJudge({
741
951
 
742
952
  await evaluate(
743
953
  (inputs) => [
744
- {role: "user", content: "What is the weather in SF?"},
745
- {
746
- role: "assistant",
747
- tool_calls: [
748
- {
749
- function: {
750
- name: "get_weather",
751
- arguments: json.dumps({"city": "SF"}),
752
- }
753
- }
754
- ],
755
- },
756
- {role: "tool", content: "It's 80 degrees and sunny in SF."},
757
- {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
954
+ {role: "user", content: "What is the weather in SF?"},
955
+ {
956
+ role: "assistant",
957
+ content: "",
958
+ tool_calls: [
959
+ {
960
+ function: {
961
+ name: "get_weather",
962
+ arguments: json.dumps({"city": "SF"}),
963
+ }
964
+ }
965
+ ],
966
+ },
967
+ {role: "tool", content: "It's 80 degrees and sunny in SF."},
968
+ {role: "assistant", content: "The weather in SF is 80 degrees and sunny."},
758
969
  ],
759
970
  {
760
971
  data: datasetName,
@@ -763,7 +974,7 @@ await evaluate(
763
974
  );
764
975
  ```
765
976
 
766
- ## Thank you!
977
+ # Thank you!
767
978
 
768
979
  We hope that `agentevals` helps make evaluating your LLM agents easier!
769
980
 
package/dist/index.cjs CHANGED
@@ -14,7 +14,7 @@ var __exportStar = (this && this.__exportStar) || function(m, exports) {
14
14
  for (var p in m) if (p !== "default" && !Object.prototype.hasOwnProperty.call(exports, p)) __createBinding(exports, m, p);
15
15
  };
16
16
  Object.defineProperty(exports, "__esModule", { value: true });
17
- exports.GRAPH_TRAJECTORY_ACCURACY_PROMPT = exports.createGraphTrajectoryLLMAsJudge = exports.TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = exports.TRAJECTORY_ACCURACY_PROMPT = exports.createTrajectoryLLMAsJudge = exports.trajectoryUnorderedMatch = exports.trajectorySuperset = exports.trajectorySubset = exports.trajectoryStrictMatch = void 0;
17
+ exports.GRAPH_TRAJECTORY_ACCURACY_PROMPT = exports.createGraphTrajectoryLLMAsJudge = exports.TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = exports.TRAJECTORY_ACCURACY_PROMPT = exports.createTrajectoryLLMAsJudge = exports.createTrajectoryMatchEvaluator = exports.trajectoryUnorderedMatch = exports.trajectorySuperset = exports.trajectorySubset = exports.trajectoryStrictMatch = void 0;
18
18
  var strict_js_1 = require("./trajectory/strict.cjs");
19
19
  Object.defineProperty(exports, "trajectoryStrictMatch", { enumerable: true, get: function () { return strict_js_1.trajectoryStrictMatch; } });
20
20
  var subset_js_1 = require("./trajectory/subset.cjs");
@@ -23,6 +23,8 @@ var superset_js_1 = require("./trajectory/superset.cjs");
23
23
  Object.defineProperty(exports, "trajectorySuperset", { enumerable: true, get: function () { return superset_js_1.trajectorySuperset; } });
24
24
  var unordered_js_1 = require("./trajectory/unordered.cjs");
25
25
  Object.defineProperty(exports, "trajectoryUnorderedMatch", { enumerable: true, get: function () { return unordered_js_1.trajectoryUnorderedMatch; } });
26
+ var match_js_1 = require("./trajectory/match.cjs");
27
+ Object.defineProperty(exports, "createTrajectoryMatchEvaluator", { enumerable: true, get: function () { return match_js_1.createTrajectoryMatchEvaluator; } });
26
28
  var llm_js_1 = require("./trajectory/llm.cjs");
27
29
  Object.defineProperty(exports, "createTrajectoryLLMAsJudge", { enumerable: true, get: function () { return llm_js_1.createTrajectoryLLMAsJudge; } });
28
30
  Object.defineProperty(exports, "TRAJECTORY_ACCURACY_PROMPT", { enumerable: true, get: function () { return llm_js_1.TRAJECTORY_ACCURACY_PROMPT; } });
package/dist/index.d.ts CHANGED
@@ -2,6 +2,7 @@ export { trajectoryStrictMatch } from "./trajectory/strict.js";
2
2
  export { trajectorySubset } from "./trajectory/subset.js";
3
3
  export { trajectorySuperset } from "./trajectory/superset.js";
4
4
  export { trajectoryUnorderedMatch } from "./trajectory/unordered.js";
5
+ export { createTrajectoryMatchEvaluator, type TrajectoryMatchMode, } from "./trajectory/match.js";
5
6
  export { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE, } from "./trajectory/llm.js";
6
7
  export { createGraphTrajectoryLLMAsJudge, GRAPH_TRAJECTORY_ACCURACY_PROMPT, } from "./graph_trajectory/llm.js";
7
8
  export * from "./types.js";
package/dist/index.js CHANGED
@@ -2,6 +2,7 @@ export { trajectoryStrictMatch } from "./trajectory/strict.js";
2
2
  export { trajectorySubset } from "./trajectory/subset.js";
3
3
  export { trajectorySuperset } from "./trajectory/superset.js";
4
4
  export { trajectoryUnorderedMatch } from "./trajectory/unordered.js";
5
+ export { createTrajectoryMatchEvaluator, } from "./trajectory/match.js";
5
6
  export { createTrajectoryLLMAsJudge, TRAJECTORY_ACCURACY_PROMPT, TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE, } from "./trajectory/llm.js";
6
7
  export { createGraphTrajectoryLLMAsJudge, GRAPH_TRAJECTORY_ACCURACY_PROMPT, } from "./graph_trajectory/llm.js";
7
8
  export * from "./types.js";