agentv 0.23.0 → 0.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,237 @@
1
+ # Tool Trajectory Evaluator Guide
2
+
3
+ Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
+
5
+ ## Evaluator Types
6
+
7
+ AgentV provides two ways to validate tool usage:
8
+
9
+ 1. **`tool_trajectory`** - Dedicated evaluator with configurable matching modes
10
+ 2. **`expected_messages`** - Inline tool_calls in expected_messages for simpler cases
11
+
12
+ ## Tool Trajectory Evaluator
13
+
14
+ ### Modes
15
+
16
+ #### 1. `any_order` - Minimum Tool Counts
17
+
18
+ Validates that each tool was called at least N times, regardless of order:
19
+
20
+ ```yaml
21
+ execution:
22
+ evaluators:
23
+ - name: tool-usage
24
+ type: tool_trajectory
25
+ mode: any_order
26
+ minimums:
27
+ knowledgeSearch: 2 # Must be called at least twice
28
+ documentRetrieve: 1 # Must be called at least once
29
+ ```
30
+
31
+ **Use cases:**
32
+ - Ensure required tools are used
33
+ - Don't care about execution order
34
+ - Allow flexibility in agent implementation
35
+
36
+ #### 2. `in_order` - Sequential Matching
37
+
38
+ Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
39
+
40
+ ```yaml
41
+ execution:
42
+ evaluators:
43
+ - name: workflow-sequence
44
+ type: tool_trajectory
45
+ mode: in_order
46
+ expected:
47
+ - tool: fetchData
48
+ - tool: validateSchema
49
+ - tool: transformData
50
+ - tool: saveResults
51
+ ```
52
+
53
+ **Use cases:**
54
+ - Validate logical workflow order
55
+ - Allow agent to use additional helper tools
56
+ - Check that key steps happen in sequence
57
+
58
+ #### 3. `exact` - Strict Sequence Match
59
+
60
+ Validates the exact tool sequence with no gaps or extra tools:
61
+
62
+ ```yaml
63
+ execution:
64
+ evaluators:
65
+ - name: auth-sequence
66
+ type: tool_trajectory
67
+ mode: exact
68
+ expected:
69
+ - tool: checkCredentials
70
+ - tool: generateToken
71
+ - tool: auditLog
72
+ ```
73
+
74
+ **Use cases:**
75
+ - Security-critical workflows
76
+ - Strict protocol validation
77
+ - Regression testing specific behavior
78
+
79
+ ## Expected Messages Evaluator
80
+
81
+ For simpler cases, specify tool_calls inline in `expected_messages`:
82
+
83
+ ```yaml
84
+ evalcases:
85
+ - id: research-task
86
+ expected_outcome: Agent searches and retrieves documents
87
+
88
+ input_messages:
89
+ - role: user
90
+ content: Research REST vs GraphQL differences
91
+
92
+ expected_messages:
93
+ - role: assistant
94
+ content: I'll research this topic.
95
+ tool_calls:
96
+ - tool: knowledgeSearch
97
+ - tool: knowledgeSearch
98
+ - tool: documentRetrieve
99
+
100
+ execution:
101
+ evaluators:
102
+ - name: tool-validator
103
+ type: expected_messages
104
+ ```
105
+
106
+ ### With Input Matching
107
+
108
+ Validate specific inputs were passed to tools:
109
+
110
+ ```yaml
111
+ expected_messages:
112
+ - role: assistant
113
+ content: Checking metrics...
114
+ tool_calls:
115
+ - tool: getCpuMetrics
116
+ input:
117
+ server: prod-1
118
+ - tool: getMemoryMetrics
119
+ input:
120
+ server: prod-1
121
+ ```
122
+
123
+ ## Scoring
124
+
125
+ ### tool_trajectory Scoring
126
+
127
+ | Mode | Score Calculation |
128
+ |------|------------------|
129
+ | `any_order` | (tools meeting minimum) / (total tools with minimums) |
130
+ | `in_order` | (matched tools in sequence) / (expected tools count) |
131
+ | `exact` | (correctly positioned tools) / (expected tools count) |
132
+
133
+ ### expected_messages Scoring
134
+
135
+ Sequential matching: `(matched tool_calls) / (expected tool_calls)`
136
+
137
+ ## Trace Data Requirements
138
+
139
+ Tool trajectory evaluators require trace data from the agent provider. Supported providers:
140
+
141
+ - **codex** - Returns trace via JSONL log events
142
+ - **vscode / vscode-insiders** - Returns trace from Copilot execution
143
+ - **cli** - Can return trace if agent outputs trace format
144
+
145
+ ### Trace Event Structure
146
+
147
+ ```json
148
+ {
149
+ "type": "tool_call",
150
+ "name": "knowledgeSearch",
151
+ "input": { "query": "REST vs GraphQL" },
152
+ "timestamp": "2024-01-15T10:30:00Z"
153
+ }
154
+ ```
155
+
156
+ ## Complete Examples
157
+
158
+ ### Research Agent Validation
159
+
160
+ ```yaml
161
+ $schema: agentv-eval-v2
162
+ description: Validate research agent tool usage
163
+
164
+ target: codex_agent # Provider that returns traces
165
+
166
+ evalcases:
167
+ - id: comprehensive-research
168
+ expected_outcome: Agent thoroughly researches the topic
169
+
170
+ input_messages:
171
+ - role: user
172
+ content: Research machine learning frameworks
173
+
174
+ execution:
175
+ evaluators:
176
+ # Check minimum tool usage
177
+ - name: coverage
178
+ type: tool_trajectory
179
+ mode: any_order
180
+ minimums:
181
+ webSearch: 1
182
+ documentRead: 2
183
+ noteTaking: 1
184
+
185
+ # Check workflow order
186
+ - name: workflow
187
+ type: tool_trajectory
188
+ mode: in_order
189
+ expected:
190
+ - tool: webSearch
191
+ - tool: documentRead
192
+ - tool: summarize
193
+ ```
194
+
195
+ ### Multi-Step Pipeline
196
+
197
+ ```yaml
198
+ evalcases:
199
+ - id: data-pipeline
200
+ expected_outcome: Process data through complete pipeline
201
+
202
+ input_messages:
203
+ - role: user
204
+ content: Process the customer dataset
205
+
206
+ expected_messages:
207
+ - role: assistant
208
+ content: Processing data...
209
+ tool_calls:
210
+ - tool: loadData
211
+ - tool: validate
212
+ - tool: transform
213
+ - tool: export
214
+
215
+ execution:
216
+ evaluators:
217
+ - name: pipeline-check
218
+ type: expected_messages
219
+ ```
220
+
221
+ ## CLI Options for Traces
222
+
223
+ ```bash
224
+ # Write trace files to disk
225
+ agentv eval evals/test.yaml --dump-traces
226
+
227
+ # Include full trace in result output
228
+ agentv eval evals/test.yaml --include-trace
229
+ ```
230
+
231
+ ## Best Practices
232
+
233
+ 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
234
+ 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
235
+ 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
236
+ 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
237
+ 5. **Use expected_messages for simple cases** - It's more readable for basic tool validation
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "0.23.0",
3
+ "version": "0.25.0",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {