agentv 0.23.0 → 0.25.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +13 -10
- package/dist/{chunk-4T62HFF4.js → chunk-ZVSFP6NK.js} +822 -233
- package/dist/chunk-ZVSFP6NK.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/cli.js.map +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +10 -10
- package/dist/templates/.agentv/targets.yaml +8 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +75 -6
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +217 -217
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +139 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +237 -0
- package/package.json +1 -1
- package/dist/chunk-4T62HFF4.js.map +0 -1
- package/dist/templates/agentv/.env.template +0 -23
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
1
|
+
# Tool Trajectory Evaluator Guide
|
|
2
|
+
|
|
3
|
+
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
|
|
4
|
+
|
|
5
|
+
## Evaluator Types
|
|
6
|
+
|
|
7
|
+
AgentV provides two ways to validate tool usage:
|
|
8
|
+
|
|
9
|
+
1. **`tool_trajectory`** - Dedicated evaluator with configurable matching modes
|
|
10
|
+
2. **`expected_messages`** - Inline tool_calls in expected_messages for simpler cases
|
|
11
|
+
|
|
12
|
+
## Tool Trajectory Evaluator
|
|
13
|
+
|
|
14
|
+
### Modes
|
|
15
|
+
|
|
16
|
+
#### 1. `any_order` - Minimum Tool Counts
|
|
17
|
+
|
|
18
|
+
Validates that each tool was called at least N times, regardless of order:
|
|
19
|
+
|
|
20
|
+
```yaml
|
|
21
|
+
execution:
|
|
22
|
+
evaluators:
|
|
23
|
+
- name: tool-usage
|
|
24
|
+
type: tool_trajectory
|
|
25
|
+
mode: any_order
|
|
26
|
+
minimums:
|
|
27
|
+
knowledgeSearch: 2 # Must be called at least twice
|
|
28
|
+
documentRetrieve: 1 # Must be called at least once
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
**Use cases:**
|
|
32
|
+
- Ensure required tools are used
|
|
33
|
+
- Don't care about execution order
|
|
34
|
+
- Allow flexibility in agent implementation
|
|
35
|
+
|
|
36
|
+
#### 2. `in_order` - Sequential Matching
|
|
37
|
+
|
|
38
|
+
Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
|
|
39
|
+
|
|
40
|
+
```yaml
|
|
41
|
+
execution:
|
|
42
|
+
evaluators:
|
|
43
|
+
- name: workflow-sequence
|
|
44
|
+
type: tool_trajectory
|
|
45
|
+
mode: in_order
|
|
46
|
+
expected:
|
|
47
|
+
- tool: fetchData
|
|
48
|
+
- tool: validateSchema
|
|
49
|
+
- tool: transformData
|
|
50
|
+
- tool: saveResults
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
**Use cases:**
|
|
54
|
+
- Validate logical workflow order
|
|
55
|
+
- Allow agent to use additional helper tools
|
|
56
|
+
- Check that key steps happen in sequence
|
|
57
|
+
|
|
58
|
+
#### 3. `exact` - Strict Sequence Match
|
|
59
|
+
|
|
60
|
+
Validates the exact tool sequence with no gaps or extra tools:
|
|
61
|
+
|
|
62
|
+
```yaml
|
|
63
|
+
execution:
|
|
64
|
+
evaluators:
|
|
65
|
+
- name: auth-sequence
|
|
66
|
+
type: tool_trajectory
|
|
67
|
+
mode: exact
|
|
68
|
+
expected:
|
|
69
|
+
- tool: checkCredentials
|
|
70
|
+
- tool: generateToken
|
|
71
|
+
- tool: auditLog
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**Use cases:**
|
|
75
|
+
- Security-critical workflows
|
|
76
|
+
- Strict protocol validation
|
|
77
|
+
- Regression testing specific behavior
|
|
78
|
+
|
|
79
|
+
## Expected Messages Evaluator
|
|
80
|
+
|
|
81
|
+
For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
82
|
+
|
|
83
|
+
```yaml
|
|
84
|
+
evalcases:
|
|
85
|
+
- id: research-task
|
|
86
|
+
expected_outcome: Agent searches and retrieves documents
|
|
87
|
+
|
|
88
|
+
input_messages:
|
|
89
|
+
- role: user
|
|
90
|
+
content: Research REST vs GraphQL differences
|
|
91
|
+
|
|
92
|
+
expected_messages:
|
|
93
|
+
- role: assistant
|
|
94
|
+
content: I'll research this topic.
|
|
95
|
+
tool_calls:
|
|
96
|
+
- tool: knowledgeSearch
|
|
97
|
+
- tool: knowledgeSearch
|
|
98
|
+
- tool: documentRetrieve
|
|
99
|
+
|
|
100
|
+
execution:
|
|
101
|
+
evaluators:
|
|
102
|
+
- name: tool-validator
|
|
103
|
+
type: expected_messages
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
### With Input Matching
|
|
107
|
+
|
|
108
|
+
Validate specific inputs were passed to tools:
|
|
109
|
+
|
|
110
|
+
```yaml
|
|
111
|
+
expected_messages:
|
|
112
|
+
- role: assistant
|
|
113
|
+
content: Checking metrics...
|
|
114
|
+
tool_calls:
|
|
115
|
+
- tool: getCpuMetrics
|
|
116
|
+
input:
|
|
117
|
+
server: prod-1
|
|
118
|
+
- tool: getMemoryMetrics
|
|
119
|
+
input:
|
|
120
|
+
server: prod-1
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
## Scoring
|
|
124
|
+
|
|
125
|
+
### tool_trajectory Scoring
|
|
126
|
+
|
|
127
|
+
| Mode | Score Calculation |
|
|
128
|
+
|------|------------------|
|
|
129
|
+
| `any_order` | (tools meeting minimum) / (total tools with minimums) |
|
|
130
|
+
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
131
|
+
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
132
|
+
|
|
133
|
+
### expected_messages Scoring
|
|
134
|
+
|
|
135
|
+
Sequential matching: `(matched tool_calls) / (expected tool_calls)`
|
|
136
|
+
|
|
137
|
+
## Trace Data Requirements
|
|
138
|
+
|
|
139
|
+
Tool trajectory evaluators require trace data from the agent provider. Supported providers:
|
|
140
|
+
|
|
141
|
+
- **codex** - Returns trace via JSONL log events
|
|
142
|
+
- **vscode / vscode-insiders** - Returns trace from Copilot execution
|
|
143
|
+
- **cli** - Can return trace if agent outputs trace format
|
|
144
|
+
|
|
145
|
+
### Trace Event Structure
|
|
146
|
+
|
|
147
|
+
```json
|
|
148
|
+
{
|
|
149
|
+
"type": "tool_call",
|
|
150
|
+
"name": "knowledgeSearch",
|
|
151
|
+
"input": { "query": "REST vs GraphQL" },
|
|
152
|
+
"timestamp": "2024-01-15T10:30:00Z"
|
|
153
|
+
}
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
## Complete Examples
|
|
157
|
+
|
|
158
|
+
### Research Agent Validation
|
|
159
|
+
|
|
160
|
+
```yaml
|
|
161
|
+
$schema: agentv-eval-v2
|
|
162
|
+
description: Validate research agent tool usage
|
|
163
|
+
|
|
164
|
+
target: codex_agent # Provider that returns traces
|
|
165
|
+
|
|
166
|
+
evalcases:
|
|
167
|
+
- id: comprehensive-research
|
|
168
|
+
expected_outcome: Agent thoroughly researches the topic
|
|
169
|
+
|
|
170
|
+
input_messages:
|
|
171
|
+
- role: user
|
|
172
|
+
content: Research machine learning frameworks
|
|
173
|
+
|
|
174
|
+
execution:
|
|
175
|
+
evaluators:
|
|
176
|
+
# Check minimum tool usage
|
|
177
|
+
- name: coverage
|
|
178
|
+
type: tool_trajectory
|
|
179
|
+
mode: any_order
|
|
180
|
+
minimums:
|
|
181
|
+
webSearch: 1
|
|
182
|
+
documentRead: 2
|
|
183
|
+
noteTaking: 1
|
|
184
|
+
|
|
185
|
+
# Check workflow order
|
|
186
|
+
- name: workflow
|
|
187
|
+
type: tool_trajectory
|
|
188
|
+
mode: in_order
|
|
189
|
+
expected:
|
|
190
|
+
- tool: webSearch
|
|
191
|
+
- tool: documentRead
|
|
192
|
+
- tool: summarize
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### Multi-Step Pipeline
|
|
196
|
+
|
|
197
|
+
```yaml
|
|
198
|
+
evalcases:
|
|
199
|
+
- id: data-pipeline
|
|
200
|
+
expected_outcome: Process data through complete pipeline
|
|
201
|
+
|
|
202
|
+
input_messages:
|
|
203
|
+
- role: user
|
|
204
|
+
content: Process the customer dataset
|
|
205
|
+
|
|
206
|
+
expected_messages:
|
|
207
|
+
- role: assistant
|
|
208
|
+
content: Processing data...
|
|
209
|
+
tool_calls:
|
|
210
|
+
- tool: loadData
|
|
211
|
+
- tool: validate
|
|
212
|
+
- tool: transform
|
|
213
|
+
- tool: export
|
|
214
|
+
|
|
215
|
+
execution:
|
|
216
|
+
evaluators:
|
|
217
|
+
- name: pipeline-check
|
|
218
|
+
type: expected_messages
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
## CLI Options for Traces
|
|
222
|
+
|
|
223
|
+
```bash
|
|
224
|
+
# Write trace files to disk
|
|
225
|
+
agentv eval evals/test.yaml --dump-traces
|
|
226
|
+
|
|
227
|
+
# Include full trace in result output
|
|
228
|
+
agentv eval evals/test.yaml --include-trace
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
## Best Practices
|
|
232
|
+
|
|
233
|
+
1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
|
|
234
|
+
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
235
|
+
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
236
|
+
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
237
|
+
5. **Use expected_messages for simple cases** - It's more readable for basic tool validation
|