agentv 1.0.0 → 1.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-RIJO5WBF.js → chunk-IVIT4U6S.js} +52 -256
- package/dist/chunk-IVIT4U6S.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -16
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -27
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +10 -68
- package/package.json +1 -1
- package/dist/chunk-RIJO5WBF.js.map +0 -1
package/dist/cli.js
CHANGED
package/dist/index.js
CHANGED
|
@@ -79,22 +79,6 @@ execution:
|
|
|
79
79
|
|
|
80
80
|
See `references/tool-trajectory-evaluator.md` for modes and configuration.
|
|
81
81
|
|
|
82
|
-
### Expected Tool Calls Evaluators
|
|
83
|
-
Validate tool calls and inputs inline with conversation flow:
|
|
84
|
-
|
|
85
|
-
```yaml
|
|
86
|
-
expected_messages:
|
|
87
|
-
- role: assistant
|
|
88
|
-
tool_calls:
|
|
89
|
-
- tool: getMetrics
|
|
90
|
-
input: { server: "prod-1" }
|
|
91
|
-
|
|
92
|
-
execution:
|
|
93
|
-
evaluators:
|
|
94
|
-
- name: input_check
|
|
95
|
-
type: expected_tool_calls
|
|
96
|
-
```
|
|
97
|
-
|
|
98
82
|
### Multiple Evaluators
|
|
99
83
|
Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
|
|
100
84
|
|
|
@@ -142,33 +142,6 @@ evalcases:
|
|
|
142
142
|
- tool: generateToken
|
|
143
143
|
```
|
|
144
144
|
|
|
145
|
-
## Expected Messages with Tool Calls
|
|
146
|
-
|
|
147
|
-
Validate precise tool inputs inline with expected messages.
|
|
148
|
-
|
|
149
|
-
```yaml
|
|
150
|
-
$schema: agentv-eval-v2
|
|
151
|
-
description: Tool input validation
|
|
152
|
-
target: mock_agent
|
|
153
|
-
|
|
154
|
-
evalcases:
|
|
155
|
-
- id: precise-inputs
|
|
156
|
-
expected_outcome: Agent calls tools with correct parameters
|
|
157
|
-
input_messages:
|
|
158
|
-
- role: user
|
|
159
|
-
content: Check CPU metrics for prod-1
|
|
160
|
-
expected_messages:
|
|
161
|
-
- role: assistant
|
|
162
|
-
content: Checking metrics...
|
|
163
|
-
tool_calls:
|
|
164
|
-
- tool: getCpuMetrics
|
|
165
|
-
input: { server: "prod-1" }
|
|
166
|
-
execution:
|
|
167
|
-
evaluators:
|
|
168
|
-
- name: input-validator
|
|
169
|
-
type: expected_tool_calls
|
|
170
|
-
```
|
|
171
|
-
|
|
172
145
|
## Static Trace Evaluation
|
|
173
146
|
|
|
174
147
|
Evaluate pre-existing trace files without running an agent.
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
CHANGED
|
@@ -2,13 +2,6 @@
|
|
|
2
2
|
|
|
3
3
|
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
|
|
4
4
|
|
|
5
|
-
## Evaluator Types
|
|
6
|
-
|
|
7
|
-
AgentV provides two ways to validate tool usage:
|
|
8
|
-
|
|
9
|
-
1. **`tool_trajectory`** - Dedicated evaluator with configurable matching modes
|
|
10
|
-
2. **`expected_messages`** - Inline tool_calls in expected_messages for simpler cases
|
|
11
|
-
|
|
12
5
|
## Tool Trajectory Evaluator
|
|
13
6
|
|
|
14
7
|
### Modes
|
|
@@ -76,50 +69,6 @@ execution:
|
|
|
76
69
|
- Strict protocol validation
|
|
77
70
|
- Regression testing specific behavior
|
|
78
71
|
|
|
79
|
-
## Expected Tool Calls Evaluator
|
|
80
|
-
|
|
81
|
-
For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
82
|
-
|
|
83
|
-
```yaml
|
|
84
|
-
evalcases:
|
|
85
|
-
- id: research-task
|
|
86
|
-
expected_outcome: Agent searches and retrieves documents
|
|
87
|
-
|
|
88
|
-
input_messages:
|
|
89
|
-
- role: user
|
|
90
|
-
content: Research REST vs GraphQL differences
|
|
91
|
-
|
|
92
|
-
expected_messages:
|
|
93
|
-
- role: assistant
|
|
94
|
-
content: I'll research this topic.
|
|
95
|
-
tool_calls:
|
|
96
|
-
- tool: knowledgeSearch
|
|
97
|
-
- tool: knowledgeSearch
|
|
98
|
-
- tool: documentRetrieve
|
|
99
|
-
|
|
100
|
-
execution:
|
|
101
|
-
evaluators:
|
|
102
|
-
- name: tool-validator
|
|
103
|
-
type: expected_tool_calls
|
|
104
|
-
```
|
|
105
|
-
|
|
106
|
-
### With Input Matching
|
|
107
|
-
|
|
108
|
-
Validate specific inputs were passed to tools:
|
|
109
|
-
|
|
110
|
-
```yaml
|
|
111
|
-
expected_messages:
|
|
112
|
-
- role: assistant
|
|
113
|
-
content: Checking metrics...
|
|
114
|
-
tool_calls:
|
|
115
|
-
- tool: getCpuMetrics
|
|
116
|
-
input:
|
|
117
|
-
server: prod-1
|
|
118
|
-
- tool: getMemoryMetrics
|
|
119
|
-
input:
|
|
120
|
-
server: prod-1
|
|
121
|
-
```
|
|
122
|
-
|
|
123
72
|
## Scoring
|
|
124
73
|
|
|
125
74
|
### tool_trajectory Scoring
|
|
@@ -130,10 +79,6 @@ expected_messages:
|
|
|
130
79
|
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
131
80
|
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
132
81
|
|
|
133
|
-
### expected_tool_calls Scoring
|
|
134
|
-
|
|
135
|
-
Sequential matching: `(matched tool_calls) / (expected tool_calls)`
|
|
136
|
-
|
|
137
82
|
## Trace Data Requirements
|
|
138
83
|
|
|
139
84
|
Tool trajectory evaluators require trace data from the agent provider. Supported providers:
|
|
@@ -198,24 +143,21 @@ evalcases:
|
|
|
198
143
|
evalcases:
|
|
199
144
|
- id: data-pipeline
|
|
200
145
|
expected_outcome: Process data through complete pipeline
|
|
201
|
-
|
|
146
|
+
|
|
202
147
|
input_messages:
|
|
203
148
|
- role: user
|
|
204
149
|
content: Process the customer dataset
|
|
205
|
-
|
|
206
|
-
expected_messages:
|
|
207
|
-
- role: assistant
|
|
208
|
-
content: Processing data...
|
|
209
|
-
tool_calls:
|
|
210
|
-
- tool: loadData
|
|
211
|
-
- tool: validate
|
|
212
|
-
- tool: transform
|
|
213
|
-
- tool: export
|
|
214
|
-
|
|
150
|
+
|
|
215
151
|
execution:
|
|
216
152
|
evaluators:
|
|
217
153
|
- name: pipeline-check
|
|
218
|
-
type:
|
|
154
|
+
type: tool_trajectory
|
|
155
|
+
mode: exact
|
|
156
|
+
expected:
|
|
157
|
+
- tool: loadData
|
|
158
|
+
- tool: validate
|
|
159
|
+
- tool: transform
|
|
160
|
+
- tool: export
|
|
219
161
|
```
|
|
220
162
|
|
|
221
163
|
## CLI Options for Traces
|
|
@@ -234,4 +176,4 @@ agentv eval evals/test.yaml --include-trace
|
|
|
234
176
|
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
235
177
|
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
236
178
|
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
237
|
-
5. **Use
|
|
179
|
+
5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
|