agentv 0.25.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +3 -3
- package/dist/{chunk-ZVSFP6NK.js → chunk-RIJO5WBF.js} +94 -33
- package/dist/chunk-RIJO5WBF.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +37 -20
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +94 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +8 -8
- package/package.json +1 -1
- package/dist/chunk-ZVSFP6NK.js.map +0 -1
package/dist/cli.js
CHANGED
package/dist/index.js
CHANGED
|
@@ -44,7 +44,7 @@ execution:
|
|
|
44
44
|
```
|
|
45
45
|
|
|
46
46
|
**Contract:**
|
|
47
|
-
- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `
|
|
47
|
+
- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
|
|
48
48
|
- Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
|
|
49
49
|
|
|
50
50
|
**Template:** See `references/custom-evaluators.md` for Python code evaluator template
|
|
@@ -61,8 +61,42 @@ execution:
|
|
|
61
61
|
model: gpt-5-chat
|
|
62
62
|
```
|
|
63
63
|
|
|
64
|
-
###
|
|
65
|
-
|
|
64
|
+
### Tool Trajectory Evaluators
|
|
65
|
+
Validate agent tool usage patterns (requires trace data from provider):
|
|
66
|
+
|
|
67
|
+
```yaml
|
|
68
|
+
execution:
|
|
69
|
+
evaluators:
|
|
70
|
+
- name: research_check
|
|
71
|
+
type: tool_trajectory
|
|
72
|
+
mode: any_order # Options: any_order, in_order, exact
|
|
73
|
+
minimums: # For any_order mode
|
|
74
|
+
knowledgeSearch: 2
|
|
75
|
+
expected: # For in_order/exact modes
|
|
76
|
+
- tool: knowledgeSearch
|
|
77
|
+
- tool: documentRetrieve
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
See `references/tool-trajectory-evaluator.md` for modes and configuration.
|
|
81
|
+
|
|
82
|
+
### Expected Tool Calls Evaluators
|
|
83
|
+
Validate tool calls and inputs inline with conversation flow:
|
|
84
|
+
|
|
85
|
+
```yaml
|
|
86
|
+
expected_messages:
|
|
87
|
+
- role: assistant
|
|
88
|
+
tool_calls:
|
|
89
|
+
- tool: getMetrics
|
|
90
|
+
input: { server: "prod-1" }
|
|
91
|
+
|
|
92
|
+
execution:
|
|
93
|
+
evaluators:
|
|
94
|
+
- name: input_check
|
|
95
|
+
type: expected_tool_calls
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
### Multiple Evaluators
|
|
99
|
+
Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
|
|
66
100
|
|
|
67
101
|
```yaml
|
|
68
102
|
execution:
|
|
@@ -119,23 +153,6 @@ execution:
|
|
|
119
153
|
|
|
120
154
|
See `references/composite-evaluator.md` for aggregation types and patterns.
|
|
121
155
|
|
|
122
|
-
### Tool Trajectory Evaluator
|
|
123
|
-
Validate agent tool usage from trace data:
|
|
124
|
-
|
|
125
|
-
```yaml
|
|
126
|
-
execution:
|
|
127
|
-
evaluators:
|
|
128
|
-
- name: workflow-check
|
|
129
|
-
type: tool_trajectory
|
|
130
|
-
mode: in_order # or: any_order, exact
|
|
131
|
-
expected:
|
|
132
|
-
- tool: fetchData
|
|
133
|
-
- tool: processData
|
|
134
|
-
- tool: saveResults
|
|
135
|
-
```
|
|
136
|
-
|
|
137
|
-
See `references/tool-trajectory-evaluator.md` for modes and configuration.
|
|
138
|
-
|
|
139
156
|
## Example
|
|
140
157
|
```yaml
|
|
141
158
|
$schema: agentv-eval-v2
|
|
@@ -78,13 +78,12 @@ evalcases:
|
|
|
78
78
|
execution:
|
|
79
79
|
evaluators:
|
|
80
80
|
- name: json_format_validator
|
|
81
|
-
type:
|
|
81
|
+
type: code_judge
|
|
82
82
|
script: uv run validate_json.py
|
|
83
83
|
cwd: ./evaluators
|
|
84
84
|
- name: content_evaluator
|
|
85
85
|
type: llm_judge
|
|
86
86
|
prompt: ./judges/semantic_correctness.md
|
|
87
|
-
model: gpt-5-chat
|
|
88
87
|
|
|
89
88
|
input_messages:
|
|
90
89
|
- role: user
|
|
@@ -102,6 +101,99 @@ evalcases:
|
|
|
102
101
|
}
|
|
103
102
|
```
|
|
104
103
|
|
|
104
|
+
## Tool Trajectory Evaluation
|
|
105
|
+
|
|
106
|
+
Validate that an agent uses specific tools during execution.
|
|
107
|
+
|
|
108
|
+
```yaml
|
|
109
|
+
$schema: agentv-eval-v2
|
|
110
|
+
description: Tool usage validation
|
|
111
|
+
target: mock_agent
|
|
112
|
+
|
|
113
|
+
evalcases:
|
|
114
|
+
# Validate minimum tool usage (order doesn't matter)
|
|
115
|
+
- id: research-depth
|
|
116
|
+
expected_outcome: Agent researches thoroughly
|
|
117
|
+
input_messages:
|
|
118
|
+
- role: user
|
|
119
|
+
content: Research REST vs GraphQL
|
|
120
|
+
execution:
|
|
121
|
+
evaluators:
|
|
122
|
+
- name: research-check
|
|
123
|
+
type: tool_trajectory
|
|
124
|
+
mode: any_order
|
|
125
|
+
minimums:
|
|
126
|
+
knowledgeSearch: 2
|
|
127
|
+
documentRetrieve: 1
|
|
128
|
+
|
|
129
|
+
# Validate exact tool sequence
|
|
130
|
+
- id: auth-flow
|
|
131
|
+
expected_outcome: Agent follows auth sequence
|
|
132
|
+
input_messages:
|
|
133
|
+
- role: user
|
|
134
|
+
content: Authenticate user
|
|
135
|
+
execution:
|
|
136
|
+
evaluators:
|
|
137
|
+
- name: auth-sequence
|
|
138
|
+
type: tool_trajectory
|
|
139
|
+
mode: exact
|
|
140
|
+
expected:
|
|
141
|
+
- tool: checkCredentials
|
|
142
|
+
- tool: generateToken
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
## Expected Messages with Tool Calls
|
|
146
|
+
|
|
147
|
+
Validate precise tool inputs inline with expected messages.
|
|
148
|
+
|
|
149
|
+
```yaml
|
|
150
|
+
$schema: agentv-eval-v2
|
|
151
|
+
description: Tool input validation
|
|
152
|
+
target: mock_agent
|
|
153
|
+
|
|
154
|
+
evalcases:
|
|
155
|
+
- id: precise-inputs
|
|
156
|
+
expected_outcome: Agent calls tools with correct parameters
|
|
157
|
+
input_messages:
|
|
158
|
+
- role: user
|
|
159
|
+
content: Check CPU metrics for prod-1
|
|
160
|
+
expected_messages:
|
|
161
|
+
- role: assistant
|
|
162
|
+
content: Checking metrics...
|
|
163
|
+
tool_calls:
|
|
164
|
+
- tool: getCpuMetrics
|
|
165
|
+
input: { server: "prod-1" }
|
|
166
|
+
execution:
|
|
167
|
+
evaluators:
|
|
168
|
+
- name: input-validator
|
|
169
|
+
type: expected_tool_calls
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
## Static Trace Evaluation
|
|
173
|
+
|
|
174
|
+
Evaluate pre-existing trace files without running an agent.
|
|
175
|
+
|
|
176
|
+
```yaml
|
|
177
|
+
$schema: agentv-eval-v2
|
|
178
|
+
description: Static trace evaluation
|
|
179
|
+
target: static_trace
|
|
180
|
+
|
|
181
|
+
evalcases:
|
|
182
|
+
- id: validate-trace-file
|
|
183
|
+
expected_outcome: Trace contains required steps
|
|
184
|
+
input_messages:
|
|
185
|
+
- role: user
|
|
186
|
+
content: Analyze trace
|
|
187
|
+
execution:
|
|
188
|
+
evaluators:
|
|
189
|
+
- name: trace-check
|
|
190
|
+
type: tool_trajectory
|
|
191
|
+
mode: in_order
|
|
192
|
+
expected:
|
|
193
|
+
- tool: webSearch
|
|
194
|
+
- tool: readFile
|
|
195
|
+
```
|
|
196
|
+
|
|
105
197
|
## Multi-Turn Conversation (Single Eval Case)
|
|
106
198
|
|
|
107
199
|
```yaml
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
CHANGED
|
@@ -76,7 +76,7 @@ execution:
|
|
|
76
76
|
- Strict protocol validation
|
|
77
77
|
- Regression testing specific behavior
|
|
78
78
|
|
|
79
|
-
## Expected
|
|
79
|
+
## Expected Tool Calls Evaluator
|
|
80
80
|
|
|
81
81
|
For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
82
82
|
|
|
@@ -84,11 +84,11 @@ For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
|
84
84
|
evalcases:
|
|
85
85
|
- id: research-task
|
|
86
86
|
expected_outcome: Agent searches and retrieves documents
|
|
87
|
-
|
|
87
|
+
|
|
88
88
|
input_messages:
|
|
89
89
|
- role: user
|
|
90
90
|
content: Research REST vs GraphQL differences
|
|
91
|
-
|
|
91
|
+
|
|
92
92
|
expected_messages:
|
|
93
93
|
- role: assistant
|
|
94
94
|
content: I'll research this topic.
|
|
@@ -96,11 +96,11 @@ evalcases:
|
|
|
96
96
|
- tool: knowledgeSearch
|
|
97
97
|
- tool: knowledgeSearch
|
|
98
98
|
- tool: documentRetrieve
|
|
99
|
-
|
|
99
|
+
|
|
100
100
|
execution:
|
|
101
101
|
evaluators:
|
|
102
102
|
- name: tool-validator
|
|
103
|
-
type:
|
|
103
|
+
type: expected_tool_calls
|
|
104
104
|
```
|
|
105
105
|
|
|
106
106
|
### With Input Matching
|
|
@@ -130,7 +130,7 @@ expected_messages:
|
|
|
130
130
|
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
131
131
|
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
132
132
|
|
|
133
|
-
###
|
|
133
|
+
### expected_tool_calls Scoring
|
|
134
134
|
|
|
135
135
|
Sequential matching: `(matched tool_calls) / (expected tool_calls)`
|
|
136
136
|
|
|
@@ -215,7 +215,7 @@ evalcases:
|
|
|
215
215
|
execution:
|
|
216
216
|
evaluators:
|
|
217
217
|
- name: pipeline-check
|
|
218
|
-
type:
|
|
218
|
+
type: expected_tool_calls
|
|
219
219
|
```
|
|
220
220
|
|
|
221
221
|
## CLI Options for Traces
|
|
@@ -234,4 +234,4 @@ agentv eval evals/test.yaml --include-trace
|
|
|
234
234
|
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
235
235
|
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
236
236
|
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
237
|
-
5. **Use
|
|
237
|
+
5. **Use expected_tool_calls for simple cases** - It's more readable for basic tool validation
|