agentv 1.0.0 → 1.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-RIJO5WBF.js → chunk-6R2YRXCQ.js} +287 -405
- package/dist/chunk-6R2YRXCQ.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +40 -19
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +288 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +100 -41
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +10 -68
- package/package.json +2 -2
- package/dist/chunk-RIJO5WBF.js.map +0 -1
package/dist/cli.js
CHANGED
package/dist/index.js
CHANGED
|
@@ -15,6 +15,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
|
|
|
15
15
|
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
|
|
16
16
|
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
|
|
17
17
|
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
|
|
18
|
+
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
|
|
18
19
|
|
|
19
20
|
## Structure Requirements
|
|
20
21
|
- Root level: `description` (optional), `target` (optional), `execution` (optional), `evalcases` (required)
|
|
@@ -79,22 +80,6 @@ execution:
|
|
|
79
80
|
|
|
80
81
|
See `references/tool-trajectory-evaluator.md` for modes and configuration.
|
|
81
82
|
|
|
82
|
-
### Expected Tool Calls Evaluators
|
|
83
|
-
Validate tool calls and inputs inline with conversation flow:
|
|
84
|
-
|
|
85
|
-
```yaml
|
|
86
|
-
expected_messages:
|
|
87
|
-
- role: assistant
|
|
88
|
-
tool_calls:
|
|
89
|
-
- tool: getMetrics
|
|
90
|
-
input: { server: "prod-1" }
|
|
91
|
-
|
|
92
|
-
execution:
|
|
93
|
-
evaluators:
|
|
94
|
-
- name: input_check
|
|
95
|
-
type: expected_tool_calls
|
|
96
|
-
```
|
|
97
|
-
|
|
98
83
|
### Multiple Evaluators
|
|
99
84
|
Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
|
|
100
85
|
|
|
@@ -153,6 +138,42 @@ execution:
|
|
|
153
138
|
|
|
154
139
|
See `references/composite-evaluator.md` for aggregation types and patterns.
|
|
155
140
|
|
|
141
|
+
### Batch CLI Evaluation
|
|
142
|
+
Evaluate external batch runners that process all evalcases in one invocation:
|
|
143
|
+
|
|
144
|
+
```yaml
|
|
145
|
+
$schema: agentv-eval-v2
|
|
146
|
+
description: Batch CLI evaluation
|
|
147
|
+
target: batch_cli
|
|
148
|
+
|
|
149
|
+
evalcases:
|
|
150
|
+
- id: case-001
|
|
151
|
+
expected_outcome: Returns decision=CLEAR
|
|
152
|
+
expected_messages:
|
|
153
|
+
- role: assistant
|
|
154
|
+
content:
|
|
155
|
+
decision: CLEAR
|
|
156
|
+
input_messages:
|
|
157
|
+
- role: user
|
|
158
|
+
content:
|
|
159
|
+
row:
|
|
160
|
+
id: case-001
|
|
161
|
+
amount: 5000
|
|
162
|
+
execution:
|
|
163
|
+
evaluators:
|
|
164
|
+
- name: decision-check
|
|
165
|
+
type: code_judge
|
|
166
|
+
script: bun run ./scripts/check-output.ts
|
|
167
|
+
cwd: .
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
**Key pattern:**
|
|
171
|
+
- Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
|
|
172
|
+
- Each evalcase has its own evaluator to validate its corresponding output
|
|
173
|
+
- Use structured `expected_messages.content` for expected output fields
|
|
174
|
+
|
|
175
|
+
See `references/batch-cli-evaluator.md` for full implementation guide.
|
|
176
|
+
|
|
156
177
|
## Example
|
|
157
178
|
```yaml
|
|
158
179
|
$schema: agentv-eval-v2
|
|
@@ -163,7 +184,7 @@ execution:
|
|
|
163
184
|
evalcases:
|
|
164
185
|
- id: code-review-basic
|
|
165
186
|
expected_outcome: Assistant provides helpful code analysis
|
|
166
|
-
|
|
187
|
+
|
|
167
188
|
input_messages:
|
|
168
189
|
- role: system
|
|
169
190
|
content: You are an expert code reviewer.
|
|
@@ -172,14 +193,14 @@ evalcases:
|
|
|
172
193
|
- type: text
|
|
173
194
|
value: |-
|
|
174
195
|
Review this function:
|
|
175
|
-
|
|
196
|
+
|
|
176
197
|
```python
|
|
177
198
|
def add(a, b):
|
|
178
199
|
return a + b
|
|
179
200
|
```
|
|
180
201
|
- type: file
|
|
181
202
|
value: /prompts/python.instructions.md
|
|
182
|
-
|
|
203
|
+
|
|
183
204
|
expected_messages:
|
|
184
205
|
- role: assistant
|
|
185
206
|
content: |-
|
|
@@ -0,0 +1,288 @@
|
|
|
1
|
+
# Batch CLI Evaluation Guide
|
|
2
|
+
|
|
3
|
+
Guide for evaluating batch CLI output where a single runner processes all evalcases at once and outputs JSONL.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Batch CLI evaluation is used when:
|
|
8
|
+
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
|
|
9
|
+
- The runner reads the eval YAML directly to extract all evalcases
|
|
10
|
+
- Output is JSONL with records keyed by evalcase `id`
|
|
11
|
+
- Each evalcase has its own evaluator to validate its corresponding output record
|
|
12
|
+
|
|
13
|
+
## Execution Flow
|
|
14
|
+
|
|
15
|
+
1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
|
|
16
|
+
2. **Batch runner** reads the eval YAML, extracts all evalcases, processes them, writes JSONL output keyed by `id`
|
|
17
|
+
3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
|
|
18
|
+
4. **Per-case evaluator** validates the output for each evalcase independently
|
|
19
|
+
|
|
20
|
+
## Eval File Structure
|
|
21
|
+
|
|
22
|
+
```yaml
|
|
23
|
+
$schema: agentv-eval-v2
|
|
24
|
+
description: Batch CLI demo using structured input_messages
|
|
25
|
+
|
|
26
|
+
target: batch_cli
|
|
27
|
+
|
|
28
|
+
evalcases:
|
|
29
|
+
- id: case-001
|
|
30
|
+
expected_outcome: |-
|
|
31
|
+
Batch runner returns JSON with decision=CLEAR.
|
|
32
|
+
|
|
33
|
+
expected_messages:
|
|
34
|
+
- role: assistant
|
|
35
|
+
content:
|
|
36
|
+
decision: CLEAR # Structured expected output
|
|
37
|
+
|
|
38
|
+
input_messages:
|
|
39
|
+
- role: system
|
|
40
|
+
content: You are a batch processor.
|
|
41
|
+
- role: user
|
|
42
|
+
content: # Structured input (runner extracts this)
|
|
43
|
+
request:
|
|
44
|
+
type: screening_check
|
|
45
|
+
jurisdiction: AU
|
|
46
|
+
row:
|
|
47
|
+
id: case-001
|
|
48
|
+
name: Example A
|
|
49
|
+
amount: 5000
|
|
50
|
+
|
|
51
|
+
execution:
|
|
52
|
+
evaluators:
|
|
53
|
+
- name: decision-check
|
|
54
|
+
type: code_judge
|
|
55
|
+
script: bun run ./scripts/check-output.ts
|
|
56
|
+
cwd: .
|
|
57
|
+
|
|
58
|
+
- id: case-002
|
|
59
|
+
expected_outcome: |-
|
|
60
|
+
Batch runner returns JSON with decision=REVIEW.
|
|
61
|
+
|
|
62
|
+
expected_messages:
|
|
63
|
+
- role: assistant
|
|
64
|
+
content:
|
|
65
|
+
decision: REVIEW
|
|
66
|
+
|
|
67
|
+
input_messages:
|
|
68
|
+
- role: system
|
|
69
|
+
content: You are a batch processor.
|
|
70
|
+
- role: user
|
|
71
|
+
content:
|
|
72
|
+
request:
|
|
73
|
+
type: screening_check
|
|
74
|
+
jurisdiction: AU
|
|
75
|
+
row:
|
|
76
|
+
id: case-002
|
|
77
|
+
name: Example B
|
|
78
|
+
amount: 25000
|
|
79
|
+
|
|
80
|
+
execution:
|
|
81
|
+
evaluators:
|
|
82
|
+
- name: decision-check
|
|
83
|
+
type: code_judge
|
|
84
|
+
script: bun run ./scripts/check-output.ts
|
|
85
|
+
cwd: .
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
## Batch Runner Implementation
|
|
89
|
+
|
|
90
|
+
The batch runner reads the eval YAML directly and processes all evalcases in one invocation.
|
|
91
|
+
|
|
92
|
+
### Runner Contract
|
|
93
|
+
|
|
94
|
+
**Input:** The runner receives the eval file path via `--eval` flag:
|
|
95
|
+
```bash
|
|
96
|
+
bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**Output:** JSONL file where each line is a JSON object with:
|
|
100
|
+
```json
|
|
101
|
+
{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
|
|
102
|
+
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
The `id` field must match the evalcase `id` for AgentV to route output to the correct evaluator.
|
|
106
|
+
|
|
107
|
+
### Example Runner (TypeScript)
|
|
108
|
+
|
|
109
|
+
```typescript
|
|
110
|
+
import fs from 'node:fs/promises';
|
|
111
|
+
import { parse } from 'yaml';
|
|
112
|
+
|
|
113
|
+
type EvalCase = {
|
|
114
|
+
id: string;
|
|
115
|
+
input_messages: Array<{ role: string; content: unknown }>;
|
|
116
|
+
};
|
|
117
|
+
|
|
118
|
+
async function main() {
|
|
119
|
+
const args = process.argv.slice(2);
|
|
120
|
+
const evalPath = getFlag(args, '--eval');
|
|
121
|
+
const outPath = getFlag(args, '--output');
|
|
122
|
+
|
|
123
|
+
// Read and parse eval YAML
|
|
124
|
+
const yamlText = await fs.readFile(evalPath, 'utf8');
|
|
125
|
+
const parsed = parse(yamlText);
|
|
126
|
+
const evalcases = parsed.evalcases as EvalCase[];
|
|
127
|
+
|
|
128
|
+
// Process each evalcase
|
|
129
|
+
const results: Array<{ id: string; text: string }> = [];
|
|
130
|
+
for (const evalcase of evalcases) {
|
|
131
|
+
const userContent = findUserContent(evalcase.input_messages);
|
|
132
|
+
const decision = processInput(userContent); // Your logic here
|
|
133
|
+
|
|
134
|
+
results.push({
|
|
135
|
+
id: evalcase.id,
|
|
136
|
+
text: JSON.stringify({ decision, ...otherFields }),
|
|
137
|
+
});
|
|
138
|
+
}
|
|
139
|
+
|
|
140
|
+
// Write JSONL output
|
|
141
|
+
const jsonl = results.map((r) => JSON.stringify(r)).join('\n') + '\n';
|
|
142
|
+
await fs.writeFile(outPath, jsonl, 'utf8');
|
|
143
|
+
}
|
|
144
|
+
|
|
145
|
+
function getFlag(args: string[], name: string): string {
|
|
146
|
+
const idx = args.indexOf(name);
|
|
147
|
+
return args[idx + 1];
|
|
148
|
+
}
|
|
149
|
+
|
|
150
|
+
function findUserContent(messages: Array<{ role: string; content: unknown }>) {
|
|
151
|
+
return messages.find((m) => m.role === 'user')?.content;
|
|
152
|
+
}
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
## Evaluator Implementation
|
|
156
|
+
|
|
157
|
+
Each evalcase has its own evaluator that validates the output. The evaluator receives the standard code_judge input.
|
|
158
|
+
|
|
159
|
+
### Evaluator Contract
|
|
160
|
+
|
|
161
|
+
**Input (stdin):** Standard AgentV code_judge format:
|
|
162
|
+
```json
|
|
163
|
+
{
|
|
164
|
+
"candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
|
|
165
|
+
"expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
|
|
166
|
+
"input_messages": [...],
|
|
167
|
+
...
|
|
168
|
+
}
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
**Output (stdout):** Standard evaluator result:
|
|
172
|
+
```json
|
|
173
|
+
{
|
|
174
|
+
"score": 1.0,
|
|
175
|
+
"hits": ["decision matches: CLEAR"],
|
|
176
|
+
"misses": [],
|
|
177
|
+
"reasoning": "Batch runner decision matches expected."
|
|
178
|
+
}
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
### Example Evaluator (TypeScript)
|
|
182
|
+
|
|
183
|
+
```typescript
|
|
184
|
+
import fs from 'node:fs';
|
|
185
|
+
|
|
186
|
+
type EvalInput = {
|
|
187
|
+
candidate_answer?: string;
|
|
188
|
+
expected_messages?: Array<{ role: string; content: unknown }>;
|
|
189
|
+
};
|
|
190
|
+
|
|
191
|
+
function main() {
|
|
192
|
+
const stdin = fs.readFileSync(0, 'utf8');
|
|
193
|
+
const input = JSON.parse(stdin) as EvalInput;
|
|
194
|
+
|
|
195
|
+
// Extract expected value from expected_messages
|
|
196
|
+
const expectedDecision = findExpectedDecision(input.expected_messages);
|
|
197
|
+
|
|
198
|
+
// Parse candidate answer (output from batch runner)
|
|
199
|
+
let candidateDecision: string | undefined;
|
|
200
|
+
try {
|
|
201
|
+
const parsed = JSON.parse(input.candidate_answer ?? '');
|
|
202
|
+
candidateDecision = parsed.decision;
|
|
203
|
+
} catch {
|
|
204
|
+
candidateDecision = undefined;
|
|
205
|
+
}
|
|
206
|
+
|
|
207
|
+
// Compare
|
|
208
|
+
const hits: string[] = [];
|
|
209
|
+
const misses: string[] = [];
|
|
210
|
+
|
|
211
|
+
if (expectedDecision === candidateDecision) {
|
|
212
|
+
hits.push(`decision matches: ${expectedDecision}`);
|
|
213
|
+
} else {
|
|
214
|
+
misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
|
|
215
|
+
}
|
|
216
|
+
|
|
217
|
+
const score = misses.length === 0 ? 1 : 0;
|
|
218
|
+
|
|
219
|
+
process.stdout.write(JSON.stringify({
|
|
220
|
+
score,
|
|
221
|
+
hits,
|
|
222
|
+
misses,
|
|
223
|
+
reasoning: score === 1
|
|
224
|
+
? 'Batch runner output matches expected.'
|
|
225
|
+
: 'Batch runner output did not match expected.',
|
|
226
|
+
}));
|
|
227
|
+
}
|
|
228
|
+
|
|
229
|
+
function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
|
|
230
|
+
if (!messages) return undefined;
|
|
231
|
+
for (const msg of messages) {
|
|
232
|
+
if (typeof msg.content === 'object' && msg.content !== null) {
|
|
233
|
+
return (msg.content as Record<string, unknown>).decision as string;
|
|
234
|
+
}
|
|
235
|
+
}
|
|
236
|
+
return undefined;
|
|
237
|
+
}
|
|
238
|
+
|
|
239
|
+
main();
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
## Structured Content in expected_messages
|
|
243
|
+
|
|
244
|
+
For batch evaluation, use structured objects in `expected_messages.content` to define expected output fields:
|
|
245
|
+
|
|
246
|
+
```yaml
|
|
247
|
+
expected_messages:
|
|
248
|
+
- role: assistant
|
|
249
|
+
content:
|
|
250
|
+
decision: CLEAR
|
|
251
|
+
confidence: high
|
|
252
|
+
reasons: []
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
The evaluator then extracts these fields and compares against the parsed candidate output.
|
|
256
|
+
|
|
257
|
+
## Best Practices
|
|
258
|
+
|
|
259
|
+
1. **Use unique evalcase IDs** - The batch runner and AgentV use `id` to route outputs
|
|
260
|
+
2. **Structured input_messages** - Put structured data in `user.content` for the runner to extract
|
|
261
|
+
3. **Structured expected_messages** - Define expected output as objects for easy validation
|
|
262
|
+
4. **Deterministic runners** - Batch runners should produce consistent output for testing
|
|
263
|
+
5. **Healthcheck support** - Add `--healthcheck` flag for runner validation:
|
|
264
|
+
```typescript
|
|
265
|
+
if (args.includes('--healthcheck')) {
|
|
266
|
+
console.log('batch-runner: healthy');
|
|
267
|
+
return;
|
|
268
|
+
}
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
## Target Configuration
|
|
272
|
+
|
|
273
|
+
Configure the batch CLI provider in your target:
|
|
274
|
+
|
|
275
|
+
```yaml
|
|
276
|
+
# In agentv-targets.yaml or eval file
|
|
277
|
+
targets:
|
|
278
|
+
batch_cli:
|
|
279
|
+
provider: cli
|
|
280
|
+
commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
|
|
281
|
+
provider_batching: true
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
Key settings:
|
|
285
|
+
- `provider: cli` - Use CLI provider
|
|
286
|
+
- `provider_batching: true` - Run once for all evalcases
|
|
287
|
+
- `{EVAL_FILE}` - Placeholder for eval file path
|
|
288
|
+
- `{OUTPUT_FILE}` - Placeholder for JSONL output path
|
|
@@ -12,11 +12,11 @@ target: default
|
|
|
12
12
|
evalcases:
|
|
13
13
|
- id: simple-addition
|
|
14
14
|
expected_outcome: Correctly calculates 2+2
|
|
15
|
-
|
|
15
|
+
|
|
16
16
|
input_messages:
|
|
17
17
|
- role: user
|
|
18
18
|
content: What is 2 + 2?
|
|
19
|
-
|
|
19
|
+
|
|
20
20
|
expected_messages:
|
|
21
21
|
- role: assistant
|
|
22
22
|
content: "4"
|
|
@@ -32,7 +32,7 @@ target: azure_base
|
|
|
32
32
|
evalcases:
|
|
33
33
|
- id: code-review-basic
|
|
34
34
|
expected_outcome: Assistant provides helpful code analysis with security considerations
|
|
35
|
-
|
|
35
|
+
|
|
36
36
|
input_messages:
|
|
37
37
|
- role: system
|
|
38
38
|
content: You are an expert code reviewer.
|
|
@@ -41,7 +41,7 @@ evalcases:
|
|
|
41
41
|
- type: text
|
|
42
42
|
value: |-
|
|
43
43
|
Review this function for security issues:
|
|
44
|
-
|
|
44
|
+
|
|
45
45
|
```python
|
|
46
46
|
def get_user(user_id):
|
|
47
47
|
query = f"SELECT * FROM users WHERE id = {user_id}"
|
|
@@ -49,13 +49,13 @@ evalcases:
|
|
|
49
49
|
```
|
|
50
50
|
- type: file
|
|
51
51
|
value: /prompts/security-guidelines.md
|
|
52
|
-
|
|
52
|
+
|
|
53
53
|
expected_messages:
|
|
54
54
|
- role: assistant
|
|
55
55
|
content: |-
|
|
56
|
-
This code has a critical SQL injection vulnerability. The user_id is directly
|
|
56
|
+
This code has a critical SQL injection vulnerability. The user_id is directly
|
|
57
57
|
interpolated into the query string without sanitization.
|
|
58
|
-
|
|
58
|
+
|
|
59
59
|
Recommended fix:
|
|
60
60
|
```python
|
|
61
61
|
def get_user(user_id):
|
|
@@ -74,7 +74,7 @@ target: default
|
|
|
74
74
|
evalcases:
|
|
75
75
|
- id: json-generation-with-validation
|
|
76
76
|
expected_outcome: Generates valid JSON with required fields
|
|
77
|
-
|
|
77
|
+
|
|
78
78
|
execution:
|
|
79
79
|
evaluators:
|
|
80
80
|
- name: json_format_validator
|
|
@@ -84,13 +84,13 @@ evalcases:
|
|
|
84
84
|
- name: content_evaluator
|
|
85
85
|
type: llm_judge
|
|
86
86
|
prompt: ./judges/semantic_correctness.md
|
|
87
|
-
|
|
87
|
+
|
|
88
88
|
input_messages:
|
|
89
89
|
- role: user
|
|
90
90
|
content: |-
|
|
91
|
-
Generate a JSON object for a user with name "Alice",
|
|
91
|
+
Generate a JSON object for a user with name "Alice",
|
|
92
92
|
email "alice@example.com", and role "admin".
|
|
93
|
-
|
|
93
|
+
|
|
94
94
|
expected_messages:
|
|
95
95
|
- role: assistant
|
|
96
96
|
content: |-
|
|
@@ -142,33 +142,6 @@ evalcases:
|
|
|
142
142
|
- tool: generateToken
|
|
143
143
|
```
|
|
144
144
|
|
|
145
|
-
## Expected Messages with Tool Calls
|
|
146
|
-
|
|
147
|
-
Validate precise tool inputs inline with expected messages.
|
|
148
|
-
|
|
149
|
-
```yaml
|
|
150
|
-
$schema: agentv-eval-v2
|
|
151
|
-
description: Tool input validation
|
|
152
|
-
target: mock_agent
|
|
153
|
-
|
|
154
|
-
evalcases:
|
|
155
|
-
- id: precise-inputs
|
|
156
|
-
expected_outcome: Agent calls tools with correct parameters
|
|
157
|
-
input_messages:
|
|
158
|
-
- role: user
|
|
159
|
-
content: Check CPU metrics for prod-1
|
|
160
|
-
expected_messages:
|
|
161
|
-
- role: assistant
|
|
162
|
-
content: Checking metrics...
|
|
163
|
-
tool_calls:
|
|
164
|
-
- tool: getCpuMetrics
|
|
165
|
-
input: { server: "prod-1" }
|
|
166
|
-
execution:
|
|
167
|
-
evaluators:
|
|
168
|
-
- name: input-validator
|
|
169
|
-
type: expected_tool_calls
|
|
170
|
-
```
|
|
171
|
-
|
|
172
145
|
## Static Trace Evaluation
|
|
173
146
|
|
|
174
147
|
Evaluate pre-existing trace files without running an agent.
|
|
@@ -207,7 +180,7 @@ evalcases:
|
|
|
207
180
|
Assistant conducts a multi-turn debugging session, asking clarification
|
|
208
181
|
questions when needed, correctly diagnosing the bug, and proposing a clear
|
|
209
182
|
fix with rationale.
|
|
210
|
-
|
|
183
|
+
|
|
211
184
|
input_messages:
|
|
212
185
|
- role: system
|
|
213
186
|
content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
|
|
@@ -232,7 +205,7 @@ evalcases:
|
|
|
232
205
|
- role: user
|
|
233
206
|
content: |-
|
|
234
207
|
For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
|
|
235
|
-
|
|
208
|
+
|
|
236
209
|
expected_messages:
|
|
237
210
|
- role: assistant
|
|
238
211
|
content: |-
|
|
@@ -241,7 +214,7 @@ evalcases:
|
|
|
241
214
|
To include all items, you can either:
|
|
242
215
|
- Use `range(len(items))`, or
|
|
243
216
|
- Iterate directly over the list: `for item in items:`
|
|
244
|
-
|
|
217
|
+
|
|
245
218
|
Here's a corrected version:
|
|
246
219
|
|
|
247
220
|
```python
|
|
@@ -253,6 +226,92 @@ evalcases:
|
|
|
253
226
|
```
|
|
254
227
|
```
|
|
255
228
|
|
|
229
|
+
## Batch CLI Evaluation
|
|
230
|
+
|
|
231
|
+
Evaluate external batch runners that process all evalcases in one invocation.
|
|
232
|
+
|
|
233
|
+
```yaml
|
|
234
|
+
$schema: agentv-eval-v2
|
|
235
|
+
description: Batch CLI demo (AML screening)
|
|
236
|
+
target: batch_cli
|
|
237
|
+
|
|
238
|
+
evalcases:
|
|
239
|
+
- id: aml-001
|
|
240
|
+
expected_outcome: |-
|
|
241
|
+
Batch runner returns JSON with decision=CLEAR.
|
|
242
|
+
|
|
243
|
+
expected_messages:
|
|
244
|
+
- role: assistant
|
|
245
|
+
content:
|
|
246
|
+
decision: CLEAR
|
|
247
|
+
|
|
248
|
+
input_messages:
|
|
249
|
+
- role: system
|
|
250
|
+
content: You are a deterministic AML screening batch checker.
|
|
251
|
+
- role: user
|
|
252
|
+
content:
|
|
253
|
+
request:
|
|
254
|
+
type: aml_screening_check
|
|
255
|
+
jurisdiction: AU
|
|
256
|
+
effective_date: 2025-01-01
|
|
257
|
+
row:
|
|
258
|
+
id: aml-001
|
|
259
|
+
customer_name: Example Customer A
|
|
260
|
+
origin_country: NZ
|
|
261
|
+
destination_country: AU
|
|
262
|
+
transaction_type: INTERNATIONAL_TRANSFER
|
|
263
|
+
amount: 5000
|
|
264
|
+
currency: USD
|
|
265
|
+
|
|
266
|
+
execution:
|
|
267
|
+
evaluators:
|
|
268
|
+
- name: decision-check
|
|
269
|
+
type: code_judge
|
|
270
|
+
script: bun run ./scripts/check-batch-cli-output.ts
|
|
271
|
+
cwd: .
|
|
272
|
+
|
|
273
|
+
- id: aml-002
|
|
274
|
+
expected_outcome: |-
|
|
275
|
+
Batch runner returns JSON with decision=REVIEW.
|
|
276
|
+
|
|
277
|
+
expected_messages:
|
|
278
|
+
- role: assistant
|
|
279
|
+
content:
|
|
280
|
+
decision: REVIEW
|
|
281
|
+
|
|
282
|
+
input_messages:
|
|
283
|
+
- role: system
|
|
284
|
+
content: You are a deterministic AML screening batch checker.
|
|
285
|
+
- role: user
|
|
286
|
+
content:
|
|
287
|
+
request:
|
|
288
|
+
type: aml_screening_check
|
|
289
|
+
jurisdiction: AU
|
|
290
|
+
effective_date: 2025-01-01
|
|
291
|
+
row:
|
|
292
|
+
id: aml-002
|
|
293
|
+
customer_name: Example Customer B
|
|
294
|
+
origin_country: IR
|
|
295
|
+
destination_country: AU
|
|
296
|
+
transaction_type: INTERNATIONAL_TRANSFER
|
|
297
|
+
amount: 2000
|
|
298
|
+
currency: USD
|
|
299
|
+
|
|
300
|
+
execution:
|
|
301
|
+
evaluators:
|
|
302
|
+
- name: decision-check
|
|
303
|
+
type: code_judge
|
|
304
|
+
script: bun run ./scripts/check-batch-cli-output.ts
|
|
305
|
+
cwd: .
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
### Batch CLI Pattern Notes
|
|
309
|
+
- **target: batch_cli** - Configure CLI provider with `provider_batching: true`
|
|
310
|
+
- **Batch runner** - Reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
|
|
311
|
+
- **Structured input** - Put data in `user.content` as objects for runner to extract
|
|
312
|
+
- **Structured expected** - Use `expected_messages.content` with object fields
|
|
313
|
+
- **Per-case evaluators** - Each evalcase has its own evaluator to validate output
|
|
314
|
+
|
|
256
315
|
## Notes on Examples
|
|
257
316
|
|
|
258
317
|
### File Path Conventions
|