agentv 1.0.0 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-RIJO5WBF.js";
4
+ } from "./chunk-6R2YRXCQ.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
 
7
7
  // src/cli.ts
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  app,
3
3
  runCli
4
- } from "./chunk-RIJO5WBF.js";
4
+ } from "./chunk-6R2YRXCQ.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
  export {
7
7
  app,
@@ -15,6 +15,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
15
15
  - Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
16
16
  - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17
17
  - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
18
+ - Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
18
19
 
19
20
  ## Structure Requirements
20
21
  - Root level: `description` (optional), `target` (optional), `execution` (optional), `evalcases` (required)
@@ -79,22 +80,6 @@ execution:
79
80
 
80
81
  See `references/tool-trajectory-evaluator.md` for modes and configuration.
81
82
 
82
- ### Expected Tool Calls Evaluators
83
- Validate tool calls and inputs inline with conversation flow:
84
-
85
- ```yaml
86
- expected_messages:
87
- - role: assistant
88
- tool_calls:
89
- - tool: getMetrics
90
- input: { server: "prod-1" }
91
-
92
- execution:
93
- evaluators:
94
- - name: input_check
95
- type: expected_tool_calls
96
- ```
97
-
98
83
  ### Multiple Evaluators
99
84
  Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
100
85
 
@@ -153,6 +138,42 @@ execution:
153
138
 
154
139
  See `references/composite-evaluator.md` for aggregation types and patterns.
155
140
 
141
+ ### Batch CLI Evaluation
142
+ Evaluate external batch runners that process all evalcases in one invocation:
143
+
144
+ ```yaml
145
+ $schema: agentv-eval-v2
146
+ description: Batch CLI evaluation
147
+ target: batch_cli
148
+
149
+ evalcases:
150
+ - id: case-001
151
+ expected_outcome: Returns decision=CLEAR
152
+ expected_messages:
153
+ - role: assistant
154
+ content:
155
+ decision: CLEAR
156
+ input_messages:
157
+ - role: user
158
+ content:
159
+ row:
160
+ id: case-001
161
+ amount: 5000
162
+ execution:
163
+ evaluators:
164
+ - name: decision-check
165
+ type: code_judge
166
+ script: bun run ./scripts/check-output.ts
167
+ cwd: .
168
+ ```
169
+
170
+ **Key pattern:**
171
+ - Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
172
+ - Each evalcase has its own evaluator to validate its corresponding output
173
+ - Use structured `expected_messages.content` for expected output fields
174
+
175
+ See `references/batch-cli-evaluator.md` for full implementation guide.
176
+
156
177
  ## Example
157
178
  ```yaml
158
179
  $schema: agentv-eval-v2
@@ -163,7 +184,7 @@ execution:
163
184
  evalcases:
164
185
  - id: code-review-basic
165
186
  expected_outcome: Assistant provides helpful code analysis
166
-
187
+
167
188
  input_messages:
168
189
  - role: system
169
190
  content: You are an expert code reviewer.
@@ -172,14 +193,14 @@ evalcases:
172
193
  - type: text
173
194
  value: |-
174
195
  Review this function:
175
-
196
+
176
197
  ```python
177
198
  def add(a, b):
178
199
  return a + b
179
200
  ```
180
201
  - type: file
181
202
  value: /prompts/python.instructions.md
182
-
203
+
183
204
  expected_messages:
184
205
  - role: assistant
185
206
  content: |-
@@ -0,0 +1,288 @@
1
+ # Batch CLI Evaluation Guide
2
+
3
+ Guide for evaluating batch CLI output where a single runner processes all evalcases at once and outputs JSONL.
4
+
5
+ ## Overview
6
+
7
+ Batch CLI evaluation is used when:
8
+ - An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
9
+ - The runner reads the eval YAML directly to extract all evalcases
10
+ - Output is JSONL with records keyed by evalcase `id`
11
+ - Each evalcase has its own evaluator to validate its corresponding output record
12
+
13
+ ## Execution Flow
14
+
15
+ 1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
16
+ 2. **Batch runner** reads the eval YAML, extracts all evalcases, processes them, writes JSONL output keyed by `id`
17
+ 3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
18
+ 4. **Per-case evaluator** validates the output for each evalcase independently
19
+
20
+ ## Eval File Structure
21
+
22
+ ```yaml
23
+ $schema: agentv-eval-v2
24
+ description: Batch CLI demo using structured input_messages
25
+
26
+ target: batch_cli
27
+
28
+ evalcases:
29
+ - id: case-001
30
+ expected_outcome: |-
31
+ Batch runner returns JSON with decision=CLEAR.
32
+
33
+ expected_messages:
34
+ - role: assistant
35
+ content:
36
+ decision: CLEAR # Structured expected output
37
+
38
+ input_messages:
39
+ - role: system
40
+ content: You are a batch processor.
41
+ - role: user
42
+ content: # Structured input (runner extracts this)
43
+ request:
44
+ type: screening_check
45
+ jurisdiction: AU
46
+ row:
47
+ id: case-001
48
+ name: Example A
49
+ amount: 5000
50
+
51
+ execution:
52
+ evaluators:
53
+ - name: decision-check
54
+ type: code_judge
55
+ script: bun run ./scripts/check-output.ts
56
+ cwd: .
57
+
58
+ - id: case-002
59
+ expected_outcome: |-
60
+ Batch runner returns JSON with decision=REVIEW.
61
+
62
+ expected_messages:
63
+ - role: assistant
64
+ content:
65
+ decision: REVIEW
66
+
67
+ input_messages:
68
+ - role: system
69
+ content: You are a batch processor.
70
+ - role: user
71
+ content:
72
+ request:
73
+ type: screening_check
74
+ jurisdiction: AU
75
+ row:
76
+ id: case-002
77
+ name: Example B
78
+ amount: 25000
79
+
80
+ execution:
81
+ evaluators:
82
+ - name: decision-check
83
+ type: code_judge
84
+ script: bun run ./scripts/check-output.ts
85
+ cwd: .
86
+ ```
87
+
88
+ ## Batch Runner Implementation
89
+
90
+ The batch runner reads the eval YAML directly and processes all evalcases in one invocation.
91
+
92
+ ### Runner Contract
93
+
94
+ **Input:** The runner receives the eval file path via `--eval` flag:
95
+ ```bash
96
+ bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl
97
+ ```
98
+
99
+ **Output:** JSONL file where each line is a JSON object with:
100
+ ```json
101
+ {"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
102
+ {"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
103
+ ```
104
+
105
+ The `id` field must match the evalcase `id` for AgentV to route output to the correct evaluator.
106
+
107
+ ### Example Runner (TypeScript)
108
+
109
+ ```typescript
110
+ import fs from 'node:fs/promises';
111
+ import { parse } from 'yaml';
112
+
113
+ type EvalCase = {
114
+ id: string;
115
+ input_messages: Array<{ role: string; content: unknown }>;
116
+ };
117
+
118
+ async function main() {
119
+ const args = process.argv.slice(2);
120
+ const evalPath = getFlag(args, '--eval');
121
+ const outPath = getFlag(args, '--output');
122
+
123
+ // Read and parse eval YAML
124
+ const yamlText = await fs.readFile(evalPath, 'utf8');
125
+ const parsed = parse(yamlText);
126
+ const evalcases = parsed.evalcases as EvalCase[];
127
+
128
+ // Process each evalcase
129
+ const results: Array<{ id: string; text: string }> = [];
130
+ for (const evalcase of evalcases) {
131
+ const userContent = findUserContent(evalcase.input_messages);
132
+ const decision = processInput(userContent); // Your logic here
133
+
134
+ results.push({
135
+ id: evalcase.id,
136
+ text: JSON.stringify({ decision, ...otherFields }),
137
+ });
138
+ }
139
+
140
+ // Write JSONL output
141
+ const jsonl = results.map((r) => JSON.stringify(r)).join('\n') + '\n';
142
+ await fs.writeFile(outPath, jsonl, 'utf8');
143
+ }
144
+
145
+ function getFlag(args: string[], name: string): string {
146
+ const idx = args.indexOf(name);
147
+ return args[idx + 1];
148
+ }
149
+
150
+ function findUserContent(messages: Array<{ role: string; content: unknown }>) {
151
+ return messages.find((m) => m.role === 'user')?.content;
152
+ }
153
+ ```
154
+
155
+ ## Evaluator Implementation
156
+
157
+ Each evalcase has its own evaluator that validates the output. The evaluator receives the standard code_judge input.
158
+
159
+ ### Evaluator Contract
160
+
161
+ **Input (stdin):** Standard AgentV code_judge format:
162
+ ```json
163
+ {
164
+ "candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
165
+ "expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
166
+ "input_messages": [...],
167
+ ...
168
+ }
169
+ ```
170
+
171
+ **Output (stdout):** Standard evaluator result:
172
+ ```json
173
+ {
174
+ "score": 1.0,
175
+ "hits": ["decision matches: CLEAR"],
176
+ "misses": [],
177
+ "reasoning": "Batch runner decision matches expected."
178
+ }
179
+ ```
180
+
181
+ ### Example Evaluator (TypeScript)
182
+
183
+ ```typescript
184
+ import fs from 'node:fs';
185
+
186
+ type EvalInput = {
187
+ candidate_answer?: string;
188
+ expected_messages?: Array<{ role: string; content: unknown }>;
189
+ };
190
+
191
+ function main() {
192
+ const stdin = fs.readFileSync(0, 'utf8');
193
+ const input = JSON.parse(stdin) as EvalInput;
194
+
195
+ // Extract expected value from expected_messages
196
+ const expectedDecision = findExpectedDecision(input.expected_messages);
197
+
198
+ // Parse candidate answer (output from batch runner)
199
+ let candidateDecision: string | undefined;
200
+ try {
201
+ const parsed = JSON.parse(input.candidate_answer ?? '');
202
+ candidateDecision = parsed.decision;
203
+ } catch {
204
+ candidateDecision = undefined;
205
+ }
206
+
207
+ // Compare
208
+ const hits: string[] = [];
209
+ const misses: string[] = [];
210
+
211
+ if (expectedDecision === candidateDecision) {
212
+ hits.push(`decision matches: ${expectedDecision}`);
213
+ } else {
214
+ misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
215
+ }
216
+
217
+ const score = misses.length === 0 ? 1 : 0;
218
+
219
+ process.stdout.write(JSON.stringify({
220
+ score,
221
+ hits,
222
+ misses,
223
+ reasoning: score === 1
224
+ ? 'Batch runner output matches expected.'
225
+ : 'Batch runner output did not match expected.',
226
+ }));
227
+ }
228
+
229
+ function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
230
+ if (!messages) return undefined;
231
+ for (const msg of messages) {
232
+ if (typeof msg.content === 'object' && msg.content !== null) {
233
+ return (msg.content as Record<string, unknown>).decision as string;
234
+ }
235
+ }
236
+ return undefined;
237
+ }
238
+
239
+ main();
240
+ ```
241
+
242
+ ## Structured Content in expected_messages
243
+
244
+ For batch evaluation, use structured objects in `expected_messages.content` to define expected output fields:
245
+
246
+ ```yaml
247
+ expected_messages:
248
+ - role: assistant
249
+ content:
250
+ decision: CLEAR
251
+ confidence: high
252
+ reasons: []
253
+ ```
254
+
255
+ The evaluator then extracts these fields and compares against the parsed candidate output.
256
+
257
+ ## Best Practices
258
+
259
+ 1. **Use unique evalcase IDs** - The batch runner and AgentV use `id` to route outputs
260
+ 2. **Structured input_messages** - Put structured data in `user.content` for the runner to extract
261
+ 3. **Structured expected_messages** - Define expected output as objects for easy validation
262
+ 4. **Deterministic runners** - Batch runners should produce consistent output for testing
263
+ 5. **Healthcheck support** - Add `--healthcheck` flag for runner validation:
264
+ ```typescript
265
+ if (args.includes('--healthcheck')) {
266
+ console.log('batch-runner: healthy');
267
+ return;
268
+ }
269
+ ```
270
+
271
+ ## Target Configuration
272
+
273
+ Configure the batch CLI provider in your target:
274
+
275
+ ```yaml
276
+ # In agentv-targets.yaml or eval file
277
+ targets:
278
+ batch_cli:
279
+ provider: cli
280
+ commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
281
+ provider_batching: true
282
+ ```
283
+
284
+ Key settings:
285
+ - `provider: cli` - Use CLI provider
286
+ - `provider_batching: true` - Run once for all evalcases
287
+ - `{EVAL_FILE}` - Placeholder for eval file path
288
+ - `{OUTPUT_FILE}` - Placeholder for JSONL output path
@@ -12,11 +12,11 @@ target: default
12
12
  evalcases:
13
13
  - id: simple-addition
14
14
  expected_outcome: Correctly calculates 2+2
15
-
15
+
16
16
  input_messages:
17
17
  - role: user
18
18
  content: What is 2 + 2?
19
-
19
+
20
20
  expected_messages:
21
21
  - role: assistant
22
22
  content: "4"
@@ -32,7 +32,7 @@ target: azure_base
32
32
  evalcases:
33
33
  - id: code-review-basic
34
34
  expected_outcome: Assistant provides helpful code analysis with security considerations
35
-
35
+
36
36
  input_messages:
37
37
  - role: system
38
38
  content: You are an expert code reviewer.
@@ -41,7 +41,7 @@ evalcases:
41
41
  - type: text
42
42
  value: |-
43
43
  Review this function for security issues:
44
-
44
+
45
45
  ```python
46
46
  def get_user(user_id):
47
47
  query = f"SELECT * FROM users WHERE id = {user_id}"
@@ -49,13 +49,13 @@ evalcases:
49
49
  ```
50
50
  - type: file
51
51
  value: /prompts/security-guidelines.md
52
-
52
+
53
53
  expected_messages:
54
54
  - role: assistant
55
55
  content: |-
56
- This code has a critical SQL injection vulnerability. The user_id is directly
56
+ This code has a critical SQL injection vulnerability. The user_id is directly
57
57
  interpolated into the query string without sanitization.
58
-
58
+
59
59
  Recommended fix:
60
60
  ```python
61
61
  def get_user(user_id):
@@ -74,7 +74,7 @@ target: default
74
74
  evalcases:
75
75
  - id: json-generation-with-validation
76
76
  expected_outcome: Generates valid JSON with required fields
77
-
77
+
78
78
  execution:
79
79
  evaluators:
80
80
  - name: json_format_validator
@@ -84,13 +84,13 @@ evalcases:
84
84
  - name: content_evaluator
85
85
  type: llm_judge
86
86
  prompt: ./judges/semantic_correctness.md
87
-
87
+
88
88
  input_messages:
89
89
  - role: user
90
90
  content: |-
91
- Generate a JSON object for a user with name "Alice",
91
+ Generate a JSON object for a user with name "Alice",
92
92
  email "alice@example.com", and role "admin".
93
-
93
+
94
94
  expected_messages:
95
95
  - role: assistant
96
96
  content: |-
@@ -142,33 +142,6 @@ evalcases:
142
142
  - tool: generateToken
143
143
  ```
144
144
 
145
- ## Expected Messages with Tool Calls
146
-
147
- Validate precise tool inputs inline with expected messages.
148
-
149
- ```yaml
150
- $schema: agentv-eval-v2
151
- description: Tool input validation
152
- target: mock_agent
153
-
154
- evalcases:
155
- - id: precise-inputs
156
- expected_outcome: Agent calls tools with correct parameters
157
- input_messages:
158
- - role: user
159
- content: Check CPU metrics for prod-1
160
- expected_messages:
161
- - role: assistant
162
- content: Checking metrics...
163
- tool_calls:
164
- - tool: getCpuMetrics
165
- input: { server: "prod-1" }
166
- execution:
167
- evaluators:
168
- - name: input-validator
169
- type: expected_tool_calls
170
- ```
171
-
172
145
  ## Static Trace Evaluation
173
146
 
174
147
  Evaluate pre-existing trace files without running an agent.
@@ -207,7 +180,7 @@ evalcases:
207
180
  Assistant conducts a multi-turn debugging session, asking clarification
208
181
  questions when needed, correctly diagnosing the bug, and proposing a clear
209
182
  fix with rationale.
210
-
183
+
211
184
  input_messages:
212
185
  - role: system
213
186
  content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
@@ -232,7 +205,7 @@ evalcases:
232
205
  - role: user
233
206
  content: |-
234
207
  For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
235
-
208
+
236
209
  expected_messages:
237
210
  - role: assistant
238
211
  content: |-
@@ -241,7 +214,7 @@ evalcases:
241
214
  To include all items, you can either:
242
215
  - Use `range(len(items))`, or
243
216
  - Iterate directly over the list: `for item in items:`
244
-
217
+
245
218
  Here's a corrected version:
246
219
 
247
220
  ```python
@@ -253,6 +226,92 @@ evalcases:
253
226
  ```
254
227
  ```
255
228
 
229
+ ## Batch CLI Evaluation
230
+
231
+ Evaluate external batch runners that process all evalcases in one invocation.
232
+
233
+ ```yaml
234
+ $schema: agentv-eval-v2
235
+ description: Batch CLI demo (AML screening)
236
+ target: batch_cli
237
+
238
+ evalcases:
239
+ - id: aml-001
240
+ expected_outcome: |-
241
+ Batch runner returns JSON with decision=CLEAR.
242
+
243
+ expected_messages:
244
+ - role: assistant
245
+ content:
246
+ decision: CLEAR
247
+
248
+ input_messages:
249
+ - role: system
250
+ content: You are a deterministic AML screening batch checker.
251
+ - role: user
252
+ content:
253
+ request:
254
+ type: aml_screening_check
255
+ jurisdiction: AU
256
+ effective_date: 2025-01-01
257
+ row:
258
+ id: aml-001
259
+ customer_name: Example Customer A
260
+ origin_country: NZ
261
+ destination_country: AU
262
+ transaction_type: INTERNATIONAL_TRANSFER
263
+ amount: 5000
264
+ currency: USD
265
+
266
+ execution:
267
+ evaluators:
268
+ - name: decision-check
269
+ type: code_judge
270
+ script: bun run ./scripts/check-batch-cli-output.ts
271
+ cwd: .
272
+
273
+ - id: aml-002
274
+ expected_outcome: |-
275
+ Batch runner returns JSON with decision=REVIEW.
276
+
277
+ expected_messages:
278
+ - role: assistant
279
+ content:
280
+ decision: REVIEW
281
+
282
+ input_messages:
283
+ - role: system
284
+ content: You are a deterministic AML screening batch checker.
285
+ - role: user
286
+ content:
287
+ request:
288
+ type: aml_screening_check
289
+ jurisdiction: AU
290
+ effective_date: 2025-01-01
291
+ row:
292
+ id: aml-002
293
+ customer_name: Example Customer B
294
+ origin_country: IR
295
+ destination_country: AU
296
+ transaction_type: INTERNATIONAL_TRANSFER
297
+ amount: 2000
298
+ currency: USD
299
+
300
+ execution:
301
+ evaluators:
302
+ - name: decision-check
303
+ type: code_judge
304
+ script: bun run ./scripts/check-batch-cli-output.ts
305
+ cwd: .
306
+ ```
307
+
308
+ ### Batch CLI Pattern Notes
309
+ - **target: batch_cli** - Configure CLI provider with `provider_batching: true`
310
+ - **Batch runner** - Reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
311
+ - **Structured input** - Put data in `user.content` as objects for runner to extract
312
+ - **Structured expected** - Use `expected_messages.content` with object fields
313
+ - **Per-case evaluators** - Each evalcase has its own evaluator to validate output
314
+
256
315
  ## Notes on Examples
257
316
 
258
317
  ### File Path Conventions