agentv 1.5.0 → 1.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +23 -2
- package/dist/{chunk-3RYQPI4H.js → chunk-HU4B6ODF.js} +1429 -369
- package/dist/chunk-HU4B6ODF.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +2 -3
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +115 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +34 -9
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -7
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +28 -2
- package/package.json +5 -2
- package/dist/chunk-3RYQPI4H.js.map +0 -1
package/dist/cli.js
CHANGED
package/dist/index.js
CHANGED
|
@@ -16,9 +16,10 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
|
|
|
16
16
|
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
|
|
17
17
|
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
|
|
18
18
|
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
|
|
19
|
+
- Compare: `references/compare-command.md` - Compare evaluation results between runs
|
|
19
20
|
|
|
20
21
|
## Structure Requirements
|
|
21
|
-
- Root level: `description` (optional), `execution` (
|
|
22
|
+
- Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
|
|
22
23
|
- Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
|
|
23
24
|
- Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
|
|
24
25
|
- `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
|
|
@@ -142,7 +143,6 @@ See `references/composite-evaluator.md` for aggregation types and patterns.
|
|
|
142
143
|
Evaluate external batch runners that process all evalcases in one invocation:
|
|
143
144
|
|
|
144
145
|
```yaml
|
|
145
|
-
$schema: agentv-eval-v2
|
|
146
146
|
description: Batch CLI evaluation
|
|
147
147
|
execution:
|
|
148
148
|
target: batch_cli
|
|
@@ -177,7 +177,6 @@ See `references/batch-cli-evaluator.md` for full implementation guide.
|
|
|
177
177
|
|
|
178
178
|
## Example
|
|
179
179
|
```yaml
|
|
180
|
-
$schema: agentv-eval-v2
|
|
181
180
|
description: Example showing basic features and conversation threading
|
|
182
181
|
execution:
|
|
183
182
|
target: default
|
|
@@ -0,0 +1,115 @@
|
|
|
1
|
+
# Compare Command
|
|
2
|
+
|
|
3
|
+
Compare evaluation results between two runs to measure performance differences.
|
|
4
|
+
|
|
5
|
+
## Usage
|
|
6
|
+
|
|
7
|
+
```bash
|
|
8
|
+
agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold <value>]
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
## Arguments
|
|
12
|
+
|
|
13
|
+
| Argument | Description |
|
|
14
|
+
|----------|-------------|
|
|
15
|
+
| `result1` | Path to baseline JSONL result file |
|
|
16
|
+
| `result2` | Path to candidate JSONL result file |
|
|
17
|
+
| `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
|
|
18
|
+
|
|
19
|
+
## How It Works
|
|
20
|
+
|
|
21
|
+
1. **Load Results**: Reads both JSONL files containing evaluation results
|
|
22
|
+
2. **Match by eval_id**: Pairs results with matching `eval_id` fields
|
|
23
|
+
3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
|
|
24
|
+
4. **Classify Outcomes**:
|
|
25
|
+
- `win`: delta >= threshold (candidate better)
|
|
26
|
+
- `loss`: delta <= -threshold (baseline better)
|
|
27
|
+
- `tie`: |delta| < threshold (no significant difference)
|
|
28
|
+
5. **Output Summary**: JSON with matched results, unmatched counts, and statistics
|
|
29
|
+
|
|
30
|
+
## Output Format
|
|
31
|
+
|
|
32
|
+
```json
|
|
33
|
+
{
|
|
34
|
+
"matched": [
|
|
35
|
+
{
|
|
36
|
+
"eval_id": "case-1",
|
|
37
|
+
"score1": 0.7,
|
|
38
|
+
"score2": 0.9,
|
|
39
|
+
"delta": 0.2,
|
|
40
|
+
"outcome": "win"
|
|
41
|
+
}
|
|
42
|
+
],
|
|
43
|
+
"unmatched": {
|
|
44
|
+
"file1": 0,
|
|
45
|
+
"file2": 0
|
|
46
|
+
},
|
|
47
|
+
"summary": {
|
|
48
|
+
"total": 2,
|
|
49
|
+
"matched": 1,
|
|
50
|
+
"wins": 1,
|
|
51
|
+
"losses": 0,
|
|
52
|
+
"ties": 0,
|
|
53
|
+
"meanDelta": 0.2
|
|
54
|
+
}
|
|
55
|
+
}
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
## Exit Codes
|
|
59
|
+
|
|
60
|
+
| Code | Meaning |
|
|
61
|
+
|------|---------|
|
|
62
|
+
| `0` | Candidate is equal or better (meanDelta >= 0) |
|
|
63
|
+
| `1` | Baseline is better (regression detected) |
|
|
64
|
+
|
|
65
|
+
## Workflow Examples
|
|
66
|
+
|
|
67
|
+
### Model Comparison
|
|
68
|
+
|
|
69
|
+
Compare different model versions:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
# Run baseline evaluation
|
|
73
|
+
agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
|
|
74
|
+
|
|
75
|
+
# Run candidate evaluation
|
|
76
|
+
agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
|
|
77
|
+
|
|
78
|
+
# Compare results
|
|
79
|
+
agentv compare baseline.jsonl candidate.jsonl
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
### Prompt Optimization
|
|
83
|
+
|
|
84
|
+
Compare before/after prompt changes:
|
|
85
|
+
|
|
86
|
+
```bash
|
|
87
|
+
# Run with original prompt
|
|
88
|
+
agentv eval evals/*.yaml --out before.jsonl
|
|
89
|
+
|
|
90
|
+
# Modify prompt, then run again
|
|
91
|
+
agentv eval evals/*.yaml --out after.jsonl
|
|
92
|
+
|
|
93
|
+
# Compare with strict threshold
|
|
94
|
+
agentv compare before.jsonl after.jsonl --threshold 0.05
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### CI Quality Gate
|
|
98
|
+
|
|
99
|
+
Fail CI if candidate regresses:
|
|
100
|
+
|
|
101
|
+
```bash
|
|
102
|
+
#!/bin/bash
|
|
103
|
+
agentv compare baseline.jsonl candidate.jsonl
|
|
104
|
+
if [ $? -eq 1 ]; then
|
|
105
|
+
echo "Regression detected! Candidate performs worse than baseline."
|
|
106
|
+
exit 1
|
|
107
|
+
fi
|
|
108
|
+
echo "Candidate is equal or better than baseline."
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
## Tips
|
|
112
|
+
|
|
113
|
+
- **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
|
|
114
|
+
- **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
|
|
115
|
+
- **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.
|
|
@@ -11,17 +11,42 @@ Code evaluators receive input via stdin and write output to stdout, both as JSON
|
|
|
11
11
|
```json
|
|
12
12
|
{
|
|
13
13
|
"question": "string describing the task/question",
|
|
14
|
-
"
|
|
15
|
-
"
|
|
16
|
-
"
|
|
17
|
-
"
|
|
18
|
-
"
|
|
19
|
-
"
|
|
20
|
-
"
|
|
14
|
+
"expectedOutcome": "expected outcome description",
|
|
15
|
+
"referenceAnswer": "gold standard answer (optional)",
|
|
16
|
+
"candidateAnswer": "generated code/text from the agent",
|
|
17
|
+
"guidelineFiles": ["path1", "path2"],
|
|
18
|
+
"inputFiles": ["file1", "file2"],
|
|
19
|
+
"inputMessages": [{"role": "user", "content": "..."}],
|
|
20
|
+
"outputMessages": [
|
|
21
|
+
{
|
|
22
|
+
"role": "assistant",
|
|
23
|
+
"content": "...",
|
|
24
|
+
"toolCalls": [
|
|
25
|
+
{
|
|
26
|
+
"tool": "search",
|
|
27
|
+
"input": { "query": "..." },
|
|
28
|
+
"output": { "results": [...] },
|
|
29
|
+
"id": "call_123",
|
|
30
|
+
"timestamp": "2024-01-15T10:30:00Z"
|
|
31
|
+
}
|
|
32
|
+
]
|
|
33
|
+
}
|
|
34
|
+
],
|
|
35
|
+
"traceSummary": {
|
|
36
|
+
"eventCount": 5,
|
|
37
|
+
"toolNames": ["fetch", "search"],
|
|
38
|
+
"toolCallsByName": { "search": 2, "fetch": 1 },
|
|
39
|
+
"errorCount": 0,
|
|
40
|
+
"tokenUsage": { "input": 1000, "output": 500 },
|
|
41
|
+
"costUsd": 0.0015,
|
|
42
|
+
"durationMs": 3500
|
|
43
|
+
}
|
|
21
44
|
}
|
|
22
45
|
```
|
|
23
46
|
|
|
24
|
-
|
|
47
|
+
**Key fields:**
|
|
48
|
+
- `outputMessages` - Full agent execution trace with tool calls (use `toolCalls[].input` for arguments)
|
|
49
|
+
- `traceSummary` - Lightweight summary with execution metrics (counts only, no tool arguments)
|
|
25
50
|
|
|
26
51
|
### Output Format (to stdout)
|
|
27
52
|
|
|
@@ -189,7 +214,7 @@ You can customize this template in your eval file using the `evaluatorTemplate`
|
|
|
189
214
|
execution:
|
|
190
215
|
evaluators:
|
|
191
216
|
- name: my_validator
|
|
192
|
-
type:
|
|
217
|
+
type: code_judge
|
|
193
218
|
script: uv run my_validator.py
|
|
194
219
|
cwd: ./evaluators
|
|
195
220
|
```
|
|
@@ -5,7 +5,6 @@ This document contains complete examples of well-structured eval files demonstra
|
|
|
5
5
|
## Basic Example: Simple Q&A Eval
|
|
6
6
|
|
|
7
7
|
```yaml
|
|
8
|
-
$schema: agentv-eval-v2
|
|
9
8
|
description: Basic arithmetic evaluation
|
|
10
9
|
execution:
|
|
11
10
|
target: default
|
|
@@ -26,7 +25,6 @@ evalcases:
|
|
|
26
25
|
## Code Review with File References
|
|
27
26
|
|
|
28
27
|
```yaml
|
|
29
|
-
$schema: agentv-eval-v2
|
|
30
28
|
description: Code review with guidelines
|
|
31
29
|
execution:
|
|
32
30
|
target: azure_base
|
|
@@ -69,7 +67,6 @@ evalcases:
|
|
|
69
67
|
## Multi-Evaluator Configuration
|
|
70
68
|
|
|
71
69
|
```yaml
|
|
72
|
-
$schema: agentv-eval-v2
|
|
73
70
|
description: JSON generation with validation
|
|
74
71
|
execution:
|
|
75
72
|
target: default
|
|
@@ -109,7 +106,6 @@ evalcases:
|
|
|
109
106
|
Validate that an agent uses specific tools during execution.
|
|
110
107
|
|
|
111
108
|
```yaml
|
|
112
|
-
$schema: agentv-eval-v2
|
|
113
109
|
description: Tool usage validation
|
|
114
110
|
execution:
|
|
115
111
|
target: mock_agent
|
|
@@ -151,7 +147,6 @@ evalcases:
|
|
|
151
147
|
Evaluate pre-existing trace files without running an agent.
|
|
152
148
|
|
|
153
149
|
```yaml
|
|
154
|
-
$schema: agentv-eval-v2
|
|
155
150
|
description: Static trace evaluation
|
|
156
151
|
execution:
|
|
157
152
|
target: static_trace
|
|
@@ -175,7 +170,6 @@ evalcases:
|
|
|
175
170
|
## Multi-Turn Conversation (Single Eval Case)
|
|
176
171
|
|
|
177
172
|
```yaml
|
|
178
|
-
$schema: agentv-eval-v2
|
|
179
173
|
description: Multi-turn debugging session with clarifying questions
|
|
180
174
|
execution:
|
|
181
175
|
target: default
|
|
@@ -237,7 +231,6 @@ evalcases:
|
|
|
237
231
|
Evaluate external batch runners that process all evalcases in one invocation.
|
|
238
232
|
|
|
239
233
|
```yaml
|
|
240
|
-
$schema: agentv-eval-v2
|
|
241
234
|
description: Batch CLI demo (AML screening)
|
|
242
235
|
execution:
|
|
243
236
|
target: batch_cli
|
|
@@ -9,8 +9,6 @@ Rubrics provide structured evaluation through lists of criteria that define what
|
|
|
9
9
|
Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
|
|
10
10
|
|
|
11
11
|
```yaml
|
|
12
|
-
$schema: agentv-eval-v2
|
|
13
|
-
|
|
14
12
|
evalcases:
|
|
15
13
|
- id: quicksort-explanation
|
|
16
14
|
expected_outcome: Explain how quicksort works
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
CHANGED
|
@@ -48,6 +48,34 @@ execution:
|
|
|
48
48
|
- Allow agent to use additional helper tools
|
|
49
49
|
- Check that key steps happen in sequence
|
|
50
50
|
|
|
51
|
+
### Argument Matching
|
|
52
|
+
|
|
53
|
+
For `in_order` and `exact` modes, you can optionally validate tool arguments:
|
|
54
|
+
|
|
55
|
+
```yaml
|
|
56
|
+
execution:
|
|
57
|
+
evaluators:
|
|
58
|
+
- name: search-validation
|
|
59
|
+
type: tool_trajectory
|
|
60
|
+
mode: in_order
|
|
61
|
+
expected:
|
|
62
|
+
# Partial match - only specified keys are checked
|
|
63
|
+
- tool: search
|
|
64
|
+
args: { query: "machine learning" }
|
|
65
|
+
|
|
66
|
+
# Skip argument validation for this tool
|
|
67
|
+
- tool: process
|
|
68
|
+
args: any
|
|
69
|
+
|
|
70
|
+
# No args field = no argument validation (same as args: any)
|
|
71
|
+
- tool: saveResults
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**Argument matching modes:**
|
|
75
|
+
- `args: { key: value }` - Partial deep equality (only specified keys are checked)
|
|
76
|
+
- `args: any` - Skip argument validation
|
|
77
|
+
- No `args` field - Same as `args: any`
|
|
78
|
+
|
|
51
79
|
#### 3. `exact` - Strict Sequence Match
|
|
52
80
|
|
|
53
81
|
Validates the exact tool sequence with no gaps or extra tools:
|
|
@@ -121,9 +149,7 @@ The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optiona
|
|
|
121
149
|
### Research Agent Validation
|
|
122
150
|
|
|
123
151
|
```yaml
|
|
124
|
-
$schema: agentv-eval-v2
|
|
125
152
|
description: Validate research agent tool usage
|
|
126
|
-
|
|
127
153
|
execution:
|
|
128
154
|
target: codex_agent # Provider that returns traces
|
|
129
155
|
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "agentv",
|
|
3
|
-
"version": "1.
|
|
3
|
+
"version": "1.6.1",
|
|
4
4
|
"description": "CLI entry point for AgentV",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"repository": {
|
|
@@ -14,7 +14,10 @@
|
|
|
14
14
|
"bin": {
|
|
15
15
|
"agentv": "./dist/cli.js"
|
|
16
16
|
},
|
|
17
|
-
"files": [
|
|
17
|
+
"files": [
|
|
18
|
+
"dist",
|
|
19
|
+
"README.md"
|
|
20
|
+
],
|
|
18
21
|
"scripts": {
|
|
19
22
|
"dev": "bun --watch src/index.ts",
|
|
20
23
|
"build": "tsup && bun run copy-readme",
|