agentv 1.5.0 → 1.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-3RYQPI4H.js";
4
+ } from "./chunk-HU4B6ODF.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
 
7
7
  // src/cli.ts
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  app,
3
3
  runCli
4
- } from "./chunk-3RYQPI4H.js";
4
+ } from "./chunk-HU4B6ODF.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
  export {
7
7
  app,
@@ -16,9 +16,10 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
16
16
  - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17
17
  - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
18
18
  - Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
19
+ - Compare: `references/compare-command.md` - Compare evaluation results between runs
19
20
 
20
21
  ## Structure Requirements
21
- - Root level: `description` (optional), `execution` (optional with `target` inside), `evalcases` (required)
22
+ - Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
22
23
  - Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
23
24
  - Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
24
25
  - `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
@@ -142,7 +143,6 @@ See `references/composite-evaluator.md` for aggregation types and patterns.
142
143
  Evaluate external batch runners that process all evalcases in one invocation:
143
144
 
144
145
  ```yaml
145
- $schema: agentv-eval-v2
146
146
  description: Batch CLI evaluation
147
147
  execution:
148
148
  target: batch_cli
@@ -177,7 +177,6 @@ See `references/batch-cli-evaluator.md` for full implementation guide.
177
177
 
178
178
  ## Example
179
179
  ```yaml
180
- $schema: agentv-eval-v2
181
180
  description: Example showing basic features and conversation threading
182
181
  execution:
183
182
  target: default
@@ -20,9 +20,7 @@ Batch CLI evaluation is used when:
20
20
  ## Eval File Structure
21
21
 
22
22
  ```yaml
23
- $schema: agentv-eval-v2
24
23
  description: Batch CLI demo using structured input_messages
25
-
26
24
  execution:
27
25
  target: batch_cli
28
26
 
@@ -0,0 +1,115 @@
1
+ # Compare Command
2
+
3
+ Compare evaluation results between two runs to measure performance differences.
4
+
5
+ ## Usage
6
+
7
+ ```bash
8
+ agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold <value>]
9
+ ```
10
+
11
+ ## Arguments
12
+
13
+ | Argument | Description |
14
+ |----------|-------------|
15
+ | `result1` | Path to baseline JSONL result file |
16
+ | `result2` | Path to candidate JSONL result file |
17
+ | `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
18
+
19
+ ## How It Works
20
+
21
+ 1. **Load Results**: Reads both JSONL files containing evaluation results
22
+ 2. **Match by eval_id**: Pairs results with matching `eval_id` fields
23
+ 3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
24
+ 4. **Classify Outcomes**:
25
+ - `win`: delta >= threshold (candidate better)
26
+ - `loss`: delta <= -threshold (baseline better)
27
+ - `tie`: |delta| < threshold (no significant difference)
28
+ 5. **Output Summary**: JSON with matched results, unmatched counts, and statistics
29
+
30
+ ## Output Format
31
+
32
+ ```json
33
+ {
34
+ "matched": [
35
+ {
36
+ "eval_id": "case-1",
37
+ "score1": 0.7,
38
+ "score2": 0.9,
39
+ "delta": 0.2,
40
+ "outcome": "win"
41
+ }
42
+ ],
43
+ "unmatched": {
44
+ "file1": 0,
45
+ "file2": 0
46
+ },
47
+ "summary": {
48
+ "total": 2,
49
+ "matched": 1,
50
+ "wins": 1,
51
+ "losses": 0,
52
+ "ties": 0,
53
+ "meanDelta": 0.2
54
+ }
55
+ }
56
+ ```
57
+
58
+ ## Exit Codes
59
+
60
+ | Code | Meaning |
61
+ |------|---------|
62
+ | `0` | Candidate is equal or better (meanDelta >= 0) |
63
+ | `1` | Baseline is better (regression detected) |
64
+
65
+ ## Workflow Examples
66
+
67
+ ### Model Comparison
68
+
69
+ Compare different model versions:
70
+
71
+ ```bash
72
+ # Run baseline evaluation
73
+ agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
74
+
75
+ # Run candidate evaluation
76
+ agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
77
+
78
+ # Compare results
79
+ agentv compare baseline.jsonl candidate.jsonl
80
+ ```
81
+
82
+ ### Prompt Optimization
83
+
84
+ Compare before/after prompt changes:
85
+
86
+ ```bash
87
+ # Run with original prompt
88
+ agentv eval evals/*.yaml --out before.jsonl
89
+
90
+ # Modify prompt, then run again
91
+ agentv eval evals/*.yaml --out after.jsonl
92
+
93
+ # Compare with strict threshold
94
+ agentv compare before.jsonl after.jsonl --threshold 0.05
95
+ ```
96
+
97
+ ### CI Quality Gate
98
+
99
+ Fail CI if candidate regresses:
100
+
101
+ ```bash
102
+ #!/bin/bash
103
+ agentv compare baseline.jsonl candidate.jsonl
104
+ if [ $? -eq 1 ]; then
105
+ echo "Regression detected! Candidate performs worse than baseline."
106
+ exit 1
107
+ fi
108
+ echo "Candidate is equal or better than baseline."
109
+ ```
110
+
111
+ ## Tips
112
+
113
+ - **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
114
+ - **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
115
+ - **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.
@@ -11,17 +11,42 @@ Code evaluators receive input via stdin and write output to stdout, both as JSON
11
11
  ```json
12
12
  {
13
13
  "question": "string describing the task/question",
14
- "expected_outcome": "expected outcome description",
15
- "reference_answer": "gold standard answer (optional)",
16
- "candidate_answer": "generated code/text from the agent",
17
- "guideline_paths": ["path1", "path2"],
18
- "input_files": ["file1", "file2"],
19
- "input_messages": [{"role": "user", "content": "..."}],
20
- "output_messages": [{"role": "assistant", "content": "...", "tool_calls": [...]}]
14
+ "expectedOutcome": "expected outcome description",
15
+ "referenceAnswer": "gold standard answer (optional)",
16
+ "candidateAnswer": "generated code/text from the agent",
17
+ "guidelineFiles": ["path1", "path2"],
18
+ "inputFiles": ["file1", "file2"],
19
+ "inputMessages": [{"role": "user", "content": "..."}],
20
+ "outputMessages": [
21
+ {
22
+ "role": "assistant",
23
+ "content": "...",
24
+ "toolCalls": [
25
+ {
26
+ "tool": "search",
27
+ "input": { "query": "..." },
28
+ "output": { "results": [...] },
29
+ "id": "call_123",
30
+ "timestamp": "2024-01-15T10:30:00Z"
31
+ }
32
+ ]
33
+ }
34
+ ],
35
+ "traceSummary": {
36
+ "eventCount": 5,
37
+ "toolNames": ["fetch", "search"],
38
+ "toolCallsByName": { "search": 2, "fetch": 1 },
39
+ "errorCount": 0,
40
+ "tokenUsage": { "input": 1000, "output": 500 },
41
+ "costUsd": 0.0015,
42
+ "durationMs": 3500
43
+ }
21
44
  }
22
45
  ```
23
46
 
24
- The `output_messages` array contains the full agent execution trace with tool calls, enabling custom validation of agent behavior.
47
+ **Key fields:**
48
+ - `outputMessages` - Full agent execution trace with tool calls (use `toolCalls[].input` for arguments)
49
+ - `traceSummary` - Lightweight summary with execution metrics (counts only, no tool arguments)
25
50
 
26
51
  ### Output Format (to stdout)
27
52
 
@@ -189,7 +214,7 @@ You can customize this template in your eval file using the `evaluatorTemplate`
189
214
  execution:
190
215
  evaluators:
191
216
  - name: my_validator
192
- type: code
217
+ type: code_judge
193
218
  script: uv run my_validator.py
194
219
  cwd: ./evaluators
195
220
  ```
@@ -5,7 +5,6 @@ This document contains complete examples of well-structured eval files demonstra
5
5
  ## Basic Example: Simple Q&A Eval
6
6
 
7
7
  ```yaml
8
- $schema: agentv-eval-v2
9
8
  description: Basic arithmetic evaluation
10
9
  execution:
11
10
  target: default
@@ -26,7 +25,6 @@ evalcases:
26
25
  ## Code Review with File References
27
26
 
28
27
  ```yaml
29
- $schema: agentv-eval-v2
30
28
  description: Code review with guidelines
31
29
  execution:
32
30
  target: azure_base
@@ -69,7 +67,6 @@ evalcases:
69
67
  ## Multi-Evaluator Configuration
70
68
 
71
69
  ```yaml
72
- $schema: agentv-eval-v2
73
70
  description: JSON generation with validation
74
71
  execution:
75
72
  target: default
@@ -109,7 +106,6 @@ evalcases:
109
106
  Validate that an agent uses specific tools during execution.
110
107
 
111
108
  ```yaml
112
- $schema: agentv-eval-v2
113
109
  description: Tool usage validation
114
110
  execution:
115
111
  target: mock_agent
@@ -151,7 +147,6 @@ evalcases:
151
147
  Evaluate pre-existing trace files without running an agent.
152
148
 
153
149
  ```yaml
154
- $schema: agentv-eval-v2
155
150
  description: Static trace evaluation
156
151
  execution:
157
152
  target: static_trace
@@ -175,7 +170,6 @@ evalcases:
175
170
  ## Multi-Turn Conversation (Single Eval Case)
176
171
 
177
172
  ```yaml
178
- $schema: agentv-eval-v2
179
173
  description: Multi-turn debugging session with clarifying questions
180
174
  execution:
181
175
  target: default
@@ -237,7 +231,6 @@ evalcases:
237
231
  Evaluate external batch runners that process all evalcases in one invocation.
238
232
 
239
233
  ```yaml
240
- $schema: agentv-eval-v2
241
234
  description: Batch CLI demo (AML screening)
242
235
  execution:
243
236
  target: batch_cli
@@ -9,8 +9,6 @@ Rubrics provide structured evaluation through lists of criteria that define what
9
9
  Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
10
10
 
11
11
  ```yaml
12
- $schema: agentv-eval-v2
13
-
14
12
  evalcases:
15
13
  - id: quicksort-explanation
16
14
  expected_outcome: Explain how quicksort works
@@ -48,6 +48,34 @@ execution:
48
48
  - Allow agent to use additional helper tools
49
49
  - Check that key steps happen in sequence
50
50
 
51
+ ### Argument Matching
52
+
53
+ For `in_order` and `exact` modes, you can optionally validate tool arguments:
54
+
55
+ ```yaml
56
+ execution:
57
+ evaluators:
58
+ - name: search-validation
59
+ type: tool_trajectory
60
+ mode: in_order
61
+ expected:
62
+ # Partial match - only specified keys are checked
63
+ - tool: search
64
+ args: { query: "machine learning" }
65
+
66
+ # Skip argument validation for this tool
67
+ - tool: process
68
+ args: any
69
+
70
+ # No args field = no argument validation (same as args: any)
71
+ - tool: saveResults
72
+ ```
73
+
74
+ **Argument matching modes:**
75
+ - `args: { key: value }` - Partial deep equality (only specified keys are checked)
76
+ - `args: any` - Skip argument validation
77
+ - No `args` field - Same as `args: any`
78
+
51
79
  #### 3. `exact` - Strict Sequence Match
52
80
 
53
81
  Validates the exact tool sequence with no gaps or extra tools:
@@ -121,9 +149,7 @@ The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optiona
121
149
  ### Research Agent Validation
122
150
 
123
151
  ```yaml
124
- $schema: agentv-eval-v2
125
152
  description: Validate research agent tool usage
126
-
127
153
  execution:
128
154
  target: codex_agent # Provider that returns traces
129
155
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "1.5.0",
3
+ "version": "1.6.1",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {
@@ -14,7 +14,10 @@
14
14
  "bin": {
15
15
  "agentv": "./dist/cli.js"
16
16
  },
17
- "files": ["dist", "README.md"],
17
+ "files": [
18
+ "dist",
19
+ "README.md"
20
+ ],
18
21
  "scripts": {
19
22
  "dev": "bun --watch src/index.ts",
20
23
  "build": "tsup && bun run copy-readme",