agentv 1.3.1 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +439 -441
- package/dist/{chunk-6R2YRXCQ.js → chunk-3RYQPI4H.js} +487 -329
- package/dist/chunk-3RYQPI4H.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +23 -23
- package/dist/templates/.agentv/config.yaml +15 -15
- package/dist/templates/.agentv/targets.yaml +71 -73
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +212 -211
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +318 -288
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +216 -213
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +340 -333
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +139 -139
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +198 -179
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +77 -77
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +4 -4
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +3 -3
- package/package.json +2 -5
- package/dist/chunk-6R2YRXCQ.js.map +0 -1
|
@@ -1,139 +1,139 @@
|
|
|
1
|
-
# Rubric Evaluator Guide
|
|
2
|
-
|
|
3
|
-
Rubrics provide structured evaluation through lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
|
|
4
|
-
|
|
5
|
-
## Basic Usage
|
|
6
|
-
|
|
7
|
-
### Simple String Rubrics
|
|
8
|
-
|
|
9
|
-
Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
|
|
10
|
-
|
|
11
|
-
```yaml
|
|
12
|
-
$schema: agentv-eval-v2
|
|
13
|
-
|
|
14
|
-
evalcases:
|
|
15
|
-
- id: quicksort-explanation
|
|
16
|
-
expected_outcome: Explain how quicksort works
|
|
17
|
-
|
|
18
|
-
input_messages:
|
|
19
|
-
- role: user
|
|
20
|
-
content: Explain how the quicksort algorithm works
|
|
21
|
-
|
|
22
|
-
rubrics:
|
|
23
|
-
- Mentions divide-and-conquer approach
|
|
24
|
-
- Explains the partition step
|
|
25
|
-
- States time complexity correctly
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
### Detailed Rubric Objects
|
|
29
|
-
|
|
30
|
-
Use objects for fine-grained control over weights and requirements:
|
|
31
|
-
|
|
32
|
-
```yaml
|
|
33
|
-
evalcases:
|
|
34
|
-
- id: technical-guide
|
|
35
|
-
expected_outcome: Write a comprehensive HTTP status codes guide
|
|
36
|
-
|
|
37
|
-
input_messages:
|
|
38
|
-
- role: user
|
|
39
|
-
content: Write a guide explaining HTTP status codes
|
|
40
|
-
|
|
41
|
-
rubrics:
|
|
42
|
-
- id: structure
|
|
43
|
-
description: Has clear headings and organization
|
|
44
|
-
weight: 1.0
|
|
45
|
-
required: true
|
|
46
|
-
|
|
47
|
-
- id: success-codes
|
|
48
|
-
description: Covers 2xx success codes with examples
|
|
49
|
-
weight: 2.0
|
|
50
|
-
required: true
|
|
51
|
-
|
|
52
|
-
- id: client-errors
|
|
53
|
-
description: Explains 4xx client error codes
|
|
54
|
-
weight: 2.0
|
|
55
|
-
required: true
|
|
56
|
-
|
|
57
|
-
- id: server-errors
|
|
58
|
-
description: Explains 5xx server error codes
|
|
59
|
-
weight: 1.5
|
|
60
|
-
required: false
|
|
61
|
-
|
|
62
|
-
- id: practical-examples
|
|
63
|
-
description: Includes practical use case examples
|
|
64
|
-
weight: 1.0
|
|
65
|
-
required: false
|
|
66
|
-
```
|
|
67
|
-
|
|
68
|
-
## Rubric Object Fields
|
|
69
|
-
|
|
70
|
-
| Field | Type | Default | Description |
|
|
71
|
-
|-------|------|---------|-------------|
|
|
72
|
-
| `id` | string | auto-generated | Unique identifier for the rubric |
|
|
73
|
-
| `description` | string | required | The criterion being evaluated |
|
|
74
|
-
| `weight` | number | 1.0 | Relative importance (higher = more impact on score) |
|
|
75
|
-
| `required` | boolean | true | If true, failing this rubric forces verdict to 'fail' |
|
|
76
|
-
|
|
77
|
-
## Scoring and Verdicts
|
|
78
|
-
|
|
79
|
-
**Score Calculation:**
|
|
80
|
-
```
|
|
81
|
-
score = (sum of satisfied weights) / (total weights)
|
|
82
|
-
```
|
|
83
|
-
|
|
84
|
-
**Verdict Rules:**
|
|
85
|
-
- `pass`: Score ≥ 0.8 AND all required rubrics satisfied
|
|
86
|
-
- `borderline`: Score ≥ 0.6 AND all required rubrics satisfied
|
|
87
|
-
- `fail`: Score < 0.6 OR any required rubric failed
|
|
88
|
-
|
|
89
|
-
## Combining Rubrics with Other Evaluators
|
|
90
|
-
|
|
91
|
-
Rubrics can be combined with code evaluators for comprehensive validation:
|
|
92
|
-
|
|
93
|
-
```yaml
|
|
94
|
-
evalcases:
|
|
95
|
-
- id: email-validator
|
|
96
|
-
expected_outcome: Python function to validate email addresses
|
|
97
|
-
|
|
98
|
-
input_messages:
|
|
99
|
-
- role: user
|
|
100
|
-
content: Write a Python function to validate email addresses
|
|
101
|
-
|
|
102
|
-
# Semantic evaluation via rubrics
|
|
103
|
-
rubrics:
|
|
104
|
-
- Uses regular expressions for validation
|
|
105
|
-
- Includes type hints
|
|
106
|
-
- Has docstring documentation
|
|
107
|
-
- Handles edge cases (None, empty string)
|
|
108
|
-
|
|
109
|
-
execution:
|
|
110
|
-
evaluators:
|
|
111
|
-
# Rubric evaluator is auto-added from inline rubrics field
|
|
112
|
-
|
|
113
|
-
# Additional code evaluator for syntax checking
|
|
114
|
-
- name: python_syntax
|
|
115
|
-
type: code_judge
|
|
116
|
-
script: uv run python -m py_compile
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
## Generate Rubrics from Expected Outcome
|
|
120
|
-
|
|
121
|
-
Use the CLI to auto-generate rubrics from `expected_outcome`:
|
|
122
|
-
|
|
123
|
-
```bash
|
|
124
|
-
# Generate rubrics for eval cases that don't have them
|
|
125
|
-
agentv generate rubrics evals/my-eval.yaml
|
|
126
|
-
|
|
127
|
-
# Use a specific LLM target for generation
|
|
128
|
-
agentv generate rubrics evals/my-eval.yaml --target azure_base
|
|
129
|
-
```
|
|
130
|
-
|
|
131
|
-
This analyzes each `expected_outcome` and creates appropriate rubric items.
|
|
132
|
-
|
|
133
|
-
## Best Practices
|
|
134
|
-
|
|
135
|
-
1. **Use required sparingly** - Only mark rubrics as `required: true` for critical criteria
|
|
136
|
-
2. **Balance weights** - Use higher weights (2.0+) for core requirements, lower (0.5) for nice-to-haves
|
|
137
|
-
3. **Be specific** - "Includes error handling" is better than "Good code quality"
|
|
138
|
-
4. **Keep rubrics atomic** - Each rubric should test one thing
|
|
139
|
-
5. **Consider partial credit** - Non-required rubrics allow partial scores
|
|
1
|
+
# Rubric Evaluator Guide
|
|
2
|
+
|
|
3
|
+
Rubrics provide structured evaluation through lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
|
|
4
|
+
|
|
5
|
+
## Basic Usage
|
|
6
|
+
|
|
7
|
+
### Simple String Rubrics
|
|
8
|
+
|
|
9
|
+
Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
|
|
10
|
+
|
|
11
|
+
```yaml
|
|
12
|
+
$schema: agentv-eval-v2
|
|
13
|
+
|
|
14
|
+
evalcases:
|
|
15
|
+
- id: quicksort-explanation
|
|
16
|
+
expected_outcome: Explain how quicksort works
|
|
17
|
+
|
|
18
|
+
input_messages:
|
|
19
|
+
- role: user
|
|
20
|
+
content: Explain how the quicksort algorithm works
|
|
21
|
+
|
|
22
|
+
rubrics:
|
|
23
|
+
- Mentions divide-and-conquer approach
|
|
24
|
+
- Explains the partition step
|
|
25
|
+
- States time complexity correctly
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
### Detailed Rubric Objects
|
|
29
|
+
|
|
30
|
+
Use objects for fine-grained control over weights and requirements:
|
|
31
|
+
|
|
32
|
+
```yaml
|
|
33
|
+
evalcases:
|
|
34
|
+
- id: technical-guide
|
|
35
|
+
expected_outcome: Write a comprehensive HTTP status codes guide
|
|
36
|
+
|
|
37
|
+
input_messages:
|
|
38
|
+
- role: user
|
|
39
|
+
content: Write a guide explaining HTTP status codes
|
|
40
|
+
|
|
41
|
+
rubrics:
|
|
42
|
+
- id: structure
|
|
43
|
+
description: Has clear headings and organization
|
|
44
|
+
weight: 1.0
|
|
45
|
+
required: true
|
|
46
|
+
|
|
47
|
+
- id: success-codes
|
|
48
|
+
description: Covers 2xx success codes with examples
|
|
49
|
+
weight: 2.0
|
|
50
|
+
required: true
|
|
51
|
+
|
|
52
|
+
- id: client-errors
|
|
53
|
+
description: Explains 4xx client error codes
|
|
54
|
+
weight: 2.0
|
|
55
|
+
required: true
|
|
56
|
+
|
|
57
|
+
- id: server-errors
|
|
58
|
+
description: Explains 5xx server error codes
|
|
59
|
+
weight: 1.5
|
|
60
|
+
required: false
|
|
61
|
+
|
|
62
|
+
- id: practical-examples
|
|
63
|
+
description: Includes practical use case examples
|
|
64
|
+
weight: 1.0
|
|
65
|
+
required: false
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
## Rubric Object Fields
|
|
69
|
+
|
|
70
|
+
| Field | Type | Default | Description |
|
|
71
|
+
|-------|------|---------|-------------|
|
|
72
|
+
| `id` | string | auto-generated | Unique identifier for the rubric |
|
|
73
|
+
| `description` | string | required | The criterion being evaluated |
|
|
74
|
+
| `weight` | number | 1.0 | Relative importance (higher = more impact on score) |
|
|
75
|
+
| `required` | boolean | true | If true, failing this rubric forces verdict to 'fail' |
|
|
76
|
+
|
|
77
|
+
## Scoring and Verdicts
|
|
78
|
+
|
|
79
|
+
**Score Calculation:**
|
|
80
|
+
```
|
|
81
|
+
score = (sum of satisfied weights) / (total weights)
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
**Verdict Rules:**
|
|
85
|
+
- `pass`: Score ≥ 0.8 AND all required rubrics satisfied
|
|
86
|
+
- `borderline`: Score ≥ 0.6 AND all required rubrics satisfied
|
|
87
|
+
- `fail`: Score < 0.6 OR any required rubric failed
|
|
88
|
+
|
|
89
|
+
## Combining Rubrics with Other Evaluators
|
|
90
|
+
|
|
91
|
+
Rubrics can be combined with code evaluators for comprehensive validation:
|
|
92
|
+
|
|
93
|
+
```yaml
|
|
94
|
+
evalcases:
|
|
95
|
+
- id: email-validator
|
|
96
|
+
expected_outcome: Python function to validate email addresses
|
|
97
|
+
|
|
98
|
+
input_messages:
|
|
99
|
+
- role: user
|
|
100
|
+
content: Write a Python function to validate email addresses
|
|
101
|
+
|
|
102
|
+
# Semantic evaluation via rubrics
|
|
103
|
+
rubrics:
|
|
104
|
+
- Uses regular expressions for validation
|
|
105
|
+
- Includes type hints
|
|
106
|
+
- Has docstring documentation
|
|
107
|
+
- Handles edge cases (None, empty string)
|
|
108
|
+
|
|
109
|
+
execution:
|
|
110
|
+
evaluators:
|
|
111
|
+
# Rubric evaluator is auto-added from inline rubrics field
|
|
112
|
+
|
|
113
|
+
# Additional code evaluator for syntax checking
|
|
114
|
+
- name: python_syntax
|
|
115
|
+
type: code_judge
|
|
116
|
+
script: uv run python -m py_compile
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
## Generate Rubrics from Expected Outcome
|
|
120
|
+
|
|
121
|
+
Use the CLI to auto-generate rubrics from `expected_outcome`:
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
# Generate rubrics for eval cases that don't have them
|
|
125
|
+
agentv generate rubrics evals/my-eval.yaml
|
|
126
|
+
|
|
127
|
+
# Use a specific LLM target for generation
|
|
128
|
+
agentv generate rubrics evals/my-eval.yaml --target azure_base
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
This analyzes each `expected_outcome` and creates appropriate rubric items.
|
|
132
|
+
|
|
133
|
+
## Best Practices
|
|
134
|
+
|
|
135
|
+
1. **Use required sparingly** - Only mark rubrics as `required: true` for critical criteria
|
|
136
|
+
2. **Balance weights** - Use higher weights (2.0+) for core requirements, lower (0.5) for nice-to-haves
|
|
137
|
+
3. **Be specific** - "Includes error handling" is better than "Good code quality"
|
|
138
|
+
4. **Keep rubrics atomic** - Each rubric should test one thing
|
|
139
|
+
5. **Consider partial credit** - Non-required rubrics allow partial scores
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
CHANGED
|
@@ -1,179 +1,198 @@
|
|
|
1
|
-
# Tool Trajectory Evaluator Guide
|
|
2
|
-
|
|
3
|
-
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
|
|
4
|
-
|
|
5
|
-
## Tool Trajectory Evaluator
|
|
6
|
-
|
|
7
|
-
### Modes
|
|
8
|
-
|
|
9
|
-
#### 1. `any_order` - Minimum Tool Counts
|
|
10
|
-
|
|
11
|
-
Validates that each tool was called at least N times, regardless of order:
|
|
12
|
-
|
|
13
|
-
```yaml
|
|
14
|
-
execution:
|
|
15
|
-
evaluators:
|
|
16
|
-
- name: tool-usage
|
|
17
|
-
type: tool_trajectory
|
|
18
|
-
mode: any_order
|
|
19
|
-
minimums:
|
|
20
|
-
knowledgeSearch: 2 # Must be called at least twice
|
|
21
|
-
documentRetrieve: 1 # Must be called at least once
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
**Use cases:**
|
|
25
|
-
- Ensure required tools are used
|
|
26
|
-
- Don't care about execution order
|
|
27
|
-
- Allow flexibility in agent implementation
|
|
28
|
-
|
|
29
|
-
#### 2. `in_order` - Sequential Matching
|
|
30
|
-
|
|
31
|
-
Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
|
|
32
|
-
|
|
33
|
-
```yaml
|
|
34
|
-
execution:
|
|
35
|
-
evaluators:
|
|
36
|
-
- name: workflow-sequence
|
|
37
|
-
type: tool_trajectory
|
|
38
|
-
mode: in_order
|
|
39
|
-
expected:
|
|
40
|
-
- tool: fetchData
|
|
41
|
-
- tool: validateSchema
|
|
42
|
-
- tool: transformData
|
|
43
|
-
- tool: saveResults
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
**Use cases:**
|
|
47
|
-
- Validate logical workflow order
|
|
48
|
-
- Allow agent to use additional helper tools
|
|
49
|
-
- Check that key steps happen in sequence
|
|
50
|
-
|
|
51
|
-
#### 3. `exact` - Strict Sequence Match
|
|
52
|
-
|
|
53
|
-
Validates the exact tool sequence with no gaps or extra tools:
|
|
54
|
-
|
|
55
|
-
```yaml
|
|
56
|
-
execution:
|
|
57
|
-
evaluators:
|
|
58
|
-
- name: auth-sequence
|
|
59
|
-
type: tool_trajectory
|
|
60
|
-
mode: exact
|
|
61
|
-
expected:
|
|
62
|
-
- tool: checkCredentials
|
|
63
|
-
- tool: generateToken
|
|
64
|
-
- tool: auditLog
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
**Use cases:**
|
|
68
|
-
- Security-critical workflows
|
|
69
|
-
- Strict protocol validation
|
|
70
|
-
- Regression testing specific behavior
|
|
71
|
-
|
|
72
|
-
## Scoring
|
|
73
|
-
|
|
74
|
-
### tool_trajectory Scoring
|
|
75
|
-
|
|
76
|
-
| Mode | Score Calculation |
|
|
77
|
-
|------|------------------|
|
|
78
|
-
| `any_order` | (tools meeting minimum) / (total tools with minimums) |
|
|
79
|
-
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
80
|
-
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
81
|
-
|
|
82
|
-
## Trace Data Requirements
|
|
83
|
-
|
|
84
|
-
Tool trajectory evaluators require trace data from the agent provider.
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
```
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
1
|
+
# Tool Trajectory Evaluator Guide
|
|
2
|
+
|
|
3
|
+
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
|
|
4
|
+
|
|
5
|
+
## Tool Trajectory Evaluator
|
|
6
|
+
|
|
7
|
+
### Modes
|
|
8
|
+
|
|
9
|
+
#### 1. `any_order` - Minimum Tool Counts
|
|
10
|
+
|
|
11
|
+
Validates that each tool was called at least N times, regardless of order:
|
|
12
|
+
|
|
13
|
+
```yaml
|
|
14
|
+
execution:
|
|
15
|
+
evaluators:
|
|
16
|
+
- name: tool-usage
|
|
17
|
+
type: tool_trajectory
|
|
18
|
+
mode: any_order
|
|
19
|
+
minimums:
|
|
20
|
+
knowledgeSearch: 2 # Must be called at least twice
|
|
21
|
+
documentRetrieve: 1 # Must be called at least once
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
**Use cases:**
|
|
25
|
+
- Ensure required tools are used
|
|
26
|
+
- Don't care about execution order
|
|
27
|
+
- Allow flexibility in agent implementation
|
|
28
|
+
|
|
29
|
+
#### 2. `in_order` - Sequential Matching
|
|
30
|
+
|
|
31
|
+
Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
|
|
32
|
+
|
|
33
|
+
```yaml
|
|
34
|
+
execution:
|
|
35
|
+
evaluators:
|
|
36
|
+
- name: workflow-sequence
|
|
37
|
+
type: tool_trajectory
|
|
38
|
+
mode: in_order
|
|
39
|
+
expected:
|
|
40
|
+
- tool: fetchData
|
|
41
|
+
- tool: validateSchema
|
|
42
|
+
- tool: transformData
|
|
43
|
+
- tool: saveResults
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
**Use cases:**
|
|
47
|
+
- Validate logical workflow order
|
|
48
|
+
- Allow agent to use additional helper tools
|
|
49
|
+
- Check that key steps happen in sequence
|
|
50
|
+
|
|
51
|
+
#### 3. `exact` - Strict Sequence Match
|
|
52
|
+
|
|
53
|
+
Validates the exact tool sequence with no gaps or extra tools:
|
|
54
|
+
|
|
55
|
+
```yaml
|
|
56
|
+
execution:
|
|
57
|
+
evaluators:
|
|
58
|
+
- name: auth-sequence
|
|
59
|
+
type: tool_trajectory
|
|
60
|
+
mode: exact
|
|
61
|
+
expected:
|
|
62
|
+
- tool: checkCredentials
|
|
63
|
+
- tool: generateToken
|
|
64
|
+
- tool: auditLog
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
**Use cases:**
|
|
68
|
+
- Security-critical workflows
|
|
69
|
+
- Strict protocol validation
|
|
70
|
+
- Regression testing specific behavior
|
|
71
|
+
|
|
72
|
+
## Scoring
|
|
73
|
+
|
|
74
|
+
### tool_trajectory Scoring
|
|
75
|
+
|
|
76
|
+
| Mode | Score Calculation |
|
|
77
|
+
|------|------------------|
|
|
78
|
+
| `any_order` | (tools meeting minimum) / (total tools with minimums) |
|
|
79
|
+
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
80
|
+
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
81
|
+
|
|
82
|
+
## Trace Data Requirements
|
|
83
|
+
|
|
84
|
+
Tool trajectory evaluators require trace data from the agent provider. Providers return `output_messages` containing `tool_calls` that capture agent tool usage.
|
|
85
|
+
|
|
86
|
+
### Output Messages Format
|
|
87
|
+
|
|
88
|
+
Providers return `output_messages` with `tool_calls` in the JSONL output:
|
|
89
|
+
|
|
90
|
+
```json
|
|
91
|
+
{
|
|
92
|
+
"id": "eval-001",
|
|
93
|
+
"output_messages": [
|
|
94
|
+
{
|
|
95
|
+
"role": "assistant",
|
|
96
|
+
"content": "I'll search for information about this topic.",
|
|
97
|
+
"tool_calls": [
|
|
98
|
+
{
|
|
99
|
+
"tool": "knowledgeSearch",
|
|
100
|
+
"input": { "query": "REST vs GraphQL" },
|
|
101
|
+
"output": { "results": [...] },
|
|
102
|
+
"id": "call_123",
|
|
103
|
+
"timestamp": "2024-01-15T10:30:00Z"
|
|
104
|
+
}
|
|
105
|
+
]
|
|
106
|
+
}
|
|
107
|
+
]
|
|
108
|
+
}
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optional fields `id` and `timestamp` can be included for debugging.
|
|
112
|
+
|
|
113
|
+
### Supported Providers
|
|
114
|
+
|
|
115
|
+
- **codex** - Returns output_messages via JSONL log events
|
|
116
|
+
- **vscode / vscode-insiders** - Returns output_messages from Copilot execution
|
|
117
|
+
- **cli** - Returns `output_messages` with `tool_calls`
|
|
118
|
+
|
|
119
|
+
## Complete Examples
|
|
120
|
+
|
|
121
|
+
### Research Agent Validation
|
|
122
|
+
|
|
123
|
+
```yaml
|
|
124
|
+
$schema: agentv-eval-v2
|
|
125
|
+
description: Validate research agent tool usage
|
|
126
|
+
|
|
127
|
+
execution:
|
|
128
|
+
target: codex_agent # Provider that returns traces
|
|
129
|
+
|
|
130
|
+
evalcases:
|
|
131
|
+
- id: comprehensive-research
|
|
132
|
+
expected_outcome: Agent thoroughly researches the topic
|
|
133
|
+
|
|
134
|
+
input_messages:
|
|
135
|
+
- role: user
|
|
136
|
+
content: Research machine learning frameworks
|
|
137
|
+
|
|
138
|
+
execution:
|
|
139
|
+
evaluators:
|
|
140
|
+
# Check minimum tool usage
|
|
141
|
+
- name: coverage
|
|
142
|
+
type: tool_trajectory
|
|
143
|
+
mode: any_order
|
|
144
|
+
minimums:
|
|
145
|
+
webSearch: 1
|
|
146
|
+
documentRead: 2
|
|
147
|
+
noteTaking: 1
|
|
148
|
+
|
|
149
|
+
# Check workflow order
|
|
150
|
+
- name: workflow
|
|
151
|
+
type: tool_trajectory
|
|
152
|
+
mode: in_order
|
|
153
|
+
expected:
|
|
154
|
+
- tool: webSearch
|
|
155
|
+
- tool: documentRead
|
|
156
|
+
- tool: summarize
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
### Multi-Step Pipeline
|
|
160
|
+
|
|
161
|
+
```yaml
|
|
162
|
+
evalcases:
|
|
163
|
+
- id: data-pipeline
|
|
164
|
+
expected_outcome: Process data through complete pipeline
|
|
165
|
+
|
|
166
|
+
input_messages:
|
|
167
|
+
- role: user
|
|
168
|
+
content: Process the customer dataset
|
|
169
|
+
|
|
170
|
+
execution:
|
|
171
|
+
evaluators:
|
|
172
|
+
- name: pipeline-check
|
|
173
|
+
type: tool_trajectory
|
|
174
|
+
mode: exact
|
|
175
|
+
expected:
|
|
176
|
+
- tool: loadData
|
|
177
|
+
- tool: validate
|
|
178
|
+
- tool: transform
|
|
179
|
+
- tool: export
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
## CLI Options for Traces
|
|
183
|
+
|
|
184
|
+
```bash
|
|
185
|
+
# Write trace files to disk
|
|
186
|
+
agentv eval evals/test.yaml --dump-traces
|
|
187
|
+
|
|
188
|
+
# Include full trace in result output
|
|
189
|
+
agentv eval evals/test.yaml --include-trace
|
|
190
|
+
```
|
|
191
|
+
|
|
192
|
+
## Best Practices
|
|
193
|
+
|
|
194
|
+
1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
|
|
195
|
+
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
196
|
+
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
197
|
+
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
198
|
+
5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
|