agentv 1.3.1 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,139 +1,139 @@
1
- # Rubric Evaluator Guide
2
-
3
- Rubrics provide structured evaluation through lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
4
-
5
- ## Basic Usage
6
-
7
- ### Simple String Rubrics
8
-
9
- Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
10
-
11
- ```yaml
12
- $schema: agentv-eval-v2
13
-
14
- evalcases:
15
- - id: quicksort-explanation
16
- expected_outcome: Explain how quicksort works
17
-
18
- input_messages:
19
- - role: user
20
- content: Explain how the quicksort algorithm works
21
-
22
- rubrics:
23
- - Mentions divide-and-conquer approach
24
- - Explains the partition step
25
- - States time complexity correctly
26
- ```
27
-
28
- ### Detailed Rubric Objects
29
-
30
- Use objects for fine-grained control over weights and requirements:
31
-
32
- ```yaml
33
- evalcases:
34
- - id: technical-guide
35
- expected_outcome: Write a comprehensive HTTP status codes guide
36
-
37
- input_messages:
38
- - role: user
39
- content: Write a guide explaining HTTP status codes
40
-
41
- rubrics:
42
- - id: structure
43
- description: Has clear headings and organization
44
- weight: 1.0
45
- required: true
46
-
47
- - id: success-codes
48
- description: Covers 2xx success codes with examples
49
- weight: 2.0
50
- required: true
51
-
52
- - id: client-errors
53
- description: Explains 4xx client error codes
54
- weight: 2.0
55
- required: true
56
-
57
- - id: server-errors
58
- description: Explains 5xx server error codes
59
- weight: 1.5
60
- required: false
61
-
62
- - id: practical-examples
63
- description: Includes practical use case examples
64
- weight: 1.0
65
- required: false
66
- ```
67
-
68
- ## Rubric Object Fields
69
-
70
- | Field | Type | Default | Description |
71
- |-------|------|---------|-------------|
72
- | `id` | string | auto-generated | Unique identifier for the rubric |
73
- | `description` | string | required | The criterion being evaluated |
74
- | `weight` | number | 1.0 | Relative importance (higher = more impact on score) |
75
- | `required` | boolean | true | If true, failing this rubric forces verdict to 'fail' |
76
-
77
- ## Scoring and Verdicts
78
-
79
- **Score Calculation:**
80
- ```
81
- score = (sum of satisfied weights) / (total weights)
82
- ```
83
-
84
- **Verdict Rules:**
85
- - `pass`: Score ≥ 0.8 AND all required rubrics satisfied
86
- - `borderline`: Score ≥ 0.6 AND all required rubrics satisfied
87
- - `fail`: Score < 0.6 OR any required rubric failed
88
-
89
- ## Combining Rubrics with Other Evaluators
90
-
91
- Rubrics can be combined with code evaluators for comprehensive validation:
92
-
93
- ```yaml
94
- evalcases:
95
- - id: email-validator
96
- expected_outcome: Python function to validate email addresses
97
-
98
- input_messages:
99
- - role: user
100
- content: Write a Python function to validate email addresses
101
-
102
- # Semantic evaluation via rubrics
103
- rubrics:
104
- - Uses regular expressions for validation
105
- - Includes type hints
106
- - Has docstring documentation
107
- - Handles edge cases (None, empty string)
108
-
109
- execution:
110
- evaluators:
111
- # Rubric evaluator is auto-added from inline rubrics field
112
-
113
- # Additional code evaluator for syntax checking
114
- - name: python_syntax
115
- type: code_judge
116
- script: uv run python -m py_compile
117
- ```
118
-
119
- ## Generate Rubrics from Expected Outcome
120
-
121
- Use the CLI to auto-generate rubrics from `expected_outcome`:
122
-
123
- ```bash
124
- # Generate rubrics for eval cases that don't have them
125
- agentv generate rubrics evals/my-eval.yaml
126
-
127
- # Use a specific LLM target for generation
128
- agentv generate rubrics evals/my-eval.yaml --target azure_base
129
- ```
130
-
131
- This analyzes each `expected_outcome` and creates appropriate rubric items.
132
-
133
- ## Best Practices
134
-
135
- 1. **Use required sparingly** - Only mark rubrics as `required: true` for critical criteria
136
- 2. **Balance weights** - Use higher weights (2.0+) for core requirements, lower (0.5) for nice-to-haves
137
- 3. **Be specific** - "Includes error handling" is better than "Good code quality"
138
- 4. **Keep rubrics atomic** - Each rubric should test one thing
139
- 5. **Consider partial credit** - Non-required rubrics allow partial scores
1
+ # Rubric Evaluator Guide
2
+
3
+ Rubrics provide structured evaluation through lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
4
+
5
+ ## Basic Usage
6
+
7
+ ### Simple String Rubrics
8
+
9
+ Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
10
+
11
+ ```yaml
12
+ $schema: agentv-eval-v2
13
+
14
+ evalcases:
15
+ - id: quicksort-explanation
16
+ expected_outcome: Explain how quicksort works
17
+
18
+ input_messages:
19
+ - role: user
20
+ content: Explain how the quicksort algorithm works
21
+
22
+ rubrics:
23
+ - Mentions divide-and-conquer approach
24
+ - Explains the partition step
25
+ - States time complexity correctly
26
+ ```
27
+
28
+ ### Detailed Rubric Objects
29
+
30
+ Use objects for fine-grained control over weights and requirements:
31
+
32
+ ```yaml
33
+ evalcases:
34
+ - id: technical-guide
35
+ expected_outcome: Write a comprehensive HTTP status codes guide
36
+
37
+ input_messages:
38
+ - role: user
39
+ content: Write a guide explaining HTTP status codes
40
+
41
+ rubrics:
42
+ - id: structure
43
+ description: Has clear headings and organization
44
+ weight: 1.0
45
+ required: true
46
+
47
+ - id: success-codes
48
+ description: Covers 2xx success codes with examples
49
+ weight: 2.0
50
+ required: true
51
+
52
+ - id: client-errors
53
+ description: Explains 4xx client error codes
54
+ weight: 2.0
55
+ required: true
56
+
57
+ - id: server-errors
58
+ description: Explains 5xx server error codes
59
+ weight: 1.5
60
+ required: false
61
+
62
+ - id: practical-examples
63
+ description: Includes practical use case examples
64
+ weight: 1.0
65
+ required: false
66
+ ```
67
+
68
+ ## Rubric Object Fields
69
+
70
+ | Field | Type | Default | Description |
71
+ |-------|------|---------|-------------|
72
+ | `id` | string | auto-generated | Unique identifier for the rubric |
73
+ | `description` | string | required | The criterion being evaluated |
74
+ | `weight` | number | 1.0 | Relative importance (higher = more impact on score) |
75
+ | `required` | boolean | true | If true, failing this rubric forces verdict to 'fail' |
76
+
77
+ ## Scoring and Verdicts
78
+
79
+ **Score Calculation:**
80
+ ```
81
+ score = (sum of satisfied weights) / (total weights)
82
+ ```
83
+
84
+ **Verdict Rules:**
85
+ - `pass`: Score ≥ 0.8 AND all required rubrics satisfied
86
+ - `borderline`: Score ≥ 0.6 AND all required rubrics satisfied
87
+ - `fail`: Score < 0.6 OR any required rubric failed
88
+
89
+ ## Combining Rubrics with Other Evaluators
90
+
91
+ Rubrics can be combined with code evaluators for comprehensive validation:
92
+
93
+ ```yaml
94
+ evalcases:
95
+ - id: email-validator
96
+ expected_outcome: Python function to validate email addresses
97
+
98
+ input_messages:
99
+ - role: user
100
+ content: Write a Python function to validate email addresses
101
+
102
+ # Semantic evaluation via rubrics
103
+ rubrics:
104
+ - Uses regular expressions for validation
105
+ - Includes type hints
106
+ - Has docstring documentation
107
+ - Handles edge cases (None, empty string)
108
+
109
+ execution:
110
+ evaluators:
111
+ # Rubric evaluator is auto-added from inline rubrics field
112
+
113
+ # Additional code evaluator for syntax checking
114
+ - name: python_syntax
115
+ type: code_judge
116
+ script: uv run python -m py_compile
117
+ ```
118
+
119
+ ## Generate Rubrics from Expected Outcome
120
+
121
+ Use the CLI to auto-generate rubrics from `expected_outcome`:
122
+
123
+ ```bash
124
+ # Generate rubrics for eval cases that don't have them
125
+ agentv generate rubrics evals/my-eval.yaml
126
+
127
+ # Use a specific LLM target for generation
128
+ agentv generate rubrics evals/my-eval.yaml --target azure_base
129
+ ```
130
+
131
+ This analyzes each `expected_outcome` and creates appropriate rubric items.
132
+
133
+ ## Best Practices
134
+
135
+ 1. **Use required sparingly** - Only mark rubrics as `required: true` for critical criteria
136
+ 2. **Balance weights** - Use higher weights (2.0+) for core requirements, lower (0.5) for nice-to-haves
137
+ 3. **Be specific** - "Includes error handling" is better than "Good code quality"
138
+ 4. **Keep rubrics atomic** - Each rubric should test one thing
139
+ 5. **Consider partial credit** - Non-required rubrics allow partial scores
@@ -1,179 +1,198 @@
1
- # Tool Trajectory Evaluator Guide
2
-
3
- Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
-
5
- ## Tool Trajectory Evaluator
6
-
7
- ### Modes
8
-
9
- #### 1. `any_order` - Minimum Tool Counts
10
-
11
- Validates that each tool was called at least N times, regardless of order:
12
-
13
- ```yaml
14
- execution:
15
- evaluators:
16
- - name: tool-usage
17
- type: tool_trajectory
18
- mode: any_order
19
- minimums:
20
- knowledgeSearch: 2 # Must be called at least twice
21
- documentRetrieve: 1 # Must be called at least once
22
- ```
23
-
24
- **Use cases:**
25
- - Ensure required tools are used
26
- - Don't care about execution order
27
- - Allow flexibility in agent implementation
28
-
29
- #### 2. `in_order` - Sequential Matching
30
-
31
- Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
32
-
33
- ```yaml
34
- execution:
35
- evaluators:
36
- - name: workflow-sequence
37
- type: tool_trajectory
38
- mode: in_order
39
- expected:
40
- - tool: fetchData
41
- - tool: validateSchema
42
- - tool: transformData
43
- - tool: saveResults
44
- ```
45
-
46
- **Use cases:**
47
- - Validate logical workflow order
48
- - Allow agent to use additional helper tools
49
- - Check that key steps happen in sequence
50
-
51
- #### 3. `exact` - Strict Sequence Match
52
-
53
- Validates the exact tool sequence with no gaps or extra tools:
54
-
55
- ```yaml
56
- execution:
57
- evaluators:
58
- - name: auth-sequence
59
- type: tool_trajectory
60
- mode: exact
61
- expected:
62
- - tool: checkCredentials
63
- - tool: generateToken
64
- - tool: auditLog
65
- ```
66
-
67
- **Use cases:**
68
- - Security-critical workflows
69
- - Strict protocol validation
70
- - Regression testing specific behavior
71
-
72
- ## Scoring
73
-
74
- ### tool_trajectory Scoring
75
-
76
- | Mode | Score Calculation |
77
- |------|------------------|
78
- | `any_order` | (tools meeting minimum) / (total tools with minimums) |
79
- | `in_order` | (matched tools in sequence) / (expected tools count) |
80
- | `exact` | (correctly positioned tools) / (expected tools count) |
81
-
82
- ## Trace Data Requirements
83
-
84
- Tool trajectory evaluators require trace data from the agent provider. Supported providers:
85
-
86
- - **codex** - Returns trace via JSONL log events
87
- - **vscode / vscode-insiders** - Returns trace from Copilot execution
88
- - **cli** - Can return trace if agent outputs trace format
89
-
90
- ### Trace Event Structure
91
-
92
- ```json
93
- {
94
- "type": "tool_call",
95
- "name": "knowledgeSearch",
96
- "input": { "query": "REST vs GraphQL" },
97
- "timestamp": "2024-01-15T10:30:00Z"
98
- }
99
- ```
100
-
101
- ## Complete Examples
102
-
103
- ### Research Agent Validation
104
-
105
- ```yaml
106
- $schema: agentv-eval-v2
107
- description: Validate research agent tool usage
108
-
109
- target: codex_agent # Provider that returns traces
110
-
111
- evalcases:
112
- - id: comprehensive-research
113
- expected_outcome: Agent thoroughly researches the topic
114
-
115
- input_messages:
116
- - role: user
117
- content: Research machine learning frameworks
118
-
119
- execution:
120
- evaluators:
121
- # Check minimum tool usage
122
- - name: coverage
123
- type: tool_trajectory
124
- mode: any_order
125
- minimums:
126
- webSearch: 1
127
- documentRead: 2
128
- noteTaking: 1
129
-
130
- # Check workflow order
131
- - name: workflow
132
- type: tool_trajectory
133
- mode: in_order
134
- expected:
135
- - tool: webSearch
136
- - tool: documentRead
137
- - tool: summarize
138
- ```
139
-
140
- ### Multi-Step Pipeline
141
-
142
- ```yaml
143
- evalcases:
144
- - id: data-pipeline
145
- expected_outcome: Process data through complete pipeline
146
-
147
- input_messages:
148
- - role: user
149
- content: Process the customer dataset
150
-
151
- execution:
152
- evaluators:
153
- - name: pipeline-check
154
- type: tool_trajectory
155
- mode: exact
156
- expected:
157
- - tool: loadData
158
- - tool: validate
159
- - tool: transform
160
- - tool: export
161
- ```
162
-
163
- ## CLI Options for Traces
164
-
165
- ```bash
166
- # Write trace files to disk
167
- agentv eval evals/test.yaml --dump-traces
168
-
169
- # Include full trace in result output
170
- agentv eval evals/test.yaml --include-trace
171
- ```
172
-
173
- ## Best Practices
174
-
175
- 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
176
- 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
177
- 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
178
- 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
179
- 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
1
+ # Tool Trajectory Evaluator Guide
2
+
3
+ Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
+
5
+ ## Tool Trajectory Evaluator
6
+
7
+ ### Modes
8
+
9
+ #### 1. `any_order` - Minimum Tool Counts
10
+
11
+ Validates that each tool was called at least N times, regardless of order:
12
+
13
+ ```yaml
14
+ execution:
15
+ evaluators:
16
+ - name: tool-usage
17
+ type: tool_trajectory
18
+ mode: any_order
19
+ minimums:
20
+ knowledgeSearch: 2 # Must be called at least twice
21
+ documentRetrieve: 1 # Must be called at least once
22
+ ```
23
+
24
+ **Use cases:**
25
+ - Ensure required tools are used
26
+ - Don't care about execution order
27
+ - Allow flexibility in agent implementation
28
+
29
+ #### 2. `in_order` - Sequential Matching
30
+
31
+ Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
32
+
33
+ ```yaml
34
+ execution:
35
+ evaluators:
36
+ - name: workflow-sequence
37
+ type: tool_trajectory
38
+ mode: in_order
39
+ expected:
40
+ - tool: fetchData
41
+ - tool: validateSchema
42
+ - tool: transformData
43
+ - tool: saveResults
44
+ ```
45
+
46
+ **Use cases:**
47
+ - Validate logical workflow order
48
+ - Allow agent to use additional helper tools
49
+ - Check that key steps happen in sequence
50
+
51
+ #### 3. `exact` - Strict Sequence Match
52
+
53
+ Validates the exact tool sequence with no gaps or extra tools:
54
+
55
+ ```yaml
56
+ execution:
57
+ evaluators:
58
+ - name: auth-sequence
59
+ type: tool_trajectory
60
+ mode: exact
61
+ expected:
62
+ - tool: checkCredentials
63
+ - tool: generateToken
64
+ - tool: auditLog
65
+ ```
66
+
67
+ **Use cases:**
68
+ - Security-critical workflows
69
+ - Strict protocol validation
70
+ - Regression testing specific behavior
71
+
72
+ ## Scoring
73
+
74
+ ### tool_trajectory Scoring
75
+
76
+ | Mode | Score Calculation |
77
+ |------|------------------|
78
+ | `any_order` | (tools meeting minimum) / (total tools with minimums) |
79
+ | `in_order` | (matched tools in sequence) / (expected tools count) |
80
+ | `exact` | (correctly positioned tools) / (expected tools count) |
81
+
82
+ ## Trace Data Requirements
83
+
84
+ Tool trajectory evaluators require trace data from the agent provider. Providers return `output_messages` containing `tool_calls` that capture agent tool usage.
85
+
86
+ ### Output Messages Format
87
+
88
+ Providers return `output_messages` with `tool_calls` in the JSONL output:
89
+
90
+ ```json
91
+ {
92
+ "id": "eval-001",
93
+ "output_messages": [
94
+ {
95
+ "role": "assistant",
96
+ "content": "I'll search for information about this topic.",
97
+ "tool_calls": [
98
+ {
99
+ "tool": "knowledgeSearch",
100
+ "input": { "query": "REST vs GraphQL" },
101
+ "output": { "results": [...] },
102
+ "id": "call_123",
103
+ "timestamp": "2024-01-15T10:30:00Z"
104
+ }
105
+ ]
106
+ }
107
+ ]
108
+ }
109
+ ```
110
+
111
+ The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optional fields `id` and `timestamp` can be included for debugging.
112
+
113
+ ### Supported Providers
114
+
115
+ - **codex** - Returns output_messages via JSONL log events
116
+ - **vscode / vscode-insiders** - Returns output_messages from Copilot execution
117
+ - **cli** - Returns `output_messages` with `tool_calls`
118
+
119
+ ## Complete Examples
120
+
121
+ ### Research Agent Validation
122
+
123
+ ```yaml
124
+ $schema: agentv-eval-v2
125
+ description: Validate research agent tool usage
126
+
127
+ execution:
128
+ target: codex_agent # Provider that returns traces
129
+
130
+ evalcases:
131
+ - id: comprehensive-research
132
+ expected_outcome: Agent thoroughly researches the topic
133
+
134
+ input_messages:
135
+ - role: user
136
+ content: Research machine learning frameworks
137
+
138
+ execution:
139
+ evaluators:
140
+ # Check minimum tool usage
141
+ - name: coverage
142
+ type: tool_trajectory
143
+ mode: any_order
144
+ minimums:
145
+ webSearch: 1
146
+ documentRead: 2
147
+ noteTaking: 1
148
+
149
+ # Check workflow order
150
+ - name: workflow
151
+ type: tool_trajectory
152
+ mode: in_order
153
+ expected:
154
+ - tool: webSearch
155
+ - tool: documentRead
156
+ - tool: summarize
157
+ ```
158
+
159
+ ### Multi-Step Pipeline
160
+
161
+ ```yaml
162
+ evalcases:
163
+ - id: data-pipeline
164
+ expected_outcome: Process data through complete pipeline
165
+
166
+ input_messages:
167
+ - role: user
168
+ content: Process the customer dataset
169
+
170
+ execution:
171
+ evaluators:
172
+ - name: pipeline-check
173
+ type: tool_trajectory
174
+ mode: exact
175
+ expected:
176
+ - tool: loadData
177
+ - tool: validate
178
+ - tool: transform
179
+ - tool: export
180
+ ```
181
+
182
+ ## CLI Options for Traces
183
+
184
+ ```bash
185
+ # Write trace files to disk
186
+ agentv eval evals/test.yaml --dump-traces
187
+
188
+ # Include full trace in result output
189
+ agentv eval evals/test.yaml --include-trace
190
+ ```
191
+
192
+ ## Best Practices
193
+
194
+ 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
195
+ 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
196
+ 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
197
+ 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
198
+ 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data