agentv 1.3.1 → 1.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,211 +1,211 @@
1
- ---
2
- name: agentv-eval-builder
3
- description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
4
- ---
5
-
6
- # AgentV Eval Builder
7
-
8
- ## Schema Reference
9
- - Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
10
- - Format: YAML with structured content arrays
11
- - Examples: `references/example-evals.md`
12
-
13
- ## Feature Reference
14
- - Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
15
- - Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
16
- - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17
- - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
18
- - Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
19
-
20
- ## Structure Requirements
21
- - Root level: `description` (optional), `target` (optional), `execution` (optional), `evalcases` (required)
22
- - Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
23
- - Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
24
- - `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
25
- - Message fields: `role` (required), `content` (required)
26
- - Message roles: `system`, `user`, `assistant`, `tool`
27
- - Content types: `text` (inline), `file` (relative or absolute path)
28
- - Attachments (type: `file`) should default to the `user` role
29
- - File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)
30
-
31
- ## Custom Evaluators
32
-
33
- Configure multiple evaluators per eval case via `execution.evaluators` array.
34
-
35
- ### Code Evaluators
36
- Scripts that validate output programmatically:
37
-
38
- ```yaml
39
- execution:
40
- evaluators:
41
- - name: json_format_validator
42
- type: code_judge
43
- script: uv run validate_output.py
44
- cwd: ../../evaluators/scripts
45
- ```
46
-
47
- **Contract:**
48
- - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
49
- - Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
50
-
51
- **Template:** See `references/custom-evaluators.md` for Python code evaluator template
52
-
53
- ### LLM Judges
54
- Language models evaluate response quality:
55
-
56
- ```yaml
57
- execution:
58
- evaluators:
59
- - name: content_evaluator
60
- type: llm_judge
61
- prompt: /evaluators/prompts/correctness.md
62
- model: gpt-5-chat
63
- ```
64
-
65
- ### Tool Trajectory Evaluators
66
- Validate agent tool usage patterns (requires trace data from provider):
67
-
68
- ```yaml
69
- execution:
70
- evaluators:
71
- - name: research_check
72
- type: tool_trajectory
73
- mode: any_order # Options: any_order, in_order, exact
74
- minimums: # For any_order mode
75
- knowledgeSearch: 2
76
- expected: # For in_order/exact modes
77
- - tool: knowledgeSearch
78
- - tool: documentRetrieve
79
- ```
80
-
81
- See `references/tool-trajectory-evaluator.md` for modes and configuration.
82
-
83
- ### Multiple Evaluators
84
- Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
85
-
86
- ```yaml
87
- execution:
88
- evaluators:
89
- - name: format_check # Runs first
90
- type: code_judge
91
- script: uv run validate_json.py
92
- - name: content_check # Runs second
93
- type: llm_judge
94
- ```
95
-
96
- ### Rubric Evaluator
97
- Inline rubrics for structured criteria-based evaluation:
98
-
99
- ```yaml
100
- evalcases:
101
- - id: explanation-task
102
- expected_outcome: Clear explanation of quicksort
103
- input_messages:
104
- - role: user
105
- content: Explain quicksort
106
- rubrics:
107
- - Mentions divide-and-conquer approach
108
- - Explains the partition step
109
- - id: complexity
110
- description: States time complexity correctly
111
- weight: 2.0
112
- required: true
113
- ```
114
-
115
- See `references/rubric-evaluator.md` for detailed rubric configuration.
116
-
117
- ### Composite Evaluator
118
- Combine multiple evaluators with aggregation:
119
-
120
- ```yaml
121
- execution:
122
- evaluators:
123
- - name: release_gate
124
- type: composite
125
- evaluators:
126
- - name: safety
127
- type: llm_judge
128
- prompt: ./prompts/safety.md
129
- - name: quality
130
- type: llm_judge
131
- prompt: ./prompts/quality.md
132
- aggregator:
133
- type: weighted_average
134
- weights:
135
- safety: 0.3
136
- quality: 0.7
137
- ```
138
-
139
- See `references/composite-evaluator.md` for aggregation types and patterns.
140
-
141
- ### Batch CLI Evaluation
142
- Evaluate external batch runners that process all evalcases in one invocation:
143
-
144
- ```yaml
145
- $schema: agentv-eval-v2
146
- description: Batch CLI evaluation
147
- target: batch_cli
148
-
149
- evalcases:
150
- - id: case-001
151
- expected_outcome: Returns decision=CLEAR
152
- expected_messages:
153
- - role: assistant
154
- content:
155
- decision: CLEAR
156
- input_messages:
157
- - role: user
158
- content:
159
- row:
160
- id: case-001
161
- amount: 5000
162
- execution:
163
- evaluators:
164
- - name: decision-check
165
- type: code_judge
166
- script: bun run ./scripts/check-output.ts
167
- cwd: .
168
- ```
169
-
170
- **Key pattern:**
171
- - Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
172
- - Each evalcase has its own evaluator to validate its corresponding output
173
- - Use structured `expected_messages.content` for expected output fields
174
-
175
- See `references/batch-cli-evaluator.md` for full implementation guide.
176
-
177
- ## Example
178
- ```yaml
179
- $schema: agentv-eval-v2
180
- description: Example showing basic features and conversation threading
181
- execution:
182
- target: default
183
-
184
- evalcases:
185
- - id: code-review-basic
186
- expected_outcome: Assistant provides helpful code analysis
187
-
188
- input_messages:
189
- - role: system
190
- content: You are an expert code reviewer.
191
- - role: user
192
- content:
193
- - type: text
194
- value: |-
195
- Review this function:
196
-
197
- ```python
198
- def add(a, b):
199
- return a + b
200
- ```
201
- - type: file
202
- value: /prompts/python.instructions.md
203
-
204
- expected_messages:
205
- - role: assistant
206
- content: |-
207
- The function is simple and correct. Suggestions:
208
- - Add type hints: `def add(a: int, b: int) -> int:`
209
- - Add docstring
210
- - Consider validation for edge cases
211
- ```
1
+ ---
2
+ name: agentv-eval-builder
3
+ description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
4
+ ---
5
+
6
+ # AgentV Eval Builder
7
+
8
+ ## Schema Reference
9
+ - Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
10
+ - Format: YAML with structured content arrays
11
+ - Examples: `references/example-evals.md`
12
+
13
+ ## Feature Reference
14
+ - Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
15
+ - Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
16
+ - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
17
+ - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
18
+ - Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
19
+ - Compare: `references/compare-command.md` - Compare evaluation results between runs
20
+
21
+ ## Structure Requirements
22
+ - Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
23
+ - Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
24
+ - Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
25
+ - `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
26
+ - Message fields: `role` (required), `content` (required)
27
+ - Message roles: `system`, `user`, `assistant`, `tool`
28
+ - Content types: `text` (inline), `file` (relative or absolute path)
29
+ - Attachments (type: `file`) should default to the `user` role
30
+ - File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)
31
+
32
+ ## Custom Evaluators
33
+
34
+ Configure multiple evaluators per eval case via `execution.evaluators` array.
35
+
36
+ ### Code Evaluators
37
+ Scripts that validate output programmatically:
38
+
39
+ ```yaml
40
+ execution:
41
+ evaluators:
42
+ - name: json_format_validator
43
+ type: code_judge
44
+ script: uv run validate_output.py
45
+ cwd: ../../evaluators/scripts
46
+ ```
47
+
48
+ **Contract:**
49
+ - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
50
+ - Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
51
+
52
+ **Template:** See `references/custom-evaluators.md` for Python code evaluator template
53
+
54
+ ### LLM Judges
55
+ Language models evaluate response quality:
56
+
57
+ ```yaml
58
+ execution:
59
+ evaluators:
60
+ - name: content_evaluator
61
+ type: llm_judge
62
+ prompt: /evaluators/prompts/correctness.md
63
+ model: gpt-5-chat
64
+ ```
65
+
66
+ ### Tool Trajectory Evaluators
67
+ Validate agent tool usage patterns (requires `output_messages` with `tool_calls` from provider):
68
+
69
+ ```yaml
70
+ execution:
71
+ evaluators:
72
+ - name: research_check
73
+ type: tool_trajectory
74
+ mode: any_order # Options: any_order, in_order, exact
75
+ minimums: # For any_order mode
76
+ knowledgeSearch: 2
77
+ expected: # For in_order/exact modes
78
+ - tool: knowledgeSearch
79
+ - tool: documentRetrieve
80
+ ```
81
+
82
+ See `references/tool-trajectory-evaluator.md` for modes and configuration.
83
+
84
+ ### Multiple Evaluators
85
+ Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
86
+
87
+ ```yaml
88
+ execution:
89
+ evaluators:
90
+ - name: format_check # Runs first
91
+ type: code_judge
92
+ script: uv run validate_json.py
93
+ - name: content_check # Runs second
94
+ type: llm_judge
95
+ ```
96
+
97
+ ### Rubric Evaluator
98
+ Inline rubrics for structured criteria-based evaluation:
99
+
100
+ ```yaml
101
+ evalcases:
102
+ - id: explanation-task
103
+ expected_outcome: Clear explanation of quicksort
104
+ input_messages:
105
+ - role: user
106
+ content: Explain quicksort
107
+ rubrics:
108
+ - Mentions divide-and-conquer approach
109
+ - Explains the partition step
110
+ - id: complexity
111
+ description: States time complexity correctly
112
+ weight: 2.0
113
+ required: true
114
+ ```
115
+
116
+ See `references/rubric-evaluator.md` for detailed rubric configuration.
117
+
118
+ ### Composite Evaluator
119
+ Combine multiple evaluators with aggregation:
120
+
121
+ ```yaml
122
+ execution:
123
+ evaluators:
124
+ - name: release_gate
125
+ type: composite
126
+ evaluators:
127
+ - name: safety
128
+ type: llm_judge
129
+ prompt: ./prompts/safety.md
130
+ - name: quality
131
+ type: llm_judge
132
+ prompt: ./prompts/quality.md
133
+ aggregator:
134
+ type: weighted_average
135
+ weights:
136
+ safety: 0.3
137
+ quality: 0.7
138
+ ```
139
+
140
+ See `references/composite-evaluator.md` for aggregation types and patterns.
141
+
142
+ ### Batch CLI Evaluation
143
+ Evaluate external batch runners that process all evalcases in one invocation:
144
+
145
+ ```yaml
146
+ description: Batch CLI evaluation
147
+ execution:
148
+ target: batch_cli
149
+
150
+ evalcases:
151
+ - id: case-001
152
+ expected_outcome: Returns decision=CLEAR
153
+ expected_messages:
154
+ - role: assistant
155
+ content:
156
+ decision: CLEAR
157
+ input_messages:
158
+ - role: user
159
+ content:
160
+ row:
161
+ id: case-001
162
+ amount: 5000
163
+ execution:
164
+ evaluators:
165
+ - name: decision-check
166
+ type: code_judge
167
+ script: bun run ./scripts/check-output.ts
168
+ cwd: .
169
+ ```
170
+
171
+ **Key pattern:**
172
+ - Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
173
+ - Each evalcase has its own evaluator to validate its corresponding output
174
+ - Use structured `expected_messages.content` for expected output fields
175
+
176
+ See `references/batch-cli-evaluator.md` for full implementation guide.
177
+
178
+ ## Example
179
+ ```yaml
180
+ description: Example showing basic features and conversation threading
181
+ execution:
182
+ target: default
183
+
184
+ evalcases:
185
+ - id: code-review-basic
186
+ expected_outcome: Assistant provides helpful code analysis
187
+
188
+ input_messages:
189
+ - role: system
190
+ content: You are an expert code reviewer.
191
+ - role: user
192
+ content:
193
+ - type: text
194
+ value: |-
195
+ Review this function:
196
+
197
+ ```python
198
+ def add(a, b):
199
+ return a + b
200
+ ```
201
+ - type: file
202
+ value: /prompts/python.instructions.md
203
+
204
+ expected_messages:
205
+ - role: assistant
206
+ content: |-
207
+ The function is simple and correct. Suggestions:
208
+ - Add type hints: `def add(a: int, b: int) -> int:`
209
+ - Add docstring
210
+ - Consider validation for edge cases
211
+ ```