agentv 1.3.1 → 1.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +460 -441
- package/dist/{chunk-6R2YRXCQ.js → chunk-HU4B6ODF.js} +1859 -641
- package/dist/chunk-HU4B6ODF.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +23 -23
- package/dist/templates/.agentv/config.yaml +15 -15
- package/dist/templates/.agentv/targets.yaml +71 -73
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +211 -211
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +316 -288
- package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +115 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +241 -213
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +333 -333
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +137 -139
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +224 -179
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +77 -77
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +4 -4
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +3 -3
- package/package.json +1 -1
- package/dist/chunk-6R2YRXCQ.js.map +0 -1
|
@@ -1,211 +1,211 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: agentv-eval-builder
|
|
3
|
-
description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# AgentV Eval Builder
|
|
7
|
-
|
|
8
|
-
## Schema Reference
|
|
9
|
-
- Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
|
|
10
|
-
- Format: YAML with structured content arrays
|
|
11
|
-
- Examples: `references/example-evals.md`
|
|
12
|
-
|
|
13
|
-
## Feature Reference
|
|
14
|
-
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
|
|
15
|
-
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
|
|
16
|
-
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
|
|
17
|
-
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
|
|
18
|
-
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
|
|
19
|
-
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
-
|
|
23
|
-
-
|
|
24
|
-
- `expected_messages`
|
|
25
|
-
-
|
|
26
|
-
- Message
|
|
27
|
-
-
|
|
28
|
-
-
|
|
29
|
-
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
- tool:
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
-
|
|
109
|
-
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
description: Batch CLI evaluation
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
-
|
|
173
|
-
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
description: Example showing basic features and conversation threading
|
|
181
|
-
execution:
|
|
182
|
-
target: default
|
|
183
|
-
|
|
184
|
-
evalcases:
|
|
185
|
-
- id: code-review-basic
|
|
186
|
-
expected_outcome: Assistant provides helpful code analysis
|
|
187
|
-
|
|
188
|
-
input_messages:
|
|
189
|
-
- role: system
|
|
190
|
-
content: You are an expert code reviewer.
|
|
191
|
-
- role: user
|
|
192
|
-
content:
|
|
193
|
-
- type: text
|
|
194
|
-
value: |-
|
|
195
|
-
Review this function:
|
|
196
|
-
|
|
197
|
-
```python
|
|
198
|
-
def add(a, b):
|
|
199
|
-
return a + b
|
|
200
|
-
```
|
|
201
|
-
- type: file
|
|
202
|
-
value: /prompts/python.instructions.md
|
|
203
|
-
|
|
204
|
-
expected_messages:
|
|
205
|
-
- role: assistant
|
|
206
|
-
content: |-
|
|
207
|
-
The function is simple and correct. Suggestions:
|
|
208
|
-
- Add type hints: `def add(a: int, b: int) -> int:`
|
|
209
|
-
- Add docstring
|
|
210
|
-
- Consider validation for edge cases
|
|
211
|
-
```
|
|
1
|
+
---
|
|
2
|
+
name: agentv-eval-builder
|
|
3
|
+
description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# AgentV Eval Builder
|
|
7
|
+
|
|
8
|
+
## Schema Reference
|
|
9
|
+
- Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
|
|
10
|
+
- Format: YAML with structured content arrays
|
|
11
|
+
- Examples: `references/example-evals.md`
|
|
12
|
+
|
|
13
|
+
## Feature Reference
|
|
14
|
+
- Rubrics: `references/rubric-evaluator.md` - Structured criteria-based evaluation
|
|
15
|
+
- Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
|
|
16
|
+
- Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
|
|
17
|
+
- Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
|
|
18
|
+
- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
|
|
19
|
+
- Compare: `references/compare-command.md` - Compare evaluation results between runs
|
|
20
|
+
|
|
21
|
+
## Structure Requirements
|
|
22
|
+
- Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
|
|
23
|
+
- Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
|
|
24
|
+
- Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
|
|
25
|
+
- `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
|
|
26
|
+
- Message fields: `role` (required), `content` (required)
|
|
27
|
+
- Message roles: `system`, `user`, `assistant`, `tool`
|
|
28
|
+
- Content types: `text` (inline), `file` (relative or absolute path)
|
|
29
|
+
- Attachments (type: `file`) should default to the `user` role
|
|
30
|
+
- File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)
|
|
31
|
+
|
|
32
|
+
## Custom Evaluators
|
|
33
|
+
|
|
34
|
+
Configure multiple evaluators per eval case via `execution.evaluators` array.
|
|
35
|
+
|
|
36
|
+
### Code Evaluators
|
|
37
|
+
Scripts that validate output programmatically:
|
|
38
|
+
|
|
39
|
+
```yaml
|
|
40
|
+
execution:
|
|
41
|
+
evaluators:
|
|
42
|
+
- name: json_format_validator
|
|
43
|
+
type: code_judge
|
|
44
|
+
script: uv run validate_output.py
|
|
45
|
+
cwd: ../../evaluators/scripts
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**Contract:**
|
|
49
|
+
- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
|
|
50
|
+
- Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
|
|
51
|
+
|
|
52
|
+
**Template:** See `references/custom-evaluators.md` for Python code evaluator template
|
|
53
|
+
|
|
54
|
+
### LLM Judges
|
|
55
|
+
Language models evaluate response quality:
|
|
56
|
+
|
|
57
|
+
```yaml
|
|
58
|
+
execution:
|
|
59
|
+
evaluators:
|
|
60
|
+
- name: content_evaluator
|
|
61
|
+
type: llm_judge
|
|
62
|
+
prompt: /evaluators/prompts/correctness.md
|
|
63
|
+
model: gpt-5-chat
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### Tool Trajectory Evaluators
|
|
67
|
+
Validate agent tool usage patterns (requires `output_messages` with `tool_calls` from provider):
|
|
68
|
+
|
|
69
|
+
```yaml
|
|
70
|
+
execution:
|
|
71
|
+
evaluators:
|
|
72
|
+
- name: research_check
|
|
73
|
+
type: tool_trajectory
|
|
74
|
+
mode: any_order # Options: any_order, in_order, exact
|
|
75
|
+
minimums: # For any_order mode
|
|
76
|
+
knowledgeSearch: 2
|
|
77
|
+
expected: # For in_order/exact modes
|
|
78
|
+
- tool: knowledgeSearch
|
|
79
|
+
- tool: documentRetrieve
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
See `references/tool-trajectory-evaluator.md` for modes and configuration.
|
|
83
|
+
|
|
84
|
+
### Multiple Evaluators
|
|
85
|
+
Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
|
|
86
|
+
|
|
87
|
+
```yaml
|
|
88
|
+
execution:
|
|
89
|
+
evaluators:
|
|
90
|
+
- name: format_check # Runs first
|
|
91
|
+
type: code_judge
|
|
92
|
+
script: uv run validate_json.py
|
|
93
|
+
- name: content_check # Runs second
|
|
94
|
+
type: llm_judge
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
### Rubric Evaluator
|
|
98
|
+
Inline rubrics for structured criteria-based evaluation:
|
|
99
|
+
|
|
100
|
+
```yaml
|
|
101
|
+
evalcases:
|
|
102
|
+
- id: explanation-task
|
|
103
|
+
expected_outcome: Clear explanation of quicksort
|
|
104
|
+
input_messages:
|
|
105
|
+
- role: user
|
|
106
|
+
content: Explain quicksort
|
|
107
|
+
rubrics:
|
|
108
|
+
- Mentions divide-and-conquer approach
|
|
109
|
+
- Explains the partition step
|
|
110
|
+
- id: complexity
|
|
111
|
+
description: States time complexity correctly
|
|
112
|
+
weight: 2.0
|
|
113
|
+
required: true
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
See `references/rubric-evaluator.md` for detailed rubric configuration.
|
|
117
|
+
|
|
118
|
+
### Composite Evaluator
|
|
119
|
+
Combine multiple evaluators with aggregation:
|
|
120
|
+
|
|
121
|
+
```yaml
|
|
122
|
+
execution:
|
|
123
|
+
evaluators:
|
|
124
|
+
- name: release_gate
|
|
125
|
+
type: composite
|
|
126
|
+
evaluators:
|
|
127
|
+
- name: safety
|
|
128
|
+
type: llm_judge
|
|
129
|
+
prompt: ./prompts/safety.md
|
|
130
|
+
- name: quality
|
|
131
|
+
type: llm_judge
|
|
132
|
+
prompt: ./prompts/quality.md
|
|
133
|
+
aggregator:
|
|
134
|
+
type: weighted_average
|
|
135
|
+
weights:
|
|
136
|
+
safety: 0.3
|
|
137
|
+
quality: 0.7
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
See `references/composite-evaluator.md` for aggregation types and patterns.
|
|
141
|
+
|
|
142
|
+
### Batch CLI Evaluation
|
|
143
|
+
Evaluate external batch runners that process all evalcases in one invocation:
|
|
144
|
+
|
|
145
|
+
```yaml
|
|
146
|
+
description: Batch CLI evaluation
|
|
147
|
+
execution:
|
|
148
|
+
target: batch_cli
|
|
149
|
+
|
|
150
|
+
evalcases:
|
|
151
|
+
- id: case-001
|
|
152
|
+
expected_outcome: Returns decision=CLEAR
|
|
153
|
+
expected_messages:
|
|
154
|
+
- role: assistant
|
|
155
|
+
content:
|
|
156
|
+
decision: CLEAR
|
|
157
|
+
input_messages:
|
|
158
|
+
- role: user
|
|
159
|
+
content:
|
|
160
|
+
row:
|
|
161
|
+
id: case-001
|
|
162
|
+
amount: 5000
|
|
163
|
+
execution:
|
|
164
|
+
evaluators:
|
|
165
|
+
- name: decision-check
|
|
166
|
+
type: code_judge
|
|
167
|
+
script: bun run ./scripts/check-output.ts
|
|
168
|
+
cwd: .
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
**Key pattern:**
|
|
172
|
+
- Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
|
|
173
|
+
- Each evalcase has its own evaluator to validate its corresponding output
|
|
174
|
+
- Use structured `expected_messages.content` for expected output fields
|
|
175
|
+
|
|
176
|
+
See `references/batch-cli-evaluator.md` for full implementation guide.
|
|
177
|
+
|
|
178
|
+
## Example
|
|
179
|
+
```yaml
|
|
180
|
+
description: Example showing basic features and conversation threading
|
|
181
|
+
execution:
|
|
182
|
+
target: default
|
|
183
|
+
|
|
184
|
+
evalcases:
|
|
185
|
+
- id: code-review-basic
|
|
186
|
+
expected_outcome: Assistant provides helpful code analysis
|
|
187
|
+
|
|
188
|
+
input_messages:
|
|
189
|
+
- role: system
|
|
190
|
+
content: You are an expert code reviewer.
|
|
191
|
+
- role: user
|
|
192
|
+
content:
|
|
193
|
+
- type: text
|
|
194
|
+
value: |-
|
|
195
|
+
Review this function:
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
def add(a, b):
|
|
199
|
+
return a + b
|
|
200
|
+
```
|
|
201
|
+
- type: file
|
|
202
|
+
value: /prompts/python.instructions.md
|
|
203
|
+
|
|
204
|
+
expected_messages:
|
|
205
|
+
- role: assistant
|
|
206
|
+
content: |-
|
|
207
|
+
The function is simple and correct. Suggestions:
|
|
208
|
+
- Add type hints: `def add(a: int, b: int) -> int:`
|
|
209
|
+
- Add docstring
|
|
210
|
+
- Consider validation for edge cases
|
|
211
|
+
```
|