agentv 1.2.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +439 -441
- package/dist/{chunk-IVIT4U6S.js → chunk-3RYQPI4H.js} +709 -465
- package/dist/chunk-3RYQPI4H.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +23 -23
- package/dist/templates/.agentv/config.yaml +15 -15
- package/dist/templates/.agentv/targets.yaml +71 -73
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +212 -174
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +318 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +216 -213
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +340 -247
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +139 -139
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +198 -179
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +77 -77
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +4 -4
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +3 -3
- package/package.json +3 -6
- package/dist/chunk-IVIT4U6S.js.map +0 -1
|
@@ -1,213 +1,216 @@
|
|
|
1
|
-
# Custom Evaluators Guide
|
|
2
|
-
|
|
3
|
-
Guide for writing custom code evaluators and LLM judges for AgentV eval files.
|
|
4
|
-
|
|
5
|
-
## Code Evaluator Contract
|
|
6
|
-
|
|
7
|
-
Code evaluators receive input via stdin and write output to stdout, both as JSON.
|
|
8
|
-
|
|
9
|
-
### Input Format (via stdin)
|
|
10
|
-
|
|
11
|
-
```json
|
|
12
|
-
{
|
|
13
|
-
"question": "string describing the task/question",
|
|
14
|
-
"expected_outcome": "expected outcome description",
|
|
15
|
-
"reference_answer": "gold standard answer (optional)",
|
|
16
|
-
"candidate_answer": "generated code/text from the agent",
|
|
17
|
-
"guideline_paths": ["path1", "path2"],
|
|
18
|
-
"input_files": ["file1", "file2"],
|
|
19
|
-
"input_messages": [{"role": "user", "content": "..."}]
|
|
20
|
-
}
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
"
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
- `
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
#
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
"
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
#
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
#
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
"
|
|
116
|
-
|
|
117
|
-
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
- `{{
|
|
134
|
-
- `{{
|
|
135
|
-
- `{{
|
|
136
|
-
|
|
137
|
-
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
[[ ##
|
|
152
|
-
{{
|
|
153
|
-
|
|
154
|
-
[[ ##
|
|
155
|
-
{{
|
|
156
|
-
|
|
157
|
-
[[ ##
|
|
158
|
-
{{
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
#
|
|
210
|
-
#
|
|
211
|
-
# "
|
|
212
|
-
#
|
|
213
|
-
|
|
1
|
+
# Custom Evaluators Guide
|
|
2
|
+
|
|
3
|
+
Guide for writing custom code evaluators and LLM judges for AgentV eval files.
|
|
4
|
+
|
|
5
|
+
## Code Evaluator Contract
|
|
6
|
+
|
|
7
|
+
Code evaluators receive input via stdin and write output to stdout, both as JSON.
|
|
8
|
+
|
|
9
|
+
### Input Format (via stdin)
|
|
10
|
+
|
|
11
|
+
```json
|
|
12
|
+
{
|
|
13
|
+
"question": "string describing the task/question",
|
|
14
|
+
"expected_outcome": "expected outcome description",
|
|
15
|
+
"reference_answer": "gold standard answer (optional)",
|
|
16
|
+
"candidate_answer": "generated code/text from the agent",
|
|
17
|
+
"guideline_paths": ["path1", "path2"],
|
|
18
|
+
"input_files": ["file1", "file2"],
|
|
19
|
+
"input_messages": [{"role": "user", "content": "..."}],
|
|
20
|
+
"output_messages": [{"role": "assistant", "content": "...", "tool_calls": [...]}]
|
|
21
|
+
}
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
The `output_messages` array contains the full agent execution trace with tool calls, enabling custom validation of agent behavior.
|
|
25
|
+
|
|
26
|
+
### Output Format (to stdout)
|
|
27
|
+
|
|
28
|
+
```json
|
|
29
|
+
{
|
|
30
|
+
"score": 0.85,
|
|
31
|
+
"hits": ["successful check 1", "successful check 2"],
|
|
32
|
+
"misses": ["failed check 1"],
|
|
33
|
+
"reasoning": "Brief explanation of the score"
|
|
34
|
+
}
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
**Field Requirements:**
|
|
38
|
+
- `score`: Float between 0.0 and 1.0 (required)
|
|
39
|
+
- `hits`: Array of strings describing what passed (optional but recommended)
|
|
40
|
+
- `misses`: Array of strings describing what failed (optional but recommended)
|
|
41
|
+
- `reasoning`: String explaining the score (optional but recommended)
|
|
42
|
+
|
|
43
|
+
## Python Code Evaluator Template
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
#!/usr/bin/env python3
|
|
47
|
+
"""
|
|
48
|
+
Example code evaluator for AgentV
|
|
49
|
+
|
|
50
|
+
This evaluator checks for specific keywords in the output.
|
|
51
|
+
Replace validation logic as needed.
|
|
52
|
+
"""
|
|
53
|
+
|
|
54
|
+
import json
|
|
55
|
+
import sys
|
|
56
|
+
from typing import Any
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
|
|
60
|
+
"""
|
|
61
|
+
Evaluate the agent output.
|
|
62
|
+
|
|
63
|
+
Args:
|
|
64
|
+
input_data: Full input context from AgentV
|
|
65
|
+
|
|
66
|
+
Returns:
|
|
67
|
+
Evaluation result with score, hits, misses, reasoning
|
|
68
|
+
"""
|
|
69
|
+
# Extract only the fields you need
|
|
70
|
+
# Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
|
|
71
|
+
candidate_answer = input_data.get("candidate_answer", "")
|
|
72
|
+
|
|
73
|
+
# Your validation logic here
|
|
74
|
+
hits = []
|
|
75
|
+
misses = []
|
|
76
|
+
|
|
77
|
+
# Example: Check for keywords
|
|
78
|
+
required_keywords = ["async", "await"]
|
|
79
|
+
for keyword in required_keywords:
|
|
80
|
+
if keyword in candidate_answer:
|
|
81
|
+
hits.append(f"Contains required keyword: {keyword}")
|
|
82
|
+
else:
|
|
83
|
+
misses.append(f"Missing required keyword: {keyword}")
|
|
84
|
+
|
|
85
|
+
# Calculate score
|
|
86
|
+
if not required_keywords:
|
|
87
|
+
score = 1.0
|
|
88
|
+
else:
|
|
89
|
+
score = len(hits) / len(required_keywords)
|
|
90
|
+
|
|
91
|
+
# Build result
|
|
92
|
+
return {
|
|
93
|
+
"score": score,
|
|
94
|
+
"hits": hits,
|
|
95
|
+
"misses": misses,
|
|
96
|
+
"reasoning": f"Found {len(hits)}/{len(required_keywords)} required keywords"
|
|
97
|
+
}
|
|
98
|
+
|
|
99
|
+
|
|
100
|
+
def main():
|
|
101
|
+
"""Main entry point for AgentV code evaluator."""
|
|
102
|
+
try:
|
|
103
|
+
# Read input from stdin
|
|
104
|
+
input_data = json.loads(sys.stdin.read())
|
|
105
|
+
|
|
106
|
+
# Run evaluation
|
|
107
|
+
result = evaluate(input_data)
|
|
108
|
+
|
|
109
|
+
# Write result to stdout
|
|
110
|
+
print(json.dumps(result, indent=2))
|
|
111
|
+
|
|
112
|
+
except Exception as e:
|
|
113
|
+
# Error handling: return zero score with error message
|
|
114
|
+
error_result = {
|
|
115
|
+
"score": 0.0,
|
|
116
|
+
"hits": [],
|
|
117
|
+
"misses": [f"Evaluator error: {str(e)}"],
|
|
118
|
+
"reasoning": f"Evaluator error: {str(e)}"
|
|
119
|
+
}
|
|
120
|
+
print(json.dumps(error_result, indent=2))
|
|
121
|
+
sys.exit(1)
|
|
122
|
+
|
|
123
|
+
|
|
124
|
+
if __name__ == "__main__":
|
|
125
|
+
main()
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## LLM Judge Prompt Template
|
|
129
|
+
|
|
130
|
+
LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
|
|
131
|
+
|
|
132
|
+
**Available Template Variables:**
|
|
133
|
+
- `{{question}}` - The original question/task
|
|
134
|
+
- `{{expected_outcome}}` - What the answer should accomplish
|
|
135
|
+
- `{{candidate_answer}}` - The actual output to evaluate
|
|
136
|
+
- `{{reference_answer}}` - Gold standard answer (optional, may be empty)
|
|
137
|
+
- `{{input_messages}}` - JSON stringified input message segments
|
|
138
|
+
- `{{output_messages}}` - JSON stringified expected output segments
|
|
139
|
+
|
|
140
|
+
**Default Evaluator Template:**
|
|
141
|
+
|
|
142
|
+
If you don't specify a custom evaluator template, AgentV uses this default:
|
|
143
|
+
|
|
144
|
+
```
|
|
145
|
+
You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
|
|
146
|
+
|
|
147
|
+
Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
|
|
148
|
+
|
|
149
|
+
Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
|
|
150
|
+
|
|
151
|
+
[[ ## expected_outcome ## ]]
|
|
152
|
+
{{expected_outcome}}
|
|
153
|
+
|
|
154
|
+
[[ ## question ## ]]
|
|
155
|
+
{{question}}
|
|
156
|
+
|
|
157
|
+
[[ ## reference_answer ## ]]
|
|
158
|
+
{{reference_answer}}
|
|
159
|
+
|
|
160
|
+
[[ ## candidate_answer ## ]]
|
|
161
|
+
{{candidate_answer}}
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
|
|
165
|
+
|
|
166
|
+
## Best Practices
|
|
167
|
+
|
|
168
|
+
### For Code-based Evaluators
|
|
169
|
+
|
|
170
|
+
1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
|
|
171
|
+
2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
|
|
172
|
+
3. **Be deterministic** - Same input should always produce same output
|
|
173
|
+
4. **Handle errors gracefully** - Return a valid result even when evaluation fails
|
|
174
|
+
5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
|
|
175
|
+
|
|
176
|
+
### For Prompt-based Evaluators (LLM Judges)
|
|
177
|
+
|
|
178
|
+
1. **Clear criteria** - Define what you're evaluating
|
|
179
|
+
2. **Specific guidelines** - Provide scoring rubrics
|
|
180
|
+
3. **JSON output** - Enforce structured output format
|
|
181
|
+
4. **Examples** - Show what good/bad looks like
|
|
182
|
+
5. **Concise prompts** - Keep instructions focused
|
|
183
|
+
|
|
184
|
+
## Running Code Evaluators
|
|
185
|
+
|
|
186
|
+
### In Eval Files
|
|
187
|
+
|
|
188
|
+
```yaml
|
|
189
|
+
execution:
|
|
190
|
+
evaluators:
|
|
191
|
+
- name: my_validator
|
|
192
|
+
type: code
|
|
193
|
+
script: uv run my_validator.py
|
|
194
|
+
cwd: ./evaluators
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
### Command Line Testing
|
|
198
|
+
|
|
199
|
+
Test your evaluator locally:
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
# Create test input
|
|
203
|
+
echo '{
|
|
204
|
+
"candidate_answer": "test output here",
|
|
205
|
+
"question": "test task",
|
|
206
|
+
"expected_outcome": "expected result"
|
|
207
|
+
}' | uv run my_validator.py
|
|
208
|
+
|
|
209
|
+
# Should output:
|
|
210
|
+
# {
|
|
211
|
+
# "score": 0.8,
|
|
212
|
+
# "hits": ["check 1 passed"],
|
|
213
|
+
# "misses": ["check 2 failed"],
|
|
214
|
+
# "reasoning": "..."
|
|
215
|
+
# }
|
|
216
|
+
```
|