agentv 1.2.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,213 +1,216 @@
1
- # Custom Evaluators Guide
2
-
3
- Guide for writing custom code evaluators and LLM judges for AgentV eval files.
4
-
5
- ## Code Evaluator Contract
6
-
7
- Code evaluators receive input via stdin and write output to stdout, both as JSON.
8
-
9
- ### Input Format (via stdin)
10
-
11
- ```json
12
- {
13
- "question": "string describing the task/question",
14
- "expected_outcome": "expected outcome description",
15
- "reference_answer": "gold standard answer (optional)",
16
- "candidate_answer": "generated code/text from the agent",
17
- "guideline_paths": ["path1", "path2"],
18
- "input_files": ["file1", "file2"],
19
- "input_messages": [{"role": "user", "content": "..."}]
20
- }
21
- ```
22
-
23
- ### Output Format (to stdout)
24
-
25
- ```json
26
- {
27
- "score": 0.85,
28
- "hits": ["successful check 1", "successful check 2"],
29
- "misses": ["failed check 1"],
30
- "reasoning": "Brief explanation of the score"
31
- }
32
- ```
33
-
34
- **Field Requirements:**
35
- - `score`: Float between 0.0 and 1.0 (required)
36
- - `hits`: Array of strings describing what passed (optional but recommended)
37
- - `misses`: Array of strings describing what failed (optional but recommended)
38
- - `reasoning`: String explaining the score (optional but recommended)
39
-
40
- ## Python Code Evaluator Template
41
-
42
- ```python
43
- #!/usr/bin/env python3
44
- """
45
- Example code evaluator for AgentV
46
-
47
- This evaluator checks for specific keywords in the output.
48
- Replace validation logic as needed.
49
- """
50
-
51
- import json
52
- import sys
53
- from typing import Any
54
-
55
-
56
- def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
57
- """
58
- Evaluate the agent output.
59
-
60
- Args:
61
- input_data: Full input context from AgentV
62
-
63
- Returns:
64
- Evaluation result with score, hits, misses, reasoning
65
- """
66
- # Extract only the fields you need
67
- # Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
68
- candidate_answer = input_data.get("candidate_answer", "")
69
-
70
- # Your validation logic here
71
- hits = []
72
- misses = []
73
-
74
- # Example: Check for keywords
75
- required_keywords = ["async", "await"]
76
- for keyword in required_keywords:
77
- if keyword in candidate_answer:
78
- hits.append(f"Contains required keyword: {keyword}")
79
- else:
80
- misses.append(f"Missing required keyword: {keyword}")
81
-
82
- # Calculate score
83
- if not required_keywords:
84
- score = 1.0
85
- else:
86
- score = len(hits) / len(required_keywords)
87
-
88
- # Build result
89
- return {
90
- "score": score,
91
- "hits": hits,
92
- "misses": misses,
93
- "reasoning": f"Found {len(hits)}/{len(required_keywords)} required keywords"
94
- }
95
-
96
-
97
- def main():
98
- """Main entry point for AgentV code evaluator."""
99
- try:
100
- # Read input from stdin
101
- input_data = json.loads(sys.stdin.read())
102
-
103
- # Run evaluation
104
- result = evaluate(input_data)
105
-
106
- # Write result to stdout
107
- print(json.dumps(result, indent=2))
108
-
109
- except Exception as e:
110
- # Error handling: return zero score with error message
111
- error_result = {
112
- "score": 0.0,
113
- "hits": [],
114
- "misses": [f"Evaluator error: {str(e)}"],
115
- "reasoning": f"Evaluator error: {str(e)}"
116
- }
117
- print(json.dumps(error_result, indent=2))
118
- sys.exit(1)
119
-
120
-
121
- if __name__ == "__main__":
122
- main()
123
- ```
124
-
125
- ## LLM Judge Prompt Template
126
-
127
- LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
128
-
129
- **Available Template Variables:**
130
- - `{{question}}` - The original question/task
131
- - `{{expected_outcome}}` - What the answer should accomplish
132
- - `{{candidate_answer}}` - The actual output to evaluate
133
- - `{{reference_answer}}` - Gold standard answer (optional, may be empty)
134
- - `{{input_messages}}` - JSON stringified input message segments
135
- - `{{output_messages}}` - JSON stringified expected output segments
136
-
137
- **Default Evaluator Template:**
138
-
139
- If you don't specify a custom evaluator template, AgentV uses this default:
140
-
141
- ```
142
- You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
143
-
144
- Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
145
-
146
- Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
147
-
148
- [[ ## expected_outcome ## ]]
149
- {{expected_outcome}}
150
-
151
- [[ ## question ## ]]
152
- {{question}}
153
-
154
- [[ ## reference_answer ## ]]
155
- {{reference_answer}}
156
-
157
- [[ ## candidate_answer ## ]]
158
- {{candidate_answer}}
159
- ```
160
-
161
- You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
162
-
163
- ## Best Practices
164
-
165
- ### For Code-based Evaluators
166
-
167
- 1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
168
- 2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
169
- 3. **Be deterministic** - Same input should always produce same output
170
- 4. **Handle errors gracefully** - Return a valid result even when evaluation fails
171
- 5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
172
-
173
- ### For Prompt-based Evaluators (LLM Judges)
174
-
175
- 1. **Clear criteria** - Define what you're evaluating
176
- 2. **Specific guidelines** - Provide scoring rubrics
177
- 3. **JSON output** - Enforce structured output format
178
- 4. **Examples** - Show what good/bad looks like
179
- 5. **Concise prompts** - Keep instructions focused
180
-
181
- ## Running Code Evaluators
182
-
183
- ### In Eval Files
184
-
185
- ```yaml
186
- execution:
187
- evaluators:
188
- - name: my_validator
189
- type: code
190
- script: uv run my_validator.py
191
- cwd: ./evaluators
192
- ```
193
-
194
- ### Command Line Testing
195
-
196
- Test your evaluator locally:
197
-
198
- ```bash
199
- # Create test input
200
- echo '{
201
- "candidate_answer": "test output here",
202
- "question": "test task",
203
- "expected_outcome": "expected result"
204
- }' | uv run my_validator.py
205
-
206
- # Should output:
207
- # {
208
- # "score": 0.8,
209
- # "hits": ["check 1 passed"],
210
- # "misses": ["check 2 failed"],
211
- # "reasoning": "..."
212
- # }
213
- ```
1
+ # Custom Evaluators Guide
2
+
3
+ Guide for writing custom code evaluators and LLM judges for AgentV eval files.
4
+
5
+ ## Code Evaluator Contract
6
+
7
+ Code evaluators receive input via stdin and write output to stdout, both as JSON.
8
+
9
+ ### Input Format (via stdin)
10
+
11
+ ```json
12
+ {
13
+ "question": "string describing the task/question",
14
+ "expected_outcome": "expected outcome description",
15
+ "reference_answer": "gold standard answer (optional)",
16
+ "candidate_answer": "generated code/text from the agent",
17
+ "guideline_paths": ["path1", "path2"],
18
+ "input_files": ["file1", "file2"],
19
+ "input_messages": [{"role": "user", "content": "..."}],
20
+ "output_messages": [{"role": "assistant", "content": "...", "tool_calls": [...]}]
21
+ }
22
+ ```
23
+
24
+ The `output_messages` array contains the full agent execution trace with tool calls, enabling custom validation of agent behavior.
25
+
26
+ ### Output Format (to stdout)
27
+
28
+ ```json
29
+ {
30
+ "score": 0.85,
31
+ "hits": ["successful check 1", "successful check 2"],
32
+ "misses": ["failed check 1"],
33
+ "reasoning": "Brief explanation of the score"
34
+ }
35
+ ```
36
+
37
+ **Field Requirements:**
38
+ - `score`: Float between 0.0 and 1.0 (required)
39
+ - `hits`: Array of strings describing what passed (optional but recommended)
40
+ - `misses`: Array of strings describing what failed (optional but recommended)
41
+ - `reasoning`: String explaining the score (optional but recommended)
42
+
43
+ ## Python Code Evaluator Template
44
+
45
+ ```python
46
+ #!/usr/bin/env python3
47
+ """
48
+ Example code evaluator for AgentV
49
+
50
+ This evaluator checks for specific keywords in the output.
51
+ Replace validation logic as needed.
52
+ """
53
+
54
+ import json
55
+ import sys
56
+ from typing import Any
57
+
58
+
59
+ def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
60
+ """
61
+ Evaluate the agent output.
62
+
63
+ Args:
64
+ input_data: Full input context from AgentV
65
+
66
+ Returns:
67
+ Evaluation result with score, hits, misses, reasoning
68
+ """
69
+ # Extract only the fields you need
70
+ # Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
71
+ candidate_answer = input_data.get("candidate_answer", "")
72
+
73
+ # Your validation logic here
74
+ hits = []
75
+ misses = []
76
+
77
+ # Example: Check for keywords
78
+ required_keywords = ["async", "await"]
79
+ for keyword in required_keywords:
80
+ if keyword in candidate_answer:
81
+ hits.append(f"Contains required keyword: {keyword}")
82
+ else:
83
+ misses.append(f"Missing required keyword: {keyword}")
84
+
85
+ # Calculate score
86
+ if not required_keywords:
87
+ score = 1.0
88
+ else:
89
+ score = len(hits) / len(required_keywords)
90
+
91
+ # Build result
92
+ return {
93
+ "score": score,
94
+ "hits": hits,
95
+ "misses": misses,
96
+ "reasoning": f"Found {len(hits)}/{len(required_keywords)} required keywords"
97
+ }
98
+
99
+
100
+ def main():
101
+ """Main entry point for AgentV code evaluator."""
102
+ try:
103
+ # Read input from stdin
104
+ input_data = json.loads(sys.stdin.read())
105
+
106
+ # Run evaluation
107
+ result = evaluate(input_data)
108
+
109
+ # Write result to stdout
110
+ print(json.dumps(result, indent=2))
111
+
112
+ except Exception as e:
113
+ # Error handling: return zero score with error message
114
+ error_result = {
115
+ "score": 0.0,
116
+ "hits": [],
117
+ "misses": [f"Evaluator error: {str(e)}"],
118
+ "reasoning": f"Evaluator error: {str(e)}"
119
+ }
120
+ print(json.dumps(error_result, indent=2))
121
+ sys.exit(1)
122
+
123
+
124
+ if __name__ == "__main__":
125
+ main()
126
+ ```
127
+
128
+ ## LLM Judge Prompt Template
129
+
130
+ LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
131
+
132
+ **Available Template Variables:**
133
+ - `{{question}}` - The original question/task
134
+ - `{{expected_outcome}}` - What the answer should accomplish
135
+ - `{{candidate_answer}}` - The actual output to evaluate
136
+ - `{{reference_answer}}` - Gold standard answer (optional, may be empty)
137
+ - `{{input_messages}}` - JSON stringified input message segments
138
+ - `{{output_messages}}` - JSON stringified expected output segments
139
+
140
+ **Default Evaluator Template:**
141
+
142
+ If you don't specify a custom evaluator template, AgentV uses this default:
143
+
144
+ ```
145
+ You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
146
+
147
+ Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
148
+
149
+ Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
150
+
151
+ [[ ## expected_outcome ## ]]
152
+ {{expected_outcome}}
153
+
154
+ [[ ## question ## ]]
155
+ {{question}}
156
+
157
+ [[ ## reference_answer ## ]]
158
+ {{reference_answer}}
159
+
160
+ [[ ## candidate_answer ## ]]
161
+ {{candidate_answer}}
162
+ ```
163
+
164
+ You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
165
+
166
+ ## Best Practices
167
+
168
+ ### For Code-based Evaluators
169
+
170
+ 1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
171
+ 2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
172
+ 3. **Be deterministic** - Same input should always produce same output
173
+ 4. **Handle errors gracefully** - Return a valid result even when evaluation fails
174
+ 5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
175
+
176
+ ### For Prompt-based Evaluators (LLM Judges)
177
+
178
+ 1. **Clear criteria** - Define what you're evaluating
179
+ 2. **Specific guidelines** - Provide scoring rubrics
180
+ 3. **JSON output** - Enforce structured output format
181
+ 4. **Examples** - Show what good/bad looks like
182
+ 5. **Concise prompts** - Keep instructions focused
183
+
184
+ ## Running Code Evaluators
185
+
186
+ ### In Eval Files
187
+
188
+ ```yaml
189
+ execution:
190
+ evaluators:
191
+ - name: my_validator
192
+ type: code
193
+ script: uv run my_validator.py
194
+ cwd: ./evaluators
195
+ ```
196
+
197
+ ### Command Line Testing
198
+
199
+ Test your evaluator locally:
200
+
201
+ ```bash
202
+ # Create test input
203
+ echo '{
204
+ "candidate_answer": "test output here",
205
+ "question": "test task",
206
+ "expected_outcome": "expected result"
207
+ }' | uv run my_validator.py
208
+
209
+ # Should output:
210
+ # {
211
+ # "score": 0.8,
212
+ # "hits": ["check 1 passed"],
213
+ # "misses": ["check 2 failed"],
214
+ # "reasoning": "..."
215
+ # }
216
+ ```