agentv 0.10.1 → 0.13.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-72BHGHIT.js → chunk-WMO5PVPX.js} +806 -663
- package/dist/chunk-WMO5PVPX.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +23 -0
- package/dist/templates/{github/prompts/eval-build.prompt.md → .claude/skills/agentv-eval-builder/SKILL.md} +57 -6
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +399 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +317 -0
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +70 -0
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +5 -0
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +4 -0
- package/package.json +2 -2
- package/dist/chunk-72BHGHIT.js.map +0 -1
- /package/dist/templates/{agentv → .agentv}/config.yaml +0 -0
- /package/dist/templates/{agentv → .agentv}/targets.yaml +0 -0
- /package/dist/templates/{github/contexts → .claude/skills/agentv-eval-builder/references}/config-schema.json +0 -0
- /package/dist/templates/{github/contexts → .claude/skills/agentv-eval-builder/references}/eval-schema.json +0 -0
|
@@ -0,0 +1,317 @@
|
|
|
1
|
+
# Example Eval Files
|
|
2
|
+
|
|
3
|
+
This document contains complete examples of well-structured eval files demonstrating various AgentV patterns and best practices.
|
|
4
|
+
|
|
5
|
+
## Basic Example: Simple Q&A Eval
|
|
6
|
+
|
|
7
|
+
```yaml
|
|
8
|
+
$schema: agentv-eval-v2
|
|
9
|
+
description: Basic arithmetic evaluation
|
|
10
|
+
target: default
|
|
11
|
+
|
|
12
|
+
evalcases:
|
|
13
|
+
- id: simple-addition
|
|
14
|
+
outcome: Correctly calculates 2+2
|
|
15
|
+
|
|
16
|
+
input_messages:
|
|
17
|
+
- role: user
|
|
18
|
+
content: What is 2 + 2?
|
|
19
|
+
|
|
20
|
+
expected_messages:
|
|
21
|
+
- role: assistant
|
|
22
|
+
content: "4"
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## Code Review with File References
|
|
26
|
+
|
|
27
|
+
```yaml
|
|
28
|
+
$schema: agentv-eval-v2
|
|
29
|
+
description: Code review with guidelines
|
|
30
|
+
target: azure_base
|
|
31
|
+
|
|
32
|
+
evalcases:
|
|
33
|
+
- id: code-review-basic
|
|
34
|
+
outcome: Assistant provides helpful code analysis with security considerations
|
|
35
|
+
|
|
36
|
+
input_messages:
|
|
37
|
+
- role: system
|
|
38
|
+
content: You are an expert code reviewer.
|
|
39
|
+
- role: user
|
|
40
|
+
content:
|
|
41
|
+
- type: text
|
|
42
|
+
value: |-
|
|
43
|
+
Review this function for security issues:
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
def get_user(user_id):
|
|
47
|
+
query = f"SELECT * FROM users WHERE id = {user_id}"
|
|
48
|
+
return db.execute(query)
|
|
49
|
+
```
|
|
50
|
+
- type: file
|
|
51
|
+
value: /prompts/security-guidelines.md
|
|
52
|
+
|
|
53
|
+
expected_messages:
|
|
54
|
+
- role: assistant
|
|
55
|
+
content: |-
|
|
56
|
+
This code has a critical SQL injection vulnerability. The user_id is directly
|
|
57
|
+
interpolated into the query string without sanitization.
|
|
58
|
+
|
|
59
|
+
Recommended fix:
|
|
60
|
+
```python
|
|
61
|
+
def get_user(user_id):
|
|
62
|
+
query = "SELECT * FROM users WHERE id = ?"
|
|
63
|
+
return db.execute(query, (user_id,))
|
|
64
|
+
```
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
## Multi-Evaluator Configuration
|
|
68
|
+
|
|
69
|
+
```yaml
|
|
70
|
+
$schema: agentv-eval-v2
|
|
71
|
+
description: JSON generation with validation
|
|
72
|
+
target: default
|
|
73
|
+
|
|
74
|
+
evalcases:
|
|
75
|
+
- id: json-generation-with-validation
|
|
76
|
+
outcome: Generates valid JSON with required fields
|
|
77
|
+
|
|
78
|
+
execution:
|
|
79
|
+
evaluators:
|
|
80
|
+
- name: json_format_validator
|
|
81
|
+
type: code
|
|
82
|
+
script: uv run validate_json.py
|
|
83
|
+
cwd: ./evaluators
|
|
84
|
+
- name: content_evaluator
|
|
85
|
+
type: llm_judge
|
|
86
|
+
prompt: ./judges/semantic_correctness.md
|
|
87
|
+
model: gpt-5-chat
|
|
88
|
+
|
|
89
|
+
input_messages:
|
|
90
|
+
- role: user
|
|
91
|
+
content: |-
|
|
92
|
+
Generate a JSON object for a user with name "Alice",
|
|
93
|
+
email "alice@example.com", and role "admin".
|
|
94
|
+
|
|
95
|
+
expected_messages:
|
|
96
|
+
- role: assistant
|
|
97
|
+
content: |-
|
|
98
|
+
{
|
|
99
|
+
"name": "Alice",
|
|
100
|
+
"email": "alice@example.com",
|
|
101
|
+
"role": "admin"
|
|
102
|
+
}
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Conversation Threading
|
|
106
|
+
|
|
107
|
+
```yaml
|
|
108
|
+
$schema: agentv-eval-v2
|
|
109
|
+
description: Multi-turn conversation for iterative code development
|
|
110
|
+
target: default
|
|
111
|
+
|
|
112
|
+
evalcases:
|
|
113
|
+
- id: create-class
|
|
114
|
+
conversation_id: python-dev-session
|
|
115
|
+
outcome: Creates a basic User class
|
|
116
|
+
|
|
117
|
+
input_messages:
|
|
118
|
+
- role: user
|
|
119
|
+
content: Create a Python User class with name and email attributes
|
|
120
|
+
|
|
121
|
+
expected_messages:
|
|
122
|
+
- role: assistant
|
|
123
|
+
content: |-
|
|
124
|
+
```python
|
|
125
|
+
class User:
|
|
126
|
+
def __init__(self, name, email):
|
|
127
|
+
self.name = name
|
|
128
|
+
self.email = email
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
- id: add-validation
|
|
132
|
+
conversation_id: python-dev-session
|
|
133
|
+
outcome: Adds email validation to the existing class
|
|
134
|
+
|
|
135
|
+
input_messages:
|
|
136
|
+
- role: user
|
|
137
|
+
content: Add email validation to the User class
|
|
138
|
+
|
|
139
|
+
expected_messages:
|
|
140
|
+
- role: assistant
|
|
141
|
+
content: |-
|
|
142
|
+
```python
|
|
143
|
+
import re
|
|
144
|
+
|
|
145
|
+
class User:
|
|
146
|
+
def __init__(self, name, email):
|
|
147
|
+
self.name = name
|
|
148
|
+
self.email = self._validate_email(email)
|
|
149
|
+
|
|
150
|
+
def _validate_email(self, email):
|
|
151
|
+
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
|
|
152
|
+
if not re.match(pattern, email):
|
|
153
|
+
raise ValueError(f"Invalid email: {email}")
|
|
154
|
+
return email
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
- id: add-repr
|
|
158
|
+
conversation_id: python-dev-session
|
|
159
|
+
outcome: Adds __repr__ method for better debugging
|
|
160
|
+
|
|
161
|
+
input_messages:
|
|
162
|
+
- role: user
|
|
163
|
+
content: Add a __repr__ method to the class
|
|
164
|
+
|
|
165
|
+
expected_messages:
|
|
166
|
+
- role: assistant
|
|
167
|
+
content: |-
|
|
168
|
+
```python
|
|
169
|
+
def __repr__(self):
|
|
170
|
+
return f"User(name='{self.name}', email='{self.email}')"
|
|
171
|
+
```
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Advanced: Per-Case Target Override
|
|
175
|
+
|
|
176
|
+
```yaml
|
|
177
|
+
$schema: agentv-eval-v2
|
|
178
|
+
description: Cross-platform testing with different targets
|
|
179
|
+
target: azure_base
|
|
180
|
+
|
|
181
|
+
evalcases:
|
|
182
|
+
- id: azure-test
|
|
183
|
+
outcome: Tests with Azure OpenAI (uses file-level default target)
|
|
184
|
+
|
|
185
|
+
input_messages:
|
|
186
|
+
- role: user
|
|
187
|
+
content: Explain async/await in Python
|
|
188
|
+
|
|
189
|
+
expected_messages:
|
|
190
|
+
- role: assistant
|
|
191
|
+
content: "async/await is Python's syntax for asynchronous programming..."
|
|
192
|
+
|
|
193
|
+
- id: vscode-test
|
|
194
|
+
outcome: Tests with VS Code Copilot
|
|
195
|
+
|
|
196
|
+
execution:
|
|
197
|
+
target: vscode_projectx
|
|
198
|
+
|
|
199
|
+
input_messages:
|
|
200
|
+
- role: user
|
|
201
|
+
content:
|
|
202
|
+
- type: text
|
|
203
|
+
value: Refactor this code to use async/await
|
|
204
|
+
- type: file
|
|
205
|
+
value: /code-samples/sync_code.py
|
|
206
|
+
|
|
207
|
+
expected_messages:
|
|
208
|
+
- role: assistant
|
|
209
|
+
content: "async def fetch_data():\n await ..."
|
|
210
|
+
```
|
|
211
|
+
|
|
212
|
+
## Complex Real-World Example: CargoWise Triage
|
|
213
|
+
|
|
214
|
+
```yaml
|
|
215
|
+
$schema: agentv-eval-v2
|
|
216
|
+
description: CargoWise criticality rating (CR1-CR9) classification eval
|
|
217
|
+
target: default
|
|
218
|
+
|
|
219
|
+
evalcases:
|
|
220
|
+
- id: cr-global-outage
|
|
221
|
+
conversation_id: cargowise-triage
|
|
222
|
+
|
|
223
|
+
outcome: |
|
|
224
|
+
Assistant correctly classifies as 'CR1' for complete system inaccessibility.
|
|
225
|
+
Reasoning should emphasize 'any user on any workstation' and lack of access.
|
|
226
|
+
|
|
227
|
+
execution:
|
|
228
|
+
evaluators:
|
|
229
|
+
- name: json_format_validator
|
|
230
|
+
type: code
|
|
231
|
+
script: uv run validate_output.py
|
|
232
|
+
- name: content_evaluator
|
|
233
|
+
type: llm_judge
|
|
234
|
+
|
|
235
|
+
input_messages:
|
|
236
|
+
- role: user
|
|
237
|
+
content:
|
|
238
|
+
- type: file
|
|
239
|
+
value: ../../prompts/cw-criticality-rating.prompt.md
|
|
240
|
+
- type: text
|
|
241
|
+
value: |-
|
|
242
|
+
Classify this CargoWise ticket:
|
|
243
|
+
|
|
244
|
+
Ticket: Entire system down, no users can log in from any device
|
|
245
|
+
Impact: All operations halted
|
|
246
|
+
Scope: Global
|
|
247
|
+
Signals: Error: "Server unreachable", no recent changes reported
|
|
248
|
+
|
|
249
|
+
expected_messages:
|
|
250
|
+
- role: assistant
|
|
251
|
+
content: |-
|
|
252
|
+
{
|
|
253
|
+
"criticalityRating": "CR1",
|
|
254
|
+
"reasoning": "Step 1: Issue prevents access to the entire application suite.
|
|
255
|
+
Step 2: Affects any user on any workstation. Step 3: Matches CR1 definition
|
|
256
|
+
exactly—no partial access or workarounds mentioned."
|
|
257
|
+
}
|
|
258
|
+
|
|
259
|
+
- id: cr-module-inaccessible
|
|
260
|
+
conversation_id: cargowise-triage
|
|
261
|
+
|
|
262
|
+
outcome: |
|
|
263
|
+
Assistant correctly classifies as 'CR2' for module-wide inaccessibility.
|
|
264
|
+
Reasoning should distinguish from CR1 by noting it's limited to one module.
|
|
265
|
+
|
|
266
|
+
execution:
|
|
267
|
+
evaluators:
|
|
268
|
+
- name: json_format_validator
|
|
269
|
+
type: code
|
|
270
|
+
script: uv run validate_output.py
|
|
271
|
+
- name: content_evaluator
|
|
272
|
+
type: llm_judge
|
|
273
|
+
|
|
274
|
+
input_messages:
|
|
275
|
+
- role: user
|
|
276
|
+
content:
|
|
277
|
+
- type: file
|
|
278
|
+
value: ../../prompts/cw-criticality-rating.prompt.md
|
|
279
|
+
- type: text
|
|
280
|
+
value: |-
|
|
281
|
+
Classify this CargoWise ticket:
|
|
282
|
+
|
|
283
|
+
Ticket: Customs module inaccessible for all users
|
|
284
|
+
Impact: Customs declarations blocked
|
|
285
|
+
Scope: Module-wide
|
|
286
|
+
Signals: "Module not found" error, other modules working
|
|
287
|
+
|
|
288
|
+
expected_messages:
|
|
289
|
+
- role: assistant
|
|
290
|
+
content: |-
|
|
291
|
+
{
|
|
292
|
+
"criticalityRating": "CR2",
|
|
293
|
+
"reasoning": "Step 1: Issue affects an entire module (Customs).
|
|
294
|
+
Step 2: No access for any user/workstation. Step 3: Does not impact
|
|
295
|
+
the full suite, so CR2 over CR1."
|
|
296
|
+
}
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
## Notes on Examples
|
|
300
|
+
|
|
301
|
+
### File Path Conventions
|
|
302
|
+
- **Absolute paths** (start with `/`): Resolved from repository root
|
|
303
|
+
- Example: `/prompts/guidelines.md` → `<repo_root>/prompts/guidelines.md`
|
|
304
|
+
- **Relative paths** (start with `./` or `../`): Resolved from eval file directory
|
|
305
|
+
- Example: `../../prompts/file.md` → Two directories up, then into prompts/
|
|
306
|
+
|
|
307
|
+
### Outcome Writing Tips
|
|
308
|
+
- Be specific about what success looks like
|
|
309
|
+
- Mention key elements that must be present
|
|
310
|
+
- For classification tasks, specify the expected category
|
|
311
|
+
- For reasoning tasks, describe the thought process expected
|
|
312
|
+
|
|
313
|
+
### Expected Messages
|
|
314
|
+
- Show the pattern, not rigid templates
|
|
315
|
+
- Allow for natural language variation
|
|
316
|
+
- Focus on semantic correctness over exact matching
|
|
317
|
+
- Evaluators will handle the actual validation
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
---
|
|
2
|
+
description: Iteratively optimize a prompt file against an AgentV evaluation suite by analyzing failures and refining instructions.
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
# AgentV Prompt Optimizer
|
|
6
|
+
|
|
7
|
+
Iteratively optimize a prompt file against an AgentV evaluation suite.
|
|
8
|
+
|
|
9
|
+
## Usage
|
|
10
|
+
`prompt-optimizer <eval-path> [playbook-path]`
|
|
11
|
+
|
|
12
|
+
## Workflow
|
|
13
|
+
|
|
14
|
+
1. **Initialize**
|
|
15
|
+
- Verify `<eval-path>` (file or glob) targets the correct system.
|
|
16
|
+
- **Identify Prompt Files**:
|
|
17
|
+
- Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
|
|
18
|
+
- Recursively check referenced prompt files for *other* prompt references (dependencies).
|
|
19
|
+
- If multiple prompts are found, consider ALL of them as candidates for optimization.
|
|
20
|
+
- **Identify Playbook**:
|
|
21
|
+
- If `<playbook-path>` is provided, use it.
|
|
22
|
+
- If not, create a new one in the parent directory of the eval files: `playbook-[timestamp].md`.
|
|
23
|
+
- Read content of the identified prompt file.
|
|
24
|
+
|
|
25
|
+
2. **Optimization Loop** (Max 5 iterations)
|
|
26
|
+
- **Execute (The Generator)**: Run `pnpm agentv eval <eval-path>`.
|
|
27
|
+
- **Analyze (The Reflector)**:
|
|
28
|
+
- Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
|
|
29
|
+
- Read the results file. Calculate pass rate.
|
|
30
|
+
- **Root Cause Analysis**: For each failure, perform a deep dive:
|
|
31
|
+
- **Error Identification**: What exactly went wrong? (e.g., "Predicted 'High' but expected 'Low'")
|
|
32
|
+
- **Root Cause**: Why did it happen? (e.g., "Ambiguous definition of 'High' severity", "Hallucinated a constraint", "Incorrect test expectation")
|
|
33
|
+
- **Correct Approach**: What *should* the model have done?
|
|
34
|
+
- **Key Insight**: What general rule or pattern can we learn from this?
|
|
35
|
+
- **Regression Check**: Did this change break previously passing tests? If so, mark the previous change as "Harmful".
|
|
36
|
+
- **Decide**:
|
|
37
|
+
- If **100% pass**: STOP and report success.
|
|
38
|
+
- If **Score decreased**: Revert last change, try different approach.
|
|
39
|
+
- If **No improvement** (2x): STOP and report stagnation.
|
|
40
|
+
- **Log Result**:
|
|
41
|
+
- Append the result of this iteration to the identified playbook file.
|
|
42
|
+
- **Format**:
|
|
43
|
+
```markdown
|
|
44
|
+
### Iteration [N]
|
|
45
|
+
- **Change**: [Description of edit]
|
|
46
|
+
- **Rationale**: [Root Cause / Why this fix was chosen]
|
|
47
|
+
- **Outcome**: [Success / Failure / Harmful] (Score: X% -> Y%)
|
|
48
|
+
- **Insight**: [Key learning or pattern identified]
|
|
49
|
+
```
|
|
50
|
+
- **Refine (The Curator)**:
|
|
51
|
+
- Modify the relevant `<prompt-file>` (pick the one most likely to be the root cause) to address failures.
|
|
52
|
+
- **Strategy**: Treat the prompt as a structured "Playbook".
|
|
53
|
+
- **Clarify**: If ambiguous, make the existing instruction more specific.
|
|
54
|
+
- **Add Rule**: If a constraint was missed, add a specific bullet point to the relevant section.
|
|
55
|
+
- **Negative Constraint**: If hallucinating, explicitly state what NOT to do.
|
|
56
|
+
- **Consolidate**: Check for redundant or overlapping instructions and merge them.
|
|
57
|
+
- **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
|
|
58
|
+
- **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
|
|
59
|
+
- **Apply**: Use `replace_string_in_file`.
|
|
60
|
+
|
|
61
|
+
3. **Completion**
|
|
62
|
+
- Report final score.
|
|
63
|
+
- Summarize key changes made to the prompt.
|
|
64
|
+
- **Finalize Playbook**: Add a summary header to the playbook file indicating the session completion and final score.
|
|
65
|
+
|
|
66
|
+
## Guidelines
|
|
67
|
+
- **Simplicity ("Less is More")**: Avoid adding specific rules for rare edge cases ("hotfixes"). Focus on universally applicable instructions.
|
|
68
|
+
- **Structure**: Maintain existing Markdown headers/sections.
|
|
69
|
+
- **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
|
|
70
|
+
- **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "agentv",
|
|
3
|
-
"version": "0.
|
|
3
|
+
"version": "0.13.0",
|
|
4
4
|
"description": "CLI entry point for AgentV",
|
|
5
5
|
"type": "module",
|
|
6
6
|
"repository": {
|
|
@@ -24,7 +24,7 @@
|
|
|
24
24
|
"dotenv": "^16.4.5",
|
|
25
25
|
"micromatch": "^4.0.8",
|
|
26
26
|
"yaml": "^2.6.1",
|
|
27
|
-
"@agentv/core": "0.
|
|
27
|
+
"@agentv/core": "0.13.0"
|
|
28
28
|
},
|
|
29
29
|
"devDependencies": {
|
|
30
30
|
"execa": "^9.3.0"
|