agentv 0.11.0 → 0.13.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,317 @@
1
+ # Example Eval Files
2
+
3
+ This document contains complete examples of well-structured eval files demonstrating various AgentV patterns and best practices.
4
+
5
+ ## Basic Example: Simple Q&A Eval
6
+
7
+ ```yaml
8
+ $schema: agentv-eval-v2
9
+ description: Basic arithmetic evaluation
10
+ target: default
11
+
12
+ evalcases:
13
+ - id: simple-addition
14
+ outcome: Correctly calculates 2+2
15
+
16
+ input_messages:
17
+ - role: user
18
+ content: What is 2 + 2?
19
+
20
+ expected_messages:
21
+ - role: assistant
22
+ content: "4"
23
+ ```
24
+
25
+ ## Code Review with File References
26
+
27
+ ```yaml
28
+ $schema: agentv-eval-v2
29
+ description: Code review with guidelines
30
+ target: azure_base
31
+
32
+ evalcases:
33
+ - id: code-review-basic
34
+ outcome: Assistant provides helpful code analysis with security considerations
35
+
36
+ input_messages:
37
+ - role: system
38
+ content: You are an expert code reviewer.
39
+ - role: user
40
+ content:
41
+ - type: text
42
+ value: |-
43
+ Review this function for security issues:
44
+
45
+ ```python
46
+ def get_user(user_id):
47
+ query = f"SELECT * FROM users WHERE id = {user_id}"
48
+ return db.execute(query)
49
+ ```
50
+ - type: file
51
+ value: /prompts/security-guidelines.md
52
+
53
+ expected_messages:
54
+ - role: assistant
55
+ content: |-
56
+ This code has a critical SQL injection vulnerability. The user_id is directly
57
+ interpolated into the query string without sanitization.
58
+
59
+ Recommended fix:
60
+ ```python
61
+ def get_user(user_id):
62
+ query = "SELECT * FROM users WHERE id = ?"
63
+ return db.execute(query, (user_id,))
64
+ ```
65
+ ```
66
+
67
+ ## Multi-Evaluator Configuration
68
+
69
+ ```yaml
70
+ $schema: agentv-eval-v2
71
+ description: JSON generation with validation
72
+ target: default
73
+
74
+ evalcases:
75
+ - id: json-generation-with-validation
76
+ outcome: Generates valid JSON with required fields
77
+
78
+ execution:
79
+ evaluators:
80
+ - name: json_format_validator
81
+ type: code
82
+ script: uv run validate_json.py
83
+ cwd: ./evaluators
84
+ - name: content_evaluator
85
+ type: llm_judge
86
+ prompt: ./judges/semantic_correctness.md
87
+ model: gpt-5-chat
88
+
89
+ input_messages:
90
+ - role: user
91
+ content: |-
92
+ Generate a JSON object for a user with name "Alice",
93
+ email "alice@example.com", and role "admin".
94
+
95
+ expected_messages:
96
+ - role: assistant
97
+ content: |-
98
+ {
99
+ "name": "Alice",
100
+ "email": "alice@example.com",
101
+ "role": "admin"
102
+ }
103
+ ```
104
+
105
+ ## Conversation Threading
106
+
107
+ ```yaml
108
+ $schema: agentv-eval-v2
109
+ description: Multi-turn conversation for iterative code development
110
+ target: default
111
+
112
+ evalcases:
113
+ - id: create-class
114
+ conversation_id: python-dev-session
115
+ outcome: Creates a basic User class
116
+
117
+ input_messages:
118
+ - role: user
119
+ content: Create a Python User class with name and email attributes
120
+
121
+ expected_messages:
122
+ - role: assistant
123
+ content: |-
124
+ ```python
125
+ class User:
126
+ def __init__(self, name, email):
127
+ self.name = name
128
+ self.email = email
129
+ ```
130
+
131
+ - id: add-validation
132
+ conversation_id: python-dev-session
133
+ outcome: Adds email validation to the existing class
134
+
135
+ input_messages:
136
+ - role: user
137
+ content: Add email validation to the User class
138
+
139
+ expected_messages:
140
+ - role: assistant
141
+ content: |-
142
+ ```python
143
+ import re
144
+
145
+ class User:
146
+ def __init__(self, name, email):
147
+ self.name = name
148
+ self.email = self._validate_email(email)
149
+
150
+ def _validate_email(self, email):
151
+ pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
152
+ if not re.match(pattern, email):
153
+ raise ValueError(f"Invalid email: {email}")
154
+ return email
155
+ ```
156
+
157
+ - id: add-repr
158
+ conversation_id: python-dev-session
159
+ outcome: Adds __repr__ method for better debugging
160
+
161
+ input_messages:
162
+ - role: user
163
+ content: Add a __repr__ method to the class
164
+
165
+ expected_messages:
166
+ - role: assistant
167
+ content: |-
168
+ ```python
169
+ def __repr__(self):
170
+ return f"User(name='{self.name}', email='{self.email}')"
171
+ ```
172
+ ```
173
+
174
+ ## Advanced: Per-Case Target Override
175
+
176
+ ```yaml
177
+ $schema: agentv-eval-v2
178
+ description: Cross-platform testing with different targets
179
+ target: azure_base
180
+
181
+ evalcases:
182
+ - id: azure-test
183
+ outcome: Tests with Azure OpenAI (uses file-level default target)
184
+
185
+ input_messages:
186
+ - role: user
187
+ content: Explain async/await in Python
188
+
189
+ expected_messages:
190
+ - role: assistant
191
+ content: "async/await is Python's syntax for asynchronous programming..."
192
+
193
+ - id: vscode-test
194
+ outcome: Tests with VS Code Copilot
195
+
196
+ execution:
197
+ target: vscode_projectx
198
+
199
+ input_messages:
200
+ - role: user
201
+ content:
202
+ - type: text
203
+ value: Refactor this code to use async/await
204
+ - type: file
205
+ value: /code-samples/sync_code.py
206
+
207
+ expected_messages:
208
+ - role: assistant
209
+ content: "async def fetch_data():\n await ..."
210
+ ```
211
+
212
+ ## Complex Real-World Example: CargoWise Triage
213
+
214
+ ```yaml
215
+ $schema: agentv-eval-v2
216
+ description: CargoWise criticality rating (CR1-CR9) classification eval
217
+ target: default
218
+
219
+ evalcases:
220
+ - id: cr-global-outage
221
+ conversation_id: cargowise-triage
222
+
223
+ outcome: |
224
+ Assistant correctly classifies as 'CR1' for complete system inaccessibility.
225
+ Reasoning should emphasize 'any user on any workstation' and lack of access.
226
+
227
+ execution:
228
+ evaluators:
229
+ - name: json_format_validator
230
+ type: code
231
+ script: uv run validate_output.py
232
+ - name: content_evaluator
233
+ type: llm_judge
234
+
235
+ input_messages:
236
+ - role: user
237
+ content:
238
+ - type: file
239
+ value: ../../prompts/cw-criticality-rating.prompt.md
240
+ - type: text
241
+ value: |-
242
+ Classify this CargoWise ticket:
243
+
244
+ Ticket: Entire system down, no users can log in from any device
245
+ Impact: All operations halted
246
+ Scope: Global
247
+ Signals: Error: "Server unreachable", no recent changes reported
248
+
249
+ expected_messages:
250
+ - role: assistant
251
+ content: |-
252
+ {
253
+ "criticalityRating": "CR1",
254
+ "reasoning": "Step 1: Issue prevents access to the entire application suite.
255
+ Step 2: Affects any user on any workstation. Step 3: Matches CR1 definition
256
+ exactly—no partial access or workarounds mentioned."
257
+ }
258
+
259
+ - id: cr-module-inaccessible
260
+ conversation_id: cargowise-triage
261
+
262
+ outcome: |
263
+ Assistant correctly classifies as 'CR2' for module-wide inaccessibility.
264
+ Reasoning should distinguish from CR1 by noting it's limited to one module.
265
+
266
+ execution:
267
+ evaluators:
268
+ - name: json_format_validator
269
+ type: code
270
+ script: uv run validate_output.py
271
+ - name: content_evaluator
272
+ type: llm_judge
273
+
274
+ input_messages:
275
+ - role: user
276
+ content:
277
+ - type: file
278
+ value: ../../prompts/cw-criticality-rating.prompt.md
279
+ - type: text
280
+ value: |-
281
+ Classify this CargoWise ticket:
282
+
283
+ Ticket: Customs module inaccessible for all users
284
+ Impact: Customs declarations blocked
285
+ Scope: Module-wide
286
+ Signals: "Module not found" error, other modules working
287
+
288
+ expected_messages:
289
+ - role: assistant
290
+ content: |-
291
+ {
292
+ "criticalityRating": "CR2",
293
+ "reasoning": "Step 1: Issue affects an entire module (Customs).
294
+ Step 2: No access for any user/workstation. Step 3: Does not impact
295
+ the full suite, so CR2 over CR1."
296
+ }
297
+ ```
298
+
299
+ ## Notes on Examples
300
+
301
+ ### File Path Conventions
302
+ - **Absolute paths** (start with `/`): Resolved from repository root
303
+ - Example: `/prompts/guidelines.md` → `<repo_root>/prompts/guidelines.md`
304
+ - **Relative paths** (start with `./` or `../`): Resolved from eval file directory
305
+ - Example: `../../prompts/file.md` → Two directories up, then into prompts/
306
+
307
+ ### Outcome Writing Tips
308
+ - Be specific about what success looks like
309
+ - Mention key elements that must be present
310
+ - For classification tasks, specify the expected category
311
+ - For reasoning tasks, describe the thought process expected
312
+
313
+ ### Expected Messages
314
+ - Show the pattern, not rigid templates
315
+ - Allow for natural language variation
316
+ - Focus on semantic correctness over exact matching
317
+ - Evaluators will handle the actual validation
@@ -0,0 +1,70 @@
1
+ ---
2
+ description: Iteratively optimize a prompt file against an AgentV evaluation suite by analyzing failures and refining instructions.
3
+ ---
4
+
5
+ # AgentV Prompt Optimizer
6
+
7
+ Iteratively optimize a prompt file against an AgentV evaluation suite.
8
+
9
+ ## Usage
10
+ `prompt-optimizer <eval-path> [playbook-path]`
11
+
12
+ ## Workflow
13
+
14
+ 1. **Initialize**
15
+ - Verify `<eval-path>` (file or glob) targets the correct system.
16
+ - **Identify Prompt Files**:
17
+ - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
18
+ - Recursively check referenced prompt files for *other* prompt references (dependencies).
19
+ - If multiple prompts are found, consider ALL of them as candidates for optimization.
20
+ - **Identify Playbook**:
21
+ - If `<playbook-path>` is provided, use it.
22
+ - If not, create a new one in the parent directory of the eval files: `playbook-[timestamp].md`.
23
+ - Read content of the identified prompt file.
24
+
25
+ 2. **Optimization Loop** (Max 5 iterations)
26
+ - **Execute (The Generator)**: Run `pnpm agentv eval <eval-path>`.
27
+ - **Analyze (The Reflector)**:
28
+ - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
29
+ - Read the results file. Calculate pass rate.
30
+ - **Root Cause Analysis**: For each failure, perform a deep dive:
31
+ - **Error Identification**: What exactly went wrong? (e.g., "Predicted 'High' but expected 'Low'")
32
+ - **Root Cause**: Why did it happen? (e.g., "Ambiguous definition of 'High' severity", "Hallucinated a constraint", "Incorrect test expectation")
33
+ - **Correct Approach**: What *should* the model have done?
34
+ - **Key Insight**: What general rule or pattern can we learn from this?
35
+ - **Regression Check**: Did this change break previously passing tests? If so, mark the previous change as "Harmful".
36
+ - **Decide**:
37
+ - If **100% pass**: STOP and report success.
38
+ - If **Score decreased**: Revert last change, try different approach.
39
+ - If **No improvement** (2x): STOP and report stagnation.
40
+ - **Log Result**:
41
+ - Append the result of this iteration to the identified playbook file.
42
+ - **Format**:
43
+ ```markdown
44
+ ### Iteration [N]
45
+ - **Change**: [Description of edit]
46
+ - **Rationale**: [Root Cause / Why this fix was chosen]
47
+ - **Outcome**: [Success / Failure / Harmful] (Score: X% -> Y%)
48
+ - **Insight**: [Key learning or pattern identified]
49
+ ```
50
+ - **Refine (The Curator)**:
51
+ - Modify the relevant `<prompt-file>` (pick the one most likely to be the root cause) to address failures.
52
+ - **Strategy**: Treat the prompt as a structured "Playbook".
53
+ - **Clarify**: If ambiguous, make the existing instruction more specific.
54
+ - **Add Rule**: If a constraint was missed, add a specific bullet point to the relevant section.
55
+ - **Negative Constraint**: If hallucinating, explicitly state what NOT to do.
56
+ - **Consolidate**: Check for redundant or overlapping instructions and merge them.
57
+ - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
58
+ - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
59
+ - **Apply**: Use `replace_string_in_file`.
60
+
61
+ 3. **Completion**
62
+ - Report final score.
63
+ - Summarize key changes made to the prompt.
64
+ - **Finalize Playbook**: Add a summary header to the playbook file indicating the session completion and final score.
65
+
66
+ ## Guidelines
67
+ - **Simplicity ("Less is More")**: Avoid adding specific rules for rare edge cases ("hotfixes"). Focus on universally applicable instructions.
68
+ - **Structure**: Maintain existing Markdown headers/sections.
69
+ - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
70
+ - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
@@ -0,0 +1,5 @@
1
+ ---
2
+ description: 'Create and maintain AgentV YAML evaluation files'
3
+ ---
4
+
5
+ #file:../../.claude/skills/agentv-eval-builder/SKILL.md
@@ -0,0 +1,4 @@
1
+ ---
2
+ description: Iteratively optimize a prompt file against an AgentV evaluation suite
3
+ ---
4
+ #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "0.11.0",
3
+ "version": "0.13.0",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {
@@ -24,7 +24,7 @@
24
24
  "dotenv": "^16.4.5",
25
25
  "micromatch": "^4.0.8",
26
26
  "yaml": "^2.6.1",
27
- "@agentv/core": "0.11.0"
27
+ "@agentv/core": "0.13.0"
28
28
  },
29
29
  "devDependencies": {
30
30
  "execa": "^9.3.0"