agentv 0.14.2 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-TJ4U6A6X.js";
4
+ } from "./chunk-HWGALLUR.js";
5
5
  import "./chunk-7XYYGJAC.js";
6
6
 
7
7
  // src/cli.ts
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  createProgram,
3
3
  runCli
4
- } from "./chunk-TJ4U6A6X.js";
4
+ } from "./chunk-HWGALLUR.js";
5
5
  import "./chunk-7XYYGJAC.js";
6
6
  export {
7
7
  createProgram,
@@ -1,9 +1,9 @@
1
1
  ---
2
- name: eval-builder
2
+ name: agentv-eval-builder
3
3
  description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
4
4
  ---
5
5
 
6
- # Eval Builder
6
+ # AgentV Eval Builder
7
7
 
8
8
  ## Schema Reference
9
9
  - Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
@@ -11,7 +11,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
11
11
  - Examples: `references/example-evals.md`
12
12
 
13
13
  ## Structure Requirements
14
- - Root level: `$schema` (required: "agentv-eval-v2"), `description` (optional), `target` (optional), `evalcases` (required)
14
+ - Root level: `$schema` (required: "agentv-eval-v2"), `description` (optional), `execution` (optional), `evalcases` (required)
15
15
  - Eval case fields: `id` (required), `outcome` (required), `input_messages` (required), `expected_messages` (required)
16
16
  - Optional fields: `conversation_id`, `note`, `execution`
17
17
  - Message fields: `role` (required), `content` (required)
@@ -71,10 +71,10 @@ execution:
71
71
  ```yaml
72
72
  $schema: agentv-eval-v2
73
73
  description: Example showing basic features and conversation threading
74
- target: default
74
+ execution:
75
+ target: default
75
76
 
76
77
  evalcases:
77
- # Basic eval case with file references
78
78
  - id: code-review-basic
79
79
  outcome: Assistant provides helpful code analysis
80
80
 
@@ -101,52 +101,4 @@ evalcases:
101
101
  - Add type hints: `def add(a: int, b: int) -> int:`
102
102
  - Add docstring
103
103
  - Consider validation for edge cases
104
-
105
- # Advanced: conversation threading, multiple evaluators
106
- - id: python-coding-session
107
- conversation_id: python-coding-session
108
- outcome: Generates correct code with proper error handling
109
-
110
- execution:
111
- target: azure_base
112
- evaluators:
113
- - name: keyword_check
114
- type: code
115
- script: uv run check_keywords.py
116
- cwd: /evaluators/scripts
117
- - name: semantic_judge
118
- type: llm_judge
119
- prompt: /evaluators/prompts/correctness.md
120
- model: gpt-5-chat
121
-
122
- input_messages:
123
- - role: system
124
- content: You are a code generator.
125
- - role: user
126
- content:
127
- - type: text
128
- value: Create a function to find the second largest number in a list.
129
- - type: file
130
- value: /prompts/python.instructions.md
131
-
132
- expected_messages:
133
- - role: assistant
134
- content: |-
135
- ```python
136
- from typing import List, Union
137
-
138
- def find_second_largest(numbers: List[int]) -> Union[int, None]:
139
- """Find the second largest number."""
140
- if not isinstance(numbers, list):
141
- raise TypeError("Input must be a list")
142
- if not numbers:
143
- raise ValueError("List cannot be empty")
144
-
145
- unique = list(set(numbers))
146
- if len(unique) < 2:
147
- return None
148
-
149
- unique.sort(reverse=True)
150
- return unique[1]
151
- ```
152
104
  ```
@@ -102,200 +102,65 @@ evalcases:
102
102
  }
103
103
  ```
104
104
 
105
- ## Conversation Threading
105
+ ## Multi-Turn Conversation (Single Eval Case)
106
106
 
107
107
  ```yaml
108
108
  $schema: agentv-eval-v2
109
- description: Multi-turn conversation for iterative code development
109
+ description: Multi-turn debugging session with clarifying questions
110
110
  target: default
111
111
 
112
112
  evalcases:
113
- - id: create-class
114
- conversation_id: python-dev-session
115
- outcome: Creates a basic User class
113
+ - id: debug-with-clarification
114
+ outcome: |-
115
+ Assistant conducts a multi-turn debugging session, asking clarification
116
+ questions when needed, correctly diagnosing the bug, and proposing a clear
117
+ fix with rationale.
116
118
 
117
119
  input_messages:
120
+ - role: system
121
+ content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
118
122
  - role: user
119
- content: Create a Python User class with name and email attributes
120
-
121
- expected_messages:
122
- - role: assistant
123
123
  content: |-
124
+ I'm getting an off-by-one error in this function, but I can't see why:
125
+
124
126
  ```python
125
- class User:
126
- def __init__(self, name, email):
127
- self.name = name
128
- self.email = email
127
+ def get_items(items):
128
+ result = []
129
+ for i in range(len(items) - 1):
130
+ result.append(items[i])
131
+ return result
129
132
  ```
130
-
131
- - id: add-validation
132
- conversation_id: python-dev-session
133
- outcome: Adds email validation to the existing class
134
-
135
- input_messages:
136
- - role: user
137
- content: Add email validation to the User class
138
-
139
- expected_messages:
133
+
134
+ Sometimes the last element is missing. Can you help debug this?
140
135
  - role: assistant
141
136
  content: |-
142
- ```python
143
- import re
144
-
145
- class User:
146
- def __init__(self, name, email):
147
- self.name = name
148
- self.email = self._validate_email(email)
149
-
150
- def _validate_email(self, email):
151
- pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
152
- if not re.match(pattern, email):
153
- raise ValueError(f"Invalid email: {email}")
154
- return email
155
- ```
156
-
157
- - id: add-repr
158
- conversation_id: python-dev-session
159
- outcome: Adds __repr__ method for better debugging
160
-
161
- input_messages:
137
+ I can help debug this. Before I propose a fix, could you tell me:
138
+ - What output you expect for an example input list
139
+ - What output you actually get
162
140
  - role: user
163
- content: Add a __repr__ method to the class
141
+ content: |-
142
+ For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
164
143
 
165
144
  expected_messages:
166
145
  - role: assistant
167
146
  content: |-
147
+ You have an off-by-one error in your loop bounds.
148
+ You're iterating with `range(len(items) - 1)`, which stops before the last index.
149
+ To include all items, you can either:
150
+ - Use `range(len(items))`, or
151
+ - Iterate directly over the list: `for item in items:`
152
+
153
+ Here's a corrected version:
154
+
168
155
  ```python
169
- def __repr__(self):
170
- return f"User(name='{self.name}', email='{self.email}')"
156
+ def get_items(items):
157
+ result = []
158
+ for item in items:
159
+ result.append(item)
160
+ return result
171
161
  ```
172
162
  ```
173
163
 
174
- ## Advanced: Per-Case Target Override
175
-
176
- ```yaml
177
- $schema: agentv-eval-v2
178
- description: Cross-platform testing with different targets
179
- target: azure_base
180
-
181
- evalcases:
182
- - id: azure-test
183
- outcome: Tests with Azure OpenAI (uses file-level default target)
184
-
185
- input_messages:
186
- - role: user
187
- content: Explain async/await in Python
188
-
189
- expected_messages:
190
- - role: assistant
191
- content: "async/await is Python's syntax for asynchronous programming..."
192
-
193
- - id: vscode-test
194
- outcome: Tests with VS Code Copilot
195
-
196
- execution:
197
- target: vscode_projectx
198
-
199
- input_messages:
200
- - role: user
201
- content:
202
- - type: text
203
- value: Refactor this code to use async/await
204
- - type: file
205
- value: /code-samples/sync_code.py
206
-
207
- expected_messages:
208
- - role: assistant
209
- content: "async def fetch_data():\n await ..."
210
- ```
211
-
212
- ## Complex Real-World Example: CargoWise Triage
213
-
214
- ```yaml
215
- $schema: agentv-eval-v2
216
- description: CargoWise criticality rating (CR1-CR9) classification eval
217
- target: default
218
-
219
- evalcases:
220
- - id: cr-global-outage
221
- conversation_id: cargowise-triage
222
-
223
- outcome: |
224
- Assistant correctly classifies as 'CR1' for complete system inaccessibility.
225
- Reasoning should emphasize 'any user on any workstation' and lack of access.
226
-
227
- execution:
228
- evaluators:
229
- - name: json_format_validator
230
- type: code
231
- script: uv run validate_output.py
232
- - name: content_evaluator
233
- type: llm_judge
234
-
235
- input_messages:
236
- - role: user
237
- content:
238
- - type: file
239
- value: ../../prompts/cw-criticality-rating.prompt.md
240
- - type: text
241
- value: |-
242
- Classify this CargoWise ticket:
243
-
244
- Ticket: Entire system down, no users can log in from any device
245
- Impact: All operations halted
246
- Scope: Global
247
- Signals: Error: "Server unreachable", no recent changes reported
248
-
249
- expected_messages:
250
- - role: assistant
251
- content: |-
252
- {
253
- "criticalityRating": "CR1",
254
- "reasoning": "Step 1: Issue prevents access to the entire application suite.
255
- Step 2: Affects any user on any workstation. Step 3: Matches CR1 definition
256
- exactly—no partial access or workarounds mentioned."
257
- }
258
-
259
- - id: cr-module-inaccessible
260
- conversation_id: cargowise-triage
261
-
262
- outcome: |
263
- Assistant correctly classifies as 'CR2' for module-wide inaccessibility.
264
- Reasoning should distinguish from CR1 by noting it's limited to one module.
265
-
266
- execution:
267
- evaluators:
268
- - name: json_format_validator
269
- type: code
270
- script: uv run validate_output.py
271
- - name: content_evaluator
272
- type: llm_judge
273
-
274
- input_messages:
275
- - role: user
276
- content:
277
- - type: file
278
- value: ../../prompts/cw-criticality-rating.prompt.md
279
- - type: text
280
- value: |-
281
- Classify this CargoWise ticket:
282
-
283
- Ticket: Customs module inaccessible for all users
284
- Impact: Customs declarations blocked
285
- Scope: Module-wide
286
- Signals: "Module not found" error, other modules working
287
-
288
- expected_messages:
289
- - role: assistant
290
- content: |-
291
- {
292
- "criticalityRating": "CR2",
293
- "reasoning": "Step 1: Issue affects an entire module (Customs).
294
- Step 2: No access for any user/workstation. Step 3: Does not impact
295
- the full suite, so CR2 over CR1."
296
- }
297
- ```
298
-
299
164
  ## Notes on Examples
300
165
 
301
166
  ### File Path Conventions
@@ -1,13 +1,12 @@
1
1
  ---
2
- description: Iteratively optimize a prompt file against an AgentV evaluation suite by analyzing failures and refining instructions.
2
+ description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
3
3
  ---
4
4
 
5
5
  # AgentV Prompt Optimizer
6
6
 
7
- Iteratively optimize a prompt file against an AgentV evaluation suite.
8
-
9
- ## Usage
10
- `prompt-optimizer <eval-path> [playbook-path]`
7
+ ## Input Variables
8
+ - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
9
+ - `optimization-log-path` (optional): Path where optimization progress should be logged
11
10
 
12
11
  ## Workflow
13
12
 
@@ -17,9 +16,9 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
17
16
  - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
18
17
  - Recursively check referenced prompt files for *other* prompt references (dependencies).
19
18
  - If multiple prompts are found, consider ALL of them as candidates for optimization.
20
- - **Identify Playbook**:
21
- - If `<playbook-path>` is provided, use it.
22
- - If not, create a new one in the parent directory of the eval files: `playbook-[timestamp].md`.
19
+ - **Identify Optimization Log**:
20
+ - If `<optimization-log-path>` is provided, use it.
21
+ - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
23
22
  - Read content of the identified prompt file.
24
23
 
25
24
  2. **Optimization Loop** (Max 5 iterations)
@@ -38,7 +37,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
38
37
  - If **Score decreased**: Revert last change, try different approach.
39
38
  - If **No improvement** (2x): STOP and report stagnation.
40
39
  - **Log Result**:
41
- - Append the result of this iteration to the identified playbook file.
40
+ - Append the result of this iteration to the identified optimization log file.
42
41
  - **Format**:
43
42
  ```markdown
44
43
  ### Iteration [N]
@@ -49,7 +48,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
49
48
  ```
50
49
  - **Refine (The Curator)**:
51
50
  - Modify the relevant `<prompt-file>` (pick the one most likely to be the root cause) to address failures.
52
- - **Strategy**: Treat the prompt as a structured "Playbook".
51
+ - **Strategy**: Treat the prompt as a structured set of rules and instructions.
53
52
  - **Clarify**: If ambiguous, make the existing instruction more specific.
54
53
  - **Add Rule**: If a constraint was missed, add a specific bullet point to the relevant section.
55
54
  - **Negative Constraint**: If hallucinating, explicitly state what NOT to do.
@@ -61,7 +60,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
61
60
  3. **Completion**
62
61
  - Report final score.
63
62
  - Summarize key changes made to the prompt.
64
- - **Finalize Playbook**: Add a summary header to the playbook file indicating the session completion and final score.
63
+ - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
65
64
 
66
65
  ## Guidelines
67
66
  - **Simplicity ("Less is More")**: Avoid adding specific rules for rare edge cases ("hotfixes"). Focus on universally applicable instructions.
@@ -1,4 +1,4 @@
1
1
  ---
2
- description: Iteratively optimize a prompt file against an AgentV evaluation suite
2
+ description: Iteratively optimize prompt files against an AgentV evaluation suite
3
3
  ---
4
4
  #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "0.14.2",
3
+ "version": "0.16.0",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {
@@ -24,7 +24,7 @@
24
24
  "dotenv": "^16.4.5",
25
25
  "micromatch": "^4.0.8",
26
26
  "yaml": "^2.6.1",
27
- "@agentv/core": "0.14.2"
27
+ "@agentv/core": "0.16.0"
28
28
  },
29
29
  "devDependencies": {
30
30
  "execa": "^9.3.0"