agentv 3.10.2 → 3.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-6UE665XI.js → chunk-ETMDLQ72.js} +1141 -60
  2. package/dist/chunk-ETMDLQ72.js.map +1 -0
  3. package/dist/{chunk-KGK5NUFG.js → chunk-EZGWZVVK.js} +377 -163
  4. package/dist/chunk-EZGWZVVK.js.map +1 -0
  5. package/dist/{chunk-F7LAJMTO.js → chunk-JEW3FEO7.js} +68 -32
  6. package/dist/chunk-JEW3FEO7.js.map +1 -0
  7. package/dist/cli.js +3 -3
  8. package/dist/{dist-3QUJEJUT.js → dist-QERRYDSC.js} +2 -2
  9. package/dist/index.js +3 -3
  10. package/dist/{interactive-EO6AR2R3.js → interactive-AD4PRYDN.js} +3 -3
  11. package/package.json +3 -1
  12. package/dist/chunk-6UE665XI.js.map +0 -1
  13. package/dist/chunk-F7LAJMTO.js.map +0 -1
  14. package/dist/chunk-KGK5NUFG.js.map +0 -1
  15. package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +0 -84
  16. package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +0 -144
  17. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +0 -67
  18. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +0 -101
  19. package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +0 -458
  20. package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +0 -36
  21. package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
  22. package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +0 -12753
  23. package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
  24. package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +0 -50
  25. package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  26. package/dist/templates/.agentv/.env.example +0 -25
  27. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -177
  28. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
  29. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
  30. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
  31. package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
  32. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -115
  33. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
  34. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
  35. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -79
  36. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
  37. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
  38. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  39. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
  40. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
  41. /package/dist/{dist-3QUJEJUT.js.map → dist-QERRYDSC.js.map} +0 -0
  42. /package/dist/{interactive-EO6AR2R3.js.map → interactive-AD4PRYDN.js.map} +0 -0
@@ -1,333 +0,0 @@
1
- # Example Eval Files
2
-
3
- This document contains complete examples of well-structured eval files demonstrating various AgentV patterns and best practices.
4
-
5
- ## Basic Example: Simple Q&A Eval
6
-
7
- ```yaml
8
- description: Basic arithmetic evaluation
9
- execution:
10
- target: default
11
-
12
- evalcases:
13
- - id: simple-addition
14
- expected_outcome: Correctly calculates 2+2
15
-
16
- input_messages:
17
- - role: user
18
- content: What is 2 + 2?
19
-
20
- expected_messages:
21
- - role: assistant
22
- content: "4"
23
- ```
24
-
25
- ## Code Review with File References
26
-
27
- ```yaml
28
- description: Code review with guidelines
29
- execution:
30
- target: azure_base
31
-
32
- evalcases:
33
- - id: code-review-basic
34
- expected_outcome: Assistant provides helpful code analysis with security considerations
35
-
36
- input_messages:
37
- - role: system
38
- content: You are an expert code reviewer.
39
- - role: user
40
- content:
41
- - type: text
42
- value: |-
43
- Review this function for security issues:
44
-
45
- ```python
46
- def get_user(user_id):
47
- query = f"SELECT * FROM users WHERE id = {user_id}"
48
- return db.execute(query)
49
- ```
50
- - type: file
51
- value: /prompts/security-guidelines.md
52
-
53
- expected_messages:
54
- - role: assistant
55
- content: |-
56
- This code has a critical SQL injection vulnerability. The user_id is directly
57
- interpolated into the query string without sanitization.
58
-
59
- Recommended fix:
60
- ```python
61
- def get_user(user_id):
62
- query = "SELECT * FROM users WHERE id = ?"
63
- return db.execute(query, (user_id,))
64
- ```
65
- ```
66
-
67
- ## Multi-Evaluator Configuration
68
-
69
- ```yaml
70
- description: JSON generation with validation
71
- execution:
72
- target: default
73
-
74
- evalcases:
75
- - id: json-generation-with-validation
76
- expected_outcome: Generates valid JSON with required fields
77
-
78
- execution:
79
- evaluators:
80
- - name: json_format_validator
81
- type: code_judge
82
- script: uv run validate_json.py
83
- cwd: ./evaluators
84
- - name: content_evaluator
85
- type: llm_judge
86
- prompt: ./judges/semantic_correctness.md
87
-
88
- input_messages:
89
- - role: user
90
- content: |-
91
- Generate a JSON object for a user with name "Alice",
92
- email "alice@example.com", and role "admin".
93
-
94
- expected_messages:
95
- - role: assistant
96
- content: |-
97
- {
98
- "name": "Alice",
99
- "email": "alice@example.com",
100
- "role": "admin"
101
- }
102
- ```
103
-
104
- ## Tool Trajectory Evaluation
105
-
106
- Validate that an agent uses specific tools during execution.
107
-
108
- ```yaml
109
- description: Tool usage validation
110
- execution:
111
- target: mock_agent
112
-
113
- evalcases:
114
- # Validate minimum tool usage (order doesn't matter)
115
- - id: research-depth
116
- expected_outcome: Agent researches thoroughly
117
- input_messages:
118
- - role: user
119
- content: Research REST vs GraphQL
120
- execution:
121
- evaluators:
122
- - name: research-check
123
- type: tool_trajectory
124
- mode: any_order
125
- minimums:
126
- knowledgeSearch: 2
127
- documentRetrieve: 1
128
-
129
- # Validate exact tool sequence
130
- - id: auth-flow
131
- expected_outcome: Agent follows auth sequence
132
- input_messages:
133
- - role: user
134
- content: Authenticate user
135
- execution:
136
- evaluators:
137
- - name: auth-sequence
138
- type: tool_trajectory
139
- mode: exact
140
- expected:
141
- - tool: checkCredentials
142
- - tool: generateToken
143
- ```
144
-
145
- ## Static Trace Evaluation
146
-
147
- Evaluate pre-existing trace files without running an agent.
148
-
149
- ```yaml
150
- description: Static trace evaluation
151
- execution:
152
- target: static_trace
153
-
154
- evalcases:
155
- - id: validate-trace-file
156
- expected_outcome: Trace contains required steps
157
- input_messages:
158
- - role: user
159
- content: Analyze trace
160
- execution:
161
- evaluators:
162
- - name: trace-check
163
- type: tool_trajectory
164
- mode: in_order
165
- expected:
166
- - tool: webSearch
167
- - tool: readFile
168
- ```
169
-
170
- ## Multi-Turn Conversation (Single Eval Case)
171
-
172
- ```yaml
173
- description: Multi-turn debugging session with clarifying questions
174
- execution:
175
- target: default
176
-
177
- evalcases:
178
- - id: debug-with-clarification
179
- expected_outcome: |-
180
- Assistant conducts a multi-turn debugging session, asking clarification
181
- questions when needed, correctly diagnosing the bug, and proposing a clear
182
- fix with rationale.
183
-
184
- input_messages:
185
- - role: system
186
- content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
187
- - role: user
188
- content: |-
189
- I'm getting an off-by-one error in this function, but I can't see why:
190
-
191
- ```python
192
- def get_items(items):
193
- result = []
194
- for i in range(len(items) - 1):
195
- result.append(items[i])
196
- return result
197
- ```
198
-
199
- Sometimes the last element is missing. Can you help debug this?
200
- - role: assistant
201
- content: |-
202
- I can help debug this. Before I propose a fix, could you tell me:
203
- - What output you expect for an example input list
204
- - What output you actually get
205
- - role: user
206
- content: |-
207
- For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
208
-
209
- expected_messages:
210
- - role: assistant
211
- content: |-
212
- You have an off-by-one error in your loop bounds.
213
- You're iterating with `range(len(items) - 1)`, which stops before the last index.
214
- To include all items, you can either:
215
- - Use `range(len(items))`, or
216
- - Iterate directly over the list: `for item in items:`
217
-
218
- Here's a corrected version:
219
-
220
- ```python
221
- def get_items(items):
222
- result = []
223
- for item in items:
224
- result.append(item)
225
- return result
226
- ```
227
- ```
228
-
229
- ## Batch CLI Evaluation
230
-
231
- Evaluate external batch runners that process all evalcases in one invocation.
232
-
233
- ```yaml
234
- description: Batch CLI demo (AML screening)
235
- execution:
236
- target: batch_cli
237
-
238
- evalcases:
239
- - id: aml-001
240
- expected_outcome: |-
241
- Batch runner returns JSON with decision=CLEAR.
242
-
243
- expected_messages:
244
- - role: assistant
245
- content:
246
- decision: CLEAR
247
-
248
- input_messages:
249
- - role: system
250
- content: You are a deterministic AML screening batch checker.
251
- - role: user
252
- content:
253
- request:
254
- type: aml_screening_check
255
- jurisdiction: AU
256
- effective_date: 2025-01-01
257
- row:
258
- id: aml-001
259
- customer_name: Example Customer A
260
- origin_country: NZ
261
- destination_country: AU
262
- transaction_type: INTERNATIONAL_TRANSFER
263
- amount: 5000
264
- currency: USD
265
-
266
- execution:
267
- evaluators:
268
- - name: decision-check
269
- type: code_judge
270
- script: bun run ./scripts/check-batch-cli-output.ts
271
- cwd: .
272
-
273
- - id: aml-002
274
- expected_outcome: |-
275
- Batch runner returns JSON with decision=REVIEW.
276
-
277
- expected_messages:
278
- - role: assistant
279
- content:
280
- decision: REVIEW
281
-
282
- input_messages:
283
- - role: system
284
- content: You are a deterministic AML screening batch checker.
285
- - role: user
286
- content:
287
- request:
288
- type: aml_screening_check
289
- jurisdiction: AU
290
- effective_date: 2025-01-01
291
- row:
292
- id: aml-002
293
- customer_name: Example Customer B
294
- origin_country: IR
295
- destination_country: AU
296
- transaction_type: INTERNATIONAL_TRANSFER
297
- amount: 2000
298
- currency: USD
299
-
300
- execution:
301
- evaluators:
302
- - name: decision-check
303
- type: code_judge
304
- script: bun run ./scripts/check-batch-cli-output.ts
305
- cwd: .
306
- ```
307
-
308
- ### Batch CLI Pattern Notes
309
- - **execution.target: batch_cli** - Configure CLI provider with `provider_batching: true`
310
- - **Batch runner** - Reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
311
- - **Structured input** - Put data in `user.content` as objects for runner to extract
312
- - **Structured expected** - Use `expected_messages.content` with object fields
313
- - **Per-case evaluators** - Each evalcase has its own evaluator to validate output
314
-
315
- ## Notes on Examples
316
-
317
- ### File Path Conventions
318
- - **Absolute paths** (start with `/`): Resolved from repository root
319
- - Example: `/prompts/guidelines.md` → `<repo_root>/prompts/guidelines.md`
320
- - **Relative paths** (start with `./` or `../`): Resolved from eval file directory
321
- - Example: `../../prompts/file.md` → Two directories up, then into prompts/
322
-
323
- ### expected_outcome Writing Tips
324
- - Be specific about what success looks like
325
- - Mention key elements that must be present
326
- - For classification tasks, specify the expected category
327
- - For reasoning tasks, describe the thought process expected
328
-
329
- ### Expected Messages
330
- - Show the pattern, not rigid templates
331
- - Allow for natural language variation
332
- - Focus on semantic correctness over exact matching
333
- - Evaluators will handle the actual validation
@@ -1,79 +0,0 @@
1
- # Rubric Evaluator
2
-
3
- ## Field Reference
4
-
5
- | Field | Type | Default | Description |
6
- |-------|------|---------|-------------|
7
- | `id` | string | auto-generated | Unique identifier |
8
- | `expected_outcome` | string | required* | Criterion being evaluated (*optional if `score_ranges` used) |
9
- | `weight` | number | 1.0 | Relative importance |
10
- | `required` | boolean | true | Failing forces verdict to 'fail' (checklist mode) |
11
- | `required_min_score` | integer | - | Minimum 0-10 score to pass (score-range mode) |
12
- | `score_ranges` | map or array | - | Score range definitions for analytic scoring |
13
-
14
- `description` is a backward-compatible alias for `expected_outcome`.
15
-
16
- ## Checklist Mode
17
-
18
- ```yaml
19
- rubrics:
20
- - Mentions divide-and-conquer approach
21
- - id: complexity
22
- expected_outcome: States time complexity correctly
23
- weight: 2.0
24
- required: true
25
- - id: examples
26
- expected_outcome: Includes code examples
27
- weight: 1.0
28
- required: false
29
- ```
30
-
31
- ## Score-Range Mode
32
-
33
- Shorthand map format (recommended):
34
-
35
- ```yaml
36
- rubrics:
37
- - id: correctness
38
- weight: 2.0
39
- required_min_score: 7
40
- score_ranges:
41
- 0: Critical bugs
42
- 3: Minor bugs
43
- 6: Correct with minor issues
44
- 9: Fully correct
45
- ```
46
-
47
- Map keys are lower bounds (0-10). Each range extends from its key to (next key - 1), with the last extending to 10. Must start at 0.
48
-
49
- Array format is also accepted:
50
-
51
- ```yaml
52
- score_ranges:
53
- - score_range: [0, 2]
54
- expected_outcome: Critical bugs
55
- - score_range: [3, 5]
56
- expected_outcome: Minor bugs
57
- - score_range: [6, 8]
58
- expected_outcome: Correct with minor issues
59
- - score_range: [9, 10]
60
- expected_outcome: Fully correct
61
- ```
62
-
63
- Ranges must be integers 0-10, non-overlapping, covering all values 0-10.
64
-
65
- ## Scoring
66
-
67
- **Checklist:** `score = sum(satisfied weights) / sum(all weights)`
68
-
69
- **Score-range:** `score = weighted_average(raw_score / 10)` per criterion
70
-
71
- ## Verdicts
72
-
73
- | Verdict | Condition |
74
- |---------|-----------|
75
- | `pass` | score >= 0.8 AND all gating criteria satisfied |
76
- | `borderline` | score >= 0.6 AND all gating criteria satisfied |
77
- | `fail` | score < 0.6 OR any gating criterion failed |
78
-
79
- Gating: checklist uses `required: true`, score-range uses `required_min_score: N`.
@@ -1,121 +0,0 @@
1
- # Structured Data + Metrics Evaluators
2
-
3
- This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:
4
-
5
- - `field_accuracy`
6
- - `latency`
7
- - `cost`
8
- - `token_usage`
9
-
10
- ## Ground Truth (`expected_messages`)
11
-
12
- Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.
13
-
14
- ```yaml
15
- evalcases:
16
- - id: invoice-001
17
- expected_messages:
18
- - role: assistant
19
- content:
20
- invoice_number: "INV-2025-001234"
21
- net_total: 1889
22
- ```
23
-
24
- ## `field_accuracy`
25
-
26
- Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.
27
-
28
- ```yaml
29
- execution:
30
- evaluators:
31
- - name: invoice_fields
32
- type: field_accuracy
33
- aggregation: weighted_average
34
- fields:
35
- - path: invoice_number
36
- match: exact
37
- required: true
38
- weight: 2.0
39
- - path: invoice_date
40
- match: date
41
- formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
42
- - path: net_total
43
- match: numeric_tolerance
44
- tolerance: 1.0
45
- ```
46
-
47
- ### Match types
48
-
49
- - `exact`: strict equality
50
- - `date`: compares dates after parsing; optionally provide `formats`
51
- - `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)
52
-
53
- For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.
54
-
55
- ### Aggregation
56
-
57
- - `weighted_average` (default): weighted mean of field scores
58
- - `all_or_nothing`: score 1.0 only if all graded fields pass
59
-
60
- ## `latency` and `cost`
61
-
62
- These evaluators gate on execution metrics reported by the provider (via `traceSummary`).
63
-
64
- ```yaml
65
- execution:
66
- evaluators:
67
- - name: performance
68
- type: latency
69
- threshold: 2000
70
- - name: budget
71
- type: cost
72
- budget: 0.10
73
- ```
74
-
75
- ## `token_usage`
76
-
77
- Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).
78
-
79
- ```yaml
80
- execution:
81
- evaluators:
82
- - name: token-budget
83
- type: token_usage
84
- max_total: 10000
85
- # or:
86
- # max_input: 8000
87
- # max_output: 2000
88
- ```
89
-
90
- ## Common pattern: combine correctness + gates
91
-
92
- Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:
93
-
94
- ```yaml
95
- execution:
96
- evaluators:
97
- - name: release_gate
98
- type: composite
99
- evaluators:
100
- - name: correctness
101
- type: field_accuracy
102
- fields:
103
- - path: invoice_number
104
- match: exact
105
- - name: latency
106
- type: latency
107
- threshold: 2000
108
- - name: cost
109
- type: cost
110
- budget: 0.10
111
- - name: tokens
112
- type: token_usage
113
- max_total: 10000
114
- aggregator:
115
- type: weighted_average
116
- weights:
117
- correctness: 0.8
118
- latency: 0.1
119
- cost: 0.05
120
- tokens: 0.05
121
- ```