agentv 3.10.2 → 3.10.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-6UE665XI.js → chunk-7LC3VNOC.js} +4 -4
  2. package/dist/{chunk-KGK5NUFG.js → chunk-JUQCB3ZW.js} +56 -15
  3. package/dist/chunk-JUQCB3ZW.js.map +1 -0
  4. package/dist/{chunk-F7LAJMTO.js → chunk-U556GRI3.js} +4 -4
  5. package/dist/{chunk-F7LAJMTO.js.map → chunk-U556GRI3.js.map} +1 -1
  6. package/dist/cli.js +3 -3
  7. package/dist/{dist-3QUJEJUT.js → dist-2X7A3TTC.js} +2 -2
  8. package/dist/index.js +3 -3
  9. package/dist/{interactive-EO6AR2R3.js → interactive-CSA4KIND.js} +3 -3
  10. package/dist/templates/.agentv/.env.example +9 -11
  11. package/dist/templates/.agentv/config.yaml +13 -4
  12. package/dist/templates/.agentv/targets.yaml +16 -0
  13. package/package.json +1 -1
  14. package/dist/chunk-KGK5NUFG.js.map +0 -1
  15. package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +0 -84
  16. package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +0 -144
  17. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +0 -67
  18. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +0 -101
  19. package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +0 -458
  20. package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +0 -36
  21. package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
  22. package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +0 -12753
  23. package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
  24. package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +0 -50
  25. package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  26. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -177
  27. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
  28. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
  29. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
  30. package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
  31. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -115
  32. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
  33. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
  34. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -79
  35. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
  36. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
  37. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  38. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
  39. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
  40. /package/dist/{chunk-6UE665XI.js.map → chunk-7LC3VNOC.js.map} +0 -0
  41. /package/dist/{dist-3QUJEJUT.js.map → dist-2X7A3TTC.js.map} +0 -0
  42. /package/dist/{interactive-EO6AR2R3.js.map → interactive-CSA4KIND.js.map} +0 -0
@@ -1,121 +0,0 @@
1
- # Structured Data + Metrics Evaluators
2
-
3
- This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:
4
-
5
- - `field_accuracy`
6
- - `latency`
7
- - `cost`
8
- - `token_usage`
9
-
10
- ## Ground Truth (`expected_messages`)
11
-
12
- Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.
13
-
14
- ```yaml
15
- evalcases:
16
- - id: invoice-001
17
- expected_messages:
18
- - role: assistant
19
- content:
20
- invoice_number: "INV-2025-001234"
21
- net_total: 1889
22
- ```
23
-
24
- ## `field_accuracy`
25
-
26
- Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.
27
-
28
- ```yaml
29
- execution:
30
- evaluators:
31
- - name: invoice_fields
32
- type: field_accuracy
33
- aggregation: weighted_average
34
- fields:
35
- - path: invoice_number
36
- match: exact
37
- required: true
38
- weight: 2.0
39
- - path: invoice_date
40
- match: date
41
- formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
42
- - path: net_total
43
- match: numeric_tolerance
44
- tolerance: 1.0
45
- ```
46
-
47
- ### Match types
48
-
49
- - `exact`: strict equality
50
- - `date`: compares dates after parsing; optionally provide `formats`
51
- - `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)
52
-
53
- For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.
54
-
55
- ### Aggregation
56
-
57
- - `weighted_average` (default): weighted mean of field scores
58
- - `all_or_nothing`: score 1.0 only if all graded fields pass
59
-
60
- ## `latency` and `cost`
61
-
62
- These evaluators gate on execution metrics reported by the provider (via `traceSummary`).
63
-
64
- ```yaml
65
- execution:
66
- evaluators:
67
- - name: performance
68
- type: latency
69
- threshold: 2000
70
- - name: budget
71
- type: cost
72
- budget: 0.10
73
- ```
74
-
75
- ## `token_usage`
76
-
77
- Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).
78
-
79
- ```yaml
80
- execution:
81
- evaluators:
82
- - name: token-budget
83
- type: token_usage
84
- max_total: 10000
85
- # or:
86
- # max_input: 8000
87
- # max_output: 2000
88
- ```
89
-
90
- ## Common pattern: combine correctness + gates
91
-
92
- Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:
93
-
94
- ```yaml
95
- execution:
96
- evaluators:
97
- - name: release_gate
98
- type: composite
99
- evaluators:
100
- - name: correctness
101
- type: field_accuracy
102
- fields:
103
- - path: invoice_number
104
- match: exact
105
- - name: latency
106
- type: latency
107
- threshold: 2000
108
- - name: cost
109
- type: cost
110
- budget: 0.10
111
- - name: tokens
112
- type: token_usage
113
- max_total: 10000
114
- aggregator:
115
- type: weighted_average
116
- weights:
117
- correctness: 0.8
118
- latency: 0.1
119
- cost: 0.05
120
- tokens: 0.05
121
- ```
@@ -1,298 +0,0 @@
1
- # Tool Trajectory Evaluator Guide
2
-
3
- Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
-
5
- ## Tool Trajectory Evaluator
6
-
7
- ### Modes
8
-
9
- #### 1. `any_order` - Minimum Tool Counts
10
-
11
- Validates that each tool was called at least N times, regardless of order:
12
-
13
- ```yaml
14
- execution:
15
- evaluators:
16
- - name: tool-usage
17
- type: tool_trajectory
18
- mode: any_order
19
- minimums:
20
- knowledgeSearch: 2 # Must be called at least twice
21
- documentRetrieve: 1 # Must be called at least once
22
- ```
23
-
24
- **Use cases:**
25
- - Ensure required tools are used
26
- - Don't care about execution order
27
- - Allow flexibility in agent implementation
28
-
29
- #### 2. `in_order` - Sequential Matching
30
-
31
- Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
32
-
33
- ```yaml
34
- execution:
35
- evaluators:
36
- - name: workflow-sequence
37
- type: tool_trajectory
38
- mode: in_order
39
- expected:
40
- - tool: fetchData
41
- - tool: validateSchema
42
- - tool: transformData
43
- - tool: saveResults
44
- ```
45
-
46
- **Use cases:**
47
- - Validate logical workflow order
48
- - Allow agent to use additional helper tools
49
- - Check that key steps happen in sequence
50
-
51
- ### Argument Matching
52
-
53
- For `in_order` and `exact` modes, you can optionally validate tool arguments:
54
-
55
- ```yaml
56
- execution:
57
- evaluators:
58
- - name: search-validation
59
- type: tool_trajectory
60
- mode: in_order
61
- expected:
62
- # Partial match - only specified keys are checked
63
- - tool: search
64
- args: { query: "machine learning" }
65
-
66
- # Skip argument validation for this tool
67
- - tool: process
68
- args: any
69
-
70
- # No args field = no argument validation (same as args: any)
71
- - tool: saveResults
72
- ```
73
-
74
- **Argument matching modes:**
75
- - `args: { key: value }` - Partial deep equality (only specified keys are checked)
76
- - `args: any` - Skip argument validation
77
- - No `args` field - Same as `args: any`
78
-
79
- ### Latency Assertions
80
-
81
- For `in_order` and `exact` modes, you can optionally validate per-tool timing with `max_duration_ms`:
82
-
83
- ```yaml
84
- execution:
85
- evaluators:
86
- - name: perf-check
87
- type: tool_trajectory
88
- mode: in_order
89
- expected:
90
- - tool: Read
91
- max_duration_ms: 100 # Must complete within 100ms
92
- - tool: Edit
93
- max_duration_ms: 500 # Allow 500ms for edits
94
- - tool: Write # No timing requirement
95
- ```
96
-
97
- **Latency assertion behavior:**
98
- - **Pass**: `actual_duration <= max_duration_ms` → counts as hit
99
- - **Fail**: `actual_duration > max_duration_ms` → counts as miss
100
- - **Skip**: No `duration_ms` in output → warning logged, neutral (neither hit nor miss)
101
-
102
- **Provider requirements:**
103
- Providers must include `duration_ms` in tool calls for latency checks:
104
-
105
- ```json
106
- {
107
- "tool_calls": [{
108
- "tool": "Read",
109
- "duration_ms": 45
110
- }]
111
- }
112
- ```
113
-
114
- **Best practices for latency assertions:**
115
- - Set generous thresholds to avoid flaky tests from timing variance
116
- - Only add latency assertions where timing matters (critical paths)
117
- - Combine with sequence checks for comprehensive validation
118
-
119
- #### 3. `exact` - Strict Sequence Match
120
-
121
- Validates the exact tool sequence with no gaps or extra tools:
122
-
123
- ```yaml
124
- execution:
125
- evaluators:
126
- - name: auth-sequence
127
- type: tool_trajectory
128
- mode: exact
129
- expected:
130
- - tool: checkCredentials
131
- - tool: generateToken
132
- - tool: auditLog
133
- ```
134
-
135
- **Use cases:**
136
- - Security-critical workflows
137
- - Strict protocol validation
138
- - Regression testing specific behavior
139
-
140
- ## Scoring
141
-
142
- ### tool_trajectory Scoring
143
-
144
- | Mode | Score Calculation |
145
- |------|------------------|
146
- | `any_order` | (tools meeting minimum) / (total tools with minimums) |
147
- | `in_order` | (sequence hits + latency hits) / (expected tools + latency assertions) |
148
- | `exact` | (sequence hits + latency hits) / (expected tools + latency assertions) |
149
-
150
- **With latency assertions:**
151
- - Each `max_duration_ms` assertion counts as a separate aspect
152
- - Latency checks only run when the sequence/position matches
153
- - Example: 3 expected tools + 2 latency assertions = 5 total aspects
154
-
155
- ## Trace Data Requirements
156
-
157
- Tool trajectory evaluators require trace data from the agent provider. Providers return `output_messages` containing `tool_calls` that capture agent tool usage.
158
-
159
- ### Output Messages Format
160
-
161
- Providers return `output_messages` with `tool_calls` in the JSONL output:
162
-
163
- ```json
164
- {
165
- "id": "eval-001",
166
- "output_messages": [
167
- {
168
- "role": "assistant",
169
- "content": "I'll search for information about this topic.",
170
- "tool_calls": [
171
- {
172
- "tool": "knowledgeSearch",
173
- "input": { "query": "REST vs GraphQL" },
174
- "output": { "results": [...] },
175
- "id": "call_123",
176
- "timestamp": "2024-01-15T10:30:00Z",
177
- "duration_ms": 45
178
- }
179
- ]
180
- }
181
- ]
182
- }
183
- ```
184
-
185
- The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optional fields:
186
- - `id` and `timestamp` - for debugging
187
- - `duration_ms` - for latency assertions (required if using `max_duration_ms`)
188
-
189
- ### Supported Providers
190
-
191
- - **codex** - Returns output_messages via JSONL log events
192
- - **vscode / vscode-insiders** - Returns output_messages from Copilot execution
193
- - **cli** - Returns `output_messages` with `tool_calls`
194
-
195
- ## Complete Examples
196
-
197
- ### Research Agent Validation
198
-
199
- ```yaml
200
- description: Validate research agent tool usage
201
- execution:
202
- target: codex_agent # Provider that returns traces
203
-
204
- evalcases:
205
- - id: comprehensive-research
206
- expected_outcome: Agent thoroughly researches the topic
207
-
208
- input_messages:
209
- - role: user
210
- content: Research machine learning frameworks
211
-
212
- execution:
213
- evaluators:
214
- # Check minimum tool usage
215
- - name: coverage
216
- type: tool_trajectory
217
- mode: any_order
218
- minimums:
219
- webSearch: 1
220
- documentRead: 2
221
- noteTaking: 1
222
-
223
- # Check workflow order
224
- - name: workflow
225
- type: tool_trajectory
226
- mode: in_order
227
- expected:
228
- - tool: webSearch
229
- - tool: documentRead
230
- - tool: summarize
231
- ```
232
-
233
- ### Multi-Step Pipeline
234
-
235
- ```yaml
236
- evalcases:
237
- - id: data-pipeline
238
- expected_outcome: Process data through complete pipeline
239
-
240
- input_messages:
241
- - role: user
242
- content: Process the customer dataset
243
-
244
- execution:
245
- evaluators:
246
- - name: pipeline-check
247
- type: tool_trajectory
248
- mode: exact
249
- expected:
250
- - tool: loadData
251
- - tool: validate
252
- - tool: transform
253
- - tool: export
254
- ```
255
-
256
- ### Pipeline with Latency Assertions
257
-
258
- ```yaml
259
- evalcases:
260
- - id: data-pipeline-perf
261
- expected_outcome: Process data within timing budgets
262
-
263
- input_messages:
264
- - role: user
265
- content: Process the customer dataset quickly
266
-
267
- execution:
268
- evaluators:
269
- - name: pipeline-perf
270
- type: tool_trajectory
271
- mode: in_order
272
- expected:
273
- - tool: loadData
274
- max_duration_ms: 1000 # Network fetch within 1s
275
- - tool: validate # No timing requirement
276
- - tool: transform
277
- max_duration_ms: 500 # Transform must be fast
278
- - tool: export
279
- max_duration_ms: 200 # Export should be quick
280
- ```
281
-
282
- ## CLI Options for Traces
283
-
284
- ```bash
285
- # Write trace files to disk
286
- agentv eval evals/test.yaml --dump-traces
287
-
288
- # Include full trace in result output
289
- agentv eval evals/test.yaml --include-trace
290
- ```
291
-
292
- ## Best Practices
293
-
294
- 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
295
- 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
296
- 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
297
- 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
298
- 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
@@ -1,78 +0,0 @@
1
- ---
2
- name: agentv-prompt-optimizer
3
- description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
4
- ---
5
-
6
- # AgentV Prompt Optimizer
7
-
8
- ## Input Variables
9
- - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
10
- - `optimization-log-path` (optional): Path where optimization progress should be logged
11
-
12
- ## Workflow
13
-
14
- 1. **Initialize**
15
- - Verify `<eval-path>` (file or glob) targets the correct system.
16
- - **Identify Prompt Files**:
17
- - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
18
- - Recursively check referenced prompt files for *other* prompt references (dependencies).
19
- - If multiple prompts are found, consider ALL of them as candidates for optimization.
20
- - **Identify Optimization Log**:
21
- - If `<optimization-log-path>` is provided, use it.
22
- - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
23
- - Read content of the identified prompt file.
24
-
25
- 2. **Optimization Loop** (Max 10 iterations)
26
- - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
27
- - *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
28
- - **Analyze (The Reflector)**:
29
- - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
30
- - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
31
- - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
32
- - **Output**: Return a structured analysis including:
33
- - **Score**: Current pass rate.
34
- - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
35
- - **Insight**: Key learning or pattern identified from the failures.
36
- - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
37
- - **Decide**:
38
- - If **100% pass**: STOP and report success.
39
- - If **Score decreased**: Revert last change, try different approach.
40
- - If **No improvement** (2x): STOP and report stagnation.
41
- - **Refine (The Curator)**:
42
- - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
43
- - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
44
- - **Output**: The **Log Entry** describing the specific operation performed.
45
- ```markdown
46
- ### Iteration [N]
47
- - **Operation**: [ADD / UPDATE / DELETE]
48
- - **Target**: [Section Name]
49
- - **Change**: [Specific text added/modified]
50
- - **Trigger**: [Specific failing test case or error pattern]
51
- - **Rationale**: [From Reflector: Root Cause]
52
- - **Score**: [From Reflector: Current Pass Rate]
53
- - **Insight**: [From Reflector: Key Learning]
54
- ```
55
- - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
56
- - **ADD**: Insert a new rule if a constraint was missed.
57
- - **UPDATE**: Refine an existing rule to be clearer or more general.
58
- - *Clarify*: Make ambiguous instructions specific.
59
- - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
60
- - **DELETE**: Remove obsolete, redundant, or harmful rules.
61
- - *Prune*: If a general rule covers specific cases, delete the specific ones.
62
- - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
63
- - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
64
- - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
65
- - **Log Result**:
66
- - Append the **Log Entry** returned by the Curator to the optimization log file.
67
-
68
- 3. **Completion**
69
- - Report final score.
70
- - Summarize key changes made to the prompt.
71
- - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
72
-
73
- ## Guidelines
74
- - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
75
- - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
76
- - **Structure**: Maintain existing Markdown headers/sections.
77
- - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
78
- - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
@@ -1,5 +0,0 @@
1
- ---
2
- description: 'Create and maintain AgentV YAML evaluation files'
3
- ---
4
-
5
- #file:../../.claude/skills/agentv-eval-builder/SKILL.md
@@ -1,4 +0,0 @@
1
- ---
2
- description: Iteratively optimize prompt files against an AgentV evaluation suite
3
- ---
4
- #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md