agentv 3.10.2 → 3.10.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-6UE665XI.js → chunk-7LC3VNOC.js} +4 -4
- package/dist/{chunk-KGK5NUFG.js → chunk-JUQCB3ZW.js} +56 -15
- package/dist/chunk-JUQCB3ZW.js.map +1 -0
- package/dist/{chunk-F7LAJMTO.js → chunk-U556GRI3.js} +4 -4
- package/dist/{chunk-F7LAJMTO.js.map → chunk-U556GRI3.js.map} +1 -1
- package/dist/cli.js +3 -3
- package/dist/{dist-3QUJEJUT.js → dist-2X7A3TTC.js} +2 -2
- package/dist/index.js +3 -3
- package/dist/{interactive-EO6AR2R3.js → interactive-CSA4KIND.js} +3 -3
- package/dist/templates/.agentv/.env.example +9 -11
- package/dist/templates/.agentv/config.yaml +13 -4
- package/dist/templates/.agentv/targets.yaml +16 -0
- package/package.json +1 -1
- package/dist/chunk-KGK5NUFG.js.map +0 -1
- package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +0 -84
- package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +0 -144
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +0 -67
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +0 -101
- package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +0 -458
- package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +0 -36
- package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
- package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +0 -12753
- package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
- package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +0 -50
- package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -177
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
- package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -115
- package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -79
- package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
- /package/dist/{chunk-6UE665XI.js.map → chunk-7LC3VNOC.js.map} +0 -0
- /package/dist/{dist-3QUJEJUT.js.map → dist-2X7A3TTC.js.map} +0 -0
- /package/dist/{interactive-EO6AR2R3.js.map → interactive-CSA4KIND.js.map} +0 -0
package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md
DELETED
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
# Structured Data + Metrics Evaluators
|
|
2
|
-
|
|
3
|
-
This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:
|
|
4
|
-
|
|
5
|
-
- `field_accuracy`
|
|
6
|
-
- `latency`
|
|
7
|
-
- `cost`
|
|
8
|
-
- `token_usage`
|
|
9
|
-
|
|
10
|
-
## Ground Truth (`expected_messages`)
|
|
11
|
-
|
|
12
|
-
Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.
|
|
13
|
-
|
|
14
|
-
```yaml
|
|
15
|
-
evalcases:
|
|
16
|
-
- id: invoice-001
|
|
17
|
-
expected_messages:
|
|
18
|
-
- role: assistant
|
|
19
|
-
content:
|
|
20
|
-
invoice_number: "INV-2025-001234"
|
|
21
|
-
net_total: 1889
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
## `field_accuracy`
|
|
25
|
-
|
|
26
|
-
Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.
|
|
27
|
-
|
|
28
|
-
```yaml
|
|
29
|
-
execution:
|
|
30
|
-
evaluators:
|
|
31
|
-
- name: invoice_fields
|
|
32
|
-
type: field_accuracy
|
|
33
|
-
aggregation: weighted_average
|
|
34
|
-
fields:
|
|
35
|
-
- path: invoice_number
|
|
36
|
-
match: exact
|
|
37
|
-
required: true
|
|
38
|
-
weight: 2.0
|
|
39
|
-
- path: invoice_date
|
|
40
|
-
match: date
|
|
41
|
-
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
|
|
42
|
-
- path: net_total
|
|
43
|
-
match: numeric_tolerance
|
|
44
|
-
tolerance: 1.0
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
### Match types
|
|
48
|
-
|
|
49
|
-
- `exact`: strict equality
|
|
50
|
-
- `date`: compares dates after parsing; optionally provide `formats`
|
|
51
|
-
- `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)
|
|
52
|
-
|
|
53
|
-
For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.
|
|
54
|
-
|
|
55
|
-
### Aggregation
|
|
56
|
-
|
|
57
|
-
- `weighted_average` (default): weighted mean of field scores
|
|
58
|
-
- `all_or_nothing`: score 1.0 only if all graded fields pass
|
|
59
|
-
|
|
60
|
-
## `latency` and `cost`
|
|
61
|
-
|
|
62
|
-
These evaluators gate on execution metrics reported by the provider (via `traceSummary`).
|
|
63
|
-
|
|
64
|
-
```yaml
|
|
65
|
-
execution:
|
|
66
|
-
evaluators:
|
|
67
|
-
- name: performance
|
|
68
|
-
type: latency
|
|
69
|
-
threshold: 2000
|
|
70
|
-
- name: budget
|
|
71
|
-
type: cost
|
|
72
|
-
budget: 0.10
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
## `token_usage`
|
|
76
|
-
|
|
77
|
-
Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).
|
|
78
|
-
|
|
79
|
-
```yaml
|
|
80
|
-
execution:
|
|
81
|
-
evaluators:
|
|
82
|
-
- name: token-budget
|
|
83
|
-
type: token_usage
|
|
84
|
-
max_total: 10000
|
|
85
|
-
# or:
|
|
86
|
-
# max_input: 8000
|
|
87
|
-
# max_output: 2000
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
## Common pattern: combine correctness + gates
|
|
91
|
-
|
|
92
|
-
Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:
|
|
93
|
-
|
|
94
|
-
```yaml
|
|
95
|
-
execution:
|
|
96
|
-
evaluators:
|
|
97
|
-
- name: release_gate
|
|
98
|
-
type: composite
|
|
99
|
-
evaluators:
|
|
100
|
-
- name: correctness
|
|
101
|
-
type: field_accuracy
|
|
102
|
-
fields:
|
|
103
|
-
- path: invoice_number
|
|
104
|
-
match: exact
|
|
105
|
-
- name: latency
|
|
106
|
-
type: latency
|
|
107
|
-
threshold: 2000
|
|
108
|
-
- name: cost
|
|
109
|
-
type: cost
|
|
110
|
-
budget: 0.10
|
|
111
|
-
- name: tokens
|
|
112
|
-
type: token_usage
|
|
113
|
-
max_total: 10000
|
|
114
|
-
aggregator:
|
|
115
|
-
type: weighted_average
|
|
116
|
-
weights:
|
|
117
|
-
correctness: 0.8
|
|
118
|
-
latency: 0.1
|
|
119
|
-
cost: 0.05
|
|
120
|
-
tokens: 0.05
|
|
121
|
-
```
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
DELETED
|
@@ -1,298 +0,0 @@
|
|
|
1
|
-
# Tool Trajectory Evaluator Guide
|
|
2
|
-
|
|
3
|
-
Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
|
|
4
|
-
|
|
5
|
-
## Tool Trajectory Evaluator
|
|
6
|
-
|
|
7
|
-
### Modes
|
|
8
|
-
|
|
9
|
-
#### 1. `any_order` - Minimum Tool Counts
|
|
10
|
-
|
|
11
|
-
Validates that each tool was called at least N times, regardless of order:
|
|
12
|
-
|
|
13
|
-
```yaml
|
|
14
|
-
execution:
|
|
15
|
-
evaluators:
|
|
16
|
-
- name: tool-usage
|
|
17
|
-
type: tool_trajectory
|
|
18
|
-
mode: any_order
|
|
19
|
-
minimums:
|
|
20
|
-
knowledgeSearch: 2 # Must be called at least twice
|
|
21
|
-
documentRetrieve: 1 # Must be called at least once
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
**Use cases:**
|
|
25
|
-
- Ensure required tools are used
|
|
26
|
-
- Don't care about execution order
|
|
27
|
-
- Allow flexibility in agent implementation
|
|
28
|
-
|
|
29
|
-
#### 2. `in_order` - Sequential Matching
|
|
30
|
-
|
|
31
|
-
Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
|
|
32
|
-
|
|
33
|
-
```yaml
|
|
34
|
-
execution:
|
|
35
|
-
evaluators:
|
|
36
|
-
- name: workflow-sequence
|
|
37
|
-
type: tool_trajectory
|
|
38
|
-
mode: in_order
|
|
39
|
-
expected:
|
|
40
|
-
- tool: fetchData
|
|
41
|
-
- tool: validateSchema
|
|
42
|
-
- tool: transformData
|
|
43
|
-
- tool: saveResults
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
**Use cases:**
|
|
47
|
-
- Validate logical workflow order
|
|
48
|
-
- Allow agent to use additional helper tools
|
|
49
|
-
- Check that key steps happen in sequence
|
|
50
|
-
|
|
51
|
-
### Argument Matching
|
|
52
|
-
|
|
53
|
-
For `in_order` and `exact` modes, you can optionally validate tool arguments:
|
|
54
|
-
|
|
55
|
-
```yaml
|
|
56
|
-
execution:
|
|
57
|
-
evaluators:
|
|
58
|
-
- name: search-validation
|
|
59
|
-
type: tool_trajectory
|
|
60
|
-
mode: in_order
|
|
61
|
-
expected:
|
|
62
|
-
# Partial match - only specified keys are checked
|
|
63
|
-
- tool: search
|
|
64
|
-
args: { query: "machine learning" }
|
|
65
|
-
|
|
66
|
-
# Skip argument validation for this tool
|
|
67
|
-
- tool: process
|
|
68
|
-
args: any
|
|
69
|
-
|
|
70
|
-
# No args field = no argument validation (same as args: any)
|
|
71
|
-
- tool: saveResults
|
|
72
|
-
```
|
|
73
|
-
|
|
74
|
-
**Argument matching modes:**
|
|
75
|
-
- `args: { key: value }` - Partial deep equality (only specified keys are checked)
|
|
76
|
-
- `args: any` - Skip argument validation
|
|
77
|
-
- No `args` field - Same as `args: any`
|
|
78
|
-
|
|
79
|
-
### Latency Assertions
|
|
80
|
-
|
|
81
|
-
For `in_order` and `exact` modes, you can optionally validate per-tool timing with `max_duration_ms`:
|
|
82
|
-
|
|
83
|
-
```yaml
|
|
84
|
-
execution:
|
|
85
|
-
evaluators:
|
|
86
|
-
- name: perf-check
|
|
87
|
-
type: tool_trajectory
|
|
88
|
-
mode: in_order
|
|
89
|
-
expected:
|
|
90
|
-
- tool: Read
|
|
91
|
-
max_duration_ms: 100 # Must complete within 100ms
|
|
92
|
-
- tool: Edit
|
|
93
|
-
max_duration_ms: 500 # Allow 500ms for edits
|
|
94
|
-
- tool: Write # No timing requirement
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
**Latency assertion behavior:**
|
|
98
|
-
- **Pass**: `actual_duration <= max_duration_ms` → counts as hit
|
|
99
|
-
- **Fail**: `actual_duration > max_duration_ms` → counts as miss
|
|
100
|
-
- **Skip**: No `duration_ms` in output → warning logged, neutral (neither hit nor miss)
|
|
101
|
-
|
|
102
|
-
**Provider requirements:**
|
|
103
|
-
Providers must include `duration_ms` in tool calls for latency checks:
|
|
104
|
-
|
|
105
|
-
```json
|
|
106
|
-
{
|
|
107
|
-
"tool_calls": [{
|
|
108
|
-
"tool": "Read",
|
|
109
|
-
"duration_ms": 45
|
|
110
|
-
}]
|
|
111
|
-
}
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
**Best practices for latency assertions:**
|
|
115
|
-
- Set generous thresholds to avoid flaky tests from timing variance
|
|
116
|
-
- Only add latency assertions where timing matters (critical paths)
|
|
117
|
-
- Combine with sequence checks for comprehensive validation
|
|
118
|
-
|
|
119
|
-
#### 3. `exact` - Strict Sequence Match
|
|
120
|
-
|
|
121
|
-
Validates the exact tool sequence with no gaps or extra tools:
|
|
122
|
-
|
|
123
|
-
```yaml
|
|
124
|
-
execution:
|
|
125
|
-
evaluators:
|
|
126
|
-
- name: auth-sequence
|
|
127
|
-
type: tool_trajectory
|
|
128
|
-
mode: exact
|
|
129
|
-
expected:
|
|
130
|
-
- tool: checkCredentials
|
|
131
|
-
- tool: generateToken
|
|
132
|
-
- tool: auditLog
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
**Use cases:**
|
|
136
|
-
- Security-critical workflows
|
|
137
|
-
- Strict protocol validation
|
|
138
|
-
- Regression testing specific behavior
|
|
139
|
-
|
|
140
|
-
## Scoring
|
|
141
|
-
|
|
142
|
-
### tool_trajectory Scoring
|
|
143
|
-
|
|
144
|
-
| Mode | Score Calculation |
|
|
145
|
-
|------|------------------|
|
|
146
|
-
| `any_order` | (tools meeting minimum) / (total tools with minimums) |
|
|
147
|
-
| `in_order` | (sequence hits + latency hits) / (expected tools + latency assertions) |
|
|
148
|
-
| `exact` | (sequence hits + latency hits) / (expected tools + latency assertions) |
|
|
149
|
-
|
|
150
|
-
**With latency assertions:**
|
|
151
|
-
- Each `max_duration_ms` assertion counts as a separate aspect
|
|
152
|
-
- Latency checks only run when the sequence/position matches
|
|
153
|
-
- Example: 3 expected tools + 2 latency assertions = 5 total aspects
|
|
154
|
-
|
|
155
|
-
## Trace Data Requirements
|
|
156
|
-
|
|
157
|
-
Tool trajectory evaluators require trace data from the agent provider. Providers return `output_messages` containing `tool_calls` that capture agent tool usage.
|
|
158
|
-
|
|
159
|
-
### Output Messages Format
|
|
160
|
-
|
|
161
|
-
Providers return `output_messages` with `tool_calls` in the JSONL output:
|
|
162
|
-
|
|
163
|
-
```json
|
|
164
|
-
{
|
|
165
|
-
"id": "eval-001",
|
|
166
|
-
"output_messages": [
|
|
167
|
-
{
|
|
168
|
-
"role": "assistant",
|
|
169
|
-
"content": "I'll search for information about this topic.",
|
|
170
|
-
"tool_calls": [
|
|
171
|
-
{
|
|
172
|
-
"tool": "knowledgeSearch",
|
|
173
|
-
"input": { "query": "REST vs GraphQL" },
|
|
174
|
-
"output": { "results": [...] },
|
|
175
|
-
"id": "call_123",
|
|
176
|
-
"timestamp": "2024-01-15T10:30:00Z",
|
|
177
|
-
"duration_ms": 45
|
|
178
|
-
}
|
|
179
|
-
]
|
|
180
|
-
}
|
|
181
|
-
]
|
|
182
|
-
}
|
|
183
|
-
```
|
|
184
|
-
|
|
185
|
-
The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optional fields:
|
|
186
|
-
- `id` and `timestamp` - for debugging
|
|
187
|
-
- `duration_ms` - for latency assertions (required if using `max_duration_ms`)
|
|
188
|
-
|
|
189
|
-
### Supported Providers
|
|
190
|
-
|
|
191
|
-
- **codex** - Returns output_messages via JSONL log events
|
|
192
|
-
- **vscode / vscode-insiders** - Returns output_messages from Copilot execution
|
|
193
|
-
- **cli** - Returns `output_messages` with `tool_calls`
|
|
194
|
-
|
|
195
|
-
## Complete Examples
|
|
196
|
-
|
|
197
|
-
### Research Agent Validation
|
|
198
|
-
|
|
199
|
-
```yaml
|
|
200
|
-
description: Validate research agent tool usage
|
|
201
|
-
execution:
|
|
202
|
-
target: codex_agent # Provider that returns traces
|
|
203
|
-
|
|
204
|
-
evalcases:
|
|
205
|
-
- id: comprehensive-research
|
|
206
|
-
expected_outcome: Agent thoroughly researches the topic
|
|
207
|
-
|
|
208
|
-
input_messages:
|
|
209
|
-
- role: user
|
|
210
|
-
content: Research machine learning frameworks
|
|
211
|
-
|
|
212
|
-
execution:
|
|
213
|
-
evaluators:
|
|
214
|
-
# Check minimum tool usage
|
|
215
|
-
- name: coverage
|
|
216
|
-
type: tool_trajectory
|
|
217
|
-
mode: any_order
|
|
218
|
-
minimums:
|
|
219
|
-
webSearch: 1
|
|
220
|
-
documentRead: 2
|
|
221
|
-
noteTaking: 1
|
|
222
|
-
|
|
223
|
-
# Check workflow order
|
|
224
|
-
- name: workflow
|
|
225
|
-
type: tool_trajectory
|
|
226
|
-
mode: in_order
|
|
227
|
-
expected:
|
|
228
|
-
- tool: webSearch
|
|
229
|
-
- tool: documentRead
|
|
230
|
-
- tool: summarize
|
|
231
|
-
```
|
|
232
|
-
|
|
233
|
-
### Multi-Step Pipeline
|
|
234
|
-
|
|
235
|
-
```yaml
|
|
236
|
-
evalcases:
|
|
237
|
-
- id: data-pipeline
|
|
238
|
-
expected_outcome: Process data through complete pipeline
|
|
239
|
-
|
|
240
|
-
input_messages:
|
|
241
|
-
- role: user
|
|
242
|
-
content: Process the customer dataset
|
|
243
|
-
|
|
244
|
-
execution:
|
|
245
|
-
evaluators:
|
|
246
|
-
- name: pipeline-check
|
|
247
|
-
type: tool_trajectory
|
|
248
|
-
mode: exact
|
|
249
|
-
expected:
|
|
250
|
-
- tool: loadData
|
|
251
|
-
- tool: validate
|
|
252
|
-
- tool: transform
|
|
253
|
-
- tool: export
|
|
254
|
-
```
|
|
255
|
-
|
|
256
|
-
### Pipeline with Latency Assertions
|
|
257
|
-
|
|
258
|
-
```yaml
|
|
259
|
-
evalcases:
|
|
260
|
-
- id: data-pipeline-perf
|
|
261
|
-
expected_outcome: Process data within timing budgets
|
|
262
|
-
|
|
263
|
-
input_messages:
|
|
264
|
-
- role: user
|
|
265
|
-
content: Process the customer dataset quickly
|
|
266
|
-
|
|
267
|
-
execution:
|
|
268
|
-
evaluators:
|
|
269
|
-
- name: pipeline-perf
|
|
270
|
-
type: tool_trajectory
|
|
271
|
-
mode: in_order
|
|
272
|
-
expected:
|
|
273
|
-
- tool: loadData
|
|
274
|
-
max_duration_ms: 1000 # Network fetch within 1s
|
|
275
|
-
- tool: validate # No timing requirement
|
|
276
|
-
- tool: transform
|
|
277
|
-
max_duration_ms: 500 # Transform must be fast
|
|
278
|
-
- tool: export
|
|
279
|
-
max_duration_ms: 200 # Export should be quick
|
|
280
|
-
```
|
|
281
|
-
|
|
282
|
-
## CLI Options for Traces
|
|
283
|
-
|
|
284
|
-
```bash
|
|
285
|
-
# Write trace files to disk
|
|
286
|
-
agentv eval evals/test.yaml --dump-traces
|
|
287
|
-
|
|
288
|
-
# Include full trace in result output
|
|
289
|
-
agentv eval evals/test.yaml --include-trace
|
|
290
|
-
```
|
|
291
|
-
|
|
292
|
-
## Best Practices
|
|
293
|
-
|
|
294
|
-
1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
|
|
295
|
-
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
296
|
-
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
297
|
-
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
298
|
-
5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
|
|
@@ -1,78 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: agentv-prompt-optimizer
|
|
3
|
-
description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# AgentV Prompt Optimizer
|
|
7
|
-
|
|
8
|
-
## Input Variables
|
|
9
|
-
- `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
|
|
10
|
-
- `optimization-log-path` (optional): Path where optimization progress should be logged
|
|
11
|
-
|
|
12
|
-
## Workflow
|
|
13
|
-
|
|
14
|
-
1. **Initialize**
|
|
15
|
-
- Verify `<eval-path>` (file or glob) targets the correct system.
|
|
16
|
-
- **Identify Prompt Files**:
|
|
17
|
-
- Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
|
|
18
|
-
- Recursively check referenced prompt files for *other* prompt references (dependencies).
|
|
19
|
-
- If multiple prompts are found, consider ALL of them as candidates for optimization.
|
|
20
|
-
- **Identify Optimization Log**:
|
|
21
|
-
- If `<optimization-log-path>` is provided, use it.
|
|
22
|
-
- If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
|
|
23
|
-
- Read content of the identified prompt file.
|
|
24
|
-
|
|
25
|
-
2. **Optimization Loop** (Max 10 iterations)
|
|
26
|
-
- **Execute (The Generator)**: Run `agentv eval <eval-path>`.
|
|
27
|
-
- *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
|
|
28
|
-
- **Analyze (The Reflector)**:
|
|
29
|
-
- Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
|
|
30
|
-
- **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
|
|
31
|
-
- **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
|
|
32
|
-
- **Output**: Return a structured analysis including:
|
|
33
|
-
- **Score**: Current pass rate.
|
|
34
|
-
- **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
|
|
35
|
-
- **Insight**: Key learning or pattern identified from the failures.
|
|
36
|
-
- **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
|
|
37
|
-
- **Decide**:
|
|
38
|
-
- If **100% pass**: STOP and report success.
|
|
39
|
-
- If **Score decreased**: Revert last change, try different approach.
|
|
40
|
-
- If **No improvement** (2x): STOP and report stagnation.
|
|
41
|
-
- **Refine (The Curator)**:
|
|
42
|
-
- **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
|
|
43
|
-
- **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
|
|
44
|
-
- **Output**: The **Log Entry** describing the specific operation performed.
|
|
45
|
-
```markdown
|
|
46
|
-
### Iteration [N]
|
|
47
|
-
- **Operation**: [ADD / UPDATE / DELETE]
|
|
48
|
-
- **Target**: [Section Name]
|
|
49
|
-
- **Change**: [Specific text added/modified]
|
|
50
|
-
- **Trigger**: [Specific failing test case or error pattern]
|
|
51
|
-
- **Rationale**: [From Reflector: Root Cause]
|
|
52
|
-
- **Score**: [From Reflector: Current Pass Rate]
|
|
53
|
-
- **Insight**: [From Reflector: Key Learning]
|
|
54
|
-
```
|
|
55
|
-
- **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
|
|
56
|
-
- **ADD**: Insert a new rule if a constraint was missed.
|
|
57
|
-
- **UPDATE**: Refine an existing rule to be clearer or more general.
|
|
58
|
-
- *Clarify*: Make ambiguous instructions specific.
|
|
59
|
-
- *Generalize*: Refactor specific fixes into high-level principles (First Principles).
|
|
60
|
-
- **DELETE**: Remove obsolete, redundant, or harmful rules.
|
|
61
|
-
- *Prune*: If a general rule covers specific cases, delete the specific ones.
|
|
62
|
-
- **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
|
|
63
|
-
- **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
|
|
64
|
-
- **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
|
|
65
|
-
- **Log Result**:
|
|
66
|
-
- Append the **Log Entry** returned by the Curator to the optimization log file.
|
|
67
|
-
|
|
68
|
-
3. **Completion**
|
|
69
|
-
- Report final score.
|
|
70
|
-
- Summarize key changes made to the prompt.
|
|
71
|
-
- **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
|
|
72
|
-
|
|
73
|
-
## Guidelines
|
|
74
|
-
- **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
|
|
75
|
-
- **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
|
|
76
|
-
- **Structure**: Maintain existing Markdown headers/sections.
|
|
77
|
-
- **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
|
|
78
|
-
- **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
|
|
File without changes
|
|
File without changes
|
|
File without changes
|