agentv 3.10.2 → 3.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-6UE665XI.js → chunk-ETMDLQ72.js} +1141 -60
- package/dist/chunk-ETMDLQ72.js.map +1 -0
- package/dist/{chunk-KGK5NUFG.js → chunk-EZGWZVVK.js} +377 -163
- package/dist/chunk-EZGWZVVK.js.map +1 -0
- package/dist/{chunk-F7LAJMTO.js → chunk-JEW3FEO7.js} +68 -32
- package/dist/chunk-JEW3FEO7.js.map +1 -0
- package/dist/cli.js +3 -3
- package/dist/{dist-3QUJEJUT.js → dist-QERRYDSC.js} +2 -2
- package/dist/index.js +3 -3
- package/dist/{interactive-EO6AR2R3.js → interactive-AD4PRYDN.js} +3 -3
- package/package.json +3 -1
- package/dist/chunk-6UE665XI.js.map +0 -1
- package/dist/chunk-F7LAJMTO.js.map +0 -1
- package/dist/chunk-KGK5NUFG.js.map +0 -1
- package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +0 -84
- package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +0 -144
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +0 -67
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +0 -101
- package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +0 -458
- package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +0 -36
- package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
- package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +0 -12753
- package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
- package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +0 -50
- package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.agentv/.env.example +0 -25
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -177
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
- package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -115
- package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -79
- package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
- /package/dist/{dist-3QUJEJUT.js.map → dist-QERRYDSC.js.map} +0 -0
- /package/dist/{interactive-EO6AR2R3.js.map → interactive-AD4PRYDN.js.map} +0 -0
|
@@ -1,333 +0,0 @@
|
|
|
1
|
-
# Example Eval Files
|
|
2
|
-
|
|
3
|
-
This document contains complete examples of well-structured eval files demonstrating various AgentV patterns and best practices.
|
|
4
|
-
|
|
5
|
-
## Basic Example: Simple Q&A Eval
|
|
6
|
-
|
|
7
|
-
```yaml
|
|
8
|
-
description: Basic arithmetic evaluation
|
|
9
|
-
execution:
|
|
10
|
-
target: default
|
|
11
|
-
|
|
12
|
-
evalcases:
|
|
13
|
-
- id: simple-addition
|
|
14
|
-
expected_outcome: Correctly calculates 2+2
|
|
15
|
-
|
|
16
|
-
input_messages:
|
|
17
|
-
- role: user
|
|
18
|
-
content: What is 2 + 2?
|
|
19
|
-
|
|
20
|
-
expected_messages:
|
|
21
|
-
- role: assistant
|
|
22
|
-
content: "4"
|
|
23
|
-
```
|
|
24
|
-
|
|
25
|
-
## Code Review with File References
|
|
26
|
-
|
|
27
|
-
```yaml
|
|
28
|
-
description: Code review with guidelines
|
|
29
|
-
execution:
|
|
30
|
-
target: azure_base
|
|
31
|
-
|
|
32
|
-
evalcases:
|
|
33
|
-
- id: code-review-basic
|
|
34
|
-
expected_outcome: Assistant provides helpful code analysis with security considerations
|
|
35
|
-
|
|
36
|
-
input_messages:
|
|
37
|
-
- role: system
|
|
38
|
-
content: You are an expert code reviewer.
|
|
39
|
-
- role: user
|
|
40
|
-
content:
|
|
41
|
-
- type: text
|
|
42
|
-
value: |-
|
|
43
|
-
Review this function for security issues:
|
|
44
|
-
|
|
45
|
-
```python
|
|
46
|
-
def get_user(user_id):
|
|
47
|
-
query = f"SELECT * FROM users WHERE id = {user_id}"
|
|
48
|
-
return db.execute(query)
|
|
49
|
-
```
|
|
50
|
-
- type: file
|
|
51
|
-
value: /prompts/security-guidelines.md
|
|
52
|
-
|
|
53
|
-
expected_messages:
|
|
54
|
-
- role: assistant
|
|
55
|
-
content: |-
|
|
56
|
-
This code has a critical SQL injection vulnerability. The user_id is directly
|
|
57
|
-
interpolated into the query string without sanitization.
|
|
58
|
-
|
|
59
|
-
Recommended fix:
|
|
60
|
-
```python
|
|
61
|
-
def get_user(user_id):
|
|
62
|
-
query = "SELECT * FROM users WHERE id = ?"
|
|
63
|
-
return db.execute(query, (user_id,))
|
|
64
|
-
```
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
## Multi-Evaluator Configuration
|
|
68
|
-
|
|
69
|
-
```yaml
|
|
70
|
-
description: JSON generation with validation
|
|
71
|
-
execution:
|
|
72
|
-
target: default
|
|
73
|
-
|
|
74
|
-
evalcases:
|
|
75
|
-
- id: json-generation-with-validation
|
|
76
|
-
expected_outcome: Generates valid JSON with required fields
|
|
77
|
-
|
|
78
|
-
execution:
|
|
79
|
-
evaluators:
|
|
80
|
-
- name: json_format_validator
|
|
81
|
-
type: code_judge
|
|
82
|
-
script: uv run validate_json.py
|
|
83
|
-
cwd: ./evaluators
|
|
84
|
-
- name: content_evaluator
|
|
85
|
-
type: llm_judge
|
|
86
|
-
prompt: ./judges/semantic_correctness.md
|
|
87
|
-
|
|
88
|
-
input_messages:
|
|
89
|
-
- role: user
|
|
90
|
-
content: |-
|
|
91
|
-
Generate a JSON object for a user with name "Alice",
|
|
92
|
-
email "alice@example.com", and role "admin".
|
|
93
|
-
|
|
94
|
-
expected_messages:
|
|
95
|
-
- role: assistant
|
|
96
|
-
content: |-
|
|
97
|
-
{
|
|
98
|
-
"name": "Alice",
|
|
99
|
-
"email": "alice@example.com",
|
|
100
|
-
"role": "admin"
|
|
101
|
-
}
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
## Tool Trajectory Evaluation
|
|
105
|
-
|
|
106
|
-
Validate that an agent uses specific tools during execution.
|
|
107
|
-
|
|
108
|
-
```yaml
|
|
109
|
-
description: Tool usage validation
|
|
110
|
-
execution:
|
|
111
|
-
target: mock_agent
|
|
112
|
-
|
|
113
|
-
evalcases:
|
|
114
|
-
# Validate minimum tool usage (order doesn't matter)
|
|
115
|
-
- id: research-depth
|
|
116
|
-
expected_outcome: Agent researches thoroughly
|
|
117
|
-
input_messages:
|
|
118
|
-
- role: user
|
|
119
|
-
content: Research REST vs GraphQL
|
|
120
|
-
execution:
|
|
121
|
-
evaluators:
|
|
122
|
-
- name: research-check
|
|
123
|
-
type: tool_trajectory
|
|
124
|
-
mode: any_order
|
|
125
|
-
minimums:
|
|
126
|
-
knowledgeSearch: 2
|
|
127
|
-
documentRetrieve: 1
|
|
128
|
-
|
|
129
|
-
# Validate exact tool sequence
|
|
130
|
-
- id: auth-flow
|
|
131
|
-
expected_outcome: Agent follows auth sequence
|
|
132
|
-
input_messages:
|
|
133
|
-
- role: user
|
|
134
|
-
content: Authenticate user
|
|
135
|
-
execution:
|
|
136
|
-
evaluators:
|
|
137
|
-
- name: auth-sequence
|
|
138
|
-
type: tool_trajectory
|
|
139
|
-
mode: exact
|
|
140
|
-
expected:
|
|
141
|
-
- tool: checkCredentials
|
|
142
|
-
- tool: generateToken
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
## Static Trace Evaluation
|
|
146
|
-
|
|
147
|
-
Evaluate pre-existing trace files without running an agent.
|
|
148
|
-
|
|
149
|
-
```yaml
|
|
150
|
-
description: Static trace evaluation
|
|
151
|
-
execution:
|
|
152
|
-
target: static_trace
|
|
153
|
-
|
|
154
|
-
evalcases:
|
|
155
|
-
- id: validate-trace-file
|
|
156
|
-
expected_outcome: Trace contains required steps
|
|
157
|
-
input_messages:
|
|
158
|
-
- role: user
|
|
159
|
-
content: Analyze trace
|
|
160
|
-
execution:
|
|
161
|
-
evaluators:
|
|
162
|
-
- name: trace-check
|
|
163
|
-
type: tool_trajectory
|
|
164
|
-
mode: in_order
|
|
165
|
-
expected:
|
|
166
|
-
- tool: webSearch
|
|
167
|
-
- tool: readFile
|
|
168
|
-
```
|
|
169
|
-
|
|
170
|
-
## Multi-Turn Conversation (Single Eval Case)
|
|
171
|
-
|
|
172
|
-
```yaml
|
|
173
|
-
description: Multi-turn debugging session with clarifying questions
|
|
174
|
-
execution:
|
|
175
|
-
target: default
|
|
176
|
-
|
|
177
|
-
evalcases:
|
|
178
|
-
- id: debug-with-clarification
|
|
179
|
-
expected_outcome: |-
|
|
180
|
-
Assistant conducts a multi-turn debugging session, asking clarification
|
|
181
|
-
questions when needed, correctly diagnosing the bug, and proposing a clear
|
|
182
|
-
fix with rationale.
|
|
183
|
-
|
|
184
|
-
input_messages:
|
|
185
|
-
- role: system
|
|
186
|
-
content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
|
|
187
|
-
- role: user
|
|
188
|
-
content: |-
|
|
189
|
-
I'm getting an off-by-one error in this function, but I can't see why:
|
|
190
|
-
|
|
191
|
-
```python
|
|
192
|
-
def get_items(items):
|
|
193
|
-
result = []
|
|
194
|
-
for i in range(len(items) - 1):
|
|
195
|
-
result.append(items[i])
|
|
196
|
-
return result
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
Sometimes the last element is missing. Can you help debug this?
|
|
200
|
-
- role: assistant
|
|
201
|
-
content: |-
|
|
202
|
-
I can help debug this. Before I propose a fix, could you tell me:
|
|
203
|
-
- What output you expect for an example input list
|
|
204
|
-
- What output you actually get
|
|
205
|
-
- role: user
|
|
206
|
-
content: |-
|
|
207
|
-
For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
|
|
208
|
-
|
|
209
|
-
expected_messages:
|
|
210
|
-
- role: assistant
|
|
211
|
-
content: |-
|
|
212
|
-
You have an off-by-one error in your loop bounds.
|
|
213
|
-
You're iterating with `range(len(items) - 1)`, which stops before the last index.
|
|
214
|
-
To include all items, you can either:
|
|
215
|
-
- Use `range(len(items))`, or
|
|
216
|
-
- Iterate directly over the list: `for item in items:`
|
|
217
|
-
|
|
218
|
-
Here's a corrected version:
|
|
219
|
-
|
|
220
|
-
```python
|
|
221
|
-
def get_items(items):
|
|
222
|
-
result = []
|
|
223
|
-
for item in items:
|
|
224
|
-
result.append(item)
|
|
225
|
-
return result
|
|
226
|
-
```
|
|
227
|
-
```
|
|
228
|
-
|
|
229
|
-
## Batch CLI Evaluation
|
|
230
|
-
|
|
231
|
-
Evaluate external batch runners that process all evalcases in one invocation.
|
|
232
|
-
|
|
233
|
-
```yaml
|
|
234
|
-
description: Batch CLI demo (AML screening)
|
|
235
|
-
execution:
|
|
236
|
-
target: batch_cli
|
|
237
|
-
|
|
238
|
-
evalcases:
|
|
239
|
-
- id: aml-001
|
|
240
|
-
expected_outcome: |-
|
|
241
|
-
Batch runner returns JSON with decision=CLEAR.
|
|
242
|
-
|
|
243
|
-
expected_messages:
|
|
244
|
-
- role: assistant
|
|
245
|
-
content:
|
|
246
|
-
decision: CLEAR
|
|
247
|
-
|
|
248
|
-
input_messages:
|
|
249
|
-
- role: system
|
|
250
|
-
content: You are a deterministic AML screening batch checker.
|
|
251
|
-
- role: user
|
|
252
|
-
content:
|
|
253
|
-
request:
|
|
254
|
-
type: aml_screening_check
|
|
255
|
-
jurisdiction: AU
|
|
256
|
-
effective_date: 2025-01-01
|
|
257
|
-
row:
|
|
258
|
-
id: aml-001
|
|
259
|
-
customer_name: Example Customer A
|
|
260
|
-
origin_country: NZ
|
|
261
|
-
destination_country: AU
|
|
262
|
-
transaction_type: INTERNATIONAL_TRANSFER
|
|
263
|
-
amount: 5000
|
|
264
|
-
currency: USD
|
|
265
|
-
|
|
266
|
-
execution:
|
|
267
|
-
evaluators:
|
|
268
|
-
- name: decision-check
|
|
269
|
-
type: code_judge
|
|
270
|
-
script: bun run ./scripts/check-batch-cli-output.ts
|
|
271
|
-
cwd: .
|
|
272
|
-
|
|
273
|
-
- id: aml-002
|
|
274
|
-
expected_outcome: |-
|
|
275
|
-
Batch runner returns JSON with decision=REVIEW.
|
|
276
|
-
|
|
277
|
-
expected_messages:
|
|
278
|
-
- role: assistant
|
|
279
|
-
content:
|
|
280
|
-
decision: REVIEW
|
|
281
|
-
|
|
282
|
-
input_messages:
|
|
283
|
-
- role: system
|
|
284
|
-
content: You are a deterministic AML screening batch checker.
|
|
285
|
-
- role: user
|
|
286
|
-
content:
|
|
287
|
-
request:
|
|
288
|
-
type: aml_screening_check
|
|
289
|
-
jurisdiction: AU
|
|
290
|
-
effective_date: 2025-01-01
|
|
291
|
-
row:
|
|
292
|
-
id: aml-002
|
|
293
|
-
customer_name: Example Customer B
|
|
294
|
-
origin_country: IR
|
|
295
|
-
destination_country: AU
|
|
296
|
-
transaction_type: INTERNATIONAL_TRANSFER
|
|
297
|
-
amount: 2000
|
|
298
|
-
currency: USD
|
|
299
|
-
|
|
300
|
-
execution:
|
|
301
|
-
evaluators:
|
|
302
|
-
- name: decision-check
|
|
303
|
-
type: code_judge
|
|
304
|
-
script: bun run ./scripts/check-batch-cli-output.ts
|
|
305
|
-
cwd: .
|
|
306
|
-
```
|
|
307
|
-
|
|
308
|
-
### Batch CLI Pattern Notes
|
|
309
|
-
- **execution.target: batch_cli** - Configure CLI provider with `provider_batching: true`
|
|
310
|
-
- **Batch runner** - Reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
|
|
311
|
-
- **Structured input** - Put data in `user.content` as objects for runner to extract
|
|
312
|
-
- **Structured expected** - Use `expected_messages.content` with object fields
|
|
313
|
-
- **Per-case evaluators** - Each evalcase has its own evaluator to validate output
|
|
314
|
-
|
|
315
|
-
## Notes on Examples
|
|
316
|
-
|
|
317
|
-
### File Path Conventions
|
|
318
|
-
- **Absolute paths** (start with `/`): Resolved from repository root
|
|
319
|
-
- Example: `/prompts/guidelines.md` → `<repo_root>/prompts/guidelines.md`
|
|
320
|
-
- **Relative paths** (start with `./` or `../`): Resolved from eval file directory
|
|
321
|
-
- Example: `../../prompts/file.md` → Two directories up, then into prompts/
|
|
322
|
-
|
|
323
|
-
### expected_outcome Writing Tips
|
|
324
|
-
- Be specific about what success looks like
|
|
325
|
-
- Mention key elements that must be present
|
|
326
|
-
- For classification tasks, specify the expected category
|
|
327
|
-
- For reasoning tasks, describe the thought process expected
|
|
328
|
-
|
|
329
|
-
### Expected Messages
|
|
330
|
-
- Show the pattern, not rigid templates
|
|
331
|
-
- Allow for natural language variation
|
|
332
|
-
- Focus on semantic correctness over exact matching
|
|
333
|
-
- Evaluators will handle the actual validation
|
|
@@ -1,79 +0,0 @@
|
|
|
1
|
-
# Rubric Evaluator
|
|
2
|
-
|
|
3
|
-
## Field Reference
|
|
4
|
-
|
|
5
|
-
| Field | Type | Default | Description |
|
|
6
|
-
|-------|------|---------|-------------|
|
|
7
|
-
| `id` | string | auto-generated | Unique identifier |
|
|
8
|
-
| `expected_outcome` | string | required* | Criterion being evaluated (*optional if `score_ranges` used) |
|
|
9
|
-
| `weight` | number | 1.0 | Relative importance |
|
|
10
|
-
| `required` | boolean | true | Failing forces verdict to 'fail' (checklist mode) |
|
|
11
|
-
| `required_min_score` | integer | - | Minimum 0-10 score to pass (score-range mode) |
|
|
12
|
-
| `score_ranges` | map or array | - | Score range definitions for analytic scoring |
|
|
13
|
-
|
|
14
|
-
`description` is a backward-compatible alias for `expected_outcome`.
|
|
15
|
-
|
|
16
|
-
## Checklist Mode
|
|
17
|
-
|
|
18
|
-
```yaml
|
|
19
|
-
rubrics:
|
|
20
|
-
- Mentions divide-and-conquer approach
|
|
21
|
-
- id: complexity
|
|
22
|
-
expected_outcome: States time complexity correctly
|
|
23
|
-
weight: 2.0
|
|
24
|
-
required: true
|
|
25
|
-
- id: examples
|
|
26
|
-
expected_outcome: Includes code examples
|
|
27
|
-
weight: 1.0
|
|
28
|
-
required: false
|
|
29
|
-
```
|
|
30
|
-
|
|
31
|
-
## Score-Range Mode
|
|
32
|
-
|
|
33
|
-
Shorthand map format (recommended):
|
|
34
|
-
|
|
35
|
-
```yaml
|
|
36
|
-
rubrics:
|
|
37
|
-
- id: correctness
|
|
38
|
-
weight: 2.0
|
|
39
|
-
required_min_score: 7
|
|
40
|
-
score_ranges:
|
|
41
|
-
0: Critical bugs
|
|
42
|
-
3: Minor bugs
|
|
43
|
-
6: Correct with minor issues
|
|
44
|
-
9: Fully correct
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
Map keys are lower bounds (0-10). Each range extends from its key to (next key - 1), with the last extending to 10. Must start at 0.
|
|
48
|
-
|
|
49
|
-
Array format is also accepted:
|
|
50
|
-
|
|
51
|
-
```yaml
|
|
52
|
-
score_ranges:
|
|
53
|
-
- score_range: [0, 2]
|
|
54
|
-
expected_outcome: Critical bugs
|
|
55
|
-
- score_range: [3, 5]
|
|
56
|
-
expected_outcome: Minor bugs
|
|
57
|
-
- score_range: [6, 8]
|
|
58
|
-
expected_outcome: Correct with minor issues
|
|
59
|
-
- score_range: [9, 10]
|
|
60
|
-
expected_outcome: Fully correct
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
Ranges must be integers 0-10, non-overlapping, covering all values 0-10.
|
|
64
|
-
|
|
65
|
-
## Scoring
|
|
66
|
-
|
|
67
|
-
**Checklist:** `score = sum(satisfied weights) / sum(all weights)`
|
|
68
|
-
|
|
69
|
-
**Score-range:** `score = weighted_average(raw_score / 10)` per criterion
|
|
70
|
-
|
|
71
|
-
## Verdicts
|
|
72
|
-
|
|
73
|
-
| Verdict | Condition |
|
|
74
|
-
|---------|-----------|
|
|
75
|
-
| `pass` | score >= 0.8 AND all gating criteria satisfied |
|
|
76
|
-
| `borderline` | score >= 0.6 AND all gating criteria satisfied |
|
|
77
|
-
| `fail` | score < 0.6 OR any gating criterion failed |
|
|
78
|
-
|
|
79
|
-
Gating: checklist uses `required: true`, score-range uses `required_min_score: N`.
|
package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md
DELETED
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
# Structured Data + Metrics Evaluators
|
|
2
|
-
|
|
3
|
-
This reference covers the built-in evaluators used for grading structured outputs and gating on execution metrics:
|
|
4
|
-
|
|
5
|
-
- `field_accuracy`
|
|
6
|
-
- `latency`
|
|
7
|
-
- `cost`
|
|
8
|
-
- `token_usage`
|
|
9
|
-
|
|
10
|
-
## Ground Truth (`expected_messages`)
|
|
11
|
-
|
|
12
|
-
Put the expected structured output in the evalcase `expected_messages` (typically as the last `assistant` message with `content` as an object). Evaluators read expected values from there.
|
|
13
|
-
|
|
14
|
-
```yaml
|
|
15
|
-
evalcases:
|
|
16
|
-
- id: invoice-001
|
|
17
|
-
expected_messages:
|
|
18
|
-
- role: assistant
|
|
19
|
-
content:
|
|
20
|
-
invoice_number: "INV-2025-001234"
|
|
21
|
-
net_total: 1889
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
## `field_accuracy`
|
|
25
|
-
|
|
26
|
-
Use `field_accuracy` to compare fields in the candidate JSON against the ground-truth object in `expected_messages`.
|
|
27
|
-
|
|
28
|
-
```yaml
|
|
29
|
-
execution:
|
|
30
|
-
evaluators:
|
|
31
|
-
- name: invoice_fields
|
|
32
|
-
type: field_accuracy
|
|
33
|
-
aggregation: weighted_average
|
|
34
|
-
fields:
|
|
35
|
-
- path: invoice_number
|
|
36
|
-
match: exact
|
|
37
|
-
required: true
|
|
38
|
-
weight: 2.0
|
|
39
|
-
- path: invoice_date
|
|
40
|
-
match: date
|
|
41
|
-
formats: ["DD-MMM-YYYY", "YYYY-MM-DD"]
|
|
42
|
-
- path: net_total
|
|
43
|
-
match: numeric_tolerance
|
|
44
|
-
tolerance: 1.0
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
### Match types
|
|
48
|
-
|
|
49
|
-
- `exact`: strict equality
|
|
50
|
-
- `date`: compares dates after parsing; optionally provide `formats`
|
|
51
|
-
- `numeric_tolerance`: numeric compare within `tolerance` (set `relative: true` for relative tolerance)
|
|
52
|
-
|
|
53
|
-
For fuzzy string matching, use a `code_judge` evaluator (e.g. Levenshtein) instead of adding a fuzzy mode to `field_accuracy`.
|
|
54
|
-
|
|
55
|
-
### Aggregation
|
|
56
|
-
|
|
57
|
-
- `weighted_average` (default): weighted mean of field scores
|
|
58
|
-
- `all_or_nothing`: score 1.0 only if all graded fields pass
|
|
59
|
-
|
|
60
|
-
## `latency` and `cost`
|
|
61
|
-
|
|
62
|
-
These evaluators gate on execution metrics reported by the provider (via `traceSummary`).
|
|
63
|
-
|
|
64
|
-
```yaml
|
|
65
|
-
execution:
|
|
66
|
-
evaluators:
|
|
67
|
-
- name: performance
|
|
68
|
-
type: latency
|
|
69
|
-
threshold: 2000
|
|
70
|
-
- name: budget
|
|
71
|
-
type: cost
|
|
72
|
-
budget: 0.10
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
## `token_usage`
|
|
76
|
-
|
|
77
|
-
Gate on provider-reported token usage (useful when cost is unavailable or model pricing differs).
|
|
78
|
-
|
|
79
|
-
```yaml
|
|
80
|
-
execution:
|
|
81
|
-
evaluators:
|
|
82
|
-
- name: token-budget
|
|
83
|
-
type: token_usage
|
|
84
|
-
max_total: 10000
|
|
85
|
-
# or:
|
|
86
|
-
# max_input: 8000
|
|
87
|
-
# max_output: 2000
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
## Common pattern: combine correctness + gates
|
|
91
|
-
|
|
92
|
-
Use a `composite` evaluator if you want a single “release gate” score/verdict from multiple checks:
|
|
93
|
-
|
|
94
|
-
```yaml
|
|
95
|
-
execution:
|
|
96
|
-
evaluators:
|
|
97
|
-
- name: release_gate
|
|
98
|
-
type: composite
|
|
99
|
-
evaluators:
|
|
100
|
-
- name: correctness
|
|
101
|
-
type: field_accuracy
|
|
102
|
-
fields:
|
|
103
|
-
- path: invoice_number
|
|
104
|
-
match: exact
|
|
105
|
-
- name: latency
|
|
106
|
-
type: latency
|
|
107
|
-
threshold: 2000
|
|
108
|
-
- name: cost
|
|
109
|
-
type: cost
|
|
110
|
-
budget: 0.10
|
|
111
|
-
- name: tokens
|
|
112
|
-
type: token_usage
|
|
113
|
-
max_total: 10000
|
|
114
|
-
aggregator:
|
|
115
|
-
type: weighted_average
|
|
116
|
-
weights:
|
|
117
|
-
correctness: 0.8
|
|
118
|
-
latency: 0.1
|
|
119
|
-
cost: 0.05
|
|
120
|
-
tokens: 0.05
|
|
121
|
-
```
|