agentv 4.26.1 → 4.27.0-next.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
- package/dist/chunk-47JX7NNZ.js.map +1 -0
- package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
- package/dist/chunk-V3LWJB5X.js.map +1 -0
- package/dist/cli.js +2 -2
- package/dist/index.js +2 -2
- package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
- package/dist/skills/agentv-bench/LICENSE.txt +202 -0
- package/dist/skills/agentv-bench/SKILL.md +459 -0
- package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
- package/dist/skills/agentv-bench/agents/comparator.md +247 -0
- package/dist/skills/agentv-bench/agents/executor.md +30 -0
- package/dist/skills/agentv-bench/agents/grader.md +238 -0
- package/dist/skills/agentv-bench/agents/mutator.md +172 -0
- package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
- package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
- package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
- package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
- package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
- package/dist/skills/agentv-bench/references/schemas.md +432 -0
- package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
- package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
- package/dist/skills/agentv-eval-review/SKILL.md +53 -0
- package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
- package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
- package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
- package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
- package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
- package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
- package/dist/skills/agentv-governance/SKILL.md +79 -0
- package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
- package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
- package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
- package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
- package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
- package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
- package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
- package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
- package/package.json +1 -1
- package/dist/chunk-JA4WQNE6.js.map +0 -1
- package/dist/chunk-XBUHMRX2.js.map +0 -1
- /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
|
@@ -0,0 +1,707 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agentv-eval-writer
|
|
3
|
+
description: >-
|
|
4
|
+
Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files.
|
|
5
|
+
Use when asked to create new eval files, update or fix existing ones, add or remove test cases,
|
|
6
|
+
configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete,
|
|
7
|
+
convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases
|
|
8
|
+
from chat transcripts (markdown conversation or JSON messages).
|
|
9
|
+
Do NOT use for creating SKILL.md files, writing skill definitions, or running evals —
|
|
10
|
+
running and benchmarking belongs to agentv-bench.
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
# AgentV Eval Writer
|
|
14
|
+
|
|
15
|
+
Comprehensive docs: https://agentv.dev
|
|
16
|
+
|
|
17
|
+
## Evaluation Types
|
|
18
|
+
|
|
19
|
+
AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked.
|
|
20
|
+
|
|
21
|
+
For **trigger quality** (whether the right skill is triggered for the right prompts), see the [Evaluation Types guide](https://agentv.dev/guides/evaluation-types/). Do not use execution eval configs (`EVAL.yaml`, `evals.json`) for trigger evaluation — these are distinct concerns requiring different tooling and methodologies.
|
|
22
|
+
|
|
23
|
+
## Starting from evals.json?
|
|
24
|
+
|
|
25
|
+
If the project already has an Agent Skills `evals.json` file, use it as a starting point instead of writing YAML from scratch:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
# Convert evals.json to AgentV EVAL YAML
|
|
29
|
+
agentv convert evals.json
|
|
30
|
+
|
|
31
|
+
# Run directly without converting (all commands accept evals.json)
|
|
32
|
+
agentv eval evals.json
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assertions` (`llm-grader`), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code graders, rubrics, required gates).
|
|
36
|
+
|
|
37
|
+
After converting, enhance the YAML with AgentV-specific capabilities shown below.
|
|
38
|
+
|
|
39
|
+
## From Chat Transcript
|
|
40
|
+
|
|
41
|
+
Convert a chat conversation into eval test cases without starting from scratch.
|
|
42
|
+
|
|
43
|
+
**Input formats:**
|
|
44
|
+
|
|
45
|
+
Markdown conversation:
|
|
46
|
+
```
|
|
47
|
+
User: How do I reset my password?
|
|
48
|
+
Assistant: Go to Settings > Security > Reset Password...
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
JSON messages:
|
|
52
|
+
```json
|
|
53
|
+
[{"role": "user", "content": "How do I reset my password?"},
|
|
54
|
+
{"role": "assistant", "content": "Go to Settings > Security > Reset Password..."}]
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
**Select exchanges that make good test cases:**
|
|
58
|
+
- Factual Q&A — verifiable answers
|
|
59
|
+
- Task completion — user requests an action, agent performs it
|
|
60
|
+
- Edge cases — unusual inputs, error handling, boundary conditions
|
|
61
|
+
- Multi-turn reasoning — exchanges where earlier context matters
|
|
62
|
+
|
|
63
|
+
**Skip:** greetings, one-word acknowledgments, repeated exchanges
|
|
64
|
+
|
|
65
|
+
**Multi-turn format** (when context from prior turns matters):
|
|
66
|
+
```yaml
|
|
67
|
+
tests:
|
|
68
|
+
- id: multi-turn-context
|
|
69
|
+
criteria: "Agent remembers prior context"
|
|
70
|
+
input:
|
|
71
|
+
- role: user
|
|
72
|
+
content: "My name is Alice"
|
|
73
|
+
- role: assistant
|
|
74
|
+
content: "Nice to meet you, Alice!"
|
|
75
|
+
- role: user
|
|
76
|
+
content: "What's my name?"
|
|
77
|
+
expected_output: "Your name is Alice."
|
|
78
|
+
assertions:
|
|
79
|
+
- type: rubrics
|
|
80
|
+
criteria:
|
|
81
|
+
- Correctly recalls the user's name from earlier in the conversation
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
**Guidelines:** preserve exact wording in `expected_output`; aim for 5–15 tests per transcript; pick exchanges that test different capabilities.
|
|
85
|
+
|
|
86
|
+
## Quick Start
|
|
87
|
+
|
|
88
|
+
```yaml
|
|
89
|
+
description: Example eval
|
|
90
|
+
execution:
|
|
91
|
+
target: default
|
|
92
|
+
|
|
93
|
+
tests:
|
|
94
|
+
- id: greeting
|
|
95
|
+
criteria: Friendly greeting
|
|
96
|
+
input: "Say hello"
|
|
97
|
+
expected_output: "Hello! How can I help you?"
|
|
98
|
+
assertions:
|
|
99
|
+
- type: rubrics
|
|
100
|
+
criteria:
|
|
101
|
+
- Greeting is friendly and warm
|
|
102
|
+
- Offers to help
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## Eval File Structure
|
|
106
|
+
|
|
107
|
+
**Required:** `tests` (array or string path)
|
|
108
|
+
**Optional:** `name`, `description`, `version`, `author`, `tags`, `license`, `requires`, `execution`, `suite`, `workspace`, `assertions`, `input`
|
|
109
|
+
|
|
110
|
+
**Test fields:**
|
|
111
|
+
|
|
112
|
+
| Field | Required | Description |
|
|
113
|
+
|-------|----------|-------------|
|
|
114
|
+
| `id` | yes | Unique identifier |
|
|
115
|
+
| `criteria` | yes | What the response should accomplish |
|
|
116
|
+
| `input` / `input` | yes | Input to the agent |
|
|
117
|
+
| `expected_output` / `expected_output` | no | Gold-standard reference answer |
|
|
118
|
+
| `assertions` | no | Graders: deterministic checks, rubrics, and LLM/code graders |
|
|
119
|
+
| `rubrics` | no | **Deprecated** — use `assertions: [{type: rubrics, criteria: [...]}]` instead |
|
|
120
|
+
| `execution` | no | Per-case execution overrides |
|
|
121
|
+
| `workspace` | no | Per-case workspace config (overrides suite-level) |
|
|
122
|
+
| `metadata` | no | Arbitrary key-value pairs passed to setup/teardown scripts |
|
|
123
|
+
| `conversation_id` | no | Thread grouping |
|
|
124
|
+
|
|
125
|
+
**Shorthand aliases:**
|
|
126
|
+
- `input` (string) expands to `[{role: "user", content: "..."}]`
|
|
127
|
+
- `expected_output` (string/object) expands to `[{role: "assistant", content: ...}]`
|
|
128
|
+
- Canonical `input` / `expected_output` take precedence when both present
|
|
129
|
+
|
|
130
|
+
**Message format:** `{role, content}` where role is `system`, `user`, `assistant`, or `tool`
|
|
131
|
+
**Content types:** inline text, `{type: "file", value: "./path.md"}`
|
|
132
|
+
**File paths:** relative from eval file dir, or absolute with `/` prefix from repo root
|
|
133
|
+
**File handling by provider type:** LLM providers receive file content inlined in XML tags. Agent providers receive a preread block with `file://` URIs and must read files themselves. See [Coding Agents > Prompt format](https://agentv.dev/targets/coding-agents#prompt-format).
|
|
134
|
+
|
|
135
|
+
**JSONL format:** One test per line as JSON. Optional `.yaml` sidecar for shared defaults. See `examples/features/basic-jsonl/`.
|
|
136
|
+
|
|
137
|
+
**Environment variables:** All string fields support `${{ VAR }}` interpolation. Missing vars resolve to empty string. Works in eval files, external case files, and workspace configs. `.env` files are loaded automatically.
|
|
138
|
+
|
|
139
|
+
## Metadata
|
|
140
|
+
|
|
141
|
+
When `name` is present, the suite is parsed as a metadata-bearing eval:
|
|
142
|
+
|
|
143
|
+
```yaml
|
|
144
|
+
name: export-screening # required, lowercase/hyphens, max 64 chars
|
|
145
|
+
description: Evaluates export control screening accuracy
|
|
146
|
+
version: "1.0"
|
|
147
|
+
author: acme-compliance
|
|
148
|
+
tags: [compliance, agents]
|
|
149
|
+
license: Apache-2.0
|
|
150
|
+
requires:
|
|
151
|
+
agentv: ">=0.30.0"
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
## Suite-level Input
|
|
155
|
+
|
|
156
|
+
Prepend shared input messages to every test (like suite-level `assertions`). Avoids repeating the same prompt file in each test:
|
|
157
|
+
|
|
158
|
+
```yaml
|
|
159
|
+
input:
|
|
160
|
+
- role: user
|
|
161
|
+
content:
|
|
162
|
+
- type: file
|
|
163
|
+
value: ./system-prompt.md
|
|
164
|
+
|
|
165
|
+
tests: ./cases.yaml
|
|
166
|
+
|
|
167
|
+
# cases.yaml — each test only needs its own query
|
|
168
|
+
# - id: test-1
|
|
169
|
+
# criteria: ...
|
|
170
|
+
# input: "User question here"
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Effective input: `[...suite input, ...test input]`. Skipped when `execution.skip_defaults: true`.
|
|
174
|
+
Accepts same formats as test `input` (string or message array).
|
|
175
|
+
|
|
176
|
+
## Tests as String Path
|
|
177
|
+
|
|
178
|
+
Point `tests` to an external file instead of inlining:
|
|
179
|
+
|
|
180
|
+
```yaml
|
|
181
|
+
name: my-eval
|
|
182
|
+
description: My evaluation suite
|
|
183
|
+
tests: ./cases.yaml # relative to eval file dir
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
The external file can be YAML (array of test objects) or JSONL.
|
|
187
|
+
|
|
188
|
+
## Assertions Field
|
|
189
|
+
|
|
190
|
+
`assertions` defines graders at the suite level or per-test level. It is the canonical field for all graders:
|
|
191
|
+
|
|
192
|
+
```yaml
|
|
193
|
+
# Suite-level (appended to every test)
|
|
194
|
+
assertions:
|
|
195
|
+
- type: is-json
|
|
196
|
+
required: true
|
|
197
|
+
- type: contains
|
|
198
|
+
value: "status"
|
|
199
|
+
|
|
200
|
+
tests:
|
|
201
|
+
- id: test-1
|
|
202
|
+
criteria: Returns JSON
|
|
203
|
+
input: Get status
|
|
204
|
+
# Per-test assertions (runs before suite-level)
|
|
205
|
+
assertions:
|
|
206
|
+
- type: equals
|
|
207
|
+
value: '{"status": "ok"}'
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## How `criteria` and `assertions` Interact
|
|
211
|
+
|
|
212
|
+
`criteria` is a **data field** — it describes what the response should accomplish. It is **not** a grader. How it gets evaluated depends on whether `assertions` is present:
|
|
213
|
+
|
|
214
|
+
| Scenario | What happens | Warning? |
|
|
215
|
+
|----------|-------------|----------|
|
|
216
|
+
| `criteria` + **no `assertions`** | Implicit `llm-grader` runs automatically against `criteria` | No |
|
|
217
|
+
| `criteria` + **`assertions` with only deterministic graders** (contains, regex, etc.) | Only declared graders run. `criteria` is **not evaluated**. | Yes — warns that no grader will consume criteria |
|
|
218
|
+
| `criteria` + **`assertions` with a grader** (`llm-grader`, `code-grader`, `rubrics`) | Declared graders run. Graders receive `criteria` as input. | No |
|
|
219
|
+
|
|
220
|
+
### No assertions → implicit llm-grader
|
|
221
|
+
|
|
222
|
+
The simplest path. `criteria` is automatically evaluated by the default `llm-grader`:
|
|
223
|
+
|
|
224
|
+
```yaml
|
|
225
|
+
tests:
|
|
226
|
+
- id: simple-eval
|
|
227
|
+
criteria: Assistant correctly explains the bug and proposes a fix
|
|
228
|
+
input: "Debug this function..."
|
|
229
|
+
# No assertions → default llm-grader evaluates against criteria
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
### assertions present → no implicit grader
|
|
233
|
+
|
|
234
|
+
When `assertions` is defined, **only the declared graders run**. If you want an LLM grader alongside deterministic checks, declare it explicitly:
|
|
235
|
+
|
|
236
|
+
```yaml
|
|
237
|
+
tests:
|
|
238
|
+
- id: mixed-eval
|
|
239
|
+
criteria: Response is helpful and mentions the fix
|
|
240
|
+
input: "Debug this function..."
|
|
241
|
+
assertions:
|
|
242
|
+
- type: llm-grader # must be explicit when assertions is present
|
|
243
|
+
- type: contains
|
|
244
|
+
value: "fix"
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
**Common mistake:** defining `criteria` with only deterministic graders. The criteria will be ignored and a warning is emitted:
|
|
248
|
+
|
|
249
|
+
```yaml
|
|
250
|
+
tests:
|
|
251
|
+
- id: bad-example
|
|
252
|
+
criteria: Gives a thoughtful answer # ⚠ NOT evaluated — no grader in assertions
|
|
253
|
+
input: "What is 2+2?"
|
|
254
|
+
assertions:
|
|
255
|
+
- type: contains
|
|
256
|
+
value: "4"
|
|
257
|
+
# Warning: criteria is defined but no grader in assertions will evaluate it.
|
|
258
|
+
```
|
|
259
|
+
|
|
260
|
+
## Required Gates
|
|
261
|
+
|
|
262
|
+
Any grader can be marked `required` to enforce a minimum score:
|
|
263
|
+
|
|
264
|
+
```yaml
|
|
265
|
+
assertions:
|
|
266
|
+
- type: contains
|
|
267
|
+
value: "DENIED"
|
|
268
|
+
required: true # must score >= 0.8 (default)
|
|
269
|
+
- type: rubrics
|
|
270
|
+
required: 0.6 # must score >= 0.6 (custom threshold)
|
|
271
|
+
criteria:
|
|
272
|
+
- id: accuracy
|
|
273
|
+
outcome: Identifies the denied party
|
|
274
|
+
weight: 5.0
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
If a required grader scores below its threshold, the overall verdict is forced to `fail`.
|
|
278
|
+
|
|
279
|
+
## Workspace Setup/Teardown
|
|
280
|
+
|
|
281
|
+
Run scripts before/after each test. Define at suite level or override per case:
|
|
282
|
+
|
|
283
|
+
```yaml
|
|
284
|
+
workspace:
|
|
285
|
+
template: ./workspace-templates/my-project
|
|
286
|
+
setup:
|
|
287
|
+
command: ["bun", "run", "setup.ts"]
|
|
288
|
+
timeout_ms: 120000
|
|
289
|
+
teardown:
|
|
290
|
+
command: ["bun", "run", "teardown.ts"]
|
|
291
|
+
|
|
292
|
+
tests:
|
|
293
|
+
- id: case-1
|
|
294
|
+
input: Fix the bug
|
|
295
|
+
criteria: Bug is fixed
|
|
296
|
+
metadata:
|
|
297
|
+
repo: sympy/sympy
|
|
298
|
+
workspace:
|
|
299
|
+
repos:
|
|
300
|
+
- path: /testbed
|
|
301
|
+
source:
|
|
302
|
+
type: git
|
|
303
|
+
url: https://github.com/sympy/sympy.git
|
|
304
|
+
checkout:
|
|
305
|
+
base_commit: "abc123"
|
|
306
|
+
docker:
|
|
307
|
+
image: swebench/sweb.eval.django__django:latest
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
**Lifecycle:** template copy → repo clone → setup → git baseline → agent → file changes → teardown → repo reset → cleanup
|
|
311
|
+
**Merge:** Case-level fields replace suite-level fields.
|
|
312
|
+
**Commands receive stdin JSON:** `{workspace_path, test_id, eval_run_id, case_input, case_metadata}`
|
|
313
|
+
**Setup failure:** aborts case. **Teardown failure:** non-fatal (warning).
|
|
314
|
+
For SWE-bench-style evals, keep operational checkout state under `workspace.repos[].checkout.base_commit`; treat `metadata.base_commit` as informational only.
|
|
315
|
+
|
|
316
|
+
### Repository Lifecycle
|
|
317
|
+
|
|
318
|
+
Clone repos into workspace automatically. For shared repo workspaces, pooling is the default:
|
|
319
|
+
|
|
320
|
+
```yaml
|
|
321
|
+
workspace:
|
|
322
|
+
repos:
|
|
323
|
+
- path: ./repo
|
|
324
|
+
source:
|
|
325
|
+
type: git
|
|
326
|
+
url: https://github.com/org/repo.git
|
|
327
|
+
checkout:
|
|
328
|
+
ref: main
|
|
329
|
+
ancestor: 1 # parent commit
|
|
330
|
+
clone:
|
|
331
|
+
depth: 10
|
|
332
|
+
hooks:
|
|
333
|
+
after_each:
|
|
334
|
+
reset: fast # none | fast | strict
|
|
335
|
+
isolation: shared # shared | per_test
|
|
336
|
+
mode: pooled # pooled | temp | static
|
|
337
|
+
hooks:
|
|
338
|
+
enabled: true # set false to skip all hooks
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
- `source.type`: `git` (URL) or `local` (path)
|
|
342
|
+
- `checkout.resolve`: `remote` (ls-remote) or `local`
|
|
343
|
+
- `clone.depth`: shallow clone depth
|
|
344
|
+
- `clone.filter`: partial clone filter (e.g., `blob:none`)
|
|
345
|
+
- `clone.sparse`: sparse checkout paths array
|
|
346
|
+
- `mode`: `pooled` (default for shared repos), `temp`, or `static`
|
|
347
|
+
- `path`: workspace path used when `mode: static`; when empty/missing the workspace is auto-materialised (template copied + repos cloned); populated dirs are reused as-is
|
|
348
|
+
- `hooks.enabled`: boolean (default `true`); set `false` to skip all lifecycle hooks
|
|
349
|
+
- Pool reset defaults to `fast` (`git clean -fd`); use `--workspace-clean full` for strict reset (`git clean -fdx`)
|
|
350
|
+
- Pool entries are managed separately via `agentv workspace list` and `agentv workspace clean`
|
|
351
|
+
- `agentv workspace deps <eval-paths>` scans eval files and outputs a JSON manifest of required git repos (useful for CI pre-cloning)
|
|
352
|
+
|
|
353
|
+
See https://agentv.dev/targets/configuration/#repository-lifecycle
|
|
354
|
+
|
|
355
|
+
## Grader Types
|
|
356
|
+
|
|
357
|
+
Configure via `assertions` array. Multiple graders produce a weighted average score.
|
|
358
|
+
|
|
359
|
+
### code_grader
|
|
360
|
+
```yaml
|
|
361
|
+
- name: format_check
|
|
362
|
+
type: code-grader
|
|
363
|
+
command: [uv, run, validate.py]
|
|
364
|
+
cwd: ./scripts # optional working directory
|
|
365
|
+
target: {} # optional: enable LLM target proxy (max_calls: 50)
|
|
366
|
+
```
|
|
367
|
+
Contract: stdin JSON -> stdout JSON `{score, assertions: [{text, passed, evidence?}], reasoning}`
|
|
368
|
+
Input includes: `question`, `criteria`, `answer`, `reference_answer`, `output`, `trace`, `token_usage`, `cost_usd`, `duration_ms`, `start_time`, `end_time`, `file_changes`, `workspace_path`, `config`
|
|
369
|
+
When a workspace is configured, `workspace_path` is the absolute path to the workspace dir (also available as `AGENTV_WORKSPACE_PATH` env var). Use this for functional grading (e.g., running `npm test` in the workspace).
|
|
370
|
+
See docs at https://agentv.dev/graders/code-graders/
|
|
371
|
+
|
|
372
|
+
### llm_grader
|
|
373
|
+
```yaml
|
|
374
|
+
- name: quality
|
|
375
|
+
type: llm-grader
|
|
376
|
+
prompt: ./prompts/eval.md # markdown template or command config
|
|
377
|
+
target: grader_gpt_5_mini # optional: override the grader target for this grader
|
|
378
|
+
model: gpt-5-chat # optional model override
|
|
379
|
+
config: # passed to prompt templates as context.config
|
|
380
|
+
strictness: high
|
|
381
|
+
```
|
|
382
|
+
Variables: `{{question}}`, `{{criteria}}`, `{{answer}}`, `{{reference_answer}}`, `{{input}}`, `{{expected_output}}`, `{{output}}`, `{{file_changes}}`
|
|
383
|
+
- Markdown templates: use `{{variable}}` syntax
|
|
384
|
+
- TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
|
|
385
|
+
- Use `target:` to run different `llm-grader` graders against different named LLM targets in the same eval (useful for grader panels / ensembles)
|
|
386
|
+
|
|
387
|
+
### composite
|
|
388
|
+
```yaml
|
|
389
|
+
- name: gate
|
|
390
|
+
type: composite
|
|
391
|
+
assertions:
|
|
392
|
+
- name: safety
|
|
393
|
+
type: llm-grader
|
|
394
|
+
prompt: ./safety.md
|
|
395
|
+
- name: quality
|
|
396
|
+
type: llm-grader
|
|
397
|
+
aggregator:
|
|
398
|
+
type: weighted_average
|
|
399
|
+
weights: { safety: 0.3, quality: 0.7 }
|
|
400
|
+
```
|
|
401
|
+
Aggregator types: `weighted_average`, `all_or_nothing`, `minimum`, `maximum`, `safety_gate`
|
|
402
|
+
- `safety_gate`: fails immediately if the named gate grader scores below threshold (default 1.0)
|
|
403
|
+
|
|
404
|
+
### tool_trajectory
|
|
405
|
+
```yaml
|
|
406
|
+
- name: tool_check
|
|
407
|
+
type: tool-trajectory
|
|
408
|
+
mode: any_order # any_order | in_order | exact
|
|
409
|
+
minimums: # for any_order
|
|
410
|
+
knowledgeSearch: 2
|
|
411
|
+
expected: # for in_order/exact
|
|
412
|
+
- tool: knowledgeSearch
|
|
413
|
+
args: { query: "search term" } # partial deep equality match
|
|
414
|
+
- tool: documentRetrieve
|
|
415
|
+
args: any # any arguments accepted
|
|
416
|
+
max_duration_ms: 5000 # per-tool latency assertion
|
|
417
|
+
- tool: summarize # omit args to skip argument checking
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
### field_accuracy
|
|
421
|
+
```yaml
|
|
422
|
+
- name: fields
|
|
423
|
+
type: field-accuracy
|
|
424
|
+
match_type: exact # exact | date | numeric_tolerance
|
|
425
|
+
numeric_tolerance: 0.01 # for numeric_tolerance match_type
|
|
426
|
+
aggregation: weighted_average # weighted_average | all_or_nothing
|
|
427
|
+
```
|
|
428
|
+
Compares `output` fields against `expected_output` fields.
|
|
429
|
+
|
|
430
|
+
### latency
|
|
431
|
+
```yaml
|
|
432
|
+
- name: speed
|
|
433
|
+
type: latency
|
|
434
|
+
max_ms: 5000
|
|
435
|
+
```
|
|
436
|
+
|
|
437
|
+
### cost
|
|
438
|
+
```yaml
|
|
439
|
+
- name: budget
|
|
440
|
+
type: cost
|
|
441
|
+
max_usd: 0.10
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
### token_usage
|
|
445
|
+
```yaml
|
|
446
|
+
- name: tokens
|
|
447
|
+
type: token-usage
|
|
448
|
+
max_total_tokens: 4000
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
### execution_metrics
|
|
452
|
+
```yaml
|
|
453
|
+
- name: efficiency
|
|
454
|
+
type: execution-metrics
|
|
455
|
+
max_tool_calls: 10 # Maximum tool invocations
|
|
456
|
+
max_llm_calls: 5 # Maximum LLM calls (assistant messages)
|
|
457
|
+
max_tokens: 5000 # Maximum total tokens (input + output)
|
|
458
|
+
max_cost_usd: 0.05 # Maximum cost in USD
|
|
459
|
+
max_duration_ms: 30000 # Maximum execution duration
|
|
460
|
+
target_exploration_ratio: 0.6 # Target ratio of read-only tool calls
|
|
461
|
+
exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)
|
|
462
|
+
```
|
|
463
|
+
Declarative threshold-based checks on execution metrics. Only specified thresholds are checked.
|
|
464
|
+
Score is proportional: `passed / total` assertions. Missing data counts as a failed assertion.
|
|
465
|
+
|
|
466
|
+
### contains
|
|
467
|
+
```yaml
|
|
468
|
+
- type: contains
|
|
469
|
+
value: "DENIED"
|
|
470
|
+
required: true
|
|
471
|
+
```
|
|
472
|
+
Binary check: does output contain the substring? Name auto-generated if omitted.
|
|
473
|
+
|
|
474
|
+
### regex
|
|
475
|
+
```yaml
|
|
476
|
+
- type: regex
|
|
477
|
+
value: "\\d{3}-\\d{2}-\\d{4}"
|
|
478
|
+
```
|
|
479
|
+
Binary check: does output match the regex pattern?
|
|
480
|
+
|
|
481
|
+
### equals
|
|
482
|
+
```yaml
|
|
483
|
+
- type: equals
|
|
484
|
+
value: "42"
|
|
485
|
+
```
|
|
486
|
+
Binary check: does output exactly equal the value (both trimmed)?
|
|
487
|
+
|
|
488
|
+
### is_json
|
|
489
|
+
```yaml
|
|
490
|
+
- type: is-json
|
|
491
|
+
required: true
|
|
492
|
+
```
|
|
493
|
+
Binary check: is the output valid JSON?
|
|
494
|
+
|
|
495
|
+
### rubrics
|
|
496
|
+
```yaml
|
|
497
|
+
- type: rubrics
|
|
498
|
+
criteria:
|
|
499
|
+
- id: accuracy
|
|
500
|
+
outcome: Correctly identifies the denied party
|
|
501
|
+
weight: 5.0
|
|
502
|
+
- id: reasoning
|
|
503
|
+
outcome: Provides clear reasoning
|
|
504
|
+
weight: 3.0
|
|
505
|
+
```
|
|
506
|
+
LLM-judged structured evaluation with weighted criteria. Criteria items support `id`, `outcome`, `weight`, and `required` fields.
|
|
507
|
+
|
|
508
|
+
### rubrics (inline, deprecated)
|
|
509
|
+
Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assertions` instead.
|
|
510
|
+
See `references/rubric-grader.md` for score-range mode and scoring formula.
|
|
511
|
+
|
|
512
|
+
## Execution Error Tolerance
|
|
513
|
+
|
|
514
|
+
Control how the runner handles execution errors (infrastructure failures, not quality failures):
|
|
515
|
+
|
|
516
|
+
```yaml
|
|
517
|
+
execution:
|
|
518
|
+
fail_on_error: false # never halt (default)
|
|
519
|
+
# fail_on_error: true # halt on first execution error
|
|
520
|
+
```
|
|
521
|
+
|
|
522
|
+
When halted, remaining tests get `executionStatus: 'execution_error'` with `failureReasonCode: 'error_threshold_exceeded'`.
|
|
523
|
+
|
|
524
|
+
## Suite-Level Quality Threshold
|
|
525
|
+
|
|
526
|
+
Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.
|
|
527
|
+
|
|
528
|
+
```yaml
|
|
529
|
+
execution:
|
|
530
|
+
threshold: 0.8
|
|
531
|
+
```
|
|
532
|
+
|
|
533
|
+
CLI flag `--threshold 0.8` overrides the YAML value. Must be a number between 0 and 1. Mean score is computed from quality results only (execution errors excluded).
|
|
534
|
+
|
|
535
|
+
The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as `<failure>`. When no threshold is set, JUnit defaults to 0.5.
|
|
536
|
+
|
|
537
|
+
## CLI Commands
|
|
538
|
+
|
|
539
|
+
```bash
|
|
540
|
+
# Run evaluation (requires API keys)
|
|
541
|
+
agentv eval <file.yaml> [--test-id <id>] [--target <name>] [--dry-run] [--threshold <0-1>]
|
|
542
|
+
|
|
543
|
+
# Run with OTLP JSON file (importable by OTel backends)
|
|
544
|
+
agentv eval <file.yaml> --otel-file traces/eval.otlp.json
|
|
545
|
+
|
|
546
|
+
# Run a single assertion in isolation (no API keys needed)
|
|
547
|
+
agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
|
|
548
|
+
|
|
549
|
+
# Import agent transcripts for offline grading
|
|
550
|
+
agentv import claude --session-id <uuid>
|
|
551
|
+
|
|
552
|
+
# Re-run only execution errors from a previous run
|
|
553
|
+
agentv eval <file.yaml> --retry-errors .agentv/results/runs/<timestamp>/index.jsonl
|
|
554
|
+
|
|
555
|
+
# Validate eval file
|
|
556
|
+
agentv validate <file.yaml>
|
|
557
|
+
|
|
558
|
+
# Compare results — N-way matrix from a canonical run manifest
|
|
559
|
+
agentv compare .agentv/results/runs/<timestamp>/index.jsonl
|
|
560
|
+
agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> # CI regression gate
|
|
561
|
+
agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> --candidate <target> # pairwise
|
|
562
|
+
agentv compare .agentv/results/runs/<baseline-timestamp>/index.jsonl .agentv/results/runs/<candidate-timestamp>/index.jsonl
|
|
563
|
+
|
|
564
|
+
# Author assertions directly in the eval file
|
|
565
|
+
# Prefer simple assertions when they fit the criteria; use deterministic or LLM-based graders when needed
|
|
566
|
+
agentv validate <file.yaml>
|
|
567
|
+
```
|
|
568
|
+
|
|
569
|
+
## Code Judge SDK
|
|
570
|
+
|
|
571
|
+
Use `@agentv/eval` to build custom graders in TypeScript/JavaScript:
|
|
572
|
+
|
|
573
|
+
### defineAssertion (recommended for custom checks)
|
|
574
|
+
```typescript
|
|
575
|
+
#!/usr/bin/env bun
|
|
576
|
+
import { defineAssertion } from '@agentv/eval';
|
|
577
|
+
|
|
578
|
+
export default defineAssertion(({ answer, trace }) => ({
|
|
579
|
+
pass: answer.length > 0 && (trace?.eventCount ?? 0) <= 10,
|
|
580
|
+
reasoning: 'Checks content exists and is efficient',
|
|
581
|
+
}));
|
|
582
|
+
```
|
|
583
|
+
|
|
584
|
+
Assertions support both `pass: boolean` and `score: number` (0-1). If only `pass` is given, score is 1 (pass) or 0 (fail).
|
|
585
|
+
|
|
586
|
+
### defineCodeGrader (full control)
|
|
587
|
+
```typescript
|
|
588
|
+
#!/usr/bin/env bun
|
|
589
|
+
import { defineCodeGrader } from '@agentv/eval';
|
|
590
|
+
|
|
591
|
+
export default defineCodeGrader(({ trace, answer }) => ({
|
|
592
|
+
score: trace?.eventCount <= 5 ? 1.0 : 0.5,
|
|
593
|
+
assertions: [
|
|
594
|
+
{ text: 'Efficient tool usage', passed: (trace?.eventCount ?? 0) <= 5 },
|
|
595
|
+
],
|
|
596
|
+
}));
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
Both are used via `type: code-grader` in YAML with `command: [bun, run, grader.ts]`.
|
|
600
|
+
|
|
601
|
+
### Convention-Based Discovery
|
|
602
|
+
|
|
603
|
+
Place assertion files in `.agentv/assertions/` — they auto-register by filename:
|
|
604
|
+
|
|
605
|
+
```
|
|
606
|
+
.agentv/assertions/word-count.ts → type: word-count
|
|
607
|
+
.agentv/assertions/sentiment.ts → type: sentiment
|
|
608
|
+
```
|
|
609
|
+
|
|
610
|
+
No `command:` needed in YAML — just use `type: <filename>`.
|
|
611
|
+
|
|
612
|
+
## Programmatic API
|
|
613
|
+
|
|
614
|
+
Use `evaluate()` from `@agentv/core` to run evals as a library:
|
|
615
|
+
|
|
616
|
+
```typescript
|
|
617
|
+
import { evaluate } from '@agentv/core';
|
|
618
|
+
|
|
619
|
+
const { results, summary } = await evaluate({
|
|
620
|
+
tests: [
|
|
621
|
+
{
|
|
622
|
+
id: 'greeting',
|
|
623
|
+
input: 'Say hello',
|
|
624
|
+
assertions: [{ type: 'contains', value: 'hello' }],
|
|
625
|
+
},
|
|
626
|
+
],
|
|
627
|
+
target: { provider: 'mock_agent' },
|
|
628
|
+
});
|
|
629
|
+
console.log(`${summary.passed}/${summary.total} passed`);
|
|
630
|
+
```
|
|
631
|
+
|
|
632
|
+
Supports inline tests (no YAML) or file-based via `specFile`.
|
|
633
|
+
|
|
634
|
+
## defineConfig
|
|
635
|
+
|
|
636
|
+
Type-safe project configuration in `agentv.config.ts`:
|
|
637
|
+
|
|
638
|
+
```typescript
|
|
639
|
+
import { defineConfig } from '@agentv/core';
|
|
640
|
+
|
|
641
|
+
export default defineConfig({
|
|
642
|
+
execution: { workers: 5, maxRetries: 2 },
|
|
643
|
+
output: { format: 'jsonl', dir: './results' },
|
|
644
|
+
limits: { maxCostUsd: 10.0 },
|
|
645
|
+
});
|
|
646
|
+
```
|
|
647
|
+
|
|
648
|
+
Auto-discovered from project root. Validated with Zod.
|
|
649
|
+
|
|
650
|
+
## Scaffold Commands
|
|
651
|
+
|
|
652
|
+
```bash
|
|
653
|
+
agentv create assertion <name> # → .agentv/assertions/<name>.ts
|
|
654
|
+
agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl
|
|
655
|
+
```
|
|
656
|
+
|
|
657
|
+
## Skill Improvement Workflow
|
|
658
|
+
|
|
659
|
+
For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide.
|
|
660
|
+
## Human Review Checkpoint
|
|
661
|
+
|
|
662
|
+
After running evals, perform a human review before iterating. Create `feedback.json` in the results directory:
|
|
663
|
+
|
|
664
|
+
```json
|
|
665
|
+
{
|
|
666
|
+
"run_id": "2026-03-14T10-32-00_claude",
|
|
667
|
+
"reviewer": "engineer-name",
|
|
668
|
+
"timestamp": "2026-03-14T12:00:00Z",
|
|
669
|
+
"overall_notes": "Summary of observations",
|
|
670
|
+
"per_case": [
|
|
671
|
+
{
|
|
672
|
+
"test_id": "test-id",
|
|
673
|
+
"verdict": "acceptable | needs_improvement | incorrect | flaky",
|
|
674
|
+
"notes": "Why this verdict",
|
|
675
|
+
"evaluator_overrides": { "code-grader:name": "Override note" },
|
|
676
|
+
"workspace_notes": "Workspace state observations"
|
|
677
|
+
}
|
|
678
|
+
]
|
|
679
|
+
}
|
|
680
|
+
```
|
|
681
|
+
|
|
682
|
+
Use `evaluator_overrides` for workspace evaluations to annotate specific grader results (e.g., "code-grader was too strict"). Use `workspace_notes` for observations about workspace state.
|
|
683
|
+
|
|
684
|
+
Review workflow: run evals → inspect results (`agentv inspect show`) → write feedback → tune prompts/graders → re-run.
|
|
685
|
+
|
|
686
|
+
Full guide: https://agentv.dev/guides/human-review/
|
|
687
|
+
|
|
688
|
+
## Schemas
|
|
689
|
+
|
|
690
|
+
- Eval file: `references/eval-schema.json`
|
|
691
|
+
- Config: `references/config-schema.json`
|
|
692
|
+
|
|
693
|
+
## Accessing reference files
|
|
694
|
+
|
|
695
|
+
To load a specific reference without pulling the entire skill into context:
|
|
696
|
+
|
|
697
|
+
```bash
|
|
698
|
+
agentv skills get agentv-eval-writer --ref eval-schema.json
|
|
699
|
+
```
|
|
700
|
+
|
|
701
|
+
Or resolve the skill directory and read files directly:
|
|
702
|
+
|
|
703
|
+
```bash
|
|
704
|
+
cat $(agentv skills path agentv-eval-writer)/references/eval-schema.json
|
|
705
|
+
```
|
|
706
|
+
|
|
707
|
+
Use `--full` to retrieve every file in the skill at once.
|