agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
  2. package/dist/chunk-47JX7NNZ.js.map +1 -0
  3. package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
  4. package/dist/chunk-V3LWJB5X.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
@@ -0,0 +1,707 @@
1
+ ---
2
+ name: agentv-eval-writer
3
+ description: >-
4
+ Write, edit, review, and validate AgentV EVAL.yaml / .eval.yaml evaluation files.
5
+ Use when asked to create new eval files, update or fix existing ones, add or remove test cases,
6
+ configure graders (`llm-grader`, `code-grader`, `rubrics`), review whether an eval is correct or complete,
7
+ convert between EVAL.yaml and evals.json using `agentv convert`, or generate eval test cases
8
+ from chat transcripts (markdown conversation or JSON messages).
9
+ Do NOT use for creating SKILL.md files, writing skill definitions, or running evals —
10
+ running and benchmarking belongs to agentv-bench.
11
+ ---
12
+
13
+ # AgentV Eval Writer
14
+
15
+ Comprehensive docs: https://agentv.dev
16
+
17
+ ## Evaluation Types
18
+
19
+ AgentV evaluations measure **execution quality** — whether your agent or skill produces correct output when invoked.
20
+
21
+ For **trigger quality** (whether the right skill is triggered for the right prompts), see the [Evaluation Types guide](https://agentv.dev/guides/evaluation-types/). Do not use execution eval configs (`EVAL.yaml`, `evals.json`) for trigger evaluation — these are distinct concerns requiring different tooling and methodologies.
22
+
23
+ ## Starting from evals.json?
24
+
25
+ If the project already has an Agent Skills `evals.json` file, use it as a starting point instead of writing YAML from scratch:
26
+
27
+ ```bash
28
+ # Convert evals.json to AgentV EVAL YAML
29
+ agentv convert evals.json
30
+
31
+ # Run directly without converting (all commands accept evals.json)
32
+ agentv eval evals.json
33
+ ```
34
+
35
+ The converter maps `prompt` → `input`, `expected_output` → `expected_output`, `assertions` → `assertions` (`llm-grader`), and resolves `files[]` paths. The generated YAML includes TODO comments for AgentV features to add (workspace setup, code graders, rubrics, required gates).
36
+
37
+ After converting, enhance the YAML with AgentV-specific capabilities shown below.
38
+
39
+ ## From Chat Transcript
40
+
41
+ Convert a chat conversation into eval test cases without starting from scratch.
42
+
43
+ **Input formats:**
44
+
45
+ Markdown conversation:
46
+ ```
47
+ User: How do I reset my password?
48
+ Assistant: Go to Settings > Security > Reset Password...
49
+ ```
50
+
51
+ JSON messages:
52
+ ```json
53
+ [{"role": "user", "content": "How do I reset my password?"},
54
+ {"role": "assistant", "content": "Go to Settings > Security > Reset Password..."}]
55
+ ```
56
+
57
+ **Select exchanges that make good test cases:**
58
+ - Factual Q&A — verifiable answers
59
+ - Task completion — user requests an action, agent performs it
60
+ - Edge cases — unusual inputs, error handling, boundary conditions
61
+ - Multi-turn reasoning — exchanges where earlier context matters
62
+
63
+ **Skip:** greetings, one-word acknowledgments, repeated exchanges
64
+
65
+ **Multi-turn format** (when context from prior turns matters):
66
+ ```yaml
67
+ tests:
68
+ - id: multi-turn-context
69
+ criteria: "Agent remembers prior context"
70
+ input:
71
+ - role: user
72
+ content: "My name is Alice"
73
+ - role: assistant
74
+ content: "Nice to meet you, Alice!"
75
+ - role: user
76
+ content: "What's my name?"
77
+ expected_output: "Your name is Alice."
78
+ assertions:
79
+ - type: rubrics
80
+ criteria:
81
+ - Correctly recalls the user's name from earlier in the conversation
82
+ ```
83
+
84
+ **Guidelines:** preserve exact wording in `expected_output`; aim for 5–15 tests per transcript; pick exchanges that test different capabilities.
85
+
86
+ ## Quick Start
87
+
88
+ ```yaml
89
+ description: Example eval
90
+ execution:
91
+ target: default
92
+
93
+ tests:
94
+ - id: greeting
95
+ criteria: Friendly greeting
96
+ input: "Say hello"
97
+ expected_output: "Hello! How can I help you?"
98
+ assertions:
99
+ - type: rubrics
100
+ criteria:
101
+ - Greeting is friendly and warm
102
+ - Offers to help
103
+ ```
104
+
105
+ ## Eval File Structure
106
+
107
+ **Required:** `tests` (array or string path)
108
+ **Optional:** `name`, `description`, `version`, `author`, `tags`, `license`, `requires`, `execution`, `suite`, `workspace`, `assertions`, `input`
109
+
110
+ **Test fields:**
111
+
112
+ | Field | Required | Description |
113
+ |-------|----------|-------------|
114
+ | `id` | yes | Unique identifier |
115
+ | `criteria` | yes | What the response should accomplish |
116
+ | `input` / `input` | yes | Input to the agent |
117
+ | `expected_output` / `expected_output` | no | Gold-standard reference answer |
118
+ | `assertions` | no | Graders: deterministic checks, rubrics, and LLM/code graders |
119
+ | `rubrics` | no | **Deprecated** — use `assertions: [{type: rubrics, criteria: [...]}]` instead |
120
+ | `execution` | no | Per-case execution overrides |
121
+ | `workspace` | no | Per-case workspace config (overrides suite-level) |
122
+ | `metadata` | no | Arbitrary key-value pairs passed to setup/teardown scripts |
123
+ | `conversation_id` | no | Thread grouping |
124
+
125
+ **Shorthand aliases:**
126
+ - `input` (string) expands to `[{role: "user", content: "..."}]`
127
+ - `expected_output` (string/object) expands to `[{role: "assistant", content: ...}]`
128
+ - Canonical `input` / `expected_output` take precedence when both present
129
+
130
+ **Message format:** `{role, content}` where role is `system`, `user`, `assistant`, or `tool`
131
+ **Content types:** inline text, `{type: "file", value: "./path.md"}`
132
+ **File paths:** relative from eval file dir, or absolute with `/` prefix from repo root
133
+ **File handling by provider type:** LLM providers receive file content inlined in XML tags. Agent providers receive a preread block with `file://` URIs and must read files themselves. See [Coding Agents > Prompt format](https://agentv.dev/targets/coding-agents#prompt-format).
134
+
135
+ **JSONL format:** One test per line as JSON. Optional `.yaml` sidecar for shared defaults. See `examples/features/basic-jsonl/`.
136
+
137
+ **Environment variables:** All string fields support `${{ VAR }}` interpolation. Missing vars resolve to empty string. Works in eval files, external case files, and workspace configs. `.env` files are loaded automatically.
138
+
139
+ ## Metadata
140
+
141
+ When `name` is present, the suite is parsed as a metadata-bearing eval:
142
+
143
+ ```yaml
144
+ name: export-screening # required, lowercase/hyphens, max 64 chars
145
+ description: Evaluates export control screening accuracy
146
+ version: "1.0"
147
+ author: acme-compliance
148
+ tags: [compliance, agents]
149
+ license: Apache-2.0
150
+ requires:
151
+ agentv: ">=0.30.0"
152
+ ```
153
+
154
+ ## Suite-level Input
155
+
156
+ Prepend shared input messages to every test (like suite-level `assertions`). Avoids repeating the same prompt file in each test:
157
+
158
+ ```yaml
159
+ input:
160
+ - role: user
161
+ content:
162
+ - type: file
163
+ value: ./system-prompt.md
164
+
165
+ tests: ./cases.yaml
166
+
167
+ # cases.yaml — each test only needs its own query
168
+ # - id: test-1
169
+ # criteria: ...
170
+ # input: "User question here"
171
+ ```
172
+
173
+ Effective input: `[...suite input, ...test input]`. Skipped when `execution.skip_defaults: true`.
174
+ Accepts same formats as test `input` (string or message array).
175
+
176
+ ## Tests as String Path
177
+
178
+ Point `tests` to an external file instead of inlining:
179
+
180
+ ```yaml
181
+ name: my-eval
182
+ description: My evaluation suite
183
+ tests: ./cases.yaml # relative to eval file dir
184
+ ```
185
+
186
+ The external file can be YAML (array of test objects) or JSONL.
187
+
188
+ ## Assertions Field
189
+
190
+ `assertions` defines graders at the suite level or per-test level. It is the canonical field for all graders:
191
+
192
+ ```yaml
193
+ # Suite-level (appended to every test)
194
+ assertions:
195
+ - type: is-json
196
+ required: true
197
+ - type: contains
198
+ value: "status"
199
+
200
+ tests:
201
+ - id: test-1
202
+ criteria: Returns JSON
203
+ input: Get status
204
+ # Per-test assertions (runs before suite-level)
205
+ assertions:
206
+ - type: equals
207
+ value: '{"status": "ok"}'
208
+ ```
209
+
210
+ ## How `criteria` and `assertions` Interact
211
+
212
+ `criteria` is a **data field** — it describes what the response should accomplish. It is **not** a grader. How it gets evaluated depends on whether `assertions` is present:
213
+
214
+ | Scenario | What happens | Warning? |
215
+ |----------|-------------|----------|
216
+ | `criteria` + **no `assertions`** | Implicit `llm-grader` runs automatically against `criteria` | No |
217
+ | `criteria` + **`assertions` with only deterministic graders** (contains, regex, etc.) | Only declared graders run. `criteria` is **not evaluated**. | Yes — warns that no grader will consume criteria |
218
+ | `criteria` + **`assertions` with a grader** (`llm-grader`, `code-grader`, `rubrics`) | Declared graders run. Graders receive `criteria` as input. | No |
219
+
220
+ ### No assertions → implicit llm-grader
221
+
222
+ The simplest path. `criteria` is automatically evaluated by the default `llm-grader`:
223
+
224
+ ```yaml
225
+ tests:
226
+ - id: simple-eval
227
+ criteria: Assistant correctly explains the bug and proposes a fix
228
+ input: "Debug this function..."
229
+ # No assertions → default llm-grader evaluates against criteria
230
+ ```
231
+
232
+ ### assertions present → no implicit grader
233
+
234
+ When `assertions` is defined, **only the declared graders run**. If you want an LLM grader alongside deterministic checks, declare it explicitly:
235
+
236
+ ```yaml
237
+ tests:
238
+ - id: mixed-eval
239
+ criteria: Response is helpful and mentions the fix
240
+ input: "Debug this function..."
241
+ assertions:
242
+ - type: llm-grader # must be explicit when assertions is present
243
+ - type: contains
244
+ value: "fix"
245
+ ```
246
+
247
+ **Common mistake:** defining `criteria` with only deterministic graders. The criteria will be ignored and a warning is emitted:
248
+
249
+ ```yaml
250
+ tests:
251
+ - id: bad-example
252
+ criteria: Gives a thoughtful answer # ⚠ NOT evaluated — no grader in assertions
253
+ input: "What is 2+2?"
254
+ assertions:
255
+ - type: contains
256
+ value: "4"
257
+ # Warning: criteria is defined but no grader in assertions will evaluate it.
258
+ ```
259
+
260
+ ## Required Gates
261
+
262
+ Any grader can be marked `required` to enforce a minimum score:
263
+
264
+ ```yaml
265
+ assertions:
266
+ - type: contains
267
+ value: "DENIED"
268
+ required: true # must score >= 0.8 (default)
269
+ - type: rubrics
270
+ required: 0.6 # must score >= 0.6 (custom threshold)
271
+ criteria:
272
+ - id: accuracy
273
+ outcome: Identifies the denied party
274
+ weight: 5.0
275
+ ```
276
+
277
+ If a required grader scores below its threshold, the overall verdict is forced to `fail`.
278
+
279
+ ## Workspace Setup/Teardown
280
+
281
+ Run scripts before/after each test. Define at suite level or override per case:
282
+
283
+ ```yaml
284
+ workspace:
285
+ template: ./workspace-templates/my-project
286
+ setup:
287
+ command: ["bun", "run", "setup.ts"]
288
+ timeout_ms: 120000
289
+ teardown:
290
+ command: ["bun", "run", "teardown.ts"]
291
+
292
+ tests:
293
+ - id: case-1
294
+ input: Fix the bug
295
+ criteria: Bug is fixed
296
+ metadata:
297
+ repo: sympy/sympy
298
+ workspace:
299
+ repos:
300
+ - path: /testbed
301
+ source:
302
+ type: git
303
+ url: https://github.com/sympy/sympy.git
304
+ checkout:
305
+ base_commit: "abc123"
306
+ docker:
307
+ image: swebench/sweb.eval.django__django:latest
308
+ ```
309
+
310
+ **Lifecycle:** template copy → repo clone → setup → git baseline → agent → file changes → teardown → repo reset → cleanup
311
+ **Merge:** Case-level fields replace suite-level fields.
312
+ **Commands receive stdin JSON:** `{workspace_path, test_id, eval_run_id, case_input, case_metadata}`
313
+ **Setup failure:** aborts case. **Teardown failure:** non-fatal (warning).
314
+ For SWE-bench-style evals, keep operational checkout state under `workspace.repos[].checkout.base_commit`; treat `metadata.base_commit` as informational only.
315
+
316
+ ### Repository Lifecycle
317
+
318
+ Clone repos into workspace automatically. For shared repo workspaces, pooling is the default:
319
+
320
+ ```yaml
321
+ workspace:
322
+ repos:
323
+ - path: ./repo
324
+ source:
325
+ type: git
326
+ url: https://github.com/org/repo.git
327
+ checkout:
328
+ ref: main
329
+ ancestor: 1 # parent commit
330
+ clone:
331
+ depth: 10
332
+ hooks:
333
+ after_each:
334
+ reset: fast # none | fast | strict
335
+ isolation: shared # shared | per_test
336
+ mode: pooled # pooled | temp | static
337
+ hooks:
338
+ enabled: true # set false to skip all hooks
339
+ ```
340
+
341
+ - `source.type`: `git` (URL) or `local` (path)
342
+ - `checkout.resolve`: `remote` (ls-remote) or `local`
343
+ - `clone.depth`: shallow clone depth
344
+ - `clone.filter`: partial clone filter (e.g., `blob:none`)
345
+ - `clone.sparse`: sparse checkout paths array
346
+ - `mode`: `pooled` (default for shared repos), `temp`, or `static`
347
+ - `path`: workspace path used when `mode: static`; when empty/missing the workspace is auto-materialised (template copied + repos cloned); populated dirs are reused as-is
348
+ - `hooks.enabled`: boolean (default `true`); set `false` to skip all lifecycle hooks
349
+ - Pool reset defaults to `fast` (`git clean -fd`); use `--workspace-clean full` for strict reset (`git clean -fdx`)
350
+ - Pool entries are managed separately via `agentv workspace list` and `agentv workspace clean`
351
+ - `agentv workspace deps <eval-paths>` scans eval files and outputs a JSON manifest of required git repos (useful for CI pre-cloning)
352
+
353
+ See https://agentv.dev/targets/configuration/#repository-lifecycle
354
+
355
+ ## Grader Types
356
+
357
+ Configure via `assertions` array. Multiple graders produce a weighted average score.
358
+
359
+ ### code_grader
360
+ ```yaml
361
+ - name: format_check
362
+ type: code-grader
363
+ command: [uv, run, validate.py]
364
+ cwd: ./scripts # optional working directory
365
+ target: {} # optional: enable LLM target proxy (max_calls: 50)
366
+ ```
367
+ Contract: stdin JSON -> stdout JSON `{score, assertions: [{text, passed, evidence?}], reasoning}`
368
+ Input includes: `question`, `criteria`, `answer`, `reference_answer`, `output`, `trace`, `token_usage`, `cost_usd`, `duration_ms`, `start_time`, `end_time`, `file_changes`, `workspace_path`, `config`
369
+ When a workspace is configured, `workspace_path` is the absolute path to the workspace dir (also available as `AGENTV_WORKSPACE_PATH` env var). Use this for functional grading (e.g., running `npm test` in the workspace).
370
+ See docs at https://agentv.dev/graders/code-graders/
371
+
372
+ ### llm_grader
373
+ ```yaml
374
+ - name: quality
375
+ type: llm-grader
376
+ prompt: ./prompts/eval.md # markdown template or command config
377
+ target: grader_gpt_5_mini # optional: override the grader target for this grader
378
+ model: gpt-5-chat # optional model override
379
+ config: # passed to prompt templates as context.config
380
+ strictness: high
381
+ ```
382
+ Variables: `{{question}}`, `{{criteria}}`, `{{answer}}`, `{{reference_answer}}`, `{{input}}`, `{{expected_output}}`, `{{output}}`, `{{file_changes}}`
383
+ - Markdown templates: use `{{variable}}` syntax
384
+ - TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
385
+ - Use `target:` to run different `llm-grader` graders against different named LLM targets in the same eval (useful for grader panels / ensembles)
386
+
387
+ ### composite
388
+ ```yaml
389
+ - name: gate
390
+ type: composite
391
+ assertions:
392
+ - name: safety
393
+ type: llm-grader
394
+ prompt: ./safety.md
395
+ - name: quality
396
+ type: llm-grader
397
+ aggregator:
398
+ type: weighted_average
399
+ weights: { safety: 0.3, quality: 0.7 }
400
+ ```
401
+ Aggregator types: `weighted_average`, `all_or_nothing`, `minimum`, `maximum`, `safety_gate`
402
+ - `safety_gate`: fails immediately if the named gate grader scores below threshold (default 1.0)
403
+
404
+ ### tool_trajectory
405
+ ```yaml
406
+ - name: tool_check
407
+ type: tool-trajectory
408
+ mode: any_order # any_order | in_order | exact
409
+ minimums: # for any_order
410
+ knowledgeSearch: 2
411
+ expected: # for in_order/exact
412
+ - tool: knowledgeSearch
413
+ args: { query: "search term" } # partial deep equality match
414
+ - tool: documentRetrieve
415
+ args: any # any arguments accepted
416
+ max_duration_ms: 5000 # per-tool latency assertion
417
+ - tool: summarize # omit args to skip argument checking
418
+ ```
419
+
420
+ ### field_accuracy
421
+ ```yaml
422
+ - name: fields
423
+ type: field-accuracy
424
+ match_type: exact # exact | date | numeric_tolerance
425
+ numeric_tolerance: 0.01 # for numeric_tolerance match_type
426
+ aggregation: weighted_average # weighted_average | all_or_nothing
427
+ ```
428
+ Compares `output` fields against `expected_output` fields.
429
+
430
+ ### latency
431
+ ```yaml
432
+ - name: speed
433
+ type: latency
434
+ max_ms: 5000
435
+ ```
436
+
437
+ ### cost
438
+ ```yaml
439
+ - name: budget
440
+ type: cost
441
+ max_usd: 0.10
442
+ ```
443
+
444
+ ### token_usage
445
+ ```yaml
446
+ - name: tokens
447
+ type: token-usage
448
+ max_total_tokens: 4000
449
+ ```
450
+
451
+ ### execution_metrics
452
+ ```yaml
453
+ - name: efficiency
454
+ type: execution-metrics
455
+ max_tool_calls: 10 # Maximum tool invocations
456
+ max_llm_calls: 5 # Maximum LLM calls (assistant messages)
457
+ max_tokens: 5000 # Maximum total tokens (input + output)
458
+ max_cost_usd: 0.05 # Maximum cost in USD
459
+ max_duration_ms: 30000 # Maximum execution duration
460
+ target_exploration_ratio: 0.6 # Target ratio of read-only tool calls
461
+ exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)
462
+ ```
463
+ Declarative threshold-based checks on execution metrics. Only specified thresholds are checked.
464
+ Score is proportional: `passed / total` assertions. Missing data counts as a failed assertion.
465
+
466
+ ### contains
467
+ ```yaml
468
+ - type: contains
469
+ value: "DENIED"
470
+ required: true
471
+ ```
472
+ Binary check: does output contain the substring? Name auto-generated if omitted.
473
+
474
+ ### regex
475
+ ```yaml
476
+ - type: regex
477
+ value: "\\d{3}-\\d{2}-\\d{4}"
478
+ ```
479
+ Binary check: does output match the regex pattern?
480
+
481
+ ### equals
482
+ ```yaml
483
+ - type: equals
484
+ value: "42"
485
+ ```
486
+ Binary check: does output exactly equal the value (both trimmed)?
487
+
488
+ ### is_json
489
+ ```yaml
490
+ - type: is-json
491
+ required: true
492
+ ```
493
+ Binary check: is the output valid JSON?
494
+
495
+ ### rubrics
496
+ ```yaml
497
+ - type: rubrics
498
+ criteria:
499
+ - id: accuracy
500
+ outcome: Correctly identifies the denied party
501
+ weight: 5.0
502
+ - id: reasoning
503
+ outcome: Provides clear reasoning
504
+ weight: 3.0
505
+ ```
506
+ LLM-judged structured evaluation with weighted criteria. Criteria items support `id`, `outcome`, `weight`, and `required` fields.
507
+
508
+ ### rubrics (inline, deprecated)
509
+ Top-level `rubrics:` field is deprecated. Use `type: rubrics` under `assertions` instead.
510
+ See `references/rubric-grader.md` for score-range mode and scoring formula.
511
+
512
+ ## Execution Error Tolerance
513
+
514
+ Control how the runner handles execution errors (infrastructure failures, not quality failures):
515
+
516
+ ```yaml
517
+ execution:
518
+ fail_on_error: false # never halt (default)
519
+ # fail_on_error: true # halt on first execution error
520
+ ```
521
+
522
+ When halted, remaining tests get `executionStatus: 'execution_error'` with `failureReasonCode: 'error_threshold_exceeded'`.
523
+
524
+ ## Suite-Level Quality Threshold
525
+
526
+ Set a minimum mean score for the eval suite. If the mean quality score falls below the threshold, the CLI exits with code 1 — useful for CI/CD quality gates.
527
+
528
+ ```yaml
529
+ execution:
530
+ threshold: 0.8
531
+ ```
532
+
533
+ CLI flag `--threshold 0.8` overrides the YAML value. Must be a number between 0 and 1. Mean score is computed from quality results only (execution errors excluded).
534
+
535
+ The threshold also controls JUnit XML pass/fail: tests with scores below the threshold are marked as `<failure>`. When no threshold is set, JUnit defaults to 0.5.
536
+
537
+ ## CLI Commands
538
+
539
+ ```bash
540
+ # Run evaluation (requires API keys)
541
+ agentv eval <file.yaml> [--test-id <id>] [--target <name>] [--dry-run] [--threshold <0-1>]
542
+
543
+ # Run with OTLP JSON file (importable by OTel backends)
544
+ agentv eval <file.yaml> --otel-file traces/eval.otlp.json
545
+
546
+ # Run a single assertion in isolation (no API keys needed)
547
+ agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
548
+
549
+ # Import agent transcripts for offline grading
550
+ agentv import claude --session-id <uuid>
551
+
552
+ # Re-run only execution errors from a previous run
553
+ agentv eval <file.yaml> --retry-errors .agentv/results/runs/<timestamp>/index.jsonl
554
+
555
+ # Validate eval file
556
+ agentv validate <file.yaml>
557
+
558
+ # Compare results — N-way matrix from a canonical run manifest
559
+ agentv compare .agentv/results/runs/<timestamp>/index.jsonl
560
+ agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> # CI regression gate
561
+ agentv compare .agentv/results/runs/<timestamp>/index.jsonl --baseline <target> --candidate <target> # pairwise
562
+ agentv compare .agentv/results/runs/<baseline-timestamp>/index.jsonl .agentv/results/runs/<candidate-timestamp>/index.jsonl
563
+
564
+ # Author assertions directly in the eval file
565
+ # Prefer simple assertions when they fit the criteria; use deterministic or LLM-based graders when needed
566
+ agentv validate <file.yaml>
567
+ ```
568
+
569
+ ## Code Judge SDK
570
+
571
+ Use `@agentv/eval` to build custom graders in TypeScript/JavaScript:
572
+
573
+ ### defineAssertion (recommended for custom checks)
574
+ ```typescript
575
+ #!/usr/bin/env bun
576
+ import { defineAssertion } from '@agentv/eval';
577
+
578
+ export default defineAssertion(({ answer, trace }) => ({
579
+ pass: answer.length > 0 && (trace?.eventCount ?? 0) <= 10,
580
+ reasoning: 'Checks content exists and is efficient',
581
+ }));
582
+ ```
583
+
584
+ Assertions support both `pass: boolean` and `score: number` (0-1). If only `pass` is given, score is 1 (pass) or 0 (fail).
585
+
586
+ ### defineCodeGrader (full control)
587
+ ```typescript
588
+ #!/usr/bin/env bun
589
+ import { defineCodeGrader } from '@agentv/eval';
590
+
591
+ export default defineCodeGrader(({ trace, answer }) => ({
592
+ score: trace?.eventCount <= 5 ? 1.0 : 0.5,
593
+ assertions: [
594
+ { text: 'Efficient tool usage', passed: (trace?.eventCount ?? 0) <= 5 },
595
+ ],
596
+ }));
597
+ ```
598
+
599
+ Both are used via `type: code-grader` in YAML with `command: [bun, run, grader.ts]`.
600
+
601
+ ### Convention-Based Discovery
602
+
603
+ Place assertion files in `.agentv/assertions/` — they auto-register by filename:
604
+
605
+ ```
606
+ .agentv/assertions/word-count.ts → type: word-count
607
+ .agentv/assertions/sentiment.ts → type: sentiment
608
+ ```
609
+
610
+ No `command:` needed in YAML — just use `type: <filename>`.
611
+
612
+ ## Programmatic API
613
+
614
+ Use `evaluate()` from `@agentv/core` to run evals as a library:
615
+
616
+ ```typescript
617
+ import { evaluate } from '@agentv/core';
618
+
619
+ const { results, summary } = await evaluate({
620
+ tests: [
621
+ {
622
+ id: 'greeting',
623
+ input: 'Say hello',
624
+ assertions: [{ type: 'contains', value: 'hello' }],
625
+ },
626
+ ],
627
+ target: { provider: 'mock_agent' },
628
+ });
629
+ console.log(`${summary.passed}/${summary.total} passed`);
630
+ ```
631
+
632
+ Supports inline tests (no YAML) or file-based via `specFile`.
633
+
634
+ ## defineConfig
635
+
636
+ Type-safe project configuration in `agentv.config.ts`:
637
+
638
+ ```typescript
639
+ import { defineConfig } from '@agentv/core';
640
+
641
+ export default defineConfig({
642
+ execution: { workers: 5, maxRetries: 2 },
643
+ output: { format: 'jsonl', dir: './results' },
644
+ limits: { maxCostUsd: 10.0 },
645
+ });
646
+ ```
647
+
648
+ Auto-discovered from project root. Validated with Zod.
649
+
650
+ ## Scaffold Commands
651
+
652
+ ```bash
653
+ agentv create assertion <name> # → .agentv/assertions/<name>.ts
654
+ agentv create eval <name> # → evals/<name>.eval.yaml + .cases.jsonl
655
+ ```
656
+
657
+ ## Skill Improvement Workflow
658
+
659
+ For a complete guide to iterating on skills using evaluations — writing scenarios, running baselines, comparing results, and improving — see the [Skill Improvement Workflow](https://agentv.dev/guides/skill-improvement-workflow/) guide.
660
+ ## Human Review Checkpoint
661
+
662
+ After running evals, perform a human review before iterating. Create `feedback.json` in the results directory:
663
+
664
+ ```json
665
+ {
666
+ "run_id": "2026-03-14T10-32-00_claude",
667
+ "reviewer": "engineer-name",
668
+ "timestamp": "2026-03-14T12:00:00Z",
669
+ "overall_notes": "Summary of observations",
670
+ "per_case": [
671
+ {
672
+ "test_id": "test-id",
673
+ "verdict": "acceptable | needs_improvement | incorrect | flaky",
674
+ "notes": "Why this verdict",
675
+ "evaluator_overrides": { "code-grader:name": "Override note" },
676
+ "workspace_notes": "Workspace state observations"
677
+ }
678
+ ]
679
+ }
680
+ ```
681
+
682
+ Use `evaluator_overrides` for workspace evaluations to annotate specific grader results (e.g., "code-grader was too strict"). Use `workspace_notes` for observations about workspace state.
683
+
684
+ Review workflow: run evals → inspect results (`agentv inspect show`) → write feedback → tune prompts/graders → re-run.
685
+
686
+ Full guide: https://agentv.dev/guides/human-review/
687
+
688
+ ## Schemas
689
+
690
+ - Eval file: `references/eval-schema.json`
691
+ - Config: `references/config-schema.json`
692
+
693
+ ## Accessing reference files
694
+
695
+ To load a specific reference without pulling the entire skill into context:
696
+
697
+ ```bash
698
+ agentv skills get agentv-eval-writer --ref eval-schema.json
699
+ ```
700
+
701
+ Or resolve the skill directory and read files directly:
702
+
703
+ ```bash
704
+ cat $(agentv skills path agentv-eval-writer)/references/eval-schema.json
705
+ ```
706
+
707
+ Use `--full` to retrieve every file in the skill at once.