agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
  2. package/dist/chunk-47JX7NNZ.js.map +1 -0
  3. package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
  4. package/dist/chunk-V3LWJB5X.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
@@ -0,0 +1,432 @@
1
+ # JSON Schemas
2
+
3
+ This document defines the JSON schemas used by skill-creator.
4
+
5
+ ---
6
+
7
+ ## evals.json
8
+
9
+ Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
10
+
11
+ ```json
12
+ {
13
+ "skill_name": "example-skill",
14
+ "evals": [
15
+ {
16
+ "id": 1,
17
+ "prompt": "User's example prompt",
18
+ "expected_output": "Description of expected result",
19
+ "files": ["evals/files/sample1.pdf"],
20
+ "assertions": [
21
+ "The output includes X",
22
+ "The skill used script Y"
23
+ ]
24
+ }
25
+ ]
26
+ }
27
+ ```
28
+
29
+ **Fields:**
30
+ - `skill_name`: Name matching the skill's frontmatter
31
+ - `evals[].id`: Unique integer identifier
32
+ - `evals[].prompt`: The task to execute
33
+ - `evals[].expected_output`: Human-readable description of success
34
+ - `evals[].files`: Optional list of input file paths (relative to skill root)
35
+ - `evals[].assertions`: List of verifiable statements
36
+
37
+ ---
38
+
39
+ ## history.json
40
+
41
+ Tracks version progression in Improve mode. Located at workspace root.
42
+
43
+ ```json
44
+ {
45
+ "started_at": "2026-01-15T10:30:00Z",
46
+ "skill_name": "pdf",
47
+ "current_best": "v2",
48
+ "iterations": [
49
+ {
50
+ "version": "v0",
51
+ "parent": null,
52
+ "assertion_pass_rate": 0.65,
53
+ "grading_result": "baseline",
54
+ "is_current_best": false
55
+ },
56
+ {
57
+ "version": "v1",
58
+ "parent": "v0",
59
+ "assertion_pass_rate": 0.75,
60
+ "grading_result": "won",
61
+ "is_current_best": false
62
+ },
63
+ {
64
+ "version": "v2",
65
+ "parent": "v1",
66
+ "assertion_pass_rate": 0.85,
67
+ "grading_result": "won",
68
+ "is_current_best": true
69
+ }
70
+ ]
71
+ }
72
+ ```
73
+
74
+ **Fields:**
75
+ - `started_at`: ISO timestamp of when improvement started
76
+ - `skill_name`: Name of the skill being improved
77
+ - `current_best`: Version identifier of the best performer
78
+ - `iterations[].version`: Version identifier (v0, v1, ...)
79
+ - `iterations[].parent`: Parent version this was derived from
80
+ - `iterations[].assertion_pass_rate`: Pass rate from grading
81
+ - `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
82
+ - `iterations[].is_current_best`: Whether this is the current best version
83
+
84
+ ---
85
+
86
+ ## grading.json
87
+
88
+ Output from the grader agent. Located at `<run-dir>/grading.json`.
89
+
90
+ **Important:** The `assertions` array must use the fields `text`, `passed`, and `evidence` — downstream tooling depends on these exact field names.
91
+
92
+ ```json
93
+ {
94
+ "assertions": [
95
+ {
96
+ "text": "The output includes the name 'John Smith'",
97
+ "passed": true,
98
+ "evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
99
+ },
100
+ {
101
+ "text": "The spreadsheet has a SUM formula in cell B10",
102
+ "passed": false,
103
+ "evidence": "No spreadsheet was created. The output was a text file."
104
+ }
105
+ ],
106
+ "summary": {
107
+ "passed": 2,
108
+ "failed": 1,
109
+ "total": 3,
110
+ "pass_rate": 0.67
111
+ },
112
+ "execution_metrics": {
113
+ "tool_calls": {
114
+ "Read": 5,
115
+ "Write": 2,
116
+ "Bash": 8
117
+ },
118
+ "total_tool_calls": 15,
119
+ "total_steps": 6,
120
+ "errors_encountered": 0,
121
+ "output_chars": 12450,
122
+ "transcript_chars": 3200
123
+ },
124
+ "timing": {
125
+ "executor_duration_seconds": 165.0,
126
+ "grader_duration_seconds": 26.0,
127
+ "total_duration_seconds": 191.0
128
+ },
129
+ "claims": [
130
+ {
131
+ "claim": "The form has 12 fillable fields",
132
+ "type": "factual",
133
+ "verified": true,
134
+ "evidence": "Counted 12 fields in field_info.json"
135
+ }
136
+ ],
137
+ "user_notes_summary": {
138
+ "uncertainties": ["Used 2023 data, may be stale"],
139
+ "needs_review": [],
140
+ "workarounds": ["Fell back to text overlay for non-fillable fields"]
141
+ },
142
+ "eval_feedback": {
143
+ "suggestions": [
144
+ {
145
+ "assertion": "The output includes the name 'John Smith'",
146
+ "reason": "A hallucinated document that mentions the name would also pass"
147
+ }
148
+ ],
149
+ "overall": "Assertions check presence but not correctness."
150
+ }
151
+ }
152
+ ```
153
+
154
+ **Fields:**
155
+ - `assertions[]`: Graded assertion results with evidence
156
+ - `summary`: Aggregate pass/fail counts
157
+ - `execution_metrics`: Tool usage and output size (from executor's metrics.json)
158
+ - `timing`: Wall clock timing (from timing.json)
159
+ - `claims`: Extracted and verified claims from the output
160
+ - `user_notes_summary`: Issues flagged by the executor
161
+ - `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
162
+
163
+ ---
164
+
165
+ ## metrics.json
166
+
167
+ Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
168
+
169
+ ```json
170
+ {
171
+ "tool_calls": {
172
+ "Read": 5,
173
+ "Write": 2,
174
+ "Bash": 8,
175
+ "Edit": 1,
176
+ "Glob": 2,
177
+ "Grep": 0
178
+ },
179
+ "total_tool_calls": 18,
180
+ "total_steps": 6,
181
+ "files_created": ["filled_form.pdf", "field_values.json"],
182
+ "errors_encountered": 0,
183
+ "output_chars": 12450,
184
+ "transcript_chars": 3200
185
+ }
186
+ ```
187
+
188
+ **Fields:**
189
+ - `tool_calls`: Count per tool type
190
+ - `total_tool_calls`: Sum of all tool calls
191
+ - `total_steps`: Number of major execution steps
192
+ - `files_created`: List of output files created
193
+ - `errors_encountered`: Number of errors during execution
194
+ - `output_chars`: Total character count of output files
195
+ - `transcript_chars`: Character count of transcript
196
+
197
+ ---
198
+
199
+ ## timing.json
200
+
201
+ Wall clock timing for a run. Located at `<run-dir>/timing.json`.
202
+
203
+ **How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
204
+
205
+ ```json
206
+ {
207
+ "total_tokens": 84852,
208
+ "duration_ms": 23332,
209
+ "total_duration_seconds": 23.3,
210
+ "executor_start": "2026-01-15T10:30:00Z",
211
+ "executor_end": "2026-01-15T10:32:45Z",
212
+ "executor_duration_seconds": 165.0,
213
+ "grader_start": "2026-01-15T10:32:46Z",
214
+ "grader_end": "2026-01-15T10:33:12Z",
215
+ "grader_duration_seconds": 26.0
216
+ }
217
+ ```
218
+
219
+ ---
220
+
221
+ ## benchmark.json
222
+
223
+ Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
224
+
225
+ ```json
226
+ {
227
+ "metadata": {
228
+ "skill_name": "pdf",
229
+ "skill_path": "/path/to/pdf",
230
+ "executor_model": "claude-sonnet-4-20250514",
231
+ "analyzer_model": "most-capable-model",
232
+ "timestamp": "2026-01-15T10:30:00Z",
233
+ "evals_run": [1, 2, 3],
234
+ "runs_per_configuration": 3
235
+ },
236
+
237
+ "runs": [
238
+ {
239
+ "eval_id": 1,
240
+ "eval_name": "Ocean",
241
+ "configuration": "with_skill",
242
+ "run_number": 1,
243
+ "result": {
244
+ "pass_rate": 0.85,
245
+ "passed": 6,
246
+ "failed": 1,
247
+ "total": 7,
248
+ "time_seconds": 42.5,
249
+ "tokens": 3800,
250
+ "tool_calls": 18,
251
+ "errors": 0
252
+ },
253
+ "assertions": [
254
+ {"text": "...", "passed": true, "evidence": "..."}
255
+ ],
256
+ "notes": [
257
+ "Used 2023 data, may be stale",
258
+ "Fell back to text overlay for non-fillable fields"
259
+ ]
260
+ }
261
+ ],
262
+
263
+ "run_summary": {
264
+ "with_skill": {
265
+ "pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
266
+ "time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
267
+ "tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
268
+ },
269
+ "without_skill": {
270
+ "pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
271
+ "time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
272
+ "tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
273
+ },
274
+ "delta": {
275
+ "pass_rate": "+0.50",
276
+ "time_seconds": "+13.0",
277
+ "tokens": "+1700"
278
+ }
279
+ },
280
+
281
+ "notes": [
282
+ "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
283
+ "Eval 3 shows high variance (50% ± 40%) - may be flaky or model-dependent",
284
+ "Without-skill runs consistently fail on table extraction assertions",
285
+ "Skill adds 13s average execution time but improves pass rate by 50%"
286
+ ]
287
+ }
288
+ ```
289
+
290
+ **Fields:**
291
+ - `metadata`: Information about the benchmark run
292
+ - `skill_name`: Name of the skill
293
+ - `timestamp`: When the benchmark was run
294
+ - `evals_run`: List of eval names or IDs
295
+ - `runs_per_configuration`: Number of runs per config (e.g. 3)
296
+ - `runs[]`: Individual run results
297
+ - `eval_id`: Numeric eval identifier
298
+ - `eval_name`: Human-readable eval name (used as section header in the viewer)
299
+ - `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
300
+ - `run_number`: Integer run number (1, 2, 3...)
301
+ - `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
302
+ - `run_summary`: Statistical aggregates per configuration
303
+ - `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
304
+ - `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
305
+ - `notes`: Freeform observations from the analyzer
306
+
307
+ **Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
308
+
309
+ ---
310
+
311
+ ## comparison.json
312
+
313
+ Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
314
+
315
+ ```json
316
+ {
317
+ "winner": "A",
318
+ "reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
319
+ "rubric": {
320
+ "A": {
321
+ "content": {
322
+ "correctness": 5,
323
+ "completeness": 5,
324
+ "accuracy": 4
325
+ },
326
+ "structure": {
327
+ "organization": 4,
328
+ "formatting": 5,
329
+ "usability": 4
330
+ },
331
+ "content_score": 4.7,
332
+ "structure_score": 4.3,
333
+ "overall_score": 9.0
334
+ },
335
+ "B": {
336
+ "content": {
337
+ "correctness": 3,
338
+ "completeness": 2,
339
+ "accuracy": 3
340
+ },
341
+ "structure": {
342
+ "organization": 3,
343
+ "formatting": 2,
344
+ "usability": 3
345
+ },
346
+ "content_score": 2.7,
347
+ "structure_score": 2.7,
348
+ "overall_score": 5.4
349
+ }
350
+ },
351
+ "output_quality": {
352
+ "A": {
353
+ "score": 9,
354
+ "strengths": ["Complete solution", "Well-formatted", "All fields present"],
355
+ "weaknesses": ["Minor style inconsistency in header"]
356
+ },
357
+ "B": {
358
+ "score": 5,
359
+ "strengths": ["Readable output", "Correct basic structure"],
360
+ "weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
361
+ }
362
+ },
363
+ "assertions": {
364
+ "A": {
365
+ "passed": 4,
366
+ "total": 5,
367
+ "pass_rate": 0.80,
368
+ "details": [
369
+ {"text": "Output includes name", "passed": true}
370
+ ]
371
+ },
372
+ "B": {
373
+ "passed": 3,
374
+ "total": 5,
375
+ "pass_rate": 0.60,
376
+ "details": [
377
+ {"text": "Output includes name", "passed": true}
378
+ ]
379
+ }
380
+ }
381
+ }
382
+ ```
383
+
384
+ ---
385
+
386
+ ## analysis.json
387
+
388
+ Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
389
+
390
+ ```json
391
+ {
392
+ "comparison_summary": {
393
+ "winner": "A",
394
+ "winner_skill": "path/to/winner/skill",
395
+ "loser_skill": "path/to/loser/skill",
396
+ "comparator_reasoning": "Brief summary of why comparator chose winner"
397
+ },
398
+ "winner_strengths": [
399
+ "Clear step-by-step instructions for handling multi-page documents",
400
+ "Included validation script that caught formatting errors"
401
+ ],
402
+ "loser_weaknesses": [
403
+ "Vague instruction 'process the document appropriately' led to inconsistent behavior",
404
+ "No script for validation, agent had to improvise"
405
+ ],
406
+ "instruction_following": {
407
+ "winner": {
408
+ "score": 9,
409
+ "issues": ["Minor: skipped optional logging step"]
410
+ },
411
+ "loser": {
412
+ "score": 6,
413
+ "issues": [
414
+ "Did not use the skill's formatting template",
415
+ "Invented own approach instead of following step 3"
416
+ ]
417
+ }
418
+ },
419
+ "improvement_suggestions": [
420
+ {
421
+ "priority": "high",
422
+ "category": "instructions",
423
+ "suggestion": "Replace 'process the document appropriately' with explicit steps",
424
+ "expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
425
+ }
426
+ ],
427
+ "transcript_insights": {
428
+ "winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
429
+ "loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
430
+ }
431
+ }
432
+ ```
@@ -0,0 +1,181 @@
1
+ # Subagent Pipeline — Running eval.yaml without CLI
2
+
3
+ This reference documents the detailed procedure for running evaluations in subagent mode
4
+ (`AGENT_EVAL_MODE=subagent`, the default). The orchestrating skill dispatches `executor`
5
+ subagents to perform test cases and `grader` subagents to evaluate outputs.
6
+
7
+ Read this reference when executing Step 3 (Run and Grade) in subagent mode.
8
+
9
+ ## Prerequisites
10
+
11
+ - The eval.yaml file exists and contains valid test definitions
12
+ - `agentv` CLI is installed (or run from source via `AGENTV_CLI=bun /path/to/cli.ts` in `.env`)
13
+ - Read `references/eval-yaml-spec.md` for the full schema
14
+
15
+ ## Workspace Context
16
+
17
+ Some evals pass prompt files directly and don't require a specific workspace — those run fine
18
+ from anywhere. But evals that test agent behavior in a workspace (accessing skills, modifying
19
+ repos, using tools across multiple repos) require the user to be in the **target workspace**
20
+ (e.g., a multi-repo workspace set up by allagents). If the eval references workspace files or
21
+ expects the agent to use skills, check that the current directory is the target workspace, not
22
+ just the eval repo — and warn the user if it's wrong.
23
+
24
+ ## Executor Subagent Eligibility
25
+
26
+ All providers except `cli` are eligible for executor subagents by default. To opt out a
27
+ specific target, set `subagent_mode_allowed: false` in `.agentv/targets.yaml`:
28
+
29
+ ```yaml
30
+ # .agentv/targets.yaml
31
+ targets:
32
+ - name: my-target
33
+ provider: openai
34
+ model: ${{ OPENAI_MODEL }}
35
+ api_key: ${{ OPENAI_API_KEY }}
36
+ subagent_mode_allowed: false # forces CLI invocation instead of executor subagent
37
+ ```
38
+
39
+ When `subagent_mode_allowed: false`, the target falls back to CLI invocation via `agentv eval`
40
+ even in subagent mode.
41
+
42
+ ## CLI Targets: Single Command
43
+
44
+ For evals with CLI targets, `pipeline run` handles input extraction, target invocation, and
45
+ code grading in one step. When `--out` is omitted, the output directory defaults to
46
+ `.agentv/results/runs/<timestamp>` (same convention as `agentv eval`):
47
+
48
+ ```bash
49
+ # Extract inputs and invoke all CLI targets in parallel:
50
+ agentv pipeline run evals/repro.eval.yaml
51
+
52
+ # Also run code graders inline (instead of using pipeline grade separately):
53
+ agentv pipeline run evals/repro.eval.yaml --grader-type code
54
+ ```
55
+
56
+ By default, `pipeline run` extracts inputs and invokes targets only. Pass `--grader-type code`
57
+ to also run code-graders inline, or use `agentv pipeline grade <run-dir>` as a separate step.
58
+
59
+ The run directory is printed to stdout. Then continue to the grading and merge phases
60
+ described in SKILL.md Step 3.
61
+
62
+ ## Non-CLI Targets: Executor Subagents
63
+
64
+ When the target provider is not `cli`, check `manifest.json` → `target.subagent_mode_allowed`.
65
+ If `true` (default for all non-CLI providers), the subagent IS the target. If `false` (user
66
+ opted out via `subagent_mode_allowed: false` in `.agentv/targets.yaml`), fall back to
67
+ `agentv eval` CLI mode instead.
68
+
69
+ ### Step 1: Extract inputs
70
+
71
+ ```bash
72
+ # Defaults to .agentv/results/runs/<timestamp>
73
+ agentv pipeline input evals/repro.eval.yaml
74
+ ```
75
+
76
+ This creates a run directory with per-test `input.json`, `invoke.json`,
77
+ `criteria.md`, and grader configs.
78
+
79
+ ### Step 2: Dispatch executor subagents
80
+
81
+ Read `agents/executor.md`. Launch one `executor` subagent **per test case**, all in parallel.
82
+ Each subagent receives the test directory path, reads `input.json`, performs the task using
83
+ its own tools, and writes `response.md`.
84
+
85
+ Example: 5 tests = 5 executor subagents launched simultaneously.
86
+
87
+ ```
88
+ # Per executor subagent:
89
+ # - Reads <run-dir>/<test-id>/input.json
90
+ # - Performs the task
91
+ # - Writes <run-dir>/<test-id>/response.md
92
+ ```
93
+
94
+ ### After executors complete: read results from disk
95
+
96
+ When all executor subagents have finished, **read `response.md` directly from disk** — do
97
+ NOT use `read_agent` to fetch results. The executors wrote their outputs to the run directory.
98
+
99
+ ```bash
100
+ # Verify all responses exist:
101
+ for d in <run-dir>/<evalset>/*/; do
102
+ echo "$(basename $d): $(ls "$d"/response.md 2>/dev/null && echo OK || echo MISSING)"
103
+ done
104
+ ```
105
+
106
+ If any `response.md` is MISSING, re-run that specific executor subagent. Do not proceed to
107
+ grading until all responses are present.
108
+
109
+ ### Step 3 onward: Grade and merge
110
+
111
+ See SKILL.md Step 3 "Grading" section for the three-phase grading process (code graders →
112
+ LLM grading → merge and validate).
113
+
114
+ ## Step-by-Step Fine-Grained Control (CLI targets)
115
+
116
+ Use individual commands when you need control over each step with CLI targets:
117
+
118
+ ```bash
119
+ # Step 1: Extract inputs (defaults to .agentv/results/runs/<timestamp>)
120
+ agentv pipeline input evals/repro.eval.yaml
121
+
122
+ # Step 2: run_tests.py invokes CLI targets (or use pipeline run instead)
123
+
124
+ # Step 3: Run code graders
125
+ agentv pipeline grade <run-dir>
126
+
127
+ # Step 4: Subagent does LLM grading, writes results to llm_grader_results/<name>.json per test
128
+
129
+ # Step 5: Merge scores (writes index.jsonl with full scores[] for dashboard)
130
+ agentv pipeline bench <run-dir>
131
+
132
+ # Step 6: Validate
133
+ agentv results validate <run-dir>
134
+ ```
135
+
136
+ ## LLM Grading JSON Format
137
+
138
+ The agent reads `llm_graders/<name>.json` for each test, grades the response using the prompt
139
+ content, and produces a scores JSON:
140
+
141
+ ```json
142
+ {
143
+ "test-01": {
144
+ "relevance": {
145
+ "score": 0.85,
146
+ "assertions": [{"text": "Response is relevant", "passed": true, "evidence": "..."}]
147
+ }
148
+ }
149
+ }
150
+ ```
151
+
152
+ ## Pipeline Bench and Dashboard
153
+
154
+ `pipeline bench` merges LLM scores into `index.jsonl` with a full `scores[]` array per entry,
155
+ matching the CLI-mode schema. The web dashboard (`agentv results serve`) reads this format
156
+ directly — no separate conversion script is needed. Run `agentv results validate <run-dir>`
157
+ to verify compatibility.
158
+
159
+ ## Output Structure
160
+
161
+ The path hierarchy mirrors the CLI mode: `<evalset-name>` comes from the `name` field in
162
+ the eval.yaml. The target is recorded in `manifest.json` — one run = one target.
163
+
164
+ ```
165
+ .agentv/results/runs/<experiment>/<timestamp>/
166
+ ├── manifest.json ← eval metadata, target, test_ids
167
+ ├── index.jsonl ← per-test scores
168
+ ├── benchmark.json ← aggregate statistics
169
+ └── <evalset-name>/ ← eval.yaml "name" field, or eval file basename if absent (same as CLI mode)
170
+ └── <test-id>/ ← test case id
171
+ ├── input.json ← test input text + messages
172
+ ├── invoke.json ← target command or agent instructions
173
+ ├── criteria.md ← grading criteria
174
+ ├── response.md ← target/agent output
175
+ ├── timing.json ← execution timing
176
+ ├── code_graders/<name>.json ← grader configs written by `pipeline input`: code-grader scripts AND built-in types (contains, regex, equals, etc.)
177
+ ├── llm_graders/<name>.json ← LLM grader configs
178
+ ├── code_grader_results/<name>.json ← code grader results
179
+ ├── llm_grader_results/<name>.json ← LLM grader results (written by grader subagents; one file per grader)
180
+ └── grading.json ← merged grading (written by `pipeline bench` — do NOT write here directly)
181
+ ```