agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
  2. package/dist/chunk-47JX7NNZ.js.map +1 -0
  3. package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
  4. package/dist/chunk-V3LWJB5X.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
@@ -0,0 +1,338 @@
1
+ # Eval YAML Spec — Schema and Assertion Grading Recipes
2
+
3
+ This reference documents the eval.yaml schema and grading recipes for every assertion type.
4
+ The grader agent uses this to evaluate assertions without the CLI.
5
+
6
+ ## 1. Eval YAML Structure
7
+
8
+ ### Top-level fields
9
+
10
+ - `name` (string, optional) — eval name
11
+ - `description` (string, optional) — description
12
+ - `execution` (object, optional) — `target`, `model`, etc.
13
+ - `workspace` (object, optional) — workspace config (template, hooks)
14
+ - `tests` (array, required) — test cases
15
+
16
+ ### Per-test fields
17
+
18
+ - `id` (string, required) — unique test identifier
19
+ - `input` (string | Message[], required) — task input. String shorthand expands to `[{role: user, content: "..."}]`
20
+ - `expected_output` (string | Message[], optional) — reference answer. String shorthand expands to `[{role: assistant, content: "..."}]`
21
+ - `criteria` (string, optional) — human-readable success criteria
22
+ - `assertions` (array, optional) — grader assertions
23
+ - `conversation_id` (string, optional) — groups related tests
24
+ - `execution` (object, optional) — per-test execution override
25
+
26
+ ## 2. Assertion Types and Grading Recipes
27
+
28
+ For each assertion type: YAML config fields, grading recipe (exact pseudocode for deterministic types), and PASS/FAIL conditions.
29
+
30
+ ### Deterministic assertions (zero-cost, instant)
31
+
32
+ #### `contains`
33
+
34
+ - **Fields:** `value` (string, required)
35
+ - **Recipe:**
36
+ ```
37
+ response.toLowerCase().includes(value.toLowerCase())
38
+ ```
39
+ Note: case-insensitive by default in AgentV. If `case_sensitive: true`, use exact match.
40
+ - **PASS:** substring found. **FAIL:** substring not found.
41
+
42
+ #### `contains-any`
43
+
44
+ - **Fields:** `value` (string[], required)
45
+ - **Recipe:**
46
+ ```
47
+ value.some(v => response.toLowerCase().includes(v.toLowerCase()))
48
+ ```
49
+ - **PASS:** at least one substring found.
50
+
51
+ #### `contains-all`
52
+
53
+ - **Fields:** `value` (string[], required)
54
+ - **Recipe:**
55
+ ```
56
+ value.every(v => response.toLowerCase().includes(v.toLowerCase()))
57
+ ```
58
+ - **PASS:** all substrings found.
59
+
60
+ #### `icontains` / `icontains-any` / `icontains-all`
61
+
62
+ Same as contains variants but explicitly case-insensitive.
63
+
64
+ #### `equals`
65
+
66
+ - **Fields:** `value` (string, required)
67
+ - **Recipe:**
68
+ ```
69
+ response.trim() === value.trim()
70
+ ```
71
+ - **PASS:** exact match after trimming.
72
+
73
+ #### `regex`
74
+
75
+ - **Fields:** `value` (string, required — a regex pattern)
76
+ - **Recipe:**
77
+ ```
78
+ new RegExp(value).test(response)
79
+ ```
80
+ - **PASS:** pattern matches.
81
+
82
+ #### `starts-with`
83
+
84
+ - **Fields:** `value` (string, required)
85
+ - **Recipe:**
86
+ ```
87
+ response.startsWith(value)
88
+ ```
89
+ (or case-insensitive variant)
90
+ - **PASS:** response starts with value.
91
+
92
+ #### `ends-with`
93
+
94
+ - **Fields:** `value` (string, required)
95
+ - **Recipe:**
96
+ ```
97
+ response.endsWith(value)
98
+ ```
99
+ (or case-insensitive variant)
100
+ - **PASS:** response ends with value.
101
+
102
+ #### `is-json`
103
+
104
+ - **Fields:** none required
105
+ - **Recipe:**
106
+ ```
107
+ try { JSON.parse(response); return true } catch { return false }
108
+ ```
109
+ - **PASS:** response is valid JSON. **FAIL:** parse error.
110
+
111
+ #### `field-accuracy`
112
+
113
+ - **Fields:** `expected` (object, required — JSON object with field paths and expected values)
114
+ - **Recipe:** Parse response as JSON. For each field path in `expected`, check if the value matches.
115
+ - **PASS:** all fields match. Partial score = `matched_fields / total_fields`.
116
+
117
+ ### Metric assertions (require timing.json)
118
+
119
+ #### `latency`
120
+
121
+ - **Fields:** `threshold` (number, required — max duration in ms)
122
+ - **Recipe:** Read `timing.json`. Compare `duration_ms` against threshold.
123
+ - **PASS:** `duration_ms <= threshold`.
124
+
125
+ #### `cost`
126
+
127
+ - **Fields:** `threshold` (number, required — max cost in USD)
128
+ - **Recipe:** Read timing/token data. Compare cost against threshold.
129
+ - **PASS:** `cost <= threshold`.
130
+
131
+ #### `token-usage`
132
+
133
+ - **Fields:** `threshold` (number, required — max tokens)
134
+ - **Recipe:** Read `timing.json`. Compare `total_tokens` against threshold.
135
+ - **PASS:** `total_tokens <= threshold`.
136
+
137
+ #### `execution-metrics`
138
+
139
+ - **Fields:** Various threshold fields for tool calls, output chars, etc.
140
+ - **Recipe:** Read timing.json, compare each metric against its threshold.
141
+
142
+ ### Tool inspection assertions
143
+
144
+ #### `tool-trajectory`
145
+
146
+ - **Fields:** `expected` (array of expected tool calls), `mode` (string: `exact` | `contains` | `order`)
147
+ - **Recipe:** Inspect transcript for tool call sequence. Match against expected based on mode.
148
+ - **PASS:** tool calls match expected pattern per mode.
149
+
150
+ #### `skill-trigger`
151
+
152
+ - **Fields:** `skill_name` (string, required)
153
+ - **Recipe:** Check if the agent invoked the named skill in its tool calls.
154
+ - **PASS:** skill was triggered.
155
+
156
+ ### LLM-judged assertions (require Claude reasoning)
157
+
158
+ #### `llm-grader`
159
+
160
+ - **Fields:** `prompt` (string, required — either inline text or path to .md file)
161
+ - **Recipe:** Read the prompt. Evaluate the response against the criteria using your own reasoning. Produce score (0.0-1.0) with evidence.
162
+ - **PASS:** score >= 0.5 (configurable via `threshold`).
163
+
164
+ #### `rubric` / `rubrics`
165
+
166
+ - **Fields:** `rubric_items` or `criteria` (array of rubric items with descriptions and weights)
167
+ - **Recipe:** For each rubric item, evaluate the response. Score each item 0.0-1.0. Aggregate as weighted average.
168
+ - **PASS:** aggregate score >= threshold.
169
+
170
+ ### Script-based assertions
171
+
172
+ #### `code-grader`
173
+
174
+ - **Fields:** `path` (string, required — path to script), `command` (string[], optional — custom command)
175
+ - **Script SDK:** Use `defineCodeGrader` from `@agentv/eval`:
176
+ ```typescript
177
+ import { defineCodeGrader } from '@agentv/eval';
178
+ export default defineCodeGrader(({ outputText, trace }) => ({
179
+ score: outputText.includes('expected') ? 1 : 0,
180
+ assertions: [{ text: 'Contains expected', passed: outputText.includes('expected') }],
181
+ }));
182
+ ```
183
+ - **Recipe:** The CLI runs the script, passing context as JSON on stdin (`{output, outputText, input, inputText, ...}`). Script returns `{"score": N, "assertions": [...]}`
184
+ - **PASS:** score >= 0.5 (or as configured).
185
+
186
+ ### Composite assertion
187
+
188
+ #### `composite`
189
+
190
+ - **Fields:** `assertions` (array of sub-assertions), `aggregation` (string: `weighted_average` | `min` | `max` | `all_pass`)
191
+ - **Recipe:** Evaluate each sub-assertion. Aggregate scores per aggregation mode.
192
+ - **PASS:** depends on aggregation mode.
193
+
194
+ ## 3. Negate Support
195
+
196
+ When `negate: true` is set on any assertion, invert the pass/fail result:
197
+
198
+ - A passing check becomes a failure
199
+ - A failing check becomes a pass
200
+ - Score is inverted: `1.0 - score`
201
+
202
+ ## 4. Common Assertion Fields
203
+
204
+ All assertion types support:
205
+
206
+ - `name` (string, optional) — human-readable name
207
+ - `type` (string, required) — the assertion type
208
+ - `weight` (number, optional, default 1.0) — weight in score aggregation
209
+ - `negate` (boolean, optional) — invert result
210
+ - `threshold` (number, optional) — minimum score to pass (for LLM types)
211
+
212
+ ## 5. AgentV JSONL Output Format
213
+
214
+ Each line in the results JSONL file is an `EvaluationResult` object. In JSONL, field names use snake_case (applied by `toSnakeCaseDeep()`).
215
+
216
+ ### Required fields
217
+
218
+ - `timestamp` (string, ISO-8601)
219
+ - `test_id` (string)
220
+ - `score` (number, 0.0-1.0, weighted average of all assertion scores)
221
+ - `assertions` (array of `{text, passed, evidence?}`)
222
+ - `output` (Message[]) — agent output messages
223
+ - `execution_status` (string: `ok` | `quality_failure` | `execution_error`)
224
+
225
+ ### Optional fields
226
+
227
+ - `scores` (array of EvaluatorResult) — per-grader breakdown
228
+ - `input` (Message[]) — input messages
229
+ - `token_usage` (object: `{prompt_tokens, completion_tokens, total_tokens}`)
230
+ - `cost_usd` (number)
231
+ - `duration_ms` (number)
232
+ - `target` (string)
233
+ - `eval_set` (string)
234
+ - `error` (string)
235
+ - `file_changes` (string — unified diff)
236
+ - `mode` (string — `agent` for agent mode)
237
+
238
+ ### `scores[]` entries (EvaluatorResult)
239
+
240
+ - `name` (string) — grader name
241
+ - `type` (string) — grader kind (kebab-case)
242
+ - `score` (number, 0.0-1.0)
243
+ - `assertions` (array of `{text, passed, evidence?}`)
244
+ - `weight` (number, optional)
245
+ - `verdict` (string: `pass` | `fail` | `skip`)
246
+ - `details` (object, optional — structured data from code graders)
247
+ - `reasoning` (string, optional)
248
+
249
+ ## 6. Eval Set Support
250
+
251
+ An eval_set references multiple eval.yaml files:
252
+
253
+ ```yaml
254
+ # eval_set.yaml
255
+ eval_set:
256
+ - path: ./basic.eval.yaml
257
+ - path: ./advanced.eval.yaml
258
+ ```
259
+
260
+ Process each file's tests independently, then aggregate results.
261
+
262
+ ## 7. Agent-Mode Pipeline CLI Commands
263
+
264
+ These CLI subcommands break the monolithic `eval run` into discrete steps for agent-mode execution. The agent handles LLM grading between steps.
265
+
266
+ ### `agentv pipeline input <eval-path> --out <dir>`
267
+
268
+ Extracts inputs, target commands, and grader configs from an eval YAML file.
269
+
270
+ **Output structure:**
271
+ ```
272
+ <out-dir>/
273
+ ├── manifest.json
274
+ ├── <test-id>/
275
+ │ ├── input.json ← {input, input_files, metadata}
276
+ │ ├── invoke.json ← {kind, command?, cwd?, timeout_ms?}
277
+ │ ├── criteria.md ← human-readable success criteria
278
+ │ ├── expected_output.json ← (if present)
279
+ │ ├── code_graders/<name>.json ← {name, command, weight, config?}
280
+ │ └── llm_graders/<name>.json ← {name, weight, threshold?, prompt_content}
281
+ ```
282
+
283
+ **`manifest.json` format:**
284
+ ```json
285
+ {
286
+ "eval_file": "path/to/eval.yaml",
287
+ "timestamp": "2026-03-24T...",
288
+ "target": {"name": "target-name", "kind": "cli", "subagent_mode_allowed": false},
289
+ "test_ids": ["test-01", "test-02"]
290
+ }
291
+ ```
292
+
293
+ **`invoke.json` kinds:**
294
+ - `kind: "cli"` — has `command`, `cwd`, `timeout_ms`. Use the command to run the target.
295
+ - `kind: "agent"` — non-CLI provider. Check `manifest.json` `target.subagent_mode_allowed` to decide whether to dispatch executor subagents or fall back to `agentv eval` CLI.
296
+
297
+ ### `agentv pipeline grade <export-dir>`
298
+
299
+ Runs code-grader assertions against `response.md` files in each test directory.
300
+
301
+ **Prerequisites:** `pipeline input` has been run and `response.md` exists in each test dir.
302
+
303
+ **Output:** `<test-id>/code_grader_results/<name>.json` for each code grader, containing:
304
+ ```json
305
+ {
306
+ "name": "grader-name",
307
+ "type": "code-grader",
308
+ "score": 1.0,
309
+ "weight": 1.0,
310
+ "assertions": [{"text": "...", "passed": true}]
311
+ }
312
+ ```
313
+
314
+ ### `agentv pipeline bench <export-dir>`
315
+
316
+ Merges code-grader results with LLM grader scores and produces final artifacts.
317
+
318
+ LLM grader results are read from disk at `<test-id>/llm_grader_results/<name>.json` per test.
319
+
320
+ **LLM grader result file format** (`llm_grader_results/<name>.json`):
321
+ ```json
322
+ { "score": 0.85, "assertions": [{"text": "...", "passed": true, "evidence": "..."}] }
323
+ ```
324
+
325
+ **Output:**
326
+ - `<test-id>/grading.json` — merged grading with `graders`, `assertions`, `summary.pass_rate`
327
+ - `index.jsonl` — one JSON line per test: `{test_id, score, pass, graders: [...]}`
328
+ - `benchmark.json` — aggregate stats: `{metadata: {targets}, run_summary: {<target>: {mean, stddev, n}}}`
329
+
330
+ ### Agent-Mode Workflow
331
+
332
+ ```
333
+ 1. agentv pipeline input eval.yaml --out ./export
334
+ 2. (Agent runs targets or reads response.md)
335
+ 3. agentv pipeline grade ./export
336
+ 4. (Agent does LLM grading, produces scores JSON)
337
+ 5. echo '<scores>' | agentv pipeline bench ./export
338
+ ```
@@ -0,0 +1,103 @@
1
+ # Migrating from Skill-Creator to AgentV Lifecycle Skill
2
+
3
+ This reference covers how to use AgentV's unified agent-evaluation lifecycle skill (`agentv-bench`) with evals.json files originally created for Anthropic's skill-creator.
4
+
5
+ ## Drop-in Replacement
6
+
7
+ AgentV runs skill-creator's evals.json directly — no conversion required:
8
+
9
+ ```bash
10
+ # Run evals.json with AgentV
11
+ agentv eval evals.json
12
+
13
+ # Or run a single assertion offline (no API keys)
14
+ agentv eval assert <grader-name> --agent-output "..." --agent-input "..."
15
+ ```
16
+
17
+ AgentV automatically:
18
+ - Promotes `prompt` → input messages
19
+ - Promotes `expected_output` → reference answer
20
+ - Converts `assertions` → LLM-grader graders
21
+ - Resolves `files[]` paths relative to the evals.json directory
22
+
23
+ If you're using the `agentv-bench` skill, it orchestrates these same AgentV commands. Code graders, grading, and artifact generation remain in AgentV core; the skill just orchestrates and summarizes the existing outputs.
24
+
25
+ ## What You Gain
26
+
27
+ Moving from skill-creator's eval loop to AgentV's lifecycle skill gives you:
28
+
29
+ | Capability | skill-creator | AgentV lifecycle skill |
30
+ |-----------|---------------|----------------------|
31
+ | Workspace isolation | ❌ | ✅ Clone repos, run setup/teardown scripts |
32
+ | Code graders | ❌ | ✅ Python/TypeScript grader scripts via `defineCodeGrader()` |
33
+ | Tool trajectory scoring | ❌ | ✅ Evaluate tool call sequences |
34
+ | Multi-provider comparison | with-skill vs without-skill | N-way: Claude, GPT, Copilot, Gemini, custom CLI |
35
+ | Multi-turn evaluation | ❌ | ✅ Conversation tracking with `conversation_id` |
36
+ | Blind comparison | ❌ | ✅ Judge doesn't know which is baseline |
37
+ | Deterministic upgrade suggestions | ❌ | ✅ LLM-grader → contains/regex/is-json |
38
+ | Human review checkpoint | ❌ | ✅ Structured feedback gate |
39
+ | Workspace file tracking | ❌ | ✅ Evaluate by diffing workspace files |
40
+ | Agent mode (no API keys) | ❌ | ✅ Uses grader agent in agent mode |
41
+
42
+ ## Artifact Compatibility
43
+
44
+ AgentV's companion artifacts are compatible with skill-creator's eval-viewer:
45
+
46
+ | Artifact | Format | Compatible with eval-viewer |
47
+ |----------|--------|---------------------------|
48
+ | `<test-id>/grading.json` | Per-assertion evidence with claims | ✅ Superset of skill-creator's per-test grading format |
49
+ | `benchmark.json` | Aggregate pass rates, timing, patterns | ✅ Superset of Agent Skills benchmark format |
50
+ | Results JSONL | Per-test results | ✅ Standard JSONL format |
51
+
52
+ AgentV's schemas are supersets — they include all fields skill-creator expects, plus additional fields (claims extraction, pattern analysis, deterministic upgrade candidates). Tools that read skill-creator artifacts will read AgentV artifacts correctly, ignoring the extra fields.
53
+
54
+ The optimizer scripts layer reads those same artifacts directly:
55
+ - `aggregate-benchmark.ts` consumes `benchmark.json`, `timing.json`, and results JSONL
56
+ - `generate-report.ts` and `eval-viewer/generate-review.ts` render review output from AgentV artifacts
57
+ - `improve-description.ts` proposes follow-up experiments from benchmark/grading observations
58
+
59
+ ## Graduating to EVAL.yaml
60
+
61
+ When evals.json becomes limiting, convert to EVAL.yaml for the full feature set:
62
+
63
+ ```bash
64
+ # Convert evals.json to EVAL.yaml
65
+ agentv convert evals.json
66
+
67
+ # Edit the generated YAML to add workspace config, code graders, etc.
68
+ # Then run with the full lifecycle
69
+ agentv eval eval.yaml
70
+ ```
71
+
72
+ EVAL.yaml unlocks:
73
+ - **Workspace setup/teardown** — clone repos, install dependencies, clean up after tests
74
+ - **Code graders** — write graders in Python or TypeScript, not just LLM prompts
75
+ - **Rubric-based grading** — multi-dimensional scoring with weighted criteria
76
+ - **Retry policies** — automatic retries for flaky tests with configurable backoff
77
+ - **Test groups** — organize tests by category with shared config
78
+ - **Multi-turn conversations** — test agent interactions across multiple turns
79
+
80
+ ## What Stays in Skill-Creator
81
+
82
+ AgentV does NOT replace these skill-creator capabilities:
83
+
84
+ - **Trigger optimization** — optimizing when/how a skill is triggered
85
+ - **.skill packaging** — bundling skills for distribution
86
+ - **Skill authoring** — creating new SKILL.md files from scratch
87
+ - **Skill discovery** — finding and installing skills
88
+
89
+ AgentV focuses on the **evaluation and optimization loop**. Skill-creator focuses on **skill authoring and packaging**. They are complementary — use skill-creator to write the skill, use AgentV to evaluate and optimize it.
90
+
91
+ ## Example Workflow
92
+
93
+ ```
94
+ 1. Author a skill with skill-creator
95
+ 2. skill-creator generates evals.json
96
+ 3. Run evals.json through AgentV's lifecycle skill for richer evaluation:
97
+ - Workspace isolation (test in a real repo)
98
+ - Multi-provider comparison (does the skill work with GPT too?)
99
+ - Blind comparison (is the new version actually better?)
100
+ - Deterministic upgrades (replace vague LLM graders with precise checks)
101
+ 4. Use AgentV's optimization loop to refine the skill's prompts
102
+ 5. Return to skill-creator for packaging and distribution
103
+ ```