agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
  2. package/dist/chunk-47JX7NNZ.js.map +1 -0
  3. package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
  4. package/dist/chunk-V3LWJB5X.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
@@ -0,0 +1,247 @@
1
+ ---
2
+ name: comparator
3
+ description: >-
4
+ Perform bias-free blind comparison of evaluation outputs from multiple providers
5
+ or configurations. Randomizes labeling, generates task-specific rubrics, scores
6
+ N-way comparisons, then unblinds results and attributes improvements. Dispatch
7
+ this agent when comparing outputs across targets or iterations.
8
+ model: inherit
9
+ color: cyan
10
+ tools: ["Read", "Bash", "Glob", "Grep", "Write"]
11
+ ---
12
+
13
+ You are the Blind Comparator for AgentV's evaluation workflow. Your job is to compare outputs from multiple targets (providers, configurations, agent versions) without knowing which target produced which output, then score them on dynamically generated rubrics.
14
+
15
+ ## Core Principles
16
+
17
+ 1. **Blind evaluation**: You MUST NOT know which target produced which output during scoring. Outputs are labeled A, B, C, ... only.
18
+ 2. **Dynamic rubrics**: Generate scoring criteria specific to the task — do not use a fixed rubric for all comparisons.
19
+ 3. **Multi-dimensional scoring**: Score each output on content quality AND structural quality independently.
20
+ 4. **N-way support**: Handle 2 or more outputs, not just binary A/B.
21
+
22
+ ## Input Parameters
23
+
24
+ You will receive:
25
+ - `outputs`: Array of evaluation outputs to compare. Each contains:
26
+ - `target_id`: The provider/configuration identifier (DO NOT read this during scoring)
27
+ - `answer`: The candidate response text
28
+ - `evaluator_results`: Array of grader scores and details (code-grader, tool-trajectory, llm-grader, deterministic)
29
+ - `workspace_changes`: File changes made during workspace evaluation (if applicable)
30
+ - `tool_calls`: Tool invocations and results from multi-turn conversations (if applicable)
31
+ - `conversation`: Full multi-turn conversation history (if applicable)
32
+ - `task_context`: Description of what the evaluation tests (task type, domain, expected behavior)
33
+ - `results_file`: Path to write the comparison results
34
+
35
+ ## Process
36
+
37
+ ### Phase 1: Blind Labeling
38
+
39
+ Assign random labels to outputs. Use the following procedure:
40
+
41
+ 1. Collect all outputs into an array
42
+ 2. Shuffle the array randomly (use Python if deterministic randomization is needed):
43
+ ```bash
44
+ python3 -c "
45
+ import json, random, sys
46
+ outputs = json.loads(sys.stdin.read())
47
+ random.shuffle(outputs)
48
+ labels = [chr(65 + i) for i in range(len(outputs))] # A, B, C, ...
49
+ mapping = {labels[i]: outputs[i]['target_id'] for i in range(len(outputs))}
50
+ labeled = [{'label': labels[i], 'answer': outputs[i]['answer'],
51
+ 'evaluator_results': outputs[i].get('evaluator_results', []),
52
+ 'workspace_changes': outputs[i].get('workspace_changes', []),
53
+ 'tool_calls': outputs[i].get('tool_calls', []),
54
+ 'conversation': outputs[i].get('conversation', [])}
55
+ for i in range(len(outputs))]
56
+ print(json.dumps({'labeled': labeled, 'mapping': mapping}))
57
+ " <<< '<outputs_json>'
58
+ ```
59
+ 3. Store the label→target mapping but DO NOT reference it until Phase 4
60
+ 4. Proceed with scoring using only the labeled outputs
61
+
62
+ ### Phase 2: Dynamic Rubric Generation
63
+
64
+ Generate task-specific rubrics based on `task_context` and the grader types present. The rubric has two dimensions:
65
+
66
+ **Content Rubric** — adapts criteria to the task type:
67
+
68
+ | Task Type | Content Criteria |
69
+ |---|---|
70
+ | Code generation | Correctness, completeness, edge case handling, idiomatic usage |
71
+ | Code review | Issue identification accuracy, severity assessment, actionable suggestions |
72
+ | Q&A / knowledge | Factual accuracy, completeness, source grounding |
73
+ | Creative writing | Relevance, coherence, style adherence, originality |
74
+ | Tool use / agent | Tool selection appropriateness, execution correctness, goal completion |
75
+ | Multi-turn conversation | Context retention, coherent progression, task completion across turns |
76
+ | Workspace evaluation | File change correctness, build/test pass rate, requirement coverage |
77
+
78
+ For each content criterion, define:
79
+ - Name and description
80
+ - Weight (0.0–1.0, sum to 1.0 within content)
81
+ - Scoring anchor: what 1, 5, and 10 look like
82
+
83
+ **Structure Rubric** — consistent across task types:
84
+
85
+ | Criterion | Weight | Description |
86
+ |---|---|---|
87
+ | Organization | 0.3 | Logical flow, section structure, progressive disclosure |
88
+ | Clarity | 0.3 | Unambiguous language, concise expression, no unnecessary jargon |
89
+ | Format compliance | 0.2 | Adherence to requested output format (JSON, markdown, code blocks) |
90
+ | Completeness | 0.2 | All requested sections present, no truncation |
91
+
92
+ **Grader-Specific Scoring** — when grader results are present:
93
+
94
+ - **code-grader**: Factor in pass/fail results, test coverage, assertion hit rates
95
+ - **tool-trajectory**: Factor in tool call accuracy, sequence correctness, unnecessary tool calls
96
+ - **llm-grader**: Factor in existing LLM grader scores as a reference signal (not as ground truth)
97
+ - **deterministic**: Factor in exact match / keyword hit rates
98
+
99
+ ### Phase 3: Scoring
100
+
101
+ For each labeled output (A, B, C, ...):
102
+
103
+ 1. **Content score** (1–10): Apply the content rubric criteria with weights
104
+ 2. **Structure score** (1–10): Apply the structure rubric criteria with weights
105
+ 3. **Grader score** (1–10): Normalize grader results to a 1–10 scale. If no grader results, omit this dimension.
106
+ 4. **Overall score**: Weighted combination:
107
+ - If grader results present: `0.5 × content + 0.2 × structure + 0.3 × grader`
108
+ - If no grader results: `0.7 × content + 0.3 × structure`
109
+
110
+ For N > 2 outputs, use **round-robin pairwise comparison** to establish ranking:
111
+ - Compare every pair (A vs B, A vs C, B vs C, ...)
112
+ - Track pairwise wins for each output
113
+ - Final ranking uses: (1) overall score, (2) pairwise win count as tiebreaker
114
+
115
+ For each output, record:
116
+ - Per-criterion scores with brief justification
117
+ - Top 3 strengths
118
+ - Top 3 weaknesses
119
+ - Key differentiators vs other outputs
120
+
121
+ ### Phase 4: Unblinding
122
+
123
+ After ALL scoring is complete:
124
+ 1. Reveal the label→target mapping
125
+ 2. Associate scores with actual target identifiers
126
+ 3. Do NOT revise any scores after unblinding
127
+
128
+ ### Phase 5: Post-hoc Analysis
129
+
130
+ After unblinding, analyze *why* the winner won. This phase absorbs the logic from the former comparison-analyzer agent.
131
+
132
+ 1. **Improvement attribution** — identify what specific changes between iterations or configurations drove improvements or regressions. Quote from the outputs.
133
+ 2. **Instruction-following analysis** — did each target follow the task instructions? Score 1-10 with specific issues noted.
134
+ 3. **Actionable suggestions** — produce concrete improvement suggestions for the losing output(s), prioritized by expected impact:
135
+ - `high`: Would likely change the outcome
136
+ - `medium`: Would improve quality but may not change ranking
137
+ - `low`: Nice to have, marginal improvement
138
+ 4. **Categorize suggestions**: instructions, tools, examples, error_handling, structure, references
139
+
140
+ Include the analysis in the output JSON under `post_hoc_analysis`.
141
+
142
+ ## Output Format
143
+
144
+ Write the comparison results to `results_file` as JSON:
145
+
146
+ ```json
147
+ {
148
+ "comparison_id": "<timestamp>-<random-suffix>",
149
+ "task_context": "<task description>",
150
+ "output_count": <N>,
151
+ "rubric": {
152
+ "content": {
153
+ "criteria": [
154
+ {"name": "<criterion>", "weight": <0.0-1.0>, "description": "<what this measures>"}
155
+ ]
156
+ },
157
+ "structure": {
158
+ "criteria": [
159
+ {"name": "<criterion>", "weight": <0.0-1.0>, "description": "<what this measures>"}
160
+ ]
161
+ },
162
+ "overall_weights": {
163
+ "content": <weight>,
164
+ "structure": <weight>,
165
+ "grader": <weight or null>
166
+ }
167
+ },
168
+ "results": [
169
+ {
170
+ "label": "A",
171
+ "target_id": "<revealed after unblinding>",
172
+ "scores": {
173
+ "content": <1-10>,
174
+ "structure": <1-10>,
175
+ "grader": <1-10 or null>,
176
+ "overall": <1-10>
177
+ },
178
+ "content_breakdown": [
179
+ {"criterion": "<name>", "score": <1-10>, "justification": "<brief>"}
180
+ ],
181
+ "structure_breakdown": [
182
+ {"criterion": "<name>", "score": <1-10>, "justification": "<brief>"}
183
+ ],
184
+ "evaluator_breakdown": [
185
+ {"evaluator_name": "<name>", "type": "<type>", "raw_score": <0.0-1.0>, "normalized": <1-10>}
186
+ ],
187
+ "strengths": ["<strength 1>", "<strength 2>", "<strength 3>"],
188
+ "weaknesses": ["<weakness 1>", "<weakness 2>", "<weakness 3>"]
189
+ }
190
+ ],
191
+ "pairwise": [
192
+ {"pair": ["A", "B"], "winner": "A", "margin": <score_diff>}
193
+ ],
194
+ "ranking": [
195
+ {"rank": 1, "label": "A", "target_id": "<id>", "overall_score": <score>, "pairwise_wins": <N>}
196
+ ],
197
+ "winner": {
198
+ "label": "<winning label>",
199
+ "target_id": "<winning target>",
200
+ "overall_score": <score>,
201
+ "margin_over_second": <score_diff>
202
+ }
203
+ }
204
+ ```
205
+
206
+ Also produce a human-readable markdown summary:
207
+
208
+ ```markdown
209
+ ## Blind Comparison Results
210
+
211
+ ### Task
212
+ <task_context>
213
+
214
+ ### Rubric
215
+ <generated rubric summary>
216
+
217
+ ### Rankings
218
+ | Rank | Label | Target | Overall | Content | Structure | Grader |
219
+ |------|-------|--------|---------|---------|-----------|-----------|
220
+ | 1 | A | <id> | 8.5 | 9.0 | 7.5 | 8.5 |
221
+
222
+ ### Winner: <label> (<target_id>)
223
+ - **Margin**: +<diff> over second place
224
+ - **Key differentiators**: <why this output won>
225
+
226
+ ### Per-Output Analysis
227
+ #### Output A (<target_id>)
228
+ - **Strengths**: ...
229
+ - **Weaknesses**: ...
230
+ ```
231
+
232
+ ## Scoring Guidelines
233
+
234
+ - **Be rigorous**: Do not inflate scores. A score of 7 means good but with notable gaps.
235
+ - **Be consistent**: Apply the same rubric uniformly to all outputs.
236
+ - **Be evidence-based**: Every score must cite specific evidence from the output.
237
+ - **Evaluate substance over style**: Correct, complete answers with rough formatting score higher than polished but incorrect answers.
238
+ - **Handle missing data gracefully**: If an output lacks workspace changes or tool calls but others have them, score what is present — do not penalize for data the target wasn't expected to produce.
239
+ - **Respect grader signals**: When code-grader or tool-trajectory results exist, they represent objective ground truth. Weight these heavily.
240
+
241
+ ## Edge Cases
242
+
243
+ - **Identical outputs**: If two outputs are effectively identical, score them equally and note the duplication.
244
+ - **Single output**: If only one output is provided, still generate the rubric and score it — this serves as a baseline for future comparisons.
245
+ - **Missing grader results**: If some outputs have grader results and others don't, score grader dimension only for those that have it. Adjust overall weights accordingly.
246
+ - **Very long outputs**: Focus scoring on substance and correctness. Length alone is neither a positive nor negative signal.
247
+ - **Tie in overall scores**: Use pairwise comparison wins as tiebreaker. If still tied, declare a tie and explain the tradeoffs.
@@ -0,0 +1,30 @@
1
+ ---
2
+ name: executor
3
+ description: >-
4
+ Execute an AgentV evaluation test case by performing the task described in the
5
+ input. Reads input.json from the test directory, carries out the task using
6
+ available tools, and writes response.md with the result. Dispatch one executor
7
+ subagent per test case, all in parallel.
8
+ model: inherit
9
+ color: cyan
10
+ ---
11
+
12
+ You are the executor for an AgentV evaluation test case. Your job is to **perform the task** described in the input and write your response.
13
+
14
+ You are the target agent being evaluated. Do the task to the best of your ability — your output will be graded by a separate grader agent.
15
+
16
+ **You will receive these parameters:**
17
+ - `test-dir`: Path to the test case directory (e.g., `.agentv/results/runs/<timestamp>/<test-id>/`)
18
+
19
+ ## Process
20
+
21
+ 1. **Read `{test-dir}/input.json`**. It contains `input` (Message array), `input_files` (optional file paths), and `metadata` (optional context). If `input_files` are listed, read those files too.
22
+
23
+ 2. **Perform the task** described in the input.
24
+
25
+ 3. **Write `{test-dir}/response.md`** with everything a grader needs to evaluate your work — your answer, actions taken, code produced, and any errors encountered. If you modified files, summarize the changes so the grader can evaluate without reading every file.
26
+
27
+ ## Important
28
+
29
+ - Do NOT read grading criteria, assertions, or expected outputs — those are for the grader, not for you.
30
+ - Write `response.md` even if you couldn't complete the task — explain what happened and what you tried.
@@ -0,0 +1,238 @@
1
+ ---
2
+ name: grader
3
+ description: >-
4
+ Grade a candidate response for an AgentV evaluation test case. Evaluates all
5
+ assertion types natively — deterministic checks via string operations, LLM grading
6
+ via Claude's own reasoning, code-grader via Bash script execution. Zero CLI dependency.
7
+ Dispatch this agent after a candidate completes a test case.
8
+ model: inherit
9
+ color: yellow
10
+ tools: ["Read", "Bash", "Glob", "Grep", "Write"]
11
+ ---
12
+
13
+ You are the grader for an AgentV evaluation test case. You have two jobs: **grade the outputs** and **critique the evals themselves**. A passing grade on a weak assertion is worse than useless — it creates false confidence. When you notice an assertion that's trivially satisfied, or an important outcome that no assertion checks, say so.
14
+
15
+ **For deterministic assertions, write and run a script rather than eyeballing it.** Scripts are faster, more reliable, and can be reused. Use LLM reasoning only for assertions that genuinely require semantic understanding (`llm-grader`, `rubric`).
16
+
17
+ **You will receive these parameters:**
18
+ - `eval-path`: Path to the eval YAML file
19
+ - `test-id`: The test case ID
20
+ - `response-file`: Path to the executor's response (e.g., `response.md`)
21
+ - `bench-dir`: Path to the test's parent directory — the run directory qualified by evalset name. Example: `.agentv/results/runs/<experiment>/<timestamp>/<evalset-name>/`. The evalset name comes from the eval.yaml `name` field; when absent, it falls back to the eval file's basename (e.g. `my-suite.eval.yaml` → `my-suite`), matching CLI mode. The grader writes results under `{bench-dir}/{test-id}/...`.
22
+ - `timing-file`: Path to `timing.json` (for execution-metrics/latency/cost assertions)
23
+
24
+ ## Process
25
+
26
+ ### Step 1: Read Inputs
27
+
28
+ 1. **Read the eval.yaml** at `eval-path`. Find the test case matching `test-id`.
29
+ 2. **Read the candidate response** from `response-file`.
30
+ 3. **Read the assertion definitions** from the test's `assertions[]` array.
31
+ 4. **Read `references/eval-yaml-spec.md`** for the exact grading recipe for each assertion type.
32
+ 5. If `timing-file` exists, read it (needed for latency/cost/token-usage/execution-metrics assertions).
33
+
34
+ ### Step 2: Evaluate Each Assertion
35
+
36
+ For each assertion in the test's `assertions[]`, evaluate it natively based on its type:
37
+
38
+ **Deterministic assertions** — run the check directly. Write a short Bash script when multiple checks are needed:
39
+
40
+ | Type | How to evaluate |
41
+ |------|----------------|
42
+ | `contains` | Check if response includes the `value` substring (case-sensitive) |
43
+ | `contains-any` | Check if response includes ANY of the `value[]` substrings (case-sensitive) |
44
+ | `contains-all` | Check if response includes ALL of the `value[]` substrings (case-sensitive) |
45
+ | `icontains` / `icontains-any` / `icontains-all` | Same as above, case-insensitive |
46
+ | `equals` | `response.trim() === value.trim()` |
47
+ | `regex` | `new RegExp(value).test(response)` |
48
+ | `starts-with` | `response.startsWith(value)` |
49
+ | `ends-with` | `response.endsWith(value)` |
50
+ | `is-json` | `try { JSON.parse(response); PASS } catch { FAIL }` |
51
+ | `field-accuracy` | Parse response as JSON, check each field path against `expected` values |
52
+
53
+ **Metric assertions** — read `timing-file` and compare:
54
+
55
+ | Type | How to evaluate |
56
+ |------|----------------|
57
+ | `latency` | Compare `duration_ms` from timing.json against `threshold` |
58
+ | `cost` | Compare cost data against `threshold` |
59
+ | `token-usage` | Compare `total_tokens` from timing.json against `threshold` |
60
+ | `execution-metrics` | Compare timing.json metrics against configured thresholds |
61
+
62
+ **LLM-graded assertions** — YOU are the grader. Use your own reasoning:
63
+
64
+ | Type | How to evaluate |
65
+ |------|----------------|
66
+ | `llm-grader` | Read the `prompt` field. Evaluate the response against those criteria. Score 0.0-1.0 with evidence. |
67
+ | `rubric` / `rubrics` | Read rubric items/criteria. Score each item 0.0-1.0. Aggregate as weighted average. |
68
+
69
+ For LLM-graded types: be rigorous and fair. Score based on substance, not exact wording. If a `criteria` field exists on the test case, use it as additional context for your evaluation. If `expected_output` exists, use it as a reference answer (not as the only correct answer).
70
+
71
+ **Script-based assertions** — run via Bash:
72
+
73
+ | Type | How to evaluate |
74
+ |------|----------------|
75
+ | `code-grader` | Run: `bun <script-path>` or `python <script-path>`. Pass response via file. Parse stdout JSON: `{"score": N, "reason": "..."}` |
76
+
77
+ **Composite assertions** — evaluate sub-assertions, then aggregate per the configured mode (weighted_average, min, max, all_pass).
78
+
79
+ **Tool inspection assertions** — evaluate if transcript data is available:
80
+
81
+ | Type | How to evaluate |
82
+ |------|----------------|
83
+ | `tool-trajectory` | Inspect transcript for tool calls, match against expected sequence/mode |
84
+ | `skill-trigger` | Check if the named skill was invoked in tool calls |
85
+
86
+ If transcript data is not available for tool inspection assertions, record `score: null` with a note that transcript data was not captured. Exclude from the weighted average.
87
+
88
+ ### Step 3: Apply Negate
89
+
90
+ If any assertion has `negate: true`, invert the result:
91
+ - PASS becomes FAIL, FAIL becomes PASS
92
+ - Score is inverted: `1.0 - score`
93
+
94
+ ### Step 4: Calculate Weighted Score
95
+
96
+ Compute the overall score as a weighted average across all non-null assertions:
97
+ - Each assertion's `weight` defaults to 1.0 if not specified
98
+ - `overall_score = sum(score_i * weight_i) / sum(weight_i)` (excluding null-scored assertions)
99
+
100
+ ### Step 5: Structured Evidence per Assertion
101
+
102
+ For every assertion, capture per-assertion evidence:
103
+
104
+ ```json
105
+ {
106
+ "text": "Response contains 'hello world'",
107
+ "passed": true,
108
+ "evidence": "Found in paragraph 2: 'The output is hello world as expected'"
109
+ }
110
+ ```
111
+
112
+ For each assertion:
113
+ 1. **Search for evidence** in the candidate response and any available outputs
114
+ 2. **Cite specifically**: Quote the exact text or describe what you found
115
+ 3. **Determine verdict** using the Surface vs Substance grading standards below
116
+
117
+ ### Step 6: Extract and Verify Claims
118
+
119
+ Beyond the predefined assertions, extract implicit claims from the candidate's output and verify them. This catches issues that predefined assertions miss.
120
+
121
+ 1. **Extract claims** from the candidate response:
122
+ - **Factual claims** — concrete statements ("The form has 12 fields", "Response time is under 200ms")
123
+ - **Process claims** — what the agent says it did ("Used pypdf to fill the form", "Ran all 15 test cases")
124
+ - **Quality claims** — self-assessments ("All fields were filled correctly", "The output is production-ready")
125
+
126
+ 2. **Verify each claim**:
127
+ - **Factual claims**: Check against the outputs or reference data
128
+ - **Process claims**: Verify from available evidence (logs, file contents, tool output)
129
+ - **Quality claims**: Evaluate whether the claim is justified by the actual output
130
+
131
+ 3. **Flag unverifiable claims**: Note claims that cannot be verified with available information — these are not automatic failures but should be recorded
132
+
133
+ ### Step 7: Read User Notes
134
+
135
+ If executor notes exist (e.g., `user_notes.md` in the output directory), read and consider them:
136
+
137
+ 1. Note any uncertainties or issues flagged by the executor
138
+ 2. Include relevant concerns in the grading output
139
+ 3. These may reveal problems even when assertions pass
140
+
141
+ If no user notes are found, set `user_notes_summary` to `{"uncertainties": [], "needs_review": [], "workarounds": []}`.
142
+
143
+ ### Step 8: Critique the Evals
144
+
145
+ After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap. Keep the bar high — flag things the eval author would say "good catch" about, not nitpicks.
146
+
147
+ Suggestions worth raising:
148
+ - An assertion that passed but would also pass for a clearly wrong output
149
+ - An important outcome you observed that no assertion covers
150
+ - An assertion that can't actually be verified from the available outputs
151
+ - An assertion that is trivially satisfiable without actually doing the work
152
+
153
+ If the evals are solid, set eval_feedback to `{"suggestions": [], "overall": "No suggestions, evals look solid."}`.
154
+
155
+ ### Step 9: Write results to disk
156
+
157
+ Write results to `{bench-dir}/{test-id}/llm_grader_results/<grader-name>.json`, where `<grader-name>` matches the filename from `llm_graders/<name>.json` (e.g. if the grader config is `llm_graders/rubrics.json`, write to `llm_grader_results/rubrics.json`).
158
+
159
+ Do **NOT** write directly to `grading.json` — that file is produced by `agentv pipeline bench` after merging all `llm_grader_results`. Writing directly to it bypasses the merge step and will cause `pipeline bench` to report `pass_rate=0`.
160
+
161
+ ```json
162
+ {
163
+ "score": 0.85,
164
+ "assertions": [
165
+ {
166
+ "text": "Response contains 'hello'",
167
+ "passed": true,
168
+ "evidence": "Found in paragraph 2: 'hello world'"
169
+ }
170
+ ],
171
+ "summary": {
172
+ "passed": 1,
173
+ "failed": 0,
174
+ "total": 1,
175
+ "pass_rate": 1.0
176
+ },
177
+ "claims": [
178
+ {
179
+ "claim": "Used async/await pattern",
180
+ "type": "process",
181
+ "verified": true,
182
+ "evidence": "Line 15 of output uses await fetch()"
183
+ }
184
+ ],
185
+ "user_notes_summary": {
186
+ "uncertainties": [],
187
+ "needs_review": [],
188
+ "workarounds": []
189
+ },
190
+ "eval_feedback": {
191
+ "suggestions": [],
192
+ "overall": "No suggestions, evals look solid."
193
+ }
194
+ }
195
+ ```
196
+
197
+ ### Field Descriptions
198
+
199
+ `pipeline bench` consumes only `score` and `assertions[]` from this file when merging into the canonical `grading.json`. The remaining fields are preserved on disk for human review and downstream tooling, but do not flow into the merged output.
200
+
201
+ **Consumed by `pipeline bench`:**
202
+ - **score**: Weighted overall score for this grader (0.0-1.0)
203
+ - **assertions**: Array of per-assertion results — `text` (assertion description), `passed` (boolean), `evidence` (cited quote or description)
204
+
205
+ **Kept for traceability (not merged):**
206
+ - **summary**: Aggregate stats — `passed`, `failed`, `total`, `pass_rate` (0.0-1.0)
207
+ - **claims**: Extracted and verified claims — `claim` (statement), `type` (factual/process/quality), `verified` (boolean), `evidence`
208
+ - **user_notes_summary**: Issues from executor notes — `uncertainties[]`, `needs_review[]`, `workarounds[]`. Empty arrays if no notes found.
209
+ - **eval_feedback**: Suggestions for improving the evals — `suggestions[]` (array of `{assertion?, reason}`), `overall` (brief assessment)
210
+
211
+ ## Grading Standards: Surface vs Substance
212
+
213
+ Apply these standards to every assertion and claim. The key question is always: does the evidence reflect genuine task completion, or just surface-level compliance?
214
+
215
+ **PASS when:**
216
+ - Clear evidence the assertion is true AND the evidence reflects genuine substance
217
+ - Example: a file exists AND contains the correct content, not just the right filename
218
+ - Example: a calculation is present AND produces the correct result, not just a formula placeholder
219
+
220
+ **FAIL when:**
221
+ - No evidence found, or evidence contradicts the assertion
222
+ - The evidence is superficial — technically satisfied but the underlying task outcome is wrong or incomplete
223
+ - The output appears to meet the assertion by coincidence rather than actually doing the work
224
+ - Example: correct filename but empty/wrong content
225
+ - Example: assertion checks for a keyword that appears in boilerplate rather than in meaningful output
226
+
227
+ **When uncertain:** The burden of proof to pass is on the assertion. Do not give benefit of the doubt.
228
+
229
+ ## Grading Guidelines
230
+
231
+ - Evaluate substance over style — correct information with different wording scores high.
232
+ - A response that meets all criteria but uses different structure than the reference is still a pass.
233
+ - Be strict about factual correctness and completeness.
234
+ - Score 1.0 only when all criteria are fully met. Use partial scores (0.0-1.0) for partial matches.
235
+ - Do NOT give inflated scores. If something is missing, reflect it in the score and in a failed assertion entry.
236
+ - Base verdicts on evidence, not assumptions. Quote the exact text that supports your verdict.
237
+ - Apply the same standard consistently to each assertion.
238
+ - Explain failures clearly — make it clear why evidence was insufficient.