agentv 3.10.2 → 3.10.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-6UE665XI.js → chunk-7LC3VNOC.js} +4 -4
- package/dist/{chunk-KGK5NUFG.js → chunk-JUQCB3ZW.js} +56 -15
- package/dist/chunk-JUQCB3ZW.js.map +1 -0
- package/dist/{chunk-F7LAJMTO.js → chunk-U556GRI3.js} +4 -4
- package/dist/{chunk-F7LAJMTO.js.map → chunk-U556GRI3.js.map} +1 -1
- package/dist/cli.js +3 -3
- package/dist/{dist-3QUJEJUT.js → dist-2X7A3TTC.js} +2 -2
- package/dist/index.js +3 -3
- package/dist/{interactive-EO6AR2R3.js → interactive-CSA4KIND.js} +3 -3
- package/dist/templates/.agentv/.env.example +9 -11
- package/dist/templates/.agentv/config.yaml +13 -4
- package/dist/templates/.agentv/targets.yaml +16 -0
- package/package.json +1 -1
- package/dist/chunk-KGK5NUFG.js.map +0 -1
- package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +0 -84
- package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +0 -144
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +0 -67
- package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +0 -101
- package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +0 -458
- package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +0 -36
- package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
- package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +0 -12753
- package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
- package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +0 -50
- package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -177
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
- package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -115
- package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -79
- package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
- /package/dist/{chunk-6UE665XI.js.map → chunk-7LC3VNOC.js.map} +0 -0
- /package/dist/{dist-3QUJEJUT.js.map → dist-2X7A3TTC.js.map} +0 -0
- /package/dist/{interactive-EO6AR2R3.js.map → interactive-CSA4KIND.js.map} +0 -0
|
@@ -1,77 +0,0 @@
|
|
|
1
|
-
# Rubric Evaluator
|
|
2
|
-
|
|
3
|
-
## Field Reference
|
|
4
|
-
|
|
5
|
-
| Field | Type | Default | Description |
|
|
6
|
-
|-------|------|---------|-------------|
|
|
7
|
-
| `id` | string | auto-generated | Unique identifier |
|
|
8
|
-
| `outcome` | string | required* | Criterion being evaluated (*optional if `score_ranges` used) |
|
|
9
|
-
| `weight` | number | 1.0 | Relative importance |
|
|
10
|
-
| `required` | boolean | true | Failing forces verdict to 'fail' (checklist mode) |
|
|
11
|
-
| `required_min_score` | integer | - | Minimum 0-10 score to pass (score-range mode) |
|
|
12
|
-
| `score_ranges` | map or array | - | Score range definitions for analytic scoring |
|
|
13
|
-
|
|
14
|
-
## Checklist Mode
|
|
15
|
-
|
|
16
|
-
```yaml
|
|
17
|
-
rubrics:
|
|
18
|
-
- Mentions divide-and-conquer approach
|
|
19
|
-
- id: complexity
|
|
20
|
-
outcome: States time complexity correctly
|
|
21
|
-
weight: 2.0
|
|
22
|
-
required: true
|
|
23
|
-
- id: examples
|
|
24
|
-
outcome: Includes code examples
|
|
25
|
-
weight: 1.0
|
|
26
|
-
required: false
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
## Score-Range Mode
|
|
30
|
-
|
|
31
|
-
Shorthand map format (recommended):
|
|
32
|
-
|
|
33
|
-
```yaml
|
|
34
|
-
rubrics:
|
|
35
|
-
- id: correctness
|
|
36
|
-
weight: 2.0
|
|
37
|
-
required_min_score: 7
|
|
38
|
-
score_ranges:
|
|
39
|
-
0: Critical bugs
|
|
40
|
-
3: Minor bugs
|
|
41
|
-
6: Correct with minor issues
|
|
42
|
-
9: Fully correct
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
Map keys are lower bounds (0-10). Each range extends from its key to (next key - 1), with the last extending to 10. Must start at 0.
|
|
46
|
-
|
|
47
|
-
Array format is also accepted:
|
|
48
|
-
|
|
49
|
-
```yaml
|
|
50
|
-
score_ranges:
|
|
51
|
-
- score_range: [0, 2]
|
|
52
|
-
outcome: Critical bugs
|
|
53
|
-
- score_range: [3, 5]
|
|
54
|
-
outcome: Minor bugs
|
|
55
|
-
- score_range: [6, 8]
|
|
56
|
-
outcome: Correct with minor issues
|
|
57
|
-
- score_range: [9, 10]
|
|
58
|
-
outcome: Fully correct
|
|
59
|
-
```
|
|
60
|
-
|
|
61
|
-
Ranges must be integers 0-10, non-overlapping, covering all values 0-10.
|
|
62
|
-
|
|
63
|
-
## Scoring
|
|
64
|
-
|
|
65
|
-
**Checklist:** `score = sum(satisfied weights) / sum(all weights)`
|
|
66
|
-
|
|
67
|
-
**Score-range:** `score = weighted_average(raw_score / 10)` per criterion
|
|
68
|
-
|
|
69
|
-
## Verdicts
|
|
70
|
-
|
|
71
|
-
| Verdict | Condition |
|
|
72
|
-
|---------|-----------|
|
|
73
|
-
| `pass` | score >= 0.8 AND all gating criteria satisfied |
|
|
74
|
-
| `borderline` | score >= 0.6 AND all gating criteria satisfied |
|
|
75
|
-
| `fail` | score < 0.6 OR any gating criterion failed |
|
|
76
|
-
|
|
77
|
-
Gating: checklist uses `required: true`, score-range uses `required_min_score: N`.
|
|
@@ -1,50 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: agentv-eval-orchestrator
|
|
3
|
-
description: Run AgentV evaluations without API keys by orchestrating eval subcommands. Use this skill when asked to run evals, evaluate an agent, or test prompt quality using agentv.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# AgentV Eval Orchestrator
|
|
7
|
-
|
|
8
|
-
Run AgentV evaluations by acting as the LLM yourself — no API keys needed.
|
|
9
|
-
|
|
10
|
-
## Quick Start
|
|
11
|
-
|
|
12
|
-
```bash
|
|
13
|
-
agentv prompt eval <eval-file.yaml>
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
This outputs a complete orchestration prompt with step-by-step instructions and all test IDs. Follow its instructions.
|
|
17
|
-
|
|
18
|
-
## Workflow
|
|
19
|
-
|
|
20
|
-
For each test, run these three steps:
|
|
21
|
-
|
|
22
|
-
### 1. Get Task Input
|
|
23
|
-
|
|
24
|
-
```bash
|
|
25
|
-
agentv prompt eval input <path> --test-id <id>
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
Returns JSON with `input`, `guideline_paths`, and `criteria`. File references in messages use absolute paths — read them from the filesystem.
|
|
29
|
-
|
|
30
|
-
### 2. Execute the Task
|
|
31
|
-
|
|
32
|
-
You ARE the candidate LLM. Read `input` from step 1, read any referenced files, and answer the task. Save your response to a temp file.
|
|
33
|
-
|
|
34
|
-
**Important**: Do not leak `criteria` into your answer — it's for your reference when judging, not part of the task.
|
|
35
|
-
|
|
36
|
-
### 3. Judge the Result
|
|
37
|
-
|
|
38
|
-
```bash
|
|
39
|
-
agentv prompt eval judge <path> --test-id <id> --answer-file /tmp/eval_<id>.txt
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
Returns JSON with an `evaluators` array. Each evaluator has a `status`:
|
|
43
|
-
|
|
44
|
-
- **`"completed"`** — Deterministic score is final. Read `result.score` (0.0–1.0).
|
|
45
|
-
- **`"prompt_ready"`** — LLM grading required. Send `prompt.system_prompt` as system and `prompt.user_prompt` as user to yourself. Parse the JSON response to get `score`, `hits`, `misses`.
|
|
46
|
-
|
|
47
|
-
## When to use this vs `agentv eval`
|
|
48
|
-
|
|
49
|
-
- **`agentv eval`** — You have API keys configured. Runs everything end-to-end automatically.
|
|
50
|
-
- **`agentv prompt`** — No API keys. You orchestrate: get input, answer the task yourself, judge the result.
|
|
@@ -1,78 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: agentv-prompt-optimizer
|
|
3
|
-
description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# AgentV Prompt Optimizer
|
|
7
|
-
|
|
8
|
-
## Input Variables
|
|
9
|
-
- `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
|
|
10
|
-
- `optimization-log-path` (optional): Path where optimization progress should be logged
|
|
11
|
-
|
|
12
|
-
## Workflow
|
|
13
|
-
|
|
14
|
-
1. **Initialize**
|
|
15
|
-
- Verify `<eval-path>` (file or glob) targets the correct system.
|
|
16
|
-
- **Identify Prompt Files**:
|
|
17
|
-
- Infer prompt files from the eval file content (look for `file:` references in `input` that match these patterns).
|
|
18
|
-
- Recursively check referenced prompt files for *other* prompt references (dependencies).
|
|
19
|
-
- If multiple prompts are found, consider ALL of them as candidates for optimization.
|
|
20
|
-
- **Identify Optimization Log**:
|
|
21
|
-
- If `<optimization-log-path>` is provided, use it.
|
|
22
|
-
- If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
|
|
23
|
-
- Read content of the identified prompt file.
|
|
24
|
-
|
|
25
|
-
2. **Optimization Loop** (Max 10 iterations)
|
|
26
|
-
- **Execute (The Generator)**: Run `agentv eval <eval-path>`.
|
|
27
|
-
- *Targeted Run*: If iterating on specific stubborn failures, use `--test-id <test_id>` to run only the relevant tests.
|
|
28
|
-
- **Analyze (The Reflector)**:
|
|
29
|
-
- Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
|
|
30
|
-
- **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
|
|
31
|
-
- **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
|
|
32
|
-
- **Output**: Return a structured analysis including:
|
|
33
|
-
- **Score**: Current pass rate.
|
|
34
|
-
- **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
|
|
35
|
-
- **Insight**: Key learning or pattern identified from the failures.
|
|
36
|
-
- **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
|
|
37
|
-
- **Decide**:
|
|
38
|
-
- If **100% pass**: STOP and report success.
|
|
39
|
-
- If **Score decreased**: Revert last change, try different approach.
|
|
40
|
-
- If **No improvement** (2x): STOP and report stagnation.
|
|
41
|
-
- **Refine (The Curator)**:
|
|
42
|
-
- **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
|
|
43
|
-
- **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
|
|
44
|
-
- **Output**: The **Log Entry** describing the specific operation performed.
|
|
45
|
-
```markdown
|
|
46
|
-
### Iteration [N]
|
|
47
|
-
- **Operation**: [ADD / UPDATE / DELETE]
|
|
48
|
-
- **Target**: [Section Name]
|
|
49
|
-
- **Change**: [Specific text added/modified]
|
|
50
|
-
- **Trigger**: [Specific failing test case or error pattern]
|
|
51
|
-
- **Rationale**: [From Reflector: Root Cause]
|
|
52
|
-
- **Score**: [From Reflector: Current Pass Rate]
|
|
53
|
-
- **Insight**: [From Reflector: Key Learning]
|
|
54
|
-
```
|
|
55
|
-
- **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
|
|
56
|
-
- **ADD**: Insert a new rule if a constraint was missed.
|
|
57
|
-
- **UPDATE**: Refine an existing rule to be clearer or more general.
|
|
58
|
-
- *Clarify*: Make ambiguous instructions specific.
|
|
59
|
-
- *Generalize*: Refactor specific fixes into high-level principles (First Principles).
|
|
60
|
-
- **DELETE**: Remove obsolete, redundant, or harmful rules.
|
|
61
|
-
- *Prune*: If a general rule covers specific cases, delete the specific ones.
|
|
62
|
-
- **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
|
|
63
|
-
- **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
|
|
64
|
-
- **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
|
|
65
|
-
- **Log Result**:
|
|
66
|
-
- Append the **Log Entry** returned by the Curator to the optimization log file.
|
|
67
|
-
|
|
68
|
-
3. **Completion**
|
|
69
|
-
- Report final score.
|
|
70
|
-
- Summarize key changes made to the prompt.
|
|
71
|
-
- **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
|
|
72
|
-
|
|
73
|
-
## Guidelines
|
|
74
|
-
- **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
|
|
75
|
-
- **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
|
|
76
|
-
- **Structure**: Maintain existing Markdown headers/sections.
|
|
77
|
-
- **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
|
|
78
|
-
- **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
|
|
@@ -1,177 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: agentv-eval-builder
|
|
3
|
-
description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring evaluators.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
# AgentV Eval Builder
|
|
7
|
-
|
|
8
|
-
Comprehensive docs: https://agentv.dev
|
|
9
|
-
|
|
10
|
-
## Quick Start
|
|
11
|
-
|
|
12
|
-
```yaml
|
|
13
|
-
description: Example eval
|
|
14
|
-
execution:
|
|
15
|
-
target: default
|
|
16
|
-
|
|
17
|
-
evalcases:
|
|
18
|
-
- id: greeting
|
|
19
|
-
expected_outcome: Friendly greeting
|
|
20
|
-
input: "Say hello"
|
|
21
|
-
expected_output: "Hello! How can I help you?"
|
|
22
|
-
rubrics:
|
|
23
|
-
- Greeting is friendly and warm
|
|
24
|
-
- Offers to help
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
## Eval File Structure
|
|
28
|
-
|
|
29
|
-
**Required:** `evalcases` (array)
|
|
30
|
-
**Optional:** `description`, `execution`, `dataset`
|
|
31
|
-
|
|
32
|
-
**Eval case fields:**
|
|
33
|
-
|
|
34
|
-
| Field | Required | Description |
|
|
35
|
-
|-------|----------|-------------|
|
|
36
|
-
| `id` | yes | Unique identifier |
|
|
37
|
-
| `expected_outcome` | yes | What the response should accomplish |
|
|
38
|
-
| `input` / `input_messages` | yes | Input to the agent |
|
|
39
|
-
| `expected_output` / `expected_messages` | no | Gold-standard reference answer |
|
|
40
|
-
| `rubrics` | no | Inline evaluation criteria |
|
|
41
|
-
| `execution` | no | Per-case execution overrides |
|
|
42
|
-
| `conversation_id` | no | Thread grouping |
|
|
43
|
-
|
|
44
|
-
**Shorthand aliases:**
|
|
45
|
-
- `input` (string) expands to `[{role: "user", content: "..."}]`
|
|
46
|
-
- `expected_output` (string/object) expands to `[{role: "assistant", content: ...}]`
|
|
47
|
-
- Canonical `input_messages` / `expected_messages` take precedence when both present
|
|
48
|
-
|
|
49
|
-
**Message format:** `{role, content}` where role is `system`, `user`, `assistant`, or `tool`
|
|
50
|
-
**Content types:** inline text, `{type: "file", value: "./path.md"}`
|
|
51
|
-
**File paths:** relative from eval file dir, or absolute with `/` prefix from repo root
|
|
52
|
-
|
|
53
|
-
**JSONL format:** One eval case per line as JSON. Optional `.yaml` sidecar for shared defaults. See `examples/features/basic-jsonl/`.
|
|
54
|
-
|
|
55
|
-
## Evaluator Types
|
|
56
|
-
|
|
57
|
-
Configure via `execution.evaluators` array. Multiple evaluators produce a weighted average score.
|
|
58
|
-
|
|
59
|
-
### code_judge
|
|
60
|
-
```yaml
|
|
61
|
-
- name: format_check
|
|
62
|
-
type: code_judge
|
|
63
|
-
script: uv run validate.py
|
|
64
|
-
cwd: ./scripts # optional working directory
|
|
65
|
-
target: {} # optional: enable LLM target proxy (max_calls: 50)
|
|
66
|
-
```
|
|
67
|
-
Contract: stdin JSON -> stdout JSON `{score, hits, misses, reasoning}`
|
|
68
|
-
See `references/custom-evaluators.md` for templates.
|
|
69
|
-
|
|
70
|
-
### llm_judge
|
|
71
|
-
```yaml
|
|
72
|
-
- name: quality
|
|
73
|
-
type: llm_judge
|
|
74
|
-
prompt: ./prompts/eval.md # markdown template or script config
|
|
75
|
-
model: gpt-5-chat # optional model override
|
|
76
|
-
config: # passed to script templates as context.config
|
|
77
|
-
strictness: high
|
|
78
|
-
```
|
|
79
|
-
Variables: `{{question}}`, `{{expected_outcome}}`, `{{candidate_answer}}`, `{{reference_answer}}`, `{{input_messages}}`, `{{expected_messages}}`, `{{output_messages}}`
|
|
80
|
-
- Markdown templates: use `{{variable}}` syntax
|
|
81
|
-
- TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
|
|
82
|
-
|
|
83
|
-
### composite
|
|
84
|
-
```yaml
|
|
85
|
-
- name: gate
|
|
86
|
-
type: composite
|
|
87
|
-
evaluators:
|
|
88
|
-
- name: safety
|
|
89
|
-
type: llm_judge
|
|
90
|
-
prompt: ./safety.md
|
|
91
|
-
- name: quality
|
|
92
|
-
type: llm_judge
|
|
93
|
-
aggregator:
|
|
94
|
-
type: weighted_average
|
|
95
|
-
weights: { safety: 0.3, quality: 0.7 }
|
|
96
|
-
```
|
|
97
|
-
Aggregator types: `weighted_average`, `all_or_nothing`, `minimum`, `maximum`, `safety_gate`
|
|
98
|
-
- `safety_gate`: fails immediately if the named gate evaluator scores below threshold (default 1.0)
|
|
99
|
-
|
|
100
|
-
### tool_trajectory
|
|
101
|
-
```yaml
|
|
102
|
-
- name: tool_check
|
|
103
|
-
type: tool_trajectory
|
|
104
|
-
mode: any_order # any_order | in_order | exact
|
|
105
|
-
minimums: # for any_order
|
|
106
|
-
knowledgeSearch: 2
|
|
107
|
-
expected: # for in_order/exact
|
|
108
|
-
- tool: knowledgeSearch
|
|
109
|
-
args: { query: "search term" } # partial deep equality match
|
|
110
|
-
- tool: documentRetrieve
|
|
111
|
-
args: any # any arguments accepted
|
|
112
|
-
max_duration_ms: 5000 # per-tool latency assertion
|
|
113
|
-
- tool: summarize # omit args to skip argument checking
|
|
114
|
-
```
|
|
115
|
-
|
|
116
|
-
### field_accuracy
|
|
117
|
-
```yaml
|
|
118
|
-
- name: fields
|
|
119
|
-
type: field_accuracy
|
|
120
|
-
match_type: exact # exact | date | numeric_tolerance
|
|
121
|
-
numeric_tolerance: 0.01 # for numeric_tolerance match_type
|
|
122
|
-
aggregation: weighted_average # weighted_average | all_or_nothing
|
|
123
|
-
```
|
|
124
|
-
Compares `output_messages` fields against `expected_messages` fields.
|
|
125
|
-
|
|
126
|
-
### latency
|
|
127
|
-
```yaml
|
|
128
|
-
- name: speed
|
|
129
|
-
type: latency
|
|
130
|
-
max_ms: 5000
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
### cost
|
|
134
|
-
```yaml
|
|
135
|
-
- name: budget
|
|
136
|
-
type: cost
|
|
137
|
-
max_usd: 0.10
|
|
138
|
-
```
|
|
139
|
-
|
|
140
|
-
### token_usage
|
|
141
|
-
```yaml
|
|
142
|
-
- name: tokens
|
|
143
|
-
type: token_usage
|
|
144
|
-
max_total_tokens: 4000
|
|
145
|
-
```
|
|
146
|
-
|
|
147
|
-
### rubric (inline)
|
|
148
|
-
```yaml
|
|
149
|
-
rubrics:
|
|
150
|
-
- Simple string criterion
|
|
151
|
-
- id: weighted
|
|
152
|
-
expected_outcome: Detailed criterion
|
|
153
|
-
weight: 2.0
|
|
154
|
-
required: true
|
|
155
|
-
```
|
|
156
|
-
See `references/rubric-evaluator.md` for score-range mode and scoring formula.
|
|
157
|
-
|
|
158
|
-
## CLI Commands
|
|
159
|
-
|
|
160
|
-
```bash
|
|
161
|
-
# Run evaluation
|
|
162
|
-
bun agentv eval <file.yaml> [--eval-id <id>] [--target <name>] [--dry-run]
|
|
163
|
-
|
|
164
|
-
# Validate eval file
|
|
165
|
-
bun agentv validate <file.yaml>
|
|
166
|
-
|
|
167
|
-
# Compare results between runs
|
|
168
|
-
bun agentv compare <results1.jsonl> <results2.jsonl>
|
|
169
|
-
|
|
170
|
-
# Generate rubrics from expected_outcome
|
|
171
|
-
bun agentv generate rubrics <file.yaml> [--target <name>]
|
|
172
|
-
```
|
|
173
|
-
|
|
174
|
-
## Schemas
|
|
175
|
-
|
|
176
|
-
- Eval file: `references/eval-schema.json`
|
|
177
|
-
- Config: `references/config-schema.json`
|
|
@@ -1,316 +0,0 @@
|
|
|
1
|
-
# Batch CLI Evaluation Guide
|
|
2
|
-
|
|
3
|
-
Guide for evaluating batch CLI output where a single runner processes all evalcases at once and outputs JSONL.
|
|
4
|
-
|
|
5
|
-
## Overview
|
|
6
|
-
|
|
7
|
-
Batch CLI evaluation is used when:
|
|
8
|
-
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
|
|
9
|
-
- The runner reads the eval YAML directly to extract all evalcases
|
|
10
|
-
- Output is JSONL with records keyed by evalcase `id`
|
|
11
|
-
- Each evalcase has its own evaluator to validate its corresponding output record
|
|
12
|
-
|
|
13
|
-
## Execution Flow
|
|
14
|
-
|
|
15
|
-
1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
|
|
16
|
-
2. **Batch runner** reads the eval YAML, extracts all evalcases, processes them, writes JSONL output keyed by `id`
|
|
17
|
-
3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
|
|
18
|
-
4. **Per-case evaluator** validates the output for each evalcase independently
|
|
19
|
-
|
|
20
|
-
## Eval File Structure
|
|
21
|
-
|
|
22
|
-
```yaml
|
|
23
|
-
description: Batch CLI demo using structured input_messages
|
|
24
|
-
execution:
|
|
25
|
-
target: batch_cli
|
|
26
|
-
|
|
27
|
-
evalcases:
|
|
28
|
-
- id: case-001
|
|
29
|
-
expected_outcome: |-
|
|
30
|
-
Batch runner returns JSON with decision=CLEAR.
|
|
31
|
-
|
|
32
|
-
expected_messages:
|
|
33
|
-
- role: assistant
|
|
34
|
-
content:
|
|
35
|
-
decision: CLEAR # Structured expected output
|
|
36
|
-
|
|
37
|
-
input_messages:
|
|
38
|
-
- role: system
|
|
39
|
-
content: You are a batch processor.
|
|
40
|
-
- role: user
|
|
41
|
-
content: # Structured input (runner extracts this)
|
|
42
|
-
request:
|
|
43
|
-
type: screening_check
|
|
44
|
-
jurisdiction: AU
|
|
45
|
-
row:
|
|
46
|
-
id: case-001
|
|
47
|
-
name: Example A
|
|
48
|
-
amount: 5000
|
|
49
|
-
|
|
50
|
-
execution:
|
|
51
|
-
evaluators:
|
|
52
|
-
- name: decision-check
|
|
53
|
-
type: code_judge
|
|
54
|
-
script: bun run ./scripts/check-output.ts
|
|
55
|
-
cwd: .
|
|
56
|
-
|
|
57
|
-
- id: case-002
|
|
58
|
-
expected_outcome: |-
|
|
59
|
-
Batch runner returns JSON with decision=REVIEW.
|
|
60
|
-
|
|
61
|
-
expected_messages:
|
|
62
|
-
- role: assistant
|
|
63
|
-
content:
|
|
64
|
-
decision: REVIEW
|
|
65
|
-
|
|
66
|
-
input_messages:
|
|
67
|
-
- role: system
|
|
68
|
-
content: You are a batch processor.
|
|
69
|
-
- role: user
|
|
70
|
-
content:
|
|
71
|
-
request:
|
|
72
|
-
type: screening_check
|
|
73
|
-
jurisdiction: AU
|
|
74
|
-
row:
|
|
75
|
-
id: case-002
|
|
76
|
-
name: Example B
|
|
77
|
-
amount: 25000
|
|
78
|
-
|
|
79
|
-
execution:
|
|
80
|
-
evaluators:
|
|
81
|
-
- name: decision-check
|
|
82
|
-
type: code_judge
|
|
83
|
-
script: bun run ./scripts/check-output.ts
|
|
84
|
-
cwd: .
|
|
85
|
-
```
|
|
86
|
-
|
|
87
|
-
## Batch Runner Implementation
|
|
88
|
-
|
|
89
|
-
The batch runner reads the eval YAML directly and processes all evalcases in one invocation.
|
|
90
|
-
|
|
91
|
-
### Runner Contract
|
|
92
|
-
|
|
93
|
-
**Input:** The runner receives the eval file path via `--eval` flag:
|
|
94
|
-
```bash
|
|
95
|
-
bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl
|
|
96
|
-
```
|
|
97
|
-
|
|
98
|
-
**Output:** JSONL file where each line is a JSON object with:
|
|
99
|
-
```json
|
|
100
|
-
{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
|
|
101
|
-
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
The `id` field must match the evalcase `id` for AgentV to route output to the correct evaluator.
|
|
105
|
-
|
|
106
|
-
### Output with Tool Trajectory Support
|
|
107
|
-
|
|
108
|
-
To enable `tool_trajectory` evaluation, include `output_messages` with `tool_calls`:
|
|
109
|
-
|
|
110
|
-
```json
|
|
111
|
-
{
|
|
112
|
-
"id": "case-001",
|
|
113
|
-
"text": "{\"decision\": \"CLEAR\", ...}",
|
|
114
|
-
"output_messages": [
|
|
115
|
-
{
|
|
116
|
-
"role": "assistant",
|
|
117
|
-
"tool_calls": [
|
|
118
|
-
{
|
|
119
|
-
"tool": "screening_check",
|
|
120
|
-
"input": { "origin_country": "NZ", "amount": 5000 },
|
|
121
|
-
"output": { "decision": "CLEAR", "reasons": [] }
|
|
122
|
-
}
|
|
123
|
-
]
|
|
124
|
-
},
|
|
125
|
-
{
|
|
126
|
-
"role": "assistant",
|
|
127
|
-
"content": { "decision": "CLEAR" }
|
|
128
|
-
}
|
|
129
|
-
]
|
|
130
|
-
}
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
AgentV extracts tool calls directly from `output_messages[].tool_calls[]` for `tool_trajectory` evaluators. This is the recommended format for batch runners that make tool calls.
|
|
134
|
-
|
|
135
|
-
### Example Runner (TypeScript)
|
|
136
|
-
|
|
137
|
-
```typescript
|
|
138
|
-
import fs from 'node:fs/promises';
|
|
139
|
-
import { parse } from 'yaml';
|
|
140
|
-
|
|
141
|
-
type EvalCase = {
|
|
142
|
-
id: string;
|
|
143
|
-
input_messages: Array<{ role: string; content: unknown }>;
|
|
144
|
-
};
|
|
145
|
-
|
|
146
|
-
async function main() {
|
|
147
|
-
const args = process.argv.slice(2);
|
|
148
|
-
const evalPath = getFlag(args, '--eval');
|
|
149
|
-
const outPath = getFlag(args, '--output');
|
|
150
|
-
|
|
151
|
-
// Read and parse eval YAML
|
|
152
|
-
const yamlText = await fs.readFile(evalPath, 'utf8');
|
|
153
|
-
const parsed = parse(yamlText);
|
|
154
|
-
const evalcases = parsed.evalcases as EvalCase[];
|
|
155
|
-
|
|
156
|
-
// Process each evalcase
|
|
157
|
-
const results: Array<{ id: string; text: string }> = [];
|
|
158
|
-
for (const evalcase of evalcases) {
|
|
159
|
-
const userContent = findUserContent(evalcase.input_messages);
|
|
160
|
-
const decision = processInput(userContent); // Your logic here
|
|
161
|
-
|
|
162
|
-
results.push({
|
|
163
|
-
id: evalcase.id,
|
|
164
|
-
text: JSON.stringify({ decision, ...otherFields }),
|
|
165
|
-
});
|
|
166
|
-
}
|
|
167
|
-
|
|
168
|
-
// Write JSONL output
|
|
169
|
-
const jsonl = results.map((r) => JSON.stringify(r)).join('\n') + '\n';
|
|
170
|
-
await fs.writeFile(outPath, jsonl, 'utf8');
|
|
171
|
-
}
|
|
172
|
-
|
|
173
|
-
function getFlag(args: string[], name: string): string {
|
|
174
|
-
const idx = args.indexOf(name);
|
|
175
|
-
return args[idx + 1];
|
|
176
|
-
}
|
|
177
|
-
|
|
178
|
-
function findUserContent(messages: Array<{ role: string; content: unknown }>) {
|
|
179
|
-
return messages.find((m) => m.role === 'user')?.content;
|
|
180
|
-
}
|
|
181
|
-
```
|
|
182
|
-
|
|
183
|
-
## Evaluator Implementation
|
|
184
|
-
|
|
185
|
-
Each evalcase has its own evaluator that validates the output. The evaluator receives the standard code_judge input.
|
|
186
|
-
|
|
187
|
-
### Evaluator Contract
|
|
188
|
-
|
|
189
|
-
**Input (stdin):** Standard AgentV code_judge format:
|
|
190
|
-
```json
|
|
191
|
-
{
|
|
192
|
-
"candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
|
|
193
|
-
"expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
|
|
194
|
-
"input_messages": [...],
|
|
195
|
-
...
|
|
196
|
-
}
|
|
197
|
-
```
|
|
198
|
-
|
|
199
|
-
**Output (stdout):** Standard evaluator result:
|
|
200
|
-
```json
|
|
201
|
-
{
|
|
202
|
-
"score": 1.0,
|
|
203
|
-
"hits": ["decision matches: CLEAR"],
|
|
204
|
-
"misses": [],
|
|
205
|
-
"reasoning": "Batch runner decision matches expected."
|
|
206
|
-
}
|
|
207
|
-
```
|
|
208
|
-
|
|
209
|
-
### Example Evaluator (TypeScript)
|
|
210
|
-
|
|
211
|
-
```typescript
|
|
212
|
-
import fs from 'node:fs';
|
|
213
|
-
|
|
214
|
-
type EvalInput = {
|
|
215
|
-
candidate_answer?: string;
|
|
216
|
-
expected_messages?: Array<{ role: string; content: unknown }>;
|
|
217
|
-
};
|
|
218
|
-
|
|
219
|
-
function main() {
|
|
220
|
-
const stdin = fs.readFileSync(0, 'utf8');
|
|
221
|
-
const input = JSON.parse(stdin) as EvalInput;
|
|
222
|
-
|
|
223
|
-
// Extract expected value from expected_messages
|
|
224
|
-
const expectedDecision = findExpectedDecision(input.expected_messages);
|
|
225
|
-
|
|
226
|
-
// Parse candidate answer (output from batch runner)
|
|
227
|
-
let candidateDecision: string | undefined;
|
|
228
|
-
try {
|
|
229
|
-
const parsed = JSON.parse(input.candidate_answer ?? '');
|
|
230
|
-
candidateDecision = parsed.decision;
|
|
231
|
-
} catch {
|
|
232
|
-
candidateDecision = undefined;
|
|
233
|
-
}
|
|
234
|
-
|
|
235
|
-
// Compare
|
|
236
|
-
const hits: string[] = [];
|
|
237
|
-
const misses: string[] = [];
|
|
238
|
-
|
|
239
|
-
if (expectedDecision === candidateDecision) {
|
|
240
|
-
hits.push(`decision matches: ${expectedDecision}`);
|
|
241
|
-
} else {
|
|
242
|
-
misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
|
|
243
|
-
}
|
|
244
|
-
|
|
245
|
-
const score = misses.length === 0 ? 1 : 0;
|
|
246
|
-
|
|
247
|
-
process.stdout.write(JSON.stringify({
|
|
248
|
-
score,
|
|
249
|
-
hits,
|
|
250
|
-
misses,
|
|
251
|
-
reasoning: score === 1
|
|
252
|
-
? 'Batch runner output matches expected.'
|
|
253
|
-
: 'Batch runner output did not match expected.',
|
|
254
|
-
}));
|
|
255
|
-
}
|
|
256
|
-
|
|
257
|
-
function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
|
|
258
|
-
if (!messages) return undefined;
|
|
259
|
-
for (const msg of messages) {
|
|
260
|
-
if (typeof msg.content === 'object' && msg.content !== null) {
|
|
261
|
-
return (msg.content as Record<string, unknown>).decision as string;
|
|
262
|
-
}
|
|
263
|
-
}
|
|
264
|
-
return undefined;
|
|
265
|
-
}
|
|
266
|
-
|
|
267
|
-
main();
|
|
268
|
-
```
|
|
269
|
-
|
|
270
|
-
## Structured Content in expected_messages
|
|
271
|
-
|
|
272
|
-
For batch evaluation, use structured objects in `expected_messages.content` to define expected output fields:
|
|
273
|
-
|
|
274
|
-
```yaml
|
|
275
|
-
expected_messages:
|
|
276
|
-
- role: assistant
|
|
277
|
-
content:
|
|
278
|
-
decision: CLEAR
|
|
279
|
-
confidence: high
|
|
280
|
-
reasons: []
|
|
281
|
-
```
|
|
282
|
-
|
|
283
|
-
The evaluator then extracts these fields and compares against the parsed candidate output.
|
|
284
|
-
|
|
285
|
-
## Best Practices
|
|
286
|
-
|
|
287
|
-
1. **Use unique evalcase IDs** - The batch runner and AgentV use `id` to route outputs
|
|
288
|
-
2. **Structured input_messages** - Put structured data in `user.content` for the runner to extract
|
|
289
|
-
3. **Structured expected_messages** - Define expected output as objects for easy validation
|
|
290
|
-
4. **Deterministic runners** - Batch runners should produce consistent output for testing
|
|
291
|
-
5. **Healthcheck support** - Add `--healthcheck` flag for runner validation:
|
|
292
|
-
```typescript
|
|
293
|
-
if (args.includes('--healthcheck')) {
|
|
294
|
-
console.log('batch-runner: healthy');
|
|
295
|
-
return;
|
|
296
|
-
}
|
|
297
|
-
```
|
|
298
|
-
|
|
299
|
-
## Target Configuration
|
|
300
|
-
|
|
301
|
-
Configure the batch CLI provider in your target:
|
|
302
|
-
|
|
303
|
-
```yaml
|
|
304
|
-
# In agentv-targets.yaml or eval file
|
|
305
|
-
targets:
|
|
306
|
-
batch_cli:
|
|
307
|
-
provider: cli
|
|
308
|
-
commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
|
|
309
|
-
provider_batching: true
|
|
310
|
-
```
|
|
311
|
-
|
|
312
|
-
Key settings:
|
|
313
|
-
- `provider: cli` - Use CLI provider
|
|
314
|
-
- `provider_batching: true` - Run once for all evalcases
|
|
315
|
-
- `{EVAL_FILE}` - Placeholder for eval file path
|
|
316
|
-
- `{OUTPUT_FILE}` - Placeholder for JSONL output path
|