agentv 2.0.2 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +199 -325
- package/dist/{chunk-5AJ7DFUO.js → chunk-5BLNVACB.js} +1208 -882
- package/dist/chunk-5BLNVACB.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +24 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -12
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +90 -209
- package/package.json +2 -2
- package/dist/chunk-5AJ7DFUO.js.map +0 -1
- /package/dist/templates/.agentv/{.env.template → .env.example} +0 -0
package/README.md
CHANGED
|
@@ -1,431 +1,305 @@
|
|
|
1
1
|
# AgentV
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
**CLI-first AI agent evaluation. No server. No signup. No overhead.**
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
This is the recommended method for users who want to use `agentv` as a command-line tool.
|
|
10
|
-
|
|
11
|
-
1. Install via npm:
|
|
7
|
+
## Installation
|
|
12
8
|
|
|
9
|
+
**1. Install:**
|
|
13
10
|
```bash
|
|
14
|
-
# Install globally
|
|
15
11
|
npm install -g agentv
|
|
16
|
-
|
|
17
|
-
# Or use npx to run without installing
|
|
18
|
-
npx agentv --help
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
2. Verify the installation:
|
|
22
|
-
|
|
23
|
-
```bash
|
|
24
|
-
agentv --help
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
### Local Development Setup
|
|
28
|
-
|
|
29
|
-
Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
|
|
30
|
-
|
|
31
|
-
1. Clone the repository and navigate into it:
|
|
32
|
-
|
|
33
|
-
```bash
|
|
34
|
-
git clone https://github.com/EntityProcess/agentv.git
|
|
35
|
-
cd agentv
|
|
36
12
|
```
|
|
37
13
|
|
|
38
|
-
2.
|
|
39
|
-
|
|
40
|
-
```bash
|
|
41
|
-
# Install Bun if you don't have it
|
|
42
|
-
curl -fsSL https://bun.sh/install | bash # macOS/Linux
|
|
43
|
-
# or
|
|
44
|
-
powershell -c "irm bun.sh/install.ps1 | iex" # Windows
|
|
45
|
-
|
|
46
|
-
# Install all workspace dependencies
|
|
47
|
-
bun install
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
3. Build the project:
|
|
51
|
-
|
|
14
|
+
**2. Initialize your workspace:**
|
|
52
15
|
```bash
|
|
53
|
-
|
|
16
|
+
agentv init
|
|
54
17
|
```
|
|
55
18
|
|
|
56
|
-
|
|
19
|
+
**3. Configure environment variables:**
|
|
20
|
+
- The init command creates a `.env.example` file in your project root
|
|
21
|
+
- Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
|
|
22
|
+
- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
|
|
57
23
|
|
|
58
|
-
|
|
59
|
-
|
|
24
|
+
**4. Create an eval** (`./evals/example.yaml`):
|
|
25
|
+
```yaml
|
|
26
|
+
description: Math problem solving evaluation
|
|
27
|
+
execution:
|
|
28
|
+
target: default
|
|
29
|
+
|
|
30
|
+
evalcases:
|
|
31
|
+
- id: addition
|
|
32
|
+
expected_outcome: Correctly calculates 15 + 27 = 42
|
|
33
|
+
|
|
34
|
+
input_messages:
|
|
35
|
+
- role: user
|
|
36
|
+
content: What is 15 + 27?
|
|
37
|
+
|
|
38
|
+
expected_messages:
|
|
39
|
+
- role: assistant
|
|
40
|
+
content: "42"
|
|
41
|
+
|
|
42
|
+
execution:
|
|
43
|
+
evaluators:
|
|
44
|
+
- name: math_check
|
|
45
|
+
type: code_judge
|
|
46
|
+
script: ./validators/check_math.py
|
|
60
47
|
```
|
|
61
48
|
|
|
62
|
-
5.
|
|
63
|
-
|
|
49
|
+
**5. Run the eval:**
|
|
64
50
|
```bash
|
|
65
|
-
|
|
51
|
+
agentv eval ./evals/example.yaml
|
|
66
52
|
```
|
|
67
53
|
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
You are now ready to start development. The monorepo contains:
|
|
54
|
+
Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
|
|
71
55
|
|
|
72
|
-
|
|
73
|
-
- `apps/cli/` - Command-line interface
|
|
56
|
+
Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
|
|
74
57
|
|
|
75
|
-
|
|
58
|
+
## Why AgentV?
|
|
76
59
|
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
60
|
+
| Feature | AgentV | [LangWatch](https://github.com/langwatch/langwatch) | [LangSmith](https://github.com/langchain-ai/langsmith-sdk) | [LangFuse](https://github.com/langfuse/langfuse) |
|
|
61
|
+
|---------|--------|-----------|-----------|----------|
|
|
62
|
+
| **Setup** | `npm install` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
|
|
63
|
+
| **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
|
|
64
|
+
| **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
|
|
65
|
+
| **CLI-first** | ✓ | ✗ | Limited | Limited |
|
|
66
|
+
| **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
|
|
67
|
+
| **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
|
|
68
|
+
| **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
|
|
80
69
|
|
|
81
|
-
|
|
82
|
-
- The init command creates a `.env.template` file in your project root
|
|
83
|
-
- Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
|
|
84
|
-
- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
|
|
70
|
+
**Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
|
|
85
71
|
|
|
86
|
-
##
|
|
72
|
+
## Features
|
|
87
73
|
|
|
88
|
-
|
|
89
|
-
-
|
|
74
|
+
- **Multi-objective scoring**: Correctness, latency, cost, safety in one run
|
|
75
|
+
- **Multiple evaluator types**: Code validators, LLM judges, custom Python/TypeScript
|
|
76
|
+
- **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
|
|
77
|
+
- **Structured evaluation**: Rubric-based grading with weights and requirements
|
|
78
|
+
- **Batch evaluation**: Run hundreds of test cases in parallel
|
|
79
|
+
- **Export**: JSON, JSONL, YAML formats
|
|
80
|
+
- **Compare results**: Compute deltas between evaluation runs for A/B testing
|
|
90
81
|
|
|
91
|
-
|
|
82
|
+
## Development
|
|
92
83
|
|
|
93
|
-
|
|
84
|
+
Contributing to AgentV? Clone and set up the repository:
|
|
94
85
|
|
|
95
86
|
```bash
|
|
96
|
-
|
|
97
|
-
agentv
|
|
98
|
-
|
|
99
|
-
# Validate multiple files
|
|
100
|
-
agentv validate evals/eval1.yaml evals/eval2.yaml
|
|
101
|
-
|
|
102
|
-
# Validate entire directory (recursively finds all YAML files)
|
|
103
|
-
agentv validate evals/
|
|
104
|
-
```
|
|
105
|
-
|
|
106
|
-
### Running Evals
|
|
107
|
-
|
|
108
|
-
Run eval (target auto-selected from eval file or CLI override):
|
|
109
|
-
|
|
110
|
-
```bash
|
|
111
|
-
# If your eval.yaml contains "target: azure_base", it will be used automatically
|
|
112
|
-
agentv eval "path/to/eval.yaml"
|
|
113
|
-
|
|
114
|
-
# Override the eval file's target with CLI flag
|
|
115
|
-
agentv eval --target vscode_projectx "path/to/eval.yaml"
|
|
87
|
+
git clone https://github.com/EntityProcess/agentv.git
|
|
88
|
+
cd agentv
|
|
116
89
|
|
|
117
|
-
#
|
|
118
|
-
|
|
119
|
-
```
|
|
90
|
+
# Install Bun if you don't have it
|
|
91
|
+
curl -fsSL https://bun.sh/install | bash
|
|
120
92
|
|
|
121
|
-
|
|
93
|
+
# Install dependencies and build
|
|
94
|
+
bun install && bun run build
|
|
122
95
|
|
|
123
|
-
|
|
124
|
-
|
|
96
|
+
# Run tests
|
|
97
|
+
bun test
|
|
125
98
|
```
|
|
126
99
|
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
- `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
|
|
130
|
-
- `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
|
|
131
|
-
- `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
|
|
132
|
-
- `--eval-id EVAL_ID`: Run only the eval case with this specific ID
|
|
133
|
-
- `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
|
|
134
|
-
- `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
|
|
135
|
-
- `--dry-run`: Run with mock model for testing
|
|
136
|
-
- `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
|
|
137
|
-
- `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
|
|
138
|
-
- `--cache`: Enable caching of LLM responses (default: disabled)
|
|
139
|
-
- `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
|
|
140
|
-
- `--verbose`: Verbose output
|
|
100
|
+
See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
|
|
141
101
|
|
|
142
|
-
|
|
102
|
+
## Core Concepts
|
|
143
103
|
|
|
144
|
-
|
|
104
|
+
**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
|
|
145
105
|
|
|
146
|
-
|
|
147
|
-
2. Eval file specification: `target: my_target` key in the .eval.yaml file
|
|
148
|
-
3. Default fallback: Uses the 'default' target (original behavior)
|
|
106
|
+
## Usage
|
|
149
107
|
|
|
150
|
-
|
|
108
|
+
### Running Evaluations
|
|
151
109
|
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
110
|
+
```bash
|
|
111
|
+
# Validate evals
|
|
112
|
+
agentv validate evals/my-eval.yaml
|
|
155
113
|
|
|
156
|
-
|
|
114
|
+
# Run an eval with default target (from eval file or targets.yaml)
|
|
115
|
+
agentv eval evals/my-eval.yaml
|
|
157
116
|
|
|
158
|
-
|
|
117
|
+
# Override target
|
|
118
|
+
agentv eval --target azure_base evals/**/*.yaml
|
|
159
119
|
|
|
160
|
-
|
|
120
|
+
# Run specific eval case
|
|
121
|
+
agentv eval --eval-id case-123 evals/my-eval.yaml
|
|
161
122
|
|
|
162
|
-
|
|
123
|
+
# Dry-run with mock provider
|
|
124
|
+
agentv eval --dry-run evals/my-eval.yaml
|
|
125
|
+
```
|
|
163
126
|
|
|
164
|
-
|
|
127
|
+
See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
|
|
165
128
|
|
|
166
|
-
|
|
129
|
+
### Create Custom Evaluators
|
|
167
130
|
|
|
168
|
-
|
|
169
|
-
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
|
|
170
|
-
- Provider-specific configuration fields at the top level (no `settings` wrapper needed)
|
|
171
|
-
- Optional fields: `judge_target`, `workers`, `provider_batching`
|
|
131
|
+
Write code judges in Python or TypeScript:
|
|
172
132
|
|
|
173
|
-
|
|
133
|
+
```python
|
|
134
|
+
# validators/check_answer.py
|
|
135
|
+
import json, sys
|
|
136
|
+
data = json.load(sys.stdin)
|
|
137
|
+
candidate_answer = data.get("candidate_answer", "")
|
|
174
138
|
|
|
175
|
-
|
|
139
|
+
hits = []
|
|
140
|
+
misses = []
|
|
176
141
|
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
182
|
-
model: ${{ AZURE_DEPLOYMENT_NAME }}
|
|
183
|
-
version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
|
|
184
|
-
```
|
|
142
|
+
if "42" in candidate_answer:
|
|
143
|
+
hits.append("Answer contains correct value (42)")
|
|
144
|
+
else:
|
|
145
|
+
misses.append("Answer does not contain expected value (42)")
|
|
185
146
|
|
|
186
|
-
|
|
147
|
+
score = 1.0 if hits else 0.0
|
|
187
148
|
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
provider_batching: false
|
|
195
|
-
judge_target: azure_base
|
|
196
|
-
|
|
197
|
-
- name: vscode_insiders_projectx
|
|
198
|
-
provider: vscode-insiders
|
|
199
|
-
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
|
|
200
|
-
provider_batching: false
|
|
201
|
-
judge_target: azure_base
|
|
149
|
+
print(json.dumps({
|
|
150
|
+
"score": score,
|
|
151
|
+
"hits": hits,
|
|
152
|
+
"misses": misses,
|
|
153
|
+
"reasoning": f"Passed {len(hits)} check(s)"
|
|
154
|
+
}))
|
|
202
155
|
```
|
|
203
156
|
|
|
204
|
-
|
|
157
|
+
Reference evaluators in your eval file:
|
|
205
158
|
|
|
206
159
|
```yaml
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
cwd: ${{ CLI_EVALS_DIR }} # optional working directory
|
|
213
|
-
timeout_seconds: 30 # optional per-command timeout
|
|
214
|
-
healthcheck:
|
|
215
|
-
type: command # or http
|
|
216
|
-
command_template: uv run ./my_agent.py --healthcheck
|
|
160
|
+
execution:
|
|
161
|
+
evaluators:
|
|
162
|
+
- name: my_validator
|
|
163
|
+
type: code_judge
|
|
164
|
+
script: ./validators/check_answer.py
|
|
217
165
|
```
|
|
218
166
|
|
|
219
|
-
|
|
220
|
-
- `{PROMPT}` - The rendered prompt text (shell-escaped)
|
|
221
|
-
- `{FILES}` - Expands to multiple file arguments using `files_format` template
|
|
222
|
-
- `{GUIDELINES}` - Guidelines content
|
|
223
|
-
- `{EVAL_ID}` - Current eval case ID
|
|
224
|
-
- `{ATTEMPT}` - Retry attempt number
|
|
225
|
-
- `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
|
|
226
|
-
|
|
227
|
-
**Codex CLI targets:**
|
|
167
|
+
For complete templates, examples, and evaluator patterns, see: [custom-evaluators.md](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
|
|
228
168
|
|
|
229
|
-
|
|
230
|
-
- name: codex_cli
|
|
231
|
-
provider: codex
|
|
232
|
-
judge_target: azure_base
|
|
233
|
-
executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
|
|
234
|
-
args: # optional CLI arguments
|
|
235
|
-
- --profile
|
|
236
|
-
- ${{ CODEX_PROFILE }}
|
|
237
|
-
- --model
|
|
238
|
-
- ${{ CODEX_MODEL }}
|
|
239
|
-
timeout_seconds: 180
|
|
240
|
-
cwd: ${{ CODEX_WORKSPACE_DIR }}
|
|
241
|
-
log_format: json # 'summary' or 'json'
|
|
242
|
-
```
|
|
169
|
+
### Compare Evaluation Results
|
|
243
170
|
|
|
244
|
-
|
|
245
|
-
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
|
|
171
|
+
Run two evaluations and compare them:
|
|
246
172
|
|
|
247
|
-
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
-
|
|
251
|
-
|
|
252
|
-
judge_target: gemini_base
|
|
253
|
-
executable: ${{ PI_CLI_PATH }} # Optional: defaults to `pi` if omitted
|
|
254
|
-
pi_provider: google # google, anthropic, openai, groq, xai, openrouter
|
|
255
|
-
model: ${{ GEMINI_MODEL_NAME }}
|
|
256
|
-
api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
|
|
257
|
-
tools: read,bash,edit,write # Available tools for the agent
|
|
258
|
-
timeout_seconds: 180
|
|
259
|
-
cwd: ${{ PI_WORKSPACE_DIR }} # Optional: run in specific directory
|
|
260
|
-
log_format: json # 'summary' (default) or 'json' for full logs
|
|
261
|
-
# system_prompt: optional override for the default system prompt
|
|
173
|
+
```bash
|
|
174
|
+
agentv eval evals/my-eval.yaml --out before.jsonl
|
|
175
|
+
# ... make changes to your agent ...
|
|
176
|
+
agentv eval evals/my-eval.yaml --out after.jsonl
|
|
177
|
+
agentv compare before.jsonl after.jsonl --threshold 0.1
|
|
262
178
|
```
|
|
263
179
|
|
|
264
|
-
|
|
180
|
+
Output shows wins, losses, ties, and mean delta to identify improvements.
|
|
265
181
|
|
|
266
|
-
|
|
182
|
+
## Targets Configuration
|
|
267
183
|
|
|
268
|
-
|
|
184
|
+
Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
|
|
269
185
|
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
186
|
+
```yaml
|
|
187
|
+
targets:
|
|
188
|
+
- name: azure_base
|
|
189
|
+
provider: azure
|
|
190
|
+
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
191
|
+
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
192
|
+
model: ${{ AZURE_DEPLOYMENT_NAME }}
|
|
273
193
|
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
{
|
|
277
|
-
|
|
278
|
-
"expected_outcome": "expected outcome description",
|
|
279
|
-
"reference_answer": "gold standard answer (optional)",
|
|
280
|
-
"candidate_answer": "generated code/text from the agent",
|
|
281
|
-
"guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
|
|
282
|
-
"input_files": ["path/to/data.json", "path/to/config.yaml"],
|
|
283
|
-
"input_messages": [{"role": "user", "content": "..."}]
|
|
284
|
-
}
|
|
285
|
-
```
|
|
194
|
+
- name: vscode_dev
|
|
195
|
+
provider: vscode
|
|
196
|
+
workspace_template: ${{ WORKSPACE_PATH }}
|
|
197
|
+
judge_target: azure_base
|
|
286
198
|
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
{
|
|
290
|
-
|
|
291
|
-
"hits": ["list of successful checks"],
|
|
292
|
-
"misses": ["list of failed checks"],
|
|
293
|
-
"reasoning": "explanation of the score"
|
|
294
|
-
}
|
|
199
|
+
- name: local_agent
|
|
200
|
+
provider: cli
|
|
201
|
+
command_template: 'python agent.py --prompt {PROMPT}'
|
|
202
|
+
judge_target: azure_base
|
|
295
203
|
```
|
|
296
204
|
|
|
297
|
-
|
|
298
|
-
- Evaluators receive **full context** but should select only relevant fields
|
|
299
|
-
- Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
|
|
300
|
-
- Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
|
|
301
|
-
- Score range: `0.0` to `1.0` (float)
|
|
302
|
-
- `hits` and `misses` are optional but recommended for debugging
|
|
303
|
-
|
|
304
|
-
### Code Evaluator Templates
|
|
305
|
-
|
|
306
|
-
Custom evaluators can be written in any language. For complete templates and examples:
|
|
205
|
+
Supports: `azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `claude-code`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
|
|
307
206
|
|
|
308
|
-
|
|
309
|
-
- **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
|
|
310
|
-
- **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
|
|
207
|
+
Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
|
|
311
208
|
|
|
312
|
-
|
|
209
|
+
## Evaluation Features
|
|
313
210
|
|
|
314
|
-
|
|
315
|
-
# Judge Name
|
|
211
|
+
### Code Judges
|
|
316
212
|
|
|
317
|
-
|
|
213
|
+
Write validators in any language (Python, TypeScript, Node, etc.):
|
|
318
214
|
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
0
|
|
322
|
-
...
|
|
323
|
-
|
|
324
|
-
## Output Format
|
|
325
|
-
{
|
|
326
|
-
"score": 0.85,
|
|
327
|
-
"passed": true,
|
|
328
|
-
"reasoning": "..."
|
|
329
|
-
}
|
|
215
|
+
```bash
|
|
216
|
+
# Input: stdin JSON with question, expected_outcome, candidate_answer
|
|
217
|
+
# Output: stdout JSON with score (0-1), hits, misses, reasoning
|
|
330
218
|
```
|
|
331
219
|
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
220
|
+
For complete examples and patterns, see:
|
|
221
|
+
- [custom-evaluators skill](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
|
|
222
|
+
- [code-judge-sdk example](examples/features/code-judge-sdk)
|
|
335
223
|
|
|
336
|
-
###
|
|
224
|
+
### LLM Judges
|
|
337
225
|
|
|
338
|
-
|
|
226
|
+
Create markdown judge files with evaluation criteria and scoring guidelines:
|
|
339
227
|
|
|
340
228
|
```yaml
|
|
341
|
-
|
|
342
|
-
|
|
343
|
-
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
- States time complexity correctly
|
|
229
|
+
execution:
|
|
230
|
+
evaluators:
|
|
231
|
+
- name: semantic_check
|
|
232
|
+
type: llm_judge
|
|
233
|
+
prompt: ./judges/correctness.md
|
|
347
234
|
```
|
|
348
235
|
|
|
349
|
-
|
|
236
|
+
Your judge prompt file defines criteria and scoring guidelines.
|
|
237
|
+
|
|
238
|
+
### Rubric-Based Evaluation
|
|
239
|
+
|
|
240
|
+
Define structured criteria directly in your eval case:
|
|
350
241
|
|
|
351
242
|
```yaml
|
|
352
|
-
|
|
353
|
-
- id:
|
|
354
|
-
|
|
355
|
-
|
|
356
|
-
|
|
357
|
-
|
|
358
|
-
|
|
359
|
-
|
|
360
|
-
|
|
243
|
+
evalcases:
|
|
244
|
+
- id: quicksort-explain
|
|
245
|
+
expected_outcome: Explain how quicksort works
|
|
246
|
+
|
|
247
|
+
input_messages:
|
|
248
|
+
- role: user
|
|
249
|
+
content: Explain quicksort algorithm
|
|
250
|
+
|
|
251
|
+
rubrics:
|
|
252
|
+
- Mentions divide-and-conquer approach
|
|
253
|
+
- Explains partition step
|
|
254
|
+
- States time complexity
|
|
361
255
|
```
|
|
362
256
|
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
Automatically generate rubrics from `expected_outcome` fields:
|
|
257
|
+
Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
|
|
366
258
|
|
|
259
|
+
Auto-generate rubrics from expected outcomes:
|
|
367
260
|
```bash
|
|
368
|
-
# Generate rubrics for all eval cases without rubrics
|
|
369
261
|
agentv generate rubrics evals/my-eval.yaml
|
|
370
|
-
|
|
371
|
-
# Use a specific LLM target for generation
|
|
372
|
-
agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
|
|
373
262
|
```
|
|
374
263
|
|
|
375
|
-
|
|
376
|
-
|
|
377
|
-
- **Score**: (sum of satisfied weights) / (total weights)
|
|
378
|
-
- **Verdicts**:
|
|
379
|
-
- `pass`: Score ≥ 0.8 and all required rubrics met
|
|
380
|
-
- `borderline`: Score ≥ 0.6 and all required rubrics met
|
|
381
|
-
- `fail`: Score < 0.6 or any required rubric failed
|
|
382
|
-
|
|
383
|
-
For complete examples and detailed patterns, see [examples/features/rubric/](examples/features/rubric/).
|
|
264
|
+
See [rubric-evaluator skill](.claude/skills/agentv-eval-builder/references/rubric-evaluator.md) for detailed patterns.
|
|
384
265
|
|
|
385
266
|
## Advanced Configuration
|
|
386
267
|
|
|
387
|
-
### Retry
|
|
388
|
-
|
|
389
|
-
AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
|
|
390
|
-
|
|
391
|
-
**Available retry fields:**
|
|
268
|
+
### Retry Behavior
|
|
392
269
|
|
|
393
|
-
|
|
394
|
-
|-------|------|---------|-------------|
|
|
395
|
-
| `max_retries` | number | 3 | Maximum number of retry attempts |
|
|
396
|
-
| `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
|
|
397
|
-
| `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
|
|
398
|
-
| `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
|
|
399
|
-
| `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
|
|
400
|
-
|
|
401
|
-
**Example configuration:**
|
|
270
|
+
Configure automatic retry with exponential backoff:
|
|
402
271
|
|
|
403
272
|
```yaml
|
|
404
273
|
targets:
|
|
405
274
|
- name: azure_base
|
|
406
275
|
provider: azure
|
|
407
|
-
|
|
408
|
-
|
|
409
|
-
|
|
410
|
-
|
|
411
|
-
|
|
412
|
-
retry_initial_delay_ms: 2000 # Initial delay before first retry
|
|
413
|
-
retry_max_delay_ms: 120000 # Maximum delay cap
|
|
414
|
-
retry_backoff_factor: 2 # Exponential backoff multiplier
|
|
415
|
-
retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
|
|
276
|
+
max_retries: 5
|
|
277
|
+
retry_initial_delay_ms: 2000
|
|
278
|
+
retry_max_delay_ms: 120000
|
|
279
|
+
retry_backoff_factor: 2
|
|
280
|
+
retry_status_codes: [500, 408, 429, 502, 503, 504]
|
|
416
281
|
```
|
|
417
282
|
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
|
|
421
|
-
|
|
422
|
-
|
|
283
|
+
Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
|
|
284
|
+
|
|
285
|
+
## Documentation & Learning
|
|
286
|
+
|
|
287
|
+
**Getting Started:**
|
|
288
|
+
- Run `agentv init` to set up your first evaluation workspace
|
|
289
|
+
- Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
|
|
290
|
+
- AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
|
|
291
|
+
|
|
292
|
+
**Detailed Guides:**
|
|
293
|
+
- [Evaluation format and structure](.claude/skills/agentv-eval-builder/SKILL.md)
|
|
294
|
+
- [Custom evaluators](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
|
|
295
|
+
- [Structured data evaluation](.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md)
|
|
296
|
+
|
|
297
|
+
**Reference:**
|
|
298
|
+
- Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
|
|
423
299
|
|
|
424
|
-
##
|
|
300
|
+
## Contributing
|
|
425
301
|
|
|
426
|
-
|
|
427
|
-
- [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
|
|
428
|
-
- [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
|
|
302
|
+
See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
|
|
429
303
|
|
|
430
304
|
## License
|
|
431
305
|
|