agentv 3.14.6 → 4.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +59 -533
- package/dist/{chunk-CQRWNXVG.js → chunk-2W5JKKXC.js} +537 -727
- package/dist/chunk-2W5JKKXC.js.map +1 -0
- package/dist/{chunk-Y25VL7PX.js → chunk-4Z326WWF.js} +40 -17
- package/dist/chunk-4Z326WWF.js.map +1 -0
- package/dist/{chunk-ELQEFMGO.js → chunk-XEAW7OQT.js} +594 -296
- package/dist/chunk-XEAW7OQT.js.map +1 -0
- package/dist/cli.js +3 -3
- package/dist/{dist-5EEXTTC3.js → dist-2JUUJ6PT.js} +18 -2
- package/dist/index.js +3 -3
- package/dist/{interactive-5ESM5DWV.js → interactive-7ZYS6IOC.js} +4 -11
- package/dist/interactive-7ZYS6IOC.js.map +1 -0
- package/dist/studio/assets/index-CDGReinH.js +71 -0
- package/dist/studio/assets/index-DofvSOmX.js +11 -0
- package/dist/studio/assets/index-izxfmBKC.css +1 -0
- package/dist/studio/index.html +13 -0
- package/package.json +1 -1
- package/dist/chunk-CQRWNXVG.js.map +0 -1
- package/dist/chunk-ELQEFMGO.js.map +0 -1
- package/dist/chunk-Y25VL7PX.js.map +0 -1
- package/dist/interactive-5ESM5DWV.js.map +0 -1
- /package/dist/{dist-5EEXTTC3.js.map → dist-2JUUJ6PT.js.map} +0 -0
package/README.md
CHANGED
|
@@ -1,314 +1,90 @@
|
|
|
1
1
|
# AgentV
|
|
2
2
|
|
|
3
|
-
**
|
|
3
|
+
**Evaluate AI agents from the terminal. No server. No signup.**
|
|
4
4
|
|
|
5
|
-
AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
|
|
6
|
-
|
|
7
|
-
## Installation
|
|
8
|
-
|
|
9
|
-
### All Agents Plugin Manager
|
|
10
|
-
|
|
11
|
-
**1. Add AgentV marketplace source:**
|
|
12
|
-
```bash
|
|
13
|
-
npx allagents plugin marketplace add EntityProcess/agentv
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
**2. Ask Claude to set up AgentV in your current repository**
|
|
17
|
-
Example prompt:
|
|
18
|
-
```text
|
|
19
|
-
Set up AgentV in this repo.
|
|
20
|
-
```
|
|
21
|
-
|
|
22
|
-
The `agentv-onboarding` skill bootstraps setup automatically:
|
|
23
|
-
- verifies `agentv` CLI availability
|
|
24
|
-
- installs the CLI if needed
|
|
25
|
-
- runs `agentv init`
|
|
26
|
-
- verifies setup artifacts
|
|
27
|
-
|
|
28
|
-
### CLI-Only Setup (Fallback)
|
|
29
|
-
|
|
30
|
-
If you are not using Claude plugins, use the CLI directly.
|
|
31
|
-
|
|
32
|
-
**1. Install:**
|
|
33
|
-
```bash
|
|
34
|
-
bun install -g agentv
|
|
35
|
-
```
|
|
36
|
-
|
|
37
|
-
Or with npm:
|
|
38
5
|
```bash
|
|
39
6
|
npm install -g agentv
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
**2. Initialize your workspace:**
|
|
43
|
-
```bash
|
|
44
7
|
agentv init
|
|
8
|
+
agentv eval evals/example.yaml
|
|
45
9
|
```
|
|
46
10
|
|
|
47
|
-
|
|
48
|
-
- The init command creates a `.env.example` file in your project root
|
|
49
|
-
- Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
|
|
50
|
-
- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
|
|
11
|
+
That's it. Results in seconds, not minutes.
|
|
51
12
|
|
|
52
|
-
|
|
53
|
-
```yaml
|
|
54
|
-
description: Math problem solving evaluation
|
|
55
|
-
execution:
|
|
56
|
-
target: default
|
|
13
|
+
## What it does
|
|
57
14
|
|
|
15
|
+
AgentV runs evaluation cases against your AI agents and scores them with deterministic code graders + customizable LLM graders. Everything lives in Git — YAML eval files, markdown judge prompts, JSONL results.
|
|
16
|
+
|
|
17
|
+
```yaml
|
|
18
|
+
# evals/math.yaml
|
|
19
|
+
description: Math problem solving
|
|
58
20
|
tests:
|
|
59
21
|
- id: addition
|
|
60
|
-
criteria: Correctly calculates 15 + 27 = 42
|
|
61
|
-
|
|
62
22
|
input: What is 15 + 27?
|
|
63
|
-
|
|
64
23
|
expected_output: "42"
|
|
65
|
-
|
|
66
24
|
assertions:
|
|
67
|
-
-
|
|
68
|
-
|
|
69
|
-
command: ./validators/check_math.py
|
|
25
|
+
- type: contains
|
|
26
|
+
value: "42"
|
|
70
27
|
```
|
|
71
28
|
|
|
72
|
-
**5. Run the eval:**
|
|
73
29
|
```bash
|
|
74
|
-
agentv eval
|
|
30
|
+
agentv eval evals/math.yaml
|
|
75
31
|
```
|
|
76
32
|
|
|
77
|
-
Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
|
|
78
|
-
|
|
79
|
-
Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
|
|
80
|
-
|
|
81
33
|
## Why AgentV?
|
|
82
34
|
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
| **CLI-first** | ✓ | ✗ | Limited | Limited |
|
|
89
|
-
| **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
|
|
90
|
-
| **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
|
|
91
|
-
| **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
|
|
92
|
-
|
|
93
|
-
**Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
|
|
94
|
-
|
|
95
|
-
## Features
|
|
35
|
+
- **Local-first** — runs on your machine, no cloud accounts or API keys for eval infrastructure
|
|
36
|
+
- **Version-controlled** — evals, judges, and results all live in Git
|
|
37
|
+
- **Hybrid graders** — deterministic code checks + LLM-based subjective scoring
|
|
38
|
+
- **CI/CD native** — exit codes, JSONL output, threshold flags for pipeline gating
|
|
39
|
+
- **Any agent** — supports Claude, Codex, Copilot, VS Code, Pi, Azure OpenAI, or any CLI agent
|
|
96
40
|
|
|
97
|
-
|
|
98
|
-
- **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
|
|
99
|
-
- **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
|
|
100
|
-
- **Structured evaluation**: Rubric-based grading with weights and requirements
|
|
101
|
-
- **Batch evaluation**: Run hundreds of test cases in parallel
|
|
102
|
-
- **Export**: JSON, JSONL, YAML formats
|
|
103
|
-
- **Compare results**: Compute deltas between evaluation runs for A/B testing
|
|
104
|
-
|
|
105
|
-
## Development
|
|
106
|
-
|
|
107
|
-
Contributing to AgentV? Clone and set up the repository:
|
|
41
|
+
## Quick start
|
|
108
42
|
|
|
43
|
+
**1. Install and initialize:**
|
|
109
44
|
```bash
|
|
110
|
-
|
|
111
|
-
|
|
112
|
-
|
|
113
|
-
# Install Bun if you don't have it
|
|
114
|
-
curl -fsSL https://bun.sh/install | bash
|
|
115
|
-
|
|
116
|
-
# Install dependencies and build
|
|
117
|
-
bun install && bun run build
|
|
118
|
-
|
|
119
|
-
# Run tests
|
|
120
|
-
bun test
|
|
121
|
-
```
|
|
122
|
-
|
|
123
|
-
See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
|
|
124
|
-
|
|
125
|
-
### Releasing
|
|
126
|
-
|
|
127
|
-
Version bump:
|
|
128
|
-
|
|
129
|
-
```bash
|
|
130
|
-
bun run release # patch bump
|
|
131
|
-
bun run release minor
|
|
132
|
-
bun run release major
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
Canary rollout (recommended):
|
|
136
|
-
|
|
137
|
-
```bash
|
|
138
|
-
bun run publish:next # publish current version to npm `next`
|
|
139
|
-
bun run promote:latest # promote same version to npm `latest`
|
|
140
|
-
bun run tag:next 2.18.0 # point npm `next` to an explicit version
|
|
141
|
-
bun run promote:latest 2.18.0 # point npm `latest` to an explicit version
|
|
142
|
-
```
|
|
143
|
-
|
|
144
|
-
Legacy prerelease flow (still available):
|
|
145
|
-
|
|
146
|
-
```bash
|
|
147
|
-
bun run release:next # bump/increment `-next.N`
|
|
148
|
-
bun run release:next major # start new major prerelease line
|
|
45
|
+
npm install -g agentv
|
|
46
|
+
agentv init
|
|
149
47
|
```
|
|
150
48
|
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
|
|
49
|
+
**2. Configure targets** in `.agentv/targets.yaml` — point to your agent or LLM provider.
|
|
154
50
|
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
|
|
158
|
-
|
|
159
|
-
```jsonl
|
|
160
|
-
{"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
|
|
161
|
-
{"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}
|
|
162
|
-
```
|
|
163
|
-
|
|
164
|
-
Optional sidecar YAML metadata file (`dataset.eval.yaml` alongside `dataset.jsonl`):
|
|
51
|
+
**3. Create an eval** in `evals/`:
|
|
165
52
|
```yaml
|
|
166
|
-
description:
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
53
|
+
description: Code generation quality
|
|
54
|
+
tests:
|
|
55
|
+
- id: fizzbuzz
|
|
56
|
+
criteria: Write a correct FizzBuzz implementation
|
|
57
|
+
input: Write FizzBuzz in Python
|
|
58
|
+
assertions:
|
|
59
|
+
- type: contains
|
|
60
|
+
value: "fizz"
|
|
61
|
+
- type: code-grader
|
|
62
|
+
command: ./validators/check_syntax.py
|
|
63
|
+
- type: llm-grader
|
|
64
|
+
prompt: ./graders/correctness.md
|
|
174
65
|
```
|
|
175
66
|
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
## Usage
|
|
179
|
-
|
|
180
|
-
### Running Evaluations
|
|
181
|
-
|
|
67
|
+
**4. Run it:**
|
|
182
68
|
```bash
|
|
183
|
-
# Validate evals
|
|
184
|
-
agentv validate evals/my-eval.yaml
|
|
185
|
-
|
|
186
|
-
# Run an eval with default target (from eval file or targets.yaml)
|
|
187
69
|
agentv eval evals/my-eval.yaml
|
|
188
|
-
|
|
189
|
-
# Override target
|
|
190
|
-
agentv eval --target azure-llm evals/**/*.yaml
|
|
191
|
-
|
|
192
|
-
# Run specific test
|
|
193
|
-
agentv eval --test-id case-123 evals/my-eval.yaml
|
|
194
|
-
|
|
195
|
-
# Dry-run with mock provider
|
|
196
|
-
agentv eval --dry-run evals/my-eval.yaml
|
|
197
70
|
```
|
|
198
71
|
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
#### Output Formats
|
|
202
|
-
|
|
203
|
-
Write results to different formats using the `-o` flag (format auto-detected from extension):
|
|
204
|
-
|
|
72
|
+
**5. Compare results across targets:**
|
|
205
73
|
```bash
|
|
206
|
-
|
|
207
|
-
agentv eval evals/my-eval.yaml
|
|
208
|
-
|
|
209
|
-
# Self-contained HTML dashboard (opens in any browser, no server needed)
|
|
210
|
-
agentv eval evals/my-eval.yaml -o report.html
|
|
211
|
-
|
|
212
|
-
# Explicit JSONL output
|
|
213
|
-
agentv eval evals/my-eval.yaml -o output.jsonl
|
|
214
|
-
|
|
215
|
-
# Multiple formats simultaneously
|
|
216
|
-
agentv eval evals/my-eval.yaml -o report.html
|
|
217
|
-
|
|
218
|
-
# JUnit XML for CI/CD integration
|
|
219
|
-
agentv eval evals/my-eval.yaml -o results.xml
|
|
74
|
+
agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
|
|
220
75
|
```
|
|
221
76
|
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
By default, `agentv eval` creates a run workspace under `.agentv/results/runs/<run>/`
|
|
225
|
-
with `index.jsonl` as the machine-facing manifest.
|
|
226
|
-
|
|
227
|
-
You can also convert an existing manifest to HTML after the fact:
|
|
77
|
+
## Output formats
|
|
228
78
|
|
|
229
79
|
```bash
|
|
230
|
-
agentv
|
|
80
|
+
agentv eval evals/my-eval.yaml # JSONL (default)
|
|
81
|
+
agentv eval evals/my-eval.yaml -o report.html # HTML dashboard
|
|
82
|
+
agentv eval evals/my-eval.yaml -o results.xml # JUnit XML for CI
|
|
231
83
|
```
|
|
232
84
|
|
|
233
|
-
|
|
234
|
-
|
|
235
|
-
AgentV does not apply a default top-level evaluation timeout. If you want one, set it explicitly
|
|
236
|
-
with `--agent-timeout`, or set `execution.agentTimeoutMs` in your AgentV config to make it the
|
|
237
|
-
default for your local runs.
|
|
238
|
-
|
|
239
|
-
This top-level timeout is separate from provider- or tool-level timeouts. For example, an upstream
|
|
240
|
-
agent or tool call may still time out even when AgentV's own top-level timeout is unset.
|
|
241
|
-
|
|
242
|
-
### Create Custom Evaluators
|
|
243
|
-
|
|
244
|
-
Write code graders in Python or TypeScript:
|
|
245
|
-
|
|
246
|
-
```python
|
|
247
|
-
# validators/check_answer.py
|
|
248
|
-
import json, sys
|
|
249
|
-
data = json.load(sys.stdin)
|
|
250
|
-
answer = data.get("answer", "")
|
|
251
|
-
|
|
252
|
-
assertions = []
|
|
253
|
-
|
|
254
|
-
if "42" in answer:
|
|
255
|
-
assertions.append({"text": "Answer contains correct value (42)", "passed": True})
|
|
256
|
-
else:
|
|
257
|
-
assertions.append({"text": "Answer does not contain expected value (42)", "passed": False})
|
|
258
|
-
|
|
259
|
-
passed = sum(1 for a in assertions if a["passed"])
|
|
260
|
-
score = 1.0 if passed == len(assertions) else 0.0
|
|
261
|
-
|
|
262
|
-
print(json.dumps({
|
|
263
|
-
"score": score,
|
|
264
|
-
"assertions": assertions,
|
|
265
|
-
}))
|
|
266
|
-
```
|
|
267
|
-
|
|
268
|
-
Reference evaluators in your eval file:
|
|
269
|
-
|
|
270
|
-
```yaml
|
|
271
|
-
assertions:
|
|
272
|
-
- name: my_validator
|
|
273
|
-
type: code-grader
|
|
274
|
-
command: ./validators/check_answer.py
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
For complete templates, examples, and evaluator patterns, see: [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
278
|
-
|
|
279
|
-
### TypeScript SDK
|
|
280
|
-
|
|
281
|
-
#### Custom Assertions with `defineAssertion()`
|
|
282
|
-
|
|
283
|
-
Create custom assertion types in TypeScript using `@agentv/eval`:
|
|
284
|
-
|
|
285
|
-
```typescript
|
|
286
|
-
// .agentv/assertions/word-count.ts
|
|
287
|
-
import { defineAssertion } from '@agentv/eval';
|
|
288
|
-
|
|
289
|
-
export default defineAssertion(({ answer }) => {
|
|
290
|
-
const wordCount = answer.trim().split(/\s+/).length;
|
|
291
|
-
return {
|
|
292
|
-
pass: wordCount >= 3,
|
|
293
|
-
reasoning: `Output has ${wordCount} words`,
|
|
294
|
-
};
|
|
295
|
-
});
|
|
296
|
-
```
|
|
297
|
-
|
|
298
|
-
Files in `.agentv/assertions/` are auto-discovered by filename — use directly in YAML:
|
|
299
|
-
|
|
300
|
-
```yaml
|
|
301
|
-
assertions:
|
|
302
|
-
- type: word-count # matches word-count.ts
|
|
303
|
-
- type: contains
|
|
304
|
-
value: "Hello"
|
|
305
|
-
```
|
|
306
|
-
|
|
307
|
-
See the [sdk-custom-assertion example](examples/features/sdk-custom-assertion).
|
|
308
|
-
|
|
309
|
-
#### Programmatic API with `evaluate()`
|
|
85
|
+
## TypeScript SDK
|
|
310
86
|
|
|
311
|
-
Use AgentV
|
|
87
|
+
Use AgentV programmatically:
|
|
312
88
|
|
|
313
89
|
```typescript
|
|
314
90
|
import { evaluate } from '@agentv/core';
|
|
@@ -326,278 +102,28 @@ const { results, summary } = await evaluate({
|
|
|
326
102
|
console.log(`${summary.passed}/${summary.total} passed`);
|
|
327
103
|
```
|
|
328
104
|
|
|
329
|
-
|
|
105
|
+
## Documentation
|
|
330
106
|
|
|
331
|
-
|
|
107
|
+
Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduction/).
|
|
332
108
|
|
|
333
|
-
|
|
109
|
+
- [Eval files](https://agentv.dev/docs/evaluation/eval-files/) — format and structure
|
|
110
|
+
- [Custom evaluators](https://agentv.dev/docs/evaluators/custom-evaluators/) — code graders in any language
|
|
111
|
+
- [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring
|
|
112
|
+
- [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers
|
|
113
|
+
- [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection
|
|
114
|
+
- [Comparison with other frameworks](https://agentv.dev/docs/reference/comparison/) — vs Braintrust, Langfuse, LangSmith, LangWatch
|
|
334
115
|
|
|
335
|
-
|
|
336
|
-
import { defineConfig } from '@agentv/core';
|
|
337
|
-
|
|
338
|
-
export default defineConfig({
|
|
339
|
-
execution: { workers: 5, maxRetries: 2 },
|
|
340
|
-
output: { format: 'jsonl', dir: './results' },
|
|
341
|
-
limits: { maxCostUsd: 10.0 },
|
|
342
|
-
});
|
|
343
|
-
```
|
|
344
|
-
|
|
345
|
-
See the [sdk-config-file example](examples/features/sdk-config-file).
|
|
346
|
-
|
|
347
|
-
#### Scaffold Commands
|
|
348
|
-
|
|
349
|
-
Bootstrap new assertions and eval files:
|
|
350
|
-
|
|
351
|
-
```bash
|
|
352
|
-
agentv create assertion sentiment # → .agentv/assertions/sentiment.ts
|
|
353
|
-
agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl
|
|
354
|
-
```
|
|
355
|
-
|
|
356
|
-
### Compare Evaluation Results
|
|
357
|
-
|
|
358
|
-
Compare a combined results file across all targets (N-way matrix):
|
|
359
|
-
|
|
360
|
-
```bash
|
|
361
|
-
agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
|
|
362
|
-
```
|
|
363
|
-
|
|
364
|
-
```
|
|
365
|
-
Score Matrix
|
|
366
|
-
|
|
367
|
-
Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
|
|
368
|
-
─────────────── ────────────────────── ─────── ──────────
|
|
369
|
-
code-generation 0.70 0.80 0.75
|
|
370
|
-
greeting 0.90 0.85 0.95
|
|
371
|
-
summarization 0.85 0.90 0.80
|
|
372
|
-
|
|
373
|
-
Pairwise Summary:
|
|
374
|
-
gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033)
|
|
375
|
-
gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017)
|
|
376
|
-
gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017)
|
|
377
|
-
```
|
|
378
|
-
|
|
379
|
-
Designate a baseline for CI regression gating, or compare two specific targets:
|
|
380
|
-
|
|
381
|
-
```bash
|
|
382
|
-
agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1
|
|
383
|
-
agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
|
|
384
|
-
agentv compare before.jsonl after.jsonl # two-file pairwise
|
|
385
|
-
```
|
|
386
|
-
|
|
387
|
-
## Targets Configuration
|
|
388
|
-
|
|
389
|
-
Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
|
|
390
|
-
|
|
391
|
-
```yaml
|
|
392
|
-
targets:
|
|
393
|
-
- name: azure-llm
|
|
394
|
-
provider: azure
|
|
395
|
-
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
396
|
-
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
397
|
-
model: ${{ AZURE_DEPLOYMENT_NAME }}
|
|
398
|
-
|
|
399
|
-
- name: vscode_dev
|
|
400
|
-
provider: vscode
|
|
401
|
-
grader_target: azure-llm
|
|
402
|
-
|
|
403
|
-
- name: local_agent
|
|
404
|
-
provider: cli
|
|
405
|
-
command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
|
|
406
|
-
grader_target: azure-llm
|
|
407
|
-
```
|
|
408
|
-
|
|
409
|
-
Supports: `azure`, `anthropic`, `gemini`, `codex`, `copilot`, `pi-coding-agent`, `claude`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
|
|
410
|
-
|
|
411
|
-
Workspace templates are configured at eval-level under `workspace.template` (not per-target `workspace_template`).
|
|
412
|
-
|
|
413
|
-
Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
|
|
414
|
-
|
|
415
|
-
## Evaluation Features
|
|
416
|
-
|
|
417
|
-
### Code Graders
|
|
418
|
-
|
|
419
|
-
Write validators in any language (Python, TypeScript, Node, etc.):
|
|
420
|
-
|
|
421
|
-
```bash
|
|
422
|
-
# Input: stdin JSON with question, criteria, answer
|
|
423
|
-
# Output: stdout JSON with score (0-1), hits, misses, reasoning
|
|
424
|
-
```
|
|
425
|
-
|
|
426
|
-
For complete examples and patterns, see:
|
|
427
|
-
- [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
428
|
-
- [code-grader-sdk example](examples/features/code-grader-sdk)
|
|
429
|
-
|
|
430
|
-
### Deterministic Assertions
|
|
431
|
-
|
|
432
|
-
Built-in assertion types for common text-matching patterns — no LLM grader or code_grader needed:
|
|
433
|
-
|
|
434
|
-
| Type | Value | Behavior |
|
|
435
|
-
|------|-------|----------|
|
|
436
|
-
| `contains` | `string` | Pass if output includes the substring |
|
|
437
|
-
| `contains_any` | `string[]` | Pass if output includes ANY of the strings |
|
|
438
|
-
| `contains_all` | `string[]` | Pass if output includes ALL of the strings |
|
|
439
|
-
| `icontains` | `string` | Case-insensitive `contains` |
|
|
440
|
-
| `icontains_any` | `string[]` | Case-insensitive `contains_any` |
|
|
441
|
-
| `icontains_all` | `string[]` | Case-insensitive `contains_all` |
|
|
442
|
-
| `starts_with` | `string` | Pass if output starts with value (trimmed) |
|
|
443
|
-
| `ends_with` | `string` | Pass if output ends with value (trimmed) |
|
|
444
|
-
| `regex` | `string` | Pass if output matches regex (optional `flags: "i"`) |
|
|
445
|
-
| `equals` | `string` | Pass if output exactly equals value (trimmed) |
|
|
446
|
-
| `is_json` | — | Pass if output is valid JSON |
|
|
447
|
-
|
|
448
|
-
All assertions support `weight`, `required`, and `negate` flags. Use `negate: true` to invert (no `not_` prefix needed).
|
|
449
|
-
|
|
450
|
-
```yaml
|
|
451
|
-
assertions:
|
|
452
|
-
# Case-insensitive matching for natural language variation
|
|
453
|
-
- type: icontains-any
|
|
454
|
-
value: ["missing rule code", "need rule code", "provide rule code"]
|
|
455
|
-
required: true
|
|
456
|
-
|
|
457
|
-
# Multiple required terms
|
|
458
|
-
- type: icontains-all
|
|
459
|
-
value: ["country code", "rule codes"]
|
|
460
|
-
|
|
461
|
-
# Case-insensitive regex
|
|
462
|
-
- type: regex
|
|
463
|
-
value: "[a-z]+@[a-z]+\\.[a-z]+"
|
|
464
|
-
flags: "i"
|
|
465
|
-
```
|
|
466
|
-
|
|
467
|
-
See the [assert-extended example](examples/features/assert-extended) for complete patterns.
|
|
468
|
-
|
|
469
|
-
### Target Configuration: `grader_target`
|
|
470
|
-
|
|
471
|
-
Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `grader_target` (also accepts `judge_target` for backward compatibility) when using `llm_grader` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers cannot return structured JSON for grading.
|
|
472
|
-
|
|
473
|
-
```yaml
|
|
474
|
-
targets:
|
|
475
|
-
# Agent target — requires grader_target for LLM-based evaluation
|
|
476
|
-
- name: codex_local
|
|
477
|
-
provider: codex
|
|
478
|
-
grader_target: azure-llm # Required: LLM provider for grading
|
|
479
|
-
|
|
480
|
-
# LLM target — no grader_target needed (grades itself)
|
|
481
|
-
- name: azure-llm
|
|
482
|
-
provider: azure
|
|
483
|
-
```
|
|
484
|
-
|
|
485
|
-
### Agentic Eval Patterns
|
|
486
|
-
|
|
487
|
-
When agents respond via tool calls instead of text, use `tool_trajectory` instead of text assertions:
|
|
488
|
-
|
|
489
|
-
- **Agent takes workspace actions** (creates files, runs commands) → `tool_trajectory` evaluator
|
|
490
|
-
- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_grader`
|
|
491
|
-
- **Agent does both** → `composite` evaluator combining both
|
|
492
|
-
|
|
493
|
-
### LLM Graders
|
|
494
|
-
|
|
495
|
-
Create markdown grader files with evaluation criteria and scoring guidelines:
|
|
496
|
-
|
|
497
|
-
```yaml
|
|
498
|
-
assertions:
|
|
499
|
-
- name: semantic_check
|
|
500
|
-
type: llm-grader
|
|
501
|
-
prompt: ./graders/correctness.md
|
|
502
|
-
```
|
|
503
|
-
|
|
504
|
-
Your grader prompt file defines criteria and scoring guidelines.
|
|
505
|
-
|
|
506
|
-
### Rubric-Based Evaluation
|
|
507
|
-
|
|
508
|
-
Define structured criteria directly in your test:
|
|
509
|
-
|
|
510
|
-
```yaml
|
|
511
|
-
tests:
|
|
512
|
-
- id: quicksort-explain
|
|
513
|
-
criteria: Explain how quicksort works
|
|
514
|
-
|
|
515
|
-
input: Explain quicksort algorithm
|
|
516
|
-
|
|
517
|
-
assertions:
|
|
518
|
-
- type: rubrics
|
|
519
|
-
criteria:
|
|
520
|
-
- Mentions divide-and-conquer approach
|
|
521
|
-
- Explains partition step
|
|
522
|
-
- States time complexity
|
|
523
|
-
```
|
|
524
|
-
|
|
525
|
-
Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
|
|
526
|
-
|
|
527
|
-
Author assertions directly in your eval file. When you want help choosing between simple assertions, deterministic graders, and LLM-based graders, use the `agentv-eval-writer` skill.
|
|
528
|
-
|
|
529
|
-
See [rubric evaluator](https://agentv.dev/evaluation/rubrics/) for detailed patterns.
|
|
530
|
-
|
|
531
|
-
## Advanced Configuration
|
|
532
|
-
|
|
533
|
-
### Retry Behavior
|
|
534
|
-
|
|
535
|
-
Configure automatic retry with exponential backoff:
|
|
536
|
-
|
|
537
|
-
```yaml
|
|
538
|
-
targets:
|
|
539
|
-
- name: azure-llm
|
|
540
|
-
provider: azure
|
|
541
|
-
max_retries: 5
|
|
542
|
-
retry_initial_delay_ms: 2000
|
|
543
|
-
retry_max_delay_ms: 120000
|
|
544
|
-
retry_backoff_factor: 2
|
|
545
|
-
retry_status_codes: [500, 408, 429, 502, 503, 504]
|
|
546
|
-
```
|
|
547
|
-
|
|
548
|
-
Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
|
|
549
|
-
|
|
550
|
-
## Documentation & Learning
|
|
551
|
-
|
|
552
|
-
**Getting Started:**
|
|
553
|
-
- Run `agentv init` to set up your first evaluation workspace
|
|
554
|
-
- Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
|
|
555
|
-
- AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
|
|
556
|
-
|
|
557
|
-
**Detailed Guides:**
|
|
558
|
-
- [Evaluation format and structure](https://agentv.dev/evaluation/eval-files/)
|
|
559
|
-
- [Custom evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
560
|
-
- [Rubric evaluator](https://agentv.dev/evaluation/rubrics/)
|
|
561
|
-
- [Composite evaluator](https://agentv.dev/evaluators/composite/)
|
|
562
|
-
- [Tool trajectory evaluator](https://agentv.dev/evaluators/tool-trajectory/)
|
|
563
|
-
- [Structured data evaluators](https://agentv.dev/evaluators/structured-data/)
|
|
564
|
-
- [Batch CLI evaluation](https://agentv.dev/evaluation/batch-cli/)
|
|
565
|
-
- [Compare results](https://agentv.dev/tools/compare/)
|
|
566
|
-
- [Example evaluations](https://agentv.dev/evaluation/examples/)
|
|
567
|
-
|
|
568
|
-
**Reference:**
|
|
569
|
-
- Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
|
|
570
|
-
|
|
571
|
-
## Troubleshooting
|
|
572
|
-
|
|
573
|
-
### `EACCES` permission error on global install (npm)
|
|
574
|
-
|
|
575
|
-
If you see `EACCES: permission denied` when running `npm install -g agentv`, switch to bun (recommended) or configure npm to use a user-owned directory:
|
|
576
|
-
|
|
577
|
-
**Option 1 (recommended): Use bun instead**
|
|
578
|
-
```bash
|
|
579
|
-
bun install -g agentv
|
|
580
|
-
```
|
|
581
|
-
|
|
582
|
-
**Option 2: Fix npm permissions**
|
|
583
|
-
```bash
|
|
584
|
-
mkdir -p ~/.npm-global
|
|
585
|
-
npm config set prefix ~/.npm-global --location=user
|
|
586
|
-
```
|
|
587
|
-
|
|
588
|
-
Then add the directory to your PATH. For bash (`~/.bashrc`) or zsh (`~/.zshrc`):
|
|
116
|
+
## Development
|
|
589
117
|
|
|
590
118
|
```bash
|
|
591
|
-
|
|
592
|
-
|
|
119
|
+
git clone https://github.com/EntityProcess/agentv.git
|
|
120
|
+
cd agentv
|
|
121
|
+
bun install && bun run build
|
|
122
|
+
bun test
|
|
593
123
|
```
|
|
594
124
|
|
|
595
|
-
|
|
596
|
-
|
|
597
|
-
## Contributing
|
|
598
|
-
|
|
599
|
-
See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
|
|
125
|
+
See [AGENTS.md](AGENTS.md) for development guidelines.
|
|
600
126
|
|
|
601
127
|
## License
|
|
602
128
|
|
|
603
|
-
MIT
|
|
129
|
+
MIT
|