agentv 2.19.0 → 3.0.0-next.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -36
- package/dist/agentv-provider-5CJVBBGG-2XVZBW7L.js +9 -0
- package/dist/{chunk-GC6T3RD4.js → chunk-5WIB7A27.js} +598 -403
- package/dist/chunk-5WIB7A27.js.map +1 -0
- package/dist/chunk-6GSYTMXD.js +31520 -0
- package/dist/chunk-6GSYTMXD.js.map +1 -0
- package/dist/{chunk-4MSAOMCC.js → chunk-DY4ZDTTO.js} +1018 -140
- package/dist/chunk-DY4ZDTTO.js.map +1 -0
- package/dist/chunk-HF4X7ALN.js +24299 -0
- package/dist/chunk-HF4X7ALN.js.map +1 -0
- package/dist/{chunk-FV32QHPB.js → chunk-XOSNETAV.js} +1 -1
- package/dist/cli.js +5 -4
- package/dist/cli.js.map +1 -1
- package/dist/{dist-MQBGD6LP.js → dist-WN2QIOQR.js} +27 -11
- package/dist/{esm-DX3WQKEN.js → esm-CZAWIY6F.js} +2 -2
- package/dist/esm-CZAWIY6F.js.map +1 -0
- package/dist/index.js +5 -4
- package/dist/{interactive-3TDBCSDW.js → interactive-B432TCRZ.js} +5 -4
- package/dist/{interactive-3TDBCSDW.js.map → interactive-B432TCRZ.js.map} +1 -1
- package/dist/{src-2N5EJ2N6.js → src-ML4D2MC2.js} +2 -2
- package/dist/templates/.agentv/targets.yaml +8 -11
- package/package.json +2 -2
- package/dist/chunk-4MSAOMCC.js.map +0 -1
- package/dist/chunk-GC6T3RD4.js.map +0 -1
- package/dist/chunk-XTYMR4I5.js +0 -49811
- package/dist/chunk-XTYMR4I5.js.map +0 -1
- /package/dist/{dist-MQBGD6LP.js.map → agentv-provider-5CJVBBGG-2XVZBW7L.js.map} +0 -0
- /package/dist/{chunk-FV32QHPB.js.map → chunk-XOSNETAV.js.map} +0 -0
- /package/dist/{esm-DX3WQKEN.js.map → dist-WN2QIOQR.js.map} +0 -0
- /package/dist/{src-2N5EJ2N6.js.map → src-ML4D2MC2.js.map} +0 -0
package/README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
|
|
3
3
|
**CLI-first AI agent evaluation. No server. No signup. No overhead.**
|
|
4
4
|
|
|
5
|
-
AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code
|
|
5
|
+
AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
|
|
6
6
|
|
|
7
7
|
## Installation
|
|
8
8
|
|
|
@@ -58,9 +58,9 @@ tests:
|
|
|
58
58
|
|
|
59
59
|
expected_output: "42"
|
|
60
60
|
|
|
61
|
-
|
|
61
|
+
assertions:
|
|
62
62
|
- name: math_check
|
|
63
|
-
type: code-
|
|
63
|
+
type: code-grader
|
|
64
64
|
command: ./validators/check_math.py
|
|
65
65
|
```
|
|
66
66
|
|
|
@@ -90,7 +90,7 @@ Learn more in the [examples/](examples/README.md) directory. For a detailed comp
|
|
|
90
90
|
## Features
|
|
91
91
|
|
|
92
92
|
- **Multi-objective scoring**: Correctness, latency, cost, safety in one run
|
|
93
|
-
- **Multiple evaluator types**: Code validators, LLM
|
|
93
|
+
- **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
|
|
94
94
|
- **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
|
|
95
95
|
- **Structured evaluation**: Rubric-based grading with weights and requirements
|
|
96
96
|
- **Batch evaluation**: Run hundreds of test cases in parallel
|
|
@@ -145,7 +145,7 @@ bun run release:next major # start new major prerelease line
|
|
|
145
145
|
|
|
146
146
|
## Core Concepts
|
|
147
147
|
|
|
148
|
-
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **
|
|
148
|
+
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
|
|
149
149
|
|
|
150
150
|
### JSONL Format Support
|
|
151
151
|
|
|
@@ -161,11 +161,11 @@ Optional sidecar YAML metadata file (`dataset.eval.yaml` alongside `dataset.json
|
|
|
161
161
|
description: Math evaluation dataset
|
|
162
162
|
dataset: math-tests
|
|
163
163
|
execution:
|
|
164
|
-
target: azure-
|
|
165
|
-
|
|
164
|
+
target: azure-llm
|
|
165
|
+
assertions:
|
|
166
166
|
- name: correctness
|
|
167
|
-
type: llm-
|
|
168
|
-
prompt: ./
|
|
167
|
+
type: llm-grader
|
|
168
|
+
prompt: ./graders/correctness.md
|
|
169
169
|
```
|
|
170
170
|
|
|
171
171
|
Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
|
|
@@ -182,7 +182,7 @@ agentv validate evals/my-eval.yaml
|
|
|
182
182
|
agentv eval evals/my-eval.yaml
|
|
183
183
|
|
|
184
184
|
# Override target
|
|
185
|
-
agentv eval --target azure-
|
|
185
|
+
agentv eval --target azure-llm evals/**/*.yaml
|
|
186
186
|
|
|
187
187
|
# Run specific test
|
|
188
188
|
agentv eval --test-id case-123 evals/my-eval.yaml
|
|
@@ -193,6 +193,32 @@ agentv eval --dry-run evals/my-eval.yaml
|
|
|
193
193
|
|
|
194
194
|
See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
|
|
195
195
|
|
|
196
|
+
#### Output Formats
|
|
197
|
+
|
|
198
|
+
Write results to different formats using the `-o` flag (format auto-detected from extension):
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
# JSONL (default streaming format)
|
|
202
|
+
agentv eval evals/my-eval.yaml -o results.jsonl
|
|
203
|
+
|
|
204
|
+
# Self-contained HTML dashboard (opens in any browser, no server needed)
|
|
205
|
+
agentv eval evals/my-eval.yaml -o report.html
|
|
206
|
+
|
|
207
|
+
# Multiple formats simultaneously
|
|
208
|
+
agentv eval evals/my-eval.yaml -o results.jsonl -o report.html
|
|
209
|
+
|
|
210
|
+
# JUnit XML for CI/CD integration
|
|
211
|
+
agentv eval evals/my-eval.yaml -o results.xml
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
The HTML report auto-refreshes every 2 seconds during a live run, then locks once the run completes.
|
|
215
|
+
|
|
216
|
+
You can also convert an existing JSONL results file to HTML after the fact:
|
|
217
|
+
|
|
218
|
+
```bash
|
|
219
|
+
agentv convert results.jsonl -o report.html
|
|
220
|
+
```
|
|
221
|
+
|
|
196
222
|
#### Timeouts
|
|
197
223
|
|
|
198
224
|
AgentV does not apply a default top-level evaluation timeout. If you want one, set it explicitly
|
|
@@ -204,7 +230,7 @@ agent or tool call may still time out even when AgentV's own top-level timeout i
|
|
|
204
230
|
|
|
205
231
|
### Create Custom Evaluators
|
|
206
232
|
|
|
207
|
-
Write code
|
|
233
|
+
Write code graders in Python or TypeScript:
|
|
208
234
|
|
|
209
235
|
```python
|
|
210
236
|
# validators/check_answer.py
|
|
@@ -233,9 +259,9 @@ print(json.dumps({
|
|
|
233
259
|
Reference evaluators in your eval file:
|
|
234
260
|
|
|
235
261
|
```yaml
|
|
236
|
-
|
|
262
|
+
assertions:
|
|
237
263
|
- name: my_validator
|
|
238
|
-
type: code-
|
|
264
|
+
type: code-grader
|
|
239
265
|
command: ./validators/check_answer.py
|
|
240
266
|
```
|
|
241
267
|
|
|
@@ -263,7 +289,7 @@ export default defineAssertion(({ answer }) => {
|
|
|
263
289
|
Files in `.agentv/assertions/` are auto-discovered by filename — use directly in YAML:
|
|
264
290
|
|
|
265
291
|
```yaml
|
|
266
|
-
|
|
292
|
+
assertions:
|
|
267
293
|
- type: word-count # matches word-count.ts
|
|
268
294
|
- type: contains
|
|
269
295
|
value: "Hello"
|
|
@@ -355,7 +381,7 @@ Define execution targets in `.agentv/targets.yaml` to decouple evals from provid
|
|
|
355
381
|
|
|
356
382
|
```yaml
|
|
357
383
|
targets:
|
|
358
|
-
- name: azure-
|
|
384
|
+
- name: azure-llm
|
|
359
385
|
provider: azure
|
|
360
386
|
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
361
387
|
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
@@ -363,12 +389,12 @@ targets:
|
|
|
363
389
|
|
|
364
390
|
- name: vscode_dev
|
|
365
391
|
provider: vscode
|
|
366
|
-
|
|
392
|
+
grader_target: azure-llm
|
|
367
393
|
|
|
368
394
|
- name: local_agent
|
|
369
395
|
provider: cli
|
|
370
396
|
command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
|
|
371
|
-
|
|
397
|
+
grader_target: azure-llm
|
|
372
398
|
```
|
|
373
399
|
|
|
374
400
|
Supports: `azure`, `anthropic`, `gemini`, `codex`, `copilot`, `pi-coding-agent`, `claude`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
|
|
@@ -379,7 +405,7 @@ Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/ta
|
|
|
379
405
|
|
|
380
406
|
## Evaluation Features
|
|
381
407
|
|
|
382
|
-
### Code
|
|
408
|
+
### Code Graders
|
|
383
409
|
|
|
384
410
|
Write validators in any language (Python, TypeScript, Node, etc.):
|
|
385
411
|
|
|
@@ -390,11 +416,11 @@ Write validators in any language (Python, TypeScript, Node, etc.):
|
|
|
390
416
|
|
|
391
417
|
For complete examples and patterns, see:
|
|
392
418
|
- [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
393
|
-
- [code-
|
|
419
|
+
- [code-grader-sdk example](examples/features/code-grader-sdk)
|
|
394
420
|
|
|
395
421
|
### Deterministic Assertions
|
|
396
422
|
|
|
397
|
-
Built-in assertion types for common text-matching patterns — no LLM
|
|
423
|
+
Built-in assertion types for common text-matching patterns — no LLM grader or code_grader needed:
|
|
398
424
|
|
|
399
425
|
| Type | Value | Behavior |
|
|
400
426
|
|------|-------|----------|
|
|
@@ -413,7 +439,7 @@ Built-in assertion types for common text-matching patterns — no LLM judge or c
|
|
|
413
439
|
All assertions support `weight`, `required`, and `negate` flags. Use `negate: true` to invert (no `not_` prefix needed).
|
|
414
440
|
|
|
415
441
|
```yaml
|
|
416
|
-
|
|
442
|
+
assertions:
|
|
417
443
|
# Case-insensitive matching for natural language variation
|
|
418
444
|
- type: icontains-any
|
|
419
445
|
value: ["missing rule code", "need rule code", "provide rule code"]
|
|
@@ -431,19 +457,19 @@ assert:
|
|
|
431
457
|
|
|
432
458
|
See the [assert-extended example](examples/features/assert-extended) for complete patterns.
|
|
433
459
|
|
|
434
|
-
### Target Configuration: `
|
|
460
|
+
### Target Configuration: `grader_target`
|
|
435
461
|
|
|
436
|
-
Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `judge_target` when using `
|
|
462
|
+
Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `grader_target` (also accepts `judge_target` for backward compatibility) when using `llm_grader` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers cannot return structured JSON for grading.
|
|
437
463
|
|
|
438
464
|
```yaml
|
|
439
465
|
targets:
|
|
440
|
-
# Agent target — requires
|
|
466
|
+
# Agent target — requires grader_target for LLM-based evaluation
|
|
441
467
|
- name: codex_local
|
|
442
468
|
provider: codex
|
|
443
|
-
|
|
469
|
+
grader_target: azure-llm # Required: LLM provider for grading
|
|
444
470
|
|
|
445
|
-
# LLM target — no
|
|
446
|
-
- name: azure-
|
|
471
|
+
# LLM target — no grader_target needed (grades itself)
|
|
472
|
+
- name: azure-llm
|
|
447
473
|
provider: azure
|
|
448
474
|
```
|
|
449
475
|
|
|
@@ -452,21 +478,21 @@ targets:
|
|
|
452
478
|
When agents respond via tool calls instead of text, use `tool_trajectory` instead of text assertions:
|
|
453
479
|
|
|
454
480
|
- **Agent takes workspace actions** (creates files, runs commands) → `tool_trajectory` evaluator
|
|
455
|
-
- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`
|
|
481
|
+
- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_grader`
|
|
456
482
|
- **Agent does both** → `composite` evaluator combining both
|
|
457
483
|
|
|
458
|
-
### LLM
|
|
484
|
+
### LLM Graders
|
|
459
485
|
|
|
460
|
-
Create markdown
|
|
486
|
+
Create markdown grader files with evaluation criteria and scoring guidelines:
|
|
461
487
|
|
|
462
488
|
```yaml
|
|
463
|
-
|
|
489
|
+
assertions:
|
|
464
490
|
- name: semantic_check
|
|
465
|
-
type: llm-
|
|
466
|
-
prompt: ./
|
|
491
|
+
type: llm-grader
|
|
492
|
+
prompt: ./graders/correctness.md
|
|
467
493
|
```
|
|
468
494
|
|
|
469
|
-
Your
|
|
495
|
+
Your grader prompt file defines criteria and scoring guidelines.
|
|
470
496
|
|
|
471
497
|
### Rubric-Based Evaluation
|
|
472
498
|
|
|
@@ -479,7 +505,7 @@ tests:
|
|
|
479
505
|
|
|
480
506
|
input: Explain quicksort algorithm
|
|
481
507
|
|
|
482
|
-
|
|
508
|
+
assertions:
|
|
483
509
|
- type: rubrics
|
|
484
510
|
criteria:
|
|
485
511
|
- Mentions divide-and-conquer approach
|
|
@@ -504,7 +530,7 @@ Configure automatic retry with exponential backoff:
|
|
|
504
530
|
|
|
505
531
|
```yaml
|
|
506
532
|
targets:
|
|
507
|
-
- name: azure-
|
|
533
|
+
- name: azure-llm
|
|
508
534
|
provider: azure
|
|
509
535
|
max_retries: 5
|
|
510
536
|
retry_initial_delay_ms: 2000
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
import { createRequire } from 'node:module'; const require = createRequire(import.meta.url);
|
|
2
|
+
import {
|
|
3
|
+
AgentvProvider
|
|
4
|
+
} from "./chunk-6GSYTMXD.js";
|
|
5
|
+
import "./chunk-5H446C7X.js";
|
|
6
|
+
export {
|
|
7
|
+
AgentvProvider
|
|
8
|
+
};
|
|
9
|
+
//# sourceMappingURL=agentv-provider-5CJVBBGG-2XVZBW7L.js.map
|