agentv 2.5.4 → 2.5.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +17 -17
- package/dist/{chunk-I4EMT5Q2.js → chunk-APKXUJF3.js} +1250 -462
- package/dist/chunk-APKXUJF3.js.map +1 -0
- package/dist/{chunk-LTPZBEJU.js → chunk-BKMQNEUD.js} +9 -3
- package/dist/{chunk-LTPZBEJU.js.map → chunk-BKMQNEUD.js.map} +1 -1
- package/dist/{chunk-A7TQUSVG.js → chunk-LJVS3JAK.js} +2 -2
- package/dist/cli.js +2 -2
- package/dist/index.js +2 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +123 -244
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +56 -271
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +55 -180
- package/dist/{token-DVVSDOYP.js → token-D3IYDJQZ.js} +3 -3
- package/dist/{token-util-YEKFTEJA.js → token-util-FWFPR2BV.js} +3 -3
- package/package.json +5 -2
- package/dist/chunk-I4EMT5Q2.js.map +0 -1
- /package/dist/{chunk-A7TQUSVG.js.map → chunk-LJVS3JAK.js.map} +0 -0
- /package/dist/{token-DVVSDOYP.js.map → token-D3IYDJQZ.js.map} +0 -0
- /package/dist/{token-util-YEKFTEJA.js.map → token-util-FWFPR2BV.js.map} +0 -0
package/README.md
CHANGED
|
@@ -31,13 +31,9 @@ evalcases:
|
|
|
31
31
|
- id: addition
|
|
32
32
|
expected_outcome: Correctly calculates 15 + 27 = 42
|
|
33
33
|
|
|
34
|
-
|
|
35
|
-
- role: user
|
|
36
|
-
content: What is 15 + 27?
|
|
34
|
+
input: What is 15 + 27?
|
|
37
35
|
|
|
38
|
-
|
|
39
|
-
- role: assistant
|
|
40
|
-
content: "42"
|
|
36
|
+
expected_output: "42"
|
|
41
37
|
|
|
42
38
|
execution:
|
|
43
39
|
evaluators:
|
|
@@ -108,8 +104,8 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
|
|
|
108
104
|
For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
|
|
109
105
|
|
|
110
106
|
```jsonl
|
|
111
|
-
{"id": "test-1", "expected_outcome": "Calculates correctly", "
|
|
112
|
-
{"id": "test-2", "expected_outcome": "Provides explanation", "
|
|
107
|
+
{"id": "test-1", "expected_outcome": "Calculates correctly", "input": "What is 2+2?"}
|
|
108
|
+
{"id": "test-2", "expected_outcome": "Provides explanation", "input": "Explain variables"}
|
|
113
109
|
```
|
|
114
110
|
|
|
115
111
|
Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
|
|
@@ -184,7 +180,7 @@ execution:
|
|
|
184
180
|
script: ./validators/check_answer.py
|
|
185
181
|
```
|
|
186
182
|
|
|
187
|
-
For complete templates, examples, and evaluator patterns, see: [custom-evaluators
|
|
183
|
+
For complete templates, examples, and evaluator patterns, see: [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
188
184
|
|
|
189
185
|
### Compare Evaluation Results
|
|
190
186
|
|
|
@@ -238,7 +234,7 @@ Write validators in any language (Python, TypeScript, Node, etc.):
|
|
|
238
234
|
```
|
|
239
235
|
|
|
240
236
|
For complete examples and patterns, see:
|
|
241
|
-
- [custom-evaluators
|
|
237
|
+
- [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
242
238
|
- [code-judge-sdk example](examples/features/code-judge-sdk)
|
|
243
239
|
|
|
244
240
|
### LLM Judges
|
|
@@ -264,9 +260,7 @@ evalcases:
|
|
|
264
260
|
- id: quicksort-explain
|
|
265
261
|
expected_outcome: Explain how quicksort works
|
|
266
262
|
|
|
267
|
-
|
|
268
|
-
- role: user
|
|
269
|
-
content: Explain quicksort algorithm
|
|
263
|
+
input: Explain quicksort algorithm
|
|
270
264
|
|
|
271
265
|
rubrics:
|
|
272
266
|
- Mentions divide-and-conquer approach
|
|
@@ -281,7 +275,7 @@ Auto-generate rubrics from expected outcomes:
|
|
|
281
275
|
agentv generate rubrics evals/my-eval.yaml
|
|
282
276
|
```
|
|
283
277
|
|
|
284
|
-
See [rubric
|
|
278
|
+
See [rubric evaluator](https://agentv.dev/evaluation/rubrics/) for detailed patterns.
|
|
285
279
|
|
|
286
280
|
## Advanced Configuration
|
|
287
281
|
|
|
@@ -310,9 +304,15 @@ Automatically retries on rate limits, transient 5xx errors, and network failures
|
|
|
310
304
|
- AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
|
|
311
305
|
|
|
312
306
|
**Detailed Guides:**
|
|
313
|
-
- [Evaluation format and structure](.
|
|
314
|
-
- [Custom evaluators](.
|
|
315
|
-
- [
|
|
307
|
+
- [Evaluation format and structure](https://agentv.dev/evaluation/eval-files/)
|
|
308
|
+
- [Custom evaluators](https://agentv.dev/evaluators/custom-evaluators/)
|
|
309
|
+
- [Rubric evaluator](https://agentv.dev/evaluation/rubrics/)
|
|
310
|
+
- [Composite evaluator](https://agentv.dev/evaluators/composite/)
|
|
311
|
+
- [Tool trajectory evaluator](https://agentv.dev/evaluators/tool-trajectory/)
|
|
312
|
+
- [Structured data evaluators](https://agentv.dev/evaluators/structured-data/)
|
|
313
|
+
- [Batch CLI evaluation](https://agentv.dev/evaluation/batch-cli/)
|
|
314
|
+
- [Compare results](https://agentv.dev/tools/compare/)
|
|
315
|
+
- [Example evaluations](https://agentv.dev/evaluation/examples/)
|
|
316
316
|
|
|
317
317
|
**Reference:**
|
|
318
318
|
- Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
|