agentv 2.1.0 → 2.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +21 -1
- package/dist/{chunk-5BLNVACB.js → chunk-5HTT24MQ.js} +538 -308
- package/dist/chunk-5HTT24MQ.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +23 -1
- package/package.json +1 -1
- package/dist/chunk-5BLNVACB.js.map +0 -1
package/README.md
CHANGED
|
@@ -101,7 +101,27 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
|
|
|
101
101
|
|
|
102
102
|
## Core Concepts
|
|
103
103
|
|
|
104
|
-
**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
|
|
104
|
+
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
|
|
105
|
+
|
|
106
|
+
### JSONL Format Support
|
|
107
|
+
|
|
108
|
+
For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
|
|
109
|
+
|
|
110
|
+
```jsonl
|
|
111
|
+
{"id": "test-1", "expected_outcome": "Calculates correctly", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
|
|
112
|
+
{"id": "test-2", "expected_outcome": "Provides explanation", "input_messages": [{"role": "user", "content": "Explain variables"}]}
|
|
113
|
+
```
|
|
114
|
+
|
|
115
|
+
Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
|
|
116
|
+
```yaml
|
|
117
|
+
description: Math evaluation dataset
|
|
118
|
+
dataset: math-tests
|
|
119
|
+
execution:
|
|
120
|
+
target: azure_base
|
|
121
|
+
evaluator: llm_judge
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
|
|
105
125
|
|
|
106
126
|
## Usage
|
|
107
127
|
|