agentv 1.2.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +439 -441
- package/dist/{chunk-IVIT4U6S.js → chunk-3RYQPI4H.js} +709 -465
- package/dist/chunk-3RYQPI4H.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.agentv/.env.template +23 -23
- package/dist/templates/.agentv/config.yaml +15 -15
- package/dist/templates/.agentv/targets.yaml +71 -73
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +212 -174
- package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +318 -0
- package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
- package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +216 -213
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +340 -247
- package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +139 -139
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +198 -179
- package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +77 -77
- package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +4 -4
- package/dist/templates/.github/prompts/agentv-optimize.prompt.md +3 -3
- package/package.json +3 -6
- package/dist/chunk-IVIT4U6S.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,441 +1,439 @@
|
|
|
1
|
-
# AgentV
|
|
2
|
-
|
|
3
|
-
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
|
|
4
|
-
|
|
5
|
-
## Installation and Setup
|
|
6
|
-
|
|
7
|
-
### Installation for End Users
|
|
8
|
-
|
|
9
|
-
This is the recommended method for users who want to use `agentv` as a command-line tool.
|
|
10
|
-
|
|
11
|
-
1. Install via npm:
|
|
12
|
-
|
|
13
|
-
```bash
|
|
14
|
-
# Install globally
|
|
15
|
-
npm install -g agentv
|
|
16
|
-
|
|
17
|
-
# Or use npx to run without installing
|
|
18
|
-
npx agentv --help
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
2. Verify the installation:
|
|
22
|
-
|
|
23
|
-
```bash
|
|
24
|
-
agentv --help
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
### Local Development Setup
|
|
28
|
-
|
|
29
|
-
Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
|
|
30
|
-
|
|
31
|
-
1. Clone the repository and navigate into it:
|
|
32
|
-
|
|
33
|
-
```bash
|
|
34
|
-
git clone https://github.com/EntityProcess/agentv.git
|
|
35
|
-
cd agentv
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
2. Install dependencies:
|
|
39
|
-
|
|
40
|
-
```bash
|
|
41
|
-
# Install Bun if you don't have it
|
|
42
|
-
curl -fsSL https://bun.sh/install | bash # macOS/Linux
|
|
43
|
-
# or
|
|
44
|
-
powershell -c "irm bun.sh/install.ps1 | iex" # Windows
|
|
45
|
-
|
|
46
|
-
# Install all workspace dependencies
|
|
47
|
-
bun install
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
3. Build the project:
|
|
51
|
-
|
|
52
|
-
```bash
|
|
53
|
-
bun run build
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
4. Run tests:
|
|
57
|
-
|
|
58
|
-
```bash
|
|
59
|
-
bun test
|
|
60
|
-
```
|
|
61
|
-
|
|
62
|
-
You are now ready to start development. The monorepo contains:
|
|
63
|
-
|
|
64
|
-
- `packages/core/` - Core evaluation engine
|
|
65
|
-
- `apps/cli/` - Command-line interface
|
|
66
|
-
|
|
67
|
-
### Environment Setup
|
|
68
|
-
|
|
69
|
-
1. Initialize your workspace:
|
|
70
|
-
- Run `agentv init` at the root of your repository
|
|
71
|
-
- This command automatically sets up the `.agentv/` directory structure and configuration files
|
|
72
|
-
|
|
73
|
-
2. Configure environment variables:
|
|
74
|
-
- The init command creates a `.env.template` file in your project root
|
|
75
|
-
- Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
|
|
76
|
-
- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
|
|
77
|
-
|
|
78
|
-
## Quick Start
|
|
79
|
-
|
|
80
|
-
You can use the following examples as a starting point.
|
|
81
|
-
- [Simple Example](docs/examples/simple/README.md): A minimal working example to help you get started fast.
|
|
82
|
-
- [Showcase](docs/examples/showcase/README.md): A collection of advanced use cases and real-world agent evaluation scenarios.
|
|
83
|
-
|
|
84
|
-
### Validating Eval Files
|
|
85
|
-
|
|
86
|
-
Validate your eval and targets files before running them:
|
|
87
|
-
|
|
88
|
-
```bash
|
|
89
|
-
# Validate a single file
|
|
90
|
-
agentv validate evals/my-eval.yaml
|
|
91
|
-
|
|
92
|
-
# Validate multiple files
|
|
93
|
-
agentv validate evals/eval1.yaml evals/eval2.yaml
|
|
94
|
-
|
|
95
|
-
# Validate entire directory (recursively finds all YAML files)
|
|
96
|
-
agentv validate evals/
|
|
97
|
-
```
|
|
98
|
-
|
|
99
|
-
### Running Evals
|
|
100
|
-
|
|
101
|
-
Run eval (target auto-selected from eval file or CLI override):
|
|
102
|
-
|
|
103
|
-
```bash
|
|
104
|
-
# If your eval.yaml contains "target: azure_base", it will be used automatically
|
|
105
|
-
agentv eval "path/to/eval.yaml"
|
|
106
|
-
|
|
107
|
-
# Override the eval file's target with CLI flag
|
|
108
|
-
agentv eval --target vscode_projectx "path/to/eval.yaml"
|
|
109
|
-
|
|
110
|
-
# Run multiple evals via glob
|
|
111
|
-
agentv eval "path/to/evals/**/*.yaml"
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
Run a specific eval case with custom targets path:
|
|
115
|
-
|
|
116
|
-
```bash
|
|
117
|
-
agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
### Command Line Options
|
|
121
|
-
|
|
122
|
-
- `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
|
|
123
|
-
- `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
|
|
124
|
-
- `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
|
|
125
|
-
- `--eval-id EVAL_ID`: Run only the eval case with this specific ID
|
|
126
|
-
- `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
|
|
127
|
-
- `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
|
|
128
|
-
- `--dry-run`: Run with mock model for testing
|
|
129
|
-
- `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
|
|
130
|
-
- `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
|
|
131
|
-
- `--cache`: Enable caching of LLM responses (default: disabled)
|
|
132
|
-
- `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
|
|
133
|
-
- `--dump-traces`: Write trace files to `.agentv/traces/` directory
|
|
134
|
-
- `--include-trace`: Include full trace in result output (verbose)
|
|
135
|
-
- `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
|
|
136
|
-
- `--verbose`: Verbose output
|
|
137
|
-
|
|
138
|
-
### Target Selection Priority
|
|
139
|
-
|
|
140
|
-
The CLI determines which execution target to use with the following precedence:
|
|
141
|
-
|
|
142
|
-
1. CLI flag override: `--target my_target` (when provided and not 'default')
|
|
143
|
-
2. Eval file specification: `target: my_target` key in the .eval.yaml file
|
|
144
|
-
3. Default fallback: Uses the 'default' target (original behavior)
|
|
145
|
-
|
|
146
|
-
This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
|
|
147
|
-
|
|
148
|
-
Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
|
|
149
|
-
|
|
150
|
-
### Tips for VS Code Copilot Evals
|
|
151
|
-
|
|
152
|
-
**Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
|
|
153
|
-
|
|
154
|
-
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
|
|
155
|
-
|
|
156
|
-
## Targets and Environment Variables
|
|
157
|
-
|
|
158
|
-
Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
|
|
159
|
-
|
|
160
|
-
### Target Configuration Structure
|
|
161
|
-
|
|
162
|
-
Each target specifies:
|
|
163
|
-
|
|
164
|
-
- `name`: Unique identifier for the target
|
|
165
|
-
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
|
|
166
|
-
- Provider-specific configuration fields at the top level (no `settings` wrapper needed)
|
|
167
|
-
- Optional fields: `judge_target`, `workers`, `provider_batching`
|
|
168
|
-
|
|
169
|
-
### Examples
|
|
170
|
-
|
|
171
|
-
**Azure OpenAI targets:**
|
|
172
|
-
|
|
173
|
-
```yaml
|
|
174
|
-
- name: azure_base
|
|
175
|
-
provider: azure
|
|
176
|
-
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
177
|
-
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
178
|
-
model: ${{ AZURE_DEPLOYMENT_NAME }}
|
|
179
|
-
version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
|
|
180
|
-
```
|
|
181
|
-
|
|
182
|
-
Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
|
|
183
|
-
|
|
184
|
-
**VS Code targets:**
|
|
185
|
-
|
|
186
|
-
```yaml
|
|
187
|
-
- name: vscode_projectx
|
|
188
|
-
provider: vscode
|
|
189
|
-
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
|
|
190
|
-
provider_batching: false
|
|
191
|
-
judge_target: azure_base
|
|
192
|
-
|
|
193
|
-
- name: vscode_insiders_projectx
|
|
194
|
-
provider: vscode-insiders
|
|
195
|
-
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
|
|
196
|
-
provider_batching: false
|
|
197
|
-
judge_target: azure_base
|
|
198
|
-
```
|
|
199
|
-
|
|
200
|
-
**CLI targets (template-based):**
|
|
201
|
-
|
|
202
|
-
```yaml
|
|
203
|
-
- name: local_cli
|
|
204
|
-
provider: cli
|
|
205
|
-
judge_target: azure_base
|
|
206
|
-
command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
|
|
207
|
-
files_format: '--file {path}'
|
|
208
|
-
cwd: ${{ CLI_EVALS_DIR }} # optional working directory
|
|
209
|
-
timeout_seconds: 30 # optional per-command timeout
|
|
210
|
-
healthcheck:
|
|
211
|
-
type: command # or http
|
|
212
|
-
command_template: uv run ./my_agent.py --healthcheck
|
|
213
|
-
```
|
|
214
|
-
|
|
215
|
-
**Supported placeholders in CLI commands:**
|
|
216
|
-
- `{PROMPT}` - The rendered prompt text (shell-escaped)
|
|
217
|
-
- `{FILES}` - Expands to multiple file arguments using `files_format` template
|
|
218
|
-
- `{GUIDELINES}` - Guidelines content
|
|
219
|
-
- `{EVAL_ID}` - Current eval case ID
|
|
220
|
-
- `{ATTEMPT}` - Retry attempt number
|
|
221
|
-
- `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
|
|
222
|
-
|
|
223
|
-
**Codex CLI targets:**
|
|
224
|
-
|
|
225
|
-
```yaml
|
|
226
|
-
- name: codex_cli
|
|
227
|
-
provider: codex
|
|
228
|
-
judge_target: azure_base
|
|
229
|
-
executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
|
|
230
|
-
args: # optional CLI arguments
|
|
231
|
-
- --profile
|
|
232
|
-
- ${{ CODEX_PROFILE }}
|
|
233
|
-
- --model
|
|
234
|
-
- ${{ CODEX_MODEL }}
|
|
235
|
-
timeout_seconds: 180
|
|
236
|
-
cwd: ${{ CODEX_WORKSPACE_DIR }}
|
|
237
|
-
log_format: json # 'summary' or 'json'
|
|
238
|
-
```
|
|
239
|
-
|
|
240
|
-
Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
|
|
241
|
-
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
|
|
242
|
-
|
|
243
|
-
## Writing Custom Evaluators
|
|
244
|
-
|
|
245
|
-
### Code Evaluator I/O Contract
|
|
246
|
-
|
|
247
|
-
Code evaluators receive input via stdin and write output to stdout as JSON.
|
|
248
|
-
|
|
249
|
-
**Input Format (via stdin):**
|
|
250
|
-
```json
|
|
251
|
-
{
|
|
252
|
-
"question": "string describing the task/question",
|
|
253
|
-
"expected_outcome": "expected outcome description",
|
|
254
|
-
"reference_answer": "gold standard answer (optional)",
|
|
255
|
-
"candidate_answer": "generated code/text from the agent",
|
|
256
|
-
"guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
|
|
257
|
-
"input_files": ["path/to/data.json", "path/to/config.yaml"],
|
|
258
|
-
"input_messages": [{"role": "user", "content": "..."}]
|
|
259
|
-
}
|
|
260
|
-
```
|
|
261
|
-
|
|
262
|
-
**Output Format (to stdout):**
|
|
263
|
-
```json
|
|
264
|
-
{
|
|
265
|
-
"score": 0.85,
|
|
266
|
-
"hits": ["list of successful checks"],
|
|
267
|
-
"misses": ["list of failed checks"],
|
|
268
|
-
"reasoning": "explanation of the score"
|
|
269
|
-
}
|
|
270
|
-
```
|
|
271
|
-
|
|
272
|
-
**Key Points:**
|
|
273
|
-
- Evaluators receive **full context** but should select only relevant fields
|
|
274
|
-
- Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
|
|
275
|
-
- Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
|
|
276
|
-
- Score range: `0.0` to `1.0` (float)
|
|
277
|
-
- `hits` and `misses` are optional but recommended for debugging
|
|
278
|
-
|
|
279
|
-
### Code Evaluator Script Template
|
|
280
|
-
|
|
281
|
-
```python
|
|
282
|
-
#!/usr/bin/env python3
|
|
283
|
-
import json
|
|
284
|
-
import sys
|
|
285
|
-
|
|
286
|
-
def evaluate(input_data):
|
|
287
|
-
# Extract only the fields you need
|
|
288
|
-
candidate_answer = input_data.get("candidate_answer", "")
|
|
289
|
-
|
|
290
|
-
# Your validation logic here
|
|
291
|
-
score = 0.0 # to 1.0
|
|
292
|
-
hits = ["successful check 1", "successful check 2"]
|
|
293
|
-
misses = ["failed check 1"]
|
|
294
|
-
reasoning = "Explanation of score"
|
|
295
|
-
|
|
296
|
-
return {
|
|
297
|
-
"score": score,
|
|
298
|
-
"hits": hits,
|
|
299
|
-
"misses": misses,
|
|
300
|
-
"reasoning": reasoning
|
|
301
|
-
}
|
|
302
|
-
|
|
303
|
-
if __name__ == "__main__":
|
|
304
|
-
try:
|
|
305
|
-
input_data = json.loads(sys.stdin.read())
|
|
306
|
-
result = evaluate(input_data)
|
|
307
|
-
print(json.dumps(result, indent=2))
|
|
308
|
-
except Exception as e:
|
|
309
|
-
error_result = {
|
|
310
|
-
"score": 0.0,
|
|
311
|
-
"hits": [],
|
|
312
|
-
"misses": [f"Evaluator error: {str(e)}"],
|
|
313
|
-
"reasoning": f"Evaluator error: {str(e)}"
|
|
314
|
-
}
|
|
315
|
-
print(json.dumps(error_result, indent=2))
|
|
316
|
-
sys.exit(1)
|
|
317
|
-
```
|
|
318
|
-
|
|
319
|
-
### LLM Judge Template Structure
|
|
320
|
-
|
|
321
|
-
```markdown
|
|
322
|
-
# Judge Name
|
|
323
|
-
|
|
324
|
-
Evaluation criteria and guidelines...
|
|
325
|
-
|
|
326
|
-
## Scoring Guidelines
|
|
327
|
-
0.9-1.0: Excellent
|
|
328
|
-
0.7-0.8: Good
|
|
329
|
-
...
|
|
330
|
-
|
|
331
|
-
## Output Format
|
|
332
|
-
{
|
|
333
|
-
"score": 0.85,
|
|
334
|
-
"passed": true,
|
|
335
|
-
"reasoning": "..."
|
|
336
|
-
}
|
|
337
|
-
```
|
|
338
|
-
|
|
339
|
-
## Rubric-Based Evaluation
|
|
340
|
-
|
|
341
|
-
AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
|
|
342
|
-
|
|
343
|
-
### Basic Usage
|
|
344
|
-
|
|
345
|
-
Define rubrics inline using simple strings:
|
|
346
|
-
|
|
347
|
-
```yaml
|
|
348
|
-
- id: example-1
|
|
349
|
-
expected_outcome: Explain quicksort algorithm
|
|
350
|
-
rubrics:
|
|
351
|
-
- Mentions divide-and-conquer approach
|
|
352
|
-
- Explains the partition step
|
|
353
|
-
- States time complexity correctly
|
|
354
|
-
```
|
|
355
|
-
|
|
356
|
-
Or use detailed objects for fine-grained control:
|
|
357
|
-
|
|
358
|
-
```yaml
|
|
359
|
-
rubrics:
|
|
360
|
-
- id: structure
|
|
361
|
-
description: Has clear headings and organization
|
|
362
|
-
weight: 1.0
|
|
363
|
-
required: true
|
|
364
|
-
- id: examples
|
|
365
|
-
description: Includes practical examples
|
|
366
|
-
weight: 0.5
|
|
367
|
-
required: false
|
|
368
|
-
```
|
|
369
|
-
|
|
370
|
-
### Generate Rubrics
|
|
371
|
-
|
|
372
|
-
Automatically generate rubrics from `expected_outcome` fields:
|
|
373
|
-
|
|
374
|
-
```bash
|
|
375
|
-
# Generate rubrics for all eval cases without rubrics
|
|
376
|
-
agentv generate rubrics evals/my-eval.yaml
|
|
377
|
-
|
|
378
|
-
# Use a specific LLM target for generation
|
|
379
|
-
agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
|
|
380
|
-
```
|
|
381
|
-
|
|
382
|
-
### Scoring and Verdicts
|
|
383
|
-
|
|
384
|
-
- **Score**: (sum of satisfied weights) / (total weights)
|
|
385
|
-
- **Verdicts**:
|
|
386
|
-
- `pass`: Score ≥ 0.8 and all required rubrics met
|
|
387
|
-
- `borderline`: Score ≥ 0.6 and all required rubrics met
|
|
388
|
-
- `fail`: Score < 0.6 or any required rubric failed
|
|
389
|
-
|
|
390
|
-
For complete examples and detailed patterns, see [examples/features/evals/rubric/](examples/features/evals/rubric/).
|
|
391
|
-
|
|
392
|
-
## Advanced Configuration
|
|
393
|
-
|
|
394
|
-
### Retry Configuration
|
|
395
|
-
|
|
396
|
-
AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
|
|
397
|
-
|
|
398
|
-
**Available retry fields:**
|
|
399
|
-
|
|
400
|
-
| Field | Type | Default | Description |
|
|
401
|
-
|-------|------|---------|-------------|
|
|
402
|
-
| `max_retries` | number | 3 | Maximum number of retry attempts |
|
|
403
|
-
| `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
|
|
404
|
-
| `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
|
|
405
|
-
| `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
|
|
406
|
-
| `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
|
|
407
|
-
|
|
408
|
-
**Example configuration:**
|
|
409
|
-
|
|
410
|
-
```yaml
|
|
411
|
-
|
|
412
|
-
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
|
|
416
|
-
|
|
417
|
-
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
|
|
421
|
-
|
|
422
|
-
|
|
423
|
-
|
|
424
|
-
|
|
425
|
-
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
-
|
|
429
|
-
-
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
|
|
433
|
-
|
|
434
|
-
|
|
435
|
-
- [
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
|
|
439
|
-
|
|
440
|
-
|
|
441
|
-
MIT License - see [LICENSE](LICENSE) for details.
|
|
1
|
+
# AgentV
|
|
2
|
+
|
|
3
|
+
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
|
|
4
|
+
|
|
5
|
+
## Installation and Setup
|
|
6
|
+
|
|
7
|
+
### Installation for End Users
|
|
8
|
+
|
|
9
|
+
This is the recommended method for users who want to use `agentv` as a command-line tool.
|
|
10
|
+
|
|
11
|
+
1. Install via npm:
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# Install globally
|
|
15
|
+
npm install -g agentv
|
|
16
|
+
|
|
17
|
+
# Or use npx to run without installing
|
|
18
|
+
npx agentv --help
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
2. Verify the installation:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
agentv --help
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
### Local Development Setup
|
|
28
|
+
|
|
29
|
+
Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
|
|
30
|
+
|
|
31
|
+
1. Clone the repository and navigate into it:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
git clone https://github.com/EntityProcess/agentv.git
|
|
35
|
+
cd agentv
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
2. Install dependencies:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
# Install Bun if you don't have it
|
|
42
|
+
curl -fsSL https://bun.sh/install | bash # macOS/Linux
|
|
43
|
+
# or
|
|
44
|
+
powershell -c "irm bun.sh/install.ps1 | iex" # Windows
|
|
45
|
+
|
|
46
|
+
# Install all workspace dependencies
|
|
47
|
+
bun install
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
3. Build the project:
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
bun run build
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
4. Run tests:
|
|
57
|
+
|
|
58
|
+
```bash
|
|
59
|
+
bun test
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
You are now ready to start development. The monorepo contains:
|
|
63
|
+
|
|
64
|
+
- `packages/core/` - Core evaluation engine
|
|
65
|
+
- `apps/cli/` - Command-line interface
|
|
66
|
+
|
|
67
|
+
### Environment Setup
|
|
68
|
+
|
|
69
|
+
1. Initialize your workspace:
|
|
70
|
+
- Run `agentv init` at the root of your repository
|
|
71
|
+
- This command automatically sets up the `.agentv/` directory structure and configuration files
|
|
72
|
+
|
|
73
|
+
2. Configure environment variables:
|
|
74
|
+
- The init command creates a `.env.template` file in your project root
|
|
75
|
+
- Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
|
|
76
|
+
- Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
|
|
77
|
+
|
|
78
|
+
## Quick Start
|
|
79
|
+
|
|
80
|
+
You can use the following examples as a starting point.
|
|
81
|
+
- [Simple Example](docs/examples/simple/README.md): A minimal working example to help you get started fast.
|
|
82
|
+
- [Showcase](docs/examples/showcase/README.md): A collection of advanced use cases and real-world agent evaluation scenarios.
|
|
83
|
+
|
|
84
|
+
### Validating Eval Files
|
|
85
|
+
|
|
86
|
+
Validate your eval and targets files before running them:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
# Validate a single file
|
|
90
|
+
agentv validate evals/my-eval.yaml
|
|
91
|
+
|
|
92
|
+
# Validate multiple files
|
|
93
|
+
agentv validate evals/eval1.yaml evals/eval2.yaml
|
|
94
|
+
|
|
95
|
+
# Validate entire directory (recursively finds all YAML files)
|
|
96
|
+
agentv validate evals/
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
### Running Evals
|
|
100
|
+
|
|
101
|
+
Run eval (target auto-selected from eval file or CLI override):
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
# If your eval.yaml contains "target: azure_base", it will be used automatically
|
|
105
|
+
agentv eval "path/to/eval.yaml"
|
|
106
|
+
|
|
107
|
+
# Override the eval file's target with CLI flag
|
|
108
|
+
agentv eval --target vscode_projectx "path/to/eval.yaml"
|
|
109
|
+
|
|
110
|
+
# Run multiple evals via glob
|
|
111
|
+
agentv eval "path/to/evals/**/*.yaml"
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
Run a specific eval case with custom targets path:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Command Line Options
|
|
121
|
+
|
|
122
|
+
- `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
|
|
123
|
+
- `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
|
|
124
|
+
- `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
|
|
125
|
+
- `--eval-id EVAL_ID`: Run only the eval case with this specific ID
|
|
126
|
+
- `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
|
|
127
|
+
- `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
|
|
128
|
+
- `--dry-run`: Run with mock model for testing
|
|
129
|
+
- `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
|
|
130
|
+
- `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
|
|
131
|
+
- `--cache`: Enable caching of LLM responses (default: disabled)
|
|
132
|
+
- `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
|
|
133
|
+
- `--dump-traces`: Write trace files to `.agentv/traces/` directory
|
|
134
|
+
- `--include-trace`: Include full trace in result output (verbose)
|
|
135
|
+
- `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
|
|
136
|
+
- `--verbose`: Verbose output
|
|
137
|
+
|
|
138
|
+
### Target Selection Priority
|
|
139
|
+
|
|
140
|
+
The CLI determines which execution target to use with the following precedence:
|
|
141
|
+
|
|
142
|
+
1. CLI flag override: `--target my_target` (when provided and not 'default')
|
|
143
|
+
2. Eval file specification: `target: my_target` key in the .eval.yaml file
|
|
144
|
+
3. Default fallback: Uses the 'default' target (original behavior)
|
|
145
|
+
|
|
146
|
+
This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
|
|
147
|
+
|
|
148
|
+
Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
|
|
149
|
+
|
|
150
|
+
### Tips for VS Code Copilot Evals
|
|
151
|
+
|
|
152
|
+
**Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
|
|
153
|
+
|
|
154
|
+
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
|
|
155
|
+
|
|
156
|
+
## Targets and Environment Variables
|
|
157
|
+
|
|
158
|
+
Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
|
|
159
|
+
|
|
160
|
+
### Target Configuration Structure
|
|
161
|
+
|
|
162
|
+
Each target specifies:
|
|
163
|
+
|
|
164
|
+
- `name`: Unique identifier for the target
|
|
165
|
+
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
|
|
166
|
+
- Provider-specific configuration fields at the top level (no `settings` wrapper needed)
|
|
167
|
+
- Optional fields: `judge_target`, `workers`, `provider_batching`
|
|
168
|
+
|
|
169
|
+
### Examples
|
|
170
|
+
|
|
171
|
+
**Azure OpenAI targets:**
|
|
172
|
+
|
|
173
|
+
```yaml
|
|
174
|
+
- name: azure_base
|
|
175
|
+
provider: azure
|
|
176
|
+
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
177
|
+
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
178
|
+
model: ${{ AZURE_DEPLOYMENT_NAME }}
|
|
179
|
+
version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
|
|
183
|
+
|
|
184
|
+
**VS Code targets:**
|
|
185
|
+
|
|
186
|
+
```yaml
|
|
187
|
+
- name: vscode_projectx
|
|
188
|
+
provider: vscode
|
|
189
|
+
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
|
|
190
|
+
provider_batching: false
|
|
191
|
+
judge_target: azure_base
|
|
192
|
+
|
|
193
|
+
- name: vscode_insiders_projectx
|
|
194
|
+
provider: vscode-insiders
|
|
195
|
+
workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
|
|
196
|
+
provider_batching: false
|
|
197
|
+
judge_target: azure_base
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
**CLI targets (template-based):**
|
|
201
|
+
|
|
202
|
+
```yaml
|
|
203
|
+
- name: local_cli
|
|
204
|
+
provider: cli
|
|
205
|
+
judge_target: azure_base
|
|
206
|
+
command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
|
|
207
|
+
files_format: '--file {path}'
|
|
208
|
+
cwd: ${{ CLI_EVALS_DIR }} # optional working directory
|
|
209
|
+
timeout_seconds: 30 # optional per-command timeout
|
|
210
|
+
healthcheck:
|
|
211
|
+
type: command # or http
|
|
212
|
+
command_template: uv run ./my_agent.py --healthcheck
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Supported placeholders in CLI commands:**
|
|
216
|
+
- `{PROMPT}` - The rendered prompt text (shell-escaped)
|
|
217
|
+
- `{FILES}` - Expands to multiple file arguments using `files_format` template
|
|
218
|
+
- `{GUIDELINES}` - Guidelines content
|
|
219
|
+
- `{EVAL_ID}` - Current eval case ID
|
|
220
|
+
- `{ATTEMPT}` - Retry attempt number
|
|
221
|
+
- `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
|
|
222
|
+
|
|
223
|
+
**Codex CLI targets:**
|
|
224
|
+
|
|
225
|
+
```yaml
|
|
226
|
+
- name: codex_cli
|
|
227
|
+
provider: codex
|
|
228
|
+
judge_target: azure_base
|
|
229
|
+
executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
|
|
230
|
+
args: # optional CLI arguments
|
|
231
|
+
- --profile
|
|
232
|
+
- ${{ CODEX_PROFILE }}
|
|
233
|
+
- --model
|
|
234
|
+
- ${{ CODEX_MODEL }}
|
|
235
|
+
timeout_seconds: 180
|
|
236
|
+
cwd: ${{ CODEX_WORKSPACE_DIR }}
|
|
237
|
+
log_format: json # 'summary' or 'json'
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
|
|
241
|
+
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
|
|
242
|
+
|
|
243
|
+
## Writing Custom Evaluators
|
|
244
|
+
|
|
245
|
+
### Code Evaluator I/O Contract
|
|
246
|
+
|
|
247
|
+
Code evaluators receive input via stdin and write output to stdout as JSON.
|
|
248
|
+
|
|
249
|
+
**Input Format (via stdin):**
|
|
250
|
+
```json
|
|
251
|
+
{
|
|
252
|
+
"question": "string describing the task/question",
|
|
253
|
+
"expected_outcome": "expected outcome description",
|
|
254
|
+
"reference_answer": "gold standard answer (optional)",
|
|
255
|
+
"candidate_answer": "generated code/text from the agent",
|
|
256
|
+
"guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
|
|
257
|
+
"input_files": ["path/to/data.json", "path/to/config.yaml"],
|
|
258
|
+
"input_messages": [{"role": "user", "content": "..."}]
|
|
259
|
+
}
|
|
260
|
+
```
|
|
261
|
+
|
|
262
|
+
**Output Format (to stdout):**
|
|
263
|
+
```json
|
|
264
|
+
{
|
|
265
|
+
"score": 0.85,
|
|
266
|
+
"hits": ["list of successful checks"],
|
|
267
|
+
"misses": ["list of failed checks"],
|
|
268
|
+
"reasoning": "explanation of the score"
|
|
269
|
+
}
|
|
270
|
+
```
|
|
271
|
+
|
|
272
|
+
**Key Points:**
|
|
273
|
+
- Evaluators receive **full context** but should select only relevant fields
|
|
274
|
+
- Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
|
|
275
|
+
- Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
|
|
276
|
+
- Score range: `0.0` to `1.0` (float)
|
|
277
|
+
- `hits` and `misses` are optional but recommended for debugging
|
|
278
|
+
|
|
279
|
+
### Code Evaluator Script Template
|
|
280
|
+
|
|
281
|
+
```python
|
|
282
|
+
#!/usr/bin/env python3
|
|
283
|
+
import json
|
|
284
|
+
import sys
|
|
285
|
+
|
|
286
|
+
def evaluate(input_data):
|
|
287
|
+
# Extract only the fields you need
|
|
288
|
+
candidate_answer = input_data.get("candidate_answer", "")
|
|
289
|
+
|
|
290
|
+
# Your validation logic here
|
|
291
|
+
score = 0.0 # to 1.0
|
|
292
|
+
hits = ["successful check 1", "successful check 2"]
|
|
293
|
+
misses = ["failed check 1"]
|
|
294
|
+
reasoning = "Explanation of score"
|
|
295
|
+
|
|
296
|
+
return {
|
|
297
|
+
"score": score,
|
|
298
|
+
"hits": hits,
|
|
299
|
+
"misses": misses,
|
|
300
|
+
"reasoning": reasoning
|
|
301
|
+
}
|
|
302
|
+
|
|
303
|
+
if __name__ == "__main__":
|
|
304
|
+
try:
|
|
305
|
+
input_data = json.loads(sys.stdin.read())
|
|
306
|
+
result = evaluate(input_data)
|
|
307
|
+
print(json.dumps(result, indent=2))
|
|
308
|
+
except Exception as e:
|
|
309
|
+
error_result = {
|
|
310
|
+
"score": 0.0,
|
|
311
|
+
"hits": [],
|
|
312
|
+
"misses": [f"Evaluator error: {str(e)}"],
|
|
313
|
+
"reasoning": f"Evaluator error: {str(e)}"
|
|
314
|
+
}
|
|
315
|
+
print(json.dumps(error_result, indent=2))
|
|
316
|
+
sys.exit(1)
|
|
317
|
+
```
|
|
318
|
+
|
|
319
|
+
### LLM Judge Template Structure
|
|
320
|
+
|
|
321
|
+
```markdown
|
|
322
|
+
# Judge Name
|
|
323
|
+
|
|
324
|
+
Evaluation criteria and guidelines...
|
|
325
|
+
|
|
326
|
+
## Scoring Guidelines
|
|
327
|
+
0.9-1.0: Excellent
|
|
328
|
+
0.7-0.8: Good
|
|
329
|
+
...
|
|
330
|
+
|
|
331
|
+
## Output Format
|
|
332
|
+
{
|
|
333
|
+
"score": 0.85,
|
|
334
|
+
"passed": true,
|
|
335
|
+
"reasoning": "..."
|
|
336
|
+
}
|
|
337
|
+
```
|
|
338
|
+
|
|
339
|
+
## Rubric-Based Evaluation
|
|
340
|
+
|
|
341
|
+
AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
|
|
342
|
+
|
|
343
|
+
### Basic Usage
|
|
344
|
+
|
|
345
|
+
Define rubrics inline using simple strings:
|
|
346
|
+
|
|
347
|
+
```yaml
|
|
348
|
+
- id: example-1
|
|
349
|
+
expected_outcome: Explain quicksort algorithm
|
|
350
|
+
rubrics:
|
|
351
|
+
- Mentions divide-and-conquer approach
|
|
352
|
+
- Explains the partition step
|
|
353
|
+
- States time complexity correctly
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
Or use detailed objects for fine-grained control:
|
|
357
|
+
|
|
358
|
+
```yaml
|
|
359
|
+
rubrics:
|
|
360
|
+
- id: structure
|
|
361
|
+
description: Has clear headings and organization
|
|
362
|
+
weight: 1.0
|
|
363
|
+
required: true
|
|
364
|
+
- id: examples
|
|
365
|
+
description: Includes practical examples
|
|
366
|
+
weight: 0.5
|
|
367
|
+
required: false
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
### Generate Rubrics
|
|
371
|
+
|
|
372
|
+
Automatically generate rubrics from `expected_outcome` fields:
|
|
373
|
+
|
|
374
|
+
```bash
|
|
375
|
+
# Generate rubrics for all eval cases without rubrics
|
|
376
|
+
agentv generate rubrics evals/my-eval.yaml
|
|
377
|
+
|
|
378
|
+
# Use a specific LLM target for generation
|
|
379
|
+
agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
### Scoring and Verdicts
|
|
383
|
+
|
|
384
|
+
- **Score**: (sum of satisfied weights) / (total weights)
|
|
385
|
+
- **Verdicts**:
|
|
386
|
+
- `pass`: Score ≥ 0.8 and all required rubrics met
|
|
387
|
+
- `borderline`: Score ≥ 0.6 and all required rubrics met
|
|
388
|
+
- `fail`: Score < 0.6 or any required rubric failed
|
|
389
|
+
|
|
390
|
+
For complete examples and detailed patterns, see [examples/features/evals/rubric/](examples/features/evals/rubric/).
|
|
391
|
+
|
|
392
|
+
## Advanced Configuration
|
|
393
|
+
|
|
394
|
+
### Retry Configuration
|
|
395
|
+
|
|
396
|
+
AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
|
|
397
|
+
|
|
398
|
+
**Available retry fields:**
|
|
399
|
+
|
|
400
|
+
| Field | Type | Default | Description |
|
|
401
|
+
|-------|------|---------|-------------|
|
|
402
|
+
| `max_retries` | number | 3 | Maximum number of retry attempts |
|
|
403
|
+
| `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
|
|
404
|
+
| `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
|
|
405
|
+
| `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
|
|
406
|
+
| `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
|
|
407
|
+
|
|
408
|
+
**Example configuration:**
|
|
409
|
+
|
|
410
|
+
```yaml
|
|
411
|
+
targets:
|
|
412
|
+
- name: azure_base
|
|
413
|
+
provider: azure
|
|
414
|
+
endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
|
|
415
|
+
api_key: ${{ AZURE_OPENAI_API_KEY }}
|
|
416
|
+
model: gpt-4
|
|
417
|
+
version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: API version (defaults to 2024-12-01-preview)
|
|
418
|
+
max_retries: 5 # Maximum retry attempts
|
|
419
|
+
retry_initial_delay_ms: 2000 # Initial delay before first retry
|
|
420
|
+
retry_max_delay_ms: 120000 # Maximum delay cap
|
|
421
|
+
retry_backoff_factor: 2 # Exponential backoff multiplier
|
|
422
|
+
retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
**Retry behavior:**
|
|
426
|
+
- Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
|
|
427
|
+
- Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
|
|
428
|
+
- Respects abort signals for cancellation
|
|
429
|
+
- If no retry config is specified, uses sensible defaults
|
|
430
|
+
|
|
431
|
+
## Related Projects
|
|
432
|
+
|
|
433
|
+
- [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
|
|
434
|
+
- [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
|
|
435
|
+
- [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
|
|
436
|
+
|
|
437
|
+
## License
|
|
438
|
+
|
|
439
|
+
MIT License - see [LICENSE](LICENSE) for details.
|