agentv 0.2.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +380 -0
- package/dist/chunk-S3RN2GSO.js +14542 -0
- package/dist/chunk-S3RN2GSO.js.map +1 -0
- package/dist/cli.js +8 -0
- package/dist/cli.js.map +1 -0
- package/dist/index.js +9 -0
- package/dist/index.js.map +1 -0
- package/dist/templates/eval-build.prompt.md +100 -0
- package/dist/templates/eval-schema.json +182 -0
- package/package.json +40 -0
package/LICENSE
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2025 EntityProcess
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
package/README.md
ADDED
|
@@ -0,0 +1,380 @@
|
|
|
1
|
+
# AgentV
|
|
2
|
+
|
|
3
|
+
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
|
|
4
|
+
|
|
5
|
+
## Installation and Setup
|
|
6
|
+
|
|
7
|
+
### Installation for End Users
|
|
8
|
+
|
|
9
|
+
This is the recommended method for users who want to use `agentv` as a command-line tool.
|
|
10
|
+
|
|
11
|
+
1. Install via npm:
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
# Install globally
|
|
15
|
+
npm install -g agentv
|
|
16
|
+
|
|
17
|
+
# Or use npx to run without installing
|
|
18
|
+
npx agentv --help
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
2. Verify the installation:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
agentv --help
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
### Local Development Setup
|
|
28
|
+
|
|
29
|
+
Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses pnpm workspaces and an editable install for immediate feedback.
|
|
30
|
+
|
|
31
|
+
1. Clone the repository and navigate into it:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
git clone https://github.com/EntityProcess/agentv.git
|
|
35
|
+
cd agentv
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
2. Install dependencies:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
# Install pnpm if you don't have it
|
|
42
|
+
npm install -g pnpm
|
|
43
|
+
|
|
44
|
+
# Install all workspace dependencies
|
|
45
|
+
pnpm install
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
3. Build the project:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
pnpm build
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
4. Run tests:
|
|
55
|
+
|
|
56
|
+
```bash
|
|
57
|
+
pnpm test
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
You are now ready to start development. The monorepo contains:
|
|
61
|
+
|
|
62
|
+
- `packages/core/` - Core evaluation engine
|
|
63
|
+
- `apps/cli/` - Command-line interface
|
|
64
|
+
|
|
65
|
+
### Environment Setup
|
|
66
|
+
|
|
67
|
+
1. Configure environment variables:
|
|
68
|
+
- Copy [.env.template](docs/examples/simple/.env.template) to `.env` in your project root
|
|
69
|
+
- Fill in your API keys, endpoints, and other configuration values
|
|
70
|
+
|
|
71
|
+
2. Set up targets:
|
|
72
|
+
- Copy [targets.yaml](docs/examples/simple/.agentv/targets.yaml) to `.agentv/targets.yaml`
|
|
73
|
+
- Update the environment variable names in targets.yaml to match those defined in your `.env` file
|
|
74
|
+
|
|
75
|
+
## Quick Start
|
|
76
|
+
|
|
77
|
+
### Linting Eval Files
|
|
78
|
+
|
|
79
|
+
Validate your eval and targets files before running them:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
# Lint a single file
|
|
83
|
+
agentv lint evals/my-test.yaml
|
|
84
|
+
|
|
85
|
+
# Lint multiple files
|
|
86
|
+
agentv lint evals/test1.yaml evals/test2.yaml
|
|
87
|
+
|
|
88
|
+
# Lint entire directory (recursively finds all YAML files)
|
|
89
|
+
agentv lint evals/
|
|
90
|
+
|
|
91
|
+
# Enable strict mode for additional checks
|
|
92
|
+
agentv lint --strict evals/
|
|
93
|
+
|
|
94
|
+
# Output results in JSON format
|
|
95
|
+
agentv lint --json evals/
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
**Linter features:**
|
|
99
|
+
|
|
100
|
+
- Validates `$schema` field is present and correct
|
|
101
|
+
- Checks required fields and structure for eval and targets files
|
|
102
|
+
- Validates file references exist and are accessible
|
|
103
|
+
- Provides clear error messages with file path and location context
|
|
104
|
+
- Exits with non-zero code on validation failures (CI-friendly)
|
|
105
|
+
- Supports strict mode for additional checks (e.g., non-empty file content)
|
|
106
|
+
|
|
107
|
+
**File type detection:**
|
|
108
|
+
|
|
109
|
+
All AgentV files must include a `$schema` field:
|
|
110
|
+
|
|
111
|
+
```yaml
|
|
112
|
+
# Eval files
|
|
113
|
+
$schema: agentv-eval-v2
|
|
114
|
+
evalcases:
|
|
115
|
+
- id: test-1
|
|
116
|
+
# ...
|
|
117
|
+
|
|
118
|
+
# Targets files
|
|
119
|
+
$schema: agentv-targets-v2
|
|
120
|
+
targets:
|
|
121
|
+
- name: default
|
|
122
|
+
# ...
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Files without a `$schema` field will be rejected with a clear error message.
|
|
126
|
+
|
|
127
|
+
### Running Evals
|
|
128
|
+
|
|
129
|
+
Run eval (target auto-selected from test file or CLI override):
|
|
130
|
+
|
|
131
|
+
```bash
|
|
132
|
+
# If your test.yaml contains "target: azure_base", it will be used automatically
|
|
133
|
+
agentv eval "path/to/test.yaml"
|
|
134
|
+
|
|
135
|
+
# Override the test file's target with CLI flag
|
|
136
|
+
agentv eval --target vscode_projectx "path/to/test.yaml"
|
|
137
|
+
```
|
|
138
|
+
|
|
139
|
+
Run a specific test case with custom targets path:
|
|
140
|
+
|
|
141
|
+
```bash
|
|
142
|
+
agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --test-id "my-test-case" "path/to/test.yaml"
|
|
143
|
+
```
|
|
144
|
+
|
|
145
|
+
### Command Line Options
|
|
146
|
+
|
|
147
|
+
- `test_file`: Path to test YAML file (required, positional argument)
|
|
148
|
+
- `--target TARGET`: Execution target name from targets.yaml (overrides target specified in test file)
|
|
149
|
+
- `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
|
|
150
|
+
- `--test-id TEST_ID`: Run only the test case with this specific ID
|
|
151
|
+
- `--out OUTPUT_FILE`: Output file path (default: results/{testname}_{timestamp}.jsonl)
|
|
152
|
+
- `--format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
|
|
153
|
+
- `--dry-run`: Run with mock model for testing
|
|
154
|
+
- `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
|
|
155
|
+
- `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
|
|
156
|
+
- `--cache`: Enable caching of LLM responses (default: disabled)
|
|
157
|
+
- `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
|
|
158
|
+
- `--verbose`: Verbose output
|
|
159
|
+
|
|
160
|
+
### Target Selection Priority
|
|
161
|
+
|
|
162
|
+
The CLI determines which execution target to use with the following precedence:
|
|
163
|
+
|
|
164
|
+
1. CLI flag override: `--target my_target` (when provided and not 'default')
|
|
165
|
+
2. Test file specification: `target: my_target` key in the .test.yaml file
|
|
166
|
+
3. Default fallback: Uses the 'default' target (original behavior)
|
|
167
|
+
|
|
168
|
+
This allows test files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
|
|
169
|
+
|
|
170
|
+
Output goes to `.agentv/results/{testname}_{timestamp}.jsonl` (or `.yaml`) unless `--out` is provided.
|
|
171
|
+
|
|
172
|
+
### Tips for VS Code Copilot Evals
|
|
173
|
+
|
|
174
|
+
**Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
|
|
175
|
+
|
|
176
|
+
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
|
|
177
|
+
|
|
178
|
+
## Requirements
|
|
179
|
+
|
|
180
|
+
- Node.js 20.0.0 or higher
|
|
181
|
+
- Environment variables for your chosen providers (configured via targets.yaml)
|
|
182
|
+
|
|
183
|
+
Environment keys (configured via targets.yaml):
|
|
184
|
+
|
|
185
|
+
- **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
|
|
186
|
+
- **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
|
|
187
|
+
- **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
|
|
188
|
+
- **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
|
|
189
|
+
|
|
190
|
+
## Targets and Environment Variables
|
|
191
|
+
|
|
192
|
+
Execution targets in `.agentv/targets.yaml` decouple tests from providers/settings and provide flexible environment variable mapping.
|
|
193
|
+
|
|
194
|
+
### Target Configuration Structure
|
|
195
|
+
|
|
196
|
+
Each target specifies:
|
|
197
|
+
|
|
198
|
+
- `name`: Unique identifier for the target
|
|
199
|
+
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
|
|
200
|
+
- `settings`: Environment variable names to use for this target
|
|
201
|
+
|
|
202
|
+
### Examples
|
|
203
|
+
|
|
204
|
+
**Azure OpenAI targets:**
|
|
205
|
+
|
|
206
|
+
```yaml
|
|
207
|
+
- name: azure_base
|
|
208
|
+
provider: azure
|
|
209
|
+
settings:
|
|
210
|
+
endpoint: "AZURE_OPENAI_ENDPOINT"
|
|
211
|
+
api_key: "AZURE_OPENAI_API_KEY"
|
|
212
|
+
model: "AZURE_DEPLOYMENT_NAME"
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Anthropic targets:**
|
|
216
|
+
|
|
217
|
+
```yaml
|
|
218
|
+
- name: anthropic_base
|
|
219
|
+
provider: anthropic
|
|
220
|
+
settings:
|
|
221
|
+
api_key: "ANTHROPIC_API_KEY"
|
|
222
|
+
model: "ANTHROPIC_MODEL"
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
**Google Gemini targets:**
|
|
226
|
+
|
|
227
|
+
```yaml
|
|
228
|
+
- name: gemini_base
|
|
229
|
+
provider: gemini
|
|
230
|
+
settings:
|
|
231
|
+
api_key: "GOOGLE_API_KEY"
|
|
232
|
+
model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-exp
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
**VS Code targets:**
|
|
236
|
+
|
|
237
|
+
```yaml
|
|
238
|
+
- name: vscode_projectx
|
|
239
|
+
provider: vscode
|
|
240
|
+
settings:
|
|
241
|
+
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
|
|
242
|
+
|
|
243
|
+
- name: vscode_insiders_projectx
|
|
244
|
+
provider: vscode-insiders
|
|
245
|
+
settings:
|
|
246
|
+
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
## Timeout Handling and Retries
|
|
250
|
+
|
|
251
|
+
When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
|
|
252
|
+
|
|
253
|
+
- **Timeout detection:** Automatically detects when agents timeout
|
|
254
|
+
- **Automatic retries:** When a timeout occurs, the same test case is retried up to `--max-retries` times (default: 2)
|
|
255
|
+
- **Retry behavior:** Only timeouts trigger retries; other errors proceed to the next test case
|
|
256
|
+
- **Timeout configuration:** Use `--agent-timeout` to adjust how long to wait for agent responses
|
|
257
|
+
|
|
258
|
+
Example with custom timeout settings:
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
## How the Evals Work
|
|
265
|
+
|
|
266
|
+
For each test case in a `.yaml` file:
|
|
267
|
+
|
|
268
|
+
1. Parse YAML and collect user messages (inline text and referenced files)
|
|
269
|
+
2. Extract code blocks from text for structured prompting
|
|
270
|
+
3. Generate a candidate answer via the configured provider/model
|
|
271
|
+
4. Score against the expected answer using AI-powered quality grading
|
|
272
|
+
5. Output results in JSONL or YAML format with detailed metrics
|
|
273
|
+
|
|
274
|
+
### VS Code Copilot Target
|
|
275
|
+
|
|
276
|
+
- Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
|
|
277
|
+
- The prompt is built from the `.yaml` user content (task, files, code blocks)
|
|
278
|
+
- Copilot is instructed to complete the task within the workspace context
|
|
279
|
+
- Results are captured and scored automatically
|
|
280
|
+
|
|
281
|
+
## Scoring and Outputs
|
|
282
|
+
|
|
283
|
+
Run with `--verbose` to print detailed information and stack traces on errors.
|
|
284
|
+
|
|
285
|
+
### Scoring Methodology
|
|
286
|
+
|
|
287
|
+
AgentV uses an AI-powered quality grader that:
|
|
288
|
+
|
|
289
|
+
- Extracts key aspects from the expected answer
|
|
290
|
+
- Compares model output against those aspects
|
|
291
|
+
- Provides detailed hit/miss analysis with reasoning
|
|
292
|
+
- Returns a normalized score (0.0 to 1.0)
|
|
293
|
+
|
|
294
|
+
### Output Formats
|
|
295
|
+
|
|
296
|
+
**JSONL format (default):**
|
|
297
|
+
|
|
298
|
+
- One JSON object per line (newline-delimited)
|
|
299
|
+
- Fields: `test_id`, `score`, `hits`, `misses`, `model_answer`, `expected_aspect_count`, `target`, `timestamp`, `reasoning`, `raw_request`, `grader_raw_request`
|
|
300
|
+
|
|
301
|
+
**YAML format (with `--format yaml`):**
|
|
302
|
+
|
|
303
|
+
- Human-readable YAML documents
|
|
304
|
+
- Same fields as JSONL, properly formatted for readability
|
|
305
|
+
- Multi-line strings use literal block style
|
|
306
|
+
|
|
307
|
+
### Summary Statistics
|
|
308
|
+
|
|
309
|
+
After running all test cases, AgentV displays:
|
|
310
|
+
|
|
311
|
+
- Mean, median, min, max scores
|
|
312
|
+
- Standard deviation
|
|
313
|
+
- Distribution histogram
|
|
314
|
+
- Total test count and execution time
|
|
315
|
+
|
|
316
|
+
## Architecture
|
|
317
|
+
|
|
318
|
+
AgentV is built as a TypeScript monorepo using:
|
|
319
|
+
|
|
320
|
+
- **pnpm workspaces:** Efficient dependency management
|
|
321
|
+
- **Turbo:** Build system and task orchestration
|
|
322
|
+
- **@ax-llm/ax:** Unified LLM provider abstraction
|
|
323
|
+
- **Vercel AI SDK:** Streaming and tool use capabilities
|
|
324
|
+
- **Zod:** Runtime type validation
|
|
325
|
+
- **Commander.js:** CLI argument parsing
|
|
326
|
+
- **Vitest:** Testing framework
|
|
327
|
+
|
|
328
|
+
### Package Structure
|
|
329
|
+
|
|
330
|
+
- `@agentv/core` - Core evaluation engine, providers, grading logic
|
|
331
|
+
- `agentv` - Main package that bundles CLI functionality
|
|
332
|
+
|
|
333
|
+
## Troubleshooting
|
|
334
|
+
|
|
335
|
+
### Installation Issues
|
|
336
|
+
|
|
337
|
+
**Problem:** Package installation fails or command not found.
|
|
338
|
+
|
|
339
|
+
**Solution:**
|
|
340
|
+
|
|
341
|
+
```bash
|
|
342
|
+
# Clear npm cache and reinstall
|
|
343
|
+
npm cache clean --force
|
|
344
|
+
npm uninstall -g agentv
|
|
345
|
+
npm install -g agentv
|
|
346
|
+
|
|
347
|
+
# Or use npx without installing
|
|
348
|
+
npx agentv@latest --help
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
### VS Code Integration Issues
|
|
352
|
+
|
|
353
|
+
**Problem:** VS Code workspace doesn't open or prompts aren't injected.
|
|
354
|
+
|
|
355
|
+
**Solution:**
|
|
356
|
+
|
|
357
|
+
- Ensure the `subagent` package is installed (should be automatic)
|
|
358
|
+
- Verify your workspace path in `.env` is correct and points to a `.code-workspace` file
|
|
359
|
+
- Close any other VS Code instances before running evals
|
|
360
|
+
- Use `--verbose` flag to see detailed workspace switching logs
|
|
361
|
+
|
|
362
|
+
### Provider Configuration Issues
|
|
363
|
+
|
|
364
|
+
**Problem:** API authentication errors or missing credentials.
|
|
365
|
+
|
|
366
|
+
**Solution:**
|
|
367
|
+
|
|
368
|
+
- Double-check environment variables in your `.env` file
|
|
369
|
+
- Verify the variable names in `targets.yaml` match your `.env` file
|
|
370
|
+
- Use `--dry-run` first to test without making API calls
|
|
371
|
+
- Check provider-specific documentation for required environment variables
|
|
372
|
+
|
|
373
|
+
## License
|
|
374
|
+
|
|
375
|
+
MIT License - see [LICENSE](LICENSE) for details.
|
|
376
|
+
|
|
377
|
+
## Related Projects
|
|
378
|
+
|
|
379
|
+
- [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
|
|
380
|
+
- [Ax](https://github.com/axflow/axflow) - TypeScript LLM framework
|