agentv 0.2.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 EntityProcess
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,380 @@
1
+ # AgentV
2
+
3
+ A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
4
+
5
+ ## Installation and Setup
6
+
7
+ ### Installation for End Users
8
+
9
+ This is the recommended method for users who want to use `agentv` as a command-line tool.
10
+
11
+ 1. Install via npm:
12
+
13
+ ```bash
14
+ # Install globally
15
+ npm install -g agentv
16
+
17
+ # Or use npx to run without installing
18
+ npx agentv --help
19
+ ```
20
+
21
+ 2. Verify the installation:
22
+
23
+ ```bash
24
+ agentv --help
25
+ ```
26
+
27
+ ### Local Development Setup
28
+
29
+ Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses pnpm workspaces and an editable install for immediate feedback.
30
+
31
+ 1. Clone the repository and navigate into it:
32
+
33
+ ```bash
34
+ git clone https://github.com/EntityProcess/agentv.git
35
+ cd agentv
36
+ ```
37
+
38
+ 2. Install dependencies:
39
+
40
+ ```bash
41
+ # Install pnpm if you don't have it
42
+ npm install -g pnpm
43
+
44
+ # Install all workspace dependencies
45
+ pnpm install
46
+ ```
47
+
48
+ 3. Build the project:
49
+
50
+ ```bash
51
+ pnpm build
52
+ ```
53
+
54
+ 4. Run tests:
55
+
56
+ ```bash
57
+ pnpm test
58
+ ```
59
+
60
+ You are now ready to start development. The monorepo contains:
61
+
62
+ - `packages/core/` - Core evaluation engine
63
+ - `apps/cli/` - Command-line interface
64
+
65
+ ### Environment Setup
66
+
67
+ 1. Configure environment variables:
68
+ - Copy [.env.template](docs/examples/simple/.env.template) to `.env` in your project root
69
+ - Fill in your API keys, endpoints, and other configuration values
70
+
71
+ 2. Set up targets:
72
+ - Copy [targets.yaml](docs/examples/simple/.agentv/targets.yaml) to `.agentv/targets.yaml`
73
+ - Update the environment variable names in targets.yaml to match those defined in your `.env` file
74
+
75
+ ## Quick Start
76
+
77
+ ### Linting Eval Files
78
+
79
+ Validate your eval and targets files before running them:
80
+
81
+ ```bash
82
+ # Lint a single file
83
+ agentv lint evals/my-test.yaml
84
+
85
+ # Lint multiple files
86
+ agentv lint evals/test1.yaml evals/test2.yaml
87
+
88
+ # Lint entire directory (recursively finds all YAML files)
89
+ agentv lint evals/
90
+
91
+ # Enable strict mode for additional checks
92
+ agentv lint --strict evals/
93
+
94
+ # Output results in JSON format
95
+ agentv lint --json evals/
96
+ ```
97
+
98
+ **Linter features:**
99
+
100
+ - Validates `$schema` field is present and correct
101
+ - Checks required fields and structure for eval and targets files
102
+ - Validates file references exist and are accessible
103
+ - Provides clear error messages with file path and location context
104
+ - Exits with non-zero code on validation failures (CI-friendly)
105
+ - Supports strict mode for additional checks (e.g., non-empty file content)
106
+
107
+ **File type detection:**
108
+
109
+ All AgentV files must include a `$schema` field:
110
+
111
+ ```yaml
112
+ # Eval files
113
+ $schema: agentv-eval-v2
114
+ evalcases:
115
+ - id: test-1
116
+ # ...
117
+
118
+ # Targets files
119
+ $schema: agentv-targets-v2
120
+ targets:
121
+ - name: default
122
+ # ...
123
+ ```
124
+
125
+ Files without a `$schema` field will be rejected with a clear error message.
126
+
127
+ ### Running Evals
128
+
129
+ Run eval (target auto-selected from test file or CLI override):
130
+
131
+ ```bash
132
+ # If your test.yaml contains "target: azure_base", it will be used automatically
133
+ agentv eval "path/to/test.yaml"
134
+
135
+ # Override the test file's target with CLI flag
136
+ agentv eval --target vscode_projectx "path/to/test.yaml"
137
+ ```
138
+
139
+ Run a specific test case with custom targets path:
140
+
141
+ ```bash
142
+ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --test-id "my-test-case" "path/to/test.yaml"
143
+ ```
144
+
145
+ ### Command Line Options
146
+
147
+ - `test_file`: Path to test YAML file (required, positional argument)
148
+ - `--target TARGET`: Execution target name from targets.yaml (overrides target specified in test file)
149
+ - `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
150
+ - `--test-id TEST_ID`: Run only the test case with this specific ID
151
+ - `--out OUTPUT_FILE`: Output file path (default: results/{testname}_{timestamp}.jsonl)
152
+ - `--format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
153
+ - `--dry-run`: Run with mock model for testing
154
+ - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
155
+ - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
156
+ - `--cache`: Enable caching of LLM responses (default: disabled)
157
+ - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
158
+ - `--verbose`: Verbose output
159
+
160
+ ### Target Selection Priority
161
+
162
+ The CLI determines which execution target to use with the following precedence:
163
+
164
+ 1. CLI flag override: `--target my_target` (when provided and not 'default')
165
+ 2. Test file specification: `target: my_target` key in the .test.yaml file
166
+ 3. Default fallback: Uses the 'default' target (original behavior)
167
+
168
+ This allows test files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
169
+
170
+ Output goes to `.agentv/results/{testname}_{timestamp}.jsonl` (or `.yaml`) unless `--out` is provided.
171
+
172
+ ### Tips for VS Code Copilot Evals
173
+
174
+ **Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
175
+
176
+ **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
177
+
178
+ ## Requirements
179
+
180
+ - Node.js 20.0.0 or higher
181
+ - Environment variables for your chosen providers (configured via targets.yaml)
182
+
183
+ Environment keys (configured via targets.yaml):
184
+
185
+ - **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
186
+ - **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
187
+ - **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
188
+ - **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
189
+
190
+ ## Targets and Environment Variables
191
+
192
+ Execution targets in `.agentv/targets.yaml` decouple tests from providers/settings and provide flexible environment variable mapping.
193
+
194
+ ### Target Configuration Structure
195
+
196
+ Each target specifies:
197
+
198
+ - `name`: Unique identifier for the target
199
+ - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
200
+ - `settings`: Environment variable names to use for this target
201
+
202
+ ### Examples
203
+
204
+ **Azure OpenAI targets:**
205
+
206
+ ```yaml
207
+ - name: azure_base
208
+ provider: azure
209
+ settings:
210
+ endpoint: "AZURE_OPENAI_ENDPOINT"
211
+ api_key: "AZURE_OPENAI_API_KEY"
212
+ model: "AZURE_DEPLOYMENT_NAME"
213
+ ```
214
+
215
+ **Anthropic targets:**
216
+
217
+ ```yaml
218
+ - name: anthropic_base
219
+ provider: anthropic
220
+ settings:
221
+ api_key: "ANTHROPIC_API_KEY"
222
+ model: "ANTHROPIC_MODEL"
223
+ ```
224
+
225
+ **Google Gemini targets:**
226
+
227
+ ```yaml
228
+ - name: gemini_base
229
+ provider: gemini
230
+ settings:
231
+ api_key: "GOOGLE_API_KEY"
232
+ model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-exp
233
+ ```
234
+
235
+ **VS Code targets:**
236
+
237
+ ```yaml
238
+ - name: vscode_projectx
239
+ provider: vscode
240
+ settings:
241
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
242
+
243
+ - name: vscode_insiders_projectx
244
+ provider: vscode-insiders
245
+ settings:
246
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
247
+ ```
248
+
249
+ ## Timeout Handling and Retries
250
+
251
+ When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
252
+
253
+ - **Timeout detection:** Automatically detects when agents timeout
254
+ - **Automatic retries:** When a timeout occurs, the same test case is retried up to `--max-retries` times (default: 2)
255
+ - **Retry behavior:** Only timeouts trigger retries; other errors proceed to the next test case
256
+ - **Timeout configuration:** Use `--agent-timeout` to adjust how long to wait for agent responses
257
+
258
+ Example with custom timeout settings:
259
+
260
+ ```bash
261
+ agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
262
+ ```
263
+
264
+ ## How the Evals Work
265
+
266
+ For each test case in a `.yaml` file:
267
+
268
+ 1. Parse YAML and collect user messages (inline text and referenced files)
269
+ 2. Extract code blocks from text for structured prompting
270
+ 3. Generate a candidate answer via the configured provider/model
271
+ 4. Score against the expected answer using AI-powered quality grading
272
+ 5. Output results in JSONL or YAML format with detailed metrics
273
+
274
+ ### VS Code Copilot Target
275
+
276
+ - Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
277
+ - The prompt is built from the `.yaml` user content (task, files, code blocks)
278
+ - Copilot is instructed to complete the task within the workspace context
279
+ - Results are captured and scored automatically
280
+
281
+ ## Scoring and Outputs
282
+
283
+ Run with `--verbose` to print detailed information and stack traces on errors.
284
+
285
+ ### Scoring Methodology
286
+
287
+ AgentV uses an AI-powered quality grader that:
288
+
289
+ - Extracts key aspects from the expected answer
290
+ - Compares model output against those aspects
291
+ - Provides detailed hit/miss analysis with reasoning
292
+ - Returns a normalized score (0.0 to 1.0)
293
+
294
+ ### Output Formats
295
+
296
+ **JSONL format (default):**
297
+
298
+ - One JSON object per line (newline-delimited)
299
+ - Fields: `test_id`, `score`, `hits`, `misses`, `model_answer`, `expected_aspect_count`, `target`, `timestamp`, `reasoning`, `raw_request`, `grader_raw_request`
300
+
301
+ **YAML format (with `--format yaml`):**
302
+
303
+ - Human-readable YAML documents
304
+ - Same fields as JSONL, properly formatted for readability
305
+ - Multi-line strings use literal block style
306
+
307
+ ### Summary Statistics
308
+
309
+ After running all test cases, AgentV displays:
310
+
311
+ - Mean, median, min, max scores
312
+ - Standard deviation
313
+ - Distribution histogram
314
+ - Total test count and execution time
315
+
316
+ ## Architecture
317
+
318
+ AgentV is built as a TypeScript monorepo using:
319
+
320
+ - **pnpm workspaces:** Efficient dependency management
321
+ - **Turbo:** Build system and task orchestration
322
+ - **@ax-llm/ax:** Unified LLM provider abstraction
323
+ - **Vercel AI SDK:** Streaming and tool use capabilities
324
+ - **Zod:** Runtime type validation
325
+ - **Commander.js:** CLI argument parsing
326
+ - **Vitest:** Testing framework
327
+
328
+ ### Package Structure
329
+
330
+ - `@agentv/core` - Core evaluation engine, providers, grading logic
331
+ - `agentv` - Main package that bundles CLI functionality
332
+
333
+ ## Troubleshooting
334
+
335
+ ### Installation Issues
336
+
337
+ **Problem:** Package installation fails or command not found.
338
+
339
+ **Solution:**
340
+
341
+ ```bash
342
+ # Clear npm cache and reinstall
343
+ npm cache clean --force
344
+ npm uninstall -g agentv
345
+ npm install -g agentv
346
+
347
+ # Or use npx without installing
348
+ npx agentv@latest --help
349
+ ```
350
+
351
+ ### VS Code Integration Issues
352
+
353
+ **Problem:** VS Code workspace doesn't open or prompts aren't injected.
354
+
355
+ **Solution:**
356
+
357
+ - Ensure the `subagent` package is installed (should be automatic)
358
+ - Verify your workspace path in `.env` is correct and points to a `.code-workspace` file
359
+ - Close any other VS Code instances before running evals
360
+ - Use `--verbose` flag to see detailed workspace switching logs
361
+
362
+ ### Provider Configuration Issues
363
+
364
+ **Problem:** API authentication errors or missing credentials.
365
+
366
+ **Solution:**
367
+
368
+ - Double-check environment variables in your `.env` file
369
+ - Verify the variable names in `targets.yaml` match your `.env` file
370
+ - Use `--dry-run` first to test without making API calls
371
+ - Check provider-specific documentation for required environment variables
372
+
373
+ ## License
374
+
375
+ MIT License - see [LICENSE](LICENSE) for details.
376
+
377
+ ## Related Projects
378
+
379
+ - [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
380
+ - [Ax](https://github.com/axflow/axflow) - TypeScript LLM framework