agentv 1.2.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,441 +1,439 @@
1
- # AgentV
2
-
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
4
-
5
- ## Installation and Setup
6
-
7
- ### Installation for End Users
8
-
9
- This is the recommended method for users who want to use `agentv` as a command-line tool.
10
-
11
- 1. Install via npm:
12
-
13
- ```bash
14
- # Install globally
15
- npm install -g agentv
16
-
17
- # Or use npx to run without installing
18
- npx agentv --help
19
- ```
20
-
21
- 2. Verify the installation:
22
-
23
- ```bash
24
- agentv --help
25
- ```
26
-
27
- ### Local Development Setup
28
-
29
- Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
30
-
31
- 1. Clone the repository and navigate into it:
32
-
33
- ```bash
34
- git clone https://github.com/EntityProcess/agentv.git
35
- cd agentv
36
- ```
37
-
38
- 2. Install dependencies:
39
-
40
- ```bash
41
- # Install Bun if you don't have it
42
- curl -fsSL https://bun.sh/install | bash # macOS/Linux
43
- # or
44
- powershell -c "irm bun.sh/install.ps1 | iex" # Windows
45
-
46
- # Install all workspace dependencies
47
- bun install
48
- ```
49
-
50
- 3. Build the project:
51
-
52
- ```bash
53
- bun run build
54
- ```
55
-
56
- 4. Run tests:
57
-
58
- ```bash
59
- bun test
60
- ```
61
-
62
- You are now ready to start development. The monorepo contains:
63
-
64
- - `packages/core/` - Core evaluation engine
65
- - `apps/cli/` - Command-line interface
66
-
67
- ### Environment Setup
68
-
69
- 1. Initialize your workspace:
70
- - Run `agentv init` at the root of your repository
71
- - This command automatically sets up the `.agentv/` directory structure and configuration files
72
-
73
- 2. Configure environment variables:
74
- - The init command creates a `.env.template` file in your project root
75
- - Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
76
- - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
77
-
78
- ## Quick Start
79
-
80
- You can use the following examples as a starting point.
81
- - [Simple Example](docs/examples/simple/README.md): A minimal working example to help you get started fast.
82
- - [Showcase](docs/examples/showcase/README.md): A collection of advanced use cases and real-world agent evaluation scenarios.
83
-
84
- ### Validating Eval Files
85
-
86
- Validate your eval and targets files before running them:
87
-
88
- ```bash
89
- # Validate a single file
90
- agentv validate evals/my-eval.yaml
91
-
92
- # Validate multiple files
93
- agentv validate evals/eval1.yaml evals/eval2.yaml
94
-
95
- # Validate entire directory (recursively finds all YAML files)
96
- agentv validate evals/
97
- ```
98
-
99
- ### Running Evals
100
-
101
- Run eval (target auto-selected from eval file or CLI override):
102
-
103
- ```bash
104
- # If your eval.yaml contains "target: azure_base", it will be used automatically
105
- agentv eval "path/to/eval.yaml"
106
-
107
- # Override the eval file's target with CLI flag
108
- agentv eval --target vscode_projectx "path/to/eval.yaml"
109
-
110
- # Run multiple evals via glob
111
- agentv eval "path/to/evals/**/*.yaml"
112
- ```
113
-
114
- Run a specific eval case with custom targets path:
115
-
116
- ```bash
117
- agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
118
- ```
119
-
120
- ### Command Line Options
121
-
122
- - `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
123
- - `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
124
- - `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
125
- - `--eval-id EVAL_ID`: Run only the eval case with this specific ID
126
- - `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
127
- - `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
128
- - `--dry-run`: Run with mock model for testing
129
- - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
130
- - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
131
- - `--cache`: Enable caching of LLM responses (default: disabled)
132
- - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
133
- - `--dump-traces`: Write trace files to `.agentv/traces/` directory
134
- - `--include-trace`: Include full trace in result output (verbose)
135
- - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
136
- - `--verbose`: Verbose output
137
-
138
- ### Target Selection Priority
139
-
140
- The CLI determines which execution target to use with the following precedence:
141
-
142
- 1. CLI flag override: `--target my_target` (when provided and not 'default')
143
- 2. Eval file specification: `target: my_target` key in the .eval.yaml file
144
- 3. Default fallback: Uses the 'default' target (original behavior)
145
-
146
- This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
147
-
148
- Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
149
-
150
- ### Tips for VS Code Copilot Evals
151
-
152
- **Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
153
-
154
- **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
155
-
156
- ## Targets and Environment Variables
157
-
158
- Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
159
-
160
- ### Target Configuration Structure
161
-
162
- Each target specifies:
163
-
164
- - `name`: Unique identifier for the target
165
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
166
- - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
167
- - Optional fields: `judge_target`, `workers`, `provider_batching`
168
-
169
- ### Examples
170
-
171
- **Azure OpenAI targets:**
172
-
173
- ```yaml
174
- - name: azure_base
175
- provider: azure
176
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
177
- api_key: ${{ AZURE_OPENAI_API_KEY }}
178
- model: ${{ AZURE_DEPLOYMENT_NAME }}
179
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
180
- ```
181
-
182
- Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
183
-
184
- **VS Code targets:**
185
-
186
- ```yaml
187
- - name: vscode_projectx
188
- provider: vscode
189
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
190
- provider_batching: false
191
- judge_target: azure_base
192
-
193
- - name: vscode_insiders_projectx
194
- provider: vscode-insiders
195
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
196
- provider_batching: false
197
- judge_target: azure_base
198
- ```
199
-
200
- **CLI targets (template-based):**
201
-
202
- ```yaml
203
- - name: local_cli
204
- provider: cli
205
- judge_target: azure_base
206
- command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
207
- files_format: '--file {path}'
208
- cwd: ${{ CLI_EVALS_DIR }} # optional working directory
209
- timeout_seconds: 30 # optional per-command timeout
210
- healthcheck:
211
- type: command # or http
212
- command_template: uv run ./my_agent.py --healthcheck
213
- ```
214
-
215
- **Supported placeholders in CLI commands:**
216
- - `{PROMPT}` - The rendered prompt text (shell-escaped)
217
- - `{FILES}` - Expands to multiple file arguments using `files_format` template
218
- - `{GUIDELINES}` - Guidelines content
219
- - `{EVAL_ID}` - Current eval case ID
220
- - `{ATTEMPT}` - Retry attempt number
221
- - `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
222
-
223
- **Codex CLI targets:**
224
-
225
- ```yaml
226
- - name: codex_cli
227
- provider: codex
228
- judge_target: azure_base
229
- executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
230
- args: # optional CLI arguments
231
- - --profile
232
- - ${{ CODEX_PROFILE }}
233
- - --model
234
- - ${{ CODEX_MODEL }}
235
- timeout_seconds: 180
236
- cwd: ${{ CODEX_WORKSPACE_DIR }}
237
- log_format: json # 'summary' or 'json'
238
- ```
239
-
240
- Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
241
- Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
242
-
243
- ## Writing Custom Evaluators
244
-
245
- ### Code Evaluator I/O Contract
246
-
247
- Code evaluators receive input via stdin and write output to stdout as JSON.
248
-
249
- **Input Format (via stdin):**
250
- ```json
251
- {
252
- "question": "string describing the task/question",
253
- "expected_outcome": "expected outcome description",
254
- "reference_answer": "gold standard answer (optional)",
255
- "candidate_answer": "generated code/text from the agent",
256
- "guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
257
- "input_files": ["path/to/data.json", "path/to/config.yaml"],
258
- "input_messages": [{"role": "user", "content": "..."}]
259
- }
260
- ```
261
-
262
- **Output Format (to stdout):**
263
- ```json
264
- {
265
- "score": 0.85,
266
- "hits": ["list of successful checks"],
267
- "misses": ["list of failed checks"],
268
- "reasoning": "explanation of the score"
269
- }
270
- ```
271
-
272
- **Key Points:**
273
- - Evaluators receive **full context** but should select only relevant fields
274
- - Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
275
- - Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
276
- - Score range: `0.0` to `1.0` (float)
277
- - `hits` and `misses` are optional but recommended for debugging
278
-
279
- ### Code Evaluator Script Template
280
-
281
- ```python
282
- #!/usr/bin/env python3
283
- import json
284
- import sys
285
-
286
- def evaluate(input_data):
287
- # Extract only the fields you need
288
- candidate_answer = input_data.get("candidate_answer", "")
289
-
290
- # Your validation logic here
291
- score = 0.0 # to 1.0
292
- hits = ["successful check 1", "successful check 2"]
293
- misses = ["failed check 1"]
294
- reasoning = "Explanation of score"
295
-
296
- return {
297
- "score": score,
298
- "hits": hits,
299
- "misses": misses,
300
- "reasoning": reasoning
301
- }
302
-
303
- if __name__ == "__main__":
304
- try:
305
- input_data = json.loads(sys.stdin.read())
306
- result = evaluate(input_data)
307
- print(json.dumps(result, indent=2))
308
- except Exception as e:
309
- error_result = {
310
- "score": 0.0,
311
- "hits": [],
312
- "misses": [f"Evaluator error: {str(e)}"],
313
- "reasoning": f"Evaluator error: {str(e)}"
314
- }
315
- print(json.dumps(error_result, indent=2))
316
- sys.exit(1)
317
- ```
318
-
319
- ### LLM Judge Template Structure
320
-
321
- ```markdown
322
- # Judge Name
323
-
324
- Evaluation criteria and guidelines...
325
-
326
- ## Scoring Guidelines
327
- 0.9-1.0: Excellent
328
- 0.7-0.8: Good
329
- ...
330
-
331
- ## Output Format
332
- {
333
- "score": 0.85,
334
- "passed": true,
335
- "reasoning": "..."
336
- }
337
- ```
338
-
339
- ## Rubric-Based Evaluation
340
-
341
- AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
342
-
343
- ### Basic Usage
344
-
345
- Define rubrics inline using simple strings:
346
-
347
- ```yaml
348
- - id: example-1
349
- expected_outcome: Explain quicksort algorithm
350
- rubrics:
351
- - Mentions divide-and-conquer approach
352
- - Explains the partition step
353
- - States time complexity correctly
354
- ```
355
-
356
- Or use detailed objects for fine-grained control:
357
-
358
- ```yaml
359
- rubrics:
360
- - id: structure
361
- description: Has clear headings and organization
362
- weight: 1.0
363
- required: true
364
- - id: examples
365
- description: Includes practical examples
366
- weight: 0.5
367
- required: false
368
- ```
369
-
370
- ### Generate Rubrics
371
-
372
- Automatically generate rubrics from `expected_outcome` fields:
373
-
374
- ```bash
375
- # Generate rubrics for all eval cases without rubrics
376
- agentv generate rubrics evals/my-eval.yaml
377
-
378
- # Use a specific LLM target for generation
379
- agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
380
- ```
381
-
382
- ### Scoring and Verdicts
383
-
384
- - **Score**: (sum of satisfied weights) / (total weights)
385
- - **Verdicts**:
386
- - `pass`: Score ≥ 0.8 and all required rubrics met
387
- - `borderline`: Score ≥ 0.6 and all required rubrics met
388
- - `fail`: Score < 0.6 or any required rubric failed
389
-
390
- For complete examples and detailed patterns, see [examples/features/evals/rubric/](examples/features/evals/rubric/).
391
-
392
- ## Advanced Configuration
393
-
394
- ### Retry Configuration
395
-
396
- AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
397
-
398
- **Available retry fields:**
399
-
400
- | Field | Type | Default | Description |
401
- |-------|------|---------|-------------|
402
- | `max_retries` | number | 3 | Maximum number of retry attempts |
403
- | `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
404
- | `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
405
- | `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
406
- | `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
407
-
408
- **Example configuration:**
409
-
410
- ```yaml
411
- $schema: agentv-targets-v2.2
412
-
413
- targets:
414
- - name: azure_base
415
- provider: azure
416
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
417
- api_key: ${{ AZURE_OPENAI_API_KEY }}
418
- model: gpt-4
419
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: API version (defaults to 2024-12-01-preview)
420
- max_retries: 5 # Maximum retry attempts
421
- retry_initial_delay_ms: 2000 # Initial delay before first retry
422
- retry_max_delay_ms: 120000 # Maximum delay cap
423
- retry_backoff_factor: 2 # Exponential backoff multiplier
424
- retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
425
- ```
426
-
427
- **Retry behavior:**
428
- - Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
429
- - Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
430
- - Respects abort signals for cancellation
431
- - If no retry config is specified, uses sensible defaults
432
-
433
- ## Related Projects
434
-
435
- - [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
436
- - [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
437
- - [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
438
-
439
- ## License
440
-
441
- MIT License - see [LICENSE](LICENSE) for details.
1
+ # AgentV
2
+
3
+ A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
4
+
5
+ ## Installation and Setup
6
+
7
+ ### Installation for End Users
8
+
9
+ This is the recommended method for users who want to use `agentv` as a command-line tool.
10
+
11
+ 1. Install via npm:
12
+
13
+ ```bash
14
+ # Install globally
15
+ npm install -g agentv
16
+
17
+ # Or use npx to run without installing
18
+ npx agentv --help
19
+ ```
20
+
21
+ 2. Verify the installation:
22
+
23
+ ```bash
24
+ agentv --help
25
+ ```
26
+
27
+ ### Local Development Setup
28
+
29
+ Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
30
+
31
+ 1. Clone the repository and navigate into it:
32
+
33
+ ```bash
34
+ git clone https://github.com/EntityProcess/agentv.git
35
+ cd agentv
36
+ ```
37
+
38
+ 2. Install dependencies:
39
+
40
+ ```bash
41
+ # Install Bun if you don't have it
42
+ curl -fsSL https://bun.sh/install | bash # macOS/Linux
43
+ # or
44
+ powershell -c "irm bun.sh/install.ps1 | iex" # Windows
45
+
46
+ # Install all workspace dependencies
47
+ bun install
48
+ ```
49
+
50
+ 3. Build the project:
51
+
52
+ ```bash
53
+ bun run build
54
+ ```
55
+
56
+ 4. Run tests:
57
+
58
+ ```bash
59
+ bun test
60
+ ```
61
+
62
+ You are now ready to start development. The monorepo contains:
63
+
64
+ - `packages/core/` - Core evaluation engine
65
+ - `apps/cli/` - Command-line interface
66
+
67
+ ### Environment Setup
68
+
69
+ 1. Initialize your workspace:
70
+ - Run `agentv init` at the root of your repository
71
+ - This command automatically sets up the `.agentv/` directory structure and configuration files
72
+
73
+ 2. Configure environment variables:
74
+ - The init command creates a `.env.template` file in your project root
75
+ - Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
76
+ - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
77
+
78
+ ## Quick Start
79
+
80
+ You can use the following examples as a starting point.
81
+ - [Simple Example](docs/examples/simple/README.md): A minimal working example to help you get started fast.
82
+ - [Showcase](docs/examples/showcase/README.md): A collection of advanced use cases and real-world agent evaluation scenarios.
83
+
84
+ ### Validating Eval Files
85
+
86
+ Validate your eval and targets files before running them:
87
+
88
+ ```bash
89
+ # Validate a single file
90
+ agentv validate evals/my-eval.yaml
91
+
92
+ # Validate multiple files
93
+ agentv validate evals/eval1.yaml evals/eval2.yaml
94
+
95
+ # Validate entire directory (recursively finds all YAML files)
96
+ agentv validate evals/
97
+ ```
98
+
99
+ ### Running Evals
100
+
101
+ Run eval (target auto-selected from eval file or CLI override):
102
+
103
+ ```bash
104
+ # If your eval.yaml contains "target: azure_base", it will be used automatically
105
+ agentv eval "path/to/eval.yaml"
106
+
107
+ # Override the eval file's target with CLI flag
108
+ agentv eval --target vscode_projectx "path/to/eval.yaml"
109
+
110
+ # Run multiple evals via glob
111
+ agentv eval "path/to/evals/**/*.yaml"
112
+ ```
113
+
114
+ Run a specific eval case with custom targets path:
115
+
116
+ ```bash
117
+ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
118
+ ```
119
+
120
+ ### Command Line Options
121
+
122
+ - `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
123
+ - `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
124
+ - `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
125
+ - `--eval-id EVAL_ID`: Run only the eval case with this specific ID
126
+ - `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
127
+ - `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
128
+ - `--dry-run`: Run with mock model for testing
129
+ - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
130
+ - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
131
+ - `--cache`: Enable caching of LLM responses (default: disabled)
132
+ - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
133
+ - `--dump-traces`: Write trace files to `.agentv/traces/` directory
134
+ - `--include-trace`: Include full trace in result output (verbose)
135
+ - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
136
+ - `--verbose`: Verbose output
137
+
138
+ ### Target Selection Priority
139
+
140
+ The CLI determines which execution target to use with the following precedence:
141
+
142
+ 1. CLI flag override: `--target my_target` (when provided and not 'default')
143
+ 2. Eval file specification: `target: my_target` key in the .eval.yaml file
144
+ 3. Default fallback: Uses the 'default' target (original behavior)
145
+
146
+ This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
147
+
148
+ Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
149
+
150
+ ### Tips for VS Code Copilot Evals
151
+
152
+ **Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
153
+
154
+ **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
155
+
156
+ ## Targets and Environment Variables
157
+
158
+ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
159
+
160
+ ### Target Configuration Structure
161
+
162
+ Each target specifies:
163
+
164
+ - `name`: Unique identifier for the target
165
+ - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
166
+ - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
167
+ - Optional fields: `judge_target`, `workers`, `provider_batching`
168
+
169
+ ### Examples
170
+
171
+ **Azure OpenAI targets:**
172
+
173
+ ```yaml
174
+ - name: azure_base
175
+ provider: azure
176
+ endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
177
+ api_key: ${{ AZURE_OPENAI_API_KEY }}
178
+ model: ${{ AZURE_DEPLOYMENT_NAME }}
179
+ version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
180
+ ```
181
+
182
+ Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
183
+
184
+ **VS Code targets:**
185
+
186
+ ```yaml
187
+ - name: vscode_projectx
188
+ provider: vscode
189
+ workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
190
+ provider_batching: false
191
+ judge_target: azure_base
192
+
193
+ - name: vscode_insiders_projectx
194
+ provider: vscode-insiders
195
+ workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
196
+ provider_batching: false
197
+ judge_target: azure_base
198
+ ```
199
+
200
+ **CLI targets (template-based):**
201
+
202
+ ```yaml
203
+ - name: local_cli
204
+ provider: cli
205
+ judge_target: azure_base
206
+ command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
207
+ files_format: '--file {path}'
208
+ cwd: ${{ CLI_EVALS_DIR }} # optional working directory
209
+ timeout_seconds: 30 # optional per-command timeout
210
+ healthcheck:
211
+ type: command # or http
212
+ command_template: uv run ./my_agent.py --healthcheck
213
+ ```
214
+
215
+ **Supported placeholders in CLI commands:**
216
+ - `{PROMPT}` - The rendered prompt text (shell-escaped)
217
+ - `{FILES}` - Expands to multiple file arguments using `files_format` template
218
+ - `{GUIDELINES}` - Guidelines content
219
+ - `{EVAL_ID}` - Current eval case ID
220
+ - `{ATTEMPT}` - Retry attempt number
221
+ - `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
222
+
223
+ **Codex CLI targets:**
224
+
225
+ ```yaml
226
+ - name: codex_cli
227
+ provider: codex
228
+ judge_target: azure_base
229
+ executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
230
+ args: # optional CLI arguments
231
+ - --profile
232
+ - ${{ CODEX_PROFILE }}
233
+ - --model
234
+ - ${{ CODEX_MODEL }}
235
+ timeout_seconds: 180
236
+ cwd: ${{ CODEX_WORKSPACE_DIR }}
237
+ log_format: json # 'summary' or 'json'
238
+ ```
239
+
240
+ Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
241
+ Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
242
+
243
+ ## Writing Custom Evaluators
244
+
245
+ ### Code Evaluator I/O Contract
246
+
247
+ Code evaluators receive input via stdin and write output to stdout as JSON.
248
+
249
+ **Input Format (via stdin):**
250
+ ```json
251
+ {
252
+ "question": "string describing the task/question",
253
+ "expected_outcome": "expected outcome description",
254
+ "reference_answer": "gold standard answer (optional)",
255
+ "candidate_answer": "generated code/text from the agent",
256
+ "guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
257
+ "input_files": ["path/to/data.json", "path/to/config.yaml"],
258
+ "input_messages": [{"role": "user", "content": "..."}]
259
+ }
260
+ ```
261
+
262
+ **Output Format (to stdout):**
263
+ ```json
264
+ {
265
+ "score": 0.85,
266
+ "hits": ["list of successful checks"],
267
+ "misses": ["list of failed checks"],
268
+ "reasoning": "explanation of the score"
269
+ }
270
+ ```
271
+
272
+ **Key Points:**
273
+ - Evaluators receive **full context** but should select only relevant fields
274
+ - Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
275
+ - Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
276
+ - Score range: `0.0` to `1.0` (float)
277
+ - `hits` and `misses` are optional but recommended for debugging
278
+
279
+ ### Code Evaluator Script Template
280
+
281
+ ```python
282
+ #!/usr/bin/env python3
283
+ import json
284
+ import sys
285
+
286
+ def evaluate(input_data):
287
+ # Extract only the fields you need
288
+ candidate_answer = input_data.get("candidate_answer", "")
289
+
290
+ # Your validation logic here
291
+ score = 0.0 # to 1.0
292
+ hits = ["successful check 1", "successful check 2"]
293
+ misses = ["failed check 1"]
294
+ reasoning = "Explanation of score"
295
+
296
+ return {
297
+ "score": score,
298
+ "hits": hits,
299
+ "misses": misses,
300
+ "reasoning": reasoning
301
+ }
302
+
303
+ if __name__ == "__main__":
304
+ try:
305
+ input_data = json.loads(sys.stdin.read())
306
+ result = evaluate(input_data)
307
+ print(json.dumps(result, indent=2))
308
+ except Exception as e:
309
+ error_result = {
310
+ "score": 0.0,
311
+ "hits": [],
312
+ "misses": [f"Evaluator error: {str(e)}"],
313
+ "reasoning": f"Evaluator error: {str(e)}"
314
+ }
315
+ print(json.dumps(error_result, indent=2))
316
+ sys.exit(1)
317
+ ```
318
+
319
+ ### LLM Judge Template Structure
320
+
321
+ ```markdown
322
+ # Judge Name
323
+
324
+ Evaluation criteria and guidelines...
325
+
326
+ ## Scoring Guidelines
327
+ 0.9-1.0: Excellent
328
+ 0.7-0.8: Good
329
+ ...
330
+
331
+ ## Output Format
332
+ {
333
+ "score": 0.85,
334
+ "passed": true,
335
+ "reasoning": "..."
336
+ }
337
+ ```
338
+
339
+ ## Rubric-Based Evaluation
340
+
341
+ AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
342
+
343
+ ### Basic Usage
344
+
345
+ Define rubrics inline using simple strings:
346
+
347
+ ```yaml
348
+ - id: example-1
349
+ expected_outcome: Explain quicksort algorithm
350
+ rubrics:
351
+ - Mentions divide-and-conquer approach
352
+ - Explains the partition step
353
+ - States time complexity correctly
354
+ ```
355
+
356
+ Or use detailed objects for fine-grained control:
357
+
358
+ ```yaml
359
+ rubrics:
360
+ - id: structure
361
+ description: Has clear headings and organization
362
+ weight: 1.0
363
+ required: true
364
+ - id: examples
365
+ description: Includes practical examples
366
+ weight: 0.5
367
+ required: false
368
+ ```
369
+
370
+ ### Generate Rubrics
371
+
372
+ Automatically generate rubrics from `expected_outcome` fields:
373
+
374
+ ```bash
375
+ # Generate rubrics for all eval cases without rubrics
376
+ agentv generate rubrics evals/my-eval.yaml
377
+
378
+ # Use a specific LLM target for generation
379
+ agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
380
+ ```
381
+
382
+ ### Scoring and Verdicts
383
+
384
+ - **Score**: (sum of satisfied weights) / (total weights)
385
+ - **Verdicts**:
386
+ - `pass`: Score ≥ 0.8 and all required rubrics met
387
+ - `borderline`: Score ≥ 0.6 and all required rubrics met
388
+ - `fail`: Score < 0.6 or any required rubric failed
389
+
390
+ For complete examples and detailed patterns, see [examples/features/evals/rubric/](examples/features/evals/rubric/).
391
+
392
+ ## Advanced Configuration
393
+
394
+ ### Retry Configuration
395
+
396
+ AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
397
+
398
+ **Available retry fields:**
399
+
400
+ | Field | Type | Default | Description |
401
+ |-------|------|---------|-------------|
402
+ | `max_retries` | number | 3 | Maximum number of retry attempts |
403
+ | `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
404
+ | `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
405
+ | `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
406
+ | `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
407
+
408
+ **Example configuration:**
409
+
410
+ ```yaml
411
+ targets:
412
+ - name: azure_base
413
+ provider: azure
414
+ endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
415
+ api_key: ${{ AZURE_OPENAI_API_KEY }}
416
+ model: gpt-4
417
+ version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: API version (defaults to 2024-12-01-preview)
418
+ max_retries: 5 # Maximum retry attempts
419
+ retry_initial_delay_ms: 2000 # Initial delay before first retry
420
+ retry_max_delay_ms: 120000 # Maximum delay cap
421
+ retry_backoff_factor: 2 # Exponential backoff multiplier
422
+ retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
423
+ ```
424
+
425
+ **Retry behavior:**
426
+ - Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
427
+ - Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
428
+ - Respects abort signals for cancellation
429
+ - If no retry config is specified, uses sensible defaults
430
+
431
+ ## Related Projects
432
+
433
+ - [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
434
+ - [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
435
+ - [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
436
+
437
+ ## License
438
+
439
+ MIT License - see [LICENSE](LICENSE) for details.