agentv 0.2.8 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # AgentV
2
2
 
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
3
+ A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
4
4
 
5
5
  ## Installation and Setup
6
6
 
@@ -76,7 +76,7 @@ You are now ready to start development. The monorepo contains:
76
76
 
77
77
  ### Configuring Guideline Patterns
78
78
 
79
- AgentV automatically detects guideline files (instructions, prompts) and treats them differently from regular file content. You can customize which files are considered guidelines using an optional `.agentv/config.yaml` configuration file.
79
+ AgentV automatically detects guideline files and treats them differently from regular file content. You can customize which files are considered guidelines using an optional `.agentv/config.yaml` configuration file.
80
80
 
81
81
  **Config file discovery:**
82
82
  - AgentV searches for `.agentv/config.yaml` starting from the eval file's directory
@@ -84,16 +84,6 @@ AgentV automatically detects guideline files (instructions, prompts) and treats
84
84
  - Uses the first config file found (similar to how `targets.yaml` is discovered)
85
85
  - This allows you to place one config file at the project root for all evals
86
86
 
87
- **Default patterns** (used when `.agentv/config.yaml` is absent):
88
-
89
- ```yaml
90
- guideline_patterns:
91
- - "**/*.instructions.md"
92
- - "**/instructions/**"
93
- - "**/*.prompt.md"
94
- - "**/prompts/**"
95
- ```
96
-
97
87
  **Custom patterns** (create `.agentv/config.yaml` in same directory as your eval file):
98
88
 
99
89
  ```yaml
@@ -105,13 +95,6 @@ guideline_patterns:
105
95
  - "**/*.rules.md" # Match by naming convention
106
96
  ```
107
97
 
108
- **How it works:**
109
-
110
- - Files matching guideline patterns are loaded as separate guideline context
111
- - Files NOT matching are treated as regular file content in user messages
112
- - Patterns use standard glob syntax (via [micromatch](https://github.com/micromatch/micromatch))
113
- - Paths are normalized to forward slashes for cross-platform compatibility
114
-
115
98
  See [config.yaml example](docs/examples/simple/.agentv/config.yaml) for more pattern examples.
116
99
 
117
100
  ### Validating Eval Files
@@ -200,18 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
200
183
 
201
184
  **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
202
185
 
203
- ## Requirements
204
-
205
- - Node.js 20.0.0 or higher
206
- - Environment variables for your chosen providers (configured via targets.yaml)
207
-
208
- Environment keys (configured via targets.yaml):
209
-
210
- - **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
211
- - **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
212
- - **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
213
- - **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
214
-
215
186
  ## Targets and Environment Variables
216
187
 
217
188
  Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -221,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
221
192
  Each target specifies:
222
193
 
223
194
  - `name`: Unique identifier for the target
224
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
195
+ - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
225
196
  - `settings`: Environment variable names to use for this target
226
197
 
227
198
  ### Examples
@@ -237,40 +208,54 @@ Each target specifies:
237
208
  model: "AZURE_DEPLOYMENT_NAME"
238
209
  ```
239
210
 
240
- **Anthropic targets:**
211
+ **VS Code targets:**
241
212
 
242
213
  ```yaml
243
- - name: anthropic_base
244
- provider: anthropic
214
+ - name: vscode_projectx
215
+ provider: vscode
245
216
  settings:
246
- api_key: "ANTHROPIC_API_KEY"
247
- model: "ANTHROPIC_MODEL"
217
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
218
+
219
+ - name: vscode_insiders_projectx
220
+ provider: vscode-insiders
221
+ settings:
222
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
248
223
  ```
249
224
 
250
- **Google Gemini targets:**
225
+ **CLI targets (template-based):**
251
226
 
252
227
  ```yaml
253
- - name: gemini_base
254
- provider: gemini
228
+ - name: local_cli
229
+ provider: cli
255
230
  settings:
256
- api_key: "GOOGLE_API_KEY"
257
- model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-exp
231
+ command_template: 'somecommand {PROMPT} {FILES}'
232
+ files_format: '--file {path}'
233
+ cwd: PROJECT_ROOT # optional working directory
234
+ env: # merged into process.env
235
+ API_TOKEN: LOCAL_AGENT_TOKEN
236
+ timeout_seconds: 30 # optional per-command timeout
237
+ healthcheck:
238
+ type: command # or http
239
+ command_template: code --version
258
240
  ```
259
241
 
260
- **VS Code targets:**
242
+ **Codex CLI targets:**
261
243
 
262
244
  ```yaml
263
- - name: vscode_projectx
264
- provider: vscode
265
- settings:
266
- workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
267
-
268
- - name: vscode_insiders_projectx
269
- provider: vscode-insiders
245
+ - name: codex_cli
246
+ provider: codex
270
247
  settings:
271
- workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
248
+ executable: "CODEX_CLI_PATH" # defaults to `codex` if omitted
249
+ profile: "CODEX_PROFILE" # matches the profile in ~/.codex/config
250
+ model: "CODEX_MODEL" # optional, falls back to profile default
251
+ approval_preset: "CODEX_APPROVAL_PRESET"
252
+ timeout_seconds: 180
253
+ cwd: CODEX_WORKSPACE_DIR
272
254
  ```
273
255
 
256
+ Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
257
+ Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
258
+
274
259
  ## Timeout Handling and Retries
275
260
 
276
261
  When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
@@ -286,22 +271,116 @@ Example with custom timeout settings:
286
271
  agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
287
272
  ```
288
273
 
289
- ## How the Evals Work
274
+ ## Writing Custom Evaluators
275
+
276
+ ### Code Evaluator I/O Contract
277
+
278
+ Code evaluators receive input via stdin and write output to stdout as JSON.
279
+
280
+ **Input Format (via stdin):**
281
+ ```json
282
+ {
283
+ "task": "string describing the task",
284
+ "outcome": "expected outcome description",
285
+ "expected": "expected output string",
286
+ "output": "generated code/text from the agent",
287
+ "system_message": "system message if any",
288
+ "guideline_paths": ["path1", "path2"],
289
+ "attachments": ["file1", "file2"],
290
+ "user_segments": [{"type": "text", "value": "..."}]
291
+ }
292
+ ```
293
+
294
+ **Output Format (to stdout):**
295
+ ```json
296
+ {
297
+ "score": 0.85,
298
+ "hits": ["list of successful checks"],
299
+ "misses": ["list of failed checks"],
300
+ "reasoning": "explanation of the score"
301
+ }
302
+ ```
303
+
304
+ **Key Points:**
305
+ - Evaluators receive **full context** but should select only relevant fields
306
+ - Most evaluators only need `output` field - ignore the rest to avoid false positives
307
+ - Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
308
+ - Score range: `0.0` to `1.0` (float)
309
+ - `hits` and `misses` are optional but recommended for debugging
310
+
311
+ ### Code Evaluator Script Template
312
+
313
+ ```python
314
+ #!/usr/bin/env python3
315
+ import json
316
+ import sys
317
+
318
+ def evaluate(input_data):
319
+ # Extract only the fields you need
320
+ output = input_data.get("output", "")
321
+
322
+ # Your validation logic here
323
+ score = 0.0 # to 1.0
324
+ hits = ["successful check 1", "successful check 2"]
325
+ misses = ["failed check 1"]
326
+ reasoning = "Explanation of score"
327
+
328
+ return {
329
+ "score": score,
330
+ "hits": hits,
331
+ "misses": misses,
332
+ "reasoning": reasoning
333
+ }
334
+
335
+ if __name__ == "__main__":
336
+ try:
337
+ input_data = json.loads(sys.stdin.read())
338
+ result = evaluate(input_data)
339
+ print(json.dumps(result, indent=2))
340
+ except Exception as e:
341
+ error_result = {
342
+ "score": 0.0,
343
+ "hits": [],
344
+ "misses": [f"Evaluator error: {str(e)}"],
345
+ "reasoning": f"Evaluator error: {str(e)}"
346
+ }
347
+ print(json.dumps(error_result, indent=2))
348
+ sys.exit(1)
349
+ ```
350
+
351
+ ### LLM Judge Template Structure
352
+
353
+ ```markdown
354
+ # Judge Name
355
+
356
+ Evaluation criteria and guidelines...
357
+
358
+ ## Scoring Guidelines
359
+ 0.9-1.0: Excellent
360
+ 0.7-0.8: Good
361
+ ...
362
+
363
+ ## Output Format
364
+ {
365
+ "score": 0.85,
366
+ "passed": true,
367
+ "reasoning": "..."
368
+ }
369
+ ```
290
370
 
291
- For each eval case in a `.yaml` file:
371
+ ## Next Steps
292
372
 
293
- 1. Parse YAML and collect user messages (inline text and referenced files)
294
- 2. Extract code blocks from text for structured prompting
295
- 3. Generate a candidate answer via the configured provider/model
296
- 4. Score against the expected answer using AI-powered quality grading
297
- 5. Output results in JSONL or YAML format with detailed metrics
373
+ - Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
374
+ - Create your own eval cases following the schema
375
+ - Write custom evaluator scripts for domain-specific validation
376
+ - Create LLM judge templates for semantic evaluation
377
+ - Set up optimizer configs when ready to improve prompts
298
378
 
299
- ### VS Code Copilot Target
379
+ ## Resources
300
380
 
301
- - Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
302
- - The prompt is built from the `.yaml` user content (task, files, code blocks)
303
- - Copilot is instructed to complete the task within the workspace context
304
- - Results are captured and scored automatically
381
+ - [Simple Example README](docs/examples/simple/README.md)
382
+ - [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
383
+ - [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
305
384
 
306
385
  ## Scoring and Outputs
307
386