agentv 0.2.11 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # AgentV
2
2
 
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
3
+ A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
4
4
 
5
5
  ## Installation and Setup
6
6
 
@@ -183,18 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
183
183
 
184
184
  **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
185
185
 
186
- ## Requirements
187
-
188
- - Node.js 20.0.0 or higher
189
- - Environment variables for your chosen providers (configured via targets.yaml)
190
-
191
- Environment keys (configured via targets.yaml):
192
-
193
- - **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
194
- - **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
195
- - **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
196
- - **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
197
-
198
186
  ## Targets and Environment Variables
199
187
 
200
188
  Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -204,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
204
192
  Each target specifies:
205
193
 
206
194
  - `name`: Unique identifier for the target
207
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
195
+ - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
208
196
  - `settings`: Environment variable names to use for this target
209
197
 
210
198
  ### Examples
@@ -220,40 +208,54 @@ Each target specifies:
220
208
  model: "AZURE_DEPLOYMENT_NAME"
221
209
  ```
222
210
 
223
- **Anthropic targets:**
211
+ **VS Code targets:**
224
212
 
225
213
  ```yaml
226
- - name: anthropic_base
227
- provider: anthropic
214
+ - name: vscode_projectx
215
+ provider: vscode
228
216
  settings:
229
- api_key: "ANTHROPIC_API_KEY"
230
- model: "ANTHROPIC_MODEL"
217
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
218
+
219
+ - name: vscode_insiders_projectx
220
+ provider: vscode-insiders
221
+ settings:
222
+ workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
231
223
  ```
232
224
 
233
- **Google Gemini targets:**
225
+ **CLI targets (template-based):**
234
226
 
235
227
  ```yaml
236
- - name: gemini_base
237
- provider: gemini
228
+ - name: local_cli
229
+ provider: cli
238
230
  settings:
239
- api_key: "GOOGLE_API_KEY"
240
- model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-exp
231
+ command_template: 'somecommand {PROMPT} {FILES}'
232
+ files_format: '--file {path}'
233
+ cwd: PROJECT_ROOT # optional working directory
234
+ env: # merged into process.env
235
+ API_TOKEN: LOCAL_AGENT_TOKEN
236
+ timeout_seconds: 30 # optional per-command timeout
237
+ healthcheck:
238
+ type: command # or http
239
+ command_template: code --version
241
240
  ```
242
241
 
243
- **VS Code targets:**
242
+ **Codex CLI targets:**
244
243
 
245
244
  ```yaml
246
- - name: vscode_projectx
247
- provider: vscode
245
+ - name: codex_cli
246
+ provider: codex
248
247
  settings:
249
- workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
250
-
251
- - name: vscode_insiders_projectx
252
- provider: vscode-insiders
253
- settings:
254
- workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
248
+ executable: "CODEX_CLI_PATH" # defaults to `codex` if omitted
249
+ profile: "CODEX_PROFILE" # matches the profile in ~/.codex/config
250
+ model: "CODEX_MODEL" # optional, falls back to profile default
251
+ approval_preset: "CODEX_APPROVAL_PRESET"
252
+ timeout_seconds: 180
253
+ cwd: CODEX_WORKSPACE_DIR
255
254
  ```
256
255
 
256
+ Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
257
+ Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
258
+
257
259
  ## Timeout Handling and Retries
258
260
 
259
261
  When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
@@ -269,22 +271,116 @@ Example with custom timeout settings:
269
271
  agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
270
272
  ```
271
273
 
272
- ## How the Evals Work
274
+ ## Writing Custom Evaluators
275
+
276
+ ### Code Evaluator I/O Contract
277
+
278
+ Code evaluators receive input via stdin and write output to stdout as JSON.
279
+
280
+ **Input Format (via stdin):**
281
+ ```json
282
+ {
283
+ "task": "string describing the task",
284
+ "outcome": "expected outcome description",
285
+ "expected": "expected output string",
286
+ "output": "generated code/text from the agent",
287
+ "system_message": "system message if any",
288
+ "guideline_paths": ["path1", "path2"],
289
+ "attachments": ["file1", "file2"],
290
+ "user_segments": [{"type": "text", "value": "..."}]
291
+ }
292
+ ```
293
+
294
+ **Output Format (to stdout):**
295
+ ```json
296
+ {
297
+ "score": 0.85,
298
+ "hits": ["list of successful checks"],
299
+ "misses": ["list of failed checks"],
300
+ "reasoning": "explanation of the score"
301
+ }
302
+ ```
303
+
304
+ **Key Points:**
305
+ - Evaluators receive **full context** but should select only relevant fields
306
+ - Most evaluators only need `output` field - ignore the rest to avoid false positives
307
+ - Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
308
+ - Score range: `0.0` to `1.0` (float)
309
+ - `hits` and `misses` are optional but recommended for debugging
310
+
311
+ ### Code Evaluator Script Template
312
+
313
+ ```python
314
+ #!/usr/bin/env python3
315
+ import json
316
+ import sys
317
+
318
+ def evaluate(input_data):
319
+ # Extract only the fields you need
320
+ output = input_data.get("output", "")
321
+
322
+ # Your validation logic here
323
+ score = 0.0 # to 1.0
324
+ hits = ["successful check 1", "successful check 2"]
325
+ misses = ["failed check 1"]
326
+ reasoning = "Explanation of score"
327
+
328
+ return {
329
+ "score": score,
330
+ "hits": hits,
331
+ "misses": misses,
332
+ "reasoning": reasoning
333
+ }
334
+
335
+ if __name__ == "__main__":
336
+ try:
337
+ input_data = json.loads(sys.stdin.read())
338
+ result = evaluate(input_data)
339
+ print(json.dumps(result, indent=2))
340
+ except Exception as e:
341
+ error_result = {
342
+ "score": 0.0,
343
+ "hits": [],
344
+ "misses": [f"Evaluator error: {str(e)}"],
345
+ "reasoning": f"Evaluator error: {str(e)}"
346
+ }
347
+ print(json.dumps(error_result, indent=2))
348
+ sys.exit(1)
349
+ ```
350
+
351
+ ### LLM Judge Template Structure
352
+
353
+ ```markdown
354
+ # Judge Name
355
+
356
+ Evaluation criteria and guidelines...
357
+
358
+ ## Scoring Guidelines
359
+ 0.9-1.0: Excellent
360
+ 0.7-0.8: Good
361
+ ...
362
+
363
+ ## Output Format
364
+ {
365
+ "score": 0.85,
366
+ "passed": true,
367
+ "reasoning": "..."
368
+ }
369
+ ```
273
370
 
274
- For each eval case in a `.yaml` file:
371
+ ## Next Steps
275
372
 
276
- 1. Parse YAML and collect user messages (inline text and referenced files)
277
- 2. Extract code blocks from text for structured prompting
278
- 3. Generate a candidate answer via the configured provider/model
279
- 4. Score against the expected answer using AI-powered quality grading
280
- 5. Output results in JSONL or YAML format with detailed metrics
373
+ - Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
374
+ - Create your own eval cases following the schema
375
+ - Write custom evaluator scripts for domain-specific validation
376
+ - Create LLM judge templates for semantic evaluation
377
+ - Set up optimizer configs when ready to improve prompts
281
378
 
282
- ### VS Code Copilot Target
379
+ ## Resources
283
380
 
284
- - Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
285
- - The prompt is built from the `.yaml` user content (task, files, code blocks)
286
- - Copilot is instructed to complete the task within the workspace context
287
- - Results are captured and scored automatically
381
+ - [Simple Example README](docs/examples/simple/README.md)
382
+ - [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
383
+ - [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
288
384
 
289
385
  ## Scoring and Outputs
290
386