agentv 0.2.8 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +142 -63
- package/dist/{chunk-RLBRJX7V.js → chunk-THVRLL37.js} +2316 -482
- package/dist/chunk-THVRLL37.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/index.js +1 -1
- package/package.json +2 -2
- package/dist/chunk-RLBRJX7V.js.map +0 -1
package/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# AgentV
|
|
2
2
|
|
|
3
|
-
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot,
|
|
3
|
+
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
|
|
4
4
|
|
|
5
5
|
## Installation and Setup
|
|
6
6
|
|
|
@@ -76,7 +76,7 @@ You are now ready to start development. The monorepo contains:
|
|
|
76
76
|
|
|
77
77
|
### Configuring Guideline Patterns
|
|
78
78
|
|
|
79
|
-
AgentV automatically detects guideline files
|
|
79
|
+
AgentV automatically detects guideline files and treats them differently from regular file content. You can customize which files are considered guidelines using an optional `.agentv/config.yaml` configuration file.
|
|
80
80
|
|
|
81
81
|
**Config file discovery:**
|
|
82
82
|
- AgentV searches for `.agentv/config.yaml` starting from the eval file's directory
|
|
@@ -84,16 +84,6 @@ AgentV automatically detects guideline files (instructions, prompts) and treats
|
|
|
84
84
|
- Uses the first config file found (similar to how `targets.yaml` is discovered)
|
|
85
85
|
- This allows you to place one config file at the project root for all evals
|
|
86
86
|
|
|
87
|
-
**Default patterns** (used when `.agentv/config.yaml` is absent):
|
|
88
|
-
|
|
89
|
-
```yaml
|
|
90
|
-
guideline_patterns:
|
|
91
|
-
- "**/*.instructions.md"
|
|
92
|
-
- "**/instructions/**"
|
|
93
|
-
- "**/*.prompt.md"
|
|
94
|
-
- "**/prompts/**"
|
|
95
|
-
```
|
|
96
|
-
|
|
97
87
|
**Custom patterns** (create `.agentv/config.yaml` in same directory as your eval file):
|
|
98
88
|
|
|
99
89
|
```yaml
|
|
@@ -105,13 +95,6 @@ guideline_patterns:
|
|
|
105
95
|
- "**/*.rules.md" # Match by naming convention
|
|
106
96
|
```
|
|
107
97
|
|
|
108
|
-
**How it works:**
|
|
109
|
-
|
|
110
|
-
- Files matching guideline patterns are loaded as separate guideline context
|
|
111
|
-
- Files NOT matching are treated as regular file content in user messages
|
|
112
|
-
- Patterns use standard glob syntax (via [micromatch](https://github.com/micromatch/micromatch))
|
|
113
|
-
- Paths are normalized to forward slashes for cross-platform compatibility
|
|
114
|
-
|
|
115
98
|
See [config.yaml example](docs/examples/simple/.agentv/config.yaml) for more pattern examples.
|
|
116
99
|
|
|
117
100
|
### Validating Eval Files
|
|
@@ -200,18 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
|
|
|
200
183
|
|
|
201
184
|
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
|
|
202
185
|
|
|
203
|
-
## Requirements
|
|
204
|
-
|
|
205
|
-
- Node.js 20.0.0 or higher
|
|
206
|
-
- Environment variables for your chosen providers (configured via targets.yaml)
|
|
207
|
-
|
|
208
|
-
Environment keys (configured via targets.yaml):
|
|
209
|
-
|
|
210
|
-
- **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
|
|
211
|
-
- **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
|
|
212
|
-
- **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
|
|
213
|
-
- **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
|
|
214
|
-
|
|
215
186
|
## Targets and Environment Variables
|
|
216
187
|
|
|
217
188
|
Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
|
|
@@ -221,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
|
|
|
221
192
|
Each target specifies:
|
|
222
193
|
|
|
223
194
|
- `name`: Unique identifier for the target
|
|
224
|
-
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
|
|
195
|
+
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
|
|
225
196
|
- `settings`: Environment variable names to use for this target
|
|
226
197
|
|
|
227
198
|
### Examples
|
|
@@ -237,40 +208,54 @@ Each target specifies:
|
|
|
237
208
|
model: "AZURE_DEPLOYMENT_NAME"
|
|
238
209
|
```
|
|
239
210
|
|
|
240
|
-
**
|
|
211
|
+
**VS Code targets:**
|
|
241
212
|
|
|
242
213
|
```yaml
|
|
243
|
-
- name:
|
|
244
|
-
provider:
|
|
214
|
+
- name: vscode_projectx
|
|
215
|
+
provider: vscode
|
|
245
216
|
settings:
|
|
246
|
-
|
|
247
|
-
|
|
217
|
+
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
|
|
218
|
+
|
|
219
|
+
- name: vscode_insiders_projectx
|
|
220
|
+
provider: vscode-insiders
|
|
221
|
+
settings:
|
|
222
|
+
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
|
|
248
223
|
```
|
|
249
224
|
|
|
250
|
-
**
|
|
225
|
+
**CLI targets (template-based):**
|
|
251
226
|
|
|
252
227
|
```yaml
|
|
253
|
-
- name:
|
|
254
|
-
provider:
|
|
228
|
+
- name: local_cli
|
|
229
|
+
provider: cli
|
|
255
230
|
settings:
|
|
256
|
-
|
|
257
|
-
|
|
231
|
+
command_template: 'somecommand {PROMPT} {FILES}'
|
|
232
|
+
files_format: '--file {path}'
|
|
233
|
+
cwd: PROJECT_ROOT # optional working directory
|
|
234
|
+
env: # merged into process.env
|
|
235
|
+
API_TOKEN: LOCAL_AGENT_TOKEN
|
|
236
|
+
timeout_seconds: 30 # optional per-command timeout
|
|
237
|
+
healthcheck:
|
|
238
|
+
type: command # or http
|
|
239
|
+
command_template: code --version
|
|
258
240
|
```
|
|
259
241
|
|
|
260
|
-
**
|
|
242
|
+
**Codex CLI targets:**
|
|
261
243
|
|
|
262
244
|
```yaml
|
|
263
|
-
- name:
|
|
264
|
-
provider:
|
|
265
|
-
settings:
|
|
266
|
-
workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
|
|
267
|
-
|
|
268
|
-
- name: vscode_insiders_projectx
|
|
269
|
-
provider: vscode-insiders
|
|
245
|
+
- name: codex_cli
|
|
246
|
+
provider: codex
|
|
270
247
|
settings:
|
|
271
|
-
|
|
248
|
+
executable: "CODEX_CLI_PATH" # defaults to `codex` if omitted
|
|
249
|
+
profile: "CODEX_PROFILE" # matches the profile in ~/.codex/config
|
|
250
|
+
model: "CODEX_MODEL" # optional, falls back to profile default
|
|
251
|
+
approval_preset: "CODEX_APPROVAL_PRESET"
|
|
252
|
+
timeout_seconds: 180
|
|
253
|
+
cwd: CODEX_WORKSPACE_DIR
|
|
272
254
|
```
|
|
273
255
|
|
|
256
|
+
Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
|
|
257
|
+
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
|
|
258
|
+
|
|
274
259
|
## Timeout Handling and Retries
|
|
275
260
|
|
|
276
261
|
When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
|
|
@@ -286,22 +271,116 @@ Example with custom timeout settings:
|
|
|
286
271
|
agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
|
|
287
272
|
```
|
|
288
273
|
|
|
289
|
-
##
|
|
274
|
+
## Writing Custom Evaluators
|
|
275
|
+
|
|
276
|
+
### Code Evaluator I/O Contract
|
|
277
|
+
|
|
278
|
+
Code evaluators receive input via stdin and write output to stdout as JSON.
|
|
279
|
+
|
|
280
|
+
**Input Format (via stdin):**
|
|
281
|
+
```json
|
|
282
|
+
{
|
|
283
|
+
"task": "string describing the task",
|
|
284
|
+
"outcome": "expected outcome description",
|
|
285
|
+
"expected": "expected output string",
|
|
286
|
+
"output": "generated code/text from the agent",
|
|
287
|
+
"system_message": "system message if any",
|
|
288
|
+
"guideline_paths": ["path1", "path2"],
|
|
289
|
+
"attachments": ["file1", "file2"],
|
|
290
|
+
"user_segments": [{"type": "text", "value": "..."}]
|
|
291
|
+
}
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**Output Format (to stdout):**
|
|
295
|
+
```json
|
|
296
|
+
{
|
|
297
|
+
"score": 0.85,
|
|
298
|
+
"hits": ["list of successful checks"],
|
|
299
|
+
"misses": ["list of failed checks"],
|
|
300
|
+
"reasoning": "explanation of the score"
|
|
301
|
+
}
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
**Key Points:**
|
|
305
|
+
- Evaluators receive **full context** but should select only relevant fields
|
|
306
|
+
- Most evaluators only need `output` field - ignore the rest to avoid false positives
|
|
307
|
+
- Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
|
|
308
|
+
- Score range: `0.0` to `1.0` (float)
|
|
309
|
+
- `hits` and `misses` are optional but recommended for debugging
|
|
310
|
+
|
|
311
|
+
### Code Evaluator Script Template
|
|
312
|
+
|
|
313
|
+
```python
|
|
314
|
+
#!/usr/bin/env python3
|
|
315
|
+
import json
|
|
316
|
+
import sys
|
|
317
|
+
|
|
318
|
+
def evaluate(input_data):
|
|
319
|
+
# Extract only the fields you need
|
|
320
|
+
output = input_data.get("output", "")
|
|
321
|
+
|
|
322
|
+
# Your validation logic here
|
|
323
|
+
score = 0.0 # to 1.0
|
|
324
|
+
hits = ["successful check 1", "successful check 2"]
|
|
325
|
+
misses = ["failed check 1"]
|
|
326
|
+
reasoning = "Explanation of score"
|
|
327
|
+
|
|
328
|
+
return {
|
|
329
|
+
"score": score,
|
|
330
|
+
"hits": hits,
|
|
331
|
+
"misses": misses,
|
|
332
|
+
"reasoning": reasoning
|
|
333
|
+
}
|
|
334
|
+
|
|
335
|
+
if __name__ == "__main__":
|
|
336
|
+
try:
|
|
337
|
+
input_data = json.loads(sys.stdin.read())
|
|
338
|
+
result = evaluate(input_data)
|
|
339
|
+
print(json.dumps(result, indent=2))
|
|
340
|
+
except Exception as e:
|
|
341
|
+
error_result = {
|
|
342
|
+
"score": 0.0,
|
|
343
|
+
"hits": [],
|
|
344
|
+
"misses": [f"Evaluator error: {str(e)}"],
|
|
345
|
+
"reasoning": f"Evaluator error: {str(e)}"
|
|
346
|
+
}
|
|
347
|
+
print(json.dumps(error_result, indent=2))
|
|
348
|
+
sys.exit(1)
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
### LLM Judge Template Structure
|
|
352
|
+
|
|
353
|
+
```markdown
|
|
354
|
+
# Judge Name
|
|
355
|
+
|
|
356
|
+
Evaluation criteria and guidelines...
|
|
357
|
+
|
|
358
|
+
## Scoring Guidelines
|
|
359
|
+
0.9-1.0: Excellent
|
|
360
|
+
0.7-0.8: Good
|
|
361
|
+
...
|
|
362
|
+
|
|
363
|
+
## Output Format
|
|
364
|
+
{
|
|
365
|
+
"score": 0.85,
|
|
366
|
+
"passed": true,
|
|
367
|
+
"reasoning": "..."
|
|
368
|
+
}
|
|
369
|
+
```
|
|
290
370
|
|
|
291
|
-
|
|
371
|
+
## Next Steps
|
|
292
372
|
|
|
293
|
-
|
|
294
|
-
|
|
295
|
-
|
|
296
|
-
|
|
297
|
-
|
|
373
|
+
- Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
|
|
374
|
+
- Create your own eval cases following the schema
|
|
375
|
+
- Write custom evaluator scripts for domain-specific validation
|
|
376
|
+
- Create LLM judge templates for semantic evaluation
|
|
377
|
+
- Set up optimizer configs when ready to improve prompts
|
|
298
378
|
|
|
299
|
-
|
|
379
|
+
## Resources
|
|
300
380
|
|
|
301
|
-
-
|
|
302
|
-
-
|
|
303
|
-
-
|
|
304
|
-
- Results are captured and scored automatically
|
|
381
|
+
- [Simple Example README](docs/examples/simple/README.md)
|
|
382
|
+
- [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
|
|
383
|
+
- [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
|
|
305
384
|
|
|
306
385
|
## Scoring and Outputs
|
|
307
386
|
|