agentv 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,424 +1,305 @@
1
1
  # AgentV
2
2
 
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI, Pi Coding Agent, and Azure OpenAI.
3
+ **CLI-first AI agent evaluation. No server. No signup. No overhead.**
4
4
 
5
- ## Installation and Setup
5
+ AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.
6
6
 
7
- ### Installation for End Users
8
-
9
- This is the recommended method for users who want to use `agentv` as a command-line tool.
10
-
11
- 1. Install via npm:
7
+ ## Installation
12
8
 
9
+ **1. Install:**
13
10
  ```bash
14
- # Install globally
15
11
  npm install -g agentv
16
-
17
- # Or use npx to run without installing
18
- npx agentv --help
19
- ```
20
-
21
- 2. Verify the installation:
22
-
23
- ```bash
24
- agentv --help
25
- ```
26
-
27
- ### Local Development Setup
28
-
29
- Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
30
-
31
- 1. Clone the repository and navigate into it:
32
-
33
- ```bash
34
- git clone https://github.com/EntityProcess/agentv.git
35
- cd agentv
36
12
  ```
37
13
 
38
- 2. Install dependencies:
39
-
14
+ **2. Initialize your workspace:**
40
15
  ```bash
41
- # Install Bun if you don't have it
42
- curl -fsSL https://bun.sh/install | bash # macOS/Linux
43
- # or
44
- powershell -c "irm bun.sh/install.ps1 | iex" # Windows
45
-
46
- # Install all workspace dependencies
47
- bun install
16
+ agentv init
48
17
  ```
49
18
 
50
- 3. Build the project:
19
+ **3. Configure environment variables:**
20
+ - The init command creates a `.env.example` file in your project root
21
+ - Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
22
+ - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
51
23
 
52
- ```bash
53
- bun run build
24
+ **4. Create an eval** (`./evals/example.yaml`):
25
+ ```yaml
26
+ description: Math problem solving evaluation
27
+ execution:
28
+ target: default
29
+
30
+ evalcases:
31
+ - id: addition
32
+ expected_outcome: Correctly calculates 15 + 27 = 42
33
+
34
+ input_messages:
35
+ - role: user
36
+ content: What is 15 + 27?
37
+
38
+ expected_messages:
39
+ - role: assistant
40
+ content: "42"
41
+
42
+ execution:
43
+ evaluators:
44
+ - name: math_check
45
+ type: code_judge
46
+ script: ./validators/check_math.py
54
47
  ```
55
48
 
56
- 4. Run tests:
57
-
49
+ **5. Run the eval:**
58
50
  ```bash
59
- bun test
51
+ agentv eval ./evals/example.yaml
60
52
  ```
61
53
 
62
- You are now ready to start development. The monorepo contains:
63
-
64
- - `packages/core/` - Core evaluation engine
65
- - `apps/cli/` - Command-line interface
66
-
67
- ### Environment Setup
68
-
69
- 1. Initialize your workspace:
70
- - Run `agentv init` at the root of your repository
71
- - This command automatically sets up the `.agentv/` directory structure and configuration files
54
+ Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
72
55
 
73
- 2. Configure environment variables:
74
- - The init command creates a `.env.template` file in your project root
75
- - Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
76
- - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
56
+ Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
77
57
 
78
- ## Quick Start
58
+ ## Why AgentV?
79
59
 
80
- You can use the following examples as a starting point.
81
- - [Simple Example](docs/examples/simple/README.md): A minimal working example to help you get started fast.
82
- - [Showcase](docs/examples/showcase/README.md): A collection of advanced use cases and real-world agent evaluation scenarios.
60
+ | Feature | AgentV | [LangWatch](https://github.com/langwatch/langwatch) | [LangSmith](https://github.com/langchain-ai/langsmith-sdk) | [LangFuse](https://github.com/langfuse/langfuse) |
61
+ |---------|--------|-----------|-----------|----------|
62
+ | **Setup** | `npm install` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
63
+ | **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
64
+ | **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
65
+ | **CLI-first** | ✓ | ✗ | Limited | Limited |
66
+ | **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
67
+ | **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
68
+ | **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
83
69
 
84
- ### Validating Eval Files
70
+ **Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
85
71
 
86
- Validate your eval and targets files before running them:
72
+ ## Features
87
73
 
88
- ```bash
89
- # Validate a single file
90
- agentv validate evals/my-eval.yaml
91
-
92
- # Validate multiple files
93
- agentv validate evals/eval1.yaml evals/eval2.yaml
94
-
95
- # Validate entire directory (recursively finds all YAML files)
96
- agentv validate evals/
97
- ```
74
+ - **Multi-objective scoring**: Correctness, latency, cost, safety in one run
75
+ - **Multiple evaluator types**: Code validators, LLM judges, custom Python/TypeScript
76
+ - **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
77
+ - **Structured evaluation**: Rubric-based grading with weights and requirements
78
+ - **Batch evaluation**: Run hundreds of test cases in parallel
79
+ - **Export**: JSON, JSONL, YAML formats
80
+ - **Compare results**: Compute deltas between evaluation runs for A/B testing
98
81
 
99
- ### Running Evals
82
+ ## Development
100
83
 
101
- Run eval (target auto-selected from eval file or CLI override):
84
+ Contributing to AgentV? Clone and set up the repository:
102
85
 
103
86
  ```bash
104
- # If your eval.yaml contains "target: azure_base", it will be used automatically
105
- agentv eval "path/to/eval.yaml"
106
-
107
- # Override the eval file's target with CLI flag
108
- agentv eval --target vscode_projectx "path/to/eval.yaml"
87
+ git clone https://github.com/EntityProcess/agentv.git
88
+ cd agentv
109
89
 
110
- # Run multiple evals via glob
111
- agentv eval "path/to/evals/**/*.yaml"
112
- ```
90
+ # Install Bun if you don't have it
91
+ curl -fsSL https://bun.sh/install | bash
113
92
 
114
- Run a specific eval case with custom targets path:
93
+ # Install dependencies and build
94
+ bun install && bun run build
115
95
 
116
- ```bash
117
- agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
96
+ # Run tests
97
+ bun test
118
98
  ```
119
99
 
120
- ### Command Line Options
121
-
122
- - `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
123
- - `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
124
- - `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
125
- - `--eval-id EVAL_ID`: Run only the eval case with this specific ID
126
- - `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
127
- - `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
128
- - `--dry-run`: Run with mock model for testing
129
- - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
130
- - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
131
- - `--cache`: Enable caching of LLM responses (default: disabled)
132
- - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
133
- - `--verbose`: Verbose output
134
-
135
- ### Target Selection Priority
136
-
137
- The CLI determines which execution target to use with the following precedence:
100
+ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
138
101
 
139
- 1. CLI flag override: `--target my_target` (when provided and not 'default')
140
- 2. Eval file specification: `target: my_target` key in the .eval.yaml file
141
- 3. Default fallback: Uses the 'default' target (original behavior)
102
+ ## Core Concepts
142
103
 
143
- This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
104
+ **Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
144
105
 
145
- Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
106
+ ## Usage
146
107
 
147
- ### Tips for VS Code Copilot Evals
108
+ ### Running Evaluations
148
109
 
149
- **Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
110
+ ```bash
111
+ # Validate evals
112
+ agentv validate evals/my-eval.yaml
150
113
 
151
- **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
114
+ # Run an eval with default target (from eval file or targets.yaml)
115
+ agentv eval evals/my-eval.yaml
152
116
 
153
- ## Targets and Environment Variables
117
+ # Override target
118
+ agentv eval --target azure_base evals/**/*.yaml
154
119
 
155
- Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
120
+ # Run specific eval case
121
+ agentv eval --eval-id case-123 evals/my-eval.yaml
156
122
 
157
- ### Target Configuration Structure
123
+ # Dry-run with mock provider
124
+ agentv eval --dry-run evals/my-eval.yaml
125
+ ```
158
126
 
159
- Each target specifies:
127
+ See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
160
128
 
161
- - `name`: Unique identifier for the target
162
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
163
- - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
164
- - Optional fields: `judge_target`, `workers`, `provider_batching`
129
+ ### Create Custom Evaluators
165
130
 
166
- ### Examples
131
+ Write code judges in Python or TypeScript:
167
132
 
168
- **Azure OpenAI targets:**
133
+ ```python
134
+ # validators/check_answer.py
135
+ import json, sys
136
+ data = json.load(sys.stdin)
137
+ candidate_answer = data.get("candidate_answer", "")
169
138
 
170
- ```yaml
171
- - name: azure_base
172
- provider: azure
173
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
174
- api_key: ${{ AZURE_OPENAI_API_KEY }}
175
- model: ${{ AZURE_DEPLOYMENT_NAME }}
176
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
177
- ```
139
+ hits = []
140
+ misses = []
178
141
 
179
- Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
142
+ if "42" in candidate_answer:
143
+ hits.append("Answer contains correct value (42)")
144
+ else:
145
+ misses.append("Answer does not contain expected value (42)")
180
146
 
181
- **VS Code targets:**
147
+ score = 1.0 if hits else 0.0
182
148
 
183
- ```yaml
184
- - name: vscode_projectx
185
- provider: vscode
186
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
187
- provider_batching: false
188
- judge_target: azure_base
189
-
190
- - name: vscode_insiders_projectx
191
- provider: vscode-insiders
192
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
193
- provider_batching: false
194
- judge_target: azure_base
149
+ print(json.dumps({
150
+ "score": score,
151
+ "hits": hits,
152
+ "misses": misses,
153
+ "reasoning": f"Passed {len(hits)} check(s)"
154
+ }))
195
155
  ```
196
156
 
197
- **CLI targets (template-based):**
157
+ Reference evaluators in your eval file:
198
158
 
199
159
  ```yaml
200
- - name: local_cli
201
- provider: cli
202
- judge_target: azure_base
203
- command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
204
- files_format: '--file {path}'
205
- cwd: ${{ CLI_EVALS_DIR }} # optional working directory
206
- timeout_seconds: 30 # optional per-command timeout
207
- healthcheck:
208
- type: command # or http
209
- command_template: uv run ./my_agent.py --healthcheck
160
+ execution:
161
+ evaluators:
162
+ - name: my_validator
163
+ type: code_judge
164
+ script: ./validators/check_answer.py
210
165
  ```
211
166
 
212
- **Supported placeholders in CLI commands:**
213
- - `{PROMPT}` - The rendered prompt text (shell-escaped)
214
- - `{FILES}` - Expands to multiple file arguments using `files_format` template
215
- - `{GUIDELINES}` - Guidelines content
216
- - `{EVAL_ID}` - Current eval case ID
217
- - `{ATTEMPT}` - Retry attempt number
218
- - `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
219
-
220
- **Codex CLI targets:**
221
-
222
- ```yaml
223
- - name: codex_cli
224
- provider: codex
225
- judge_target: azure_base
226
- executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
227
- args: # optional CLI arguments
228
- - --profile
229
- - ${{ CODEX_PROFILE }}
230
- - --model
231
- - ${{ CODEX_MODEL }}
232
- timeout_seconds: 180
233
- cwd: ${{ CODEX_WORKSPACE_DIR }}
234
- log_format: json # 'summary' or 'json'
235
- ```
167
+ For complete templates, examples, and evaluator patterns, see: [custom-evaluators.md](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
236
168
 
237
- Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
238
- Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
169
+ ### Compare Evaluation Results
239
170
 
240
- **Pi Coding Agent targets:**
171
+ Run two evaluations and compare them:
241
172
 
242
- ```yaml
243
- - name: pi
244
- provider: pi-coding-agent
245
- judge_target: gemini_base
246
- executable: ${{ PI_CLI_PATH }} # Optional: defaults to `pi` if omitted
247
- pi_provider: google # google, anthropic, openai, groq, xai, openrouter
248
- model: ${{ GEMINI_MODEL_NAME }}
249
- api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
250
- tools: read,bash,edit,write # Available tools for the agent
251
- timeout_seconds: 180
252
- cwd: ${{ PI_WORKSPACE_DIR }} # Optional: run in specific directory
253
- log_format: json # 'summary' (default) or 'json' for full logs
254
- # system_prompt: optional override for the default system prompt
173
+ ```bash
174
+ agentv eval evals/my-eval.yaml --out before.jsonl
175
+ # ... make changes to your agent ...
176
+ agentv eval evals/my-eval.yaml --out after.jsonl
177
+ agentv compare before.jsonl after.jsonl --threshold 0.1
255
178
  ```
256
179
 
257
- Pi Coding Agent is an autonomous coding CLI from [pi-mono](https://github.com/badlogic/pi-mono). Install it globally with `npm install -g @mariozechner/pi-coding-agent` (or use a local path via `executable`). It supports multiple LLM providers and outputs JSONL events. AgentV extracts tool trajectories from the output for trace-based evaluation. File attachments are passed using Pi's native `@path` syntax.
258
-
259
- By default, a system prompt instructs the agent to include code in its response (required for evaluation scoring). Use `system_prompt` to override this behavior.
180
+ Output shows wins, losses, ties, and mean delta to identify improvements.
260
181
 
261
- ## Writing Custom Evaluators
182
+ ## Targets Configuration
262
183
 
263
- ### Code Evaluator I/O Contract
184
+ Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
264
185
 
265
- Code evaluators receive input via stdin and write output to stdout as JSON.
186
+ ```yaml
187
+ targets:
188
+ - name: azure_base
189
+ provider: azure
190
+ endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
191
+ api_key: ${{ AZURE_OPENAI_API_KEY }}
192
+ model: ${{ AZURE_DEPLOYMENT_NAME }}
266
193
 
267
- **Input Format (via stdin):**
268
- ```json
269
- {
270
- "question": "string describing the task/question",
271
- "expected_outcome": "expected outcome description",
272
- "reference_answer": "gold standard answer (optional)",
273
- "candidate_answer": "generated code/text from the agent",
274
- "guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
275
- "input_files": ["path/to/data.json", "path/to/config.yaml"],
276
- "input_messages": [{"role": "user", "content": "..."}]
277
- }
278
- ```
194
+ - name: vscode_dev
195
+ provider: vscode
196
+ workspace_template: ${{ WORKSPACE_PATH }}
197
+ judge_target: azure_base
279
198
 
280
- **Output Format (to stdout):**
281
- ```json
282
- {
283
- "score": 0.85,
284
- "hits": ["list of successful checks"],
285
- "misses": ["list of failed checks"],
286
- "reasoning": "explanation of the score"
287
- }
199
+ - name: local_agent
200
+ provider: cli
201
+ command_template: 'python agent.py --prompt {PROMPT}'
202
+ judge_target: azure_base
288
203
  ```
289
204
 
290
- **Key Points:**
291
- - Evaluators receive **full context** but should select only relevant fields
292
- - Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
293
- - Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
294
- - Score range: `0.0` to `1.0` (float)
295
- - `hits` and `misses` are optional but recommended for debugging
205
+ Supports: `azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `claude-code`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
296
206
 
297
- ### Code Evaluator Templates
207
+ Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
298
208
 
299
- Custom evaluators can be written in any language. For complete templates and examples:
209
+ ## Evaluation Features
300
210
 
301
- - **Python template**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
302
- - **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
303
- - **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
211
+ ### Code Judges
304
212
 
305
- ### LLM Judge Template Structure
213
+ Write validators in any language (Python, TypeScript, Node, etc.):
306
214
 
307
- ```markdown
308
- # Judge Name
309
-
310
- Evaluation criteria and guidelines...
311
-
312
- ## Scoring Guidelines
313
- 0.9-1.0: Excellent
314
- 0.7-0.8: Good
315
- ...
316
-
317
- ## Output Format
318
- {
319
- "score": 0.85,
320
- "passed": true,
321
- "reasoning": "..."
322
- }
215
+ ```bash
216
+ # Input: stdin JSON with question, expected_outcome, candidate_answer
217
+ # Output: stdout JSON with score (0-1), hits, misses, reasoning
323
218
  ```
324
219
 
325
- ## Rubric-Based Evaluation
326
-
327
- AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
220
+ For complete examples and patterns, see:
221
+ - [custom-evaluators skill](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
222
+ - [code-judge-sdk example](examples/features/code-judge-sdk)
328
223
 
329
- ### Basic Usage
224
+ ### LLM Judges
330
225
 
331
- Define rubrics inline using simple strings:
226
+ Create markdown judge files with evaluation criteria and scoring guidelines:
332
227
 
333
228
  ```yaml
334
- - id: example-1
335
- expected_outcome: Explain quicksort algorithm
336
- rubrics:
337
- - Mentions divide-and-conquer approach
338
- - Explains the partition step
339
- - States time complexity correctly
229
+ execution:
230
+ evaluators:
231
+ - name: semantic_check
232
+ type: llm_judge
233
+ prompt: ./judges/correctness.md
340
234
  ```
341
235
 
342
- Or use detailed objects for fine-grained control:
236
+ Your judge prompt file defines criteria and scoring guidelines.
237
+
238
+ ### Rubric-Based Evaluation
239
+
240
+ Define structured criteria directly in your eval case:
343
241
 
344
242
  ```yaml
345
- rubrics:
346
- - id: structure
347
- description: Has clear headings and organization
348
- weight: 1.0
349
- required: true
350
- - id: examples
351
- description: Includes practical examples
352
- weight: 0.5
353
- required: false
243
+ evalcases:
244
+ - id: quicksort-explain
245
+ expected_outcome: Explain how quicksort works
246
+
247
+ input_messages:
248
+ - role: user
249
+ content: Explain quicksort algorithm
250
+
251
+ rubrics:
252
+ - Mentions divide-and-conquer approach
253
+ - Explains partition step
254
+ - States time complexity
354
255
  ```
355
256
 
356
- ### Generate Rubrics
357
-
358
- Automatically generate rubrics from `expected_outcome` fields:
257
+ Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
359
258
 
259
+ Auto-generate rubrics from expected outcomes:
360
260
  ```bash
361
- # Generate rubrics for all eval cases without rubrics
362
261
  agentv generate rubrics evals/my-eval.yaml
363
-
364
- # Use a specific LLM target for generation
365
- agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
366
262
  ```
367
263
 
368
- ### Scoring and Verdicts
369
-
370
- - **Score**: (sum of satisfied weights) / (total weights)
371
- - **Verdicts**:
372
- - `pass`: Score ≥ 0.8 and all required rubrics met
373
- - `borderline`: Score ≥ 0.6 and all required rubrics met
374
- - `fail`: Score < 0.6 or any required rubric failed
375
-
376
- For complete examples and detailed patterns, see [examples/features/evals/rubric/](examples/features/evals/rubric/).
264
+ See [rubric-evaluator skill](.claude/skills/agentv-eval-builder/references/rubric-evaluator.md) for detailed patterns.
377
265
 
378
266
  ## Advanced Configuration
379
267
 
380
- ### Retry Configuration
381
-
382
- AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
383
-
384
- **Available retry fields:**
268
+ ### Retry Behavior
385
269
 
386
- | Field | Type | Default | Description |
387
- |-------|------|---------|-------------|
388
- | `max_retries` | number | 3 | Maximum number of retry attempts |
389
- | `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
390
- | `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
391
- | `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
392
- | `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
393
-
394
- **Example configuration:**
270
+ Configure automatic retry with exponential backoff:
395
271
 
396
272
  ```yaml
397
273
  targets:
398
274
  - name: azure_base
399
275
  provider: azure
400
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
401
- api_key: ${{ AZURE_OPENAI_API_KEY }}
402
- model: gpt-4
403
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: API version (defaults to 2024-12-01-preview)
404
- max_retries: 5 # Maximum retry attempts
405
- retry_initial_delay_ms: 2000 # Initial delay before first retry
406
- retry_max_delay_ms: 120000 # Maximum delay cap
407
- retry_backoff_factor: 2 # Exponential backoff multiplier
408
- retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
276
+ max_retries: 5
277
+ retry_initial_delay_ms: 2000
278
+ retry_max_delay_ms: 120000
279
+ retry_backoff_factor: 2
280
+ retry_status_codes: [500, 408, 429, 502, 503, 504]
409
281
  ```
410
282
 
411
- **Retry behavior:**
412
- - Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
413
- - Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
414
- - Respects abort signals for cancellation
415
- - If no retry config is specified, uses sensible defaults
283
+ Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
284
+
285
+ ## Documentation & Learning
286
+
287
+ **Getting Started:**
288
+ - Run `agentv init` to set up your first evaluation workspace
289
+ - Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
290
+ - AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
291
+
292
+ **Detailed Guides:**
293
+ - [Evaluation format and structure](.claude/skills/agentv-eval-builder/SKILL.md)
294
+ - [Custom evaluators](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
295
+ - [Structured data evaluation](.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md)
296
+
297
+ **Reference:**
298
+ - Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
416
299
 
417
- ## Related Projects
300
+ ## Contributing
418
301
 
419
- - [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
420
- - [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
421
- - [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
302
+ See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
422
303
 
423
304
  ## License
424
305