agentv 2.0.2 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,431 +1,305 @@
1
1
  # AgentV
2
2
 
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI, Pi Coding Agent, and Azure OpenAI.
3
+ **CLI-first AI agent evaluation. No server. No signup. No overhead.**
4
4
 
5
- ## Installation and Setup
5
+ AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.
6
6
 
7
- ### Installation for End Users
8
-
9
- This is the recommended method for users who want to use `agentv` as a command-line tool.
10
-
11
- 1. Install via npm:
7
+ ## Installation
12
8
 
9
+ **1. Install:**
13
10
  ```bash
14
- # Install globally
15
11
  npm install -g agentv
16
-
17
- # Or use npx to run without installing
18
- npx agentv --help
19
- ```
20
-
21
- 2. Verify the installation:
22
-
23
- ```bash
24
- agentv --help
25
- ```
26
-
27
- ### Local Development Setup
28
-
29
- Follow these steps if you want to contribute to the `agentv` project itself. This workflow uses Bun workspaces for fast, efficient dependency management.
30
-
31
- 1. Clone the repository and navigate into it:
32
-
33
- ```bash
34
- git clone https://github.com/EntityProcess/agentv.git
35
- cd agentv
36
12
  ```
37
13
 
38
- 2. Install dependencies:
39
-
40
- ```bash
41
- # Install Bun if you don't have it
42
- curl -fsSL https://bun.sh/install | bash # macOS/Linux
43
- # or
44
- powershell -c "irm bun.sh/install.ps1 | iex" # Windows
45
-
46
- # Install all workspace dependencies
47
- bun install
48
- ```
49
-
50
- 3. Build the project:
51
-
14
+ **2. Initialize your workspace:**
52
15
  ```bash
53
- bun run build
16
+ agentv init
54
17
  ```
55
18
 
56
- 4. Run tests:
19
+ **3. Configure environment variables:**
20
+ - The init command creates a `.env.example` file in your project root
21
+ - Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
22
+ - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
57
23
 
58
- ```bash
59
- bun test
24
+ **4. Create an eval** (`./evals/example.yaml`):
25
+ ```yaml
26
+ description: Math problem solving evaluation
27
+ execution:
28
+ target: default
29
+
30
+ evalcases:
31
+ - id: addition
32
+ expected_outcome: Correctly calculates 15 + 27 = 42
33
+
34
+ input_messages:
35
+ - role: user
36
+ content: What is 15 + 27?
37
+
38
+ expected_messages:
39
+ - role: assistant
40
+ content: "42"
41
+
42
+ execution:
43
+ evaluators:
44
+ - name: math_check
45
+ type: code_judge
46
+ script: ./validators/check_math.py
60
47
  ```
61
48
 
62
- 5. (Optional) Install example dependencies:
63
-
49
+ **5. Run the eval:**
64
50
  ```bash
65
- bun run examples:install
51
+ agentv eval ./evals/example.yaml
66
52
  ```
67
53
 
68
- This step is required if you want to run the examples in the `examples/` directory, as they are self-contained packages with their own dependencies.
69
-
70
- You are now ready to start development. The monorepo contains:
54
+ Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
71
55
 
72
- - `packages/core/` - Core evaluation engine
73
- - `apps/cli/` - Command-line interface
56
+ Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
74
57
 
75
- ### Environment Setup
58
+ ## Why AgentV?
76
59
 
77
- 1. Initialize your workspace:
78
- - Run `agentv init` at the root of your repository
79
- - This command automatically sets up the `.agentv/` directory structure and configuration files
60
+ | Feature | AgentV | [LangWatch](https://github.com/langwatch/langwatch) | [LangSmith](https://github.com/langchain-ai/langsmith-sdk) | [LangFuse](https://github.com/langfuse/langfuse) |
61
+ |---------|--------|-----------|-----------|----------|
62
+ | **Setup** | `npm install` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
63
+ | **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
64
+ | **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
65
+ | **CLI-first** | ✓ | ✗ | Limited | Limited |
66
+ | **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
67
+ | **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
68
+ | **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
80
69
 
81
- 2. Configure environment variables:
82
- - The init command creates a `.env.template` file in your project root
83
- - Copy `.env.template` to `.env` and fill in your API keys, endpoints, and other configuration values
84
- - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
70
+ **Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
85
71
 
86
- ## Quick Start
72
+ ## Features
87
73
 
88
- You can use the following examples as a starting point:
89
- - [Examples](examples/README.md): Feature demonstrations and real-world showcase examples
74
+ - **Multi-objective scoring**: Correctness, latency, cost, safety in one run
75
+ - **Multiple evaluator types**: Code validators, LLM judges, custom Python/TypeScript
76
+ - **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
77
+ - **Structured evaluation**: Rubric-based grading with weights and requirements
78
+ - **Batch evaluation**: Run hundreds of test cases in parallel
79
+ - **Export**: JSON, JSONL, YAML formats
80
+ - **Compare results**: Compute deltas between evaluation runs for A/B testing
90
81
 
91
- ### Validating Eval Files
82
+ ## Development
92
83
 
93
- Validate your eval and targets files before running them:
84
+ Contributing to AgentV? Clone and set up the repository:
94
85
 
95
86
  ```bash
96
- # Validate a single file
97
- agentv validate evals/my-eval.yaml
98
-
99
- # Validate multiple files
100
- agentv validate evals/eval1.yaml evals/eval2.yaml
101
-
102
- # Validate entire directory (recursively finds all YAML files)
103
- agentv validate evals/
104
- ```
105
-
106
- ### Running Evals
107
-
108
- Run eval (target auto-selected from eval file or CLI override):
109
-
110
- ```bash
111
- # If your eval.yaml contains "target: azure_base", it will be used automatically
112
- agentv eval "path/to/eval.yaml"
113
-
114
- # Override the eval file's target with CLI flag
115
- agentv eval --target vscode_projectx "path/to/eval.yaml"
87
+ git clone https://github.com/EntityProcess/agentv.git
88
+ cd agentv
116
89
 
117
- # Run multiple evals via glob
118
- agentv eval "path/to/evals/**/*.yaml"
119
- ```
90
+ # Install Bun if you don't have it
91
+ curl -fsSL https://bun.sh/install | bash
120
92
 
121
- Run a specific eval case with custom targets path:
93
+ # Install dependencies and build
94
+ bun install && bun run build
122
95
 
123
- ```bash
124
- agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id "my-eval-case" "path/to/eval.yaml"
96
+ # Run tests
97
+ bun test
125
98
  ```
126
99
 
127
- ### Command Line Options
128
-
129
- - `eval_paths...`: Path(s) or glob(s) to eval YAML files (required; e.g., `evals/**/*.yaml`)
130
- - `--target TARGET`: Execution target name from targets.yaml (overrides target specified in eval file)
131
- - `--targets TARGETS`: Path to targets.yaml file (default: ./.agentv/targets.yaml)
132
- - `--eval-id EVAL_ID`: Run only the eval case with this specific ID
133
- - `--out OUTPUT_FILE`: Output file path (default: .agentv/results/eval_<timestamp>.jsonl)
134
- - `--output-format FORMAT`: Output format: 'jsonl' or 'yaml' (default: jsonl)
135
- - `--dry-run`: Run with mock model for testing
136
- - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
137
- - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
138
- - `--cache`: Enable caching of LLM responses (default: disabled)
139
- - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
140
- - `--verbose`: Verbose output
100
+ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
141
101
 
142
- ### Target Selection Priority
102
+ ## Core Concepts
143
103
 
144
- The CLI determines which execution target to use with the following precedence:
104
+ **Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
145
105
 
146
- 1. CLI flag override: `--target my_target` (when provided and not 'default')
147
- 2. Eval file specification: `target: my_target` key in the .eval.yaml file
148
- 3. Default fallback: Uses the 'default' target (original behavior)
106
+ ## Usage
149
107
 
150
- This allows eval files to specify their preferred target while still allowing command-line overrides for flexibility, and maintains backward compatibility with existing workflows.
108
+ ### Running Evaluations
151
109
 
152
- Output goes to `.agentv/results/eval_<timestamp>.jsonl` (or `.yaml`) unless `--out` is provided.
153
-
154
- ### Tips for VS Code Copilot Evals
110
+ ```bash
111
+ # Validate evals
112
+ agentv validate evals/my-eval.yaml
155
113
 
156
- **Workspace Switching:** The runner automatically switches to the target workspace when running evals. Make sure you're not actively using another VS Code instance, as this could cause prompts to be injected into the wrong workspace.
114
+ # Run an eval with default target (from eval file or targets.yaml)
115
+ agentv eval evals/my-eval.yaml
157
116
 
158
- **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
117
+ # Override target
118
+ agentv eval --target azure_base evals/**/*.yaml
159
119
 
160
- ## Targets and Environment Variables
120
+ # Run specific eval case
121
+ agentv eval --eval-id case-123 evals/my-eval.yaml
161
122
 
162
- Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
123
+ # Dry-run with mock provider
124
+ agentv eval --dry-run evals/my-eval.yaml
125
+ ```
163
126
 
164
- ### Target Configuration Structure
127
+ See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
165
128
 
166
- Each target specifies:
129
+ ### Create Custom Evaluators
167
130
 
168
- - `name`: Unique identifier for the target
169
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
170
- - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
171
- - Optional fields: `judge_target`, `workers`, `provider_batching`
131
+ Write code judges in Python or TypeScript:
172
132
 
173
- ### Examples
133
+ ```python
134
+ # validators/check_answer.py
135
+ import json, sys
136
+ data = json.load(sys.stdin)
137
+ candidate_answer = data.get("candidate_answer", "")
174
138
 
175
- **Azure OpenAI targets:**
139
+ hits = []
140
+ misses = []
176
141
 
177
- ```yaml
178
- - name: azure_base
179
- provider: azure
180
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
181
- api_key: ${{ AZURE_OPENAI_API_KEY }}
182
- model: ${{ AZURE_DEPLOYMENT_NAME }}
183
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: defaults to 2024-12-01-preview
184
- ```
142
+ if "42" in candidate_answer:
143
+ hits.append("Answer contains correct value (42)")
144
+ else:
145
+ misses.append("Answer does not contain expected value (42)")
185
146
 
186
- Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
147
+ score = 1.0 if hits else 0.0
187
148
 
188
- **VS Code targets:**
189
-
190
- ```yaml
191
- - name: vscode_projectx
192
- provider: vscode
193
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
194
- provider_batching: false
195
- judge_target: azure_base
196
-
197
- - name: vscode_insiders_projectx
198
- provider: vscode-insiders
199
- workspace_template: ${{ PROJECTX_WORKSPACE_PATH }}
200
- provider_batching: false
201
- judge_target: azure_base
149
+ print(json.dumps({
150
+ "score": score,
151
+ "hits": hits,
152
+ "misses": misses,
153
+ "reasoning": f"Passed {len(hits)} check(s)"
154
+ }))
202
155
  ```
203
156
 
204
- **CLI targets (template-based):**
157
+ Reference evaluators in your eval file:
205
158
 
206
159
  ```yaml
207
- - name: local_cli
208
- provider: cli
209
- judge_target: azure_base
210
- command_template: 'uv run ./my_agent.py --prompt {PROMPT} {FILES}'
211
- files_format: '--file {path}'
212
- cwd: ${{ CLI_EVALS_DIR }} # optional working directory
213
- timeout_seconds: 30 # optional per-command timeout
214
- healthcheck:
215
- type: command # or http
216
- command_template: uv run ./my_agent.py --healthcheck
160
+ execution:
161
+ evaluators:
162
+ - name: my_validator
163
+ type: code_judge
164
+ script: ./validators/check_answer.py
217
165
  ```
218
166
 
219
- **Supported placeholders in CLI commands:**
220
- - `{PROMPT}` - The rendered prompt text (shell-escaped)
221
- - `{FILES}` - Expands to multiple file arguments using `files_format` template
222
- - `{GUIDELINES}` - Guidelines content
223
- - `{EVAL_ID}` - Current eval case ID
224
- - `{ATTEMPT}` - Retry attempt number
225
- - `{OUTPUT_FILE}` - Path to output file (for agents that write responses to disk)
226
-
227
- **Codex CLI targets:**
167
+ For complete templates, examples, and evaluator patterns, see: [custom-evaluators.md](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
228
168
 
229
- ```yaml
230
- - name: codex_cli
231
- provider: codex
232
- judge_target: azure_base
233
- executable: ${{ CODEX_CLI_PATH }} # defaults to `codex` if omitted
234
- args: # optional CLI arguments
235
- - --profile
236
- - ${{ CODEX_PROFILE }}
237
- - --model
238
- - ${{ CODEX_MODEL }}
239
- timeout_seconds: 180
240
- cwd: ${{ CODEX_WORKSPACE_DIR }}
241
- log_format: json # 'summary' or 'json'
242
- ```
169
+ ### Compare Evaluation Results
243
170
 
244
- Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
245
- Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
171
+ Run two evaluations and compare them:
246
172
 
247
- **Pi Coding Agent targets:**
248
-
249
- ```yaml
250
- - name: pi
251
- provider: pi-coding-agent
252
- judge_target: gemini_base
253
- executable: ${{ PI_CLI_PATH }} # Optional: defaults to `pi` if omitted
254
- pi_provider: google # google, anthropic, openai, groq, xai, openrouter
255
- model: ${{ GEMINI_MODEL_NAME }}
256
- api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
257
- tools: read,bash,edit,write # Available tools for the agent
258
- timeout_seconds: 180
259
- cwd: ${{ PI_WORKSPACE_DIR }} # Optional: run in specific directory
260
- log_format: json # 'summary' (default) or 'json' for full logs
261
- # system_prompt: optional override for the default system prompt
173
+ ```bash
174
+ agentv eval evals/my-eval.yaml --out before.jsonl
175
+ # ... make changes to your agent ...
176
+ agentv eval evals/my-eval.yaml --out after.jsonl
177
+ agentv compare before.jsonl after.jsonl --threshold 0.1
262
178
  ```
263
179
 
264
- Pi Coding Agent is an autonomous coding CLI from [pi-mono](https://github.com/badlogic/pi-mono). Install it globally with `npm install -g @mariozechner/pi-coding-agent` (or use a local path via `executable`). It supports multiple LLM providers and outputs JSONL events. AgentV extracts tool trajectories from the output for trace-based evaluation. File attachments are passed using Pi's native `@path` syntax.
180
+ Output shows wins, losses, ties, and mean delta to identify improvements.
265
181
 
266
- By default, a system prompt instructs the agent to include code in its response (required for evaluation scoring). Use `system_prompt` to override this behavior.
182
+ ## Targets Configuration
267
183
 
268
- ## Writing Custom Evaluators
184
+ Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
269
185
 
270
- ### Code Evaluator I/O Contract
271
-
272
- Code evaluators receive input via stdin and write output to stdout as JSON.
186
+ ```yaml
187
+ targets:
188
+ - name: azure_base
189
+ provider: azure
190
+ endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
191
+ api_key: ${{ AZURE_OPENAI_API_KEY }}
192
+ model: ${{ AZURE_DEPLOYMENT_NAME }}
273
193
 
274
- **Input Format (via stdin):**
275
- ```json
276
- {
277
- "question": "string describing the task/question",
278
- "expected_outcome": "expected outcome description",
279
- "reference_answer": "gold standard answer (optional)",
280
- "candidate_answer": "generated code/text from the agent",
281
- "guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
282
- "input_files": ["path/to/data.json", "path/to/config.yaml"],
283
- "input_messages": [{"role": "user", "content": "..."}]
284
- }
285
- ```
194
+ - name: vscode_dev
195
+ provider: vscode
196
+ workspace_template: ${{ WORKSPACE_PATH }}
197
+ judge_target: azure_base
286
198
 
287
- **Output Format (to stdout):**
288
- ```json
289
- {
290
- "score": 0.85,
291
- "hits": ["list of successful checks"],
292
- "misses": ["list of failed checks"],
293
- "reasoning": "explanation of the score"
294
- }
199
+ - name: local_agent
200
+ provider: cli
201
+ command_template: 'python agent.py --prompt {PROMPT}'
202
+ judge_target: azure_base
295
203
  ```
296
204
 
297
- **Key Points:**
298
- - Evaluators receive **full context** but should select only relevant fields
299
- - Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
300
- - Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
301
- - Score range: `0.0` to `1.0` (float)
302
- - `hits` and `misses` are optional but recommended for debugging
303
-
304
- ### Code Evaluator Templates
305
-
306
- Custom evaluators can be written in any language. For complete templates and examples:
205
+ Supports: `azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `claude-code`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
307
206
 
308
- - **Python template**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
309
- - **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
310
- - **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
207
+ Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
311
208
 
312
- ### LLM Judge Template Structure
209
+ ## Evaluation Features
313
210
 
314
- ```markdown
315
- # Judge Name
211
+ ### Code Judges
316
212
 
317
- Evaluation criteria and guidelines...
213
+ Write validators in any language (Python, TypeScript, Node, etc.):
318
214
 
319
- ## Scoring Guidelines
320
- 0.9-1.0: Excellent
321
- 0.7-0.8: Good
322
- ...
323
-
324
- ## Output Format
325
- {
326
- "score": 0.85,
327
- "passed": true,
328
- "reasoning": "..."
329
- }
215
+ ```bash
216
+ # Input: stdin JSON with question, expected_outcome, candidate_answer
217
+ # Output: stdout JSON with score (0-1), hits, misses, reasoning
330
218
  ```
331
219
 
332
- ## Rubric-Based Evaluation
333
-
334
- AgentV supports structured evaluation through rubrics - lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
220
+ For complete examples and patterns, see:
221
+ - [custom-evaluators skill](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
222
+ - [code-judge-sdk example](examples/features/code-judge-sdk)
335
223
 
336
- ### Basic Usage
224
+ ### LLM Judges
337
225
 
338
- Define rubrics inline using simple strings:
226
+ Create markdown judge files with evaluation criteria and scoring guidelines:
339
227
 
340
228
  ```yaml
341
- - id: example-1
342
- expected_outcome: Explain quicksort algorithm
343
- rubrics:
344
- - Mentions divide-and-conquer approach
345
- - Explains the partition step
346
- - States time complexity correctly
229
+ execution:
230
+ evaluators:
231
+ - name: semantic_check
232
+ type: llm_judge
233
+ prompt: ./judges/correctness.md
347
234
  ```
348
235
 
349
- Or use detailed objects for fine-grained control:
236
+ Your judge prompt file defines criteria and scoring guidelines.
237
+
238
+ ### Rubric-Based Evaluation
239
+
240
+ Define structured criteria directly in your eval case:
350
241
 
351
242
  ```yaml
352
- rubrics:
353
- - id: structure
354
- description: Has clear headings and organization
355
- weight: 1.0
356
- required: true
357
- - id: examples
358
- description: Includes practical examples
359
- weight: 0.5
360
- required: false
243
+ evalcases:
244
+ - id: quicksort-explain
245
+ expected_outcome: Explain how quicksort works
246
+
247
+ input_messages:
248
+ - role: user
249
+ content: Explain quicksort algorithm
250
+
251
+ rubrics:
252
+ - Mentions divide-and-conquer approach
253
+ - Explains partition step
254
+ - States time complexity
361
255
  ```
362
256
 
363
- ### Generate Rubrics
364
-
365
- Automatically generate rubrics from `expected_outcome` fields:
257
+ Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
366
258
 
259
+ Auto-generate rubrics from expected outcomes:
367
260
  ```bash
368
- # Generate rubrics for all eval cases without rubrics
369
261
  agentv generate rubrics evals/my-eval.yaml
370
-
371
- # Use a specific LLM target for generation
372
- agentv generate rubrics evals/my-eval.yaml --target openai:gpt-4o
373
262
  ```
374
263
 
375
- ### Scoring and Verdicts
376
-
377
- - **Score**: (sum of satisfied weights) / (total weights)
378
- - **Verdicts**:
379
- - `pass`: Score ≥ 0.8 and all required rubrics met
380
- - `borderline`: Score ≥ 0.6 and all required rubrics met
381
- - `fail`: Score < 0.6 or any required rubric failed
382
-
383
- For complete examples and detailed patterns, see [examples/features/rubric/](examples/features/rubric/).
264
+ See [rubric-evaluator skill](.claude/skills/agentv-eval-builder/references/rubric-evaluator.md) for detailed patterns.
384
265
 
385
266
  ## Advanced Configuration
386
267
 
387
- ### Retry Configuration
388
-
389
- AgentV supports automatic retry with exponential backoff for handling rate limiting (HTTP 429) and transient errors. All retry configuration fields are optional and work with Azure, Anthropic, and Gemini providers.
390
-
391
- **Available retry fields:**
268
+ ### Retry Behavior
392
269
 
393
- | Field | Type | Default | Description |
394
- |-------|------|---------|-------------|
395
- | `max_retries` | number | 3 | Maximum number of retry attempts |
396
- | `retry_initial_delay_ms` | number | 1000 | Initial delay in milliseconds before first retry |
397
- | `retry_max_delay_ms` | number | 60000 | Maximum delay cap in milliseconds |
398
- | `retry_backoff_factor` | number | 2 | Exponential backoff multiplier |
399
- | `retry_status_codes` | number[] | [500, 408, 429, 502, 503, 504] | HTTP status codes to retry |
400
-
401
- **Example configuration:**
270
+ Configure automatic retry with exponential backoff:
402
271
 
403
272
  ```yaml
404
273
  targets:
405
274
  - name: azure_base
406
275
  provider: azure
407
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
408
- api_key: ${{ AZURE_OPENAI_API_KEY }}
409
- model: gpt-4
410
- version: ${{ AZURE_OPENAI_API_VERSION }} # Optional: API version (defaults to 2024-12-01-preview)
411
- max_retries: 5 # Maximum retry attempts
412
- retry_initial_delay_ms: 2000 # Initial delay before first retry
413
- retry_max_delay_ms: 120000 # Maximum delay cap
414
- retry_backoff_factor: 2 # Exponential backoff multiplier
415
- retry_status_codes: [500, 408, 429, 502, 503, 504] # HTTP status codes to retry
276
+ max_retries: 5
277
+ retry_initial_delay_ms: 2000
278
+ retry_max_delay_ms: 120000
279
+ retry_backoff_factor: 2
280
+ retry_status_codes: [500, 408, 429, 502, 503, 504]
416
281
  ```
417
282
 
418
- **Retry behavior:**
419
- - Exponential backoff with jitter (0.75-1.25x) to avoid thundering herd
420
- - Automatically retries on HTTP 429 (rate limiting), 5xx errors, and network failures
421
- - Respects abort signals for cancellation
422
- - If no retry config is specified, uses sensible defaults
283
+ Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
284
+
285
+ ## Documentation & Learning
286
+
287
+ **Getting Started:**
288
+ - Run `agentv init` to set up your first evaluation workspace
289
+ - Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
290
+ - AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
291
+
292
+ **Detailed Guides:**
293
+ - [Evaluation format and structure](.claude/skills/agentv-eval-builder/SKILL.md)
294
+ - [Custom evaluators](.claude/skills/agentv-eval-builder/references/custom-evaluators.md)
295
+ - [Structured data evaluation](.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md)
296
+
297
+ **Reference:**
298
+ - Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
423
299
 
424
- ## Related Projects
300
+ ## Contributing
425
301
 
426
- - [subagent](https://github.com/EntityProcess/subagent) - VS Code Copilot programmatic interface
427
- - [ai-sdk](https://github.com/vercel/ai) - Vercel AI SDK
428
- - [Agentic Context Engineering (ACE)](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
302
+ See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
429
303
 
430
304
  ## License
431
305