agentv 4.0.0 → 4.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,314 +1,90 @@
1
1
  # AgentV
2
2
 
3
- **CLI-first AI agent evaluation. No server. No signup. No overhead.**
3
+ **Evaluate AI agents from the terminal. No server. No signup.**
4
4
 
5
- AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
6
-
7
- ## Installation
8
-
9
- ### All Agents Plugin Manager
10
-
11
- **1. Add AgentV marketplace source:**
12
- ```bash
13
- npx allagents plugin marketplace add EntityProcess/agentv
14
- ```
15
-
16
- **2. Ask Claude to set up AgentV in your current repository**
17
- Example prompt:
18
- ```text
19
- Set up AgentV in this repo.
20
- ```
21
-
22
- The `agentv-onboarding` skill bootstraps setup automatically:
23
- - verifies `agentv` CLI availability
24
- - installs the CLI if needed
25
- - runs `agentv init`
26
- - verifies setup artifacts
27
-
28
- ### CLI-Only Setup (Fallback)
29
-
30
- If you are not using Claude plugins, use the CLI directly.
31
-
32
- **1. Install:**
33
- ```bash
34
- bun install -g agentv
35
- ```
36
-
37
- Or with npm:
38
5
  ```bash
39
6
  npm install -g agentv
40
- ```
41
-
42
- **2. Initialize your workspace:**
43
- ```bash
44
7
  agentv init
8
+ agentv eval evals/example.yaml
45
9
  ```
46
10
 
47
- **3. Configure environment variables:**
48
- - The init command creates a `.env.example` file in your project root
49
- - Copy `.env.example` to `.env` and fill in your API keys, endpoints, and other configuration values
50
- - Update the environment variable names in `.agentv/targets.yaml` to match those defined in your `.env` file
11
+ That's it. Results in seconds, not minutes.
51
12
 
52
- **4. Create an eval** (`./evals/example.yaml`):
53
- ```yaml
54
- description: Math problem solving evaluation
55
- execution:
56
- target: default
13
+ ## What it does
57
14
 
15
+ AgentV runs evaluation cases against your AI agents and scores them with deterministic code graders + customizable LLM graders. Everything lives in Git — YAML eval files, markdown judge prompts, JSONL results.
16
+
17
+ ```yaml
18
+ # evals/math.yaml
19
+ description: Math problem solving
58
20
  tests:
59
21
  - id: addition
60
- criteria: Correctly calculates 15 + 27 = 42
61
-
62
22
  input: What is 15 + 27?
63
-
64
23
  expected_output: "42"
65
-
66
24
  assertions:
67
- - name: math_check
68
- type: code-grader
69
- command: ./validators/check_math.py
25
+ - type: contains
26
+ value: "42"
70
27
  ```
71
28
 
72
- **5. Run the eval:**
73
29
  ```bash
74
- agentv eval ./evals/example.yaml
30
+ agentv eval evals/math.yaml
75
31
  ```
76
32
 
77
- Results appear in `.agentv/results/eval_<timestamp>.jsonl` with scores, reasoning, and execution traces.
78
-
79
- Learn more in the [examples/](examples/README.md) directory. For a detailed comparison with other frameworks, see [docs/COMPARISON.md](docs/COMPARISON.md).
80
-
81
33
  ## Why AgentV?
82
34
 
83
- | Feature | AgentV | [LangWatch](https://github.com/langwatch/langwatch) | [LangSmith](https://github.com/langchain-ai/langsmith-sdk) | [LangFuse](https://github.com/langfuse/langfuse) |
84
- |---------|--------|-----------|-----------|----------|
85
- | **Setup** | `bun install -g agentv` | Cloud account + API key | Cloud account + API key | Cloud account + API key |
86
- | **Server** | None (local) | Managed cloud | Managed cloud | Managed cloud |
87
- | **Privacy** | All local | Cloud-hosted | Cloud-hosted | Cloud-hosted |
88
- | **CLI-first** | ✓ | ✗ | Limited | Limited |
89
- | **CI/CD ready** | ✓ | Requires API calls | Requires API calls | Requires API calls |
90
- | **Version control** | ✓ (YAML in Git) | ✗ | ✗ | ✗ |
91
- | **Evaluators** | Code + LLM + Custom | LLM only | LLM + Code | LLM only |
92
-
93
- **Best for:** Developers who want evaluation in their workflow, not a separate dashboard. Teams prioritizing privacy and reproducibility.
94
-
95
- ## Features
35
+ - **Local-first** runs on your machine, no cloud accounts or API keys for eval infrastructure
36
+ - **Version-controlled** — evals, judges, and results all live in Git
37
+ - **Hybrid graders** deterministic code checks + LLM-based subjective scoring
38
+ - **CI/CD native** exit codes, JSONL output, threshold flags for pipeline gating
39
+ - **Any agent** supports Claude, Codex, Copilot, VS Code, Pi, Azure OpenAI, or any CLI agent
96
40
 
97
- - **Multi-objective scoring**: Correctness, latency, cost, safety in one run
98
- - **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
99
- - **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
100
- - **Structured evaluation**: Rubric-based grading with weights and requirements
101
- - **Batch evaluation**: Run hundreds of test cases in parallel
102
- - **Export**: JSON, JSONL, YAML formats
103
- - **Compare results**: Compute deltas between evaluation runs for A/B testing
104
-
105
- ## Development
106
-
107
- Contributing to AgentV? Clone and set up the repository:
41
+ ## Quick start
108
42
 
43
+ **1. Install and initialize:**
109
44
  ```bash
110
- git clone https://github.com/EntityProcess/agentv.git
111
- cd agentv
112
-
113
- # Install Bun if you don't have it
114
- curl -fsSL https://bun.sh/install | bash
115
-
116
- # Install dependencies and build
117
- bun install && bun run build
118
-
119
- # Run tests
120
- bun test
121
- ```
122
-
123
- See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
124
-
125
- ### Releasing
126
-
127
- Version bump:
128
-
129
- ```bash
130
- bun run release # patch bump
131
- bun run release minor
132
- bun run release major
133
- ```
134
-
135
- Canary rollout (recommended):
136
-
137
- ```bash
138
- bun run publish:next # publish current version to npm `next`
139
- bun run promote:latest # promote same version to npm `latest`
140
- bun run tag:next 2.18.0 # point npm `next` to an explicit version
141
- bun run promote:latest 2.18.0 # point npm `latest` to an explicit version
142
- ```
143
-
144
- Legacy prerelease flow (still available):
145
-
146
- ```bash
147
- bun run release:next # bump/increment `-next.N`
148
- bun run release:next major # start new major prerelease line
45
+ npm install -g agentv
46
+ agentv init
149
47
  ```
150
48
 
151
- ## Core Concepts
152
-
153
- **Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
49
+ **2. Configure targets** in `.agentv/targets.yaml` — point to your agent or LLM provider.
154
50
 
155
- ### JSONL Format Support
156
-
157
- For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
158
-
159
- ```jsonl
160
- {"id": "test-1", "criteria": "Calculates correctly", "input": "What is 2+2?"}
161
- {"id": "test-2", "criteria": "Provides explanation", "input": "Explain variables"}
162
- ```
163
-
164
- Optional sidecar YAML metadata file (`dataset.eval.yaml` alongside `dataset.jsonl`):
51
+ **3. Create an eval** in `evals/`:
165
52
  ```yaml
166
- description: Math evaluation dataset
167
- name: math-tests
168
- execution:
169
- target: azure-llm
170
- assertions:
171
- - name: correctness
172
- type: llm-grader
173
- prompt: ./graders/correctness.md
53
+ description: Code generation quality
54
+ tests:
55
+ - id: fizzbuzz
56
+ criteria: Write a correct FizzBuzz implementation
57
+ input: Write FizzBuzz in Python
58
+ assertions:
59
+ - type: contains
60
+ value: "fizz"
61
+ - type: code-grader
62
+ command: ./validators/check_syntax.py
63
+ - type: llm-grader
64
+ prompt: ./graders/correctness.md
174
65
  ```
175
66
 
176
- Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
177
-
178
- ## Usage
179
-
180
- ### Running Evaluations
181
-
67
+ **4. Run it:**
182
68
  ```bash
183
- # Validate evals
184
- agentv validate evals/my-eval.yaml
185
-
186
- # Run an eval with default target (from eval file or targets.yaml)
187
69
  agentv eval evals/my-eval.yaml
188
-
189
- # Override target
190
- agentv eval --target azure-llm evals/**/*.yaml
191
-
192
- # Run specific test
193
- agentv eval --test-id case-123 evals/my-eval.yaml
194
-
195
- # Dry-run with mock provider
196
- agentv eval --dry-run evals/my-eval.yaml
197
70
  ```
198
71
 
199
- See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
200
-
201
- #### Output Formats
202
-
203
- Write results to different formats using the `-o` flag (format auto-detected from extension):
204
-
72
+ **5. Compare results across targets:**
205
73
  ```bash
206
- # Default run workspace (index.jsonl + benchmark/timing/per-test artifacts)
207
- agentv eval evals/my-eval.yaml
208
-
209
- # Self-contained HTML dashboard (opens in any browser, no server needed)
210
- agentv eval evals/my-eval.yaml -o report.html
211
-
212
- # Explicit JSONL output
213
- agentv eval evals/my-eval.yaml -o output.jsonl
214
-
215
- # Multiple formats simultaneously
216
- agentv eval evals/my-eval.yaml -o report.html
217
-
218
- # JUnit XML for CI/CD integration
219
- agentv eval evals/my-eval.yaml -o results.xml
74
+ agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
220
75
  ```
221
76
 
222
- The HTML report auto-refreshes every 2 seconds during a live run, then locks once the run completes.
223
-
224
- By default, `agentv eval` creates a run workspace under `.agentv/results/runs/<run>/`
225
- with `index.jsonl` as the machine-facing manifest.
226
-
227
- You can also convert an existing manifest to HTML after the fact:
77
+ ## Output formats
228
78
 
229
79
  ```bash
230
- agentv convert .agentv/results/runs/eval_<timestamp>/index.jsonl -o report.html
80
+ agentv eval evals/my-eval.yaml # JSONL (default)
81
+ agentv eval evals/my-eval.yaml -o report.html # HTML dashboard
82
+ agentv eval evals/my-eval.yaml -o results.xml # JUnit XML for CI
231
83
  ```
232
84
 
233
- #### Timeouts
234
-
235
- AgentV does not apply a default top-level evaluation timeout. If you want one, set it explicitly
236
- with `--agent-timeout`, or set `execution.agentTimeoutMs` in your AgentV config to make it the
237
- default for your local runs.
238
-
239
- This top-level timeout is separate from provider- or tool-level timeouts. For example, an upstream
240
- agent or tool call may still time out even when AgentV's own top-level timeout is unset.
241
-
242
- ### Create Custom Evaluators
243
-
244
- Write code graders in Python or TypeScript:
245
-
246
- ```python
247
- # validators/check_answer.py
248
- import json, sys
249
- data = json.load(sys.stdin)
250
- answer = data.get("answer", "")
251
-
252
- assertions = []
253
-
254
- if "42" in answer:
255
- assertions.append({"text": "Answer contains correct value (42)", "passed": True})
256
- else:
257
- assertions.append({"text": "Answer does not contain expected value (42)", "passed": False})
258
-
259
- passed = sum(1 for a in assertions if a["passed"])
260
- score = 1.0 if passed == len(assertions) else 0.0
261
-
262
- print(json.dumps({
263
- "score": score,
264
- "assertions": assertions,
265
- }))
266
- ```
267
-
268
- Reference evaluators in your eval file:
269
-
270
- ```yaml
271
- assertions:
272
- - name: my_validator
273
- type: code-grader
274
- command: ./validators/check_answer.py
275
- ```
276
-
277
- For complete templates, examples, and evaluator patterns, see: [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
278
-
279
- ### TypeScript SDK
280
-
281
- #### Custom Assertions with `defineAssertion()`
282
-
283
- Create custom assertion types in TypeScript using `@agentv/eval`:
284
-
285
- ```typescript
286
- // .agentv/assertions/word-count.ts
287
- import { defineAssertion } from '@agentv/eval';
288
-
289
- export default defineAssertion(({ answer }) => {
290
- const wordCount = answer.trim().split(/\s+/).length;
291
- return {
292
- pass: wordCount >= 3,
293
- reasoning: `Output has ${wordCount} words`,
294
- };
295
- });
296
- ```
297
-
298
- Files in `.agentv/assertions/` are auto-discovered by filename — use directly in YAML:
299
-
300
- ```yaml
301
- assertions:
302
- - type: word-count # matches word-count.ts
303
- - type: contains
304
- value: "Hello"
305
- ```
306
-
307
- See the [sdk-custom-assertion example](examples/features/sdk-custom-assertion).
308
-
309
- #### Programmatic API with `evaluate()`
85
+ ## TypeScript SDK
310
86
 
311
- Use AgentV as a library — no YAML needed:
87
+ Use AgentV programmatically:
312
88
 
313
89
  ```typescript
314
90
  import { evaluate } from '@agentv/core';
@@ -326,278 +102,28 @@ const { results, summary } = await evaluate({
326
102
  console.log(`${summary.passed}/${summary.total} passed`);
327
103
  ```
328
104
 
329
- Auto-discovers `default` target from `.agentv/targets.yaml` and `.env` credentials. See the [sdk-programmatic-api example](examples/features/sdk-programmatic-api).
105
+ ## Documentation
330
106
 
331
- #### Typed Configuration with `defineConfig()`
107
+ Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduction/).
332
108
 
333
- Create `agentv.config.ts` at your project root for typed, validated configuration:
109
+ - [Eval files](https://agentv.dev/docs/evaluation/eval-files/) format and structure
110
+ - [Custom evaluators](https://agentv.dev/docs/evaluators/custom-evaluators/) — code graders in any language
111
+ - [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring
112
+ - [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers
113
+ - [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection
114
+ - [Comparison with other frameworks](https://agentv.dev/docs/reference/comparison/) — vs Braintrust, Langfuse, LangSmith, LangWatch
334
115
 
335
- ```typescript
336
- import { defineConfig } from '@agentv/core';
337
-
338
- export default defineConfig({
339
- execution: { workers: 5, maxRetries: 2 },
340
- output: { format: 'jsonl', dir: './results' },
341
- limits: { maxCostUsd: 10.0 },
342
- });
343
- ```
344
-
345
- See the [sdk-config-file example](examples/features/sdk-config-file).
346
-
347
- #### Scaffold Commands
348
-
349
- Bootstrap new assertions and eval files:
350
-
351
- ```bash
352
- agentv create assertion sentiment # → .agentv/assertions/sentiment.ts
353
- agentv create eval my-eval # → evals/my-eval.eval.yaml + .cases.jsonl
354
- ```
355
-
356
- ### Compare Evaluation Results
357
-
358
- Compare a combined results file across all targets (N-way matrix):
359
-
360
- ```bash
361
- agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl
362
- ```
363
-
364
- ```
365
- Score Matrix
366
-
367
- Test ID gemini-3-flash-preview gpt-4.1 gpt-5-mini
368
- ─────────────── ────────────────────── ─────── ──────────
369
- code-generation 0.70 0.80 0.75
370
- greeting 0.90 0.85 0.95
371
- summarization 0.85 0.90 0.80
372
-
373
- Pairwise Summary:
374
- gemini-3-flash-preview → gpt-4.1: 1 win, 0 losses, 2 ties (Δ +0.033)
375
- gemini-3-flash-preview → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ +0.017)
376
- gpt-4.1 → gpt-5-mini: 0 wins, 0 losses, 3 ties (Δ -0.017)
377
- ```
378
-
379
- Designate a baseline for CI regression gating, or compare two specific targets:
380
-
381
- ```bash
382
- agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1
383
- agentv compare .agentv/results/runs/eval_<timestamp>/index.jsonl --baseline gpt-4.1 --candidate gpt-5-mini
384
- agentv compare before.jsonl after.jsonl # two-file pairwise
385
- ```
386
-
387
- ## Targets Configuration
388
-
389
- Define execution targets in `.agentv/targets.yaml` to decouple evals from providers:
390
-
391
- ```yaml
392
- targets:
393
- - name: azure-llm
394
- provider: azure
395
- endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
396
- api_key: ${{ AZURE_OPENAI_API_KEY }}
397
- model: ${{ AZURE_DEPLOYMENT_NAME }}
398
-
399
- - name: vscode_dev
400
- provider: vscode
401
- grader_target: azure-llm
402
-
403
- - name: local_agent
404
- provider: cli
405
- command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
406
- grader_target: azure-llm
407
- ```
408
-
409
- Supports: `azure`, `anthropic`, `gemini`, `codex`, `copilot`, `pi-coding-agent`, `claude`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
410
-
411
- Workspace templates are configured at eval-level under `workspace.template` (not per-target `workspace_template`).
412
-
413
- Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/targets.yaml` after `agentv init` for detailed examples and all provider-specific fields.
414
-
415
- ## Evaluation Features
416
-
417
- ### Code Graders
418
-
419
- Write validators in any language (Python, TypeScript, Node, etc.):
420
-
421
- ```bash
422
- # Input: stdin JSON with question, criteria, answer
423
- # Output: stdout JSON with score (0-1), hits, misses, reasoning
424
- ```
425
-
426
- For complete examples and patterns, see:
427
- - [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
428
- - [code-grader-sdk example](examples/features/code-grader-sdk)
429
-
430
- ### Deterministic Assertions
431
-
432
- Built-in assertion types for common text-matching patterns — no LLM grader or code_grader needed:
433
-
434
- | Type | Value | Behavior |
435
- |------|-------|----------|
436
- | `contains` | `string` | Pass if output includes the substring |
437
- | `contains_any` | `string[]` | Pass if output includes ANY of the strings |
438
- | `contains_all` | `string[]` | Pass if output includes ALL of the strings |
439
- | `icontains` | `string` | Case-insensitive `contains` |
440
- | `icontains_any` | `string[]` | Case-insensitive `contains_any` |
441
- | `icontains_all` | `string[]` | Case-insensitive `contains_all` |
442
- | `starts_with` | `string` | Pass if output starts with value (trimmed) |
443
- | `ends_with` | `string` | Pass if output ends with value (trimmed) |
444
- | `regex` | `string` | Pass if output matches regex (optional `flags: "i"`) |
445
- | `equals` | `string` | Pass if output exactly equals value (trimmed) |
446
- | `is_json` | — | Pass if output is valid JSON |
447
-
448
- All assertions support `weight`, `required`, and `negate` flags. Use `negate: true` to invert (no `not_` prefix needed).
449
-
450
- ```yaml
451
- assertions:
452
- # Case-insensitive matching for natural language variation
453
- - type: icontains-any
454
- value: ["missing rule code", "need rule code", "provide rule code"]
455
- required: true
456
-
457
- # Multiple required terms
458
- - type: icontains-all
459
- value: ["country code", "rule codes"]
460
-
461
- # Case-insensitive regex
462
- - type: regex
463
- value: "[a-z]+@[a-z]+\\.[a-z]+"
464
- flags: "i"
465
- ```
466
-
467
- See the [assert-extended example](examples/features/assert-extended) for complete patterns.
468
-
469
- ### Target Configuration: `grader_target`
470
-
471
- Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `grader_target` (also accepts `judge_target` for backward compatibility) when using `llm_grader` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers cannot return structured JSON for grading.
472
-
473
- ```yaml
474
- targets:
475
- # Agent target — requires grader_target for LLM-based evaluation
476
- - name: codex_local
477
- provider: codex
478
- grader_target: azure-llm # Required: LLM provider for grading
479
-
480
- # LLM target — no grader_target needed (grades itself)
481
- - name: azure-llm
482
- provider: azure
483
- ```
484
-
485
- ### Agentic Eval Patterns
486
-
487
- When agents respond via tool calls instead of text, use `tool_trajectory` instead of text assertions:
488
-
489
- - **Agent takes workspace actions** (creates files, runs commands) → `tool_trajectory` evaluator
490
- - **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_grader`
491
- - **Agent does both** → `composite` evaluator combining both
492
-
493
- ### LLM Graders
494
-
495
- Create markdown grader files with evaluation criteria and scoring guidelines:
496
-
497
- ```yaml
498
- assertions:
499
- - name: semantic_check
500
- type: llm-grader
501
- prompt: ./graders/correctness.md
502
- ```
503
-
504
- Your grader prompt file defines criteria and scoring guidelines.
505
-
506
- ### Rubric-Based Evaluation
507
-
508
- Define structured criteria directly in your test:
509
-
510
- ```yaml
511
- tests:
512
- - id: quicksort-explain
513
- criteria: Explain how quicksort works
514
-
515
- input: Explain quicksort algorithm
516
-
517
- assertions:
518
- - type: rubrics
519
- criteria:
520
- - Mentions divide-and-conquer approach
521
- - Explains partition step
522
- - States time complexity
523
- ```
524
-
525
- Scoring: `(satisfied weights) / (total weights)` → verdicts: `pass` (≥0.8), `borderline` (≥0.6), `fail`
526
-
527
- Author assertions directly in your eval file. When you want help choosing between simple assertions, deterministic graders, and LLM-based graders, use the `agentv-eval-writer` skill.
528
-
529
- See [rubric evaluator](https://agentv.dev/evaluation/rubrics/) for detailed patterns.
530
-
531
- ## Advanced Configuration
532
-
533
- ### Retry Behavior
534
-
535
- Configure automatic retry with exponential backoff:
536
-
537
- ```yaml
538
- targets:
539
- - name: azure-llm
540
- provider: azure
541
- max_retries: 5
542
- retry_initial_delay_ms: 2000
543
- retry_max_delay_ms: 120000
544
- retry_backoff_factor: 2
545
- retry_status_codes: [500, 408, 429, 502, 503, 504]
546
- ```
547
-
548
- Automatically retries on rate limits, transient 5xx errors, and network failures with jitter.
549
-
550
- ## Documentation & Learning
551
-
552
- **Getting Started:**
553
- - Run `agentv init` to set up your first evaluation workspace
554
- - Check [examples/README.md](examples/README.md) for demos (math, code generation, tool use)
555
- - AI agents: Ask Claude Code to `/agentv-eval-builder` to create and iterate on evals
556
-
557
- **Detailed Guides:**
558
- - [Evaluation format and structure](https://agentv.dev/evaluation/eval-files/)
559
- - [Custom evaluators](https://agentv.dev/evaluators/custom-evaluators/)
560
- - [Rubric evaluator](https://agentv.dev/evaluation/rubrics/)
561
- - [Composite evaluator](https://agentv.dev/evaluators/composite/)
562
- - [Tool trajectory evaluator](https://agentv.dev/evaluators/tool-trajectory/)
563
- - [Structured data evaluators](https://agentv.dev/evaluators/structured-data/)
564
- - [Batch CLI evaluation](https://agentv.dev/evaluation/batch-cli/)
565
- - [Compare results](https://agentv.dev/tools/compare/)
566
- - [Example evaluations](https://agentv.dev/evaluation/examples/)
567
-
568
- **Reference:**
569
- - Monorepo structure: `packages/core/` (engine), `packages/eval/` (evaluation logic), `apps/cli/` (commands)
570
-
571
- ## Troubleshooting
572
-
573
- ### `EACCES` permission error on global install (npm)
574
-
575
- If you see `EACCES: permission denied` when running `npm install -g agentv`, switch to bun (recommended) or configure npm to use a user-owned directory:
576
-
577
- **Option 1 (recommended): Use bun instead**
578
- ```bash
579
- bun install -g agentv
580
- ```
581
-
582
- **Option 2: Fix npm permissions**
583
- ```bash
584
- mkdir -p ~/.npm-global
585
- npm config set prefix ~/.npm-global --location=user
586
- ```
587
-
588
- Then add the directory to your PATH. For bash (`~/.bashrc`) or zsh (`~/.zshrc`):
116
+ ## Development
589
117
 
590
118
  ```bash
591
- echo 'export PATH=~/.npm-global/bin:$PATH' >> ~/.bashrc
592
- source ~/.bashrc
119
+ git clone https://github.com/EntityProcess/agentv.git
120
+ cd agentv
121
+ bun install && bun run build
122
+ bun test
593
123
  ```
594
124
 
595
- After this, `npm install -g` will work without `sudo`.
596
-
597
- ## Contributing
598
-
599
- See [AGENTS.md](AGENTS.md) for development guidelines, design principles, and quality assurance workflow.
125
+ See [AGENTS.md](AGENTS.md) for development guidelines.
600
126
 
601
127
  ## License
602
128
 
603
- MIT License - see [LICENSE](LICENSE) for details.
129
+ MIT
@@ -29,12 +29,12 @@ import {
29
29
  subscribeToCopilotCliLogEntries,
30
30
  subscribeToCopilotSdkLogEntries,
31
31
  subscribeToPiLogEntries
32
- } from "./chunk-OXBBWZOY.js";
32
+ } from "./chunk-XEAW7OQT.js";
33
33
 
34
34
  // package.json
35
35
  var package_default = {
36
36
  name: "agentv",
37
- version: "4.0.0",
37
+ version: "4.1.1",
38
38
  description: "CLI entry point for AgentV",
39
39
  type: "module",
40
40
  repository: {
@@ -113,10 +113,7 @@ async function resolveEvalPaths(evalPaths, cwd) {
113
113
  continue;
114
114
  }
115
115
  if (stats.isDirectory()) {
116
- const dirGlob = path.posix.join(
117
- candidatePath.replace(/\\/g, "/"),
118
- "**/*.eval.{yaml,yml}"
119
- );
116
+ const dirGlob = path.posix.join(candidatePath.replace(/\\/g, "/"), "**/*.eval.{yaml,yml}");
120
117
  const dirMatches = await fg(dirGlob, {
121
118
  absolute: true,
122
119
  onlyFiles: true,
@@ -4446,7 +4443,7 @@ async function runEvalCommand(input) {
4446
4443
  const useFileExport = !!options.otelFile;
4447
4444
  if (options.exportOtel || useFileExport) {
4448
4445
  try {
4449
- const { OtelTraceExporter, OTEL_BACKEND_PRESETS } = await import("./dist-3Z22B6SU.js");
4446
+ const { OtelTraceExporter, OTEL_BACKEND_PRESETS } = await import("./dist-2JUUJ6PT.js");
4450
4447
  let endpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
4451
4448
  let headers = {};
4452
4449
  if (options.otelBackend) {
@@ -4838,4 +4835,4 @@ export {
4838
4835
  selectTarget,
4839
4836
  runEvalCommand
4840
4837
  };
4841
- //# sourceMappingURL=chunk-OT2J474N.js.map
4838
+ //# sourceMappingURL=chunk-QCKPJPYC.js.map