npm - agentv - Versions diffs - 2.19.0 → 3.0.0-next.1 - Mend

agentv 2.19.0 → 3.0.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

package/README.md +62 -36
package/dist/agentv-provider-5CJVBBGG-2XVZBW7L.js +9 -0
package/dist/{chunk-GC6T3RD4.js → chunk-5WIB7A27.js} +598 -403
package/dist/chunk-5WIB7A27.js.map +1 -0
package/dist/chunk-6GSYTMXD.js +31520 -0
package/dist/chunk-6GSYTMXD.js.map +1 -0
package/dist/{chunk-4MSAOMCC.js → chunk-DY4ZDTTO.js} +1018 -140
package/dist/chunk-DY4ZDTTO.js.map +1 -0
package/dist/chunk-HF4X7ALN.js +24299 -0
package/dist/chunk-HF4X7ALN.js.map +1 -0
package/dist/{chunk-FV32QHPB.js → chunk-XOSNETAV.js} +1 -1
package/dist/cli.js +5 -4
package/dist/cli.js.map +1 -1
package/dist/{dist-MQBGD6LP.js → dist-WN2QIOQR.js} +27 -11
package/dist/{esm-DX3WQKEN.js → esm-CZAWIY6F.js} +2 -2
package/dist/esm-CZAWIY6F.js.map +1 -0
package/dist/index.js +5 -4
package/dist/{interactive-3TDBCSDW.js → interactive-B432TCRZ.js} +5 -4
package/dist/{interactive-3TDBCSDW.js.map → interactive-B432TCRZ.js.map} +1 -1
package/dist/{src-2N5EJ2N6.js → src-ML4D2MC2.js} +2 -2
package/dist/templates/.agentv/targets.yaml +8 -11
package/package.json +2 -2
package/dist/chunk-4MSAOMCC.js.map +0 -1
package/dist/chunk-GC6T3RD4.js.map +0 -1
package/dist/chunk-XTYMR4I5.js +0 -49811
package/dist/chunk-XTYMR4I5.js.map +0 -1
/package/dist/{dist-MQBGD6LP.js.map → agentv-provider-5CJVBBGG-2XVZBW7L.js.map} +0 -0
/package/dist/{chunk-FV32QHPB.js.map → chunk-XOSNETAV.js.map} +0 -0
/package/dist/{esm-DX3WQKEN.js.map → dist-WN2QIOQR.js.map} +0 -0
/package/dist/{src-2N5EJ2N6.js.map → src-ML4D2MC2.js.map} +0 -0

package/README.md CHANGED Viewed

@@ -2,7 +2,7 @@
 **CLI-first AI agent evaluation. No server. No signup. No overhead.**
-AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code judges + customizable LLM judges, all version-controlled in Git.
+AgentV evaluates your agents locally with multi-objective scoring (correctness, latency, cost, safety) from YAML specifications. Deterministic code graders + customizable LLM graders, all version-controlled in Git.
 ## Installation
@@ -58,9 +58,9 @@ tests:
     expected_output: "42"
-    assert:
+    assertions:
       - name: math_check
-        type: code-judge
+        type: code-grader
         command: ./validators/check_math.py
 ```
@@ -90,7 +90,7 @@ Learn more in the [examples/](examples/README.md) directory. For a detailed comp
 ## Features
 - **Multi-objective scoring**: Correctness, latency, cost, safety in one run
-- **Multiple evaluator types**: Code validators, LLM judges, custom Python/TypeScript
+- **Multiple evaluator types**: Code validators, LLM graders, custom Python/TypeScript
 - **Built-in targets**: VS Code Copilot, Codex CLI, Pi Coding Agent, Azure OpenAI, local CLI agents
 - **Structured evaluation**: Rubric-based grading with weights and requirements
 - **Batch evaluation**: Run hundreds of test cases in parallel
@@ -145,7 +145,7 @@ bun run release:next major   # start new major prerelease line
 ## Core Concepts
-**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Graders** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
 ### JSONL Format Support
@@ -161,11 +161,11 @@ Optional sidecar YAML metadata file (`dataset.eval.yaml` alongside `dataset.json
 description: Math evaluation dataset
 dataset: math-tests
 execution:
-  target: azure-base
-assert:
+  target: azure-llm
+assertions:
   - name: correctness
-    type: llm-judge
-    prompt: ./judges/correctness.md
+    type: llm-grader
+    prompt: ./graders/correctness.md
 ```
 Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
@@ -182,7 +182,7 @@ agentv validate evals/my-eval.yaml
 agentv eval evals/my-eval.yaml
 # Override target
-agentv eval --target azure-base evals/**/*.yaml
+agentv eval --target azure-llm evals/**/*.yaml
 # Run specific test
 agentv eval --test-id case-123 evals/my-eval.yaml
@@ -193,6 +193,32 @@ agentv eval --dry-run evals/my-eval.yaml
 See `agentv eval --help` for all options: workers, timeouts, output formats, trace dumping, and more.
+#### Output Formats
+Write results to different formats using the `-o` flag (format auto-detected from extension):
+```bash
+# JSONL (default streaming format)
+agentv eval evals/my-eval.yaml -o results.jsonl
+# Self-contained HTML dashboard (opens in any browser, no server needed)
+agentv eval evals/my-eval.yaml -o report.html
+# Multiple formats simultaneously
+agentv eval evals/my-eval.yaml -o results.jsonl -o report.html
+# JUnit XML for CI/CD integration
+agentv eval evals/my-eval.yaml -o results.xml
+```
+The HTML report auto-refreshes every 2 seconds during a live run, then locks once the run completes.
+You can also convert an existing JSONL results file to HTML after the fact:
+```bash
+agentv convert results.jsonl -o report.html
+```
 #### Timeouts
 AgentV does not apply a default top-level evaluation timeout. If you want one, set it explicitly
@@ -204,7 +230,7 @@ agent or tool call may still time out even when AgentV's own top-level timeout i
 ### Create Custom Evaluators
-Write code judges in Python or TypeScript:
+Write code graders in Python or TypeScript:
 ```python
 # validators/check_answer.py
@@ -233,9 +259,9 @@ print(json.dumps({
 Reference evaluators in your eval file:
 ```yaml
-assert:
+assertions:
   - name: my_validator
-    type: code-judge
+    type: code-grader
     command: ./validators/check_answer.py
 ```
@@ -263,7 +289,7 @@ export default defineAssertion(({ answer }) => {
 Files in `.agentv/assertions/` are auto-discovered by filename — use directly in YAML:
 ```yaml
-assert:
+assertions:
   - type: word-count    # matches word-count.ts
   - type: contains
     value: "Hello"
@@ -355,7 +381,7 @@ Define execution targets in `.agentv/targets.yaml` to decouple evals from provid
 ```yaml
 targets:
-  - name: azure-base
+  - name: azure-llm
     provider: azure
     endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
     api_key: ${{ AZURE_OPENAI_API_KEY }}
@@ -363,12 +389,12 @@ targets:
   - name: vscode_dev
     provider: vscode
-    judge_target: azure-base
+    grader_target: azure-llm
   - name: local_agent
     provider: cli
     command: 'python agent.py --prompt-file {PROMPT_FILE} --output {OUTPUT_FILE}'
-    judge_target: azure-base
+    grader_target: azure-llm
 ```
 Supports: `azure`, `anthropic`, `gemini`, `codex`, `copilot`, `pi-coding-agent`, `claude`, `vscode`, `vscode-insiders`, `cli`, and `mock`.
@@ -379,7 +405,7 @@ Use `${{ VARIABLE_NAME }}` syntax to reference your `.env` file. See `.agentv/ta
 ## Evaluation Features
-### Code Judges
+### Code Graders
 Write validators in any language (Python, TypeScript, Node, etc.):
@@ -390,11 +416,11 @@ Write validators in any language (Python, TypeScript, Node, etc.):
 For complete examples and patterns, see:
 - [custom-evaluators](https://agentv.dev/evaluators/custom-evaluators/)
-- [code-judge-sdk example](examples/features/code-judge-sdk)
+- [code-grader-sdk example](examples/features/code-grader-sdk)
 ### Deterministic Assertions
-Built-in assertion types for common text-matching patterns — no LLM judge or code_judge needed:
+Built-in assertion types for common text-matching patterns — no LLM grader or code_grader needed:
 | Type | Value | Behavior |
 |------|-------|----------|
@@ -413,7 +439,7 @@ Built-in assertion types for common text-matching patterns — no LLM judge or c
 All assertions support `weight`, `required`, and `negate` flags. Use `negate: true` to invert (no `not_` prefix needed).
 ```yaml
-assert:
+assertions:
   # Case-insensitive matching for natural language variation
   - type: icontains-any
     value: ["missing rule code", "need rule code", "provide rule code"]
@@ -431,19 +457,19 @@ assert:
 See the [assert-extended example](examples/features/assert-extended) for complete patterns.
-### Target Configuration: `judge_target`
+### Target Configuration: `grader_target`
-Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `judge_target` when using `llm_judge` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers can't return structured JSON for judging.
+Agent provider targets (`codex`, `copilot`, `claude`, `vscode`) **must** specify `grader_target` (also accepts `judge_target` for backward compatibility) when using `llm_grader` or `rubrics` evaluators. Without it, AgentV errors at startup — agent providers cannot return structured JSON for grading.
 ```yaml
 targets:
-  # Agent target — requires judge_target for LLM-based evaluation
+  # Agent target — requires grader_target for LLM-based evaluation
   - name: codex_local
     provider: codex
-    judge_target: azure-base  # Required: LLM provider for judging
+    grader_target: azure-llm  # Required: LLM provider for grading
-  # LLM target — no judge_target needed (judges itself)
-  - name: azure-base
+  # LLM target — no grader_target needed (grades itself)
+  - name: azure-llm
     provider: azure
 ```
@@ -452,21 +478,21 @@ targets:
 When agents respond via tool calls instead of text, use `tool_trajectory` instead of text assertions:
 - **Agent takes workspace actions** (creates files, runs commands) → `tool_trajectory` evaluator
-- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_judge`
+- **Agent responds in text** (answers questions, asks for info) → `contains`/`icontains_any`/`llm_grader`
 - **Agent does both** → `composite` evaluator combining both
-### LLM Judges
+### LLM Graders
-Create markdown judge files with evaluation criteria and scoring guidelines:
+Create markdown grader files with evaluation criteria and scoring guidelines:
 ```yaml
-assert:
+assertions:
   - name: semantic_check
-    type: llm-judge
-    prompt: ./judges/correctness.md
+    type: llm-grader
+    prompt: ./graders/correctness.md
 ```
-Your judge prompt file defines criteria and scoring guidelines.
+Your grader prompt file defines criteria and scoring guidelines.
 ### Rubric-Based Evaluation
@@ -479,7 +505,7 @@ tests:
     input: Explain quicksort algorithm
-    assert:
+    assertions:
       - type: rubrics
         criteria:
           - Mentions divide-and-conquer approach
@@ -504,7 +530,7 @@ Configure automatic retry with exponential backoff:
 ```yaml
 targets:
-  - name: azure-base
+  - name: azure-llm
     provider: azure
     max_retries: 5
     retry_initial_delay_ms: 2000

package/dist/agentv-provider-5CJVBBGG-2XVZBW7L.js ADDED Viewed

@@ -0,0 +1,9 @@
+import { createRequire } from 'node:module'; const require = createRequire(import.meta.url);
+import {
+  AgentvProvider
+} from "./chunk-6GSYTMXD.js";
+import "./chunk-5H446C7X.js";
+export {
+  AgentvProvider
+};
+//# sourceMappingURL=agentv-provider-5CJVBBGG-2XVZBW7L.js.map