npm - agentv - Versions diffs - 1.5.0 → 1.6.1 - Mend

agentv 1.5.0 → 1.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/dist/cli.js CHANGED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env node
 import {
   runCli
-} from "./chunk-3RYQPI4H.js";
+} from "./chunk-HU4B6ODF.js";
 import "./chunk-UE4GLFVL.js";
 // src/cli.ts

package/dist/index.js CHANGED Viewed

@@ -1,7 +1,7 @@
 import {
   app,
   runCli
-} from "./chunk-3RYQPI4H.js";
+} from "./chunk-HU4B6ODF.js";
 import "./chunk-UE4GLFVL.js";
 export {
   app,

package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md CHANGED Viewed

@@ -16,9 +16,10 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
 - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
 - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
 - Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
+- Compare: `references/compare-command.md` - Compare evaluation results between runs
 ## Structure Requirements
-- Root level: `description` (optional), `execution` (optional with `target` inside), `evalcases` (required)
+- Root level: `description` (optional), `execution` (with `target`), `evalcases` (required)
 - Eval case fields: `id` (required), `expected_outcome` (required), `input_messages` (required)
 - Optional fields: `expected_messages`, `conversation_id`, `rubrics`, `execution`
 - `expected_messages` is optional - omit for outcome-only evaluation where the LLM judge evaluates based on `expected_outcome` criteria alone
@@ -142,7 +143,6 @@ See `references/composite-evaluator.md` for aggregation types and patterns.
 Evaluate external batch runners that process all evalcases in one invocation:
 ```yaml
-$schema: agentv-eval-v2
 description: Batch CLI evaluation
 execution:
   target: batch_cli
@@ -177,7 +177,6 @@ See `references/batch-cli-evaluator.md` for full implementation guide.
 ## Example
 ```yaml
-$schema: agentv-eval-v2
 description: Example showing basic features and conversation threading
 execution:
   target: default

package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md CHANGED Viewed

@@ -20,9 +20,7 @@ Batch CLI evaluation is used when:
 ## Eval File Structure
 ```yaml
-$schema: agentv-eval-v2
 description: Batch CLI demo using structured input_messages
 execution:
   target: batch_cli

package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md ADDED Viewed

@@ -0,0 +1,115 @@
+# Compare Command
+Compare evaluation results between two runs to measure performance differences.
+## Usage
+```bash
+agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold <value>]
+```
+## Arguments
+| Argument | Description |
+|----------|-------------|
+| `result1` | Path to baseline JSONL result file |
+| `result2` | Path to candidate JSONL result file |
+| `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
+## How It Works
+1. **Load Results**: Reads both JSONL files containing evaluation results
+2. **Match by eval_id**: Pairs results with matching `eval_id` fields
+3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
+4. **Classify Outcomes**:
+   - `win`: delta >= threshold (candidate better)
+   - `loss`: delta <= -threshold (baseline better)
+   - `tie`: |delta| < threshold (no significant difference)
+5. **Output Summary**: JSON with matched results, unmatched counts, and statistics
+## Output Format
+```json
+{
+  "matched": [
+    {
+      "eval_id": "case-1",
+      "score1": 0.7,
+      "score2": 0.9,
+      "delta": 0.2,
+      "outcome": "win"
+    }
+  ],
+  "unmatched": {
+    "file1": 0,
+    "file2": 0
+  },
+  "summary": {
+    "total": 2,
+    "matched": 1,
+    "wins": 1,
+    "losses": 0,
+    "ties": 0,
+    "meanDelta": 0.2
+  }
+}
+```
+## Exit Codes
+| Code | Meaning |
+|------|---------|
+| `0` | Candidate is equal or better (meanDelta >= 0) |
+| `1` | Baseline is better (regression detected) |
+## Workflow Examples
+### Model Comparison
+Compare different model versions:
+```bash
+# Run baseline evaluation
+agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
+# Run candidate evaluation
+agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
+# Compare results
+agentv compare baseline.jsonl candidate.jsonl
+```
+### Prompt Optimization
+Compare before/after prompt changes:
+```bash
+# Run with original prompt
+agentv eval evals/*.yaml --out before.jsonl
+# Modify prompt, then run again
+agentv eval evals/*.yaml --out after.jsonl
+# Compare with strict threshold
+agentv compare before.jsonl after.jsonl --threshold 0.05
+```
+### CI Quality Gate
+Fail CI if candidate regresses:
+```bash
+#!/bin/bash
+agentv compare baseline.jsonl candidate.jsonl
+if [ $? -eq 1 ]; then
+  echo "Regression detected! Candidate performs worse than baseline."
+  exit 1
+fi
+echo "Candidate is equal or better than baseline."
+```
+## Tips
+- **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
+- **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
+- **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.

package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md CHANGED Viewed

@@ -11,17 +11,42 @@ Code evaluators receive input via stdin and write output to stdout, both as JSON
 ```json
 {
   "question": "string describing the task/question",
-  "expected_outcome": "expected outcome description",
-  "reference_answer": "gold standard answer (optional)",
-  "candidate_answer": "generated code/text from the agent",
-  "guideline_paths": ["path1", "path2"],
-  "input_files": ["file1", "file2"],
-  "input_messages": [{"role": "user", "content": "..."}],
-  "output_messages": [{"role": "assistant", "content": "...", "tool_calls": [...]}]
+  "expectedOutcome": "expected outcome description",
+  "referenceAnswer": "gold standard answer (optional)",
+  "candidateAnswer": "generated code/text from the agent",
+  "guidelineFiles": ["path1", "path2"],
+  "inputFiles": ["file1", "file2"],
+  "inputMessages": [{"role": "user", "content": "..."}],
+  "outputMessages": [
+    {
+      "role": "assistant",
+      "content": "...",
+      "toolCalls": [
+        {
+          "tool": "search",
+          "input": { "query": "..." },
+          "output": { "results": [...] },
+          "id": "call_123",
+          "timestamp": "2024-01-15T10:30:00Z"
+        }
+      ]
+    }
+  ],
+  "traceSummary": {
+    "eventCount": 5,
+    "toolNames": ["fetch", "search"],
+    "toolCallsByName": { "search": 2, "fetch": 1 },
+    "errorCount": 0,
+    "tokenUsage": { "input": 1000, "output": 500 },
+    "costUsd": 0.0015,
+    "durationMs": 3500
+  }
 }
 ```
-The `output_messages` array contains the full agent execution trace with tool calls, enabling custom validation of agent behavior.
+**Key fields:**
+- `outputMessages` - Full agent execution trace with tool calls (use `toolCalls[].input` for arguments)
+- `traceSummary` - Lightweight summary with execution metrics (counts only, no tool arguments)
 ### Output Format (to stdout)
@@ -189,7 +214,7 @@ You can customize this template in your eval file using the `evaluatorTemplate`
 execution:
   evaluators:
     - name: my_validator
-      type: code
+      type: code_judge
       script: uv run my_validator.py
       cwd: ./evaluators
 ```

package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md CHANGED Viewed

@@ -5,7 +5,6 @@ This document contains complete examples of well-structured eval files demonstra
 ## Basic Example: Simple Q&A Eval
 ```yaml
-$schema: agentv-eval-v2
 description: Basic arithmetic evaluation
 execution:
   target: default
@@ -26,7 +25,6 @@ evalcases:
 ## Code Review with File References
 ```yaml
-$schema: agentv-eval-v2
 description: Code review with guidelines
 execution:
   target: azure_base
@@ -69,7 +67,6 @@ evalcases:
 ## Multi-Evaluator Configuration
 ```yaml
-$schema: agentv-eval-v2
 description: JSON generation with validation
 execution:
   target: default
@@ -109,7 +106,6 @@ evalcases:
 Validate that an agent uses specific tools during execution.
 ```yaml
-$schema: agentv-eval-v2
 description: Tool usage validation
 execution:
   target: mock_agent
@@ -151,7 +147,6 @@ evalcases:
 Evaluate pre-existing trace files without running an agent.
 ```yaml
-$schema: agentv-eval-v2
 description: Static trace evaluation
 execution:
   target: static_trace
@@ -175,7 +170,6 @@ evalcases:
 ## Multi-Turn Conversation (Single Eval Case)
 ```yaml
-$schema: agentv-eval-v2
 description: Multi-turn debugging session with clarifying questions
 execution:
   target: default
@@ -237,7 +231,6 @@ evalcases:
 Evaluate external batch runners that process all evalcases in one invocation.
 ```yaml
-$schema: agentv-eval-v2
 description: Batch CLI demo (AML screening)
 execution:
   target: batch_cli

package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md CHANGED Viewed

@@ -9,8 +9,6 @@ Rubrics provide structured evaluation through lists of criteria that define what
 Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
 ```yaml
-$schema: agentv-eval-v2
 evalcases:
   - id: quicksort-explanation
     expected_outcome: Explain how quicksort works

package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md CHANGED Viewed

@@ -48,6 +48,34 @@ execution:
 - Allow agent to use additional helper tools
 - Check that key steps happen in sequence
+### Argument Matching
+For `in_order` and `exact` modes, you can optionally validate tool arguments:
+```yaml
+execution:
+  evaluators:
+    - name: search-validation
+      type: tool_trajectory
+      mode: in_order
+      expected:
+        # Partial match - only specified keys are checked
+        - tool: search
+          args: { query: "machine learning" }
+        # Skip argument validation for this tool
+        - tool: process
+          args: any
+        # No args field = no argument validation (same as args: any)
+        - tool: saveResults
+```
+**Argument matching modes:**
+- `args: { key: value }` - Partial deep equality (only specified keys are checked)
+- `args: any` - Skip argument validation
+- No `args` field - Same as `args: any`
 #### 3. `exact` - Strict Sequence Match
 Validates the exact tool sequence with no gaps or extra tools:
@@ -121,9 +149,7 @@ The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optiona
 ### Research Agent Validation
 ```yaml
-$schema: agentv-eval-v2
 description: Validate research agent tool usage
 execution:
   target: codex_agent  # Provider that returns traces

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentv",
-  "version": "1.5.0",
+  "version": "1.6.1",
   "description": "CLI entry point for AgentV",
   "type": "module",
   "repository": {
@@ -14,7 +14,10 @@
   "bin": {
     "agentv": "./dist/cli.js"
   },
-  "files": ["dist", "README.md"],
+  "files": [
+    "dist",
+    "README.md"
+  ],
   "scripts": {
     "dev": "bun --watch src/index.ts",
     "build": "tsup && bun run copy-readme",