npm - agentv - Versions diffs - 1.3.1 → 1.6.1 - Mend

agentv 1.3.1 → 1.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md ADDED Viewed

@@ -0,0 +1,115 @@
+# Compare Command
+Compare evaluation results between two runs to measure performance differences.
+## Usage
+```bash
+agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold <value>]
+```
+## Arguments
+| Argument | Description |
+|----------|-------------|
+| `result1` | Path to baseline JSONL result file |
+| `result2` | Path to candidate JSONL result file |
+| `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
+## How It Works
+1. **Load Results**: Reads both JSONL files containing evaluation results
+2. **Match by eval_id**: Pairs results with matching `eval_id` fields
+3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
+4. **Classify Outcomes**:
+   - `win`: delta >= threshold (candidate better)
+   - `loss`: delta <= -threshold (baseline better)
+   - `tie`: |delta| < threshold (no significant difference)
+5. **Output Summary**: JSON with matched results, unmatched counts, and statistics
+## Output Format
+```json
+{
+  "matched": [
+    {
+      "eval_id": "case-1",
+      "score1": 0.7,
+      "score2": 0.9,
+      "delta": 0.2,
+      "outcome": "win"
+    }
+  ],
+  "unmatched": {
+    "file1": 0,
+    "file2": 0
+  },
+  "summary": {
+    "total": 2,
+    "matched": 1,
+    "wins": 1,
+    "losses": 0,
+    "ties": 0,
+    "meanDelta": 0.2
+  }
+}
+```
+## Exit Codes
+| Code | Meaning |
+|------|---------|
+| `0` | Candidate is equal or better (meanDelta >= 0) |
+| `1` | Baseline is better (regression detected) |
+## Workflow Examples
+### Model Comparison
+Compare different model versions:
+```bash
+# Run baseline evaluation
+agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
+# Run candidate evaluation
+agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
+# Compare results
+agentv compare baseline.jsonl candidate.jsonl
+```
+### Prompt Optimization
+Compare before/after prompt changes:
+```bash
+# Run with original prompt
+agentv eval evals/*.yaml --out before.jsonl
+# Modify prompt, then run again
+agentv eval evals/*.yaml --out after.jsonl
+# Compare with strict threshold
+agentv compare before.jsonl after.jsonl --threshold 0.05
+```
+### CI Quality Gate
+Fail CI if candidate regresses:
+```bash
+#!/bin/bash
+agentv compare baseline.jsonl candidate.jsonl
+if [ $? -eq 1 ]; then
+  echo "Regression detected! Candidate performs worse than baseline."
+  exit 1
+fi
+echo "Candidate is equal or better than baseline."
+```
+## Tips
+- **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
+- **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
+- **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.

package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md CHANGED Viewed

@@ -1,215 +1,215 @@
-# Composite Evaluator Guide
-Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
-## Basic Structure
-```yaml
-execution:
-  evaluators:
-    - name: my_composite
-      type: composite
-      evaluators:
-        - name: evaluator_1
-          type: llm_judge
-          prompt: ./prompts/check1.md
-        - name: evaluator_2
-          type: code_judge
-          script: uv run check2.py
-      aggregator:
-        type: weighted_average
-        weights:
-          evaluator_1: 0.6
-          evaluator_2: 0.4
-```
-## Aggregator Types
-### 1. Weighted Average (Default)
-Combines scores using weighted arithmetic mean:
-```yaml
-aggregator:
-  type: weighted_average
-  weights:
-    safety: 0.3      # 30% weight
-    quality: 0.7     # 70% weight
-```
-If weights are omitted, all evaluators have equal weight (1.0).
-**Score calculation:**
-```
-final_score = Σ(score_i × weight_i) / Σ(weight_i)
-```
-### 2. Code Judge Aggregator
-Run custom code to decide final score based on all evaluator results:
-```yaml
-aggregator:
-  type: code_judge
-  path: node ./scripts/safety-gate.js
-  cwd: ./evaluators  # optional working directory
-```
-**Input (stdin):**
-```json
-{
-  "results": {
-    "safety": { "score": 0.9, "hits": [...], "misses": [...] },
-    "quality": { "score": 0.85, "hits": [...], "misses": [...] }
-  }
-}
-```
-**Output (stdout):**
-```json
-{
-  "score": 0.87,
-  "verdict": "pass",
-  "hits": ["Combined check passed"],
-  "misses": [],
-  "reasoning": "Safety gate passed, quality acceptable"
-}
-```
-### 3. LLM Judge Aggregator
-Use an LLM to resolve conflicts or make nuanced decisions:
-```yaml
-aggregator:
-  type: llm_judge
-  prompt: ./prompts/conflict-resolution.md
-```
-The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
-## Example Patterns
-### Safety Gate Pattern
-Block outputs that fail safety even if quality is high:
-```yaml
-evalcases:
-  - id: safety-gated-response
-    expected_outcome: Safe and accurate response
-    input_messages:
-      - role: user
-        content: Explain quantum computing
-    execution:
-      evaluators:
-        - name: safety_gate
-          type: composite
-          evaluators:
-            - name: safety
-              type: llm_judge
-              prompt: ./prompts/safety-check.md
-            - name: quality
-              type: llm_judge
-              prompt: ./prompts/quality-check.md
-          aggregator:
-            type: code_judge
-            path: ./scripts/safety-gate.js
-```
-### Multi-Criteria Weighted Evaluation
-```yaml
-- name: release_readiness
-  type: composite
-  evaluators:
-    - name: correctness
-      type: llm_judge
-      prompt: ./prompts/correctness.md
-    - name: style
-      type: code_judge
-      script: uv run style_checker.py
-    - name: security
-      type: llm_judge
-      prompt: ./prompts/security.md
-  aggregator:
-    type: weighted_average
-    weights:
-      correctness: 0.5
-      style: 0.2
-      security: 0.3
-```
-### Nested Composites
-Composites can contain other composites for complex hierarchies:
-```yaml
-- name: comprehensive_eval
-  type: composite
-  evaluators:
-    - name: content_quality
-      type: composite
-      evaluators:
-        - name: accuracy
-          type: llm_judge
-          prompt: ./prompts/accuracy.md
-        - name: clarity
-          type: llm_judge
-          prompt: ./prompts/clarity.md
-      aggregator:
-        type: weighted_average
-        weights:
-          accuracy: 0.6
-          clarity: 0.4
-    - name: safety
-      type: llm_judge
-      prompt: ./prompts/safety.md
-  aggregator:
-    type: weighted_average
-    weights:
-      content_quality: 0.7
-      safety: 0.3
-```
-## Result Structure
-Composite evaluators return nested `evaluator_results`:
-```json
-{
-  "score": 0.85,
-  "verdict": "pass",
-  "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
-  "misses": ["[quality] Could use more examples"],
-  "reasoning": "safety: Passed all checks; quality: Good but could improve",
-  "evaluator_results": [
-    {
-      "name": "safety",
-      "type": "llm_judge",
-      "score": 0.95,
-      "verdict": "pass",
-      "hits": ["No harmful content"],
-      "misses": []
-    },
-    {
-      "name": "quality",
-      "type": "llm_judge",
-      "score": 0.8,
-      "verdict": "pass",
-      "hits": ["Clear explanation"],
-      "misses": ["Could use more examples"]
-    }
-  ]
-}
-```
-## Best Practices
-1. **Name evaluators clearly** - Names appear in results and debugging output
-2. **Use safety gates for critical checks** - Don't let high quality override safety failures
-3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
-4. **Keep nesting shallow** - Deep nesting makes debugging harder
-5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests
+# Composite Evaluator Guide
+Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
+## Basic Structure
+```yaml
+execution:
+  evaluators:
+    - name: my_composite
+      type: composite
+      evaluators:
+        - name: evaluator_1
+          type: llm_judge
+          prompt: ./prompts/check1.md
+        - name: evaluator_2
+          type: code_judge
+          script: uv run check2.py
+      aggregator:
+        type: weighted_average
+        weights:
+          evaluator_1: 0.6
+          evaluator_2: 0.4
+```
+## Aggregator Types
+### 1. Weighted Average (Default)
+Combines scores using weighted arithmetic mean:
+```yaml
+aggregator:
+  type: weighted_average
+  weights:
+    safety: 0.3      # 30% weight
+    quality: 0.7     # 70% weight
+```
+If weights are omitted, all evaluators have equal weight (1.0).
+**Score calculation:**
+```
+final_score = Σ(score_i × weight_i) / Σ(weight_i)
+```
+### 2. Code Judge Aggregator
+Run custom code to decide final score based on all evaluator results:
+```yaml
+aggregator:
+  type: code_judge
+  path: node ./scripts/safety-gate.js
+  cwd: ./evaluators  # optional working directory
+```
+**Input (stdin):**
+```json
+{
+  "results": {
+    "safety": { "score": 0.9, "hits": [...], "misses": [...] },
+    "quality": { "score": 0.85, "hits": [...], "misses": [...] }
+  }
+}
+```
+**Output (stdout):**
+```json
+{
+  "score": 0.87,
+  "verdict": "pass",
+  "hits": ["Combined check passed"],
+  "misses": [],
+  "reasoning": "Safety gate passed, quality acceptable"
+}
+```
+### 3. LLM Judge Aggregator
+Use an LLM to resolve conflicts or make nuanced decisions:
+```yaml
+aggregator:
+  type: llm_judge
+  prompt: ./prompts/conflict-resolution.md
+```
+The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
+## Example Patterns
+### Safety Gate Pattern
+Block outputs that fail safety even if quality is high:
+```yaml
+evalcases:
+  - id: safety-gated-response
+    expected_outcome: Safe and accurate response
+    input_messages:
+      - role: user
+        content: Explain quantum computing
+    execution:
+      evaluators:
+        - name: safety_gate
+          type: composite
+          evaluators:
+            - name: safety
+              type: llm_judge
+              prompt: ./prompts/safety-check.md
+            - name: quality
+              type: llm_judge
+              prompt: ./prompts/quality-check.md
+          aggregator:
+            type: code_judge
+            path: ./scripts/safety-gate.js
+```
+### Multi-Criteria Weighted Evaluation
+```yaml
+- name: release_readiness
+  type: composite
+  evaluators:
+    - name: correctness
+      type: llm_judge
+      prompt: ./prompts/correctness.md
+    - name: style
+      type: code_judge
+      script: uv run style_checker.py
+    - name: security
+      type: llm_judge
+      prompt: ./prompts/security.md
+  aggregator:
+    type: weighted_average
+    weights:
+      correctness: 0.5
+      style: 0.2
+      security: 0.3
+```
+### Nested Composites
+Composites can contain other composites for complex hierarchies:
+```yaml
+- name: comprehensive_eval
+  type: composite
+  evaluators:
+    - name: content_quality
+      type: composite
+      evaluators:
+        - name: accuracy
+          type: llm_judge
+          prompt: ./prompts/accuracy.md
+        - name: clarity
+          type: llm_judge
+          prompt: ./prompts/clarity.md
+      aggregator:
+        type: weighted_average
+        weights:
+          accuracy: 0.6
+          clarity: 0.4
+    - name: safety
+      type: llm_judge
+      prompt: ./prompts/safety.md
+  aggregator:
+    type: weighted_average
+    weights:
+      content_quality: 0.7
+      safety: 0.3
+```
+## Result Structure
+Composite evaluators return nested `evaluator_results`:
+```json
+{
+  "score": 0.85,
+  "verdict": "pass",
+  "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
+  "misses": ["[quality] Could use more examples"],
+  "reasoning": "safety: Passed all checks; quality: Good but could improve",
+  "evaluator_results": [
+    {
+      "name": "safety",
+      "type": "llm_judge",
+      "score": 0.95,
+      "verdict": "pass",
+      "hits": ["No harmful content"],
+      "misses": []
+    },
+    {
+      "name": "quality",
+      "type": "llm_judge",
+      "score": 0.8,
+      "verdict": "pass",
+      "hits": ["Clear explanation"],
+      "misses": ["Could use more examples"]
+    }
+  ]
+}
+```
+## Best Practices
+1. **Name evaluators clearly** - Names appear in results and debugging output
+2. **Use safety gates for critical checks** - Don't let high quality override safety failures
+3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
+4. **Keep nesting shallow** - Deep nesting makes debugging harder
+5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests