npm - agentv - Versions diffs - 3.10.2 → 3.10.3 - Mend

agentv 3.10.2 → 3.10.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md DELETED Viewed

@@ -1,137 +0,0 @@
-# Compare Command
-Compare evaluation results between two runs to measure performance differences.
-## Usage
-```bash
-agentv compare <baseline.jsonl> <candidate.jsonl> [options]
-```
-## Arguments
-| Argument | Description |
-|----------|-------------|
-| `result1` | Path to baseline JSONL result file |
-| `result2` | Path to candidate JSONL result file |
-| `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
-| `--format`, `-f` | Output format: `table` (default) or `json` |
-| `--json` | Shorthand for `--format=json` |
-## How It Works
-1. **Load Results**: Reads both JSONL files containing evaluation results
-2. **Match by eval_id**: Pairs results with matching `eval_id` fields
-3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
-4. **Classify Outcomes**:
-   - `win`: delta >= threshold (candidate better)
-   - `loss`: delta <= -threshold (baseline better)
-   - `tie`: |delta| < threshold (no significant difference)
-5. **Output Summary**: Human-readable table (default) or JSON
-## Output Format
-### Table Format (default)
-```
-Comparing: baseline.jsonl → candidate.jsonl
-  Eval ID        Baseline  Candidate     Delta  Result
-  ─────────────  ────────  ─────────  ────────  ────────
-  safety-check       0.70       0.90     +0.20  ✓ win
-  accuracy-test      0.85       0.80     -0.05  = tie
-  latency-eval       0.90       0.75     -0.15  ✗ loss
-Summary: 1 win, 1 loss, 1 tie | Mean Δ: +0.000 | Status: neutral
-```
-Colors are used to highlight wins (green), losses (red), and ties (gray). Colors are automatically disabled when output is piped or `NO_COLOR` is set.
-### JSON Format (`--json`)
-Output uses snake_case for Python ecosystem compatibility:
-```json
-{
-  "matched": [
-    {
-      "eval_id": "case-1",
-      "score1": 0.7,
-      "score2": 0.9,
-      "delta": 0.2,
-      "outcome": "win"
-    }
-  ],
-  "unmatched": {
-    "file1": 0,
-    "file2": 0
-  },
-  "summary": {
-    "total": 2,
-    "matched": 1,
-    "wins": 1,
-    "losses": 0,
-    "ties": 0,
-    "mean_delta": 0.2
-  }
-}
-```
-## Exit Codes
-| Code | Meaning |
-|------|---------|
-| `0` | Candidate is equal or better (meanDelta >= 0) |
-| `1` | Baseline is better (regression detected) |
-## Workflow Examples
-### Model Comparison
-Compare different model versions:
-```bash
-# Run baseline evaluation
-agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
-# Run candidate evaluation
-agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
-# Compare results
-agentv compare baseline.jsonl candidate.jsonl
-```
-### Prompt Optimization
-Compare before/after prompt changes:
-```bash
-# Run with original prompt
-agentv eval evals/*.yaml --out before.jsonl
-# Modify prompt, then run again
-agentv eval evals/*.yaml --out after.jsonl
-# Compare with strict threshold
-agentv compare before.jsonl after.jsonl --threshold 0.05
-```
-### CI Quality Gate
-Fail CI if candidate regresses:
-```bash
-#!/bin/bash
-agentv compare baseline.jsonl candidate.jsonl
-if [ $? -eq 1 ]; then
-  echo "Regression detected! Candidate performs worse than baseline."
-  exit 1
-fi
-echo "Candidate is equal or better than baseline."
-```
-## Tips
-- **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
-- **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
-- **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.

package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md DELETED Viewed

@@ -1,215 +0,0 @@
-# Composite Evaluator Guide
-Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
-## Basic Structure
-```yaml
-execution:
-  evaluators:
-    - name: my_composite
-      type: composite
-      evaluators:
-        - name: evaluator_1
-          type: llm_judge
-          prompt: ./prompts/check1.md
-        - name: evaluator_2
-          type: code_judge
-          script: uv run check2.py
-      aggregator:
-        type: weighted_average
-        weights:
-          evaluator_1: 0.6
-          evaluator_2: 0.4
-```
-## Aggregator Types
-### 1. Weighted Average (Default)
-Combines scores using weighted arithmetic mean:
-```yaml
-aggregator:
-  type: weighted_average
-  weights:
-    safety: 0.3      # 30% weight
-    quality: 0.7     # 70% weight
-```
-If weights are omitted, all evaluators have equal weight (1.0).
-**Score calculation:**
-```
-final_score = Σ(score_i × weight_i) / Σ(weight_i)
-```
-### 2. Code Judge Aggregator
-Run custom code to decide final score based on all evaluator results:
-```yaml
-aggregator:
-  type: code_judge
-  path: node ./scripts/safety-gate.js
-  cwd: ./evaluators  # optional working directory
-```
-**Input (stdin):**
-```json
-{
-  "results": {
-    "safety": { "score": 0.9, "hits": [...], "misses": [...] },
-    "quality": { "score": 0.85, "hits": [...], "misses": [...] }
-  }
-}
-```
-**Output (stdout):**
-```json
-{
-  "score": 0.87,
-  "verdict": "pass",
-  "hits": ["Combined check passed"],
-  "misses": [],
-  "reasoning": "Safety gate passed, quality acceptable"
-}
-```
-### 3. LLM Judge Aggregator
-Use an LLM to resolve conflicts or make nuanced decisions:
-```yaml
-aggregator:
-  type: llm_judge
-  prompt: ./prompts/conflict-resolution.md
-```
-The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
-## Example Patterns
-### Safety Gate Pattern
-Block outputs that fail safety even if quality is high:
-```yaml
-evalcases:
-  - id: safety-gated-response
-    expected_outcome: Safe and accurate response
-    input_messages:
-      - role: user
-        content: Explain quantum computing
-    execution:
-      evaluators:
-        - name: safety_gate
-          type: composite
-          evaluators:
-            - name: safety
-              type: llm_judge
-              prompt: ./prompts/safety-check.md
-            - name: quality
-              type: llm_judge
-              prompt: ./prompts/quality-check.md
-          aggregator:
-            type: code_judge
-            path: ./scripts/safety-gate.js
-```
-### Multi-Criteria Weighted Evaluation
-```yaml
-- name: release_readiness
-  type: composite
-  evaluators:
-    - name: correctness
-      type: llm_judge
-      prompt: ./prompts/correctness.md
-    - name: style
-      type: code_judge
-      script: uv run style_checker.py
-    - name: security
-      type: llm_judge
-      prompt: ./prompts/security.md
-  aggregator:
-    type: weighted_average
-    weights:
-      correctness: 0.5
-      style: 0.2
-      security: 0.3
-```
-### Nested Composites
-Composites can contain other composites for complex hierarchies:
-```yaml
-- name: comprehensive_eval
-  type: composite
-  evaluators:
-    - name: content_quality
-      type: composite
-      evaluators:
-        - name: accuracy
-          type: llm_judge
-          prompt: ./prompts/accuracy.md
-        - name: clarity
-          type: llm_judge
-          prompt: ./prompts/clarity.md
-      aggregator:
-        type: weighted_average
-        weights:
-          accuracy: 0.6
-          clarity: 0.4
-    - name: safety
-      type: llm_judge
-      prompt: ./prompts/safety.md
-  aggregator:
-    type: weighted_average
-    weights:
-      content_quality: 0.7
-      safety: 0.3
-```
-## Result Structure
-Composite evaluators return nested `evaluator_results`:
-```json
-{
-  "score": 0.85,
-  "verdict": "pass",
-  "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
-  "misses": ["[quality] Could use more examples"],
-  "reasoning": "safety: Passed all checks; quality: Good but could improve",
-  "evaluator_results": [
-    {
-      "name": "safety",
-      "type": "llm_judge",
-      "score": 0.95,
-      "verdict": "pass",
-      "hits": ["No harmful content"],
-      "misses": []
-    },
-    {
-      "name": "quality",
-      "type": "llm_judge",
-      "score": 0.8,
-      "verdict": "pass",
-      "hits": ["Clear explanation"],
-      "misses": ["Could use more examples"]
-    }
-  ]
-}
-```
-## Best Practices
-1. **Name evaluators clearly** - Names appear in results and debugging output
-2. **Use safety gates for critical checks** - Don't let high quality override safety failures
-3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
-4. **Keep nesting shallow** - Deep nesting makes debugging harder
-5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests

package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json DELETED Viewed

@@ -1,27 +0,0 @@
-{
-  "$schema": "http://json-schema.org/draft-07/schema#",
-  "title": "AgentV Config Schema",
-  "description": "Schema for .agentv/config.yaml configuration files",
-  "type": "object",
-  "properties": {
-    "$schema": {
-      "type": "string",
-      "description": "Schema identifier",
-      "enum": ["agentv-config-v2"]
-    },
-    "guideline_patterns": {
-      "type": "array",
-      "description": "Glob patterns for identifying guideline files (instructions, prompts). Files matching these patterns are treated as guidelines, while non-matching files are treated as regular file content.",
-      "items": {
-        "type": "string",
-        "description": "Glob pattern (e.g., '**/*.instructions.md', '**/prompts/**')"
-      },
-      "examples": [
-        ["**/*.instructions.md", "**/instructions/**", "**/*.prompt.md", "**/prompts/**"],
-        ["**/*.guide.md", "**/guidelines/**", "docs/AGENTS.md"]
-      ]
-    }
-  },
-  "required": ["$schema"],
-  "additionalProperties": false
-}

package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md DELETED Viewed

@@ -1,115 +0,0 @@
-# Custom Evaluators
-## Wire Format
-### Input (stdin JSON)
-```json
-{
-  "question": "string",
-  "expected_outcome": "string",
-  "reference_answer": "string",
-  "candidate_answer": "string",
-  "guideline_files": ["path"],
-  "input_files": ["path"],
-  "input_messages": [{"role": "user", "content": "..."}],
-  "expected_messages": [{"role": "assistant", "content": "..."}],
-  "output_messages": [{"role": "assistant", "content": "..."}],
-  "trace_summary": {
-    "event_count": 5,
-    "tool_names": ["fetch"],
-    "tool_calls_by_name": {"fetch": 1},
-    "error_count": 0,
-    "token_usage": {"input": 1000, "output": 500},
-    "cost_usd": 0.0015,
-    "duration_ms": 3500
-  }
-}
-```
-### Output (stdout JSON)
-```json
-{
-  "score": 0.85,
-  "hits": ["passed check"],
-  "misses": ["failed check"],
-  "reasoning": "explanation"
-}
-```
-`score` (0.0-1.0) required. `hits`, `misses`, `reasoning` optional.
-## SDK Functions
-```typescript
-import { defineCodeJudge, createTargetClient, definePromptTemplate } from '@agentv/eval';
-```
-- `defineCodeJudge(fn)` - Wraps evaluation function with stdin/stdout handling
-- `createTargetClient()` - Returns LLM proxy client (when `target: {}` configured)
-  - `.invoke({question, systemPrompt})` - Single LLM call
-  - `.invokeBatch(requests)` - Batch LLM calls
-- `definePromptTemplate(fn)` - Wraps prompt generation function
-  - Context fields: `question`, `candidateAnswer`, `referenceAnswer`, `expectedOutcome`, `expectedMessages`, `outputMessages`, `config`, `traceSummary`
-## Python Example
-```python
-#!/usr/bin/env python3
-import json, sys
-def evaluate(data: dict) -> dict:
-    candidate = data.get("candidate_answer", "")
-    hits, misses = [], []
-    for kw in ["async", "await"]:
-        (hits if kw in candidate else misses).append(f"Keyword '{kw}'")
-    return {
-        "score": len(hits) / max(len(hits) + len(misses), 1),
-        "hits": hits, "misses": misses
-    }
-if __name__ == "__main__":
-    try:
-        print(json.dumps(evaluate(json.loads(sys.stdin.read()))))
-    except Exception as e:
-        print(json.dumps({"score": 0, "misses": [str(e)]}))
-        sys.exit(1)
-```
-## TypeScript Example
-```typescript
-#!/usr/bin/env bun
-import { defineCodeJudge } from '@agentv/eval';
-export default defineCodeJudge(({ candidateAnswer, expectedOutcome }) => {
-  const hits: string[] = [];
-  const misses: string[] = [];
-  if (candidateAnswer.includes(expectedOutcome)) {
-    hits.push('Matches expected outcome');
-  } else {
-    misses.push('Does not match expected outcome');
-  }
-  return {
-    score: hits.length / Math.max(hits.length + misses.length, 1),
-    hits, misses,
-  };
-});
-```
-## Template Variables
-Derived from eval case fields (users never author these directly):
-| Variable | Source |
-|----------|--------|
-| `question` | First user message in `input_messages` |
-| `expected_outcome` | Eval case `expected_outcome` field |
-| `reference_answer` | Last entry in `expected_messages` |
-| `candidate_answer` | Last entry in `output_messages` (runtime) |
-| `input_messages` | Full resolved input array (JSON) |
-| `expected_messages` | Full resolved expected array (JSON) |
-| `output_messages` | Full provider output array (JSON) |
-Markdown templates use `{{variable}}` syntax. TypeScript templates receive context object.