npm - agentv - Versions diffs - 0.22.0 → 0.25.0 - Mend

agentv 0.22.0 → 0.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md CHANGED Viewed

@@ -10,14 +10,13 @@ Code evaluators receive input via stdin and write output to stdout, both as JSON
 ```json
 {
-  "task": "string describing the task",
-  "outcome": "expected outcome description",
-  "expected": "expected output string",
-  "output": "generated code/text from the agent",
-  "system_message": "system message if any",
+  "question": "string describing the task/question",
+  "expected_outcome": "expected outcome description",
+  "reference_answer": "gold standard answer (optional)",
+  "candidate_answer": "generated code/text from the agent",
   "guideline_paths": ["path1", "path2"],
-  "attachments": ["file1", "file2"],
-  "user_segments": [{"type": "text", "value": "..."}]
+  "input_files": ["file1", "file2"],
+  "input_messages": [{"role": "user", "content": "..."}]
 }
 ```
@@ -65,8 +64,8 @@ def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
         Evaluation result with score, hits, misses, reasoning
     """
     # Extract only the fields you need
-    # Most evaluators only need 'output' - avoid using unnecessary fields
-    output = input_data.get("output", "")
+    # Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
+    candidate_answer = input_data.get("candidate_answer", "")
     # Your validation logic here
     hits = []
@@ -75,7 +74,7 @@ def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
     # Example: Check for keywords
     required_keywords = ["async", "await"]
     for keyword in required_keywords:
-        if keyword in output:
+        if keyword in candidate_answer:
             hits.append(f"Contains required keyword: {keyword}")
         else:
             misses.append(f"Missing required keyword: {keyword}")
@@ -123,157 +122,55 @@ if __name__ == "__main__":
     main()
 ```
-## JSON Format Validator Example
-A common pattern is validating JSON output structure:
-```python
-#!/usr/bin/env python3
-"""
-JSON Format Validator for AgentV
-Validates that output is valid JSON with required keys.
-"""
-import json
-import sys
-from typing import Any
+## LLM Judge Prompt Template
-def validate_json_format(output: str, required_keys: list[str]) -> dict[str, Any]:
-    """
-    Validate that output is valid JSON with required keys.
-    Args:
-        output: The candidate output to validate
-        required_keys: List of required top-level keys
-    Returns:
-        Evaluation result dict
-    """
-    # Try to parse as JSON
-    try:
-        parsed = json.loads(output.strip())
-    except json.JSONDecodeError as e:
-        return {
-            "score": 0.0,
-            "hits": [],
-            "misses": ["Not valid JSON"],
-            "reasoning": f"Output is not valid JSON. Parse error: {str(e)}"
-        }
-    # Check if it's a dict
-    if not isinstance(parsed, dict):
-        return {
-            "score": 0.0,
-            "hits": [],
-            "misses": ["JSON is not an object/dict"],
-            "reasoning": f"Output is valid JSON but not an object. Got: {type(parsed).__name__}"
-        }
-    # Check for required keys
-    missing_keys = [key for key in required_keys if key not in parsed]
-    present_keys = [key for key in required_keys if key in parsed]
-    if missing_keys:
-        return {
-            "score": 0.0,
-            "hits": [f"Has key: {key}" for key in present_keys],
-            "misses": [f"Missing key: {key}" for key in missing_keys],
-            "reasoning": f"Valid JSON but missing required keys: {', '.join(missing_keys)}"
-        }
-    # All checks passed
-    return {
-        "score": 1.0,
-        "hits": [f"Valid JSON with all required keys: {', '.join(required_keys)}"],
-        "misses": [],
-        "reasoning": f"Valid JSON with all required keys: {', '.join(required_keys)}"
-    }
+LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
+**Available Template Variables:**
+- `{{question}}` - The original question/task
+- `{{expected_outcome}}` - What the answer should accomplish
+- `{{candidate_answer}}` - The actual output to evaluate
+- `{{reference_answer}}` - Gold standard answer (optional, may be empty)
+- `{{input_messages}}` - JSON stringified input message segments
+- `{{output_messages}}` - JSON stringified expected output segments
-def main():
-    """Main entry point."""
-    try:
-        input_data = json.loads(sys.stdin.read())
-        output = input_data.get("output", "")
-        # Define required keys (customize as needed)
-        required_keys = ["criticalityRating", "reasoning"]
-        result = validate_json_format(output, required_keys)
-        print(json.dumps(result, indent=2))
-    except Exception as e:
-        error_result = {
-            "score": 0.0,
-            "hits": [],
-            "misses": [f"Evaluator error: {str(e)}"],
-            "reasoning": f"Evaluator error: {str(e)}"
-        }
-        print(json.dumps(error_result, indent=2))
-        sys.exit(1)
+**Default Evaluator Template:**
+If you don't specify a custom evaluator template, AgentV uses this default:
-if __name__ == "__main__":
-    main()
 ```
+You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
-## LLM Judge Prompt Template
-LLM judges use markdown prompts to guide evaluation:
-```markdown
-# Code Quality Judge
-Evaluate the candidate code for quality, correctness, and best practices.
-## Evaluation Criteria
+Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
-Rate the code on:
-1. **Correctness** - Does it solve the problem?
-2. **Style** - Does it follow best practices?
-3. **Completeness** - Are edge cases handled?
-4. **Documentation** - Are there helpful comments/docstrings?
+Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
-## Scoring Guidelines
+[[ ## expected_outcome ## ]]
+{{expected_outcome}}
-- **0.9-1.0:** Excellent - Correct, clean, well-documented
-- **0.7-0.8:** Good - Correct with minor style issues
-- **0.5-0.6:** Adequate - Works but has quality issues
-- **0.3-0.4:** Poor - Has bugs or major style problems
-- **0.0-0.2:** Unacceptable - Does not work or completely wrong
+[[ ## question ## ]]
+{{question}}
-## Output Format
+[[ ## reference_answer ## ]]
+{{reference_answer}}
-Respond with valid JSON:
-```json
-{
-  "score": 0.85,
-  "hits": [
-    "Correctly implements the algorithm",
-    "Good error handling"
-  ],
-  "misses": [
-    "Missing type hints",
-    "No docstring"
-  ],
-  "reasoning": "Code is correct and handles errors well, but lacks documentation."
-}
-```
+[[ ## candidate_answer ## ]]
+{{candidate_answer}}
 ```
+You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
 ## Best Practices
-### For Code Evaluators
+### For Code-based Evaluators
-1. **Focus on relevant fields** - Most evaluators only need the `output` field
-2. **Avoid false positives** - Don't check fields like `task` or `expected` unless you specifically need context
+1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
+2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
 3. **Be deterministic** - Same input should always produce same output
 4. **Handle errors gracefully** - Return a valid result even when evaluation fails
 5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
-### For LLM Judges
+### For Prompt-based Evaluators (LLM Judges)
 1. **Clear criteria** - Define what you're evaluating
 2. **Specific guidelines** - Provide scoring rubrics
@@ -281,37 +178,6 @@ Respond with valid JSON:
 4. **Examples** - Show what good/bad looks like
 5. **Concise prompts** - Keep instructions focused
-### Common Pitfalls to Avoid
-**❌ Checking unnecessary fields:**
-```python
-# BAD: Checking 'task' or 'expected' when you only need to validate format
-if "async" in input_data.get("task", ""):
-    # This creates false positives
-```
-**✅ Focus on output:**
-```python
-# GOOD: Only check the actual output
-output = input_data.get("output", "")
-if "async" in output:
-    # This is what you actually want to validate
-```
-**❌ Brittle string matching:**
-```python
-# BAD: Exact match is too strict
-if output == "The answer is 42":
-    score = 1.0
-```
-**✅ Flexible validation:**
-```python
-# GOOD: Check for semantic correctness
-if "42" in output and "answer" in output.lower():
-    score = 1.0
-```
 ## Running Code Evaluators
 ### In Eval Files
@@ -332,8 +198,9 @@ Test your evaluator locally:
 ```bash
 # Create test input
 echo '{
-  "output": "test output here",
-  "task": "test task"
+  "candidate_answer": "test output here",
+  "question": "test task",
+  "expected_outcome": "expected result"
 }' | uv run my_validator.py
 # Should output:
@@ -344,56 +211,3 @@ echo '{
 #   "reasoning": "..."
 # }
 ```
-## Advanced Patterns
-### Combining Multiple Checks
-```python
-def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
-    output = input_data.get("output", "")
-    checks = [
-        ("has_async", "async" in output, "Contains async keyword"),
-        ("has_await", "await" in output, "Contains await keyword"),
-        ("has_try", "try:" in output, "Has error handling"),
-    ]
-    hits = [msg for _, passed, msg in checks if passed]
-    misses = [msg for _, passed, msg in checks if not passed]
-    score = len(hits) / len(checks)
-    return {
-        "score": score,
-        "hits": hits,
-        "misses": misses,
-        "reasoning": f"Passed {len(hits)}/{len(checks)} checks"
-    }
-```
-### Weighted Scoring
-```python
-def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
-    output = input_data.get("output", "")
-    # Define checks with weights
-    checks = [
-        ("correctness", is_correct(output), 0.5),
-        ("style", has_good_style(output), 0.3),
-        ("docs", has_docs(output), 0.2),
-    ]
-    hits = [name for name, passed, _ in checks if passed]
-    misses = [name for name, passed, _ in checks if not passed]
-    # Weighted score
-    score = sum(weight for _, passed, weight in checks if passed)
-    return {
-        "score": score,
-        "hits": hits,
-        "misses": misses,
-        "reasoning": f"Weighted score: {score:.2f}"
-    }
-```

package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json CHANGED Viewed

@@ -71,7 +71,7 @@
             "type": "string",
             "description": "Optional conversation identifier for threading multiple eval cases together"
           },
-          "outcome": {
+          "expected_outcome": {
             "type": "string",
             "description": "Description of what the AI should accomplish in this eval"
           },
@@ -207,7 +207,7 @@
             "additionalProperties": true
           }
         },
-        "required": ["id", "outcome", "input_messages", "expected_messages"],
+        "required": ["id", "expected_outcome", "input_messages", "expected_messages"],
         "additionalProperties": false
       }
     }

package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md CHANGED Viewed

@@ -11,7 +11,7 @@ target: default
 evalcases:
   - id: simple-addition
-    outcome: Correctly calculates 2+2
+    expected_outcome: Correctly calculates 2+2
     input_messages:
       - role: user
@@ -31,7 +31,7 @@ target: azure_base
 evalcases:
   - id: code-review-basic
-    outcome: Assistant provides helpful code analysis with security considerations
+    expected_outcome: Assistant provides helpful code analysis with security considerations
     input_messages:
       - role: system
@@ -73,7 +73,7 @@ target: default
 evalcases:
   - id: json-generation-with-validation
-    outcome: Generates valid JSON with required fields
+    expected_outcome: Generates valid JSON with required fields
     execution:
       evaluators:
@@ -111,7 +111,7 @@ target: default
 evalcases:
   - id: debug-with-clarification
-    outcome: |-
+    expected_outcome: |-
       Assistant conducts a multi-turn debugging session, asking clarification
       questions when needed, correctly diagnosing the bug, and proposing a clear
       fix with rationale.
@@ -169,7 +169,7 @@ evalcases:
 - **Relative paths** (start with `./` or `../`): Resolved from eval file directory
   - Example: `../../prompts/file.md` → Two directories up, then into prompts/
-### Outcome Writing Tips
+### expected_outcome Writing Tips
 - Be specific about what success looks like
 - Mention key elements that must be present
 - For classification tasks, specify the expected category

package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md ADDED Viewed

@@ -0,0 +1,139 @@
+# Rubric Evaluator Guide
+Rubrics provide structured evaluation through lists of criteria that define what makes a good response. Rubrics are checked by an LLM judge and scored based on weights and requirements.
+## Basic Usage
+### Simple String Rubrics
+Define rubrics as simple strings - each becomes a required criterion with weight 1.0:
+```yaml
+$schema: agentv-eval-v2
+evalcases:
+  - id: quicksort-explanation
+    expected_outcome: Explain how quicksort works
+    input_messages:
+      - role: user
+        content: Explain how the quicksort algorithm works
+    rubrics:
+      - Mentions divide-and-conquer approach
+      - Explains the partition step
+      - States time complexity correctly
+```
+### Detailed Rubric Objects
+Use objects for fine-grained control over weights and requirements:
+```yaml
+evalcases:
+  - id: technical-guide
+    expected_outcome: Write a comprehensive HTTP status codes guide
+    input_messages:
+      - role: user
+        content: Write a guide explaining HTTP status codes
+    rubrics:
+      - id: structure
+        description: Has clear headings and organization
+        weight: 1.0
+        required: true
+      - id: success-codes
+        description: Covers 2xx success codes with examples
+        weight: 2.0
+        required: true
+      - id: client-errors
+        description: Explains 4xx client error codes
+        weight: 2.0
+        required: true
+      - id: server-errors
+        description: Explains 5xx server error codes
+        weight: 1.5
+        required: false
+      - id: practical-examples
+        description: Includes practical use case examples
+        weight: 1.0
+        required: false
+```
+## Rubric Object Fields
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `id` | string | auto-generated | Unique identifier for the rubric |
+| `description` | string | required | The criterion being evaluated |
+| `weight` | number | 1.0 | Relative importance (higher = more impact on score) |
+| `required` | boolean | true | If true, failing this rubric forces verdict to 'fail' |
+## Scoring and Verdicts
+**Score Calculation:**
+```
+score = (sum of satisfied weights) / (total weights)
+```
+**Verdict Rules:**
+- `pass`: Score ≥ 0.8 AND all required rubrics satisfied
+- `borderline`: Score ≥ 0.6 AND all required rubrics satisfied
+- `fail`: Score < 0.6 OR any required rubric failed
+## Combining Rubrics with Other Evaluators
+Rubrics can be combined with code evaluators for comprehensive validation:
+```yaml
+evalcases:
+  - id: email-validator
+    expected_outcome: Python function to validate email addresses
+    input_messages:
+      - role: user
+        content: Write a Python function to validate email addresses
+    # Semantic evaluation via rubrics
+    rubrics:
+      - Uses regular expressions for validation
+      - Includes type hints
+      - Has docstring documentation
+      - Handles edge cases (None, empty string)
+    execution:
+      evaluators:
+        # Rubric evaluator is auto-added from inline rubrics field
+        # Additional code evaluator for syntax checking
+        - name: python_syntax
+          type: code_judge
+          script: uv run python -m py_compile
+```
+## Generate Rubrics from Expected Outcome
+Use the CLI to auto-generate rubrics from `expected_outcome`:
+```bash
+# Generate rubrics for eval cases that don't have them
+agentv generate rubrics evals/my-eval.yaml
+# Use a specific LLM target for generation
+agentv generate rubrics evals/my-eval.yaml --target azure_base
+```
+This analyzes each `expected_outcome` and creates appropriate rubric items.
+## Best Practices
+1. **Use required sparingly** - Only mark rubrics as `required: true` for critical criteria
+2. **Balance weights** - Use higher weights (2.0+) for core requirements, lower (0.5) for nice-to-haves
+3. **Be specific** - "Includes error handling" is better than "Good code quality"
+4. **Keep rubrics atomic** - Each rubric should test one thing
+5. **Consider partial credit** - Non-required rubrics allow partial scores