npm - agentv - Versions diffs - 2.0.2 → 2.1.1 - Mend

agentv 2.0.2 → 2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

package/README.md +199 -325
package/dist/{chunk-5AJ7DFUO.js → chunk-HTTN5OWL.js} +1203 -895
package/dist/chunk-HTTN5OWL.js.map +1 -0
package/dist/cli.js +1 -1
package/dist/index.js +1 -1
package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +24 -2
package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -12
package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +90 -209
package/package.json +2 -2
package/dist/chunk-5AJ7DFUO.js.map +0 -1
/package/dist/templates/.agentv/{.env.template → .env.example} +0 -0

package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Custom Evaluators Guide
-Guide for writing custom code evaluators and LLM judges for AgentV eval files.
+Templates and best practices for code evaluators and LLM judges. For YAML configuration, see `SKILL.md`.
 ## Code Evaluator Contract
@@ -19,21 +19,25 @@ Wire format uses snake_case for cross-language compatibility:
   "guideline_files": ["path1", "path2"],
   "input_files": ["file1", "file2"],
   "input_messages": [{"role": "user", "content": "..."}],
-  "output_messages": [
+  "expected_messages": [
     {
       "role": "assistant",
-      "content": "...",
       "tool_calls": [
         {
-          "tool": "search",
+          "tool": "vector_search",
           "input": { "query": "..." },
-          "output": { "results": [...] },
-          "id": "call_123",
-          "timestamp": "2024-01-15T10:30:00Z"
+          "output": { "results": ["doc1", "doc2"] }
         }
       ]
     }
   ],
+  "output_messages": [
+    {
+      "role": "assistant",
+      "content": "...",
+      "tool_calls": [...]
+    }
+  ],
   "trace_summary": {
     "event_count": 5,
     "tool_names": ["fetch", "search"],
@@ -47,7 +51,8 @@ Wire format uses snake_case for cross-language compatibility:
 ```
 **Key fields:**
-- `output_messages` - Full agent execution trace with tool calls (use `tool_calls[].input` for arguments)
+- `expected_messages` - Expected agent behavior from YAML, including tool calls with outputs (use for retrieval context in RAG evals)
+- `output_messages` - Actual agent execution trace with tool calls (from live agent runs)
 - `trace_summary` - Lightweight summary with execution metrics (counts only, no tool arguments)
 ### Output Format (to stdout)
@@ -71,201 +76,128 @@ Wire format uses snake_case for cross-language compatibility:
 ```python
 #!/usr/bin/env python3
-"""
-Example code evaluator for AgentV
-This evaluator checks for specific keywords in the output.
-Replace validation logic as needed.
-"""
 import json
 import sys
-from typing import Any
-def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
-    """
-    Evaluate the agent output.
-    Args:
-        input_data: Full input context from AgentV
-    Returns:
-        Evaluation result with score, hits, misses, reasoning
-    """
-    # Extract only the fields you need
-    # Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
-    candidate_answer = input_data.get("candidate_answer", "")
+def evaluate(data: dict) -> dict:
+    candidate = data.get("candidate_answer", "")
+    hits, misses = [], []
     # Your validation logic here
-    hits = []
-    misses = []
-    # Example: Check for keywords
-    required_keywords = ["async", "await"]
-    for keyword in required_keywords:
-        if keyword in candidate_answer:
-            hits.append(f"Contains required keyword: {keyword}")
-        else:
-            misses.append(f"Missing required keyword: {keyword}")
-    # Calculate score
-    if not required_keywords:
-        score = 1.0
-    else:
-        score = len(hits) / len(required_keywords)
-    # Build result
+    keywords = ["async", "await"]
+    for kw in keywords:
+        (hits if kw in candidate else misses).append(f"Keyword '{kw}'")
     return {
-        "score": score,
+        "score": len(hits) / len(keywords) if keywords else 1.0,
         "hits": hits,
         "misses": misses,
-        "reasoning": f"Found {len(hits)}/{len(required_keywords)} required keywords"
+        "reasoning": f"Found {len(hits)}/{len(keywords)} keywords"
     }
-def main():
-    """Main entry point for AgentV code evaluator."""
+if __name__ == "__main__":
     try:
-        # Read input from stdin
-        input_data = json.loads(sys.stdin.read())
-        # Run evaluation
-        result = evaluate(input_data)
-        # Write result to stdout
+        result = evaluate(json.loads(sys.stdin.read()))
         print(json.dumps(result, indent=2))
     except Exception as e:
-        # Error handling: return zero score with error message
-        error_result = {
-            "score": 0.0,
-            "hits": [],
-            "misses": [f"Evaluator error: {str(e)}"],
-            "reasoning": f"Evaluator error: {str(e)}"
-        }
-        print(json.dumps(error_result, indent=2))
+        print(json.dumps({"score": 0, "hits": [], "misses": [str(e)], "reasoning": "Error"}))
         sys.exit(1)
-if __name__ == "__main__":
-    main()
 ```
-## TypeScript Code Evaluator Template (with SDK)
-The `@agentv/eval` SDK provides a declarative API for code evaluators with automatic stdin/stdout handling, validation, and error handling.
+## TypeScript Code Evaluator Template
-**Execution:** Keep evaluators as `.ts` files and run via `bun run` or Node loaders like `npx --yes tsx ./evaluators/my-check.ts`.
+The `@agentv/eval` SDK provides a declarative API with automatic stdin/stdout handling.
 ```typescript
 #!/usr/bin/env bun
-/**
- * Example TypeScript code evaluator using defineCodeJudge
- *
- * Run with: bun run ./evaluators/example-check.ts
- *        or: npx --yes tsx ./evaluators/example-check.ts
- *
- * The SDK handles:
- * - Reading JSON from stdin
- * - Converting snake_case to camelCase
- * - Validating input with Zod
- * - Error handling and output formatting
- */
 import { defineCodeJudge } from '@agentv/eval';
-export default defineCodeJudge(({ candidateAnswer, expectedOutcome, inputFiles, guidelineFiles }) => {
+export default defineCodeJudge(({ candidateAnswer, expectedOutcome }) => {
   const hits: string[] = [];
   const misses: string[] = [];
-  // Example: Check if answer contains expected outcome
+  // Your validation logic here
   if (candidateAnswer.includes(expectedOutcome)) {
     hits.push('Answer matches expected outcome');
   } else {
     misses.push('Answer does not match expected outcome');
   }
-  // Example: Check attachment mentions
-  const attachments = [...guidelineFiles, ...inputFiles];
-  for (const filePath of attachments) {
-    const fileName = filePath.split('/').pop() ?? filePath;
-    if (candidateAnswer.includes(fileName)) {
-      hits.push(`Mentions attachment: ${fileName}`);
-    } else {
-      misses.push(`Missing attachment: ${fileName}`);
-    }
-  }
-  // Calculate score
-  const totalChecks = hits.length + misses.length;
-  const score = totalChecks === 0 ? 0 : hits.length / totalChecks;
+  const total = hits.length + misses.length;
   return {
-    score,
+    score: total === 0 ? 0 : hits.length / total,
     hits,
     misses,
-    reasoning: `Passed ${hits.length}/${totalChecks} checks`,
+    reasoning: `Passed ${hits.length}/${total} checks`,
   };
 });
 ```
-**TypeScript SDK Benefits:**
-- **Zero boilerplate**: No try/catch, stdin parsing, or JSON.stringify needed
-- **Type-safe**: `CodeJudgeInput` interface with all fields typed
-- **camelCase**: Idiomatic TypeScript naming (`candidateAnswer` vs `candidate_answer`)
-- **Validation**: Zod schemas validate input and output at runtime
-- **Error handling**: Exceptions automatically produce valid failure results
+**SDK exports:** `defineCodeJudge`, `Message`, `ToolCall`, `TraceSummary`, `CodeJudgeInput`, `CodeJudgeResult`
-**Available exports from `@agentv/eval`:**
-- `defineCodeJudge(handler)`: Define a code judge evaluator (recommended)
-- `CodeJudgeInput`: TypeScript type for input payload
-- `CodeJudgeResult`: TypeScript type for result
-- `TraceSummary`, `OutputMessage`: Types for trace data
-- `z`: Re-exported Zod for custom config schemas
+## Target Access for Code Evaluators
-**Using execution metrics:**
+Code judges can access an LLM through a **target proxy** for metrics requiring multiple LLM calls (contextual precision, semantic similarity, etc).
+### Configuration
+```yaml
+evaluators:
+  - name: contextual-precision
+    type: code_judge
+    script: bun scripts/contextual-precision.ts
+    target:
+      max_calls: 10  # Default: 50
+```
+### Usage
 ```typescript
-import { defineCodeJudge } from '@agentv/eval';
+#!/usr/bin/env bun
+import { createTargetClient, defineCodeJudge } from '@agentv/eval';
-export default defineCodeJudge(({ traceSummary }) => {
-  if (!traceSummary) {
-    return { score: 0.5, reasoning: 'No trace available' };
-  }
+export default defineCodeJudge(async ({ question, candidateAnswer }) => {
+  const target = createTargetClient();
+  if (!target) return { score: 0, misses: ['Target not configured'] };
-  const efficient = traceSummary.eventCount <= 10;
-  return {
-    score: efficient ? 1.0 : 0.5,
-    hits: efficient ? ['Efficient execution'] : [],
-    misses: efficient ? [] : ['Too many tool calls'],
-  };
+  const response = await target.invoke({
+    question: `Is this relevant to: ${question}? Response: ${candidateAnswer}`,
+    systemPrompt: 'Respond with JSON: { "relevant": true/false }'
+  });
+  const result = JSON.parse(response.rawText ?? '{}');
+  return { score: result.relevant ? 1.0 : 0.0 };
 });
 ```
-**See also:** `examples/features/code-judge-sdk/` for complete working examples
+**Batch invocation:** Use `target.invokeBatch(requests)` for multiple calls.
+**Environment variables** (set automatically when `target` is configured):
+- `AGENTV_TARGET_PROXY_URL` - Local proxy URL
+- `AGENTV_TARGET_PROXY_TOKEN` - Bearer token for authentication
+**See also:** `examples/features/code-judge-with-llm-calls/`
 ## LLM Judge Prompt Template
-LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
+LLM judges use markdown prompts. AgentV handles the output format automatically.
 **Available Template Variables:**
 - `{{question}}` - The original question/task
 - `{{expected_outcome}}` - What the answer should accomplish
 - `{{candidate_answer}}` - The actual output to evaluate
-- `{{reference_answer}}` - Gold standard answer (optional, may be empty)
-- `{{input_messages}}` - JSON stringified input message segments
-- `{{output_messages}}` - JSON stringified expected output segments
+- `{{reference_answer}}` - Gold standard answer (optional)
+- `{{input_messages}}` - JSON stringified input messages
+- `{{output_messages}}` - JSON stringified output messages
-**Default Evaluator Template:**
-If you don't specify a custom evaluator template, AgentV uses this default:
+**Default Template:**
 ```
-You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
+You are an expert evaluator. Grade the candidate_answer based on how well it achieves the expected_outcome.
-Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
+Use reference_answer as a gold standard (if provided). The candidate_answer doesn't need to match verbatim, but should capture key points.
-Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
+Be concise. Provide specific feedback rather than verbose explanations.
 [[ ## expected_outcome ## ]]
 {{expected_outcome}}
@@ -280,76 +212,25 @@ Be concise and focused in your evaluation. Provide succinct, specific feedback r
 {{candidate_answer}}
 ```
-You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
 ## Best Practices
-### For Code-based Evaluators
-1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
-2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
-3. **Be deterministic** - Same input should always produce same output
-4. **Handle errors gracefully** - Return a valid result even when evaluation fails
-5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
-### For Prompt-based Evaluators (LLM Judges)
+### Code Evaluators
+1. **Focus on `candidate_answer`** - Most evaluators only need this field
+2. **Be deterministic** - Same input → same output
+3. **Handle errors gracefully** - Return valid result even on failure
+4. **Use `hits`/`misses`** - Explain the score clearly
+### LLM Judges
 1. **Clear criteria** - Define what you're evaluating
-2. **Specific guidelines** - Provide scoring rubrics
-3. **JSON output** - Enforce structured output format
-4. **Examples** - Show what good/bad looks like
-5. **Concise prompts** - Keep instructions focused
-## Running Code Evaluators
+2. **Specific rubrics** - Provide scoring guidelines
+3. **Concise prompts** - Keep instructions focused
-### In Eval Files
-```yaml
-execution:
-  evaluators:
-    - name: my_validator
-      type: code_judge
-      script: uv run my_validator.py
-      cwd: ./evaluators
-```
-TypeScript evaluators use the same structure but invoke `tsx` (or another Node-compatible loader) so they work everywhere:
-```yaml
-execution:
-  evaluators:
-    - name: csv_guardrail
-      type: code_judge
-      script: npx --yes tsx ./evaluators/check-csv.ts
-      cwd: ./evaluators
-```
-### Command Line Testing
-Test your evaluator locally:
+## Testing Locally
 ```bash
-# Create test input
-echo '{
-  "candidate_answer": "test output here",
-  "question": "test task",
-  "expected_outcome": "expected result"
-}' | uv run my_validator.py
-# Should output:
-# {
-#   "score": 0.8,
-#   "hits": ["check 1 passed"],
-#   "misses": ["check 2 failed"],
-#   "reasoning": "..."
-# }
-```
+# Python
+echo '{"candidate_answer": "test", "question": "task", "expected_outcome": "result"}' | uv run my_validator.py
-```bash
-# TypeScript (uses tsx loader under Node)
-echo '{
-  "candidate_answer": "test output here",
-  "question": "test task",
-  "expected_outcome": "expected result"
-}' | npx --yes tsx ./evaluators/check-csv.ts
+# TypeScript
+echo '{"candidate_answer": "test", "question": "task", "expected_outcome": "result"}' | bun run ./check.ts
 ```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentv",
-  "version": "2.0.2",
+  "version": "2.1.1",
   "description": "CLI entry point for AgentV",
   "type": "module",
   "repository": {
@@ -31,7 +31,7 @@
     "test:watch": "bun test --watch"
   },
   "dependencies": {
-    "@agentv/core": "2.0.1",
+    "@agentv/core": "2.0.2",
     "@mariozechner/pi-agent": "^0.9.0",
     "@mariozechner/pi-ai": "^0.37.2",
     "cmd-ts": "^0.14.3",