npm - agentv - Versions diffs - 2.0.1 → 2.1.0 - Mend

agentv 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +199 -318
package/dist/{chunk-6SHT2QS6.js → chunk-5BLNVACB.js} +1286 -756
package/dist/chunk-5BLNVACB.js.map +1 -0
package/dist/cli.js +4 -2
package/dist/cli.js.map +1 -1
package/dist/index.js +1 -1
package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +24 -2
package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -12
package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +100 -217
package/package.json +4 -2
package/dist/chunk-6SHT2QS6.js.map +0 -1
/package/dist/templates/.agentv/{.env.template → .env.example} +0 -0

package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Custom Evaluators Guide
-Guide for writing custom code evaluators and LLM judges for AgentV eval files.
+Templates and best practices for code evaluators and LLM judges. For YAML configuration, see `SKILL.md`.
 ## Code Evaluator Contract
@@ -19,21 +19,25 @@ Wire format uses snake_case for cross-language compatibility:
   "guideline_files": ["path1", "path2"],
   "input_files": ["file1", "file2"],
   "input_messages": [{"role": "user", "content": "..."}],
-  "output_messages": [
+  "expected_messages": [
     {
       "role": "assistant",
-      "content": "...",
       "tool_calls": [
         {
-          "tool": "search",
+          "tool": "vector_search",
           "input": { "query": "..." },
-          "output": { "results": [...] },
-          "id": "call_123",
-          "timestamp": "2024-01-15T10:30:00Z"
+          "output": { "results": ["doc1", "doc2"] }
         }
       ]
     }
   ],
+  "output_messages": [
+    {
+      "role": "assistant",
+      "content": "...",
+      "tool_calls": [...]
+    }
+  ],
   "trace_summary": {
     "event_count": 5,
     "tool_names": ["fetch", "search"],
@@ -47,7 +51,8 @@ Wire format uses snake_case for cross-language compatibility:
 ```
 **Key fields:**
-- `output_messages` - Full agent execution trace with tool calls (use `tool_calls[].input` for arguments)
+- `expected_messages` - Expected agent behavior from YAML, including tool calls with outputs (use for retrieval context in RAG evals)
+- `output_messages` - Actual agent execution trace with tool calls (from live agent runs)
 - `trace_summary` - Lightweight summary with execution metrics (counts only, no tool arguments)
 ### Output Format (to stdout)
@@ -71,199 +76,128 @@ Wire format uses snake_case for cross-language compatibility:
 ```python
 #!/usr/bin/env python3
-"""
-Example code evaluator for AgentV
-This evaluator checks for specific keywords in the output.
-Replace validation logic as needed.
-"""
 import json
 import sys
-from typing import Any
-def evaluate(input_data: dict[str, Any]) -> dict[str, Any]:
-    """
-    Evaluate the agent output.
-    Args:
-        input_data: Full input context from AgentV
-    Returns:
-        Evaluation result with score, hits, misses, reasoning
-    """
-    # Extract only the fields you need
-    # Most evaluators only need 'candidate_answer' - avoid using unnecessary fields
-    candidate_answer = input_data.get("candidate_answer", "")
+def evaluate(data: dict) -> dict:
+    candidate = data.get("candidate_answer", "")
+    hits, misses = [], []
     # Your validation logic here
-    hits = []
-    misses = []
-    # Example: Check for keywords
-    required_keywords = ["async", "await"]
-    for keyword in required_keywords:
-        if keyword in candidate_answer:
-            hits.append(f"Contains required keyword: {keyword}")
-        else:
-            misses.append(f"Missing required keyword: {keyword}")
-    # Calculate score
-    if not required_keywords:
-        score = 1.0
-    else:
-        score = len(hits) / len(required_keywords)
-    # Build result
+    keywords = ["async", "await"]
+    for kw in keywords:
+        (hits if kw in candidate else misses).append(f"Keyword '{kw}'")
     return {
-        "score": score,
+        "score": len(hits) / len(keywords) if keywords else 1.0,
         "hits": hits,
         "misses": misses,
-        "reasoning": f"Found {len(hits)}/{len(required_keywords)} required keywords"
+        "reasoning": f"Found {len(hits)}/{len(keywords)} keywords"
     }
-def main():
-    """Main entry point for AgentV code evaluator."""
+if __name__ == "__main__":
     try:
-        # Read input from stdin
-        input_data = json.loads(sys.stdin.read())
-        # Run evaluation
-        result = evaluate(input_data)
-        # Write result to stdout
+        result = evaluate(json.loads(sys.stdin.read()))
         print(json.dumps(result, indent=2))
     except Exception as e:
-        # Error handling: return zero score with error message
-        error_result = {
-            "score": 0.0,
-            "hits": [],
-            "misses": [f"Evaluator error: {str(e)}"],
-            "reasoning": f"Evaluator error: {str(e)}"
-        }
-        print(json.dumps(error_result, indent=2))
+        print(json.dumps({"score": 0, "hits": [], "misses": [str(e)], "reasoning": "Error"}))
         sys.exit(1)
-if __name__ == "__main__":
-    main()
 ```
-## TypeScript Code Evaluator Template (with SDK)
+## TypeScript Code Evaluator Template
-The optional `@agentv/core` SDK provides type-safe payload parsing with camelCase properties (`candidateAnswer` vs `candidate_answer`).
-**Execution:** Keep evaluators as `.ts` files and run via Node loaders like `npx --yes tsx ./evaluators/my-check.ts` so users don't need Bun after `npm install -g agentv`.
-**Without SDK:** Skip the import and parse JSON from stdin directly (similar to the Python template above).
+The `@agentv/eval` SDK provides a declarative API with automatic stdin/stdout handling.
 ```typescript
-/**
- * Example TypeScript code evaluator using the AgentV SDK
- *
- * Run with: npx --yes tsx ./evaluators/example-check.ts
- *
- * The SDK provides:
- * - Type-safe CodeJudgePayload interface with all fields
- * - camelCase properties (candidateAnswer, expectedOutcome, etc.)
- * - Automatic conversion from snake_case wire format
- */
-import { readCodeJudgePayload } from '@agentv/core';
-try {
-  // Read and parse stdin with automatic snake_case → camelCase conversion
-  const payload = readCodeJudgePayload();
-  // Type-safe camelCase access to all fields
-  const { candidateAnswer, expectedOutcome, inputFiles, guidelineFiles } = payload;
+#!/usr/bin/env bun
+import { defineCodeJudge } from '@agentv/eval';
-  // Your validation logic here
+export default defineCodeJudge(({ candidateAnswer, expectedOutcome }) => {
   const hits: string[] = [];
   const misses: string[] = [];
-  // Example: Check if answer contains expected outcome
+  // Your validation logic here
   if (candidateAnswer.includes(expectedOutcome)) {
     hits.push('Answer matches expected outcome');
   } else {
     misses.push('Answer does not match expected outcome');
   }
-  // Example: Check attachment mentions
-  const attachments = [...guidelineFiles, ...inputFiles];
-  for (const filePath of attachments) {
-    const fileName = filePath.split('/').pop() ?? filePath;
-    if (candidateAnswer.includes(fileName)) {
-      hits.push(`Mentions attachment: ${fileName}`);
-    } else {
-      misses.push(`Missing attachment: ${fileName}`);
-    }
-  }
-  // Calculate score
-  const totalChecks = hits.length + misses.length;
-  const score = totalChecks === 0 ? 0 : hits.length / totalChecks;
-  // Build result
-  const result = {
-    score,
+  const total = hits.length + misses.length;
+  return {
+    score: total === 0 ? 0 : hits.length / total,
     hits,
     misses,
-    reasoning: `Passed ${hits.length}/${totalChecks} checks`
+    reasoning: `Passed ${hits.length}/${total} checks`,
   };
+});
+```
-  console.log(JSON.stringify(result, null, 2));
-} catch (error) {
-  const message = error instanceof Error ? error.message : String(error);
-  console.log(JSON.stringify({
-    score: 0,
-    hits: [],
-    misses: [`Error: ${message}`],
-    reasoning: 'Evaluator error'
-  }, null, 2));
-  process.exit(1);
-}
+**SDK exports:** `defineCodeJudge`, `Message`, `ToolCall`, `TraceSummary`, `CodeJudgeInput`, `CodeJudgeResult`
+## Target Access for Code Evaluators
+Code judges can access an LLM through a **target proxy** for metrics requiring multiple LLM calls (contextual precision, semantic similarity, etc).
+### Configuration
+```yaml
+evaluators:
+  - name: contextual-precision
+    type: code_judge
+    script: bun scripts/contextual-precision.ts
+    target:
+      max_calls: 10  # Default: 50
+```
+### Usage
+```typescript
+#!/usr/bin/env bun
+import { createTargetClient, defineCodeJudge } from '@agentv/eval';
+export default defineCodeJudge(async ({ question, candidateAnswer }) => {
+  const target = createTargetClient();
+  if (!target) return { score: 0, misses: ['Target not configured'] };
+  const response = await target.invoke({
+    question: `Is this relevant to: ${question}? Response: ${candidateAnswer}`,
+    systemPrompt: 'Respond with JSON: { "relevant": true/false }'
+  });
+  const result = JSON.parse(response.rawText ?? '{}');
+  return { score: result.relevant ? 1.0 : 0.0 };
+});
 ```
-**TypeScript SDK Benefits:**
-- **Type-safe**: `CodeJudgePayload` interface with all fields typed
-- **camelCase**: Idiomatic TypeScript naming (`candidateAnswer` vs `candidate_answer`)
-- **Automatic conversion**: Handles snake_case wire format → camelCase objects
-- **Compile-time safety**: Catch typos and missing fields before runtime
+**Batch invocation:** Use `target.invokeBatch(requests)` for multiple calls.
-**Available in SDK:**
-- `readCodeJudgePayload()`: Read stdin and convert to camelCase (recommended)
-- `parseCodeJudgePayload(jsonString)`: Parse JSON string and convert to camelCase
-- `CodeJudgePayload`: TypeScript interface for type safety
+**Environment variables** (set automatically when `target` is configured):
+- `AGENTV_TARGET_PROXY_URL` - Local proxy URL
+- `AGENTV_TARGET_PROXY_TOKEN` - Bearer token for authentication
-**See also:** `examples/features/code-judge-sdk/` for complete working examples
+**See also:** `examples/features/code-judge-with-llm-calls/`
 ## LLM Judge Prompt Template
-LLM judges use markdown prompts to guide evaluation. AgentV automatically handles the output format, so focus your prompt on evaluation criteria and guidelines.
+LLM judges use markdown prompts. AgentV handles the output format automatically.
 **Available Template Variables:**
 - `{{question}}` - The original question/task
 - `{{expected_outcome}}` - What the answer should accomplish
 - `{{candidate_answer}}` - The actual output to evaluate
-- `{{reference_answer}}` - Gold standard answer (optional, may be empty)
-- `{{input_messages}}` - JSON stringified input message segments
-- `{{output_messages}}` - JSON stringified expected output segments
-**Default Evaluator Template:**
+- `{{reference_answer}}` - Gold standard answer (optional)
+- `{{input_messages}}` - JSON stringified input messages
+- `{{output_messages}}` - JSON stringified output messages
-If you don't specify a custom evaluator template, AgentV uses this default:
+**Default Template:**
 ```
-You are an expert evaluator. Your goal is to grade the candidate_answer based on how well it achieves the expected_outcome for the original task.
+You are an expert evaluator. Grade the candidate_answer based on how well it achieves the expected_outcome.
-Use the reference_answer as a gold standard for a high-quality response (if provided). The candidate_answer does not need to match it verbatim, but should capture the key points and follow the same spirit.
+Use reference_answer as a gold standard (if provided). The candidate_answer doesn't need to match verbatim, but should capture key points.
-Be concise and focused in your evaluation. Provide succinct, specific feedback rather than verbose explanations.
+Be concise. Provide specific feedback rather than verbose explanations.
 [[ ## expected_outcome ## ]]
 {{expected_outcome}}
@@ -278,76 +212,25 @@ Be concise and focused in your evaluation. Provide succinct, specific feedback r
 {{candidate_answer}}
 ```
-You can customize this template in your eval file using the `evaluatorTemplate` field to add domain-specific criteria or scoring guidelines.
 ## Best Practices
-### For Code-based Evaluators
-1. **Focus on relevant fields** - Most evaluators only need the `candidate_answer` field
-2. **Avoid false positives** - Don't check fields like `question` or `reference_answer` unless you specifically need context
-3. **Be deterministic** - Same input should always produce same output
-4. **Handle errors gracefully** - Return a valid result even when evaluation fails
-5. **Provide helpful feedback** - Use `hits` and `misses` to explain the score
-### For Prompt-based Evaluators (LLM Judges)
+### Code Evaluators
+1. **Focus on `candidate_answer`** - Most evaluators only need this field
+2. **Be deterministic** - Same input → same output
+3. **Handle errors gracefully** - Return valid result even on failure
+4. **Use `hits`/`misses`** - Explain the score clearly
+### LLM Judges
 1. **Clear criteria** - Define what you're evaluating
-2. **Specific guidelines** - Provide scoring rubrics
-3. **JSON output** - Enforce structured output format
-4. **Examples** - Show what good/bad looks like
-5. **Concise prompts** - Keep instructions focused
-## Running Code Evaluators
-### In Eval Files
-```yaml
-execution:
-  evaluators:
-    - name: my_validator
-      type: code_judge
-      script: uv run my_validator.py
-      cwd: ./evaluators
-```
+2. **Specific rubrics** - Provide scoring guidelines
+3. **Concise prompts** - Keep instructions focused
-TypeScript evaluators use the same structure but invoke `tsx` (or another Node-compatible loader) so they work everywhere:
-```yaml
-execution:
-  evaluators:
-    - name: csv_guardrail
-      type: code_judge
-      script: npx --yes tsx ./evaluators/check-csv.ts
-      cwd: ./evaluators
-```
-### Command Line Testing
-Test your evaluator locally:
+## Testing Locally
 ```bash
-# Create test input
-echo '{
-  "candidate_answer": "test output here",
-  "question": "test task",
-  "expected_outcome": "expected result"
-}' | uv run my_validator.py
-# Should output:
-# {
-#   "score": 0.8,
-#   "hits": ["check 1 passed"],
-#   "misses": ["check 2 failed"],
-#   "reasoning": "..."
-# }
-```
+# Python
+echo '{"candidate_answer": "test", "question": "task", "expected_outcome": "result"}' | uv run my_validator.py
-```bash
-# TypeScript (uses tsx loader under Node)
-echo '{
-  "candidate_answer": "test output here",
-  "question": "test task",
-  "expected_outcome": "expected result"
-}' | npx --yes tsx ./evaluators/check-csv.ts
+# TypeScript
+echo '{"candidate_answer": "test", "question": "task", "expected_outcome": "result"}' | bun run ./check.ts
 ```

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentv",
-  "version": "2.0.1",
+  "version": "2.1.0",
   "description": "CLI entry point for AgentV",
   "type": "module",
   "repository": {
@@ -31,7 +31,9 @@
     "test:watch": "bun test --watch"
   },
   "dependencies": {
-    "@agentv/core": "1.5.0",
+    "@agentv/core": "2.0.2",
+    "@mariozechner/pi-agent": "^0.9.0",
+    "@mariozechner/pi-ai": "^0.37.2",
     "cmd-ts": "^0.14.3",
     "dotenv": "^16.4.5",
     "fast-glob": "^3.3.3",