npm - agentv - Versions diffs - 1.0.0 → 1.3.1 - Mend

agentv 1.0.0 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/dist/cli.js CHANGED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env node
 import {
   runCli
-} from "./chunk-RIJO5WBF.js";
+} from "./chunk-6R2YRXCQ.js";
 import "./chunk-UE4GLFVL.js";
 // src/cli.ts

package/dist/index.js CHANGED Viewed

@@ -1,7 +1,7 @@
 import {
   app,
   runCli
-} from "./chunk-RIJO5WBF.js";
+} from "./chunk-6R2YRXCQ.js";
 import "./chunk-UE4GLFVL.js";
 export {
   app,

package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md CHANGED Viewed

@@ -15,6 +15,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
 - Composite Evaluators: `references/composite-evaluator.md` - Combine multiple evaluators
 - Tool Trajectory: `references/tool-trajectory-evaluator.md` - Validate agent tool usage
 - Custom Evaluators: `references/custom-evaluators.md` - Code and LLM judge templates
+- Batch CLI: `references/batch-cli-evaluator.md` - Evaluate batch runner output (JSONL)
 ## Structure Requirements
 - Root level: `description` (optional), `target` (optional), `execution` (optional), `evalcases` (required)
@@ -79,22 +80,6 @@ execution:
 See `references/tool-trajectory-evaluator.md` for modes and configuration.
-### Expected Tool Calls Evaluators
-Validate tool calls and inputs inline with conversation flow:
-```yaml
-expected_messages:
-  - role: assistant
-    tool_calls:
-      - tool: getMetrics
-        input: { server: "prod-1" }
-execution:
-  evaluators:
-    - name: input_check
-      type: expected_tool_calls
-```
 ### Multiple Evaluators
 Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
@@ -153,6 +138,42 @@ execution:
 See `references/composite-evaluator.md` for aggregation types and patterns.
+### Batch CLI Evaluation
+Evaluate external batch runners that process all evalcases in one invocation:
+```yaml
+$schema: agentv-eval-v2
+description: Batch CLI evaluation
+target: batch_cli
+evalcases:
+  - id: case-001
+    expected_outcome: Returns decision=CLEAR
+    expected_messages:
+      - role: assistant
+        content:
+          decision: CLEAR
+    input_messages:
+      - role: user
+        content:
+          row:
+            id: case-001
+            amount: 5000
+    execution:
+      evaluators:
+        - name: decision-check
+          type: code_judge
+          script: bun run ./scripts/check-output.ts
+          cwd: .
+```
+**Key pattern:**
+- Batch runner reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
+- Each evalcase has its own evaluator to validate its corresponding output
+- Use structured `expected_messages.content` for expected output fields
+See `references/batch-cli-evaluator.md` for full implementation guide.
 ## Example
 ```yaml
 $schema: agentv-eval-v2
@@ -163,7 +184,7 @@ execution:
 evalcases:
   - id: code-review-basic
     expected_outcome: Assistant provides helpful code analysis
     input_messages:
       - role: system
         content: You are an expert code reviewer.
@@ -172,14 +193,14 @@ evalcases:
           - type: text
             value: |-
               Review this function:
               ```python
               def add(a, b):
                   return a + b
               ```
           - type: file
             value: /prompts/python.instructions.md
     expected_messages:
       - role: assistant
         content: |-

package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md ADDED Viewed

@@ -0,0 +1,288 @@
+# Batch CLI Evaluation Guide
+Guide for evaluating batch CLI output where a single runner processes all evalcases at once and outputs JSONL.
+## Overview
+Batch CLI evaluation is used when:
+- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
+- The runner reads the eval YAML directly to extract all evalcases
+- Output is JSONL with records keyed by evalcase `id`
+- Each evalcase has its own evaluator to validate its corresponding output record
+## Execution Flow
+1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
+2. **Batch runner** reads the eval YAML, extracts all evalcases, processes them, writes JSONL output keyed by `id`
+3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
+4. **Per-case evaluator** validates the output for each evalcase independently
+## Eval File Structure
+```yaml
+$schema: agentv-eval-v2
+description: Batch CLI demo using structured input_messages
+target: batch_cli
+evalcases:
+  - id: case-001
+    expected_outcome: |-
+      Batch runner returns JSON with decision=CLEAR.
+    expected_messages:
+      - role: assistant
+        content:
+          decision: CLEAR  # Structured expected output
+    input_messages:
+      - role: system
+        content: You are a batch processor.
+      - role: user
+        content:  # Structured input (runner extracts this)
+          request:
+            type: screening_check
+            jurisdiction: AU
+          row:
+            id: case-001
+            name: Example A
+            amount: 5000
+    execution:
+      evaluators:
+        - name: decision-check
+          type: code_judge
+          script: bun run ./scripts/check-output.ts
+          cwd: .
+  - id: case-002
+    expected_outcome: |-
+      Batch runner returns JSON with decision=REVIEW.
+    expected_messages:
+      - role: assistant
+        content:
+          decision: REVIEW
+    input_messages:
+      - role: system
+        content: You are a batch processor.
+      - role: user
+        content:
+          request:
+            type: screening_check
+            jurisdiction: AU
+          row:
+            id: case-002
+            name: Example B
+            amount: 25000
+    execution:
+      evaluators:
+        - name: decision-check
+          type: code_judge
+          script: bun run ./scripts/check-output.ts
+          cwd: .
+```
+## Batch Runner Implementation
+The batch runner reads the eval YAML directly and processes all evalcases in one invocation.
+### Runner Contract
+**Input:** The runner receives the eval file path via `--eval` flag:
+```bash
+bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl
+```
+**Output:** JSONL file where each line is a JSON object with:
+```json
+{"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
+{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
+```
+The `id` field must match the evalcase `id` for AgentV to route output to the correct evaluator.
+### Example Runner (TypeScript)
+```typescript
+import fs from 'node:fs/promises';
+import { parse } from 'yaml';
+type EvalCase = {
+  id: string;
+  input_messages: Array<{ role: string; content: unknown }>;
+};
+async function main() {
+  const args = process.argv.slice(2);
+  const evalPath = getFlag(args, '--eval');
+  const outPath = getFlag(args, '--output');
+  // Read and parse eval YAML
+  const yamlText = await fs.readFile(evalPath, 'utf8');
+  const parsed = parse(yamlText);
+  const evalcases = parsed.evalcases as EvalCase[];
+  // Process each evalcase
+  const results: Array<{ id: string; text: string }> = [];
+  for (const evalcase of evalcases) {
+    const userContent = findUserContent(evalcase.input_messages);
+    const decision = processInput(userContent); // Your logic here
+    results.push({
+      id: evalcase.id,
+      text: JSON.stringify({ decision, ...otherFields }),
+    });
+  }
+  // Write JSONL output
+  const jsonl = results.map((r) => JSON.stringify(r)).join('\n') + '\n';
+  await fs.writeFile(outPath, jsonl, 'utf8');
+}
+function getFlag(args: string[], name: string): string {
+  const idx = args.indexOf(name);
+  return args[idx + 1];
+}
+function findUserContent(messages: Array<{ role: string; content: unknown }>) {
+  return messages.find((m) => m.role === 'user')?.content;
+}
+```
+## Evaluator Implementation
+Each evalcase has its own evaluator that validates the output. The evaluator receives the standard code_judge input.
+### Evaluator Contract
+**Input (stdin):** Standard AgentV code_judge format:
+```json
+{
+  "candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
+  "expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
+  "input_messages": [...],
+  ...
+}
+```
+**Output (stdout):** Standard evaluator result:
+```json
+{
+  "score": 1.0,
+  "hits": ["decision matches: CLEAR"],
+  "misses": [],
+  "reasoning": "Batch runner decision matches expected."
+}
+```
+### Example Evaluator (TypeScript)
+```typescript
+import fs from 'node:fs';
+type EvalInput = {
+  candidate_answer?: string;
+  expected_messages?: Array<{ role: string; content: unknown }>;
+};
+function main() {
+  const stdin = fs.readFileSync(0, 'utf8');
+  const input = JSON.parse(stdin) as EvalInput;
+  // Extract expected value from expected_messages
+  const expectedDecision = findExpectedDecision(input.expected_messages);
+  // Parse candidate answer (output from batch runner)
+  let candidateDecision: string | undefined;
+  try {
+    const parsed = JSON.parse(input.candidate_answer ?? '');
+    candidateDecision = parsed.decision;
+  } catch {
+    candidateDecision = undefined;
+  }
+  // Compare
+  const hits: string[] = [];
+  const misses: string[] = [];
+  if (expectedDecision === candidateDecision) {
+    hits.push(`decision matches: ${expectedDecision}`);
+  } else {
+    misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
+  }
+  const score = misses.length === 0 ? 1 : 0;
+  process.stdout.write(JSON.stringify({
+    score,
+    hits,
+    misses,
+    reasoning: score === 1
+      ? 'Batch runner output matches expected.'
+      : 'Batch runner output did not match expected.',
+  }));
+}
+function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
+  if (!messages) return undefined;
+  for (const msg of messages) {
+    if (typeof msg.content === 'object' && msg.content !== null) {
+      return (msg.content as Record<string, unknown>).decision as string;
+    }
+  }
+  return undefined;
+}
+main();
+```
+## Structured Content in expected_messages
+For batch evaluation, use structured objects in `expected_messages.content` to define expected output fields:
+```yaml
+expected_messages:
+  - role: assistant
+    content:
+      decision: CLEAR
+      confidence: high
+      reasons: []
+```
+The evaluator then extracts these fields and compares against the parsed candidate output.
+## Best Practices
+1. **Use unique evalcase IDs** - The batch runner and AgentV use `id` to route outputs
+2. **Structured input_messages** - Put structured data in `user.content` for the runner to extract
+3. **Structured expected_messages** - Define expected output as objects for easy validation
+4. **Deterministic runners** - Batch runners should produce consistent output for testing
+5. **Healthcheck support** - Add `--healthcheck` flag for runner validation:
+   ```typescript
+   if (args.includes('--healthcheck')) {
+     console.log('batch-runner: healthy');
+     return;
+   }
+   ```
+## Target Configuration
+Configure the batch CLI provider in your target:
+```yaml
+# In agentv-targets.yaml or eval file
+targets:
+  batch_cli:
+    provider: cli
+    commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
+    provider_batching: true
+```
+Key settings:
+- `provider: cli` - Use CLI provider
+- `provider_batching: true` - Run once for all evalcases
+- `{EVAL_FILE}` - Placeholder for eval file path
+- `{OUTPUT_FILE}` - Placeholder for JSONL output path

package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md CHANGED Viewed

@@ -12,11 +12,11 @@ target: default
 evalcases:
   - id: simple-addition
     expected_outcome: Correctly calculates 2+2
     input_messages:
       - role: user
         content: What is 2 + 2?
     expected_messages:
       - role: assistant
         content: "4"
@@ -32,7 +32,7 @@ target: azure_base
 evalcases:
   - id: code-review-basic
     expected_outcome: Assistant provides helpful code analysis with security considerations
     input_messages:
       - role: system
         content: You are an expert code reviewer.
@@ -41,7 +41,7 @@ evalcases:
           - type: text
             value: |-
               Review this function for security issues:
               ```python
               def get_user(user_id):
                   query = f"SELECT * FROM users WHERE id = {user_id}"
@@ -49,13 +49,13 @@ evalcases:
               ```
           - type: file
             value: /prompts/security-guidelines.md
     expected_messages:
       - role: assistant
         content: |-
-          This code has a critical SQL injection vulnerability. The user_id is directly
+          This code has a critical SQL injection vulnerability. The user_id is directly
           interpolated into the query string without sanitization.
           Recommended fix:
           ```python
           def get_user(user_id):
@@ -74,7 +74,7 @@ target: default
 evalcases:
   - id: json-generation-with-validation
     expected_outcome: Generates valid JSON with required fields
     execution:
       evaluators:
         - name: json_format_validator
@@ -84,13 +84,13 @@ evalcases:
         - name: content_evaluator
           type: llm_judge
           prompt: ./judges/semantic_correctness.md
     input_messages:
       - role: user
         content: |-
-          Generate a JSON object for a user with name "Alice",
+          Generate a JSON object for a user with name "Alice",
           email "alice@example.com", and role "admin".
     expected_messages:
       - role: assistant
         content: |-
@@ -142,33 +142,6 @@ evalcases:
             - tool: generateToken
 ```
-## Expected Messages with Tool Calls
-Validate precise tool inputs inline with expected messages.
-```yaml
-$schema: agentv-eval-v2
-description: Tool input validation
-target: mock_agent
-evalcases:
-  - id: precise-inputs
-    expected_outcome: Agent calls tools with correct parameters
-    input_messages:
-      - role: user
-        content: Check CPU metrics for prod-1
-    expected_messages:
-      - role: assistant
-        content: Checking metrics...
-        tool_calls:
-          - tool: getCpuMetrics
-            input: { server: "prod-1" }
-    execution:
-      evaluators:
-        - name: input-validator
-          type: expected_tool_calls
-```
 ## Static Trace Evaluation
 Evaluate pre-existing trace files without running an agent.
@@ -207,7 +180,7 @@ evalcases:
       Assistant conducts a multi-turn debugging session, asking clarification
       questions when needed, correctly diagnosing the bug, and proposing a clear
       fix with rationale.
     input_messages:
       - role: system
         content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
@@ -232,7 +205,7 @@ evalcases:
       - role: user
         content: |-
           For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
     expected_messages:
       - role: assistant
         content: |-
@@ -241,7 +214,7 @@ evalcases:
           To include all items, you can either:
           - Use `range(len(items))`, or
           - Iterate directly over the list: `for item in items:`
           Here's a corrected version:
           ```python
@@ -253,6 +226,92 @@ evalcases:
           ```
 ```
+## Batch CLI Evaluation
+Evaluate external batch runners that process all evalcases in one invocation.
+```yaml
+$schema: agentv-eval-v2
+description: Batch CLI demo (AML screening)
+target: batch_cli
+evalcases:
+  - id: aml-001
+    expected_outcome: |-
+      Batch runner returns JSON with decision=CLEAR.
+    expected_messages:
+      - role: assistant
+        content:
+          decision: CLEAR
+    input_messages:
+      - role: system
+        content: You are a deterministic AML screening batch checker.
+      - role: user
+        content:
+          request:
+            type: aml_screening_check
+            jurisdiction: AU
+            effective_date: 2025-01-01
+          row:
+            id: aml-001
+            customer_name: Example Customer A
+            origin_country: NZ
+            destination_country: AU
+            transaction_type: INTERNATIONAL_TRANSFER
+            amount: 5000
+            currency: USD
+    execution:
+      evaluators:
+        - name: decision-check
+          type: code_judge
+          script: bun run ./scripts/check-batch-cli-output.ts
+          cwd: .
+  - id: aml-002
+    expected_outcome: |-
+      Batch runner returns JSON with decision=REVIEW.
+    expected_messages:
+      - role: assistant
+        content:
+          decision: REVIEW
+    input_messages:
+      - role: system
+        content: You are a deterministic AML screening batch checker.
+      - role: user
+        content:
+          request:
+            type: aml_screening_check
+            jurisdiction: AU
+            effective_date: 2025-01-01
+          row:
+            id: aml-002
+            customer_name: Example Customer B
+            origin_country: IR
+            destination_country: AU
+            transaction_type: INTERNATIONAL_TRANSFER
+            amount: 2000
+            currency: USD
+    execution:
+      evaluators:
+        - name: decision-check
+          type: code_judge
+          script: bun run ./scripts/check-batch-cli-output.ts
+          cwd: .
+```
+### Batch CLI Pattern Notes
+- **target: batch_cli** - Configure CLI provider with `provider_batching: true`
+- **Batch runner** - Reads eval YAML via `--eval` flag, outputs JSONL keyed by `id`
+- **Structured input** - Put data in `user.content` as objects for runner to extract
+- **Structured expected** - Use `expected_messages.content` with object fields
+- **Per-case evaluators** - Each evalcase has its own evaluator to validate output
 ## Notes on Examples
 ### File Path Conventions