npm - agentv - Versions diffs - 2.0.1 → 2.1.0 - Mend

agentv 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +199 -318
package/dist/{chunk-6SHT2QS6.js → chunk-5BLNVACB.js} +1286 -756
package/dist/chunk-5BLNVACB.js.map +1 -0
package/dist/cli.js +4 -2
package/dist/cli.js.map +1 -1
package/dist/index.js +1 -1
package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +24 -2
package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -12
package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +100 -217
package/package.json +4 -2
package/dist/chunk-6SHT2QS6.js.map +0 -1
/package/dist/templates/.agentv/{.env.template → .env.example} +0 -0

package/dist/cli.js CHANGED Viewed

@@ -1,11 +1,13 @@
 #!/usr/bin/env node
 import {
   runCli
-} from "./chunk-6SHT2QS6.js";
+} from "./chunk-5BLNVACB.js";
 import "./chunk-UE4GLFVL.js";
 // src/cli.ts
-runCli().catch((error) => {
+runCli().then(() => {
+  process.exit(0);
+}).catch((error) => {
   console.error(error);
   process.exit(1);
 });

package/dist/cli.js.map CHANGED Viewed

	@@ -1 +1 @@
1	- {"version":3,"sources":["../src/cli.ts"],"sourcesContent":["#!/usr/bin/env node\nimport { runCli } from './index.js';\n\nrunCli().catch((error) => {\n console.error(error);\n process.exit(1);\n});\n"],"mappings":";;;;;;;AAGA,OAAO,~~EAAE~~,MAAM,CAAC,UAAU;~~AACxB~~,UAAQ,MAAM,KAAK;AACnB,UAAQ,KAAK,CAAC;AAChB,CAAC;","names":[]}
1	+ {"version":3,"sources":["../src/cli.ts"],"sourcesContent":["#!/usr/bin/env node\nimport { runCli } from './index.js';\n\nrunCli()\n .then(() => {\n process.exit(0);\n })\n .catch((error) => {\n console.error(error);\n process.exit(1);\n });\n"],"mappings":";;;;;;;AAGA,OAAO,EACJ,KAAK,MAAM;AACV,UAAQ,KAAK,CAAC;AAChB,CAAC,EACA,MAAM,CAAC,UAAU;AAChB,UAAQ,MAAM,KAAK;AACnB,UAAQ,KAAK,CAAC;AAChB,CAAC;","names":[]}

package/dist/index.js CHANGED Viewed

@@ -1,7 +1,7 @@
 import {
   app,
   runCli
-} from "./chunk-6SHT2QS6.js";
+} from "./chunk-5BLNVACB.js";
 import "./chunk-UE4GLFVL.js";
 export {
   app,

package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md CHANGED Viewed

@@ -47,10 +47,32 @@ execution:
 ```
 **Contract:**
-- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
+- Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files`, `input_files`, `input_messages`, `expected_messages`, `output_messages`, `trace_summary`
 - Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
-**TypeScript evaluators:** Keep `.ts` source files and run them via Node-compatible loaders such as `npx --yes tsx` so global `agentv` installs stay portable. See `references/custom-evaluators.md` for complete templates and command examples.
+**Target Proxy:** Code evaluators can access an LLM through the target proxy for sophisticated evaluation logic (e.g., Contextual Precision, semantic similarity). Enable with `target: {}`:
+```yaml
+execution:
+  evaluators:
+    - name: contextual_precision
+      type: code_judge
+      script: bun run evaluate.ts
+      target: {}           # Enable target proxy (max_calls: 50 default)
+```
+**RAG Evaluation Pattern:** For retrieval-based evals, pass retrieval context via `expected_messages.tool_calls`:
+```yaml
+expected_messages:
+  - role: assistant
+    tool_calls:
+      - tool: vector_search
+        output:
+          results: ["doc1", "doc2", "doc3"]
+```
+**TypeScript evaluators:** Keep `.ts` source files and run them via Node-compatible loaders such as `npx --yes tsx` so global `agentv` installs stay portable. See `references/custom-evaluators.md` for complete templates, target proxy usage, and command examples.
 **Template:** See `references/custom-evaluators.md` for Python and TypeScript templates

package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md CHANGED Viewed

@@ -17,18 +17,6 @@ Batch CLI evaluation is used when:
 3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
 4. **Per-case evaluator** validates the output for each evalcase independently
-## Error Handling
-### Missing IDs / Per-item Failures
-In batch mode, a runner can succeed overall while still failing to produce an output record for a specific evalcase `id`.
-- When an output record is missing, AgentV treats this as a per-item error.
-- The evaluation result will include an `error` field describing the issue.
-- The CLI progress display will show that evalcase as failed (❌) while still continuing to evaluate other cases.
-This behavior is intentional so one bad/missing record does not discard the entire batch.
 ## Eval File Structure
 ```yaml

package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md CHANGED Viewed

@@ -1,215 +1,215 @@
-# Composite Evaluator Guide
-Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
-## Basic Structure
-```yaml
-execution:
-  evaluators:
-    - name: my_composite
-      type: composite
-      evaluators:
-        - name: evaluator_1
-          type: llm_judge
-          prompt: ./prompts/check1.md
-        - name: evaluator_2
-          type: code_judge
-          script: uv run check2.py
-      aggregator:
-        type: weighted_average
-        weights:
-          evaluator_1: 0.6
-          evaluator_2: 0.4
-```
-## Aggregator Types
-### 1. Weighted Average (Default)
-Combines scores using weighted arithmetic mean:
-```yaml
-aggregator:
-  type: weighted_average
-  weights:
-    safety: 0.3      # 30% weight
-    quality: 0.7     # 70% weight
-```
-If weights are omitted, all evaluators have equal weight (1.0).
-**Score calculation:**
-```
-final_score = Σ(score_i × weight_i) / Σ(weight_i)
-```
-### 2. Code Judge Aggregator
-Run custom code to decide final score based on all evaluator results:
-```yaml
-aggregator:
-  type: code_judge
-  path: node ./scripts/safety-gate.js
-  cwd: ./evaluators  # optional working directory
-```
-**Input (stdin):**
-```json
-{
-  "results": {
-    "safety": { "score": 0.9, "hits": [...], "misses": [...] },
-    "quality": { "score": 0.85, "hits": [...], "misses": [...] }
-  }
-}
-```
-**Output (stdout):**
-```json
-{
-  "score": 0.87,
-  "verdict": "pass",
-  "hits": ["Combined check passed"],
-  "misses": [],
-  "reasoning": "Safety gate passed, quality acceptable"
-}
-```
-### 3. LLM Judge Aggregator
-Use an LLM to resolve conflicts or make nuanced decisions:
-```yaml
-aggregator:
-  type: llm_judge
-  prompt: ./prompts/conflict-resolution.md
-```
-The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
-## Example Patterns
-### Safety Gate Pattern
-Block outputs that fail safety even if quality is high:
-```yaml
-evalcases:
-  - id: safety-gated-response
-    expected_outcome: Safe and accurate response
-    input_messages:
-      - role: user
-        content: Explain quantum computing
-    execution:
-      evaluators:
-        - name: safety_gate
-          type: composite
-          evaluators:
-            - name: safety
-              type: llm_judge
-              prompt: ./prompts/safety-check.md
-            - name: quality
-              type: llm_judge
-              prompt: ./prompts/quality-check.md
-          aggregator:
-            type: code_judge
-            path: ./scripts/safety-gate.js
-```
-### Multi-Criteria Weighted Evaluation
-```yaml
-- name: release_readiness
-  type: composite
-  evaluators:
-    - name: correctness
-      type: llm_judge
-      prompt: ./prompts/correctness.md
-    - name: style
-      type: code_judge
-      script: uv run style_checker.py
-    - name: security
-      type: llm_judge
-      prompt: ./prompts/security.md
-  aggregator:
-    type: weighted_average
-    weights:
-      correctness: 0.5
-      style: 0.2
-      security: 0.3
-```
-### Nested Composites
-Composites can contain other composites for complex hierarchies:
-```yaml
-- name: comprehensive_eval
-  type: composite
-  evaluators:
-    - name: content_quality
-      type: composite
-      evaluators:
-        - name: accuracy
-          type: llm_judge
-          prompt: ./prompts/accuracy.md
-        - name: clarity
-          type: llm_judge
-          prompt: ./prompts/clarity.md
-      aggregator:
-        type: weighted_average
-        weights:
-          accuracy: 0.6
-          clarity: 0.4
-    - name: safety
-      type: llm_judge
-      prompt: ./prompts/safety.md
-  aggregator:
-    type: weighted_average
-    weights:
-      content_quality: 0.7
-      safety: 0.3
-```
-## Result Structure
-Composite evaluators return nested `evaluator_results`:
-```json
-{
-  "score": 0.85,
-  "verdict": "pass",
-  "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
-  "misses": ["[quality] Could use more examples"],
-  "reasoning": "safety: Passed all checks; quality: Good but could improve",
-  "evaluator_results": [
-    {
-      "name": "safety",
-      "type": "llm_judge",
-      "score": 0.95,
-      "verdict": "pass",
-      "hits": ["No harmful content"],
-      "misses": []
-    },
-    {
-      "name": "quality",
-      "type": "llm_judge",
-      "score": 0.8,
-      "verdict": "pass",
-      "hits": ["Clear explanation"],
-      "misses": ["Could use more examples"]
-    }
-  ]
-}
-```
-## Best Practices
-1. **Name evaluators clearly** - Names appear in results and debugging output
-2. **Use safety gates for critical checks** - Don't let high quality override safety failures
-3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
-4. **Keep nesting shallow** - Deep nesting makes debugging harder
-5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests
+# Composite Evaluator Guide
+Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
+## Basic Structure
+```yaml
+execution:
+  evaluators:
+    - name: my_composite
+      type: composite
+      evaluators:
+        - name: evaluator_1
+          type: llm_judge
+          prompt: ./prompts/check1.md
+        - name: evaluator_2
+          type: code_judge
+          script: uv run check2.py
+      aggregator:
+        type: weighted_average
+        weights:
+          evaluator_1: 0.6
+          evaluator_2: 0.4
+```
+## Aggregator Types
+### 1. Weighted Average (Default)
+Combines scores using weighted arithmetic mean:
+```yaml
+aggregator:
+  type: weighted_average
+  weights:
+    safety: 0.3      # 30% weight
+    quality: 0.7     # 70% weight
+```
+If weights are omitted, all evaluators have equal weight (1.0).
+**Score calculation:**
+```
+final_score = Σ(score_i × weight_i) / Σ(weight_i)
+```
+### 2. Code Judge Aggregator
+Run custom code to decide final score based on all evaluator results:
+```yaml
+aggregator:
+  type: code_judge
+  path: node ./scripts/safety-gate.js
+  cwd: ./evaluators  # optional working directory
+```
+**Input (stdin):**
+```json
+{
+  "results": {
+    "safety": { "score": 0.9, "hits": [...], "misses": [...] },
+    "quality": { "score": 0.85, "hits": [...], "misses": [...] }
+  }
+}
+```
+**Output (stdout):**
+```json
+{
+  "score": 0.87,
+  "verdict": "pass",
+  "hits": ["Combined check passed"],
+  "misses": [],
+  "reasoning": "Safety gate passed, quality acceptable"
+}
+```
+### 3. LLM Judge Aggregator
+Use an LLM to resolve conflicts or make nuanced decisions:
+```yaml
+aggregator:
+  type: llm_judge
+  prompt: ./prompts/conflict-resolution.md
+```
+The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
+## Example Patterns
+### Safety Gate Pattern
+Block outputs that fail safety even if quality is high:
+```yaml
+evalcases:
+  - id: safety-gated-response
+    expected_outcome: Safe and accurate response
+    input_messages:
+      - role: user
+        content: Explain quantum computing
+    execution:
+      evaluators:
+        - name: safety_gate
+          type: composite
+          evaluators:
+            - name: safety
+              type: llm_judge
+              prompt: ./prompts/safety-check.md
+            - name: quality
+              type: llm_judge
+              prompt: ./prompts/quality-check.md
+          aggregator:
+            type: code_judge
+            path: ./scripts/safety-gate.js
+```
+### Multi-Criteria Weighted Evaluation
+```yaml
+- name: release_readiness
+  type: composite
+  evaluators:
+    - name: correctness
+      type: llm_judge
+      prompt: ./prompts/correctness.md
+    - name: style
+      type: code_judge
+      script: uv run style_checker.py
+    - name: security
+      type: llm_judge
+      prompt: ./prompts/security.md
+  aggregator:
+    type: weighted_average
+    weights:
+      correctness: 0.5
+      style: 0.2
+      security: 0.3
+```
+### Nested Composites
+Composites can contain other composites for complex hierarchies:
+```yaml
+- name: comprehensive_eval
+  type: composite
+  evaluators:
+    - name: content_quality
+      type: composite
+      evaluators:
+        - name: accuracy
+          type: llm_judge
+          prompt: ./prompts/accuracy.md
+        - name: clarity
+          type: llm_judge
+          prompt: ./prompts/clarity.md
+      aggregator:
+        type: weighted_average
+        weights:
+          accuracy: 0.6
+          clarity: 0.4
+    - name: safety
+      type: llm_judge
+      prompt: ./prompts/safety.md
+  aggregator:
+    type: weighted_average
+    weights:
+      content_quality: 0.7
+      safety: 0.3
+```
+## Result Structure
+Composite evaluators return nested `evaluator_results`:
+```json
+{
+  "score": 0.85,
+  "verdict": "pass",
+  "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
+  "misses": ["[quality] Could use more examples"],
+  "reasoning": "safety: Passed all checks; quality: Good but could improve",
+  "evaluator_results": [
+    {
+      "name": "safety",
+      "type": "llm_judge",
+      "score": 0.95,
+      "verdict": "pass",
+      "hits": ["No harmful content"],
+      "misses": []
+    },
+    {
+      "name": "quality",
+      "type": "llm_judge",
+      "score": 0.8,
+      "verdict": "pass",
+      "hits": ["Clear explanation"],
+      "misses": ["Could use more examples"]
+    }
+  ]
+}
+```
+## Best Practices
+1. **Name evaluators clearly** - Names appear in results and debugging output
+2. **Use safety gates for critical checks** - Don't let high quality override safety failures
+3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
+4. **Keep nesting shallow** - Deep nesting makes debugging harder
+5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests