npm - agentv - Versions diffs - 0.14.2 → 0.16.0 - Mend

agentv 0.14.2 → 0.16.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/dist/{chunk-TJ4U6A6X.js → chunk-HWGALLUR.js} +192 -83
package/dist/chunk-HWGALLUR.js.map +1 -0
package/dist/cli.js +1 -1
package/dist/index.js +1 -1
package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +5 -53
package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +36 -171
package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +10 -11
package/dist/templates/.github/prompts/agentv-optimize.prompt.md +1 -1
package/package.json +2 -2
package/dist/chunk-TJ4U6A6X.js.map +0 -1

package/dist/cli.js CHANGED Viewed

@@ -1,7 +1,7 @@
 #!/usr/bin/env node
 import {
   runCli
-} from "./chunk-TJ4U6A6X.js";
+} from "./chunk-HWGALLUR.js";
 import "./chunk-7XYYGJAC.js";
 // src/cli.ts

package/dist/index.js CHANGED Viewed

@@ -1,7 +1,7 @@
 import {
   createProgram,
   runCli
-} from "./chunk-TJ4U6A6X.js";
+} from "./chunk-HWGALLUR.js";
 import "./chunk-7XYYGJAC.js";
 export {
   createProgram,

package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md CHANGED Viewed

@@ -1,9 +1,9 @@
 ---
-name: eval-builder
+name: agentv-eval-builder
 description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring custom evaluators (code validators or LLM judges) for agent testing workflows.
 ---
-# Eval Builder
+# AgentV Eval Builder
 ## Schema Reference
 - Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
@@ -11,7 +11,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
 - Examples: `references/example-evals.md`
 ## Structure Requirements
-- Root level: `$schema` (required: "agentv-eval-v2"), `description` (optional), `target` (optional), `evalcases` (required)
+- Root level: `$schema` (required: "agentv-eval-v2"), `description` (optional), `execution` (optional), `evalcases` (required)
 - Eval case fields: `id` (required), `outcome` (required), `input_messages` (required), `expected_messages` (required)
 - Optional fields: `conversation_id`, `note`, `execution`
 - Message fields: `role` (required), `content` (required)
@@ -71,10 +71,10 @@ execution:
 ```yaml
 $schema: agentv-eval-v2
 description: Example showing basic features and conversation threading
-target: default
+execution:
+  target: default
 evalcases:
-  # Basic eval case with file references
   - id: code-review-basic
     outcome: Assistant provides helpful code analysis
@@ -101,52 +101,4 @@ evalcases:
           - Add type hints: `def add(a: int, b: int) -> int:`
           - Add docstring
           - Consider validation for edge cases
-  # Advanced: conversation threading, multiple evaluators
-  - id: python-coding-session
-    conversation_id: python-coding-session
-    outcome: Generates correct code with proper error handling
-    execution:
-      target: azure_base
-      evaluators:
-        - name: keyword_check
-          type: code
-          script: uv run check_keywords.py
-          cwd: /evaluators/scripts
-        - name: semantic_judge
-          type: llm_judge
-          prompt: /evaluators/prompts/correctness.md
-          model: gpt-5-chat
-    input_messages:
-      - role: system
-        content: You are a code generator.
-      - role: user
-        content:
-          - type: text
-            value: Create a function to find the second largest number in a list.
-          - type: file
-            value: /prompts/python.instructions.md
-    expected_messages:
-      - role: assistant
-        content: |-
-          ```python
-          from typing import List, Union
-          def find_second_largest(numbers: List[int]) -> Union[int, None]:
-              """Find the second largest number."""
-              if not isinstance(numbers, list):
-                  raise TypeError("Input must be a list")
-              if not numbers:
-                  raise ValueError("List cannot be empty")
-              unique = list(set(numbers))
-              if len(unique) < 2:
-                  return None
-              unique.sort(reverse=True)
-              return unique[1]
-          ```
 ```

package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md CHANGED Viewed

@@ -102,200 +102,65 @@ evalcases:
           }
 ```
-## Conversation Threading
+## Multi-Turn Conversation (Single Eval Case)
 ```yaml
 $schema: agentv-eval-v2
-description: Multi-turn conversation for iterative code development
+description: Multi-turn debugging session with clarifying questions
 target: default
 evalcases:
-  - id: create-class
-    conversation_id: python-dev-session
-    outcome: Creates a basic User class
+  - id: debug-with-clarification
+    outcome: |-
+      Assistant conducts a multi-turn debugging session, asking clarification
+      questions when needed, correctly diagnosing the bug, and proposing a clear
+      fix with rationale.
     input_messages:
+      - role: system
+        content: You are an expert debugging assistant who reasons step by step, asks clarifying questions, and explains fixes clearly.
       - role: user
-        content: Create a Python User class with name and email attributes
-    expected_messages:
-      - role: assistant
         content: |-
+          I'm getting an off-by-one error in this function, but I can't see why:
           ```python
-          class User:
-              def __init__(self, name, email):
-                  self.name = name
-                  self.email = email
+          def get_items(items):
+              result = []
+              for i in range(len(items) - 1):
+                  result.append(items[i])
+              return result
           ```
-  - id: add-validation
-    conversation_id: python-dev-session
-    outcome: Adds email validation to the existing class
-    input_messages:
-      - role: user
-        content: Add email validation to the User class
-    expected_messages:
+          Sometimes the last element is missing. Can you help debug this?
       - role: assistant
         content: |-
-          ```python
-          import re
-          class User:
-              def __init__(self, name, email):
-                  self.name = name
-                  self.email = self._validate_email(email)
-              def _validate_email(self, email):
-                  pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
-                  if not re.match(pattern, email):
-                      raise ValueError(f"Invalid email: {email}")
-                  return email
-          ```
-  - id: add-repr
-    conversation_id: python-dev-session
-    outcome: Adds __repr__ method for better debugging
-    input_messages:
+          I can help debug this. Before I propose a fix, could you tell me:
+          - What output you expect for an example input list
+          - What output you actually get
       - role: user
-        content: Add a __repr__ method to the class
+        content: |-
+          For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`.
     expected_messages:
       - role: assistant
         content: |-
+          You have an off-by-one error in your loop bounds.
+          You're iterating with `range(len(items) - 1)`, which stops before the last index.
+          To include all items, you can either:
+          - Use `range(len(items))`, or
+          - Iterate directly over the list: `for item in items:`
+          Here's a corrected version:
           ```python
-          def __repr__(self):
-              return f"User(name='{self.name}', email='{self.email}')"
+          def get_items(items):
+              result = []
+              for item in items:
+                  result.append(item)
+              return result
           ```
 ```
-## Advanced: Per-Case Target Override
-```yaml
-$schema: agentv-eval-v2
-description: Cross-platform testing with different targets
-target: azure_base
-evalcases:
-  - id: azure-test
-    outcome: Tests with Azure OpenAI (uses file-level default target)
-    input_messages:
-      - role: user
-        content: Explain async/await in Python
-    expected_messages:
-      - role: assistant
-        content: "async/await is Python's syntax for asynchronous programming..."
-  - id: vscode-test
-    outcome: Tests with VS Code Copilot
-    execution:
-      target: vscode_projectx
-    input_messages:
-      - role: user
-        content:
-          - type: text
-            value: Refactor this code to use async/await
-          - type: file
-            value: /code-samples/sync_code.py
-    expected_messages:
-      - role: assistant
-        content: "async def fetch_data():\n    await ..."
-```
-## Complex Real-World Example: CargoWise Triage
-```yaml
-$schema: agentv-eval-v2
-description: CargoWise criticality rating (CR1-CR9) classification eval
-target: default
-evalcases:
-  - id: cr-global-outage
-    conversation_id: cargowise-triage
-    outcome: |
-      Assistant correctly classifies as 'CR1' for complete system inaccessibility.
-      Reasoning should emphasize 'any user on any workstation' and lack of access.
-    execution:
-      evaluators:
-        - name: json_format_validator
-          type: code
-          script: uv run validate_output.py
-        - name: content_evaluator
-          type: llm_judge
-    input_messages:
-      - role: user
-        content:
-          - type: file
-            value: ../../prompts/cw-criticality-rating.prompt.md
-          - type: text
-            value: |-
-              Classify this CargoWise ticket:
-              Ticket: Entire system down, no users can log in from any device
-              Impact: All operations halted
-              Scope: Global
-              Signals: Error: "Server unreachable", no recent changes reported
-    expected_messages:
-      - role: assistant
-        content: |-
-          {
-            "criticalityRating": "CR1",
-            "reasoning": "Step 1: Issue prevents access to the entire application suite.
-            Step 2: Affects any user on any workstation. Step 3: Matches CR1 definition
-            exactly—no partial access or workarounds mentioned."
-          }
-  - id: cr-module-inaccessible
-    conversation_id: cargowise-triage
-    outcome: |
-      Assistant correctly classifies as 'CR2' for module-wide inaccessibility.
-      Reasoning should distinguish from CR1 by noting it's limited to one module.
-    execution:
-      evaluators:
-        - name: json_format_validator
-          type: code
-          script: uv run validate_output.py
-        - name: content_evaluator
-          type: llm_judge
-    input_messages:
-      - role: user
-        content:
-          - type: file
-            value: ../../prompts/cw-criticality-rating.prompt.md
-          - type: text
-            value: |-
-              Classify this CargoWise ticket:
-              Ticket: Customs module inaccessible for all users
-              Impact: Customs declarations blocked
-              Scope: Module-wide
-              Signals: "Module not found" error, other modules working
-    expected_messages:
-      - role: assistant
-        content: |-
-          {
-            "criticalityRating": "CR2",
-            "reasoning": "Step 1: Issue affects an entire module (Customs).
-            Step 2: No access for any user/workstation. Step 3: Does not impact
-            the full suite, so CR2 over CR1."
-          }
-```
 ## Notes on Examples
 ### File Path Conventions

package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md CHANGED Viewed

@@ -1,13 +1,12 @@
 ---
-description: Iteratively optimize a prompt file against an AgentV evaluation suite by analyzing failures and refining instructions.
+description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
 ---
 # AgentV Prompt Optimizer
-Iteratively optimize a prompt file against an AgentV evaluation suite.
-## Usage
-`prompt-optimizer <eval-path> [playbook-path]`
+## Input Variables
+- `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
+- `optimization-log-path` (optional): Path where optimization progress should be logged
 ## Workflow
@@ -17,9 +16,9 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
         - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
         - Recursively check referenced prompt files for *other* prompt references (dependencies).
         - If multiple prompts are found, consider ALL of them as candidates for optimization.
-    - **Identify Playbook**:
-        - If `<playbook-path>` is provided, use it.
-        - If not, create a new one in the parent directory of the eval files: `playbook-[timestamp].md`.
+    - **Identify Optimization Log**:
+        - If `<optimization-log-path>` is provided, use it.
+        - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
     - Read content of the identified prompt file.
 2.  **Optimization Loop** (Max 5 iterations)
@@ -38,7 +37,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
         - If **Score decreased**: Revert last change, try different approach.
         - If **No improvement** (2x): STOP and report stagnation.
     - **Log Result**:
-        - Append the result of this iteration to the identified playbook file.
+        - Append the result of this iteration to the identified optimization log file.
         - **Format**:
           ```markdown
           ### Iteration [N]
@@ -49,7 +48,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
           ```
     - **Refine (The Curator)**:
         - Modify the relevant `<prompt-file>` (pick the one most likely to be the root cause) to address failures.
-        - **Strategy**: Treat the prompt as a structured "Playbook".
+        - **Strategy**: Treat the prompt as a structured set of rules and instructions.
             - **Clarify**: If ambiguous, make the existing instruction more specific.
             - **Add Rule**: If a constraint was missed, add a specific bullet point to the relevant section.
             - **Negative Constraint**: If hallucinating, explicitly state what NOT to do.
@@ -61,7 +60,7 @@ Iteratively optimize a prompt file against an AgentV evaluation suite.
 3.  **Completion**
     - Report final score.
     - Summarize key changes made to the prompt.
-    - **Finalize Playbook**: Add a summary header to the playbook file indicating the session completion and final score.
+    - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
 ## Guidelines
 - **Simplicity ("Less is More")**: Avoid adding specific rules for rare edge cases ("hotfixes"). Focus on universally applicable instructions.

package/dist/templates/.github/prompts/agentv-optimize.prompt.md CHANGED Viewed

@@ -1,4 +1,4 @@
 ---
-description: Iteratively optimize a prompt file against an AgentV evaluation suite
+description: Iteratively optimize prompt files against an AgentV evaluation suite
 ---
 #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "agentv",
-  "version": "0.14.2",
+  "version": "0.16.0",
   "description": "CLI entry point for AgentV",
   "type": "module",
   "repository": {
@@ -24,7 +24,7 @@
     "dotenv": "^16.4.5",
     "micromatch": "^4.0.8",
     "yaml": "^2.6.1",
-    "@agentv/core": "0.14.2"
+    "@agentv/core": "0.16.0"
   },
   "devDependencies": {
     "execa": "^9.3.0"