npm - agentv - Versions diffs - 0.23.0 → 0.26.0 - Mend

agentv 0.23.0 → 0.26.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/README.md +14 -11
package/dist/{chunk-4T62HFF4.js → chunk-6ZM7WVSC.js} +900 -250
package/dist/chunk-6ZM7WVSC.js.map +1 -0
package/dist/cli.js +1 -1
package/dist/index.js +1 -1
package/dist/templates/.agentv/.env.template +10 -10
package/dist/templates/.agentv/targets.yaml +8 -1
package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +75 -6
package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -0
package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +139 -0
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +237 -0
package/package.json +1 -1
package/dist/chunk-4T62HFF4.js.map +0 -1

package/README.md CHANGED Viewed

@@ -130,6 +130,8 @@ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id
 - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
 - `--cache`: Enable caching of LLM responses (default: disabled)
 - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
+- `--dump-traces`: Write trace files to `.agentv/traces/` directory
+- `--include-trace`: Include full trace in result output (verbose)
 - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
 - `--verbose`: Verbose output
@@ -174,6 +176,7 @@ Each target specifies:
   endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
   api_key: ${{ AZURE_OPENAI_API_KEY }}
   model: ${{ AZURE_DEPLOYMENT_NAME }}
+  version: ${{ AZURE_OPENAI_API_VERSION }}  # Optional: defaults to 2024-12-01-preview
 ```
 Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax. The actual values are resolved from your `.env` file at runtime.
@@ -246,14 +249,13 @@ Code evaluators receive input via stdin and write output to stdout as JSON.
 **Input Format (via stdin):**
 ```json
 {
-  "task": "string describing the task",
-  "outcome": "expected outcome description",
-  "expected": "expected output string",
-  "output": "generated code/text from the agent",
-  "system_message": "system message if any",
-  "guideline_paths": ["path1", "path2"],
-  "attachments": ["file1", "file2"],
-  "user_segments": [{"type": "text", "value": "..."}]
+  "question": "string describing the task/question",
+  "expected_outcome": "expected outcome description",
+  "reference_answer": "gold standard answer (optional)",
+  "candidate_answer": "generated code/text from the agent",
+  "guideline_files": ["path/to/guideline1.md", "path/to/guideline2.md"],
+  "input_files": ["path/to/data.json", "path/to/config.yaml"],
+  "input_messages": [{"role": "user", "content": "..."}]
 }
 ```
@@ -269,8 +271,8 @@ Code evaluators receive input via stdin and write output to stdout as JSON.
 **Key Points:**
 - Evaluators receive **full context** but should select only relevant fields
-- Most evaluators only need `output` field - ignore the rest to avoid false positives
-- Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
+- Most evaluators only need `candidate_answer` field - ignore the rest to avoid false positives
+- Complex evaluators can use `question`, `reference_answer`, or `guideline_paths` for context-aware validation
 - Score range: `0.0` to `1.0` (float)
 - `hits` and `misses` are optional but recommended for debugging
@@ -283,7 +285,7 @@ import sys
 def evaluate(input_data):
     # Extract only the fields you need
-    output = input_data.get("output", "")
+    candidate_answer = input_data.get("candidate_answer", "")
     # Your validation logic here
     score = 0.0  # to 1.0
@@ -414,6 +416,7 @@ targets:
     endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
     api_key: ${{ AZURE_OPENAI_API_KEY }}
     model: gpt-4
+    version: ${{ AZURE_OPENAI_API_VERSION }}                # Optional: API version (defaults to 2024-12-01-preview)
     max_retries: 5                                          # Maximum retry attempts
     retry_initial_delay_ms: 2000                            # Initial delay before first retry
     retry_max_delay_ms: 120000                              # Maximum delay cap