npm - agentv - Versions diffs - 1.5.0 → 2.0.1 - Mend

agentv 1.5.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # AgentV
-A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
+A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI, Pi Coding Agent, and Azure OpenAI.
 ## Installation and Setup
@@ -129,9 +129,6 @@ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id
 - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
 - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
 - `--cache`: Enable caching of LLM responses (default: disabled)
-- `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
-- `--dump-traces`: Write trace files to `.agentv/traces/` directory
-- `--include-trace`: Include full trace in result output (verbose)
 - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
 - `--verbose`: Verbose output
@@ -162,7 +159,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
 Each target specifies:
 - `name`: Unique identifier for the target
-- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
+- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
 - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
 - Optional fields: `judge_target`, `workers`, `provider_batching`
@@ -240,6 +237,27 @@ Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax.
 Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
 Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
+**Pi Coding Agent targets:**
+```yaml
+- name: pi
+  provider: pi-coding-agent
+  judge_target: gemini_base
+  executable: ${{ PI_CLI_PATH }}            # Optional: defaults to `pi` if omitted
+  pi_provider: google                       # google, anthropic, openai, groq, xai, openrouter
+  model: ${{ GEMINI_MODEL_NAME }}
+  api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
+  tools: read,bash,edit,write               # Available tools for the agent
+  timeout_seconds: 180
+  cwd: ${{ PI_WORKSPACE_DIR }}              # Optional: run in specific directory
+  log_format: json                          # 'summary' (default) or 'json' for full logs
+  # system_prompt: optional override for the default system prompt
+```
+Pi Coding Agent is an autonomous coding CLI from [pi-mono](https://github.com/badlogic/pi-mono). Install it globally with `npm install -g @mariozechner/pi-coding-agent` (or use a local path via `executable`). It supports multiple LLM providers and outputs JSONL events. AgentV extracts tool trajectories from the output for trace-based evaluation. File attachments are passed using Pi's native `@path` syntax.
+By default, a system prompt instructs the agent to include code in its response (required for evaluation scoring). Use `system_prompt` to override this behavior.
 ## Writing Custom Evaluators
 ### Code Evaluator I/O Contract
@@ -276,45 +294,13 @@ Code evaluators receive input via stdin and write output to stdout as JSON.
 - Score range: `0.0` to `1.0` (float)
 - `hits` and `misses` are optional but recommended for debugging
-### Code Evaluator Script Template
-```python
-#!/usr/bin/env python3
-import json
-import sys
-def evaluate(input_data):
-    # Extract only the fields you need
-    candidate_answer = input_data.get("candidate_answer", "")
-    # Your validation logic here
-    score = 0.0  # to 1.0
-    hits = ["successful check 1", "successful check 2"]
-    misses = ["failed check 1"]
-    reasoning = "Explanation of score"
-    return {
-        "score": score,
-        "hits": hits,
-        "misses": misses,
-        "reasoning": reasoning
-    }
-if __name__ == "__main__":
-    try:
-        input_data = json.loads(sys.stdin.read())
-        result = evaluate(input_data)
-        print(json.dumps(result, indent=2))
-    except Exception as e:
-        error_result = {
-            "score": 0.0,
-            "hits": [],
-            "misses": [f"Evaluator error: {str(e)}"],
-            "reasoning": f"Evaluator error: {str(e)}"
-        }
-        print(json.dumps(error_result, indent=2))
-        sys.exit(1)
-```
+### Code Evaluator Templates
+Custom evaluators can be written in any language. For complete templates and examples:
+- **Python template**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
+- **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
+- **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
 ### LLM Judge Template Structure