agentv 1.5.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (23) hide show
  1. package/README.md +30 -44
  2. package/dist/{chunk-3RYQPI4H.js → chunk-6SHT2QS6.js} +4075 -1129
  3. package/dist/chunk-6SHT2QS6.js.map +1 -0
  4. package/dist/cli.js +1 -1
  5. package/dist/index.js +1 -1
  6. package/dist/templates/.agentv/.env.template +23 -23
  7. package/dist/templates/.agentv/config.yaml +15 -15
  8. package/dist/templates/.agentv/targets.yaml +16 -0
  9. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +6 -4
  10. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +12 -2
  11. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +137 -0
  12. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -215
  13. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +141 -4
  14. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +10 -6
  15. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -7
  16. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -2
  17. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +121 -0
  18. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +28 -2
  19. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +77 -77
  20. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +4 -4
  21. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +3 -3
  22. package/package.json +6 -3
  23. package/dist/chunk-3RYQPI4H.js.map +0 -1
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # AgentV
2
2
 
3
- A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
3
+ A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI, Pi Coding Agent, and Azure OpenAI.
4
4
 
5
5
  ## Installation and Setup
6
6
 
@@ -129,9 +129,6 @@ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id
129
129
  - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
130
130
  - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
131
131
  - `--cache`: Enable caching of LLM responses (default: disabled)
132
- - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
133
- - `--dump-traces`: Write trace files to `.agentv/traces/` directory
134
- - `--include-trace`: Include full trace in result output (verbose)
135
132
  - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
136
133
  - `--verbose`: Verbose output
137
134
 
@@ -162,7 +159,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
162
159
  Each target specifies:
163
160
 
164
161
  - `name`: Unique identifier for the target
165
- - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
162
+ - `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `pi-coding-agent`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
166
163
  - Provider-specific configuration fields at the top level (no `settings` wrapper needed)
167
164
  - Optional fields: `judge_target`, `workers`, `provider_batching`
168
165
 
@@ -240,6 +237,27 @@ Note: Environment variables are referenced using `${{ VARIABLE_NAME }}` syntax.
240
237
  Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
241
238
  Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
242
239
 
240
+ **Pi Coding Agent targets:**
241
+
242
+ ```yaml
243
+ - name: pi
244
+ provider: pi-coding-agent
245
+ judge_target: gemini_base
246
+ executable: ${{ PI_CLI_PATH }} # Optional: defaults to `pi` if omitted
247
+ pi_provider: google # google, anthropic, openai, groq, xai, openrouter
248
+ model: ${{ GEMINI_MODEL_NAME }}
249
+ api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
250
+ tools: read,bash,edit,write # Available tools for the agent
251
+ timeout_seconds: 180
252
+ cwd: ${{ PI_WORKSPACE_DIR }} # Optional: run in specific directory
253
+ log_format: json # 'summary' (default) or 'json' for full logs
254
+ # system_prompt: optional override for the default system prompt
255
+ ```
256
+
257
+ Pi Coding Agent is an autonomous coding CLI from [pi-mono](https://github.com/badlogic/pi-mono). Install it globally with `npm install -g @mariozechner/pi-coding-agent` (or use a local path via `executable`). It supports multiple LLM providers and outputs JSONL events. AgentV extracts tool trajectories from the output for trace-based evaluation. File attachments are passed using Pi's native `@path` syntax.
258
+
259
+ By default, a system prompt instructs the agent to include code in its response (required for evaluation scoring). Use `system_prompt` to override this behavior.
260
+
243
261
  ## Writing Custom Evaluators
244
262
 
245
263
  ### Code Evaluator I/O Contract
@@ -276,45 +294,13 @@ Code evaluators receive input via stdin and write output to stdout as JSON.
276
294
  - Score range: `0.0` to `1.0` (float)
277
295
  - `hits` and `misses` are optional but recommended for debugging
278
296
 
279
- ### Code Evaluator Script Template
280
-
281
- ```python
282
- #!/usr/bin/env python3
283
- import json
284
- import sys
285
-
286
- def evaluate(input_data):
287
- # Extract only the fields you need
288
- candidate_answer = input_data.get("candidate_answer", "")
289
-
290
- # Your validation logic here
291
- score = 0.0 # to 1.0
292
- hits = ["successful check 1", "successful check 2"]
293
- misses = ["failed check 1"]
294
- reasoning = "Explanation of score"
295
-
296
- return {
297
- "score": score,
298
- "hits": hits,
299
- "misses": misses,
300
- "reasoning": reasoning
301
- }
302
-
303
- if __name__ == "__main__":
304
- try:
305
- input_data = json.loads(sys.stdin.read())
306
- result = evaluate(input_data)
307
- print(json.dumps(result, indent=2))
308
- except Exception as e:
309
- error_result = {
310
- "score": 0.0,
311
- "hits": [],
312
- "misses": [f"Evaluator error: {str(e)}"],
313
- "reasoning": f"Evaluator error: {str(e)}"
314
- }
315
- print(json.dumps(error_result, indent=2))
316
- sys.exit(1)
317
- ```
297
+ ### Code Evaluator Templates
298
+
299
+ Custom evaluators can be written in any language. For complete templates and examples:
300
+
301
+ - **Python template**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
302
+ - **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
303
+ - **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
318
304
 
319
305
  ### LLM Judge Template Structure
320
306