agentv 3.9.2 → 3.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/dist/{chunk-OIVGGWJ3.js → chunk-GWHHM6X2.js} +25 -14
  2. package/dist/chunk-GWHHM6X2.js.map +1 -0
  3. package/dist/{chunk-6ZAFWUBT.js → chunk-JLFFYTZA.js} +4 -4
  4. package/dist/{chunk-JGMJL2LV.js → chunk-TXCVDTEE.js} +8 -7
  5. package/dist/{chunk-JGMJL2LV.js.map → chunk-TXCVDTEE.js.map} +1 -1
  6. package/dist/cli.js +3 -3
  7. package/dist/{dist-PUPHGVKL.js → dist-FPC7J7KQ.js} +2 -2
  8. package/dist/index.js +3 -3
  9. package/dist/{interactive-BD56NB23.js → interactive-N463HRIL.js} +3 -3
  10. package/dist/templates/.agents/skills/agentv-chat-to-eval/README.md +84 -0
  11. package/dist/templates/.agents/skills/agentv-chat-to-eval/SKILL.md +144 -0
  12. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-json.md +67 -0
  13. package/dist/templates/.agents/skills/agentv-chat-to-eval/examples/transcript-markdown.md +101 -0
  14. package/dist/templates/.agents/skills/agentv-eval-builder/SKILL.md +458 -0
  15. package/dist/templates/.agents/skills/agentv-eval-builder/references/config-schema.json +36 -0
  16. package/dist/templates/.agents/skills/agentv-eval-builder/references/custom-evaluators.md +118 -0
  17. package/dist/templates/.agents/skills/agentv-eval-builder/references/eval-schema.json +12753 -0
  18. package/dist/templates/.agents/skills/agentv-eval-builder/references/rubric-evaluator.md +77 -0
  19. package/dist/templates/.agents/skills/agentv-eval-orchestrator/SKILL.md +50 -0
  20. package/dist/templates/.agents/skills/agentv-prompt-optimizer/SKILL.md +78 -0
  21. package/dist/templates/.agentv/.env.example +25 -0
  22. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +177 -0
  23. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +316 -0
  24. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +137 -0
  25. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +215 -0
  26. package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +27 -0
  27. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +115 -0
  28. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +278 -0
  29. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +333 -0
  30. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +79 -0
  31. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +121 -0
  32. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +298 -0
  33. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +78 -0
  34. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +5 -0
  35. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +4 -0
  36. package/package.json +3 -3
  37. package/dist/chunk-OIVGGWJ3.js.map +0 -1
  38. /package/dist/{chunk-6ZAFWUBT.js.map → chunk-JLFFYTZA.js.map} +0 -0
  39. /package/dist/{dist-PUPHGVKL.js.map → dist-FPC7J7KQ.js.map} +0 -0
  40. /package/dist/{interactive-BD56NB23.js.map → interactive-N463HRIL.js.map} +0 -0
@@ -0,0 +1,77 @@
1
+ # Rubric Evaluator
2
+
3
+ ## Field Reference
4
+
5
+ | Field | Type | Default | Description |
6
+ |-------|------|---------|-------------|
7
+ | `id` | string | auto-generated | Unique identifier |
8
+ | `outcome` | string | required* | Criterion being evaluated (*optional if `score_ranges` used) |
9
+ | `weight` | number | 1.0 | Relative importance |
10
+ | `required` | boolean | true | Failing forces verdict to 'fail' (checklist mode) |
11
+ | `required_min_score` | integer | - | Minimum 0-10 score to pass (score-range mode) |
12
+ | `score_ranges` | map or array | - | Score range definitions for analytic scoring |
13
+
14
+ ## Checklist Mode
15
+
16
+ ```yaml
17
+ rubrics:
18
+ - Mentions divide-and-conquer approach
19
+ - id: complexity
20
+ outcome: States time complexity correctly
21
+ weight: 2.0
22
+ required: true
23
+ - id: examples
24
+ outcome: Includes code examples
25
+ weight: 1.0
26
+ required: false
27
+ ```
28
+
29
+ ## Score-Range Mode
30
+
31
+ Shorthand map format (recommended):
32
+
33
+ ```yaml
34
+ rubrics:
35
+ - id: correctness
36
+ weight: 2.0
37
+ required_min_score: 7
38
+ score_ranges:
39
+ 0: Critical bugs
40
+ 3: Minor bugs
41
+ 6: Correct with minor issues
42
+ 9: Fully correct
43
+ ```
44
+
45
+ Map keys are lower bounds (0-10). Each range extends from its key to (next key - 1), with the last extending to 10. Must start at 0.
46
+
47
+ Array format is also accepted:
48
+
49
+ ```yaml
50
+ score_ranges:
51
+ - score_range: [0, 2]
52
+ outcome: Critical bugs
53
+ - score_range: [3, 5]
54
+ outcome: Minor bugs
55
+ - score_range: [6, 8]
56
+ outcome: Correct with minor issues
57
+ - score_range: [9, 10]
58
+ outcome: Fully correct
59
+ ```
60
+
61
+ Ranges must be integers 0-10, non-overlapping, covering all values 0-10.
62
+
63
+ ## Scoring
64
+
65
+ **Checklist:** `score = sum(satisfied weights) / sum(all weights)`
66
+
67
+ **Score-range:** `score = weighted_average(raw_score / 10)` per criterion
68
+
69
+ ## Verdicts
70
+
71
+ | Verdict | Condition |
72
+ |---------|-----------|
73
+ | `pass` | score >= 0.8 AND all gating criteria satisfied |
74
+ | `borderline` | score >= 0.6 AND all gating criteria satisfied |
75
+ | `fail` | score < 0.6 OR any gating criterion failed |
76
+
77
+ Gating: checklist uses `required: true`, score-range uses `required_min_score: N`.
@@ -0,0 +1,50 @@
1
+ ---
2
+ name: agentv-eval-orchestrator
3
+ description: Run AgentV evaluations without API keys by orchestrating eval subcommands. Use this skill when asked to run evals, evaluate an agent, or test prompt quality using agentv.
4
+ ---
5
+
6
+ # AgentV Eval Orchestrator
7
+
8
+ Run AgentV evaluations by acting as the LLM yourself — no API keys needed.
9
+
10
+ ## Quick Start
11
+
12
+ ```bash
13
+ agentv prompt eval <eval-file.yaml>
14
+ ```
15
+
16
+ This outputs a complete orchestration prompt with step-by-step instructions and all test IDs. Follow its instructions.
17
+
18
+ ## Workflow
19
+
20
+ For each test, run these three steps:
21
+
22
+ ### 1. Get Task Input
23
+
24
+ ```bash
25
+ agentv prompt eval input <path> --test-id <id>
26
+ ```
27
+
28
+ Returns JSON with `input`, `guideline_paths`, and `criteria`. File references in messages use absolute paths — read them from the filesystem.
29
+
30
+ ### 2. Execute the Task
31
+
32
+ You ARE the candidate LLM. Read `input` from step 1, read any referenced files, and answer the task. Save your response to a temp file.
33
+
34
+ **Important**: Do not leak `criteria` into your answer — it's for your reference when judging, not part of the task.
35
+
36
+ ### 3. Judge the Result
37
+
38
+ ```bash
39
+ agentv prompt eval judge <path> --test-id <id> --answer-file /tmp/eval_<id>.txt
40
+ ```
41
+
42
+ Returns JSON with an `evaluators` array. Each evaluator has a `status`:
43
+
44
+ - **`"completed"`** — Deterministic score is final. Read `result.score` (0.0–1.0).
45
+ - **`"prompt_ready"`** — LLM grading required. Send `prompt.system_prompt` as system and `prompt.user_prompt` as user to yourself. Parse the JSON response to get `score`, `hits`, `misses`.
46
+
47
+ ## When to use this vs `agentv eval`
48
+
49
+ - **`agentv eval`** — You have API keys configured. Runs everything end-to-end automatically.
50
+ - **`agentv prompt`** — No API keys. You orchestrate: get input, answer the task yourself, judge the result.
@@ -0,0 +1,78 @@
1
+ ---
2
+ name: agentv-prompt-optimizer
3
+ description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
4
+ ---
5
+
6
+ # AgentV Prompt Optimizer
7
+
8
+ ## Input Variables
9
+ - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
10
+ - `optimization-log-path` (optional): Path where optimization progress should be logged
11
+
12
+ ## Workflow
13
+
14
+ 1. **Initialize**
15
+ - Verify `<eval-path>` (file or glob) targets the correct system.
16
+ - **Identify Prompt Files**:
17
+ - Infer prompt files from the eval file content (look for `file:` references in `input` that match these patterns).
18
+ - Recursively check referenced prompt files for *other* prompt references (dependencies).
19
+ - If multiple prompts are found, consider ALL of them as candidates for optimization.
20
+ - **Identify Optimization Log**:
21
+ - If `<optimization-log-path>` is provided, use it.
22
+ - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
23
+ - Read content of the identified prompt file.
24
+
25
+ 2. **Optimization Loop** (Max 10 iterations)
26
+ - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
27
+ - *Targeted Run*: If iterating on specific stubborn failures, use `--test-id <test_id>` to run only the relevant tests.
28
+ - **Analyze (The Reflector)**:
29
+ - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
30
+ - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
31
+ - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
32
+ - **Output**: Return a structured analysis including:
33
+ - **Score**: Current pass rate.
34
+ - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
35
+ - **Insight**: Key learning or pattern identified from the failures.
36
+ - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
37
+ - **Decide**:
38
+ - If **100% pass**: STOP and report success.
39
+ - If **Score decreased**: Revert last change, try different approach.
40
+ - If **No improvement** (2x): STOP and report stagnation.
41
+ - **Refine (The Curator)**:
42
+ - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
43
+ - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
44
+ - **Output**: The **Log Entry** describing the specific operation performed.
45
+ ```markdown
46
+ ### Iteration [N]
47
+ - **Operation**: [ADD / UPDATE / DELETE]
48
+ - **Target**: [Section Name]
49
+ - **Change**: [Specific text added/modified]
50
+ - **Trigger**: [Specific failing test case or error pattern]
51
+ - **Rationale**: [From Reflector: Root Cause]
52
+ - **Score**: [From Reflector: Current Pass Rate]
53
+ - **Insight**: [From Reflector: Key Learning]
54
+ ```
55
+ - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
56
+ - **ADD**: Insert a new rule if a constraint was missed.
57
+ - **UPDATE**: Refine an existing rule to be clearer or more general.
58
+ - *Clarify*: Make ambiguous instructions specific.
59
+ - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
60
+ - **DELETE**: Remove obsolete, redundant, or harmful rules.
61
+ - *Prune*: If a general rule covers specific cases, delete the specific ones.
62
+ - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
63
+ - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
64
+ - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
65
+ - **Log Result**:
66
+ - Append the **Log Entry** returned by the Curator to the optimization log file.
67
+
68
+ 3. **Completion**
69
+ - Report final score.
70
+ - Summarize key changes made to the prompt.
71
+ - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
72
+
73
+ ## Guidelines
74
+ - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
75
+ - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
76
+ - **Structure**: Maintain existing Markdown headers/sections.
77
+ - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
78
+ - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
@@ -0,0 +1,25 @@
1
+ # Copy this file to .env and fill in your credentials
2
+
3
+ # Eval run mode (used by agentv-bench skill)
4
+ AGENT_EVAL_MODE=agent # agent | cli
5
+
6
+ # Azure OpenAI Configuration
7
+ AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
8
+ AZURE_OPENAI_API_KEY=your-openai-api-key-here
9
+ AZURE_DEPLOYMENT_NAME=gpt-5-mini
10
+ AZURE_OPENAI_API_VERSION=2024-12-01-preview
11
+
12
+ # OpenAI
13
+ OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
14
+ OPENAI_API_KEY=your-openai-api-key-here
15
+ OPENAI_MODEL=gpt-5-mini
16
+
17
+ # Google Gemini
18
+ GOOGLE_GENERATIVE_AI_API_KEY=your-gemini-api-key-here
19
+ GEMINI_MODEL_NAME=gemini-3-flash-preview
20
+
21
+ # Anthropic
22
+ ANTHROPIC_API_KEY=your-anthropic-api-key-here
23
+
24
+ # CLI provider sample (used by the local_cli target)
25
+ CLI_EVALS_DIR=./docs/examples/simple/evals/local-cli
@@ -0,0 +1,177 @@
1
+ ---
2
+ name: agentv-eval-builder
3
+ description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring evaluators.
4
+ ---
5
+
6
+ # AgentV Eval Builder
7
+
8
+ Comprehensive docs: https://agentv.dev
9
+
10
+ ## Quick Start
11
+
12
+ ```yaml
13
+ description: Example eval
14
+ execution:
15
+ target: default
16
+
17
+ evalcases:
18
+ - id: greeting
19
+ expected_outcome: Friendly greeting
20
+ input: "Say hello"
21
+ expected_output: "Hello! How can I help you?"
22
+ rubrics:
23
+ - Greeting is friendly and warm
24
+ - Offers to help
25
+ ```
26
+
27
+ ## Eval File Structure
28
+
29
+ **Required:** `evalcases` (array)
30
+ **Optional:** `description`, `execution`, `dataset`
31
+
32
+ **Eval case fields:**
33
+
34
+ | Field | Required | Description |
35
+ |-------|----------|-------------|
36
+ | `id` | yes | Unique identifier |
37
+ | `expected_outcome` | yes | What the response should accomplish |
38
+ | `input` / `input_messages` | yes | Input to the agent |
39
+ | `expected_output` / `expected_messages` | no | Gold-standard reference answer |
40
+ | `rubrics` | no | Inline evaluation criteria |
41
+ | `execution` | no | Per-case execution overrides |
42
+ | `conversation_id` | no | Thread grouping |
43
+
44
+ **Shorthand aliases:**
45
+ - `input` (string) expands to `[{role: "user", content: "..."}]`
46
+ - `expected_output` (string/object) expands to `[{role: "assistant", content: ...}]`
47
+ - Canonical `input_messages` / `expected_messages` take precedence when both present
48
+
49
+ **Message format:** `{role, content}` where role is `system`, `user`, `assistant`, or `tool`
50
+ **Content types:** inline text, `{type: "file", value: "./path.md"}`
51
+ **File paths:** relative from eval file dir, or absolute with `/` prefix from repo root
52
+
53
+ **JSONL format:** One eval case per line as JSON. Optional `.yaml` sidecar for shared defaults. See `examples/features/basic-jsonl/`.
54
+
55
+ ## Evaluator Types
56
+
57
+ Configure via `execution.evaluators` array. Multiple evaluators produce a weighted average score.
58
+
59
+ ### code_judge
60
+ ```yaml
61
+ - name: format_check
62
+ type: code_judge
63
+ script: uv run validate.py
64
+ cwd: ./scripts # optional working directory
65
+ target: {} # optional: enable LLM target proxy (max_calls: 50)
66
+ ```
67
+ Contract: stdin JSON -> stdout JSON `{score, hits, misses, reasoning}`
68
+ See `references/custom-evaluators.md` for templates.
69
+
70
+ ### llm_judge
71
+ ```yaml
72
+ - name: quality
73
+ type: llm_judge
74
+ prompt: ./prompts/eval.md # markdown template or script config
75
+ model: gpt-5-chat # optional model override
76
+ config: # passed to script templates as context.config
77
+ strictness: high
78
+ ```
79
+ Variables: `{{question}}`, `{{expected_outcome}}`, `{{candidate_answer}}`, `{{reference_answer}}`, `{{input_messages}}`, `{{expected_messages}}`, `{{output_messages}}`
80
+ - Markdown templates: use `{{variable}}` syntax
81
+ - TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
82
+
83
+ ### composite
84
+ ```yaml
85
+ - name: gate
86
+ type: composite
87
+ evaluators:
88
+ - name: safety
89
+ type: llm_judge
90
+ prompt: ./safety.md
91
+ - name: quality
92
+ type: llm_judge
93
+ aggregator:
94
+ type: weighted_average
95
+ weights: { safety: 0.3, quality: 0.7 }
96
+ ```
97
+ Aggregator types: `weighted_average`, `all_or_nothing`, `minimum`, `maximum`, `safety_gate`
98
+ - `safety_gate`: fails immediately if the named gate evaluator scores below threshold (default 1.0)
99
+
100
+ ### tool_trajectory
101
+ ```yaml
102
+ - name: tool_check
103
+ type: tool_trajectory
104
+ mode: any_order # any_order | in_order | exact
105
+ minimums: # for any_order
106
+ knowledgeSearch: 2
107
+ expected: # for in_order/exact
108
+ - tool: knowledgeSearch
109
+ args: { query: "search term" } # partial deep equality match
110
+ - tool: documentRetrieve
111
+ args: any # any arguments accepted
112
+ max_duration_ms: 5000 # per-tool latency assertion
113
+ - tool: summarize # omit args to skip argument checking
114
+ ```
115
+
116
+ ### field_accuracy
117
+ ```yaml
118
+ - name: fields
119
+ type: field_accuracy
120
+ match_type: exact # exact | date | numeric_tolerance
121
+ numeric_tolerance: 0.01 # for numeric_tolerance match_type
122
+ aggregation: weighted_average # weighted_average | all_or_nothing
123
+ ```
124
+ Compares `output_messages` fields against `expected_messages` fields.
125
+
126
+ ### latency
127
+ ```yaml
128
+ - name: speed
129
+ type: latency
130
+ max_ms: 5000
131
+ ```
132
+
133
+ ### cost
134
+ ```yaml
135
+ - name: budget
136
+ type: cost
137
+ max_usd: 0.10
138
+ ```
139
+
140
+ ### token_usage
141
+ ```yaml
142
+ - name: tokens
143
+ type: token_usage
144
+ max_total_tokens: 4000
145
+ ```
146
+
147
+ ### rubric (inline)
148
+ ```yaml
149
+ rubrics:
150
+ - Simple string criterion
151
+ - id: weighted
152
+ expected_outcome: Detailed criterion
153
+ weight: 2.0
154
+ required: true
155
+ ```
156
+ See `references/rubric-evaluator.md` for score-range mode and scoring formula.
157
+
158
+ ## CLI Commands
159
+
160
+ ```bash
161
+ # Run evaluation
162
+ bun agentv eval <file.yaml> [--eval-id <id>] [--target <name>] [--dry-run]
163
+
164
+ # Validate eval file
165
+ bun agentv validate <file.yaml>
166
+
167
+ # Compare results between runs
168
+ bun agentv compare <results1.jsonl> <results2.jsonl>
169
+
170
+ # Generate rubrics from expected_outcome
171
+ bun agentv generate rubrics <file.yaml> [--target <name>]
172
+ ```
173
+
174
+ ## Schemas
175
+
176
+ - Eval file: `references/eval-schema.json`
177
+ - Config: `references/config-schema.json`