agentv 2.9.0-next.1 → 2.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. package/dist/{chunk-H54JIK7G.js → chunk-G3OTPFYX.js} +2 -3
  2. package/dist/chunk-G3OTPFYX.js.map +1 -0
  3. package/dist/cli.js +1 -1
  4. package/dist/index.js +1 -1
  5. package/dist/templates/.agentv/config.yaml +1 -1
  6. package/dist/templates/.agentv/targets.yaml +10 -13
  7. package/package.json +1 -1
  8. package/dist/chunk-H54JIK7G.js.map +0 -1
  9. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -202
  10. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
  11. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
  12. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
  13. package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
  14. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
  15. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
  16. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
  17. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
  18. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
  19. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
  20. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  21. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
  22. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
@@ -1,215 +0,0 @@
1
- # Composite Evaluator Guide
2
-
3
- Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
4
-
5
- ## Basic Structure
6
-
7
- ```yaml
8
- execution:
9
- evaluators:
10
- - name: my_composite
11
- type: composite
12
- evaluators:
13
- - name: evaluator_1
14
- type: llm_judge
15
- prompt: ./prompts/check1.md
16
- - name: evaluator_2
17
- type: code_judge
18
- script: uv run check2.py
19
- aggregator:
20
- type: weighted_average
21
- weights:
22
- evaluator_1: 0.6
23
- evaluator_2: 0.4
24
- ```
25
-
26
- ## Aggregator Types
27
-
28
- ### 1. Weighted Average (Default)
29
-
30
- Combines scores using weighted arithmetic mean:
31
-
32
- ```yaml
33
- aggregator:
34
- type: weighted_average
35
- weights:
36
- safety: 0.3 # 30% weight
37
- quality: 0.7 # 70% weight
38
- ```
39
-
40
- If weights are omitted, all evaluators have equal weight (1.0).
41
-
42
- **Score calculation:**
43
- ```
44
- final_score = Σ(score_i × weight_i) / Σ(weight_i)
45
- ```
46
-
47
- ### 2. Code Judge Aggregator
48
-
49
- Run custom code to decide final score based on all evaluator results:
50
-
51
- ```yaml
52
- aggregator:
53
- type: code_judge
54
- path: node ./scripts/safety-gate.js
55
- cwd: ./evaluators # optional working directory
56
- ```
57
-
58
- **Input (stdin):**
59
- ```json
60
- {
61
- "results": {
62
- "safety": { "score": 0.9, "hits": [...], "misses": [...] },
63
- "quality": { "score": 0.85, "hits": [...], "misses": [...] }
64
- }
65
- }
66
- ```
67
-
68
- **Output (stdout):**
69
- ```json
70
- {
71
- "score": 0.87,
72
- "verdict": "pass",
73
- "hits": ["Combined check passed"],
74
- "misses": [],
75
- "reasoning": "Safety gate passed, quality acceptable"
76
- }
77
- ```
78
-
79
- ### 3. LLM Judge Aggregator
80
-
81
- Use an LLM to resolve conflicts or make nuanced decisions:
82
-
83
- ```yaml
84
- aggregator:
85
- type: llm_judge
86
- prompt: ./prompts/conflict-resolution.md
87
- ```
88
-
89
- The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
90
-
91
- ## Example Patterns
92
-
93
- ### Safety Gate Pattern
94
-
95
- Block outputs that fail safety even if quality is high:
96
-
97
- ```yaml
98
- evalcases:
99
- - id: safety-gated-response
100
- expected_outcome: Safe and accurate response
101
-
102
- input_messages:
103
- - role: user
104
- content: Explain quantum computing
105
-
106
- execution:
107
- evaluators:
108
- - name: safety_gate
109
- type: composite
110
- evaluators:
111
- - name: safety
112
- type: llm_judge
113
- prompt: ./prompts/safety-check.md
114
- - name: quality
115
- type: llm_judge
116
- prompt: ./prompts/quality-check.md
117
- aggregator:
118
- type: code_judge
119
- path: ./scripts/safety-gate.js
120
- ```
121
-
122
- ### Multi-Criteria Weighted Evaluation
123
-
124
- ```yaml
125
- - name: release_readiness
126
- type: composite
127
- evaluators:
128
- - name: correctness
129
- type: llm_judge
130
- prompt: ./prompts/correctness.md
131
- - name: style
132
- type: code_judge
133
- script: uv run style_checker.py
134
- - name: security
135
- type: llm_judge
136
- prompt: ./prompts/security.md
137
- aggregator:
138
- type: weighted_average
139
- weights:
140
- correctness: 0.5
141
- style: 0.2
142
- security: 0.3
143
- ```
144
-
145
- ### Nested Composites
146
-
147
- Composites can contain other composites for complex hierarchies:
148
-
149
- ```yaml
150
- - name: comprehensive_eval
151
- type: composite
152
- evaluators:
153
- - name: content_quality
154
- type: composite
155
- evaluators:
156
- - name: accuracy
157
- type: llm_judge
158
- prompt: ./prompts/accuracy.md
159
- - name: clarity
160
- type: llm_judge
161
- prompt: ./prompts/clarity.md
162
- aggregator:
163
- type: weighted_average
164
- weights:
165
- accuracy: 0.6
166
- clarity: 0.4
167
- - name: safety
168
- type: llm_judge
169
- prompt: ./prompts/safety.md
170
- aggregator:
171
- type: weighted_average
172
- weights:
173
- content_quality: 0.7
174
- safety: 0.3
175
- ```
176
-
177
- ## Result Structure
178
-
179
- Composite evaluators return nested `evaluator_results`:
180
-
181
- ```json
182
- {
183
- "score": 0.85,
184
- "verdict": "pass",
185
- "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
186
- "misses": ["[quality] Could use more examples"],
187
- "reasoning": "safety: Passed all checks; quality: Good but could improve",
188
- "evaluator_results": [
189
- {
190
- "name": "safety",
191
- "type": "llm_judge",
192
- "score": 0.95,
193
- "verdict": "pass",
194
- "hits": ["No harmful content"],
195
- "misses": []
196
- },
197
- {
198
- "name": "quality",
199
- "type": "llm_judge",
200
- "score": 0.8,
201
- "verdict": "pass",
202
- "hits": ["Clear explanation"],
203
- "misses": ["Could use more examples"]
204
- }
205
- ]
206
- }
207
- ```
208
-
209
- ## Best Practices
210
-
211
- 1. **Name evaluators clearly** - Names appear in results and debugging output
212
- 2. **Use safety gates for critical checks** - Don't let high quality override safety failures
213
- 3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
214
- 4. **Keep nesting shallow** - Deep nesting makes debugging harder
215
- 5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests
@@ -1,27 +0,0 @@
1
- {
2
- "$schema": "http://json-schema.org/draft-07/schema#",
3
- "title": "AgentV Config Schema",
4
- "description": "Schema for .agentv/config.yaml configuration files",
5
- "type": "object",
6
- "properties": {
7
- "$schema": {
8
- "type": "string",
9
- "description": "Schema identifier",
10
- "enum": ["agentv-config-v2"]
11
- },
12
- "guideline_patterns": {
13
- "type": "array",
14
- "description": "Glob patterns for identifying guideline files (instructions, prompts). Files matching these patterns are treated as guidelines, while non-matching files are treated as regular file content.",
15
- "items": {
16
- "type": "string",
17
- "description": "Glob pattern (e.g., '**/*.instructions.md', '**/prompts/**')"
18
- },
19
- "examples": [
20
- ["**/*.instructions.md", "**/instructions/**", "**/*.prompt.md", "**/prompts/**"],
21
- ["**/*.guide.md", "**/guidelines/**", "docs/AGENTS.md"]
22
- ]
23
- }
24
- },
25
- "required": ["$schema"],
26
- "additionalProperties": false
27
- }
@@ -1,118 +0,0 @@
1
- # Custom Evaluators
2
-
3
- ## Wire Format
4
-
5
- ### Input (stdin JSON)
6
-
7
- ```json
8
- {
9
- "question": "string",
10
- "criteria": "string",
11
- "reference_answer": "string",
12
- "candidate_answer": "string",
13
- "guideline_files": ["path"],
14
- "input_files": ["path"],
15
- "input_messages": [{"role": "user", "content": "..."}],
16
- "expected_messages": [{"role": "assistant", "content": "..."}],
17
- "output_messages": [{"role": "assistant", "content": "..."}],
18
- "trace_summary": {
19
- "event_count": 5,
20
- "tool_names": ["fetch"],
21
- "tool_calls_by_name": {"fetch": 1},
22
- "error_count": 0,
23
- "llm_call_count": 2,
24
- "token_usage": {"input": 1000, "output": 500},
25
- "cost_usd": 0.0015,
26
- "duration_ms": 3500,
27
- "start_time": "2026-02-13T10:00:00.000Z",
28
- "end_time": "2026-02-13T10:00:03.500Z"
29
- }
30
- }
31
- ```
32
-
33
- ### Output (stdout JSON)
34
-
35
- ```json
36
- {
37
- "score": 0.85,
38
- "hits": ["passed check"],
39
- "misses": ["failed check"],
40
- "reasoning": "explanation"
41
- }
42
- ```
43
-
44
- `score` (0.0-1.0) required. `hits`, `misses`, `reasoning` optional.
45
-
46
- ## SDK Functions
47
-
48
- ```typescript
49
- import { defineCodeJudge, createTargetClient, definePromptTemplate } from '@agentv/eval';
50
- ```
51
-
52
- - `defineCodeJudge(fn)` - Wraps evaluation function with stdin/stdout handling
53
- - `createTargetClient()` - Returns LLM proxy client (when `target: {}` configured)
54
- - `.invoke({question, systemPrompt})` - Single LLM call
55
- - `.invokeBatch(requests)` - Batch LLM calls
56
- - `definePromptTemplate(fn)` - Wraps prompt generation function
57
- - Context fields: `question`, `candidateAnswer`, `referenceAnswer`, `criteria`, `expectedMessages`, `outputMessages`, `config`, `traceSummary`
58
-
59
- ## Python Example
60
-
61
- ```python
62
- #!/usr/bin/env python3
63
- import json, sys
64
-
65
- def evaluate(data: dict) -> dict:
66
- candidate = data.get("candidate_answer", "")
67
- hits, misses = [], []
68
- for kw in ["async", "await"]:
69
- (hits if kw in candidate else misses).append(f"Keyword '{kw}'")
70
- return {
71
- "score": len(hits) / max(len(hits) + len(misses), 1),
72
- "hits": hits, "misses": misses
73
- }
74
-
75
- if __name__ == "__main__":
76
- try:
77
- print(json.dumps(evaluate(json.loads(sys.stdin.read()))))
78
- except Exception as e:
79
- print(json.dumps({"score": 0, "misses": [str(e)]}))
80
- sys.exit(1)
81
- ```
82
-
83
- ## TypeScript Example
84
-
85
- ```typescript
86
- #!/usr/bin/env bun
87
- import { defineCodeJudge } from '@agentv/eval';
88
-
89
- export default defineCodeJudge(({ candidateAnswer, criteria }) => {
90
- const hits: string[] = [];
91
- const misses: string[] = [];
92
- if (candidateAnswer.includes(criteria)) {
93
- hits.push('Matches expected outcome');
94
- } else {
95
- misses.push('Does not match expected outcome');
96
- }
97
- return {
98
- score: hits.length / Math.max(hits.length + misses.length, 1),
99
- hits, misses,
100
- };
101
- });
102
- ```
103
-
104
- ## Template Variables
105
-
106
- Derived from eval case fields (users never author these directly):
107
-
108
- | Variable | Source |
109
- |----------|--------|
110
- | `question` | First user message in `input_messages` |
111
- | `criteria` | Eval case `criteria` field |
112
- | `reference_answer` | Last entry in `expected_messages` |
113
- | `candidate_answer` | Last entry in `output_messages` (runtime) |
114
- | `input_messages` | Full resolved input array (JSON) |
115
- | `expected_messages` | Full resolved expected array (JSON) |
116
- | `output_messages` | Full provider output array (JSON) |
117
-
118
- Markdown templates use `{{variable}}` syntax. TypeScript templates receive context object.
@@ -1,278 +0,0 @@
1
- {
2
- "$schema": "http://json-schema.org/draft-07/schema#",
3
- "title": "AgentV Eval Schema",
4
- "description": "Schema for YAML evaluation files with conversation flows, multiple evaluators, and execution configuration",
5
- "type": "object",
6
- "properties": {
7
- "description": {
8
- "type": "string",
9
- "description": "Description of what this eval suite covers"
10
- },
11
- "target": {
12
- "type": "string",
13
- "description": "(Deprecated: use execution.target instead) Default target configuration name. Can be overridden per eval case."
14
- },
15
- "execution": {
16
- "type": "object",
17
- "description": "Default execution configuration for all eval cases (can be overridden per case)",
18
- "properties": {
19
- "target": {
20
- "type": "string",
21
- "description": "Default target configuration name (e.g., default, azure_base, vscode_projectx). Can be overridden per eval case."
22
- },
23
- "evaluators": {
24
- "type": "array",
25
- "description": "Default evaluators for all eval cases (code-based and LLM judges)",
26
- "items": {
27
- "type": "object",
28
- "properties": {
29
- "name": {
30
- "type": "string",
31
- "description": "Evaluator name/identifier"
32
- },
33
- "type": {
34
- "type": "string",
35
- "enum": [
36
- "code",
37
- "llm_judge",
38
- "composite",
39
- "tool_trajectory",
40
- "field_accuracy",
41
- "latency",
42
- "cost",
43
- "token_usage"
44
- ],
45
- "description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
46
- },
47
- "script": {
48
- "type": "string",
49
- "description": "Path to evaluator script (for type: code)"
50
- },
51
- "prompt": {
52
- "type": "string",
53
- "description": "Path to judge prompt file (for type: llm_judge)"
54
- }
55
- },
56
- "required": ["name", "type"],
57
- "additionalProperties": true
58
- }
59
- }
60
- },
61
- "additionalProperties": true
62
- },
63
- "cases": {
64
- "type": "array",
65
- "description": "Array of evaluation cases",
66
- "minItems": 1,
67
- "items": {
68
- "type": "object",
69
- "properties": {
70
- "id": {
71
- "type": "string",
72
- "description": "Unique identifier for the eval case"
73
- },
74
- "conversation_id": {
75
- "type": "string",
76
- "description": "Optional conversation identifier for threading multiple eval cases together"
77
- },
78
- "criteria": {
79
- "type": "string",
80
- "description": "Description of what the AI should accomplish in this eval"
81
- },
82
- "note": {
83
- "type": "string",
84
- "description": "Optional note or additional context for the eval case. Use this to document test-specific considerations, known limitations, or rationale for expected behavior."
85
- },
86
- "input_messages": {
87
- "type": "array",
88
- "description": "Input messages for the conversation",
89
- "minItems": 1,
90
- "items": {
91
- "type": "object",
92
- "properties": {
93
- "role": {
94
- "type": "string",
95
- "enum": ["system", "user", "assistant", "tool"],
96
- "description": "Message role"
97
- },
98
- "content": {
99
- "oneOf": [
100
- {
101
- "type": "string",
102
- "description": "Simple text content"
103
- },
104
- {
105
- "type": "array",
106
- "description": "Mixed content items (text and file references)",
107
- "items": {
108
- "type": "object",
109
- "properties": {
110
- "type": {
111
- "type": "string",
112
- "enum": ["text", "file"],
113
- "description": "Content type: 'text' for inline content, 'file' for file references"
114
- },
115
- "value": {
116
- "type": "string",
117
- "description": "Text content or file path. Relative paths (e.g., ../prompts/file.md) are resolved from eval file directory. Absolute paths (e.g., /docs/examples/prompts/file.md) are resolved from repo root."
118
- }
119
- },
120
- "required": ["type", "value"],
121
- "additionalProperties": false
122
- }
123
- }
124
- ]
125
- }
126
- },
127
- "required": ["role", "content"],
128
- "additionalProperties": false
129
- }
130
- },
131
- "input": {
132
- "description": "Alias for input_messages with shorthand support. String expands to single user message, array of messages passes through.",
133
- "oneOf": [
134
- {
135
- "type": "string",
136
- "description": "Shorthand: single user message content"
137
- },
138
- {
139
- "type": "array",
140
- "description": "Array of messages (same format as input_messages)",
141
- "items": {
142
- "type": "object",
143
- "properties": {
144
- "role": {
145
- "type": "string",
146
- "enum": ["system", "user", "assistant", "tool"]
147
- },
148
- "content": {
149
- "oneOf": [{ "type": "string" }, { "type": "array" }]
150
- }
151
- },
152
- "required": ["role", "content"]
153
- }
154
- }
155
- ]
156
- },
157
- "expected_messages": {
158
- "type": "array",
159
- "description": "Expected response messages. Canonical form — use this or expected_output (alias). The content of the last entry is derived as the template variable 'reference_answer' for evaluator prompts.",
160
- "minItems": 1,
161
- "items": {
162
- "type": "object",
163
- "properties": {
164
- "role": {
165
- "type": "string",
166
- "enum": ["system", "user", "assistant", "tool"],
167
- "description": "Message role"
168
- },
169
- "content": {
170
- "oneOf": [
171
- {
172
- "type": "string",
173
- "description": "Simple text content"
174
- },
175
- {
176
- "type": "array",
177
- "description": "Mixed content items",
178
- "items": {
179
- "type": "object",
180
- "properties": {
181
- "type": {
182
- "type": "string",
183
- "enum": ["text", "file"]
184
- },
185
- "value": {
186
- "type": "string"
187
- }
188
- },
189
- "required": ["type", "value"],
190
- "additionalProperties": false
191
- }
192
- }
193
- ]
194
- }
195
- },
196
- "required": ["role", "content"],
197
- "additionalProperties": false
198
- }
199
- },
200
- "expected_output": {
201
- "description": "Alias for expected_messages with shorthand support. String expands to single assistant message, object wraps as assistant message content. Resolves to expected_messages internally — the content of the last resolved entry becomes the template variable 'reference_answer'.",
202
- "oneOf": [
203
- {
204
- "type": "string",
205
- "description": "Shorthand: single assistant message content"
206
- },
207
- {
208
- "type": "object",
209
- "description": "Shorthand: structured content wraps as assistant message"
210
- },
211
- {
212
- "type": "array",
213
- "description": "Array of messages (same format as expected_messages)",
214
- "items": {
215
- "type": "object",
216
- "properties": {
217
- "role": {
218
- "type": "string",
219
- "enum": ["system", "user", "assistant", "tool"]
220
- },
221
- "content": {
222
- "oneOf": [{ "type": "string" }, { "type": "object" }, { "type": "array" }]
223
- }
224
- },
225
- "required": ["role", "content"]
226
- }
227
- }
228
- ]
229
- },
230
- "execution": {
231
- "type": "object",
232
- "description": "Per-case execution configuration",
233
- "properties": {
234
- "target": {
235
- "type": "string",
236
- "description": "Override target for this specific eval case"
237
- },
238
- "evaluators": {
239
- "type": "array",
240
- "description": "Multiple evaluators (code-based and LLM judges)",
241
- "items": {
242
- "type": "object",
243
- "properties": {
244
- "name": {
245
- "type": "string",
246
- "description": "Evaluator name/identifier"
247
- },
248
- "type": {
249
- "type": "string",
250
- "enum": ["code", "llm_judge"],
251
- "description": "Evaluator type: 'code' for scripts/regex/keywords, 'llm_judge' for LLM-based evaluation"
252
- },
253
- "script": {
254
- "type": "string",
255
- "description": "Path to evaluator script (for type: code)"
256
- },
257
- "prompt": {
258
- "type": "string",
259
- "description": "Path to judge prompt file (for type: llm_judge)"
260
- }
261
- },
262
- "required": ["name", "type"],
263
- "additionalProperties": true
264
- }
265
- }
266
- },
267
- "additionalProperties": true
268
- }
269
- },
270
- "required": ["id", "criteria"],
271
- "anyOf": [{ "required": ["input_messages"] }, { "required": ["input"] }],
272
- "additionalProperties": true
273
- }
274
- }
275
- },
276
- "required": ["cases"],
277
- "additionalProperties": false
278
- }