agentv 2.0.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,11 +1,13 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-6SHT2QS6.js";
4
+ } from "./chunk-5BLNVACB.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
 
7
7
  // src/cli.ts
8
- runCli().catch((error) => {
8
+ runCli().then(() => {
9
+ process.exit(0);
10
+ }).catch((error) => {
9
11
  console.error(error);
10
12
  process.exit(1);
11
13
  });
package/dist/cli.js.map CHANGED
@@ -1 +1 @@
1
- {"version":3,"sources":["../src/cli.ts"],"sourcesContent":["#!/usr/bin/env node\nimport { runCli } from './index.js';\n\nrunCli().catch((error) => {\n console.error(error);\n process.exit(1);\n});\n"],"mappings":";;;;;;;AAGA,OAAO,EAAE,MAAM,CAAC,UAAU;AACxB,UAAQ,MAAM,KAAK;AACnB,UAAQ,KAAK,CAAC;AAChB,CAAC;","names":[]}
1
+ {"version":3,"sources":["../src/cli.ts"],"sourcesContent":["#!/usr/bin/env node\nimport { runCli } from './index.js';\n\nrunCli()\n .then(() => {\n process.exit(0);\n })\n .catch((error) => {\n console.error(error);\n process.exit(1);\n });\n"],"mappings":";;;;;;;AAGA,OAAO,EACJ,KAAK,MAAM;AACV,UAAQ,KAAK,CAAC;AAChB,CAAC,EACA,MAAM,CAAC,UAAU;AAChB,UAAQ,MAAM,KAAK;AACnB,UAAQ,KAAK,CAAC;AAChB,CAAC;","names":[]}
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  app,
3
3
  runCli
4
- } from "./chunk-6SHT2QS6.js";
4
+ } from "./chunk-5BLNVACB.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
  export {
7
7
  app,
@@ -47,10 +47,32 @@ execution:
47
47
  ```
48
48
 
49
49
  **Contract:**
50
- - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
50
+ - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files`, `input_files`, `input_messages`, `expected_messages`, `output_messages`, `trace_summary`
51
51
  - Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
52
52
 
53
- **TypeScript evaluators:** Keep `.ts` source files and run them via Node-compatible loaders such as `npx --yes tsx` so global `agentv` installs stay portable. See `references/custom-evaluators.md` for complete templates and command examples.
53
+ **Target Proxy:** Code evaluators can access an LLM through the target proxy for sophisticated evaluation logic (e.g., Contextual Precision, semantic similarity). Enable with `target: {}`:
54
+
55
+ ```yaml
56
+ execution:
57
+ evaluators:
58
+ - name: contextual_precision
59
+ type: code_judge
60
+ script: bun run evaluate.ts
61
+ target: {} # Enable target proxy (max_calls: 50 default)
62
+ ```
63
+
64
+ **RAG Evaluation Pattern:** For retrieval-based evals, pass retrieval context via `expected_messages.tool_calls`:
65
+
66
+ ```yaml
67
+ expected_messages:
68
+ - role: assistant
69
+ tool_calls:
70
+ - tool: vector_search
71
+ output:
72
+ results: ["doc1", "doc2", "doc3"]
73
+ ```
74
+
75
+ **TypeScript evaluators:** Keep `.ts` source files and run them via Node-compatible loaders such as `npx --yes tsx` so global `agentv` installs stay portable. See `references/custom-evaluators.md` for complete templates, target proxy usage, and command examples.
54
76
 
55
77
  **Template:** See `references/custom-evaluators.md` for Python and TypeScript templates
56
78
 
@@ -17,18 +17,6 @@ Batch CLI evaluation is used when:
17
17
  3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
18
18
  4. **Per-case evaluator** validates the output for each evalcase independently
19
19
 
20
- ## Error Handling
21
-
22
- ### Missing IDs / Per-item Failures
23
-
24
- In batch mode, a runner can succeed overall while still failing to produce an output record for a specific evalcase `id`.
25
-
26
- - When an output record is missing, AgentV treats this as a per-item error.
27
- - The evaluation result will include an `error` field describing the issue.
28
- - The CLI progress display will show that evalcase as failed (❌) while still continuing to evaluate other cases.
29
-
30
- This behavior is intentional so one bad/missing record does not discard the entire batch.
31
-
32
20
  ## Eval File Structure
33
21
 
34
22
  ```yaml
@@ -1,215 +1,215 @@
1
- # Composite Evaluator Guide
2
-
3
- Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
4
-
5
- ## Basic Structure
6
-
7
- ```yaml
8
- execution:
9
- evaluators:
10
- - name: my_composite
11
- type: composite
12
- evaluators:
13
- - name: evaluator_1
14
- type: llm_judge
15
- prompt: ./prompts/check1.md
16
- - name: evaluator_2
17
- type: code_judge
18
- script: uv run check2.py
19
- aggregator:
20
- type: weighted_average
21
- weights:
22
- evaluator_1: 0.6
23
- evaluator_2: 0.4
24
- ```
25
-
26
- ## Aggregator Types
27
-
28
- ### 1. Weighted Average (Default)
29
-
30
- Combines scores using weighted arithmetic mean:
31
-
32
- ```yaml
33
- aggregator:
34
- type: weighted_average
35
- weights:
36
- safety: 0.3 # 30% weight
37
- quality: 0.7 # 70% weight
38
- ```
39
-
40
- If weights are omitted, all evaluators have equal weight (1.0).
41
-
42
- **Score calculation:**
43
- ```
44
- final_score = Σ(score_i × weight_i) / Σ(weight_i)
45
- ```
46
-
47
- ### 2. Code Judge Aggregator
48
-
49
- Run custom code to decide final score based on all evaluator results:
50
-
51
- ```yaml
52
- aggregator:
53
- type: code_judge
54
- path: node ./scripts/safety-gate.js
55
- cwd: ./evaluators # optional working directory
56
- ```
57
-
58
- **Input (stdin):**
59
- ```json
60
- {
61
- "results": {
62
- "safety": { "score": 0.9, "hits": [...], "misses": [...] },
63
- "quality": { "score": 0.85, "hits": [...], "misses": [...] }
64
- }
65
- }
66
- ```
67
-
68
- **Output (stdout):**
69
- ```json
70
- {
71
- "score": 0.87,
72
- "verdict": "pass",
73
- "hits": ["Combined check passed"],
74
- "misses": [],
75
- "reasoning": "Safety gate passed, quality acceptable"
76
- }
77
- ```
78
-
79
- ### 3. LLM Judge Aggregator
80
-
81
- Use an LLM to resolve conflicts or make nuanced decisions:
82
-
83
- ```yaml
84
- aggregator:
85
- type: llm_judge
86
- prompt: ./prompts/conflict-resolution.md
87
- ```
88
-
89
- The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
90
-
91
- ## Example Patterns
92
-
93
- ### Safety Gate Pattern
94
-
95
- Block outputs that fail safety even if quality is high:
96
-
97
- ```yaml
98
- evalcases:
99
- - id: safety-gated-response
100
- expected_outcome: Safe and accurate response
101
-
102
- input_messages:
103
- - role: user
104
- content: Explain quantum computing
105
-
106
- execution:
107
- evaluators:
108
- - name: safety_gate
109
- type: composite
110
- evaluators:
111
- - name: safety
112
- type: llm_judge
113
- prompt: ./prompts/safety-check.md
114
- - name: quality
115
- type: llm_judge
116
- prompt: ./prompts/quality-check.md
117
- aggregator:
118
- type: code_judge
119
- path: ./scripts/safety-gate.js
120
- ```
121
-
122
- ### Multi-Criteria Weighted Evaluation
123
-
124
- ```yaml
125
- - name: release_readiness
126
- type: composite
127
- evaluators:
128
- - name: correctness
129
- type: llm_judge
130
- prompt: ./prompts/correctness.md
131
- - name: style
132
- type: code_judge
133
- script: uv run style_checker.py
134
- - name: security
135
- type: llm_judge
136
- prompt: ./prompts/security.md
137
- aggregator:
138
- type: weighted_average
139
- weights:
140
- correctness: 0.5
141
- style: 0.2
142
- security: 0.3
143
- ```
144
-
145
- ### Nested Composites
146
-
147
- Composites can contain other composites for complex hierarchies:
148
-
149
- ```yaml
150
- - name: comprehensive_eval
151
- type: composite
152
- evaluators:
153
- - name: content_quality
154
- type: composite
155
- evaluators:
156
- - name: accuracy
157
- type: llm_judge
158
- prompt: ./prompts/accuracy.md
159
- - name: clarity
160
- type: llm_judge
161
- prompt: ./prompts/clarity.md
162
- aggregator:
163
- type: weighted_average
164
- weights:
165
- accuracy: 0.6
166
- clarity: 0.4
167
- - name: safety
168
- type: llm_judge
169
- prompt: ./prompts/safety.md
170
- aggregator:
171
- type: weighted_average
172
- weights:
173
- content_quality: 0.7
174
- safety: 0.3
175
- ```
176
-
177
- ## Result Structure
178
-
179
- Composite evaluators return nested `evaluator_results`:
180
-
181
- ```json
182
- {
183
- "score": 0.85,
184
- "verdict": "pass",
185
- "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
186
- "misses": ["[quality] Could use more examples"],
187
- "reasoning": "safety: Passed all checks; quality: Good but could improve",
188
- "evaluator_results": [
189
- {
190
- "name": "safety",
191
- "type": "llm_judge",
192
- "score": 0.95,
193
- "verdict": "pass",
194
- "hits": ["No harmful content"],
195
- "misses": []
196
- },
197
- {
198
- "name": "quality",
199
- "type": "llm_judge",
200
- "score": 0.8,
201
- "verdict": "pass",
202
- "hits": ["Clear explanation"],
203
- "misses": ["Could use more examples"]
204
- }
205
- ]
206
- }
207
- ```
208
-
209
- ## Best Practices
210
-
211
- 1. **Name evaluators clearly** - Names appear in results and debugging output
212
- 2. **Use safety gates for critical checks** - Don't let high quality override safety failures
213
- 3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
214
- 4. **Keep nesting shallow** - Deep nesting makes debugging harder
215
- 5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests
1
+ # Composite Evaluator Guide
2
+
3
+ Composite evaluators combine multiple evaluators and aggregate their results. This enables sophisticated evaluation patterns like safety gates, weighted scoring, and conflict resolution.
4
+
5
+ ## Basic Structure
6
+
7
+ ```yaml
8
+ execution:
9
+ evaluators:
10
+ - name: my_composite
11
+ type: composite
12
+ evaluators:
13
+ - name: evaluator_1
14
+ type: llm_judge
15
+ prompt: ./prompts/check1.md
16
+ - name: evaluator_2
17
+ type: code_judge
18
+ script: uv run check2.py
19
+ aggregator:
20
+ type: weighted_average
21
+ weights:
22
+ evaluator_1: 0.6
23
+ evaluator_2: 0.4
24
+ ```
25
+
26
+ ## Aggregator Types
27
+
28
+ ### 1. Weighted Average (Default)
29
+
30
+ Combines scores using weighted arithmetic mean:
31
+
32
+ ```yaml
33
+ aggregator:
34
+ type: weighted_average
35
+ weights:
36
+ safety: 0.3 # 30% weight
37
+ quality: 0.7 # 70% weight
38
+ ```
39
+
40
+ If weights are omitted, all evaluators have equal weight (1.0).
41
+
42
+ **Score calculation:**
43
+ ```
44
+ final_score = Σ(score_i × weight_i) / Σ(weight_i)
45
+ ```
46
+
47
+ ### 2. Code Judge Aggregator
48
+
49
+ Run custom code to decide final score based on all evaluator results:
50
+
51
+ ```yaml
52
+ aggregator:
53
+ type: code_judge
54
+ path: node ./scripts/safety-gate.js
55
+ cwd: ./evaluators # optional working directory
56
+ ```
57
+
58
+ **Input (stdin):**
59
+ ```json
60
+ {
61
+ "results": {
62
+ "safety": { "score": 0.9, "hits": [...], "misses": [...] },
63
+ "quality": { "score": 0.85, "hits": [...], "misses": [...] }
64
+ }
65
+ }
66
+ ```
67
+
68
+ **Output (stdout):**
69
+ ```json
70
+ {
71
+ "score": 0.87,
72
+ "verdict": "pass",
73
+ "hits": ["Combined check passed"],
74
+ "misses": [],
75
+ "reasoning": "Safety gate passed, quality acceptable"
76
+ }
77
+ ```
78
+
79
+ ### 3. LLM Judge Aggregator
80
+
81
+ Use an LLM to resolve conflicts or make nuanced decisions:
82
+
83
+ ```yaml
84
+ aggregator:
85
+ type: llm_judge
86
+ prompt: ./prompts/conflict-resolution.md
87
+ ```
88
+
89
+ The `{{EVALUATOR_RESULTS_JSON}}` variable is replaced with the JSON results from all child evaluators.
90
+
91
+ ## Example Patterns
92
+
93
+ ### Safety Gate Pattern
94
+
95
+ Block outputs that fail safety even if quality is high:
96
+
97
+ ```yaml
98
+ evalcases:
99
+ - id: safety-gated-response
100
+ expected_outcome: Safe and accurate response
101
+
102
+ input_messages:
103
+ - role: user
104
+ content: Explain quantum computing
105
+
106
+ execution:
107
+ evaluators:
108
+ - name: safety_gate
109
+ type: composite
110
+ evaluators:
111
+ - name: safety
112
+ type: llm_judge
113
+ prompt: ./prompts/safety-check.md
114
+ - name: quality
115
+ type: llm_judge
116
+ prompt: ./prompts/quality-check.md
117
+ aggregator:
118
+ type: code_judge
119
+ path: ./scripts/safety-gate.js
120
+ ```
121
+
122
+ ### Multi-Criteria Weighted Evaluation
123
+
124
+ ```yaml
125
+ - name: release_readiness
126
+ type: composite
127
+ evaluators:
128
+ - name: correctness
129
+ type: llm_judge
130
+ prompt: ./prompts/correctness.md
131
+ - name: style
132
+ type: code_judge
133
+ script: uv run style_checker.py
134
+ - name: security
135
+ type: llm_judge
136
+ prompt: ./prompts/security.md
137
+ aggregator:
138
+ type: weighted_average
139
+ weights:
140
+ correctness: 0.5
141
+ style: 0.2
142
+ security: 0.3
143
+ ```
144
+
145
+ ### Nested Composites
146
+
147
+ Composites can contain other composites for complex hierarchies:
148
+
149
+ ```yaml
150
+ - name: comprehensive_eval
151
+ type: composite
152
+ evaluators:
153
+ - name: content_quality
154
+ type: composite
155
+ evaluators:
156
+ - name: accuracy
157
+ type: llm_judge
158
+ prompt: ./prompts/accuracy.md
159
+ - name: clarity
160
+ type: llm_judge
161
+ prompt: ./prompts/clarity.md
162
+ aggregator:
163
+ type: weighted_average
164
+ weights:
165
+ accuracy: 0.6
166
+ clarity: 0.4
167
+ - name: safety
168
+ type: llm_judge
169
+ prompt: ./prompts/safety.md
170
+ aggregator:
171
+ type: weighted_average
172
+ weights:
173
+ content_quality: 0.7
174
+ safety: 0.3
175
+ ```
176
+
177
+ ## Result Structure
178
+
179
+ Composite evaluators return nested `evaluator_results`:
180
+
181
+ ```json
182
+ {
183
+ "score": 0.85,
184
+ "verdict": "pass",
185
+ "hits": ["[safety] No harmful content", "[quality] Clear explanation"],
186
+ "misses": ["[quality] Could use more examples"],
187
+ "reasoning": "safety: Passed all checks; quality: Good but could improve",
188
+ "evaluator_results": [
189
+ {
190
+ "name": "safety",
191
+ "type": "llm_judge",
192
+ "score": 0.95,
193
+ "verdict": "pass",
194
+ "hits": ["No harmful content"],
195
+ "misses": []
196
+ },
197
+ {
198
+ "name": "quality",
199
+ "type": "llm_judge",
200
+ "score": 0.8,
201
+ "verdict": "pass",
202
+ "hits": ["Clear explanation"],
203
+ "misses": ["Could use more examples"]
204
+ }
205
+ ]
206
+ }
207
+ ```
208
+
209
+ ## Best Practices
210
+
211
+ 1. **Name evaluators clearly** - Names appear in results and debugging output
212
+ 2. **Use safety gates for critical checks** - Don't let high quality override safety failures
213
+ 3. **Balance weights thoughtfully** - Consider which aspects matter most for your use case
214
+ 4. **Keep nesting shallow** - Deep nesting makes debugging harder
215
+ 5. **Test aggregators independently** - Verify your custom aggregation logic with unit tests