agentv 2.9.0-next.1 → 2.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. package/dist/{chunk-H54JIK7G.js → chunk-G3OTPFYX.js} +2 -3
  2. package/dist/chunk-G3OTPFYX.js.map +1 -0
  3. package/dist/cli.js +1 -1
  4. package/dist/index.js +1 -1
  5. package/dist/templates/.agentv/config.yaml +1 -1
  6. package/dist/templates/.agentv/targets.yaml +10 -13
  7. package/package.json +1 -1
  8. package/dist/chunk-H54JIK7G.js.map +0 -1
  9. package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +0 -202
  10. package/dist/templates/.claude/skills/agentv-eval-builder/references/batch-cli-evaluator.md +0 -316
  11. package/dist/templates/.claude/skills/agentv-eval-builder/references/compare-command.md +0 -137
  12. package/dist/templates/.claude/skills/agentv-eval-builder/references/composite-evaluator.md +0 -215
  13. package/dist/templates/.claude/skills/agentv-eval-builder/references/config-schema.json +0 -27
  14. package/dist/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md +0 -118
  15. package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +0 -278
  16. package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +0 -333
  17. package/dist/templates/.claude/skills/agentv-eval-builder/references/rubric-evaluator.md +0 -77
  18. package/dist/templates/.claude/skills/agentv-eval-builder/references/structured-data-evaluators.md +0 -121
  19. package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +0 -298
  20. package/dist/templates/.claude/skills/agentv-prompt-optimizer/SKILL.md +0 -78
  21. package/dist/templates/.github/prompts/agentv-eval-build.prompt.md +0 -5
  22. package/dist/templates/.github/prompts/agentv-optimize.prompt.md +0 -4
@@ -1,202 +0,0 @@
1
- ---
2
- name: agentv-eval-builder
3
- description: Create and maintain AgentV YAML evaluation files for testing AI agent performance. Use this skill when creating new eval files, adding eval cases, or configuring evaluators.
4
- ---
5
-
6
- # AgentV Eval Builder
7
-
8
- Comprehensive docs: https://agentv.dev
9
-
10
- ## Quick Start
11
-
12
- ```yaml
13
- description: Example eval
14
- execution:
15
- target: default
16
-
17
- cases:
18
- - id: greeting
19
- criteria: Friendly greeting
20
- input: "Say hello"
21
- expected_output: "Hello! How can I help you?"
22
- rubrics:
23
- - Greeting is friendly and warm
24
- - Offers to help
25
- ```
26
-
27
- ## Eval File Structure
28
-
29
- **Required:** `cases` (array)
30
- **Optional:** `description`, `execution`, `dataset`
31
-
32
- **Eval case fields:**
33
-
34
- | Field | Required | Description |
35
- |-------|----------|-------------|
36
- | `id` | yes | Unique identifier |
37
- | `criteria` | yes | What the response should accomplish |
38
- | `input` / `input_messages` | yes | Input to the agent |
39
- | `expected_output` / `expected_messages` | no | Gold-standard reference answer |
40
- | `rubrics` | no | Inline evaluation criteria |
41
- | `execution` | no | Per-case execution overrides |
42
- | `conversation_id` | no | Thread grouping |
43
-
44
- **Shorthand aliases:**
45
- - `input` (string) expands to `[{role: "user", content: "..."}]`
46
- - `expected_output` (string/object) expands to `[{role: "assistant", content: ...}]`
47
- - Canonical `input_messages` / `expected_messages` take precedence when both present
48
-
49
- **Message format:** `{role, content}` where role is `system`, `user`, `assistant`, or `tool`
50
- **Content types:** inline text, `{type: "file", value: "./path.md"}`
51
- **File paths:** relative from eval file dir, or absolute with `/` prefix from repo root
52
-
53
- **JSONL format:** One eval case per line as JSON. Optional `.yaml` sidecar for shared defaults. See `examples/features/basic-jsonl/`.
54
-
55
- ## Evaluator Types
56
-
57
- Configure via `execution.evaluators` array. Multiple evaluators produce a weighted average score.
58
-
59
- ### code_judge
60
- ```yaml
61
- - name: format_check
62
- type: code_judge
63
- script: uv run validate.py
64
- cwd: ./scripts # optional working directory
65
- target: {} # optional: enable LLM target proxy (max_calls: 50)
66
- ```
67
- Contract: stdin JSON -> stdout JSON `{score, hits, misses, reasoning}`
68
- Input includes: `question`, `criteria`, `candidate_answer`, `reference_answer`, `output_messages`, `trace_summary`, `file_changes`, `workspace_path`, `config`
69
- When `workspace_template` is configured, `workspace_path` is the absolute path to the workspace dir (also available as `AGENTV_WORKSPACE_PATH` env var). Use this for functional grading (e.g., running `npm test` in the workspace).
70
- See docs at https://agentv.dev/evaluators/code-judges/
71
-
72
- ### llm_judge
73
- ```yaml
74
- - name: quality
75
- type: llm_judge
76
- prompt: ./prompts/eval.md # markdown template or script config
77
- model: gpt-5-chat # optional model override
78
- config: # passed to script templates as context.config
79
- strictness: high
80
- ```
81
- Variables: `{{question}}`, `{{criteria}}`, `{{candidate_answer}}`, `{{reference_answer}}`, `{{input_messages}}`, `{{expected_messages}}`, `{{output_messages}}`, `{{file_changes}}`
82
- - Markdown templates: use `{{variable}}` syntax
83
- - TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
84
-
85
- ### composite
86
- ```yaml
87
- - name: gate
88
- type: composite
89
- evaluators:
90
- - name: safety
91
- type: llm_judge
92
- prompt: ./safety.md
93
- - name: quality
94
- type: llm_judge
95
- aggregator:
96
- type: weighted_average
97
- weights: { safety: 0.3, quality: 0.7 }
98
- ```
99
- Aggregator types: `weighted_average`, `all_or_nothing`, `minimum`, `maximum`, `safety_gate`
100
- - `safety_gate`: fails immediately if the named gate evaluator scores below threshold (default 1.0)
101
-
102
- ### tool_trajectory
103
- ```yaml
104
- - name: tool_check
105
- type: tool_trajectory
106
- mode: any_order # any_order | in_order | exact
107
- minimums: # for any_order
108
- knowledgeSearch: 2
109
- expected: # for in_order/exact
110
- - tool: knowledgeSearch
111
- args: { query: "search term" } # partial deep equality match
112
- - tool: documentRetrieve
113
- args: any # any arguments accepted
114
- max_duration_ms: 5000 # per-tool latency assertion
115
- - tool: summarize # omit args to skip argument checking
116
- ```
117
-
118
- ### field_accuracy
119
- ```yaml
120
- - name: fields
121
- type: field_accuracy
122
- match_type: exact # exact | date | numeric_tolerance
123
- numeric_tolerance: 0.01 # for numeric_tolerance match_type
124
- aggregation: weighted_average # weighted_average | all_or_nothing
125
- ```
126
- Compares `output_messages` fields against `expected_messages` fields.
127
-
128
- ### latency
129
- ```yaml
130
- - name: speed
131
- type: latency
132
- max_ms: 5000
133
- ```
134
-
135
- ### cost
136
- ```yaml
137
- - name: budget
138
- type: cost
139
- max_usd: 0.10
140
- ```
141
-
142
- ### token_usage
143
- ```yaml
144
- - name: tokens
145
- type: token_usage
146
- max_total_tokens: 4000
147
- ```
148
-
149
- ### execution_metrics
150
- ```yaml
151
- - name: efficiency
152
- type: execution_metrics
153
- max_tool_calls: 10 # Maximum tool invocations
154
- max_llm_calls: 5 # Maximum LLM calls (assistant messages)
155
- max_tokens: 5000 # Maximum total tokens (input + output)
156
- max_cost_usd: 0.05 # Maximum cost in USD
157
- max_duration_ms: 30000 # Maximum execution duration
158
- target_exploration_ratio: 0.6 # Target ratio of read-only tool calls
159
- exploration_tolerance: 0.2 # Tolerance for ratio check (default: 0.2)
160
- ```
161
- Declarative threshold-based checks on execution metrics. Only specified thresholds are checked.
162
- Score is proportional: `hits / (hits + misses)`. Missing data counts as a miss.
163
-
164
- ### rubric (inline)
165
- ```yaml
166
- rubrics:
167
- - Simple string criterion
168
- - id: weighted
169
- criteria: Detailed criterion
170
- weight: 2.0
171
- required: true
172
- ```
173
- See `references/rubric-evaluator.md` for score-range mode and scoring formula.
174
-
175
- ## CLI Commands
176
-
177
- ```bash
178
- # Run evaluation (requires API keys)
179
- agentv eval <file.yaml> [--eval-id <id>] [--target <name>] [--dry-run]
180
-
181
- # Run with trace persistence (writes to .agentv/traces/)
182
- agentv eval <file.yaml> --trace
183
-
184
- # Agent-orchestrated evals (no API keys needed)
185
- agentv eval prompt <file.yaml> # orchestration overview
186
- agentv eval prompt input <file.yaml> --eval-id <id> # task input JSON (file paths, not embedded content)
187
- agentv eval prompt judge <file.yaml> --eval-id <id> --answer-file f # judge prompts / code judge results
188
-
189
- # Validate eval file
190
- agentv validate <file.yaml>
191
-
192
- # Compare results between runs
193
- agentv compare <results1.jsonl> <results2.jsonl>
194
-
195
- # Generate rubrics from criteria
196
- agentv generate rubrics <file.yaml> [--target <name>]
197
- ```
198
-
199
- ## Schemas
200
-
201
- - Eval file: `references/eval-schema.json`
202
- - Config: `references/config-schema.json`
@@ -1,316 +0,0 @@
1
- # Batch CLI Evaluation Guide
2
-
3
- Guide for evaluating batch CLI output where a single runner processes all evalcases at once and outputs JSONL.
4
-
5
- ## Overview
6
-
7
- Batch CLI evaluation is used when:
8
- - An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
9
- - The runner reads the eval YAML directly to extract all evalcases
10
- - Output is JSONL with records keyed by evalcase `id`
11
- - Each evalcase has its own evaluator to validate its corresponding output record
12
-
13
- ## Execution Flow
14
-
15
- 1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
16
- 2. **Batch runner** reads the eval YAML, extracts all evalcases, processes them, writes JSONL output keyed by `id`
17
- 3. **AgentV** parses JSONL, routes each record to its matching evalcase by `id`
18
- 4. **Per-case evaluator** validates the output for each evalcase independently
19
-
20
- ## Eval File Structure
21
-
22
- ```yaml
23
- description: Batch CLI demo using structured input_messages
24
- execution:
25
- target: batch_cli
26
-
27
- evalcases:
28
- - id: case-001
29
- expected_outcome: |-
30
- Batch runner returns JSON with decision=CLEAR.
31
-
32
- expected_messages:
33
- - role: assistant
34
- content:
35
- decision: CLEAR # Structured expected output
36
-
37
- input_messages:
38
- - role: system
39
- content: You are a batch processor.
40
- - role: user
41
- content: # Structured input (runner extracts this)
42
- request:
43
- type: screening_check
44
- jurisdiction: AU
45
- row:
46
- id: case-001
47
- name: Example A
48
- amount: 5000
49
-
50
- execution:
51
- evaluators:
52
- - name: decision-check
53
- type: code_judge
54
- script: bun run ./scripts/check-output.ts
55
- cwd: .
56
-
57
- - id: case-002
58
- expected_outcome: |-
59
- Batch runner returns JSON with decision=REVIEW.
60
-
61
- expected_messages:
62
- - role: assistant
63
- content:
64
- decision: REVIEW
65
-
66
- input_messages:
67
- - role: system
68
- content: You are a batch processor.
69
- - role: user
70
- content:
71
- request:
72
- type: screening_check
73
- jurisdiction: AU
74
- row:
75
- id: case-002
76
- name: Example B
77
- amount: 25000
78
-
79
- execution:
80
- evaluators:
81
- - name: decision-check
82
- type: code_judge
83
- script: bun run ./scripts/check-output.ts
84
- cwd: .
85
- ```
86
-
87
- ## Batch Runner Implementation
88
-
89
- The batch runner reads the eval YAML directly and processes all evalcases in one invocation.
90
-
91
- ### Runner Contract
92
-
93
- **Input:** The runner receives the eval file path via `--eval` flag:
94
- ```bash
95
- bun run batch-runner.ts --eval ./my-eval.yaml --output ./results.jsonl
96
- ```
97
-
98
- **Output:** JSONL file where each line is a JSON object with:
99
- ```json
100
- {"id": "case-001", "text": "{\"decision\": \"CLEAR\", ...}"}
101
- {"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
102
- ```
103
-
104
- The `id` field must match the evalcase `id` for AgentV to route output to the correct evaluator.
105
-
106
- ### Output with Tool Trajectory Support
107
-
108
- To enable `tool_trajectory` evaluation, include `output_messages` with `tool_calls`:
109
-
110
- ```json
111
- {
112
- "id": "case-001",
113
- "text": "{\"decision\": \"CLEAR\", ...}",
114
- "output_messages": [
115
- {
116
- "role": "assistant",
117
- "tool_calls": [
118
- {
119
- "tool": "screening_check",
120
- "input": { "origin_country": "NZ", "amount": 5000 },
121
- "output": { "decision": "CLEAR", "reasons": [] }
122
- }
123
- ]
124
- },
125
- {
126
- "role": "assistant",
127
- "content": { "decision": "CLEAR" }
128
- }
129
- ]
130
- }
131
- ```
132
-
133
- AgentV extracts tool calls directly from `output_messages[].tool_calls[]` for `tool_trajectory` evaluators. This is the recommended format for batch runners that make tool calls.
134
-
135
- ### Example Runner (TypeScript)
136
-
137
- ```typescript
138
- import fs from 'node:fs/promises';
139
- import { parse } from 'yaml';
140
-
141
- type EvalCase = {
142
- id: string;
143
- input_messages: Array<{ role: string; content: unknown }>;
144
- };
145
-
146
- async function main() {
147
- const args = process.argv.slice(2);
148
- const evalPath = getFlag(args, '--eval');
149
- const outPath = getFlag(args, '--output');
150
-
151
- // Read and parse eval YAML
152
- const yamlText = await fs.readFile(evalPath, 'utf8');
153
- const parsed = parse(yamlText);
154
- const evalcases = parsed.evalcases as EvalCase[];
155
-
156
- // Process each evalcase
157
- const results: Array<{ id: string; text: string }> = [];
158
- for (const evalcase of evalcases) {
159
- const userContent = findUserContent(evalcase.input_messages);
160
- const decision = processInput(userContent); // Your logic here
161
-
162
- results.push({
163
- id: evalcase.id,
164
- text: JSON.stringify({ decision, ...otherFields }),
165
- });
166
- }
167
-
168
- // Write JSONL output
169
- const jsonl = results.map((r) => JSON.stringify(r)).join('\n') + '\n';
170
- await fs.writeFile(outPath, jsonl, 'utf8');
171
- }
172
-
173
- function getFlag(args: string[], name: string): string {
174
- const idx = args.indexOf(name);
175
- return args[idx + 1];
176
- }
177
-
178
- function findUserContent(messages: Array<{ role: string; content: unknown }>) {
179
- return messages.find((m) => m.role === 'user')?.content;
180
- }
181
- ```
182
-
183
- ## Evaluator Implementation
184
-
185
- Each evalcase has its own evaluator that validates the output. The evaluator receives the standard code_judge input.
186
-
187
- ### Evaluator Contract
188
-
189
- **Input (stdin):** Standard AgentV code_judge format:
190
- ```json
191
- {
192
- "candidate_answer": "{\"id\":\"case-001\",\"decision\":\"CLEAR\",...}",
193
- "expected_messages": [{"role": "assistant", "content": {"decision": "CLEAR"}}],
194
- "input_messages": [...],
195
- ...
196
- }
197
- ```
198
-
199
- **Output (stdout):** Standard evaluator result:
200
- ```json
201
- {
202
- "score": 1.0,
203
- "hits": ["decision matches: CLEAR"],
204
- "misses": [],
205
- "reasoning": "Batch runner decision matches expected."
206
- }
207
- ```
208
-
209
- ### Example Evaluator (TypeScript)
210
-
211
- ```typescript
212
- import fs from 'node:fs';
213
-
214
- type EvalInput = {
215
- candidate_answer?: string;
216
- expected_messages?: Array<{ role: string; content: unknown }>;
217
- };
218
-
219
- function main() {
220
- const stdin = fs.readFileSync(0, 'utf8');
221
- const input = JSON.parse(stdin) as EvalInput;
222
-
223
- // Extract expected value from expected_messages
224
- const expectedDecision = findExpectedDecision(input.expected_messages);
225
-
226
- // Parse candidate answer (output from batch runner)
227
- let candidateDecision: string | undefined;
228
- try {
229
- const parsed = JSON.parse(input.candidate_answer ?? '');
230
- candidateDecision = parsed.decision;
231
- } catch {
232
- candidateDecision = undefined;
233
- }
234
-
235
- // Compare
236
- const hits: string[] = [];
237
- const misses: string[] = [];
238
-
239
- if (expectedDecision === candidateDecision) {
240
- hits.push(`decision matches: ${expectedDecision}`);
241
- } else {
242
- misses.push(`mismatch: expected=${expectedDecision} actual=${candidateDecision}`);
243
- }
244
-
245
- const score = misses.length === 0 ? 1 : 0;
246
-
247
- process.stdout.write(JSON.stringify({
248
- score,
249
- hits,
250
- misses,
251
- reasoning: score === 1
252
- ? 'Batch runner output matches expected.'
253
- : 'Batch runner output did not match expected.',
254
- }));
255
- }
256
-
257
- function findExpectedDecision(messages?: Array<{ role: string; content: unknown }>) {
258
- if (!messages) return undefined;
259
- for (const msg of messages) {
260
- if (typeof msg.content === 'object' && msg.content !== null) {
261
- return (msg.content as Record<string, unknown>).decision as string;
262
- }
263
- }
264
- return undefined;
265
- }
266
-
267
- main();
268
- ```
269
-
270
- ## Structured Content in expected_messages
271
-
272
- For batch evaluation, use structured objects in `expected_messages.content` to define expected output fields:
273
-
274
- ```yaml
275
- expected_messages:
276
- - role: assistant
277
- content:
278
- decision: CLEAR
279
- confidence: high
280
- reasons: []
281
- ```
282
-
283
- The evaluator then extracts these fields and compares against the parsed candidate output.
284
-
285
- ## Best Practices
286
-
287
- 1. **Use unique evalcase IDs** - The batch runner and AgentV use `id` to route outputs
288
- 2. **Structured input_messages** - Put structured data in `user.content` for the runner to extract
289
- 3. **Structured expected_messages** - Define expected output as objects for easy validation
290
- 4. **Deterministic runners** - Batch runners should produce consistent output for testing
291
- 5. **Healthcheck support** - Add `--healthcheck` flag for runner validation:
292
- ```typescript
293
- if (args.includes('--healthcheck')) {
294
- console.log('batch-runner: healthy');
295
- return;
296
- }
297
- ```
298
-
299
- ## Target Configuration
300
-
301
- Configure the batch CLI provider in your target:
302
-
303
- ```yaml
304
- # In agentv-targets.yaml or eval file
305
- targets:
306
- batch_cli:
307
- provider: cli
308
- commandTemplate: bun run ./scripts/batch-runner.ts --eval {EVAL_FILE} --output {OUTPUT_FILE}
309
- provider_batching: true
310
- ```
311
-
312
- Key settings:
313
- - `provider: cli` - Use CLI provider
314
- - `provider_batching: true` - Run once for all evalcases
315
- - `{EVAL_FILE}` - Placeholder for eval file path
316
- - `{OUTPUT_FILE}` - Placeholder for JSONL output path
@@ -1,137 +0,0 @@
1
- # Compare Command
2
-
3
- Compare evaluation results between two runs to measure performance differences.
4
-
5
- ## Usage
6
-
7
- ```bash
8
- agentv compare <baseline.jsonl> <candidate.jsonl> [options]
9
- ```
10
-
11
- ## Arguments
12
-
13
- | Argument | Description |
14
- |----------|-------------|
15
- | `result1` | Path to baseline JSONL result file |
16
- | `result2` | Path to candidate JSONL result file |
17
- | `--threshold`, `-t` | Score delta threshold for win/loss classification (default: 0.1) |
18
- | `--format`, `-f` | Output format: `table` (default) or `json` |
19
- | `--json` | Shorthand for `--format=json` |
20
-
21
- ## How It Works
22
-
23
- 1. **Load Results**: Reads both JSONL files containing evaluation results
24
- 2. **Match by eval_id**: Pairs results with matching `eval_id` fields
25
- 3. **Compute Deltas**: Calculates `delta = score2 - score1` for each pair
26
- 4. **Classify Outcomes**:
27
- - `win`: delta >= threshold (candidate better)
28
- - `loss`: delta <= -threshold (baseline better)
29
- - `tie`: |delta| < threshold (no significant difference)
30
- 5. **Output Summary**: Human-readable table (default) or JSON
31
-
32
- ## Output Format
33
-
34
- ### Table Format (default)
35
-
36
- ```
37
- Comparing: baseline.jsonl → candidate.jsonl
38
-
39
- Eval ID Baseline Candidate Delta Result
40
- ───────────── ──────── ───────── ──────── ────────
41
- safety-check 0.70 0.90 +0.20 ✓ win
42
- accuracy-test 0.85 0.80 -0.05 = tie
43
- latency-eval 0.90 0.75 -0.15 ✗ loss
44
-
45
- Summary: 1 win, 1 loss, 1 tie | Mean Δ: +0.000 | Status: neutral
46
- ```
47
-
48
- Colors are used to highlight wins (green), losses (red), and ties (gray). Colors are automatically disabled when output is piped or `NO_COLOR` is set.
49
-
50
- ### JSON Format (`--json`)
51
-
52
- Output uses snake_case for Python ecosystem compatibility:
53
-
54
- ```json
55
- {
56
- "matched": [
57
- {
58
- "eval_id": "case-1",
59
- "score1": 0.7,
60
- "score2": 0.9,
61
- "delta": 0.2,
62
- "outcome": "win"
63
- }
64
- ],
65
- "unmatched": {
66
- "file1": 0,
67
- "file2": 0
68
- },
69
- "summary": {
70
- "total": 2,
71
- "matched": 1,
72
- "wins": 1,
73
- "losses": 0,
74
- "ties": 0,
75
- "mean_delta": 0.2
76
- }
77
- }
78
- ```
79
-
80
- ## Exit Codes
81
-
82
- | Code | Meaning |
83
- |------|---------|
84
- | `0` | Candidate is equal or better (meanDelta >= 0) |
85
- | `1` | Baseline is better (regression detected) |
86
-
87
- ## Workflow Examples
88
-
89
- ### Model Comparison
90
-
91
- Compare different model versions:
92
-
93
- ```bash
94
- # Run baseline evaluation
95
- agentv eval evals/*.yaml --target gpt-4 --out baseline.jsonl
96
-
97
- # Run candidate evaluation
98
- agentv eval evals/*.yaml --target gpt-4o --out candidate.jsonl
99
-
100
- # Compare results
101
- agentv compare baseline.jsonl candidate.jsonl
102
- ```
103
-
104
- ### Prompt Optimization
105
-
106
- Compare before/after prompt changes:
107
-
108
- ```bash
109
- # Run with original prompt
110
- agentv eval evals/*.yaml --out before.jsonl
111
-
112
- # Modify prompt, then run again
113
- agentv eval evals/*.yaml --out after.jsonl
114
-
115
- # Compare with strict threshold
116
- agentv compare before.jsonl after.jsonl --threshold 0.05
117
- ```
118
-
119
- ### CI Quality Gate
120
-
121
- Fail CI if candidate regresses:
122
-
123
- ```bash
124
- #!/bin/bash
125
- agentv compare baseline.jsonl candidate.jsonl
126
- if [ $? -eq 1 ]; then
127
- echo "Regression detected! Candidate performs worse than baseline."
128
- exit 1
129
- fi
130
- echo "Candidate is equal or better than baseline."
131
- ```
132
-
133
- ## Tips
134
-
135
- - **Threshold Selection**: Default 0.1 means 10% difference required. Use stricter thresholds (0.05) for critical evaluations.
136
- - **Unmatched Results**: Check `unmatched` counts to identify eval cases that only exist in one file.
137
- - **Multiple Comparisons**: Compare against multiple baselines by running the command multiple times.