aievaluator 1.0.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- aievaluator-1.0.1/PKG-INFO +366 -0
- aievaluator-1.0.1/README.md +336 -0
- aievaluator-1.0.1/pyproject.toml +50 -0
- aievaluator-1.0.1/setup.cfg +4 -0
- aievaluator-1.0.1/src/aievaluator/__init__.py +3 -0
- aievaluator-1.0.1/src/aievaluator/api/__init__.py +0 -0
- aievaluator-1.0.1/src/aievaluator/api/client.py +178 -0
- aievaluator-1.0.1/src/aievaluator/cli.py +532 -0
- aievaluator-1.0.1/src/aievaluator/config.py +115 -0
- aievaluator-1.0.1/src/aievaluator/formatters/__init__.py +7 -0
- aievaluator-1.0.1/src/aievaluator/formatters/json.py +28 -0
- aievaluator-1.0.1/src/aievaluator/formatters/junit.py +46 -0
- aievaluator-1.0.1/src/aievaluator/formatters/table.py +53 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/PKG-INFO +366 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/SOURCES.txt +30 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/dependency_links.txt +1 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/entry_points.txt +2 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/requires.txt +8 -0
- aievaluator-1.0.1/src/aievaluator.egg-info/top_level.txt +1 -0
- aievaluator-1.0.1/tests/test_api.py +285 -0
- aievaluator-1.0.1/tests/test_cli_config.py +98 -0
- aievaluator-1.0.1/tests/test_cli_eval.py +495 -0
- aievaluator-1.0.1/tests/test_cli_init.py +115 -0
- aievaluator-1.0.1/tests/test_cli_login.py +74 -0
- aievaluator-1.0.1/tests/test_cli_quick.py +184 -0
- aievaluator-1.0.1/tests/test_cli_whoami.py +61 -0
- aievaluator-1.0.1/tests/test_config.py +218 -0
- aievaluator-1.0.1/tests/test_dataset.py +111 -0
- aievaluator-1.0.1/tests/test_exit_codes.py +122 -0
- aievaluator-1.0.1/tests/test_formatters.py +190 -0
- aievaluator-1.0.1/tests/test_metrics.py +84 -0
- aievaluator-1.0.1/tests/test_thresholds.py +76 -0
|
@@ -0,0 +1,366 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: aievaluator
|
|
3
|
+
Version: 1.0.1
|
|
4
|
+
Summary: AI Evaluator CLI โ evaluate your LLM agents from the command line
|
|
5
|
+
Author-email: AI Evaluator <support@aievaluator.dev>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://aievaluator.dev
|
|
8
|
+
Project-URL: Repository, https://github.com/aievaluator-dev/aievaluator-cli
|
|
9
|
+
Project-URL: Issues, https://github.com/aievaluator-dev/aievaluator-cli/issues
|
|
10
|
+
Keywords: ai,evaluation,llm,agent,testing,ci-cd
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Operating System :: OS Independent
|
|
15
|
+
Classifier: Programming Language :: Python :: 3
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
19
|
+
Classifier: Topic :: Software Development :: Testing
|
|
20
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
21
|
+
Requires-Python: >=3.10
|
|
22
|
+
Description-Content-Type: text/markdown
|
|
23
|
+
Requires-Dist: click>=8.1
|
|
24
|
+
Requires-Dist: httpx>=0.27
|
|
25
|
+
Requires-Dist: rich>=13.0
|
|
26
|
+
Provides-Extra: dev
|
|
27
|
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
|
28
|
+
Requires-Dist: pytest-asyncio>=0.24; extra == "dev"
|
|
29
|
+
Requires-Dist: pytest-httpx>=0.30; extra == "dev"
|
|
30
|
+
|
|
31
|
+
# AI Evaluator CLI โ Python
|
|
32
|
+
|
|
33
|
+
[](https://pypi.org/project/aievaluator/)
|
|
34
|
+
[](https://pypi.org/project/aievaluator/)
|
|
35
|
+
|
|
36
|
+
Evaluate your LLM agents from the terminal. No browser. No dashboard.
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
pip install aievaluator
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## ๐งญ Tutorial โ From Zero to CI/CD
|
|
45
|
+
|
|
46
|
+
Every step builds on the previous one. Start wherever makes sense for you.
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
### Level 0 โ Try it without installing anything
|
|
51
|
+
|
|
52
|
+
```bash
|
|
53
|
+
curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
|
|
54
|
+
-H "Content-Type: application/json" \
|
|
55
|
+
-d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .
|
|
56
|
+
```
|
|
57
|
+
|
|
58
|
+
5 free per day. No key. No install. Good enough to decide if it's useful.
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
### Level 1 โ Install and evaluate a single prompt
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
pip install aievaluator
|
|
66
|
+
|
|
67
|
+
# Ask a question, tell it what you expect
|
|
68
|
+
aievaluator quick "What is the capital of France?" --expected "Paris"
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
You'll see a table with the score. The `--expected` is optional โ without it, the judge evaluates
|
|
72
|
+
the response on its own merits.
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
โ ๏ธ Playground mode โ 4/5 remaining
|
|
76
|
+
|
|
77
|
+
AI Evaluator โ Results
|
|
78
|
+
Overall Score: 95.0% โ
above threshold (0%)
|
|
79
|
+
Total rows: 1
|
|
80
|
+
Failed: 0
|
|
81
|
+
|
|
82
|
+
โโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโ
|
|
83
|
+
โ # โ Query โ Score โ Pass โ
|
|
84
|
+
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโค
|
|
85
|
+
โ 1 โ What is the capital of France? โ 95% โ โ
โ
|
|
86
|
+
โโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโ
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
---
|
|
90
|
+
|
|
91
|
+
### Level 2 โ Sign up and scaffold a project
|
|
92
|
+
|
|
93
|
+
Playground is great for trying, but you'll want more than 5 evals/day.
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
# Get your API key at https://aievaluator.dev/settings
|
|
97
|
+
aievaluator login
|
|
98
|
+
|
|
99
|
+
# Check your account
|
|
100
|
+
aievaluator whoami
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Now scaffold your project:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
aievaluator init
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
This creates:
|
|
110
|
+
- `aievaluator.config.json` โ project-local config
|
|
111
|
+
- `evals/smoke-test.json` โ sample dataset with 3 queries
|
|
112
|
+
- Updates `.gitignore`
|
|
113
|
+
|
|
114
|
+
Open `evals/smoke-test.json` and replace the sample queries with your own:
|
|
115
|
+
|
|
116
|
+
```json
|
|
117
|
+
[
|
|
118
|
+
{"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
|
|
119
|
+
{"input": "How do I cancel my order?", "expected_output": "Go to My Orders โ Cancel"},
|
|
120
|
+
{"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
|
|
121
|
+
]
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
Test it against the built-in agent:
|
|
125
|
+
|
|
126
|
+
```bash
|
|
127
|
+
aievaluator quick --dataset ./evals/smoke-test.json
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
---
|
|
131
|
+
|
|
132
|
+
### Level 3 โ Evaluate your own agent
|
|
133
|
+
|
|
134
|
+
Point the CLI at your agent's endpoint:
|
|
135
|
+
|
|
136
|
+
```bash
|
|
137
|
+
aievaluator eval \
|
|
138
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
139
|
+
--dataset ./evals/smoke-test.json \
|
|
140
|
+
--metrics faithfulness,g_eval
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
The CLI calls your agent with each query, then an LLM judge scores the responses.
|
|
144
|
+
|
|
145
|
+
---
|
|
146
|
+
|
|
147
|
+
### Level 4 โ Add quality gates
|
|
148
|
+
|
|
149
|
+
Not all metrics are equally important. Set different thresholds per metric:
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
aievaluator eval \
|
|
153
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
154
|
+
--dataset ./evals/smoke-test.json \
|
|
155
|
+
--thresholds faithfulness:0.90,g_eval:0.75
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
- `faithfulness` must be โฅ 90% (hallucination = instant fail)
|
|
159
|
+
- `g_eval` must be โฅ 75% (general quality)
|
|
160
|
+
|
|
161
|
+
If any metric fails to meet its threshold, that row is marked โ.
|
|
162
|
+
|
|
163
|
+
**Or set one bar for everything:**
|
|
164
|
+
|
|
165
|
+
```bash
|
|
166
|
+
aievaluator eval \
|
|
167
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
168
|
+
--dataset ./evals/smoke-test.json \
|
|
169
|
+
--min-score 0.80
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
This works on `quick` too:
|
|
173
|
+
|
|
174
|
+
```bash
|
|
175
|
+
aievaluator quick "test prompt" --min-score 0.80
|
|
176
|
+
# Exit code 1 if any metric drops below 0.80
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
### Level 5 โ Create your own evaluation criteria
|
|
182
|
+
|
|
183
|
+
Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
aievaluator eval \
|
|
187
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
188
|
+
--dataset ./evals/smoke-test.json \
|
|
189
|
+
--metrics politeness,g_eval \
|
|
190
|
+
--custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
The custom evaluator `politeness` is defined in the request, referenced in `--metrics` by name,
|
|
194
|
+
and evaluated alongside `g_eval`. No dashboard needed.
|
|
195
|
+
|
|
196
|
+
**Custom evaluator with per-metric threshold override:**
|
|
197
|
+
|
|
198
|
+
```bash
|
|
199
|
+
aievaluator eval \
|
|
200
|
+
--agent $URL --dataset ./tests.json \
|
|
201
|
+
--metrics politeness,g_eval \
|
|
202
|
+
--custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
|
|
203
|
+
--thresholds politeness:0.90,g_eval:0.80
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
The `--thresholds` flag overrides whatever was set in `--custom`. The engine uses the
|
|
207
|
+
per-evaluation value.
|
|
208
|
+
|
|
209
|
+
---
|
|
210
|
+
|
|
211
|
+
### Level 6 โ CI/CD pipeline
|
|
212
|
+
|
|
213
|
+
Add this to your GitHub Actions, GitLab CI, or Jenkins:
|
|
214
|
+
|
|
215
|
+
```bash
|
|
216
|
+
aievaluator eval \
|
|
217
|
+
--agent $STAGING_AGENT \
|
|
218
|
+
--dataset ./evals/regression.json \
|
|
219
|
+
--thresholds faithfulness:0.90,g_eval:0.75 \
|
|
220
|
+
--min-score 0.80 \
|
|
221
|
+
--ci \
|
|
222
|
+
--format junit > report.xml
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
| Flag | What it does |
|
|
226
|
+
|---|---|
|
|
227
|
+
| `--ci` | No colors, no prompts โ clean output for logs |
|
|
228
|
+
| `--format junit` | JUnit XML that CI systems understand natively |
|
|
229
|
+
| `--min-score 0.80` | Overall score must be โฅ 80% |
|
|
230
|
+
| `--thresholds` | Per-metric quality bars |
|
|
231
|
+
|
|
232
|
+
Exit code 1 = pipeline fails = deploy blocked.
|
|
233
|
+
|
|
234
|
+
**Environment variables for CI:**
|
|
235
|
+
|
|
236
|
+
```bash
|
|
237
|
+
export AIEVALUATOR_API_KEY="sk-..." # No hardcoded keys in YAML
|
|
238
|
+
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"
|
|
239
|
+
```
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## ๐ Complete Command Reference
|
|
244
|
+
|
|
245
|
+
### `aievaluator login`
|
|
246
|
+
|
|
247
|
+
```bash
|
|
248
|
+
aievaluator login # Interactive prompt
|
|
249
|
+
aievaluator login --api-key sk-xxx # Non-interactive (CI)
|
|
250
|
+
aievaluator login --engine-url https://custom.engine.com
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
### `aievaluator whoami`
|
|
254
|
+
|
|
255
|
+
```bash
|
|
256
|
+
aievaluator whoami
|
|
257
|
+
# Tenant: acme-corp
|
|
258
|
+
# Tier: pro
|
|
259
|
+
# Evals: 42/5000 this cycle
|
|
260
|
+
# Tokens: โ124,800 ยท โ89,200 this cycle
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
### `aievaluator quick`
|
|
264
|
+
|
|
265
|
+
```bash
|
|
266
|
+
# Single query
|
|
267
|
+
aievaluator quick "What is 2+2?" --expected "4"
|
|
268
|
+
|
|
269
|
+
# Per-metric thresholds
|
|
270
|
+
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75
|
|
271
|
+
|
|
272
|
+
# General threshold
|
|
273
|
+
aievaluator quick "test" --min-score 0.80
|
|
274
|
+
|
|
275
|
+
# From dataset (JSON or JSONL)
|
|
276
|
+
aievaluator quick --dataset ./tests.json
|
|
277
|
+
aievaluator quick --dataset ./tests.jsonl
|
|
278
|
+
|
|
279
|
+
# Custom judge model
|
|
280
|
+
aievaluator quick "test" --judge deepseek
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
### `aievaluator eval`
|
|
284
|
+
|
|
285
|
+
```bash
|
|
286
|
+
# Basic
|
|
287
|
+
aievaluator eval --agent $URL --dataset ./tests.json
|
|
288
|
+
|
|
289
|
+
# With quality gates
|
|
290
|
+
aievaluator eval --agent $URL --dataset ./tests.json \
|
|
291
|
+
--thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80
|
|
292
|
+
|
|
293
|
+
# Inline rows
|
|
294
|
+
aievaluator eval --agent $URL \
|
|
295
|
+
--rows '[{"input":"Hi","expected_output":"Hello"}]'
|
|
296
|
+
|
|
297
|
+
# Custom evaluator inline
|
|
298
|
+
aievaluator eval --agent $URL --dataset ./tests.json \
|
|
299
|
+
--metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'
|
|
300
|
+
|
|
301
|
+
# CI mode
|
|
302
|
+
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit
|
|
303
|
+
|
|
304
|
+
# Different agent format
|
|
305
|
+
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
### `aievaluator config`
|
|
309
|
+
|
|
310
|
+
```bash
|
|
311
|
+
aievaluator config show
|
|
312
|
+
aievaluator config set default-metrics "faithfulness,g_eval"
|
|
313
|
+
aievaluator config set default-min-score 0.80
|
|
314
|
+
aievaluator config unset default-min-score
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
### `aievaluator init`
|
|
318
|
+
|
|
319
|
+
```bash
|
|
320
|
+
aievaluator init
|
|
321
|
+
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore
|
|
322
|
+
```
|
|
323
|
+
|
|
324
|
+
---
|
|
325
|
+
|
|
326
|
+
## ๐ Output Formats
|
|
327
|
+
|
|
328
|
+
### Table (default)
|
|
329
|
+
|
|
330
|
+
Human-readable table with scores, pass/fail icons, and token counts.
|
|
331
|
+
|
|
332
|
+
### JSON (`--format json`)
|
|
333
|
+
|
|
334
|
+
```bash
|
|
335
|
+
aievaluator eval ... --format json | jq '.overall_score'
|
|
336
|
+
```
|
|
337
|
+
|
|
338
|
+
Clean JSON on stdout. All logs/warnings go to stderr.
|
|
339
|
+
|
|
340
|
+
### JUnit XML (`--format junit`)
|
|
341
|
+
|
|
342
|
+
```bash
|
|
343
|
+
aievaluator eval ... --format junit > report.xml
|
|
344
|
+
```
|
|
345
|
+
|
|
346
|
+
Native CI integration. `<testcase>` per query, `<failure>` for queries below threshold.
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
## ๐ค VS Code Extension
|
|
351
|
+
|
|
352
|
+
Prefer staying in your editor? Install the [VS Code extension](https://marketplace.visualstudio.com/items?itemName=aievaluator.aievaluator).
|
|
353
|
+
|
|
354
|
+
- Select text โ right-click โ Evaluate
|
|
355
|
+
- Per-metric threshold editor with preset buttons
|
|
356
|
+
- Custom evaluator support via Command Palette
|
|
357
|
+
- Sidebar with evaluation history
|
|
358
|
+
- Dataset file evaluation (JSON + JSONL)
|
|
359
|
+
|
|
360
|
+
[Full VS Code tutorial โ](../vscode/README.md)
|
|
361
|
+
|
|
362
|
+
---
|
|
363
|
+
|
|
364
|
+
## Requirements
|
|
365
|
+
|
|
366
|
+
- Python 3.10+
|
|
@@ -0,0 +1,336 @@
|
|
|
1
|
+
# AI Evaluator CLI โ Python
|
|
2
|
+
|
|
3
|
+
[](https://pypi.org/project/aievaluator/)
|
|
4
|
+
[](https://pypi.org/project/aievaluator/)
|
|
5
|
+
|
|
6
|
+
Evaluate your LLM agents from the terminal. No browser. No dashboard.
|
|
7
|
+
|
|
8
|
+
```bash
|
|
9
|
+
pip install aievaluator
|
|
10
|
+
```
|
|
11
|
+
|
|
12
|
+
---
|
|
13
|
+
|
|
14
|
+
## ๐งญ Tutorial โ From Zero to CI/CD
|
|
15
|
+
|
|
16
|
+
Every step builds on the previous one. Start wherever makes sense for you.
|
|
17
|
+
|
|
18
|
+
---
|
|
19
|
+
|
|
20
|
+
### Level 0 โ Try it without installing anything
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
|
|
24
|
+
-H "Content-Type: application/json" \
|
|
25
|
+
-d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
5 free per day. No key. No install. Good enough to decide if it's useful.
|
|
29
|
+
|
|
30
|
+
---
|
|
31
|
+
|
|
32
|
+
### Level 1 โ Install and evaluate a single prompt
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
pip install aievaluator
|
|
36
|
+
|
|
37
|
+
# Ask a question, tell it what you expect
|
|
38
|
+
aievaluator quick "What is the capital of France?" --expected "Paris"
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
You'll see a table with the score. The `--expected` is optional โ without it, the judge evaluates
|
|
42
|
+
the response on its own merits.
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
โ ๏ธ Playground mode โ 4/5 remaining
|
|
46
|
+
|
|
47
|
+
AI Evaluator โ Results
|
|
48
|
+
Overall Score: 95.0% โ
above threshold (0%)
|
|
49
|
+
Total rows: 1
|
|
50
|
+
Failed: 0
|
|
51
|
+
|
|
52
|
+
โโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโ
|
|
53
|
+
โ # โ Query โ Score โ Pass โ
|
|
54
|
+
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโค
|
|
55
|
+
โ 1 โ What is the capital of France? โ 95% โ โ
โ
|
|
56
|
+
โโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโ
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
### Level 2 โ Sign up and scaffold a project
|
|
62
|
+
|
|
63
|
+
Playground is great for trying, but you'll want more than 5 evals/day.
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
# Get your API key at https://aievaluator.dev/settings
|
|
67
|
+
aievaluator login
|
|
68
|
+
|
|
69
|
+
# Check your account
|
|
70
|
+
aievaluator whoami
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
Now scaffold your project:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
aievaluator init
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
This creates:
|
|
80
|
+
- `aievaluator.config.json` โ project-local config
|
|
81
|
+
- `evals/smoke-test.json` โ sample dataset with 3 queries
|
|
82
|
+
- Updates `.gitignore`
|
|
83
|
+
|
|
84
|
+
Open `evals/smoke-test.json` and replace the sample queries with your own:
|
|
85
|
+
|
|
86
|
+
```json
|
|
87
|
+
[
|
|
88
|
+
{"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
|
|
89
|
+
{"input": "How do I cancel my order?", "expected_output": "Go to My Orders โ Cancel"},
|
|
90
|
+
{"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
|
|
91
|
+
]
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
Test it against the built-in agent:
|
|
95
|
+
|
|
96
|
+
```bash
|
|
97
|
+
aievaluator quick --dataset ./evals/smoke-test.json
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
### Level 3 โ Evaluate your own agent
|
|
103
|
+
|
|
104
|
+
Point the CLI at your agent's endpoint:
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
aievaluator eval \
|
|
108
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
109
|
+
--dataset ./evals/smoke-test.json \
|
|
110
|
+
--metrics faithfulness,g_eval
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
The CLI calls your agent with each query, then an LLM judge scores the responses.
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
### Level 4 โ Add quality gates
|
|
118
|
+
|
|
119
|
+
Not all metrics are equally important. Set different thresholds per metric:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
aievaluator eval \
|
|
123
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
124
|
+
--dataset ./evals/smoke-test.json \
|
|
125
|
+
--thresholds faithfulness:0.90,g_eval:0.75
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
- `faithfulness` must be โฅ 90% (hallucination = instant fail)
|
|
129
|
+
- `g_eval` must be โฅ 75% (general quality)
|
|
130
|
+
|
|
131
|
+
If any metric fails to meet its threshold, that row is marked โ.
|
|
132
|
+
|
|
133
|
+
**Or set one bar for everything:**
|
|
134
|
+
|
|
135
|
+
```bash
|
|
136
|
+
aievaluator eval \
|
|
137
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
138
|
+
--dataset ./evals/smoke-test.json \
|
|
139
|
+
--min-score 0.80
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
This works on `quick` too:
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
aievaluator quick "test prompt" --min-score 0.80
|
|
146
|
+
# Exit code 1 if any metric drops below 0.80
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
---
|
|
150
|
+
|
|
151
|
+
### Level 5 โ Create your own evaluation criteria
|
|
152
|
+
|
|
153
|
+
Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
aievaluator eval \
|
|
157
|
+
--agent https://chatbot-staging.acme.com/api/chat \
|
|
158
|
+
--dataset ./evals/smoke-test.json \
|
|
159
|
+
--metrics politeness,g_eval \
|
|
160
|
+
--custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
The custom evaluator `politeness` is defined in the request, referenced in `--metrics` by name,
|
|
164
|
+
and evaluated alongside `g_eval`. No dashboard needed.
|
|
165
|
+
|
|
166
|
+
**Custom evaluator with per-metric threshold override:**
|
|
167
|
+
|
|
168
|
+
```bash
|
|
169
|
+
aievaluator eval \
|
|
170
|
+
--agent $URL --dataset ./tests.json \
|
|
171
|
+
--metrics politeness,g_eval \
|
|
172
|
+
--custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
|
|
173
|
+
--thresholds politeness:0.90,g_eval:0.80
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
The `--thresholds` flag overrides whatever was set in `--custom`. The engine uses the
|
|
177
|
+
per-evaluation value.
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
### Level 6 โ CI/CD pipeline
|
|
182
|
+
|
|
183
|
+
Add this to your GitHub Actions, GitLab CI, or Jenkins:
|
|
184
|
+
|
|
185
|
+
```bash
|
|
186
|
+
aievaluator eval \
|
|
187
|
+
--agent $STAGING_AGENT \
|
|
188
|
+
--dataset ./evals/regression.json \
|
|
189
|
+
--thresholds faithfulness:0.90,g_eval:0.75 \
|
|
190
|
+
--min-score 0.80 \
|
|
191
|
+
--ci \
|
|
192
|
+
--format junit > report.xml
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
| Flag | What it does |
|
|
196
|
+
|---|---|
|
|
197
|
+
| `--ci` | No colors, no prompts โ clean output for logs |
|
|
198
|
+
| `--format junit` | JUnit XML that CI systems understand natively |
|
|
199
|
+
| `--min-score 0.80` | Overall score must be โฅ 80% |
|
|
200
|
+
| `--thresholds` | Per-metric quality bars |
|
|
201
|
+
|
|
202
|
+
Exit code 1 = pipeline fails = deploy blocked.
|
|
203
|
+
|
|
204
|
+
**Environment variables for CI:**
|
|
205
|
+
|
|
206
|
+
```bash
|
|
207
|
+
export AIEVALUATOR_API_KEY="sk-..." # No hardcoded keys in YAML
|
|
208
|
+
export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## ๐ Complete Command Reference
|
|
214
|
+
|
|
215
|
+
### `aievaluator login`
|
|
216
|
+
|
|
217
|
+
```bash
|
|
218
|
+
aievaluator login # Interactive prompt
|
|
219
|
+
aievaluator login --api-key sk-xxx # Non-interactive (CI)
|
|
220
|
+
aievaluator login --engine-url https://custom.engine.com
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
### `aievaluator whoami`
|
|
224
|
+
|
|
225
|
+
```bash
|
|
226
|
+
aievaluator whoami
|
|
227
|
+
# Tenant: acme-corp
|
|
228
|
+
# Tier: pro
|
|
229
|
+
# Evals: 42/5000 this cycle
|
|
230
|
+
# Tokens: โ124,800 ยท โ89,200 this cycle
|
|
231
|
+
```
|
|
232
|
+
|
|
233
|
+
### `aievaluator quick`
|
|
234
|
+
|
|
235
|
+
```bash
|
|
236
|
+
# Single query
|
|
237
|
+
aievaluator quick "What is 2+2?" --expected "4"
|
|
238
|
+
|
|
239
|
+
# Per-metric thresholds
|
|
240
|
+
aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75
|
|
241
|
+
|
|
242
|
+
# General threshold
|
|
243
|
+
aievaluator quick "test" --min-score 0.80
|
|
244
|
+
|
|
245
|
+
# From dataset (JSON or JSONL)
|
|
246
|
+
aievaluator quick --dataset ./tests.json
|
|
247
|
+
aievaluator quick --dataset ./tests.jsonl
|
|
248
|
+
|
|
249
|
+
# Custom judge model
|
|
250
|
+
aievaluator quick "test" --judge deepseek
|
|
251
|
+
```
|
|
252
|
+
|
|
253
|
+
### `aievaluator eval`
|
|
254
|
+
|
|
255
|
+
```bash
|
|
256
|
+
# Basic
|
|
257
|
+
aievaluator eval --agent $URL --dataset ./tests.json
|
|
258
|
+
|
|
259
|
+
# With quality gates
|
|
260
|
+
aievaluator eval --agent $URL --dataset ./tests.json \
|
|
261
|
+
--thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80
|
|
262
|
+
|
|
263
|
+
# Inline rows
|
|
264
|
+
aievaluator eval --agent $URL \
|
|
265
|
+
--rows '[{"input":"Hi","expected_output":"Hello"}]'
|
|
266
|
+
|
|
267
|
+
# Custom evaluator inline
|
|
268
|
+
aievaluator eval --agent $URL --dataset ./tests.json \
|
|
269
|
+
--metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'
|
|
270
|
+
|
|
271
|
+
# CI mode
|
|
272
|
+
aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit
|
|
273
|
+
|
|
274
|
+
# Different agent format
|
|
275
|
+
aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
### `aievaluator config`
|
|
279
|
+
|
|
280
|
+
```bash
|
|
281
|
+
aievaluator config show
|
|
282
|
+
aievaluator config set default-metrics "faithfulness,g_eval"
|
|
283
|
+
aievaluator config set default-min-score 0.80
|
|
284
|
+
aievaluator config unset default-min-score
|
|
285
|
+
```
|
|
286
|
+
|
|
287
|
+
### `aievaluator init`
|
|
288
|
+
|
|
289
|
+
```bash
|
|
290
|
+
aievaluator init
|
|
291
|
+
# Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
---
|
|
295
|
+
|
|
296
|
+
## ๐ Output Formats
|
|
297
|
+
|
|
298
|
+
### Table (default)
|
|
299
|
+
|
|
300
|
+
Human-readable table with scores, pass/fail icons, and token counts.
|
|
301
|
+
|
|
302
|
+
### JSON (`--format json`)
|
|
303
|
+
|
|
304
|
+
```bash
|
|
305
|
+
aievaluator eval ... --format json | jq '.overall_score'
|
|
306
|
+
```
|
|
307
|
+
|
|
308
|
+
Clean JSON on stdout. All logs/warnings go to stderr.
|
|
309
|
+
|
|
310
|
+
### JUnit XML (`--format junit`)
|
|
311
|
+
|
|
312
|
+
```bash
|
|
313
|
+
aievaluator eval ... --format junit > report.xml
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
Native CI integration. `<testcase>` per query, `<failure>` for queries below threshold.
|
|
317
|
+
|
|
318
|
+
---
|
|
319
|
+
|
|
320
|
+
## ๐ค VS Code Extension
|
|
321
|
+
|
|
322
|
+
Prefer staying in your editor? Install the [VS Code extension](https://marketplace.visualstudio.com/items?itemName=aievaluator.aievaluator).
|
|
323
|
+
|
|
324
|
+
- Select text โ right-click โ Evaluate
|
|
325
|
+
- Per-metric threshold editor with preset buttons
|
|
326
|
+
- Custom evaluator support via Command Palette
|
|
327
|
+
- Sidebar with evaluation history
|
|
328
|
+
- Dataset file evaluation (JSON + JSONL)
|
|
329
|
+
|
|
330
|
+
[Full VS Code tutorial โ](../vscode/README.md)
|
|
331
|
+
|
|
332
|
+
---
|
|
333
|
+
|
|
334
|
+
## Requirements
|
|
335
|
+
|
|
336
|
+
- Python 3.10+
|