aievaluator 1.0.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. aievaluator-1.0.1/PKG-INFO +366 -0
  2. aievaluator-1.0.1/README.md +336 -0
  3. aievaluator-1.0.1/pyproject.toml +50 -0
  4. aievaluator-1.0.1/setup.cfg +4 -0
  5. aievaluator-1.0.1/src/aievaluator/__init__.py +3 -0
  6. aievaluator-1.0.1/src/aievaluator/api/__init__.py +0 -0
  7. aievaluator-1.0.1/src/aievaluator/api/client.py +178 -0
  8. aievaluator-1.0.1/src/aievaluator/cli.py +532 -0
  9. aievaluator-1.0.1/src/aievaluator/config.py +115 -0
  10. aievaluator-1.0.1/src/aievaluator/formatters/__init__.py +7 -0
  11. aievaluator-1.0.1/src/aievaluator/formatters/json.py +28 -0
  12. aievaluator-1.0.1/src/aievaluator/formatters/junit.py +46 -0
  13. aievaluator-1.0.1/src/aievaluator/formatters/table.py +53 -0
  14. aievaluator-1.0.1/src/aievaluator.egg-info/PKG-INFO +366 -0
  15. aievaluator-1.0.1/src/aievaluator.egg-info/SOURCES.txt +30 -0
  16. aievaluator-1.0.1/src/aievaluator.egg-info/dependency_links.txt +1 -0
  17. aievaluator-1.0.1/src/aievaluator.egg-info/entry_points.txt +2 -0
  18. aievaluator-1.0.1/src/aievaluator.egg-info/requires.txt +8 -0
  19. aievaluator-1.0.1/src/aievaluator.egg-info/top_level.txt +1 -0
  20. aievaluator-1.0.1/tests/test_api.py +285 -0
  21. aievaluator-1.0.1/tests/test_cli_config.py +98 -0
  22. aievaluator-1.0.1/tests/test_cli_eval.py +495 -0
  23. aievaluator-1.0.1/tests/test_cli_init.py +115 -0
  24. aievaluator-1.0.1/tests/test_cli_login.py +74 -0
  25. aievaluator-1.0.1/tests/test_cli_quick.py +184 -0
  26. aievaluator-1.0.1/tests/test_cli_whoami.py +61 -0
  27. aievaluator-1.0.1/tests/test_config.py +218 -0
  28. aievaluator-1.0.1/tests/test_dataset.py +111 -0
  29. aievaluator-1.0.1/tests/test_exit_codes.py +122 -0
  30. aievaluator-1.0.1/tests/test_formatters.py +190 -0
  31. aievaluator-1.0.1/tests/test_metrics.py +84 -0
  32. aievaluator-1.0.1/tests/test_thresholds.py +76 -0
@@ -0,0 +1,366 @@
1
+ Metadata-Version: 2.4
2
+ Name: aievaluator
3
+ Version: 1.0.1
4
+ Summary: AI Evaluator CLI โ€” evaluate your LLM agents from the command line
5
+ Author-email: AI Evaluator <support@aievaluator.dev>
6
+ License: MIT
7
+ Project-URL: Homepage, https://aievaluator.dev
8
+ Project-URL: Repository, https://github.com/aievaluator-dev/aievaluator-cli
9
+ Project-URL: Issues, https://github.com/aievaluator-dev/aievaluator-cli/issues
10
+ Keywords: ai,evaluation,llm,agent,testing,ci-cd
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: License :: OSI Approved :: MIT License
14
+ Classifier: Operating System :: OS Independent
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.10
17
+ Classifier: Programming Language :: Python :: 3.11
18
+ Classifier: Programming Language :: Python :: 3.12
19
+ Classifier: Topic :: Software Development :: Testing
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.10
22
+ Description-Content-Type: text/markdown
23
+ Requires-Dist: click>=8.1
24
+ Requires-Dist: httpx>=0.27
25
+ Requires-Dist: rich>=13.0
26
+ Provides-Extra: dev
27
+ Requires-Dist: pytest>=8.0; extra == "dev"
28
+ Requires-Dist: pytest-asyncio>=0.24; extra == "dev"
29
+ Requires-Dist: pytest-httpx>=0.30; extra == "dev"
30
+
31
+ # AI Evaluator CLI โ€” Python
32
+
33
+ [![PyPI](https://img.shields.io/pypi/v/aievaluator)](https://pypi.org/project/aievaluator/)
34
+ [![Python](https://img.shields.io/pypi/pyversions/aievaluator)](https://pypi.org/project/aievaluator/)
35
+
36
+ Evaluate your LLM agents from the terminal. No browser. No dashboard.
37
+
38
+ ```bash
39
+ pip install aievaluator
40
+ ```
41
+
42
+ ---
43
+
44
+ ## ๐Ÿงญ Tutorial โ€” From Zero to CI/CD
45
+
46
+ Every step builds on the previous one. Start wherever makes sense for you.
47
+
48
+ ---
49
+
50
+ ### Level 0 โ€” Try it without installing anything
51
+
52
+ ```bash
53
+ curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
54
+ -H "Content-Type: application/json" \
55
+ -d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .
56
+ ```
57
+
58
+ 5 free per day. No key. No install. Good enough to decide if it's useful.
59
+
60
+ ---
61
+
62
+ ### Level 1 โ€” Install and evaluate a single prompt
63
+
64
+ ```bash
65
+ pip install aievaluator
66
+
67
+ # Ask a question, tell it what you expect
68
+ aievaluator quick "What is the capital of France?" --expected "Paris"
69
+ ```
70
+
71
+ You'll see a table with the score. The `--expected` is optional โ€” without it, the judge evaluates
72
+ the response on its own merits.
73
+
74
+ ```
75
+ โš ๏ธ Playground mode โ€” 4/5 remaining
76
+
77
+ AI Evaluator โ€” Results
78
+ Overall Score: 95.0% โœ… above threshold (0%)
79
+ Total rows: 1
80
+ Failed: 0
81
+
82
+ โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
83
+ โ”‚ # โ”‚ Query โ”‚ Score โ”‚ Pass โ”‚
84
+ โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค
85
+ โ”‚ 1 โ”‚ What is the capital of France? โ”‚ 95% โ”‚ โœ… โ”‚
86
+ โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
87
+ ```
88
+
89
+ ---
90
+
91
+ ### Level 2 โ€” Sign up and scaffold a project
92
+
93
+ Playground is great for trying, but you'll want more than 5 evals/day.
94
+
95
+ ```bash
96
+ # Get your API key at https://aievaluator.dev/settings
97
+ aievaluator login
98
+
99
+ # Check your account
100
+ aievaluator whoami
101
+ ```
102
+
103
+ Now scaffold your project:
104
+
105
+ ```bash
106
+ aievaluator init
107
+ ```
108
+
109
+ This creates:
110
+ - `aievaluator.config.json` โ€” project-local config
111
+ - `evals/smoke-test.json` โ€” sample dataset with 3 queries
112
+ - Updates `.gitignore`
113
+
114
+ Open `evals/smoke-test.json` and replace the sample queries with your own:
115
+
116
+ ```json
117
+ [
118
+ {"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
119
+ {"input": "How do I cancel my order?", "expected_output": "Go to My Orders โ†’ Cancel"},
120
+ {"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
121
+ ]
122
+ ```
123
+
124
+ Test it against the built-in agent:
125
+
126
+ ```bash
127
+ aievaluator quick --dataset ./evals/smoke-test.json
128
+ ```
129
+
130
+ ---
131
+
132
+ ### Level 3 โ€” Evaluate your own agent
133
+
134
+ Point the CLI at your agent's endpoint:
135
+
136
+ ```bash
137
+ aievaluator eval \
138
+ --agent https://chatbot-staging.acme.com/api/chat \
139
+ --dataset ./evals/smoke-test.json \
140
+ --metrics faithfulness,g_eval
141
+ ```
142
+
143
+ The CLI calls your agent with each query, then an LLM judge scores the responses.
144
+
145
+ ---
146
+
147
+ ### Level 4 โ€” Add quality gates
148
+
149
+ Not all metrics are equally important. Set different thresholds per metric:
150
+
151
+ ```bash
152
+ aievaluator eval \
153
+ --agent https://chatbot-staging.acme.com/api/chat \
154
+ --dataset ./evals/smoke-test.json \
155
+ --thresholds faithfulness:0.90,g_eval:0.75
156
+ ```
157
+
158
+ - `faithfulness` must be โ‰ฅ 90% (hallucination = instant fail)
159
+ - `g_eval` must be โ‰ฅ 75% (general quality)
160
+
161
+ If any metric fails to meet its threshold, that row is marked โŒ.
162
+
163
+ **Or set one bar for everything:**
164
+
165
+ ```bash
166
+ aievaluator eval \
167
+ --agent https://chatbot-staging.acme.com/api/chat \
168
+ --dataset ./evals/smoke-test.json \
169
+ --min-score 0.80
170
+ ```
171
+
172
+ This works on `quick` too:
173
+
174
+ ```bash
175
+ aievaluator quick "test prompt" --min-score 0.80
176
+ # Exit code 1 if any metric drops below 0.80
177
+ ```
178
+
179
+ ---
180
+
181
+ ### Level 5 โ€” Create your own evaluation criteria
182
+
183
+ Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:
184
+
185
+ ```bash
186
+ aievaluator eval \
187
+ --agent https://chatbot-staging.acme.com/api/chat \
188
+ --dataset ./evals/smoke-test.json \
189
+ --metrics politeness,g_eval \
190
+ --custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'
191
+ ```
192
+
193
+ The custom evaluator `politeness` is defined in the request, referenced in `--metrics` by name,
194
+ and evaluated alongside `g_eval`. No dashboard needed.
195
+
196
+ **Custom evaluator with per-metric threshold override:**
197
+
198
+ ```bash
199
+ aievaluator eval \
200
+ --agent $URL --dataset ./tests.json \
201
+ --metrics politeness,g_eval \
202
+ --custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
203
+ --thresholds politeness:0.90,g_eval:0.80
204
+ ```
205
+
206
+ The `--thresholds` flag overrides whatever was set in `--custom`. The engine uses the
207
+ per-evaluation value.
208
+
209
+ ---
210
+
211
+ ### Level 6 โ€” CI/CD pipeline
212
+
213
+ Add this to your GitHub Actions, GitLab CI, or Jenkins:
214
+
215
+ ```bash
216
+ aievaluator eval \
217
+ --agent $STAGING_AGENT \
218
+ --dataset ./evals/regression.json \
219
+ --thresholds faithfulness:0.90,g_eval:0.75 \
220
+ --min-score 0.80 \
221
+ --ci \
222
+ --format junit > report.xml
223
+ ```
224
+
225
+ | Flag | What it does |
226
+ |---|---|
227
+ | `--ci` | No colors, no prompts โ€” clean output for logs |
228
+ | `--format junit` | JUnit XML that CI systems understand natively |
229
+ | `--min-score 0.80` | Overall score must be โ‰ฅ 80% |
230
+ | `--thresholds` | Per-metric quality bars |
231
+
232
+ Exit code 1 = pipeline fails = deploy blocked.
233
+
234
+ **Environment variables for CI:**
235
+
236
+ ```bash
237
+ export AIEVALUATOR_API_KEY="sk-..." # No hardcoded keys in YAML
238
+ export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"
239
+ ```
240
+
241
+ ---
242
+
243
+ ## ๐Ÿ“‹ Complete Command Reference
244
+
245
+ ### `aievaluator login`
246
+
247
+ ```bash
248
+ aievaluator login # Interactive prompt
249
+ aievaluator login --api-key sk-xxx # Non-interactive (CI)
250
+ aievaluator login --engine-url https://custom.engine.com
251
+ ```
252
+
253
+ ### `aievaluator whoami`
254
+
255
+ ```bash
256
+ aievaluator whoami
257
+ # Tenant: acme-corp
258
+ # Tier: pro
259
+ # Evals: 42/5000 this cycle
260
+ # Tokens: โ†“124,800 ยท โ†‘89,200 this cycle
261
+ ```
262
+
263
+ ### `aievaluator quick`
264
+
265
+ ```bash
266
+ # Single query
267
+ aievaluator quick "What is 2+2?" --expected "4"
268
+
269
+ # Per-metric thresholds
270
+ aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75
271
+
272
+ # General threshold
273
+ aievaluator quick "test" --min-score 0.80
274
+
275
+ # From dataset (JSON or JSONL)
276
+ aievaluator quick --dataset ./tests.json
277
+ aievaluator quick --dataset ./tests.jsonl
278
+
279
+ # Custom judge model
280
+ aievaluator quick "test" --judge deepseek
281
+ ```
282
+
283
+ ### `aievaluator eval`
284
+
285
+ ```bash
286
+ # Basic
287
+ aievaluator eval --agent $URL --dataset ./tests.json
288
+
289
+ # With quality gates
290
+ aievaluator eval --agent $URL --dataset ./tests.json \
291
+ --thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80
292
+
293
+ # Inline rows
294
+ aievaluator eval --agent $URL \
295
+ --rows '[{"input":"Hi","expected_output":"Hello"}]'
296
+
297
+ # Custom evaluator inline
298
+ aievaluator eval --agent $URL --dataset ./tests.json \
299
+ --metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'
300
+
301
+ # CI mode
302
+ aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit
303
+
304
+ # Different agent format
305
+ aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude
306
+ ```
307
+
308
+ ### `aievaluator config`
309
+
310
+ ```bash
311
+ aievaluator config show
312
+ aievaluator config set default-metrics "faithfulness,g_eval"
313
+ aievaluator config set default-min-score 0.80
314
+ aievaluator config unset default-min-score
315
+ ```
316
+
317
+ ### `aievaluator init`
318
+
319
+ ```bash
320
+ aievaluator init
321
+ # Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore
322
+ ```
323
+
324
+ ---
325
+
326
+ ## ๐Ÿ“Š Output Formats
327
+
328
+ ### Table (default)
329
+
330
+ Human-readable table with scores, pass/fail icons, and token counts.
331
+
332
+ ### JSON (`--format json`)
333
+
334
+ ```bash
335
+ aievaluator eval ... --format json | jq '.overall_score'
336
+ ```
337
+
338
+ Clean JSON on stdout. All logs/warnings go to stderr.
339
+
340
+ ### JUnit XML (`--format junit`)
341
+
342
+ ```bash
343
+ aievaluator eval ... --format junit > report.xml
344
+ ```
345
+
346
+ Native CI integration. `<testcase>` per query, `<failure>` for queries below threshold.
347
+
348
+ ---
349
+
350
+ ## ๐Ÿค– VS Code Extension
351
+
352
+ Prefer staying in your editor? Install the [VS Code extension](https://marketplace.visualstudio.com/items?itemName=aievaluator.aievaluator).
353
+
354
+ - Select text โ†’ right-click โ†’ Evaluate
355
+ - Per-metric threshold editor with preset buttons
356
+ - Custom evaluator support via Command Palette
357
+ - Sidebar with evaluation history
358
+ - Dataset file evaluation (JSON + JSONL)
359
+
360
+ [Full VS Code tutorial โ†’](../vscode/README.md)
361
+
362
+ ---
363
+
364
+ ## Requirements
365
+
366
+ - Python 3.10+
@@ -0,0 +1,336 @@
1
+ # AI Evaluator CLI โ€” Python
2
+
3
+ [![PyPI](https://img.shields.io/pypi/v/aievaluator)](https://pypi.org/project/aievaluator/)
4
+ [![Python](https://img.shields.io/pypi/pyversions/aievaluator)](https://pypi.org/project/aievaluator/)
5
+
6
+ Evaluate your LLM agents from the terminal. No browser. No dashboard.
7
+
8
+ ```bash
9
+ pip install aievaluator
10
+ ```
11
+
12
+ ---
13
+
14
+ ## ๐Ÿงญ Tutorial โ€” From Zero to CI/CD
15
+
16
+ Every step builds on the previous one. Start wherever makes sense for you.
17
+
18
+ ---
19
+
20
+ ### Level 0 โ€” Try it without installing anything
21
+
22
+ ```bash
23
+ curl -s -X POST https://api.aievaluator.dev/api/v1/playground/evaluate \
24
+ -H "Content-Type: application/json" \
25
+ -d '{"queries":["What is 2+2?"],"metrics":["faithfulness"]}' | jq .
26
+ ```
27
+
28
+ 5 free per day. No key. No install. Good enough to decide if it's useful.
29
+
30
+ ---
31
+
32
+ ### Level 1 โ€” Install and evaluate a single prompt
33
+
34
+ ```bash
35
+ pip install aievaluator
36
+
37
+ # Ask a question, tell it what you expect
38
+ aievaluator quick "What is the capital of France?" --expected "Paris"
39
+ ```
40
+
41
+ You'll see a table with the score. The `--expected` is optional โ€” without it, the judge evaluates
42
+ the response on its own merits.
43
+
44
+ ```
45
+ โš ๏ธ Playground mode โ€” 4/5 remaining
46
+
47
+ AI Evaluator โ€” Results
48
+ Overall Score: 95.0% โœ… above threshold (0%)
49
+ Total rows: 1
50
+ Failed: 0
51
+
52
+ โ”Œโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”
53
+ โ”‚ # โ”‚ Query โ”‚ Score โ”‚ Pass โ”‚
54
+ โ”œโ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค
55
+ โ”‚ 1 โ”‚ What is the capital of France? โ”‚ 95% โ”‚ โœ… โ”‚
56
+ โ””โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
57
+ ```
58
+
59
+ ---
60
+
61
+ ### Level 2 โ€” Sign up and scaffold a project
62
+
63
+ Playground is great for trying, but you'll want more than 5 evals/day.
64
+
65
+ ```bash
66
+ # Get your API key at https://aievaluator.dev/settings
67
+ aievaluator login
68
+
69
+ # Check your account
70
+ aievaluator whoami
71
+ ```
72
+
73
+ Now scaffold your project:
74
+
75
+ ```bash
76
+ aievaluator init
77
+ ```
78
+
79
+ This creates:
80
+ - `aievaluator.config.json` โ€” project-local config
81
+ - `evals/smoke-test.json` โ€” sample dataset with 3 queries
82
+ - Updates `.gitignore`
83
+
84
+ Open `evals/smoke-test.json` and replace the sample queries with your own:
85
+
86
+ ```json
87
+ [
88
+ {"input": "What are your business hours?", "expected_output": "Mon-Fri 9am-6pm"},
89
+ {"input": "How do I cancel my order?", "expected_output": "Go to My Orders โ†’ Cancel"},
90
+ {"input": "Do you ship internationally?", "expected_output": "Yes, via DHL Express"}
91
+ ]
92
+ ```
93
+
94
+ Test it against the built-in agent:
95
+
96
+ ```bash
97
+ aievaluator quick --dataset ./evals/smoke-test.json
98
+ ```
99
+
100
+ ---
101
+
102
+ ### Level 3 โ€” Evaluate your own agent
103
+
104
+ Point the CLI at your agent's endpoint:
105
+
106
+ ```bash
107
+ aievaluator eval \
108
+ --agent https://chatbot-staging.acme.com/api/chat \
109
+ --dataset ./evals/smoke-test.json \
110
+ --metrics faithfulness,g_eval
111
+ ```
112
+
113
+ The CLI calls your agent with each query, then an LLM judge scores the responses.
114
+
115
+ ---
116
+
117
+ ### Level 4 โ€” Add quality gates
118
+
119
+ Not all metrics are equally important. Set different thresholds per metric:
120
+
121
+ ```bash
122
+ aievaluator eval \
123
+ --agent https://chatbot-staging.acme.com/api/chat \
124
+ --dataset ./evals/smoke-test.json \
125
+ --thresholds faithfulness:0.90,g_eval:0.75
126
+ ```
127
+
128
+ - `faithfulness` must be โ‰ฅ 90% (hallucination = instant fail)
129
+ - `g_eval` must be โ‰ฅ 75% (general quality)
130
+
131
+ If any metric fails to meet its threshold, that row is marked โŒ.
132
+
133
+ **Or set one bar for everything:**
134
+
135
+ ```bash
136
+ aievaluator eval \
137
+ --agent https://chatbot-staging.acme.com/api/chat \
138
+ --dataset ./evals/smoke-test.json \
139
+ --min-score 0.80
140
+ ```
141
+
142
+ This works on `quick` too:
143
+
144
+ ```bash
145
+ aievaluator quick "test prompt" --min-score 0.80
146
+ # Exit code 1 if any metric drops below 0.80
147
+ ```
148
+
149
+ ---
150
+
151
+ ### Level 5 โ€” Create your own evaluation criteria
152
+
153
+ Sometimes the built-in metrics aren't enough. Define a custom evaluator inline:
154
+
155
+ ```bash
156
+ aievaluator eval \
157
+ --agent https://chatbot-staging.acme.com/api/chat \
158
+ --dataset ./evals/smoke-test.json \
159
+ --metrics politeness,g_eval \
160
+ --custom '{"name":"politeness","prompt":"Is the response polite and professional? Answer YES or NO and explain.","threshold":0.85}'
161
+ ```
162
+
163
+ The custom evaluator `politeness` is defined in the request, referenced in `--metrics` by name,
164
+ and evaluated alongside `g_eval`. No dashboard needed.
165
+
166
+ **Custom evaluator with per-metric threshold override:**
167
+
168
+ ```bash
169
+ aievaluator eval \
170
+ --agent $URL --dataset ./tests.json \
171
+ --metrics politeness,g_eval \
172
+ --custom '{"name":"politeness","prompt":"Is the tone friendly?","threshold":0.7}' \
173
+ --thresholds politeness:0.90,g_eval:0.80
174
+ ```
175
+
176
+ The `--thresholds` flag overrides whatever was set in `--custom`. The engine uses the
177
+ per-evaluation value.
178
+
179
+ ---
180
+
181
+ ### Level 6 โ€” CI/CD pipeline
182
+
183
+ Add this to your GitHub Actions, GitLab CI, or Jenkins:
184
+
185
+ ```bash
186
+ aievaluator eval \
187
+ --agent $STAGING_AGENT \
188
+ --dataset ./evals/regression.json \
189
+ --thresholds faithfulness:0.90,g_eval:0.75 \
190
+ --min-score 0.80 \
191
+ --ci \
192
+ --format junit > report.xml
193
+ ```
194
+
195
+ | Flag | What it does |
196
+ |---|---|
197
+ | `--ci` | No colors, no prompts โ€” clean output for logs |
198
+ | `--format junit` | JUnit XML that CI systems understand natively |
199
+ | `--min-score 0.80` | Overall score must be โ‰ฅ 80% |
200
+ | `--thresholds` | Per-metric quality bars |
201
+
202
+ Exit code 1 = pipeline fails = deploy blocked.
203
+
204
+ **Environment variables for CI:**
205
+
206
+ ```bash
207
+ export AIEVALUATOR_API_KEY="sk-..." # No hardcoded keys in YAML
208
+ export AIEVALUATOR_ENGINE_URL="https://api.aievaluator.dev"
209
+ ```
210
+
211
+ ---
212
+
213
+ ## ๐Ÿ“‹ Complete Command Reference
214
+
215
+ ### `aievaluator login`
216
+
217
+ ```bash
218
+ aievaluator login # Interactive prompt
219
+ aievaluator login --api-key sk-xxx # Non-interactive (CI)
220
+ aievaluator login --engine-url https://custom.engine.com
221
+ ```
222
+
223
+ ### `aievaluator whoami`
224
+
225
+ ```bash
226
+ aievaluator whoami
227
+ # Tenant: acme-corp
228
+ # Tier: pro
229
+ # Evals: 42/5000 this cycle
230
+ # Tokens: โ†“124,800 ยท โ†‘89,200 this cycle
231
+ ```
232
+
233
+ ### `aievaluator quick`
234
+
235
+ ```bash
236
+ # Single query
237
+ aievaluator quick "What is 2+2?" --expected "4"
238
+
239
+ # Per-metric thresholds
240
+ aievaluator quick "test" --metrics faithfulness:0.90,g_eval:0.75
241
+
242
+ # General threshold
243
+ aievaluator quick "test" --min-score 0.80
244
+
245
+ # From dataset (JSON or JSONL)
246
+ aievaluator quick --dataset ./tests.json
247
+ aievaluator quick --dataset ./tests.jsonl
248
+
249
+ # Custom judge model
250
+ aievaluator quick "test" --judge deepseek
251
+ ```
252
+
253
+ ### `aievaluator eval`
254
+
255
+ ```bash
256
+ # Basic
257
+ aievaluator eval --agent $URL --dataset ./tests.json
258
+
259
+ # With quality gates
260
+ aievaluator eval --agent $URL --dataset ./tests.json \
261
+ --thresholds faithfulness:0.90,g_eval:0.75 --min-score 0.80
262
+
263
+ # Inline rows
264
+ aievaluator eval --agent $URL \
265
+ --rows '[{"input":"Hi","expected_output":"Hello"}]'
266
+
267
+ # Custom evaluator inline
268
+ aievaluator eval --agent $URL --dataset ./tests.json \
269
+ --metrics my-eval --custom '{"name":"my-eval","prompt":"...","threshold":0.8}'
270
+
271
+ # CI mode
272
+ aievaluator eval --agent $URL --dataset ./tests.json --ci --format junit
273
+
274
+ # Different agent format
275
+ aievaluator eval --agent $URL --dataset ./tests.json --agent-format claude
276
+ ```
277
+
278
+ ### `aievaluator config`
279
+
280
+ ```bash
281
+ aievaluator config show
282
+ aievaluator config set default-metrics "faithfulness,g_eval"
283
+ aievaluator config set default-min-score 0.80
284
+ aievaluator config unset default-min-score
285
+ ```
286
+
287
+ ### `aievaluator init`
288
+
289
+ ```bash
290
+ aievaluator init
291
+ # Creates aievaluator.config.json + evals/smoke-test.json + updates .gitignore
292
+ ```
293
+
294
+ ---
295
+
296
+ ## ๐Ÿ“Š Output Formats
297
+
298
+ ### Table (default)
299
+
300
+ Human-readable table with scores, pass/fail icons, and token counts.
301
+
302
+ ### JSON (`--format json`)
303
+
304
+ ```bash
305
+ aievaluator eval ... --format json | jq '.overall_score'
306
+ ```
307
+
308
+ Clean JSON on stdout. All logs/warnings go to stderr.
309
+
310
+ ### JUnit XML (`--format junit`)
311
+
312
+ ```bash
313
+ aievaluator eval ... --format junit > report.xml
314
+ ```
315
+
316
+ Native CI integration. `<testcase>` per query, `<failure>` for queries below threshold.
317
+
318
+ ---
319
+
320
+ ## ๐Ÿค– VS Code Extension
321
+
322
+ Prefer staying in your editor? Install the [VS Code extension](https://marketplace.visualstudio.com/items?itemName=aievaluator.aievaluator).
323
+
324
+ - Select text โ†’ right-click โ†’ Evaluate
325
+ - Per-metric threshold editor with preset buttons
326
+ - Custom evaluator support via Command Palette
327
+ - Sidebar with evaluation history
328
+ - Dataset file evaluation (JSON + JSONL)
329
+
330
+ [Full VS Code tutorial โ†’](../vscode/README.md)
331
+
332
+ ---
333
+
334
+ ## Requirements
335
+
336
+ - Python 3.10+