agent-duelist 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +150 -58
- package/dist/cli.js +870 -123
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +897 -227
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +67 -3
- package/dist/index.d.ts +67 -3
- package/dist/index.js +887 -224
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
- package/templates/arena.config.ts +5 -5
package/README.md
CHANGED
|
@@ -1,13 +1,16 @@
|
|
|
1
1
|
# agent-duelist
|
|
2
2
|
|
|
3
3
|
[](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
|
|
4
|
+
[](https://datagobes.github.io/agent-duelist/)
|
|
4
5
|
|
|
5
6
|
> Pit LLM providers against each other on agent tasks — Duel your models.
|
|
7
|
+
>
|
|
8
|
+
> **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
|
|
6
9
|
|
|
7
10
|
`agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
|
|
8
11
|
|
|
9
12
|
## What you get
|
|
10
|
-
>
|
|
13
|
+
> 
|
|
11
14
|
|
|
12
15
|
|
|
13
16
|
- Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
|
|
@@ -64,8 +67,8 @@ import { z } from 'zod'
|
|
|
64
67
|
|
|
65
68
|
export default defineArena({
|
|
66
69
|
providers: [
|
|
67
|
-
openai('gpt-
|
|
68
|
-
azureOpenai('gpt-
|
|
70
|
+
openai('gpt-5-mini'),
|
|
71
|
+
azureOpenai('gpt-5-mini', { deployment: 'my-azure-deployment' }),
|
|
69
72
|
],
|
|
70
73
|
tasks: [
|
|
71
74
|
{
|
|
@@ -98,7 +101,7 @@ npx duelist run
|
|
|
98
101
|
You'll see a matrix like:
|
|
99
102
|
|
|
100
103
|
- Rows: tasks (`simple-qa`, `structured-extraction`)
|
|
101
|
-
- Columns: providers (`openai/gpt-
|
|
104
|
+
- Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
|
|
102
105
|
- Cells: correctness score, latency, tokens, and estimated cost.
|
|
103
106
|
|
|
104
107
|
For CI or further processing:
|
|
@@ -133,21 +136,21 @@ import {
|
|
|
133
136
|
type ArenaProvider,
|
|
134
137
|
} from 'agent-duelist'
|
|
135
138
|
|
|
136
|
-
const oai = openai('gpt-
|
|
139
|
+
const oai = openai('gpt-5-mini')
|
|
137
140
|
|
|
138
|
-
const azure = azureOpenai('gpt-
|
|
141
|
+
const azure = azureOpenai('gpt-5-mini', {
|
|
139
142
|
deployment: 'my-deployment',
|
|
140
143
|
})
|
|
141
144
|
|
|
142
|
-
const claude = anthropic('claude-sonnet-4
|
|
145
|
+
const claude = anthropic('claude-sonnet-4.6')
|
|
143
146
|
|
|
144
|
-
const gem = gemini('gemini-
|
|
147
|
+
const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
|
|
145
148
|
|
|
146
149
|
const local: ArenaProvider = openaiCompatible({
|
|
147
|
-
id: 'local/
|
|
150
|
+
id: 'local/llama',
|
|
148
151
|
name: 'Local Gateway',
|
|
149
152
|
baseURL: 'http://localhost:11434/v1',
|
|
150
|
-
model: '
|
|
153
|
+
model: 'llama3.3',
|
|
151
154
|
apiKeyEnv: 'LOCAL_LLM_API_KEY',
|
|
152
155
|
})
|
|
153
156
|
```
|
|
@@ -156,7 +159,7 @@ At minimum, a provider implements:
|
|
|
156
159
|
|
|
157
160
|
```ts
|
|
158
161
|
interface ArenaProvider {
|
|
159
|
-
id: string // e.g. 'openai/gpt-
|
|
162
|
+
id: string // e.g. 'openai/gpt-5-mini'
|
|
160
163
|
name: string // e.g. 'OpenAI'
|
|
161
164
|
model: string
|
|
162
165
|
run(input: TaskInput): Promise<TaskResult>
|
|
@@ -230,12 +233,39 @@ defineArena({
|
|
|
230
233
|
})
|
|
231
234
|
```
|
|
232
235
|
|
|
233
|
-
The judge model defaults to `gpt-
|
|
236
|
+
The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
|
|
234
237
|
|
|
235
238
|
You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
|
|
236
239
|
|
|
237
240
|
---
|
|
238
241
|
|
|
242
|
+
### Arena options
|
|
243
|
+
|
|
244
|
+
`defineArena()` accepts these top-level options alongside `providers`, `tasks`, and `scorers`:
|
|
245
|
+
|
|
246
|
+
| Option | Type | Default | Description |
|
|
247
|
+
|--------|------|---------|-------------|
|
|
248
|
+
| `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
|
|
249
|
+
| `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
|
|
250
|
+
| `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
|
|
251
|
+
| `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
|
|
252
|
+
|
|
253
|
+
Example with all options:
|
|
254
|
+
|
|
255
|
+
```ts
|
|
256
|
+
export default defineArena({
|
|
257
|
+
providers: [openai('gpt-5-mini'), gemini('gemini-3-flash-preview')],
|
|
258
|
+
tasks: [/* ... */],
|
|
259
|
+
scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
|
|
260
|
+
runs: 3,
|
|
261
|
+
judgeModel: 'gemini-3.1-pro-preview',
|
|
262
|
+
timeout: 30_000, // 30s — fail fast on slow APIs
|
|
263
|
+
sparklines: false, // plain percentages, no Unicode bars
|
|
264
|
+
})
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
239
269
|
## Cost & pricing
|
|
240
270
|
|
|
241
271
|
Cost estimation is intentionally transparent and conservative:
|
|
@@ -247,7 +277,7 @@ Cost estimation is intentionally transparent and conservative:
|
|
|
247
277
|
`agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
248
278
|
|
|
249
279
|
- The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
|
|
250
|
-
- Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-
|
|
280
|
+
- Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
|
|
251
281
|
|
|
252
282
|
3. **Estimated USD**
|
|
253
283
|
The `cost` scorer computes:
|
|
@@ -274,12 +304,19 @@ You can update the catalog with a script that re-scrapes OpenRouter's public pri
|
|
|
274
304
|
|
|
275
305
|
## CLI usage
|
|
276
306
|
|
|
277
|
-
|
|
307
|
+
### `duelist init`
|
|
308
|
+
|
|
309
|
+
Scaffold a new `arena.config.ts` in the current directory.
|
|
278
310
|
|
|
279
311
|
```bash
|
|
280
|
-
# Scaffold a new config
|
|
281
312
|
npx duelist init
|
|
313
|
+
```
|
|
282
314
|
|
|
315
|
+
### `duelist run`
|
|
316
|
+
|
|
317
|
+
Run benchmarks defined in your arena config.
|
|
318
|
+
|
|
319
|
+
```bash
|
|
283
320
|
# Run with the default config (arena.config.ts)
|
|
284
321
|
npx duelist run
|
|
285
322
|
|
|
@@ -288,12 +325,48 @@ npx duelist run --config path/to/arena.config.ts
|
|
|
288
325
|
|
|
289
326
|
# Get JSON instead of a table
|
|
290
327
|
npx duelist run --reporter json
|
|
328
|
+
|
|
329
|
+
# Suppress per-result progress lines
|
|
330
|
+
npx duelist run --quiet
|
|
331
|
+
```
|
|
332
|
+
|
|
333
|
+
| Option | Description |
|
|
334
|
+
|--------|-------------|
|
|
335
|
+
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
336
|
+
| `--reporter <type>` | Output format: `console` (default) or `json` |
|
|
337
|
+
| `-q, --quiet` | Suppress per-result progress |
|
|
338
|
+
|
|
339
|
+
### `duelist ci`
|
|
340
|
+
|
|
341
|
+
Run benchmarks, compare against a baseline, and enforce quality gates. Exits non-zero if regressions are detected or cost exceeds the budget.
|
|
342
|
+
|
|
343
|
+
```bash
|
|
344
|
+
# First run — establishes the baseline
|
|
345
|
+
npx duelist ci --update-baseline
|
|
346
|
+
|
|
347
|
+
# Subsequent runs — compare against baseline
|
|
348
|
+
npx duelist ci --threshold correctness=0.1 --budget 1.00
|
|
349
|
+
|
|
350
|
+
# Post comparison table as a PR comment (GitHub Actions)
|
|
351
|
+
npx duelist ci --threshold correctness=0.1 --comment
|
|
291
352
|
```
|
|
292
353
|
|
|
293
|
-
|
|
354
|
+
| Option | Description |
|
|
355
|
+
|--------|-------------|
|
|
356
|
+
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
357
|
+
| `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
|
|
358
|
+
| `--budget <dollars>` | Max total cost in USD — fails if exceeded |
|
|
359
|
+
| `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
|
|
360
|
+
| `--update-baseline` | Save results as new baseline after passing |
|
|
361
|
+
| `--comment` | Post markdown comparison table as a GitHub PR comment |
|
|
362
|
+
| `-q, --quiet` | Suppress per-result progress |
|
|
363
|
+
|
|
364
|
+
**How regression detection works:**
|
|
294
365
|
|
|
295
|
-
-
|
|
296
|
-
-
|
|
366
|
+
- With `runs > 1`, the CI uses 95% confidence intervals (t-distribution) — a scorer only regresses if the confidence intervals don't overlap beyond the threshold. This avoids false positives from noisy LLM outputs.
|
|
367
|
+
- With `runs === 1`, it uses a simple delta comparison.
|
|
368
|
+
- Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
|
|
369
|
+
- Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
|
|
297
370
|
|
|
298
371
|
The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
|
|
299
372
|
|
|
@@ -345,42 +418,7 @@ export default defineArena({
|
|
|
345
418
|
npx duelist run
|
|
346
419
|
```
|
|
347
420
|
|
|
348
|
-
Output
|
|
349
|
-
|
|
350
|
-
```
|
|
351
|
-
⬡ Agent Duelist Results (3 runs each)
|
|
352
|
-
──────────────────────────────────────────────────────────────────────
|
|
353
|
-
|
|
354
|
-
Task: extract-company
|
|
355
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
356
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
357
|
-
azure/gpt-5-mini 1905ms ~$0.189m 140 100% 100% 100%
|
|
358
|
-
azure/gpt-5-nano 2079ms ~$0.081m 249 100% 100% 100%
|
|
359
|
-
azure/gpt-5.2-chat 1493ms ~$0.0011 126 100% 100% 100%
|
|
360
|
-
|
|
361
|
-
Task: summarize
|
|
362
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
363
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
364
|
-
azure/gpt-5-mini 1723ms ~$0.192m 127 0% — 36%
|
|
365
|
-
azure/gpt-5-nano 2117ms ~$0.081m 234 0% — 43%
|
|
366
|
-
azure/gpt-5.2-chat 1008ms ~$0.584m 72 0% — 43%
|
|
367
|
-
|
|
368
|
-
Task: classify-sentiment
|
|
369
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
370
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
371
|
-
azure/gpt-5-mini 1012ms ~$0.077m 82 100% — 100%
|
|
372
|
-
azure/gpt-5-nano 1075ms ~$0.024m 104 100% — 100%
|
|
373
|
-
azure/gpt-5.2-chat 936ms ~$0.526m 81 100% — 100%
|
|
374
|
-
|
|
375
|
-
──────────────────────────────────────────────────────────────────────
|
|
376
|
-
Summary
|
|
377
|
-
|
|
378
|
-
◆ Most correct: azure/gpt-5-mini (OpenAI via Azure) (avg 67%)
|
|
379
|
-
◆ Fastest: azure/gpt-5.2-chat (OpenAI via Azure) (avg 1146ms)
|
|
380
|
-
◆ Cheapest: azure/gpt-5-nano (OpenAI via Azure) (avg ~$0.062m)
|
|
381
|
-
|
|
382
|
-
Costs estimated from OpenRouter pricing catalog.
|
|
383
|
-
```
|
|
421
|
+
Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
|
|
384
422
|
|
|
385
423
|
---
|
|
386
424
|
|
|
@@ -403,7 +441,7 @@ const weatherTool = {
|
|
|
403
441
|
}
|
|
404
442
|
|
|
405
443
|
export default defineArena({
|
|
406
|
-
providers: [openai('gpt-
|
|
444
|
+
providers: [openai('gpt-5-mini')],
|
|
407
445
|
tasks: [
|
|
408
446
|
{
|
|
409
447
|
name: 'weather-tool-call',
|
|
@@ -421,6 +459,57 @@ The model calls `getCurrentWeather`, the handler returns a stub result, and the
|
|
|
421
459
|
|
|
422
460
|
---
|
|
423
461
|
|
|
462
|
+
## CI / GitHub Actions
|
|
463
|
+
|
|
464
|
+
`duelist ci` is designed to run as a quality gate in your CI pipeline. It compares benchmark results against a saved baseline and fails the build if quality regresses or costs exceed a budget.
|
|
465
|
+
|
|
466
|
+
### GitHub Action
|
|
467
|
+
|
|
468
|
+
The easiest way to add eval quality gates to your PR workflow:
|
|
469
|
+
|
|
470
|
+
```yaml
|
|
471
|
+
# .github/workflows/eval.yml
|
|
472
|
+
name: LLM Eval
|
|
473
|
+
on: [pull_request]
|
|
474
|
+
|
|
475
|
+
jobs:
|
|
476
|
+
eval:
|
|
477
|
+
runs-on: ubuntu-latest
|
|
478
|
+
steps:
|
|
479
|
+
- uses: actions/checkout@v4
|
|
480
|
+
- uses: DataGobes/agent-duelist/.github/actions/duelist-ci@main
|
|
481
|
+
with:
|
|
482
|
+
budget: '1.00'
|
|
483
|
+
thresholds: 'correctness=0.1'
|
|
484
|
+
env:
|
|
485
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
The action handles Node.js setup, runs `duelist ci`, posts a comparison table as a PR comment, and optionally commits an updated baseline.
|
|
489
|
+
|
|
490
|
+
| Input | Default | Description |
|
|
491
|
+
|-------|---------|-------------|
|
|
492
|
+
| `config` | `arena.config.ts` | Path to arena config file |
|
|
493
|
+
| `baseline` | `.duelist/baseline.json` | Path to baseline JSON file |
|
|
494
|
+
| `budget` | — | Max total cost in USD |
|
|
495
|
+
| `thresholds` | — | Space-separated `scorer=delta` pairs |
|
|
496
|
+
| `update-baseline` | `false` | Save results as new baseline after passing |
|
|
497
|
+
| `comment` | `true` | Post results as PR comment |
|
|
498
|
+
| `node-version` | `20` | Node.js version to use |
|
|
499
|
+
|
|
500
|
+
### PR comment output
|
|
501
|
+
|
|
502
|
+
When `--comment` is enabled, the CI posts (or updates) a markdown table on the PR:
|
|
503
|
+
|
|
504
|
+
| Provider | Task | Scorer | Baseline | Current | Delta | Status |
|
|
505
|
+
|----------|------|--------|----------|---------|-------|--------|
|
|
506
|
+
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | ⚪ unchanged |
|
|
507
|
+
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | ⚪ unchanged |
|
|
508
|
+
|
|
509
|
+
With cost summary, flakiness warnings, and pass/fail verdict.
|
|
510
|
+
|
|
511
|
+
---
|
|
512
|
+
|
|
424
513
|
## Roadmap
|
|
425
514
|
|
|
426
515
|
Shipped so far:
|
|
@@ -428,8 +517,12 @@ Shipped so far:
|
|
|
428
517
|
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
|
|
429
518
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
430
519
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
431
|
-
-
|
|
520
|
+
- Console reporter with box-drawing tables, medal rankings, sparkline bars (toggleable), and per-task winner rows
|
|
521
|
+
- Configurable per-request timeout to prevent hanging on unresponsive APIs
|
|
432
522
|
- JSON reporter for CI/pipeline integration
|
|
523
|
+
- Markdown reporter for PR comments
|
|
524
|
+
- `duelist ci` command with regression detection, cost budgets, and flakiness warnings
|
|
525
|
+
- GitHub Action for CI/CD integration
|
|
433
526
|
- Pricing catalog from OpenRouter with refresh script
|
|
434
527
|
|
|
435
528
|
Planned directions (subject to community feedback):
|
|
@@ -437,8 +530,7 @@ Planned directions (subject to community feedback):
|
|
|
437
530
|
- **More providers**
|
|
438
531
|
- OpenRouter-native and more OpenAI-compatible gateways.
|
|
439
532
|
- **Better reporting**
|
|
440
|
-
-
|
|
441
|
-
- GitHub Actions summaries.
|
|
533
|
+
- HTML and CSV export options.
|
|
442
534
|
- **Agent workflows**
|
|
443
535
|
- Multi-step tool chains, multi-hop reasoning, and agent traces.
|
|
444
536
|
- **Plugin system**
|