agent-duelist 0.1.2 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +165 -59
- package/dist/cli.js +1793 -394
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +1774 -396
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +73 -8
- package/dist/index.d.ts +73 -8
- package/dist/index.js +1765 -395
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
- package/templates/arena.config.ts +5 -5
package/README.md
CHANGED
|
@@ -1,14 +1,18 @@
|
|
|
1
1
|
# agent-duelist
|
|
2
2
|
|
|
3
3
|
[](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
|
|
4
|
+
[](https://datagobes.github.io/agent-duelist/)
|
|
4
5
|
|
|
5
6
|
> Pit LLM providers against each other on agent tasks — Duel your models.
|
|
7
|
+
>
|
|
8
|
+
> **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
|
|
6
9
|
|
|
7
10
|
`agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
|
|
8
11
|
|
|
9
12
|
## What you get
|
|
10
|
-
>
|
|
11
|
-
|
|
13
|
+
> 
|
|
14
|
+
>
|
|
15
|
+
> 
|
|
12
16
|
|
|
13
17
|
- Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
|
|
14
18
|
- Define tasks once, run them against many providers.
|
|
@@ -64,8 +68,8 @@ import { z } from 'zod'
|
|
|
64
68
|
|
|
65
69
|
export default defineArena({
|
|
66
70
|
providers: [
|
|
67
|
-
openai('gpt-
|
|
68
|
-
azureOpenai('gpt-
|
|
71
|
+
openai('gpt-5-mini'),
|
|
72
|
+
azureOpenai('gpt-5-mini', { deployment: 'my-azure-deployment' }),
|
|
69
73
|
],
|
|
70
74
|
tasks: [
|
|
71
75
|
{
|
|
@@ -98,7 +102,7 @@ npx duelist run
|
|
|
98
102
|
You'll see a matrix like:
|
|
99
103
|
|
|
100
104
|
- Rows: tasks (`simple-qa`, `structured-extraction`)
|
|
101
|
-
- Columns: providers (`openai/gpt-
|
|
105
|
+
- Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
|
|
102
106
|
- Cells: correctness score, latency, tokens, and estimated cost.
|
|
103
107
|
|
|
104
108
|
For CI or further processing:
|
|
@@ -107,6 +111,12 @@ For CI or further processing:
|
|
|
107
111
|
npx duelist run --reporter json > results.json
|
|
108
112
|
```
|
|
109
113
|
|
|
114
|
+
Generate a shareable HTML report:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
npx duelist run --reporter html --output report.html
|
|
118
|
+
```
|
|
119
|
+
|
|
110
120
|
---
|
|
111
121
|
|
|
112
122
|
## Core concepts
|
|
@@ -133,21 +143,21 @@ import {
|
|
|
133
143
|
type ArenaProvider,
|
|
134
144
|
} from 'agent-duelist'
|
|
135
145
|
|
|
136
|
-
const oai = openai('gpt-
|
|
146
|
+
const oai = openai('gpt-5-mini')
|
|
137
147
|
|
|
138
|
-
const azure = azureOpenai('gpt-
|
|
148
|
+
const azure = azureOpenai('gpt-5-mini', {
|
|
139
149
|
deployment: 'my-deployment',
|
|
140
150
|
})
|
|
141
151
|
|
|
142
|
-
const claude = anthropic('claude-sonnet-4
|
|
152
|
+
const claude = anthropic('claude-sonnet-4.6')
|
|
143
153
|
|
|
144
|
-
const gem = gemini('gemini-
|
|
154
|
+
const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
|
|
145
155
|
|
|
146
156
|
const local: ArenaProvider = openaiCompatible({
|
|
147
|
-
id: 'local/
|
|
157
|
+
id: 'local/llama',
|
|
148
158
|
name: 'Local Gateway',
|
|
149
159
|
baseURL: 'http://localhost:11434/v1',
|
|
150
|
-
model: '
|
|
160
|
+
model: 'llama3.3',
|
|
151
161
|
apiKeyEnv: 'LOCAL_LLM_API_KEY',
|
|
152
162
|
})
|
|
153
163
|
```
|
|
@@ -156,7 +166,7 @@ At minimum, a provider implements:
|
|
|
156
166
|
|
|
157
167
|
```ts
|
|
158
168
|
interface ArenaProvider {
|
|
159
|
-
id: string // e.g. 'openai/gpt-
|
|
169
|
+
id: string // e.g. 'openai/gpt-5-mini'
|
|
160
170
|
name: string // e.g. 'OpenAI'
|
|
161
171
|
model: string
|
|
162
172
|
run(input: TaskInput): Promise<TaskResult>
|
|
@@ -230,12 +240,39 @@ defineArena({
|
|
|
230
240
|
})
|
|
231
241
|
```
|
|
232
242
|
|
|
233
|
-
The judge model defaults to `gpt-
|
|
243
|
+
The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
|
|
234
244
|
|
|
235
245
|
You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
|
|
236
246
|
|
|
237
247
|
---
|
|
238
248
|
|
|
249
|
+
### Arena options
|
|
250
|
+
|
|
251
|
+
`defineArena()` accepts these top-level options alongside `providers`, `tasks`, and `scorers`:
|
|
252
|
+
|
|
253
|
+
| Option | Type | Default | Description |
|
|
254
|
+
|--------|------|---------|-------------|
|
|
255
|
+
| `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
|
|
256
|
+
| `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
|
|
257
|
+
| `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
|
|
258
|
+
| `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
|
|
259
|
+
|
|
260
|
+
Example with all options:
|
|
261
|
+
|
|
262
|
+
```ts
|
|
263
|
+
export default defineArena({
|
|
264
|
+
providers: [openai('gpt-5-mini'), gemini('gemini-3-flash-preview')],
|
|
265
|
+
tasks: [/* ... */],
|
|
266
|
+
scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
|
|
267
|
+
runs: 3,
|
|
268
|
+
judgeModel: 'gemini-3.1-pro-preview',
|
|
269
|
+
timeout: 30_000, // 30s — fail fast on slow APIs
|
|
270
|
+
sparklines: false, // plain percentages, no Unicode bars
|
|
271
|
+
})
|
|
272
|
+
```
|
|
273
|
+
|
|
274
|
+
---
|
|
275
|
+
|
|
239
276
|
## Cost & pricing
|
|
240
277
|
|
|
241
278
|
Cost estimation is intentionally transparent and conservative:
|
|
@@ -247,7 +284,7 @@ Cost estimation is intentionally transparent and conservative:
|
|
|
247
284
|
`agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
248
285
|
|
|
249
286
|
- The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
|
|
250
|
-
- Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-
|
|
287
|
+
- Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
|
|
251
288
|
|
|
252
289
|
3. **Estimated USD**
|
|
253
290
|
The `cost` scorer computes:
|
|
@@ -274,12 +311,19 @@ You can update the catalog with a script that re-scrapes OpenRouter's public pri
|
|
|
274
311
|
|
|
275
312
|
## CLI usage
|
|
276
313
|
|
|
277
|
-
|
|
314
|
+
### `duelist init`
|
|
315
|
+
|
|
316
|
+
Scaffold a new `arena.config.ts` in the current directory.
|
|
278
317
|
|
|
279
318
|
```bash
|
|
280
|
-
# Scaffold a new config
|
|
281
319
|
npx duelist init
|
|
320
|
+
```
|
|
282
321
|
|
|
322
|
+
### `duelist run`
|
|
323
|
+
|
|
324
|
+
Run benchmarks defined in your arena config.
|
|
325
|
+
|
|
326
|
+
```bash
|
|
283
327
|
# Run with the default config (arena.config.ts)
|
|
284
328
|
npx duelist run
|
|
285
329
|
|
|
@@ -288,12 +332,48 @@ npx duelist run --config path/to/arena.config.ts
|
|
|
288
332
|
|
|
289
333
|
# Get JSON instead of a table
|
|
290
334
|
npx duelist run --reporter json
|
|
335
|
+
|
|
336
|
+
# Suppress per-result progress lines
|
|
337
|
+
npx duelist run --quiet
|
|
291
338
|
```
|
|
292
339
|
|
|
293
|
-
|
|
340
|
+
| Option | Description |
|
|
341
|
+
|--------|-------------|
|
|
342
|
+
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
343
|
+
| `--reporter <type>` | Output format: `console` (default) or `json` |
|
|
344
|
+
| `-q, --quiet` | Suppress per-result progress |
|
|
345
|
+
|
|
346
|
+
### `duelist ci`
|
|
294
347
|
|
|
295
|
-
-
|
|
296
|
-
|
|
348
|
+
Run benchmarks, compare against a baseline, and enforce quality gates. Exits non-zero if regressions are detected or cost exceeds the budget.
|
|
349
|
+
|
|
350
|
+
```bash
|
|
351
|
+
# First run — establishes the baseline
|
|
352
|
+
npx duelist ci --update-baseline
|
|
353
|
+
|
|
354
|
+
# Subsequent runs — compare against baseline
|
|
355
|
+
npx duelist ci --threshold correctness=0.1 --budget 1.00
|
|
356
|
+
|
|
357
|
+
# Post comparison table as a PR comment (GitHub Actions)
|
|
358
|
+
npx duelist ci --threshold correctness=0.1 --comment
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
| Option | Description |
|
|
362
|
+
|--------|-------------|
|
|
363
|
+
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
364
|
+
| `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
|
|
365
|
+
| `--budget <dollars>` | Max total cost in USD — fails if exceeded |
|
|
366
|
+
| `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
|
|
367
|
+
| `--update-baseline` | Save results as new baseline after passing |
|
|
368
|
+
| `--comment` | Post markdown comparison table as a GitHub PR comment |
|
|
369
|
+
| `-q, --quiet` | Suppress per-result progress |
|
|
370
|
+
|
|
371
|
+
**How regression detection works:**
|
|
372
|
+
|
|
373
|
+
- With `runs > 1`, the CI uses 95% confidence intervals (t-distribution) — a scorer only regresses if the confidence intervals don't overlap beyond the threshold. This avoids false positives from noisy LLM outputs.
|
|
374
|
+
- With `runs === 1`, it uses a simple delta comparison.
|
|
375
|
+
- Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
|
|
376
|
+
- Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
|
|
297
377
|
|
|
298
378
|
The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
|
|
299
379
|
|
|
@@ -345,42 +425,13 @@ export default defineArena({
|
|
|
345
425
|
npx duelist run
|
|
346
426
|
```
|
|
347
427
|
|
|
348
|
-
Output
|
|
428
|
+
Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
|
|
349
429
|
|
|
350
|
-
|
|
351
|
-
|
|
352
|
-
|
|
353
|
-
|
|
354
|
-
|
|
355
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
356
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
357
|
-
azure/gpt-5-mini 1905ms ~$0.189m 140 100% 100% 100%
|
|
358
|
-
azure/gpt-5-nano 2079ms ~$0.081m 249 100% 100% 100%
|
|
359
|
-
azure/gpt-5.2-chat 1493ms ~$0.0011 126 100% 100% 100%
|
|
360
|
-
|
|
361
|
-
Task: summarize
|
|
362
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
363
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
364
|
-
azure/gpt-5-mini 1723ms ~$0.192m 127 0% — 36%
|
|
365
|
-
azure/gpt-5-nano 2117ms ~$0.081m 234 0% — 43%
|
|
366
|
-
azure/gpt-5.2-chat 1008ms ~$0.584m 72 0% — 43%
|
|
367
|
-
|
|
368
|
-
Task: classify-sentiment
|
|
369
|
-
Provider Latency Cost Tokens Match Schema Fuzzy
|
|
370
|
-
─────────────────────────────────────────────────────────────────────────────────────────────
|
|
371
|
-
azure/gpt-5-mini 1012ms ~$0.077m 82 100% — 100%
|
|
372
|
-
azure/gpt-5-nano 1075ms ~$0.024m 104 100% — 100%
|
|
373
|
-
azure/gpt-5.2-chat 936ms ~$0.526m 81 100% — 100%
|
|
374
|
-
|
|
375
|
-
──────────────────────────────────────────────────────────────────────
|
|
376
|
-
Summary
|
|
377
|
-
|
|
378
|
-
◆ Most correct: azure/gpt-5-mini (OpenAI via Azure) (avg 67%)
|
|
379
|
-
◆ Fastest: azure/gpt-5.2-chat (OpenAI via Azure) (avg 1146ms)
|
|
380
|
-
◆ Cheapest: azure/gpt-5-nano (OpenAI via Azure) (avg ~$0.062m)
|
|
381
|
-
|
|
382
|
-
Costs estimated from OpenRouter pricing catalog.
|
|
383
|
-
```
|
|
430
|
+
**How scoring works:**
|
|
431
|
+
|
|
432
|
+
- Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
|
|
433
|
+
- Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
|
|
434
|
+
- The overall winner is determined by category wins across correctness, latency, and cost.
|
|
384
435
|
|
|
385
436
|
---
|
|
386
437
|
|
|
@@ -403,7 +454,7 @@ const weatherTool = {
|
|
|
403
454
|
}
|
|
404
455
|
|
|
405
456
|
export default defineArena({
|
|
406
|
-
providers: [openai('gpt-
|
|
457
|
+
providers: [openai('gpt-5-mini')],
|
|
407
458
|
tasks: [
|
|
408
459
|
{
|
|
409
460
|
name: 'weather-tool-call',
|
|
@@ -421,15 +472,71 @@ The model calls `getCurrentWeather`, the handler returns a stub result, and the
|
|
|
421
472
|
|
|
422
473
|
---
|
|
423
474
|
|
|
475
|
+
## CI / GitHub Actions
|
|
476
|
+
|
|
477
|
+
`duelist ci` is designed to run as a quality gate in your CI pipeline. It compares benchmark results against a saved baseline and fails the build if quality regresses or costs exceed a budget.
|
|
478
|
+
|
|
479
|
+
### GitHub Action
|
|
480
|
+
|
|
481
|
+
The easiest way to add eval quality gates to your PR workflow:
|
|
482
|
+
|
|
483
|
+
```yaml
|
|
484
|
+
# .github/workflows/eval.yml
|
|
485
|
+
name: LLM Eval
|
|
486
|
+
on: [pull_request]
|
|
487
|
+
|
|
488
|
+
jobs:
|
|
489
|
+
eval:
|
|
490
|
+
runs-on: ubuntu-latest
|
|
491
|
+
steps:
|
|
492
|
+
- uses: actions/checkout@v4
|
|
493
|
+
- uses: DataGobes/agent-duelist/.github/actions/duelist-ci@main
|
|
494
|
+
with:
|
|
495
|
+
budget: '1.00'
|
|
496
|
+
thresholds: 'correctness=0.1'
|
|
497
|
+
env:
|
|
498
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
499
|
+
```
|
|
500
|
+
|
|
501
|
+
The action handles Node.js setup, runs `duelist ci`, posts a comparison table as a PR comment, and optionally commits an updated baseline.
|
|
502
|
+
|
|
503
|
+
| Input | Default | Description |
|
|
504
|
+
|-------|---------|-------------|
|
|
505
|
+
| `config` | `arena.config.ts` | Path to arena config file |
|
|
506
|
+
| `baseline` | `.duelist/baseline.json` | Path to baseline JSON file |
|
|
507
|
+
| `budget` | — | Max total cost in USD |
|
|
508
|
+
| `thresholds` | — | Space-separated `scorer=delta` pairs |
|
|
509
|
+
| `update-baseline` | `false` | Save results as new baseline after passing |
|
|
510
|
+
| `comment` | `true` | Post results as PR comment |
|
|
511
|
+
| `node-version` | `20` | Node.js version to use |
|
|
512
|
+
|
|
513
|
+
### PR comment output
|
|
514
|
+
|
|
515
|
+
When `--comment` is enabled, the CI posts (or updates) a markdown table on the PR:
|
|
516
|
+
|
|
517
|
+
| Provider | Task | Scorer | Baseline | Current | Delta | Status |
|
|
518
|
+
|----------|------|--------|----------|---------|-------|--------|
|
|
519
|
+
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | ⚪ unchanged |
|
|
520
|
+
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | ⚪ unchanged |
|
|
521
|
+
|
|
522
|
+
With cost summary, flakiness warnings, and pass/fail verdict.
|
|
523
|
+
|
|
524
|
+
---
|
|
525
|
+
|
|
424
526
|
## Roadmap
|
|
425
527
|
|
|
426
528
|
Shipped so far:
|
|
427
529
|
|
|
428
|
-
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible
|
|
530
|
+
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
|
|
429
531
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
430
532
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
431
|
-
-
|
|
533
|
+
- Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
|
|
534
|
+
- Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
|
|
535
|
+
- Configurable per-request timeout to prevent hanging on unresponsive APIs
|
|
432
536
|
- JSON reporter for CI/pipeline integration
|
|
537
|
+
- Markdown reporter for PR comments
|
|
538
|
+
- `duelist ci` command with regression detection, cost budgets, and flakiness warnings
|
|
539
|
+
- GitHub Action for CI/CD integration
|
|
433
540
|
- Pricing catalog from OpenRouter with refresh script
|
|
434
541
|
|
|
435
542
|
Planned directions (subject to community feedback):
|
|
@@ -437,8 +544,7 @@ Planned directions (subject to community feedback):
|
|
|
437
544
|
- **More providers**
|
|
438
545
|
- OpenRouter-native and more OpenAI-compatible gateways.
|
|
439
546
|
- **Better reporting**
|
|
440
|
-
-
|
|
441
|
-
- GitHub Actions summaries.
|
|
547
|
+
- HTML and CSV export options.
|
|
442
548
|
- **Agent workflows**
|
|
443
549
|
- Multi-step tool chains, multi-hop reasoning, and agent traces.
|
|
444
550
|
- **Plugin system**
|