agent-duelist 0.1.2 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,14 +1,18 @@
1
1
  # agent-duelist
2
2
 
3
3
  [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
+ [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
4
5
 
5
6
  > Pit LLM providers against each other on agent tasks — Duel your models.
7
+ >
8
+ > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
6
9
 
7
10
  `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
8
11
 
9
12
  ## What you get
10
- > <img width="739" height="473" alt="image" src="https://github.com/user-attachments/assets/41149222-1035-42fd-b643-4f8b856c30a0" />
11
-
13
+ > ![Agent Duelist console output](docs/assets/screenshot.png)
14
+ >
15
+ > ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
12
16
 
13
17
  - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
14
18
  - Define tasks once, run them against many providers.
@@ -64,8 +68,8 @@ import { z } from 'zod'
64
68
 
65
69
  export default defineArena({
66
70
  providers: [
67
- openai('gpt-4o'),
68
- azureOpenai('gpt-4o', { deployment: 'my-azure-deployment' }),
71
+ openai('gpt-5-mini'),
72
+ azureOpenai('gpt-5-mini', { deployment: 'my-azure-deployment' }),
69
73
  ],
70
74
  tasks: [
71
75
  {
@@ -98,7 +102,7 @@ npx duelist run
98
102
  You'll see a matrix like:
99
103
 
100
104
  - Rows: tasks (`simple-qa`, `structured-extraction`)
101
- - Columns: providers (`openai/gpt-4o`, `azure/gpt-4o`)
105
+ - Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
102
106
  - Cells: correctness score, latency, tokens, and estimated cost.
103
107
 
104
108
  For CI or further processing:
@@ -107,6 +111,12 @@ For CI or further processing:
107
111
  npx duelist run --reporter json > results.json
108
112
  ```
109
113
 
114
+ Generate a shareable HTML report:
115
+
116
+ ```bash
117
+ npx duelist run --reporter html --output report.html
118
+ ```
119
+
110
120
  ---
111
121
 
112
122
  ## Core concepts
@@ -133,21 +143,21 @@ import {
133
143
  type ArenaProvider,
134
144
  } from 'agent-duelist'
135
145
 
136
- const oai = openai('gpt-4o')
146
+ const oai = openai('gpt-5-mini')
137
147
 
138
- const azure = azureOpenai('gpt-4o', {
148
+ const azure = azureOpenai('gpt-5-mini', {
139
149
  deployment: 'my-deployment',
140
150
  })
141
151
 
142
- const claude = anthropic('claude-sonnet-4-20250514')
152
+ const claude = anthropic('claude-sonnet-4.6')
143
153
 
144
- const gem = gemini('gemini-2.5-flash') // uses GOOGLE_API_KEY
154
+ const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
145
155
 
146
156
  const local: ArenaProvider = openaiCompatible({
147
- id: 'local/gpt-4o-like',
157
+ id: 'local/llama',
148
158
  name: 'Local Gateway',
149
159
  baseURL: 'http://localhost:11434/v1',
150
- model: 'gpt-4o',
160
+ model: 'llama3.3',
151
161
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
152
162
  })
153
163
  ```
@@ -156,7 +166,7 @@ At minimum, a provider implements:
156
166
 
157
167
  ```ts
158
168
  interface ArenaProvider {
159
- id: string // e.g. 'openai/gpt-4o'
169
+ id: string // e.g. 'openai/gpt-5-mini'
160
170
  name: string // e.g. 'OpenAI'
161
171
  model: string
162
172
  run(input: TaskInput): Promise<TaskResult>
@@ -230,12 +240,39 @@ defineArena({
230
240
  })
231
241
  ```
232
242
 
233
- The judge model defaults to `gpt-4o-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
243
+ The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
234
244
 
235
245
  You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
236
246
 
237
247
  ---
238
248
 
249
+ ### Arena options
250
+
251
+ `defineArena()` accepts these top-level options alongside `providers`, `tasks`, and `scorers`:
252
+
253
+ | Option | Type | Default | Description |
254
+ |--------|------|---------|-------------|
255
+ | `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
256
+ | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
257
+ | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
258
+ | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
259
+
260
+ Example with all options:
261
+
262
+ ```ts
263
+ export default defineArena({
264
+ providers: [openai('gpt-5-mini'), gemini('gemini-3-flash-preview')],
265
+ tasks: [/* ... */],
266
+ scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
267
+ runs: 3,
268
+ judgeModel: 'gemini-3.1-pro-preview',
269
+ timeout: 30_000, // 30s — fail fast on slow APIs
270
+ sparklines: false, // plain percentages, no Unicode bars
271
+ })
272
+ ```
273
+
274
+ ---
275
+
239
276
  ## Cost & pricing
240
277
 
241
278
  Cost estimation is intentionally transparent and conservative:
@@ -247,7 +284,7 @@ Cost estimation is intentionally transparent and conservative:
247
284
  `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
248
285
 
249
286
  - The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
250
- - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-4o` → `openai/gpt-4o`) so you don't need to configure Azure pricing manually.
287
+ - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
251
288
 
252
289
  3. **Estimated USD**
253
290
  The `cost` scorer computes:
@@ -274,12 +311,19 @@ You can update the catalog with a script that re-scrapes OpenRouter's public pri
274
311
 
275
312
  ## CLI usage
276
313
 
277
- Basic commands:
314
+ ### `duelist init`
315
+
316
+ Scaffold a new `arena.config.ts` in the current directory.
278
317
 
279
318
  ```bash
280
- # Scaffold a new config
281
319
  npx duelist init
320
+ ```
282
321
 
322
+ ### `duelist run`
323
+
324
+ Run benchmarks defined in your arena config.
325
+
326
+ ```bash
283
327
  # Run with the default config (arena.config.ts)
284
328
  npx duelist run
285
329
 
@@ -288,12 +332,48 @@ npx duelist run --config path/to/arena.config.ts
288
332
 
289
333
  # Get JSON instead of a table
290
334
  npx duelist run --reporter json
335
+
336
+ # Suppress per-result progress lines
337
+ npx duelist run --quiet
291
338
  ```
292
339
 
293
- Options (subject to change as the project evolves):
340
+ | Option | Description |
341
+ |--------|-------------|
342
+ | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
343
+ | `--reporter <type>` | Output format: `console` (default) or `json` |
344
+ | `-q, --quiet` | Suppress per-result progress |
345
+
346
+ ### `duelist ci`
294
347
 
295
- - `--config` path to a config file (TypeScript).
296
- - `--reporter` — `console` (default) or `json`.
348
+ Run benchmarks, compare against a baseline, and enforce quality gates. Exits non-zero if regressions are detected or cost exceeds the budget.
349
+
350
+ ```bash
351
+ # First run — establishes the baseline
352
+ npx duelist ci --update-baseline
353
+
354
+ # Subsequent runs — compare against baseline
355
+ npx duelist ci --threshold correctness=0.1 --budget 1.00
356
+
357
+ # Post comparison table as a PR comment (GitHub Actions)
358
+ npx duelist ci --threshold correctness=0.1 --comment
359
+ ```
360
+
361
+ | Option | Description |
362
+ |--------|-------------|
363
+ | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
364
+ | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
365
+ | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
366
+ | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
367
+ | `--update-baseline` | Save results as new baseline after passing |
368
+ | `--comment` | Post markdown comparison table as a GitHub PR comment |
369
+ | `-q, --quiet` | Suppress per-result progress |
370
+
371
+ **How regression detection works:**
372
+
373
+ - With `runs > 1`, the CI uses 95% confidence intervals (t-distribution) — a scorer only regresses if the confidence intervals don't overlap beyond the threshold. This avoids false positives from noisy LLM outputs.
374
+ - With `runs === 1`, it uses a simple delta comparison.
375
+ - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
376
+ - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
297
377
 
298
378
  The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
299
379
 
@@ -345,42 +425,13 @@ export default defineArena({
345
425
  npx duelist run
346
426
  ```
347
427
 
348
- Output:
428
+ Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
349
429
 
350
- ```
351
- ⬡ Agent Duelist Results (3 runs each)
352
- ──────────────────────────────────────────────────────────────────────
353
-
354
- Task: extract-company
355
- Provider Latency Cost Tokens Match Schema Fuzzy
356
- ─────────────────────────────────────────────────────────────────────────────────────────────
357
- azure/gpt-5-mini 1905ms ~$0.189m 140 100% 100% 100%
358
- azure/gpt-5-nano 2079ms ~$0.081m 249 100% 100% 100%
359
- azure/gpt-5.2-chat 1493ms ~$0.0011 126 100% 100% 100%
360
-
361
- Task: summarize
362
- Provider Latency Cost Tokens Match Schema Fuzzy
363
- ─────────────────────────────────────────────────────────────────────────────────────────────
364
- azure/gpt-5-mini 1723ms ~$0.192m 127 0% — 36%
365
- azure/gpt-5-nano 2117ms ~$0.081m 234 0% — 43%
366
- azure/gpt-5.2-chat 1008ms ~$0.584m 72 0% — 43%
367
-
368
- Task: classify-sentiment
369
- Provider Latency Cost Tokens Match Schema Fuzzy
370
- ─────────────────────────────────────────────────────────────────────────────────────────────
371
- azure/gpt-5-mini 1012ms ~$0.077m 82 100% — 100%
372
- azure/gpt-5-nano 1075ms ~$0.024m 104 100% — 100%
373
- azure/gpt-5.2-chat 936ms ~$0.526m 81 100% — 100%
374
-
375
- ──────────────────────────────────────────────────────────────────────
376
- Summary
377
-
378
- ◆ Most correct: azure/gpt-5-mini (OpenAI via Azure) (avg 67%)
379
- ◆ Fastest: azure/gpt-5.2-chat (OpenAI via Azure) (avg 1146ms)
380
- ◆ Cheapest: azure/gpt-5-nano (OpenAI via Azure) (avg ~$0.062m)
381
-
382
- Costs estimated from OpenRouter pricing catalog.
383
- ```
430
+ **How scoring works:**
431
+
432
+ - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
433
+ - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
434
+ - The overall winner is determined by category wins across correctness, latency, and cost.
384
435
 
385
436
  ---
386
437
 
@@ -403,7 +454,7 @@ const weatherTool = {
403
454
  }
404
455
 
405
456
  export default defineArena({
406
- providers: [openai('gpt-4o')],
457
+ providers: [openai('gpt-5-mini')],
407
458
  tasks: [
408
459
  {
409
460
  name: 'weather-tool-call',
@@ -421,15 +472,71 @@ The model calls `getCurrentWeather`, the handler returns a stub result, and the
421
472
 
422
473
  ---
423
474
 
475
+ ## CI / GitHub Actions
476
+
477
+ `duelist ci` is designed to run as a quality gate in your CI pipeline. It compares benchmark results against a saved baseline and fails the build if quality regresses or costs exceed a budget.
478
+
479
+ ### GitHub Action
480
+
481
+ The easiest way to add eval quality gates to your PR workflow:
482
+
483
+ ```yaml
484
+ # .github/workflows/eval.yml
485
+ name: LLM Eval
486
+ on: [pull_request]
487
+
488
+ jobs:
489
+ eval:
490
+ runs-on: ubuntu-latest
491
+ steps:
492
+ - uses: actions/checkout@v4
493
+ - uses: DataGobes/agent-duelist/.github/actions/duelist-ci@main
494
+ with:
495
+ budget: '1.00'
496
+ thresholds: 'correctness=0.1'
497
+ env:
498
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
499
+ ```
500
+
501
+ The action handles Node.js setup, runs `duelist ci`, posts a comparison table as a PR comment, and optionally commits an updated baseline.
502
+
503
+ | Input | Default | Description |
504
+ |-------|---------|-------------|
505
+ | `config` | `arena.config.ts` | Path to arena config file |
506
+ | `baseline` | `.duelist/baseline.json` | Path to baseline JSON file |
507
+ | `budget` | — | Max total cost in USD |
508
+ | `thresholds` | — | Space-separated `scorer=delta` pairs |
509
+ | `update-baseline` | `false` | Save results as new baseline after passing |
510
+ | `comment` | `true` | Post results as PR comment |
511
+ | `node-version` | `20` | Node.js version to use |
512
+
513
+ ### PR comment output
514
+
515
+ When `--comment` is enabled, the CI posts (or updates) a markdown table on the PR:
516
+
517
+ | Provider | Task | Scorer | Baseline | Current | Delta | Status |
518
+ |----------|------|--------|----------|---------|-------|--------|
519
+ | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | ⚪ unchanged |
520
+ | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | ⚪ unchanged |
521
+
522
+ With cost summary, flakiness warnings, and pass/fail verdict.
523
+
524
+ ---
525
+
424
526
  ## Roadmap
425
527
 
426
528
  Shipped so far:
427
529
 
428
- - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
530
+ - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
429
531
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
430
532
  - Tool-calling support with local handlers for agent task benchmarking
431
- - Colored console reporter with per-task tables and cross-provider summary
533
+ - Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
534
+ - Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
535
+ - Configurable per-request timeout to prevent hanging on unresponsive APIs
432
536
  - JSON reporter for CI/pipeline integration
537
+ - Markdown reporter for PR comments
538
+ - `duelist ci` command with regression detection, cost budgets, and flakiness warnings
539
+ - GitHub Action for CI/CD integration
433
540
  - Pricing catalog from OpenRouter with refresh script
434
541
 
435
542
  Planned directions (subject to community feedback):
@@ -437,8 +544,7 @@ Planned directions (subject to community feedback):
437
544
  - **More providers**
438
545
  - OpenRouter-native and more OpenAI-compatible gateways.
439
546
  - **Better reporting**
440
- - Markdown/HTML/CSV reports.
441
- - GitHub Actions summaries.
547
+ - HTML and CSV export options.
442
548
  - **Agent workflows**
443
549
  - Multi-step tool chains, multi-hop reasoning, and agent traces.
444
550
  - **Plugin system**