agent-duelist 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,13 +1,16 @@
1
1
  # agent-duelist
2
2
 
3
3
  [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
+ [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
4
5
 
5
6
  > Pit LLM providers against each other on agent tasks — Duel your models.
7
+ >
8
+ > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
6
9
 
7
10
  `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
8
11
 
9
12
  ## What you get
10
- > <img width="739" height="473" alt="image" src="https://github.com/user-attachments/assets/41149222-1035-42fd-b643-4f8b856c30a0" />
13
+ > ![Agent Duelist console output](docs/assets/screenshot.png)
11
14
 
12
15
 
13
16
  - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
@@ -64,8 +67,8 @@ import { z } from 'zod'
64
67
 
65
68
  export default defineArena({
66
69
  providers: [
67
- openai('gpt-4o'),
68
- azureOpenai('gpt-4o', { deployment: 'my-azure-deployment' }),
70
+ openai('gpt-5-mini'),
71
+ azureOpenai('gpt-5-mini', { deployment: 'my-azure-deployment' }),
69
72
  ],
70
73
  tasks: [
71
74
  {
@@ -98,7 +101,7 @@ npx duelist run
98
101
  You'll see a matrix like:
99
102
 
100
103
  - Rows: tasks (`simple-qa`, `structured-extraction`)
101
- - Columns: providers (`openai/gpt-4o`, `azure/gpt-4o`)
104
+ - Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
102
105
  - Cells: correctness score, latency, tokens, and estimated cost.
103
106
 
104
107
  For CI or further processing:
@@ -133,21 +136,21 @@ import {
133
136
  type ArenaProvider,
134
137
  } from 'agent-duelist'
135
138
 
136
- const oai = openai('gpt-4o')
139
+ const oai = openai('gpt-5-mini')
137
140
 
138
- const azure = azureOpenai('gpt-4o', {
141
+ const azure = azureOpenai('gpt-5-mini', {
139
142
  deployment: 'my-deployment',
140
143
  })
141
144
 
142
- const claude = anthropic('claude-sonnet-4-20250514')
145
+ const claude = anthropic('claude-sonnet-4.6')
143
146
 
144
- const gem = gemini('gemini-2.5-flash') // uses GOOGLE_API_KEY
147
+ const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
145
148
 
146
149
  const local: ArenaProvider = openaiCompatible({
147
- id: 'local/gpt-4o-like',
150
+ id: 'local/llama',
148
151
  name: 'Local Gateway',
149
152
  baseURL: 'http://localhost:11434/v1',
150
- model: 'gpt-4o',
153
+ model: 'llama3.3',
151
154
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
152
155
  })
153
156
  ```
@@ -156,7 +159,7 @@ At minimum, a provider implements:
156
159
 
157
160
  ```ts
158
161
  interface ArenaProvider {
159
- id: string // e.g. 'openai/gpt-4o'
162
+ id: string // e.g. 'openai/gpt-5-mini'
160
163
  name: string // e.g. 'OpenAI'
161
164
  model: string
162
165
  run(input: TaskInput): Promise<TaskResult>
@@ -230,12 +233,39 @@ defineArena({
230
233
  })
231
234
  ```
232
235
 
233
- The judge model defaults to `gpt-4o-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
236
+ The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
234
237
 
235
238
  You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
236
239
 
237
240
  ---
238
241
 
242
+ ### Arena options
243
+
244
+ `defineArena()` accepts these top-level options alongside `providers`, `tasks`, and `scorers`:
245
+
246
+ | Option | Type | Default | Description |
247
+ |--------|------|---------|-------------|
248
+ | `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
249
+ | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
250
+ | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
251
+ | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
252
+
253
+ Example with all options:
254
+
255
+ ```ts
256
+ export default defineArena({
257
+ providers: [openai('gpt-5-mini'), gemini('gemini-3-flash-preview')],
258
+ tasks: [/* ... */],
259
+ scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
260
+ runs: 3,
261
+ judgeModel: 'gemini-3.1-pro-preview',
262
+ timeout: 30_000, // 30s — fail fast on slow APIs
263
+ sparklines: false, // plain percentages, no Unicode bars
264
+ })
265
+ ```
266
+
267
+ ---
268
+
239
269
  ## Cost & pricing
240
270
 
241
271
  Cost estimation is intentionally transparent and conservative:
@@ -247,7 +277,7 @@ Cost estimation is intentionally transparent and conservative:
247
277
  `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
248
278
 
249
279
  - The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
250
- - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-4o` → `openai/gpt-4o`) so you don't need to configure Azure pricing manually.
280
+ - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
251
281
 
252
282
  3. **Estimated USD**
253
283
  The `cost` scorer computes:
@@ -274,12 +304,19 @@ You can update the catalog with a script that re-scrapes OpenRouter's public pri
274
304
 
275
305
  ## CLI usage
276
306
 
277
- Basic commands:
307
+ ### `duelist init`
308
+
309
+ Scaffold a new `arena.config.ts` in the current directory.
278
310
 
279
311
  ```bash
280
- # Scaffold a new config
281
312
  npx duelist init
313
+ ```
282
314
 
315
+ ### `duelist run`
316
+
317
+ Run benchmarks defined in your arena config.
318
+
319
+ ```bash
283
320
  # Run with the default config (arena.config.ts)
284
321
  npx duelist run
285
322
 
@@ -288,12 +325,48 @@ npx duelist run --config path/to/arena.config.ts
288
325
 
289
326
  # Get JSON instead of a table
290
327
  npx duelist run --reporter json
328
+
329
+ # Suppress per-result progress lines
330
+ npx duelist run --quiet
331
+ ```
332
+
333
+ | Option | Description |
334
+ |--------|-------------|
335
+ | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
336
+ | `--reporter <type>` | Output format: `console` (default) or `json` |
337
+ | `-q, --quiet` | Suppress per-result progress |
338
+
339
+ ### `duelist ci`
340
+
341
+ Run benchmarks, compare against a baseline, and enforce quality gates. Exits non-zero if regressions are detected or cost exceeds the budget.
342
+
343
+ ```bash
344
+ # First run — establishes the baseline
345
+ npx duelist ci --update-baseline
346
+
347
+ # Subsequent runs — compare against baseline
348
+ npx duelist ci --threshold correctness=0.1 --budget 1.00
349
+
350
+ # Post comparison table as a PR comment (GitHub Actions)
351
+ npx duelist ci --threshold correctness=0.1 --comment
291
352
  ```
292
353
 
293
- Options (subject to change as the project evolves):
354
+ | Option | Description |
355
+ |--------|-------------|
356
+ | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
357
+ | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
358
+ | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
359
+ | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
360
+ | `--update-baseline` | Save results as new baseline after passing |
361
+ | `--comment` | Post markdown comparison table as a GitHub PR comment |
362
+ | `-q, --quiet` | Suppress per-result progress |
363
+
364
+ **How regression detection works:**
294
365
 
295
- - `--config` path to a config file (TypeScript).
296
- - `--reporter` `console` (default) or `json`.
366
+ - With `runs > 1`, the CI uses 95% confidence intervals (t-distribution) — a scorer only regresses if the confidence intervals don't overlap beyond the threshold. This avoids false positives from noisy LLM outputs.
367
+ - With `runs === 1`, it uses a simple delta comparison.
368
+ - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
369
+ - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
297
370
 
298
371
  The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
299
372
 
@@ -345,42 +418,7 @@ export default defineArena({
345
418
  npx duelist run
346
419
  ```
347
420
 
348
- Output:
349
-
350
- ```
351
- ⬡ Agent Duelist Results (3 runs each)
352
- ──────────────────────────────────────────────────────────────────────
353
-
354
- Task: extract-company
355
- Provider Latency Cost Tokens Match Schema Fuzzy
356
- ─────────────────────────────────────────────────────────────────────────────────────────────
357
- azure/gpt-5-mini 1905ms ~$0.189m 140 100% 100% 100%
358
- azure/gpt-5-nano 2079ms ~$0.081m 249 100% 100% 100%
359
- azure/gpt-5.2-chat 1493ms ~$0.0011 126 100% 100% 100%
360
-
361
- Task: summarize
362
- Provider Latency Cost Tokens Match Schema Fuzzy
363
- ─────────────────────────────────────────────────────────────────────────────────────────────
364
- azure/gpt-5-mini 1723ms ~$0.192m 127 0% — 36%
365
- azure/gpt-5-nano 2117ms ~$0.081m 234 0% — 43%
366
- azure/gpt-5.2-chat 1008ms ~$0.584m 72 0% — 43%
367
-
368
- Task: classify-sentiment
369
- Provider Latency Cost Tokens Match Schema Fuzzy
370
- ─────────────────────────────────────────────────────────────────────────────────────────────
371
- azure/gpt-5-mini 1012ms ~$0.077m 82 100% — 100%
372
- azure/gpt-5-nano 1075ms ~$0.024m 104 100% — 100%
373
- azure/gpt-5.2-chat 936ms ~$0.526m 81 100% — 100%
374
-
375
- ──────────────────────────────────────────────────────────────────────
376
- Summary
377
-
378
- ◆ Most correct: azure/gpt-5-mini (OpenAI via Azure) (avg 67%)
379
- ◆ Fastest: azure/gpt-5.2-chat (OpenAI via Azure) (avg 1146ms)
380
- ◆ Cheapest: azure/gpt-5-nano (OpenAI via Azure) (avg ~$0.062m)
381
-
382
- Costs estimated from OpenRouter pricing catalog.
383
- ```
421
+ Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
384
422
 
385
423
  ---
386
424
 
@@ -403,7 +441,7 @@ const weatherTool = {
403
441
  }
404
442
 
405
443
  export default defineArena({
406
- providers: [openai('gpt-4o')],
444
+ providers: [openai('gpt-5-mini')],
407
445
  tasks: [
408
446
  {
409
447
  name: 'weather-tool-call',
@@ -421,6 +459,57 @@ The model calls `getCurrentWeather`, the handler returns a stub result, and the
421
459
 
422
460
  ---
423
461
 
462
+ ## CI / GitHub Actions
463
+
464
+ `duelist ci` is designed to run as a quality gate in your CI pipeline. It compares benchmark results against a saved baseline and fails the build if quality regresses or costs exceed a budget.
465
+
466
+ ### GitHub Action
467
+
468
+ The easiest way to add eval quality gates to your PR workflow:
469
+
470
+ ```yaml
471
+ # .github/workflows/eval.yml
472
+ name: LLM Eval
473
+ on: [pull_request]
474
+
475
+ jobs:
476
+ eval:
477
+ runs-on: ubuntu-latest
478
+ steps:
479
+ - uses: actions/checkout@v4
480
+ - uses: DataGobes/agent-duelist/.github/actions/duelist-ci@main
481
+ with:
482
+ budget: '1.00'
483
+ thresholds: 'correctness=0.1'
484
+ env:
485
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
486
+ ```
487
+
488
+ The action handles Node.js setup, runs `duelist ci`, posts a comparison table as a PR comment, and optionally commits an updated baseline.
489
+
490
+ | Input | Default | Description |
491
+ |-------|---------|-------------|
492
+ | `config` | `arena.config.ts` | Path to arena config file |
493
+ | `baseline` | `.duelist/baseline.json` | Path to baseline JSON file |
494
+ | `budget` | — | Max total cost in USD |
495
+ | `thresholds` | — | Space-separated `scorer=delta` pairs |
496
+ | `update-baseline` | `false` | Save results as new baseline after passing |
497
+ | `comment` | `true` | Post results as PR comment |
498
+ | `node-version` | `20` | Node.js version to use |
499
+
500
+ ### PR comment output
501
+
502
+ When `--comment` is enabled, the CI posts (or updates) a markdown table on the PR:
503
+
504
+ | Provider | Task | Scorer | Baseline | Current | Delta | Status |
505
+ |----------|------|--------|----------|---------|-------|--------|
506
+ | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | ⚪ unchanged |
507
+ | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | ⚪ unchanged |
508
+
509
+ With cost summary, flakiness warnings, and pass/fail verdict.
510
+
511
+ ---
512
+
424
513
  ## Roadmap
425
514
 
426
515
  Shipped so far:
@@ -428,8 +517,12 @@ Shipped so far:
428
517
  - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
429
518
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
430
519
  - Tool-calling support with local handlers for agent task benchmarking
431
- - Colored console reporter with per-task tables and cross-provider summary
520
+ - Console reporter with box-drawing tables, medal rankings, sparkline bars (toggleable), and per-task winner rows
521
+ - Configurable per-request timeout to prevent hanging on unresponsive APIs
432
522
  - JSON reporter for CI/pipeline integration
523
+ - Markdown reporter for PR comments
524
+ - `duelist ci` command with regression detection, cost budgets, and flakiness warnings
525
+ - GitHub Action for CI/CD integration
433
526
  - Pricing catalog from OpenRouter with refresh script
434
527
 
435
528
  Planned directions (subject to community feedback):
@@ -437,8 +530,7 @@ Planned directions (subject to community feedback):
437
530
  - **More providers**
438
531
  - OpenRouter-native and more OpenAI-compatible gateways.
439
532
  - **Better reporting**
440
- - Markdown/HTML/CSV reports.
441
- - GitHub Actions summaries.
533
+ - HTML and CSV export options.
442
534
  - **Agent workflows**
443
535
  - Multi-step tool chains, multi-hop reasoning, and agent traces.
444
536
  - **Plugin system**