agent-duelist 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,31 +1,46 @@
1
1
  # agent-duelist
2
2
 
3
+ [![npm version](https://img.shields.io/npm/v/agent-duelist?color=f59e0b)](https://www.npmjs.com/package/agent-duelist)
3
4
  [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
5
  [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
6
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
5
7
 
6
- > Pit LLM providers against each other on agent tasks — Duel your models.
8
+ > Pit LLM providers against each other on agent tasks — **Duel your models.**
7
9
  >
8
10
  > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
9
11
 
10
- `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
12
+ `agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
13
+
14
+ ```bash
15
+ npx duelist init # scaffold a config
16
+ npx duelist run # see who wins
17
+ ```
11
18
 
12
19
  ## What you get
13
- > ![Agent Duelist console output](docs/assets/screenshot.png)
14
20
 
21
+ **Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
22
+
23
+ ![Agent Duelist console output](docs/assets/screenshot.png)
24
+
25
+ **HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
15
26
 
16
- - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
17
- - Define tasks once, run them against many providers.
18
- - Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
27
+ ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
19
28
 
20
29
  ---
21
30
 
22
31
  ## Why agent-duelist?
23
32
 
24
- - **Provider-agnostic**: One config, many providers. Swap models and gateways without rewriting your tasks.
25
- - **Agent-focused**: Designed for agent workflows and tool use, not just single-turn prompts.
26
- - **Realistic metrics**: Latency, token counts, and cost estimates based on a pricing catalog.
27
- - **TypeScript-native DX**: Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint.
28
- - **CLI-first**: `npx duelist init` `npx duelist run` gets you from zero to useful table in minutes.
33
+ | | |
34
+ |---|---|
35
+ | **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
36
+ | **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
37
+ | **Task packs** | Built-in benchmark suites run with `--pack structured-output`, zero config needed. |
38
+ | **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
39
+ | **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
40
+ | **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
41
+ | **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
42
+ | **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
43
+ | **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
29
44
 
30
45
  ---
31
46
 
@@ -39,7 +54,7 @@ pnpm add agent-duelist
39
54
  yarn add agent-duelist
40
55
  ```
41
56
 
42
- You'll also need API keys for the providers you want to benchmark, for example:
57
+ Set API keys for the providers you want to benchmark:
43
58
 
44
59
  ```bash
45
60
  export OPENAI_API_KEY=sk-...
@@ -50,7 +65,7 @@ export GOOGLE_API_KEY=...
50
65
 
51
66
  ---
52
67
 
53
- ## One-minute quickstart
68
+ ## Quickstart
54
69
 
55
70
  Initialize a config:
56
71
 
@@ -58,7 +73,7 @@ Initialize a config:
58
73
  npx duelist init
59
74
  ```
60
75
 
61
- This creates `arena.config.ts` in your project. A minimal example:
76
+ This creates `arena.config.ts` in your project:
62
77
 
63
78
  ```ts
64
79
  // arena.config.ts
@@ -98,33 +113,78 @@ Run the benchmark:
98
113
  npx duelist run
99
114
  ```
100
115
 
101
- You'll see a matrix like:
116
+ You'll see a results matrix:
102
117
 
103
- - Rows: tasks (`simple-qa`, `structured-extraction`)
104
- - Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
105
- - Cells: correctness score, latency, tokens, and estimated cost.
118
+ - **Rows**: tasks (`simple-qa`, `structured-extraction`)
119
+ - **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
120
+ - **Cells**: correctness score, latency, tokens, and estimated cost
106
121
 
107
- For CI or further processing:
122
+ Export the results in different formats:
108
123
 
109
124
  ```bash
125
+ # JSON for CI pipelines and dashboards
110
126
  npx duelist run --reporter json > results.json
127
+
128
+ # Self-contained HTML report you can share or host
129
+ npx duelist run --reporter html --output report.html
111
130
  ```
112
131
 
113
132
  ---
114
133
 
115
- ## Core concepts
134
+ ## Task packs
116
135
 
117
- ### Providers
136
+ Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
118
137
 
119
- Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface.
138
+ ```bash
139
+ # List available packs
140
+ npx duelist run --pack list
120
141
 
121
- This lets you:
142
+ # Run a pack with your providers
143
+ npx duelist run --pack structured-output --config arena.config.ts
144
+ ```
122
145
 
123
- - Swap providers without changing tasks.
124
- - Wrap or extend providers in your own code.
125
- - Mock providers in tests.
146
+ Your config only needs to define providers the pack supplies tasks and scorers:
126
147
 
127
- Examples:
148
+ ```ts
149
+ // arena.config.ts
150
+ import { defineArena, openai, anthropic } from 'agent-duelist'
151
+
152
+ export default defineArena({
153
+ providers: [
154
+ openai('gpt-5-mini'),
155
+ anthropic('claude-sonnet-4.6'),
156
+ ],
157
+ tasks: [], // ignored when --pack is used
158
+ scorers: [], // pack supplies its own scorers
159
+ })
160
+ ```
161
+
162
+ ### Available packs
163
+
164
+ | Pack | Tasks | Description |
165
+ |------|-------|-------------|
166
+ | `structured-output` | 6 | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
167
+
168
+ Packs work with both `run` and `ci` commands:
169
+
170
+ ```bash
171
+ # CI with a task pack
172
+ npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
173
+ ```
174
+
175
+ You can also combine multiple packs:
176
+
177
+ ```bash
178
+ npx duelist run --pack structured-output,another-pack
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Core concepts
184
+
185
+ ### Providers
186
+
187
+ Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
128
188
 
129
189
  ```ts
130
190
  import {
@@ -133,25 +193,40 @@ import {
133
193
  anthropic,
134
194
  gemini,
135
195
  openaiCompatible,
136
- type ArenaProvider,
137
196
  } from 'agent-duelist'
138
197
 
198
+ // OpenAI
139
199
  const oai = openai('gpt-5-mini')
140
200
 
201
+ // Azure OpenAI
141
202
  const azure = azureOpenai('gpt-5-mini', {
142
203
  deployment: 'my-deployment',
143
204
  })
144
205
 
206
+ // Anthropic
145
207
  const claude = anthropic('claude-sonnet-4.6')
146
208
 
147
- const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
209
+ // Google Gemini
210
+ const gem = gemini('gemini-3-flash-preview')
148
211
 
149
- const local: ArenaProvider = openaiCompatible({
212
+ // Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
213
+ const local = openaiCompatible({
150
214
  id: 'local/llama',
151
- name: 'Local Gateway',
215
+ name: 'Local Ollama',
152
216
  baseURL: 'http://localhost:11434/v1',
153
217
  model: 'llama3.3',
154
218
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
219
+ free: true, // registers zero-cost pricing
220
+ })
221
+
222
+ // Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
223
+ const deepseek = openaiCompatible({
224
+ id: 'deepseek/r1',
225
+ name: 'DeepSeek R1',
226
+ baseURL: 'https://api.deepseek.com/v1',
227
+ model: 'deepseek-reasoner',
228
+ apiKeyEnv: 'DEEPSEEK_API_KEY',
229
+ stripThinking: true, // strips <think>...</think> from output
155
230
  })
156
231
  ```
157
232
 
@@ -178,6 +253,7 @@ interface ArenaTask {
178
253
  prompt: string
179
254
  expected?: unknown // used by correctness scorers
180
255
  schema?: ZodSchema<any> // used by schema-based scorers
256
+ tools?: ToolDefinition[] // used by tool-calling scorers
181
257
  }
182
258
  ```
183
259
 
@@ -206,7 +282,7 @@ const tasks: ArenaTask[] = [
206
282
 
207
283
  ### Scorers
208
284
 
209
- Scorers take raw model outputs and turn them into **numeric scores** (0–1) with optional details. Built-in scorers:
285
+ Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
210
286
 
211
287
  | Scorer | What it measures |
212
288
  |--------|-----------------|
@@ -215,7 +291,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
215
291
  | `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
216
292
  | `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
217
293
  | `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
218
- | `llm-judge-correctness` | Async LLM-as-judge calls a judge model to score correctness 0–1 |
294
+ | `tool-usage` | Whether the model invoked the expected tool(s) during a tool-calling task |
295
+ | `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
219
296
 
220
297
  Configure them in your arena:
221
298
 
@@ -235,8 +312,6 @@ defineArena({
235
312
 
236
313
  The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
237
314
 
238
- You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
239
-
240
315
  ---
241
316
 
242
317
  ### Arena options
@@ -245,7 +320,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
245
320
 
246
321
  | Option | Type | Default | Description |
247
322
  |--------|------|---------|-------------|
248
- | `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
323
+ | `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
249
324
  | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
250
325
  | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
251
326
  | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
@@ -266,43 +341,62 @@ export default defineArena({
266
341
 
267
342
  ---
268
343
 
344
+ ## Reporters
345
+
346
+ agent-duelist includes four output formats, each suited to a different workflow:
347
+
348
+ | Reporter | Flag | Use case |
349
+ |----------|------|----------|
350
+ | **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
351
+ | **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
352
+ | **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
353
+ | **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
354
+
355
+ Generate an HTML report:
356
+
357
+ ```bash
358
+ npx duelist run --reporter html --output report.html
359
+ ```
360
+
361
+ The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
362
+
363
+ ---
364
+
269
365
  ## Cost & pricing
270
366
 
271
367
  Cost estimation is intentionally transparent and conservative:
272
368
 
273
- 1. **Token counts**
274
- Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
369
+ 1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
275
370
 
276
- 2. **Pricing catalog**
277
- `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
371
+ 2. **Pricing catalog** — `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
372
+ - The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
373
+ - Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
374
+ - Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
278
375
 
279
- - The catalog maps `(provider, model)``{ inputPerM, outputPerM }` in USD per 1M tokens.
280
- - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
281
-
282
- 3. **Estimated USD**
283
- The `cost` scorer computes:
376
+ 3. **Estimated USD** The `cost` scorer computes:
284
377
 
285
378
  ```text
286
379
  estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
287
380
  ```
288
381
 
289
- In the console reporter, you'll see:
382
+ 4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
290
383
 
291
- - token counts: `prompt: X, completion: Y`
292
- - cost: `~$0.0XXm` (millicents; fractions of a cent)
293
- - a short disclaimer that this is an **estimate** based on a pricing snapshot.
384
+ 5. **Custom pricing** Register pricing for models not in the catalog:
294
385
 
295
- 4. **Unknown models**
296
- If a model is not in the catalog:
386
+ ```ts
387
+ import { registerPricing } from 'agent-duelist'
297
388
 
298
- - Tokens are still reported.
299
- - Cost is marked as unknown (no fake numbers).
389
+ registerPricing('custom/my-model', {
390
+ inputPerToken: 0.000003,
391
+ outputPerToken: 0.000015,
392
+ })
393
+ ```
300
394
 
301
- You can update the catalog with a script that re-scrapes OpenRouter's public pricing page when prices change.
395
+ You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
302
396
 
303
397
  ---
304
398
 
305
- ## CLI usage
399
+ ## CLI reference
306
400
 
307
401
  ### `duelist init`
308
402
 
@@ -310,30 +404,46 @@ Scaffold a new `arena.config.ts` in the current directory.
310
404
 
311
405
  ```bash
312
406
  npx duelist init
407
+ npx duelist init --force # overwrite an existing config
313
408
  ```
314
409
 
410
+ | Option | Description |
411
+ |--------|-------------|
412
+ | `--force` | Overwrite existing config file |
413
+
315
414
  ### `duelist run`
316
415
 
317
416
  Run benchmarks defined in your arena config.
318
417
 
319
418
  ```bash
320
- # Run with the default config (arena.config.ts)
419
+ # Default config, console output
321
420
  npx duelist run
322
421
 
323
- # Use a custom config
422
+ # Custom config
324
423
  npx duelist run --config path/to/arena.config.ts
325
424
 
326
- # Get JSON instead of a table
327
- npx duelist run --reporter json
425
+ # Run a built-in task pack
426
+ npx duelist run --pack structured-output
427
+
428
+ # List available packs
429
+ npx duelist run --pack list
430
+
431
+ # JSON for piping
432
+ npx duelist run --reporter json > results.json
433
+
434
+ # HTML report
435
+ npx duelist run --reporter html --output report.html
328
436
 
329
- # Suppress per-result progress lines
437
+ # Quiet mode
330
438
  npx duelist run --quiet
331
439
  ```
332
440
 
333
441
  | Option | Description |
334
442
  |--------|-------------|
335
443
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
336
- | `--reporter <type>` | Output format: `console` (default) or `json` |
444
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
445
+ | `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
446
+ | `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
337
447
  | `-q, --quiet` | Suppress per-result progress |
338
448
 
339
449
  ### `duelist ci`
@@ -347,6 +457,9 @@ npx duelist ci --update-baseline
347
457
  # Subsequent runs — compare against baseline
348
458
  npx duelist ci --threshold correctness=0.1 --budget 1.00
349
459
 
460
+ # Run CI with a task pack
461
+ npx duelist ci --pack structured-output --threshold correctness=0.1
462
+
350
463
  # Post comparison table as a PR comment (GitHub Actions)
351
464
  npx duelist ci --threshold correctness=0.1 --comment
352
465
  ```
@@ -354,6 +467,7 @@ npx duelist ci --threshold correctness=0.1 --comment
354
467
  | Option | Description |
355
468
  |--------|-------------|
356
469
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
470
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks |
357
471
  | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
358
472
  | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
359
473
  | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
@@ -368,24 +482,61 @@ npx duelist ci --threshold correctness=0.1 --comment
368
482
  - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
369
483
  - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
370
484
 
371
- The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
485
+ The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
486
+
487
+ ---
488
+
489
+ ## Tool-calling agent example
490
+
491
+ agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
492
+
493
+ ```ts
494
+ import { defineArena, openai } from 'agent-duelist'
495
+ import { z } from 'zod'
496
+
497
+ const weatherTool = {
498
+ name: 'getCurrentWeather',
499
+ description: 'Get the current weather in a given city',
500
+ parameters: z.object({ city: z.string() }),
501
+ handler: async ({ city }: { city: string }) => ({
502
+ city,
503
+ tempC: 20,
504
+ }),
505
+ }
506
+
507
+ export default defineArena({
508
+ providers: [openai('gpt-5-mini')],
509
+ tasks: [
510
+ {
511
+ name: 'weather-tool-call',
512
+ prompt: 'What is the current temperature in Amsterdam? Use the tool.',
513
+ expected: { city: 'Amsterdam' },
514
+ tools: [weatherTool],
515
+ },
516
+ ],
517
+ scorers: ['latency', 'cost', 'tool-usage'],
518
+ runs: 1,
519
+ })
520
+ ```
521
+
522
+ The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
372
523
 
373
524
  ---
374
525
 
375
526
  ## Example: multi-provider benchmark
376
527
 
377
- Here's a richer example comparing multiple providers across tasks:
528
+ A richer example comparing multiple providers across tasks:
378
529
 
379
530
  ```ts
380
531
  // arena.config.ts
381
- import { defineArena, azureOpenai } from 'agent-duelist'
532
+ import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
382
533
  import { z } from 'zod'
383
534
 
384
535
  export default defineArena({
385
536
  providers: [
386
- azureOpenai('gpt-5-mini'),
537
+ openai('gpt-5-mini'),
387
538
  azureOpenai('gpt-5-nano'),
388
- azureOpenai('gpt-5.2-chat'),
539
+ gemini('gemini-3-flash-preview'),
389
540
  ],
390
541
  tasks: [
391
542
  {
@@ -418,44 +569,13 @@ export default defineArena({
418
569
  npx duelist run
419
570
  ```
420
571
 
421
- Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
422
-
423
- ---
424
-
425
- ## Tool-calling agent example
426
-
427
- agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
428
-
429
- ```ts
430
- import { defineArena, openai } from 'agent-duelist'
431
- import { z } from 'zod'
432
-
433
- const weatherTool = {
434
- name: 'getCurrentWeather',
435
- description: 'Get the current weather in a given city',
436
- parameters: z.object({ city: z.string() }),
437
- handler: async ({ city }: { city: string }) => ({
438
- city,
439
- tempC: 20,
440
- }),
441
- }
572
+ **How scoring works:**
442
573
 
443
- export default defineArena({
444
- providers: [openai('gpt-5-mini')],
445
- tasks: [
446
- {
447
- name: 'weather-tool-call',
448
- prompt: 'What is the current temperature in Amsterdam? Use the tool.',
449
- expected: { city: 'Amsterdam' },
450
- tools: [weatherTool],
451
- },
452
- ],
453
- scorers: ['latency', 'cost', 'tool-usage'],
454
- runs: 1,
455
- })
456
- ```
457
-
458
- The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
574
+ - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
575
+ - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
576
+ - **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
577
+ - **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
578
+ - The overall winner is the provider with the highest average correctness score.
459
579
 
460
580
  ---
461
581
 
@@ -503,8 +623,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
503
623
 
504
624
  | Provider | Task | Scorer | Baseline | Current | Delta | Status |
505
625
  |----------|------|--------|----------|---------|-------|--------|
506
- | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
507
- | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
626
+ | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
627
+ | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
508
628
 
509
629
  With cost summary, flakiness warnings, and pass/fail verdict.
510
630
 
@@ -512,33 +632,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
512
632
 
513
633
  ## Roadmap
514
634
 
515
- Shipped so far:
635
+ **Shipped:**
516
636
 
517
- - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
637
+ - 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
518
638
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
519
639
  - Tool-calling support with local handlers for agent task benchmarking
520
- - Console reporter with box-drawing tables, medal rankings, sparkline bars (toggleable), and per-task winner rows
521
- - Configurable per-request timeout to prevent hanging on unresponsive APIs
522
- - JSON reporter for CI/pipeline integration
523
- - Markdown reporter for PR comments
524
- - `duelist ci` command with regression detection, cost budgets, and flakiness warnings
640
+ - **Task packs**: built-in benchmark suites (`structured-output`) run with `--pack`, no config writing needed
641
+ - Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
642
+ - Fair head-to-head benchmarking with parallel provider execution
643
+ - 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
644
+ - `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
525
645
  - GitHub Action for CI/CD integration
526
- - Pricing catalog from OpenRouter with refresh script
646
+ - Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
647
+ - `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
648
+ - Configurable per-request timeout
527
649
 
528
- Planned directions (subject to community feedback):
650
+ **Planned** (subject to community feedback):
529
651
 
530
- - **More providers**
531
- - OpenRouter-native and more OpenAI-compatible gateways.
532
- - **Better reporting**
533
- - HTML and CSV export options.
534
- - **Agent workflows**
535
- - Multi-step tool chains, multi-hop reasoning, and agent traces.
536
- - **Plugin system**
537
- - First-class support for user-defined providers and scorers.
538
- - **Embedding-based scoring**
539
- - Semantic similarity via embedding distance.
652
+ - **More task packs** — reasoning, summarization, tool-calling, and multi-turn conversation packs
653
+ - **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
654
+ - **More export formats** — CSV
655
+ - **Plugin system** first-class support for user-defined providers and scorers
656
+ - **Embedding-based scoring** — semantic similarity via embedding distance
657
+ - **More providers** OpenRouter-native and additional OpenAI-compatible gateways
540
658
 
541
- If you have a specific use case (framework comparisons, multi-agent competitions, tool-calling benchmarks), please open an issue those will shape what gets built first.
659
+ Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
542
660
 
543
661
  ---
544
662
 
@@ -546,18 +664,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
546
664
 
547
665
  Contributions, issues, and feature requests are welcome.
548
666
 
549
- - **Bug reports / ideas**: open a GitHub issue.
667
+ - **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
550
668
  - **Code changes**:
551
- - Fork the repo.
552
- - Create a branch.
553
- - Run tests: `npm test`.
554
- - Run build: `npm run build`.
555
- - Open a PR with a clear description and, if possible, a small repro.
669
+ 1. Fork the repo.
670
+ 2. Create a branch.
671
+ 3. Run tests: `npm test`.
672
+ 4. Run build: `npm run build`.
673
+ 5. Open a PR with a clear description.
556
674
 
557
- Please try to keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
675
+ Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
558
676
 
559
677
  ---
560
678
 
561
679
  ## License
562
680
 
563
- MIT.
681
+ MIT