agent-duelist 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,32 +1,46 @@
1
1
  # agent-duelist
2
2
 
3
+ [![npm version](https://img.shields.io/npm/v/agent-duelist?color=f59e0b)](https://www.npmjs.com/package/agent-duelist)
3
4
  [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
5
  [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
6
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
5
7
 
6
- > Pit LLM providers against each other on agent tasks — Duel your models.
8
+ > Pit LLM providers against each other on agent tasks — **Duel your models.**
7
9
  >
8
10
  > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
9
11
 
10
- `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
12
+ `agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
13
+
14
+ ```bash
15
+ npx duelist init # scaffold a config
16
+ npx duelist run # see who wins
17
+ ```
11
18
 
12
19
  ## What you get
13
- > ![Agent Duelist console output](docs/assets/screenshot.png)
14
- >
15
- > ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
16
20
 
17
- - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
18
- - Define tasks once, run them against many providers.
19
- - Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
21
+ **Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
22
+
23
+ ![Agent Duelist console output](docs/assets/screenshot.png)
24
+
25
+ **HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
26
+
27
+ ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
20
28
 
21
29
  ---
22
30
 
23
31
  ## Why agent-duelist?
24
32
 
25
- - **Provider-agnostic**: One config, many providers. Swap models and gateways without rewriting your tasks.
26
- - **Agent-focused**: Designed for agent workflows and tool use, not just single-turn prompts.
27
- - **Realistic metrics**: Latency, token counts, and cost estimates based on a pricing catalog.
28
- - **TypeScript-native DX**: Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint.
29
- - **CLI-first**: `npx duelist init` `npx duelist run` gets you from zero to useful table in minutes.
33
+ | | |
34
+ |---|---|
35
+ | **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
36
+ | **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
37
+ | **Task packs** | Built-in benchmark suites run with `--pack structured-output`, zero config needed. |
38
+ | **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
39
+ | **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
40
+ | **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
41
+ | **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
42
+ | **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
43
+ | **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
30
44
 
31
45
  ---
32
46
 
@@ -40,7 +54,7 @@ pnpm add agent-duelist
40
54
  yarn add agent-duelist
41
55
  ```
42
56
 
43
- You'll also need API keys for the providers you want to benchmark, for example:
57
+ Set API keys for the providers you want to benchmark:
44
58
 
45
59
  ```bash
46
60
  export OPENAI_API_KEY=sk-...
@@ -51,7 +65,7 @@ export GOOGLE_API_KEY=...
51
65
 
52
66
  ---
53
67
 
54
- ## One-minute quickstart
68
+ ## Quickstart
55
69
 
56
70
  Initialize a config:
57
71
 
@@ -59,7 +73,7 @@ Initialize a config:
59
73
  npx duelist init
60
74
  ```
61
75
 
62
- This creates `arena.config.ts` in your project. A minimal example:
76
+ This creates `arena.config.ts` in your project:
63
77
 
64
78
  ```ts
65
79
  // arena.config.ts
@@ -99,39 +113,78 @@ Run the benchmark:
99
113
  npx duelist run
100
114
  ```
101
115
 
102
- You'll see a matrix like:
116
+ You'll see a results matrix:
103
117
 
104
- - Rows: tasks (`simple-qa`, `structured-extraction`)
105
- - Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
106
- - Cells: correctness score, latency, tokens, and estimated cost.
118
+ - **Rows**: tasks (`simple-qa`, `structured-extraction`)
119
+ - **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
120
+ - **Cells**: correctness score, latency, tokens, and estimated cost
107
121
 
108
- For CI or further processing:
122
+ Export the results in different formats:
109
123
 
110
124
  ```bash
125
+ # JSON for CI pipelines and dashboards
111
126
  npx duelist run --reporter json > results.json
127
+
128
+ # Self-contained HTML report you can share or host
129
+ npx duelist run --reporter html --output report.html
112
130
  ```
113
131
 
114
- Generate a shareable HTML report:
132
+ ---
133
+
134
+ ## Task packs
135
+
136
+ Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
115
137
 
116
138
  ```bash
117
- npx duelist run --reporter html --output report.html
139
+ # List available packs
140
+ npx duelist run --pack list
141
+
142
+ # Run a pack with your providers
143
+ npx duelist run --pack structured-output --config arena.config.ts
118
144
  ```
119
145
 
120
- ---
146
+ Your config only needs to define providers — the pack supplies tasks and scorers:
121
147
 
122
- ## Core concepts
148
+ ```ts
149
+ // arena.config.ts
150
+ import { defineArena, openai, anthropic } from 'agent-duelist'
123
151
 
124
- ### Providers
152
+ export default defineArena({
153
+ providers: [
154
+ openai('gpt-5-mini'),
155
+ anthropic('claude-sonnet-4.6'),
156
+ ],
157
+ tasks: [], // ignored when --pack is used
158
+ scorers: [], // pack supplies its own scorers
159
+ })
160
+ ```
125
161
 
126
- Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface.
162
+ ### Available packs
127
163
 
128
- This lets you:
164
+ | Pack | Tasks | Description |
165
+ |------|-------|-------------|
166
+ | `structured-output` | 6 | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
129
167
 
130
- - Swap providers without changing tasks.
131
- - Wrap or extend providers in your own code.
132
- - Mock providers in tests.
168
+ Packs work with both `run` and `ci` commands:
133
169
 
134
- Examples:
170
+ ```bash
171
+ # CI with a task pack
172
+ npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
173
+ ```
174
+
175
+ You can also combine multiple packs:
176
+
177
+ ```bash
178
+ npx duelist run --pack structured-output,another-pack
179
+ ```
180
+
181
+ ---
182
+
183
+ ## Core concepts
184
+
185
+ ### Providers
186
+
187
+ Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
135
188
 
136
189
  ```ts
137
190
  import {
@@ -140,25 +193,40 @@ import {
140
193
  anthropic,
141
194
  gemini,
142
195
  openaiCompatible,
143
- type ArenaProvider,
144
196
  } from 'agent-duelist'
145
197
 
198
+ // OpenAI
146
199
  const oai = openai('gpt-5-mini')
147
200
 
201
+ // Azure OpenAI
148
202
  const azure = azureOpenai('gpt-5-mini', {
149
203
  deployment: 'my-deployment',
150
204
  })
151
205
 
206
+ // Anthropic
152
207
  const claude = anthropic('claude-sonnet-4.6')
153
208
 
154
- const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
209
+ // Google Gemini
210
+ const gem = gemini('gemini-3-flash-preview')
155
211
 
156
- const local: ArenaProvider = openaiCompatible({
212
+ // Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
213
+ const local = openaiCompatible({
157
214
  id: 'local/llama',
158
- name: 'Local Gateway',
215
+ name: 'Local Ollama',
159
216
  baseURL: 'http://localhost:11434/v1',
160
217
  model: 'llama3.3',
161
218
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
219
+ free: true, // registers zero-cost pricing
220
+ })
221
+
222
+ // Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
223
+ const deepseek = openaiCompatible({
224
+ id: 'deepseek/r1',
225
+ name: 'DeepSeek R1',
226
+ baseURL: 'https://api.deepseek.com/v1',
227
+ model: 'deepseek-reasoner',
228
+ apiKeyEnv: 'DEEPSEEK_API_KEY',
229
+ stripThinking: true, // strips <think>...</think> from output
162
230
  })
163
231
  ```
164
232
 
@@ -185,6 +253,7 @@ interface ArenaTask {
185
253
  prompt: string
186
254
  expected?: unknown // used by correctness scorers
187
255
  schema?: ZodSchema<any> // used by schema-based scorers
256
+ tools?: ToolDefinition[] // used by tool-calling scorers
188
257
  }
189
258
  ```
190
259
 
@@ -213,7 +282,7 @@ const tasks: ArenaTask[] = [
213
282
 
214
283
  ### Scorers
215
284
 
216
- Scorers take raw model outputs and turn them into **numeric scores** (0–1) with optional details. Built-in scorers:
285
+ Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
217
286
 
218
287
  | Scorer | What it measures |
219
288
  |--------|-----------------|
@@ -222,7 +291,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
222
291
  | `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
223
292
  | `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
224
293
  | `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
225
- | `llm-judge-correctness` | Async LLM-as-judge calls a judge model to score correctness 0–1 |
294
+ | `tool-usage` | Whether the model invoked the expected tool(s) during a tool-calling task |
295
+ | `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
226
296
 
227
297
  Configure them in your arena:
228
298
 
@@ -242,8 +312,6 @@ defineArena({
242
312
 
243
313
  The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
244
314
 
245
- You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
246
-
247
315
  ---
248
316
 
249
317
  ### Arena options
@@ -252,7 +320,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
252
320
 
253
321
  | Option | Type | Default | Description |
254
322
  |--------|------|---------|-------------|
255
- | `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
323
+ | `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
256
324
  | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
257
325
  | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
258
326
  | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
@@ -273,43 +341,62 @@ export default defineArena({
273
341
 
274
342
  ---
275
343
 
344
+ ## Reporters
345
+
346
+ agent-duelist includes four output formats, each suited to a different workflow:
347
+
348
+ | Reporter | Flag | Use case |
349
+ |----------|------|----------|
350
+ | **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
351
+ | **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
352
+ | **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
353
+ | **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
354
+
355
+ Generate an HTML report:
356
+
357
+ ```bash
358
+ npx duelist run --reporter html --output report.html
359
+ ```
360
+
361
+ The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
362
+
363
+ ---
364
+
276
365
  ## Cost & pricing
277
366
 
278
367
  Cost estimation is intentionally transparent and conservative:
279
368
 
280
- 1. **Token counts**
281
- Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
282
-
283
- 2. **Pricing catalog**
284
- `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
369
+ 1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
285
370
 
286
- - The catalog maps `(provider, model)` `{ inputPerM, outputPerM }` in USD per 1M tokens.
287
- - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
371
+ 2. **Pricing catalog** `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
372
+ - The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
373
+ - Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
374
+ - Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
288
375
 
289
- 3. **Estimated USD**
290
- The `cost` scorer computes:
376
+ 3. **Estimated USD** — The `cost` scorer computes:
291
377
 
292
378
  ```text
293
379
  estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
294
380
  ```
295
381
 
296
- In the console reporter, you'll see:
382
+ 4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
297
383
 
298
- - token counts: `prompt: X, completion: Y`
299
- - cost: `~$0.0XXm` (millicents; fractions of a cent)
300
- - a short disclaimer that this is an **estimate** based on a pricing snapshot.
384
+ 5. **Custom pricing** Register pricing for models not in the catalog:
301
385
 
302
- 4. **Unknown models**
303
- If a model is not in the catalog:
386
+ ```ts
387
+ import { registerPricing } from 'agent-duelist'
304
388
 
305
- - Tokens are still reported.
306
- - Cost is marked as unknown (no fake numbers).
389
+ registerPricing('custom/my-model', {
390
+ inputPerToken: 0.000003,
391
+ outputPerToken: 0.000015,
392
+ })
393
+ ```
307
394
 
308
- You can update the catalog with a script that re-scrapes OpenRouter's public pricing page when prices change.
395
+ You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
309
396
 
310
397
  ---
311
398
 
312
- ## CLI usage
399
+ ## CLI reference
313
400
 
314
401
  ### `duelist init`
315
402
 
@@ -317,30 +404,46 @@ Scaffold a new `arena.config.ts` in the current directory.
317
404
 
318
405
  ```bash
319
406
  npx duelist init
407
+ npx duelist init --force # overwrite an existing config
320
408
  ```
321
409
 
410
+ | Option | Description |
411
+ |--------|-------------|
412
+ | `--force` | Overwrite existing config file |
413
+
322
414
  ### `duelist run`
323
415
 
324
416
  Run benchmarks defined in your arena config.
325
417
 
326
418
  ```bash
327
- # Run with the default config (arena.config.ts)
419
+ # Default config, console output
328
420
  npx duelist run
329
421
 
330
- # Use a custom config
422
+ # Custom config
331
423
  npx duelist run --config path/to/arena.config.ts
332
424
 
333
- # Get JSON instead of a table
334
- npx duelist run --reporter json
425
+ # Run a built-in task pack
426
+ npx duelist run --pack structured-output
427
+
428
+ # List available packs
429
+ npx duelist run --pack list
430
+
431
+ # JSON for piping
432
+ npx duelist run --reporter json > results.json
335
433
 
336
- # Suppress per-result progress lines
434
+ # HTML report
435
+ npx duelist run --reporter html --output report.html
436
+
437
+ # Quiet mode
337
438
  npx duelist run --quiet
338
439
  ```
339
440
 
340
441
  | Option | Description |
341
442
  |--------|-------------|
342
443
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
343
- | `--reporter <type>` | Output format: `console` (default) or `json` |
444
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
445
+ | `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
446
+ | `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
344
447
  | `-q, --quiet` | Suppress per-result progress |
345
448
 
346
449
  ### `duelist ci`
@@ -354,6 +457,9 @@ npx duelist ci --update-baseline
354
457
  # Subsequent runs — compare against baseline
355
458
  npx duelist ci --threshold correctness=0.1 --budget 1.00
356
459
 
460
+ # Run CI with a task pack
461
+ npx duelist ci --pack structured-output --threshold correctness=0.1
462
+
357
463
  # Post comparison table as a PR comment (GitHub Actions)
358
464
  npx duelist ci --threshold correctness=0.1 --comment
359
465
  ```
@@ -361,6 +467,7 @@ npx duelist ci --threshold correctness=0.1 --comment
361
467
  | Option | Description |
362
468
  |--------|-------------|
363
469
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
470
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks |
364
471
  | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
365
472
  | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
366
473
  | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
@@ -375,24 +482,61 @@ npx duelist ci --threshold correctness=0.1 --comment
375
482
  - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
376
483
  - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
377
484
 
378
- The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
485
+ The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
486
+
487
+ ---
488
+
489
+ ## Tool-calling agent example
490
+
491
+ agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
492
+
493
+ ```ts
494
+ import { defineArena, openai } from 'agent-duelist'
495
+ import { z } from 'zod'
496
+
497
+ const weatherTool = {
498
+ name: 'getCurrentWeather',
499
+ description: 'Get the current weather in a given city',
500
+ parameters: z.object({ city: z.string() }),
501
+ handler: async ({ city }: { city: string }) => ({
502
+ city,
503
+ tempC: 20,
504
+ }),
505
+ }
506
+
507
+ export default defineArena({
508
+ providers: [openai('gpt-5-mini')],
509
+ tasks: [
510
+ {
511
+ name: 'weather-tool-call',
512
+ prompt: 'What is the current temperature in Amsterdam? Use the tool.',
513
+ expected: { city: 'Amsterdam' },
514
+ tools: [weatherTool],
515
+ },
516
+ ],
517
+ scorers: ['latency', 'cost', 'tool-usage'],
518
+ runs: 1,
519
+ })
520
+ ```
521
+
522
+ The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
379
523
 
380
524
  ---
381
525
 
382
526
  ## Example: multi-provider benchmark
383
527
 
384
- Here's a richer example comparing multiple providers across tasks:
528
+ A richer example comparing multiple providers across tasks:
385
529
 
386
530
  ```ts
387
531
  // arena.config.ts
388
- import { defineArena, azureOpenai } from 'agent-duelist'
532
+ import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
389
533
  import { z } from 'zod'
390
534
 
391
535
  export default defineArena({
392
536
  providers: [
393
- azureOpenai('gpt-5-mini'),
537
+ openai('gpt-5-mini'),
394
538
  azureOpenai('gpt-5-nano'),
395
- azureOpenai('gpt-5.2-chat'),
539
+ gemini('gemini-3-flash-preview'),
396
540
  ],
397
541
  tasks: [
398
542
  {
@@ -425,50 +569,13 @@ export default defineArena({
425
569
  npx duelist run
426
570
  ```
427
571
 
428
- Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
429
-
430
572
  **How scoring works:**
431
573
 
432
574
  - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
433
575
  - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
434
- - The overall winner is determined by category wins across correctness, latency, and cost.
435
-
436
- ---
437
-
438
- ## Tool-calling agent example
439
-
440
- agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
441
-
442
- ```ts
443
- import { defineArena, openai } from 'agent-duelist'
444
- import { z } from 'zod'
445
-
446
- const weatherTool = {
447
- name: 'getCurrentWeather',
448
- description: 'Get the current weather in a given city',
449
- parameters: z.object({ city: z.string() }),
450
- handler: async ({ city }: { city: string }) => ({
451
- city,
452
- tempC: 20,
453
- }),
454
- }
455
-
456
- export default defineArena({
457
- providers: [openai('gpt-5-mini')],
458
- tasks: [
459
- {
460
- name: 'weather-tool-call',
461
- prompt: 'What is the current temperature in Amsterdam? Use the tool.',
462
- expected: { city: 'Amsterdam' },
463
- tools: [weatherTool],
464
- },
465
- ],
466
- scorers: ['latency', 'cost', 'tool-usage'],
467
- runs: 1,
468
- })
469
- ```
470
-
471
- The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
576
+ - **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
577
+ - **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
578
+ - The overall winner is the provider with the highest average correctness score.
472
579
 
473
580
  ---
474
581
 
@@ -516,8 +623,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
516
623
 
517
624
  | Provider | Task | Scorer | Baseline | Current | Delta | Status |
518
625
  |----------|------|--------|----------|---------|-------|--------|
519
- | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
520
- | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
626
+ | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
627
+ | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
521
628
 
522
629
  With cost summary, flakiness warnings, and pass/fail verdict.
523
630
 
@@ -525,34 +632,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
525
632
 
526
633
  ## Roadmap
527
634
 
528
- Shipped so far:
635
+ **Shipped:**
529
636
 
530
- - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
637
+ - 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
531
638
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
532
639
  - Tool-calling support with local handlers for agent task benchmarking
533
- - Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
534
- - Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
535
- - Configurable per-request timeout to prevent hanging on unresponsive APIs
536
- - JSON reporter for CI/pipeline integration
537
- - Markdown reporter for PR comments
538
- - `duelist ci` command with regression detection, cost budgets, and flakiness warnings
640
+ - **Task packs**: built-in benchmark suites (`structured-output`) run with `--pack`, no config writing needed
641
+ - Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
642
+ - Fair head-to-head benchmarking with parallel provider execution
643
+ - 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
644
+ - `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
539
645
  - GitHub Action for CI/CD integration
540
- - Pricing catalog from OpenRouter with refresh script
646
+ - Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
647
+ - `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
648
+ - Configurable per-request timeout
541
649
 
542
- Planned directions (subject to community feedback):
650
+ **Planned** (subject to community feedback):
543
651
 
544
- - **More providers**
545
- - OpenRouter-native and more OpenAI-compatible gateways.
546
- - **Better reporting**
547
- - HTML and CSV export options.
548
- - **Agent workflows**
549
- - Multi-step tool chains, multi-hop reasoning, and agent traces.
550
- - **Plugin system**
551
- - First-class support for user-defined providers and scorers.
552
- - **Embedding-based scoring**
553
- - Semantic similarity via embedding distance.
652
+ - **More task packs** — reasoning, summarization, tool-calling, and multi-turn conversation packs
653
+ - **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
654
+ - **More export formats** — CSV
655
+ - **Plugin system** first-class support for user-defined providers and scorers
656
+ - **Embedding-based scoring** — semantic similarity via embedding distance
657
+ - **More providers** OpenRouter-native and additional OpenAI-compatible gateways
554
658
 
555
- If you have a specific use case (framework comparisons, multi-agent competitions, tool-calling benchmarks), please open an issue those will shape what gets built first.
659
+ Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
556
660
 
557
661
  ---
558
662
 
@@ -560,18 +664,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
560
664
 
561
665
  Contributions, issues, and feature requests are welcome.
562
666
 
563
- - **Bug reports / ideas**: open a GitHub issue.
667
+ - **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
564
668
  - **Code changes**:
565
- - Fork the repo.
566
- - Create a branch.
567
- - Run tests: `npm test`.
568
- - Run build: `npm run build`.
569
- - Open a PR with a clear description and, if possible, a small repro.
669
+ 1. Fork the repo.
670
+ 2. Create a branch.
671
+ 3. Run tests: `npm test`.
672
+ 4. Run build: `npm run build`.
673
+ 5. Open a PR with a clear description.
570
674
 
571
- Please try to keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
675
+ Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
572
676
 
573
677
  ---
574
678
 
575
679
  ## License
576
680
 
577
- MIT.
681
+ MIT