agent-duelist 0.2.1 → 0.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,32 +1,46 @@
1
1
  # agent-duelist
2
2
 
3
+ [![npm version](https://img.shields.io/npm/v/agent-duelist?color=f59e0b)](https://www.npmjs.com/package/agent-duelist)
3
4
  [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
5
  [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
6
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
5
7
 
6
- > Pit LLM providers against each other on agent tasks — Duel your models.
8
+ > Pit LLM providers against each other on agent tasks — **Duel your models.**
7
9
  >
8
10
  > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
9
11
 
10
- `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
12
+ `agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
13
+
14
+ ```bash
15
+ npx duelist init # scaffold a config
16
+ npx duelist run # see who wins
17
+ ```
11
18
 
12
19
  ## What you get
13
- > ![Agent Duelist console output](docs/assets/screenshot.png)
14
- >
15
- > ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
16
20
 
17
- - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
18
- - Define tasks once, run them against many providers.
19
- - Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
21
+ **Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
22
+
23
+ ![Agent Duelist console output](docs/assets/screenshot.png)
24
+
25
+ **HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
26
+
27
+ ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
20
28
 
21
29
  ---
22
30
 
23
31
  ## Why agent-duelist?
24
32
 
25
- - **Provider-agnostic**: One config, many providers. Swap models and gateways without rewriting your tasks.
26
- - **Agent-focused**: Designed for agent workflows and tool use, not just single-turn prompts.
27
- - **Realistic metrics**: Latency, token counts, and cost estimates based on a pricing catalog.
28
- - **TypeScript-native DX**: Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint.
29
- - **CLI-first**: `npx duelist init` `npx duelist run` gets you from zero to useful table in minutes.
33
+ | | |
34
+ |---|---|
35
+ | **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
36
+ | **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
37
+ | **Task packs** | Built-in benchmark suites run with `--pack structured-output`, zero config needed. |
38
+ | **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
39
+ | **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
40
+ | **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
41
+ | **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
42
+ | **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
43
+ | **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
30
44
 
31
45
  ---
32
46
 
@@ -40,7 +54,7 @@ pnpm add agent-duelist
40
54
  yarn add agent-duelist
41
55
  ```
42
56
 
43
- You'll also need API keys for the providers you want to benchmark, for example:
57
+ Set API keys for the providers you want to benchmark:
44
58
 
45
59
  ```bash
46
60
  export OPENAI_API_KEY=sk-...
@@ -51,7 +65,7 @@ export GOOGLE_API_KEY=...
51
65
 
52
66
  ---
53
67
 
54
- ## One-minute quickstart
68
+ ## Quickstart
55
69
 
56
70
  Initialize a config:
57
71
 
@@ -59,7 +73,7 @@ Initialize a config:
59
73
  npx duelist init
60
74
  ```
61
75
 
62
- This creates `arena.config.ts` in your project. A minimal example:
76
+ This creates `arena.config.ts` in your project:
63
77
 
64
78
  ```ts
65
79
  // arena.config.ts
@@ -99,39 +113,80 @@ Run the benchmark:
99
113
  npx duelist run
100
114
  ```
101
115
 
102
- You'll see a matrix like:
116
+ You'll see a results matrix:
103
117
 
104
- - Rows: tasks (`simple-qa`, `structured-extraction`)
105
- - Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
106
- - Cells: correctness score, latency, tokens, and estimated cost.
118
+ - **Rows**: tasks (`simple-qa`, `structured-extraction`)
119
+ - **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
120
+ - **Cells**: correctness score, latency, tokens, and estimated cost
107
121
 
108
- For CI or further processing:
122
+ Export the results in different formats:
109
123
 
110
124
  ```bash
125
+ # JSON for CI pipelines and dashboards
111
126
  npx duelist run --reporter json > results.json
127
+
128
+ # Self-contained HTML report you can share or host
129
+ npx duelist run --reporter html --output report.html
112
130
  ```
113
131
 
114
- Generate a shareable HTML report:
132
+ ---
133
+
134
+ ## Task packs
135
+
136
+ Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
115
137
 
116
138
  ```bash
117
- npx duelist run --reporter html --output report.html
139
+ # List available packs
140
+ npx duelist run --pack list
141
+
142
+ # Run a pack with your providers
143
+ npx duelist run --pack structured-output --config arena.config.ts
118
144
  ```
119
145
 
120
- ---
146
+ Your config only needs to define providers — the pack supplies tasks and scorers:
121
147
 
122
- ## Core concepts
148
+ ```ts
149
+ // arena.config.ts
150
+ import { defineArena, openai, anthropic } from 'agent-duelist'
123
151
 
124
- ### Providers
152
+ export default defineArena({
153
+ providers: [
154
+ openai('gpt-5-mini'),
155
+ anthropic('claude-sonnet-4.6'),
156
+ ],
157
+ tasks: [], // ignored when --pack is used
158
+ scorers: [], // pack supplies its own scorers
159
+ })
160
+ ```
125
161
 
126
- Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface.
162
+ ### Available packs
127
163
 
128
- This lets you:
164
+ | Pack | Tasks | Scorers | Description |
165
+ |------|-------|---------|-------------|
166
+ | `structured-output` | 6 | correctness, schema-correctness, latency, cost | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
167
+ | `tool-calling` | 4 | tool-usage, latency, cost | Function invocation accuracy — single calls, complex params, tool selection, and parallel calls |
168
+ | `reasoning` | 5 | correctness, latency, cost | Logic, math, and multi-step thinking — arithmetic, deduction, data interpretation, critical path, and business rules |
129
169
 
130
- - Swap providers without changing tasks.
131
- - Wrap or extend providers in your own code.
132
- - Mock providers in tests.
170
+ Packs work with both `run` and `ci` commands:
133
171
 
134
- Examples:
172
+ ```bash
173
+ # CI with a task pack
174
+ npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
175
+ ```
176
+
177
+ You can also combine multiple packs:
178
+
179
+ ```bash
180
+ npx duelist run --pack structured-output,another-pack
181
+ ```
182
+
183
+ ---
184
+
185
+ ## Core concepts
186
+
187
+ ### Providers
188
+
189
+ Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
135
190
 
136
191
  ```ts
137
192
  import {
@@ -140,25 +195,40 @@ import {
140
195
  anthropic,
141
196
  gemini,
142
197
  openaiCompatible,
143
- type ArenaProvider,
144
198
  } from 'agent-duelist'
145
199
 
200
+ // OpenAI
146
201
  const oai = openai('gpt-5-mini')
147
202
 
203
+ // Azure OpenAI
148
204
  const azure = azureOpenai('gpt-5-mini', {
149
205
  deployment: 'my-deployment',
150
206
  })
151
207
 
208
+ // Anthropic
152
209
  const claude = anthropic('claude-sonnet-4.6')
153
210
 
154
- const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
211
+ // Google Gemini
212
+ const gem = gemini('gemini-3-flash-preview')
155
213
 
156
- const local: ArenaProvider = openaiCompatible({
214
+ // Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
215
+ const local = openaiCompatible({
157
216
  id: 'local/llama',
158
- name: 'Local Gateway',
217
+ name: 'Local Ollama',
159
218
  baseURL: 'http://localhost:11434/v1',
160
219
  model: 'llama3.3',
161
220
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
221
+ free: true, // registers zero-cost pricing
222
+ })
223
+
224
+ // Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
225
+ const deepseek = openaiCompatible({
226
+ id: 'deepseek/r1',
227
+ name: 'DeepSeek R1',
228
+ baseURL: 'https://api.deepseek.com/v1',
229
+ model: 'deepseek-reasoner',
230
+ apiKeyEnv: 'DEEPSEEK_API_KEY',
231
+ stripThinking: true, // strips <think>...</think> from output
162
232
  })
163
233
  ```
164
234
 
@@ -185,6 +255,7 @@ interface ArenaTask {
185
255
  prompt: string
186
256
  expected?: unknown // used by correctness scorers
187
257
  schema?: ZodSchema<any> // used by schema-based scorers
258
+ tools?: ToolDefinition[] // used by tool-calling scorers
188
259
  }
189
260
  ```
190
261
 
@@ -213,7 +284,7 @@ const tasks: ArenaTask[] = [
213
284
 
214
285
  ### Scorers
215
286
 
216
- Scorers take raw model outputs and turn them into **numeric scores** (0–1) with optional details. Built-in scorers:
287
+ Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
217
288
 
218
289
  | Scorer | What it measures |
219
290
  |--------|-----------------|
@@ -222,7 +293,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
222
293
  | `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
223
294
  | `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
224
295
  | `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
225
- | `llm-judge-correctness` | Async LLM-as-judgecalls a judge model to score correctness 0–1 |
296
+ | `tool-usage` | Tool calling accuracy checks tool selection and argument correctness (1.0 exact match, 0.5 right tool / wrong args, 0.0 wrong tool) |
297
+ | `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
226
298
 
227
299
  Configure them in your arena:
228
300
 
@@ -242,8 +314,6 @@ defineArena({
242
314
 
243
315
  The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
244
316
 
245
- You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
246
-
247
317
  ---
248
318
 
249
319
  ### Arena options
@@ -252,7 +322,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
252
322
 
253
323
  | Option | Type | Default | Description |
254
324
  |--------|------|---------|-------------|
255
- | `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
325
+ | `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
256
326
  | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
257
327
  | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
258
328
  | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
@@ -273,43 +343,62 @@ export default defineArena({
273
343
 
274
344
  ---
275
345
 
346
+ ## Reporters
347
+
348
+ agent-duelist includes four output formats, each suited to a different workflow:
349
+
350
+ | Reporter | Flag | Use case |
351
+ |----------|------|----------|
352
+ | **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
353
+ | **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
354
+ | **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
355
+ | **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
356
+
357
+ Generate an HTML report:
358
+
359
+ ```bash
360
+ npx duelist run --reporter html --output report.html
361
+ ```
362
+
363
+ The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
364
+
365
+ ---
366
+
276
367
  ## Cost & pricing
277
368
 
278
369
  Cost estimation is intentionally transparent and conservative:
279
370
 
280
- 1. **Token counts**
281
- Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
282
-
283
- 2. **Pricing catalog**
284
- `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
371
+ 1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
285
372
 
286
- - The catalog maps `(provider, model)` `{ inputPerM, outputPerM }` in USD per 1M tokens.
287
- - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
373
+ 2. **Pricing catalog** `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
374
+ - The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
375
+ - Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
376
+ - Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
288
377
 
289
- 3. **Estimated USD**
290
- The `cost` scorer computes:
378
+ 3. **Estimated USD** — The `cost` scorer computes:
291
379
 
292
380
  ```text
293
381
  estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
294
382
  ```
295
383
 
296
- In the console reporter, you'll see:
384
+ 4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
297
385
 
298
- - token counts: `prompt: X, completion: Y`
299
- - cost: `~$0.0XXm` (millicents; fractions of a cent)
300
- - a short disclaimer that this is an **estimate** based on a pricing snapshot.
386
+ 5. **Custom pricing** Register pricing for models not in the catalog:
301
387
 
302
- 4. **Unknown models**
303
- If a model is not in the catalog:
388
+ ```ts
389
+ import { registerPricing } from 'agent-duelist'
304
390
 
305
- - Tokens are still reported.
306
- - Cost is marked as unknown (no fake numbers).
391
+ registerPricing('custom/my-model', {
392
+ inputPerToken: 0.000003,
393
+ outputPerToken: 0.000015,
394
+ })
395
+ ```
307
396
 
308
- You can update the catalog with a script that re-scrapes OpenRouter's public pricing page when prices change.
397
+ You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
309
398
 
310
399
  ---
311
400
 
312
- ## CLI usage
401
+ ## CLI reference
313
402
 
314
403
  ### `duelist init`
315
404
 
@@ -317,30 +406,46 @@ Scaffold a new `arena.config.ts` in the current directory.
317
406
 
318
407
  ```bash
319
408
  npx duelist init
409
+ npx duelist init --force # overwrite an existing config
320
410
  ```
321
411
 
412
+ | Option | Description |
413
+ |--------|-------------|
414
+ | `--force` | Overwrite existing config file |
415
+
322
416
  ### `duelist run`
323
417
 
324
418
  Run benchmarks defined in your arena config.
325
419
 
326
420
  ```bash
327
- # Run with the default config (arena.config.ts)
421
+ # Default config, console output
328
422
  npx duelist run
329
423
 
330
- # Use a custom config
424
+ # Custom config
331
425
  npx duelist run --config path/to/arena.config.ts
332
426
 
333
- # Get JSON instead of a table
334
- npx duelist run --reporter json
427
+ # Run a built-in task pack
428
+ npx duelist run --pack structured-output
429
+
430
+ # List available packs
431
+ npx duelist run --pack list
432
+
433
+ # JSON for piping
434
+ npx duelist run --reporter json > results.json
335
435
 
336
- # Suppress per-result progress lines
436
+ # HTML report
437
+ npx duelist run --reporter html --output report.html
438
+
439
+ # Quiet mode
337
440
  npx duelist run --quiet
338
441
  ```
339
442
 
340
443
  | Option | Description |
341
444
  |--------|-------------|
342
445
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
343
- | `--reporter <type>` | Output format: `console` (default) or `json` |
446
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
447
+ | `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
448
+ | `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
344
449
  | `-q, --quiet` | Suppress per-result progress |
345
450
 
346
451
  ### `duelist ci`
@@ -354,6 +459,9 @@ npx duelist ci --update-baseline
354
459
  # Subsequent runs — compare against baseline
355
460
  npx duelist ci --threshold correctness=0.1 --budget 1.00
356
461
 
462
+ # Run CI with a task pack
463
+ npx duelist ci --pack structured-output --threshold correctness=0.1
464
+
357
465
  # Post comparison table as a PR comment (GitHub Actions)
358
466
  npx duelist ci --threshold correctness=0.1 --comment
359
467
  ```
@@ -361,6 +469,7 @@ npx duelist ci --threshold correctness=0.1 --comment
361
469
  | Option | Description |
362
470
  |--------|-------------|
363
471
  | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
472
+ | `--pack <names>` | Run built-in task pack(s) instead of config tasks |
364
473
  | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
365
474
  | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
366
475
  | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
@@ -375,24 +484,61 @@ npx duelist ci --threshold correctness=0.1 --comment
375
484
  - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
376
485
  - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
377
486
 
378
- The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
487
+ The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
488
+
489
+ ---
490
+
491
+ ## Tool-calling agent example
492
+
493
+ agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
494
+
495
+ ```ts
496
+ import { defineArena, openai } from 'agent-duelist'
497
+ import { z } from 'zod'
498
+
499
+ const weatherTool = {
500
+ name: 'getCurrentWeather',
501
+ description: 'Get the current weather in a given city',
502
+ parameters: z.object({ city: z.string() }),
503
+ handler: async ({ city }: { city: string }) => ({
504
+ city,
505
+ tempC: 20,
506
+ }),
507
+ }
508
+
509
+ export default defineArena({
510
+ providers: [openai('gpt-5-mini')],
511
+ tasks: [
512
+ {
513
+ name: 'weather-tool-call',
514
+ prompt: 'What is the current temperature in Amsterdam? Use the tool.',
515
+ expected: { city: 'Amsterdam' },
516
+ tools: [weatherTool],
517
+ },
518
+ ],
519
+ scorers: ['latency', 'cost', 'tool-usage'],
520
+ runs: 1,
521
+ })
522
+ ```
523
+
524
+ The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
379
525
 
380
526
  ---
381
527
 
382
528
  ## Example: multi-provider benchmark
383
529
 
384
- Here's a richer example comparing multiple providers across tasks:
530
+ A richer example comparing multiple providers across tasks:
385
531
 
386
532
  ```ts
387
533
  // arena.config.ts
388
- import { defineArena, azureOpenai } from 'agent-duelist'
534
+ import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
389
535
  import { z } from 'zod'
390
536
 
391
537
  export default defineArena({
392
538
  providers: [
393
- azureOpenai('gpt-5-mini'),
539
+ openai('gpt-5-mini'),
394
540
  azureOpenai('gpt-5-nano'),
395
- azureOpenai('gpt-5.2-chat'),
541
+ gemini('gemini-3-flash-preview'),
396
542
  ],
397
543
  tasks: [
398
544
  {
@@ -425,50 +571,13 @@ export default defineArena({
425
571
  npx duelist run
426
572
  ```
427
573
 
428
- Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
429
-
430
574
  **How scoring works:**
431
575
 
432
576
  - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
433
577
  - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
434
- - The overall winner is determined by category wins across correctness, latency, and cost.
435
-
436
- ---
437
-
438
- ## Tool-calling agent example
439
-
440
- agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
441
-
442
- ```ts
443
- import { defineArena, openai } from 'agent-duelist'
444
- import { z } from 'zod'
445
-
446
- const weatherTool = {
447
- name: 'getCurrentWeather',
448
- description: 'Get the current weather in a given city',
449
- parameters: z.object({ city: z.string() }),
450
- handler: async ({ city }: { city: string }) => ({
451
- city,
452
- tempC: 20,
453
- }),
454
- }
455
-
456
- export default defineArena({
457
- providers: [openai('gpt-5-mini')],
458
- tasks: [
459
- {
460
- name: 'weather-tool-call',
461
- prompt: 'What is the current temperature in Amsterdam? Use the tool.',
462
- expected: { city: 'Amsterdam' },
463
- tools: [weatherTool],
464
- },
465
- ],
466
- scorers: ['latency', 'cost', 'tool-usage'],
467
- runs: 1,
468
- })
469
- ```
470
-
471
- The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
578
+ - **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
579
+ - **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
580
+ - The overall winner is the provider with the highest average correctness score.
472
581
 
473
582
  ---
474
583
 
@@ -516,8 +625,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
516
625
 
517
626
  | Provider | Task | Scorer | Baseline | Current | Delta | Status |
518
627
  |----------|------|--------|----------|---------|-------|--------|
519
- | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
520
- | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
628
+ | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
629
+ | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
521
630
 
522
631
  With cost summary, flakiness warnings, and pass/fail verdict.
523
632
 
@@ -525,34 +634,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
525
634
 
526
635
  ## Roadmap
527
636
 
528
- Shipped so far:
637
+ **Shipped:**
529
638
 
530
- - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
639
+ - 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
531
640
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
532
641
  - Tool-calling support with local handlers for agent task benchmarking
533
- - Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
534
- - Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
535
- - Configurable per-request timeout to prevent hanging on unresponsive APIs
536
- - JSON reporter for CI/pipeline integration
537
- - Markdown reporter for PR comments
538
- - `duelist ci` command with regression detection, cost budgets, and flakiness warnings
642
+ - **Task packs**: built-in benchmark suites (`structured-output`, `tool-calling`, `reasoning`) run with `--pack`, no config writing needed
643
+ - Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
644
+ - Fair head-to-head benchmarking with parallel provider execution
645
+ - 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
646
+ - `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
539
647
  - GitHub Action for CI/CD integration
540
- - Pricing catalog from OpenRouter with refresh script
648
+ - Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
649
+ - `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
650
+ - Configurable per-request timeout
541
651
 
542
- Planned directions (subject to community feedback):
652
+ **Planned** (subject to community feedback):
543
653
 
544
- - **More providers**
545
- - OpenRouter-native and more OpenAI-compatible gateways.
546
- - **Better reporting**
547
- - HTML and CSV export options.
548
- - **Agent workflows**
549
- - Multi-step tool chains, multi-hop reasoning, and agent traces.
550
- - **Plugin system**
551
- - First-class support for user-defined providers and scorers.
552
- - **Embedding-based scoring**
553
- - Semantic similarity via embedding distance.
654
+ - **More task packs** — summarization, multi-turn conversation, and code generation packs
655
+ - **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
656
+ - **More export formats** — CSV
657
+ - **Plugin system** first-class support for user-defined providers and scorers
658
+ - **Embedding-based scoring** — semantic similarity via embedding distance
659
+ - **More providers** OpenRouter-native and additional OpenAI-compatible gateways
554
660
 
555
- If you have a specific use case (framework comparisons, multi-agent competitions, tool-calling benchmarks), please open an issue those will shape what gets built first.
661
+ Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
556
662
 
557
663
  ---
558
664
 
@@ -560,18 +666,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
560
666
 
561
667
  Contributions, issues, and feature requests are welcome.
562
668
 
563
- - **Bug reports / ideas**: open a GitHub issue.
669
+ - **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
564
670
  - **Code changes**:
565
- - Fork the repo.
566
- - Create a branch.
567
- - Run tests: `npm test`.
568
- - Run build: `npm run build`.
569
- - Open a PR with a clear description and, if possible, a small repro.
671
+ 1. Fork the repo.
672
+ 2. Create a branch.
673
+ 3. Run tests: `npm test`.
674
+ 4. Run build: `npm run build`.
675
+ 5. Open a PR with a clear description.
570
676
 
571
- Please try to keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
677
+ Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
572
678
 
573
679
  ---
574
680
 
575
681
  ## License
576
682
 
577
- MIT.
683
+ MIT