agent-duelist 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +246 -142
- package/dist/cli.js +2004 -62
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +334 -105
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +28 -3
- package/dist/index.d.ts +28 -3
- package/dist/index.js +332 -105
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,32 +1,46 @@
|
|
|
1
1
|
# agent-duelist
|
|
2
2
|
|
|
3
|
+
[](https://www.npmjs.com/package/agent-duelist)
|
|
3
4
|
[](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
|
|
4
5
|
[](https://datagobes.github.io/agent-duelist/)
|
|
6
|
+
[](./LICENSE)
|
|
5
7
|
|
|
6
|
-
> Pit LLM providers against each other on agent tasks — Duel your models
|
|
8
|
+
> Pit LLM providers against each other on agent tasks — **Duel your models.**
|
|
7
9
|
>
|
|
8
10
|
> **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
|
|
9
11
|
|
|
10
|
-
`agent-duelist` is a TypeScript-first framework
|
|
12
|
+
`agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
npx duelist init # scaffold a config
|
|
16
|
+
npx duelist run # see who wins
|
|
17
|
+
```
|
|
11
18
|
|
|
12
19
|
## What you get
|
|
13
|
-
> 
|
|
14
|
-
>
|
|
15
|
-
> 
|
|
16
20
|
|
|
17
|
-
-
|
|
18
|
-
|
|
19
|
-
|
|
21
|
+
**Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
|
|
22
|
+
|
|
23
|
+

|
|
24
|
+
|
|
25
|
+
**HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
|
|
26
|
+
|
|
27
|
+

|
|
20
28
|
|
|
21
29
|
---
|
|
22
30
|
|
|
23
31
|
## Why agent-duelist?
|
|
24
32
|
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
33
|
+
| | |
|
|
34
|
+
|---|---|
|
|
35
|
+
| **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
|
|
36
|
+
| **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
|
|
37
|
+
| **Task packs** | Built-in benchmark suites — run with `--pack structured-output`, zero config needed. |
|
|
38
|
+
| **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
|
|
39
|
+
| **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
|
|
40
|
+
| **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
|
|
41
|
+
| **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
|
|
42
|
+
| **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
|
|
43
|
+
| **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
|
|
30
44
|
|
|
31
45
|
---
|
|
32
46
|
|
|
@@ -40,7 +54,7 @@ pnpm add agent-duelist
|
|
|
40
54
|
yarn add agent-duelist
|
|
41
55
|
```
|
|
42
56
|
|
|
43
|
-
|
|
57
|
+
Set API keys for the providers you want to benchmark:
|
|
44
58
|
|
|
45
59
|
```bash
|
|
46
60
|
export OPENAI_API_KEY=sk-...
|
|
@@ -51,7 +65,7 @@ export GOOGLE_API_KEY=...
|
|
|
51
65
|
|
|
52
66
|
---
|
|
53
67
|
|
|
54
|
-
##
|
|
68
|
+
## Quickstart
|
|
55
69
|
|
|
56
70
|
Initialize a config:
|
|
57
71
|
|
|
@@ -59,7 +73,7 @@ Initialize a config:
|
|
|
59
73
|
npx duelist init
|
|
60
74
|
```
|
|
61
75
|
|
|
62
|
-
This creates `arena.config.ts` in your project
|
|
76
|
+
This creates `arena.config.ts` in your project:
|
|
63
77
|
|
|
64
78
|
```ts
|
|
65
79
|
// arena.config.ts
|
|
@@ -99,39 +113,78 @@ Run the benchmark:
|
|
|
99
113
|
npx duelist run
|
|
100
114
|
```
|
|
101
115
|
|
|
102
|
-
You'll see a matrix
|
|
116
|
+
You'll see a results matrix:
|
|
103
117
|
|
|
104
|
-
- Rows
|
|
105
|
-
- Columns
|
|
106
|
-
- Cells
|
|
118
|
+
- **Rows**: tasks (`simple-qa`, `structured-extraction`)
|
|
119
|
+
- **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
|
|
120
|
+
- **Cells**: correctness score, latency, tokens, and estimated cost
|
|
107
121
|
|
|
108
|
-
|
|
122
|
+
Export the results in different formats:
|
|
109
123
|
|
|
110
124
|
```bash
|
|
125
|
+
# JSON for CI pipelines and dashboards
|
|
111
126
|
npx duelist run --reporter json > results.json
|
|
127
|
+
|
|
128
|
+
# Self-contained HTML report you can share or host
|
|
129
|
+
npx duelist run --reporter html --output report.html
|
|
112
130
|
```
|
|
113
131
|
|
|
114
|
-
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Task packs
|
|
135
|
+
|
|
136
|
+
Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
|
|
115
137
|
|
|
116
138
|
```bash
|
|
117
|
-
|
|
139
|
+
# List available packs
|
|
140
|
+
npx duelist run --pack list
|
|
141
|
+
|
|
142
|
+
# Run a pack with your providers
|
|
143
|
+
npx duelist run --pack structured-output --config arena.config.ts
|
|
118
144
|
```
|
|
119
145
|
|
|
120
|
-
|
|
146
|
+
Your config only needs to define providers — the pack supplies tasks and scorers:
|
|
121
147
|
|
|
122
|
-
|
|
148
|
+
```ts
|
|
149
|
+
// arena.config.ts
|
|
150
|
+
import { defineArena, openai, anthropic } from 'agent-duelist'
|
|
123
151
|
|
|
124
|
-
|
|
152
|
+
export default defineArena({
|
|
153
|
+
providers: [
|
|
154
|
+
openai('gpt-5-mini'),
|
|
155
|
+
anthropic('claude-sonnet-4.6'),
|
|
156
|
+
],
|
|
157
|
+
tasks: [], // ignored when --pack is used
|
|
158
|
+
scorers: [], // pack supplies its own scorers
|
|
159
|
+
})
|
|
160
|
+
```
|
|
125
161
|
|
|
126
|
-
|
|
162
|
+
### Available packs
|
|
127
163
|
|
|
128
|
-
|
|
164
|
+
| Pack | Tasks | Description |
|
|
165
|
+
|------|-------|-------------|
|
|
166
|
+
| `structured-output` | 6 | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
|
|
129
167
|
|
|
130
|
-
|
|
131
|
-
- Wrap or extend providers in your own code.
|
|
132
|
-
- Mock providers in tests.
|
|
168
|
+
Packs work with both `run` and `ci` commands:
|
|
133
169
|
|
|
134
|
-
|
|
170
|
+
```bash
|
|
171
|
+
# CI with a task pack
|
|
172
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
You can also combine multiple packs:
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
npx duelist run --pack structured-output,another-pack
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Core concepts
|
|
184
|
+
|
|
185
|
+
### Providers
|
|
186
|
+
|
|
187
|
+
Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
|
|
135
188
|
|
|
136
189
|
```ts
|
|
137
190
|
import {
|
|
@@ -140,25 +193,40 @@ import {
|
|
|
140
193
|
anthropic,
|
|
141
194
|
gemini,
|
|
142
195
|
openaiCompatible,
|
|
143
|
-
type ArenaProvider,
|
|
144
196
|
} from 'agent-duelist'
|
|
145
197
|
|
|
198
|
+
// OpenAI
|
|
146
199
|
const oai = openai('gpt-5-mini')
|
|
147
200
|
|
|
201
|
+
// Azure OpenAI
|
|
148
202
|
const azure = azureOpenai('gpt-5-mini', {
|
|
149
203
|
deployment: 'my-deployment',
|
|
150
204
|
})
|
|
151
205
|
|
|
206
|
+
// Anthropic
|
|
152
207
|
const claude = anthropic('claude-sonnet-4.6')
|
|
153
208
|
|
|
154
|
-
|
|
209
|
+
// Google Gemini
|
|
210
|
+
const gem = gemini('gemini-3-flash-preview')
|
|
155
211
|
|
|
156
|
-
|
|
212
|
+
// Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
|
|
213
|
+
const local = openaiCompatible({
|
|
157
214
|
id: 'local/llama',
|
|
158
|
-
name: 'Local
|
|
215
|
+
name: 'Local Ollama',
|
|
159
216
|
baseURL: 'http://localhost:11434/v1',
|
|
160
217
|
model: 'llama3.3',
|
|
161
218
|
apiKeyEnv: 'LOCAL_LLM_API_KEY',
|
|
219
|
+
free: true, // registers zero-cost pricing
|
|
220
|
+
})
|
|
221
|
+
|
|
222
|
+
// Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
|
|
223
|
+
const deepseek = openaiCompatible({
|
|
224
|
+
id: 'deepseek/r1',
|
|
225
|
+
name: 'DeepSeek R1',
|
|
226
|
+
baseURL: 'https://api.deepseek.com/v1',
|
|
227
|
+
model: 'deepseek-reasoner',
|
|
228
|
+
apiKeyEnv: 'DEEPSEEK_API_KEY',
|
|
229
|
+
stripThinking: true, // strips <think>...</think> from output
|
|
162
230
|
})
|
|
163
231
|
```
|
|
164
232
|
|
|
@@ -185,6 +253,7 @@ interface ArenaTask {
|
|
|
185
253
|
prompt: string
|
|
186
254
|
expected?: unknown // used by correctness scorers
|
|
187
255
|
schema?: ZodSchema<any> // used by schema-based scorers
|
|
256
|
+
tools?: ToolDefinition[] // used by tool-calling scorers
|
|
188
257
|
}
|
|
189
258
|
```
|
|
190
259
|
|
|
@@ -213,7 +282,7 @@ const tasks: ArenaTask[] = [
|
|
|
213
282
|
|
|
214
283
|
### Scorers
|
|
215
284
|
|
|
216
|
-
Scorers
|
|
285
|
+
Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
|
|
217
286
|
|
|
218
287
|
| Scorer | What it measures |
|
|
219
288
|
|--------|-----------------|
|
|
@@ -222,7 +291,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
|
|
|
222
291
|
| `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
|
|
223
292
|
| `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
|
|
224
293
|
| `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
|
|
225
|
-
| `
|
|
294
|
+
| `tool-usage` | Whether the model invoked the expected tool(s) during a tool-calling task |
|
|
295
|
+
| `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
|
|
226
296
|
|
|
227
297
|
Configure them in your arena:
|
|
228
298
|
|
|
@@ -242,8 +312,6 @@ defineArena({
|
|
|
242
312
|
|
|
243
313
|
The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
|
|
244
314
|
|
|
245
|
-
You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
|
|
246
|
-
|
|
247
315
|
---
|
|
248
316
|
|
|
249
317
|
### Arena options
|
|
@@ -252,7 +320,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
|
|
|
252
320
|
|
|
253
321
|
| Option | Type | Default | Description |
|
|
254
322
|
|--------|------|---------|-------------|
|
|
255
|
-
| `runs` | `number` | `1` | Number of runs per provider
|
|
323
|
+
| `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
|
|
256
324
|
| `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
|
|
257
325
|
| `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
|
|
258
326
|
| `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
|
|
@@ -273,43 +341,62 @@ export default defineArena({
|
|
|
273
341
|
|
|
274
342
|
---
|
|
275
343
|
|
|
344
|
+
## Reporters
|
|
345
|
+
|
|
346
|
+
agent-duelist includes four output formats, each suited to a different workflow:
|
|
347
|
+
|
|
348
|
+
| Reporter | Flag | Use case |
|
|
349
|
+
|----------|------|----------|
|
|
350
|
+
| **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
|
|
351
|
+
| **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
|
|
352
|
+
| **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
|
|
353
|
+
| **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
|
|
354
|
+
|
|
355
|
+
Generate an HTML report:
|
|
356
|
+
|
|
357
|
+
```bash
|
|
358
|
+
npx duelist run --reporter html --output report.html
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
|
|
362
|
+
|
|
363
|
+
---
|
|
364
|
+
|
|
276
365
|
## Cost & pricing
|
|
277
366
|
|
|
278
367
|
Cost estimation is intentionally transparent and conservative:
|
|
279
368
|
|
|
280
|
-
1. **Token counts**
|
|
281
|
-
Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
|
|
282
|
-
|
|
283
|
-
2. **Pricing catalog**
|
|
284
|
-
`agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
369
|
+
1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
|
|
285
370
|
|
|
286
|
-
|
|
287
|
-
-
|
|
371
|
+
2. **Pricing catalog** — `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
372
|
+
- The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
|
|
373
|
+
- Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
|
|
374
|
+
- Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
|
|
288
375
|
|
|
289
|
-
3. **Estimated USD**
|
|
290
|
-
The `cost` scorer computes:
|
|
376
|
+
3. **Estimated USD** — The `cost` scorer computes:
|
|
291
377
|
|
|
292
378
|
```text
|
|
293
379
|
estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
|
|
294
380
|
```
|
|
295
381
|
|
|
296
|
-
|
|
382
|
+
4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
|
|
297
383
|
|
|
298
|
-
|
|
299
|
-
- cost: `~$0.0XXm` (millicents; fractions of a cent)
|
|
300
|
-
- a short disclaimer that this is an **estimate** based on a pricing snapshot.
|
|
384
|
+
5. **Custom pricing** — Register pricing for models not in the catalog:
|
|
301
385
|
|
|
302
|
-
|
|
303
|
-
|
|
386
|
+
```ts
|
|
387
|
+
import { registerPricing } from 'agent-duelist'
|
|
304
388
|
|
|
305
|
-
-
|
|
306
|
-
|
|
389
|
+
registerPricing('custom/my-model', {
|
|
390
|
+
inputPerToken: 0.000003,
|
|
391
|
+
outputPerToken: 0.000015,
|
|
392
|
+
})
|
|
393
|
+
```
|
|
307
394
|
|
|
308
|
-
You can update the catalog with
|
|
395
|
+
You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
|
|
309
396
|
|
|
310
397
|
---
|
|
311
398
|
|
|
312
|
-
## CLI
|
|
399
|
+
## CLI reference
|
|
313
400
|
|
|
314
401
|
### `duelist init`
|
|
315
402
|
|
|
@@ -317,30 +404,46 @@ Scaffold a new `arena.config.ts` in the current directory.
|
|
|
317
404
|
|
|
318
405
|
```bash
|
|
319
406
|
npx duelist init
|
|
407
|
+
npx duelist init --force # overwrite an existing config
|
|
320
408
|
```
|
|
321
409
|
|
|
410
|
+
| Option | Description |
|
|
411
|
+
|--------|-------------|
|
|
412
|
+
| `--force` | Overwrite existing config file |
|
|
413
|
+
|
|
322
414
|
### `duelist run`
|
|
323
415
|
|
|
324
416
|
Run benchmarks defined in your arena config.
|
|
325
417
|
|
|
326
418
|
```bash
|
|
327
|
-
#
|
|
419
|
+
# Default config, console output
|
|
328
420
|
npx duelist run
|
|
329
421
|
|
|
330
|
-
#
|
|
422
|
+
# Custom config
|
|
331
423
|
npx duelist run --config path/to/arena.config.ts
|
|
332
424
|
|
|
333
|
-
#
|
|
334
|
-
npx duelist run --
|
|
425
|
+
# Run a built-in task pack
|
|
426
|
+
npx duelist run --pack structured-output
|
|
427
|
+
|
|
428
|
+
# List available packs
|
|
429
|
+
npx duelist run --pack list
|
|
430
|
+
|
|
431
|
+
# JSON for piping
|
|
432
|
+
npx duelist run --reporter json > results.json
|
|
335
433
|
|
|
336
|
-
#
|
|
434
|
+
# HTML report
|
|
435
|
+
npx duelist run --reporter html --output report.html
|
|
436
|
+
|
|
437
|
+
# Quiet mode
|
|
337
438
|
npx duelist run --quiet
|
|
338
439
|
```
|
|
339
440
|
|
|
340
441
|
| Option | Description |
|
|
341
442
|
|--------|-------------|
|
|
342
443
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
343
|
-
| `--
|
|
444
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
|
|
445
|
+
| `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
|
|
446
|
+
| `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
|
|
344
447
|
| `-q, --quiet` | Suppress per-result progress |
|
|
345
448
|
|
|
346
449
|
### `duelist ci`
|
|
@@ -354,6 +457,9 @@ npx duelist ci --update-baseline
|
|
|
354
457
|
# Subsequent runs — compare against baseline
|
|
355
458
|
npx duelist ci --threshold correctness=0.1 --budget 1.00
|
|
356
459
|
|
|
460
|
+
# Run CI with a task pack
|
|
461
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1
|
|
462
|
+
|
|
357
463
|
# Post comparison table as a PR comment (GitHub Actions)
|
|
358
464
|
npx duelist ci --threshold correctness=0.1 --comment
|
|
359
465
|
```
|
|
@@ -361,6 +467,7 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
361
467
|
| Option | Description |
|
|
362
468
|
|--------|-------------|
|
|
363
469
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
470
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks |
|
|
364
471
|
| `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
|
|
365
472
|
| `--budget <dollars>` | Max total cost in USD — fails if exceeded |
|
|
366
473
|
| `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
|
|
@@ -375,24 +482,61 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
375
482
|
- Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
|
|
376
483
|
- Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
|
|
377
484
|
|
|
378
|
-
The CLI loads TypeScript configs directly using a lightweight runtime loader so
|
|
485
|
+
The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
|
|
486
|
+
|
|
487
|
+
---
|
|
488
|
+
|
|
489
|
+
## Tool-calling agent example
|
|
490
|
+
|
|
491
|
+
agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
|
|
492
|
+
|
|
493
|
+
```ts
|
|
494
|
+
import { defineArena, openai } from 'agent-duelist'
|
|
495
|
+
import { z } from 'zod'
|
|
496
|
+
|
|
497
|
+
const weatherTool = {
|
|
498
|
+
name: 'getCurrentWeather',
|
|
499
|
+
description: 'Get the current weather in a given city',
|
|
500
|
+
parameters: z.object({ city: z.string() }),
|
|
501
|
+
handler: async ({ city }: { city: string }) => ({
|
|
502
|
+
city,
|
|
503
|
+
tempC: 20,
|
|
504
|
+
}),
|
|
505
|
+
}
|
|
506
|
+
|
|
507
|
+
export default defineArena({
|
|
508
|
+
providers: [openai('gpt-5-mini')],
|
|
509
|
+
tasks: [
|
|
510
|
+
{
|
|
511
|
+
name: 'weather-tool-call',
|
|
512
|
+
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
513
|
+
expected: { city: 'Amsterdam' },
|
|
514
|
+
tools: [weatherTool],
|
|
515
|
+
},
|
|
516
|
+
],
|
|
517
|
+
scorers: ['latency', 'cost', 'tool-usage'],
|
|
518
|
+
runs: 1,
|
|
519
|
+
})
|
|
520
|
+
```
|
|
521
|
+
|
|
522
|
+
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
379
523
|
|
|
380
524
|
---
|
|
381
525
|
|
|
382
526
|
## Example: multi-provider benchmark
|
|
383
527
|
|
|
384
|
-
|
|
528
|
+
A richer example comparing multiple providers across tasks:
|
|
385
529
|
|
|
386
530
|
```ts
|
|
387
531
|
// arena.config.ts
|
|
388
|
-
import { defineArena, azureOpenai } from 'agent-duelist'
|
|
532
|
+
import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
|
|
389
533
|
import { z } from 'zod'
|
|
390
534
|
|
|
391
535
|
export default defineArena({
|
|
392
536
|
providers: [
|
|
393
|
-
|
|
537
|
+
openai('gpt-5-mini'),
|
|
394
538
|
azureOpenai('gpt-5-nano'),
|
|
395
|
-
|
|
539
|
+
gemini('gemini-3-flash-preview'),
|
|
396
540
|
],
|
|
397
541
|
tasks: [
|
|
398
542
|
{
|
|
@@ -425,50 +569,13 @@ export default defineArena({
|
|
|
425
569
|
npx duelist run
|
|
426
570
|
```
|
|
427
571
|
|
|
428
|
-
Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
|
|
429
|
-
|
|
430
572
|
**How scoring works:**
|
|
431
573
|
|
|
432
574
|
- Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
|
|
433
575
|
- Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
|
|
434
|
-
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
## Tool-calling agent example
|
|
439
|
-
|
|
440
|
-
agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
|
|
441
|
-
|
|
442
|
-
```ts
|
|
443
|
-
import { defineArena, openai } from 'agent-duelist'
|
|
444
|
-
import { z } from 'zod'
|
|
445
|
-
|
|
446
|
-
const weatherTool = {
|
|
447
|
-
name: 'getCurrentWeather',
|
|
448
|
-
description: 'Get the current weather in a given city',
|
|
449
|
-
parameters: z.object({ city: z.string() }),
|
|
450
|
-
handler: async ({ city }: { city: string }) => ({
|
|
451
|
-
city,
|
|
452
|
-
tempC: 20,
|
|
453
|
-
}),
|
|
454
|
-
}
|
|
455
|
-
|
|
456
|
-
export default defineArena({
|
|
457
|
-
providers: [openai('gpt-5-mini')],
|
|
458
|
-
tasks: [
|
|
459
|
-
{
|
|
460
|
-
name: 'weather-tool-call',
|
|
461
|
-
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
462
|
-
expected: { city: 'Amsterdam' },
|
|
463
|
-
tools: [weatherTool],
|
|
464
|
-
},
|
|
465
|
-
],
|
|
466
|
-
scorers: ['latency', 'cost', 'tool-usage'],
|
|
467
|
-
runs: 1,
|
|
468
|
-
})
|
|
469
|
-
```
|
|
470
|
-
|
|
471
|
-
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
576
|
+
- **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
|
|
577
|
+
- **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
|
|
578
|
+
- The overall winner is the provider with the highest average correctness score.
|
|
472
579
|
|
|
473
580
|
---
|
|
474
581
|
|
|
@@ -516,8 +623,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
|
|
|
516
623
|
|
|
517
624
|
| Provider | Task | Scorer | Baseline | Current | Delta | Status |
|
|
518
625
|
|----------|------|--------|----------|---------|-------|--------|
|
|
519
|
-
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 |
|
|
520
|
-
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 |
|
|
626
|
+
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
|
|
627
|
+
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
|
|
521
628
|
|
|
522
629
|
With cost summary, flakiness warnings, and pass/fail verdict.
|
|
523
630
|
|
|
@@ -525,34 +632,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
|
|
|
525
632
|
|
|
526
633
|
## Roadmap
|
|
527
634
|
|
|
528
|
-
Shipped
|
|
635
|
+
**Shipped:**
|
|
529
636
|
|
|
530
|
-
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible
|
|
637
|
+
- 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
|
|
531
638
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
532
639
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
533
|
-
-
|
|
534
|
-
-
|
|
535
|
-
-
|
|
536
|
-
- JSON
|
|
537
|
-
-
|
|
538
|
-
- `duelist ci` command with regression detection, cost budgets, and flakiness warnings
|
|
640
|
+
- **Task packs**: built-in benchmark suites (`structured-output`) — run with `--pack`, no config writing needed
|
|
641
|
+
- Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
|
|
642
|
+
- Fair head-to-head benchmarking with parallel provider execution
|
|
643
|
+
- 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
|
|
644
|
+
- `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
|
|
539
645
|
- GitHub Action for CI/CD integration
|
|
540
|
-
- Pricing catalog
|
|
646
|
+
- Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
|
|
647
|
+
- `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
|
|
648
|
+
- Configurable per-request timeout
|
|
541
649
|
|
|
542
|
-
Planned
|
|
650
|
+
**Planned** (subject to community feedback):
|
|
543
651
|
|
|
544
|
-
- **More
|
|
545
|
-
|
|
546
|
-
- **
|
|
547
|
-
|
|
548
|
-
- **
|
|
549
|
-
|
|
550
|
-
- **Plugin system**
|
|
551
|
-
- First-class support for user-defined providers and scorers.
|
|
552
|
-
- **Embedding-based scoring**
|
|
553
|
-
- Semantic similarity via embedding distance.
|
|
652
|
+
- **More task packs** — reasoning, summarization, tool-calling, and multi-turn conversation packs
|
|
653
|
+
- **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
|
|
654
|
+
- **More export formats** — CSV
|
|
655
|
+
- **Plugin system** — first-class support for user-defined providers and scorers
|
|
656
|
+
- **Embedding-based scoring** — semantic similarity via embedding distance
|
|
657
|
+
- **More providers** — OpenRouter-native and additional OpenAI-compatible gateways
|
|
554
658
|
|
|
555
|
-
|
|
659
|
+
Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
|
|
556
660
|
|
|
557
661
|
---
|
|
558
662
|
|
|
@@ -560,18 +664,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
|
|
|
560
664
|
|
|
561
665
|
Contributions, issues, and feature requests are welcome.
|
|
562
666
|
|
|
563
|
-
- **Bug reports / ideas**: open a GitHub issue.
|
|
667
|
+
- **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
|
|
564
668
|
- **Code changes**:
|
|
565
|
-
|
|
566
|
-
|
|
567
|
-
|
|
568
|
-
|
|
569
|
-
|
|
669
|
+
1. Fork the repo.
|
|
670
|
+
2. Create a branch.
|
|
671
|
+
3. Run tests: `npm test`.
|
|
672
|
+
4. Run build: `npm run build`.
|
|
673
|
+
5. Open a PR with a clear description.
|
|
570
674
|
|
|
571
|
-
Please
|
|
675
|
+
Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
|
|
572
676
|
|
|
573
677
|
---
|
|
574
678
|
|
|
575
679
|
## License
|
|
576
680
|
|
|
577
|
-
MIT
|
|
681
|
+
MIT
|