agent-duelist 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +251 -133
- package/dist/cli.js +4945 -2351
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +1405 -468
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +35 -9
- package/dist/index.d.ts +35 -9
- package/dist/index.js +1402 -468
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -1,31 +1,46 @@
|
|
|
1
1
|
# agent-duelist
|
|
2
2
|
|
|
3
|
+
[](https://www.npmjs.com/package/agent-duelist)
|
|
3
4
|
[](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
|
|
4
5
|
[](https://datagobes.github.io/agent-duelist/)
|
|
6
|
+
[](./LICENSE)
|
|
5
7
|
|
|
6
|
-
> Pit LLM providers against each other on agent tasks — Duel your models
|
|
8
|
+
> Pit LLM providers against each other on agent tasks — **Duel your models.**
|
|
7
9
|
>
|
|
8
10
|
> **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
|
|
9
11
|
|
|
10
|
-
`agent-duelist` is a TypeScript-first framework
|
|
12
|
+
`agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
npx duelist init # scaffold a config
|
|
16
|
+
npx duelist run # see who wins
|
|
17
|
+
```
|
|
11
18
|
|
|
12
19
|
## What you get
|
|
13
|
-
> 
|
|
14
20
|
|
|
21
|
+
**Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
|
|
22
|
+
|
|
23
|
+

|
|
24
|
+
|
|
25
|
+
**HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
|
|
15
26
|
|
|
16
|
-
|
|
17
|
-
- Define tasks once, run them against many providers.
|
|
18
|
-
- Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
|
|
27
|
+

|
|
19
28
|
|
|
20
29
|
---
|
|
21
30
|
|
|
22
31
|
## Why agent-duelist?
|
|
23
32
|
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
33
|
+
| | |
|
|
34
|
+
|---|---|
|
|
35
|
+
| **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
|
|
36
|
+
| **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
|
|
37
|
+
| **Task packs** | Built-in benchmark suites — run with `--pack structured-output`, zero config needed. |
|
|
38
|
+
| **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
|
|
39
|
+
| **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
|
|
40
|
+
| **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
|
|
41
|
+
| **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
|
|
42
|
+
| **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
|
|
43
|
+
| **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
|
|
29
44
|
|
|
30
45
|
---
|
|
31
46
|
|
|
@@ -39,7 +54,7 @@ pnpm add agent-duelist
|
|
|
39
54
|
yarn add agent-duelist
|
|
40
55
|
```
|
|
41
56
|
|
|
42
|
-
|
|
57
|
+
Set API keys for the providers you want to benchmark:
|
|
43
58
|
|
|
44
59
|
```bash
|
|
45
60
|
export OPENAI_API_KEY=sk-...
|
|
@@ -50,7 +65,7 @@ export GOOGLE_API_KEY=...
|
|
|
50
65
|
|
|
51
66
|
---
|
|
52
67
|
|
|
53
|
-
##
|
|
68
|
+
## Quickstart
|
|
54
69
|
|
|
55
70
|
Initialize a config:
|
|
56
71
|
|
|
@@ -58,7 +73,7 @@ Initialize a config:
|
|
|
58
73
|
npx duelist init
|
|
59
74
|
```
|
|
60
75
|
|
|
61
|
-
This creates `arena.config.ts` in your project
|
|
76
|
+
This creates `arena.config.ts` in your project:
|
|
62
77
|
|
|
63
78
|
```ts
|
|
64
79
|
// arena.config.ts
|
|
@@ -98,33 +113,78 @@ Run the benchmark:
|
|
|
98
113
|
npx duelist run
|
|
99
114
|
```
|
|
100
115
|
|
|
101
|
-
You'll see a matrix
|
|
116
|
+
You'll see a results matrix:
|
|
102
117
|
|
|
103
|
-
- Rows
|
|
104
|
-
- Columns
|
|
105
|
-
- Cells
|
|
118
|
+
- **Rows**: tasks (`simple-qa`, `structured-extraction`)
|
|
119
|
+
- **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
|
|
120
|
+
- **Cells**: correctness score, latency, tokens, and estimated cost
|
|
106
121
|
|
|
107
|
-
|
|
122
|
+
Export the results in different formats:
|
|
108
123
|
|
|
109
124
|
```bash
|
|
125
|
+
# JSON for CI pipelines and dashboards
|
|
110
126
|
npx duelist run --reporter json > results.json
|
|
127
|
+
|
|
128
|
+
# Self-contained HTML report you can share or host
|
|
129
|
+
npx duelist run --reporter html --output report.html
|
|
111
130
|
```
|
|
112
131
|
|
|
113
132
|
---
|
|
114
133
|
|
|
115
|
-
##
|
|
134
|
+
## Task packs
|
|
116
135
|
|
|
117
|
-
|
|
136
|
+
Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
|
|
118
137
|
|
|
119
|
-
|
|
138
|
+
```bash
|
|
139
|
+
# List available packs
|
|
140
|
+
npx duelist run --pack list
|
|
120
141
|
|
|
121
|
-
|
|
142
|
+
# Run a pack with your providers
|
|
143
|
+
npx duelist run --pack structured-output --config arena.config.ts
|
|
144
|
+
```
|
|
122
145
|
|
|
123
|
-
|
|
124
|
-
- Wrap or extend providers in your own code.
|
|
125
|
-
- Mock providers in tests.
|
|
146
|
+
Your config only needs to define providers — the pack supplies tasks and scorers:
|
|
126
147
|
|
|
127
|
-
|
|
148
|
+
```ts
|
|
149
|
+
// arena.config.ts
|
|
150
|
+
import { defineArena, openai, anthropic } from 'agent-duelist'
|
|
151
|
+
|
|
152
|
+
export default defineArena({
|
|
153
|
+
providers: [
|
|
154
|
+
openai('gpt-5-mini'),
|
|
155
|
+
anthropic('claude-sonnet-4.6'),
|
|
156
|
+
],
|
|
157
|
+
tasks: [], // ignored when --pack is used
|
|
158
|
+
scorers: [], // pack supplies its own scorers
|
|
159
|
+
})
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### Available packs
|
|
163
|
+
|
|
164
|
+
| Pack | Tasks | Description |
|
|
165
|
+
|------|-------|-------------|
|
|
166
|
+
| `structured-output` | 6 | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
|
|
167
|
+
|
|
168
|
+
Packs work with both `run` and `ci` commands:
|
|
169
|
+
|
|
170
|
+
```bash
|
|
171
|
+
# CI with a task pack
|
|
172
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
You can also combine multiple packs:
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
npx duelist run --pack structured-output,another-pack
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
---
|
|
182
|
+
|
|
183
|
+
## Core concepts
|
|
184
|
+
|
|
185
|
+
### Providers
|
|
186
|
+
|
|
187
|
+
Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
|
|
128
188
|
|
|
129
189
|
```ts
|
|
130
190
|
import {
|
|
@@ -133,25 +193,40 @@ import {
|
|
|
133
193
|
anthropic,
|
|
134
194
|
gemini,
|
|
135
195
|
openaiCompatible,
|
|
136
|
-
type ArenaProvider,
|
|
137
196
|
} from 'agent-duelist'
|
|
138
197
|
|
|
198
|
+
// OpenAI
|
|
139
199
|
const oai = openai('gpt-5-mini')
|
|
140
200
|
|
|
201
|
+
// Azure OpenAI
|
|
141
202
|
const azure = azureOpenai('gpt-5-mini', {
|
|
142
203
|
deployment: 'my-deployment',
|
|
143
204
|
})
|
|
144
205
|
|
|
206
|
+
// Anthropic
|
|
145
207
|
const claude = anthropic('claude-sonnet-4.6')
|
|
146
208
|
|
|
147
|
-
|
|
209
|
+
// Google Gemini
|
|
210
|
+
const gem = gemini('gemini-3-flash-preview')
|
|
148
211
|
|
|
149
|
-
|
|
212
|
+
// Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
|
|
213
|
+
const local = openaiCompatible({
|
|
150
214
|
id: 'local/llama',
|
|
151
|
-
name: 'Local
|
|
215
|
+
name: 'Local Ollama',
|
|
152
216
|
baseURL: 'http://localhost:11434/v1',
|
|
153
217
|
model: 'llama3.3',
|
|
154
218
|
apiKeyEnv: 'LOCAL_LLM_API_KEY',
|
|
219
|
+
free: true, // registers zero-cost pricing
|
|
220
|
+
})
|
|
221
|
+
|
|
222
|
+
// Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
|
|
223
|
+
const deepseek = openaiCompatible({
|
|
224
|
+
id: 'deepseek/r1',
|
|
225
|
+
name: 'DeepSeek R1',
|
|
226
|
+
baseURL: 'https://api.deepseek.com/v1',
|
|
227
|
+
model: 'deepseek-reasoner',
|
|
228
|
+
apiKeyEnv: 'DEEPSEEK_API_KEY',
|
|
229
|
+
stripThinking: true, // strips <think>...</think> from output
|
|
155
230
|
})
|
|
156
231
|
```
|
|
157
232
|
|
|
@@ -178,6 +253,7 @@ interface ArenaTask {
|
|
|
178
253
|
prompt: string
|
|
179
254
|
expected?: unknown // used by correctness scorers
|
|
180
255
|
schema?: ZodSchema<any> // used by schema-based scorers
|
|
256
|
+
tools?: ToolDefinition[] // used by tool-calling scorers
|
|
181
257
|
}
|
|
182
258
|
```
|
|
183
259
|
|
|
@@ -206,7 +282,7 @@ const tasks: ArenaTask[] = [
|
|
|
206
282
|
|
|
207
283
|
### Scorers
|
|
208
284
|
|
|
209
|
-
Scorers
|
|
285
|
+
Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
|
|
210
286
|
|
|
211
287
|
| Scorer | What it measures |
|
|
212
288
|
|--------|-----------------|
|
|
@@ -215,7 +291,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
|
|
|
215
291
|
| `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
|
|
216
292
|
| `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
|
|
217
293
|
| `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
|
|
218
|
-
| `
|
|
294
|
+
| `tool-usage` | Whether the model invoked the expected tool(s) during a tool-calling task |
|
|
295
|
+
| `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
|
|
219
296
|
|
|
220
297
|
Configure them in your arena:
|
|
221
298
|
|
|
@@ -235,8 +312,6 @@ defineArena({
|
|
|
235
312
|
|
|
236
313
|
The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
|
|
237
314
|
|
|
238
|
-
You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
|
|
239
|
-
|
|
240
315
|
---
|
|
241
316
|
|
|
242
317
|
### Arena options
|
|
@@ -245,7 +320,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
|
|
|
245
320
|
|
|
246
321
|
| Option | Type | Default | Description |
|
|
247
322
|
|--------|------|---------|-------------|
|
|
248
|
-
| `runs` | `number` | `1` | Number of runs per provider
|
|
323
|
+
| `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
|
|
249
324
|
| `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
|
|
250
325
|
| `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
|
|
251
326
|
| `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
|
|
@@ -266,43 +341,62 @@ export default defineArena({
|
|
|
266
341
|
|
|
267
342
|
---
|
|
268
343
|
|
|
344
|
+
## Reporters
|
|
345
|
+
|
|
346
|
+
agent-duelist includes four output formats, each suited to a different workflow:
|
|
347
|
+
|
|
348
|
+
| Reporter | Flag | Use case |
|
|
349
|
+
|----------|------|----------|
|
|
350
|
+
| **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
|
|
351
|
+
| **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
|
|
352
|
+
| **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
|
|
353
|
+
| **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
|
|
354
|
+
|
|
355
|
+
Generate an HTML report:
|
|
356
|
+
|
|
357
|
+
```bash
|
|
358
|
+
npx duelist run --reporter html --output report.html
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
|
|
362
|
+
|
|
363
|
+
---
|
|
364
|
+
|
|
269
365
|
## Cost & pricing
|
|
270
366
|
|
|
271
367
|
Cost estimation is intentionally transparent and conservative:
|
|
272
368
|
|
|
273
|
-
1. **Token counts**
|
|
274
|
-
Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
|
|
369
|
+
1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
|
|
275
370
|
|
|
276
|
-
2. **Pricing catalog**
|
|
277
|
-
|
|
371
|
+
2. **Pricing catalog** — `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
372
|
+
- The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
|
|
373
|
+
- Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
|
|
374
|
+
- Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
|
|
278
375
|
|
|
279
|
-
|
|
280
|
-
- Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
|
|
281
|
-
|
|
282
|
-
3. **Estimated USD**
|
|
283
|
-
The `cost` scorer computes:
|
|
376
|
+
3. **Estimated USD** — The `cost` scorer computes:
|
|
284
377
|
|
|
285
378
|
```text
|
|
286
379
|
estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
|
|
287
380
|
```
|
|
288
381
|
|
|
289
|
-
|
|
382
|
+
4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
|
|
290
383
|
|
|
291
|
-
|
|
292
|
-
- cost: `~$0.0XXm` (millicents; fractions of a cent)
|
|
293
|
-
- a short disclaimer that this is an **estimate** based on a pricing snapshot.
|
|
384
|
+
5. **Custom pricing** — Register pricing for models not in the catalog:
|
|
294
385
|
|
|
295
|
-
|
|
296
|
-
|
|
386
|
+
```ts
|
|
387
|
+
import { registerPricing } from 'agent-duelist'
|
|
297
388
|
|
|
298
|
-
-
|
|
299
|
-
|
|
389
|
+
registerPricing('custom/my-model', {
|
|
390
|
+
inputPerToken: 0.000003,
|
|
391
|
+
outputPerToken: 0.000015,
|
|
392
|
+
})
|
|
393
|
+
```
|
|
300
394
|
|
|
301
|
-
You can update the catalog with
|
|
395
|
+
You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
|
|
302
396
|
|
|
303
397
|
---
|
|
304
398
|
|
|
305
|
-
## CLI
|
|
399
|
+
## CLI reference
|
|
306
400
|
|
|
307
401
|
### `duelist init`
|
|
308
402
|
|
|
@@ -310,30 +404,46 @@ Scaffold a new `arena.config.ts` in the current directory.
|
|
|
310
404
|
|
|
311
405
|
```bash
|
|
312
406
|
npx duelist init
|
|
407
|
+
npx duelist init --force # overwrite an existing config
|
|
313
408
|
```
|
|
314
409
|
|
|
410
|
+
| Option | Description |
|
|
411
|
+
|--------|-------------|
|
|
412
|
+
| `--force` | Overwrite existing config file |
|
|
413
|
+
|
|
315
414
|
### `duelist run`
|
|
316
415
|
|
|
317
416
|
Run benchmarks defined in your arena config.
|
|
318
417
|
|
|
319
418
|
```bash
|
|
320
|
-
#
|
|
419
|
+
# Default config, console output
|
|
321
420
|
npx duelist run
|
|
322
421
|
|
|
323
|
-
#
|
|
422
|
+
# Custom config
|
|
324
423
|
npx duelist run --config path/to/arena.config.ts
|
|
325
424
|
|
|
326
|
-
#
|
|
327
|
-
npx duelist run --
|
|
425
|
+
# Run a built-in task pack
|
|
426
|
+
npx duelist run --pack structured-output
|
|
427
|
+
|
|
428
|
+
# List available packs
|
|
429
|
+
npx duelist run --pack list
|
|
430
|
+
|
|
431
|
+
# JSON for piping
|
|
432
|
+
npx duelist run --reporter json > results.json
|
|
433
|
+
|
|
434
|
+
# HTML report
|
|
435
|
+
npx duelist run --reporter html --output report.html
|
|
328
436
|
|
|
329
|
-
#
|
|
437
|
+
# Quiet mode
|
|
330
438
|
npx duelist run --quiet
|
|
331
439
|
```
|
|
332
440
|
|
|
333
441
|
| Option | Description |
|
|
334
442
|
|--------|-------------|
|
|
335
443
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
336
|
-
| `--
|
|
444
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
|
|
445
|
+
| `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
|
|
446
|
+
| `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
|
|
337
447
|
| `-q, --quiet` | Suppress per-result progress |
|
|
338
448
|
|
|
339
449
|
### `duelist ci`
|
|
@@ -347,6 +457,9 @@ npx duelist ci --update-baseline
|
|
|
347
457
|
# Subsequent runs — compare against baseline
|
|
348
458
|
npx duelist ci --threshold correctness=0.1 --budget 1.00
|
|
349
459
|
|
|
460
|
+
# Run CI with a task pack
|
|
461
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1
|
|
462
|
+
|
|
350
463
|
# Post comparison table as a PR comment (GitHub Actions)
|
|
351
464
|
npx duelist ci --threshold correctness=0.1 --comment
|
|
352
465
|
```
|
|
@@ -354,6 +467,7 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
354
467
|
| Option | Description |
|
|
355
468
|
|--------|-------------|
|
|
356
469
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
470
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks |
|
|
357
471
|
| `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
|
|
358
472
|
| `--budget <dollars>` | Max total cost in USD — fails if exceeded |
|
|
359
473
|
| `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
|
|
@@ -368,24 +482,61 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
368
482
|
- Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
|
|
369
483
|
- Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
|
|
370
484
|
|
|
371
|
-
The CLI loads TypeScript configs directly using a lightweight runtime loader so
|
|
485
|
+
The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
|
|
486
|
+
|
|
487
|
+
---
|
|
488
|
+
|
|
489
|
+
## Tool-calling agent example
|
|
490
|
+
|
|
491
|
+
agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
|
|
492
|
+
|
|
493
|
+
```ts
|
|
494
|
+
import { defineArena, openai } from 'agent-duelist'
|
|
495
|
+
import { z } from 'zod'
|
|
496
|
+
|
|
497
|
+
const weatherTool = {
|
|
498
|
+
name: 'getCurrentWeather',
|
|
499
|
+
description: 'Get the current weather in a given city',
|
|
500
|
+
parameters: z.object({ city: z.string() }),
|
|
501
|
+
handler: async ({ city }: { city: string }) => ({
|
|
502
|
+
city,
|
|
503
|
+
tempC: 20,
|
|
504
|
+
}),
|
|
505
|
+
}
|
|
506
|
+
|
|
507
|
+
export default defineArena({
|
|
508
|
+
providers: [openai('gpt-5-mini')],
|
|
509
|
+
tasks: [
|
|
510
|
+
{
|
|
511
|
+
name: 'weather-tool-call',
|
|
512
|
+
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
513
|
+
expected: { city: 'Amsterdam' },
|
|
514
|
+
tools: [weatherTool],
|
|
515
|
+
},
|
|
516
|
+
],
|
|
517
|
+
scorers: ['latency', 'cost', 'tool-usage'],
|
|
518
|
+
runs: 1,
|
|
519
|
+
})
|
|
520
|
+
```
|
|
521
|
+
|
|
522
|
+
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
372
523
|
|
|
373
524
|
---
|
|
374
525
|
|
|
375
526
|
## Example: multi-provider benchmark
|
|
376
527
|
|
|
377
|
-
|
|
528
|
+
A richer example comparing multiple providers across tasks:
|
|
378
529
|
|
|
379
530
|
```ts
|
|
380
531
|
// arena.config.ts
|
|
381
|
-
import { defineArena, azureOpenai } from 'agent-duelist'
|
|
532
|
+
import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
|
|
382
533
|
import { z } from 'zod'
|
|
383
534
|
|
|
384
535
|
export default defineArena({
|
|
385
536
|
providers: [
|
|
386
|
-
|
|
537
|
+
openai('gpt-5-mini'),
|
|
387
538
|
azureOpenai('gpt-5-nano'),
|
|
388
|
-
|
|
539
|
+
gemini('gemini-3-flash-preview'),
|
|
389
540
|
],
|
|
390
541
|
tasks: [
|
|
391
542
|
{
|
|
@@ -418,44 +569,13 @@ export default defineArena({
|
|
|
418
569
|
npx duelist run
|
|
419
570
|
```
|
|
420
571
|
|
|
421
|
-
|
|
422
|
-
|
|
423
|
-
---
|
|
424
|
-
|
|
425
|
-
## Tool-calling agent example
|
|
426
|
-
|
|
427
|
-
agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
|
|
428
|
-
|
|
429
|
-
```ts
|
|
430
|
-
import { defineArena, openai } from 'agent-duelist'
|
|
431
|
-
import { z } from 'zod'
|
|
432
|
-
|
|
433
|
-
const weatherTool = {
|
|
434
|
-
name: 'getCurrentWeather',
|
|
435
|
-
description: 'Get the current weather in a given city',
|
|
436
|
-
parameters: z.object({ city: z.string() }),
|
|
437
|
-
handler: async ({ city }: { city: string }) => ({
|
|
438
|
-
city,
|
|
439
|
-
tempC: 20,
|
|
440
|
-
}),
|
|
441
|
-
}
|
|
572
|
+
**How scoring works:**
|
|
442
573
|
|
|
443
|
-
|
|
444
|
-
|
|
445
|
-
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
449
|
-
expected: { city: 'Amsterdam' },
|
|
450
|
-
tools: [weatherTool],
|
|
451
|
-
},
|
|
452
|
-
],
|
|
453
|
-
scorers: ['latency', 'cost', 'tool-usage'],
|
|
454
|
-
runs: 1,
|
|
455
|
-
})
|
|
456
|
-
```
|
|
457
|
-
|
|
458
|
-
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
574
|
+
- Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
|
|
575
|
+
- Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
|
|
576
|
+
- **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
|
|
577
|
+
- **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
|
|
578
|
+
- The overall winner is the provider with the highest average correctness score.
|
|
459
579
|
|
|
460
580
|
---
|
|
461
581
|
|
|
@@ -503,8 +623,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
|
|
|
503
623
|
|
|
504
624
|
| Provider | Task | Scorer | Baseline | Current | Delta | Status |
|
|
505
625
|
|----------|------|--------|----------|---------|-------|--------|
|
|
506
|
-
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 |
|
|
507
|
-
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 |
|
|
626
|
+
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
|
|
627
|
+
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
|
|
508
628
|
|
|
509
629
|
With cost summary, flakiness warnings, and pass/fail verdict.
|
|
510
630
|
|
|
@@ -512,33 +632,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
|
|
|
512
632
|
|
|
513
633
|
## Roadmap
|
|
514
634
|
|
|
515
|
-
Shipped
|
|
635
|
+
**Shipped:**
|
|
516
636
|
|
|
517
|
-
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible
|
|
637
|
+
- 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
|
|
518
638
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
519
639
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
520
|
-
-
|
|
521
|
-
-
|
|
522
|
-
-
|
|
523
|
-
-
|
|
524
|
-
- `duelist ci`
|
|
640
|
+
- **Task packs**: built-in benchmark suites (`structured-output`) — run with `--pack`, no config writing needed
|
|
641
|
+
- Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
|
|
642
|
+
- Fair head-to-head benchmarking with parallel provider execution
|
|
643
|
+
- 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
|
|
644
|
+
- `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
|
|
525
645
|
- GitHub Action for CI/CD integration
|
|
526
|
-
- Pricing catalog
|
|
646
|
+
- Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
|
|
647
|
+
- `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
|
|
648
|
+
- Configurable per-request timeout
|
|
527
649
|
|
|
528
|
-
Planned
|
|
650
|
+
**Planned** (subject to community feedback):
|
|
529
651
|
|
|
530
|
-
- **More
|
|
531
|
-
|
|
532
|
-
- **
|
|
533
|
-
|
|
534
|
-
- **
|
|
535
|
-
|
|
536
|
-
- **Plugin system**
|
|
537
|
-
- First-class support for user-defined providers and scorers.
|
|
538
|
-
- **Embedding-based scoring**
|
|
539
|
-
- Semantic similarity via embedding distance.
|
|
652
|
+
- **More task packs** — reasoning, summarization, tool-calling, and multi-turn conversation packs
|
|
653
|
+
- **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
|
|
654
|
+
- **More export formats** — CSV
|
|
655
|
+
- **Plugin system** — first-class support for user-defined providers and scorers
|
|
656
|
+
- **Embedding-based scoring** — semantic similarity via embedding distance
|
|
657
|
+
- **More providers** — OpenRouter-native and additional OpenAI-compatible gateways
|
|
540
658
|
|
|
541
|
-
|
|
659
|
+
Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
|
|
542
660
|
|
|
543
661
|
---
|
|
544
662
|
|
|
@@ -546,18 +664,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
|
|
|
546
664
|
|
|
547
665
|
Contributions, issues, and feature requests are welcome.
|
|
548
666
|
|
|
549
|
-
- **Bug reports / ideas**: open a GitHub issue.
|
|
667
|
+
- **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
|
|
550
668
|
- **Code changes**:
|
|
551
|
-
|
|
552
|
-
|
|
553
|
-
|
|
554
|
-
|
|
555
|
-
|
|
669
|
+
1. Fork the repo.
|
|
670
|
+
2. Create a branch.
|
|
671
|
+
3. Run tests: `npm test`.
|
|
672
|
+
4. Run build: `npm run build`.
|
|
673
|
+
5. Open a PR with a clear description.
|
|
556
674
|
|
|
557
|
-
Please
|
|
675
|
+
Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
|
|
558
676
|
|
|
559
677
|
---
|
|
560
678
|
|
|
561
679
|
## License
|
|
562
680
|
|
|
563
|
-
MIT
|
|
681
|
+
MIT
|