agent-duelist 0.2.1 → 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +248 -142
- package/dist/cli.js +2284 -62
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +614 -109
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +28 -3
- package/dist/index.d.ts +28 -3
- package/dist/index.js +612 -109
- package/dist/index.js.map +1 -1
- package/package.json +9 -3
package/README.md
CHANGED
|
@@ -1,32 +1,46 @@
|
|
|
1
1
|
# agent-duelist
|
|
2
2
|
|
|
3
|
+
[](https://www.npmjs.com/package/agent-duelist)
|
|
3
4
|
[](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
|
|
4
5
|
[](https://datagobes.github.io/agent-duelist/)
|
|
6
|
+
[](./LICENSE)
|
|
5
7
|
|
|
6
|
-
> Pit LLM providers against each other on agent tasks — Duel your models
|
|
8
|
+
> Pit LLM providers against each other on agent tasks — **Duel your models.**
|
|
7
9
|
>
|
|
8
10
|
> **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
|
|
9
11
|
|
|
10
|
-
`agent-duelist` is a TypeScript-first framework
|
|
12
|
+
`agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
|
|
13
|
+
|
|
14
|
+
```bash
|
|
15
|
+
npx duelist init # scaffold a config
|
|
16
|
+
npx duelist run # see who wins
|
|
17
|
+
```
|
|
11
18
|
|
|
12
19
|
## What you get
|
|
13
|
-
> 
|
|
14
|
-
>
|
|
15
|
-
> 
|
|
16
20
|
|
|
17
|
-
-
|
|
18
|
-
|
|
19
|
-
|
|
21
|
+
**Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
|
|
22
|
+
|
|
23
|
+

|
|
24
|
+
|
|
25
|
+
**HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
|
|
26
|
+
|
|
27
|
+

|
|
20
28
|
|
|
21
29
|
---
|
|
22
30
|
|
|
23
31
|
## Why agent-duelist?
|
|
24
32
|
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
|
|
29
|
-
|
|
33
|
+
| | |
|
|
34
|
+
|---|---|
|
|
35
|
+
| **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
|
|
36
|
+
| **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
|
|
37
|
+
| **Task packs** | Built-in benchmark suites — run with `--pack structured-output`, zero config needed. |
|
|
38
|
+
| **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
|
|
39
|
+
| **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
|
|
40
|
+
| **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
|
|
41
|
+
| **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
|
|
42
|
+
| **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
|
|
43
|
+
| **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
|
|
30
44
|
|
|
31
45
|
---
|
|
32
46
|
|
|
@@ -40,7 +54,7 @@ pnpm add agent-duelist
|
|
|
40
54
|
yarn add agent-duelist
|
|
41
55
|
```
|
|
42
56
|
|
|
43
|
-
|
|
57
|
+
Set API keys for the providers you want to benchmark:
|
|
44
58
|
|
|
45
59
|
```bash
|
|
46
60
|
export OPENAI_API_KEY=sk-...
|
|
@@ -51,7 +65,7 @@ export GOOGLE_API_KEY=...
|
|
|
51
65
|
|
|
52
66
|
---
|
|
53
67
|
|
|
54
|
-
##
|
|
68
|
+
## Quickstart
|
|
55
69
|
|
|
56
70
|
Initialize a config:
|
|
57
71
|
|
|
@@ -59,7 +73,7 @@ Initialize a config:
|
|
|
59
73
|
npx duelist init
|
|
60
74
|
```
|
|
61
75
|
|
|
62
|
-
This creates `arena.config.ts` in your project
|
|
76
|
+
This creates `arena.config.ts` in your project:
|
|
63
77
|
|
|
64
78
|
```ts
|
|
65
79
|
// arena.config.ts
|
|
@@ -99,39 +113,80 @@ Run the benchmark:
|
|
|
99
113
|
npx duelist run
|
|
100
114
|
```
|
|
101
115
|
|
|
102
|
-
You'll see a matrix
|
|
116
|
+
You'll see a results matrix:
|
|
103
117
|
|
|
104
|
-
- Rows
|
|
105
|
-
- Columns
|
|
106
|
-
- Cells
|
|
118
|
+
- **Rows**: tasks (`simple-qa`, `structured-extraction`)
|
|
119
|
+
- **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
|
|
120
|
+
- **Cells**: correctness score, latency, tokens, and estimated cost
|
|
107
121
|
|
|
108
|
-
|
|
122
|
+
Export the results in different formats:
|
|
109
123
|
|
|
110
124
|
```bash
|
|
125
|
+
# JSON for CI pipelines and dashboards
|
|
111
126
|
npx duelist run --reporter json > results.json
|
|
127
|
+
|
|
128
|
+
# Self-contained HTML report you can share or host
|
|
129
|
+
npx duelist run --reporter html --output report.html
|
|
112
130
|
```
|
|
113
131
|
|
|
114
|
-
|
|
132
|
+
---
|
|
133
|
+
|
|
134
|
+
## Task packs
|
|
135
|
+
|
|
136
|
+
Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
|
|
115
137
|
|
|
116
138
|
```bash
|
|
117
|
-
|
|
139
|
+
# List available packs
|
|
140
|
+
npx duelist run --pack list
|
|
141
|
+
|
|
142
|
+
# Run a pack with your providers
|
|
143
|
+
npx duelist run --pack structured-output --config arena.config.ts
|
|
118
144
|
```
|
|
119
145
|
|
|
120
|
-
|
|
146
|
+
Your config only needs to define providers — the pack supplies tasks and scorers:
|
|
121
147
|
|
|
122
|
-
|
|
148
|
+
```ts
|
|
149
|
+
// arena.config.ts
|
|
150
|
+
import { defineArena, openai, anthropic } from 'agent-duelist'
|
|
123
151
|
|
|
124
|
-
|
|
152
|
+
export default defineArena({
|
|
153
|
+
providers: [
|
|
154
|
+
openai('gpt-5-mini'),
|
|
155
|
+
anthropic('claude-sonnet-4.6'),
|
|
156
|
+
],
|
|
157
|
+
tasks: [], // ignored when --pack is used
|
|
158
|
+
scorers: [], // pack supplies its own scorers
|
|
159
|
+
})
|
|
160
|
+
```
|
|
125
161
|
|
|
126
|
-
|
|
162
|
+
### Available packs
|
|
127
163
|
|
|
128
|
-
|
|
164
|
+
| Pack | Tasks | Scorers | Description |
|
|
165
|
+
|------|-------|---------|-------------|
|
|
166
|
+
| `structured-output` | 6 | correctness, schema-correctness, latency, cost | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
|
|
167
|
+
| `tool-calling` | 4 | tool-usage, latency, cost | Function invocation accuracy — single calls, complex params, tool selection, and parallel calls |
|
|
168
|
+
| `reasoning` | 5 | correctness, latency, cost | Logic, math, and multi-step thinking — arithmetic, deduction, data interpretation, critical path, and business rules |
|
|
129
169
|
|
|
130
|
-
|
|
131
|
-
- Wrap or extend providers in your own code.
|
|
132
|
-
- Mock providers in tests.
|
|
170
|
+
Packs work with both `run` and `ci` commands:
|
|
133
171
|
|
|
134
|
-
|
|
172
|
+
```bash
|
|
173
|
+
# CI with a task pack
|
|
174
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
You can also combine multiple packs:
|
|
178
|
+
|
|
179
|
+
```bash
|
|
180
|
+
npx duelist run --pack structured-output,another-pack
|
|
181
|
+
```
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
## Core concepts
|
|
186
|
+
|
|
187
|
+
### Providers
|
|
188
|
+
|
|
189
|
+
Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
|
|
135
190
|
|
|
136
191
|
```ts
|
|
137
192
|
import {
|
|
@@ -140,25 +195,40 @@ import {
|
|
|
140
195
|
anthropic,
|
|
141
196
|
gemini,
|
|
142
197
|
openaiCompatible,
|
|
143
|
-
type ArenaProvider,
|
|
144
198
|
} from 'agent-duelist'
|
|
145
199
|
|
|
200
|
+
// OpenAI
|
|
146
201
|
const oai = openai('gpt-5-mini')
|
|
147
202
|
|
|
203
|
+
// Azure OpenAI
|
|
148
204
|
const azure = azureOpenai('gpt-5-mini', {
|
|
149
205
|
deployment: 'my-deployment',
|
|
150
206
|
})
|
|
151
207
|
|
|
208
|
+
// Anthropic
|
|
152
209
|
const claude = anthropic('claude-sonnet-4.6')
|
|
153
210
|
|
|
154
|
-
|
|
211
|
+
// Google Gemini
|
|
212
|
+
const gem = gemini('gemini-3-flash-preview')
|
|
155
213
|
|
|
156
|
-
|
|
214
|
+
// Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
|
|
215
|
+
const local = openaiCompatible({
|
|
157
216
|
id: 'local/llama',
|
|
158
|
-
name: 'Local
|
|
217
|
+
name: 'Local Ollama',
|
|
159
218
|
baseURL: 'http://localhost:11434/v1',
|
|
160
219
|
model: 'llama3.3',
|
|
161
220
|
apiKeyEnv: 'LOCAL_LLM_API_KEY',
|
|
221
|
+
free: true, // registers zero-cost pricing
|
|
222
|
+
})
|
|
223
|
+
|
|
224
|
+
// Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
|
|
225
|
+
const deepseek = openaiCompatible({
|
|
226
|
+
id: 'deepseek/r1',
|
|
227
|
+
name: 'DeepSeek R1',
|
|
228
|
+
baseURL: 'https://api.deepseek.com/v1',
|
|
229
|
+
model: 'deepseek-reasoner',
|
|
230
|
+
apiKeyEnv: 'DEEPSEEK_API_KEY',
|
|
231
|
+
stripThinking: true, // strips <think>...</think> from output
|
|
162
232
|
})
|
|
163
233
|
```
|
|
164
234
|
|
|
@@ -185,6 +255,7 @@ interface ArenaTask {
|
|
|
185
255
|
prompt: string
|
|
186
256
|
expected?: unknown // used by correctness scorers
|
|
187
257
|
schema?: ZodSchema<any> // used by schema-based scorers
|
|
258
|
+
tools?: ToolDefinition[] // used by tool-calling scorers
|
|
188
259
|
}
|
|
189
260
|
```
|
|
190
261
|
|
|
@@ -213,7 +284,7 @@ const tasks: ArenaTask[] = [
|
|
|
213
284
|
|
|
214
285
|
### Scorers
|
|
215
286
|
|
|
216
|
-
Scorers
|
|
287
|
+
Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
|
|
217
288
|
|
|
218
289
|
| Scorer | What it measures |
|
|
219
290
|
|--------|-----------------|
|
|
@@ -222,7 +293,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
|
|
|
222
293
|
| `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
|
|
223
294
|
| `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
|
|
224
295
|
| `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
|
|
225
|
-
| `
|
|
296
|
+
| `tool-usage` | Tool calling accuracy — checks tool selection and argument correctness (1.0 exact match, 0.5 right tool / wrong args, 0.0 wrong tool) |
|
|
297
|
+
| `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
|
|
226
298
|
|
|
227
299
|
Configure them in your arena:
|
|
228
300
|
|
|
@@ -242,8 +314,6 @@ defineArena({
|
|
|
242
314
|
|
|
243
315
|
The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
|
|
244
316
|
|
|
245
|
-
You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
|
|
246
|
-
|
|
247
317
|
---
|
|
248
318
|
|
|
249
319
|
### Arena options
|
|
@@ -252,7 +322,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
|
|
|
252
322
|
|
|
253
323
|
| Option | Type | Default | Description |
|
|
254
324
|
|--------|------|---------|-------------|
|
|
255
|
-
| `runs` | `number` | `1` | Number of runs per provider
|
|
325
|
+
| `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
|
|
256
326
|
| `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
|
|
257
327
|
| `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
|
|
258
328
|
| `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
|
|
@@ -273,43 +343,62 @@ export default defineArena({
|
|
|
273
343
|
|
|
274
344
|
---
|
|
275
345
|
|
|
346
|
+
## Reporters
|
|
347
|
+
|
|
348
|
+
agent-duelist includes four output formats, each suited to a different workflow:
|
|
349
|
+
|
|
350
|
+
| Reporter | Flag | Use case |
|
|
351
|
+
|----------|------|----------|
|
|
352
|
+
| **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
|
|
353
|
+
| **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
|
|
354
|
+
| **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
|
|
355
|
+
| **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
|
|
356
|
+
|
|
357
|
+
Generate an HTML report:
|
|
358
|
+
|
|
359
|
+
```bash
|
|
360
|
+
npx duelist run --reporter html --output report.html
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
|
|
364
|
+
|
|
365
|
+
---
|
|
366
|
+
|
|
276
367
|
## Cost & pricing
|
|
277
368
|
|
|
278
369
|
Cost estimation is intentionally transparent and conservative:
|
|
279
370
|
|
|
280
|
-
1. **Token counts**
|
|
281
|
-
Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
|
|
282
|
-
|
|
283
|
-
2. **Pricing catalog**
|
|
284
|
-
`agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
371
|
+
1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
|
|
285
372
|
|
|
286
|
-
|
|
287
|
-
-
|
|
373
|
+
2. **Pricing catalog** — `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
|
|
374
|
+
- The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
|
|
375
|
+
- Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
|
|
376
|
+
- Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
|
|
288
377
|
|
|
289
|
-
3. **Estimated USD**
|
|
290
|
-
The `cost` scorer computes:
|
|
378
|
+
3. **Estimated USD** — The `cost` scorer computes:
|
|
291
379
|
|
|
292
380
|
```text
|
|
293
381
|
estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
|
|
294
382
|
```
|
|
295
383
|
|
|
296
|
-
|
|
384
|
+
4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
|
|
297
385
|
|
|
298
|
-
|
|
299
|
-
- cost: `~$0.0XXm` (millicents; fractions of a cent)
|
|
300
|
-
- a short disclaimer that this is an **estimate** based on a pricing snapshot.
|
|
386
|
+
5. **Custom pricing** — Register pricing for models not in the catalog:
|
|
301
387
|
|
|
302
|
-
|
|
303
|
-
|
|
388
|
+
```ts
|
|
389
|
+
import { registerPricing } from 'agent-duelist'
|
|
304
390
|
|
|
305
|
-
-
|
|
306
|
-
|
|
391
|
+
registerPricing('custom/my-model', {
|
|
392
|
+
inputPerToken: 0.000003,
|
|
393
|
+
outputPerToken: 0.000015,
|
|
394
|
+
})
|
|
395
|
+
```
|
|
307
396
|
|
|
308
|
-
You can update the catalog with
|
|
397
|
+
You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
|
|
309
398
|
|
|
310
399
|
---
|
|
311
400
|
|
|
312
|
-
## CLI
|
|
401
|
+
## CLI reference
|
|
313
402
|
|
|
314
403
|
### `duelist init`
|
|
315
404
|
|
|
@@ -317,30 +406,46 @@ Scaffold a new `arena.config.ts` in the current directory.
|
|
|
317
406
|
|
|
318
407
|
```bash
|
|
319
408
|
npx duelist init
|
|
409
|
+
npx duelist init --force # overwrite an existing config
|
|
320
410
|
```
|
|
321
411
|
|
|
412
|
+
| Option | Description |
|
|
413
|
+
|--------|-------------|
|
|
414
|
+
| `--force` | Overwrite existing config file |
|
|
415
|
+
|
|
322
416
|
### `duelist run`
|
|
323
417
|
|
|
324
418
|
Run benchmarks defined in your arena config.
|
|
325
419
|
|
|
326
420
|
```bash
|
|
327
|
-
#
|
|
421
|
+
# Default config, console output
|
|
328
422
|
npx duelist run
|
|
329
423
|
|
|
330
|
-
#
|
|
424
|
+
# Custom config
|
|
331
425
|
npx duelist run --config path/to/arena.config.ts
|
|
332
426
|
|
|
333
|
-
#
|
|
334
|
-
npx duelist run --
|
|
427
|
+
# Run a built-in task pack
|
|
428
|
+
npx duelist run --pack structured-output
|
|
429
|
+
|
|
430
|
+
# List available packs
|
|
431
|
+
npx duelist run --pack list
|
|
432
|
+
|
|
433
|
+
# JSON for piping
|
|
434
|
+
npx duelist run --reporter json > results.json
|
|
335
435
|
|
|
336
|
-
#
|
|
436
|
+
# HTML report
|
|
437
|
+
npx duelist run --reporter html --output report.html
|
|
438
|
+
|
|
439
|
+
# Quiet mode
|
|
337
440
|
npx duelist run --quiet
|
|
338
441
|
```
|
|
339
442
|
|
|
340
443
|
| Option | Description |
|
|
341
444
|
|--------|-------------|
|
|
342
445
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
343
|
-
| `--
|
|
446
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
|
|
447
|
+
| `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
|
|
448
|
+
| `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
|
|
344
449
|
| `-q, --quiet` | Suppress per-result progress |
|
|
345
450
|
|
|
346
451
|
### `duelist ci`
|
|
@@ -354,6 +459,9 @@ npx duelist ci --update-baseline
|
|
|
354
459
|
# Subsequent runs — compare against baseline
|
|
355
460
|
npx duelist ci --threshold correctness=0.1 --budget 1.00
|
|
356
461
|
|
|
462
|
+
# Run CI with a task pack
|
|
463
|
+
npx duelist ci --pack structured-output --threshold correctness=0.1
|
|
464
|
+
|
|
357
465
|
# Post comparison table as a PR comment (GitHub Actions)
|
|
358
466
|
npx duelist ci --threshold correctness=0.1 --comment
|
|
359
467
|
```
|
|
@@ -361,6 +469,7 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
361
469
|
| Option | Description |
|
|
362
470
|
|--------|-------------|
|
|
363
471
|
| `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
|
|
472
|
+
| `--pack <names>` | Run built-in task pack(s) instead of config tasks |
|
|
364
473
|
| `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
|
|
365
474
|
| `--budget <dollars>` | Max total cost in USD — fails if exceeded |
|
|
366
475
|
| `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
|
|
@@ -375,24 +484,61 @@ npx duelist ci --threshold correctness=0.1 --comment
|
|
|
375
484
|
- Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
|
|
376
485
|
- Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
|
|
377
486
|
|
|
378
|
-
The CLI loads TypeScript configs directly using a lightweight runtime loader so
|
|
487
|
+
The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
|
|
488
|
+
|
|
489
|
+
---
|
|
490
|
+
|
|
491
|
+
## Tool-calling agent example
|
|
492
|
+
|
|
493
|
+
agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
|
|
494
|
+
|
|
495
|
+
```ts
|
|
496
|
+
import { defineArena, openai } from 'agent-duelist'
|
|
497
|
+
import { z } from 'zod'
|
|
498
|
+
|
|
499
|
+
const weatherTool = {
|
|
500
|
+
name: 'getCurrentWeather',
|
|
501
|
+
description: 'Get the current weather in a given city',
|
|
502
|
+
parameters: z.object({ city: z.string() }),
|
|
503
|
+
handler: async ({ city }: { city: string }) => ({
|
|
504
|
+
city,
|
|
505
|
+
tempC: 20,
|
|
506
|
+
}),
|
|
507
|
+
}
|
|
508
|
+
|
|
509
|
+
export default defineArena({
|
|
510
|
+
providers: [openai('gpt-5-mini')],
|
|
511
|
+
tasks: [
|
|
512
|
+
{
|
|
513
|
+
name: 'weather-tool-call',
|
|
514
|
+
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
515
|
+
expected: { city: 'Amsterdam' },
|
|
516
|
+
tools: [weatherTool],
|
|
517
|
+
},
|
|
518
|
+
],
|
|
519
|
+
scorers: ['latency', 'cost', 'tool-usage'],
|
|
520
|
+
runs: 1,
|
|
521
|
+
})
|
|
522
|
+
```
|
|
523
|
+
|
|
524
|
+
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
379
525
|
|
|
380
526
|
---
|
|
381
527
|
|
|
382
528
|
## Example: multi-provider benchmark
|
|
383
529
|
|
|
384
|
-
|
|
530
|
+
A richer example comparing multiple providers across tasks:
|
|
385
531
|
|
|
386
532
|
```ts
|
|
387
533
|
// arena.config.ts
|
|
388
|
-
import { defineArena, azureOpenai } from 'agent-duelist'
|
|
534
|
+
import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
|
|
389
535
|
import { z } from 'zod'
|
|
390
536
|
|
|
391
537
|
export default defineArena({
|
|
392
538
|
providers: [
|
|
393
|
-
|
|
539
|
+
openai('gpt-5-mini'),
|
|
394
540
|
azureOpenai('gpt-5-nano'),
|
|
395
|
-
|
|
541
|
+
gemini('gemini-3-flash-preview'),
|
|
396
542
|
],
|
|
397
543
|
tasks: [
|
|
398
544
|
{
|
|
@@ -425,50 +571,13 @@ export default defineArena({
|
|
|
425
571
|
npx duelist run
|
|
426
572
|
```
|
|
427
573
|
|
|
428
|
-
Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
|
|
429
|
-
|
|
430
574
|
**How scoring works:**
|
|
431
575
|
|
|
432
576
|
- Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
|
|
433
577
|
- Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
|
|
434
|
-
-
|
|
435
|
-
|
|
436
|
-
|
|
437
|
-
|
|
438
|
-
## Tool-calling agent example
|
|
439
|
-
|
|
440
|
-
agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
|
|
441
|
-
|
|
442
|
-
```ts
|
|
443
|
-
import { defineArena, openai } from 'agent-duelist'
|
|
444
|
-
import { z } from 'zod'
|
|
445
|
-
|
|
446
|
-
const weatherTool = {
|
|
447
|
-
name: 'getCurrentWeather',
|
|
448
|
-
description: 'Get the current weather in a given city',
|
|
449
|
-
parameters: z.object({ city: z.string() }),
|
|
450
|
-
handler: async ({ city }: { city: string }) => ({
|
|
451
|
-
city,
|
|
452
|
-
tempC: 20,
|
|
453
|
-
}),
|
|
454
|
-
}
|
|
455
|
-
|
|
456
|
-
export default defineArena({
|
|
457
|
-
providers: [openai('gpt-5-mini')],
|
|
458
|
-
tasks: [
|
|
459
|
-
{
|
|
460
|
-
name: 'weather-tool-call',
|
|
461
|
-
prompt: 'What is the current temperature in Amsterdam? Use the tool.',
|
|
462
|
-
expected: { city: 'Amsterdam' },
|
|
463
|
-
tools: [weatherTool],
|
|
464
|
-
},
|
|
465
|
-
],
|
|
466
|
-
scorers: ['latency', 'cost', 'tool-usage'],
|
|
467
|
-
runs: 1,
|
|
468
|
-
})
|
|
469
|
-
```
|
|
470
|
-
|
|
471
|
-
The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
|
|
578
|
+
- **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
|
|
579
|
+
- **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
|
|
580
|
+
- The overall winner is the provider with the highest average correctness score.
|
|
472
581
|
|
|
473
582
|
---
|
|
474
583
|
|
|
@@ -516,8 +625,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
|
|
|
516
625
|
|
|
517
626
|
| Provider | Task | Scorer | Baseline | Current | Delta | Status |
|
|
518
627
|
|----------|------|--------|----------|---------|-------|--------|
|
|
519
|
-
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 |
|
|
520
|
-
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 |
|
|
628
|
+
| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
|
|
629
|
+
| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
|
|
521
630
|
|
|
522
631
|
With cost summary, flakiness warnings, and pass/fail verdict.
|
|
523
632
|
|
|
@@ -525,34 +634,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
|
|
|
525
634
|
|
|
526
635
|
## Roadmap
|
|
527
636
|
|
|
528
|
-
Shipped
|
|
637
|
+
**Shipped:**
|
|
529
638
|
|
|
530
|
-
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible
|
|
639
|
+
- 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
|
|
531
640
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
532
641
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
533
|
-
-
|
|
534
|
-
-
|
|
535
|
-
-
|
|
536
|
-
- JSON
|
|
537
|
-
-
|
|
538
|
-
- `duelist ci` command with regression detection, cost budgets, and flakiness warnings
|
|
642
|
+
- **Task packs**: built-in benchmark suites (`structured-output`, `tool-calling`, `reasoning`) — run with `--pack`, no config writing needed
|
|
643
|
+
- Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
|
|
644
|
+
- Fair head-to-head benchmarking with parallel provider execution
|
|
645
|
+
- 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
|
|
646
|
+
- `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
|
|
539
647
|
- GitHub Action for CI/CD integration
|
|
540
|
-
- Pricing catalog
|
|
648
|
+
- Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
|
|
649
|
+
- `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
|
|
650
|
+
- Configurable per-request timeout
|
|
541
651
|
|
|
542
|
-
Planned
|
|
652
|
+
**Planned** (subject to community feedback):
|
|
543
653
|
|
|
544
|
-
- **More
|
|
545
|
-
|
|
546
|
-
- **
|
|
547
|
-
|
|
548
|
-
- **
|
|
549
|
-
|
|
550
|
-
- **Plugin system**
|
|
551
|
-
- First-class support for user-defined providers and scorers.
|
|
552
|
-
- **Embedding-based scoring**
|
|
553
|
-
- Semantic similarity via embedding distance.
|
|
654
|
+
- **More task packs** — summarization, multi-turn conversation, and code generation packs
|
|
655
|
+
- **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
|
|
656
|
+
- **More export formats** — CSV
|
|
657
|
+
- **Plugin system** — first-class support for user-defined providers and scorers
|
|
658
|
+
- **Embedding-based scoring** — semantic similarity via embedding distance
|
|
659
|
+
- **More providers** — OpenRouter-native and additional OpenAI-compatible gateways
|
|
554
660
|
|
|
555
|
-
|
|
661
|
+
Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
|
|
556
662
|
|
|
557
663
|
---
|
|
558
664
|
|
|
@@ -560,18 +666,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
|
|
|
560
666
|
|
|
561
667
|
Contributions, issues, and feature requests are welcome.
|
|
562
668
|
|
|
563
|
-
- **Bug reports / ideas**: open a GitHub issue.
|
|
669
|
+
- **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
|
|
564
670
|
- **Code changes**:
|
|
565
|
-
|
|
566
|
-
|
|
567
|
-
|
|
568
|
-
|
|
569
|
-
|
|
671
|
+
1. Fork the repo.
|
|
672
|
+
2. Create a branch.
|
|
673
|
+
3. Run tests: `npm test`.
|
|
674
|
+
4. Run build: `npm run build`.
|
|
675
|
+
5. Open a PR with a clear description.
|
|
570
676
|
|
|
571
|
-
Please
|
|
677
|
+
Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
|
|
572
678
|
|
|
573
679
|
---
|
|
574
680
|
|
|
575
681
|
## License
|
|
576
682
|
|
|
577
|
-
MIT
|
|
683
|
+
MIT
|