agent-duelist 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,471 @@
1
+ # agent-duelist
2
+
3
+ [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
4
+
5
+ > Pit LLM providers against each other on agent tasks — Duel your models.
6
+
7
+ `agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
8
+
9
+ ## What you get
10
+ > <img width="739" height="473" alt="image" src="https://github.com/user-attachments/assets/41149222-1035-42fd-b643-4f8b856c30a0" />
11
+
12
+
13
+ - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
14
+ - Define tasks once, run them against many providers.
15
+ - Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
16
+
17
+ ---
18
+
19
+ ## Why agent-duelist?
20
+
21
+ - **Provider-agnostic**: One config, many providers. Swap models and gateways without rewriting your tasks.
22
+ - **Agent-focused**: Designed for agent workflows and tool use, not just single-turn prompts.
23
+ - **Realistic metrics**: Latency, token counts, and cost estimates based on a pricing catalog.
24
+ - **TypeScript-native DX**: Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint.
25
+ - **CLI-first**: `npx duelist init` → `npx duelist run` gets you from zero to useful table in minutes.
26
+
27
+ ---
28
+
29
+ ## Installation
30
+
31
+ ```bash
32
+ npm install agent-duelist
33
+ # or
34
+ pnpm add agent-duelist
35
+ # or
36
+ yarn add agent-duelist
37
+ ```
38
+
39
+ You'll also need API keys for the providers you want to benchmark, for example:
40
+
41
+ ```bash
42
+ export OPENAI_API_KEY=sk-...
43
+ export AZURE_OPENAI_API_KEY=...
44
+ export ANTHROPIC_API_KEY=...
45
+ export GOOGLE_API_KEY=...
46
+ ```
47
+
48
+ ---
49
+
50
+ ## One-minute quickstart
51
+
52
+ Initialize a config:
53
+
54
+ ```bash
55
+ npx duelist init
56
+ ```
57
+
58
+ This creates `arena.config.ts` in your project. A minimal example:
59
+
60
+ ```ts
61
+ // arena.config.ts
62
+ import { defineArena, openai, azureOpenai } from 'agent-duelist'
63
+ import { z } from 'zod'
64
+
65
+ export default defineArena({
66
+ providers: [
67
+ openai('gpt-4o'),
68
+ azureOpenai('gpt-4o', { deployment: 'my-azure-deployment' }),
69
+ ],
70
+ tasks: [
71
+ {
72
+ name: 'simple-qa',
73
+ prompt: 'In one sentence, explain what a monorepo is.',
74
+ expected:
75
+ 'A monorepo is a single repository that contains code for multiple projects.',
76
+ },
77
+ {
78
+ name: 'structured-extraction',
79
+ prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
80
+ expected: { company: 'Acme', year: 2024 },
81
+ schema: z.object({
82
+ company: z.string(),
83
+ year: z.number(),
84
+ }),
85
+ },
86
+ ],
87
+ scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
88
+ runs: 3,
89
+ })
90
+ ```
91
+
92
+ Run the benchmark:
93
+
94
+ ```bash
95
+ npx duelist run
96
+ ```
97
+
98
+ You'll see a matrix like:
99
+
100
+ - Rows: tasks (`simple-qa`, `structured-extraction`)
101
+ - Columns: providers (`openai/gpt-4o`, `azure/gpt-4o`)
102
+ - Cells: correctness score, latency, tokens, and estimated cost.
103
+
104
+ For CI or further processing:
105
+
106
+ ```bash
107
+ npx duelist run --reporter json > results.json
108
+ ```
109
+
110
+ ---
111
+
112
+ ## Core concepts
113
+
114
+ ### Providers
115
+
116
+ Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface.
117
+
118
+ This lets you:
119
+
120
+ - Swap providers without changing tasks.
121
+ - Wrap or extend providers in your own code.
122
+ - Mock providers in tests.
123
+
124
+ Examples:
125
+
126
+ ```ts
127
+ import {
128
+ openai,
129
+ azureOpenai,
130
+ anthropic,
131
+ gemini,
132
+ openaiCompatible,
133
+ type ArenaProvider,
134
+ } from 'agent-duelist'
135
+
136
+ const oai = openai('gpt-4o')
137
+
138
+ const azure = azureOpenai('gpt-4o', {
139
+ deployment: 'my-deployment',
140
+ })
141
+
142
+ const claude = anthropic('claude-sonnet-4-20250514')
143
+
144
+ const gem = gemini('gemini-2.5-flash') // uses GOOGLE_API_KEY
145
+
146
+ const local: ArenaProvider = openaiCompatible({
147
+ id: 'local/gpt-4o-like',
148
+ name: 'Local Gateway',
149
+ baseURL: 'http://localhost:11434/v1',
150
+ model: 'gpt-4o',
151
+ apiKeyEnv: 'LOCAL_LLM_API_KEY',
152
+ })
153
+ ```
154
+
155
+ At minimum, a provider implements:
156
+
157
+ ```ts
158
+ interface ArenaProvider {
159
+ id: string // e.g. 'openai/gpt-4o'
160
+ name: string // e.g. 'OpenAI'
161
+ model: string
162
+ run(input: TaskInput): Promise<TaskResult>
163
+ }
164
+ ```
165
+
166
+ ---
167
+
168
+ ### Tasks
169
+
170
+ Tasks describe what you want the model to do:
171
+
172
+ ```ts
173
+ interface ArenaTask {
174
+ name: string
175
+ prompt: string
176
+ expected?: unknown // used by correctness scorers
177
+ schema?: ZodSchema<any> // used by schema-based scorers
178
+ }
179
+ ```
180
+
181
+ Examples:
182
+
183
+ ```ts
184
+ const tasks: ArenaTask[] = [
185
+ {
186
+ name: 'classify-sentiment',
187
+ prompt: 'Classify the sentiment of: "I love this product".',
188
+ expected: 'positive',
189
+ },
190
+ {
191
+ name: 'extract-structured-data',
192
+ prompt: 'Extract { company, year } from: "Acme was founded in 2024."',
193
+ expected: { company: 'Acme', year: 2024 },
194
+ schema: z.object({
195
+ company: z.string(),
196
+ year: z.number(),
197
+ }),
198
+ },
199
+ ]
200
+ ```
201
+
202
+ ---
203
+
204
+ ### Scorers
205
+
206
+ Scorers take raw model outputs and turn them into **numeric scores** (0–1) with optional details. Built-in scorers:
207
+
208
+ | Scorer | What it measures |
209
+ |--------|-----------------|
210
+ | `latency` | Wall-clock response time in milliseconds |
211
+ | `cost` | Estimated USD cost from token usage and a bundled pricing catalog |
212
+ | `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
213
+ | `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
214
+ | `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
215
+ | `llm-judge-correctness` | Async LLM-as-judge — calls a judge model to score correctness 0–1 |
216
+
217
+ Configure them in your arena:
218
+
219
+ ```ts
220
+ scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity']
221
+ ```
222
+
223
+ The `llm-judge-correctness` scorer evaluates outputs on three criteria (accuracy, completeness, conciseness) and returns a composite decimal score. Configure the judge model directly in your arena config:
224
+
225
+ ```ts
226
+ defineArena({
227
+ // ...
228
+ scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
229
+ judgeModel: 'gemini-3.1-pro-preview', // or any OpenAI/Azure/Gemini model
230
+ })
231
+ ```
232
+
233
+ The judge model defaults to `gpt-4o-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
234
+
235
+ You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
236
+
237
+ ---
238
+
239
+ ## Cost & pricing
240
+
241
+ Cost estimation is intentionally transparent and conservative:
242
+
243
+ 1. **Token counts**
244
+ Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
245
+
246
+ 2. **Pricing catalog**
247
+ `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
248
+
249
+ - The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
250
+ - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-4o` → `openai/gpt-4o`) so you don't need to configure Azure pricing manually.
251
+
252
+ 3. **Estimated USD**
253
+ The `cost` scorer computes:
254
+
255
+ ```text
256
+ estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
257
+ ```
258
+
259
+ In the console reporter, you'll see:
260
+
261
+ - token counts: `prompt: X, completion: Y`
262
+ - cost: `~$0.0XXm` (millicents; fractions of a cent)
263
+ - a short disclaimer that this is an **estimate** based on a pricing snapshot.
264
+
265
+ 4. **Unknown models**
266
+ If a model is not in the catalog:
267
+
268
+ - Tokens are still reported.
269
+ - Cost is marked as unknown (no fake numbers).
270
+
271
+ You can update the catalog with a script that re-scrapes OpenRouter's public pricing page when prices change.
272
+
273
+ ---
274
+
275
+ ## CLI usage
276
+
277
+ Basic commands:
278
+
279
+ ```bash
280
+ # Scaffold a new config
281
+ npx duelist init
282
+
283
+ # Run with the default config (arena.config.ts)
284
+ npx duelist run
285
+
286
+ # Use a custom config
287
+ npx duelist run --config path/to/arena.config.ts
288
+
289
+ # Get JSON instead of a table
290
+ npx duelist run --reporter json
291
+ ```
292
+
293
+ Options (subject to change as the project evolves):
294
+
295
+ - `--config` — path to a config file (TypeScript).
296
+ - `--reporter` — `console` (default) or `json`.
297
+
298
+ The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
299
+
300
+ ---
301
+
302
+ ## Example: multi-provider benchmark
303
+
304
+ Here's a richer example comparing multiple providers across tasks:
305
+
306
+ ```ts
307
+ // arena.config.ts
308
+ import { defineArena, azureOpenai } from 'agent-duelist'
309
+ import { z } from 'zod'
310
+
311
+ export default defineArena({
312
+ providers: [
313
+ azureOpenai('gpt-5-mini'),
314
+ azureOpenai('gpt-5-nano'),
315
+ azureOpenai('gpt-5.2-chat'),
316
+ ],
317
+ tasks: [
318
+ {
319
+ name: 'extract-company',
320
+ prompt:
321
+ 'Extract the company name and role as JSON from: "I work at Acme Corp as a senior engineer." Return {"company": "...", "role": "..."}',
322
+ expected: { company: 'Acme Corp', role: 'senior engineer' },
323
+ schema: z.object({ company: z.string(), role: z.string() }),
324
+ },
325
+ {
326
+ name: 'summarize',
327
+ prompt:
328
+ 'Summarize in one sentence: TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale.',
329
+ expected:
330
+ 'TypeScript is a typed superset of JavaScript that improves tooling for projects of any size.',
331
+ },
332
+ {
333
+ name: 'classify-sentiment',
334
+ prompt:
335
+ 'Classify the sentiment of this review as "positive", "negative", or "neutral". Return only the classification word.\n\nReview: "The product arrived on time and works exactly as described. Very happy with my purchase!"',
336
+ expected: 'positive',
337
+ },
338
+ ],
339
+ scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
340
+ runs: 3,
341
+ })
342
+ ```
343
+
344
+ ```bash
345
+ npx duelist run
346
+ ```
347
+
348
+ Output:
349
+
350
+ ```
351
+ ⬡ Agent Duelist Results (3 runs each)
352
+ ──────────────────────────────────────────────────────────────────────
353
+
354
+ Task: extract-company
355
+ Provider Latency Cost Tokens Match Schema Fuzzy
356
+ ─────────────────────────────────────────────────────────────────────────────────────────────
357
+ azure/gpt-5-mini 1905ms ~$0.189m 140 100% 100% 100%
358
+ azure/gpt-5-nano 2079ms ~$0.081m 249 100% 100% 100%
359
+ azure/gpt-5.2-chat 1493ms ~$0.0011 126 100% 100% 100%
360
+
361
+ Task: summarize
362
+ Provider Latency Cost Tokens Match Schema Fuzzy
363
+ ─────────────────────────────────────────────────────────────────────────────────────────────
364
+ azure/gpt-5-mini 1723ms ~$0.192m 127 0% — 36%
365
+ azure/gpt-5-nano 2117ms ~$0.081m 234 0% — 43%
366
+ azure/gpt-5.2-chat 1008ms ~$0.584m 72 0% — 43%
367
+
368
+ Task: classify-sentiment
369
+ Provider Latency Cost Tokens Match Schema Fuzzy
370
+ ─────────────────────────────────────────────────────────────────────────────────────────────
371
+ azure/gpt-5-mini 1012ms ~$0.077m 82 100% — 100%
372
+ azure/gpt-5-nano 1075ms ~$0.024m 104 100% — 100%
373
+ azure/gpt-5.2-chat 936ms ~$0.526m 81 100% — 100%
374
+
375
+ ──────────────────────────────────────────────────────────────────────
376
+ Summary
377
+
378
+ ◆ Most correct: azure/gpt-5-mini (OpenAI via Azure) (avg 67%)
379
+ ◆ Fastest: azure/gpt-5.2-chat (OpenAI via Azure) (avg 1146ms)
380
+ ◆ Cheapest: azure/gpt-5-nano (OpenAI via Azure) (avg ~$0.062m)
381
+
382
+ Costs estimated from OpenRouter pricing catalog.
383
+ ```
384
+
385
+ ---
386
+
387
+ ## Tool-calling agent example
388
+
389
+ agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
390
+
391
+ ```ts
392
+ import { defineArena, openai } from 'agent-duelist'
393
+ import { z } from 'zod'
394
+
395
+ const weatherTool = {
396
+ name: 'getCurrentWeather',
397
+ description: 'Get the current weather in a given city',
398
+ parameters: z.object({ city: z.string() }),
399
+ handler: async ({ city }: { city: string }) => ({
400
+ city,
401
+ tempC: 20,
402
+ }),
403
+ }
404
+
405
+ export default defineArena({
406
+ providers: [openai('gpt-4o')],
407
+ tasks: [
408
+ {
409
+ name: 'weather-tool-call',
410
+ prompt: 'What is the current temperature in Amsterdam? Use the tool.',
411
+ expected: { city: 'Amsterdam' },
412
+ tools: [weatherTool],
413
+ },
414
+ ],
415
+ scorers: ['latency', 'cost', 'tool-usage'],
416
+ runs: 1,
417
+ })
418
+ ```
419
+
420
+ The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
421
+
422
+ ---
423
+
424
+ ## Roadmap
425
+
426
+ Shipped so far:
427
+
428
+ - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
429
+ - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
430
+ - Tool-calling support with local handlers for agent task benchmarking
431
+ - Colored console reporter with per-task tables and cross-provider summary
432
+ - JSON reporter for CI/pipeline integration
433
+ - Pricing catalog from OpenRouter with refresh script
434
+
435
+ Planned directions (subject to community feedback):
436
+
437
+ - **More providers**
438
+ - OpenRouter-native and more OpenAI-compatible gateways.
439
+ - **Better reporting**
440
+ - Markdown/HTML/CSV reports.
441
+ - GitHub Actions summaries.
442
+ - **Agent workflows**
443
+ - Multi-step tool chains, multi-hop reasoning, and agent traces.
444
+ - **Plugin system**
445
+ - First-class support for user-defined providers and scorers.
446
+ - **Embedding-based scoring**
447
+ - Semantic similarity via embedding distance.
448
+
449
+ If you have a specific use case (framework comparisons, multi-agent competitions, tool-calling benchmarks), please open an issue — those will shape what gets built first.
450
+
451
+ ---
452
+
453
+ ## Contributing
454
+
455
+ Contributions, issues, and feature requests are welcome.
456
+
457
+ - **Bug reports / ideas**: open a GitHub issue.
458
+ - **Code changes**:
459
+ - Fork the repo.
460
+ - Create a branch.
461
+ - Run tests: `npm test`.
462
+ - Run build: `npm run build`.
463
+ - Open a PR with a clear description and, if possible, a small repro.
464
+
465
+ Please try to keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
466
+
467
+ ---
468
+
469
+ ## License
470
+
471
+ MIT.