agent-duelist 0.2.0 → 0.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +17 -3
- package/dist/cli.js +2754 -2102
- package/dist/cli.js.map +1 -1
- package/dist/index.cjs +1054 -346
- package/dist/index.cjs.map +1 -1
- package/dist/index.d.cts +8 -7
- package/dist/index.d.ts +8 -7
- package/dist/index.js +1053 -346
- package/dist/index.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -11,7 +11,8 @@
|
|
|
11
11
|
|
|
12
12
|
## What you get
|
|
13
13
|
> 
|
|
14
|
-
|
|
14
|
+
>
|
|
15
|
+
> 
|
|
15
16
|
|
|
16
17
|
- Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
|
|
17
18
|
- Define tasks once, run them against many providers.
|
|
@@ -110,6 +111,12 @@ For CI or further processing:
|
|
|
110
111
|
npx duelist run --reporter json > results.json
|
|
111
112
|
```
|
|
112
113
|
|
|
114
|
+
Generate a shareable HTML report:
|
|
115
|
+
|
|
116
|
+
```bash
|
|
117
|
+
npx duelist run --reporter html --output report.html
|
|
118
|
+
```
|
|
119
|
+
|
|
113
120
|
---
|
|
114
121
|
|
|
115
122
|
## Core concepts
|
|
@@ -420,6 +427,12 @@ npx duelist run
|
|
|
420
427
|
|
|
421
428
|
Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
|
|
422
429
|
|
|
430
|
+
**How scoring works:**
|
|
431
|
+
|
|
432
|
+
- Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
|
|
433
|
+
- Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
|
|
434
|
+
- The overall winner is determined by category wins across correctness, latency, and cost.
|
|
435
|
+
|
|
423
436
|
---
|
|
424
437
|
|
|
425
438
|
## Tool-calling agent example
|
|
@@ -514,10 +527,11 @@ With cost summary, flakiness warnings, and pass/fail verdict.
|
|
|
514
527
|
|
|
515
528
|
Shipped so far:
|
|
516
529
|
|
|
517
|
-
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible
|
|
530
|
+
- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
|
|
518
531
|
- 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
|
|
519
532
|
- Tool-calling support with local handlers for agent task benchmarking
|
|
520
|
-
-
|
|
533
|
+
- Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
|
|
534
|
+
- Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
|
|
521
535
|
- Configurable per-request timeout to prevent hanging on unresponsive APIs
|
|
522
536
|
- JSON reporter for CI/pipeline integration
|
|
523
537
|
- Markdown reporter for PR comments
|