agent-duelist 0.2.0 → 0.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -11,7 +11,8 @@
11
11
 
12
12
  ## What you get
13
13
  > ![Agent Duelist console output](docs/assets/screenshot.png)
14
-
14
+ >
15
+ > ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
15
16
 
16
17
  - Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
17
18
  - Define tasks once, run them against many providers.
@@ -110,6 +111,12 @@ For CI or further processing:
110
111
  npx duelist run --reporter json > results.json
111
112
  ```
112
113
 
114
+ Generate a shareable HTML report:
115
+
116
+ ```bash
117
+ npx duelist run --reporter html --output report.html
118
+ ```
119
+
113
120
  ---
114
121
 
115
122
  ## Core concepts
@@ -420,6 +427,12 @@ npx duelist run
420
427
 
421
428
  Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
422
429
 
430
+ **How scoring works:**
431
+
432
+ - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
433
+ - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
434
+ - The overall winner is determined by category wins across correctness, latency, and cost.
435
+
423
436
  ---
424
437
 
425
438
  ## Tool-calling agent example
@@ -514,10 +527,11 @@ With cost summary, flakiness warnings, and pass/fail verdict.
514
527
 
515
528
  Shipped so far:
516
529
 
517
- - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and OpenAI-compatible providers
530
+ - OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
518
531
  - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
519
532
  - Tool-calling support with local handlers for agent task benchmarking
520
- - Console reporter with box-drawing tables, medal rankings, sparkline bars (toggleable), and per-task winner rows
533
+ - Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
534
+ - Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
521
535
  - Configurable per-request timeout to prevent hanging on unresponsive APIs
522
536
  - JSON reporter for CI/pipeline integration
523
537
  - Markdown reporter for PR comments