npm - agent-duelist - Versions diffs - 0.2.1 → 0.3.0 - Mend

agent-duelist 0.2.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md CHANGED Viewed

@@ -1,32 +1,46 @@
 # agent-duelist
+[![npm version](https://img.shields.io/npm/v/agent-duelist?color=f59e0b)](https://www.npmjs.com/package/agent-duelist)
 [![CI](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml/badge.svg)](https://github.com/DataGobes/agent-duelist/actions/workflows/ci.yml)
 [![Docs](https://img.shields.io/badge/docs-landing%20page-f59e0b)](https://datagobes.github.io/agent-duelist/)
+[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](./LICENSE)
-> Pit LLM providers against each other on agent tasks — Duel your models.
+> Pit LLM providers against each other on agent tasks — **Duel your models.**
 >
 > **[View the landing page →](https://datagobes.github.io/agent-duelist/)**
-`agent-duelist` is a TypeScript-first framework to pit multiple LLM providers against each other on the same tasks and get structured, reproducible results: correctness, latency, tokens, and cost.
+`agent-duelist` is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.
+```bash
+npx duelist init   # scaffold a config
+npx duelist run    # see who wins
+```
 ## What you get
-> ![Agent Duelist console output](docs/assets/screenshot.png)
->
-> ![Agent Duelist HTML report](docs/assets/screenshot-html.png)
-- Compare OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway.
-- Define tasks once, run them against many providers.
-- Get CLI tables and JSON results you can feed into dashboards, CI, or docs.
+**Console output** — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:
+![Agent Duelist console output](docs/assets/screenshot.png)
+**HTML report** — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:
+![Agent Duelist HTML report](docs/assets/screenshot-html.png)
 ---
 ## Why agent-duelist?
-- **Provider-agnostic**: One config, many providers. Swap models and gateways without rewriting your tasks.
-- **Agent-focused**: Designed for agent workflows and tool use, not just single-turn prompts.
-- **Realistic metrics**: Latency, token counts, and cost estimates based on a pricing catalog.
-- **TypeScript-native DX**: Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint.
-- **CLI-first**: `npx duelist init` → `npx duelist run` gets you from zero to useful table in minutes.
+| | |
+|---|---|
+| **Provider-agnostic** | One config, many providers. Swap models and gateways without rewriting tasks. |
+| **Agent-focused** | Built for agent workflows and tool use, not just single-turn prompts. |
+| **Task packs** | Built-in benchmark suites — run with `--pack structured-output`, zero config needed. |
+| **Quality-first ranking** | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. |
+| **7 built-in scorers** | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. |
+| **Fair benchmarking** | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. |
+| **TypeScript-native** | Strongly typed APIs, Zod schemas for structured outputs, and a simple `defineArena()` entrypoint. |
+| **CI-ready** | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. |
+| **CLI-first** | `npx duelist init` → `npx duelist run` gets you from zero to results in minutes. |
 ---
@@ -40,7 +54,7 @@ pnpm add agent-duelist
 yarn add agent-duelist
 ```
-You'll also need API keys for the providers you want to benchmark, for example:
+Set API keys for the providers you want to benchmark:
 ```bash
 export OPENAI_API_KEY=sk-...
@@ -51,7 +65,7 @@ export GOOGLE_API_KEY=...
 ---
-## One-minute quickstart
+## Quickstart
 Initialize a config:
@@ -59,7 +73,7 @@ Initialize a config:
 npx duelist init
 ```
-This creates `arena.config.ts` in your project. A minimal example:
+This creates `arena.config.ts` in your project:
 ```ts
 // arena.config.ts
@@ -99,39 +113,78 @@ Run the benchmark:
 npx duelist run
 ```
-You'll see a matrix like:
+You'll see a results matrix:
-- Rows: tasks (`simple-qa`, `structured-extraction`)
-- Columns: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
-- Cells: correctness score, latency, tokens, and estimated cost.
+- **Rows**: tasks (`simple-qa`, `structured-extraction`)
+- **Columns**: providers (`openai/gpt-5-mini`, `azure/gpt-5-mini`)
+- **Cells**: correctness score, latency, tokens, and estimated cost
-For CI or further processing:
+Export the results in different formats:
 ```bash
+# JSON for CI pipelines and dashboards
 npx duelist run --reporter json > results.json
+# Self-contained HTML report you can share or host
+npx duelist run --reporter html --output report.html
 ```
-Generate a shareable HTML report:
+---
+## Task packs
+Task packs are **built-in benchmark suites** — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.
 ```bash
-npx duelist run --reporter html --output report.html
+# List available packs
+npx duelist run --pack list
+# Run a pack with your providers
+npx duelist run --pack structured-output --config arena.config.ts
 ```
----
+Your config only needs to define providers — the pack supplies tasks and scorers:
-## Core concepts
+```ts
+// arena.config.ts
+import { defineArena, openai, anthropic } from 'agent-duelist'
-### Providers
+export default defineArena({
+  providers: [
+    openai('gpt-5-mini'),
+    anthropic('claude-sonnet-4.6'),
+  ],
+  tasks: [],     // ignored when --pack is used
+  scorers: [],   // pack supplies its own scorers
+})
+```
-Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface.
+### Available packs
-This lets you:
+| Pack | Tasks | Description |
+|------|-------|-------------|
+| `structured-output` | 6 | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input |
-- Swap providers without changing tasks.
-- Wrap or extend providers in your own code.
-- Mock providers in tests.
+Packs work with both `run` and `ci` commands:
-Examples:
+```bash
+# CI with a task pack
+npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00
+```
+You can also combine multiple packs:
+```bash
+npx duelist run --pack structured-output,another-pack
+```
+---
+## Core concepts
+### Providers
+Providers are **factory functions** that return plain objects implementing a shared `ArenaProvider` interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.
 ```ts
 import {
@@ -140,25 +193,40 @@ import {
   anthropic,
   gemini,
   openaiCompatible,
-  type ArenaProvider,
 } from 'agent-duelist'
+// OpenAI
 const oai = openai('gpt-5-mini')
+// Azure OpenAI
 const azure = azureOpenai('gpt-5-mini', {
   deployment: 'my-deployment',
 })
+// Anthropic
 const claude = anthropic('claude-sonnet-4.6')
-const gem = gemini('gemini-3-flash-preview') // uses GOOGLE_API_KEY
+// Google Gemini
+const gem = gemini('gemini-3-flash-preview')
-const local: ArenaProvider = openaiCompatible({
+// Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
+const local = openaiCompatible({
   id: 'local/llama',
-  name: 'Local Gateway',
+  name: 'Local Ollama',
   baseURL: 'http://localhost:11434/v1',
   model: 'llama3.3',
   apiKeyEnv: 'LOCAL_LLM_API_KEY',
+  free: true,           // registers zero-cost pricing
+})
+// Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
+const deepseek = openaiCompatible({
+  id: 'deepseek/r1',
+  name: 'DeepSeek R1',
+  baseURL: 'https://api.deepseek.com/v1',
+  model: 'deepseek-reasoner',
+  apiKeyEnv: 'DEEPSEEK_API_KEY',
+  stripThinking: true,  // strips <think>...</think> from output
 })
 ```
@@ -185,6 +253,7 @@ interface ArenaTask {
   prompt: string
   expected?: unknown         // used by correctness scorers
   schema?: ZodSchema<any>    // used by schema-based scorers
+  tools?: ToolDefinition[]   // used by tool-calling scorers
 }
 ```
@@ -213,7 +282,7 @@ const tasks: ArenaTask[] = [
 ### Scorers
-Scorers take raw model outputs and turn them into **numeric scores** (0–1) with optional details. Built-in scorers:
+Scorers turn raw model outputs into **numeric scores** (0–1) with optional details. Seven built-in scorers ship out of the box:
 | Scorer | What it measures |
 |--------|-----------------|
@@ -222,7 +291,8 @@ Scorers take raw model outputs and turn them into **numeric scores** (0–1) wit
 | `correctness` | Exact match against `expected` (deep-equal, key-order independent for objects) |
 | `schema-correctness` | Validates output against the task's Zod `schema` via `safeParse()` |
 | `fuzzy-similarity` | Jaccard token-overlap similarity between output and `expected` |
-| `llm-judge-correctness` | Async LLM-as-judge — calls a judge model to score correctness 0–1 |
+| `tool-usage` | Whether the model invoked the expected tool(s) during a tool-calling task |
+| `llm-judge-correctness` | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |
 Configure them in your arena:
@@ -242,8 +312,6 @@ defineArena({
 The judge model defaults to `gpt-5-mini`. It can also be set via the `DUELIST_JUDGE_MODEL` env var. The judge backend is auto-detected from the model name — `gemini-*` models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.
-You can also add custom scorers for domain-specific metrics (e.g. tool-call correctness, safety, style).
 ---
 ### Arena options
@@ -252,7 +320,7 @@ You can also add custom scorers for domain-specific metrics (e.g. tool-call corr
 | Option | Type | Default | Description |
 |--------|------|---------|-------------|
-| `runs` | `number` | `1` | Number of runs per provider × task combination. Higher values improve statistical confidence for CI regression detection. |
+| `runs` | `number` | `1` | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. |
 | `judgeModel` | `string` | `'gpt-5-mini'` | Model used by the `llm-judge-correctness` scorer. Also settable via `DUELIST_JUDGE_MODEL` env var. Gemini models auto-route to Google's API. |
 | `timeout` | `number` | `60000` | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. |
 | `sparklines` | `boolean` | `true` | Show sparkline bars next to percentage scores in the console reporter. Disable with `false` if your terminal doesn't render Unicode block characters well. |
@@ -273,43 +341,62 @@ export default defineArena({
 ---
+## Reporters
+agent-duelist includes four output formats, each suited to a different workflow:
+| Reporter | Flag | Use case |
+|----------|------|----------|
+| **Console** | `--reporter console` (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners |
+| **JSON** | `--reporter json` | CI pipelines, dashboards, and downstream tooling |
+| **HTML** | `--reporter html` | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards |
+| **Markdown** | `--comment` (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |
+Generate an HTML report:
+```bash
+npx duelist run --reporter html --output report.html
+```
+The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.
+---
 ## Cost & pricing
 Cost estimation is intentionally transparent and conservative:
-1. **Token counts**
-   Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are treated as the source of truth.
-2. **Pricing catalog**
-   `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
+1. **Token counts** — Providers return token usage (prompt and completion tokens) in each `TaskResult`. These are the source of truth.
-   - The catalog maps `(provider, model)` → `{ inputPerM, outputPerM }` in USD per 1M tokens.
-   - Azure OpenAI models are resolved back to their base OpenAI models where possible (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`) so you don't need to configure Azure pricing manually.
+2. **Pricing catalog** — `agent-duelist` ships with a **locally bundled catalog** of per-token prices for many models, derived from OpenRouter's public pricing pages.
+   - The catalog maps `(provider, model)` to `{ inputPerM, outputPerM }` in USD per 1M tokens.
+   - Azure OpenAI models resolve back to their base OpenAI models (e.g. `azure/gpt-5-mini` → `openai/gpt-5-mini`).
+   - Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
-3. **Estimated USD**
-   The `cost` scorer computes:
+3. **Estimated USD** — The `cost` scorer computes:
    ```text
    estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
    ```
-   In the console reporter, you'll see:
+4. **Unknown models** — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).
-   - token counts: `prompt: X, completion: Y`
-   - cost: `~$0.0XXm` (millicents; fractions of a cent)
-   - a short disclaimer that this is an **estimate** based on a pricing snapshot.
+5. **Custom pricing** — Register pricing for models not in the catalog:
-4. **Unknown models**
-   If a model is not in the catalog:
+   ```ts
+   import { registerPricing } from 'agent-duelist'
-   - Tokens are still reported.
-   - Cost is marked as unknown (no fake numbers).
+   registerPricing('custom/my-model', {
+     inputPerToken: 0.000003,
+     outputPerToken: 0.000015,
+   })
+   ```
-You can update the catalog with a script that re-scrapes OpenRouter's public pricing page when prices change.
+You can update the bundled catalog with `npm run update:pricing`, which re-scrapes OpenRouter's public pricing page.
 ---
-## CLI usage
+## CLI reference
 ### `duelist init`
@@ -317,30 +404,46 @@ Scaffold a new `arena.config.ts` in the current directory.
 ```bash
 npx duelist init
+npx duelist init --force  # overwrite an existing config
 ```
+| Option | Description |
+|--------|-------------|
+| `--force` | Overwrite existing config file |
 ### `duelist run`
 Run benchmarks defined in your arena config.
 ```bash
-# Run with the default config (arena.config.ts)
+# Default config, console output
 npx duelist run
-# Use a custom config
+# Custom config
 npx duelist run --config path/to/arena.config.ts
-# Get JSON instead of a table
-npx duelist run --reporter json
+# Run a built-in task pack
+npx duelist run --pack structured-output
+# List available packs
+npx duelist run --pack list
+# JSON for piping
+npx duelist run --reporter json > results.json
-# Suppress per-result progress lines
+# HTML report
+npx duelist run --reporter html --output report.html
+# Quiet mode
 npx duelist run --quiet
 ```
 | Option | Description |
 |--------|-------------|
 | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
-| `--reporter <type>` | Output format: `console` (default) or `json` |
+| `--pack <names>` | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use `list` to show available packs. |
+| `--reporter <type>` | Output format: `console` (default), `json`, or `html` |
+| `--output <path>` | Output file path for HTML reporter (default: `duelist-report.html`) |
 | `-q, --quiet` | Suppress per-result progress |
 ### `duelist ci`
@@ -354,6 +457,9 @@ npx duelist ci --update-baseline
 # Subsequent runs — compare against baseline
 npx duelist ci --threshold correctness=0.1 --budget 1.00
+# Run CI with a task pack
+npx duelist ci --pack structured-output --threshold correctness=0.1
 # Post comparison table as a PR comment (GitHub Actions)
 npx duelist ci --threshold correctness=0.1 --comment
 ```
@@ -361,6 +467,7 @@ npx duelist ci --threshold correctness=0.1 --comment
 | Option | Description |
 |--------|-------------|
 | `-c, --config <path>` | Path to config file (default: `arena.config.ts`) |
+| `--pack <names>` | Run built-in task pack(s) instead of config tasks |
 | `--baseline <path>` | Baseline JSON file (default: `.duelist/baseline.json`) |
 | `--budget <dollars>` | Max total cost in USD — fails if exceeded |
 | `--threshold <scorer=delta>` | Regression threshold (repeatable, e.g. `--threshold correctness=0.1 --threshold cost=0.002`) |
@@ -375,24 +482,61 @@ npx duelist ci --threshold correctness=0.1 --comment
 - Without `--threshold` flags, regression detection is skipped entirely — only `--budget` is enforced.
 - Results with high variance (CV > 0.3) are flagged as **flaky** with a warning.
-The CLI loads TypeScript configs directly using a lightweight runtime loader so users don't need to precompile their config.
+The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.
+---
+## Tool-calling agent example
+agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:
+```ts
+import { defineArena, openai } from 'agent-duelist'
+import { z } from 'zod'
+const weatherTool = {
+  name: 'getCurrentWeather',
+  description: 'Get the current weather in a given city',
+  parameters: z.object({ city: z.string() }),
+  handler: async ({ city }: { city: string }) => ({
+    city,
+    tempC: 20,
+  }),
+}
+export default defineArena({
+  providers: [openai('gpt-5-mini')],
+  tasks: [
+    {
+      name: 'weather-tool-call',
+      prompt: 'What is the current temperature in Amsterdam? Use the tool.',
+      expected: { city: 'Amsterdam' },
+      tools: [weatherTool],
+    },
+  ],
+  scorers: ['latency', 'cost', 'tool-usage'],
+  runs: 1,
+})
+```
+The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
 ---
 ## Example: multi-provider benchmark
-Here's a richer example comparing multiple providers across tasks:
+A richer example comparing multiple providers across tasks:
 ```ts
 // arena.config.ts
-import { defineArena, azureOpenai } from 'agent-duelist'
+import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
 import { z } from 'zod'
 export default defineArena({
   providers: [
-    azureOpenai('gpt-5-mini'),
+    openai('gpt-5-mini'),
     azureOpenai('gpt-5-nano'),
-    azureOpenai('gpt-5.2-chat'),
+    gemini('gemini-3-flash-preview'),
   ],
   tasks: [
     {
@@ -425,50 +569,13 @@ export default defineArena({
 npx duelist run
 ```
-Output includes box-drawing tables with medals, color-ranked metrics, sparkline bars, and a winner row per task — see the [screenshot above](#what-you-get) for a real example.
 **How scoring works:**
 - Providers are compared **head-to-head within each task** — all providers receive the same prompt at the same time.
 - Medals are awarded only when a provider is the **sole leader** in a metric column. Ties don't award medals, keeping rankings meaningful.
-- The overall winner is determined by category wins across correctness, latency, and cost.
----
-## Tool-calling agent example
-agent-duelist supports tool-calling tasks — define tools with handlers, and the provider will execute them during the benchmark:
-```ts
-import { defineArena, openai } from 'agent-duelist'
-import { z } from 'zod'
-const weatherTool = {
-  name: 'getCurrentWeather',
-  description: 'Get the current weather in a given city',
-  parameters: z.object({ city: z.string() }),
-  handler: async ({ city }: { city: string }) => ({
-    city,
-    tempC: 20,
-  }),
-}
-export default defineArena({
-  providers: [openai('gpt-5-mini')],
-  tasks: [
-    {
-      name: 'weather-tool-call',
-      prompt: 'What is the current temperature in Amsterdam? Use the tool.',
-      expected: { city: 'Amsterdam' },
-      tools: [weatherTool],
-    },
-  ],
-  scorers: ['latency', 'cost', 'tool-usage'],
-  runs: 1,
-})
-```
-The model calls `getCurrentWeather`, the handler returns a stub result, and the `tool-usage` scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.
+- **Quality-first ranking**: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
+- **Quality gate**: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
+- The overall winner is the provider with the highest average correctness score.
 ---
@@ -516,8 +623,8 @@ When `--comment` is enabled, the CI posts (or updates) a markdown table on the P
 | Provider | Task | Scorer | Baseline | Current | Delta | Status |
 |----------|------|--------|----------|---------|-------|--------|
-| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | ⚪ unchanged |
-| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | ⚪ unchanged |
+| openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged |
+| openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |
 With cost summary, flakiness warnings, and pass/fail verdict.
@@ -525,34 +632,31 @@ With cost summary, flakiness warnings, and pass/fail verdict.
 ## Roadmap
-Shipped so far:
+**Shipped:**
-- OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible provider
+- 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
 - 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
 - Tool-calling support with local handlers for agent task benchmarking
-- Fair head-to-head benchmarking: tasks run sequentially while providers race in parallel, ensuring fair latency comparison without queue-induced timeout penalties
-- Console reporter with box-drawing tables, sole-leader medal rankings, sparkline bars (toggleable), and per-task winner rows
-- Configurable per-request timeout to prevent hanging on unresponsive APIs
-- JSON reporter for CI/pipeline integration
-- Markdown reporter for PR comments
-- `duelist ci` command with regression detection, cost budgets, and flakiness warnings
+- **Task packs**: built-in benchmark suites (`structured-output`) — run with `--pack`, no config writing needed
+- Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
+- Fair head-to-head benchmarking with parallel provider execution
+- 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
+- `duelist ci` with regression detection (confidence intervals), cost budgets, and flakiness warnings
 - GitHub Action for CI/CD integration
-- Pricing catalog from OpenRouter with refresh script
+- Pricing catalog with cross-provider fallback and `registerPricing()` for custom models
+- `openaiCompatible` with `stripThinking` for reasoning models and `free` flag for local models
+- Configurable per-request timeout
-Planned directions (subject to community feedback):
+**Planned** (subject to community feedback):
-- **More providers**
-  - OpenRouter-native and more OpenAI-compatible gateways.
-- **Better reporting**
-  - HTML and CSV export options.
-- **Agent workflows**
-  - Multi-step tool chains, multi-hop reasoning, and agent traces.
-- **Plugin system**
-  - First-class support for user-defined providers and scorers.
-- **Embedding-based scoring**
-  - Semantic similarity via embedding distance.
+- **More task packs** — reasoning, summarization, tool-calling, and multi-turn conversation packs
+- **Agent workflows** — multi-step tool chains, multi-hop reasoning, and agent traces
+- **More export formats** — CSV
+- **Plugin system** — first-class support for user-defined providers and scorers
+- **Embedding-based scoring** — semantic similarity via embedding distance
+- **More providers** — OpenRouter-native and additional OpenAI-compatible gateways
-If you have a specific use case (framework comparisons, multi-agent competitions, tool-calling benchmarks), please open an issue — those will shape what gets built first.
+Have a use case in mind? [Open an issue](https://github.com/DataGobes/agent-duelist/issues) — community feedback shapes what gets built first.
 ---
@@ -560,18 +664,18 @@ If you have a specific use case (framework comparisons, multi-agent competitions
 Contributions, issues, and feature requests are welcome.
-- **Bug reports / ideas**: open a GitHub issue.
+- **Bug reports / ideas**: [open a GitHub issue](https://github.com/DataGobes/agent-duelist/issues).
 - **Code changes**:
-  - Fork the repo.
-  - Create a branch.
-  - Run tests: `npm test`.
-  - Run build: `npm run build`.
-  - Open a PR with a clear description and, if possible, a small repro.
+  1. Fork the repo.
+  2. Create a branch.
+  3. Run tests: `npm test`.
+  4. Run build: `npm run build`.
+  5. Open a PR with a clear description.
-Please try to keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
+Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.
 ---
 ## License
-MIT.
+MIT