mcpbr-cli 0.4.1 → 0.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +66 -0
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
88
88
  - **Categories**: Browser, Finance, Code Analysis, and 40+ more
89
89
  - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
90
90
 
91
+ ### GSM8K
92
+ Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
93
+
94
+ - **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
95
+ - **Task**: Solve math word problems with step-by-step reasoning
96
+ - **Evaluation**: Numeric answer correctness with tolerance
97
+ - **Problem Types**: Multi-step arithmetic and basic algebra
98
+ - **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
99
+
91
100
  ```bash
92
101
  # Run SWE-bench Verified (default - manually validated tests)
93
102
  mcpbr run -c config.yaml
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
98
107
  # Run SWE-bench Full (2,294 tasks, complete benchmark)
99
108
  mcpbr run -c config.yaml -b swe-bench-full
100
109
 
110
+ # Run GSM8K
111
+ mcpbr run -c config.yaml --benchmark gsm8k -n 50
112
+
101
113
  # List all available benchmarks
102
114
  mcpbr benchmarks
103
115
  ```
@@ -320,6 +332,60 @@ max_concurrent: 4
320
332
  mcpbr run --config config.yaml
321
333
  ```
322
334
 
335
+ ## Side-by-Side Server Comparison
336
+
337
+ Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
338
+
339
+ ### Quick Example
340
+
341
+ ```yaml
342
+ # comparison-config.yaml
343
+ comparison_mode: true
344
+
345
+ mcp_server_a:
346
+ name: "Task Queries"
347
+ command: node
348
+ args: [build/index.js]
349
+ cwd: /path/to/task-queries
350
+
351
+ mcp_server_b:
352
+ name: "Edge Identity"
353
+ command: node
354
+ args: [build/index.js]
355
+ cwd: /path/to/edge-identity
356
+
357
+ benchmark: swe-bench-lite
358
+ sample_size: 10
359
+ ```
360
+
361
+ ```bash
362
+ mcpbr run -c comparison-config.yaml -o results.json
363
+ ```
364
+
365
+ ### Results Output
366
+
367
+ ```text
368
+ Side-by-Side MCP Server Comparison
369
+
370
+ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
371
+ ┃ Metric ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
372
+ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
373
+ │ Resolved Tasks │ 4/10 │ 2/10 │ +2 │
374
+ │ Resolution Rate │ 40.0% │ 20.0% │ +100.0% │
375
+ └───────────────────┴──────────────┴──────────────┴──────────┘
376
+
377
+ ✓ Task Queries unique wins: 2 tasks
378
+ - django__django-12286
379
+ - astropy__astropy-7606
380
+ ```
381
+
382
+ **Use cases:**
383
+ - **A/B testing**: Compare optimized vs. baseline implementations
384
+ - **Tool evaluation**: Test different MCP tool sets
385
+ - **Version comparison**: Benchmark v2.0 vs. v1.5
386
+
387
+ See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
388
+
323
389
  ## Claude Code Integration
324
390
 
325
391
  [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.4.1",
3
+ "version": "0.4.3",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",