npm - mcpbr-cli - Versions diffs - 0.4.2 → 0.4.3 - Mend

mcpbr-cli 0.4.2 → 0.4.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +66 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
 - **Categories**: Browser, Finance, Code Analysis, and 40+ more
 - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
+### GSM8K
+Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
+- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
+- **Task**: Solve math word problems with step-by-step reasoning
+- **Evaluation**: Numeric answer correctness with tolerance
+- **Problem Types**: Multi-step arithmetic and basic algebra
+- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
 ```bash
 # Run SWE-bench Verified (default - manually validated tests)
 mcpbr run -c config.yaml
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
 # Run SWE-bench Full (2,294 tasks, complete benchmark)
 mcpbr run -c config.yaml -b swe-bench-full
+# Run GSM8K
+mcpbr run -c config.yaml --benchmark gsm8k -n 50
 # List all available benchmarks
 mcpbr benchmarks
 ```
@@ -320,6 +332,60 @@ max_concurrent: 4
 mcpbr run --config config.yaml
 ```
+## Side-by-Side Server Comparison
+Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
+### Quick Example
+```yaml
+# comparison-config.yaml
+comparison_mode: true
+mcp_server_a:
+  name: "Task Queries"
+  command: node
+  args: [build/index.js]
+  cwd: /path/to/task-queries
+mcp_server_b:
+  name: "Edge Identity"
+  command: node
+  args: [build/index.js]
+  cwd: /path/to/edge-identity
+benchmark: swe-bench-lite
+sample_size: 10
+```
+```bash
+mcpbr run -c comparison-config.yaml -o results.json
+```
+### Results Output
+```text
+Side-by-Side MCP Server Comparison
+┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
+┃ Metric            ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
+┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
+│ Resolved Tasks    │ 4/10         │ 2/10         │ +2       │
+│ Resolution Rate   │ 40.0%        │ 20.0%        │ +100.0%  │
+└───────────────────┴──────────────┴──────────────┴──────────┘
+✓ Task Queries unique wins: 2 tasks
+  - django__django-12286
+  - astropy__astropy-7606
+```
+**Use cases:**
+- **A/B testing**: Compare optimized vs. baseline implementations
+- **Tool evaluation**: Test different MCP tool sets
+- **Version comparison**: Benchmark v2.0 vs. v1.5
+See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
 ## Claude Code Integration
 [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "mcpbr-cli",
-  "version": "0.4.2",
+  "version": "0.4.3",
   "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
   "keywords": [
     "mcpbr",