npm - mcpbr-cli - Versions diffs - 0.4.2 → 0.4.4 - Mend

mcpbr-cli 0.4.2 → 0.4.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +116 -0
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
 - **Categories**: Browser, Finance, Code Analysis, and 40+ more
 - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
+### GSM8K
+Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
+- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
+- **Task**: Solve math word problems with step-by-step reasoning
+- **Evaluation**: Numeric answer correctness with tolerance
+- **Problem Types**: Multi-step arithmetic and basic algebra
+- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
 ```bash
 # Run SWE-bench Verified (default - manually validated tests)
 mcpbr run -c config.yaml
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
 # Run SWE-bench Full (2,294 tasks, complete benchmark)
 mcpbr run -c config.yaml -b swe-bench-full
+# Run GSM8K
+mcpbr run -c config.yaml --benchmark gsm8k -n 50
 # List all available benchmarks
 mcpbr benchmarks
 ```
@@ -320,6 +332,110 @@ max_concurrent: 4
 mcpbr run --config config.yaml
 ```
+## Infrastructure Modes
+mcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.
+### Local (Default)
+Run evaluations on your local machine:
+```yaml
+infrastructure:
+  mode: local  # default
+```
+This is the default mode - evaluations run directly on your machine using local Docker containers.
+### Azure VM
+Run evaluations on Azure Virtual Machines with automatic provisioning and cleanup:
+```yaml
+infrastructure:
+  mode: azure
+  azure:
+    resource_group: mcpbr-benchmarks
+    location: eastus
+    cpu_cores: 10
+    memory_gb: 40
+```
+**Key features:**
+- Zero manual VM setup - provisioned automatically from config
+- Automatic Docker, Python, and mcpbr installation
+- Test task validation before full evaluation
+- Auto-cleanup after completion (configurable)
+- Cost-optimized with automatic VM deletion
+**Example usage:**
+```bash
+# Run evaluation on Azure VM
+mcpbr run -c azure-config.yaml
+# VM is automatically created, evaluation runs, results are downloaded, VM is deleted
+```
+See [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:
+- Prerequisites and authentication
+- VM sizing and cost estimation
+- Debugging with `preserve_on_error`
+- Troubleshooting guide
+## Side-by-Side Server Comparison
+Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
+### Quick Example
+```yaml
+# comparison-config.yaml
+comparison_mode: true
+mcp_server_a:
+  name: "Task Queries"
+  command: node
+  args: [build/index.js]
+  cwd: /path/to/task-queries
+mcp_server_b:
+  name: "Edge Identity"
+  command: node
+  args: [build/index.js]
+  cwd: /path/to/edge-identity
+benchmark: swe-bench-lite
+sample_size: 10
+```
+```bash
+mcpbr run -c comparison-config.yaml -o results.json
+```
+### Results Output
+```text
+Side-by-Side MCP Server Comparison
+┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
+┃ Metric            ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
+┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
+│ Resolved Tasks    │ 4/10         │ 2/10         │ +2       │
+│ Resolution Rate   │ 40.0%        │ 20.0%        │ +100.0%  │
+└───────────────────┴──────────────┴──────────────┴──────────┘
+✓ Task Queries unique wins: 2 tasks
+  - django__django-12286
+  - astropy__astropy-7606
+```
+**Use cases:**
+- **A/B testing**: Compare optimized vs. baseline implementations
+- **Tool evaluation**: Test different MCP tool sets
+- **Version comparison**: Benchmark v2.0 vs. v1.5
+See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
 ## Claude Code Integration
 [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "mcpbr-cli",
-  "version": "0.4.2",
+  "version": "0.4.4",
   "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
   "keywords": [
     "mcpbr",