mcpbr-cli 0.4.3 → 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +80 -46
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -56,59 +56,37 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
56
56
 
57
57
  ## Supported Benchmarks
58
58
 
59
- mcpbr supports multiple software engineering benchmarks through a flexible abstraction layer:
59
+ mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
60
60
 
61
- ### SWE-bench (Default)
62
- Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
61
+ | Category | Benchmarks |
62
+ |----------|-----------|
63
+ | **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |
64
+ | **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |
65
+ | **Math & Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |
66
+ | **Knowledge & QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |
67
+ | **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
68
+ | **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
69
+ | **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
70
+ | **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
63
71
 
64
- - **Task**: Generate patches to fix bugs
65
- - **Evaluation**: Test suite pass/fail
66
- - **Pre-built images**: Available for most tasks
72
+ ### Featured Benchmarks
67
73
 
68
- **Variants:**
69
- - **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
70
- - **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
71
- - **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
74
+ **SWE-bench** (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.
72
75
 
73
- ### CyberGym
74
- Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
76
+ **CyberGym** - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.
75
77
 
76
- - **Dataset**: [sunblaze-ucb/cybergym](https://huggingface.co/datasets/sunblaze-ucb/cybergym)
77
- - **Task**: Generate PoC exploits
78
- - **Evaluation**: PoC crashes pre-patch, doesn't crash post-patch
79
- - **Difficulty levels**: 0-3 (controls context given to agent)
80
- - **Learn more**: [CyberGym Project](https://cybergym.cs.berkeley.edu/)
78
+ **MCPToolBench++** - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.
81
79
 
82
- ### MCPToolBench++
83
- Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
84
-
85
- - **Dataset**: [MCPToolBench/MCPToolBenchPP](https://huggingface.co/datasets/MCPToolBench/MCPToolBenchPP)
86
- - **Task**: Complete tasks using appropriate MCP tools
87
- - **Evaluation**: Tool selection accuracy, parameter correctness, sequence matching
88
- - **Categories**: Browser, Finance, Code Analysis, and 40+ more
89
- - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
90
-
91
- ### GSM8K
92
- Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
93
-
94
- - **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
95
- - **Task**: Solve math word problems with step-by-step reasoning
96
- - **Evaluation**: Numeric answer correctness with tolerance
97
- - **Problem Types**: Multi-step arithmetic and basic algebra
98
- - **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
80
+ **GSM8K** - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.
99
81
 
100
82
  ```bash
101
- # Run SWE-bench Verified (default - manually validated tests)
83
+ # Run SWE-bench Verified (default)
102
84
  mcpbr run -c config.yaml
103
85
 
104
- # Run SWE-bench Lite (300 tasks, quick testing)
105
- mcpbr run -c config.yaml -b swe-bench-lite
106
-
107
- # Run SWE-bench Full (2,294 tasks, complete benchmark)
108
- mcpbr run -c config.yaml -b swe-bench-full
109
-
110
- # Run GSM8K
86
+ # Run any benchmark
87
+ mcpbr run -c config.yaml --benchmark humaneval -n 20
111
88
  mcpbr run -c config.yaml --benchmark gsm8k -n 50
89
+ mcpbr run -c config.yaml --benchmark cybergym --level 2
112
90
 
113
91
  # List all available benchmarks
114
92
  mcpbr benchmarks
@@ -332,6 +310,56 @@ max_concurrent: 4
332
310
  mcpbr run --config config.yaml
333
311
  ```
334
312
 
313
+ ## Infrastructure Modes
314
+
315
+ mcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.
316
+
317
+ ### Local (Default)
318
+
319
+ Run evaluations on your local machine:
320
+
321
+ ```yaml
322
+ infrastructure:
323
+ mode: local # default
324
+ ```
325
+
326
+ This is the default mode - evaluations run directly on your machine using local Docker containers.
327
+
328
+ ### Azure VM
329
+
330
+ Run evaluations on Azure Virtual Machines with automatic provisioning and cleanup:
331
+
332
+ ```yaml
333
+ infrastructure:
334
+ mode: azure
335
+ azure:
336
+ resource_group: mcpbr-benchmarks
337
+ location: eastus
338
+ cpu_cores: 10
339
+ memory_gb: 40
340
+ ```
341
+
342
+ **Key features:**
343
+ - Zero manual VM setup - provisioned automatically from config
344
+ - Automatic Docker, Python, and mcpbr installation
345
+ - Test task validation before full evaluation
346
+ - Auto-cleanup after completion (configurable)
347
+ - Cost-optimized with automatic VM deletion
348
+
349
+ **Example usage:**
350
+ ```bash
351
+ # Run evaluation on Azure VM
352
+ mcpbr run -c azure-config.yaml
353
+
354
+ # VM is automatically created, evaluation runs, results are downloaded, VM is deleted
355
+ ```
356
+
357
+ See [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:
358
+ - Prerequisites and authentication
359
+ - VM sizing and cost estimation
360
+ - Debugging with `preserve_on_error`
361
+ - Troubleshooting guide
362
+
335
363
  ## Side-by-Side Server Comparison
336
364
 
337
365
  Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
@@ -1215,12 +1243,18 @@ mcpbr/
1215
1243
  │ ├── models.py # Supported model registry
1216
1244
  │ ├── providers.py # LLM provider abstractions (extensible)
1217
1245
  │ ├── harnesses.py # Agent harness implementations (extensible)
1218
- │ ├── benchmarks/ # Benchmark abstraction layer
1246
+ │ ├── benchmarks/ # Benchmark abstraction layer (25+ benchmarks)
1219
1247
  │ │ ├── __init__.py # Registry and factory
1220
1248
  │ │ ├── base.py # Benchmark protocol
1221
- │ │ ├── swebench.py # SWE-bench implementation
1222
- │ │ ├── cybergym.py # CyberGym implementation
1223
- │ │ └── mcptoolbench.py # MCPToolBench++ implementation
1249
+ │ │ ├── swebench.py # SWE-bench (Verified/Lite/Full)
1250
+ │ │ ├── cybergym.py # CyberGym security
1251
+ │ │ ├── humaneval.py # HumanEval code generation
1252
+ │ │ ├── gsm8k.py # GSM8K math reasoning
1253
+ │ │ ├── mcptoolbench.py # MCPToolBench++ tool use
1254
+ │ │ ├── apps.py # APPS coding problems
1255
+ │ │ ├── mbpp.py # MBPP Python problems
1256
+ │ │ ├── math_benchmark.py # MATH competition math
1257
+ │ │ └── ... # 15+ more benchmarks
1224
1258
  │ ├── harness.py # Main orchestrator
1225
1259
  │ ├── agent.py # Baseline agent implementation
1226
1260
  │ ├── docker_env.py # Docker environment management + in-container execution
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.4.3",
3
+ "version": "0.4.5",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",