mcpbr-cli 0.4.3 → 0.4.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +80 -46
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -56,59 +56,37 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
|
|
|
56
56
|
|
|
57
57
|
## Supported Benchmarks
|
|
58
58
|
|
|
59
|
-
mcpbr supports
|
|
59
|
+
mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
|
|
60
60
|
|
|
61
|
-
|
|
62
|
-
|
|
61
|
+
| Category | Benchmarks |
|
|
62
|
+
|----------|-----------|
|
|
63
|
+
| **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |
|
|
64
|
+
| **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |
|
|
65
|
+
| **Math & Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |
|
|
66
|
+
| **Knowledge & QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |
|
|
67
|
+
| **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
|
|
68
|
+
| **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
|
|
69
|
+
| **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
|
|
70
|
+
| **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
|
|
63
71
|
|
|
64
|
-
|
|
65
|
-
- **Evaluation**: Test suite pass/fail
|
|
66
|
-
- **Pre-built images**: Available for most tasks
|
|
72
|
+
### Featured Benchmarks
|
|
67
73
|
|
|
68
|
-
**
|
|
69
|
-
- **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
|
|
70
|
-
- **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
|
|
71
|
-
- **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
|
|
74
|
+
**SWE-bench** (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.
|
|
72
75
|
|
|
73
|
-
|
|
74
|
-
Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
|
|
76
|
+
**CyberGym** - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.
|
|
75
77
|
|
|
76
|
-
-
|
|
77
|
-
- **Task**: Generate PoC exploits
|
|
78
|
-
- **Evaluation**: PoC crashes pre-patch, doesn't crash post-patch
|
|
79
|
-
- **Difficulty levels**: 0-3 (controls context given to agent)
|
|
80
|
-
- **Learn more**: [CyberGym Project](https://cybergym.cs.berkeley.edu/)
|
|
78
|
+
**MCPToolBench++** - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.
|
|
81
79
|
|
|
82
|
-
|
|
83
|
-
Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
|
|
84
|
-
|
|
85
|
-
- **Dataset**: [MCPToolBench/MCPToolBenchPP](https://huggingface.co/datasets/MCPToolBench/MCPToolBenchPP)
|
|
86
|
-
- **Task**: Complete tasks using appropriate MCP tools
|
|
87
|
-
- **Evaluation**: Tool selection accuracy, parameter correctness, sequence matching
|
|
88
|
-
- **Categories**: Browser, Finance, Code Analysis, and 40+ more
|
|
89
|
-
- **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
|
|
90
|
-
|
|
91
|
-
### GSM8K
|
|
92
|
-
Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
|
|
93
|
-
|
|
94
|
-
- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
|
|
95
|
-
- **Task**: Solve math word problems with step-by-step reasoning
|
|
96
|
-
- **Evaluation**: Numeric answer correctness with tolerance
|
|
97
|
-
- **Problem Types**: Multi-step arithmetic and basic algebra
|
|
98
|
-
- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
|
|
80
|
+
**GSM8K** - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.
|
|
99
81
|
|
|
100
82
|
```bash
|
|
101
|
-
# Run SWE-bench Verified (default
|
|
83
|
+
# Run SWE-bench Verified (default)
|
|
102
84
|
mcpbr run -c config.yaml
|
|
103
85
|
|
|
104
|
-
# Run
|
|
105
|
-
mcpbr run -c config.yaml -
|
|
106
|
-
|
|
107
|
-
# Run SWE-bench Full (2,294 tasks, complete benchmark)
|
|
108
|
-
mcpbr run -c config.yaml -b swe-bench-full
|
|
109
|
-
|
|
110
|
-
# Run GSM8K
|
|
86
|
+
# Run any benchmark
|
|
87
|
+
mcpbr run -c config.yaml --benchmark humaneval -n 20
|
|
111
88
|
mcpbr run -c config.yaml --benchmark gsm8k -n 50
|
|
89
|
+
mcpbr run -c config.yaml --benchmark cybergym --level 2
|
|
112
90
|
|
|
113
91
|
# List all available benchmarks
|
|
114
92
|
mcpbr benchmarks
|
|
@@ -332,6 +310,56 @@ max_concurrent: 4
|
|
|
332
310
|
mcpbr run --config config.yaml
|
|
333
311
|
```
|
|
334
312
|
|
|
313
|
+
## Infrastructure Modes
|
|
314
|
+
|
|
315
|
+
mcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.
|
|
316
|
+
|
|
317
|
+
### Local (Default)
|
|
318
|
+
|
|
319
|
+
Run evaluations on your local machine:
|
|
320
|
+
|
|
321
|
+
```yaml
|
|
322
|
+
infrastructure:
|
|
323
|
+
mode: local # default
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
This is the default mode - evaluations run directly on your machine using local Docker containers.
|
|
327
|
+
|
|
328
|
+
### Azure VM
|
|
329
|
+
|
|
330
|
+
Run evaluations on Azure Virtual Machines with automatic provisioning and cleanup:
|
|
331
|
+
|
|
332
|
+
```yaml
|
|
333
|
+
infrastructure:
|
|
334
|
+
mode: azure
|
|
335
|
+
azure:
|
|
336
|
+
resource_group: mcpbr-benchmarks
|
|
337
|
+
location: eastus
|
|
338
|
+
cpu_cores: 10
|
|
339
|
+
memory_gb: 40
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
**Key features:**
|
|
343
|
+
- Zero manual VM setup - provisioned automatically from config
|
|
344
|
+
- Automatic Docker, Python, and mcpbr installation
|
|
345
|
+
- Test task validation before full evaluation
|
|
346
|
+
- Auto-cleanup after completion (configurable)
|
|
347
|
+
- Cost-optimized with automatic VM deletion
|
|
348
|
+
|
|
349
|
+
**Example usage:**
|
|
350
|
+
```bash
|
|
351
|
+
# Run evaluation on Azure VM
|
|
352
|
+
mcpbr run -c azure-config.yaml
|
|
353
|
+
|
|
354
|
+
# VM is automatically created, evaluation runs, results are downloaded, VM is deleted
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
See [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:
|
|
358
|
+
- Prerequisites and authentication
|
|
359
|
+
- VM sizing and cost estimation
|
|
360
|
+
- Debugging with `preserve_on_error`
|
|
361
|
+
- Troubleshooting guide
|
|
362
|
+
|
|
335
363
|
## Side-by-Side Server Comparison
|
|
336
364
|
|
|
337
365
|
Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
|
|
@@ -1215,12 +1243,18 @@ mcpbr/
|
|
|
1215
1243
|
│ ├── models.py # Supported model registry
|
|
1216
1244
|
│ ├── providers.py # LLM provider abstractions (extensible)
|
|
1217
1245
|
│ ├── harnesses.py # Agent harness implementations (extensible)
|
|
1218
|
-
│ ├── benchmarks/ # Benchmark abstraction layer
|
|
1246
|
+
│ ├── benchmarks/ # Benchmark abstraction layer (25+ benchmarks)
|
|
1219
1247
|
│ │ ├── __init__.py # Registry and factory
|
|
1220
1248
|
│ │ ├── base.py # Benchmark protocol
|
|
1221
|
-
│ │ ├── swebench.py # SWE-bench
|
|
1222
|
-
│ │ ├── cybergym.py # CyberGym
|
|
1223
|
-
│ │
|
|
1249
|
+
│ │ ├── swebench.py # SWE-bench (Verified/Lite/Full)
|
|
1250
|
+
│ │ ├── cybergym.py # CyberGym security
|
|
1251
|
+
│ │ ├── humaneval.py # HumanEval code generation
|
|
1252
|
+
│ │ ├── gsm8k.py # GSM8K math reasoning
|
|
1253
|
+
│ │ ├── mcptoolbench.py # MCPToolBench++ tool use
|
|
1254
|
+
│ │ ├── apps.py # APPS coding problems
|
|
1255
|
+
│ │ ├── mbpp.py # MBPP Python problems
|
|
1256
|
+
│ │ ├── math_benchmark.py # MATH competition math
|
|
1257
|
+
│ │ └── ... # 15+ more benchmarks
|
|
1224
1258
|
│ ├── harness.py # Main orchestrator
|
|
1225
1259
|
│ ├── agent.py # Baseline agent implementation
|
|
1226
1260
|
│ ├── docker_env.py # Docker environment management + in-container execution
|