mcpbr-cli 0.4.4 → 0.4.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +30 -46
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -56,59 +56,37 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
|
|
|
56
56
|
|
|
57
57
|
## Supported Benchmarks
|
|
58
58
|
|
|
59
|
-
mcpbr supports
|
|
59
|
+
mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
|
|
60
60
|
|
|
61
|
-
|
|
62
|
-
|
|
61
|
+
| Category | Benchmarks |
|
|
62
|
+
|----------|-----------|
|
|
63
|
+
| **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |
|
|
64
|
+
| **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |
|
|
65
|
+
| **Math & Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |
|
|
66
|
+
| **Knowledge & QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |
|
|
67
|
+
| **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
|
|
68
|
+
| **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
|
|
69
|
+
| **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
|
|
70
|
+
| **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
|
|
63
71
|
|
|
64
|
-
|
|
65
|
-
- **Evaluation**: Test suite pass/fail
|
|
66
|
-
- **Pre-built images**: Available for most tasks
|
|
72
|
+
### Featured Benchmarks
|
|
67
73
|
|
|
68
|
-
**
|
|
69
|
-
- **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
|
|
70
|
-
- **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
|
|
71
|
-
- **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
|
|
74
|
+
**SWE-bench** (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.
|
|
72
75
|
|
|
73
|
-
|
|
74
|
-
Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
|
|
76
|
+
**CyberGym** - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.
|
|
75
77
|
|
|
76
|
-
-
|
|
77
|
-
- **Task**: Generate PoC exploits
|
|
78
|
-
- **Evaluation**: PoC crashes pre-patch, doesn't crash post-patch
|
|
79
|
-
- **Difficulty levels**: 0-3 (controls context given to agent)
|
|
80
|
-
- **Learn more**: [CyberGym Project](https://cybergym.cs.berkeley.edu/)
|
|
78
|
+
**MCPToolBench++** - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.
|
|
81
79
|
|
|
82
|
-
|
|
83
|
-
Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
|
|
84
|
-
|
|
85
|
-
- **Dataset**: [MCPToolBench/MCPToolBenchPP](https://huggingface.co/datasets/MCPToolBench/MCPToolBenchPP)
|
|
86
|
-
- **Task**: Complete tasks using appropriate MCP tools
|
|
87
|
-
- **Evaluation**: Tool selection accuracy, parameter correctness, sequence matching
|
|
88
|
-
- **Categories**: Browser, Finance, Code Analysis, and 40+ more
|
|
89
|
-
- **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
|
|
90
|
-
|
|
91
|
-
### GSM8K
|
|
92
|
-
Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
|
|
93
|
-
|
|
94
|
-
- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
|
|
95
|
-
- **Task**: Solve math word problems with step-by-step reasoning
|
|
96
|
-
- **Evaluation**: Numeric answer correctness with tolerance
|
|
97
|
-
- **Problem Types**: Multi-step arithmetic and basic algebra
|
|
98
|
-
- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
|
|
80
|
+
**GSM8K** - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.
|
|
99
81
|
|
|
100
82
|
```bash
|
|
101
|
-
# Run SWE-bench Verified (default
|
|
83
|
+
# Run SWE-bench Verified (default)
|
|
102
84
|
mcpbr run -c config.yaml
|
|
103
85
|
|
|
104
|
-
# Run
|
|
105
|
-
mcpbr run -c config.yaml -
|
|
106
|
-
|
|
107
|
-
# Run SWE-bench Full (2,294 tasks, complete benchmark)
|
|
108
|
-
mcpbr run -c config.yaml -b swe-bench-full
|
|
109
|
-
|
|
110
|
-
# Run GSM8K
|
|
86
|
+
# Run any benchmark
|
|
87
|
+
mcpbr run -c config.yaml --benchmark humaneval -n 20
|
|
111
88
|
mcpbr run -c config.yaml --benchmark gsm8k -n 50
|
|
89
|
+
mcpbr run -c config.yaml --benchmark cybergym --level 2
|
|
112
90
|
|
|
113
91
|
# List all available benchmarks
|
|
114
92
|
mcpbr benchmarks
|
|
@@ -1265,12 +1243,18 @@ mcpbr/
|
|
|
1265
1243
|
│ ├── models.py # Supported model registry
|
|
1266
1244
|
│ ├── providers.py # LLM provider abstractions (extensible)
|
|
1267
1245
|
│ ├── harnesses.py # Agent harness implementations (extensible)
|
|
1268
|
-
│ ├── benchmarks/ # Benchmark abstraction layer
|
|
1246
|
+
│ ├── benchmarks/ # Benchmark abstraction layer (25+ benchmarks)
|
|
1269
1247
|
│ │ ├── __init__.py # Registry and factory
|
|
1270
1248
|
│ │ ├── base.py # Benchmark protocol
|
|
1271
|
-
│ │ ├── swebench.py # SWE-bench
|
|
1272
|
-
│ │ ├── cybergym.py # CyberGym
|
|
1273
|
-
│ │
|
|
1249
|
+
│ │ ├── swebench.py # SWE-bench (Verified/Lite/Full)
|
|
1250
|
+
│ │ ├── cybergym.py # CyberGym security
|
|
1251
|
+
│ │ ├── humaneval.py # HumanEval code generation
|
|
1252
|
+
│ │ ├── gsm8k.py # GSM8K math reasoning
|
|
1253
|
+
│ │ ├── mcptoolbench.py # MCPToolBench++ tool use
|
|
1254
|
+
│ │ ├── apps.py # APPS coding problems
|
|
1255
|
+
│ │ ├── mbpp.py # MBPP Python problems
|
|
1256
|
+
│ │ ├── math_benchmark.py # MATH competition math
|
|
1257
|
+
│ │ └── ... # 15+ more benchmarks
|
|
1274
1258
|
│ ├── harness.py # Main orchestrator
|
|
1275
1259
|
│ ├── agent.py # Baseline agent implementation
|
|
1276
1260
|
│ ├── docker_env.py # Docker environment management + in-container execution
|