mcpbr-cli 0.4.4 → 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +30 -46
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -56,59 +56,37 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
56
56
 
57
57
  ## Supported Benchmarks
58
58
 
59
- mcpbr supports multiple software engineering benchmarks through a flexible abstraction layer:
59
+ mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
60
60
 
61
- ### SWE-bench (Default)
62
- Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
61
+ | Category | Benchmarks |
62
+ |----------|-----------|
63
+ | **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |
64
+ | **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |
65
+ | **Math & Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |
66
+ | **Knowledge & QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |
67
+ | **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
68
+ | **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
69
+ | **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
70
+ | **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
63
71
 
64
- - **Task**: Generate patches to fix bugs
65
- - **Evaluation**: Test suite pass/fail
66
- - **Pre-built images**: Available for most tasks
72
+ ### Featured Benchmarks
67
73
 
68
- **Variants:**
69
- - **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
70
- - **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
71
- - **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
74
+ **SWE-bench** (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.
72
75
 
73
- ### CyberGym
74
- Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
76
+ **CyberGym** - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.
75
77
 
76
- - **Dataset**: [sunblaze-ucb/cybergym](https://huggingface.co/datasets/sunblaze-ucb/cybergym)
77
- - **Task**: Generate PoC exploits
78
- - **Evaluation**: PoC crashes pre-patch, doesn't crash post-patch
79
- - **Difficulty levels**: 0-3 (controls context given to agent)
80
- - **Learn more**: [CyberGym Project](https://cybergym.cs.berkeley.edu/)
78
+ **MCPToolBench++** - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.
81
79
 
82
- ### MCPToolBench++
83
- Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
84
-
85
- - **Dataset**: [MCPToolBench/MCPToolBenchPP](https://huggingface.co/datasets/MCPToolBench/MCPToolBenchPP)
86
- - **Task**: Complete tasks using appropriate MCP tools
87
- - **Evaluation**: Tool selection accuracy, parameter correctness, sequence matching
88
- - **Categories**: Browser, Finance, Code Analysis, and 40+ more
89
- - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
90
-
91
- ### GSM8K
92
- Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
93
-
94
- - **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
95
- - **Task**: Solve math word problems with step-by-step reasoning
96
- - **Evaluation**: Numeric answer correctness with tolerance
97
- - **Problem Types**: Multi-step arithmetic and basic algebra
98
- - **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
80
+ **GSM8K** - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.
99
81
 
100
82
  ```bash
101
- # Run SWE-bench Verified (default - manually validated tests)
83
+ # Run SWE-bench Verified (default)
102
84
  mcpbr run -c config.yaml
103
85
 
104
- # Run SWE-bench Lite (300 tasks, quick testing)
105
- mcpbr run -c config.yaml -b swe-bench-lite
106
-
107
- # Run SWE-bench Full (2,294 tasks, complete benchmark)
108
- mcpbr run -c config.yaml -b swe-bench-full
109
-
110
- # Run GSM8K
86
+ # Run any benchmark
87
+ mcpbr run -c config.yaml --benchmark humaneval -n 20
111
88
  mcpbr run -c config.yaml --benchmark gsm8k -n 50
89
+ mcpbr run -c config.yaml --benchmark cybergym --level 2
112
90
 
113
91
  # List all available benchmarks
114
92
  mcpbr benchmarks
@@ -1265,12 +1243,18 @@ mcpbr/
1265
1243
  │ ├── models.py # Supported model registry
1266
1244
  │ ├── providers.py # LLM provider abstractions (extensible)
1267
1245
  │ ├── harnesses.py # Agent harness implementations (extensible)
1268
- │ ├── benchmarks/ # Benchmark abstraction layer
1246
+ │ ├── benchmarks/ # Benchmark abstraction layer (25+ benchmarks)
1269
1247
  │ │ ├── __init__.py # Registry and factory
1270
1248
  │ │ ├── base.py # Benchmark protocol
1271
- │ │ ├── swebench.py # SWE-bench implementation
1272
- │ │ ├── cybergym.py # CyberGym implementation
1273
- │ │ └── mcptoolbench.py # MCPToolBench++ implementation
1249
+ │ │ ├── swebench.py # SWE-bench (Verified/Lite/Full)
1250
+ │ │ ├── cybergym.py # CyberGym security
1251
+ │ │ ├── humaneval.py # HumanEval code generation
1252
+ │ │ ├── gsm8k.py # GSM8K math reasoning
1253
+ │ │ ├── mcptoolbench.py # MCPToolBench++ tool use
1254
+ │ │ ├── apps.py # APPS coding problems
1255
+ │ │ ├── mbpp.py # MBPP Python problems
1256
+ │ │ ├── math_benchmark.py # MATH competition math
1257
+ │ │ └── ... # 15+ more benchmarks
1274
1258
  │ ├── harness.py # Main orchestrator
1275
1259
  │ ├── agent.py # Baseline agent implementation
1276
1260
  │ ├── docker_env.py # Docker environment management + in-container execution
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.4.4",
3
+ "version": "0.4.5",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",