@greynewell/mcpbr 0.3.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +1113 -0
  3. package/bin/mcpbr.js +184 -0
  4. package/package.json +50 -0
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 mcpbr contributors
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,1113 @@
1
+ # mcpbr
2
+
3
+ ```bash
4
+ pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
5
+ ```
6
+
7
+ Benchmark your MCP server against real GitHub issues. One command, hard numbers.
8
+
9
+ ---
10
+
11
+ <p align="center">
12
+ <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-logo.jpg" alt="MCPBR Logo" width="400">
13
+ </p>
14
+
15
+ **Model Context Protocol Benchmark Runner**
16
+
17
+ [![PyPI version](https://badge.fury.io/py/mcpbr.svg)](https://pypi.org/project/mcpbr/)
18
+ [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
19
+ [![CI](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml/badge.svg)](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml)
20
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
21
+ [![Documentation](https://img.shields.io/badge/docs-greynewell.github.io%2Fmcpbr-blue)](https://greynewell.github.io/mcpbr/)
22
+ ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/greynewell/mcpbr?utm_source=oss&utm_medium=github&utm_campaign=greynewell%2Fmcpbr&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews)
23
+
24
+ [![good first issues](https://img.shields.io/github/issues/greynewell/mcpbr/good%20first%20issue?label=good%20first%20issues&color=7057ff)](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)
25
+ [![help wanted](https://img.shields.io/github/issues/greynewell/mcpbr/help%20wanted?label=help%20wanted&color=008672)](https://github.com/greynewell/mcpbr/labels/help%20wanted)
26
+ [![roadmap](https://img.shields.io/badge/roadmap-200%2B%20features-blue)](https://github.com/users/greynewell/projects/2)
27
+
28
+ > Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.
29
+
30
+ <p align="center">
31
+ <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-demo.gif" alt="mcpbr in action" width="700">
32
+ </p>
33
+
34
+ ## What You Get
35
+
36
+ <p align="center">
37
+ <img src="https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-eval-results.png" alt="MCPBR Evaluation Results" width="600">
38
+ </p>
39
+
40
+ Real metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.
41
+
42
+ ## Why mcpbr?
43
+
44
+ MCP servers promise to make LLMs better at coding tasks. But how do you *prove* it?
45
+
46
+ mcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:
47
+
48
+ - **Apples-to-apples comparison** against a baseline agent
49
+ - **Real GitHub issues** from SWE-bench (not toy examples)
50
+ - **Reproducible results** via Docker containers with pinned dependencies
51
+
52
+ ## Supported Benchmarks
53
+
54
+ mcpbr supports multiple software engineering benchmarks through a flexible abstraction layer:
55
+
56
+ ### SWE-bench (Default)
57
+ Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
58
+
59
+ - **Dataset**: [SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite)
60
+ - **Task**: Generate patches to fix bugs
61
+ - **Evaluation**: Test suite pass/fail
62
+ - **Pre-built images**: Available for most tasks
63
+
64
+ ### CyberGym
65
+ Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
66
+
67
+ - **Dataset**: [sunblaze-ucb/cybergym](https://huggingface.co/datasets/sunblaze-ucb/cybergym)
68
+ - **Task**: Generate PoC exploits
69
+ - **Evaluation**: PoC crashes pre-patch, doesn't crash post-patch
70
+ - **Difficulty levels**: 0-3 (controls context given to agent)
71
+ - **Learn more**: [CyberGym Project](https://cybergym.cs.berkeley.edu/)
72
+
73
+ ### MCPToolBench++
74
+ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilities in tool discovery, selection, invocation, and result interpretation.
75
+
76
+ - **Dataset**: [MCPToolBench/MCPToolBenchPP](https://huggingface.co/datasets/MCPToolBench/MCPToolBenchPP)
77
+ - **Task**: Complete tasks using appropriate MCP tools
78
+ - **Evaluation**: Tool selection accuracy, parameter correctness, sequence matching
79
+ - **Categories**: Browser, Finance, Code Analysis, and 40+ more
80
+ - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
81
+
82
+ ```bash
83
+ # Run SWE-bench (default)
84
+ mcpbr run -c config.yaml
85
+
86
+ # Run CyberGym at level 2
87
+ mcpbr run -c config.yaml --benchmark cybergym --level 2
88
+
89
+ # Run MCPToolBench++
90
+ mcpbr run -c config.yaml --benchmark mcptoolbench
91
+
92
+ # List available benchmarks
93
+ mcpbr benchmarks
94
+ ```
95
+
96
+ See the **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for details on each benchmark and how to configure them.
97
+
98
+ ## Overview
99
+
100
+ This harness runs two parallel evaluations for each task:
101
+
102
+ 1. **MCP Agent**: LLM with access to tools from your MCP server
103
+ 2. **Baseline Agent**: LLM without tools (single-shot generation)
104
+
105
+ By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://greynewell.github.io/mcpbr/mcp-integration/)** for tips on testing your server.
106
+
107
+ ## Regression Detection
108
+
109
+ mcpbr includes built-in regression detection to catch performance degradations between MCP server versions:
110
+
111
+ ### Key Features
112
+
113
+ - **Automatic Detection**: Compare current results against a baseline to identify regressions
114
+ - **Detailed Reports**: See exactly which tasks regressed and which improved
115
+ - **Threshold-Based Exit Codes**: Fail CI/CD pipelines when regression rate exceeds acceptable limits
116
+ - **Multi-Channel Alerts**: Send notifications via Slack, Discord, or email
117
+
118
+ ### How It Works
119
+
120
+ A regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.
121
+
122
+ ```bash
123
+ # First, run a baseline evaluation and save results
124
+ mcpbr run -c config.yaml -o baseline.json
125
+
126
+ # Later, compare a new version against the baseline
127
+ mcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1
128
+
129
+ # With notifications
130
+ mcpbr run -c config.yaml --baseline-results baseline.json \
131
+ --regression-threshold 0.1 \
132
+ --slack-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL
133
+ ```
134
+
135
+ ### Use Cases
136
+
137
+ - **CI/CD Integration**: Automatically detect regressions in pull requests
138
+ - **Version Comparison**: Compare different versions of your MCP server
139
+ - **Performance Monitoring**: Track MCP server performance over time
140
+ - **Team Notifications**: Alert your team when regressions are detected
141
+
142
+ ### Example Output
143
+
144
+ ```
145
+ ======================================================================
146
+ REGRESSION DETECTION REPORT
147
+ ======================================================================
148
+
149
+ Total tasks compared: 25
150
+ Regressions detected: 2
151
+ Improvements detected: 5
152
+ Regression rate: 8.0%
153
+
154
+ REGRESSIONS (previously passed, now failed):
155
+ ----------------------------------------------------------------------
156
+ - django__django-11099
157
+ Error: Timeout
158
+ - sympy__sympy-18087
159
+ Error: Test suite failed
160
+
161
+ IMPROVEMENTS (previously failed, now passed):
162
+ ----------------------------------------------------------------------
163
+ - astropy__astropy-12907
164
+ - pytest-dev__pytest-7373
165
+ - scikit-learn__scikit-learn-25570
166
+ - matplotlib__matplotlib-23913
167
+ - requests__requests-3362
168
+
169
+ ======================================================================
170
+ ```
171
+
172
+ For CI/CD integration, use `--regression-threshold` to fail the build when regressions exceed an acceptable rate:
173
+
174
+ ```yaml
175
+ # .github/workflows/test-mcp.yml
176
+ - name: Run mcpbr with regression detection
177
+ run: |
178
+ mcpbr run -c config.yaml \
179
+ --baseline-results baseline.json \
180
+ --regression-threshold 0.1 \
181
+ -o current.json
182
+ ```
183
+
184
+ This will exit with code 1 if the regression rate exceeds 10%, failing the CI job.
185
+
186
+ ## Installation
187
+
188
+ > **[Full installation guide](https://greynewell.github.io/mcpbr/installation/)** with detailed setup instructions.
189
+
190
+ <details>
191
+ <summary>Prerequisites</summary>
192
+
193
+ - Python 3.11+
194
+ - Docker (running)
195
+ - `ANTHROPIC_API_KEY` environment variable
196
+ - Claude Code CLI (`claude`) installed
197
+ - Network access (for pulling Docker images and API calls)
198
+
199
+ **Supported Models (aliases or full names):**
200
+ - Claude Opus 4.5: `opus` or `claude-opus-4-5-20251101`
201
+ - Claude Sonnet 4.5: `sonnet` or `claude-sonnet-4-5-20250929`
202
+ - Claude Haiku 4.5: `haiku` or `claude-haiku-4-5-20251001`
203
+
204
+ Run `mcpbr models` to see the full list.
205
+
206
+ </details>
207
+
208
+ ```bash
209
+ # Install from PyPI
210
+ pip install mcpbr
211
+
212
+ # Or install from source
213
+ git clone https://github.com/greynewell/mcpbr.git
214
+ cd mcpbr
215
+ pip install -e .
216
+
217
+ # Or with uv
218
+ uv pip install -e .
219
+ ```
220
+
221
+ > **Note for Apple Silicon users**: The harness automatically uses x86_64 Docker images via emulation. This may be slower than native ARM64 images but ensures compatibility with all SWE-bench tasks.
222
+
223
+ ## Quick Start
224
+
225
+ ### Option 1: Use Example Configurations (Recommended)
226
+
227
+ Get started in seconds with our example configurations:
228
+
229
+ ```bash
230
+ # Set your API key
231
+ export ANTHROPIC_API_KEY="your-api-key"
232
+
233
+ # Run your first evaluation using an example config
234
+ mcpbr run -c examples/quick-start/getting-started.yaml -v
235
+ ```
236
+
237
+ This runs 5 SWE-bench tasks with the filesystem server. Expected runtime: 15-30 minutes, cost: $2-5.
238
+
239
+ **Explore 25+ example configurations** in the [`examples/`](examples/) directory:
240
+ - **Quick Start**: Getting started, testing servers, comparing models
241
+ - **Benchmarks**: SWE-bench Lite/Full, CyberGym basic/advanced
242
+ - **MCP Servers**: Filesystem, GitHub, Brave Search, databases, custom servers
243
+ - **Scenarios**: Cost-optimized, performance-optimized, CI/CD, regression detection
244
+
245
+ See the **[Examples README](examples/README.md)** for the complete guide.
246
+
247
+ ### Option 2: Generate Custom Configuration
248
+
249
+ 1. **Set your API key:**
250
+
251
+ ```bash
252
+ export ANTHROPIC_API_KEY="your-api-key"
253
+ ```
254
+
255
+ 2. **Generate a configuration file:**
256
+
257
+ ```bash
258
+ mcpbr init
259
+ ```
260
+
261
+ 3. **Edit the configuration** to point to your MCP server:
262
+
263
+ ```yaml
264
+ mcp_server:
265
+ command: "npx"
266
+ args:
267
+ - "-y"
268
+ - "@modelcontextprotocol/server-filesystem"
269
+ - "{workdir}"
270
+ env: {}
271
+
272
+ provider: "anthropic"
273
+ agent_harness: "claude-code"
274
+
275
+ model: "sonnet" # or full name: "claude-sonnet-4-5-20250929"
276
+ dataset: "SWE-bench/SWE-bench_Lite"
277
+ sample_size: 10
278
+ timeout_seconds: 300
279
+ max_concurrent: 4
280
+ ```
281
+
282
+ 4. **Run the evaluation:**
283
+
284
+ ```bash
285
+ mcpbr run --config config.yaml
286
+ ```
287
+
288
+ ## Claude Code Integration
289
+
290
+ [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)
291
+
292
+ mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. When you clone this repository, Claude Code automatically detects the plugin and gains specialized knowledge about mcpbr.
293
+
294
+ ### What This Means for You
295
+
296
+ When using Claude Code in this repository, you can simply say:
297
+
298
+ - "Run the SWE-bench Lite benchmark"
299
+ - "Generate a config for my MCP server"
300
+ - "Run a quick test with 1 task"
301
+
302
+ Claude will automatically:
303
+ - Verify Docker is running before starting
304
+ - Check for required API keys
305
+ - Generate valid configurations with proper `{workdir}` placeholders
306
+ - Use correct CLI flags and options
307
+ - Provide helpful troubleshooting when issues occur
308
+
309
+ ### Available Skills
310
+
311
+ The plugin includes three specialized skills:
312
+
313
+ 1. **run-benchmark**: Expert at running evaluations with proper validation
314
+ - Checks prerequisites (Docker, API keys, config files)
315
+ - Constructs valid `mcpbr run` commands
316
+ - Handles errors gracefully with actionable feedback
317
+
318
+ 2. **generate-config**: Generates valid mcpbr configuration files
319
+ - Ensures `{workdir}` placeholder is included
320
+ - Validates MCP server commands
321
+ - Provides benchmark-specific templates
322
+
323
+ 3. **swe-bench-lite**: Quick-start command for SWE-bench Lite
324
+ - Pre-configured for 5-task evaluation
325
+ - Includes sensible defaults for output files
326
+ - Perfect for testing and demonstrations
327
+
328
+ ### Getting Started with Claude Code
329
+
330
+ Just clone the repository and start asking Claude to run benchmarks:
331
+
332
+ ```bash
333
+ git clone https://github.com/greynewell/mcpbr.git
334
+ cd mcpbr
335
+
336
+ # In Claude Code, simply say:
337
+ # "Run the SWE-bench Lite eval with 5 tasks"
338
+ ```
339
+
340
+ The bundled plugin ensures Claude makes no silly mistakes and follows best practices automatically.
341
+
342
+ ## Configuration
343
+
344
+ > **[Full configuration reference](https://greynewell.github.io/mcpbr/configuration/)** with all options and examples.
345
+
346
+ ### MCP Server Configuration
347
+
348
+ The `mcp_server` section defines how to start your MCP server:
349
+
350
+ | Field | Description |
351
+ |-------|-------------|
352
+ | `command` | Executable to run (e.g., `npx`, `uvx`, `python`) |
353
+ | `args` | Command arguments. Use `{workdir}` as placeholder for the task repository path |
354
+ | `env` | Additional environment variables |
355
+
356
+ ### Example Configurations
357
+
358
+ **Anthropic Filesystem Server:**
359
+
360
+ ```yaml
361
+ mcp_server:
362
+ command: "npx"
363
+ args: ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
364
+ ```
365
+
366
+ **Custom Python MCP Server:**
367
+
368
+ ```yaml
369
+ mcp_server:
370
+ command: "python"
371
+ args: ["-m", "my_mcp_server", "--workspace", "{workdir}"]
372
+ env:
373
+ LOG_LEVEL: "debug"
374
+ ```
375
+
376
+ **Supermodel Codebase Analysis Server:**
377
+
378
+ ```yaml
379
+ mcp_server:
380
+ command: "npx"
381
+ args: ["-y", "@supermodeltools/mcp-server"]
382
+ env:
383
+ SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
384
+ ```
385
+
386
+ ### Custom Agent Prompt
387
+
388
+ You can customize the prompt sent to the agent using the `agent_prompt` field:
389
+
390
+ ```yaml
391
+ agent_prompt: |
392
+ Fix the following bug in this repository:
393
+
394
+ {problem_statement}
395
+
396
+ Make the minimal changes necessary to fix the issue.
397
+ Focus on the root cause, not symptoms.
398
+ ```
399
+
400
+ Use `{problem_statement}` as a placeholder for the SWE-bench issue text. You can also override the prompt via CLI with `--prompt`.
401
+
402
+ ### Evaluation Parameters
403
+
404
+ | Parameter | Default | Description |
405
+ |-----------|---------|-------------|
406
+ | `provider` | `anthropic` | LLM provider |
407
+ | `agent_harness` | `claude-code` | Agent backend |
408
+ | `benchmark` | `swe-bench` | Benchmark to run (`swe-bench`, `cybergym`, or `mcptoolbench`) |
409
+ | `agent_prompt` | `null` | Custom prompt template (use `{problem_statement}` placeholder) |
410
+ | `model` | `sonnet` | Model alias or full ID |
411
+ | `dataset` | `null` | HuggingFace dataset (optional, benchmark provides default) |
412
+ | `cybergym_level` | `1` | CyberGym difficulty level (0-3, only for CyberGym benchmark) |
413
+ | `sample_size` | `null` | Number of tasks (null = full dataset) |
414
+ | `timeout_seconds` | `300` | Timeout per task |
415
+ | `max_concurrent` | `4` | Parallel task limit |
416
+ | `max_iterations` | `10` | Max agent iterations per task |
417
+
418
+ ## CLI Reference
419
+
420
+ > **[Full CLI documentation](https://greynewell.github.io/mcpbr/cli/)** with all commands and options.
421
+
422
+ Get help for any command with `--help` or `-h`:
423
+
424
+ ```bash
425
+ mcpbr --help
426
+ mcpbr run --help
427
+ mcpbr init --help
428
+ ```
429
+
430
+ ### Commands Overview
431
+
432
+ | Command | Description |
433
+ |---------|-------------|
434
+ | `mcpbr run` | Run benchmark evaluation with configured MCP server |
435
+ | `mcpbr init` | Generate an example configuration file |
436
+ | `mcpbr models` | List supported models for evaluation |
437
+ | `mcpbr providers` | List available model providers |
438
+ | `mcpbr harnesses` | List available agent harnesses |
439
+ | `mcpbr benchmarks` | List available benchmarks (SWE-bench, CyberGym, MCPToolBench++) |
440
+ | `mcpbr cleanup` | Remove orphaned mcpbr Docker containers |
441
+
442
+ ### `mcpbr run`
443
+
444
+ Run SWE-bench evaluation with the configured MCP server.
445
+
446
+ <details>
447
+ <summary>All options</summary>
448
+
449
+ | Option | Short | Description |
450
+ |--------|-------|-------------|
451
+ | `--config PATH` | `-c` | Path to YAML configuration file (required) |
452
+ | `--model TEXT` | `-m` | Override model from config |
453
+ | `--benchmark TEXT` | `-b` | Override benchmark from config (`swe-bench`, `cybergym`, or `mcptoolbench`) |
454
+ | `--level INTEGER` | | Override CyberGym difficulty level (0-3) |
455
+ | `--sample INTEGER` | `-n` | Override sample size from config |
456
+ | `--mcp-only` | `-M` | Run only MCP evaluation (skip baseline) |
457
+ | `--baseline-only` | `-B` | Run only baseline evaluation (skip MCP) |
458
+ | `--no-prebuilt` | | Disable pre-built SWE-bench images (build from scratch) |
459
+ | `--output PATH` | `-o` | Path to save JSON results |
460
+ | `--report PATH` | `-r` | Path to save Markdown report |
461
+ | `--output-junit PATH` | | Path to save JUnit XML report (for CI/CD integration) |
462
+ | `--verbose` | `-v` | Verbose output (`-v` summary, `-vv` detailed) |
463
+ | `--log-file PATH` | `-l` | Path to write raw JSON log output (single file) |
464
+ | `--log-dir PATH` | | Directory to write per-instance JSON log files |
465
+ | `--task TEXT` | `-t` | Run specific task(s) by instance_id (repeatable) |
466
+ | `--prompt TEXT` | | Override agent prompt (use `{problem_statement}` placeholder) |
467
+ | `--baseline-results PATH` | | Path to baseline results JSON for regression detection |
468
+ | `--regression-threshold FLOAT` | | Maximum acceptable regression rate (0-1). Exit with code 1 if exceeded. |
469
+ | `--slack-webhook URL` | | Slack webhook URL for regression notifications |
470
+ | `--discord-webhook URL` | | Discord webhook URL for regression notifications |
471
+ | `--email-to EMAIL` | | Email address for regression notifications |
472
+ | `--email-from EMAIL` | | Sender email address for notifications |
473
+ | `--smtp-host HOST` | | SMTP server hostname for email notifications |
474
+ | `--smtp-port PORT` | | SMTP server port (default: 587) |
475
+ | `--smtp-user USER` | | SMTP username for authentication |
476
+ | `--smtp-password PASS` | | SMTP password for authentication |
477
+ | `--help` | `-h` | Show help message |
478
+
479
+ </details>
480
+
481
+ <details>
482
+ <summary>Examples</summary>
483
+
484
+ ```bash
485
+ # Full evaluation (MCP + baseline)
486
+ mcpbr run -c config.yaml
487
+
488
+ # Run only MCP evaluation
489
+ mcpbr run -c config.yaml -M
490
+
491
+ # Run only baseline evaluation
492
+ mcpbr run -c config.yaml -B
493
+
494
+ # Override model
495
+ mcpbr run -c config.yaml -m claude-3-5-sonnet-20241022
496
+
497
+ # Override sample size
498
+ mcpbr run -c config.yaml -n 50
499
+
500
+ # Save results and report
501
+ mcpbr run -c config.yaml -o results.json -r report.md
502
+
503
+ # Save JUnit XML for CI/CD
504
+ mcpbr run -c config.yaml --output-junit junit.xml
505
+
506
+ # Run specific tasks
507
+ mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
508
+
509
+ # Verbose output with per-instance logs
510
+ mcpbr run -c config.yaml -v --log-dir logs/
511
+
512
+ # Very verbose output
513
+ mcpbr run -c config.yaml -vv
514
+
515
+ # Run CyberGym benchmark
516
+ mcpbr run -c config.yaml --benchmark cybergym --level 2
517
+
518
+ # Run CyberGym with specific tasks
519
+ mcpbr run -c config.yaml --benchmark cybergym --level 3 -n 5
520
+
521
+ # Regression detection - compare against baseline
522
+ mcpbr run -c config.yaml --baseline-results baseline.json
523
+
524
+ # Regression detection with threshold (exit 1 if exceeded)
525
+ mcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1
526
+
527
+ # Regression detection with Slack notifications
528
+ mcpbr run -c config.yaml --baseline-results baseline.json --slack-webhook https://hooks.slack.com/...
529
+
530
+ # Regression detection with Discord notifications
531
+ mcpbr run -c config.yaml --baseline-results baseline.json --discord-webhook https://discord.com/api/webhooks/...
532
+
533
+ # Regression detection with email notifications
534
+ mcpbr run -c config.yaml --baseline-results baseline.json \
535
+ --email-to team@example.com --email-from mcpbr@example.com \
536
+ --smtp-host smtp.gmail.com --smtp-port 587 \
537
+ --smtp-user user@gmail.com --smtp-password "app-password"
538
+ ```
539
+
540
+ </details>
541
+
542
+ ### `mcpbr init`
543
+
544
+ Generate an example configuration file.
545
+
546
+ <details>
547
+ <summary>Options and examples</summary>
548
+
549
+ | Option | Short | Description |
550
+ |--------|-------|-------------|
551
+ | `--output PATH` | `-o` | Path to write example config (default: `mcpbr.yaml`) |
552
+ | `--help` | `-h` | Show help message |
553
+
554
+ ```bash
555
+ mcpbr init
556
+ mcpbr init -o my-config.yaml
557
+ ```
558
+
559
+ </details>
560
+
561
+ ### `mcpbr models`
562
+
563
+ List supported Anthropic models for evaluation.
564
+
565
+ ### `mcpbr cleanup`
566
+
567
+ Remove orphaned mcpbr Docker containers that were not properly cleaned up.
568
+
569
+ <details>
570
+ <summary>Options and examples</summary>
571
+
572
+ | Option | Short | Description |
573
+ |--------|-------|-------------|
574
+ | `--dry-run` | | Show containers that would be removed without removing them |
575
+ | `--force` | `-f` | Skip confirmation prompt |
576
+ | `--help` | `-h` | Show help message |
577
+
578
+ ```bash
579
+ # Preview containers to remove
580
+ mcpbr cleanup --dry-run
581
+
582
+ # Remove containers with confirmation
583
+ mcpbr cleanup
584
+
585
+ # Remove containers without confirmation
586
+ mcpbr cleanup -f
587
+ ```
588
+
589
+ </details>
590
+
591
+ ## Example Run
592
+
593
+ Here's what a typical evaluation looks like:
594
+
595
+ ```bash
596
+ $ mcpbr run -c config.yaml -v -o results.json --log-dir my-logs
597
+
598
+ mcpbr Evaluation
599
+ Config: config.yaml
600
+ Provider: anthropic
601
+ Model: sonnet
602
+ Agent Harness: claude-code
603
+ Dataset: SWE-bench/SWE-bench_Lite
604
+ Sample size: 10
605
+ Run MCP: True, Run Baseline: True
606
+ Pre-built images: True
607
+ Log dir: my-logs
608
+
609
+ Loading dataset: SWE-bench/SWE-bench_Lite
610
+ Evaluating 10 tasks
611
+ Provider: anthropic, Harness: claude-code
612
+ 14:23:15 [MCP] Starting mcp run for astropy-12907:mcp
613
+ 14:23:22 astropy-12907:mcp > TodoWrite
614
+ 14:23:22 astropy-12907:mcp < Todos have been modified successfully...
615
+ 14:23:26 astropy-12907:mcp > Glob
616
+ 14:23:26 astropy-12907:mcp > Grep
617
+ 14:23:27 astropy-12907:mcp < $WORKDIR/astropy/modeling/separable.py
618
+ 14:23:27 astropy-12907:mcp < Found 5 files: astropy/modeling/tests/test_separable.py...
619
+ ...
620
+ 14:27:43 astropy-12907:mcp * done turns=31 tokens=115/6,542
621
+ 14:28:30 [BASELINE] Starting baseline run for astropy-12907:baseline
622
+ ...
623
+ ```
624
+
625
+ ## Output
626
+
627
+ > **[Understanding evaluation results](https://greynewell.github.io/mcpbr/evaluation-results/)** - detailed guide to interpreting output.
628
+
629
+ ### Console Output
630
+
631
+ The harness displays real-time progress with verbose mode (`-v`) and a final summary table:
632
+
633
+ ```text
634
+ Evaluation Results
635
+
636
+ Summary
637
+ +-----------------+-----------+----------+
638
+ | Metric | MCP Agent | Baseline |
639
+ +-----------------+-----------+----------+
640
+ | Resolved | 8/25 | 5/25 |
641
+ | Resolution Rate | 32.0% | 20.0% |
642
+ +-----------------+-----------+----------+
643
+
644
+ Improvement: +60.0%
645
+
646
+ Per-Task Results
647
+ +------------------------+------+----------+-------+
648
+ | Instance ID | MCP | Baseline | Error |
649
+ +------------------------+------+----------+-------+
650
+ | astropy__astropy-12907 | PASS | PASS | |
651
+ | django__django-11099 | PASS | FAIL | |
652
+ | sympy__sympy-18087 | FAIL | FAIL | |
653
+ +------------------------+------+----------+-------+
654
+
655
+ Results saved to results.json
656
+ ```
657
+
658
+ ### JSON Output (`--output`)
659
+
660
+ ```json
661
+ {
662
+ "metadata": {
663
+ "timestamp": "2026-01-17T07:23:39.871437+00:00",
664
+ "config": {
665
+ "model": "sonnet",
666
+ "provider": "anthropic",
667
+ "agent_harness": "claude-code",
668
+ "dataset": "SWE-bench/SWE-bench_Lite",
669
+ "sample_size": 25,
670
+ "timeout_seconds": 600,
671
+ "max_iterations": 30
672
+ },
673
+ "mcp_server": {
674
+ "command": "npx",
675
+ "args": ["-y", "@modelcontextprotocol/server-filesystem", "{workdir}"]
676
+ }
677
+ },
678
+ "summary": {
679
+ "mcp": {"resolved": 8, "total": 25, "rate": 0.32},
680
+ "baseline": {"resolved": 5, "total": 25, "rate": 0.20},
681
+ "improvement": "+60.0%"
682
+ },
683
+ "tasks": [
684
+ {
685
+ "instance_id": "astropy__astropy-12907",
686
+ "mcp": {
687
+ "patch_generated": true,
688
+ "tokens": {"input": 115, "output": 6542},
689
+ "iterations": 30,
690
+ "tool_calls": 72,
691
+ "tool_usage": {
692
+ "TodoWrite": 4, "Task": 1, "Glob": 4,
693
+ "Grep": 11, "Bash": 27, "Read": 22,
694
+ "Write": 2, "Edit": 1
695
+ },
696
+ "resolved": true,
697
+ "patch_applied": true,
698
+ "fail_to_pass": {"passed": 2, "total": 2},
699
+ "pass_to_pass": {"passed": 10, "total": 10}
700
+ },
701
+ "baseline": {
702
+ "patch_generated": true,
703
+ "tokens": {"input": 63, "output": 7615},
704
+ "iterations": 30,
705
+ "tool_calls": 57,
706
+ "tool_usage": {
707
+ "TodoWrite": 4, "Glob": 3, "Grep": 4,
708
+ "Read": 14, "Bash": 26, "Write": 4, "Edit": 1
709
+ },
710
+ "resolved": true,
711
+ "patch_applied": true
712
+ }
713
+ }
714
+ ]
715
+ }
716
+ ```
717
+
718
+ ### Markdown Report (`--report`)
719
+
720
+ Generates a human-readable report with:
721
+ - Summary statistics
722
+ - Per-task results table
723
+ - Analysis of which tasks each agent solved
724
+
725
+ ### Per-Instance Logs (`--log-dir`)
726
+
727
+ Creates a directory with detailed JSON log files for each task run. Filenames include timestamps to prevent overwrites:
728
+
729
+ ```text
730
+ my-logs/
731
+ astropy__astropy-12907_mcp_20260117_143052.json
732
+ astropy__astropy-12907_baseline_20260117_143156.json
733
+ django__django-11099_mcp_20260117_144023.json
734
+ django__django-11099_baseline_20260117_144512.json
735
+ ```
736
+
737
+ Each log file contains the full stream of events from the agent CLI:
738
+
739
+ ```json
740
+ {
741
+ "instance_id": "astropy__astropy-12907",
742
+ "run_type": "mcp",
743
+ "events": [
744
+ {
745
+ "type": "system",
746
+ "subtype": "init",
747
+ "cwd": "/workspace",
748
+ "tools": ["Task", "Bash", "Glob", "Grep", "Read", "Edit", "Write", "TodoWrite"],
749
+ "model": "claude-sonnet-4-5-20250929",
750
+ "claude_code_version": "2.1.12"
751
+ },
752
+ {
753
+ "type": "assistant",
754
+ "message": {
755
+ "content": [{"type": "text", "text": "I'll help you fix this bug..."}]
756
+ }
757
+ },
758
+ {
759
+ "type": "assistant",
760
+ "message": {
761
+ "content": [{"type": "tool_use", "name": "Grep", "input": {"pattern": "separability"}}]
762
+ }
763
+ },
764
+ {
765
+ "type": "result",
766
+ "num_turns": 31,
767
+ "usage": {"input_tokens": 115, "output_tokens": 6542}
768
+ }
769
+ ]
770
+ }
771
+ ```
772
+
773
+ This is useful for debugging failed runs or analyzing agent behavior in detail.
774
+
775
+ ### JUnit XML Output (`--output-junit`)
776
+
777
+ The harness can generate JUnit XML reports for integration with CI/CD systems like GitHub Actions, GitLab CI, and Jenkins. Each task is represented as a test case, with resolved/unresolved tasks mapped to pass/fail states.
778
+
779
+ ```bash
780
+ mcpbr run -c config.yaml --output-junit junit.xml
781
+ ```
782
+
783
+ The JUnit XML report includes:
784
+
785
+ - **Test Suites**: Separate suites for MCP and baseline evaluations
786
+ - **Test Cases**: Each task is a test case with timing information
787
+ - **Failures**: Unresolved tasks with detailed error messages
788
+ - **Properties**: Metadata about model, provider, benchmark configuration
789
+ - **System Output**: Token usage, tool calls, and test results per task
790
+
791
+ #### CI/CD Integration Examples
792
+
793
+ **GitHub Actions:**
794
+
795
+ ```yaml
796
+ name: MCP Benchmark
797
+
798
+ on: [push, pull_request]
799
+
800
+ jobs:
801
+ benchmark:
802
+ runs-on: ubuntu-latest
803
+ steps:
804
+ - uses: actions/checkout@v3
805
+
806
+ - name: Set up Python
807
+ uses: actions/setup-python@v4
808
+ with:
809
+ python-version: '3.11'
810
+
811
+ - name: Install mcpbr
812
+ run: pip install mcpbr
813
+
814
+ - name: Run benchmark
815
+ env:
816
+ ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
817
+ run: |
818
+ mcpbr run -c config.yaml --output-junit junit.xml
819
+
820
+ - name: Publish Test Results
821
+ uses: EnricoMi/publish-unit-test-result-action@v2
822
+ if: always()
823
+ with:
824
+ files: junit.xml
825
+ ```
826
+
827
+ **GitLab CI:**
828
+
829
+ ```yaml
830
+ benchmark:
831
+ image: python:3.11
832
+ services:
833
+ - docker:dind
834
+ script:
835
+ - pip install mcpbr
836
+ - mcpbr run -c config.yaml --output-junit junit.xml
837
+ artifacts:
838
+ reports:
839
+ junit: junit.xml
840
+ ```
841
+
842
+ **Jenkins:**
843
+
844
+ ```groovy
845
+ pipeline {
846
+ agent any
847
+ stages {
848
+ stage('Benchmark') {
849
+ steps {
850
+ sh 'pip install mcpbr'
851
+ sh 'mcpbr run -c config.yaml --output-junit junit.xml'
852
+ }
853
+ }
854
+ }
855
+ post {
856
+ always {
857
+ junit 'junit.xml'
858
+ }
859
+ }
860
+ }
861
+ ```
862
+
863
+ The JUnit XML format enables native test result visualization in your CI/CD dashboard, making it easy to track benchmark performance over time and identify regressions.
864
+
865
+ ## How It Works
866
+
867
+ > **[Architecture deep dive](https://greynewell.github.io/mcpbr/architecture/)** - learn how mcpbr works internally.
868
+
869
+ 1. **Load Tasks**: Fetches tasks from the selected benchmark (SWE-bench, CyberGym, or MCPToolBench++) via HuggingFace
870
+ 2. **Create Environment**: For each task, creates an isolated Docker environment with the repository and dependencies
871
+ 3. **Run MCP Agent**: Invokes Claude Code CLI **inside the Docker container**, letting it explore and generate a solution (patch or PoC)
872
+ 4. **Run Baseline**: Same as MCP agent but without the MCP server
873
+ 5. **Evaluate**: Runs benchmark-specific evaluation (test suites for SWE-bench, crash detection for CyberGym, tool use accuracy for MCPToolBench++)
874
+ 6. **Report**: Aggregates results and calculates improvement
875
+
876
+ ### Pre-built Docker Images
877
+
878
+ The harness uses pre-built SWE-bench Docker images from [Epoch AI's registry](https://github.com/orgs/Epoch-Research/packages) when available. These images come with:
879
+
880
+ - The repository checked out at the correct commit
881
+ - All project dependencies pre-installed and validated
882
+ - A consistent environment for reproducible evaluations
883
+
884
+ The agent (Claude Code CLI) runs **inside the container**, which means:
885
+ - Python imports work correctly (e.g., `from astropy import ...`)
886
+ - The agent can run tests and verify fixes
887
+ - No dependency conflicts with the host machine
888
+
889
+ If a pre-built image is not available for a task, the harness falls back to cloning the repository and attempting to install dependencies (less reliable).
890
+
891
+ ## Architecture
892
+
893
+ ```
894
+ mcpbr/
895
+ ├── src/mcpbr/
896
+ │ ├── cli.py # Command-line interface
897
+ │ ├── config.py # Configuration models
898
+ │ ├── models.py # Supported model registry
899
+ │ ├── providers.py # LLM provider abstractions (extensible)
900
+ │ ├── harnesses.py # Agent harness implementations (extensible)
901
+ │ ├── benchmarks/ # Benchmark abstraction layer
902
+ │ │ ├── __init__.py # Registry and factory
903
+ │ │ ├── base.py # Benchmark protocol
904
+ │ │ ├── swebench.py # SWE-bench implementation
905
+ │ │ ├── cybergym.py # CyberGym implementation
906
+ │ │ └── mcptoolbench.py # MCPToolBench++ implementation
907
+ │ ├── harness.py # Main orchestrator
908
+ │ ├── agent.py # Baseline agent implementation
909
+ │ ├── docker_env.py # Docker environment management + in-container execution
910
+ │ ├── evaluation.py # Patch application and testing
911
+ │ ├── log_formatter.py # Log formatting and per-instance logging
912
+ │ └── reporting.py # Output formatting
913
+ ├── tests/
914
+ │ ├── test_*.py # Unit tests
915
+ │ ├── test_benchmarks.py # Benchmark tests
916
+ │ └── test_integration.py # Integration tests
917
+ ├── Dockerfile # Fallback image for task environments
918
+ └── config/
919
+ └── example.yaml # Example configuration
920
+ ```
921
+
922
+ The architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://greynewell.github.io/mcpbr/api/)** and **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for more details.
923
+
924
+ ### Execution Flow
925
+
926
+ ```
927
+ ┌─────────────────────────────────────────────────────────────────┐
928
+ │ Host Machine │
929
+ │ ┌───────────────────────────────────────────────────────────┐ │
930
+ │ │ mcpbr Harness (Python) │ │
931
+ │ │ - Loads SWE-bench tasks from HuggingFace │ │
932
+ │ │ - Pulls pre-built Docker images │ │
933
+ │ │ - Orchestrates agent runs │ │
934
+ │ │ - Collects results and generates reports │ │
935
+ │ └─────────────────────────┬─────────────────────────────────┘ │
936
+ │ │ docker exec │
937
+ │ ┌─────────────────────────▼─────────────────────────────────┐ │
938
+ │ │ Docker Container (per task) │ │
939
+ │ │ ┌─────────────────────────────────────────────────────┐ │ │
940
+ │ │ │ Pre-built SWE-bench Image │ │ │
941
+ │ │ │ - Repository at correct commit │ │ │
942
+ │ │ │ - All dependencies installed (astropy, django...) │ │ │
943
+ │ │ │ - Node.js + Claude CLI (installed at startup) │ │ │
944
+ │ │ └─────────────────────────────────────────────────────┘ │ │
945
+ │ │ │ │
946
+ │ │ Agent (Claude Code CLI) runs HERE: │ │
947
+ │ │ - Makes API calls to Anthropic │ │
948
+ │ │ - Executes Bash commands (with working imports!) │ │
949
+ │ │ - Reads/writes files │ │
950
+ │ │ - Generates patches │ │
951
+ │ │ │ │
952
+ │ │ Evaluation runs HERE: │ │
953
+ │ │ - Applies patch via git │ │
954
+ │ │ - Runs pytest with task's test suite │ │
955
+ │ └───────────────────────────────────────────────────────────┘ │
956
+ └─────────────────────────────────────────────────────────────────┘
957
+ ```
958
+
959
+ ## Troubleshooting
960
+
961
+ > **[FAQ](https://greynewell.github.io/mcpbr/FAQ/)** - Quick answers to common questions
962
+ >
963
+ > **[Full troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/)** - Detailed solutions to common issues
964
+
965
+ ### Docker Issues
966
+
967
+ Ensure Docker is running:
968
+ ```bash
969
+ docker info
970
+ ```
971
+
972
+ ### Pre-built Image Not Found
973
+
974
+ If the harness can't pull a pre-built image for a task, it will fall back to building from scratch. You can also manually pull images:
975
+ ```bash
976
+ docker pull ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907
977
+ ```
978
+
979
+ ### Slow on Apple Silicon
980
+
981
+ On ARM64 Macs, x86_64 Docker images run via emulation which is slower. This is normal. If you're experiencing issues, ensure you have Rosetta 2 installed:
982
+ ```bash
983
+ softwareupdate --install-rosetta
984
+ ```
985
+
986
+ ### MCP Server Not Starting
987
+
988
+ Test your MCP server independently:
989
+ ```bash
990
+ npx -y @modelcontextprotocol/server-filesystem /tmp/test
991
+ ```
992
+
993
+ ### API Key Issues
994
+
995
+ Ensure your Anthropic API key is set:
996
+
997
+ ```bash
998
+ export ANTHROPIC_API_KEY="sk-ant-..."
999
+ ```
1000
+
1001
+ ### Timeout Issues
1002
+
1003
+ Increase the timeout in your config:
1004
+ ```yaml
1005
+ timeout_seconds: 600
1006
+ ```
1007
+
1008
+ ### Claude CLI Not Found
1009
+
1010
+ Ensure the Claude Code CLI is installed and in your PATH:
1011
+ ```bash
1012
+ which claude # Should return the path to the CLI
1013
+ ```
1014
+
1015
+ ## Development
1016
+
1017
+ ```bash
1018
+ # Install dev dependencies
1019
+ pip install -e ".[dev]"
1020
+
1021
+ # Run unit tests
1022
+ pytest -m "not integration"
1023
+
1024
+ # Run integration tests (requires API keys and Docker)
1025
+ pytest -m integration
1026
+
1027
+ # Run all tests
1028
+ pytest
1029
+
1030
+ # Lint
1031
+ ruff check src/
1032
+ ```
1033
+
1034
+ ## Roadmap
1035
+
1036
+ We're building the defacto standard for MCP server benchmarking! Our [v1.0 Roadmap](https://github.com/greynewell/mcpbr/projects/2) includes 200+ features across 11 strategic categories:
1037
+
1038
+ 🎯 **[Good First Issues](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)** | 🙋 **[Help Wanted](https://github.com/greynewell/mcpbr/labels/help%20wanted)** | 📋 **[View Roadmap](https://github.com/greynewell/mcpbr/projects/2)**
1039
+
1040
+ [![good first issues](https://img.shields.io/github/issues/greynewell/mcpbr/good%20first%20issue?label=good%20first%20issues&color=7057ff)](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)
1041
+ [![help wanted](https://img.shields.io/github/issues/greynewell/mcpbr/help%20wanted?label=help%20wanted&color=008672)](https://github.com/greynewell/mcpbr/labels/help%20wanted)
1042
+ [![roadmap progress](https://img.shields.io/github/issues-pr-closed/greynewell/mcpbr?label=roadmap%20progress)](https://github.com/greynewell/mcpbr/projects/2)
1043
+
1044
+ ### Roadmap Highlights
1045
+
1046
+ **Phase 1: Foundation** (v0.3.0)
1047
+ - ✅ JUnit XML output format for CI/CD integration
1048
+ - CSV, YAML, XML output formats
1049
+ - Config validation and templates
1050
+ - Results persistence and recovery
1051
+ - Cost analysis in reports
1052
+
1053
+ **Phase 2: Benchmarks** (v0.4.0)
1054
+ - HumanEval, MBPP, ToolBench
1055
+ - GAIA for general AI capabilities
1056
+ - Custom benchmark YAML support
1057
+ - SWE-bench Verified
1058
+
1059
+ **Phase 3: Developer Experience** (v0.5.0)
1060
+ - Real-time dashboard
1061
+ - Interactive config wizard
1062
+ - Shell completion
1063
+ - Pre-flight checks
1064
+
1065
+ **Phase 4: Platform Expansion** (v0.6.0)
1066
+ - NPM package
1067
+ - GitHub Action for CI/CD
1068
+ - Homebrew formula
1069
+ - Official Docker image
1070
+
1071
+ **Phase 5: MCP Testing Suite** (v1.0.0)
1072
+ - Tool coverage analysis
1073
+ - Performance profiling
1074
+ - Error rate monitoring
1075
+ - Security scanning
1076
+
1077
+ ### Get Involved
1078
+
1079
+ We welcome contributions! Check out our **30+ good first issues** perfect for newcomers:
1080
+
1081
+ - **Output Formats**: CSV/YAML/XML export
1082
+ - **Configuration**: Validation, templates, shell completion
1083
+ - **Platform**: Homebrew formula, Conda package
1084
+ - **Documentation**: Best practices, examples, guides
1085
+
1086
+ See the [contributing guide](https://greynewell.github.io/mcpbr/contributing/) to get started!
1087
+
1088
+ ## Best Practices
1089
+
1090
+ New to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://greynewell.github.io/mcpbr/best-practices/)** for:
1091
+
1092
+ - Benchmark selection guidelines
1093
+ - MCP server configuration tips
1094
+ - Performance optimization strategies
1095
+ - Cost management techniques
1096
+ - CI/CD integration patterns
1097
+ - Debugging workflows
1098
+ - Common pitfalls to avoid
1099
+
1100
+ ## Contributing
1101
+
1102
+ Please see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://greynewell.github.io/mcpbr/contributing/)** for guidelines on how to contribute.
1103
+
1104
+ All contributors are expected to follow our [Community Guidelines](CODE_OF_CONDUCT.md).
1105
+
1106
+ ## License
1107
+
1108
+ MIT - see [LICENSE](LICENSE) for details.
1109
+
1110
+
1111
+ ---
1112
+
1113
+ Built by [Grey Newell](https://greynewell.com)
package/bin/mcpbr.js ADDED
@@ -0,0 +1,184 @@
1
+ #!/usr/bin/env node
2
+
3
+ /**
4
+ * mcpbr CLI wrapper for npm
5
+ *
6
+ * This wrapper provides npm/npx access to the mcpbr CLI tool,
7
+ * which is implemented in Python. It checks for Python 3.11+
8
+ * and the mcpbr Python package, then forwards all arguments
9
+ * to the Python CLI.
10
+ */
11
+
12
+ const { spawn } = require('cross-spawn');
13
+ const { execSync } = require('child_process');
14
+
15
+ /**
16
+ * Check if Python 3.11+ is available
17
+ */
18
+ function checkPython() {
19
+ try {
20
+ // Try python3 first (most common on Unix systems)
21
+ const version = execSync('python3 --version', { encoding: 'utf8', stdio: ['pipe', 'pipe', 'ignore'] });
22
+ const match = version.match(/Python (\d+)\.(\d+)/);
23
+
24
+ if (match) {
25
+ const major = parseInt(match[1]);
26
+ const minor = parseInt(match[2]);
27
+
28
+ if (major === 3 && minor >= 11) {
29
+ return 'python3';
30
+ }
31
+ }
32
+ } catch (error) {
33
+ // python3 not found or failed
34
+ }
35
+
36
+ try {
37
+ // Try python (Windows common, sometimes Unix too)
38
+ const version = execSync('python --version', { encoding: 'utf8', stdio: ['pipe', 'pipe', 'ignore'] });
39
+ const match = version.match(/Python (\d+)\.(\d+)/);
40
+
41
+ if (match) {
42
+ const major = parseInt(match[1]);
43
+ const minor = parseInt(match[2]);
44
+
45
+ if (major === 3 && minor >= 11) {
46
+ return 'python';
47
+ }
48
+ }
49
+ } catch (error) {
50
+ // python not found or failed
51
+ }
52
+
53
+ return null;
54
+ }
55
+
56
+ /**
57
+ * Check if mcpbr Python package is installed
58
+ */
59
+ function checkMcpbr(pythonCmd) {
60
+ try {
61
+ execSync(`${pythonCmd} -m mcpbr --version`, {
62
+ encoding: 'utf8',
63
+ stdio: ['pipe', 'pipe', 'ignore']
64
+ });
65
+ return true;
66
+ } catch (error) {
67
+ return false;
68
+ }
69
+ }
70
+
71
+ /**
72
+ * Print installation instructions
73
+ */
74
+ function printInstallInstructions() {
75
+ console.error(`
76
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
77
+ mcpbr requires Python 3.11+ and the mcpbr Python package
78
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
79
+
80
+ Please install the requirements:
81
+
82
+ 1. Install Python 3.11 or later:
83
+ • macOS: brew install python@3.11
84
+ • Ubuntu: sudo apt install python3.11
85
+ • Windows: https://www.python.org/downloads/
86
+
87
+ 2. Install mcpbr via pip:
88
+ • pip install mcpbr
89
+ or
90
+ • pip3 install mcpbr
91
+
92
+ For more information, visit: https://github.com/greynewell/mcpbr
93
+
94
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
95
+ `);
96
+ }
97
+
98
+ /**
99
+ * Print Python version mismatch error
100
+ */
101
+ function printPythonVersionError() {
102
+ console.error(`
103
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
104
+ mcpbr requires Python 3.11 or later
105
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
106
+
107
+ Your Python version is too old. Please upgrade:
108
+
109
+ • macOS: brew install python@3.11
110
+ • Ubuntu: sudo apt install python3.11
111
+ • Windows: https://www.python.org/downloads/
112
+
113
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
114
+ `);
115
+ }
116
+
117
+ /**
118
+ * Print mcpbr not installed error
119
+ */
120
+ function printMcpbrNotInstalledError() {
121
+ console.error(`
122
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
123
+ mcpbr Python package not found
124
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
125
+
126
+ Please install mcpbr via pip:
127
+
128
+ pip install mcpbr
129
+
130
+ or
131
+
132
+ pip3 install mcpbr
133
+
134
+ For more information, visit: https://github.com/greynewell/mcpbr
135
+
136
+ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
137
+ `);
138
+ }
139
+
140
+ /**
141
+ * Main execution
142
+ */
143
+ function main() {
144
+ // Check for Python 3.11+
145
+ const pythonCmd = checkPython();
146
+
147
+ if (!pythonCmd) {
148
+ printPythonVersionError();
149
+ process.exit(1);
150
+ }
151
+
152
+ // Check if mcpbr is installed
153
+ if (!checkMcpbr(pythonCmd)) {
154
+ printMcpbrNotInstalledError();
155
+ process.exit(1);
156
+ }
157
+
158
+ // Forward all arguments to mcpbr Python CLI
159
+ const args = process.argv.slice(2);
160
+ const mcpbr = spawn(pythonCmd, ['-m', 'mcpbr', ...args], {
161
+ stdio: 'inherit',
162
+ env: process.env
163
+ });
164
+
165
+ mcpbr.on('error', (error) => {
166
+ console.error(`Failed to start mcpbr: ${error.message}`);
167
+ process.exit(1);
168
+ });
169
+
170
+ mcpbr.on('exit', (code, signal) => {
171
+ // If killed by signal, exit with error code
172
+ if (signal) {
173
+ process.exit(1);
174
+ }
175
+ process.exit(code || 0);
176
+ });
177
+ }
178
+
179
+ // Run if executed directly
180
+ if (require.main === module) {
181
+ main();
182
+ }
183
+
184
+ module.exports = { checkPython, checkMcpbr };
package/package.json ADDED
@@ -0,0 +1,50 @@
1
+ {
2
+ "name": "@greynewell/mcpbr",
3
+ "version": "0.3.18",
4
+ "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
+ "keywords": [
6
+ "mcpbr",
7
+ "mcp",
8
+ "benchmark",
9
+ "model-context-protocol",
10
+ "swe-bench",
11
+ "cybergym",
12
+ "llm",
13
+ "agents",
14
+ "evaluation",
15
+ "cli",
16
+ "testing"
17
+ ],
18
+ "homepage": "https://github.com/greynewell/mcpbr",
19
+ "repository": {
20
+ "type": "git",
21
+ "url": "https://github.com/greynewell/mcpbr.git"
22
+ },
23
+ "bugs": {
24
+ "url": "https://github.com/greynewell/mcpbr/issues"
25
+ },
26
+ "license": "MIT",
27
+ "author": "mcpbr Contributors",
28
+ "bin": {
29
+ "mcpbr": "./bin/mcpbr.js"
30
+ },
31
+ "files": [
32
+ "bin/",
33
+ "README.md"
34
+ ],
35
+ "scripts": {
36
+ "test": "node bin/mcpbr.js --version",
37
+ "prepublishOnly": "npm test"
38
+ },
39
+ "engines": {
40
+ "node": ">=18.0.0"
41
+ },
42
+ "os": [
43
+ "darwin",
44
+ "linux",
45
+ "win32"
46
+ ],
47
+ "dependencies": {
48
+ "cross-spawn": "^7.0.3"
49
+ }
50
+ }