mcpbr-cli 0.3.28 → 0.4.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +194 -48
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,11 +1,11 @@
1
1
  # mcpbr
2
2
 
3
3
  ```bash
4
- # Install via pip
5
- pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
4
+ # One-liner install (installs + runs quick test)
5
+ curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash
6
6
 
7
- # Or via npm
8
- npm install -g mcpbr-cli && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
7
+ # Or install and run manually
8
+ pip install mcpbr && mcpbr run -n 1
9
9
  ```
10
10
 
11
11
  Benchmark your MCP server against real GitHub issues. One command, hard numbers.
@@ -19,6 +19,7 @@ Benchmark your MCP server against real GitHub issues. One command, hard numbers.
19
19
  **Model Context Protocol Benchmark Runner**
20
20
 
21
21
  [![PyPI version](https://badge.fury.io/py/mcpbr.svg)](https://pypi.org/project/mcpbr/)
22
+ [![npm version](https://badge.fury.io/js/mcpbr-cli.svg)](https://www.npmjs.com/package/mcpbr-cli)
22
23
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
23
24
  [![CI](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml/badge.svg)](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml)
24
25
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
@@ -60,11 +61,15 @@ mcpbr supports multiple software engineering benchmarks through a flexible abstr
60
61
  ### SWE-bench (Default)
61
62
  Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
62
63
 
63
- - **Dataset**: [SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite)
64
64
  - **Task**: Generate patches to fix bugs
65
65
  - **Evaluation**: Test suite pass/fail
66
66
  - **Pre-built images**: Available for most tasks
67
67
 
68
+ **Variants:**
69
+ - **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
70
+ - **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
71
+ - **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
72
+
68
73
  ### CyberGym
69
74
  Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
70
75
 
@@ -84,16 +89,16 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
84
89
  - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
85
90
 
86
91
  ```bash
87
- # Run SWE-bench (default)
92
+ # Run SWE-bench Verified (default - manually validated tests)
88
93
  mcpbr run -c config.yaml
89
94
 
90
- # Run CyberGym at level 2
91
- mcpbr run -c config.yaml --benchmark cybergym --level 2
95
+ # Run SWE-bench Lite (300 tasks, quick testing)
96
+ mcpbr run -c config.yaml -b swe-bench-lite
92
97
 
93
- # Run MCPToolBench++
94
- mcpbr run -c config.yaml --benchmark mcptoolbench
98
+ # Run SWE-bench Full (2,294 tasks, complete benchmark)
99
+ mcpbr run -c config.yaml -b swe-bench-full
95
100
 
96
- # List available benchmarks
101
+ # List all available benchmarks
97
102
  mcpbr benchmarks
98
103
  ```
99
104
 
@@ -211,6 +216,8 @@ Run `mcpbr models` to see the full list.
211
216
 
212
217
  ### via npm
213
218
 
219
+ [![npm package](https://img.shields.io/npm/v/mcpbr-cli.svg)](https://www.npmjs.com/package/mcpbr-cli)
220
+
214
221
  ```bash
215
222
  # Run with npx (no installation)
216
223
  npx mcpbr-cli run -c config.yaml
@@ -220,6 +227,8 @@ npm install -g mcpbr-cli
220
227
  mcpbr run -c config.yaml
221
228
  ```
222
229
 
230
+ > **Package**: [`mcpbr-cli`](https://www.npmjs.com/package/mcpbr-cli) on npm
231
+ >
223
232
  > **Note**: The npm package requires Python 3.11+ and the mcpbr Python package (`pip install mcpbr`)
224
233
 
225
234
  ### via pip
@@ -271,9 +280,13 @@ See the **[Examples README](examples/README.md)** for the complete guide.
271
280
  export ANTHROPIC_API_KEY="your-api-key"
272
281
  ```
273
282
 
274
- 2. **Generate a configuration file:**
283
+ 2. **Run mcpbr (config auto-created if missing):**
275
284
 
276
285
  ```bash
286
+ # Config is auto-created on first run
287
+ mcpbr run -n 1
288
+
289
+ # Or explicitly generate a config file first
277
290
  mcpbr init
278
291
  ```
279
292
 
@@ -311,55 +324,135 @@ mcpbr run --config config.yaml
311
324
 
312
325
  [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)
313
326
 
314
- mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. When you clone this repository, Claude Code automatically detects the plugin and gains specialized knowledge about mcpbr.
327
+ mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. The plugin provides specialized skills and knowledge about mcpbr configuration, execution, and troubleshooting.
315
328
 
316
- ### What This Means for You
329
+ ### Installation Options
317
330
 
318
- When using Claude Code in this repository, you can simply say:
331
+ You have three ways to enable the mcpbr plugin in Claude Code:
319
332
 
320
- - "Run the SWE-bench Lite benchmark"
321
- - "Generate a config for my MCP server"
322
- - "Run a quick test with 1 task"
333
+ #### Option 1: Clone Repository (Automatic Detection)
323
334
 
324
- Claude will automatically:
325
- - Verify Docker is running before starting
326
- - Check for required API keys
327
- - Generate valid configurations with proper `{workdir}` placeholders
328
- - Use correct CLI flags and options
329
- - Provide helpful troubleshooting when issues occur
335
+ When you clone this repository, Claude Code automatically detects and loads the plugin:
330
336
 
331
- ### Available Skills
337
+ ```bash
338
+ git clone https://github.com/greynewell/mcpbr.git
339
+ cd mcpbr
332
340
 
333
- The plugin includes three specialized skills:
341
+ # Plugin is now active - try asking Claude:
342
+ # "Run the SWE-bench Lite eval with 5 tasks"
343
+ ```
334
344
 
335
- 1. **run-benchmark**: Expert at running evaluations with proper validation
336
- - Checks prerequisites (Docker, API keys, config files)
337
- - Constructs valid `mcpbr run` commands
338
- - Handles errors gracefully with actionable feedback
345
+ **Best for**: Contributors, developers testing changes, or users who want the latest unreleased features.
339
346
 
340
- 2. **generate-config**: Generates valid mcpbr configuration files
341
- - Ensures `{workdir}` placeholder is included
342
- - Validates MCP server commands
343
- - Provides benchmark-specific templates
347
+ #### Option 2: npm Global Install (Planned for v0.4.0)
344
348
 
345
- 3. **swe-bench-lite**: Quick-start command for SWE-bench Lite
346
- - Pre-configured for 5-task evaluation
347
- - Includes sensible defaults for output files
348
- - Perfect for testing and demonstrations
349
+ Install the plugin globally via npm for use across any project:
349
350
 
350
- ### Getting Started with Claude Code
351
+ ```bash
352
+ # Planned for v0.4.0 (not yet released)
353
+ npm install -g @mcpbr/claude-code-plugin
354
+ ```
351
355
 
352
- Just clone the repository and start asking Claude to run benchmarks:
356
+ > **Note**: The npm package is not yet published. This installation method will be available in a future release. Track progress in [issue #265](https://github.com/greynewell/mcpbr/issues/265).
353
357
 
354
- ```bash
355
- git clone https://github.com/greynewell/mcpbr.git
356
- cd mcpbr
358
+ **Best for**: Users who want plugin features available in any directory.
357
359
 
358
- # In Claude Code, simply say:
359
- # "Run the SWE-bench Lite eval with 5 tasks"
360
- ```
360
+ #### Option 3: Claude Code Plugin Manager (Planned for v0.4.0)
361
+
362
+ Install via Claude Code's built-in plugin manager:
363
+
364
+ 1. Open Claude Code settings
365
+ 2. Navigate to Plugins > Browse
366
+ 3. Search for "mcpbr"
367
+ 4. Click Install
368
+
369
+ > **Note**: Plugin manager installation is not yet available. This installation method will be available after plugin marketplace submission. Track progress in [issue #267](https://github.com/greynewell/mcpbr/issues/267).
370
+
371
+ **Best for**: Users who prefer a GUI and want automatic updates.
372
+
373
+ ### Installation Comparison
374
+
375
+ | Method | Availability | Auto-updates | Works Anywhere | Latest Features |
376
+ |--------|-------------|--------------|----------------|-----------------|
377
+ | Clone Repository | Available now | Manual (git pull) | No (repo only) | Yes (unreleased) |
378
+ | npm Global Install | Planned (not yet released) | Via npm | Yes | Yes (published) |
379
+ | Plugin Manager | Planned (not yet released) | Automatic | Yes | Yes (published) |
380
+
381
+ ### What You Get
382
+
383
+ The plugin includes three specialized skills that enhance Claude's ability to work with mcpbr:
384
+
385
+ #### 1. run-benchmark
386
+ Expert at running evaluations with proper validation and error handling.
387
+
388
+ **Capabilities**:
389
+ - Validates prerequisites (Docker running, API keys set, config files exist)
390
+ - Constructs correct `mcpbr run` commands with appropriate flags
391
+ - Handles errors gracefully with actionable troubleshooting steps
392
+ - Monitors progress and provides meaningful status updates
393
+
394
+ **Example interactions**:
395
+ - "Run the SWE-bench Lite benchmark with 10 tasks"
396
+ - "Evaluate my MCP server using CyberGym level 2"
397
+ - "Test my config with a single task"
398
+
399
+ #### 2. generate-config
400
+ Generates valid mcpbr configuration files with benchmark-specific templates.
401
+
402
+ **Capabilities**:
403
+ - Ensures required `{workdir}` placeholder is included in MCP server args
404
+ - Validates MCP server command syntax
405
+ - Provides templates for different benchmarks (SWE-bench, CyberGym, MCPToolBench++)
406
+ - Suggests appropriate timeouts and concurrency settings
407
+
408
+ **Example interactions**:
409
+ - "Generate a config for the filesystem MCP server"
410
+ - "Create a config for testing my custom MCP server"
411
+ - "Set up a CyberGym evaluation config"
412
+
413
+ #### 3. swe-bench-lite
414
+ Quick-start command for running SWE-bench Lite evaluations.
415
+
416
+ **Capabilities**:
417
+ - Pre-configured for 5-task evaluation (fast testing)
418
+ - Includes sensible defaults for output files and logging
419
+ - Perfect for demonstrations and initial testing
420
+ - Automatically sets up verbose output for debugging
361
421
 
362
- The bundled plugin ensures Claude makes no silly mistakes and follows best practices automatically.
422
+ **Example interactions**:
423
+ - "Run a quick SWE-bench Lite test"
424
+ - "Show me how mcpbr works"
425
+ - "Test the filesystem server"
426
+
427
+ ### Benefits
428
+
429
+ When using Claude Code with the mcpbr plugin active, Claude will automatically:
430
+
431
+ - Verify Docker is running before starting evaluations
432
+ - Check for required API keys (`ANTHROPIC_API_KEY`)
433
+ - Generate valid configurations with proper `{workdir}` placeholders
434
+ - Use correct CLI flags and avoid deprecated options
435
+ - Provide contextual troubleshooting when issues occur
436
+ - Follow mcpbr best practices for optimal results
437
+
438
+ ### Troubleshooting
439
+
440
+ **Plugin not detected in cloned repository**:
441
+ - Ensure you're in the repository root directory
442
+ - Verify the `claude-code.json` file exists in the repo
443
+ - Try restarting Claude Code
444
+
445
+ **Skills not appearing**:
446
+ - Check Claude Code version (requires v2.0+)
447
+ - Verify plugin is listed in Settings > Plugins
448
+ - Try running `/reload-plugins` in Claude Code
449
+
450
+ **Commands failing**:
451
+ - Ensure mcpbr is installed: `pip install mcpbr`
452
+ - Verify Docker is running: `docker info`
453
+ - Check API key is set: `echo $ANTHROPIC_API_KEY`
454
+
455
+ For more help, see the [troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).
363
456
 
364
457
  ## Configuration
365
458
 
@@ -509,7 +602,7 @@ Run SWE-bench evaluation with the configured MCP server.
509
602
 
510
603
  | Option | Short | Description |
511
604
  |--------|-------|-------------|
512
- | `--config PATH` | `-c` | Path to YAML configuration file (required) |
605
+ | `--config PATH` | `-c` | Path to YAML configuration file (default: `mcpbr.yaml`, auto-created if missing) |
513
606
  | `--model TEXT` | `-m` | Override model from config |
514
607
  | `--benchmark TEXT` | `-b` | Override benchmark from config (`swe-bench`, `cybergym`, or `mcptoolbench`) |
515
608
  | `--level INTEGER` | | Override CyberGym difficulty level (0-3) |
@@ -536,6 +629,7 @@ Run SWE-bench evaluation with the configured MCP server.
536
629
  | `--smtp-port PORT` | | SMTP server port (default: 587) |
537
630
  | `--smtp-user USER` | | SMTP username for authentication |
538
631
  | `--smtp-password PASS` | | SMTP password for authentication |
632
+ | `--profile` | | Enable comprehensive performance profiling (tool latency, memory, overhead) |
539
633
  | `--help` | `-h` | Show help message |
540
634
 
541
635
  </details>
@@ -650,6 +744,58 @@ mcpbr cleanup -f
650
744
 
651
745
  </details>
652
746
 
747
+ ## Performance Profiling
748
+
749
+ mcpbr includes comprehensive performance profiling to understand MCP server overhead and identify optimization opportunities.
750
+
751
+ ### Enable Profiling
752
+
753
+ ```bash
754
+ # Via CLI flag
755
+ mcpbr run -c config.yaml --profile
756
+
757
+ # Or in config.yaml
758
+ enable_profiling: true
759
+ ```
760
+
761
+ ### What Gets Measured
762
+
763
+ - **Tool call latencies** with percentiles (p50, p95, p99)
764
+ - **Memory usage** (peak and average RSS/VMS)
765
+ - **Infrastructure overhead** (Docker and MCP server startup times)
766
+ - **Tool discovery speed** (time to first tool use)
767
+ - **Tool switching overhead** (time between tool calls)
768
+ - **Automated insights** from profiling data
769
+
770
+ ### Example Profiling Output
771
+
772
+ ```json
773
+ {
774
+ "profiling": {
775
+ "task_duration_seconds": 140.5,
776
+ "tool_call_latencies": {
777
+ "Read": {"count": 15, "avg_seconds": 0.8, "p95_seconds": 1.5},
778
+ "Bash": {"avg_seconds": 2.3, "p95_seconds": 5.1}
779
+ },
780
+ "memory_profile": {"peak_rss_mb": 512.3, "avg_rss_mb": 387.5},
781
+ "docker_startup_seconds": 2.1,
782
+ "mcp_server_startup_seconds": 1.8
783
+ }
784
+ }
785
+ ```
786
+
787
+ ### Automated Insights
788
+
789
+ The profiler automatically identifies performance issues:
790
+
791
+ ```text
792
+ - Bash is the slowest tool (avg: 2.3s, p95: 5.1s)
793
+ - Docker startup adds 2.1s overhead per task
794
+ - Fast tool discovery: first tool use in 8.3s
795
+ ```
796
+
797
+ See [docs/profiling.md](docs/profiling.md) for complete profiling documentation.
798
+
653
799
  ## Example Run
654
800
 
655
801
  Here's what a typical evaluation looks like:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.3.28",
3
+ "version": "0.4.1",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",