mcpbr-cli 0.3.26 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +184 -49
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -1,11 +1,11 @@
1
1
  # mcpbr
2
2
 
3
3
  ```bash
4
- # Install via pip
5
- pip install mcpbr && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
4
+ # One-liner install (installs + runs quick test)
5
+ curl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash
6
6
 
7
- # Or via npm
8
- npm install -g mcpbr-cli && mcpbr init && mcpbr run -c mcpbr.yaml -n 1 -v
7
+ # Or install and run manually
8
+ pip install mcpbr && mcpbr run -n 1
9
9
  ```
10
10
 
11
11
  Benchmark your MCP server against real GitHub issues. One command, hard numbers.
@@ -60,11 +60,15 @@ mcpbr supports multiple software engineering benchmarks through a flexible abstr
60
60
  ### SWE-bench (Default)
61
61
  Real GitHub issues requiring bug fixes and patches. The agent generates unified diffs evaluated by running pytest test suites.
62
62
 
63
- - **Dataset**: [SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite)
64
63
  - **Task**: Generate patches to fix bugs
65
64
  - **Evaluation**: Test suite pass/fail
66
65
  - **Pre-built images**: Available for most tasks
67
66
 
67
+ **Variants:**
68
+ - **swe-bench-verified** (default) - Manually validated test cases for higher quality evaluation ([SWE-bench/SWE-bench_Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified))
69
+ - **swe-bench-lite** - 300 tasks, quick testing ([SWE-bench/SWE-bench_Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite))
70
+ - **swe-bench-full** - 2,294 tasks, complete benchmark ([SWE-bench/SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench))
71
+
68
72
  ### CyberGym
69
73
  Security vulnerabilities requiring Proof-of-Concept (PoC) exploits. The agent generates exploits that trigger crashes in vulnerable code.
70
74
 
@@ -84,16 +88,16 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
84
88
  - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
85
89
 
86
90
  ```bash
87
- # Run SWE-bench (default)
91
+ # Run SWE-bench Verified (default - manually validated tests)
88
92
  mcpbr run -c config.yaml
89
93
 
90
- # Run CyberGym at level 2
91
- mcpbr run -c config.yaml --benchmark cybergym --level 2
94
+ # Run SWE-bench Lite (300 tasks, quick testing)
95
+ mcpbr run -c config.yaml -b swe-bench-lite
92
96
 
93
- # Run MCPToolBench++
94
- mcpbr run -c config.yaml --benchmark mcptoolbench
97
+ # Run SWE-bench Full (2,294 tasks, complete benchmark)
98
+ mcpbr run -c config.yaml -b swe-bench-full
95
99
 
96
- # List available benchmarks
100
+ # List all available benchmarks
97
101
  mcpbr benchmarks
98
102
  ```
99
103
 
@@ -271,9 +275,13 @@ See the **[Examples README](examples/README.md)** for the complete guide.
271
275
  export ANTHROPIC_API_KEY="your-api-key"
272
276
  ```
273
277
 
274
- 2. **Generate a configuration file:**
278
+ 2. **Run mcpbr (config auto-created if missing):**
275
279
 
276
280
  ```bash
281
+ # Config is auto-created on first run
282
+ mcpbr run -n 1
283
+
284
+ # Or explicitly generate a config file first
277
285
  mcpbr init
278
286
  ```
279
287
 
@@ -296,6 +304,9 @@ dataset: "SWE-bench/SWE-bench_Lite"
296
304
  sample_size: 10
297
305
  timeout_seconds: 300
298
306
  max_concurrent: 4
307
+
308
+ # Optional: disable default logging (logs are saved to output_dir/logs/ by default)
309
+ # disable_logs: true
299
310
  ```
300
311
 
301
312
  4. **Run the evaluation:**
@@ -308,55 +319,135 @@ mcpbr run --config config.yaml
308
319
 
309
320
  [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)
310
321
 
311
- mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. When you clone this repository, Claude Code automatically detects the plugin and gains specialized knowledge about mcpbr.
322
+ mcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. The plugin provides specialized skills and knowledge about mcpbr configuration, execution, and troubleshooting.
312
323
 
313
- ### What This Means for You
324
+ ### Installation Options
314
325
 
315
- When using Claude Code in this repository, you can simply say:
326
+ You have three ways to enable the mcpbr plugin in Claude Code:
316
327
 
317
- - "Run the SWE-bench Lite benchmark"
318
- - "Generate a config for my MCP server"
319
- - "Run a quick test with 1 task"
328
+ #### Option 1: Clone Repository (Automatic Detection)
320
329
 
321
- Claude will automatically:
322
- - Verify Docker is running before starting
323
- - Check for required API keys
324
- - Generate valid configurations with proper `{workdir}` placeholders
325
- - Use correct CLI flags and options
326
- - Provide helpful troubleshooting when issues occur
330
+ When you clone this repository, Claude Code automatically detects and loads the plugin:
327
331
 
328
- ### Available Skills
332
+ ```bash
333
+ git clone https://github.com/greynewell/mcpbr.git
334
+ cd mcpbr
329
335
 
330
- The plugin includes three specialized skills:
336
+ # Plugin is now active - try asking Claude:
337
+ # "Run the SWE-bench Lite eval with 5 tasks"
338
+ ```
331
339
 
332
- 1. **run-benchmark**: Expert at running evaluations with proper validation
333
- - Checks prerequisites (Docker, API keys, config files)
334
- - Constructs valid `mcpbr run` commands
335
- - Handles errors gracefully with actionable feedback
340
+ **Best for**: Contributors, developers testing changes, or users who want the latest unreleased features.
336
341
 
337
- 2. **generate-config**: Generates valid mcpbr configuration files
338
- - Ensures `{workdir}` placeholder is included
339
- - Validates MCP server commands
340
- - Provides benchmark-specific templates
342
+ #### Option 2: npm Global Install (Planned for v0.4.0)
341
343
 
342
- 3. **swe-bench-lite**: Quick-start command for SWE-bench Lite
343
- - Pre-configured for 5-task evaluation
344
- - Includes sensible defaults for output files
345
- - Perfect for testing and demonstrations
344
+ Install the plugin globally via npm for use across any project:
346
345
 
347
- ### Getting Started with Claude Code
346
+ ```bash
347
+ # Planned for v0.4.0 (not yet released)
348
+ npm install -g @mcpbr/claude-code-plugin
349
+ ```
348
350
 
349
- Just clone the repository and start asking Claude to run benchmarks:
351
+ > **Note**: The npm package is not yet published. This installation method will be available in a future release. Track progress in [issue #265](https://github.com/greynewell/mcpbr/issues/265).
350
352
 
351
- ```bash
352
- git clone https://github.com/greynewell/mcpbr.git
353
- cd mcpbr
353
+ **Best for**: Users who want plugin features available in any directory.
354
354
 
355
- # In Claude Code, simply say:
356
- # "Run the SWE-bench Lite eval with 5 tasks"
357
- ```
355
+ #### Option 3: Claude Code Plugin Manager (Planned for v0.4.0)
356
+
357
+ Install via Claude Code's built-in plugin manager:
358
+
359
+ 1. Open Claude Code settings
360
+ 2. Navigate to Plugins > Browse
361
+ 3. Search for "mcpbr"
362
+ 4. Click Install
363
+
364
+ > **Note**: Plugin manager installation is not yet available. This installation method will be available after plugin marketplace submission. Track progress in [issue #267](https://github.com/greynewell/mcpbr/issues/267).
365
+
366
+ **Best for**: Users who prefer a GUI and want automatic updates.
367
+
368
+ ### Installation Comparison
369
+
370
+ | Method | Availability | Auto-updates | Works Anywhere | Latest Features |
371
+ |--------|-------------|--------------|----------------|-----------------|
372
+ | Clone Repository | Available now | Manual (git pull) | No (repo only) | Yes (unreleased) |
373
+ | npm Global Install | Planned (not yet released) | Via npm | Yes | Yes (published) |
374
+ | Plugin Manager | Planned (not yet released) | Automatic | Yes | Yes (published) |
375
+
376
+ ### What You Get
377
+
378
+ The plugin includes three specialized skills that enhance Claude's ability to work with mcpbr:
379
+
380
+ #### 1. run-benchmark
381
+ Expert at running evaluations with proper validation and error handling.
382
+
383
+ **Capabilities**:
384
+ - Validates prerequisites (Docker running, API keys set, config files exist)
385
+ - Constructs correct `mcpbr run` commands with appropriate flags
386
+ - Handles errors gracefully with actionable troubleshooting steps
387
+ - Monitors progress and provides meaningful status updates
388
+
389
+ **Example interactions**:
390
+ - "Run the SWE-bench Lite benchmark with 10 tasks"
391
+ - "Evaluate my MCP server using CyberGym level 2"
392
+ - "Test my config with a single task"
393
+
394
+ #### 2. generate-config
395
+ Generates valid mcpbr configuration files with benchmark-specific templates.
396
+
397
+ **Capabilities**:
398
+ - Ensures required `{workdir}` placeholder is included in MCP server args
399
+ - Validates MCP server command syntax
400
+ - Provides templates for different benchmarks (SWE-bench, CyberGym, MCPToolBench++)
401
+ - Suggests appropriate timeouts and concurrency settings
358
402
 
359
- The bundled plugin ensures Claude makes no silly mistakes and follows best practices automatically.
403
+ **Example interactions**:
404
+ - "Generate a config for the filesystem MCP server"
405
+ - "Create a config for testing my custom MCP server"
406
+ - "Set up a CyberGym evaluation config"
407
+
408
+ #### 3. swe-bench-lite
409
+ Quick-start command for running SWE-bench Lite evaluations.
410
+
411
+ **Capabilities**:
412
+ - Pre-configured for 5-task evaluation (fast testing)
413
+ - Includes sensible defaults for output files and logging
414
+ - Perfect for demonstrations and initial testing
415
+ - Automatically sets up verbose output for debugging
416
+
417
+ **Example interactions**:
418
+ - "Run a quick SWE-bench Lite test"
419
+ - "Show me how mcpbr works"
420
+ - "Test the filesystem server"
421
+
422
+ ### Benefits
423
+
424
+ When using Claude Code with the mcpbr plugin active, Claude will automatically:
425
+
426
+ - Verify Docker is running before starting evaluations
427
+ - Check for required API keys (`ANTHROPIC_API_KEY`)
428
+ - Generate valid configurations with proper `{workdir}` placeholders
429
+ - Use correct CLI flags and avoid deprecated options
430
+ - Provide contextual troubleshooting when issues occur
431
+ - Follow mcpbr best practices for optimal results
432
+
433
+ ### Troubleshooting
434
+
435
+ **Plugin not detected in cloned repository**:
436
+ - Ensure you're in the repository root directory
437
+ - Verify the `claude-code.json` file exists in the repo
438
+ - Try restarting Claude Code
439
+
440
+ **Skills not appearing**:
441
+ - Check Claude Code version (requires v2.0+)
442
+ - Verify plugin is listed in Settings > Plugins
443
+ - Try running `/reload-plugins` in Claude Code
444
+
445
+ **Commands failing**:
446
+ - Ensure mcpbr is installed: `pip install mcpbr`
447
+ - Verify Docker is running: `docker info`
448
+ - Check API key is set: `echo $ANTHROPIC_API_KEY`
449
+
450
+ For more help, see the [troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).
360
451
 
361
452
  ## Configuration
362
453
 
@@ -506,7 +597,7 @@ Run SWE-bench evaluation with the configured MCP server.
506
597
 
507
598
  | Option | Short | Description |
508
599
  |--------|-------|-------------|
509
- | `--config PATH` | `-c` | Path to YAML configuration file (required) |
600
+ | `--config PATH` | `-c` | Path to YAML configuration file (default: `mcpbr.yaml`, auto-created if missing) |
510
601
  | `--model TEXT` | `-m` | Override model from config |
511
602
  | `--benchmark TEXT` | `-b` | Override benchmark from config (`swe-bench`, `cybergym`, or `mcptoolbench`) |
512
603
  | `--level INTEGER` | | Override CyberGym difficulty level (0-3) |
@@ -519,7 +610,8 @@ Run SWE-bench evaluation with the configured MCP server.
519
610
  | `--output-junit PATH` | | Path to save JUnit XML report (for CI/CD integration) |
520
611
  | `--verbose` | `-v` | Verbose output (`-v` summary, `-vv` detailed) |
521
612
  | `--log-file PATH` | `-l` | Path to write raw JSON log output (single file) |
522
- | `--log-dir PATH` | | Directory to write per-instance JSON log files |
613
+ | `--log-dir PATH` | | Directory to write per-instance JSON log files (default: `output_dir/logs/`) |
614
+ | `--disable-logs` | | Disable detailed execution logs (overrides default and config) |
523
615
  | `--task TEXT` | `-t` | Run specific task(s) by instance_id (repeatable) |
524
616
  | `--prompt TEXT` | | Override agent prompt (use `{problem_statement}` placeholder) |
525
617
  | `--baseline-results PATH` | | Path to baseline results JSON for regression detection |
@@ -773,6 +865,38 @@ Results saved to results.json
773
865
  }
774
866
  ```
775
867
 
868
+ ### Output Directory Structure
869
+
870
+ By default, mcpbr consolidates all outputs into a single timestamped directory:
871
+
872
+ ```text
873
+ .mcpbr_run_20260126_133000/
874
+ ├── config.yaml # Copy of configuration used
875
+ ├── evaluation_state.json # Task results and state
876
+ ├── logs/ # Detailed MCP server logs
877
+ │ ├── task_1_mcp.log
878
+ │ ├── task_2_mcp.log
879
+ │ └── ...
880
+ └── README.txt # Auto-generated explanation
881
+ ```
882
+
883
+ This makes it easy to:
884
+ - **Archive results**: `tar -czf results.tar.gz .mcpbr_run_*`
885
+ - **Clean up**: `rm -rf .mcpbr_run_*`
886
+ - **Share**: Just zip one directory
887
+
888
+ You can customize the output directory:
889
+
890
+ ```bash
891
+ # Custom output directory
892
+ mcpbr run -c config.yaml --output-dir ./my-results
893
+
894
+ # Or in config.yaml
895
+ output_dir: "./my-results"
896
+ ```
897
+
898
+ **Note:** The `--output-dir` CLI flag takes precedence over the `output_dir` config setting. This ensures that the README.txt file in the output directory reflects the final effective configuration values after all CLI overrides are applied.
899
+
776
900
  ### Markdown Report (`--report`)
777
901
 
778
902
  Generates a human-readable report with:
@@ -782,6 +906,17 @@ Generates a human-readable report with:
782
906
 
783
907
  ### Per-Instance Logs (`--log-dir`)
784
908
 
909
+ **Logging is enabled by default** to prevent data loss. Detailed execution traces are automatically saved to `output_dir/logs/` unless disabled.
910
+
911
+ To disable logging:
912
+ ```bash
913
+ # Via CLI flag
914
+ mcpbr run -c config.yaml --disable-logs
915
+
916
+ # Or in config file
917
+ disable_logs: true
918
+ ```
919
+
785
920
  Creates a directory with detailed JSON log files for each task run. Filenames include timestamps to prevent overwrites:
786
921
 
787
922
  ```text
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.3.26",
3
+ "version": "0.4.0",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",