@greynewell/mcpbr 0.6.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +23 -23
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -23,7 +23,7 @@ Benchmark your MCP server against real GitHub issues. One command, hard numbers.
23
23
  [![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
24
24
  [![CI](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml/badge.svg)](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml)
25
25
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
26
- [![Documentation](https://img.shields.io/badge/docs-greynewell.github.io%2Fmcpbr-blue)](https://greynewell.github.io/mcpbr/)
26
+ [![Documentation](https://img.shields.io/badge/docs-mcpbr.org-blue)](https://mcpbr.org/)
27
27
  ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/greynewell/mcpbr?utm_source=oss&utm_medium=github&utm_campaign=greynewell%2Fmcpbr&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews)
28
28
 
29
29
  [![good first issues](https://img.shields.io/github/issues/greynewell/mcpbr/good%20first%20issue?label=good%20first%20issues&color=7057ff)](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)
@@ -60,17 +60,17 @@ mcpbr supports 30+ benchmarks across 10 categories through a flexible abstractio
60
60
 
61
61
  | Category | Benchmarks |
62
62
  |----------|-----------|
63
- | **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |
64
- | **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |
65
- | **Math & Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |
66
- | **Knowledge & QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |
67
- | **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
68
- | **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
69
- | **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
63
+ | **Software Engineering** | [SWE-bench](https://mcpbr.org/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://mcpbr.org/benchmarks/apps/), [CodeContests](https://mcpbr.org/benchmarks/codecontests/), [BigCodeBench](https://mcpbr.org/benchmarks/bigcodebench/), [LeetCode](https://mcpbr.org/benchmarks/leetcode/), [CoderEval](https://mcpbr.org/benchmarks/codereval/), [Aider Polyglot](https://mcpbr.org/benchmarks/aider-polyglot/) |
64
+ | **Code Generation** | [HumanEval](https://mcpbr.org/benchmarks/humaneval/), [MBPP](https://mcpbr.org/benchmarks/mbpp/) |
65
+ | **Math & Reasoning** | [GSM8K](https://mcpbr.org/benchmarks/gsm8k/), [MATH](https://mcpbr.org/benchmarks/math/), [BigBench-Hard](https://mcpbr.org/benchmarks/bigbench-hard/) |
66
+ | **Knowledge & QA** | [TruthfulQA](https://mcpbr.org/benchmarks/truthfulqa/), [HellaSwag](https://mcpbr.org/benchmarks/hellaswag/), [ARC](https://mcpbr.org/benchmarks/arc/), [GAIA](https://mcpbr.org/benchmarks/gaia/) |
67
+ | **Tool Use & Agents** | [MCPToolBench++](https://mcpbr.org/benchmarks/mcptoolbench/), [ToolBench](https://mcpbr.org/benchmarks/toolbench/), [AgentBench](https://mcpbr.org/benchmarks/agentbench/), [WebArena](https://mcpbr.org/benchmarks/webarena/), [TerminalBench](https://mcpbr.org/benchmarks/terminalbench/), [InterCode](https://mcpbr.org/benchmarks/intercode/) |
68
+ | **ML Research** | [MLAgentBench](https://mcpbr.org/benchmarks/mlagentbench/) |
69
+ | **Code Understanding** | [RepoQA](https://mcpbr.org/benchmarks/repoqa/) |
70
70
  | **Multimodal** | MMMU |
71
71
  | **Long Context** | LongBench |
72
72
  | **Safety & Adversarial** | Adversarial (HarmBench) |
73
- | **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
73
+ | **Security** | [CyberGym](https://mcpbr.org/benchmarks/cybergym/) |
74
74
  | **Custom** | User-defined benchmarks via YAML |
75
75
 
76
76
  ### Featured Benchmarks
@@ -96,7 +96,7 @@ mcpbr run -c config.yaml --benchmark cybergym --level 2
96
96
  mcpbr benchmarks
97
97
  ```
98
98
 
99
- See the **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for details on each benchmark and how to configure them.
99
+ See the **[benchmarks guide](https://mcpbr.org/benchmarks/)** for details on each benchmark and how to configure them.
100
100
 
101
101
  ## Overview
102
102
 
@@ -105,7 +105,7 @@ This harness runs two parallel evaluations for each task:
105
105
  1. **MCP Agent**: LLM with access to tools from your MCP server
106
106
  2. **Baseline Agent**: LLM without tools (single-shot generation)
107
107
 
108
- By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://greynewell.github.io/mcpbr/mcp-integration/)** for tips on testing your server.
108
+ By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://mcpbr.org/mcp-integration/)** for tips on testing your server.
109
109
 
110
110
  ## Regression Detection
111
111
 
@@ -188,7 +188,7 @@ This will exit with code 1 if the regression rate exceeds 10%, failing the CI jo
188
188
 
189
189
  ## Installation
190
190
 
191
- > **[Full installation guide](https://greynewell.github.io/mcpbr/installation/)** with detailed setup instructions.
191
+ > **[Full installation guide](https://mcpbr.org/installation/)** with detailed setup instructions.
192
192
 
193
193
  <details>
194
194
  <summary>Prerequisites</summary>
@@ -550,11 +550,11 @@ When using Claude Code with the mcpbr plugin active, Claude will automatically:
550
550
  - Verify Docker is running: `docker info`
551
551
  - Check API key is set: `echo $ANTHROPIC_API_KEY`
552
552
 
553
- For more help, see the [troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).
553
+ For more help, see the [troubleshooting guide](https://mcpbr.org/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).
554
554
 
555
555
  ## Configuration
556
556
 
557
- > **[Full configuration reference](https://greynewell.github.io/mcpbr/configuration/)** with all options and examples.
557
+ > **[Full configuration reference](https://mcpbr.org/configuration/)** with all options and examples.
558
558
 
559
559
  ### MCP Server Configuration
560
560
 
@@ -669,7 +669,7 @@ Use `{problem_statement}` as a placeholder for the SWE-bench issue text. You can
669
669
 
670
670
  ## CLI Reference
671
671
 
672
- > **[Full CLI documentation](https://greynewell.github.io/mcpbr/cli/)** with all commands and options.
672
+ > **[Full CLI documentation](https://mcpbr.org/cli/)** with all commands and options.
673
673
 
674
674
  Get help for any command with `--help` or `-h`:
675
675
 
@@ -930,7 +930,7 @@ Provider: anthropic, Harness: claude-code
930
930
 
931
931
  ## Output
932
932
 
933
- > **[Understanding evaluation results](https://greynewell.github.io/mcpbr/evaluation-results/)** - detailed guide to interpreting output.
933
+ > **[Understanding evaluation results](https://mcpbr.org/evaluation-results/)** - detailed guide to interpreting output.
934
934
 
935
935
  ### Console Output
936
936
 
@@ -1213,7 +1213,7 @@ The JUnit XML format enables native test result visualization in your CI/CD dash
1213
1213
 
1214
1214
  ## How It Works
1215
1215
 
1216
- > **[Architecture deep dive](https://greynewell.github.io/mcpbr/architecture/)** - learn how mcpbr works internally.
1216
+ > **[Architecture deep dive](https://mcpbr.org/architecture/)** - learn how mcpbr works internally.
1217
1217
 
1218
1218
  1. **Load Tasks**: Fetches tasks from the selected benchmark (SWE-bench, CyberGym, or MCPToolBench++) via HuggingFace
1219
1219
  2. **Create Environment**: For each task, creates an isolated Docker environment with the repository and dependencies
@@ -1274,7 +1274,7 @@ mcpbr/
1274
1274
  └── example.yaml # Example configuration
1275
1275
  ```
1276
1276
 
1277
- The architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://greynewell.github.io/mcpbr/api/)** and **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for more details.
1277
+ The architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://mcpbr.org/api/)** and **[benchmarks guide](https://mcpbr.org/benchmarks/)** for more details.
1278
1278
 
1279
1279
  ### Execution Flow
1280
1280
 
@@ -1313,9 +1313,9 @@ The architecture uses Protocol-based abstractions for providers, harnesses, and
1313
1313
 
1314
1314
  ## Troubleshooting
1315
1315
 
1316
- > **[FAQ](https://greynewell.github.io/mcpbr/FAQ/)** - Quick answers to common questions
1316
+ > **[FAQ](https://mcpbr.org/FAQ/)** - Quick answers to common questions
1317
1317
  >
1318
- > **[Full troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/)** - Detailed solutions to common issues
1318
+ > **[Full troubleshooting guide](https://mcpbr.org/troubleshooting/)** - Detailed solutions to common issues
1319
1319
 
1320
1320
  ### Docker Issues
1321
1321
 
@@ -1462,11 +1462,11 @@ We welcome contributions! Check out our **30+ good first issues** perfect for ne
1462
1462
  - **Platform**: Homebrew formula, Conda package
1463
1463
  - **Documentation**: Best practices, examples, guides
1464
1464
 
1465
- See the [contributing guide](https://greynewell.github.io/mcpbr/contributing/) to get started!
1465
+ See the [contributing guide](https://mcpbr.org/contributing/) to get started!
1466
1466
 
1467
1467
  ## Best Practices
1468
1468
 
1469
- New to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://greynewell.github.io/mcpbr/best-practices/)** for:
1469
+ New to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://mcpbr.org/best-practices/)** for:
1470
1470
 
1471
1471
  - Benchmark selection guidelines
1472
1472
  - MCP server configuration tips
@@ -1478,7 +1478,7 @@ New to mcpbr or want to optimize your workflow? Check out the **[Best Practices
1478
1478
 
1479
1479
  ## Contributing
1480
1480
 
1481
- Please see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://greynewell.github.io/mcpbr/contributing/)** for guidelines on how to contribute.
1481
+ Please see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://mcpbr.org/contributing/)** for guidelines on how to contribute.
1482
1482
 
1483
1483
  All contributors are expected to follow our [Community Guidelines](CODE_OF_CONDUCT.md).
1484
1484
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@greynewell/mcpbr",
3
- "version": "0.6.0",
3
+ "version": "0.8.0",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",