@greynewell/mcpbr 0.6.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +23 -23
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -23,7 +23,7 @@ Benchmark your MCP server against real GitHub issues. One command, hard numbers.
|
|
|
23
23
|
[](https://www.python.org/downloads/)
|
|
24
24
|
[](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml)
|
|
25
25
|
[](https://opensource.org/licenses/MIT)
|
|
26
|
-
[](https://mcpbr.org/)
|
|
27
27
|

|
|
28
28
|
|
|
29
29
|
[](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)
|
|
@@ -60,17 +60,17 @@ mcpbr supports 30+ benchmarks across 10 categories through a flexible abstractio
|
|
|
60
60
|
|
|
61
61
|
| Category | Benchmarks |
|
|
62
62
|
|----------|-----------|
|
|
63
|
-
| **Software Engineering** | [SWE-bench](https://
|
|
64
|
-
| **Code Generation** | [HumanEval](https://
|
|
65
|
-
| **Math & Reasoning** | [GSM8K](https://
|
|
66
|
-
| **Knowledge & QA** | [TruthfulQA](https://
|
|
67
|
-
| **Tool Use & Agents** | [MCPToolBench++](https://
|
|
68
|
-
| **ML Research** | [MLAgentBench](https://
|
|
69
|
-
| **Code Understanding** | [RepoQA](https://
|
|
63
|
+
| **Software Engineering** | [SWE-bench](https://mcpbr.org/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://mcpbr.org/benchmarks/apps/), [CodeContests](https://mcpbr.org/benchmarks/codecontests/), [BigCodeBench](https://mcpbr.org/benchmarks/bigcodebench/), [LeetCode](https://mcpbr.org/benchmarks/leetcode/), [CoderEval](https://mcpbr.org/benchmarks/codereval/), [Aider Polyglot](https://mcpbr.org/benchmarks/aider-polyglot/) |
|
|
64
|
+
| **Code Generation** | [HumanEval](https://mcpbr.org/benchmarks/humaneval/), [MBPP](https://mcpbr.org/benchmarks/mbpp/) |
|
|
65
|
+
| **Math & Reasoning** | [GSM8K](https://mcpbr.org/benchmarks/gsm8k/), [MATH](https://mcpbr.org/benchmarks/math/), [BigBench-Hard](https://mcpbr.org/benchmarks/bigbench-hard/) |
|
|
66
|
+
| **Knowledge & QA** | [TruthfulQA](https://mcpbr.org/benchmarks/truthfulqa/), [HellaSwag](https://mcpbr.org/benchmarks/hellaswag/), [ARC](https://mcpbr.org/benchmarks/arc/), [GAIA](https://mcpbr.org/benchmarks/gaia/) |
|
|
67
|
+
| **Tool Use & Agents** | [MCPToolBench++](https://mcpbr.org/benchmarks/mcptoolbench/), [ToolBench](https://mcpbr.org/benchmarks/toolbench/), [AgentBench](https://mcpbr.org/benchmarks/agentbench/), [WebArena](https://mcpbr.org/benchmarks/webarena/), [TerminalBench](https://mcpbr.org/benchmarks/terminalbench/), [InterCode](https://mcpbr.org/benchmarks/intercode/) |
|
|
68
|
+
| **ML Research** | [MLAgentBench](https://mcpbr.org/benchmarks/mlagentbench/) |
|
|
69
|
+
| **Code Understanding** | [RepoQA](https://mcpbr.org/benchmarks/repoqa/) |
|
|
70
70
|
| **Multimodal** | MMMU |
|
|
71
71
|
| **Long Context** | LongBench |
|
|
72
72
|
| **Safety & Adversarial** | Adversarial (HarmBench) |
|
|
73
|
-
| **Security** | [CyberGym](https://
|
|
73
|
+
| **Security** | [CyberGym](https://mcpbr.org/benchmarks/cybergym/) |
|
|
74
74
|
| **Custom** | User-defined benchmarks via YAML |
|
|
75
75
|
|
|
76
76
|
### Featured Benchmarks
|
|
@@ -96,7 +96,7 @@ mcpbr run -c config.yaml --benchmark cybergym --level 2
|
|
|
96
96
|
mcpbr benchmarks
|
|
97
97
|
```
|
|
98
98
|
|
|
99
|
-
See the **[benchmarks guide](https://
|
|
99
|
+
See the **[benchmarks guide](https://mcpbr.org/benchmarks/)** for details on each benchmark and how to configure them.
|
|
100
100
|
|
|
101
101
|
## Overview
|
|
102
102
|
|
|
@@ -105,7 +105,7 @@ This harness runs two parallel evaluations for each task:
|
|
|
105
105
|
1. **MCP Agent**: LLM with access to tools from your MCP server
|
|
106
106
|
2. **Baseline Agent**: LLM without tools (single-shot generation)
|
|
107
107
|
|
|
108
|
-
By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://
|
|
108
|
+
By comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://mcpbr.org/mcp-integration/)** for tips on testing your server.
|
|
109
109
|
|
|
110
110
|
## Regression Detection
|
|
111
111
|
|
|
@@ -188,7 +188,7 @@ This will exit with code 1 if the regression rate exceeds 10%, failing the CI jo
|
|
|
188
188
|
|
|
189
189
|
## Installation
|
|
190
190
|
|
|
191
|
-
> **[Full installation guide](https://
|
|
191
|
+
> **[Full installation guide](https://mcpbr.org/installation/)** with detailed setup instructions.
|
|
192
192
|
|
|
193
193
|
<details>
|
|
194
194
|
<summary>Prerequisites</summary>
|
|
@@ -550,11 +550,11 @@ When using Claude Code with the mcpbr plugin active, Claude will automatically:
|
|
|
550
550
|
- Verify Docker is running: `docker info`
|
|
551
551
|
- Check API key is set: `echo $ANTHROPIC_API_KEY`
|
|
552
552
|
|
|
553
|
-
For more help, see the [troubleshooting guide](https://
|
|
553
|
+
For more help, see the [troubleshooting guide](https://mcpbr.org/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).
|
|
554
554
|
|
|
555
555
|
## Configuration
|
|
556
556
|
|
|
557
|
-
> **[Full configuration reference](https://
|
|
557
|
+
> **[Full configuration reference](https://mcpbr.org/configuration/)** with all options and examples.
|
|
558
558
|
|
|
559
559
|
### MCP Server Configuration
|
|
560
560
|
|
|
@@ -669,7 +669,7 @@ Use `{problem_statement}` as a placeholder for the SWE-bench issue text. You can
|
|
|
669
669
|
|
|
670
670
|
## CLI Reference
|
|
671
671
|
|
|
672
|
-
> **[Full CLI documentation](https://
|
|
672
|
+
> **[Full CLI documentation](https://mcpbr.org/cli/)** with all commands and options.
|
|
673
673
|
|
|
674
674
|
Get help for any command with `--help` or `-h`:
|
|
675
675
|
|
|
@@ -930,7 +930,7 @@ Provider: anthropic, Harness: claude-code
|
|
|
930
930
|
|
|
931
931
|
## Output
|
|
932
932
|
|
|
933
|
-
> **[Understanding evaluation results](https://
|
|
933
|
+
> **[Understanding evaluation results](https://mcpbr.org/evaluation-results/)** - detailed guide to interpreting output.
|
|
934
934
|
|
|
935
935
|
### Console Output
|
|
936
936
|
|
|
@@ -1213,7 +1213,7 @@ The JUnit XML format enables native test result visualization in your CI/CD dash
|
|
|
1213
1213
|
|
|
1214
1214
|
## How It Works
|
|
1215
1215
|
|
|
1216
|
-
> **[Architecture deep dive](https://
|
|
1216
|
+
> **[Architecture deep dive](https://mcpbr.org/architecture/)** - learn how mcpbr works internally.
|
|
1217
1217
|
|
|
1218
1218
|
1. **Load Tasks**: Fetches tasks from the selected benchmark (SWE-bench, CyberGym, or MCPToolBench++) via HuggingFace
|
|
1219
1219
|
2. **Create Environment**: For each task, creates an isolated Docker environment with the repository and dependencies
|
|
@@ -1274,7 +1274,7 @@ mcpbr/
|
|
|
1274
1274
|
└── example.yaml # Example configuration
|
|
1275
1275
|
```
|
|
1276
1276
|
|
|
1277
|
-
The architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://
|
|
1277
|
+
The architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://mcpbr.org/api/)** and **[benchmarks guide](https://mcpbr.org/benchmarks/)** for more details.
|
|
1278
1278
|
|
|
1279
1279
|
### Execution Flow
|
|
1280
1280
|
|
|
@@ -1313,9 +1313,9 @@ The architecture uses Protocol-based abstractions for providers, harnesses, and
|
|
|
1313
1313
|
|
|
1314
1314
|
## Troubleshooting
|
|
1315
1315
|
|
|
1316
|
-
> **[FAQ](https://
|
|
1316
|
+
> **[FAQ](https://mcpbr.org/FAQ/)** - Quick answers to common questions
|
|
1317
1317
|
>
|
|
1318
|
-
> **[Full troubleshooting guide](https://
|
|
1318
|
+
> **[Full troubleshooting guide](https://mcpbr.org/troubleshooting/)** - Detailed solutions to common issues
|
|
1319
1319
|
|
|
1320
1320
|
### Docker Issues
|
|
1321
1321
|
|
|
@@ -1462,11 +1462,11 @@ We welcome contributions! Check out our **30+ good first issues** perfect for ne
|
|
|
1462
1462
|
- **Platform**: Homebrew formula, Conda package
|
|
1463
1463
|
- **Documentation**: Best practices, examples, guides
|
|
1464
1464
|
|
|
1465
|
-
See the [contributing guide](https://
|
|
1465
|
+
See the [contributing guide](https://mcpbr.org/contributing/) to get started!
|
|
1466
1466
|
|
|
1467
1467
|
## Best Practices
|
|
1468
1468
|
|
|
1469
|
-
New to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://
|
|
1469
|
+
New to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://mcpbr.org/best-practices/)** for:
|
|
1470
1470
|
|
|
1471
1471
|
- Benchmark selection guidelines
|
|
1472
1472
|
- MCP server configuration tips
|
|
@@ -1478,7 +1478,7 @@ New to mcpbr or want to optimize your workflow? Check out the **[Best Practices
|
|
|
1478
1478
|
|
|
1479
1479
|
## Contributing
|
|
1480
1480
|
|
|
1481
|
-
Please see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://
|
|
1481
|
+
Please see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://mcpbr.org/contributing/)** for guidelines on how to contribute.
|
|
1482
1482
|
|
|
1483
1483
|
All contributors are expected to follow our [Community Guidelines](CODE_OF_CONDUCT.md).
|
|
1484
1484
|
|