@greynewell/mcpbr-claude-plugin 0.3.18
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +127 -0
- package/package.json +38 -0
- package/plugin.json +6 -0
- package/skills/README.md +183 -0
- package/skills/benchmark-swe-lite/SKILL.md +102 -0
- package/skills/mcpbr-config/SKILL.md +204 -0
- package/skills/mcpbr-eval/SKILL.md +123 -0
package/README.md
ADDED
|
@@ -0,0 +1,127 @@
|
|
|
1
|
+
# @greynewell/mcpbr-claude-plugin
|
|
2
|
+
|
|
3
|
+
> Claude Code plugin for mcpbr - Makes Claude an expert at running MCP benchmarks
|
|
4
|
+
|
|
5
|
+
[](https://www.npmjs.com/package/@greynewell/mcpbr-claude-plugin)
|
|
6
|
+
[](https://opensource.org/licenses/MIT)
|
|
7
|
+
|
|
8
|
+
This Claude Code plugin provides specialized skills that make Claude an expert at using [mcpbr](https://github.com/greynewell/mcpbr) for MCP server benchmarking.
|
|
9
|
+
|
|
10
|
+
## What is mcpbr?
|
|
11
|
+
|
|
12
|
+
**Model Context Protocol Benchmark Runner** - Benchmark your MCP server against real GitHub issues. Get hard numbers comparing tool-assisted vs. baseline agent performance.
|
|
13
|
+
|
|
14
|
+
## Installation
|
|
15
|
+
|
|
16
|
+
### Option 1: Clone Repository (Automatic Detection)
|
|
17
|
+
|
|
18
|
+
```bash
|
|
19
|
+
git clone https://github.com/greynewell/mcpbr.git
|
|
20
|
+
cd mcpbr
|
|
21
|
+
# Claude Code automatically detects the plugin
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
### Option 2: Install via npm
|
|
25
|
+
|
|
26
|
+
```bash
|
|
27
|
+
npm install -g @greynewell/mcpbr-claude-plugin
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
### Option 3: Claude Code Plugin Manager
|
|
31
|
+
|
|
32
|
+
```bash
|
|
33
|
+
# In Claude Code, run:
|
|
34
|
+
/plugin install github:greynewell/mcpbr
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## What You Get
|
|
38
|
+
|
|
39
|
+
When using Claude Code in a project with this plugin, Claude automatically gains expertise in:
|
|
40
|
+
|
|
41
|
+
### 1. run-benchmark (`mcpbr-eval`)
|
|
42
|
+
Expert at running evaluations with proper validation:
|
|
43
|
+
- Verifies Docker is running
|
|
44
|
+
- Checks for API keys
|
|
45
|
+
- Validates configuration files
|
|
46
|
+
- Uses correct CLI flags
|
|
47
|
+
- Provides actionable error messages
|
|
48
|
+
|
|
49
|
+
### 2. generate-config (`mcpbr-config`)
|
|
50
|
+
Generates valid mcpbr configuration files:
|
|
51
|
+
- Ensures `{workdir}` placeholder is included
|
|
52
|
+
- Validates MCP server commands
|
|
53
|
+
- Provides benchmark-specific templates
|
|
54
|
+
- Applies best practices
|
|
55
|
+
|
|
56
|
+
### 3. swe-bench-lite (`benchmark-swe-lite`)
|
|
57
|
+
Quick-start command for demonstrations:
|
|
58
|
+
- Pre-configured for 5-task evaluation
|
|
59
|
+
- Sensible defaults for output files
|
|
60
|
+
- Perfect for testing and demos
|
|
61
|
+
|
|
62
|
+
## Example Interactions
|
|
63
|
+
|
|
64
|
+
Simply ask Claude in natural language:
|
|
65
|
+
|
|
66
|
+
- "Run the SWE-bench Lite benchmark"
|
|
67
|
+
- "Generate a config for my MCP server"
|
|
68
|
+
- "Run a quick test with 1 task"
|
|
69
|
+
|
|
70
|
+
Claude will automatically:
|
|
71
|
+
- Verify prerequisites before starting
|
|
72
|
+
- Generate valid configurations
|
|
73
|
+
- Use correct CLI commands
|
|
74
|
+
- Handle errors gracefully
|
|
75
|
+
|
|
76
|
+
## How It Works
|
|
77
|
+
|
|
78
|
+
The plugin includes:
|
|
79
|
+
- **plugin.json** - Claude Code plugin manifest
|
|
80
|
+
- **skills/** - Three specialized SKILL.md files
|
|
81
|
+
|
|
82
|
+
When Claude Code loads this plugin, it injects the skill instructions into Claude's context, making it an expert at mcpbr without you having to explain anything.
|
|
83
|
+
|
|
84
|
+
## Development
|
|
85
|
+
|
|
86
|
+
This package is part of the mcpbr project:
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
# Clone the repository
|
|
90
|
+
git clone https://github.com/greynewell/mcpbr.git
|
|
91
|
+
cd mcpbr
|
|
92
|
+
|
|
93
|
+
# Test the plugin
|
|
94
|
+
pytest tests/test_claude_plugin.py -v
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
## Related Packages
|
|
98
|
+
|
|
99
|
+
- [`mcpbr`](https://pypi.org/project/mcpbr/) - Python package (core implementation)
|
|
100
|
+
- [`@greynewell/mcpbr`](https://www.npmjs.com/package/@greynewell/mcpbr) - npm CLI wrapper
|
|
101
|
+
|
|
102
|
+
## Documentation
|
|
103
|
+
|
|
104
|
+
For full documentation, visit: <https://greynewell.github.io/mcpbr/>
|
|
105
|
+
|
|
106
|
+
- [Claude Code Plugin Guide](https://greynewell.github.io/mcpbr/claude-code-plugin/)
|
|
107
|
+
- [Installation Guide](https://greynewell.github.io/mcpbr/installation/)
|
|
108
|
+
- [Configuration Reference](https://greynewell.github.io/mcpbr/configuration/)
|
|
109
|
+
- [CLI Reference](https://greynewell.github.io/mcpbr/cli/)
|
|
110
|
+
|
|
111
|
+
## License
|
|
112
|
+
|
|
113
|
+
MIT - see [LICENSE](../LICENSE) for details.
|
|
114
|
+
|
|
115
|
+
## Contributing
|
|
116
|
+
|
|
117
|
+
Contributions welcome! See [CONTRIBUTING.md](../CONTRIBUTING.md) for guidelines.
|
|
118
|
+
|
|
119
|
+
## Support
|
|
120
|
+
|
|
121
|
+
- **Documentation**: <https://greynewell.github.io/mcpbr/>
|
|
122
|
+
- **Issues**: <https://github.com/greynewell/mcpbr/issues>
|
|
123
|
+
- **Discussions**: <https://github.com/greynewell/mcpbr/discussions>
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
Built by [Grey Newell](https://greynewell.com)
|
package/package.json
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
{
|
|
2
|
+
"name": "@greynewell/mcpbr-claude-plugin",
|
|
3
|
+
"version": "0.3.18",
|
|
4
|
+
"description": "Claude Code plugin for mcpbr - Expert benchmark runner for MCP servers with specialized skills",
|
|
5
|
+
"keywords": [
|
|
6
|
+
"claude-code",
|
|
7
|
+
"claude-plugin",
|
|
8
|
+
"mcpbr",
|
|
9
|
+
"mcp",
|
|
10
|
+
"benchmark",
|
|
11
|
+
"model-context-protocol",
|
|
12
|
+
"swe-bench",
|
|
13
|
+
"llm",
|
|
14
|
+
"agents",
|
|
15
|
+
"evaluation"
|
|
16
|
+
],
|
|
17
|
+
"homepage": "https://github.com/greynewell/mcpbr",
|
|
18
|
+
"repository": {
|
|
19
|
+
"type": "git",
|
|
20
|
+
"url": "https://github.com/greynewell/mcpbr.git",
|
|
21
|
+
"directory": ".claude-plugin"
|
|
22
|
+
},
|
|
23
|
+
"bugs": {
|
|
24
|
+
"url": "https://github.com/greynewell/mcpbr/issues"
|
|
25
|
+
},
|
|
26
|
+
"license": "MIT",
|
|
27
|
+
"author": "mcpbr Contributors",
|
|
28
|
+
"files": [
|
|
29
|
+
"plugin.json",
|
|
30
|
+
"skills/"
|
|
31
|
+
],
|
|
32
|
+
"scripts": {
|
|
33
|
+
"test": "pytest tests/test_claude_plugin.py -v"
|
|
34
|
+
},
|
|
35
|
+
"engines": {
|
|
36
|
+
"node": ">=18.0.0"
|
|
37
|
+
}
|
|
38
|
+
}
|
package/plugin.json
ADDED
package/skills/README.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
1
|
+
# mcpbr Claude Code Skills
|
|
2
|
+
|
|
3
|
+
This directory contains specialized skills that make Claude Code an expert at using mcpbr.
|
|
4
|
+
|
|
5
|
+
## What are Skills?
|
|
6
|
+
|
|
7
|
+
Skills are instruction sets that teach Claude Code how to perform specific tasks correctly. When Claude works in this repository, it automatically detects these skills and gains domain expertise about mcpbr.
|
|
8
|
+
|
|
9
|
+
## Available Skills
|
|
10
|
+
|
|
11
|
+
### 1. `mcpbr-eval` - Run Benchmark
|
|
12
|
+
|
|
13
|
+
**Purpose:** Expert at running evaluations with proper validation
|
|
14
|
+
|
|
15
|
+
**Key Features:**
|
|
16
|
+
- Validates prerequisites (Docker, API keys, config files)
|
|
17
|
+
- Checks for common mistakes before running
|
|
18
|
+
- Supports all benchmarks (SWE-bench, CyberGym, MCPToolBench++)
|
|
19
|
+
- Provides actionable troubleshooting
|
|
20
|
+
|
|
21
|
+
**When to use:** Anytime you want to run an evaluation with mcpbr
|
|
22
|
+
|
|
23
|
+
**Example prompts:**
|
|
24
|
+
- "Run the SWE-bench benchmark with 10 tasks"
|
|
25
|
+
- "Evaluate my MCP server on CyberGym level 2"
|
|
26
|
+
- "Run a quick test with 1 task"
|
|
27
|
+
|
|
28
|
+
### 2. `mcpbr-config` - Generate Config
|
|
29
|
+
|
|
30
|
+
**Purpose:** Generates valid mcpbr configuration files
|
|
31
|
+
|
|
32
|
+
**Key Features:**
|
|
33
|
+
- Ensures critical `{workdir}` placeholder is included
|
|
34
|
+
- Validates MCP server commands
|
|
35
|
+
- Provides templates for common MCP servers
|
|
36
|
+
- Supports all benchmark types
|
|
37
|
+
|
|
38
|
+
**When to use:** When creating or modifying mcpbr configuration files
|
|
39
|
+
|
|
40
|
+
**Example prompts:**
|
|
41
|
+
- "Generate a config for my Python MCP server"
|
|
42
|
+
- "Create a config using the filesystem server"
|
|
43
|
+
- "Help me configure my custom MCP server"
|
|
44
|
+
|
|
45
|
+
### 3. `benchmark-swe-lite` - Quick Start
|
|
46
|
+
|
|
47
|
+
**Purpose:** Streamlined command for SWE-bench Lite evaluation
|
|
48
|
+
|
|
49
|
+
**Key Features:**
|
|
50
|
+
- Pre-configured for 5-task evaluation
|
|
51
|
+
- Sensible defaults for quick testing
|
|
52
|
+
- Includes runtime/cost estimates
|
|
53
|
+
- Perfect for demos and testing
|
|
54
|
+
|
|
55
|
+
**When to use:** For quick validation or demonstrations
|
|
56
|
+
|
|
57
|
+
**Example prompts:**
|
|
58
|
+
- "Run a quick SWE-bench Lite test"
|
|
59
|
+
- "Show me how mcpbr works"
|
|
60
|
+
- "Do a fast evaluation"
|
|
61
|
+
|
|
62
|
+
## How Skills Work
|
|
63
|
+
|
|
64
|
+
When you clone this repository and work with Claude Code:
|
|
65
|
+
|
|
66
|
+
1. Claude Code detects the `.claude-plugin/plugin.json` manifest
|
|
67
|
+
2. It loads all skills from the `skills/` directory
|
|
68
|
+
3. Each skill provides specialized knowledge about mcpbr commands
|
|
69
|
+
4. Claude automatically follows best practices without being told
|
|
70
|
+
|
|
71
|
+
## Skill Structure
|
|
72
|
+
|
|
73
|
+
Each skill is a directory containing a `SKILL.md` file:
|
|
74
|
+
|
|
75
|
+
```text
|
|
76
|
+
skills/
|
|
77
|
+
├── mcpbr-eval/
|
|
78
|
+
│ └── SKILL.md # Instructions for running evaluations
|
|
79
|
+
├── mcpbr-config/
|
|
80
|
+
│ └── SKILL.md # Instructions for config generation
|
|
81
|
+
└── benchmark-swe-lite/
|
|
82
|
+
└── SKILL.md # Quick-start instructions
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Each `SKILL.md` contains:
|
|
86
|
+
|
|
87
|
+
1. **Frontmatter** - Metadata (name, description)
|
|
88
|
+
2. **Instructions** - Main skill content
|
|
89
|
+
3. **Examples** - Usage examples
|
|
90
|
+
4. **Constraints** - Critical requirements
|
|
91
|
+
5. **Troubleshooting** - Common issues and solutions
|
|
92
|
+
|
|
93
|
+
## Benefits
|
|
94
|
+
|
|
95
|
+
### Without Skills
|
|
96
|
+
```text
|
|
97
|
+
User: "Run the benchmark"
|
|
98
|
+
Claude: *tries `mcpbr run` without config, fails*
|
|
99
|
+
Claude: *forgets to check Docker, fails*
|
|
100
|
+
Claude: *uses wrong flags, fails*
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### With Skills
|
|
104
|
+
```text
|
|
105
|
+
User: "Run the benchmark"
|
|
106
|
+
Claude: *checks Docker is running*
|
|
107
|
+
Claude: *verifies config exists*
|
|
108
|
+
Claude: *uses correct flags*
|
|
109
|
+
Claude: *evaluation succeeds*
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
## Testing
|
|
113
|
+
|
|
114
|
+
Skills are validated by comprehensive tests in `tests/test_claude_plugin.py`:
|
|
115
|
+
|
|
116
|
+
- Validates plugin manifest structure
|
|
117
|
+
- Checks skill file format and content
|
|
118
|
+
- Ensures critical keywords are present (Docker, {workdir}, etc.)
|
|
119
|
+
- Verifies documentation mentions all commands
|
|
120
|
+
|
|
121
|
+
Run skill tests:
|
|
122
|
+
```bash
|
|
123
|
+
uv run pytest tests/test_claude_plugin.py -v
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
## Adding New Skills
|
|
127
|
+
|
|
128
|
+
To add a new skill:
|
|
129
|
+
|
|
130
|
+
1. Create a new directory: `skills/my-skill/`
|
|
131
|
+
2. Create `SKILL.md` with frontmatter:
|
|
132
|
+
```markdown
|
|
133
|
+
---
|
|
134
|
+
name: my-skill
|
|
135
|
+
description: Brief description of what this skill does
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
# Instructions
|
|
139
|
+
[Your skill content here]
|
|
140
|
+
```
|
|
141
|
+
3. Add tests in `tests/test_claude_plugin.py`
|
|
142
|
+
4. Update this README
|
|
143
|
+
|
|
144
|
+
## Version Management
|
|
145
|
+
|
|
146
|
+
The plugin version in `.claude-plugin/plugin.json` is automatically synced with `pyproject.toml`:
|
|
147
|
+
|
|
148
|
+
```bash
|
|
149
|
+
# Sync versions manually
|
|
150
|
+
make sync-version
|
|
151
|
+
|
|
152
|
+
# Syncs automatically during
|
|
153
|
+
make build
|
|
154
|
+
pre-commit hooks
|
|
155
|
+
CI/CD checks
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
## Learn More
|
|
159
|
+
|
|
160
|
+
- **Plugin Manifest**: `.claude-plugin/plugin.json`
|
|
161
|
+
- **Tests**: `tests/test_claude_plugin.py`
|
|
162
|
+
- **Documentation**: Main README.md, CONTRIBUTING.md
|
|
163
|
+
- **Issue**: [#262](https://github.com/greynewell/mcpbr/issues/262)
|
|
164
|
+
|
|
165
|
+
## Contributing
|
|
166
|
+
|
|
167
|
+
When modifying skills:
|
|
168
|
+
|
|
169
|
+
1. Update the relevant `SKILL.md` file
|
|
170
|
+
2. Run tests: `uv run pytest tests/test_claude_plugin.py`
|
|
171
|
+
3. Run pre-commit: `pre-commit run --all-files`
|
|
172
|
+
4. Submit a PR with your changes
|
|
173
|
+
|
|
174
|
+
Skills should:
|
|
175
|
+
- Be clear and concise
|
|
176
|
+
- Include examples
|
|
177
|
+
- Emphasize critical requirements
|
|
178
|
+
- Provide troubleshooting guidance
|
|
179
|
+
- Reference actual mcpbr commands
|
|
180
|
+
|
|
181
|
+
## Questions?
|
|
182
|
+
|
|
183
|
+
Open an issue or check the main project documentation.
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: swe-bench-lite
|
|
3
|
+
description: Quick-start command to run SWE-bench Lite evaluation with sensible defaults.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Instructions
|
|
7
|
+
This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.
|
|
8
|
+
|
|
9
|
+
## What This Skill Does
|
|
10
|
+
|
|
11
|
+
This skill runs a quick SWE-bench Lite evaluation with:
|
|
12
|
+
- 5 sample tasks (configurable)
|
|
13
|
+
- Verbose output for visibility
|
|
14
|
+
- Results saved to `results.json`
|
|
15
|
+
- Report saved to `report.md`
|
|
16
|
+
|
|
17
|
+
## Prerequisites Check
|
|
18
|
+
|
|
19
|
+
Before running, verify:
|
|
20
|
+
|
|
21
|
+
1. **Docker is running:**
|
|
22
|
+
```bash
|
|
23
|
+
docker ps
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
2. **API key is set:**
|
|
27
|
+
```bash
|
|
28
|
+
echo $ANTHROPIC_API_KEY
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
3. **Config file exists:**
|
|
32
|
+
- Check for `mcpbr.yaml` in the current directory
|
|
33
|
+
- If missing, run `mcpbr init` to generate it
|
|
34
|
+
|
|
35
|
+
## Default Command
|
|
36
|
+
|
|
37
|
+
The default command for SWE-bench Lite:
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Customization Options
|
|
44
|
+
|
|
45
|
+
Users can customize the run by modifying:
|
|
46
|
+
|
|
47
|
+
- **Sample size:** Change `-n 5` to any number (or remove for full dataset)
|
|
48
|
+
- **Config file:** Change `-c mcpbr.yaml` to point to a different config
|
|
49
|
+
- **Verbosity:** Use `-vv` for very verbose output
|
|
50
|
+
- **Output files:** Change `results.json` and `report.md` to different paths
|
|
51
|
+
|
|
52
|
+
## Example Variations
|
|
53
|
+
|
|
54
|
+
### Minimal quick test (1 task)
|
|
55
|
+
```bash
|
|
56
|
+
mcpbr run -c mcpbr.yaml -n 1 -v
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Full evaluation (all ~300 tasks)
|
|
60
|
+
```bash
|
|
61
|
+
mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### MCP-only (skip baseline)
|
|
65
|
+
```bash
|
|
66
|
+
mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
### Specific tasks
|
|
70
|
+
```bash
|
|
71
|
+
mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## Expected Runtime & Cost
|
|
75
|
+
|
|
76
|
+
For 5 tasks with default settings:
|
|
77
|
+
- **Runtime:** 15-30 minutes (depends on task complexity)
|
|
78
|
+
- **Cost:** $2-5 (depends on task complexity and model used)
|
|
79
|
+
|
|
80
|
+
## What to Do If It Fails
|
|
81
|
+
|
|
82
|
+
1. **Docker not running:** Start Docker Desktop
|
|
83
|
+
2. **API key missing:** Set with `export ANTHROPIC_API_KEY="sk-ant-..."`
|
|
84
|
+
3. **Config missing:** Run `mcpbr init` to generate default config
|
|
85
|
+
4. **Config invalid:** Check that `{workdir}` placeholder is in the `args` array
|
|
86
|
+
5. **MCP server fails:** Test the server command independently
|
|
87
|
+
|
|
88
|
+
## After the Run
|
|
89
|
+
|
|
90
|
+
Once complete, you'll have:
|
|
91
|
+
- **results.json:** Full evaluation data with metrics, token usage, and per-task results
|
|
92
|
+
- **report.md:** Human-readable summary with resolution rates and comparisons
|
|
93
|
+
- **Console output:** Real-time progress and summary table
|
|
94
|
+
|
|
95
|
+
Review the results to see how your MCP server performed compared to the baseline!
|
|
96
|
+
|
|
97
|
+
## Pro Tips
|
|
98
|
+
|
|
99
|
+
- Start with `-n 1` to verify everything works before running larger evaluations
|
|
100
|
+
- Use `--log-dir logs/` to save detailed per-task logs for debugging
|
|
101
|
+
- Compare multiple runs by changing the MCP server config between runs
|
|
102
|
+
- Use `--baseline-results baseline.json` to detect regressions between versions
|
|
@@ -0,0 +1,204 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: generate-config
|
|
3
|
+
description: Generate and validate mcpbr configuration files for MCP server benchmarking.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Instructions
|
|
7
|
+
You are an expert at creating valid `mcpbr` configuration files. Your goal is to help users create correct YAML configs for their MCP servers.
|
|
8
|
+
|
|
9
|
+
## Critical Requirements
|
|
10
|
+
|
|
11
|
+
1. **Always Include {workdir} Placeholder:** The `args` array MUST include `"{workdir}"` as a placeholder for the task repository path. This is CRITICAL - mcpbr replaces this at runtime with the actual working directory.
|
|
12
|
+
|
|
13
|
+
2. **Valid Commands:** Ensure the `command` field uses an executable that exists on the user's system:
|
|
14
|
+
- `npx` for Node.js-based MCP servers
|
|
15
|
+
- `uvx` for Python MCP servers via uv
|
|
16
|
+
- `python` or `python3` for direct Python execution
|
|
17
|
+
- Custom binaries (verify they exist with `which <command>`)
|
|
18
|
+
|
|
19
|
+
3. **Model Aliases:** Use short aliases when possible:
|
|
20
|
+
- `sonnet` instead of `claude-sonnet-4-5-20250929`
|
|
21
|
+
- `opus` instead of `claude-opus-4-5-20251101`
|
|
22
|
+
- `haiku` instead of `claude-haiku-4-5-20251001`
|
|
23
|
+
|
|
24
|
+
4. **Required Fields:** Every config MUST have:
|
|
25
|
+
- `mcp_server.command`
|
|
26
|
+
- `mcp_server.args` (with `"{workdir}"`)
|
|
27
|
+
- `provider` (usually `"anthropic"`)
|
|
28
|
+
- `agent_harness` (usually `"claude-code"`)
|
|
29
|
+
- `model`
|
|
30
|
+
- `dataset` (or rely on benchmark default)
|
|
31
|
+
|
|
32
|
+
## Common MCP Server Configurations
|
|
33
|
+
|
|
34
|
+
### Anthropic Filesystem Server
|
|
35
|
+
```yaml
|
|
36
|
+
mcp_server:
|
|
37
|
+
name: "filesystem"
|
|
38
|
+
command: "npx"
|
|
39
|
+
args:
|
|
40
|
+
- "-y"
|
|
41
|
+
- "@modelcontextprotocol/server-filesystem"
|
|
42
|
+
- "{workdir}"
|
|
43
|
+
env: {}
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### Custom Python MCP Server
|
|
47
|
+
```yaml
|
|
48
|
+
mcp_server:
|
|
49
|
+
name: "my-server"
|
|
50
|
+
command: "uvx"
|
|
51
|
+
args:
|
|
52
|
+
- "my-mcp-server"
|
|
53
|
+
- "--workspace"
|
|
54
|
+
- "{workdir}"
|
|
55
|
+
env:
|
|
56
|
+
LOG_LEVEL: "debug"
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
### Supermodel Codebase Analysis
|
|
60
|
+
```yaml
|
|
61
|
+
mcp_server:
|
|
62
|
+
name: "supermodel"
|
|
63
|
+
command: "npx"
|
|
64
|
+
args:
|
|
65
|
+
- "-y"
|
|
66
|
+
- "@supermodeltools/mcp-server"
|
|
67
|
+
env:
|
|
68
|
+
SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
## Configuration Template
|
|
72
|
+
|
|
73
|
+
When generating a new config, use this template:
|
|
74
|
+
|
|
75
|
+
```yaml
|
|
76
|
+
mcp_server:
|
|
77
|
+
name: "<server-name>"
|
|
78
|
+
command: "<executable>"
|
|
79
|
+
args:
|
|
80
|
+
- "<arg1>"
|
|
81
|
+
- "<arg2>"
|
|
82
|
+
- "{workdir}" # CRITICAL: Include this placeholder
|
|
83
|
+
env: {}
|
|
84
|
+
|
|
85
|
+
provider: "anthropic"
|
|
86
|
+
agent_harness: "claude-code"
|
|
87
|
+
|
|
88
|
+
model: "sonnet" # or "opus", "haiku"
|
|
89
|
+
dataset: "SWE-bench/SWE-bench_Lite" # or null to use benchmark default
|
|
90
|
+
sample_size: 5
|
|
91
|
+
timeout_seconds: 300
|
|
92
|
+
max_concurrent: 4
|
|
93
|
+
max_iterations: 30
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
## Validation Steps
|
|
97
|
+
|
|
98
|
+
Before saving a config, validate:
|
|
99
|
+
|
|
100
|
+
1. **Workdir Placeholder:** Ensure `"{workdir}"` appears in `args` array.
|
|
101
|
+
2. **Command Exists:** Verify the command is available:
|
|
102
|
+
```bash
|
|
103
|
+
which npx # or uvx, python, etc.
|
|
104
|
+
```
|
|
105
|
+
3. **Syntax:** YAML syntax is correct (no tabs, proper indentation).
|
|
106
|
+
4. **Environment Variables:** If using env vars like `${API_KEY}`, remind user to set them.
|
|
107
|
+
|
|
108
|
+
## Benchmark-Specific Configurations
|
|
109
|
+
|
|
110
|
+
### SWE-bench (Default)
|
|
111
|
+
```yaml
|
|
112
|
+
# ... mcp_server config ...
|
|
113
|
+
provider: "anthropic"
|
|
114
|
+
agent_harness: "claude-code"
|
|
115
|
+
model: "sonnet"
|
|
116
|
+
dataset: "SWE-bench/SWE-bench_Lite" # or SWE-bench/SWE-bench_Verified
|
|
117
|
+
sample_size: 10
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### CyberGym
|
|
121
|
+
```yaml
|
|
122
|
+
# ... mcp_server config ...
|
|
123
|
+
provider: "anthropic"
|
|
124
|
+
agent_harness: "claude-code"
|
|
125
|
+
model: "sonnet"
|
|
126
|
+
benchmark: "cybergym"
|
|
127
|
+
dataset: "sunblaze-ucb/cybergym"
|
|
128
|
+
cybergym_level: 2 # 0-3
|
|
129
|
+
sample_size: 10
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
### MCPToolBench++
|
|
133
|
+
```yaml
|
|
134
|
+
# ... mcp_server config ...
|
|
135
|
+
provider: "anthropic"
|
|
136
|
+
agent_harness: "claude-code"
|
|
137
|
+
model: "sonnet"
|
|
138
|
+
benchmark: "mcptoolbench"
|
|
139
|
+
dataset: "MCPToolBench/MCPToolBenchPP"
|
|
140
|
+
sample_size: 10
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
## Custom Agent Prompts
|
|
144
|
+
|
|
145
|
+
Users can customize the agent prompt using the `agent_prompt` field:
|
|
146
|
+
|
|
147
|
+
```yaml
|
|
148
|
+
agent_prompt: |
|
|
149
|
+
Fix the following bug in this repository:
|
|
150
|
+
|
|
151
|
+
{problem_statement}
|
|
152
|
+
|
|
153
|
+
Make the minimal changes necessary to fix the issue.
|
|
154
|
+
Focus on the root cause, not symptoms.
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
**Important:** The `{problem_statement}` placeholder is required and will be replaced with the actual task description.
|
|
158
|
+
|
|
159
|
+
## Common Mistakes to Avoid
|
|
160
|
+
|
|
161
|
+
1. **Missing {workdir}:** Forgetting to include `"{workdir}"` in args.
|
|
162
|
+
2. **Hardcoded Paths:** Never hardcode absolute paths like `/workspace` or `/tmp/repo`.
|
|
163
|
+
3. **Invalid Commands:** Using commands that don't exist (e.g., `uv` instead of `uvx`).
|
|
164
|
+
4. **Wrong Indentation:** YAML is whitespace-sensitive. Use 2 spaces, not tabs.
|
|
165
|
+
5. **Missing Quotes:** Environment variable references like `"${VAR}"` need quotes.
|
|
166
|
+
|
|
167
|
+
## Example Workflow
|
|
168
|
+
|
|
169
|
+
When a user asks to create a config:
|
|
170
|
+
|
|
171
|
+
1. Ask about their MCP server:
|
|
172
|
+
- What package/command runs the server?
|
|
173
|
+
- Does it need any special arguments or environment variables?
|
|
174
|
+
- Is it Node.js-based (npx) or Python-based (uvx)?
|
|
175
|
+
|
|
176
|
+
2. Generate the config based on their answers.
|
|
177
|
+
|
|
178
|
+
3. Validate the config:
|
|
179
|
+
- Check for `{workdir}` placeholder
|
|
180
|
+
- Verify command exists
|
|
181
|
+
- Confirm YAML syntax
|
|
182
|
+
|
|
183
|
+
4. Save the config (usually to `mcpbr.yaml`).
|
|
184
|
+
|
|
185
|
+
5. Optionally test the config with a small sample:
|
|
186
|
+
```bash
|
|
187
|
+
mcpbr run -c mcpbr.yaml -n 1 -v
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
## Helpful Commands
|
|
191
|
+
|
|
192
|
+
```bash
|
|
193
|
+
# Generate a default config
|
|
194
|
+
mcpbr init
|
|
195
|
+
|
|
196
|
+
# List available models
|
|
197
|
+
mcpbr models
|
|
198
|
+
|
|
199
|
+
# List available benchmarks
|
|
200
|
+
mcpbr benchmarks
|
|
201
|
+
|
|
202
|
+
# Validate config by doing a dry run with 1 task
|
|
203
|
+
mcpbr run -c config.yaml -n 1 -v
|
|
204
|
+
```
|
|
@@ -0,0 +1,123 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: run-benchmark
|
|
3
|
+
description: Run an MCP evaluation using mcpbr on SWE-bench or other datasets.
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Instructions
|
|
7
|
+
You are an expert at benchmarking AI agents using the `mcpbr` CLI. Your goal is to run valid, reproducible evaluations.
|
|
8
|
+
|
|
9
|
+
## Critical Constraints (DO NOT IGNORE)
|
|
10
|
+
|
|
11
|
+
1. **Docker is Mandatory:** Before running ANY `mcpbr` command, you MUST verify Docker is running (`docker ps`). If not, tell the user to start it.
|
|
12
|
+
|
|
13
|
+
2. **Config is Required:** `mcpbr run` FAILS without a config file. Never guess flags.
|
|
14
|
+
- IF no config exists: Run `mcpbr init` first to generate a template.
|
|
15
|
+
- IF config exists: Read it (`cat mcpbr.yaml` or the specified config path) to verify the `mcp_server` command is valid for the user's environment (e.g., check if `npx` or `uvx` is installed).
|
|
16
|
+
|
|
17
|
+
3. **Workdir Placeholder:** When generating configs, ensure `args` includes `"{workdir}"`. Do not resolve this path yourself; `mcpbr` handles it.
|
|
18
|
+
|
|
19
|
+
4. **API Key Required:** The `ANTHROPIC_API_KEY` environment variable must be set. Check for it before running evaluations.
|
|
20
|
+
|
|
21
|
+
## Common Pitfalls to Avoid
|
|
22
|
+
|
|
23
|
+
- **DO NOT** use the `-m` flag unless the user explicitly asks to override the model in the YAML.
|
|
24
|
+
- **DO NOT** hallucinate dataset names. Valid datasets include:
|
|
25
|
+
- `SWE-bench/SWE-bench_Lite` (default for SWE-bench)
|
|
26
|
+
- `SWE-bench/SWE-bench_Verified`
|
|
27
|
+
- `sunblaze-ucb/cybergym` (for CyberGym benchmark)
|
|
28
|
+
- `MCPToolBench/MCPToolBenchPP` (for MCPToolBench++)
|
|
29
|
+
- **DO NOT** hallucinate flags or options. Only use documented CLI flags.
|
|
30
|
+
- **DO NOT** forget to specify the config file with `-c` or `--config`.
|
|
31
|
+
|
|
32
|
+
## Supported Benchmarks
|
|
33
|
+
|
|
34
|
+
mcpbr supports three benchmarks:
|
|
35
|
+
1. **SWE-bench** (default): Real GitHub issues requiring bug fixes
|
|
36
|
+
- Dataset: `SWE-bench/SWE-bench_Lite` or `SWE-bench/SWE-bench_Verified`
|
|
37
|
+
- Use: `mcpbr run -c config.yaml` or `--benchmark swe-bench`
|
|
38
|
+
|
|
39
|
+
2. **CyberGym**: Security vulnerabilities requiring PoC exploits
|
|
40
|
+
- Dataset: `sunblaze-ucb/cybergym`
|
|
41
|
+
- Use: `mcpbr run -c config.yaml --benchmark cybergym --level [0-3]`
|
|
42
|
+
|
|
43
|
+
3. **MCPToolBench++**: Large-scale tool use evaluation
|
|
44
|
+
- Dataset: `MCPToolBench/MCPToolBenchPP`
|
|
45
|
+
- Use: `mcpbr run -c config.yaml --benchmark mcptoolbench`
|
|
46
|
+
|
|
47
|
+
## Execution Steps
|
|
48
|
+
|
|
49
|
+
Follow these steps in order:
|
|
50
|
+
|
|
51
|
+
1. **Verify Prerequisites:**
|
|
52
|
+
```bash
|
|
53
|
+
# Check Docker is running
|
|
54
|
+
docker ps
|
|
55
|
+
|
|
56
|
+
# Verify API key is set
|
|
57
|
+
echo $ANTHROPIC_API_KEY
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
2. **Check for Config File:**
|
|
61
|
+
- If `mcpbr.yaml` (or user-specified config) does NOT exist: Run `mcpbr init` to generate it.
|
|
62
|
+
- If config exists: Read it to understand the configuration.
|
|
63
|
+
|
|
64
|
+
3. **Validate Config:**
|
|
65
|
+
- Ensure `mcp_server.command` is valid (e.g., `npx`, `uvx`, `python` are installed).
|
|
66
|
+
- Ensure `mcp_server.args` includes `"{workdir}"` placeholder.
|
|
67
|
+
- Verify `model`, `dataset`, and other parameters are correctly set.
|
|
68
|
+
|
|
69
|
+
4. **Construct the Command:**
|
|
70
|
+
- Base command: `mcpbr run --config <path-to-config>`
|
|
71
|
+
- Add flags as needed based on user request:
|
|
72
|
+
- `-n <number>` or `--sample <number>`: Override sample size
|
|
73
|
+
- `-v` or `-vv`: Verbose output
|
|
74
|
+
- `-o <path>`: Save JSON results
|
|
75
|
+
- `-r <path>`: Save Markdown report
|
|
76
|
+
- `--log-dir <path>`: Save per-instance logs
|
|
77
|
+
- `-M`: MCP-only evaluation (skip baseline)
|
|
78
|
+
- `-B`: Baseline-only evaluation (skip MCP)
|
|
79
|
+
- `--benchmark <name>`: Override benchmark
|
|
80
|
+
- `--level <0-3>`: Set CyberGym difficulty level
|
|
81
|
+
|
|
82
|
+
5. **Run the Command:**
|
|
83
|
+
Execute the constructed command and monitor the output.
|
|
84
|
+
|
|
85
|
+
6. **Handle Results:**
|
|
86
|
+
- If the run completes successfully, inform the user about the results.
|
|
87
|
+
- If errors occur, diagnose and provide actionable feedback.
|
|
88
|
+
|
|
89
|
+
## Example Commands
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
# Full evaluation with 5 tasks
|
|
93
|
+
mcpbr run -c config.yaml -n 5 -v
|
|
94
|
+
|
|
95
|
+
# MCP-only evaluation
|
|
96
|
+
mcpbr run -c config.yaml -M -n 10
|
|
97
|
+
|
|
98
|
+
# Save results and report
|
|
99
|
+
mcpbr run -c config.yaml -o results.json -r report.md
|
|
100
|
+
|
|
101
|
+
# Run CyberGym at level 2
|
|
102
|
+
mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5
|
|
103
|
+
|
|
104
|
+
# Run specific tasks
|
|
105
|
+
mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## Troubleshooting
|
|
109
|
+
|
|
110
|
+
If you encounter errors:
|
|
111
|
+
|
|
112
|
+
1. **Docker not running:** Remind user to start Docker Desktop or Docker daemon.
|
|
113
|
+
2. **API key missing:** Ask user to set `export ANTHROPIC_API_KEY="sk-ant-..."`
|
|
114
|
+
3. **Config file invalid:** Re-generate with `mcpbr init` or fix the YAML syntax.
|
|
115
|
+
4. **MCP server fails to start:** Test the server command independently.
|
|
116
|
+
5. **Timeout issues:** Suggest increasing `timeout_seconds` in config.
|
|
117
|
+
|
|
118
|
+
## Important Reminders
|
|
119
|
+
|
|
120
|
+
- Always read the config file before making assumptions about what's configured.
|
|
121
|
+
- Never modify the config file without explicit user permission.
|
|
122
|
+
- Use the `mcpbr models` command to check available models if needed.
|
|
123
|
+
- Use the `mcpbr benchmarks` command to list available benchmarks.
|