@greynewell/mcpbr-claude-plugin 0.3.18

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,127 @@
1
+ # @greynewell/mcpbr-claude-plugin
2
+
3
+ > Claude Code plugin for mcpbr - Makes Claude an expert at running MCP benchmarks
4
+
5
+ [![npm version](https://badge.fury.io/js/%40greynewell%2Fmcpbr-claude-plugin.svg)](https://www.npmjs.com/package/@greynewell/mcpbr-claude-plugin)
6
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
7
+
8
+ This Claude Code plugin provides specialized skills that make Claude an expert at using [mcpbr](https://github.com/greynewell/mcpbr) for MCP server benchmarking.
9
+
10
+ ## What is mcpbr?
11
+
12
+ **Model Context Protocol Benchmark Runner** - Benchmark your MCP server against real GitHub issues. Get hard numbers comparing tool-assisted vs. baseline agent performance.
13
+
14
+ ## Installation
15
+
16
+ ### Option 1: Clone Repository (Automatic Detection)
17
+
18
+ ```bash
19
+ git clone https://github.com/greynewell/mcpbr.git
20
+ cd mcpbr
21
+ # Claude Code automatically detects the plugin
22
+ ```
23
+
24
+ ### Option 2: Install via npm
25
+
26
+ ```bash
27
+ npm install -g @greynewell/mcpbr-claude-plugin
28
+ ```
29
+
30
+ ### Option 3: Claude Code Plugin Manager
31
+
32
+ ```bash
33
+ # In Claude Code, run:
34
+ /plugin install github:greynewell/mcpbr
35
+ ```
36
+
37
+ ## What You Get
38
+
39
+ When using Claude Code in a project with this plugin, Claude automatically gains expertise in:
40
+
41
+ ### 1. run-benchmark (`mcpbr-eval`)
42
+ Expert at running evaluations with proper validation:
43
+ - Verifies Docker is running
44
+ - Checks for API keys
45
+ - Validates configuration files
46
+ - Uses correct CLI flags
47
+ - Provides actionable error messages
48
+
49
+ ### 2. generate-config (`mcpbr-config`)
50
+ Generates valid mcpbr configuration files:
51
+ - Ensures `{workdir}` placeholder is included
52
+ - Validates MCP server commands
53
+ - Provides benchmark-specific templates
54
+ - Applies best practices
55
+
56
+ ### 3. swe-bench-lite (`benchmark-swe-lite`)
57
+ Quick-start command for demonstrations:
58
+ - Pre-configured for 5-task evaluation
59
+ - Sensible defaults for output files
60
+ - Perfect for testing and demos
61
+
62
+ ## Example Interactions
63
+
64
+ Simply ask Claude in natural language:
65
+
66
+ - "Run the SWE-bench Lite benchmark"
67
+ - "Generate a config for my MCP server"
68
+ - "Run a quick test with 1 task"
69
+
70
+ Claude will automatically:
71
+ - Verify prerequisites before starting
72
+ - Generate valid configurations
73
+ - Use correct CLI commands
74
+ - Handle errors gracefully
75
+
76
+ ## How It Works
77
+
78
+ The plugin includes:
79
+ - **plugin.json** - Claude Code plugin manifest
80
+ - **skills/** - Three specialized SKILL.md files
81
+
82
+ When Claude Code loads this plugin, it injects the skill instructions into Claude's context, making it an expert at mcpbr without you having to explain anything.
83
+
84
+ ## Development
85
+
86
+ This package is part of the mcpbr project:
87
+
88
+ ```bash
89
+ # Clone the repository
90
+ git clone https://github.com/greynewell/mcpbr.git
91
+ cd mcpbr
92
+
93
+ # Test the plugin
94
+ pytest tests/test_claude_plugin.py -v
95
+ ```
96
+
97
+ ## Related Packages
98
+
99
+ - [`mcpbr`](https://pypi.org/project/mcpbr/) - Python package (core implementation)
100
+ - [`@greynewell/mcpbr`](https://www.npmjs.com/package/@greynewell/mcpbr) - npm CLI wrapper
101
+
102
+ ## Documentation
103
+
104
+ For full documentation, visit: <https://greynewell.github.io/mcpbr/>
105
+
106
+ - [Claude Code Plugin Guide](https://greynewell.github.io/mcpbr/claude-code-plugin/)
107
+ - [Installation Guide](https://greynewell.github.io/mcpbr/installation/)
108
+ - [Configuration Reference](https://greynewell.github.io/mcpbr/configuration/)
109
+ - [CLI Reference](https://greynewell.github.io/mcpbr/cli/)
110
+
111
+ ## License
112
+
113
+ MIT - see [LICENSE](../LICENSE) for details.
114
+
115
+ ## Contributing
116
+
117
+ Contributions welcome! See [CONTRIBUTING.md](../CONTRIBUTING.md) for guidelines.
118
+
119
+ ## Support
120
+
121
+ - **Documentation**: <https://greynewell.github.io/mcpbr/>
122
+ - **Issues**: <https://github.com/greynewell/mcpbr/issues>
123
+ - **Discussions**: <https://github.com/greynewell/mcpbr/discussions>
124
+
125
+ ---
126
+
127
+ Built by [Grey Newell](https://greynewell.com)
package/package.json ADDED
@@ -0,0 +1,38 @@
1
+ {
2
+ "name": "@greynewell/mcpbr-claude-plugin",
3
+ "version": "0.3.18",
4
+ "description": "Claude Code plugin for mcpbr - Expert benchmark runner for MCP servers with specialized skills",
5
+ "keywords": [
6
+ "claude-code",
7
+ "claude-plugin",
8
+ "mcpbr",
9
+ "mcp",
10
+ "benchmark",
11
+ "model-context-protocol",
12
+ "swe-bench",
13
+ "llm",
14
+ "agents",
15
+ "evaluation"
16
+ ],
17
+ "homepage": "https://github.com/greynewell/mcpbr",
18
+ "repository": {
19
+ "type": "git",
20
+ "url": "https://github.com/greynewell/mcpbr.git",
21
+ "directory": ".claude-plugin"
22
+ },
23
+ "bugs": {
24
+ "url": "https://github.com/greynewell/mcpbr/issues"
25
+ },
26
+ "license": "MIT",
27
+ "author": "mcpbr Contributors",
28
+ "files": [
29
+ "plugin.json",
30
+ "skills/"
31
+ ],
32
+ "scripts": {
33
+ "test": "pytest tests/test_claude_plugin.py -v"
34
+ },
35
+ "engines": {
36
+ "node": ">=18.0.0"
37
+ }
38
+ }
package/plugin.json ADDED
@@ -0,0 +1,6 @@
1
+ {
2
+ "name": "mcpbr",
3
+ "version": "0.3.18",
4
+ "description": "Expert benchmark runner for MCP servers using mcpbr. Handles Docker checks, config generation, and result parsing.",
5
+ "schema_version": "1.0"
6
+ }
@@ -0,0 +1,183 @@
1
+ # mcpbr Claude Code Skills
2
+
3
+ This directory contains specialized skills that make Claude Code an expert at using mcpbr.
4
+
5
+ ## What are Skills?
6
+
7
+ Skills are instruction sets that teach Claude Code how to perform specific tasks correctly. When Claude works in this repository, it automatically detects these skills and gains domain expertise about mcpbr.
8
+
9
+ ## Available Skills
10
+
11
+ ### 1. `mcpbr-eval` - Run Benchmark
12
+
13
+ **Purpose:** Expert at running evaluations with proper validation
14
+
15
+ **Key Features:**
16
+ - Validates prerequisites (Docker, API keys, config files)
17
+ - Checks for common mistakes before running
18
+ - Supports all benchmarks (SWE-bench, CyberGym, MCPToolBench++)
19
+ - Provides actionable troubleshooting
20
+
21
+ **When to use:** Anytime you want to run an evaluation with mcpbr
22
+
23
+ **Example prompts:**
24
+ - "Run the SWE-bench benchmark with 10 tasks"
25
+ - "Evaluate my MCP server on CyberGym level 2"
26
+ - "Run a quick test with 1 task"
27
+
28
+ ### 2. `mcpbr-config` - Generate Config
29
+
30
+ **Purpose:** Generates valid mcpbr configuration files
31
+
32
+ **Key Features:**
33
+ - Ensures critical `{workdir}` placeholder is included
34
+ - Validates MCP server commands
35
+ - Provides templates for common MCP servers
36
+ - Supports all benchmark types
37
+
38
+ **When to use:** When creating or modifying mcpbr configuration files
39
+
40
+ **Example prompts:**
41
+ - "Generate a config for my Python MCP server"
42
+ - "Create a config using the filesystem server"
43
+ - "Help me configure my custom MCP server"
44
+
45
+ ### 3. `benchmark-swe-lite` - Quick Start
46
+
47
+ **Purpose:** Streamlined command for SWE-bench Lite evaluation
48
+
49
+ **Key Features:**
50
+ - Pre-configured for 5-task evaluation
51
+ - Sensible defaults for quick testing
52
+ - Includes runtime/cost estimates
53
+ - Perfect for demos and testing
54
+
55
+ **When to use:** For quick validation or demonstrations
56
+
57
+ **Example prompts:**
58
+ - "Run a quick SWE-bench Lite test"
59
+ - "Show me how mcpbr works"
60
+ - "Do a fast evaluation"
61
+
62
+ ## How Skills Work
63
+
64
+ When you clone this repository and work with Claude Code:
65
+
66
+ 1. Claude Code detects the `.claude-plugin/plugin.json` manifest
67
+ 2. It loads all skills from the `skills/` directory
68
+ 3. Each skill provides specialized knowledge about mcpbr commands
69
+ 4. Claude automatically follows best practices without being told
70
+
71
+ ## Skill Structure
72
+
73
+ Each skill is a directory containing a `SKILL.md` file:
74
+
75
+ ```text
76
+ skills/
77
+ ├── mcpbr-eval/
78
+ │ └── SKILL.md # Instructions for running evaluations
79
+ ├── mcpbr-config/
80
+ │ └── SKILL.md # Instructions for config generation
81
+ └── benchmark-swe-lite/
82
+ └── SKILL.md # Quick-start instructions
83
+ ```
84
+
85
+ Each `SKILL.md` contains:
86
+
87
+ 1. **Frontmatter** - Metadata (name, description)
88
+ 2. **Instructions** - Main skill content
89
+ 3. **Examples** - Usage examples
90
+ 4. **Constraints** - Critical requirements
91
+ 5. **Troubleshooting** - Common issues and solutions
92
+
93
+ ## Benefits
94
+
95
+ ### Without Skills
96
+ ```text
97
+ User: "Run the benchmark"
98
+ Claude: *tries `mcpbr run` without config, fails*
99
+ Claude: *forgets to check Docker, fails*
100
+ Claude: *uses wrong flags, fails*
101
+ ```
102
+
103
+ ### With Skills
104
+ ```text
105
+ User: "Run the benchmark"
106
+ Claude: *checks Docker is running*
107
+ Claude: *verifies config exists*
108
+ Claude: *uses correct flags*
109
+ Claude: *evaluation succeeds*
110
+ ```
111
+
112
+ ## Testing
113
+
114
+ Skills are validated by comprehensive tests in `tests/test_claude_plugin.py`:
115
+
116
+ - Validates plugin manifest structure
117
+ - Checks skill file format and content
118
+ - Ensures critical keywords are present (Docker, {workdir}, etc.)
119
+ - Verifies documentation mentions all commands
120
+
121
+ Run skill tests:
122
+ ```bash
123
+ uv run pytest tests/test_claude_plugin.py -v
124
+ ```
125
+
126
+ ## Adding New Skills
127
+
128
+ To add a new skill:
129
+
130
+ 1. Create a new directory: `skills/my-skill/`
131
+ 2. Create `SKILL.md` with frontmatter:
132
+ ```markdown
133
+ ---
134
+ name: my-skill
135
+ description: Brief description of what this skill does
136
+ ---
137
+
138
+ # Instructions
139
+ [Your skill content here]
140
+ ```
141
+ 3. Add tests in `tests/test_claude_plugin.py`
142
+ 4. Update this README
143
+
144
+ ## Version Management
145
+
146
+ The plugin version in `.claude-plugin/plugin.json` is automatically synced with `pyproject.toml`:
147
+
148
+ ```bash
149
+ # Sync versions manually
150
+ make sync-version
151
+
152
+ # Syncs automatically during
153
+ make build
154
+ pre-commit hooks
155
+ CI/CD checks
156
+ ```
157
+
158
+ ## Learn More
159
+
160
+ - **Plugin Manifest**: `.claude-plugin/plugin.json`
161
+ - **Tests**: `tests/test_claude_plugin.py`
162
+ - **Documentation**: Main README.md, CONTRIBUTING.md
163
+ - **Issue**: [#262](https://github.com/greynewell/mcpbr/issues/262)
164
+
165
+ ## Contributing
166
+
167
+ When modifying skills:
168
+
169
+ 1. Update the relevant `SKILL.md` file
170
+ 2. Run tests: `uv run pytest tests/test_claude_plugin.py`
171
+ 3. Run pre-commit: `pre-commit run --all-files`
172
+ 4. Submit a PR with your changes
173
+
174
+ Skills should:
175
+ - Be clear and concise
176
+ - Include examples
177
+ - Emphasize critical requirements
178
+ - Provide troubleshooting guidance
179
+ - Reference actual mcpbr commands
180
+
181
+ ## Questions?
182
+
183
+ Open an issue or check the main project documentation.
@@ -0,0 +1,102 @@
1
+ ---
2
+ name: swe-bench-lite
3
+ description: Quick-start command to run SWE-bench Lite evaluation with sensible defaults.
4
+ ---
5
+
6
+ # Instructions
7
+ This skill provides a streamlined way to run the SWE-bench Lite benchmark with pre-configured defaults.
8
+
9
+ ## What This Skill Does
10
+
11
+ This skill runs a quick SWE-bench Lite evaluation with:
12
+ - 5 sample tasks (configurable)
13
+ - Verbose output for visibility
14
+ - Results saved to `results.json`
15
+ - Report saved to `report.md`
16
+
17
+ ## Prerequisites Check
18
+
19
+ Before running, verify:
20
+
21
+ 1. **Docker is running:**
22
+ ```bash
23
+ docker ps
24
+ ```
25
+
26
+ 2. **API key is set:**
27
+ ```bash
28
+ echo $ANTHROPIC_API_KEY
29
+ ```
30
+
31
+ 3. **Config file exists:**
32
+ - Check for `mcpbr.yaml` in the current directory
33
+ - If missing, run `mcpbr init` to generate it
34
+
35
+ ## Default Command
36
+
37
+ The default command for SWE-bench Lite:
38
+
39
+ ```bash
40
+ mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -n 5 -v -o results.json -r report.md
41
+ ```
42
+
43
+ ## Customization Options
44
+
45
+ Users can customize the run by modifying:
46
+
47
+ - **Sample size:** Change `-n 5` to any number (or remove for full dataset)
48
+ - **Config file:** Change `-c mcpbr.yaml` to point to a different config
49
+ - **Verbosity:** Use `-vv` for very verbose output
50
+ - **Output files:** Change `results.json` and `report.md` to different paths
51
+
52
+ ## Example Variations
53
+
54
+ ### Minimal quick test (1 task)
55
+ ```bash
56
+ mcpbr run -c mcpbr.yaml -n 1 -v
57
+ ```
58
+
59
+ ### Full evaluation (all ~300 tasks)
60
+ ```bash
61
+ mcpbr run -c mcpbr.yaml --dataset SWE-bench/SWE-bench_Lite -v -o results.json
62
+ ```
63
+
64
+ ### MCP-only (skip baseline)
65
+ ```bash
66
+ mcpbr run -c mcpbr.yaml -n 5 -M -v -o results.json
67
+ ```
68
+
69
+ ### Specific tasks
70
+ ```bash
71
+ mcpbr run -c mcpbr.yaml -t astropy__astropy-12907 -t django__django-11099 -v
72
+ ```
73
+
74
+ ## Expected Runtime & Cost
75
+
76
+ For 5 tasks with default settings:
77
+ - **Runtime:** 15-30 minutes (depends on task complexity)
78
+ - **Cost:** $2-5 (depends on task complexity and model used)
79
+
80
+ ## What to Do If It Fails
81
+
82
+ 1. **Docker not running:** Start Docker Desktop
83
+ 2. **API key missing:** Set with `export ANTHROPIC_API_KEY="sk-ant-..."`
84
+ 3. **Config missing:** Run `mcpbr init` to generate default config
85
+ 4. **Config invalid:** Check that `{workdir}` placeholder is in the `args` array
86
+ 5. **MCP server fails:** Test the server command independently
87
+
88
+ ## After the Run
89
+
90
+ Once complete, you'll have:
91
+ - **results.json:** Full evaluation data with metrics, token usage, and per-task results
92
+ - **report.md:** Human-readable summary with resolution rates and comparisons
93
+ - **Console output:** Real-time progress and summary table
94
+
95
+ Review the results to see how your MCP server performed compared to the baseline!
96
+
97
+ ## Pro Tips
98
+
99
+ - Start with `-n 1` to verify everything works before running larger evaluations
100
+ - Use `--log-dir logs/` to save detailed per-task logs for debugging
101
+ - Compare multiple runs by changing the MCP server config between runs
102
+ - Use `--baseline-results baseline.json` to detect regressions between versions
@@ -0,0 +1,204 @@
1
+ ---
2
+ name: generate-config
3
+ description: Generate and validate mcpbr configuration files for MCP server benchmarking.
4
+ ---
5
+
6
+ # Instructions
7
+ You are an expert at creating valid `mcpbr` configuration files. Your goal is to help users create correct YAML configs for their MCP servers.
8
+
9
+ ## Critical Requirements
10
+
11
+ 1. **Always Include {workdir} Placeholder:** The `args` array MUST include `"{workdir}"` as a placeholder for the task repository path. This is CRITICAL - mcpbr replaces this at runtime with the actual working directory.
12
+
13
+ 2. **Valid Commands:** Ensure the `command` field uses an executable that exists on the user's system:
14
+ - `npx` for Node.js-based MCP servers
15
+ - `uvx` for Python MCP servers via uv
16
+ - `python` or `python3` for direct Python execution
17
+ - Custom binaries (verify they exist with `which <command>`)
18
+
19
+ 3. **Model Aliases:** Use short aliases when possible:
20
+ - `sonnet` instead of `claude-sonnet-4-5-20250929`
21
+ - `opus` instead of `claude-opus-4-5-20251101`
22
+ - `haiku` instead of `claude-haiku-4-5-20251001`
23
+
24
+ 4. **Required Fields:** Every config MUST have:
25
+ - `mcp_server.command`
26
+ - `mcp_server.args` (with `"{workdir}"`)
27
+ - `provider` (usually `"anthropic"`)
28
+ - `agent_harness` (usually `"claude-code"`)
29
+ - `model`
30
+ - `dataset` (or rely on benchmark default)
31
+
32
+ ## Common MCP Server Configurations
33
+
34
+ ### Anthropic Filesystem Server
35
+ ```yaml
36
+ mcp_server:
37
+ name: "filesystem"
38
+ command: "npx"
39
+ args:
40
+ - "-y"
41
+ - "@modelcontextprotocol/server-filesystem"
42
+ - "{workdir}"
43
+ env: {}
44
+ ```
45
+
46
+ ### Custom Python MCP Server
47
+ ```yaml
48
+ mcp_server:
49
+ name: "my-server"
50
+ command: "uvx"
51
+ args:
52
+ - "my-mcp-server"
53
+ - "--workspace"
54
+ - "{workdir}"
55
+ env:
56
+ LOG_LEVEL: "debug"
57
+ ```
58
+
59
+ ### Supermodel Codebase Analysis
60
+ ```yaml
61
+ mcp_server:
62
+ name: "supermodel"
63
+ command: "npx"
64
+ args:
65
+ - "-y"
66
+ - "@supermodeltools/mcp-server"
67
+ env:
68
+ SUPERMODEL_API_KEY: "${SUPERMODEL_API_KEY}"
69
+ ```
70
+
71
+ ## Configuration Template
72
+
73
+ When generating a new config, use this template:
74
+
75
+ ```yaml
76
+ mcp_server:
77
+ name: "<server-name>"
78
+ command: "<executable>"
79
+ args:
80
+ - "<arg1>"
81
+ - "<arg2>"
82
+ - "{workdir}" # CRITICAL: Include this placeholder
83
+ env: {}
84
+
85
+ provider: "anthropic"
86
+ agent_harness: "claude-code"
87
+
88
+ model: "sonnet" # or "opus", "haiku"
89
+ dataset: "SWE-bench/SWE-bench_Lite" # or null to use benchmark default
90
+ sample_size: 5
91
+ timeout_seconds: 300
92
+ max_concurrent: 4
93
+ max_iterations: 30
94
+ ```
95
+
96
+ ## Validation Steps
97
+
98
+ Before saving a config, validate:
99
+
100
+ 1. **Workdir Placeholder:** Ensure `"{workdir}"` appears in `args` array.
101
+ 2. **Command Exists:** Verify the command is available:
102
+ ```bash
103
+ which npx # or uvx, python, etc.
104
+ ```
105
+ 3. **Syntax:** YAML syntax is correct (no tabs, proper indentation).
106
+ 4. **Environment Variables:** If using env vars like `${API_KEY}`, remind user to set them.
107
+
108
+ ## Benchmark-Specific Configurations
109
+
110
+ ### SWE-bench (Default)
111
+ ```yaml
112
+ # ... mcp_server config ...
113
+ provider: "anthropic"
114
+ agent_harness: "claude-code"
115
+ model: "sonnet"
116
+ dataset: "SWE-bench/SWE-bench_Lite" # or SWE-bench/SWE-bench_Verified
117
+ sample_size: 10
118
+ ```
119
+
120
+ ### CyberGym
121
+ ```yaml
122
+ # ... mcp_server config ...
123
+ provider: "anthropic"
124
+ agent_harness: "claude-code"
125
+ model: "sonnet"
126
+ benchmark: "cybergym"
127
+ dataset: "sunblaze-ucb/cybergym"
128
+ cybergym_level: 2 # 0-3
129
+ sample_size: 10
130
+ ```
131
+
132
+ ### MCPToolBench++
133
+ ```yaml
134
+ # ... mcp_server config ...
135
+ provider: "anthropic"
136
+ agent_harness: "claude-code"
137
+ model: "sonnet"
138
+ benchmark: "mcptoolbench"
139
+ dataset: "MCPToolBench/MCPToolBenchPP"
140
+ sample_size: 10
141
+ ```
142
+
143
+ ## Custom Agent Prompts
144
+
145
+ Users can customize the agent prompt using the `agent_prompt` field:
146
+
147
+ ```yaml
148
+ agent_prompt: |
149
+ Fix the following bug in this repository:
150
+
151
+ {problem_statement}
152
+
153
+ Make the minimal changes necessary to fix the issue.
154
+ Focus on the root cause, not symptoms.
155
+ ```
156
+
157
+ **Important:** The `{problem_statement}` placeholder is required and will be replaced with the actual task description.
158
+
159
+ ## Common Mistakes to Avoid
160
+
161
+ 1. **Missing {workdir}:** Forgetting to include `"{workdir}"` in args.
162
+ 2. **Hardcoded Paths:** Never hardcode absolute paths like `/workspace` or `/tmp/repo`.
163
+ 3. **Invalid Commands:** Using commands that don't exist (e.g., `uv` instead of `uvx`).
164
+ 4. **Wrong Indentation:** YAML is whitespace-sensitive. Use 2 spaces, not tabs.
165
+ 5. **Missing Quotes:** Environment variable references like `"${VAR}"` need quotes.
166
+
167
+ ## Example Workflow
168
+
169
+ When a user asks to create a config:
170
+
171
+ 1. Ask about their MCP server:
172
+ - What package/command runs the server?
173
+ - Does it need any special arguments or environment variables?
174
+ - Is it Node.js-based (npx) or Python-based (uvx)?
175
+
176
+ 2. Generate the config based on their answers.
177
+
178
+ 3. Validate the config:
179
+ - Check for `{workdir}` placeholder
180
+ - Verify command exists
181
+ - Confirm YAML syntax
182
+
183
+ 4. Save the config (usually to `mcpbr.yaml`).
184
+
185
+ 5. Optionally test the config with a small sample:
186
+ ```bash
187
+ mcpbr run -c mcpbr.yaml -n 1 -v
188
+ ```
189
+
190
+ ## Helpful Commands
191
+
192
+ ```bash
193
+ # Generate a default config
194
+ mcpbr init
195
+
196
+ # List available models
197
+ mcpbr models
198
+
199
+ # List available benchmarks
200
+ mcpbr benchmarks
201
+
202
+ # Validate config by doing a dry run with 1 task
203
+ mcpbr run -c config.yaml -n 1 -v
204
+ ```
@@ -0,0 +1,123 @@
1
+ ---
2
+ name: run-benchmark
3
+ description: Run an MCP evaluation using mcpbr on SWE-bench or other datasets.
4
+ ---
5
+
6
+ # Instructions
7
+ You are an expert at benchmarking AI agents using the `mcpbr` CLI. Your goal is to run valid, reproducible evaluations.
8
+
9
+ ## Critical Constraints (DO NOT IGNORE)
10
+
11
+ 1. **Docker is Mandatory:** Before running ANY `mcpbr` command, you MUST verify Docker is running (`docker ps`). If not, tell the user to start it.
12
+
13
+ 2. **Config is Required:** `mcpbr run` FAILS without a config file. Never guess flags.
14
+ - IF no config exists: Run `mcpbr init` first to generate a template.
15
+ - IF config exists: Read it (`cat mcpbr.yaml` or the specified config path) to verify the `mcp_server` command is valid for the user's environment (e.g., check if `npx` or `uvx` is installed).
16
+
17
+ 3. **Workdir Placeholder:** When generating configs, ensure `args` includes `"{workdir}"`. Do not resolve this path yourself; `mcpbr` handles it.
18
+
19
+ 4. **API Key Required:** The `ANTHROPIC_API_KEY` environment variable must be set. Check for it before running evaluations.
20
+
21
+ ## Common Pitfalls to Avoid
22
+
23
+ - **DO NOT** use the `-m` flag unless the user explicitly asks to override the model in the YAML.
24
+ - **DO NOT** hallucinate dataset names. Valid datasets include:
25
+ - `SWE-bench/SWE-bench_Lite` (default for SWE-bench)
26
+ - `SWE-bench/SWE-bench_Verified`
27
+ - `sunblaze-ucb/cybergym` (for CyberGym benchmark)
28
+ - `MCPToolBench/MCPToolBenchPP` (for MCPToolBench++)
29
+ - **DO NOT** hallucinate flags or options. Only use documented CLI flags.
30
+ - **DO NOT** forget to specify the config file with `-c` or `--config`.
31
+
32
+ ## Supported Benchmarks
33
+
34
+ mcpbr supports three benchmarks:
35
+ 1. **SWE-bench** (default): Real GitHub issues requiring bug fixes
36
+ - Dataset: `SWE-bench/SWE-bench_Lite` or `SWE-bench/SWE-bench_Verified`
37
+ - Use: `mcpbr run -c config.yaml` or `--benchmark swe-bench`
38
+
39
+ 2. **CyberGym**: Security vulnerabilities requiring PoC exploits
40
+ - Dataset: `sunblaze-ucb/cybergym`
41
+ - Use: `mcpbr run -c config.yaml --benchmark cybergym --level [0-3]`
42
+
43
+ 3. **MCPToolBench++**: Large-scale tool use evaluation
44
+ - Dataset: `MCPToolBench/MCPToolBenchPP`
45
+ - Use: `mcpbr run -c config.yaml --benchmark mcptoolbench`
46
+
47
+ ## Execution Steps
48
+
49
+ Follow these steps in order:
50
+
51
+ 1. **Verify Prerequisites:**
52
+ ```bash
53
+ # Check Docker is running
54
+ docker ps
55
+
56
+ # Verify API key is set
57
+ echo $ANTHROPIC_API_KEY
58
+ ```
59
+
60
+ 2. **Check for Config File:**
61
+ - If `mcpbr.yaml` (or user-specified config) does NOT exist: Run `mcpbr init` to generate it.
62
+ - If config exists: Read it to understand the configuration.
63
+
64
+ 3. **Validate Config:**
65
+ - Ensure `mcp_server.command` is valid (e.g., `npx`, `uvx`, `python` are installed).
66
+ - Ensure `mcp_server.args` includes `"{workdir}"` placeholder.
67
+ - Verify `model`, `dataset`, and other parameters are correctly set.
68
+
69
+ 4. **Construct the Command:**
70
+ - Base command: `mcpbr run --config <path-to-config>`
71
+ - Add flags as needed based on user request:
72
+ - `-n <number>` or `--sample <number>`: Override sample size
73
+ - `-v` or `-vv`: Verbose output
74
+ - `-o <path>`: Save JSON results
75
+ - `-r <path>`: Save Markdown report
76
+ - `--log-dir <path>`: Save per-instance logs
77
+ - `-M`: MCP-only evaluation (skip baseline)
78
+ - `-B`: Baseline-only evaluation (skip MCP)
79
+ - `--benchmark <name>`: Override benchmark
80
+ - `--level <0-3>`: Set CyberGym difficulty level
81
+
82
+ 5. **Run the Command:**
83
+ Execute the constructed command and monitor the output.
84
+
85
+ 6. **Handle Results:**
86
+ - If the run completes successfully, inform the user about the results.
87
+ - If errors occur, diagnose and provide actionable feedback.
88
+
89
+ ## Example Commands
90
+
91
+ ```bash
92
+ # Full evaluation with 5 tasks
93
+ mcpbr run -c config.yaml -n 5 -v
94
+
95
+ # MCP-only evaluation
96
+ mcpbr run -c config.yaml -M -n 10
97
+
98
+ # Save results and report
99
+ mcpbr run -c config.yaml -o results.json -r report.md
100
+
101
+ # Run CyberGym at level 2
102
+ mcpbr run -c config.yaml --benchmark cybergym --level 2 -n 5
103
+
104
+ # Run specific tasks
105
+ mcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099
106
+ ```
107
+
108
+ ## Troubleshooting
109
+
110
+ If you encounter errors:
111
+
112
+ 1. **Docker not running:** Remind user to start Docker Desktop or Docker daemon.
113
+ 2. **API key missing:** Ask user to set `export ANTHROPIC_API_KEY="sk-ant-..."`
114
+ 3. **Config file invalid:** Re-generate with `mcpbr init` or fix the YAML syntax.
115
+ 4. **MCP server fails to start:** Test the server command independently.
116
+ 5. **Timeout issues:** Suggest increasing `timeout_seconds` in config.
117
+
118
+ ## Important Reminders
119
+
120
+ - Always read the config file before making assumptions about what's configured.
121
+ - Never modify the config file without explicit user permission.
122
+ - Use the `mcpbr models` command to check available models if needed.
123
+ - Use the `mcpbr benchmarks` command to list available benchmarks.