mcpbr-cli 0.4.2 → 0.4.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +116 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
|
|
|
88
88
|
- **Categories**: Browser, Finance, Code Analysis, and 40+ more
|
|
89
89
|
- **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
|
|
90
90
|
|
|
91
|
+
### GSM8K
|
|
92
|
+
Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
|
|
93
|
+
|
|
94
|
+
- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
|
|
95
|
+
- **Task**: Solve math word problems with step-by-step reasoning
|
|
96
|
+
- **Evaluation**: Numeric answer correctness with tolerance
|
|
97
|
+
- **Problem Types**: Multi-step arithmetic and basic algebra
|
|
98
|
+
- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
|
|
99
|
+
|
|
91
100
|
```bash
|
|
92
101
|
# Run SWE-bench Verified (default - manually validated tests)
|
|
93
102
|
mcpbr run -c config.yaml
|
|
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
|
|
|
98
107
|
# Run SWE-bench Full (2,294 tasks, complete benchmark)
|
|
99
108
|
mcpbr run -c config.yaml -b swe-bench-full
|
|
100
109
|
|
|
110
|
+
# Run GSM8K
|
|
111
|
+
mcpbr run -c config.yaml --benchmark gsm8k -n 50
|
|
112
|
+
|
|
101
113
|
# List all available benchmarks
|
|
102
114
|
mcpbr benchmarks
|
|
103
115
|
```
|
|
@@ -320,6 +332,110 @@ max_concurrent: 4
|
|
|
320
332
|
mcpbr run --config config.yaml
|
|
321
333
|
```
|
|
322
334
|
|
|
335
|
+
## Infrastructure Modes
|
|
336
|
+
|
|
337
|
+
mcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.
|
|
338
|
+
|
|
339
|
+
### Local (Default)
|
|
340
|
+
|
|
341
|
+
Run evaluations on your local machine:
|
|
342
|
+
|
|
343
|
+
```yaml
|
|
344
|
+
infrastructure:
|
|
345
|
+
mode: local # default
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
This is the default mode - evaluations run directly on your machine using local Docker containers.
|
|
349
|
+
|
|
350
|
+
### Azure VM
|
|
351
|
+
|
|
352
|
+
Run evaluations on Azure Virtual Machines with automatic provisioning and cleanup:
|
|
353
|
+
|
|
354
|
+
```yaml
|
|
355
|
+
infrastructure:
|
|
356
|
+
mode: azure
|
|
357
|
+
azure:
|
|
358
|
+
resource_group: mcpbr-benchmarks
|
|
359
|
+
location: eastus
|
|
360
|
+
cpu_cores: 10
|
|
361
|
+
memory_gb: 40
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
**Key features:**
|
|
365
|
+
- Zero manual VM setup - provisioned automatically from config
|
|
366
|
+
- Automatic Docker, Python, and mcpbr installation
|
|
367
|
+
- Test task validation before full evaluation
|
|
368
|
+
- Auto-cleanup after completion (configurable)
|
|
369
|
+
- Cost-optimized with automatic VM deletion
|
|
370
|
+
|
|
371
|
+
**Example usage:**
|
|
372
|
+
```bash
|
|
373
|
+
# Run evaluation on Azure VM
|
|
374
|
+
mcpbr run -c azure-config.yaml
|
|
375
|
+
|
|
376
|
+
# VM is automatically created, evaluation runs, results are downloaded, VM is deleted
|
|
377
|
+
```
|
|
378
|
+
|
|
379
|
+
See [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:
|
|
380
|
+
- Prerequisites and authentication
|
|
381
|
+
- VM sizing and cost estimation
|
|
382
|
+
- Debugging with `preserve_on_error`
|
|
383
|
+
- Troubleshooting guide
|
|
384
|
+
|
|
385
|
+
## Side-by-Side Server Comparison
|
|
386
|
+
|
|
387
|
+
Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
|
|
388
|
+
|
|
389
|
+
### Quick Example
|
|
390
|
+
|
|
391
|
+
```yaml
|
|
392
|
+
# comparison-config.yaml
|
|
393
|
+
comparison_mode: true
|
|
394
|
+
|
|
395
|
+
mcp_server_a:
|
|
396
|
+
name: "Task Queries"
|
|
397
|
+
command: node
|
|
398
|
+
args: [build/index.js]
|
|
399
|
+
cwd: /path/to/task-queries
|
|
400
|
+
|
|
401
|
+
mcp_server_b:
|
|
402
|
+
name: "Edge Identity"
|
|
403
|
+
command: node
|
|
404
|
+
args: [build/index.js]
|
|
405
|
+
cwd: /path/to/edge-identity
|
|
406
|
+
|
|
407
|
+
benchmark: swe-bench-lite
|
|
408
|
+
sample_size: 10
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
```bash
|
|
412
|
+
mcpbr run -c comparison-config.yaml -o results.json
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
### Results Output
|
|
416
|
+
|
|
417
|
+
```text
|
|
418
|
+
Side-by-Side MCP Server Comparison
|
|
419
|
+
|
|
420
|
+
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
|
|
421
|
+
┃ Metric ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
|
|
422
|
+
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
|
|
423
|
+
│ Resolved Tasks │ 4/10 │ 2/10 │ +2 │
|
|
424
|
+
│ Resolution Rate │ 40.0% │ 20.0% │ +100.0% │
|
|
425
|
+
└───────────────────┴──────────────┴──────────────┴──────────┘
|
|
426
|
+
|
|
427
|
+
✓ Task Queries unique wins: 2 tasks
|
|
428
|
+
- django__django-12286
|
|
429
|
+
- astropy__astropy-7606
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
**Use cases:**
|
|
433
|
+
- **A/B testing**: Compare optimized vs. baseline implementations
|
|
434
|
+
- **Tool evaluation**: Test different MCP tool sets
|
|
435
|
+
- **Version comparison**: Benchmark v2.0 vs. v1.5
|
|
436
|
+
|
|
437
|
+
See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
|
|
438
|
+
|
|
323
439
|
## Claude Code Integration
|
|
324
440
|
|
|
325
441
|
[](https://claude.ai/download)
|