mcpbr-cli 0.4.2 → 0.4.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +66 -0
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
|
|
|
88
88
|
- **Categories**: Browser, Finance, Code Analysis, and 40+ more
|
|
89
89
|
- **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
|
|
90
90
|
|
|
91
|
+
### GSM8K
|
|
92
|
+
Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
|
|
93
|
+
|
|
94
|
+
- **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
|
|
95
|
+
- **Task**: Solve math word problems with step-by-step reasoning
|
|
96
|
+
- **Evaluation**: Numeric answer correctness with tolerance
|
|
97
|
+
- **Problem Types**: Multi-step arithmetic and basic algebra
|
|
98
|
+
- **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
|
|
99
|
+
|
|
91
100
|
```bash
|
|
92
101
|
# Run SWE-bench Verified (default - manually validated tests)
|
|
93
102
|
mcpbr run -c config.yaml
|
|
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
|
|
|
98
107
|
# Run SWE-bench Full (2,294 tasks, complete benchmark)
|
|
99
108
|
mcpbr run -c config.yaml -b swe-bench-full
|
|
100
109
|
|
|
110
|
+
# Run GSM8K
|
|
111
|
+
mcpbr run -c config.yaml --benchmark gsm8k -n 50
|
|
112
|
+
|
|
101
113
|
# List all available benchmarks
|
|
102
114
|
mcpbr benchmarks
|
|
103
115
|
```
|
|
@@ -320,6 +332,60 @@ max_concurrent: 4
|
|
|
320
332
|
mcpbr run --config config.yaml
|
|
321
333
|
```
|
|
322
334
|
|
|
335
|
+
## Side-by-Side Server Comparison
|
|
336
|
+
|
|
337
|
+
Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
|
|
338
|
+
|
|
339
|
+
### Quick Example
|
|
340
|
+
|
|
341
|
+
```yaml
|
|
342
|
+
# comparison-config.yaml
|
|
343
|
+
comparison_mode: true
|
|
344
|
+
|
|
345
|
+
mcp_server_a:
|
|
346
|
+
name: "Task Queries"
|
|
347
|
+
command: node
|
|
348
|
+
args: [build/index.js]
|
|
349
|
+
cwd: /path/to/task-queries
|
|
350
|
+
|
|
351
|
+
mcp_server_b:
|
|
352
|
+
name: "Edge Identity"
|
|
353
|
+
command: node
|
|
354
|
+
args: [build/index.js]
|
|
355
|
+
cwd: /path/to/edge-identity
|
|
356
|
+
|
|
357
|
+
benchmark: swe-bench-lite
|
|
358
|
+
sample_size: 10
|
|
359
|
+
```
|
|
360
|
+
|
|
361
|
+
```bash
|
|
362
|
+
mcpbr run -c comparison-config.yaml -o results.json
|
|
363
|
+
```
|
|
364
|
+
|
|
365
|
+
### Results Output
|
|
366
|
+
|
|
367
|
+
```text
|
|
368
|
+
Side-by-Side MCP Server Comparison
|
|
369
|
+
|
|
370
|
+
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
|
|
371
|
+
┃ Metric ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
|
|
372
|
+
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
|
|
373
|
+
│ Resolved Tasks │ 4/10 │ 2/10 │ +2 │
|
|
374
|
+
│ Resolution Rate │ 40.0% │ 20.0% │ +100.0% │
|
|
375
|
+
└───────────────────┴──────────────┴──────────────┴──────────┘
|
|
376
|
+
|
|
377
|
+
✓ Task Queries unique wins: 2 tasks
|
|
378
|
+
- django__django-12286
|
|
379
|
+
- astropy__astropy-7606
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
**Use cases:**
|
|
383
|
+
- **A/B testing**: Compare optimized vs. baseline implementations
|
|
384
|
+
- **Tool evaluation**: Test different MCP tool sets
|
|
385
|
+
- **Version comparison**: Benchmark v2.0 vs. v1.5
|
|
386
|
+
|
|
387
|
+
See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
|
|
388
|
+
|
|
323
389
|
## Claude Code Integration
|
|
324
390
|
|
|
325
391
|
[](https://claude.ai/download)
|