mcpbr-cli 0.4.2 → 0.4.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +116 -0
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -88,6 +88,15 @@ Large-scale MCP tool use evaluation across 45+ categories. Tests agent capabilit
88
88
  - **Categories**: Browser, Finance, Code Analysis, and 40+ more
89
89
  - **Learn more**: [MCPToolBench++ Paper](https://arxiv.org/pdf/2508.07575) | [GitHub](https://github.com/mcp-tool-bench/MCPToolBenchPP)
90
90
 
91
+ ### GSM8K
92
+ Grade-school math word problems testing mathematical reasoning and chain-of-thought capabilities.
93
+
94
+ - **Dataset**: [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)
95
+ - **Task**: Solve math word problems with step-by-step reasoning
96
+ - **Evaluation**: Numeric answer correctness with tolerance
97
+ - **Problem Types**: Multi-step arithmetic and basic algebra
98
+ - **Learn more**: [GSM8K Paper](https://arxiv.org/abs/2110.14168) | [GitHub](https://github.com/openai/grade-school-math)
99
+
91
100
  ```bash
92
101
  # Run SWE-bench Verified (default - manually validated tests)
93
102
  mcpbr run -c config.yaml
@@ -98,6 +107,9 @@ mcpbr run -c config.yaml -b swe-bench-lite
98
107
  # Run SWE-bench Full (2,294 tasks, complete benchmark)
99
108
  mcpbr run -c config.yaml -b swe-bench-full
100
109
 
110
+ # Run GSM8K
111
+ mcpbr run -c config.yaml --benchmark gsm8k -n 50
112
+
101
113
  # List all available benchmarks
102
114
  mcpbr benchmarks
103
115
  ```
@@ -320,6 +332,110 @@ max_concurrent: 4
320
332
  mcpbr run --config config.yaml
321
333
  ```
322
334
 
335
+ ## Infrastructure Modes
336
+
337
+ mcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.
338
+
339
+ ### Local (Default)
340
+
341
+ Run evaluations on your local machine:
342
+
343
+ ```yaml
344
+ infrastructure:
345
+ mode: local # default
346
+ ```
347
+
348
+ This is the default mode - evaluations run directly on your machine using local Docker containers.
349
+
350
+ ### Azure VM
351
+
352
+ Run evaluations on Azure Virtual Machines with automatic provisioning and cleanup:
353
+
354
+ ```yaml
355
+ infrastructure:
356
+ mode: azure
357
+ azure:
358
+ resource_group: mcpbr-benchmarks
359
+ location: eastus
360
+ cpu_cores: 10
361
+ memory_gb: 40
362
+ ```
363
+
364
+ **Key features:**
365
+ - Zero manual VM setup - provisioned automatically from config
366
+ - Automatic Docker, Python, and mcpbr installation
367
+ - Test task validation before full evaluation
368
+ - Auto-cleanup after completion (configurable)
369
+ - Cost-optimized with automatic VM deletion
370
+
371
+ **Example usage:**
372
+ ```bash
373
+ # Run evaluation on Azure VM
374
+ mcpbr run -c azure-config.yaml
375
+
376
+ # VM is automatically created, evaluation runs, results are downloaded, VM is deleted
377
+ ```
378
+
379
+ See [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:
380
+ - Prerequisites and authentication
381
+ - VM sizing and cost estimation
382
+ - Debugging with `preserve_on_error`
383
+ - Troubleshooting guide
384
+
385
+ ## Side-by-Side Server Comparison
386
+
387
+ Compare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.
388
+
389
+ ### Quick Example
390
+
391
+ ```yaml
392
+ # comparison-config.yaml
393
+ comparison_mode: true
394
+
395
+ mcp_server_a:
396
+ name: "Task Queries"
397
+ command: node
398
+ args: [build/index.js]
399
+ cwd: /path/to/task-queries
400
+
401
+ mcp_server_b:
402
+ name: "Edge Identity"
403
+ command: node
404
+ args: [build/index.js]
405
+ cwd: /path/to/edge-identity
406
+
407
+ benchmark: swe-bench-lite
408
+ sample_size: 10
409
+ ```
410
+
411
+ ```bash
412
+ mcpbr run -c comparison-config.yaml -o results.json
413
+ ```
414
+
415
+ ### Results Output
416
+
417
+ ```text
418
+ Side-by-Side MCP Server Comparison
419
+
420
+ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓
421
+ ┃ Metric ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃
422
+ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩
423
+ │ Resolved Tasks │ 4/10 │ 2/10 │ +2 │
424
+ │ Resolution Rate │ 40.0% │ 20.0% │ +100.0% │
425
+ └───────────────────┴──────────────┴──────────────┴──────────┘
426
+
427
+ ✓ Task Queries unique wins: 2 tasks
428
+ - django__django-12286
429
+ - astropy__astropy-7606
430
+ ```
431
+
432
+ **Use cases:**
433
+ - **A/B testing**: Compare optimized vs. baseline implementations
434
+ - **Tool evaluation**: Test different MCP tool sets
435
+ - **Version comparison**: Benchmark v2.0 vs. v1.5
436
+
437
+ See [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.
438
+
323
439
  ## Claude Code Integration
324
440
 
325
441
  [![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat&logo=anthropic)](https://claude.ai/download)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "mcpbr-cli",
3
- "version": "0.4.2",
3
+ "version": "0.4.4",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",