@greynewell/mcpbr 0.4.14 → 0.4.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (2) hide show
  1. package/README.md +9 -5
  2. package/package.json +1 -1
package/README.md CHANGED
@@ -56,7 +56,7 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
56
56
 
57
57
  ## Supported Benchmarks
58
58
 
59
- mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
59
+ mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:
60
60
 
61
61
  | Category | Benchmarks |
62
62
  |----------|-----------|
@@ -67,7 +67,11 @@ mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction
67
67
  | **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
68
68
  | **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
69
69
  | **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
70
+ | **Multimodal** | MMMU |
71
+ | **Long Context** | LongBench |
72
+ | **Safety & Adversarial** | Adversarial (HarmBench) |
70
73
  | **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
74
+ | **Custom** | User-defined benchmarks via YAML |
71
75
 
72
76
  ### Featured Benchmarks
73
77
 
@@ -1426,10 +1430,10 @@ We're building the defacto standard for MCP server benchmarking! Our [v1.0 Roadm
1426
1430
  - Cost analysis in reports
1427
1431
 
1428
1432
  **Phase 2: Benchmarks** (v0.4.0)
1429
- - HumanEval, MBPP, ToolBench
1430
- - GAIA for general AI capabilities
1431
- - Custom benchmark YAML support
1432
- - SWE-bench Verified
1433
+ - 30+ benchmarks across 10 categories
1434
+ - Custom benchmark YAML support
1435
+ - Custom metrics, failure analysis, sampling strategies
1436
+ - ✅ Dataset versioning, latency metrics, GPU support, few-shot learning
1433
1437
 
1434
1438
  **Phase 3: Developer Experience** (v0.5.0)
1435
1439
  - Real-time dashboard
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@greynewell/mcpbr",
3
- "version": "0.4.14",
3
+ "version": "0.4.16",
4
4
  "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
5
5
  "keywords": [
6
6
  "mcpbr",