@greynewell/mcpbr 0.4.14 → 0.4.16
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +9 -5
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -56,7 +56,7 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
|
|
|
56
56
|
|
|
57
57
|
## Supported Benchmarks
|
|
58
58
|
|
|
59
|
-
mcpbr supports
|
|
59
|
+
mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:
|
|
60
60
|
|
|
61
61
|
| Category | Benchmarks |
|
|
62
62
|
|----------|-----------|
|
|
@@ -67,7 +67,11 @@ mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction
|
|
|
67
67
|
| **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
|
|
68
68
|
| **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
|
|
69
69
|
| **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
|
|
70
|
+
| **Multimodal** | MMMU |
|
|
71
|
+
| **Long Context** | LongBench |
|
|
72
|
+
| **Safety & Adversarial** | Adversarial (HarmBench) |
|
|
70
73
|
| **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
|
|
74
|
+
| **Custom** | User-defined benchmarks via YAML |
|
|
71
75
|
|
|
72
76
|
### Featured Benchmarks
|
|
73
77
|
|
|
@@ -1426,10 +1430,10 @@ We're building the defacto standard for MCP server benchmarking! Our [v1.0 Roadm
|
|
|
1426
1430
|
- Cost analysis in reports
|
|
1427
1431
|
|
|
1428
1432
|
**Phase 2: Benchmarks** (v0.4.0)
|
|
1429
|
-
-
|
|
1430
|
-
-
|
|
1431
|
-
- Custom
|
|
1432
|
-
-
|
|
1433
|
+
- ✅ 30+ benchmarks across 10 categories
|
|
1434
|
+
- ✅ Custom benchmark YAML support
|
|
1435
|
+
- ✅ Custom metrics, failure analysis, sampling strategies
|
|
1436
|
+
- ✅ Dataset versioning, latency metrics, GPU support, few-shot learning
|
|
1433
1437
|
|
|
1434
1438
|
**Phase 3: Developer Experience** (v0.5.0)
|
|
1435
1439
|
- Real-time dashboard
|