npm - @greynewell/mcpbr - Versions diffs - 0.4.14 → 0.4.16 - Mend

@greynewell/mcpbr 0.4.14 → 0.4.16

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (2) hide show

package/README.md +9 -5
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -56,7 +56,7 @@ mcpbr runs controlled experiments: same model, same tasks, same environment - th
 ## Supported Benchmarks
-mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction layer:
+mcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:
 | Category | Benchmarks |
 |----------|-----------|
@@ -67,7 +67,11 @@ mcpbr supports 25+ benchmarks across 8 categories through a flexible abstraction
 | **Tool Use & Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |
 | **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |
 | **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |
+| **Multimodal** | MMMU |
+| **Long Context** | LongBench |
+| **Safety & Adversarial** | Adversarial (HarmBench) |
 | **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |
+| **Custom** | User-defined benchmarks via YAML |
 ### Featured Benchmarks
@@ -1426,10 +1430,10 @@ We're building the defacto standard for MCP server benchmarking! Our [v1.0 Roadm
 - Cost analysis in reports
 **Phase 2: Benchmarks** (v0.4.0)
-- HumanEval, MBPP, ToolBench
-- GAIA for general AI capabilities
-- Custom benchmark YAML support
-- SWE-bench Verified
+- ✅ 30+ benchmarks across 10 categories
+- ✅ Custom benchmark YAML support
+- ✅ Custom metrics, failure analysis, sampling strategies
+- ✅ Dataset versioning, latency metrics, GPU support, few-shot learning
 **Phase 3: Developer Experience** (v0.5.0)
 - Real-time dashboard

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@greynewell/mcpbr",
-  "version": "0.4.14",
+  "version": "0.4.16",
   "description": "Model Context Protocol Benchmark Runner - CLI tool for evaluating MCP servers",
   "keywords": [
     "mcpbr",