npm - nodebench-mcp - Versions diffs - 2.4.0 → 2.8.0 - Mend

nodebench-mcp 2.4.0 → 2.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

package/NODEBENCH_AGENTS.md +8 -4
package/README.md +56 -19
package/dist/__tests__/evalHarness.test.js +1 -1
package/dist/__tests__/gaiaCapabilityFilesEval.test.js +543 -57
package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -1
package/dist/__tests__/tools.test.js +664 -6
package/dist/__tests__/tools.test.js.map +1 -1
package/dist/index.js +30 -6
package/dist/index.js.map +1 -1
package/dist/tools/boilerplateTools.d.ts +11 -0
package/dist/tools/boilerplateTools.js +500 -0
package/dist/tools/boilerplateTools.js.map +1 -0
package/dist/tools/cCompilerBenchmarkTools.d.ts +14 -0
package/dist/tools/cCompilerBenchmarkTools.js +453 -0
package/dist/tools/cCompilerBenchmarkTools.js.map +1 -0
package/dist/tools/figmaFlowTools.d.ts +13 -0
package/dist/tools/figmaFlowTools.js +183 -0
package/dist/tools/figmaFlowTools.js.map +1 -0
package/dist/tools/flickerDetectionTools.d.ts +14 -0
package/dist/tools/flickerDetectionTools.js +231 -0
package/dist/tools/flickerDetectionTools.js.map +1 -0
package/dist/tools/localFileTools.d.ts +1 -0
package/dist/tools/localFileTools.js +1926 -27
package/dist/tools/localFileTools.js.map +1 -1
package/dist/tools/metaTools.js +96 -2
package/dist/tools/metaTools.js.map +1 -1
package/dist/tools/progressiveDiscoveryTools.d.ts +14 -0
package/dist/tools/progressiveDiscoveryTools.js +222 -0
package/dist/tools/progressiveDiscoveryTools.js.map +1 -0
package/dist/tools/researchWritingTools.d.ts +12 -0
package/dist/tools/researchWritingTools.js +573 -0
package/dist/tools/researchWritingTools.js.map +1 -0
package/dist/tools/securityTools.js +128 -0
package/dist/tools/securityTools.js.map +1 -1
package/dist/tools/toolRegistry.d.ts +70 -0
package/dist/tools/toolRegistry.js +1437 -0
package/dist/tools/toolRegistry.js.map +1 -0
package/package.json +6 -3

package/NODEBENCH_AGENTS.md CHANGED Viewed

@@ -21,7 +21,7 @@ Add to `~/.claude/settings.json`:
 }
 ```
-Restart Claude Code. 56 tools available immediately.
+Restart Claude Code. 89 tools available immediately.
 **→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup)
@@ -189,8 +189,9 @@ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 n
 ```
 Modes:
-- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
-- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
+- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag` (single deterministic extract + answer)
+- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (small tool loop)
+Web lane only: `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1` and/or `NODEBENCH_GAIA_CAPABILITY_FORCE_FETCH_URL=1`
 Run all public lanes:
 ```bash
@@ -246,7 +247,7 @@ Use `getMethodology("overview")` to see all available workflows.
 | Category | Tools | When to Use |
 |----------|-------|-------------|
 | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
-| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
+| **Local Files** | `read_pdf_text`, `pdf_search_text`, `read_xlsx_file`, `xlsx_select_rows`, `xlsx_aggregate`, `read_csv_file`, `csv_select_rows`, `csv_aggregate`, `read_text_file`, `read_json_file`, `json_select`, `read_jsonl_file`, `zip_list_files`, `zip_read_text_file`, `zip_extract_file`, `read_docx_text`, `read_pptx_text` | Deterministic parsing and aggregation of local attachments (GAIA file-backed lane) |
 | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
 | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
 | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -256,6 +257,9 @@ Use `getMethodology("overview")` to see all available workflows.
 | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
 | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
 | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
+| **LLM** | `call_llm`, `extract_structured_data`, `benchmark_models` | LLM calling, structured extraction, model comparison |
+| **Security** | `scan_dependencies`, `run_code_analysis` | Dependency auditing, static code analysis |
+| **Platform** | `query_daily_brief`, `query_funding_entities`, `query_research_queue`, `publish_to_queue` | Convex platform bridge: intelligence, funding, research, publishing |
 | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
 **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics

package/README.md CHANGED Viewed

@@ -39,7 +39,7 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
 **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
-Both found different subsets of the 75 tools useful — which is why v2.1 ships with `--preset` gating to load only what you need.
+Both found different subsets of the 129 tools useful — which is why v2.8 ships with `--preset` gating to load only what you need.
 ---
@@ -77,10 +77,10 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
 ### Install (30 seconds)
 ```bash
-# Claude Code CLI — all 75 tools
+# Claude Code CLI — all 129 tools
 claude mcp add nodebench -- npx -y nodebench-mcp
-# Or start lean — 30 tools, ~60% less token overhead
+# Or start lean — 39 tools, ~70% less token overhead
 claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
 ```
@@ -117,6 +117,33 @@ export GEMINI_API_KEY="your-key"        # Web search + vision (recommended)
 export GITHUB_TOKEN="your-token"        # GitHub (higher rate limits)
 ```
+### Capability benchmarking (GAIA, gated)
+NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
+Notes:
+- GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
+- Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
+Web lane (web_search + fetch_url):
+```bash
+npm run mcp:dataset:gaia:capability:refresh
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
+```
+File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
+```bash
+npm run mcp:dataset:gaia:capability:files:refresh
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
+```
+Modes:
+- Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
+- More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
+Notes:
+- ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
 ---
 ## What You Get
@@ -171,7 +198,7 @@ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn
 **Outer loop** (over time): Eval-driven development ensures improvement.
 **Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
-Ask the agent: `Use getMethodology("overview")` to see all 18 methodology topics.
+Ask the agent: `Use getMethodology("overview")` to see all 19 methodology topics.
 ---
@@ -200,20 +227,20 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
 ---
-## Toolset Gating (v2.1)
+## Toolset Gating (v2.8)
-75 tools means ~19K tokens of schema per API call. If you only need core methodology, gate the toolset:
+129 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
 ### Presets
 ```bash
-# Lite — 30 tools (verification, eval, gates, learning, recon)
+# Lite — 39 tools (verification, eval, gates, learning, recon, security, boilerplate + meta + discovery)
 claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
-# Core — 50 tools (adds flywheel, bootstrap, self-eval)
+# Core — 87 tools (adds flywheel, bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark + meta + discovery)
 claude mcp add nodebench -- npx -y nodebench-mcp --preset core
-# Full — all 75 tools (default)
+# Full — all 129 tools (default)
 claude mcp add nodebench -- npx -y nodebench-mcp
 ```
@@ -248,22 +275,32 @@ npx nodebench-mcp --help
 | Toolset | Tools | What it covers |
 |---|---|---|
 | verification | 8 | Cycles, gaps, triple-verify, status |
-| eval | 5 | Eval runs, results, comparison |
+| eval | 6 | Eval runs, results, comparison, diff |
 | quality_gate | 4 | Gates, presets, history |
 | learning | 4 | Knowledge, search, record |
-| recon | 5 | Research, findings, framework checks |
+| recon | 7 | Research, findings, framework checks, risk |
 | flywheel | 4 | Mandatory flywheel, promote, investigate |
-| bootstrap | 4 | Project setup, agents.md, self-implement |
+| bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |
 | self_eval | 6 | Trajectory analysis, health reports |
 | parallel | 10 | Task locks, roles, context budget, oracle |
-| vision | 3 | Screenshot analysis, UI capture |
-| ui_capture | 3 | Playwright-based capture |
+| vision | 4 | Screenshot analysis, UI capture, diff |
+| ui_capture | 2 | Playwright-based capture |
 | web | 2 | Web search, URL fetch |
-| github | 3 | Repo search, analysis |
-| docs | 3 | Documentation generation |
-| local_file | 3 | CSV, XLSX, PDF parsing |
-`findTools` and `getMethodology` are always available regardless of gating — agents can discover tools on demand.
+| github | 3 | Repo search, analysis, monitoring |
+| docs | 4 | Documentation generation, reports |
+| local_file | 17 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT) |
+| llm | 3 | LLM calling, extraction, benchmarking |
+| security | 3 | Dependency scanning, code analysis, terminal security scanning |
+| platform | 4 | Convex bridge: briefs, funding, research, publish |
+| research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |
+| flicker_detection | 5 | Android flicker detection + SSIM tooling |
+| figma_flow | 4 | Figma flow analysis + rendering |
+| boilerplate | 2 | Scaffold NodeBench projects + status |
+| benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
+Always included (regardless of gating):
+- Meta: `findTools`, `getMethodology`
+- Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
 ---

package/dist/__tests__/evalHarness.test.js CHANGED Viewed

@@ -592,7 +592,7 @@ describe("Scenario: Meta Tool Discovery", () => {
         }, "meta");
         expect(result.title).toContain("Overview");
         const topics = Object.keys(result.steps[0].topics);
-        expect(topics.length).toBe(18);
+        expect(topics.length).toBe(19);
     });
     it("Step 4: Get specific methodology", async () => {
         const methodologies = [