nodebench-mcp 2.4.0 → 2.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +8 -4
- package/README.md +56 -19
- package/dist/__tests__/evalHarness.test.js +1 -1
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js +543 -57
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -1
- package/dist/__tests__/tools.test.js +664 -6
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/index.js +30 -6
- package/dist/index.js.map +1 -1
- package/dist/tools/boilerplateTools.d.ts +11 -0
- package/dist/tools/boilerplateTools.js +500 -0
- package/dist/tools/boilerplateTools.js.map +1 -0
- package/dist/tools/cCompilerBenchmarkTools.d.ts +14 -0
- package/dist/tools/cCompilerBenchmarkTools.js +453 -0
- package/dist/tools/cCompilerBenchmarkTools.js.map +1 -0
- package/dist/tools/figmaFlowTools.d.ts +13 -0
- package/dist/tools/figmaFlowTools.js +183 -0
- package/dist/tools/figmaFlowTools.js.map +1 -0
- package/dist/tools/flickerDetectionTools.d.ts +14 -0
- package/dist/tools/flickerDetectionTools.js +231 -0
- package/dist/tools/flickerDetectionTools.js.map +1 -0
- package/dist/tools/localFileTools.d.ts +1 -0
- package/dist/tools/localFileTools.js +1926 -27
- package/dist/tools/localFileTools.js.map +1 -1
- package/dist/tools/metaTools.js +96 -2
- package/dist/tools/metaTools.js.map +1 -1
- package/dist/tools/progressiveDiscoveryTools.d.ts +14 -0
- package/dist/tools/progressiveDiscoveryTools.js +222 -0
- package/dist/tools/progressiveDiscoveryTools.js.map +1 -0
- package/dist/tools/researchWritingTools.d.ts +12 -0
- package/dist/tools/researchWritingTools.js +573 -0
- package/dist/tools/researchWritingTools.js.map +1 -0
- package/dist/tools/securityTools.js +128 -0
- package/dist/tools/securityTools.js.map +1 -1
- package/dist/tools/toolRegistry.d.ts +70 -0
- package/dist/tools/toolRegistry.js +1437 -0
- package/dist/tools/toolRegistry.js.map +1 -0
- package/package.json +6 -3
package/NODEBENCH_AGENTS.md
CHANGED
|
@@ -21,7 +21,7 @@ Add to `~/.claude/settings.json`:
|
|
|
21
21
|
}
|
|
22
22
|
```
|
|
23
23
|
|
|
24
|
-
Restart Claude Code.
|
|
24
|
+
Restart Claude Code. 89 tools available immediately.
|
|
25
25
|
|
|
26
26
|
**→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup)
|
|
27
27
|
|
|
@@ -189,8 +189,9 @@ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 n
|
|
|
189
189
|
```
|
|
190
190
|
|
|
191
191
|
Modes:
|
|
192
|
-
- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
193
|
-
- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (
|
|
192
|
+
- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag` (single deterministic extract + answer)
|
|
193
|
+
- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (small tool loop)
|
|
194
|
+
Web lane only: `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1` and/or `NODEBENCH_GAIA_CAPABILITY_FORCE_FETCH_URL=1`
|
|
194
195
|
|
|
195
196
|
Run all public lanes:
|
|
196
197
|
```bash
|
|
@@ -246,7 +247,7 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
246
247
|
| Category | Tools | When to Use |
|
|
247
248
|
|----------|-------|-------------|
|
|
248
249
|
| **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
|
|
249
|
-
| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
|
|
250
|
+
| **Local Files** | `read_pdf_text`, `pdf_search_text`, `read_xlsx_file`, `xlsx_select_rows`, `xlsx_aggregate`, `read_csv_file`, `csv_select_rows`, `csv_aggregate`, `read_text_file`, `read_json_file`, `json_select`, `read_jsonl_file`, `zip_list_files`, `zip_read_text_file`, `zip_extract_file`, `read_docx_text`, `read_pptx_text` | Deterministic parsing and aggregation of local attachments (GAIA file-backed lane) |
|
|
250
251
|
| **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
|
|
251
252
|
| **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
|
|
252
253
|
| **Eval** | `start_eval_run`, `log_test_result` | Test case management |
|
|
@@ -256,6 +257,9 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
256
257
|
| **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
|
|
257
258
|
| **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
|
|
258
259
|
| **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
|
|
260
|
+
| **LLM** | `call_llm`, `extract_structured_data`, `benchmark_models` | LLM calling, structured extraction, model comparison |
|
|
261
|
+
| **Security** | `scan_dependencies`, `run_code_analysis` | Dependency auditing, static code analysis |
|
|
262
|
+
| **Platform** | `query_daily_brief`, `query_funding_entities`, `query_research_queue`, `publish_to_queue` | Convex platform bridge: intelligence, funding, research, publishing |
|
|
259
263
|
| **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
|
|
260
264
|
|
|
261
265
|
**→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
|
package/README.md
CHANGED
|
@@ -39,7 +39,7 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
|
|
|
39
39
|
|
|
40
40
|
**QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
|
|
41
41
|
|
|
42
|
-
Both found different subsets of the
|
|
42
|
+
Both found different subsets of the 129 tools useful — which is why v2.8 ships with `--preset` gating to load only what you need.
|
|
43
43
|
|
|
44
44
|
---
|
|
45
45
|
|
|
@@ -77,10 +77,10 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
|
|
|
77
77
|
### Install (30 seconds)
|
|
78
78
|
|
|
79
79
|
```bash
|
|
80
|
-
# Claude Code CLI — all
|
|
80
|
+
# Claude Code CLI — all 129 tools
|
|
81
81
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
82
82
|
|
|
83
|
-
# Or start lean —
|
|
83
|
+
# Or start lean — 39 tools, ~70% less token overhead
|
|
84
84
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
|
|
85
85
|
```
|
|
86
86
|
|
|
@@ -117,6 +117,33 @@ export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
|
|
|
117
117
|
export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
|
|
118
118
|
```
|
|
119
119
|
|
|
120
|
+
### Capability benchmarking (GAIA, gated)
|
|
121
|
+
|
|
122
|
+
NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
|
|
123
|
+
|
|
124
|
+
Notes:
|
|
125
|
+
- GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
|
|
126
|
+
- Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
|
|
127
|
+
|
|
128
|
+
Web lane (web_search + fetch_url):
|
|
129
|
+
```bash
|
|
130
|
+
npm run mcp:dataset:gaia:capability:refresh
|
|
131
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
|
|
135
|
+
```bash
|
|
136
|
+
npm run mcp:dataset:gaia:capability:files:refresh
|
|
137
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
Modes:
|
|
141
|
+
- Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
142
|
+
- More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
|
|
143
|
+
|
|
144
|
+
Notes:
|
|
145
|
+
- ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
|
|
146
|
+
|
|
120
147
|
---
|
|
121
148
|
|
|
122
149
|
## What You Get
|
|
@@ -171,7 +198,7 @@ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn
|
|
|
171
198
|
**Outer loop** (over time): Eval-driven development ensures improvement.
|
|
172
199
|
**Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
|
|
173
200
|
|
|
174
|
-
Ask the agent: `Use getMethodology("overview")` to see all
|
|
201
|
+
Ask the agent: `Use getMethodology("overview")` to see all 19 methodology topics.
|
|
175
202
|
|
|
176
203
|
---
|
|
177
204
|
|
|
@@ -200,20 +227,20 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
|
|
|
200
227
|
|
|
201
228
|
---
|
|
202
229
|
|
|
203
|
-
## Toolset Gating (v2.
|
|
230
|
+
## Toolset Gating (v2.8)
|
|
204
231
|
|
|
205
|
-
|
|
232
|
+
129 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
|
|
206
233
|
|
|
207
234
|
### Presets
|
|
208
235
|
|
|
209
236
|
```bash
|
|
210
|
-
# Lite —
|
|
237
|
+
# Lite — 39 tools (verification, eval, gates, learning, recon, security, boilerplate + meta + discovery)
|
|
211
238
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
|
|
212
239
|
|
|
213
|
-
# Core —
|
|
240
|
+
# Core — 87 tools (adds flywheel, bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark + meta + discovery)
|
|
214
241
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset core
|
|
215
242
|
|
|
216
|
-
# Full — all
|
|
243
|
+
# Full — all 129 tools (default)
|
|
217
244
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
218
245
|
```
|
|
219
246
|
|
|
@@ -248,22 +275,32 @@ npx nodebench-mcp --help
|
|
|
248
275
|
| Toolset | Tools | What it covers |
|
|
249
276
|
|---|---|---|
|
|
250
277
|
| verification | 8 | Cycles, gaps, triple-verify, status |
|
|
251
|
-
| eval |
|
|
278
|
+
| eval | 6 | Eval runs, results, comparison, diff |
|
|
252
279
|
| quality_gate | 4 | Gates, presets, history |
|
|
253
280
|
| learning | 4 | Knowledge, search, record |
|
|
254
|
-
| recon |
|
|
281
|
+
| recon | 7 | Research, findings, framework checks, risk |
|
|
255
282
|
| flywheel | 4 | Mandatory flywheel, promote, investigate |
|
|
256
|
-
| bootstrap |
|
|
283
|
+
| bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |
|
|
257
284
|
| self_eval | 6 | Trajectory analysis, health reports |
|
|
258
285
|
| parallel | 10 | Task locks, roles, context budget, oracle |
|
|
259
|
-
| vision |
|
|
260
|
-
| ui_capture |
|
|
286
|
+
| vision | 4 | Screenshot analysis, UI capture, diff |
|
|
287
|
+
| ui_capture | 2 | Playwright-based capture |
|
|
261
288
|
| web | 2 | Web search, URL fetch |
|
|
262
|
-
| github | 3 | Repo search, analysis |
|
|
263
|
-
| docs |
|
|
264
|
-
| local_file |
|
|
265
|
-
|
|
266
|
-
|
|
289
|
+
| github | 3 | Repo search, analysis, monitoring |
|
|
290
|
+
| docs | 4 | Documentation generation, reports |
|
|
291
|
+
| local_file | 17 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT) |
|
|
292
|
+
| llm | 3 | LLM calling, extraction, benchmarking |
|
|
293
|
+
| security | 3 | Dependency scanning, code analysis, terminal security scanning |
|
|
294
|
+
| platform | 4 | Convex bridge: briefs, funding, research, publish |
|
|
295
|
+
| research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |
|
|
296
|
+
| flicker_detection | 5 | Android flicker detection + SSIM tooling |
|
|
297
|
+
| figma_flow | 4 | Figma flow analysis + rendering |
|
|
298
|
+
| boilerplate | 2 | Scaffold NodeBench projects + status |
|
|
299
|
+
| benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
|
|
300
|
+
|
|
301
|
+
Always included (regardless of gating):
|
|
302
|
+
- Meta: `findTools`, `getMethodology`
|
|
303
|
+
- Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
|
|
267
304
|
|
|
268
305
|
---
|
|
269
306
|
|
|
@@ -592,7 +592,7 @@ describe("Scenario: Meta Tool Discovery", () => {
|
|
|
592
592
|
}, "meta");
|
|
593
593
|
expect(result.title).toContain("Overview");
|
|
594
594
|
const topics = Object.keys(result.steps[0].topics);
|
|
595
|
-
expect(topics.length).toBe(
|
|
595
|
+
expect(topics.length).toBe(19);
|
|
596
596
|
});
|
|
597
597
|
it("Step 4: Get specific methodology", async () => {
|
|
598
598
|
const methodologies = [
|