nodebench-mcp 2.11.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (67) hide show
  1. package/NODEBENCH_AGENTS.md +809 -809
  2. package/README.md +443 -431
  3. package/STYLE_GUIDE.md +477 -477
  4. package/dist/__tests__/evalHarness.test.js +1 -1
  5. package/dist/__tests__/gaiaCapabilityAudioEval.test.js +9 -14
  6. package/dist/__tests__/gaiaCapabilityAudioEval.test.js.map +1 -1
  7. package/dist/__tests__/gaiaCapabilityEval.test.js +88 -14
  8. package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -1
  9. package/dist/__tests__/gaiaCapabilityFilesEval.test.js +9 -5
  10. package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -1
  11. package/dist/__tests__/gaiaCapabilityMediaEval.test.js +165 -17
  12. package/dist/__tests__/gaiaCapabilityMediaEval.test.js.map +1 -1
  13. package/dist/__tests__/helpers/answerMatch.d.ts +36 -7
  14. package/dist/__tests__/helpers/answerMatch.js +224 -35
  15. package/dist/__tests__/helpers/answerMatch.js.map +1 -1
  16. package/dist/__tests__/helpers/textLlm.d.ts +1 -1
  17. package/dist/__tests__/presetRealWorldBench.test.d.ts +1 -0
  18. package/dist/__tests__/presetRealWorldBench.test.js +850 -0
  19. package/dist/__tests__/presetRealWorldBench.test.js.map +1 -0
  20. package/dist/__tests__/tools.test.js +20 -7
  21. package/dist/__tests__/tools.test.js.map +1 -1
  22. package/dist/__tests__/toolsetGatingEval.test.js +21 -11
  23. package/dist/__tests__/toolsetGatingEval.test.js.map +1 -1
  24. package/dist/db.js +21 -0
  25. package/dist/db.js.map +1 -1
  26. package/dist/index.js +424 -327
  27. package/dist/index.js.map +1 -1
  28. package/dist/tools/agentBootstrapTools.js +258 -258
  29. package/dist/tools/boilerplateTools.js +144 -144
  30. package/dist/tools/cCompilerBenchmarkTools.js +33 -33
  31. package/dist/tools/documentationTools.js +59 -59
  32. package/dist/tools/flywheelTools.js +6 -6
  33. package/dist/tools/gitWorkflowTools.d.ts +11 -0
  34. package/dist/tools/gitWorkflowTools.js +580 -0
  35. package/dist/tools/gitWorkflowTools.js.map +1 -0
  36. package/dist/tools/learningTools.js +26 -26
  37. package/dist/tools/localFileTools.d.ts +3 -0
  38. package/dist/tools/localFileTools.js +3164 -125
  39. package/dist/tools/localFileTools.js.map +1 -1
  40. package/dist/tools/metaTools.js +82 -0
  41. package/dist/tools/metaTools.js.map +1 -1
  42. package/dist/tools/parallelAgentTools.js +228 -0
  43. package/dist/tools/parallelAgentTools.js.map +1 -1
  44. package/dist/tools/patternTools.d.ts +13 -0
  45. package/dist/tools/patternTools.js +456 -0
  46. package/dist/tools/patternTools.js.map +1 -0
  47. package/dist/tools/reconTools.js +31 -31
  48. package/dist/tools/selfEvalTools.js +44 -44
  49. package/dist/tools/seoTools.d.ts +16 -0
  50. package/dist/tools/seoTools.js +866 -0
  51. package/dist/tools/seoTools.js.map +1 -0
  52. package/dist/tools/sessionMemoryTools.d.ts +15 -0
  53. package/dist/tools/sessionMemoryTools.js +348 -0
  54. package/dist/tools/sessionMemoryTools.js.map +1 -0
  55. package/dist/tools/toolRegistry.d.ts +4 -0
  56. package/dist/tools/toolRegistry.js +489 -0
  57. package/dist/tools/toolRegistry.js.map +1 -1
  58. package/dist/tools/toonTools.d.ts +15 -0
  59. package/dist/tools/toonTools.js +94 -0
  60. package/dist/tools/toonTools.js.map +1 -0
  61. package/dist/tools/verificationTools.js +41 -41
  62. package/dist/tools/visionTools.js +17 -17
  63. package/dist/tools/voiceBridgeTools.d.ts +15 -0
  64. package/dist/tools/voiceBridgeTools.js +1427 -0
  65. package/dist/tools/voiceBridgeTools.js.map +1 -0
  66. package/dist/tools/webTools.js +18 -18
  67. package/package.json +102 -101
package/README.md CHANGED
@@ -1,431 +1,443 @@
1
- # NodeBench MCP
2
-
3
- **Make AI agents catch the bugs they normally ship.**
4
-
5
- One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
6
-
7
- ```bash
8
- claude mcp add nodebench -- npx -y nodebench-mcp
9
- ```
10
-
11
- ---
12
-
13
- ## Why — What Bare Agents Miss
14
-
15
- We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
16
-
17
- | What gets measured | Bare Agent | With NodeBench MCP |
18
- |---|---|---|
19
- | Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
20
- | Research findings before coding | 0 | **21** |
21
- | Risk assessments | 0 | **9** |
22
- | Test coverage layers | 1 | **3** (static + unit + integration) |
23
- | Integration failures caught early | 0 | **4** |
24
- | Regression eval cases created | 0 | **22** |
25
- | Quality gate rules enforced | 0 | **52** |
26
- | Deploys blocked by gate violations | 0 | **4** |
27
- | Knowledge entries banked | 0 | **9** |
28
- | Blind spots shipped to production | **26** | **0** |
29
-
30
- The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
31
-
32
- Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
33
-
34
- ---
35
-
36
- ## Who's Using It
37
-
38
- **Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes.
39
-
40
- **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
41
-
42
- Both found different subsets of the 129 tools useful — which is why v2.8 ships with 4 `--preset` levels to load only what you need.
43
-
44
- ---
45
-
46
- ## How It Works — 3 Real Examples
47
-
48
- ### Example 1: Bug fix
49
-
50
- You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
51
-
52
- **Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
53
-
54
- **With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
55
- - No retry backoff on OpenRouter rate limits (HIGH)
56
- - JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
57
- - No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
58
-
59
- All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
60
-
61
- ### Example 2: Parallel agents overwriting each other
62
-
63
- You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
64
-
65
- **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
66
-
67
- **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
68
-
69
- ### Example 3: Knowledge compounding
70
-
71
- Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
72
-
73
- ---
74
-
75
- ## Quick Start
76
-
77
- ### Install (30 seconds)
78
-
79
- ```bash
80
- # Claude Code CLI — all 129 tools
81
- claude mcp add nodebench -- npx -y nodebench-mcp
82
-
83
- # Or start with discovery only — 5 tools, agents self-escalate to what they need
84
- claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
85
-
86
- # Or start lean — 39 tools, ~70% less token overhead
87
- claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
88
- ```
89
-
90
- Or add to `~/.claude/settings.json` or `.claude.json`:
91
-
92
- ```json
93
- {
94
- "mcpServers": {
95
- "nodebench": {
96
- "command": "npx",
97
- "args": ["-y", "nodebench-mcp"]
98
- }
99
- }
100
- }
101
- ```
102
-
103
- ### First prompts to try
104
-
105
- ```
106
- # See what's available
107
- > Use getMethodology("overview") to see all workflows
108
-
109
- # Before your next task — search for prior knowledge
110
- > Use search_all_knowledge("what I'm about to work on")
111
-
112
- # Run the full verification pipeline on a change
113
- > Use getMethodology("mandatory_flywheel") and follow the 6 steps
114
- ```
115
-
116
- ### Optional: API keys for web search and vision
117
-
118
- ```bash
119
- export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
120
- export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
121
- ```
122
-
123
- ### Capability benchmarking (GAIA, gated)
124
-
125
- NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
126
-
127
- Notes:
128
- - GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
129
- - Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
130
-
131
- Web lane (web_search + fetch_url):
132
- ```bash
133
- npm run mcp:dataset:gaia:capability:refresh
134
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
135
- ```
136
-
137
- File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
138
- ```bash
139
- npm run mcp:dataset:gaia:capability:files:refresh
140
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
141
- ```
142
-
143
- Modes:
144
- - Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
145
- - More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
146
-
147
- Notes:
148
- - ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
149
-
150
- ---
151
-
152
- ## What You Get
153
-
154
- ### Core workflow (use these every session)
155
-
156
- | When you... | Use this | Impact |
157
- |---|---|---|
158
- | Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
159
- | Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
160
- | Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
161
- | Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
162
- | Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
163
- | Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
164
- | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
165
- | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
166
- | Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
167
-
168
- ### When running parallel agents (Claude Code subagents, worktrees)
169
-
170
- | When you... | Use this | Impact |
171
- |---|---|---|
172
- | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
173
- | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
174
- | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
175
- | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
176
- | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
177
- | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
178
-
179
- ### Research and discovery
180
-
181
- | When you... | Use this | Impact |
182
- |---|---|---|
183
- | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
184
- | Fetch a URL | `fetch_url` | Read any page as clean markdown |
185
- | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
186
- | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
187
-
188
- ---
189
-
190
- ## Progressive Discovery (v2.8.1)
191
-
192
- 129 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
193
-
194
- ### Multi-modal search engine
195
-
196
- ```
197
- > discover_tools("verify my implementation")
198
- ```
199
-
200
- The `discover_tools` search engine scores tools using **9 parallel strategies**:
201
-
202
- | Strategy | What it does | Example |
203
- |---|---|---|
204
- | Keyword | Exact/partial word matching on name, tags, description | "benchmark" → `benchmark_models` |
205
- | Fuzzy | Levenshtein distance — tolerates typos | "verifiy" → `start_verification_cycle` |
206
- | N-gram | Trigram similarity for partial words | "screen" → `capture_ui_screenshot` |
207
- | Prefix | Matches tool name starts | "cap" → `capture_*` tools |
208
- | Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |
209
- | TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
210
- | Regex | Pattern matching | `"^run_.*loop$"` → `run_closed_loop` |
211
- | Bigram | Phrase matching | "quality gate" matched as unit |
212
- | Domain boost | Related categories boosted together | verification + quality_gate cluster |
213
-
214
- **6 search modes**: `hybrid` (default, all strategies), `fuzzy`, `regex`, `prefix`, `semantic`, `exact`
215
-
216
- Pass `explain: true` to see exactly which strategies contributed to each score.
217
-
218
- ### Quick refs — what to do next
219
-
220
- Every tool response auto-appends a `_quickRef` with:
221
- - **nextAction**: What to do immediately after this tool
222
- - **nextTools**: Recommended follow-up tools
223
- - **methodology**: Which methodology guide to consult
224
- - **tip**: Practical usage advice
225
-
226
- Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
227
-
228
- ### Workflow chains — step-by-step recipes
229
-
230
- 11 pre-built chains for common workflows:
231
-
232
- | Chain | Steps | Use case |
233
- |---|---|---|
234
- | `new_feature` | 12 | End-to-end feature development |
235
- | `fix_bug` | 6 | Structured debugging |
236
- | `ui_change` | 7 | Frontend with visual verification |
237
- | `parallel_project` | 7 | Multi-agent coordination |
238
- | `research_phase` | 8 | Context gathering |
239
- | `academic_paper` | 7 | Paper writing pipeline |
240
- | `c_compiler_benchmark` | 10 | Autonomous capability test |
241
- | `security_audit` | 9 | Comprehensive security assessment |
242
- | `code_review` | 8 | Structured code review |
243
- | `deployment` | 8 | Ship with full verification |
244
- | `migration` | 10 | SDK/framework upgrade |
245
-
246
- Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
247
-
248
- ### Boilerplate template
249
-
250
- Start new projects with everything pre-configured:
251
-
252
- ```bash
253
- gh repo create my-project --template HomenShum/nodebench-boilerplate --clone
254
- cd my-project && npm install
255
- ```
256
-
257
- Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.
258
-
259
- ---
260
-
261
- ## The Methodology Pipeline
262
-
263
- NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
264
-
265
- ```
266
- Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
267
- ↑ │
268
- └──────────── knowledge compounds ─────────────────────────────┘
269
- ```
270
-
271
- **Inner loop** (per change): 6-phase verification ensures correctness.
272
- **Outer loop** (over time): Eval-driven development ensures improvement.
273
- **Together**: The AI Flywheelevery verification produces eval artifacts, every regression triggers verification.
274
-
275
- Ask the agent: `Use getMethodology("overview")` to see all 19 methodology topics.
276
-
277
- ---
278
-
279
- ## Parallel Agents with Claude Code
280
-
281
- Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
282
-
283
- **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
284
-
285
- **How it works with Claude Code's Task tool:**
286
-
287
- 1. **COORDINATOR** (your main session) breaks work into independent tasks
288
- 2. Each **Task tool** call spawns a subagent with instructions to:
289
- - `claim_agent_task` lock the task
290
- - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
291
- - Do the work
292
- - `release_agent_task` — handoff with progress note
293
- 3. Coordinator calls `get_parallel_status` to monitor all subagents
294
- 4. Coordinator runs `run_quality_gate` on the aggregate result
295
-
296
- **MCP Prompts available:**
297
- - `claude-code-parallel` Step-by-step Claude Code subagent coordination
298
- - `parallel-agent-team` Full team setup with role assignment
299
- - `oracle-test-harness` — Validate outputs against known-good reference
300
- - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
301
-
302
- ---
303
-
304
- ## Toolset Gating (v2.8)
305
-
306
- 129 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
307
-
308
- ### Presets
309
-
310
- | Preset | Tools | Use case |
311
- |---|---|---|
312
- | `meta` | 5 | Discovery-only front door — agents start here and self-escalate via `discover_tools` |
313
- | `lite` | 39 | Core methodology — verification, eval, gates, learning, recon, security, boilerplate |
314
- | `core` | 87 | Full workflow — adds flywheel, bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark |
315
- | `full` | 129 | Everything (default) |
316
-
317
- ```bash
318
- # Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)
319
- # Agents start here and self-escalate to the tools they need
320
- claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
321
-
322
- # Lite 39 tools (verification, eval, gates, learning, recon, security, boilerplate + meta + discovery)
323
- claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
324
-
325
- # Core 87 tools (adds flywheel, bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark + meta + discovery)
326
- claude mcp add nodebench -- npx -y nodebench-mcp --preset core
327
-
328
- # Fullall 129 tools (default)
329
- claude mcp add nodebench -- npx -y nodebench-mcp
330
- ```
331
-
332
- Or in config:
333
-
334
- ```json
335
- {
336
- "mcpServers": {
337
- "nodebench": {
338
- "command": "npx",
339
- "args": ["-y", "nodebench-mcp", "--preset", "meta"]
340
- }
341
- }
342
- }
343
- ```
344
-
345
- ### Fine-grained control
346
-
347
- ```bash
348
- # Include only specific toolsets
349
- npx nodebench-mcp --toolsets verification,eval,recon
350
-
351
- # Exclude heavy optional-dep toolsets
352
- npx nodebench-mcp --exclude vision,ui_capture,parallel
353
-
354
- # See all toolsets and presets
355
- npx nodebench-mcp --help
356
- ```
357
-
358
- ### Available toolsets
359
-
360
- | Toolset | Tools | What it covers |
361
- |---|---|---|
362
- | verification | 8 | Cycles, gaps, triple-verify, status |
363
- | eval | 6 | Eval runs, results, comparison, diff |
364
- | quality_gate | 4 | Gates, presets, history |
365
- | learning | 4 | Knowledge, search, record |
366
- | recon | 7 | Research, findings, framework checks, risk |
367
- | flywheel | 4 | Mandatory flywheel, promote, investigate |
368
- | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |
369
- | self_eval | 6 | Trajectory analysis, health reports |
370
- | parallel | 10 | Task locks, roles, context budget, oracle |
371
- | vision | 4 | Screenshot analysis, UI capture, diff |
372
- | ui_capture | 2 | Playwright-based capture |
373
- | web | 2 | Web search, URL fetch |
374
- | github | 3 | Repo search, analysis, monitoring |
375
- | docs | 4 | Documentation generation, reports |
376
- | local_file | 17 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT) |
377
- | llm | 3 | LLM calling, extraction, benchmarking |
378
- | security | 3 | Dependency scanning, code analysis, terminal security scanning |
379
- | platform | 4 | Convex bridge: briefs, funding, research, publish |
380
- | research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |
381
- | flicker_detection | 5 | Android flicker detection + SSIM tooling |
382
- | figma_flow | 4 | Figma flow analysis + rendering |
383
- | boilerplate | 2 | Scaffold NodeBench projects + status |
384
- | benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
385
-
386
- Always included (regardless of gating) these 5 tools form the `meta` preset:
387
- - Meta: `findTools`, `getMethodology`
388
- - Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
389
-
390
- The `meta` preset loads **only** these 5 tools (0 domain tools). Agents use `discover_tools` to find what they need and self-escalate.
391
-
392
- ---
393
-
394
- ## Build from Source
395
-
396
- ```bash
397
- git clone https://github.com/HomenShum/nodebench-ai.git
398
- cd nodebench-ai/packages/mcp-local
399
- npm install && npm run build
400
- ```
401
-
402
- Then use absolute path:
403
-
404
- ```json
405
- {
406
- "mcpServers": {
407
- "nodebench": {
408
- "command": "node",
409
- "args": ["/path/to/packages/mcp-local/dist/index.js"]
410
- }
411
- }
412
- }
413
- ```
414
-
415
- ---
416
-
417
- ## Troubleshooting
418
-
419
- **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
420
-
421
- **"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
422
-
423
- **"Cannot find module"** — Run `npm run build` in the mcp-local directory
424
-
425
- **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
426
-
427
- ---
428
-
429
- ## License
430
-
431
- MIT
1
+ # NodeBench MCP
2
+
3
+ **Make AI agents catch the bugs they normally ship.**
4
+
5
+ One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
6
+
7
+ ```bash
8
+ claude mcp add nodebench -- npx -y nodebench-mcp
9
+ ```
10
+
11
+ ---
12
+
13
+ ## Why — What Bare Agents Miss
14
+
15
+ We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
16
+
17
+ | What gets measured | Bare Agent | With NodeBench MCP |
18
+ |---|---|---|
19
+ | Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
20
+ | Research findings before coding | 0 | **21** |
21
+ | Risk assessments | 0 | **9** |
22
+ | Test coverage layers | 1 | **3** (static + unit + integration) |
23
+ | Integration failures caught early | 0 | **4** |
24
+ | Regression eval cases created | 0 | **22** |
25
+ | Quality gate rules enforced | 0 | **52** |
26
+ | Deploys blocked by gate violations | 0 | **4** |
27
+ | Knowledge entries banked | 0 | **9** |
28
+ | Blind spots shipped to production | **26** | **0** |
29
+
30
+ The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
31
+
32
+ Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
33
+
34
+ ---
35
+
36
+ ## Who's Using It
37
+
38
+ **Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes.
39
+
40
+ **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
41
+
42
+ Both found different subsets of the 143 tools useful — which is why NodeBench ships with 4 `--preset` levels to load only what you need.
43
+
44
+ ---
45
+
46
+ ## How It Works — 3 Real Examples
47
+
48
+ ### Example 1: Bug fix
49
+
50
+ You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
51
+
52
+ **Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
53
+
54
+ **With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
55
+ - No retry backoff on OpenRouter rate limits (HIGH)
56
+ - JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
57
+ - No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
58
+
59
+ All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
60
+
61
+ ### Example 2: Parallel agents overwriting each other
62
+
63
+ You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
64
+
65
+ **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
66
+
67
+ **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
68
+
69
+ ### Example 3: Knowledge compounding
70
+
71
+ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
72
+
73
+ ---
74
+
75
+ ## Quick Start
76
+
77
+ ### Install (30 seconds)
78
+
79
+ ```bash
80
+ # Claude Code CLI — all 143 tools
81
+ claude mcp add nodebench -- npx -y nodebench-mcp
82
+
83
+ # Or start with discovery only — 5 tools, agents self-escalate to what they need
84
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
85
+
86
+ # Or start lean — 43 tools, ~70% less token overhead
87
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
88
+ ```
89
+
90
+ Or add to `~/.claude/settings.json` or `.claude.json`:
91
+
92
+ ```json
93
+ {
94
+ "mcpServers": {
95
+ "nodebench": {
96
+ "command": "npx",
97
+ "args": ["-y", "nodebench-mcp"]
98
+ }
99
+ }
100
+ }
101
+ ```
102
+
103
+ ### First prompts to try
104
+
105
+ ```
106
+ # See what's available
107
+ > Use getMethodology("overview") to see all workflows
108
+
109
+ # Before your next task — search for prior knowledge
110
+ > Use search_all_knowledge("what I'm about to work on")
111
+
112
+ # Run the full verification pipeline on a change
113
+ > Use getMethodology("mandatory_flywheel") and follow the 6 steps
114
+ ```
115
+
116
+ ### Optional: API keys for web search and vision
117
+
118
+ ```bash
119
+ export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
120
+ export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
121
+ ```
122
+
123
+ ### Capability benchmarking (GAIA, gated)
124
+
125
+ NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
126
+
127
+ Notes:
128
+ - GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
129
+ - Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
130
+
131
+ Web lane (web_search + fetch_url):
132
+ ```bash
133
+ npm run mcp:dataset:gaia:capability:refresh
134
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
135
+ ```
136
+
137
+ File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
138
+ ```bash
139
+ npm run mcp:dataset:gaia:capability:files:refresh
140
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
141
+ ```
142
+
143
+ Modes:
144
+ - Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
145
+ - More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
146
+
147
+ Notes:
148
+ - ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
149
+
150
+ ---
151
+
152
+ ## What You Get
153
+
154
+ ### Core workflow (use these every session)
155
+
156
+ | When you... | Use this | Impact |
157
+ |---|---|---|
158
+ | Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
159
+ | Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
160
+ | Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
161
+ | Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
162
+ | Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
163
+ | Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
164
+ | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
165
+ | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
166
+ | Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
167
+
168
+ ### When running parallel agents (Claude Code subagents, worktrees)
169
+
170
+ | When you... | Use this | Impact |
171
+ |---|---|---|
172
+ | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
173
+ | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
174
+ | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
175
+ | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
176
+ | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
177
+ | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
178
+
179
+ ### Research and discovery
180
+
181
+ | When you... | Use this | Impact |
182
+ |---|---|---|
183
+ | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
184
+ | Fetch a URL | `fetch_url` | Read any page as clean markdown |
185
+ | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
186
+ | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
187
+
188
+ ---
189
+
190
+ ## Progressive Discovery
191
+
192
+ 143 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
193
+
194
+ ### Multi-modal search engine
195
+
196
+ ```
197
+ > discover_tools("verify my implementation")
198
+ ```
199
+
200
+ The `discover_tools` search engine scores tools using **9 parallel strategies**:
201
+
202
+ | Strategy | What it does | Example |
203
+ |---|---|---|
204
+ | Keyword | Exact/partial word matching on name, tags, description | "benchmark" → `benchmark_models` |
205
+ | Fuzzy | Levenshtein distance — tolerates typos | "verifiy" → `start_verification_cycle` |
206
+ | N-gram | Trigram similarity for partial words | "screen" → `capture_ui_screenshot` |
207
+ | Prefix | Matches tool name starts | "cap" → `capture_*` tools |
208
+ | Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |
209
+ | TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
210
+ | Regex | Pattern matching | `"^run_.*loop$"` → `run_closed_loop` |
211
+ | Bigram | Phrase matching | "quality gate" matched as unit |
212
+ | Domain boost | Related categories boosted together | verification + quality_gate cluster |
213
+
214
+ **7 search modes**: `hybrid` (default, all strategies), `fuzzy`, `regex`, `prefix`, `semantic`, `exact`, `dense`
215
+
216
+ Pass `explain: true` to see exactly which strategies contributed to each score.
217
+
218
+ ### Quick refs — what to do next
219
+
220
+ Every tool response auto-appends a `_quickRef` with:
221
+ - **nextAction**: What to do immediately after this tool
222
+ - **nextTools**: Recommended follow-up tools
223
+ - **methodology**: Which methodology guide to consult
224
+ - **tip**: Practical usage advice
225
+
226
+ Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
227
+
228
+ ### Workflow chains — step-by-step recipes
229
+
230
+ 21 pre-built chains for common workflows:
231
+
232
+ | Chain | Steps | Use case |
233
+ |---|---|---|
234
+ | `new_feature` | 12 | End-to-end feature development |
235
+ | `fix_bug` | 6 | Structured debugging |
236
+ | `ui_change` | 7 | Frontend with visual verification |
237
+ | `parallel_project` | 7 | Multi-agent coordination |
238
+ | `research_phase` | 8 | Context gathering |
239
+ | `academic_paper` | 7 | Paper writing pipeline |
240
+ | `c_compiler_benchmark` | 10 | Autonomous capability test |
241
+ | `security_audit` | 9 | Comprehensive security assessment |
242
+ | `code_review` | 8 | Structured code review |
243
+ | `deployment` | 8 | Ship with full verification |
244
+ | `migration` | 10 | SDK/framework upgrade |
245
+ | `coordinator_spawn` | 6 | Parallel coordinator setup |
246
+ | `self_setup` | 5 | Agent self-onboarding |
247
+ | `flicker_detection` | 5 | Android flicker analysis |
248
+ | `figma_flow_analysis` | 5 | Figma prototype flow audit |
249
+ | `agent_eval` | 6 | Evaluate agent performance |
250
+ | `contract_compliance` | 4 | Check agent contract adherence |
251
+ | `ablation_eval` | 6 | Ablation experiment design |
252
+ | `session_recovery` | 6 | Recover context after compaction |
253
+ | `attention_refresh` | 4 | Reload bearings mid-session |
254
+ | `task_bank_setup` | 5 | Create evaluation task banks |
255
+
256
+ Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
257
+
258
+ ### Boilerplate template
259
+
260
+ Start new projects with everything pre-configured:
261
+
262
+ ```bash
263
+ gh repo create my-project --template HomenShum/nodebench-boilerplate --clone
264
+ cd my-project && npm install
265
+ ```
266
+
267
+ Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.
268
+
269
+ ---
270
+
271
+ ## The Methodology Pipeline
272
+
273
+ NodeBench MCP isn't just a bag of tools it's a pipeline. Each step feeds the next:
274
+
275
+ ```
276
+ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
277
+ ↑ │
278
+ └──────────── knowledge compounds ─────────────────────────────┘
279
+ ```
280
+
281
+ **Inner loop** (per change): 6-phase verification ensures correctness.
282
+ **Outer loop** (over time): Eval-driven development ensures improvement.
283
+ **Together**: The AI Flywheel every verification produces eval artifacts, every regression triggers verification.
284
+
285
+ Ask the agent: `Use getMethodology("overview")` to see all 19 methodology topics.
286
+
287
+ ---
288
+
289
+ ## Parallel Agents with Claude Code
290
+
291
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
292
+
293
+ **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
294
+
295
+ **How it works with Claude Code's Task tool:**
296
+
297
+ 1. **COORDINATOR** (your main session) breaks work into independent tasks
298
+ 2. Each **Task tool** call spawns a subagent with instructions to:
299
+ - `claim_agent_task` — lock the task
300
+ - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
301
+ - Do the work
302
+ - `release_agent_task` — handoff with progress note
303
+ 3. Coordinator calls `get_parallel_status` to monitor all subagents
304
+ 4. Coordinator runs `run_quality_gate` on the aggregate result
305
+
306
+ **MCP Prompts available:**
307
+ - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
308
+ - `parallel-agent-team` — Full team setup with role assignment
309
+ - `oracle-test-harness` — Validate outputs against known-good reference
310
+ - `bootstrap-parallel-agents` Scaffold parallel infra for any repo
311
+
312
+ ---
313
+
314
+ ## Toolset Gating
315
+
316
+ 143 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
317
+
318
+ ### Presets
319
+
320
+ | Preset | Tools | Domains | Use case |
321
+ |---|---|---|---|
322
+ | `meta` | 5 | 0 | Discovery-only front door agents start here and self-escalate via `discover_tools` |
323
+ | `lite` | 43 | 8 | Core methodology — verification, eval, flywheel, learning, recon, security, boilerplate |
324
+ | `core` | 93 | 17 | Full workflow — adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory |
325
+ | `full` | 143 | 25 | Everything — adds vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers |
326
+
327
+ ```bash
328
+ # Meta5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)
329
+ # Agents start here and self-escalate to the tools they need
330
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
331
+
332
+ # Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)
333
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
334
+
335
+ # Core — 93 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory + meta + discovery)
336
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset core
337
+
338
+ # Full — all 143 tools (default)
339
+ claude mcp add nodebench -- npx -y nodebench-mcp
340
+ ```
341
+
342
+ Or in config:
343
+
344
+ ```json
345
+ {
346
+ "mcpServers": {
347
+ "nodebench": {
348
+ "command": "npx",
349
+ "args": ["-y", "nodebench-mcp", "--preset", "meta"]
350
+ }
351
+ }
352
+ }
353
+ ```
354
+
355
+ ### Fine-grained control
356
+
357
+ ```bash
358
+ # Include only specific toolsets
359
+ npx nodebench-mcp --toolsets verification,eval,recon
360
+
361
+ # Exclude heavy optional-dep toolsets
362
+ npx nodebench-mcp --exclude vision,ui_capture,parallel
363
+
364
+ # See all toolsets and presets
365
+ npx nodebench-mcp --help
366
+ ```
367
+
368
+ ### Available toolsets
369
+
370
+ | Toolset | Tools | What it covers |
371
+ |---|---|---|
372
+ | verification | 8 | Cycles, gaps, triple-verify, status |
373
+ | eval | 6 | Eval runs, results, comparison, diff |
374
+ | quality_gate | 4 | Gates, presets, history |
375
+ | learning | 4 | Knowledge, search, record |
376
+ | recon | 7 | Research, findings, framework checks, risk |
377
+ | flywheel | 4 | Mandatory flywheel, promote, investigate |
378
+ | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |
379
+ | self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance |
380
+ | parallel | 10 | Task locks, roles, context budget, oracle |
381
+ | vision | 4 | Screenshot analysis, UI capture, diff |
382
+ | ui_capture | 2 | Playwright-based capture |
383
+ | web | 2 | Web search, URL fetch |
384
+ | github | 3 | Repo search, analysis, monitoring |
385
+ | docs | 4 | Documentation generation, reports |
386
+ | local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) |
387
+ | llm | 3 | LLM calling, extraction, benchmarking |
388
+ | security | 3 | Dependency scanning, code analysis, terminal security scanning |
389
+ | platform | 4 | Convex bridge: briefs, funding, research, publish |
390
+ | research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |
391
+ | flicker_detection | 5 | Android flicker detection + SSIM tooling |
392
+ | figma_flow | 4 | Figma flow analysis + rendering |
393
+ | boilerplate | 2 | Scaffold NodeBench projects + status |
394
+ | benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
395
+ | session_memory | 3 | Compaction-resilient notes, attention refresh, context reload |
396
+ | gaia_solvers | 6 | GAIA media image solvers (red/green deviation, polygon area, fraction quiz, bass clef, storage cost) |
397
+
398
+ Always included (regardless of gating) — these 5 tools form the `meta` preset:
399
+ - Meta: `findTools`, `getMethodology`
400
+ - Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
401
+
402
+ The `meta` preset loads **only** these 5 tools (0 domain tools). Agents use `discover_tools` to find what they need and self-escalate.
403
+
404
+ ---
405
+
406
+ ## Build from Source
407
+
408
+ ```bash
409
+ git clone https://github.com/HomenShum/nodebench-ai.git
410
+ cd nodebench-ai/packages/mcp-local
411
+ npm install && npm run build
412
+ ```
413
+
414
+ Then use absolute path:
415
+
416
+ ```json
417
+ {
418
+ "mcpServers": {
419
+ "nodebench": {
420
+ "command": "node",
421
+ "args": ["/path/to/packages/mcp-local/dist/index.js"]
422
+ }
423
+ }
424
+ }
425
+ ```
426
+
427
+ ---
428
+
429
+ ## Troubleshooting
430
+
431
+ **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
432
+
433
+ **"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
434
+
435
+ **"Cannot find module"** — Run `npm run build` in the mcp-local directory
436
+
437
+ **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
438
+
439
+ ---
440
+
441
+ ## License
442
+
443
+ MIT