nodebench-mcp 2.22.0 → 2.26.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +5 -4
- package/README.md +495 -280
- package/dist/__tests__/architectComplex.test.js +3 -5
- package/dist/__tests__/architectComplex.test.js.map +1 -1
- package/dist/__tests__/batchAutopilot.test.d.ts +8 -0
- package/dist/__tests__/batchAutopilot.test.js +218 -0
- package/dist/__tests__/batchAutopilot.test.js.map +1 -0
- package/dist/__tests__/cliSubcommands.test.d.ts +1 -0
- package/dist/__tests__/cliSubcommands.test.js +138 -0
- package/dist/__tests__/cliSubcommands.test.js.map +1 -0
- package/dist/__tests__/evalHarness.test.js +1 -1
- package/dist/__tests__/forecastingDogfood.test.d.ts +9 -0
- package/dist/__tests__/forecastingDogfood.test.js +284 -0
- package/dist/__tests__/forecastingDogfood.test.js.map +1 -0
- package/dist/__tests__/forecastingScoring.test.d.ts +9 -0
- package/dist/__tests__/forecastingScoring.test.js +202 -0
- package/dist/__tests__/forecastingScoring.test.js.map +1 -0
- package/dist/__tests__/localDashboard.test.d.ts +1 -0
- package/dist/__tests__/localDashboard.test.js +226 -0
- package/dist/__tests__/localDashboard.test.js.map +1 -0
- package/dist/__tests__/multiHopDogfood.test.d.ts +12 -0
- package/dist/__tests__/multiHopDogfood.test.js +303 -0
- package/dist/__tests__/multiHopDogfood.test.js.map +1 -0
- package/dist/__tests__/openclawDogfood.test.d.ts +23 -0
- package/dist/__tests__/openclawDogfood.test.js +535 -0
- package/dist/__tests__/openclawDogfood.test.js.map +1 -0
- package/dist/__tests__/openclawMessaging.test.d.ts +14 -0
- package/dist/__tests__/openclawMessaging.test.js +232 -0
- package/dist/__tests__/openclawMessaging.test.js.map +1 -0
- package/dist/__tests__/tools.test.js +7 -3
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/__tests__/traceabilityDogfood.test.d.ts +12 -0
- package/dist/__tests__/traceabilityDogfood.test.js +241 -0
- package/dist/__tests__/traceabilityDogfood.test.js.map +1 -0
- package/dist/__tests__/webmcpTools.test.d.ts +7 -0
- package/dist/__tests__/webmcpTools.test.js +195 -0
- package/dist/__tests__/webmcpTools.test.js.map +1 -0
- package/dist/dashboard/briefHtml.d.ts +20 -0
- package/dist/dashboard/briefHtml.js +1000 -0
- package/dist/dashboard/briefHtml.js.map +1 -0
- package/dist/dashboard/briefServer.d.ts +18 -0
- package/dist/dashboard/briefServer.js +320 -0
- package/dist/dashboard/briefServer.js.map +1 -0
- package/dist/dashboard/html.d.ts +18 -0
- package/dist/dashboard/html.js +1491 -0
- package/dist/dashboard/html.js.map +1 -0
- package/dist/dashboard/server.d.ts +17 -0
- package/dist/dashboard/server.js +403 -0
- package/dist/dashboard/server.js.map +1 -0
- package/dist/db.js +38 -0
- package/dist/db.js.map +1 -1
- package/dist/index.js +211 -5
- package/dist/index.js.map +1 -1
- package/dist/tools/critterTools.js +4 -0
- package/dist/tools/critterTools.js.map +1 -1
- package/dist/tools/forecastingTools.d.ts +11 -0
- package/dist/tools/forecastingTools.js +616 -0
- package/dist/tools/forecastingTools.js.map +1 -0
- package/dist/tools/localDashboardTools.d.ts +8 -0
- package/dist/tools/localDashboardTools.js +332 -0
- package/dist/tools/localDashboardTools.js.map +1 -0
- package/dist/tools/metaTools.js +170 -1
- package/dist/tools/metaTools.js.map +1 -1
- package/dist/tools/openclawTools.d.ts +11 -0
- package/dist/tools/openclawTools.js +1017 -0
- package/dist/tools/openclawTools.js.map +1 -0
- package/dist/tools/overstoryTools.d.ts +14 -0
- package/dist/tools/overstoryTools.js +426 -0
- package/dist/tools/overstoryTools.js.map +1 -0
- package/dist/tools/prReportTools.d.ts +11 -0
- package/dist/tools/prReportTools.js +911 -0
- package/dist/tools/prReportTools.js.map +1 -0
- package/dist/tools/progressiveDiscoveryTools.js +28 -9
- package/dist/tools/progressiveDiscoveryTools.js.map +1 -1
- package/dist/tools/selfEvalTools.js +8 -1
- package/dist/tools/selfEvalTools.js.map +1 -1
- package/dist/tools/sessionMemoryTools.js +14 -2
- package/dist/tools/sessionMemoryTools.js.map +1 -1
- package/dist/tools/skillUpdateTools.d.ts +24 -0
- package/dist/tools/skillUpdateTools.js +469 -0
- package/dist/tools/skillUpdateTools.js.map +1 -0
- package/dist/tools/toolRegistry.js +178 -0
- package/dist/tools/toolRegistry.js.map +1 -1
- package/dist/tools/uiUxDiveAdvancedTools.js +61 -0
- package/dist/tools/uiUxDiveAdvancedTools.js.map +1 -1
- package/dist/tools/uiUxDiveTools.js +154 -1
- package/dist/tools/uiUxDiveTools.js.map +1 -1
- package/dist/tools/visualQaTools.d.ts +2 -0
- package/dist/tools/visualQaTools.js +1088 -0
- package/dist/tools/visualQaTools.js.map +1 -0
- package/dist/tools/webmcpTools.d.ts +16 -0
- package/dist/tools/webmcpTools.js +703 -0
- package/dist/tools/webmcpTools.js.map +1 -0
- package/dist/toolsetRegistry.js +4 -0
- package/dist/toolsetRegistry.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -5,10 +5,12 @@
|
|
|
5
5
|
One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
|
|
6
6
|
|
|
7
7
|
```bash
|
|
8
|
-
#
|
|
8
|
+
# Claude Code — AI Flywheel core (50 tools, recommended)
|
|
9
9
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
10
10
|
|
|
11
|
-
#
|
|
11
|
+
# Windsurf / Cursor — same tools, add to your MCP config (see setup below)
|
|
12
|
+
|
|
13
|
+
# Need everything? Vision, web, files, parallel agents, etc.
|
|
12
14
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset full
|
|
13
15
|
```
|
|
14
16
|
|
|
@@ -37,16 +39,6 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
|
|
|
37
39
|
|
|
38
40
|
---
|
|
39
41
|
|
|
40
|
-
## Who's Using It
|
|
41
|
-
|
|
42
|
-
**Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes. (Uses `full` preset for vision tools)
|
|
43
|
-
|
|
44
|
-
**QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss. (Uses `default` preset — all core AI Flywheel tools)
|
|
45
|
-
|
|
46
|
-
Both found different subsets of the tools useful — which is why NodeBench ships with just 2 `--preset` levels. The `default` preset (50 tools) covers the complete AI Flywheel methodology with ~76% fewer tools. Add `--preset full` for specialized tools (vision, web, files, parallel agents, security).
|
|
47
|
-
|
|
48
|
-
---
|
|
49
|
-
|
|
50
42
|
## How It Works — 3 Real Examples
|
|
51
43
|
|
|
52
44
|
### Example 1: Bug fix
|
|
@@ -68,7 +60,7 @@ You type: *"I launched 3 Claude Code subagents but they keep overwriting each ot
|
|
|
68
60
|
|
|
69
61
|
**Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
|
|
70
62
|
|
|
71
|
-
**With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
|
|
63
|
+
**With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch. (Requires `--preset multi_agent` or `--preset full`.)
|
|
72
64
|
|
|
73
65
|
### Example 3: Knowledge compounding
|
|
74
66
|
|
|
@@ -78,14 +70,16 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
|
|
|
78
70
|
|
|
79
71
|
## Quick Start
|
|
80
72
|
|
|
81
|
-
###
|
|
73
|
+
### Claude Code (CLI)
|
|
82
74
|
|
|
83
75
|
```bash
|
|
84
|
-
#
|
|
76
|
+
# Recommended — AI Flywheel core (50 tools)
|
|
85
77
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
86
78
|
|
|
87
|
-
#
|
|
88
|
-
claude mcp add nodebench -- npx -y nodebench-mcp --preset
|
|
79
|
+
# Or pick a themed preset for your workflow
|
|
80
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
|
|
81
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset research
|
|
82
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset data
|
|
89
83
|
```
|
|
90
84
|
|
|
91
85
|
Or add to `~/.claude/settings.json` or `.claude.json`:
|
|
@@ -101,98 +95,108 @@ Or add to `~/.claude/settings.json` or `.claude.json`:
|
|
|
101
95
|
}
|
|
102
96
|
```
|
|
103
97
|
|
|
104
|
-
###
|
|
98
|
+
### Windsurf
|
|
99
|
+
|
|
100
|
+
Add to `~/.codeium/windsurf/mcp_config.json` (or open Settings → MCP → View raw config):
|
|
105
101
|
|
|
102
|
+
```json
|
|
103
|
+
{
|
|
104
|
+
"mcpServers": {
|
|
105
|
+
"nodebench": {
|
|
106
|
+
"command": "npx",
|
|
107
|
+
"args": ["-y", "nodebench-mcp"]
|
|
108
|
+
}
|
|
109
|
+
}
|
|
110
|
+
}
|
|
106
111
|
```
|
|
107
|
-
# See what's available
|
|
108
|
-
> Use discover_tools("verify my implementation") to find relevant tools
|
|
109
112
|
|
|
110
|
-
|
|
111
|
-
> Use getMethodology("overview") to see all workflows
|
|
113
|
+
### Cursor
|
|
112
114
|
|
|
113
|
-
|
|
114
|
-
> Use search_all_knowledge("what I'm about to work on")
|
|
115
|
+
Add to `.cursor/mcp.json` in your project root (or open Settings → MCP):
|
|
115
116
|
|
|
116
|
-
|
|
117
|
-
|
|
117
|
+
```json
|
|
118
|
+
{
|
|
119
|
+
"mcpServers": {
|
|
120
|
+
"nodebench": {
|
|
121
|
+
"command": "npx",
|
|
122
|
+
"args": ["-y", "nodebench-mcp"]
|
|
123
|
+
}
|
|
124
|
+
}
|
|
125
|
+
}
|
|
118
126
|
```
|
|
119
127
|
|
|
120
|
-
###
|
|
128
|
+
### Other MCP Clients
|
|
121
129
|
|
|
122
|
-
|
|
130
|
+
Any MCP-compatible client works. The config format is the same — point `command` to `npx` and `args` to `["-y", "nodebench-mcp"]`. Add `"--preset", "<name>"` to the args array for themed presets.
|
|
131
|
+
|
|
132
|
+
### First Prompts to Try
|
|
123
133
|
|
|
124
|
-
**Get smart preset recommendation:**
|
|
125
|
-
```bash
|
|
126
|
-
npx nodebench-mcp --smart-preset
|
|
127
134
|
```
|
|
135
|
+
# See what's available
|
|
136
|
+
> Use discover_tools("verify my implementation") to find relevant tools
|
|
128
137
|
|
|
129
|
-
|
|
138
|
+
# Page through results
|
|
139
|
+
> Use discover_tools({ query: "verify", limit: 5, offset: 5 }) for page 2
|
|
130
140
|
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
npx nodebench-mcp --stats
|
|
134
|
-
```
|
|
141
|
+
# Expand results via conceptual neighbors
|
|
142
|
+
> Use discover_tools({ query: "deploy changes", expand: 3 }) for broader discovery
|
|
135
143
|
|
|
136
|
-
|
|
144
|
+
# Explore a tool's neighborhood (multi-hop)
|
|
145
|
+
> Use get_tool_quick_ref({ tool_name: "run_recon", depth: 2 }) to see 2-hop graph
|
|
137
146
|
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
npx nodebench-mcp --export-stats > usage-stats.json
|
|
141
|
-
```
|
|
147
|
+
# Get methodology guidance
|
|
148
|
+
> Use getMethodology("overview") to see all workflows
|
|
142
149
|
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
npx nodebench-mcp --list-presets
|
|
146
|
-
```
|
|
150
|
+
# Before your next task — search for prior knowledge
|
|
151
|
+
> Use search_all_knowledge("what I'm about to work on")
|
|
147
152
|
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
npx nodebench-mcp --reset-stats
|
|
153
|
+
# Run the full verification pipeline on a change
|
|
154
|
+
> Use getMethodology("mandatory_flywheel") and follow the 6 steps
|
|
151
155
|
```
|
|
152
156
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
### Optional: API keys for web search and vision
|
|
157
|
+
### Optional: API Keys
|
|
156
158
|
|
|
157
159
|
```bash
|
|
158
160
|
export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
|
|
159
161
|
export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
|
|
160
162
|
```
|
|
161
163
|
|
|
162
|
-
|
|
164
|
+
Set these as environment variables, or add them to the `env` block in your MCP config:
|
|
163
165
|
|
|
164
|
-
|
|
166
|
+
```json
|
|
167
|
+
{
|
|
168
|
+
"mcpServers": {
|
|
169
|
+
"nodebench": {
|
|
170
|
+
"command": "npx",
|
|
171
|
+
"args": ["-y", "nodebench-mcp"],
|
|
172
|
+
"env": {
|
|
173
|
+
"GEMINI_API_KEY": "your-key",
|
|
174
|
+
"GITHUB_TOKEN": "your-token"
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
}
|
|
178
|
+
}
|
|
179
|
+
```
|
|
165
180
|
|
|
166
|
-
|
|
167
|
-
- GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
|
|
168
|
-
- Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
|
|
181
|
+
### Usage Analytics & Smart Presets
|
|
169
182
|
|
|
170
|
-
|
|
171
|
-
```bash
|
|
172
|
-
npm run mcp:dataset:gaia:capability:refresh
|
|
173
|
-
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
|
|
174
|
-
```
|
|
183
|
+
NodeBench MCP tracks tool usage locally and can recommend optimal presets based on your project type and usage patterns.
|
|
175
184
|
|
|
176
|
-
File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
|
|
177
185
|
```bash
|
|
178
|
-
|
|
179
|
-
|
|
186
|
+
npx nodebench-mcp --smart-preset # Get AI-powered preset recommendation
|
|
187
|
+
npx nodebench-mcp --stats # Show usage statistics (last 30 days)
|
|
188
|
+
npx nodebench-mcp --export-stats # Export usage data to JSON
|
|
189
|
+
npx nodebench-mcp --list-presets # List all available presets
|
|
190
|
+
npx nodebench-mcp --reset-stats # Clear analytics data
|
|
180
191
|
```
|
|
181
192
|
|
|
182
|
-
|
|
183
|
-
- Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
184
|
-
- More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
|
|
185
|
-
|
|
186
|
-
Notes:
|
|
187
|
-
- ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
|
|
193
|
+
All analytics data is stored locally in `~/.nodebench/analytics.db` and never leaves your machine.
|
|
188
194
|
|
|
189
195
|
---
|
|
190
196
|
|
|
191
|
-
## What You Get
|
|
197
|
+
## What You Get — The AI Flywheel
|
|
192
198
|
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
The `default` preset (50 tools) gives you the complete AI Flywheel methodology from [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
|
|
199
|
+
The default setup (no `--preset` flag) gives you **50 tools** that implement the complete [AI Flywheel](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md) methodology — two interlocking loops that compound quality over time:
|
|
196
200
|
|
|
197
201
|
```
|
|
198
202
|
Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
|
|
@@ -203,21 +207,50 @@ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn
|
|
|
203
207
|
**Inner loop** (per change): 6-phase verification ensures correctness.
|
|
204
208
|
**Outer loop** (over time): Eval-driven development ensures improvement.
|
|
205
209
|
|
|
206
|
-
###
|
|
210
|
+
### What's in the Default Preset (50 Tools)
|
|
207
211
|
|
|
208
|
-
The
|
|
212
|
+
The default preset has 3 layers:
|
|
209
213
|
|
|
210
|
-
1
|
|
214
|
+
**Layer 1 — Discovery (6 tools):** *"What tool should I use?"*
|
|
211
215
|
|
|
212
|
-
|
|
216
|
+
| Tool | Purpose |
|
|
217
|
+
|---|---|
|
|
218
|
+
| `findTools` | Keyword search across all tools |
|
|
219
|
+
| `getMethodology` | Access methodology guides (20 topics) |
|
|
220
|
+
| `check_mcp_setup` | Diagnostic wizard — checks env vars, API keys, optional deps |
|
|
221
|
+
| `discover_tools` | 14-strategy hybrid search with pagination (`offset`), result expansion (`expand`), and `relatedTools` neighbors |
|
|
222
|
+
| `get_tool_quick_ref` | Quick reference with multi-hop BFS traversal (`depth` 1-3) — discovers tools 2-3 hops away |
|
|
223
|
+
| `get_workflow_chain` | Step-by-step recipes for 28 common workflows |
|
|
213
224
|
|
|
214
|
-
|
|
225
|
+
**Layer 2 — Dynamic Loading (6 tools):** *"Add/remove tools from my session"*
|
|
215
226
|
|
|
216
|
-
|
|
227
|
+
| Tool | Purpose |
|
|
228
|
+
|---|---|
|
|
229
|
+
| `load_toolset` | Add a toolset to the current session on demand |
|
|
230
|
+
| `unload_toolset` | Remove a toolset to recover context budget |
|
|
231
|
+
| `list_available_toolsets` | See all 39 toolsets with tool counts |
|
|
232
|
+
| `call_loaded_tool` | Proxy for clients that don't support dynamic tool updates |
|
|
233
|
+
| `smart_select_tools` | LLM-powered tool selection (sends compact catalog to fast model) |
|
|
234
|
+
| `get_ab_test_report` | Compare static vs dynamic loading performance |
|
|
217
235
|
|
|
218
|
-
|
|
236
|
+
**Layer 3 — AI Flywheel Core Methodology (38 tools):** *"Do the work"*
|
|
219
237
|
|
|
220
|
-
|
|
238
|
+
| Domain | Tools | What You Get |
|
|
239
|
+
|---|---|---|
|
|
240
|
+
| **verification** | 8 | `start_verification_cycle`, `log_gap`, `resolve_gap`, `get_cycle_status`, `triple_verify`, `run_closed_loop`, `compare_cycles`, `list_cycles` |
|
|
241
|
+
| **eval** | 6 | `start_eval_run`, `record_eval_result`, `get_eval_summary`, `compare_eval_runs`, `get_eval_diff`, `list_eval_runs` |
|
|
242
|
+
| **quality_gate** | 4 | `run_quality_gate`, `create_gate_preset`, `get_gate_history`, `list_gate_presets` |
|
|
243
|
+
| **learning** | 4 | `record_learning`, `search_all_knowledge`, `get_knowledge_stats`, `list_recent_learnings` |
|
|
244
|
+
| **flywheel** | 4 | `run_mandatory_flywheel`, `promote_to_eval`, `investigate_blind_spot`, `get_flywheel_status` |
|
|
245
|
+
| **recon** | 7 | `run_recon`, `log_recon_finding`, `assess_risk`, `get_recon_summary`, `list_recon_sessions`, `check_framework_version`, `search_recon_findings` |
|
|
246
|
+
| **security** | 3 | `scan_dependencies`, `analyze_code_security`, `scan_terminal_output` |
|
|
247
|
+
| **boilerplate** | 2 | `scaffold_nodebench_project`, `get_boilerplate_status` |
|
|
248
|
+
|
|
249
|
+
> **Note:** `skill_update` (4 tools for rule file freshness tracking) is available via `load_toolset("skill_update")` when needed.
|
|
250
|
+
|
|
251
|
+
### Core Workflow — Use These Every Session
|
|
252
|
+
|
|
253
|
+
These are the AI Flywheel tools documented in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
|
|
221
254
|
|
|
222
255
|
| When you... | Use this | Impact |
|
|
223
256
|
|---|---|---|
|
|
@@ -230,30 +263,59 @@ This approach minimizes token overhead while ensuring agents have access to the
|
|
|
230
263
|
| Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
|
|
231
264
|
| Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
|
|
232
265
|
| Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
|
|
266
|
+
| Re-examine for 11/10 | Fresh-eyes review | After completing, re-examine for exceptional quality — a11y, resilience, polish |
|
|
233
267
|
|
|
234
|
-
###
|
|
268
|
+
### Mandatory After Any Non-Trivial Change
|
|
235
269
|
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
| Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra | `full` |
|
|
270
|
+
1. **Static analysis**: `tsc --noEmit` and linter checks
|
|
271
|
+
2. **Happy-path test**: Run the changed functionality with valid inputs
|
|
272
|
+
3. **Failure-path test**: Validate expected error handling + edge cases
|
|
273
|
+
4. **Gap analysis**: Dead code, unused vars, missing integrations, intent mismatch
|
|
274
|
+
5. **Fix and re-verify**: Rerun steps 1-3 from scratch after any fix
|
|
275
|
+
6. **Deploy and document**: Ship + write down what changed and why
|
|
276
|
+
7. **Re-examine for 11/10**: Re-examine the completed work with fresh eyes. Not "does it work?" but "is this the best it can be?" Check: prefers-reduced-motion, color-blind safety, print stylesheet, error resilience (partial failures, retry with backoff), keyboard efficiency (skip links, Ctrl+K search), skeleton loading, staggered animations, progressive disclosure for large datasets. Fix what you find, then re-examine your fixes.
|
|
244
277
|
|
|
245
|
-
|
|
278
|
+
---
|
|
279
|
+
|
|
280
|
+
## Themed Presets — Choose Your Workflow
|
|
246
281
|
|
|
247
|
-
|
|
282
|
+
The default preset covers the AI Flywheel. For specialized workflows, pick a themed preset that adds domain-specific tools on top:
|
|
248
283
|
|
|
249
|
-
|
|
|
284
|
+
| Preset | Tools | What it adds to the default | Use case |
|
|
250
285
|
|---|---|---|---|
|
|
251
|
-
|
|
|
252
|
-
|
|
|
253
|
-
|
|
|
254
|
-
|
|
|
286
|
+
| **default** ⭐ | **50** | — | Bug fixes, features, refactoring, code review |
|
|
287
|
+
| `web_dev` | 102 | + vision, UI capture, SEO, git workflow, architect, UI/UX dive, MCP bridge, PR reports | Web projects with visual QA |
|
|
288
|
+
| `mobile` | 91 | + vision, UI capture, flicker detection, UI/UX dive, MCP bridge | Mobile apps with screenshot analysis |
|
|
289
|
+
| `academic` | 82 | + research writing, LLM, web, local file parsing | Academic papers and research |
|
|
290
|
+
| `multi_agent` | 79 | + parallel agents, self-eval, session memory, pattern mining, TOON | Multi-agent coordination |
|
|
291
|
+
| `data` | 74 | + local file parsing (CSV/XLSX/PDF/DOCX/JSON), LLM, web | Data analysis and file processing |
|
|
292
|
+
| `content` | 69 | + LLM, critter, email, RSS, platform queue, architect | Content pipelines and publishing |
|
|
293
|
+
| `research` | 67 | + web search, LLM, RSS feeds, email, docs | Research workflows |
|
|
294
|
+
| `devops` | 64 | + git compliance, session memory, benchmarks, pattern mining, PR reports | CI/CD and operations |
|
|
295
|
+
| `full` | 218 | + everything (all 39 toolsets) | Maximum coverage |
|
|
296
|
+
|
|
297
|
+
```bash
|
|
298
|
+
# Claude Code
|
|
299
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
|
|
300
|
+
|
|
301
|
+
# Windsurf / Cursor — add --preset to args
|
|
302
|
+
{
|
|
303
|
+
"mcpServers": {
|
|
304
|
+
"nodebench": {
|
|
305
|
+
"command": "npx",
|
|
306
|
+
"args": ["-y", "nodebench-mcp", "--preset", "web_dev"]
|
|
307
|
+
}
|
|
308
|
+
}
|
|
309
|
+
}
|
|
310
|
+
```
|
|
255
311
|
|
|
256
|
-
|
|
312
|
+
### Let AI Pick Your Preset
|
|
313
|
+
|
|
314
|
+
```bash
|
|
315
|
+
npx nodebench-mcp --smart-preset
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
Analyzes your project (language, framework, project type) and usage history to recommend the best preset.
|
|
257
319
|
|
|
258
320
|
---
|
|
259
321
|
|
|
@@ -280,9 +342,135 @@ The comparative benchmark validates this with 9 real production scenarios:
|
|
|
280
342
|
|
|
281
343
|
---
|
|
282
344
|
|
|
283
|
-
##
|
|
345
|
+
## Governance Model — What Your Agent Can and Can't Do
|
|
346
|
+
|
|
347
|
+
NodeBench enforces decision rights so you know exactly what your agent does autonomously vs what requires your approval. This is the "King Mode" layer — you delegate outcomes, not tasks, and the governance model ensures the agent stays within bounds.
|
|
348
|
+
|
|
349
|
+
### Autonomous (agent acts without asking)
|
|
350
|
+
|
|
351
|
+
These actions are safe for the agent to perform without human confirmation:
|
|
352
|
+
|
|
353
|
+
- Run tests and fix failing assertions
|
|
354
|
+
- Refactor within existing patterns (no new dependencies)
|
|
355
|
+
- Add logging, comments, and documentation
|
|
356
|
+
- Update type definitions to match implementation
|
|
357
|
+
- Fix lint errors and format code
|
|
358
|
+
|
|
359
|
+
### Requires Confirmation (agent asks before acting)
|
|
360
|
+
|
|
361
|
+
These actions trigger a confirmation prompt because they have broader impact:
|
|
362
|
+
|
|
363
|
+
- Changes to auth, security, or permissions logic
|
|
364
|
+
- Database migrations or schema changes
|
|
365
|
+
- API contract changes (new endpoints, changed signatures)
|
|
366
|
+
- Adding or removing dependencies
|
|
367
|
+
- Deleting code, files, or features
|
|
368
|
+
- Changes to CI/CD configuration
|
|
369
|
+
|
|
370
|
+
### Quality Gates (enforced before any deploy)
|
|
371
|
+
|
|
372
|
+
Every change must pass these gates before the agent can consider the work done:
|
|
373
|
+
|
|
374
|
+
| Gate | What it checks | Failure behavior |
|
|
375
|
+
|------|---------------|------------------|
|
|
376
|
+
| Static analysis | `tsc --noEmit`, lint passes | Agent must fix before proceeding |
|
|
377
|
+
| Unit tests | All tests pass | Agent must fix or explain why skipped |
|
|
378
|
+
| Integration tests | E2E scenarios pass | Agent must fix or flag as known issue |
|
|
379
|
+
| Verification cycle | No unresolved HIGH gaps | Agent must resolve or escalate |
|
|
380
|
+
| Knowledge banked | Learning recorded for future | Agent must document what it learned |
|
|
381
|
+
|
|
382
|
+
### How this works in practice
|
|
383
|
+
|
|
384
|
+
**With Claude Code:**
|
|
385
|
+
```
|
|
386
|
+
> "Fix the LinkedIn posting bug"
|
|
387
|
+
|
|
388
|
+
Agent runs recon → finds 3 related issues
|
|
389
|
+
Agent logs gaps → 2 HIGH, 1 MEDIUM
|
|
390
|
+
Agent fixes all 3 → runs tests → all pass
|
|
391
|
+
Agent hits quality gate → knowledge not banked
|
|
392
|
+
Agent records learning → gate passes
|
|
393
|
+
Agent: "Fixed. 3 issues resolved, knowledge banked."
|
|
394
|
+
```
|
|
395
|
+
|
|
396
|
+
**With Cursor Agent:**
|
|
397
|
+
```
|
|
398
|
+
> "Add rate limiting to the API"
|
|
399
|
+
|
|
400
|
+
Agent runs risk assessment → HIGH (auth-adjacent)
|
|
401
|
+
Agent: "This touches auth middleware. Confirm?"
|
|
402
|
+
You: "Yes, proceed"
|
|
403
|
+
Agent implements → tests pass → gate passes
|
|
404
|
+
Agent: "Done. Added rate limiting with tests."
|
|
405
|
+
```
|
|
406
|
+
|
|
407
|
+
---
|
|
408
|
+
|
|
409
|
+
## Case Studies
|
|
410
|
+
|
|
411
|
+
### Case Study 1: Bug Fix with Knowledge Compounding
|
|
412
|
+
|
|
413
|
+
**Context:** Solo founder using Claude Code to fix a recurring bug in their SaaS.
|
|
414
|
+
|
|
415
|
+
**Before NodeBench:**
|
|
416
|
+
- Agent fixes the immediate bug
|
|
417
|
+
- Runs tests once, passes
|
|
418
|
+
- Ships
|
|
419
|
+
- 3 days later, related bug appears in production
|
|
420
|
+
- Agent re-investigates from scratch
|
|
421
|
+
|
|
422
|
+
**With NodeBench:**
|
|
423
|
+
- Agent runs `run_recon` → finds 2 related issues
|
|
424
|
+
- Agent runs `log_gap` → tracks all 3 issues
|
|
425
|
+
- Agent fixes all 3 → runs 3-layer tests
|
|
426
|
+
- Agent runs `run_quality_gate` → passes
|
|
427
|
+
- Agent runs `record_learning` → banks the pattern
|
|
428
|
+
- Next similar bug: agent finds the prior learning in `search_all_knowledge` and fixes in half the time
|
|
429
|
+
|
|
430
|
+
**Result:** Time to fix similar bugs decreased 50% over 30 days.
|
|
431
|
+
|
|
432
|
+
### Case Study 2: Parallel Agents Without Conflicts
|
|
433
|
+
|
|
434
|
+
**Context:** Developer spawns 3 Claude Code subagents to fix different bugs in the same codebase.
|
|
435
|
+
|
|
436
|
+
**Before NodeBench:**
|
|
437
|
+
- Agent 1 and Agent 2 both see the same bug
|
|
438
|
+
- Both implement a fix
|
|
439
|
+
- Agent 2's fix overwrites Agent 1's fix
|
|
440
|
+
- Agent 3 re-investigates what Agent 1 already solved
|
|
441
|
+
- Agent 2 hits context limit mid-fix, loses work
|
|
442
|
+
|
|
443
|
+
**With NodeBench:**
|
|
444
|
+
- Each agent calls `claim_agent_task` → locks its work
|
|
445
|
+
- Roles assigned via `assign_agent_role` → no overlap
|
|
446
|
+
- Context budget tracked via `log_context_budget`
|
|
447
|
+
- Progress notes shared via `release_agent_task`
|
|
448
|
+
- All 3 bugs fixed without conflict
|
|
449
|
+
|
|
450
|
+
**Result:** Parallel agent success rate increased from 60% to 95%.
|
|
451
|
+
|
|
452
|
+
### Case Study 3: Security-Sensitive Change
|
|
453
|
+
|
|
454
|
+
**Context:** Small team using Cursor Agent to add a new API endpoint.
|
|
284
455
|
|
|
285
|
-
|
|
456
|
+
**Before NodeBench:**
|
|
457
|
+
- Agent implements the endpoint
|
|
458
|
+
- Tests pass
|
|
459
|
+
- Ships
|
|
460
|
+
- 2 weeks later, security audit finds auth bypass
|
|
461
|
+
|
|
462
|
+
**With NodeBench:**
|
|
463
|
+
- Agent runs `assess_risk` → HIGH (auth-adjacent)
|
|
464
|
+
- Agent prompts for confirmation before proceeding
|
|
465
|
+
- Human reviews the planned changes
|
|
466
|
+
- Security issue caught before code is written
|
|
467
|
+
- Agent implements with security constraints
|
|
468
|
+
|
|
469
|
+
**Result:** Security-related incidents from AI code reduced to zero.
|
|
470
|
+
|
|
471
|
+
---
|
|
472
|
+
|
|
473
|
+
## Progressive Discovery
|
|
286
474
|
|
|
287
475
|
### Multi-modal search engine
|
|
288
476
|
|
|
@@ -309,19 +497,71 @@ The `discover_tools` search engine scores tools using **14 parallel strategies**
|
|
|
309
497
|
|
|
310
498
|
Pass `explain: true` to see exactly which strategies contributed to each score.
|
|
311
499
|
|
|
312
|
-
###
|
|
500
|
+
### Cursor pagination
|
|
501
|
+
|
|
502
|
+
Page through large result sets with `offset` and `limit`:
|
|
503
|
+
|
|
504
|
+
```
|
|
505
|
+
> discover_tools({ query: "verify", limit: 5 })
|
|
506
|
+
# Returns: { results: [...5 tools], totalMatches: 76, hasMore: true, offset: 0 }
|
|
507
|
+
|
|
508
|
+
> discover_tools({ query: "verify", limit: 5, offset: 5 })
|
|
509
|
+
# Returns: { results: [...next 5 tools], totalMatches: 76, hasMore: true, offset: 5 }
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
`totalMatches` is stable across pages. `hasMore` tells you whether another page exists.
|
|
513
|
+
|
|
514
|
+
### Result expansion via relatedTools
|
|
515
|
+
|
|
516
|
+
Broaden results by following conceptual neighbors:
|
|
517
|
+
|
|
518
|
+
```
|
|
519
|
+
> discover_tools({ query: "deploy and ship changes", expand: 3 })
|
|
520
|
+
# Top 3 results' relatedTools neighbors are added at 50% parent score
|
|
521
|
+
# "deploy" finds git_workflow tools → expansion adds quality_gate, flywheel tools
|
|
522
|
+
# Expanded results include depth: 1 and expandedFrom fields
|
|
523
|
+
```
|
|
524
|
+
|
|
525
|
+
Dogfood A/B results: 5/8 queries gained recall lift (+2 to +8 new tools per query). "deploy and ship changes" went from 82 → 90 matches.
|
|
526
|
+
|
|
527
|
+
### Quick refs — what to do next (with multi-hop)
|
|
313
528
|
|
|
314
529
|
Every tool response auto-appends a `_quickRef` with:
|
|
315
530
|
- **nextAction**: What to do immediately after this tool
|
|
316
|
-
- **nextTools**: Recommended follow-up tools
|
|
531
|
+
- **nextTools**: Recommended follow-up tools (workflow-sequential)
|
|
532
|
+
- **relatedTools**: Conceptually adjacent tools (same domain, shared tags — 949 connections across 218 tools)
|
|
317
533
|
- **methodology**: Which methodology guide to consult
|
|
318
534
|
- **tip**: Practical usage advice
|
|
319
535
|
|
|
320
|
-
Call `get_tool_quick_ref("tool_name")` for any tool's guidance
|
|
536
|
+
Call `get_tool_quick_ref("tool_name")` for any tool's guidance — or use **multi-hop BFS traversal** to discover tools 2-3 hops away:
|
|
537
|
+
|
|
538
|
+
```
|
|
539
|
+
> get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 1 })
|
|
540
|
+
# Returns: direct neighbors via nextTools + relatedTools (hopDistance: 1)
|
|
541
|
+
|
|
542
|
+
> get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 2 })
|
|
543
|
+
# Returns: direct neighbors + their neighbors (hopDistance: 1 and 2)
|
|
544
|
+
# Discovers 34 additional tools reachable in 2 hops
|
|
545
|
+
|
|
546
|
+
> get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 3 })
|
|
547
|
+
# Returns: 3-hop BFS traversal — full neighborhood graph
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
Each discovered tool includes `hopDistance` (1-3) and `reachedVia` (which parent tool led to it). BFS prevents cycles — no tool appears at multiple depths.
|
|
551
|
+
|
|
552
|
+
### `nextTools` vs `relatedTools`
|
|
553
|
+
|
|
554
|
+
| | `nextTools` | `relatedTools` |
|
|
555
|
+
|---|---|---|
|
|
556
|
+
| **Meaning** | Workflow-sequential ("do X then Y") | Conceptually adjacent ("if doing X, consider Y") |
|
|
557
|
+
| **Example** | `run_recon` → `log_recon_finding` | `run_recon` → `search_all_knowledge`, `bootstrap_project` |
|
|
558
|
+
| **Total connections** | 498 | 949 (191% amplification) |
|
|
559
|
+
| **Overlap** | — | 0% (all net-new connections) |
|
|
560
|
+
| **Cross-domain** | Mostly same-domain | 90% bridge different domains |
|
|
321
561
|
|
|
322
562
|
### Workflow chains — step-by-step recipes
|
|
323
563
|
|
|
324
|
-
|
|
564
|
+
28 pre-built chains for common workflows:
|
|
325
565
|
|
|
326
566
|
| Chain | Steps | Use case |
|
|
327
567
|
|---|---|---|
|
|
@@ -349,6 +589,10 @@ Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
|
|
|
349
589
|
| `pr_review` | 5 | Pull request review |
|
|
350
590
|
| `seo_audit` | 6 | Full SEO audit |
|
|
351
591
|
| `voice_pipeline` | 6 | Voice pipeline implementation |
|
|
592
|
+
| `intentionality_check` | 4 | Verify agent intent before action |
|
|
593
|
+
| `research_digest` | 6 | Summarize research across sessions |
|
|
594
|
+
| `email_assistant` | 5 | Email triage and response |
|
|
595
|
+
| `pr_creation` | 6 | Visual PR creation from UI Dive sessions |
|
|
352
596
|
|
|
353
597
|
Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
|
|
354
598
|
|
|
@@ -365,120 +609,21 @@ Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.j
|
|
|
365
609
|
|
|
366
610
|
---
|
|
367
611
|
|
|
368
|
-
##
|
|
369
|
-
|
|
370
|
-
NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
|
|
371
|
-
|
|
372
|
-
```
|
|
373
|
-
Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
|
|
374
|
-
↑ │
|
|
375
|
-
└──────────── knowledge compounds ─────────────────────────────┘
|
|
376
|
-
```
|
|
377
|
-
|
|
378
|
-
**Inner loop** (per change): 6-phase verification ensures correctness.
|
|
379
|
-
**Outer loop** (over time): Eval-driven development ensures improvement.
|
|
380
|
-
**Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
|
|
381
|
-
|
|
382
|
-
### The 6-Phase Verification Process (Inner Loop)
|
|
383
|
-
|
|
384
|
-
Every non-trivial change should go through these 6 steps:
|
|
385
|
-
|
|
386
|
-
1. **Context Gathering** — Parallel subagent deep dive into SDK specs, implementation patterns, dispatcher/backend audit, external API research
|
|
387
|
-
2. **Gap Analysis** — Compare findings against current implementation, categorize gaps (CRITICAL/HIGH/MEDIUM/LOW)
|
|
388
|
-
3. **Implementation** — Apply fixes following production patterns exactly
|
|
389
|
-
4. **Testing & Validation** — 5 layers: static analysis, unit tests, integration tests, manual verification, live end-to-end
|
|
390
|
-
5. **Self-Closed-Loop Verification** — Parallel verification subagents check spec compliance, functional correctness, argument compatibility
|
|
391
|
-
6. **Document Learnings** — Update documentation with edge cases and key learnings
|
|
392
|
-
|
|
393
|
-
### The Eval-Driven Development Loop (Outer Loop)
|
|
394
|
-
|
|
395
|
-
1. **Run Eval Batch** — Send test cases through the target workflow
|
|
396
|
-
2. **Capture Telemetry** — Collect complete agent execution trace
|
|
397
|
-
3. **LLM-as-Judge Analysis** — Score goal alignment, tool efficiency, output quality
|
|
398
|
-
4. **Retrieve Results** — Aggregate pass/fail rates and improvement suggestions
|
|
399
|
-
5. **Fix, Optimize, Enhance** — Apply changes based on judge feedback
|
|
400
|
-
6. **Re-run Evals** — Deploy only if scores improve
|
|
401
|
-
|
|
402
|
-
**Rule: No change ships without an eval improvement.**
|
|
403
|
-
|
|
404
|
-
Ask the agent: `Use getMethodology("overview")` to see all 20 methodology topics.
|
|
405
|
-
|
|
406
|
-
---
|
|
407
|
-
|
|
408
|
-
## Parallel Agents with Claude Code
|
|
409
|
-
|
|
410
|
-
Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
|
|
411
|
-
|
|
412
|
-
**When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
|
|
413
|
-
|
|
414
|
-
**How it works with Claude Code's Task tool:**
|
|
415
|
-
|
|
416
|
-
1. **COORDINATOR** (your main session) breaks work into independent tasks
|
|
417
|
-
2. Each **Task tool** call spawns a subagent with instructions to:
|
|
418
|
-
- `claim_agent_task` — lock the task
|
|
419
|
-
- `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
|
|
420
|
-
- Do the work
|
|
421
|
-
- `release_agent_task` — handoff with progress note
|
|
422
|
-
3. Coordinator calls `get_parallel_status` to monitor all subagents
|
|
423
|
-
4. Coordinator runs `run_quality_gate` on the aggregate result
|
|
424
|
-
|
|
425
|
-
**MCP Prompts available:**
|
|
426
|
-
- `claude-code-parallel` — Step-by-step Claude Code subagent coordination
|
|
427
|
-
- `parallel-agent-team` — Full team setup with role assignment
|
|
428
|
-
- `oracle-test-harness` — Validate outputs against known-good reference
|
|
429
|
-
- `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
|
|
430
|
-
|
|
431
|
-
**Note:** Parallel agent coordination tools are only available in the `full` preset. For single-agent workflows, the `default` preset provides all the core AI Flywheel tools you need.
|
|
432
|
-
|
|
433
|
-
---
|
|
434
|
-
|
|
435
|
-
## Toolset Gating
|
|
436
|
-
|
|
437
|
-
The default preset (50 tools) gives you the complete AI Flywheel methodology with ~78% fewer tools compared to the full suite (175 tools).
|
|
438
|
-
|
|
439
|
-
### Presets — Choose What You Need
|
|
440
|
-
|
|
441
|
-
| Preset | Tools | Domains | Use case |
|
|
442
|
-
|---|---|---|---|
|
|
443
|
-
| **default** ⭐ | **50** | 7 | **Recommended** — Complete AI Flywheel: verification, eval, quality_gate, learning, flywheel, recon, boilerplate + discovery + dynamic loading |
|
|
444
|
-
| `full` | 175 | 34 | Everything — vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers, security, email, RSS, architect |
|
|
445
|
-
|
|
446
|
-
```bash
|
|
447
|
-
# ⭐ Recommended: Default (50 tools) - complete AI Flywheel
|
|
448
|
-
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
449
|
-
|
|
450
|
-
# Everything: All 175 tools
|
|
451
|
-
claude mcp add nodebench -- npx -y nodebench-mcp --preset full
|
|
452
|
-
```
|
|
453
|
-
|
|
454
|
-
Or in config:
|
|
612
|
+
## Scaling MCP: How We Solved the 5 Biggest Industry Problems
|
|
455
613
|
|
|
456
|
-
|
|
457
|
-
{
|
|
458
|
-
"mcpServers": {
|
|
459
|
-
"nodebench": {
|
|
460
|
-
"command": "npx",
|
|
461
|
-
"args": ["-y", "nodebench-mcp"]
|
|
462
|
-
}
|
|
463
|
-
}
|
|
464
|
-
}
|
|
465
|
-
```
|
|
466
|
-
|
|
467
|
-
### Scaling MCP: How We Solved the 5 Biggest Industry Problems
|
|
468
|
-
|
|
469
|
-
MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses. Here's the full breakdown — problem by problem.
|
|
614
|
+
MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses.
|
|
470
615
|
|
|
471
616
|
---
|
|
472
617
|
|
|
473
|
-
|
|
618
|
+
### Problem 1: Context Bloat (too many tool definitions eat the context window)
|
|
474
619
|
|
|
475
|
-
**The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At
|
|
620
|
+
**The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At 218 tools, NodeBench would consume ~109K tokens — over half a 200K context window just on tool metadata. [Microsoft Research](https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/) found LLMs "decline to act at all when faced with ambiguous or excessive tool options." [Cursor enforces a ~40-tool hard cap](https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents) for this reason.
|
|
476
621
|
|
|
477
622
|
**Our solutions** (layered, each independent):
|
|
478
623
|
|
|
479
624
|
| Layer | What it does | Token savings | Requires |
|
|
480
625
|
|---|---|---|---|
|
|
481
|
-
| Themed presets (`--preset web_dev`) | Load only relevant toolsets (
|
|
626
|
+
| Themed presets (`--preset web_dev`) | Load only relevant toolsets (54-106 tools vs 218) | **50-75%** | Nothing |
|
|
482
627
|
| TOON encoding (on by default) | Encode all tool responses in token-optimized format | **~40%** on responses | Nothing |
|
|
483
628
|
| `discover_tools({ compact: true })` | Return `{ name, category, hint }` only | **~60%** on search results | Nothing |
|
|
484
629
|
| `instructions` field (Claude Code) | Claude Code defers tool loading, searches on demand | **~85%** | Claude Code client |
|
|
@@ -488,11 +633,11 @@ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft
|
|
|
488
633
|
|
|
489
634
|
---
|
|
490
635
|
|
|
491
|
-
|
|
636
|
+
### Problem 2: Tool Selection Degradation (LLMs pick the wrong tool as count increases)
|
|
492
637
|
|
|
493
638
|
**The research**: [Anthropic's Tool Search Tool](https://www.anthropic.com/engineering/advanced-tool-use) improved accuracy from **49% → 74%** (Opus 4) and **79.5% → 88.1%** (Opus 4.5) by switching from all-tools-upfront to on-demand discovery. The [Dynamic ReAct paper (arxiv 2509.20386)](https://arxiv.org/html/2509.20386v1) tested 5 architectures and found **Search + Load** wins — flat search + deliberate loading beats hierarchical app→tool search.
|
|
494
639
|
|
|
495
|
-
**Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from
|
|
640
|
+
**Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from 218 candidates, with **cursor pagination**, **result expansion**, and **multi-hop traversal**:
|
|
496
641
|
|
|
497
642
|
| Strategy | What it does | Example |
|
|
498
643
|
|---|---|---|
|
|
@@ -502,8 +647,11 @@ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft
|
|
|
502
647
|
| N-gram + Bigram | Partial words and phrases | "screen" → `capture_ui_screenshot` |
|
|
503
648
|
| Dense (TF-IDF cosine) | Vector-like ranking | "audit compliance" surfaces related tools |
|
|
504
649
|
| Embedding (neural) | Agent-as-a-Graph bipartite RRF | Based on [arxiv 2511.01854](https://arxiv.org/html/2511.01854v1) |
|
|
505
|
-
| Execution traces | Co-occurrence mining from `tool_call_log` | Tools frequently used together boost each other |
|
|
650
|
+
| Execution traces | Co-occurrence mining from `tool_call_log` (direct + transitive A→B→C) | Tools frequently used together boost each other |
|
|
506
651
|
| Intent pre-filter | Narrow to relevant categories before search | `intent: "data_analysis"` → only local_file, llm, benchmark |
|
|
652
|
+
| **Pagination** | `offset` + `limit` with stable `totalMatches` and `hasMore` | Page through 76+ results 5 at a time |
|
|
653
|
+
| **Expansion** | Top N results' `relatedTools` neighbors added at 50% parent score | `expand: 3` adds 2-8 new tools per query |
|
|
654
|
+
| **Multi-hop BFS** | `get_tool_quick_ref` depth 1-3 with `hopDistance` + `reachedVia` | depth=2 discovers 24-40 additional tools |
|
|
507
655
|
|
|
508
656
|
Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini 3 Flash / GPT-5-mini / Claude Haiku 4.5 for LLM-powered reranking.
|
|
509
657
|
|
|
@@ -518,7 +666,7 @@ Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini
|
|
|
518
666
|
|
|
519
667
|
---
|
|
520
668
|
|
|
521
|
-
|
|
669
|
+
### Problem 3: Static Loading (all tools loaded upfront, even if unused)
|
|
522
670
|
|
|
523
671
|
**The research**: The Dynamic ReAct paper found that **Search + Load with 2 meta tools** beats all other architectures. Hierarchical search (search apps → search tools → load) adds overhead without improving accuracy. [ToolScope (arxiv 2510.20036)](https://arxiv.org/html/2510.20036) showed **+34.6%** tool selection accuracy with hybrid retrieval + tool deduplication.
|
|
524
672
|
|
|
@@ -542,7 +690,7 @@ npx nodebench-mcp --dynamic
|
|
|
542
690
|
Key design decisions from the research:
|
|
543
691
|
- **No hierarchical search** — Dynamic ReAct Section 3.4: "search_apps introduces an additional call without significantly improving accuracy"
|
|
544
692
|
- **Direct tool binding** — Dynamic ReAct Section 3.5: LLMs perform best with directly bound tools; `call_tool` indirection degrades in long conversations
|
|
545
|
-
- **Full-registry search** — `discover_tools` searches all
|
|
693
|
+
- **Full-registry search** — `discover_tools` searches all 218 tools even with 54 loaded, so it can suggest what to load
|
|
546
694
|
|
|
547
695
|
**How we tested**: Automated A/B harness + live IDE session.
|
|
548
696
|
|
|
@@ -559,7 +707,7 @@ Key design decisions from the research:
|
|
|
559
707
|
|
|
560
708
|
---
|
|
561
709
|
|
|
562
|
-
|
|
710
|
+
### Problem 4: Client Fragmentation (not all clients handle dynamic tool updates)
|
|
563
711
|
|
|
564
712
|
**The research**: The MCP spec defines `notifications/tools/list_changed` for servers to tell clients to re-fetch the tool list. But [Cursor hasn't implemented it](https://forum.cursor.com/t/enhance-mcp-integration-in-cursor-dynamic-tool-updates-roots-support-progress-tokens-streamable-http/99903), [Claude Desktop didn't support it](https://github.com/orgs/modelcontextprotocol/discussions/76) (as of Dec 2024), and [Gemini CLI has an open issue](https://github.com/google-gemini/gemini-cli/issues/13850).
|
|
565
713
|
|
|
@@ -592,7 +740,7 @@ tools/list AFTER UNLOAD: 95 tools (-4) ← tools removed
|
|
|
592
740
|
|
|
593
741
|
---
|
|
594
742
|
|
|
595
|
-
|
|
743
|
+
### Problem 5: Aggressive Filtering (over-filtering means the right tool isn't found)
|
|
596
744
|
|
|
597
745
|
**The research**: This is the flip side of Problem 1. If you reduce context aggressively (e.g., keyword-only search), ambiguous queries like "call an AI model" fail to match the `llm` toolset because every tool mentions "AI" in its description. [SynapticLabs' Bounded Context Packs](https://blog.synapticlabs.ai/bounded-context-packs-tool-bloat-tipping-point) addresses this with progressive disclosure. [SEP-1576](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576) proposes adaptive granularity at the protocol level.
|
|
598
746
|
|
|
@@ -623,11 +771,11 @@ Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-G
|
|
|
623
771
|
|
|
624
772
|
---
|
|
625
773
|
|
|
626
|
-
|
|
774
|
+
### Summary: research → solution → eval for each problem
|
|
627
775
|
|
|
628
776
|
| Problem | Research Source | Our Solution | Eval Method | Result |
|
|
629
777
|
|---|---|---|---|---|
|
|
630
|
-
| **Context bloat** (
|
|
778
|
+
| **Context bloat** (107K tokens) | Anthropic (85% reduction), Lunar.dev (~40-tool cap), SEP-1576 | Presets, TOON, compact mode, `instructions`, `smart_select_tools` | A/B harness token measurement | 50-95% reduction depending on layer |
|
|
631
779
|
| **Selection degradation** | Anthropic (+25pp), Dynamic ReAct (Search+Load wins) | 14-strategy hybrid search, intent pre-filter, LLM reranking | 28-scenario discovery accuracy | **100% accuracy** (18/18 domains) |
|
|
632
780
|
| **Static loading** | Dynamic ReAct, ToolScope (+34.6%), MCP spec | `--dynamic` flag, `load_toolset` / `unload_toolset` | A/B harness + live IDE test | **100% success**, <1ms load latency |
|
|
633
781
|
| **Client fragmentation** | MCP discussions, client bug trackers | `list_changed` + `call_loaded_tool` proxy | Server-side `tools/list` verification | Works on **all clients** |
|
|
@@ -637,15 +785,17 @@ Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-G
|
|
|
637
785
|
|
|
638
786
|
| Segment | R@5 Baseline | Most Critical Strategy | Impact When Removed |
|
|
639
787
|
|---|---|---|---|
|
|
640
|
-
| **New user** (vague, natural language) | 67% | Synonym expansion |
|
|
641
|
-
| **Experienced** (domain keywords) | 72% | All robust |
|
|
642
|
-
| **Power user** (exact tool names) | 100% | None needed |
|
|
788
|
+
| **New user** (vague, natural language) | 67% | Synonym expansion | -17pp R@5 |
|
|
789
|
+
| **Experienced** (domain keywords) | 72% | All robust | No single strategy >5pp |
|
|
790
|
+
| **Power user** (exact tool names) | 100% | None needed | Keyword alone = 100% |
|
|
643
791
|
|
|
644
792
|
Key insight: new users need synonym expansion ("website" → seo, "AI" → llm) and fuzzy matching (typo tolerance). Power users need nothing beyond keyword matching. The remaining 33% new user gap is filled by `smart_select_tools` (LLM-powered).
|
|
645
793
|
|
|
646
794
|
Full methodology, per-scenario breakdown, ablation data, and research citations: [DYNAMIC_LOADING.md](./DYNAMIC_LOADING.md)
|
|
647
795
|
|
|
648
|
-
|
|
796
|
+
---
|
|
797
|
+
|
|
798
|
+
## Fine-Grained Control
|
|
649
799
|
|
|
650
800
|
```bash
|
|
651
801
|
# Include only specific toolsets
|
|
@@ -654,11 +804,14 @@ npx nodebench-mcp --toolsets verification,eval,recon
|
|
|
654
804
|
# Exclude heavy optional-dep toolsets
|
|
655
805
|
npx nodebench-mcp --exclude vision,ui_capture,parallel
|
|
656
806
|
|
|
807
|
+
# Dynamic loading — start with 12 tools, load on demand
|
|
808
|
+
npx nodebench-mcp --dynamic
|
|
809
|
+
|
|
657
810
|
# See all toolsets and presets
|
|
658
811
|
npx nodebench-mcp --help
|
|
659
812
|
```
|
|
660
813
|
|
|
661
|
-
###
|
|
814
|
+
### All 39 Toolsets
|
|
662
815
|
|
|
663
816
|
| Toolset | Tools | What it covers | In `default` |
|
|
664
817
|
|---|---|---|---|
|
|
@@ -666,11 +819,12 @@ npx nodebench-mcp --help
|
|
|
666
819
|
| eval | 6 | Eval runs, results, comparison, diff | ✅ |
|
|
667
820
|
| quality_gate | 4 | Gates, presets, history | ✅ |
|
|
668
821
|
| learning | 4 | Knowledge, search, record | ✅ |
|
|
669
|
-
| recon | 7 | Research, findings, framework checks, risk | ✅ |
|
|
670
822
|
| flywheel | 4 | Mandatory flywheel, promote, investigate | ✅ |
|
|
823
|
+
| recon | 7 | Research, findings, framework checks, risk | ✅ |
|
|
671
824
|
| security | 3 | Dependency scanning, code analysis, terminal security scanning | ✅ |
|
|
672
|
-
| **Total** | **44** | **Complete AI Flywheel** |
|
|
673
825
|
| boilerplate | 2 | Scaffold NodeBench projects + status | ✅ |
|
|
826
|
+
| skill_update | 4 | Skill tracking, freshness checks, sync | ✅ |
|
|
827
|
+
| **Subtotal** | **42** | **AI Flywheel core** | |
|
|
674
828
|
| bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner | — |
|
|
675
829
|
| self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance | — |
|
|
676
830
|
| parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox | — |
|
|
@@ -693,19 +847,22 @@ npx nodebench-mcp --help
|
|
|
693
847
|
| git_workflow | 3 | Branch compliance, PR checklist review, merge gate | — |
|
|
694
848
|
| seo | 5 | Technical SEO audit, page performance, content analysis | — |
|
|
695
849
|
| voice_bridge | 4 | Voice pipeline design, config analysis, scaffold | — |
|
|
850
|
+
| critter | 1 | Accountability checkpoint with calibrated scoring | — |
|
|
696
851
|
| email | 4 | SMTP/IMAP email ingestion, search, delivery | — |
|
|
697
852
|
| rss | 4 | RSS feed parsing and monitoring | — |
|
|
698
|
-
| architect | 3 |
|
|
853
|
+
| architect | 3 | Structural analysis, concept verification, implementation planning | — |
|
|
854
|
+
| ui_ux_dive | 11 | UI/UX deep analysis sessions, component reviews, flow audits | — |
|
|
855
|
+
| mcp_bridge | 5 | Connect external MCP servers, proxy tool calls, manage sessions | — |
|
|
856
|
+
| ui_ux_dive_v2 | 14 | Advanced UI/UX analysis with preflight, scoring, heuristic evaluation | — |
|
|
857
|
+
| pr_report | 3 | Visual PR creation with screenshot comparisons, timelines, past session links | — |
|
|
699
858
|
|
|
700
|
-
**Always included** — these 12 tools are
|
|
859
|
+
**Always included** — these 12 tools are available regardless of preset:
|
|
701
860
|
- **Meta/discovery (6):** `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
|
|
702
861
|
- **Dynamic loading (6):** `load_toolset`, `unload_toolset`, `list_available_toolsets`, `call_loaded_tool`, `smart_select_tools`, `get_ab_test_report`
|
|
703
862
|
|
|
704
|
-
The `default` preset includes 50 tools (38 domain + 6 meta/discovery + 6 dynamic loading).
|
|
705
|
-
|
|
706
863
|
### TOON Format — Token Savings
|
|
707
864
|
|
|
708
|
-
TOON (Token-Oriented Object Notation) is **on by default** for all presets
|
|
865
|
+
TOON (Token-Oriented Object Notation) is **on by default** for all presets. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
|
|
709
866
|
|
|
710
867
|
```bash
|
|
711
868
|
# TOON on (default, all presets)
|
|
@@ -715,20 +872,13 @@ claude mcp add nodebench -- npx -y nodebench-mcp
|
|
|
715
872
|
claude mcp add nodebench -- npx -y nodebench-mcp --no-toon
|
|
716
873
|
```
|
|
717
874
|
|
|
718
|
-
Use the `toon_encode` and `toon_decode` tools to convert between TOON and JSON in your own workflows.
|
|
719
|
-
|
|
720
|
-
### When to Use Each Preset
|
|
721
|
-
|
|
722
|
-
| Preset | Use when... | Example |
|
|
723
|
-
|---|---|---|
|
|
724
|
-
| **default** ⭐ | You want the complete AI Flywheel methodology with minimal token overhead | Most users — bug fixes, features, refactoring, code review |
|
|
725
|
-
| `full` | You need vision, UI capture, web search, GitHub, local file parsing, or GAIA solvers | Vision QA, web scraping, file processing, parallel agents, capability benchmarking |
|
|
875
|
+
Use the `toon_encode` and `toon_decode` tools (in the `toon` toolset) to convert between TOON and JSON in your own workflows.
|
|
726
876
|
|
|
727
877
|
---
|
|
728
878
|
|
|
729
|
-
## AI Flywheel — Complete Methodology
|
|
879
|
+
## The AI Flywheel — Complete Methodology
|
|
730
880
|
|
|
731
|
-
The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md).
|
|
881
|
+
The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md).
|
|
732
882
|
|
|
733
883
|
### Two Loops That Compound
|
|
734
884
|
|
|
@@ -780,6 +930,62 @@ The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/H
|
|
|
780
930
|
|
|
781
931
|
---
|
|
782
932
|
|
|
933
|
+
## Parallel Agents with Claude Code
|
|
934
|
+
|
|
935
|
+
Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
|
|
936
|
+
|
|
937
|
+
**When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
|
|
938
|
+
|
|
939
|
+
**How it works with Claude Code's Task tool:**
|
|
940
|
+
|
|
941
|
+
1. **COORDINATOR** (your main session) breaks work into independent tasks
|
|
942
|
+
2. Each **Task tool** call spawns a subagent with instructions to:
|
|
943
|
+
- `claim_agent_task` — lock the task
|
|
944
|
+
- `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
|
|
945
|
+
- Do the work
|
|
946
|
+
- `release_agent_task` — handoff with progress note
|
|
947
|
+
3. Coordinator calls `get_parallel_status` to monitor all subagents
|
|
948
|
+
4. Coordinator runs `run_quality_gate` on the aggregate result
|
|
949
|
+
|
|
950
|
+
**MCP Prompts available:**
|
|
951
|
+
- `claude-code-parallel` — Step-by-step Claude Code subagent coordination
|
|
952
|
+
- `parallel-agent-team` — Full team setup with role assignment
|
|
953
|
+
- `oracle-test-harness` — Validate outputs against known-good reference
|
|
954
|
+
- `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
|
|
955
|
+
|
|
956
|
+
**Note:** Parallel agent coordination tools require `--preset multi_agent` or `--preset full`.
|
|
957
|
+
|
|
958
|
+
---
|
|
959
|
+
|
|
960
|
+
## Capability Benchmarking (GAIA, Gated)
|
|
961
|
+
|
|
962
|
+
NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
|
|
963
|
+
|
|
964
|
+
Notes:
|
|
965
|
+
- GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
|
|
966
|
+
- Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
|
|
967
|
+
|
|
968
|
+
Web lane (web_search + fetch_url):
|
|
969
|
+
```bash
|
|
970
|
+
npm run mcp:dataset:gaia:capability:refresh
|
|
971
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
|
|
972
|
+
```
|
|
973
|
+
|
|
974
|
+
File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
|
|
975
|
+
```bash
|
|
976
|
+
npm run mcp:dataset:gaia:capability:files:refresh
|
|
977
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
|
|
978
|
+
```
|
|
979
|
+
|
|
980
|
+
Modes:
|
|
981
|
+
- Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
982
|
+
- More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
|
|
983
|
+
|
|
984
|
+
Notes:
|
|
985
|
+
- ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
|
|
986
|
+
|
|
987
|
+
---
|
|
988
|
+
|
|
783
989
|
## Build from Source
|
|
784
990
|
|
|
785
991
|
```bash
|
|
@@ -805,51 +1011,54 @@ Then use absolute path:
|
|
|
805
1011
|
|
|
806
1012
|
## Quick Reference
|
|
807
1013
|
|
|
808
|
-
### Recommended Setup
|
|
1014
|
+
### Recommended Setup
|
|
809
1015
|
|
|
810
1016
|
```bash
|
|
811
|
-
# Claude Code
|
|
1017
|
+
# Claude Code — AI Flywheel core (50 tools, default)
|
|
812
1018
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
813
|
-
```
|
|
814
|
-
|
|
815
|
-
### What's in the default preset?
|
|
816
|
-
|
|
817
|
-
| Domain | Tools | What you get |
|
|
818
|
-
|---|---|---|
|
|
819
|
-
| verification | 8 | Cycles, gaps, triple-verify, status |
|
|
820
|
-
| eval | 6 | Eval runs, results, comparison, diff |
|
|
821
|
-
| quality_gate | 4 | Gates, presets, history |
|
|
822
|
-
| learning | 4 | Knowledge, search, record |
|
|
823
|
-
| recon | 7 | Research, findings, framework checks, risk |
|
|
824
|
-
| flywheel | 4 | Mandatory flywheel, promote, investigate |
|
|
825
|
-
| security | 3 | Dependency scanning, code analysis, terminal security scanning |
|
|
826
|
-
| boilerplate | 2 | Scaffold NodeBench projects + status |
|
|
827
|
-
| meta + discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools, get_tool_quick_ref, get_workflow_chain |
|
|
828
|
-
| dynamic loading | 6 | load_toolset, unload_toolset, list_available_toolsets, call_loaded_tool, smart_select_tools, get_ab_test_report |
|
|
829
|
-
|
|
830
|
-
**Total: 50 tools** — Complete AI Flywheel methodology with ~70% less token overhead.
|
|
831
|
-
|
|
832
|
-
### When to Upgrade Presets
|
|
833
|
-
|
|
834
|
-
| Need | Upgrade to |
|
|
835
|
-
|---|---|
|
|
836
|
-
| Everything: vision, UI capture, web search, GitHub, local file parsing, GAIA solvers | `--preset full` (175 tools) |
|
|
837
|
-
|
|
838
|
-
### First Prompts to Try
|
|
839
1019
|
|
|
1020
|
+
# Windsurf — add to ~/.codeium/windsurf/mcp_config.json
|
|
1021
|
+
# Cursor — add to .cursor/mcp.json
|
|
1022
|
+
{
|
|
1023
|
+
"mcpServers": {
|
|
1024
|
+
"nodebench": {
|
|
1025
|
+
"command": "npx",
|
|
1026
|
+
"args": ["-y", "nodebench-mcp"]
|
|
1027
|
+
}
|
|
1028
|
+
}
|
|
1029
|
+
}
|
|
840
1030
|
```
|
|
841
|
-
# See what's available
|
|
842
|
-
> Use getMethodology("overview") to see all workflows
|
|
843
1031
|
|
|
844
|
-
|
|
845
|
-
> Use search_all_knowledge("what I'm about to work on")
|
|
1032
|
+
### What's in the Default?
|
|
846
1033
|
|
|
847
|
-
|
|
848
|
-
|
|
849
|
-
|
|
850
|
-
|
|
851
|
-
|
|
852
|
-
|
|
1034
|
+
| Category | Tools | What you get |
|
|
1035
|
+
|---|---|---|
|
|
1036
|
+
| Discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools (pagination + expansion), get_tool_quick_ref (multi-hop BFS), get_workflow_chain |
|
|
1037
|
+
| Dynamic loading | 6 | load_toolset, unload_toolset, list_available_toolsets, call_loaded_tool, smart_select_tools, get_ab_test_report |
|
|
1038
|
+
| Verification | 8 | Cycles, gaps, triple-verify, status |
|
|
1039
|
+
| Eval | 6 | Eval runs, results, comparison, diff |
|
|
1040
|
+
| Quality gate | 4 | Gates, presets, history |
|
|
1041
|
+
| Learning | 4 | Knowledge, search, record |
|
|
1042
|
+
| Flywheel | 4 | Mandatory flywheel, promote, investigate |
|
|
1043
|
+
| Recon | 7 | Research, findings, framework checks, risk |
|
|
1044
|
+
| Security | 3 | Dependency scanning, code analysis, terminal security scanning |
|
|
1045
|
+
| Boilerplate | 2 | Scaffold NodeBench projects + status |
|
|
1046
|
+
| Skill update | 4 | Skill tracking, freshness checks, sync |
|
|
1047
|
+
| **Total** | **54** | **Complete AI Flywheel methodology** |
|
|
1048
|
+
|
|
1049
|
+
### When to Use a Themed Preset
|
|
1050
|
+
|
|
1051
|
+
| Need | Preset | Tools |
|
|
1052
|
+
|---|---|---|
|
|
1053
|
+
| Web development with visual QA | `--preset web_dev` | 106 |
|
|
1054
|
+
| Mobile apps with flicker detection | `--preset mobile` | 95 |
|
|
1055
|
+
| Academic papers and research writing | `--preset academic` | 86 |
|
|
1056
|
+
| Multi-agent coordination | `--preset multi_agent` | 83 |
|
|
1057
|
+
| Data analysis and file processing | `--preset data` | 78 |
|
|
1058
|
+
| Content pipelines and publishing | `--preset content` | 73 |
|
|
1059
|
+
| Research with web search and RSS | `--preset research` | 71 |
|
|
1060
|
+
| CI/CD and DevOps | `--preset devops` | 68 |
|
|
1061
|
+
| Everything | `--preset full` | 218 |
|
|
853
1062
|
|
|
854
1063
|
### Key Methodology Topics
|
|
855
1064
|
|
|
@@ -872,8 +1081,8 @@ NodeBench MCP runs locally on your machine. Here's what it can and cannot access
|
|
|
872
1081
|
- Analytics data never leaves your machine
|
|
873
1082
|
|
|
874
1083
|
### File system access
|
|
875
|
-
- The `local_file` toolset (
|
|
876
|
-
- The `security` toolset runs static analysis on files you point it at
|
|
1084
|
+
- The `local_file` toolset (in `data`, `academic`, `full` presets) can **read files anywhere on your filesystem** that the Node.js process has permission to access. This includes CSV, PDF, XLSX, DOCX, PPTX, JSON, TXT, and ZIP files
|
|
1085
|
+
- The `security` toolset (in all presets) runs static analysis on files you point it at
|
|
877
1086
|
- Session notes and project bootstrapping write to the current working directory or `~/.nodebench/`
|
|
878
1087
|
- **Trust boundary**: If you grant an AI agent access to NodeBench MCP with `--preset full`, that agent can read any file your user account can read. Use the `default` preset if you want to restrict file system access
|
|
879
1088
|
|
|
@@ -897,6 +1106,12 @@ NodeBench MCP runs locally on your machine. Here's what it can and cannot access
|
|
|
897
1106
|
|
|
898
1107
|
**MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
|
|
899
1108
|
|
|
1109
|
+
**Windsurf not finding tools** — Verify `~/.codeium/windsurf/mcp_config.json` has the correct JSON structure. Open Settings → MCP → View raw config to edit directly.
|
|
1110
|
+
|
|
1111
|
+
**Cursor tools not loading** — Ensure `.cursor/mcp.json` exists in the project root. Restart Cursor after config changes.
|
|
1112
|
+
|
|
1113
|
+
**Dynamic loading not working** — Claude Code and GitHub Copilot support native dynamic loading. For Windsurf/Cursor, use `call_loaded_tool` as a fallback (it's always available).
|
|
1114
|
+
|
|
900
1115
|
---
|
|
901
1116
|
|
|
902
1117
|
## License
|