nodebench-mcp 2.22.0 → 2.25.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/README.md +366 -280
  2. package/dist/__tests__/multiHopDogfood.test.d.ts +12 -0
  3. package/dist/__tests__/multiHopDogfood.test.js +303 -0
  4. package/dist/__tests__/multiHopDogfood.test.js.map +1 -0
  5. package/dist/__tests__/presetRealWorldBench.test.js +2 -0
  6. package/dist/__tests__/presetRealWorldBench.test.js.map +1 -1
  7. package/dist/__tests__/tools.test.js +158 -6
  8. package/dist/__tests__/tools.test.js.map +1 -1
  9. package/dist/__tests__/toolsetGatingEval.test.js +2 -0
  10. package/dist/__tests__/toolsetGatingEval.test.js.map +1 -1
  11. package/dist/dashboard/html.d.ts +18 -0
  12. package/dist/dashboard/html.js +1251 -0
  13. package/dist/dashboard/html.js.map +1 -0
  14. package/dist/dashboard/server.d.ts +17 -0
  15. package/dist/dashboard/server.js +278 -0
  16. package/dist/dashboard/server.js.map +1 -0
  17. package/dist/db.js +38 -0
  18. package/dist/db.js.map +1 -1
  19. package/dist/index.js +19 -9
  20. package/dist/index.js.map +1 -1
  21. package/dist/tools/prReportTools.d.ts +11 -0
  22. package/dist/tools/prReportTools.js +911 -0
  23. package/dist/tools/prReportTools.js.map +1 -0
  24. package/dist/tools/progressiveDiscoveryTools.js +111 -24
  25. package/dist/tools/progressiveDiscoveryTools.js.map +1 -1
  26. package/dist/tools/skillUpdateTools.d.ts +24 -0
  27. package/dist/tools/skillUpdateTools.js +469 -0
  28. package/dist/tools/skillUpdateTools.js.map +1 -0
  29. package/dist/tools/toolRegistry.d.ts +15 -1
  30. package/dist/tools/toolRegistry.js +315 -11
  31. package/dist/tools/toolRegistry.js.map +1 -1
  32. package/dist/tools/uiUxDiveAdvancedTools.js +61 -0
  33. package/dist/tools/uiUxDiveAdvancedTools.js.map +1 -1
  34. package/dist/tools/uiUxDiveTools.js +154 -1
  35. package/dist/tools/uiUxDiveTools.js.map +1 -1
  36. package/dist/toolsetRegistry.js +4 -0
  37. package/dist/toolsetRegistry.js.map +1 -1
  38. package/package.json +2 -2
package/README.md CHANGED
@@ -5,10 +5,12 @@
5
5
  One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
6
6
 
7
7
  ```bash
8
- # Default (50 tools) - complete AI Flywheel methodology
8
+ # Claude Code AI Flywheel core (54 tools, recommended)
9
9
  claude mcp add nodebench -- npx -y nodebench-mcp
10
10
 
11
- # Full (175 tools) - everything including vision, web, files, etc.
11
+ # Windsurf / Cursor — same tools, add to your MCP config (see setup below)
12
+
13
+ # Need everything? Vision, web, files, parallel agents, etc.
12
14
  claude mcp add nodebench -- npx -y nodebench-mcp --preset full
13
15
  ```
14
16
 
@@ -37,16 +39,6 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
37
39
 
38
40
  ---
39
41
 
40
- ## Who's Using It
41
-
42
- **Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes. (Uses `full` preset for vision tools)
43
-
44
- **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss. (Uses `default` preset — all core AI Flywheel tools)
45
-
46
- Both found different subsets of the tools useful — which is why NodeBench ships with just 2 `--preset` levels. The `default` preset (50 tools) covers the complete AI Flywheel methodology with ~76% fewer tools. Add `--preset full` for specialized tools (vision, web, files, parallel agents, security).
47
-
48
- ---
49
-
50
42
  ## How It Works — 3 Real Examples
51
43
 
52
44
  ### Example 1: Bug fix
@@ -68,7 +60,7 @@ You type: *"I launched 3 Claude Code subagents but they keep overwriting each ot
68
60
 
69
61
  **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
70
62
 
71
- **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
63
+ **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch. (Requires `--preset multi_agent` or `--preset full`.)
72
64
 
73
65
  ### Example 3: Knowledge compounding
74
66
 
@@ -78,14 +70,16 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
78
70
 
79
71
  ## Quick Start
80
72
 
81
- ### Install (30 seconds)
73
+ ### Claude Code (CLI)
82
74
 
83
75
  ```bash
84
- # Default (50 tools) - complete AI Flywheel methodology
76
+ # Recommended AI Flywheel core (54 tools)
85
77
  claude mcp add nodebench -- npx -y nodebench-mcp
86
78
 
87
- # Full (175 tools) - everything including vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers
88
- claude mcp add nodebench -- npx -y nodebench-mcp --preset full
79
+ # Or pick a themed preset for your workflow
80
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
81
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset research
82
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset data
89
83
  ```
90
84
 
91
85
  Or add to `~/.claude/settings.json` or `.claude.json`:
@@ -101,98 +95,108 @@ Or add to `~/.claude/settings.json` or `.claude.json`:
101
95
  }
102
96
  ```
103
97
 
104
- ### First prompts to try
98
+ ### Windsurf
105
99
 
100
+ Add to `~/.codeium/windsurf/mcp_config.json` (or open Settings → MCP → View raw config):
101
+
102
+ ```json
103
+ {
104
+ "mcpServers": {
105
+ "nodebench": {
106
+ "command": "npx",
107
+ "args": ["-y", "nodebench-mcp"]
108
+ }
109
+ }
110
+ }
106
111
  ```
107
- # See what's available
108
- > Use discover_tools("verify my implementation") to find relevant tools
109
112
 
110
- # Get methodology guidance
111
- > Use getMethodology("overview") to see all workflows
113
+ ### Cursor
112
114
 
113
- # Before your next task search for prior knowledge
114
- > Use search_all_knowledge("what I'm about to work on")
115
+ Add to `.cursor/mcp.json` in your project root (or open Settings MCP):
115
116
 
116
- # Run the full verification pipeline on a change
117
- > Use getMethodology("mandatory_flywheel") and follow the 6 steps
117
+ ```json
118
+ {
119
+ "mcpServers": {
120
+ "nodebench": {
121
+ "command": "npx",
122
+ "args": ["-y", "nodebench-mcp"]
123
+ }
124
+ }
125
+ }
118
126
  ```
119
127
 
120
- ### Usage Analytics & Smart Presets
128
+ ### Other MCP Clients
121
129
 
122
- NodeBench MCP tracks tool usage locally and can recommend optimal presets based on your project type and usage patterns.
130
+ Any MCP-compatible client works. The config format is the same — point `command` to `npx` and `args` to `["-y", "nodebench-mcp"]`. Add `"--preset", "<name>"` to the args array for themed presets.
131
+
132
+ ### First Prompts to Try
123
133
 
124
- **Get smart preset recommendation:**
125
- ```bash
126
- npx nodebench-mcp --smart-preset
127
134
  ```
135
+ # See what's available
136
+ > Use discover_tools("verify my implementation") to find relevant tools
128
137
 
129
- This analyzes your project (detects language, framework, project type) and usage history to recommend the best preset.
138
+ # Page through results
139
+ > Use discover_tools({ query: "verify", limit: 5, offset: 5 }) for page 2
130
140
 
131
- **View usage statistics:**
132
- ```bash
133
- npx nodebench-mcp --stats
134
- ```
141
+ # Expand results via conceptual neighbors
142
+ > Use discover_tools({ query: "deploy changes", expand: 3 }) for broader discovery
135
143
 
136
- Shows tool usage patterns, most used toolsets, and success rates for the last 30 days.
144
+ # Explore a tool's neighborhood (multi-hop)
145
+ > Use get_tool_quick_ref({ tool_name: "run_recon", depth: 2 }) to see 2-hop graph
137
146
 
138
- **Export usage data:**
139
- ```bash
140
- npx nodebench-mcp --export-stats > usage-stats.json
141
- ```
147
+ # Get methodology guidance
148
+ > Use getMethodology("overview") to see all workflows
142
149
 
143
- **List all available presets:**
144
- ```bash
145
- npx nodebench-mcp --list-presets
146
- ```
150
+ # Before your next task — search for prior knowledge
151
+ > Use search_all_knowledge("what I'm about to work on")
147
152
 
148
- **Clear analytics data:**
149
- ```bash
150
- npx nodebench-mcp --reset-stats
153
+ # Run the full verification pipeline on a change
154
+ > Use getMethodology("mandatory_flywheel") and follow the 6 steps
151
155
  ```
152
156
 
153
- All analytics data is stored locally in `~/.nodebench/analytics.db` and never leaves your machine.
154
-
155
- ### Optional: API keys for web search and vision
157
+ ### Optional: API Keys
156
158
 
157
159
  ```bash
158
160
  export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
159
161
  export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
160
162
  ```
161
163
 
162
- ### Capability benchmarking (GAIA, gated)
164
+ Set these as environment variables, or add them to the `env` block in your MCP config:
163
165
 
164
- NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
166
+ ```json
167
+ {
168
+ "mcpServers": {
169
+ "nodebench": {
170
+ "command": "npx",
171
+ "args": ["-y", "nodebench-mcp"],
172
+ "env": {
173
+ "GEMINI_API_KEY": "your-key",
174
+ "GITHUB_TOKEN": "your-token"
175
+ }
176
+ }
177
+ }
178
+ }
179
+ ```
165
180
 
166
- Notes:
167
- - GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
168
- - Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
181
+ ### Usage Analytics & Smart Presets
169
182
 
170
- Web lane (web_search + fetch_url):
171
- ```bash
172
- npm run mcp:dataset:gaia:capability:refresh
173
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
174
- ```
183
+ NodeBench MCP tracks tool usage locally and can recommend optimal presets based on your project type and usage patterns.
175
184
 
176
- File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
177
185
  ```bash
178
- npm run mcp:dataset:gaia:capability:files:refresh
179
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
186
+ npx nodebench-mcp --smart-preset # Get AI-powered preset recommendation
187
+ npx nodebench-mcp --stats # Show usage statistics (last 30 days)
188
+ npx nodebench-mcp --export-stats # Export usage data to JSON
189
+ npx nodebench-mcp --list-presets # List all available presets
190
+ npx nodebench-mcp --reset-stats # Clear analytics data
180
191
  ```
181
192
 
182
- Modes:
183
- - Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
184
- - More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
185
-
186
- Notes:
187
- - ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
193
+ All analytics data is stored locally in `~/.nodebench/analytics.db` and never leaves your machine.
188
194
 
189
195
  ---
190
196
 
191
- ## What You Get
197
+ ## What You Get — The AI Flywheel
192
198
 
193
- ### The AI Flywheel — Core Methodology
194
-
195
- The `default` preset (50 tools) gives you the complete AI Flywheel methodology from [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
199
+ The default setup (no `--preset` flag) gives you **54 tools** that implement the complete [AI Flywheel](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md) methodology two interlocking loops that compound quality over time:
196
200
 
197
201
  ```
198
202
  Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
@@ -203,21 +207,49 @@ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn
203
207
  **Inner loop** (per change): 6-phase verification ensures correctness.
204
208
  **Outer loop** (over time): Eval-driven development ensures improvement.
205
209
 
206
- ### Recommended Workflow: Start with Default
210
+ ### What's in the Default Preset (54 Tools)
207
211
 
208
- The `default` preset includes 50 tools in 3 groups:
212
+ The default preset has 3 layers:
209
213
 
210
- 1. **Discovery tools (6)** *"What tool should I use?"* — `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`. These help agents find the right tool via keyword search, 14-strategy hybrid search, workflow chains, and methodology guides.
214
+ **Layer 1 Discovery (6 tools):** *"What tool should I use?"*
211
215
 
212
- 2. **Dynamic loading tools (6)** — *"Add/remove tools from my session"* — `load_toolset`, `unload_toolset`, `list_available_toolsets`, `call_loaded_tool`, `smart_select_tools`, `get_ab_test_report`. These let agents manage their own context budget by loading toolsets on demand and unloading them when done.
216
+ | Tool | Purpose |
217
+ |---|---|
218
+ | `findTools` | Keyword search across all tools |
219
+ | `getMethodology` | Access methodology guides (20 topics) |
220
+ | `check_mcp_setup` | Diagnostic wizard — checks env vars, API keys, optional deps |
221
+ | `discover_tools` | 14-strategy hybrid search with pagination (`offset`), result expansion (`expand`), and `relatedTools` neighbors |
222
+ | `get_tool_quick_ref` | Quick reference with multi-hop BFS traversal (`depth` 1-3) — discovers tools 2-3 hops away |
223
+ | `get_workflow_chain` | Step-by-step recipes for 28 common workflows |
213
224
 
214
- 3. **Core methodology (38)** *"Do the work"* — verification (8), eval (6), quality_gate (4), learning (4), flywheel (4), recon (7), security (3), boilerplate (2). These are the AI Flywheel tools that enforce structured research, risk assessment, 3-layer testing, quality gates, and persistent knowledge.
225
+ **Layer 2Dynamic Loading (6 tools):** *"Add/remove tools from my session"*
215
226
 
216
- **Self-escalate**: Add `--preset full` when you need vision, web, files, or parallel agents.
227
+ | Tool | Purpose |
228
+ |---|---|
229
+ | `load_toolset` | Add a toolset to the current session on demand |
230
+ | `unload_toolset` | Remove a toolset to recover context budget |
231
+ | `list_available_toolsets` | See all 39 toolsets with tool counts |
232
+ | `call_loaded_tool` | Proxy for clients that don't support dynamic tool updates |
233
+ | `smart_select_tools` | LLM-powered tool selection (sends compact catalog to fast model) |
234
+ | `get_ab_test_report` | Compare static vs dynamic loading performance |
235
+
236
+ **Layer 3 — AI Flywheel Core Methodology (42 tools):** *"Do the work"*
237
+
238
+ | Domain | Tools | What You Get |
239
+ |---|---|---|
240
+ | **verification** | 8 | `start_verification_cycle`, `log_gap`, `resolve_gap`, `get_cycle_status`, `triple_verify`, `run_closed_loop`, `compare_cycles`, `list_cycles` |
241
+ | **eval** | 6 | `start_eval_run`, `record_eval_result`, `get_eval_summary`, `compare_eval_runs`, `get_eval_diff`, `list_eval_runs` |
242
+ | **quality_gate** | 4 | `run_quality_gate`, `create_gate_preset`, `get_gate_history`, `list_gate_presets` |
243
+ | **learning** | 4 | `record_learning`, `search_all_knowledge`, `get_knowledge_stats`, `list_recent_learnings` |
244
+ | **flywheel** | 4 | `run_mandatory_flywheel`, `promote_to_eval`, `investigate_blind_spot`, `get_flywheel_status` |
245
+ | **recon** | 7 | `run_recon`, `log_recon_finding`, `assess_risk`, `get_recon_summary`, `list_recon_sessions`, `check_framework_version`, `search_recon_findings` |
246
+ | **security** | 3 | `scan_dependencies`, `analyze_code_security`, `scan_terminal_output` |
247
+ | **boilerplate** | 2 | `scaffold_nodebench_project`, `get_boilerplate_status` |
248
+ | **skill_update** | 4 | `register_skill`, `check_skill_freshness`, `sync_skill`, `list_skills` |
217
249
 
218
- This approach minimizes token overhead while ensuring agents have access to the complete methodology when needed.
250
+ ### Core Workflow Use These Every Session
219
251
 
220
- ### Core workflow (use these every session)
252
+ These are the AI Flywheel tools documented in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
221
253
 
222
254
  | When you... | Use this | Impact |
223
255
  |---|---|---|
@@ -230,30 +262,59 @@ This approach minimizes token overhead while ensuring agents have access to the
230
262
  | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
231
263
  | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
232
264
  | Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
265
+ | Re-examine for 11/10 | Fresh-eyes review | After completing, re-examine for exceptional quality — a11y, resilience, polish |
233
266
 
234
- ### When running parallel agents (Claude Code subagents, worktrees)
267
+ ### Mandatory After Any Non-Trivial Change
235
268
 
236
- | When you... | Use this | Impact | Preset |
237
- |---|---|---|---|
238
- | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent | `full` |
239
- | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. | `full` |
240
- | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix | `full` |
241
- | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle | `full` |
242
- | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked | `full` |
243
- | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra | `full` |
269
+ 1. **Static analysis**: `tsc --noEmit` and linter checks
270
+ 2. **Happy-path test**: Run the changed functionality with valid inputs
271
+ 3. **Failure-path test**: Validate expected error handling + edge cases
272
+ 4. **Gap analysis**: Dead code, unused vars, missing integrations, intent mismatch
273
+ 5. **Fix and re-verify**: Rerun steps 1-3 from scratch after any fix
274
+ 6. **Deploy and document**: Ship + write down what changed and why
275
+ 7. **Re-examine for 11/10**: Re-examine the completed work with fresh eyes. Not "does it work?" but "is this the best it can be?" Check: prefers-reduced-motion, color-blind safety, print stylesheet, error resilience (partial failures, retry with backoff), keyboard efficiency (skip links, Ctrl+K search), skeleton loading, staggered animations, progressive disclosure for large datasets. Fix what you find, then re-examine your fixes.
276
+
277
+ ---
244
278
 
245
- **Note:** Parallel agent coordination tools are only available in the `full` preset. For single-agent workflows, the `default` preset provides all the core AI Flywheel tools you need.
279
+ ## Themed Presets Choose Your Workflow
246
280
 
247
- ### Research and discovery
281
+ The default preset covers the AI Flywheel. For specialized workflows, pick a themed preset that adds domain-specific tools on top:
248
282
 
249
- | When you... | Use this | Impact | Preset |
283
+ | Preset | Tools | What it adds to the default | Use case |
250
284
  |---|---|---|---|
251
- | Search the web | `web_search` | Gemini/OpenAI/Perplexity latest docs and updates | `full` |
252
- | Fetch a URL | `fetch_url` | Read any page as clean markdown | `full` |
253
- | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns | `full` |
254
- | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini 3 Flash/GPT-5-mini/Claude) for UI QA | `full` |
285
+ | **default** | **54** | — | Bug fixes, features, refactoring, code review |
286
+ | `web_dev` | 106 | + vision, UI capture, SEO, git workflow, architect, UI/UX dive, MCP bridge, PR reports | Web projects with visual QA |
287
+ | `mobile` | 95 | + vision, UI capture, flicker detection, UI/UX dive, MCP bridge | Mobile apps with screenshot analysis |
288
+ | `academic` | 86 | + research writing, LLM, web, local file parsing | Academic papers and research |
289
+ | `multi_agent` | 83 | + parallel agents, self-eval, session memory, pattern mining, TOON | Multi-agent coordination |
290
+ | `data` | 78 | + local file parsing (CSV/XLSX/PDF/DOCX/JSON), LLM, web | Data analysis and file processing |
291
+ | `content` | 73 | + LLM, critter, email, RSS, platform queue, architect | Content pipelines and publishing |
292
+ | `research` | 71 | + web search, LLM, RSS feeds, email, docs | Research workflows |
293
+ | `devops` | 68 | + git compliance, session memory, benchmarks, pattern mining, PR reports | CI/CD and operations |
294
+ | `full` | 218 | + everything (all 39 toolsets) | Maximum coverage |
295
+
296
+ ```bash
297
+ # Claude Code
298
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
299
+
300
+ # Windsurf / Cursor — add --preset to args
301
+ {
302
+ "mcpServers": {
303
+ "nodebench": {
304
+ "command": "npx",
305
+ "args": ["-y", "nodebench-mcp", "--preset", "web_dev"]
306
+ }
307
+ }
308
+ }
309
+ ```
310
+
311
+ ### Let AI Pick Your Preset
312
+
313
+ ```bash
314
+ npx nodebench-mcp --smart-preset
315
+ ```
255
316
 
256
- **Note:** Web search, GitHub, and vision tools are only available in the `full` preset. The `default` preset focuses on the core AI Flywheel methodology (verification, eval, learning, recon, flywheel, security, boilerplate).
317
+ Analyzes your project (language, framework, project type) and usage history to recommend the best preset.
257
318
 
258
319
  ---
259
320
 
@@ -282,8 +343,6 @@ The comparative benchmark validates this with 9 real production scenarios:
282
343
 
283
344
  ## Progressive Discovery
284
345
 
285
- The `default` preset (50 tools) provides the complete AI Flywheel methodology with discovery built in. The progressive disclosure system helps agents find exactly what they need:
286
-
287
346
  ### Multi-modal search engine
288
347
 
289
348
  ```
@@ -309,19 +368,71 @@ The `discover_tools` search engine scores tools using **14 parallel strategies**
309
368
 
310
369
  Pass `explain: true` to see exactly which strategies contributed to each score.
311
370
 
312
- ### Quick refs — what to do next
371
+ ### Cursor pagination
372
+
373
+ Page through large result sets with `offset` and `limit`:
374
+
375
+ ```
376
+ > discover_tools({ query: "verify", limit: 5 })
377
+ # Returns: { results: [...5 tools], totalMatches: 76, hasMore: true, offset: 0 }
378
+
379
+ > discover_tools({ query: "verify", limit: 5, offset: 5 })
380
+ # Returns: { results: [...next 5 tools], totalMatches: 76, hasMore: true, offset: 5 }
381
+ ```
382
+
383
+ `totalMatches` is stable across pages. `hasMore` tells you whether another page exists.
384
+
385
+ ### Result expansion via relatedTools
386
+
387
+ Broaden results by following conceptual neighbors:
388
+
389
+ ```
390
+ > discover_tools({ query: "deploy and ship changes", expand: 3 })
391
+ # Top 3 results' relatedTools neighbors are added at 50% parent score
392
+ # "deploy" finds git_workflow tools → expansion adds quality_gate, flywheel tools
393
+ # Expanded results include depth: 1 and expandedFrom fields
394
+ ```
395
+
396
+ Dogfood A/B results: 5/8 queries gained recall lift (+2 to +8 new tools per query). "deploy and ship changes" went from 82 → 90 matches.
397
+
398
+ ### Quick refs — what to do next (with multi-hop)
313
399
 
314
400
  Every tool response auto-appends a `_quickRef` with:
315
401
  - **nextAction**: What to do immediately after this tool
316
- - **nextTools**: Recommended follow-up tools
402
+ - **nextTools**: Recommended follow-up tools (workflow-sequential)
403
+ - **relatedTools**: Conceptually adjacent tools (same domain, shared tags — 949 connections across 218 tools)
317
404
  - **methodology**: Which methodology guide to consult
318
405
  - **tip**: Practical usage advice
319
406
 
320
- Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
407
+ Call `get_tool_quick_ref("tool_name")` for any tool's guidance — or use **multi-hop BFS traversal** to discover tools 2-3 hops away:
408
+
409
+ ```
410
+ > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 1 })
411
+ # Returns: direct neighbors via nextTools + relatedTools (hopDistance: 1)
412
+
413
+ > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 2 })
414
+ # Returns: direct neighbors + their neighbors (hopDistance: 1 and 2)
415
+ # Discovers 34 additional tools reachable in 2 hops
416
+
417
+ > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 3 })
418
+ # Returns: 3-hop BFS traversal — full neighborhood graph
419
+ ```
420
+
421
+ Each discovered tool includes `hopDistance` (1-3) and `reachedVia` (which parent tool led to it). BFS prevents cycles — no tool appears at multiple depths.
422
+
423
+ ### `nextTools` vs `relatedTools`
424
+
425
+ | | `nextTools` | `relatedTools` |
426
+ |---|---|---|
427
+ | **Meaning** | Workflow-sequential ("do X then Y") | Conceptually adjacent ("if doing X, consider Y") |
428
+ | **Example** | `run_recon` → `log_recon_finding` | `run_recon` → `search_all_knowledge`, `bootstrap_project` |
429
+ | **Total connections** | 498 | 949 (191% amplification) |
430
+ | **Overlap** | — | 0% (all net-new connections) |
431
+ | **Cross-domain** | Mostly same-domain | 90% bridge different domains |
321
432
 
322
433
  ### Workflow chains — step-by-step recipes
323
434
 
324
- 24 pre-built chains for common workflows:
435
+ 28 pre-built chains for common workflows:
325
436
 
326
437
  | Chain | Steps | Use case |
327
438
  |---|---|---|
@@ -349,6 +460,10 @@ Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
349
460
  | `pr_review` | 5 | Pull request review |
350
461
  | `seo_audit` | 6 | Full SEO audit |
351
462
  | `voice_pipeline` | 6 | Voice pipeline implementation |
463
+ | `intentionality_check` | 4 | Verify agent intent before action |
464
+ | `research_digest` | 6 | Summarize research across sessions |
465
+ | `email_assistant` | 5 | Email triage and response |
466
+ | `pr_creation` | 6 | Visual PR creation from UI Dive sessions |
352
467
 
353
468
  Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
354
469
 
@@ -365,120 +480,21 @@ Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.j
365
480
 
366
481
  ---
367
482
 
368
- ## The Methodology Pipeline
369
-
370
- NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
371
-
372
- ```
373
- Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
374
- ↑ │
375
- └──────────── knowledge compounds ─────────────────────────────┘
376
- ```
377
-
378
- **Inner loop** (per change): 6-phase verification ensures correctness.
379
- **Outer loop** (over time): Eval-driven development ensures improvement.
380
- **Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
381
-
382
- ### The 6-Phase Verification Process (Inner Loop)
383
-
384
- Every non-trivial change should go through these 6 steps:
385
-
386
- 1. **Context Gathering** — Parallel subagent deep dive into SDK specs, implementation patterns, dispatcher/backend audit, external API research
387
- 2. **Gap Analysis** — Compare findings against current implementation, categorize gaps (CRITICAL/HIGH/MEDIUM/LOW)
388
- 3. **Implementation** — Apply fixes following production patterns exactly
389
- 4. **Testing & Validation** — 5 layers: static analysis, unit tests, integration tests, manual verification, live end-to-end
390
- 5. **Self-Closed-Loop Verification** — Parallel verification subagents check spec compliance, functional correctness, argument compatibility
391
- 6. **Document Learnings** — Update documentation with edge cases and key learnings
483
+ ## Scaling MCP: How We Solved the 5 Biggest Industry Problems
392
484
 
393
- ### The Eval-Driven Development Loop (Outer Loop)
394
-
395
- 1. **Run Eval Batch** — Send test cases through the target workflow
396
- 2. **Capture Telemetry** — Collect complete agent execution trace
397
- 3. **LLM-as-Judge Analysis** — Score goal alignment, tool efficiency, output quality
398
- 4. **Retrieve Results** — Aggregate pass/fail rates and improvement suggestions
399
- 5. **Fix, Optimize, Enhance** — Apply changes based on judge feedback
400
- 6. **Re-run Evals** — Deploy only if scores improve
401
-
402
- **Rule: No change ships without an eval improvement.**
403
-
404
- Ask the agent: `Use getMethodology("overview")` to see all 20 methodology topics.
405
-
406
- ---
407
-
408
- ## Parallel Agents with Claude Code
409
-
410
- Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
411
-
412
- **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
413
-
414
- **How it works with Claude Code's Task tool:**
415
-
416
- 1. **COORDINATOR** (your main session) breaks work into independent tasks
417
- 2. Each **Task tool** call spawns a subagent with instructions to:
418
- - `claim_agent_task` — lock the task
419
- - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
420
- - Do the work
421
- - `release_agent_task` — handoff with progress note
422
- 3. Coordinator calls `get_parallel_status` to monitor all subagents
423
- 4. Coordinator runs `run_quality_gate` on the aggregate result
424
-
425
- **MCP Prompts available:**
426
- - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
427
- - `parallel-agent-team` — Full team setup with role assignment
428
- - `oracle-test-harness` — Validate outputs against known-good reference
429
- - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
430
-
431
- **Note:** Parallel agent coordination tools are only available in the `full` preset. For single-agent workflows, the `default` preset provides all the core AI Flywheel tools you need.
432
-
433
- ---
434
-
435
- ## Toolset Gating
436
-
437
- The default preset (50 tools) gives you the complete AI Flywheel methodology with ~78% fewer tools compared to the full suite (175 tools).
438
-
439
- ### Presets — Choose What You Need
440
-
441
- | Preset | Tools | Domains | Use case |
442
- |---|---|---|---|
443
- | **default** ⭐ | **50** | 7 | **Recommended** — Complete AI Flywheel: verification, eval, quality_gate, learning, flywheel, recon, boilerplate + discovery + dynamic loading |
444
- | `full` | 175 | 34 | Everything — vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers, security, email, RSS, architect |
445
-
446
- ```bash
447
- # ⭐ Recommended: Default (50 tools) - complete AI Flywheel
448
- claude mcp add nodebench -- npx -y nodebench-mcp
449
-
450
- # Everything: All 175 tools
451
- claude mcp add nodebench -- npx -y nodebench-mcp --preset full
452
- ```
453
-
454
- Or in config:
455
-
456
- ```json
457
- {
458
- "mcpServers": {
459
- "nodebench": {
460
- "command": "npx",
461
- "args": ["-y", "nodebench-mcp"]
462
- }
463
- }
464
- }
465
- ```
466
-
467
- ### Scaling MCP: How We Solved the 5 Biggest Industry Problems
468
-
469
- MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses. Here's the full breakdown — problem by problem.
485
+ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses.
470
486
 
471
487
  ---
472
488
 
473
- #### Problem 1: Context Bloat (too many tool definitions eat the context window)
489
+ ### Problem 1: Context Bloat (too many tool definitions eat the context window)
474
490
 
475
- **The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At 175 tools, NodeBench would consume ~87K tokens — up to 44% of a 200K context window just on tool metadata. [Microsoft Research](https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/) found LLMs "decline to act at all when faced with ambiguous or excessive tool options." [Cursor enforces a ~40-tool hard cap](https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents) for this reason.
491
+ **The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At 218 tools, NodeBench would consume ~109K tokens — over half a 200K context window just on tool metadata. [Microsoft Research](https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/) found LLMs "decline to act at all when faced with ambiguous or excessive tool options." [Cursor enforces a ~40-tool hard cap](https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents) for this reason.
476
492
 
477
493
  **Our solutions** (layered, each independent):
478
494
 
479
495
  | Layer | What it does | Token savings | Requires |
480
496
  |---|---|---|---|
481
- | Themed presets (`--preset web_dev`) | Load only relevant toolsets (44-60 tools vs 175) | **60-75%** | Nothing |
497
+ | Themed presets (`--preset web_dev`) | Load only relevant toolsets (54-106 tools vs 218) | **50-75%** | Nothing |
482
498
  | TOON encoding (on by default) | Encode all tool responses in token-optimized format | **~40%** on responses | Nothing |
483
499
  | `discover_tools({ compact: true })` | Return `{ name, category, hint }` only | **~60%** on search results | Nothing |
484
500
  | `instructions` field (Claude Code) | Claude Code defers tool loading, searches on demand | **~85%** | Claude Code client |
@@ -488,11 +504,11 @@ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft
488
504
 
489
505
  ---
490
506
 
491
- #### Problem 2: Tool Selection Degradation (LLMs pick the wrong tool as count increases)
507
+ ### Problem 2: Tool Selection Degradation (LLMs pick the wrong tool as count increases)
492
508
 
493
509
  **The research**: [Anthropic's Tool Search Tool](https://www.anthropic.com/engineering/advanced-tool-use) improved accuracy from **49% → 74%** (Opus 4) and **79.5% → 88.1%** (Opus 4.5) by switching from all-tools-upfront to on-demand discovery. The [Dynamic ReAct paper (arxiv 2509.20386)](https://arxiv.org/html/2509.20386v1) tested 5 architectures and found **Search + Load** wins — flat search + deliberate loading beats hierarchical app→tool search.
494
510
 
495
- **Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from 175 candidates:
511
+ **Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from 218 candidates, with **cursor pagination**, **result expansion**, and **multi-hop traversal**:
496
512
 
497
513
  | Strategy | What it does | Example |
498
514
  |---|---|---|
@@ -502,8 +518,11 @@ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft
502
518
  | N-gram + Bigram | Partial words and phrases | "screen" → `capture_ui_screenshot` |
503
519
  | Dense (TF-IDF cosine) | Vector-like ranking | "audit compliance" surfaces related tools |
504
520
  | Embedding (neural) | Agent-as-a-Graph bipartite RRF | Based on [arxiv 2511.01854](https://arxiv.org/html/2511.01854v1) |
505
- | Execution traces | Co-occurrence mining from `tool_call_log` | Tools frequently used together boost each other |
521
+ | Execution traces | Co-occurrence mining from `tool_call_log` (direct + transitive A→B→C) | Tools frequently used together boost each other |
506
522
  | Intent pre-filter | Narrow to relevant categories before search | `intent: "data_analysis"` → only local_file, llm, benchmark |
523
+ | **Pagination** | `offset` + `limit` with stable `totalMatches` and `hasMore` | Page through 76+ results 5 at a time |
524
+ | **Expansion** | Top N results' `relatedTools` neighbors added at 50% parent score | `expand: 3` adds 2-8 new tools per query |
525
+ | **Multi-hop BFS** | `get_tool_quick_ref` depth 1-3 with `hopDistance` + `reachedVia` | depth=2 discovers 24-40 additional tools |
507
526
 
508
527
  Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini 3 Flash / GPT-5-mini / Claude Haiku 4.5 for LLM-powered reranking.
509
528
 
@@ -518,7 +537,7 @@ Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini
518
537
 
519
538
  ---
520
539
 
521
- #### Problem 3: Static Loading (all tools loaded upfront, even if unused)
540
+ ### Problem 3: Static Loading (all tools loaded upfront, even if unused)
522
541
 
523
542
  **The research**: The Dynamic ReAct paper found that **Search + Load with 2 meta tools** beats all other architectures. Hierarchical search (search apps → search tools → load) adds overhead without improving accuracy. [ToolScope (arxiv 2510.20036)](https://arxiv.org/html/2510.20036) showed **+34.6%** tool selection accuracy with hybrid retrieval + tool deduplication.
524
543
 
@@ -542,7 +561,7 @@ npx nodebench-mcp --dynamic
542
561
  Key design decisions from the research:
543
562
  - **No hierarchical search** — Dynamic ReAct Section 3.4: "search_apps introduces an additional call without significantly improving accuracy"
544
563
  - **Direct tool binding** — Dynamic ReAct Section 3.5: LLMs perform best with directly bound tools; `call_tool` indirection degrades in long conversations
545
- - **Full-registry search** — `discover_tools` searches all 175 tools even with 44 loaded, so it can suggest what to load
564
+ - **Full-registry search** — `discover_tools` searches all 218 tools even with 54 loaded, so it can suggest what to load
546
565
 
547
566
  **How we tested**: Automated A/B harness + live IDE session.
548
567
 
@@ -559,7 +578,7 @@ Key design decisions from the research:
559
578
 
560
579
  ---
561
580
 
562
- #### Problem 4: Client Fragmentation (not all clients handle dynamic tool updates)
581
+ ### Problem 4: Client Fragmentation (not all clients handle dynamic tool updates)
563
582
 
564
583
  **The research**: The MCP spec defines `notifications/tools/list_changed` for servers to tell clients to re-fetch the tool list. But [Cursor hasn't implemented it](https://forum.cursor.com/t/enhance-mcp-integration-in-cursor-dynamic-tool-updates-roots-support-progress-tokens-streamable-http/99903), [Claude Desktop didn't support it](https://github.com/orgs/modelcontextprotocol/discussions/76) (as of Dec 2024), and [Gemini CLI has an open issue](https://github.com/google-gemini/gemini-cli/issues/13850).
565
584
 
@@ -592,7 +611,7 @@ tools/list AFTER UNLOAD: 95 tools (-4) ← tools removed
592
611
 
593
612
  ---
594
613
 
595
- #### Problem 5: Aggressive Filtering (over-filtering means the right tool isn't found)
614
+ ### Problem 5: Aggressive Filtering (over-filtering means the right tool isn't found)
596
615
 
597
616
  **The research**: This is the flip side of Problem 1. If you reduce context aggressively (e.g., keyword-only search), ambiguous queries like "call an AI model" fail to match the `llm` toolset because every tool mentions "AI" in its description. [SynapticLabs' Bounded Context Packs](https://blog.synapticlabs.ai/bounded-context-packs-tool-bloat-tipping-point) addresses this with progressive disclosure. [SEP-1576](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576) proposes adaptive granularity at the protocol level.
598
617
 
@@ -623,11 +642,11 @@ Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-G
623
642
 
624
643
  ---
625
644
 
626
- #### Summary: research → solution → eval for each problem
645
+ ### Summary: research → solution → eval for each problem
627
646
 
628
647
  | Problem | Research Source | Our Solution | Eval Method | Result |
629
648
  |---|---|---|---|---|
630
- | **Context bloat** (87K tokens) | Anthropic (85% reduction), Lunar.dev (~40-tool cap), SEP-1576 | Presets, TOON, compact mode, `instructions`, `smart_select_tools` | A/B harness token measurement | 60-95% reduction depending on layer |
649
+ | **Context bloat** (107K tokens) | Anthropic (85% reduction), Lunar.dev (~40-tool cap), SEP-1576 | Presets, TOON, compact mode, `instructions`, `smart_select_tools` | A/B harness token measurement | 50-95% reduction depending on layer |
631
650
  | **Selection degradation** | Anthropic (+25pp), Dynamic ReAct (Search+Load wins) | 14-strategy hybrid search, intent pre-filter, LLM reranking | 28-scenario discovery accuracy | **100% accuracy** (18/18 domains) |
632
651
  | **Static loading** | Dynamic ReAct, ToolScope (+34.6%), MCP spec | `--dynamic` flag, `load_toolset` / `unload_toolset` | A/B harness + live IDE test | **100% success**, <1ms load latency |
633
652
  | **Client fragmentation** | MCP discussions, client bug trackers | `list_changed` + `call_loaded_tool` proxy | Server-side `tools/list` verification | Works on **all clients** |
@@ -637,15 +656,17 @@ Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-G
637
656
 
638
657
  | Segment | R@5 Baseline | Most Critical Strategy | Impact When Removed |
639
658
  |---|---|---|---|
640
- | **New user** (vague, natural language) | 67% | Synonym expansion | 🔴 -17pp R@5 |
641
- | **Experienced** (domain keywords) | 72% | All robust | No single strategy >5pp |
642
- | **Power user** (exact tool names) | 100% | None needed | Keyword alone = 100% |
659
+ | **New user** (vague, natural language) | 67% | Synonym expansion | -17pp R@5 |
660
+ | **Experienced** (domain keywords) | 72% | All robust | No single strategy >5pp |
661
+ | **Power user** (exact tool names) | 100% | None needed | Keyword alone = 100% |
643
662
 
644
663
  Key insight: new users need synonym expansion ("website" → seo, "AI" → llm) and fuzzy matching (typo tolerance). Power users need nothing beyond keyword matching. The remaining 33% new user gap is filled by `smart_select_tools` (LLM-powered).
645
664
 
646
665
  Full methodology, per-scenario breakdown, ablation data, and research citations: [DYNAMIC_LOADING.md](./DYNAMIC_LOADING.md)
647
666
 
648
- ### Fine-grained control
667
+ ---
668
+
669
+ ## Fine-Grained Control
649
670
 
650
671
  ```bash
651
672
  # Include only specific toolsets
@@ -654,11 +675,14 @@ npx nodebench-mcp --toolsets verification,eval,recon
654
675
  # Exclude heavy optional-dep toolsets
655
676
  npx nodebench-mcp --exclude vision,ui_capture,parallel
656
677
 
678
+ # Dynamic loading — start with 12 tools, load on demand
679
+ npx nodebench-mcp --dynamic
680
+
657
681
  # See all toolsets and presets
658
682
  npx nodebench-mcp --help
659
683
  ```
660
684
 
661
- ### Available toolsets
685
+ ### All 39 Toolsets
662
686
 
663
687
  | Toolset | Tools | What it covers | In `default` |
664
688
  |---|---|---|---|
@@ -666,11 +690,12 @@ npx nodebench-mcp --help
666
690
  | eval | 6 | Eval runs, results, comparison, diff | ✅ |
667
691
  | quality_gate | 4 | Gates, presets, history | ✅ |
668
692
  | learning | 4 | Knowledge, search, record | ✅ |
669
- | recon | 7 | Research, findings, framework checks, risk | ✅ |
670
693
  | flywheel | 4 | Mandatory flywheel, promote, investigate | ✅ |
694
+ | recon | 7 | Research, findings, framework checks, risk | ✅ |
671
695
  | security | 3 | Dependency scanning, code analysis, terminal security scanning | ✅ |
672
- | **Total** | **44** | **Complete AI Flywheel** |
673
696
  | boilerplate | 2 | Scaffold NodeBench projects + status | ✅ |
697
+ | skill_update | 4 | Skill tracking, freshness checks, sync | ✅ |
698
+ | **Subtotal** | **42** | **AI Flywheel core** | |
674
699
  | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner | — |
675
700
  | self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance | — |
676
701
  | parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox | — |
@@ -693,19 +718,22 @@ npx nodebench-mcp --help
693
718
  | git_workflow | 3 | Branch compliance, PR checklist review, merge gate | — |
694
719
  | seo | 5 | Technical SEO audit, page performance, content analysis | — |
695
720
  | voice_bridge | 4 | Voice pipeline design, config analysis, scaffold | — |
721
+ | critter | 1 | Accountability checkpoint with calibrated scoring | — |
696
722
  | email | 4 | SMTP/IMAP email ingestion, search, delivery | — |
697
723
  | rss | 4 | RSS feed parsing and monitoring | — |
698
- | architect | 3 | Architecture analysis and decision logging | — |
724
+ | architect | 3 | Structural analysis, concept verification, implementation planning | — |
725
+ | ui_ux_dive | 11 | UI/UX deep analysis sessions, component reviews, flow audits | — |
726
+ | mcp_bridge | 5 | Connect external MCP servers, proxy tool calls, manage sessions | — |
727
+ | ui_ux_dive_v2 | 14 | Advanced UI/UX analysis with preflight, scoring, heuristic evaluation | — |
728
+ | pr_report | 3 | Visual PR creation with screenshot comparisons, timelines, past session links | — |
699
729
 
700
- **Always included** — these 12 tools are always available:
730
+ **Always included** — these 12 tools are available regardless of preset:
701
731
  - **Meta/discovery (6):** `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
702
732
  - **Dynamic loading (6):** `load_toolset`, `unload_toolset`, `list_available_toolsets`, `call_loaded_tool`, `smart_select_tools`, `get_ab_test_report`
703
733
 
704
- The `default` preset includes 50 tools (38 domain + 6 meta/discovery + 6 dynamic loading).
705
-
706
734
  ### TOON Format — Token Savings
707
735
 
708
- TOON (Token-Oriented Object Notation) is **on by default** for all presets since v2.14.1. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
736
+ TOON (Token-Oriented Object Notation) is **on by default** for all presets. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
709
737
 
710
738
  ```bash
711
739
  # TOON on (default, all presets)
@@ -715,20 +743,13 @@ claude mcp add nodebench -- npx -y nodebench-mcp
715
743
  claude mcp add nodebench -- npx -y nodebench-mcp --no-toon
716
744
  ```
717
745
 
718
- Use the `toon_encode` and `toon_decode` tools to convert between TOON and JSON in your own workflows.
719
-
720
- ### When to Use Each Preset
721
-
722
- | Preset | Use when... | Example |
723
- |---|---|---|
724
- | **default** ⭐ | You want the complete AI Flywheel methodology with minimal token overhead | Most users — bug fixes, features, refactoring, code review |
725
- | `full` | You need vision, UI capture, web search, GitHub, local file parsing, or GAIA solvers | Vision QA, web scraping, file processing, parallel agents, capability benchmarking |
746
+ Use the `toon_encode` and `toon_decode` tools (in the `toon` toolset) to convert between TOON and JSON in your own workflows.
726
747
 
727
748
  ---
728
749
 
729
- ## AI Flywheel — Complete Methodology
750
+ ## The AI Flywheel — Complete Methodology
730
751
 
731
- The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md). Here's a summary:
752
+ The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md).
732
753
 
733
754
  ### Two Loops That Compound
734
755
 
@@ -780,6 +801,62 @@ The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/H
780
801
 
781
802
  ---
782
803
 
804
+ ## Parallel Agents with Claude Code
805
+
806
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
807
+
808
+ **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
809
+
810
+ **How it works with Claude Code's Task tool:**
811
+
812
+ 1. **COORDINATOR** (your main session) breaks work into independent tasks
813
+ 2. Each **Task tool** call spawns a subagent with instructions to:
814
+ - `claim_agent_task` — lock the task
815
+ - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
816
+ - Do the work
817
+ - `release_agent_task` — handoff with progress note
818
+ 3. Coordinator calls `get_parallel_status` to monitor all subagents
819
+ 4. Coordinator runs `run_quality_gate` on the aggregate result
820
+
821
+ **MCP Prompts available:**
822
+ - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
823
+ - `parallel-agent-team` — Full team setup with role assignment
824
+ - `oracle-test-harness` — Validate outputs against known-good reference
825
+ - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
826
+
827
+ **Note:** Parallel agent coordination tools require `--preset multi_agent` or `--preset full`.
828
+
829
+ ---
830
+
831
+ ## Capability Benchmarking (GAIA, Gated)
832
+
833
+ NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
834
+
835
+ Notes:
836
+ - GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
837
+ - Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
838
+
839
+ Web lane (web_search + fetch_url):
840
+ ```bash
841
+ npm run mcp:dataset:gaia:capability:refresh
842
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
843
+ ```
844
+
845
+ File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
846
+ ```bash
847
+ npm run mcp:dataset:gaia:capability:files:refresh
848
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
849
+ ```
850
+
851
+ Modes:
852
+ - Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
853
+ - More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
854
+
855
+ Notes:
856
+ - ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
857
+
858
+ ---
859
+
783
860
  ## Build from Source
784
861
 
785
862
  ```bash
@@ -805,51 +882,54 @@ Then use absolute path:
805
882
 
806
883
  ## Quick Reference
807
884
 
808
- ### Recommended Setup for Most Users
885
+ ### Recommended Setup
809
886
 
810
887
  ```bash
811
- # Claude Code / Windsurf — AI Flywheel core tools (50 tools, default)
888
+ # Claude Code — AI Flywheel core (54 tools, default)
812
889
  claude mcp add nodebench -- npx -y nodebench-mcp
813
- ```
814
-
815
- ### What's in the default preset?
816
-
817
- | Domain | Tools | What you get |
818
- |---|---|---|
819
- | verification | 8 | Cycles, gaps, triple-verify, status |
820
- | eval | 6 | Eval runs, results, comparison, diff |
821
- | quality_gate | 4 | Gates, presets, history |
822
- | learning | 4 | Knowledge, search, record |
823
- | recon | 7 | Research, findings, framework checks, risk |
824
- | flywheel | 4 | Mandatory flywheel, promote, investigate |
825
- | security | 3 | Dependency scanning, code analysis, terminal security scanning |
826
- | boilerplate | 2 | Scaffold NodeBench projects + status |
827
- | meta + discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools, get_tool_quick_ref, get_workflow_chain |
828
- | dynamic loading | 6 | load_toolset, unload_toolset, list_available_toolsets, call_loaded_tool, smart_select_tools, get_ab_test_report |
829
-
830
- **Total: 50 tools** — Complete AI Flywheel methodology with ~70% less token overhead.
831
-
832
- ### When to Upgrade Presets
833
-
834
- | Need | Upgrade to |
835
- |---|---|
836
- | Everything: vision, UI capture, web search, GitHub, local file parsing, GAIA solvers | `--preset full` (175 tools) |
837
-
838
- ### First Prompts to Try
839
890
 
891
+ # Windsurf — add to ~/.codeium/windsurf/mcp_config.json
892
+ # Cursor — add to .cursor/mcp.json
893
+ {
894
+ "mcpServers": {
895
+ "nodebench": {
896
+ "command": "npx",
897
+ "args": ["-y", "nodebench-mcp"]
898
+ }
899
+ }
900
+ }
840
901
  ```
841
- # See what's available
842
- > Use getMethodology("overview") to see all workflows
843
-
844
- # Before your next task — search for prior knowledge
845
- > Use search_all_knowledge("what I'm about to work on")
846
902
 
847
- # Run the full verification pipeline on a change
848
- > Use getMethodology("mandatory_flywheel") and follow the 6 steps
903
+ ### What's in the Default?
849
904
 
850
- # Find tools for a specific task
851
- > Use discover_tools("verify my implementation")
852
- ```
905
+ | Category | Tools | What you get |
906
+ |---|---|---|
907
+ | Discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools (pagination + expansion), get_tool_quick_ref (multi-hop BFS), get_workflow_chain |
908
+ | Dynamic loading | 6 | load_toolset, unload_toolset, list_available_toolsets, call_loaded_tool, smart_select_tools, get_ab_test_report |
909
+ | Verification | 8 | Cycles, gaps, triple-verify, status |
910
+ | Eval | 6 | Eval runs, results, comparison, diff |
911
+ | Quality gate | 4 | Gates, presets, history |
912
+ | Learning | 4 | Knowledge, search, record |
913
+ | Flywheel | 4 | Mandatory flywheel, promote, investigate |
914
+ | Recon | 7 | Research, findings, framework checks, risk |
915
+ | Security | 3 | Dependency scanning, code analysis, terminal security scanning |
916
+ | Boilerplate | 2 | Scaffold NodeBench projects + status |
917
+ | Skill update | 4 | Skill tracking, freshness checks, sync |
918
+ | **Total** | **54** | **Complete AI Flywheel methodology** |
919
+
920
+ ### When to Use a Themed Preset
921
+
922
+ | Need | Preset | Tools |
923
+ |---|---|---|
924
+ | Web development with visual QA | `--preset web_dev` | 106 |
925
+ | Mobile apps with flicker detection | `--preset mobile` | 95 |
926
+ | Academic papers and research writing | `--preset academic` | 86 |
927
+ | Multi-agent coordination | `--preset multi_agent` | 83 |
928
+ | Data analysis and file processing | `--preset data` | 78 |
929
+ | Content pipelines and publishing | `--preset content` | 73 |
930
+ | Research with web search and RSS | `--preset research` | 71 |
931
+ | CI/CD and DevOps | `--preset devops` | 68 |
932
+ | Everything | `--preset full` | 218 |
853
933
 
854
934
  ### Key Methodology Topics
855
935
 
@@ -872,8 +952,8 @@ NodeBench MCP runs locally on your machine. Here's what it can and cannot access
872
952
  - Analytics data never leaves your machine
873
953
 
874
954
  ### File system access
875
- - The `local_file` toolset (`--preset full` only) can **read files anywhere on your filesystem** that the Node.js process has permission to access. This includes CSV, PDF, XLSX, DOCX, PPTX, JSON, TXT, and ZIP files
876
- - The `security` toolset runs static analysis on files you point it at
955
+ - The `local_file` toolset (in `data`, `academic`, `full` presets) can **read files anywhere on your filesystem** that the Node.js process has permission to access. This includes CSV, PDF, XLSX, DOCX, PPTX, JSON, TXT, and ZIP files
956
+ - The `security` toolset (in all presets) runs static analysis on files you point it at
877
957
  - Session notes and project bootstrapping write to the current working directory or `~/.nodebench/`
878
958
  - **Trust boundary**: If you grant an AI agent access to NodeBench MCP with `--preset full`, that agent can read any file your user account can read. Use the `default` preset if you want to restrict file system access
879
959
 
@@ -897,6 +977,12 @@ NodeBench MCP runs locally on your machine. Here's what it can and cannot access
897
977
 
898
978
  **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
899
979
 
980
+ **Windsurf not finding tools** — Verify `~/.codeium/windsurf/mcp_config.json` has the correct JSON structure. Open Settings → MCP → View raw config to edit directly.
981
+
982
+ **Cursor tools not loading** — Ensure `.cursor/mcp.json` exists in the project root. Restart Cursor after config changes.
983
+
984
+ **Dynamic loading not working** — Claude Code and GitHub Copilot support native dynamic loading. For Windsurf/Cursor, use `call_loaded_tool` as a fallback (it's always available).
985
+
900
986
  ---
901
987
 
902
988
  ## License