nodebench-mcp 2.38.0 → 2.40.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -5,73 +5,70 @@
5
5
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
6
6
  [![GitHub stars](https://img.shields.io/github/stars/HomenShum/nodebench-ai.svg)](https://github.com/HomenShum/nodebench-ai)
7
7
  [![MCP Compatible](https://img.shields.io/badge/MCP-Compatible-green.svg)](https://modelcontextprotocol.io)
8
- [![Tools](https://img.shields.io/badge/Tools-260-orange.svg)](https://www.npmjs.com/package/nodebench-mcp)
8
+ [![Tools](https://img.shields.io/badge/Tools-338-orange.svg)](https://www.npmjs.com/package/nodebench-mcp)
9
9
 
10
- **Make AI agents catch the bugs they normally ship.**
10
+ **Entity intelligence for any company, market, or question — turn messy context into decision-ready packets, memos, and delegation briefs.**
11
11
 
12
- One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base so every fix is thorough and every insight compounds into future work.
12
+ 338 tools across 55 domains. You start with 15 (starter preset). Call `discover_tools` to find what you need, then `load_toolset` to activate it. No context bloat, no IDE crashes.
13
13
 
14
14
  ```bash
15
- # Claude CodeAI Flywheel core (50 tools, recommended)
15
+ # Starter preset (15 tools) decision intelligence + progressive discovery
16
16
  claude mcp add nodebench -- npx -y nodebench-mcp
17
17
 
18
- # Windsurf / Cursorsame tools, add to your MCP config (see setup below)
18
+ # Founder preset (~40 tools) decision intelligence, company tracking, session memory
19
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset founder
19
20
 
20
- # Need everything? Vision, web, files, parallel agents, etc.
21
+ # All 338 tools
21
22
  claude mcp add nodebench -- npx -y nodebench-mcp --preset full
22
23
  ```
23
24
 
24
25
  ---
25
26
 
26
- ## Why — What Bare Agents Miss
27
+ ## What You Get
27
28
 
28
- We benchmarked 9 real production prompts things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* comparing a bare agent vs one with NodeBench MCP.
29
+ NodeBench is a decision-intelligence layer for your AI coding agent. Instead of dumping 300+ tools into context, you start with a tight starter set and expand on demand.
29
30
 
30
- | What gets measured | Bare Agent | With NodeBench MCP |
31
- |---|---|---|
32
- | Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
33
- | Research findings before coding | 0 | **21** |
34
- | Risk assessments | 0 | **9** |
35
- | Test coverage layers | 1 | **3** (static + unit + integration) |
36
- | Integration failures caught early | 0 | **4** |
37
- | Regression eval cases created | 0 | **22** |
38
- | Quality gate rules enforced | 0 | **52** |
39
- | Deploys blocked by gate violations | 0 | **4** |
40
- | Knowledge entries banked | 0 | **9** |
41
- | Blind spots shipped to production | **26** | **0** |
42
-
43
- The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
44
-
45
- Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
46
-
47
- ---
48
-
49
- ## How It Works — 3 Real Examples
31
+ ### Starter Preset (default, 15 tools)
50
32
 
51
- ### Example 1: Bug fix
33
+ Decision intelligence core + progressive discovery. Enough to run Deep Sim scenarios, generate decision memos, and discover/load any of the 338 tools when needed.
52
34
 
53
- You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
54
-
55
- **Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
56
-
57
- **With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
58
- - No retry backoff on OpenRouter rate limits (HIGH)
59
- - JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
60
- - No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
61
-
62
- All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
35
+ | Domain | What it does |
36
+ |---|---|
37
+ | **Decision Intelligence (Deep Sim)** | Simulate decisions, run postmortems, score trajectories, generate decision memos |
38
+ | **Progressive Discovery** | `discover_tools` (14-strategy hybrid search), `get_tool_quick_ref` (multi-hop BFS), `get_workflow_chain` |
39
+ | **Dynamic Loading** | `load_toolset` / `unload_toolset` activate any toolset mid-session |
63
40
 
64
- ### Example 2: Parallel agents overwriting each other
41
+ ### Persona Presets (all under 50 tools — IDE-safe)
65
42
 
66
- You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
43
+ | Preset | Tools | What it adds | Best for |
44
+ |---|---|---|---|
45
+ | `founder` | ~40 | Company tracking, session memory, local dashboard, weekly reset, delegation briefs | Solo founders, CEOs making daily decisions |
46
+ | `banker` | ~39 | Company profiling, web research, recon, risk assessment | Due diligence, deal evaluation, market analysis |
47
+ | `operator` | ~40 | Company tracking, causal memory, action tracing, important-change review | COOs, ops leads tracking execution |
48
+ | `researcher` | ~32 | Web search, recon, session memory | Analysts, research-heavy workflows |
67
49
 
68
- **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
50
+ ### Task Presets (specialized toolsets)
69
51
 
70
- **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch. (Requires `--preset multi_agent` or `--preset full`.)
52
+ | Preset | Tools | Use case |
53
+ |---|---|---|
54
+ | `core` | ~81 | Full verification flywheel — recon, eval, quality gates, knowledge |
55
+ | `web_dev` | 150 | Web projects — vision, UI capture, SEO, git workflow, PR reports |
56
+ | `research` | 115 | Research workflows — web search, RSS, LLM, docs |
57
+ | `data` | 122 | Data analysis — CSV/XLSX/PDF/DOCX/JSON parsing, LLM |
58
+ | `devops` | 92 | CI/CD — git compliance, benchmarks, pattern mining |
59
+ | `mobile` | 126 | Mobile apps — vision, flicker detection, UI/UX analysis |
60
+ | `academic` | 113 | Academic papers — research writing, translation, citation |
61
+ | `multi_agent` | 136 | Parallel agents — task locks, roles, context budget, self-eval |
62
+ | `content` | 115 | Content pipelines — LLM, email, RSS, publishing |
63
+ | `cursor` | 28 | Cursor IDE — fits within Cursor's tool cap |
64
+ | `full` | 338 | Everything |
71
65
 
72
- ### Example 3: Knowledge compounding
66
+ ```bash
67
+ # Claude Code
68
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset founder
73
69
 
74
- Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
70
+ # Windsurf / Cursor add --preset to args in your MCP config
71
+ ```
75
72
 
76
73
  ---
77
74
 
@@ -80,13 +77,7 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
80
77
  ### Claude Code (CLI)
81
78
 
82
79
  ```bash
83
- # Recommended — AI Flywheel core (50 tools)
84
80
  claude mcp add nodebench -- npx -y nodebench-mcp
85
-
86
- # Or pick a themed preset for your workflow
87
- claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
88
- claude mcp add nodebench -- npx -y nodebench-mcp --preset research
89
- claude mcp add nodebench -- npx -y nodebench-mcp --preset data
90
81
  ```
91
82
 
92
83
  Or add to `~/.claude/settings.json` or `.mcp.json` in your project root:
@@ -96,8 +87,7 @@ Or add to `~/.claude/settings.json` or `.mcp.json` in your project root:
96
87
  "mcpServers": {
97
88
  "nodebench": {
98
89
  "command": "npx",
99
- "args": ["-y", "nodebench-mcp"],
100
- "env": {}
90
+ "args": ["-y", "nodebench-mcp"]
101
91
  }
102
92
  }
103
93
  }
@@ -105,14 +95,14 @@ Or add to `~/.claude/settings.json` or `.mcp.json` in your project root:
105
95
 
106
96
  ### Cursor
107
97
 
108
- Add to `.cursor/mcp.json` in your project root (or open Settings MCP):
98
+ Add to `.cursor/mcp.json` (or Settings > MCP). Use the `cursor` preset to stay within Cursor's tool limit:
109
99
 
110
100
  ```json
111
101
  {
112
102
  "mcpServers": {
113
103
  "nodebench": {
114
104
  "command": "npx",
115
- "args": ["-y", "nodebench-mcp"]
105
+ "args": ["-y", "nodebench-mcp", "--preset", "cursor"]
116
106
  }
117
107
  }
118
108
  }
@@ -120,14 +110,14 @@ Add to `.cursor/mcp.json` in your project root (or open Settings → MCP):
120
110
 
121
111
  ### Windsurf
122
112
 
123
- Add to `.windsurf/mcp.json` in your project root (or open Settings MCP View raw config):
113
+ Add to `.windsurf/mcp.json` (or Settings > MCP > View raw config):
124
114
 
125
115
  ```json
126
116
  {
127
117
  "mcpServers": {
128
118
  "nodebench": {
129
119
  "command": "npx",
130
- "args": ["-y", "nodebench-mcp"]
120
+ "args": ["-y", "nodebench-mcp", "--preset", "founder"]
131
121
  }
132
122
  }
133
123
  }
@@ -135,38 +125,28 @@ Add to `.windsurf/mcp.json` in your project root (or open Settings → MCP → V
135
125
 
136
126
  ### Other MCP Clients
137
127
 
138
- Any MCP-compatible client works. The config format is the same — point `command` to `npx` and `args` to `["-y", "nodebench-mcp"]`. Add `"--preset", "<name>"` to the args array for themed presets.
139
-
140
- ### Local Mode vs Cloud Mode
141
-
142
- | Mode | Command | What it does |
143
- |------|---------|-------------|
144
- | **Local** (default) | `npx nodebench-mcp` | Runs entirely on your machine. No account needed. All data stays local. |
145
- | **Cloud** | `npx nodebench-mcp --cloud` | Syncs to NodeBench dashboard at nodebenchai.com. Knowledge persists across machines. |
128
+ Any MCP-compatible client works. Point `command` to `npx`, `args` to `["-y", "nodebench-mcp"]`. Add `"--preset", "<name>"` to the args array for presets.
146
129
 
147
130
  ### First Prompts to Try
148
131
 
149
132
  ```
150
- # See what's available
151
- > Use discover_tools("verify my implementation") to find relevant tools
152
-
153
- # Page through results
154
- > Use discover_tools({ query: "verify", limit: 5, offset: 5 }) for page 2
133
+ # Find tools for your task
134
+ > Use discover_tools("evaluate this acquisition target") to find relevant tools
155
135
 
156
- # Expand results via conceptual neighbors
157
- > Use discover_tools({ query: "deploy changes", expand: 3 }) for broader discovery
136
+ # Load a toolset
137
+ > Use load_toolset("deep_sim") to activate decision simulation tools
158
138
 
159
- # Explore a tool's neighborhood (multi-hop)
160
- > Use get_tool_quick_ref({ tool_name: "run_recon", depth: 2 }) to see 2-hop graph
139
+ # Run a decision simulation
140
+ > Use run_deep_sim_scenario to simulate a business decision with multiple variables
161
141
 
162
- # Get methodology guidance
163
- > Use getMethodology("overview") to see all workflows
142
+ # Generate a decision memo
143
+ > Use generate_decision_memo to produce a shareable memo from your analysis
164
144
 
165
- # Before your next task — search for prior knowledge
166
- > Use search_all_knowledge("what I'm about to work on")
145
+ # Weekly founder reset
146
+ > Use founder_weekly_reset to review the week's decisions and outcomes
167
147
 
168
- # Run the full verification pipeline on a change
169
- > Use getMethodology("mandatory_flywheel") and follow the 6 steps
148
+ # Pre-delegation briefing
149
+ > Use pre_delegation_briefing to prepare context before handing off a task
170
150
  ```
171
151
 
172
152
  ### Optional: API Keys
@@ -193,706 +173,95 @@ Set these as environment variables, or add them to the `env` block in your MCP c
193
173
  }
194
174
  ```
195
175
 
196
- ### Usage Analytics & Smart Presets
197
-
198
- NodeBench MCP tracks tool usage locally and can recommend optimal presets based on your project type and usage patterns.
199
-
200
- ```bash
201
- npx nodebench-mcp --smart-preset # Get AI-powered preset recommendation
202
- npx nodebench-mcp --stats # Show usage statistics (last 30 days)
203
- npx nodebench-mcp --export-stats # Export usage data to JSON
204
- npx nodebench-mcp --list-presets # List all available presets
205
- npx nodebench-mcp --reset-stats # Clear analytics data
206
- ```
207
-
208
- All analytics data is stored locally in `~/.nodebench/analytics.db` and never leaves your machine.
209
-
210
176
  ---
211
177
 
212
- ## Headless Engine API (v2.30.0)
213
-
214
- NodeBench now ships a **headless, API-first Agentic Engine** — plug it into any client workflow and sell results, not software seats.
215
-
216
- ```bash
217
- # Start MCP server with engine API on port 6276
218
- npx nodebench-mcp --engine
219
-
220
- # With auth token
221
- npx nodebench-mcp --engine --engine-secret "your-token"
222
- # or: ENGINE_SECRET=your-token npx nodebench-mcp --engine
223
- ```
224
-
225
- ### API Endpoints
226
-
227
- | Method | Path | Purpose |
228
- |--------|------|---------|
229
- | GET | `/` | Engine status, tool count, uptime |
230
- | GET | `/api/health` | Health check |
231
- | GET | `/api/tools` | List all available tools |
232
- | POST | `/api/tools/:name` | Execute a single tool |
233
- | GET | `/api/workflows` | List all 32 workflow chains |
234
- | POST | `/api/workflows/:name` | Execute a workflow (with SSE streaming) |
235
- | POST | `/api/sessions` | Create an isolated session |
236
- | GET | `/api/sessions/:id` | Session status + call history |
237
- | GET | `/api/sessions/:id/trace` | Full disclosure trace |
238
- | GET | `/api/sessions/:id/report` | Conformance report |
239
- | DELETE | `/api/sessions/:id` | End session |
240
- | GET | `/api/presets` | List presets with tool counts |
241
-
242
- ### Quick Examples
243
-
244
- ```bash
245
- # Execute a single tool
246
- curl -X POST http://127.0.0.1:6276/api/tools/discover_tools \
247
- -H "Content-Type: application/json" \
248
- -d '{"args": {"query": "security audit"}, "preset": "full"}'
249
-
250
- # Run a workflow with streaming
251
- curl -N -X POST http://127.0.0.1:6276/api/workflows/fix_bug \
252
- -H "Content-Type: application/json" \
253
- -d '{"preset": "web_dev", "streaming": true}'
254
-
255
- # Create a session, execute tools, get conformance report
256
- SESSION=$(curl -s -X POST http://127.0.0.1:6276/api/sessions \
257
- -H "Content-Type: application/json" \
258
- -d '{"preset": "web_dev"}' | jq -r .sessionId)
259
-
260
- curl -X POST "http://127.0.0.1:6276/api/tools/run_recon" \
261
- -H "Content-Type: application/json" \
262
- -d "{\"args\": {\"focusArea\": \"web\"}, \"sessionId\": \"$SESSION\"}"
263
-
264
- curl "http://127.0.0.1:6276/api/sessions/$SESSION/report"
265
- ```
266
-
267
- ### Conformance Reports
268
-
269
- Every workflow execution produces a conformance report scoring:
270
- - **Step completeness** — did all required tools execute?
271
- - **Quality gate** — did the quality gate pass?
272
- - **Test layers** — were unit/integration/e2e results logged?
273
- - **Flywheel** — was the methodology completed?
274
- - **Learnings** — were findings banked for next time?
275
-
276
- Grades: A (90+) / B (75+) / C (60+) / D (40+) / F (<40). Sell these reports as "Zero-bug deployment certificates" or "Automated WebMCP Conformance Reports."
277
-
278
- ### SSE Streaming
279
-
280
- Workflow execution supports Server-Sent Events for real-time progress:
281
-
282
- ```
283
- event: start
284
- data: {"workflow":"fix_bug","totalSteps":7,"sessionId":"eng_..."}
285
-
286
- event: step
287
- data: {"stepIndex":0,"tool":"search_all_knowledge","status":"running"}
288
-
289
- event: step
290
- data: {"stepIndex":0,"tool":"search_all_knowledge","status":"complete","durationMs":42}
178
+ ## Progressive Discovery How 338 Tools Fit in Any Context Window
291
179
 
292
- event: complete
293
- data: {"totalSteps":7,"totalDurationMs":340,"conformanceScore":88,"grade":"B"}
294
- ```
295
-
296
- ---
180
+ The starter preset loads 15 tools. The other 323 are discoverable and loadable on demand.
297
181
 
298
- ## What You Get — The AI Flywheel
299
-
300
- The default setup (no `--preset` flag) gives you **50 tools** that implement the complete [AI Flywheel](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md) methodology — two interlocking loops that compound quality over time:
182
+ ### How it works
301
183
 
302
184
  ```
303
- Research Risk → Implement → Test (3 layers) Eval Gate Learn → Ship
304
- ↑ │
305
- └──────────── knowledge compounds ─────────────────────────────┘
185
+ 1. discover_tools("your task description") ranked results from all 338 tools
186
+ 2. load_toolset("deep_sim") → tools activate in your session
187
+ 3. Use the tools directly → no proxy, native binding
188
+ 4. unload_toolset("deep_sim") → free context budget when done
306
189
  ```
307
190
 
308
- **Inner loop** (per change): 6-phase verification ensures correctness.
309
- **Outer loop** (over time): Eval-driven development ensures improvement.
310
-
311
- ### What's in the Default Preset (50 Tools)
312
-
313
- The default preset has 3 layers:
314
-
315
- **Layer 1 — Discovery (6 tools):** *"What tool should I use?"*
316
-
317
- | Tool | Purpose |
318
- |---|---|
319
- | `findTools` | Keyword search across all tools |
320
- | `getMethodology` | Access methodology guides (20 topics) |
321
- | `check_mcp_setup` | Diagnostic wizard — checks env vars, API keys, optional deps |
322
- | `discover_tools` | 14-strategy hybrid search with pagination (`offset`), result expansion (`expand`), and `relatedTools` neighbors |
323
- | `get_tool_quick_ref` | Quick reference with multi-hop BFS traversal (`depth` 1-3) — discovers tools 2-3 hops away |
324
- | `get_workflow_chain` | Step-by-step recipes for 28 common workflows |
191
+ ### Multi-modal search engine
325
192
 
326
- **Layer 2 — Dynamic Loading (6 tools):** *"Add/remove tools from my session"*
193
+ `discover_tools` scores tools using 14 parallel strategies:
327
194
 
328
- | Tool | Purpose |
195
+ | Strategy | What it does |
329
196
  |---|---|
330
- | `load_toolset` | Add a toolset to the current session on demand |
331
- | `unload_toolset` | Remove a toolset to recover context budget |
332
- | `list_available_toolsets` | See all 39 toolsets with tool counts |
333
- | `call_loaded_tool` | Proxy for clients that don't support dynamic tool updates |
334
- | `smart_select_tools` | LLM-powered tool selection (sends compact catalog to fast model) |
335
- | `get_ab_test_report` | Compare static vs dynamic loading performance |
336
-
337
- **Layer 3 AI Flywheel Core Methodology (38 tools):** *"Do the work"*
338
-
339
- | Domain | Tools | What You Get |
340
- |---|---|---|
341
- | **verification** | 8 | `start_verification_cycle`, `log_gap`, `resolve_gap`, `get_cycle_status`, `triple_verify`, `run_closed_loop`, `compare_cycles`, `list_cycles` |
342
- | **eval** | 6 | `start_eval_run`, `record_eval_result`, `get_eval_summary`, `compare_eval_runs`, `get_eval_diff`, `list_eval_runs` |
343
- | **quality_gate** | 4 | `run_quality_gate`, `create_gate_preset`, `get_gate_history`, `list_gate_presets` |
344
- | **learning** | 4 | `record_learning`, `search_all_knowledge`, `get_knowledge_stats`, `list_recent_learnings` |
345
- | **flywheel** | 4 | `run_mandatory_flywheel`, `promote_to_eval`, `investigate_blind_spot`, `get_flywheel_status` |
346
- | **recon** | 7 | `run_recon`, `log_recon_finding`, `assess_risk`, `get_recon_summary`, `list_recon_sessions`, `check_framework_version`, `search_recon_findings` |
347
- | **security** | 3 | `scan_dependencies`, `analyze_code_security`, `scan_terminal_output` |
348
- | **boilerplate** | 2 | `scaffold_nodebench_project`, `get_boilerplate_status` |
349
-
350
- > **Note:** `skill_update` (4 tools for rule file freshness tracking) is available via `load_toolset("skill_update")` when needed.
351
-
352
- ### Core Workflow — Use These Every Session
353
-
354
- These are the AI Flywheel tools documented in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
355
-
356
- | When you... | Use this | Impact |
357
- |---|---|---|
358
- | Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
359
- | Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
360
- | Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
361
- | Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
362
- | Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
363
- | Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
364
- | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
365
- | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
366
- | Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
367
- | Re-examine for 11/10 | Fresh-eyes review | After completing, re-examine for exceptional quality — a11y, resilience, polish |
368
-
369
- ### Mandatory After Any Non-Trivial Change
370
-
371
- 1. **Static analysis**: `tsc --noEmit` and linter checks
372
- 2. **Happy-path test**: Run the changed functionality with valid inputs
373
- 3. **Failure-path test**: Validate expected error handling + edge cases
374
- 4. **Gap analysis**: Dead code, unused vars, missing integrations, intent mismatch
375
- 5. **Fix and re-verify**: Rerun steps 1-3 from scratch after any fix
376
- 6. **Deploy and document**: Ship + write down what changed and why
377
- 7. **Re-examine for 11/10**: Re-examine the completed work with fresh eyes. Not "does it work?" but "is this the best it can be?" Check: prefers-reduced-motion, color-blind safety, print stylesheet, error resilience (partial failures, retry with backoff), keyboard efficiency (skip links, Ctrl+K search), skeleton loading, staggered animations, progressive disclosure for large datasets. Fix what you find, then re-examine your fixes.
378
-
379
- ---
380
-
381
- ## Themed Presets — Choose Your Workflow
197
+ | Keyword + TF-IDF | Exact matching, rare tags score higher |
198
+ | Fuzzy (Levenshtein) | Tolerates typos |
199
+ | Semantic (synonyms) | 30 word families "check" finds "verify", "validate" |
200
+ | N-gram + Bigram | Partial words and phrases |
201
+ | Dense (TF-IDF cosine) | Vector-like ranking |
202
+ | Embedding (neural) | Agent-as-a-Graph bipartite search |
203
+ | Execution traces | Co-occurrence mining from usage logs |
204
+ | Intent pre-filter | Narrow to relevant categories before search |
382
205
 
383
- The default preset covers the AI Flywheel. For specialized workflows, pick a themed preset that adds domain-specific tools on top:
206
+ Plus cursor pagination (`offset`/`limit`), result expansion (`expand: N`), and multi-hop BFS traversal (`depth: 1-3`) via `get_tool_quick_ref`.
384
207
 
385
- | Preset | Tools | What it adds to the default | Use case |
386
- |---|---|---|---|
387
- | **default** ⭐ | **50** | — | Bug fixes, features, refactoring, code review |
388
- | `web_dev` | 102 | + vision, UI capture, SEO, git workflow, architect, UI/UX dive, MCP bridge, PR reports | Web projects with visual QA |
389
- | `mobile` | 91 | + vision, UI capture, flicker detection, UI/UX dive, MCP bridge | Mobile apps with screenshot analysis |
390
- | `academic` | 82 | + research writing, LLM, web, local file parsing | Academic papers and research |
391
- | `multi_agent` | 79 | + parallel agents, self-eval, session memory, pattern mining, TOON | Multi-agent coordination |
392
- | `data` | 74 | + local file parsing (CSV/XLSX/PDF/DOCX/JSON), LLM, web | Data analysis and file processing |
393
- | `content` | 69 | + LLM, critter, email, RSS, platform queue, architect | Content pipelines and publishing |
394
- | `research` | 67 | + web search, LLM, RSS feeds, email, docs | Research workflows |
395
- | `devops` | 64 | + git compliance, session memory, benchmarks, pattern mining, PR reports | CI/CD and operations |
396
- | `full` | 218 | + everything (all 39 toolsets) | Maximum coverage |
397
-
398
- ```bash
399
- # Claude Code
400
- claude mcp add nodebench -- npx -y nodebench-mcp --preset web_dev
401
-
402
- # Windsurf / Cursor — add --preset to args
403
- {
404
- "mcpServers": {
405
- "nodebench": {
406
- "command": "npx",
407
- "args": ["-y", "nodebench-mcp", "--preset", "web_dev"]
408
- }
409
- }
410
- }
411
- ```
412
-
413
- ### Let AI Pick Your Preset
414
-
415
- ```bash
416
- npx nodebench-mcp --smart-preset
417
- ```
418
-
419
- Analyzes your project (language, framework, project type) and usage history to recommend the best preset.
420
-
421
- ---
208
+ ### Client compatibility
422
209
 
423
- ## Impact-Driven Methodology
424
-
425
- Every tool call, methodology step, and workflow path must answer: **"What concrete thing did this produce?"**
426
-
427
- | Tool / Phase | Concrete Impact |
210
+ | Client | Dynamic Loading |
428
211
  |---|---|
429
- | `run_recon` + `log_recon_finding` | N findings surfaced before writing code |
430
- | `assess_risk` | Risk tier assigned - HIGH triggers confirmation before action |
431
- | `start_verification_cycle` + `log_gap` | N issues detected with severity, all tracked to resolution |
432
- | `log_test_result` (3 layers) | 3x test coverage vs single-layer; catches integration failures |
433
- | `start_eval_run` + `record_eval_result` | N regression cases protecting against future breakage |
434
- | `run_quality_gate` | N gate rules enforced; violations blocked before deploy |
435
- | `record_learning` + `search_all_knowledge` | Knowledge compounds - later tasks reuse prior findings |
436
- | `run_mandatory_flywheel` | 6-step minimum verification; catches dead code and intent mismatches |
437
-
438
- The comparative benchmark validates this with 9 real production scenarios:
439
- - 13 issues detected (4 HIGH, 8 MEDIUM, 1 LOW) - bare agent ships all of them
440
- - 21 recon findings before implementation
441
- - 26 blind spots prevented
442
- - Knowledge compounding: 0 hits on task 1 → 2+ hits by task 9
443
-
444
- ---
445
-
446
- ## Governance Model — What Your Agent Can and Can't Do
447
-
448
- NodeBench enforces decision rights so you know exactly what your agent does autonomously vs what requires your approval. This is the "King Mode" layer — you delegate outcomes, not tasks, and the governance model ensures the agent stays within bounds.
449
-
450
- ### Autonomous (agent acts without asking)
451
-
452
- These actions are safe for the agent to perform without human confirmation:
453
-
454
- - Run tests and fix failing assertions
455
- - Refactor within existing patterns (no new dependencies)
456
- - Add logging, comments, and documentation
457
- - Update type definitions to match implementation
458
- - Fix lint errors and format code
459
-
460
- ### Requires Confirmation (agent asks before acting)
461
-
462
- These actions trigger a confirmation prompt because they have broader impact:
463
-
464
- - Changes to auth, security, or permissions logic
465
- - Database migrations or schema changes
466
- - API contract changes (new endpoints, changed signatures)
467
- - Adding or removing dependencies
468
- - Deleting code, files, or features
469
- - Changes to CI/CD configuration
470
-
471
- ### Quality Gates (enforced before any deploy)
472
-
473
- Every change must pass these gates before the agent can consider the work done:
474
-
475
- | Gate | What it checks | Failure behavior |
476
- |------|---------------|------------------|
477
- | Static analysis | `tsc --noEmit`, lint passes | Agent must fix before proceeding |
478
- | Unit tests | All tests pass | Agent must fix or explain why skipped |
479
- | Integration tests | E2E scenarios pass | Agent must fix or flag as known issue |
480
- | Verification cycle | No unresolved HIGH gaps | Agent must resolve or escalate |
481
- | Knowledge banked | Learning recorded for future | Agent must document what it learned |
482
-
483
- ### How this works in practice
484
-
485
- **With Claude Code:**
486
- ```
487
- > "Fix the LinkedIn posting bug"
488
-
489
- Agent runs recon → finds 3 related issues
490
- Agent logs gaps → 2 HIGH, 1 MEDIUM
491
- Agent fixes all 3 → runs tests → all pass
492
- Agent hits quality gate → knowledge not banked
493
- Agent records learning → gate passes
494
- Agent: "Fixed. 3 issues resolved, knowledge banked."
495
- ```
496
-
497
- **With Cursor Agent:**
498
- ```
499
- > "Add rate limiting to the API"
500
-
501
- Agent runs risk assessment → HIGH (auth-adjacent)
502
- Agent: "This touches auth middleware. Confirm?"
503
- You: "Yes, proceed"
504
- Agent implements → tests pass → gate passes
505
- Agent: "Done. Added rate limiting with tests."
506
- ```
212
+ | Claude Code, GitHub Copilot | Native re-fetches tools after `list_changed` |
213
+ | Windsurf, Cursor, Claude Desktop, Gemini CLI | Via `call_loaded_tool` fallback (always available) |
507
214
 
508
215
  ---
509
216
 
510
- ## Case Studies
511
-
512
- ### Case Study 1: Bug Fix with Knowledge Compounding
513
-
514
- **Context:** Solo founder using Claude Code to fix a recurring bug in their SaaS.
515
-
516
- **Before NodeBench:**
517
- - Agent fixes the immediate bug
518
- - Runs tests once, passes
519
- - Ships
520
- - 3 days later, related bug appears in production
521
- - Agent re-investigates from scratch
522
-
523
- **With NodeBench:**
524
- - Agent runs `run_recon` → finds 2 related issues
525
- - Agent runs `log_gap` → tracks all 3 issues
526
- - Agent fixes all 3 → runs 3-layer tests
527
- - Agent runs `run_quality_gate` → passes
528
- - Agent runs `record_learning` → banks the pattern
529
- - Next similar bug: agent finds the prior learning in `search_all_knowledge` and fixes in half the time
217
+ ## Key Features
530
218
 
531
- **Result:** Time to fix similar bugs decreased 50% over 30 days.
219
+ ### Decision Intelligence (Deep Sim)
532
220
 
533
- ### Case Study 2: Parallel Agents Without Conflicts
221
+ Simulate decisions before committing. Run scenarios with multiple variables, score trajectories, generate postmortems, produce decision memos.
534
222
 
535
- **Context:** Developer spawns 3 Claude Code subagents to fix different bugs in the same codebase.
223
+ ### Causal Memory
536
224
 
537
- **Before NodeBench:**
538
- - Agent 1 and Agent 2 both see the same bug
539
- - Both implement a fix
540
- - Agent 2's fix overwrites Agent 1's fix
541
- - Agent 3 re-investigates what Agent 1 already solved
542
- - Agent 2 hits context limit mid-fix, loses work
225
+ Track actions, paths, and state across sessions. Important-change review surfaces what shifted since your last session.
543
226
 
544
- **With NodeBench:**
545
- - Each agent calls `claim_agent_task` → locks its work
546
- - Roles assigned via `assign_agent_role` → no overlap
547
- - Context budget tracked via `log_context_budget`
548
- - Progress notes shared via `release_agent_task`
549
- - All 3 bugs fixed without conflict
227
+ ### Artifact Packets
550
228
 
551
- **Result:** Parallel agent success rate increased from 60% to 95%.
229
+ Every analysis produces a shareable artifact decision memos, delegation briefs, investigation reports. The output is the distribution.
552
230
 
553
- ### Case Study 3: Security-Sensitive Change
231
+ ### Founder Tools
554
232
 
555
- **Context:** Small team using Cursor Agent to add a new API endpoint.
233
+ Weekly reset, pre-delegation briefing, company tracking, important-change review. Built for the founder who needs to make 20 decisions a day with incomplete information.
556
234
 
557
- **Before NodeBench:**
558
- - Agent implements the endpoint
559
- - Tests pass
560
- - Ships
561
- - 2 weeks later, security audit finds auth bypass
235
+ ### Knowledge Compounding
562
236
 
563
- **With NodeBench:**
564
- - Agent runs `assess_risk` → HIGH (auth-adjacent)
565
- - Agent prompts for confirmation before proceeding
566
- - Human reviews the planned changes
567
- - Security issue caught before code is written
568
- - Agent implements with security constraints
569
-
570
- **Result:** Security-related incidents from AI code reduced to zero.
237
+ `record_learning` + `search_all_knowledge` — findings persist across sessions. By session 9, the agent finds 2+ relevant prior findings before writing a single line of code.
571
238
 
572
239
  ---
573
240
 
574
- ## Progressive Discovery
575
-
576
- ### Multi-modal search engine
577
-
578
- ```
579
- > discover_tools("verify my implementation")
580
- ```
581
-
582
- The `discover_tools` search engine scores tools using **14 parallel strategies** (including Agent-as-a-Graph bipartite embedding search):
583
-
584
- | Strategy | What it does | Example |
585
- |---|---|---|
586
- | Keyword | Exact/partial word matching on name, tags, description | "benchmark" → `benchmark_models` |
587
- | Fuzzy | Levenshtein distance — tolerates typos | "verifiy" → `start_verification_cycle` |
588
- | N-gram | Trigram similarity for partial words | "screen" → `capture_ui_screenshot` |
589
- | Prefix | Matches tool name starts | "cap" → `capture_*` tools |
590
- | Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |
591
- | TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
592
- | Regex | Pattern matching | `"^run_.*loop$"` → `run_closed_loop` |
593
- | Bigram | Phrase matching | "quality gate" matched as unit |
594
- | Domain boost | Related categories boosted together | verification + quality_gate cluster |
595
- | Dense | TF-IDF cosine similarity for vector-like ranking | "audit compliance" surfaces related tools |
596
-
597
- **7 search modes**: `hybrid` (default, all strategies), `fuzzy`, `regex`, `prefix`, `semantic`, `exact`, `dense`
598
-
599
- Pass `explain: true` to see exactly which strategies contributed to each score.
600
-
601
- ### Cursor pagination
602
-
603
- Page through large result sets with `offset` and `limit`:
604
-
605
- ```
606
- > discover_tools({ query: "verify", limit: 5 })
607
- # Returns: { results: [...5 tools], totalMatches: 76, hasMore: true, offset: 0 }
608
-
609
- > discover_tools({ query: "verify", limit: 5, offset: 5 })
610
- # Returns: { results: [...next 5 tools], totalMatches: 76, hasMore: true, offset: 5 }
611
- ```
612
-
613
- `totalMatches` is stable across pages. `hasMore` tells you whether another page exists.
614
-
615
- ### Result expansion via relatedTools
616
-
617
- Broaden results by following conceptual neighbors:
618
-
619
- ```
620
- > discover_tools({ query: "deploy and ship changes", expand: 3 })
621
- # Top 3 results' relatedTools neighbors are added at 50% parent score
622
- # "deploy" finds git_workflow tools → expansion adds quality_gate, flywheel tools
623
- # Expanded results include depth: 1 and expandedFrom fields
624
- ```
625
-
626
- Dogfood A/B results: 5/8 queries gained recall lift (+2 to +8 new tools per query). "deploy and ship changes" went from 82 → 90 matches.
627
-
628
- ### Quick refs — what to do next (with multi-hop)
629
-
630
- Every tool response auto-appends a `_quickRef` with:
631
- - **nextAction**: What to do immediately after this tool
632
- - **nextTools**: Recommended follow-up tools (workflow-sequential)
633
- - **relatedTools**: Conceptually adjacent tools (same domain, shared tags — 949 connections across 218 tools)
634
- - **methodology**: Which methodology guide to consult
635
- - **tip**: Practical usage advice
636
-
637
- Call `get_tool_quick_ref("tool_name")` for any tool's guidance — or use **multi-hop BFS traversal** to discover tools 2-3 hops away:
638
-
639
- ```
640
- > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 1 })
641
- # Returns: direct neighbors via nextTools + relatedTools (hopDistance: 1)
642
-
643
- > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 2 })
644
- # Returns: direct neighbors + their neighbors (hopDistance: 1 and 2)
645
- # Discovers 34 additional tools reachable in 2 hops
646
-
647
- > get_tool_quick_ref({ tool_name: "start_verification_cycle", depth: 3 })
648
- # Returns: 3-hop BFS traversal — full neighborhood graph
649
- ```
650
-
651
- Each discovered tool includes `hopDistance` (1-3) and `reachedVia` (which parent tool led to it). BFS prevents cycles — no tool appears at multiple depths.
652
-
653
- ### `nextTools` vs `relatedTools`
654
-
655
- | | `nextTools` | `relatedTools` |
656
- |---|---|---|
657
- | **Meaning** | Workflow-sequential ("do X then Y") | Conceptually adjacent ("if doing X, consider Y") |
658
- | **Example** | `run_recon` → `log_recon_finding` | `run_recon` → `search_all_knowledge`, `bootstrap_project` |
659
- | **Total connections** | 498 | 949 (191% amplification) |
660
- | **Overlap** | — | 0% (all net-new connections) |
661
- | **Cross-domain** | Mostly same-domain | 90% bridge different domains |
241
+ ## Headless Engine API
662
242
 
663
- ### Workflow chains step-by-step recipes
664
-
665
- 28 pre-built chains for common workflows:
666
-
667
- | Chain | Steps | Use case |
668
- |---|---|---|
669
- | `new_feature` | 12 | End-to-end feature development |
670
- | `fix_bug` | 6 | Structured debugging |
671
- | `ui_change` | 7 | Frontend with visual verification |
672
- | `parallel_project` | 7 | Multi-agent coordination |
673
- | `research_phase` | 8 | Context gathering |
674
- | `academic_paper` | 7 | Paper writing pipeline |
675
- | `c_compiler_benchmark` | 10 | Autonomous capability test |
676
- | `security_audit` | 9 | Comprehensive security assessment |
677
- | `code_review` | 8 | Structured code review |
678
- | `deployment` | 8 | Ship with full verification |
679
- | `migration` | 10 | SDK/framework upgrade |
680
- | `coordinator_spawn` | 10 | Parallel coordinator setup |
681
- | `self_setup` | 8 | Agent self-onboarding |
682
- | `flicker_detection` | 7 | Android flicker analysis |
683
- | `figma_flow_analysis` | 5 | Figma prototype flow audit |
684
- | `agent_eval` | 9 | Evaluate agent performance |
685
- | `contract_compliance` | 5 | Check agent contract adherence |
686
- | `ablation_eval` | 10 | Ablation experiment design |
687
- | `session_recovery` | 6 | Recover context after compaction |
688
- | `attention_refresh` | 4 | Reload bearings mid-session |
689
- | `task_bank_setup` | 9 | Create evaluation task banks |
690
- | `pr_review` | 5 | Pull request review |
691
- | `seo_audit` | 6 | Full SEO audit |
692
- | `voice_pipeline` | 6 | Voice pipeline implementation |
693
- | `intentionality_check` | 4 | Verify agent intent before action |
694
- | `research_digest` | 6 | Summarize research across sessions |
695
- | `email_assistant` | 5 | Email triage and response |
696
- | `pr_creation` | 6 | Visual PR creation from UI Dive sessions |
697
-
698
- Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
699
-
700
- ### Boilerplate template
701
-
702
- Start new projects with everything pre-configured:
703
-
704
- ```bash
705
- gh repo create my-project --template HomenShum/nodebench-boilerplate --clone
706
- cd my-project && npm install
707
- ```
708
-
709
- Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.
710
-
711
- ---
712
-
713
- ## Scaling MCP: How We Solved the 5 Biggest Industry Problems
714
-
715
- MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses.
716
-
717
- ---
718
-
719
- ### Problem 1: Context Bloat (too many tool definitions eat the context window)
720
-
721
- **The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At 218 tools, NodeBench would consume ~109K tokens — over half a 200K context window just on tool metadata. [Microsoft Research](https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/) found LLMs "decline to act at all when faced with ambiguous or excessive tool options." [Cursor enforces a ~40-tool hard cap](https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents) for this reason.
722
-
723
- **Our solutions** (layered, each independent):
724
-
725
- | Layer | What it does | Token savings | Requires |
726
- |---|---|---|---|
727
- | Themed presets (`--preset web_dev`) | Load only relevant toolsets (54-106 tools vs 218) | **50-75%** | Nothing |
728
- | TOON encoding (on by default) | Encode all tool responses in token-optimized format | **~40%** on responses | Nothing |
729
- | `discover_tools({ compact: true })` | Return `{ name, category, hint }` only | **~60%** on search results | Nothing |
730
- | `instructions` field (Claude Code) | Claude Code defers tool loading, searches on demand | **~85%** | Claude Code client |
731
- | `smart_select_tools` (LLM-powered) | Fast model picks 8 best tools from compact catalog | **~95%** | Any API key |
732
-
733
- **How we tested**: The A/B harness (`scripts/ab-test-harness.ts`) measures tool counts, token overhead, and success rates across 28 scenarios in both static and dynamic modes. TOON savings validated by comparing JSON vs TOON serialized sizes across all tool responses.
734
-
735
- ---
736
-
737
- ### Problem 2: Tool Selection Degradation (LLMs pick the wrong tool as count increases)
738
-
739
- **The research**: [Anthropic's Tool Search Tool](https://www.anthropic.com/engineering/advanced-tool-use) improved accuracy from **49% → 74%** (Opus 4) and **79.5% → 88.1%** (Opus 4.5) by switching from all-tools-upfront to on-demand discovery. The [Dynamic ReAct paper (arxiv 2509.20386)](https://arxiv.org/html/2509.20386v1) tested 5 architectures and found **Search + Load** wins — flat search + deliberate loading beats hierarchical app→tool search.
740
-
741
- **Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from 218 candidates, with **cursor pagination**, **result expansion**, and **multi-hop traversal**:
742
-
743
- | Strategy | What it does | Example |
744
- |---|---|---|
745
- | Keyword + TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
746
- | Fuzzy (Levenshtein) | Tolerates typos | "verifiy" → `start_verification_cycle` |
747
- | Semantic (synonyms) | Expands 30 word families | "check" also finds "verify", "validate" |
748
- | N-gram + Bigram | Partial words and phrases | "screen" → `capture_ui_screenshot` |
749
- | Dense (TF-IDF cosine) | Vector-like ranking | "audit compliance" surfaces related tools |
750
- | Embedding (neural) | Agent-as-a-Graph bipartite RRF | Based on [arxiv 2511.01854](https://arxiv.org/html/2511.01854v1) |
751
- | Execution traces | Co-occurrence mining from `tool_call_log` (direct + transitive A→B→C) | Tools frequently used together boost each other |
752
- | Intent pre-filter | Narrow to relevant categories before search | `intent: "data_analysis"` → only local_file, llm, benchmark |
753
- | **Pagination** | `offset` + `limit` with stable `totalMatches` and `hasMore` | Page through 76+ results 5 at a time |
754
- | **Expansion** | Top N results' `relatedTools` neighbors added at 50% parent score | `expand: 3` adds 2-8 new tools per query |
755
- | **Multi-hop BFS** | `get_tool_quick_ref` depth 1-3 with `hopDistance` + `reachedVia` | depth=2 discovers 24-40 additional tools |
756
-
757
- Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini 3 Flash / GPT-5-mini / Claude Haiku 4.5 for LLM-powered reranking.
758
-
759
- **How we tested**: 28 scenarios with expected-toolset ground truth. The harness checks if `_loadSuggestions` points to the correct toolset for each domain query.
760
-
761
- | What we measured | Result |
762
- |---|---|
763
- | Discovery accuracy | **18/18 (100%)** — correct toolset suggested for every domain |
764
- | Domains covered | File I/O, email, GitHub, academic writing, SEO, git, Figma, CI/CD, browser automation, database, security, LLM, monitoring |
765
- | Natural language queries | "I need to look at what's in this zip file" → `local_file` ✓ |
766
- | Zero-match graceful degradation | "deploy Kubernetes pods" → closest tools, no errors ✓ |
767
-
768
- ---
769
-
770
- ### Problem 3: Static Loading (all tools loaded upfront, even if unused)
771
-
772
- **The research**: The Dynamic ReAct paper found that **Search + Load with 2 meta tools** beats all other architectures. Hierarchical search (search apps → search tools → load) adds overhead without improving accuracy. [ToolScope (arxiv 2510.20036)](https://arxiv.org/html/2510.20036) showed **+34.6%** tool selection accuracy with hybrid retrieval + tool deduplication.
773
-
774
- **Our solution**: `--dynamic` flag enables Search + Load:
243
+ NodeBench ships a headless, API-first engine for programmatic access.
775
244
 
776
245
  ```bash
777
- npx nodebench-mcp --dynamic
778
- ```
779
-
780
- ```
781
- > discover_tools("analyze screenshot for UI bugs")
782
- # _loadSuggestions: [{ toolset: "vision", action: "load_toolset('vision')" }]
783
-
784
- > load_toolset("vision")
785
- # 4 vision tools now directly bound (not indirected through a proxy)
786
-
787
- > unload_toolset("vision")
788
- # Tools removed, token budget recovered
789
- ```
790
-
791
- Key design decisions from the research:
792
- - **No hierarchical search** — Dynamic ReAct Section 3.4: "search_apps introduces an additional call without significantly improving accuracy"
793
- - **Direct tool binding** — Dynamic ReAct Section 3.5: LLMs perform best with directly bound tools; `call_tool` indirection degrades in long conversations
794
- - **Full-registry search** — `discover_tools` searches all 218 tools even with 54 loaded, so it can suggest what to load
795
-
796
- **How we tested**: Automated A/B harness + live IDE session.
797
-
798
- | What we measured | Result |
799
- |---|---|
800
- | Scenarios tested | **28** aligned to [real MCP usage data](https://towardsdatascience.com/mcp-in-practice/) — Web/Browser (24.8%), SWE (24.7%), DB/Search (23.1%), File Ops, Comms, Design, Security, AI, Monitoring |
801
- | Success rate | **100%** across 128 tool calls per round (both modes) |
802
- | Load latency | **<1ms** per `load_toolset` call |
803
- | Long sessions | 6 loads + 2 unloads in a single session — correct tool count at every step |
804
- | Burst performance | 6 consecutive calls averaging **1ms** each |
805
- | Live agent test | Verified in real Windsurf session: load, double-load (idempotent), unload, unload-protection |
806
- | Unit tests | **266 passing** (24 dedicated to dynamic loading) |
807
- | Bugs found during testing | 5 (all fixed) — most critical: search results only showed loaded tools, not full registry |
808
-
809
- ---
810
-
811
- ### Problem 4: Client Fragmentation (not all clients handle dynamic tool updates)
812
-
813
- **The research**: The MCP spec defines `notifications/tools/list_changed` for servers to tell clients to re-fetch the tool list. But [Cursor hasn't implemented it](https://forum.cursor.com/t/enhance-mcp-integration-in-cursor-dynamic-tool-updates-roots-support-progress-tokens-streamable-http/99903), [Claude Desktop didn't support it](https://github.com/orgs/modelcontextprotocol/discussions/76) (as of Dec 2024), and [Gemini CLI has an open issue](https://github.com/google-gemini/gemini-cli/issues/13850).
814
-
815
- **Our solution**: Two-tier compatibility — native `list_changed` for clients that support it, plus a `call_loaded_tool` proxy fallback for those that don't.
816
-
817
- | Client | Dynamic Loading | How |
818
- |---|---|---|
819
- | **Claude Code** | ✅ Native | Re-fetches tools automatically after `list_changed` |
820
- | **GitHub Copilot** | ✅ Native | Same |
821
- | **Windsurf / Cursor / Claude Desktop / Gemini CLI / LibreChat** | ✅ Via fallback | `call_loaded_tool` proxy (always in tool list) |
822
-
823
- ```
824
- > load_toolset("vision")
825
- # Response includes: toolNames: ["analyze_screenshot", "manipulate_screenshot", ...]
826
-
827
- > call_loaded_tool({ tool: "analyze_screenshot", args: { imagePath: "page.png" } })
828
- # Dispatches internally — works on ALL clients
829
- ```
830
-
831
- **How we tested**: Server-side verification in the A/B harness proves correct `tools/list` updates:
832
-
833
- ```
834
- tools/list BEFORE: 95 tools
835
- load_toolset("voice_bridge")
836
- tools/list AFTER: 99 tools (+4) ← new tools visible
837
- call_loaded_tool proxy: ✓ OK ← fallback dispatch works
838
- unload_toolset("voice_bridge")
839
- tools/list AFTER UNLOAD: 95 tools (-4) ← tools removed
840
- ```
841
-
842
- ---
843
-
844
- ### Problem 5: Aggressive Filtering (over-filtering means the right tool isn't found)
845
-
846
- **The research**: This is the flip side of Problem 1. If you reduce context aggressively (e.g., keyword-only search), ambiguous queries like "call an AI model" fail to match the `llm` toolset because every tool mentions "AI" in its description. [SynapticLabs' Bounded Context Packs](https://blog.synapticlabs.ai/bounded-context-packs-tool-bloat-tipping-point) addresses this with progressive disclosure. [SEP-1576](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576) proposes adaptive granularity at the protocol level.
847
-
848
- **Our solutions** (3 tiers, progressively smarter):
849
-
850
- **Tier 1 — Intent pre-filter (no API key):**
851
- ```
852
- > discover_tools({ query: "parse a CSV file", intent: "data_analysis" })
853
- # Narrows search to: local_file, llm, benchmark categories only
854
- # 15 intents: file_processing, web_research, code_quality, security_audit,
855
- # academic_writing, data_analysis, llm_interaction, visual_qa, devops_ci,
856
- # team_coordination, communication, seo_audit, design_review, voice_ui, project_setup
857
- ```
246
+ # Start MCP server with engine API on port 6276
247
+ npx nodebench-mcp --engine
858
248
 
859
- **Tier 2 LLM-powered selection (API key):**
860
- ```
861
- > smart_select_tools({ task: "parse a PDF, extract tables, email a summary" })
862
- # Sends compact catalog (~4K tokens: name + category + 5 tags per tool) to
863
- # Gemini 3 Flash / GPT-5-mini / Claude Haiku 4.5
864
- # Returns the 8 best tools + _loadSuggestions for unloaded toolsets
865
- # Falls back to heuristic search if no API key is set
249
+ # With auth token
250
+ npx nodebench-mcp --engine --engine-secret "your-token"
866
251
  ```
867
252
 
868
- **Tier 3 Embedding search (optional):**
869
- Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-Graph (arxiv 2511.18194)](https://arxiv.org/html/2511.18194). Enable with `--embedding` or set `OPENAI_API_KEY` / `GEMINI_API_KEY`.
870
-
871
- **How we tested**: The `llm_model_interaction` scenario in the A/B harness specifically tests this — the query "call LLM generate prompt GPT Claude Gemini" must surface the `llm` toolset in `_loadSuggestions`. A tag coverage bonus in hybrid search ensures tools where many query words match tags rank highest. For even more ambiguous queries, `smart_select_tools` lets an LLM pick the right tools semantically.
872
-
873
- ---
874
-
875
- ### Summary: research solution eval for each problem
876
-
877
- | Problem | Research Source | Our Solution | Eval Method | Result |
878
- |---|---|---|---|---|
879
- | **Context bloat** (107K tokens) | Anthropic (85% reduction), Lunar.dev (~40-tool cap), SEP-1576 | Presets, TOON, compact mode, `instructions`, `smart_select_tools` | A/B harness token measurement | 50-95% reduction depending on layer |
880
- | **Selection degradation** | Anthropic (+25pp), Dynamic ReAct (Search+Load wins) | 14-strategy hybrid search, intent pre-filter, LLM reranking | 28-scenario discovery accuracy | **100% accuracy** (18/18 domains) |
881
- | **Static loading** | Dynamic ReAct, ToolScope (+34.6%), MCP spec | `--dynamic` flag, `load_toolset` / `unload_toolset` | A/B harness + live IDE test | **100% success**, <1ms load latency |
882
- | **Client fragmentation** | MCP discussions, client bug trackers | `list_changed` + `call_loaded_tool` proxy | Server-side `tools/list` verification | Works on **all clients** |
883
- | **Aggressive filtering** | SynapticLabs, SEP-1576, our own `llm` gap | Intent pre-filter, `smart_select_tools`, embeddings | `llm_model_interaction` scenario | LLM-powered selection solves the gap |
884
-
885
- **Ablation study** (`scripts/ablation-test.ts`): We tested which strategies matter for each user segment by disabling them one at a time across 54 queries:
886
-
887
- | Segment | R@5 Baseline | Most Critical Strategy | Impact When Removed |
888
- |---|---|---|---|
889
- | **New user** (vague, natural language) | 67% | Synonym expansion | -17pp R@5 |
890
- | **Experienced** (domain keywords) | 72% | All robust | No single strategy >5pp |
891
- | **Power user** (exact tool names) | 100% | None needed | Keyword alone = 100% |
892
-
893
- Key insight: new users need synonym expansion ("website" → seo, "AI" → llm) and fuzzy matching (typo tolerance). Power users need nothing beyond keyword matching. The remaining 33% new user gap is filled by `smart_select_tools` (LLM-powered).
894
-
895
- Full methodology, per-scenario breakdown, ablation data, and research citations: [DYNAMIC_LOADING.md](./DYNAMIC_LOADING.md)
253
+ | Method | Path | Purpose |
254
+ |--------|------|---------|
255
+ | GET | `/` | Engine status, tool count, uptime |
256
+ | GET | `/api/health` | Health check |
257
+ | GET | `/api/tools` | List all available tools |
258
+ | POST | `/api/tools/:name` | Execute a single tool |
259
+ | GET | `/api/workflows` | List workflow chains |
260
+ | POST | `/api/workflows/:name` | Execute a workflow (SSE streaming) |
261
+ | POST | `/api/sessions` | Create an isolated session |
262
+ | GET | `/api/sessions/:id` | Session status + call history |
263
+ | GET | `/api/sessions/:id/report` | Conformance report |
264
+ | GET | `/api/presets` | List presets with tool counts |
896
265
 
897
266
  ---
898
267
 
@@ -900,190 +269,42 @@ Full methodology, per-scenario breakdown, ablation data, and research citations:
900
269
 
901
270
  ```bash
902
271
  # Include only specific toolsets
903
- npx nodebench-mcp --toolsets verification,eval,recon
272
+ npx nodebench-mcp --toolsets deep_sim,recon,learning
904
273
 
905
- # Exclude heavy optional-dep toolsets
274
+ # Exclude heavy toolsets
906
275
  npx nodebench-mcp --exclude vision,ui_capture,parallel
907
276
 
908
- # Dynamic loading — start with 12 tools, load on demand
277
+ # Dynamic loading — start minimal, load on demand
909
278
  npx nodebench-mcp --dynamic
910
279
 
911
- # See all toolsets and presets
912
- npx nodebench-mcp --help
913
- ```
914
-
915
- ### All 39 Toolsets
916
-
917
- | Toolset | Tools | What it covers | In `default` |
918
- |---|---|---|---|
919
- | verification | 8 | Cycles, gaps, triple-verify, status | ✅ |
920
- | eval | 6 | Eval runs, results, comparison, diff | ✅ |
921
- | quality_gate | 4 | Gates, presets, history | ✅ |
922
- | learning | 4 | Knowledge, search, record | ✅ |
923
- | flywheel | 4 | Mandatory flywheel, promote, investigate | ✅ |
924
- | recon | 7 | Research, findings, framework checks, risk | ✅ |
925
- | security | 3 | Dependency scanning, code analysis, terminal security scanning | ✅ |
926
- | boilerplate | 2 | Scaffold NodeBench projects + status | ✅ |
927
- | skill_update | 4 | Skill tracking, freshness checks, sync | ✅ |
928
- | **Subtotal** | **42** | **AI Flywheel core** | |
929
- | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner | — |
930
- | self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance | — |
931
- | parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox | — |
932
- | vision | 4 | Screenshot analysis, UI capture, diff | — |
933
- | ui_capture | 2 | Playwright-based capture | — |
934
- | web | 2 | Web search, URL fetch | — |
935
- | github | 3 | Repo search, analysis, monitoring | — |
936
- | docs | 4 | Documentation generation, reports | — |
937
- | local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) | — |
938
- | llm | 3 | LLM calling, extraction, benchmarking | — |
939
- | platform | 4 | Convex bridge: briefs, funding, research, publish | — |
940
- | research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions | — |
941
- | flicker_detection | 5 | Android flicker detection + SSIM tooling | — |
942
- | figma_flow | 4 | Figma flow analysis + rendering | — |
943
- | benchmark | 3 | Autonomous benchmark lifecycle | — |
944
- | session_memory | 3 | Compaction-resilient notes, attention refresh, context reload | — |
945
- | gaia_solvers | 6 | GAIA media image solvers | — |
946
- | toon | 2 | TOON encode/decode (~40% token savings) | — |
947
- | pattern | 2 | Session pattern mining + risk prediction | — |
948
- | git_workflow | 3 | Branch compliance, PR checklist review, merge gate | — |
949
- | seo | 5 | Technical SEO audit, page performance, content analysis | — |
950
- | voice_bridge | 4 | Voice pipeline design, config analysis, scaffold | — |
951
- | critter | 1 | Accountability checkpoint with calibrated scoring | — |
952
- | email | 4 | SMTP/IMAP email ingestion, search, delivery | — |
953
- | rss | 4 | RSS feed parsing and monitoring | — |
954
- | architect | 3 | Structural analysis, concept verification, implementation planning | — |
955
- | ui_ux_dive | 11 | UI/UX deep analysis sessions, component reviews, flow audits | — |
956
- | mcp_bridge | 5 | Connect external MCP servers, proxy tool calls, manage sessions | — |
957
- | ui_ux_dive_v2 | 14 | Advanced UI/UX analysis with preflight, scoring, heuristic evaluation | — |
958
- | pr_report | 3 | Visual PR creation with screenshot comparisons, timelines, past session links | — |
959
-
960
- **Always included** — these 12 tools are available regardless of preset:
961
- - **Meta/discovery (6):** `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
962
- - **Dynamic loading (6):** `load_toolset`, `unload_toolset`, `list_available_toolsets`, `call_loaded_tool`, `smart_select_tools`, `get_ab_test_report`
963
-
964
- ### TOON Format — Token Savings
965
-
966
- TOON (Token-Oriented Object Notation) is **on by default** for all presets. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
967
-
968
- ```bash
969
- # TOON on (default, all presets)
970
- claude mcp add nodebench -- npx -y nodebench-mcp
971
-
972
- # TOON off
973
- claude mcp add nodebench -- npx -y nodebench-mcp --no-toon
974
- ```
975
-
976
- Use the `toon_encode` and `toon_decode` tools (in the `toon` toolset) to convert between TOON and JSON in your own workflows.
977
-
978
- ---
979
-
980
- ## The AI Flywheel — Complete Methodology
280
+ # Smart preset recommendation based on your project
281
+ npx nodebench-mcp --smart-preset
981
282
 
982
- The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md).
283
+ # Usage stats
284
+ npx nodebench-mcp --stats
983
285
 
984
- ### Two Loops That Compound
286
+ # List all presets
287
+ npx nodebench-mcp --list-presets
985
288
 
289
+ # See all options
290
+ npx nodebench-mcp --help
986
291
  ```
987
- ┌─────────────────────────────────────────────────────────────────┐
988
- │ OUTER LOOP: Eval-Driven Development │
989
- │ │
990
- │ Eval Batch ──→ Telemetry ──→ LLM Judge ──→ Suggestions │
991
- │ │ │ │
992
- │ │ ┌───────────────────────────┐ │ │
993
- │ │ │ INNER LOOP: 6-Phase │ │ │
994
- │ │ │ │ │ │
995
- │ ▼ │ P1 Context Gather │ │ │
996
- │ Regression │ P2 Gap Analysis ◄─────┼────┘ │
997
- │ detected or │ P3 Implementation │ Judge suggestions │
998
- │ new intent │ P4 Test & Validate ─────┼──► feeds back as │
999
- │ added │ P5 Self-Closed Verify │ new eval cases │
1000
- │ │ │ P6 Document Learnings ──┼──► updates edge │
1001
- │ │ │ │ case registry │
1002
- │ ▼ └───────────────────────────┘ │
1003
- │ Re-run Eval Batch ──→ Score improved? ──→ Deploy │
1004
- │ │ │
1005
- │ NO → revert, try different approach │
1006
- └─────────────────────────────────────────────────────────────────┘
1007
- ```
1008
-
1009
- ### Inner Loop → Outer Loop (Verification feeds Evals)
1010
-
1011
- | 6-Phase output | Feeds into Eval Loop as |
1012
- |---|---|
1013
- | Phase 4 test cases (static, unit, integration, E2E) | New eval batch test cases with known-good expected outputs |
1014
- | Phase 5 subagent PASS/FAIL checklists | Eval scoring rubrics — each checklist item becomes a boolean eval criterion |
1015
- | Phase 6 edge cases & learnings | New adversarial eval cases targeting discovered failure modes |
1016
-
1017
- ### Outer Loop → Inner Loop (Evals trigger Verification)
1018
-
1019
- | Eval Loop output | Triggers 6-Phase as |
1020
- |---|---|
1021
- | Judge finds tool calling inefficiency | Phase 2 gap analysis scoped to that tool's implementation |
1022
- | Eval scores regress after deploy | Full Phase 1-6 cycle on the regression — treat as a production incident |
1023
- | Judge suggests new tool or prompt change | Phase 3 implementation following existing patterns, validated through Phase 4-5 |
1024
- | Recurring failure pattern across batch | Phase 1 deep dive into root cause (maybe upstream API changed, maybe schema drifted) |
1025
-
1026
- ### When to Use Which
1027
-
1028
- - **Building or changing a feature** → Run the 6-Phase inner loop. You're asking: *"Is this implementation correct?"*
1029
- - **Measuring system quality over time** → Run the Eval outer loop. You're asking: *"Is the system getting better?"*
1030
- - **Both, always** → Every 6-Phase run produces artifacts (test cases, edge cases, checklists) that expand the eval suite. Every eval regression triggers a 6-Phase investigation. They are not optional alternatives — they compound.
1031
-
1032
- ---
1033
-
1034
- ## Parallel Agents with Claude Code
1035
-
1036
- Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
1037
-
1038
- **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
1039
-
1040
- **How it works with Claude Code's Task tool:**
1041
-
1042
- 1. **COORDINATOR** (your main session) breaks work into independent tasks
1043
- 2. Each **Task tool** call spawns a subagent with instructions to:
1044
- - `claim_agent_task` — lock the task
1045
- - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
1046
- - Do the work
1047
- - `release_agent_task` — handoff with progress note
1048
- 3. Coordinator calls `get_parallel_status` to monitor all subagents
1049
- 4. Coordinator runs `run_quality_gate` on the aggregate result
1050
292
 
1051
- **MCP Prompts available:**
1052
- - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
1053
- - `parallel-agent-team` — Full team setup with role assignment
1054
- - `oracle-test-harness` — Validate outputs against known-good reference
1055
- - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
293
+ ### TOON Format — Token Savings
1056
294
 
1057
- **Note:** Parallel agent coordination tools require `--preset multi_agent` or `--preset full`.
295
+ TOON (Token-Oriented Object Notation) is on by default. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon`.
1058
296
 
1059
297
  ---
1060
298
 
1061
- ## Capability Benchmarking (GAIA, Gated)
1062
-
1063
- NodeBench MCP treats tools as "Access". To measure real capability lift, we benchmark baseline (LLM-only) vs tool-augmented accuracy on GAIA (gated).
1064
-
1065
- Notes:
1066
- - GAIA fixtures and attachments are written under `.cache/gaia` (gitignored). Do not commit GAIA content.
1067
- - Fixture generation requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN`.
1068
-
1069
- Web lane (web_search + fetch_url):
1070
- ```bash
1071
- npm run mcp:dataset:gaia:capability:refresh
1072
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
1073
- ```
1074
-
1075
- File-backed lane (PDF / XLSX / CSV / DOCX / PPTX / JSON / JSONL / TXT / ZIP via `local_file` tools):
1076
- ```bash
1077
- npm run mcp:dataset:gaia:capability:files:refresh
1078
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
1079
- ```
299
+ ## Security & Trust Boundaries
1080
300
 
1081
- Modes:
1082
- - Stable: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
1083
- - More realistic: `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent`
301
+ NodeBench MCP runs locally on your machine.
1084
302
 
1085
- Notes:
1086
- - ZIP attachments require `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (multi-step extract -> parse).
303
+ - All persistent data stored in `~/.nodebench/` (SQLite). No data sent to external servers unless you provide API keys and use tools that call external APIs.
304
+ - Analytics data never leaves your machine.
305
+ - The `local_file` toolset can read files anywhere your Node.js process has permission. Use the `starter` preset to restrict file system access.
306
+ - All API keys read from environment variables — never hardcoded or logged.
307
+ - All database queries use parameterized statements.
1087
308
 
1088
309
  ---
1089
310
 
@@ -1110,93 +331,6 @@ Then use absolute path:
1110
331
 
1111
332
  ---
1112
333
 
1113
- ## Quick Reference
1114
-
1115
- ### Recommended Setup
1116
-
1117
- ```bash
1118
- # Claude Code — AI Flywheel core (50 tools, default)
1119
- claude mcp add nodebench -- npx -y nodebench-mcp
1120
-
1121
- # Windsurf — add to ~/.codeium/windsurf/mcp_config.json
1122
- # Cursor — add to .cursor/mcp.json
1123
- {
1124
- "mcpServers": {
1125
- "nodebench": {
1126
- "command": "npx",
1127
- "args": ["-y", "nodebench-mcp"]
1128
- }
1129
- }
1130
- }
1131
- ```
1132
-
1133
- ### What's in the Default?
1134
-
1135
- | Category | Tools | What you get |
1136
- |---|---|---|
1137
- | Discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools (pagination + expansion), get_tool_quick_ref (multi-hop BFS), get_workflow_chain |
1138
- | Dynamic loading | 6 | load_toolset, unload_toolset, list_available_toolsets, call_loaded_tool, smart_select_tools, get_ab_test_report |
1139
- | Verification | 8 | Cycles, gaps, triple-verify, status |
1140
- | Eval | 6 | Eval runs, results, comparison, diff |
1141
- | Quality gate | 4 | Gates, presets, history |
1142
- | Learning | 4 | Knowledge, search, record |
1143
- | Flywheel | 4 | Mandatory flywheel, promote, investigate |
1144
- | Recon | 7 | Research, findings, framework checks, risk |
1145
- | Security | 3 | Dependency scanning, code analysis, terminal security scanning |
1146
- | Boilerplate | 2 | Scaffold NodeBench projects + status |
1147
- | Skill update | 4 | Skill tracking, freshness checks, sync |
1148
- | **Total** | **54** | **Complete AI Flywheel methodology** |
1149
-
1150
- ### When to Use a Themed Preset
1151
-
1152
- | Need | Preset | Tools |
1153
- |---|---|---|
1154
- | Web development with visual QA | `--preset web_dev` | 106 |
1155
- | Mobile apps with flicker detection | `--preset mobile` | 95 |
1156
- | Academic papers and research writing | `--preset academic` | 86 |
1157
- | Multi-agent coordination | `--preset multi_agent` | 83 |
1158
- | Data analysis and file processing | `--preset data` | 78 |
1159
- | Content pipelines and publishing | `--preset content` | 73 |
1160
- | Research with web search and RSS | `--preset research` | 71 |
1161
- | CI/CD and DevOps | `--preset devops` | 68 |
1162
- | Everything | `--preset full` | 218 |
1163
-
1164
- ### Key Methodology Topics
1165
-
1166
- | Topic | Command |
1167
- |---|---|
1168
- | AI Flywheel overview | `getMethodology("overview")` |
1169
- | 6-phase verification | `getMethodology("mandatory_flywheel")` |
1170
- | Parallel agents | `getMethodology("parallel_agent_teams")` |
1171
- | Eval-driven development | `getMethodology("eval_driven_development")` |
1172
-
1173
- ---
1174
-
1175
- ## Security & Trust Boundaries
1176
-
1177
- NodeBench MCP runs locally on your machine. Here's what it can and cannot access:
1178
-
1179
- ### Data locality
1180
- - All persistent data is stored in **`~/.nodebench/`** (SQLite databases for tool logs, analytics, learnings, eval results)
1181
- - **No data is sent to external servers** unless you explicitly provide API keys and use tools that call external APIs (web search, LLM, GitHub, email)
1182
- - Analytics data never leaves your machine
1183
-
1184
- ### File system access
1185
- - The `local_file` toolset (in `data`, `academic`, `full` presets) can **read files anywhere on your filesystem** that the Node.js process has permission to access. This includes CSV, PDF, XLSX, DOCX, PPTX, JSON, TXT, and ZIP files
1186
- - The `security` toolset (in all presets) runs static analysis on files you point it at
1187
- - Session notes and project bootstrapping write to the current working directory or `~/.nodebench/`
1188
- - **Trust boundary**: If you grant an AI agent access to NodeBench MCP with `--preset full`, that agent can read any file your user account can read. Use the `default` preset if you want to restrict file system access
1189
-
1190
- ### API keys
1191
- - All API keys are read from environment variables (`GEMINI_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GITHUB_TOKEN`, etc.)
1192
- - No keys are hardcoded or logged
1193
- - Keys are passed to their respective provider APIs only — never to NodeBench servers (there are none)
1194
-
1195
- ### SQL injection protection
1196
- - All database queries use parameterized statements — no string concatenation in SQL
1197
-
1198
- ---
1199
-
1200
334
  ## Troubleshooting
1201
335
 
1202
336
  **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
@@ -1207,11 +341,11 @@ NodeBench MCP runs locally on your machine. Here's what it can and cannot access
1207
341
 
1208
342
  **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
1209
343
 
1210
- **Windsurf not finding tools** — Verify `~/.codeium/windsurf/mcp_config.json` has the correct JSON structure. Open Settings → MCP → View raw config to edit directly.
344
+ **Windsurf not finding tools** — Verify `~/.codeium/windsurf/mcp_config.json` has correct JSON structure
1211
345
 
1212
- **Cursor tools not loading** — Ensure `.cursor/mcp.json` exists in the project root. Restart Cursor after config changes.
346
+ **Cursor tools not loading** — Ensure `.cursor/mcp.json` exists in project root. Use `--preset cursor` to stay within the tool cap. Restart Cursor after config changes.
1213
347
 
1214
- **Dynamic loading not working** — Claude Code and GitHub Copilot support native dynamic loading. For Windsurf/Cursor, use `call_loaded_tool` as a fallback (it's always available).
348
+ **Dynamic loading not working** — Claude Code and GitHub Copilot support native dynamic loading. For Windsurf/Cursor, use `call_loaded_tool` as a fallback.
1215
349
 
1216
350
  ---
1217
351