nodebench-mcp 2.17.0 → 2.18.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. package/LICENSE +21 -0
  2. package/NODEBENCH_AGENTS.md +2 -2
  3. package/README.md +514 -82
  4. package/dist/__tests__/analytics.test.d.ts +11 -0
  5. package/dist/__tests__/analytics.test.js +546 -0
  6. package/dist/__tests__/analytics.test.js.map +1 -0
  7. package/dist/__tests__/dynamicLoading.test.d.ts +1 -0
  8. package/dist/__tests__/dynamicLoading.test.js +278 -0
  9. package/dist/__tests__/dynamicLoading.test.js.map +1 -0
  10. package/dist/__tests__/evalHarness.test.js +1 -1
  11. package/dist/__tests__/evalHarness.test.js.map +1 -1
  12. package/dist/__tests__/helpers/answerMatch.js +22 -22
  13. package/dist/__tests__/presetRealWorldBench.test.js +9 -0
  14. package/dist/__tests__/presetRealWorldBench.test.js.map +1 -1
  15. package/dist/__tests__/tools.test.js +1 -1
  16. package/dist/__tests__/toolsetGatingEval.test.js +9 -1
  17. package/dist/__tests__/toolsetGatingEval.test.js.map +1 -1
  18. package/dist/analytics/index.d.ts +10 -0
  19. package/dist/analytics/index.js +11 -0
  20. package/dist/analytics/index.js.map +1 -0
  21. package/dist/analytics/projectDetector.d.ts +19 -0
  22. package/dist/analytics/projectDetector.js +259 -0
  23. package/dist/analytics/projectDetector.js.map +1 -0
  24. package/dist/analytics/schema.d.ts +57 -0
  25. package/dist/analytics/schema.js +157 -0
  26. package/dist/analytics/schema.js.map +1 -0
  27. package/dist/analytics/smartPreset.d.ts +63 -0
  28. package/dist/analytics/smartPreset.js +300 -0
  29. package/dist/analytics/smartPreset.js.map +1 -0
  30. package/dist/analytics/toolTracker.d.ts +59 -0
  31. package/dist/analytics/toolTracker.js +163 -0
  32. package/dist/analytics/toolTracker.js.map +1 -0
  33. package/dist/analytics/usageStats.d.ts +64 -0
  34. package/dist/analytics/usageStats.js +252 -0
  35. package/dist/analytics/usageStats.js.map +1 -0
  36. package/dist/db.js +359 -321
  37. package/dist/db.js.map +1 -1
  38. package/dist/index.d.ts +2 -1
  39. package/dist/index.js +652 -89
  40. package/dist/index.js.map +1 -1
  41. package/dist/tools/architectTools.js +13 -13
  42. package/dist/tools/critterTools.js +14 -14
  43. package/dist/tools/parallelAgentTools.js +176 -176
  44. package/dist/tools/patternTools.js +11 -11
  45. package/dist/tools/progressiveDiscoveryTools.d.ts +5 -1
  46. package/dist/tools/progressiveDiscoveryTools.js +111 -19
  47. package/dist/tools/progressiveDiscoveryTools.js.map +1 -1
  48. package/dist/tools/researchWritingTools.js +42 -42
  49. package/dist/tools/rssTools.js +396 -396
  50. package/dist/tools/toolRegistry.d.ts +17 -0
  51. package/dist/tools/toolRegistry.js +65 -17
  52. package/dist/tools/toolRegistry.js.map +1 -1
  53. package/dist/tools/voiceBridgeTools.js +498 -498
  54. package/dist/toolsetRegistry.d.ts +10 -0
  55. package/dist/toolsetRegistry.js +84 -0
  56. package/dist/toolsetRegistry.js.map +1 -0
  57. package/package.json +4 -4
package/README.md CHANGED
@@ -5,7 +5,11 @@
5
5
  One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
6
6
 
7
7
  ```bash
8
+ # Default (39 tools) - complete AI Flywheel methodology
8
9
  claude mcp add nodebench -- npx -y nodebench-mcp
10
+
11
+ # Full (175 tools) - everything including vision, web, files, etc.
12
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset full
9
13
  ```
10
14
 
11
15
  ---
@@ -35,11 +39,11 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
35
39
 
36
40
  ## Who's Using It
37
41
 
38
- **Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes.
42
+ **Vision engineer** — Built agentic vision analysis using GPT 5.2 with Set-of-Mark (SoM) for boundary boxing, similar to Google Gemini 3 Flash's agentic code execution approach. Uses NodeBench's verification pipeline to validate detection accuracy across screenshot variants before shipping model changes. (Uses `full` preset for vision tools)
39
43
 
40
- **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
44
+ **QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss. (Uses `default` preset — all core AI Flywheel tools)
41
45
 
42
- Both found different subsets of the 163 tools useful — which is why NodeBench ships with 4 `--preset` levels to load only what you need.
46
+ Both found different subsets of the tools useful — which is why NodeBench ships with just 2 `--preset` levels. The `default` preset (39 tools) covers the complete AI Flywheel methodology with ~76% fewer tools. Add `--preset full` for specialized tools (vision, web, files, parallel agents, security).
43
47
 
44
48
  ---
45
49
 
@@ -77,14 +81,11 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
77
81
  ### Install (30 seconds)
78
82
 
79
83
  ```bash
80
- # Claude Code CLI — all 163 tools (TOON encoding on by default for ~40% token savings)
84
+ # Default (39 tools) - complete AI Flywheel methodology
81
85
  claude mcp add nodebench -- npx -y nodebench-mcp
82
86
 
83
- # Or start with discovery only 5 tools, agents self-escalate to what they need
84
- claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
85
-
86
- # Or start lean — 43 tools, ~70% less token overhead
87
- claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
87
+ # Full (175 tools) - everything including vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers
88
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset full
88
89
  ```
89
90
 
90
91
  Or add to `~/.claude/settings.json` or `.claude.json`:
@@ -104,6 +105,9 @@ Or add to `~/.claude/settings.json` or `.claude.json`:
104
105
 
105
106
  ```
106
107
  # See what's available
108
+ > Use discover_tools("verify my implementation") to find relevant tools
109
+
110
+ # Get methodology guidance
107
111
  > Use getMethodology("overview") to see all workflows
108
112
 
109
113
  # Before your next task — search for prior knowledge
@@ -113,6 +117,41 @@ Or add to `~/.claude/settings.json` or `.claude.json`:
113
117
  > Use getMethodology("mandatory_flywheel") and follow the 6 steps
114
118
  ```
115
119
 
120
+ ### Usage Analytics & Smart Presets
121
+
122
+ NodeBench MCP tracks tool usage locally and can recommend optimal presets based on your project type and usage patterns.
123
+
124
+ **Get smart preset recommendation:**
125
+ ```bash
126
+ npx nodebench-mcp --smart-preset
127
+ ```
128
+
129
+ This analyzes your project (detects language, framework, project type) and usage history to recommend the best preset.
130
+
131
+ **View usage statistics:**
132
+ ```bash
133
+ npx nodebench-mcp --stats
134
+ ```
135
+
136
+ Shows tool usage patterns, most used toolsets, and success rates for the last 30 days.
137
+
138
+ **Export usage data:**
139
+ ```bash
140
+ npx nodebench-mcp --export-stats > usage-stats.json
141
+ ```
142
+
143
+ **List all available presets:**
144
+ ```bash
145
+ npx nodebench-mcp --list-presets
146
+ ```
147
+
148
+ **Clear analytics data:**
149
+ ```bash
150
+ npx nodebench-mcp --reset-stats
151
+ ```
152
+
153
+ All analytics data is stored locally in `~/.nodebench/analytics.db` and never leaves your machine.
154
+
116
155
  ### Optional: API keys for web search and vision
117
156
 
118
157
  ```bash
@@ -151,6 +190,29 @@ Notes:
151
190
 
152
191
  ## What You Get
153
192
 
193
+ ### The AI Flywheel — Core Methodology
194
+
195
+ The `default` preset (39 tools) gives you the complete AI Flywheel methodology from [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md):
196
+
197
+ ```
198
+ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
199
+ ↑ │
200
+ └──────────── knowledge compounds ─────────────────────────────┘
201
+ ```
202
+
203
+ **Inner loop** (per change): 6-phase verification ensures correctness.
204
+ **Outer loop** (over time): Eval-driven development ensures improvement.
205
+
206
+ ### Recommended Workflow: Start with Default
207
+
208
+ The `default` preset includes:
209
+
210
+ 1. **Discovery tools** — 6 tools: `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
211
+ 2. **Core methodology** — 38 tools: verification, eval, quality_gate, learning, flywheel, recon, security, boilerplate
212
+ 3. **Self-escalate** — Add `--preset full` when you need vision, web, files, or parallel agents
213
+
214
+ This approach minimizes token overhead while ensuring agents have access to the complete methodology when needed.
215
+
154
216
  ### Core workflow (use these every session)
155
217
 
156
218
  | When you... | Use this | Impact |
@@ -167,29 +229,56 @@ Notes:
167
229
 
168
230
  ### When running parallel agents (Claude Code subagents, worktrees)
169
231
 
170
- | When you... | Use this | Impact |
171
- |---|---|---|
172
- | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
173
- | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
174
- | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
175
- | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
176
- | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
177
- | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
232
+ | When you... | Use this | Impact | Preset |
233
+ |---|---|---|---|
234
+ | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent | `full` |
235
+ | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. | `full` |
236
+ | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix | `full` |
237
+ | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle | `full` |
238
+ | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked | `full` |
239
+ | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra | `full` |
240
+
241
+ **Note:** Parallel agent coordination tools are only available in the `full` preset. For single-agent workflows, the `default` preset provides all the core AI Flywheel tools you need.
178
242
 
179
243
  ### Research and discovery
180
244
 
181
- | When you... | Use this | Impact |
182
- |---|---|---|
183
- | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
184
- | Fetch a URL | `fetch_url` | Read any page as clean markdown |
185
- | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
186
- | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
245
+ | When you... | Use this | Impact | Preset |
246
+ |---|---|---|---|
247
+ | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates | `full` |
248
+ | Fetch a URL | `fetch_url` | Read any page as clean markdown | `full` |
249
+ | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns | `full` |
250
+ | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA | `full` |
251
+
252
+ **Note:** Web search, GitHub, and vision tools are only available in the `full` preset. The `default` preset focuses on the core AI Flywheel methodology (verification, eval, learning, recon, flywheel, security, boilerplate).
253
+
254
+ ---
255
+
256
+ ## Impact-Driven Methodology
257
+
258
+ Every tool call, methodology step, and workflow path must answer: **"What concrete thing did this produce?"**
259
+
260
+ | Tool / Phase | Concrete Impact |
261
+ |---|---|
262
+ | `run_recon` + `log_recon_finding` | N findings surfaced before writing code |
263
+ | `assess_risk` | Risk tier assigned - HIGH triggers confirmation before action |
264
+ | `start_verification_cycle` + `log_gap` | N issues detected with severity, all tracked to resolution |
265
+ | `log_test_result` (3 layers) | 3x test coverage vs single-layer; catches integration failures |
266
+ | `start_eval_run` + `record_eval_result` | N regression cases protecting against future breakage |
267
+ | `run_quality_gate` | N gate rules enforced; violations blocked before deploy |
268
+ | `record_learning` + `search_all_knowledge` | Knowledge compounds - later tasks reuse prior findings |
269
+ | `run_mandatory_flywheel` | 6-step minimum verification; catches dead code and intent mismatches |
270
+
271
+ The comparative benchmark validates this with 9 real production scenarios:
272
+ - 13 issues detected (4 HIGH, 8 MEDIUM, 1 LOW) - bare agent ships all of them
273
+ - 21 recon findings before implementation
274
+ - 26 blind spots prevented
275
+ - Knowledge compounding: 0 hits on task 1 → 2+ hits by task 9
187
276
 
188
277
  ---
189
278
 
190
279
  ## Progressive Discovery
191
280
 
192
- 163 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
281
+ The `default` preset (39 tools) provides the complete AI Flywheel methodology with discovery built in. The progressive disclosure system helps agents find exactly what they need:
193
282
 
194
283
  ### Multi-modal search engine
195
284
 
@@ -286,6 +375,28 @@ Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn
286
375
  **Outer loop** (over time): Eval-driven development ensures improvement.
287
376
  **Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
288
377
 
378
+ ### The 6-Phase Verification Process (Inner Loop)
379
+
380
+ Every non-trivial change should go through these 6 steps:
381
+
382
+ 1. **Context Gathering** — Parallel subagent deep dive into SDK specs, implementation patterns, dispatcher/backend audit, external API research
383
+ 2. **Gap Analysis** — Compare findings against current implementation, categorize gaps (CRITICAL/HIGH/MEDIUM/LOW)
384
+ 3. **Implementation** — Apply fixes following production patterns exactly
385
+ 4. **Testing & Validation** — 5 layers: static analysis, unit tests, integration tests, manual verification, live end-to-end
386
+ 5. **Self-Closed-Loop Verification** — Parallel verification subagents check spec compliance, functional correctness, argument compatibility
387
+ 6. **Document Learnings** — Update documentation with edge cases and key learnings
388
+
389
+ ### The Eval-Driven Development Loop (Outer Loop)
390
+
391
+ 1. **Run Eval Batch** — Send test cases through the target workflow
392
+ 2. **Capture Telemetry** — Collect complete agent execution trace
393
+ 3. **LLM-as-Judge Analysis** — Score goal alignment, tool efficiency, output quality
394
+ 4. **Retrieve Results** — Aggregate pass/fail rates and improvement suggestions
395
+ 5. **Fix, Optimize, Enhance** — Apply changes based on judge feedback
396
+ 6. **Re-run Evals** — Deploy only if scores improve
397
+
398
+ **Rule: No change ships without an eval improvement.**
399
+
289
400
  Ask the agent: `Use getMethodology("overview")` to see all 20 methodology topics.
290
401
 
291
402
  ---
@@ -313,34 +424,27 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
313
424
  - `oracle-test-harness` — Validate outputs against known-good reference
314
425
  - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
315
426
 
427
+ **Note:** Parallel agent coordination tools are only available in the `full` preset. For single-agent workflows, the `default` preset provides all the core AI Flywheel tools you need.
428
+
316
429
  ---
317
430
 
318
431
  ## Toolset Gating
319
432
 
320
- 163 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
433
+ The default preset (39 tools) gives you the complete AI Flywheel methodology with ~78% fewer tools compared to the full suite (175 tools).
321
434
 
322
- ### Presets
435
+ ### Presets — Choose What You Need
323
436
 
324
437
  | Preset | Tools | Domains | Use case |
325
438
  |---|---|---|---|
326
- | `meta` | 5 | 0 | Discovery-only front door agents start here and self-escalate via `discover_tools` |
327
- | `lite` | 43 | 8 | Core methodology verification, eval, flywheel, learning, recon, security, boilerplate |
328
- | `core` | 110 | 23 | Full workflow — adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge, critter |
329
- | `full` | 163 | 31 | Everything — adds vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers |
439
+ | **default** | **39** | 7 | **Recommended** Complete AI Flywheel: verification, eval, quality_gate, learning, flywheel, recon, boilerplate |
440
+ | `full` | 175 | 34 | Everythingvision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers, security, email, RSS, architect |
330
441
 
331
442
  ```bash
332
- # Meta 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)
333
- # Agents start here and self-escalate to the tools they need
334
- claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
335
-
336
- # Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)
337
- claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
338
-
339
- # Core — 110 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge, critter + meta + discovery)
340
- claude mcp add nodebench -- npx -y nodebench-mcp --preset core
341
-
342
- # Full — all 163 tools (default, TOON encoding on by default)
443
+ # Recommended: Default (39 tools) - complete AI Flywheel
343
444
  claude mcp add nodebench -- npx -y nodebench-mcp
445
+
446
+ # Everything: All 175 tools
447
+ claude mcp add nodebench -- npx -y nodebench-mcp --preset full
344
448
  ```
345
449
 
346
450
  Or in config:
@@ -350,12 +454,193 @@ Or in config:
350
454
  "mcpServers": {
351
455
  "nodebench": {
352
456
  "command": "npx",
353
- "args": ["-y", "nodebench-mcp", "--preset", "meta"]
457
+ "args": ["-y", "nodebench-mcp"]
354
458
  }
355
459
  }
356
460
  }
357
461
  ```
358
462
 
463
+ ### Scaling MCP: How We Solved the 5 Biggest Industry Problems
464
+
465
+ MCP tool servers face 5 systemic problems documented across Anthropic, Microsoft Research, and the open-source community. We researched each one, built solutions, and tested them with automated eval harnesses. Here's the full breakdown — problem by problem.
466
+
467
+ ---
468
+
469
+ #### Problem 1: Context Bloat (too many tool definitions eat the context window)
470
+
471
+ **The research**: Anthropic measured that 58 tools from 5 MCP servers consume **~55K tokens** before the conversation starts. At 175 tools, NodeBench would consume ~87K tokens — up to 44% of a 200K context window just on tool metadata. [Microsoft Research](https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/) found LLMs "decline to act at all when faced with ambiguous or excessive tool options." [Cursor enforces a ~40-tool hard cap](https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents) for this reason.
472
+
473
+ **Our solutions** (layered, each independent):
474
+
475
+ | Layer | What it does | Token savings | Requires |
476
+ |---|---|---|---|
477
+ | Themed presets (`--preset web_dev`) | Load only relevant toolsets (44-60 tools vs 175) | **60-75%** | Nothing |
478
+ | TOON encoding (on by default) | Encode all tool responses in token-optimized format | **~40%** on responses | Nothing |
479
+ | `discover_tools({ compact: true })` | Return `{ name, category, hint }` only | **~60%** on search results | Nothing |
480
+ | `instructions` field (Claude Code) | Claude Code defers tool loading, searches on demand | **~85%** | Claude Code client |
481
+ | `smart_select_tools` (LLM-powered) | Fast model picks 8 best tools from compact catalog | **~95%** | Any API key |
482
+
483
+ **How we tested**: The A/B harness (`scripts/ab-test-harness.ts`) measures tool counts, token overhead, and success rates across 28 scenarios in both static and dynamic modes. TOON savings validated by comparing JSON vs TOON serialized sizes across all tool responses.
484
+
485
+ ---
486
+
487
+ #### Problem 2: Tool Selection Degradation (LLMs pick the wrong tool as count increases)
488
+
489
+ **The research**: [Anthropic's Tool Search Tool](https://www.anthropic.com/engineering/advanced-tool-use) improved accuracy from **49% → 74%** (Opus 4) and **79.5% → 88.1%** (Opus 4.5) by switching from all-tools-upfront to on-demand discovery. The [Dynamic ReAct paper (arxiv 2509.20386)](https://arxiv.org/html/2509.20386v1) tested 5 architectures and found **Search + Load** wins — flat search + deliberate loading beats hierarchical app→tool search.
490
+
491
+ **Our solution**: `discover_tools` — a 14-strategy hybrid search engine that finds the right tool from 175 candidates:
492
+
493
+ | Strategy | What it does | Example |
494
+ |---|---|---|
495
+ | Keyword + TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
496
+ | Fuzzy (Levenshtein) | Tolerates typos | "verifiy" → `start_verification_cycle` |
497
+ | Semantic (synonyms) | Expands 30 word families | "check" also finds "verify", "validate" |
498
+ | N-gram + Bigram | Partial words and phrases | "screen" → `capture_ui_screenshot` |
499
+ | Dense (TF-IDF cosine) | Vector-like ranking | "audit compliance" surfaces related tools |
500
+ | Embedding (neural) | Agent-as-a-Graph bipartite RRF | Based on [arxiv 2511.01854](https://arxiv.org/html/2511.01854v1) |
501
+ | Execution traces | Co-occurrence mining from `tool_call_log` | Tools frequently used together boost each other |
502
+ | Intent pre-filter | Narrow to relevant categories before search | `intent: "data_analysis"` → only local_file, llm, benchmark |
503
+
504
+ Plus `smart_select_tools` for ambiguous queries — sends the catalog to Gemini Flash / GPT-4o-mini / Claude Haiku for LLM-powered reranking.
505
+
506
+ **How we tested**: 28 scenarios with expected-toolset ground truth. The harness checks if `_loadSuggestions` points to the correct toolset for each domain query.
507
+
508
+ | What we measured | Result |
509
+ |---|---|
510
+ | Discovery accuracy | **18/18 (100%)** — correct toolset suggested for every domain |
511
+ | Domains covered | File I/O, email, GitHub, academic writing, SEO, git, Figma, CI/CD, browser automation, database, security, LLM, monitoring |
512
+ | Natural language queries | "I need to look at what's in this zip file" → `local_file` ✓ |
513
+ | Zero-match graceful degradation | "deploy Kubernetes pods" → closest tools, no errors ✓ |
514
+
515
+ ---
516
+
517
+ #### Problem 3: Static Loading (all tools loaded upfront, even if unused)
518
+
519
+ **The research**: The Dynamic ReAct paper found that **Search + Load with 2 meta tools** beats all other architectures. Hierarchical search (search apps → search tools → load) adds overhead without improving accuracy. [ToolScope (arxiv 2510.20036)](https://arxiv.org/html/2510.20036) showed **+34.6%** tool selection accuracy with hybrid retrieval + tool deduplication.
520
+
521
+ **Our solution**: `--dynamic` flag enables Search + Load:
522
+
523
+ ```bash
524
+ npx nodebench-mcp --dynamic
525
+ ```
526
+
527
+ ```
528
+ > discover_tools("analyze screenshot for UI bugs")
529
+ # _loadSuggestions: [{ toolset: "vision", action: "load_toolset('vision')" }]
530
+
531
+ > load_toolset("vision")
532
+ # 4 vision tools now directly bound (not indirected through a proxy)
533
+
534
+ > unload_toolset("vision")
535
+ # Tools removed, token budget recovered
536
+ ```
537
+
538
+ Key design decisions from the research:
539
+ - **No hierarchical search** — Dynamic ReAct Section 3.4: "search_apps introduces an additional call without significantly improving accuracy"
540
+ - **Direct tool binding** — Dynamic ReAct Section 3.5: LLMs perform best with directly bound tools; `call_tool` indirection degrades in long conversations
541
+ - **Full-registry search** — `discover_tools` searches all 175 tools even with 44 loaded, so it can suggest what to load
542
+
543
+ **How we tested**: Automated A/B harness + live IDE session.
544
+
545
+ | What we measured | Result |
546
+ |---|---|
547
+ | Scenarios tested | **28** aligned to [real MCP usage data](https://towardsdatascience.com/mcp-in-practice/) — Web/Browser (24.8%), SWE (24.7%), DB/Search (23.1%), File Ops, Comms, Design, Security, AI, Monitoring |
548
+ | Success rate | **100%** across 128 tool calls per round (both modes) |
549
+ | Load latency | **<1ms** per `load_toolset` call |
550
+ | Long sessions | 6 loads + 2 unloads in a single session — correct tool count at every step |
551
+ | Burst performance | 6 consecutive calls averaging **1ms** each |
552
+ | Live agent test | Verified in real Windsurf session: load, double-load (idempotent), unload, unload-protection |
553
+ | Unit tests | **266 passing** (24 dedicated to dynamic loading) |
554
+ | Bugs found during testing | 5 (all fixed) — most critical: search results only showed loaded tools, not full registry |
555
+
556
+ ---
557
+
558
+ #### Problem 4: Client Fragmentation (not all clients handle dynamic tool updates)
559
+
560
+ **The research**: The MCP spec defines `notifications/tools/list_changed` for servers to tell clients to re-fetch the tool list. But [Cursor hasn't implemented it](https://forum.cursor.com/t/enhance-mcp-integration-in-cursor-dynamic-tool-updates-roots-support-progress-tokens-streamable-http/99903), [Claude Desktop didn't support it](https://github.com/orgs/modelcontextprotocol/discussions/76) (as of Dec 2024), and [Gemini CLI has an open issue](https://github.com/google-gemini/gemini-cli/issues/13850).
561
+
562
+ **Our solution**: Two-tier compatibility — native `list_changed` for clients that support it, plus a `call_loaded_tool` proxy fallback for those that don't.
563
+
564
+ | Client | Dynamic Loading | How |
565
+ |---|---|---|
566
+ | **Claude Code** | ✅ Native | Re-fetches tools automatically after `list_changed` |
567
+ | **GitHub Copilot** | ✅ Native | Same |
568
+ | **Windsurf / Cursor / Claude Desktop / Gemini CLI / LibreChat** | ✅ Via fallback | `call_loaded_tool` proxy (always in tool list) |
569
+
570
+ ```
571
+ > load_toolset("vision")
572
+ # Response includes: toolNames: ["analyze_screenshot", "manipulate_screenshot", ...]
573
+
574
+ > call_loaded_tool({ tool: "analyze_screenshot", args: { imagePath: "page.png" } })
575
+ # Dispatches internally — works on ALL clients
576
+ ```
577
+
578
+ **How we tested**: Server-side verification in the A/B harness proves correct `tools/list` updates:
579
+
580
+ ```
581
+ tools/list BEFORE: 95 tools
582
+ load_toolset("voice_bridge")
583
+ tools/list AFTER: 99 tools (+4) ← new tools visible
584
+ call_loaded_tool proxy: ✓ OK ← fallback dispatch works
585
+ unload_toolset("voice_bridge")
586
+ tools/list AFTER UNLOAD: 95 tools (-4) ← tools removed
587
+ ```
588
+
589
+ ---
590
+
591
+ #### Problem 5: Aggressive Filtering (over-filtering means the right tool isn't found)
592
+
593
+ **The research**: This is the flip side of Problem 1. If you reduce context aggressively (e.g., keyword-only search), ambiguous queries like "call an AI model" fail to match the `llm` toolset because every tool mentions "AI" in its description. [SynapticLabs' Bounded Context Packs](https://blog.synapticlabs.ai/bounded-context-packs-tool-bloat-tipping-point) addresses this with progressive disclosure. [SEP-1576](https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576) proposes adaptive granularity at the protocol level.
594
+
595
+ **Our solutions** (3 tiers, progressively smarter):
596
+
597
+ **Tier 1 — Intent pre-filter (no API key):**
598
+ ```
599
+ > discover_tools({ query: "parse a CSV file", intent: "data_analysis" })
600
+ # Narrows search to: local_file, llm, benchmark categories only
601
+ # 15 intents: file_processing, web_research, code_quality, security_audit,
602
+ # academic_writing, data_analysis, llm_interaction, visual_qa, devops_ci,
603
+ # team_coordination, communication, seo_audit, design_review, voice_ui, project_setup
604
+ ```
605
+
606
+ **Tier 2 — LLM-powered selection (API key):**
607
+ ```
608
+ > smart_select_tools({ task: "parse a PDF, extract tables, email a summary" })
609
+ # Sends compact catalog (~4K tokens: name + category + 5 tags per tool) to
610
+ # Gemini Flash / GPT-4o-mini / Claude Haiku
611
+ # Returns the 8 best tools + _loadSuggestions for unloaded toolsets
612
+ # Falls back to heuristic search if no API key is set
613
+ ```
614
+
615
+ **Tier 3 — Embedding search (optional):**
616
+ Neural bipartite graph search (tool nodes + domain nodes) based on [Agent-as-a-Graph (arxiv 2511.18194)](https://arxiv.org/html/2511.18194). Enable with `--embedding` or set `OPENAI_API_KEY` / `GEMINI_API_KEY`.
617
+
618
+ **How we tested**: The `llm_model_interaction` scenario in the A/B harness specifically tests this — the query "call LLM generate prompt GPT Claude Gemini" must surface the `llm` toolset in `_loadSuggestions`. A tag coverage bonus in hybrid search ensures tools where many query words match tags rank highest. For even more ambiguous queries, `smart_select_tools` lets an LLM pick the right tools semantically.
619
+
620
+ ---
621
+
622
+ #### Summary: research → solution → eval for each problem
623
+
624
+ | Problem | Research Source | Our Solution | Eval Method | Result |
625
+ |---|---|---|---|---|
626
+ | **Context bloat** (87K tokens) | Anthropic (85% reduction), Lunar.dev (~40-tool cap), SEP-1576 | Presets, TOON, compact mode, `instructions`, `smart_select_tools` | A/B harness token measurement | 60-95% reduction depending on layer |
627
+ | **Selection degradation** | Anthropic (+25pp), Dynamic ReAct (Search+Load wins) | 14-strategy hybrid search, intent pre-filter, LLM reranking | 28-scenario discovery accuracy | **100% accuracy** (18/18 domains) |
628
+ | **Static loading** | Dynamic ReAct, ToolScope (+34.6%), MCP spec | `--dynamic` flag, `load_toolset` / `unload_toolset` | A/B harness + live IDE test | **100% success**, <1ms load latency |
629
+ | **Client fragmentation** | MCP discussions, client bug trackers | `list_changed` + `call_loaded_tool` proxy | Server-side `tools/list` verification | Works on **all clients** |
630
+ | **Aggressive filtering** | SynapticLabs, SEP-1576, our own `llm` gap | Intent pre-filter, `smart_select_tools`, embeddings | `llm_model_interaction` scenario | LLM-powered selection solves the gap |
631
+
632
+ **Ablation study** (`scripts/ablation-test.ts`): We tested which strategies matter for each user segment by disabling them one at a time across 54 queries:
633
+
634
+ | Segment | R@5 Baseline | Most Critical Strategy | Impact When Removed |
635
+ |---|---|---|---|
636
+ | **New user** (vague, natural language) | 67% | Synonym expansion | 🔴 -17pp R@5 |
637
+ | **Experienced** (domain keywords) | 72% | All robust | ⚪ No single strategy >5pp |
638
+ | **Power user** (exact tool names) | 100% | None needed | ⚪ Keyword alone = 100% |
639
+
640
+ Key insight: new users need synonym expansion ("website" → seo, "AI" → llm) and fuzzy matching (typo tolerance). Power users need nothing beyond keyword matching. The remaining 33% new user gap is filled by `smart_select_tools` (LLM-powered).
641
+
642
+ Full methodology, per-scenario breakdown, ablation data, and research citations: [DYNAMIC_LOADING.md](./DYNAMIC_LOADING.md)
643
+
359
644
  ### Fine-grained control
360
645
 
361
646
  ```bash
@@ -371,51 +656,54 @@ npx nodebench-mcp --help
371
656
 
372
657
  ### Available toolsets
373
658
 
374
- | Toolset | Tools | What it covers |
375
- |---|---|---|
376
- | verification | 8 | Cycles, gaps, triple-verify, status |
377
- | eval | 6 | Eval runs, results, comparison, diff |
378
- | quality_gate | 4 | Gates, presets, history |
379
- | learning | 4 | Knowledge, search, record |
380
- | recon | 7 | Research, findings, framework checks, risk |
381
- | flywheel | 4 | Mandatory flywheel, promote, investigate |
382
- | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner |
383
- | self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance |
384
- | parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox (point-to-point + broadcast) |
385
- | vision | 4 | Screenshot analysis, UI capture, diff |
386
- | ui_capture | 2 | Playwright-based capture |
387
- | web | 2 | Web search, URL fetch |
388
- | github | 3 | Repo search, analysis, monitoring |
389
- | docs | 4 | Documentation generation, reports |
390
- | local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) |
391
- | llm | 3 | LLM calling, extraction, benchmarking |
392
- | security | 3 | Dependency scanning, code analysis, terminal security scanning |
393
- | platform | 4 | Convex bridge: briefs, funding, research, publish |
394
- | research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions, experiment analysis, reviewer simulation |
395
- | flicker_detection | 5 | Android flicker detection + SSIM tooling |
396
- | figma_flow | 4 | Figma flow analysis + rendering |
397
- | boilerplate | 2 | Scaffold NodeBench projects + status |
398
- | benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
399
- | session_memory | 3 | Compaction-resilient notes, attention refresh, context reload |
400
- | gaia_solvers | 6 | GAIA media image solvers (red/green deviation, polygon area, fraction quiz, bass clef, storage cost) |
401
- | toon | 2 | TOON encode/decode Token-Oriented Object Notation (~40% token savings) |
402
- | pattern | 2 | Session pattern mining + risk prediction from historical sequences |
403
- | git_workflow | 3 | Branch compliance, PR checklist review, merge gate enforcement |
404
- | seo | 5 | Technical SEO audit, page performance, content analysis, WordPress detection + updates |
405
- | voice_bridge | 4 | Voice pipeline design, config analysis, scaffold generation, latency benchmarking |
406
-
407
- Always included (regardless of gating) these 5 tools form the `meta` preset:
408
- - Meta: `findTools`, `getMethodology`
409
- - Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
410
-
411
- The `meta` preset loads **only** these 5 tools (0 domain tools). Agents use `discover_tools` to find what they need and self-escalate.
659
+ | Toolset | Tools | What it covers | In `default` |
660
+ |---|---|---|---|
661
+ | verification | 8 | Cycles, gaps, triple-verify, status | ✅ |
662
+ | eval | 6 | Eval runs, results, comparison, diff | ✅ |
663
+ | quality_gate | 4 | Gates, presets, history | ✅ |
664
+ | learning | 4 | Knowledge, search, record | ✅ |
665
+ | recon | 7 | Research, findings, framework checks, risk | ✅ |
666
+ | flywheel | 4 | Mandatory flywheel, promote, investigate | ✅ |
667
+ | security | 3 | Dependency scanning, code analysis, terminal security scanning | ✅ |
668
+ | **Total** | **44** | **Complete AI Flywheel** |
669
+ | boilerplate | 2 | Scaffold NodeBench projects + status | |
670
+ | bootstrap | 11 | Project setup, agents.md, self-implement, autonomous, test runner | — |
671
+ | self_eval | 9 | Trajectory analysis, health reports, task banks, grading, contract compliance | — |
672
+ | parallel | 13 | Task locks, roles, context budget, oracle, agent mailbox | — |
673
+ | vision | 4 | Screenshot analysis, UI capture, diff | — |
674
+ | ui_capture | 2 | Playwright-based capture | |
675
+ | web | 2 | Web search, URL fetch | — |
676
+ | github | 3 | Repo search, analysis, monitoring | — |
677
+ | docs | 4 | Documentation generation, reports | |
678
+ | local_file | 19 | Deterministic parsing (CSV/XLSX/PDF/DOCX/PPTX/ZIP/JSON/JSONL/TXT/OCR/audio) | |
679
+ | llm | 3 | LLM calling, extraction, benchmarking | |
680
+ | platform | 4 | Convex bridge: briefs, funding, research, publish | — |
681
+ | research_writing | 8 | Academic paper polishing, translation, de-AI, logic check, captions | — |
682
+ | flicker_detection | 5 | Android flicker detection + SSIM tooling | — |
683
+ | figma_flow | 4 | Figma flow analysis + rendering | — |
684
+ | benchmark | 3 | Autonomous benchmark lifecycle | |
685
+ | session_memory | 3 | Compaction-resilient notes, attention refresh, context reload | |
686
+ | gaia_solvers | 6 | GAIA media image solvers | |
687
+ | toon | 2 | TOON encode/decode (~40% token savings) | |
688
+ | pattern | 2 | Session pattern mining + risk prediction | |
689
+ | git_workflow | 3 | Branch compliance, PR checklist review, merge gate | |
690
+ | seo | 5 | Technical SEO audit, page performance, content analysis | |
691
+ | voice_bridge | 4 | Voice pipeline design, config analysis, scaffold | — |
692
+ | email | 4 | SMTP/IMAP email ingestion, search, delivery | |
693
+ | rss | 4 | RSS feed parsing and monitoring | — |
694
+ | architect | 3 | Architecture analysis and decision logging | — |
695
+
696
+ **Always included** these 6 tools are always available:
697
+ - `findTools`, `getMethodology`, `check_mcp_setup`, `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
698
+
699
+ The `default` preset includes 39 tools (38 domain tools + 6 meta/discovery tools).
412
700
 
413
701
  ### TOON Format — Token Savings
414
702
 
415
- TOON (Token-Oriented Object Notation) is **on by default** since v2.14.1. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
703
+ TOON (Token-Oriented Object Notation) is **on by default** for all presets since v2.14.1. Every tool response is TOON-encoded for ~40% fewer tokens vs JSON. Disable with `--no-toon` if your client can't handle non-JSON responses.
416
704
 
417
705
  ```bash
418
- # TOON on (default)
706
+ # TOON on (default, all presets)
419
707
  claude mcp add nodebench -- npx -y nodebench-mcp
420
708
 
421
709
  # TOON off
@@ -424,6 +712,67 @@ claude mcp add nodebench -- npx -y nodebench-mcp --no-toon
424
712
 
425
713
  Use the `toon_encode` and `toon_decode` tools to convert between TOON and JSON in your own workflows.
426
714
 
715
+ ### When to Use Each Preset
716
+
717
+ | Preset | Use when... | Example |
718
+ |---|---|---|
719
+ | **default** ⭐ | You want the complete AI Flywheel methodology with minimal token overhead | Most users — bug fixes, features, refactoring, code review |
720
+ | `full` | You need vision, UI capture, web search, GitHub, local file parsing, or GAIA solvers | Vision QA, web scraping, file processing, parallel agents, capability benchmarking |
721
+
722
+ ---
723
+
724
+ ## AI Flywheel — Complete Methodology
725
+
726
+ The AI Flywheel is documented in detail in [AI_FLYWHEEL.md](https://github.com/HomenShum/nodebench-ai/blob/main/AI_FLYWHEEL.md). Here's a summary:
727
+
728
+ ### Two Loops That Compound
729
+
730
+ ```
731
+ ┌─────────────────────────────────────────────────────────────────┐
732
+ │ OUTER LOOP: Eval-Driven Development │
733
+ │ │
734
+ │ Eval Batch ──→ Telemetry ──→ LLM Judge ──→ Suggestions │
735
+ │ │ │ │
736
+ │ │ ┌───────────────────────────┐ │ │
737
+ │ │ │ INNER LOOP: 6-Phase │ │ │
738
+ │ │ │ │ │ │
739
+ │ ▼ │ P1 Context Gather │ │ │
740
+ │ Regression │ P2 Gap Analysis ◄─────┼────┘ │
741
+ │ detected or │ P3 Implementation │ Judge suggestions │
742
+ │ new intent │ P4 Test & Validate ─────┼──► feeds back as │
743
+ │ added │ P5 Self-Closed Verify │ new eval cases │
744
+ │ │ │ P6 Document Learnings ──┼──► updates edge │
745
+ │ │ │ │ case registry │
746
+ │ ▼ └───────────────────────────┘ │
747
+ │ Re-run Eval Batch ──→ Score improved? ──→ Deploy │
748
+ │ │ │
749
+ │ NO → revert, try different approach │
750
+ └─────────────────────────────────────────────────────────────────┘
751
+ ```
752
+
753
+ ### Inner Loop → Outer Loop (Verification feeds Evals)
754
+
755
+ | 6-Phase output | Feeds into Eval Loop as |
756
+ |---|---|
757
+ | Phase 4 test cases (static, unit, integration, E2E) | New eval batch test cases with known-good expected outputs |
758
+ | Phase 5 subagent PASS/FAIL checklists | Eval scoring rubrics — each checklist item becomes a boolean eval criterion |
759
+ | Phase 6 edge cases & learnings | New adversarial eval cases targeting discovered failure modes |
760
+
761
+ ### Outer Loop → Inner Loop (Evals trigger Verification)
762
+
763
+ | Eval Loop output | Triggers 6-Phase as |
764
+ |---|---|
765
+ | Judge finds tool calling inefficiency | Phase 2 gap analysis scoped to that tool's implementation |
766
+ | Eval scores regress after deploy | Full Phase 1-6 cycle on the regression — treat as a production incident |
767
+ | Judge suggests new tool or prompt change | Phase 3 implementation following existing patterns, validated through Phase 4-5 |
768
+ | Recurring failure pattern across batch | Phase 1 deep dive into root cause (maybe upstream API changed, maybe schema drifted) |
769
+
770
+ ### When to Use Which
771
+
772
+ - **Building or changing a feature** → Run the 6-Phase inner loop. You're asking: *"Is this implementation correct?"*
773
+ - **Measuring system quality over time** → Run the Eval outer loop. You're asking: *"Is the system getting better?"*
774
+ - **Both, always** → Every 6-Phase run produces artifacts (test cases, edge cases, checklists) that expand the eval suite. Every eval regression triggers a 6-Phase investigation. They are not optional alternatives — they compound.
775
+
427
776
  ---
428
777
 
429
778
  ## Build from Source
@@ -449,6 +798,89 @@ Then use absolute path:
449
798
 
450
799
  ---
451
800
 
801
+ ## Quick Reference
802
+
803
+ ### Recommended Setup for Most Users
804
+
805
+ ```bash
806
+ # Claude Code / Windsurf — AI Flywheel core tools (39 tools, default)
807
+ claude mcp add nodebench -- npx -y nodebench-mcp
808
+ ```
809
+
810
+ ### What's in the default preset?
811
+
812
+ | Domain | Tools | What you get |
813
+ |---|---|---|
814
+ | verification | 8 | Cycles, gaps, triple-verify, status |
815
+ | eval | 6 | Eval runs, results, comparison, diff |
816
+ | quality_gate | 4 | Gates, presets, history |
817
+ | learning | 4 | Knowledge, search, record |
818
+ | recon | 7 | Research, findings, framework checks, risk |
819
+ | flywheel | 4 | Mandatory flywheel, promote, investigate |
820
+ | security | 3 | Dependency scanning, code analysis, terminal security scanning |
821
+ | boilerplate | 2 | Scaffold NodeBench projects + status |
822
+ | meta + discovery | 6 | findTools, getMethodology, check_mcp_setup, discover_tools, get_tool_quick_ref, get_workflow_chain |
823
+
824
+ **Total: 39 tools** — Complete AI Flywheel methodology with ~70% less token overhead.
825
+
826
+ ### When to Upgrade Presets
827
+
828
+ | Need | Upgrade to |
829
+ |---|---|
830
+ | Everything: vision, UI capture, web search, GitHub, local file parsing, GAIA solvers | `--preset full` (175 tools) |
831
+
832
+ ### First Prompts to Try
833
+
834
+ ```
835
+ # See what's available
836
+ > Use getMethodology("overview") to see all workflows
837
+
838
+ # Before your next task — search for prior knowledge
839
+ > Use search_all_knowledge("what I'm about to work on")
840
+
841
+ # Run the full verification pipeline on a change
842
+ > Use getMethodology("mandatory_flywheel") and follow the 6 steps
843
+
844
+ # Find tools for a specific task
845
+ > Use discover_tools("verify my implementation")
846
+ ```
847
+
848
+ ### Key Methodology Topics
849
+
850
+ | Topic | Command |
851
+ |---|---|
852
+ | AI Flywheel overview | `getMethodology("overview")` |
853
+ | 6-phase verification | `getMethodology("mandatory_flywheel")` |
854
+ | Parallel agents | `getMethodology("parallel_agent_teams")` |
855
+ | Eval-driven development | `getMethodology("eval_driven_development")` |
856
+
857
+ ---
858
+
859
+ ## Security & Trust Boundaries
860
+
861
+ NodeBench MCP runs locally on your machine. Here's what it can and cannot access:
862
+
863
+ ### Data locality
864
+ - All persistent data is stored in **`~/.nodebench/`** (SQLite databases for tool logs, analytics, learnings, eval results)
865
+ - **No data is sent to external servers** unless you explicitly provide API keys and use tools that call external APIs (web search, LLM, GitHub, email)
866
+ - Analytics data never leaves your machine
867
+
868
+ ### File system access
869
+ - The `local_file` toolset (`--preset full` only) can **read files anywhere on your filesystem** that the Node.js process has permission to access. This includes CSV, PDF, XLSX, DOCX, PPTX, JSON, TXT, and ZIP files
870
+ - The `security` toolset runs static analysis on files you point it at
871
+ - Session notes and project bootstrapping write to the current working directory or `~/.nodebench/`
872
+ - **Trust boundary**: If you grant an AI agent access to NodeBench MCP with `--preset full`, that agent can read any file your user account can read. Use the `default` preset if you want to restrict file system access
873
+
874
+ ### API keys
875
+ - All API keys are read from environment variables (`GEMINI_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GITHUB_TOKEN`, etc.)
876
+ - No keys are hardcoded or logged
877
+ - Keys are passed to their respective provider APIs only — never to NodeBench servers (there are none)
878
+
879
+ ### SQL injection protection
880
+ - All database queries use parameterized statements — no string concatenation in SQL
881
+
882
+ ---
883
+
452
884
  ## Troubleshooting
453
885
 
454
886
  **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`