nodebench-mcp 2.11.0 → 2.14.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (67) hide show
  1. package/NODEBENCH_AGENTS.md +809 -809
  2. package/README.md +443 -431
  3. package/STYLE_GUIDE.md +477 -477
  4. package/dist/__tests__/evalHarness.test.js +1 -1
  5. package/dist/__tests__/gaiaCapabilityAudioEval.test.js +9 -14
  6. package/dist/__tests__/gaiaCapabilityAudioEval.test.js.map +1 -1
  7. package/dist/__tests__/gaiaCapabilityEval.test.js +88 -14
  8. package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -1
  9. package/dist/__tests__/gaiaCapabilityFilesEval.test.js +9 -5
  10. package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -1
  11. package/dist/__tests__/gaiaCapabilityMediaEval.test.js +165 -17
  12. package/dist/__tests__/gaiaCapabilityMediaEval.test.js.map +1 -1
  13. package/dist/__tests__/helpers/answerMatch.d.ts +36 -7
  14. package/dist/__tests__/helpers/answerMatch.js +224 -35
  15. package/dist/__tests__/helpers/answerMatch.js.map +1 -1
  16. package/dist/__tests__/helpers/textLlm.d.ts +1 -1
  17. package/dist/__tests__/presetRealWorldBench.test.d.ts +1 -0
  18. package/dist/__tests__/presetRealWorldBench.test.js +850 -0
  19. package/dist/__tests__/presetRealWorldBench.test.js.map +1 -0
  20. package/dist/__tests__/tools.test.js +20 -7
  21. package/dist/__tests__/tools.test.js.map +1 -1
  22. package/dist/__tests__/toolsetGatingEval.test.js +21 -11
  23. package/dist/__tests__/toolsetGatingEval.test.js.map +1 -1
  24. package/dist/db.js +21 -0
  25. package/dist/db.js.map +1 -1
  26. package/dist/index.js +424 -327
  27. package/dist/index.js.map +1 -1
  28. package/dist/tools/agentBootstrapTools.js +258 -258
  29. package/dist/tools/boilerplateTools.js +144 -144
  30. package/dist/tools/cCompilerBenchmarkTools.js +33 -33
  31. package/dist/tools/documentationTools.js +59 -59
  32. package/dist/tools/flywheelTools.js +6 -6
  33. package/dist/tools/gitWorkflowTools.d.ts +11 -0
  34. package/dist/tools/gitWorkflowTools.js +580 -0
  35. package/dist/tools/gitWorkflowTools.js.map +1 -0
  36. package/dist/tools/learningTools.js +26 -26
  37. package/dist/tools/localFileTools.d.ts +3 -0
  38. package/dist/tools/localFileTools.js +3164 -125
  39. package/dist/tools/localFileTools.js.map +1 -1
  40. package/dist/tools/metaTools.js +82 -0
  41. package/dist/tools/metaTools.js.map +1 -1
  42. package/dist/tools/parallelAgentTools.js +228 -0
  43. package/dist/tools/parallelAgentTools.js.map +1 -1
  44. package/dist/tools/patternTools.d.ts +13 -0
  45. package/dist/tools/patternTools.js +456 -0
  46. package/dist/tools/patternTools.js.map +1 -0
  47. package/dist/tools/reconTools.js +31 -31
  48. package/dist/tools/selfEvalTools.js +44 -44
  49. package/dist/tools/seoTools.d.ts +16 -0
  50. package/dist/tools/seoTools.js +866 -0
  51. package/dist/tools/seoTools.js.map +1 -0
  52. package/dist/tools/sessionMemoryTools.d.ts +15 -0
  53. package/dist/tools/sessionMemoryTools.js +348 -0
  54. package/dist/tools/sessionMemoryTools.js.map +1 -0
  55. package/dist/tools/toolRegistry.d.ts +4 -0
  56. package/dist/tools/toolRegistry.js +489 -0
  57. package/dist/tools/toolRegistry.js.map +1 -1
  58. package/dist/tools/toonTools.d.ts +15 -0
  59. package/dist/tools/toonTools.js +94 -0
  60. package/dist/tools/toonTools.js.map +1 -0
  61. package/dist/tools/verificationTools.js +41 -41
  62. package/dist/tools/visionTools.js +17 -17
  63. package/dist/tools/voiceBridgeTools.d.ts +15 -0
  64. package/dist/tools/voiceBridgeTools.js +1427 -0
  65. package/dist/tools/voiceBridgeTools.js.map +1 -0
  66. package/dist/tools/webTools.js +18 -18
  67. package/package.json +102 -101
@@ -1,809 +1,809 @@
1
- # NodeBench Agents Protocol
2
-
3
- This file provides the standard operating procedure for AI agents using the NodeBench MCP. Drop this into any repository and agents will auto-configure their workflow.
4
-
5
- Reference: [agents.md](https://agents.md/) standard.
6
-
7
- ---
8
-
9
- ## Quick Setup
10
-
11
- Add to `~/.claude/settings.json`:
12
-
13
- ```json
14
- {
15
- "mcpServers": {
16
- "nodebench": {
17
- "command": "npx",
18
- "args": ["-y", "nodebench-mcp"]
19
- }
20
- }
21
- }
22
- ```
23
-
24
- Restart Claude Code. 89+ tools available immediately.
25
-
26
- ### Preset Selection
27
-
28
- By default all toolsets are enabled. Use `--preset` to start with a scoped subset:
29
-
30
- ```json
31
- {
32
- "mcpServers": {
33
- "nodebench": {
34
- "command": "npx",
35
- "args": ["-y", "nodebench-mcp", "--preset", "meta"]
36
- }
37
- }
38
- }
39
- ```
40
-
41
- The **meta** preset is the recommended front door for new agents: start with just 5 discovery tools, use `discover_tools` to find what you need, then self-escalate to a larger preset. See [Toolset Gating & Presets](#toolset-gating--presets) for the full breakdown.
42
-
43
- **→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup) | Preset options: See [Toolset Gating & Presets](#toolset-gating--presets)
44
-
45
- ---
46
-
47
- ## The AI Flywheel (Mandatory)
48
-
49
- Every non-trivial change MUST go through this 6-step verification process before shipping.
50
-
51
- ### Step 1: Static Analysis
52
- ```
53
- tsc --noEmit
54
- ```
55
- Zero errors. Zero warnings. No exceptions.
56
-
57
- ### Step 2: Happy-Path Test
58
- Run the changed functionality with valid inputs. Confirm expected output.
59
-
60
- ### Step 3: Failure-Path Test
61
- Test each failure mode the code handles. Invalid inputs, edge cases, error states.
62
-
63
- ### Step 4: Gap Analysis
64
- Review the code for:
65
- - Dead code, unused variables
66
- - Missing integrations (new functions not wired to existing systems)
67
- - Logic that doesn't match stated intent
68
- - Hardcoded values that should be configurable
69
-
70
- ### Step 5: Fix and Re-Verify
71
- If any gap found: fix it, then restart from Step 1.
72
-
73
- ### Step 6: Live E2E Test (MANDATORY)
74
- **Before declaring done or publishing:**
75
- ```bash
76
- echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"YOUR_TOOL","arguments":{...}}}' | node dist/index.js
77
- ```
78
- Every new/modified tool MUST pass stdio E2E test. No exceptions.
79
-
80
- For workflow-level changes (verification, eval, recon, quality gates, flywheel, or knowledge tools), also run the long-running open-source benchmark:
81
- ```bash
82
- npm --prefix packages/mcp-local run dataset:bfcl:refresh
83
- NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm --prefix packages/mcp-local run test:open-dataset
84
- npm --prefix packages/mcp-local run dataset:toolbench:refresh
85
- NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm --prefix packages/mcp-local run test:open-dataset:toolbench
86
- npm --prefix packages/mcp-local run dataset:swebench:refresh
87
- NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm --prefix packages/mcp-local run test:open-dataset:swebench
88
- ```
89
-
90
- ### Step 7: Document Learnings
91
- Record edge cases discovered. Update this file if needed.
92
-
93
- **Rule: No change ships without passing all 7 steps.**
94
-
95
- **→ Quick Refs:** Track progress with `start_verification_cycle` | Record findings with `record_learning` | Run gate with `run_quality_gate` | See [Post-Implementation Checklist](#post-implementation-checklist)
96
-
97
- ---
98
-
99
- ## Open-Source Long-Running MCP Benchmark
100
-
101
- Use open-source long-context tasks to validate real orchestration behavior under parallel load.
102
-
103
- - Dataset: `gorilla-llm/Berkeley-Function-Calling-Leaderboard`
104
- - Split: `BFCL_v3_multi_turn_long_context`
105
- - Source: `https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard`
106
-
107
- Refresh local fixture:
108
- ```bash
109
- npm run mcp:dataset:refresh
110
- ```
111
-
112
- Run parallel subagent benchmark:
113
- ```bash
114
- NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm run mcp:dataset:test
115
- ```
116
-
117
- Run refresh + benchmark in one shot:
118
- ```bash
119
- npm run mcp:dataset:bench
120
- ```
121
-
122
- Second lane (ToolBench multi-tool instructions):
123
- - Dataset: `OpenBMB/ToolBench`
124
- - Split: `data_example/instruction (G1,G2,G3)`
125
- - Source: `https://github.com/OpenBMB/ToolBench`
126
-
127
- Refresh ToolBench fixture:
128
- ```bash
129
- npm run mcp:dataset:toolbench:refresh
130
- ```
131
-
132
- Run ToolBench parallel subagent benchmark:
133
- ```bash
134
- NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
135
- ```
136
-
137
- Run all public lanes:
138
- ```bash
139
- npm run mcp:dataset:bench:all
140
- ```
141
-
142
- Third lane (SWE-bench Verified long-horizon software tasks):
143
- - Dataset: `princeton-nlp/SWE-bench_Verified`
144
- - Split: `test`
145
- - Source: `https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified`
146
-
147
- Refresh SWE-bench fixture:
148
- ```bash
149
- npm run mcp:dataset:swebench:refresh
150
- ```
151
-
152
- Run SWE-bench parallel subagent benchmark:
153
- ```bash
154
- NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
155
- ```
156
-
157
- Fourth lane (GAIA gated long-horizon tool-augmented tasks):
158
- - Dataset: `gaia-benchmark/GAIA` (gated)
159
- - Default config: `2023_level3`
160
- - Default split: `validation`
161
- - Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
162
-
163
- Notes:
164
- - Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
165
- - Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
166
- - Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
167
-
168
- Refresh GAIA fixture:
169
- ```bash
170
- npm run mcp:dataset:gaia:refresh
171
- ```
172
-
173
- Run GAIA parallel subagent benchmark:
174
- ```bash
175
- NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
176
- ```
177
-
178
- GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
179
- - This runs real model calls and web search. It is disabled by default and only intended for regression checks.
180
- - Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
181
- - Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
182
-
183
- Generate scoring fixture (local only, gated):
184
- ```bash
185
- npm run mcp:dataset:gaia:capability:refresh
186
- ```
187
-
188
- Run capability benchmark:
189
- ```bash
190
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
191
- ```
192
-
193
- GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
194
- - This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
195
- - Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
196
- - Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
197
-
198
- Generate file-backed scoring fixture + download attachments (local only, gated):
199
- ```bash
200
- npm run mcp:dataset:gaia:capability:files:refresh
201
- ```
202
-
203
- Run file-backed capability benchmark:
204
- ```bash
205
- NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
206
- ```
207
-
208
- Modes:
209
- - Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag` (single deterministic extract + answer)
210
- - More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (small tool loop)
211
- Web lane only: `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1` and/or `NODEBENCH_GAIA_CAPABILITY_FORCE_FETCH_URL=1`
212
-
213
- Run all public lanes:
214
- ```bash
215
- npm run mcp:dataset:bench:all
216
- ```
217
-
218
- Run full lane suite (includes GAIA):
219
- ```bash
220
- npm run mcp:dataset:bench:full
221
- ```
222
-
223
- Implementation files:
224
- - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
225
- - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
226
- - `packages/mcp-local/src/__tests__/openDatasetParallelEval.test.ts`
227
- - `packages/mcp-local/src/__tests__/fixtures/generateToolbenchInstructionFixture.ts`
228
- - `packages/mcp-local/src/__tests__/fixtures/toolbench_instruction.sample.json`
229
- - `packages/mcp-local/src/__tests__/openDatasetParallelEvalToolbench.test.ts`
230
- - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
231
- - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
232
- - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
233
- - `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
234
- - `.cache/gaia/gaia_2023_level3_validation.sample.json`
235
- - `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
236
- - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
237
- - `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
238
- - `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
239
- - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
240
- - `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
241
- - `.cache/gaia/data/...` (local GAIA attachments; do not commit)
242
- - `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
243
-
244
- Required tool chain per dataset task:
245
- - `run_recon`
246
- - `log_recon_finding`
247
- - `findTools`
248
- - `getMethodology`
249
- - `start_eval_run`
250
- - `record_eval_result`
251
- - `complete_eval_run`
252
- - `run_closed_loop`
253
- - `run_mandatory_flywheel`
254
- - `search_all_knowledge`
255
-
256
- **→ Quick Refs:** Core process in [AI Flywheel](#the-ai-flywheel-mandatory) | Verification flow in [Verification Cycle](#verification-cycle-workflow) | Loop discipline in [Closed Loop Principle](#closed-loop-principle)
257
-
258
- ---
259
-
260
- ## MCP Tool Categories
261
-
262
- Use `getMethodology("overview")` to see all available workflows.
263
-
264
- | Category | Tools | When to Use |
265
- |----------|-------|-------------|
266
- | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
267
- | **Local Files** | `read_pdf_text`, `pdf_search_text`, `read_xlsx_file`, `xlsx_select_rows`, `xlsx_aggregate`, `read_csv_file`, `csv_select_rows`, `csv_aggregate`, `read_text_file`, `read_json_file`, `json_select`, `read_jsonl_file`, `zip_list_files`, `zip_read_text_file`, `zip_extract_file`, `read_docx_text`, `read_pptx_text` | Deterministic parsing and aggregation of local attachments (GAIA file-backed lane) |
268
- | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
269
- | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
270
- | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
271
- | **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
272
- | **Learning** | `record_learning`, `search_all_knowledge` | Persistent knowledge base |
273
- | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
274
- | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
275
- | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
276
- | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
277
- | **LLM** | `call_llm`, `extract_structured_data`, `benchmark_models` | LLM calling, structured extraction, model comparison |
278
- | **Security** | `scan_dependencies`, `run_code_analysis` | Dependency auditing, static code analysis |
279
- | **Platform** | `query_daily_brief`, `query_funding_entities`, `query_research_queue`, `publish_to_queue` | Convex platform bridge: intelligence, funding, research, publishing |
280
- | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
281
- | **Discovery** | `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain` | Hybrid search, quick refs, workflow chains |
282
-
283
- Meta + Discovery tools (5 total) are **always included** regardless of preset. See [Toolset Gating & Presets](#toolset-gating--presets).
284
-
285
- **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Hybrid search: `discover_tools({ query: "security" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
286
-
287
- ---
288
-
289
- ## Toolset Gating & Presets
290
-
291
- NodeBench MCP supports 4 presets that control which domain toolsets are loaded at startup. Meta + Discovery tools (5 total) are **always included** on top of any preset.
292
-
293
- ### Preset Table
294
-
295
- | Preset | Domain Toolsets | Domain Tools | Total (with meta+discovery) | Use Case |
296
- |--------|----------------|-------------|----------------------------|----------|
297
- | **meta** | 0 | 0 | 5 | Discovery-only front door. Agents start here and self-escalate. |
298
- | **lite** | 7 | ~35 | ~40 | Lightweight verification-focused workflows. CI bots, quick checks. |
299
- | **core** | 16 | ~75 | ~80 | Full development workflow. Most agent sessions. |
300
- | **full** | all | 89+ | 94+ | Everything enabled. Benchmarking, exploration, advanced use. |
301
-
302
- ### Usage
303
-
304
- ```bash
305
- npx nodebench-mcp --preset meta # Discovery-only (5 tools)
306
- npx nodebench-mcp --preset lite # Verification + eval + recon + security
307
- npx nodebench-mcp --preset core # Full dev workflow without vision/parallel
308
- npx nodebench-mcp --preset full # All toolsets (default)
309
- npx nodebench-mcp --toolsets verification,eval,recon # Custom selection
310
- npx nodebench-mcp --exclude vision,ui_capture # Exclude specific toolsets
311
- ```
312
-
313
- ### The Meta Preset — Discovery-Only Front Door
314
-
315
- The **meta** preset loads zero domain tools. Agents start with only 5 tools:
316
-
317
- | Tool | Purpose |
318
- |------|---------|
319
- | `findTools` | Keyword search across all registered tools |
320
- | `getMethodology` | Get workflow guides by topic |
321
- | `discover_tools` | Hybrid search with relevance scoring (richer than findTools) |
322
- | `get_tool_quick_ref` | Quick reference card for any specific tool |
323
- | `get_workflow_chain` | Recommended tool sequence for common workflows |
324
-
325
- This is the recommended starting point for autonomous agents. The self-escalation pattern:
326
-
327
- ```
328
- 1. Start with --preset meta (5 tools)
329
- 2. discover_tools({ query: "what I need to do" }) // Find relevant tools
330
- 3. get_workflow_chain({ workflow: "verification" }) // Get the tool sequence
331
- 4. If needed tools are not loaded:
332
- → Restart with --preset core or --preset full
333
- → Or use --toolsets to add specific domains
334
- 5. Proceed with full workflow
335
- ```
336
-
337
- ### Preset Domain Breakdown
338
-
339
- **meta** (0 domains): No domain tools. Meta + Discovery only.
340
-
341
- **lite** (7 domains): `verification`, `eval`, `quality_gate`, `learning`, `recon`, `security`, `boilerplate`
342
-
343
- **core** (16 domains): Everything in lite plus `flywheel`, `bootstrap`, `self_eval`, `llm`, `platform`, `research_writing`, `flicker_detection`, `figma_flow`, `benchmark`
344
-
345
- **full** (all domains): All toolsets in TOOLSET_MAP including `ui_capture`, `vision`, `local_file`, `web`, `github`, `docs`, `parallel`, and everything in core.
346
-
347
- **→ Quick Refs:** Check current toolset: `findTools({ query: "*" })` | Self-escalate: restart with `--preset core` | See [MCP Tool Categories](#mcp-tool-categories) | CLI help: `npx nodebench-mcp --help`
348
-
349
- ---
350
-
351
- ## Verification Cycle Workflow
352
-
353
- Start every significant task with a verification cycle:
354
-
355
- ```
356
- 1. start_cycle({ goal: "Implement feature X" })
357
- 2. log_phase({ phase: "context", notes: "Researched existing patterns..." })
358
- 3. log_phase({ phase: "implementation", notes: "Added new function..." })
359
- 4. log_phase({ phase: "testing", notes: "All tests pass..." })
360
- 5. complete_cycle({ status: "success", summary: "Feature X shipped" })
361
- ```
362
-
363
- If blocked or failed:
364
- ```
365
- abandon_cycle({ reason: "Blocked by external dependency" })
366
- ```
367
-
368
- **→ Quick Refs:** Before starting: `search_all_knowledge({ query: "your task" })` | After completing: `record_learning({ ... })` | Run flywheel: See [AI Flywheel](#the-ai-flywheel-mandatory) | Track quality: See [Quality Gates](#quality-gates)
369
-
370
- ---
371
-
372
- ## Recording Learnings
373
-
374
- After discovering something useful, record it:
375
-
376
- ```
377
- record_learning({
378
- title: "Convex index predicates must use withIndex chaining",
379
- content: "Using .filter() after .withIndex() bypasses the index...",
380
- category: "convex",
381
- tags: ["database", "performance", "gotcha"]
382
- })
383
- ```
384
-
385
- Search later with:
386
- ```
387
- search_all_knowledge({ query: "convex index" })
388
- ```
389
-
390
- **→ Quick Refs:** Search before implementing: `search_all_knowledge` | `search_learnings` and `list_learnings` are DEPRECATED | Part of flywheel Step 7 | See [Verification Cycle](#verification-cycle-workflow)
391
-
392
- ---
393
-
394
- ## Quality Gates
395
-
396
- Before shipping, run quality gates:
397
-
398
- ```
399
- run_quality_gate({
400
- gateName: "deploy_readiness",
401
- results: [
402
- { rule: "tests_pass", passed: true },
403
- { rule: "no_type_errors", passed: true },
404
- { rule: "code_reviewed", passed: true }
405
- ]
406
- })
407
- ```
408
-
409
- Gate history tracks pass/fail over time.
410
-
411
- **→ Quick Refs:** Get preset rules: `get_gate_preset({ preset: "ui_ux_qa" })` | View history: `get_gate_history({ gateName: "..." })` | UI/UX gates: See [Vision](#vision-analysis) | Part of flywheel Step 5 re-verify
412
-
413
- ---
414
-
415
- ## Web Research Workflow
416
-
417
- For market research or tech evaluation:
418
-
419
- ```
420
- 1. web_search({ query: "MCP servers 2026", maxResults: 10 })
421
- 2. search_github({ query: "mcp typescript", minStars: 100 })
422
- 3. analyze_repo({ repoUrl: "owner/repo" }) // study top results
423
- 4. fetch_url({ url: "https://docs.example.com" }) // read their docs
424
- 5. record_learning({ ... }) // save key findings
425
- ```
426
-
427
- **→ Quick Refs:** Analyze repo structure: `analyze_repo` | Save findings: `record_learning` | Part of: `getMethodology({ topic: "project_ideation" })` | See [Recording Learnings](#recording-learnings)
428
-
429
- ---
430
-
431
- ## Project Ideation Workflow
432
-
433
- Before building anything new:
434
-
435
- ```
436
- getMethodology({ topic: "project_ideation" })
437
- ```
438
-
439
- This returns a 6-step process:
440
- 1. Define Concept
441
- 2. Research Market
442
- 3. Analyze Competition
443
- 4. Define Requirements
444
- 5. Plan Metrics
445
- 6. Gate Approval
446
-
447
- **→ Quick Refs:** Research tools: `web_search`, `search_github`, `analyze_repo` | Record requirements: `log_recon_finding` | Create baseline: `start_eval_run` | See [Web Research](#web-research-workflow)
448
-
449
- ---
450
-
451
- ## Closed Loop Principle
452
-
453
- **Never present changes without full local verification.**
454
-
455
- The loop:
456
- 1. Compile. Build clean.
457
- 2. Lint. Style clean. No warnings.
458
- 3. Test. Run automated suites.
459
- 4. Self-debug. If 1-3 fail: read logs, hypothesize, fix, restart loop.
460
-
461
- Only when all green: present to user.
462
-
463
- **→ Quick Refs:** Track loop: `run_closed_loop({ ... })` | Part of flywheel Steps 1-5 | See [AI Flywheel](#the-ai-flywheel-mandatory) | After loop: See [Post-Implementation Checklist](#post-implementation-checklist)
464
-
465
- ---
466
-
467
- ## Environment Setup
468
-
469
- Check your environment:
470
- ```
471
- setup_local_env({ checkSdks: true })
472
- ```
473
-
474
- Returns:
475
- - Node/npm versions
476
- - Missing API keys
477
- - Recommended SDK installations
478
- - Actionable next steps
479
-
480
- **→ Quick Refs:** After setup: `getMethodology("overview")` | Check vision: `discover_vision_env()` | See [API Keys](#api-keys-optional) | Then: See [Verification Cycle](#verification-cycle-workflow)
481
-
482
- ---
483
-
484
- ## API Keys (Optional)
485
-
486
- Set these for enhanced functionality:
487
-
488
- | Key | Purpose |
489
- |-----|---------|
490
- | `GEMINI_API_KEY` | Web search (Google grounding), vision analysis |
491
- | `OPENAI_API_KEY` | Alternative search/vision provider |
492
- | `GITHUB_TOKEN` | Higher rate limits (5000/hr vs 60/hr) |
493
- | `ANTHROPIC_API_KEY` | Alternative vision provider |
494
-
495
- **→ Quick Refs:** Check what's available: `setup_local_env({ checkSdks: true })` | Vision capabilities: `discover_vision_env()` | See [Environment Setup](#environment-setup)
496
-
497
- ---
498
-
499
- ## Vision Analysis
500
-
501
- For UI/UX verification:
502
-
503
- ```
504
- 1. capture_ui_screenshot({ url: "http://localhost:3000", viewport: "desktop" })
505
- 2. analyze_screenshot({ imageBase64: "...", prompt: "Check accessibility" })
506
- 3. capture_responsive_suite({ url: "...", label: "homepage" })
507
- ```
508
-
509
- **→ Quick Refs:** Check capabilities: `discover_vision_env()` | UI QA methodology: `getMethodology({ topic: "ui_ux_qa" })` | Agentic vision: `getMethodology({ topic: "agentic_vision" })` | See [Quality Gates](#quality-gates)
510
-
511
- ---
512
-
513
- ## Post-Implementation Checklist
514
-
515
- After every implementation, answer these 3 questions:
516
-
517
- 1. **MCP gaps?** — Were all relevant tools called? Any unexpected results?
518
- 2. **Implementation gaps?** — Dead code? Missing integrations? Hardcoded values?
519
- 3. **Flywheel complete?** — All 7 steps passed including E2E test?
520
-
521
- If any answer reveals a gap: fix it before proceeding.
522
-
523
- **→ Quick Refs:** Run self-check: `run_self_maintenance({ scope: "quick" })` | Record learnings: `record_learning` | Update docs: `update_agents_md` | See [AI Flywheel](#the-ai-flywheel-mandatory)
524
-
525
- ---
526
-
527
- ## Agent Self-Bootstrap System
528
-
529
- For agents to self-configure and validate against authoritative sources.
530
-
531
- ### 1. Discover Existing Infrastructure
532
- ```
533
- discover_infrastructure({
534
- categories: ["agent_loop", "telemetry", "evaluation", "verification"],
535
- depth: "thorough"
536
- })
537
- ```
538
-
539
- Returns: discovered patterns, missing components, bootstrap plan.
540
-
541
- ### 2. Triple Verification (with Source Citations)
542
-
543
- Run 3-layer verification with authoritative sources:
544
-
545
- ```
546
- triple_verify({
547
- target: "my-feature",
548
- scope: "full",
549
- includeWebSearch: true,
550
- generateInstructions: true
551
- })
552
- ```
553
-
554
- **V1: Internal Analysis** — Checks codebase patterns
555
- **V2: External Validation** — Cross-references Anthropic, OpenAI, LangChain, MCP spec
556
- **V3: Synthesis** — Generates recommendations with source citations
557
-
558
- ### 3. Self-Implement Missing Components
559
- ```
560
- self_implement({
561
- component: "telemetry", // or: agent_loop, evaluation, verification, multi_channel
562
- dryRun: true
563
- })
564
- ```
565
-
566
- Generates production-ready templates based on industry patterns.
567
-
568
- ### 4. Generate Self-Instructions
569
- ```
570
- generate_self_instructions({
571
- format: "skills_md", // or: rules_md, guidelines, claude_md
572
- includeExternalSources: true
573
- })
574
- ```
575
-
576
- Creates persistent instructions with authoritative source citations.
577
-
578
- ### 5. Multi-Channel Information Gathering
579
- ```
580
- connect_channels({
581
- channels: ["web", "github", "slack", "docs"],
582
- query: "agent verification patterns",
583
- aggressive: true
584
- })
585
- ```
586
-
587
- Aggregates findings from multiple sources.
588
-
589
- ### Authoritative Sources (Tier 1)
590
- - https://www.anthropic.com/research/building-effective-agents
591
- - https://openai.github.io/openai-agents-python/
592
- - https://www.langchain.com/langgraph
593
- - https://modelcontextprotocol.io/specification/2025-11-25
594
-
595
- **→ Quick Refs:** Full methodology: `getMethodology({ topic: "agent_bootstrap" })` | After bootstrap: See [Autonomous Maintenance](#autonomous-self-maintenance-system) | Before implementing: `assess_risk` | See [Triple Verification](#2-triple-verification-with-source-citations)
596
-
597
- ---
598
-
599
- ## Autonomous Self-Maintenance System
600
-
601
- Aggressive autonomous self-management with risk-aware execution. Based on OpenClaw patterns and Ralph Wiggum stop-hooks.
602
-
603
- ### 1. Risk-Tiered Execution
604
-
605
- Before any action, assess its risk tier:
606
-
607
- ```
608
- assess_risk({ action: "push to remote" })
609
- ```
610
-
611
- Risk tiers:
612
- - **Low**: Reading, analyzing, searching — auto-approve
613
- - **Medium**: Writing local files, running tests — log and proceed
614
- - **High**: Pushing to remote, posting externally — require confirmation
615
-
616
- ### 2. Re-Update Before Create
617
-
618
- **CRITICAL:** Before creating new files, check if updating existing is better:
619
-
620
- ```
621
- decide_re_update({
622
- targetContent: "New agent instructions",
623
- contentType: "instructions",
624
- existingFiles: ["AGENTS.md", "README.md"]
625
- })
626
- ```
627
-
628
- This prevents file sprawl and maintains single source of truth.
629
-
630
- ### 3. Self-Maintenance Cycles
631
-
632
- Run periodic self-checks:
633
-
634
- ```
635
- run_self_maintenance({
636
- scope: "standard", // quick | standard | thorough
637
- autoFix: false,
638
- dryRun: true
639
- })
640
- ```
641
-
642
- Checks: TypeScript compilation, documentation sync, tool counts, test coverage.
643
-
644
- ### 4. Directory Scaffolding (OpenClaw Style)
645
-
646
- When adding infrastructure, use standardized scaffolding:
647
-
648
- ```
649
- scaffold_directory({
650
- component: "agent_loop", // or: telemetry, evaluation, multi_channel, etc.
651
- includeTests: true,
652
- dryRun: true
653
- })
654
- ```
655
-
656
- Creates organized subdirectories with proper test structure.
657
-
658
- ### 5. Autonomous Loops with Guardrails
659
-
660
- For multi-step autonomous tasks, use controlled loops:
661
-
662
- ```
663
- run_autonomous_loop({
664
- goal: "Verify all tools pass static analysis",
665
- maxIterations: 5,
666
- maxDurationMs: 60000,
667
- stopOnFirstFailure: true
668
- })
669
- ```
670
-
671
- Implements Ralph Wiggum pattern with checkpoints and stop conditions.
672
-
673
- **→ Quick Refs:** Full methodology: `getMethodology({ topic: "autonomous_maintenance" })` | Before actions: `assess_risk` | Before new files: `decide_re_update` | Scaffold structure: `scaffold_directory` | See [Self-Bootstrap](#agent-self-bootstrap-system)
674
-
675
- ---
676
-
677
- ## Methodology Topics
678
-
679
- Available via `getMethodology({ topic: "..." })`:
680
-
681
- | Topic | Description | Quick Ref |
682
- |-------|-------------|-----------|
683
- | `overview` | See all methodologies | Start here |
684
- | `verification` | 6-phase development cycle | [AI Flywheel](#the-ai-flywheel-mandatory) |
685
- | `eval` | Test case management | [Quality Gates](#quality-gates) |
686
- | `flywheel` | Continuous improvement loop | [AI Flywheel](#the-ai-flywheel-mandatory) |
687
- | `mandatory_flywheel` | Required verification for changes | [AI Flywheel](#the-ai-flywheel-mandatory) |
688
- | `reconnaissance` | Codebase discovery | [Self-Bootstrap](#agent-self-bootstrap-system) |
689
- | `quality_gates` | Pass/fail checkpoints | [Quality Gates](#quality-gates) |
690
- | `ui_ux_qa` | Frontend verification | [Vision Analysis](#vision-analysis) |
691
- | `agentic_vision` | AI-powered visual QA | [Vision Analysis](#vision-analysis) |
692
- | `closed_loop` | Build/test before presenting | [Closed Loop](#closed-loop-principle) |
693
- | `learnings` | Knowledge persistence | [Recording Learnings](#recording-learnings) |
694
- | `project_ideation` | Validate ideas before building | [Project Ideation](#project-ideation-workflow) |
695
- | `tech_stack_2026` | Dependency management | [Environment Setup](#environment-setup) |
696
- | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
697
- | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
698
- | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
699
- | `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
700
- | `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
701
- | `toolset_gating` | 4 presets (meta, lite, core, full) and self-escalation | [Toolset Gating & Presets](#toolset-gating--presets) |
702
-
703
- **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
704
-
705
- ---
706
-
707
- ## Parallel Agent Teams
708
-
709
- Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
710
-
711
- Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
712
-
713
- ### Quick Start — Parallel Agents
714
-
715
- ```
716
- 1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
717
- 2. assign_agent_role({ role: "implementer" }) // Specialize
718
- 3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
719
- 4. ... do work ...
720
- 5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
721
- 6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
722
- 7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
723
- ```
724
-
725
- ### Predefined Agent Roles
726
-
727
- | Role | Focus |
728
- |------|-------|
729
- | `implementer` | Primary feature work. Picks failing tests, implements fixes. |
730
- | `dedup_reviewer` | Finds and coalesces duplicate implementations. |
731
- | `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
732
- | `documentation_maintainer` | Keeps READMEs and progress files in sync. |
733
- | `code_quality_critic` | Structural improvements, pattern enforcement. |
734
- | `test_writer` | Writes targeted tests for edge cases and failure modes. |
735
- | `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
736
-
737
- ### Key Patterns (from Anthropic blog)
738
-
739
- - **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
740
- - **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
741
- - **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
742
- - **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
743
- - **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
744
- - **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
745
-
746
- ### Bootstrap for External Repos
747
-
748
- When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
749
-
750
- ```
751
- 1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
752
- // Scans 7 categories: task coordination, roles, oracle, context budget,
753
- // progress files, AGENTS.md parallel section, git worktrees
754
-
755
- 2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
756
- // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
757
-
758
- 3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
759
- // Generates portable AGENTS.md section — paste into their repo
760
-
761
- 4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
762
- 5. Fix any issues, re-verify
763
- 6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
764
- ```
765
-
766
- The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
767
- - Task locking protocol (file-based, no dependencies)
768
- - Role definitions and assignment guide
769
- - Oracle testing workflow with idiomatic examples
770
- - Context budget rules
771
- - Progress file protocol
772
- - Anti-patterns to avoid
773
- - Optional nodebench-mcp tool mapping table
774
-
775
- ### MCP Prompts for Parallel Agent Teams
776
-
777
- - `parallel-agent-team` — Full team setup with role assignment and task breakdown
778
- - `oracle-test-harness` — Oracle-based testing setup for a component
779
- - `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
780
-
781
- **→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
782
-
783
- ---
784
-
785
- ## Auto-Update This File
786
-
787
- Agents can self-update this file:
788
-
789
- ```
790
- update_agents_md({
791
- operation: "update_section",
792
- section: "Custom Section",
793
- content: "New content here...",
794
- projectRoot: "/path/to/project"
795
- })
796
- ```
797
-
798
- Or read current structure:
799
- ```
800
- update_agents_md({ operation: "read", projectRoot: "/path/to/project" })
801
- ```
802
-
803
- **→ Quick Refs:** Before updating: `decide_re_update({ contentType: "instructions", ... })` | After updating: Run flywheel Steps 1-7 | See [Re-Update Before Create](#2-re-update-before-create)
804
-
805
- ---
806
-
807
- ## License
808
-
809
- MIT — Use freely in any project.
1
+ # NodeBench Agents Protocol
2
+
3
+ This file provides the standard operating procedure for AI agents using the NodeBench MCP. Drop this into any repository and agents will auto-configure their workflow.
4
+
5
+ Reference: [agents.md](https://agents.md/) standard.
6
+
7
+ ---
8
+
9
+ ## Quick Setup
10
+
11
+ Add to `~/.claude/settings.json`:
12
+
13
+ ```json
14
+ {
15
+ "mcpServers": {
16
+ "nodebench": {
17
+ "command": "npx",
18
+ "args": ["-y", "nodebench-mcp"]
19
+ }
20
+ }
21
+ }
22
+ ```
23
+
24
+ Restart Claude Code. 89+ tools available immediately.
25
+
26
+ ### Preset Selection
27
+
28
+ By default all toolsets are enabled. Use `--preset` to start with a scoped subset:
29
+
30
+ ```json
31
+ {
32
+ "mcpServers": {
33
+ "nodebench": {
34
+ "command": "npx",
35
+ "args": ["-y", "nodebench-mcp", "--preset", "meta"]
36
+ }
37
+ }
38
+ }
39
+ ```
40
+
41
+ The **meta** preset is the recommended front door for new agents: start with just 5 discovery tools, use `discover_tools` to find what you need, then self-escalate to a larger preset. See [Toolset Gating & Presets](#toolset-gating--presets) for the full breakdown.
42
+
43
+ **→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup) | Preset options: See [Toolset Gating & Presets](#toolset-gating--presets)
44
+
45
+ ---
46
+
47
+ ## The AI Flywheel (Mandatory)
48
+
49
+ Every non-trivial change MUST go through this 6-step verification process before shipping.
50
+
51
+ ### Step 1: Static Analysis
52
+ ```
53
+ tsc --noEmit
54
+ ```
55
+ Zero errors. Zero warnings. No exceptions.
56
+
57
+ ### Step 2: Happy-Path Test
58
+ Run the changed functionality with valid inputs. Confirm expected output.
59
+
60
+ ### Step 3: Failure-Path Test
61
+ Test each failure mode the code handles. Invalid inputs, edge cases, error states.
62
+
63
+ ### Step 4: Gap Analysis
64
+ Review the code for:
65
+ - Dead code, unused variables
66
+ - Missing integrations (new functions not wired to existing systems)
67
+ - Logic that doesn't match stated intent
68
+ - Hardcoded values that should be configurable
69
+
70
+ ### Step 5: Fix and Re-Verify
71
+ If any gap found: fix it, then restart from Step 1.
72
+
73
+ ### Step 6: Live E2E Test (MANDATORY)
74
+ **Before declaring done or publishing:**
75
+ ```bash
76
+ echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"YOUR_TOOL","arguments":{...}}}' | node dist/index.js
77
+ ```
78
+ Every new/modified tool MUST pass stdio E2E test. No exceptions.
79
+
80
+ For workflow-level changes (verification, eval, recon, quality gates, flywheel, or knowledge tools), also run the long-running open-source benchmark:
81
+ ```bash
82
+ npm --prefix packages/mcp-local run dataset:bfcl:refresh
83
+ NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm --prefix packages/mcp-local run test:open-dataset
84
+ npm --prefix packages/mcp-local run dataset:toolbench:refresh
85
+ NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm --prefix packages/mcp-local run test:open-dataset:toolbench
86
+ npm --prefix packages/mcp-local run dataset:swebench:refresh
87
+ NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm --prefix packages/mcp-local run test:open-dataset:swebench
88
+ ```
89
+
90
+ ### Step 7: Document Learnings
91
+ Record edge cases discovered. Update this file if needed.
92
+
93
+ **Rule: No change ships without passing all 7 steps.**
94
+
95
+ **→ Quick Refs:** Track progress with `start_verification_cycle` | Record findings with `record_learning` | Run gate with `run_quality_gate` | See [Post-Implementation Checklist](#post-implementation-checklist)
96
+
97
+ ---
98
+
99
+ ## Open-Source Long-Running MCP Benchmark
100
+
101
+ Use open-source long-context tasks to validate real orchestration behavior under parallel load.
102
+
103
+ - Dataset: `gorilla-llm/Berkeley-Function-Calling-Leaderboard`
104
+ - Split: `BFCL_v3_multi_turn_long_context`
105
+ - Source: `https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard`
106
+
107
+ Refresh local fixture:
108
+ ```bash
109
+ npm run mcp:dataset:refresh
110
+ ```
111
+
112
+ Run parallel subagent benchmark:
113
+ ```bash
114
+ NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm run mcp:dataset:test
115
+ ```
116
+
117
+ Run refresh + benchmark in one shot:
118
+ ```bash
119
+ npm run mcp:dataset:bench
120
+ ```
121
+
122
+ Second lane (ToolBench multi-tool instructions):
123
+ - Dataset: `OpenBMB/ToolBench`
124
+ - Split: `data_example/instruction (G1,G2,G3)`
125
+ - Source: `https://github.com/OpenBMB/ToolBench`
126
+
127
+ Refresh ToolBench fixture:
128
+ ```bash
129
+ npm run mcp:dataset:toolbench:refresh
130
+ ```
131
+
132
+ Run ToolBench parallel subagent benchmark:
133
+ ```bash
134
+ NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
135
+ ```
136
+
137
+ Run all public lanes:
138
+ ```bash
139
+ npm run mcp:dataset:bench:all
140
+ ```
141
+
142
+ Third lane (SWE-bench Verified long-horizon software tasks):
143
+ - Dataset: `princeton-nlp/SWE-bench_Verified`
144
+ - Split: `test`
145
+ - Source: `https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified`
146
+
147
+ Refresh SWE-bench fixture:
148
+ ```bash
149
+ npm run mcp:dataset:swebench:refresh
150
+ ```
151
+
152
+ Run SWE-bench parallel subagent benchmark:
153
+ ```bash
154
+ NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
155
+ ```
156
+
157
+ Fourth lane (GAIA gated long-horizon tool-augmented tasks):
158
+ - Dataset: `gaia-benchmark/GAIA` (gated)
159
+ - Default config: `2023_level3`
160
+ - Default split: `validation`
161
+ - Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
162
+
163
+ Notes:
164
+ - Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
165
+ - Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
166
+ - Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
167
+
168
+ Refresh GAIA fixture:
169
+ ```bash
170
+ npm run mcp:dataset:gaia:refresh
171
+ ```
172
+
173
+ Run GAIA parallel subagent benchmark:
174
+ ```bash
175
+ NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
176
+ ```
177
+
178
+ GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
179
+ - This runs real model calls and web search. It is disabled by default and only intended for regression checks.
180
+ - Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
181
+ - Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
182
+
183
+ Generate scoring fixture (local only, gated):
184
+ ```bash
185
+ npm run mcp:dataset:gaia:capability:refresh
186
+ ```
187
+
188
+ Run capability benchmark:
189
+ ```bash
190
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
191
+ ```
192
+
193
+ GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
194
+ - This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
195
+ - Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
196
+ - Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
197
+
198
+ Generate file-backed scoring fixture + download attachments (local only, gated):
199
+ ```bash
200
+ npm run mcp:dataset:gaia:capability:files:refresh
201
+ ```
202
+
203
+ Run file-backed capability benchmark:
204
+ ```bash
205
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
206
+ ```
207
+
208
+ Modes:
209
+ - Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag` (single deterministic extract + answer)
210
+ - More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (small tool loop)
211
+ Web lane only: `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1` and/or `NODEBENCH_GAIA_CAPABILITY_FORCE_FETCH_URL=1`
212
+
213
+ Run all public lanes:
214
+ ```bash
215
+ npm run mcp:dataset:bench:all
216
+ ```
217
+
218
+ Run full lane suite (includes GAIA):
219
+ ```bash
220
+ npm run mcp:dataset:bench:full
221
+ ```
222
+
223
+ Implementation files:
224
+ - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
225
+ - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
226
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEval.test.ts`
227
+ - `packages/mcp-local/src/__tests__/fixtures/generateToolbenchInstructionFixture.ts`
228
+ - `packages/mcp-local/src/__tests__/fixtures/toolbench_instruction.sample.json`
229
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalToolbench.test.ts`
230
+ - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
231
+ - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
232
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
233
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
234
+ - `.cache/gaia/gaia_2023_level3_validation.sample.json`
235
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
236
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
237
+ - `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
238
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
239
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
240
+ - `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
241
+ - `.cache/gaia/data/...` (local GAIA attachments; do not commit)
242
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
243
+
244
+ Required tool chain per dataset task:
245
+ - `run_recon`
246
+ - `log_recon_finding`
247
+ - `findTools`
248
+ - `getMethodology`
249
+ - `start_eval_run`
250
+ - `record_eval_result`
251
+ - `complete_eval_run`
252
+ - `run_closed_loop`
253
+ - `run_mandatory_flywheel`
254
+ - `search_all_knowledge`
255
+
256
+ **→ Quick Refs:** Core process in [AI Flywheel](#the-ai-flywheel-mandatory) | Verification flow in [Verification Cycle](#verification-cycle-workflow) | Loop discipline in [Closed Loop Principle](#closed-loop-principle)
257
+
258
+ ---
259
+
260
+ ## MCP Tool Categories
261
+
262
+ Use `getMethodology("overview")` to see all available workflows.
263
+
264
+ | Category | Tools | When to Use |
265
+ |----------|-------|-------------|
266
+ | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
267
+ | **Local Files** | `read_pdf_text`, `pdf_search_text`, `read_xlsx_file`, `xlsx_select_rows`, `xlsx_aggregate`, `read_csv_file`, `csv_select_rows`, `csv_aggregate`, `read_text_file`, `read_json_file`, `json_select`, `read_jsonl_file`, `zip_list_files`, `zip_read_text_file`, `zip_extract_file`, `read_docx_text`, `read_pptx_text` | Deterministic parsing and aggregation of local attachments (GAIA file-backed lane) |
268
+ | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
269
+ | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
270
+ | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
271
+ | **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
272
+ | **Learning** | `record_learning`, `search_all_knowledge` | Persistent knowledge base |
273
+ | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
274
+ | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
275
+ | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
276
+ | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
277
+ | **LLM** | `call_llm`, `extract_structured_data`, `benchmark_models` | LLM calling, structured extraction, model comparison |
278
+ | **Security** | `scan_dependencies`, `run_code_analysis` | Dependency auditing, static code analysis |
279
+ | **Platform** | `query_daily_brief`, `query_funding_entities`, `query_research_queue`, `publish_to_queue` | Convex platform bridge: intelligence, funding, research, publishing |
280
+ | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
281
+ | **Discovery** | `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain` | Hybrid search, quick refs, workflow chains |
282
+
283
+ Meta + Discovery tools (5 total) are **always included** regardless of preset. See [Toolset Gating & Presets](#toolset-gating--presets).
284
+
285
+ **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Hybrid search: `discover_tools({ query: "security" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
286
+
287
+ ---
288
+
289
+ ## Toolset Gating & Presets
290
+
291
+ NodeBench MCP supports 4 presets that control which domain toolsets are loaded at startup. Meta + Discovery tools (5 total) are **always included** on top of any preset.
292
+
293
+ ### Preset Table
294
+
295
+ | Preset | Domain Toolsets | Domain Tools | Total (with meta+discovery) | Use Case |
296
+ |--------|----------------|-------------|----------------------------|----------|
297
+ | **meta** | 0 | 0 | 5 | Discovery-only front door. Agents start here and self-escalate. |
298
+ | **lite** | 7 | ~35 | ~40 | Lightweight verification-focused workflows. CI bots, quick checks. |
299
+ | **core** | 16 | ~75 | ~80 | Full development workflow. Most agent sessions. |
300
+ | **full** | all | 89+ | 94+ | Everything enabled. Benchmarking, exploration, advanced use. |
301
+
302
+ ### Usage
303
+
304
+ ```bash
305
+ npx nodebench-mcp --preset meta # Discovery-only (5 tools)
306
+ npx nodebench-mcp --preset lite # Verification + eval + recon + security
307
+ npx nodebench-mcp --preset core # Full dev workflow without vision/parallel
308
+ npx nodebench-mcp --preset full # All toolsets (default)
309
+ npx nodebench-mcp --toolsets verification,eval,recon # Custom selection
310
+ npx nodebench-mcp --exclude vision,ui_capture # Exclude specific toolsets
311
+ ```
312
+
313
+ ### The Meta Preset — Discovery-Only Front Door
314
+
315
+ The **meta** preset loads zero domain tools. Agents start with only 5 tools:
316
+
317
+ | Tool | Purpose |
318
+ |------|---------|
319
+ | `findTools` | Keyword search across all registered tools |
320
+ | `getMethodology` | Get workflow guides by topic |
321
+ | `discover_tools` | Hybrid search with relevance scoring (richer than findTools) |
322
+ | `get_tool_quick_ref` | Quick reference card for any specific tool |
323
+ | `get_workflow_chain` | Recommended tool sequence for common workflows |
324
+
325
+ This is the recommended starting point for autonomous agents. The self-escalation pattern:
326
+
327
+ ```
328
+ 1. Start with --preset meta (5 tools)
329
+ 2. discover_tools({ query: "what I need to do" }) // Find relevant tools
330
+ 3. get_workflow_chain({ workflow: "verification" }) // Get the tool sequence
331
+ 4. If needed tools are not loaded:
332
+ → Restart with --preset core or --preset full
333
+ → Or use --toolsets to add specific domains
334
+ 5. Proceed with full workflow
335
+ ```
336
+
337
+ ### Preset Domain Breakdown
338
+
339
+ **meta** (0 domains): No domain tools. Meta + Discovery only.
340
+
341
+ **lite** (7 domains): `verification`, `eval`, `quality_gate`, `learning`, `recon`, `security`, `boilerplate`
342
+
343
+ **core** (16 domains): Everything in lite plus `flywheel`, `bootstrap`, `self_eval`, `llm`, `platform`, `research_writing`, `flicker_detection`, `figma_flow`, `benchmark`
344
+
345
+ **full** (all domains): All toolsets in TOOLSET_MAP including `ui_capture`, `vision`, `local_file`, `web`, `github`, `docs`, `parallel`, and everything in core.
346
+
347
+ **→ Quick Refs:** Check current toolset: `findTools({ query: "*" })` | Self-escalate: restart with `--preset core` | See [MCP Tool Categories](#mcp-tool-categories) | CLI help: `npx nodebench-mcp --help`
348
+
349
+ ---
350
+
351
+ ## Verification Cycle Workflow
352
+
353
+ Start every significant task with a verification cycle:
354
+
355
+ ```
356
+ 1. start_cycle({ goal: "Implement feature X" })
357
+ 2. log_phase({ phase: "context", notes: "Researched existing patterns..." })
358
+ 3. log_phase({ phase: "implementation", notes: "Added new function..." })
359
+ 4. log_phase({ phase: "testing", notes: "All tests pass..." })
360
+ 5. complete_cycle({ status: "success", summary: "Feature X shipped" })
361
+ ```
362
+
363
+ If blocked or failed:
364
+ ```
365
+ abandon_cycle({ reason: "Blocked by external dependency" })
366
+ ```
367
+
368
+ **→ Quick Refs:** Before starting: `search_all_knowledge({ query: "your task" })` | After completing: `record_learning({ ... })` | Run flywheel: See [AI Flywheel](#the-ai-flywheel-mandatory) | Track quality: See [Quality Gates](#quality-gates)
369
+
370
+ ---
371
+
372
+ ## Recording Learnings
373
+
374
+ After discovering something useful, record it:
375
+
376
+ ```
377
+ record_learning({
378
+ title: "Convex index predicates must use withIndex chaining",
379
+ content: "Using .filter() after .withIndex() bypasses the index...",
380
+ category: "convex",
381
+ tags: ["database", "performance", "gotcha"]
382
+ })
383
+ ```
384
+
385
+ Search later with:
386
+ ```
387
+ search_all_knowledge({ query: "convex index" })
388
+ ```
389
+
390
+ **→ Quick Refs:** Search before implementing: `search_all_knowledge` | `search_learnings` and `list_learnings` are DEPRECATED | Part of flywheel Step 7 | See [Verification Cycle](#verification-cycle-workflow)
391
+
392
+ ---
393
+
394
+ ## Quality Gates
395
+
396
+ Before shipping, run quality gates:
397
+
398
+ ```
399
+ run_quality_gate({
400
+ gateName: "deploy_readiness",
401
+ results: [
402
+ { rule: "tests_pass", passed: true },
403
+ { rule: "no_type_errors", passed: true },
404
+ { rule: "code_reviewed", passed: true }
405
+ ]
406
+ })
407
+ ```
408
+
409
+ Gate history tracks pass/fail over time.
410
+
411
+ **→ Quick Refs:** Get preset rules: `get_gate_preset({ preset: "ui_ux_qa" })` | View history: `get_gate_history({ gateName: "..." })` | UI/UX gates: See [Vision](#vision-analysis) | Part of flywheel Step 5 re-verify
412
+
413
+ ---
414
+
415
+ ## Web Research Workflow
416
+
417
+ For market research or tech evaluation:
418
+
419
+ ```
420
+ 1. web_search({ query: "MCP servers 2026", maxResults: 10 })
421
+ 2. search_github({ query: "mcp typescript", minStars: 100 })
422
+ 3. analyze_repo({ repoUrl: "owner/repo" }) // study top results
423
+ 4. fetch_url({ url: "https://docs.example.com" }) // read their docs
424
+ 5. record_learning({ ... }) // save key findings
425
+ ```
426
+
427
+ **→ Quick Refs:** Analyze repo structure: `analyze_repo` | Save findings: `record_learning` | Part of: `getMethodology({ topic: "project_ideation" })` | See [Recording Learnings](#recording-learnings)
428
+
429
+ ---
430
+
431
+ ## Project Ideation Workflow
432
+
433
+ Before building anything new:
434
+
435
+ ```
436
+ getMethodology({ topic: "project_ideation" })
437
+ ```
438
+
439
+ This returns a 6-step process:
440
+ 1. Define Concept
441
+ 2. Research Market
442
+ 3. Analyze Competition
443
+ 4. Define Requirements
444
+ 5. Plan Metrics
445
+ 6. Gate Approval
446
+
447
+ **→ Quick Refs:** Research tools: `web_search`, `search_github`, `analyze_repo` | Record requirements: `log_recon_finding` | Create baseline: `start_eval_run` | See [Web Research](#web-research-workflow)
448
+
449
+ ---
450
+
451
+ ## Closed Loop Principle
452
+
453
+ **Never present changes without full local verification.**
454
+
455
+ The loop:
456
+ 1. Compile. Build clean.
457
+ 2. Lint. Style clean. No warnings.
458
+ 3. Test. Run automated suites.
459
+ 4. Self-debug. If 1-3 fail: read logs, hypothesize, fix, restart loop.
460
+
461
+ Only when all green: present to user.
462
+
463
+ **→ Quick Refs:** Track loop: `run_closed_loop({ ... })` | Part of flywheel Steps 1-5 | See [AI Flywheel](#the-ai-flywheel-mandatory) | After loop: See [Post-Implementation Checklist](#post-implementation-checklist)
464
+
465
+ ---
466
+
467
+ ## Environment Setup
468
+
469
+ Check your environment:
470
+ ```
471
+ setup_local_env({ checkSdks: true })
472
+ ```
473
+
474
+ Returns:
475
+ - Node/npm versions
476
+ - Missing API keys
477
+ - Recommended SDK installations
478
+ - Actionable next steps
479
+
480
+ **→ Quick Refs:** After setup: `getMethodology("overview")` | Check vision: `discover_vision_env()` | See [API Keys](#api-keys-optional) | Then: See [Verification Cycle](#verification-cycle-workflow)
481
+
482
+ ---
483
+
484
+ ## API Keys (Optional)
485
+
486
+ Set these for enhanced functionality:
487
+
488
+ | Key | Purpose |
489
+ |-----|---------|
490
+ | `GEMINI_API_KEY` | Web search (Google grounding), vision analysis |
491
+ | `OPENAI_API_KEY` | Alternative search/vision provider |
492
+ | `GITHUB_TOKEN` | Higher rate limits (5000/hr vs 60/hr) |
493
+ | `ANTHROPIC_API_KEY` | Alternative vision provider |
494
+
495
+ **→ Quick Refs:** Check what's available: `setup_local_env({ checkSdks: true })` | Vision capabilities: `discover_vision_env()` | See [Environment Setup](#environment-setup)
496
+
497
+ ---
498
+
499
+ ## Vision Analysis
500
+
501
+ For UI/UX verification:
502
+
503
+ ```
504
+ 1. capture_ui_screenshot({ url: "http://localhost:3000", viewport: "desktop" })
505
+ 2. analyze_screenshot({ imageBase64: "...", prompt: "Check accessibility" })
506
+ 3. capture_responsive_suite({ url: "...", label: "homepage" })
507
+ ```
508
+
509
+ **→ Quick Refs:** Check capabilities: `discover_vision_env()` | UI QA methodology: `getMethodology({ topic: "ui_ux_qa" })` | Agentic vision: `getMethodology({ topic: "agentic_vision" })` | See [Quality Gates](#quality-gates)
510
+
511
+ ---
512
+
513
+ ## Post-Implementation Checklist
514
+
515
+ After every implementation, answer these 3 questions:
516
+
517
+ 1. **MCP gaps?** — Were all relevant tools called? Any unexpected results?
518
+ 2. **Implementation gaps?** — Dead code? Missing integrations? Hardcoded values?
519
+ 3. **Flywheel complete?** — All 7 steps passed including E2E test?
520
+
521
+ If any answer reveals a gap: fix it before proceeding.
522
+
523
+ **→ Quick Refs:** Run self-check: `run_self_maintenance({ scope: "quick" })` | Record learnings: `record_learning` | Update docs: `update_agents_md` | See [AI Flywheel](#the-ai-flywheel-mandatory)
524
+
525
+ ---
526
+
527
+ ## Agent Self-Bootstrap System
528
+
529
+ For agents to self-configure and validate against authoritative sources.
530
+
531
+ ### 1. Discover Existing Infrastructure
532
+ ```
533
+ discover_infrastructure({
534
+ categories: ["agent_loop", "telemetry", "evaluation", "verification"],
535
+ depth: "thorough"
536
+ })
537
+ ```
538
+
539
+ Returns: discovered patterns, missing components, bootstrap plan.
540
+
541
+ ### 2. Triple Verification (with Source Citations)
542
+
543
+ Run 3-layer verification with authoritative sources:
544
+
545
+ ```
546
+ triple_verify({
547
+ target: "my-feature",
548
+ scope: "full",
549
+ includeWebSearch: true,
550
+ generateInstructions: true
551
+ })
552
+ ```
553
+
554
+ **V1: Internal Analysis** — Checks codebase patterns
555
+ **V2: External Validation** — Cross-references Anthropic, OpenAI, LangChain, MCP spec
556
+ **V3: Synthesis** — Generates recommendations with source citations
557
+
558
+ ### 3. Self-Implement Missing Components
559
+ ```
560
+ self_implement({
561
+ component: "telemetry", // or: agent_loop, evaluation, verification, multi_channel
562
+ dryRun: true
563
+ })
564
+ ```
565
+
566
+ Generates production-ready templates based on industry patterns.
567
+
568
+ ### 4. Generate Self-Instructions
569
+ ```
570
+ generate_self_instructions({
571
+ format: "skills_md", // or: rules_md, guidelines, claude_md
572
+ includeExternalSources: true
573
+ })
574
+ ```
575
+
576
+ Creates persistent instructions with authoritative source citations.
577
+
578
+ ### 5. Multi-Channel Information Gathering
579
+ ```
580
+ connect_channels({
581
+ channels: ["web", "github", "slack", "docs"],
582
+ query: "agent verification patterns",
583
+ aggressive: true
584
+ })
585
+ ```
586
+
587
+ Aggregates findings from multiple sources.
588
+
589
+ ### Authoritative Sources (Tier 1)
590
+ - https://www.anthropic.com/research/building-effective-agents
591
+ - https://openai.github.io/openai-agents-python/
592
+ - https://www.langchain.com/langgraph
593
+ - https://modelcontextprotocol.io/specification/2025-11-25
594
+
595
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "agent_bootstrap" })` | After bootstrap: See [Autonomous Maintenance](#autonomous-self-maintenance-system) | Before implementing: `assess_risk` | See [Triple Verification](#2-triple-verification-with-source-citations)
596
+
597
+ ---
598
+
599
+ ## Autonomous Self-Maintenance System
600
+
601
+ Aggressive autonomous self-management with risk-aware execution. Based on OpenClaw patterns and Ralph Wiggum stop-hooks.
602
+
603
+ ### 1. Risk-Tiered Execution
604
+
605
+ Before any action, assess its risk tier:
606
+
607
+ ```
608
+ assess_risk({ action: "push to remote" })
609
+ ```
610
+
611
+ Risk tiers:
612
+ - **Low**: Reading, analyzing, searching — auto-approve
613
+ - **Medium**: Writing local files, running tests — log and proceed
614
+ - **High**: Pushing to remote, posting externally — require confirmation
615
+
616
+ ### 2. Re-Update Before Create
617
+
618
+ **CRITICAL:** Before creating new files, check if updating existing is better:
619
+
620
+ ```
621
+ decide_re_update({
622
+ targetContent: "New agent instructions",
623
+ contentType: "instructions",
624
+ existingFiles: ["AGENTS.md", "README.md"]
625
+ })
626
+ ```
627
+
628
+ This prevents file sprawl and maintains single source of truth.
629
+
630
+ ### 3. Self-Maintenance Cycles
631
+
632
+ Run periodic self-checks:
633
+
634
+ ```
635
+ run_self_maintenance({
636
+ scope: "standard", // quick | standard | thorough
637
+ autoFix: false,
638
+ dryRun: true
639
+ })
640
+ ```
641
+
642
+ Checks: TypeScript compilation, documentation sync, tool counts, test coverage.
643
+
644
+ ### 4. Directory Scaffolding (OpenClaw Style)
645
+
646
+ When adding infrastructure, use standardized scaffolding:
647
+
648
+ ```
649
+ scaffold_directory({
650
+ component: "agent_loop", // or: telemetry, evaluation, multi_channel, etc.
651
+ includeTests: true,
652
+ dryRun: true
653
+ })
654
+ ```
655
+
656
+ Creates organized subdirectories with proper test structure.
657
+
658
+ ### 5. Autonomous Loops with Guardrails
659
+
660
+ For multi-step autonomous tasks, use controlled loops:
661
+
662
+ ```
663
+ run_autonomous_loop({
664
+ goal: "Verify all tools pass static analysis",
665
+ maxIterations: 5,
666
+ maxDurationMs: 60000,
667
+ stopOnFirstFailure: true
668
+ })
669
+ ```
670
+
671
+ Implements Ralph Wiggum pattern with checkpoints and stop conditions.
672
+
673
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "autonomous_maintenance" })` | Before actions: `assess_risk` | Before new files: `decide_re_update` | Scaffold structure: `scaffold_directory` | See [Self-Bootstrap](#agent-self-bootstrap-system)
674
+
675
+ ---
676
+
677
+ ## Methodology Topics
678
+
679
+ Available via `getMethodology({ topic: "..." })`:
680
+
681
+ | Topic | Description | Quick Ref |
682
+ |-------|-------------|-----------|
683
+ | `overview` | See all methodologies | Start here |
684
+ | `verification` | 6-phase development cycle | [AI Flywheel](#the-ai-flywheel-mandatory) |
685
+ | `eval` | Test case management | [Quality Gates](#quality-gates) |
686
+ | `flywheel` | Continuous improvement loop | [AI Flywheel](#the-ai-flywheel-mandatory) |
687
+ | `mandatory_flywheel` | Required verification for changes | [AI Flywheel](#the-ai-flywheel-mandatory) |
688
+ | `reconnaissance` | Codebase discovery | [Self-Bootstrap](#agent-self-bootstrap-system) |
689
+ | `quality_gates` | Pass/fail checkpoints | [Quality Gates](#quality-gates) |
690
+ | `ui_ux_qa` | Frontend verification | [Vision Analysis](#vision-analysis) |
691
+ | `agentic_vision` | AI-powered visual QA | [Vision Analysis](#vision-analysis) |
692
+ | `closed_loop` | Build/test before presenting | [Closed Loop](#closed-loop-principle) |
693
+ | `learnings` | Knowledge persistence | [Recording Learnings](#recording-learnings) |
694
+ | `project_ideation` | Validate ideas before building | [Project Ideation](#project-ideation-workflow) |
695
+ | `tech_stack_2026` | Dependency management | [Environment Setup](#environment-setup) |
696
+ | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
697
+ | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
698
+ | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
699
+ | `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
700
+ | `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
701
+ | `toolset_gating` | 4 presets (meta, lite, core, full) and self-escalation | [Toolset Gating & Presets](#toolset-gating--presets) |
702
+
703
+ **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
704
+
705
+ ---
706
+
707
+ ## Parallel Agent Teams
708
+
709
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
710
+
711
+ Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
712
+
713
+ ### Quick Start — Parallel Agents
714
+
715
+ ```
716
+ 1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
717
+ 2. assign_agent_role({ role: "implementer" }) // Specialize
718
+ 3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
719
+ 4. ... do work ...
720
+ 5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
721
+ 6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
722
+ 7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
723
+ ```
724
+
725
+ ### Predefined Agent Roles
726
+
727
+ | Role | Focus |
728
+ |------|-------|
729
+ | `implementer` | Primary feature work. Picks failing tests, implements fixes. |
730
+ | `dedup_reviewer` | Finds and coalesces duplicate implementations. |
731
+ | `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
732
+ | `documentation_maintainer` | Keeps READMEs and progress files in sync. |
733
+ | `code_quality_critic` | Structural improvements, pattern enforcement. |
734
+ | `test_writer` | Writes targeted tests for edge cases and failure modes. |
735
+ | `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
736
+
737
+ ### Key Patterns (from Anthropic blog)
738
+
739
+ - **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
740
+ - **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
741
+ - **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
742
+ - **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
743
+ - **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
744
+ - **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
745
+
746
+ ### Bootstrap for External Repos
747
+
748
+ When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
749
+
750
+ ```
751
+ 1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
752
+ // Scans 7 categories: task coordination, roles, oracle, context budget,
753
+ // progress files, AGENTS.md parallel section, git worktrees
754
+
755
+ 2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
756
+ // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
757
+
758
+ 3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
759
+ // Generates portable AGENTS.md section — paste into their repo
760
+
761
+ 4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
762
+ 5. Fix any issues, re-verify
763
+ 6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
764
+ ```
765
+
766
+ The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
767
+ - Task locking protocol (file-based, no dependencies)
768
+ - Role definitions and assignment guide
769
+ - Oracle testing workflow with idiomatic examples
770
+ - Context budget rules
771
+ - Progress file protocol
772
+ - Anti-patterns to avoid
773
+ - Optional nodebench-mcp tool mapping table
774
+
775
+ ### MCP Prompts for Parallel Agent Teams
776
+
777
+ - `parallel-agent-team` — Full team setup with role assignment and task breakdown
778
+ - `oracle-test-harness` — Oracle-based testing setup for a component
779
+ - `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
780
+
781
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
782
+
783
+ ---
784
+
785
+ ## Auto-Update This File
786
+
787
+ Agents can self-update this file:
788
+
789
+ ```
790
+ update_agents_md({
791
+ operation: "update_section",
792
+ section: "Custom Section",
793
+ content: "New content here...",
794
+ projectRoot: "/path/to/project"
795
+ })
796
+ ```
797
+
798
+ Or read current structure:
799
+ ```
800
+ update_agents_md({ operation: "read", projectRoot: "/path/to/project" })
801
+ ```
802
+
803
+ **→ Quick Refs:** Before updating: `decide_re_update({ contentType: "instructions", ... })` | After updating: Run flywheel Steps 1-7 | See [Re-Update Before Create](#2-re-update-before-create)
804
+
805
+ ---
806
+
807
+ ## License
808
+
809
+ MIT — Use freely in any project.