nodebench-mcp 1.4.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/NODEBENCH_AGENTS.md +154 -2
  2. package/README.md +147 -228
  3. package/dist/__tests__/comparativeBench.test.d.ts +1 -0
  4. package/dist/__tests__/comparativeBench.test.js +722 -0
  5. package/dist/__tests__/comparativeBench.test.js.map +1 -0
  6. package/dist/__tests__/evalHarness.test.js +24 -2
  7. package/dist/__tests__/evalHarness.test.js.map +1 -1
  8. package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
  9. package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
  10. package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
  11. package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
  12. package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
  13. package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
  14. package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
  15. package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
  16. package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
  17. package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
  18. package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
  19. package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
  20. package/dist/__tests__/tools.test.js +155 -7
  21. package/dist/__tests__/tools.test.js.map +1 -1
  22. package/dist/db.js +56 -0
  23. package/dist/db.js.map +1 -1
  24. package/dist/index.js +370 -11
  25. package/dist/index.js.map +1 -1
  26. package/dist/tools/localFileTools.d.ts +15 -0
  27. package/dist/tools/localFileTools.js +386 -0
  28. package/dist/tools/localFileTools.js.map +1 -0
  29. package/dist/tools/metaTools.js +170 -3
  30. package/dist/tools/metaTools.js.map +1 -1
  31. package/dist/tools/parallelAgentTools.d.ts +18 -0
  32. package/dist/tools/parallelAgentTools.js +1272 -0
  33. package/dist/tools/parallelAgentTools.js.map +1 -0
  34. package/dist/tools/selfEvalTools.js +240 -10
  35. package/dist/tools/selfEvalTools.js.map +1 -1
  36. package/dist/tools/webTools.js +171 -37
  37. package/dist/tools/webTools.js.map +1 -1
  38. package/package.json +19 -7
@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
117
117
  NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
118
118
  ```
119
119
 
120
- Run all lanes:
120
+ Run all public lanes:
121
121
  ```bash
122
122
  npm run mcp:dataset:bench:all
123
123
  ```
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
137
137
  NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
138
138
  ```
139
139
 
140
- Run all lanes:
140
+ Fourth lane (GAIA gated long-horizon tool-augmented tasks):
141
+ - Dataset: `gaia-benchmark/GAIA` (gated)
142
+ - Default config: `2023_level3`
143
+ - Default split: `validation`
144
+ - Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
145
+
146
+ Notes:
147
+ - Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
148
+ - Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
149
+ - Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
150
+
151
+ Refresh GAIA fixture:
152
+ ```bash
153
+ npm run mcp:dataset:gaia:refresh
154
+ ```
155
+
156
+ Run GAIA parallel subagent benchmark:
157
+ ```bash
158
+ NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
159
+ ```
160
+
161
+ GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
162
+ - This runs real model calls and web search. It is disabled by default and only intended for regression checks.
163
+ - Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
164
+ - Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
165
+
166
+ Generate scoring fixture (local only, gated):
167
+ ```bash
168
+ npm run mcp:dataset:gaia:capability:refresh
169
+ ```
170
+
171
+ Run capability benchmark:
172
+ ```bash
173
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
174
+ ```
175
+
176
+ GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
177
+ - This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
178
+ - Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
179
+ - Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
180
+
181
+ Generate file-backed scoring fixture + download attachments (local only, gated):
182
+ ```bash
183
+ npm run mcp:dataset:gaia:capability:files:refresh
184
+ ```
185
+
186
+ Run file-backed capability benchmark:
187
+ ```bash
188
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
189
+ ```
190
+
191
+ Modes:
192
+ - Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
193
+ - More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
194
+
195
+ Run all public lanes:
141
196
  ```bash
142
197
  npm run mcp:dataset:bench:all
143
198
  ```
144
199
 
200
+ Run full lane suite (includes GAIA):
201
+ ```bash
202
+ npm run mcp:dataset:bench:full
203
+ ```
204
+
145
205
  Implementation files:
146
206
  - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
147
207
  - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
@@ -152,6 +212,16 @@ Implementation files:
152
212
  - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
153
213
  - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
154
214
  - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
215
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
216
+ - `.cache/gaia/gaia_2023_level3_validation.sample.json`
217
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
218
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
219
+ - `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
220
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
221
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
222
+ - `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
223
+ - `.cache/gaia/data/...` (local GAIA attachments; do not commit)
224
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
155
225
 
156
226
  Required tool chain per dataset task:
157
227
  - `run_recon`
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
176
246
  | Category | Tools | When to Use |
177
247
  |----------|-------|-------------|
178
248
  | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
249
+ | **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
179
250
  | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
180
251
  | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
181
252
  | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
184
255
  | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
185
256
  | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
186
257
  | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
258
+ | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
187
259
  | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
188
260
 
189
261
  **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
538
610
  | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
539
611
  | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
540
612
  | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
613
+ | `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
614
+ | `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
541
615
 
542
616
  **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
543
617
 
544
618
  ---
545
619
 
620
+ ## Parallel Agent Teams
621
+
622
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
623
+
624
+ Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
625
+
626
+ ### Quick Start — Parallel Agents
627
+
628
+ ```
629
+ 1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
630
+ 2. assign_agent_role({ role: "implementer" }) // Specialize
631
+ 3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
632
+ 4. ... do work ...
633
+ 5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
634
+ 6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
635
+ 7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
636
+ ```
637
+
638
+ ### Predefined Agent Roles
639
+
640
+ | Role | Focus |
641
+ |------|-------|
642
+ | `implementer` | Primary feature work. Picks failing tests, implements fixes. |
643
+ | `dedup_reviewer` | Finds and coalesces duplicate implementations. |
644
+ | `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
645
+ | `documentation_maintainer` | Keeps READMEs and progress files in sync. |
646
+ | `code_quality_critic` | Structural improvements, pattern enforcement. |
647
+ | `test_writer` | Writes targeted tests for edge cases and failure modes. |
648
+ | `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
649
+
650
+ ### Key Patterns (from Anthropic blog)
651
+
652
+ - **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
653
+ - **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
654
+ - **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
655
+ - **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
656
+ - **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
657
+ - **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
658
+
659
+ ### Bootstrap for External Repos
660
+
661
+ When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
662
+
663
+ ```
664
+ 1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
665
+ // Scans 7 categories: task coordination, roles, oracle, context budget,
666
+ // progress files, AGENTS.md parallel section, git worktrees
667
+
668
+ 2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
669
+ // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
670
+
671
+ 3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
672
+ // Generates portable AGENTS.md section — paste into their repo
673
+
674
+ 4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
675
+ 5. Fix any issues, re-verify
676
+ 6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
677
+ ```
678
+
679
+ The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
680
+ - Task locking protocol (file-based, no dependencies)
681
+ - Role definitions and assignment guide
682
+ - Oracle testing workflow with idiomatic examples
683
+ - Context budget rules
684
+ - Progress file protocol
685
+ - Anti-patterns to avoid
686
+ - Optional nodebench-mcp tool mapping table
687
+
688
+ ### MCP Prompts for Parallel Agent Teams
689
+
690
+ - `parallel-agent-team` — Full team setup with role assignment and task breakdown
691
+ - `oracle-test-harness` — Oracle-based testing setup for a component
692
+ - `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
693
+
694
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
695
+
696
+ ---
697
+
546
698
  ## Auto-Update This File
547
699
 
548
700
  Agents can self-update this file:
package/README.md CHANGED
@@ -1,200 +1,208 @@
1
- # NodeBench MCP Server
1
+ # NodeBench MCP
2
2
 
3
- A fully local, zero-config MCP server with **60 tools** for AI-powered development workflows.
3
+ **Make AI agents catch the bugs they normally ship.**
4
4
 
5
- **Features:**
6
- - Web search (Gemini/OpenAI/Perplexity)
7
- - GitHub repository discovery and analysis
8
- - Job market research
9
- - AGENTS.md self-maintenance
10
- - AI vision for screenshot analysis
11
- - 6-phase verification flywheel
12
- - Self-reinforced learning (trajectory analysis, health reports, improvement recommendations)
13
- - Autonomous agent bootstrap and self-maintenance
14
- - SQLite-backed learning database
15
-
16
- ## Quick Start (30 seconds)
17
-
18
- ### Option A: Claude Code CLI (recommended)
5
+ One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
19
6
 
20
7
  ```bash
21
8
  claude mcp add nodebench -- npx -y nodebench-mcp
22
9
  ```
23
10
 
24
- That's it. One command, 60 tools. No restart needed.
11
+ ---
25
12
 
26
- ### Option B: Manual config
13
+ ## Why What Bare Agents Miss
27
14
 
28
- Add to `~/.claude/settings.json` (global) or `.claude.json` (per-project):
15
+ We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
29
16
 
30
- ```json
31
- {
32
- "mcpServers": {
33
- "nodebench": {
34
- "command": "npx",
35
- "args": ["-y", "nodebench-mcp"]
36
- }
37
- }
38
- }
39
- ```
17
+ | What gets measured | Bare Agent | With NodeBench MCP |
18
+ |---|---|---|
19
+ | Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
20
+ | Research findings before coding | 0 | **21** |
21
+ | Risk assessments | 0 | **9** |
22
+ | Test coverage layers | 1 | **3** (static + unit + integration) |
23
+ | Integration failures caught early | 0 | **4** |
24
+ | Regression eval cases created | 0 | **22** |
25
+ | Quality gate rules enforced | 0 | **52** |
26
+ | Deploys blocked by gate violations | 0 | **4** |
27
+ | Knowledge entries banked | 0 | **9** |
28
+ | Blind spots shipped to production | **26** | **0** |
29
+
30
+ The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
31
+
32
+ Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
33
+
34
+ ---
35
+
36
+ ## How It Works — 3 Real Examples
37
+
38
+ ### Example 1: Bug fix
39
+
40
+ You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
41
+
42
+ **Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
43
+
44
+ **With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
45
+ - No retry backoff on OpenRouter rate limits (HIGH)
46
+ - JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
47
+ - No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
48
+
49
+ All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
50
+
51
+ ### Example 2: Parallel agents overwriting each other
40
52
 
41
- Then restart Claude Code.
53
+ You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
54
+
55
+ **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
56
+
57
+ **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
58
+
59
+ ### Example 3: Knowledge compounding
60
+
61
+ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
42
62
 
43
63
  ---
44
64
 
45
- ## Alternative: Build from source
65
+ ## Quick Start
66
+
67
+ ### Install (30 seconds)
46
68
 
47
69
  ```bash
48
- git clone https://github.com/nodebench/nodebench-ai.git
49
- cd nodebench-ai/packages/mcp-local
50
- npm install && npm run build
70
+ # Claude Code CLI (recommended)
71
+ claude mcp add nodebench -- npx -y nodebench-mcp
51
72
  ```
52
73
 
53
- Then use absolute path in settings:
74
+ Or add to `~/.claude/settings.json` or `.claude.json`:
54
75
 
55
76
  ```json
56
77
  {
57
78
  "mcpServers": {
58
79
  "nodebench": {
59
- "command": "node",
60
- "args": ["/path/to/packages/mcp-local/dist/index.js"]
80
+ "command": "npx",
81
+ "args": ["-y", "nodebench-mcp"]
61
82
  }
62
83
  }
63
84
  }
64
85
  ```
65
86
 
66
- ### 3. Add API keys (optional but recommended)
87
+ ### First prompts to try
67
88
 
68
- Add to your shell profile (`~/.bashrc`, `~/.zshrc`, or Windows Environment Variables):
89
+ ```
90
+ # See what's available
91
+ > Use getMethodology("overview") to see all workflows
69
92
 
70
- ```bash
71
- # Required for web search (pick one)
72
- export GEMINI_API_KEY="your-key" # Best: Google Search grounding
73
- export OPENAI_API_KEY="your-key" # Alternative: GPT-4o web search
74
- export PERPLEXITY_API_KEY="your-key" # Alternative: Perplexity
75
-
76
- # Required for GitHub (higher rate limits)
77
- export GITHUB_TOKEN="your-token" # github.com/settings/tokens
78
-
79
- # Required for vision analysis (pick one)
80
- export GEMINI_API_KEY="your-key" # Best: Gemini 2.5 Flash
81
- export OPENAI_API_KEY="your-key" # Alternative: GPT-4o
82
- export ANTHROPIC_API_KEY="your-key" # Alternative: Claude
93
+ # Before your next task — search for prior knowledge
94
+ > Use search_all_knowledge("what I'm about to work on")
95
+
96
+ # Run the full verification pipeline on a change
97
+ > Use getMethodology("mandatory_flywheel") and follow the 6 steps
83
98
  ```
84
99
 
85
- ### 4. Restart Claude Code
100
+ ### Optional: API keys for web search and vision
86
101
 
87
102
  ```bash
88
- # Quit and reopen Claude Code, or run:
89
- claude --mcp-debug
103
+ export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
104
+ export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
90
105
  ```
91
106
 
92
- ### 5. Test it works
107
+ ---
93
108
 
94
- In Claude Code, try these prompts:
109
+ ## What You Get
110
+
111
+ ### Core workflow (use these every session)
112
+
113
+ | When you... | Use this | Impact |
114
+ |---|---|---|
115
+ | Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
116
+ | Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
117
+ | Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
118
+ | Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
119
+ | Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
120
+ | Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
121
+ | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
122
+ | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
123
+ | Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
124
+
125
+ ### When running parallel agents (Claude Code subagents, worktrees)
126
+
127
+ | When you... | Use this | Impact |
128
+ |---|---|---|
129
+ | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
130
+ | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
131
+ | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
132
+ | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
133
+ | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
134
+ | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
135
+
136
+ ### Research and discovery
137
+
138
+ | When you... | Use this | Impact |
139
+ |---|---|---|
140
+ | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
141
+ | Fetch a URL | `fetch_url` | Read any page as clean markdown |
142
+ | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
143
+ | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
95
144
 
96
- ```
97
- # Check your environment
98
- > Use setup_local_env to check my development environment
145
+ ---
99
146
 
100
- # Search GitHub
101
- > Use search_github to find TypeScript MCP servers with at least 100 stars
147
+ ## The Methodology Pipeline
102
148
 
103
- # Fetch documentation
104
- > Use fetch_url to read https://modelcontextprotocol.io/introduction
149
+ NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
105
150
 
106
- # Get methodology
107
- > Use getMethodology("overview") to see all available workflows
151
+ ```
152
+ Research Risk → Implement → Test (3 layers) Eval Gate → Learn → Ship
153
+ ↑ │
154
+ └──────────── knowledge compounds ─────────────────────────────┘
108
155
  ```
109
156
 
110
- ---
157
+ **Inner loop** (per change): 6-phase verification ensures correctness.
158
+ **Outer loop** (over time): Eval-driven development ensures improvement.
159
+ **Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
111
160
 
112
- ## Tool Categories
113
-
114
- | Category | Tools | Description |
115
- |----------|-------|-------------|
116
- | **Web** | `web_search`, `fetch_url` | Search the web, fetch URLs as markdown |
117
- | **GitHub** | `search_github`, `analyze_repo` | Find repos, analyze tech stacks |
118
- | **Documentation** | `update_agents_md`, `research_job_market`, `setup_local_env` | Self-maintaining docs, job research |
119
- | **Vision** | `discover_vision_env`, `analyze_screenshot`, `manipulate_screenshot` | AI-powered image analysis |
120
- | **UI Capture** | `capture_ui_screenshot`, `capture_responsive_suite` | Browser screenshots (requires Playwright) |
121
- | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | 6-phase dev workflow |
122
- | **Eval** | `start_eval_run`, `log_test_result`, `list_eval_runs` | Test case tracking |
123
- | **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
124
- | **Learning** | `record_learning`, `search_learnings`, `search_all_knowledge` | Persistent knowledge base |
125
- | **Flywheel** | `run_closed_loop`, `check_framework_updates` | Automated workflows |
126
- | **Recon** | `run_recon`, `log_recon_finding`, `log_gap` | Discovery and gap tracking |
127
- | **Agent Bootstrap** | `bootstrap_project`, `setup_local_env`, `triple_verify`, `self_implement` | Self-discover infrastructure, auto-configure |
128
- | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance`, `run_autonomous_loop` | Risk-tiered autonomous execution |
129
- | **Self-Eval** | `log_tool_call`, `get_trajectory_analysis`, `get_self_eval_report`, `get_improvement_recommendations` | Self-reinforced learning loop |
130
- | **Meta** | `findTools`, `getMethodology` | Tool discovery, methodology guides |
161
+ Ask the agent: `Use getMethodology("overview")` to see all 18 methodology topics.
131
162
 
132
163
  ---
133
164
 
134
- ## Methodology Topics (17 total)
135
-
136
- Ask Claude: `Use getMethodology("topic_name")`
137
-
138
- - `overview` — See all methodologies
139
- - `verification` — 6-phase development cycle
140
- - `eval` — Test case management
141
- - `flywheel` — Continuous improvement loop
142
- - `mandatory_flywheel` — Required verification for changes
143
- - `reconnaissance` — Codebase discovery
144
- - `quality_gates` — Pass/fail checkpoints
145
- - `ui_ux_qa` — Frontend verification
146
- - `agentic_vision` — AI-powered visual QA
147
- - `closed_loop` — Build/test before presenting
148
- - `learnings` — Knowledge persistence
149
- - `project_ideation` — Validate ideas before building
150
- - `tech_stack_2026` — Dependency management
151
- - `telemetry_setup` — Observability setup
152
- - `agents_md_maintenance` — Keep docs in sync
153
- - `agent_bootstrap` — Self-discover and auto-configure infrastructure
154
- - `autonomous_maintenance` — Risk-tiered autonomous execution
155
- - `self_reinforced_learning` — Trajectory analysis and improvement loop
165
+ ## Parallel Agents with Claude Code
156
166
 
157
- ---
167
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
158
168
 
159
- ## Self-Reinforced Learning (v1.4.0)
169
+ **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
160
170
 
161
- The MCP learns from its own usage. As you develop with the tools, the system accumulates trajectory data and surfaces recommendations.
171
+ **How it works with Claude Code's Task tool:**
162
172
 
163
- ```
164
- Use Log Analyze Recommend Apply Re-analyze
165
- ```
173
+ 1. **COORDINATOR** (your main session) breaks work into independent tasks
174
+ 2. Each **Task tool** call spawns a subagent with instructions to:
175
+ - `claim_agent_task` — lock the task
176
+ - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
177
+ - Do the work
178
+ - `release_agent_task` — handoff with progress note
179
+ 3. Coordinator calls `get_parallel_status` to monitor all subagents
180
+ 4. Coordinator runs `run_quality_gate` on the aggregate result
166
181
 
167
- **Try it:**
168
- ```
169
- > Use getMethodology("self_reinforced_learning") for the 5-step guide
170
- > Use get_self_eval_report to see your project's health score
171
- > Use get_improvement_recommendations to find actionable improvements
172
- > Use get_trajectory_analysis to see your tool usage patterns
173
- ```
174
-
175
- The health score is a weighted composite:
176
- - Cycle completion (25%) — Are verification cycles being completed?
177
- - Eval pass rate (25%) — Are eval runs succeeding?
178
- - Gap resolution (20%) — Are logged gaps getting resolved?
179
- - Gate pass rate (15%) — Are quality gates passing?
180
- - Tool error rate (15%) — Are tools running without errors?
182
+ **MCP Prompts available:**
183
+ - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
184
+ - `parallel-agent-team` Full team setup with role assignment
185
+ - `oracle-test-harness` Validate outputs against known-good reference
186
+ - `bootstrap-parallel-agents` Scaffold parallel infra for any repo
181
187
 
182
188
  ---
183
189
 
184
- ## VSCode Extension Setup
190
+ ## Build from Source
185
191
 
186
- If using the Claude Code VSCode extension:
192
+ ```bash
193
+ git clone https://github.com/nodebench/nodebench-ai.git
194
+ cd nodebench-ai/packages/mcp-local
195
+ npm install && npm run build
196
+ ```
187
197
 
188
- 1. Open VSCode Settings (Ctrl/Cmd + ,)
189
- 2. Search for "Claude Code MCP"
190
- 3. Add server configuration:
198
+ Then use absolute path:
191
199
 
192
200
  ```json
193
201
  {
194
- "claude-code.mcpServers": {
202
+ "mcpServers": {
195
203
  "nodebench": {
196
204
  "command": "node",
197
- "args": ["/absolute/path/to/packages/mcp-local/dist/index.js"]
205
+ "args": ["/path/to/packages/mcp-local/dist/index.js"]
198
206
  }
199
207
  }
200
208
  }
@@ -202,104 +210,15 @@ If using the Claude Code VSCode extension:
202
210
 
203
211
  ---
204
212
 
205
- ## Optional Dependencies
206
-
207
- Install for additional features:
208
-
209
- ```bash
210
- # Screenshot capture (headless browser)
211
- npm install playwright
212
- npx playwright install chromium
213
-
214
- # Image manipulation
215
- npm install sharp
216
-
217
- # HTML parsing (already included)
218
- npm install cheerio
219
-
220
- # AI providers (pick your preferred)
221
- npm install @google/genai # Gemini
222
- npm install openai # OpenAI
223
- npm install @anthropic-ai/sdk # Anthropic
224
- ```
225
-
226
- ---
227
-
228
213
  ## Troubleshooting
229
214
 
230
- **"No search provider available"**
231
- - Set at least one API key: `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
232
-
233
- **"GitHub API error 403"**
234
- - Set `GITHUB_TOKEN` for higher rate limits (60/hour without, 5000/hour with)
215
+ **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
235
216
 
236
- **"Cannot find module"**
237
- - Run `npm run build` in the mcp-local directory
238
-
239
- **MCP not connecting**
240
- - Check path is absolute in settings.json
241
- - Run `claude --mcp-debug` to see connection errors
242
- - Ensure Node.js >= 18
243
-
244
- ---
217
+ **"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
245
218
 
246
- ## Example Workflows
247
-
248
- ### Research a new project idea
249
-
250
- ```
251
- 1. Use getMethodology("project_ideation") for the 6-step process
252
- 2. Use web_search to validate market demand
253
- 3. Use search_github to find similar projects
254
- 4. Use analyze_repo to study competitor implementations
255
- 5. Use research_job_market to understand skill demand
256
- ```
257
-
258
- ### Analyze a GitHub repo before using it
259
-
260
- ```
261
- 1. Use search_github({ query: "mcp server", language: "typescript", minStars: 100 })
262
- 2. Use analyze_repo({ repoUrl: "owner/repo" }) to see tech stack and patterns
263
- 3. Use fetch_url to read their documentation
264
- ```
265
-
266
- ### Set up a new development environment
267
-
268
- ```
269
- 1. Use setup_local_env to scan current environment
270
- 2. Follow the recommendations to install missing SDKs
271
- 3. Use getMethodology("tech_stack_2026") for ongoing maintenance
272
- ```
273
-
274
- ---
275
-
276
- ## Agent Protocol (NODEBENCH_AGENTS.md)
277
-
278
- The package includes `NODEBENCH_AGENTS.md` — a portable agent operating procedure that any AI agent can use to self-configure.
279
-
280
- **What it provides:**
281
- - The 6-step AI Flywheel verification process (mandatory for all changes)
282
- - MCP tool usage patterns and workflows
283
- - Quality gate definitions
284
- - Post-implementation checklists
285
- - Self-update instructions
286
-
287
- **To use in your project:**
288
-
289
- 1. Copy `NODEBENCH_AGENTS.md` to your repo root
290
- 2. Agents will auto-discover and follow the protocol
291
- 3. Use `update_agents_md` tool to keep it in sync
292
-
293
- Or fetch it directly:
294
-
295
- ```bash
296
- curl -o AGENTS.md https://raw.githubusercontent.com/nodebench/nodebench-ai/main/packages/mcp-local/NODEBENCH_AGENTS.md
297
- ```
219
+ **"Cannot find module"** — Run `npm run build` in the mcp-local directory
298
220
 
299
- The file is designed to be:
300
- - **Portable** — Works in any repo, any language
301
- - **Self-updating** — Agents can modify it via MCP tools
302
- - **Composable** — Add your own sections alongside the standard protocol
221
+ **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
303
222
 
304
223
  ---
305
224
 
@@ -0,0 +1 @@
1
+ export {};