nodebench-mcp 1.4.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. package/NODEBENCH_AGENTS.md +154 -2
  2. package/README.md +152 -192
  3. package/dist/__tests__/comparativeBench.test.d.ts +1 -0
  4. package/dist/__tests__/comparativeBench.test.js +722 -0
  5. package/dist/__tests__/comparativeBench.test.js.map +1 -0
  6. package/dist/__tests__/evalHarness.test.js +24 -2
  7. package/dist/__tests__/evalHarness.test.js.map +1 -1
  8. package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
  9. package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
  10. package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
  11. package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
  12. package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
  13. package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
  14. package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
  15. package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
  16. package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
  17. package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
  18. package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
  19. package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
  20. package/dist/__tests__/tools.test.js +155 -7
  21. package/dist/__tests__/tools.test.js.map +1 -1
  22. package/dist/db.js +56 -0
  23. package/dist/db.js.map +1 -1
  24. package/dist/index.js +370 -11
  25. package/dist/index.js.map +1 -1
  26. package/dist/tools/localFileTools.d.ts +15 -0
  27. package/dist/tools/localFileTools.js +386 -0
  28. package/dist/tools/localFileTools.js.map +1 -0
  29. package/dist/tools/metaTools.js +170 -3
  30. package/dist/tools/metaTools.js.map +1 -1
  31. package/dist/tools/parallelAgentTools.d.ts +18 -0
  32. package/dist/tools/parallelAgentTools.js +1272 -0
  33. package/dist/tools/parallelAgentTools.js.map +1 -0
  34. package/dist/tools/selfEvalTools.js +240 -10
  35. package/dist/tools/selfEvalTools.js.map +1 -1
  36. package/dist/tools/webTools.js +171 -37
  37. package/dist/tools/webTools.js.map +1 -1
  38. package/package.json +19 -7
@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
117
117
  NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
118
118
  ```
119
119
 
120
- Run all lanes:
120
+ Run all public lanes:
121
121
  ```bash
122
122
  npm run mcp:dataset:bench:all
123
123
  ```
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
137
137
  NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
138
138
  ```
139
139
 
140
- Run all lanes:
140
+ Fourth lane (GAIA gated long-horizon tool-augmented tasks):
141
+ - Dataset: `gaia-benchmark/GAIA` (gated)
142
+ - Default config: `2023_level3`
143
+ - Default split: `validation`
144
+ - Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
145
+
146
+ Notes:
147
+ - Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
148
+ - Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
149
+ - Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
150
+
151
+ Refresh GAIA fixture:
152
+ ```bash
153
+ npm run mcp:dataset:gaia:refresh
154
+ ```
155
+
156
+ Run GAIA parallel subagent benchmark:
157
+ ```bash
158
+ NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
159
+ ```
160
+
161
+ GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
162
+ - This runs real model calls and web search. It is disabled by default and only intended for regression checks.
163
+ - Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
164
+ - Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
165
+
166
+ Generate scoring fixture (local only, gated):
167
+ ```bash
168
+ npm run mcp:dataset:gaia:capability:refresh
169
+ ```
170
+
171
+ Run capability benchmark:
172
+ ```bash
173
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
174
+ ```
175
+
176
+ GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
177
+ - This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
178
+ - Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
179
+ - Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
180
+
181
+ Generate file-backed scoring fixture + download attachments (local only, gated):
182
+ ```bash
183
+ npm run mcp:dataset:gaia:capability:files:refresh
184
+ ```
185
+
186
+ Run file-backed capability benchmark:
187
+ ```bash
188
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
189
+ ```
190
+
191
+ Modes:
192
+ - Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
193
+ - More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
194
+
195
+ Run all public lanes:
141
196
  ```bash
142
197
  npm run mcp:dataset:bench:all
143
198
  ```
144
199
 
200
+ Run full lane suite (includes GAIA):
201
+ ```bash
202
+ npm run mcp:dataset:bench:full
203
+ ```
204
+
145
205
  Implementation files:
146
206
  - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
147
207
  - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
@@ -152,6 +212,16 @@ Implementation files:
152
212
  - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
153
213
  - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
154
214
  - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
215
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
216
+ - `.cache/gaia/gaia_2023_level3_validation.sample.json`
217
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
218
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
219
+ - `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
220
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
221
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
222
+ - `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
223
+ - `.cache/gaia/data/...` (local GAIA attachments; do not commit)
224
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
155
225
 
156
226
  Required tool chain per dataset task:
157
227
  - `run_recon`
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
176
246
  | Category | Tools | When to Use |
177
247
  |----------|-------|-------------|
178
248
  | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
249
+ | **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
179
250
  | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
180
251
  | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
181
252
  | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
184
255
  | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
185
256
  | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
186
257
  | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
258
+ | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
187
259
  | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
188
260
 
189
261
  **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
538
610
  | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
539
611
  | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
540
612
  | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
613
+ | `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
614
+ | `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
541
615
 
542
616
  **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
543
617
 
544
618
  ---
545
619
 
620
+ ## Parallel Agent Teams
621
+
622
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
623
+
624
+ Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
625
+
626
+ ### Quick Start — Parallel Agents
627
+
628
+ ```
629
+ 1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
630
+ 2. assign_agent_role({ role: "implementer" }) // Specialize
631
+ 3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
632
+ 4. ... do work ...
633
+ 5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
634
+ 6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
635
+ 7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
636
+ ```
637
+
638
+ ### Predefined Agent Roles
639
+
640
+ | Role | Focus |
641
+ |------|-------|
642
+ | `implementer` | Primary feature work. Picks failing tests, implements fixes. |
643
+ | `dedup_reviewer` | Finds and coalesces duplicate implementations. |
644
+ | `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
645
+ | `documentation_maintainer` | Keeps READMEs and progress files in sync. |
646
+ | `code_quality_critic` | Structural improvements, pattern enforcement. |
647
+ | `test_writer` | Writes targeted tests for edge cases and failure modes. |
648
+ | `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
649
+
650
+ ### Key Patterns (from Anthropic blog)
651
+
652
+ - **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
653
+ - **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
654
+ - **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
655
+ - **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
656
+ - **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
657
+ - **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
658
+
659
+ ### Bootstrap for External Repos
660
+
661
+ When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
662
+
663
+ ```
664
+ 1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
665
+ // Scans 7 categories: task coordination, roles, oracle, context budget,
666
+ // progress files, AGENTS.md parallel section, git worktrees
667
+
668
+ 2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
669
+ // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
670
+
671
+ 3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
672
+ // Generates portable AGENTS.md section — paste into their repo
673
+
674
+ 4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
675
+ 5. Fix any issues, re-verify
676
+ 6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
677
+ ```
678
+
679
+ The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
680
+ - Task locking protocol (file-based, no dependencies)
681
+ - Role definitions and assignment guide
682
+ - Oracle testing workflow with idiomatic examples
683
+ - Context budget rules
684
+ - Progress file protocol
685
+ - Anti-patterns to avoid
686
+ - Optional nodebench-mcp tool mapping table
687
+
688
+ ### MCP Prompts for Parallel Agent Teams
689
+
690
+ - `parallel-agent-team` — Full team setup with role assignment and task breakdown
691
+ - `oracle-test-harness` — Oracle-based testing setup for a component
692
+ - `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
693
+
694
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
695
+
696
+ ---
697
+
546
698
  ## Auto-Update This File
547
699
 
548
700
  Agents can self-update this file:
package/README.md CHANGED
@@ -1,264 +1,224 @@
1
- # NodeBench MCP Server
1
+ # NodeBench MCP
2
2
 
3
- A fully local, zero-config MCP server with 46 tools for AI-powered development workflows.
3
+ **Make AI agents catch the bugs they normally ship.**
4
4
 
5
- **Features:**
6
- - Web search (Gemini/OpenAI/Perplexity)
7
- - GitHub repository discovery and analysis
8
- - Job market research
9
- - AGENTS.md self-maintenance
10
- - AI vision for screenshot analysis
11
- - 6-phase verification flywheel
12
- - SQLite-backed learning database
5
+ One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
13
6
 
14
- ## Quick Start (1 minute)
15
-
16
- ### 1. Add to Claude Code settings
17
-
18
- Add to `~/.claude/settings.json`:
19
-
20
- ```json
21
- {
22
- "mcpServers": {
23
- "nodebench": {
24
- "command": "npx",
25
- "args": ["-y", "nodebench-mcp"]
26
- }
27
- }
28
- }
7
+ ```bash
8
+ claude mcp add nodebench -- npx -y nodebench-mcp
29
9
  ```
30
10
 
31
- That's it. Restart Claude Code and you have 46 tools.
32
-
33
11
  ---
34
12
 
35
- ## Alternative: Build from source
13
+ ## Why What Bare Agents Miss
36
14
 
37
- ```bash
38
- git clone https://github.com/nodebench/nodebench-ai.git
39
- cd nodebench-ai/packages/mcp-local
40
- npm install && npm run build
41
- ```
15
+ We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
42
16
 
43
- Then use absolute path in settings:
17
+ | What gets measured | Bare Agent | With NodeBench MCP |
18
+ |---|---|---|
19
+ | Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
20
+ | Research findings before coding | 0 | **21** |
21
+ | Risk assessments | 0 | **9** |
22
+ | Test coverage layers | 1 | **3** (static + unit + integration) |
23
+ | Integration failures caught early | 0 | **4** |
24
+ | Regression eval cases created | 0 | **22** |
25
+ | Quality gate rules enforced | 0 | **52** |
26
+ | Deploys blocked by gate violations | 0 | **4** |
27
+ | Knowledge entries banked | 0 | **9** |
28
+ | Blind spots shipped to production | **26** | **0** |
44
29
 
45
- ```json
46
- {
47
- "mcpServers": {
48
- "nodebench": {
49
- "command": "node",
50
- "args": ["/path/to/packages/mcp-local/dist/index.js"]
51
- }
52
- }
53
- }
54
- ```
30
+ The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
55
31
 
56
- ### 3. Add API keys (optional but recommended)
32
+ Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
57
33
 
58
- Add to your shell profile (`~/.bashrc`, `~/.zshrc`, or Windows Environment Variables):
34
+ ---
59
35
 
60
- ```bash
61
- # Required for web search (pick one)
62
- export GEMINI_API_KEY="your-key" # Best: Google Search grounding
63
- export OPENAI_API_KEY="your-key" # Alternative: GPT-4o web search
64
- export PERPLEXITY_API_KEY="your-key" # Alternative: Perplexity
65
-
66
- # Required for GitHub (higher rate limits)
67
- export GITHUB_TOKEN="your-token" # github.com/settings/tokens
68
-
69
- # Required for vision analysis (pick one)
70
- export GEMINI_API_KEY="your-key" # Best: Gemini 2.5 Flash
71
- export OPENAI_API_KEY="your-key" # Alternative: GPT-4o
72
- export ANTHROPIC_API_KEY="your-key" # Alternative: Claude
73
- ```
36
+ ## How It Works — 3 Real Examples
74
37
 
75
- ### 4. Restart Claude Code
38
+ ### Example 1: Bug fix
76
39
 
77
- ```bash
78
- # Quit and reopen Claude Code, or run:
79
- claude --mcp-debug
80
- ```
40
+ You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
81
41
 
82
- ### 5. Test it works
42
+ **Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
83
43
 
84
- In Claude Code, try these prompts:
44
+ **With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
45
+ - No retry backoff on OpenRouter rate limits (HIGH)
46
+ - JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
47
+ - No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
85
48
 
86
- ```
87
- # Check your environment
88
- > Use setup_local_env to check my development environment
49
+ All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
89
50
 
90
- # Search GitHub
91
- > Use search_github to find TypeScript MCP servers with at least 100 stars
51
+ ### Example 2: Parallel agents overwriting each other
92
52
 
93
- # Fetch documentation
94
- > Use fetch_url to read https://modelcontextprotocol.io/introduction
53
+ You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
95
54
 
96
- # Get methodology
97
- > Use getMethodology("overview") to see all available workflows
98
- ```
99
-
100
- ---
55
+ **Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
101
56
 
102
- ## Tool Categories
103
-
104
- | Category | Tools | Description |
105
- |----------|-------|-------------|
106
- | **Web** | `web_search`, `fetch_url` | Search the web, fetch URLs as markdown |
107
- | **GitHub** | `search_github`, `analyze_repo` | Find repos, analyze tech stacks |
108
- | **Documentation** | `update_agents_md`, `research_job_market`, `setup_local_env` | Self-maintaining docs, job research |
109
- | **Vision** | `discover_vision_env`, `analyze_screenshot`, `manipulate_screenshot` | AI-powered image analysis |
110
- | **UI Capture** | `capture_ui_screenshot`, `capture_responsive_suite` | Browser screenshots (requires Playwright) |
111
- | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | 6-phase dev workflow |
112
- | **Eval** | `start_eval_run`, `log_test_result`, `list_eval_runs` | Test case tracking |
113
- | **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
114
- | **Learning** | `record_learning`, `search_learnings`, `search_all_knowledge` | Persistent knowledge base |
115
- | **Flywheel** | `run_closed_loop`, `check_framework_updates` | Automated workflows |
116
- | **Recon** | `run_recon`, `log_recon_finding`, `log_gap` | Discovery and gap tracking |
117
- | **Meta** | `findTools`, `getMethodology` | Tool discovery, methodology guides |
57
+ **With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
118
58
 
119
- ---
59
+ ### Example 3: Knowledge compounding
120
60
 
121
- ## Methodology Topics (15 total)
122
-
123
- Ask Claude: `Use getMethodology("topic_name")`
124
-
125
- - `overview` — See all methodologies
126
- - `verification` — 6-phase development cycle
127
- - `eval` — Test case management
128
- - `flywheel` — Continuous improvement loop
129
- - `mandatory_flywheel` — Required verification for changes
130
- - `reconnaissance` — Codebase discovery
131
- - `quality_gates` — Pass/fail checkpoints
132
- - `ui_ux_qa` — Frontend verification
133
- - `agentic_vision` — AI-powered visual QA
134
- - `closed_loop` — Build/test before presenting
135
- - `learnings` — Knowledge persistence
136
- - `project_ideation` — Validate ideas before building
137
- - `tech_stack_2026` — Dependency management
138
- - `telemetry_setup` — Observability setup
139
- - `agents_md_maintenance` — Keep docs in sync
61
+ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
140
62
 
141
63
  ---
142
64
 
143
- ## VSCode Extension Setup
65
+ ## Quick Start
144
66
 
145
- If using the Claude Code VSCode extension:
67
+ ### Install (30 seconds)
146
68
 
147
- 1. Open VSCode Settings (Ctrl/Cmd + ,)
148
- 2. Search for "Claude Code MCP"
149
- 3. Add server configuration:
69
+ ```bash
70
+ # Claude Code CLI (recommended)
71
+ claude mcp add nodebench -- npx -y nodebench-mcp
72
+ ```
73
+
74
+ Or add to `~/.claude/settings.json` or `.claude.json`:
150
75
 
151
76
  ```json
152
77
  {
153
- "claude-code.mcpServers": {
78
+ "mcpServers": {
154
79
  "nodebench": {
155
- "command": "node",
156
- "args": ["/absolute/path/to/packages/mcp-local/dist/index.js"]
80
+ "command": "npx",
81
+ "args": ["-y", "nodebench-mcp"]
157
82
  }
158
83
  }
159
84
  }
160
85
  ```
161
86
 
162
- ---
87
+ ### First prompts to try
163
88
 
164
- ## Optional Dependencies
89
+ ```
90
+ # See what's available
91
+ > Use getMethodology("overview") to see all workflows
165
92
 
166
- Install for additional features:
93
+ # Before your next task — search for prior knowledge
94
+ > Use search_all_knowledge("what I'm about to work on")
167
95
 
168
- ```bash
169
- # Screenshot capture (headless browser)
170
- npm install playwright
171
- npx playwright install chromium
172
-
173
- # Image manipulation
174
- npm install sharp
96
+ # Run the full verification pipeline on a change
97
+ > Use getMethodology("mandatory_flywheel") and follow the 6 steps
98
+ ```
175
99
 
176
- # HTML parsing (already included)
177
- npm install cheerio
100
+ ### Optional: API keys for web search and vision
178
101
 
179
- # AI providers (pick your preferred)
180
- npm install @google/genai # Gemini
181
- npm install openai # OpenAI
182
- npm install @anthropic-ai/sdk # Anthropic
102
+ ```bash
103
+ export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
104
+ export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
183
105
  ```
184
106
 
185
107
  ---
186
108
 
187
- ## Troubleshooting
188
-
189
- **"No search provider available"**
190
- - Set at least one API key: `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
191
-
192
- **"GitHub API error 403"**
193
- - Set `GITHUB_TOKEN` for higher rate limits (60/hour without, 5000/hour with)
194
-
195
- **"Cannot find module"**
196
- - Run `npm run build` in the mcp-local directory
197
-
198
- **MCP not connecting**
199
- - Check path is absolute in settings.json
200
- - Run `claude --mcp-debug` to see connection errors
201
- - Ensure Node.js >= 18
109
+ ## What You Get
110
+
111
+ ### Core workflow (use these every session)
112
+
113
+ | When you... | Use this | Impact |
114
+ |---|---|---|
115
+ | Start any task | `search_all_knowledge` | Find prior findings avoid repeating past mistakes |
116
+ | Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
117
+ | Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
118
+ | Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
119
+ | Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
120
+ | Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
121
+ | Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
122
+ | Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
123
+ | Verify completeness | `run_mandatory_flywheel` | 6-step minimum catches dead code and intent mismatches |
124
+
125
+ ### When running parallel agents (Claude Code subagents, worktrees)
126
+
127
+ | When you... | Use this | Impact |
128
+ |---|---|---|
129
+ | Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
130
+ | Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
131
+ | Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
132
+ | Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
133
+ | Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
134
+ | Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
135
+
136
+ ### Research and discovery
137
+
138
+ | When you... | Use this | Impact |
139
+ |---|---|---|
140
+ | Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
141
+ | Fetch a URL | `fetch_url` | Read any page as clean markdown |
142
+ | Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
143
+ | Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
202
144
 
203
145
  ---
204
146
 
205
- ## Example Workflows
147
+ ## The Methodology Pipeline
206
148
 
207
- ### Research a new project idea
149
+ NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
208
150
 
209
151
  ```
210
- 1. Use getMethodology("project_ideation") for the 6-step process
211
- 2. Use web_search to validate market demand
212
- 3. Use search_github to find similar projects
213
- 4. Use analyze_repo to study competitor implementations
214
- 5. Use research_job_market to understand skill demand
152
+ Research Risk → Implement → Test (3 layers) Eval Gate → Learn → Ship
153
+ ↑ │
154
+ └──────────── knowledge compounds ─────────────────────────────┘
215
155
  ```
216
156
 
217
- ### Analyze a GitHub repo before using it
157
+ **Inner loop** (per change): 6-phase verification ensures correctness.
158
+ **Outer loop** (over time): Eval-driven development ensures improvement.
159
+ **Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
218
160
 
219
- ```
220
- 1. Use search_github({ query: "mcp server", language: "typescript", minStars: 100 })
221
- 2. Use analyze_repo({ repoUrl: "owner/repo" }) to see tech stack and patterns
222
- 3. Use fetch_url to read their documentation
223
- ```
161
+ Ask the agent: `Use getMethodology("overview")` to see all 18 methodology topics.
224
162
 
225
- ### Set up a new development environment
163
+ ---
226
164
 
227
- ```
228
- 1. Use setup_local_env to scan current environment
229
- 2. Follow the recommendations to install missing SDKs
230
- 3. Use getMethodology("tech_stack_2026") for ongoing maintenance
231
- ```
165
+ ## Parallel Agents with Claude Code
232
166
 
233
- ---
167
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
234
168
 
235
- ## Agent Protocol (NODEBENCH_AGENTS.md)
169
+ **When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
236
170
 
237
- The package includes `NODEBENCH_AGENTS.md` a portable agent operating procedure that any AI agent can use to self-configure.
171
+ **How it works with Claude Code's Task tool:**
238
172
 
239
- **What it provides:**
240
- - The 6-step AI Flywheel verification process (mandatory for all changes)
241
- - MCP tool usage patterns and workflows
242
- - Quality gate definitions
243
- - Post-implementation checklists
244
- - Self-update instructions
173
+ 1. **COORDINATOR** (your main session) breaks work into independent tasks
174
+ 2. Each **Task tool** call spawns a subagent with instructions to:
175
+ - `claim_agent_task` lock the task
176
+ - `assign_agent_role` specialize (implementer, test_writer, critic, etc.)
177
+ - Do the work
178
+ - `release_agent_task` — handoff with progress note
179
+ 3. Coordinator calls `get_parallel_status` to monitor all subagents
180
+ 4. Coordinator runs `run_quality_gate` on the aggregate result
245
181
 
246
- **To use in your project:**
182
+ **MCP Prompts available:**
183
+ - `claude-code-parallel` — Step-by-step Claude Code subagent coordination
184
+ - `parallel-agent-team` — Full team setup with role assignment
185
+ - `oracle-test-harness` — Validate outputs against known-good reference
186
+ - `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
247
187
 
248
- 1. Copy `NODEBENCH_AGENTS.md` to your repo root
249
- 2. Agents will auto-discover and follow the protocol
250
- 3. Use `update_agents_md` tool to keep it in sync
188
+ ---
251
189
 
252
- Or fetch it directly:
190
+ ## Build from Source
253
191
 
254
192
  ```bash
255
- curl -o AGENTS.md https://raw.githubusercontent.com/nodebench/nodebench-ai/main/packages/mcp-local/NODEBENCH_AGENTS.md
193
+ git clone https://github.com/nodebench/nodebench-ai.git
194
+ cd nodebench-ai/packages/mcp-local
195
+ npm install && npm run build
256
196
  ```
257
197
 
258
- The file is designed to be:
259
- - **Portable** — Works in any repo, any language
260
- - **Self-updating** — Agents can modify it via MCP tools
261
- - **Composable** — Add your own sections alongside the standard protocol
198
+ Then use absolute path:
199
+
200
+ ```json
201
+ {
202
+ "mcpServers": {
203
+ "nodebench": {
204
+ "command": "node",
205
+ "args": ["/path/to/packages/mcp-local/dist/index.js"]
206
+ }
207
+ }
208
+ }
209
+ ```
210
+
211
+ ---
212
+
213
+ ## Troubleshooting
214
+
215
+ **"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
216
+
217
+ **"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
218
+
219
+ **"Cannot find module"** — Run `npm run build` in the mcp-local directory
220
+
221
+ **MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
262
222
 
263
223
  ---
264
224
 
@@ -0,0 +1 @@
1
+ export {};