nodebench-mcp 1.4.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +154 -2
- package/README.md +152 -192
- package/dist/__tests__/comparativeBench.test.d.ts +1 -0
- package/dist/__tests__/comparativeBench.test.js +722 -0
- package/dist/__tests__/comparativeBench.test.js.map +1 -0
- package/dist/__tests__/evalHarness.test.js +24 -2
- package/dist/__tests__/evalHarness.test.js.map +1 -1
- package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
- package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
- package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
- package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
- package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
- package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
- package/dist/__tests__/tools.test.js +155 -7
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/db.js +56 -0
- package/dist/db.js.map +1 -1
- package/dist/index.js +370 -11
- package/dist/index.js.map +1 -1
- package/dist/tools/localFileTools.d.ts +15 -0
- package/dist/tools/localFileTools.js +386 -0
- package/dist/tools/localFileTools.js.map +1 -0
- package/dist/tools/metaTools.js +170 -3
- package/dist/tools/metaTools.js.map +1 -1
- package/dist/tools/parallelAgentTools.d.ts +18 -0
- package/dist/tools/parallelAgentTools.js +1272 -0
- package/dist/tools/parallelAgentTools.js.map +1 -0
- package/dist/tools/selfEvalTools.js +240 -10
- package/dist/tools/selfEvalTools.js.map +1 -1
- package/dist/tools/webTools.js +171 -37
- package/dist/tools/webTools.js.map +1 -1
- package/package.json +19 -7
package/NODEBENCH_AGENTS.md
CHANGED
|
@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
|
|
|
117
117
|
NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
|
|
118
118
|
```
|
|
119
119
|
|
|
120
|
-
Run all lanes:
|
|
120
|
+
Run all public lanes:
|
|
121
121
|
```bash
|
|
122
122
|
npm run mcp:dataset:bench:all
|
|
123
123
|
```
|
|
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
|
|
|
137
137
|
NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
|
|
138
138
|
```
|
|
139
139
|
|
|
140
|
-
|
|
140
|
+
Fourth lane (GAIA gated long-horizon tool-augmented tasks):
|
|
141
|
+
- Dataset: `gaia-benchmark/GAIA` (gated)
|
|
142
|
+
- Default config: `2023_level3`
|
|
143
|
+
- Default split: `validation`
|
|
144
|
+
- Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
|
|
145
|
+
|
|
146
|
+
Notes:
|
|
147
|
+
- Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
|
|
148
|
+
- Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
|
|
149
|
+
- Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
|
|
150
|
+
|
|
151
|
+
Refresh GAIA fixture:
|
|
152
|
+
```bash
|
|
153
|
+
npm run mcp:dataset:gaia:refresh
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Run GAIA parallel subagent benchmark:
|
|
157
|
+
```bash
|
|
158
|
+
NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
|
|
162
|
+
- This runs real model calls and web search. It is disabled by default and only intended for regression checks.
|
|
163
|
+
- Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
|
|
164
|
+
- Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
|
|
165
|
+
|
|
166
|
+
Generate scoring fixture (local only, gated):
|
|
167
|
+
```bash
|
|
168
|
+
npm run mcp:dataset:gaia:capability:refresh
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
Run capability benchmark:
|
|
172
|
+
```bash
|
|
173
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
|
|
177
|
+
- This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
|
|
178
|
+
- Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
|
|
179
|
+
- Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
|
|
180
|
+
|
|
181
|
+
Generate file-backed scoring fixture + download attachments (local only, gated):
|
|
182
|
+
```bash
|
|
183
|
+
npm run mcp:dataset:gaia:capability:files:refresh
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
Run file-backed capability benchmark:
|
|
187
|
+
```bash
|
|
188
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
Modes:
|
|
192
|
+
- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
193
|
+
- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
|
|
194
|
+
|
|
195
|
+
Run all public lanes:
|
|
141
196
|
```bash
|
|
142
197
|
npm run mcp:dataset:bench:all
|
|
143
198
|
```
|
|
144
199
|
|
|
200
|
+
Run full lane suite (includes GAIA):
|
|
201
|
+
```bash
|
|
202
|
+
npm run mcp:dataset:bench:full
|
|
203
|
+
```
|
|
204
|
+
|
|
145
205
|
Implementation files:
|
|
146
206
|
- `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
|
|
147
207
|
- `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
|
|
@@ -152,6 +212,16 @@ Implementation files:
|
|
|
152
212
|
- `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
|
|
153
213
|
- `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
|
|
154
214
|
- `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
|
|
215
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
|
|
216
|
+
- `.cache/gaia/gaia_2023_level3_validation.sample.json`
|
|
217
|
+
- `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
|
|
218
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
|
|
219
|
+
- `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
|
|
220
|
+
- `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
|
|
221
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
|
|
222
|
+
- `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
|
|
223
|
+
- `.cache/gaia/data/...` (local GAIA attachments; do not commit)
|
|
224
|
+
- `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
|
|
155
225
|
|
|
156
226
|
Required tool chain per dataset task:
|
|
157
227
|
- `run_recon`
|
|
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
176
246
|
| Category | Tools | When to Use |
|
|
177
247
|
|----------|-------|-------------|
|
|
178
248
|
| **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
|
|
249
|
+
| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
|
|
179
250
|
| **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
|
|
180
251
|
| **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
|
|
181
252
|
| **Eval** | `start_eval_run`, `log_test_result` | Test case management |
|
|
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
184
255
|
| **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
|
|
185
256
|
| **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
|
|
186
257
|
| **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
|
|
258
|
+
| **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
|
|
187
259
|
| **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
|
|
188
260
|
|
|
189
261
|
**→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
|
|
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
|
|
|
538
610
|
| `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
|
|
539
611
|
| `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
|
|
540
612
|
| `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
|
|
613
|
+
| `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
|
|
614
|
+
| `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
|
|
541
615
|
|
|
542
616
|
**→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
|
|
543
617
|
|
|
544
618
|
---
|
|
545
619
|
|
|
620
|
+
## Parallel Agent Teams
|
|
621
|
+
|
|
622
|
+
Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
|
|
623
|
+
|
|
624
|
+
Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
|
|
625
|
+
|
|
626
|
+
### Quick Start — Parallel Agents
|
|
627
|
+
|
|
628
|
+
```
|
|
629
|
+
1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
|
|
630
|
+
2. assign_agent_role({ role: "implementer" }) // Specialize
|
|
631
|
+
3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
|
|
632
|
+
4. ... do work ...
|
|
633
|
+
5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
|
|
634
|
+
6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
|
|
635
|
+
7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
### Predefined Agent Roles
|
|
639
|
+
|
|
640
|
+
| Role | Focus |
|
|
641
|
+
|------|-------|
|
|
642
|
+
| `implementer` | Primary feature work. Picks failing tests, implements fixes. |
|
|
643
|
+
| `dedup_reviewer` | Finds and coalesces duplicate implementations. |
|
|
644
|
+
| `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
|
|
645
|
+
| `documentation_maintainer` | Keeps READMEs and progress files in sync. |
|
|
646
|
+
| `code_quality_critic` | Structural improvements, pattern enforcement. |
|
|
647
|
+
| `test_writer` | Writes targeted tests for edge cases and failure modes. |
|
|
648
|
+
| `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
|
|
649
|
+
|
|
650
|
+
### Key Patterns (from Anthropic blog)
|
|
651
|
+
|
|
652
|
+
- **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
|
|
653
|
+
- **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
|
|
654
|
+
- **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
|
|
655
|
+
- **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
|
|
656
|
+
- **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
|
|
657
|
+
- **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
|
|
658
|
+
|
|
659
|
+
### Bootstrap for External Repos
|
|
660
|
+
|
|
661
|
+
When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
|
|
662
|
+
|
|
663
|
+
```
|
|
664
|
+
1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
|
|
665
|
+
// Scans 7 categories: task coordination, roles, oracle, context budget,
|
|
666
|
+
// progress files, AGENTS.md parallel section, git worktrees
|
|
667
|
+
|
|
668
|
+
2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
|
|
669
|
+
// Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
|
|
670
|
+
|
|
671
|
+
3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
|
|
672
|
+
// Generates portable AGENTS.md section — paste into their repo
|
|
673
|
+
|
|
674
|
+
4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
|
|
675
|
+
5. Fix any issues, re-verify
|
|
676
|
+
6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
|
|
680
|
+
- Task locking protocol (file-based, no dependencies)
|
|
681
|
+
- Role definitions and assignment guide
|
|
682
|
+
- Oracle testing workflow with idiomatic examples
|
|
683
|
+
- Context budget rules
|
|
684
|
+
- Progress file protocol
|
|
685
|
+
- Anti-patterns to avoid
|
|
686
|
+
- Optional nodebench-mcp tool mapping table
|
|
687
|
+
|
|
688
|
+
### MCP Prompts for Parallel Agent Teams
|
|
689
|
+
|
|
690
|
+
- `parallel-agent-team` — Full team setup with role assignment and task breakdown
|
|
691
|
+
- `oracle-test-harness` — Oracle-based testing setup for a component
|
|
692
|
+
- `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
|
|
693
|
+
|
|
694
|
+
**→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
|
|
695
|
+
|
|
696
|
+
---
|
|
697
|
+
|
|
546
698
|
## Auto-Update This File
|
|
547
699
|
|
|
548
700
|
Agents can self-update this file:
|
package/README.md
CHANGED
|
@@ -1,264 +1,224 @@
|
|
|
1
|
-
# NodeBench MCP
|
|
1
|
+
# NodeBench MCP
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
**Make AI agents catch the bugs they normally ship.**
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
- Web search (Gemini/OpenAI/Perplexity)
|
|
7
|
-
- GitHub repository discovery and analysis
|
|
8
|
-
- Job market research
|
|
9
|
-
- AGENTS.md self-maintenance
|
|
10
|
-
- AI vision for screenshot analysis
|
|
11
|
-
- 6-phase verification flywheel
|
|
12
|
-
- SQLite-backed learning database
|
|
5
|
+
One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
|
|
13
6
|
|
|
14
|
-
|
|
15
|
-
|
|
16
|
-
### 1. Add to Claude Code settings
|
|
17
|
-
|
|
18
|
-
Add to `~/.claude/settings.json`:
|
|
19
|
-
|
|
20
|
-
```json
|
|
21
|
-
{
|
|
22
|
-
"mcpServers": {
|
|
23
|
-
"nodebench": {
|
|
24
|
-
"command": "npx",
|
|
25
|
-
"args": ["-y", "nodebench-mcp"]
|
|
26
|
-
}
|
|
27
|
-
}
|
|
28
|
-
}
|
|
7
|
+
```bash
|
|
8
|
+
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
29
9
|
```
|
|
30
10
|
|
|
31
|
-
That's it. Restart Claude Code and you have 46 tools.
|
|
32
|
-
|
|
33
11
|
---
|
|
34
12
|
|
|
35
|
-
##
|
|
13
|
+
## Why — What Bare Agents Miss
|
|
36
14
|
|
|
37
|
-
|
|
38
|
-
git clone https://github.com/nodebench/nodebench-ai.git
|
|
39
|
-
cd nodebench-ai/packages/mcp-local
|
|
40
|
-
npm install && npm run build
|
|
41
|
-
```
|
|
15
|
+
We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
|
|
42
16
|
|
|
43
|
-
|
|
17
|
+
| What gets measured | Bare Agent | With NodeBench MCP |
|
|
18
|
+
|---|---|---|
|
|
19
|
+
| Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
|
|
20
|
+
| Research findings before coding | 0 | **21** |
|
|
21
|
+
| Risk assessments | 0 | **9** |
|
|
22
|
+
| Test coverage layers | 1 | **3** (static + unit + integration) |
|
|
23
|
+
| Integration failures caught early | 0 | **4** |
|
|
24
|
+
| Regression eval cases created | 0 | **22** |
|
|
25
|
+
| Quality gate rules enforced | 0 | **52** |
|
|
26
|
+
| Deploys blocked by gate violations | 0 | **4** |
|
|
27
|
+
| Knowledge entries banked | 0 | **9** |
|
|
28
|
+
| Blind spots shipped to production | **26** | **0** |
|
|
44
29
|
|
|
45
|
-
|
|
46
|
-
{
|
|
47
|
-
"mcpServers": {
|
|
48
|
-
"nodebench": {
|
|
49
|
-
"command": "node",
|
|
50
|
-
"args": ["/path/to/packages/mcp-local/dist/index.js"]
|
|
51
|
-
}
|
|
52
|
-
}
|
|
53
|
-
}
|
|
54
|
-
```
|
|
30
|
+
The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
|
|
55
31
|
|
|
56
|
-
|
|
32
|
+
Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
|
|
57
33
|
|
|
58
|
-
|
|
34
|
+
---
|
|
59
35
|
|
|
60
|
-
|
|
61
|
-
# Required for web search (pick one)
|
|
62
|
-
export GEMINI_API_KEY="your-key" # Best: Google Search grounding
|
|
63
|
-
export OPENAI_API_KEY="your-key" # Alternative: GPT-4o web search
|
|
64
|
-
export PERPLEXITY_API_KEY="your-key" # Alternative: Perplexity
|
|
65
|
-
|
|
66
|
-
# Required for GitHub (higher rate limits)
|
|
67
|
-
export GITHUB_TOKEN="your-token" # github.com/settings/tokens
|
|
68
|
-
|
|
69
|
-
# Required for vision analysis (pick one)
|
|
70
|
-
export GEMINI_API_KEY="your-key" # Best: Gemini 2.5 Flash
|
|
71
|
-
export OPENAI_API_KEY="your-key" # Alternative: GPT-4o
|
|
72
|
-
export ANTHROPIC_API_KEY="your-key" # Alternative: Claude
|
|
73
|
-
```
|
|
36
|
+
## How It Works — 3 Real Examples
|
|
74
37
|
|
|
75
|
-
###
|
|
38
|
+
### Example 1: Bug fix
|
|
76
39
|
|
|
77
|
-
|
|
78
|
-
# Quit and reopen Claude Code, or run:
|
|
79
|
-
claude --mcp-debug
|
|
80
|
-
```
|
|
40
|
+
You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
|
|
81
41
|
|
|
82
|
-
|
|
42
|
+
**Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
|
|
83
43
|
|
|
84
|
-
|
|
44
|
+
**With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
|
|
45
|
+
- No retry backoff on OpenRouter rate limits (HIGH)
|
|
46
|
+
- JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
|
|
47
|
+
- No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
|
|
85
48
|
|
|
86
|
-
|
|
87
|
-
# Check your environment
|
|
88
|
-
> Use setup_local_env to check my development environment
|
|
49
|
+
All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
|
|
89
50
|
|
|
90
|
-
|
|
91
|
-
> Use search_github to find TypeScript MCP servers with at least 100 stars
|
|
51
|
+
### Example 2: Parallel agents overwriting each other
|
|
92
52
|
|
|
93
|
-
|
|
94
|
-
> Use fetch_url to read https://modelcontextprotocol.io/introduction
|
|
53
|
+
You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
|
|
95
54
|
|
|
96
|
-
|
|
97
|
-
> Use getMethodology("overview") to see all available workflows
|
|
98
|
-
```
|
|
99
|
-
|
|
100
|
-
---
|
|
55
|
+
**Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
|
|
101
56
|
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
| Category | Tools | Description |
|
|
105
|
-
|----------|-------|-------------|
|
|
106
|
-
| **Web** | `web_search`, `fetch_url` | Search the web, fetch URLs as markdown |
|
|
107
|
-
| **GitHub** | `search_github`, `analyze_repo` | Find repos, analyze tech stacks |
|
|
108
|
-
| **Documentation** | `update_agents_md`, `research_job_market`, `setup_local_env` | Self-maintaining docs, job research |
|
|
109
|
-
| **Vision** | `discover_vision_env`, `analyze_screenshot`, `manipulate_screenshot` | AI-powered image analysis |
|
|
110
|
-
| **UI Capture** | `capture_ui_screenshot`, `capture_responsive_suite` | Browser screenshots (requires Playwright) |
|
|
111
|
-
| **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | 6-phase dev workflow |
|
|
112
|
-
| **Eval** | `start_eval_run`, `log_test_result`, `list_eval_runs` | Test case tracking |
|
|
113
|
-
| **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
|
|
114
|
-
| **Learning** | `record_learning`, `search_learnings`, `search_all_knowledge` | Persistent knowledge base |
|
|
115
|
-
| **Flywheel** | `run_closed_loop`, `check_framework_updates` | Automated workflows |
|
|
116
|
-
| **Recon** | `run_recon`, `log_recon_finding`, `log_gap` | Discovery and gap tracking |
|
|
117
|
-
| **Meta** | `findTools`, `getMethodology` | Tool discovery, methodology guides |
|
|
57
|
+
**With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
|
|
118
58
|
|
|
119
|
-
|
|
59
|
+
### Example 3: Knowledge compounding
|
|
120
60
|
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
Ask Claude: `Use getMethodology("topic_name")`
|
|
124
|
-
|
|
125
|
-
- `overview` — See all methodologies
|
|
126
|
-
- `verification` — 6-phase development cycle
|
|
127
|
-
- `eval` — Test case management
|
|
128
|
-
- `flywheel` — Continuous improvement loop
|
|
129
|
-
- `mandatory_flywheel` — Required verification for changes
|
|
130
|
-
- `reconnaissance` — Codebase discovery
|
|
131
|
-
- `quality_gates` — Pass/fail checkpoints
|
|
132
|
-
- `ui_ux_qa` — Frontend verification
|
|
133
|
-
- `agentic_vision` — AI-powered visual QA
|
|
134
|
-
- `closed_loop` — Build/test before presenting
|
|
135
|
-
- `learnings` — Knowledge persistence
|
|
136
|
-
- `project_ideation` — Validate ideas before building
|
|
137
|
-
- `tech_stack_2026` — Dependency management
|
|
138
|
-
- `telemetry_setup` — Observability setup
|
|
139
|
-
- `agents_md_maintenance` — Keep docs in sync
|
|
61
|
+
Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
|
|
140
62
|
|
|
141
63
|
---
|
|
142
64
|
|
|
143
|
-
##
|
|
65
|
+
## Quick Start
|
|
144
66
|
|
|
145
|
-
|
|
67
|
+
### Install (30 seconds)
|
|
146
68
|
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
69
|
+
```bash
|
|
70
|
+
# Claude Code CLI (recommended)
|
|
71
|
+
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Or add to `~/.claude/settings.json` or `.claude.json`:
|
|
150
75
|
|
|
151
76
|
```json
|
|
152
77
|
{
|
|
153
|
-
"
|
|
78
|
+
"mcpServers": {
|
|
154
79
|
"nodebench": {
|
|
155
|
-
"command": "
|
|
156
|
-
"args": ["
|
|
80
|
+
"command": "npx",
|
|
81
|
+
"args": ["-y", "nodebench-mcp"]
|
|
157
82
|
}
|
|
158
83
|
}
|
|
159
84
|
}
|
|
160
85
|
```
|
|
161
86
|
|
|
162
|
-
|
|
87
|
+
### First prompts to try
|
|
163
88
|
|
|
164
|
-
|
|
89
|
+
```
|
|
90
|
+
# See what's available
|
|
91
|
+
> Use getMethodology("overview") to see all workflows
|
|
165
92
|
|
|
166
|
-
|
|
93
|
+
# Before your next task — search for prior knowledge
|
|
94
|
+
> Use search_all_knowledge("what I'm about to work on")
|
|
167
95
|
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
npx playwright install chromium
|
|
172
|
-
|
|
173
|
-
# Image manipulation
|
|
174
|
-
npm install sharp
|
|
96
|
+
# Run the full verification pipeline on a change
|
|
97
|
+
> Use getMethodology("mandatory_flywheel") and follow the 6 steps
|
|
98
|
+
```
|
|
175
99
|
|
|
176
|
-
|
|
177
|
-
npm install cheerio
|
|
100
|
+
### Optional: API keys for web search and vision
|
|
178
101
|
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
|
|
182
|
-
npm install @anthropic-ai/sdk # Anthropic
|
|
102
|
+
```bash
|
|
103
|
+
export GEMINI_API_KEY="your-key" # Web search + vision (recommended)
|
|
104
|
+
export GITHUB_TOKEN="your-token" # GitHub (higher rate limits)
|
|
183
105
|
```
|
|
184
106
|
|
|
185
107
|
---
|
|
186
108
|
|
|
187
|
-
##
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
201
|
-
-
|
|
109
|
+
## What You Get
|
|
110
|
+
|
|
111
|
+
### Core workflow (use these every session)
|
|
112
|
+
|
|
113
|
+
| When you... | Use this | Impact |
|
|
114
|
+
|---|---|---|
|
|
115
|
+
| Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
|
|
116
|
+
| Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
|
|
117
|
+
| Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
|
|
118
|
+
| Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
|
|
119
|
+
| Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
|
|
120
|
+
| Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
|
|
121
|
+
| Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
|
|
122
|
+
| Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
|
|
123
|
+
| Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
|
|
124
|
+
|
|
125
|
+
### When running parallel agents (Claude Code subagents, worktrees)
|
|
126
|
+
|
|
127
|
+
| When you... | Use this | Impact |
|
|
128
|
+
|---|---|---|
|
|
129
|
+
| Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
|
|
130
|
+
| Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
|
|
131
|
+
| Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
|
|
132
|
+
| Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
|
|
133
|
+
| Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
|
|
134
|
+
| Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
|
|
135
|
+
|
|
136
|
+
### Research and discovery
|
|
137
|
+
|
|
138
|
+
| When you... | Use this | Impact |
|
|
139
|
+
|---|---|---|
|
|
140
|
+
| Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
|
|
141
|
+
| Fetch a URL | `fetch_url` | Read any page as clean markdown |
|
|
142
|
+
| Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
|
|
143
|
+
| Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
|
|
202
144
|
|
|
203
145
|
---
|
|
204
146
|
|
|
205
|
-
##
|
|
147
|
+
## The Methodology Pipeline
|
|
206
148
|
|
|
207
|
-
|
|
149
|
+
NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
|
|
208
150
|
|
|
209
151
|
```
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
4. Use analyze_repo to study competitor implementations
|
|
214
|
-
5. Use research_job_market to understand skill demand
|
|
152
|
+
Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
|
|
153
|
+
↑ │
|
|
154
|
+
└──────────── knowledge compounds ─────────────────────────────┘
|
|
215
155
|
```
|
|
216
156
|
|
|
217
|
-
|
|
157
|
+
**Inner loop** (per change): 6-phase verification ensures correctness.
|
|
158
|
+
**Outer loop** (over time): Eval-driven development ensures improvement.
|
|
159
|
+
**Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
|
|
218
160
|
|
|
219
|
-
|
|
220
|
-
1. Use search_github({ query: "mcp server", language: "typescript", minStars: 100 })
|
|
221
|
-
2. Use analyze_repo({ repoUrl: "owner/repo" }) to see tech stack and patterns
|
|
222
|
-
3. Use fetch_url to read their documentation
|
|
223
|
-
```
|
|
161
|
+
Ask the agent: `Use getMethodology("overview")` to see all 18 methodology topics.
|
|
224
162
|
|
|
225
|
-
|
|
163
|
+
---
|
|
226
164
|
|
|
227
|
-
|
|
228
|
-
1. Use setup_local_env to scan current environment
|
|
229
|
-
2. Follow the recommendations to install missing SDKs
|
|
230
|
-
3. Use getMethodology("tech_stack_2026") for ongoing maintenance
|
|
231
|
-
```
|
|
165
|
+
## Parallel Agents with Claude Code
|
|
232
166
|
|
|
233
|
-
|
|
167
|
+
Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
|
|
234
168
|
|
|
235
|
-
|
|
169
|
+
**When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
|
|
236
170
|
|
|
237
|
-
|
|
171
|
+
**How it works with Claude Code's Task tool:**
|
|
238
172
|
|
|
239
|
-
**
|
|
240
|
-
|
|
241
|
-
-
|
|
242
|
-
-
|
|
243
|
-
-
|
|
244
|
-
-
|
|
173
|
+
1. **COORDINATOR** (your main session) breaks work into independent tasks
|
|
174
|
+
2. Each **Task tool** call spawns a subagent with instructions to:
|
|
175
|
+
- `claim_agent_task` — lock the task
|
|
176
|
+
- `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
|
|
177
|
+
- Do the work
|
|
178
|
+
- `release_agent_task` — handoff with progress note
|
|
179
|
+
3. Coordinator calls `get_parallel_status` to monitor all subagents
|
|
180
|
+
4. Coordinator runs `run_quality_gate` on the aggregate result
|
|
245
181
|
|
|
246
|
-
**
|
|
182
|
+
**MCP Prompts available:**
|
|
183
|
+
- `claude-code-parallel` — Step-by-step Claude Code subagent coordination
|
|
184
|
+
- `parallel-agent-team` — Full team setup with role assignment
|
|
185
|
+
- `oracle-test-harness` — Validate outputs against known-good reference
|
|
186
|
+
- `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
|
|
247
187
|
|
|
248
|
-
|
|
249
|
-
2. Agents will auto-discover and follow the protocol
|
|
250
|
-
3. Use `update_agents_md` tool to keep it in sync
|
|
188
|
+
---
|
|
251
189
|
|
|
252
|
-
|
|
190
|
+
## Build from Source
|
|
253
191
|
|
|
254
192
|
```bash
|
|
255
|
-
|
|
193
|
+
git clone https://github.com/nodebench/nodebench-ai.git
|
|
194
|
+
cd nodebench-ai/packages/mcp-local
|
|
195
|
+
npm install && npm run build
|
|
256
196
|
```
|
|
257
197
|
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
261
|
-
|
|
198
|
+
Then use absolute path:
|
|
199
|
+
|
|
200
|
+
```json
|
|
201
|
+
{
|
|
202
|
+
"mcpServers": {
|
|
203
|
+
"nodebench": {
|
|
204
|
+
"command": "node",
|
|
205
|
+
"args": ["/path/to/packages/mcp-local/dist/index.js"]
|
|
206
|
+
}
|
|
207
|
+
}
|
|
208
|
+
}
|
|
209
|
+
```
|
|
210
|
+
|
|
211
|
+
---
|
|
212
|
+
|
|
213
|
+
## Troubleshooting
|
|
214
|
+
|
|
215
|
+
**"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
|
|
216
|
+
|
|
217
|
+
**"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
|
|
218
|
+
|
|
219
|
+
**"Cannot find module"** — Run `npm run build` in the mcp-local directory
|
|
220
|
+
|
|
221
|
+
**MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
|
|
262
222
|
|
|
263
223
|
---
|
|
264
224
|
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
export {};
|