nodebench-mcp 1.4.1 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +154 -2
- package/README.md +214 -215
- package/dist/__tests__/comparativeBench.test.d.ts +1 -0
- package/dist/__tests__/comparativeBench.test.js +722 -0
- package/dist/__tests__/comparativeBench.test.js.map +1 -0
- package/dist/__tests__/evalHarness.test.js +24 -2
- package/dist/__tests__/evalHarness.test.js.map +1 -1
- package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
- package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
- package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
- package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
- package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
- package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
- package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
- package/dist/__tests__/tools.test.js +155 -7
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/__tests__/toolsetGatingEval.test.d.ts +1 -0
- package/dist/__tests__/toolsetGatingEval.test.js +1031 -0
- package/dist/__tests__/toolsetGatingEval.test.js.map +1 -0
- package/dist/db.js +56 -0
- package/dist/db.js.map +1 -1
- package/dist/index.js +462 -28
- package/dist/index.js.map +1 -1
- package/dist/tools/localFileTools.d.ts +15 -0
- package/dist/tools/localFileTools.js +386 -0
- package/dist/tools/localFileTools.js.map +1 -0
- package/dist/tools/metaTools.js +170 -3
- package/dist/tools/metaTools.js.map +1 -1
- package/dist/tools/parallelAgentTools.d.ts +18 -0
- package/dist/tools/parallelAgentTools.js +1272 -0
- package/dist/tools/parallelAgentTools.js.map +1 -0
- package/dist/tools/selfEvalTools.js +240 -10
- package/dist/tools/selfEvalTools.js.map +1 -1
- package/dist/tools/webTools.js +171 -37
- package/dist/tools/webTools.js.map +1 -1
- package/package.json +26 -8
package/NODEBENCH_AGENTS.md
CHANGED
|
@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
|
|
|
117
117
|
NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
|
|
118
118
|
```
|
|
119
119
|
|
|
120
|
-
Run all lanes:
|
|
120
|
+
Run all public lanes:
|
|
121
121
|
```bash
|
|
122
122
|
npm run mcp:dataset:bench:all
|
|
123
123
|
```
|
|
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
|
|
|
137
137
|
NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
|
|
138
138
|
```
|
|
139
139
|
|
|
140
|
-
|
|
140
|
+
Fourth lane (GAIA gated long-horizon tool-augmented tasks):
|
|
141
|
+
- Dataset: `gaia-benchmark/GAIA` (gated)
|
|
142
|
+
- Default config: `2023_level3`
|
|
143
|
+
- Default split: `validation`
|
|
144
|
+
- Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
|
|
145
|
+
|
|
146
|
+
Notes:
|
|
147
|
+
- Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
|
|
148
|
+
- Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
|
|
149
|
+
- Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
|
|
150
|
+
|
|
151
|
+
Refresh GAIA fixture:
|
|
152
|
+
```bash
|
|
153
|
+
npm run mcp:dataset:gaia:refresh
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
Run GAIA parallel subagent benchmark:
|
|
157
|
+
```bash
|
|
158
|
+
NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
|
|
159
|
+
```
|
|
160
|
+
|
|
161
|
+
GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
|
|
162
|
+
- This runs real model calls and web search. It is disabled by default and only intended for regression checks.
|
|
163
|
+
- Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
|
|
164
|
+
- Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
|
|
165
|
+
|
|
166
|
+
Generate scoring fixture (local only, gated):
|
|
167
|
+
```bash
|
|
168
|
+
npm run mcp:dataset:gaia:capability:refresh
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
Run capability benchmark:
|
|
172
|
+
```bash
|
|
173
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
|
|
177
|
+
- This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
|
|
178
|
+
- Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
|
|
179
|
+
- Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
|
|
180
|
+
|
|
181
|
+
Generate file-backed scoring fixture + download attachments (local only, gated):
|
|
182
|
+
```bash
|
|
183
|
+
npm run mcp:dataset:gaia:capability:files:refresh
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
Run file-backed capability benchmark:
|
|
187
|
+
```bash
|
|
188
|
+
NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
Modes:
|
|
192
|
+
- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
|
|
193
|
+
- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
|
|
194
|
+
|
|
195
|
+
Run all public lanes:
|
|
141
196
|
```bash
|
|
142
197
|
npm run mcp:dataset:bench:all
|
|
143
198
|
```
|
|
144
199
|
|
|
200
|
+
Run full lane suite (includes GAIA):
|
|
201
|
+
```bash
|
|
202
|
+
npm run mcp:dataset:bench:full
|
|
203
|
+
```
|
|
204
|
+
|
|
145
205
|
Implementation files:
|
|
146
206
|
- `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
|
|
147
207
|
- `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
|
|
@@ -152,6 +212,16 @@ Implementation files:
|
|
|
152
212
|
- `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
|
|
153
213
|
- `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
|
|
154
214
|
- `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
|
|
215
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
|
|
216
|
+
- `.cache/gaia/gaia_2023_level3_validation.sample.json`
|
|
217
|
+
- `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
|
|
218
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
|
|
219
|
+
- `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
|
|
220
|
+
- `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
|
|
221
|
+
- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
|
|
222
|
+
- `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
|
|
223
|
+
- `.cache/gaia/data/...` (local GAIA attachments; do not commit)
|
|
224
|
+
- `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
|
|
155
225
|
|
|
156
226
|
Required tool chain per dataset task:
|
|
157
227
|
- `run_recon`
|
|
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
176
246
|
| Category | Tools | When to Use |
|
|
177
247
|
|----------|-------|-------------|
|
|
178
248
|
| **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
|
|
249
|
+
| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
|
|
179
250
|
| **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
|
|
180
251
|
| **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
|
|
181
252
|
| **Eval** | `start_eval_run`, `log_test_result` | Test case management |
|
|
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
184
255
|
| **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
|
|
185
256
|
| **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
|
|
186
257
|
| **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
|
|
258
|
+
| **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
|
|
187
259
|
| **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
|
|
188
260
|
|
|
189
261
|
**→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
|
|
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
|
|
|
538
610
|
| `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
|
|
539
611
|
| `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
|
|
540
612
|
| `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
|
|
613
|
+
| `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
|
|
614
|
+
| `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
|
|
541
615
|
|
|
542
616
|
**→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
|
|
543
617
|
|
|
544
618
|
---
|
|
545
619
|
|
|
620
|
+
## Parallel Agent Teams
|
|
621
|
+
|
|
622
|
+
Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
|
|
623
|
+
|
|
624
|
+
Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
|
|
625
|
+
|
|
626
|
+
### Quick Start — Parallel Agents
|
|
627
|
+
|
|
628
|
+
```
|
|
629
|
+
1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
|
|
630
|
+
2. assign_agent_role({ role: "implementer" }) // Specialize
|
|
631
|
+
3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
|
|
632
|
+
4. ... do work ...
|
|
633
|
+
5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
|
|
634
|
+
6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
|
|
635
|
+
7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
|
|
636
|
+
```
|
|
637
|
+
|
|
638
|
+
### Predefined Agent Roles
|
|
639
|
+
|
|
640
|
+
| Role | Focus |
|
|
641
|
+
|------|-------|
|
|
642
|
+
| `implementer` | Primary feature work. Picks failing tests, implements fixes. |
|
|
643
|
+
| `dedup_reviewer` | Finds and coalesces duplicate implementations. |
|
|
644
|
+
| `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
|
|
645
|
+
| `documentation_maintainer` | Keeps READMEs and progress files in sync. |
|
|
646
|
+
| `code_quality_critic` | Structural improvements, pattern enforcement. |
|
|
647
|
+
| `test_writer` | Writes targeted tests for edge cases and failure modes. |
|
|
648
|
+
| `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
|
|
649
|
+
|
|
650
|
+
### Key Patterns (from Anthropic blog)
|
|
651
|
+
|
|
652
|
+
- **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
|
|
653
|
+
- **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
|
|
654
|
+
- **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
|
|
655
|
+
- **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
|
|
656
|
+
- **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
|
|
657
|
+
- **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
|
|
658
|
+
|
|
659
|
+
### Bootstrap for External Repos
|
|
660
|
+
|
|
661
|
+
When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
|
|
662
|
+
|
|
663
|
+
```
|
|
664
|
+
1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
|
|
665
|
+
// Scans 7 categories: task coordination, roles, oracle, context budget,
|
|
666
|
+
// progress files, AGENTS.md parallel section, git worktrees
|
|
667
|
+
|
|
668
|
+
2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
|
|
669
|
+
// Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
|
|
670
|
+
|
|
671
|
+
3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
|
|
672
|
+
// Generates portable AGENTS.md section — paste into their repo
|
|
673
|
+
|
|
674
|
+
4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
|
|
675
|
+
5. Fix any issues, re-verify
|
|
676
|
+
6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
|
|
680
|
+
- Task locking protocol (file-based, no dependencies)
|
|
681
|
+
- Role definitions and assignment guide
|
|
682
|
+
- Oracle testing workflow with idiomatic examples
|
|
683
|
+
- Context budget rules
|
|
684
|
+
- Progress file protocol
|
|
685
|
+
- Anti-patterns to avoid
|
|
686
|
+
- Optional nodebench-mcp tool mapping table
|
|
687
|
+
|
|
688
|
+
### MCP Prompts for Parallel Agent Teams
|
|
689
|
+
|
|
690
|
+
- `parallel-agent-team` — Full team setup with role assignment and task breakdown
|
|
691
|
+
- `oracle-test-harness` — Oracle-based testing setup for a component
|
|
692
|
+
- `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
|
|
693
|
+
|
|
694
|
+
**→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
|
|
695
|
+
|
|
696
|
+
---
|
|
697
|
+
|
|
546
698
|
## Auto-Update This File
|
|
547
699
|
|
|
548
700
|
Agents can self-update this file:
|