nodebench-mcp 1.4.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. package/NODEBENCH_AGENTS.md +154 -2
  2. package/README.md +214 -215
  3. package/dist/__tests__/comparativeBench.test.d.ts +1 -0
  4. package/dist/__tests__/comparativeBench.test.js +722 -0
  5. package/dist/__tests__/comparativeBench.test.js.map +1 -0
  6. package/dist/__tests__/evalHarness.test.js +24 -2
  7. package/dist/__tests__/evalHarness.test.js.map +1 -1
  8. package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
  9. package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
  10. package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
  11. package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
  12. package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
  13. package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
  14. package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
  15. package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
  16. package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
  17. package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
  18. package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
  19. package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
  20. package/dist/__tests__/tools.test.js +155 -7
  21. package/dist/__tests__/tools.test.js.map +1 -1
  22. package/dist/__tests__/toolsetGatingEval.test.d.ts +1 -0
  23. package/dist/__tests__/toolsetGatingEval.test.js +1031 -0
  24. package/dist/__tests__/toolsetGatingEval.test.js.map +1 -0
  25. package/dist/db.js +56 -0
  26. package/dist/db.js.map +1 -1
  27. package/dist/index.js +462 -28
  28. package/dist/index.js.map +1 -1
  29. package/dist/tools/localFileTools.d.ts +15 -0
  30. package/dist/tools/localFileTools.js +386 -0
  31. package/dist/tools/localFileTools.js.map +1 -0
  32. package/dist/tools/metaTools.js +170 -3
  33. package/dist/tools/metaTools.js.map +1 -1
  34. package/dist/tools/parallelAgentTools.d.ts +18 -0
  35. package/dist/tools/parallelAgentTools.js +1272 -0
  36. package/dist/tools/parallelAgentTools.js.map +1 -0
  37. package/dist/tools/selfEvalTools.js +240 -10
  38. package/dist/tools/selfEvalTools.js.map +1 -1
  39. package/dist/tools/webTools.js +171 -37
  40. package/dist/tools/webTools.js.map +1 -1
  41. package/package.json +26 -8
@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
117
117
  NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
118
118
  ```
119
119
 
120
- Run all lanes:
120
+ Run all public lanes:
121
121
  ```bash
122
122
  npm run mcp:dataset:bench:all
123
123
  ```
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
137
137
  NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
138
138
  ```
139
139
 
140
- Run all lanes:
140
+ Fourth lane (GAIA gated long-horizon tool-augmented tasks):
141
+ - Dataset: `gaia-benchmark/GAIA` (gated)
142
+ - Default config: `2023_level3`
143
+ - Default split: `validation`
144
+ - Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
145
+
146
+ Notes:
147
+ - Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
148
+ - Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
149
+ - Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
150
+
151
+ Refresh GAIA fixture:
152
+ ```bash
153
+ npm run mcp:dataset:gaia:refresh
154
+ ```
155
+
156
+ Run GAIA parallel subagent benchmark:
157
+ ```bash
158
+ NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
159
+ ```
160
+
161
+ GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
162
+ - This runs real model calls and web search. It is disabled by default and only intended for regression checks.
163
+ - Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
164
+ - Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
165
+
166
+ Generate scoring fixture (local only, gated):
167
+ ```bash
168
+ npm run mcp:dataset:gaia:capability:refresh
169
+ ```
170
+
171
+ Run capability benchmark:
172
+ ```bash
173
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
174
+ ```
175
+
176
+ GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
177
+ - This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
178
+ - Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
179
+ - Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
180
+
181
+ Generate file-backed scoring fixture + download attachments (local only, gated):
182
+ ```bash
183
+ npm run mcp:dataset:gaia:capability:files:refresh
184
+ ```
185
+
186
+ Run file-backed capability benchmark:
187
+ ```bash
188
+ NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
189
+ ```
190
+
191
+ Modes:
192
+ - Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
193
+ - More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
194
+
195
+ Run all public lanes:
141
196
  ```bash
142
197
  npm run mcp:dataset:bench:all
143
198
  ```
144
199
 
200
+ Run full lane suite (includes GAIA):
201
+ ```bash
202
+ npm run mcp:dataset:bench:full
203
+ ```
204
+
145
205
  Implementation files:
146
206
  - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
147
207
  - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
@@ -152,6 +212,16 @@ Implementation files:
152
212
  - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
153
213
  - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
154
214
  - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
215
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
216
+ - `.cache/gaia/gaia_2023_level3_validation.sample.json`
217
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
218
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
219
+ - `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
220
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
221
+ - `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
222
+ - `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
223
+ - `.cache/gaia/data/...` (local GAIA attachments; do not commit)
224
+ - `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
155
225
 
156
226
  Required tool chain per dataset task:
157
227
  - `run_recon`
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
176
246
  | Category | Tools | When to Use |
177
247
  |----------|-------|-------------|
178
248
  | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
249
+ | **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
179
250
  | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
180
251
  | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
181
252
  | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
184
255
  | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
185
256
  | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
186
257
  | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
258
+ | **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
187
259
  | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
188
260
 
189
261
  **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
538
610
  | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
539
611
  | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
540
612
  | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
613
+ | `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
614
+ | `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
541
615
 
542
616
  **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
543
617
 
544
618
  ---
545
619
 
620
+ ## Parallel Agent Teams
621
+
622
+ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
623
+
624
+ Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
625
+
626
+ ### Quick Start — Parallel Agents
627
+
628
+ ```
629
+ 1. get_parallel_status({ includeHistory: true }) // Orient: what's happening?
630
+ 2. assign_agent_role({ role: "implementer" }) // Specialize
631
+ 3. claim_agent_task({ taskKey: "fix_auth" }) // Lock task
632
+ 4. ... do work ...
633
+ 5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 }) // Track budget
634
+ 6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
635
+ 7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
636
+ ```
637
+
638
+ ### Predefined Agent Roles
639
+
640
+ | Role | Focus |
641
+ |------|-------|
642
+ | `implementer` | Primary feature work. Picks failing tests, implements fixes. |
643
+ | `dedup_reviewer` | Finds and coalesces duplicate implementations. |
644
+ | `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
645
+ | `documentation_maintainer` | Keeps READMEs and progress files in sync. |
646
+ | `code_quality_critic` | Structural improvements, pattern enforcement. |
647
+ | `test_writer` | Writes targeted tests for edge cases and failure modes. |
648
+ | `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
649
+
650
+ ### Key Patterns (from Anthropic blog)
651
+
652
+ - **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
653
+ - **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
654
+ - **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
655
+ - **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
656
+ - **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
657
+ - **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
658
+
659
+ ### Bootstrap for External Repos
660
+
661
+ When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
662
+
663
+ ```
664
+ 1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
665
+ // Scans 7 categories: task coordination, roles, oracle, context budget,
666
+ // progress files, AGENTS.md parallel section, git worktrees
667
+
668
+ 2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
669
+ // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
670
+
671
+ 3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
672
+ // Generates portable AGENTS.md section — paste into their repo
673
+
674
+ 4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
675
+ 5. Fix any issues, re-verify
676
+ 6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
677
+ ```
678
+
679
+ The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
680
+ - Task locking protocol (file-based, no dependencies)
681
+ - Role definitions and assignment guide
682
+ - Oracle testing workflow with idiomatic examples
683
+ - Context budget rules
684
+ - Progress file protocol
685
+ - Anti-patterns to avoid
686
+ - Optional nodebench-mcp tool mapping table
687
+
688
+ ### MCP Prompts for Parallel Agent Teams
689
+
690
+ - `parallel-agent-team` — Full team setup with role assignment and task breakdown
691
+ - `oracle-test-harness` — Oracle-based testing setup for a component
692
+ - `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
693
+
694
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
695
+
696
+ ---
697
+
546
698
  ## Auto-Update This File
547
699
 
548
700
  Agents can self-update this file: