npm - nodebench-mcp - Versions diffs - 1.4.1 → 2.1.0 - Mend

nodebench-mcp 1.4.1 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

package/NODEBENCH_AGENTS.md +154 -2
package/README.md +214 -215
package/dist/__tests__/comparativeBench.test.d.ts +1 -0
package/dist/__tests__/comparativeBench.test.js +722 -0
package/dist/__tests__/comparativeBench.test.js.map +1 -0
package/dist/__tests__/evalHarness.test.js +24 -2
package/dist/__tests__/evalHarness.test.js.map +1 -1
package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
package/dist/__tests__/tools.test.js +155 -7
package/dist/__tests__/tools.test.js.map +1 -1
package/dist/__tests__/toolsetGatingEval.test.d.ts +1 -0
package/dist/__tests__/toolsetGatingEval.test.js +1031 -0
package/dist/__tests__/toolsetGatingEval.test.js.map +1 -0
package/dist/db.js +56 -0
package/dist/db.js.map +1 -1
package/dist/index.js +462 -28
package/dist/index.js.map +1 -1
package/dist/tools/localFileTools.d.ts +15 -0
package/dist/tools/localFileTools.js +386 -0
package/dist/tools/localFileTools.js.map +1 -0
package/dist/tools/metaTools.js +170 -3
package/dist/tools/metaTools.js.map +1 -1
package/dist/tools/parallelAgentTools.d.ts +18 -0
package/dist/tools/parallelAgentTools.js +1272 -0
package/dist/tools/parallelAgentTools.js.map +1 -0
package/dist/tools/selfEvalTools.js +240 -10
package/dist/tools/selfEvalTools.js.map +1 -1
package/dist/tools/webTools.js +171 -37
package/dist/tools/webTools.js.map +1 -1
package/package.json +26 -8

package/NODEBENCH_AGENTS.md CHANGED Viewed

@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
 NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
 ```
-Run all lanes:
+Run all public lanes:
 ```bash
 npm run mcp:dataset:bench:all
 ```
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
 NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
 ```
-Run all lanes:
+Fourth lane (GAIA gated long-horizon tool-augmented tasks):
+- Dataset: `gaia-benchmark/GAIA` (gated)
+- Default config: `2023_level3`
+- Default split: `validation`
+- Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
+Notes:
+- Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
+- Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
+- Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
+Refresh GAIA fixture:
+```bash
+npm run mcp:dataset:gaia:refresh
+```
+Run GAIA parallel subagent benchmark:
+```bash
+NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
+```
+GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
+- This runs real model calls and web search. It is disabled by default and only intended for regression checks.
+- Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
+- Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
+Generate scoring fixture (local only, gated):
+```bash
+npm run mcp:dataset:gaia:capability:refresh
+```
+Run capability benchmark:
+```bash
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
+```
+GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
+- This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
+- Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
+- Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
+Generate file-backed scoring fixture + download attachments (local only, gated):
+```bash
+npm run mcp:dataset:gaia:capability:files:refresh
+```
+Run file-backed capability benchmark:
+```bash
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
+```
+Modes:
+- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
+- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
+Run all public lanes:
 ```bash
 npm run mcp:dataset:bench:all
 ```
+Run full lane suite (includes GAIA):
+```bash
+npm run mcp:dataset:bench:full
+```
 Implementation files:
 - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
 - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
@@ -152,6 +212,16 @@ Implementation files:
 - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
 - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
 - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
+- `.cache/gaia/gaia_2023_level3_validation.sample.json`
+- `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
+- `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
+- `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
+- `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
+- `.cache/gaia/data/...` (local GAIA attachments; do not commit)
+- `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
 Required tool chain per dataset task:
 - `run_recon`
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
 | Category | Tools | When to Use |
 |----------|-------|-------------|
 | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
+| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
 | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
 | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
 | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
 | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
 | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
 | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
+| **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
 | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
 **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
 | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
 | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
 | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
+| `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
+| `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
 **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
 ---
+## Parallel Agent Teams
+Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
+Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
+### Quick Start — Parallel Agents
+```
+1. get_parallel_status({ includeHistory: true })     // Orient: what's happening?
+2. assign_agent_role({ role: "implementer" })         // Specialize
+3. claim_agent_task({ taskKey: "fix_auth" })           // Lock task
+4. ... do work ...
+5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 })  // Track budget
+6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
+7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
+```
+### Predefined Agent Roles
+| Role | Focus |
+|------|-------|
+| `implementer` | Primary feature work. Picks failing tests, implements fixes. |
+| `dedup_reviewer` | Finds and coalesces duplicate implementations. |
+| `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
+| `documentation_maintainer` | Keeps READMEs and progress files in sync. |
+| `code_quality_critic` | Structural improvements, pattern enforcement. |
+| `test_writer` | Writes targeted tests for edge cases and failure modes. |
+| `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
+### Key Patterns (from Anthropic blog)
+- **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
+- **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
+- **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
+- **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
+- **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
+- **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
+### Bootstrap for External Repos
+When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
+```
+1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
+   // Scans 7 categories: task coordination, roles, oracle, context budget,
+   // progress files, AGENTS.md parallel section, git worktrees
+2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
+   // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
+3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
+   // Generates portable AGENTS.md section — paste into their repo
+4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
+5. Fix any issues, re-verify
+6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
+```
+The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
+- Task locking protocol (file-based, no dependencies)
+- Role definitions and assignment guide
+- Oracle testing workflow with idiomatic examples
+- Context budget rules
+- Progress file protocol
+- Anti-patterns to avoid
+- Optional nodebench-mcp tool mapping table
+### MCP Prompts for Parallel Agent Teams
+- `parallel-agent-team` — Full team setup with role assignment and task breakdown
+- `oracle-test-harness` — Oracle-based testing setup for a component
+- `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
+**→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
+---
 ## Auto-Update This File
 Agents can self-update this file: