npm - nodebench-mcp - Versions diffs - 1.4.1 → 2.0.0 - Mend

nodebench-mcp 1.4.1 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

package/NODEBENCH_AGENTS.md +154 -2
package/README.md +147 -228
package/dist/__tests__/comparativeBench.test.d.ts +1 -0
package/dist/__tests__/comparativeBench.test.js +722 -0
package/dist/__tests__/comparativeBench.test.js.map +1 -0
package/dist/__tests__/evalHarness.test.js +24 -2
package/dist/__tests__/evalHarness.test.js.map +1 -1
package/dist/__tests__/gaiaCapabilityEval.test.d.ts +14 -0
package/dist/__tests__/gaiaCapabilityEval.test.js +420 -0
package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.d.ts +15 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.js +303 -0
package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.d.ts +7 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.js +279 -0
package/dist/__tests__/openDatasetParallelEvalGaia.test.js.map +1 -0
package/dist/__tests__/openDatasetPerfComparison.test.d.ts +10 -0
package/dist/__tests__/openDatasetPerfComparison.test.js +318 -0
package/dist/__tests__/openDatasetPerfComparison.test.js.map +1 -0
package/dist/__tests__/tools.test.js +155 -7
package/dist/__tests__/tools.test.js.map +1 -1
package/dist/db.js +56 -0
package/dist/db.js.map +1 -1
package/dist/index.js +370 -11
package/dist/index.js.map +1 -1
package/dist/tools/localFileTools.d.ts +15 -0
package/dist/tools/localFileTools.js +386 -0
package/dist/tools/localFileTools.js.map +1 -0
package/dist/tools/metaTools.js +170 -3
package/dist/tools/metaTools.js.map +1 -1
package/dist/tools/parallelAgentTools.d.ts +18 -0
package/dist/tools/parallelAgentTools.js +1272 -0
package/dist/tools/parallelAgentTools.js.map +1 -0
package/dist/tools/selfEvalTools.js +240 -10
package/dist/tools/selfEvalTools.js.map +1 -1
package/dist/tools/webTools.js +171 -37
package/dist/tools/webTools.js.map +1 -1
package/package.json +19 -7

package/NODEBENCH_AGENTS.md CHANGED Viewed

@@ -117,7 +117,7 @@ Run ToolBench parallel subagent benchmark:
 NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
 ```
-Run all lanes:
+Run all public lanes:
 ```bash
 npm run mcp:dataset:bench:all
 ```
@@ -137,11 +137,71 @@ Run SWE-bench parallel subagent benchmark:
 NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
 ```
-Run all lanes:
+Fourth lane (GAIA gated long-horizon tool-augmented tasks):
+- Dataset: `gaia-benchmark/GAIA` (gated)
+- Default config: `2023_level3`
+- Default split: `validation`
+- Source: `https://huggingface.co/datasets/gaia-benchmark/GAIA`
+Notes:
+- Fixture is written to `.cache/gaia` (gitignored). Do not commit GAIA question/answer content.
+- Refresh requires `HF_TOKEN` or `HUGGINGFACE_HUB_TOKEN` in your shell.
+- Python deps: `pandas`, `huggingface_hub`, `pyarrow` (or equivalent parquet engine).
+Refresh GAIA fixture:
+```bash
+npm run mcp:dataset:gaia:refresh
+```
+Run GAIA parallel subagent benchmark:
+```bash
+NODEBENCH_GAIA_TASK_LIMIT=8 NODEBENCH_GAIA_CONCURRENCY=4 npm run mcp:dataset:gaia:test
+```
+GAIA capability benchmark (accuracy: LLM-only vs LLM+tools):
+- This runs real model calls and web search. It is disabled by default and only intended for regression checks.
+- Uses Gemini by default. Ensure `GEMINI_API_KEY` is available (repo `.env.local` is loaded by the test).
+- Scoring fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
+Generate scoring fixture (local only, gated):
+```bash
+npm run mcp:dataset:gaia:capability:refresh
+```
+Run capability benchmark:
+```bash
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:test
+```
+GAIA capability benchmark (file-backed lane: PDF / XLSX / CSV):
+- This lane measures the impact of deterministic local parsing tools on GAIA tasks with attachments.
+- Fixture includes ground-truth answers and MUST remain under `.cache/gaia` (gitignored).
+- Attachments are copied into `.cache/gaia/data/<file_path>` for offline deterministic runs after the first download.
+Generate file-backed scoring fixture + download attachments (local only, gated):
+```bash
+npm run mcp:dataset:gaia:capability:files:refresh
+```
+Run file-backed capability benchmark:
+```bash
+NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT=6 NODEBENCH_GAIA_CAPABILITY_CONCURRENCY=1 npm run mcp:dataset:gaia:capability:files:test
+```
+Modes:
+- Recommended (more stable): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=rag`
+- More realistic (higher variance): `NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE=agent` (optional `NODEBENCH_GAIA_CAPABILITY_FORCE_WEB_SEARCH=1`)
+Run all public lanes:
 ```bash
 npm run mcp:dataset:bench:all
 ```
+Run full lane suite (includes GAIA):
+```bash
+npm run mcp:dataset:bench:full
+```
 Implementation files:
 - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
 - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
@@ -152,6 +212,16 @@ Implementation files:
 - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
 - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
 - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaLevel3Fixture.py`
+- `.cache/gaia/gaia_2023_level3_validation.sample.json`
+- `packages/mcp-local/src/__tests__/openDatasetParallelEvalGaia.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFixture.py`
+- `.cache/gaia/gaia_capability_2023_all_validation.sample.json`
+- `packages/mcp-local/src/__tests__/gaiaCapabilityEval.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityFilesFixture.py`
+- `.cache/gaia/gaia_capability_files_2023_all_validation.sample.json`
+- `.cache/gaia/data/...` (local GAIA attachments; do not commit)
+- `packages/mcp-local/src/__tests__/gaiaCapabilityFilesEval.test.ts`
 Required tool chain per dataset task:
 - `run_recon`
@@ -176,6 +246,7 @@ Use `getMethodology("overview")` to see all available workflows.
 | Category | Tools | When to Use |
 |----------|-------|-------------|
 | **Web** | `web_search`, `fetch_url` | Research, reading docs, market validation |
+| **Local Files** | `read_pdf_text`, `read_xlsx_file`, `read_csv_file` | Deterministic parsing of local attachments (GAIA file-backed lane) |
 | **GitHub** | `search_github`, `analyze_repo` | Finding libraries, studying implementations |
 | **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | Tracking the flywheel process |
 | **Eval** | `start_eval_run`, `log_test_result` | Test case management |
@@ -184,6 +255,7 @@ Use `getMethodology("overview")` to see all available workflows.
 | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
 | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
 | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
+| **Parallel Agents** | `claim_agent_task`, `release_agent_task`, `list_agent_tasks`, `assign_agent_role`, `get_agent_role`, `log_context_budget`, `run_oracle_comparison`, `get_parallel_status` | Multi-agent coordination, task locking, role specialization, oracle testing |
 | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
 **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
@@ -538,11 +610,91 @@ Available via `getMethodology({ topic: "..." })`:
 | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
 | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
 | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
+| `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
+| `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
 **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
 ---
+## Parallel Agent Teams
+Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
+Run multiple AI agents in parallel on a shared codebase with coordination via task locking, role specialization, context budget management, and oracle-based testing.
+### Quick Start — Parallel Agents
+```
+1. get_parallel_status({ includeHistory: true })     // Orient: what's happening?
+2. assign_agent_role({ role: "implementer" })         // Specialize
+3. claim_agent_task({ taskKey: "fix_auth" })           // Lock task
+4. ... do work ...
+5. log_context_budget({ eventType: "test_output", tokensUsed: 5000 })  // Track budget
+6. run_oracle_comparison({ testLabel: "auth_output", actualOutput: "...", expectedOutput: "...", oracleSource: "prod_v2" })
+7. release_agent_task({ taskKey: "fix_auth", status: "completed", progressNote: "Fixed JWT, added tests" })
+```
+### Predefined Agent Roles
+| Role | Focus |
+|------|-------|
+| `implementer` | Primary feature work. Picks failing tests, implements fixes. |
+| `dedup_reviewer` | Finds and coalesces duplicate implementations. |
+| `performance_optimizer` | Profiles bottlenecks, optimizes hot paths. |
+| `documentation_maintainer` | Keeps READMEs and progress files in sync. |
+| `code_quality_critic` | Structural improvements, pattern enforcement. |
+| `test_writer` | Writes targeted tests for edge cases and failure modes. |
+| `security_auditor` | Audits for vulnerabilities, logs CRITICAL gaps. |
+### Key Patterns (from Anthropic blog)
+- **Task Locking**: Claim before working. If two agents try the same task, the second picks a different one.
+- **Context Window Budget**: Do NOT print thousands of useless bytes. Pre-compute summaries. Use `--fast` mode (1-10% random sample) for large test suites. Log errors with ERROR prefix on same line for grep.
+- **Oracle Testing**: Compare output against known-good reference. Each failing comparison is an independent work item for a parallel agent.
+- **Time Blindness**: Agents can't tell time. Print progress infrequently. Use deterministic random sampling per-agent but randomized across VMs.
+- **Progress Files**: Maintain running docs of status, failed approaches, and remaining tasks. Fresh agent sessions read these to orient.
+- **Delta Debugging**: When tests pass individually but fail together, split the set in half to narrow down the minimal failing combination.
+### Bootstrap for External Repos
+When nodebench-mcp is connected to a project that lacks parallel agent infrastructure, it can auto-detect gaps and scaffold everything needed:
+```
+1. bootstrap_parallel_agents({ projectRoot: "/path/to/their/repo", dryRun: true })
+   // Scans 7 categories: task coordination, roles, oracle, context budget,
+   // progress files, AGENTS.md parallel section, git worktrees
+2. bootstrap_parallel_agents({ projectRoot: "...", dryRun: false, techStack: "TypeScript/React" })
+   // Creates .parallel-agents/ dir, progress.md, roles.json, lock dirs, oracle dirs
+3. generate_parallel_agents_md({ techStack: "TypeScript/React", projectName: "their-project", maxAgents: 4 })
+   // Generates portable AGENTS.md section — paste into their repo
+4. Run the 6-step flywheel plan returned by the bootstrap tool to verify
+5. Fix any issues, re-verify
+6. record_learning({ key: "bootstrap_their_project", content: "...", category: "pattern" })
+```
+The generated AGENTS.md section is framework-agnostic and works with any AI agent (Claude, GPT, etc.). It includes:
+- Task locking protocol (file-based, no dependencies)
+- Role definitions and assignment guide
+- Oracle testing workflow with idiomatic examples
+- Context budget rules
+- Progress file protocol
+- Anti-patterns to avoid
+- Optional nodebench-mcp tool mapping table
+### MCP Prompts for Parallel Agent Teams
+- `parallel-agent-team` — Full team setup with role assignment and task breakdown
+- `oracle-test-harness` — Oracle-based testing setup for a component
+- `bootstrap-parallel-agents` — Detect and scaffold parallel agent infra for any external repo
+**→ Quick Refs:** Full methodology: `getMethodology({ topic: "parallel_agent_teams" })` | Find parallel tools: `findTools({ category: "parallel_agents" })` | Bootstrap external repo: `bootstrap_parallel_agents({ projectRoot: "..." })` | See [AI Flywheel](#the-ai-flywheel-mandatory)
+---
 ## Auto-Update This File
 Agents can self-update this file:

package/README.md CHANGED Viewed

@@ -1,200 +1,208 @@
-# NodeBench MCP Server
+# NodeBench MCP
-A fully local, zero-config MCP server with **60 tools** for AI-powered development workflows.
+**Make AI agents catch the bugs they normally ship.**
-**Features:**
-- Web search (Gemini/OpenAI/Perplexity)
-- GitHub repository discovery and analysis
-- Job market research
-- AGENTS.md self-maintenance
-- AI vision for screenshot analysis
-- 6-phase verification flywheel
-- Self-reinforced learning (trajectory analysis, health reports, improvement recommendations)
-- Autonomous agent bootstrap and self-maintenance
-- SQLite-backed learning database
-## Quick Start (30 seconds)
-### Option A: Claude Code CLI (recommended)
+One command gives your agent structured research, risk assessment, 3-layer testing, quality gates, and a persistent knowledge base — so every fix is thorough and every insight compounds into future work.
 ```bash
 claude mcp add nodebench -- npx -y nodebench-mcp
 ```
-That's it. One command, 60 tools. No restart needed.
+---
-### Option B: Manual config
+## Why — What Bare Agents Miss
-Add to `~/.claude/settings.json` (global) or `.claude.json` (per-project):
+We benchmarked 9 real production prompts — things like *"The LinkedIn posting pipeline is creating duplicate posts"* and *"The agent loop hits budget but still gets new events"* — comparing a bare agent vs one with NodeBench MCP.
-```json
-{
-  "mcpServers": {
-    "nodebench": {
-      "command": "npx",
-      "args": ["-y", "nodebench-mcp"]
-    }
-  }
-}
-```
+| What gets measured | Bare Agent | With NodeBench MCP |
+|---|---|---|
+| Issues detected before deploy | 0 | **13** (4 high, 8 medium, 1 low) |
+| Research findings before coding | 0 | **21** |
+| Risk assessments | 0 | **9** |
+| Test coverage layers | 1 | **3** (static + unit + integration) |
+| Integration failures caught early | 0 | **4** |
+| Regression eval cases created | 0 | **22** |
+| Quality gate rules enforced | 0 | **52** |
+| Deploys blocked by gate violations | 0 | **4** |
+| Knowledge entries banked | 0 | **9** |
+| Blind spots shipped to production | **26** | **0** |
+The bare agent reads the code, implements a fix, runs tests once, and ships. The MCP agent researches first, assesses risk, tracks issues to resolution, runs 3-layer tests, creates regression guards, enforces quality gates, and banks everything as knowledge for next time.
+Every additional tool call produces a concrete artifact — an issue found, a risk assessed, a regression guarded — that compounds across future tasks.
+---
+## How It Works — 3 Real Examples
+### Example 1: Bug fix
+You type: *"The content queue has 40 items stuck in 'judging' status for 6 hours"*
+**Bare agent:** Reads the queue code, finds a potential fix, runs tests, ships.
+**With NodeBench MCP:** The agent runs structured recon and discovers 3 blind spots the bare agent misses:
+- No retry backoff on OpenRouter rate limits (HIGH)
+- JSON regex `match(/\{[\s\S]*\}/)` grabs last `}` — breaks on multi-object responses (MEDIUM)
+- No timeout on LLM call — hung request blocks entire cron for 15+ min (not detected by unit tests)
+All 3 are logged as gaps, resolved, regression-tested, and the patterns banked so the next similar bug is fixed faster.
+### Example 2: Parallel agents overwriting each other
-Then restart Claude Code.
+You type: *"I launched 3 Claude Code subagents but they keep overwriting each other's changes"*
+**Without NodeBench:** Both agents see the same bug and both implement a fix. The third agent re-investigates what agent 1 already solved. Agent 2 hits context limit mid-fix and loses work.
+**With NodeBench MCP:** Each subagent calls `claim_agent_task` to lock its work. Roles are assigned so they don't overlap. Context budget is tracked. Progress notes ensure handoff without starting from scratch.
+### Example 3: Knowledge compounding
+Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevant prior findings before writing a single line of code. Bare agents start from zero every time.
 ---
-## Alternative: Build from source
+## Quick Start
+### Install (30 seconds)
 ```bash
-git clone https://github.com/nodebench/nodebench-ai.git
-cd nodebench-ai/packages/mcp-local
-npm install && npm run build
+# Claude Code CLI (recommended)
+claude mcp add nodebench -- npx -y nodebench-mcp
 ```
-Then use absolute path in settings:
+Or add to `~/.claude/settings.json` or `.claude.json`:
 ```json
 {
   "mcpServers": {
     "nodebench": {
-      "command": "node",
-      "args": ["/path/to/packages/mcp-local/dist/index.js"]
+      "command": "npx",
+      "args": ["-y", "nodebench-mcp"]
     }
   }
 }
 ```
-### 3. Add API keys (optional but recommended)
+### First prompts to try
-Add to your shell profile (`~/.bashrc`, `~/.zshrc`, or Windows Environment Variables):
+```
+# See what's available
+> Use getMethodology("overview") to see all workflows
-```bash
-# Required for web search (pick one)
-export GEMINI_API_KEY="your-key"        # Best: Google Search grounding
-export OPENAI_API_KEY="your-key"        # Alternative: GPT-4o web search
-export PERPLEXITY_API_KEY="your-key"    # Alternative: Perplexity
-# Required for GitHub (higher rate limits)
-export GITHUB_TOKEN="your-token"        # github.com/settings/tokens
-# Required for vision analysis (pick one)
-export GEMINI_API_KEY="your-key"        # Best: Gemini 2.5 Flash
-export OPENAI_API_KEY="your-key"        # Alternative: GPT-4o
-export ANTHROPIC_API_KEY="your-key"     # Alternative: Claude
+# Before your next task — search for prior knowledge
+> Use search_all_knowledge("what I'm about to work on")
+# Run the full verification pipeline on a change
+> Use getMethodology("mandatory_flywheel") and follow the 6 steps
 ```
-### 4. Restart Claude Code
+### Optional: API keys for web search and vision
 ```bash
-# Quit and reopen Claude Code, or run:
-claude --mcp-debug
+export GEMINI_API_KEY="your-key"        # Web search + vision (recommended)
+export GITHUB_TOKEN="your-token"        # GitHub (higher rate limits)
 ```
-### 5. Test it works
+---
-In Claude Code, try these prompts:
+## What You Get
+### Core workflow (use these every session)
+| When you... | Use this | Impact |
+|---|---|---|
+| Start any task | `search_all_knowledge` | Find prior findings — avoid repeating past mistakes |
+| Research before coding | `run_recon` + `log_recon_finding` | Structured research with surfaced findings |
+| Assess risk before acting | `assess_risk` | Risk tier determines if action needs confirmation |
+| Track implementation | `start_verification_cycle` + `log_gap` | Issues logged with severity, tracked to resolution |
+| Test thoroughly | `log_test_result` (3 layers) | Static + unit + integration vs running tests once |
+| Guard against regression | `start_eval_run` + `record_eval_result` | Eval cases that protect this fix in the future |
+| Gate before deploy | `run_quality_gate` | Boolean rules enforced — violations block deploy |
+| Bank knowledge | `record_learning` | Persisted findings compound across future sessions |
+| Verify completeness | `run_mandatory_flywheel` | 6-step minimum — catches dead code and intent mismatches |
+### When running parallel agents (Claude Code subagents, worktrees)
+| When you... | Use this | Impact |
+|---|---|---|
+| Prevent duplicate work | `claim_agent_task` / `release_agent_task` | Task locks — each task owned by exactly one agent |
+| Specialize agents | `assign_agent_role` | 7 roles: implementer, test_writer, critic, etc. |
+| Track context usage | `log_context_budget` | Prevents context exhaustion mid-fix |
+| Validate against reference | `run_oracle_comparison` | Compare output against known-good oracle |
+| Orient new sessions | `get_parallel_status` | See what all agents are doing and what's blocked |
+| Bootstrap any repo | `bootstrap_parallel_agents` | Auto-detect gaps, scaffold coordination infra |
+### Research and discovery
+| When you... | Use this | Impact |
+|---|---|---|
+| Search the web | `web_search` | Gemini/OpenAI/Perplexity — latest docs and updates |
+| Fetch a URL | `fetch_url` | Read any page as clean markdown |
+| Find GitHub repos | `search_github` + `analyze_repo` | Discover and evaluate libraries and patterns |
+| Analyze screenshots | `analyze_screenshot` | AI vision (Gemini/GPT-4o/Claude) for UI QA |
-```
-# Check your environment
-> Use setup_local_env to check my development environment
+---
-# Search GitHub
-> Use search_github to find TypeScript MCP servers with at least 100 stars
+## The Methodology Pipeline
-# Fetch documentation
-> Use fetch_url to read https://modelcontextprotocol.io/introduction
+NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
-# Get methodology
-> Use getMethodology("overview") to see all available workflows
+```
+Research → Risk → Implement → Test (3 layers) → Eval → Gate → Learn → Ship
+    ↑                                                              │
+    └──────────── knowledge compounds ─────────────────────────────┘
 ```
----
+**Inner loop** (per change): 6-phase verification ensures correctness.
+**Outer loop** (over time): Eval-driven development ensures improvement.
+**Together**: The AI Flywheel — every verification produces eval artifacts, every regression triggers verification.
-## Tool Categories
-| Category | Tools | Description |
-|----------|-------|-------------|
-| **Web** | `web_search`, `fetch_url` | Search the web, fetch URLs as markdown |
-| **GitHub** | `search_github`, `analyze_repo` | Find repos, analyze tech stacks |
-| **Documentation** | `update_agents_md`, `research_job_market`, `setup_local_env` | Self-maintaining docs, job research |
-| **Vision** | `discover_vision_env`, `analyze_screenshot`, `manipulate_screenshot` | AI-powered image analysis |
-| **UI Capture** | `capture_ui_screenshot`, `capture_responsive_suite` | Browser screenshots (requires Playwright) |
-| **Verification** | `start_cycle`, `log_phase`, `complete_cycle` | 6-phase dev workflow |
-| **Eval** | `start_eval_run`, `log_test_result`, `list_eval_runs` | Test case tracking |
-| **Quality Gates** | `run_quality_gate`, `get_gate_history` | Pass/fail checkpoints |
-| **Learning** | `record_learning`, `search_learnings`, `search_all_knowledge` | Persistent knowledge base |
-| **Flywheel** | `run_closed_loop`, `check_framework_updates` | Automated workflows |
-| **Recon** | `run_recon`, `log_recon_finding`, `log_gap` | Discovery and gap tracking |
-| **Agent Bootstrap** | `bootstrap_project`, `setup_local_env`, `triple_verify`, `self_implement` | Self-discover infrastructure, auto-configure |
-| **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance`, `run_autonomous_loop` | Risk-tiered autonomous execution |
-| **Self-Eval** | `log_tool_call`, `get_trajectory_analysis`, `get_self_eval_report`, `get_improvement_recommendations` | Self-reinforced learning loop |
-| **Meta** | `findTools`, `getMethodology` | Tool discovery, methodology guides |
+Ask the agent: `Use getMethodology("overview")` to see all 18 methodology topics.
 ---
-## Methodology Topics (17 total)
-Ask Claude: `Use getMethodology("topic_name")`
-- `overview` — See all methodologies
-- `verification` — 6-phase development cycle
-- `eval` — Test case management
-- `flywheel` — Continuous improvement loop
-- `mandatory_flywheel` — Required verification for changes
-- `reconnaissance` — Codebase discovery
-- `quality_gates` — Pass/fail checkpoints
-- `ui_ux_qa` — Frontend verification
-- `agentic_vision` — AI-powered visual QA
-- `closed_loop` — Build/test before presenting
-- `learnings` — Knowledge persistence
-- `project_ideation` — Validate ideas before building
-- `tech_stack_2026` — Dependency management
-- `telemetry_setup` — Observability setup
-- `agents_md_maintenance` — Keep docs in sync
-- `agent_bootstrap` — Self-discover and auto-configure infrastructure
-- `autonomous_maintenance` — Risk-tiered autonomous execution
-- `self_reinforced_learning` — Trajectory analysis and improvement loop
+## Parallel Agents with Claude Code
----
+Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www.anthropic.com/engineering/building-c-compiler) (Feb 2026).
-## Self-Reinforced Learning (v1.4.0)
+**When to use:** Only when running 2+ agent sessions. Single-agent workflows use the standard pipeline above.
-The MCP learns from its own usage. As you develop with the tools, the system accumulates trajectory data and surfaces recommendations.
+**How it works with Claude Code's Task tool:**
-```
-Use → Log → Analyze → Recommend → Apply → Re-analyze
-```
+1. **COORDINATOR** (your main session) breaks work into independent tasks
+2. Each **Task tool** call spawns a subagent with instructions to:
+   - `claim_agent_task` — lock the task
+   - `assign_agent_role` — specialize (implementer, test_writer, critic, etc.)
+   - Do the work
+   - `release_agent_task` — handoff with progress note
+3. Coordinator calls `get_parallel_status` to monitor all subagents
+4. Coordinator runs `run_quality_gate` on the aggregate result
-**Try it:**
-```
-> Use getMethodology("self_reinforced_learning") for the 5-step guide
-> Use get_self_eval_report to see your project's health score
-> Use get_improvement_recommendations to find actionable improvements
-> Use get_trajectory_analysis to see your tool usage patterns
-```
-The health score is a weighted composite:
-- Cycle completion (25%) — Are verification cycles being completed?
-- Eval pass rate (25%) — Are eval runs succeeding?
-- Gap resolution (20%) — Are logged gaps getting resolved?
-- Gate pass rate (15%) — Are quality gates passing?
-- Tool error rate (15%) — Are tools running without errors?
+**MCP Prompts available:**
+- `claude-code-parallel` — Step-by-step Claude Code subagent coordination
+- `parallel-agent-team` — Full team setup with role assignment
+- `oracle-test-harness` — Validate outputs against known-good reference
+- `bootstrap-parallel-agents` — Scaffold parallel infra for any repo
 ---
-## VSCode Extension Setup
+## Build from Source
-If using the Claude Code VSCode extension:
+```bash
+git clone https://github.com/nodebench/nodebench-ai.git
+cd nodebench-ai/packages/mcp-local
+npm install && npm run build
+```
-1. Open VSCode Settings (Ctrl/Cmd + ,)
-2. Search for "Claude Code MCP"
-3. Add server configuration:
+Then use absolute path:
 ```json
 {
-  "claude-code.mcpServers": {
+  "mcpServers": {
     "nodebench": {
       "command": "node",
-      "args": ["/absolute/path/to/packages/mcp-local/dist/index.js"]
+      "args": ["/path/to/packages/mcp-local/dist/index.js"]
     }
   }
 }
@@ -202,104 +210,15 @@ If using the Claude Code VSCode extension:
 ---
-## Optional Dependencies
-Install for additional features:
-```bash
-# Screenshot capture (headless browser)
-npm install playwright
-npx playwright install chromium
-# Image manipulation
-npm install sharp
-# HTML parsing (already included)
-npm install cheerio
-# AI providers (pick your preferred)
-npm install @google/genai    # Gemini
-npm install openai           # OpenAI
-npm install @anthropic-ai/sdk # Anthropic
-```
----
 ## Troubleshooting
-**"No search provider available"**
-- Set at least one API key: `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
-**"GitHub API error 403"**
-- Set `GITHUB_TOKEN` for higher rate limits (60/hour without, 5000/hour with)
+**"No search provider available"** — Set `GEMINI_API_KEY`, `OPENAI_API_KEY`, or `PERPLEXITY_API_KEY`
-**"Cannot find module"**
-- Run `npm run build` in the mcp-local directory
-**MCP not connecting**
-- Check path is absolute in settings.json
-- Run `claude --mcp-debug` to see connection errors
-- Ensure Node.js >= 18
----
+**"GitHub API error 403"** — Set `GITHUB_TOKEN` for higher rate limits
-## Example Workflows
-### Research a new project idea
-```
-1. Use getMethodology("project_ideation") for the 6-step process
-2. Use web_search to validate market demand
-3. Use search_github to find similar projects
-4. Use analyze_repo to study competitor implementations
-5. Use research_job_market to understand skill demand
-```
-### Analyze a GitHub repo before using it
-```
-1. Use search_github({ query: "mcp server", language: "typescript", minStars: 100 })
-2. Use analyze_repo({ repoUrl: "owner/repo" }) to see tech stack and patterns
-3. Use fetch_url to read their documentation
-```
-### Set up a new development environment
-```
-1. Use setup_local_env to scan current environment
-2. Follow the recommendations to install missing SDKs
-3. Use getMethodology("tech_stack_2026") for ongoing maintenance
-```
----
-## Agent Protocol (NODEBENCH_AGENTS.md)
-The package includes `NODEBENCH_AGENTS.md` — a portable agent operating procedure that any AI agent can use to self-configure.
-**What it provides:**
-- The 6-step AI Flywheel verification process (mandatory for all changes)
-- MCP tool usage patterns and workflows
-- Quality gate definitions
-- Post-implementation checklists
-- Self-update instructions
-**To use in your project:**
-1. Copy `NODEBENCH_AGENTS.md` to your repo root
-2. Agents will auto-discover and follow the protocol
-3. Use `update_agents_md` tool to keep it in sync
-Or fetch it directly:
-```bash
-curl -o AGENTS.md https://raw.githubusercontent.com/nodebench/nodebench-ai/main/packages/mcp-local/NODEBENCH_AGENTS.md
-```
+**"Cannot find module"** — Run `npm run build` in the mcp-local directory
-The file is designed to be:
-- **Portable** — Works in any repo, any language
-- **Self-updating** — Agents can modify it via MCP tools
-- **Composable** — Add your own sections alongside the standard protocol
+**MCP not connecting** — Check path is absolute, run `claude --mcp-debug`, ensure Node.js >= 18
 ---

package/dist/__tests__/comparativeBench.test.d.ts ADDED Viewed

	@@ -0,0 +1 @@
1	+ export {};