npm - nodebench-mcp - Versions diffs - 1.2.0 → 1.4.0 - Mend

nodebench-mcp 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (48) hide show

package/NODEBENCH_AGENTS.md +253 -20
package/STYLE_GUIDE.md +477 -0
package/dist/__tests__/evalDatasetBench.test.d.ts +1 -0
package/dist/__tests__/evalDatasetBench.test.js +738 -0
package/dist/__tests__/evalDatasetBench.test.js.map +1 -0
package/dist/__tests__/evalHarness.test.d.ts +1 -0
package/dist/__tests__/evalHarness.test.js +830 -0
package/dist/__tests__/evalHarness.test.js.map +1 -0
package/dist/__tests__/fixtures/bfcl_v3_long_context.sample.json +264 -0
package/dist/__tests__/fixtures/generateBfclLongContextFixture.d.ts +10 -0
package/dist/__tests__/fixtures/generateBfclLongContextFixture.js +135 -0
package/dist/__tests__/fixtures/generateBfclLongContextFixture.js.map +1 -0
package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.d.ts +14 -0
package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.js +189 -0
package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.js.map +1 -0
package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.d.ts +16 -0
package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.js +154 -0
package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.js.map +1 -0
package/dist/__tests__/fixtures/swebench_verified.sample.json +162 -0
package/dist/__tests__/fixtures/toolbench_instruction.sample.json +109 -0
package/dist/__tests__/openDatasetParallelEval.test.d.ts +7 -0
package/dist/__tests__/openDatasetParallelEval.test.js +209 -0
package/dist/__tests__/openDatasetParallelEval.test.js.map +1 -0
package/dist/__tests__/openDatasetParallelEvalSwebench.test.d.ts +7 -0
package/dist/__tests__/openDatasetParallelEvalSwebench.test.js +220 -0
package/dist/__tests__/openDatasetParallelEvalSwebench.test.js.map +1 -0
package/dist/__tests__/openDatasetParallelEvalToolbench.test.d.ts +7 -0
package/dist/__tests__/openDatasetParallelEvalToolbench.test.js +218 -0
package/dist/__tests__/openDatasetParallelEvalToolbench.test.js.map +1 -0
package/dist/__tests__/tools.test.js +252 -3
package/dist/__tests__/tools.test.js.map +1 -1
package/dist/db.js +20 -0
package/dist/db.js.map +1 -1
package/dist/index.js +2 -0
package/dist/index.js.map +1 -1
package/dist/tools/agentBootstrapTools.d.ts +5 -1
package/dist/tools/agentBootstrapTools.js +566 -1
package/dist/tools/agentBootstrapTools.js.map +1 -1
package/dist/tools/documentationTools.js +102 -8
package/dist/tools/documentationTools.js.map +1 -1
package/dist/tools/learningTools.js +6 -2
package/dist/tools/learningTools.js.map +1 -1
package/dist/tools/metaTools.js +112 -1
package/dist/tools/metaTools.js.map +1 -1
package/dist/tools/selfEvalTools.d.ts +12 -0
package/dist/tools/selfEvalTools.js +568 -0
package/dist/tools/selfEvalTools.js.map +1 -0
package/package.json +11 -3

package/NODEBENCH_AGENTS.md CHANGED Viewed

@@ -21,7 +21,9 @@ Add to `~/.claude/settings.json`:
 }
 ```
-Restart Claude Code. 51 tools available immediately.
+Restart Claude Code. 56 tools available immediately.
+**→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup)
 ---
@@ -51,10 +53,119 @@ Review the code for:
 ### Step 5: Fix and Re-Verify
 If any gap found: fix it, then restart from Step 1.
-### Step 6: Document Learnings
+### Step 6: Live E2E Test (MANDATORY)
+**Before declaring done or publishing:**
+```bash
+echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"YOUR_TOOL","arguments":{...}}}' | node dist/index.js
+```
+Every new/modified tool MUST pass stdio E2E test. No exceptions.
+For workflow-level changes (verification, eval, recon, quality gates, flywheel, or knowledge tools), also run the long-running open-source benchmark:
+```bash
+npm --prefix packages/mcp-local run dataset:bfcl:refresh
+NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm --prefix packages/mcp-local run test:open-dataset
+npm --prefix packages/mcp-local run dataset:toolbench:refresh
+NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm --prefix packages/mcp-local run test:open-dataset:toolbench
+npm --prefix packages/mcp-local run dataset:swebench:refresh
+NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm --prefix packages/mcp-local run test:open-dataset:swebench
+```
+### Step 7: Document Learnings
 Record edge cases discovered. Update this file if needed.
-**Rule: No change ships without passing all 6 steps.**
+**Rule: No change ships without passing all 7 steps.**
+**→ Quick Refs:** Track progress with `start_verification_cycle` | Record findings with `record_learning` | Run gate with `run_quality_gate` | See [Post-Implementation Checklist](#post-implementation-checklist)
+---
+## Open-Source Long-Running MCP Benchmark
+Use open-source long-context tasks to validate real orchestration behavior under parallel load.
+- Dataset: `gorilla-llm/Berkeley-Function-Calling-Leaderboard`
+- Split: `BFCL_v3_multi_turn_long_context`
+- Source: `https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard`
+Refresh local fixture:
+```bash
+npm run mcp:dataset:refresh
+```
+Run parallel subagent benchmark:
+```bash
+NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm run mcp:dataset:test
+```
+Run refresh + benchmark in one shot:
+```bash
+npm run mcp:dataset:bench
+```
+Second lane (ToolBench multi-tool instructions):
+- Dataset: `OpenBMB/ToolBench`
+- Split: `data_example/instruction (G1,G2,G3)`
+- Source: `https://github.com/OpenBMB/ToolBench`
+Refresh ToolBench fixture:
+```bash
+npm run mcp:dataset:toolbench:refresh
+```
+Run ToolBench parallel subagent benchmark:
+```bash
+NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
+```
+Run all lanes:
+```bash
+npm run mcp:dataset:bench:all
+```
+Third lane (SWE-bench Verified long-horizon software tasks):
+- Dataset: `princeton-nlp/SWE-bench_Verified`
+- Split: `test`
+- Source: `https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified`
+Refresh SWE-bench fixture:
+```bash
+npm run mcp:dataset:swebench:refresh
+```
+Run SWE-bench parallel subagent benchmark:
+```bash
+NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
+```
+Run all lanes:
+```bash
+npm run mcp:dataset:bench:all
+```
+Implementation files:
+- `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
+- `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
+- `packages/mcp-local/src/__tests__/openDatasetParallelEval.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateToolbenchInstructionFixture.ts`
+- `packages/mcp-local/src/__tests__/fixtures/toolbench_instruction.sample.json`
+- `packages/mcp-local/src/__tests__/openDatasetParallelEvalToolbench.test.ts`
+- `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
+- `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
+- `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
+Required tool chain per dataset task:
+- `run_recon`
+- `log_recon_finding`
+- `findTools`
+- `getMethodology`
+- `start_eval_run`
+- `record_eval_result`
+- `complete_eval_run`
+- `run_closed_loop`
+- `run_mandatory_flywheel`
+- `search_all_knowledge`
+**→ Quick Refs:** Core process in [AI Flywheel](#the-ai-flywheel-mandatory) | Verification flow in [Verification Cycle](#verification-cycle-workflow) | Loop discipline in [Closed Loop Principle](#closed-loop-principle)
 ---
@@ -72,8 +183,11 @@ Use `getMethodology("overview")` to see all available workflows.
 | **Learning** | `record_learning`, `search_all_knowledge` | Persistent knowledge base |
 | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
 | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
+| **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
 | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
+**→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
 ---
 ## Verification Cycle Workflow
@@ -93,6 +207,8 @@ If blocked or failed:
 abandon_cycle({ reason: "Blocked by external dependency" })
 ```
+**→ Quick Refs:** Before starting: `search_all_knowledge({ query: "your task" })` | After completing: `record_learning({ ... })` | Run flywheel: See [AI Flywheel](#the-ai-flywheel-mandatory) | Track quality: See [Quality Gates](#quality-gates)
 ---
 ## Recording Learnings
@@ -113,6 +229,8 @@ Search later with:
 search_all_knowledge({ query: "convex index" })
 ```
+**→ Quick Refs:** Search before implementing: `search_all_knowledge` | `search_learnings` and `list_learnings` are DEPRECATED | Part of flywheel Step 7 | See [Verification Cycle](#verification-cycle-workflow)
 ---
 ## Quality Gates
@@ -132,6 +250,8 @@ run_quality_gate({
 Gate history tracks pass/fail over time.
+**→ Quick Refs:** Get preset rules: `get_gate_preset({ preset: "ui_ux_qa" })` | View history: `get_gate_history({ gateName: "..." })` | UI/UX gates: See [Vision](#vision-analysis) | Part of flywheel Step 5 re-verify
 ---
 ## Web Research Workflow
@@ -146,6 +266,8 @@ For market research or tech evaluation:
 5. record_learning({ ... })  // save key findings
 ```
+**→ Quick Refs:** Analyze repo structure: `analyze_repo` | Save findings: `record_learning` | Part of: `getMethodology({ topic: "project_ideation" })` | See [Recording Learnings](#recording-learnings)
 ---
 ## Project Ideation Workflow
@@ -164,6 +286,8 @@ This returns a 6-step process:
 5. Plan Metrics
 6. Gate Approval
+**→ Quick Refs:** Research tools: `web_search`, `search_github`, `analyze_repo` | Record requirements: `log_recon_finding` | Create baseline: `start_eval_run` | See [Web Research](#web-research-workflow)
 ---
 ## Closed Loop Principle
@@ -178,6 +302,8 @@ The loop:
 Only when all green: present to user.
+**→ Quick Refs:** Track loop: `run_closed_loop({ ... })` | Part of flywheel Steps 1-5 | See [AI Flywheel](#the-ai-flywheel-mandatory) | After loop: See [Post-Implementation Checklist](#post-implementation-checklist)
 ---
 ## Environment Setup
@@ -193,6 +319,8 @@ Returns:
 - Recommended SDK installations
 - Actionable next steps
+**→ Quick Refs:** After setup: `getMethodology("overview")` | Check vision: `discover_vision_env()` | See [API Keys](#api-keys-optional) | Then: See [Verification Cycle](#verification-cycle-workflow)
 ---
 ## API Keys (Optional)
@@ -206,6 +334,22 @@ Set these for enhanced functionality:
 | `GITHUB_TOKEN` | Higher rate limits (5000/hr vs 60/hr) |
 | `ANTHROPIC_API_KEY` | Alternative vision provider |
+**→ Quick Refs:** Check what's available: `setup_local_env({ checkSdks: true })` | Vision capabilities: `discover_vision_env()` | See [Environment Setup](#environment-setup)
+---
+## Vision Analysis
+For UI/UX verification:
+```
+1. capture_ui_screenshot({ url: "http://localhost:3000", viewport: "desktop" })
+2. analyze_screenshot({ imageBase64: "...", prompt: "Check accessibility" })
+3. capture_responsive_suite({ url: "...", label: "homepage" })
+```
+**→ Quick Refs:** Check capabilities: `discover_vision_env()` | UI QA methodology: `getMethodology({ topic: "ui_ux_qa" })` | Agentic vision: `getMethodology({ topic: "agentic_vision" })` | See [Quality Gates](#quality-gates)
 ---
 ## Post-Implementation Checklist
@@ -214,13 +358,15 @@ After every implementation, answer these 3 questions:
 1. **MCP gaps?** — Were all relevant tools called? Any unexpected results?
 2. **Implementation gaps?** — Dead code? Missing integrations? Hardcoded values?
-3. **Flywheel complete?** — All 6 steps passed?
+3. **Flywheel complete?** — All 7 steps passed including E2E test?
 If any answer reveals a gap: fix it before proceeding.
+**→ Quick Refs:** Run self-check: `run_self_maintenance({ scope: "quick" })` | Record learnings: `record_learning` | Update docs: `update_agents_md` | See [AI Flywheel](#the-ai-flywheel-mandatory)
 ---
-## Agent Self-Bootstrap System (NEW)
+## Agent Self-Bootstrap System
 For agents to self-configure and validate against authoritative sources.
@@ -288,27 +434,112 @@ Aggregates findings from multiple sources.
 - https://www.langchain.com/langgraph
 - https://modelcontextprotocol.io/specification/2025-11-25
+**→ Quick Refs:** Full methodology: `getMethodology({ topic: "agent_bootstrap" })` | After bootstrap: See [Autonomous Maintenance](#autonomous-self-maintenance-system) | Before implementing: `assess_risk` | See [Triple Verification](#2-triple-verification-with-source-citations)
+---
+## Autonomous Self-Maintenance System
+Aggressive autonomous self-management with risk-aware execution. Based on OpenClaw patterns and Ralph Wiggum stop-hooks.
+### 1. Risk-Tiered Execution
+Before any action, assess its risk tier:
+```
+assess_risk({ action: "push to remote" })
+```
+Risk tiers:
+- **Low**: Reading, analyzing, searching — auto-approve
+- **Medium**: Writing local files, running tests — log and proceed
+- **High**: Pushing to remote, posting externally — require confirmation
+### 2. Re-Update Before Create
+**CRITICAL:** Before creating new files, check if updating existing is better:
+```
+decide_re_update({
+  targetContent: "New agent instructions",
+  contentType: "instructions",
+  existingFiles: ["AGENTS.md", "README.md"]
+})
+```
+This prevents file sprawl and maintains single source of truth.
+### 3. Self-Maintenance Cycles
+Run periodic self-checks:
+```
+run_self_maintenance({
+  scope: "standard",  // quick | standard | thorough
+  autoFix: false,
+  dryRun: true
+})
+```
+Checks: TypeScript compilation, documentation sync, tool counts, test coverage.
+### 4. Directory Scaffolding (OpenClaw Style)
+When adding infrastructure, use standardized scaffolding:
+```
+scaffold_directory({
+  component: "agent_loop",  // or: telemetry, evaluation, multi_channel, etc.
+  includeTests: true,
+  dryRun: true
+})
+```
+Creates organized subdirectories with proper test structure.
+### 5. Autonomous Loops with Guardrails
+For multi-step autonomous tasks, use controlled loops:
+```
+run_autonomous_loop({
+  goal: "Verify all tools pass static analysis",
+  maxIterations: 5,
+  maxDurationMs: 60000,
+  stopOnFirstFailure: true
+})
+```
+Implements Ralph Wiggum pattern with checkpoints and stop conditions.
+**→ Quick Refs:** Full methodology: `getMethodology({ topic: "autonomous_maintenance" })` | Before actions: `assess_risk` | Before new files: `decide_re_update` | Scaffold structure: `scaffold_directory` | See [Self-Bootstrap](#agent-self-bootstrap-system)
 ---
 ## Methodology Topics
 Available via `getMethodology({ topic: "..." })`:
-- `overview` — See all methodologies
-- `verification` — 6-phase development cycle
-- `eval` — Test case management
-- `flywheel` — Continuous improvement loop
-- `mandatory_flywheel` — Required verification for changes
-- `reconnaissance` — Codebase discovery
-- `quality_gates` — Pass/fail checkpoints
-- `ui_ux_qa` — Frontend verification
-- `agentic_vision` — AI-powered visual QA
-- `closed_loop` — Build/test before presenting
-- `learnings` — Knowledge persistence
-- `project_ideation` — Validate ideas before building
-- `tech_stack_2026` — Dependency management
-- `agents_md_maintenance` — Keep docs in sync
-- `agent_bootstrap` — Self-discover, triple verify, self-implement
+| Topic | Description | Quick Ref |
+|-------|-------------|-----------|
+| `overview` | See all methodologies | Start here |
+| `verification` | 6-phase development cycle | [AI Flywheel](#the-ai-flywheel-mandatory) |
+| `eval` | Test case management | [Quality Gates](#quality-gates) |
+| `flywheel` | Continuous improvement loop | [AI Flywheel](#the-ai-flywheel-mandatory) |
+| `mandatory_flywheel` | Required verification for changes | [AI Flywheel](#the-ai-flywheel-mandatory) |
+| `reconnaissance` | Codebase discovery | [Self-Bootstrap](#agent-self-bootstrap-system) |
+| `quality_gates` | Pass/fail checkpoints | [Quality Gates](#quality-gates) |
+| `ui_ux_qa` | Frontend verification | [Vision Analysis](#vision-analysis) |
+| `agentic_vision` | AI-powered visual QA | [Vision Analysis](#vision-analysis) |
+| `closed_loop` | Build/test before presenting | [Closed Loop](#closed-loop-principle) |
+| `learnings` | Knowledge persistence | [Recording Learnings](#recording-learnings) |
+| `project_ideation` | Validate ideas before building | [Project Ideation](#project-ideation-workflow) |
+| `tech_stack_2026` | Dependency management | [Environment Setup](#environment-setup) |
+| `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
+| `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
+| `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
+**→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
 ---
@@ -330,6 +561,8 @@ Or read current structure:
 update_agents_md({ operation: "read", projectRoot: "/path/to/project" })
 ```
+**→ Quick Refs:** Before updating: `decide_re_update({ contentType: "instructions", ... })` | After updating: Run flywheel Steps 1-7 | See [Re-Update Before Create](#2-re-update-before-create)
 ---
 ## License