nodebench-mcp 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (48) hide show
  1. package/NODEBENCH_AGENTS.md +253 -20
  2. package/STYLE_GUIDE.md +477 -0
  3. package/dist/__tests__/evalDatasetBench.test.d.ts +1 -0
  4. package/dist/__tests__/evalDatasetBench.test.js +738 -0
  5. package/dist/__tests__/evalDatasetBench.test.js.map +1 -0
  6. package/dist/__tests__/evalHarness.test.d.ts +1 -0
  7. package/dist/__tests__/evalHarness.test.js +830 -0
  8. package/dist/__tests__/evalHarness.test.js.map +1 -0
  9. package/dist/__tests__/fixtures/bfcl_v3_long_context.sample.json +264 -0
  10. package/dist/__tests__/fixtures/generateBfclLongContextFixture.d.ts +10 -0
  11. package/dist/__tests__/fixtures/generateBfclLongContextFixture.js +135 -0
  12. package/dist/__tests__/fixtures/generateBfclLongContextFixture.js.map +1 -0
  13. package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.d.ts +14 -0
  14. package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.js +189 -0
  15. package/dist/__tests__/fixtures/generateSwebenchVerifiedFixture.js.map +1 -0
  16. package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.d.ts +16 -0
  17. package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.js +154 -0
  18. package/dist/__tests__/fixtures/generateToolbenchInstructionFixture.js.map +1 -0
  19. package/dist/__tests__/fixtures/swebench_verified.sample.json +162 -0
  20. package/dist/__tests__/fixtures/toolbench_instruction.sample.json +109 -0
  21. package/dist/__tests__/openDatasetParallelEval.test.d.ts +7 -0
  22. package/dist/__tests__/openDatasetParallelEval.test.js +209 -0
  23. package/dist/__tests__/openDatasetParallelEval.test.js.map +1 -0
  24. package/dist/__tests__/openDatasetParallelEvalSwebench.test.d.ts +7 -0
  25. package/dist/__tests__/openDatasetParallelEvalSwebench.test.js +220 -0
  26. package/dist/__tests__/openDatasetParallelEvalSwebench.test.js.map +1 -0
  27. package/dist/__tests__/openDatasetParallelEvalToolbench.test.d.ts +7 -0
  28. package/dist/__tests__/openDatasetParallelEvalToolbench.test.js +218 -0
  29. package/dist/__tests__/openDatasetParallelEvalToolbench.test.js.map +1 -0
  30. package/dist/__tests__/tools.test.js +252 -3
  31. package/dist/__tests__/tools.test.js.map +1 -1
  32. package/dist/db.js +20 -0
  33. package/dist/db.js.map +1 -1
  34. package/dist/index.js +2 -0
  35. package/dist/index.js.map +1 -1
  36. package/dist/tools/agentBootstrapTools.d.ts +5 -1
  37. package/dist/tools/agentBootstrapTools.js +566 -1
  38. package/dist/tools/agentBootstrapTools.js.map +1 -1
  39. package/dist/tools/documentationTools.js +102 -8
  40. package/dist/tools/documentationTools.js.map +1 -1
  41. package/dist/tools/learningTools.js +6 -2
  42. package/dist/tools/learningTools.js.map +1 -1
  43. package/dist/tools/metaTools.js +112 -1
  44. package/dist/tools/metaTools.js.map +1 -1
  45. package/dist/tools/selfEvalTools.d.ts +12 -0
  46. package/dist/tools/selfEvalTools.js +568 -0
  47. package/dist/tools/selfEvalTools.js.map +1 -0
  48. package/package.json +11 -3
@@ -21,7 +21,9 @@ Add to `~/.claude/settings.json`:
21
21
  }
22
22
  ```
23
23
 
24
- Restart Claude Code. 51 tools available immediately.
24
+ Restart Claude Code. 56 tools available immediately.
25
+
26
+ **→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup)
25
27
 
26
28
  ---
27
29
 
@@ -51,10 +53,119 @@ Review the code for:
51
53
  ### Step 5: Fix and Re-Verify
52
54
  If any gap found: fix it, then restart from Step 1.
53
55
 
54
- ### Step 6: Document Learnings
56
+ ### Step 6: Live E2E Test (MANDATORY)
57
+ **Before declaring done or publishing:**
58
+ ```bash
59
+ echo '{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"YOUR_TOOL","arguments":{...}}}' | node dist/index.js
60
+ ```
61
+ Every new/modified tool MUST pass stdio E2E test. No exceptions.
62
+
63
+ For workflow-level changes (verification, eval, recon, quality gates, flywheel, or knowledge tools), also run the long-running open-source benchmark:
64
+ ```bash
65
+ npm --prefix packages/mcp-local run dataset:bfcl:refresh
66
+ NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm --prefix packages/mcp-local run test:open-dataset
67
+ npm --prefix packages/mcp-local run dataset:toolbench:refresh
68
+ NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm --prefix packages/mcp-local run test:open-dataset:toolbench
69
+ npm --prefix packages/mcp-local run dataset:swebench:refresh
70
+ NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm --prefix packages/mcp-local run test:open-dataset:swebench
71
+ ```
72
+
73
+ ### Step 7: Document Learnings
55
74
  Record edge cases discovered. Update this file if needed.
56
75
 
57
- **Rule: No change ships without passing all 6 steps.**
76
+ **Rule: No change ships without passing all 7 steps.**
77
+
78
+ **→ Quick Refs:** Track progress with `start_verification_cycle` | Record findings with `record_learning` | Run gate with `run_quality_gate` | See [Post-Implementation Checklist](#post-implementation-checklist)
79
+
80
+ ---
81
+
82
+ ## Open-Source Long-Running MCP Benchmark
83
+
84
+ Use open-source long-context tasks to validate real orchestration behavior under parallel load.
85
+
86
+ - Dataset: `gorilla-llm/Berkeley-Function-Calling-Leaderboard`
87
+ - Split: `BFCL_v3_multi_turn_long_context`
88
+ - Source: `https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard`
89
+
90
+ Refresh local fixture:
91
+ ```bash
92
+ npm run mcp:dataset:refresh
93
+ ```
94
+
95
+ Run parallel subagent benchmark:
96
+ ```bash
97
+ NODEBENCH_OPEN_DATASET_TASK_LIMIT=12 NODEBENCH_OPEN_DATASET_CONCURRENCY=6 npm run mcp:dataset:test
98
+ ```
99
+
100
+ Run refresh + benchmark in one shot:
101
+ ```bash
102
+ npm run mcp:dataset:bench
103
+ ```
104
+
105
+ Second lane (ToolBench multi-tool instructions):
106
+ - Dataset: `OpenBMB/ToolBench`
107
+ - Split: `data_example/instruction (G1,G2,G3)`
108
+ - Source: `https://github.com/OpenBMB/ToolBench`
109
+
110
+ Refresh ToolBench fixture:
111
+ ```bash
112
+ npm run mcp:dataset:toolbench:refresh
113
+ ```
114
+
115
+ Run ToolBench parallel subagent benchmark:
116
+ ```bash
117
+ NODEBENCH_TOOLBENCH_TASK_LIMIT=6 NODEBENCH_TOOLBENCH_CONCURRENCY=3 npm run mcp:dataset:toolbench:test
118
+ ```
119
+
120
+ Run all lanes:
121
+ ```bash
122
+ npm run mcp:dataset:bench:all
123
+ ```
124
+
125
+ Third lane (SWE-bench Verified long-horizon software tasks):
126
+ - Dataset: `princeton-nlp/SWE-bench_Verified`
127
+ - Split: `test`
128
+ - Source: `https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified`
129
+
130
+ Refresh SWE-bench fixture:
131
+ ```bash
132
+ npm run mcp:dataset:swebench:refresh
133
+ ```
134
+
135
+ Run SWE-bench parallel subagent benchmark:
136
+ ```bash
137
+ NODEBENCH_SWEBENCH_TASK_LIMIT=8 NODEBENCH_SWEBENCH_CONCURRENCY=4 npm run mcp:dataset:swebench:test
138
+ ```
139
+
140
+ Run all lanes:
141
+ ```bash
142
+ npm run mcp:dataset:bench:all
143
+ ```
144
+
145
+ Implementation files:
146
+ - `packages/mcp-local/src/__tests__/fixtures/generateBfclLongContextFixture.ts`
147
+ - `packages/mcp-local/src/__tests__/fixtures/bfcl_v3_long_context.sample.json`
148
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEval.test.ts`
149
+ - `packages/mcp-local/src/__tests__/fixtures/generateToolbenchInstructionFixture.ts`
150
+ - `packages/mcp-local/src/__tests__/fixtures/toolbench_instruction.sample.json`
151
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalToolbench.test.ts`
152
+ - `packages/mcp-local/src/__tests__/fixtures/generateSwebenchVerifiedFixture.ts`
153
+ - `packages/mcp-local/src/__tests__/fixtures/swebench_verified.sample.json`
154
+ - `packages/mcp-local/src/__tests__/openDatasetParallelEvalSwebench.test.ts`
155
+
156
+ Required tool chain per dataset task:
157
+ - `run_recon`
158
+ - `log_recon_finding`
159
+ - `findTools`
160
+ - `getMethodology`
161
+ - `start_eval_run`
162
+ - `record_eval_result`
163
+ - `complete_eval_run`
164
+ - `run_closed_loop`
165
+ - `run_mandatory_flywheel`
166
+ - `search_all_knowledge`
167
+
168
+ **→ Quick Refs:** Core process in [AI Flywheel](#the-ai-flywheel-mandatory) | Verification flow in [Verification Cycle](#verification-cycle-workflow) | Loop discipline in [Closed Loop Principle](#closed-loop-principle)
58
169
 
59
170
  ---
60
171
 
@@ -72,8 +183,11 @@ Use `getMethodology("overview")` to see all available workflows.
72
183
  | **Learning** | `record_learning`, `search_all_knowledge` | Persistent knowledge base |
73
184
  | **Vision** | `analyze_screenshot`, `capture_ui_screenshot` | UI/UX verification |
74
185
  | **Bootstrap** | `discover_infrastructure`, `triple_verify`, `self_implement` | Self-setup, triple verification |
186
+ | **Autonomous** | `assess_risk`, `decide_re_update`, `run_self_maintenance` | Risk-aware execution, self-maintenance |
75
187
  | **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
76
188
 
189
+ **→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
190
+
77
191
  ---
78
192
 
79
193
  ## Verification Cycle Workflow
@@ -93,6 +207,8 @@ If blocked or failed:
93
207
  abandon_cycle({ reason: "Blocked by external dependency" })
94
208
  ```
95
209
 
210
+ **→ Quick Refs:** Before starting: `search_all_knowledge({ query: "your task" })` | After completing: `record_learning({ ... })` | Run flywheel: See [AI Flywheel](#the-ai-flywheel-mandatory) | Track quality: See [Quality Gates](#quality-gates)
211
+
96
212
  ---
97
213
 
98
214
  ## Recording Learnings
@@ -113,6 +229,8 @@ Search later with:
113
229
  search_all_knowledge({ query: "convex index" })
114
230
  ```
115
231
 
232
+ **→ Quick Refs:** Search before implementing: `search_all_knowledge` | `search_learnings` and `list_learnings` are DEPRECATED | Part of flywheel Step 7 | See [Verification Cycle](#verification-cycle-workflow)
233
+
116
234
  ---
117
235
 
118
236
  ## Quality Gates
@@ -132,6 +250,8 @@ run_quality_gate({
132
250
 
133
251
  Gate history tracks pass/fail over time.
134
252
 
253
+ **→ Quick Refs:** Get preset rules: `get_gate_preset({ preset: "ui_ux_qa" })` | View history: `get_gate_history({ gateName: "..." })` | UI/UX gates: See [Vision](#vision-analysis) | Part of flywheel Step 5 re-verify
254
+
135
255
  ---
136
256
 
137
257
  ## Web Research Workflow
@@ -146,6 +266,8 @@ For market research or tech evaluation:
146
266
  5. record_learning({ ... }) // save key findings
147
267
  ```
148
268
 
269
+ **→ Quick Refs:** Analyze repo structure: `analyze_repo` | Save findings: `record_learning` | Part of: `getMethodology({ topic: "project_ideation" })` | See [Recording Learnings](#recording-learnings)
270
+
149
271
  ---
150
272
 
151
273
  ## Project Ideation Workflow
@@ -164,6 +286,8 @@ This returns a 6-step process:
164
286
  5. Plan Metrics
165
287
  6. Gate Approval
166
288
 
289
+ **→ Quick Refs:** Research tools: `web_search`, `search_github`, `analyze_repo` | Record requirements: `log_recon_finding` | Create baseline: `start_eval_run` | See [Web Research](#web-research-workflow)
290
+
167
291
  ---
168
292
 
169
293
  ## Closed Loop Principle
@@ -178,6 +302,8 @@ The loop:
178
302
 
179
303
  Only when all green: present to user.
180
304
 
305
+ **→ Quick Refs:** Track loop: `run_closed_loop({ ... })` | Part of flywheel Steps 1-5 | See [AI Flywheel](#the-ai-flywheel-mandatory) | After loop: See [Post-Implementation Checklist](#post-implementation-checklist)
306
+
181
307
  ---
182
308
 
183
309
  ## Environment Setup
@@ -193,6 +319,8 @@ Returns:
193
319
  - Recommended SDK installations
194
320
  - Actionable next steps
195
321
 
322
+ **→ Quick Refs:** After setup: `getMethodology("overview")` | Check vision: `discover_vision_env()` | See [API Keys](#api-keys-optional) | Then: See [Verification Cycle](#verification-cycle-workflow)
323
+
196
324
  ---
197
325
 
198
326
  ## API Keys (Optional)
@@ -206,6 +334,22 @@ Set these for enhanced functionality:
206
334
  | `GITHUB_TOKEN` | Higher rate limits (5000/hr vs 60/hr) |
207
335
  | `ANTHROPIC_API_KEY` | Alternative vision provider |
208
336
 
337
+ **→ Quick Refs:** Check what's available: `setup_local_env({ checkSdks: true })` | Vision capabilities: `discover_vision_env()` | See [Environment Setup](#environment-setup)
338
+
339
+ ---
340
+
341
+ ## Vision Analysis
342
+
343
+ For UI/UX verification:
344
+
345
+ ```
346
+ 1. capture_ui_screenshot({ url: "http://localhost:3000", viewport: "desktop" })
347
+ 2. analyze_screenshot({ imageBase64: "...", prompt: "Check accessibility" })
348
+ 3. capture_responsive_suite({ url: "...", label: "homepage" })
349
+ ```
350
+
351
+ **→ Quick Refs:** Check capabilities: `discover_vision_env()` | UI QA methodology: `getMethodology({ topic: "ui_ux_qa" })` | Agentic vision: `getMethodology({ topic: "agentic_vision" })` | See [Quality Gates](#quality-gates)
352
+
209
353
  ---
210
354
 
211
355
  ## Post-Implementation Checklist
@@ -214,13 +358,15 @@ After every implementation, answer these 3 questions:
214
358
 
215
359
  1. **MCP gaps?** — Were all relevant tools called? Any unexpected results?
216
360
  2. **Implementation gaps?** — Dead code? Missing integrations? Hardcoded values?
217
- 3. **Flywheel complete?** — All 6 steps passed?
361
+ 3. **Flywheel complete?** — All 7 steps passed including E2E test?
218
362
 
219
363
  If any answer reveals a gap: fix it before proceeding.
220
364
 
365
+ **→ Quick Refs:** Run self-check: `run_self_maintenance({ scope: "quick" })` | Record learnings: `record_learning` | Update docs: `update_agents_md` | See [AI Flywheel](#the-ai-flywheel-mandatory)
366
+
221
367
  ---
222
368
 
223
- ## Agent Self-Bootstrap System (NEW)
369
+ ## Agent Self-Bootstrap System
224
370
 
225
371
  For agents to self-configure and validate against authoritative sources.
226
372
 
@@ -288,27 +434,112 @@ Aggregates findings from multiple sources.
288
434
  - https://www.langchain.com/langgraph
289
435
  - https://modelcontextprotocol.io/specification/2025-11-25
290
436
 
437
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "agent_bootstrap" })` | After bootstrap: See [Autonomous Maintenance](#autonomous-self-maintenance-system) | Before implementing: `assess_risk` | See [Triple Verification](#2-triple-verification-with-source-citations)
438
+
439
+ ---
440
+
441
+ ## Autonomous Self-Maintenance System
442
+
443
+ Aggressive autonomous self-management with risk-aware execution. Based on OpenClaw patterns and Ralph Wiggum stop-hooks.
444
+
445
+ ### 1. Risk-Tiered Execution
446
+
447
+ Before any action, assess its risk tier:
448
+
449
+ ```
450
+ assess_risk({ action: "push to remote" })
451
+ ```
452
+
453
+ Risk tiers:
454
+ - **Low**: Reading, analyzing, searching — auto-approve
455
+ - **Medium**: Writing local files, running tests — log and proceed
456
+ - **High**: Pushing to remote, posting externally — require confirmation
457
+
458
+ ### 2. Re-Update Before Create
459
+
460
+ **CRITICAL:** Before creating new files, check if updating existing is better:
461
+
462
+ ```
463
+ decide_re_update({
464
+ targetContent: "New agent instructions",
465
+ contentType: "instructions",
466
+ existingFiles: ["AGENTS.md", "README.md"]
467
+ })
468
+ ```
469
+
470
+ This prevents file sprawl and maintains single source of truth.
471
+
472
+ ### 3. Self-Maintenance Cycles
473
+
474
+ Run periodic self-checks:
475
+
476
+ ```
477
+ run_self_maintenance({
478
+ scope: "standard", // quick | standard | thorough
479
+ autoFix: false,
480
+ dryRun: true
481
+ })
482
+ ```
483
+
484
+ Checks: TypeScript compilation, documentation sync, tool counts, test coverage.
485
+
486
+ ### 4. Directory Scaffolding (OpenClaw Style)
487
+
488
+ When adding infrastructure, use standardized scaffolding:
489
+
490
+ ```
491
+ scaffold_directory({
492
+ component: "agent_loop", // or: telemetry, evaluation, multi_channel, etc.
493
+ includeTests: true,
494
+ dryRun: true
495
+ })
496
+ ```
497
+
498
+ Creates organized subdirectories with proper test structure.
499
+
500
+ ### 5. Autonomous Loops with Guardrails
501
+
502
+ For multi-step autonomous tasks, use controlled loops:
503
+
504
+ ```
505
+ run_autonomous_loop({
506
+ goal: "Verify all tools pass static analysis",
507
+ maxIterations: 5,
508
+ maxDurationMs: 60000,
509
+ stopOnFirstFailure: true
510
+ })
511
+ ```
512
+
513
+ Implements Ralph Wiggum pattern with checkpoints and stop conditions.
514
+
515
+ **→ Quick Refs:** Full methodology: `getMethodology({ topic: "autonomous_maintenance" })` | Before actions: `assess_risk` | Before new files: `decide_re_update` | Scaffold structure: `scaffold_directory` | See [Self-Bootstrap](#agent-self-bootstrap-system)
516
+
291
517
  ---
292
518
 
293
519
  ## Methodology Topics
294
520
 
295
521
  Available via `getMethodology({ topic: "..." })`:
296
522
 
297
- - `overview` See all methodologies
298
- - `verification` — 6-phase development cycle
299
- - `eval` Test case management
300
- - `flywheel` Continuous improvement loop
301
- - `mandatory_flywheel` Required verification for changes
302
- - `reconnaissance` Codebase discovery
303
- - `quality_gates` Pass/fail checkpoints
304
- - `ui_ux_qa` Frontend verification
305
- - `agentic_vision` AI-powered visual QA
306
- - `closed_loop` Build/test before presenting
307
- - `learnings` Knowledge persistence
308
- - `project_ideation` Validate ideas before building
309
- - `tech_stack_2026` Dependency management
310
- - `agents_md_maintenance` Keep docs in sync
311
- - `agent_bootstrap` Self-discover, triple verify, self-implement
523
+ | Topic | Description | Quick Ref |
524
+ |-------|-------------|-----------|
525
+ | `overview` | See all methodologies | Start here |
526
+ | `verification` | 6-phase development cycle | [AI Flywheel](#the-ai-flywheel-mandatory) |
527
+ | `eval` | Test case management | [Quality Gates](#quality-gates) |
528
+ | `flywheel` | Continuous improvement loop | [AI Flywheel](#the-ai-flywheel-mandatory) |
529
+ | `mandatory_flywheel` | Required verification for changes | [AI Flywheel](#the-ai-flywheel-mandatory) |
530
+ | `reconnaissance` | Codebase discovery | [Self-Bootstrap](#agent-self-bootstrap-system) |
531
+ | `quality_gates` | Pass/fail checkpoints | [Quality Gates](#quality-gates) |
532
+ | `ui_ux_qa` | Frontend verification | [Vision Analysis](#vision-analysis) |
533
+ | `agentic_vision` | AI-powered visual QA | [Vision Analysis](#vision-analysis) |
534
+ | `closed_loop` | Build/test before presenting | [Closed Loop](#closed-loop-principle) |
535
+ | `learnings` | Knowledge persistence | [Recording Learnings](#recording-learnings) |
536
+ | `project_ideation` | Validate ideas before building | [Project Ideation](#project-ideation-workflow) |
537
+ | `tech_stack_2026` | Dependency management | [Environment Setup](#environment-setup) |
538
+ | `agents_md_maintenance` | Keep docs in sync | [Auto-Update](#auto-update-this-file) |
539
+ | `agent_bootstrap` | Self-discover, triple verify | [Self-Bootstrap](#agent-self-bootstrap-system) |
540
+ | `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
541
+
542
+ **→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
312
543
 
313
544
  ---
314
545
 
@@ -330,6 +561,8 @@ Or read current structure:
330
561
  update_agents_md({ operation: "read", projectRoot: "/path/to/project" })
331
562
  ```
332
563
 
564
+ **→ Quick Refs:** Before updating: `decide_re_update({ contentType: "instructions", ... })` | After updating: Run flywheel Steps 1-7 | See [Re-Update Before Create](#2-re-update-before-create)
565
+
333
566
  ---
334
567
 
335
568
  ## License