nodebench-mcp 2.10.0 → 2.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +86 -3
- package/README.md +19 -3
- package/dist/__tests__/toolsetGatingEval.test.js +67 -33
- package/dist/__tests__/toolsetGatingEval.test.js.map +1 -1
- package/dist/index.js +5 -4
- package/dist/index.js.map +1 -1
- package/dist/tools/localFileTools.js +207 -0
- package/dist/tools/localFileTools.js.map +1 -1
- package/package.json +2 -2
package/NODEBENCH_AGENTS.md
CHANGED
|
@@ -21,9 +21,26 @@ Add to `~/.claude/settings.json`:
|
|
|
21
21
|
}
|
|
22
22
|
```
|
|
23
23
|
|
|
24
|
-
Restart Claude Code. 89 tools available immediately.
|
|
24
|
+
Restart Claude Code. 89+ tools available immediately.
|
|
25
25
|
|
|
26
|
-
|
|
26
|
+
### Preset Selection
|
|
27
|
+
|
|
28
|
+
By default all toolsets are enabled. Use `--preset` to start with a scoped subset:
|
|
29
|
+
|
|
30
|
+
```json
|
|
31
|
+
{
|
|
32
|
+
"mcpServers": {
|
|
33
|
+
"nodebench": {
|
|
34
|
+
"command": "npx",
|
|
35
|
+
"args": ["-y", "nodebench-mcp", "--preset", "meta"]
|
|
36
|
+
}
|
|
37
|
+
}
|
|
38
|
+
}
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
The **meta** preset is the recommended front door for new agents: start with just 5 discovery tools, use `discover_tools` to find what you need, then self-escalate to a larger preset. See [Toolset Gating & Presets](#toolset-gating--presets) for the full breakdown.
|
|
42
|
+
|
|
43
|
+
**→ Quick Refs:** After setup, run `getMethodology("overview")` | First task? See [Verification Cycle](#verification-cycle-workflow) | New to codebase? See [Environment Setup](#environment-setup) | Preset options: See [Toolset Gating & Presets](#toolset-gating--presets)
|
|
27
44
|
|
|
28
45
|
---
|
|
29
46
|
|
|
@@ -261,8 +278,73 @@ Use `getMethodology("overview")` to see all available workflows.
|
|
|
261
278
|
| **Security** | `scan_dependencies`, `run_code_analysis` | Dependency auditing, static code analysis |
|
|
262
279
|
| **Platform** | `query_daily_brief`, `query_funding_entities`, `query_research_queue`, `publish_to_queue` | Convex platform bridge: intelligence, funding, research, publishing |
|
|
263
280
|
| **Meta** | `findTools`, `getMethodology` | Discover tools, get workflow guides |
|
|
281
|
+
| **Discovery** | `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain` | Hybrid search, quick refs, workflow chains |
|
|
282
|
+
|
|
283
|
+
Meta + Discovery tools (5 total) are **always included** regardless of preset. See [Toolset Gating & Presets](#toolset-gating--presets).
|
|
284
|
+
|
|
285
|
+
**→ Quick Refs:** Find tools by keyword: `findTools({ query: "verification" })` | Hybrid search: `discover_tools({ query: "security" })` | Get workflow guide: `getMethodology({ topic: "..." })` | See [Methodology Topics](#methodology-topics) for all topics
|
|
286
|
+
|
|
287
|
+
---
|
|
288
|
+
|
|
289
|
+
## Toolset Gating & Presets
|
|
290
|
+
|
|
291
|
+
NodeBench MCP supports 4 presets that control which domain toolsets are loaded at startup. Meta + Discovery tools (5 total) are **always included** on top of any preset.
|
|
292
|
+
|
|
293
|
+
### Preset Table
|
|
294
|
+
|
|
295
|
+
| Preset | Domain Toolsets | Domain Tools | Total (with meta+discovery) | Use Case |
|
|
296
|
+
|--------|----------------|-------------|----------------------------|----------|
|
|
297
|
+
| **meta** | 0 | 0 | 5 | Discovery-only front door. Agents start here and self-escalate. |
|
|
298
|
+
| **lite** | 7 | ~35 | ~40 | Lightweight verification-focused workflows. CI bots, quick checks. |
|
|
299
|
+
| **core** | 16 | ~75 | ~80 | Full development workflow. Most agent sessions. |
|
|
300
|
+
| **full** | all | 89+ | 94+ | Everything enabled. Benchmarking, exploration, advanced use. |
|
|
301
|
+
|
|
302
|
+
### Usage
|
|
303
|
+
|
|
304
|
+
```bash
|
|
305
|
+
npx nodebench-mcp --preset meta # Discovery-only (5 tools)
|
|
306
|
+
npx nodebench-mcp --preset lite # Verification + eval + recon + security
|
|
307
|
+
npx nodebench-mcp --preset core # Full dev workflow without vision/parallel
|
|
308
|
+
npx nodebench-mcp --preset full # All toolsets (default)
|
|
309
|
+
npx nodebench-mcp --toolsets verification,eval,recon # Custom selection
|
|
310
|
+
npx nodebench-mcp --exclude vision,ui_capture # Exclude specific toolsets
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
### The Meta Preset — Discovery-Only Front Door
|
|
314
|
+
|
|
315
|
+
The **meta** preset loads zero domain tools. Agents start with only 5 tools:
|
|
316
|
+
|
|
317
|
+
| Tool | Purpose |
|
|
318
|
+
|------|---------|
|
|
319
|
+
| `findTools` | Keyword search across all registered tools |
|
|
320
|
+
| `getMethodology` | Get workflow guides by topic |
|
|
321
|
+
| `discover_tools` | Hybrid search with relevance scoring (richer than findTools) |
|
|
322
|
+
| `get_tool_quick_ref` | Quick reference card for any specific tool |
|
|
323
|
+
| `get_workflow_chain` | Recommended tool sequence for common workflows |
|
|
324
|
+
|
|
325
|
+
This is the recommended starting point for autonomous agents. The self-escalation pattern:
|
|
326
|
+
|
|
327
|
+
```
|
|
328
|
+
1. Start with --preset meta (5 tools)
|
|
329
|
+
2. discover_tools({ query: "what I need to do" }) // Find relevant tools
|
|
330
|
+
3. get_workflow_chain({ workflow: "verification" }) // Get the tool sequence
|
|
331
|
+
4. If needed tools are not loaded:
|
|
332
|
+
→ Restart with --preset core or --preset full
|
|
333
|
+
→ Or use --toolsets to add specific domains
|
|
334
|
+
5. Proceed with full workflow
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
### Preset Domain Breakdown
|
|
338
|
+
|
|
339
|
+
**meta** (0 domains): No domain tools. Meta + Discovery only.
|
|
340
|
+
|
|
341
|
+
**lite** (7 domains): `verification`, `eval`, `quality_gate`, `learning`, `recon`, `security`, `boilerplate`
|
|
342
|
+
|
|
343
|
+
**core** (16 domains): Everything in lite plus `flywheel`, `bootstrap`, `self_eval`, `llm`, `platform`, `research_writing`, `flicker_detection`, `figma_flow`, `benchmark`
|
|
344
|
+
|
|
345
|
+
**full** (all domains): All toolsets in TOOLSET_MAP including `ui_capture`, `vision`, `local_file`, `web`, `github`, `docs`, `parallel`, and everything in core.
|
|
264
346
|
|
|
265
|
-
**→ Quick Refs:**
|
|
347
|
+
**→ Quick Refs:** Check current toolset: `findTools({ query: "*" })` | Self-escalate: restart with `--preset core` | See [MCP Tool Categories](#mcp-tool-categories) | CLI help: `npx nodebench-mcp --help`
|
|
266
348
|
|
|
267
349
|
---
|
|
268
350
|
|
|
@@ -616,6 +698,7 @@ Available via `getMethodology({ topic: "..." })`:
|
|
|
616
698
|
| `autonomous_maintenance` | Risk-tiered execution | [Autonomous Maintenance](#autonomous-self-maintenance-system) |
|
|
617
699
|
| `parallel_agent_teams` | Multi-agent coordination, task locking, oracle testing | [Parallel Agent Teams](#parallel-agent-teams) |
|
|
618
700
|
| `self_reinforced_learning` | Trajectory analysis, self-eval, improvement recs | [Self-Reinforced Learning](#self-reinforced-learning-loop) |
|
|
701
|
+
| `toolset_gating` | 4 presets (meta, lite, core, full) and self-escalation | [Toolset Gating & Presets](#toolset-gating--presets) |
|
|
619
702
|
|
|
620
703
|
**→ Quick Refs:** Find tools: `findTools({ query: "..." })` | Get any methodology: `getMethodology({ topic: "..." })` | See [MCP Tool Categories](#mcp-tool-categories)
|
|
621
704
|
|
package/README.md
CHANGED
|
@@ -39,7 +39,7 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
|
|
|
39
39
|
|
|
40
40
|
**QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
|
|
41
41
|
|
|
42
|
-
Both found different subsets of the 129 tools useful — which is why v2.8 ships with `--preset`
|
|
42
|
+
Both found different subsets of the 129 tools useful — which is why v2.8 ships with 4 `--preset` levels to load only what you need.
|
|
43
43
|
|
|
44
44
|
---
|
|
45
45
|
|
|
@@ -80,6 +80,9 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
|
|
|
80
80
|
# Claude Code CLI — all 129 tools
|
|
81
81
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
82
82
|
|
|
83
|
+
# Or start with discovery only — 5 tools, agents self-escalate to what they need
|
|
84
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
|
|
85
|
+
|
|
83
86
|
# Or start lean — 39 tools, ~70% less token overhead
|
|
84
87
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
|
|
85
88
|
```
|
|
@@ -304,7 +307,18 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
|
|
|
304
307
|
|
|
305
308
|
### Presets
|
|
306
309
|
|
|
310
|
+
| Preset | Tools | Use case |
|
|
311
|
+
|---|---|---|
|
|
312
|
+
| `meta` | 5 | Discovery-only front door — agents start here and self-escalate via `discover_tools` |
|
|
313
|
+
| `lite` | 39 | Core methodology — verification, eval, gates, learning, recon, security, boilerplate |
|
|
314
|
+
| `core` | 87 | Full workflow — adds flywheel, bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark |
|
|
315
|
+
| `full` | 129 | Everything (default) |
|
|
316
|
+
|
|
307
317
|
```bash
|
|
318
|
+
# Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)
|
|
319
|
+
# Agents start here and self-escalate to the tools they need
|
|
320
|
+
claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
|
|
321
|
+
|
|
308
322
|
# Lite — 39 tools (verification, eval, gates, learning, recon, security, boilerplate + meta + discovery)
|
|
309
323
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
|
|
310
324
|
|
|
@@ -322,7 +336,7 @@ Or in config:
|
|
|
322
336
|
"mcpServers": {
|
|
323
337
|
"nodebench": {
|
|
324
338
|
"command": "npx",
|
|
325
|
-
"args": ["-y", "nodebench-mcp", "--preset", "
|
|
339
|
+
"args": ["-y", "nodebench-mcp", "--preset", "meta"]
|
|
326
340
|
}
|
|
327
341
|
}
|
|
328
342
|
}
|
|
@@ -369,10 +383,12 @@ npx nodebench-mcp --help
|
|
|
369
383
|
| boilerplate | 2 | Scaffold NodeBench projects + status |
|
|
370
384
|
| benchmark | 3 | Autonomous benchmark lifecycle (C-compiler pattern) |
|
|
371
385
|
|
|
372
|
-
Always included (regardless of gating):
|
|
386
|
+
Always included (regardless of gating) — these 5 tools form the `meta` preset:
|
|
373
387
|
- Meta: `findTools`, `getMethodology`
|
|
374
388
|
- Discovery: `discover_tools`, `get_tool_quick_ref`, `get_workflow_chain`
|
|
375
389
|
|
|
390
|
+
The `meta` preset loads **only** these 5 tools (0 domain tools). Agents use `discover_tools` to find what they need and self-escalate.
|
|
391
|
+
|
|
376
392
|
---
|
|
377
393
|
|
|
378
394
|
## Build from Source
|
|
@@ -75,6 +75,7 @@ const TOOLSET_MAP = {
|
|
|
75
75
|
benchmark: cCompilerBenchmarkTools,
|
|
76
76
|
};
|
|
77
77
|
const PRESETS = {
|
|
78
|
+
meta: [],
|
|
78
79
|
lite: ["verification", "eval", "quality_gate", "learning", "recon", "security", "boilerplate"],
|
|
79
80
|
core: ["verification", "eval", "quality_gate", "learning", "flywheel", "recon", "bootstrap", "self_eval", "llm", "security", "platform", "research_writing", "flicker_detection", "figma_flow", "boilerplate", "benchmark"],
|
|
80
81
|
full: Object.keys(TOOLSET_MAP),
|
|
@@ -721,41 +722,51 @@ async function cleanupAll() {
|
|
|
721
722
|
const allTrajectories = [];
|
|
722
723
|
describe("Toolset Gating Eval", () => {
|
|
723
724
|
afterAll(async () => { await cleanupAll(); });
|
|
724
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
725
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
725
726
|
describe(`Preset: ${preset}`, () => {
|
|
726
727
|
for (const scenario of SCENARIOS) {
|
|
727
728
|
it(`${preset}/${scenario.id}: runs 8-phase pipeline`, async () => {
|
|
728
729
|
const t = await runTrajectory(preset, scenario);
|
|
729
730
|
allTrajectories.push(t);
|
|
730
|
-
//
|
|
731
|
+
// Meta phase always succeeds (findTools + getMethodology always present)
|
|
731
732
|
const metaPhase = t.phases.find((p) => p.phase === "meta");
|
|
732
733
|
expect(metaPhase?.success).toBe(true);
|
|
733
|
-
|
|
734
|
-
|
|
735
|
-
|
|
736
|
-
|
|
737
|
-
|
|
738
|
-
|
|
739
|
-
|
|
740
|
-
|
|
741
|
-
|
|
742
|
-
|
|
743
|
-
|
|
734
|
+
if (preset === "meta") {
|
|
735
|
+
// meta preset: only meta tools available — all other phases skipped
|
|
736
|
+
expect(t.phasesCompleted).toBe(1); // only meta phase
|
|
737
|
+
expect(t.toolCount).toBe(2); // findTools + getMethodology
|
|
738
|
+
}
|
|
739
|
+
else {
|
|
740
|
+
// lite, core, full: domain tools available
|
|
741
|
+
const reconPhase = t.phases.find((p) => p.phase === "recon");
|
|
742
|
+
expect(reconPhase?.success).toBe(true);
|
|
743
|
+
const verifyPhase = t.phases.find((p) => p.phase === "verification");
|
|
744
|
+
expect(verifyPhase?.success).toBe(true);
|
|
745
|
+
const evalPhase = t.phases.find((p) => p.phase === "eval");
|
|
746
|
+
expect(evalPhase?.success).toBe(true);
|
|
747
|
+
const gatePhase = t.phases.find((p) => p.phase === "quality-gate");
|
|
748
|
+
expect(gatePhase?.success).toBe(true);
|
|
749
|
+
// Knowledge phase depends on preset (learning tools in lite + core + full)
|
|
750
|
+
const knowledgePhase = t.phases.find((p) => p.phase === "knowledge");
|
|
751
|
+
expect(knowledgePhase?.success).toBe(true);
|
|
752
|
+
}
|
|
744
753
|
}, 30_000);
|
|
745
754
|
}
|
|
746
755
|
});
|
|
747
756
|
}
|
|
748
757
|
describe("Flywheel availability", () => {
|
|
749
|
-
it("lite
|
|
750
|
-
const
|
|
751
|
-
for (const t of
|
|
758
|
+
it("meta and lite presets do NOT have flywheel tools", () => {
|
|
759
|
+
const noFlywheel = allTrajectories.filter((t) => t.preset === "meta" || t.preset === "lite");
|
|
760
|
+
for (const t of noFlywheel) {
|
|
752
761
|
const fw = t.phases.find((p) => p.phase === "flywheel");
|
|
753
762
|
expect(fw?.success).toBe(false);
|
|
754
|
-
|
|
763
|
+
if (t.preset === "lite") {
|
|
764
|
+
expect(fw?.toolsMissing).toContain("run_mandatory_flywheel");
|
|
765
|
+
}
|
|
755
766
|
}
|
|
756
767
|
});
|
|
757
768
|
it("core and full presets HAVE flywheel tools", () => {
|
|
758
|
-
const coreFullTrajectories = allTrajectories.filter((t) => t.preset
|
|
769
|
+
const coreFullTrajectories = allTrajectories.filter((t) => t.preset === "core" || t.preset === "full");
|
|
759
770
|
for (const t of coreFullTrajectories) {
|
|
760
771
|
expect(t.flywheelComplete).toBe(true);
|
|
761
772
|
}
|
|
@@ -784,16 +795,16 @@ describe("Toolset Gating Eval", () => {
|
|
|
784
795
|
});
|
|
785
796
|
});
|
|
786
797
|
describe("Self-eval availability", () => {
|
|
787
|
-
it("lite
|
|
788
|
-
const
|
|
789
|
-
for (const t of
|
|
798
|
+
it("meta and lite do NOT have self-eval tools", () => {
|
|
799
|
+
const noSelfEval = allTrajectories.filter((t) => t.preset === "meta" || t.preset === "lite");
|
|
800
|
+
for (const t of noSelfEval) {
|
|
790
801
|
const se = t.phases.find((p) => p.phase === "self-eval");
|
|
791
802
|
if (se)
|
|
792
803
|
expect(se.success).toBe(false);
|
|
793
804
|
}
|
|
794
805
|
});
|
|
795
806
|
it("core and full HAVE self-eval tools", () => {
|
|
796
|
-
const coreFullTrajectories = allTrajectories.filter((t) => t.preset
|
|
807
|
+
const coreFullTrajectories = allTrajectories.filter((t) => t.preset === "core" || t.preset === "full");
|
|
797
808
|
for (const t of coreFullTrajectories) {
|
|
798
809
|
const se = t.phases.find((p) => p.phase === "self-eval");
|
|
799
810
|
expect(se?.success).toBe(true);
|
|
@@ -801,6 +812,14 @@ describe("Toolset Gating Eval", () => {
|
|
|
801
812
|
});
|
|
802
813
|
});
|
|
803
814
|
describe("Token surface area reduction", () => {
|
|
815
|
+
it("meta has the fewest tools (only meta tools)", () => {
|
|
816
|
+
const metaT = allTrajectories.find((t) => t.preset === "meta");
|
|
817
|
+
const liteT = allTrajectories.find((t) => t.preset === "lite");
|
|
818
|
+
expect(metaT.toolCount).toBe(2); // findTools + getMethodology only
|
|
819
|
+
expect(metaT.toolCount).toBeLessThan(liteT.toolCount);
|
|
820
|
+
const reduction = 1 - metaT.toolCount / liteT.toolCount;
|
|
821
|
+
expect(reduction).toBeGreaterThan(0.9); // meta is 90%+ fewer tools than lite
|
|
822
|
+
});
|
|
804
823
|
it("lite reduces tool count and estimated token overhead vs full", () => {
|
|
805
824
|
const liteT = allTrajectories.find((t) => t.preset === "lite");
|
|
806
825
|
const fullT = allTrajectories.find((t) => t.preset === "full");
|
|
@@ -809,11 +828,13 @@ describe("Toolset Gating Eval", () => {
|
|
|
809
828
|
const reduction = 1 - liteT.toolCount / fullT.toolCount;
|
|
810
829
|
expect(reduction).toBeGreaterThan(0.5); // lite is at least 50% fewer tools
|
|
811
830
|
});
|
|
812
|
-
it("
|
|
831
|
+
it("presets are ordered: meta < lite < core < full", () => {
|
|
832
|
+
const metaT = allTrajectories.find((t) => t.preset === "meta");
|
|
813
833
|
const liteT = allTrajectories.find((t) => t.preset === "lite");
|
|
814
834
|
const coreT = allTrajectories.find((t) => t.preset === "core");
|
|
815
835
|
const fullT = allTrajectories.find((t) => t.preset === "full");
|
|
816
|
-
expect(
|
|
836
|
+
expect(metaT.toolCount).toBeLessThan(liteT.toolCount);
|
|
837
|
+
expect(liteT.toolCount).toBeLessThan(coreT.toolCount);
|
|
817
838
|
expect(coreT.toolCount).toBeLessThan(fullT.toolCount);
|
|
818
839
|
});
|
|
819
840
|
});
|
|
@@ -823,7 +844,7 @@ describe("Toolset Gating Eval", () => {
|
|
|
823
844
|
// ═══════════════════════════════════════════════════════════════════════════
|
|
824
845
|
describe("Toolset Gating Report", () => {
|
|
825
846
|
it("generates trajectory comparison across presets", () => {
|
|
826
|
-
expect(allTrajectories.length).toBe(
|
|
847
|
+
expect(allTrajectories.length).toBe(36); // 4 presets × 9 scenarios
|
|
827
848
|
console.log("\n");
|
|
828
849
|
console.log("╔══════════════════════════════════════════════════════════════════════════════╗");
|
|
829
850
|
console.log("║ TOOLSET GATING EVAL — Trajectory Comparison ║");
|
|
@@ -834,7 +855,7 @@ describe("Toolset Gating Report", () => {
|
|
|
834
855
|
console.log("┌──────────────────────────────────────────────────────────────────────────────┐");
|
|
835
856
|
console.log("│ 1. TOOL COUNT & ESTIMATED TOKEN OVERHEAD │");
|
|
836
857
|
console.log("├──────────────────────────────────────────────────────────────────────────────┤");
|
|
837
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
858
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
838
859
|
const t = allTrajectories.find((tr) => tr.preset === preset);
|
|
839
860
|
const bar = "█".repeat(Math.round(t.toolCount / 3));
|
|
840
861
|
console.log(`│ ${preset.padEnd(6)} ${String(t.toolCount).padStart(3)} tools ~${String(t.estimatedSchemaTokens).padStart(5)} tokens ${bar}`.padEnd(79) + "│");
|
|
@@ -855,7 +876,7 @@ describe("Toolset Gating Report", () => {
|
|
|
855
876
|
const allPhaseNames = ["meta", "recon", "risk", "verification", "eval", "quality-gate", "knowledge", "flywheel", "parallel", "self-eval"];
|
|
856
877
|
for (const phase of allPhaseNames) {
|
|
857
878
|
const cols = [];
|
|
858
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
879
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
859
880
|
const trajectories = allTrajectories.filter((t) => t.preset === preset);
|
|
860
881
|
const phaseResults = trajectories.map((t) => t.phases.find((p) => p.phase === phase));
|
|
861
882
|
const present = phaseResults.some((p) => p);
|
|
@@ -886,7 +907,7 @@ describe("Toolset Gating Report", () => {
|
|
|
886
907
|
{ label: "Total tool calls", key: "totalToolCalls" },
|
|
887
908
|
]) {
|
|
888
909
|
const cols = [];
|
|
889
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
910
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
890
911
|
const sum = allTrajectories
|
|
891
912
|
.filter((t) => t.preset === preset)
|
|
892
913
|
.reduce((s, t) => s + t[metric.key], 0);
|
|
@@ -901,7 +922,7 @@ describe("Toolset Gating Report", () => {
|
|
|
901
922
|
{ label: "Flywheel complete", fn: (t) => t.flywheelComplete },
|
|
902
923
|
]) {
|
|
903
924
|
const cols = [];
|
|
904
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
925
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
905
926
|
const count = allTrajectories
|
|
906
927
|
.filter((t) => t.preset === preset)
|
|
907
928
|
.filter(metric.fn).length;
|
|
@@ -915,7 +936,7 @@ describe("Toolset Gating Report", () => {
|
|
|
915
936
|
console.log("┌──────────────────────────────────────────────────────────────────────────────┐");
|
|
916
937
|
console.log("│ 4. TOOLS MISSING BY PRESET (what you lose with gating) │");
|
|
917
938
|
console.log("├──────────────────────────────────────────────────────────────────────────────┤");
|
|
918
|
-
for (const preset of ["lite", "core"]) {
|
|
939
|
+
for (const preset of ["meta", "lite", "core"]) {
|
|
919
940
|
const missingCalls = callLog.filter((c) => c.preset === preset && c.status === "missing");
|
|
920
941
|
const uniqueMissing = [...new Set(missingCalls.map((c) => c.tool))];
|
|
921
942
|
if (uniqueMissing.length > 0) {
|
|
@@ -984,7 +1005,7 @@ describe("Toolset Gating Report", () => {
|
|
|
984
1005
|
console.log("┌──────────────────────────────────────────────────────────────────────────────┐");
|
|
985
1006
|
console.log("│ 7. UNIQUE TOOLS EXERCISED PER PRESET │");
|
|
986
1007
|
console.log("├──────────────────────────────────────────────────────────────────────────────┤");
|
|
987
|
-
for (const preset of ["lite", "core", "full"]) {
|
|
1008
|
+
for (const preset of ["meta", "lite", "core", "full"]) {
|
|
988
1009
|
const successCalls = callLog.filter((c) => c.preset === preset && c.status === "success");
|
|
989
1010
|
const uniqueTools = [...new Set(successCalls.map((c) => c.tool))];
|
|
990
1011
|
const availableTools = buildToolset(preset).length;
|
|
@@ -1002,24 +1023,37 @@ describe("Toolset Gating Report", () => {
|
|
|
1002
1023
|
console.log("║ VERDICT ║");
|
|
1003
1024
|
console.log("╠══════════════════════════════════════════════════════════════════════════════╣");
|
|
1004
1025
|
console.log("║ ║");
|
|
1026
|
+
const metaCompleted = allTrajectories.filter((t) => t.preset === "meta").reduce((s, t) => s + t.phasesCompleted, 0);
|
|
1027
|
+
const metaTotal = allTrajectories.filter((t) => t.preset === "meta").reduce((s, t) => s + t.phasesCompleted + t.phasesSkipped, 0);
|
|
1005
1028
|
const liteCompleted = allTrajectories.filter((t) => t.preset === "lite").reduce((s, t) => s + t.phasesCompleted, 0);
|
|
1006
1029
|
const liteTotal = allTrajectories.filter((t) => t.preset === "lite").reduce((s, t) => s + t.phasesCompleted + t.phasesSkipped, 0);
|
|
1007
1030
|
const coreCompleted = allTrajectories.filter((t) => t.preset === "core").reduce((s, t) => s + t.phasesCompleted, 0);
|
|
1008
1031
|
const coreTotal = allTrajectories.filter((t) => t.preset === "core").reduce((s, t) => s + t.phasesCompleted + t.phasesSkipped, 0);
|
|
1009
1032
|
const fullCompleted = allTrajectories.filter((t) => t.preset === "full").reduce((s, t) => s + t.phasesCompleted, 0);
|
|
1010
1033
|
const fullTotal = allTrajectories.filter((t) => t.preset === "full").reduce((s, t) => s + t.phasesCompleted + t.phasesSkipped, 0);
|
|
1034
|
+
console.log(`║ meta: ${metaCompleted}/${metaTotal} phases (${Math.round(metaCompleted / metaTotal * 100)}%) — discovery only, 5 tools, minimal context`.padEnd(79) + "║");
|
|
1011
1035
|
console.log(`║ lite: ${liteCompleted}/${liteTotal} phases (${Math.round(liteCompleted / liteTotal * 100)}%) — ${savings}% fewer tokens, loses flywheel + parallel`.padEnd(79) + "║");
|
|
1012
1036
|
console.log(`║ core: ${coreCompleted}/${coreTotal} phases (${Math.round(coreCompleted / coreTotal * 100)}%) — full methodology loop, no parallel/vision/web`.padEnd(79) + "║");
|
|
1013
1037
|
console.log(`║ full: ${fullCompleted}/${fullTotal} phases (${Math.round(fullCompleted / fullTotal * 100)}%) — everything`.padEnd(79) + "║");
|
|
1014
1038
|
console.log("║ ║");
|
|
1015
1039
|
console.log("║ Recommendation: ║");
|
|
1040
|
+
console.log("║ Discovery-first / front door → --preset meta (5 tools, self-escalate) ║");
|
|
1016
1041
|
console.log("║ Solo dev, standard tasks → --preset lite (fast, low token overhead) ║");
|
|
1017
1042
|
console.log("║ Team with methodology needs → --preset core (full flywheel loop) ║");
|
|
1018
1043
|
console.log("║ Multi-agent / full pipeline → --preset full (parallel + self-eval) ║");
|
|
1019
1044
|
console.log("║ ║");
|
|
1020
1045
|
console.log("╚══════════════════════════════════════════════════════════════════════════════╝");
|
|
1021
1046
|
// ─── ASSERTIONS ───
|
|
1022
|
-
//
|
|
1047
|
+
// meta preset: only meta phase succeeds (discovery-only gate)
|
|
1048
|
+
{
|
|
1049
|
+
const metaTrajectories = allTrajectories.filter((t) => t.preset === "meta");
|
|
1050
|
+
for (const t of metaTrajectories) {
|
|
1051
|
+
expect(t.phases.find((p) => p.phase === "meta")?.success).toBe(true);
|
|
1052
|
+
expect(t.phasesCompleted).toBe(1);
|
|
1053
|
+
expect(t.toolCount).toBe(2);
|
|
1054
|
+
}
|
|
1055
|
+
}
|
|
1056
|
+
// lite, core, full: complete the core 6 phases (meta, recon, risk, verification, eval, quality-gate)
|
|
1023
1057
|
for (const preset of ["lite", "core", "full"]) {
|
|
1024
1058
|
const trajectories = allTrajectories.filter((t) => t.preset === preset);
|
|
1025
1059
|
for (const t of trajectories) {
|
|
@@ -1031,7 +1065,7 @@ describe("Toolset Gating Report", () => {
|
|
|
1031
1065
|
expect(t.phases.find((p) => p.phase === "knowledge")?.success).toBe(true);
|
|
1032
1066
|
}
|
|
1033
1067
|
}
|
|
1034
|
-
// lite
|
|
1068
|
+
// lite, core, full detect issues (core methodology is intact)
|
|
1035
1069
|
for (const preset of ["lite", "core", "full"]) {
|
|
1036
1070
|
const totalIssues = allTrajectories
|
|
1037
1071
|
.filter((t) => t.preset === preset)
|