nodebench-mcp 2.8.1 → 2.8.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -184,6 +184,77 @@ Notes:
184
184
 
185
185
  ---
186
186
 
187
+ ## Progressive Discovery (v2.8.1)
188
+
189
+ 129 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
190
+
191
+ ### Multi-modal search engine
192
+
193
+ ```
194
+ > discover_tools("verify my implementation")
195
+ ```
196
+
197
+ The `discover_tools` search engine scores tools using **9 parallel strategies**:
198
+
199
+ | Strategy | What it does | Example |
200
+ |---|---|---|
201
+ | Keyword | Exact/partial word matching on name, tags, description | "benchmark" → `benchmark_models` |
202
+ | Fuzzy | Levenshtein distance — tolerates typos | "verifiy" → `start_verification_cycle` |
203
+ | N-gram | Trigram similarity for partial words | "screen" → `capture_ui_screenshot` |
204
+ | Prefix | Matches tool name starts | "cap" → `capture_*` tools |
205
+ | Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |
206
+ | TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
207
+ | Regex | Pattern matching | `"^run_.*loop$"` → `run_closed_loop` |
208
+ | Bigram | Phrase matching | "quality gate" matched as unit |
209
+ | Domain boost | Related categories boosted together | verification + quality_gate cluster |
210
+
211
+ **6 search modes**: `hybrid` (default, all strategies), `fuzzy`, `regex`, `prefix`, `semantic`, `exact`
212
+
213
+ Pass `explain: true` to see exactly which strategies contributed to each score.
214
+
215
+ ### Quick refs — what to do next
216
+
217
+ Every tool response auto-appends a `_quickRef` with:
218
+ - **nextAction**: What to do immediately after this tool
219
+ - **nextTools**: Recommended follow-up tools
220
+ - **methodology**: Which methodology guide to consult
221
+ - **tip**: Practical usage advice
222
+
223
+ Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
224
+
225
+ ### Workflow chains — step-by-step recipes
226
+
227
+ 11 pre-built chains for common workflows:
228
+
229
+ | Chain | Steps | Use case |
230
+ |---|---|---|
231
+ | `new_feature` | 12 | End-to-end feature development |
232
+ | `fix_bug` | 6 | Structured debugging |
233
+ | `ui_change` | 7 | Frontend with visual verification |
234
+ | `parallel_project` | 7 | Multi-agent coordination |
235
+ | `research_phase` | 8 | Context gathering |
236
+ | `academic_paper` | 7 | Paper writing pipeline |
237
+ | `c_compiler_benchmark` | 10 | Autonomous capability test |
238
+ | `security_audit` | 9 | Comprehensive security assessment |
239
+ | `code_review` | 8 | Structured code review |
240
+ | `deployment` | 8 | Ship with full verification |
241
+ | `migration` | 10 | SDK/framework upgrade |
242
+
243
+ Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
244
+
245
+ ### Boilerplate template
246
+
247
+ Start new projects with everything pre-configured:
248
+
249
+ ```bash
250
+ gh repo create my-project --template HomenShum/nodebench-boilerplate --clone
251
+ cd my-project && npm install
252
+ ```
253
+
254
+ Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.
255
+
256
+ ---
257
+
187
258
  ## The Methodology Pipeline
188
259
 
189
260
  NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
@@ -307,7 +378,7 @@ Always included (regardless of gating):
307
378
  ## Build from Source
308
379
 
309
380
  ```bash
310
- git clone https://github.com/nodebench/nodebench-ai.git
381
+ git clone https://github.com/HomenShum/nodebench-ai.git
311
382
  cd nodebench-ai/packages/mcp-local
312
383
  npm install && npm run build
313
384
  ```
@@ -0,0 +1,15 @@
1
+ /**
2
+ * GAIA audio-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local audio tools.
3
+ *
4
+ * This lane targets GAIA tasks that include audio attachments (MP3/WAV/etc).
5
+ * We provide deterministic local transcription via NodeBench MCP tools and score answers against
6
+ * the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
7
+ *
8
+ * Safety:
9
+ * - GAIA is gated. Do not commit fixtures that contain prompts/answers.
10
+ * - This test logs only task IDs and aggregate metrics (no prompt/answer text).
11
+ *
12
+ * Disabled by default (cost + rate limits). Run with:
13
+ * NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
14
+ */
15
+ export {};
@@ -0,0 +1,291 @@
1
+ /**
2
+ * GAIA audio-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local audio tools.
3
+ *
4
+ * This lane targets GAIA tasks that include audio attachments (MP3/WAV/etc).
5
+ * We provide deterministic local transcription via NodeBench MCP tools and score answers against
6
+ * the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
7
+ *
8
+ * Safety:
9
+ * - GAIA is gated. Do not commit fixtures that contain prompts/answers.
10
+ * - This test logs only task IDs and aggregate metrics (no prompt/answer text).
11
+ *
12
+ * Disabled by default (cost + rate limits). Run with:
13
+ * NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
14
+ */
15
+ import { describe, expect, it } from "vitest";
16
+ import { existsSync, readFileSync } from "node:fs";
17
+ import { mkdir, readFile, writeFile } from "node:fs/promises";
18
+ import path from "node:path";
19
+ import { fileURLToPath } from "node:url";
20
+ import { performance } from "node:perf_hooks";
21
+ import { localFileTools } from "../tools/localFileTools.js";
22
+ const shouldRun = process.env.NODEBENCH_RUN_GAIA_CAPABILITY === "1";
23
+ const shouldWriteReport = process.env.NODEBENCH_WRITE_GAIA_REPORT === "1";
24
+ async function safeWriteJson(filePath, payload) {
25
+ try {
26
+ await mkdir(path.dirname(filePath), { recursive: true });
27
+ await writeFile(filePath, JSON.stringify(payload, null, 2) + "\n", "utf8");
28
+ }
29
+ catch (err) {
30
+ console.warn(`[gaia-capability-audio] report write failed: ${err?.message ?? String(err)}`);
31
+ }
32
+ }
33
+ function resolveRepoRoot() {
34
+ const testDir = path.dirname(fileURLToPath(import.meta.url));
35
+ return path.resolve(testDir, "../../../..");
36
+ }
37
+ function resolveCapabilityAudioFixturePath() {
38
+ const override = process.env.NODEBENCH_GAIA_CAPABILITY_AUDIO_FIXTURE_PATH;
39
+ if (override) {
40
+ if (path.isAbsolute(override))
41
+ return override;
42
+ const repoRoot = resolveRepoRoot();
43
+ return path.resolve(repoRoot, override);
44
+ }
45
+ const config = process.env.NODEBENCH_GAIA_CAPABILITY_CONFIG ?? "2023_all";
46
+ const split = process.env.NODEBENCH_GAIA_CAPABILITY_SPLIT ?? "validation";
47
+ const repoRoot = resolveRepoRoot();
48
+ return path.join(repoRoot, ".cache", "gaia", `gaia_capability_audio_${config}_${split}.sample.json`);
49
+ }
50
+ function loadDotEnvLocalIfPresent() {
51
+ const repoRoot = resolveRepoRoot();
52
+ const envPath = path.join(repoRoot, ".env.local");
53
+ if (!existsSync(envPath))
54
+ return;
55
+ const text = readFileSync(envPath, "utf8");
56
+ for (const rawLine of text.split(/\r?\n/)) {
57
+ const line = rawLine.trim();
58
+ if (!line || line.startsWith("#"))
59
+ continue;
60
+ const idx = line.indexOf("=");
61
+ if (idx <= 0)
62
+ continue;
63
+ const key = line.slice(0, idx).trim();
64
+ let value = line.slice(idx + 1).trim();
65
+ if ((value.startsWith("\"") && value.endsWith("\"")) ||
66
+ (value.startsWith("'") && value.endsWith("'"))) {
67
+ value = value.slice(1, -1);
68
+ }
69
+ if (!process.env[key])
70
+ process.env[key] = value;
71
+ }
72
+ }
73
+ async function canImport(pkg) {
74
+ try {
75
+ await import(pkg);
76
+ return true;
77
+ }
78
+ catch {
79
+ return false;
80
+ }
81
+ }
82
+ function normalizeAnswer(value) {
83
+ return value
84
+ .trim()
85
+ .replace(/\r/g, "")
86
+ .replace(/\s+/g, " ")
87
+ .replace(/^["']|["']$/g, "")
88
+ .replace(/[.]+$/g, "")
89
+ .toLowerCase();
90
+ }
91
+ async function createGeminiClient() {
92
+ const mod = await import("@google/genai");
93
+ const { GoogleGenAI } = mod;
94
+ const apiKey = process.env.GEMINI_API_KEY || process.env.GOOGLE_AI_API_KEY || "";
95
+ if (!apiKey) {
96
+ throw new Error("Missing GEMINI_API_KEY (or GOOGLE_AI_API_KEY)");
97
+ }
98
+ return new GoogleGenAI({ apiKey });
99
+ }
100
+ async function geminiGenerateText(ai, model, contents) {
101
+ const temperature = Number.parseFloat(process.env.NODEBENCH_GAIA_CAPABILITY_TEMPERATURE ?? "0");
102
+ const response = await ai.models.generateContent({
103
+ model,
104
+ contents,
105
+ config: {
106
+ temperature: Number.isFinite(temperature) ? temperature : 0,
107
+ maxOutputTokens: 1024,
108
+ },
109
+ });
110
+ const parts = response?.candidates?.[0]?.content?.parts ?? [];
111
+ const text = parts.map((p) => p?.text ?? "").join("").trim();
112
+ return text;
113
+ }
114
+ async function baselineAnswer(ai, task) {
115
+ const contents = [
116
+ {
117
+ role: "user",
118
+ parts: [
119
+ {
120
+ text: `Answer the question using your existing knowledge only. Do not browse the web.\n\nReturn ONLY the final answer, no explanation.\n\nQuestion:\n${task.prompt}`,
121
+ },
122
+ ],
123
+ },
124
+ ];
125
+ return geminiGenerateText(ai, process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-2.5-flash", contents);
126
+ }
127
+ async function loadFixture(filePath) {
128
+ const raw = await readFile(filePath, "utf8");
129
+ const json = JSON.parse(raw);
130
+ return json;
131
+ }
132
+ function createToolIndex(tools) {
133
+ const m = new Map();
134
+ for (const t of tools)
135
+ m.set(t.name, t);
136
+ return m;
137
+ }
138
+ async function toolAugmentedAnswerFromAudio(ai, task, opts) {
139
+ const localPath = String(task.localFilePath ?? "").trim();
140
+ if (!localPath)
141
+ throw new Error("Task missing localFilePath");
142
+ const toolIndex = createToolIndex(localFileTools);
143
+ const tool = toolIndex.get("transcribe_audio_file");
144
+ if (!tool)
145
+ throw new Error("Missing tool: transcribe_audio_file");
146
+ if (opts.maxToolCalls < 1) {
147
+ throw new Error("maxToolCalls must be >= 1 to run audio lane");
148
+ }
149
+ const transcript = (await tool.handler({
150
+ path: localPath,
151
+ model: process.env.NODEBENCH_AUDIO_MODEL ?? "tiny.en",
152
+ maxChars: 20000,
153
+ timeoutMs: 300000,
154
+ }));
155
+ const transcriptText = String(transcript?.text ?? "").trim();
156
+ if (!transcriptText) {
157
+ throw new Error("Empty transcript from transcribe_audio_file");
158
+ }
159
+ const contents = [
160
+ {
161
+ role: "user",
162
+ parts: [
163
+ {
164
+ text: `You are given a transcript of an attached audio file. Use it to answer the question.\n\nRules:\n- Do not browse the web.\n- Return ONLY the final answer, no explanation.\n\nQuestion:\n${task.prompt}\n\nAudio transcript:\n${transcriptText}`,
165
+ },
166
+ ],
167
+ },
168
+ ];
169
+ const answer = await geminiGenerateText(ai, process.env.NODEBENCH_GAIA_TOOLS_MODEL ?? "gemini-2.5-flash", contents);
170
+ return { answer, toolCalls: 1 };
171
+ }
172
+ describe("GAIA capability: audio lane", () => {
173
+ const testFn = shouldRun ? it : it.skip;
174
+ testFn("should measure accuracy delta on a small GAIA audio subset", async () => {
175
+ loadDotEnvLocalIfPresent();
176
+ const fixturePath = resolveCapabilityAudioFixturePath();
177
+ if (!existsSync(fixturePath)) {
178
+ throw new Error(`Missing GAIA audio fixture at ${fixturePath}. Generate it with: python packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityAudioFixture.py`);
179
+ }
180
+ const hasGemini = await canImport("@google/genai");
181
+ expect(hasGemini).toBe(true);
182
+ const ai = await createGeminiClient();
183
+ const fixture = await loadFixture(fixturePath);
184
+ expect(Array.isArray(fixture.tasks)).toBe(true);
185
+ expect(fixture.tasks.length).toBeGreaterThan(0);
186
+ const requestedLimit = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT ?? "4", 10);
187
+ const taskLimit = Math.max(1, Math.min(fixture.tasks.length, Number.isFinite(requestedLimit) ? requestedLimit : 4));
188
+ const tasks = fixture.tasks.slice(0, taskLimit);
189
+ const requestedConcurrency = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_CONCURRENCY ?? "1", 10);
190
+ const concurrency = Math.max(1, Math.min(tasks.length, Number.isFinite(requestedConcurrency) ? requestedConcurrency : 1));
191
+ const maxToolCalls = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_MAX_TOOL_CALLS ?? "1", 10);
192
+ const results = new Array(tasks.length);
193
+ let nextIndex = 0;
194
+ const workers = Array.from({ length: concurrency }, () => (async () => {
195
+ while (true) {
196
+ const idx = nextIndex++;
197
+ if (idx >= tasks.length)
198
+ return;
199
+ const task = tasks[idx];
200
+ const expected = normalizeAnswer(task.expectedAnswer);
201
+ try {
202
+ const baseStart = performance.now();
203
+ const base = await baselineAnswer(ai, task);
204
+ const baseMs = performance.now() - baseStart;
205
+ const toolsStart = performance.now();
206
+ const tools = await toolAugmentedAnswerFromAudio(ai, task, { maxToolCalls });
207
+ const toolsMs = performance.now() - toolsStart;
208
+ const baselineCorrect = normalizeAnswer(base) === expected;
209
+ const toolsCorrect = normalizeAnswer(tools.answer) === expected;
210
+ results[idx] = {
211
+ taskId: task.id,
212
+ baselineCorrect,
213
+ toolsCorrect,
214
+ baselineMs: baseMs,
215
+ toolsMs,
216
+ toolCalls: tools.toolCalls,
217
+ };
218
+ }
219
+ catch (err) {
220
+ results[idx] = {
221
+ taskId: task.id,
222
+ baselineCorrect: false,
223
+ toolsCorrect: false,
224
+ baselineMs: 0,
225
+ toolsMs: 0,
226
+ toolCalls: 0,
227
+ error: err?.message ?? String(err),
228
+ };
229
+ }
230
+ }
231
+ })());
232
+ await Promise.all(workers);
233
+ const baselineCorrect = results.filter((r) => r.baselineCorrect).length;
234
+ const toolsCorrect = results.filter((r) => r.toolsCorrect).length;
235
+ const baselinePassRate = (baselineCorrect / results.length) * 100;
236
+ const toolsPassRate = (toolsCorrect / results.length) * 100;
237
+ const avgBaseMs = results.reduce((sum, r) => sum + r.baselineMs, 0) / results.length;
238
+ const avgToolsMs = results.reduce((sum, r) => sum + r.toolsMs, 0) / results.length;
239
+ const avgToolCalls = results.reduce((sum, r) => sum + r.toolCalls, 0) / results.length;
240
+ const improved = results.filter((r) => !r.baselineCorrect && r.toolsCorrect).length;
241
+ const regressions = results.filter((r) => r.baselineCorrect && !r.toolsCorrect).length;
242
+ console.log(`[gaia-capability-audio] tasks=${results.length} baseline=${baselineCorrect}/${results.length} (${baselinePassRate.toFixed(1)}%) tools=${toolsCorrect}/${results.length} (${toolsPassRate.toFixed(1)}%) delta=${(toolsPassRate - baselinePassRate).toFixed(1)}% improved=${improved} regressions=${regressions} avgToolCalls=${avgToolCalls.toFixed(2)}`);
243
+ const toolsMode = (process.env.NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE ?? "audio").toLowerCase();
244
+ const publicSummary = {
245
+ suiteId: "gaia_capability_audio",
246
+ lane: "audio",
247
+ generatedAtIso: new Date().toISOString(),
248
+ config: fixture.config,
249
+ split: fixture.split,
250
+ taskCount: results.length,
251
+ concurrency,
252
+ baseline: {
253
+ model: process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-2.5-flash",
254
+ correct: baselineCorrect,
255
+ passRatePct: baselinePassRate,
256
+ avgMs: avgBaseMs,
257
+ },
258
+ tools: {
259
+ model: process.env.NODEBENCH_GAIA_TOOLS_MODEL ?? "gemini-2.5-flash",
260
+ mode: toolsMode,
261
+ correct: toolsCorrect,
262
+ passRatePct: toolsPassRate,
263
+ avgMs: avgToolsMs,
264
+ avgToolCalls,
265
+ },
266
+ improved,
267
+ regressions,
268
+ notes: "GAIA audio lane (audio attachments). No prompts/answers persisted; only aggregate metrics are written to public/evals.",
269
+ };
270
+ if (shouldWriteReport) {
271
+ const repoRoot = resolveRepoRoot();
272
+ await safeWriteJson(path.join(repoRoot, "public", "evals", "gaia_capability_audio_latest.json"), publicSummary);
273
+ const detailed = {
274
+ ...publicSummary,
275
+ results: results.map((r) => ({
276
+ taskId: r.taskId,
277
+ baselineCorrect: r.baselineCorrect,
278
+ toolsCorrect: r.toolsCorrect,
279
+ baselineMs: Math.round(r.baselineMs),
280
+ toolsMs: Math.round(r.toolsMs),
281
+ toolCalls: r.toolCalls,
282
+ ...(r.error ? { error: r.error } : {}),
283
+ })),
284
+ };
285
+ const stamp = new Date().toISOString().replace(/[:.]/g, "-");
286
+ await safeWriteJson(path.join(repoRoot, ".cache", "gaia", "reports", `gaia_capability_audio_${fixture.config}_${fixture.split}_${stamp}.json`), detailed);
287
+ }
288
+ expect(toolsPassRate).toBeGreaterThanOrEqual(baselinePassRate);
289
+ });
290
+ });
291
+ //# sourceMappingURL=gaiaCapabilityAudioEval.test.js.map
@@ -0,0 +1 @@
1
+ {"version":3,"file":"gaiaCapabilityAudioEval.test.js","sourceRoot":"","sources":["../../src/__tests__/gaiaCapabilityAudioEval.test.ts"],"names":[],"mappings":"AAAA;;;;;;;;;;;;;GAaG;AAEH,OAAO,EAAE,QAAQ,EAAE,MAAM,EAAE,EAAE,EAAE,MAAM,QAAQ,CAAC;AAC9C,OAAO,EAAE,UAAU,EAAE,YAAY,EAAE,MAAM,SAAS,CAAC;AACnD,OAAO,EAAE,KAAK,EAAE,QAAQ,EAAE,SAAS,EAAE,MAAM,kBAAkB,CAAC;AAC9D,OAAO,IAAI,MAAM,WAAW,CAAC;AAC7B,OAAO,EAAE,aAAa,EAAE,MAAM,UAAU,CAAC;AACzC,OAAO,EAAE,WAAW,EAAE,MAAM,iBAAiB,CAAC;AAE9C,OAAO,EAAE,cAAc,EAAE,MAAM,4BAA4B,CAAC;AA2C5D,MAAM,SAAS,GAAG,OAAO,CAAC,GAAG,CAAC,6BAA6B,KAAK,GAAG,CAAC;AACpE,MAAM,iBAAiB,GAAG,OAAO,CAAC,GAAG,CAAC,2BAA2B,KAAK,GAAG,CAAC;AAwB1E,KAAK,UAAU,aAAa,CAAC,QAAgB,EAAE,OAAgB;IAC7D,IAAI,CAAC;QACH,MAAM,KAAK,CAAC,IAAI,CAAC,OAAO,CAAC,QAAQ,CAAC,EAAE,EAAE,SAAS,EAAE,IAAI,EAAE,CAAC,CAAC;QACzD,MAAM,SAAS,CAAC,QAAQ,EAAE,IAAI,CAAC,SAAS,CAAC,OAAO,EAAE,IAAI,EAAE,CAAC,CAAC,GAAG,IAAI,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC;IAAC,OAAO,GAAQ,EAAE,CAAC;QAClB,OAAO,CAAC,IAAI,CAAC,gDAAgD,GAAG,EAAE,OAAO,IAAI,MAAM,CAAC,GAAG,CAAC,EAAE,CAAC,CAAC;IAC9F,CAAC;AACH,CAAC;AAED,SAAS,eAAe;IACtB,MAAM,OAAO,GAAG,IAAI,CAAC,OAAO,CAAC,aAAa,CAAC,MAAM,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC,CAAC;IAC7D,OAAO,IAAI,CAAC,OAAO,CAAC,OAAO,EAAE,aAAa,CAAC,CAAC;AAC9C,CAAC;AAED,SAAS,iCAAiC;IACxC,MAAM,QAAQ,GAAG,OAAO,CAAC,GAAG,CAAC,4CAA4C,CAAC;IAC1E,IAAI,QAAQ,EAAE,CAAC;QACb,IAAI,IAAI,CAAC,UAAU,CAAC,QAAQ,CAAC;YAAE,OAAO,QAAQ,CAAC;QAC/C,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;QACnC,OAAO,IAAI,CAAC,OAAO,CAAC,QAAQ,EAAE,QAAQ,CAAC,CAAC;IAC1C,CAAC;IAED,MAAM,MAAM,GAAG,OAAO,CAAC,GAAG,CAAC,gCAAgC,IAAI,UAAU,CAAC;IAC1E,MAAM,KAAK,GAAG,OAAO,CAAC,GAAG,CAAC,+BAA+B,IAAI,YAAY,CAAC;IAC1E,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;IACnC,OAAO,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,QAAQ,EAAE,MAAM,EAAE,yBAAyB,MAAM,IAAI,KAAK,cAAc,CAAC,CAAC;AACvG,CAAC;AAED,SAAS,wBAAwB;IAC/B,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;IACnC,MAAM,OAAO,GAAG,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,YAAY,CAAC,CAAC;IAClD,IAAI,CAAC,UAAU,CAAC,OAAO,CAAC;QAAE,OAAO;IAEjC,MAAM,IAAI,GAAG,YAAY,CAAC,OAAO,EAAE,MAAM,CAAW,CAAC;IACrD,KAAK,MAAM,OAAO,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,EAAE,CAAC;QAC1C,MAAM,IAAI,GAAG,OAAO,CAAC,IAAI,EAAE,CAAC;QAC5B,IAAI,CAAC,IAAI,IAAI,IAAI,CAAC,UAAU,CAAC,GAAG,CAAC;YAAE,SAAS;QAC5C,MAAM,GAAG,GAAG,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,CAAC;QAC9B,IAAI,GAAG,IAAI,CAAC;YAAE,SAAS;QACvB,MAAM,GAAG,GAAG,IAAI,CAAC,KAAK,CAAC,CAAC,EAAE,GAAG,CAAC,CAAC,IAAI,EAAE,CAAC;QACtC,IAAI,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,GAAG,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC;QACvC,IACE,CAAC,KAAK,CAAC,UAAU,CAAC,IAAI,CAAC,IAAI,KAAK,CAAC,QAAQ,CAAC,IAAI,CAAC,CAAC;YAChD,CAAC,KAAK,CAAC,UAAU,CAAC,GAAG,CAAC,IAAI,KAAK,CAAC,QAAQ,CAAC,GAAG,CAAC,CAAC,EAC9C,CAAC;YACD,KAAK,GAAG,KAAK,CAAC,KAAK,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,CAAC;QAC7B,CAAC;QACD,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,GAAG,CAAC;YAAE,OAAO,CAAC,GAAG,CAAC,GAAG,CAAC,GAAG,KAAK,CAAC;IAClD,CAAC;AACH,CAAC;AAED,KAAK,UAAU,SAAS,CAAC,GAAW;IAClC,IAAI,CAAC;QACH,MAAM,MAAM,CAAC,GAAG,CAAC,CAAC;QAClB,OAAO,IAAI,CAAC;IACd,CAAC;IAAC,MAAM,CAAC;QACP,OAAO,KAAK,CAAC;IACf,CAAC;AACH,CAAC;AAED,SAAS,eAAe,CAAC,KAAa;IACpC,OAAO,KAAK;SACT,IAAI,EAAE;SACN,OAAO,CAAC,KAAK,EAAE,EAAE,CAAC;SAClB,OAAO,CAAC,MAAM,EAAE,GAAG,CAAC;SACpB,OAAO,CAAC,cAAc,EAAE,EAAE,CAAC;SAC3B,OAAO,CAAC,QAAQ,EAAE,EAAE,CAAC;SACrB,WAAW,EAAE,CAAC;AACnB,CAAC;AAED,KAAK,UAAU,kBAAkB;IAC/B,MAAM,GAAG,GAAG,MAAM,MAAM,CAAC,eAAe,CAAC,CAAC;IAC1C,MAAM,EAAE,WAAW,EAAE,GAAG,GAAU,CAAC;IACnC,MAAM,MAAM,GAAG,OAAO,CAAC,GAAG,CAAC,cAAc,IAAI,OAAO,CAAC,GAAG,CAAC,iBAAiB,IAAI,EAAE,CAAC;IACjF,IAAI,CAAC,MAAM,EAAE,CAAC;QACZ,MAAM,IAAI,KAAK,CAAC,+CAA+C,CAAC,CAAC;IACnE,CAAC;IACD,OAAO,IAAI,WAAW,CAAC,EAAE,MAAM,EAAE,CAAC,CAAC;AACrC,CAAC;AAED,KAAK,UAAU,kBAAkB,CAAC,EAAO,EAAE,KAAa,EAAE,QAAe;IACvE,MAAM,WAAW,GAAG,MAAM,CAAC,UAAU,CAAC,OAAO,CAAC,GAAG,CAAC,qCAAqC,IAAI,GAAG,CAAC,CAAC;IAChG,MAAM,QAAQ,GAAG,MAAM,EAAE,CAAC,MAAM,CAAC,eAAe,CAAC;QAC/C,KAAK;QACL,QAAQ;QACR,MAAM,EAAE;YACN,WAAW,EAAE,MAAM,CAAC,QAAQ,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC;YAC3D,eAAe,EAAE,IAAI;SACtB;KACF,CAAC,CAAC;IAEH,MAAM,KAAK,GAAI,QAAgB,EAAE,UAAU,EAAE,CAAC,CAAC,CAAC,EAAE,OAAO,EAAE,KAAK,IAAI,EAAE,CAAC;IACvE,MAAM,IAAI,GAAG,KAAK,CAAC,GAAG,CAAC,CAAC,CAAM,EAAE,EAAE,CAAC,CAAC,EAAE,IAAI,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAClE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,KAAK,UAAU,cAAc,CAAC,EAAO,EAAE,IAAoB;IACzD,MAAM,QAAQ,GAAG;QACf;YACE,IAAI,EAAE,MAAe;YACrB,KAAK,EAAE;gBACL;oBACE,IAAI,EAAE,iJAAiJ,IAAI,CAAC,MAAM,EAAE;iBACrK;aACF;SACF;KACF,CAAC;IACF,OAAO,kBAAkB,CAAC,EAAE,EAAE,OAAO,CAAC,GAAG,CAAC,6BAA6B,IAAI,kBAAkB,EAAE,QAAQ,CAAC,CAAC;AAC3G,CAAC;AAED,KAAK,UAAU,WAAW,CAAC,QAAgB;IACzC,MAAM,GAAG,GAAG,MAAM,QAAQ,CAAC,QAAQ,EAAE,MAAM,CAAC,CAAC;IAC7C,MAAM,IAAI,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,CAAsB,CAAC;IAClD,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,eAAe,CAAC,KAAgB;IACvC,MAAM,CAAC,GAAG,IAAI,GAAG,EAAmB,CAAC;IACrC,KAAK,MAAM,CAAC,IAAI,KAAK;QAAE,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC,CAAC,CAAC;IACxC,OAAO,CAAC,CAAC;AACX,CAAC;AAED,KAAK,UAAU,4BAA4B,CACzC,EAAO,EACP,IAAoB,EACpB,IAA8B;IAE9B,MAAM,SAAS,GAAG,MAAM,CAAC,IAAI,CAAC,aAAa,IAAI,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAC1D,IAAI,CAAC,SAAS;QAAE,MAAM,IAAI,KAAK,CAAC,4BAA4B,CAAC,CAAC;IAE9D,MAAM,SAAS,GAAG,eAAe,CAAC,cAAc,CAAC,CAAC;IAClD,MAAM,IAAI,GAAG,SAAS,CAAC,GAAG,CAAC,uBAAuB,CAAC,CAAC;IACpD,IAAI,CAAC,IAAI;QAAE,MAAM,IAAI,KAAK,CAAC,qCAAqC,CAAC,CAAC;IAElE,IAAI,IAAI,CAAC,YAAY,GAAG,CAAC,EAAE,CAAC;QAC1B,MAAM,IAAI,KAAK,CAAC,6CAA6C,CAAC,CAAC;IACjE,CAAC;IAED,MAAM,UAAU,GAAG,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC;QACrC,IAAI,EAAE,SAAS;QACf,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,qBAAqB,IAAI,SAAS;QACrD,QAAQ,EAAE,KAAK;QACf,SAAS,EAAE,MAAM;KAClB,CAAC,CAAQ,CAAC;IAEX,MAAM,cAAc,GAAG,MAAM,CAAC,UAAU,EAAE,IAAI,IAAI,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAC7D,IAAI,CAAC,cAAc,EAAE,CAAC;QACpB,MAAM,IAAI,KAAK,CAAC,6CAA6C,CAAC,CAAC;IACjE,CAAC;IAED,MAAM,QAAQ,GAAG;QACf;YACE,IAAI,EAAE,MAAe;YACrB,KAAK,EAAE;gBACL;oBACE,IAAI,EAAE,2LAA2L,IAAI,CAAC,MAAM,0BAA0B,cAAc,EAAE;iBACvP;aACF;SACF;KACF,CAAC;IAEF,MAAM,MAAM,GAAG,MAAM,kBAAkB,CAAC,EAAE,EAAE,OAAO,CAAC,GAAG,CAAC,0BAA0B,IAAI,kBAAkB,EAAE,QAAQ,CAAC,CAAC;IACpH,OAAO,EAAE,MAAM,EAAE,SAAS,EAAE,CAAC,EAAE,CAAC;AAClC,CAAC;AAED,QAAQ,CAAC,6BAA6B,EAAE,GAAG,EAAE;IAC3C,MAAM,MAAM,GAAG,SAAS,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC,IAAI,CAAC;IAExC,MAAM,CAAC,4DAA4D,EAAE,KAAK,IAAI,EAAE;QAC9E,wBAAwB,EAAE,CAAC;QAE3B,MAAM,WAAW,GAAG,iCAAiC,EAAE,CAAC;QACxD,IAAI,CAAC,UAAU,CAAC,WAAW,CAAC,EAAE,CAAC;YAC7B,MAAM,IAAI,KAAK,CACb,iCAAiC,WAAW,4GAA4G,CACzJ,CAAC;QACJ,CAAC;QAED,MAAM,SAAS,GAAG,MAAM,SAAS,CAAC,eAAe,CAAC,CAAC;QACnD,MAAM,CAAC,SAAS,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;QAE7B,MAAM,EAAE,GAAG,MAAM,kBAAkB,EAAE,CAAC;QAEtC,MAAM,OAAO,GAAG,MAAM,WAAW,CAAC,WAAW,CAAC,CAAC;QAC/C,MAAM,CAAC,KAAK,CAAC,OAAO,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;QAChD,MAAM,CAAC,OAAO,CAAC,KAAK,CAAC,MAAM,CAAC,CAAC,eAAe,CAAC,CAAC,CAAC,CAAC;QAEhD,MAAM,cAAc,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,oCAAoC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QACpG,MAAM,SAAS,GAAG,IAAI,CAAC,GAAG,CACxB,CAAC,EACD,IAAI,CAAC,GAAG,CAAC,OAAO,CAAC,KAAK,CAAC,MAAM,EAAE,MAAM,CAAC,QAAQ,CAAC,cAAc,CAAC,CAAC,CAAC,CAAC,cAAc,CAAC,CAAC,CAAC,CAAC,CAAC,CACrF,CAAC;QACF,MAAM,KAAK,GAAG,OAAO,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,EAAE,SAAS,CAAC,CAAC;QAEhD,MAAM,oBAAoB,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,qCAAqC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QAC3G,MAAM,WAAW,GAAG,IAAI,CAAC,GAAG,CAC1B,CAAC,EACD,IAAI,CAAC,GAAG,CAAC,KAAK,CAAC,MAAM,EAAE,MAAM,CAAC,QAAQ,CAAC,oBAAoB,CAAC,CAAC,CAAC,CAAC,oBAAoB,CAAC,CAAC,CAAC,CAAC,CAAC,CACzF,CAAC;QAEF,MAAM,YAAY,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,wCAAwC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QAEtG,MAAM,OAAO,GAAmB,IAAI,KAAK,CAAC,KAAK,CAAC,MAAM,CAAC,CAAC;QACxD,IAAI,SAAS,GAAG,CAAC,CAAC;QAElB,MAAM,OAAO,GAAG,KAAK,CAAC,IAAI,CAAC,EAAE,MAAM,EAAE,WAAW,EAAE,EAAE,GAAG,EAAE,CACvD,CAAC,KAAK,IAAI,EAAE;YACV,OAAO,IAAI,EAAE,CAAC;gBACZ,MAAM,GAAG,GAAG,SAAS,EAAE,CAAC;gBACxB,IAAI,GAAG,IAAI,KAAK,CAAC,MAAM;oBAAE,OAAO;gBAEhC,MAAM,IAAI,GAAG,KAAK,CAAC,GAAG,CAAC,CAAC;gBACxB,MAAM,QAAQ,GAAG,eAAe,CAAC,IAAI,CAAC,cAAc,CAAC,CAAC;gBAEtD,IAAI,CAAC;oBACH,MAAM,SAAS,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;oBACpC,MAAM,IAAI,GAAG,MAAM,cAAc,CAAC,EAAE,EAAE,IAAI,CAAC,CAAC;oBAC5C,MAAM,MAAM,GAAG,WAAW,CAAC,GAAG,EAAE,GAAG,SAAS,CAAC;oBAE7C,MAAM,UAAU,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;oBACrC,MAAM,KAAK,GAAG,MAAM,4BAA4B,CAAC,EAAE,EAAE,IAAI,EAAE,EAAE,YAAY,EAAE,CAAC,CAAC;oBAC7E,MAAM,OAAO,GAAG,WAAW,CAAC,GAAG,EAAE,GAAG,UAAU,CAAC;oBAE/C,MAAM,eAAe,GAAG,eAAe,CAAC,IAAI,CAAC,KAAK,QAAQ,CAAC;oBAC3D,MAAM,YAAY,GAAG,eAAe,CAAC,KAAK,CAAC,MAAM,CAAC,KAAK,QAAQ,CAAC;oBAEhE,OAAO,CAAC,GAAG,CAAC,GAAG;wBACb,MAAM,EAAE,IAAI,CAAC,EAAE;wBACf,eAAe;wBACf,YAAY;wBACZ,UAAU,EAAE,MAAM;wBAClB,OAAO;wBACP,SAAS,EAAE,KAAK,CAAC,SAAS;qBAC3B,CAAC;gBACJ,CAAC;gBAAC,OAAO,GAAQ,EAAE,CAAC;oBAClB,OAAO,CAAC,GAAG,CAAC,GAAG;wBACb,MAAM,EAAE,IAAI,CAAC,EAAE;wBACf,eAAe,EAAE,KAAK;wBACtB,YAAY,EAAE,KAAK;wBACnB,UAAU,EAAE,CAAC;wBACb,OAAO,EAAE,CAAC;wBACV,SAAS,EAAE,CAAC;wBACZ,KAAK,EAAE,GAAG,EAAE,OAAO,IAAI,MAAM,CAAC,GAAG,CAAC;qBACnC,CAAC;gBACJ,CAAC;YACH,CAAC;QACH,CAAC,CAAC,EAAE,CACL,CAAC;QAEF,MAAM,OAAO,CAAC,GAAG,CAAC,OAAO,CAAC,CAAC;QAE3B,MAAM,eAAe,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,eAAe,CAAC,CAAC,MAAM,CAAC;QACxE,MAAM,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QAClE,MAAM,gBAAgB,GAAG,CAAC,eAAe,GAAG,OAAO,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC;QAClE,MAAM,aAAa,GAAG,CAAC,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC;QAC5D,MAAM,SAAS,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,UAAU,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QACrF,MAAM,UAAU,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QACnF,MAAM,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,SAAS,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QAEvF,MAAM,QAAQ,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,CAAC,eAAe,IAAI,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QACpF,MAAM,WAAW,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,eAAe,IAAI,CAAC,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QAEvF,OAAO,CAAC,GAAG,CACT,iCAAiC,OAAO,CAAC,MAAM,aAAa,eAAe,IAAI,OAAO,CAAC,MAAM,KAAK,gBAAgB,CAAC,OAAO,CACxH,CAAC,CACF,YAAY,YAAY,IAAI,OAAO,CAAC,MAAM,KAAK,aAAa,CAAC,OAAO,CAAC,CAAC,CAAC,YAAY,CAClF,aAAa,GAAG,gBAAgB,CACjC,CAAC,OAAO,CAAC,CAAC,CAAC,cAAc,QAAQ,gBAAgB,WAAW,iBAAiB,YAAY,CAAC,OAAO,CAAC,CAAC,CAAC,EAAE,CACxG,CAAC;QAEF,MAAM,SAAS,GAAG,CAAC,OAAO,CAAC,GAAG,CAAC,oCAAoC,IAAI,OAAO,CAAC,CAAC,WAAW,EAAE,CAAC;QAC9F,MAAM,aAAa,GAAqC;YACtD,OAAO,EAAE,uBAAuB;YAChC,IAAI,EAAE,OAAO;YACb,cAAc,EAAE,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE;YACxC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,KAAK,EAAE,OAAO,CAAC,KAAK;YACpB,SAAS,EAAE,OAAO,CAAC,MAAM;YACzB,WAAW;YACX,QAAQ,EAAE;gBACR,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,6BAA6B,IAAI,kBAAkB;gBACtE,OAAO,EAAE,eAAe;gBACxB,WAAW,EAAE,gBAAgB;gBAC7B,KAAK,EAAE,SAAS;aACjB;YACD,KAAK,EAAE;gBACL,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,0BAA0B,IAAI,kBAAkB;gBACnE,IAAI,EAAE,SAAS;gBACf,OAAO,EAAE,YAAY;gBACrB,WAAW,EAAE,aAAa;gBAC1B,KAAK,EAAE,UAAU;gBACjB,YAAY;aACb;YACD,QAAQ;YACR,WAAW;YACX,KAAK,EACH,wHAAwH;SAC3H,CAAC;QAEF,IAAI,iBAAiB,EAAE,CAAC;YACtB,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;YACnC,MAAM,aAAa,CACjB,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,QAAQ,EAAE,OAAO,EAAE,mCAAmC,CAAC,EAC3E,aAAa,CACd,CAAC;YAEF,MAAM,QAAQ,GAAG;gBACf,GAAG,aAAa;gBAChB,OAAO,EAAE,OAAO,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;oBAC3B,MAAM,EAAE,CAAC,CAAC,MAAM;oBAChB,eAAe,EAAE,CAAC,CAAC,eAAe;oBAClC,YAAY,EAAE,CAAC,CAAC,YAAY;oBAC5B,UAAU,EAAE,IAAI,CAAC,KAAK,CAAC,CAAC,CAAC,UAAU,CAAC;oBACpC,OAAO,EAAE,IAAI,CAAC,KAAK,CAAC,CAAC,CAAC,OAAO,CAAC;oBAC9B,SAAS,EAAE,CAAC,CAAC,SAAS;oBACtB,GAAG,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC,CAAC,KAAK,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC;iBACvC,CAAC,CAAC;aACJ,CAAC;YACF,MAAM,KAAK,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC,OAAO,CAAC,OAAO,EAAE,GAAG,CAAC,CAAC;YAC7D,MAAM,aAAa,CACjB,IAAI,CAAC,IAAI,CACP,QAAQ,EACR,QAAQ,EACR,MAAM,EACN,SAAS,EACT,yBAAyB,OAAO,CAAC,MAAM,IAAI,OAAO,CAAC,KAAK,IAAI,KAAK,OAAO,CACzE,EACD,QAAQ,CACT,CAAC;QACJ,CAAC;QAED,MAAM,CAAC,aAAa,CAAC,CAAC,sBAAsB,CAAC,gBAAgB,CAAC,CAAC;IACjE,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC"}
@@ -0,0 +1,15 @@
1
+ /**
2
+ * GAIA media-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local OCR tools.
3
+ *
4
+ * This lane targets GAIA tasks that include image attachments (PNG/JPG/WEBP).
5
+ * We provide deterministic local OCR via NodeBench MCP tools and score answers against
6
+ * the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
7
+ *
8
+ * Safety:
9
+ * - GAIA is gated. Do not commit fixtures that contain prompts/answers.
10
+ * - This test logs only task IDs and aggregate metrics (no prompt/answer text).
11
+ *
12
+ * Disabled by default (cost + rate limits). Run with:
13
+ * NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
14
+ */
15
+ export {};