nodebench-mcp 2.8.1 → 2.8.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +72 -1
- package/dist/__tests__/gaiaCapabilityAudioEval.test.d.ts +15 -0
- package/dist/__tests__/gaiaCapabilityAudioEval.test.js +291 -0
- package/dist/__tests__/gaiaCapabilityAudioEval.test.js.map +1 -0
- package/dist/__tests__/gaiaCapabilityMediaEval.test.d.ts +15 -0
- package/dist/__tests__/gaiaCapabilityMediaEval.test.js +421 -0
- package/dist/__tests__/gaiaCapabilityMediaEval.test.js.map +1 -0
- package/dist/__tests__/tools.test.js +153 -4
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/tools/localFileTools.d.ts +1 -0
- package/dist/tools/localFileTools.js +353 -0
- package/dist/tools/localFileTools.js.map +1 -1
- package/dist/tools/toolRegistry.js +93 -6
- package/dist/tools/toolRegistry.js.map +1 -1
- package/package.json +10 -5
package/README.md
CHANGED
|
@@ -184,6 +184,77 @@ Notes:
|
|
|
184
184
|
|
|
185
185
|
---
|
|
186
186
|
|
|
187
|
+
## Progressive Discovery (v2.8.1)
|
|
188
|
+
|
|
189
|
+
129 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
|
|
190
|
+
|
|
191
|
+
### Multi-modal search engine
|
|
192
|
+
|
|
193
|
+
```
|
|
194
|
+
> discover_tools("verify my implementation")
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
The `discover_tools` search engine scores tools using **9 parallel strategies**:
|
|
198
|
+
|
|
199
|
+
| Strategy | What it does | Example |
|
|
200
|
+
|---|---|---|
|
|
201
|
+
| Keyword | Exact/partial word matching on name, tags, description | "benchmark" → `benchmark_models` |
|
|
202
|
+
| Fuzzy | Levenshtein distance — tolerates typos | "verifiy" → `start_verification_cycle` |
|
|
203
|
+
| N-gram | Trigram similarity for partial words | "screen" → `capture_ui_screenshot` |
|
|
204
|
+
| Prefix | Matches tool name starts | "cap" → `capture_*` tools |
|
|
205
|
+
| Semantic | Synonym expansion (30 word families) | "check" also finds "verify", "validate" |
|
|
206
|
+
| TF-IDF | Rare tags score higher than common ones | "c-compiler" scores higher than "test" |
|
|
207
|
+
| Regex | Pattern matching | `"^run_.*loop$"` → `run_closed_loop` |
|
|
208
|
+
| Bigram | Phrase matching | "quality gate" matched as unit |
|
|
209
|
+
| Domain boost | Related categories boosted together | verification + quality_gate cluster |
|
|
210
|
+
|
|
211
|
+
**6 search modes**: `hybrid` (default, all strategies), `fuzzy`, `regex`, `prefix`, `semantic`, `exact`
|
|
212
|
+
|
|
213
|
+
Pass `explain: true` to see exactly which strategies contributed to each score.
|
|
214
|
+
|
|
215
|
+
### Quick refs — what to do next
|
|
216
|
+
|
|
217
|
+
Every tool response auto-appends a `_quickRef` with:
|
|
218
|
+
- **nextAction**: What to do immediately after this tool
|
|
219
|
+
- **nextTools**: Recommended follow-up tools
|
|
220
|
+
- **methodology**: Which methodology guide to consult
|
|
221
|
+
- **tip**: Practical usage advice
|
|
222
|
+
|
|
223
|
+
Call `get_tool_quick_ref("tool_name")` for any tool's guidance.
|
|
224
|
+
|
|
225
|
+
### Workflow chains — step-by-step recipes
|
|
226
|
+
|
|
227
|
+
11 pre-built chains for common workflows:
|
|
228
|
+
|
|
229
|
+
| Chain | Steps | Use case |
|
|
230
|
+
|---|---|---|
|
|
231
|
+
| `new_feature` | 12 | End-to-end feature development |
|
|
232
|
+
| `fix_bug` | 6 | Structured debugging |
|
|
233
|
+
| `ui_change` | 7 | Frontend with visual verification |
|
|
234
|
+
| `parallel_project` | 7 | Multi-agent coordination |
|
|
235
|
+
| `research_phase` | 8 | Context gathering |
|
|
236
|
+
| `academic_paper` | 7 | Paper writing pipeline |
|
|
237
|
+
| `c_compiler_benchmark` | 10 | Autonomous capability test |
|
|
238
|
+
| `security_audit` | 9 | Comprehensive security assessment |
|
|
239
|
+
| `code_review` | 8 | Structured code review |
|
|
240
|
+
| `deployment` | 8 | Ship with full verification |
|
|
241
|
+
| `migration` | 10 | SDK/framework upgrade |
|
|
242
|
+
|
|
243
|
+
Call `get_workflow_chain("new_feature")` to get the step-by-step sequence.
|
|
244
|
+
|
|
245
|
+
### Boilerplate template
|
|
246
|
+
|
|
247
|
+
Start new projects with everything pre-configured:
|
|
248
|
+
|
|
249
|
+
```bash
|
|
250
|
+
gh repo create my-project --template HomenShum/nodebench-boilerplate --clone
|
|
251
|
+
cd my-project && npm install
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
Or use the scaffold tool: `scaffold_nodebench_project` creates AGENTS.md, .mcp.json, package.json, CI, Docker, and parallel agent infra.
|
|
255
|
+
|
|
256
|
+
---
|
|
257
|
+
|
|
187
258
|
## The Methodology Pipeline
|
|
188
259
|
|
|
189
260
|
NodeBench MCP isn't just a bag of tools — it's a pipeline. Each step feeds the next:
|
|
@@ -307,7 +378,7 @@ Always included (regardless of gating):
|
|
|
307
378
|
## Build from Source
|
|
308
379
|
|
|
309
380
|
```bash
|
|
310
|
-
git clone https://github.com/
|
|
381
|
+
git clone https://github.com/HomenShum/nodebench-ai.git
|
|
311
382
|
cd nodebench-ai/packages/mcp-local
|
|
312
383
|
npm install && npm run build
|
|
313
384
|
```
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* GAIA audio-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local audio tools.
|
|
3
|
+
*
|
|
4
|
+
* This lane targets GAIA tasks that include audio attachments (MP3/WAV/etc).
|
|
5
|
+
* We provide deterministic local transcription via NodeBench MCP tools and score answers against
|
|
6
|
+
* the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
|
|
7
|
+
*
|
|
8
|
+
* Safety:
|
|
9
|
+
* - GAIA is gated. Do not commit fixtures that contain prompts/answers.
|
|
10
|
+
* - This test logs only task IDs and aggregate metrics (no prompt/answer text).
|
|
11
|
+
*
|
|
12
|
+
* Disabled by default (cost + rate limits). Run with:
|
|
13
|
+
* NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
|
|
14
|
+
*/
|
|
15
|
+
export {};
|
|
@@ -0,0 +1,291 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* GAIA audio-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local audio tools.
|
|
3
|
+
*
|
|
4
|
+
* This lane targets GAIA tasks that include audio attachments (MP3/WAV/etc).
|
|
5
|
+
* We provide deterministic local transcription via NodeBench MCP tools and score answers against
|
|
6
|
+
* the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
|
|
7
|
+
*
|
|
8
|
+
* Safety:
|
|
9
|
+
* - GAIA is gated. Do not commit fixtures that contain prompts/answers.
|
|
10
|
+
* - This test logs only task IDs and aggregate metrics (no prompt/answer text).
|
|
11
|
+
*
|
|
12
|
+
* Disabled by default (cost + rate limits). Run with:
|
|
13
|
+
* NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
|
|
14
|
+
*/
|
|
15
|
+
import { describe, expect, it } from "vitest";
|
|
16
|
+
import { existsSync, readFileSync } from "node:fs";
|
|
17
|
+
import { mkdir, readFile, writeFile } from "node:fs/promises";
|
|
18
|
+
import path from "node:path";
|
|
19
|
+
import { fileURLToPath } from "node:url";
|
|
20
|
+
import { performance } from "node:perf_hooks";
|
|
21
|
+
import { localFileTools } from "../tools/localFileTools.js";
|
|
22
|
+
const shouldRun = process.env.NODEBENCH_RUN_GAIA_CAPABILITY === "1";
|
|
23
|
+
const shouldWriteReport = process.env.NODEBENCH_WRITE_GAIA_REPORT === "1";
|
|
24
|
+
async function safeWriteJson(filePath, payload) {
|
|
25
|
+
try {
|
|
26
|
+
await mkdir(path.dirname(filePath), { recursive: true });
|
|
27
|
+
await writeFile(filePath, JSON.stringify(payload, null, 2) + "\n", "utf8");
|
|
28
|
+
}
|
|
29
|
+
catch (err) {
|
|
30
|
+
console.warn(`[gaia-capability-audio] report write failed: ${err?.message ?? String(err)}`);
|
|
31
|
+
}
|
|
32
|
+
}
|
|
33
|
+
function resolveRepoRoot() {
|
|
34
|
+
const testDir = path.dirname(fileURLToPath(import.meta.url));
|
|
35
|
+
return path.resolve(testDir, "../../../..");
|
|
36
|
+
}
|
|
37
|
+
function resolveCapabilityAudioFixturePath() {
|
|
38
|
+
const override = process.env.NODEBENCH_GAIA_CAPABILITY_AUDIO_FIXTURE_PATH;
|
|
39
|
+
if (override) {
|
|
40
|
+
if (path.isAbsolute(override))
|
|
41
|
+
return override;
|
|
42
|
+
const repoRoot = resolveRepoRoot();
|
|
43
|
+
return path.resolve(repoRoot, override);
|
|
44
|
+
}
|
|
45
|
+
const config = process.env.NODEBENCH_GAIA_CAPABILITY_CONFIG ?? "2023_all";
|
|
46
|
+
const split = process.env.NODEBENCH_GAIA_CAPABILITY_SPLIT ?? "validation";
|
|
47
|
+
const repoRoot = resolveRepoRoot();
|
|
48
|
+
return path.join(repoRoot, ".cache", "gaia", `gaia_capability_audio_${config}_${split}.sample.json`);
|
|
49
|
+
}
|
|
50
|
+
function loadDotEnvLocalIfPresent() {
|
|
51
|
+
const repoRoot = resolveRepoRoot();
|
|
52
|
+
const envPath = path.join(repoRoot, ".env.local");
|
|
53
|
+
if (!existsSync(envPath))
|
|
54
|
+
return;
|
|
55
|
+
const text = readFileSync(envPath, "utf8");
|
|
56
|
+
for (const rawLine of text.split(/\r?\n/)) {
|
|
57
|
+
const line = rawLine.trim();
|
|
58
|
+
if (!line || line.startsWith("#"))
|
|
59
|
+
continue;
|
|
60
|
+
const idx = line.indexOf("=");
|
|
61
|
+
if (idx <= 0)
|
|
62
|
+
continue;
|
|
63
|
+
const key = line.slice(0, idx).trim();
|
|
64
|
+
let value = line.slice(idx + 1).trim();
|
|
65
|
+
if ((value.startsWith("\"") && value.endsWith("\"")) ||
|
|
66
|
+
(value.startsWith("'") && value.endsWith("'"))) {
|
|
67
|
+
value = value.slice(1, -1);
|
|
68
|
+
}
|
|
69
|
+
if (!process.env[key])
|
|
70
|
+
process.env[key] = value;
|
|
71
|
+
}
|
|
72
|
+
}
|
|
73
|
+
async function canImport(pkg) {
|
|
74
|
+
try {
|
|
75
|
+
await import(pkg);
|
|
76
|
+
return true;
|
|
77
|
+
}
|
|
78
|
+
catch {
|
|
79
|
+
return false;
|
|
80
|
+
}
|
|
81
|
+
}
|
|
82
|
+
function normalizeAnswer(value) {
|
|
83
|
+
return value
|
|
84
|
+
.trim()
|
|
85
|
+
.replace(/\r/g, "")
|
|
86
|
+
.replace(/\s+/g, " ")
|
|
87
|
+
.replace(/^["']|["']$/g, "")
|
|
88
|
+
.replace(/[.]+$/g, "")
|
|
89
|
+
.toLowerCase();
|
|
90
|
+
}
|
|
91
|
+
async function createGeminiClient() {
|
|
92
|
+
const mod = await import("@google/genai");
|
|
93
|
+
const { GoogleGenAI } = mod;
|
|
94
|
+
const apiKey = process.env.GEMINI_API_KEY || process.env.GOOGLE_AI_API_KEY || "";
|
|
95
|
+
if (!apiKey) {
|
|
96
|
+
throw new Error("Missing GEMINI_API_KEY (or GOOGLE_AI_API_KEY)");
|
|
97
|
+
}
|
|
98
|
+
return new GoogleGenAI({ apiKey });
|
|
99
|
+
}
|
|
100
|
+
async function geminiGenerateText(ai, model, contents) {
|
|
101
|
+
const temperature = Number.parseFloat(process.env.NODEBENCH_GAIA_CAPABILITY_TEMPERATURE ?? "0");
|
|
102
|
+
const response = await ai.models.generateContent({
|
|
103
|
+
model,
|
|
104
|
+
contents,
|
|
105
|
+
config: {
|
|
106
|
+
temperature: Number.isFinite(temperature) ? temperature : 0,
|
|
107
|
+
maxOutputTokens: 1024,
|
|
108
|
+
},
|
|
109
|
+
});
|
|
110
|
+
const parts = response?.candidates?.[0]?.content?.parts ?? [];
|
|
111
|
+
const text = parts.map((p) => p?.text ?? "").join("").trim();
|
|
112
|
+
return text;
|
|
113
|
+
}
|
|
114
|
+
async function baselineAnswer(ai, task) {
|
|
115
|
+
const contents = [
|
|
116
|
+
{
|
|
117
|
+
role: "user",
|
|
118
|
+
parts: [
|
|
119
|
+
{
|
|
120
|
+
text: `Answer the question using your existing knowledge only. Do not browse the web.\n\nReturn ONLY the final answer, no explanation.\n\nQuestion:\n${task.prompt}`,
|
|
121
|
+
},
|
|
122
|
+
],
|
|
123
|
+
},
|
|
124
|
+
];
|
|
125
|
+
return geminiGenerateText(ai, process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-2.5-flash", contents);
|
|
126
|
+
}
|
|
127
|
+
async function loadFixture(filePath) {
|
|
128
|
+
const raw = await readFile(filePath, "utf8");
|
|
129
|
+
const json = JSON.parse(raw);
|
|
130
|
+
return json;
|
|
131
|
+
}
|
|
132
|
+
function createToolIndex(tools) {
|
|
133
|
+
const m = new Map();
|
|
134
|
+
for (const t of tools)
|
|
135
|
+
m.set(t.name, t);
|
|
136
|
+
return m;
|
|
137
|
+
}
|
|
138
|
+
async function toolAugmentedAnswerFromAudio(ai, task, opts) {
|
|
139
|
+
const localPath = String(task.localFilePath ?? "").trim();
|
|
140
|
+
if (!localPath)
|
|
141
|
+
throw new Error("Task missing localFilePath");
|
|
142
|
+
const toolIndex = createToolIndex(localFileTools);
|
|
143
|
+
const tool = toolIndex.get("transcribe_audio_file");
|
|
144
|
+
if (!tool)
|
|
145
|
+
throw new Error("Missing tool: transcribe_audio_file");
|
|
146
|
+
if (opts.maxToolCalls < 1) {
|
|
147
|
+
throw new Error("maxToolCalls must be >= 1 to run audio lane");
|
|
148
|
+
}
|
|
149
|
+
const transcript = (await tool.handler({
|
|
150
|
+
path: localPath,
|
|
151
|
+
model: process.env.NODEBENCH_AUDIO_MODEL ?? "tiny.en",
|
|
152
|
+
maxChars: 20000,
|
|
153
|
+
timeoutMs: 300000,
|
|
154
|
+
}));
|
|
155
|
+
const transcriptText = String(transcript?.text ?? "").trim();
|
|
156
|
+
if (!transcriptText) {
|
|
157
|
+
throw new Error("Empty transcript from transcribe_audio_file");
|
|
158
|
+
}
|
|
159
|
+
const contents = [
|
|
160
|
+
{
|
|
161
|
+
role: "user",
|
|
162
|
+
parts: [
|
|
163
|
+
{
|
|
164
|
+
text: `You are given a transcript of an attached audio file. Use it to answer the question.\n\nRules:\n- Do not browse the web.\n- Return ONLY the final answer, no explanation.\n\nQuestion:\n${task.prompt}\n\nAudio transcript:\n${transcriptText}`,
|
|
165
|
+
},
|
|
166
|
+
],
|
|
167
|
+
},
|
|
168
|
+
];
|
|
169
|
+
const answer = await geminiGenerateText(ai, process.env.NODEBENCH_GAIA_TOOLS_MODEL ?? "gemini-2.5-flash", contents);
|
|
170
|
+
return { answer, toolCalls: 1 };
|
|
171
|
+
}
|
|
172
|
+
describe("GAIA capability: audio lane", () => {
|
|
173
|
+
const testFn = shouldRun ? it : it.skip;
|
|
174
|
+
testFn("should measure accuracy delta on a small GAIA audio subset", async () => {
|
|
175
|
+
loadDotEnvLocalIfPresent();
|
|
176
|
+
const fixturePath = resolveCapabilityAudioFixturePath();
|
|
177
|
+
if (!existsSync(fixturePath)) {
|
|
178
|
+
throw new Error(`Missing GAIA audio fixture at ${fixturePath}. Generate it with: python packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityAudioFixture.py`);
|
|
179
|
+
}
|
|
180
|
+
const hasGemini = await canImport("@google/genai");
|
|
181
|
+
expect(hasGemini).toBe(true);
|
|
182
|
+
const ai = await createGeminiClient();
|
|
183
|
+
const fixture = await loadFixture(fixturePath);
|
|
184
|
+
expect(Array.isArray(fixture.tasks)).toBe(true);
|
|
185
|
+
expect(fixture.tasks.length).toBeGreaterThan(0);
|
|
186
|
+
const requestedLimit = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_TASK_LIMIT ?? "4", 10);
|
|
187
|
+
const taskLimit = Math.max(1, Math.min(fixture.tasks.length, Number.isFinite(requestedLimit) ? requestedLimit : 4));
|
|
188
|
+
const tasks = fixture.tasks.slice(0, taskLimit);
|
|
189
|
+
const requestedConcurrency = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_CONCURRENCY ?? "1", 10);
|
|
190
|
+
const concurrency = Math.max(1, Math.min(tasks.length, Number.isFinite(requestedConcurrency) ? requestedConcurrency : 1));
|
|
191
|
+
const maxToolCalls = Number.parseInt(process.env.NODEBENCH_GAIA_CAPABILITY_MAX_TOOL_CALLS ?? "1", 10);
|
|
192
|
+
const results = new Array(tasks.length);
|
|
193
|
+
let nextIndex = 0;
|
|
194
|
+
const workers = Array.from({ length: concurrency }, () => (async () => {
|
|
195
|
+
while (true) {
|
|
196
|
+
const idx = nextIndex++;
|
|
197
|
+
if (idx >= tasks.length)
|
|
198
|
+
return;
|
|
199
|
+
const task = tasks[idx];
|
|
200
|
+
const expected = normalizeAnswer(task.expectedAnswer);
|
|
201
|
+
try {
|
|
202
|
+
const baseStart = performance.now();
|
|
203
|
+
const base = await baselineAnswer(ai, task);
|
|
204
|
+
const baseMs = performance.now() - baseStart;
|
|
205
|
+
const toolsStart = performance.now();
|
|
206
|
+
const tools = await toolAugmentedAnswerFromAudio(ai, task, { maxToolCalls });
|
|
207
|
+
const toolsMs = performance.now() - toolsStart;
|
|
208
|
+
const baselineCorrect = normalizeAnswer(base) === expected;
|
|
209
|
+
const toolsCorrect = normalizeAnswer(tools.answer) === expected;
|
|
210
|
+
results[idx] = {
|
|
211
|
+
taskId: task.id,
|
|
212
|
+
baselineCorrect,
|
|
213
|
+
toolsCorrect,
|
|
214
|
+
baselineMs: baseMs,
|
|
215
|
+
toolsMs,
|
|
216
|
+
toolCalls: tools.toolCalls,
|
|
217
|
+
};
|
|
218
|
+
}
|
|
219
|
+
catch (err) {
|
|
220
|
+
results[idx] = {
|
|
221
|
+
taskId: task.id,
|
|
222
|
+
baselineCorrect: false,
|
|
223
|
+
toolsCorrect: false,
|
|
224
|
+
baselineMs: 0,
|
|
225
|
+
toolsMs: 0,
|
|
226
|
+
toolCalls: 0,
|
|
227
|
+
error: err?.message ?? String(err),
|
|
228
|
+
};
|
|
229
|
+
}
|
|
230
|
+
}
|
|
231
|
+
})());
|
|
232
|
+
await Promise.all(workers);
|
|
233
|
+
const baselineCorrect = results.filter((r) => r.baselineCorrect).length;
|
|
234
|
+
const toolsCorrect = results.filter((r) => r.toolsCorrect).length;
|
|
235
|
+
const baselinePassRate = (baselineCorrect / results.length) * 100;
|
|
236
|
+
const toolsPassRate = (toolsCorrect / results.length) * 100;
|
|
237
|
+
const avgBaseMs = results.reduce((sum, r) => sum + r.baselineMs, 0) / results.length;
|
|
238
|
+
const avgToolsMs = results.reduce((sum, r) => sum + r.toolsMs, 0) / results.length;
|
|
239
|
+
const avgToolCalls = results.reduce((sum, r) => sum + r.toolCalls, 0) / results.length;
|
|
240
|
+
const improved = results.filter((r) => !r.baselineCorrect && r.toolsCorrect).length;
|
|
241
|
+
const regressions = results.filter((r) => r.baselineCorrect && !r.toolsCorrect).length;
|
|
242
|
+
console.log(`[gaia-capability-audio] tasks=${results.length} baseline=${baselineCorrect}/${results.length} (${baselinePassRate.toFixed(1)}%) tools=${toolsCorrect}/${results.length} (${toolsPassRate.toFixed(1)}%) delta=${(toolsPassRate - baselinePassRate).toFixed(1)}% improved=${improved} regressions=${regressions} avgToolCalls=${avgToolCalls.toFixed(2)}`);
|
|
243
|
+
const toolsMode = (process.env.NODEBENCH_GAIA_CAPABILITY_TOOLS_MODE ?? "audio").toLowerCase();
|
|
244
|
+
const publicSummary = {
|
|
245
|
+
suiteId: "gaia_capability_audio",
|
|
246
|
+
lane: "audio",
|
|
247
|
+
generatedAtIso: new Date().toISOString(),
|
|
248
|
+
config: fixture.config,
|
|
249
|
+
split: fixture.split,
|
|
250
|
+
taskCount: results.length,
|
|
251
|
+
concurrency,
|
|
252
|
+
baseline: {
|
|
253
|
+
model: process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-2.5-flash",
|
|
254
|
+
correct: baselineCorrect,
|
|
255
|
+
passRatePct: baselinePassRate,
|
|
256
|
+
avgMs: avgBaseMs,
|
|
257
|
+
},
|
|
258
|
+
tools: {
|
|
259
|
+
model: process.env.NODEBENCH_GAIA_TOOLS_MODEL ?? "gemini-2.5-flash",
|
|
260
|
+
mode: toolsMode,
|
|
261
|
+
correct: toolsCorrect,
|
|
262
|
+
passRatePct: toolsPassRate,
|
|
263
|
+
avgMs: avgToolsMs,
|
|
264
|
+
avgToolCalls,
|
|
265
|
+
},
|
|
266
|
+
improved,
|
|
267
|
+
regressions,
|
|
268
|
+
notes: "GAIA audio lane (audio attachments). No prompts/answers persisted; only aggregate metrics are written to public/evals.",
|
|
269
|
+
};
|
|
270
|
+
if (shouldWriteReport) {
|
|
271
|
+
const repoRoot = resolveRepoRoot();
|
|
272
|
+
await safeWriteJson(path.join(repoRoot, "public", "evals", "gaia_capability_audio_latest.json"), publicSummary);
|
|
273
|
+
const detailed = {
|
|
274
|
+
...publicSummary,
|
|
275
|
+
results: results.map((r) => ({
|
|
276
|
+
taskId: r.taskId,
|
|
277
|
+
baselineCorrect: r.baselineCorrect,
|
|
278
|
+
toolsCorrect: r.toolsCorrect,
|
|
279
|
+
baselineMs: Math.round(r.baselineMs),
|
|
280
|
+
toolsMs: Math.round(r.toolsMs),
|
|
281
|
+
toolCalls: r.toolCalls,
|
|
282
|
+
...(r.error ? { error: r.error } : {}),
|
|
283
|
+
})),
|
|
284
|
+
};
|
|
285
|
+
const stamp = new Date().toISOString().replace(/[:.]/g, "-");
|
|
286
|
+
await safeWriteJson(path.join(repoRoot, ".cache", "gaia", "reports", `gaia_capability_audio_${fixture.config}_${fixture.split}_${stamp}.json`), detailed);
|
|
287
|
+
}
|
|
288
|
+
expect(toolsPassRate).toBeGreaterThanOrEqual(baselinePassRate);
|
|
289
|
+
});
|
|
290
|
+
});
|
|
291
|
+
//# sourceMappingURL=gaiaCapabilityAudioEval.test.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"gaiaCapabilityAudioEval.test.js","sourceRoot":"","sources":["../../src/__tests__/gaiaCapabilityAudioEval.test.ts"],"names":[],"mappings":"AAAA;;;;;;;;;;;;;GAaG;AAEH,OAAO,EAAE,QAAQ,EAAE,MAAM,EAAE,EAAE,EAAE,MAAM,QAAQ,CAAC;AAC9C,OAAO,EAAE,UAAU,EAAE,YAAY,EAAE,MAAM,SAAS,CAAC;AACnD,OAAO,EAAE,KAAK,EAAE,QAAQ,EAAE,SAAS,EAAE,MAAM,kBAAkB,CAAC;AAC9D,OAAO,IAAI,MAAM,WAAW,CAAC;AAC7B,OAAO,EAAE,aAAa,EAAE,MAAM,UAAU,CAAC;AACzC,OAAO,EAAE,WAAW,EAAE,MAAM,iBAAiB,CAAC;AAE9C,OAAO,EAAE,cAAc,EAAE,MAAM,4BAA4B,CAAC;AA2C5D,MAAM,SAAS,GAAG,OAAO,CAAC,GAAG,CAAC,6BAA6B,KAAK,GAAG,CAAC;AACpE,MAAM,iBAAiB,GAAG,OAAO,CAAC,GAAG,CAAC,2BAA2B,KAAK,GAAG,CAAC;AAwB1E,KAAK,UAAU,aAAa,CAAC,QAAgB,EAAE,OAAgB;IAC7D,IAAI,CAAC;QACH,MAAM,KAAK,CAAC,IAAI,CAAC,OAAO,CAAC,QAAQ,CAAC,EAAE,EAAE,SAAS,EAAE,IAAI,EAAE,CAAC,CAAC;QACzD,MAAM,SAAS,CAAC,QAAQ,EAAE,IAAI,CAAC,SAAS,CAAC,OAAO,EAAE,IAAI,EAAE,CAAC,CAAC,GAAG,IAAI,EAAE,MAAM,CAAC,CAAC;IAC7E,CAAC;IAAC,OAAO,GAAQ,EAAE,CAAC;QAClB,OAAO,CAAC,IAAI,CAAC,gDAAgD,GAAG,EAAE,OAAO,IAAI,MAAM,CAAC,GAAG,CAAC,EAAE,CAAC,CAAC;IAC9F,CAAC;AACH,CAAC;AAED,SAAS,eAAe;IACtB,MAAM,OAAO,GAAG,IAAI,CAAC,OAAO,CAAC,aAAa,CAAC,MAAM,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC,CAAC;IAC7D,OAAO,IAAI,CAAC,OAAO,CAAC,OAAO,EAAE,aAAa,CAAC,CAAC;AAC9C,CAAC;AAED,SAAS,iCAAiC;IACxC,MAAM,QAAQ,GAAG,OAAO,CAAC,GAAG,CAAC,4CAA4C,CAAC;IAC1E,IAAI,QAAQ,EAAE,CAAC;QACb,IAAI,IAAI,CAAC,UAAU,CAAC,QAAQ,CAAC;YAAE,OAAO,QAAQ,CAAC;QAC/C,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;QACnC,OAAO,IAAI,CAAC,OAAO,CAAC,QAAQ,EAAE,QAAQ,CAAC,CAAC;IAC1C,CAAC;IAED,MAAM,MAAM,GAAG,OAAO,CAAC,GAAG,CAAC,gCAAgC,IAAI,UAAU,CAAC;IAC1E,MAAM,KAAK,GAAG,OAAO,CAAC,GAAG,CAAC,+BAA+B,IAAI,YAAY,CAAC;IAC1E,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;IACnC,OAAO,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,QAAQ,EAAE,MAAM,EAAE,yBAAyB,MAAM,IAAI,KAAK,cAAc,CAAC,CAAC;AACvG,CAAC;AAED,SAAS,wBAAwB;IAC/B,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;IACnC,MAAM,OAAO,GAAG,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,YAAY,CAAC,CAAC;IAClD,IAAI,CAAC,UAAU,CAAC,OAAO,CAAC;QAAE,OAAO;IAEjC,MAAM,IAAI,GAAG,YAAY,CAAC,OAAO,EAAE,MAAM,CAAW,CAAC;IACrD,KAAK,MAAM,OAAO,IAAI,IAAI,CAAC,KAAK,CAAC,OAAO,CAAC,EAAE,CAAC;QAC1C,MAAM,IAAI,GAAG,OAAO,CAAC,IAAI,EAAE,CAAC;QAC5B,IAAI,CAAC,IAAI,IAAI,IAAI,CAAC,UAAU,CAAC,GAAG,CAAC;YAAE,SAAS;QAC5C,MAAM,GAAG,GAAG,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,CAAC;QAC9B,IAAI,GAAG,IAAI,CAAC;YAAE,SAAS;QACvB,MAAM,GAAG,GAAG,IAAI,CAAC,KAAK,CAAC,CAAC,EAAE,GAAG,CAAC,CAAC,IAAI,EAAE,CAAC;QACtC,IAAI,KAAK,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,GAAG,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC;QACvC,IACE,CAAC,KAAK,CAAC,UAAU,CAAC,IAAI,CAAC,IAAI,KAAK,CAAC,QAAQ,CAAC,IAAI,CAAC,CAAC;YAChD,CAAC,KAAK,CAAC,UAAU,CAAC,GAAG,CAAC,IAAI,KAAK,CAAC,QAAQ,CAAC,GAAG,CAAC,CAAC,EAC9C,CAAC;YACD,KAAK,GAAG,KAAK,CAAC,KAAK,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,CAAC;QAC7B,CAAC;QACD,IAAI,CAAC,OAAO,CAAC,GAAG,CAAC,GAAG,CAAC;YAAE,OAAO,CAAC,GAAG,CAAC,GAAG,CAAC,GAAG,KAAK,CAAC;IAClD,CAAC;AACH,CAAC;AAED,KAAK,UAAU,SAAS,CAAC,GAAW;IAClC,IAAI,CAAC;QACH,MAAM,MAAM,CAAC,GAAG,CAAC,CAAC;QAClB,OAAO,IAAI,CAAC;IACd,CAAC;IAAC,MAAM,CAAC;QACP,OAAO,KAAK,CAAC;IACf,CAAC;AACH,CAAC;AAED,SAAS,eAAe,CAAC,KAAa;IACpC,OAAO,KAAK;SACT,IAAI,EAAE;SACN,OAAO,CAAC,KAAK,EAAE,EAAE,CAAC;SAClB,OAAO,CAAC,MAAM,EAAE,GAAG,CAAC;SACpB,OAAO,CAAC,cAAc,EAAE,EAAE,CAAC;SAC3B,OAAO,CAAC,QAAQ,EAAE,EAAE,CAAC;SACrB,WAAW,EAAE,CAAC;AACnB,CAAC;AAED,KAAK,UAAU,kBAAkB;IAC/B,MAAM,GAAG,GAAG,MAAM,MAAM,CAAC,eAAe,CAAC,CAAC;IAC1C,MAAM,EAAE,WAAW,EAAE,GAAG,GAAU,CAAC;IACnC,MAAM,MAAM,GAAG,OAAO,CAAC,GAAG,CAAC,cAAc,IAAI,OAAO,CAAC,GAAG,CAAC,iBAAiB,IAAI,EAAE,CAAC;IACjF,IAAI,CAAC,MAAM,EAAE,CAAC;QACZ,MAAM,IAAI,KAAK,CAAC,+CAA+C,CAAC,CAAC;IACnE,CAAC;IACD,OAAO,IAAI,WAAW,CAAC,EAAE,MAAM,EAAE,CAAC,CAAC;AACrC,CAAC;AAED,KAAK,UAAU,kBAAkB,CAAC,EAAO,EAAE,KAAa,EAAE,QAAe;IACvE,MAAM,WAAW,GAAG,MAAM,CAAC,UAAU,CAAC,OAAO,CAAC,GAAG,CAAC,qCAAqC,IAAI,GAAG,CAAC,CAAC;IAChG,MAAM,QAAQ,GAAG,MAAM,EAAE,CAAC,MAAM,CAAC,eAAe,CAAC;QAC/C,KAAK;QACL,QAAQ;QACR,MAAM,EAAE;YACN,WAAW,EAAE,MAAM,CAAC,QAAQ,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC,WAAW,CAAC,CAAC,CAAC,CAAC;YAC3D,eAAe,EAAE,IAAI;SACtB;KACF,CAAC,CAAC;IAEH,MAAM,KAAK,GAAI,QAAgB,EAAE,UAAU,EAAE,CAAC,CAAC,CAAC,EAAE,OAAO,EAAE,KAAK,IAAI,EAAE,CAAC;IACvE,MAAM,IAAI,GAAG,KAAK,CAAC,GAAG,CAAC,CAAC,CAAM,EAAE,EAAE,CAAC,CAAC,EAAE,IAAI,IAAI,EAAE,CAAC,CAAC,IAAI,CAAC,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAClE,OAAO,IAAI,CAAC;AACd,CAAC;AAED,KAAK,UAAU,cAAc,CAAC,EAAO,EAAE,IAAoB;IACzD,MAAM,QAAQ,GAAG;QACf;YACE,IAAI,EAAE,MAAe;YACrB,KAAK,EAAE;gBACL;oBACE,IAAI,EAAE,iJAAiJ,IAAI,CAAC,MAAM,EAAE;iBACrK;aACF;SACF;KACF,CAAC;IACF,OAAO,kBAAkB,CAAC,EAAE,EAAE,OAAO,CAAC,GAAG,CAAC,6BAA6B,IAAI,kBAAkB,EAAE,QAAQ,CAAC,CAAC;AAC3G,CAAC;AAED,KAAK,UAAU,WAAW,CAAC,QAAgB;IACzC,MAAM,GAAG,GAAG,MAAM,QAAQ,CAAC,QAAQ,EAAE,MAAM,CAAC,CAAC;IAC7C,MAAM,IAAI,GAAG,IAAI,CAAC,KAAK,CAAC,GAAG,CAAsB,CAAC;IAClD,OAAO,IAAI,CAAC;AACd,CAAC;AAED,SAAS,eAAe,CAAC,KAAgB;IACvC,MAAM,CAAC,GAAG,IAAI,GAAG,EAAmB,CAAC;IACrC,KAAK,MAAM,CAAC,IAAI,KAAK;QAAE,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC,CAAC,CAAC;IACxC,OAAO,CAAC,CAAC;AACX,CAAC;AAED,KAAK,UAAU,4BAA4B,CACzC,EAAO,EACP,IAAoB,EACpB,IAA8B;IAE9B,MAAM,SAAS,GAAG,MAAM,CAAC,IAAI,CAAC,aAAa,IAAI,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAC1D,IAAI,CAAC,SAAS;QAAE,MAAM,IAAI,KAAK,CAAC,4BAA4B,CAAC,CAAC;IAE9D,MAAM,SAAS,GAAG,eAAe,CAAC,cAAc,CAAC,CAAC;IAClD,MAAM,IAAI,GAAG,SAAS,CAAC,GAAG,CAAC,uBAAuB,CAAC,CAAC;IACpD,IAAI,CAAC,IAAI;QAAE,MAAM,IAAI,KAAK,CAAC,qCAAqC,CAAC,CAAC;IAElE,IAAI,IAAI,CAAC,YAAY,GAAG,CAAC,EAAE,CAAC;QAC1B,MAAM,IAAI,KAAK,CAAC,6CAA6C,CAAC,CAAC;IACjE,CAAC;IAED,MAAM,UAAU,GAAG,CAAC,MAAM,IAAI,CAAC,OAAO,CAAC;QACrC,IAAI,EAAE,SAAS;QACf,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,qBAAqB,IAAI,SAAS;QACrD,QAAQ,EAAE,KAAK;QACf,SAAS,EAAE,MAAM;KAClB,CAAC,CAAQ,CAAC;IAEX,MAAM,cAAc,GAAG,MAAM,CAAC,UAAU,EAAE,IAAI,IAAI,EAAE,CAAC,CAAC,IAAI,EAAE,CAAC;IAC7D,IAAI,CAAC,cAAc,EAAE,CAAC;QACpB,MAAM,IAAI,KAAK,CAAC,6CAA6C,CAAC,CAAC;IACjE,CAAC;IAED,MAAM,QAAQ,GAAG;QACf;YACE,IAAI,EAAE,MAAe;YACrB,KAAK,EAAE;gBACL;oBACE,IAAI,EAAE,2LAA2L,IAAI,CAAC,MAAM,0BAA0B,cAAc,EAAE;iBACvP;aACF;SACF;KACF,CAAC;IAEF,MAAM,MAAM,GAAG,MAAM,kBAAkB,CAAC,EAAE,EAAE,OAAO,CAAC,GAAG,CAAC,0BAA0B,IAAI,kBAAkB,EAAE,QAAQ,CAAC,CAAC;IACpH,OAAO,EAAE,MAAM,EAAE,SAAS,EAAE,CAAC,EAAE,CAAC;AAClC,CAAC;AAED,QAAQ,CAAC,6BAA6B,EAAE,GAAG,EAAE;IAC3C,MAAM,MAAM,GAAG,SAAS,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC,IAAI,CAAC;IAExC,MAAM,CAAC,4DAA4D,EAAE,KAAK,IAAI,EAAE;QAC9E,wBAAwB,EAAE,CAAC;QAE3B,MAAM,WAAW,GAAG,iCAAiC,EAAE,CAAC;QACxD,IAAI,CAAC,UAAU,CAAC,WAAW,CAAC,EAAE,CAAC;YAC7B,MAAM,IAAI,KAAK,CACb,iCAAiC,WAAW,4GAA4G,CACzJ,CAAC;QACJ,CAAC;QAED,MAAM,SAAS,GAAG,MAAM,SAAS,CAAC,eAAe,CAAC,CAAC;QACnD,MAAM,CAAC,SAAS,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;QAE7B,MAAM,EAAE,GAAG,MAAM,kBAAkB,EAAE,CAAC;QAEtC,MAAM,OAAO,GAAG,MAAM,WAAW,CAAC,WAAW,CAAC,CAAC;QAC/C,MAAM,CAAC,KAAK,CAAC,OAAO,CAAC,OAAO,CAAC,KAAK,CAAC,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;QAChD,MAAM,CAAC,OAAO,CAAC,KAAK,CAAC,MAAM,CAAC,CAAC,eAAe,CAAC,CAAC,CAAC,CAAC;QAEhD,MAAM,cAAc,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,oCAAoC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QACpG,MAAM,SAAS,GAAG,IAAI,CAAC,GAAG,CACxB,CAAC,EACD,IAAI,CAAC,GAAG,CAAC,OAAO,CAAC,KAAK,CAAC,MAAM,EAAE,MAAM,CAAC,QAAQ,CAAC,cAAc,CAAC,CAAC,CAAC,CAAC,cAAc,CAAC,CAAC,CAAC,CAAC,CAAC,CACrF,CAAC;QACF,MAAM,KAAK,GAAG,OAAO,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,EAAE,SAAS,CAAC,CAAC;QAEhD,MAAM,oBAAoB,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,qCAAqC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QAC3G,MAAM,WAAW,GAAG,IAAI,CAAC,GAAG,CAC1B,CAAC,EACD,IAAI,CAAC,GAAG,CAAC,KAAK,CAAC,MAAM,EAAE,MAAM,CAAC,QAAQ,CAAC,oBAAoB,CAAC,CAAC,CAAC,CAAC,oBAAoB,CAAC,CAAC,CAAC,CAAC,CAAC,CACzF,CAAC;QAEF,MAAM,YAAY,GAAG,MAAM,CAAC,QAAQ,CAAC,OAAO,CAAC,GAAG,CAAC,wCAAwC,IAAI,GAAG,EAAE,EAAE,CAAC,CAAC;QAEtG,MAAM,OAAO,GAAmB,IAAI,KAAK,CAAC,KAAK,CAAC,MAAM,CAAC,CAAC;QACxD,IAAI,SAAS,GAAG,CAAC,CAAC;QAElB,MAAM,OAAO,GAAG,KAAK,CAAC,IAAI,CAAC,EAAE,MAAM,EAAE,WAAW,EAAE,EAAE,GAAG,EAAE,CACvD,CAAC,KAAK,IAAI,EAAE;YACV,OAAO,IAAI,EAAE,CAAC;gBACZ,MAAM,GAAG,GAAG,SAAS,EAAE,CAAC;gBACxB,IAAI,GAAG,IAAI,KAAK,CAAC,MAAM;oBAAE,OAAO;gBAEhC,MAAM,IAAI,GAAG,KAAK,CAAC,GAAG,CAAC,CAAC;gBACxB,MAAM,QAAQ,GAAG,eAAe,CAAC,IAAI,CAAC,cAAc,CAAC,CAAC;gBAEtD,IAAI,CAAC;oBACH,MAAM,SAAS,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;oBACpC,MAAM,IAAI,GAAG,MAAM,cAAc,CAAC,EAAE,EAAE,IAAI,CAAC,CAAC;oBAC5C,MAAM,MAAM,GAAG,WAAW,CAAC,GAAG,EAAE,GAAG,SAAS,CAAC;oBAE7C,MAAM,UAAU,GAAG,WAAW,CAAC,GAAG,EAAE,CAAC;oBACrC,MAAM,KAAK,GAAG,MAAM,4BAA4B,CAAC,EAAE,EAAE,IAAI,EAAE,EAAE,YAAY,EAAE,CAAC,CAAC;oBAC7E,MAAM,OAAO,GAAG,WAAW,CAAC,GAAG,EAAE,GAAG,UAAU,CAAC;oBAE/C,MAAM,eAAe,GAAG,eAAe,CAAC,IAAI,CAAC,KAAK,QAAQ,CAAC;oBAC3D,MAAM,YAAY,GAAG,eAAe,CAAC,KAAK,CAAC,MAAM,CAAC,KAAK,QAAQ,CAAC;oBAEhE,OAAO,CAAC,GAAG,CAAC,GAAG;wBACb,MAAM,EAAE,IAAI,CAAC,EAAE;wBACf,eAAe;wBACf,YAAY;wBACZ,UAAU,EAAE,MAAM;wBAClB,OAAO;wBACP,SAAS,EAAE,KAAK,CAAC,SAAS;qBAC3B,CAAC;gBACJ,CAAC;gBAAC,OAAO,GAAQ,EAAE,CAAC;oBAClB,OAAO,CAAC,GAAG,CAAC,GAAG;wBACb,MAAM,EAAE,IAAI,CAAC,EAAE;wBACf,eAAe,EAAE,KAAK;wBACtB,YAAY,EAAE,KAAK;wBACnB,UAAU,EAAE,CAAC;wBACb,OAAO,EAAE,CAAC;wBACV,SAAS,EAAE,CAAC;wBACZ,KAAK,EAAE,GAAG,EAAE,OAAO,IAAI,MAAM,CAAC,GAAG,CAAC;qBACnC,CAAC;gBACJ,CAAC;YACH,CAAC;QACH,CAAC,CAAC,EAAE,CACL,CAAC;QAEF,MAAM,OAAO,CAAC,GAAG,CAAC,OAAO,CAAC,CAAC;QAE3B,MAAM,eAAe,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,eAAe,CAAC,CAAC,MAAM,CAAC;QACxE,MAAM,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QAClE,MAAM,gBAAgB,GAAG,CAAC,eAAe,GAAG,OAAO,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC;QAClE,MAAM,aAAa,GAAG,CAAC,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,GAAG,GAAG,CAAC;QAC5D,MAAM,SAAS,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,UAAU,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QACrF,MAAM,UAAU,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QACnF,MAAM,YAAY,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,GAAG,EAAE,CAAC,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,SAAS,EAAE,CAAC,CAAC,GAAG,OAAO,CAAC,MAAM,CAAC;QAEvF,MAAM,QAAQ,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,CAAC,eAAe,IAAI,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QACpF,MAAM,WAAW,GAAG,OAAO,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,eAAe,IAAI,CAAC,CAAC,CAAC,YAAY,CAAC,CAAC,MAAM,CAAC;QAEvF,OAAO,CAAC,GAAG,CACT,iCAAiC,OAAO,CAAC,MAAM,aAAa,eAAe,IAAI,OAAO,CAAC,MAAM,KAAK,gBAAgB,CAAC,OAAO,CACxH,CAAC,CACF,YAAY,YAAY,IAAI,OAAO,CAAC,MAAM,KAAK,aAAa,CAAC,OAAO,CAAC,CAAC,CAAC,YAAY,CAClF,aAAa,GAAG,gBAAgB,CACjC,CAAC,OAAO,CAAC,CAAC,CAAC,cAAc,QAAQ,gBAAgB,WAAW,iBAAiB,YAAY,CAAC,OAAO,CAAC,CAAC,CAAC,EAAE,CACxG,CAAC;QAEF,MAAM,SAAS,GAAG,CAAC,OAAO,CAAC,GAAG,CAAC,oCAAoC,IAAI,OAAO,CAAC,CAAC,WAAW,EAAE,CAAC;QAC9F,MAAM,aAAa,GAAqC;YACtD,OAAO,EAAE,uBAAuB;YAChC,IAAI,EAAE,OAAO;YACb,cAAc,EAAE,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE;YACxC,MAAM,EAAE,OAAO,CAAC,MAAM;YACtB,KAAK,EAAE,OAAO,CAAC,KAAK;YACpB,SAAS,EAAE,OAAO,CAAC,MAAM;YACzB,WAAW;YACX,QAAQ,EAAE;gBACR,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,6BAA6B,IAAI,kBAAkB;gBACtE,OAAO,EAAE,eAAe;gBACxB,WAAW,EAAE,gBAAgB;gBAC7B,KAAK,EAAE,SAAS;aACjB;YACD,KAAK,EAAE;gBACL,KAAK,EAAE,OAAO,CAAC,GAAG,CAAC,0BAA0B,IAAI,kBAAkB;gBACnE,IAAI,EAAE,SAAS;gBACf,OAAO,EAAE,YAAY;gBACrB,WAAW,EAAE,aAAa;gBAC1B,KAAK,EAAE,UAAU;gBACjB,YAAY;aACb;YACD,QAAQ;YACR,WAAW;YACX,KAAK,EACH,wHAAwH;SAC3H,CAAC;QAEF,IAAI,iBAAiB,EAAE,CAAC;YACtB,MAAM,QAAQ,GAAG,eAAe,EAAE,CAAC;YACnC,MAAM,aAAa,CACjB,IAAI,CAAC,IAAI,CAAC,QAAQ,EAAE,QAAQ,EAAE,OAAO,EAAE,mCAAmC,CAAC,EAC3E,aAAa,CACd,CAAC;YAEF,MAAM,QAAQ,GAAG;gBACf,GAAG,aAAa;gBAChB,OAAO,EAAE,OAAO,CAAC,GAAG,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC;oBAC3B,MAAM,EAAE,CAAC,CAAC,MAAM;oBAChB,eAAe,EAAE,CAAC,CAAC,eAAe;oBAClC,YAAY,EAAE,CAAC,CAAC,YAAY;oBAC5B,UAAU,EAAE,IAAI,CAAC,KAAK,CAAC,CAAC,CAAC,UAAU,CAAC;oBACpC,OAAO,EAAE,IAAI,CAAC,KAAK,CAAC,CAAC,CAAC,OAAO,CAAC;oBAC9B,SAAS,EAAE,CAAC,CAAC,SAAS;oBACtB,GAAG,CAAC,CAAC,CAAC,KAAK,CAAC,CAAC,CAAC,EAAE,KAAK,EAAE,CAAC,CAAC,KAAK,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC;iBACvC,CAAC,CAAC;aACJ,CAAC;YACF,MAAM,KAAK,GAAG,IAAI,IAAI,EAAE,CAAC,WAAW,EAAE,CAAC,OAAO,CAAC,OAAO,EAAE,GAAG,CAAC,CAAC;YAC7D,MAAM,aAAa,CACjB,IAAI,CAAC,IAAI,CACP,QAAQ,EACR,QAAQ,EACR,MAAM,EACN,SAAS,EACT,yBAAyB,OAAO,CAAC,MAAM,IAAI,OAAO,CAAC,KAAK,IAAI,KAAK,OAAO,CACzE,EACD,QAAQ,CACT,CAAC;QACJ,CAAC;QAED,MAAM,CAAC,aAAa,CAAC,CAAC,sBAAsB,CAAC,gBAAgB,CAAC,CAAC;IACjE,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC"}
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* GAIA media-backed capability/accuracy benchmark: LLM-only vs LLM+NodeBench MCP local OCR tools.
|
|
3
|
+
*
|
|
4
|
+
* This lane targets GAIA tasks that include image attachments (PNG/JPG/WEBP).
|
|
5
|
+
* We provide deterministic local OCR via NodeBench MCP tools and score answers against
|
|
6
|
+
* the ground-truth "Final answer" (stored locally under `.cache/gaia`, gitignored).
|
|
7
|
+
*
|
|
8
|
+
* Safety:
|
|
9
|
+
* - GAIA is gated. Do not commit fixtures that contain prompts/answers.
|
|
10
|
+
* - This test logs only task IDs and aggregate metrics (no prompt/answer text).
|
|
11
|
+
*
|
|
12
|
+
* Disabled by default (cost + rate limits). Run with:
|
|
13
|
+
* NODEBENCH_RUN_GAIA_CAPABILITY=1 npm --prefix packages/mcp-local run test
|
|
14
|
+
*/
|
|
15
|
+
export {};
|