nodebench-mcp 2.14.2 → 2.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/NODEBENCH_AGENTS.md +3 -3
- package/README.md +9 -9
- package/dist/__tests__/critterCalibrationEval.d.ts +8 -0
- package/dist/__tests__/critterCalibrationEval.js +370 -0
- package/dist/__tests__/critterCalibrationEval.js.map +1 -0
- package/dist/__tests__/embeddingProvider.test.d.ts +1 -0
- package/dist/__tests__/embeddingProvider.test.js +86 -0
- package/dist/__tests__/embeddingProvider.test.js.map +1 -0
- package/dist/__tests__/gaiaCapabilityAudioEval.test.js +1 -1
- package/dist/__tests__/gaiaCapabilityAudioEval.test.js.map +1 -1
- package/dist/__tests__/gaiaCapabilityEval.test.js +541 -27
- package/dist/__tests__/gaiaCapabilityEval.test.js.map +1 -1
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js +1 -1
- package/dist/__tests__/gaiaCapabilityFilesEval.test.js.map +1 -1
- package/dist/__tests__/gaiaCapabilityMediaEval.test.js +473 -4
- package/dist/__tests__/gaiaCapabilityMediaEval.test.js.map +1 -1
- package/dist/__tests__/tools.test.js +1010 -8
- package/dist/__tests__/tools.test.js.map +1 -1
- package/dist/db.js +64 -0
- package/dist/db.js.map +1 -1
- package/dist/index.js +70 -9
- package/dist/index.js.map +1 -1
- package/dist/tools/critterTools.d.ts +21 -0
- package/dist/tools/critterTools.js +230 -0
- package/dist/tools/critterTools.js.map +1 -0
- package/dist/tools/embeddingProvider.d.ts +67 -0
- package/dist/tools/embeddingProvider.js +299 -0
- package/dist/tools/embeddingProvider.js.map +1 -0
- package/dist/tools/progressiveDiscoveryTools.js +24 -7
- package/dist/tools/progressiveDiscoveryTools.js.map +1 -1
- package/dist/tools/reconTools.js +83 -33
- package/dist/tools/reconTools.js.map +1 -1
- package/dist/tools/toolRegistry.d.ts +30 -2
- package/dist/tools/toolRegistry.js +253 -25
- package/dist/tools/toolRegistry.js.map +1 -1
- package/package.json +7 -3
package/NODEBENCH_AGENTS.md
CHANGED
|
@@ -21,7 +21,7 @@ Add to `~/.claude/settings.json`:
|
|
|
21
21
|
}
|
|
22
22
|
```
|
|
23
23
|
|
|
24
|
-
Restart Claude Code.
|
|
24
|
+
Restart Claude Code. 163 tools available immediately.
|
|
25
25
|
|
|
26
26
|
### Preset Selection
|
|
27
27
|
|
|
@@ -303,8 +303,8 @@ NodeBench MCP supports 4 presets that control which domain toolsets are loaded a
|
|
|
303
303
|
|--------|----------------|-------------|----------------------------|----------|
|
|
304
304
|
| **meta** | 0 | 0 | 5 | Discovery-only front door. Agents start here and self-escalate. |
|
|
305
305
|
| **lite** | 8 | 38 | 43 | Lightweight verification-focused workflows. CI bots, quick checks. |
|
|
306
|
-
| **core** |
|
|
307
|
-
| **full** |
|
|
306
|
+
| **core** | 23 | 105 | 110 | Full development workflow. Most agent sessions. |
|
|
307
|
+
| **full** | 31 | 158 | 163 | Everything enabled. Benchmarking, exploration, advanced use. |
|
|
308
308
|
|
|
309
309
|
### Usage
|
|
310
310
|
|
package/README.md
CHANGED
|
@@ -39,7 +39,7 @@ Every additional tool call produces a concrete artifact — an issue found, a ri
|
|
|
39
39
|
|
|
40
40
|
**QA engineer** — Transitioned a manual QA workflow website into an AI agent-driven app for a pet care messaging platform. Uses NodeBench's quality gates, verification cycles, and eval runs to ensure the AI agent handles edge cases that manual QA caught but bare AI agents miss.
|
|
41
41
|
|
|
42
|
-
Both found different subsets of the
|
|
42
|
+
Both found different subsets of the 163 tools useful — which is why NodeBench ships with 4 `--preset` levels to load only what you need.
|
|
43
43
|
|
|
44
44
|
---
|
|
45
45
|
|
|
@@ -77,7 +77,7 @@ Tasks 1-3 start with zero prior knowledge. By task 9, the agent finds 2+ relevan
|
|
|
77
77
|
### Install (30 seconds)
|
|
78
78
|
|
|
79
79
|
```bash
|
|
80
|
-
# Claude Code CLI — all
|
|
80
|
+
# Claude Code CLI — all 163 tools (TOON encoding on by default for ~40% token savings)
|
|
81
81
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
82
82
|
|
|
83
83
|
# Or start with discovery only — 5 tools, agents self-escalate to what they need
|
|
@@ -189,7 +189,7 @@ Notes:
|
|
|
189
189
|
|
|
190
190
|
## Progressive Discovery
|
|
191
191
|
|
|
192
|
-
|
|
192
|
+
163 tools is a lot. The progressive disclosure system helps agents find exactly what they need:
|
|
193
193
|
|
|
194
194
|
### Multi-modal search engine
|
|
195
195
|
|
|
@@ -197,7 +197,7 @@ Notes:
|
|
|
197
197
|
> discover_tools("verify my implementation")
|
|
198
198
|
```
|
|
199
199
|
|
|
200
|
-
The `discover_tools` search engine scores tools using **
|
|
200
|
+
The `discover_tools` search engine scores tools using **14 parallel strategies** (including Agent-as-a-Graph bipartite embedding search):
|
|
201
201
|
|
|
202
202
|
| Strategy | What it does | Example |
|
|
203
203
|
|---|---|---|
|
|
@@ -317,7 +317,7 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
|
|
|
317
317
|
|
|
318
318
|
## Toolset Gating
|
|
319
319
|
|
|
320
|
-
|
|
320
|
+
163 tools means tens of thousands of tokens of schema per API call. If you only need core methodology, gate the toolset:
|
|
321
321
|
|
|
322
322
|
### Presets
|
|
323
323
|
|
|
@@ -325,8 +325,8 @@ Based on Anthropic's ["Building a C Compiler with Parallel Claudes"](https://www
|
|
|
325
325
|
|---|---|---|---|
|
|
326
326
|
| `meta` | 5 | 0 | Discovery-only front door — agents start here and self-escalate via `discover_tools` |
|
|
327
327
|
| `lite` | 43 | 8 | Core methodology — verification, eval, flywheel, learning, recon, security, boilerplate |
|
|
328
|
-
| `core` |
|
|
329
|
-
| `full` |
|
|
328
|
+
| `core` | 110 | 23 | Full workflow — adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge, critter |
|
|
329
|
+
| `full` | 163 | 31 | Everything — adds vision, UI capture, web, GitHub, docs, parallel, local files, GAIA solvers |
|
|
330
330
|
|
|
331
331
|
```bash
|
|
332
332
|
# Meta — 5 tools (discovery-only: findTools, getMethodology, discover_tools, get_tool_quick_ref, get_workflow_chain)
|
|
@@ -336,10 +336,10 @@ claude mcp add nodebench -- npx -y nodebench-mcp --preset meta
|
|
|
336
336
|
# Lite — 43 tools (verification, eval, flywheel, learning, recon, security, boilerplate + meta + discovery)
|
|
337
337
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset lite
|
|
338
338
|
|
|
339
|
-
# Core —
|
|
339
|
+
# Core — 110 tools (adds bootstrap, self-eval, llm, platform, research_writing, flicker_detection, figma_flow, benchmark, session_memory, toon, pattern, git_workflow, seo, voice_bridge, critter + meta + discovery)
|
|
340
340
|
claude mcp add nodebench -- npx -y nodebench-mcp --preset core
|
|
341
341
|
|
|
342
|
-
# Full — all
|
|
342
|
+
# Full — all 163 tools (default, TOON encoding on by default)
|
|
343
343
|
claude mcp add nodebench -- npx -y nodebench-mcp
|
|
344
344
|
```
|
|
345
345
|
|
|
@@ -0,0 +1,370 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Critter Tool Calibration Eval — tests how useful the scoring really is.
|
|
3
|
+
*
|
|
4
|
+
* 19 scenarios across 4 tiers. Scoring logic mirrored from critterTools.ts.
|
|
5
|
+
*
|
|
6
|
+
* Run: npx tsx packages/mcp-local/src/__tests__/critterCalibrationEval.ts
|
|
7
|
+
*/
|
|
8
|
+
// ── Mirror of critterTools.ts scoreCritterCheck — keep in sync! ──────────
|
|
9
|
+
function scoreCritterCheck(input) {
|
|
10
|
+
const feedback = [];
|
|
11
|
+
let score = 100;
|
|
12
|
+
const taskLower = input.task.toLowerCase().trim();
|
|
13
|
+
const whyLower = input.why.toLowerCase().trim();
|
|
14
|
+
const whoLower = input.who.toLowerCase().trim();
|
|
15
|
+
// 1: Circular reasoning (threshold 0.5)
|
|
16
|
+
const taskWords = new Set(taskLower.split(/\s+/).filter((w) => w.length > 3));
|
|
17
|
+
const whyWords = whyLower.split(/\s+/).filter((w) => w.length > 3);
|
|
18
|
+
const overlap = whyWords.filter((w) => taskWords.has(w));
|
|
19
|
+
if (whyWords.length > 0 && overlap.length / whyWords.length > 0.5) {
|
|
20
|
+
score -= 30;
|
|
21
|
+
feedback.push("Circular");
|
|
22
|
+
}
|
|
23
|
+
// 2: Vague audience
|
|
24
|
+
const vagueAudiences = ["users", "everyone", "people", "the team", "stakeholders", "clients", "customers", "developers"];
|
|
25
|
+
if (vagueAudiences.includes(whoLower)) {
|
|
26
|
+
score -= 20;
|
|
27
|
+
feedback.push(`Vague: "${input.who}"`);
|
|
28
|
+
}
|
|
29
|
+
// 3: Empty or too short
|
|
30
|
+
if (whyLower.length === 0) {
|
|
31
|
+
score -= 40;
|
|
32
|
+
feedback.push("Empty why");
|
|
33
|
+
}
|
|
34
|
+
else if (whyLower.length < 10) {
|
|
35
|
+
score -= 25;
|
|
36
|
+
feedback.push("Why too short");
|
|
37
|
+
}
|
|
38
|
+
if (whoLower.length < 3) {
|
|
39
|
+
score -= 25;
|
|
40
|
+
feedback.push("Who too short");
|
|
41
|
+
}
|
|
42
|
+
// 4: Deference
|
|
43
|
+
const deferPatterns = [
|
|
44
|
+
"was told", "asked to", "ticket says", "was asked", "requirement says",
|
|
45
|
+
"spec says", "jira", "because I was", "they said", "assigned to me",
|
|
46
|
+
];
|
|
47
|
+
if (deferPatterns.some((p) => whyLower.includes(p))) {
|
|
48
|
+
score -= 15;
|
|
49
|
+
feedback.push("Deference");
|
|
50
|
+
}
|
|
51
|
+
// 5: Non-answer (count matches, -20 each, cap -40)
|
|
52
|
+
const nonAnswerPatterns = [
|
|
53
|
+
"just because", "don't know", "not sure", "why not", "might need it",
|
|
54
|
+
"no reason", "no idea", "whatever", "idk", "tbd",
|
|
55
|
+
];
|
|
56
|
+
const nonAnswerHits = nonAnswerPatterns.filter((p) => whyLower.includes(p)).length;
|
|
57
|
+
if (nonAnswerHits > 0) {
|
|
58
|
+
const penalty = Math.min(nonAnswerHits * 20, 40);
|
|
59
|
+
score -= penalty;
|
|
60
|
+
feedback.push(`Non-answer x${nonAnswerHits} (-${penalty})`);
|
|
61
|
+
}
|
|
62
|
+
// 6: Repetitive padding (why + who)
|
|
63
|
+
const whyAllWords = whyLower.split(/\s+/).filter((w) => w.length > 2);
|
|
64
|
+
if (whyAllWords.length >= 5) {
|
|
65
|
+
const whyUniqueWords = new Set(whyAllWords);
|
|
66
|
+
if (whyUniqueWords.size / whyAllWords.length < 0.4) {
|
|
67
|
+
score -= 25;
|
|
68
|
+
feedback.push("Why repetitive");
|
|
69
|
+
}
|
|
70
|
+
}
|
|
71
|
+
const whoAllWords = whoLower.split(/\s+/).filter((w) => w.length > 2);
|
|
72
|
+
if (whoAllWords.length >= 5) {
|
|
73
|
+
const whoUniqueWords = new Set(whoAllWords);
|
|
74
|
+
if (whoUniqueWords.size / whoAllWords.length < 0.4) {
|
|
75
|
+
score -= 25;
|
|
76
|
+
feedback.push("Who repetitive");
|
|
77
|
+
}
|
|
78
|
+
}
|
|
79
|
+
// 7: Buzzword-heavy (scans why + who, graduated: 3+ = -30, 2 = -20)
|
|
80
|
+
const buzzwords = [
|
|
81
|
+
"leverage", "synergies", "synergy", "paradigm", "holistic", "alignment",
|
|
82
|
+
"transformation", "innovative", "disruptive", "best practices",
|
|
83
|
+
"streamline", "ecosystem", "actionable", "circle back",
|
|
84
|
+
];
|
|
85
|
+
const allText = `${whyLower} ${whoLower}`;
|
|
86
|
+
const buzzCount = buzzwords.filter((b) => allText.includes(b)).length;
|
|
87
|
+
if (buzzCount >= 4) {
|
|
88
|
+
score -= 35;
|
|
89
|
+
feedback.push(`Buzzwords x${buzzCount}`);
|
|
90
|
+
}
|
|
91
|
+
else if (buzzCount >= 3) {
|
|
92
|
+
score -= 30;
|
|
93
|
+
feedback.push(`Buzzwords x${buzzCount}`);
|
|
94
|
+
}
|
|
95
|
+
else if (buzzCount >= 2) {
|
|
96
|
+
score -= 20;
|
|
97
|
+
feedback.push(`Buzzwords x${buzzCount}`);
|
|
98
|
+
}
|
|
99
|
+
// 8: Hedging language
|
|
100
|
+
const hedgeWords = ["could", "potentially", "maybe", "possibly", "might", "perhaps", "hopefully"];
|
|
101
|
+
const hedgeCount = hedgeWords.filter((h) => {
|
|
102
|
+
const regex = new RegExp(`\\b${h}\\b`, "i");
|
|
103
|
+
return regex.test(whyLower);
|
|
104
|
+
}).length;
|
|
105
|
+
if (hedgeCount >= 2) {
|
|
106
|
+
score -= 15;
|
|
107
|
+
feedback.push(`Hedging x${hedgeCount}`);
|
|
108
|
+
}
|
|
109
|
+
// 9: Task-word echo
|
|
110
|
+
for (const tw of taskWords) {
|
|
111
|
+
const twCount = whyWords.filter((w) => w === tw).length;
|
|
112
|
+
if (twCount >= 3) {
|
|
113
|
+
score -= 20;
|
|
114
|
+
feedback.push(`Echo: "${tw}" x${twCount}`);
|
|
115
|
+
break;
|
|
116
|
+
}
|
|
117
|
+
}
|
|
118
|
+
// 10: Bonuses
|
|
119
|
+
if (input.success_looks_like && input.success_looks_like.length > 20) {
|
|
120
|
+
score += 10;
|
|
121
|
+
feedback.push("+success");
|
|
122
|
+
}
|
|
123
|
+
if (input.simplest_version && input.simplest_version.length > 20) {
|
|
124
|
+
score += 10;
|
|
125
|
+
feedback.push("+simplest");
|
|
126
|
+
}
|
|
127
|
+
score = Math.max(0, Math.min(100, score));
|
|
128
|
+
let verdict;
|
|
129
|
+
if (score >= 70)
|
|
130
|
+
verdict = "proceed";
|
|
131
|
+
else if (score >= 40)
|
|
132
|
+
verdict = "reconsider";
|
|
133
|
+
else
|
|
134
|
+
verdict = "stop";
|
|
135
|
+
return { score, verdict, feedback };
|
|
136
|
+
}
|
|
137
|
+
const scenarios = [
|
|
138
|
+
// ── GOOD — should proceed ────────────────────────────────────────────
|
|
139
|
+
{
|
|
140
|
+
label: "GOOD-1: Clear purpose, specific audience",
|
|
141
|
+
input: {
|
|
142
|
+
task: "Add rate limiting to the public API endpoints",
|
|
143
|
+
why: "Our API is getting hammered by a scraper bot causing 503s for real customers — rate limiting protects availability",
|
|
144
|
+
who: "E-commerce customers who see checkout failures during bot attacks",
|
|
145
|
+
},
|
|
146
|
+
expectedVerdict: "proceed",
|
|
147
|
+
expectedScoreRange: [80, 100],
|
|
148
|
+
},
|
|
149
|
+
{
|
|
150
|
+
label: "GOOD-2: With both bonuses",
|
|
151
|
+
input: {
|
|
152
|
+
task: "Migrate from REST to GraphQL",
|
|
153
|
+
why: "Mobile app makes 12 API calls per screen because REST endpoints return fixed shapes — GraphQL lets us fetch exactly what each screen needs in one round trip",
|
|
154
|
+
who: "Mobile team (3 iOS + 2 Android devs) who spend 40% of sprint on pagination workarounds",
|
|
155
|
+
success_looks_like: "Screen load API calls drop from 12 to 1-2, mobile team velocity increases by at least 20%",
|
|
156
|
+
simplest_version: "Start with the 3 highest-traffic screens, keep REST alive for backwards compat",
|
|
157
|
+
},
|
|
158
|
+
expectedVerdict: "proceed",
|
|
159
|
+
expectedScoreRange: [100, 100],
|
|
160
|
+
},
|
|
161
|
+
{
|
|
162
|
+
label: "GOOD-3: Short but specific",
|
|
163
|
+
input: {
|
|
164
|
+
task: "Fix the dark mode toggle",
|
|
165
|
+
why: "Toggle doesn't persist across page reloads — users report losing their setting every time",
|
|
166
|
+
who: "Users with visual sensitivities who rely on dark mode for comfort",
|
|
167
|
+
},
|
|
168
|
+
expectedVerdict: "proceed",
|
|
169
|
+
expectedScoreRange: [80, 100],
|
|
170
|
+
},
|
|
171
|
+
{
|
|
172
|
+
label: "GOOD-4: Infra task with system audience",
|
|
173
|
+
input: {
|
|
174
|
+
task: "Add a health check endpoint",
|
|
175
|
+
why: "Kubernetes needs a liveness probe to restart crashed pods",
|
|
176
|
+
who: "The Kubernetes orchestration layer",
|
|
177
|
+
},
|
|
178
|
+
expectedVerdict: "proceed",
|
|
179
|
+
expectedScoreRange: [80, 100],
|
|
180
|
+
},
|
|
181
|
+
{
|
|
182
|
+
label: "GOOD-5: Shares domain words but isn't circular",
|
|
183
|
+
input: {
|
|
184
|
+
task: "Add password reset flow to the authentication module",
|
|
185
|
+
why: "Users who forget their password currently have to email support and wait 24h for a manual reset — this is our #1 support ticket category",
|
|
186
|
+
who: "End users who lock themselves out (estimated 15% of monthly active users)",
|
|
187
|
+
},
|
|
188
|
+
expectedVerdict: "proceed",
|
|
189
|
+
expectedScoreRange: [80, 100],
|
|
190
|
+
},
|
|
191
|
+
// ── MEDIOCRE — should reconsider ──────────────────────────────────────
|
|
192
|
+
{
|
|
193
|
+
label: "MED-1: Circular + vague",
|
|
194
|
+
input: {
|
|
195
|
+
task: "Add user authentication and login system to the application",
|
|
196
|
+
why: "Because we need user authentication and login system in the application",
|
|
197
|
+
who: "users",
|
|
198
|
+
},
|
|
199
|
+
expectedVerdict: "reconsider",
|
|
200
|
+
expectedScoreRange: [40, 55],
|
|
201
|
+
},
|
|
202
|
+
{
|
|
203
|
+
label: "MED-2: Deference + vague",
|
|
204
|
+
input: {
|
|
205
|
+
task: "Refactor the payment module",
|
|
206
|
+
why: "The ticket says we need to refactor payments before Q3",
|
|
207
|
+
who: "the team",
|
|
208
|
+
},
|
|
209
|
+
expectedVerdict: "reconsider",
|
|
210
|
+
expectedScoreRange: [45, 69],
|
|
211
|
+
},
|
|
212
|
+
{
|
|
213
|
+
label: "MED-3: Good who but lazy why",
|
|
214
|
+
input: {
|
|
215
|
+
task: "Add caching to the dashboard API",
|
|
216
|
+
why: "For speed",
|
|
217
|
+
who: "Internal analytics team who refreshes dashboards 50+ times daily",
|
|
218
|
+
},
|
|
219
|
+
expectedVerdict: "proceed", // Borderline — score 75 with warning is acceptable
|
|
220
|
+
expectedScoreRange: [70, 80],
|
|
221
|
+
},
|
|
222
|
+
{
|
|
223
|
+
label: "MED-4: Deference rescued by specificity",
|
|
224
|
+
input: {
|
|
225
|
+
task: "Add WebSocket support",
|
|
226
|
+
why: "The PM asked to add real-time updates because customer support agents currently poll every 30 seconds, causing delayed responses to urgent tickets",
|
|
227
|
+
who: "Customer support team handling 200+ tickets/day",
|
|
228
|
+
success_looks_like: "Ticket updates appear within 1 second instead of up to 30 seconds",
|
|
229
|
+
},
|
|
230
|
+
expectedVerdict: "proceed",
|
|
231
|
+
expectedScoreRange: [85, 100],
|
|
232
|
+
},
|
|
233
|
+
{
|
|
234
|
+
label: "MED-5: Good why but vague 'stakeholders'",
|
|
235
|
+
input: {
|
|
236
|
+
task: "Add audit logging to the admin panel",
|
|
237
|
+
why: "SOC2 compliance requires immutable audit trails for all admin actions — we fail the next audit without this",
|
|
238
|
+
who: "stakeholders",
|
|
239
|
+
},
|
|
240
|
+
expectedVerdict: "proceed",
|
|
241
|
+
expectedScoreRange: [70, 85],
|
|
242
|
+
},
|
|
243
|
+
// ── BAD ──────────────────────────────────────────────────────────────
|
|
244
|
+
{
|
|
245
|
+
label: "BAD-1: Deference + vague",
|
|
246
|
+
input: { task: "Do the thing", why: "Was told to", who: "everyone" },
|
|
247
|
+
expectedVerdict: "reconsider",
|
|
248
|
+
expectedScoreRange: [40, 69],
|
|
249
|
+
},
|
|
250
|
+
{
|
|
251
|
+
label: "BAD-2: Empty why",
|
|
252
|
+
input: { task: "Add a new endpoint", why: "", who: "me" },
|
|
253
|
+
expectedVerdict: "stop",
|
|
254
|
+
expectedScoreRange: [0, 39],
|
|
255
|
+
},
|
|
256
|
+
{
|
|
257
|
+
label: "BAD-3: Circular + vague (threshold 0.5)",
|
|
258
|
+
input: {
|
|
259
|
+
task: "Update the user interface for the dashboard component",
|
|
260
|
+
why: "We need to update the user interface for the dashboard component to make it better",
|
|
261
|
+
who: "people",
|
|
262
|
+
},
|
|
263
|
+
expectedVerdict: "reconsider",
|
|
264
|
+
expectedScoreRange: [40, 55],
|
|
265
|
+
},
|
|
266
|
+
// ── EDGE CASES ──────────────────────────────────────────────────────
|
|
267
|
+
{
|
|
268
|
+
label: "EDGE-1: Buzzwords masking emptiness (why + who scanned)",
|
|
269
|
+
input: {
|
|
270
|
+
task: "Implement the new feature",
|
|
271
|
+
why: "To leverage synergies and drive engagement through digital transformation and innovation paradigms",
|
|
272
|
+
who: "Cross-functional stakeholder alignment team",
|
|
273
|
+
},
|
|
274
|
+
// 4+ buzzwords in why+who (leverage, synergies, transformation, paradigm, alignment) → -30
|
|
275
|
+
expectedVerdict: "reconsider",
|
|
276
|
+
expectedScoreRange: [50, 69],
|
|
277
|
+
},
|
|
278
|
+
{
|
|
279
|
+
label: "EDGE-2: Padding with repetition (why + who)",
|
|
280
|
+
input: {
|
|
281
|
+
task: "Fix the bug",
|
|
282
|
+
why: "need need need need need need need need need need need need",
|
|
283
|
+
who: "test test test test test test test test test test test test",
|
|
284
|
+
},
|
|
285
|
+
// Why repetition (-25) + who repetition (-25) = 50 → reconsider
|
|
286
|
+
expectedVerdict: "reconsider",
|
|
287
|
+
expectedScoreRange: [40, 55],
|
|
288
|
+
},
|
|
289
|
+
{
|
|
290
|
+
label: "EDGE-3: Double non-answer ('just because' + 'might need it')",
|
|
291
|
+
input: {
|
|
292
|
+
task: "Add a new database table",
|
|
293
|
+
why: "Just because we might need it later",
|
|
294
|
+
who: "Maybe someone eventually",
|
|
295
|
+
},
|
|
296
|
+
// 2 non-answer hits × -20 = -40 → score 60 → reconsider
|
|
297
|
+
expectedVerdict: "reconsider",
|
|
298
|
+
expectedScoreRange: [40, 65],
|
|
299
|
+
},
|
|
300
|
+
{
|
|
301
|
+
label: "EDGE-4: Hedging with good audience",
|
|
302
|
+
input: {
|
|
303
|
+
task: "Build a recommendation engine",
|
|
304
|
+
why: "It could potentially maybe help with user retention if people use it",
|
|
305
|
+
who: "Product manager who wants to try ML features",
|
|
306
|
+
},
|
|
307
|
+
// Hedging (could+potentially+maybe) = -15 → score 85 → proceed with warning
|
|
308
|
+
expectedVerdict: "proceed",
|
|
309
|
+
expectedScoreRange: [80, 90],
|
|
310
|
+
},
|
|
311
|
+
{
|
|
312
|
+
label: "EDGE-5: Long vacuous echo + vague audience",
|
|
313
|
+
input: {
|
|
314
|
+
task: "Refactor the codebase",
|
|
315
|
+
why: "We need to refactor the codebase because the codebase needs refactoring and the current state of the codebase is such that refactoring would be beneficial for the overall quality of the codebase moving forward in the future",
|
|
316
|
+
who: "developers",
|
|
317
|
+
},
|
|
318
|
+
// Echo: 'codebase' x4 (-20) + vague 'developers' (-20) = 60 → reconsider
|
|
319
|
+
expectedVerdict: "reconsider",
|
|
320
|
+
expectedScoreRange: [40, 69],
|
|
321
|
+
},
|
|
322
|
+
{
|
|
323
|
+
label: "EDGE-6: 'customers' now vague",
|
|
324
|
+
input: {
|
|
325
|
+
task: "Deploy to production",
|
|
326
|
+
why: "Performance improvements need to reach production for customer benefit today",
|
|
327
|
+
who: "customers",
|
|
328
|
+
},
|
|
329
|
+
// Vague 'customers' (-20) → 80 → proceed with warning
|
|
330
|
+
expectedVerdict: "proceed",
|
|
331
|
+
expectedScoreRange: [70, 85],
|
|
332
|
+
},
|
|
333
|
+
];
|
|
334
|
+
// ── Run ────────────────────────────────────────────────────────────────────
|
|
335
|
+
console.log("═══════════════════════════════════════════════════════════════");
|
|
336
|
+
console.log(`CRITTER CALIBRATION EVAL — ${scenarios.length} scenarios`);
|
|
337
|
+
console.log("═══════════════════════════════════════════════════════════════\n");
|
|
338
|
+
let passed = 0;
|
|
339
|
+
let failed = 0;
|
|
340
|
+
const gaps = [];
|
|
341
|
+
for (const tc of scenarios) {
|
|
342
|
+
const r = scoreCritterCheck(tc.input);
|
|
343
|
+
const vOk = r.verdict === tc.expectedVerdict;
|
|
344
|
+
const sOk = r.score >= tc.expectedScoreRange[0] && r.score <= tc.expectedScoreRange[1];
|
|
345
|
+
const ok = vOk && sOk;
|
|
346
|
+
const status = ok ? " PASS" : " FAIL";
|
|
347
|
+
console.log(`${status} ${tc.label}`);
|
|
348
|
+
console.log(` Score: ${r.score}${sOk ? "" : ` (expected ${tc.expectedScoreRange[0]}-${tc.expectedScoreRange[1]})`}, Verdict: ${r.verdict}${vOk ? "" : ` (expected ${tc.expectedVerdict})`}`);
|
|
349
|
+
console.log(` Checks: [${r.feedback.join("; ")}]`);
|
|
350
|
+
if (!ok) {
|
|
351
|
+
failed++;
|
|
352
|
+
gaps.push(`${tc.label}: score=${r.score} verdict=${r.verdict}`);
|
|
353
|
+
}
|
|
354
|
+
else {
|
|
355
|
+
passed++;
|
|
356
|
+
}
|
|
357
|
+
console.log();
|
|
358
|
+
}
|
|
359
|
+
console.log("═══════════════════════════════════════════════════════════════");
|
|
360
|
+
console.log(`RESULTS: ${passed}/${scenarios.length} passed, ${failed} failed`);
|
|
361
|
+
console.log(`CALIBRATION: ${Math.round(passed / scenarios.length * 100)}%`);
|
|
362
|
+
console.log("═══════════════════════════════════════════════════════════════");
|
|
363
|
+
if (gaps.length > 0) {
|
|
364
|
+
console.log("\nREMAINING GAPS:");
|
|
365
|
+
for (const g of gaps)
|
|
366
|
+
console.log(` -> ${g}`);
|
|
367
|
+
}
|
|
368
|
+
process.exit(failed > 0 ? 1 : 0);
|
|
369
|
+
export {};
|
|
370
|
+
//# sourceMappingURL=critterCalibrationEval.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"critterCalibrationEval.js","sourceRoot":"","sources":["../../src/__tests__/critterCalibrationEval.ts"],"names":[],"mappings":"AAAA;;;;;;GAMG;AAgBH,4EAA4E;AAE5E,SAAS,iBAAiB,CAAC,KAAmB;IAC5C,MAAM,QAAQ,GAAa,EAAE,CAAC;IAC9B,IAAI,KAAK,GAAG,GAAG,CAAC;IAChB,MAAM,SAAS,GAAG,KAAK,CAAC,IAAI,CAAC,WAAW,EAAE,CAAC,IAAI,EAAE,CAAC;IAClD,MAAM,QAAQ,GAAG,KAAK,CAAC,GAAG,CAAC,WAAW,EAAE,CAAC,IAAI,EAAE,CAAC;IAChD,MAAM,QAAQ,GAAG,KAAK,CAAC,GAAG,CAAC,WAAW,EAAE,CAAC,IAAI,EAAE,CAAC;IAEhD,wCAAwC;IACxC,MAAM,SAAS,GAAG,IAAI,GAAG,CAAC,SAAS,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,GAAG,CAAC,CAAC,CAAC,CAAC;IAC9E,MAAM,QAAQ,GAAG,QAAQ,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,GAAG,CAAC,CAAC,CAAC;IACnE,MAAM,OAAO,GAAG,QAAQ,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,SAAS,CAAC,GAAG,CAAC,CAAC,CAAC,CAAC,CAAC;IACzD,IAAI,QAAQ,CAAC,MAAM,GAAG,CAAC,IAAI,OAAO,CAAC,MAAM,GAAG,QAAQ,CAAC,MAAM,GAAG,GAAG,EAAE,CAAC;QAClE,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,UAAU,CAAC,CAAC;IAC5B,CAAC;IAED,oBAAoB;IACpB,MAAM,cAAc,GAAG,CAAC,OAAO,EAAE,UAAU,EAAE,QAAQ,EAAE,UAAU,EAAE,cAAc,EAAE,SAAS,EAAE,WAAW,EAAE,YAAY,CAAC,CAAC;IACzH,IAAI,cAAc,CAAC,QAAQ,CAAC,QAAQ,CAAC,EAAE,CAAC;QACtC,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,WAAW,KAAK,CAAC,GAAG,GAAG,CAAC,CAAC;IACzC,CAAC;IAED,wBAAwB;IACxB,IAAI,QAAQ,CAAC,MAAM,KAAK,CAAC,EAAE,CAAC;QAC1B,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,WAAW,CAAC,CAAC;IAC7B,CAAC;SAAM,IAAI,QAAQ,CAAC,MAAM,GAAG,EAAE,EAAE,CAAC;QAChC,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,eAAe,CAAC,CAAC;IACjC,CAAC;IACD,IAAI,QAAQ,CAAC,MAAM,GAAG,CAAC,EAAE,CAAC;QACxB,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,eAAe,CAAC,CAAC;IACjC,CAAC;IAED,eAAe;IACf,MAAM,aAAa,GAAG;QACpB,UAAU,EAAE,UAAU,EAAE,aAAa,EAAE,WAAW,EAAE,kBAAkB;QACtE,WAAW,EAAE,MAAM,EAAE,eAAe,EAAE,WAAW,EAAE,gBAAgB;KACpE,CAAC;IACF,IAAI,aAAa,CAAC,IAAI,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,QAAQ,CAAC,QAAQ,CAAC,CAAC,CAAC,CAAC,EAAE,CAAC;QACpD,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,WAAW,CAAC,CAAC;IAC7B,CAAC;IAED,mDAAmD;IACnD,MAAM,iBAAiB,GAAG;QACxB,cAAc,EAAE,YAAY,EAAE,UAAU,EAAE,SAAS,EAAE,eAAe;QACpE,WAAW,EAAE,SAAS,EAAE,UAAU,EAAE,KAAK,EAAE,KAAK;KACjD,CAAC;IACF,MAAM,aAAa,GAAG,iBAAiB,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,QAAQ,CAAC,QAAQ,CAAC,CAAC,CAAC,CAAC,CAAC,MAAM,CAAC;IACnF,IAAI,aAAa,GAAG,CAAC,EAAE,CAAC;QACtB,MAAM,OAAO,GAAG,IAAI,CAAC,GAAG,CAAC,aAAa,GAAG,EAAE,EAAE,EAAE,CAAC,CAAC;QACjD,KAAK,IAAI,OAAO,CAAC;QACjB,QAAQ,CAAC,IAAI,CAAC,eAAe,aAAa,MAAM,OAAO,GAAG,CAAC,CAAC;IAC9D,CAAC;IAED,oCAAoC;IACpC,MAAM,WAAW,GAAG,QAAQ,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,GAAG,CAAC,CAAC,CAAC;IACtE,IAAI,WAAW,CAAC,MAAM,IAAI,CAAC,EAAE,CAAC;QAC5B,MAAM,cAAc,GAAG,IAAI,GAAG,CAAC,WAAW,CAAC,CAAC;QAC5C,IAAI,cAAc,CAAC,IAAI,GAAG,WAAW,CAAC,MAAM,GAAG,GAAG,EAAE,CAAC;YACnD,KAAK,IAAI,EAAE,CAAC;YACZ,QAAQ,CAAC,IAAI,CAAC,gBAAgB,CAAC,CAAC;QAClC,CAAC;IACH,CAAC;IACD,MAAM,WAAW,GAAG,QAAQ,CAAC,KAAK,CAAC,KAAK,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,CAAC,MAAM,GAAG,CAAC,CAAC,CAAC;IACtE,IAAI,WAAW,CAAC,MAAM,IAAI,CAAC,EAAE,CAAC;QAC5B,MAAM,cAAc,GAAG,IAAI,GAAG,CAAC,WAAW,CAAC,CAAC;QAC5C,IAAI,cAAc,CAAC,IAAI,GAAG,WAAW,CAAC,MAAM,GAAG,GAAG,EAAE,CAAC;YACnD,KAAK,IAAI,EAAE,CAAC;YACZ,QAAQ,CAAC,IAAI,CAAC,gBAAgB,CAAC,CAAC;QAClC,CAAC;IACH,CAAC;IAED,oEAAoE;IACpE,MAAM,SAAS,GAAG;QAChB,UAAU,EAAE,WAAW,EAAE,SAAS,EAAE,UAAU,EAAE,UAAU,EAAE,WAAW;QACvE,gBAAgB,EAAE,YAAY,EAAE,YAAY,EAAE,gBAAgB;QAC9D,YAAY,EAAE,WAAW,EAAE,YAAY,EAAE,aAAa;KACvD,CAAC;IACF,MAAM,OAAO,GAAG,GAAG,QAAQ,IAAI,QAAQ,EAAE,CAAC;IAC1C,MAAM,SAAS,GAAG,SAAS,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,OAAO,CAAC,QAAQ,CAAC,CAAC,CAAC,CAAC,CAAC,MAAM,CAAC;IACtE,IAAI,SAAS,IAAI,CAAC,EAAE,CAAC;QACnB,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,cAAc,SAAS,EAAE,CAAC,CAAC;IAC3C,CAAC;SAAM,IAAI,SAAS,IAAI,CAAC,EAAE,CAAC;QAC1B,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,cAAc,SAAS,EAAE,CAAC,CAAC;IAC3C,CAAC;SAAM,IAAI,SAAS,IAAI,CAAC,EAAE,CAAC;QAC1B,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,cAAc,SAAS,EAAE,CAAC,CAAC;IAC3C,CAAC;IAED,sBAAsB;IACtB,MAAM,UAAU,GAAG,CAAC,OAAO,EAAE,aAAa,EAAE,OAAO,EAAE,UAAU,EAAE,OAAO,EAAE,SAAS,EAAE,WAAW,CAAC,CAAC;IAClG,MAAM,UAAU,GAAG,UAAU,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE;QACzC,MAAM,KAAK,GAAG,IAAI,MAAM,CAAC,MAAM,CAAC,KAAK,EAAE,GAAG,CAAC,CAAC;QAC5C,OAAO,KAAK,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC;IAC9B,CAAC,CAAC,CAAC,MAAM,CAAC;IACV,IAAI,UAAU,IAAI,CAAC,EAAE,CAAC;QACpB,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,YAAY,UAAU,EAAE,CAAC,CAAC;IAC1C,CAAC;IAED,oBAAoB;IACpB,KAAK,MAAM,EAAE,IAAI,SAAS,EAAE,CAAC;QAC3B,MAAM,OAAO,GAAG,QAAQ,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,EAAE,CAAC,CAAC,KAAK,EAAE,CAAC,CAAC,MAAM,CAAC;QACxD,IAAI,OAAO,IAAI,CAAC,EAAE,CAAC;YACjB,KAAK,IAAI,EAAE,CAAC;YACZ,QAAQ,CAAC,IAAI,CAAC,UAAU,EAAE,MAAM,OAAO,EAAE,CAAC,CAAC;YAC3C,MAAM;QACR,CAAC;IACH,CAAC;IAED,cAAc;IACd,IAAI,KAAK,CAAC,kBAAkB,IAAI,KAAK,CAAC,kBAAkB,CAAC,MAAM,GAAG,EAAE,EAAE,CAAC;QACrE,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,UAAU,CAAC,CAAC;IAC5B,CAAC;IACD,IAAI,KAAK,CAAC,gBAAgB,IAAI,KAAK,CAAC,gBAAgB,CAAC,MAAM,GAAG,EAAE,EAAE,CAAC;QACjE,KAAK,IAAI,EAAE,CAAC;QACZ,QAAQ,CAAC,IAAI,CAAC,WAAW,CAAC,CAAC;IAC7B,CAAC;IAED,KAAK,GAAG,IAAI,CAAC,GAAG,CAAC,CAAC,EAAE,IAAI,CAAC,GAAG,CAAC,GAAG,EAAE,KAAK,CAAC,CAAC,CAAC;IAC1C,IAAI,OAA0C,CAAC;IAC/C,IAAI,KAAK,IAAI,EAAE;QAAE,OAAO,GAAG,SAAS,CAAC;SAChC,IAAI,KAAK,IAAI,EAAE;QAAE,OAAO,GAAG,YAAY,CAAC;;QACxC,OAAO,GAAG,MAAM,CAAC;IAEtB,OAAO,EAAE,KAAK,EAAE,OAAO,EAAE,QAAQ,EAAE,CAAC;AACtC,CAAC;AAWD,MAAM,SAAS,GAAe;IAC5B,wEAAwE;IACxE;QACE,KAAK,EAAE,0CAA0C;QACjD,KAAK,EAAE;YACL,IAAI,EAAE,+CAA+C;YACrD,GAAG,EAAE,oHAAoH;YACzH,GAAG,EAAE,mEAAmE;SACzE;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,GAAG,CAAC;KAC9B;IACD;QACE,KAAK,EAAE,2BAA2B;QAClC,KAAK,EAAE;YACL,IAAI,EAAE,8BAA8B;YACpC,GAAG,EAAE,8JAA8J;YACnK,GAAG,EAAE,wFAAwF;YAC7F,kBAAkB,EAAE,2FAA2F;YAC/G,gBAAgB,EAAE,gFAAgF;SACnG;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,GAAG,EAAE,GAAG,CAAC;KAC/B;IACD;QACE,KAAK,EAAE,4BAA4B;QACnC,KAAK,EAAE;YACL,IAAI,EAAE,0BAA0B;YAChC,GAAG,EAAE,2FAA2F;YAChG,GAAG,EAAE,mEAAmE;SACzE;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,GAAG,CAAC;KAC9B;IACD;QACE,KAAK,EAAE,yCAAyC;QAChD,KAAK,EAAE;YACL,IAAI,EAAE,6BAA6B;YACnC,GAAG,EAAE,2DAA2D;YAChE,GAAG,EAAE,oCAAoC;SAC1C;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,GAAG,CAAC;KAC9B;IACD;QACE,KAAK,EAAE,gDAAgD;QACvD,KAAK,EAAE;YACL,IAAI,EAAE,sDAAsD;YAC5D,GAAG,EAAE,0IAA0I;YAC/I,GAAG,EAAE,2EAA2E;SACjF;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,GAAG,CAAC;KAC9B;IAED,yEAAyE;IACzE;QACE,KAAK,EAAE,yBAAyB;QAChC,KAAK,EAAE;YACL,IAAI,EAAE,6DAA6D;YACnE,GAAG,EAAE,yEAAyE;YAC9E,GAAG,EAAE,OAAO;SACb;QACD,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,0BAA0B;QACjC,KAAK,EAAE;YACL,IAAI,EAAE,6BAA6B;YACnC,GAAG,EAAE,wDAAwD;YAC7D,GAAG,EAAE,UAAU;SAChB;QACD,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,8BAA8B;QACrC,KAAK,EAAE;YACL,IAAI,EAAE,kCAAkC;YACxC,GAAG,EAAE,WAAW;YAChB,GAAG,EAAE,kEAAkE;SACxE;QACD,eAAe,EAAE,SAAS,EAAE,mDAAmD;QAC/E,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,yCAAyC;QAChD,KAAK,EAAE;YACL,IAAI,EAAE,uBAAuB;YAC7B,GAAG,EAAE,oJAAoJ;YACzJ,GAAG,EAAE,iDAAiD;YACtD,kBAAkB,EAAE,mEAAmE;SACxF;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,GAAG,CAAC;KAC9B;IACD;QACE,KAAK,EAAE,0CAA0C;QACjD,KAAK,EAAE;YACL,IAAI,EAAE,sCAAsC;YAC5C,GAAG,EAAE,6GAA6G;YAClH,GAAG,EAAE,cAAc;SACpB;QACD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IAED,wEAAwE;IACxE;QACE,KAAK,EAAE,0BAA0B;QACjC,KAAK,EAAE,EAAE,IAAI,EAAE,cAAc,EAAE,GAAG,EAAE,aAAa,EAAE,GAAG,EAAE,UAAU,EAAE;QACpE,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,kBAAkB;QACzB,KAAK,EAAE,EAAE,IAAI,EAAE,oBAAoB,EAAE,GAAG,EAAE,EAAE,EAAE,GAAG,EAAE,IAAI,EAAE;QACzD,eAAe,EAAE,MAAM;QACvB,kBAAkB,EAAE,CAAC,CAAC,EAAE,EAAE,CAAC;KAC5B;IACD;QACE,KAAK,EAAE,yCAAyC;QAChD,KAAK,EAAE;YACL,IAAI,EAAE,uDAAuD;YAC7D,GAAG,EAAE,oFAAoF;YACzF,GAAG,EAAE,QAAQ;SACd;QACD,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IAED,uEAAuE;IACvE;QACE,KAAK,EAAE,yDAAyD;QAChE,KAAK,EAAE;YACL,IAAI,EAAE,2BAA2B;YACjC,GAAG,EAAE,oGAAoG;YACzG,GAAG,EAAE,6CAA6C;SACnD;QACD,2FAA2F;QAC3F,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,6CAA6C;QACpD,KAAK,EAAE;YACL,IAAI,EAAE,aAAa;YACnB,GAAG,EAAE,6DAA6D;YAClE,GAAG,EAAE,6DAA6D;SACnE;QACD,gEAAgE;QAChE,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,8DAA8D;QACrE,KAAK,EAAE;YACL,IAAI,EAAE,0BAA0B;YAChC,GAAG,EAAE,qCAAqC;YAC1C,GAAG,EAAE,0BAA0B;SAChC;QACD,wDAAwD;QACxD,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,oCAAoC;QAC3C,KAAK,EAAE;YACL,IAAI,EAAE,+BAA+B;YACrC,GAAG,EAAE,sEAAsE;YAC3E,GAAG,EAAE,8CAA8C;SACpD;QACD,4EAA4E;QAC5E,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,4CAA4C;QACnD,KAAK,EAAE;YACL,IAAI,EAAE,uBAAuB;YAC7B,GAAG,EAAE,iOAAiO;YACtO,GAAG,EAAE,YAAY;SAClB;QACD,yEAAyE;QACzE,eAAe,EAAE,YAAY;QAC7B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;IACD;QACE,KAAK,EAAE,+BAA+B;QACtC,KAAK,EAAE;YACL,IAAI,EAAE,sBAAsB;YAC5B,GAAG,EAAE,8EAA8E;YACnF,GAAG,EAAE,WAAW;SACjB;QACD,sDAAsD;QACtD,eAAe,EAAE,SAAS;QAC1B,kBAAkB,EAAE,CAAC,EAAE,EAAE,EAAE,CAAC;KAC7B;CACF,CAAC;AAEF,8EAA8E;AAE9E,OAAO,CAAC,GAAG,CAAC,iEAAiE,CAAC,CAAC;AAC/E,OAAO,CAAC,GAAG,CAAC,8BAA8B,SAAS,CAAC,MAAM,YAAY,CAAC,CAAC;AACxE,OAAO,CAAC,GAAG,CAAC,mEAAmE,CAAC,CAAC;AAEjF,IAAI,MAAM,GAAG,CAAC,CAAC;AACf,IAAI,MAAM,GAAG,CAAC,CAAC;AACf,MAAM,IAAI,GAAa,EAAE,CAAC;AAE1B,KAAK,MAAM,EAAE,IAAI,SAAS,EAAE,CAAC;IAC3B,MAAM,CAAC,GAAG,iBAAiB,CAAC,EAAE,CAAC,KAAK,CAAC,CAAC;IACtC,MAAM,GAAG,GAAG,CAAC,CAAC,OAAO,KAAK,EAAE,CAAC,eAAe,CAAC;IAC7C,MAAM,GAAG,GAAG,CAAC,CAAC,KAAK,IAAI,EAAE,CAAC,kBAAkB,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,KAAK,IAAI,EAAE,CAAC,kBAAkB,CAAC,CAAC,CAAC,CAAC;IACvF,MAAM,EAAE,GAAG,GAAG,IAAI,GAAG,CAAC;IAEtB,MAAM,MAAM,GAAG,EAAE,CAAC,CAAC,CAAC,QAAQ,CAAC,CAAC,CAAC,QAAQ,CAAC;IACxC,OAAO,CAAC,GAAG,CAAC,GAAG,MAAM,KAAK,EAAE,CAAC,KAAK,EAAE,CAAC,CAAC;IACtC,OAAO,CAAC,GAAG,CAAC,kBAAkB,CAAC,CAAC,KAAK,GAAG,GAAG,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,cAAc,EAAE,CAAC,kBAAkB,CAAC,CAAC,CAAC,IAAI,EAAE,CAAC,kBAAkB,CAAC,CAAC,CAAC,GAAG,cAAc,CAAC,CAAC,OAAO,GAAG,GAAG,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC,cAAc,EAAE,CAAC,eAAe,GAAG,EAAE,CAAC,CAAC;IACpM,OAAO,CAAC,GAAG,CAAC,oBAAoB,CAAC,CAAC,QAAQ,CAAC,IAAI,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC;IAC1D,IAAI,CAAC,EAAE,EAAE,CAAC;QACR,MAAM,EAAE,CAAC;QACT,IAAI,CAAC,IAAI,CAAC,GAAG,EAAE,CAAC,KAAK,WAAW,CAAC,CAAC,KAAK,YAAY,CAAC,CAAC,OAAO,EAAE,CAAC,CAAC;IAClE,CAAC;SAAM,CAAC;QACN,MAAM,EAAE,CAAC;IACX,CAAC;IACD,OAAO,CAAC,GAAG,EAAE,CAAC;AAChB,CAAC;AAED,OAAO,CAAC,GAAG,CAAC,iEAAiE,CAAC,CAAC;AAC/E,OAAO,CAAC,GAAG,CAAC,YAAY,MAAM,IAAI,SAAS,CAAC,MAAM,YAAY,MAAM,SAAS,CAAC,CAAC;AAC/E,OAAO,CAAC,GAAG,CAAC,gBAAgB,IAAI,CAAC,KAAK,CAAC,MAAM,GAAG,SAAS,CAAC,MAAM,GAAG,GAAG,CAAC,GAAG,CAAC,CAAC;AAC5E,OAAO,CAAC,GAAG,CAAC,iEAAiE,CAAC,CAAC;AAE/E,IAAI,IAAI,CAAC,MAAM,GAAG,CAAC,EAAE,CAAC;IACpB,OAAO,CAAC,GAAG,CAAC,mBAAmB,CAAC,CAAC;IACjC,KAAK,MAAM,CAAC,IAAI,IAAI;QAAE,OAAO,CAAC,GAAG,CAAC,QAAQ,CAAC,EAAE,CAAC,CAAC;AACjD,CAAC;AAED,OAAO,CAAC,IAAI,CAAC,MAAM,GAAG,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC,CAAC"}
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
export {};
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
/**
|
|
2
|
+
* Unit tests for embeddingProvider — cosine similarity, cache, mock provider, graceful fallback.
|
|
3
|
+
*/
|
|
4
|
+
import { describe, it, expect, beforeEach } from "vitest";
|
|
5
|
+
import { embeddingSearch, isEmbeddingReady, getProviderName, _resetForTesting, _setProviderForTesting, _setIndexForTesting, } from "../tools/embeddingProvider.js";
|
|
6
|
+
beforeEach(() => {
|
|
7
|
+
_resetForTesting();
|
|
8
|
+
});
|
|
9
|
+
describe("embeddingProvider: cosine similarity correctness", () => {
|
|
10
|
+
it("identical vectors should have similarity ~1.0", () => {
|
|
11
|
+
const vec = new Float32Array([0.5, 0.3, 0.8, 0.1]);
|
|
12
|
+
_setIndexForTesting([{ name: "tool_a", nodeType: "tool", vector: vec }]);
|
|
13
|
+
const results = embeddingSearch(vec, 5);
|
|
14
|
+
expect(results.length).toBe(1);
|
|
15
|
+
expect(results[0].name).toBe("tool_a");
|
|
16
|
+
expect(results[0].similarity).toBeCloseTo(1.0, 4);
|
|
17
|
+
});
|
|
18
|
+
it("orthogonal vectors should have similarity ~0.0", () => {
|
|
19
|
+
const vecA = new Float32Array([1, 0, 0, 0]);
|
|
20
|
+
const vecB = new Float32Array([0, 1, 0, 0]);
|
|
21
|
+
_setIndexForTesting([
|
|
22
|
+
{ name: "tool_a", nodeType: "tool", vector: vecA },
|
|
23
|
+
{ name: "tool_b", nodeType: "tool", vector: vecB },
|
|
24
|
+
]);
|
|
25
|
+
const queryVec = new Float32Array([1, 0, 0, 0]);
|
|
26
|
+
const results = embeddingSearch(queryVec, 5);
|
|
27
|
+
expect(results[0].name).toBe("tool_a");
|
|
28
|
+
expect(results[0].similarity).toBeCloseTo(1.0, 4);
|
|
29
|
+
expect(results[1].name).toBe("tool_b");
|
|
30
|
+
expect(results[1].similarity).toBeCloseTo(0.0, 4);
|
|
31
|
+
});
|
|
32
|
+
it("should rank by similarity (highest first)", () => {
|
|
33
|
+
_setIndexForTesting([
|
|
34
|
+
{ name: "low", nodeType: "tool", vector: new Float32Array([0.1, 0.9, 0.0]) },
|
|
35
|
+
{ name: "high", nodeType: "tool", vector: new Float32Array([0.9, 0.1, 0.0]) },
|
|
36
|
+
{ name: "mid", nodeType: "tool", vector: new Float32Array([0.5, 0.5, 0.0]) },
|
|
37
|
+
]);
|
|
38
|
+
const query = new Float32Array([1.0, 0.0, 0.0]);
|
|
39
|
+
const results = embeddingSearch(query, 3);
|
|
40
|
+
expect(results[0].name).toBe("high");
|
|
41
|
+
expect(results[1].name).toBe("mid");
|
|
42
|
+
expect(results[2].name).toBe("low");
|
|
43
|
+
});
|
|
44
|
+
});
|
|
45
|
+
describe("embeddingProvider: graceful fallback", () => {
|
|
46
|
+
it("isEmbeddingReady returns false when no index", () => {
|
|
47
|
+
expect(isEmbeddingReady()).toBe(false);
|
|
48
|
+
});
|
|
49
|
+
it("embeddingSearch returns empty when no index", () => {
|
|
50
|
+
const results = embeddingSearch(new Float32Array([1, 0, 0]), 5);
|
|
51
|
+
expect(results).toEqual([]);
|
|
52
|
+
});
|
|
53
|
+
it("getProviderName returns null when no provider", () => {
|
|
54
|
+
expect(getProviderName()).toBe(null);
|
|
55
|
+
});
|
|
56
|
+
});
|
|
57
|
+
describe("embeddingProvider: mock provider", () => {
|
|
58
|
+
it("can inject a mock provider", () => {
|
|
59
|
+
const mockProvider = {
|
|
60
|
+
name: "mock",
|
|
61
|
+
dimensions: 3,
|
|
62
|
+
embed: async (texts) => texts.map(() => new Float32Array([0.5, 0.5, 0.0])),
|
|
63
|
+
};
|
|
64
|
+
_setProviderForTesting(mockProvider);
|
|
65
|
+
expect(getProviderName()).toBe("mock");
|
|
66
|
+
});
|
|
67
|
+
it("isEmbeddingReady returns true when index is set", () => {
|
|
68
|
+
_setIndexForTesting([
|
|
69
|
+
{ name: "tool_a", nodeType: "tool", vector: new Float32Array([1, 0, 0]) },
|
|
70
|
+
]);
|
|
71
|
+
expect(isEmbeddingReady()).toBe(true);
|
|
72
|
+
});
|
|
73
|
+
});
|
|
74
|
+
describe("embeddingProvider: limit parameter", () => {
|
|
75
|
+
it("respects limit in embeddingSearch", () => {
|
|
76
|
+
_setIndexForTesting([
|
|
77
|
+
{ name: "a", nodeType: "tool", vector: new Float32Array([1, 0, 0]) },
|
|
78
|
+
{ name: "b", nodeType: "tool", vector: new Float32Array([0, 1, 0]) },
|
|
79
|
+
{ name: "c", nodeType: "tool", vector: new Float32Array([0, 0, 1]) },
|
|
80
|
+
]);
|
|
81
|
+
const results = embeddingSearch(new Float32Array([1, 0, 0]), 2);
|
|
82
|
+
expect(results.length).toBe(2);
|
|
83
|
+
expect(results[0].name).toBe("a");
|
|
84
|
+
});
|
|
85
|
+
});
|
|
86
|
+
//# sourceMappingURL=embeddingProvider.test.js.map
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
{"version":3,"file":"embeddingProvider.test.js","sourceRoot":"","sources":["../../src/__tests__/embeddingProvider.test.ts"],"names":[],"mappings":"AAAA;;GAEG;AACH,OAAO,EAAE,QAAQ,EAAE,EAAE,EAAE,MAAM,EAAE,UAAU,EAAE,MAAM,QAAQ,CAAC;AAC1D,OAAO,EACL,eAAe,EACf,gBAAgB,EAChB,eAAe,EACf,gBAAgB,EAChB,sBAAsB,EACtB,mBAAmB,GACpB,MAAM,+BAA+B,CAAC;AAGvC,UAAU,CAAC,GAAG,EAAE;IACd,gBAAgB,EAAE,CAAC;AACrB,CAAC,CAAC,CAAC;AAEH,QAAQ,CAAC,kDAAkD,EAAE,GAAG,EAAE;IAChE,EAAE,CAAC,+CAA+C,EAAE,GAAG,EAAE;QACvD,MAAM,GAAG,GAAG,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,CAAC;QACnD,mBAAmB,CAAC,CAAC,EAAE,IAAI,EAAE,QAAQ,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,GAAG,EAAE,CAAC,CAAC,CAAC;QAElF,MAAM,OAAO,GAAG,eAAe,CAAC,GAAG,EAAE,CAAC,CAAC,CAAC;QACxC,MAAM,CAAC,OAAO,CAAC,MAAM,CAAC,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;QAC/B,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC;QACvC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,WAAW,CAAC,GAAG,EAAE,CAAC,CAAC,CAAC;IACpD,CAAC,CAAC,CAAC;IAEH,EAAE,CAAC,gDAAgD,EAAE,GAAG,EAAE;QACxD,MAAM,IAAI,GAAG,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,CAAC;QAC5C,MAAM,IAAI,GAAG,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,CAAC;QAC5C,mBAAmB,CAAC;YAClB,EAAE,IAAI,EAAE,QAAQ,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,EAAE;YAC3D,EAAE,IAAI,EAAE,QAAQ,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,EAAE;SAC5D,CAAC,CAAC;QAEH,MAAM,QAAQ,GAAG,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,CAAC;QAChD,MAAM,OAAO,GAAG,eAAe,CAAC,QAAQ,EAAE,CAAC,CAAC,CAAC;QAC7C,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC;QACvC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,WAAW,CAAC,GAAG,EAAE,CAAC,CAAC,CAAC;QAClD,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,QAAQ,CAAC,CAAC;QACvC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,UAAU,CAAC,CAAC,WAAW,CAAC,GAAG,EAAE,CAAC,CAAC,CAAC;IACpD,CAAC,CAAC,CAAC;IAEH,EAAE,CAAC,2CAA2C,EAAE,GAAG,EAAE;QACnD,mBAAmB,CAAC;YAClB,EAAE,IAAI,EAAE,KAAK,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,EAAE;YACrF,EAAE,IAAI,EAAE,MAAM,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,EAAE;YACtF,EAAE,IAAI,EAAE,KAAK,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,EAAE;SACtF,CAAC,CAAC;QAEH,MAAM,KAAK,GAAG,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,CAAC;QAChD,MAAM,OAAO,GAAG,eAAe,CAAC,KAAK,EAAE,CAAC,CAAC,CAAC;QAC1C,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC;QACrC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,KAAK,CAAC,CAAC;QACpC,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,KAAK,CAAC,CAAC;IACtC,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC;AAEH,QAAQ,CAAC,sCAAsC,EAAE,GAAG,EAAE;IACpD,EAAE,CAAC,8CAA8C,EAAE,GAAG,EAAE;QACtD,MAAM,CAAC,gBAAgB,EAAE,CAAC,CAAC,IAAI,CAAC,KAAK,CAAC,CAAC;IACzC,CAAC,CAAC,CAAC;IAEH,EAAE,CAAC,6CAA6C,EAAE,GAAG,EAAE;QACrD,MAAM,OAAO,GAAG,eAAe,CAAC,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC;QAChE,MAAM,CAAC,OAAO,CAAC,CAAC,OAAO,CAAC,EAAE,CAAC,CAAC;IAC9B,CAAC,CAAC,CAAC;IAEH,EAAE,CAAC,+CAA+C,EAAE,GAAG,EAAE;QACvD,MAAM,CAAC,eAAe,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;IACvC,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC;AAEH,QAAQ,CAAC,kCAAkC,EAAE,GAAG,EAAE;IAChD,EAAE,CAAC,4BAA4B,EAAE,GAAG,EAAE;QACpC,MAAM,YAAY,GAAsB;YACtC,IAAI,EAAE,MAAM;YACZ,UAAU,EAAE,CAAC;YACb,KAAK,EAAE,KAAK,EAAE,KAAK,EAAE,EAAE,CACrB,KAAK,CAAC,GAAG,CAAC,GAAG,EAAE,CAAC,IAAI,YAAY,CAAC,CAAC,GAAG,EAAE,GAAG,EAAE,GAAG,CAAC,CAAC,CAAC;SACrD,CAAC;QACF,sBAAsB,CAAC,YAAY,CAAC,CAAC;QACrC,MAAM,CAAC,eAAe,EAAE,CAAC,CAAC,IAAI,CAAC,MAAM,CAAC,CAAC;IACzC,CAAC,CAAC,CAAC;IAEH,EAAE,CAAC,iDAAiD,EAAE,GAAG,EAAE;QACzD,mBAAmB,CAAC;YAClB,EAAE,IAAI,EAAE,QAAQ,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE;SACnF,CAAC,CAAC;QACH,MAAM,CAAC,gBAAgB,EAAE,CAAC,CAAC,IAAI,CAAC,IAAI,CAAC,CAAC;IACxC,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC;AAEH,QAAQ,CAAC,oCAAoC,EAAE,GAAG,EAAE;IAClD,EAAE,CAAC,mCAAmC,EAAE,GAAG,EAAE;QAC3C,mBAAmB,CAAC;YAClB,EAAE,IAAI,EAAE,GAAG,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE;YAC7E,EAAE,IAAI,EAAE,GAAG,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE;YAC7E,EAAE,IAAI,EAAE,GAAG,EAAE,QAAQ,EAAE,MAAe,EAAE,MAAM,EAAE,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE;SAC9E,CAAC,CAAC;QAEH,MAAM,OAAO,GAAG,eAAe,CAAC,IAAI,YAAY,CAAC,CAAC,CAAC,EAAE,CAAC,EAAE,CAAC,CAAC,CAAC,EAAE,CAAC,CAAC,CAAC;QAChE,MAAM,CAAC,OAAO,CAAC,MAAM,CAAC,CAAC,IAAI,CAAC,CAAC,CAAC,CAAC;QAC/B,MAAM,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,IAAI,CAAC,GAAG,CAAC,CAAC;IACpC,CAAC,CAAC,CAAC;AACL,CAAC,CAAC,CAAC"}
|
|
@@ -145,7 +145,7 @@ describe("GAIA capability: audio lane", () => {
|
|
|
145
145
|
if (!existsSync(fixturePath)) {
|
|
146
146
|
throw new Error(`Missing GAIA audio fixture at ${fixturePath}. Generate it with: python packages/mcp-local/src/__tests__/fixtures/generateGaiaCapabilityAudioFixture.py`);
|
|
147
147
|
}
|
|
148
|
-
const baselineModel = process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-
|
|
148
|
+
const baselineModel = process.env.NODEBENCH_GAIA_BASELINE_MODEL ?? "gemini-3-flash-preview";
|
|
149
149
|
const toolsModel = process.env.NODEBENCH_GAIA_TOOLS_MODEL ?? baselineModel;
|
|
150
150
|
const baselineLlm = await createTextLlmClient({ model: baselineModel });
|
|
151
151
|
const toolsLlm = await createTextLlmClient({ model: toolsModel });
|